Skip to content Skip to sidebar Skip to footer

Download Many Files In Parallel? (linux/python?)

I have a big list of remote file locations and local paths where I would like them to end up. Each file is small, but there are very many of them. I am generating this list within

Solution 1:

I normally use pscp to do things like this, and then call it using subprocess.Popen

for example:

pscp_command = '''"c:\program files\putty\pscp.exe" -pw <pwd> -p -scp -unsafe <file location on my   linux machine including machine name and login, can use wildcards here> <where you want the files to go on a windows machine>'''
p = subprocess.Popen( pscp_command, stdout=subprocess.PIPE, stderr=subprocess.PIPE )
stdout, stderr = p.communicate()
p.wait()

of course I'm assuming linux --> windows

Solution 2:

Try wget, a command line utility installed on most Linux distros, also available via Cygwin on Windows.

You may also have a look at Scrapy, which is a library/framework written in Python.

Solution 3:

If youuse a Pool object from the multiprocessing module, urllib2 should handle FTP.

results = {}
defget_url(url):
    try:
        res = urllib2.urlopen(url)
        # url should start with 'ftp:'
        results[url] = res.read()
    except Exception:
        # add more meaningful exception handling if you need it. Eg, retry once etc. 
        results[url] = None
pool = Pool(processes=num_processes)
result = pool.map_async(get_url, url_list)
pool.close()
pool.join()

Of course, spawning processes will have some serious overhead. Non-blocking requests will almost certainly be faster if you can use a 3rd part module like twisted

Whether the overhead is a serious problem will depend on the relative magnitude of download times per file and network latency.

You can try implementing it using python threads rather than processes, but it gets a bit trickier. See the answer to this question to use urllib2 safely with threads. You would also need to use the multiprocessing.pool.ThreadPool instead of the regular Pool

Solution 4:

Know it's an old post but there is a perfect linux utility for this. If you are transferring files from a remote host, lftp is great! I mainly use it to quickly push stuff to my ftp server but it works great for pulling stuff off as well using the mirror command. It also has an option to copy a user defined number of files in parallel like you wanted. If you wanted to copy some files from a remote path to a local path your command line would look something like this;

lftp
open ftp://user:password@ftp.site.com
cd some/remote/path
lcd some/local/path
mirror --reverse --parallel=2

Be very careful with this command though, just like other mirror commands if you screw it up, you WILL DELETE FILES.

For more options or documentation for lftp I've visited this site http://lftp.yar.ru/lftp-man.html

Post a Comment for "Download Many Files In Parallel? (linux/python?)"