downloading a billion files in python · our task is to download all the files on the remote server...

J a m e s S a r y e r w i n n i e

A case study in multi-threading, multi-processing, and asyncio

Downloading a Billion Files in Python

@ j s a r y e r

Our Task

There is a remote server that stores files

Our Task

The files can be accessed through a REST API

Our Task

The files can be accessed through a REST API

Our task is to download all the files on the remote server to our client machine

Our Task (the details)

What client machine will this run on?

Page 8: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

We have one machine we can use, 16 cores, 64GB memory

Page 9: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

What about the network between the client and server?

Page 10: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Our client machine is on the same network as the service with remote files

Page 11: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

How many files are on the remote server?

Page 12: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Approximately one billion files, 100 bytes per file

Page 13: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

When do you need this done?

Page 14: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Please have this done as soon as possible

Files

Page Page Page

File Server Rest API

File Server Rest API GET /list

Files

Page Page

FileNames

NextMarker

File Server Rest API GET /list

Files

Page Page

FileNames

NextMarker

{"FileNames": [ "file1", "file2", ...], "NextMarker": "pagination-token"}

File Server Rest API GET /list?next-marker=token

Files

Page Page

FileNames

NextMarker

File Server Rest API GET /list?next-marker=token

Files

Page Page

FileNames

NextMarker

{"FileNames": [ "file1", "file2", ...], "NextMarker": "pagination-token"}

Page 20: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

File Server Rest API

GET /list

GET /get/{filename}

{"FileNames": ["file1", "file2", ...]}

{"FileNames": ["file1", "file2", ...], "NextMarker": "pagination-token"}

(File blob content)

GET /list?next-marker={token}

Page 21: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Caveats

This is a simplified case study.

The results shown here don't necessarily generalize.

Not an apples to apples comparison, each approach does things slightly different

Sometimes concrete examples can be helpful

Page 22: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Caveats

This is a simplified case study.

The results shown here don't necessarily generalize.

Not an apples to apples comparison, each approach does things slightly different

Always profile and test for yourself

Sometimes concrete examples can be helpful

Page 23: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Synchronous Version

Simplest thing that could possibly work.

Page 24: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Synchronous

Page Page Page

Page 25: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Synchronous

PagePage

Page 26: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Synchronous

PagePage

Page 27: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Synchronous

PagePage

Page 28: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Synchronous

PagePage

Page 29: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Synchronous

PagePage

Page 30: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Synchronous

PagePage

Page 31: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Synchronous

PagePage

Synchronous

Page

Synchronous

Page

Synchronous

Page

Synchronous

Page

Synchronous

Page

Synchronous

Page

Synchronous

Page

def download_files(host, port, outdir): hostname = f'http://{host}:{port}' list_url = f'{hostname}/list' get_url = f'{hostname}/get'

response = requests.get(list_url) response.raise_for_status() content = json.loads(response.content) while True: for filename in content['FileNames']: remote_url = f'{get_url}/{filename}' download_file(remote_url, os.path.join(outdir, filename)) if 'NextMarker' not in content: break response = requests.get( f'{list_url}?next-marker={content["NextFile"]}') response.raise_for_status() content = json.loads(response.content)

Page 40: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

response = requests.get(list_url) response.raise_for_status() content = json.loads(response.content) while True: for filename in content['FileNames']: remote_url = f'{get_url}/{filename}' download_file(remote_url, os.path.join(outdir, filename)) if 'NextMarker' not in content: break response = requests.get( f'{list_url}?next-marker={content["NextMarker"]}') response.raise_for_status() content = json.loads(response.content)

Page 41: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Page 42: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Page 43: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Page 44: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

response = requests.get(list_url) response.raise_for_status() content = json.loads(response.content) while True: for filename in content['FileNames']: remote_url = f'{get_url}/{filename}' download_file(remote_url, os.path.join(outdir, filename)) if 'NextMarker' not in content: break response = requests.get( f'{list_url}?next-marker={content["NextFile"]}') response.raise_for_status() content = json.loads(response.content)

Page 45: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

def download_file(remote_url, local_filename): response = requests.get(remote_url) response.raise_for_status() with open(local_filename, 'wb') as f: f.write(response.content)

Page 46: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Synchronous Results

Page 47: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

One request 0.003 seconds

Synchronous Results

Page 48: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

One billion requests 3,000,000 seconds

Synchronous Results

Page 49: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

833.3 hours

Synchronous Results

Page 50: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

833.3 hours34.7 days

Synchronous Results

Page 51: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Multithreading

Page 52: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Multithreading

List Files can't be parallelized.

queue.Queue But Get File can be parallelized.

Page 53: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Multithreading

Page 54: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Multithreading

One thread calls List Files and puts the filenames on a queue.Queue

Page 55: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Multithreading

WorkerThread-1

WorkerThread-2

WorkerThread-3One thread calls List Files and puts the filenames on a queue.Queue

Page 56: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Multithreading

WorkerThread-1

WorkerThread-2

Page 57: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Multithreading

WorkerThread-1

WorkerThread-2

Page 58: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Multithreading

WorkerThread-1

WorkerThread-2

queue.Queue

Results Queue

Result thread prints progress, tracks overall results, failures, etc.

Page 59: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

def download_files(host, port, outdir, num_threads): # ... same constants as before ...

work_queue = queue.Queue(MAX_SIZE) result_queue = queue.Queue(MAX_SIZE)

threads = [] for i in range(num_threads): t = threading.Thread( target=worker_thread, args=(work_queue, result_queue)) t.start() threads.append(t) result_thread = threading.Thread(target=result_poller, args=(result_queue,)) result_thread.start() threads.append(result_thread)

# ...

response = requests.get(list_url)

Page 60: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

def download_files(host, port, outdir, num_threads): # ... same constants as before ...

work_queue = queue.Queue(MAX_SIZE) result_queue = queue.Queue(MAX_SIZE)

threads = [] for i in range(num_threads): t = threading.Thread( target=worker_thread, args=(work_queue, result_queue)) t.start() threads.append(t) result_thread = threading.Thread(target=result_poller, args=(result_queue,)) result_thread.start() threads.append(result_thread)

# ...

response = requests.get(list_url)

Page 61: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

response = requests.get(list_url) response.raise_for_status() content = json.loads(response.content) while True: for filename in content['FileNames']: remote_url = f'{get_url}/{filename}' outfile = os.path.join(outdir, filename) work_queue.put((remote_url, outfile)) if 'NextFile' not in content: break response = requests.get( f'{list_url}?next-marker={content["NextFile"]}') response.raise_for_status() content = json.loads(response.content)

Page 62: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

response = requests.get(list_url) response.raise_for_status() content = json.loads(response.content) while True: for filename in content['FileNames']: remote_url = f'{get_url}/{filename}' outfile = os.path.join(outdir, filename) work_queue.put((remote_url, outfile)) if 'NextFile' not in content: break response = requests.get( f'{list_url}?next-marker={content["NextFile"]}') response.raise_for_status() content = json.loads(response.content)

Page 63: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

def worker_thread(work_queue, result_queue): while True: work = work_queue.get() if work is _SHUTDOWN: return remote_url, outfile = work download_file(remote_url, outfile) result_queue.put(_SUCCESS)

Page 64: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

def worker_thread(work_queue, result_queue): while True: work = work_queue.get() if work is _SHUTDOWN: return remote_url, outfile = work download_file(remote_url, outfile) result_queue.put(_SUCCESS)

Page 65: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Multithreaded Results - 10 threads

Page 66: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Page 67: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

One billion requests 3,600,000 seconds1000.0 hours

41.6 days

Page 68: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Page 69: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Page 70: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

One billion requests 4,200,000 seconds1166.67 hours

48.6 days

Page 71: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Why?

Not necessarily IO bound due to low latency and small file size

GIL contention, overhead of passing data through queues

Page 72: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Things to keep in mind

The real code is more complicated, ctrl-c, graceful shutdown, etc.

Debugging is much harder, non-deterministic

The more you stray from stdlib abstractions, more likely to encounter race conditions

Can't use concurrent.futures map() because of large number of files

Page 73: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Multiprocessing

Page 74: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Please have this done as soon as possible

Page 75: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Multiprocessing

WorkerProcess-1

WorkerProcess-2

WorkerProcess-3Download one page at a time in parallel across multiple processes

Page 76: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Multiprocessing

WorkerProcess-1

WorkerProcess-2

Page 77: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Multiprocessing

WorkerProcess-1

WorkerProcess-2

Page 78: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Multiprocessing

WorkerProcess-1

WorkerProcess-2

Page 79: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

from concurrent import futures

def download_files(host, port, outdir): hostname = f'http://{host}:{port}' list_url = f'{hostname}/list'

all_pages = iter_all_pages(list_url) downloader = Downloader(host, port, outdir) with futures.ProcessPoolExecutor() as executor: for page in all_pages: future_to_filename = {} for filename in page: future = executor.submit(downloader.download, filename) future_to_filename[future] = filename for future in futures.as_completed(future_to_filename): future.result()

Page 80: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Page 81: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Start parallel downloads

Page 82: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Wait for downloads to finish

Page 83: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

def iter_all_pages(list_url): session = requests.Session() response = session.get(list_url) response.raise_for_status() content = json.loads(response.content) while True: yield content['FileNames'] if 'NextFile' not in content: break response = session.get( f'{list_url}?next-marker={content["NextFile"]}') response.raise_for_status() content = json.loads(response.content)

Page 84: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

class Downloader: # ...

def download(self, filename): remote_url = f'{self.get_url}/{filename}' response = self.session.get(remote_url) response.raise_for_status() outfile = os.path.join(self.outdir, filename) with open(outfile, 'wb') as f: f.write(response.content)

Page 85: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Multiprocessing Results - 16 processes

Page 86: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Page 87: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

One billion requests 320,000 seconds

88.88 hours

Page 88: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

88.88 hours

3.7 days

Page 89: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Things to keep in mind

Speed improvements due to truly running in parallel

Debugging is much harder, non-deterministic, pdb doesn't work out of the box

IPC overhead between processes higher than threads

Tradeoff between entirely in parallel vs. parallel chunks

Page 90: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Asyncio

Page 91: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Asyncio

Create an asyncio.Task for each file. This immediately starts the download.

Page 92: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Asyncio

Page 93: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Asyncio

Page 94: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Asyncio

Page 95: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Asyncio

Page 96: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Asyncio

Page 97: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Asyncio

Page 98: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Asyncio

Move on to the next page and start creating tasks.

Page 99: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Asyncio

Page 100: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Asyncio

Page 101: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Asyncio

Meanwhile tasks from the first page will finish downloading their file.

Page 102: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Asyncio

Page 103: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Asyncio

Page 104: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Asyncio

Page 105: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Asyncio

Page 106: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Asyncio

Page 107: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Asyncio

Page 108: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Asyncio

Page 109: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Asyncio

Page 110: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Asyncio

Page 111: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Asyncio

Page 112: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Asyncio

All in a single process

All in a single thread

Switch tasks when waiting for IO

Should keep CPU busy

Page 113: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

import asynciofrom aiohttp import ClientSessionimport uvloop

async def download_files(host, port, outdir): hostname = f'http://{host}:{port}' list_url = f'{hostname}/list' get_url = f'{hostname}/get' semaphore = asyncio.Semaphore(MAX_CONCURRENT) task_queue = asyncio.Queue(MAX_SIZE) asyncio.create_task(results_worker(task_queue)) async with ClientSession() as session: async for filename in iter_all_files(session, list_url): remote_url = f'{get_url}/{filename}' task = asyncio.create_task( download_file(session, semaphore, remote_url, os.path.join(outdir, filename)) ) await task_queue.put(task)

Page 114: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Page 115: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Page 116: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Page 117: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Page 118: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

async def iter_all_files(session, list_url): async with session.get(list_url) as response: if response.status != 200: raise RuntimeError(f"Bad status code: {response.status}") content = json.loads(await response.read()) while True: for filename in content['FileNames']: yield filename if 'NextFile' not in content: return next_page_url = f'{list_url}?next-marker={content["NextFile"]}' async with session.get(next_page_url) as response: if response.status != 200: raise RuntimeError(f"Bad status code: {response.status}") content = json.loads(await response.read())

Page 119: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

async def iter_all_files(session, list_url): async with session.get(list_url) as response: if response.status != 200: raise RuntimeError(f"Bad status code: {response.status}") content = json.loads(await response.read()) while True: for filename in content['FileNames']: yield filename if 'NextFile' not in content: return next_page_url = f'{list_url}?next-marker={content["NextFile"]}' async with session.get(next_page_url) as response: if response.status != 200: raise RuntimeError(f"Bad status code: {response.status}") content = json.loads(await response.read())

Page 120: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

async def download_file(session, semaphore, remote_url, local_filename): async with semaphore: async with session.get(remote_url) as response: contents = await response.read() # Sync version. with open(local_filename, 'wb') as f: f.write(contents) return local_filename

Page 121: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

async def download_file(session, semaphore, remote_url, local_filename): async with semaphore: async with session.get(remote_url) as response: contents = await response.read() # Sync version. with open(local_filename, 'wb') as f: f.write(contents) return local_filename

Page 122: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Asyncio Results

Page 123: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Asyncio Results

Page 124: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

One billion requests 560,000 seconds155.55 hours

6.48 days

Asyncio Results

Page 125: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

SummaryApproach SingleRequestTime(s) Days

Synchronous 0.003 34.7

Multithread 0.0036 41.6

Multiprocess 0.00032 3.7

Asyncio 0.00056 6.5

Page 126: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Asyncio and Multiprocessing

Page 127: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Asyncio and Multiprocessing

and Multithreading

Page 128: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

WorkerProcess-1

Page 129: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

WorkerProcess-1

Thread-2

Thread-1

Page 130: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

WorkerProcess-1

EventLoop

Thread-2

Thread-1

Queue

Page 131: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

WorkerProcess-1

EventLoop

Thread-2

Thread-1

Queue

WorkerProcess-2

EventLoop

Thread-2

Thread-1

Queue

Main process

Input Queue

Output Queue

Page 132: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

WorkerProcess-1

EventLoop

Thread-2

Thread-1

Queue

WorkerProcess-2

EventLoop

Thread-2

Thread-1

Queue

Main process

Input Queue

Output Queue

The Input/Output queues contain pagination tokens

foo

Page 133: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

WorkerProcess-1

EventLoop

Thread-2

Thread-1

Queue

WorkerProcess-2

EventLoop

Thread-2

Thread-1

Queue

Main process

Input Queue

Output Queue

The Input/Output queues contain pagination tokens

foo

Page 134: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

WorkerProcess-1

EventLoop

Thread-2

Thread-1

Queue

WorkerProcess-2

EventLoop

Thread-2

Thread-1

Queue

Main process

Input Queue

Output Queue

foo

The main thread of the worker process is a bridge to the event loop running on a separate thread. It sends the pagination token to the async Queue.

Page 135: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

WorkerProcess-1

EventLoop

Thread-2

Thread-1

Queue

WorkerProcess-2

EventLoop

Thread-2

Thread-1

Queue

Main process

Input Queue

Output Queue

foo

The event loop makes the List call with the provided pagination token "foo".

Page 136: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

WorkerProcess-1

EventLoop

Thread-2

Thread-1

Queue

WorkerProcess-2

EventLoop

Thread-2

Thread-1

Queue

Main process

Input Queue

Output Queue

foo

The event loop makes the List call with the provided pagination token "foo".

{"FileNames": [...], "NextMarker": "bar"}

Page 137: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

WorkerProcess-1

EventLoop

Thread-2

Thread-1

Queue

WorkerProcess-2

EventLoop

Thread-2

Thread-1

Queue

Main process

Input Queue

Output Queue

bar

The next pagination token "bar", eventually makes its way back to the main process.

Page 138: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

WorkerProcess-1

EventLoop

Thread-2

Thread-1

Queue

WorkerProcess-2

EventLoop

Thread-2

Thread-1

Queue

Main process

Input Queue

Output Queue

bar

The next pagination token "bar", eventually makes its way back to the main process.

Page 139: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

WorkerProcess-1

EventLoop

Thread-2

Thread-1

Queue

WorkerProcess-2

EventLoop

Thread-2

Thread-1

Queue

Main process

Input Queue

Output Queue

bar

While another process starts goes through the same steps, WorkerProcess-1 is downloading 1000 files using asyncio.

Page 140: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

WorkerProcess-1

EventLoop

Thread-2

Thread-1

Queue

WorkerProcess-2

EventLoop

Thread-2

Thread-1

Queue

Main process

Input Queue

Output Queue

bar

Page 141: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

WorkerProcess-1

EventLoop

Thread-2

Thread-1

Queue

WorkerProcess-2

EventLoop

Thread-2

Thread-1

Queue

Main process

Input Queue

Output Queue

bar

We get to leverage all our cores.1.

Page 142: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

WorkerProcess-1

EventLoop

Thread-2

Thread-1

Queue

WorkerProcess-2

EventLoop

Thread-2

Thread-1

Queue

Main process

Input Queue

Output Queue

bar

We download individual files efficiently with asyncio.

2.

Page 143: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

WorkerProcess-1

EventLoop

Thread-2

Thread-1

Queue

WorkerProcess-2

EventLoop

Thread-2

Thread-1

Queue

Main process

Input Queue

Output Queue

bar

We download individual files efficiently with asyncio.

2.

Minimal IPC overhead, only passing pagination tokens across processes, only

one per thousand files.

3.

Page 144: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Combo Results

Page 145: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Combo Results

Page 146: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Combo Results

Page 147: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

8.42 hoursOne billion requests 30,300 seconds

Combo Results

Page 148: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Summary

Approach SingleRequestTime(s) Days

Synchronous 0.003 34.7

Multithread 0.0036 41.6

Multiprocess 0.00032 3.7

Asyncio 0.00056 6.5

Combo 0.0000303 0.35

Page 149: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Tradeoff between simplicity and speed

Multiple orders of magnitude difference based on approach used

Lessons Learned

Need to have max bounds when using queueing or any task scheduling

Page 150: Downloading a Billion Files in Python · Our task is to download all the files on the remote server to our client machine. Our Task (the details) Our Task (the details) What client

Thanks!

J a m e s S a r y e r w i n n i e @ j s a r y e r

downloading a billion files in python · our task is to download all the files on the remote server...

Documents