processing biggish data on commodity hardware: simple python patterns
DESCRIPTION
Scipy 2013 talk on simple Python patterns to process efficiently large datasets using Python. The talk focuses on the patterns and the concepts rather than the implementations. The implementations can be found by looking at the joblib and scikit-learn codebaseTRANSCRIPT
Processing biggish dataon commodity hardware
Simple Python patternsGael Varoquaux INRIA/Parietal – Neurospin
Disclaimer: I’m French, I have opinionsWe’re in Texas, I hope y’all have left your guns outside
Yeah, I know, Texas is bigger than France
“Big data”:Petabytes...Distributed storageComputing cluster
Mere mortals:Gigabytes...Python programmingOff-the-self computers
∼ 16 CPUs, 32 Gb RAM
G Varoquaux 2
My tools
Python, what else? + Numpy+ Scipy
The ndarray is underusedby the data community
G Varoquaux 3
My tools
Python, what else? Patterns in this presentation:scikit-learnMachine learning in PythonjoblibUsing Python functions aspipeline jobs
G Varoquaux 3
Design philosophy
1. Fail gracefullyEasy to debug. Robust to errors.
2. Don’t solve hard problemsThe original problem can be bent.
3. Dependencies suckDistribution is an age-old problem.
4. Performance mattersWaiting kills productivity.
G Varoquaux 4
Processing big dataSpeed ups in Hadoop, CPUs...
Execution pipelinesdataflow programmingparallel computing
Data accessstoringcaching
Pipelines can get messyDatabases are tedious
G Varoquaux 5
Processing big dataSpeed ups in Hadoop, CPUs...
Execution pipelinesdataflow programmingparallel computing
Data accessstoringcaching
Pipelines can get messyDatabases are tedious
G Varoquaux 5
5 simple Python patterns for efficient data crunching
1 On the fly data reduction
2 On-line algorithms
3 Parallel processing patterns
4 Caching
5 Fast I/O
G Varoquaux 6
Big how?2 scenarios:
Many observations –samplese.g. twitterMany descriptors per observation –featurese.g. brain scans
G Varoquaux 7
1 On the fly data reduction
Big data is often I/O bound
Layer memory accessCPU cachesRAMLocal disksDistant storage
Less data also means less work
G Varoquaux 8
1 On the fly data reductionBig data is often I/O bound
Layer memory accessCPU cachesRAMLocal disksDistant storage
Less data also means less work
G Varoquaux 8
1 Dropping dataNumber one technique used to handle large dataset
1 loop: take a random fraction of the data
2 run algorithm on that fraction
3 aggregate results across sub-samplingsLooks like bagging : bootstrap aggregation
Performance tip: run the loop in parallel
Exploits redundancy across observations
Great when the number of samples is largeG Varoquaux 9
1 Dimension reductionOften individual features are low SNR
Random projections (will average features)sklearn.random projection
random linear combinations of the features
Fast –sub-optimal– clustering of featuressklearn.cluster.WardAgglomeration
on images: super-pixel strategy
Hashing, when observations have varying size(e.g. words)
sklearn.feature extraction.text.HashingVectorizer
stateless: can be used in parallel
G Varoquaux 10
1 An example: randomized SVDsklearn.utils.extmath.randomized svd
One random projection + power iterationsX = np.random.normal(size=(50000, 200))%timeit lapack = linalg.svd(X, full matrices=False)
1 loops, best of 3: 6.09 s per loop%timeit arpack=splinalg.svds(X, 10)
1 loops, best of 3: 2.49 s per loop%timeit randomized = randomized svd(X, 10)
1 loops, best of 3: 303 ms per looplinalg.norm(lapack[0][:, :10] - arpack[0]) / 2000
0.0022360679774997738linalg.norm(lapack[0][:, :10] - randomized[0]) / 2000
0.0022121161221386925
G Varoquaux 11
2 On-line algorithmsProcess the data one sample at a time
G Varoquaux 12
2 On-line algorithms
Compute the mean of a gazillionnumbers
Hard?
G Varoquaux 12
2 On-line algorithms
Compute the mean of a gazillionnumbers
Hard?No: just do a running mean
G Varoquaux 12
2 Convergence: statistics and speedIf the data are i.i.d., converges to expectations
Mini-batch = bunch observationsTrade-off between memory usage and vectorization
Example: K-Means clusteringX = np.random.normal(size=(10000, 200))
scipy.cluster.vq.kmeans(X, 10,
iter=2)11.33 s
sklearn.cluster.MiniBatchKMeans(n clusters=10,
n init=2).fit(X)0.62 s
G Varoquaux 13
3 Parallel processing patterns
Focus on embarassingly parallel for loopsLife is too short to worry about deadlocks
Workers compete for data accessMemory bus is a bottleneckOn grids: distributed storage
The right grain of parallelismToo fine ⇒ overheadToo coarse ⇒ memory shortage
Scale by the relevant cache pool
G Varoquaux 14
3 Parallel processing patternsFocus on embarassingly parallel for loopsLife is too short to worry about deadlocks
Workers compete for data accessMemory bus is a bottleneckOn grids: distributed storage
The right grain of parallelismToo fine ⇒ overheadToo coarse ⇒ memory shortage
Scale by the relevant cache pool
G Varoquaux 14
3 Parallel processing patternsFocus on embarassingly parallel for loopsLife is too short to worry about deadlocks
Workers compete for data accessMemory bus is a bottleneckOn grids: distributed storage
The right grain of parallelismToo fine ⇒ overheadToo coarse ⇒ memory shortage
Scale by the relevant cache pool
G Varoquaux 14
3 Parallel processing patternsFocus on embarassingly parallel for loopsLife is too short to worry about deadlocks
Workers compete for data accessMemory bus is a bottleneckOn grids: distributed storage
The right grain of parallelismToo fine ⇒ overheadToo coarse ⇒ memory shortage
Scale by the relevant cache poolG Varoquaux 14
3 Queues – the magic behind joblib.Parallel
Queues: high-performance, concurrent-friendly
Difficulty: callback on result arrival⇒ multiple threads in caller + risk of deadlocks
Dispatch queue should fill up “slowly”⇒ pre dispatch in joblib
⇒ Back and forth communicationDoor open to race conditions
G Varoquaux 15
3 What happens where: grand-central dispatch?
joblib design: Caller, dispatch queue, and collectqueue in same process
Benefit: robustness
Grand-central dispatch design: dispatch queue hasa process of its own
Benefit: resource managment in nested for loops
G Varoquaux 16
4 CachingFor reproducible science:avoid manually chained scripts (make-like usage)
For performance:avoiding re-computing is the crux of optimization
G Varoquaux 17
4 The joblib approachThe memoize pattern
mem = joblib.Memory(cachedir=’.’)g = mem.cache(f)b = g(a) # computes a using fc = g(a) # retrieves results from store
Challenges in the context of big dataa & b are big
Design goalsa & b arbitrary Python objectsNo dependencies
Drop-in, framework-less code for cachingG Varoquaux 18
4 Efficient input argument hashing – joblib.hash
Compute md5? of input arguments
Implementation1. Create an md5 hash object2. Subclass the standard-library pickler
= state machine that walks the object graph3. Walk the object graph:
- ndarrays: pass data pointer to md5 algorithm(“update” method)
- the rest: pickle4. Update the md5 with the pickle
? md5 is in the Python standard libraryG Varoquaux 19
4 Fast, disk-based, concurrent, store – joblib.dumpPersisting arbritrary objects
Once again sub-class the picklerUse .npy for large numpy arrays (np.save),pickle for the rest
⇒ Multiple files
Store concurrency issuesStrategy: atomic operations + try/except
Renaming a directory is atomicDirectory layout consistent with remove operations
Good performance, usable on shared disks (cluster)
G Varoquaux 20
5 Fast I/OFast read-outs, for out-of-core computing
G Varoquaux 21
5 Making I/O fastFast compression
CPU may be faster than disk accessChunk data for access patterns pytables
Standard library: zlib.compress with buffers(bypass gzip module to work online + in-memory)
Avoiding copieszlib.compress needs C-contiguous buffersStore raw buffer + meta-information (strides, class...)
- use reduce- rebuild: np.core.multiarray. reconstruct
not in pytables
G Varoquaux 22
5 Benchmarking to np.save and pytables
yax
issc
ale:
1is
np.s
ave
NeuroImaging data (MNI atlas)G Varoquaux 23
@GaelVaroquaux
Summing up
5 simple Python patterns for efficient data crunching
1 On the fly data reduction2 On-line algorithms3 Parallel processing patterns4 Caching5 Fast I/O
@GaelVaroquaux
Cost of complexity underestimatedKnow your problem
& solve it with simple primitives
Python modules
scikit-learn: machine learning
joblib: pipeline-ish patterns
Come work with me!Positions available