processing biggish data on commodity hardware: simple python patterns

Processing biggish dataon commodity hardware

Simple Python patternsGael Varoquaux INRIA/Parietal – Neurospin

Disclaimer: I’m French, I have opinionsWe’re in Texas, I hope y’all have left your guns outside

Yeah, I know, Texas is bigger than France

“Big data”:Petabytes...Distributed storageComputing cluster

Mere mortals:Gigabytes...Python programmingOff-the-self computers

∼ 16 CPUs, 32 Gb RAM

G Varoquaux 2

My tools

Python, what else? + Numpy+ Scipy

The ndarray is underusedby the data community

G Varoquaux 3

My tools

Python, what else? Patterns in this presentation:scikit-learnMachine learning in PythonjoblibUsing Python functions aspipeline jobs

G Varoquaux 3

Design philosophy

1. Fail gracefullyEasy to debug. Robust to errors.

2. Don’t solve hard problemsThe original problem can be bent.

3. Dependencies suckDistribution is an age-old problem.

4. Performance mattersWaiting kills productivity.

G Varoquaux 4

Processing big dataSpeed ups in Hadoop, CPUs...

Execution pipelinesdataflow programmingparallel computing

Data accessstoringcaching

Pipelines can get messyDatabases are tedious

G Varoquaux 5

5 simple Python patterns for efficient data crunching

1 On the fly data reduction

2 On-line algorithms

3 Parallel processing patterns

4 Caching

5 Fast I/O

G Varoquaux 6

Big how?2 scenarios:

Many observations –samplese.g. twitterMany descriptors per observation –featurese.g. brain scans

G Varoquaux 7

1 On the fly data reduction

Big data is often I/O bound

Layer memory accessCPU cachesRAMLocal disksDistant storage

Less data also means less work

G Varoquaux 8

1 On the fly data reductionBig data is often I/O bound

Layer memory accessCPU cachesRAMLocal disksDistant storage

Less data also means less work

G Varoquaux 8

1 Dropping dataNumber one technique used to handle large dataset

1 loop: take a random fraction of the data

2 run algorithm on that fraction

3 aggregate results across sub-samplingsLooks like bagging : bootstrap aggregation

Performance tip: run the loop in parallel

Exploits redundancy across observations

Great when the number of samples is largeG Varoquaux 9

1 Dimension reductionOften individual features are low SNR

Random projections (will average features)sklearn.random projection

random linear combinations of the features

Fast –sub-optimal– clustering of featuressklearn.cluster.WardAgglomeration

on images: super-pixel strategy

Hashing, when observations have varying size(e.g. words)

sklearn.feature extraction.text.HashingVectorizer

stateless: can be used in parallel

G Varoquaux 10

1 An example: randomized SVDsklearn.utils.extmath.randomized svd

One random projection + power iterationsX = np.random.normal(size=(50000, 200))%timeit lapack = linalg.svd(X, full matrices=False)

1 loops, best of 3: 6.09 s per loop%timeit arpack=splinalg.svds(X, 10)

1 loops, best of 3: 2.49 s per loop%timeit randomized = randomized svd(X, 10)

1 loops, best of 3: 303 ms per looplinalg.norm(lapack[0][:, :10] - arpack[0]) / 2000

0.0022360679774997738linalg.norm(lapack[0][:, :10] - randomized[0]) / 2000

0.0022121161221386925

G Varoquaux 11

2 On-line algorithmsProcess the data one sample at a time

G Varoquaux 12


Compute the mean of a gazillionnumbers

Hard?

G Varoquaux 12


Compute the mean of a gazillionnumbers

Hard?No: just do a running mean

G Varoquaux 12

2 Convergence: statistics and speedIf the data are i.i.d., converges to expectations

Mini-batch = bunch observationsTrade-off between memory usage and vectorization

Example: K-Means clusteringX = np.random.normal(size=(10000, 200))

scipy.cluster.vq.kmeans(X, 10,

iter=2)11.33 s

sklearn.cluster.MiniBatchKMeans(n clusters=10,

n init=2).fit(X)0.62 s

G Varoquaux 13

3 Parallel processing patterns

Focus on embarassingly parallel for loopsLife is too short to worry about deadlocks

Workers compete for data accessMemory bus is a bottleneckOn grids: distributed storage

The right grain of parallelismToo fine ⇒ overheadToo coarse ⇒ memory shortage

Scale by the relevant cache pool

G Varoquaux 14

3 Parallel processing patternsFocus on embarassingly parallel for loopsLife is too short to worry about deadlocks



Scale by the relevant cache pool

G Varoquaux 14

3 Parallel processing patternsFocus on embarassingly parallel for loopsLife is too short to worry about deadlocks



Scale by the relevant cache poolG Varoquaux 14

3 Queues – the magic behind joblib.Parallel

Queues: high-performance, concurrent-friendly

Difficulty: callback on result arrival⇒ multiple threads in caller + risk of deadlocks

Dispatch queue should fill up “slowly”⇒ pre dispatch in joblib

⇒ Back and forth communicationDoor open to race conditions

G Varoquaux 15

3 What happens where: grand-central dispatch?

joblib design: Caller, dispatch queue, and collectqueue in same process

Benefit: robustness

Grand-central dispatch design: dispatch queue hasa process of its own

Benefit: resource managment in nested for loops

G Varoquaux 16

4 CachingFor reproducible science:avoid manually chained scripts (make-like usage)

For performance:avoiding re-computing is the crux of optimization

G Varoquaux 17

4 The joblib approachThe memoize pattern

mem = joblib.Memory(cachedir=’.’)g = mem.cache(f)b = g(a) # computes a using fc = g(a) # retrieves results from store

Challenges in the context of big dataa & b are big

Design goalsa & b arbitrary Python objectsNo dependencies

Drop-in, framework-less code for cachingG Varoquaux 18

4 Efficient input argument hashing – joblib.hash

Compute md5? of input arguments

Implementation1. Create an md5 hash object2. Subclass the standard-library pickler

= state machine that walks the object graph3. Walk the object graph:

- ndarrays: pass data pointer to md5 algorithm(“update” method)

- the rest: pickle4. Update the md5 with the pickle

? md5 is in the Python standard libraryG Varoquaux 19

4 Fast, disk-based, concurrent, store – joblib.dumpPersisting arbritrary objects

Once again sub-class the picklerUse .npy for large numpy arrays (np.save),pickle for the rest

⇒ Multiple files

Store concurrency issuesStrategy: atomic operations + try/except

Renaming a directory is atomicDirectory layout consistent with remove operations

Good performance, usable on shared disks (cluster)

G Varoquaux 20

5 Fast I/OFast read-outs, for out-of-core computing

G Varoquaux 21

5 Making I/O fastFast compression

CPU may be faster than disk accessChunk data for access patterns pytables

Standard library: zlib.compress with buffers(bypass gzip module to work online + in-memory)

Avoiding copieszlib.compress needs C-contiguous buffersStore raw buffer + meta-information (strides, class...)

- use reduce- rebuild: np.core.multiarray. reconstruct

not in pytables

G Varoquaux 22

5 Benchmarking to np.save and pytables

yax

issc

ale:

1is

np.s

ave

NeuroImaging data (MNI atlas)G Varoquaux 23

@GaelVaroquaux

Summing up

5 simple Python patterns for efficient data crunching

1 On the fly data reduction2 On-line algorithms3 Parallel processing patterns4 Caching5 Fast I/O

@GaelVaroquaux

Cost of complexity underestimatedKnow your problem

& solve it with simple primitives

Python modules

scikit-learn: machine learning

joblib: pipeline-ish patterns

Come work with me!Positions available

processing biggish data on commodity hardware: simple python patterns

Technology

loopsg varoquaux

tediousg varoquaux

workg varoquaux

largeg varoquaux

timeg varoquaux

parallelg varoquaux

deadlocksg varoquaux

data communityg varoquaux