building a cutting-edge data processing environment on a budget
DESCRIPTION
As a penniless academic I wanted to do "big data" for science. Open source, Python, and simple patterns were the way forward. Staying on top of todays growing datasets is an arm race. Data analytics machinery —clusters, NOSQL, visualization, Hadoop, machine learning, ...— can spread a team's resources thin. Focusing on simple patterns, lightweight technologies, and a good understanding of the applications gets us most of the way for a fraction of the cost. I will present a personal perspective on ten years of scientific data processing with Python. What are the emerging patterns in data processing? How can modern data-mining ideas be used without a big engineering team? What constraints and design trade-offs govern software projects like scikit-learn, Mayavi, or joblib? How can we make the most out of distributed hardware with simple framework-less code?TRANSCRIPT
Building a cutting-edge data processingenvironment on a budget
Gael Varoquaux
This talk is not aboutrocket science!
Building a cutting-edge data processingenvironment on a budget
Gael Varoquaux
Disclaimer: this talk is as much about peopleand projects as it is about code and algorithms.
Growing up as a penniless academic
I did a PhD inquantum physics
Growing up as a penniless academic
I did a PhD inquantum physics
Vacuum (leaks)Electronics (shorts)Lasers (mis-alignment)
Best training everfor agile project
management
Growing up as a penniless academic
I did a PhD inquantum physics
Vacuum (leaks)Electronics (shorts)Lasers (mis-alignment)
Computers were only oneof the many moving parts
MatlabInstrument control
Shaped my visionof computing as ameans to an end
Growing up as a penniless academic
I did a PhD inquantum physics
Vacuum (leaks)Electronics (shorts)Lasers (mis-alignment)
Computers were only oneof the many moving parts
MatlabInstrument controlShaped my vision
of computing as ameans to an end
Growing up as a penniless academic
2011Tenured researcherin computer science
TodayGrowing team withdata sciencerock stars
Growing up as a penniless academic
2011Tenured researcherin computer science
TodayGrowing team withdata sciencerock stars
1 Using machine learning tounderstand brain function
Link neural activity to thoughts and cognitionG Varoquaux 6
1 Functional MRI
t
Recordings of brain activity
G Varoquaux 7
1 Cognitive NeuroImaging
Learn a bilateral link between brain activityand cognitive function
G Varoquaux 8
1 Encoding models of stimuli
Predicting neural responseñ a window into brain representations of stimuli
“feature engineering” a description of the worldG Varoquaux 9
1 Decoding brain activity
“brain reading”
G Varoquaux 10
1 Data processing featsVisual image reconstruction from human brain activity
[Miyawaki, et al. (2008)]
“brain reading”
Make it work, make it right, make it boring
http://nilearn.github.io/auto examples/plot miyawaki reconstruction.html
Code, data, ... just worksTM
G Varoquaux 11
1 Data processing featsVisual image reconstruction from human brain activity
[Miyawaki, et al. (2008)]
“if it’s not open and verifiable by others, it’s notscience, or engineering...” Stodden, 2010
Make it work, make it right, make it boring
http://nilearn.github.io/auto examples/plot miyawaki reconstruction.html
Code, data, ... just worksTM
G Varoquaux 11
1 Data processing featsVisual image reconstruction from human brain activity
[Miyawaki, et al. (2008)]
Make it work, make it right, make it boring
http://nilearn.github.io/auto examples/plot miyawaki reconstruction.html
Code, data, ... just worksTM
http://nilearn.github.io
ni
G Varoquaux 11
1 Data processing featsVisual image reconstruction from human brain activity
[Miyawaki, et al. (2008)]
Make it work, make it right, make it boring
http://nilearn.github.io/auto examples/plot miyawaki reconstruction.html
Code, data, ... just worksTM
http://nilearn.github.io
ni
G Varoquaux 11
1 Data processing featsVisual image reconstruction from human brain activity
[Miyawaki, et al. (2008)]
Make it work, make it right, make it boring
http://nilearn.github.io/auto examples/plot miyawaki reconstruction.html
Code, data, ... just worksTM
http://nilearn.github.io
ni
Software development challenge
G Varoquaux 11
1 Data accumulationWhen data processing is routine... “big data”
for rich models ofbrain function
Accumulation of scientific knowledgeand learning formal representations
“A theory is a good theory if it satisfies two requirements:It must accurately describe a large class of observa-tions on the basis of a model that contains only a fewarbitrary elements, and it must make definite predic-tions about the results of future observations.”
Stephen Hawking, A Brief History of Time.
G Varoquaux 12
1 Data accumulationWhen data processing is routine... “big data”
for rich models ofbrain function
Accumulation of scientific knowledgeand learning formal representations
“A theory is a good theory if it satisfies two requirements:It must accurately describe a large class of observa-tions on the basis of a model that contains only a fewarbitrary elements, and it must make definite predic-tions about the results of future observations.”
Stephen Hawking, A Brief History of Time.
G Varoquaux 12
1 Petty day-to-day technicalities
Buggy code
Slow code
Lead data scientist leaves
New intern to train
I don’t understand thecode I have written a year ago
A lab is no different from a startup
DifficultiesRecruitmentLimited resources
(people & hardware)
RisksBus factorTechnical dept
Our mission is to revolutionize brain data processingon a tight budget
G Varoquaux 13
1 Petty day-to-day technicalities
Buggy code
Slow code
Lead data scientist leaves
New intern to train
I don’t understand thecode I have written a year ago
A lab is no different from a startup
DifficultiesRecruitmentLimited resources
(people & hardware)
RisksBus factorTechnical dept
Our mission is to revolutionize brain data processingon a tight budget
G Varoquaux 13
1 Petty day-to-day technicalities
Buggy code
Slow code
Lead data scientist leaves
New intern to train
I don’t understand thecode I have written a year ago
A lab is no different from a startup
DifficultiesRecruitmentLimited resources
(people & hardware)
RisksBus factorTechnical dept
Our mission is to revolutionize brain data processingon a tight budget
G Varoquaux 13
2 Patterns in data processing
G Varoquaux 14
2 The data processing workflow agile
Interaction...Ñ script...Ñ module...
ý interaction again...
Consolidation,progressively
Low tech and shortturn-around times
G Varoquaux 15
2 From statistics to statistical learning
Paradigm shift as thedimensionality of datagrows# features,
not only # samples
From parameterinference to prediction
Statistical learning isspreading everywhere
x
y
G Varoquaux 16
3 Let’s just make softwareto solve all these problems.
c©Theodore W. GrayG Varoquaux 17
3 Design philosophy
1. Don’t solve hard problemsThe original problem can be bent.
2. Easy setup, works out of the boxInstalling software sucks.
Convention over configuration.
3. Fail gracefullyRobust to errors. Easy to debug.
4. Quality, quality, qualityWhat’s not excellent won’t be used.
Not “one software to rule them all”
Break down projects by expertise
G Varoquaux 18
3 Design philosophy
1. Don’t solve hard problemsThe original problem can be bent.
2. Easy setup, works out of the boxInstalling software sucks.
Convention over configuration.
3. Fail gracefullyRobust to errors. Easy to debug.
4. Quality, quality, qualityWhat’s not excellent won’t be used.
Not “one software to rule them all”
Break down projects by expertise
G Varoquaux 18
VisionMachine learning without learning the machinery
Black box that can be openedRight trade-off between ”just works” and versatility
(think Apple vs Linux)
G Varoquaux 19
VisionMachine learning without learning the machinery
Black box that can be openedRight trade-off between ”just works” and versatility
(think Apple vs Linux)
G Varoquaux 19
VisionMachine learning without learning the machinery
Black box that can be openedRight trade-off between ”just works” and versatility
(think Apple vs Linux)
We’re not going to solve all the problems for youI don’t solve hard problems
Feature-engineering, domain-specific cases...Python is a programming language. Use it.
Cover all the 80% usecases in one packageG Varoquaux 19
3 Performance in high-level programming
High-level programmingis what keeps usalive and kicking
G Varoquaux 20
3 Performance in high-level programming
The secret sauceOptimize algorithmes, not for loops
Know perfectly Numpy and Scipy- Significant data should be arrays/memoryviews- Avoid memory copies, rely on blas/lapack
line-profiler/memory-profilerscipy-lectures.github.io
Cython not C/C++
G Varoquaux 20
3 Performance in high-level programming
The secret sauceOptimize algorithmes, not for loops
Know perfectly Numpy and Scipy- Significant data should be arrays/memoryviews- Avoid memory copies, rely on blas/lapack
line-profiler/memory-profilerscipy-lectures.github.io
Cython not C/C++
Hierarchical clustering PR #21991. Take the 2 closest clusters2. Merge them3. Update the distance matrix
...Faster with constraints: sparse distance matrix- Keep a heap queue of distances: cheap minimum- Need sparse growable structure for neighborhoods
skip-list in Cython!Oplog nq insert, remove, access
bind C++ map[int, float] with CythonFast traversal, possibly in Cython, for step 3.
G Varoquaux 20
3 Performance in high-level programming
The secret sauceOptimize algorithmes, not for loops
Know perfectly Numpy and Scipy- Significant data should be arrays/memoryviews- Avoid memory copies, rely on blas/lapack
line-profiler/memory-profilerscipy-lectures.github.io
Cython not C/C++
Hierarchical clustering PR #21991. Take the 2 closest clusters2. Merge them3. Update the distance matrix
...Faster with constraints: sparse distance matrix- Keep a heap queue of distances: cheap minimum- Need sparse growable structure for neighborhoods
skip-list in Cython!Oplog nq insert, remove, access
bind C++ map[int, float] with CythonFast traversal, possibly in Cython, for step 3.
G Varoquaux 20
3 Architecture of a data-manipulation toolkit
Separate data from operations,but keep an imperative-like language
0387
8794
7979
27
0179
0752
7015
78
9407
1746
1247
97
5497
0718
7178
87
1365
3490
4951
90
7475
4265
3580
98
4872
1546
3490
84
9034
5673
2456
14
7895
7187
7456
200387
8794
7979
27
0179
0752
7015
78
9407
1746
1247
97
5497
0718
7178
87
1365
3490
4951
90
7475
4265
3580
98
4872
1546
3490
84
9034
5673
2456
14
7895
7187
7456
20
0387
8794
7979
27
0179
0752
7015
78
9407
1746
1247
97
5497
0718
7178
87
1365
3490
4951
90
7475
4265
3580
98
4872
1546
3490
84
9034
5673
2456
14
7895
7187
7456
200387
8794
7979
27
0179
0752
7015
78
9407
1746
1247
97
5497
0718
7178
87
1365
3490
4951
90
7475
4265
3580
98
4872
1546
3490
84
9034
5673
2456
14
7895
7187
7456
20
0387
8794
7979
27
0179
0752
7015
78
9407
1746
1247
97
5497
0718
7178
87
1365
3490
4951
90
7475
4265
3580
98
4872
1546
3490
84
9034
5673
2456
14
7895
7187
7456
200387
8794
7979
27
0179
0752
7015
78
9407
1746
1247
97
5497
0718
7178
87
1365
3490
4951
90
7475
4265
3580
98
4872
1546
3490
84
9034
5673
2456
14
7895
7187
7456
20
bokeh, chaco, hadoop, Mayavi, CPUs
Object API exposes a data-processing languagefit, predict, transform, score, partial fit
Instantiated without data but with all the parameters
Objects pipeline, merging, etc...
configuration/run pattern traits, pyrecurry in functional programming functools.partialIdeas from MVC pattern
G Varoquaux 21
3 Architecture of a data-manipulation toolkit
Separate data from operations,but keep an imperative-like language
Object API exposes a data-processing languagefit, predict, transform, score, partial fit
Instantiated without data but with all the parameters
Objects pipeline, merging, etc...
configuration/run pattern traits, pyrecurry in functional programming functools.partialIdeas from MVC pattern
G Varoquaux 21
3 Architecture of a data-manipulation toolkit
Separate data from operations,but keep an imperative-like language
Object API exposes a data-processing languagefit, predict, transform, score, partial fit
Instantiated without data but with all the parameters
Objects pipeline, merging, etc...
configuration/run pattern traits, pyrecurry in functional programming functools.partialIdeas from MVC pattern
G Varoquaux 21
4 Big data on small hardware
Biggish
smallish
“Big data”:Petabytes...Distributed storageComputing cluster
Mere mortals:Gigabytes...Python programmingOff-the-self computers
G Varoquaux 22
4 Big data on small hardwareBiggish
smallish
“Big data”:Petabytes...Distributed storageComputing cluster
Mere mortals:Gigabytes...Python programmingOff-the-self computers
G Varoquaux 22
4 On-line algorithms
Process the data one sample at a time
Compute the mean of a gazillionnumbers
Hard?
G Varoquaux 23
4 On-line algorithms
Process the data one sample at a time
Compute the mean of a gazillionnumbers
Hard?No: just do a running mean
G Varoquaux 23
4 On-line algorithmsConverges to expectations
Mini-batch = bunch observations for vectorization
Example: K-Means clusteringX = np.random.normal(size=(10 000, 200))
scipy.cluster.vq.kmeans(X, 10,
iter=2)11.33 s
sklearn.cluster.MiniBatchKMeans(n clusters=10,
n init=2).fit(X)0.62 s
G Varoquaux 23
4 On-the-fly data reduction
Big data is often I/O bound
Layer memory accessCPU cachesRAMLocal disksDistant storage
Less data also means less work
G Varoquaux 24
4 On-the-fly data reductionDropping data
1 loop: take a random fraction of the data
2 run algorithm on that fraction
3 aggregate results across sub-samplingsLooks like bagging : bootstrap aggregation
Exploits redundancy across observations
Run the loop in parallel
G Varoquaux 24
4 On-the-fly data reduction
Random projections (will average features)sklearn.random projection
random linear combinations of the features
Fast clustering of featuressklearn.cluster.WardAgglomeration
on images: super-pixel strategy
Hashing when observations have varying size(e.g. words)
sklearn.feature extraction.text.HashingVectorizer
stateless: can be used in parallel
G Varoquaux 24
4 On-the-fly data reduction
Example: randomized SVD Random projectionsklearn.utils.extmath.randomized svd
X = np.random.normal(size=(50000, 200))%timeit lapack = linalg.svd(X, full matrices=False)
1 loops, best of 3: 6.09 s per loop%timeit arpack=splinalg.svds(X, 10)
1 loops, best of 3: 2.49 s per loop%timeit randomized = randomized svd(X, 10)
1 loops, best of 3: 303 ms per loop
linalg.norm(lapack[0][:, :10] - arpack[0]) / 20000.0022360679774997738
linalg.norm(lapack[0][:, :10] - randomized[0]) / 20000.0022121161221386925
G Varoquaux 24
4 Biggish ironOur new box: 15 ke
48 cores384G RAM70T storage
(SSD cache on RAID controller)
Gets our work done faster than our 800 CPU cluster
It’s the access patterns!
“Nobody ever got fired for using Hadoop on a cluster”A. Rowstron et al., HotCDP ’12
G Varoquaux 25
5 Avoiding the framework
joblib
G Varoquaux 26
5 Parallel processing big picture
Focus on embarassingly parallel for loopsLife is too short to worry about deadlocks
Workers compete for data accessMemory bus is a bottleneck
The right grain of parallelismToo fine ñ overheadToo coarse ñ memory shortage
Scale by the relevant cache pool
G Varoquaux 27
5 Parallel processing joblib
Focus on embarassingly parallel for loopsLife is too short to worry about deadlocks>>> from joblib import Parallel, delayed>>> Parallel(n jobs=2)(delayed(sqrt)(i**2)... for i in range(8))
[0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0]
G Varoquaux 27
5 Parallel processing joblib
IPython, multiprocessing, celery, MPI?joblib is higher-level
No dependencies, works everywhereBetter traceback reportingMemmaping arrays to share memory (O. Grisel)On-the-fly dispatch of jobs – memory-friendlyThreads or processes backend
G Varoquaux 27
5 Parallel processing joblib
IPython, multiprocessing, celery, MPI?joblib is higher-level
No dependencies, works everywhereBetter traceback reportingMemmaping arrays to share memory (O. Grisel)On-the-fly dispatch of jobs – memory-friendlyThreads or processes backend
G Varoquaux 27
5 Parallel processing Queues
Queues: high-performance, concurrent-friendly
Difficulty: callback on result arrivalñ multiple threads in caller ` risk of deadlocks
Dispatch queue should fill up “slowly”ñ pre dispatch in joblib
ñ Back and forth communicationDoor open to race conditions
G Varoquaux 28
5 Parallel processing: what happens where
joblib design: Caller, dispatch queue, and collectqueue in same process
Benefit: robustness
Grand-central dispatch design: dispatch queue hasa process of its own
Benefit: resource managment in nested for loops
G Varoquaux 29
5 CachingFor reproducibility:avoid manually chained scripts (make-like usage)
For performance:avoiding re-computing is the crux of optimization
G Varoquaux 30
5 Caching The joblib approachFor reproducibility:avoid manually chained scripts (make-like usage)
For performance:avoiding re-computing is the crux of optimization
Memoize patternmem = joblib.Memory(cachedir=’.’)g = mem.cache(f)b = g(a) # computes a using fc = g(a) # retrieves results from store
G Varoquaux 30
5 Caching The joblib approachFor reproducibility:avoid manually chained scripts (make-like usage)
For performance:avoiding re-computing is the crux of optimization
Memoize patternmem = joblib.Memory(cachedir=’.’)g = mem.cache(f)b = g(a) # computes a using fc = g(a) # retrieves results from store
Challenges in the context of big dataa & b are big
Design goalsa & b arbitrary Python objectsNo dependencies
Drop-in, framework-less code
G Varoquaux 30
5 Caching The joblib approachFor reproducibility:avoid manually chained scripts (make-like usage)
For performance:avoiding re-computing is the crux of optimization
Memoize patternmem = joblib.Memory(cachedir=’.’)g = mem.cache(f)b = g(a) # computes a using fc = g(a) # retrieves results from store
Lego bricks for out-of-core algorithms coming soonąąąąąąąąą result = g.call and shelve(a)ąąąąąąąąą result
MemorizedResult(cachedir=”...”, func=”g...”, argument hash=”...”)ąąąąąąąąą c = result.get()
G Varoquaux 30
5 Efficient input argument hashing – joblib.hash
Compute md5‹ of input arguments
Trade-off between features and costBlack boxyRobust and completely generic
G Varoquaux 31
5 Efficient input argument hashing – joblib.hash
Compute md5‹ of input arguments
Implementation1. Create an md5 hash object2. Subclass the standard-library pickler
= state machine that walks the object graph3. Walk the object graph:
- ndarrays: pass data pointer to md5 algorithm(“update” method)
- the rest: pickle4. Update the md5 with the pickle
‹ md5 is in the Python standard libraryG Varoquaux 31
5 Fast, disk-based, concurrent, store – joblib.dumpPersisting arbritrary objects
Once again sub-class the picklerUse .npy for large numpy arrays (np.save),pickle for the rest
ñ Multiple files
Store concurrency issuesStrategy: atomic operations ` try/except
Renaming a directory is atomicDirectory layout consistent with remove operations
Good performance, usable on shared disks (cluster)
G Varoquaux 32
5 Making I/O fastFast compression
CPU may be faster than disk accessin particular in parallel
Standard library: zlib.compress with buffers(bypass gzip module to work online + in-memory)
Avoiding copieszlib.compress: C-contiguous buffersCopyless storage of raw buffer
+ meta-information (strides, class...)
Single file dump coming soonFile opening is slow on clusterChallenge: streaming the above for memory usage
What matters on large systemsNumbers of bytes stored
brings network/SATA bus downMemory usage
brings compute nodes downNumber of atomic file access
brings shared storage down
G Varoquaux 33
5 Making I/O fastFast compression
CPU may be faster than disk accessin particular in parallel
Standard library: zlib.compress with buffers(bypass gzip module to work online + in-memory)
Avoiding copieszlib.compress: C-contiguous buffersCopyless storage of raw buffer
+ meta-information (strides, class...)
Single file dump coming soonFile opening is slow on clusterChallenge: streaming the above for memory usage
What matters on large systemsNumbers of bytes stored
brings network/SATA bus downMemory usage
brings compute nodes downNumber of atomic file access
brings shared storage down
G Varoquaux 33
5 Making I/O fastFast compression
CPU may be faster than disk accessin particular in parallel
Standard library: zlib.compress with buffers(bypass gzip module to work online + in-memory)
Avoiding copieszlib.compress: C-contiguous buffersCopyless storage of raw buffer
+ meta-information (strides, class...)
Single file dump coming soonFile opening is slow on clusterChallenge: streaming the above for memory usage
What matters on large systemsNumbers of bytes stored
brings network/SATA bus downMemory usage
brings compute nodes downNumber of atomic file access
brings shared storage down
G Varoquaux 33
5 Making I/O fastFast compression
CPU may be faster than disk accessin particular in parallel
Standard library: zlib.compress with buffers(bypass gzip module to work online + in-memory)
Avoiding copieszlib.compress: C-contiguous buffersCopyless storage of raw buffer
+ meta-information (strides, class...)
Single file dump coming soonFile opening is slow on clusterChallenge: streaming the above for memory usage
What matters on large systemsNumbers of bytes stored
brings network/SATA bus downMemory usage
brings compute nodes downNumber of atomic file access
brings shared storage down
G Varoquaux 33
5 Benchmarking to np.save and pytables
yax
issc
ale:
1is
np.s
ave
NeuroImaging data (MNI atlas)G Varoquaux 34
6 The bigger picture: buildingan ecosystem
Helping your future selfG Varoquaux 35
6 Community-based development in scikit-learnHuge feature set:
benefits of a large teamProject growth:
More than 200 contributors„ 12 core contributors
1 full-time INRIA programmerfrom the start
Estimated cost of development: $ 6 millionsCOCOMO model,http://www.ohloh.net/p/scikit-learn
G Varoquaux 36
6 The economics of open sourceCode maintenance too expensive to be alone
scikit-learn „ 300 email/month nipy „ 45 email/monthjoblib „ 45 email/month mayavi „ 30 email/month
“Hey Gael, I take it you’re toobusy. That’s okay, I spent a daytrying to install XXX and I thinkI’ll succeed myself. Next timethough please don’t ignore myemails, I really don’t like it. Youcan say, ‘sorry, I have no time tohelp you.’ Just don’t ignore.”
Your “benefits” come from a fraction of the codeData loading? Maybe?Standard algorithms? Nah
Share the common code......to avoid dying under code
Code becomes less precious with timeAnd somebody might contribute features
G Varoquaux 37
6 The economics of open sourceCode maintenance too expensive to be alone
scikit-learn „ 300 email/month nipy „ 45 email/monthjoblib „ 45 email/month mayavi „ 30 email/month
Your “benefits” come from a fraction of the codeData loading? Maybe?Standard algorithms? Nah
Share the common code......to avoid dying under code
Code becomes less precious with timeAnd somebody might contribute features
G Varoquaux 37
6 Many eyes makes code fast
Bench WiseRF anybody?L. Buitinck, O. Grisel, A. Joly, G. Louppe, J. Nothman, P. Prettenhofer
G Varoquaux 38
6 6 steps to a community-driven project
1 Focus on quality
2 Build great docs and examples
3 Use github
4 Limit the technicality of your codebase
5 Releasing and packaging matter
6 Focus on your contributors,give them credit, decision power
http://www.slideshare.net/GaelVaroquaux/scikit-learn-dveloppement-communautaire
G Varoquaux 39
6 Core project contributors
Normalized number of commits since 2009-06
Num
ber
of c
omm
its
Individual committerCredit: Fernando Perez, Gist 5843625
G Varoquaux 40
6 The tragedy of the commonsIndividuals, acting independently and rationally accord-ing to each one’s self-interest, behave contrary to thewhole group’s long-term best interests by depletingsome common resource.
Wikipedia
Make it work, make it right, make it boringCore projects (boring) taken for granted
ñ Hard to fund, less excitementThey need citation, in papers & on corporate web pages
G Varoquaux 41
@GaelVaroquaux
Solving problems that matter
The 80/20 rule80% of the usecases can be solved
with 20% of the lines of code
scikit-learn, joblib, nilearn, ... I hope
@GaelVaroquaux
Cutting-edge ... environment ... on a budget
1 Set the goals rightDon’t solve hard problems
What’s your original problem?
2 Use the simplest technological solutions possible
3 Don’t forget the human factors
A perfectdesign?
@GaelVaroquaux
Cutting-edge ... environment ... on a budget
1 Set the goals right
2 Use the simplest technological solutions possibleBe very technically sophisticated
Don’t use that sophistication
3 Don’t forget the human factors
A perfectdesign?
@GaelVaroquaux
Cutting-edge ... environment ... on a budget
1 Set the goals right
2 Use the simplest technological solutions possible
3 Don’t forget the human factorsWith your users (documentation)
With your contributors
A perfectdesign?
@GaelVaroquaux
Cutting-edge ... environment ... on a budget
1 Set the goals right
2 Use the simplest technological solutions possible
3 Don’t forget the human factors
A perfectdesign?