memory efficient applications. francesc alted at big data spain 2012

It´s The Memory, Stupid!

or: How I Learned to Stop Worrying about CPU Speed

and Love Memory Access

Francesc AltedSoftware Architect

Big Data Spain 2012, Madrid (Spain) November 16, 2012

About Continuum Analytics

• Develop new ways on how data is stored, computed, and visualized.

• Provide open technologies for Data Integration on a massive scale.

• Provide software tools, training, and integration/consulting services to corporate, government, and educational clients worldwide.

Overview

• The Era of ‘Big Data’

• A few words about Python and NumPy

• The Starving CPU problem

• Choosing optimal containers for Big Data

The Dawn of ‘Big Data’

“A wind of streaming data, social data and unstructured data is knocking at

the door, and we're starting to let it in. It's a scary place at the moment.”

-- Unidentified bank IT executive, as quoted by “The American Banker”

Challenges

• We have to deal with as much data as possible by using limited resources

• So, we must use our computational resources optimally to be able to get the most out of Big Data

Interactivity and Big Data

• Interactivity is crucial for handling data

• Interactivity and performance are crucial for handling Big Data

Python and ‘Big Data’

• Python is an interpreted language and hence, it offers interactivity

• Myth: “Python is slow, so why on the hell are you going to use it for Big Data?”

• Answer: Python has access to an incredibly powerful range of libraries that boost its performance far beyond your expectations

• ...and during this talk I will prove it!

NumPy: A Standard ‘De Facto’ Container

• array[2]; array[1,1:5, :]; array[[3,6,10]]

• (array1**3 / array2) - sin(array3)

• numpy.dot(array1, array2): access to optimized BLAS (*GEMM) functions

• and much more...

Operating with NumPy

Nothing Is Perfect

• NumPy is just great for many use cases

• However, it also has its own deficiencies:

• Follows the Python evaluation order in complex expressions like : (a * b) + c

• Does not have support for multiprocessors (except for BLAS computations)

Numexpr: Dealing with Complex Expressions

• It comes with a specialized virtual machine for evaluating expressions

• It accelerates computations mainly by making a more efficient memory usage

• It supports extremely easy to use multithreading (active by default)

Exercise (I)

Evaluate the next polynomial:

in the range [-1, 1] with a step size of 2*10-7, using both NumPy and numexpr.

Note: use a single processor for numexpr numexpr.set_num_threads(1)

0.25x3 + 0.75x2 + 1.5x - 2

Exercise (II)

Rewrite the polynomial in this notation:

and redo the computations.

What happens?

((0.25x + 0.75)x + 1.5)x - 2

NumPy

Page 1

.25*x**3 + .75*x**2 - 1.5*x – 2 1,613 0,138

0,301 0,11

x 0,052 0,045

sin(x)**2+cos(x)**2 0,715 0,559

NumPy Numexpr

((.25*x + .75)*x - 1.5)*x – 2

NumPy Numexpr

0

0,2

0,4

0,6

0,8

1

1,2

1,4

1,6

1,8

NumPy vs Numexpr (1 thread)

.25*x**3 + .75*x**2 - 1.5*x – 2

((.25*x + .75)*x - 1.5)*x – 2

x

sin(x)**2+cos(x)**2

Tim

e (

s)

.25*x**3 + .75*x**2 - 1.5*x – 2 ((.25*x + .75)*x - 1.5)*x – 2

0

0,2

0,4

0,6

0,8

1

1,2

1,4

1,6

1,8

Time to evaluate polynomial (1 thread)

NumPy

Numexpr

Tim

e (

s)

Power Expansion

Numexpr expands expression:

0.25x3 + 0.75x2 + 1.5x - 2

to:

0.25x*x*x + 0.75x*x + 1.5x*x - 2

so, no need to use transcendental pow()

Pending question

• Why numexpr continues to be 3x faster than NumPy, even when both are executing exactly the *same* number of operations?

The Starving CPUProblem

“Across the industry, today’s chips are largely able to execute code faster than we can feed

them with instructions and data.”

– Richard Sites, after his article “It’s The Memory, Stupid!”,

Microprocessor Report, 10(10),1996

Memory Access Time vs CPU Cycle Time

Book in 2009

The Status of CPU Starvation in 2012

• Memory latency is much slower (between 250x and 500x) than processors.

• Memory bandwidth is improving at a better rate than memory latency, but it is also slower than processors (between 30x and 100x).

CPU Caches to the Rescue

• CPU cache latency and throughput are much better than memory

• However: the faster they run the smaller they must be

CPU Cache EvolutionUp to end 80’s 90’s and 2000’s 2010’s

MARCH/APRIL 2010 3

implemented several memory lay-ers with different capabilities: lower-level caches (that is, those closer to the CPU) are faster but have reduced capacities and are best suited for per-forming computations; higher-level caches are slower but have higher ca-pacity and are best suited for storage purposes.

Figure 1 shows the evolution of this hierarchical memory model over time. The forthcoming (or should I say the present?) hierarchical model includes a minimum of six memory levels. Taking advantage of such a deep hierarchy isn’t trivial at all, and programmers must grasp this fact if they want their code to run at an acceptable speed.

Techniques to Fight Data Starvation Unlike the good old days when the processor was the main bottleneck, memory organization has now be-come the key factor in optimization. Although learning assembly language to get direct processor access is (rela-tively) easy, understanding how the hierarchical memory model works—and adapting your data structures accordingly—requires considerable knowledge and experience. Until we have languages that facilitate the de-velopment of programs that are aware

of memory hierarchy (for an example in progress, see the Sequoia project at www.stanford.edu/group/sequoia), programmers must learn how to deal with this problem at a fairly low level.4

There are some common techniques to deal with the CPU data-starvation problem in current hierarchical mem-ory models. Most of them exploit the principles of temporal and spatial locality. In temporal locality, the target dataset is reused several times over a short period. The !rst time the dataset is accessed, the system must bring it to cache from slow memory; the next time, however, the processor will fetch it directly (and much more quickly) from the cache.

In spatial locality, the dataset is ac-cessed sequentially from memory. In this case, circuits are designed to fetch memory elements that are clumped together much faster than if they’re dispersed. In addition, specialized circuitry (even in current commodity hardware) offers prefetching—that is, it can look at memory-access patterns and predict when a certain chunk of data will be used and start to trans-fer it to cache before the CPU has actually asked for it. The net result is that the CPU can retrieve data much faster when spatial locality is properly used.

Programmers should exploit the op-timizations inherent in temporal and spatial locality as much as possible. One generally useful technique that leverages these principles is the block-ing technique (see Figure 2). When properly applied, the blocking tech-nique guarantees that both spatial and temporal localities are exploited for maximum bene!t.

Although the blocking technique is relatively simple in principle, it’s less straightforward to implement in practice. For example, should the basic block !t in cache level one, two, or three? Or would it be bet-ter to !t it in main memory—which can be useful when computing large, disk-based datasets? Choosing from among these different possibilities is dif!cult, and there’s no substitute for experimentation and empirical analysis.

In general, it’s always wise to use libraries that already leverage the blocking technique (and others) for achieving high performance; exam-ples include Lapack (www.netlib.org/lapack) and Numexpr (http://code.google.com/p/numexpr). Numexpr is a virtual machine written in Python and C that lets you evaluate poten-tially complex arithmetic expressions over arbitrarily large arrays. Using the blocking technique in combination

Figure 1. Evolution of the hierarchical memory model. (a) The primordial (and simplest) model; (b) the most common current implementation, which includes additional cache levels; and (c) a sensible guess at what’s coming over the next decade: three levels of cache in the CPU and solid state disks lying between main memory and classical mechanical disks.

Mechanical disk Mechanical disk Mechanical disk

SpeedC

apac

ity

Solid state disk

Main memory

Level 3 cache

Level 2 cache

Level 1 cache

Level 2 cache

Level 1 cache

Main memoryMain memory

CPUCPU

(a) (b) (c)

Centralprocessingunit (CPU)

CISE-12-2-ScientificPro.indd 3 1/29/10 11:21:43 AM

When CPU Caches Are Effective?

Mainly in a couple of scenarios:

• Time locality: when the dataset is reused

• Spatial locality: when the dataset is accessed sequentially

The Blocking Technique

Use this extensively to leverage spatial and temporal localities

When accessing disk or memory, get a contiguous block that fits in CPU cache, operate upon it and reuse it as much as possible.

Time To Answer Pending Questions

NumPy

Page 1

.25*x**3 + .75*x**2 - 1.5*x – 2 1,613 0,138

0,301 0,11

x 0,052 0,045

sin(x)**2+cos(x)**2 0,715 0,559

NumPy Numexpr

((.25*x + .75)*x - 1.5)*x – 2

NumPy Numexpr

0

0,2

0,4

0,6

0,8

1

1,2

1,4

1,6

1,8

NumPy vs Numexpr (1 thread)

.25*x**3 + .75*x**2 - 1.5*x – 2

((.25*x + .75)*x - 1.5)*x – 2

x

sin(x)**2+cos(x)**2

Tim

e (

s)

.25*x**3 + .75*x**2 - 1.5*x – 2 ((.25*x + .75)*x - 1.5)*x – 2

0

0,2

0,4

0,6

0,8

1

1,2

1,4

1,6

1,8

Time to evaluate polynomial (1 thread)

NumPy

Numexpr

Tim

e (

s)

Beyond numexpr: Numba

Numexpr Limitations

• Numexpr only implements element-wise operations, i.e. ‘a*b’ is evaluated as:

for i in range(N):

c[i] = a[i] * b[i]

• In particular, it cannot deal with things like:for i in range(N):

c[i] = a[i-1] + a[i] * b[i]

Numba: Overcoming numexpr Limitations

• Numba is a JIT that can translate a subset of the Python language into machine code

• It uses LLVM infrastructure behind the scenes

• Can achieve similar or better performance than numexpr, but with more flexibility

LLVM 3.1

Intel Nvidia AppleAMD

OpenCLISPC CUDA CLANGOpenMP

LLVM-PY

Python Function Machine Code

How Numba Works

Numba Example:Computing the Polynomial

import numpy as npimport numba as nb

N = 10*1000*1000

x = np.linspace(-1, 1, N)y = np.empty(N, dtype=np.float64)

@nb.jit(arg_types=[nb.f8[:], nb.f8[:]])def poly(x, y): for i in range(N): # y[i] = 0.25*x[i]**3 + 0.75*x[i]**2 + 1.5*x[i] - 2 y[i] = ((0.25*x[i] + 0.75)*x[i] + 1.5)*x[i] - 2

poly(x, y) # run through Numba!

Times for Computing the Polynomial (In Seconds)

Poly version (I) (II)

Numpy 1.086 0.505

numexpr 0.108 0.096

Numba 0.055 0.054

Pure C, OpenMP 0.215 0.054

• Compilation time for Numba: 0.019 sec• Run on Mac OSX, Core2 Duo @ 2.13 GHz

Numba: LLVM for Python

Python code can reach C speed without having to

program in C itself

(and without losing interactivity!)

Numba in SC 2012

Numba in SC2012Awesome Python!

Optimal Containers for Big Data

If a datastore requires all data to fit in memory, it isn't big data

-- Alex Gaynor (in twitter)

The Need for a Good Data Container

• Too many times we are too focused on computing as fast as possible

• But we have seen how important data access is

• Hence, having an optimal data structure is critical for getting good performance when processing very large datasets

Appending Data in Large NumPy Objects

Copy!

New memoryallocation

array to be enlarged final array object

new data to append

• Normally a realloc() syscall will not succeed• Both memory areas have to exist simultaneously

Contiguous vs ChunkedNumPy container

Contiguous memory

Blaze container

chunk 1

chunk 2

Discontiguous memory

chunk N

...

Appending data in Blaze

compressnew chunk

array to be enlarged final array object

new data to append

Only a small amount of data has to be compressed

Xchunk 1

chunk 2

chunk 1

chunk 2

Blosc: (de)compressing faster than memcpy()

Transmission + decompression faster than direct transfer?

Source: Howison, M. (in press). High-throughput compression of FASTQ data with SeqDB. IEEE Transactions on Computational Biology and Bioinformatics.

TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 4

TABLE 1Test Data Sets

# Source Identifier Sequencer Read Count Read Length ID Lengths FASTQ Size1 1000 Genomes ERR000018 Illumina GA 9,280,498 36 bp 40–50 1,105 MB2 1000 Genomes SRR493233 1 Illumina HiSeq 2000 43,225,060 100 bp 51–61 10,916 MB3 1000 Genomes SRR497004 1 AB SOLiD 4 122,924,963 51 bp 78–91 22,990 MB

Fig. 1. In-memory throughputs for several compression schemes applied to increasing block sizes (where eachsequence is 256 bytes long).

into a memory buffer, timed the compression of blockinto a second buffer, then timed the decompressionof the block back into the first buffer. We repeatedeach of these tests 10 times and kept the minimumruntime, which represents the best-case scenario whenbackground noise from other processes is at its lowest.As a baseline, we also performed a memcpy of the firstbuffer to the second, which measures the bidirectionalmemory bandwidth of the system. We calculatedthroughput as the size of the original, uncompressedblock divided by runtime. Sequences were stored with256 bytes (100 bytes for sequence and quality scoredata, and 56 bytes for the ID header) and both powerof 2 and power of 10 block sizes were used, rangingfrom 1,000 to 33,554,432 sequences.

Figure 1 shows how throughput varied drasticallywith block size for both SeqPack and BLOSC, andhow both were able to achieve higher throughput thanmemcpy, thanks to their compression of the buffer.In contrast, zlib displayed a constant throughputacross block sizes that never rose above 5.8 MB/s forcompression or 189 MB/s for decompression.

BLOSC attained a maximum throughput of nearly12 GB/s for decompression, but its highest compres-sion throughput was an order of magnitude lower.Although SeqDB’s throughputs were not as high asBLOSC’s in decompression, SeqDB achieved more

consistent throughput across both compression anddecompression, topping 8 GB/s for both.

Finally, we also tested a “SeqPack + BLOSC” condi-tion in which we compressed the buffer with SeqPackfollowed by BLOSC, then decompressed with BLOSCfollowed by SeqPack. Not surprisingly, with the extraround of compression and decompression, this condi-tion performed worse than either SeqPack or BLOSCalone. Yet, it still outperformed zlib while yieldingsimilar compression ratios, as we will report in moredetail in the next section.

4.2 Compression Ratio and ThroughputCompression methods face a trade-off betweenthroughput and compression ratio. Additional pro-cessing, and hence lower throughput, can improvecompression ratio. Some methods, like BLOSC andzlib, have a tunable parameter for controlling thistrade-off, called the compression level.

We tested SeqDB against four alternative methodsfor compressing NGS data sets. The most commonlyused method is probably gzip (which uses the samecompression algorithm as zlib). We ran gzip atcompression level 6 since this is the default level ifnone is specified on the command line, and we sus-pect this is the most common usage. Next, we testedBioHDF with its default HDF5 zlib compression fil-

Example of How Blosc Accelerates Genomics I/O:SeqPack (backed by Blosc)

How Blaze Does Out-Of-Core Computations

Virtual Machine : Python, numexpr, Numba

Last Message for TodayBig data is tricky to manage:

Look for the optimal containers for your data

Spending some time choosing your appropriate data container can be a big time saver in the long run

Summary

• Python is a perfect language for Big Data

• Nowadays you should be aware of the memory system for getting good performance

• Choosing appropriate data containers is of the utmost importance when dealing with Big Data

“El éxito del Big Data lo conseguirán aquellos desarrolladores que sean capaces de mirar más allá del standard y sean capaces de entender los recursos hardware subyacentes y la variedad de algoritmos disponibles.”

-- Oscar de Bustos, HPC Line of Business Manager at BULL

¡Gracias!

memory efficient applications. francesc alted at big data spain 2012

Technology