python as number crunching code glue
DESCRIPTION
Presented to the Boston Python User Group on 9/21/2011TRANSCRIPT
Python as number crunching glue
Jiahao [email protected]@mitpostdoc
theochem.mit.edu
1Thursday, September 22, 2011
This is not a crash course on scientific computing or numerical linear algebraRecommended texts:
2
nr.com
Thursday, September 22, 2011
NumPy and SciPyHow to say:
NumPy: no official pronunciation
SciPy: “sigh pie”
3Thursday, September 22, 2011
NumPy and SciPyHow to say:
NumPy: no official pronunciation
SciPy: “sigh pie”
3
Where to get:
scipy.org, numpy.scipy.org
You might already have it
Otherwise, have fun installing it ;)
Thursday, September 22, 2011
You may already know how to use numpy/scipy!
Similar to Matlab, Octave, Scilab, R.
see:http://mathesaurus.sourceforge.net/
In many cases, Matlab/Octave/Scilab code can be translated easily to use numpy+scipy+matplotlib.
Other interfaces exist: e.g. mlabwrap lets you wrap Python around Matlab.
4Thursday, September 22, 2011
Approximately continuous arithmeticfloating point*
- vs -
Exact discrete arithmeticbooleans, integers, strings, ...
*David Goldberg, “What every computer scientist should know about floating-point arithmetic”
5Thursday, September 22, 2011
Using numpy can make code cleaner
6
a = range(10000000)b = range(10000000)c = []
for i in range(len(a)): c.append(a[i] + b[i])
import numpy as npa = np.arange(10000000)b = np.arange(10000000)c = a + b
What’s different??
Thursday, September 22, 2011
What’s different?
7
a = range(10000000)b = range(10000000)c = [] #a+b is concatenation
for i in range(len(a)): c.append(a[i] + b[i])
import numpy as npa = np.arange(10000000)b = np.arange(10000000)c = a + b #vectorized addition
Using numpy can save lots of time
0.333s7.050s (21x)
a convenient interface to compiled C/Fortran libraries: BLAS, LAPACK, FFTW, UMFPACK,...
creates list ofdynamically typed int
creates ndarray ofstatically typed int
Thursday, September 22, 2011
Numerical sw stack
8
PythonBLAS
NumPy
SciPy
FFTW
...
linearalgebra
Fouriertransforms
External Fortran/C
Your code
LAPACK
...
Thursday, September 22, 2011
“One thing that graduate students eventually learn is that you can hide just about anything in a NxN matrix... (for sufficiently large N)” - anonymous string theorist
9Thursday, September 22, 2011
“One thing that graduate students eventually learn is that you can hide just about anything in a NxN matrix... (for sufficiently large N)” - anonymous string theorist
9
If your data can be put into a matrix/vector, numpy/scipy can help you!
Thursday, September 22, 2011
You may already be working with matrix/vector data...
10
bitmap/video waveform
database table text differential
equation model
graph
Thursday, September 22, 2011
11
# Chapter NumPy SciPy
1 Scientific Computing2 Systems of linear equations X X
3 Linear least squares X
4 Eigenvalue problems X X
5 Nonlinear equations X
6 Optimization X
7 Interpolation X
8 Numerical integration and differntiation X
9 Initial value problems for ODEs X
10 Boundary value problems for ODEs X
11 Partial differential equations X
12 Fast Fourier Transform X
13 Random numbers and stochastic simulation X
Table of contents from Michael Heath’s textbook
Thursday, September 22, 2011
Outline:
* NumPy: explicit data typing with dtypes : array manipulation with ndarrays
* SciPy: high-level numerical routines : use cases
* NumPy/SciPy as code glue: f2py and weave
12Thursday, September 22, 2011
The most fundamental object in NumPy is the ndarray (N-dimensional array)
v[:] vector M[:,:] matrix x[:,:,...,:] higher order tensor
unlike built-in Python data types,ndarrays are designed forhomogeneous, explicitly typed data
13Thursday, September 22, 2011
numpy primitive dtypes
14
Bits Boolean Signedinteger
Unsignedinteger Float Complex
8 bool int8 uint816 int16 uint1632 int32 uint32 float32
64int intp uint float
float64 complex6464int64 uint64
floatfloat64 complex64
128 float128 complex128256 complex256
dtypes bring explicit typing to Python
Thursday, September 22, 2011
>>> mol = np.array(mol, dtype={'atomicnum':('uint8',0), 'coords':('3float64',1)})>>> mol['atomicnum']array([8, 1, 1], dtype=uint8)
Recarray: ndarray of data structure with named fields (record)
15
Structured array: ndarray of data structure
>>> mol = np.zeros(3, dtype=('uint8, 3float64'))>>> mol[0] = 8, (-0.464, 0.177, 0.0)>>> mol[1] = 1, (-0.464, 1.137, 0.0)>>> mol[2] = 1, (0.441, -0.143, 0.0)>>> molarray([(8, [-0.46400000000000002, 0.17699999999999999, 0.0]), (1, [-0.46400000000000002, 1.137, 0.0]), (1, [0.441, -0.14299999999999999, 0.0])], dtype=[('f0', '|u1'), ('f1', '<f8', (3,))])
Thursday, September 22, 2011
The most fundamental object in NumPy is the ndarray (N-dimensional array)In 2D, the matrix class is also useful, especially when porting Matlab/Octave code.* For matrices, a*b is matrix multiply. For ndarrays, a*b is elementwise multiply.
* Matrices have convenient attributes: M.T transpose of M M.H Hermitian conjugate of M M.I matrix inverse of M
* Matrices are always 2D, no matter how you manipulate them. ****** This can lead to some very severe, insidious bugs. ******
using asarray() and asmatrix() views allows the best of both worlds.see: http://docs.scipy.org/doc/numpy/reference/arrays.classes.html#matrix-objects
16Thursday, September 22, 2011
Memory layout of matrices
column major: first dimension is contiguous in memory Fortran, Matlab, R,...
row major: last dimension is contiguous in memory C, Java, numpy,...
Why you should care:• Cache coherence• Transposing a matrix is very expensive
17Thursday, September 22, 2011
• from Python iterable: lists, tuples,...e.g. array([1, 2, 3]) == asarray((1, 2, 3))• from intrinsic functionsempty() allocates memory onlyzeros() initializes to 0ones() initializes to 1arange() creates a uniform rangerand() initializes to uniform randomrandn() initializes to standard normal random...• from binary representation in string/buffer• from file on disk
18
Creating ndarrays
Thursday, September 22, 2011
fromfunction() creates an ndarray whose entries are functions of its indices
e.g. the Hilbert matrix
>>> np.fromfunction(lambda i,j: 1./(i+j+1), (4,4))array([[ 1. , 0.5 , 0.33333333, 0.25 ], [ 0.5 , 0.33333333, 0.25 , 0.2 ], [ 0.33333333, 0.25 , 0.2 , 0.16666667], [ 0.25 , 0.2 , 0.16666667, 0.14285714]])
19
1..n
Generating ndarrays
Thursday, September 22, 2011
arange(): like range() but accepts floats>>> import numpy as np>>> np.arange(2, 2.5, 0.1)array([ 2. , 2.1, 2.2, 2.3, 2.4])
linspace(): creates array with specified number of elements, spaced equally between the specified beginning and ending.>>> np.linspace(2.0, 2.4, 5)array([ 2. , 2.1, 2.2, 2.3, 2.4])
20
Generating ndarrays
Thursday, September 22, 2011
21
ndarray native I/OFormat Reader Writer
pickle pickle.loads() dumps()pickle
np.load()
dumps()
NPY np.load() np.save()NPZ
np.load()np.savez()
Memory map np.memmapnp.memmap
NPY is numpy’s native binary formatNPZ is a zip file of NPYsMemory map: a class useful for handling huge matrices won’t load entire matrix into memory
Thursday, September 22, 2011
22
ndarray text I/OFormat Reader Writer
Stringeval() np.array_repr()
Stringor below with StringIOor below with StringIO
Text filenp.loadtxt()
np.genfromtxt()np.recfromtxt()
savetxt()
CSV np.recfromcsv()Matrix Market scipy.io.mmread() mmwrite()
Thursday, September 22, 2011
23
ndarray binary I/OFormat Reader WriterList np.array() ndarray.tolist()
Stringnp.fromstring() tostring()
Stringor below with StringIOor below with StringIO
Raw binary file
scipy.io.numpyio.fread() ndarray.fromfile()
fwrite().tofile()
MATLAB scipy.io.loadmat() savemat()netCDF scipy.io.netcdf.netcdf_filescipy.io.netcdf.netcdf_file
WAV audio scipy.io.wavfile.read() write()Image
(via PIL)scipy.misc.imread()
scipy.misc.fromimage()imsave()toimage()
Also video (OpenCV), HDF5 (PyTables), FITS (PyFITS)...Thursday, September 22, 2011
Indexing>>> x = np.arange(12).reshape(3,4); xarray([[ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11]])>>> x[1,2]6>>> x[2,-1]11>>> x[0][2]2>>> x[(2,2)]10>>> x[:1]array([[0, 1, 2, 3]])>>> x[::2,1:4:2]array([[ 1, 3], [ 9, 11]])
24
#slices return views, not copies
#tuple
row, then column
Thursday, September 22, 2011
Fancy indexing>>> x = np.arange(12).reshape(3,4); xarray([[ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11]])>>> x[(2,2)]10>>> x[np.array([2,2])] #same as x[[2,2]]array([[ 8, 9, 10, 11], [ 8, 9, 10, 11]])>>> x[np.array([1,0]), np.array([2,1])]array([6, 1])>>> x[x>8]array([ 9, 10, 11])>>> x>8array([[False, False, False, False], [False, False, False, False], [False, True, True, True]], dtype=bool)
25
array index
Boolean mask
Thursday, September 22, 2011
Fancy indexing II>>> y = np.arange(1*2*3*4).reshape(1,2,3,4); yarray([[[[ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11]],
[[12, 13, 14, 15], [16, 17, 18, 19], [20, 21, 22, 23]]]])
>>> y[0, Ellipsis, 0] # == y[0, ..., 0] == [0,:,:,0]array([[ 0, 4, 8], [12, 16, 20]])>>> y[0, 0, 0, slice(2,4)] # == y[(0, 0, 0, 2:4)]array([2, 3])
26Thursday, September 22, 2011
Broadcasting
>>> x #.shape = (3,4)array([[ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11]])>>> y #.shape = (1,2,3,4)array([[[[ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11]],
[[12, 13, 14, 15], [16, 17, 18, 19], [20, 21, 22, 23]]]])
27
>>> y * xarray([[[[ 0, 1, 4, 9], [ 16, 25, 36, 49], [ 64, 81, 100, 121]],
[[ 0, 13, 28, 45], [ 64, 85, 108, 133], [160, 189, 220, 253]]]])
What happens when you multiply ndarrays of different dimensions?
Case I: trailing dimensions match
Thursday, September 22, 2011
Broadcasting
>>> a = np.arange(4); aarray([0, 1, 2, 3])>>> b = np.arange(4)[::-1]; barray([3, 2, 1, 0])>>> a + barray([3, 3, 3, 3])
28
What happens when you multiply ndarrays of different dimensions?
Case II: trailing dimension is 1>>> b.shape = 4,1>>> a + barray([[3, 4, 5, 6], [2, 3, 4, 5], [1, 2, 3, 4], [0, 1, 2, 3]])
>>> b.shape = 1,4>>> a + barray([[3, 3, 3, 3]])
Thursday, September 22, 2011
In 2D, the matrix class is often more useful than ndarrays, especially when porting Matlab/Octave code.* For matrices, a*b is matrix multiply. For ndarrays, a*b is elementwise multiply.
* Matrices have convenient attributes: M.T transpose of M M.H Hermitian conjugate of M M.I matrix inverse of M
* Matrices are always 2D, no matter how you manipulate them. ****** This can lead to some very severe, insidious bugs. ******
using asarray() and asmatrix() views allows the best of both worlds.see: http://docs.scipy.org/doc/numpy/reference/arrays.classes.html#matrix-objects
29
Matrix operations
Thursday, September 22, 2011
Matrix functionsYou can apply a function elementwise to a matrix...>>> from numpy import array, exp>>> X = array([[1, 1], [1, 0]])>>> exp(X)array([[ 2.71828183, 2.71828183], [ 2.71828183, 1.]])
...or a matrix version of that function>>> from scipy.linalg import expm>>> expm(X)array([[ 2.71828183, 7.3890561 ], [ 1. , 2.71828183]])
other functions in scipy.linalg.matfuncs30
Thursday, September 22, 2011
SciPy by example
* Data fitting
* Signal matching
* Disease outbreak modeling (epidemiology)
31
http://scipy-central.org/
Thursday, September 22, 2011
Least-squares curve fittingfrom scipy import *from scipy.optimize import leastsqfrom matplotlib.pyplot import plot
#Make up data x(t) with Gaussian noisenum_points = 150t = linspace(5, 8, num_points)x = 11.86*cos(2*pi/0.81*t-1.32) + 0.64*t\ +4*((0.5-rand(num_points))*\ exp(2*rand(num_points)**2))
# Target functionmodel = lambda p, x: \ p[0]*cos(2*pi/p[1]*x+p[2]) + p[3]*x# Distance to the target functionerror = lambda p, x, y: model(p, x) - y# Initial guess for the parametersp0 = [-15., 0.8, 0., -1.]p1, _ = leastsq(error, p0, args=(t, x))
t2 = linspace(t.min(), t.max(), 100)plot(t, x, "ro", t2, model(p1, t2), "b-")raw_input()
32
fit data to model
Thursday, September 22, 2011
Matching signalsSuppose I have a short audio clip
that I know to be part of a larger file
How can I figure out its offset?
Problem: naïve matching scales as O(N2)
33Thursday, September 22, 2011
An O(N lg N) solutionNaïve matching scales as O(N2)How can we do faster?
phase correlation
Exploit Fourier transforms: they encode relative offsets in complex phase
34
60o
1/6Thursday, September 22, 2011
From math to code
35Thursday, September 22, 2011
From math to code
35
import numpy
#Make up some dataN = 30000idx = 24700size = 300data = numpy.random.rand(N)frag_pad = numpy.zeros(N)frag = data[idx:idx+size]frag_pad[:size] = frag
#Compute phase correlationdata_ft = numpy.fft.rfft(data)frag_ft = numpy.fft.rfft(frag_pad)phase = data_ft * numpy.conj(frag_ft)phase /= abs(phase)cross_correlation = numpy.fft.irfft(phase)offset = numpy.argmax(cross_correlation)
print 'Input offset: %d, computed: %d' % (idx, offset)from matplotlib.pyplot import plotplot(cross_correlation)raw_input() #Pause
Thursday, September 22, 2011
From math to code
35
import numpy
#Make up some dataN = 30000idx = 24700size = 300data = numpy.random.rand(N)frag_pad = numpy.zeros(N)frag = data[idx:idx+size]frag_pad[:size] = frag
#Compute phase correlationdata_ft = numpy.fft.rfft(data)frag_ft = numpy.fft.rfft(frag_pad)phase = data_ft * numpy.conj(frag_ft)phase /= abs(phase)cross_correlation = numpy.fft.irfft(phase)offset = numpy.argmax(cross_correlation)
print 'Input offset: %d, computed: %d' % (idx, offset)from matplotlib.pyplot import plotplot(cross_correlation)raw_input() #Pause
Thursday, September 22, 2011
Modeling a zombie apocalypse
36
http://blogs.cdc.gov/publichealthmatters/2011/05/preparedness-101-zombie-apocalypse/
Thursday, September 22, 2011
Modeling a zombie apocalypse
37
http://www.scipy.org/Cookbook/Zombie_Apocalypse_ODEINT
Normal (S) Zombie Dead (R)
Each person can be in one of three states
Thursday, September 22, 2011
Modeling a zombie apocalypse
38
http://www.scipy.org/Cookbook/Zombie_Apocalypse_ODEINT
Normal (S) Zombie Dead (R)
Various processes connect these states
birth (P) normal death
+
resurrection (G)transmission (B)
destruction (A)
Thursday, September 22, 2011
from numpy import linspacefrom scipy.integrate import odeint
P = 0 # birth rated = 0.0001 # natural death rateB = 0.0095 # transmission rateG = 0.0001 # resurrection rateA = 0.0001 # destruction ratedef f(y, t): Si, Zi, Ri = y return [P - B*Si*Zi - d*Si, B*Si*Zi + G*Ri - A*Si*Zi, d*Si + A*Si*Zi - G*Ri]
y0 = [500, 0, 0] # initial conditionst = linspace(0, 5., 1000) # time grid
soln = odeint(f, y0, t) # solve ODES, Z, R = soln[:, :].T
From math to code
39
http://www.scipy.org/Cookbook/Zombie_Apocalypse_ODEINT
S Z R
r d+
GB
A
Thursday, September 22, 2011
Using external code“NumPy can get you most of the way to compiled speeds through vectorization. In situations where you still need the last ounce of speed in a critical section, or when it either requires a PhD in NumPy-ology to vectorize the solution or it results in too much memory overhead, you can reach for Cython or Weave. If you already know C/C++, then weave is a simple and speedy solution. If, however, you are not already familiar with C then you may find Cython to be exactly what you are looking for to get the speed you need out of Python.” - Travis Oliphant, 2011-06-20
see:http://www.scipy.org/PerformancePythonhttp://technicaldiscovery.blogspot.com/2011/06/speeding-up-python-numpy-cython-and.html
40Thursday, September 22, 2011
Python as code glue- numpy.f2py: wraps * C, Fortran 77/90/95 functions * Fortran 90/95 module data * Fortran 77 COMMON blocks
- scipy.weave * .inline: compiles & runs C/C++ code manipulating Python scalars/ndarrays * .blitz: interfaces with Blitz++
Other wrapper libraries and programs: seehttp://scipy.org/Topical_Software
41Thursday, September 22, 2011
numpy.f2py: Fortran/C
$ cat>invsqrt.f real*8 function invsqrt (a) real*8 a invsqrt = 1.0/sqrt(a) end
$ f2py -c -m invsqrt invsqrt.f$ python -c 'import invsqrt; print invsqrt.invsqrt(4)'0.5
see: http://www.scipy.org/F2py
42
$ cat>invsqrt.c#include <math.h>double invsqrt(a) { return 1.0/sqrt(a);}$ cat>invsqrt.mpython module invsqrtinterface real*8 function invsqrt(x) intent(c) :: invsqrt real*8 intent(in) :: x end function invsqrtend interfaceend python module invsqrt$ f2py invsqrt.m invsqrt.c -c$ python -c 'import invsqrt; print invsqrt.invsqrt(4)'0.5
Thursday, September 22, 2011
scipy.weave.inline
>>> from scipy.weave import inline>>> x = 4.0>>> inline('return_val = 1./sqrt(x));',['x'])0.5
see: https://github.com/scipy/scipy/blob/master/scipy/weave/doc/tutorial.txt
43
inline Extension
pythonscipyweave
distutilscore
on-the-flycompiledC/C++program
Thursday, September 22, 2011
scipy.weave.blitzUses the Blitz++ numerical library for C++Converts between ndarrays and Blitz arrays>>> # Computes five-point average using numpy and weave.blitz>>> import numpy import empty>>> from scipy.weave import blitz>>> a = numpy.zeros((4096,4096)); c = numpy.zeros((4096, 4096))>>> b = numpy.random.randn(4096,4096)>>> c[1:-1,1:-1] = (b[1:-1,1:-1] + b[2:,1:-1] + b[:-2,1:-1] + b[1:-1,2:] + b[1:-1,:-2]) / 5.0>>> blitz("a[1:-1,1:-1] = (b[1:-1,1:-1] + b[2:,1:-1] + b[:-2,1:-1] + b[1:-1,2:] + b[1:-1,:-2]) / 5.")>>> (a == c).all()True
see:https://github.com/scipy/scipy/blob/master/scipy/weave/doc/tutorial.txt
44Thursday, September 22, 2011
ParallelizationThe easy way: numpy/scipy’s primitives automatically use vectorization compiled into external BLAS/LAPACK/... libraries
The usual way:- MPI interfaces (mpi4py,...)- Python threads/multiprocessing/...- OpenMP/pthreads... in external C/Fortran
see:http://www.scipy.org/ParallelProgramming
45Thursday, September 22, 2011
How I use NumPy/Scipy
46
Text input
Matrices Test model Visualize
Text output
scipy.optimizeQuasi-Newton optimizers
External binary
Binary outputndarray.
fromfile()
Thursday, September 22, 2011
Beyond NumPy/SciPy
47
Python
NumPy
SciPyExternal Fortran/C
My script
CVXOpt
many more examples at http://www.scipy.org/Topical_Software
PyTables VTK matplotlib
My interactive session
PylabHDF5
file I/Onumerical
optimization
visualization
PyMol
moleculeviz.
plots
Thursday, September 22, 2011