effective numerical computation in numpy and scipy
DESCRIPTION
Presented at PyCon JP 2014. Video is available at http://bit.ly/1tXYhw6 This talk explores case studies of effective usage of Numpy/Scipy and shows that the computational speed sometimes improves drastically with the appropriate derivation of formulas and performance-conscious implementation. I especially focus on scipy.sparse, the module for sparse matrices, which is often useful in the areas of machine learning and natural language processing.TRANSCRIPT
Effective Numerical Computation in NumPy and SciPy
Kimikazu Kato
PyCon JP 2014
September 13, 2014
1 / 35
About Myself
Kimikazu KatoChief Scientists at Silver Egg Technology Co., Ltd.
Ph.D in Computer Science
Background in Mathematics, Numerical Computation, Algorithms, etc.
<2 year experience in Python>10 year experience in numerical computation
Now designing algorithms for recommendation system, and doing researchabout machine learning and data analysis.
2 / 35
This talk...
is about effective usage of NumPy/SciPyis NOT exhaustive introduction of capabilities, but shows some casestudies based on my experience and interest
3 / 35
Table of Contents
IntroductionBasics about NumPy
BroadcastingIndexing
Sparse matrixUsage of scipy.sparseInternal structure
Case studiesConclusion
4 / 35
Numerical Computation
Differential equationsSimulationsSignal processingMachine Learningetc...
Why Numerical Computation in Python?
ProductivityEasy to writeEasy to debug
Connectivity with visualization toolsMatplotlibIPython
Connectivity with web systemMany frameworks (Django, Pyramid, Flask, Bottle, etc.)
5 / 35
But Python is Very Slow!
Code in C
#include <stdio.h>int main() { int i; double s=0; for (i=1; i<=100000000; i++) s+=i; printf("%.0f\n",s);}
Code in Python
s=0.for i in xrange(1,100000001): s+=iprint s
Both of the codes compute the sum of integers from 1 to 100,000,000.
Result of benchmark in a certain environment:Above: 0.109 sec (compiled with -O3 option)Below: 8.657 sec(80+ times slower!!)
6 / 35
Better code
import numpy as npa=np.arange(1,100000001)print a.sum()
Now it takes 0.188 sec. (Measured by "time" command in Linux, loading timeincluded)
Still slower than C, but sufficiently fast as a script language.
7 / 35
Lessons
Python is very slow when written badlyTranslate C (or Java, C# etc.) code into Python is often a bad idea.Python-friendly rewriting sometimes result in drastic performanceimprovement
8 / 35
Basic rules for better performance
Avoid for-sentence as far as possibleUtilize libraries' capabilities insteadForget about the cost of copying memory
Typical C programmer might care about it, but ...
9 / 35
Basic techniques for NumPy
BroadcastingIndexing
10 / 35
Broadcasting
>>> import numpy as np>>> a=np.array([0,1,2])>>> a*3array([0, 3, 6])
>>> b=np.array([1,4,9])>>> np.sqrt(b)array([ 1., 2., 3.])
A function which is applied to each element when applied to an array is calleda universal function.
11 / 35
Broadcasting (2D)
>>> import numpy as np>>> a=np.arange(9).reshape((3,3))>>> b=np.array([1,2,3])>>> aarray([[0, 1, 2], [3, 4, 5], [6, 7, 8]])>>> barray([1, 2, 3])>>> a*barray([[ 0, 2, 6], [ 3, 8, 15], [ 6, 14, 24]])
12 / 35
Indexing
>>> import numpy as np>>> a=np.arange(10)>>> aarray([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])>>> indices=np.arange(0,10,2)>>> indicesarray([0, 2, 4, 6, 8])>>> a[indices]=0>>> aarray([0, 1, 0, 3, 0, 5, 0, 7, 0, 9])>>> b=np.arange(100,600,100)>>> barray([100, 200, 300, 400, 500])>>> a[indices]=b>>> aarray([100, 1, 200, 3, 300, 5, 400, 7, 500, 9])
13 / 35
Refernces
Gabriele Lanaro, "Python High Performance Programming," PacktPublishing, 2013.Stéfan van der Walt, Numpy Medkit
14 / 35
Sparse matrix
Defined as a matrix in which most elements are zeroCompressed data structure is used to express it, so that it will be...
Space effectiveTime effective
15 / 35
scipy.sparse
The class scipy.sparse has mainly three types as expressions of a sparsematrix. (There are other types but not mentioned here)
lil_matrix : convenient to set data; setting a[i,j] is fastcsr_matrix : convenient for computation, fast to retrieve a rowcsc_matrix : convenient for computation, fast to retrieve a column
Usually, set the data into lil_matrix, and then, convert it to csc_matrix orcsr_matrix.
For csr_matrix, and csc_matrix, calcutaion of matrices of the same type is fast,but you should avoid calculation of different types.
16 / 35
Use case
>>> from scipy.sparse import lil_matrix, csr_matrix>>> a=lil_matrix((3,3))>>> a[0,0]=1.; a[0,2]=2.>>> a=a.tocsr()>>> print a (0, 0) 1.0 (0, 2) 2.0>>> a.todense()matrix([[ 1., 0., 2.], [ 0., 0., 0.], [ 0., 0., 0.]])>>> b=lil_matrix((3,3))>>> b[1,1]=3.; b[2,0]=4.; b[2,2]=5.>>> b=b.tocsr()>>> b.todense()matrix([[ 0., 0., 0.], [ 0., 3., 0.], [ 4., 0., 5.]])>>> c=a.dot(b)>>> c.todense()matrix([[ 8., 0., 10.], [ 0., 0., 0.], [ 0., 0., 0.]])>>> d=a+b>>> d.todense()matrix([[ 1., 0., 2.], [ 0., 3., 0.], [ 4., 0., 5.]]) 17 / 35
Internal structure: csr_matrix
>>> from scipy.sparse import lil_matrix, csr_matrix>>> a=lil_matrix((3,3))>>> a[0,1]=1.; a[0,2]=2.; a[1,2]=3.; a[2,0]=4.; a[2,1]=5.>>> b=a.tocsr()>>> b.todense()matrix([[ 0., 1., 2.], [ 0., 0., 3.], [ 4., 5., 0.]])>>> b.indicesarray([1, 2, 2, 0, 1], dtype=int32)>>> b.dataarray([ 1., 2., 3., 4., 5.])>>> b.indptrarray([0, 2, 3, 5], dtype=int32)
18 / 35
Internal structure: csc_matrix
>>> from scipy.sparse import lil_matrix, csr_matrix>>> a=lil_matrix((3,3))>>> a[0,1]=1.; a[0,2]=2.; a[1,2]=3.; a[2,0]=4.; a[2,1]=5.>>> b=a.tocsc()>>> b.todense()matrix([[ 0., 1., 2.], [ 0., 0., 3.], [ 4., 5., 0.]])>>> b.indicesarray([2, 0, 2, 0, 1], dtype=int32)>>> b.dataarray([ 4., 1., 5., 2., 3.])>>> b.indptrarray([0, 1, 3, 5], dtype=int32)
19 / 35
Merit of knowing the internal structure
Setting csr_matrix or csc_matrix with its internal structure is much faster thansetting lil_matrix with indices.
See the benchmark of setting
⎛
⎝
⎜⎜⎜⎜⎜⎜⎜⎜
2 12 1
⋱ ⋱
⋱ 12
⎞
⎠
⎟⎟⎟⎟⎟⎟⎟⎟
20 / 35
from scipy.sparse import lil_matrix, csr_matriximport numpy as npfrom timeit import timeit
def set_lil(n): a=lil_matrix((n,n)) for i in xrange(n): a[i,i]=2. if i+1<n: a[i,i+1]=1. return a
def set_csr(n): data=np.empty(2*n-1) indices=np.empty(2*n-1,dtype=np.int32) indptr=np.empty(n+1,dtype=np.int32) # to be fair, for-sentence is intentionally used # (using indexing technique is faster) for i in xrange(n): indices[2*i]=i data[2*i]=2. if i<n-1: indices[2*i+1]=i+1 data[2*i+1]=1. indptr[i]=2*i indptr[n]=2*n-1 a=csr_matrix((data,indices,indptr),shape=(n,n)) return a
print "lil:",timeit("set_lil(10000)", number=10,setup="from __main__ import set_lil")print "csr:",timeit("set_csr(10000)", number=10,setup="from __main__ import set_csr")
21 / 35
Result:
lil: 11.6730761528csr: 0.0562081336975
Remark
When you deal with already sorted data, setting csr_matrix or csc_matrixwith data, indices, indptr is much faster than setting lil_matrixBut the code tend to be more complicated if you use the internal structureof csr_matrix or csc_matrix
22 / 35
Case Studies
23 / 35
Case 1: Norms
If is dense:
norm=np.dot(v,v)
Expressed as product of matrices. (dot means matrix product, but you don'thave to take transpose explicitly.)
When is sparse, suppose that is expressed as matrix:
norm=v.multiply(v).sum()
(multiply() is element-wise product)
This is because taking transpose of a sparse matrix changes the type.
∥v =∥2 ∑i
v2i
v
v v 1 × n
24 / 35
Frobenius norm:
norm=a.multiply(a).sum()
=∥A∥Fro ∑ij
a2ij
25 / 35
Case 2: Applying a function to all of the elements of asparse matrix
A universal function can be applied to a dense matrix:
>>> import numpy as np>>> a=np.arange(9).reshape((3,3))>>> aarray([[0, 1, 2], [3, 4, 5], [6, 7, 8]])>>> np.tanh(a)array([[ 0. , 0.76159416, 0.96402758], [ 0.99505475, 0.9993293 , 0.9999092 ], [ 0.99998771, 0.99999834, 0.99999977]])
This is convenient and fast.
However, we cannot do the same thing for a sparse matrix.
26 / 35
>>> from scipy.sparse import lil_matrix>>> a=lil_matrix((3,3))>>> a[0,0]=1.>>> a[1,0]=2.>>> b=a.tocsr()>>> np.tanh(b)<3x3 sparse matrix of type '<type 'numpy.float64'>' with 2 stored elements in Compressed Sparse Row format>
This is because, for an arbitrary function, its application to a sparse matrix isnot necessarily sparse.
However, if a universal function satisfies , the density ispreserved.
Then, how can we compute it?
f f(0) = 0
27 / 35
Use the internal structure!!
The positions of the non-zero elements are not changed after application ofthe function.
Keep indices and indptr, and just change data.
Solution:
b = csr_matrix((np.tanh(a.data), a.indices, a.indptr), shape=a.shape)
28 / 35
Case 3: Formula which appears in a paper
In the algorithm for recommendation system [1], the following formulaappears:
where is dense matrix, and D is a diagonal matrix defined from agiven array as:
Here, (which corresponds to the number of users or items) is big and (which means the number of latent factors) is small.
[1] Hu et al. Collaborative Filtering for Implicit Feedback Datasets, ICDM,2008.
⋅ D ⋅ AAT
A n × f( )di
D =
⎛
⎝⎜⎜⎜⎜⎜
d1
d2
⋱dn
⎞
⎠⎟⎟⎟⎟⎟
n f
29 / 35
Solution 1:
There is a special class dia_matrix to deal with a diagonal sparse matrix.
import scipy.sparse as sparseimport numpy as np
def f(a,d): """a: 2d array of shape (n,f), d: 1d array of length n""" dd=sparse.diags([d],[0]) return np.dot(a.T,dd.dot(a))
30 / 35
Solution 2:
Pack csr_matrix with data,indices,indptr
data=dindices=[0,1,..,n]indptr=[0,1,...,n+1]
def g(a,d): n,f=a.shape data=d indices=np.arange(n) indptr=np.arange(n+1) dd=sparse.csr_matrix((data,indices,indptr),shape=(n,n)) return np.dot(a.T,dd.dot(a))
31 / 35
Solution 3:
This is equivalent to the broadcasting!
def h(a,d): return np.dot(a.T*d,a)
( D)A = × × AAT
⎛
⎝⎜⎜⎜⎜
a11
a12
⋮a1m
a21
a22
⋮a2m
⋯⋯
⋯
an1
an2
⋮anm
⎞
⎠⎟⎟⎟⎟
⎛
⎝⎜⎜⎜⎜⎜
d1
d2
⋱dn
⎞
⎠⎟⎟⎟⎟⎟
= × A
⎛
⎝⎜⎜⎜⎜
a11d1
a12d1
⋮a1md1
a21d2
a22d2
⋮a2md2
⋯⋯
⋯
an1dn
an2dn
⋮anmdn
⎞
⎠⎟⎟⎟⎟
32 / 35
Benchmark
def datagen(n,f): np.random.seed(0) a=np.random.random((n,f)) d=np.random.random(n) return a,d
from timeit import timeitprint "dia_matrix :",timeit("f(a,d)",number=10, setup="from __main__ import f,datagen; a,d=datagen(1000000,10)")print "csr_matrix :",timeit("g(a,d)",number=10, setup="from __main__ import g,datagen; a,d=datagen(1000000,10)")print "broadcasting :",timeit("h(a,d)",number=10, setup="from __main__ import h,datagen; a,d=datagen(1000000,10)")
Result:
dia_matrix : 1.60458707809csr_matrix : 1.32580018044broadcasting : 1.30032682419
33 / 35
Conclusion
Try not to use for-sentence, but use libraries' capabilities instead.Knowledge about the internal structure of the sparse matrix is useful toextract further performance.Mathematical derivation is important. The key is to find a mathematicallyequivalent and Python-friendly formula.Computational speed does not necessarily matter. Finding a better code ina short time is valuable. Otherwise, you shouldn't pursue too much.
34 / 35
Acknowledgment
I would like to thank
(@shima__shima)who gave me useful advice in Twitter.
35 / 35