comparing r and python for pca pydata boston 2013files.meetup.com/1676436/pydata-2013-pca.pdf ·...
TRANSCRIPT
© 2013 IBM Corporation
Comparing R and Python for PCA PyData Boston 2013
Vipin Sachdeva
Senior Engineer, IBM Research
© 2013 IBM Corporation
Comparison of R and Python for Principal Component Analysis
§ R and Python are popular choices for data analysis.
§ How do they compare in terms of programmer productivity and performance ?
§ Use a common task for both R and Python – Principal Component Analysis (PCA) – PCA is a very commonly used technique for dimension reduction.
§ Dataframes is an essential part of languages supporting data analysis – R provides data frame with numerous statistical packages. – Python has included numPy (arrays) and Pandas (dataframe) for data handling which
we use.
§ Both language have rich development environments – Rstudio for R – iPython for Python.
§ Both languages have many features that helps in data analysis. – In this talk we compare those features with some code examples to solve our
problem.
§ This talk is not about as much about principal component analysis as about programming and performance of Python and R
§ Let’s get started
© 2013 IBM Corporation
PCA – Short Introduction
§ PCA is a standard tool in modern data analysis.
§ Simple method to extract information from confusing datasets – Reduce a complex dataset to a lower dimension
• PCA projects the data along the direction where data varies the most. • Directions are determined by the direction of the eigenvectors coresponding to
largest eigenvalues
© 2013 IBM Corporation
PCA – Mathematical approaches
§ Find eigenvalues of standardized covariance matrix.
§ Choose eigenvalues with sum exceeding a threshold.
§ Reduction in dimension from N to K: – Create data with subset of eigenvalues (whose sum exceeds that threshold).
-- --
- 7 -
• How to choose the principal components?
- To choose K , use the following criterion:
K
i=1Σ i
N
i=1Σ i
> Threshold (e.g., 0.9 or 0.95)
• What is the error due to dimensionality reduction?
- We saw above that an original vector x can be reconstructed using its the prin-cipla components:
x̂ − x =K
i=1Σ biui or x̂ =
K
i=1Σ biui + x
- It can be shown that the low-dimensional basis based on principal componentsminimizes the reconstruction error:
e = ||x − x̂||
- It can be shown that the error is equal to:
e = 1/2N
i=K+1Σ i
• Standardization
- The principal components are dependent on the units used to measure the orig-inal variables as well as on the range of values they assume.
- We should always standardize the data prior to using PCA.
- A common standardization method is to transform all the data to have zeromean and unit standard deviation:
xi − ( and are the mean and standard deviation of xi’s)
-- --
- 7 -
• How to choose the principal components?
- To choose K , use the following criterion:
K
i=1Σ i
N
i=1Σ i
> Threshold (e.g., 0.9 or 0.95)
• What is the error due to dimensionality reduction?
- We saw above that an original vector x can be reconstructed using its the prin-cipla components:
x̂ − x =K
i=1Σ biui or x̂ =
K
i=1Σ biui + x
- It can be shown that the low-dimensional basis based on principal componentsminimizes the reconstruction error:
e = ||x − x̂||
- It can be shown that the error is equal to:
e = 1/2N
i=K+1Σ i
• Standardization
- The principal components are dependent on the units used to measure the orig-inal variables as well as on the range of values they assume.
- We should always standardize the data prior to using PCA.
- A common standardization method is to transform all the data to have zeromean and unit standard deviation:
xi − ( and are the mean and standard deviation of xi’s)
© 2013 IBM Corporation
PCA using Singular Value Decomposition (SVD)
§ More generalized approach for performing PCA.
§ Decompose X=UDVT
§ D*D is eigenvalues of covariance matrix.
§ Reconstruction of data by zeroing out regions as shown below
§ Choose q (as before)
Figure 5:
We can embed x into an orthogonal space via rotation. D scales, V rotates, and U is aperfect circle.
PCA cuts o↵ SVD at q dimensions. In Figure 6, U is a low dimensional representation.Examples 3 and 1.3 use q = 2 and N = 130. D reflects the variance so we cut o↵ dimensionswith low variance (remember d11 d22...). Lastly, V are the principle components.
Figure 6:
2 Factor Analysis
Figure 7: The hidden variable is the point on the hyperplane (line). The observed value isx, which is dependant on the hidden variable.
Factor analysis is another dimension-reduction technique. The low-dimension represen-tation of higher-dimensional space is a hyperplane drawn through the high dimensionalspace. For each datapoint, we select a point on the hyperplane and choose data from theGaussian around that point. These chosen points are observable whereas the point on thehyperplane is latent.
4
© 2013 IBM Corporation
PCA: What data to use ?
§ How about PCA on current 500 S&P stocks data for a “period of time” ?
§ Download symbols from S&P 500 website and create a vector.
§ Use this vector to download symbols data from 1970 to 2012 in a dataframe (if possible).
§ R and Python have various packages for financial data download – quantMod (R) – pandas.io.data.DataReader (Python)
§ Need a package that provides a single dataframe as output from a single call.
Dates MMM ABT … 01-01-1970 109.62 NA … 01-02-1970 107.12 NA NA … … … … 12-31-2012 NA 108.66 104.32
© 2013 IBM Corporation
Data Download – R only
§ I am a C/C++/Fortran HPC programmer, and I do use for loops in R and Python. – for loops are slow in R
§ Can any package return data for S&P stocks as a single dataframe ? – Use fImport package of R to download daily data. – stocksData<-yahooSeries(symbols_nospaces,from="1970-01-01",to="2012-12-31”)
#symbols_nospaces is S&P stock symbols – Extract columns with closing dates.
§ Write to a csv file for repeated runs (takes a long time to download)
§ Read the file in R/Python to get the data – read.table in R created a R dataframe – Pandas read_table created a Pandas dataframe.
§ Many symbols have NA’s for dates where data is not available.
§ Work with a subset of data – How about 200 stocks for quarter of a century (1988-2012) ?
#Snippet of code to get closing data colname<-paste(symbols_nospaces[i],".Close",sep="") print(colname) stockData_df[,i+1]<-get(colname) colnames(stockData_df)[i+1]<-symbols_nospaces[i]
© 2013 IBM Corporation
Combined Data Preparation – R and Python
§ Read the file in R/Python to get the data – read.table in R created a R dataframe – Pandas read_table created a Pandas dataframe.
§ Many symbols have NA’s for dates where data is not available.
§ Work with a subset of data – How about 200 stocks for quarter of a century (1988-2012) ?
§ Find first occurrence of “1988” in dataframe’s Dates column. – str.contains(“1970”) in Python – agrep(…,fixed=TRUE) in R.
§ Extract stock columns which do not have NA on the first trading day of 1988 – !is.na in R/ not math.isnan in Python – Get 200 stocks which satisfy above requirement
§ Result: combined data for 200 stocks from 1988-2012 in R/Python dataframes. – Drop rows with NA for any stock.
• na.omit()/drop.na() – 6162 entries for 200 stocks in total.
§ Both R and Python are remarkably similar for this step.
© 2013 IBM Corporation
Code in R and Python for data preparation R code extractData<-function(filename,yearrange, numstocks) { stocksreturns<-read.table("./stocksdata_dataframe.txt", header=T) yrpattern<-paste(yearrange[1],"-*",sep="”) stockdates<-stocksreturns$Dates sindex<-agrep(yrpattern,stockdates,fixed=TRUE)[1] eindex<-length(stockdates) colnames<-colnames(stocksreturns[i]) stocksreturns_short<-data.frame(stockdates[startindex:endindex])]) colnames(stocksreturns_short)[1]<-"Dates"
j<-2
stockindex<-0 for(i in 2:501) { if(stockindex<numstocks) { if(!is.na(stocksreturns[,i][k])) { stocksreturns_short[,j]<-stocksreturns[,i][sindex:eindex] colnames(stocksreturns_short)[j]<-colnames[i][i] j<-j+1 stockindex<-stockindex+1}}}}
9
Python code def extractData(filename,lowyear, numstocks): stocksreturns=pd.read_table('stocksdata_dataframe.txt', sep='\s+') yrpattern="%d-*" % (lowyear) x=stocksreturns['"Dates"'].str.contains(str(lowyear)) for i in range(size(x)): if(x[i]==True): break startindex=i colnames=stocksreturns.columns stocksreturns_short=DataFrame(stocksreturns.ix[startindex:size(x)]['"Dates"']) stockindex=0 for i in range(1,501): if stockindex<numstocks: if not math.isnan(stocksreturns[colnames[i]][startindex]): stocksreturns_short[colnames[i]]=stocksreturns[colnames[i]][startindex:len(stocksreturns.index)] stockindex=stockindex+1 return stocksreturns_short
© 2013 IBM Corporation
R/Python packages for PCA
§ Size of data is only about 23 MB in txt format. – No memory-bound issues running on my laptop.
§ R has many choices (one too many) for PCA: – prcomp/princomp/PCA/dudi.pca/acp – prcomp scales and centers data (very convenient)
• prcomp(stocksreturns_short,scale=TRUE,center=TRUE,retx=TRUE) • Reconstruct data with predict function. • prcomp uses svd beneath the covers
§ Python seems to have several choices for PCA as well. – matplotlib.mca.pca – MDP (module for data processing) PCA – numpy.eig/scipy.eig etc
§ Both packages seem to have adequate support for PCA in multiple ways.
§ Our approach: Use SVD in both R/Python: – Do same operations and compare runtimes. – svd in numpy returns transpose(V), while R returns V – Both R and Python return d as a vector; trivial to make a diagonal matrix for
reconstruction of data. – Things start in Python from 0; in R from 1 J
covariance matrix/eigenvalues/eigenvectors approach.
© 2013 IBM Corporation
PCA on combined data using SVD
§ PCA is a SVD operation:
§ X is stocks data (6162x200)
§ D*D is eigenvalues. (p=200)
§ Reconstruction of data by zeroing out regions as shown below
§ Choose q (explained ahead)
Figure 5:
We can embed x into an orthogonal space via rotation. D scales, V rotates, and U is aperfect circle.
PCA cuts o↵ SVD at q dimensions. In Figure 6, U is a low dimensional representation.Examples 3 and 1.3 use q = 2 and N = 130. D reflects the variance so we cut o↵ dimensionswith low variance (remember d11 d22...). Lastly, V are the principle components.
Figure 6:
2 Factor Analysis
Figure 7: The hidden variable is the point on the hyperplane (line). The observed value isx, which is dependant on the hidden variable.
Factor analysis is another dimension-reduction technique. The low-dimension represen-tation of higher-dimensional space is a hyperplane drawn through the high dimensionalspace. For each datapoint, we select a point on the hyperplane and choose data from theGaussian around that point. These chosen points are observable whereas the point on thehyperplane is latent.
4
© 2013 IBM Corporation
PCA on combined data – Approach
§ Perform a SVD of stock returns data.
§ Find number of eigenvalues q comprising 50%,75%,90% and 100% of sum of all the eigenvalues
– Eigenvalues=d*d from SVD
§ Zero out remaining eigenvectors/eigenvalues – In Python, use copy.copy for copying eigenvalues/vectors from SVD (assignment is
done using references)
§ Reconstruct data with matrix-multiply operations. – X_reconstructed=U*D*t(V) – Measure the std_dev(data_reconstructed-original_data)
© 2013 IBM Corporation
Code in Python for PCA on combined data colnames=stocksreturns_short.columns diff_data_combined=np.zeros(shape=(stocksreturns_short.shape[0]-1,stocksreturns_short.shape[1])) for i in range(0,numstocks): diff_data_combined[:,i]=diff(stocksreturns_short[:,i+1]) [u_original,d_original,v_original]=np.linalg.svd(diff_data_combined,full_matrices=False) d_diag_original=diag(d_original) eigvals_combined = d_original*d_original totalsumeigvals=sum(eigvals_combined) for percent in eigvalspercent:
sumeigvals=0 for i in range(0,200): sumeigvals=sumeigvals+eigvals_combined[i] if sumeigvals>=(percent*totalsumeigvals):
neigvals=i+1 break u=copy.copy(u_original) d_diag=copy.copy(d_diag_original) v=copy.copy(v_original) nvals=shape(diff_data_combined)[1] u[:][neigvals:nvals]=0 d_diag[neigvals:nvals][neigvals:nvals]=0 v[neigvals:nvals][:]=0 dproduct=np.dot(u,np.dot(d_diag,v))
13
© 2013 IBM Corporation
Code in R for PCA on combined data data.combined<-diff_stockyr svd.combined<-svd(data.combined) #find SVD eigvalues.combined<-svd.combined$d * svd.combined$d totalsum<-sum(eigvalues.combined) proportionrange<-c(0.5,0.75,0.90,1) for(proportion in proportionrange){ sum<-0 neigvalues<-0 for(i in 1:numstocks) { sum<-sum+eigvalues.combined[i] neigvalues<-neigvalues+1 if((sum/totalsum)>=proportion) { cat(sprintf("Number of eigenvalues for combined data for proportion %f = %d\n",proportion,neigvalues)) break; }} nvals<-dim(data.combined)[2] u<-svd.combined$u d<-diag(svd.combined$d) v<-svd.combined$v #Copy SVD matrices u[,(neigvalues):nvals]<-0 d[(neigvalues):nvals,(neigvalues):nvals]<-0 v[,(neigvalues):nvals]<-0 stock.data<-u %*% d %*% t(v) #Do a matrix multiply to get data
14
© 2013 IBM Corporation
PCA on combined data results
• 138 stocks out of 200 account for 90% of the sum of all the eigenvalues • Reconstruct data with 138 stocks has negligible error (10^-5)
© 2013 IBM Corporation
Yearly PCA
§ Instead of doing a PCA on combined data from 1988-2012, how about yearly PCA ? – PCA on yearly data
§ Separate the combined dataframe into yearly dataframes (1 for each year).
§ Number of observations vary for each year.
§ Calculate number of eigenvalues accounting for 50%, 75% ,90% ,100% of sum of all eigenvalues (same operation as PCA on combined data)
§ Do a reconstruction for each proportion/each year. (step operation as PCA on combined data)
– 25 separate PCA’s – 100 reconstructions in total.
Dates MMM ABT …
01-01-1988 109.62 NA …
01-02-1988 107.12 NA NA
01-03-1988 … NA …
Dataframe 1..
Dates MMM ABT …
01-01-2012 109.62 NA …
01-02-2012 107.12 NA NA
01-03-2012 … NA …
…..Dataframe 25
© 2013 IBM Corporation
Code in Python for extracting yearly dataframes
Python code colnames=stocksreturns_short.columns x=stocksreturns_short['"Dates"'].str.contains(str(years)) stocksreturns_yr=stocksreturns_short.ix[x] shape0,shape1=np.shape(stocksreturns_yr) diff_data_yr=np.zeros((shape0-1,shape1)) for i in range(0,numstocks): data_yr[:,i]=(stocksreturns_yr[:][colnames[i+1]]) diff_data_yr[:,i]=np.diff(data_yr[:,i])
17
R code yrpattern<-paste(years,"-*",sep="") yrindices<-agrep(yrpattern,stocksreturns_short$Dates,fixed=TRUE) val_stockyr<-data.frame(stocksreturns_short$Dates[yrindices[1]:tail(yrindices,n=1)]) colnames(val_stockyr)[1]<-"Dates" for(i in 2:(numstocks+1)) { val_stockyr[,i]<-stocksreturns_short[,i][yrindices[1]:tail(yrindices,n=1)] colnames(val_stockyr)[i]<-colnames(stocksreturns_short)[i] } diff_stockyr<-data.frame(matrix(NA,nrow=(dim(log_stockyr)[1]-1),ncol=numstocks)) for(j in 1:200) diff_stockyr[,j]<-diff(log_stockyr[,j])
© 2013 IBM Corporation
Analysis of yearly PCA data
§ Number of eigenvalues with 50% of total sum drops to 1 in 2008
§ Stock movement is highly correlated due to macro-economic trends.
© 2013 IBM Corporation
PCA on yearly data
§ Being a C/C++/Fortran HPC programmer, I use for loops in R/Python
§ Not efficient for R (for loop is an object; assignment in R is a copy operation)
§ Python’s assignment is done with references so it works better with for loops, and lesser overhead of functions.
§ Development Environment: – ipython-2.7 with pandas and numpy installed through ports package – Rstudio 0.97 with R binary downloaded for Mac – No attempt to optimize the build for either R and Python.
§ Total code for R takes above 20 seconds versus about 11.9 seconds for Python on my Macbook Pro.
§ Timings may change with less reliance on for loops in the code.
© 2013 IBM Corporation
Parallelizing yearly PCA
§ Can we use parallelism in R and Python productively ?
§ Both R and Python provide several ways for parallelization – Multiple cores – Distributed parallelism using MPI or sockets
§ Use coarse-grained parallelism to speed up our computations. – Look into how both packages allow use of multicores on modern day processors
§ Very easy to apply coarse-grained parallelism to yearly PCA – Divide years amongst threads/processes.
§ For R use doMC/foreach package that works on the multiple cores.
§ Python threads does not work well due to global interpreter lock (GIL). – Use iPython ipcluster parallelization framework.
§ Further evaluation using MPI on distributed clusters needed.
© 2013 IBM Corporation
Parallelizing yearly PCA in R
§ foreach depends on a backend for execution § We register DoMC (multiple cores) as backend for the yearly PCA in this case § MPI can also be used as a backend for distributed clusters. § Snow package another option (higher level for distributed clusters). § Not just limited to for loops:
§ Use mclapply for multi-core lapply etc.
#Sequential code
for(years in 1988:2012) { for(proportion in c(0.5,0.75,0.90,1){ Code to extract yearly data, do PCA and then reconstruct
}}
#Parallel code
registerDoMC(4) #Register multicore as backend with 4 cores
foreach(years in 1988:2012) %dopar%{ for(proportion in c(0.5,0.75,0.90,1){ Code to extract yearly data, do PCA and then reconstruct
§ }}
© 2013 IBM Corporation
Timing Results
Intel Core i7 Macbook Pro (4 cores, 8 hyper-threading threads)
Threads
© 2013 IBM Corporation
Parallelizing yearly PCA in Python
§ Using iPython’s Direct Interface – Start backend of iPython – % ipcluster-2.7 start –n 4 (4 is the number of processes)
§ Rewrite pcaData function so that it can be used with the map API of Python – pcaData(stocksreturns_short,year) – pcaData extracts data for year from stocksreturns_short, performs a SVD and then
reconstructions with eigenvalues percentages as before. § Processes (unlike threads in R) makes us reimport all the modules inside the function.
– Higher memory footprint – More heavyweight compared to threads.
§ Create a list for each process’s function arguments. § Parallelize across years as in R
– Each process computes a subset of SVD’s and reuses a single SVD for 4 reconstructions.
§ #code for map_async x=[] for i in range(0,25): x.append(stocksreturns_short) starttime=datetime.now() map_sync(pcaData,x,range(1988,2013)) print(datetime.now()-starttime)
© 2013 IBM Corporation
Timing results in Python
§ Starting ipcluster=8 leads to processes hanging. § ipython with multiple processes led to some memory issues. § Scalability of Python shows similar trend as R
Threads
© 2013 IBM Corporation
Summary
§ Both R and Python offer good choices for PCA
§ R has many packages for tasks such as downloading financial data, PCA etc. – Python has a good support as well.
§ R offers a cohesive framework – Installing packages is pain-free – Parallelization in R is very simple.
§ R seems to be slower as assignment operator requires copy operations which is a lot of overhead (and my use of for loops).
§ Python is more forgiving of usage of for loop, and seems to require lesser statements to do the same work.
– Pandas/Numpy adds dataframe capabilities to Python’s native string handling capabilities to provide a strong platform for data analysis.
© 2013 IBM Corporation
Future Work
§ Profiling of code at statement level etc.
§ How does R/Python work for memory-bound/compute-bound problems ?
§ Work with Distributed matrices (disnumpy for Python,r-pbd for R)
§ Use MPI as backend for parallelization on a cluster
§ Make interpreted code faster for both R/Python through compilers(cmpfun for R, Cython for Python)