huerta ecss presentation - home - xsedeecss+presentation.pdfhuerta ecss presentation.pptx author...

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA; SAN DIEGO

Ramon Huerta ECSS Project Summary!!

Robert Sinkovits!San Diego Supercomputer Center!

5/21/13!



Classification of time series data!Chemical sensors (e-noses) will be placed in the homes of elderly participants in an effort to continuously and non-intrusively monitor their living environments. Time series classification algorithms will then be applied to the sensor data to detect anomalous behavior that may suggest a change in health status.!

Source: Herb Hauser (U. Scranton) and Ramon Huerta (UCSD) Used by permission. 2012!

In preparation for pilot study, time series classification software developed by the Huerta lab has been ported to Gordon.!!After optimizing code, linking Intel’s MKL and parallelizing key loops, calculations that had taken 15-1/2 hours on a local workstation can now be completed in 4 minutes on a single Gordon compute node.!



Classification of time series data!

Source: Herb Hauser (U. Scranton) and Ramon Huerta (UCSD) Used by permission. 2012!

Electronic nose containing 8 metal oxide sensors!



Summary of optimizations!

Notes& cores& Run&time& Speedup&Original&code,&GNU&compiler& 1& 11:22:00& <&Switch&to&Intel&compiler&and&enable&AVX& 1& 05:41:49& 2.0&Link&threaded&MKL&library,&run&in&parallel& 16& 00:14:46& 46.2&OpenMP&directives&in&loops&in&kAR&and&kARtest& 16& 00:13:10& 52.5&Remove&duplicate&call&to&kARtest& 16& 00:07:58& 85.6&Optimization&of&DYSRK&operations& 16& 00:04:04& 167.7&&

Original version of code was serial, compiled using GNU C++ compiler and linked to default LAPACK libraries. By changing the compiler and compiler options, linking MKL, enabling threaded execution, eliminating redundant calculations and parallelizing loops, obtained 167x speedup!

Reported speedups are relative to single core on Gordon. Porting from Huerta lab workstation (Intel Nehalem) to Gordon resulted in 1.3x reduction in runtime!



The lowest hanging fruit – GNU to Intel!

Simply changing from “gcc -O3” to “icpc -O3 -xHOST” resulted in a 2x speedup. Absolutely no effort required.!!The 2x speedup is exactly what we expect from enabling AVX since most of the time is spent in linear algebra routines, which contain vectorizable loops that can benefit from AVX!




The big payoff – MKL and threading!

Using Intel’s Math Kernel Library (MKL) gives two big advantages: tuned for best performance on Intel hardware and many routines are threaded!!Replace “-llapack -lblas” with “-mkl”, set OMP_NUM_THREADS to 16 (number of cores on Gordon node) and add headers (omp.h, mkl.h)!!A little more work needed to modernize function call syntax and get rid of name mangling (e.g. dgemm_(…) à dgemm(…))!




Profile, optimize, repeat!

Before linking to threaded MKL, the linear algebra routines accounted for virtually all of the run time. Re-profiling optimized code indicated that a few loops accounted for ~10% of run time. Simple parallelization using OpenMP directives gave a modest speedup!




The simple things will get you – redundant call!

The functions kAR and kARtest call the linear algebra routines and comprise the guts of the autoregressive kernel algorithm. A simple programming oversight added 60% to the run time!


printf("K=%lf\n",kARtest(i,j));fflush(stdout);!Ei+=ASV[j]*kARtest(i,j);! !double kARtest_result = kARtest(i,j); !printf("K=%lf\n",kARtest_result);fflush(stdout);!Ei+=ASV[j]*kARtest_result;!

Original!

Modified!



Using my brain – optimizing calls to DSYRK!

At this point, the linear algebra routine DSYRK was accounting for about 50% of the wall time!

DSYRK = double precision symmetric rank-k update!C := αAAT + β*C!

At the expense of some additional memory usage (time-space tradeoff) and modifications to the call tree, time spent in DSYRK can be sharply reduced!

a11 a12 a13 a14a21 a22 a23 a24a31 a32 a33 a34a41 a42 a43 a44

!

"

#####

$

%

&&&&&

=

b11 b12 c11 c12b21 b22 c21 c22b31 b32 c31 c32b41 b42 c41 c42

!

"

#####

$

%

&&&&&

= B C!"

$%

AAT = B C!"

#$ B C!"

#$T= B C!"

#$

BT

CT

!

"%%

#

$&&= BBT +CCT

Rewrite A as concatenation of two matrices A and B!

AAT can be expressed as BBT + CCT !



Using my brain – optimizing calls to DSYRK (cont.)!

Loop over i! Loop over j! A ß [Bi Cj]! DSYRK(A, …)!

A deeper inspection of the code indicates a lot of opportunity for reuse of partial calculations. Can replace O(N2) matrix multiplications with O(N) !

Loop over i! BBTi ß (Bi)(Bi)T!Loop over j! CCTj ß (Cj)(Cj)T!Loop over i! Loop over j! AAT ß BBTi + CCTj!




Huerta collaboration – lessons learned!

•  We (i.e. the ECSS consultants) may think obsessively about compilers and instruction set architectures, but most of our users do not. Go for the low hanging fruit first (GNU à Intel/PGI, -O3, -xHOST)!

•  All numerical libraries are not the equal. Can potentially get a big win just by linking a more optimal version!

•  Profile, optimize, repeat! After significant progress is made, new hotspots may emerge!

•  Dig deep and look for things that the code developers may have never thought about.!•  53x to 85x speedup by eliminating an unnecessary call!•  85x to 167x using a clever time-space tradeoff!

•  You can optimize yourself out of a job. Revolutionized productivity of Huerta lab, but they no longer need supercomputers (for now?)!

huerta ecss presentation - home - xsedeecss+presentation.pdfhuerta ecss presentation.pptx author...

Documents