intel math kernel library (mkl) clay p. breshears, phd intel software college ncsa multi-core...

Intel Math Kernel Library (MKL)Clay P. Breshears, PhD

Intel Software College

NCSA Multi-core WorkshopJuly 24, 2007

Performance Libraries: Intel® Math Kernel Library (MKL)

Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

Agenda

Performance Features

The Library Sections• BLAS• LAPACK*• DFTs• VML• VSL

SciMark 2.0 Optimization Case Study (from Henry Gabb)

• SciMark 2.0 overview

• Tuning with the Intel compiler

• Tuning with the Intel Math Kernel Library

Intel® Math Kernel Library Purpose

Performance, Performance, Performance!

Intel’s engineering, scientific, and financial math library

Addresses:

• Solvers (BLAS, LAPACK)

• Eigenvector/eigenvalue solvers (BLAS, LAPACK)

• Some quantum chemistry needs (dgemm)

• PDEs, signal processing, seismic, solid-state physics (FFTs)

• General scientific, financial [vector transcendental functions (VML) and vector random number generators (VSL)]

Tuned for Intel® processors – current and future

Intel® Math Kernel Library Purpose – Don’ts

But don’t use Intel® Math Kernel (Intel® MKL) on …

Don’t use Intel® MKL on “small” counts

Don’t call vector math functions on small n

X’Y’Z’W’

Transformationmatrix

Geometric Transformation

But you could use Intel® Performance Primitives

Intel® Math Kernel Library Environment

Support 32-bit and 64-bit Intel® processors

Large set of examples and tests

Extensive documentation

Windows* Linux*

Compilers Intel, Microsoft Intel, Gnu

Libraries .dll, .lib .a, .so

Resource Limited Optimization

The goal of all optimization is maximum speed

Resource limited optimization – exhaust one or more resource of system:

• CPU: Register use, FP units

• Cache: Keep data in cache as long as possible; deal with cache interleaving

• TLBs: Maximally use data on each page

• Memory bandwidth: Minimally access memory

• Computer: Use all the processors/cores available using threading

• System: Use all the nodes available (cluster software)

Threading

Most of Intel® Math Kernel Library could be threaded but:

• Limited resource is memory bandwidth

• Threading level 1 and level 2 BLAS are mostly ineffective ( O(n) )

There are numerous opportunities for threading:

• Level 3 BLAS ( O(n3) )

• LAPACK* ( O(n3) )

• FFTs ( O(n log n ) )

• VML, VSL ? depends on processor and function

All threading is via OpenMP*

All Intel MKL is designed and compiled for thread safety

SciMark 2.0

Produced by the National Institute of Standards and Technology

ANSI C and Java versions available

Five floating-point-intensive kernels

• FFT: Compute a complex 1D FFT

• SOR: Jacobi successive over-relaxation in 2D

• MC: Compute by Monte Carlo integration

• MV: Sparse matrix-vector multiplication

• LU: Dense matrix LU factorization

SciMark 2.0 Problem Sizes

Benchmark

Problem Size

Small Large

FFT N = 1024 N = 1048576

SOR 100 x 100 1000 x 1000

MC Problem size not fixed, no distinction between small and large problems

MVN = 1000

NZ = 5000

N = 100000

NZ = 1000000

LU 100 x 100 1000 x 1000

Benchmark System

Hardware

CPU (dual-processor system) 3.6 GHz Xeon (2 MB L2 cache) EM64T

Motherboard Intel Server Board SE7520AF2

Memory 512 MB DDR2

Version P06

Adjacent Cache Line Prefetch ON

Hardware Prefetch ON

Hyper-Threading Technology OFF

Software

Operating system Red Hat Enterprise Linux AS3

Linux kernel 2.4.21-20.EL #1 SMP

Intel C++ Compiler for Linux 8.1 (l_cce_pc_8.1.024)

Intel Cluster MKL 7.2 (l_cluster_mkl_7.2.008)

GNU C Compiler gcc 3.2.3

GNU Performance Baseline

Small Problems

Default Optimized

Large Problems

Default Optimized

Aggressive optimization significantly improves performance relative to the default optimization level. The following gcc options were used to establish baseline performance: –O3 –march=nocona –ffast-math –mfpmath=sse

Intel C++ Compiler for Linux

Performance• Automatic vectorization• Streaming SIMD Extensions 3• IPO and PGO• Automatic parallelization and OpenMP support• Automatic CPU dispatch• Much more...

Compatibility• Source and object compatible with gcc and g++• Supports GNU inline ASM• ANSI/ISO C/C++ standards compliance• Conforms to the C++ ABI standard• Integrated with the Eclipse IDE

Tuning SciMark 2.0 with the Intel Compiler

Small Problems

GNU Intel

Large Problems

0200400600800

10001200

GNU Intel

The Intel C++ Compiler for Linux improves SciMark 2.0 performance relative to the GNU baseline. Intel compiler options: –O3 –xP –ipo –fno-alias.

Intel® Math Kernel Library ContentsBLAS

BLAS (Basic Linear Algebra Subroutines)

Level 1 BLAS – vector-vector operations• 15 function types• 48 functions

Level 2 BLAS – matrix-vector operations• 26 function types• 66 functions

Level 3 BLAS – matrix-matrix operations• 9 function types• 30 functions

Extended BLAS – level 1 BLAS for sparse vectors• 8 function types• 24 functions

Intel® Math Kernel Library ContentsLAPACK

LAPACK (linear algebra package)• Solvers and eigensolvers. Many hundreds of routines total!

• There are more than 1000 total user callable and support routines

DFTs (Discrete Fourier transforms)• Mixed radix, multi-dimensional transforms

• Multithreaded

VML (Vector Math Library)• Set of vectorized transcendental functions

• Most of libm functions, but faster

VSL (Vector Statistical Library)• Set of vectorized random number generators

Intel® Math Kernel Library Contents

BLAS and LAPACK* are both Fortran

• Legacy of high performance computation

VSL and VML have Fortran and C interfaces

DFTs have Fortran 95 and C interfaces

cblas interface available

• More convenient for a C/C++ programmer

Intel® Math Kernel Library Optimizations in LAPACK*

Most important LAPACK optimizations:

• Threading – effectively uses multiple cores

• Recursive factorization• Reduces scalar time (Amdahl’s law: t = tscalar + tparallel/p)

• Extends blocking further into the code

No runtime library support required

Tuning the SciMark 2.0 LU Kernel

Replacing the SciMark 2.0 LU kernel with the LAPACK dgetrf function requires attention to detail:

• SciMark 2.0 is written in C• LAPACK defines a Fortran interface

• C is call-by-value• Fortran is call-by-reference

• C uses row-major ordering• Fortran uses column-major ordering

• For best performance, dgetrf requires data to be contiguous in memory

• SciMark 2.0 LU kernel allocates a 2D array as pointers-to-pointers (not necessarily contiguous in memory)

Tuning the SciMark 2.0 LU Kernel

Small Large

SciMark 2.0 LU Kernel

GNU baseline

Intel compiler

Intel MKL LAPACK

Intel MKL LAPACK+ OpenMP

The Intel MKL Lapack significantly improves performance over the original SciMark 2.0 LU source code.

Intel® Math Kernel Library Contents Discrete Fourier Transforms

One dimensional, two-dimensional, three-dimensional…

Multithreaded

Mixed radix

User-specified scaling, transform sign

Transforms on embedded matrices

Multiple one-dimensional transforms on single call

Strides

C and F90 interfaces; FFTW interface support

Using the Intel® Math Kernel Library DFTs

Basically a 3-step Process

Create a descriptor

Status = DftiCreateDescriptor(MDH, …)

Commit the descriptor (instantiates it)

Status = DftiCommitDescriptor(MDH)

Perform the transform

Status = DftiComputeForward(MDH, X)

Optionally free the descriptor

MDH: MyDescriptorHandle

Tuning the SciMark 2.0 FFT Kernel

#include <mkl.h>

int N = 1024; // Size of SciMark 2.0 small FFT problemdouble scale = 1.0 / (double)N;

double *x = RandomVector ((2 * N), R); // SciMark creates a random vector // of size 2*N to hold real and // imaginary partsDFTI_DESCRIPTOR *dftiHandle; // Structure for MKL DFT descriptor

DftiCreateDescriptor (&dftiHandle, // Transform descriptor DFTI_DOUBLE, // Precision DFTI_COMPLEX, // Complex-to-complex 1, // Number of dimensions N); // Size of transform

// Apply scaling factor to backward transformDftiSetValue (dftiHandle, DFTI_BACKWARD_SCALE, scale);

DftiCommitDescriptor (dftiHandle);

DftiComputeForward (dftiHandle, x); // Apply DFT to array xDftiComputeBackward (dftiHandle, x);

DftiFreeDescriptor (&dftiHandle);

Tuning the SciMark 2.0 FFT Kernel

Small Large

SciMark 2.0 FFT Kernel

GNU baselineIntel compilerIntel MKL DFT

The Intel MKL DFT significantly improves performance over the original SciMark 2.0 FFT source code.

Intel® Math Kernel Library Contents Vector Math Library (VML)

Vector Math Library: vectorized transcendental functions – like libm but better (faster)

Interface: Have both Fortran and C interfaces

Multiple accuracies

• High accuracy ( < 1 ulp )

• Lower accuracy, faster ( < 4 ulps )

Special value handling √(-a), sin(0), and so on

Error handling – can not duplicate libm here

VML: Why Does It Matter?

It is important for financial codes (Monte Carlo simulations)

• Exponentials, logarithms

Other scientific codes depend on transcendental functions

Error functions can be big time sinks in some codes

Intel® Math Kernel Library Contents Vector Statistical Library (VSL)

Set of random number generators (RNGs)

Numerous non-uniform distributions

VML used extensively for transformations

Parallel computation support – some functions

User can supply own BRNG or transformations

Five basic RNGs (BRNGs)

• MCG31, R250, MRG32, MCG59, WH

Non-Uniform RNGs

Gaussian (two methods)

Exponential

Laplace

Weibull

Cauchy

Rayleigh

Lognormal

Gumbel

Using VSL

Basically a 3-step Process

Create a stream pointer

VSLStreamStatePtr stream;

Create a stream

vslNewStream(&stream,VSL_BRNG_MC_G31,seed );

Generate a set of RNGs

vsRngUniform( 0,&stream,size,out,start,end );

Delete a stream (optional)

vslDeleteStream(&stream);

Calculating Pi by Monte Carlo

squarein darts of #

circle hitting darts of#4

squarein darts of #

circle hitting darts of#2

Loop I = 1 to N_samples

x.coor = random [0..1]

y.coor = random [0..1]

dist = sqrt (x^2 + y^2)

if dist <= 1

hits = hits + 1

Pi = 4 * hits / N_samples

Tuning the SciMark 2.0 MC Kernel

#include <mkl.h>

double MonteCarlo_integrate (int Num_samples){ int i, j, blocks, under_curve = 0; static double rnBuf[2 * BLOCK_SIZE]; double rnX, rnY; VSLStreamStatePtr stream;

blocks = Num_samples / BLOCK_SIZE; vslNewStream (&stream, VSL_BRNG_MCG31, SEED);

for (i = 0; i < blocks; i++) { vdRngUniform (VSL_METHOD_DUNIFORM_STD, stream, (2 * BLOCK_SIZE), rnBuf, 0.0, 1.0);

for (j = 0; j < BLOCK_SIZE; j++) { rnX = rnBuf[2*j]; rnY = rnBuf[2*j+1]; if (sqrt(rnX*rnX + rnY*rnY) <= 1.0) under_curve++; } } vslDeleteStream (&stream);

return ((double) under_curve / Num_samples) * 4.0;}

Tuning the SciMark 2.0 MC Kernel

0100200300400500600700800900

SciMark 2.0 MC Kernel

GNU baseline

Intel compiler

Intel MKL VSL

Intel MKL VSL +OpenMP

The Intel MKL VSL significantly improves performance over the original SciMark 2.0 MC source code.

Best SciMark 2.0 Single Node PerformanceSmall Problems

2000FFT

GNU Intel

Small Problems (MFLOPS)

GNU Intel Speedup

FFT 510 1817 3.6

SOR 524 1092 2.1

MC 206 1003 4.9

MV 857 832 1.0

LU 884 1827 2.1

Comp. 596 1314 2.2

Best SciMark 2.0 Single Node PerformanceLarge Problems

GNU Intel

Large Problems (MFLOPS)

GNU Intel Speedup

FFT 45 600 13.3

SOR 495 1015 2.1

MC 206 1003 4.9

MV 453 457 1.0

LU 392 6646 16.9

Comp. 318 1944 6.1

Intel® Cluster MKL

Intel Cluster MKL is a superset of MKL for solving large linear algebra problems on a cluster

Intel Cluster MKL contains:

• ScaLAPACK (Scalable LAPACK)

• BLACS (Basic Linear Algebra Communication Subprograms)

Supports MPICH and the Intel MPI Library

Data Layout Critical to Parallel Performance

ScaLAPACK uses 2D block-cyclic data distribution

Example layouts of lower triangular matrix for four processes

2D block-cyclic

distribution

0 1 2 3 0 1 2 3 0 1 2 3

1D block distribution

1D block-cyclic

distribution

2D block-cyclic

distribution

Load balancePoor Better

Parallelizing the SciMark 2.0 LU Kernel with Intel® Cluster MKL

1. Initialize the process grid

2. Create a descriptor for each distributed matrix

3. Replace the call to dgetrf with pdgetrf (the ‘p’ is for parallel)

Result: LU factorization of a 40000 x 40000 matrix on an 8-node, dual 3.0 GHz Xeon cluster achieves 46000 MFLOPS.

Performance Libraries: Intel® MKLWhat’s Been Covered

Intel® Math Kernel Library is a broad scientific/engineering math library

It is optimized for Intel® processors

It is threaded for effective use on multi-core and SMP machines

The Intel C++ Compiler for Linux improves SciMark 2.0 performance without requiring code modifications

With minor code modifications, Intel MKL dramatically improves the FFT, MC, and LU kernels

Some SciMark 2.0 kernels benefit from parallel computing

Useful Links

Intel Software Products• http://www.intel.com/software/products/

Intel Software Network• http://www.intel.com/software/

Intel Software College• http://www.intel.com/software/college/

SciMark 2.0• http://math.nist.gov/scimark2/index.html

intel math kernel library (mkl) clay p. breshears, phd intel software college ncsa multi-core...

intel logo

intel compilertuning

intel processorslarge

intel processors current

dont use intel mkl

vector math functions

respective owners

united states

Documents

intel® mkl sparse solvers · intel® mkl sparse solvers....

developing an intel® mkl based application in microsoft...

intel(r) math kernel library for mac os* x user's … ·...

using the intel math kernel library (intel® mkl) and intel...

intel® math kernel library (intel® mkl) dispatching to...

solutions brief...through intel® data analytics...

welcome to · maya plugin distributed to server farm for...

getting reproducible results with intel® mkl 11.0

warp3d implementation of mkl cluster pardiso solver ·...

using the intel math kernel library (intel® mkl) and...

mkl metabolisme

session: intel® performance libraries: intel® math kernel...

overview of data fitting component in intel® math kernel...

rouen simple mkl

change agent role: a successful transformation into agile...

intel(r) mkl user's guide

christian theology mark driscoll & gerry breshears

new using intel® math kernel library and intel® integrated...

overview of intel® mkl sparse blas · title: dpd...

intel(r) math kernel library for linux* user's guide ·...