monte carlo simulation and its efficient implementation

54
Experts in numerical algorithms and HPC services Monte Carlo Simulation and its Efficient Implementation Robert Tong 28 January 2010

Upload: mireya

Post on 14-Jan-2016

27 views

Category:

Documents


1 download

DESCRIPTION

Monte Carlo Simulation and its Efficient Implementation. Robert Tong. 28 January 2010. Introduction to NAG. Founded in 1970 as a co-operative project in UK Operates as a commercial, not-for-profit organization Worldwide operations Oxford & Manchester, UK Chicago, US Tokyo, Japan - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Monte Carlo Simulation  and its Efficient Implementation

Experts in numerical algorithms and HPC services

Monte Carlo Simulation and its Efficient Implementation

Robert Tong

28 January 2010

Page 2: Monte Carlo Simulation  and its Efficient Implementation

2

Introduction to NAG

Founded in 1970 as a co-operative project in UK Operates as a commercial, not-for-profit

organization Worldwide operations

Oxford & Manchester, UK Chicago, US Tokyo, Japan Taipei, Taiwan

Over 3,000 customer sites worldwide Staff of ~100, over 50% technical, over 25 PhDs £7m+ financial turnover

Page 3: Monte Carlo Simulation  and its Efficient Implementation

3

PortfolioNumerical Libraries

Highly flexible for use in many computing languages, programming environments, hardware platforms and for high performance computing methods

Connector Products for Excel, MATLAB and Maple and Giving users of the Excel and mathematical software packages MATLAB and

Maple access to NAG’s library of highly optimized and often superior numerical routines and allowing easy integration

NAG Fortran Compiler and GUI based Windows Compiler: Fortran Builder

Visualization and graphics software Build data visualization applications with NAG’s IRIS Explorer

Consultancy services

Page 4: Monte Carlo Simulation  and its Efficient Implementation

4

Outline

Why use Monte Carlo simulation? Higher order methods and convergence GPU acceleration The need for numerical libraries

Page 5: Monte Carlo Simulation  and its Efficient Implementation

5

Why use Monte Carlo methods?

Essential for high dimensional problems – many degrees of freedom

For applications with uncertainty in inputs In finance

Important in risk modelling Pricing/hedging derivatives

Page 6: Monte Carlo Simulation  and its Efficient Implementation

6

The elements of Monte Carlo simulation

Derivative pricing Simulate a path of asset values Compute payoff from path Compute option value

Numerical components Pseudo-random number generator Discretization scheme

Page 7: Monte Carlo Simulation  and its Efficient Implementation

7

In the past Faster solution has been provided by increasing processor

speeds Want a quicker solution? Buy a new processor

Present Multi-core/Many-core architectures, without increased

processor clock speeds A major challenge for existing numerical algorithms

The escalator has stopped... or gone into reverse! Existing codes may well run slower on multi-core

The demand for ever increasing performance

Page 8: Monte Carlo Simulation  and its Efficient Implementation

8

Ways to improve performance in Monte Carlo simulation

1. Use higher order discretization 2. Keep low order (Euler) discretization –

make use of multi-core potentiale.g. GPU (Graphics Processing Unit)

3. Use high order discretization on GPU4. Use quasi-random sequence (Sobol’, …) and

Brownian Bridge5. Implement Sobol’ sequence and Brownian Bridge

on GPU

Page 9: Monte Carlo Simulation  and its Efficient Implementation

9

Higher order methods – 1(work by Kai Zhang, University of Warwick, UK)

Page 10: Monte Carlo Simulation  and its Efficient Implementation

10

Higher order methods – 2

Page 11: Monte Carlo Simulation  and its Efficient Implementation

11

Numerical example – 1

Page 12: Monte Carlo Simulation  and its Efficient Implementation

12

Numerical example – 1a

Page 13: Monte Carlo Simulation  and its Efficient Implementation

13

Numerical example – 1b

Page 14: Monte Carlo Simulation  and its Efficient Implementation

14

Numerical example – 2a

Page 15: Monte Carlo Simulation  and its Efficient Implementation

15

Numerical example – 2b

Page 16: Monte Carlo Simulation  and its Efficient Implementation

16

GPU acceleration

Retain low order Euler discretization Use multi-core GPU architecture to achieve speed-up

Page 17: Monte Carlo Simulation  and its Efficient Implementation

17

The Emergence of GPGPU Computing

Initially – computation carried out by CPU (scalar, serial execution)

CPU evolves to add cache, SSE instructions, ...

GPU added to speed graphics display – driven by gaming needs multi-core, SIMT, limited flexibility

CPU and GPU move closer CPU becomes multi-core GPU becomes General Purpose (GPGPU) – fully

programmable

Page 18: Monte Carlo Simulation  and its Efficient Implementation

18

Current GPU architecture – e.g. NVIDIA Tesla

Page 19: Monte Carlo Simulation  and its Efficient Implementation

19

Tesla – processing power

SM – Streaming Multiprocessor 8 X SP - Streaming Processor core 2 X Special Function Unit MT – Multithreaded instruction fetch and issue unit Instruction cache Constant cache (read only) Shared memory (16 Kb, read/write)

C1060 – adds double precision 30 double precision cores 240 single precision cores

Page 20: Monte Carlo Simulation  and its Efficient Implementation

20

Tesla C1060 memory (from: M A Martinez-del-Amor et al. (2008) based on E Lindholm et al. (2008))

Page 21: Monte Carlo Simulation  and its Efficient Implementation

21

Programming GPUs – CUDA and OpenCL

CUDA (Compute Unified Device Architecture, developed by NVIDIA) Extension of C to enable programming of GPU devices Allows easy management of parallel threads executing on

GPU Handles communication with ‘host’ CPU

OpenCL Standard language for multi-device programming Not tied to a particular company Will open up GPU computing Incorporates elements of CUDA

Page 22: Monte Carlo Simulation  and its Efficient Implementation

22

First step – obtaining and installing CUDA

FREE download from http://www.nvidia.com/object/cuda_learn.html

See: Quickstart Guide Require:

CUDA capable GPU – GeForce 8, 9, 200, Tesla, many Quadro Recent version of NVIDIA driver CUDA Toolkit – essential components to compile and build applications CUDA SDK – example projects

Update environment variables (Linux default shown) PATH /usr/local/cuda/bin LD_LIBRARY_PATH /usr/local/cuda/lib

CUDA compiler nvcc works with gcc (Linux) MS VC++ (Windows)

Page 23: Monte Carlo Simulation  and its Efficient Implementation

23

Host (CPU) – Device (GPU) Relationship

Application program initiated on Host (CPU) Device ‘kernels’ execute on GPU in SIMT (Single

Instruction Multiple Thread) manner Host program

Transfers data from Host memory to Device (GPU) memory Specifies number and organisation of threads on Device Calls Device ‘kernel’ as a C function, passing parameters Copies output from Device back to Host memory

Page 24: Monte Carlo Simulation  and its Efficient Implementation

24

Organisation of threads on GPU

SM (Streaming Multiprocessor) manages up to 1024 threads

Each thread is identified by an index Threads execute as Warps of 32 threads Threads are grouped in blocks (user specifies number

of threads per block) Blocks make up a grid

Page 25: Monte Carlo Simulation  and its Efficient Implementation

25

Memory hierarchy

• On device can• Read/write per-thread

• Registers• Local memory

• Read/write per-block shared memory• Read/write per-grid global memory• Read only per-grid constant memory

• On host (CPU) can• Read/write per-grid

• Global memory• Constant memory

Page 26: Monte Carlo Simulation  and its Efficient Implementation

26

CUDA terminology

‘kernel’ – C function executing on the GPU __global__ declares function as a kernel

Executed on the Device Callable only from the Host void return type

__device__ declares function that is Executed on the Device Callable only from the Device

Page 27: Monte Carlo Simulation  and its Efficient Implementation

27

Application to Monte Carlo simulation

Monte Carlo paths lead to highly parallel algorithms

• Applications in finance e.g. simulation based on SDE (return on asset)

drift + Brownian motion

• Requires fast pseudorandom or Quasi-random number generator

• Additional techniques improve efficiency: Brownian Bridge, stratified sampling, …

tt

t dWdtS

dS

Page 28: Monte Carlo Simulation  and its Efficient Implementation

28

Random Number Generators:choice of algorithm

Must be highly parallel Implementation must satisfy statistical tests of randomness

Some common generators do not guarantee randomness properties when split into parallel streams

A suitable choice: MRG32k3a (L’Ecuyer)

Page 29: Monte Carlo Simulation  and its Efficient Implementation

29

MRG32k3a: skip ahead

Generator combines 2 recurrences:

Each recurrence of form (M Giles, note on implementation)

Precompute in operations on CPU,

11,311,211, modmxbxax nnn

22,322,122, modmxbxax nnn

1 nn Ayy

2

1

n

n

n

n

x

x

x

y

pA )(log pO

myAy np

pn mod

Page 30: Monte Carlo Simulation  and its Efficient Implementation

30

MRG32k3a: modulus

Combined and individual recurrences

Can compute using double precision divide – slow

Use 64 bit integers (supported on GPU) – avoid divide

Bit shift – faster (used in CPU implementations)

Note: speed of different possibilities subject to change as NVIDIA updates floating point capability

12,1, modmxxz nnn

Page 31: Monte Carlo Simulation  and its Efficient Implementation

31

MRG32k3a: memory coalescence

GPU performance limited by memory accessRequire memory coalescence for fast transfer of

dataOrder RNG output to retain consecutive memory

accesses

is stored at

sequential ordering

(Implementation by M Giles)

btnx thbtn block in by thread generatedelement the,,

bNNnNt ptt

bNNtNn ptp

per thread points num threads,num pt NN

Page 32: Monte Carlo Simulation  and its Efficient Implementation

32

MRG32k3a: single – double precision

L’Ecuyer’s example implementation in double precision floating point

Double precision on high end GPUs – but arithmetic operations much slower in execution than single precision

GPU implementation in integers – final output cast to double

Note: output to single precision gives sequence that does not pass randomness tests

Page 33: Monte Carlo Simulation  and its Efficient Implementation

33

MRG32k3a: GPU benchmarking – double precision

GPU – NVIDIA Tesla C1060CPU – serial version of integer implementation

running on single core of quad-core XeonVSL – Intel Library MRG32k3aICC – Intel C/C++ compiler

VC++ – Microsoft Visual C++ GPU CPU-ICC CPU-VC+

+ VSL-ICC VSL-VC+

+

Samples/ sec

3.00E+09

3.46E+07

4.77E+07

9.35E+07

9.32E+07

Page 34: Monte Carlo Simulation  and its Efficient Implementation

34

MRG32k3a: GPU benchmarking – single precision

Note: for double precision all sequences were identical

For single precision GPU and CPU identicalGPU and VSL differ max abs err 5.96E-008

Which output preferred? use statistical tests of randomness GPU CPU-ICC CPU-VC+

+ VSL-ICC VSL-VC+

+

Samples/ sec

3.49E+09

3.58E+07

5.24E+07

1.02E+08

9.75E+07

Page 35: Monte Carlo Simulation  and its Efficient Implementation

35

LIBOR Market Model on GPU

Equally weighted portfolio of 15 swaptions each with same maturity, but different lengths and different strikes

Page 36: Monte Carlo Simulation  and its Efficient Implementation

36

Numerical Libraries for GPUs

The problem The time-consuming work of writing basic numerical

components should not be repeated The general user should not need to spend many days

writing each application The solution

Standard numerical components should be available as libraries for GPUs

Page 37: Monte Carlo Simulation  and its Efficient Implementation

37

NAG Routines for GPUs

Page 38: Monte Carlo Simulation  and its Efficient Implementation

38

nag_gpu_mrg32k3a_uniform

Page 39: Monte Carlo Simulation  and its Efficient Implementation

39

nag_gpu_mrg32k3a_uniform

Page 40: Monte Carlo Simulation  and its Efficient Implementation

40

nag_gpu_mrg32k3a_uniform

Page 41: Monte Carlo Simulation  and its Efficient Implementation

41

Example program: generate random numbers on GPU

... // Allocate memory on Host host_array = (double *)calloc(N,sizeof(double)); // Allocate memory on GPU cudaMalloc((void **)&device_array, sizeof(double)*N); // Call GPU functions // Initialise random number generator nag_gpu_mrg32k3a_init(V1, V2, offset);

// Generate random numbers nag_gpu_mrg32k3a_uniform(nb, nt, np, device_array); // Read back GPU results to hostcudaMemcpy(host_array,gpu_array,sizeof(double)*N,cudaMemcpyDeviceToHost);...

Page 42: Monte Carlo Simulation  and its Efficient Implementation

42

NAG Routines for GPUs

Page 43: Monte Carlo Simulation  and its Efficient Implementation

43

nag_gpu_mrg32k3a_next_uniform

Page 44: Monte Carlo Simulation  and its Efficient Implementation

44

nag_gpu_mrg32k3a_next_uniform

Page 45: Monte Carlo Simulation  and its Efficient Implementation

45

Example program – kernel function

__global__ void mrg32k3a_kernel(int np, FP *d_P){ unsigned int v1[3], v2[3]; int n, i0; FP x, x2 = nanf(""); // initialisation for first point nag_gpu_mrg32k3a_stream_init(v1, v2, np); // now do points i0 = threadIdx.x + np*blockDim.x*blockIdx.x; for (n=0; n<np; n++) {nag_gpu_mrg32k3a_next_uniform(v1, v2, x);} d_P[i0] = x; i0 += blockDim.x;}

Page 46: Monte Carlo Simulation  and its Efficient Implementation

46

Library issues: Auto-tuning

Performance affected by mapping of algorithm to GPU via threads, blocks and warps

Implement a code generator to produce variants using the relevant parameters

Determine optimal performance

Li, Dongarra & Tomov (2009)

Page 47: Monte Carlo Simulation  and its Efficient Implementation

47

Working with Fixed Income Research & Strategies Team (FIRST) NAG mrg32k3a works well in BNP Paribas CUDA “Local Vol

Monte-Carlo” Passes rigorous statistical tests for randomness properties

(Diehard, Dieharder,TestU01) Performance good Being able to match the GPU random numbers with the

CPU version of mrg32k3a has been very valuable for establishing validity of output

Early Success with BNP Paribas

Page 48: Monte Carlo Simulation  and its Efficient Implementation

48

BNP Paribas Results – local vol example

Page 49: Monte Carlo Simulation  and its Efficient Implementation

49

“The NAG GPU libraries are helping us enormously by providing us with fast, good quality algorithms. This has let us concentrate on our models and deliver GPGPU based pricing much more quickly.”

And with Bank of America Merrill Lynch

Page 50: Monte Carlo Simulation  and its Efficient Implementation

50

“Thank you for the GPU code, we have achieved speed ups of x120”

In a simple uncorrelated loss simulation: Number of simulations 50,000 Time taken in seconds 2.373606 Simulations per second 21065 Simulated default rate 311.8472 Theoretical default rate 311.9125

24 trillion numbers in 6 hours

“A N Other Tier 1” Risk Group

Page 51: Monte Carlo Simulation  and its Efficient Implementation

51

NAG routines for GPUs – 1

Currently available Random Number Generator (L’Ecuyer mrg32k3a)

Uniform distribution Normal distribution Exponential distribution Gamma distribution

Sobol sequence for Quasi-Monte Carlo (to 19,000 dimensions)

Brownian Bridge

Page 52: Monte Carlo Simulation  and its Efficient Implementation

52

NAG routines for GPUs – 2

Future plans Random Number Generator – Mersenne Twister Linear algebra components for PDE option pricing methods Time series analysis – wavelets ...

Page 53: Monte Carlo Simulation  and its Efficient Implementation

53

Summary

GPUs offer high performance computing for specific massively parallel algorithms such as Monte Carlo simulations

GPUs are lower cost and require less power than corresponding CPU configurations

Numerical libraries for GPUs will make these an important computing resource

Higher order methods for GPUs being considered

Page 54: Monte Carlo Simulation  and its Efficient Implementation

54

AcknowledgmentsMike Giles (Mathematical Institute, University of Oxford) – algorithmic input

Technology Strategy Board through Knowledge Transfer Partnership with Smith Institute

NVIDIA for technical support and supply of Tesla C1060 and Quadro FX 5800

See www.nag.co.uk/numeric/GPUs/

Contact: [email protected]