beyond simple monte-carlo: parallel computing with quantlib

30
Beyond Simple Monte-Carlo: Parallel Computing with QuantLib Klaus Spanderen E.ON Global Commodities November 14, 2013 Klaus Spanderen Beyond Simple Monte-Carlo: Parallel Computing with QuantLib

Upload: others

Post on 09-Feb-2022

3 views

Category:

Documents


0 download

TRANSCRIPT

Beyond Simple Monte-Carlo:Parallel Computing with QuantLib

Klaus Spanderen

E.ON Global Commodities

November 14, 2013

Klaus Spanderen Beyond Simple Monte-Carlo: Parallel Computing with QuantLib

Symmetric Multi-Processing

Graphical Processing Units

Message Passing Interface

Conclusion

Klaus Spanderen Beyond Simple Monte-Carlo: Parallel Computing with QuantLib

Symmetric Multi-Processing: Overview

I Moore’s Law: Number oftransistors doubles every twoyears.

I Leaking turns out to be thedeath of CPU scaling.

I Multi-core designs helpsprocessor makers to managepower dissipation.

I Symmetric Multi-Processinghas become a main streamtechnology.

Herb Sutter: ”The Free Lunch isOver: A Fundamental Turn TowardConcurrency in Software.”

Klaus Spanderen Beyond Simple Monte-Carlo: Parallel Computing with QuantLib

Multi-Processing with QuantLib

Divide and Conquer: Spawn several independent OS processes

# Processes

GF

lops

0.5

12

510

2050

1 2 4 8 16 32 64 128 256

The QuantLib benchmark on a 32 core (plus 32 HT cores) server.

Klaus Spanderen Beyond Simple Monte-Carlo: Parallel Computing with QuantLib

Multi-Threading: Overview

I QuantLib is per se not thread-safe.

I Use case one: really thread-safe QuantLib (see Luigi’s talk)I Use case two: multi-threading to speed-up single pricings.

I Joesph Wang is working with Open Multi-Processing(OpenMP) to parallelize several finite difference andMonte-Carlo algorithms.

I Use case three: multi-threading to parallelize several pricings,e.g. parallel pricing to calibrate models.

I Use case four: Use of QuantLib in C#,F#, Java or Scala viaSWIG layer and multi-threaded unit tests.

I Focus on use case three and four:I Situation is not too bad as long as objects are not shared

between different threads.

Klaus Spanderen Beyond Simple Monte-Carlo: Parallel Computing with QuantLib

Multi-Threading: Parallel Model Calibration

C++11 version of a parallel model calibration function

Disposable<Array>

CalibrationFunction::values(const Array& params) const {

model_->setParams(params);

std::vector<std::future<Real> > errorFcts;

std::transform(std::begin(instruments_), std::end(instruments_),

std::back_inserter(errorFcts),

[](decltype(*begin(instruments_)) h) {

return std::async(std::launch::async,

&CalibrationHelper::calibrationError,

h.get());});

Array values(instruments_.size());

std::transform(std::begin(errorFcts), std::end(errorFcts),

values.begin(), [](std::future<Real>& f) { return f.get();});

return values;

}

Klaus Spanderen Beyond Simple Monte-Carlo: Parallel Computing with QuantLib

Multi-Threading: Singleton

I Riccardo’s patch: All singletons are thread local singletons.

template <class T>

T& Singleton<T>::instance() {

static boost::thread_specific_ptr<T> tss_instance_;

if (!tss_instance_.get()) {

tss_instance_.reset(new T);

}

return *tss_instance_;

}

I C++11 Implementation: Scott Meyer Singleton

template <class T>

T& Singleton<T>::instance() {

static thread_local T t_;

return t_;

}

Klaus Spanderen Beyond Simple Monte-Carlo: Parallel Computing with QuantLib

Multi-Threading: Observer-Pattern

I Main purpose in QuantLib: Distributed event handling.

I Current implementation is highly optimized for singlethreading performance.

I In a thread local environment this would be sufficient, but ...

I ... the parallel garbage collector in C#/F#, Java or Scala isby definition not thread local!

I Shuo Chen article ”Where Destructors meet Threads”provides a good solution ...

I ... but is not applicable to QuantLib without a major redesignof the observer pattern.

Klaus Spanderen Beyond Simple Monte-Carlo: Parallel Computing with QuantLib

Multi-Threading: Observer-Pattern

Scala example fails immediately with spurious error messages

I pure virtual function call

I segmentation fault

import org.quantlib.{Array => QArray, _}

object ObserverTest {

def main(args: Array[String]) : Unit = {

System.loadLibrary("QuantLibJNI");

val aSimpleQuote = new SimpleQuote(0)

while (true) {

(0 until 10).foreach(_ => {

new QuoteHandle(aSimpleQuote)

aSimpleQuote.setValue(aSimpleQuote.value + 1)

})

System.gc

}

}

}

Klaus Spanderen Beyond Simple Monte-Carlo: Parallel Computing with QuantLib

Multi-Threading: Observer-Pattern

I The observer pattern itself can be solved using the thread-safeboost::signals2 library.

I Problem remains, an observer must be unregistered from allobservables before the destructor is called.

I Solution:I QuantLib enforces that all observers are instantiated as boost

shared pointers.I The preprocessor directive

BOOST SP ENABLE DEBUG HOOKS provides a hook toevery destructor call of a shared object.

I if the shared object is an observer then use the thread-safeversion of Observer::unregisterWithAll to detach the observerfrom all observables.

I Advantage: this solution is backward compatible, e.g. testsuite can now run multi-threaded.

Klaus Spanderen Beyond Simple Monte-Carlo: Parallel Computing with QuantLib

Finite Differences Methods on GPUs: Overview

I Performance of Finite Difference Methods is mainly driven bythe speed of the underlying sparse linear algebra subsystem.

I In QuantLib any finite difference operator can be exported asboost::numeric::ublas::compressed matrix<Real>

I boost sparse matrices can by exported in Compressed SparseRow (CSR) format to high performance libraries.

I CUDA sparse matrix libraries:I cuSPARSE: basic linear algebra subroutines used for sparse

matrices.I cusp: general template library for sparse iterative solvers.

Klaus Spanderen Beyond Simple Monte-Carlo: Parallel Computing with QuantLib

Spare Matrix Libraries for GPUs

Performance pictures from NVIDIA(https://developer.nvidia.com/cuSPARSE)

Klaus Spanderen Beyond Simple Monte-Carlo: Parallel Computing with QuantLib

Spare Matrix Libraries for GPUs

Performance pictures from NVIDIA

Speed-ups are smaller than the reported ”100x” for Monte-Carlo

Klaus Spanderen Beyond Simple Monte-Carlo: Parallel Computing with QuantLib

Example I: Heston-Hull-White Model on GPUs

SDE is defined by

dSt = (rt − qt)Stdt +√vtStdW

St

dvt = κv (θv − vt)dt + σv√vtdW

vt

drt = κr (θr ,t − rt)dt + σrdWrt

ρSvdt = dW St dW

vt

ρSrdt = dW St dW

rt

ρvrdt = dW vt dW

rt

Feynman-Kac gives the corresponding PDE:

∂u

∂t=

1

2S2ν

∂2u

∂S2+

1

2σ2νν

∂2u

∂ν2+

1

2σ2r∂2u

∂r2

+ ρSνσνSν∂2u

∂S∂ν+ ρSrσrS

√ν∂2u

∂S∂r+ ρvrσrσν

√ν∂2u

∂ν∂r

+ (r − q)S∂u

∂S+ κv (θv − ν)

∂u

∂ν+ κr (θr ,t − r)

∂u

∂r− ru

Klaus Spanderen Beyond Simple Monte-Carlo: Parallel Computing with QuantLib

Example I: Heston-Hull-White Model on GPUs

I Good new: QuantLib can build the sparse matrix.

I An operator splitting scheme needs to be ported to the GPU.

void HundsdorferScheme::step(array_type& a, Time t) {

Array y = a + dt_*map_->apply(a);

Array y0 = y;

for (Size i=0; i < map_->size(); ++i) {

Array rhs = y - theta_*dt_*map_->apply_direction(i, a);

y = map_->solve_splitting(i, rhs, -theta_*dt_);

}

Array yt = y0 + mu_*dt_*map_->apply(y-a);

for (Size i=0; i < map_->size(); ++i) {

Array rhs = yt - theta_*dt_*map_->apply_direction(i, y);

yt = map_->solve_splitting(i, rhs, -theta_*dt_);

}

a = yt;

}

Klaus Spanderen Beyond Simple Monte-Carlo: Parallel Computing with QuantLib

Example I: Heston-Hull-White Model on GPUs

24

68

1012

14

Heston−Hull−White Model: GTX560 vs. Core i7

Spe

ed−

Up

GP

U v

s C

PU

20x5

0x20

x10

20x5

0x20

x20

50x1

00x5

0x10

50x1

00x5

0x20

50x2

00x1

00x1

0

50x2

00x1

00x2

0

50x4

00x1

00x2

0

Grid Size (t,x,v,r)

GPU single precisionGPU double precision

Speed-ups are much smaller than for Monte-Carlo pricing.

Klaus Spanderen Beyond Simple Monte-Carlo: Parallel Computing with QuantLib

Example II: Heston Model on GPUs

510

1520

Heston Model: GTX560 vs. Core i7

Spe

ed−

Up

GP

U v

s C

PU

50x2

00x1

00

100x

200x

100

100x

500x

100

100x

500x

200

100x

1000

x500

100x

2000

x500

100x

2000

x100

0

Grid Size (t, x, v)

GPU single precisionGPU double precision

Speed-ups are much smaller than for Monte-Carlo pricing.

Klaus Spanderen Beyond Simple Monte-Carlo: Parallel Computing with QuantLib

Example III: Virtual Power Plant

Kluge model (two OU processes plus jump diffusion) leads to athree dimensional partial integro differential equation:

rV =∂V

∂t+σ2x2

∂2V

∂x2− αx ∂V

∂x− βy ∂V

∂y

+σ2u2

∂2V

∂u2− κu∂V

∂u+ ρσxσu

∂2V

∂x∂u

+ λ

∫R

(V (x , y + z , u, t)− V (x , y , u, t))ω(z)dz

Due to the integro part the equation is not truly a sparse matrix.

Klaus Spanderen Beyond Simple Monte-Carlo: Parallel Computing with QuantLib

Example III: Virtual Power Plant

15

1050

100

500

1000

[email protected]/1.6GHz vs. Core [email protected]

Grid Size (x,y,u,s)

Cal

cula

tion

Tim

e

10x10x10x6 20x20x10x6 40x20x20x6 80x40x20x6 100x50x40x6

GPU BiCGStab+TridiagGPU BiCGStab+nonsym BridsonGPU BiCGStabGPU BicgStab+DiagCPU Douglas SchemeCPU BiCGStab+Tridiag

Klaus Spanderen Beyond Simple Monte-Carlo: Parallel Computing with QuantLib

Quasi Monte-Carlo on GPUs: Overview

I Koksma-Hlawka bound is the basis for any QMC method:∣∣∣∣1nn∑

i=1

f (xi )−∫[0,1]d

f (u)du

∣∣∣∣ ≤ V (f )D∗(x1, ..., xn)

D∗(x1, ..., xn) ≥ c(log n)d

n

I The real advantage of QMC shows up only after N ∼ ed

drawing samples, where d is the dimensionality of the problem.

I Dimensional reduction of the problem is often the first step.

I The Brownian bridge is tailor-made to reduce the number ofsignificant dimensions.

Klaus Spanderen Beyond Simple Monte-Carlo: Parallel Computing with QuantLib

Quasi Monte-Carlo on GPUs: Arithmetic Option Example

Klaus Spanderen Beyond Simple Monte-Carlo: Parallel Computing with QuantLib

Quasi Monte-Carlo on GPUs: Exotic Equity Options

Accelerating Exotic Option Pricing and Model Calibration Using GPUs, Bernemann et al in High PerformanceComputational Finance (WHPCF), 2010, IEEE Workshop on, pages 17, Nov. 2010.

Klaus Spanderen Beyond Simple Monte-Carlo: Parallel Computing with QuantLib

Quasi Monte-Carlo on GPUs: QuantLib Implementation

I CUDA supports Sobol random numbers up to the dimension20,000.

I Direction integers are taken from the JoeKuoD7 set.

I On comparable hardware CUDA Sobol generators are approx.50 times faster than MKL.

I Weights and indices of the Brownian bridge will be calculatedby QuantLib.

Klaus Spanderen Beyond Simple Monte-Carlo: Parallel Computing with QuantLib

Quasi Monte-Carlo on GPUs: Performance

●●

● ●●

020

4060

80

Sobol Brownian Bridge GPU vs CPU

Path Lengths

Spe

ed−

up G

PU

vs

CP

U

101 102 103 104

● single precision, one factorsingle precision, four factorsdouble precision, one factor

Comparison GPU (GTX [email protected]/1.6Ghz) vs. CPU ([email protected])

Klaus Spanderen Beyond Simple Monte-Carlo: Parallel Computing with QuantLib

Quasi Monte-Carlo on GPUs: Scrambled Sobol Sequences

I In addition CUDA supportsscrambled Sobol sequences.

I Higher order scrambledsequences are a variant ofrandomized QMC method.

I They achieve better rootmean square errors onsmooth integrands.

I Error analysis is difficult. Ashifted (t,m,d)-net does notneed to be a (t,m,d)-net.

RMSE for a benchmark portfolio of Asian options.

Klaus Spanderen Beyond Simple Monte-Carlo: Parallel Computing with QuantLib

Message Passing Interface (MPI): Overview

I De-facto standard for massive parallel processing (MPP).

I MPI is a complementary standard to OpenMP or threading.

I Vendors provide high performance/low latencyimplementations.

I The roots of the MPI specification are going back to the early90s and you will feel the age if you use the C-API.

I Favour Boost.MPI over the original MPI C++ bindings!

I Boost.MPI can build MPI data types for user-defined typesusing the Boost.Serialization library.

Klaus Spanderen Beyond Simple Monte-Carlo: Parallel Computing with QuantLib

Message Passing Interface (MPI): Model Calibration

I Model calibration can be a very time-consuming task, e.g. thecalibration of a Heston or a Heston-Hull-White model usingAmerican puts with discrete dividends → FDM pricing

I Minimal approach: introduce a MPICalibrationHelper proxy,which ”has a” CalibrationHelper.

class MPICalibrationHelper : public CalibrationHelper {

public:

MPICalibrationHelper(

Integer mpiRankId,

const Handle<Quote>& volatility,

const Handle<YieldTermStructure>& termStructure,

const boost::shared_ptr<CalibrationHelper>& helper);

....

private:

std::future<Real> modelValueF_;

const boost::shared_ptr<boost::mpi::communicator> world_;

....

};

Klaus Spanderen Beyond Simple Monte-Carlo: Parallel Computing with QuantLib

Message Passing Interface (MPI): Model Calibration

void MPICalibrationHelper::update() {

if (world_->rank() == mpiRankId_) {

modelValueF_ = std::async(std::launch::async,

&CalibrationHelper::modelValue, helper_);

}

CalibrationHelper::update();

}

Real MPICalibrationHelper::modelValue() const {

if (world_->rank() == mpiRankId_) {

modelValue_ = modelValueF_.get();

}

boost::mpi::broadcast(*world_, modelValue_, mpiRankId_);

return modelValue_;

}

int main(int argc, char* argv[]) {

boost::mpi::environment env(argc, argv);

....

}

Klaus Spanderen Beyond Simple Monte-Carlo: Parallel Computing with QuantLib

Message Passing Interface (MPI): Model Calibration

Parallel Heston−Hull−White Calibration on 2x4 Cores

#Processes

Spe

ed−

up

1 2 4 8 16

12

34

56

Klaus Spanderen Beyond Simple Monte-Carlo: Parallel Computing with QuantLib

Conclusion

I Often a simple divide and conquer approach on process levelis sufficient to ”parallelize” QuantLib.

I In a multi-threading environment the singleton- andobserver-pattern need to be modified.

I Do not share QuantLib objects between different threads.I Working solution for languages with parallel garbage collector.

I Finite Difference speed-up on GPUs is rather 10x than 100x.

I Scrambled Sobol sequences in conjunction with Brownianbridges improve the convergence rate on GPUs.

I Boost.MPI is a convenient library to utilise QuantLib on MPPsystems.

Klaus Spanderen Beyond Simple Monte-Carlo: Parallel Computing with QuantLib