20110620 amst rdam_kpb

IntroductionComputing in databases

Conclusion

Computing near the data:let someone else do the heavy lifting for you

Konrad Banachewicz

AmstRdam, June 20th 2011

Konrad Banachewicz Computing near the data

Conclusion

”We’re drowning in data and starving for information”

Conclusion

Data coming in from the market:

1 liquid instrument (front month DAX Future), 1 day, 1exchange → 400 MB in pure ASCII

different parameters → ”clones” of the same instrument

{ exchanges } x { instruments } x { days }...= A LOT

Conclusion

Problems:

memory

bandwidth

Conclusion

Model 1: regressionModel 2: correlationModel 3: VaR

Typical approach

read the data to memory

analyze there

save the results

Conclusion

Typical approach

analyze there

save the results

Conclusion

Typical approach

analyze there

save the results

Conclusion

Typical approach

analyze there

save the results

Conclusion

But is it really necessary?

Conclusion

In many cases what we really need is aggregate info:Example: linear regression

classic estimatorβ̂ = (XTX )−1XT y

come to think about it, what we really need are sums, sums ofsquares and cross-products

Conclusion

Two possible approaches:

1 Ripley i Chen: extra interface, pure R

2 R + SQL

Conclusion

Ripley i Chen

R(user) // CORBA // R(servant)

��DB

Conclusion

Alternative

R(user) // DBoo

Two scenarios:

1 pure R processing

2 computations partially in DB

Conclusion

base model:Yt = β1 + β2Xt + εt

estimator:

β̂ =(XTX

)−1XTY

in the DB: arithmetic operations on a limited set of columns

Conclusion

estimator:

β̂ =(XTX

)−1XTY

Conclusion

estimator:

β̂ =(XTX

)−1XTY

Conclusion

estimator:

β̂ =(XTX

)−1XTY

Conclusion

Pure R processing

200000 400000 600000 800000 1000000

Case study 1, method 1

Dataset size (number of rows)

Ingres VWIngresMySQLPostgreSQLDBMS X

Conclusion

Computations partially in DB

200000 400000 600000 800000 1000000

Conclusion

base model:

Cov(X ,Y ) = E [XY ]− EXEY

estimator:

ˆCov(X ,Y ) =1

n∑i=1

XiYi −

n∑i=1

in the DB: large queries

Conclusion

base model:

estimator:

ˆCov(X ,Y ) =1

n∑i=1

XiYi −

n∑i=1

Conclusion

base model:

estimator:

ˆCov(X ,Y ) =1

n∑i=1

XiYi −

n∑i=1

Conclusion

base model:

estimator:

ˆCov(X ,Y ) =1

n∑i=1

XiYi −

n∑i=1

Conclusion

Pure R processing

15 20 25 30 35

Dataset size (columns)

Conclusion

15 20 25 30 35

Dataset size (columns)

Conclusion

calculate a quantile of the portfolio PnL

Vp = inf {u : F (u) ≥ 1− p}

estimator:V̂p = X[n(1−p)]+1

in the DB: sorting

Conclusion

Vp = inf {u : F (u) ≥ 1− p}

in the DB: sorting

Conclusion

Vp = inf {u : F (u) ≥ 1− p}

in the DB: sorting

Conclusion

Vp = inf {u : F (u) ≥ 1− p}

in the DB: sorting

Conclusion

Pure R processing

2000000 4000000 6000000 8000000 10000000

Conclusion

200000 400000 600000 800000 1000000

Conclusion

1 with minimal effort, significant speedups are possible

2 ODBC as minimal requirement

3 extensions: parallel computing...

20110620 amst rdam_kpb

Small Business & Entrepreneurship