20110620 amst rdam_kpb
DESCRIPTION
AmstRdam presentationTRANSCRIPT
![Page 1: 20110620 amst rdam_kpb](https://reader034.vdocuments.us/reader034/viewer/2022052413/5598fc6e1a28ab5e718b470b/html5/thumbnails/1.jpg)
IntroductionComputing in databases
Conclusion
Computing near the data:let someone else do the heavy lifting for you
Konrad Banachewicz
AmstRdam, June 20th 2011
Konrad Banachewicz Computing near the data
![Page 2: 20110620 amst rdam_kpb](https://reader034.vdocuments.us/reader034/viewer/2022052413/5598fc6e1a28ab5e718b470b/html5/thumbnails/2.jpg)
IntroductionComputing in databases
Conclusion
”We’re drowning in data and starving for information”
Konrad Banachewicz Computing near the data
![Page 3: 20110620 amst rdam_kpb](https://reader034.vdocuments.us/reader034/viewer/2022052413/5598fc6e1a28ab5e718b470b/html5/thumbnails/3.jpg)
IntroductionComputing in databases
Conclusion
Data coming in from the market:
1 liquid instrument (front month DAX Future), 1 day, 1exchange → 400 MB in pure ASCII
different parameters → ”clones” of the same instrument
{ exchanges } x { instruments } x { days }...= A LOT
Konrad Banachewicz Computing near the data
![Page 4: 20110620 amst rdam_kpb](https://reader034.vdocuments.us/reader034/viewer/2022052413/5598fc6e1a28ab5e718b470b/html5/thumbnails/4.jpg)
IntroductionComputing in databases
Conclusion
Data coming in from the market:
1 liquid instrument (front month DAX Future), 1 day, 1exchange → 400 MB in pure ASCII
different parameters → ”clones” of the same instrument
{ exchanges } x { instruments } x { days }...= A LOT
Konrad Banachewicz Computing near the data
![Page 5: 20110620 amst rdam_kpb](https://reader034.vdocuments.us/reader034/viewer/2022052413/5598fc6e1a28ab5e718b470b/html5/thumbnails/5.jpg)
IntroductionComputing in databases
Conclusion
Data coming in from the market:
1 liquid instrument (front month DAX Future), 1 day, 1exchange → 400 MB in pure ASCII
different parameters → ”clones” of the same instrument
{ exchanges } x { instruments } x { days }...= A LOT
Konrad Banachewicz Computing near the data
![Page 6: 20110620 amst rdam_kpb](https://reader034.vdocuments.us/reader034/viewer/2022052413/5598fc6e1a28ab5e718b470b/html5/thumbnails/6.jpg)
IntroductionComputing in databases
Conclusion
Data coming in from the market:
1 liquid instrument (front month DAX Future), 1 day, 1exchange → 400 MB in pure ASCII
different parameters → ”clones” of the same instrument
{ exchanges } x { instruments } x { days }...= A LOT
Konrad Banachewicz Computing near the data
![Page 7: 20110620 amst rdam_kpb](https://reader034.vdocuments.us/reader034/viewer/2022052413/5598fc6e1a28ab5e718b470b/html5/thumbnails/7.jpg)
IntroductionComputing in databases
Conclusion
Problems:
memory
bandwidth
Konrad Banachewicz Computing near the data
![Page 8: 20110620 amst rdam_kpb](https://reader034.vdocuments.us/reader034/viewer/2022052413/5598fc6e1a28ab5e718b470b/html5/thumbnails/8.jpg)
IntroductionComputing in databases
Conclusion
Model 1: regressionModel 2: correlationModel 3: VaR
Typical approach
read the data to memory
analyze there
save the results
Konrad Banachewicz Computing near the data
![Page 9: 20110620 amst rdam_kpb](https://reader034.vdocuments.us/reader034/viewer/2022052413/5598fc6e1a28ab5e718b470b/html5/thumbnails/9.jpg)
IntroductionComputing in databases
Conclusion
Model 1: regressionModel 2: correlationModel 3: VaR
Typical approach
read the data to memory
analyze there
save the results
Konrad Banachewicz Computing near the data
![Page 10: 20110620 amst rdam_kpb](https://reader034.vdocuments.us/reader034/viewer/2022052413/5598fc6e1a28ab5e718b470b/html5/thumbnails/10.jpg)
IntroductionComputing in databases
Conclusion
Model 1: regressionModel 2: correlationModel 3: VaR
Typical approach
read the data to memory
analyze there
save the results
Konrad Banachewicz Computing near the data
![Page 11: 20110620 amst rdam_kpb](https://reader034.vdocuments.us/reader034/viewer/2022052413/5598fc6e1a28ab5e718b470b/html5/thumbnails/11.jpg)
IntroductionComputing in databases
Conclusion
Model 1: regressionModel 2: correlationModel 3: VaR
Typical approach
read the data to memory
analyze there
save the results
Konrad Banachewicz Computing near the data
![Page 12: 20110620 amst rdam_kpb](https://reader034.vdocuments.us/reader034/viewer/2022052413/5598fc6e1a28ab5e718b470b/html5/thumbnails/12.jpg)
IntroductionComputing in databases
Conclusion
Model 1: regressionModel 2: correlationModel 3: VaR
But is it really necessary?
Konrad Banachewicz Computing near the data
![Page 13: 20110620 amst rdam_kpb](https://reader034.vdocuments.us/reader034/viewer/2022052413/5598fc6e1a28ab5e718b470b/html5/thumbnails/13.jpg)
IntroductionComputing in databases
Conclusion
Model 1: regressionModel 2: correlationModel 3: VaR
In many cases what we really need is aggregate info:Example: linear regression
classic estimatorβ̂ = (XTX )−1XT y
come to think about it, what we really need are sums, sums ofsquares and cross-products
Konrad Banachewicz Computing near the data
![Page 14: 20110620 amst rdam_kpb](https://reader034.vdocuments.us/reader034/viewer/2022052413/5598fc6e1a28ab5e718b470b/html5/thumbnails/14.jpg)
IntroductionComputing in databases
Conclusion
Model 1: regressionModel 2: correlationModel 3: VaR
In many cases what we really need is aggregate info:Example: linear regression
classic estimatorβ̂ = (XTX )−1XT y
come to think about it, what we really need are sums, sums ofsquares and cross-products
Konrad Banachewicz Computing near the data
![Page 15: 20110620 amst rdam_kpb](https://reader034.vdocuments.us/reader034/viewer/2022052413/5598fc6e1a28ab5e718b470b/html5/thumbnails/15.jpg)
IntroductionComputing in databases
Conclusion
Model 1: regressionModel 2: correlationModel 3: VaR
In many cases what we really need is aggregate info:Example: linear regression
classic estimatorβ̂ = (XTX )−1XT y
come to think about it, what we really need are sums, sums ofsquares and cross-products
Konrad Banachewicz Computing near the data
![Page 16: 20110620 amst rdam_kpb](https://reader034.vdocuments.us/reader034/viewer/2022052413/5598fc6e1a28ab5e718b470b/html5/thumbnails/16.jpg)
IntroductionComputing in databases
Conclusion
Model 1: regressionModel 2: correlationModel 3: VaR
Two possible approaches:
1 Ripley i Chen: extra interface, pure R
2 R + SQL
Konrad Banachewicz Computing near the data
![Page 17: 20110620 amst rdam_kpb](https://reader034.vdocuments.us/reader034/viewer/2022052413/5598fc6e1a28ab5e718b470b/html5/thumbnails/17.jpg)
IntroductionComputing in databases
Conclusion
Model 1: regressionModel 2: correlationModel 3: VaR
Ripley i Chen
R(user) // CORBA // R(servant)
��DB
Konrad Banachewicz Computing near the data
![Page 18: 20110620 amst rdam_kpb](https://reader034.vdocuments.us/reader034/viewer/2022052413/5598fc6e1a28ab5e718b470b/html5/thumbnails/18.jpg)
IntroductionComputing in databases
Conclusion
Model 1: regressionModel 2: correlationModel 3: VaR
Alternative
R(user) // DBoo
Two scenarios:
1 pure R processing
2 computations partially in DB
Konrad Banachewicz Computing near the data
![Page 19: 20110620 amst rdam_kpb](https://reader034.vdocuments.us/reader034/viewer/2022052413/5598fc6e1a28ab5e718b470b/html5/thumbnails/19.jpg)
IntroductionComputing in databases
Conclusion
Model 1: regressionModel 2: correlationModel 3: VaR
base model:Yt = β1 + β2Xt + εt
estimator:
β̂ =(XTX
)−1XTY
in the DB: arithmetic operations on a limited set of columns
Konrad Banachewicz Computing near the data
![Page 20: 20110620 amst rdam_kpb](https://reader034.vdocuments.us/reader034/viewer/2022052413/5598fc6e1a28ab5e718b470b/html5/thumbnails/20.jpg)
IntroductionComputing in databases
Conclusion
Model 1: regressionModel 2: correlationModel 3: VaR
base model:Yt = β1 + β2Xt + εt
estimator:
β̂ =(XTX
)−1XTY
in the DB: arithmetic operations on a limited set of columns
Konrad Banachewicz Computing near the data
![Page 21: 20110620 amst rdam_kpb](https://reader034.vdocuments.us/reader034/viewer/2022052413/5598fc6e1a28ab5e718b470b/html5/thumbnails/21.jpg)
IntroductionComputing in databases
Conclusion
Model 1: regressionModel 2: correlationModel 3: VaR
base model:Yt = β1 + β2Xt + εt
estimator:
β̂ =(XTX
)−1XTY
in the DB: arithmetic operations on a limited set of columns
Konrad Banachewicz Computing near the data
![Page 22: 20110620 amst rdam_kpb](https://reader034.vdocuments.us/reader034/viewer/2022052413/5598fc6e1a28ab5e718b470b/html5/thumbnails/22.jpg)
IntroductionComputing in databases
Conclusion
Model 1: regressionModel 2: correlationModel 3: VaR
base model:Yt = β1 + β2Xt + εt
estimator:
β̂ =(XTX
)−1XTY
in the DB: arithmetic operations on a limited set of columns
Konrad Banachewicz Computing near the data
![Page 23: 20110620 amst rdam_kpb](https://reader034.vdocuments.us/reader034/viewer/2022052413/5598fc6e1a28ab5e718b470b/html5/thumbnails/23.jpg)
IntroductionComputing in databases
Conclusion
Model 1: regressionModel 2: correlationModel 3: VaR
Pure R processing
200000 400000 600000 800000 1000000
05
1015
2025
30
Case study 1, method 1
Dataset size (number of rows)
Exec
utio
n tim
e (s
econ
ds)
Ingres VWIngresMySQLPostgreSQLDBMS X
Konrad Banachewicz Computing near the data
![Page 24: 20110620 amst rdam_kpb](https://reader034.vdocuments.us/reader034/viewer/2022052413/5598fc6e1a28ab5e718b470b/html5/thumbnails/24.jpg)
IntroductionComputing in databases
Conclusion
Model 1: regressionModel 2: correlationModel 3: VaR
Computations partially in DB
200000 400000 600000 800000 1000000
05
1015
2025
30
Case study 1, method 2
Dataset size (number of rows)
Exec
utio
n tim
e (s
econ
ds)
Ingres VWIngresMySQLPostgreSQLDBMS X
Konrad Banachewicz Computing near the data
![Page 25: 20110620 amst rdam_kpb](https://reader034.vdocuments.us/reader034/viewer/2022052413/5598fc6e1a28ab5e718b470b/html5/thumbnails/25.jpg)
IntroductionComputing in databases
Conclusion
Model 1: regressionModel 2: correlationModel 3: VaR
base model:
Cov(X ,Y ) = E [XY ]− EXEY
estimator:
ˆCov(X ,Y ) =1
n
n∑i=1
XiYi −
(1
n
n∑i=1
Xi
)(1
n
n∑i=1
Yi
)
in the DB: large queries
Konrad Banachewicz Computing near the data
![Page 26: 20110620 amst rdam_kpb](https://reader034.vdocuments.us/reader034/viewer/2022052413/5598fc6e1a28ab5e718b470b/html5/thumbnails/26.jpg)
IntroductionComputing in databases
Conclusion
Model 1: regressionModel 2: correlationModel 3: VaR
base model:
Cov(X ,Y ) = E [XY ]− EXEY
estimator:
ˆCov(X ,Y ) =1
n
n∑i=1
XiYi −
(1
n
n∑i=1
Xi
)(1
n
n∑i=1
Yi
)
in the DB: large queries
Konrad Banachewicz Computing near the data
![Page 27: 20110620 amst rdam_kpb](https://reader034.vdocuments.us/reader034/viewer/2022052413/5598fc6e1a28ab5e718b470b/html5/thumbnails/27.jpg)
IntroductionComputing in databases
Conclusion
Model 1: regressionModel 2: correlationModel 3: VaR
base model:
Cov(X ,Y ) = E [XY ]− EXEY
estimator:
ˆCov(X ,Y ) =1
n
n∑i=1
XiYi −
(1
n
n∑i=1
Xi
)(1
n
n∑i=1
Yi
)
in the DB: large queries
Konrad Banachewicz Computing near the data
![Page 28: 20110620 amst rdam_kpb](https://reader034.vdocuments.us/reader034/viewer/2022052413/5598fc6e1a28ab5e718b470b/html5/thumbnails/28.jpg)
IntroductionComputing in databases
Conclusion
Model 1: regressionModel 2: correlationModel 3: VaR
base model:
Cov(X ,Y ) = E [XY ]− EXEY
estimator:
ˆCov(X ,Y ) =1
n
n∑i=1
XiYi −
(1
n
n∑i=1
Xi
)(1
n
n∑i=1
Yi
)
in the DB: large queries
Konrad Banachewicz Computing near the data
![Page 29: 20110620 amst rdam_kpb](https://reader034.vdocuments.us/reader034/viewer/2022052413/5598fc6e1a28ab5e718b470b/html5/thumbnails/29.jpg)
IntroductionComputing in databases
Conclusion
Model 1: regressionModel 2: correlationModel 3: VaR
Pure R processing
15 20 25 30 35
010
2030
4050
60
Case study 1, method 1
Dataset size (columns)
Exec
utio
n tim
e (s
econ
ds)
Ingres VWIngresMySQLPostgreSQLDBMS X
Konrad Banachewicz Computing near the data
![Page 30: 20110620 amst rdam_kpb](https://reader034.vdocuments.us/reader034/viewer/2022052413/5598fc6e1a28ab5e718b470b/html5/thumbnails/30.jpg)
IntroductionComputing in databases
Conclusion
Model 1: regressionModel 2: correlationModel 3: VaR
Computations partially in DB
15 20 25 30 35
010
2030
4050
60
Case study 1, method 1
Dataset size (columns)
Exec
utio
n tim
e (s
econ
ds)
Ingres VWIngresMySQLPostgreSQLDBMS X
Konrad Banachewicz Computing near the data
![Page 31: 20110620 amst rdam_kpb](https://reader034.vdocuments.us/reader034/viewer/2022052413/5598fc6e1a28ab5e718b470b/html5/thumbnails/31.jpg)
IntroductionComputing in databases
Conclusion
Model 1: regressionModel 2: correlationModel 3: VaR
calculate a quantile of the portfolio PnL
Vp = inf {u : F (u) ≥ 1− p}
estimator:V̂p = X[n(1−p)]+1
in the DB: sorting
Konrad Banachewicz Computing near the data
![Page 32: 20110620 amst rdam_kpb](https://reader034.vdocuments.us/reader034/viewer/2022052413/5598fc6e1a28ab5e718b470b/html5/thumbnails/32.jpg)
IntroductionComputing in databases
Conclusion
Model 1: regressionModel 2: correlationModel 3: VaR
calculate a quantile of the portfolio PnL
Vp = inf {u : F (u) ≥ 1− p}
estimator:V̂p = X[n(1−p)]+1
in the DB: sorting
Konrad Banachewicz Computing near the data
![Page 33: 20110620 amst rdam_kpb](https://reader034.vdocuments.us/reader034/viewer/2022052413/5598fc6e1a28ab5e718b470b/html5/thumbnails/33.jpg)
IntroductionComputing in databases
Conclusion
Model 1: regressionModel 2: correlationModel 3: VaR
calculate a quantile of the portfolio PnL
Vp = inf {u : F (u) ≥ 1− p}
estimator:V̂p = X[n(1−p)]+1
in the DB: sorting
Konrad Banachewicz Computing near the data
![Page 34: 20110620 amst rdam_kpb](https://reader034.vdocuments.us/reader034/viewer/2022052413/5598fc6e1a28ab5e718b470b/html5/thumbnails/34.jpg)
IntroductionComputing in databases
Conclusion
Model 1: regressionModel 2: correlationModel 3: VaR
calculate a quantile of the portfolio PnL
Vp = inf {u : F (u) ≥ 1− p}
estimator:V̂p = X[n(1−p)]+1
in the DB: sorting
Konrad Banachewicz Computing near the data
![Page 35: 20110620 amst rdam_kpb](https://reader034.vdocuments.us/reader034/viewer/2022052413/5598fc6e1a28ab5e718b470b/html5/thumbnails/35.jpg)
IntroductionComputing in databases
Conclusion
Model 1: regressionModel 2: correlationModel 3: VaR
Pure R processing
2000000 4000000 6000000 8000000 10000000
020
4060
8010
0
Case study 3, method 1
Dataset size (number of rows)
Exec
utio
n tim
e (s
econ
ds)
Ingres VWIngresMySQLPostgreSQLDBMS X
Konrad Banachewicz Computing near the data
![Page 36: 20110620 amst rdam_kpb](https://reader034.vdocuments.us/reader034/viewer/2022052413/5598fc6e1a28ab5e718b470b/html5/thumbnails/36.jpg)
IntroductionComputing in databases
Conclusion
Model 1: regressionModel 2: correlationModel 3: VaR
Computations partially in DB
200000 400000 600000 800000 1000000
020
4060
8010
0
Case study 3, method 2
Dataset size (number of rows)
Exec
utio
n tim
e (s
econ
ds)
Ingres VWIngresMySQLPostgreSQLDBMS X
Konrad Banachewicz Computing near the data
![Page 37: 20110620 amst rdam_kpb](https://reader034.vdocuments.us/reader034/viewer/2022052413/5598fc6e1a28ab5e718b470b/html5/thumbnails/37.jpg)
IntroductionComputing in databases
Conclusion
1 with minimal effort, significant speedups are possible
2 ODBC as minimal requirement
3 extensions: parallel computing...
Konrad Banachewicz Computing near the data