my journey to r - budapest bi forum2014.budapestbiforum.hu/letoltes/2014_budapestbi/...nov 26, 2014...

My journey to R26 Nov 2014

Budapest BI Forum

Matt Dowle

2

Background

Graduated 1996: Applied Maths and Computing

Joined Lehman Brothers: Sybase SQL & Visual Basic

1999 : Joined Salomon Brothers

They used S-PLUS already.

Patrick Burns (an S-PLUS guru and author) was a member of the team ...

3

Pat shows me S-PLUS

> DF <- data.frame(

A = letters[1:3],

B = c(1,3,5) )

> DF

A B

1 a 1

2 b 3

3 c 5

4

Rows are stored in order

> DF[2:3,]

A B

1 a 1

2 b 3

3 c 5

(unlike SQL and very useful)

5

My first thought

> DF[2:3,]

A B

2 b 3

3 c 5

> DF[2:3, sum(B)] sum(B)[1] 8

6

No

sum(DF[2:3,”B”])

or

with(DF, sum(B))

7

3 years pass, 2002

One day S-PLUS crashes

It's not my code, but a corruption in S-PLUS

Support: Are you sure it's not your code.

Matt: Yes. See, here's how you reproduce it.

Support: Yes, you're right. We'll fix it, thanks!

Matt: Great, when?

8

When

Support: Immediately for the next release.

Matt: Great, when's that?

Support: 6 months

Matt: Can you do a patch quicker?

Support: No because it's just you with the problem.

Matt: But I'm at Salomon/Citigroup, the biggest financial corporation in the world!

Support: True but it's still just you, Matt.

9

When continued

Matt: I understand. Can you send me the code and I'll fix it? I don't mind - I'll do it for free. I just want to fix it to get my job done.

Support: Sorry, can't do that. Lawyer says no.

Matt: Pat, any ideas?

Pat: Have you tried R?

Matt: What's R?

10

R in 2002

I took the code I had in S-PLUS and ran it in R.

Not only didn't it crash, but it took 1 minute instead of 1 hour.

R had improved the speed of for loops (*) and was in-memory rather than on-disk.

(*) The code generated random portfolios and couldn't be vectorized, due to its nature.

11

Even better

If R does error or crash, I can fix it. We have the source code! Or we can hire someone to fix it for us.

I can get my work done and not wait 6 months for a fix.

And it has packages.

I start to use R.

12

My first thought, again

Matt: Pat, remember how I first thought [.data.frame should work?

DF[2:3, sum(B)]

13

2004, day 1

DF[2:3, sum(B)] is born.

Only possible because R (uniquely) has lazy evaluation.

14

2004, day 2

I do the same for i

DF[type=="books", sum(sales)]

15

2004, day 3

I realise I need group by :

DF[type=="books", sum(sales), by=country]

16

2004, day 4

I realise chaining comes for free:

DF[, sum(sales), by=country][order(-V1), ]

17

Aug 2008

I release data.table 1.1 to CRAN :

DT[ where, select, group by ][ … ][ … ]

18

2011

I define := in j to do assignment by reference, combined with subset and grouping

DT[ where, select | update, group by ][ … ][ … ]

From v1.6.3 NEWS :

for (i in 1:1000) DF[i,1] <- i # 591s

for (i in 1:1000) DT[i,V1:=i] # 1s

19

June 2012

20

data.table answer

NB: It isn't just the speed, but the simplicity. It's easy to write and easy to read.

21

User's reaction

”data.table is awesome! That took about 3 seconds for the whole thing [was 30mins]!!!”

Davy Kavanagh, 15 Jun 2012

22

Present day ...

23

Vector languages

Number of questions R 71,471 S-PLUS 22 Matlab 39,127 Python 354,437 Julia 542

Quick demo of a vector language

24

Number of R user groups

Asia 21 Canada 7 Europe 44 UK 8 Australia 7 US 55

===

142

25

R Task Views (1)

Bayesian Inference Chemometrics and Computational Physics Clinical Trial Design, Monitoring, and Analysis Cluster Analysis & Finite Mixture Models Differential Equations Probability Distributions Computational Econometrics Analysis of Ecological and Environmental Data

26

R Task Views (2)

Design of Experiments Empirical Finance Statistical Genetics Graphic Displays High-Performance and Parallel Computing Machine Learning & Statistical Learning Medical Image Analysis Meta-Analysis

27

R Task Views (3)

Multivariate Statistics Natural Language Processing Numerical Mathematics Official Statistics & Survey Methodology Optimization and Mathematical Programming Analysis of Pharmacokinetic Data Phylogenetics, Especially Comparative

Methods Psychometric Models and Methods

28

R Task Views (4)

Reproducible Research Robust Statistical Methods Statistics for the Social Sciences Analysis of Spatial Data Handling and Analyzing Spatio-Temporal Data Survival Analysis Time Series Analysis Web Technologies and Services

29

Fast and friendly file reading

e.g. 50MB .csv, 1 million rows x 6 columns

read.csv("test.csv") # 30-60s

read.csv("test.csv", colClasses=, nrows=, etc...) # 10s

fread("test.csv") # 3s

e.g. 20GB .csv, 200 million rows x 16 columns

read.csv(”big.csv”, ...) # hours

fread("big.csv") # 8m

30

Benchmarks

Use large data (don't loop on small data) Vary data

large groups vs small groups types: integer, character, numeric, factor random input or (partially) ordered

Vary queries Grouping : sum, mean, median, … Joining, Updating, Subsetting

Easy to reproduce Incomplete, but this is how far I've got ...

31

1 billion rows (50GB)

32

1 billion rows (50GB)

33

Example from Landmark, an Insurance Services company in the U.K.

(with permission) ...

© 2013 DMGT |34

LiDAr data

© 2013 DMGT |35

Individual grid files

Building Outline (red)

Building Outline reduce by 1m

Generated ascii grid file

© 2013 DMGT |36

Reading ascii grid files

library(SDMTools)

asc <- read.asc(file="./0001000012290222.asc",gz=FALSE)

library(raster)

r <- raster("./0001000012290222.asc“ , crs=CRS("+init=epsg:27700")) # OSGB36

© 2013 DMGT |38

Reading ascii grid files

library(SDMTools)

asc <- read.asc(file="./0001000012290222.asc",gz=FALSE)

library(raster)

r <- raster("./0001000012290222.asc“ , crs=CRS("+init=epsg:27700")) # OSGB36

© 2013 DMGT |41

The data.table query usedin.lid.bld.dsm.dt[, c( tot.dsm.count = length(which(!is.na(dsm))) , tot.dsm.slp.count = length(which(!is.na(dsm.slp))) , min.dsm.distance = min(distance, na.rm = TRUE) , min.dsm = min(dsm, na.rm = TRUE) , max.dsm = max(dsm, na.rm = TRUE) , dsm.diff = round(max(dsm)-min(dsm),4) , mean.dsm = round(mean(dsm, na.rm = TRUE),2) , max.dsm.dtm.diff = round(max(dsm.dtm.diff),4) , mean.dsm.dtm.diff = round(mean(dsm.dtm.diff, na.rm = TRUE),4) , mean.dsm.slp = round(mean(dsm.slp, na.rm = TRUE),2) , mean.dsm.slp.gt15 = round(MeanSlopeGT15(dsm.slp),2) , mean.dsm.dtm.diff2 = round(mean(diff(dsm.dtm.diff,1), na.rm = TRUE, trim = 0.1),4) , max.dsm.dtm.diff2 = round(max(diff(dsm.dtm.diff,1)),4) , flat.cells = CountFlatCells(dsm.slp) , flat.cells.0.5 = CountFlatCells.0.5(dsm.slp) , flat.cells.5.10 = CountFlatCells.5.10(dsm.slp) , flat.cells.10.15 = CountFlatCells.10.15(dsm.slp) , RoofOrientation(dsm.slp, dsm.asp.head, dsm.asp.head.comp) # RoofOrientation() returns a list of 13 more columns), by=buiding.id ] # all 20 million properties in the U.K.

43

”10 R packages to win Kaggle competitions”, useR! 2014

data.table

By Xavier Conortof DataRobot.com

http://www.slideshare.net/DataRobot/final-10-r-xc-36610234

http://DataRobot.com/

http://www.slideshare.net/DataRobot/final-10-r-xc-36610234

44

data.table course on DataCamp

45

setkey(PRC, id, date)1. PRC[.(“SBRY”)] # all 3 rows2. PRC[.(“SBRY”,20080502),price] # 391.503. PRC[.(“SBRY”,20080505),price] # NA4. PRC[.(“SBRY”,20080505),price,roll=TRUE] # 391.505. PRC[.(“SBRY”,20080601),price,roll=TRUE] # 389.006. PRC[.(“SBRY”,20080601),price,roll=TRUE,rollends=FALSE] # NA7. PRC[.(“SBRY”,20080601),price,roll=20] # NA8. PRC[.(“SBRY”,20080601),price,roll=40] # 389.00

PRC

46

roll = ”nearest”

x y valueA 2 1.1A 9 1.2A 11 1.3B 3 1.4

setkey(DT, x, y)

DT[.(”A”,7), roll=”nearest”]

47

Overlapping range joins

48

28hrs down to 4mins

49

data.table support

50

Client / Server Demo

Watch on YouTube (no audio)

http://www.youtube.com/watch?v=rvT8XThGA8o

51

Further links

https://github.com/Rdatatable/data.table/wiki

http://stackoverflow.com/questions/tagged/data.table

3 hour data.table tutorial at useR! 2014, Los Angeles:http://user2014.stat.ucla.edu/files/tutorial_Matt.pdf

https://www.datacamp.com/courses/ data-analysis-the-data-table-way

(10% discount code: MY-DATATABLE-COUPON)

https://github.com/Rdatatable/data.table/wiki

http://stackoverflow.com/questions/tagged/data.table

http://user2014.stat.ucla.edu/files/tutorial_Matt.pdf

https://www.datacamp.com/courses/

my journey to r - budapest bi forum2014.budapestbiforum.hu/letoltes/2014_budapestbi/...nov 26, 2014...

Documents