my journey to r - budapest bi forum2014.budapestbiforum.hu/letoltes/2014_budapestbi/...nov 26, 2014...
TRANSCRIPT
My journey to R26 Nov 2014
Budapest BI Forum
Matt Dowle
2
Background
Graduated 1996: Applied Maths and Computing
Joined Lehman Brothers: Sybase SQL & Visual Basic
1999 : Joined Salomon Brothers
They used S-PLUS already.
Patrick Burns (an S-PLUS guru and author) was a member of the team ...
3
Pat shows me S-PLUS
> DF <- data.frame(
A = letters[1:3],
B = c(1,3,5) )
> DF
A B
1 a 1
2 b 3
3 c 5
4
Rows are stored in order
> DF[2:3,]
A B
1 a 1
2 b 3
3 c 5
(unlike SQL and very useful)
5
My first thought
> DF[2:3,]
A B
2 b 3
3 c 5
> DF[2:3, sum(B)] sum(B)[1] 8
6
No
sum(DF[2:3,”B”])
or
with(DF, sum(B))
7
3 years pass, 2002
One day S-PLUS crashes
It's not my code, but a corruption in S-PLUS
Support: Are you sure it's not your code.
Matt: Yes. See, here's how you reproduce it.
Support: Yes, you're right. We'll fix it, thanks!
Matt: Great, when?
8
When
Support: Immediately for the next release.
Matt: Great, when's that?
Support: 6 months
Matt: Can you do a patch quicker?
Support: No because it's just you with the problem.
Matt: But I'm at Salomon/Citigroup, the biggest financial corporation in the world!
Support: True but it's still just you, Matt.
9
When continued
Matt: I understand. Can you send me the code and I'll fix it? I don't mind - I'll do it for free. I just want to fix it to get my job done.
Support: Sorry, can't do that. Lawyer says no.
Matt: Pat, any ideas?
Pat: Have you tried R?
Matt: What's R?
10
R in 2002
I took the code I had in S-PLUS and ran it in R.
Not only didn't it crash, but it took 1 minute instead of 1 hour.
R had improved the speed of for loops (*) and was in-memory rather than on-disk.
(*) The code generated random portfolios and couldn't be vectorized, due to its nature.
11
Even better
If R does error or crash, I can fix it. We have the source code! Or we can hire someone to fix it for us.
I can get my work done and not wait 6 months for a fix.
And it has packages.
I start to use R.
12
My first thought, again
Matt: Pat, remember how I first thought [.data.frame should work?
DF[2:3, sum(B)]
13
2004, day 1
DF[2:3, sum(B)] is born.
Only possible because R (uniquely) has lazy evaluation.
14
2004, day 2
I do the same for i
DF[type=="books", sum(sales)]
15
2004, day 3
I realise I need group by :
DF[type=="books", sum(sales), by=country]
16
2004, day 4
I realise chaining comes for free:
DF[, sum(sales), by=country][order(-V1), ]
17
Aug 2008
I release data.table 1.1 to CRAN :
DT[ where, select, group by ][ … ][ … ]
18
2011
I define := in j to do assignment by reference, combined with subset and grouping
DT[ where, select | update, group by ][ … ][ … ]
From v1.6.3 NEWS :
for (i in 1:1000) DF[i,1] <- i # 591s
for (i in 1:1000) DT[i,V1:=i] # 1s
19
June 2012
20
data.table answer
NB: It isn't just the speed, but the simplicity. It's easy to write and easy to read.
21
User's reaction
”data.table is awesome! That took about 3 seconds for the whole thing [was 30mins]!!!”
Davy Kavanagh, 15 Jun 2012
22
Present day ...
23
Vector languages
Number of questions R 71,471 S-PLUS 22 Matlab 39,127 Python 354,437 Julia 542
Quick demo of a vector language
24
Number of R user groups
Asia 21 Canada 7 Europe 44 UK 8 Australia 7 US 55
===
142
25
R Task Views (1)
Bayesian Inference Chemometrics and Computational Physics Clinical Trial Design, Monitoring, and Analysis Cluster Analysis & Finite Mixture Models Differential Equations Probability Distributions Computational Econometrics Analysis of Ecological and Environmental Data
26
R Task Views (2)
Design of Experiments Empirical Finance Statistical Genetics Graphic Displays High-Performance and Parallel Computing Machine Learning & Statistical Learning Medical Image Analysis Meta-Analysis
27
R Task Views (3)
Multivariate Statistics Natural Language Processing Numerical Mathematics Official Statistics & Survey Methodology Optimization and Mathematical Programming Analysis of Pharmacokinetic Data Phylogenetics, Especially Comparative
Methods Psychometric Models and Methods
28
R Task Views (4)
Reproducible Research Robust Statistical Methods Statistics for the Social Sciences Analysis of Spatial Data Handling and Analyzing Spatio-Temporal Data Survival Analysis Time Series Analysis Web Technologies and Services
29
Fast and friendly file reading
e.g. 50MB .csv, 1 million rows x 6 columns
read.csv("test.csv") # 30-60s
read.csv("test.csv", colClasses=, nrows=, etc...) # 10s
fread("test.csv") # 3s
e.g. 20GB .csv, 200 million rows x 16 columns
read.csv(”big.csv”, ...) # hours
fread("big.csv") # 8m
30
Benchmarks
Use large data (don't loop on small data) Vary data
large groups vs small groups types: integer, character, numeric, factor random input or (partially) ordered
Vary queries Grouping : sum, mean, median, … Joining, Updating, Subsetting
Easy to reproduce Incomplete, but this is how far I've got ...
31
1 billion rows (50GB)
32
1 billion rows (50GB)
33
Example from Landmark, an Insurance Services company in the U.K.
(with permission) ...
© 2013 DMGT |34
LiDAr data
© 2013 DMGT |35
Individual grid files
Building Outline (red)
Building Outline reduce by 1m
Generated ascii grid file
© 2013 DMGT |36
Reading ascii grid files
library(SDMTools)
asc <- read.asc(file="./0001000012290222.asc",gz=FALSE)
library(raster)
r <- raster("./0001000012290222.asc“ , crs=CRS("+init=epsg:27700")) # OSGB36
© 2013 DMGT |37
Slope & aspect
15°
NE
SW
N
© 2013 DMGT |38
Reading ascii grid files
library(SDMTools)
asc <- read.asc(file="./0001000012290222.asc",gz=FALSE)
library(raster)
r <- raster("./0001000012290222.asc“ , crs=CRS("+init=epsg:27700")) # OSGB36
© 2013 DMGT |39
Output
15°NE
SW
N
1.13m
31m² ≈ (8m x 4m)
Slope = atan (1.13m / 4m) ≈ 16°
© 2013 DMGT |40
Pre Calculated Lidar Points (1m Posting)
© 2013 DMGT |41
The data.table query usedin.lid.bld.dsm.dt[, c( tot.dsm.count = length(which(!is.na(dsm))) , tot.dsm.slp.count = length(which(!is.na(dsm.slp))) , min.dsm.distance = min(distance, na.rm = TRUE) , min.dsm = min(dsm, na.rm = TRUE) , max.dsm = max(dsm, na.rm = TRUE) , dsm.diff = round(max(dsm)-min(dsm),4) , mean.dsm = round(mean(dsm, na.rm = TRUE),2) , max.dsm.dtm.diff = round(max(dsm.dtm.diff),4) , mean.dsm.dtm.diff = round(mean(dsm.dtm.diff, na.rm = TRUE),4) , mean.dsm.slp = round(mean(dsm.slp, na.rm = TRUE),2) , mean.dsm.slp.gt15 = round(MeanSlopeGT15(dsm.slp),2) , mean.dsm.dtm.diff2 = round(mean(diff(dsm.dtm.diff,1), na.rm = TRUE, trim = 0.1),4) , max.dsm.dtm.diff2 = round(max(diff(dsm.dtm.diff,1)),4) , flat.cells = CountFlatCells(dsm.slp) , flat.cells.0.5 = CountFlatCells.0.5(dsm.slp) , flat.cells.5.10 = CountFlatCells.5.10(dsm.slp) , flat.cells.10.15 = CountFlatCells.10.15(dsm.slp) , RoofOrientation(dsm.slp, dsm.asp.head, dsm.asp.head.comp) # RoofOrientation() returns a list of 13 more columns), by=buiding.id ] # all 20 million properties in the U.K.
© 2013 DMGT |42
Kim Edmunds
Chief Data Scientist
Landmark Insurance Services, U.K.
43
”10 R packages to win Kaggle competitions”, useR! 2014
data.table
By Xavier Conortof DataRobot.com
http://www.slideshare.net/DataRobot/final-10-r-xc-36610234
44
data.table course on DataCamp
45
setkey(PRC, id, date)1. PRC[.(“SBRY”)] # all 3 rows2. PRC[.(“SBRY”,20080502),price] # 391.503. PRC[.(“SBRY”,20080505),price] # NA4. PRC[.(“SBRY”,20080505),price,roll=TRUE] # 391.505. PRC[.(“SBRY”,20080601),price,roll=TRUE] # 389.006. PRC[.(“SBRY”,20080601),price,roll=TRUE,rollends=FALSE] # NA7. PRC[.(“SBRY”,20080601),price,roll=20] # NA8. PRC[.(“SBRY”,20080601),price,roll=40] # 389.00
PRC
46
roll = ”nearest”
x y valueA 2 1.1A 9 1.2A 11 1.3B 3 1.4
setkey(DT, x, y)
DT[.(”A”,7), roll=”nearest”]
47
Overlapping range joins
48
28hrs down to 4mins
49
data.table support
51
Further links
https://github.com/Rdatatable/data.table/wiki
http://stackoverflow.com/questions/tagged/data.table
3 hour data.table tutorial at useR! 2014, Los Angeles:http://user2014.stat.ucla.edu/files/tutorial_Matt.pdf
https://www.datacamp.com/courses/ data-analysis-the-data-table-way
(10% discount code: MY-DATATABLE-COUPON)