© 2011 deloitte touche tohmatsu about me educational background – applied econometrics 4 years...
TRANSCRIPT
© 2011 Deloitte Touche Tohmatsu
About me
• Educational background – Applied Econometrics
• 4 years statistical modelling experience
• R experience – 2 years
• Currently Senior Analyst at Deloitte
Hobby – rock climbing, data mining competitions
Why? - Early retirement
Current interest – Text analytics
© 2011 Deloitte Touche Tohmatsu3
Topic: The benefits of R from a data mining competitor’s point of view and from the point of view of an employee at Deloitte
Work
Professional and pragmatic
Home
The playful scientist
© 2011 Deloitte Touche Tohmatsu4
Agenda
1. Quick introduction to R
2. What I use R for
3. R at work
• Introduction to Deloitte
• Frequently used tools
• Some of the work we do using R
• Examples
• Challenges: Data Storage
• Challenges: Standardisation
• How Deloitte is addressing this issue
4. R at home:
• Some of the work I do using R, at home
• Flexibility and convenience
• Examples
• Prototyping and experimenting
• Examples
5. Questions
6. Essential R packages for everyday use
© 2011 Deloitte Touche Tohmatsu5
Quick introduction to R
“A statistical software created by statisticians, for statisticians”
Personally, I use R for data analysis and statistical modelling
Unique features worth noting:
• Open source – free, easy to find help in the active community
• Understands mathematical computations and matrix operations naturally
• Thousands of packages, implementations of almost any algorithm
© 2011 Deloitte Touche Tohmatsu6
Introduction to RThousands of packages, implementations of almost any algorithm
ggplot2
EBImage
randomForest
etc
N = 500+
Packages
© 2011 Deloitte Touche Tohmatsu7
R at work
© 2011 Deloitte Touche Tohmatsu8
Introduction to Deloitte
1. We help clients capture, manage and analyse data to help solve important business problems to make informed decisions
2. A holistic process of data mining
© 2011 Deloitte Touche Tohmatsu9
Introduction to Deloitte: Typical activity involved in a project at Deloitte
Initiatingprocesses
Planningprocesses
Modeling
Closingprocesses
Leve
l of A
ctiv
ity
Time line
Data loading
Data preparation
But not everything is R
20% - 40% time spent on modelling
© 2011 Deloitte Touche Tohmatsu10
Frequently used tools
Geospatial analytics - Tactician
Segmentation - Self O
rganising maps
Modelling
Visualisation
SQL server
© 2011 Deloitte Touche Tohmatsu11
Some of the work we do using R
In Deloitte
• Statistical Analysis and Predictive modelling
• Time series analysis
• Social Network Analysis
• Data visualisation
• Text analytics (NEW!)
© 2011 Deloitte Touche Tohmatsu12
Examples: Time Seriesy
– r
eta
il a
ctiv
ity?
Time (days)
--- EstimateActu
al
Fitted
R package:
forecast
© 2011 Deloitte Touche Tohmatsu13
Challenges: Data Storage
We have a dedicated tool to store and clean data – SQL
R cannot handle large data sets
Error: cannot allocate vector of size 2097151 Kb
© 2011 Deloitte Touche Tohmatsu14
Challenges: Standardisation
‘You’re not the only one using it”
One of the reason’s why other commercial tools are preferred over R
• Transferable skills across the team
• Reliability of packages
• Standardised functions and procedures
© 2011 Deloitte Touche Tohmatsu15
How Deloitte is addressing this issue
Creating standardised process:
R package:
RODBC
© 2011 Deloitte Touche Tohmatsu16
How Deloitte is addressing this issue
Creating standardised functions:
# Density Plot for subject variable
DensityPlot <- function(dataset, col) {
ds <- data.frame(dataset);ds$c <- ds[,c(col)];a <- ggplot(data=ds, aes(x=c) )
a <- a + geom_density(kernel="biweight");a
}
DensityPlot (dataset, column number)
Retrieving data from the database (RODBC):
conn <- odbcDriverConnect("driver=SQL Server; database=DataBaseName; server=servername;")
query <- “Select * from TableName”
df <- sqlQuery(conn,query)
R package:
RODBC
© 2011 Deloitte Touche Tohmatsu17
R at home
© 2011 Deloitte Touche Tohmatsu18
Some of the work I do using R, at home
In Deloitte
• Statistical Analysis and Predictive modelling
• Time series analysis
• Social Network Analysis
• Data visualisation
• Text analytics (NEW!)
(we don’t just use R)
At home (data mining competitions)
• Statistical analysis and Predictive modelling
• Time series analysis
• Social Network Analysis
• Data visualisation
• Text analytics
• Image analysis
(I mainly use R)
© 2011 Deloitte Touche Tohmatsu19
Flexibility and convenience
1. Is one of the easier programming languages to pick up
2. Dive into the analysis quickly
© 2011 Deloitte Touche Tohmatsu20
Examples
Image analysis
R package:
EBImage
© 2011 Deloitte Touche Tohmatsu21
Examples
Image Analysis
R package:
EBImage
© 2011 Deloitte Touche Tohmatsu22
Prototyping and experimenting
1. Access to the latest most innovative techniques
2. Great for prototyping new algorithms
© 2011 Deloitte Touche Tohmatsu23
Examples:Text analytics
+
1 The latest proof that Google can do no wrong | http://t.co/dSUhwVoO (via @Techland)
2 Teen girls look to YouTube for self-image validation | http://t.co/PSfROdi4 (via @TIMEHealthland)
3 Why libraries need us now more than ever #sxsw | http://t.co/OTbutfup (via @Techland)
4 PHOTOS: Amazing Photos of the Sun http://t.co/bmYAtNab via @TIME
5 Why libraries need us now more than ever #sxsw | http://t.co/OTbutfup (via @Techland)
6 Why libraries need us now more than ever #sxsw | http://t.co/OTbutfup (via @Techland)
R package:
© 2011 Deloitte Touche Tohmatsu24
Examples: Word cloud of twitter feeds
R package:
wordcloud
© 2011 Deloitte Touche Tohmatsu25
Examples:Text analytics
+ =
What are the common themes that are being tweeted by Time magazine?
?
© 2011 Deloitte Touche Tohmatsu26
Tweet
R package:
ggplot2
Top words associated to the classification
A
B
C
D
A
B
C
D
© 2011 Deloitte Touche Tohmatsu27
Classification results
Tweets Topic 1 Topic 2 Topic 3 Topic 41 The latest proof that Google can do no wrong | http://t.co/dSUhwVoO (via @Techland) 40% 0% 0% 60%2 Teen girls look to YouTube for self-image validation | http://t.co/PSfROdi4 (via @TIMEHealthland) 0% 0% 0% 100%3 Why libraries need us now more than ever #sxsw | http://t.co/OTbutfup (via @Techland) 100% 0% 0% 0%4 PHOTOS: Amazing Photos of the Sun http://t.co/bmYAtNab via @TIME 0% 100% 0% 0%5 Why libraries need us now more than ever #sxsw | http://t.co/OTbutfup (via @Techland) 100% 0% 0% 0%6 Why libraries need us now more than ever #sxsw | http://t.co/OTbutfup (via @Techland) 100% 0% 0% 0%
7Astrophysicist @neiltyson responds to @TIME's q: "What is the most astounding fact about the Universe?"
http://t.co/91565khw | beautiful vid 0% 0% 67% 33%8 Why libraries need us now more than ever #sxsw | http://t.co/OTbutfup (via @Techland) 100% 0% 0% 0%9 Living Alone Is The New Norm http://t.co/25BzVSLN (via @TIME) #teamhermit 0% 0% 0% 100%
10 PHOTOS: Seven days of strange landscapes | http://t.co/oLtxFcp8 0% 100% 0% 0%11 Subject for Debate: Are Women People? http://t.co/IRVthFc8 via @TIME 0% 0% 0% 100%12 PHOTOS: Seven days of strange landscapes | http://t.co/oLtxFcp8 0% 100% 0% 0%
13@Time Israel's bogus case for bombing Gaza obscures political motives | Al Akhbar English
http://t.co/Fud0mDNN via @AlakhbarEnglish 0% 0% 100% 0%
© 2011 Deloitte Touche Tohmatsu28
Questions?
© 2011 Deloitte Touche Tohmatsu29
Essential R packages for everyday use
Essential
• ggplot2
• reshape
• RODBC
• randomForest
• rpart
Nice to have
• caret
• forecast
• tm