© 2011 deloitte touche tohmatsu about me educational background – applied econometrics 4 years...

29

Upload: jair-pownell

Post on 31-Mar-2015

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: © 2011 Deloitte Touche Tohmatsu About me Educational background – Applied Econometrics 4 years statistical modelling experience R experience – 2 years
Page 2: © 2011 Deloitte Touche Tohmatsu About me Educational background – Applied Econometrics 4 years statistical modelling experience R experience – 2 years

© 2011 Deloitte Touche Tohmatsu

About me

• Educational background – Applied Econometrics

• 4 years statistical modelling experience

• R experience – 2 years

• Currently Senior Analyst at Deloitte

Hobby – rock climbing, data mining competitions

Why? - Early retirement

Current interest – Text analytics

Page 3: © 2011 Deloitte Touche Tohmatsu About me Educational background – Applied Econometrics 4 years statistical modelling experience R experience – 2 years

© 2011 Deloitte Touche Tohmatsu3

Topic: The benefits of R from a data mining competitor’s point of view and from the point of view of an employee at Deloitte

Work

Professional and pragmatic

Home

The playful scientist

Page 4: © 2011 Deloitte Touche Tohmatsu About me Educational background – Applied Econometrics 4 years statistical modelling experience R experience – 2 years

© 2011 Deloitte Touche Tohmatsu4

Agenda

1. Quick introduction to R

2. What I use R for

3. R at work

• Introduction to Deloitte

• Frequently used tools

• Some of the work we do using R

• Examples

• Challenges: Data Storage

• Challenges: Standardisation

• How Deloitte is addressing this issue

4. R at home:

• Some of the work I do using R, at home

• Flexibility and convenience

• Examples

• Prototyping and experimenting

• Examples

5. Questions

6. Essential R packages for everyday use

Page 5: © 2011 Deloitte Touche Tohmatsu About me Educational background – Applied Econometrics 4 years statistical modelling experience R experience – 2 years

© 2011 Deloitte Touche Tohmatsu5

Quick introduction to R

“A statistical software created by statisticians, for statisticians”

Personally, I use R for data analysis and statistical modelling

Unique features worth noting:

• Open source – free, easy to find help in the active community

• Understands mathematical computations and matrix operations naturally

• Thousands of packages, implementations of almost any algorithm

Page 6: © 2011 Deloitte Touche Tohmatsu About me Educational background – Applied Econometrics 4 years statistical modelling experience R experience – 2 years

© 2011 Deloitte Touche Tohmatsu6

Introduction to RThousands of packages, implementations of almost any algorithm

ggplot2

EBImage

randomForest

etc

N = 500+

Packages

Page 7: © 2011 Deloitte Touche Tohmatsu About me Educational background – Applied Econometrics 4 years statistical modelling experience R experience – 2 years

© 2011 Deloitte Touche Tohmatsu7

R at work

Page 8: © 2011 Deloitte Touche Tohmatsu About me Educational background – Applied Econometrics 4 years statistical modelling experience R experience – 2 years

© 2011 Deloitte Touche Tohmatsu8

Introduction to Deloitte

1. We help clients capture, manage and analyse data to help solve important business problems to make informed decisions

2. A holistic process of data mining

Page 9: © 2011 Deloitte Touche Tohmatsu About me Educational background – Applied Econometrics 4 years statistical modelling experience R experience – 2 years

© 2011 Deloitte Touche Tohmatsu9

Introduction to Deloitte: Typical activity involved in a project at Deloitte

Initiatingprocesses

Planningprocesses

Modeling

Closingprocesses

Leve

l of A

ctiv

ity

Time line

Data loading

Data preparation

But not everything is R

20% - 40% time spent on modelling

Page 10: © 2011 Deloitte Touche Tohmatsu About me Educational background – Applied Econometrics 4 years statistical modelling experience R experience – 2 years

© 2011 Deloitte Touche Tohmatsu10

Frequently used tools

Geospatial analytics - Tactician

Segmentation - Self O

rganising maps

Modelling

Visualisation

SQL server

Page 11: © 2011 Deloitte Touche Tohmatsu About me Educational background – Applied Econometrics 4 years statistical modelling experience R experience – 2 years

© 2011 Deloitte Touche Tohmatsu11

Some of the work we do using R

In Deloitte

• Statistical Analysis and Predictive modelling

• Time series analysis

• Social Network Analysis

• Data visualisation

• Text analytics (NEW!)

Page 12: © 2011 Deloitte Touche Tohmatsu About me Educational background – Applied Econometrics 4 years statistical modelling experience R experience – 2 years

© 2011 Deloitte Touche Tohmatsu12

Examples: Time Seriesy

– r

eta

il a

ctiv

ity?

Time (days)

--- EstimateActu

al

Fitted

R package:

forecast

Page 13: © 2011 Deloitte Touche Tohmatsu About me Educational background – Applied Econometrics 4 years statistical modelling experience R experience – 2 years

© 2011 Deloitte Touche Tohmatsu13

Challenges: Data Storage

We have a dedicated tool to store and clean data – SQL

R cannot handle large data sets

Error: cannot allocate vector of size 2097151 Kb

Page 14: © 2011 Deloitte Touche Tohmatsu About me Educational background – Applied Econometrics 4 years statistical modelling experience R experience – 2 years

© 2011 Deloitte Touche Tohmatsu14

Challenges: Standardisation

‘You’re not the only one using it”

One of the reason’s why other commercial tools are preferred over R

• Transferable skills across the team

• Reliability of packages

• Standardised functions and procedures

Page 15: © 2011 Deloitte Touche Tohmatsu About me Educational background – Applied Econometrics 4 years statistical modelling experience R experience – 2 years

© 2011 Deloitte Touche Tohmatsu15

How Deloitte is addressing this issue

Creating standardised process:

R package:

RODBC

Page 16: © 2011 Deloitte Touche Tohmatsu About me Educational background – Applied Econometrics 4 years statistical modelling experience R experience – 2 years

© 2011 Deloitte Touche Tohmatsu16

How Deloitte is addressing this issue

Creating standardised functions:

# Density Plot for subject variable

DensityPlot <- function(dataset, col) {

ds <- data.frame(dataset);ds$c <- ds[,c(col)];a <- ggplot(data=ds, aes(x=c) )

a <- a + geom_density(kernel="biweight");a

}

DensityPlot (dataset, column number)

Retrieving data from the database (RODBC):

conn <- odbcDriverConnect("driver=SQL Server; database=DataBaseName; server=servername;")

query <- “Select * from TableName”

df <- sqlQuery(conn,query)

R package:

RODBC

Page 17: © 2011 Deloitte Touche Tohmatsu About me Educational background – Applied Econometrics 4 years statistical modelling experience R experience – 2 years

© 2011 Deloitte Touche Tohmatsu17

R at home

Page 18: © 2011 Deloitte Touche Tohmatsu About me Educational background – Applied Econometrics 4 years statistical modelling experience R experience – 2 years

© 2011 Deloitte Touche Tohmatsu18

Some of the work I do using R, at home

In Deloitte

• Statistical Analysis and Predictive modelling

• Time series analysis

• Social Network Analysis

• Data visualisation

• Text analytics (NEW!)

(we don’t just use R)

At home (data mining competitions)

• Statistical analysis and Predictive modelling

• Time series analysis

• Social Network Analysis

• Data visualisation

• Text analytics

• Image analysis

(I mainly use R)

Page 19: © 2011 Deloitte Touche Tohmatsu About me Educational background – Applied Econometrics 4 years statistical modelling experience R experience – 2 years

© 2011 Deloitte Touche Tohmatsu19

Flexibility and convenience

1. Is one of the easier programming languages to pick up

2. Dive into the analysis quickly

Page 20: © 2011 Deloitte Touche Tohmatsu About me Educational background – Applied Econometrics 4 years statistical modelling experience R experience – 2 years

© 2011 Deloitte Touche Tohmatsu20

Examples

Image analysis

R package:

EBImage

Page 21: © 2011 Deloitte Touche Tohmatsu About me Educational background – Applied Econometrics 4 years statistical modelling experience R experience – 2 years

© 2011 Deloitte Touche Tohmatsu21

Examples

Image Analysis

R package:

EBImage

Page 22: © 2011 Deloitte Touche Tohmatsu About me Educational background – Applied Econometrics 4 years statistical modelling experience R experience – 2 years

© 2011 Deloitte Touche Tohmatsu22

Prototyping and experimenting

1. Access to the latest most innovative techniques

2. Great for prototyping new algorithms

Page 23: © 2011 Deloitte Touche Tohmatsu About me Educational background – Applied Econometrics 4 years statistical modelling experience R experience – 2 years

© 2011 Deloitte Touche Tohmatsu23

Examples:Text analytics

+

1 The latest proof that Google can do no wrong | http://t.co/dSUhwVoO (via @Techland)

2 Teen girls look to YouTube for self-image validation | http://t.co/PSfROdi4 (via @TIMEHealthland)

3 Why libraries need us now more than ever #sxsw | http://t.co/OTbutfup (via @Techland)

4 PHOTOS: Amazing Photos of the Sun http://t.co/bmYAtNab via @TIME

5 Why libraries need us now more than ever #sxsw | http://t.co/OTbutfup (via @Techland)

6 Why libraries need us now more than ever #sxsw | http://t.co/OTbutfup (via @Techland)

R package:

twitteR

Page 24: © 2011 Deloitte Touche Tohmatsu About me Educational background – Applied Econometrics 4 years statistical modelling experience R experience – 2 years

© 2011 Deloitte Touche Tohmatsu24

Examples: Word cloud of twitter feeds

R package:

wordcloud

Page 25: © 2011 Deloitte Touche Tohmatsu About me Educational background – Applied Econometrics 4 years statistical modelling experience R experience – 2 years

© 2011 Deloitte Touche Tohmatsu25

Examples:Text analytics

+ =

What are the common themes that are being tweeted by Time magazine?

?

Page 26: © 2011 Deloitte Touche Tohmatsu About me Educational background – Applied Econometrics 4 years statistical modelling experience R experience – 2 years

© 2011 Deloitte Touche Tohmatsu26

Tweet

R package:

ggplot2

Top words associated to the classification

A

B

C

D

A

B

C

D

Page 27: © 2011 Deloitte Touche Tohmatsu About me Educational background – Applied Econometrics 4 years statistical modelling experience R experience – 2 years

© 2011 Deloitte Touche Tohmatsu27

Classification results

Tweets Topic 1 Topic 2 Topic 3 Topic 41 The latest proof that Google can do no wrong | http://t.co/dSUhwVoO (via @Techland) 40% 0% 0% 60%2 Teen girls look to YouTube for self-image validation | http://t.co/PSfROdi4 (via @TIMEHealthland) 0% 0% 0% 100%3 Why libraries need us now more than ever #sxsw | http://t.co/OTbutfup (via @Techland) 100% 0% 0% 0%4 PHOTOS: Amazing Photos of the Sun http://t.co/bmYAtNab via @TIME 0% 100% 0% 0%5 Why libraries need us now more than ever #sxsw | http://t.co/OTbutfup (via @Techland) 100% 0% 0% 0%6 Why libraries need us now more than ever #sxsw | http://t.co/OTbutfup (via @Techland) 100% 0% 0% 0%

7Astrophysicist @neiltyson responds to @TIME's q: "What is the most astounding fact about the Universe?"

http://t.co/91565khw | beautiful vid 0% 0% 67% 33%8 Why libraries need us now more than ever #sxsw | http://t.co/OTbutfup (via @Techland) 100% 0% 0% 0%9 Living Alone Is The New Norm http://t.co/25BzVSLN (via @TIME) #teamhermit 0% 0% 0% 100%

10 PHOTOS: Seven days of strange landscapes | http://t.co/oLtxFcp8 0% 100% 0% 0%11 Subject for Debate: Are Women People? http://t.co/IRVthFc8 via @TIME 0% 0% 0% 100%12 PHOTOS: Seven days of strange landscapes | http://t.co/oLtxFcp8 0% 100% 0% 0%

13@Time Israel's bogus case for bombing Gaza obscures political motives | Al Akhbar English

http://t.co/Fud0mDNN via @AlakhbarEnglish 0% 0% 100% 0%

Page 28: © 2011 Deloitte Touche Tohmatsu About me Educational background – Applied Econometrics 4 years statistical modelling experience R experience – 2 years

© 2011 Deloitte Touche Tohmatsu28

Questions?

Page 29: © 2011 Deloitte Touche Tohmatsu About me Educational background – Applied Econometrics 4 years statistical modelling experience R experience – 2 years

© 2011 Deloitte Touche Tohmatsu29

Essential R packages for everyday use

Essential

• ggplot2

• reshape

• RODBC

• randomForest

• rpart

Nice to have

• caret

• forecast

• tm