hadoop for data science: moving from bi dashboards to r models, using hive streaming

22
Hadoop for Hadoop for D Si @BT Data Science @BT

Upload: huguk

Post on 20-Aug-2015

850 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Hadoop for Data Science: Moving from BI dashboards to R models, using Hive streaming

Hadoop forHadoop forD S i @BTData Science @BT

Page 2: Hadoop for Data Science: Moving from BI dashboards to R models, using Hive streaming

Big Data is transforming the economics of data processing 

In 1980 storing and querying a year’s worth of twitter data (if it existed) 

would have required a spend of 75%of the Apollo program.

In 2014 it costs approximately theIn 2014 it costs approximately the same as a small used car (£6k).

Page 3: Hadoop for Data Science: Moving from BI dashboards to R models, using Hive streaming

First operational Hadoop cluster at Tinsley Park

Rest of BT Openreach• 2 x 1.2 PByte Hadoop 

cluster

• Logical separation between Openreachand rest of BT

• Based on learning from Research Hadoop cluster

© British Telecommunications plc

3

Researchclusteractivity

Page 4: Hadoop for Data Science: Moving from BI dashboards to R models, using Hive streaming

Research cluster  usage

© British Telecommunications plc

4

Page 5: Hadoop for Data Science: Moving from BI dashboards to R models, using Hive streaming

What are we doing with Hadoop

Data science to understand our networks and physical assets

• Fleet (vehicle maintenance / predictive parts ordering)

• Buildings (energy consumption, security)

• Networks (inventory, faults…)

'ETL'

• Ingest, transformation…

li• Data quality

StStorage

• Low cost, computational file store

© British Telecommunications plc

5

Page 6: Hadoop for Data Science: Moving from BI dashboards to R models, using Hive streaming

Predictive ModellingPredictive Modellingin Hadoop

© British Telecommunications plc

6

Page 7: Hadoop for Data Science: Moving from BI dashboards to R models, using Hive streaming

From model development to production

Source data

build model101011101010101001000001110101111101111110100101110000100011111001000011010101010000101011111010101101010101010101010110101010

evaluate model

production implementation

In productionmodify model pexecute/monitor

© British Telecommunications plc

7

Page 8: Hadoop for Data Science: Moving from BI dashboards to R models, using Hive streaming

In‐Hadoop model scoring(Large‐Scale Predictive Modelling with R and Apache Hive: from Modeling to Production A Zolotovitski Y Keselman)(Large‐Scale Predictive Modelling with R and Apache Hive: from Modeling to Production, A. Zolotovitski, Y. Keselman)

© British Telecommunications plc

8

Page 9: Hadoop for Data Science: Moving from BI dashboards to R models, using Hive streaming

Example of building a predictive model

Example adapted from A Handbook of Statistical Analyses using R, 2nd ed.

• Goal ‐ Predict body fat content from common anthropometric measurementsGoal  Predict body fat content from common anthropometric measurements

• Reference measurement made using Dual Energy X‐ray Absorptiometry (DXA)– This is very accurate– This is very accurate– Isn’t practical for general use due to high costs and methodological effort

• Therefore useful to have a simple method to predict DXA measurement of body fatTherefore useful to have a simple method to predict DXA measurement of body fat

age DEXfat waistcirc hipcirc elbowbreadth kneebreadth

57 41.68 100.0 112.0 7.1 9.4

65 43.29 99.5 116.5 6.5 8.9

59 35.41 96.0 108.5 6.2 8.9

58 22.79 72.0 96.5 6.1 9.2

60 36.42 89.5 100.5 7.1 10.0

© British Telecommunications plc

9

… … … … … …

Page 10: Hadoop for Data Science: Moving from BI dashboards to R models, using Hive streaming

Build a predictive model in R

library(rpart)lib ( t kit)library(partykit)

# get the bodyfat data from the mboost packagedata("bodyfat", package="mboost")

# build the modelbodyfat_rpart = rpart(DEXfat ~ age + waistcirc + hipcirc +

elbowbreadth + kneebreadth, data=bodyfat,control = rpart.control(minsplit=10))

# plot initial regression treeplot(as party(bodyfat rpart) tp args=list(id=FALSE))plot(as.party(bodyfat_rpart), tp_args=list(id=FALSE))

# save the modelmodel = bodyfat_rpartsave(model, file = './model.Rdata')

# save the test datadrops=c("DEXfat", "anthro3a", "anthro3b", "anthro3c", "anthro4")bodyfat.testdata = bodyfat[,!(names(bodyfat) %in% drops)]write.table(bodyfat.testdata, "./testdata.csv", sep=",", row.names=F, col.names=F)

© British Telecommunications plc

10

Page 11: Hadoop for Data Science: Moving from BI dashboards to R models, using Hive streaming

5‐parameter regression tree

© British Telecommunications plc

11

Page 12: Hadoop for Data Science: Moving from BI dashboards to R models, using Hive streaming

In‐Hadoop model scoring(Large‐Scale Predictive Modelling with R and Apache Hive: from Modeling to Production A Zolotovitski Y Keselman)(Large‐Scale Predictive Modelling with R and Apache Hive: from Modeling to Production, A. Zolotovitski, Y. Keselman)

© British Telecommunications plc

12

Page 13: Hadoop for Data Science: Moving from BI dashboards to R models, using Hive streaming

Model execution in Hive

• Use Hive streaming– TRANSFORM() allows Hive to pipe data through a user provided script

– We will pipe data through ‘scorer.R’, the wrapper code around our regression tree model  

hive>

ADD FILE ./model.RdataADD FILE / R

Make the model file and scoring script ADD FILE ./scorer.R

INSERT OVERWRITE TABLE default.modelscoreSELECT TRANSFORM(

t.age,t.waistcirc,

The results of the query will go into this table in Hive

and scoring script available to Hive

,t.hipcirc,t.elbowbreadth,t.kneebreadth) USING'scorer.R' AS age, score

FROMdefault modelinput t;

Model parameters are passed to scorer.R

Schema of the results

default.modelinput t;

The source data  Hive will stream the 

© British Telecommunications plc

14

comes from this table in Hive

source data through this script

Page 14: Hadoop for Data Science: Moving from BI dashboards to R models, using Hive streaming

Model execution in Hive

Research clusterHP BL460 blades2 master nodes

Ti t

2 master nodes6 worker nodes

Time toscoredata (s) Half a billion 

records scored inrecords scored in under 4 minutes

Number of records (millions)

183 Mbyte 8.9 GByte

© British Telecommunications plc

15

Page 15: Hadoop for Data Science: Moving from BI dashboards to R models, using Hive streaming

Persisting ModelsPersisting Modelsin Hadoop

© British Telecommunications plc

16

Page 16: Hadoop for Data Science: Moving from BI dashboards to R models, using Hive streaming

Persisting Models in Hadoop

Serialize R model, store in DB, fetch at run time via RJDBC• Specify model(s) to fetch with arguments to wrapper script

• Deserialize models once ‐minimal overhead

• Model(s) applied to all input records

Serialize R models, store and stream via Hive• Pre‐join input records with model in the limit – a different model per recordPre join input records with model  in the limit  a different model per record

• Stream records/models via Hive to wrapper script

• Cache models in wrapper scriptpp p

© British Telecommunications plc

17

Page 17: Hadoop for Data Science: Moving from BI dashboards to R models, using Hive streaming

Persisting and stream models via Hive

© British Telecommunications plc

18

Page 18: Hadoop for Data Science: Moving from BI dashboards to R models, using Hive streaming

Results…ongoing

Difficult to schedulescheduleTesting…

~10 minutes to score:

10 million records

1 cached model – regression tree, 5 parameters, approx 12kBytes as JSON

© British Telecommunications plc

21

Page 19: Hadoop for Data Science: Moving from BI dashboards to R models, using Hive streaming

Extracting maximum value from Big Data

PlumbingD i tt d t i tDesign patterns: data ingest, incremental processing, file formats, compression, new compute engines…p g

A l t t t t bl i i htA place to get trustable insightNetwork/service observatory, data provenance & governance, collaborative analysis…y

Enabling data science ‘at scale’In‐Hadoopmodelling, in‐life A‐B trials

© British Telecommunications plc

22

trials…

Page 20: Hadoop for Data Science: Moving from BI dashboards to R models, using Hive streaming

Bits and pieces….

R Markdown• Capture analytic workflow in a reproducible way

• Publish documents to interested community

R packagesR packages• 'Hive' – simple wrapper around RJDBC to facilitate access to Hive

• TODO: packages to support model creation and persistenceTODO: packages to support model creation and persistence

© British Telecommunications plc

23

Page 21: Hadoop for Data Science: Moving from BI dashboards to R models, using Hive streaming

Observatory

Data lifecycle: Analysis:Data lifecycle:Tools to support discovery, provenance, comprehension

Analysis:Capture analytical workflow –reproduce, test, validatecomprehension, 

usagevalidate…

Observatory:a place where we can observe and understand the health of our systems

Optimise: Models:

© British Telecommunications plc

24

Opt seOptimise  storage of data assets

Models:PMML

Page 22: Hadoop for Data Science: Moving from BI dashboards to R models, using Hive streaming

Q i ?Questions?