online model updating with spark streaming

17

Click here to load reader

Upload: keira-zhou

Post on 16-Apr-2017

199 views

Category:

Technology


4 download

TRANSCRIPT

Page 1: Online Model Updating with Spark Streaming

StreamingArchitectureforPredictiveModeling

KeiraZhouDataEngineer@CapitalOneLabs

Page 2: Online Model Updating with Spark Streaming

#whoami

• DataEngineer@CapitalOneLabs• Previous:• Fellow@InsightDataEngineering• BS+MSinSystemsEngineering@UVA

• Github Repo:• https://github.com/keiraqz/SparkModeling

1

Page 3: Online Model Updating with Spark Streaming

Batchvs.Streaming

• Batch• Runaprocessinascheduledway• Goodforcomputationonhugeamountofhistoricaldata• Complexanalysis,hardtocomputeinrealtime

• Streaming• Continuous• “Lifeasithappens”:processdataasitcomesin

• Trafficinformation• Heartbeatmonitoring• CapitalOne:doubleswipe

2

Page 4: Online Model Updating with Spark Streaming

TaskandDataset

• Task:PredictPhishingwebsite• e.g.Yourpasswordforxwebsiteisexpiring,updateat:http://abc.com/update• PredictifawebsiteURLisaphishingwebsite

• PhishingWebsitesDataSet• FromUCI:https://archive.ics.uci.edu/ml/datasets/Phishing+Websites#• Extractedfeatures+Labeleddatapoints

3

Page 5: Online Model Updating with Spark Streaming

FeatureSelection

• Descriptionprovidedinthedataset• Someexamples• UsingURLShorteningServices“TinyURL”:

• bit.ly/19DXSk4• AgeofDomain:

• minimumageofthelegitimatedomainis6months• AddingPrefixorSuffixSeparatedby(-)totheDomain:

• http://www.confirme-paypal.com/

4

Page 6: Online Model Updating with Spark Streaming

TrainandPredict

Pre-trainedModel OnlineModelUpdating OnlinePrediction

Method1 YES NO YES

Method 2 NO YES YES

Method3 YES YES YES

5

Page 7: Online Model Updating with Spark Streaming

Demo:Method2&Method3Onlineupdatingwithandwithoutapre-trainedmodel

6

Page 8: Online Model Updating with Spark Streaming

Backend

7

Page 9: Online Model Updating with Spark Streaming

TrainonHistoricalData

• Classificationproblem• LogisticRegressionwithStochasticGradientDescent

8

Page 10: Online Model Updating with Spark Streaming

CodeSnippet#1

9

• Spark2.0feature:Savemodeltofile

• Loadmodelfile

Page 11: Online Model Updating with Spark Streaming

Predict&Update• StreamingLogisticRegressionModelwithStochasticGradientDescent

10

Page 12: Online Model Updating with Spark Streaming

BuildtheStreamingPipeline

11

Page 13: Online Model Updating with Spark Streaming

CodeSnippet#2

12

• Kafka->SparkStreaming:SparkStreamingKafkaconnector

• Dependency:spark-streaming-kafka-0-8_2.11-2.0.1.jar• Usetherightversion• Makesureyouaddthejaranditsdependenciestoyourproject

Page 14: Online Model Updating with Spark Streaming

ParallelisminSpark

• ParallelisminDataReceiving• Forexample,splitKafkainputDStream basedontopics• OneKafkaInputstreamforonetopictoreceivedatainparallel

• numStreams =5• kafkaStreams =[KafkaUtils.createStream(...)for _in range(numStreams)]

• ParallelisminDataProcessing• conf =SparkConf().setMaster("local[2]")

• Meaning- 2threads:“minimal”parallelism• SparkCluster:YARN,Mesos,Standalone

13

Page 15: Online Model Updating with Spark Streaming

CodeSnippet#3

14

• SparkStreaming->Redis:PythonRedis connector(pipinstallredis)

Page 16: Online Model Updating with Spark Streaming

Summary

15

DataScience DataEngineering

• Findtherightfeatures• Getlabeleddata

• Manuallabeling

• TuningSpark• #ofdrivers• #ofexecutors• Memory• levelofparallelism

• SparkProgramming• Sparkconnectoror

Pythonconnector• UseMLlib

Page 17: Online Model Updating with Spark Streaming

Questions?

16