online model updating with spark streaming
TRANSCRIPT
StreamingArchitectureforPredictiveModeling
KeiraZhouDataEngineer@CapitalOneLabs
#whoami
• DataEngineer@CapitalOneLabs• Previous:• Fellow@InsightDataEngineering• BS+MSinSystemsEngineering@UVA
• Github Repo:• https://github.com/keiraqz/SparkModeling
1
Batchvs.Streaming
• Batch• Runaprocessinascheduledway• Goodforcomputationonhugeamountofhistoricaldata• Complexanalysis,hardtocomputeinrealtime
• Streaming• Continuous• “Lifeasithappens”:processdataasitcomesin
• Trafficinformation• Heartbeatmonitoring• CapitalOne:doubleswipe
2
TaskandDataset
• Task:PredictPhishingwebsite• e.g.Yourpasswordforxwebsiteisexpiring,updateat:http://abc.com/update• PredictifawebsiteURLisaphishingwebsite
• PhishingWebsitesDataSet• FromUCI:https://archive.ics.uci.edu/ml/datasets/Phishing+Websites#• Extractedfeatures+Labeleddatapoints
3
FeatureSelection
• Descriptionprovidedinthedataset• Someexamples• UsingURLShorteningServices“TinyURL”:
• bit.ly/19DXSk4• AgeofDomain:
• minimumageofthelegitimatedomainis6months• AddingPrefixorSuffixSeparatedby(-)totheDomain:
• http://www.confirme-paypal.com/
4
TrainandPredict
Pre-trainedModel OnlineModelUpdating OnlinePrediction
Method1 YES NO YES
Method 2 NO YES YES
Method3 YES YES YES
5
Demo:Method2&Method3Onlineupdatingwithandwithoutapre-trainedmodel
6
Backend
7
TrainonHistoricalData
• Classificationproblem• LogisticRegressionwithStochasticGradientDescent
8
CodeSnippet#1
9
• Spark2.0feature:Savemodeltofile
• Loadmodelfile
Predict&Update• StreamingLogisticRegressionModelwithStochasticGradientDescent
10
BuildtheStreamingPipeline
11
CodeSnippet#2
12
• Kafka->SparkStreaming:SparkStreamingKafkaconnector
• Dependency:spark-streaming-kafka-0-8_2.11-2.0.1.jar• Usetherightversion• Makesureyouaddthejaranditsdependenciestoyourproject
ParallelisminSpark
• ParallelisminDataReceiving• Forexample,splitKafkainputDStream basedontopics• OneKafkaInputstreamforonetopictoreceivedatainparallel
• numStreams =5• kafkaStreams =[KafkaUtils.createStream(...)for _in range(numStreams)]
• ParallelisminDataProcessing• conf =SparkConf().setMaster("local[2]")
• Meaning- 2threads:“minimal”parallelism• SparkCluster:YARN,Mesos,Standalone
13
CodeSnippet#3
14
• SparkStreaming->Redis:PythonRedis connector(pipinstallredis)
Summary
15
DataScience DataEngineering
• Findtherightfeatures• Getlabeleddata
• Manuallabeling
• TuningSpark• #ofdrivers• #ofexecutors• Memory• levelofparallelism
• SparkProgramming• Sparkconnectoror
Pythonconnector• UseMLlib
Questions?
16