interactive debugging for big data...
TRANSCRIPT
![Page 1: Interactive Debugging for Big Data Analyticsweb.cs.ucla.edu/~gulzar/assets/pdf/hotcloud16_slides_gulzar.pdf · Interactive Debugging for Big Data Analytics Muhammad Ali Gulzar, Xueyuan](https://reader035.vdocuments.us/reader035/viewer/2022070711/5ec9c5fdecfbe40809235160/html5/thumbnails/1.jpg)
InteractiveDebuggingforBigDataAnalytics
Muhammad Ali Gulzar, Xueyuan Han, Matteo Interlandi,Shaghayegh Mardani, Sai Deep Tetali, Tyson Condie, Todd Millstein,Miryung KimUniversity of California, Los Angeles
![Page 2: Interactive Debugging for Big Data Analyticsweb.cs.ucla.edu/~gulzar/assets/pdf/hotcloud16_slides_gulzar.pdf · Interactive Debugging for Big Data Analytics Muhammad Ali Gulzar, Xueyuan](https://reader035.vdocuments.us/reader035/viewer/2022070711/5ec9c5fdecfbe40809235160/html5/thumbnails/2.jpg)
DebuggingBigDataAnalytics
• Today’splatformslackdebuggingsupport– Programs(i.e.,queries, jobs)arebatchexecuted /blackboxes– Errorsreflect low-leveldetails (e.g.,taskid?!)notrelevanttothe logicalbug– Longprogramexecution time =>longdevelopment cycles
• Whatdoprogrammersdo?– Trialanderror debuggingonsample data– Post-mortem analysisoferrorlogs– Analyzephysicalviewoftheexecution (ajobid,failednode,etc).
![Page 3: Interactive Debugging for Big Data Analyticsweb.cs.ucla.edu/~gulzar/assets/pdf/hotcloud16_slides_gulzar.pdf · Interactive Debugging for Big Data Analytics Muhammad Ali Gulzar, Xueyuan](https://reader035.vdocuments.us/reader035/viewer/2022070711/5ec9c5fdecfbe40809235160/html5/thumbnails/3.jpg)
“IwouldliketounderstandtheflowofcontrolthroughtheSparksourcecodeontheworkernodeswhenIsubmitmyapplication…IamassumingI
shouldsetupSparkonEclipse…toenablesteppingthroughSparksourcecodeontheworkernodes.”
TryingtodebugaSparkApplicationonacluster…
![Page 4: Interactive Debugging for Big Data Analyticsweb.cs.ucla.edu/~gulzar/assets/pdf/hotcloud16_slides_gulzar.pdf · Interactive Debugging for Big Data Analytics Muhammad Ali Gulzar, Xueyuan](https://reader035.vdocuments.us/reader035/viewer/2022070711/5ec9c5fdecfbe40809235160/html5/thumbnails/4.jpg)
Afterayear,stillnogoodanswers!
![Page 5: Interactive Debugging for Big Data Analyticsweb.cs.ucla.edu/~gulzar/assets/pdf/hotcloud16_slides_gulzar.pdf · Interactive Debugging for Big Data Analytics Muhammad Ali Gulzar, Xueyuan](https://reader035.vdocuments.us/reader035/viewer/2022070711/5ec9c5fdecfbe40809235160/html5/thumbnails/5.jpg)
BigDebug ProjectOverviewBigDebug:DebuggingPrimitives
forInteractiveBigDataProcessinginSpark
[ICSE2016]
SimulatedBreakpointOn-DemandWatchpointCrashCulpritRemediationForwardBackwardTracing
Titian:DataProvenanceforFine-GrainedTracing[PVLDB2016]
Vega:IncrementalComputationforInteractiveDebugging
[UnderReview]
![Page 6: Interactive Debugging for Big Data Analyticsweb.cs.ucla.edu/~gulzar/assets/pdf/hotcloud16_slides_gulzar.pdf · Interactive Debugging for Big Data Analytics Muhammad Ali Gulzar, Xueyuan](https://reader035.vdocuments.us/reader035/viewer/2022070711/5ec9c5fdecfbe40809235160/html5/thumbnails/6.jpg)
ExampleQueryDevelopmentSession
• Dataset:NYCOpenDataProject– Callstonon-emergencyservicecenters– Datasetcontainscallrecordsfor2010-2015• Recordcontents:calltime,agency,callerlocation,etc.
• Query:Identifytheagencies thatreceivedthemostcallsduringbusyhours– E.g.,busyhourifnumberofcalls>10,000
![Page 7: Interactive Debugging for Big Data Analyticsweb.cs.ucla.edu/~gulzar/assets/pdf/hotcloud16_slides_gulzar.pdf · Interactive Debugging for Big Data Analytics Muhammad Ali Gulzar, Xueyuan](https://reader035.vdocuments.us/reader035/viewer/2022070711/5ec9c5fdecfbe40809235160/html5/thumbnails/7.jpg)
SparkProgram
caseclassCalls(id:String,hour:Int,agency:String,...)format=newSimpleDateFormat("M/d/yh:m:sa")input=sc.textFile("hdfs://...")calls=input.map(_.split(","))
.map(r=>Calls(r(0),format.parse(r(1)).getHours,r(2),...)calls.registerTempTable("calls")hist =sqlContext.sql("
SELECTagency,count(*)FROMcallsJOIN(
SELECThourFROMcallsGROUPBYhourHAVINGcount(*)>100000)counts
ONcalls.hour =counts.hourGROUPBYagency")
hist.show()
![Page 8: Interactive Debugging for Big Data Analyticsweb.cs.ucla.edu/~gulzar/assets/pdf/hotcloud16_slides_gulzar.pdf · Interactive Debugging for Big Data Analytics Muhammad Ali Gulzar, Xueyuan](https://reader035.vdocuments.us/reader035/viewer/2022070711/5ec9c5fdecfbe40809235160/html5/thumbnails/8.jpg)
Extract DatasetfromHDFSTransform itintoaDataFrame (i.e.,table)Load itintoSparkSQL
caseclassCalls(id:String,hour:Int,agency:String,...)format=newSimpleDateFormat("M/d/yh:m:sa")input=sc.textFile("hdfs://...")calls=input.map(_.split(","))
.map(r=>Calls(r(0),format.parse(r(1)).getHours,r(2),...)calls.registerTempTable("calls")hist =sqlContext.sql("
SELECTagency,count(*)FROMcallsJOIN(
SELECThourFROMcallsGROUPBYhourHAVINGcount(*)>100000)counts
ONcalls.hour =counts.hourGROUPBYagency")
hist.show()
![Page 9: Interactive Debugging for Big Data Analyticsweb.cs.ucla.edu/~gulzar/assets/pdf/hotcloud16_slides_gulzar.pdf · Interactive Debugging for Big Data Analytics Muhammad Ali Gulzar, Xueyuan](https://reader035.vdocuments.us/reader035/viewer/2022070711/5ec9c5fdecfbe40809235160/html5/thumbnails/9.jpg)
ExpressQueryinSparkSQL
caseclassCalls(id:String,hour:Int,agency:String,...)format=newSimpleDateFormat("M/d/yh:m:sa")input=sc.textFile("hdfs://...")calls=input.map(_.split(","))
.map(r=>Calls(r(0),format.parse(r(1)).getHours,r(2),...)calls.registerTempTable("calls")hist =sqlContext.sql("
SELECTagency,count(*)FROMcallsJOIN(
SELECThourFROMcallsGROUPBYhourHAVINGcount(*)>100000)counts
ONcalls.hour =counts.hourGROUPBYagency")
hist.show()
Identifythebusyhoursi.e.,#calls>10,000
Joinbusyhourswithcallsthengroupbyagencyandcountthenumberof“calls”receivedbyeachagency
![Page 10: Interactive Debugging for Big Data Analyticsweb.cs.ucla.edu/~gulzar/assets/pdf/hotcloud16_slides_gulzar.pdf · Interactive Debugging for Big Data Analytics Muhammad Ali Gulzar, Xueyuan](https://reader035.vdocuments.us/reader035/viewer/2022070711/5ec9c5fdecfbe40809235160/html5/thumbnails/10.jpg)
DebuggingQueryResults• Analystobservessomeunexpectedresults– Agenciesthatshouldnotappear• e.g.,BrooklynPublicLibrary
– Expectedagenciesthatshouldappear• e.g,NYPD,NYFD
• Titiansupportforquerytriage– Analystcantracebackfromoutlierresultstocontributingdataatsomeintermediatestage
– Analystcanexecutequeriesagainstintermediatedataleadingtooutlierresults
![Page 11: Interactive Debugging for Big Data Analyticsweb.cs.ucla.edu/~gulzar/assets/pdf/hotcloud16_slides_gulzar.pdf · Interactive Debugging for Big Data Analytics Muhammad Ali Gulzar, Xueyuan](https://reader035.vdocuments.us/reader035/viewer/2022070711/5ec9c5fdecfbe40809235160/html5/thumbnails/11.jpg)
QueryTriagewithTitian• Intermediateresultsforsubquery– Tracebacktosubqueryandshowdistributionofcallsperhour– Onintermediatedataleadingtooutlierresults
Significant skewinthemidnight hour=0!
SELECThour,count(*)FROMcallsGROUPBYhour
![Page 12: Interactive Debugging for Big Data Analyticsweb.cs.ucla.edu/~gulzar/assets/pdf/hotcloud16_slides_gulzar.pdf · Interactive Debugging for Big Data Analytics Muhammad Ali Gulzar, Xueyuan](https://reader035.vdocuments.us/reader035/viewer/2022070711/5ec9c5fdecfbe40809235160/html5/thumbnails/12.jpg)
IdentifyBugandRevisetheQuery• TheBug
– Systemassignsdefaultvaluehour=0for…– Callsthatdidnotlogatime
• Possiblecourseofaction– Filteroutcallsassignedtohour=0
SELECTagency,count(*)FROMcallsJOIN(
SELECThourFROMcallsWHEREhour!=0GROUPBYhourHAVINGcount(*)>100000)counts
ONcalls.hour =counts.hourGROUPBYagency
Introducepredicatethatfiltersoutmidnight hour
![Page 13: Interactive Debugging for Big Data Analyticsweb.cs.ucla.edu/~gulzar/assets/pdf/hotcloud16_slides_gulzar.pdf · Interactive Debugging for Big Data Analytics Muhammad Ali Gulzar, Xueyuan](https://reader035.vdocuments.us/reader035/viewer/2022070711/5ec9c5fdecfbe40809235160/html5/thumbnails/13.jpg)
Vega:Re-executerevisedQuery• Vegamaterializesintermediatestageresults– i.e.,Theprevioussubqueryresultissaved
• VegaQueryRewriterleveragesthistorewritethequeryinto…
SELECTagency,count(*)FROMcallsJOINcountsWHEREcounts.hour !=0ONcalls.hour =counts.hourGROUPBYagency
MaterializedresultfrompreviousexecutionRewritefiltertoremovehour0fromjoining records
![Page 14: Interactive Debugging for Big Data Analyticsweb.cs.ucla.edu/~gulzar/assets/pdf/hotcloud16_slides_gulzar.pdf · Interactive Debugging for Big Data Analytics Muhammad Ali Gulzar, Xueyuan](https://reader035.vdocuments.us/reader035/viewer/2022070711/5ec9c5fdecfbe40809235160/html5/thumbnails/14.jpg)
Vega:ModifiedQueryEvaluation• Executeanincrementaljoin– “Diff”recordsspecifychangesinthe(join)result– Forthisexample,weincrementallyremoveallrecordsforhour0fromjoinandfinalaggregationresults
• VegaOptimizerResultsConsequence:overanorder-of-magnituderuntimeimprovement
![Page 15: Interactive Debugging for Big Data Analyticsweb.cs.ucla.edu/~gulzar/assets/pdf/hotcloud16_slides_gulzar.pdf · Interactive Debugging for Big Data Analytics Muhammad Ali Gulzar, Xueyuan](https://reader035.vdocuments.us/reader035/viewer/2022070711/5ec9c5fdecfbe40809235160/html5/thumbnails/15.jpg)
• Whenaprogramfails,ausermaywanttoinvestigateasubsetoftheoriginalinputinducingacrash,afailure,orawrongoutcome.
• DeltaDebugging[Zeller1999]–Wellknowndebuggingalgorithmforminimizingfailure-inducinginputs
– Requiresmultiplerunstoisolatefailure-inducinginputs
AutomatedIsolationofFailure-InducingInputsforBigDataAnalytics
![Page 16: Interactive Debugging for Big Data Analyticsweb.cs.ucla.edu/~gulzar/assets/pdf/hotcloud16_slides_gulzar.pdf · Interactive Debugging for Big Data Analytics Muhammad Ali Gulzar, Xueyuan](https://reader035.vdocuments.us/reader035/viewer/2022070711/5ec9c5fdecfbe40809235160/html5/thumbnails/16.jpg)
Firstwerunthetesttofindthefailureinducinginputdataset
Background:DeltaDebugging[Zeller,FSE1999]
![Page 17: Interactive Debugging for Big Data Analyticsweb.cs.ucla.edu/~gulzar/assets/pdf/hotcloud16_slides_gulzar.pdf · Interactive Debugging for Big Data Analytics Muhammad Ali Gulzar, Xueyuan](https://reader035.vdocuments.us/reader035/viewer/2022070711/5ec9c5fdecfbe40809235160/html5/thumbnails/17.jpg)
TestFails
First,werunthetesttofindthefailureinducinginputdataset
Background:DeltaDebugging[Zeller,FSE1999]
![Page 18: Interactive Debugging for Big Data Analyticsweb.cs.ucla.edu/~gulzar/assets/pdf/hotcloud16_slides_gulzar.pdf · Interactive Debugging for Big Data Analytics Muhammad Ali Gulzar, Xueyuan](https://reader035.vdocuments.us/reader035/viewer/2022070711/5ec9c5fdecfbe40809235160/html5/thumbnails/18.jpg)
Second,wesplitthefailinginputdata
TestFails Split
Background:DeltaDebugging[Zeller,FSE1999]
![Page 19: Interactive Debugging for Big Data Analyticsweb.cs.ucla.edu/~gulzar/assets/pdf/hotcloud16_slides_gulzar.pdf · Interactive Debugging for Big Data Analytics Muhammad Ali Gulzar, Xueyuan](https://reader035.vdocuments.us/reader035/viewer/2022070711/5ec9c5fdecfbe40809235160/html5/thumbnails/19.jpg)
TestFails Split
TestPasses
TestFails
Background:DeltaDebugging[Zeller,FSE1999]
![Page 20: Interactive Debugging for Big Data Analyticsweb.cs.ucla.edu/~gulzar/assets/pdf/hotcloud16_slides_gulzar.pdf · Interactive Debugging for Big Data Analytics Muhammad Ali Gulzar, Xueyuan](https://reader035.vdocuments.us/reader035/viewer/2022070711/5ec9c5fdecfbe40809235160/html5/thumbnails/20.jpg)
TestFails Split
TestPasses
TestFailsSplit
Background:DeltaDebugging[Zeller,FSE1999]
![Page 21: Interactive Debugging for Big Data Analyticsweb.cs.ucla.edu/~gulzar/assets/pdf/hotcloud16_slides_gulzar.pdf · Interactive Debugging for Big Data Analytics Muhammad Ali Gulzar, Xueyuan](https://reader035.vdocuments.us/reader035/viewer/2022070711/5ec9c5fdecfbe40809235160/html5/thumbnails/21.jpg)
TestFails Split
TestPasses
TestFailsSplit
…...
Background:DeltaDebugging[Zeller,FSE1999]
![Page 22: Interactive Debugging for Big Data Analyticsweb.cs.ucla.edu/~gulzar/assets/pdf/hotcloud16_slides_gulzar.pdf · Interactive Debugging for Big Data Analytics Muhammad Ali Gulzar, Xueyuan](https://reader035.vdocuments.us/reader035/viewer/2022070711/5ec9c5fdecfbe40809235160/html5/thumbnails/22.jpg)
ScalableAutomatedIsolationofFailure-InducingInputs
• Leveragedataprovenancetoreducesearchspace– Avoidcostlyexecutionsondatanotrelevanttothebug
• LeverageVegaoptimizesubsequentruns.
DeltaDebuggingTitian
![Page 23: Interactive Debugging for Big Data Analyticsweb.cs.ucla.edu/~gulzar/assets/pdf/hotcloud16_slides_gulzar.pdf · Interactive Debugging for Big Data Analytics Muhammad Ali Gulzar, Xueyuan](https://reader035.vdocuments.us/reader035/viewer/2022070711/5ec9c5fdecfbe40809235160/html5/thumbnails/23.jpg)
Conclusion• BigDebug Project– DebuggingPrimitivesforInteractiveBigDataProcessinginApacheSpark– https://sites.google.com/site/sparkbigdebug/
• Titian:InteractiveDataProvenance– Supportstracebackqueriesfromasetofresults– Executionreplayfromanintermediatepoint
• Vega:Optimizingmodifiedqueryexecution– Novelqueryrewritemechanismthatpusheschangesbackwardstosavework– Incrementalevaluationthatoperatesondatachangesinducedbyquerymodifications