![Page 1: Re-Engineering Software Engineering ina Data-Centric World · Insight 2: Performance debugging is a pain point. Manual inspection of top 200 Spark related posts from Stack Overflow](https://reader034.vdocuments.us/reader034/viewer/2022042223/5ec9955f6cfa76645f1e469a/html5/thumbnails/1.jpg)
Re-Engineering
SoftwareEngineering
in aData-CentricWorld
MiryungKim
UniversityofCalifornia,LosAngeles
1
![Page 2: Re-Engineering Software Engineering ina Data-Centric World · Insight 2: Performance debugging is a pain point. Manual inspection of top 200 Spark related posts from Stack Overflow](https://reader034.vdocuments.us/reader034/viewer/2022042223/5ec9955f6cfa76645f1e469a/html5/thumbnails/2.jpg)
Confluence
Interdisciplinary thinking via confluences, George Varghese @ SIGCOMM 2014 Keynote2
![Page 3: Re-Engineering Software Engineering ina Data-Centric World · Insight 2: Performance debugging is a pain point. Manual inspection of top 200 Spark related posts from Stack Overflow](https://reader034.vdocuments.us/reader034/viewer/2022042223/5ec9955f6cfa76645f1e469a/html5/thumbnails/3.jpg)
Confluence:InterdisciplinaryThinking
Interdisciplinary thinking via confluences, George Varghese @ SIGCOMM 2014 Keynote
Inflection Point
![Page 4: Re-Engineering Software Engineering ina Data-Centric World · Insight 2: Performance debugging is a pain point. Manual inspection of top 200 Spark related posts from Stack Overflow](https://reader034.vdocuments.us/reader034/viewer/2022042223/5ec9955f6cfa76645f1e469a/html5/thumbnails/4.jpg)
Confluence:Impressionism
Interdisciplinary thinking via confluences, George Varghese @ SIGCOMM 2014 Keynote
Inflection Point
![Page 5: Re-Engineering Software Engineering ina Data-Centric World · Insight 2: Performance debugging is a pain point. Manual inspection of top 200 Spark related posts from Stack Overflow](https://reader034.vdocuments.us/reader034/viewer/2022042223/5ec9955f6cfa76645f1e469a/html5/thumbnails/5.jpg)
Confluence:DataAnalyticsandSE
Interdisciplinary thinking via confluences, George Varghese @ SIGCOMM 2014 Keynote
Inflection Point
AIBigData
ML
![Page 6: Re-Engineering Software Engineering ina Data-Centric World · Insight 2: Performance debugging is a pain point. Manual inspection of top 200 Spark related posts from Stack Overflow](https://reader034.vdocuments.us/reader034/viewer/2022042223/5ec9955f6cfa76645f1e469a/html5/thumbnails/6.jpg)
TakeawayMessage:ACasefor
SoftwareEngineeringforDataAnalytics(SE4DA)
Bugfindingisahugeproblemindataanalytics.
SE4DA isunderserved;somehowpeoplehavegravitatedtoapplyingdataanalyticstoSE.
SE4DA requiresre-thinkingsoftwareengineeringtechniques.
6
![Page 7: Re-Engineering Software Engineering ina Data-Centric World · Insight 2: Performance debugging is a pain point. Manual inspection of top 200 Spark related posts from Stack Overflow](https://reader034.vdocuments.us/reader034/viewer/2022042223/5ec9955f6cfa76645f1e469a/html5/thumbnails/7.jpg)
Thereisahugeopportunity fordata
analytics.
7
![Page 8: Re-Engineering Software Engineering ina Data-Centric World · Insight 2: Performance debugging is a pain point. Manual inspection of top 200 Spark related posts from Stack Overflow](https://reader034.vdocuments.us/reader034/viewer/2022042223/5ec9955f6cfa76645f1e469a/html5/thumbnails/8.jpg)
Dataanalyticsareinhighdemand,yet…
8
![Page 9: Re-Engineering Software Engineering ina Data-Centric World · Insight 2: Performance debugging is a pain point. Manual inspection of top 200 Spark related posts from Stack Overflow](https://reader034.vdocuments.us/reader034/viewer/2022042223/5ec9955f6cfa76645f1e469a/html5/thumbnails/9.jpg)
Bugsarehugeproblemsindataanalytics.
9
Dataanalyticsusedbythousandsofscientistsproducemisleading orwrong results
[BBCNews]
Thewidespreadharmincludesfromawrongmedicaldiagnosistoincorrectinterpretation
ofstockhistory[Dataversity]
Predictablyinaccurate:Theprevalenceandperilsofbadbigdata.[Deloitte]
![Page 10: Re-Engineering Software Engineering ina Data-Centric World · Insight 2: Performance debugging is a pain point. Manual inspection of top 200 Spark related posts from Stack Overflow](https://reader034.vdocuments.us/reader034/viewer/2022042223/5ec9955f6cfa76645f1e469a/html5/thumbnails/10.jpg)
GrowthofDataAnalyticsPapersinSE
21 22 28 4738 50 40
39
0
50
100
2016 2017 2018 2019
DataAnalytics(AI,BigData,ML)GrowthinASEPapers
DataAnalytics Rest
10
![Page 11: Re-Engineering Software Engineering ina Data-Centric World · Insight 2: Performance debugging is a pain point. Manual inspection of top 200 Spark related posts from Stack Overflow](https://reader034.vdocuments.us/reader034/viewer/2022042223/5ec9955f6cfa76645f1e469a/html5/thumbnails/11.jpg)
SE4DA isunder-investigated.
(SE4DA:13,DA4SE:105)
SE4DA
4%
DA4SE
37% Rest
59%
11
SE4DA(4%):ImprovingSEfordataanalytics
DA4SE(37%):ApplyingdataanalyticstoSE
![Page 12: Re-Engineering Software Engineering ina Data-Centric World · Insight 2: Performance debugging is a pain point. Manual inspection of top 200 Spark related posts from Stack Overflow](https://reader034.vdocuments.us/reader034/viewer/2022042223/5ec9955f6cfa76645f1e469a/html5/thumbnails/12.jpg)
Outline:MakingaCasefor
SoftwareEngineeringforDataAnalytics(SE4DA)
①②③
Studies:
Data
Scientists
Tools
Shifttodata-centricSW
development
Debugging& testingforbigdataanalytics
DifferencesbetweentraditionalSW
vs.data-centricSWdevprocess
④ OpenproblemsinSE4DA
![Page 13: Re-Engineering Software Engineering ina Data-Centric World · Insight 2: Performance debugging is a pain point. Manual inspection of top 200 Spark related posts from Stack Overflow](https://reader034.vdocuments.us/reader034/viewer/2022042223/5ec9955f6cfa76645f1e469a/html5/thumbnails/13.jpg)
Part1.DataScientistsin
SoftwareTeams:
StateoftheArtandChallenges
MiryungKim,ThomasZimmermann,RobDeLine,AndrewBegel
![Page 14: Re-Engineering Software Engineering ina Data-Centric World · Insight 2: Performance debugging is a pain point. Manual inspection of top 200 Spark related posts from Stack Overflow](https://reader034.vdocuments.us/reader034/viewer/2022042223/5ec9955f6cfa76645f1e469a/html5/thumbnails/14.jpg)
TheEmergingRolesof
DataScientistsonSoftwareTeams
Weareatatippingpoint wheretherearelargescaletelemetry,machine,quality,anduserdata.
DatascientistsareemergingrolesinSWteams.
Tounderstandworkingstyles andchallenges,weconductedthefirstin-depthinterviewstudyandthelargestscalesurveyofprofessionaldatascientists.
① Data
Scientists
④Challenges
②Difference
③Tools
14
![Page 15: Re-Engineering Software Engineering ina Data-Centric World · Insight 2: Performance debugging is a pain point. Manual inspection of top 200 Spark related posts from Stack Overflow](https://reader034.vdocuments.us/reader034/viewer/2022042223/5ec9955f6cfa76645f1e469a/html5/thumbnails/15.jpg)
MethodologyforStudying
“DataScientists”
In-DepthInterviews[ICSE’16]:
• 5womenand11menfromeightdifferentMicrosoftorganizations
Survey[TSE2018]
793responses• demographics/self-
perception• skillsandtoolusage• workingstyles• timespent• challengesandbest
practicesComputerScience
Physics
Math
BioInformatics
Statistics
Economics
Finance
CogSci
ML15
① Data
Scientists
④Challenges
②Difference
③Tools
![Page 16: Re-Engineering Software Engineering ina Data-Centric World · Insight 2: Performance debugging is a pain point. Manual inspection of top 200 Spark related posts from Stack Overflow](https://reader034.vdocuments.us/reader034/viewer/2022042223/5ec9955f6cfa76645f1e469a/html5/thumbnails/16.jpg)
TimeSpentonActivities
Hoursspentoncertainactivities(selfreported,survey,N=532)
16
① Data
Scientists
④Challenges
②Difference
③Tools
![Page 17: Re-Engineering Software Engineering ina Data-Centric World · Insight 2: Performance debugging is a pain point. Manual inspection of top 200 Spark related posts from Stack Overflow](https://reader034.vdocuments.us/reader034/viewer/2022042223/5ec9955f6cfa76645f1e469a/html5/thumbnails/17.jpg)
!"#$#
##
$
$
$!
!
!
"
""
"#
#$!
!
$
#$$
"
"
#!!"
Clustering
532datascientistsatMicrosoft
basedonrelativetimespentinactivities
17
Whatisa“DataScientist”?
9DistinctCategories
…
① Data
Scientists
④Challenges
②Difference
③Tools
![Page 18: Re-Engineering Software Engineering ina Data-Centric World · Insight 2: Performance debugging is a pain point. Manual inspection of top 200 Spark related posts from Stack Overflow](https://reader034.vdocuments.us/reader034/viewer/2022042223/5ec9955f6cfa76645f1e469a/html5/thumbnails/18.jpg)
Category1:DataShaper
18
Analyzingandpreparingdata
Post-graduatedegrees
Algorithms,machinelearning,andoptimizations
Lessfamiliarwithfront-endprogramming
① Data
Scientists
④Challenges
②Difference
③Tools
![Page 19: Re-Engineering Software Engineering ina Data-Centric World · Insight 2: Performance debugging is a pain point. Manual inspection of top 200 Spark related posts from Stack Overflow](https://reader034.vdocuments.us/reader034/viewer/2022042223/5ec9955f6cfa76645f1e469a/html5/thumbnails/19.jpg)
Category2:PlatformBuilder
19
Instrumentcodetocollectdata
Bigdataanddistributedsystems
Back-endandfront-endprogramming
SQL,C,C++andC#
① Data
Scientists
④Challenges
②Difference
③Tools
![Page 20: Re-Engineering Software Engineering ina Data-Centric World · Insight 2: Performance debugging is a pain point. Manual inspection of top 200 Spark related posts from Stack Overflow](https://reader034.vdocuments.us/reader034/viewer/2022042223/5ec9955f6cfa76645f1e469a/html5/thumbnails/20.jpg)
Category3:DataAnalyzer
20
Familiarwithstatistics
Notfamiliarwithfront-endprogramming
Difficultywithdatatransformation
RStudioorstatisticalanalysis
① Data
Scientists
④Challenges
②Difference
③Tools
![Page 21: Re-Engineering Software Engineering ina Data-Centric World · Insight 2: Performance debugging is a pain point. Manual inspection of top 200 Spark related posts from Stack Overflow](https://reader034.vdocuments.us/reader034/viewer/2022042223/5ec9955f6cfa76645f1e469a/html5/thumbnails/21.jpg)
Validation isamajorchallenge.
“Honestly,wedon’thaveagoodmethodforthis.”“Justbecausethemathisright,doesn’tmeanthattheanswerisright.”
Explainability isimportant— “togaininsights,youmustgooneleveldeeper.”
Commonchallenges:Datascientistsfindit
difficulttoensure“correctness”
21
① Data
Scientists
④Challenges
②Difference
③Tools
![Page 22: Re-Engineering Software Engineering ina Data-Centric World · Insight 2: Performance debugging is a pain point. Manual inspection of top 200 Spark related posts from Stack Overflow](https://reader034.vdocuments.us/reader034/viewer/2022042223/5ec9955f6cfa76645f1e469a/html5/thumbnails/22.jpg)
22
①DataScientists
④Challenges
②
Difference
③Tools
①②③
Studies:
Data
Scientists
Tools
Shifttodata-centricSW
development
Debugging& testingforbigdataanalytics
DifferencesbetweentraditionalSW
vs.data-centricSWdevprocess
④ OpenproblemsinSE4DA
Outline:MakingaCasefor
SoftwareEngineeringforDataAnalytics(SE4DA)
![Page 23: Re-Engineering Software Engineering ina Data-Centric World · Insight 2: Performance debugging is a pain point. Manual inspection of top 200 Spark related posts from Stack Overflow](https://reader034.vdocuments.us/reader034/viewer/2022042223/5ec9955f6cfa76645f1e469a/html5/thumbnails/23.jpg)
[Interactions’12] [ICSE-SEIP’19]
[NIPS’15] [TSE’19] [ICSE’16] [TSE’18]
①DataScientists
④Challenges
②
Difference
③Tools
Part2.HowisTraditional
DevelopmentDifferentfrom
BigDataAnalyticsDevelopment?
![Page 24: Re-Engineering Software Engineering ina Data-Centric World · Insight 2: Performance debugging is a pain point. Manual inspection of top 200 Spark related posts from Stack Overflow](https://reader034.vdocuments.us/reader034/viewer/2022042223/5ec9955f6cfa76645f1e469a/html5/thumbnails/24.jpg)
Traditionalvs.BigDataAnalyticsDevelopment
1 Develop
2 Run
3 Test
4 Debug
5 Repeat
1 Developlocally
2 TestlocallywithSampleData
3 Executethejobonthecloudhopingthatitwouldwork
4 Severalhourslater,thejobcrashesorproduceswrongoutput
5 Repeat
24
①DataScientists
④Challenges
②
Difference
③Tools
![Page 25: Re-Engineering Software Engineering ina Data-Centric World · Insight 2: Performance debugging is a pain point. Manual inspection of top 200 Spark related posts from Stack Overflow](https://reader034.vdocuments.us/reader034/viewer/2022042223/5ec9955f6cfa76645f1e469a/html5/thumbnails/25.jpg)
Traditionalvs.BigDataAnalyticsDevelopment
1 Developlocally
2 TestwithSample
1.Dataishuge,remote,anddistributed.
25
①DataScientists
④Challenges
②
Difference
③Tools
![Page 26: Re-Engineering Software Engineering ina Data-Centric World · Insight 2: Performance debugging is a pain point. Manual inspection of top 200 Spark related posts from Stack Overflow](https://reader034.vdocuments.us/reader034/viewer/2022042223/5ec9955f6cfa76645f1e469a/html5/thumbnails/26.jpg)
2 TestwithSample
2.Writingtest ishard.Don’tevenknowthefullinputanddon’tknowtheexpectedoutput.
Traditionalvs.BigDataAnalyticsDevelopment
26
3.Failuresarehardtodefine.
4Thejobcrashesorproduceswrongoutput
①DataScientists
④Challenges
②
Difference
③Tools
![Page 27: Re-Engineering Software Engineering ina Data-Centric World · Insight 2: Performance debugging is a pain point. Manual inspection of top 200 Spark related posts from Stack Overflow](https://reader034.vdocuments.us/reader034/viewer/2022042223/5ec9955f6cfa76645f1e469a/html5/thumbnails/27.jpg)
3Executethejobonthecloud
4.Systemstackis complex
with littlevisibility.ReduceFilter Map
Traditionalvs.BigDataAnalyticsDevelopment
27
①DataScientists
④Challenges
②
Difference
③Tools
![Page 28: Re-Engineering Software Engineering ina Data-Centric World · Insight 2: Performance debugging is a pain point. Manual inspection of top 200 Spark related posts from Stack Overflow](https://reader034.vdocuments.us/reader034/viewer/2022042223/5ec9955f6cfa76645f1e469a/html5/thumbnails/28.jpg)
5.Gapbetweenlogicalvs. physicalexecution
Trips Zipcode
MapMap
Join:⨝
Map ReduceByKey
Filter
3Executethejobonthecloud
Traditionalvs.BigDataAnalyticsDevelopment
28
①DataScientists
④Challenges
②
Difference
③Tools
![Page 29: Re-Engineering Software Engineering ina Data-Centric World · Insight 2: Performance debugging is a pain point. Manual inspection of top 200 Spark related posts from Stack Overflow](https://reader034.vdocuments.us/reader034/viewer/2022042223/5ec9955f6cfa76645f1e469a/html5/thumbnails/29.jpg)
4Thejobcrashesorproduceswrongoutput
5 Repeat
6.Data tracingis hard.
�
3Executethejobonthecloud
Traditionalvs.BigDataAnalyticsDevelopment
Task 31 failed 3 times; aborting jobERROR Executor: Exception in task 31 in stage 0 (TID 31)java.lang.NumberFormatException
29
①DataScientists
④Challenges
②
Difference
③Tools
![Page 30: Re-Engineering Software Engineering ina Data-Centric World · Insight 2: Performance debugging is a pain point. Manual inspection of top 200 Spark related posts from Stack Overflow](https://reader034.vdocuments.us/reader034/viewer/2022042223/5ec9955f6cfa76645f1e469a/html5/thumbnails/30.jpg)
30
①DataScientists
④Challenges
②Difference
③
Tools
①②③ Tools
Shifttodata-centricSW
development
Debugging& testingforbigdataanalytics
DifferencesbetweentraditionalSW
vs.data-centricSWdevprocess
④ OpenproblemsinSE4DA
Outline:MakingaCasefor
SoftwareEngineeringforDataAnalytics(SE4DA)
Studies:
Data
Scientists
![Page 31: Re-Engineering Software Engineering ina Data-Centric World · Insight 2: Performance debugging is a pain point. Manual inspection of top 200 Spark related posts from Stack Overflow](https://reader034.vdocuments.us/reader034/viewer/2022042223/5ec9955f6cfa76645f1e469a/html5/thumbnails/31.jpg)
Part3.DebuggingandTestingfor
BigDataAnalytics
TysonCondie,AriEkmekji,MuhammadAliGulzar,MiryungKim,MatteoInterlandi,ShaghayeghMardani,ToddMillstein,MadanlalMusuvathi,KshitijShah,SaiDeepTetali,SeunghyunYoo
![Page 32: Re-Engineering Software Engineering ina Data-Centric World · Insight 2: Performance debugging is a pain point. Manual inspection of top 200 Spark related posts from Stack Overflow](https://reader034.vdocuments.us/reader034/viewer/2022042223/5ec9955f6cfa76645f1e469a/html5/thumbnails/32.jpg)
Insights from DebuggingandTestingfor
ApacheSpark
• Designinginteractivedebugprimitivesrequiresdeepunderstandingofinternal executionmodel,job
scheduling,andmaterialization.
• Providingtraceabilityrequiresmodifyingaruntime.
• Abstraction isapowerfulforceinsimplifyingprogrampaths.
32
①DataScientists
④Challenges
②Difference
③
Tools
![Page 33: Re-Engineering Software Engineering ina Data-Centric World · Insight 2: Performance debugging is a pain point. Manual inspection of top 200 Spark related posts from Stack Overflow](https://reader034.vdocuments.us/reader034/viewer/2022042223/5ec9955f6cfa76645f1e469a/html5/thumbnails/33.jpg)
• Pausingtheentirecomputationontheclustercouldreducethroughput
• Itisclearlyinfeasibleforausertoinspectbillionofrecordsthrougharegularwatchpoint
①DataScientists
④Challenges
②Difference
③
Tools
Enablinginteractivedebuggingrequiresus
tore-thinkatraditionaldebugger
33
![Page 34: Re-Engineering Software Engineering ina Data-Centric World · Insight 2: Performance debugging is a pain point. Manual inspection of top 200 Spark related posts from Stack Overflow](https://reader034.vdocuments.us/reader034/viewer/2022042223/5ec9955f6cfa76645f1e469a/html5/thumbnails/34.jpg)
Stage2Stage1
BigDebug:InteractiveDebugPrimitives
forBigDataAnalytics[ICSE2016]
FilterProgram(DAG) MapMap Reduce MapMap
Reduce
①Simulated
Breakpoint
age < 0
StoredData
Records
②OnDemand
Watchpoint
③ Realtime
Repair
Map
�
④Backward
Tracing
34
①DataScientists
④Challenges
②Difference
③
Tools
![Page 35: Re-Engineering Software Engineering ina Data-Centric World · Insight 2: Performance debugging is a pain point. Manual inspection of top 200 Spark related posts from Stack Overflow](https://reader034.vdocuments.us/reader034/viewer/2022042223/5ec9955f6cfa76645f1e469a/html5/thumbnails/35.jpg)
Titian:DataProvenanceforApacheSpark
[VLDB2016]
Stage2Stage1Filter MapMap Reduce MapMap
Program(DAG)
LineageTable
⨝Worker3
Worker2
Worker1
⨝�
Worker3
Worker2
Worker1
�
35
①DataScientists
④Challenges
②Difference
③
Tools
![Page 36: Re-Engineering Software Engineering ina Data-Centric World · Insight 2: Performance debugging is a pain point. Manual inspection of top 200 Spark related posts from Stack Overflow](https://reader034.vdocuments.us/reader034/viewer/2022042223/5ec9955f6cfa76645f1e469a/html5/thumbnails/36.jpg)
TitianDataProvenance
⨝Worker3
Worker2
Worker1
⨝ �⨝Worker3
Worker2
Worker1
⨝ ��DeltaDebugging
��
�
�
�
�
�
�
�
BigSift:AutomatedDebuggingof
BigDataAnalytics[SoCC2017]
Input:AProgram,ATestFunction Output:FaultyRecords
36
①DataScientists
④Challenges
②Difference
③
Tools
TestPredicate
Pushdown
Prioritizing
BackwardTraces
Bitmapbased
Memoization
![Page 37: Re-Engineering Software Engineering ina Data-Centric World · Insight 2: Performance debugging is a pain point. Manual inspection of top 200 Spark related posts from Stack Overflow](https://reader034.vdocuments.us/reader034/viewer/2022042223/5ec9955f6cfa76645f1e469a/html5/thumbnails/37.jpg)
• BigDebugenablesinteractivedebuggingandrepair,whileretainingthescale-up property.Itposesatmost34% overhead [ICSE2016].
• Titian’sdataprovenanceisordersofmagnitude
fasterthanalternatives[VLDB2016].
• BigSift automatically findsbugs66Xfasterthandeltadebugging.Ittakes62%lesstimetodebugthantheoriginaljob’srun[SoCC2017].
ResultsonDebuggingofBigDataAnalytics
37
①DataScientists
④Challenges
②Difference
③
Tools
![Page 38: Re-Engineering Software Engineering ina Data-Centric World · Insight 2: Performance debugging is a pain point. Manual inspection of top 200 Spark related posts from Stack Overflow](https://reader034.vdocuments.us/reader034/viewer/2022042223/5ec9955f6cfa76645f1e469a/html5/thumbnails/38.jpg)
WhyisTestingBigDataAnalytics
Challenging?
Option1:SampleData
• randomsampling,
• topnsampling
• topk%sample,etc.
Limitations:
• Lowcodecoverage
• Orincreasedlocaltestingtime
Option2:TraditionalTesting
• 700KLOCforApacheSpark
38
①DataScientists
④Challenges
②Difference
③
Tools
Limitations:
• Symbolicexecutionwithoutabstractionwouldnotscale.
![Page 39: Re-Engineering Software Engineering ina Data-Centric World · Insight 2: Performance debugging is a pain point. Manual inspection of top 200 Spark related posts from Stack Overflow](https://reader034.vdocuments.us/reader034/viewer/2022042223/5ec9955f6cfa76645f1e469a/html5/thumbnails/39.jpg)
BigTest:White-BoxTestingofBigData
Analytics[ESEC/FSE2019]
Relationalskeleton700KLOCSpark
Userdefinedfunc
JOIN:�tR,tL: cR � CR � cL � CL �cR(tR) � tR,key = tL,key �cL(tL)
PathConstraint Effect
T.split(",").length ≥ 1 �…� V2 = ”ERROR" …
"\x00", "Palms"
LogicalSpecifications
SymbolicExecution
Abstract
Extract
Stringoperations
Model
39
Z.split(“,”)[1]=“Palms” �Z.split(“,”).length >1 �T.split(“,”)[1] = Z.split(“,”)[0] �T.split(“,”).length >1 � …
StringConstraints
①DataScientists
④Challenges
②Difference
③
Tools
![Page 40: Re-Engineering Software Engineering ina Data-Centric World · Insight 2: Performance debugging is a pain point. Manual inspection of top 200 Spark related posts from Stack Overflow](https://reader034.vdocuments.us/reader034/viewer/2022042223/5ec9955f6cfa76645f1e469a/html5/thumbnails/40.jpg)
6 5 14 11 430
6
4.00E+09
5.21E+05
4.48E+08 3.20E+08 2.40E+08 4.00E+07 1.11E+08
1E+00 1E+02 1E+04 1E+06 1E+08 1E+10
IncomeAggregate
MovieRatings AirportLayover
CommuteType PigMixL2 GradeAnalysis WordCount
TestDatasetSize
BigTest EntireDataset
# o
f R
ow
s
BigTestreducestestsby105Xto108X,achieving194Xtestingspeedup.
40
TestSizeReduction
①DataScientists
④Challenges
②Difference
③
Tools
![Page 41: Re-Engineering Software Engineering ina Data-Centric World · Insight 2: Performance debugging is a pain point. Manual inspection of top 200 Spark related posts from Stack Overflow](https://reader034.vdocuments.us/reader034/viewer/2022042223/5ec9955f6cfa76645f1e469a/html5/thumbnails/41.jpg)
41
①DataScientists
④
Challenges
②Difference
③Tools
①②③ Tools
Shifttodata-centricSW
development
Debugging& testingforbigdataanalytics
DifferencesbetweentraditionalSW
vs.data-centricSWdevprocess
④ OpenproblemsinSE4DA
Outline:MakingaCasefor
SoftwareEngineeringforDataAnalytics(SE4DA)
Studies:
Data
Scientists
![Page 42: Re-Engineering Software Engineering ina Data-Centric World · Insight 2: Performance debugging is a pain point. Manual inspection of top 200 Spark related posts from Stack Overflow](https://reader034.vdocuments.us/reader034/viewer/2022042223/5ec9955f6cfa76645f1e469a/html5/thumbnails/42.jpg)
2004 2014 20192008 20252022
42
DA4SE SE4DA
Part4.RoadmapforAccelerating
Data-CentricDevelopment
![Page 43: Re-Engineering Software Engineering ina Data-Centric World · Insight 2: Performance debugging is a pain point. Manual inspection of top 200 Spark related posts from Stack Overflow](https://reader034.vdocuments.us/reader034/viewer/2022042223/5ec9955f6cfa76645f1e469a/html5/thumbnails/43.jpg)
Insight1:Debuggingdataanalyticsrequires
bothdataandcodeanalysis.
Howtodefineabugbasedonthepropertiesofbothdataand code?
Howtorepair bothcode anddataerrors?
DataX-Ray[SIGMOD’15]
DataWrangling[CHI’11]
ProgramRepair[ICSE’09][ICSE’13],etc.
DataCleaning[VLDB’01][VLDB’15][SIGMOD‘15][SIGMOD’10]
DataRepair[VLDB’11][SIGMOD‘14]
BugPatterns[SIGPLAN2004],etc.
43
①DataScientists
④
Challenges
②Difference
③Tools
![Page 44: Re-Engineering Software Engineering ina Data-Centric World · Insight 2: Performance debugging is a pain point. Manual inspection of top 200 Spark related posts from Stack Overflow](https://reader034.vdocuments.us/reader034/viewer/2022042223/5ec9955f6cfa76645f1e469a/html5/thumbnails/44.jpg)
39.5%
28.5%
6.5%
13.0%
12.5%
PerformanceComprehensionInstallation and Environment SettingAPI UsageCorrectness
Insight2:Performancedebuggingis
apainpoint.
Manual inspection of top 200 Spark related posts
from Stack Overflow
44
①DataScientists
④
Challenges
②Difference
③Tools
7.6%
16.5%
16.5%
21.5%
25.3%
5.1%
7.6%Comprehension-related issueConfiguration TuningPerformance ScalingInefficient operatorUnbalanced task IO-related issueMemory-related issue
![Page 45: Re-Engineering Software Engineering ina Data-Centric World · Insight 2: Performance debugging is a pain point. Manual inspection of top 200 Spark related posts from Stack Overflow](https://reader034.vdocuments.us/reader034/viewer/2022042223/5ec9955f6cfa76645f1e469a/html5/thumbnails/45.jpg)
Insight2:Performancedebuggingrequires
visibilityofsystemstack,code,anddata.
Storage JVM
CPU GPU FPGA
Runtime
DevEnvironment
Containers
ML/AILib
45
Howtoestimateperformancebasedondatasize?
Howtooptimizequeryperformanceusingacostmodel?
Howtodebugcomputationanddataskews?
Howtoidentifythecauseofbottlenecks?
Ernest[NSDI’16]
Neo [VLDB’16]
PerfDebug[SoCC’19]
Skewtune[SIGMOD’12]
CausalProfiling[SOSP’15]
CausalMonitoring[SOSP’15]
① DataScientists
④
Challenges
②Difference
③Tools
![Page 46: Re-Engineering Software Engineering ina Data-Centric World · Insight 2: Performance debugging is a pain point. Manual inspection of top 200 Spark related posts from Stack Overflow](https://reader034.vdocuments.us/reader034/viewer/2022042223/5ec9955f6cfa76645f1e469a/html5/thumbnails/46.jpg)
Insight3:Wemustrelaxthestrictnotionof
anincorrectbehaviorandtherootcause.
Howtospecifyoraclesfordata-centricsoftware?Metamorphicrelationsaresimpleorhardtodefine
Howtoquantifyimportancewhendebuggingfaultyinputsfordataanalytics?
DeepTest[ICSE2018]
DeepConcolic[ASE2018]
DeepHunter[ISSTA2019]
MetamorphicTesting[1998]
Lamp[ESEC/FSE2017]
46
MODE[ESEC/FSE’18]
InfluenceFunction[ICML’17]
TrainingSetDebugging[AAAI’18]
LIME[KDD’16]
① DataScientists
④
Challenges
②Difference
③Tools
![Page 47: Re-Engineering Software Engineering ina Data-Centric World · Insight 2: Performance debugging is a pain point. Manual inspection of top 200 Spark related posts from Stack Overflow](https://reader034.vdocuments.us/reader034/viewer/2022042223/5ec9955f6cfa76645f1e469a/html5/thumbnails/47.jpg)
Conclusion:Hopefor
SoftwareEngineeringforDataAnalytics(SE4DA)
Weareataninflectionpoint.SE4DAisunderserved.
ProgresshasbeenmadeinSE4DA byre-thinkingsoftwareengineeringforbigdataanalytics.
WecantogetherworkonopenproblemsinSE4DA.
47
![Page 48: Re-Engineering Software Engineering ina Data-Centric World · Insight 2: Performance debugging is a pain point. Manual inspection of top 200 Spark related posts from Stack Overflow](https://reader034.vdocuments.us/reader034/viewer/2022042223/5ec9955f6cfa76645f1e469a/html5/thumbnails/48.jpg)
SE4DA: AI,BigData,andML
needawesomeSEtools
� Debugging� Intelligentsampling
andtesting� Rootcauseanalysis
� Datacleaning � Performanceanalytics
� Codeanalytics
Diagnose Fix Optimize
![Page 49: Re-Engineering Software Engineering ina Data-Centric World · Insight 2: Performance debugging is a pain point. Manual inspection of top 200 Spark related posts from Stack Overflow](https://reader034.vdocuments.us/reader034/viewer/2022042223/5ec9955f6cfa76645f1e469a/html5/thumbnails/49.jpg)
Questions?