big data analytics: the apache spark approach · 2017-08-12 · apache spark meetups(august 2017) 8...
TRANSCRIPT
![Page 1: Big Data Analytics: The Apache Spark Approach · 2017-08-12 · Apache Spark Meetups(August 2017) 8 618 groups with 391,371members ... –Open Source optimizer evolution! •Code](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed391b7a1895f794116acd1/html5/thumbnails/1.jpg)
BigDataAnalytics:TheApacheSparkApproach
Michael FranklinATPESC
August 2017
![Page 2: Big Data Analytics: The Apache Spark Approach · 2017-08-12 · Apache Spark Meetups(August 2017) 8 618 groups with 391,371members ... –Open Source optimizer evolution! •Code](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed391b7a1895f794116acd1/html5/thumbnails/2.jpg)
Nearlyeveryfieldofendeavoristransitioningfrom“datapoor”to“datarich”
Astronomy:LSST
2
Physics:LHCOceanography
Sociology:TheWeb
Biology:SequencingEconomics:mobile,
POSterminals
Neuroscience:EEG,fMRI
Data-DrivenMedicine Sports
![Page 3: Big Data Analytics: The Apache Spark Approach · 2017-08-12 · Apache Spark Meetups(August 2017) 8 618 groups with 391,371members ... –Open Source optimizer evolution! •Code](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed391b7a1895f794116acd1/html5/thumbnails/3.jpg)
The Fourth Paradigm of Science1. Empirical+ experimental2. Theoretical3. Computational4. Data-Intensive
3
![Page 4: Big Data Analytics: The Apache Spark Approach · 2017-08-12 · Apache Spark Meetups(August 2017) 8 618 groups with 391,371members ... –Open Source optimizer evolution! •Code](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed391b7a1895f794116acd1/html5/thumbnails/4.jpg)
OpenSourceEcosystem&Context
4
���
2006-2010 Autonomic Computing & Cloud
UC BERKELEY
2011-2016 Big Data Analytics
Usenix HotCloud Workshop 2010
![Page 5: Big Data Analytics: The Apache Spark Approach · 2017-08-12 · Apache Spark Meetups(August 2017) 8 618 groups with 391,371members ... –Open Source optimizer evolution! •Code](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed391b7a1895f794116acd1/html5/thumbnails/5.jpg)
AMPLabProjectVision“MakingSenseofDataatScale”
Algorithms
• MachineLearning,StatisticalMethods• Prediction,BusinessIntelligence
Machines
• ClustersandClouds• WarehouseScaleComputing
People
• Crowdsourcing,HumanComputation• DataScientists,Analysts
![Page 6: Big Data Analytics: The Apache Spark Approach · 2017-08-12 · Apache Spark Meetups(August 2017) 8 618 groups with 391,371members ... –Open Source optimizer evolution! •Code](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed391b7a1895f794116acd1/html5/thumbnails/6.jpg)
BerkeleyDataAnalyticsStack
In House Applications – Genomics, IoT, Energy, Cosmology
Access and Interfaces
Processing Engines
Resource Virtualization
Storage
![Page 7: Big Data Analytics: The Apache Spark Approach · 2017-08-12 · Apache Spark Meetups(August 2017) 8 618 groups with 391,371members ... –Open Source optimizer evolution! •Code](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed391b7a1895f794116acd1/html5/thumbnails/7.jpg)
SomeAMPLabnumbers• Funding– roughly50/50Govt/IndustrySplit
– NSFCISEExpeditions,DARPA,DOE,DHS– Google,SAP,Amazon,IBM(FoundingSponsors)+dozensmore
• Nearly2Mvisitstoamplab.cs.berkeley.edu• 200+PapersinSys,ML,DB,…3ACMDissertationAwards
(1+2HM);NumerousBestPaperandBestDemoAwards• 40+Ph.D.s granted(sofar);AlumnionfacultyatBerkeley,
HarveyMudd,Michigan,MIT,Stanford,Texas,Wisconsin,…• 3SpinoutcompaniesdirectlyfromAMPLab:
– Databricks,Mesosphere,Alluxio– Nearly$250Mraisedtodate
• Manyindustrialproducts&servicesbasedonorusingSpark• 3Marriages(andnumerouslong-termrelationships)
7
![Page 8: Big Data Analytics: The Apache Spark Approach · 2017-08-12 · Apache Spark Meetups(August 2017) 8 618 groups with 391,371members ... –Open Source optimizer evolution! •Code](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed391b7a1895f794116acd1/html5/thumbnails/8.jpg)
ApacheSparkMeetups (August2017)
8
618 groups with 391,371 membersspark.meetup.com
![Page 9: Big Data Analytics: The Apache Spark Approach · 2017-08-12 · Apache Spark Meetups(August 2017) 8 618 groups with 391,371members ... –Open Source optimizer evolution! •Code](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed391b7a1895f794116acd1/html5/thumbnails/9.jpg)
WeHitADataManagementInflectionPoint
• Massivelyscalable processingandstorage• Pay-as-you-go processingandstorage
(a.k.a.thecloud)• Flexible schemaonreadvs.schemaonwrite• Integration ofsearch,queryandanalysis• Sophisticated machinelearning/prediction• Human-in-the-loop analytics• Opensourceecosystem drivinginnovation
![Page 10: Big Data Analytics: The Apache Spark Approach · 2017-08-12 · Apache Spark Meetups(August 2017) 8 618 groups with 391,371members ... –Open Source optimizer evolution! •Code](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed391b7a1895f794116acd1/html5/thumbnails/10.jpg)
BDASUnificationStrategy• SpecializingMapReduce leadstostovepipedsystems
• Instead,generalizeMapReduce:
1.RicherProgrammingModelèFewerSystemstoMaster
2.DataSharingèLessDataMovementleads
toBetterPerformanceSparkshowed10xperformanceimprovementonexistingHDFSdatawithnomigration.
Spark
Stre
aming
Gra
phX
…Spar
kSQ
L
MLb
ase
10
![Page 11: Big Data Analytics: The Apache Spark Approach · 2017-08-12 · Apache Spark Meetups(August 2017) 8 618 groups with 391,371members ... –Open Source optimizer evolution! •Code](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed391b7a1895f794116acd1/html5/thumbnails/11.jpg)
Abstraction:DataflowOperators
• map
• filter
• groupBy
• sort
• union
• join
• leftOuterJoin
• rightOuterJoin
• reduce
• count
• fold
• reduceByKey
• groupByKey
• cogroup
• cross
• zip
sample
take
first
partitionBy
mapWith
pipe
save
...
11
![Page 12: Big Data Analytics: The Apache Spark Approach · 2017-08-12 · Apache Spark Meetups(August 2017) 8 618 groups with 391,371members ... –Open Source optimizer evolution! •Code](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed391b7a1895f794116acd1/html5/thumbnails/12.jpg)
IterationinMap-Reduce
TrainingData
Map Reduce LearnedModel
w(1)
w(2)
w(3)
w(0)
InitialModel
12
![Page 13: Big Data Analytics: The Apache Spark Approach · 2017-08-12 · Apache Spark Meetups(August 2017) 8 618 groups with 391,371members ... –Open Source optimizer evolution! •Code](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed391b7a1895f794116acd1/html5/thumbnails/13.jpg)
CostofIterationinMap-ReduceMap Reduce Learned
Model
w(1)
w(2)
w(3)
w(0)
InitialModel
TrainingData
Read 2Repeatedlyload same data
13
![Page 14: Big Data Analytics: The Apache Spark Approach · 2017-08-12 · Apache Spark Meetups(August 2017) 8 618 groups with 391,371members ... –Open Source optimizer evolution! •Code](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed391b7a1895f794116acd1/html5/thumbnails/14.jpg)
CostofIterationinMap-ReduceMap Reduce Learned
Model
w(1)
w(2)
w(3)
w(0)
InitialModel
TrainingDataRedundantly saveoutput between
stages
14
![Page 15: Big Data Analytics: The Apache Spark Approach · 2017-08-12 · Apache Spark Meetups(August 2017) 8 618 groups with 391,371members ... –Open Source optimizer evolution! •Code](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed391b7a1895f794116acd1/html5/thumbnails/15.jpg)
DataflowView
Training Data
(HDFS)
Map
Reduce
Map
Reduce
Map
Reduce
15
![Page 16: Big Data Analytics: The Apache Spark Approach · 2017-08-12 · Apache Spark Meetups(August 2017) 8 618 groups with 391,371members ... –Open Source optimizer evolution! •Code](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed391b7a1895f794116acd1/html5/thumbnails/16.jpg)
MemoryOpt.Dataflow
Training Data
(HDFS)
Map
Reduce
Map
Reduce
Map
Reduce
CachedLoad
16
![Page 17: Big Data Analytics: The Apache Spark Approach · 2017-08-12 · Apache Spark Meetups(August 2017) 8 618 groups with 391,371members ... –Open Source optimizer evolution! •Code](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed391b7a1895f794116acd1/html5/thumbnails/17.jpg)
MemoryOpt.DataflowView
Training Data
(HDFS)
Map
Reduce
Map
Reduce
Map
Reduce
Efficientlymove data betweenstages
Spark:10-100× faster than Hadoop MapReduce17
![Page 18: Big Data Analytics: The Apache Spark Approach · 2017-08-12 · Apache Spark Meetups(August 2017) 8 618 groups with 391,371members ... –Open Source optimizer evolution! •Code](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed391b7a1895f794116acd1/html5/thumbnails/18.jpg)
SparkFaultTolerance• RDDs:Immutable collectionsofobjectsthatcanbestoredinmemoryordiskacrossacluster– Builtviaparalleltransformations(map,filter,…)– Automaticallyrebuilton(partial)failure
M.Zaharia,etal,ResilientDistributedDatasets:Afault-tolerantabstractionforin-memoryclustercomputing,NSDI2012. 18
messages = textFile(...).filter(_.contains(“error”)).map(_.split(‘\t’)(2))
HadoopRDDpath=hdfs://…
FilteredRDDfunc =_.contains(...)
MappedRDDfunc =_.split(…)
![Page 19: Big Data Analytics: The Apache Spark Approach · 2017-08-12 · Apache Spark Meetups(August 2017) 8 618 groups with 391,371members ... –Open Source optimizer evolution! •Code](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed391b7a1895f794116acd1/html5/thumbnails/19.jpg)
DataFrames(mainabstractioninSpark2.0)
employees
.join(dept,employees("deptId")=== dept("id"))
.where(employees("gender")==="female")
.groupBy(dept("id"),dept("name"))
.agg(count("name"))
Notes:1) Some people think this is an improvement over SQL J2) Dataframes can be typed
19
![Page 20: Big Data Analytics: The Apache Spark Approach · 2017-08-12 · Apache Spark Meetups(August 2017) 8 618 groups with 391,371members ... –Open Source optimizer evolution! •Code](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed391b7a1895f794116acd1/html5/thumbnails/20.jpg)
CatalystOptimizer• TypicalDBoptimizationsacrossSQLandDF– ExtensibilityviaOptimizationRuleswritteninScala– OpenSourceoptimizerevolution!
• Codegenerationforinner-loops,iteratorremoval• ExtensibleDataSources:CSV,Avro,Parquet,JDBC,…viaTableScan (allcols),PrunedScan (project),FilteredPrunedScan(pushadvisoryselectsandprojects)CatalystScan (pushadvisoryfullCatalystexpressiontrees)• Extensible(UserDefined)Types
20
M.Armbrust,etal,SparkSQL:RelationalDataProcessinginSpark,SIGMOD2015.
![Page 21: Big Data Analytics: The Apache Spark Approach · 2017-08-12 · Apache Spark Meetups(August 2017) 8 618 groups with 391,371members ... –Open Source optimizer evolution! •Code](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed391b7a1895f794116acd1/html5/thumbnails/21.jpg)
AninterestingthingaboutSparkSQLPerformance
21
![Page 22: Big Data Analytics: The Apache Spark Approach · 2017-08-12 · Apache Spark Meetups(August 2017) 8 618 groups with 391,371members ... –Open Source optimizer evolution! •Code](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed391b7a1895f794116acd1/html5/thumbnails/22.jpg)
LambdaArchitecture:onewaytocombineReal-Time+Batch
• lambda-architecture.net22
![Page 23: Big Data Analytics: The Apache Spark Approach · 2017-08-12 · Apache Spark Meetups(August 2017) 8 618 groups with 391,371members ... –Open Source optimizer evolution! •Code](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed391b7a1895f794116acd1/html5/thumbnails/23.jpg)
SparkStreaming• Microbatch approachprovideslowlatency
Additional operators provide windowed operations
M.Zaharia,etal,DiscretizedStreams:Fault-Tollerant StreamingComputationatScale,SOSP2013S.Venketaraman etal,Azkar:FastandAdaptableStreamProcessingatScale,SOSP2017 23
![Page 24: Big Data Analytics: The Apache Spark Approach · 2017-08-12 · Apache Spark Meetups(August 2017) 8 618 groups with 391,371members ... –Open Source optimizer evolution! •Code](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed391b7a1895f794116acd1/html5/thumbnails/24.jpg)
SparkStructuredStreams(unified)
24
Batch Analytics
Streaming Analytics
![Page 25: Big Data Analytics: The Apache Spark Approach · 2017-08-12 · Apache Spark Meetups(August 2017) 8 618 groups with 391,371members ... –Open Source optimizer evolution! •Code](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed391b7a1895f794116acd1/html5/thumbnails/25.jpg)
25
SQL
MachineLearning
Streaming
PuttingitallTogether:Multi-modalAnalytics
![Page 26: Big Data Analytics: The Apache Spark Approach · 2017-08-12 · Apache Spark Meetups(August 2017) 8 618 groups with 391,371members ... –Open Source optimizer evolution! •Code](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed391b7a1895f794116acd1/html5/thumbnails/26.jpg)
![Page 27: Big Data Analytics: The Apache Spark Approach · 2017-08-12 · Apache Spark Meetups(August 2017) 8 618 groups with 391,371members ... –Open Source optimizer evolution! •Code](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed391b7a1895f794116acd1/html5/thumbnails/27.jpg)
27
From:SparkUserSurvey2016,1615respondentsfrom900organizationshttp://go.databricks.com/2016-spark-survey
![Page 28: Big Data Analytics: The Apache Spark Approach · 2017-08-12 · Apache Spark Meetups(August 2017) 8 618 groups with 391,371members ... –Open Source optimizer evolution! •Code](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed391b7a1895f794116acd1/html5/thumbnails/28.jpg)
28
![Page 29: Big Data Analytics: The Apache Spark Approach · 2017-08-12 · Apache Spark Meetups(August 2017) 8 618 groups with 391,371members ... –Open Source optimizer evolution! •Code](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed391b7a1895f794116acd1/html5/thumbnails/29.jpg)
29
![Page 30: Big Data Analytics: The Apache Spark Approach · 2017-08-12 · Apache Spark Meetups(August 2017) 8 618 groups with 391,371members ... –Open Source optimizer evolution! •Code](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed391b7a1895f794116acd1/html5/thumbnails/30.jpg)
30
![Page 31: Big Data Analytics: The Apache Spark Approach · 2017-08-12 · Apache Spark Meetups(August 2017) 8 618 groups with 391,371members ... –Open Source optimizer evolution! •Code](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed391b7a1895f794116acd1/html5/thumbnails/31.jpg)
SparkEcosystemAttributes
• Sparkfocuswasinitiallyon– Performance +Scalability withFaultTolerance
• Eventually,easeofdevelopment wasakeyfeature– especiallyacrossmultiplemodalities:DB,Graph,Stream,etc.
• ThiswastrueofmostBigDatasoftwareofthatgeneration
• LowLatency(streaming)andDeepLearning arealsogarneringsignificantattentionlately
![Page 32: Big Data Analytics: The Apache Spark Approach · 2017-08-12 · Apache Spark Meetups(August 2017) 8 618 groups with 391,371members ... –Open Source optimizer evolution! •Code](https://reader034.vdocuments.us/reader034/viewer/2022042307/5ed391b7a1895f794116acd1/html5/thumbnails/32.jpg)
What’sNext?Innovationin(opensource)BigDataSoftwarecontinues.Performance,Scalability,andFaultToleranceremainimportant,butwefacenewchallenges,including:DataScienceLifecycle
• DataAcquisition,Integration,Cleaning(i.e.,wrangling)• DataIntegrationremainsa“wickedproblem”• ModelBuilding• Communicatingresults,Curation,“TranslationalDataScience”
EaseofDevelopmentandDeployment• Canleveragedatabaseideas(e.g.,declarativequeryoptimization)• Newcomponentsfor“modelserving”and“modelmanagement”
“Safe”DataScience• end-to-endBiasMitigation• Security,EthicsandDataPrivacy• Explainingandinfluencingdecisions• Human-in-the-loop