introducon to data science with hadoop - dataedge 2018 · introducon to data science with hadoop...
TRANSCRIPT
1/59©Cloudera,Inc.Allrightsreserved.
Introduc;ontoDataSciencewithHadoopGlynnDurham|[email protected]
2/59©Cloudera,Inc.Allrightsreserved.
ShortandSweetHadoopWhatAboutSpark?MachineLearningTheFuture
3/59©Cloudera,Inc.Allrightsreserved.
ShortandSweetHadoopWhatAboutSpark?MachineLearningTheFuture
4/59©Cloudera,Inc.Allrightsreserved.
DataScienceis…
• gatheringdata,poten;allyofmanytypesandfrommanysources,
• wranglingthatdataintousefulforms,and
• applyingsta;s;calprogrammingandmachinelearning,togainnewinforma;onfromthedata.
5/59©Cloudera,Inc.Allrightsreserved.
MachineLearningandDataVolume
“It’snotwhohasthebestalgorithmswhowins.It’swhohasthemostdata.”[BankoandBrill,2001]
6/59©Cloudera,Inc.Allrightsreserved.
Hadoopis…
• anopensourceso\warepla]ormfor• acquiring,storing,andprocessingmassivevolumesofdata,• economically.
7/59©Cloudera,Inc.Allrightsreserved.
TheAgeofMachineLearning
8/59©Cloudera,Inc.Allrightsreserved.
ShortandSweetHadoopWhatAboutSpark?MachineLearningTheFuture
9/59©Cloudera,Inc.Allrightsreserved.
Theword“Hadoop”means
• achild’stoyor
• HadoopCoreor
• theHadoopEcosystem.
10/59©Cloudera,Inc.Allrightsreserved.
HadoopCore
• Afreeopensourceso\wareso\wareproject• Managedtransparentlyonline,attheApacheSo\wareFounda;on(ASF),apache.org
• Theprojectwasstartedin2006,basedonpapersfromGoogle,in2003and2004
• Consistsof:• HDFS(HadoopDistributedFileSystem),forstorage• HadoopMapReduce,forprocessing• YARN(YetAnotherResourceNego;ator)
hadoop.apache.org
11/59©Cloudera,Inc.Allrightsreserved.
HadoopCoremainfeatures:Filestorageandbatchprogramming
12/59©Cloudera,Inc.Allrightsreserved.
HDFSWrites
13/59©Cloudera,Inc.Allrightsreserved.
HDFSReads
14/59©Cloudera,Inc.Allrightsreserved.
GeneralFileInput/Output
15/59©Cloudera,Inc.Allrightsreserved.
HDFSStrengthsandWeaknesses• HDFSisgoodat:
• storingenormousfiles• storinglotsofdatareliably• throughputonsequen;alwrites• throughputonsequen;alreadsofafileorpartofafile
• HDFSisnotgoodat:• highspeed(lowlatency)randomreadsofpartsofafile
• HDFScannot:• updateanypartofafileoncewrijen**butyoucanalwayswriteanewfileand/ordelete,move,andrenamefilesanddirectories
16/59©Cloudera,Inc.Allrightsreserved.
MapReduce:Programmingwithsimplefunc;ons
17/59©Cloudera,Inc.Allrightsreserved.
MapReduceExample:WordCountCountthenumberofoccurrencesofeachwordoveralargeamountofinputdata• Thisisthe‘helloworld’ofMapReduceprogramming
map(String input_key, String input_value) foreach word w in input_value: emit(w, 1)
reduce(String output_key, Iterator<int> intermediate_vals) set count = 0 foreach v in intermediate_vals: count += v emit(output_key, count)
18/59©Cloudera,Inc.Allrightsreserved.
WordCount,con;nuedInputtotheMapper:
OutputfromtheMapper:
(3414, 'the cat sat on the mat') (3437, 'the aardvark sat on the sofa')
('the', 1), ('cat', 1), ('sat', 1), ('on', 1), ('the', 1), ('mat', 1), ('the', 1), ('aardvark', 1), ('sat', 1), ('on', 1), ('the', 1), ('sofa', 1)
19/59©Cloudera,Inc.Allrightsreserved.
WordCount,con;nuedIntermediatedatasenttotheReducer:
FinalReduceroutput:
('aardvark', [1]) ('cat', [1]) ('mat', [1]) ('on', [1, 1]) ('sat', [1, 1]) ('sofa', [1]) ('the', [1, 1, 1, 1])
('aardvark', 1) ('cat', 1) ('mat', 1) ('on', 2) ('sat', 2) ('sofa', 1) ('the', 4)
20/59©Cloudera,Inc.Allrightsreserved.
Sowejustcountedwords.Sowhat?
• Manyproblemsconformtothispajern:• Webloganalysis:map()emitsanIPaddressforeachweblogevent;reduce()countsoccurrencesforeachIPaddress
• Indexing:Foreachdocument,map()emitseachtermofinterestpairedwiththedocumentID;reduce()collectsandemitsalldocumentIDsforeachterm
• Pagerankalgorithm:• Everywebpage(URL)ontheWebgetsanini;alscore.• map()dividesapage’sscoreamongallofitsoutlinks’URLs;reduce()sumsthereceivedscoresforeachURL.
• Iterateonthisprocedure.
21/59©Cloudera,Inc.Allrightsreserved.
MapReduceChains
22/59©Cloudera,Inc.Allrightsreserved.
MapReduceatScale
23/59©Cloudera,Inc.Allrightsreserved.
MapReduceStrengthsandWeaknesses• MapReduceisgoodat:
• processingenormousvolumesofdata• scalingoutasyouaddmoremachines• con;nuingtocomple;on,evenwhensomemachinesdie
• MapReduceisnotgoodat:• runninganyalgorithmyoucanwriteinpseudocode• algorithmsthatrequiresharedstateoverall**butmaybeyoucangetcleverwithyouralgorithmdesign
• MapReducecannot:• runinreal;me:MapReducejobsarebatchjobs
24/59©Cloudera,Inc.Allrightsreserved.
YARN,YetAnotherResourceNego;ator
25/59©Cloudera,Inc.Allrightsreserved.
Sqoop:RDBMStoHadoopandBack
• UsesMapReducetorunconcurrentdatabasequeriesthatextractorinsertdatasqoop.apache.org
26/59©Cloudera,Inc.Allrightsreserved.
Flume:Inges;ngOngoingEventDataflume.apache.org
27/59©Cloudera,Inc.Allrightsreserved.
Kata:GeneralDataStreamingkata.apache.org
28/59©Cloudera,Inc.Allrightsreserved.
HBase:ANoSQLDatabaseSystem
• Ascalablekey/valuestore• Accommodatesgeneralbinarydata• Highvolume,highperformanceaccesstoindividualitems• Randomreadsandwrites• WeakerquerylanguagethanSQL(put,get,scan,delete)• LacksACID-complianttransac;ons
hbase.apache.org
29/59©Cloudera,Inc.Allrightsreserved.
Kudu:Scalablestorageforstructureddatakudu.apache.org
30/59©Cloudera,Inc.Allrightsreserved.
Hive:MapReduce(orSpark)as“SQL”
• Familiarlanguageandprogrammingparadigm
• ProvidesinterfacetomanySQL-complianttools
hive.apache.org
31/59©Cloudera,Inc.Allrightsreserved.
Pig:AnotherLanguageforMapReduce(orSpark)pig.apache.org
32/59©Cloudera,Inc.Allrightsreserved.
Impala:HighSpeedAnaly;csinHadoop
• Purpose-builtforhighspeedanaly;cqueries
• DoesnotuseMapReduceorSpark
• Usually5to30;mesfaster—some;mes100;mesfaster!
incubator.apache.org/projects/impala.html
33/59©Cloudera,Inc.Allrightsreserved.
AndMore
• Serializa;onandefficientfilestorage:AvroandParquet
• Workflow:Oozie
avro.apache.org parquet.apache.org
oozie.apache.org
34/59©Cloudera,Inc.Allrightsreserved.
AndEvenMore…
• Security:SentryandRecordService
• MachineLearninginMapReduce:Mahout
• And…mahout.apache.org
sentry.apache.org recordservice.io
35/59©Cloudera,Inc.Allrightsreserved.
ShortandSweetHadoopWhatAboutSpark?MachineLearningTheFuture
36/59©Cloudera,Inc.Allrightsreserved.
Spark:AnImprovementonMapReduce
• OriginallyaresearchprojectatUCBerkeleyRADLab—latertheAMPLab,in2009
• AddressessomefundamentalpainpointsofMapReduce
• TheSparkStreamingsubprojectof2012addsnearreal-;meprogramming• using“micro-batches”asanadapta;onofbatchprogramming• acapabilityaltogetherlackinginHadoopMapReduce
spark.apache.org
37/59©Cloudera,Inc.Allrightsreserved.
Similari;esofMapReduceandSpark
• Processesmassivevolumesofdatawithascale-out,distributedframework• Theframeworkprovidesreliability,eveninthefaceofmachinefailure• Programmingwithstatelessfunc;ons• Reliesonexpensiveshuffletoreorganizedataforaggrega;on,joins,sor;ng• S;lllacksasharedstateamongallprocesses• CanrununderYARNtoshareprocessingresources
38/59©Cloudera,Inc.Allrightsreserved.
ImprovedAPI
• First-classAPIsinScala,Java,PythonandR• Data-flowprogrammingparadigm(likePig)• Interac;veshell—greatforexploratorywork
• ImprovedsupportforstructureddataandSQL-likeprocessing
39/59©Cloudera,Inc.Allrightsreserved.
ProcessingChains,Improved
func;on func;on func;on
func;onfunc;onfunc;on
EliminateI/O
EliminateI/O
ReduceI/O
Tasks,notnewprocesses(JVMs)Enhancedcachinginmemory
40/59©Cloudera,Inc.Allrightsreserved.
SparkMLlib:MachineLearninginSpark
• SubprojectofSpark• Effec;velyreplacesMahoutformachinelearninginHadoopclusters• Fromspark.apache.org,thefrontpage:
Butjustbeclearwhatyoumeanby“Hadoop”!
41/59©Cloudera,Inc.Allrightsreserved.
CommercialMessage#1
42/59©Cloudera,Inc.Allrightsreserved.
BigEcosystem
oozie.apache.org
43/59©Cloudera,Inc.Allrightsreserved.
CompleteBigDataPla]orm
• ClouderaManagercan• install,monitor,manage,upgradeacoherentbundleoftheseprojectsandmore
• ClouderaDirectorcan• easilyconfigureanddeploythispla]ormoncloudservicesfromAmazon,Google,orMicroso\
• !!!
44/59©Cloudera,Inc.Allrightsreserved.
ShortandSweetHadoopWhatAboutSpark?MachineLearningTheFuture
45/59©Cloudera,Inc.Allrightsreserved.
MachineLearningAlgorithms
• SupervisedLearning:• Startwithcorrectlylabeledrecords,andlearntoes;mateorpredictlabelsfornewrecords
• Con;nuouslabels:Regression• Discretelabels:Logis;cRegression,Classifiers
• UnsupervisedLearning:• Startwithunlabeledrecords,trytoteasepajerns(labels)outofthedata• Thereisnotasingle“correct”answerforlabeling• Con;nuouslabels:Collabora;veFilters(Recommenders)• Discretelabels:Clustering
46/59©Cloudera,Inc.Allrightsreserved.
LinearRegression:SupervisedLearningofaCon;nuousLabel
47/59©Cloudera,Inc.Allrightsreserved.
Logis;cRegression:SupervisedLearningofaBinaryLabel
48/59©Cloudera,Inc.Allrightsreserved.
Classifiers:SupervisedLearningofDiscreteLabels
Training:Cat
Training:Table
Scoring:???
49/59©Cloudera,Inc.Allrightsreserved.
Collabora;veFilters(Recommenders):UnsupervisedLearningofCon;nuousLabels
50/59©Cloudera,Inc.Allrightsreserved.
Clustering:UnsupervisedLearningofDiscreteLabels
51/59©Cloudera,Inc.Allrightsreserved.
SparkMLlib:MachineLearningonHadoop
52/59©Cloudera,Inc.Allrightsreserved.
ShortandSweetHadoopWhatAboutSpark?MachineLearningTheFuture
53/59©Cloudera,Inc.Allrightsreserved.
CommercialMessage#2
54/59©Cloudera,Inc.Allrightsreserved.
MoreDSTeamsintheOrganiza;on
• Collabora;on,repeatabilitywithinteams• Differingsecurityrequirements• Differentpreferredprograminglanguages:Python,R,Scala• Differentso\warelibraries:Pandas,H2O,etc.• Evendifferentversionsofso\ware
55/59©Cloudera,Inc.Allrightsreserved.
ClouderaDataScienceWorkbench
• DevelopmentinPython,Scala,orR• Differingsecurityrequirements
56/59©Cloudera,Inc.Allrightsreserved.
DeepLearning
57/59©Cloudera,Inc.Allrightsreserved.
DeepLearningonHadoop
• DeepLearningreferstoacategoryofclassifieralgorithms,mostlyinventedin2006.
• SparkMLlibdoesnothaveanydirectimplementa;onofDL.• Thereareseveraladdi;onalprojectsthatcanfitDLontoSpark/Hadoop:
• BigDL• Caffe• TensorFlow• DL4J
58/59©Cloudera,Inc.Allrightsreserved.
TheRoad—orRunway(!)—Ahead
• Itisatruismthatorganiza;onstodayhavevaluableinsightshiddenintheirdatathatarewai;ngtobeuncovered.
• 90%ofalldatathatwillexistin2020hasyettobecreated.• Opensourceisheretostay.• Hadoopasadatasciencepla]ormisevolving,anditsuseisgrowingexponen;ally.
59/59©Cloudera,Inc.Allrightsreserved.