introducon to data science with hadoop - dataedge 2018 · introducon to data science with hadoop...

Post on 27-Jul-2018

223 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1/59©Cloudera,Inc.Allrightsreserved.

Introduc;ontoDataSciencewithHadoopGlynnDurham|SeniorInstructorglynn@cloudera.comMay2017

2/59©Cloudera,Inc.Allrightsreserved.

ShortandSweetHadoopWhatAboutSpark?MachineLearningTheFuture

3/59©Cloudera,Inc.Allrightsreserved.

ShortandSweetHadoopWhatAboutSpark?MachineLearningTheFuture

4/59©Cloudera,Inc.Allrightsreserved.

DataScienceis…

• gatheringdata,poten;allyofmanytypesandfrommanysources,

• wranglingthatdataintousefulforms,and

• applyingsta;s;calprogrammingandmachinelearning,togainnewinforma;onfromthedata.

5/59©Cloudera,Inc.Allrightsreserved.

MachineLearningandDataVolume

“It’snotwhohasthebestalgorithmswhowins.It’swhohasthemostdata.”[BankoandBrill,2001]

6/59©Cloudera,Inc.Allrightsreserved.

Hadoopis…

• anopensourceso\warepla]ormfor• acquiring,storing,andprocessingmassivevolumesofdata,• economically.

7/59©Cloudera,Inc.Allrightsreserved.

TheAgeofMachineLearning

8/59©Cloudera,Inc.Allrightsreserved.

ShortandSweetHadoopWhatAboutSpark?MachineLearningTheFuture

9/59©Cloudera,Inc.Allrightsreserved.

Theword“Hadoop”means

• achild’stoyor

• HadoopCoreor

• theHadoopEcosystem.

10/59©Cloudera,Inc.Allrightsreserved.

HadoopCore

• Afreeopensourceso\wareso\wareproject• Managedtransparentlyonline,attheApacheSo\wareFounda;on(ASF),apache.org

• Theprojectwasstartedin2006,basedonpapersfromGoogle,in2003and2004

• Consistsof:• HDFS(HadoopDistributedFileSystem),forstorage• HadoopMapReduce,forprocessing• YARN(YetAnotherResourceNego;ator)

hadoop.apache.org

11/59©Cloudera,Inc.Allrightsreserved.

HadoopCoremainfeatures:Filestorageandbatchprogramming

12/59©Cloudera,Inc.Allrightsreserved.

HDFSWrites

13/59©Cloudera,Inc.Allrightsreserved.

HDFSReads

14/59©Cloudera,Inc.Allrightsreserved.

GeneralFileInput/Output

15/59©Cloudera,Inc.Allrightsreserved.

HDFSStrengthsandWeaknesses• HDFSisgoodat:

• storingenormousfiles• storinglotsofdatareliably•  throughputonsequen;alwrites•  throughputonsequen;alreadsofafileorpartofafile

• HDFSisnotgoodat:• highspeed(lowlatency)randomreadsofpartsofafile

• HDFScannot:• updateanypartofafileoncewrijen**butyoucanalwayswriteanewfileand/ordelete,move,andrenamefilesanddirectories

16/59©Cloudera,Inc.Allrightsreserved.

MapReduce:Programmingwithsimplefunc;ons

17/59©Cloudera,Inc.Allrightsreserved.

MapReduceExample:WordCountCountthenumberofoccurrencesofeachwordoveralargeamountofinputdata• Thisisthe‘helloworld’ofMapReduceprogramming

map(String input_key, String input_value) foreach word w in input_value: emit(w, 1)

reduce(String output_key, Iterator<int> intermediate_vals) set count = 0 foreach v in intermediate_vals: count += v emit(output_key, count)

18/59©Cloudera,Inc.Allrightsreserved.

WordCount,con;nuedInputtotheMapper:

OutputfromtheMapper:

(3414, 'the cat sat on the mat') (3437, 'the aardvark sat on the sofa')

('the', 1), ('cat', 1), ('sat', 1), ('on', 1), ('the', 1), ('mat', 1), ('the', 1), ('aardvark', 1), ('sat', 1), ('on', 1), ('the', 1), ('sofa', 1)

19/59©Cloudera,Inc.Allrightsreserved.

WordCount,con;nuedIntermediatedatasenttotheReducer:

FinalReduceroutput:

('aardvark', [1]) ('cat', [1]) ('mat', [1]) ('on', [1, 1]) ('sat', [1, 1]) ('sofa', [1]) ('the', [1, 1, 1, 1])

('aardvark', 1) ('cat', 1) ('mat', 1) ('on', 2) ('sat', 2) ('sofa', 1) ('the', 4)

20/59©Cloudera,Inc.Allrightsreserved.

Sowejustcountedwords.Sowhat?

• Manyproblemsconformtothispajern:• Webloganalysis:map()emitsanIPaddressforeachweblogevent;reduce()countsoccurrencesforeachIPaddress

•  Indexing:Foreachdocument,map()emitseachtermofinterestpairedwiththedocumentID;reduce()collectsandemitsalldocumentIDsforeachterm

• Pagerankalgorithm:• Everywebpage(URL)ontheWebgetsanini;alscore.• map()dividesapage’sscoreamongallofitsoutlinks’URLs;reduce()sumsthereceivedscoresforeachURL.

•  Iterateonthisprocedure.

21/59©Cloudera,Inc.Allrightsreserved.

MapReduceChains

22/59©Cloudera,Inc.Allrightsreserved.

MapReduceatScale

23/59©Cloudera,Inc.Allrightsreserved.

MapReduceStrengthsandWeaknesses• MapReduceisgoodat:

• processingenormousvolumesofdata• scalingoutasyouaddmoremachines• con;nuingtocomple;on,evenwhensomemachinesdie

• MapReduceisnotgoodat:•  runninganyalgorithmyoucanwriteinpseudocode• algorithmsthatrequiresharedstateoverall**butmaybeyoucangetcleverwithyouralgorithmdesign

• MapReducecannot:•  runinreal;me:MapReducejobsarebatchjobs

24/59©Cloudera,Inc.Allrightsreserved.

YARN,YetAnotherResourceNego;ator

25/59©Cloudera,Inc.Allrightsreserved.

Sqoop:RDBMStoHadoopandBack

• UsesMapReducetorunconcurrentdatabasequeriesthatextractorinsertdatasqoop.apache.org

26/59©Cloudera,Inc.Allrightsreserved.

Flume:Inges;ngOngoingEventDataflume.apache.org

27/59©Cloudera,Inc.Allrightsreserved.

Kata:GeneralDataStreamingkata.apache.org

28/59©Cloudera,Inc.Allrightsreserved.

HBase:ANoSQLDatabaseSystem

• Ascalablekey/valuestore• Accommodatesgeneralbinarydata• Highvolume,highperformanceaccesstoindividualitems• Randomreadsandwrites• WeakerquerylanguagethanSQL(put,get,scan,delete)•  LacksACID-complianttransac;ons

hbase.apache.org

29/59©Cloudera,Inc.Allrightsreserved.

Kudu:Scalablestorageforstructureddatakudu.apache.org

30/59©Cloudera,Inc.Allrightsreserved.

Hive:MapReduce(orSpark)as“SQL”

•  Familiarlanguageandprogrammingparadigm

• ProvidesinterfacetomanySQL-complianttools

hive.apache.org

31/59©Cloudera,Inc.Allrightsreserved.

Pig:AnotherLanguageforMapReduce(orSpark)pig.apache.org

32/59©Cloudera,Inc.Allrightsreserved.

Impala:HighSpeedAnaly;csinHadoop

• Purpose-builtforhighspeedanaly;cqueries

• DoesnotuseMapReduceorSpark

• Usually5to30;mesfaster—some;mes100;mesfaster!

incubator.apache.org/projects/impala.html

33/59©Cloudera,Inc.Allrightsreserved.

AndMore

•  Serializa;onandefficientfilestorage:AvroandParquet

• Workflow:Oozie

avro.apache.org parquet.apache.org

oozie.apache.org

34/59©Cloudera,Inc.Allrightsreserved.

AndEvenMore…

•  Security:SentryandRecordService

• MachineLearninginMapReduce:Mahout

• And…mahout.apache.org

sentry.apache.org recordservice.io

35/59©Cloudera,Inc.Allrightsreserved.

ShortandSweetHadoopWhatAboutSpark?MachineLearningTheFuture

36/59©Cloudera,Inc.Allrightsreserved.

Spark:AnImprovementonMapReduce

• OriginallyaresearchprojectatUCBerkeleyRADLab—latertheAMPLab,in2009

• AddressessomefundamentalpainpointsofMapReduce

• TheSparkStreamingsubprojectof2012addsnearreal-;meprogramming• using“micro-batches”asanadapta;onofbatchprogramming• acapabilityaltogetherlackinginHadoopMapReduce

spark.apache.org

37/59©Cloudera,Inc.Allrightsreserved.

Similari;esofMapReduceandSpark

• Processesmassivevolumesofdatawithascale-out,distributedframework• Theframeworkprovidesreliability,eveninthefaceofmachinefailure• Programmingwithstatelessfunc;ons• Reliesonexpensiveshuffletoreorganizedataforaggrega;on,joins,sor;ng•  S;lllacksasharedstateamongallprocesses• CanrununderYARNtoshareprocessingresources

38/59©Cloudera,Inc.Allrightsreserved.

ImprovedAPI

•  First-classAPIsinScala,Java,PythonandR• Data-flowprogrammingparadigm(likePig)•  Interac;veshell—greatforexploratorywork

•  ImprovedsupportforstructureddataandSQL-likeprocessing

39/59©Cloudera,Inc.Allrightsreserved.

ProcessingChains,Improved

func;on func;on func;on

func;onfunc;onfunc;on

EliminateI/O

EliminateI/O

ReduceI/O

Tasks,notnewprocesses(JVMs)Enhancedcachinginmemory

40/59©Cloudera,Inc.Allrightsreserved.

SparkMLlib:MachineLearninginSpark

•  SubprojectofSpark• Effec;velyreplacesMahoutformachinelearninginHadoopclusters•  Fromspark.apache.org,thefrontpage:

Butjustbeclearwhatyoumeanby“Hadoop”!

41/59©Cloudera,Inc.Allrightsreserved.

CommercialMessage#1

42/59©Cloudera,Inc.Allrightsreserved.

BigEcosystem

oozie.apache.org

43/59©Cloudera,Inc.Allrightsreserved.

CompleteBigDataPla]orm

• ClouderaManagercan•  install,monitor,manage,upgradeacoherentbundleoftheseprojectsandmore

• ClouderaDirectorcan• easilyconfigureanddeploythispla]ormoncloudservicesfromAmazon,Google,orMicroso\

•  !!!

44/59©Cloudera,Inc.Allrightsreserved.

ShortandSweetHadoopWhatAboutSpark?MachineLearningTheFuture

45/59©Cloudera,Inc.Allrightsreserved.

MachineLearningAlgorithms

•  SupervisedLearning:• Startwithcorrectlylabeledrecords,andlearntoes;mateorpredictlabelsfornewrecords

• Con;nuouslabels:Regression• Discretelabels:Logis;cRegression,Classifiers

• UnsupervisedLearning:• Startwithunlabeledrecords,trytoteasepajerns(labels)outofthedata• Thereisnotasingle“correct”answerforlabeling• Con;nuouslabels:Collabora;veFilters(Recommenders)• Discretelabels:Clustering

46/59©Cloudera,Inc.Allrightsreserved.

LinearRegression:SupervisedLearningofaCon;nuousLabel

47/59©Cloudera,Inc.Allrightsreserved.

Logis;cRegression:SupervisedLearningofaBinaryLabel

48/59©Cloudera,Inc.Allrightsreserved.

Classifiers:SupervisedLearningofDiscreteLabels

Training:Cat

Training:Table

Scoring:???

49/59©Cloudera,Inc.Allrightsreserved.

Collabora;veFilters(Recommenders):UnsupervisedLearningofCon;nuousLabels

50/59©Cloudera,Inc.Allrightsreserved.

Clustering:UnsupervisedLearningofDiscreteLabels

51/59©Cloudera,Inc.Allrightsreserved.

SparkMLlib:MachineLearningonHadoop

52/59©Cloudera,Inc.Allrightsreserved.

ShortandSweetHadoopWhatAboutSpark?MachineLearningTheFuture

53/59©Cloudera,Inc.Allrightsreserved.

CommercialMessage#2

54/59©Cloudera,Inc.Allrightsreserved.

MoreDSTeamsintheOrganiza;on

• Collabora;on,repeatabilitywithinteams• Differingsecurityrequirements• Differentpreferredprograminglanguages:Python,R,Scala• Differentso\warelibraries:Pandas,H2O,etc.• Evendifferentversionsofso\ware

55/59©Cloudera,Inc.Allrightsreserved.

ClouderaDataScienceWorkbench

• DevelopmentinPython,Scala,orR• Differingsecurityrequirements

56/59©Cloudera,Inc.Allrightsreserved.

DeepLearning

57/59©Cloudera,Inc.Allrightsreserved.

DeepLearningonHadoop

• DeepLearningreferstoacategoryofclassifieralgorithms,mostlyinventedin2006.

•  SparkMLlibdoesnothaveanydirectimplementa;onofDL.• Thereareseveraladdi;onalprojectsthatcanfitDLontoSpark/Hadoop:

• BigDL• Caffe• TensorFlow• DL4J

58/59©Cloudera,Inc.Allrightsreserved.

TheRoad—orRunway(!)—Ahead

•  Itisatruismthatorganiza;onstodayhavevaluableinsightshiddenintheirdatathatarewai;ngtobeuncovered.

• 90%ofalldatathatwillexistin2020hasyettobecreated.• Opensourceisheretostay.• Hadoopasadatasciencepla]ormisevolving,anditsuseisgrowingexponen;ally.

59/59©Cloudera,Inc.Allrightsreserved.

Thankyouglynn@cloudera.com

top related