introducon to data science with hadoop - dataedge 2018 · introducon to data science with hadoop...

59
Introduc;on to Data Science with Hadoop Glynn Durham | Senior Instructor [email protected] May 2017

Upload: vudien

Post on 27-Jul-2018

223 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,

1/59©Cloudera,Inc.Allrightsreserved.

Introduc;ontoDataSciencewithHadoopGlynnDurham|[email protected]

Page 2: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,

2/59©Cloudera,Inc.Allrightsreserved.

ShortandSweetHadoopWhatAboutSpark?MachineLearningTheFuture

Page 3: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,

3/59©Cloudera,Inc.Allrightsreserved.

ShortandSweetHadoopWhatAboutSpark?MachineLearningTheFuture

Page 4: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,

4/59©Cloudera,Inc.Allrightsreserved.

DataScienceis…

• gatheringdata,poten;allyofmanytypesandfrommanysources,

• wranglingthatdataintousefulforms,and

• applyingsta;s;calprogrammingandmachinelearning,togainnewinforma;onfromthedata.

Page 5: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,

5/59©Cloudera,Inc.Allrightsreserved.

MachineLearningandDataVolume

“It’snotwhohasthebestalgorithmswhowins.It’swhohasthemostdata.”[BankoandBrill,2001]

Page 6: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,

6/59©Cloudera,Inc.Allrightsreserved.

Hadoopis…

• anopensourceso\warepla]ormfor• acquiring,storing,andprocessingmassivevolumesofdata,• economically.

Page 7: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,

7/59©Cloudera,Inc.Allrightsreserved.

TheAgeofMachineLearning

Page 8: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,

8/59©Cloudera,Inc.Allrightsreserved.

ShortandSweetHadoopWhatAboutSpark?MachineLearningTheFuture

Page 9: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,

9/59©Cloudera,Inc.Allrightsreserved.

Theword“Hadoop”means

• achild’stoyor

• HadoopCoreor

• theHadoopEcosystem.

Page 10: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,

10/59©Cloudera,Inc.Allrightsreserved.

HadoopCore

• Afreeopensourceso\wareso\wareproject• Managedtransparentlyonline,attheApacheSo\wareFounda;on(ASF),apache.org

• Theprojectwasstartedin2006,basedonpapersfromGoogle,in2003and2004

• Consistsof:• HDFS(HadoopDistributedFileSystem),forstorage• HadoopMapReduce,forprocessing• YARN(YetAnotherResourceNego;ator)

hadoop.apache.org

Page 11: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,

11/59©Cloudera,Inc.Allrightsreserved.

HadoopCoremainfeatures:Filestorageandbatchprogramming

Page 12: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,

12/59©Cloudera,Inc.Allrightsreserved.

HDFSWrites

Page 13: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,

13/59©Cloudera,Inc.Allrightsreserved.

HDFSReads

Page 14: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,

14/59©Cloudera,Inc.Allrightsreserved.

GeneralFileInput/Output

Page 15: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,

15/59©Cloudera,Inc.Allrightsreserved.

HDFSStrengthsandWeaknesses• HDFSisgoodat:

• storingenormousfiles• storinglotsofdatareliably•  throughputonsequen;alwrites•  throughputonsequen;alreadsofafileorpartofafile

• HDFSisnotgoodat:• highspeed(lowlatency)randomreadsofpartsofafile

• HDFScannot:• updateanypartofafileoncewrijen**butyoucanalwayswriteanewfileand/ordelete,move,andrenamefilesanddirectories

Page 16: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,

16/59©Cloudera,Inc.Allrightsreserved.

MapReduce:Programmingwithsimplefunc;ons

Page 17: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,

17/59©Cloudera,Inc.Allrightsreserved.

MapReduceExample:WordCountCountthenumberofoccurrencesofeachwordoveralargeamountofinputdata• Thisisthe‘helloworld’ofMapReduceprogramming

map(String input_key, String input_value) foreach word w in input_value: emit(w, 1)

reduce(String output_key, Iterator<int> intermediate_vals) set count = 0 foreach v in intermediate_vals: count += v emit(output_key, count)

Page 18: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,

18/59©Cloudera,Inc.Allrightsreserved.

WordCount,con;nuedInputtotheMapper:

OutputfromtheMapper:

(3414, 'the cat sat on the mat') (3437, 'the aardvark sat on the sofa')

('the', 1), ('cat', 1), ('sat', 1), ('on', 1), ('the', 1), ('mat', 1), ('the', 1), ('aardvark', 1), ('sat', 1), ('on', 1), ('the', 1), ('sofa', 1)

Page 19: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,

19/59©Cloudera,Inc.Allrightsreserved.

WordCount,con;nuedIntermediatedatasenttotheReducer:

FinalReduceroutput:

('aardvark', [1]) ('cat', [1]) ('mat', [1]) ('on', [1, 1]) ('sat', [1, 1]) ('sofa', [1]) ('the', [1, 1, 1, 1])

('aardvark', 1) ('cat', 1) ('mat', 1) ('on', 2) ('sat', 2) ('sofa', 1) ('the', 4)

Page 20: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,

20/59©Cloudera,Inc.Allrightsreserved.

Sowejustcountedwords.Sowhat?

• Manyproblemsconformtothispajern:• Webloganalysis:map()emitsanIPaddressforeachweblogevent;reduce()countsoccurrencesforeachIPaddress

•  Indexing:Foreachdocument,map()emitseachtermofinterestpairedwiththedocumentID;reduce()collectsandemitsalldocumentIDsforeachterm

• Pagerankalgorithm:• Everywebpage(URL)ontheWebgetsanini;alscore.• map()dividesapage’sscoreamongallofitsoutlinks’URLs;reduce()sumsthereceivedscoresforeachURL.

•  Iterateonthisprocedure.

Page 21: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,

21/59©Cloudera,Inc.Allrightsreserved.

MapReduceChains

Page 22: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,

22/59©Cloudera,Inc.Allrightsreserved.

MapReduceatScale

Page 23: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,

23/59©Cloudera,Inc.Allrightsreserved.

MapReduceStrengthsandWeaknesses• MapReduceisgoodat:

• processingenormousvolumesofdata• scalingoutasyouaddmoremachines• con;nuingtocomple;on,evenwhensomemachinesdie

• MapReduceisnotgoodat:•  runninganyalgorithmyoucanwriteinpseudocode• algorithmsthatrequiresharedstateoverall**butmaybeyoucangetcleverwithyouralgorithmdesign

• MapReducecannot:•  runinreal;me:MapReducejobsarebatchjobs

Page 24: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,

24/59©Cloudera,Inc.Allrightsreserved.

YARN,YetAnotherResourceNego;ator

Page 25: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,

25/59©Cloudera,Inc.Allrightsreserved.

Sqoop:RDBMStoHadoopandBack

• UsesMapReducetorunconcurrentdatabasequeriesthatextractorinsertdatasqoop.apache.org

Page 26: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,

26/59©Cloudera,Inc.Allrightsreserved.

Flume:Inges;ngOngoingEventDataflume.apache.org

Page 27: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,

27/59©Cloudera,Inc.Allrightsreserved.

Kata:GeneralDataStreamingkata.apache.org

Page 28: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,

28/59©Cloudera,Inc.Allrightsreserved.

HBase:ANoSQLDatabaseSystem

• Ascalablekey/valuestore• Accommodatesgeneralbinarydata• Highvolume,highperformanceaccesstoindividualitems• Randomreadsandwrites• WeakerquerylanguagethanSQL(put,get,scan,delete)•  LacksACID-complianttransac;ons

hbase.apache.org

Page 29: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,

29/59©Cloudera,Inc.Allrightsreserved.

Kudu:Scalablestorageforstructureddatakudu.apache.org

Page 30: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,

30/59©Cloudera,Inc.Allrightsreserved.

Hive:MapReduce(orSpark)as“SQL”

•  Familiarlanguageandprogrammingparadigm

• ProvidesinterfacetomanySQL-complianttools

hive.apache.org

Page 31: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,

31/59©Cloudera,Inc.Allrightsreserved.

Pig:AnotherLanguageforMapReduce(orSpark)pig.apache.org

Page 32: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,

32/59©Cloudera,Inc.Allrightsreserved.

Impala:HighSpeedAnaly;csinHadoop

• Purpose-builtforhighspeedanaly;cqueries

• DoesnotuseMapReduceorSpark

• Usually5to30;mesfaster—some;mes100;mesfaster!

incubator.apache.org/projects/impala.html

Page 33: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,

33/59©Cloudera,Inc.Allrightsreserved.

AndMore

•  Serializa;onandefficientfilestorage:AvroandParquet

• Workflow:Oozie

avro.apache.org parquet.apache.org

oozie.apache.org

Page 34: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,

34/59©Cloudera,Inc.Allrightsreserved.

AndEvenMore…

•  Security:SentryandRecordService

• MachineLearninginMapReduce:Mahout

• And…mahout.apache.org

sentry.apache.org recordservice.io

Page 35: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,

35/59©Cloudera,Inc.Allrightsreserved.

ShortandSweetHadoopWhatAboutSpark?MachineLearningTheFuture

Page 36: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,

36/59©Cloudera,Inc.Allrightsreserved.

Spark:AnImprovementonMapReduce

• OriginallyaresearchprojectatUCBerkeleyRADLab—latertheAMPLab,in2009

• AddressessomefundamentalpainpointsofMapReduce

• TheSparkStreamingsubprojectof2012addsnearreal-;meprogramming• using“micro-batches”asanadapta;onofbatchprogramming• acapabilityaltogetherlackinginHadoopMapReduce

spark.apache.org

Page 37: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,

37/59©Cloudera,Inc.Allrightsreserved.

Similari;esofMapReduceandSpark

• Processesmassivevolumesofdatawithascale-out,distributedframework• Theframeworkprovidesreliability,eveninthefaceofmachinefailure• Programmingwithstatelessfunc;ons• Reliesonexpensiveshuffletoreorganizedataforaggrega;on,joins,sor;ng•  S;lllacksasharedstateamongallprocesses• CanrununderYARNtoshareprocessingresources

Page 38: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,

38/59©Cloudera,Inc.Allrightsreserved.

ImprovedAPI

•  First-classAPIsinScala,Java,PythonandR• Data-flowprogrammingparadigm(likePig)•  Interac;veshell—greatforexploratorywork

•  ImprovedsupportforstructureddataandSQL-likeprocessing

Page 39: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,

39/59©Cloudera,Inc.Allrightsreserved.

ProcessingChains,Improved

func;on func;on func;on

func;onfunc;onfunc;on

EliminateI/O

EliminateI/O

ReduceI/O

Tasks,notnewprocesses(JVMs)Enhancedcachinginmemory

Page 40: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,

40/59©Cloudera,Inc.Allrightsreserved.

SparkMLlib:MachineLearninginSpark

•  SubprojectofSpark• Effec;velyreplacesMahoutformachinelearninginHadoopclusters•  Fromspark.apache.org,thefrontpage:

Butjustbeclearwhatyoumeanby“Hadoop”!

Page 41: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,

41/59©Cloudera,Inc.Allrightsreserved.

CommercialMessage#1

Page 42: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,

42/59©Cloudera,Inc.Allrightsreserved.

BigEcosystem

oozie.apache.org

Page 43: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,

43/59©Cloudera,Inc.Allrightsreserved.

CompleteBigDataPla]orm

• ClouderaManagercan•  install,monitor,manage,upgradeacoherentbundleoftheseprojectsandmore

• ClouderaDirectorcan• easilyconfigureanddeploythispla]ormoncloudservicesfromAmazon,Google,orMicroso\

•  !!!

Page 44: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,

44/59©Cloudera,Inc.Allrightsreserved.

ShortandSweetHadoopWhatAboutSpark?MachineLearningTheFuture

Page 45: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,

45/59©Cloudera,Inc.Allrightsreserved.

MachineLearningAlgorithms

•  SupervisedLearning:• Startwithcorrectlylabeledrecords,andlearntoes;mateorpredictlabelsfornewrecords

• Con;nuouslabels:Regression• Discretelabels:Logis;cRegression,Classifiers

• UnsupervisedLearning:• Startwithunlabeledrecords,trytoteasepajerns(labels)outofthedata• Thereisnotasingle“correct”answerforlabeling• Con;nuouslabels:Collabora;veFilters(Recommenders)• Discretelabels:Clustering

Page 46: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,

46/59©Cloudera,Inc.Allrightsreserved.

LinearRegression:SupervisedLearningofaCon;nuousLabel

Page 47: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,

47/59©Cloudera,Inc.Allrightsreserved.

Logis;cRegression:SupervisedLearningofaBinaryLabel

Page 48: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,

48/59©Cloudera,Inc.Allrightsreserved.

Classifiers:SupervisedLearningofDiscreteLabels

Training:Cat

Training:Table

Scoring:???

Page 49: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,

49/59©Cloudera,Inc.Allrightsreserved.

Collabora;veFilters(Recommenders):UnsupervisedLearningofCon;nuousLabels

Page 50: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,

50/59©Cloudera,Inc.Allrightsreserved.

Clustering:UnsupervisedLearningofDiscreteLabels

Page 51: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,

51/59©Cloudera,Inc.Allrightsreserved.

SparkMLlib:MachineLearningonHadoop

Page 52: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,

52/59©Cloudera,Inc.Allrightsreserved.

ShortandSweetHadoopWhatAboutSpark?MachineLearningTheFuture

Page 53: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,

53/59©Cloudera,Inc.Allrightsreserved.

CommercialMessage#2

Page 54: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,

54/59©Cloudera,Inc.Allrightsreserved.

MoreDSTeamsintheOrganiza;on

• Collabora;on,repeatabilitywithinteams• Differingsecurityrequirements• Differentpreferredprograminglanguages:Python,R,Scala• Differentso\warelibraries:Pandas,H2O,etc.• Evendifferentversionsofso\ware

Page 55: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,

55/59©Cloudera,Inc.Allrightsreserved.

ClouderaDataScienceWorkbench

• DevelopmentinPython,Scala,orR• Differingsecurityrequirements

Page 56: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,

56/59©Cloudera,Inc.Allrightsreserved.

DeepLearning

Page 57: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,

57/59©Cloudera,Inc.Allrightsreserved.

DeepLearningonHadoop

• DeepLearningreferstoacategoryofclassifieralgorithms,mostlyinventedin2006.

•  SparkMLlibdoesnothaveanydirectimplementa;onofDL.• Thereareseveraladdi;onalprojectsthatcanfitDLontoSpark/Hadoop:

• BigDL• Caffe• TensorFlow• DL4J

Page 58: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,

58/59©Cloudera,Inc.Allrightsreserved.

TheRoad—orRunway(!)—Ahead

•  Itisatruismthatorganiza;onstodayhavevaluableinsightshiddenintheirdatathatarewai;ngtobeuncovered.

• 90%ofalldatathatwillexistin2020hasyettobecreated.• Opensourceisheretostay.• Hadoopasadatasciencepla]ormisevolving,anditsuseisgrowingexponen;ally.

Page 59: Introducon to Data Science with Hadoop - DataEDGE 2018 · Introducon to Data Science with Hadoop Glynn Durham | Senior Instructor glynn@cloudera.com May 2017 ... • DL4J © Cloudera,

59/59©Cloudera,Inc.Allrightsreserved.

[email protected]