experiments with hadoop and spark, discussion › ... › ivoa-sydney-gws1-session1-cds.pdf · •...

13
Experiments with Hadoop and Spark, discussion André Schaaff, François-Xavier Pineau, Noémie Wali IVOA meeFng, Sydney 2015

Upload: others

Post on 09-Jun-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Experiments with Hadoop and Spark, discussion › ... › IVOA-Sydney-GWS1-Session1-CDS.pdf · • Runs on Hadoop (HDFS), Mesos, standalone or in the cloud => CompaFble with Hadoop

Experimentswith

HadoopandSpark,

discussion

AndréSchaaff,François-XavierPineau,NoémieWali

IVOAmeeFng,Sydney2015

Page 2: Experiments with Hadoop and Spark, discussion › ... › IVOA-Sydney-GWS1-Session1-CDS.pdf · • Runs on Hadoop (HDFS), Mesos, standalone or in the cloud => CompaFble with Hadoop

AroundBigData…

31/10/2015 IVOASydney2015 2

AnongoingexploraFon(startedlastmonth)of

emerging(ormaturing)«BigData»

technologieswiththeXMatchasmainusecase

Page 3: Experiments with Hadoop and Spark, discussion › ... › IVOA-Sydney-GWS1-Session1-CDS.pdf · • Runs on Hadoop (HDFS), Mesos, standalone or in the cloud => CompaFble with Hadoop

Emerging/maturingtechnologies

31/10/2015 IVOASydney2015 3

Credits:ApachefundaFon

Page 4: Experiments with Hadoop and Spark, discussion › ... › IVOA-Sydney-GWS1-Session1-CDS.pdf · • Runs on Hadoop (HDFS), Mesos, standalone or in the cloud => CompaFble with Hadoop

Hadoop?

•  High-availabilitydistributedobject-orientedpla]orm

– Frameworkforthedistributedprocessingoflarge

datasets(HDFS(distributedfilesystem),

MapReduce)

– Scalablefromasingleservertothousandsof

machines

31/10/2015 IVOASydney2015 4

Page 5: Experiments with Hadoop and Spark, discussion › ... › IVOA-Sydney-GWS1-Session1-CDS.pdf · • Runs on Hadoop (HDFS), Mesos, standalone or in the cloud => CompaFble with Hadoop

HDFS:moredetails

•  AbstracFonofthestorage– Asetofdistributedhardisksisseenasonehardisk

– NameNode•  namespace,filetree,metadata

•  locaFonofthedatablocks– DataNodes•  wherethedatablocksare•  theDataNodesinformtheNameNodeoftheircontent

(datablocks)

31/10/2015 IVOASydney2015 5

Page 6: Experiments with Hadoop and Spark, discussion › ... › IVOA-Sydney-GWS1-Session1-CDS.pdf · • Runs on Hadoop (HDFS), Mesos, standalone or in the cloud => CompaFble with Hadoop

HDFSArchitecture

31/10/2015 IVOASydney2015 6

Credits:ApachefundaFon

Page 7: Experiments with Hadoop and Spark, discussion › ... › IVOA-Sydney-GWS1-Session1-CDS.pdf · • Runs on Hadoop (HDFS), Mesos, standalone or in the cloud => CompaFble with Hadoop

Spark?

•  «Fastandgeneralenginefordataprocessing»

•  RunsonHadoop(HDFS),Mesos,standaloneor

inthecloud

=>CompaFblewithHadoopdata

•  ApplicaFonscanbewrideninJava,Scala,PythonorR

•  NotonlyMapReducedriven

31/10/2015 IVOASydney2015 7

Page 8: Experiments with Hadoop and Spark, discussion › ... › IVOA-Sydney-GWS1-Session1-CDS.pdf · • Runs on Hadoop (HDFS), Mesos, standalone or in the cloud => CompaFble with Hadoop

CDSXMatch

31/10/2015 IVOASydney2015 8

•  TheCDSXMatchisanefficientservicebased

onopFmizeddevelopmentsand

implementedonawellshapedhardware

Page 9: Experiments with Hadoop and Spark, discussion › ... › IVOA-Sydney-GWS1-Session1-CDS.pdf · • Runs on Hadoop (HDFS), Mesos, standalone or in the cloud => CompaFble with Hadoop

CDSXMatch(2)

31/10/2015 IVOASydney2015 9

•  Thedataisnotdistributedonafewservers:allthecataloguesarelocatedononeRAIDsystem

•  But…

Page 10: Experiments with Hadoop and Spark, discussion › ... › IVOA-Sydney-GWS1-Session1-CDS.pdf · • Runs on Hadoop (HDFS), Mesos, standalone or in the cloud => CompaFble with Hadoop

CDSXMatch(3)

31/10/2015 IVOASydney2015 10

•  …but«organized»

Page 11: Experiments with Hadoop and Spark, discussion › ... › IVOA-Sydney-GWS1-Session1-CDS.pdf · • Runs on Hadoop (HDFS), Mesos, standalone or in the cloud => CompaFble with Hadoop

DistributedXMatch

31/10/2015 IVOASydney2015 11

•  InthecaseofHadoop/Spark,the

dataisdistributedon

clustersofservers

•  Pointsofinterest:howisthedata

distributed?,howto

opFmizethis

distribuFon?

Page 12: Experiments with Hadoop and Spark, discussion › ... › IVOA-Sydney-GWS1-Session1-CDS.pdf · • Runs on Hadoop (HDFS), Mesos, standalone or in the cloud => CompaFble with Hadoop

Whythisstudy?...

•  WeareevaluaFngwhatHadoop/Sparkcould

bring

– Whatitcouldreplaceorimprove,especiallyinthe

frameofscalability(largerdatasets,hardware

faciliFes,deployments,etc.)

– Andforwhichcost(money,manpower,

performancesup?/down?)

•  LearntouseHadoop/Spark

31/10/2015 IVOASydney2015 12

Page 13: Experiments with Hadoop and Spark, discussion › ... › IVOA-Sydney-GWS1-Session1-CDS.pdf · • Runs on Hadoop (HDFS), Mesos, standalone or in the cloud => CompaFble with Hadoop

Discussion

•  Whois(orhastested)usingHadoop,Spark?

– Whichusecase?

–  InproducFon?– Feedback?

•  Intheframeof«bringingthecodetothe

data»?

– Example:icansendajarfiletothemasternode

tobeexecuted«nearthedata»

31/10/2015 IVOASydney2015 13