experiments with hadoop and spark, discussion › ... › ivoa-sydney-gws1-session1-cds.pdf · •...

Post on 09-Jun-2020

3 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Experimentswith

HadoopandSpark,

discussion

AndréSchaaff,François-XavierPineau,NoémieWali

IVOAmeeFng,Sydney2015

AroundBigData…

31/10/2015 IVOASydney2015 2

AnongoingexploraFon(startedlastmonth)of

emerging(ormaturing)«BigData»

technologieswiththeXMatchasmainusecase

Emerging/maturingtechnologies

31/10/2015 IVOASydney2015 3

Credits:ApachefundaFon

Hadoop?

•  High-availabilitydistributedobject-orientedpla]orm

– Frameworkforthedistributedprocessingoflarge

datasets(HDFS(distributedfilesystem),

MapReduce)

– Scalablefromasingleservertothousandsof

machines

31/10/2015 IVOASydney2015 4

HDFS:moredetails

•  AbstracFonofthestorage– Asetofdistributedhardisksisseenasonehardisk

– NameNode•  namespace,filetree,metadata

•  locaFonofthedatablocks– DataNodes•  wherethedatablocksare•  theDataNodesinformtheNameNodeoftheircontent

(datablocks)

31/10/2015 IVOASydney2015 5

HDFSArchitecture

31/10/2015 IVOASydney2015 6

Credits:ApachefundaFon

Spark?

•  «Fastandgeneralenginefordataprocessing»

•  RunsonHadoop(HDFS),Mesos,standaloneor

inthecloud

=>CompaFblewithHadoopdata

•  ApplicaFonscanbewrideninJava,Scala,PythonorR

•  NotonlyMapReducedriven

31/10/2015 IVOASydney2015 7

CDSXMatch

31/10/2015 IVOASydney2015 8

•  TheCDSXMatchisanefficientservicebased

onopFmizeddevelopmentsand

implementedonawellshapedhardware

CDSXMatch(2)

31/10/2015 IVOASydney2015 9

•  Thedataisnotdistributedonafewservers:allthecataloguesarelocatedononeRAIDsystem

•  But…

CDSXMatch(3)

31/10/2015 IVOASydney2015 10

•  …but«organized»

DistributedXMatch

31/10/2015 IVOASydney2015 11

•  InthecaseofHadoop/Spark,the

dataisdistributedon

clustersofservers

•  Pointsofinterest:howisthedata

distributed?,howto

opFmizethis

distribuFon?

Whythisstudy?...

•  WeareevaluaFngwhatHadoop/Sparkcould

bring

– Whatitcouldreplaceorimprove,especiallyinthe

frameofscalability(largerdatasets,hardware

faciliFes,deployments,etc.)

– Andforwhichcost(money,manpower,

performancesup?/down?)

•  LearntouseHadoop/Spark

31/10/2015 IVOASydney2015 12

Discussion

•  Whois(orhastested)usingHadoop,Spark?

– Whichusecase?

–  InproducFon?– Feedback?

•  Intheframeof«bringingthecodetothe

data»?

– Example:icansendajarfiletothemasternode

tobeexecuted«nearthedata»

31/10/2015 IVOASydney2015 13

top related