experiments with hadoop and spark, discussion › ... › ivoa-sydney-gws1-session1-cds.pdf · •...
TRANSCRIPT
Experimentswith
HadoopandSpark,
discussion
AndréSchaaff,François-XavierPineau,NoémieWali
IVOAmeeFng,Sydney2015
AroundBigData…
31/10/2015 IVOASydney2015 2
AnongoingexploraFon(startedlastmonth)of
emerging(ormaturing)«BigData»
technologieswiththeXMatchasmainusecase
Emerging/maturingtechnologies
31/10/2015 IVOASydney2015 3
Credits:ApachefundaFon
Hadoop?
• High-availabilitydistributedobject-orientedpla]orm
– Frameworkforthedistributedprocessingoflarge
datasets(HDFS(distributedfilesystem),
MapReduce)
– Scalablefromasingleservertothousandsof
machines
31/10/2015 IVOASydney2015 4
HDFS:moredetails
• AbstracFonofthestorage– Asetofdistributedhardisksisseenasonehardisk
– NameNode• namespace,filetree,metadata
• locaFonofthedatablocks– DataNodes• wherethedatablocksare• theDataNodesinformtheNameNodeoftheircontent
(datablocks)
31/10/2015 IVOASydney2015 5
HDFSArchitecture
31/10/2015 IVOASydney2015 6
Credits:ApachefundaFon
Spark?
• «Fastandgeneralenginefordataprocessing»
• RunsonHadoop(HDFS),Mesos,standaloneor
inthecloud
=>CompaFblewithHadoopdata
• ApplicaFonscanbewrideninJava,Scala,PythonorR
• NotonlyMapReducedriven
31/10/2015 IVOASydney2015 7
CDSXMatch
31/10/2015 IVOASydney2015 8
• TheCDSXMatchisanefficientservicebased
onopFmizeddevelopmentsand
implementedonawellshapedhardware
CDSXMatch(2)
31/10/2015 IVOASydney2015 9
• Thedataisnotdistributedonafewservers:allthecataloguesarelocatedononeRAIDsystem
• But…
CDSXMatch(3)
31/10/2015 IVOASydney2015 10
• …but«organized»
DistributedXMatch
31/10/2015 IVOASydney2015 11
• InthecaseofHadoop/Spark,the
dataisdistributedon
clustersofservers
• Pointsofinterest:howisthedata
distributed?,howto
opFmizethis
distribuFon?
Whythisstudy?...
• WeareevaluaFngwhatHadoop/Sparkcould
bring
– Whatitcouldreplaceorimprove,especiallyinthe
frameofscalability(largerdatasets,hardware
faciliFes,deployments,etc.)
– Andforwhichcost(money,manpower,
performancesup?/down?)
• LearntouseHadoop/Spark
31/10/2015 IVOASydney2015 12
Discussion
• Whois(orhastested)usingHadoop,Spark?
– Whichusecase?
– InproducFon?– Feedback?
• Intheframeof«bringingthecodetothe
data»?
– Example:icansendajarfiletothemasternode
tobeexecuted«nearthedata»
31/10/2015 IVOASydney2015 13