storage-based convergence between hpc and cloud...
Post on 09-Jun-2020
11 Views
Preview:
TRANSCRIPT
1
STORAGE-BASEDCONVERGENCEBETWEENHPCANDCLOUDTOHANDLEBIGDATA
Deliverablenumber D3.1
Deliverabletitle CharacterizationoftheexistingcloudandHPCstoragesystems IntermediateReport
WP3HPC-CLOUDConvergence
Editor
AdrienLebre(Inria)AuthorsPierreMatri(UPM,ESR03),FotiosPapaodyssefs(Seagate,ESR06),FotisNikolaidis(CEA,ESR09)with
the support of Linh ThuyNguyen (Inria, ESR07), Athanasios Kiatipis (Fujitsu, ESR08),Mohammed-YacineTaleb(Inria,ESR13),YevhenAlforov(DKRZ,ESR14)
GrantAgreementnumber 642963Projectref.no MSCA-ITN-2014-ETN-642963
Projectacronym BigStorage
Projectfullname BigStorage:Storage-basedconvergencebetweenHPCandCloudtohandleBigData
Startingdate(dur.) 1/1/2015(48months)
Endingdate 31/12/2018
Projectwebsite http://www.bigstorage-project.eu
Coordinator MaríaS.Pérez
Address CampusdeMontegancedosn.28660BoadilladelMonte,Madrid,SpainReplyto mperez@fi.upm.es
Phone +34-91-336-7380
MSCA-ITN-2014-ETN-642963
D3.1WP3IntermediateReport
Page2of37
ExecutiveSummaryThis document provides an overview of the progress of the work done until M24 of the Project
BigStorage(from01-01-2015until31-12-2016)inWP3HPC-CloudConvergence.
Although available storage mechanisms in HPC and Cloud environments significantly differ fromtheir design as well as the services they deliver (files vs. key/value systems, hierarchical vs. flatnamespaces,semantics),understandingthespecifictechniquesusedinbothareasisakeyelement
to propose new storage systems offering the best of both worlds. The activities that have beenconducted inWP3focusedonunderstandingthespecifictechniquesused inHPCandCloudareas.This firststep isakeyelementforourESRstoproposenewstoragesystems, takingadvantageof
bothareas.TheD3.1deliverablepresentsanoverviewofthemajorsystemsthathavebeenstudiedbytheESRs.
WechoosetoclassifythemintothreegroupsthatcorrespondstothethreestoragebackendsthathavebeenidentifiedintheHPCandCloudwords:
● DistributedFileSystems;● BinaryLongObjects;● Key/Valuestoressytems.
For each category, we first discuss the general concepts and second presentmajor systems thatimplementthoseconcepts.Foreachsystem,weunderlinetheprosandconsandalsoindicatethe
ESRs that act as contact points for the others in order to get additional information if needed.Finally,weconcludethediscussionofeachcategorybypresentingtheirrelevanceaccordingtotherequirementsoftheusecasesthathavebeenstudiedinWP1.Thedocumentendswithatablethat
givesanoverviewofthemappingbetweenthefeaturesofferedbyeachsystemandthe identifiedrequirements. We underline that the list of systems presented in this deliverable is in no wayexhaustiveandismainlygearedtowardthesystemsthathavebeenstudiedbytheESRs.
In addition to delivering a strong expertise to our ESRs, thisstudy shouldenable them to proposeinnovativecontributions, firstat theapplication level throughnew I/Omiddlewaremechanismsto
guideI/Osystems inordertodeliverthebestperformancestheycanachieveand second at thedistributedfilesystemslevelinordertoextendthemwithcloudcapabilitiessuchaselasticityandtorevisitsomeoftheir internals inordertodevelopaunifiedarchitectureforHPCandCloud
storagebackends.
MSCA-ITN-2014-ETN-642963
D3.1WP3IntermediateReport
Page3of37
DocumentInformationISTProjectNumber MSCA-ITN-2014-ETN-642963Acronym BigStorageTitle Storage-basedconvergencebetweenHPCandCloudtohandleBigDataProjectURL http://www.bigstorage-project.euDocumentURL http://bigstorage-project.eu/index.php/deliverablesEUProjectOfficer Mr.SzymonSrodaDeliverable D3.1IntermediateReportonWP3Workpackage WP3HPC-CLOUDConvergenceDateofDelivery Planned:31.12.2016
Actual:20.12.2016Status Version1.0final!draft□Nature prototype□report!dissemination□Disseminationlevel public□consortium!DistributionList ConsortiumPartnersDocumentLocation http://bigstorage-project.eu/index.php/deliverablesResponsibleEditor AdrienLebre(Inria),adrien.lebre@inria.fr,Tel:+33(0)251858243Authors(Partner) Pierre Matri (UPM, ESR03), Fotios Papaodyssefs (Seagate, ESR06), Fotis
Nikolaidis(CEA,ESR09)Reviewers Abstract(fordissemination)
ExecutiveSummary
Keywords StorageBackend,HPC,Cloud,ParallelFileSystems,DistributedFileSystems,BinaryLongObjects,Key/ValueStores,
Version Modification(s) Date Author(s)0.1 Initialtemplateandstructure 08.10.2016 AdrienLebre,Inria0.2 Sectionsfromallauthors 10.11.2016 Allauthors
0.3Internal version with Editor’scomments/ammendments
15.11.2016 AdrienLebre,Inria
0.4 Revisionsfromallauthors 09.12.2016 Allauthors0.5 Internalversionforreviewbytheconsortium 15.12.2016 AdrienLebre,Inria0.6 InternalversionwithReviewer’scomments 18.12.2016 MariaS.Perez,UPM0.6 Finalversiontocommission 31.12.2016 AdrienLebre,Inria
MSCA-ITN-2014-ETN-642963
D3.1WP3IntermediateReport
Page4of37
ProjectConsortiumInformationParticipants ContactUniversidad PolitécnicadeMadrid(UPM),Spain
MaríaS.PérezEmail:mperez@fi.upm.es
BarcelonaSupercomputing Center(BSC),Spain
ToniCortesEmail:toni.cortes@bsc.es
Johannes GutenbergUniversity (JGU) Mainz,Germany
AndréBrinkmannEmail:brinkman@uni-mainz.de
Inria,France
GabrielAntoniuEmail:gabriel.antoniu@inria.frAdrienLebreEmail:adrien.lebre@inria.fr
Foundation for Researchand Technology - Hellas(FORTH),Greece
AngelosBilasEmail:bilas@ics.forth.gr
Seagate,UK
MalcolmMuggeridgeEmail:malcolm.muggeridge@seagate.com
DKRZ,Germany
ThomasLudwigEmail:ludwig@dkrz.de
CA TechnologiesDevelopment Spain (CA),Spain
VictorMuntesEmail:Victor.Muntes@ca.com
CEA,France
JacqueCharlesLafoucriereEmail:Charles.LAFOUCRIERE@CEA.FR
Fujitsu TechnologySolutions GMBH,Germany
SeppStiegerEmail:sepp.stieger@ts.fujitsu.com
MSCA-ITN-2014-ETN-642963
D3.1WP3IntermediateReport
Page5of37
Preamble
Filesystemshavebeeninventedinthe60’stoprovideaninterfaceforendusersandapplicationsto
store data in a structured manner. Traditionally, files were organized in hierarchical structures
(“trees”)consistingofdirectoriesandfiles(adirectorycontainingeithersub-directoriesorfiles).The
objectiveofafilesystem,then,istomanagethelocationofdataaccordingtoalogicalsequenceand
via aneasilyunderstoodhierarchyofnesteddirectories. Compounding theproblem is that as the
amountofdatagrows,sodothenumberofnesteddirectories/files.Theresultisasetoflargetree
structures that makes it cumbersome and challenging to find any particular file, especially if the
specificname,thecreationdate,orthetypeofthefileisnotknown.
Time have changed though, and today much of the data that is generated correspond to
unstructureddatathatdoesnotrequirethesamecapabilitiesastheonesofferedbytraditionalfile
systems.Consequently, thecostofmaintaininghierarchicalstructure isoftenunnecessaryandthe
resultingoverheadmayhinderperformanceofthesystem.
To overcome that limitation, a new generation of storage systems have been introduced. Among
these, BLOB (Binary LargeOBject) storage systems came to the surface. Namespace has changed
fromhierarchicaltoflat.Insteadofiteratingdeephierarchiestoidentifyafile,nowfiles(i.e.,objects
intheBLOBterminology)aredirectlyaccessiblethroughauniqueid.BLOBsaregenerallysplitinto
multiplesmallparts,orchunks,thataredistributedoverthestoragecluster.Mostofthesesystems
allowmodifications to be performed on the BLOBs, either by appending or bywriting at random
offsets. For applicationsmanipulating smaller or immutable values, key-value systemshavebeen
introduced. By reducing the kind of data that can be manipulated, key/value systems can offer
higherefficiency.
Thedocumenthasbeenwritteninordertoreflectthesethreesystemcategories.
MSCA-ITN-2014-ETN-642963
D3.1WP3IntermediateReport
Page6of37
Tableofcontents
ParallelFileSystems/DistributedFileSystems ................................................................. 8Parallel/DistributedFilesystemsFundamentals ..........................................................................8Lustre..........................................................................................................................................9
Overview........................................................................................................................................ 9ProsandCons .............................................................................................................................. 11
PVFS/OrangeFS .......................................................................................................................11Overview...................................................................................................................................... 11ProsandCons .............................................................................................................................. 13
HDFS .........................................................................................................................................13Overview...................................................................................................................................... 13ProsandCons .............................................................................................................................. 15
GlusterFS...................................................................................................................................15Overview...................................................................................................................................... 15ProsandCons .............................................................................................................................. 16
CephFS ......................................................................................................................................16Overview...................................................................................................................................... 17ProsandCons .............................................................................................................................. 18
Relevancew-r-tWP1use-cases .................................................................................................18
Blobsystems ................................................................................................................... 19BlobFundamentals ...................................................................................................................19BlobSeer ...................................................................................................................................20
Overview...................................................................................................................................... 20Pros/Cons................................................................................................................................... 22
AzureStorage............................................................................................................................22Overview...................................................................................................................................... 22Pros/Cons................................................................................................................................... 23
RADOS ......................................................................................................................................23Overview...................................................................................................................................... 23Pros/Cons................................................................................................................................... 25
Relevancew-r-tWP1use-cases .................................................................................................25
Key/ValueStoreSystems................................................................................................. 26KVSFundamentals ....................................................................................................................26HBase........................................................................................................................................27
Overview...................................................................................................................................... 27ProsandCons .............................................................................................................................. 28
MSCA-ITN-2014-ETN-642963
D3.1WP3IntermediateReport
Page7of37
Cassandra .................................................................................................................................28Overview...................................................................................................................................... 28ProsandCons .............................................................................................................................. 29
RAMCloud.................................................................................................................................30Overview...................................................................................................................................... 30ProsandCons .............................................................................................................................. 31
Relevancew-r-ttheWP1use-cases ...........................................................................................31
Relevancew-r-ttheWP1usecases-Overview ................................................................ 32
References....................................................................................................................... 33
MSCA-ITN-2014-ETN-642963
D3.1WP3IntermediateReport
Page8of37
ParallelFileSystems/DistributedFileSystems
Parallel/DistributedFilesystemsFundamentals
Parallel, distributed or clustered file systems are terms used to describe the pooling of storageresourcesoveranetworkinterface.Whilenotstrictlydefined,adistinctionisusuallyheldbetweenparallelversusdistributedsystemstorefertothelowandhigh-levelarchitecturaldifferencesaswell
astheintendeduse-casesthateachfilesystemisgearedtowards.Parallel File systems:With the advent of cluster computing, storage often became the bottleneck
component of high-performance systems, especially when executing I/O intensive parallelapplications.Theneedforhigh-throughputconcurrentreadsandwrites ledtothedevelopmentofparallelfilesystems.Parallelfilesystemsachievehighthroughputbystripingfilestomultipledisks,
allowing for collective readandwriteoperations, thereforeovercoming the limitationsof a singledevice[FeitelsonD][BentJ].Ingeneral,parallelfilesystemsofferblock-levelaccesstotheentirepoolof storage media, providing a unified view at the lower level and are geared towards high-
performance environments located at the same physical space connected through dedicatednetworks. Parallel file systemshavebeendesigned for handling structureddata in amanner verysimilartotraditionalfilesystems.
Distributed file systems:While parallel file systems have been created out of the need for high-performance computations and the handling of structured scientific data, distributed systems
evolvedoutof thenecessity tohandle theever increasingamountsofdatageneratedbymultipleandheterogeneoussourcesovertheinternet.Theelasticityofferedbycloudcomputingrequiredasimilarparadigminthestoragelayer.Unstructuredandheterogeneoustypesofdataalsointroduced
newrequirementsinorderforasystemtobeabletoscaleproperly,whileolderrequirementswherenolongernecessaryandwhereimpedingperformance.
Thefollowingtablesummarizesthedifferencesbetweenparallelanddistributedfilesystems.
Parallel Distributed
Focusarea HPC/Enterprise BigData/Cloud
UsualDatatype Structureddata Unstructured/Binary
RedundancyModel Hardwarereplication SoftwareReplication
DeviceSharemodel SharedDisks SharedNothing
MSCA-ITN-2014-ETN-642963
D3.1WP3IntermediateReport
Page9of37
In the following paragraphs, we present some of the most popular systems that fall in the
parallel/distributed categories. First,we give an overviewof Lustre [lustre4] and PVFS (OrangeFS)[pvfs1], which are the two most representative solutions of parallel file systems. Second, weintroduce HDFS [HDFS10] and GlusterFS [GlusterFS13], which are two well-known distributed file
systems. Finally, we conclude this list by describing the Ceph system [webciteceph1] that can beseenasamixedbetweenthetwocategories.
Lustre
ESRs:ESR06,ESR14
KeyContact:ESR14Mainarea:HPC
Overview
Lustre isanopensource,paralleldistributedfilesystemsuitable for large-scaleclustercomputing.Itsnameisanamalgamoftheterms"Linux"and"Clusters".ItiscurrentlyusedbyalargenumberofaTOP500supercomputersincluding6ofthetop10and60oftheTOP100[lustre1].Featurerichand
robust,Lustrebundleshighperformancewithdataredundancyandhighavailability. It isapopularfilesystemwhichhasbeenadoptedbylargedatacentersinindustryandscientificcommunitiesthatworkinafieldoffinancialanalysis,media,lifescience,climatechange,meteorology,etc..
Lustreisbasedontheobject-storagemodelandtreatsfilesasobjectsatthefilesystemlevel.Itaimsto provide a fully POSIX-I/O compliant file system (i.e., that provides strong consistency andimprovesprogramsportability)whilemaintaininghighperformanceI/Ooperations.Anoverviewofa
typical Lustre setupcanbe seen inFigure1,alongwith themainarchitectural componentsof thesystem.Itiscomposedwiththreemainfunctionalunits[lustre2]:
● Clients.LustreexposesPortableOperatingSystemInterfaceforUnix(POSIX)[lustre3]totheclientsthatallowsforconcurrentreadandwriteaccesstothefilesystemthroughstandardsetofsystemcalls.I/Othroughputincreasesduetoallowedmassivelyparallelfileaccessin
Lustre.● ObjectStorageServers (OSSs)managethefilesystem’sdata.Theyprovideanobject-based
interfacethatclientscanusetoaccessbyterangeswithintheobjects.EachOSSisconnected
topossiblymultipleobjectstoragetargets(OSTs)thatstoretheactualfiledata.● MetadataServers(MDS)managethefilesystem’smetadata,suchasdirectories,filenames
andpermissions.MDSsarenot involved in theactual I/Obutonlycontactedoncewhena
fileiscreatedoropened.TheclientsarethenabletoindependentlycontacttheappropriateOSSs.EachMDS is connected topossiblymultiplemetadata targets (MDTs) that store theactualmetadata.
Additionalarchitectureinformation:
MSCA-ITN-2014-ETN-642963
D3.1WP3IntermediateReport
Page10of37
● TheLustrenetworkingmodule (LNET)enables low latencynetworkaccess,necessary fora
clusterstoragesystem.Additionaltohighnetworkperformance,LNETallowsRemoteDirectMemoryAccesseliminatingCPUoverheads.
● DataStripingprovides significantperformanceboostson large single file reads, sincedata
can be read from multiple storage devices, thus overcoming throughput limitations ofstoragedevices.
Figure1-Lustre:ThemaincomponentsofLustrefilesystem[dell.www]
LustreisbasedonLinuxanduseskernelbasedservermodulestoprovidetherequiredperformance.AdditionallyLustrehasanotherusefulfeatureslistedbelow[lustre4]:
● Interoperability:Lustre filesystemcanrunonmoderncommodityhardwarewithdifferentarchitecturesofcentralprocessingunit.
● Accesscontrollist(ACL):SecuritymechanismofLustreusestheideaofLinuxACL,whichisa
set of access rights and permissions that every user or group have to the specific systemobjects. It is enhanced by POSIX ACLs, based on standard POSIX file system objectpermissions.
● Quota:theamountofdiskspacethatusersorgroupsofusersareusingispossibletolimitorchangeinLustresystemwithquotas.
● OSSaddition:MoreclientorservernodescanbeeasilyaddedtoclusterthatusesLustrefile
systemwithoutanysysteminterruptions.Thisfeaturegivesopportunitytoscalethesystem,increaseitsstoragecapacityandbandwidth.
● Controlled striping: Application libraries and utilities can specify different striping settings
(likestripingcount,stripingsizeandOSTselection)ofanindividualfileordirectorycreatedintheLustrefilesysteminordertoadjustthecapacityandperformance.Thesamecontrolisavailableforeveryfilecreatedinthesystem.
MSCA-ITN-2014-ETN-642963
D3.1WP3IntermediateReport
Page11of37
● Backup tools: Lustre file systemhasutilities thatallowusers todiscover themodified files
anddirectories, striping information, etc. andprovidebackupswith restoration.Howevertheseproceduresareperformedeitherinsupervisedorsemi-supervisedways.
ProsandCons
Pros● ProvidesPOSIXI/Ocompliantinterface● Highperformance,scalability,stabilityandavailability
● ProvidesFilestripping● Achieveshighbandwidthforasmallnumberoffiles● Hasadvancedsecuritymechanisms
● IsanopensourcefilesystemCons
● Limitedinfeatures(likecompression[lustre5])
● Difficulttobackupandefficientlyrecovermetadataanddata● Strippingcanleadtothemetadataoverhead● OSTfailurecanstuckthesystemperformanceandmakefilesinaccessible
● Cannothandlelargeamountofsmallfiles(millionsandmore)
PVFS/OrangeFS
ESRs:ESR03KeyContact:ESR03Mainarea:HPC
Overview
PVFS [pvfs1] isaparallel filesystemthatsupportsbothdistributing filedataandmetadataamongmultipleserversandcoordinatedaccesstofiledatabymultipletasksofaparallelprogram.PVFShasan object-based architecture. PVFS has an object-based design, which is to say all PVFS server
requests involved objects called dataspaces. A dataspace can be used to hold file data, filemetadata,directorymetadata,directoryentries,orsymboliclinks.
MSCA-ITN-2014-ETN-642963
D3.1WP3IntermediateReport
Page12of37
Figure2-PVFS:ThemaincomponentsofPVFSfilesystem[pvfs1]Thesoftwareconsistsof twomajorcomponents:astorageserverthatrunsasadaemononan IO
nodeandaclientlibrarythatprovidesaninterfaceforuserprogramsrunningonacomputenode.The storage server stores all data in objects known as dataspaces and all IO operations areperformedrelative tooneormoreof theseobjects.Adataspaceconsistsof twostorageareas:a
bytestreamthat storesarbitrarybinarydataasa sequenceofbytes, anda collectionof key/valuepairsthatallowstructureddatatobestoredandquicklyretrievedwhenneeded.ThisarchitectureisoutlinedinFigure2.
PVFS interacts over the network through an interface known as BMI [bmi]. BMI provides a non-blocking post and poll interface for sendingmessages, receivingmessages and checking to see if
outstandingpostshave completed. These aremostly requests fromclients.Unexpectedmessagesarelimitedinsizebutareusefulformostclientrequests.BMIhasimplementationsforTCP/IP,GM,
MX,IB,andothernetworkingfabrics.Multiplenetworkscanbeusedatthesametime,thoughthereareperformanceissuesindoingthis.
ThePVFSservermanagesrequests tothrougha Job layer.TheJob interface isalsoanon-blockingpostandpolldesignandprovidesa common interface forhavingmanyoutstandingoperations inflightontheserveratonetime.JobsareissuedbytheserverRequestprocessor,whichisbuiltusing
a custom state-machine language SM. SM allows programs to define fundamental steps in theprocessingofeachrequest,eachendinginpostingajob.Onceajobispostedthestate-machineissuspending until completion at which point it automatically resumes. SM allows return codes to
drive the processing to different states. SM is designed to make the coding of server requestssimpler by abstracting awaymany details including interacting with BMI, encoding and decodingmessages, and asynchronous issues. The PVFS client code is built from many of the same
components as the server including BMI and state-machines. The primary difference is that thestate-machines arewritten to implement each function in thePVFS System Interface (sysint). Thesysint isdesignedbasedonoperationstypicallyfoundinamodernOSvirtualfilesystem.Thus,for
MSCA-ITN-2014-ETN-642963
D3.1WP3IntermediateReport
Page13of37
example,thereisn’tanopencall,butratheralookupthattakesapathnameandreturnsareference
to the file’s metadata. Metadata is read or written with getattr and setattr respectively. Userprograms are expected to use PVFS via one of several user level interfaces. These include a VFSkernelmoduleforLinux,aFUSEinterface,supportviaROMIO(MPI-IO)andaUserInterface(usrint)
providedwithPVFS.OrangeFileSystem(OrangeFS)[orangefs1]isabranchoftheParallelVirtualFileSystem.LikePVFS,
OrangeFS is a parallel file systemdesigned for use on next-generationHPC systems that providesvery high performance access to disk storage for parallel applications. OrangeFS is different fromPVFSinthatwehavedevelopedfeaturesforOrangeFSthatarenotpresentlyavailableinthePVFS
main distribution.While PVFS development tends to focus on specific very large systems,Orangeconsiders a number of areas that have not beenwell supported by PVFS in the past. Such areasincludevirtualizedstorageoveranyLinuxfilesystemasunderlyinglocalstorageoneachconnected
server, or replacement of Hadoop DFS usingMapReduce extension and JNI – nomodification ofMapReducecodeisneeded.
ProsandCons
Pros● Uniqueobject-basedfiledatatransfer,allowingclientstoworkonobjectswithouttheneed
tohandleunderlyingstoragedetails,suchasdatablocks
● Ability to have unified data/metadata servers (they can also be separated if needed; thedefaultisunified)
● Distributionofmetadataacrossstorageserversandofdirectoryentrymetadata
● Abilitytoconfigurestorageparametersbydirectoryorfile, includingstripesize,numberofservers,andimmutablefilereplication
Cons
● Residesinkernel(superuserrightsarerequiredtoinstallOrangeFS)● SmallcommunitycomparedtoLustre
HDFS
ESRs:ESR03KeyContact:ESR03
Mainarea:Cloud
Overview
HDFS[HDFS08]isaJava-basedfilesystemthatprovidesscalableandreliabledatastorage,anditwas
designed to span large clusters of commodity servers. HDFS is highly fault-tolerant and has beendesignedtobedeployedonlow-costhardware.HDFSprovideshighthroughputaccesstoapplicationdataandissuitableforapplicationsthathavelargedatasets.
MSCA-ITN-2014-ETN-642963
D3.1WP3IntermediateReport
Page14of37
HDFS relaxes a fewPOSIX requirements toenable streamingaccess to file systemdata.HDFS is a
distributedfilesystemthatsharesmanyconceptswithGoogleFS[googleFS03].The system consists of only 1 NameNode (acts asmaster) and several DataNodes (act as slaves),
whicharepiecesofsoftwarerunoncommoditymachines.TheNameNodekeepsinitsmemory(1)thewholenamespacetree,and(2)themappingoffileblocksintoDataNodes.Datacontentisstoredasasequenceofblocks,whicharereplicatedfor fault tolerance;allblocks ina fileexceptthe last
blockarethesamesizes.Foreachfile,theblocksizeandreplicationfactorareconfigurable.
Figure3-HDFSArchitecture[HDFS08]
Readflow1. ClientasksNameNodeforthefileblocksandreplicalocation2. TheNameNodereturnstheorderedlistofDataNodes(bydistancefromtheclient)thathave
thereplica3. ClientwillcontacttheDataNodedirectlytorequestforblocksandreadfromthere
Writeflow
1. ClientasksNameNodeforwritepermissiononafile.2. TheNameNodegrantsclientwiththeleaseforthefileandnootherclientcanwritetothis
file.AlistofDataNodescontainsblockstowriteintoisreturnedtotheclient.
3. The DataNodes in the list form a pipeline of data, which minimizes the total networkdistancefromtheclienttothelastDataNode.a. ClientcontactsthefirstDataNode,thenthisDataNodecontactsthenextDataNode,and
soontosetupthepipeline.b. ThelastDataNodeinthelistsendstheacknowledgementmessagebacktotheprevious
DataNode,andcontinuetodosototheclienttoinformthatthepipelineisready4. The client sendsdata to the firstDataNode, and thedata flows in thepipeline to the last
DataNode.Andtheflowofackmessageisthesameasbefore.
MSCA-ITN-2014-ETN-642963
D3.1WP3IntermediateReport
Page15of37
ProsandCons
Pros● Highavailability,scalability,simplearchitecture,etc.
● Rackawareness(improveperformanceandreducethenetworkbandwidth)● Noleasemechanism● ClientattemptstoreadblocksfromtheclosestDataNode
● SupportMapReduce● Checkpointingsupport
Cons
● Newlywritedatamaynotvisibleforreadersuntilthefileisclosed● NameNode is the Single point of failure; Backup Node can not automatically handle the
failoverevent
● MetadatainmemoryofNameNodelimitsthescalabilityofthesystem● Nosupportforrandomwritestofiles;onlyappends
GlusterFS
ESRs:ESR09KeyContact:ESR09
Mainarea:Cloud
Overview
GlusterFS[GlusterFS13]isascalablenetwork
file system, consisted of client and servercomponents. Using common off-the-shelfhardware, you can create large, distributed
storage solutions for media streaming, dataanalysis, and other data- and bandwidth-intensivetasks.Serversaretypicallydeployed
asstoragebricks,witheachserverrunningaglusterfsd daemon to export a local filesystem as a volume. The GlusterFS client
process creates composite virtual volumesfrommultipleremoteserversusingstackabletranslators[webciteglustr1].
Unlike other traditional storage solutions,GlusterFS does not need ametadata server,
and locates files algorithmically using anelastic hashing algorithm. This no-metadata
MSCA-ITN-2014-ETN-642963
D3.1WP3IntermediateReport
Page16of37
serverarchitectureensuresbetterperformance,linearscalability,andreliability[webciteglustr2].
Figure4–GlusterFSOverview[GlusterFS13]
Formutabilityandadaptabilitydependingontheworkload,GlusterFSidentifiesmultipletypesof
volumes:
● DistributedVolume:filesaredistributedacrossvariousbricksinthevolume.
● ReplicatedVolume:exactcopiesofthedataaremaintainedonallbricks.● DistributedReplicatedVolume:filesaredistributedacrossreplicatedsetsofbricks.● StripedVolume:largefilewillbedividedintosmallerchunks(equaltothenumberofbricks
inthevolume)andeachchunkisstoredinabrick.● DistributedStripedVolume:similartoStripedGlusterFSvolumeexceptthatthestripescan
nowbedistributedacrossmorenumberofbricks.
ProsandCons
Pros● Industry-standardaccessprotocolsincludingNFSandSMB● CompletePOSIXcompliance
● Scalableduetoelastichashing● File-levelasynchronousgeo-replication● Modular,stackabledesignadaptabletowiderangeofworkloads
● Wellintegratedecosystem(includingVMworld)● Matureinproduction
Cons
● Fileaccessonly(blockandobjectinterfacesmustbebuiltontopofit)● Network intensive design. Highly depends on network properties (failure, latency,
throughput)
● Data reconstruction is offloaded to the client, rendering it inherently unusable for cloudapplications
CephFS
ESRs:ESR08,ESR03Keycontact:ESR08
Mainarea:HPC/Cloud
MSCA-ITN-2014-ETN-642963
D3.1WP3IntermediateReport
Page17of37
Overview
TheCephFS [websiteceph2]distributed file system isaPOSIX-compliant file system (supportingallUnix and Linux applications) that uses a Ceph storage cluster for storing data [weil3]. Similarly to
other file systems, CephFS relies on a metadata server daemon (MDS). CephFS is actually onecomponent of the Ceph software suite that offers a block and object storage backend [donvito].CephFS is the last part of the software suite and it is under heavy development. Eventually, the
integrationofaFileSystemtotheCEPHworldwillgiveuserstheopportunitytohaveobject,blockandfilestoragecapabilitiesinonlyonesystem.Thisalsoprovidestheuserswiththeconveniencetobring legacy applications to the Ceph/OpenStack environment [websiteceph1]. Figure 5 gives an
overviewoftheCephecosystem.
Figure5-RelationshipbetweenCEPHcomponents[websiteceph]
CephFSrunsontopoftheobjectstoragesystem,buildingacachecoherentFSontopofRADOS(Seethe Blob Storage Systems section) [Weil1]. Therefore, it inherits the resilience and scalability of
RADOS.Multiplemetadatadaemonsarehandlingtheshardedmetadatadynamically.Also,subtreesnapshots and recursive statistics are included. There is also no restriction on file count or filesystem size. A “consistent caching” feature means that clients can cache data, there caches are
coherentandtheMDSinvalidatesthedatathatchanged.Thisinvalidationallowsclientstoneverseeanystaledata.Finally, some disaster recovery tools could undertake the operation of rebuilding the file system
namespacefromscratch,intheextremecaseRADOSlosesitorincaseacorruptionoccurs.AsofNovember2016,RedHatstatesthatCephFSisincludedintheCephCommunityEdition,having
twomainfeaturesonlimitedfunctionality:theSnapshotsaredeactivatedandthemultipleMDSare
MSCA-ITN-2014-ETN-642963
D3.1WP3IntermediateReport
Page18of37
notrecommended(onlyoneMDSshouldbefunctional).ThecurrentFSversionismainlyusedasa
TechPreview.
ProsandCons
Pros
● POSIXinterface● CloudStorageReady● Scalable,asallclientsread/writeonmultipleOSDnodes[wang]
● Shared,asmanyclientscanworkonsimultaneouslyCons
● Previous reliability issues have been solved, but real-world use cases are few and far
betweenatthemoment[dutchCSI].● Snapshotsarestillinanexperimentalphase.Usageofthisfeaturecouldcauseclientnodes
orMDStoterminateunexpectedly.
● OnlyoneCephFSperclusterisallowed.
Relevancew-r-tWP1use-cases
AllusecasesdefinedintheWP1deliverable[D1.1.]willmostlikelymakeuseofeitheraparalleloradistributedfilesystemasthemainstoragebackbone.Theadvantagesinreliability,performanceandscalabilityarenecessaryforscientificprojectsofsuchmagnitude.
Climate science is a typical example of a demanding compute environment with high storagethroughputandcapacityrequirements.Largeamountsofbothinputandoutputdatarequirebeing
storedfora largeamountoftimeor inmanycases indefinitely.Thesimulationsofclimatemodelsarepushingthe limitsofcomputationcapabilitiesandtakesignificantamountoftimetocompleteevenonHPCsystems.Thisrequiresfrequentcheckpointingtosafeguardthestateofthesimulation
andasresulthighwritethroughputisrequiredtoavoidI/Obecomingthebottleneckoftheprocess.AsdefinedintheWP1deliverable,thefollowingrequirementscanbesuccessfullymetbytheLustreParallelFilesysteminitscurrentstate.
● CLM1:ParallelI/O● CLM2:Paralleldistributedfilesystemsexploitation● CLM3:Longtermarchiving
● CLM4:ExascaleStorageCapacityThe Scientific Data Processor (SDP) component of the Square Kilometer Array will likely require
dedicatedHPCfacilities toprocess,storeanddistributethevastamountsofobservationdatathatwillbeconstantlyproducedbythisinstrument.Whiletheexactarchitecturalchoicesarestillunderthedesignphase,asetofrequirementshasbeenidentifiedandpresentedontheWP1deliverable.
SKA is expected to push the limits of computation and storage, and future advancements inhardwareandsoftwarehavealreadybeentakenintoaccountinthedesignprocess.ItislikelythatacustomsolutionwillbetailoredtotheexactneedsofSKAmakinguseofconceptsfrombothparallel
MSCA-ITN-2014-ETN-642963
D3.1WP3IntermediateReport
Page19of37
anddistributedfilesystems.Moreover,onecanenvisionthattheSKAwill relyonmultipletiersof
highertolowerperformancestoragesystems,makinghierarchicalstoragemanagementnecessary.This capability is currently offered by Lustre, however the complexity of SKA will require furtheroptimizations in theareasof cachingaswellas incorporating theusageofheterogeneous storage
devicessuchasNVRAM,SSDsalongsidetraditionalspinningdisks.Lastbutnot the least,weunderline that theHumanBrainProjecthas fundeda research into the
evaluationofCEPHasastoragesystem[gudu].Thespecificusecasesthathavebeenevaluatedweredatapublicationwitharchival,neuro-informaticsdataanalyticsandcontentdeliverynetwork.
Blobsystems
BlobFundamentals
During theyears, file systemshaveproved theirworthiness.Nevertheless in themiddleof2000’s,developerswere looking for solutions thatwould enable applications to interact directlywith thestoragebackend(i.e.,withoutgoingthroughthefilesystemcomplexstack).Objectstoragesystems
havebeenproposedtofulfilsuchademand.Objectsarecollectionsofdataidentifiedbyauniqueidonaflatnamespace.Thisabstractnotion(i)decouplesthestoragefromOS-semantics, (ii)groupstogether otherwise irrelevant datasets and (iii) enables better lifecycle management of a given
dataset. However, because the number of objects can significantly increase and cause significantreductionto thesearchtime,storagesolutionsdesignersproposedtomergesmalldatasets intoasuper collection named BLOB (Binary Large Object). Maintaining a small set of huge BLOBs
comprisingbillionsofsmall,KB-sizedapplication-levelobjectsismuchmorefeasiblethanmanagingbillions of small KB-sized objects directly. Although that objects are now included into a singlecontainer(BLOB),theyarestillindependententities.Therefore,aBLOBismandatorytoprovidethe
fundamentalsforconcurrentaccessandanadequateconsistencymodel.Ifexploitedefficiently,thisfeatureintroducesaveryhighdegreeofparallelismintheapplication.Similarly to traditional hash tables, objects are identified using a unique id. Decoupling identifier
fromlogicalplacementallowsmigrationpoliciestobeappliedtransparentlytotheend-user.The identifier is generated using cryptographic hashes of the object’s contents. Therefore samedatasetswillalsoproducesameidanddifferentdatasetsdifferentids.Thisallowsdatadeduplication
andintegrity.BLOBsareusedforkeepingrelatedinformationtogether.Forinstance,anMRI(MagneticResonance
Imaging) image isgroupedwiththephysician’srecordednotes(inanMP3file)alongwiththetextfile that has the patient’s history. In addition, BLOBs contain additional control properties forindexing or automate storage management (e.g., routing and retention policies). BLOBs can be
stagedfromonemediumtoanotherbasedonitscurrentusage.Whiletraditionallydataarestagedbased on aging, metadata can provide useful hints for proper and efficient placement. In a
MSCA-ITN-2014-ETN-642963
D3.1WP3IntermediateReport
Page20of37
conventionalsystemanentityhastogothroughallthestepsofthefollowinglist.However,iffora
specificobjecttheaccesspatternisknown,someofthestagescanbeskipped.
● HOT—durable,availableandperformanceobjectstorageforfrequentlyaccesseddata
● WARM— lower cost, but less durability for frequently accessed non-critical reproducibledata
● COOL—storageclassfordatathatisaccessedlessfrequently,butrequiresrapidaccess
● COLD—secure,durable,andlow-coststorageservicefordataarchivingDevelopers access objects by following the CRUD (create, read, update and delete) interface. To
performachange, theobjectmustbe fetchedback,updated locallyand thenpushedback to thestorageprovider.Anotherdrawbackofthismodel istheconcurrency issues,because if twoclientsfetch the same object there is the question about whose versioned will be regarded as the last
written.Howeverthisissolvedduetocontent-generatedids,whichwillproducedifferentidsforthetwonewobjects.Eventually,thismeansthatconcurrencycontrolhasbeendegeneratedtoproperkeymanagementandgarbagecollection.
Regarding the placement aspects, objects are stored inmost cases on the nodeswith the closestidentifier to that of the object. Tominimize risk loss, objects can be replicated. However, same
chunkofdataproducesthesameidentifierandwillbeplacedonthesamenode.Togoaround,theseparationbetweensignaturesfromidentifiersisnecessary.Withthat,objectreplicasmayhavethesame signature but be located to different failure domains. Tominimize the storage footprint of
replication without compromising reliability level, erasure codes can be used. Further, objectimmutabilityenablessystemversioning,whichcanbeusedasrestorepoints.
In the following paragraphs, we present three BLOBs systems, namely BlobSeer [Blobseer11],
Microsoft Azure [azure1] and RADOS [weil1]. The first one is an academic solutionwhile the twoothersareindustrialproposals.
BlobSeer
ESRs:ESR13
Keycontact:ESR13Mainarea:HPC
Overview
BlobSeer [Blobseer10, Blobseer11] is a large-scale distributed storage service that addressesadvanceddatamanagementrequirementsresultingfromever-increasingdatasizes.Itisbuiltontheidea of leveraging versioning for concurrent manipulation of binary large objects in order to
efficientlyexploitdata-levelparallelismandsustainahighthroughputdespitemassivelyparalleldataaccess. BlobSeer supports large binary large objects (BLOBs) that reach the order of TB while
MSCA-ITN-2014-ETN-642963
D3.1WP3IntermediateReport
Page21of37
allowing fine grained random write access. BlobSeer behaves very well under high concurrency
thankstoitsmulti-versionconcurrencycontrol.AclientofBlobSeermanipulatesBLOBsbyusingasimpleinterfacethatallowsusersto:createanewemptyBLOB;appenddatatoanexistingBLOB;Read/writeasubsequenceofbytesspecifiedbyan
offsetandasizefrom/toanexistingBLOB.Figure6presentsthearchitectureoverviewofthesystem.
Figure6-Blobseer’sarchitecture[Blobseer10]
Inthefollowing,wepresentthemajorconceptsoftheBlobSeersolution.
● Versioning access interface toBLOBs: Versioning is explicitlymanagedby the client. Each
timeawriteorappend isperformedonaBLOB,anewsnapshot reflecting thechanges is
generated instead of overwriting any existing data. This new snapshot is labelledwith anincremental version number, so that all past versions of the BLOB can potentially beaccessed,atleastaslongastheyhavenotbeendeletedforthesakeofstoragespace.
● Datastriping:eachBLOBismadeupofblocksofafixedsize.Thesizeoftheseblocksisset
tothesizeofthedatapieceaMap/Reduceworkerissupposedtoprocess.Theseblocksaredistributedamongthestoragenodes.
● Distributedmetadata: organizedasadistributedsegmenttree.Asegmenttreeisabinarytreeinwhicheachnodeisassociatedtoarangeoftheblob,delimitedbyoffsetandsize.Tofavourefficientconcurrentaccesstometadata,treenodesaredistributed:theyarestored
onthemetadataprovidersusingaDHT(DistributedHashTable).EachtreenodeisidentifiedintheDHTbyitsversionandbytherangespecifiedthroughtheoffsetandthesizeitcovers.
MSCA-ITN-2014-ETN-642963
D3.1WP3IntermediateReport
Page22of37
Decentralizingthemetadataserversavoidsthebottleneckcreatedbyconcurrentaccessesin
thecaseofacentralizedmetadataserverinmostdistributedfilesystems,includingHDFS.
● Concurrency control: BlobSeer relies on a versioning-based concurrency control algorithm
thatmaximizesthenumberofoperationsperformedinparallel inthesystem.This isdonebyavoidingsynchronizationasmuchaspossible,bothatthedataandmetadatalevels.Thekeyideabehindthisisthatnoexistingdataormetadataisevermodified.First,anywriteror
“appender”writes itsnewdatablocks,bystoring thedifferentialpatch.Then, ina secondphase,theversionnumberisallocatedandthenewmetadatareferringtotheseblocksaregenerated.Thefirstphaseconsistsinactuallywritingthenewdataonthedataprovidersin
a distributed fashion. Since only the difference is stored, eachwriter can send their dataindependentofotherwriterstothecorrespondingdataproviders.Asnosynchronizationisnecessary, this step canbeperformed in a fully parallel fashion. In the secondphase, the
writeraskstobeassignedaversionnumberbytheversionmanagerandthengeneratesthecorrespondingmetadata. This newmetadata describes theblocks of thedifference and is“weaved”togetherwiththemetadataoflowerversions,insuchwayastooffertheillusion
ofafullyindependentsnapshot.
Pros/Cons
Pros
● Fine-grainedwriteaccess● Goodhorizontalscalability● ApplicabletoHadoopMap/Reduceapplications(itcomeswithaHDFScompatibilitylayer).
Cons● Theversionmanager isa singlepointof failure,andhot spot (in thecriticalpath forboth
readsorwrites)
● Metadata distribution in a distributed treemay cause high latency as the size of the treegrows
● Noevictionforoldversion,causingdecreasingperformanceovertime
AzureStorage
ESRsinvolved:ESR03Keycontact:ESR03Mainarea:Cloud
Overview
AzureStorageBlobs [azure1] isaproprietary, cloud-basedstoragesystemdevelopedbyMicrosoftand exclusively accessible from the Microsoft Azure cloud. Azure Blobs provide the user with a
simpleAPI.Everyblobisorganizedintoacontainer.Containersalsoprovideausefulwaytoassign
MSCA-ITN-2014-ETN-642963
D3.1WP3IntermediateReport
Page23of37
securitypoliciestogroupsofobjects.Astorageaccountcancontainanynumberofcontainers,anda
containercancontainanynumberofBLOBs,uptothelimitof500TB.BlobstorageoffersthreetypesofBLOBs:blockBLOBs,appendBLOBs,andpageBLOBs(disks).
● BlockBLOBsareoptimized forstreamingandstoringcloudobjects,andareagoodchoice
forstoringdocuments,mediafiles,backupsetc.
● Append BLOBs are similar to block BLOBs, but are optimized for append operations. AnappendBLOBcanbeupdatedonlybyaddinganewblocktotheend.AppendBLOBsareagoodchoiceforscenariossuchaslogging,wherenewdataneedstobewrittenonlytothe
endoftheBLOB.● PageBLOBs are optimized for representing IaaS disks and supporting randomwrites, and
may be up to 1 TB in size. Page BLOBs are used for instance for saving the Azure virtual
machinenetworkattachedIaaSdisk.
Becauseitisaclosedsourcesolution,itisdifficulttoidentifyandanalysetheinternalmechanismsof
thesystem.
Pros/Cons
Pros
● Cloud-based,nosetupneeded● SimpleAPI,manyclientlibrariesavailable● Interestingperformance
Cons● Proprietarylicense● Opaqueinternalarchitecture
● Littlefinetuningoptions● Limited support for random writes: not supported in append or block BLOBs, must be
mappedtothesizeofapageinpageBLOBs
● Lowcapacitylimits
RADOS
ESRsinvolved:ESR08Keycontact:ESR08Mainarea:HPC/Cloud
Overview
RADOS [weil1] is a Reliable Autonomic Distributed Object Store, which provides an extremelyscalable storage service for variably sizedobjects. Theunderlying storage abstractionprovidedby
RADOSisrelativelysimple:
MSCA-ITN-2014-ETN-642963
D3.1WP3IntermediateReport
Page24of37
● The unit of storage is an object. Each object has a name (currently a fixed-size 20 byte
identifier, though thatmaychange), somenumberofnamedattributes (i.e., xattrs),andavariable-sizeddatapayload(likeafile).
● Objectsarestoredinobjectpools. Eachpoolhasaname(e.g.,“foo”)andformsadistinct
object namespace. Each pool also has a few parameters that define how the object isstored,namelyareplication level (2x,3x,etc.)andamappingruledescribinghowreplicasshouldbedistributedacrossthestoragecluster(e.g.,eachreplicainaseparaterack).
● Thestorageclusterconsistsofsome(potentially large)numberofstorageservers,orOSDs(objectstoragedaemons/devices),andthecombinedclustercanstoreanynumberofpools.
AkeydesignfeatureofRADOSisthattheOSDsareabletooperatewitharelativeautonomywhenitcomestorecoveringfromfailuresormigratingdatainresponsetoclusterexpansion.Byminimizingtheroleofthecentralclustercoordinator(actuallyasmallPaxosclustermanagingkeyclusterstate),
the overall system is in theory scalable. A small system of a few nodes can seamlessly grow tohundredsorthousandsofnodes(orcontractagain)asneeded.
Data in RADOS is distributed according to an algorithm called CRUSH: Controlled, Scalable,DecentralizedPlacementofReplicatedData[weil2].TheCRUSHalgorithmdetermineshowtostoreand retrieve data by computing data storage locations. CRUSH empowers Ceph clients to
communicate with OSDs directly rather than through a centralized server or broker. With ancomputation-basedmethodof storing and retrievingdata, Ceph avoids a single point of failure, aperformancebottleneck,andaphysicallimittoitsscalability.
Figure7-RADOSandtheCRUSHalgorithm,eventuallybeingstoredintheOSDs.
MSCA-ITN-2014-ETN-642963
D3.1WP3IntermediateReport
Page25of37
Pros/Cons
Pros
● InterfaceforSwiftandS3CephBlockDevice
● ErasureCodeAlgorithm[miyamae]
● SingleStorageLayerandflexibilitywithCRUSH
● Unlimitedscalability(intheory)
Cons
● ThereisnointerfaceforRBDonWindows
● HypervisorimplementationneedsknowledgeaboutCeph.Atthemoment,onlyKVMhas
nativesupport.
● TheuseofPAXOScanbecomeanissueforverylarge-scalesystemwherethenumberof
Monitorsbecomestooimportant.
Relevancew-r-tWP1use-cases
Objectsarepowerfulabstractions,especiallyoncombiningand tagging information fromdifferent
sources and handling the lifecycle and placement of related non-uniform datasets. The notion ofdynamic dataset link can be directly exploited on the first side by the Human Brain project inparticulartheHBP2“SpatialJoinbasedonnon-uniformdatasetdensities”andonthesecondside,by
smartcitiestechnologiessuchastheSCT4”Batchprocessingandlearningfromdata”requirement.Apart from an entity for dynamic linking, objects can be regarded as individualswithout external
dependencies. Although Parallel and Distributed file systems look to be the right choice, BlobsystemscanberelevanttoprovideabackendfortheClimateScienceusecaseinparticularfortheCLM1“ParallelI/O”aswellastheCLM2“Paralleldistributedfilesystemsexploitation”requirements.
Indeed, Blob systems such as BlobSeer proposemore or less the same File API as distributed filesystems.TheoperationspossibleonanygivenfileareroughlythesameastheoperationspossibleonBLOBs.Bothstoreunstructuredopaquebytearrays.ClimateapplicationsstronglyrelyonHDF5,
which uses MPI-I/O under the hood. MPI/IO does only rely on some POSIX features, which arealmostallprovidedbyBlobstoragesystems(MPI I/Odoesn’tneedhierarchies,permissionsoranyotherfeatureofDFS).ConsideringthisandthefactthatmostBlobstoragesystemsareaggressively
optimized for speed and parallel access, it is natural to consider Blob systems as a drop-inreplacementforDFSinthecaseofapplicationssuchastheclimateones.
MSCA-ITN-2014-ETN-642963
D3.1WP3IntermediateReport
Page26of37
Finally,thenatureofobjectcanservetheSKAuse-case,especiallyforSKA9“Specificdatastructures
formassivelyparallelaccesses”requirement.
Key/ValueStoreSystems
KVSFundamentals
Key-value Stores (KVS) [kvs1], or key-valuedatabase is a simpledatabase thatuses an associativearray(similartoamaporadictionary)asthefundamentaldatamodelwhereeachkeyisassociated
withoneandonlyonevalueinacollection.Thisrelationshipisreferredtoasakey-valuepair.Ineachkey-valuepair,anarbitrary string suchasa filename,URIorhash represents thekey.The
valuecanbeanykindofdatalikeanimage,userpreferencefileordocument.Theunderlyingdatamodel is similar toBLOBstoragesystems indifferentways: thevalue is storedasanopaquebytearrayrequiringnoupfrontdatamodellingorschemadefinition.Thisdatamodelobviatestheneed
to indexthedatato improveperformance.However,onecannot filterorcontrolwhat is returnedfromarequestbasedonthevaluebecausethevalueisopaque.Ingeneral,key-valuestoreshavenoquery language. They provide away to store, retrieve andupdate data using simple get, put and
deletecommands;thepathtoretrievedataisadirectrequesttotheobjectinmemoryorondisk.Thesimplicityofthismodelmakeskey-valuestoresfast,easytouse,scalable,portableandflexible.
Key-value stores generally scaleoutby implementingpartitioning (storingdataonmore thanonenode), replication and auto recovery. They can scaleup bymaintaining the database in RAMandminimize the effects of ACID guarantees (a guarantee that committed transactions persist
somewhere)byavoidinglocks,latchesandlow-overheadservercalls.Becauseeachindividualstoreintheclusteroperatesalmostindependently,aKVSclustercanofferhighthroughputandcapacityas demonstrated by large-scale deployments—e.g., Facebook operates a Memcache KVS cluster
servingoverabillionrequests/secondfortrillionsofitems.Key-valuestoreshandlesizewellandaregoodatprocessingaconstantstreamofread/writeoperationswithlowlatency.
Key-valuestoresdiffer in their implementationwheresomesupportorderingofkeys likeBerkeleyDB[bdb]andMemcache[memcache],somemaintaindatainmemorylikeRedis[redis],andsome,like Aerospike [aerospike], are built natively to support both RAM and solid state drives (SSDs).
Others,likeCouchbaseServer[couchbase],storedatainRAMbutalsosupportrotatingdisks.In the following, we present the HBase [hbase1-6], Cassandra [cassandra1], and the RamCloud
[ramcloud1-6]solutions.
MSCA-ITN-2014-ETN-642963
D3.1WP3IntermediateReport
Page27of37
HBase
ESRsinvolved:ESR07
Keycontact:ESR07Mainarea:Cloud
Overview
HBase[hbase1-6]isanopen-source,distributed,versioned,non-relationaldatabasemodelledafterGoogle's Bigtable [BigTable08] and written in Java. It is developed as part of Apache SoftwareFoundation'sApacheHadoopprojectandrunsontopofHDFS(seeSectionParallel/DistributedFile
System),providingBigTable-likecapabilities forHadoop.That is, itprovidesa fault-tolerantwayofstoringlargequantitiesofsparsedata(smallamountsofinformationcaughtwithinalargecollectionofemptyorunimportantdata,suchasfindingthe50largestitemsinagroupof2billionrecords,or
findingthenon-zeroitemsrepresentinglessthan0.1%ofahugecollection).
Figure8-Hbasearchitecture[hbase.www]
Physically, HBase is composed of three types of servers in a master slave type of architecture,outlined in Figure8. Region servers servedata for reads andwrites.Whenaccessingdata, clients
communicate with HBase RegionServers directly. Region assignment, DDL (create, delete tables)operationsarehandledbytheHBaseMasterprocess.Zookeeper,whichispartofHDFS,maintainsaliveclusterstate.
TheHadoopDataNodestoresthedatathattheRegionServerismanaging.AllHBasedataisstoredinHDFS files. Region Servers are collocated with the HDFS DataNodes, which enables data locality
(puttingthedataclosetowhereitisneeded)forthedataservedbytheRegionServers.HBasedataislocal when it is written, but when a region is moved, it is not local until compaction.
MSCA-ITN-2014-ETN-642963
D3.1WP3IntermediateReport
Page28of37
TheNameNodemaintainsmetadata information forall thephysicaldatablocks that comprise the
files.
ProsandCons
Pros
● Real-time,randombigdataaccessthroughHadoopintegration● Column-Orienteddatamodelforbigsparsetable● Row-levelatomicoperationsupport
● HighScalability● AutoFailover● SimpleClientInterface
Cons● SinglePointofFailure(SPOF)● Notransactionsupport.
● Nojoinsprovidednatively(MapReduceusageisnecessary)● Indexonlyonkey● Noauthenticationsupport
Cassandra
ESRsinvolved:ESR03
Keycontact:ESR03Mainarea:Cloud
Overview
Cassandra [cassandra1] is designed to handle big data workloads across multiple nodes with no
single point of failure. Its architecture is based on the understanding that system and hardwarefailurescananddooccur.Cassandraaddressestheproblemoffailuresbyemployingapeer-to-peerdistributed system across homogeneous nodes where data is distributed among all nodes in the
cluster. Data distribution is done deterministically using a distributed hash table, as depicted inFigure9.
MSCA-ITN-2014-ETN-642963
D3.1WP3IntermediateReport
Page29of37
Figure9-CassandraringdataDistribution
Eachnodeexchangesinformationacrosstheclusterperiodically.Asequentiallywrittencommitlogoneachnodecaptureswriteactivitytoensuredatadurability.Dataisthenindexedandwrittentoanin-memorystructure,calledamemtable,whichresemblesawrite-backcache.Oncethememory
structure is full, the data is written to disk in an SSTable data file. All writes are automaticallypartitioned and replicated throughout the cluster. Using a process called compaction Cassandra
periodicallyconsolidatesSSTables,discardingobsoletedataand tombstone (an indicator thatdatawasdeleted).
Cassandra is a row-oriented database. Cassandra's architecture allows any authorized user toconnecttoanynode inanydatacenterandaccessdatausingtheCQL language(CassandraQueryLanguage).Foreaseofuse,CQLusesasimilarsyntaxtoSQL.FromtheCQLperspectivethedatabase
consistsoftables.Typically,aclusterhasonekeyspaceperapplication.DeveloperscanaccessCQLthroughcqlshaswellasviadriversforapplicationlanguages.
Readandwriterequestscanbesenttoanynode inthecluster.Whenaclientconnectstoanodewith a request, that node serves as the coordinator for that particular client operation. Thecoordinatoractsasaproxybetweentheclientapplicationandthenodesthatownthedatabeing
requested. The coordinator determineswhich nodes in the ring should get the request based onhowtheclusterisconfigured.
ProsandCons
Pros● Easydatadistribution● Continuousuptimeandhighfaulttolerance
MSCA-ITN-2014-ETN-642963
D3.1WP3IntermediateReport
Page30of37
● Transactionalcapabilities
● Highscalability(linearscaleperformance)Cons
● Slowtransactionprocessing
● Lowsecondaryindexperformance
RAMCloud
ESRsinvolved:ESR13KeycontactESR13Mainarea:Cloud
Overview
RAMCloud[ramcloud1-6]isanin-memorykey-valuestore.Itsmainclaimsarelow-latency,durability(in the sense of ACID), fast-crash recovery, efficient-memory usage, and strong consistency. By
leveragingInfiniband-likenetworks,readoperationscanbeachievedinfewmicroseconds.
A RAMCloud’s cluster consists of three entities: a coordinator maintaining metadata information
aboutstorageservers,backupservers,anddatalocation;asetofstorageserversthatexposetheirDRAMasstoragespace;andbackupsthatstorereplicasofdataintheirDRAMtemporarilyandspillittodiskasynchronously.Usually,storageserversandbackupsarecollocatedwithinasamephysical
machine.DatainRAMCloudisstoredinasetoftables.Eachtablecanspanmultiplestorageservers.Aserver
usesanappend-only log-structuredmemorytostore itsdataandahash-tableto index it.The log-structuredmemoryofeachserverisdividedinto8MBsegments.Aserverstoresdatainanappend-only fashion. Thus, to free unused space a cleaning mechanism is triggered whenever a server
reaches a certainmemory utilization threshold. The cleaner copies a segment’s live data into thefreespace(stillavailableinDRAM)andremovestheoldsegment.
Durability and availability are guaranteed by replicating data to remote disks. More precisely,whenever a storage server receives a write request, it appends the object into its latest freesegment, and forwards a replication request to the backup servers randomly chosen for that
segment.Theserverwaitsforacknowledgementsfromallbackupstoansweraclient’supdate/insertrequest.Backupserverswillkeepacopyofthissegment inDRAMuntil it fills.Onlythen,theywillflushthesegmenttodiskandremoveitfromDRAM.
Foreachnewsegment,arandombackupintheclusterischoseninordertohaveasmanymachinesperforming the crash-recovery as possible. At run time each server will compute a will where it
specifieshowitsdatawillbepartitionedifitcrashes.Ifacrashedserverisdetectedtheneachserverwill replay the segments that were assigned to it according to the crashed server will. As the
MSCA-ITN-2014-ETN-642963
D3.1WP3IntermediateReport
Page31of37
segmentsarewrittentoaserver’smemory,theyarereplicatedtonewbackups.Attheendofthe
recoverythesegmentsarecleanedfromoldbackups.
ProsandCons
Pros
● Verylow-latency● Strongconsistency● Fastcrashrecovery
● EfficientmemoryutilizationCons
● Highcpu-usage:1coreisdedicatedforpollingincomingrequeststoachievelowlatency
● No"smart"mechanismtodistributedataacrossthecluster● Needhighperformancenetworktoreallytakeadvantageoflowlatency
Relevancew-r-ttheWP1use-cases
Key-ValueStoresareidealforcasesthatrequiredatadistribution,highscalabilityand/oraflexibleschema.Thisissuitablefor:
● TheScienceDataProcessor(SDP)oftheSKAprojecttotemporarilystorerawdatareceived
from the telescopes. Key-Value Storeswould be perfect for this case because the data is
highly distributable and can be processed in parallel. Moreover, the Key-Value storagemechanism is also suitable for the type of data processed by the project. The followingrequirementsfromtheSKAprojectaremetbyKey-ValueStoragesystems:
○ HBP1:SpatialDataManagementTechnique○ HBP3:Multi-formatingestion○ HBP5:Sharing
● ThestorageofindividualeventsinthecontextofSmartCities’applications,leveragingatthe
same time low-latency reads and writes, good scalability properties and low metadataoverhead.ThefollowingrequirementsfromtheSmartCitiesuse-casearemetbyKey-ValueStoragesystems:
○ SCT5:Storageinreal-time○ SCT6:Replicatedstoragesystem○ SCT10:Scalablesystem
MSCA-ITN-2014-ETN-642963
D3.1WP3IntermediateReport
Page32of37
Relevancew-r-ttheWP1usecases-OverviewThetablebelowprovidesasyntheticviewoftherelevanceofthedifferentsystemswithrespecttotheWP1uses-casesthathavebeenidentified.
System HBP SKA CLM SCT
Lustre HBP5 SKA2,SKA5,SKA6,SKA9
CLM1,CLM2,CLM4,CLM5,CLM6,CLM8,CLM9
SCT3,SCT5,SCT10
OrangeFS HBP5 SKA2,SKA5,SKA6,SKA9
CLM1,CLM2,CLM4,CLM5,CLM6,CLM8,CLM9
SCT3,SCT5,SCT10
HDFS HBP1,HBP2 SKA2,SKA5,SKA7,SKA8
CLM1,CLM2,CLM3,CLM4
SCT2,SCT6,SCT8,SCT9,
GlusterFS
HBP1 SKA1,SKA5,SKA8 CLM1,CLM2 SCT2,SCT5,SCT6,SCT8,SCT9,SCT10
Ceph HBP1 SKA1,SKA5,SKA8 CLM1,CLM2 SCT2,SCT5,SCT6,SCT8,SCT9,SCT10
BlobSeer HBP2 SKA9 CLM1,CML2 SCT4
Azure HBP2 SKA9 CLM1 SCT4
Rados HBP5,HBP8 SKA3,SKA7,SKA9 CLM1,CLM3,CLM5,CLM9
SCT6,SCT7,SCT10
HBase HBP1,HBP3 CLM1,CLM4 SCT5,SCT6,SCT10
Cassandra HBP1,HBP5 CLM1,CLM3,CLM4 SCT5,SCT6,SCT10
RAMCloud HBP7 SKA8,SKA9 CLM1 SCT1,SCT6,SCT10
MSCA-ITN-2014-ETN-642963
D3.1WP3IntermediateReport
Page33of37
References
[aerospike] http://www.aerospike.com/
[azure1] UnderstandingBlockBlobs,AppendBlobs,andPageBlobs–https://docs.microsoft.com/en-
us/rest/api/storageservices/fileservices/Understanding-Block-Blobs--Append-Blobs--and-Page-Blobs?redirectedfrom=MSDN
[BentJ] Bent,John,etal."PLFS:ACheckpointFilesystemforParallelApplications."
[bdb] MichaelA.Olson,KeithBostic,andMargoSeltzer.1999.BerkeleyDB.InProceedingsoftheannualconferenceonUSENIXAnnualTechnicalConference
(ATEC'99).USENIXAssociation,Berkeley,CA,USA,43-43.
[BigTable08] Chang,Fay,etal.Bigtable:Adistributedstoragesystemforstructureddata.ACMTransactionsonComputerSystems(TOCS)26.2(2008):4.
[Blobseer10] BogdanNicolae,DianaMoise,GabrielAntoniu,LucBougé,MatthieuDorier.BlobSeer:BringingHighThroughputunderHeavyConcurrencytoHadoopMap-
ReduceApplications.24thIEEEInternationalParallelandDistributedProcessingSymposium(IPDPS2010),Apr2010,Atlanta,UnitedStates.
[Blobseer11] BogdanNicolae,GabrielAntoniu,LucBougé,DianaMoise,andAlexandraCarpen-Amarie.2011.BlobSeer:Next-generationdatamanagementforlarge
scaleinfrastructures.J.ParallelDistrib.Comput.71,2(February2011),169-184.
[bmi] P.Carns,W.Ligon,R.RossandP.Wyckoff,"BMI:anetworkabstractionlayerforparallelI/O,"19thIEEEInternationalParallelandDistributedProcessingSymposium,2005,pp.8pp.-.
doi:10.1109/IPDPS.2005.128
[cassandra1] WhatisApacheCassandra?–http://www.planetcassandra.org/what-is-apache-cassandra/
[cassandra2] Rabl,Tilmannetal.SolvingBigDataChallengesforEnterpriseApplicationPerformanceManagement.VLDB(2012).
[couchbase] JohnZablocki.2015.CouchbaseEssentials.PacktPublishing.
[D1.1] D1.1-Usecaserequirements–BigStorageDeliverable
[dell.www] Lustre–DellPresentation-http://www.slideshare.net/undiez/dell-lustre-
storage-architecture-presentation-mbug-2016
[Departon01] BenjaminDepardon,GaelLeMahec,CyrilS´eguin.AnalysisofSixDistributedFileSystems.[ResearchReport]2013,pp.44.<hal-00789086>
MSCA-ITN-2014-ETN-642963
D3.1WP3IntermediateReport
Page34of37
[donvito] Donvito,G.,Marzulli,G.,&Diacono,D.(2014).Testingofseveraldistributed
file-systems(HDFS,CephandGlusterFS)forsupportingtheHEPexperimentsanalysis.InJournalofPhysics:ConferenceSeries(Vol.513,No.4,p.042014).IOPPublishing.
[dutchCSI] DutchForensicInstitute:CephFSinproduction
[FeitelsonD] Feitelson,DrorG.,etal."ParallelI/Osystemsandinterfacesforparallel
computers."MultiprocessorSystems|DesignandIntegration.WorldScientific(1995).
[GlusterFS13] AlexDaviesandAlessandroOrsaria.2013.ScaleoutwithGlusterFS.LinuxJ.2013,235.
[GoogleFS03] Ghemawat,S.,Gobioff,H.,&Leung,S.T.(2003,October).TheGooglefile
system.InACMSIGOPSoperatingsystemsreview(Vol.37,No.5,pp.29-43).ACM.
[gudu] Gudu,D.,Hardt,M.,&Streit,A.(2014).EvaluatingtheperformanceandscalabilityoftheCephdistributedstoragesystem.2014IEEEInternational
ConferenceonBigData:177-182.
[hbase1] George,Lars.HBase:thedefinitiveguide.O’ReillyMedia,Inc.,2011.
[hbase2] Dimiduk,Nick,AmandeepKhurana,andMarkHenryRyan.HBaseinaction.Manning,2013.
[hbase4] White,Tom.Hadoop:TheDefinitiveGuide:TheDefinitiveGuide.O’ReillyMedia,2009.
[hbase5] Kruchten,PhilippeB.“The4+1viewmodelofarchitecture.”
Software,IEEE12.6(1995):42-50.
[hbase6] ApacheHBase,“TheApacheHBaseReferenceGuide”,apache.org,2013
[hbase.www] Anin-depthlookattheHBasearchitecture,https://www.mapr.com/blog/in-depth-look-hbase-architecture
[HDFS08] Borthakur,D.(2008).HDFSarchitectureguide.HADOOPAPACHEPROJECT
http://hadoop.apache.org/common/docs/current/hdfsdesign.pdf,39.
[HDFS10] KonstantinShvachko,HairongKuang,SanjayRadia,andRobertChansler.2010.TheHadoopDistributedFileSystem.InProceedingsofthe2010IEEE26thSymposiumonMassStorageSystemsandTechnologies(MSST)(MSST'10).IEEE
ComputerSociety,Washington,DC,USA,1-10.DOI=http://dx.doi.org/10.1109/MSST.2010.5496972
[kvs1] ShengLi,HyeontaekLim,VictorW.Lee,JungHoAhn,AnujKalia,MichaelKaminsky,DavidG.Andersen,O.Seongil,SukhanLee,andPradeepDubey.
2015.Architectingtoachieveabillionrequestspersecondthroughputona
MSCA-ITN-2014-ETN-642963
D3.1WP3IntermediateReport
Page35of37
singlekey-valuestoreserverplatform.InProceedingsofthe42ndAnnual
InternationalSymposiumonComputerArchitecture(ISCA'15).ACM,NewYork,NY,USA,476-488
[lustre1] ABestPracticeAnalysisofHDF5andNetCDF-4UsingLustre(ChristopherBartz,KonstantinosChasapis,MichaelKuhn,PetraNerge,ThomasLudwig),InHigh
PerformanceComputing,LectureNotesinComputerScience(9137),pp.274–281,(Editors:JulianMartinKunkel,ThomasLudwig),SpringerInternationalPublishing(Switzerland),ISC2015,Frankfurt,Germany,ISBN:978-3-319-20118-
4,ISSN:0302-9743,2015-06
[lustre2] DynamicallyAdaptableI/OSemanticsforHighPerformanceComputing(MichaelKuhn),InHighPerformanceComputing,LectureNotesinComputerScience(9137),pp.240–256,(Editors:JulianMartinKunkel,ThomasLudwig),
SpringerInternationalPublishing(Switzerland),ISC2015,Frankfurt,Germany,ISBN:978-3-319-20118-4,ISSN:0302-9743,2015-06
[lustre3] POSIX.1FAQ.TheOpenGroup.5October2011,availableonlinehttp://www.opengroup.org/austin/papers/posix_faq.html
[lustre4] LUSTRE™FILESYSTEMHigh-PerformanceStorageArchitectureandScalable
ClusterFileSystemWhitePaperDecember2007,availableonlinehttp://www.csee.ogi.edu/~zak/cs506-pslc/lustrefilesystem.pdf
[lustre5] DataCompressionforClimateData(MichaelKuhn,JulianKunkel,ThomasLudwig),InSupercomputingFrontiersandInnovations,Series:Volume3,
Number1,pp.75–94,(Editors:JackDongarra,VladimirVoevodin),2016-06
[memcache] https://memcached.org/
[miyamae] Miyamae,T.,Nakao,T.,&Shiozawa,K.(2014).Erasurecodewithshingledlocalparitygroupsforefficientrecoveryfrommultiplediskfailures.In10thWorkshoponHotTopicsinSystemDependability(HotDep14).
[orangefs1] OrangeFS–http://www.orangefs.org
[pvfs1] PVFS–http://www.pvfs.org
[ramcloud1] S.M.Rumble,A.Kejriwal,andJ.Ousterhout,“Log-structuredmemoryfordram-basedstorage,”inProceedingsofthe12thUSENIXConferenceonFileandStorageTechnologies(FAST14),SantaClara,CA,2014,pp.1–16.
[ramcloud2] D.Ongaro,S.M.Rumble,R.Stutsman,J.Ousterhout,andM.Rosenblum,
“Fastcrashrecoveryinramcloud,”inProceedingsoftheTwentyThirdACMSymposiumonOperatingSystemsPrinciples,ser.SOSP’11,2011,pp.29–41.
[ramcloud3] J.Ousterhout,A.Gopalan,A.Gupta,A.Kejriwal,C.Lee,B.Montazeri,D.
Ongaro,S.J.Park,H.Qin,M.Rosenblum,S.Rumble,R.Stutsman,andS.Yang,“Theramcloudstoragesystem,”ACMTrans.Comput.Syst.,vol.33,no.3,pp.7:1–7:55,Aug.2015
MSCA-ITN-2014-ETN-642963
D3.1WP3IntermediateReport
Page36of37
[ramcloud4] CollinLee,SeoJinPark,AnkitaKejriwal,SatoshiMatsushita,andJohn
Ousterhout.2015.Implementinglinearizabilityatlargescaleandlowlatency.InProceedingsofthe25thSymposiumonOperatingSystemsPrinciples(SOSP'15).ACM,NewYork,NY,USA,71-86.
[ramcloud5] RyanStutsman,CollinLee,andJohnOusterhout.2015.Experiencewithrules-
basedprogrammingfordistributed,concurrent,fault-tolerantcode.InProceedingsofthe2015USENIXConferenceonUsenixAnnualTechnicalConference(USENIXATC'15).USENIXAssociation,Berkeley,CA,USA,17-30.
[ramcloud6] A.Kejriwal,A.Gopalan,A.Gupta,Z.Jia,S.Yang,andJ.Ousterhout.SLIK:
ScalableLow-LatencyIndexesforaKey-ValueStore.InUSENIXAnnualTechnicalConference,Denver,CO,June2016
[redis] https://redis.io/
[Shvachko] Shvachko,K.,Kuang,H.,Radia,S.,&Chansler,R.(2010,May).Thehadoopdistributedfilesystem.In2010IEEE26thsymposiumonmassstoragesystems
andtechnologies(MSST)(pp.1-10).IEEE.(http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5496972)
[ska1] SKA-SDPPreliminaryArchitectureDocument
[Subramanian] SubramanianMuralidhar,WyattLloyd,SabyasachiRoy,CoryHill,ErnestLin,WeiwenLiu,SatadruPan,ShivaShankar,ViswanathSivakumar,LinpengTang,
andSanjeevKumar.2014.f4:Facebook'swarmBLOBstoragesystem.InProceedingsofthe11thUSENIXconferenceonOperatingSystemsDesignandImplementation(OSDI'14).USENIXAssociation,Berkeley,CA,USA,383-398.
[TH01] ThomasHollstegge,Distributedfilesystems,Seminarthesisfortheseminar
ParallelProgrammingandParallelAlgorithms,availableonlinehttp://is.uni-muenster.de/pi/lehre/ws0910/pppa/papers/dfs.pdf
[wang] Wang,F.,Nelson,M.,Oral,S.,Atchley,S.,Weil,S.,Settlemyer,B.W.,...&Hill,J.(2013,November).PerformanceandscalabilityevaluationoftheCephparallel
filesystem.InProceedingsofthe8thParallelDataStorageWorkshop(pp.14-19).ACM.
[webciteblob1] http://www.dell.com/downloads/global/products/pvaul/en/object-storage-
overview.pdf
[webciteblob2] https://en.wikipedia.org/wiki/Object_storage
[webciteblob3] http://arstechnica.com/information-technology/2013/10/seagate-introduces-a-new-drive-interface-ethernet/
[webciteblob4] https://azure.microsoft.com/en-us/documentation/articles/fundamentals-introduction-to-azure/
[webciteglustr1] http://moo.nac.uci.edu/~hjm/fs/An_Introduction_To_Gluster_ArchitectureV7_
110708.pdf
MSCA-ITN-2014-ETN-642963
D3.1WP3IntermediateReport
Page37of37
[webciteglustr2] https://access.redhat.com/documentation/en-
US/Red_Hat_Storage/3.1/html/Administration_Guide/chap-Red_Hat_Storage_Architecture_and_Concepts.html
[websiteceph1] http://thenewstack.io/converging-storage-cephfs-now-production-ready/
[websiteceph] http://docs.ceph.com/docs/master/cephfs/
[weil1] Weil,S.A.,Leung,A.W.,Brandt,S.A.,&Maltzahn,C.(2007,November).Rados:ascalable,reliablestorageserviceforpetabyte-scalestorageclusters.In
Proceedingsofthe2ndinternationalworkshoponPetascaledatastorage:heldinconjunctionwithSupercomputing'07(pp.35-44).ACM.
[weil2] Weil,S.A.,Brandt,S.A.,Miller,E.L.,&Maltzahn,C.(2006,November).CRUSH:Controlled,scalable,decentralizedplacementofreplicateddata.InProceedings
ofthe2006ACM/IEEEconferenceonSupercomputing(p.122).ACM.
[weil3] Weil,S.A.,Brandt,S.A.,Miller,E.L.,Long,D.D.,&Maltzahn,C.(2006,November).Ceph:Ascalable,high-performancedistributedfilesystem.InProceedingsofthe7thsymposiumonOperatingsystemsdesignand
implementation(pp.307-320).USENIXAssociation.
top related