storage-based convergence between hpc and cloud...

STORAGE-BASEDCONVERGENCEBETWEENHPCANDCLOUDTOHANDLEBIGDATA

Deliverablenumber D3.1

Deliverabletitle CharacterizationoftheexistingcloudandHPCstoragesystems IntermediateReport

WP3HPC-CLOUDConvergence

Editor

AdrienLebre(Inria)AuthorsPierreMatri(UPM,ESR03),FotiosPapaodyssefs(Seagate,ESR06),FotisNikolaidis(CEA,ESR09)with

the support of Linh ThuyNguyen (Inria, ESR07), Athanasios Kiatipis (Fujitsu, ESR08),Mohammed-YacineTaleb(Inria,ESR13),YevhenAlforov(DKRZ,ESR14)

GrantAgreementnumber 642963Projectref.no MSCA-ITN-2014-ETN-642963

Projectacronym BigStorage

Projectfullname BigStorage:Storage-basedconvergencebetweenHPCandCloudtohandleBigData

Startingdate(dur.) 1/1/2015(48months)

Endingdate 31/12/2018

Projectwebsite http://www.bigstorage-project.eu

Coordinator MaríaS.Pérez

Address CampusdeMontegancedosn.28660BoadilladelMonte,Madrid,SpainReplyto mperez@fi.upm.es

Phone +34-91-336-7380

MSCA-ITN-2014-ETN-642963

D3.1WP3IntermediateReport

ExecutiveSummaryThis document provides an overview of the progress of the work done until M24 of the Project

BigStorage(from01-01-2015until31-12-2016)inWP3HPC-CloudConvergence.

Although available storage mechanisms in HPC and Cloud environments significantly differ fromtheir design as well as the services they deliver (files vs. key/value systems, hierarchical vs. flatnamespaces,semantics),understandingthespecifictechniquesusedinbothareasisakeyelement

to propose new storage systems offering the best of both worlds. The activities that have beenconducted inWP3focusedonunderstandingthespecifictechniquesused inHPCandCloudareas.This firststep isakeyelementforourESRstoproposenewstoragesystems, takingadvantageof

bothareas.TheD3.1deliverablepresentsanoverviewofthemajorsystemsthathavebeenstudiedbytheESRs.

WechoosetoclassifythemintothreegroupsthatcorrespondstothethreestoragebackendsthathavebeenidentifiedintheHPCandCloudwords:

● DistributedFileSystems;● BinaryLongObjects;● Key/Valuestoressytems.

For each category, we first discuss the general concepts and second presentmajor systems thatimplementthoseconcepts.Foreachsystem,weunderlinetheprosandconsandalsoindicatethe

ESRs that act as contact points for the others in order to get additional information if needed.Finally,weconcludethediscussionofeachcategorybypresentingtheirrelevanceaccordingtotherequirementsoftheusecasesthathavebeenstudiedinWP1.Thedocumentendswithatablethat

givesanoverviewofthemappingbetweenthefeaturesofferedbyeachsystemandthe identifiedrequirements. We underline that the list of systems presented in this deliverable is in no wayexhaustiveandismainlygearedtowardthesystemsthathavebeenstudiedbytheESRs.

In addition to delivering a strong expertise to our ESRs, thisstudy shouldenable them to proposeinnovativecontributions, firstat theapplication level throughnew I/Omiddlewaremechanismsto

guideI/Osystems inordertodeliverthebestperformancestheycanachieveand second at thedistributedfilesystemslevelinordertoextendthemwithcloudcapabilitiessuchaselasticityandtorevisitsomeoftheir internals inordertodevelopaunifiedarchitectureforHPCandCloud

storagebackends.

DocumentInformationISTProjectNumber MSCA-ITN-2014-ETN-642963Acronym BigStorageTitle Storage-basedconvergencebetweenHPCandCloudtohandleBigDataProjectURL http://www.bigstorage-project.euDocumentURL http://bigstorage-project.eu/index.php/deliverablesEUProjectOfficer Mr.SzymonSrodaDeliverable D3.1IntermediateReportonWP3Workpackage WP3HPC-CLOUDConvergenceDateofDelivery Planned:31.12.2016

Actual:20.12.2016Status Version1.0final!draft□Nature prototype□report!dissemination□Disseminationlevel public□consortium!DistributionList ConsortiumPartnersDocumentLocation http://bigstorage-project.eu/index.php/deliverablesResponsibleEditor AdrienLebre(Inria),adrien.lebre@inria.fr,Tel:+33(0)251858243Authors(Partner) Pierre Matri (UPM, ESR03), Fotios Papaodyssefs (Seagate, ESR06), Fotis

Nikolaidis(CEA,ESR09)Reviewers Abstract(fordissemination)

ExecutiveSummary

Keywords StorageBackend,HPC,Cloud,ParallelFileSystems,DistributedFileSystems,BinaryLongObjects,Key/ValueStores,

Version Modification(s) Date Author(s)0.1 Initialtemplateandstructure 08.10.2016 AdrienLebre,Inria0.2 Sectionsfromallauthors 10.11.2016 Allauthors

0.3Internal version with Editor’scomments/ammendments

15.11.2016 AdrienLebre,Inria

0.4 Revisionsfromallauthors 09.12.2016 Allauthors0.5 Internalversionforreviewbytheconsortium 15.12.2016 AdrienLebre,Inria0.6 InternalversionwithReviewer’scomments 18.12.2016 MariaS.Perez,UPM0.6 Finalversiontocommission 31.12.2016 AdrienLebre,Inria

ProjectConsortiumInformationParticipants ContactUniversidad PolitécnicadeMadrid(UPM),Spain

MaríaS.PérezEmail:mperez@fi.upm.es

BarcelonaSupercomputing Center(BSC),Spain

ToniCortesEmail:toni.cortes@bsc.es

Johannes GutenbergUniversity (JGU) Mainz,Germany

AndréBrinkmannEmail:brinkman@uni-mainz.de

Inria,France

GabrielAntoniuEmail:gabriel.antoniu@inria.frAdrienLebreEmail:adrien.lebre@inria.fr

Foundation for Researchand Technology - Hellas(FORTH),Greece

AngelosBilasEmail:bilas@ics.forth.gr

Seagate,UK

MalcolmMuggeridgeEmail:malcolm.muggeridge@seagate.com

DKRZ,Germany

ThomasLudwigEmail:ludwig@dkrz.de

CA TechnologiesDevelopment Spain (CA),Spain

VictorMuntesEmail:Victor.Muntes@ca.com

CEA,France

JacqueCharlesLafoucriereEmail:Charles.LAFOUCRIERE@CEA.FR

Fujitsu TechnologySolutions GMBH,Germany

SeppStiegerEmail:sepp.stieger@ts.fujitsu.com

Preamble

Filesystemshavebeeninventedinthe60’stoprovideaninterfaceforendusersandapplicationsto

store data in a structured manner. Traditionally, files were organized in hierarchical structures

(“trees”)consistingofdirectoriesandfiles(adirectorycontainingeithersub-directoriesorfiles).The

objectiveofafilesystem,then,istomanagethelocationofdataaccordingtoalogicalsequenceand

via aneasilyunderstoodhierarchyofnesteddirectories. Compounding theproblem is that as the

amountofdatagrows,sodothenumberofnesteddirectories/files.Theresultisasetoflargetree

structures that makes it cumbersome and challenging to find any particular file, especially if the

specificname,thecreationdate,orthetypeofthefileisnotknown.

Time have changed though, and today much of the data that is generated correspond to

unstructureddatathatdoesnotrequirethesamecapabilitiesastheonesofferedbytraditionalfile

systems.Consequently, thecostofmaintaininghierarchicalstructure isoftenunnecessaryandthe

resultingoverheadmayhinderperformanceofthesystem.

To overcome that limitation, a new generation of storage systems have been introduced. Among

these, BLOB (Binary LargeOBject) storage systems came to the surface. Namespace has changed

fromhierarchicaltoflat.Insteadofiteratingdeephierarchiestoidentifyafile,nowfiles(i.e.,objects

intheBLOBterminology)aredirectlyaccessiblethroughauniqueid.BLOBsaregenerallysplitinto

multiplesmallparts,orchunks,thataredistributedoverthestoragecluster.Mostofthesesystems

allowmodifications to be performed on the BLOBs, either by appending or bywriting at random

offsets. For applicationsmanipulating smaller or immutable values, key-value systemshavebeen

introduced. By reducing the kind of data that can be manipulated, key/value systems can offer

higherefficiency.

Thedocumenthasbeenwritteninordertoreflectthesethreesystemcategories.

Tableofcontents

ParallelFileSystems/DistributedFileSystems ................................................................. 8Parallel/DistributedFilesystemsFundamentals ..........................................................................8Lustre..........................................................................................................................................9

Overview........................................................................................................................................ 9ProsandCons .............................................................................................................................. 11

PVFS/OrangeFS .......................................................................................................................11Overview...................................................................................................................................... 11ProsandCons .............................................................................................................................. 13

HDFS .........................................................................................................................................13Overview...................................................................................................................................... 13ProsandCons .............................................................................................................................. 15

GlusterFS...................................................................................................................................15Overview...................................................................................................................................... 15ProsandCons .............................................................................................................................. 16

CephFS ......................................................................................................................................16Overview...................................................................................................................................... 17ProsandCons .............................................................................................................................. 18

Relevancew-r-tWP1use-cases .................................................................................................18

Blobsystems ................................................................................................................... 19BlobFundamentals ...................................................................................................................19BlobSeer ...................................................................................................................................20

Overview...................................................................................................................................... 20Pros/Cons................................................................................................................................... 22

AzureStorage............................................................................................................................22Overview...................................................................................................................................... 22Pros/Cons................................................................................................................................... 23

RADOS ......................................................................................................................................23Overview...................................................................................................................................... 23Pros/Cons................................................................................................................................... 25

Relevancew-r-tWP1use-cases .................................................................................................25

Key/ValueStoreSystems................................................................................................. 26KVSFundamentals ....................................................................................................................26HBase........................................................................................................................................27

Overview...................................................................................................................................... 27ProsandCons .............................................................................................................................. 28

Cassandra .................................................................................................................................28Overview...................................................................................................................................... 28ProsandCons .............................................................................................................................. 29

RAMCloud.................................................................................................................................30Overview...................................................................................................................................... 30ProsandCons .............................................................................................................................. 31

Relevancew-r-ttheWP1use-cases ...........................................................................................31

Relevancew-r-ttheWP1usecases-Overview ................................................................ 32

References....................................................................................................................... 33

ParallelFileSystems/DistributedFileSystems

Parallel/DistributedFilesystemsFundamentals

Parallel, distributed or clustered file systems are terms used to describe the pooling of storageresourcesoveranetworkinterface.Whilenotstrictlydefined,adistinctionisusuallyheldbetweenparallelversusdistributedsystemstorefertothelowandhigh-levelarchitecturaldifferencesaswell

astheintendeduse-casesthateachfilesystemisgearedtowards.Parallel File systems:With the advent of cluster computing, storage often became the bottleneck

component of high-performance systems, especially when executing I/O intensive parallelapplications.Theneedforhigh-throughputconcurrentreadsandwrites ledtothedevelopmentofparallelfilesystems.Parallelfilesystemsachievehighthroughputbystripingfilestomultipledisks,

allowing for collective readandwriteoperations, thereforeovercoming the limitationsof a singledevice[FeitelsonD][BentJ].Ingeneral,parallelfilesystemsofferblock-levelaccesstotheentirepoolof storage media, providing a unified view at the lower level and are geared towards high-

performance environments located at the same physical space connected through dedicatednetworks. Parallel file systemshavebeendesigned for handling structureddata in amanner verysimilartotraditionalfilesystems.

Distributed file systems:While parallel file systems have been created out of the need for high-performance computations and the handling of structured scientific data, distributed systems

evolvedoutof thenecessity tohandle theever increasingamountsofdatageneratedbymultipleandheterogeneoussourcesovertheinternet.Theelasticityofferedbycloudcomputingrequiredasimilarparadigminthestoragelayer.Unstructuredandheterogeneoustypesofdataalsointroduced

newrequirementsinorderforasystemtobeabletoscaleproperly,whileolderrequirementswherenolongernecessaryandwhereimpedingperformance.

Thefollowingtablesummarizesthedifferencesbetweenparallelanddistributedfilesystems.

Parallel Distributed

Focusarea HPC/Enterprise BigData/Cloud

UsualDatatype Structureddata Unstructured/Binary

RedundancyModel Hardwarereplication SoftwareReplication

DeviceSharemodel SharedDisks SharedNothing

In the following paragraphs, we present some of the most popular systems that fall in the

parallel/distributed categories. First,we give an overviewof Lustre [lustre4] and PVFS (OrangeFS)[pvfs1], which are the two most representative solutions of parallel file systems. Second, weintroduce HDFS [HDFS10] and GlusterFS [GlusterFS13], which are two well-known distributed file

systems. Finally, we conclude this list by describing the Ceph system [webciteceph1] that can beseenasamixedbetweenthetwocategories.

Lustre

ESRs:ESR06,ESR14

KeyContact:ESR14Mainarea:HPC

Overview

Lustre isanopensource,paralleldistributedfilesystemsuitable for large-scaleclustercomputing.Itsnameisanamalgamoftheterms"Linux"and"Clusters".ItiscurrentlyusedbyalargenumberofaTOP500supercomputersincluding6ofthetop10and60oftheTOP100[lustre1].Featurerichand

robust,Lustrebundleshighperformancewithdataredundancyandhighavailability. It isapopularfilesystemwhichhasbeenadoptedbylargedatacentersinindustryandscientificcommunitiesthatworkinafieldoffinancialanalysis,media,lifescience,climatechange,meteorology,etc..

Lustreisbasedontheobject-storagemodelandtreatsfilesasobjectsatthefilesystemlevel.Itaimsto provide a fully POSIX-I/O compliant file system (i.e., that provides strong consistency andimprovesprogramsportability)whilemaintaininghighperformanceI/Ooperations.Anoverviewofa

typical Lustre setupcanbe seen inFigure1,alongwith themainarchitectural componentsof thesystem.Itiscomposedwiththreemainfunctionalunits[lustre2]:

● Clients.LustreexposesPortableOperatingSystemInterfaceforUnix(POSIX)[lustre3]totheclientsthatallowsforconcurrentreadandwriteaccesstothefilesystemthroughstandardsetofsystemcalls.I/Othroughputincreasesduetoallowedmassivelyparallelfileaccessin

Lustre.● ObjectStorageServers (OSSs)managethefilesystem’sdata.Theyprovideanobject-based

interfacethatclientscanusetoaccessbyterangeswithintheobjects.EachOSSisconnected

topossiblymultipleobjectstoragetargets(OSTs)thatstoretheactualfiledata.● MetadataServers(MDS)managethefilesystem’smetadata,suchasdirectories,filenames

andpermissions.MDSsarenot involved in theactual I/Obutonlycontactedoncewhena

fileiscreatedoropened.TheclientsarethenabletoindependentlycontacttheappropriateOSSs.EachMDS is connected topossiblymultiplemetadata targets (MDTs) that store theactualmetadata.

Additionalarchitectureinformation:

● TheLustrenetworkingmodule (LNET)enables low latencynetworkaccess,necessary fora

clusterstoragesystem.Additionaltohighnetworkperformance,LNETallowsRemoteDirectMemoryAccesseliminatingCPUoverheads.

● DataStripingprovides significantperformanceboostson large single file reads, sincedata

can be read from multiple storage devices, thus overcoming throughput limitations ofstoragedevices.

Figure1-Lustre:ThemaincomponentsofLustrefilesystem[dell.www]

LustreisbasedonLinuxanduseskernelbasedservermodulestoprovidetherequiredperformance.AdditionallyLustrehasanotherusefulfeatureslistedbelow[lustre4]:

● Interoperability:Lustre filesystemcanrunonmoderncommodityhardwarewithdifferentarchitecturesofcentralprocessingunit.

● Accesscontrollist(ACL):SecuritymechanismofLustreusestheideaofLinuxACL,whichisa

set of access rights and permissions that every user or group have to the specific systemobjects. It is enhanced by POSIX ACLs, based on standard POSIX file system objectpermissions.

● Quota:theamountofdiskspacethatusersorgroupsofusersareusingispossibletolimitorchangeinLustresystemwithquotas.

● OSSaddition:MoreclientorservernodescanbeeasilyaddedtoclusterthatusesLustrefile

systemwithoutanysysteminterruptions.Thisfeaturegivesopportunitytoscalethesystem,increaseitsstoragecapacityandbandwidth.

● Controlled striping: Application libraries and utilities can specify different striping settings

(likestripingcount,stripingsizeandOSTselection)ofanindividualfileordirectorycreatedintheLustrefilesysteminordertoadjustthecapacityandperformance.Thesamecontrolisavailableforeveryfilecreatedinthesystem.

● Backup tools: Lustre file systemhasutilities thatallowusers todiscover themodified files

anddirectories, striping information, etc. andprovidebackupswith restoration.Howevertheseproceduresareperformedeitherinsupervisedorsemi-supervisedways.

ProsandCons

Pros● ProvidesPOSIXI/Ocompliantinterface● Highperformance,scalability,stabilityandavailability

● ProvidesFilestripping● Achieveshighbandwidthforasmallnumberoffiles● Hasadvancedsecuritymechanisms

● IsanopensourcefilesystemCons

● Limitedinfeatures(likecompression[lustre5])

● Difficulttobackupandefficientlyrecovermetadataanddata● Strippingcanleadtothemetadataoverhead● OSTfailurecanstuckthesystemperformanceandmakefilesinaccessible

● Cannothandlelargeamountofsmallfiles(millionsandmore)

PVFS/OrangeFS

ESRs:ESR03KeyContact:ESR03Mainarea:HPC

Overview

PVFS [pvfs1] isaparallel filesystemthatsupportsbothdistributing filedataandmetadataamongmultipleserversandcoordinatedaccesstofiledatabymultipletasksofaparallelprogram.PVFShasan object-based architecture. PVFS has an object-based design, which is to say all PVFS server

requests involved objects called dataspaces. A dataspace can be used to hold file data, filemetadata,directorymetadata,directoryentries,orsymboliclinks.

Figure2-PVFS:ThemaincomponentsofPVFSfilesystem[pvfs1]Thesoftwareconsistsof twomajorcomponents:astorageserverthatrunsasadaemononan IO

nodeandaclientlibrarythatprovidesaninterfaceforuserprogramsrunningonacomputenode.The storage server stores all data in objects known as dataspaces and all IO operations areperformedrelative tooneormoreof theseobjects.Adataspaceconsistsof twostorageareas:a

bytestreamthat storesarbitrarybinarydataasa sequenceofbytes, anda collectionof key/valuepairsthatallowstructureddatatobestoredandquicklyretrievedwhenneeded.ThisarchitectureisoutlinedinFigure2.

PVFS interacts over the network through an interface known as BMI [bmi]. BMI provides a non-blocking post and poll interface for sendingmessages, receivingmessages and checking to see if

outstandingpostshave completed. These aremostly requests fromclients.Unexpectedmessagesarelimitedinsizebutareusefulformostclientrequests.BMIhasimplementationsforTCP/IP,GM,

MX,IB,andothernetworkingfabrics.Multiplenetworkscanbeusedatthesametime,thoughthereareperformanceissuesindoingthis.

ThePVFSservermanagesrequests tothrougha Job layer.TheJob interface isalsoanon-blockingpostandpolldesignandprovidesa common interface forhavingmanyoutstandingoperations inflightontheserveratonetime.JobsareissuedbytheserverRequestprocessor,whichisbuiltusing

a custom state-machine language SM. SM allows programs to define fundamental steps in theprocessingofeachrequest,eachendinginpostingajob.Onceajobispostedthestate-machineissuspending until completion at which point it automatically resumes. SM allows return codes to

drive the processing to different states. SM is designed to make the coding of server requestssimpler by abstracting awaymany details including interacting with BMI, encoding and decodingmessages, and asynchronous issues. The PVFS client code is built from many of the same

components as the server including BMI and state-machines. The primary difference is that thestate-machines arewritten to implement each function in thePVFS System Interface (sysint). Thesysint isdesignedbasedonoperationstypicallyfoundinamodernOSvirtualfilesystem.Thus,for

example,thereisn’tanopencall,butratheralookupthattakesapathnameandreturnsareference

to the file’s metadata. Metadata is read or written with getattr and setattr respectively. Userprograms are expected to use PVFS via one of several user level interfaces. These include a VFSkernelmoduleforLinux,aFUSEinterface,supportviaROMIO(MPI-IO)andaUserInterface(usrint)

providedwithPVFS.OrangeFileSystem(OrangeFS)[orangefs1]isabranchoftheParallelVirtualFileSystem.LikePVFS,

OrangeFS is a parallel file systemdesigned for use on next-generationHPC systems that providesvery high performance access to disk storage for parallel applications. OrangeFS is different fromPVFSinthatwehavedevelopedfeaturesforOrangeFSthatarenotpresentlyavailableinthePVFS

main distribution.While PVFS development tends to focus on specific very large systems,Orangeconsiders a number of areas that have not beenwell supported by PVFS in the past. Such areasincludevirtualizedstorageoveranyLinuxfilesystemasunderlyinglocalstorageoneachconnected

server, or replacement of Hadoop DFS usingMapReduce extension and JNI – nomodification ofMapReducecodeisneeded.

ProsandCons

Pros● Uniqueobject-basedfiledatatransfer,allowingclientstoworkonobjectswithouttheneed

tohandleunderlyingstoragedetails,suchasdatablocks

● Ability to have unified data/metadata servers (they can also be separated if needed; thedefaultisunified)

● Distributionofmetadataacrossstorageserversandofdirectoryentrymetadata

● Abilitytoconfigurestorageparametersbydirectoryorfile, includingstripesize,numberofservers,andimmutablefilereplication

● Residesinkernel(superuserrightsarerequiredtoinstallOrangeFS)● SmallcommunitycomparedtoLustre

ESRs:ESR03KeyContact:ESR03

Mainarea:Cloud

Overview

HDFS[HDFS08]isaJava-basedfilesystemthatprovidesscalableandreliabledatastorage,anditwas

designed to span large clusters of commodity servers. HDFS is highly fault-tolerant and has beendesignedtobedeployedonlow-costhardware.HDFSprovideshighthroughputaccesstoapplicationdataandissuitableforapplicationsthathavelargedatasets.

HDFS relaxes a fewPOSIX requirements toenable streamingaccess to file systemdata.HDFS is a

distributedfilesystemthatsharesmanyconceptswithGoogleFS[googleFS03].The system consists of only 1 NameNode (acts asmaster) and several DataNodes (act as slaves),

whicharepiecesofsoftwarerunoncommoditymachines.TheNameNodekeepsinitsmemory(1)thewholenamespacetree,and(2)themappingoffileblocksintoDataNodes.Datacontentisstoredasasequenceofblocks,whicharereplicatedfor fault tolerance;allblocks ina fileexceptthe last

blockarethesamesizes.Foreachfile,theblocksizeandreplicationfactorareconfigurable.

Figure3-HDFSArchitecture[HDFS08]

Readflow1. ClientasksNameNodeforthefileblocksandreplicalocation2. TheNameNodereturnstheorderedlistofDataNodes(bydistancefromtheclient)thathave

thereplica3. ClientwillcontacttheDataNodedirectlytorequestforblocksandreadfromthere

Writeflow

1. ClientasksNameNodeforwritepermissiononafile.2. TheNameNodegrantsclientwiththeleaseforthefileandnootherclientcanwritetothis

file.AlistofDataNodescontainsblockstowriteintoisreturnedtotheclient.

3. The DataNodes in the list form a pipeline of data, which minimizes the total networkdistancefromtheclienttothelastDataNode.a. ClientcontactsthefirstDataNode,thenthisDataNodecontactsthenextDataNode,and

soontosetupthepipeline.b. ThelastDataNodeinthelistsendstheacknowledgementmessagebacktotheprevious

DataNode,andcontinuetodosototheclienttoinformthatthepipelineisready4. The client sendsdata to the firstDataNode, and thedata flows in thepipeline to the last

DataNode.Andtheflowofackmessageisthesameasbefore.

ProsandCons

Pros● Highavailability,scalability,simplearchitecture,etc.

● Rackawareness(improveperformanceandreducethenetworkbandwidth)● Noleasemechanism● ClientattemptstoreadblocksfromtheclosestDataNode

● SupportMapReduce● Checkpointingsupport

● Newlywritedatamaynotvisibleforreadersuntilthefileisclosed● NameNode is the Single point of failure; Backup Node can not automatically handle the

failoverevent

● MetadatainmemoryofNameNodelimitsthescalabilityofthesystem● Nosupportforrandomwritestofiles;onlyappends

GlusterFS

ESRs:ESR09KeyContact:ESR09

Mainarea:Cloud

Overview

GlusterFS[GlusterFS13]isascalablenetwork

file system, consisted of client and servercomponents. Using common off-the-shelfhardware, you can create large, distributed

storage solutions for media streaming, dataanalysis, and other data- and bandwidth-intensivetasks.Serversaretypicallydeployed

asstoragebricks,witheachserverrunningaglusterfsd daemon to export a local filesystem as a volume. The GlusterFS client

process creates composite virtual volumesfrommultipleremoteserversusingstackabletranslators[webciteglustr1].

Unlike other traditional storage solutions,GlusterFS does not need ametadata server,

and locates files algorithmically using anelastic hashing algorithm. This no-metadata

serverarchitectureensuresbetterperformance,linearscalability,andreliability[webciteglustr2].

Figure4–GlusterFSOverview[GlusterFS13]

Formutabilityandadaptabilitydependingontheworkload,GlusterFSidentifiesmultipletypesof

volumes:

● DistributedVolume:filesaredistributedacrossvariousbricksinthevolume.

● ReplicatedVolume:exactcopiesofthedataaremaintainedonallbricks.● DistributedReplicatedVolume:filesaredistributedacrossreplicatedsetsofbricks.● StripedVolume:largefilewillbedividedintosmallerchunks(equaltothenumberofbricks

inthevolume)andeachchunkisstoredinabrick.● DistributedStripedVolume:similartoStripedGlusterFSvolumeexceptthatthestripescan

nowbedistributedacrossmorenumberofbricks.

ProsandCons

Pros● Industry-standardaccessprotocolsincludingNFSandSMB● CompletePOSIXcompliance

● Scalableduetoelastichashing● File-levelasynchronousgeo-replication● Modular,stackabledesignadaptabletowiderangeofworkloads

● Wellintegratedecosystem(includingVMworld)● Matureinproduction

● Fileaccessonly(blockandobjectinterfacesmustbebuiltontopofit)● Network intensive design. Highly depends on network properties (failure, latency,

throughput)

● Data reconstruction is offloaded to the client, rendering it inherently unusable for cloudapplications

CephFS

ESRs:ESR08,ESR03Keycontact:ESR08

Mainarea:HPC/Cloud

Overview

TheCephFS [websiteceph2]distributed file system isaPOSIX-compliant file system (supportingallUnix and Linux applications) that uses a Ceph storage cluster for storing data [weil3]. Similarly to

other file systems, CephFS relies on a metadata server daemon (MDS). CephFS is actually onecomponent of the Ceph software suite that offers a block and object storage backend [donvito].CephFS is the last part of the software suite and it is under heavy development. Eventually, the

integrationofaFileSystemtotheCEPHworldwillgiveuserstheopportunitytohaveobject,blockandfilestoragecapabilitiesinonlyonesystem.Thisalsoprovidestheuserswiththeconveniencetobring legacy applications to the Ceph/OpenStack environment [websiteceph1]. Figure 5 gives an

overviewoftheCephecosystem.

Figure5-RelationshipbetweenCEPHcomponents[websiteceph]

CephFSrunsontopoftheobjectstoragesystem,buildingacachecoherentFSontopofRADOS(Seethe Blob Storage Systems section) [Weil1]. Therefore, it inherits the resilience and scalability of

RADOS.Multiplemetadatadaemonsarehandlingtheshardedmetadatadynamically.Also,subtreesnapshots and recursive statistics are included. There is also no restriction on file count or filesystem size. A “consistent caching” feature means that clients can cache data, there caches are

coherentandtheMDSinvalidatesthedatathatchanged.Thisinvalidationallowsclientstoneverseeanystaledata.Finally, some disaster recovery tools could undertake the operation of rebuilding the file system

namespacefromscratch,intheextremecaseRADOSlosesitorincaseacorruptionoccurs.AsofNovember2016,RedHatstatesthatCephFSisincludedintheCephCommunityEdition,having

twomainfeaturesonlimitedfunctionality:theSnapshotsaredeactivatedandthemultipleMDSare

notrecommended(onlyoneMDSshouldbefunctional).ThecurrentFSversionismainlyusedasa

TechPreview.

ProsandCons

● POSIXinterface● CloudStorageReady● Scalable,asallclientsread/writeonmultipleOSDnodes[wang]

● Shared,asmanyclientscanworkonsimultaneouslyCons

● Previous reliability issues have been solved, but real-world use cases are few and far

betweenatthemoment[dutchCSI].● Snapshotsarestillinanexperimentalphase.Usageofthisfeaturecouldcauseclientnodes

orMDStoterminateunexpectedly.

● OnlyoneCephFSperclusterisallowed.

Relevancew-r-tWP1use-cases

AllusecasesdefinedintheWP1deliverable[D1.1.]willmostlikelymakeuseofeitheraparalleloradistributedfilesystemasthemainstoragebackbone.Theadvantagesinreliability,performanceandscalabilityarenecessaryforscientificprojectsofsuchmagnitude.

Climate science is a typical example of a demanding compute environment with high storagethroughputandcapacityrequirements.Largeamountsofbothinputandoutputdatarequirebeing

storedfora largeamountoftimeor inmanycases indefinitely.Thesimulationsofclimatemodelsarepushingthe limitsofcomputationcapabilitiesandtakesignificantamountoftimetocompleteevenonHPCsystems.Thisrequiresfrequentcheckpointingtosafeguardthestateofthesimulation

andasresulthighwritethroughputisrequiredtoavoidI/Obecomingthebottleneckoftheprocess.AsdefinedintheWP1deliverable,thefollowingrequirementscanbesuccessfullymetbytheLustreParallelFilesysteminitscurrentstate.

● CLM1:ParallelI/O● CLM2:Paralleldistributedfilesystemsexploitation● CLM3:Longtermarchiving

● CLM4:ExascaleStorageCapacityThe Scientific Data Processor (SDP) component of the Square Kilometer Array will likely require

dedicatedHPCfacilities toprocess,storeanddistributethevastamountsofobservationdatathatwillbeconstantlyproducedbythisinstrument.Whiletheexactarchitecturalchoicesarestillunderthedesignphase,asetofrequirementshasbeenidentifiedandpresentedontheWP1deliverable.

SKA is expected to push the limits of computation and storage, and future advancements inhardwareandsoftwarehavealreadybeentakenintoaccountinthedesignprocess.ItislikelythatacustomsolutionwillbetailoredtotheexactneedsofSKAmakinguseofconceptsfrombothparallel

anddistributedfilesystems.Moreover,onecanenvisionthattheSKAwill relyonmultipletiersof

highertolowerperformancestoragesystems,makinghierarchicalstoragemanagementnecessary.This capability is currently offered by Lustre, however the complexity of SKA will require furtheroptimizations in theareasof cachingaswellas incorporating theusageofheterogeneous storage

devicessuchasNVRAM,SSDsalongsidetraditionalspinningdisks.Lastbutnot the least,weunderline that theHumanBrainProjecthas fundeda research into the

evaluationofCEPHasastoragesystem[gudu].Thespecificusecasesthathavebeenevaluatedweredatapublicationwitharchival,neuro-informaticsdataanalyticsandcontentdeliverynetwork.

Blobsystems

BlobFundamentals

During theyears, file systemshaveproved theirworthiness.Nevertheless in themiddleof2000’s,developerswere looking for solutions thatwould enable applications to interact directlywith thestoragebackend(i.e.,withoutgoingthroughthefilesystemcomplexstack).Objectstoragesystems

havebeenproposedtofulfilsuchademand.Objectsarecollectionsofdataidentifiedbyauniqueidonaflatnamespace.Thisabstractnotion(i)decouplesthestoragefromOS-semantics, (ii)groupstogether otherwise irrelevant datasets and (iii) enables better lifecycle management of a given

dataset. However, because the number of objects can significantly increase and cause significantreductionto thesearchtime,storagesolutionsdesignersproposedtomergesmalldatasets intoasuper collection named BLOB (Binary Large Object). Maintaining a small set of huge BLOBs

comprisingbillionsofsmall,KB-sizedapplication-levelobjectsismuchmorefeasiblethanmanagingbillions of small KB-sized objects directly. Although that objects are now included into a singlecontainer(BLOB),theyarestillindependententities.Therefore,aBLOBismandatorytoprovidethe

fundamentalsforconcurrentaccessandanadequateconsistencymodel.Ifexploitedefficiently,thisfeatureintroducesaveryhighdegreeofparallelismintheapplication.Similarly to traditional hash tables, objects are identified using a unique id. Decoupling identifier

fromlogicalplacementallowsmigrationpoliciestobeappliedtransparentlytotheend-user.The identifier is generated using cryptographic hashes of the object’s contents. Therefore samedatasetswillalsoproducesameidanddifferentdatasetsdifferentids.Thisallowsdatadeduplication

andintegrity.BLOBsareusedforkeepingrelatedinformationtogether.Forinstance,anMRI(MagneticResonance

Imaging) image isgroupedwiththephysician’srecordednotes(inanMP3file)alongwiththetextfile that has the patient’s history. In addition, BLOBs contain additional control properties forindexing or automate storage management (e.g., routing and retention policies). BLOBs can be

stagedfromonemediumtoanotherbasedonitscurrentusage.Whiletraditionallydataarestagedbased on aging, metadata can provide useful hints for proper and efficient placement. In a

conventionalsystemanentityhastogothroughallthestepsofthefollowinglist.However,iffora

specificobjecttheaccesspatternisknown,someofthestagescanbeskipped.

● HOT—durable,availableandperformanceobjectstorageforfrequentlyaccesseddata

● WARM— lower cost, but less durability for frequently accessed non-critical reproducibledata

● COOL—storageclassfordatathatisaccessedlessfrequently,butrequiresrapidaccess

● COLD—secure,durable,andlow-coststorageservicefordataarchivingDevelopers access objects by following the CRUD (create, read, update and delete) interface. To

performachange, theobjectmustbe fetchedback,updated locallyand thenpushedback to thestorageprovider.Anotherdrawbackofthismodel istheconcurrency issues,because if twoclientsfetch the same object there is the question about whose versioned will be regarded as the last

written.Howeverthisissolvedduetocontent-generatedids,whichwillproducedifferentidsforthetwonewobjects.Eventually,thismeansthatconcurrencycontrolhasbeendegeneratedtoproperkeymanagementandgarbagecollection.

Regarding the placement aspects, objects are stored inmost cases on the nodeswith the closestidentifier to that of the object. Tominimize risk loss, objects can be replicated. However, same

chunkofdataproducesthesameidentifierandwillbeplacedonthesamenode.Togoaround,theseparationbetweensignaturesfromidentifiersisnecessary.Withthat,objectreplicasmayhavethesame signature but be located to different failure domains. Tominimize the storage footprint of

replication without compromising reliability level, erasure codes can be used. Further, objectimmutabilityenablessystemversioning,whichcanbeusedasrestorepoints.

In the following paragraphs, we present three BLOBs systems, namely BlobSeer [Blobseer11],

Microsoft Azure [azure1] and RADOS [weil1]. The first one is an academic solutionwhile the twoothersareindustrialproposals.

BlobSeer

ESRs:ESR13

Keycontact:ESR13Mainarea:HPC

Overview

BlobSeer [Blobseer10, Blobseer11] is a large-scale distributed storage service that addressesadvanceddatamanagementrequirementsresultingfromever-increasingdatasizes.Itisbuiltontheidea of leveraging versioning for concurrent manipulation of binary large objects in order to

efficientlyexploitdata-levelparallelismandsustainahighthroughputdespitemassivelyparalleldataaccess. BlobSeer supports large binary large objects (BLOBs) that reach the order of TB while

allowing fine grained random write access. BlobSeer behaves very well under high concurrency

thankstoitsmulti-versionconcurrencycontrol.AclientofBlobSeermanipulatesBLOBsbyusingasimpleinterfacethatallowsusersto:createanewemptyBLOB;appenddatatoanexistingBLOB;Read/writeasubsequenceofbytesspecifiedbyan

offsetandasizefrom/toanexistingBLOB.Figure6presentsthearchitectureoverviewofthesystem.

Figure6-Blobseer’sarchitecture[Blobseer10]

Inthefollowing,wepresentthemajorconceptsoftheBlobSeersolution.

● Versioning access interface toBLOBs: Versioning is explicitlymanagedby the client. Each

timeawriteorappend isperformedonaBLOB,anewsnapshot reflecting thechanges is

generated instead of overwriting any existing data. This new snapshot is labelledwith anincremental version number, so that all past versions of the BLOB can potentially beaccessed,atleastaslongastheyhavenotbeendeletedforthesakeofstoragespace.

● Datastriping:eachBLOBismadeupofblocksofafixedsize.Thesizeoftheseblocksisset

tothesizeofthedatapieceaMap/Reduceworkerissupposedtoprocess.Theseblocksaredistributedamongthestoragenodes.

● Distributedmetadata: organizedasadistributedsegmenttree.Asegmenttreeisabinarytreeinwhicheachnodeisassociatedtoarangeoftheblob,delimitedbyoffsetandsize.Tofavourefficientconcurrentaccesstometadata,treenodesaredistributed:theyarestored

onthemetadataprovidersusingaDHT(DistributedHashTable).EachtreenodeisidentifiedintheDHTbyitsversionandbytherangespecifiedthroughtheoffsetandthesizeitcovers.

Decentralizingthemetadataserversavoidsthebottleneckcreatedbyconcurrentaccessesin

thecaseofacentralizedmetadataserverinmostdistributedfilesystems,includingHDFS.

● Concurrency control: BlobSeer relies on a versioning-based concurrency control algorithm

thatmaximizesthenumberofoperationsperformedinparallel inthesystem.This isdonebyavoidingsynchronizationasmuchaspossible,bothatthedataandmetadatalevels.Thekeyideabehindthisisthatnoexistingdataormetadataisevermodified.First,anywriteror

“appender”writes itsnewdatablocks,bystoring thedifferentialpatch.Then, ina secondphase,theversionnumberisallocatedandthenewmetadatareferringtotheseblocksaregenerated.Thefirstphaseconsistsinactuallywritingthenewdataonthedataprovidersin

a distributed fashion. Since only the difference is stored, eachwriter can send their dataindependentofotherwriterstothecorrespondingdataproviders.Asnosynchronizationisnecessary, this step canbeperformed in a fully parallel fashion. In the secondphase, the

writeraskstobeassignedaversionnumberbytheversionmanagerandthengeneratesthecorrespondingmetadata. This newmetadata describes theblocks of thedifference and is“weaved”togetherwiththemetadataoflowerversions,insuchwayastooffertheillusion

ofafullyindependentsnapshot.

Pros/Cons

● Fine-grainedwriteaccess● Goodhorizontalscalability● ApplicabletoHadoopMap/Reduceapplications(itcomeswithaHDFScompatibilitylayer).

Cons● Theversionmanager isa singlepointof failure,andhot spot (in thecriticalpath forboth

readsorwrites)

● Metadata distribution in a distributed treemay cause high latency as the size of the treegrows

● Noevictionforoldversion,causingdecreasingperformanceovertime

AzureStorage

ESRsinvolved:ESR03Keycontact:ESR03Mainarea:Cloud

Overview

AzureStorageBlobs [azure1] isaproprietary, cloud-basedstoragesystemdevelopedbyMicrosoftand exclusively accessible from the Microsoft Azure cloud. Azure Blobs provide the user with a

simpleAPI.Everyblobisorganizedintoacontainer.Containersalsoprovideausefulwaytoassign

securitypoliciestogroupsofobjects.Astorageaccountcancontainanynumberofcontainers,anda

containercancontainanynumberofBLOBs,uptothelimitof500TB.BlobstorageoffersthreetypesofBLOBs:blockBLOBs,appendBLOBs,andpageBLOBs(disks).

● BlockBLOBsareoptimized forstreamingandstoringcloudobjects,andareagoodchoice

forstoringdocuments,mediafiles,backupsetc.

● Append BLOBs are similar to block BLOBs, but are optimized for append operations. AnappendBLOBcanbeupdatedonlybyaddinganewblocktotheend.AppendBLOBsareagoodchoiceforscenariossuchaslogging,wherenewdataneedstobewrittenonlytothe

endoftheBLOB.● PageBLOBs are optimized for representing IaaS disks and supporting randomwrites, and

may be up to 1 TB in size. Page BLOBs are used for instance for saving the Azure virtual

machinenetworkattachedIaaSdisk.

Becauseitisaclosedsourcesolution,itisdifficulttoidentifyandanalysetheinternalmechanismsof

thesystem.

Pros/Cons

● Cloud-based,nosetupneeded● SimpleAPI,manyclientlibrariesavailable● Interestingperformance

Cons● Proprietarylicense● Opaqueinternalarchitecture

● Littlefinetuningoptions● Limited support for random writes: not supported in append or block BLOBs, must be

mappedtothesizeofapageinpageBLOBs

● Lowcapacitylimits

ESRsinvolved:ESR08Keycontact:ESR08Mainarea:HPC/Cloud

Overview

RADOS [weil1] is a Reliable Autonomic Distributed Object Store, which provides an extremelyscalable storage service for variably sizedobjects. Theunderlying storage abstractionprovidedby

RADOSisrelativelysimple:

● The unit of storage is an object. Each object has a name (currently a fixed-size 20 byte

identifier, though thatmaychange), somenumberofnamedattributes (i.e., xattrs),andavariable-sizeddatapayload(likeafile).

● Objectsarestoredinobjectpools. Eachpoolhasaname(e.g.,“foo”)andformsadistinct

object namespace. Each pool also has a few parameters that define how the object isstored,namelyareplication level (2x,3x,etc.)andamappingruledescribinghowreplicasshouldbedistributedacrossthestoragecluster(e.g.,eachreplicainaseparaterack).

● Thestorageclusterconsistsofsome(potentially large)numberofstorageservers,orOSDs(objectstoragedaemons/devices),andthecombinedclustercanstoreanynumberofpools.

AkeydesignfeatureofRADOSisthattheOSDsareabletooperatewitharelativeautonomywhenitcomestorecoveringfromfailuresormigratingdatainresponsetoclusterexpansion.Byminimizingtheroleofthecentralclustercoordinator(actuallyasmallPaxosclustermanagingkeyclusterstate),

the overall system is in theory scalable. A small system of a few nodes can seamlessly grow tohundredsorthousandsofnodes(orcontractagain)asneeded.

Data in RADOS is distributed according to an algorithm called CRUSH: Controlled, Scalable,DecentralizedPlacementofReplicatedData[weil2].TheCRUSHalgorithmdetermineshowtostoreand retrieve data by computing data storage locations. CRUSH empowers Ceph clients to

communicate with OSDs directly rather than through a centralized server or broker. With ancomputation-basedmethodof storing and retrievingdata, Ceph avoids a single point of failure, aperformancebottleneck,andaphysicallimittoitsscalability.

Figure7-RADOSandtheCRUSHalgorithm,eventuallybeingstoredintheOSDs.

Pros/Cons

● InterfaceforSwiftandS3CephBlockDevice

● ErasureCodeAlgorithm[miyamae]

● SingleStorageLayerandflexibilitywithCRUSH

● Unlimitedscalability(intheory)

● ThereisnointerfaceforRBDonWindows

● HypervisorimplementationneedsknowledgeaboutCeph.Atthemoment,onlyKVMhas

nativesupport.

● TheuseofPAXOScanbecomeanissueforverylarge-scalesystemwherethenumberof

Monitorsbecomestooimportant.

Relevancew-r-tWP1use-cases

Objectsarepowerfulabstractions,especiallyoncombiningand tagging information fromdifferent

sources and handling the lifecycle and placement of related non-uniform datasets. The notion ofdynamic dataset link can be directly exploited on the first side by the Human Brain project inparticulartheHBP2“SpatialJoinbasedonnon-uniformdatasetdensities”andonthesecondside,by

smartcitiestechnologiessuchastheSCT4”Batchprocessingandlearningfromdata”requirement.Apart from an entity for dynamic linking, objects can be regarded as individualswithout external

dependencies. Although Parallel and Distributed file systems look to be the right choice, BlobsystemscanberelevanttoprovideabackendfortheClimateScienceusecaseinparticularfortheCLM1“ParallelI/O”aswellastheCLM2“Paralleldistributedfilesystemsexploitation”requirements.

Indeed, Blob systems such as BlobSeer proposemore or less the same File API as distributed filesystems.TheoperationspossibleonanygivenfileareroughlythesameastheoperationspossibleonBLOBs.Bothstoreunstructuredopaquebytearrays.ClimateapplicationsstronglyrelyonHDF5,

which uses MPI-I/O under the hood. MPI/IO does only rely on some POSIX features, which arealmostallprovidedbyBlobstoragesystems(MPI I/Odoesn’tneedhierarchies,permissionsoranyotherfeatureofDFS).ConsideringthisandthefactthatmostBlobstoragesystemsareaggressively

optimized for speed and parallel access, it is natural to consider Blob systems as a drop-inreplacementforDFSinthecaseofapplicationssuchastheclimateones.

Finally,thenatureofobjectcanservetheSKAuse-case,especiallyforSKA9“Specificdatastructures

formassivelyparallelaccesses”requirement.

Key/ValueStoreSystems

KVSFundamentals

Key-value Stores (KVS) [kvs1], or key-valuedatabase is a simpledatabase thatuses an associativearray(similartoamaporadictionary)asthefundamentaldatamodelwhereeachkeyisassociated

withoneandonlyonevalueinacollection.Thisrelationshipisreferredtoasakey-valuepair.Ineachkey-valuepair,anarbitrary string suchasa filename,URIorhash represents thekey.The

valuecanbeanykindofdatalikeanimage,userpreferencefileordocument.Theunderlyingdatamodel is similar toBLOBstoragesystems indifferentways: thevalue is storedasanopaquebytearrayrequiringnoupfrontdatamodellingorschemadefinition.Thisdatamodelobviatestheneed

to indexthedatato improveperformance.However,onecannot filterorcontrolwhat is returnedfromarequestbasedonthevaluebecausethevalueisopaque.Ingeneral,key-valuestoreshavenoquery language. They provide away to store, retrieve andupdate data using simple get, put and

deletecommands;thepathtoretrievedataisadirectrequesttotheobjectinmemoryorondisk.Thesimplicityofthismodelmakeskey-valuestoresfast,easytouse,scalable,portableandflexible.

Key-value stores generally scaleoutby implementingpartitioning (storingdataonmore thanonenode), replication and auto recovery. They can scaleup bymaintaining the database in RAMandminimize the effects of ACID guarantees (a guarantee that committed transactions persist

somewhere)byavoidinglocks,latchesandlow-overheadservercalls.Becauseeachindividualstoreintheclusteroperatesalmostindependently,aKVSclustercanofferhighthroughputandcapacityas demonstrated by large-scale deployments—e.g., Facebook operates a Memcache KVS cluster

servingoverabillionrequests/secondfortrillionsofitems.Key-valuestoreshandlesizewellandaregoodatprocessingaconstantstreamofread/writeoperationswithlowlatency.

Key-valuestoresdiffer in their implementationwheresomesupportorderingofkeys likeBerkeleyDB[bdb]andMemcache[memcache],somemaintaindatainmemorylikeRedis[redis],andsome,like Aerospike [aerospike], are built natively to support both RAM and solid state drives (SSDs).

Others,likeCouchbaseServer[couchbase],storedatainRAMbutalsosupportrotatingdisks.In the following, we present the HBase [hbase1-6], Cassandra [cassandra1], and the RamCloud

[ramcloud1-6]solutions.

ESRsinvolved:ESR07

Keycontact:ESR07Mainarea:Cloud

Overview

HBase[hbase1-6]isanopen-source,distributed,versioned,non-relationaldatabasemodelledafterGoogle's Bigtable [BigTable08] and written in Java. It is developed as part of Apache SoftwareFoundation'sApacheHadoopprojectandrunsontopofHDFS(seeSectionParallel/DistributedFile

System),providingBigTable-likecapabilities forHadoop.That is, itprovidesa fault-tolerantwayofstoringlargequantitiesofsparsedata(smallamountsofinformationcaughtwithinalargecollectionofemptyorunimportantdata,suchasfindingthe50largestitemsinagroupof2billionrecords,or

findingthenon-zeroitemsrepresentinglessthan0.1%ofahugecollection).

Figure8-Hbasearchitecture[hbase.www]

Physically, HBase is composed of three types of servers in a master slave type of architecture,outlined in Figure8. Region servers servedata for reads andwrites.Whenaccessingdata, clients

communicate with HBase RegionServers directly. Region assignment, DDL (create, delete tables)operationsarehandledbytheHBaseMasterprocess.Zookeeper,whichispartofHDFS,maintainsaliveclusterstate.

TheHadoopDataNodestoresthedatathattheRegionServerismanaging.AllHBasedataisstoredinHDFS files. Region Servers are collocated with the HDFS DataNodes, which enables data locality

(puttingthedataclosetowhereitisneeded)forthedataservedbytheRegionServers.HBasedataislocal when it is written, but when a region is moved, it is not local until compaction.

TheNameNodemaintainsmetadata information forall thephysicaldatablocks that comprise the

files.

ProsandCons

● Real-time,randombigdataaccessthroughHadoopintegration● Column-Orienteddatamodelforbigsparsetable● Row-levelatomicoperationsupport

● HighScalability● AutoFailover● SimpleClientInterface

Cons● SinglePointofFailure(SPOF)● Notransactionsupport.

● Nojoinsprovidednatively(MapReduceusageisnecessary)● Indexonlyonkey● Noauthenticationsupport

Cassandra

ESRsinvolved:ESR03

Keycontact:ESR03Mainarea:Cloud

Overview

Cassandra [cassandra1] is designed to handle big data workloads across multiple nodes with no

single point of failure. Its architecture is based on the understanding that system and hardwarefailurescananddooccur.Cassandraaddressestheproblemoffailuresbyemployingapeer-to-peerdistributed system across homogeneous nodes where data is distributed among all nodes in the

cluster. Data distribution is done deterministically using a distributed hash table, as depicted inFigure9.

Figure9-CassandraringdataDistribution

Eachnodeexchangesinformationacrosstheclusterperiodically.Asequentiallywrittencommitlogoneachnodecaptureswriteactivitytoensuredatadurability.Dataisthenindexedandwrittentoanin-memorystructure,calledamemtable,whichresemblesawrite-backcache.Oncethememory

structure is full, the data is written to disk in an SSTable data file. All writes are automaticallypartitioned and replicated throughout the cluster. Using a process called compaction Cassandra

periodicallyconsolidatesSSTables,discardingobsoletedataand tombstone (an indicator thatdatawasdeleted).

Cassandra is a row-oriented database. Cassandra's architecture allows any authorized user toconnecttoanynode inanydatacenterandaccessdatausingtheCQL language(CassandraQueryLanguage).Foreaseofuse,CQLusesasimilarsyntaxtoSQL.FromtheCQLperspectivethedatabase

consistsoftables.Typically,aclusterhasonekeyspaceperapplication.DeveloperscanaccessCQLthroughcqlshaswellasviadriversforapplicationlanguages.

Readandwriterequestscanbesenttoanynode inthecluster.Whenaclientconnectstoanodewith a request, that node serves as the coordinator for that particular client operation. Thecoordinatoractsasaproxybetweentheclientapplicationandthenodesthatownthedatabeing

requested. The coordinator determineswhich nodes in the ring should get the request based onhowtheclusterisconfigured.

ProsandCons

Pros● Easydatadistribution● Continuousuptimeandhighfaulttolerance

● Transactionalcapabilities

● Highscalability(linearscaleperformance)Cons

● Slowtransactionprocessing

● Lowsecondaryindexperformance

RAMCloud

ESRsinvolved:ESR13KeycontactESR13Mainarea:Cloud

Overview

RAMCloud[ramcloud1-6]isanin-memorykey-valuestore.Itsmainclaimsarelow-latency,durability(in the sense of ACID), fast-crash recovery, efficient-memory usage, and strong consistency. By

leveragingInfiniband-likenetworks,readoperationscanbeachievedinfewmicroseconds.

A RAMCloud’s cluster consists of three entities: a coordinator maintaining metadata information

aboutstorageservers,backupservers,anddatalocation;asetofstorageserversthatexposetheirDRAMasstoragespace;andbackupsthatstorereplicasofdataintheirDRAMtemporarilyandspillittodiskasynchronously.Usually,storageserversandbackupsarecollocatedwithinasamephysical

machine.DatainRAMCloudisstoredinasetoftables.Eachtablecanspanmultiplestorageservers.Aserver

usesanappend-only log-structuredmemorytostore itsdataandahash-tableto index it.The log-structuredmemoryofeachserverisdividedinto8MBsegments.Aserverstoresdatainanappend-only fashion. Thus, to free unused space a cleaning mechanism is triggered whenever a server

reaches a certainmemory utilization threshold. The cleaner copies a segment’s live data into thefreespace(stillavailableinDRAM)andremovestheoldsegment.

Durability and availability are guaranteed by replicating data to remote disks. More precisely,whenever a storage server receives a write request, it appends the object into its latest freesegment, and forwards a replication request to the backup servers randomly chosen for that

segment.Theserverwaitsforacknowledgementsfromallbackupstoansweraclient’supdate/insertrequest.Backupserverswillkeepacopyofthissegment inDRAMuntil it fills.Onlythen,theywillflushthesegmenttodiskandremoveitfromDRAM.

Foreachnewsegment,arandombackupintheclusterischoseninordertohaveasmanymachinesperforming the crash-recovery as possible. At run time each server will compute a will where it

specifieshowitsdatawillbepartitionedifitcrashes.Ifacrashedserverisdetectedtheneachserverwill replay the segments that were assigned to it according to the crashed server will. As the

segmentsarewrittentoaserver’smemory,theyarereplicatedtonewbackups.Attheendofthe

recoverythesegmentsarecleanedfromoldbackups.

ProsandCons

● Verylow-latency● Strongconsistency● Fastcrashrecovery

● EfficientmemoryutilizationCons

● Highcpu-usage:1coreisdedicatedforpollingincomingrequeststoachievelowlatency

● No"smart"mechanismtodistributedataacrossthecluster● Needhighperformancenetworktoreallytakeadvantageoflowlatency

Relevancew-r-ttheWP1use-cases

Key-ValueStoresareidealforcasesthatrequiredatadistribution,highscalabilityand/oraflexibleschema.Thisissuitablefor:

● TheScienceDataProcessor(SDP)oftheSKAprojecttotemporarilystorerawdatareceived

from the telescopes. Key-Value Storeswould be perfect for this case because the data is

highly distributable and can be processed in parallel. Moreover, the Key-Value storagemechanism is also suitable for the type of data processed by the project. The followingrequirementsfromtheSKAprojectaremetbyKey-ValueStoragesystems:

○ HBP1:SpatialDataManagementTechnique○ HBP3:Multi-formatingestion○ HBP5:Sharing

● ThestorageofindividualeventsinthecontextofSmartCities’applications,leveragingatthe

same time low-latency reads and writes, good scalability properties and low metadataoverhead.ThefollowingrequirementsfromtheSmartCitiesuse-casearemetbyKey-ValueStoragesystems:

○ SCT5:Storageinreal-time○ SCT6:Replicatedstoragesystem○ SCT10:Scalablesystem

Relevancew-r-ttheWP1usecases-OverviewThetablebelowprovidesasyntheticviewoftherelevanceofthedifferentsystemswithrespecttotheWP1uses-casesthathavebeenidentified.

System HBP SKA CLM SCT

Lustre HBP5 SKA2,SKA5,SKA6,SKA9

CLM1,CLM2,CLM4,CLM5,CLM6,CLM8,CLM9

SCT3,SCT5,SCT10

OrangeFS HBP5 SKA2,SKA5,SKA6,SKA9

CLM1,CLM2,CLM4,CLM5,CLM6,CLM8,CLM9

SCT3,SCT5,SCT10

HDFS HBP1,HBP2 SKA2,SKA5,SKA7,SKA8

CLM1,CLM2,CLM3,CLM4

SCT2,SCT6,SCT8,SCT9,

GlusterFS

HBP1 SKA1,SKA5,SKA8 CLM1,CLM2 SCT2,SCT5,SCT6,SCT8,SCT9,SCT10

Ceph HBP1 SKA1,SKA5,SKA8 CLM1,CLM2 SCT2,SCT5,SCT6,SCT8,SCT9,SCT10

BlobSeer HBP2 SKA9 CLM1,CML2 SCT4

Azure HBP2 SKA9 CLM1 SCT4

Rados HBP5,HBP8 SKA3,SKA7,SKA9 CLM1,CLM3,CLM5,CLM9

SCT6,SCT7,SCT10

HBase HBP1,HBP3 CLM1,CLM4 SCT5,SCT6,SCT10

Cassandra HBP1,HBP5 CLM1,CLM3,CLM4 SCT5,SCT6,SCT10

RAMCloud HBP7 SKA8,SKA9 CLM1 SCT1,SCT6,SCT10

References

[aerospike] http://www.aerospike.com/

[azure1] UnderstandingBlockBlobs,AppendBlobs,andPageBlobs–https://docs.microsoft.com/en-

us/rest/api/storageservices/fileservices/Understanding-Block-Blobs--Append-Blobs--and-Page-Blobs?redirectedfrom=MSDN

[BentJ] Bent,John,etal."PLFS:ACheckpointFilesystemforParallelApplications."

[bdb] MichaelA.Olson,KeithBostic,andMargoSeltzer.1999.BerkeleyDB.InProceedingsoftheannualconferenceonUSENIXAnnualTechnicalConference

(ATEC'99).USENIXAssociation,Berkeley,CA,USA,43-43.

[BigTable08] Chang,Fay,etal.Bigtable:Adistributedstoragesystemforstructureddata.ACMTransactionsonComputerSystems(TOCS)26.2(2008):4.

[Blobseer10] BogdanNicolae,DianaMoise,GabrielAntoniu,LucBougé,MatthieuDorier.BlobSeer:BringingHighThroughputunderHeavyConcurrencytoHadoopMap-

ReduceApplications.24thIEEEInternationalParallelandDistributedProcessingSymposium(IPDPS2010),Apr2010,Atlanta,UnitedStates.

[Blobseer11] BogdanNicolae,GabrielAntoniu,LucBougé,DianaMoise,andAlexandraCarpen-Amarie.2011.BlobSeer:Next-generationdatamanagementforlarge

scaleinfrastructures.J.ParallelDistrib.Comput.71,2(February2011),169-184.

[bmi] P.Carns,W.Ligon,R.RossandP.Wyckoff,"BMI:anetworkabstractionlayerforparallelI/O,"19thIEEEInternationalParallelandDistributedProcessingSymposium,2005,pp.8pp.-.

doi:10.1109/IPDPS.2005.128

[cassandra1] WhatisApacheCassandra?–http://www.planetcassandra.org/what-is-apache-cassandra/

[cassandra2] Rabl,Tilmannetal.SolvingBigDataChallengesforEnterpriseApplicationPerformanceManagement.VLDB(2012).

[couchbase] JohnZablocki.2015.CouchbaseEssentials.PacktPublishing.

[D1.1] D1.1-Usecaserequirements–BigStorageDeliverable

[dell.www] Lustre–DellPresentation-http://www.slideshare.net/undiez/dell-lustre-

storage-architecture-presentation-mbug-2016

[Departon01] BenjaminDepardon,GaelLeMahec,CyrilS´eguin.AnalysisofSixDistributedFileSystems.[ResearchReport]2013,pp.44.<hal-00789086>

[donvito] Donvito,G.,Marzulli,G.,&Diacono,D.(2014).Testingofseveraldistributed

file-systems(HDFS,CephandGlusterFS)forsupportingtheHEPexperimentsanalysis.InJournalofPhysics:ConferenceSeries(Vol.513,No.4,p.042014).IOPPublishing.

[dutchCSI] DutchForensicInstitute:CephFSinproduction

[FeitelsonD] Feitelson,DrorG.,etal."ParallelI/Osystemsandinterfacesforparallel

computers."MultiprocessorSystems|DesignandIntegration.WorldScientific(1995).

[GlusterFS13] AlexDaviesandAlessandroOrsaria.2013.ScaleoutwithGlusterFS.LinuxJ.2013,235.

[GoogleFS03] Ghemawat,S.,Gobioff,H.,&Leung,S.T.(2003,October).TheGooglefile

system.InACMSIGOPSoperatingsystemsreview(Vol.37,No.5,pp.29-43).ACM.

[gudu] Gudu,D.,Hardt,M.,&Streit,A.(2014).EvaluatingtheperformanceandscalabilityoftheCephdistributedstoragesystem.2014IEEEInternational

ConferenceonBigData:177-182.

[hbase1] George,Lars.HBase:thedefinitiveguide.O’ReillyMedia,Inc.,2011.

[hbase2] Dimiduk,Nick,AmandeepKhurana,andMarkHenryRyan.HBaseinaction.Manning,2013.

[hbase4] White,Tom.Hadoop:TheDefinitiveGuide:TheDefinitiveGuide.O’ReillyMedia,2009.

[hbase5] Kruchten,PhilippeB.“The4+1viewmodelofarchitecture.”

Software,IEEE12.6(1995):42-50.

[hbase6] ApacheHBase,“TheApacheHBaseReferenceGuide”,apache.org,2013

[hbase.www] Anin-depthlookattheHBasearchitecture,https://www.mapr.com/blog/in-depth-look-hbase-architecture

[HDFS08] Borthakur,D.(2008).HDFSarchitectureguide.HADOOPAPACHEPROJECT

http://hadoop.apache.org/common/docs/current/hdfsdesign.pdf,39.

[HDFS10] KonstantinShvachko,HairongKuang,SanjayRadia,andRobertChansler.2010.TheHadoopDistributedFileSystem.InProceedingsofthe2010IEEE26thSymposiumonMassStorageSystemsandTechnologies(MSST)(MSST'10).IEEE

ComputerSociety,Washington,DC,USA,1-10.DOI=http://dx.doi.org/10.1109/MSST.2010.5496972

[kvs1] ShengLi,HyeontaekLim,VictorW.Lee,JungHoAhn,AnujKalia,MichaelKaminsky,DavidG.Andersen,O.Seongil,SukhanLee,andPradeepDubey.

2015.Architectingtoachieveabillionrequestspersecondthroughputona

singlekey-valuestoreserverplatform.InProceedingsofthe42ndAnnual

InternationalSymposiumonComputerArchitecture(ISCA'15).ACM,NewYork,NY,USA,476-488

[lustre1] ABestPracticeAnalysisofHDF5andNetCDF-4UsingLustre(ChristopherBartz,KonstantinosChasapis,MichaelKuhn,PetraNerge,ThomasLudwig),InHigh

PerformanceComputing,LectureNotesinComputerScience(9137),pp.274–281,(Editors:JulianMartinKunkel,ThomasLudwig),SpringerInternationalPublishing(Switzerland),ISC2015,Frankfurt,Germany,ISBN:978-3-319-20118-

4,ISSN:0302-9743,2015-06

[lustre2] DynamicallyAdaptableI/OSemanticsforHighPerformanceComputing(MichaelKuhn),InHighPerformanceComputing,LectureNotesinComputerScience(9137),pp.240–256,(Editors:JulianMartinKunkel,ThomasLudwig),

SpringerInternationalPublishing(Switzerland),ISC2015,Frankfurt,Germany,ISBN:978-3-319-20118-4,ISSN:0302-9743,2015-06

[lustre3] POSIX.1FAQ.TheOpenGroup.5October2011,availableonlinehttp://www.opengroup.org/austin/papers/posix_faq.html

[lustre4] LUSTRE™FILESYSTEMHigh-PerformanceStorageArchitectureandScalable

ClusterFileSystemWhitePaperDecember2007,availableonlinehttp://www.csee.ogi.edu/~zak/cs506-pslc/lustrefilesystem.pdf

[lustre5] DataCompressionforClimateData(MichaelKuhn,JulianKunkel,ThomasLudwig),InSupercomputingFrontiersandInnovations,Series:Volume3,

Number1,pp.75–94,(Editors:JackDongarra,VladimirVoevodin),2016-06

[memcache] https://memcached.org/

[miyamae] Miyamae,T.,Nakao,T.,&Shiozawa,K.(2014).Erasurecodewithshingledlocalparitygroupsforefficientrecoveryfrommultiplediskfailures.In10thWorkshoponHotTopicsinSystemDependability(HotDep14).

[orangefs1] OrangeFS–http://www.orangefs.org

[pvfs1] PVFS–http://www.pvfs.org

[ramcloud1] S.M.Rumble,A.Kejriwal,andJ.Ousterhout,“Log-structuredmemoryfordram-basedstorage,”inProceedingsofthe12thUSENIXConferenceonFileandStorageTechnologies(FAST14),SantaClara,CA,2014,pp.1–16.

[ramcloud2] D.Ongaro,S.M.Rumble,R.Stutsman,J.Ousterhout,andM.Rosenblum,

“Fastcrashrecoveryinramcloud,”inProceedingsoftheTwentyThirdACMSymposiumonOperatingSystemsPrinciples,ser.SOSP’11,2011,pp.29–41.

[ramcloud3] J.Ousterhout,A.Gopalan,A.Gupta,A.Kejriwal,C.Lee,B.Montazeri,D.

Ongaro,S.J.Park,H.Qin,M.Rosenblum,S.Rumble,R.Stutsman,andS.Yang,“Theramcloudstoragesystem,”ACMTrans.Comput.Syst.,vol.33,no.3,pp.7:1–7:55,Aug.2015

[ramcloud4] CollinLee,SeoJinPark,AnkitaKejriwal,SatoshiMatsushita,andJohn

Ousterhout.2015.Implementinglinearizabilityatlargescaleandlowlatency.InProceedingsofthe25thSymposiumonOperatingSystemsPrinciples(SOSP'15).ACM,NewYork,NY,USA,71-86.

[ramcloud5] RyanStutsman,CollinLee,andJohnOusterhout.2015.Experiencewithrules-

basedprogrammingfordistributed,concurrent,fault-tolerantcode.InProceedingsofthe2015USENIXConferenceonUsenixAnnualTechnicalConference(USENIXATC'15).USENIXAssociation,Berkeley,CA,USA,17-30.

[ramcloud6] A.Kejriwal,A.Gopalan,A.Gupta,Z.Jia,S.Yang,andJ.Ousterhout.SLIK:

ScalableLow-LatencyIndexesforaKey-ValueStore.InUSENIXAnnualTechnicalConference,Denver,CO,June2016

[redis] https://redis.io/

[Shvachko] Shvachko,K.,Kuang,H.,Radia,S.,&Chansler,R.(2010,May).Thehadoopdistributedfilesystem.In2010IEEE26thsymposiumonmassstoragesystems

andtechnologies(MSST)(pp.1-10).IEEE.(http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5496972)

[ska1] SKA-SDPPreliminaryArchitectureDocument

[Subramanian] SubramanianMuralidhar,WyattLloyd,SabyasachiRoy,CoryHill,ErnestLin,WeiwenLiu,SatadruPan,ShivaShankar,ViswanathSivakumar,LinpengTang,

andSanjeevKumar.2014.f4:Facebook'swarmBLOBstoragesystem.InProceedingsofthe11thUSENIXconferenceonOperatingSystemsDesignandImplementation(OSDI'14).USENIXAssociation,Berkeley,CA,USA,383-398.

[TH01] ThomasHollstegge,Distributedfilesystems,Seminarthesisfortheseminar

ParallelProgrammingandParallelAlgorithms,availableonlinehttp://is.uni-muenster.de/pi/lehre/ws0910/pppa/papers/dfs.pdf

[wang] Wang,F.,Nelson,M.,Oral,S.,Atchley,S.,Weil,S.,Settlemyer,B.W.,...&Hill,J.(2013,November).PerformanceandscalabilityevaluationoftheCephparallel

filesystem.InProceedingsofthe8thParallelDataStorageWorkshop(pp.14-19).ACM.

[webciteblob1] http://www.dell.com/downloads/global/products/pvaul/en/object-storage-

overview.pdf

[webciteblob2] https://en.wikipedia.org/wiki/Object_storage

[webciteblob3] http://arstechnica.com/information-technology/2013/10/seagate-introduces-a-new-drive-interface-ethernet/

[webciteblob4] https://azure.microsoft.com/en-us/documentation/articles/fundamentals-introduction-to-azure/

[webciteglustr1] http://moo.nac.uci.edu/~hjm/fs/An_Introduction_To_Gluster_ArchitectureV7_

110708.pdf

[webciteglustr2] https://access.redhat.com/documentation/en-

US/Red_Hat_Storage/3.1/html/Administration_Guide/chap-Red_Hat_Storage_Architecture_and_Concepts.html

[websiteceph1] http://thenewstack.io/converging-storage-cephfs-now-production-ready/

[websiteceph] http://docs.ceph.com/docs/master/cephfs/

[weil1] Weil,S.A.,Leung,A.W.,Brandt,S.A.,&Maltzahn,C.(2007,November).Rados:ascalable,reliablestorageserviceforpetabyte-scalestorageclusters.In

Proceedingsofthe2ndinternationalworkshoponPetascaledatastorage:heldinconjunctionwithSupercomputing'07(pp.35-44).ACM.

[weil2] Weil,S.A.,Brandt,S.A.,Miller,E.L.,&Maltzahn,C.(2006,November).CRUSH:Controlled,scalable,decentralizedplacementofreplicateddata.InProceedings

ofthe2006ACM/IEEEconferenceonSupercomputing(p.122).ACM.

[weil3] Weil,S.A.,Brandt,S.A.,Miller,E.L.,Long,D.D.,&Maltzahn,C.(2006,November).Ceph:Ascalable,high-performancedistributedfilesystem.InProceedingsofthe7thsymposiumonOperatingsystemsdesignand

implementation(pp.307-320).USENIXAssociation.

storage-based convergence between hpc and cloud...

Documents

dell emc ready solutions for hpc storage...dell emc ready...

seagate exascale hpc storage

energy conservation strategies for storage in hpc...

suitability analysis of object storage for hpc...

highest performance parallel storage for hpc environments

hpc storage - ddn.com · pdf file• hpc storage •...

hpc storage and io trends and workflows

convergence lessons: big data and hpc - post...

dell™ hpc nfs storage...

ai + hpc convergence...motivation for convergence • hpc...

convergence between hpc and big data: the day after...

the convergence of hpc and deep learning

increased reliability of large hpc storage deployments ·...

sun storage 7000 in hpc

big data, simulations and hpc convergence - about...

hpc, big data, and machine learning...

hpc and big data convergence -...

dell hpc nfs storage solution - high availability …...dell...

dell hpc scalable storage building block

dell emc ready solutions for hpc storage · hpc storage...