using big data in prac.ce: computaonal infrastructuresbig data • deﬁned as vs – volume: just...

Usingbigdatainprac.ce:Computa.onalinfrastructures

GianlucaDemar-niInforma-onSchool

UniversityofSheffield

BigData

•  DefinedasVs– Volume:Justaboutsize,Giga,Tera,Petabytes– Variety:Formats,text,databases,pictures,excel– Velocity:Speed,10000tweetspersecond,2000picturesonInstagrampersecond

Dataishuge(Volume)

•  Facebook– processes750TB/dayofdata– adds7PBofphotostorage/month

•  Thisrequirescomputers(alotofthem)

•  Notonlyinternetcompanies!– Banks,packagedelivery,governments,shops,etc.

Dataisfast(Velocity)

•  TwiVerfirehose–  In2011,1000Tweetspersecond(TPS)–  In2014,20000TPS– Withpeaks:143KTPS

•  Servicesontop– DataSiZ:aggregate,filterandextractinsights

•  Notonlyinternetcompanies!– Stockexchange,sensorsinwaternetwork,etc.

Scale-upvsScale-out

•  Scale-up–  Increasingthepowerofyourcomputer(i.e,disk,memory,processor)

•  Scale-out– Usemanystandardcomputersanddistributedataandcomputa-onoverthem

FacebookDataCenter(Sweden)

Fundamentalwork

•  GoogleFileSystem,2003– accesstodatausinglargeclustersofcommoditymachines

•  BigTable,2003-2006– datastoragesystem– DistributedmapKey->Value

•  Map/Reduce,2004– Programmingparadigmoveraclusterofmachines

Open-Sourceanalogous

•  HDFS(HadoopFileSystem)– DistributedFileSystem

•  ApacheHbasehVp://hbase.apache.org/– Distributeddatabase

•  ApacheHadoophVp://hadoop.apache.org/– Distributedcomputa-on

C.L.PhilipChen,Chun-YangZhang,Data-intensiveapplica-ons,challenges,techniquesandtechnologies:AsurveyonBigData,Informa-onSciences,Volume275,10August2014,Pages314-347,ISSN0020-0255,hVp://dx.doi.org/10.1016/j.ins.2014.01.015.(hVp://www.sciencedirect.com/science/ar-cle/pii/S0020025514000346)

HadoopDistributedFileSystem(HDFS)

•  InspiredbyGoogleFileSystem•  Scalable,distributed,portablefilesystemwriVeninJavaforHadoopframework

•  PrimarydistributedstorageusedbyHadoopapplica-ons

•  HFDScanbepartofaHadoopclusterorcanbeastand-alonegeneralpurposedistributedfilesystem

•  Reliabilityandfaulttoleranceensuredbyreplica-ngdataacrossmul-plehosts

•  Zookeeperforthedistributedcoordina-on

MapReduce(MR)

•  High-levelprogrammingmodelandimplementa-onforlarge-scaleparalleldataprocessing

•  Commodityhardware•  Fault-tolerant•  Currently,themostoverhypedsysteminCS

MRDataModel

•  Files!

•  Eachfileasetof(key,value)pairs

•  Amap-reduceprogram:–  Input:asetof(inputkey,value)pairs– Output:asetof(outputkey,value)pairs

k1->v1k2->v2k3->v3

FirstName->JohnLastName->SmithRole->VPSteps->12k

Step1:theMAPPhase

•  UserprovidestheMAPfunc-on:–  Input:one(inputkey,value)– Output:asetof(intermediatekey,value)pairs

•  Systemappliesmapfunc-oninparalleltoallinputpairs

Step2:REDUCEphase

•  UserprovidestheREDUCEfunc-on:–  Input:intermediatekey,andsetofvalues– Output:setofoutputvalues

•  SystemgroupsallpairswithsamekeyandpassesvaluestotheREDUCEfunc-on

•  Thereisonemasternode•  Masterpar..onsinputfileintoMsplits,bykey•  Masterassignsworkers(=servers)totheMmaptasks,keepstrackoftheirprogress

•  Workerswritetheiroutputtolocaldiskpar--onintoRregions

•  MasterassignsworkerstotheRreducetasks•  Reduceworkersreadregionsfromthemapworkers’localdisks

Example:MRwordlengthcount

MapReduce(MR)tools

•  MRimplementa-on

•  MRQueryLanguage

•  MRQueryEngine

PigLa-n

•  High-levellanguage•  SQL-like•  Verylargedatasets,translatedinMapandReduceTasks

O'Reilly®ProgrammingPig,byAlanGates 21

Commercialproducts

•  Hadoopascatch-allbig-datasolu-on– Smallclusters(10sofmachines)– Massivein-houseclusters– PublicCloud

•  Hortonworks– ConsultancycompanyforApacheHadoop

•  Cloudera•  HPVer-caAnaly-csPlaporm

StreamprocessingBigDatatools

•  BatchvsStreamprocessing– Dataislargebutalwaysavailableondisk– Dataisarrivingfastandcannotbestored/needstobeprocessedimmediately

•  Streamsofdata–  TwiVer–  InternetofThings–  SmartCi-es– Nuclearpowerplant

ApacheSpark

•  hVps://spark.apache.org/databricks.com

•  Generalsystemforlarge-scaledataprocessing•  Faster(in-memory),interac-ve•  Version1.6releasedinMarch2016

Conclusion

•  Ifitfitsinmemory,it’snotbigdata•  Ifit’sbigdata,youcannotuseExcel/SPSS/R•  Thewaytogoisscale-outanddistributedcomputa-on

•  Map/Reduceforbatchprocessing•  HadooporSparkasout-of-the-boxsystems– Alsowithhigh-levelqueryinterfaceslikePig/R/Python

using big data in prac.ce: computaonal infrastructuresbig data • deﬁned as vs – volume: just...

Documents

from photons to petabytes: astronomy in the era of large...

irods at cc-in2p3: managing petabytes of data · centre de...

cs 598: computaonal topology spring 2013 - github...

aggregated queries with druid on terrabytes and petabytes of...

computaonal +finance:+aroad+ map+forthenext ten +years

informa.on and communica.on technologies (ict) in evalua...

petabytes and nanoseconds

computaonal linguiscs - the stanford natural language

lhc scale physics in 2008: grids, networks and petabytes

backblaze blog » petabytes on a budget v2.0:revealing more...

promo%ng collaborave problem solving through computaonal

backblaze blog » petabytes on a budget v2.0:revealing

text analytics summit 2009 - roddy lindsay - "social media,...

petabytes on a budget: how to build cheap cloud storage...

computaonal methods for data integraon...computaonal methods...

cs 581 / bioe 540: algorithmic computaonal...

computaonal+modelling+of+ verbs+in+dene+languages+ ·...

csama%2013:%computaonal% · pdf file•...

basics&of&computaonal& chemistry&

harnessing petabytes of online storage effectively