using big data in prac.ce: computaonal infrastructuresbig data • defined as vs – volume: just...
Post on 25-Sep-2020
0 Views
Preview:
TRANSCRIPT
Usingbigdatainprac.ce:Computa.onalinfrastructures
GianlucaDemar-niInforma-onSchool
UniversityofSheffield
BigData
• DefinedasVs– Volume:Justaboutsize,Giga,Tera,Petabytes– Variety:Formats,text,databases,pictures,excel– Velocity:Speed,10000tweetspersecond,2000picturesonInstagrampersecond
2
Dataishuge(Volume)
• Facebook– processes750TB/dayofdata– adds7PBofphotostorage/month
• Thisrequirescomputers(alotofthem)
• Notonlyinternetcompanies!– Banks,packagedelivery,governments,shops,etc.
3
Dataisfast(Velocity)
• TwiVerfirehose– In2011,1000Tweetspersecond(TPS)– In2014,20000TPS– Withpeaks:143KTPS
• Servicesontop– DataSiZ:aggregate,filterandextractinsights
• Notonlyinternetcompanies!– Stockexchange,sensorsinwaternetwork,etc.
4
Scale-upvsScale-out
• Scale-up– Increasingthepowerofyourcomputer(i.e,disk,memory,processor)
• Scale-out– Usemanystandardcomputersanddistributedataandcomputa-onoverthem
5
FacebookDataCenter(Sweden)
6
Data
7
Fundamentalwork
• GoogleFileSystem,2003– accesstodatausinglargeclustersofcommoditymachines
• BigTable,2003-2006– datastoragesystem– DistributedmapKey->Value
• Map/Reduce,2004– Programmingparadigmoveraclusterofmachines
8
Open-Sourceanalogous
• HDFS(HadoopFileSystem)– DistributedFileSystem
• ApacheHbasehVp://hbase.apache.org/– Distributeddatabase
• ApacheHadoophVp://hadoop.apache.org/– Distributedcomputa-on
C.L.PhilipChen,Chun-YangZhang,Data-intensiveapplica-ons,challenges,techniquesandtechnologies:AsurveyonBigData,Informa-onSciences,Volume275,10August2014,Pages314-347,ISSN0020-0255,hVp://dx.doi.org/10.1016/j.ins.2014.01.015.(hVp://www.sciencedirect.com/science/ar-cle/pii/S0020025514000346)
9
HadoopDistributedFileSystem(HDFS)
• InspiredbyGoogleFileSystem• Scalable,distributed,portablefilesystemwriVeninJavaforHadoopframework
• PrimarydistributedstorageusedbyHadoopapplica-ons
• HFDScanbepartofaHadoopclusterorcanbeastand-alonegeneralpurposedistributedfilesystem
• Reliabilityandfaulttoleranceensuredbyreplica-ngdataacrossmul-plehosts
• Zookeeperforthedistributedcoordina-on
10
MapReduce(MR)
• High-levelprogrammingmodelandimplementa-onforlarge-scaleparalleldataprocessing
• Commodityhardware• Fault-tolerant• Currently,themostoverhypedsysteminCS
11
MRDataModel
• Files!
• Eachfileasetof(key,value)pairs
• Amap-reduceprogram:– Input:asetof(inputkey,value)pairs– Output:asetof(outputkey,value)pairs
k1->v1k2->v2k3->v3
12
FirstName->JohnLastName->SmithRole->VPSteps->12k
Step1:theMAPPhase
• UserprovidestheMAPfunc-on:– Input:one(inputkey,value)– Output:asetof(intermediatekey,value)pairs
• Systemappliesmapfunc-oninparalleltoallinputpairs
13
Step2:REDUCEphase
• UserprovidestheREDUCEfunc-on:– Input:intermediatekey,andsetofvalues– Output:setofoutputvalues
• SystemgroupsallpairswithsamekeyandpassesvaluestotheREDUCEfunc-on
14
MRJob
• Thereisonemasternode• Masterpar..onsinputfileintoMsplits,bykey• Masterassignsworkers(=servers)totheMmaptasks,keepstrackoftheirprogress
• Workerswritetheiroutputtolocaldiskpar--onintoRregions
• MasterassignsworkerstotheRreducetasks• Reduceworkersreadregionsfromthemapworkers’localdisks
15
MRJob
Example:MRwordlengthcount
17
Example:MRwordlengthcount
18
Example:MRwordlengthcount
19
MapReduce(MR)tools
• MRimplementa-on
• MRQueryLanguage
• MRQueryEngine
20
PigLa-n
• High-levellanguage• SQL-like• Verylargedatasets,translatedinMapandReduceTasks
O'Reilly®ProgrammingPig,byAlanGates 21
Commercialproducts
• Hadoopascatch-allbig-datasolu-on– Smallclusters(10sofmachines)– Massivein-houseclusters– PublicCloud
• Hortonworks– ConsultancycompanyforApacheHadoop
• Cloudera• HPVer-caAnaly-csPlaporm
22
StreamprocessingBigDatatools
• BatchvsStreamprocessing– Dataislargebutalwaysavailableondisk– Dataisarrivingfastandcannotbestored/needstobeprocessedimmediately
• Streamsofdata– TwiVer– InternetofThings– SmartCi-es– Nuclearpowerplant
23
ApacheSpark
• hVps://spark.apache.org/databricks.com
• Generalsystemforlarge-scaledataprocessing• Faster(in-memory),interac-ve• Version1.6releasedinMarch2016
24
Conclusion
• Ifitfitsinmemory,it’snotbigdata• Ifit’sbigdata,youcannotuseExcel/SPSS/R• Thewaytogoisscale-outanddistributedcomputa-on
• Map/Reduceforbatchprocessing• HadooporSparkasout-of-the-boxsystems– Alsowithhigh-levelqueryinterfaceslikePig/R/Python
top related