lecture 20 - duke university · 2017-11-08 · parallel dbms •parallelism is natural to dbms...

CompSci 516DatabaseSystems

Lecture20ParallelDBMS

Instructor:Sudeepa Roy

1DukeCS,Fall2017 CompSci516:DatabaseSystems

Announcements

• HW3dueonMonday,Nov20,11:55pm(in2weeks)– SeesomeclarificationsonPiazza

DukeCS,Fall2017 CompSci516:DatabaseSystems 2

ReadingMaterial• [RG]

– ParallelDBMS:Chapter22.1-22.5

• [GUW]– ParallelDBMSandmap-reduce:Chapter20.1-20.2

3

Acknowledgement:Thefollowingslideshavebeencreatedadaptingtheinstructormaterialofthe[RG]bookprovidedbytheauthorsDr.RamakrishnanandDr.Gehrke.

DukeCS,Fall2017 CompSci516:DatabaseSystems

ReadingMaterial• [RG]

– ParallelDBMS:Chapter22.1-22.5– DistributedDBMS:Chapter22.6– 22.14

• [GUW]– ParallelDBMSandmap-reduce:Chapter20.1-20.2– DistributedDBMS:Chapter20.3,20.4.1-20.4.2,20.5-20.6

• Recommendedreadings:– Chapter2(Sections1,2,3)ofMiningofMassiveDatasets,byRajaraman andUllman:

http://i.stanford.edu/~ullman/mmds.html– OriginalGoogleMRpaperbyJeffDeanandSanjayGhemawat,OSDI’04:

http://research.google.com/archive/mapreduce.html

4

Acknowledgement:Thefollowingslideshavebeencreatedadaptingtheinstructormaterialofthe[RG]bookprovidedbytheauthorsDr.RamakrishnanandDr.Gehrke.

DukeCS,Fall2017 CompSci516:DatabaseSystems

ParallelandDistributedDataProcessing

• RecallfromLecture18!• dataandoperationdistributionifwehavemultiple

machines

• Parallelism– performance

• Datadistribution– increasedavailability,e.g.whenasitegoesdown– distributedlocalaccesstodata(e.g.anorganizationmayhave

branchesinseveralcities)– analysisofdistributeddata


Parallelvs.DistributedDBMSParallelDBMS

• Parallelizationofvariousoperations– e.g.loadingdata,building

indexes,evaluatingqueries

• Datamayormaynotbedistributedinitially

• Distributionisgovernedbyperformanceconsideration


DistributedDBMS

• Dataisphysicallystoredacrossdifferentsites– Eachsiteistypicallymanagedbyan

independentDBMS

• LocationofdataandautonomyofsiteshaveanimpactonQueryopt.,Conc.Controlandrecovery

• Alsogovernedbyotherfactors:– increasedavailabilityforsystem

crash– localownershipandaccess

Lecture18

ParallelDBMS


WhyParallelAccessToData?

At 10 MB/s1.2 days to scan

1,000 x parallel1.5 minute to scan.

Parallelism:divide a big problem into many smaller onesto be solved in parallel.


1TBData

1TBData

ParallelDBMS• ParallelismisnaturaltoDBMSprocessing

– Pipelineparallelism:manymachineseachdoingonestepinamulti-stepprocess.

– Data-partitionedparallelism:manymachinesdoingthesamethingtodifferentpiecesofdata.

– BotharenaturalinDBMS!

PipelineAny

SequentialProgram

AnySequentialProgram

Partition SequentialSequential SequentialSequential Any SequentialProgram

Any SequentialProgram

outputs split N ways, inputs merge M waysDukeCS,Fall2017 CompSci516:DatabaseSystems 9

DBMS:TheparallelSuccessStory

• DBMSsarethemostsuccessfulapplicationofparallelism– Teradata(1979),Tandem(1974,lateracquiredbyHP),..– EverymajorDBMSvendorhassomeparallelserver

• Reasonsforsuccess:– Bulk-processing(=partitionparallelism)– Naturalpipelining– Inexpensivehardwarecandothetrick– Users/app-programmersdon’tneedtothinkinparallel


Some||Terminology

• Speed-Up– Moreresourcesmeansproportionallylesstimeforgivenamountofdata.

• Scale-Up– Ifresourcesincreasedinproportiontoincreaseindatasize,timeisconstant.

#CPUs(degreeof||-ism)

#ops/sec.

(throughp

ut)

Ideal:linearspeed-up

sec./op

(respon

setime)

#CPUs+sizeofdatabasedegreeof||-ism

Ideal:linearscale-up

Idealgraphs


Some||Terminology

• Duetooverheadinparallelprocessing

• Start-upcostStartingtheoperationonmanyprocessor,mightneedtodistributedata

• InterferenceDifferentprocessorsmaycompeteforthesameresources

• SkewTheslowestprocessor(e.g.withahugefractionofdata)maybecomethebottleneck

#CPUs(degreeof||-ism)

#ops/sec.

(throughp

ut)

#CPUs+sizeofdatabasedegreeof||-ism

sec./op

(respon

setime)

Inpractice

Ideal:linearspeed-up

Ideal:linearscale-up

Actual:sub-linearspeed-up

Actual:sub-linearscale-up


ArchitectureforParallelDBMS

• Amongdifferentcomputingunits

– Whethermemoryisshared– Whetherdiskisshared


BasicsofParallelism

• Units:acollectionofprocessors– assumealwayshavelocalcache– mayormaynothavelocalmemoryordisk(next)

• Acommunicationfacilitytopassinformationamongprocessors– asharedbusoraswitch


SharedMemory


InterconnectionNetwork

P P P

D D D

GlobalSharedMemorysharedmemory

SharedDisk


P P P

M

D

M

D

M

D


localmemory

shareddisk

SharedNothing



P P P

M

D

M

D

M

D

localmemoryanddisk

notwoCPUcanaccessthesamestoragearea

allcommunicationthroughanetworkconnection

Architecture:AtAGlanceShared Memory

(SMP)Shared Disk Shared Nothing

(network)

CLIENTS CLIENTSCLIENTS

MemoryProcessors

• Easy to program• Expensive to build• Low communication

overhead: shared mem.• Difficult to scaleup

(memory contention)

• Hard to program and design parallel algos

• Cheap to build• Easy to scaleup and

speedup• Considered to be the

best architectureSequent, SGI, Sun VMScluster, Sysplex Tandem, Teradata, SP2

• Trade-off but still interference like shared-memory (contention of memory and nw bandwidth)

wewillassumesharednothing


WhatSystemsWorkedThisWay

Shared NothingTeradata: 400 nodesTandem: 110 nodesIBM / SP2 / DB2: 128 nodesInformix/SP2 48 nodesATT & Sybase ? nodes

Shared DiskOracle 170 nodesDEC Rdb 24 nodes

Shared MemoryInformix 9 nodes RedBrick ? nodes

CLIENTS

MemoryProcessors

CLIENTS

CLIENTS

NOTE:(asof9/1995)!


DifferentTypesofDBMSParallelism• Intra-operatorparallelism

– getallmachinesworkingtocomputeagivenoperation(scan,sort,join)

– OLAP(decisionsupport)

• Inter-operatorparallelism– eachoperatormayrunconcurrentlyona

differentsite(exploitspipelining)– ForbothOLAPandOLTP

• Inter-queryparallelism– differentqueriesrunondifferentsites– ForOLTP

• We’llfocusonintra-operatorparallelism

⨝

𝝲

⨝

𝝲

⨝

𝝲

⨝

𝝲


Ack:SlidebyProf.DanSuciu

DataPartitioningHorizontally Partitioning a table (why horizontal?):Range-partition Hash-partition Block-partition

or Round Robin

Shared disk and memory less sensitive to partitioning, Shared nothing benefits from "good" partitioning

A...E F...J K...N O...S T...Z A...E F...J K...N O...S T...Z A...E F...J K...N O...S T...Z

• Good for equijoins, range queries, group-by• Can lead to data skew

• Good for equijoins• But only if hashed

on that attribute• Can lead to data

skew

• Send i-th tuple to i-mod-n processor

• Good to spread load

• Good when the entire relation is accessed


Example

• R(Key,A,B)

• CanBlock-partitionbeskewed?– no,uniform

• CanHash-partitionbeskewed?– onthekey:uniformwithagoodhashfunction– onA:maybeskewed,

• e.g.whenalltupleshavethesameA-value


ParallelizingSequentialEvaluationCode

• “Streams”fromdifferentdisksortheoutputofotheroperators– are“merged”asneededasinputtosomeoperator– are“split”asneededforsubsequentparallelprocessing

• DifferentSplit andmerge operationsappearinadditiontorelationaloperators

• Nofixedformulaforconversion• Next:parallelizingindividualoperations


ParallelScans

• Scaninparallel,andmerge.• Selectionmaynotrequireallsitesforrangeorhashpartitioning– butmayleadtoskew– SupposesA=10RandpartitionedaccordingtoA– Thenalltuplesinthesamepartition/processor

• Indexescanbebuiltateachpartition


ParallelSorting

Idea:• Scaninparallel,andrange-partitionasyougo

– e.g.salarybetween10to210,#processors=20– salaryinfirstprocessor:10-20,second:21-30,third:31-40,….

• Astuplescomein,begin“local”sortingoneach• Resultingdataissorted,andrange-partitioned• Visittheprocessorsinordertogetafullsortedorder• Problem:skew!• Solution:“sample”thedataatstarttodeterminepartitionpoints.DukeCS,Fall2017 CompSci516:DatabaseSystems 25

ParallelJoins

• Needtosendthetuplesthatwilljointothesamemachine– alsoforGROUP-BY

• Nestedloop:– Eachoutertuplemustbecomparedwitheachinnertuplethatmightjoin

– Easyforrangepartitioningonjoincols,hardotherwise

• Sort-Merge:– Sortinggivesrange-partitioning– Mergingpartitionedtablesislocal


ParallelHashJoin

• Infirstphase,partitionsgetdistributedtodifferentsites:– Agoodhashfunctionautomaticallydistributesworkevenly

• Dosecondphaseateachsite.• Almostalwaysthewinnerforequi-join

Original Relations(R then S)

OUTPUT

2

B main memory buffers DiskDisk

INPUT1

hashfunction

hB-1

Partitions

12

B-1...

Phase1


DataflowNetworkforparallelJoin

• Gooduseofsplit/mergemakesiteasiertobuildparallelversionsofsequentialjoincode.


ExamplewithparallelhashjoinbetweenAandB

Machine0 Machine1 Machine0 Machine1

hmod2=0

hmod2=0

hmod2=0

hmod2=1

hmod2=1hmod2

=1

hmod2=1

hmod2=0

hmod2=1

hmod2=0

Jim Gray & Gordon Bell: VLDB 95 Parallel Database Systems Survey

ParallelAggregates

A...E F...J K...N O...S T...Z

A Table

Count Count Count Count Count

Count

• Foreachaggregatefunction,needadecomposition:– count(S)=S count(s(i)),dittoforsum()– avg(S)=(S sum(s(i)))/S count(s(i))– andsoon...

• Forgroup-by:– Sub-aggregategroupsclosetothesource.– Passeachsub-aggregatetoitsgroup’ssite.

• Chosenviaahashfn.


WhichSQLaggregateoperatorsarenotgoodforparallelexecution?

• Why?• Trivialcounter-example:

– Tablepartitionedwithlocalsecondaryindexattwonodes

– Rangequery:allofnode1and1%ofnode2.– Node1shoulddoascanofitspartition.– Node2shouldusesecondaryindex.

Bestserialplanmaynotbebest||

N..Z

TableScan

A..M

Index Scan


Examples


Exampleproblem:ParallelDBMSR(a,b)ishorizontallypartitionedacrossN=3machines.

Eachmachinelocallystoresapproximately1/Nofthetuples inR.

Thetuples arerandomlyorganizedacrossmachines(i.e.,Risblockpartitionedacrossmachines).

ShowaRAplanforthisqueryandhowitwillbeexecutedacrosstheN=3machines.

Pickanefficientplanthatleveragestheparallelismasmuchaspossible.

• SELECTa,max(b)astopb• FROMR• WHEREa>0• GROUPBYa


WedidthisexampleforMap-ReduceinLecture12!

1/3ofR 1/3ofR 1/3ofR

Machine1 Machine2 Machine3

SELECTa,max(b)astopbFROMRWHEREa>0GROUPBYa

R(a,b)





R(a,b)

scan scan scan

Ifmorethanonerelationonamachine,then“scanS”,“scanR”etc





R(a,b)

scan scan scan

sa>0 sa>0 sa>0





R(a,b)

scan scan scan

sa>0 sa>0 sa>0

ga,max(b)-> b ga,max(b)-> b ga,max(b)-> b





R(a,b)

scan scan scan

sa>0 sa>0 sa>0


Hashona Hashona Hashona




SELECTa,max(b)astopb FROMRWHEREa>0 GROUPBYaR(a,b)

scan scan scan

sa>0 sa>0 sa>0






SELECTa,max(b)astopb FROMRWHEREa>0 GROUPBYaR(a,b)

scan scan scan

sa>0 sa>0 sa>0



ga,max(b)->topb ga,max(b)->topb ga,max(b)->topb


Benefitofhash-partitioning

• Whatwouldchangeifwehash-partitionedRonR.a beforeexecutingthesamequeryonthepreviousparallelDBMSandMR

• FirstParallelDBMS

SELECTa,max(b)astopbFROMR

WHEREa>0GROUPBYa




SELECTa,max(b)astopb FROMRWHEREa>0 GROUPBYaPrev:block-partition

scan scan scan

sa>0 sa>0 sa>0





• Itwouldavoidthedatare-shufflingphase• Itwouldcomputetheaggregateslocally


WHEREa>0GROUPBYa


Hash-partitiononaforR(a,b)



SELECTa,max(b)astopb FROMRWHEREa>0 GROUPBYaHash-partitiononaforR(a,b)

scan scan scan

sa>0 sa>0 sa>0



Benefitofhash-partitioningforMap-Reduce• ForMapReduce

– Logically,MRwon’tknowthatthedataishash-partitioned

– MRtreatsmapandreducefunctionsasblack-boxesanddoesnotperformanyoptimizationsonthem

• But,ifalocalcombinerisused– Savescommunicationcost:

• fewertuples willbeemittedbythemaptasks– Savescomputationcostinthereducers:

• thereducerswouldhavetodoanything


WHEREa>0GROUPBYa


ColumnStore(slidesfromLecture19)


Rowvs.ColumnStore

• Rowstore– storeallattributesofatupletogether– storagelike“row-majororder”inamatrix

• Columnstore– storeallrowsforanattribute(column)together– storagelike“column-majororder”inamatrix

• e.g.– MonetDB,Vertica(earlier,C-store),SAP/SybaseIQ,GoogleBigtable (withcolumngroups)



Ack:SlidefromVLDB2009tutorialonColumnstore

lecture 20 - duke university · 2017-11-08 · parallel dbms •parallelism is natural to dbms...

Documents