l19 tlp - university of california, berkeleyinst.eecs.berkeley.edu/~cs61c/fa17/lec/19/l19 tlp...

Post on 27-Sep-2020

4 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

11/2/17

1

CS61C:GreatIdeasinComputerArchitecture

Lecture19:Thread-LevelParallelProcessing

Krste Asanović &RandyH.Katz

http://inst.eecs.berkeley.edu/~cs61c/fa17

111/2/17 Fall2017 - Lecture#19

Agenda

• MIMD- multipleprogramssimultaneously• Threads• Parallelprogramming:OpenMP• Synchronizationprimitives• SynchronizationinOpenMP• And,inConclusion…

211/2/17 Fall2017 - Lecture#19

ImprovingPerformance1. Increaseclockratefs

− Reachedpracticalmaximumfortoday’stechnology− <5GHzforgeneralpurposecomputers

2. LowerCPI(cyclesperinstruction)− SIMD,“instructionlevelparallelism”

3. Performmultipletaskssimultaneously− MultipleCPUs,eachexecutingdifferentprogram− Tasksmayberelated

§ E.g.eachCPUperformspartofabigmatrixmultiplication− orunrelated

§ E.g.distributedifferentwebhttprequestsoverdifferentcomputers§ E.g.runpptx (viewlectureslides)andbrowser(youtube)simultaneously

4. Doalloftheabove:− Highfs,SIMD,multipleparalleltasks

Today’slecture

311/2/17 Fall2017 - Lecture#19

New-SchoolMachineStructures(It’sabitmorecomplicated!)

• ParallelRequestsAssignedtocomputere.g.,Search“Katz”

• ParallelThreadsAssignedtocoree.g.,Lookup,Ads

• ParallelInstructions>1instruction@onetimee.g.,5pipelinedinstructions

• ParallelData>1dataitem@onetimee.g.,Addof4pairsofwords

• HardwaredescriptionsAllgates@onetime

• ProgrammingLanguages

SmartPhone

WarehouseScale

Computer

SoftwareHardware

HarnessParallelism&AchieveHighPerformance

LogicGates

Core Core…

Memory(Cache)

Input/Output

Computer

CacheMemory

Core

InstructionUnit(s) FunctionalUnit(s)

A3+B3A2+B2A1+B1A0+B0

Projects3and5!

411/2/17 Fall2017 - Lecture#19

ParallelComputerArchitectures

Severalseparatecomputers,somemeansforcommunication(e.g.,Ethernet)

Massivearrayofcomputers,fastcommunicationbetweenprocessors

Multi-coreCPU:1datapathinsinglechip

shareL3cache,memory,peripheralsExample:Hivemachines

GPU“graphicsprocessingunit”

511/2/17 Fall2017 - Lecture#19

Example:CPUwithTwoCoresProcessor“Core”1

Control

DatapathPC

Registers(ALU)

MemoryInput

Output

Bytes

I/O-MemoryInterfaces

Processor0MemoryAccesses

Processor“Core”2

Control

DatapathPC

Registers(ALU)

Processor1MemoryAccesses

611/2/17 Fall2017 - Lecture#19

11/2/17

2

MultiprocessorExecutionModel

• Eachprocessor(core)executesitsowninstructions• Separate resources(notshared)

− Datapath(PC,registers,ALU)− Highestlevelcaches(e.g.,1st and2nd)

• Shared resources− Memory(DRAM)− Often3rd levelcache

§ Oftenonsamesiliconchip§ Butnotarequirement

• Nomenclature− “MultiprocessorMicroprocessor”− Multicoreprocessor

§ E.g.,fourcoreCPU(centralprocessingunit)§ Executesfourdifferentinstructionstreamssimultaneously

711/2/17 Fall2017 - Lecture#19

TransitiontoMulticore

Sequential App Performance

811/2/17 Fall2017 - Lecture#19

Pixel2vs.iPhone8

911/2/17 Fall2017 - Lecture#19

Pixel2vs.iPhone8

1011/2/17 Fall2017 - Lecture#19

ALUs nm MHz GFlops

2.35Ghz+1.9Ghz,64BitOcta-Core

Pixel2vs.iPhone8

1111/2/17 Fall2017 - Lecture#19

Pixel2vs.iPhone8

1211/2/17 Fall2017 - Lecture#19

11/2/17

3

MultiprocessorExecutionModel

• Sharedmemory− Each“core”hasaccesstotheentirememoryintheprocessor− Specialhardwarekeepscachesconsistent(nextlecture!)− Advantages:

§ Simplifiescommunicationinprogramviasharedvariables− Drawbacks:

§ Doesnotscalewell:o “Slow”memorysharedbymany“customers”(cores)o Maybecomebottleneck(Amdahl’sLaw)

• Twowaystouseamultiprocessor:− Job-levelparallelism

§ Processorsworkonunrelatedproblems§ Nocommunicationbetweenprograms

− Partitionworkofsingletaskbetweenseveralcores§ E.g.,eachperformspartoflargematrixmultiplication

1311/2/17 Fall2017 - Lecture#19

ParallelProcessing• It’sdifficult!• It’sinevitable

− Onlypathtoincreaseperformance− Onlypathtolowerenergyconsumption(improvebatterylife)

• Inmobilesystems(e.g.,smartphones,tablets)− Multiplecores− Dedicatedprocessors,e.g.,

§ Motionprocessor,imageprocessor,neuralprocessoriniPhone8+X§ GPU(graphicsprocessingunit)

• Warehouse-scalecomputers(nextweek!)− Multiple“nodes”

§ “Boxes”withseveralCPUs,disksperbox− MIMD(multi-core)andSIMD(e.g.AVX)ineachnode

1411/2/17 Fall2017 - Lecture#19

1511/2/17 Fall2017 - Lecture#19

PotentialParallelPerformance(assumingsoftwarecanuseit)

Year Cores SIMD bits /Core Core *SIMD bits

Total, e.g.FLOPs/Cycle

2003 2 128 256 42005 4 128 512 82007 6 128 768 122009 8 128 1024 162011 10 256 2560 402013 12 256 3072 482015 14 512 7168 1122017 16 512 8192 1282019 18 1024 18432 2882021 20 1024 20480 320

2.5X 8X 20X

MIMD SIMD MIMD&SIMD+2/

2yrs2X/4yrs

12years

20xin12years201/12 =1.28xà 28%peryearor2xevery3years!

IF(!)wecanuseit

Agenda

• MIMD- multipleprogramssimultaneously• Threads• Parallelprogramming:OpenMP• Synchronizationprimitives• SynchronizationinOpenMP• And,inConclusion…

1611/2/17 Fall2017 - Lecture#19

ProgramsRunningonmyComputerPID TTY TIME CMD220 ?? 0:04.34 /usr/libexec/UserEventAgent (Aqua)222 ?? 0:10.60 /usr/sbin/distnoted agent224 ?? 0:09.11 /usr/sbin/cfprefsd agent229 ?? 0:04.71 /usr/sbin/usernoted230 ?? 0:02.35 /usr/libexec/nsurlsessiond232 ?? 0:28.68 /System/Library/PrivateFrameworks/CalendarAgent.framework/Executables/CalendarAgent234 ?? 0:04.36 /System/Library/PrivateFrameworks/GameCenterFoundation.framework/Versions/A/gamed235 ?? 0:01.90 /System/Library/CoreServices/cloudphotosd.app/Contents/MacOS/cloudphotosd236 ?? 0:49.72 /usr/libexec/secinitd239 ?? 0:01.66 /System/Library/PrivateFrameworks/TCC.framework/Resources/tccd240 ?? 0:12.68 /System/Library/Frameworks/Accounts.framework/Versions/A/Support/accountsd241 ?? 0:09.56 /usr/libexec/SafariCloudHistoryPushAgent242 ?? 0:00.27 /System/Library/PrivateFrameworks/CallHistory.framework/Support/CallHistorySyncHelper243 ?? 0:00.74 /System/Library/CoreServices/mapspushd244 ?? 0:00.79 /usr/libexec/fmfd246 ?? 0:00.09 /System/Library/PrivateFrameworks/AskPermission.framework/Versions/A/Resources/askpermissiond248 ?? 0:01.03 /System/Library/PrivateFrameworks/CloudDocsDaemon.framework/Versions/A/Support/bird249 ?? 0:02.50 /System/Library/PrivateFrameworks/IDS.framework/identityservicesd.app/Contents/MacOS/identityservicesd250 ?? 0:04.81 /usr/libexec/secd254 ?? 0:24.01 /System/Library/PrivateFrameworks/CloudKitDaemon.framework/Support/cloudd258 ?? 0:04.73 /System/Library/PrivateFrameworks/TelephonyUtilities.framework/callservicesd267 ?? 0:02.15 /System/Library/CoreServices/AirPlayUIAgent.app/Contents/MacOS/AirPlayUIAgent --launchd271 ?? 0:03.91 /usr/libexec/nsurlstoraged274 ?? 0:00.90 /System/Library/PrivateFrameworks/CommerceKit.framework/Versions/A/Resources/storeaccountd282 ?? 0:00.09 /usr/sbin/pboard283 ?? 0:00.90

/System/Library/PrivateFrameworks/InternetAccounts.framework/Versions/A/XPCServices/com.apple.internetaccounts.xpc/Contents/MacOS/com.apple.internetaccounts285 ?? 0:04.72 /System/Library/Frameworks/ApplicationServices.framework/Frameworks/ATS.framework/Support/fontd291 ?? 0:00.25 /System/Library/Frameworks/Security.framework/Versions/A/Resources/CloudKeychainProxy.bundle/Contents/MacOS/CloudKeychainProxy292 ?? 0:09.54 /System/Library/CoreServices/CoreServicesUIAgent.app/Contents/MacOS/CoreServicesUIAgent293 ?? 0:00.29

/System/Library/PrivateFrameworks/CloudPhotoServices.framework/Versions/A/Frameworks/CloudPhotoServicesConfiguration.framework/Versions/A/XPCServices/com.apple.CloudPhotosConfiguration.xpc/Contents/MacOS/com.apple.CloudPhotosConfiguration297 ?? 0:00.84 /System/Library/PrivateFrameworks/CloudServices.framework/Resources/com.apple.sbd302 ?? 0:26.11 /System/Library/CoreServices/Dock.app/Contents/MacOS/Dock303 ?? 0:09.55 /System/Library/CoreServices/SystemUIServer.app/Contents/MacOS/SystemUIServer

…156total at this momentHow does mylaptopdothis?

Imagine doing 156assignments all at the same time!1711/2/17 Fall2017 - Lecture#19

ps -x

Threads• Sequentialflowofinstructionsthatperformssometask

−Uptonowwejustcalledthisa“program”

• Eachthreadhas:− DedicatedPC(programcounter)− Separateregisters− Accessesthesharedmemory

• Eachphysicalcoreprovidesone(ormore)− Hardwarethreads thatactivelyexecuteinstructions− Eachexecutesone“hardwarethread”

• Operatingsystemmultiplexesmultiple− Softwarethreads ontotheavailablehardwarethreads− Allthreadsexceptthosemappedtohardwarethreadsarewaiting

1811/2/17 Fall2017 - Lecture#19

11/2/17

4

OperatingSystemThreads

Giveillusionofmany“simultaneously”activethreads1. Multiplexsoftwarethreadsontohardwarethreads:

a) Switchoutblockedthreads(e.g.,cachemiss,userinput,networkaccess)b) Timer(e.g.,switchactivethreadevery1ms)

2. Removeasoftwarethreadfromahardwarethreadbya) Interruptingitsexecutionb) SavingitsregistersandPCtomemory

3. Startexecutingadifferentsoftwarethreadbya) Loadingitspreviouslysavedregistersintoahardwarethread’sregistersb) JumpingtoitssavedPC

1911/2/17 Fall2017 - Lecture#19

Example:FourCoresThreadpool:Listofthreadscompetingforprocessor

OSmapsthreadstocoresandscheduleslogical(software)threads

Core2

Each“Core”activelyrunsoneinstructionstreamatatime

Core1 Core3 Core4

2011/2/17 Fall2017 - Lecture#19

Multithreading

• Typicalscenario:− Activethreadencounterscachemiss− Activethreadwaits~ 1000cyclesfordatafromDRAM−à switchoutandrundifferentthreaduntildataavailable

• Problem−Mustsavecurrentthreadstateandloadnewthreadstate

§ PC,allregisters(couldbemany,e.g.AVX)−àmustperformswitchin≪1000cycles

• Canhardwarehelp?−Moore’sLaw:transistorsareplenty

2111/2/17 Fall2017 - Lecture#19

• TwocopiesofPCandRegistersinsideprocessorhardware

• Looksidenticaltotwoprocessorstosoftware(hardwarethread0,hardwarethread1)

• Hyperthreading:• Boththreadscanbeactivesimultaneously

HardwareAssistedSoftwareMultithreading

22

MemoryInput

Output

Bytes

I/O-MemoryInterfaces

Processor(1 Core,2Threads)

Control

DatapathPC0

Registers0

(ALU)

PC1

Registers1

CS61c Lecture19:ThreadLevelParallelProcessing

Multithreading

• Logicalthreads− ≈1%morehardware,≈10%(?)betterperformance

§ Separateregisters§ Sharedatapath,ALU(s),caches

• Multicore− =>DuplicateProcessors− ≈50%morehardware,≈2Xbetterperformance?

• Modernmachinesdoboth−Multiplecoreswithmultiplethreads percore

2311/2/17 Fall2017 - Lecture#19

Randy’sLaptop

$ sysctl -a | grep hw

hw.physicalcpu: 2hw.logicalcpu: 4hw.l1icachesize: 32,768 hw.l1dcachesize: 32,768hw.l2cachesize: 262,144hw.l3cachesize: 4,194,304

• 2Cores• 4Threadstotal

2411/2/17 Fall2017 - Lecture#19

11/2/17

5

Example:6Cores,24LogicalThreads

Threadpool:Listofthreadscompetingforprocessor

OSmapsthreadstocoresandscheduleslogical(software)threads

Thread1Core2

Thread2

Thread3

Thread4

Thread1Core6

Thread2

Thread3

Thread4

Thread1Core4

Thread2

Thread3

Thread4

Thread1Core5

Thread2

Thread3

Thread4

Thread1Core3

Thread2

Thread3

Thread4

Thread1Core1

Thread2

Thread3

Thread4

4Logicalthreadspercore(hardware)thread2511/2/17 Fall2017 - Lecture#19

Break!

2611/2/17 Fall2017 - Lecture#19

Agenda

• MIMD- multipleprogramssimultaneously• Threads• Parallelprogramming:OpenMP• Synchronizationprimitives• SynchronizationinOpenMP• And,inConclusion…

2711/2/17 Fall2017 - Lecture#19

LanguagesSupportingParallelProgramming

ActorScript Concurrent Pascal JoCaml OrcAda Concurrent ML Join OzAfnix Concurrent Haskell Java PictAlef Curry Joule ReiaAlice CUDA Joyce SALSAAPL E LabVIEW ScalaAxum Eiffel Limbo SISALChapel Erlang Linda SRCilk Fortan 90 MultiLisp Stackless PythonClean Go Modula-3 SuperPascalClojure Io Occam VHDLConcurrent C Janus occam-π XC

Whichonetopick?2811/2/17 Fall2017 - Lecture#19

WhySoManyParallelProgrammingLanguages?

• Why“intrinsics”?− TOIntel:fixyour#()&$!Compiler!

• It’shappening...but− SIMDfeaturesarecontinuallyaddedtocompilers(Intel,gcc)− Intenseareaofresearch− Researchprogress:

§ 20+yearstotranslateCintogood(fast!)assembly§ HowlongtotranslateCintogood(fast!)parallelcode?

o Generalproblemisveryhardtosolveo Presentstate:specializedsolutionsforspecificcaseso Youropportunitytobecomefamous!

2911/2/17 Fall2017 - Lecture#19

ParallelProgrammingLanguages

• Numberofchoicesisindicationof−Nouniversalsolution

§ Needsareveryproblemspecific− E.g.,

§ Scientificcomputing/machinelearning(matrixmultiply)§ Webserver:handlemanyunrelatedrequestssimultaneously§ Input/output:it’sallhappeningsimultaneously!

• Specializedlanguagesfordifferenttasks− Someareeasiertouse(forsomeproblems)−Noneisparticularly”easy”touse

• 61C− Parallellanguageexamplesforhigh-performancecomputing−OpenMP

3011/2/17 Fall2017 - Lecture#19

11/2/17

6

ParallelLoops

• Serialexecution:for (int i=0; i<100; i++) {

…}

• ParallelExecution:

for (int i=0; i<25; i++) { …

}

for (int i=25; i<50; i++) {

…}

for (int i=50; i<75; i++) {

…}

for (int i=75; i<100; i++) {

…}

3111/2/17 Fall2017 - Lecture#19

Parallelfor inOpenMP

#include <omp.h>

#pragma omp parallel forfor (int i=0; i<100; i++) {

…}

3211/2/17 Fall2017 - Lecture#19

OpenMPExample

$ gcc-5 -fopenmp for.c;./a.outthread 0, i = 0thread 1, i = 3thread 2, i = 6thread 3, i = 8thread 0, i = 1thread 1, i = 4thread 2, i = 7thread 3, i = 9thread 0, i = 2thread 1, i = 501 02 03 14 15 16 27 28 39 40

3311/2/17 Fall2017 - Lecture#19

OpenMP

• Cextension:nonewlanguagetolearn• Multi-threaded,shared-memoryparallelism

− CompilerDirectives,#pragma− RuntimeLibraryRoutines,#include <omp.h>

• #pragma− IgnoredbycompilersunawareofOpenMP− Samesourceformultiplearchitectures

§ E.g.,sameprogramfor1&16cores

• Onlyworkswithsharedmemory

3411/2/17 Fall2017 - Lecture#19

OpenMPProgrammingModel• Fork- JoinModel:

• OpenMPprogramsbeginassingleprocess(masterthread)− Sequentialexecution

• Whenparallelregionisencountered− Masterthread“forks” intoteamofparallelthreads− Executedsimultaneously− Atendofparallelregion,parallelthreads”join”,leavingonlymasterthread

• Processrepeatsforeachparallelregion− Amdahl’sLaw?

3511/2/17 Fall2017 - Lecture#19

WhatKindofThreads?

• OpenMPthreadsareoperatingsystem(software)threads• OSwillmultiplexrequestedOpenMPthreadsontoavailablehardwarethreads• Hopefullyeachgetsarealhardwarethreadtorunon,sonoOS-leveltime-multiplexing• Butothertasksonmachinecompeteforhardwarethreads!• Be“careful”(?)whentimingresultsforProject3!

− 5AM?− Jobqueue?

3611/2/17 Fall2017 - Lecture#19

11/2/17

7

Example2:Computingp

http://openmp.org/mp-documents/omp-hands-on-SC08.pdf3711/2/17 Fall2017 - Lecture#19

Sequentialp

pi = 3.142425985001

• Resemblesp,butnotveryaccurate• Let’sincreasenum_steps andparallelize

3811/2/17 Fall2017 - Lecture#19

Parallelize(1)…

• Problem:eachthreadsneedsaccesstothesharedvariablesum

• Coderunssequentially…

3911/2/17 Fall2017 - Lecture#19

Parallelize(2)…

sum[0] sum[1]

1. Computesum[0]andsum[1]

inparallel

2. Computesum = sum[0] + sum[1]

sequentially

4011/2/17 Fall2017 - Lecture#19

Parallelp

4111/2/17 Fall2017 - Lecture#19

TrialRun

i = 1, id = 1i = 0, id = 0i = 2, id = 2i = 3, id = 3i = 5, id = 1i = 4, id = 0i = 6, id = 2i = 7, id = 3i = 9, id = 1i = 8, id = 0pi = 3.142425985001

4211/2/17 Fall2017 - Lecture#19

11/2/17

8

Scaleup:num_steps = 106

pi = 3.141592653590

Youverify howmany digitsarecorrect …

4311/2/17 Fall2017 - Lecture#19

CanWeParallelizeComputingsum?

Summationinsideparallelsection• Insignificantspeedupinthisexample,but…• pi = 3.138450662641• Wrong!And value changes between runs?!• What’s going on?

AlwayslookingforwaystobeatAmdahl’sLaw…

4411/2/17 Fall2017 - Lecture#19

PeerInstructionWhatarethepossiblevaluesof*(x1) afterexecutingthiscodebytwoconcurrent threads?

# *(x1) = 100lw x2,0(x1)addi x2,x2,1sw x2,0(x1)

Answer *(x1)

RED 100 or101GREEN 101ORANGE 101or102YELLOW 100or101or102

4511/2/17 Fall2017 - Lecture#19

• Operationisreallypi = pi + sum[id]

• Whatif>1threadsreadscurrent(same)valueofpi,computesthesum,storestheresultbacktopi?

• Eachprocessorreadssameintermediatevalueofpi!

• Resultdependsonwhogetstherewhen• A“race”à resultisnot

deterministic

What’sGoingOn?

4611/2/17 Fall2017 - Lecture#19

Administrivia

• Homework4(Caches,FloatingPoint)duetomorrow at11:59pm• Project2-2dueMonday

− ProjectOfficehoursthatMondaywillbewellstaffed!− TestyourCPUthoroughly!

§ WriteprogramswithVenusandloadthemintoyourcircuit

• Project3willbereleasedMondaynight− Atwo-weekperformanceproject− Canearnextracreditfromtheperformancecontest(Project5)

• MidtermscoreswillbereleasedbeforeTuesdayonGradescope

4711/2/17 Fall2017 - Lecture#19

Break!

4811/2/17 Fall2017 - Lecture#19

11/2/17

9

Agenda

• MIMD- multipleprogramssimultaneously• Threads• Parallelprogramming:OpenMP• Synchronizationprimitives• SynchronizationinOpenMP• And,inConclusion…

4911/2/17 Fall2017 - Lecture#19

Synchronization

• Problem:− Limitaccesstosharedresourceto1actoratatime− E.g.only1personpermittedtoeditafileatatime

§ otherwisechangesbyseveralpeoplegetallmixedup

• Solution:• Taketurns:

• Onlyonepersonget’sthemicrophone&talksatatime

• Alsogoodpracticeforclassrooms,btw…

5011/2/17 Fall2017 - Lecture#19

Locks

• Computersuselockstocontrolaccesstosharedresources− Servespurposeofmicrophoneinexample− Alsoreferredtoas“semaphore”

• Usuallyimplementedwithavariable− int lock;

§ 0forunlocked§ 1forlocked

5111/2/17 Fall2017 - Lecture#19

SynchronizationwithLocks// wait for lock releasedwhile (lock != 0) ;// lock == 0 now (unlocked)

// set locklock = 1;

// access shared resource ... // e.g. pi// sequential execution! (Amdahl ...)

// release locklock = 0;

5211/2/17 Fall2017 - Lecture#19

LockSynchronization

Thread1

while (lock != 0) ;

lock = 1;

// critical section

lock = 0;

Thread2

while (lock != 0) ;

lock = 1; // critical sectionlock = 0;

• Thread2findslocknotset,beforethread1setsit

• Boththreadsbelievetheygotandsetthelock!

Tryasyoulike,thisproblemhasnosolution,notevenattheassemblylevel.

Unlessweintroducenewinstructions,thatis!5311/2/17 Fall2017 - Lecture#19

HardwareSynchronization

• Solution:−Atomicread/write−Read&writeinsingleinstruction

§ Nootheraccesspermittedbetweenreadandwrite−Note:

§ Mustusesharedmemory (multiprocessing)

• Commonimplementations:−Atomicswapofregister↔memory−Pairofinstructionsfor“linked”readandwrite

§ writefailsifmemorylocationhasbeen“tampered”withafterlinkedread• RISCVhasvariationsofboth,butforsimplicitywewillfocusontheformer

5411/2/17 Fall2017 - Lecture#19

11/2/17

10

RISCVAtomicMemoryOperations(AMOs)

• AMOsatomicallyperformanoperationonanoperandinmemoryandsetthedestinationregistertotheoriginalmemoryvalue• R-TypeInstructionFormat:Add,And,Or,Swap,Xor,Max,Max Unsigned,Min,Min Unsigned

5511/2/17 Fall2017 - Lecture#19

Loadfromaddressinrs1to“t”rd =”t”,i.e.,thevalueinmemoryStoreataddressinrs1thecalculation“t”<operation>rs2aq andrl insureinorderexecution

amoadd.w rd,rs2,(rs1):t = M[x[rs1]]; x[rd] = t; M[x[rs1]] = t + x[rs2]

RISCVCriticalSection

• Assumethatthelockisinmemorylocationstoredinregistera0• Thelockis“set”ifitis1;itis“free”ifitis0(it’sinitialvalue)

li t0, 1 # Get 1 to set lockTry: amoswap.w.aq t1, t0, (a0) # t1 gets old lock value

# while we set it to 1bnez t1, Try # if it was already 1, another

# thread has the lock,# so we need to try again

… critical section goes here …amoswap.w.rl x0, x0, (a0) # store 0 in lock to release

5611/2/17 Fall2017 - Lecture#19

LockSynchronization

BrokenSynchronization

while (lock != 0) ;

lock = 1;

// critical section

lock = 0;

Fix(lockisatlocation(a0))

li t0, 1Try amoswap.w.aq t1, t0, (a0)

bnez t1, TryLocked:

# critical section

Unlock:amoswap.w.rl x0, x0, (a0)

5711/2/17 Fall2017 - Lecture#19

Agenda

• MIMD- multipleprogramssimultaneously• Threads• Parallelprogramming:OpenMP• Synchronizationprimitives• SynchronizationinOpenMP• And,inConclusion…

5811/2/17 Fall2017 - Lecture#19

OpenMPLocks

5911/2/17 Fall2017 - Lecture#19

SynchronizationinOpenMP

• Typicallyareusedinlibrariesofhigherlevelparallelprogrammingconstructs• E.g.OpenMPoffers$pragmasforcommoncases:

− critical− atomic− barrier− ordered

• OpenMPoffersmanymorefeatures− Seeonlinedocumentation−Ortutorialat

§ http://openmp.org/mp-documents/omp-hands-on-SC08.pdf

6011/2/17 Fall2017 - Lecture#19

11/2/17

11

OpenMP CriticalSection

6111/2/17 Fall2017 - Lecture#19

TheTroublewithLocks…• …isdead-locks• Consider2cookssharingakitchen

− Eachcooksamealthatrequiressaltandpepper(locks)− Cook1grabssalt− Cook2grabspepper− Cook1noticess/heneedspepper

§ it’snotthere,sos/hewaits− Cook2realizess/heneedssalt

§ it’snotthere,sos/hewaits

• Anotsocommoncauseofcookstarvation− Butdeadlocksarepossibleinparallelprograms− Verydifficulttodebug

§ malloc/free iseasy…

6211/2/17 Fall2017 - Lecture#19

Agenda

• MIMD- multipleprogramssimultaneously• Threads• Parallelprogramming:OpenMP• Synchronizationprimitives• SynchronizationinOpenMP• And,inConclusion…

6311/2/17 Fall2017 - Lecture#19

And,inConclusion,…• Sequentialsoftwareexecutionspeedislimited• Parallelprocessingistheonlypathtohigherperformance

− SIMD:instructionlevelparallelism§ ImplementedinallhighperformanceCPUstoday(x86,ARM,…)§ Partiallysupportedbycompilers

− MIMD:threadlevelparallelism§ Multicoreprocessors§ SupportedbyOperatingSystems(OS)§ Requiresprogrammerinterventiontoexploitatsingleprogramlevel

o E.g.OpenMP− SIMD&MIMDformaximumperformance

• Synchronization− Requireshardwaresupport:specializedassemblyinstructions− Typicallyusehigher-levelsupport− Bewareofdeadlocks

6411/2/17 Fall2017 - Lecture#19

top related