ece 462 object-oriented progggramming using c++ and java ... · • smaller and faster memoryy( )...

58
ECE 462 Object-Oriented Programming using C++ and Java Design Parallel Programs Yung-Hsiang Lu l@ d d yunglu@purdue.edu YHL Design Parallel Programs 1

Upload: others

Post on 21-Jul-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ECE 462 Object-Oriented Progggramming using C++ and Java ... · • Smaller and faster memoryy( ) (cache) is installed closer to the processor. When data cannot fit into L1 cache,

ECE 462Object-Oriented Programmingj g g

using C++ and Java

Design Parallel Programs

Yung-Hsiang Lul @ d [email protected]

YHL Design Parallel Programs 1

Page 2: ECE 462 Object-Oriented Progggramming using C++ and Java ... · • Smaller and faster memoryy( ) (cache) is installed closer to the processor. When data cannot fit into L1 cache,

Compare Java and C++ (Qt) ThreadsCompare Java and C (Qt) ThreadsJava C++

extends Thread orimplements Runnable

public QThread

public void run() public:public void run() public:void run()

thread1.start(); thread1.start();

synchronized mutex.lock() …mutex unlock()mutex.unlock()

try {wait ();

} h

cond.wait(&mutex);

YHL Design Parallel Programs 2

} catch …

Page 3: ECE 462 Object-Oriented Progggramming using C++ and Java ... · • Smaller and faster memoryy( ) (cache) is installed closer to the processor. When data cannot fit into L1 cache,

Design PrinciplesDesign Principles

• separate data and threadsp– data: such as bank account, shared among threads,

need protected by synchronized (Java) or mutex (Qt)– threads: such as depositor, extends Threads (Java)

or : public QThread (C++). modify data• choose the granularity of data• choose the granularity of data

– too coarse: most threads are serialized– too fine: mutex overhead dominant

YHL Design Parallel Programs 3

Page 4: ECE 462 Object-Oriented Progggramming using C++ and Java ... · • Smaller and faster memoryy( ) (cache) is installed closer to the processor. When data cannot fit into L1 cache,

SerializationSerialization

• If all threads need to access one Waitingshared object often, the program essentially becomes a serial program

Waiting

program.• The program may actually run

more slowly due to mutex and thread overhead.

• solution: reduce the "coupling" among threads

Mutexamong threads.

YHL Design Parallel Programs 4

Page 5: ECE 462 Object-Oriented Progggramming using C++ and Java ... · • Smaller and faster memoryy( ) (cache) is installed closer to the processor. When data cannot fit into L1 cache,

Improve Thread EfficiencyImprove Thread Efficiency

reduce the "coupling" among threadsp g g• do not share objects• do not execute concurrently when objects are shared• shrink the granularity of shared object e g all bank accounts ⇒• shrink the granularity of shared object. e.g. all bank accounts ⇒

individual customer's account• shrink critical sections ⇒ only the operations that are related to the

shared objects are in mutexshared objects are in mutex• use private (not shared) data for short-term storage• do not guarantee correctness, remedy later

YHL Design Parallel Programs 5

Page 6: ECE 462 Object-Oriented Progggramming using C++ and Java ... · • Smaller and faster memoryy( ) (cache) is installed closer to the processor. When data cannot fit into L1 cache,

Producers and Consumerswith a Finite Buffer

YHL Design Parallel Programs 6

Page 7: ECE 462 Object-Oriented Progggramming using C++ and Java ... · • Smaller and faster memoryy( ) (cache) is installed closer to the processor. When data cannot fit into L1 cache,

consumer

producer consumer

producerconsumer

circular buffer (FIFO)

producerconsumer

circular buffer (FIFO)

A producer can put an element in the buffer ifp pit is not full. A consumer can get an elementthe buffer if it is not empty.

YHL Design Parallel Programs 7

Page 8: ECE 462 Object-Oriented Progggramming using C++ and Java ... · • Smaller and faster memoryy( ) (cache) is installed closer to the processor. When data cannot fit into L1 cache,

Circular BufferCircular Buffer

head tail

headtail

• head = tail ⇒ buffer empty• tail > head and (tail – head) == size – 1 ⇒ buffer full• head > tail and (head – tail) == 1 ⇒ buffer full

YHL Design Parallel Programs 8

Page 9: ECE 462 Object-Oriented Progggramming using C++ and Java ... · • Smaller and faster memoryy( ) (cache) is installed closer to the processor. When data cannot fit into L1 cache,

YHL Design Parallel Programs 9

Page 10: ECE 462 Object-Oriented Progggramming using C++ and Java ... · • Smaller and faster memoryy( ) (cache) is installed closer to the processor. When data cannot fit into L1 cache,

YHL Design Parallel Programs 10

Page 11: ECE 462 Object-Oriented Progggramming using C++ and Java ... · • Smaller and faster memoryy( ) (cache) is installed closer to the processor. When data cannot fit into L1 cache,

YHL Design Parallel Programs 11

Page 12: ECE 462 Object-Oriented Progggramming using C++ and Java ... · • Smaller and faster memoryy( ) (cache) is installed closer to the processor. When data cannot fit into L1 cache,

YHL Design Parallel Programs 12

Page 13: ECE 462 Object-Oriented Progggramming using C++ and Java ... · • Smaller and faster memoryy( ) (cache) is installed closer to the processor. When data cannot fit into L1 cache,

YHL Design Parallel Programs 13

Page 14: ECE 462 Object-Oriented Progggramming using C++ and Java ... · • Smaller and faster memoryy( ) (cache) is installed closer to the processor. When data cannot fit into L1 cache,

YHL Design Parallel Programs 14

Page 15: ECE 462 Object-Oriented Progggramming using C++ and Java ... · • Smaller and faster memoryy( ) (cache) is installed closer to the processor. When data cannot fit into L1 cache,

YHL Design Parallel Programs 15

Page 16: ECE 462 Object-Oriented Progggramming using C++ and Java ... · • Smaller and faster memoryy( ) (cache) is installed closer to the processor. When data cannot fit into L1 cache,

YHL Design Parallel Programs 16

Page 17: ECE 462 Object-Oriented Progggramming using C++ and Java ... · • Smaller and faster memoryy( ) (cache) is installed closer to the processor. When data cannot fit into L1 cache,

YHL Design Parallel Programs 17

Page 18: ECE 462 Object-Oriented Progggramming using C++ and Java ... · • Smaller and faster memoryy( ) (cache) is installed closer to the processor. When data cannot fit into L1 cache,

Check CorrectnessCheck Correctness

• wrong result ⇒ the program has bugsg p g g• correct result ⇒ maybe you are lucky• know the correct results, even though the interleaving

may be different• head ≠ tail after adding an element

randomize interleaving if possible• randomize interleaving if possible • assign invalid values to data that should not appear

YHL Design Parallel Programs 18

Page 19: ECE 462 Object-Oriented Progggramming using C++ and Java ... · • Smaller and faster memoryy( ) (cache) is installed closer to the processor. When data cannot fit into L1 cache,

Deadlock and LivelockDeadlock and Livelock

• deadlock: none of the participating threads can execute p p gany more code

• livelock: some (or all) of the participating threads can t d b t i d fexecute some more code, but no progress is made. for

example, dinning philosophers

YHL Design Parallel Programs 19

Page 20: ECE 462 Object-Oriented Progggramming using C++ and Java ... · • Smaller and faster memoryy( ) (cache) is installed closer to the processor. When data cannot fit into L1 cache,

ECE 462Object-Oriented Programmingj g g

using C++ and Java

Deadlock

Yung-Hsiang Lul @ d [email protected]

YHL Deadlock 1

Page 21: ECE 462 Object-Oriented Progggramming using C++ and Java ... · • Smaller and faster memoryy( ) (cache) is installed closer to the processor. When data cannot fit into L1 cache,

DeadlockDeadlock

YHL Deadlock 2

Page 22: ECE 462 Object-Oriented Progggramming using C++ and Java ... · • Smaller and faster memoryy( ) (cache) is installed closer to the processor. When data cannot fit into L1 cache,

DeadlockDeadlock

four necessary conditions for deadlocky– mutual exclusion: a car's current location

cannot be occupied by another carcarca

r

– hold and wait: a car must "hold" the current location while waiting

– no preemption: a car cannot be removed

carcar

– no preemption: a car cannot be removed by a scheduler

– circular wait: a car can move if another car moves but another car is waiting for this car

YHL Deadlock 3

Page 23: ECE 462 Object-Oriented Progggramming using C++ and Java ... · • Smaller and faster memoryy( ) (cache) is installed closer to the processor. When data cannot fit into L1 cache,

Programs without DeadlockPrograms without Deadlock

YHL Deadlock 4

Page 24: ECE 462 Object-Oriented Progggramming using C++ and Java ... · • Smaller and faster memoryy( ) (cache) is installed closer to the processor. When data cannot fit into L1 cache,

YHL Deadlock 5

Page 25: ECE 462 Object-Oriented Progggramming using C++ and Java ... · • Smaller and faster memoryy( ) (cache) is installed closer to the processor. When data cannot fit into L1 cache,

YHL Deadlock 6

Page 26: ECE 462 Object-Oriented Progggramming using C++ and Java ... · • Smaller and faster memoryy( ) (cache) is installed closer to the processor. When data cannot fit into L1 cache,

YHL Deadlock 7

Page 27: ECE 462 Object-Oriented Progggramming using C++ and Java ... · • Smaller and faster memoryy( ) (cache) is installed closer to the processor. When data cannot fit into L1 cache,

YHL Deadlock 8

Page 28: ECE 462 Object-Oriented Progggramming using C++ and Java ... · • Smaller and faster memoryy( ) (cache) is installed closer to the processor. When data cannot fit into L1 cache,

YHL Deadlock 9

Page 29: ECE 462 Object-Oriented Progggramming using C++ and Java ... · • Smaller and faster memoryy( ) (cache) is installed closer to the processor. When data cannot fit into L1 cache,

YHL Deadlock 10

Page 30: ECE 462 Object-Oriented Progggramming using C++ and Java ... · • Smaller and faster memoryy( ) (cache) is installed closer to the processor. When data cannot fit into L1 cache,

YHL Deadlock 11

Page 31: ECE 462 Object-Oriented Progggramming using C++ and Java ... · • Smaller and faster memoryy( ) (cache) is installed closer to the processor. When data cannot fit into L1 cache,

YHL Deadlock 12

Page 32: ECE 462 Object-Oriented Progggramming using C++ and Java ... · • Smaller and faster memoryy( ) (cache) is installed closer to the processor. When data cannot fit into L1 cache,

YHL Deadlock 13

Page 33: ECE 462 Object-Oriented Progggramming using C++ and Java ... · • Smaller and faster memoryy( ) (cache) is installed closer to the processor. When data cannot fit into L1 cache,

Java Example of DeadlockJava Example of Deadlock

YHL Deadlock 14

Page 34: ECE 462 Object-Oriented Progggramming using C++ and Java ... · • Smaller and faster memoryy( ) (cache) is installed closer to the processor. When data cannot fit into L1 cache,

John has a check and a saving accounts with a bank. He gtransferred $500 from the check account to the saving account for a high interest rate. Later in the day, he wrote a check and transferred $200 from the savingwrote a check and transferred $200 from the saving account back to the checking account. In the evening, he went to an ATM to withdraw cash. The ATM said, “ ” ?“Account not Found.” What happened?

YHL Deadlock 15

Page 35: ECE 462 Object-Oriented Progggramming using C++ and Java ... · • Smaller and faster memoryy( ) (cache) is installed closer to the processor. When data cannot fit into L1 cache,

class Account {class Account {private int balance;synchronized void transfer (Account source, int amount) {

if (source.canWithdraw(amount)) {source.withdraw(amount);balance += amount;

}}synchronized boolean canWithdraw(amount) {synchronized boolean canWithdraw(amount) {

if (balance >= amount) { return true; }return false;

}}}

YHL Deadlock 16

Page 36: ECE 462 Object-Oriented Progggramming using C++ and Java ... · • Smaller and faster memoryy( ) (cache) is installed closer to the processor. When data cannot fit into L1 cache,

thread 1 thread 2thread 1 thread 2

sa transfer(chk) hk t f ( )sav.transfer(chk) chk.transfer(sav)

i hk kacquire sav key acquire chk key

ll Withdcall chk.canWithdraw call sav.canWithdraw

time

acquire chk key acquire sav key

YHL Deadlock 17

Page 37: ECE 462 Object-Oriented Progggramming using C++ and Java ... · • Smaller and faster memoryy( ) (cache) is installed closer to the processor. When data cannot fit into L1 cache,

YHL Deadlock 18

Page 38: ECE 462 Object-Oriented Progggramming using C++ and Java ... · • Smaller and faster memoryy( ) (cache) is installed closer to the processor. When data cannot fit into L1 cache,

YHL Deadlock 19

Page 39: ECE 462 Object-Oriented Progggramming using C++ and Java ... · • Smaller and faster memoryy( ) (cache) is installed closer to the processor. When data cannot fit into L1 cache,

YHL Deadlock 20

Page 40: ECE 462 Object-Oriented Progggramming using C++ and Java ... · • Smaller and faster memoryy( ) (cache) is installed closer to the processor. When data cannot fit into L1 cache,

ECE 462Object-Oriented Programmingj g g

using C++ and Java

Thread Performance

Yung-Hsiang Lul @ d [email protected]

YHL Thread Performance 1

Page 41: ECE 462 Object-Oriented Progggramming using C++ and Java ... · • Smaller and faster memoryy( ) (cache) is installed closer to the processor. When data cannot fit into L1 cache,

Performance MeasurementPerformance Measurement

ts1 = current time;;execute the program without creating threads;

ts2 = current time;

tp1 = current time;execute the program with multiple threads

tp2 = current time;

ts2 - ts1improvementtp2 tp1

=−

YHL Thread Performance 2

Page 42: ECE 462 Object-Oriented Progggramming using C++ and Java ... · • Smaller and faster memoryy( ) (cache) is installed closer to the processor. When data cannot fit into L1 cache,

Structure of Multithread ProgramsStructure of Multithread Programsgenerate data or read data from disks

partition data into independent portions

create threads and assign tasks

start thread execution

collect data and analyze resultsYHL Thread Performance 3

collect data and analyze results

Page 43: ECE 462 Object-Oriented Progggramming using C++ and Java ... · • Smaller and faster memoryy( ) (cache) is installed closer to the processor. When data cannot fit into L1 cache,

Amdahl’s LawAmdahl s Law

• If a program has x (%) parallel code, 1-x sequential codep g ( ) p , q• the speedup of using n threads (no sync) is• if x = 0.9 and n →∞, speedup = 10

1x1-x+1 xn

20

x = 0.9 0.99 0.999 1

5

10

15

Spee

dup

0

5

0 4 8 12 16

S

YHL Thread Performance 4

Number of Processors

Page 44: ECE 462 Object-Oriented Progggramming using C++ and Java ... · • Smaller and faster memoryy( ) (cache) is installed closer to the processor. When data cannot fit into L1 cache,

Multi-Thread = High Performance?Multi Thread High Performance?

• multi-thread ≠ high performance (faster)g p ( )• If a program is IO-bounded, multi-thread (or multi-core)

does not really help.• Finding sufficient parallelism (make x closer to 1) can be

difficulty.• Reduce the sequential parts as much as possible• Reduce the sequential parts as much as possible

– read data from multiple, parallel sources– partition data as they arrivep y– create long-living threads and reuse them– reduce competitions of mutex keys

YHL Thread Performance 5

Page 45: ECE 462 Object-Oriented Progggramming using C++ and Java ... · • Smaller and faster memoryy( ) (cache) is installed closer to the processor. When data cannot fit into L1 cache,

Superlinear SpeedupSuperlinear Speedup

• Sometimes, a multithread program's performance , p g pexceeds the number of processors.

execution time of single thread number of processors>

All modern processors are built upon a series of storage

g number of processorsexecution time of multiple thread

>

• All modern processors are built upon a series of storage devices, called memory hierarchy.

processor cache cache L2 memory

YHL Thread Performance 6

Page 46: ECE 462 Object-Oriented Progggramming using C++ and Java ... · • Smaller and faster memoryy( ) (cache) is installed closer to the processor. When data cannot fit into L1 cache,

Superlinear or SublinearSuperlinear or Sublinear

superlinear linearsuperlinear

sublinear

YHL Thread Performance 7

Page 47: ECE 462 Object-Oriented Progggramming using C++ and Java ... · • Smaller and faster memoryy( ) (cache) is installed closer to the processor. When data cannot fit into L1 cache,

Memory HierarchyMemory Hierarchy

• Smaller and faster memory (cache) is installed closer to y ( )the processor. When data cannot fit into L1 cache, the data are stored in L2 cache, then memory.M lti l h l L1 h ll ti l• Multiple processors can have larger L1 cache collectively to accommodate more data for faster access. As a result, the program is faster.

• Multi-thread programs can also cause frequent data contention and trigger expensive (i.e. slow) consistency checking code When this happens the program can bechecking code. When this happens, the program can be much slower.

YHL Thread Performance 8

Page 48: ECE 462 Object-Oriented Progggramming using C++ and Java ... · • Smaller and faster memoryy( ) (cache) is installed closer to the processor. When data cannot fit into L1 cache,

ECE 462Object-Oriented Programmingj g g

using C++ and Java

Tile-Based Platform Game

Yung-Hsiang Lul @ d [email protected]

YHL Platform Game 1

Page 49: ECE 462 Object-Oriented Progggramming using C++ and Java ... · • Smaller and faster memoryy( ) (cache) is installed closer to the processor. When data cannot fit into L1 cache,

YHL Platform Game 2

Page 50: ECE 462 Object-Oriented Progggramming using C++ and Java ... · • Smaller and faster memoryy( ) (cache) is installed closer to the processor. When data cannot fit into L1 cache,

YHL Platform Game 3

Page 51: ECE 462 Object-Oriented Progggramming using C++ and Java ... · • Smaller and faster memoryy( ) (cache) is installed closer to the processor. When data cannot fit into L1 cache,

YHL Platform Game 4

Page 52: ECE 462 Object-Oriented Progggramming using C++ and Java ... · • Smaller and faster memoryy( ) (cache) is installed closer to the processor. When data cannot fit into L1 cache,

Changing BackgroundChanging Background

• background map: To provide a changing background g p p g g gwithout a very large actual background

• sky

• tilestiles

YHL Platform Game 5

Page 53: ECE 462 Object-Oriented Progggramming using C++ and Java ... · • Smaller and faster memoryy( ) (cache) is installed closer to the processor. When data cannot fit into L1 cache,

Tile MapTile Map

• The titles do not need individual copies of the imagesp g• collision detection based on the type of a tile• complex map ⇒ a special map editor• simple map ⇒ text editor sufficient• Each object’s location is relative to the map (not to the

screen) so that a player can scroll left or right easilyscreen) so that a player can scroll left or right easily.

YHL Platform Game 6

Page 54: ECE 462 Object-Oriented Progggramming using C++ and Java ... · • Smaller and faster memoryy( ) (cache) is installed closer to the processor. When data cannot fit into L1 cache,

YHL Platform Game 7

Page 55: ECE 462 Object-Oriented Progggramming using C++ and Java ... · • Smaller and faster memoryy( ) (cache) is installed closer to the processor. When data cannot fit into L1 cache,

Tile-Based Collision DetectionTile Based Collision Detection

If the sprite’s next location is (x,y), checking whether p ( ,y), gcollision occurs is easy:

tileMap.get(x,y) ⇒ if a tile exists, collision

♦ ♦

YHL Platform Game 8

Page 56: ECE 462 Object-Oriented Progggramming using C++ and Java ... · • Smaller and faster memoryy( ) (cache) is installed closer to the processor. When data cannot fit into L1 cache,

YHL Platform Game 9

Page 57: ECE 462 Object-Oriented Progggramming using C++ and Java ... · • Smaller and faster memoryy( ) (cache) is installed closer to the processor. When data cannot fit into L1 cache,

YHL Platform Game 10

Page 58: ECE 462 Object-Oriented Progggramming using C++ and Java ... · • Smaller and faster memoryy( ) (cache) is installed closer to the processor. When data cannot fit into L1 cache,

YHL Platform Game 11