![Page 1: Parallelism - Marco Serafini · Multi-Core processors •Idea: scale computational power linearly •Instead of a single 5 GHz core, 2 * 2.5 GHz cores •Scale heat dissipation linearly](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f0813b37e708231d4203804/html5/thumbnails/1.jpg)
ParallelismMarco Serafini
COMPSCI 590SLecture 3
![Page 2: Parallelism - Marco Serafini · Multi-Core processors •Idea: scale computational power linearly •Instead of a single 5 GHz core, 2 * 2.5 GHz cores •Scale heat dissipation linearly](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f0813b37e708231d4203804/html5/thumbnails/2.jpg)
2
Announcements• Reviews
• First paper posted on website
• Review due by this Wednesday 11 PM (hard deadline)
• Data Science Career Mixer (save the date!)• November 5, 4-7 pm
• Campus Center Auditorium
• Recruiting and industry engagement event
![Page 3: Parallelism - Marco Serafini · Multi-Core processors •Idea: scale computational power linearly •Instead of a single 5 GHz core, 2 * 2.5 GHz cores •Scale heat dissipation linearly](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f0813b37e708231d4203804/html5/thumbnails/3.jpg)
3
Why multi-core architectures?
![Page 4: Parallelism - Marco Serafini · Multi-Core processors •Idea: scale computational power linearly •Instead of a single 5 GHz core, 2 * 2.5 GHz cores •Scale heat dissipation linearly](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f0813b37e708231d4203804/html5/thumbnails/4.jpg)
4
Multi-Cores• We have talked about multi-core architectures• Why do we actually use multi-cores?• Why not a single core?
![Page 5: Parallelism - Marco Serafini · Multi-Core processors •Idea: scale computational power linearly •Instead of a single 5 GHz core, 2 * 2.5 GHz cores •Scale heat dissipation linearly](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f0813b37e708231d4203804/html5/thumbnails/5.jpg)
5
Maximum Clock Rate is Stagnating
Source: https://queue.acm.org/detail.cfm?id=2181798
Two major “laws” are collapsing• Moore’s law• Dennard scaling
![Page 6: Parallelism - Marco Serafini · Multi-Core processors •Idea: scale computational power linearly •Instead of a single 5 GHz core, 2 * 2.5 GHz cores •Scale heat dissipation linearly](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f0813b37e708231d4203804/html5/thumbnails/6.jpg)
6
Moore’s Law• “Density of transistors in an integrated circuit doubles every two years”. Smaller à changes propagate faster
So far so good, but the trend is slowing down and it won’t last for long (Intel’s prediction: until 2021 unless new technologies arise) [1]
[1] https://www.technologyreview.com/s/601441/moores-law-is-dead-now-what/
Expo
nent
ial a
xis
![Page 7: Parallelism - Marco Serafini · Multi-Core processors •Idea: scale computational power linearly •Instead of a single 5 GHz core, 2 * 2.5 GHz cores •Scale heat dissipation linearly](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f0813b37e708231d4203804/html5/thumbnails/7.jpg)
7
Dennard Scaling• “Reducing transistor size does not increase power
density à power consumption proportional to chip
area”
• Stopped holding around 2006
• Assumptions break when physical system close to limit
• Post-Dennard-scaling world of today
• Huge cooling and power consumption issues
• If we kept the same clock frequency trends, today a CPU
would have the power density of a nuclear reactor
![Page 8: Parallelism - Marco Serafini · Multi-Core processors •Idea: scale computational power linearly •Instead of a single 5 GHz core, 2 * 2.5 GHz cores •Scale heat dissipation linearly](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f0813b37e708231d4203804/html5/thumbnails/8.jpg)
8
Heat Dissipation Problem• Large datacenters consume energy like large cities• Cooling is the main cost factor
Google @ Columbia River valley (2006) Facebook @ Luleå (2015)
![Page 9: Parallelism - Marco Serafini · Multi-Core processors •Idea: scale computational power linearly •Instead of a single 5 GHz core, 2 * 2.5 GHz cores •Scale heat dissipation linearly](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f0813b37e708231d4203804/html5/thumbnails/9.jpg)
9
Where is Luleå?
![Page 10: Parallelism - Marco Serafini · Multi-Core processors •Idea: scale computational power linearly •Instead of a single 5 GHz core, 2 * 2.5 GHz cores •Scale heat dissipation linearly](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f0813b37e708231d4203804/html5/thumbnails/10.jpg)
10
Possible Solutions• Dynamic Voltage and Frequency Scaling (DVFS)
• E.g. Intel’s TurboBoost• Only works under low load
• Use part of the chip for coprocessors (e.g. graphics)• Lower power consumption• Limited number of generic functionalities to offload
![Page 11: Parallelism - Marco Serafini · Multi-Core processors •Idea: scale computational power linearly •Instead of a single 5 GHz core, 2 * 2.5 GHz cores •Scale heat dissipation linearly](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f0813b37e708231d4203804/html5/thumbnails/11.jpg)
11
More Solutions• Multicores
• Replace 1 powerful core with multiple weaker cores on a chip• SIMD
• Single Instruction Multiple Data• A massive number of cores with reduced flexibility
• FPGAs• Dedicated hardware designed for a specific task
![Page 12: Parallelism - Marco Serafini · Multi-Core processors •Idea: scale computational power linearly •Instead of a single 5 GHz core, 2 * 2.5 GHz cores •Scale heat dissipation linearly](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f0813b37e708231d4203804/html5/thumbnails/12.jpg)
12
Multi-Core processors• Idea: scale computational power linearly
• Instead of a single 5 GHz core, 2 * 2.5 GHz cores• Scale heat dissipation linearly
• k cores have ~ k times the heat dissipation of a single core• Increasing frequency of a single core by k times creates superlinear heat dissipation increase
![Page 13: Parallelism - Marco Serafini · Multi-Core processors •Idea: scale computational power linearly •Instead of a single 5 GHz core, 2 * 2.5 GHz cores •Scale heat dissipation linearly](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f0813b37e708231d4203804/html5/thumbnails/13.jpg)
13
Memory Bandwidth Bottleneck• Cores compete for the same main memory bus• Caches help in two ways
• They reduce latency (as we have discussed)• They also increase throughput by avoiding bus contention
![Page 14: Parallelism - Marco Serafini · Multi-Core processors •Idea: scale computational power linearly •Instead of a single 5 GHz core, 2 * 2.5 GHz cores •Scale heat dissipation linearly](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f0813b37e708231d4203804/html5/thumbnails/14.jpg)
14
How to Leverage Multicores• Run multiple tasks in parallel
• Multiprocessing• Multithreading
• E.g. PCs have many parallel background apps• OS, music, antivirus, web browser, …
• How to parallelize one app is not trivial• Embarrassingly parallel tasks
• Can be run by multiple threads• No coordination
![Page 15: Parallelism - Marco Serafini · Multi-Core processors •Idea: scale computational power linearly •Instead of a single 5 GHz core, 2 * 2.5 GHz cores •Scale heat dissipation linearly](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f0813b37e708231d4203804/html5/thumbnails/15.jpg)
15
SIMD Processors• Single Instruction Multiple Data (SIMD) processors• Example
• Graphical Processing Units (GPUs)• Intel Phi coprocessors
• Q: Possible SIMD snippets
for i in [0,n-1] dov[i] = v[i] * pi
for i in [0,n-1] doif v[i] < 0.01 then
v[i] = 0
![Page 16: Parallelism - Marco Serafini · Multi-Core processors •Idea: scale computational power linearly •Instead of a single 5 GHz core, 2 * 2.5 GHz cores •Scale heat dissipation linearly](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f0813b37e708231d4203804/html5/thumbnails/16.jpg)
16
Automatic Parallelization?• Holy grail in the multi-processor era• Approaches
• Programming languages• Systems with APIs that help express parallelism• Efficient coordination mechanisms
![Page 17: Parallelism - Marco Serafini · Multi-Core processors •Idea: scale computational power linearly •Instead of a single 5 GHz core, 2 * 2.5 GHz cores •Scale heat dissipation linearly](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f0813b37e708231d4203804/html5/thumbnails/17.jpg)
17
Processes vs. Threads
![Page 18: Parallelism - Marco Serafini · Multi-Core processors •Idea: scale computational power linearly •Instead of a single 5 GHz core, 2 * 2.5 GHz cores •Scale heat dissipation linearly](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f0813b37e708231d4203804/html5/thumbnails/18.jpg)
18
Processes & Threads• We have discussed that multicores is the future• How to make use of parallelism?• OS/PL support for parallel programming
• Processes• Threads
![Page 19: Parallelism - Marco Serafini · Multi-Core processors •Idea: scale computational power linearly •Instead of a single 5 GHz core, 2 * 2.5 GHz cores •Scale heat dissipation linearly](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f0813b37e708231d4203804/html5/thumbnails/19.jpg)
19
Processes vs. Threads• Process: separate memory space• Thread: shared memory space (except stack)
Processes ThreadsHeap not shared sharedGlobal variables not shared sharedLocal variables (Stack) not shared not sharedCode shared sharedFile handles not shared shared
![Page 20: Parallelism - Marco Serafini · Multi-Core processors •Idea: scale computational power linearly •Instead of a single 5 GHz core, 2 * 2.5 GHz cores •Scale heat dissipation linearly](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f0813b37e708231d4203804/html5/thumbnails/20.jpg)
20
Parallel Programming• Shared memory
• Threads • Access same memory locations (in heap & global variables)
• Message-Passing• Processes• Explicit communication: message-passing
![Page 21: Parallelism - Marco Serafini · Multi-Core processors •Idea: scale computational power linearly •Instead of a single 5 GHz core, 2 * 2.5 GHz cores •Scale heat dissipation linearly](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f0813b37e708231d4203804/html5/thumbnails/21.jpg)
Shared Memory
![Page 22: Parallelism - Marco Serafini · Multi-Core processors •Idea: scale computational power linearly •Instead of a single 5 GHz core, 2 * 2.5 GHz cores •Scale heat dissipation linearly](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f0813b37e708231d4203804/html5/thumbnails/22.jpg)
22
Shared Memory Examplevoid main (){
x = 12; // assume that x is a global variablet = new ThreadX();t.start(); // starts thread ty = 12/x;System.out.println(y);t.join(); // wait until t completes
}
class ThreadX extends Thread{void run (){
x = 0;}
}
• Question: What is printed as output?
This is “pseudo-Java”
in C++:pthread_createpthread_join
![Page 23: Parallelism - Marco Serafini · Multi-Core processors •Idea: scale computational power linearly •Instead of a single 5 GHz core, 2 * 2.5 GHz cores •Scale heat dissipation linearly](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f0813b37e708231d4203804/html5/thumbnails/23.jpg)
23
Desired: AtomicityThread a…foo()…
Thread b…foo()…
void foo (){x = 0;x = 1;y = 1/x;
}
x = 0x = 1y = 1
x = 0x = 1y = 1
Thread a Thread b
time
happens-before
changes become visible
DESIRED
x = 0x = 1
y = 1/0
x = 0
Thread a Thread bPOSSIBLE
foo should be atomic, in the sense of indivisible (ancient Greek)
![Page 24: Parallelism - Marco Serafini · Multi-Core processors •Idea: scale computational power linearly •Instead of a single 5 GHz core, 2 * 2.5 GHz cores •Scale heat dissipation linearly](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f0813b37e708231d4203804/html5/thumbnails/24.jpg)
24
Race Condition• Non-deterministic access to shared variables
• Correctness requires specific sequence of accesses
• But we cannot rely on it because of non-determinism!
• Solutions
• Enforce a specific order using synchronization• Enforce a sequence of happen-before relationships
• Locks, mutexes, semaphores: threads block each other
• Lock-free algorithms: threads do not wait for each other• Hard to implement correctly! Typical programmer uses locks
• Java has optimized data structures with thread-safety, e.g., ConcurrentHashMap
![Page 25: Parallelism - Marco Serafini · Multi-Core processors •Idea: scale computational power linearly •Instead of a single 5 GHz core, 2 * 2.5 GHz cores •Scale heat dissipation linearly](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f0813b37e708231d4203804/html5/thumbnails/25.jpg)
25
LocksThread a…l.lock()foo()l.unlock()
Thread b…l.lock()foo()l.unlock()
void foo (){x = 0;x ++;y = 1/x;
}
x = 0x = 1
x = 0
Thread a Thread bImpossible now
l.lock()foo()
Thread a Thread b
time
Possible
l.lock() - waits
foo()l.unlock()
l.unlock()
l.lock() - acquires
We use a lock variable land use it to synchronize
Equivalent: declarevoid synchronized foo()
![Page 26: Parallelism - Marco Serafini · Multi-Core processors •Idea: scale computational power linearly •Instead of a single 5 GHz core, 2 * 2.5 GHz cores •Scale heat dissipation linearly](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f0813b37e708231d4203804/html5/thumbnails/26.jpg)
26
Deadlock
• Question: What can go wrong?
Thread a…l1.lock()l2.lock()foo()l1.unlock()l2.unlock()
Thread b…l2.lock()l1.lock()foo()l2.unlock()l1.unlock()
![Page 27: Parallelism - Marco Serafini · Multi-Core processors •Idea: scale computational power linearly •Instead of a single 5 GHz core, 2 * 2.5 GHz cores •Scale heat dissipation linearly](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f0813b37e708231d4203804/html5/thumbnails/27.jpg)
27
Requirements for a Deadlock• Mutual exclusion: resources (locks) held and non-shareable• Hold and wait: hold a resource and request another• No preemption: can unlock only when holding• Circular wait: chain of threads waiting each other
• Question: Simple solution? • All threads acquire locks in same order
![Page 28: Parallelism - Marco Serafini · Multi-Core processors •Idea: scale computational power linearly •Instead of a single 5 GHz core, 2 * 2.5 GHz cores •Scale heat dissipation linearly](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f0813b37e708231d4203804/html5/thumbnails/28.jpg)
28
Notify / WaitThread a…synchronized(o){
o.wait();foo();
}
Thread b…synchronized(o){
foo();o.notify();
}
o.wait()…
Thread a waits…
Thread a Thread b
foo()o.notify()
o.wait()foo()
Notify on an object sends a signal that activates other threads waiting on that object
This code guarantees that Thread b executes foo before Thread a
![Page 29: Parallelism - Marco Serafini · Multi-Core processors •Idea: scale computational power linearly •Instead of a single 5 GHz core, 2 * 2.5 GHz cores •Scale heat dissipation linearly](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f0813b37e708231d4203804/html5/thumbnails/29.jpg)
29
What About Cache Coherency?• Cache coherency ensures atomicity for
• Single instructions• Single cache lines
• In reality• Different variables may reside on different cache lines• A variable may be accessed across multiple instructions
• Single high-level instructions may compile to multiple low-level ones• Example: a++ in C may compile to load (a, r0); r0 = r0 + 1; store(r0, a)
• That’s why we need locks• Main lesson learned from cache coherency discussion: you should partition data
![Page 30: Parallelism - Marco Serafini · Multi-Core processors •Idea: scale computational power linearly •Instead of a single 5 GHz core, 2 * 2.5 GHz cores •Scale heat dissipation linearly](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f0813b37e708231d4203804/html5/thumbnails/30.jpg)
30
Challenges with Multi-Threading• Correctness
• Heisenbugs: Non-deterministic bugs that appear only in certain conditions.• Hard to reproduce à Hard to debug
• Performance• Understanding concurrency bottlenecks is hard!• “Waiting time” does not show up in profilers (only CPU time)
• Load-balance• Make sure all cores work all the time and do not wait
![Page 31: Parallelism - Marco Serafini · Multi-Core processors •Idea: scale computational power linearly •Instead of a single 5 GHz core, 2 * 2.5 GHz cores •Scale heat dissipation linearly](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f0813b37e708231d4203804/html5/thumbnails/31.jpg)
31
Critical Path• Coordination (barrier) makes load balancing harder• Critical path: Maximum sequential path (thread t1, 10 steps)
t1
t1 t2 t3
start multiple threads
t1wait for all threads
to complete (barrier)
t1
9 extrasteps
one step each
![Page 32: Parallelism - Marco Serafini · Multi-Core processors •Idea: scale computational power linearly •Instead of a single 5 GHz core, 2 * 2.5 GHz cores •Scale heat dissipation linearly](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f0813b37e708231d4203804/html5/thumbnails/32.jpg)
Message Passing
![Page 33: Parallelism - Marco Serafini · Multi-Core processors •Idea: scale computational power linearly •Instead of a single 5 GHz core, 2 * 2.5 GHz cores •Scale heat dissipation linearly](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f0813b37e708231d4203804/html5/thumbnails/33.jpg)
33
Message Passing• Processes communicate by exchanging messages• Sockets: Communication endpoints
• On a network: UDP sockets, TCP sockets• Internal to a node: Inter-Process Communication (IPC)• Different technologies but similar abstractions
![Page 34: Parallelism - Marco Serafini · Multi-Core processors •Idea: scale computational power linearly •Instead of a single 5 GHz core, 2 * 2.5 GHz cores •Scale heat dissipation linearly](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f0813b37e708231d4203804/html5/thumbnails/34.jpg)
34
Building a Message• Serialization
• Message content stored at random locations in RAM • They need to be packed into a byte array to be sent
• Deserialization• Receive the byte array• Rebuild the original variable
• Pointers do not make sense anymore across nodes!
![Page 35: Parallelism - Marco Serafini · Multi-Core processors •Idea: scale computational power linearly •Instead of a single 5 GHz core, 2 * 2.5 GHz cores •Scale heat dissipation linearly](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f0813b37e708231d4203804/html5/thumbnails/35.jpg)
35
Example: Serializing a Binary Tree• Question: How to serialize it?• Possible solution
• DFS• Mark null pointers with -1
• How to deserialize?
10
12null null
5null null 10 -15 -1 12 -1 -1
![Page 36: Parallelism - Marco Serafini · Multi-Core processors •Idea: scale computational power linearly •Instead of a single 5 GHz core, 2 * 2.5 GHz cores •Scale heat dissipation linearly](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f0813b37e708231d4203804/html5/thumbnails/36.jpg)
36
Threads + Message Passing• Client-server model
• Client sends requests• Server computes replies and sends them back
• Threads often used to hide latency• Each client request is handled by a thread• The request might wait for resources (e.g. I/O)• Other threads execute other requests in the meanwhile
![Page 37: Parallelism - Marco Serafini · Multi-Core processors •Idea: scale computational power linearly •Instead of a single 5 GHz core, 2 * 2.5 GHz cores •Scale heat dissipation linearly](https://reader034.vdocuments.us/reader034/viewer/2022042319/5f0813b37e708231d4203804/html5/thumbnails/37.jpg)
37
Processes in Different Languages• Java (interpreted)
• The Java Virtual Machine (interpreter) is a process• Creating a new process entails creating a new JVM
• ProcessBuilder
• C/C++ (compiled)• OS-specific details of how processes can be generated• Typical command: fork()
• Creates a child process, which executes instruction after fork()• Child process is a full copy of the parent
• More on forking later