guang r. gao acm fellow and ieee fellow endowed distinguished professor
DESCRIPTION
Topic 3 -- II: System Software Fundamentals: Multithreaded Execution Models, Virtual Machines and Memory Models. Guang R. Gao ACM Fellow and IEEE Fellow Endowed Distinguished Professor Electrical & Computer Engineering University of Delaware [email protected]. Outline. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Guang R. Gao ACM Fellow and IEEE Fellow Endowed Distinguished Professor](https://reader035.vdocuments.us/reader035/viewer/2022062304/56814624550346895db32eed/html5/thumbnails/1.jpg)
CPEG421-2001-F-Topic-3-II 1
Topic 3 -- II: System Software Fundamentals:
Multithreaded Execution Models, Virtual Machines
and Memory Models
Guang R. Gao
ACM Fellow and IEEE FellowEndowed Distinguished ProfessorElectrical & Computer Engineering
University of Delaware
![Page 2: Guang R. Gao ACM Fellow and IEEE Fellow Endowed Distinguished Professor](https://reader035.vdocuments.us/reader035/viewer/2022062304/56814624550346895db32eed/html5/thumbnails/2.jpg)
CPEG421-2001-F-Topic-3-II 2
Outline• An introduction to parallel program execution
models• Coarse-grain vs. fine-grain multithreading• Evolution of fine-grain multithreaded program
execution models.• Memory and synchronization. models• Fine-Grain Multithreaded execution and virtual
machine models for peta-scale computing: a case study on HTMT/EARTH
![Page 3: Guang R. Gao ACM Fellow and IEEE Fellow Endowed Distinguished Professor](https://reader035.vdocuments.us/reader035/viewer/2022062304/56814624550346895db32eed/html5/thumbnails/3.jpg)
CPEG421-2001-F-Topic-3-II 3
Terminology Clarification
• Parallel Model of Computation– Parallel Models for Algorithm Designers– Parallel Models for System Designers
• Parallel Programming Models• Parallel Execution Models• Parallel Architecture Models
![Page 4: Guang R. Gao ACM Fellow and IEEE Fellow Endowed Distinguished Professor](https://reader035.vdocuments.us/reader035/viewer/2022062304/56814624550346895db32eed/html5/thumbnails/4.jpg)
CPEG421-2001-F-Topic-3-II 4
System Characterization
Questions:
Q1: What characteristics of a computational system are required …
Q2: The diversity of existing and potential multi-core architectures…
Response:
R1: An important characteristic of such a compiler should include, at both chip level and system level, a program execution model that should at least include the specification and API
Gao, ECCD Workshop, Washington D.C., Nov. 2007
![Page 5: Guang R. Gao ACM Fellow and IEEE Fellow Endowed Distinguished Professor](https://reader035.vdocuments.us/reader035/viewer/2022062304/56814624550346895db32eed/html5/thumbnails/5.jpg)
CPEG421-2001-F-Topic-3-II 5
What Does Program Execution Model (PXM) Mean ?
• The notion of PXM
The program execution model (PXM) is the basic
low-level abstraction of the underlying system
architecture upon which our programming model,
compilation strategy, runtime system, and other software components are developed.
• The PXM (and its API) serves as an interface between the architecture and the software.
![Page 6: Guang R. Gao ACM Fellow and IEEE Fellow Endowed Distinguished Professor](https://reader035.vdocuments.us/reader035/viewer/2022062304/56814624550346895db32eed/html5/thumbnails/6.jpg)
CPEG421-2001-F-Topic-3-II 6
Program Execution Model (PXM) – Cont’d
Unlike an instruction set architecture (ISA) specification, which usually focuses on lower level details (such as instruction encoding and organization of registers for a specific processor), the PXM refers to machine organization at a higher level for a whole class of high-end machines as view by the users
Gao, et. al., 2000
![Page 7: Guang R. Gao ACM Fellow and IEEE Fellow Endowed Distinguished Professor](https://reader035.vdocuments.us/reader035/viewer/2022062304/56814624550346895db32eed/html5/thumbnails/7.jpg)
CPEG421-2001-F-Topic-3-II 7
What is your “Favorite”
Program Execution Model?
![Page 8: Guang R. Gao ACM Fellow and IEEE Fellow Endowed Distinguished Professor](https://reader035.vdocuments.us/reader035/viewer/2022062304/56814624550346895db32eed/html5/thumbnails/8.jpg)
A Generic MIMD Architecture
CPEG421-2001-F-Topic-3-II 8
Memory NICCommunication
Assist
$
P
$
P
IC
Node: Processor(s), Memory System plus Communication assist (Network Interface & Communication Controller)
Full Feature Interconnect Networks. Packet Switching Fabrics. Key: Scalable Network
Objective: Make efficient use of scarce communication resources – providing high bandwidth, low-latency communication between nodes with a minimum cost and energy
![Page 9: Guang R. Gao ACM Fellow and IEEE Fellow Endowed Distinguished Professor](https://reader035.vdocuments.us/reader035/viewer/2022062304/56814624550346895db32eed/html5/thumbnails/9.jpg)
Programming Models for Multi-Processor Systems
• Message Passing Model– Multiple address
spaces
– Communication can only be achieved through “messages”
• Shared Memory Model– Memory address space
is accessible to all
– Communication is achieved through memory
CPEG421-2001-F-Topic-3-II 9
Local Memory
Processor
Local Memory
Processor
Messages
Processor Processor
Global Memory
![Page 10: Guang R. Gao ACM Fellow and IEEE Fellow Endowed Distinguished Professor](https://reader035.vdocuments.us/reader035/viewer/2022062304/56814624550346895db32eed/html5/thumbnails/10.jpg)
Comparison
Message Passing
+ Less Contention
+ Highly Scalable
+ Simplified Synch – Message Passing Sync +
Comm.
– But does not mean highly programmable
- Load Balancing
- Deadlock prone
- Overhead of small messages
Shared Memory
+ global shared address space
+ Easy to program (?)
+ No (explicit) message passing (e.g. communication through memory put/get operations)
- Synchronization (memory consistency models, cache models)
- Scalability
CPEG421-2001-F-Topic-3-II 10
![Page 11: Guang R. Gao ACM Fellow and IEEE Fellow Endowed Distinguished Professor](https://reader035.vdocuments.us/reader035/viewer/2022062304/56814624550346895db32eed/html5/thumbnails/11.jpg)
What is A Shared Memory Execution Model?
CPEG421-2001-F-Topic-3-II 11
Thread ModelA set of rules for creating, destroying and managing threads
Thread ModelA set of rules for creating, destroying and managing threads
Memory ModelDictate the ordering of memory operations
Memory ModelDictate the ordering of memory operations
Synchronization ModelProvide a set of mechanisms to protect from data races
Synchronization ModelProvide a set of mechanisms to protect from data races
Execution Model
The Thread Virtual MachineThe Thread Virtual Machine
![Page 12: Guang R. Gao ACM Fellow and IEEE Fellow Endowed Distinguished Professor](https://reader035.vdocuments.us/reader035/viewer/2022062304/56814624550346895db32eed/html5/thumbnails/12.jpg)
CPEG421-2001-F-Topic-3-II 12
Essential Aspects in User-Level Shared Memory Support?
• Shared address space support and management
• Access control and management
- Memory consistency model (MCM)
- Cache management mechanism
![Page 13: Guang R. Gao ACM Fellow and IEEE Fellow Endowed Distinguished Professor](https://reader035.vdocuments.us/reader035/viewer/2022062304/56814624550346895db32eed/html5/thumbnails/13.jpg)
CPEG421-2001-F-Topic-3-II 13
Grand Challenge Problems
• How to build a shared-memory multiprocessor that is
scalable both within a (multi-core/many-core chip) and a
system with many chips ?
• How to program and optimize application programs?
Our view: One major obstacle in solving these problems in
the memory coherence assumption in today’s hardware-
centric memory consistency model.
![Page 14: Guang R. Gao ACM Fellow and IEEE Fellow Endowed Distinguished Professor](https://reader035.vdocuments.us/reader035/viewer/2022062304/56814624550346895db32eed/html5/thumbnails/14.jpg)
A Parallel Execution Model
CPEG421-2001-F-Topic-3-II 14
Application Programming Interface (API)
Execution / Architecture Model
Thread Model
Memory Model
Synchronization Model
![Page 15: Guang R. Gao ACM Fellow and IEEE Fellow Endowed Distinguished Professor](https://reader035.vdocuments.us/reader035/viewer/2022062304/56814624550346895db32eed/html5/thumbnails/15.jpg)
A Parallel Execution Model
CPEG421-2001-F-Topic-3-II 15
Application Programming Interface (API)
With Dataflow Origins
Execution / Architecture Model
Fine Grained Multithreaded
Model
Memory Adaptive /
Aware Model
Fine Grained Synchronization
Model
Our Model
![Page 16: Guang R. Gao ACM Fellow and IEEE Fellow Endowed Distinguished Professor](https://reader035.vdocuments.us/reader035/viewer/2022062304/56814624550346895db32eed/html5/thumbnails/16.jpg)
CPEG421-2001-F-Topic-3-II 16
Comment on OS impact?
• Should compiler be OS-Aware too ? If so, how ?
• Or other alternatives ? Compiler-controlled runtime, of compiler-aware kernels, etc.
• Example: software pipelining …
Gao, ECCD Workshop, Washington D.C., Nov. 2007
![Page 17: Guang R. Gao ACM Fellow and IEEE Fellow Endowed Distinguished Professor](https://reader035.vdocuments.us/reader035/viewer/2022062304/56814624550346895db32eed/html5/thumbnails/17.jpg)
CPEG421-2001-F-Topic-3-II 17
Outline
• An introduction to multithreaded program execution models
• Coarse-grain vs. fine-grain parallel execution models – a historical overview
• Fine-grain multithreaded program execution models.
• Memory and synchronization. models• Fine-grain multithreaded execution and virtual
machine models for extreme-scale machines: a case study on HTMT/EARTH
![Page 18: Guang R. Gao ACM Fellow and IEEE Fellow Endowed Distinguished Professor](https://reader035.vdocuments.us/reader035/viewer/2022062304/56814624550346895db32eed/html5/thumbnails/18.jpg)
Course Grain Execution Models
CPEG421-2001-F-Topic-3-II 18
The Single Instruction Multiple Data (SIMD) Model
The Single Program Multiple Data (SPMD) Model
The Data Parallel Model
Pipelined Vector Unit orPipelined Vector Unit or
Array of ProcessorsArray of Processors
Program
Processor
Program
Processor
Program
Processor
Program
Processor
Task Task Task Task
Data Structure
![Page 19: Guang R. Gao ACM Fellow and IEEE Fellow Endowed Distinguished Professor](https://reader035.vdocuments.us/reader035/viewer/2022062304/56814624550346895db32eed/html5/thumbnails/19.jpg)
Data Parallel Model
CPEG421-2001-F-Topic-3-II 19
Difficult to write unstructured programsDifficult to write unstructured programsConvenient only for problems with regular structured parallelism.
Limited composability!Limited composability!Inherent limitation of coarse-grain multi-threading
Compute
Communication
Compute
Communication
?
Limitations
![Page 20: Guang R. Gao ACM Fellow and IEEE Fellow Endowed Distinguished Professor](https://reader035.vdocuments.us/reader035/viewer/2022062304/56814624550346895db32eed/html5/thumbnails/20.jpg)
Dataflow Model of Computation
CPEG421-2001-F-Topic-3-II 20
++
++**
a b c d e
1
3
4
3
![Page 21: Guang R. Gao ACM Fellow and IEEE Fellow Endowed Distinguished Professor](https://reader035.vdocuments.us/reader035/viewer/2022062304/56814624550346895db32eed/html5/thumbnails/21.jpg)
Dataflow Model of Computation
CPEG421-2001-F-Topic-3-II 21
++
++**
a b c d e
4
3
4
![Page 22: Guang R. Gao ACM Fellow and IEEE Fellow Endowed Distinguished Professor](https://reader035.vdocuments.us/reader035/viewer/2022062304/56814624550346895db32eed/html5/thumbnails/22.jpg)
Dataflow Model of Computation
CPEG421-2001-F-Topic-3-II 22
++
++**
a b c d e
7
4
![Page 23: Guang R. Gao ACM Fellow and IEEE Fellow Endowed Distinguished Professor](https://reader035.vdocuments.us/reader035/viewer/2022062304/56814624550346895db32eed/html5/thumbnails/23.jpg)
Dataflow Model of Computation
CPEG421-2001-F-Topic-3-II 23
++
++**
a b c d e
28
![Page 24: Guang R. Gao ACM Fellow and IEEE Fellow Endowed Distinguished Professor](https://reader035.vdocuments.us/reader035/viewer/2022062304/56814624550346895db32eed/html5/thumbnails/24.jpg)
Dataflow Model of Computation
CPEG421-2001-F-Topic-3-II 24
++
++**
a b c d e
1
3
4
3
28
Dataflow Software Pipelining
![Page 25: Guang R. Gao ACM Fellow and IEEE Fellow Endowed Distinguished Professor](https://reader035.vdocuments.us/reader035/viewer/2022062304/56814624550346895db32eed/html5/thumbnails/25.jpg)
CPEG421-2001-F-Topic-3-II 25
Outline
• An introduction to multithreaded program execution models
• Coarse-grain vs. fine-grain parallel execution models – A Historical Overview
• Fine-grain multithreaded program execution models.
• Memory and synchronization. models• Fine-grain multithreaded execution and virtual
machine models for peta-scale machines: a case study on HTMT/EARTH
![Page 26: Guang R. Gao ACM Fellow and IEEE Fellow Endowed Distinguished Professor](https://reader035.vdocuments.us/reader035/viewer/2022062304/56814624550346895db32eed/html5/thumbnails/26.jpg)
CPEG421-2001-F-Topic-3-II 26
CPU
Memory
Fine-Grain non-preemptive thread-The “hotel” model
ThreadUnit
ExecutorLocus
Coarse-Grain vs. Fine-Grain Multithreading
A PoolThread
CPU
Memory
ExecutorLocus
A SingleThread
Coarse-Grain thread-The family home model
ThreadUnit
[Gao: invited talk at Fran Allen’s Retirement Workshop, 07/2002]
![Page 27: Guang R. Gao ACM Fellow and IEEE Fellow Endowed Distinguished Professor](https://reader035.vdocuments.us/reader035/viewer/2022062304/56814624550346895db32eed/html5/thumbnails/27.jpg)
CPEG421-2001-F-Topic-3-II 27
Evolution of Multithreaded Execution and Architecture Models
Non-dataflowbased
CDC 66001964
MASAHalstead1986
HEPB. Smith1978
Cosmic CubeSeiltz1985
J-MachineDally1988-93
M-MachineDally1994-98
Dataflowmodel inspired
MIT TTDAArvind1980
ManchesterGurd & Watson1982
*T/Start-NGMIT/Motorola1991-
SIGMA-IShimada1988
MonsoonPapadopoulos& Culler 1988
P-RISCNikhil & Arvind1989
EM-5/4/X RWC-11992-97
Iannuci’s1988-92
Others: Multiscalar (1994), SMT (1995), etc.
Flynn’sProcessor1969
CHoPP’77 CHoPP’87
TAMCuller1990
TeraB. Smith1990-
AlwifeAgarwal1989-96
CilkLeiserson
LAUSyre1976
Eldorado
CASCADE
StaticDataflowDennis 1972MIT
Arg-FetchingDataflowDennisGao1987-88
MDFAGao1989-93
MTAHumTheobaldGao 94
EARTH CAREPACT95’, ISCA96, Theobald99
Marquez04
![Page 28: Guang R. Gao ACM Fellow and IEEE Fellow Endowed Distinguished Professor](https://reader035.vdocuments.us/reader035/viewer/2022062304/56814624550346895db32eed/html5/thumbnails/28.jpg)
The Von Neumann-type Processing
CPEG421-2001-F-Topic-3-II 28
begin for i = 1 … … endforend
begin for i = 1 … … endforend
Source Code
CompilerSequential Machine
Representation
CPU
Load
Processor
![Page 29: Guang R. Gao ACM Fellow and IEEE Fellow Endowed Distinguished Professor](https://reader035.vdocuments.us/reader035/viewer/2022062304/56814624550346895db32eed/html5/thumbnails/29.jpg)
A Multithreaded Architecture
CPEG421-2001-F-Topic-3-II 29
To Other PE’s
One PE
![Page 30: Guang R. Gao ACM Fellow and IEEE Fellow Endowed Distinguished Professor](https://reader035.vdocuments.us/reader035/viewer/2022062304/56814624550346895db32eed/html5/thumbnails/30.jpg)
CPEG421-2001-F-Topic-3-II 30
McGill Data FlowArchitecture Model
(MDFA)
![Page 31: Guang R. Gao ACM Fellow and IEEE Fellow Endowed Distinguished Professor](https://reader035.vdocuments.us/reader035/viewer/2022062304/56814624550346895db32eed/html5/thumbnails/31.jpg)
CPEG421-2001-F-Topic-3-II 31
n1
n2 n3
stor
e
store
fetchfetch
n1
n2 n3
store
fetch fetch
Argument –flow Principle Argument –fetching Principle
![Page 32: Guang R. Gao ACM Fellow and IEEE Fellow Endowed Distinguished Professor](https://reader035.vdocuments.us/reader035/viewer/2022062304/56814624550346895db32eed/html5/thumbnails/32.jpg)
A Dataflow Program Tuple
CPEG421-2001-F-Topic-3-II 32
Program Tuple = { P-Code . S-Code }Program Tuple = { P-Code . S-Code }
P-CodeP-Code
N1: x = a + b;N2: y = c – d;N3: z = x * y;
S-CodeS-Code
22
33n1n1
a
b
22
33n2n2
c
d
22
33n1n1
IPUIPU ISUISU
![Page 33: Guang R. Gao ACM Fellow and IEEE Fellow Endowed Distinguished Professor](https://reader035.vdocuments.us/reader035/viewer/2022062304/56814624550346895db32eed/html5/thumbnails/33.jpg)
The McGill Dataflow Architecture Model
CPEG421-2001-F-Topic-3-II 33
Pipelined Instruction Processing Unit (PIPU)
Dataflow Instruction Scheduling Unit (DISU)
Enable Memory & Controller
Signal Processing
Fire Done
![Page 34: Guang R. Gao ACM Fellow and IEEE Fellow Endowed Distinguished Professor](https://reader035.vdocuments.us/reader035/viewer/2022062304/56814624550346895db32eed/html5/thumbnails/34.jpg)
The McGill Dataflow Architecture Model
CPEG421-2001-F-Topic-3-II 34
Pipelined Instruction Processing Unit (PIPU)
Dataflow Instruction Scheduling Unit (DISU)
Fire Done
Waiting Instructions
Enabled Instructions = PC
Important Features
Pipeline can be kept fully utilized provided that the program has sufficient parallelism
![Page 35: Guang R. Gao ACM Fellow and IEEE Fellow Endowed Distinguished Professor](https://reader035.vdocuments.us/reader035/viewer/2022062304/56814624550346895db32eed/html5/thumbnails/35.jpg)
The Scheduling Memory (Enable)
CPEG421-2001-F-Topic-3-II 35
Dataflow Instruction Scheduling Unit (DISU)
CONTROLLER
1 1
1 1
01
0 0
0 0
0
1 1
1
1 0
0 0
0 1
Signal Processing
Fire Done
Count Signal(s)
0 Waiting Instructions1 Enabled Instructions
![Page 36: Guang R. Gao ACM Fellow and IEEE Fellow Endowed Distinguished Professor](https://reader035.vdocuments.us/reader035/viewer/2022062304/56814624550346895db32eed/html5/thumbnails/36.jpg)
CPEG421-2001-F-Topic-3-II 36
Advantages of the McGill Dataflow Architecture Model
• Eliminate unnecessary token copying and transmission overhead
• Instruction scheduling is separated from the main datapath of the processor (e.g. asynchronous, decoupled)
![Page 37: Guang R. Gao ACM Fellow and IEEE Fellow Endowed Distinguished Professor](https://reader035.vdocuments.us/reader035/viewer/2022062304/56814624550346895db32eed/html5/thumbnails/37.jpg)
Von Neumann Threads as Macro Dataflow Nodes
CPEG421-2001-F-Topic-3-II 37
1
2
3
k
A sequence of instructions is “packed” into a macro-dataflow node
Synchronization is done at the macro-node level
![Page 38: Guang R. Gao ACM Fellow and IEEE Fellow Endowed Distinguished Professor](https://reader035.vdocuments.us/reader035/viewer/2022062304/56814624550346895db32eed/html5/thumbnails/38.jpg)
CPEG421-2001-F-Topic-3-II 38
Hybrid Evaluation Von Neumann Style Instruction Execution” on
the McGill Dataflow Architecture• Group a “sequence” of dataflow instruction into a “thread” or
a macro dataflow node.• Data-driven synchronization among threads.• “Von Neumann style sequencing” within a thread.
Advantage:Preserves the parallelism among threads but avoids unnecessary fine-grain synchronization between instructions within a sequential thread.
![Page 39: Guang R. Gao ACM Fellow and IEEE Fellow Endowed Distinguished Professor](https://reader035.vdocuments.us/reader035/viewer/2022062304/56814624550346895db32eed/html5/thumbnails/39.jpg)
CPEG421-2001-F-Topic-3-II 39
What Do We Get?
• A hybrid architecture model without sacrificing the advantage of fine-grain parallelism!(latency-hiding, pipelining support)
![Page 40: Guang R. Gao ACM Fellow and IEEE Fellow Endowed Distinguished Professor](https://reader035.vdocuments.us/reader035/viewer/2022062304/56814624550346895db32eed/html5/thumbnails/40.jpg)
A Realization of the Hybrid Evaluation
CPEG421-2001-F-Topic-3-II 40
Pipelined Instruction Processing Unit (PIPU)
Dataflow Instruction Scheduling Unit (DISU)
Fire Done
Shortcut
1 2 k
Von Neumann bitVon Neumann bit