guang r. gao acm fellow and ieee fellow endowed distinguished professor

CPEG421-2001-F-Topic-3-II 1

Topic 3 -- II: System Software Fundamentals:

Multithreaded Execution Models, Virtual Machines

and Memory Models

Guang R. Gao

ACM Fellow and IEEE FellowEndowed Distinguished ProfessorElectrical & Computer Engineering

University of Delaware

[email protected]


Outline• An introduction to parallel program execution

models• Coarse-grain vs. fine-grain multithreading• Evolution of fine-grain multithreaded program

execution models.• Memory and synchronization. models• Fine-Grain Multithreaded execution and virtual

machine models for peta-scale computing: a case study on HTMT/EARTH


Terminology Clarification

• Parallel Model of Computation– Parallel Models for Algorithm Designers– Parallel Models for System Designers

• Parallel Programming Models• Parallel Execution Models• Parallel Architecture Models


System Characterization

Questions:

Q1: What characteristics of a computational system are required …

Q2: The diversity of existing and potential multi-core architectures…

Response:

R1: An important characteristic of such a compiler should include, at both chip level and system level, a program execution model that should at least include the specification and API

Gao, ECCD Workshop, Washington D.C., Nov. 2007


What Does Program Execution Model (PXM) Mean ?

• The notion of PXM

The program execution model (PXM) is the basic

low-level abstraction of the underlying system

architecture upon which our programming model,

compilation strategy, runtime system, and other software components are developed.

• The PXM (and its API) serves as an interface between the architecture and the software.


Program Execution Model (PXM) – Cont’d

Unlike an instruction set architecture (ISA) specification, which usually focuses on lower level details (such as instruction encoding and organization of registers for a specific processor), the PXM refers to machine organization at a higher level for a whole class of high-end machines as view by the users

Gao, et. al., 2000


What is your “Favorite”

Program Execution Model?

A Generic MIMD Architecture


Memory NICCommunication

Assist

$

P

$

P

IC

Node: Processor(s), Memory System plus Communication assist (Network Interface & Communication Controller)

Full Feature Interconnect Networks. Packet Switching Fabrics. Key: Scalable Network

Objective: Make efficient use of scarce communication resources – providing high bandwidth, low-latency communication between nodes with a minimum cost and energy

Programming Models for Multi-Processor Systems

• Message Passing Model– Multiple address

spaces

– Communication can only be achieved through “messages”

• Shared Memory Model– Memory address space

is accessible to all

– Communication is achieved through memory


Local Memory

Processor

Local Memory

Processor

Messages

Processor Processor

Global Memory

Comparison

Message Passing

+ Less Contention

+ Highly Scalable

+ Simplified Synch – Message Passing Sync +

Comm.

– But does not mean highly programmable

- Load Balancing

- Deadlock prone

- Overhead of small messages

Shared Memory

+ global shared address space

+ Easy to program (?)

+ No (explicit) message passing (e.g. communication through memory put/get operations)

- Synchronization (memory consistency models, cache models)

- Scalability


What is A Shared Memory Execution Model?


Thread ModelA set of rules for creating, destroying and managing threads

Thread ModelA set of rules for creating, destroying and managing threads

Memory ModelDictate the ordering of memory operations

Memory ModelDictate the ordering of memory operations

Synchronization ModelProvide a set of mechanisms to protect from data races

Synchronization ModelProvide a set of mechanisms to protect from data races

Execution Model

The Thread Virtual MachineThe Thread Virtual Machine


Essential Aspects in User-Level Shared Memory Support?

• Shared address space support and management

• Access control and management

- Memory consistency model (MCM)

- Cache management mechanism


Grand Challenge Problems

• How to build a shared-memory multiprocessor that is

scalable both within a (multi-core/many-core chip) and a

system with many chips ?

• How to program and optimize application programs?

Our view: One major obstacle in solving these problems in

the memory coherence assumption in today’s hardware-

centric memory consistency model.

A Parallel Execution Model


Application Programming Interface (API)

Execution / Architecture Model

Thread Model

Memory Model

Synchronization Model

A Parallel Execution Model


Application Programming Interface (API)

With Dataflow Origins

Execution / Architecture Model

Fine Grained Multithreaded

Model

Memory Adaptive /

Aware Model

Fine Grained Synchronization

Model

Our Model


Comment on OS impact?

• Should compiler be OS-Aware too ? If so, how ?

• Or other alternatives ? Compiler-controlled runtime, of compiler-aware kernels, etc.

• Example: software pipelining …

Gao, ECCD Workshop, Washington D.C., Nov. 2007


Outline

• An introduction to multithreaded program execution models

• Coarse-grain vs. fine-grain parallel execution models – a historical overview

• Fine-grain multithreaded program execution models.

• Memory and synchronization. models• Fine-grain multithreaded execution and virtual

machine models for extreme-scale machines: a case study on HTMT/EARTH

Course Grain Execution Models


The Single Instruction Multiple Data (SIMD) Model

The Single Program Multiple Data (SPMD) Model

The Data Parallel Model

Pipelined Vector Unit orPipelined Vector Unit or

Array of ProcessorsArray of Processors

Program

Processor

Program

Processor

Program

Processor

Program

Processor

Task Task Task Task

Data Structure

Data Parallel Model


Difficult to write unstructured programsDifficult to write unstructured programsConvenient only for problems with regular structured parallelism.

Limited composability!Limited composability!Inherent limitation of coarse-grain multi-threading

Compute

Communication

Compute

Communication

?

Limitations

Dataflow Model of Computation


++

++**

a b c d e

1

3

4

3



++

++**

a b c d e

4

3

4



++

++**

a b c d e

7

4



++

++**

a b c d e

28



++

++**

a b c d e

1

3

4

3

28

Dataflow Software Pipelining


Outline

• An introduction to multithreaded program execution models

• Coarse-grain vs. fine-grain parallel execution models – A Historical Overview

• Fine-grain multithreaded program execution models.

• Memory and synchronization. models• Fine-grain multithreaded execution and virtual

machine models for peta-scale machines: a case study on HTMT/EARTH


CPU

Memory

Fine-Grain non-preemptive thread-The “hotel” model

ThreadUnit

ExecutorLocus

Coarse-Grain vs. Fine-Grain Multithreading

A PoolThread

CPU

Memory

ExecutorLocus

A SingleThread

Coarse-Grain thread-The family home model

ThreadUnit

[Gao: invited talk at Fran Allen’s Retirement Workshop, 07/2002]


Evolution of Multithreaded Execution and Architecture Models

Non-dataflowbased

CDC 66001964

MASAHalstead1986

HEPB. Smith1978

Cosmic CubeSeiltz1985

J-MachineDally1988-93

M-MachineDally1994-98

Dataflowmodel inspired

MIT TTDAArvind1980

ManchesterGurd & Watson1982

*T/Start-NGMIT/Motorola1991-

SIGMA-IShimada1988

MonsoonPapadopoulos& Culler 1988

P-RISCNikhil & Arvind1989

EM-5/4/X RWC-11992-97

Iannuci’s1988-92

Others: Multiscalar (1994), SMT (1995), etc.

Flynn’sProcessor1969

CHoPP’77 CHoPP’87

TAMCuller1990

TeraB. Smith1990-

AlwifeAgarwal1989-96

CilkLeiserson

LAUSyre1976

Eldorado

CASCADE

StaticDataflowDennis 1972MIT

Arg-FetchingDataflowDennisGao1987-88

MDFAGao1989-93

MTAHumTheobaldGao 94

EARTH CAREPACT95’, ISCA96, Theobald99

Marquez04

The Von Neumann-type Processing


begin for i = 1 … … endforend

begin for i = 1 … … endforend

Source Code

CompilerSequential Machine

Representation

CPU

Load

Processor

A Multithreaded Architecture


To Other PE’s

One PE


McGill Data FlowArchitecture Model

(MDFA)


n1

n2 n3

stor

e

store

fetchfetch

n1

n2 n3

store

fetch fetch

Argument –flow Principle Argument –fetching Principle

A Dataflow Program Tuple


Program Tuple = { P-Code . S-Code }Program Tuple = { P-Code . S-Code }

P-CodeP-Code

N1: x = a + b;N2: y = c – d;N3: z = x * y;

S-CodeS-Code

22

33n1n1

a

b

22

33n2n2

c

d

22

33n1n1

IPUIPU ISUISU

The McGill Dataflow Architecture Model


Pipelined Instruction Processing Unit (PIPU)

Dataflow Instruction Scheduling Unit (DISU)

Enable Memory & Controller

Signal Processing

Fire Done

The McGill Dataflow Architecture Model




Fire Done

Waiting Instructions

Enabled Instructions = PC

Important Features

Pipeline can be kept fully utilized provided that the program has sufficient parallelism

The Scheduling Memory (Enable)



CONTROLLER

1 1

1 1

01

0 0

0 0

0

1 1

1

1 0

0 0

0 1

Signal Processing

Fire Done

Count Signal(s)

0 Waiting Instructions1 Enabled Instructions


Advantages of the McGill Dataflow Architecture Model

• Eliminate unnecessary token copying and transmission overhead

• Instruction scheduling is separated from the main datapath of the processor (e.g. asynchronous, decoupled)

Von Neumann Threads as Macro Dataflow Nodes


1

2

3

k

A sequence of instructions is “packed” into a macro-dataflow node

Synchronization is done at the macro-node level


Hybrid Evaluation Von Neumann Style Instruction Execution” on

the McGill Dataflow Architecture• Group a “sequence” of dataflow instruction into a “thread” or

a macro dataflow node.• Data-driven synchronization among threads.• “Von Neumann style sequencing” within a thread.

Advantage:Preserves the parallelism among threads but avoids unnecessary fine-grain synchronization between instructions within a sequential thread.


What Do We Get?

• A hybrid architecture model without sacrificing the advantage of fine-grain parallelism!(latency-hiding, pipelining support)

A Realization of the Hybrid Evaluation




Fire Done

Shortcut

1 2 k

Von Neumann bitVon Neumann bit

guang r. gao acm fellow and ieee fellow endowed distinguished professor

Documents

multithreaded execution

memory system

system level

runtime system

computational system

underlying system architecture

iiprogramming models

memory models guang