eecc756 - shaaban #1 lec # 1 spring 2013 3-5-2013 introduction to parallel processing parallel...

EECC756 - ShaabanEECC756 - Shaaban#1 lec # 1 Spring 2013 3-5-2013

Introduction to Parallel ProcessingIntroduction to Parallel Processing• Parallel Computer Architecture: Parallel Computer Architecture: Definition & Broad issues involvedDefinition & Broad issues involved

– A Generic Parallel Computer ArchitectureA Generic Parallel Computer Architecture

• The Need And Feasibility of Parallel ComputingThe Need And Feasibility of Parallel Computing– Scientific Supercomputing TrendsScientific Supercomputing Trends– CPU Performance and Technology Trends, Parallelism in Microprocessor GenerationsCPU Performance and Technology Trends, Parallelism in Microprocessor Generations– Computer System Peak FLOP Rating History/Near FutureComputer System Peak FLOP Rating History/Near Future

• The Goal of Parallel ProcessingThe Goal of Parallel Processing• Elements of Parallel Computing Elements of Parallel Computing • Factors Affecting Parallel System PerformanceFactors Affecting Parallel System Performance• Parallel Architectures HistoryParallel Architectures History

– Parallel Programming ModelsParallel Programming Models– Flynn’s 1972 Classification of Computer ArchitectureFlynn’s 1972 Classification of Computer Architecture

• Current Trends In Parallel ArchitecturesCurrent Trends In Parallel Architectures– Modern Parallel Architecture Layered FrameworkModern Parallel Architecture Layered Framework

• Shared Address Space Parallel ArchitecturesShared Address Space Parallel Architectures• Message-Passing Multicomputers: Message-Passing Programming ToolsMessage-Passing Multicomputers: Message-Passing Programming Tools• Data Parallel SystemsData Parallel Systems• Dataflow ArchitecturesDataflow Architectures• Systolic Architectures: Systolic Architectures: Matrix Multiplication Systolic Array Example

PCA Chapter 1.1, 1.2

Why?


Parallel Computer ArchitectureParallel Computer Architecture A parallel computer (or multiple processor system) is a collection of communicating processing elements (processors) that cooperate to solve large computational problems fast by dividing such problems into parallel tasks, exploiting Thread-Level Parallelism (TLP).

• Broad issues involved:– The concurrency and communication characteristics of parallel algorithms for a given

computational problem (represented by dependency graphs)– Computing Resources and Computation Allocation:

• The number of processing elements (PEs), computing power of each element and amount/organization of physical memory used.

• What portions of the computation and data are allocated or mapped to each PE.

– Data access, Communication and Synchronization• How the processing elements cooperate and communicate.• How data is shared/transmitted between processors.• Abstractions and primitives for cooperation/communication and synchronization.• The characteristics and performance of parallel system network (System interconnects).

– Parallel Processing Performance and Scalability Goals:• Maximize performance enhancement of parallelism: Maximize Speedup.

– By minimizing parallelization overheads and balancing workload on processors• Scalability of performance to larger systems/problems.

Processor = Programmable computing element that runs stored programs written using pre-defined instruction set

Processing Elements = PEs = Processors

i.e Parallel Processing

Task = Computation done on one processor

Goals


A A Generic Parallel Computer ArchitectureGeneric Parallel Computer Architecture

Processing Nodes: Each processing node contains one or more processing elements (PEs) or processor(s), memory system, plus communication assist: (Network interface and communication controller)

Parallel machine network (System Interconnects).Function of a parallel machine network is to efficiently (reduce communication cost) transfer information (data, results .. ) from source node to destination node as needed to allow cooperation among parallel processing nodes to solve large computational problems divided into a number parallel computational tasks.

Mem

Network

P

$

Communicationassist (CA)

Processing Nodes

A processing nodes

Parallel Machine Network(Custom or industry standard)

One or more processing elements or processorsper node: Custom or commercial microprocessors. Single or multiple processors per chip Homogenous or heterogonous

Network Interface(custom or industry standard)

Operating SystemParallel ProgrammingEnvironments

Parallel Computer = Multiple Processor System

AKA Communication Assist (CA)

1

1

2

2

2-8 cores per chip

Interconnects


The Need And Feasibility of The Need And Feasibility of Parallel ComputingParallel Computing• Application demands: More computing cycles/memory needed

– Scientific/Engineering computing: CFD, Biology, Chemistry, Physics, ...– General-purpose computing: Video, Graphics, CAD, Databases, Transaction Processing, Gaming…– Mainstream multithreaded programs, are similar to parallel programs

• Technology Trends:– Number of transistors on chip growing rapidly. Clock rates expected to continue to go up but only

slowly. Actual performance returns diminishing due to deeper pipelines.– Increased transistor density allows integrating multiple processor cores per creating Chip-

Multiprocessors (CMPs) even for mainstream computing applications (desktop/laptop..).

• Architecture Trends:– Instruction-level parallelism (ILP) is valuable (superscalar, VLIW) but limited.– Increased clock rates require deeper pipelines with longer latencies and higher CPIs. – Coarser-level parallelism (at the task or thread level, TLP), utilized in multiprocessor systems is the

most viable approach to further improve performance.• Main motivation for development of chip-multiprocessors (CMPs)

• Economics:– The increased utilization of commodity of-the-shelf (COTS) components in high performance parallel

computing systems instead of costly custom components used in traditional supercomputers leading to much lower parallel system cost.

• Today’s microprocessors offer high-performance and have multiprocessor support eliminating the need for designing expensive custom Pes.

• Commercial System Area Networks (SANs) offer an alternative to custom more costly networks

DrivingForce

+ multi-tasking (multiple independent programs)

Moore’s Law still alive

Multi-coreProcessors


Why is Parallel Processing Needed?Why is Parallel Processing Needed? Challenging Applications in Applied Science/Engineering

• Astrophysics

• Atmospheric and Ocean Modeling

• Bioinformatics

• Biomolecular simulation: Protein folding

• Computational Chemistry

• Computational Fluid Dynamics (CFD)

• Computational Physics

• Computer vision and image understanding

• Data Mining and Data-intensive Computing

• Engineering analysis (CAD/CAM)

• Global climate modeling and forecasting

• Material Sciences

• Military applications

• Quantum chemistry

• VLSI design

• ….

Such applications have very high1- computational and 2- memory requirements that cannot be met with single-processor architectures.

Many applications contain a largedegree of computational parallelism

Driving force for High Performance Computing (HPC)and multiple processor system development

Traditional Driving Force For HPC/Parallel Processing


Why is Parallel Processing Needed?Why is Parallel Processing Needed?

Scientific Computing DemandsScientific Computing Demands(Memory Requirement)

Computational and memory demands exceed the capabilities of even the fastest currentuniprocessor systems

5-16 GFLOPSfor uniprocessor

Driving force for HPC and multiple processor system development

GLOP = 109 FLOPS TeraFLOP = 1000 GFLOPS = 1012 FLOPS PetaFLOP = 1000 TeraFLOPS = 1015 FLOPS


Scientific Supercomputing TrendsScientific Supercomputing Trends• Proving ground and driver for innovative architecture and advanced high

performance computing (HPC) techniques:

– Market is much smaller relative to commercial (desktop/server) segment. – Dominated by costly vector machines starting in the 1970s through the

1980s.– Microprocessors have made huge gains in the performance needed for

such applications:• High clock rates. (Bad: Higher CPI?)• Multiple pipelined floating point units.• Instruction-level parallelism.• Effective use of caches.• Multiple processor cores/chip (2 cores 2002-2005, 4 end of 2006, 6-12 cores 2011)

However even the fastest current single microprocessor systemsstill cannot meet the needed computational demands.

• Currently: Large-scale microprocessor-based multiprocessor systems and computer clusters are replacing (replaced?) vector supercomputers that utilize custom processors.

As shown in last slide

Enabled with hightransistor density/chip

16 cores in 2013


Uniprocessor Performance EvaluationUniprocessor Performance Evaluation• CPU Performance benchmarking is heavily program-mix dependent.• Ideal performance requires a perfect machine/program match.• Performance measures:

– Total CPU time = T = TC / f = TC x C = I x CPI x C

= I x (CPIexecution+ M x k) xC (in seconds) TC = Total program execution clock cycles f = clock rate C = CPU clock cycle time = 1/f I = Instructions executed count

CPI = Cycles per instruction CPIexecution = CPI with ideal memory M = Memory stall cycles per memory access k = Memory accesses per instruction

– MIPS Rating = I / (T x 106) = f / (CPI x 106) = f x I /(TC x 106) (in million instructions per second)

– Throughput Rate: Wp = 1/ T = f /(I x CPI) = (MIPS) x 106 /I

(in programs per second)

• Performance factors: (I, CPIexecution, m, k, C) are influenced by: instruction-set architecture (ISA) , compiler design, CPU micro-architecture, implementation and control, cache and memory hierarchy, program access locality, and program instruction mix and instruction dependencies.

T = I x CPI x C


Single CPU Performance TrendsSingle CPU Performance TrendsP

erfo

rman

ce

0.1

1

10

100

1965 1970 1975 1980 1985 1990 1995

Supercomputers

Minicomputers

Mainframes

Microprocessors

• The microprocessor is currently the most natural building block for multiprocessor systems in terms of cost and performance.• This is even more true with the development of cost-effective multi-core microprocessors that support TLP at the chip level.

CommodityProcessors

CustomProcessors


Microprocessor Frequency TrendMicroprocessor Frequency Trend

Result:Deeper PipelinesLonger stallsHigher CPI(lowers effective performance per cycle)• Frequency doubles each generation

• Number of gates/clock reduce by 25%• Leads to deeper pipelines with more stages (e.g Intel Pentium 4E has 30+ pipeline stages)

Realty Check:Clock frequency scalingis slowing down!(Did silicone finally hit the wall?)

386486

Pentium(R)

Pentium Pro(R)

Pentium(R) II

MPC750604+604

601, 603

21264S

2126421164A

2116421064A

21066

10

100

1,000

10,000

1987

1989

1991

1993

1995

1997

1999

2001

2003

2005

Mh

z

1

10

100

Gat

e D

elay

s/ C

lock

Intel

IBM Power PC

DEC

Gate delays/clock

Processor freq scales by 2X per

generation

Why?1- Power leakage2- Clock distribution delays

T = I x CPI x C

?

Solution:Exploit TLP at the chip level,Chip-multiprocessor (CMPs)

No longerthe case


Transistor Count Growth RateTransistor Count Growth Rate

• One billion transistors/chip reached in 2005, two billion in 2008-9, Now ~ seven billion• Transistor count grows faster than clock rate: Currently ~ 40% per year• Single-threaded uniprocessors do not efficiently utilize the increased transistor count.

Currently ~ 7 Billion

Limited ILP, increased size of cache

Moore’s Law:Moore’s Law:

2X transistors/ChipEvery 1.5 years(circa 1970)still holds

Enables Thread-LevelParallelism (TLP) at the chip level:Chip-Multiprocessors (CMPs)+ Simultaneous Multithreaded (SMT) processors

Solution

Enabling Technology for Chip-Level Thread-Level Parallelism (TLP)

Intel 4004 (2300 transistors)

~ 3,000,000x transistor density increase in the last 40 years


Tran

sist

ors

1,000

10,000

100,000

1,000,000

10,000,000

100,000,000

1970 1975 1980 1985 1990 1995 2000 2005

Bit-level parallelism Instruction-level Thread-level (?)

i4004

i8008i8080

i8086

i80286

i80386

R2000

Pentium

R10000

R3000

Parallelism in Microprocessor VLSI GenerationsParallelism in Microprocessor VLSI Generations

Simultaneous Multithreading SMT:e.g. Intel’s Hyper-threading

Chip-Multiprocessors (CMPs)e.g IBM Power 4, 5 Intel Pentium D, Core Duo AMD Athlon 64 X2 Dual Core Opteron Sun UltraSparc T1 (Niagara)

Chip-LevelTLP/ParallelProcessing

Even more importantdue to slowing clock rate increase

Multiple micro-operations per cycle(multi-cycle non-pipelined)

Superscalar/VLIWCPI <1Single-issue

PipelinedCPI =1

Not PipelinedCPI >> 1

(ILP)

Single Thread

(TLP)

Improving microprocessor generation performance by exploiting more levels of parallelism

ILP = Instruction-Level ParallelismTLP = Thread-Level Parallelism

Per Chip


Dual-Core Chip-Multiprocessor (CMP) Architectures Single DieShared L2 Cache

Source: Real World Technologies,

http://www.realworldtech.com/page.cfm?ArticleID=RWT101405234615

Single DiePrivate CachesShared System Interface

Two Dice – Shared PackagePrivate CachesPrivate System Interface

Cores communicate over external Front Side Bus (FSB)(Highest communication latency)

Examples:Intel Pentium D, Intel Quad core (two dual-core chips)

Cores communicate using on-chipInterconnects (shared system interface)

Examples:AMD Dual Core Opteron, Athlon 64 X2Intel Itanium2 (Montecito)

On-chip crossbar/switchCores communicate using shared cache(Lowest communication latency)

Examples:IBM POWER4/5Intel Pentium Core Duo (Yonah), Conroe(Core 2), i7, Sun UltraSparc T1 (Niagara)AMD Phenom ….

FSB

SharedL2 or L3


Six processor cores sharing 6 MB of level 3 (L3) cacheExample Six-Core CMP: AMD Phenom II X6

CMP = Chip-Multiprocessor


Eight processor cores sharing 24 MB of level 3 (L3) cacheEach core is 2-way SMT (2 threads per core), for a total of 16 threads

Example Eight-Core CMP: Intel Nehalem-EX


Example 100-Core CMP: Tilera TILE-Gx Processor

For more information see: http://www.tilera.com/

No shared cache


Microprocessors Vs. Vector ProcessorsMicroprocessors Vs. Vector Processors

Uniprocessor Performance: LINPACK Uniprocessor Performance: LINPACKL

INP

AC

K (

MF

LO

PS

)

1

10

100

1,000

10,000

1975 1980 1985 1990 1995 2000

CRAY n = 100 CRAY n = 1,000

Micro n = 100 Micro n = 1,000

CRAY 1s

Xmp/14se

Xmp/416Ymp

C90

T94

DEC 8200

IBM Power2/990MIPS R4400

HP9000/735DEC Alpha

DEC Alpha AXPHP 9000/750

IBM RS6000/540

MIPS M/2000

MIPS M/120

Sun 4/260

Vector Processors

Microprocessors

1 GFLOP(109 FLOPS)

Now about5-16 GFLOPSper microprocessor core


Parallel Performance: LINPACKParallel Performance: LINPACKLI

NP

AC

K (

GF

LOP

S)

CRAY peak MPP peak

Xmp /416(4)

Ymp/832(8) nCUBE/2(1024)iPSC/860

CM-2CM-200

Delta

Paragon XP/S

C90(16)

CM-5

ASCI Red

T932(32)

T3D

Paragon XP/S MP(1024)


0.1

1

10

100

1,000

10,000

1985 1987 1989 1991 1993 1995 1996

1 TeraFLOP(1012 FLOPS = 1000 GFLOPS)

Current Top LINPACK Performance:Now about 17,590,000 GFlop/s = 17,590 TeraFlops/s = 17.59 PetaFlops/s Cray XK7 - Titan ( @ Oak Ridge National Laboratory) 560,640 total processor cores:299,008 Opteron cores (18,688 AMD Opteron 6274 16-core processors @ 2.2 GHz) +261,632 GPU cores (18,688 Nvidia Tesla Kepler K20x GPUs @ 0.7 GHz)

Current ranking of top 500 parallel supercomputers in the world is found at: www.top500.org

Since ~ Nov. 2012


Why is Parallel Processing Needed?Why is Parallel Processing Needed?

LINPAK Performance TrendsLI

NPAC

K (M

FLOP

S)

1

10

100

1,000

10,000

1975 1980 1985 1990 1995 2000

CRAY n = 100 CRAY n = 1,000

Micro n = 100 Micro n = 1,000

CRAY 1s

Xmp/14se

Xmp/416Ymp

C90

T94

DEC 8200

IBM Power2/990MIPS R4400

HP9000/735DEC Alpha

DEC Alpha AXPHP 9000/750

IBM RS6000/540

MIPS M/2000

MIPS M/120

Sun 4/260

LINP

ACK

(GFL

OPS)

CRAY peak MPP peak

Xmp /416(4)

Ymp/832(8) nCUBE/2(1024)iPSC/860

CM-2CM-200

Delta

Paragon XP/S

C90(16)

CM-5

ASCI Red

T932(32)

T3D



0.1

1

10

100

1,000

10,000

1985 1987 1989 1991 1993 1995 1996

Uniprocessor PerformanceUniprocessor Performance Parallel System PerformanceParallel System Performance

1 TeraFLOP(1012 FLOPS =1000 GFLOPS)1 GFLOP

(109 FLOPS)


Computer System Peak FLOP Rating HistoryComputer System Peak FLOP Rating History

Peta FLOP

Teraflop

(1015 FLOPS = 1000 Tera FLOPS)

(1012 FLOPS = 1000 GFLOPS)

Current Top Peak FP Performance:Now about 27,112,500 GFlops/s = 27,112.5 TeraFlops/s = 27.1 PetaFlops/sCray XK7 - Titan ( @ Oak Ridge National Laboratory) 560,640 total processor cores:299,008 Opteron cores (18,688 AMD Opteron 6274 16-core processors @ 2.2 GHz) +261,632 GPU cores (18,688 Nvidia Tesla Kepler K20x GPUs @ 0.7 GHz)

Cray XK7 - Titan

Since ~ Nov. 2012

Current ranking of top 500 parallel supercomputers in the world is found at: www.top500.org

Quadrillion Flops


Source (and for current list): www.top500.org

November 2005


32nd List (November 2008): The Top 10The Top 10


TOP500 Supercomputers

KW


34th List (November 2009): The Top 10The Top 10



KW





KW





Current List

KW


Current List: 40th List(November 2012)

The Top 10The Top 10


TOP500 SupercomputersKW

Current #1 Supercomputer:

Cray XK7 - Titan ( @ Oak Ridge National Laboratory)

LINPACK Performance:17.59 PetaFlops/s (quadrillion Flops per second)

Peak Performance: 27.1 PetaFlops/s

560,640 total processor cores:

299,008 Opteron cores (18,688 AMD Opteron 6274 16-core processors @ 2.2 GHz) +261,632 GPU cores (18,688 Nvidia Tesla Kepler K20x GPUs @ 0.7 GHz)


The Goal of Parallel ProcessingThe Goal of Parallel Processing• Goal of applications in using parallel machines:

Maximize Speedup over single processor performance

Speedup (p processors) =

• For a fixed problem size (input data set), performance = 1/time

Speedup fixed problem (p processors) =

• Ideal speedup = number of processors = p Very hard to achieve

Performance (p processors)

Performance (1 processor)

Time (1 processor)

Time (p processors)

Due to parallelization overheads: communication cost, dependencies ...

Parallel Speedup, Speedupp

Fixed Problem Size Parallel Speedup

+ load imbalance

Parallel


The Goal of Parallel ProcessingThe Goal of Parallel Processing• Parallel processing goal is to maximize parallel speedup:

• Ideal Speedup = p = number of processors

– Very hard to achieve: Implies no parallelization overheads and perfect load balance among all processors.

• Maximize parallel speedup by:– Balancing computations on processors (every processor does the same amount of work) and the same amount of overheads.– Minimizing communication cost and other overheads associated with each step of parallel program creation and execution.

• Performance Scalability: Achieve a good speedup for the parallel application on the parallel architecture as problem size and machine size (number of processors) are

increased.

Sequential Work on one processor

Max (Work + Synch Wait Time + Comm Cost + Extra Work)Speedup = <

Time(1)

Time(p)Parallelization overheads

i.e the processor with maximum execution time

1

2

Or time

Time

Fixed Problem Size Parallel Speedup

+


Elements of Parallel ComputingElements of Parallel Computing

ParallelParallelHardwareHardwareArchitectureArchitecture

Operating SystemOperating System

Applications SoftwareApplications Software

ComputingComputing ProblemsProblems

Parallel Parallel AlgorithmsAlgorithmsand Dataand DataStructuresStructures

High-levelHigh-levelLanguagesLanguages

Performance Performance EvaluationEvaluation

MappingMapping

ParallelParallelProgrammingProgramming

BindingBinding(Compile, (Compile, Load)Load)

Dependency analysis

Assignparallel computations (Tasks) to processors

e.g Parallel Speedup

HPCDrivingForce

(Task DependencyGraphs)

Processing Nodes/Network

ParallelParallelProgramProgram


Elements of Parallel ComputingElements of Parallel Computing1 Computing Problems:

– Numerical Computing: Science and and engineering numerical problems demand intensive integer and floating point computations.

– Logical Reasoning: Artificial intelligence (AI) demand logic inferences and symbolic manipulations and large space searches.

2 Parallel Algorithms and Data Structures– Special algorithms and data structures are needed to specify the

computations and communication present in computing problems (from dependency analysis).

– Most numerical algorithms are deterministic using regular data structures.

– Symbolic processing may use heuristics or non-deterministic searches.

– Parallel algorithm development requires interdisciplinary interaction.

DrivingForce


Elements of Parallel ComputingElements of Parallel Computing3 Hardware Resources

– Processors, memory, and peripheral devices (processing nodes) form the hardware core of a computer system.

– Processor connectivity (system interconnects, network), memory organization, influence the system architecture.

4 Operating Systems– Manages the allocation of resources to running processes.

– Mapping to match algorithmic structures with hardware architecture and vice versa: processor scheduling, memory mapping, interprocessor communication.

• Parallelism exploitation possible at: 1- algorithm design, 2- program writing, 3- compilation, and 4- run time.

A

B

Computing power

Communication/connectivity


Elements of Parallel ComputingElements of Parallel Computing5 System Software Support

– Needed for the development of efficient programs in high-level languages (HLLs.)

– Assemblers, loaders.– Portable parallel programming languages/libraries– User interfaces and tools.

6 Compiler Support– Implicit Parallelism Approach

• Parallelizing compiler: Can automatically detect parallelism in sequential source code and transforms it into parallel constructs/code.

• Source code written in conventional sequential languages

– Explicit Parallelism Approach:• Programmer explicitly specifies parallelism using:

– Sequential compiler (conventional sequential HLL) and low-level library of the target parallel computer , or ..

– Concurrent (parallel) HLL . • Concurrency Preserving Compiler: The compiler in this case preserves the

parallelism explicitly specified by the programmer. It may perform some program flow analysis, dependence checking, limited optimizations for parallelism detection.

(a)

(b)

Approaches to parallel programming

Illustrated next


Approaches to Parallel ProgrammingApproaches to Parallel Programming

Source code written inSource code written inconcurrent dialects of C, C++concurrent dialects of C, C++ FORTRAN, LISPFORTRAN, LISP ..

ProgrammerProgrammer

ConcurrencyConcurrencypreserving compilerpreserving compiler

ConcurrentConcurrentobject codeobject code

Execution byExecution byruntime systemruntime system

Source code written inSource code written insequential languages C, C++sequential languages C, C++ FORTRAN, LISPFORTRAN, LISP ..

ProgrammerProgrammer

ParallelizingParallelizing compilercompiler

ParallelParallelobject codeobject code

Execution byExecution byruntime systemruntime system

(a) Implicit (a) Implicit ParallelismParallelism

(b) Explicit(b) Explicit ParallelismParallelism

Programmer explicitly specifies parallelismusing parallel constructs

Compiler automatically detects parallelism in sequential source code and transforms it into parallel constructs/code


Factors Affecting Parallel System Performance• Parallel Algorithm Related:

– Available concurrency and profile, grain size, uniformity, patterns.• Dependencies between computations represented by dependency graph

– Type of parallelism present: Functional and/or data parallelism.– Required communication/synchronization, uniformity and patterns.– Data size requirements.– Communication to computation ratio (C-to-C ratio, lower is better).

• Parallel program Related:– Programming model used.– Resulting data/code memory requirements, locality and working set

characteristics.– Parallel task grain size.– Assignment (mapping) of tasks to processors: Dynamic or static.– Cost of communication/synchronization primitives.

• Hardware/Architecture related:– Total CPU computational power available.– Types of computation modes supported.– Shared address space Vs. message passing.– Communication network characteristics (topology, bandwidth, latency)– Memory hierarchy properties.

Concurrency = Parallelism

i.e Inherent Parallelism

+ Number of processors (hardware parallelism)


A

B

C

D

E

F

G

A

CB

D E F

G

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

Time

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

Time

A

BC

D

E

F

G

Sequential Execution on one processor

Assume computation time for each task A-G = 3Assume communication time between parallel tasks = 1Assume communication can overlap with computationSpeedup on two processors = T1/T2 = 21/16 = 1.3

Task Dependency Graph Possible Parallel Execution Schedule on Two Processors P0, P1

P0 P1

Comm

Comm

Idle

Idle

Idle

Comm

Comm

Comm

A simple parallel execution example

What would the speed bewith 3 processors?4 processors? 5 … ?

T1 =21

T2 =16

Task: Computationrun on one processor

EECC756 - ShaabanEECC756 - Shaaban

Evolution of Computer Evolution of Computer ArchitectureArchitecture

Scalar

Sequential Lookahead

I/E Overlap FunctionalParallelism

MultipleFunc. Units Pipeline

Implicit Vector

Explicit Vector

MIMDSIMD

MultiprocessorMulticomputer

Register-to -Register

Memory-to -Memory

Processor Array

Associative Processor

Massively Parallel Processors (MPPs)

I/E: Instruction Fetch and Execute

SIMD: Single Instruction stream over Multiple Data streams

MIMD: Multiple Instruction streams over Multiple Data streams

Computer Clusters

Message Passing

Shared Memory

Data Parallel

Non-pipelined

ParallelMachines

Pipelined (single or multiple issue)

Vector/data parallel

LimitedPipelining


Parallel Architectures HistoryParallel Architectures History

Application Software

System Software SIMD

Message Passing

Shared MemoryDataflow

SystolicArrays Architecture

Historically, parallel architectures were tied to parallel programming models:

• Divergent architectures, with no predictable pattern of growth.

More on this next lecture

Data Parallel Architectures


Parallel Programming ModelsParallel Programming Models• Programming methodology used in coding parallel applications• Specifies: 1- communication and 2- synchronization• Examples:

– Multiprogramming: or Multi-tasking (not true parallel processing!) No communication or synchronization at program level. A number of

independent programs running on different processors in the system.

– Shared memory address space (SAS): Parallel program threads or tasks communicate implicitly using a

shared memory address space (shared data in memory).

– Message passing: Explicit point to point communication (via send/receive pairs) is used

between parallel program tasks using messages.

– Data parallel: More regimented, global actions on data (i.e the same operations over

all elements on an array or vector)– Can be implemented with shared address space or message

passing.

However, a good way to utilize multi-core processors for the masses!


Flynn’s 1972 Classification of Computer ArchitectureFlynn’s 1972 Classification of Computer Architecture

• Single Instruction stream over a Single Data stream (SISD): Conventional sequential machines or uniprocessors.

• Single Instruction stream over Multiple Data streams (SIMD): Vector computers, array of synchronized processing elements.

• Multiple Instruction streams and a Single Data stream (MISD): Systolic arrays for pipelined execution.

• Multiple Instruction streams over Multiple Data streams (MIMD): Parallel computers:

• Shared memory multiprocessors.

• Multicomputers: Unshared distributed memory, message-passing used instead (e.g clusters)

Classified according to number of instruction streams (threads) and number of data streams in architecture

Tightly coupled processors

Loosely coupled processors

(Taxonomy)

Data parallel systems

(a)

(b)

(c)

(d)

Instruction Stream = Thread of Control or Hardware Context


Flynn’s Classification of Computer Architecture

Single Instruction stream over a Single Data stream (SISD): Conventional sequential machines or uniprocessors.

Single Instruction stream over Multiple Data streams (SIMD): Vector computers, array of synchronized processing elements.

Multiple Instruction streams and a Single Data stream (MISD): Systolic arrays for pipelined execution.

Multiple Instruction streams over Multiple Data streams (MIMD): Parallel computers:Distributed memory multiprocessor system shown

CU = Control UnitPE = Processing ElementM = Memory

Shown here:array of synchronized processing elements

Classified according to number of instruction streams(threads) and number of data streams in architecture

Parallel computersor multiprocessor systems

Uniprocessor

(Taxonomy)


Current Trends In Parallel ArchitecturesCurrent Trends In Parallel Architectures

• The extension of “computer architecture” to support communication and cooperation:

– OLD: Instruction Set Architecture (ISA)

– NEW: Communication Architecture

• Defines: – Critical abstractions, boundaries, and primitives

(interfaces).

– Organizational structures that implement interfaces (hardware or software)

• Compilers, libraries and OS are important bridges today


Implementation of Interfaces

Conventional or sequential

1

2

i.e. software abstraction layers


Modern Parallel ArchitectureModern Parallel ArchitectureLayered FrameworkLayered Framework

CAD

Multiprogramming Sharedaddress

Messagepassing

Dataparallel

Database Scientific modeling Parallel applications

Programming models

Communication abstractionUser/system boundary

Compilationor library

Operating systems support

Communication hardware

Physical communication medium

Hardware/software boundary


(ISA)

Sof

twar

eH

ard

war

e

Hardware: Processing Nodes & Interconnects

User Space

System Space


Shared Address Space (SAS) Parallel ArchitecturesShared Address Space (SAS) Parallel Architectures

• Any processor can directly reference any memory location – Communication occurs implicitly as result of loads and stores

• Convenient: – Location transparency

– Similar programming model to time-sharing in uniprocessors• Except processes run on different processors

• Good throughput on multiprogrammed workloads

• Naturally provided on a wide range of platforms– Wide range of scale: few to hundreds of processors

• Popularly known as shared memory machines or model– Ambiguous: Memory may be physically distributed among

processing nodes.

Sometimes called Tightly-Coupled Parallel Computers

i.e Distributed shared memory multiprocessors

i.e multi-tasking

Communication is implicit via loads/stores

(in shared address space)


Shared Address Space (SAS) Parallel Programming Model• Process: virtual address space plus one or more threads of control

• Portions of address spaces of processes are shared:

• Writes to shared address visible to other threads (in other processes too)• Natural extension of the uniprocessor model:

• Conventional memory operations used for communication• Special atomic operations needed for synchronization:

• Using Locks, Semaphores etc.• OS uses shared memory to coordinate processes.

St or e

P1

P2

Pn

P0

Load

P0 pr i vat e

P1 pr i vat e

P2 pr i vat e

Pn pr i vat e

Virtual address spaces for acollection of processes communicatingvia shared addresses

Machine physical address space

Shared portionof address space

Private portionof address space

Common physicaladdresses

Thus communication is implicit via loads/stores

Thus synchronization is explicit

Shared Space

i.e for event ordering and mutual exclusion

In SAS:Communication is implicitvia loads/stores.

Ordering/Synchronizationis explicit using synchronizationPrimitives.


Models of Shared-Memory MultiprocessorsModels of Shared-Memory Multiprocessors• The Uniform/Centralized Memory Access (UMA) Model:

– All physical memory is shared by all processors.– All processors have equal access (i.e equal memory

bandwidth and access latency) to all memory addresses.– Also referred to as Symmetric Memory Processors (SMPs).

• Distributed memory or Non-uniform Memory Access (NUMA) Model:

– Shared memory is physically distributed locally among processors. Access latency to remote memory is higher.

• The Cache-Only Memory Architecture (COMA) Model:

– A special case of a NUMA machine where all distributed main memory is converted to caches.

– No memory hierarchy at each processor.

1

2

3


Models of Shared-Memory MultiprocessorsModels of Shared-Memory Multiprocessors

I/O ctrlMem Mem Mem

Interconnect

Mem I/O ctrl

Processor Processor

Interconnect

I/Odevices

M M M

Network

P

$

P

$

P

$

Network

D

P

C

D

P

C

D

P

C

Distributed memory or Non-uniform Memory Access (NUMA) Model

Uniform Memory Access (UMA) Model

or Symmetric Memory Processors (SMPs). Interconnect: Bus, Crossbar, Multistage networkP: ProcessorM or Mem: MemoryC: CacheD: Cache directory

Cache-Only Memory Architecture (COMA)

UMA

NUMA

1

23

/Network


Uniform Memory Access (UMA) Uniform Memory Access (UMA) Example: Intel Pentium Pro QuadExample: Intel Pentium Pro Quad

P-Pro bus (64-bit data, 36-bit addr ess, 66 MHz)

CPU

Bus interface

MIU

P-Promodule

P-Promodule

P-Promodule256-KB

L2 $Interruptcontroller

PCIbridge

PCIbridge

Memorycontroller

1-, 2-, or 4-wayinterleaved

DRAM

PC

I bus

PC

I busPCI

I/Ocards

• All coherence and multiprocessing glue in processor module

• Highly integrated, targeted at high volume

• Computing node used in Intel’s ASCI-Red MPP

Bus-Based Symmetric Memory Processors (SMPs).

A single Front Side Bus (FSB) is shared among processorsThis severely limits scalability to only ~ 2-4 processors

Circa 1997

Shared FSB

4-way SMP


Non-Uniform Memory Access (NUMA) Non-Uniform Memory Access (NUMA) Example: 8-Socket AMD Opteron NodeExample: 8-Socket AMD Opteron Node

Dedicated point-to-point interconnects (HyperTransport links) used to connect processors alleviating the traditional limitations of FSB-based SMP systems.Each processor has two integrated DDR memory channel controllers:memory bandwidth scales up with number of processors.NUMA architecture since a processor can access its own memory at a lower latency than access to remote memory directly connected to other processors in the system.

Total 32 processor cores when quad core Opteron processors used(128 cores with 16-core processors)

Circa 2003


Non-Uniform Memory Access (NUMA) Non-Uniform Memory Access (NUMA) Example: 4-Socket Example: 4-Socket Intel Nehalem-EX Node Node


Distributed Shared-Memory Distributed Shared-Memory Multiprocessor System Example: Multiprocessor System Example:

Cray T3ECray T3E

Switch

P

$

XY

Z

External I/O

Memctrl

and NI

Mem

• Scale up to 2048 processors, DEC Alpha EV6 microprocessor (COTS)

• Custom 3D Torus point-to-point network, 480MB/s links

• Memory controller generates communication requests for non-local references

• No hardware mechanism for coherence (SGI Origin etc. provide this)

Example of Non-uniform Memory Access (NUMA)

NUMA MPP Example Circa 1995-1999

More recent Cray MPP Example:Cray X1E Supercomputer

MPP = Massively Parallel Processor System

Communication Assist (CA)3D Torus Point-To-Point Network


Message-Passing MulticomputersMessage-Passing Multicomputers• Comprised of multiple autonomous computers (computing nodes) connected

via a suitable network.

• Each node consists of one or more processors, local memory, attached storage and I/O peripherals and Communication Assist (CA).

• Local memory is only accessible by local processors in a node (no shared memory among nodes).

• Inter-node communication is carried explicitly out by message passing through the connection network via send/receive operations.

• Process communication achieved using a message-passing programming environment (e.g. PVM, MPI).

– Programming model more removed or abstracted from basic hardware operations

• Include:

– A number of commercial Massively Parallel Processor systems (MPPs).

– Computer clusters that utilize commodity of-the-shelf (COTS) components.

Also called Loosely-Coupled Parallel Computers

Industry standard System Area Network (SAN) or proprietary network

Portable, platform-independent

1

2

Thus communication is explicit


Message-Passing AbstractionMessage-Passing Abstraction

• Send specifies buffer to be transmitted and receiving process.• Receive specifies sending process and application storage to receive into.• Memory to memory copy possible, but need to name processes.• Optional tag on send and matching rule on receive.• User process names local data and entities in process/tag space too• In simplest form, the send/receive match achieves implicit pairwise synchronization event

– Ordering of computations according to dependencies • Many possible overheads: copying, buffer management, protection ...

Process P Process Q

Address Y

Address X

Send X, Q, t

Receive Y, P, tMatch

Local pr ocessaddress space

Local pr ocessaddress space

Send (X, Q, t)

Tag

DataRecipient

Receive (Y, P, t)

Tag

Data SenderSender P Recipient Q

Sender P Recipient Q

Data Dependency/Ordering

Pairwise synchronization using send/receive match

Communication is explicit via sends/receives

Synchronization isimplicit

Blocking Receive

Recipient blocks (waits) until message is received

i.e event ordering, in this case


Message-Passing Example: Message-Passing Example: Intel ParagonIntel Paragon

Memory bus (64-bit, 50 MHz)

i860

L1 $

NI

DMA

i860

L1 $

Driver

Memctrl

4-wayinterleaved

DRAM

IntelParagonnode

8 bits,175 MHz,bidirectional2D grid network

with processing nodeattached to every switch

Sandia’ s Intel Paragon XP/S-based Super computer

2D gridpoint to pointnetwork

Circa 1983

Each node Is a 2-way-SMP

Communication Assist (CA)


Message-Passing Example: IBM SP-2Message-Passing Example: IBM SP-2

Memory bus

MicroChannel bus

I/O

i860 NI

DMA

DR

AM

IBM SP-2 node

L2 $

Power 2CPU

Memorycontroller

4-wayinterleaved

DRAM

General interconnectionnetwork formed from8-port switches

NIC• Made out of essentially

complete RS6000 workstations

• Network interface integrated in I/O bus (bandwidth limited by I/O bus)

Multi-stage network

Circa 1994-1998MPP

MPP = Massively Parallel Processor System

Communication Assist (CA)


Message-Passing MPP Example: Message-Passing MPP Example:

IBM Blue Gene/L

Chip(2 processors)

Com pute Card(2 ch ips, 2x1x1)

Node Board(32 ch ips, 4x4x2)

16 Com pute C ards

System(64 cabinets, 64x32x32)

Cabinet(32 Node boards, 8x8x16)

2.8/5.6 G F/s4 M B

5.6/11.2 G F/s0.5 G B DDR

90/180 G F/s8 G B DDR

2.9/5.7 TF/s256 G B DDR

180/360 TF /s16 TB D DR

(2 processors/chip) • (2 chips/compute card) • (16 compute cards/node board) • (32 node boards/tower) • (64 tower) = 128k = 131072 (0.7 GHz PowerPC 440) processors (64k nodes)

LINPACK Performance: 280,600 GFLOPS = 280.6 TeraFLOPS = 0.2806 Peta FLOPTop Peak FP Performance:Now about 367,000 GFLOPS = 367 TeraFLOPS = 0.367 Peta FLOP

Circa 2005

System Location: Lawrence Livermore National Laboratory

Networks:3D Torus point-to-point networkGlobal tree 3D point-to-point network(both proprietary)

2.8 Gflops peak per processor core

Design Goals:- High computational power efficiency- High computational density per volume


Message-Passing Programming ToolsMessage-Passing Programming Tools• Message-passing programming environments include:

– Message Passing Interface (MPI):• Provides a standard for writing concurrent message-passing

programs.

• MPI implementations include parallel libraries used by existing programming languages (C, C++).

– Parallel Virtual Machine (PVM):• Enables a collection of heterogeneous computers to used as a

coherent and flexible concurrent computational resource.

• PVM support software executes on each machine in a user-configurable pool, and provides a computational environment of concurrent applications.

• User programs written for example in C, Fortran or Java are provided access to PVM through the use of calls to PVM library routines.

Both MPI and PVM are portable (platform-independent) and allow the user to explicitly specify parallelism

Both MPI & PVM are examples of the explicit parallelism approach to parallel programming


Data Parallel Systems Data Parallel Systems SIMD in Flynn taxonomySIMD in Flynn taxonomy

• Programming model (Data Parallel)– Similar operations performed in parallel on

each element of data structure– Logically single thread of control, performs

sequential or parallel steps

– Conceptually, a processor is associated with each data element

• Architectural model– Array of many simple processors each with

little memory• Processors don’t sequence through

instructions– Attached to a control processor that issues

instructions– Specialized and general communication, global

synchronization

• Example machines: – Thinking Machines CM-1, CM-2 (and CM-5)– Maspar MP-1 and MP-2,

PE PE PE

PE PE PE

PE PE PE

Controlprocessor

PE = Processing Element

All PE are synchronized(same instruction or operation in a given cycle)

Other Data Parallel Architectures: Vector Machines


Dataflow ArchitecturesDataflow Architectures• Represent computation as a graph of essential data dependencies

– Non-Von Neumann Architecture (Not PC-based Architecture)

– Logical processor at each node, activated by availability of operands

– Message (tokens) carrying tag of next instruction sent to next processor

– Tag compared with others in matching store; match fires execution1 b

a

+

c e

d

f

Dataflow graph

f = a d

Network

Tokenstore

WaitingMatching

Instructionfetch

Execute

Token queue

Formtoken

Network

Network

Programstore

a = (b +1) (b c)d = c e

Research Dataflow machineprototypes include:• The MIT Tagged Architecture• The Manchester Dataflow Machine

The Tomasulo approach for dynamicinstruction execution utilizes a dataflowdriven execution engine:• The data dependency graph for a small window of instructions is constructed dynamically when instructions are issued in order of the program.

• The execution of an issued instruction is triggered by the availability of its operands (data it needs) over the CDB.

Tokens = Copies of computation results

TokenMatching Token

Distribution

One Node

TokenDistributionNetwork

Dependency graph for entire computation (program)

i.e data or results


Speculative Execution +Speculative Execution + Tomasulo’s AlgorithmTomasulo’s Algorithm

Usuallyimplemented as a circularbuffer

StoreResults

Commit or Retirement

= Speculative Tomasulo

FIFO

(In Order)

Instructionsto issue in Order:Instruction

Queue (IQ)

Next to commit

The Tomasulo approach for dynamic instruction execution utilizes a dataflow driven execution engine:

• The data dependency graph for a small window of instructions is constructed dynamically when instructions are issued in order of the program.

• The execution of an issued instruction is triggered by the availability of its operands (data it needs) over the CDB.

Speculative Tomasulo Processor

From 551 Lecture 6


Systolic ArchitecturesSystolic Architectures

M

PE

M

PE PE PE

• Replace single processor with an array of regular processing elements

• Orchestrate data flow for high throughput with less memory access

• Different from linear pipelining

– Nonlinear array structure, multidirection data flow, each PE may have (small) local instruction and data memory• Different from SIMD: each PE may do something different• Initial motivation: VLSI Application-Specific Integrated Circuits (ASICs) • Represent algorithms directly by chips connected in regular pattern

A possible example of MISD in Flynn’s A possible example of MISD in Flynn’s Classification of Computer ArchitectureClassification of Computer Architecture

PE = Processing ElementM = Memory

Example of Flynn’s Taxonomy’s MISD (Multiple Instruction Streams Single Data Stream):


Systolic Array Example: 3x3 Systolic Array Matrix Multiplication

b2,2 b2,1 b1,2b2,0 b1,1 b0,2b1,0 b0,1b0,0

a0,2 a0,1 a0,0

a1,2 a1,1 a1,0

a2,2 a2,1 a2,0

Alignments in time

• Processors arranged in a 2-D grid

• Each processor accumulates one element of the product

Rows of A

Columns of B

T = 0

Example source: http://www.cs.hmc.edu/courses/2001/spring/cs156/

C = A X B

Column 0Column 1

Column 2

Row 0

Row 1

Row 2



b2,2 b2,1 b1,2b2,0 b1,1 b0,2b1,0 b0,1

a0,2 a0,1

a1,2 a1,1 a1,0

a2,2 a2,1 a2,0

Alignments in time



T = 1

b0,0

a0,0a0,0*b0,0




b2,2 b2,1 b1,2b2,0 b1,1 b0,2

a0,2

a1,2 a1,1

a2,2 a2,1 a2,0

Alignments in time



T = 2

b1,0

a0,1 a0,0*b0,0+ a0,1*b1,0

a1,0

a0,0

b0,1

b0,0

a0,0*b0,1

a1,0*b0,0




b2,2 b2,1 b1,2

a1,2

a2,2 a2,1

Alignments in time



T = 3

b2,0

a0,2 a0,0*b0,0+ a0,1*b1,0+ a0,2*b2,0

a1,1

a0,1

b1,1

b1,0

a0,0*b0,1+ a0,1*b1,1

a1,0*b0,0+ a1,1*b1,0 a1,0

b0,1

a0,0

b0,0

b0,2

a2,0

a1,0*b0,1

a0,0*b0,2

a2,0*b0,0


C00



b2,2

a2,2

Alignments in time



T = 4

a0,0*b0,0+ a0,1*b1,0+ a0,2*b2,0

a1,2

a0,2

b2,1

b2,0

a0,0*b0,1+ a0,1*b1,1+ a0,2*b2,1

a1,0*b0,0+ a1,1*b1,0+ a1,2*a2,0

a1,1

b1,1

a0,1

b1,0

b1,2

a2,1

a1,0*b0,1+a1,1*b1,1

a0,0*b0,2+ a0,1*b1,2

a2,0*b0,0+ a2,1*b1,0

b0,1

a1,0

b0,2

a2,0 a2,0*b0,1

a1,0*b0,2

a2,2


C00 C01

C10



Alignments in time



T = 5

a0,0*b0,0+ a0,1*b1,0+ a0,2*b2,0

a0,0*b0,1+ a0,1*b1,1+ a0,2*b2,1

a1,0*b0,0+ a1,1*b1,0+ a1,2*a2,0

a1,2

b2,1

a0,2

b2,0

b2,2

a2,2

a1,0*b0,1+a1,1*b1,1+ a1,2*b2,1

a0,0*b0,2+ a0,1*b1,2+ a0,2*b2,2

a2,0*b0,0+ a2,1*b1,0+ a2,2*b2,0

b1,1

a1,1

b1,2

a2,1 a2,0*b0,1+ a2,1*b1,1

a1,0*b0,2+ a1,1*b1,2

b0,2

a2,0 a2,0*b0,2


C00 C01 C02

C10 C11

C20



Alignments in time



T = 6

a0,0*b0,0+ a0,1*b1,0+ a0,2*b2,0

a0,0*b0,1+ a0,1*b1,1+ a0,2*b2,1

a1,0*b0,0+ a1,1*b1,0+ a1,2*a2,0

a1,0*b0,1+a1,1*b1,1+ a1,2*b2,1

a0,0*b0,2+ a0,1*b1,2+ a0,2*b2,2

a2,0*b0,0+ a2,1*b1,0+ a2,2*b2,0

b2,1

a1,2

b2,2

a2,2 a2,0*b0,1+ a2,1*b1,1+ a2,2*b2,1

a1,0*b0,2+ a1,1*b1,2+ a1,2*b2,2

b1,2

a2,1 a2,0*b0,2+ a2,1*b1,2


C00 C01 C02

C10 C11 C12

C20 C21



Alignments in time



T = 7

a0,0*b0,0+ a0,1*b1,0+ a0,2*b2,0

a0,0*b0,1+ a0,1*b1,1+ a0,2*b2,1

a1,0*b0,0+ a1,1*b1,0+ a1,2*a2,0

a1,0*b0,1+a1,1*b1,1+ a1,2*b2,1

a0,0*b0,2+ a0,1*b1,2+ a0,2*b2,2

a2,0*b0,0+ a2,1*b1,0+ a2,2*b2,0

a2,0*b0,1+ a2,1*b1,1+ a2,2*b2,1

a1,0*b0,2+ a1,1*b1,2+ a1,2*b2,2

b2,2

a2,2 a2,0*b0,2+ a2,1*b1,2+ a2,2*b2,2

Done


C00 C01 C02

C10 C11 C12

C20 C21 C22

On one processor = O(n3) t = 27? Speedup = 27/7 = 3.85

eecc756 - shaaban #1 lec # 1 spring 2013 3-5-2013 introduction to parallel processing parallel...

Documents