cs 5234 –spring 2013 advanced parallel computing...

Architecture

CS 5234 –Spring 2013 Advanced Parallel Computing

Architecture

Yong Cao

Architecture

Ø  Sequential Machine and Von-Neumann Model Ø Parallel Hardware

Ø Distributed vs Shared Memory Ø Architecture Classes

Ø Multiple-core Ø Many-core (massive parallel)

Ø NVIDIA GPU Architecture

Architecture

Von-Neumann Machine (VN)

Ø  PC: Program counter Ø  MAR: Memory address

register Ø  MDR: Memory data

register Ø  IR: Instruction register Ø  ALU: Arithmetic Logic

Unit Ø  Acc: Accumulator

Acc MDR OP ADDRESS

MEMORY

Decoder

Architecture

Sequential Execution and Instruction Cycle

Ø The six phases of the instruction cycle: Ø Fetch Ø Decode Ø Evaluate Address Ø Fetch Operands Ø Execute Ø  Store Result

Acc MDR OP ADDRESS

MEMORY

Decoder

Architecture

Ø Fetch Ø MARçPC Ø MDRçMEM[MAR] Ø  IRçMDR

Acc MDR OP ADDRESS

MEMORY

Decoder

Architecture

Ø Decode Ø DECODERçIR.OP PC

Acc MDR OP ADDRESS

MEMORY

Decoder

Architecture

Ø Evaluate Address Ø MARçIR.ADDR Ø MDRçMEM[MAR]

Acc MDR OP ADDRESS

MEMORY

Decoder

Architecture

Ø Execute Ø Acc çAcc + MDR PC

Acc MDR OP ADDRESS

MEMORY

Decoder

Architecture

Ø  Store Result Ø MDRçAcc PC

Acc MDR OP ADDRESS

MEMORY

Decoder

Architecture

Ø Register File PC

Register File OP ADDRESS

MEMORY

Decoder

Architecture

MEMORY

Register File

Architecture

Parallel Hardware Ø  Shared vs Distributed Memory

MEMORY

Register File

……

Ø  Multi-Core and Many-Core Architecture

Architecture

Parallel Hardware Ø  Shared vs Distributed Memory

MEMORY

Register File

……

MEMORY MEMORY

Ø  Cluster Computing, Grid Computing, Cloud Computing

Architecture

Multi-Core vs Many-Core

Ø Definition of Core – Independent ALU Ø How about a vector processor?

Ø  SIMD: E.g. Intel’s SSE. Ø How many is “many”?

Ø What if there are too “many” cores in the Multi-core design? Shared control logic (PC, IR, Schedule)

Architecture

Multi-Core

Ø Each core has its own control (PC and IR)

Architecture

Many-Core

Ø A group of cores shares the control (PC, IR and Thread Scheduling)

Architecture

NVIDIA Fermi Architecture

16 Stream Multiprocessor (SM) 32 Core for Each SM

Architecture

Fermi SM

Architecture

Execution in a SM

A total of 32 instructions from one or two warps can be dispatched in each cycle to any two of the four execution blocks within a Fermi SM: two blocks of 16 cores each, one block of four Special Function Units, and one block of 16 load/store units. This figure shows how instructions are issued to the execution blocks.

Architecture

Data Parallel

Ø Data Parallel vs Task Parallel Ø What to partition? Data or Task?

Ø Massive Data Parallel Ø Millions (or more) of threads Ø  Same instruction, different data elements

Architecture

Computing on GPUs

Ø  Steam processing and Vectorization (SIMD)

Input Stream

Output Stream

Instructions

Architecture

GPU Programming Model: Stream

Ø  Stream Programming Model Ø  Streams:

Ø An array of data units Ø Kernels:

Ø Take streams as input, produce streams at output Ø Perform computation on streams Ø Kernels can be linked together

Stream

Kernel

Architecture

Why Streams? Ø Ample computation by exposing parallelism

Ø Stream expose data parallelism Ø Multiple stream elements can be processed in parallel

Ø Pipeline (task) parallelism Ø Multiple tasks can be processed in parallel

Ø Efficient communication Ø Producer-consumer locality Ø Predictable memory access pattern

Ø Optimize for throughput of all elements, not latency of one

Ø Processing many elements at once allows latency hiding

cs 5234 –spring 2013 advanced parallel computing...

Documents

an 5234 mrb sanjoy

(week 13) introductiongilbert/cs5234/2019/... ·...

introduction to parallel computers and parallel...

the dbv white dwarf ec20058 5234 the continuing...

midterm exam - national university of...

no. 10-5234 united states court of appeals for the s c

level 3 diploma in dental nursing (5234-01/91)

parallel prefix sum – scan - virginia...

parallel computing explained parallel performance analysis

bs: 5234 testing - structural testing · bs: 5234 testing...

projects -...

parallel techniques • embarrassingly parallel computations

series, parallel, and series-parallel circuits

aspects of practical parallel programming parallel...

level 3 diploma in dental nursing (5234-01/91) & guilds...

algorithms at scale - nus...

parallel processing and multiprocessors why parallel...

sec. 2 b 5204/5254, b 5252, b 5234 engines · group 23 fuel...

parallel architecture, parallel ... - georgetown university

4.4 – parallel & perpendicular lines. parallel lines