cs718 : data parallel processors

25
Anshul Kumar, CSE IITD CS718 : Data Parallel CS718 : Data Parallel Processors Processors 27 th April, 2006

Upload: dom

Post on 05-Jan-2016

47 views

Category:

Documents


2 download

DESCRIPTION

CS718 : Data Parallel Processors. 27 th April, 2006. Data Parallel Architectures. SIMD Processors Multiple processing elements driven by a single instruction stream Associative Processors SIMD like processors with associative memory Vector Processors - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: CS718 : Data Parallel Processors

Anshul Kumar, CSE IITD

CS718 : Data Parallel ProcessorsCS718 : Data Parallel ProcessorsCS718 : Data Parallel ProcessorsCS718 : Data Parallel Processors

27th April, 2006

Page 2: CS718 : Data Parallel Processors

Anshul Kumar, CSE IITD

Data Parallel ArchitecturesData Parallel ArchitecturesData Parallel ArchitecturesData Parallel Architectures

• SIMD Processors– Multiple processing elements driven by a single

instruction stream• Associative Processors

– SIMD like processors with associative memory• Vector Processors

– Uni-processors with vector instructions• Systolic Arrays

– Application specific VLSI structures

Page 3: CS718 : Data Parallel Processors

Anshul Kumar, CSE IITD

SIMDSIMDSIMDSIMD

C

P

P

MIS

DS

DS

One of the earliest model of parallel computer

Page 4: CS718 : Data Parallel Processors

Anshul Kumar, CSE IITD

ILLIAC IV SIMD ModelILLIAC IV SIMD ModelILLIAC IV SIMD ModelILLIAC IV SIMD Model

P

M

P

M

P

M

P

M

Interconnection network

PE1 PE2 PEn

CU

I/O

bus

Planned for 64 x 4 PEs, built only 64

Page 5: CS718 : Data Parallel Processors

Anshul Kumar, CSE IITD

Burroughs Scientific Processor (BSP) ModelBurroughs Scientific Processor (BSP) ModelBurroughs Scientific Processor (BSP) ModelBurroughs Scientific Processor (BSP) Model

P

M

P1

M1

P2

M2

Pn

Mk

Interconnection network

CU

I/O

bus

Page 6: CS718 : Data Parallel Processors

Anshul Kumar, CSE IITD

SIMD algorithms: sum of vector elementsSIMD algorithms: sum of vector elementsSIMD algorithms: sum of vector elementsSIMD algorithms: sum of vector elements

Si = ai + ai+1 i = 0,2,4,6

Si = Si + Si+2 i = 0,4

Si = Si + Si+4 i = 0

a0 a1 a2 a3 a4 a5 a6 a7

a0+a1 a2+a3 a4+a5 a6+a7

a0+a1+a2+a3

a4+a5+a6+a7

a0+a1+a2+a3+a4+a5+a6+a7

step 1:

step 2:

step 3:

Si = ai + ai+4 i = 0,1,2,3

Si = Si + Si+2 i = 0,1

Si = Si + Si+1 i = 0

OR

Page 7: CS718 : Data Parallel Processors

Anshul Kumar, CSE IITD

No. of processors vs timeNo. of processors vs timeNo. of processors vs timeNo. of processors vs time

Adding vector elements:– n processors – log n steps– n/log n processors – log n steps

Matrix multiplication:– n processor – n2 steps– n2 processors – n steps– n3 processors – log n steps– n3/log n processors – log n steps

Important factors: data distribution, network

Page 8: CS718 : Data Parallel Processors

Anshul Kumar, CSE IITD

Rise and fall of SIMDsRise and fall of SIMDsRise and fall of SIMDsRise and fall of SIMDs• Introduced in 60’s (e.g. Illiac, BSP)• Problems:

– not cost effective– serial fraction and Amdahl’s law– I/O bottle neck

• Overshadowed by Vector Processors• Resurrected in 80’s (MPP from Goodyear,

Connection machine from Thinking Machines Inc., MP-1 from MasPar)

• Did not survive because of high cost

Page 9: CS718 : Data Parallel Processors

Anshul Kumar, CSE IITD

Related ideasRelated ideasRelated ideasRelated ideas

• Coarse grain SIMD with off the shelf processors (synchronized MIMD), e.g. CM5 of Thinking Machines

• This gave rise to SPMD (single program multiple data)

• MMX and SIMD instructions in Pentium

Page 10: CS718 : Data Parallel Processors

Anshul Kumar, CSE IITD

Vector ProcessorsVector ProcessorsVector ProcessorsVector Processors

I-cache

D-cache

Memcontrol

I-unitand

control

V-reg GPRsaddress

unit

VFU VFU FU

Buses

Mem

ory

Page 11: CS718 : Data Parallel Processors

Anshul Kumar, CSE IITD

Four Generations of CRAY systems Four Generations of CRAY systems (vector processors)(vector processors)

Four Generations of CRAY systems Four Generations of CRAY systems (vector processors)(vector processors)

System CPUs Clock Flops/ Words Mflops Gates/

MHz clock/ moved/ chip

CPU clk/CPU

CRAY-1 1 80 2 1 80 2

X-MP 4 105 2 3 840 16

Y-MP 8 166 2 3 2667 2500

C90 16 240 4 6 15360 10000

Page 12: CS718 : Data Parallel Processors

Anshul Kumar, CSE IITD

Cray HistoryCray HistoryCray HistoryCray History

• http://www.cray.com/company/history.html

Page 13: CS718 : Data Parallel Processors

Anshul Kumar, CSE IITD

CRAY C90CRAY C90CRAY C90CRAY C90

• 8GB central memory shared by 16 CPUs

• 128 CPU - mem paths• word =

64 bits + 16 ECC• Dual vector pipes• 128 element segments

Memory

8 sections

8x8 sub sections

8x8x2 bank groups

8x8x2x8 banks

Page 14: CS718 : Data Parallel Processors

Anshul Kumar, CSE IITD

Convex C4/XA systemConvex C4/XA systemConvex C4/XA systemConvex C4/XA system

• CPU: 7.5 ns clock, 1620 MFLOPs

• Mem: 32 MB x 32 banks, 64 bit word, 50ns access time

• 3 FP pipes, 2 results each• Vector regs - FPU cross

bar• 1.1 GB/s per I/O port

5 x 5crossbar

CPUs

mem

orie

s

I/O utilities

Page 15: CS718 : Data Parallel Processors

Anshul Kumar, CSE IITD

Other examplesOther examplesOther examplesOther examples

NEC SX - X

• 4 CPUs• 4 x 2 pipes each

Fujitsu VP5000

• 7 - 222 CPUs• 2 LS pipes• 3 Func pipes• 2 mask pipes

Fujitsu VP2000

1 - 2 CPUs

Page 16: CS718 : Data Parallel Processors

Anshul Kumar, CSE IITD

Systolic Arrays Systolic Arrays (H.T. Kung 1978)(H.T. Kung 1978)Systolic Arrays Systolic Arrays (H.T. Kung 1978)(H.T. Kung 1978)

Simplicity, Regularity, Concurrency, Communication

Example : Band matrix multiplication

666564

56555453

45444342

34333231

232221

1211

666564

56555453

45444342

34333231

232221

1211

000

00

00

00

000

0000

000

00

00

00

000

0000

BBB

BBBB

BBBB

BBBB

BBB

BB

AAA

AAAA

AAAA

AAAA

AAA

AA

C

Page 17: CS718 : Data Parallel Processors

B11 B12

B21

B31

A11

A12

A21

A22

A31

A23

T=0

Page 18: CS718 : Data Parallel Processors

B11 B12

B21

B31

B22

A11

A12

A21

A22

A31

A23

A32

T=1

Page 19: CS718 : Data Parallel Processors

A11

A12

A21

A22

A31

A23

A32

A33

B11 B12

B21

B31

B22

B32

T=2

Page 20: CS718 : Data Parallel Processors

A21

A22

A31

A23

A32

A33

A34

B12

B31

B22

B32

B42

A11 B11

A42 B23A12

B21

T=3

Page 21: CS718 : Data Parallel Processors

A22

A31

A23

A32

A33

A34

B31

B22

B32

B42

A11 B11

A12 B21

A42 B23

A11 B12A21 B11

B33A43

T=4

Page 22: CS718 : Data Parallel Processors

A23

A32

A33

A34

B31 B32

B42

A42 B23

B33A43

A11 B12

A12 B22

A21 B12

A21 B11

A22 B21

C11

A31 B11

T=5

Page 23: CS718 : Data Parallel Processors

A33

A34

B32

B42

A42

B33A43

A21 B12

A22 B22

A21 B11

A22 B21

A23 B31

C11

A31 B12

A31 B11

A32 B21

C12

A12 B23

A53

A44B43

T=6

Page 24: CS718 : Data Parallel Processors

Anshul Kumar, CSE IITD

WARP: Programmable Systolic ProcessorWARP: Programmable Systolic ProcessorWARP: Programmable Systolic ProcessorWARP: Programmable Systolic Processor

[Kung, CMU 1987]

Complete contrast to the original idea

• not application specific

• not a single VLSI

• complex cell (pipelined FP adder, mult, FIFOs, RAM, cross bar)

• linear

• asynchronous

Page 25: CS718 : Data Parallel Processors

Anshul Kumar, CSE IITD

ReferencesReferencesReferencesReferences

• D. Sima, T. Fountain, P. Kacsuk, "Advanced Computer Architectures : A Design Space Approach", Addison Wesley, 1997.

• K. Hwang, "Advanced Computer Architecture : Parallelism, Scalability, Programmability", McGraw Hill, 1993.