one-chip teraarchitecture 19 martie 2009 one-chip teraarchitecture gheorghe stefan

21
19 martie 2009 One-Chip TeraArchitecture One-Chip TeraArchitecture Gheorghe Stefan http://arh.pub.ro/gstefan/

Upload: clinton-sims

Post on 29-Dec-2015

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: One-Chip TeraArchitecture 19 martie 2009 One-Chip TeraArchitecture Gheorghe Stefan

19 martie 2009

One-Chip TeraArchitecture

One-Chip TeraArchitecture

Gheorghe Stefan

http://arh.pub.ro/gstefan/

Page 2: One-Chip TeraArchitecture 19 martie 2009 One-Chip TeraArchitecture Gheorghe Stefan

One-Chip TeraArchitecture

Outline

One-Chip Parallel Engines - an Emergent Market

One-Chip Parallel Architecture & its Performance

Integral Parallel Architecture

A Case Study: BA1024

Concluding Remarks

Page 3: One-Chip TeraArchitecture 19 martie 2009 One-Chip TeraArchitecture Gheorghe Stefan

One-Chip TeraArchitecture

One-Chip Parallel Engines – an Emergent Market

Parallelism is ubiquitous:

• Instruction Level Parallelism• Multi-Threaded Execution• Multi-Core • Many-Core Engines• Many-Computer (Message Passing Interface)

Page 4: One-Chip TeraArchitecture 19 martie 2009 One-Chip TeraArchitecture Gheorghe Stefan

One-Chip TeraArchitecture

(Performance / Power or Price) & Market

1. Performance only approach: supercomputing

2. Performance/Price approach: hi-end PCs

3. Performance/Price & performance/Power approach: embedded computing

Page 5: One-Chip TeraArchitecture 19 martie 2009 One-Chip TeraArchitecture Gheorghe Stefan

One-Chip TeraArchitecture

SoC Market & Programmability

SoC in nano-meter era asks for: High complexity High intensity High flexibility

One Giga-Gate per Chip Era enforce

complexity can’t follow size

Then the key word is

PROGRAMMABILITY

Page 6: One-Chip TeraArchitecture 19 martie 2009 One-Chip TeraArchitecture Gheorghe Stefan

One-Chip TeraArchitecture

Embedded Parallel Computing

“Flexible & Feasible ASIC” = programmable parallel engine

ASIC is a circuit = inherent heterogeneous parallel system

Flexibility = programmability

Feasibility = segregating all kinds of the simple parallel structures from the complex program

Page 7: One-Chip TeraArchitecture 19 martie 2009 One-Chip TeraArchitecture Gheorghe Stefan

One-Chip TeraArchitecture

One-Chip Parallel Architecture & its Performance

Because a programmable structure competes with ASIC philosophy, one-chip parallel architecture must be an integral parallel architecture

The performance is evaluated according to the weight of each type of instruction: float, pointer, word, half-word, byteExamples:

• for a half-word machine a word instruction is executed in 2 cycles

• For a word machine a float instruction is executed in 20 cycles

Page 8: One-Chip TeraArchitecture 19 martie 2009 One-Chip TeraArchitecture Gheorghe Stefan

One-Chip TeraArchitecture

Page 9: One-Chip TeraArchitecture 19 martie 2009 One-Chip TeraArchitecture Gheorghe Stefan

One-Chip TeraArchitecture

Weighted Tera Instruction Per Second (TIPS)

Medium intense float:Float op: 10%Word op: 13%Half-word op: 30%Byte op: 37%

1 TIPS = 2.75 TOPS

High intense float:Float op: 25%Word op: 35%Half-word op: 12%Byte op: 28%

1 TIPS = 5.96 TOPS

Float op : 20 cyclesWord op : 2 cyclesHalf-word op : 1 cycleByte op : 0.5 cycles

Then: 1 TIPS = 3 – 6 TOPS

Page 10: One-Chip TeraArchitecture 19 martie 2009 One-Chip TeraArchitecture Gheorghe Stefan

One-Chip TeraArchitecture

Integral Parallel Architecture (IPA)

Computation is:• Complex (control intensive)• Intense (data intensive)

Parallelism is:• data parallelism (almost SIMD)• time parallelism (a sort of MIMD)• speculative parallelism (a true MISD)

Page 11: One-Chip TeraArchitecture 19 martie 2009 One-Chip TeraArchitecture Gheorghe Stefan

One-Chip TeraArchitecture

Complex vs. Intense

Intense computation:

• high latency functional pipe• array computation• buffer based hierarchy

• 400 GOPS (half-word ops)• 0.5 cm2, 6 W, 0.4 GHz

• 800 GOPS/ cm2

• 6.6 GOPS/W

Complex computation:

• OS oriented• multi-threading• cache based hierarchy

• 4 GIPS & 2 GFLOPS• 1.5 cm2, 50 W, 2 GHz

• (2.6 GIPS+1.3 GFLOPS)/cm2

• (0.08GIPS + 0.04GFLOPS)/W

Page 12: One-Chip TeraArchitecture 19 martie 2009 One-Chip TeraArchitecture Gheorghe Stefan

One-Chip TeraArchitecture

Embedded Parallel Organization

Coarse grain Multi-Core Engine for complex computation

&

fine grain Many-core Engine for intense computation

Multi-Core: 2 – 16 multi-threaded complex processors

Many-Core: 256 – 4096 small & simple execution units (EU) or processing elements (PE)

Page 13: One-Chip TeraArchitecture 19 martie 2009 One-Chip TeraArchitecture Gheorghe Stefan

One-Chip TeraArchitecture

Chip Organization

Cache

Interconnection Fabric

DDR SDRAM Interface

Multi-Core 2 -16

Many-Core 64 - 4096

Buffer

Page 14: One-Chip TeraArchitecture 19 martie 2009 One-Chip TeraArchitecture Gheorghe Stefan

One-Chip TeraArchitecture

A Case Study: BA1024

The organization of BA1024:

• multi-core area of 4 MIPS

• many-core data parallel area of 1024 simple PEs

• speculative time parallel pipe of 8 PEs

• interfaces (DDR, PCI, video & audio interfaces for 2 HDTV channels)

Page 15: One-Chip TeraArchitecture 19 martie 2009 One-Chip TeraArchitecture Gheorghe Stefan

One-Chip TeraArchitecture

Overall performances of BA1024

400 GOP/sec 6.4 GB/sec: external bandwidth800 GB/sec: internal bandwidth> 60 GOPS/Watt> 8 GOPS/mm2

65 nm, Standard process

Note: 1 OP = 16-bit simple integer operation (excluding multiplication)

Page 16: One-Chip TeraArchitecture 19 martie 2009 One-Chip TeraArchitecture Gheorghe Stefan

One-Chip TeraArchitecture

Full Vector Operations

0

511

0 1023

Line i

Line k

Line j

+, -, *, XOR, etc.

=

Line k = Line i OP Line j Line k = Line i OP scalar value (repeated for all elements)

16-bit data operand

Page 17: One-Chip TeraArchitecture 19 martie 2009 One-Chip TeraArchitecture Gheorghe Stefan

One-Chip TeraArchitecture

Conditioned Operations Based

0

511

0 1023

Line i

Line k

Line j

+, -, *, XOR, etc.

=

This enables selective processing based on data content.

Page 18: One-Chip TeraArchitecture 19 martie 2009 One-Chip TeraArchitecture Gheorghe Stefan

One-Chip TeraArchitecture

Multi-Core Organization

• Multi-threaded programming model

• Each core supports:• block multi-threading • interleaved multi-threading

• Number of cores limited by the random access to the external memory

Page 19: One-Chip TeraArchitecture 19 martie 2009 One-Chip TeraArchitecture Gheorghe Stefan

One-Chip TeraArchitecture

Extrapolating BA1024 performance

Medium float environment:

• 45 nm, standard process• 1cm2• 4096 EUs• 0.7 GHz• 2.8 TOPS = 1 TIPS• ~ 25 W

High float environment:

• 45 nm, standard process• 1cm2• 4096 EUs• 1.5 GHz• 6 TOPS = 1 TIPS• ~ 50W

Page 20: One-Chip TeraArchitecture 19 martie 2009 One-Chip TeraArchitecture Gheorghe Stefan

One-Chip TeraArchitecture

Concluding Remarks

1. Segregating the complex from intense is the key2. Using all forms of parallelism allow the competition

with ASIC approach3. Implementation issues limit the true scalability4. The organization must be maintained as simple as

possible in order to be easy hidden to the user5. The Landscape of Parallel Computing Research: A

View from Berkeley is a good tool to evaluate our approach

Page 21: One-Chip TeraArchitecture 19 martie 2009 One-Chip TeraArchitecture Gheorghe Stefan

One-Chip TeraArchitecture

Main technical contributors to the project:

Emanuele Altieri, BrightScale Inc., CA Frank Ho, BrightScale Inc., CA Mihaela Malita, St. Anselm College, NH Bogdan Mitu, BrightScale Inc., CA Marius Stoian, PUB, Romania Dominique Thiebaut, Smith College, MA Dan Tomescu, BrightScale Inc., CA