one-chip teraarchitecture 19 martie 2009 one-chip teraarchitecture gheorghe stefan
TRANSCRIPT
19 martie 2009
One-Chip TeraArchitecture
One-Chip TeraArchitecture
Gheorghe Stefan
http://arh.pub.ro/gstefan/
One-Chip TeraArchitecture
Outline
One-Chip Parallel Engines - an Emergent Market
One-Chip Parallel Architecture & its Performance
Integral Parallel Architecture
A Case Study: BA1024
Concluding Remarks
One-Chip TeraArchitecture
One-Chip Parallel Engines – an Emergent Market
Parallelism is ubiquitous:
• Instruction Level Parallelism• Multi-Threaded Execution• Multi-Core • Many-Core Engines• Many-Computer (Message Passing Interface)
One-Chip TeraArchitecture
(Performance / Power or Price) & Market
1. Performance only approach: supercomputing
2. Performance/Price approach: hi-end PCs
3. Performance/Price & performance/Power approach: embedded computing
One-Chip TeraArchitecture
SoC Market & Programmability
SoC in nano-meter era asks for: High complexity High intensity High flexibility
One Giga-Gate per Chip Era enforce
complexity can’t follow size
Then the key word is
PROGRAMMABILITY
One-Chip TeraArchitecture
Embedded Parallel Computing
“Flexible & Feasible ASIC” = programmable parallel engine
ASIC is a circuit = inherent heterogeneous parallel system
Flexibility = programmability
Feasibility = segregating all kinds of the simple parallel structures from the complex program
One-Chip TeraArchitecture
One-Chip Parallel Architecture & its Performance
Because a programmable structure competes with ASIC philosophy, one-chip parallel architecture must be an integral parallel architecture
The performance is evaluated according to the weight of each type of instruction: float, pointer, word, half-word, byteExamples:
• for a half-word machine a word instruction is executed in 2 cycles
• For a word machine a float instruction is executed in 20 cycles
One-Chip TeraArchitecture
One-Chip TeraArchitecture
Weighted Tera Instruction Per Second (TIPS)
Medium intense float:Float op: 10%Word op: 13%Half-word op: 30%Byte op: 37%
1 TIPS = 2.75 TOPS
High intense float:Float op: 25%Word op: 35%Half-word op: 12%Byte op: 28%
1 TIPS = 5.96 TOPS
Float op : 20 cyclesWord op : 2 cyclesHalf-word op : 1 cycleByte op : 0.5 cycles
Then: 1 TIPS = 3 – 6 TOPS
One-Chip TeraArchitecture
Integral Parallel Architecture (IPA)
Computation is:• Complex (control intensive)• Intense (data intensive)
Parallelism is:• data parallelism (almost SIMD)• time parallelism (a sort of MIMD)• speculative parallelism (a true MISD)
One-Chip TeraArchitecture
Complex vs. Intense
Intense computation:
• high latency functional pipe• array computation• buffer based hierarchy
• 400 GOPS (half-word ops)• 0.5 cm2, 6 W, 0.4 GHz
• 800 GOPS/ cm2
• 6.6 GOPS/W
Complex computation:
• OS oriented• multi-threading• cache based hierarchy
• 4 GIPS & 2 GFLOPS• 1.5 cm2, 50 W, 2 GHz
• (2.6 GIPS+1.3 GFLOPS)/cm2
• (0.08GIPS + 0.04GFLOPS)/W
One-Chip TeraArchitecture
Embedded Parallel Organization
Coarse grain Multi-Core Engine for complex computation
&
fine grain Many-core Engine for intense computation
Multi-Core: 2 – 16 multi-threaded complex processors
Many-Core: 256 – 4096 small & simple execution units (EU) or processing elements (PE)
One-Chip TeraArchitecture
Chip Organization
Cache
Interconnection Fabric
DDR SDRAM Interface
Multi-Core 2 -16
Many-Core 64 - 4096
Buffer
One-Chip TeraArchitecture
A Case Study: BA1024
The organization of BA1024:
• multi-core area of 4 MIPS
• many-core data parallel area of 1024 simple PEs
• speculative time parallel pipe of 8 PEs
• interfaces (DDR, PCI, video & audio interfaces for 2 HDTV channels)
One-Chip TeraArchitecture
Overall performances of BA1024
400 GOP/sec 6.4 GB/sec: external bandwidth800 GB/sec: internal bandwidth> 60 GOPS/Watt> 8 GOPS/mm2
65 nm, Standard process
Note: 1 OP = 16-bit simple integer operation (excluding multiplication)
One-Chip TeraArchitecture
Full Vector Operations
0
511
0 1023
Line i
Line k
Line j
+, -, *, XOR, etc.
=
Line k = Line i OP Line j Line k = Line i OP scalar value (repeated for all elements)
16-bit data operand
One-Chip TeraArchitecture
Conditioned Operations Based
0
511
0 1023
Line i
Line k
Line j
+, -, *, XOR, etc.
=
This enables selective processing based on data content.
One-Chip TeraArchitecture
Multi-Core Organization
• Multi-threaded programming model
• Each core supports:• block multi-threading • interleaved multi-threading
• Number of cores limited by the random access to the external memory
One-Chip TeraArchitecture
Extrapolating BA1024 performance
Medium float environment:
• 45 nm, standard process• 1cm2• 4096 EUs• 0.7 GHz• 2.8 TOPS = 1 TIPS• ~ 25 W
High float environment:
• 45 nm, standard process• 1cm2• 4096 EUs• 1.5 GHz• 6 TOPS = 1 TIPS• ~ 50W
One-Chip TeraArchitecture
Concluding Remarks
1. Segregating the complex from intense is the key2. Using all forms of parallelism allow the competition
with ASIC approach3. Implementation issues limit the true scalability4. The organization must be maintained as simple as
possible in order to be easy hidden to the user5. The Landscape of Parallel Computing Research: A
View from Berkeley is a good tool to evaluate our approach
One-Chip TeraArchitecture
Main technical contributors to the project:
Emanuele Altieri, BrightScale Inc., CA Frank Ho, BrightScale Inc., CA Mihaela Malita, St. Anselm College, NH Bogdan Mitu, BrightScale Inc., CA Marius Stoian, PUB, Romania Dominique Thiebaut, Smith College, MA Dan Tomescu, BrightScale Inc., CA