a roadmap to restoring computing's former glory
DESCRIPTION
A Roadmap to Restoring Computing's Former Glory. David I. August. Princeton University. (Not speaking for Parakinetics, Inc.). Era of DIY: Multicore Reconfigurable GPUs Clusters. 10 Cores!. 10-Core Intel Xeon “Unparalleled Performance”. Golden era of computer architecture. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: A Roadmap to Restoring Computing's Former Glory](https://reader035.vdocuments.us/reader035/viewer/2022081603/568143bc550346895db04953/html5/thumbnails/1.jpg)
A Roadmap to Restoring Computing's Former Glory
David I. August
Princeton University
(Not speaking for Parakinetics, Inc.)
![Page 2: A Roadmap to Restoring Computing's Former Glory](https://reader035.vdocuments.us/reader035/viewer/2022081603/568143bc550346895db04953/html5/thumbnails/2.jpg)
Golden era of computer architecture
1992 20121994 1996 1998 2000 2002 2004 2006 2008 2010
~ 3 years behind
CPU92CPU95CPU2000CPU2006
Year
SP
EC
CIN
T P
erfo
rman
ce (
log.
Sca
le)
Era of DIY:• Multicore• Reconfigurable• GPUs• Clusters
10 Cores!
10-Core Intel Xeon“Unparalleled Performance”
![Page 3: A Roadmap to Restoring Computing's Former Glory](https://reader035.vdocuments.us/reader035/viewer/2022081603/568143bc550346895db04953/html5/thumbnails/3.jpg)
P6 SUPERSCALAR ARCHITECTURE (CIRCA 1994)
AutomaticSpeculation
AutomaticPipelining
Parallel ResourcesAutomatic
Allocation/Scheduling
Commit
![Page 4: A Roadmap to Restoring Computing's Former Glory](https://reader035.vdocuments.us/reader035/viewer/2022081603/568143bc550346895db04953/html5/thumbnails/4.jpg)
MULTICORE ARCHITECTURE (CIRCA 2010)
AutomaticPipelining
Parallel Resources
AutomaticSpeculation
AutomaticAllocation/Scheduling
Commit
![Page 5: A Roadmap to Restoring Computing's Former Glory](https://reader035.vdocuments.us/reader035/viewer/2022081603/568143bc550346895db04953/html5/thumbnails/5.jpg)
![Page 6: A Roadmap to Restoring Computing's Former Glory](https://reader035.vdocuments.us/reader035/viewer/2022081603/568143bc550346895db04953/html5/thumbnails/6.jpg)
Realizable parallelism
Parallel Library Calls
Time
Time
Thr
eads
Thr
eads
Credit: Jack Dongarra
![Page 7: A Roadmap to Restoring Computing's Former Glory](https://reader035.vdocuments.us/reader035/viewer/2022081603/568143bc550346895db04953/html5/thumbnails/7.jpg)
“Compiler Advances Double Computing Power Every 18 Years!” – Proebsting’s Law
![Page 8: A Roadmap to Restoring Computing's Former Glory](https://reader035.vdocuments.us/reader035/viewer/2022081603/568143bc550346895db04953/html5/thumbnails/8.jpg)
Multicore Needs:
1. Automatic resource allocation/scheduling, speculation/commit, and pipelining.
2. Low overhead access to programmer insight.3. Code reuse. Ideally, this includes support of legacy codes as
well as new codes.4. Intelligent automatic parallelization.
Parallel Programming
Automatic Parallelization Parallel Libraries
Computer Architecture
Implicitly parallel programming with
critique-based iterative, occasionally interactive,
speculatively pipelined automatic
parallelization
A Roadmap to restoring computing’s
former glory.
![Page 9: A Roadmap to Restoring Computing's Former Glory](https://reader035.vdocuments.us/reader035/viewer/2022081603/568143bc550346895db04953/html5/thumbnails/9.jpg)
Multicore Needs:1. Automatic resource allocation/scheduling, speculation/commit, and pipelining.2. Low overhead access to programmer insight.3. Code reuse. Ideally, this includes support of legacy codes as well as new codes.4. Intelligent automatic parallelization.
New or ExistingSequential Code DSWP Family
Optis Parallelized Code
Machine Specific Performance Primitives
Complainer/Fixer
InsightAnnotation
One Implementation
New or ExistingLibraries
InsightAnnotation
OtherOptis
SpeculativeOptis
![Page 10: A Roadmap to Restoring Computing's Former Glory](https://reader035.vdocuments.us/reader035/viewer/2022081603/568143bc550346895db04953/html5/thumbnails/10.jpg)
0
1
2
3
4
5
LD:1
LD:2
W:1
W:3
LD:3
Core 1
Core 2
Core 3
W:2
W:4
LD:4
LD:5
C:1
C:2
C:3
Core 4
Spec-PS-DSWPP6 SUPERSCALAR ARCHITECTURE
![Page 11: A Roadmap to Restoring Computing's Former Glory](https://reader035.vdocuments.us/reader035/viewer/2022081603/568143bc550346895db04953/html5/thumbnails/11.jpg)
Example
A: while (node) {B: node = node->next;C: res = work(node);D: write(res); }
B1
C1
A1
Core 1 Core 2 Core 3
A2
B2
D1
C2
D2
Tim
e
Program Dependence Graph
A B
D
C
Control DependenceData Dependence
![Page 12: A Roadmap to Restoring Computing's Former Glory](https://reader035.vdocuments.us/reader035/viewer/2022081603/568143bc550346895db04953/html5/thumbnails/12.jpg)
Example
A: while (node) {B: node = node->next;C: res = work(node);D: write(res); }
B1
C1
A1
Core 1 Core 2 Core 3
A2
B2
D1
C2
D2
Tim
e
Spec-DOALL
Program Dependence Graph
A B
D
C
Control DependenceData Dependence
![Page 13: A Roadmap to Restoring Computing's Former Glory](https://reader035.vdocuments.us/reader035/viewer/2022081603/568143bc550346895db04953/html5/thumbnails/13.jpg)
Example
A: while (node) {B: node = node->next;C: res = work(node);D: write(res); }
Core 1 Core 2 Core 3
Tim
e
Spec-DOALL
A2
B2
C2
D2
A1
B1
C1
D1
A3
B3
Program Dependence Graph
A B
D
C
Control DependenceData Dependence
![Page 14: A Roadmap to Restoring Computing's Former Glory](https://reader035.vdocuments.us/reader035/viewer/2022081603/568143bc550346895db04953/html5/thumbnails/14.jpg)
Example
B: node = node->next;C: res = work(node);D: write(res); }
Core 1 Core 2 Core 3
Tim
e
Program Dependence Graph
A B
D
C
Control DependenceData Dependence
Spec-DOALL
A2A1 A3
B2
C2
D2
B1
C1
D1
B3
C3
D3
A: while (node) { while (true) {
B2
C2
D2
B3
C3
D3
B4
C4
D4
197.parser
Slowdown
![Page 15: A Roadmap to Restoring Computing's Former Glory](https://reader035.vdocuments.us/reader035/viewer/2022081603/568143bc550346895db04953/html5/thumbnails/15.jpg)
Core 1 Core 2 Core 3
Tim
e
C1
D1
B1
B7
C3
D3
B3
C4
D4
B4
C5
D5
B5
C6
B6
Spec-DOACROSS
Core 1 Core 2 Core 3
Tim
e
Spec-DSWP
C2
D2
B2
C1
D1
B1
B3
B4
B2
C2
C3 D2
B5
B6
B7
D3
C5
C6
C4
D5
D4
Throughput: 1 iter/cycle Throughput: 1 iter/cycle
![Page 16: A Roadmap to Restoring Computing's Former Glory](https://reader035.vdocuments.us/reader035/viewer/2022081603/568143bc550346895db04953/html5/thumbnails/16.jpg)
Comparison: Spec-DOACROSS and Spec-DSWP
Comm.Latency = 2: Comm.Latency = 2:Comm.Latency = 1: 1 iter/cycle Comm.Latency = 1: 1 iter/cycle
Core 1 Core 2 Core 3
Tim
e
C1
D1
B1
C2
D2
B2
C3
D3
B3
Core 1 Core 2 Core 3
B2
B3
B1
B5
B6
B4
C2
C3
C1
C5
C6
C4
B7
PipelineFill time
0.5 iter/cycle 1 iter/cycle
D2
D3
D1
D5
D4
Tim
eC4
D4
B4
C5
D5
B5
C6
B6
B7
![Page 17: A Roadmap to Restoring Computing's Former Glory](https://reader035.vdocuments.us/reader035/viewer/2022081603/568143bc550346895db04953/html5/thumbnails/17.jpg)
(1,1)(8,2)
(16,4)(24,6)
(32,8)
(40,10)
(48,12)
(56,14)
(64,16)
(72,18)
(80,20)
(88,22)
(96,24)
(104,26)
(112,28)
(120,30)
(128,32)0
5
10
15
20
25
30
35
40
45
50TLSSpec-PS-DSWP
(Number of Total Cores, Number of Nodes)
Perf
orm
ance
Spe
edup
(X)
TLS vs. Spec-DSWP[MICRO 2010]Geomean of 11 benchmarks on the same cluster
![Page 18: A Roadmap to Restoring Computing's Former Glory](https://reader035.vdocuments.us/reader035/viewer/2022081603/568143bc550346895db04953/html5/thumbnails/18.jpg)
Multicore Needs:1. Automatic resource allocation/scheduling, speculation/commit, and pipelining. 2. Low overhead access to programmer insight.3. Code reuse. Ideally, this includes support of legacy codes as well as new codes.4. Intelligent automatic parallelization.
New or ExistingSequential Code DSWP Family
Optis Parallelized Code
Machine Specific Performance Primitives
Complainer/Fixer
InsightAnnotation
One Implementation
New or ExistingLibraries
InsightAnnotation
OtherOptis
SpeculativeOptis
![Page 19: A Roadmap to Restoring Computing's Former Glory](https://reader035.vdocuments.us/reader035/viewer/2022081603/568143bc550346895db04953/html5/thumbnails/19.jpg)
19
char *memory;
void * alloc(int size);
void * alloc(int size) { void * ptr = memory; memory = memory + size; return ptr;}
Core 1 Core 2
Tim
e
Core 3
Execution Plan
alloc1
alloc2
alloc3
alloc4
alloc5
alloc6
![Page 20: A Roadmap to Restoring Computing's Former Glory](https://reader035.vdocuments.us/reader035/viewer/2022081603/568143bc550346895db04953/html5/thumbnails/20.jpg)
20
char *memory;
void * alloc(int size);@Commutative
void * alloc(int size) { void * ptr = memory; memory = memory + size; return ptr;}
Core 1 Core 2
Tim
e
Core 3
Execution Plan
alloc1
alloc2
alloc3
alloc4
alloc5
alloc6
![Page 21: A Roadmap to Restoring Computing's Former Glory](https://reader035.vdocuments.us/reader035/viewer/2022081603/568143bc550346895db04953/html5/thumbnails/21.jpg)
21
char *memory;
void * alloc(int size);@Commutative
Core 1 Core 2
Tim
e
Core 3
Execution Plan
alloc1
alloc2
alloc3
alloc4
alloc5
alloc6
void * alloc(int size) { void * ptr = memory; memory = memory + size; return ptr;}
Easily Understood Non-Determinism!
![Page 22: A Roadmap to Restoring Computing's Former Glory](https://reader035.vdocuments.us/reader035/viewer/2022081603/568143bc550346895db04953/html5/thumbnails/22.jpg)
[MICRO ‘07, Top Picks ’08; Automatic: PLDI ‘11]
~50 of ½ Million LOCs modified in SpecINT 2000Mods also include Non-Deterministic Branch
![Page 23: A Roadmap to Restoring Computing's Former Glory](https://reader035.vdocuments.us/reader035/viewer/2022081603/568143bc550346895db04953/html5/thumbnails/23.jpg)
Multicore Needs:1. Automatic resource allocation/scheduling, speculation/commit, and pipelining. 2. Low overhead access to programmer insight. 3. Code reuse. Ideally, this includes support of legacy codes as well as new codes. 4. Intelligent automatic parallelization.
New or ExistingSequential Code DSWP Family
Optis Parallelized Code
Machine Specific Performance Primitives
Complainer/Fixer
InsightAnnotation
One Implementation
New or ExistingLibraries
InsightAnnotation
OtherOptis
SpeculativeOptis
![Page 24: A Roadmap to Restoring Computing's Former Glory](https://reader035.vdocuments.us/reader035/viewer/2022081603/568143bc550346895db04953/html5/thumbnails/24.jpg)
24
SumReduction
Unroll
Rotate
0.90X
0.10X
30.0X
1.1X
0.8XSum
Reduction
Unroll
SumReduction
Rotate
Rotate
Unroll
1.5X
Iterative Compilation[Cooper ‘05; Almagor ‘04; Triantafyllis ’05]
![Page 25: A Roadmap to Restoring Computing's Former Glory](https://reader035.vdocuments.us/reader035/viewer/2022081603/568143bc550346895db04953/html5/thumbnails/25.jpg)
PS-DSWPComplainer
![Page 26: A Roadmap to Restoring Computing's Former Glory](https://reader035.vdocuments.us/reader035/viewer/2022081603/568143bc550346895db04953/html5/thumbnails/26.jpg)
Red Edges: Deps between malloc() & free()Blue Edges: Deps between rand() callsGreen Edges: Flow Deps inside Inner LoopOrange Edges: Deps between function calls
Unroll
SumReduction
Rotate
PS-DSWPComplainer Who can
help me? ProgrammerAnnotation
![Page 27: A Roadmap to Restoring Computing's Former Glory](https://reader035.vdocuments.us/reader035/viewer/2022081603/568143bc550346895db04953/html5/thumbnails/27.jpg)
PS-DSWPComplainer
SumReduction
![Page 28: A Roadmap to Restoring Computing's Former Glory](https://reader035.vdocuments.us/reader035/viewer/2022081603/568143bc550346895db04953/html5/thumbnails/28.jpg)
PS-DSWPComplainer
SumReduction
PROGRAMMERCommutative
![Page 29: A Roadmap to Restoring Computing's Former Glory](https://reader035.vdocuments.us/reader035/viewer/2022081603/568143bc550346895db04953/html5/thumbnails/29.jpg)
PS-DSWPComplainer
SumReduction
PROGRAMMERCommutative
LIBRARYCommutative
![Page 30: A Roadmap to Restoring Computing's Former Glory](https://reader035.vdocuments.us/reader035/viewer/2022081603/568143bc550346895db04953/html5/thumbnails/30.jpg)
PS-DSWPComplainer
SumReduction
PROGRAMMERCommutative
LIBRARYCommutative
1 8 16 24 32 40 48 56 640
1020304050
Scalable Speedup!
Parallel HMMER V2HMMER with Commutative
![Page 31: A Roadmap to Restoring Computing's Former Glory](https://reader035.vdocuments.us/reader035/viewer/2022081603/568143bc550346895db04953/html5/thumbnails/31.jpg)
Multicore Needs:1. Automatic resource allocation/scheduling, speculation/commit, and pipelining. 2. Low overhead access to programmer insight. 3. Code reuse. Ideally, this includes support of legacy codes as well as new codes. 4. Intelligent automatic parallelization.
New or ExistingSequential Code DSWP Family
Optis Parallelized Code
Machine Specific Performance Primitives
Complainer/Fixer
InsightAnnotation
One Implementation
New or ExistingLibraries
InsightAnnotation
OtherOptis
SpeculativeOptis
![Page 32: A Roadmap to Restoring Computing's Former Glory](https://reader035.vdocuments.us/reader035/viewer/2022081603/568143bc550346895db04953/html5/thumbnails/32.jpg)
Performance relative to Best Sequential128 Cores in 32 Nodes with Intel Xeon Processors [MICRO 2010]
![Page 33: A Roadmap to Restoring Computing's Former Glory](https://reader035.vdocuments.us/reader035/viewer/2022081603/568143bc550346895db04953/html5/thumbnails/33.jpg)
Restoration of Trend
![Page 34: A Roadmap to Restoring Computing's Former Glory](https://reader035.vdocuments.us/reader035/viewer/2022081603/568143bc550346895db04953/html5/thumbnails/34.jpg)
“Compiler Advances Double Computing Power Every 18 Years!” – Proebsting’s Law
Compiler Technology
Architecture/Devices
Era of DIY:• Multicore• Reconfigurable• GPUs• Clusters
Compiler technology inspired class of architectures?
![Page 35: A Roadmap to Restoring Computing's Former Glory](https://reader035.vdocuments.us/reader035/viewer/2022081603/568143bc550346895db04953/html5/thumbnails/35.jpg)
The End