Multicores from the Compiler's Perspective
A Blessing or A Curse?
Saman Amarasinghe
Associate Professor, Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science
Computer Science and Artificial Intelligence Laboratory
CTO, Determina Inc.
AMD OpteronDual Core
Intel Montecito1.7 Billion transistors
Dual Core IA/64
Intel TanglewoodDual Core IA/64
Intel Pentium Extreme3.2GHz Dual Core
Intel Tejas & JayhawkUnicore (4GHz P4)
Intel DempseyDual Core Xeon
Intel Pentium D(Smithfield)
Cancelled
Intel YonahDual Core Mobile
IBM Power 6Dual Core
IBM Power 4 and 5Dual Cores Since 2001
IBM Cell Scalable Multicore
Sun Olympus and Niagara8 Processor Cores
MIT Raw 16 Cores
Since 2002
… 1H 2005 1H 2006 2H 20062H 20052H 2004
Multicores are coming!
What is Multicore?
Multiple, externally visible processors on a single die where the processors have independent control-flow, separate internal state and no critical resource sharing.
Multicores have many names… Chip Multiprocessor (CMP) Tiled Processor ….
Why move to Multicores?
Many issues with scaling a unicore Power Efficiency Complexity Wire Delay Diminishing returns from optimizing
a single instruction stream
Moore’s Law
20051985 199019801970 1975 1995 2000
40048008
8086
8080
286
386
486
PentiumP2
P3P4
ItaniumItanium 2
transistors
1,000,000,000
100,000
10,000
1,000
1,000,000
10,000,000
100,000,000
4004 (12mm2 / 8)
Pentium 4 (217mm2 / .18)
Bus Control
L2 Data Cache
L2 Data CacheOp Scheduling
Ren
amin
g
Trace Cache
Decode Fetch
Inte
ger
Cor
e
FPUMMX/SSE
Moore’s Law: Transistors Well Spent?
Itanium 2 (421mm2 / .18)
Integer Core
Floating Point
Outline
Introduction
Overview of Multicores
Success Criteria for a Compiler
Data Level Parallelism
Instruction Level Parallelism
Language Exposed Parallelism
Conclusion
Impact of Multicores How does going from Multiprocessors to
Multicores impact programs?
What changed?
Where is the Impact? Communication Bandwidth Communication Latency
Communication Bandwidth How much data can be communicated
between two cores?
What changed? Number of Wires
IO is the true bottleneck On-chip wire density is very high
Clock rate IO is slower than on-chip
Multiplexing No sharing of pins
Impact on programming model? Massive data exchange is possible Data movement is not the bottleneck
locality is not that important
32 Giga bits/sec ~300 Tera bits/sec
10,000X
Communication Latency How long does it take for a round
trip communication?
What changed? Length of wire
Very short wires are faster Pipeline stages
No multiplexing On-chip is much closer
Impact on programming model? Ultra-fast synchronization Can run real-time apps
on multiple cores
50X
~200 Cycles ~4 cycles
Past, Present and the Future?
PE PE
$$ $$
Memory
PE PE
$$
Memory
$$
PE
$$ X
PE
$$ X
PE
$$ X
PE
$$ X
MemoryMemory
Mem
ory
Mem
ory
Basic Multicore IBM Power5
Traditional Multiprocessor
Integrated Multicore16 Tile MIT Raw
Outline
Introduction
Overview of Multicores
Success Criteria for a Compiler
Data Level Parallelism
Instruction Level Parallelism
Language Exposed Parallelism
Conclusion
When is a compiler successful as a general purpose tool?
General Purpose Programs compiled with the compiler
are in daily use by non-expert users Used by many programmers Used in open source and commercial
settings
Research / niche You know the names of all the users
Success Criteria
1. Effective
2. Stable
3. General
4. Scalable
5. Simple
1: Effective
Good performance improvements on most programs
The speedup graph goes here!
2: Stable
Simple change in the program should not drastically change the performance! Otherwise need to understand the compiler
inside-out Programmers want to treat the compiler as a
black box
3: General Support the diversity of programs
Support Real Languages: C, C++, (Java) Handle rich control and data structures Tolerate aliasing of pointers
Support Real Environments Separate compilation Statically and dynamically linked libraries
Work beyond an ideal laboratory setting
4: Scalable Real applications are large!
Algorithm should scale polynomial or exponential in the program size doesn’t work
Real Programs are Dynamic Dynamically loaded libraries Dynamically generated code
Whole program analysis tractable?
100
1,000
10,000
100,000
1,000,000
10,000,000
100,000,000
Lin
es
of
Co
de
SPEC2000, Mediabench Benchmarks
Linux components
Windows OS
5: Simple Aggressive analysis and complex transformation lead to:
Buggy compilers! Programmers want to trust their compiler! How do you manage a software project when the compiler is broken?
Long time to develop
Simple compiler fast compile-times Current compilers are too complex!
Compiler Lines of CodeGNU GCC ~ 1.2 million
SUIF ~ 250,000
Open Research Compiler ~3.5 million
Trimaran ~ 800,000
StreamIt ~ 300,000
Outline
Introduction
Overview of Multicores
Success Criteria for a Compiler
Data Level Parallelism
Instruction Level Parallelism
Language Exposed Parallelism
Conclusion
Data Level Parallelism Identify loops where each
iteration can run in parallel DOALL parallelism
What affects performance? Parallelism Coverage Granularity of Parallelism Data Locality
TDT = DT MP1 = M+1 NP1 = N+1 EL = N*DX PI = 4.D0*ATAN(1.D0) TPI = PI+PI DI = TPI/M DJ = TPI/N PCF = PI*PI*A*A/(EL*EL)
DO 50 J=1,NP1 DO 50 I=1,MP1 PSI(I,J) = A*SIN(( I-.5D0)*DI)* SIN((J-.5D0)*DJ) P(I,J) = PCF*(COS(2.D0) CONTINUE
DO 60 J=1,N DO 60 I=1,M U(I+1,J) = -(PSI(I+1,J+1) -PSI(I+1,J))/DY V(I,J+1) = (PSI(I+1,J+1)-
PSI(I,J+1))/DX CONTINUE
processors
TIM
E
Parallelism Coverage Amdahl’s Law
Performance improvement to be gained from faster mode of execution is limited by the fraction of the time
the faster mode can be used
Find more parallelism Interprocedural analysis Alias analysis Data-flow analysis ……
processors
Moreprocessors
SUIF Parallelizer Results
0
20
40
60
80
100
1 2 3 4 5 6 7 8Par
alle
lism
Cov
erag
e
S p e e d u p
SPEC95fp, SPEC92fp, Nas, Perfect Benchmark SuitesOn a 8 processor Silicon Graphics Challenge (200MHz MIPS R4000)
Granularity of Parallelism Synchronization is expensive
Need to find very large parallel regions coarse-grain loop nests
Heroic analysis required
TDT = DT MP1 = M+1 NP1 = N+1 EL = N*DX PI = 4.D0*ATAN(1.D0) TPI = PI+PI DI = TPI/M DJ = TPI/N PCF = PI*PI*A*A/(EL*EL)
DO 50 J=1,NP1 DO 50 I=1,MP1 PSI(I,J) = A*SIN(( I-.5D0)*DI)* SIN((J-.5D0)*DJ) P(I,J) = PCF*(COS(2.D0) CONTINUE
DO 60 J=1,N DO 60 I=1,M U(I+1,J) = -(PSI(I+1,J+1) -PSI(I+1,J))/DY V(I,J+1) = (PSI(I+1,J+1)-
PSI(I,J+1))/DX CONTINUE
processors
TIM
E
Granularity of Parallelism Synchronization is expensive
Need to find very large parallel regions coarse-grain loop nests
Heroic analysis required
Single unanalyzable line
turb3d in SPEC95fp
Granularity of Parallelism Synchronization is expensive
Need to find very large parallel regions coarse-grain loop nests
Heroic analysis required
Single unanalyzable line Small Reduction in Coverage Drastic Reduction in
Granularity
turb3d in SPEC95fp
SUIF Parallelizer Results
1
10
100
1000
10000
0
20
40
60
80
100
1 2 3 4 5 6 7 8Par
alle
lism
Cov
erag
eG
ranu
larit
y of
Par
alle
lism S p e e d u p
SUIF Parallelizer Results
1
10
100
1000
10000
0
20
40
60
80
100
1 2 3 4 5 6 7 8Par
alle
lism
Cov
erag
eG
ranu
larit
y of
Par
alle
lism S p e e d u p
Data Locality Non-local data
Stalls due to latency Serialize when lack of
bandwidth
Data Transformations Global impact Whole program analysis
A[0]
A[4]
A[8]
A[12]
A[1]
A[5]
A[9]
A[13]
A[2]
A[6]
A[10]
A[14]
A[3]
A[7]
A[11]
A[15]
DLP on Multiprocessors:Current State
Huge body of work over the years. Vectorization in the ’80s High Performance Computing in ’90s
Commercial DLP compilers exist But…only a very small user community
Can multicores make DLP mainstream?
?
Effectiveness
Main Issue Parallelism Coverage
Compiling to Multiprocessors Amdahl’s law
Many programs have no loop-level parallelism
Compiling to Multicores Nothing much has changed
Stability Main Issue
Granularity of Parallelism
Compiling for Multiprocessors Unpredictable, drastic granularity changes reduce the
stability
Compiling for Multicores Low latency granularity is less important
Generality Main Issue
Changes in general purpose programming styles over time impacts compilation
Compiling for Multiprocessors (In the good old days) Mainly FORTRAN
Loop nests and Arrays
Compiling for Multicores Modern languages/programs are hard to analyze
Aliasing (C, C++ and Java) Complex structures (lists, sets, trees) Complex control (concurrency, recursion) Dynamic (DLLs, Dynamically generated code)
Scalability Main Issue
Whole program analysis and global transformations don’t scale
Compiling for Multiprocessors Interprocedural analysis needed to improve granularity Most data transformations have global impact
Compiling for Multicores High bandwidth and low latency no data transformations Low latency granularity improvements not important
Simplicity Main Issue
Parallelizing compilers are exceedingly complex
Compiling for Multiprocessors Heroic interprocedural analysis and global transformations
are required because of high latency and low bandwidth
Compiling for Multicores Hardware is a lot more forgiving… But…modern languages and programs make life difficult
Outline
Introduction
Overview of Multicores
Success Criteria for a Compiler
Data Level Parallelism
Instruction Level Parallelism
Language Exposed Parallelism
Conclusion
pval5=seed.0*6.0
pval4=pval5+2.0
tmp3.6=pval4/3.0
tmp3=tmp3.6
v3.10=tmp3.6-v2.7
v3=v3.10
v2.4=v2
pval3=seed.o*v2.4
tmp2.5=pval3+2.0
tmp2=tmp2.5
pval6=tmp1.3-tmp2.5
v2.7=pval6*5.0
v2=v2.7
seed.0=seed
pval1=seed.0*3.0
pval0=pval1+2.0
tmp0.1=pval0/2.0
tmp0=tmp0.1
v1.2=v1
pval2=seed.0*v1.2
tmp1.3=pval2+2.0
tmp1=tmp1.3
pval7=tmp1.3+tmp2.5
v1.8=pval7*3.0
v1=v1.8
v0.9=tmp0.1-v1.8
v0=v0.9
Instruction Level parallelism on a Unicore
tmp0 = (seed*3+2)/2tmp1 = seed*v1+2tmp2 = seed*v2 + 2tmp3 = (seed*6+2)/3v2 = (tmp1 - tmp3)*5v1 = (tmp1 + tmp2)*3v0 = tmp0 - v1v3 = tmp3 - v2
Programs have ILP Modern processors extract the ILP
Superscalars Hardware VLIW Compiler
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALUBypass
NetworkR
egis
ter
File
Scalar Operand Network (SON)
• Moves results of an operation to dependent instructions
• Superscalars in Hardware
• What makes a good SON?
pval5=seed.0*6.0
seed.0=seed
Scalar Operand Network (SON)
• Moves results of an operation to dependent instructions
• Superscalars in Hardware
• What makes a good SON?• Low latency from producer to
consumer
pval5=seed.0*6.0
seed.0=seed
pval5=seed.0*6.0
seed.0=seed
Scalar Operand Network (SON)
• Moves results of an operation to dependent instructions
• Superscalars in Hardware
• What makes a good SON?• Low latency from producer to
consumer
• Low occupancy at the producer and consumer
pval5=seed.0*6.0
seed.0=seed
Test lockBranchTest lockBranchTest lockBranchTest lock BranchRead memorypval5=seed.0*6.0
seed.0=seedlockWrite memunlock
Scalar Operand Network (SON)
• Moves results of an operation to dependent instructions
• Superscalars in Hardware
• What makes a good SON?• Low latency from producer to
consumer
• Low occupancy at the producer and consumer
• High bandwidth for multiple operations
pval5=seed.0*6.0
seed.0=seed
pval5=seed.0*6.0
seed.0=seed
v2.4=v2
pval3=seed.o*v2.4
tmp2.5=pval3+2.0
v1.2=v1
pval2=seed.0*v1.2
tmp1.3=pval2+2.0
tmp1=tmp1.3pval7=tmp1.3+tmp2.5
Is an Integrated Multcore Reedy to be a Scalar Operand Network?
Basic Multicore
Traditional Multiprocessor
IntegratedMulticore
VLIWUnicore
Latency
(cycles)60 4 3 0
Occupancy
(instructions)50 10 0 0
Bandwidth(operands/cycle)
1 2 16 6
Scalable Scalar Operand Network?
Unicores N2 connectivity Need to cluster
introduces latency
Integrated Multicores No bottlenecks in scaling
IntegratedMulticore
Unicore
Compiler Support for Instruction Level Parallelism
Accepted general purpose technique Enhance the
performance of superscalars
Essential for VLIW
Instruction Scheduling List scheduling or
Software pipelining
pval5=seed.0*6.0
pval4=pval5+2.0
tmp3.6=pval4/3.0
tmp3=tmp3.6
v3.10=tmp3.6-v2.7
v3=v3.10
v2.4=v2
pval3=seed.o*v2.4
tmp2.5=pval3+2.0
tmp2=tmp2.5
pval6=tmp1.3-tmp2.5
v2.7=pval6*5.0
v2=v2.7
seed.0=seed
pval1=seed.0*3.0
pval0=pval1+2.0
tmp0.1=pval0/2.0
tmp0=tmp0.1
v1.2=v1
pval2=seed.0*v1.2
tmp1.3=pval2+2.0
tmp1=tmp1.3
pval7=tmp1.3+tmp2.5
v1.8=pval7*3.0
v1=v1.8v0.9=tmp0.1-v1.8
v0=v0.9
seed.0=recv()pval5=seed.0*6.0
pval4=pval5+2.0tmp3.6=pval4/3.0
tmp3=tmp3.6
v2.7=recv()v3.10=tmp3.6-v2.7
v3=v3.10
seed.0=recv()pval2=seed.0*v1.2
tmp1.3=pval2+2.0
send(tmp1.3)tmp1=tmp1.3tmp2.5=recv()pval7=tmp1.3+tmp2.5
v1.8=pval7*3.0
v1=v1.8
v0.9=tmp0.1-v1.8
v0=v0.9
tmp0.1=recv()
v1=v1.8
v0.9=tmp0.1-v1.8
v0=v0.9
tmp0.1=recv()
ILP on Integrated Multicores:Space-Time Instruction Scheduling
seed.0=recv()pval5=seed.0*6.0
pval4=pval5+2.0tmp3.6=pval4/3.0
tmp3=tmp3.6
v2.7=recv()v3.10=tmp3.6-v2.7
v3=v3.10
route(W,t)
route(W,S)
route(S,t)
send(seed.0)pval1=seed.0*3.0
pval0=pval1+2.0
tmp0.1=pval0/2.0
send(tmp0.1)tmp0=tmp0.1
route(t,E)
route(t,E)
v2.4=v2
seed.0=recv(0)pval3=seed.o*v2.4
tmp2.5=pval3+2.0
tmp2=tmp2.5send(tmp2.5)
tmp1.3=recv()pval6=tmp1.3-tmp2.5
v2.7=pval6*5.0
Send(v2.7)v2=v2.7
route(N,t)
route(t,E)
route(E,t)
route(t,E)
v1.2=v1
seed.0=recv()pval2=seed.0*v1.2
tmp1.3=pval2+2.0
send(tmp1.3)tmp1=tmp1.3
tmp2.5=recv()pval7=tmp1.3+tmp2.5
v1.8=pval7*3.0
v1=v1.8
v0.9=tmp0.1-v1.8
v0=v0.9
route(N,t)
route(N,t)
route(W,N)
seed.0=seed
route(W,S)
route(W,S)
tmp0.1=recv()
route(t,W)
route(W,t)
route(W,N)
route(t,E)
Partition, placement, route and schedule
Similar to Clustered VLIW
pval5=seed.0*6.0
pval4=pval5+2.0
tmp3.6=pval4/3.0
tmp3=tmp3.6
v3.10=tmp3.6-v2.7
v3=v3.10
v2.4=v2
pval3=seed.o*v2.4
tmp2.5=pval3+2.0
tmp2=tmp2.5
pval6=tmp1.3-tmp2.5
v2.7=pval6*5.0
v2=v2.7
seed.0=seed
pval1=seed.0*3.0
pval0=pval1+2.0
tmp0.1=pval0/2.0
tmp0=tmp0.1
v1.2=v1
pval2=seed.0*v1.2
tmp1.3=pval2+2.0
tmp1=tmp1.3
pval7=tmp1.3+tmp2.5
v1.8=pval7*3.0
v1=v1.8v0.9=tmp0.1-v1.8
v0=v0.9
pval5=seed.0*6.0
pval4=pval5+2.0
tmp3.6=pval4/3.0
tmp3=tmp3.6
v3.10=tmp3.6-v2.7
v3=v3.10
v2.4=v2
pval3=seed.o*v2.4
tmp2.5=pval3+2.0
tmp2=tmp2.5
pval6=tmp1.3-tmp2.5
v2.7=pval6*5.0
v2=v2.7
seed.0=seed
pval1=seed.0*3.0
pval0=pval1+2.0
tmp0.1=pval0/2.0
tmp0=tmp0.1
v1.2=v1
pval2=seed.0*v1.2
tmp1.3=pval2+2.0
tmp1=tmp1.3
pval7=tmp1.3+tmp2.5
v1.8=pval7*3.0
v1=v1.8v0.9=tmp0.1-v1.8
v0=v0.9
Handling Control Flow
Asynchronous global branching Propagate the branch condition to all the tiles
as part of the basic block schedule When finished with the basic block execution
asynchronously switch to another basic block schedule depending on the branch condition
• • •
• • •
br x
x = cmp a, b
• • •
• • •
br x • • •
br x
• • •
• • •
br x
Raw Performance
0
4
8
12
16
20
24
28
32
Spe
edup
• 32 tile Raw
Dense Matrix Multimedia Irregular
Success Criteria
1. Effective If ILP exists same
2. Stable Localized optimization similar
3. General Applies to same type of applications
4. Scalable Local analysis similar
5. Simple Deeper analysis and more transformations
Outline
Introduction
Overview of Multicores
Success Criteria for a Compiler
Data Level Parallelism
Instruction Level Parallelism
Language Exposed Parallelism
Conclusion
Languages are out-of-touch with Architecture
Modernarchitecture
• Two choices:
• Develop cool architecture with complicated, ad-hoc language
• Bend over backwards to supportold languages like C/C++
C von-Neumannmachine
Supporting von Neumann Languages Why C (FORTRAN, C++ etc.) became very successful?
Abstracted out the differences of von Neumann machines Register set structure Functional units and capabilities Pipeline depth/width Memory/cache organization
Directly expose the common properties Single memory image Single control-flow A clear notion of time
Can have a very efficient mapping to a von Neumann machine “C is the portable machine language for von Numann machines”
Today von Neumann languages are a curse We have squeezed out all the performance out of C We can build more powerful machines But, cannot map C into next generation machines Need better languages with more information for optimization
New Languages for Cool Architectures
Processor specific languages Not portable
Increase the burden on programmers Many more tasks for the programmer (parallelism
annotations, memory alias annotations) But, no software engineering benefits
Assembly hacker mentality Worked so hard on putting architectural features Don’t want compilers to squander it away Proof-of-concept done in assembly
Architects don’t know how to design languages
What Motivates Language Designers Primary Motivation Programmer Productivity
Raising the abstraction layer Increasing the expressiveness Facilitating design, development, debugging, maintenance of
large complex applications
Design Considerations Abstraction Reduce the work programmers have to do Malleablility Reduce the interdependencies Safety Use types to prevent runtime errors Portability Architecture/system independent
No consideration given for the architecture For them, performance is a non-issue!
Is There a Win-Win Solution
Languages that increase programmer productivity while making it easier to compile
Example: StreamIt, A spatially-aware Language
A language for streaming applications Provides high-level stream abstraction
Exposes Pipeline Parallelism Improves programmer productivity
Breaks the von Neumann language barrier Each filter has its own control-flow Each filter has its own address space No global time Explicit data movement between filters Compiler is free to reorganize the computation
Example: Radar Array Front EndSplitter
FIRFilter FIRFilterFIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter
FIRFilter FIRFilterFIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter
Joiner
Splitter
Detector
Magnitude
FirFilter
Vector Mult
Detector
Magnitude
FirFilter
Vector Mult
Detector
Magnitude
FirFilter
Vector Mult
Detector
Magnitude
FirFilter
Vector Mult
Joiner
Radar Array Front End on Raw
Executing InstructionsBlocked on Network
Pipeline Stall
Bridging the Abstraction layers
StreamIt language exposes the data movement Graph structure is architecture independent
Each architecture is different in granularity and topology Communication is exposed to the compiler
The compiler needs to efficiently bridge the abstraction Map the computation and communication pattern of the program
to the tiles, memory and the communication substrate
Bridging the Abstraction layers
StreamIt language exposes the data movement Graph structure is architecture independent
Each architecture is different in granularity and topology Communication is exposed to the compiler
The compiler needs to efficiently bridge the abstraction Map the computation and communication pattern of the program
to the tiles, memory and the communication substrate The StreamIt Compiler
Partitioning Placement Scheduling Code generation
Optimized Performance for Radar Array Front End on Raw
Executing InstructionsBlocked on Network
Pipeline Stall
Performance
240
19
640
1,430
0
200
400
600
800
1,000
1,200
1,400
1,600
C program C program UnoptimizedStreamIt
Optimized StreamIt
1 GHz Pentium III 420 MHz single tileRaw
420 MHz 64 tileRaw
420 MHz 16 tileRaw
MF
LO
PS
Success Criteria
1. Effective Information available for more optimizations
2. Stable Much more analyzable
3. General Domain-Specific
4. Scalable No global data structures
5. Simple Heroic analysis vs. more transformations
Von Neumann Languages
StreamLanguage
Compiler for:
Outline
Introduction
Overview of Multicores
Success Criteria for a Compiler
Data Level Parallelism
Instruction Level Parallelism
Language Exposed Parallelism
Conclusion
Overview of Success Criteria
1. Effective
2. Stable
3. General
4. Scalable
5. Simple
Von Neumann Languages
StreamLanguage
Can Compilers take on Multicores? Success Criteria is Somewhat Mixed But….
Don’t need to compete with unicores Multicores will be available regardless
New Opportunities Architectural advances in integrated multicores Domain specific languages Possible compiler support for using multicores for other
than parallelism Security Enforcement Program Introspection ISA extensions
http://cag.csail.mit.edu/commithttp://www.determina.com