slide-1 multicore theory mit lincoln laboratory theory of multicore algorithms jeremy kepner and...
TRANSCRIPT
Slide-1Multicore Theory
MIT Lincoln Laboratory
Theory of Multicore Algorithms
Jeremy Kepner and Nadya Bliss
MIT Lincoln Laboratory
HPEC 2008
This work is sponsored by the Department of Defense under Air Force Contract FA8721-05-C-0002. Opinions, interpretations, conclusions, and recommendations are those of the author and are not
necessarily endorsed by the United States Government.
MIT Lincoln LaboratorySlide-2
Multicore Theory
• Programming Challenge• Example Issues• Theoretical Approach• Integrated Process
Outline
• Parallel Design
• Distributed Arrays
• Kuck Diagrams
• Hierarchical Arrays
• Tasks and Conduits
• Summary
MIT Lincoln LaboratorySlide-3
Multicore Theory
Multicore Programming Challenge
• Great success of Moore’s Law era– Simple model: load, op, store– Many transistors devoted to
delivering this model
• Moore’s Law is ending– Need transistors for performance
Past Programming Model:Von Neumann
Future Programming Model:???
Can we describe how an algorithm is suppose to behaveon a hierarchical heterogeneous multicore processor?
• Processor topology includes:– Registers, cache, local memory,
remote memory, disk
• Cell has multiple programming models
MIT Lincoln LaboratorySlide-4
Multicore Theory
Example Issues
Where is the data? How is distributed?
Where is it running?
Y = X + 1
X,Y : NxN
• A serial algorithm can run a serial processor with relatively little specification
• A hierarchical heterogeneous multicore algorithm requires a lot more information
Which binary to run?
Initialization policy?
How does the data flow?
What are the allowed messages size?
Overlap computations and communications?
MIT Lincoln LaboratorySlide-5
Multicore Theory
Theoretical Approach
Task1S1() A : RNP(N)M0
P00
M0
P01
M0
P02
M0
P03
Task2S2() B : RNP(N)M0
P04
M0
P05
M0
P06
M0
P07
A nTopic12
m BTopic12
Conduit
2
Replica 0 Replica 1
• Provide notation and diagrams that allow hierarchical heterogeneous multicore algorithms to be specified
MIT Lincoln LaboratorySlide-6
Multicore Theory
Integrated Development Process
Cluster
2. Parallelize code
EmbeddedComputer
3. Deploy code1. Develop serial code
Desktop
4. Automatically parallelize code
Y = X + 1
X,Y : NxN
Y = X + 1
X,Y : P(N)xN
Y = X + 1
X,Y : P(P(N))xN
• Should naturally support standard parallel embedded software development practices
MIT Lincoln LaboratorySlide-7
Multicore Theory
• Serial Program• Parallel Execution• Distributed Arrays• Redistribution
Outline
• Parallel Design
• Distributed Arrays
• Kuck Diagrams
• Hierarchical Arrays
• Tasks and Conduits
• Summary
MIT Lincoln LaboratorySlide-8
Multicore Theory
Serial Program
• Math is the language of algorithms
• Allows mathematical expressions to be written concisely
• Multi-dimensional arrays are fundamental to mathematics
Y = X + 1
X,Y : NxN
MIT Lincoln LaboratorySlide-9
Multicore Theory
Parallel Execution
• Run NP copies of same program– Single Program Multiple Data (SPMD)
• Each copy has a unique PID
• Every array is replicated on each copy of the program
Y = X + 1
X,Y : NxN
PID=1PID=0
PID=NP-1
MIT Lincoln LaboratorySlide-10
Multicore Theory
Distributed Array Program
• Use P() notation to make a distributed array• Tells program which dimension to distribute data• Each program implicitly operates on only its own data
(owner computes rule)
Y = X + 1
X,Y : P(N)xN
PID=1PID=0
PID=NP-1
MIT Lincoln LaboratorySlide-11
Multicore Theory
Explicitly Local Program
• Use .loc notation to explicitly retrieve local part of a distributed array
• Operation is the same as serial program, but with different data on each processor (recommended approach)
Y.loc = X.loc + 1
X,Y : P(N)xN
MIT Lincoln LaboratorySlide-12
Multicore Theory
Parallel Data Maps
• A map is a mapping of array indices to processors• Can be block, cyclic, block-cyclic, or block w/overlap• Use P() notation to set which dimension to split among
processors
P(N)xN
Math
0 1 2 3Computer
PID
Array
NxP(N)
P(N)xP(N)
MIT Lincoln LaboratorySlide-13
Multicore Theory
Redistribution of Data
• Different distributed arrays can have different maps• Assignment between arrays with the “=” operator causes
data to be redistributed• Underlying library determines all the message to send
Y = X + 1Y : NxP(N) X : P(N)xN
P0P1P2P3
X =
P0P1P2P3
Y =
Data Sent
MIT Lincoln LaboratorySlide-14
Multicore Theory
• Serial• Parallel• Hierarchical• Cell
Outline
• Parallel Design
• Distributed Arrays
• Kuck Diagrams
• Hierarchical Arrays
• Tasks and Conduits
• Summary
MIT Lincoln LaboratorySlide-15
Multicore Theory
M0
P0
A : RNN
Single Processor Kuck Diagram
• Processors denoted by boxes• Memory denoted by ovals• Lines connected associated processors and memories• Subscript denotes level in the memory hierarchy
MIT Lincoln LaboratorySlide-16
Multicore Theory
Net0.5
M0
P0
M0
P0
M0
P0
M0
P0
A : RNP(N)
Parallel Kuck Diagram
• Replicates serial processors• Net denotes network connecting memories at a level in the
hierarchy (incremented by 0.5)• Distributed array has a local piece on each memory
MIT Lincoln LaboratorySlide-17
Multicore Theory
Hierarchical Kuck Diagram
The Kuck notation provides a clear way of describing a hardware architecture along with the memory and communication hierarchy
The Kuck notation provides a clear way of describing a hardware architecture along with the memory and communication hierarchy
Net1.5
SM2
SMNet2
M0
P0
M0
P0
Net0.5
SM1
M0
P0
M0
P0
Net0.5
SM1
SMNet1SMNet1
Legend:• P - processor• Net - inter-processor network• M - memory• SM - shared memory• SMNet - shared memory
network
2-LEVEL HIERARCHY
Subscript indicates hierarchy level
x.5 subscript for Net indicates indirect memory access
*High Performance Computing: Challenges for Future Systems, David Kuck, 1996
MIT Lincoln LaboratorySlide-18
Multicore Theory
Cell Example
M0
P0
M0
P0
M0
P0
Net0.5
M0
P0
M0
P0
M0
P010 2 3 7
MNet1
M1
PPE
SPEs
PPPE = PPE speed (GFLOPS)
M0,PPE = size of PPE cache (bytes)
PPPE -M0,PPE =PPE to cache bandwidth (GB/sec)
PSPE = SPE speed (GFLOPS)
M0,SPE = size of SPE local store (bytes)
PSPE -M0,SPE = SPE to LS memory bandwidth (GB/sec)
Net0.5 = SPE to SPE bandwidth (matrix encoding topology, GB/sec)
MNet1 = PPE,SPE to main memory bandwidth (GB/sec)
M1 = size of main memory (bytes)
Kuck diagram for the Sony/Toshiba/IBM processorKuck diagram for the Sony/Toshiba/IBM processor
MIT Lincoln LaboratorySlide-19
Multicore Theory
• Hierarchical Arrays• Hierarchical Maps• Kuck Diagram• Explicitly Local Program
Outline
• Parallel Design
• Distributed Arrays
• Kuck Diagrams
• Hierarchical Arrays
• Tasks and Conduits
• Summary
MIT Lincoln LaboratorySlide-20
Multicore Theory
Hierarchical Arrays
0
1
2
3
Local arraysGlobal array
• Hierarchical arrays allow algorithms to conform to hierarchical multicore processors
• Each processor in P controls another set of processors P
• Array local to P is sub-divided among local P processors
PID
PID
0
1
0
1
0
1
0
1
MIT Lincoln LaboratorySlide-21
Multicore Theory
net1.5
M0
P0
M0
P0
net0.5
SM1
M0
P0
M0
P0
net0.5
SM1
SM net1SM net1
A : RNP(P(N))
Hierarchical Array and Kuck Diagram
• Array is allocated across SM1 of P processors
• Within each SM1 responsibility of processing is divided among local P processors
• P processors will move their portion to their local M0
A.loc
A.loc.loc
MIT Lincoln LaboratorySlide-22
Multicore Theory
Explicitly Local Hierarchical Program
• Extend .loc notation to explicitly retrieve local part of a local distributed array .loc.loc (assumes SPMD on P)
• Subscript p and p provide explicit access to (implicit otherwise)
Y.locp.locp = X.locp.locp + 1
X,Y : P(P(N))xN
MIT Lincoln LaboratorySlide-23
Multicore Theory
Block Hierarchical Arrays
0
1
2
3
Local arraysGlobal array
• Memory constraints are common at the lowest level of the hierarchy
• Blocking at this level allows control of the size of data operated on by each P
PID
PID
0
1
0
1
0
1
0
1
b=4
in-core
out-of-core
Core blocksblk
0
1
2
3
0
1
2
3
MIT Lincoln LaboratorySlide-24
Multicore Theory
Block Hierarchical Program
• Pb(4) indicates each sub-array should be broken up into blocks of size 4.
• .nblk provides the number of blocks for looping over each block; allows controlling size of data on lowest level
for i=0, X.loc.loc.nblk-1 Y.loc.loc.blki = X.loc.loc.blki + 1
X,Y : P(Pb(4)
(N))xN
MIT Lincoln LaboratorySlide-25
Multicore Theory
• Basic Pipeline• Replicated Tasks• Replicated Pipelines
Outline
• Parallel Design
• Distributed Arrays
• Kuck Diagrams
• Hierarchical Arrays
• Tasks and Conduits
• Summary
MIT Lincoln LaboratorySlide-26
Multicore Theory
Task1S1() A : RNP(N)M0
P00
M0
P01
M0
P02
M0
P03
Task2S2() B : RNP(N)M0
P04
M0
P05
M0
P06
M0
P07
Tasks and Conduits
• S1 superscript runs task on a set of processors; distributed arrays allocated relative to this scope
• Pub/sub conduits move data between tasks
A nTopic12
m BTopic12
Conduit
MIT Lincoln LaboratorySlide-27
Multicore Theory
Task1S1() A : RNP(N)M0
P00
M0
P01
M0
P02
M0
P03
Task2S2() B : RNP(N)M0
P04
M0
P05
M0
P06
M0
P07
A nTopic12
m BTopic12
Conduit
2
Replica 0 Replica 1
Replicated Tasks
• 2 subscript creates tasks replicas; conduit will round-robin
MIT Lincoln LaboratorySlide-28
Multicore Theory
Task1S1() A : RNP(N)M0
P00
M0
P01
M0
P02
M0
P03
Task2S2() B : RNP(N)M0
P04
M0
P05
M0
P06
M0
P07
A nTopic12
m BTopic12
Conduit
2
2
Replica 0 Replica 1
Replica 0 Replica 1
Replicated Pipelines
• 2 identical subscript on tasks creates replicated pipeline
MIT Lincoln LaboratorySlide-29
Multicore Theory
Summary
• Hierarchical heterogeneous multicore processors are difficult to program
• Specifying how an algorithm is suppose to behave on such a processor is critical
• Proposed notation provides mathematical constructs for concisely describing hierarchical heterogeneous multicore algorithms