liszt los alamos national laboratory aug 2011
DESCRIPTION
A Domain Specific Language for Building Portable Mesh-based PDE SolversTRANSCRIPT
![Page 1: Liszt los alamos national laboratory Aug 2011](https://reader034.vdocuments.us/reader034/viewer/2022051612/54bdf6a34a795946588b4646/html5/thumbnails/1.jpg)
Liszt ���A Domain Specific Language for Building ���
Portable Mesh-based PDE Solvers���
Z DeVito, N Joubert, F Palacios, S Oakley, M Medina, M Barrientos, E Elsen, ���F Ham, A Aiken, K Duraisamy, E Darve, J Alonso, P Hanrahan���
LANL Visit, August 2011. Niels Joubert
![Page 2: Liszt los alamos national laboratory Aug 2011](https://reader034.vdocuments.us/reader034/viewer/2022051612/54bdf6a34a795946588b4646/html5/thumbnails/2.jpg)
Liszt @ LANL ���Niels Joubert, Zach DeVito���
Crystal Lemire, Pat Hanrahan���
LANL Visit, August 2011. Niels Joubert
![Page 3: Liszt los alamos national laboratory Aug 2011](https://reader034.vdocuments.us/reader034/viewer/2022051612/54bdf6a34a795946588b4646/html5/thumbnails/3.jpg)
See more of us this week:
Today: Niels (me) presenting Liszt overview/implementation
This afternoon: Pat talking about EDSLs and the Big Picture
Tuesday: We’re running a Hackathon!
Wednesday: Zach leading design and future directions discussion
![Page 4: Liszt los alamos national laboratory Aug 2011](https://reader034.vdocuments.us/reader034/viewer/2022051612/54bdf6a34a795946588b4646/html5/thumbnails/4.jpg)
Today: Peel back the layers
![Page 5: Liszt los alamos national laboratory Aug 2011](https://reader034.vdocuments.us/reader034/viewer/2022051612/54bdf6a34a795946588b4646/html5/thumbnails/5.jpg)
Confused?���Questions Encouraged!
![Page 6: Liszt los alamos national laboratory Aug 2011](https://reader034.vdocuments.us/reader034/viewer/2022051612/54bdf6a34a795946588b4646/html5/thumbnails/6.jpg)
Science demands serious performance
![Page 7: Liszt los alamos national laboratory Aug 2011](https://reader034.vdocuments.us/reader034/viewer/2022051612/54bdf6a34a795946588b4646/html5/thumbnails/7.jpg)
The Future is Now LANL IBM Roadrunner
Tianhe-1A
ORNL Titan
![Page 8: Liszt los alamos national laboratory Aug 2011](https://reader034.vdocuments.us/reader034/viewer/2022051612/54bdf6a34a795946588b4646/html5/thumbnails/8.jpg)
Specialization leads to efficiency
Hybrid architectures for complex workloads
![Page 9: Liszt los alamos national laboratory Aug 2011](https://reader034.vdocuments.us/reader034/viewer/2022051612/54bdf6a34a795946588b4646/html5/thumbnails/9.jpg)
Different Technologies at Different Scales
![Page 10: Liszt los alamos national laboratory Aug 2011](https://reader034.vdocuments.us/reader034/viewer/2022051612/54bdf6a34a795946588b4646/html5/thumbnails/10.jpg)
How Do We Program These Machines?���
![Page 11: Liszt los alamos national laboratory Aug 2011](https://reader034.vdocuments.us/reader034/viewer/2022051612/54bdf6a34a795946588b4646/html5/thumbnails/11.jpg)
The old way? ���
Hardware Abstraction?
![Page 12: Liszt los alamos national laboratory Aug 2011](https://reader034.vdocuments.us/reader034/viewer/2022051612/54bdf6a34a795946588b4646/html5/thumbnails/12.jpg)
Domain Specific Languages and Compilers
![Page 13: Liszt los alamos national laboratory Aug 2011](https://reader034.vdocuments.us/reader034/viewer/2022051612/54bdf6a34a795946588b4646/html5/thumbnails/13.jpg)
DSLs: Compromising generality
![Page 14: Liszt los alamos national laboratory Aug 2011](https://reader034.vdocuments.us/reader034/viewer/2022051612/54bdf6a34a795946588b4646/html5/thumbnails/14.jpg)
Bigger picture, or Sneak Peek: My talk: Specialization
Pat and Zach: Playing nice
![Page 15: Liszt los alamos national laboratory Aug 2011](https://reader034.vdocuments.us/reader034/viewer/2022051612/54bdf6a34a795946588b4646/html5/thumbnails/15.jpg)
I’m out of pictures…
![Page 16: Liszt los alamos national laboratory Aug 2011](https://reader034.vdocuments.us/reader034/viewer/2022051612/54bdf6a34a795946588b4646/html5/thumbnails/16.jpg)
Compiling to parallel computers
Find Parallelism
Expose Data Locality
Reason about Synchronization
![Page 17: Liszt los alamos national laboratory Aug 2011](https://reader034.vdocuments.us/reader034/viewer/2022051612/54bdf6a34a795946588b4646/html5/thumbnails/17.jpg)
Analyzing data dependency
“What data does this value depend on”
Find Parallelism
Independent data can be computed in parallel
Expose Data Locality
Partition based on dependency
Reason about Synchronization
Don’t compute until the dependent values are known
Can’t be done by compilers in general!
A[i] = B[f(i)] – must compute f(i) to find dependency
![Page 18: Liszt los alamos national laboratory Aug 2011](https://reader034.vdocuments.us/reader034/viewer/2022051612/54bdf6a34a795946588b4646/html5/thumbnails/18.jpg)
Trade off generality���EDSLs solve the dependency problem
Liszt provides domain specific language features to solve the dependency problem:
Parallelism
Data Locality
Synchronization
For solving PDEs on meshes:
All data accesses can be framed in terms of the mesh
![Page 19: Liszt los alamos national laboratory Aug 2011](https://reader034.vdocuments.us/reader034/viewer/2022051612/54bdf6a34a795946588b4646/html5/thumbnails/19.jpg)
Features of high performance PDE solvers
Find Parallelism Data-parallelism on mesh elements
Expose Data Locality
PDE Operators have local support
Stencil captures exact region of support
Reason about Synchronization
Iterative solvers
Read old values to calculate new values
![Page 20: Liszt los alamos national laboratory Aug 2011](https://reader034.vdocuments.us/reader034/viewer/2022051612/54bdf6a34a795946588b4646/html5/thumbnails/20.jpg)
Liszt Language Features
Mesh Elements Vertex, Edge, Face, Cell
Sets
cells(mesh), edges(mesh), faces(mesh), …
Topological Relationships
head(edge), vertices(cell), …
Fields
val vert_position = position(v)
Parallelism
forall statements: for( f <- faces(cell) ) { … }
![Page 21: Liszt los alamos national laboratory Aug 2011](https://reader034.vdocuments.us/reader034/viewer/2022051612/54bdf6a34a795946588b4646/html5/thumbnails/21.jpg)
Example: Heat Conduction on Grid val Position = FieldWithLabel[Vertex,Float3](“position”) val Temperature = FieldWithConst[Vertex,Float](0.0f) val Flux = FieldWithConst [Vertex,Float](0.0f) val JacobiStep = FieldWithConst[Vertex,Float](0.0f) var i = 0; while (i < 1000) { for (e <-‐ edges(mesh)) { val v1 = head(e) val v2 = tail(e) val dP = Position(v1) -‐ Position(v2) val dT = Temperature(v1) -‐ Temperature(v2) val step = 1.0f/(length(dP)) Flux(v1) += dT*step Flux(v2) -‐= dT*step JacobiStep(v1) += step JacobiStep(v2) += step } for (p <-‐ vertices(mesh)) { Temperature(p) += 0.01f*Flux(p)/JacobiStep(p) } for (p <-‐ vertices(mesh)) { Flux(p) = 0.f; JacobiStep(p) = 0.f; } i += 1 }
Mesh Elements
Topology Functions
Fields (Data storage)
Parallelizable for
H
F
E
C
B
D G1
5
8
10
117
3
0
2
469
A
![Page 22: Liszt los alamos national laboratory Aug 2011](https://reader034.vdocuments.us/reader034/viewer/2022051612/54bdf6a34a795946588b4646/html5/thumbnails/22.jpg)
Language Features for Parallelism
Data-parallel for-comprehension
Calculations are independent
No assumptions about how it is parallelized
Freedom of underlying runtime implementation
for (e <-‐ edges(mesh)) { … }
![Page 23: Liszt los alamos national laboratory Aug 2011](https://reader034.vdocuments.us/reader034/viewer/2022051612/54bdf6a34a795946588b4646/html5/thumbnails/23.jpg)
Language Features for Locality
Automatically infer stencil (pattern of memory accesses at element)
Restrictions:
Mesh elements only accessed through built-in topological functions; cells(mesh), …
Variable assignments to topological elements and fields are immutable; val v1 = head(e)
Data in Fields can only be accessed using mesh elements JacobiStep(v1)
No recursive functions
![Page 24: Liszt los alamos national laboratory Aug 2011](https://reader034.vdocuments.us/reader034/viewer/2022051612/54bdf6a34a795946588b4646/html5/thumbnails/24.jpg)
Language Features for Synchronization
Phased usage of Fields Fields have field phase state
read-only, write-only, reduce-using-operator field(el) [op]= value
Fields cannot change phase within for-comprehension
Associative Operators
Allow single expensive calculation to write data to multiple elements
Provide atomic scatter operations to fields
e. g. field(el) += value
Introduce write dependencies between instances of for-comprehension
![Page 25: Liszt los alamos national laboratory Aug 2011](https://reader034.vdocuments.us/reader034/viewer/2022051612/54bdf6a34a795946588b4646/html5/thumbnails/25.jpg)
Architecture
Runtime
Platform
Native Compiler
Liszt Compiler
Scala Compiler
.scala Scala frontend
platform-independent analysisLiszt plugin
runtime-specific code gen
MPI CUDA pthreads
Cluster SMP GPU.mesh
MPIapp
partitioning
pthreadsapp
coloring
CUDA appcoloring
mpicxx c++ nvcc
![Page 26: Liszt los alamos national laboratory Aug 2011](https://reader034.vdocuments.us/reader034/viewer/2022051612/54bdf6a34a795946588b4646/html5/thumbnails/26.jpg)
Architecture
One compiler, one set of domain-specific analyses
Build AST, analyze code, extract data-dependencies
Multiple code-generators for the compiler
Generates code for each type of platform we support
Multiple run-times
Single shared base (mesh loading, partitioning, etc.)
Separate runtime API for each platform:
pthreads-specific
CUDA-specific
MPI-specific
![Page 27: Liszt los alamos national laboratory Aug 2011](https://reader034.vdocuments.us/reader034/viewer/2022051612/54bdf6a34a795946588b4646/html5/thumbnails/27.jpg)
How do we infer data accesses from Liszt?
“Stencil” of a piece of code:
Captures just the memory accesses it performs
Infer stencil for each for-comprehension in Liszt
Targetting Many Architectures
![Page 28: Liszt los alamos national laboratory Aug 2011](https://reader034.vdocuments.us/reader034/viewer/2022051612/54bdf6a34a795946588b4646/html5/thumbnails/28.jpg)
Domain Specific Transform:
Analyze code to detect memory access stencil of each top-level for-all comprehension
Extract nested mesh element reads
Extract field operations
Difficult with a traditional library
for (e <-‐ edges(mesh)) { val v1 = head(e) val v2 = tail(e) val dP = Position(v1) -‐ Position(v2) val dT = Temperature(v1) -‐ Temperature(v2) val step = 1.0f/(length(dP)) Flux(v1) += dT*step Flux(v2) -‐= dT*step JacobiStep(v1) += step JacobiStep(v2) += step }
e in edges(mesh)
head(e) tail(e)
Write Flux, JacobiStep Write Flux, JacobiStepRead Position,Temperature Read Position, Temperature
vertices(mesh)
Read/Write Flux
Write TemperatureRead/Write JacobiStep
![Page 29: Liszt los alamos national laboratory Aug 2011](https://reader034.vdocuments.us/reader034/viewer/2022051612/54bdf6a34a795946588b4646/html5/thumbnails/29.jpg)
Domain Specific Transform:
for (e <-‐ edges(mesh)) { val v1 = head(e) val v2 = tail(e) val dP = Position(v1) -‐ Position(v2) val dT = Temperature(v1) -‐ Temperature(v2) val step = 1.0f/(length(dP)) Flux(v1) += dT*step Flux(v2) -‐= dT*step JacobiStep(v1) += step JacobiStep(v2) += step }
e in edges(mesh)
head(e) tail(e)
Write Flux, JacobiStep Write Flux, JacobiStepRead Position,Temperature Read Position, Temperature
vertices(mesh)
Read/Write Flux
Write TemperatureRead/Write JacobiStep
S(el, E) = (R,W )
(R,W )(el, E)
Expression
Environment mapping free variables to values
elE
Read/Write Sets
![Page 30: Liszt los alamos national laboratory Aug 2011](https://reader034.vdocuments.us/reader034/viewer/2022051612/54bdf6a34a795946588b4646/html5/thumbnails/30.jpg)
Domain Specific Transform:
Problem: Don’t know the mesh at compile time!
Break up inference into:
Static part: Abstractly reason about operators.
generate c++ code that queries mesh to build the specific stencil for each for-comprehension
Dynamic part: Concretely build data structures by analyzing mesh
run c++ code on mesh
Anything that directly depends on mesh becomes generated code, executed at runtime.
This means: It is possible to write low-level c++ code that targets our back-end. But ugly and hard!
![Page 31: Liszt los alamos national laboratory Aug 2011](https://reader034.vdocuments.us/reader034/viewer/2022051612/54bdf6a34a795946588b4646/html5/thumbnails/31.jpg)
Implementing stencil detection
Finding stencil for each for-comprehension means capturing data access pattern. This poses a set of problems:
Cannot guarantee termination
Can be very expensive
Must do before parallelization in single core
S(el, E) = (R,W )
for (e <-‐ edges(mesh)) { val v1 = head(e) val v2 = tail(e) val dP = Position(v1) -‐ Position(v2) val dT = Temperature(v1) -‐ Temperature(v2) val step = 1.0f/(length(dP)) Flux(v1) += dT*step Flux(v2) -‐= dT*step JacobiStep(v1) += step JacobiStep(v2) += step }
![Page 32: Liszt los alamos national laboratory Aug 2011](https://reader034.vdocuments.us/reader034/viewer/2022051612/54bdf6a34a795946588b4646/html5/thumbnails/32.jpg)
Implementing stencil detection
“Partial execution of a computer program to gain insight into it’s semantics”. Allows us to calculate an approximate stencil
Apply transformation to Liszt code
generate code with desirable properties (terminates, fast)
T
![Page 33: Liszt los alamos national laboratory Aug 2011](https://reader034.vdocuments.us/reader034/viewer/2022051612/54bdf6a34a795946588b4646/html5/thumbnails/33.jpg)
Implementing stencil detection
“Partial execution of a computer program to gain insight into it’s semantics”. Allows us to calculate an approximate stencil
Apply transformation to Liszt code
generate code with desirable properties (terminates, fast)
for (e <-‐ edges(mesh)) { val v1 = head(e) val v2 = tail(e) val dP = Position(v1) -‐ Position(v2) val dT = Temperature(v1) -‐ Temperature(v2) val step = 1.0f/(length(dP)) Flux(v1) += dT*step Flux(v2) -‐= dT*step JacobiStep(v1) += step JacobiStep(v2) += step }
S(el, E) ⊆ S̄(el, E) = (R,W )
for (e <-‐ edges(mesh)) { val v1 = head(e) val v2 = tail(e) val dP = Position(v1) -‐ Position(v2) val dT = Temperature(v1) -‐ Temperature(v2)
Flux(v1) += _ Flux(v2) -‐= _ JacobiStep(v1) += _ JacobiStep(v2) += _ }
T
T
![Page 34: Liszt los alamos national laboratory Aug 2011](https://reader034.vdocuments.us/reader034/viewer/2022051612/54bdf6a34a795946588b4646/html5/thumbnails/34.jpg)
Implementing stencil detection
Defining
Conservatively evaluate if-statements
Single Static Assignment of Mesh Variables
Recursively apply to functions
Everything else, recursively apply to subexpressions of expression
TT ( if(ep) et else ee ) = T (ep); T (et); T (ee);
T ( while(ep) eb ) = T (ep); T (eb);
T ( f(a0,...,an) ) = f �(T (a0),...,T (an))
![Page 35: Liszt los alamos national laboratory Aug 2011](https://reader034.vdocuments.us/reader034/viewer/2022051612/54bdf6a34a795946588b4646/html5/thumbnails/35.jpg)
In pictures
![Page 36: Liszt los alamos national laboratory Aug 2011](https://reader034.vdocuments.us/reader034/viewer/2022051612/54bdf6a34a795946588b4646/html5/thumbnails/36.jpg)
Domain Specific Transform:
for (e <-‐ edges(mesh)) { val v1 = head(e) val v2 = tail(e) val dP = Position(v1) -‐ Position(v2) val dT = Temperature(v1) -‐ Temperature(v2)
Flux(v1) += _ Flux(v2) -‐= _ JacobiStep(v1) += _ JacobiStep(v2) += _ } ...
H
F
E
C
B
D G1
5
8
10
117
3
0
2
469
A
![Page 37: Liszt los alamos national laboratory Aug 2011](https://reader034.vdocuments.us/reader034/viewer/2022051612/54bdf6a34a795946588b4646/html5/thumbnails/37.jpg)
Domain Specific Transform:
for (e <-‐ edges(mesh)) { val v1 = head(e) val v2 = tail(e) val dP = Position(v1) -‐ Position(v2) val dT = Temperature(v1) -‐ Temperature(v2)
Flux(v1) += _ Flux(v2) -‐= _ JacobiStep(v1) += _ JacobiStep(v2) += _ } ...
H
F
E
C
B
D G1
5
8
10
117
3
0
2
469
A
e in edges(mesh)
![Page 38: Liszt los alamos national laboratory Aug 2011](https://reader034.vdocuments.us/reader034/viewer/2022051612/54bdf6a34a795946588b4646/html5/thumbnails/38.jpg)
Domain Specific Transform:
for (e <-‐ edges(mesh)) { val v1 = head(e) val v2 = tail(e) val dP = Position(v1) -‐ Position(v2) val dT = Temperature(v1) -‐ Temperature(v2)
Flux(v1) += _ Flux(v2) -‐= _ JacobiStep(v1) += _ JacobiStep(v2) += _ } ...
H
F
E
C
B
D G1
5
8
10
117
3
0
2
469
A
e in edges(mesh)
![Page 39: Liszt los alamos national laboratory Aug 2011](https://reader034.vdocuments.us/reader034/viewer/2022051612/54bdf6a34a795946588b4646/html5/thumbnails/39.jpg)
Domain Specific Transform:
for (e <-‐ edges(mesh)) { val v1 = head(e) val v2 = tail(e) val dP = Position(v1) -‐ Position(v2) val dT = Temperature(v1) -‐ Temperature(v2)
Flux(v1) += _ Flux(v2) -‐= _ JacobiStep(v1) += _ JacobiStep(v2) += _ } ...
H
F
E
C
B
D G1
5
8
10
117
3
0
2
469
A
e in edges(mesh)
e in edges(mesh)
head(e)
![Page 40: Liszt los alamos national laboratory Aug 2011](https://reader034.vdocuments.us/reader034/viewer/2022051612/54bdf6a34a795946588b4646/html5/thumbnails/40.jpg)
Domain Specific Transform:
for (e <-‐ edges(mesh)) { val v1 = head(e) val v2 = tail(e) val dP = Position(v1) -‐ Position(v2) val dT = Temperature(v1) -‐ Temperature(v2)
Flux(v1) += _ Flux(v2) -‐= _ JacobiStep(v1) += _ JacobiStep(v2) += _ } ...
H
F
E
C
B
D G1
5
8
10
117
3
0
2
469
A
e in edges(mesh)
e in edges(mesh)
head(e)
e in edges(mesh)
head(e) tail(e)
![Page 41: Liszt los alamos national laboratory Aug 2011](https://reader034.vdocuments.us/reader034/viewer/2022051612/54bdf6a34a795946588b4646/html5/thumbnails/41.jpg)
Domain Specific Transform:
for (e <-‐ edges(mesh)) { val v1 = head(e) val v2 = tail(e) val dP = Position(v1) -‐ Position(v2) val dT = Temperature(v1) -‐ Temperature(v2)
Flux(v1) += _ Flux(v2) -‐= _ JacobiStep(v1) += _ JacobiStep(v2) += _ } ...
H
F
E
C
B
D G1
5
8
10
117
3
0
2
469
A
e in edges(mesh)
head(e) tail(e)
Read Position Read Position
![Page 42: Liszt los alamos national laboratory Aug 2011](https://reader034.vdocuments.us/reader034/viewer/2022051612/54bdf6a34a795946588b4646/html5/thumbnails/42.jpg)
Domain Specific Transform:
for (e <-‐ edges(mesh)) { val v1 = head(e) val v2 = tail(e) val dP = Position(v1) -‐ Position(v2) val dT = Temperature(v1) -‐ Temperature(v2)
Flux(v1) += _ Flux(v2) -‐= _ JacobiStep(v1) += _ JacobiStep(v2) += _ } ...
H
F
E
C
B
D G1
5
8
10
117
3
0
2
469
A
e in edges(mesh)
head(e) tail(e)
Read Position,Temperature Read Position, Temperature
![Page 43: Liszt los alamos national laboratory Aug 2011](https://reader034.vdocuments.us/reader034/viewer/2022051612/54bdf6a34a795946588b4646/html5/thumbnails/43.jpg)
Domain Specific Transform:
for (e <-‐ edges(mesh)) { val v1 = head(e) val v2 = tail(e) val dP = Position(v1) -‐ Position(v2) val dT = Temperature(v1) -‐ Temperature(v2)
Flux(v1) += _ Flux(v2) -‐= _ JacobiStep(v1) += _ JacobiStep(v2) += _ } ...
H
F
E
C
B
D G1
5
8
10
117
3
0
2
469
A
e in edges(mesh)
head(e) tail(e)
Write FluxRead Position,Temperature Read Position, Temperature
![Page 44: Liszt los alamos national laboratory Aug 2011](https://reader034.vdocuments.us/reader034/viewer/2022051612/54bdf6a34a795946588b4646/html5/thumbnails/44.jpg)
Domain Specific Transform:
for (e <-‐ edges(mesh)) { val v1 = head(e) val v2 = tail(e) val dP = Position(v1) -‐ Position(v2) val dT = Temperature(v1) -‐ Temperature(v2)
Flux(v1) += _ Flux(v2) -‐= _ JacobiStep(v1) += _ JacobiStep(v2) += _ } ...
H
F
E
C
B
D G1
5
8
10
117
3
0
2
469
A
e in edges(mesh)
head(e) tail(e)
Write Flux Write FluxRead Position,Temperature Read Position, Temperature
![Page 45: Liszt los alamos national laboratory Aug 2011](https://reader034.vdocuments.us/reader034/viewer/2022051612/54bdf6a34a795946588b4646/html5/thumbnails/45.jpg)
Domain Specific Transform:
for (e <-‐ edges(mesh)) { val v1 = head(e) val v2 = tail(e) val dP = Position(v1) -‐ Position(v2) val dT = Temperature(v1) -‐ Temperature(v2)
Flux(v1) += _ Flux(v2) -‐= _ JacobiStep(v1) += _ JacobiStep(v2) += _ } ...
H
F
E
C
B
D G1
5
8
10
117
3
0
2
469
A
e in edges(mesh)
head(e) tail(e)
Write Flux, JacobiStep Write FluxRead Position,Temperature Read Position, Temperature
![Page 46: Liszt los alamos national laboratory Aug 2011](https://reader034.vdocuments.us/reader034/viewer/2022051612/54bdf6a34a795946588b4646/html5/thumbnails/46.jpg)
Domain Specific Transform:
for (e <-‐ edges(mesh)) { val v1 = head(e) val v2 = tail(e) val dP = Position(v1) -‐ Position(v2) val dT = Temperature(v1) -‐ Temperature(v2)
Flux(v1) += _ Flux(v2) -‐= _ JacobiStep(v1) += _ JacobiStep(v2) += _ } ...
H
F
E
C
B
D G1
5
8
10
117
3
0
2
469
A
e in edges(mesh)
head(e) tail(e)
Write Flux, JacobiStep Write Flux, JacobiStepRead Position,Temperature Read Position, Temperature
![Page 47: Liszt los alamos national laboratory Aug 2011](https://reader034.vdocuments.us/reader034/viewer/2022051612/54bdf6a34a795946588b4646/html5/thumbnails/47.jpg)
So how do we execute your code?
![Page 48: Liszt los alamos national laboratory Aug 2011](https://reader034.vdocuments.us/reader034/viewer/2022051612/54bdf6a34a795946588b4646/html5/thumbnails/48.jpg)
Execution Strategies
Partitioning Assign partition to each computational unit
Use ghost elements to coordinate cross-boundary communication.
Ideal for single computational unit per memory space Owned Cell
Ghost Cell
![Page 49: Liszt los alamos national laboratory Aug 2011](https://reader034.vdocuments.us/reader034/viewer/2022051612/54bdf6a34a795946588b4646/html5/thumbnails/49.jpg)
Execution Strategies
Partitioning Assign partition to each computational unit
Use ghost elements to coordinate cross-boundary communication.
Ideal for single computational unit per memory space
Coloring
Calculate interference between work items on domain
Schedule work-items into non-interfering batches
Ideal for many computational units per memory space
Owned Cell
Ghost Cell
1 58 1011 73 0 24 9
Batch 4Batch 3Batch 2Batch 1
Schedule set of nonconflicting threads per color
![Page 50: Liszt los alamos national laboratory Aug 2011](https://reader034.vdocuments.us/reader034/viewer/2022051612/54bdf6a34a795946588b4646/html5/thumbnails/50.jpg)
Implement Strategies using Abstract Interpretation
Track 1: Apply transformation to Liszt code
Use result to generate architecture-specific datastructures
Track 2: Rewrite code for specific parallel implementation
Use generated data structures for parallel computation
for (e <-‐ edges(mesh)) { val v1 = head(e) val v2 = tail(e) val dP = Position(v1) -‐ Position(v2) val dT = Temperature(v1) -‐ Temperature(v2)
Flux(v1) += _ Flux(v2) -‐= _ JacobiStep(v1) += _ JacobiStep(v2) += _ }
T
T
Owned Cell
Ghost Cell
1 58 1011 73 0 24 9
Batch 4Batch 3Batch 2Batch 1
Schedule set of nonconflicting threads per color
![Page 51: Liszt los alamos national laboratory Aug 2011](https://reader034.vdocuments.us/reader034/viewer/2022051612/54bdf6a34a795946588b4646/html5/thumbnails/51.jpg)
MPI: Partitioning with Ghosts
1. Partition Mesh (ParMETIS, G. Karypis)
A
H
F
E
C
B
D G1
5
8
10
117
3
0
2
469
Blue Node
Orange Node
![Page 52: Liszt los alamos national laboratory Aug 2011](https://reader034.vdocuments.us/reader034/viewer/2022051612/54bdf6a34a795946588b4646/html5/thumbnails/52.jpg)
MPI: Partitioning with Ghosts
2. Find used mesh elements and field entries using stencil data and duplicate locally into “ghost” elements
Implementation directly depends on algorithm’s access patterns
e in edges(mesh)
head(e) tail(e)
Write Flux, JacobiStep Write Flux, JacobiStepRead Position,Temperature Read Position, Temperature
vertices(mesh)
Read/Write Flux
Write TemperatureRead/Write JacobiStep
Blue Node
Orange Node
Ghosts: EGhosts: D
H
F
E
D G
8
10
117
3
69
E
C
B
D1
5
3
0
2
4
A
![Page 53: Liszt los alamos national laboratory Aug 2011](https://reader034.vdocuments.us/reader034/viewer/2022051612/54bdf6a34a795946588b4646/html5/thumbnails/53.jpg)
MPI: Partitioning with Ghosts
3. Annotate for-comprehensions with field preparation statements
Flux.ensureState<LISZT_SUM>(); JacobiStep.ensureState<LISZT_SUM>(); Position.ensureState<LISZT_READ>(); Temperature.ensureState<LISZT_READ>(); for (e <-‐ edges(mesh)) { val dP = Position(v1) -‐ Position(v2) ... Flux(v1) += dT*step JacobiStep(v1) += step } Temperature.ensureState<LISZT_SUM>(); Flux.ensureState<LISZT_READ>(); JacobiStep.ensureState<LISZT_READ>(); for (p <-‐ vertices(mesh)) { Temperature(p) += 0.01f * Flux(p)/JacobiStep(p) } …
![Page 54: Liszt los alamos national laboratory Aug 2011](https://reader034.vdocuments.us/reader034/viewer/2022051612/54bdf6a34a795946588b4646/html5/thumbnails/54.jpg)
MPI: Partitioning with Ghosts
4. MPI communication is batched during for-comprehensions and only transferred when necessary
D F G H -GhostsLocal
A B C E -GhostsLocal
D F G H 0GhostsLocal
A B C E 0GhostsLocal
D-∆D F G H ∆EGhostsLocal
A B C E-∆E ∆DGhostsLocal
Node Blue
Node Orange
∆D
∆E
D F G H -GhostsLocal
A B C E -GhostsLocal
1:
2:
3:
![Page 55: Liszt los alamos national laboratory Aug 2011](https://reader034.vdocuments.us/reader034/viewer/2022051612/54bdf6a34a795946588b4646/html5/thumbnails/55.jpg)
Applying Program Analysis: Results for(f <- faces(mesh)) { rhoOutside(f) := calc_flux( f,rho(outside(f) )) + calc_flux( f,rho(inside(f) ))}
Initial Partition (ParMETIS)
![Page 56: Liszt los alamos national laboratory Aug 2011](https://reader034.vdocuments.us/reader034/viewer/2022051612/54bdf6a34a795946588b4646/html5/thumbnails/56.jpg)
Applying Program Analysis: Results for(f <- faces(mesh)) { rhoOutside(f) := calc_flux( f,rho(outside(f) )) + calc_flux( f,rho(inside(f) ))}
Ghost Cells
![Page 57: Liszt los alamos national laboratory Aug 2011](https://reader034.vdocuments.us/reader034/viewer/2022051612/54bdf6a34a795946588b4646/html5/thumbnails/57.jpg)
Applying Program Analysis: Results for(f <- faces(mesh)) { rhoOutside(f) := calc_flux( f,rho(outside(f) )) + calc_flux( f,rho(inside(f) ))}
Ghost Cells
![Page 58: Liszt los alamos national laboratory Aug 2011](https://reader034.vdocuments.us/reader034/viewer/2022051612/54bdf6a34a795946588b4646/html5/thumbnails/58.jpg)
GPU: Schedule threads with coloring
Shared Memory
Field updates need to be atomic
Concerns about MPI approach – volume vs surface area
Build a graph of interfering writes:
H
F
E
C
B
D G1
5
8
10
117
3
0
2
469
A
+e in
edges(mesh)
head(e) tail(e)
Write Flux, JacobiStep Write Flux, JacobiStepRead Position,Temperature Read Position, Temperature
vertices(mesh)
Read/Write Flux
Write TemperatureRead/Write JacobiStep
![Page 59: Liszt los alamos national laboratory Aug 2011](https://reader034.vdocuments.us/reader034/viewer/2022051612/54bdf6a34a795946588b4646/html5/thumbnails/59.jpg)
GPU: Schedule threads with coloring
Threads 1 edge assigned to 1 thread
Memory
ForceField:
1 5 8 10 11730 2 4 6 9
A C E F HGDB
e in edges(mesh)
head(e) tail(e)
Write Flux, JacobiStep Write Flux, JacobiStepRead Position,Temperature Read Position, Temperature
vertices(mesh)
Read/Write Flux
Write TemperatureRead/Write JacobiStep
FORALL_SET(e,edges(mesh)) Vertex v_1 = head(e); igraph-‐>addEdge(thread(e).ID(),v_1.ID(),18886); Vertex v_2 = tail(e); igraph-‐>addEdge(thread(e).ID(),v_12.ID(),18886); ENDSET
H
F
E
C
B
D G1
5
8
10
117
3
0
2
469
A
Compile time: Generate code to create field write structure
Runtime: Build this structure using mesh and stencil
![Page 60: Liszt los alamos national laboratory Aug 2011](https://reader034.vdocuments.us/reader034/viewer/2022051612/54bdf6a34a795946588b4646/html5/thumbnails/60.jpg)
GPU: Schedule threads with coloring Threads 1 edge assigned to 1 thread
Memory
ForceField:
1 5 8 10 11730 2 4 6 9
A C E F HGDB
1 5 8 10
11730
2 4 6 9
![Page 61: Liszt los alamos national laboratory Aug 2011](https://reader034.vdocuments.us/reader034/viewer/2022051612/54bdf6a34a795946588b4646/html5/thumbnails/61.jpg)
GPU: Schedule threads with coloring Threads 1 edge assigned to 1 thread
Memory
ForceField:
1 5 8 10 11730 2 4 6 9
A C E F HGDB
1 5 8 10
11730
2 4 6 9
![Page 62: Liszt los alamos national laboratory Aug 2011](https://reader034.vdocuments.us/reader034/viewer/2022051612/54bdf6a34a795946588b4646/html5/thumbnails/62.jpg)
GPU: Schedule threads with coloring Convert each for-comprehension into a GPU kernel, launched
multiple times for sets of non-interfering iterations. __global__ for_01(ColorBatch batch, Force, Position, Velocity, Maxforce) { val dT = Position(v1) -‐ Position(v2) ... Flux(v1) += dT*step Flux(v2) -‐= springForce …. } WorkgroupLauncher launcher = WorkgroupLauncher_forWorkgroup(001); ColorBatch colorBatch; while(launcher.nextBatch(&colorBatch)) { Maxforce.ensureSize(colorBatch.kernel_size()); GlobalContext_copyToGPU(); for_01<<<batch.blocks(),batch.threads()>>>( batch, Force, Position, Velocity, Maxforce); }
1 58 1011 73 0 24 9
Batch 4Batch 3Batch 2Batch 1
Schedule set of nonconflicting threads per color
![Page 63: Liszt los alamos national laboratory Aug 2011](https://reader034.vdocuments.us/reader034/viewer/2022051612/54bdf6a34a795946588b4646/html5/thumbnails/63.jpg)
How fast do we go?
![Page 64: Liszt los alamos national laboratory Aug 2011](https://reader034.vdocuments.us/reader034/viewer/2022051612/54bdf6a34a795946588b4646/html5/thumbnails/64.jpg)
Results
4 example codes with Liszt and C++ implementations:
Euler solver from Joe
Navier-Stokes solver from Joe
Shallow Water simulator Free-surface simulation on globe as per Drake et al.
Second order accurate spatial scheme
Linear FEM Hexahedral mesh
Trilinear basis functions with support at vertices
CG solver
![Page 65: Liszt los alamos national laboratory Aug 2011](https://reader034.vdocuments.us/reader034/viewer/2022051612/54bdf6a34a795946588b4646/html5/thumbnails/65.jpg)
Real applications: Joe in Liszt
Navier-Stokes and Euler equations (unstructured solver).
Explicit time integration using local time stepping.
Second order upwind scheme via variable extrapolation using gradient calculation.
Convective terms evaluated using a HHLC approximate nonlinear Riemann solver.
Fully-viscous approximation of the viscous stresses.
Gradient calculation using a Green-Gauss scheme.
New boundary conditions for adiabatic solid walls
-
![Page 66: Liszt los alamos national laboratory Aug 2011](https://reader034.vdocuments.us/reader034/viewer/2022051612/54bdf6a34a795946588b4646/html5/thumbnails/66.jpg)
Real applications
![Page 67: Liszt los alamos national laboratory Aug 2011](https://reader034.vdocuments.us/reader034/viewer/2022051612/54bdf6a34a795946588b4646/html5/thumbnails/67.jpg)
Scalar Performance Comparisons
Runtime comparisons between hand-tuned C++ and Liszt
Liszt performance within 12% of C++
Euler Navier-Stokes
FEM Shallow Water
Mesh size 367k 668k 216k 327k
Liszt 0.37s 1.31s 0.22s 3.30s
C++ 0.39s 1.55s 0.19s 3.34s
![Page 68: Liszt los alamos national laboratory Aug 2011](https://reader034.vdocuments.us/reader034/viewer/2022051612/54bdf6a34a795946588b4646/html5/thumbnails/68.jpg)
MPI Performance
4-socket 6-core 2.66Ghz Xeon CPU per node (24 cores), 16GB RAM per node. 256 nodes, 8 cores per node
32
128
256
512
1024
32 128 256 512 1024
Spee
dup
Cores
Euler23M cell mesh
LisztC++
32
128
256
512
1024
32 128 256 512 1024Cores
Navier-Stokes21M cell mesh
LisztC++
![Page 69: Liszt los alamos national laboratory Aug 2011](https://reader034.vdocuments.us/reader034/viewer/2022051612/54bdf6a34a795946588b4646/html5/thumbnails/69.jpg)
GPU Performance
Tesla C2050, Double Precision vs
single core, Nehalem E5520 2.26Ghz
0
5
10
15
20
25
30
35
40
Euler NS FEM SW
Spee
dup
over
Sca
lar (
x)
Applications
GPU Performancecuda on tesla c2050
![Page 70: Liszt los alamos national laboratory Aug 2011](https://reader034.vdocuments.us/reader034/viewer/2022051612/54bdf6a34a795946588b4646/html5/thumbnails/70.jpg)
Portability Tested both pthreads (coloring) and MPI (partitioning) runtime on:
8-core Nehalem E5520 2.26Ghz, 8GB RAM
32-core Nehalem-EX X7560 2.26GHz, 128GB RAM
0
5
10
15
20
25
30
35
40
Euler NS FEM SW
Spee
dup
over
Sca
lar (
x)
Applications
Comparison between Liszt runtimespthreads on 8-core
mpi on 8-corepthreads on 32-core
mpi on 32-corecuda on tesla c2050
![Page 71: Liszt los alamos national laboratory Aug 2011](https://reader034.vdocuments.us/reader034/viewer/2022051612/54bdf6a34a795946588b4646/html5/thumbnails/71.jpg)
Vectorization (in progress)
Consider n-wide set of elements at once
Use SSE/AVX/NEON vector lanes
Serialize only conflicting writes
Preliminary results: 30% speedup
![Page 72: Liszt los alamos national laboratory Aug 2011](https://reader034.vdocuments.us/reader034/viewer/2022051612/54bdf6a34a795946588b4646/html5/thumbnails/72.jpg)
Faster GPUs! (in progress)
Aubrey Gress, masters student:
Use a two-level staged coloring approach.
Arrange data in global memory for ideal coalescing
Stage data into shared memory, color shared memory
Preliminary results: another ~30% speedup
![Page 73: Liszt los alamos national laboratory Aug 2011](https://reader034.vdocuments.us/reader034/viewer/2022051612/54bdf6a34a795946588b4646/html5/thumbnails/73.jpg)
Aside: Improved debugging!
![Page 74: Liszt los alamos national laboratory Aug 2011](https://reader034.vdocuments.us/reader034/viewer/2022051612/54bdf6a34a795946588b4646/html5/thumbnails/74.jpg)
Aside: Improved debugging: visualization
![Page 75: Liszt los alamos national laboratory Aug 2011](https://reader034.vdocuments.us/reader034/viewer/2022051612/54bdf6a34a795946588b4646/html5/thumbnails/75.jpg)
Aside: Improved debugging: inspection
![Page 76: Liszt los alamos national laboratory Aug 2011](https://reader034.vdocuments.us/reader034/viewer/2022051612/54bdf6a34a795946588b4646/html5/thumbnails/76.jpg)
Future work – so much cool stuff
This talk:
Low-level implementation flexibility
Deeper in-situ visualization and performance analysis
More runtimes, improved back-end runtime interface
More applications
Other talks:
Language extentions for FEM, sparse matrices
Evolving to become a library (we have some tricks)
Embedding Liszt directly into another language
![Page 77: Liszt los alamos national laboratory Aug 2011](https://reader034.vdocuments.us/reader034/viewer/2022051612/54bdf6a34a795946588b4646/html5/thumbnails/77.jpg)
Future work: flexibility allows optimizations
Multiple mesh representations and mesh specialization
We don’t change mesh layout or assume a type of mesh
But we can optimize for this!
Automatically specialize for grids, structured meshes, etc.
Multiple representations of the same mesh for performance?
Data layout optimizations
We can automatically do AoS-SoA transformations
Modify data layout dynamically
Smart staging and streaming of data (already partially for CUDA)
![Page 78: Liszt los alamos national laboratory Aug 2011](https://reader034.vdocuments.us/reader034/viewer/2022051612/54bdf6a34a795946588b4646/html5/thumbnails/78.jpg)
Future work: more debugging and analysis
In-situ LisztVis-like analysis
Cluster-wide visualization of simulations, in-situ
Interact with simulation as it progresses
eg. Track specific data representations as it cascades
Visual performance analysis
We can deeply instrument specific parts of your code, so…
Present caching behaviour and interaction of code sections
Analyze how you use intruction set, bottleneck instructions
Do it visually.
![Page 79: Liszt los alamos national laboratory Aug 2011](https://reader034.vdocuments.us/reader034/viewer/2022051612/54bdf6a34a795946588b4646/html5/thumbnails/79.jpg)
Future work: runtimes!
Fast GPUs
Staged, hierarchical GPU runtime.
There’s another 30% on the table (Aubrey Gress)
Hierarchical runtimes
MPI+GPU
More exotic things?
Fully Heterogeneous
We can combine approaches, MPI+GPU+SMP
![Page 80: Liszt los alamos national laboratory Aug 2011](https://reader034.vdocuments.us/reader034/viewer/2022051612/54bdf6a34a795946588b4646/html5/thumbnails/80.jpg)
Future work: runtimes!
Coherent back-end interface
Originally the Liszt back-end was for application programmes (!)
We’re converging to a new design:
Single compiler, single code-generator, targetting a single API
Stream Programming-like interface*
Multiple back-ends for multiple execution strategies
Easier and cleaner to add more runtimes
* Gummarahu and Rosenblum. “Stream Programming on General-Purpose
Processors” 2005, MICRO 38
![Page 81: Liszt los alamos national laboratory Aug 2011](https://reader034.vdocuments.us/reader034/viewer/2022051612/54bdf6a34a795946588b4646/html5/thumbnails/81.jpg)
Current/Future work: applications
More applications makes for a better language.
Hello LANL!
* Gummarahu and Rosenblum. “Stream Programming on General-Purpose
Processors” 2005, MICRO 38
![Page 82: Liszt los alamos national laboratory Aug 2011](https://reader034.vdocuments.us/reader034/viewer/2022051612/54bdf6a34a795946588b4646/html5/thumbnails/82.jpg)
![Page 83: Liszt los alamos national laboratory Aug 2011](https://reader034.vdocuments.us/reader034/viewer/2022051612/54bdf6a34a795946588b4646/html5/thumbnails/83.jpg)
Interested in more? To appear in Supercomputing 2011
“Liszt: A Domain Specific Language for Building Portable Mesh-based PDE Solvers”, submitted to Supercomputing 2011
Visit our website, it’s all there.
http://liszt.stanford.edu
Talk to us this week!
Drop by, we’re in building 200 (training room)
Other talks
Pat at 1.30pm today
Zach at 10am Wednesday