enabling rapid development of parallel tree-search applications harnessing the power of massively...
TRANSCRIPT
![Page 1: Enabling Rapid Development of Parallel Tree-Search Applications Harnessing the Power of Massively Parallel Platforms for Astrophysical Data Analysis Jeffrey](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697c01e1a28abf838cd0e07/html5/thumbnails/1.jpg)
Enabling Rapid Development of Parallel
Tree-Search Applications
Harnessing the Power of Massively Parallel Platforms for Astrophysical
Data AnalysisJeffrey P. GardnerJeffrey P. GardnerAndrew ConnollyAndrew ConnollyCameron McBrideCameron McBride
Pittsburgh Supercomputing CenterPittsburgh Supercomputing CenterUniversity of PittsburghUniversity of Pittsburgh
Carnegie Mellon UniversityCarnegie Mellon University
![Page 2: Enabling Rapid Development of Parallel Tree-Search Applications Harnessing the Power of Massively Parallel Platforms for Astrophysical Data Analysis Jeffrey](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697c01e1a28abf838cd0e07/html5/thumbnails/2.jpg)
How to turn astrophysics simulation output into scientific
knowledge
Step 1: Run simulation
Step 2: Analyze simulationon workstation
Step 3: Extract meaningfulscientific knowledge
(happy scientist)Using 300 processors:(circa 1995)
![Page 3: Enabling Rapid Development of Parallel Tree-Search Applications Harnessing the Power of Massively Parallel Platforms for Astrophysical Data Analysis Jeffrey](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697c01e1a28abf838cd0e07/html5/thumbnails/3.jpg)
How to turn astrophysics simulation output into scientific
knowledge
Step 1: Run simulation
Step 2: Analyze simulationon server (in serial)
Step 3: Extract meaningfulscientific knowledge
(happy scientist)Using 1000 processors:(circa 2000)
![Page 4: Enabling Rapid Development of Parallel Tree-Search Applications Harnessing the Power of Massively Parallel Platforms for Astrophysical Data Analysis Jeffrey](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697c01e1a28abf838cd0e07/html5/thumbnails/4.jpg)
How to turn astrophysics simulation output into scientific
knowledge
Step 1: Run simulation
Step 2: Analyze simulationon ???
(unhappy scientist)Using 4000+ processors:(circa 2006)
X
![Page 5: Enabling Rapid Development of Parallel Tree-Search Applications Harnessing the Power of Massively Parallel Platforms for Astrophysical Data Analysis Jeffrey](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697c01e1a28abf838cd0e07/html5/thumbnails/5.jpg)
Exploring the Universe can be (Computationally)
Expensive
The size of simulations is no longer limited by computational power
It is limited by the parallelizability of data analysis tools
This situation, will only get worse in the future.
![Page 6: Enabling Rapid Development of Parallel Tree-Search Applications Harnessing the Power of Massively Parallel Platforms for Astrophysical Data Analysis Jeffrey](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697c01e1a28abf838cd0e07/html5/thumbnails/6.jpg)
How to turn astrophysics simulation output into scientific
knowledge
Step 1: Run simulation
Step 2: Analyze simulationon ???
Using ~1,000,000 cores?:(circa 2012)
X
By 2012, we will have machines that will have many hundreds of thousands of cores!
![Page 7: Enabling Rapid Development of Parallel Tree-Search Applications Harnessing the Power of Massively Parallel Platforms for Astrophysical Data Analysis Jeffrey](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697c01e1a28abf838cd0e07/html5/thumbnails/7.jpg)
The Challenge of Data Analysis in a Multiprocessor Universe
Parallel programs are difficult to write! Steep learning curve to learn parallel programming
Parallel programs are expensive to write! Lengthy development time
Parallel world is dominated by simulations: Code is often reused for many years by many
people Therefore, you can afford to invest lots of time
writing the code. Example: GASOLINE (a cosmology N-body
code) Required 10 FTE-years of development
![Page 8: Enabling Rapid Development of Parallel Tree-Search Applications Harnessing the Power of Massively Parallel Platforms for Astrophysical Data Analysis Jeffrey](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697c01e1a28abf838cd0e07/html5/thumbnails/8.jpg)
The Challenge of Data Analysis in a Multiprocessor Universe
Data Analysis does not work this way: Rapidly changing scientific
inqueries Less code reuse
Simulation groups do not even write their analysis code in parallel!
Data Mining paradigm mandates rapid software development!
![Page 9: Enabling Rapid Development of Parallel Tree-Search Applications Harnessing the Power of Massively Parallel Platforms for Astrophysical Data Analysis Jeffrey](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697c01e1a28abf838cd0e07/html5/thumbnails/9.jpg)
How to turn observational data into scientific knowledge
Step 1: Collect data
Step 2: Analyze dataon workstation
Step 3: Extract meaningfulscientific knowledge
(happy astronomer)Observe at Telescope(circa 1990)
![Page 10: Enabling Rapid Development of Parallel Tree-Search Applications Harnessing the Power of Massively Parallel Platforms for Astrophysical Data Analysis Jeffrey](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697c01e1a28abf838cd0e07/html5/thumbnails/10.jpg)
Use Sky Survey Data(circa 2005)
How to turn observational data into scientific knowledge
Step 1: Collect data
Step 2: Analyze dataon ???
Sloan Digital Sky Survey(500,000 galaxies)
X
(unhappy astronomer)
3-point correlation function:~200,000 node-hours
of computation
![Page 11: Enabling Rapid Development of Parallel Tree-Search Applications Harnessing the Power of Massively Parallel Platforms for Astrophysical Data Analysis Jeffrey](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697c01e1a28abf838cd0e07/html5/thumbnails/11.jpg)
Use Sky Survey Data(circa 2012)
How to turn observational data into scientific knowledge
Large Synoptic Survey Telescope(2,000,000 galaxies)
3-point correlation function:~several petaflop weeks
of computation
![Page 12: Enabling Rapid Development of Parallel Tree-Search Applications Harnessing the Power of Massively Parallel Platforms for Astrophysical Data Analysis Jeffrey](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697c01e1a28abf838cd0e07/html5/thumbnails/12.jpg)
Tightly-Coupled Parallelism(what this talk is about)
Data and computational domains overlap
Computational elements must communicate with one another
Examples: Group finding N-Point correlation functions New object classification Density estimation
![Page 13: Enabling Rapid Development of Parallel Tree-Search Applications Harnessing the Power of Massively Parallel Platforms for Astrophysical Data Analysis Jeffrey](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697c01e1a28abf838cd0e07/html5/thumbnails/13.jpg)
The Challenge of Astrophysics Data Analysis in a
Multiprocessor Universe
Build a library that is: Sophisticated enough to take care
of all of the nasty parallel bits for you.
Flexible enough to be used for your own particular astrophysics data analysis application.
Scalable: scales well to thousands of processors.
![Page 14: Enabling Rapid Development of Parallel Tree-Search Applications Harnessing the Power of Massively Parallel Platforms for Astrophysical Data Analysis Jeffrey](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697c01e1a28abf838cd0e07/html5/thumbnails/14.jpg)
The Challenge of Astrophysics Data Analysis in a Multiprocessor Universe
Astrophysics uses dynamic, irregular data structures: Astronomy deals with point-like data in an N-dimensional
parameter space Most efficient methods on these kind of data use space-
partitioning trees. The most common data structure is a kd-tree.
Build a targeted library for distributed-memory kd-trees that is scalable to thousands of processing elements
![Page 15: Enabling Rapid Development of Parallel Tree-Search Applications Harnessing the Power of Massively Parallel Platforms for Astrophysical Data Analysis Jeffrey](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697c01e1a28abf838cd0e07/html5/thumbnails/15.jpg)
Challenges for scalable parallel application development:
Things that make parallel programs difficult to write Work orchestration Data management
Things that inhibit scalability: Granularity (synchronization,
consistency) Load balancing Data locality
Structured dataMemory consistency
![Page 16: Enabling Rapid Development of Parallel Tree-Search Applications Harnessing the Power of Massively Parallel Platforms for Astrophysical Data Analysis Jeffrey](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697c01e1a28abf838cd0e07/html5/thumbnails/16.jpg)
Overview of existing paradigms: DSM
There are many existing distributed shared-memory (DSM) tools.
Compilers: UPC Co-Array Fortran Titanium ZPL Linda
Libraries Global Arrays TreadMarks IVY JIAJIA Strings Mirage Munin Quarks CVM
![Page 17: Enabling Rapid Development of Parallel Tree-Search Applications Harnessing the Power of Massively Parallel Platforms for Astrophysical Data Analysis Jeffrey](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697c01e1a28abf838cd0e07/html5/thumbnails/17.jpg)
Overview of existing paradigms: DSM
The Good: These are quite simple to use.
The Good: Can manage data locality pretty well.
The Bad: Existing DSM approaches tend not to scale very well because of fine granularity.
The Ugly: Almost none support structured data (like trees).
![Page 18: Enabling Rapid Development of Parallel Tree-Search Applications Harnessing the Power of Massively Parallel Platforms for Astrophysical Data Analysis Jeffrey](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697c01e1a28abf838cd0e07/html5/thumbnails/18.jpg)
Overview of existing paradigms: DSM
There are some DSM approaches that do lend themselves to structured data: e.g. Linda (tuple-space)
The Good: Almost universally flexible The Bad: These tend not to scale even
worse than simple unstructured DSM approaches. Granularity is too fine
![Page 19: Enabling Rapid Development of Parallel Tree-Search Applications Harnessing the Power of Massively Parallel Platforms for Astrophysical Data Analysis Jeffrey](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697c01e1a28abf838cd0e07/html5/thumbnails/19.jpg)
Challenges for scalable parallel application development:
Things that make parallel programs difficult to write Work orchestration Data management
Things that inhibit scalability: Granularity Load balancing Data locality
DSM
![Page 20: Enabling Rapid Development of Parallel Tree-Search Applications Harnessing the Power of Massively Parallel Platforms for Astrophysical Data Analysis Jeffrey](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697c01e1a28abf838cd0e07/html5/thumbnails/20.jpg)
Overview of existing paradigms: RMI
rmi_broadcast(…, (*myFunction));
RMI layer
RMI layer
myFunction()
RMI Layer
myFunction()
RMI Layer
myFunction()
RMI Layer
myFunction()
Proc. 0 Proc. 1 Proc. 2 Proc. 3
MasterThread
Computational Agenda
myFunction() is coarsely grained
“Remote Method Invocation”
![Page 21: Enabling Rapid Development of Parallel Tree-Search Applications Harnessing the Power of Massively Parallel Platforms for Astrophysical Data Analysis Jeffrey](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697c01e1a28abf838cd0e07/html5/thumbnails/21.jpg)
RMI Performance Features
Coarse Granulary Thread virtualization
Queue many instances of myFunction() on each physical thread.
RMI Infrastucture can migrate these instances to achieve load balacing.
![Page 22: Enabling Rapid Development of Parallel Tree-Search Applications Harnessing the Power of Massively Parallel Platforms for Astrophysical Data Analysis Jeffrey](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697c01e1a28abf838cd0e07/html5/thumbnails/22.jpg)
Overview of existing paradigms: RMI
RMI can be language based: Java CHARM++
Or library based: RPC ARMI
![Page 23: Enabling Rapid Development of Parallel Tree-Search Applications Harnessing the Power of Massively Parallel Platforms for Astrophysical Data Analysis Jeffrey](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697c01e1a28abf838cd0e07/html5/thumbnails/23.jpg)
Challenges for scalable parallel application development:
Things that make parallel programs difficult to write Work orchestration Data management
Things that inhibit scalability: Granularity Load balancing Data locality
RMI
![Page 24: Enabling Rapid Development of Parallel Tree-Search Applications Harnessing the Power of Massively Parallel Platforms for Astrophysical Data Analysis Jeffrey](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697c01e1a28abf838cd0e07/html5/thumbnails/24.jpg)
N tropy: A Library for Rapid Development of kd-tree Applications
No existing paradigm gives us everything we need.
Can we combine existing paradigms beneath a simple, yet flexible API?
![Page 25: Enabling Rapid Development of Parallel Tree-Search Applications Harnessing the Power of Massively Parallel Platforms for Astrophysical Data Analysis Jeffrey](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697c01e1a28abf838cd0e07/html5/thumbnails/25.jpg)
N tropy: A Library for Rapid Development of kd-tree Applications
Use RMI for orchestration Use DSM for data management Implementation of both is targeted
towards astrophysics
![Page 26: Enabling Rapid Development of Parallel Tree-Search Applications Harnessing the Power of Massively Parallel Platforms for Astrophysical Data Analysis Jeffrey](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697c01e1a28abf838cd0e07/html5/thumbnails/26.jpg)
A Simple N tropy Example:N-body Gravity Calculation
Cosmological “N-Body”simulation•100,000,000 particles•1 TB of RAM
100 million light years
Proc 0 Proc 1 Proc 2
Proc 5Proc 4Proc 3
Proc 6 Proc 7 Proc 8
![Page 27: Enabling Rapid Development of Parallel Tree-Search Applications Harnessing the Power of Massively Parallel Platforms for Astrophysical Data Analysis Jeffrey](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697c01e1a28abf838cd0e07/html5/thumbnails/27.jpg)
A Simple N tropy Example:N-body Gravity Calculation
ntropy_Dynamic(…, (*myGravityFunc));
N tropy master RMI layer
N tropy thread RMI layer
myGravityFunc()
N tropy thread RMI layer
myGravityFunc()
N tropy thread RMI layer
myGravityFunc()
N tropy thread RMI layer
myGravityFunc()
Proc. 0 Proc. 1 Proc. 2 Proc. 3
MasterThread
Computational Agenda
Particles on which to calculate gravitational force
P1 …P2 Pn…
![Page 28: Enabling Rapid Development of Parallel Tree-Search Applications Harnessing the Power of Massively Parallel Platforms for Astrophysical Data Analysis Jeffrey](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697c01e1a28abf838cd0e07/html5/thumbnails/28.jpg)
A Simple N tropy Example:N-body Gravity Calculation
Cosmological “N-Body”simulation•100,000,000 particles•1 TB of RAM
100 million light years
To resolve the To resolve the gravitationalgravitational force on any force on any single particle single particle requires the requires the entire datasetentire dataset
To resolve the To resolve the gravitationalgravitational force on any force on any single particle single particle requires the requires the entire datasetentire dataset
Proc 0 Proc 1 Proc 2
Proc 5Proc 4Proc 3
Proc 6 Proc 7 Proc 8
![Page 29: Enabling Rapid Development of Parallel Tree-Search Applications Harnessing the Power of Massively Parallel Platforms for Astrophysical Data Analysis Jeffrey](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697c01e1a28abf838cd0e07/html5/thumbnails/29.jpg)
A Simple N tropy Example:N-body Gravity Calculation
N tropy thread RMI layer
myGravityFunc()
N tropy thread RMI layer
myGravityFunc()
N tropy thread RMI layer
myGravityFunc()
N tropy thread RMI layer
myGravityFunc()
Proc. 0 Proc. 1 Proc. 2 Proc. 3
N tropy DSM layer N tropy DSM layer N tropy DSM layer N tropy DSM layer
0
43
1
2
76
5
1110
8
9
1413
12
DataWork
![Page 30: Enabling Rapid Development of Parallel Tree-Search Applications Harnessing the Power of Massively Parallel Platforms for Astrophysical Data Analysis Jeffrey](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697c01e1a28abf838cd0e07/html5/thumbnails/30.jpg)
N tropy Performance Features
DSM allows performance features to be provided “under the hood”: Interprocessor data caching for both reads
and writes < 1 in 100,000 off-PE requests actually result in
communication. Updates through DSM interface must be
commutative Relaxed memory model allows multiple writers
with no overhead Consistency enforced through global
synchronization
![Page 31: Enabling Rapid Development of Parallel Tree-Search Applications Harnessing the Power of Massively Parallel Platforms for Astrophysical Data Analysis Jeffrey](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697c01e1a28abf838cd0e07/html5/thumbnails/31.jpg)
N tropy Performance Features
RMI allows further performance features Thread virtualization
Divide workload into many more pieces than physical threads
Dynamic load balacing is achieved by migrating work elements as computation progresses.
![Page 32: Enabling Rapid Development of Parallel Tree-Search Applications Harnessing the Power of Massively Parallel Platforms for Astrophysical Data Analysis Jeffrey](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697c01e1a28abf838cd0e07/html5/thumbnails/32.jpg)
N tropy Performance10 million particlesSpatial 3-Point3->4 Mpc
No interprocessor data cache,No load balancing
Interprocessor data cache,No load balancing
Interprocessor data cache,Load balancing
![Page 33: Enabling Rapid Development of Parallel Tree-Search Applications Harnessing the Power of Massively Parallel Platforms for Astrophysical Data Analysis Jeffrey](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697c01e1a28abf838cd0e07/html5/thumbnails/33.jpg)
Why does the data cache make such a huge difference?
myGravityFunc()
Proc. 0
0
43
1
2
76
5
1110
8
9
1413
12
![Page 34: Enabling Rapid Development of Parallel Tree-Search Applications Harnessing the Power of Massively Parallel Platforms for Astrophysical Data Analysis Jeffrey](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697c01e1a28abf838cd0e07/html5/thumbnails/34.jpg)
N tropy “Meaningful” Benchmarks
The purpose of this library is to minimize development time!
Development time for:1. Parallel N-point correlation function
calculator 2 years -> 3 months
2. Parallel Friends-of-Friends group finder
8 months -> 3 weeks
![Page 35: Enabling Rapid Development of Parallel Tree-Search Applications Harnessing the Power of Massively Parallel Platforms for Astrophysical Data Analysis Jeffrey](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697c01e1a28abf838cd0e07/html5/thumbnails/35.jpg)
Conclusions
Most approaches for parallel application development rely on providing a single paradigm in the most general possible manner Many scientific problems tend not to
map well onto single paradigms Providing an ultra-general single
paradigm inhibits scalability
![Page 36: Enabling Rapid Development of Parallel Tree-Search Applications Harnessing the Power of Massively Parallel Platforms for Astrophysical Data Analysis Jeffrey](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697c01e1a28abf838cd0e07/html5/thumbnails/36.jpg)
Conclusions
Scientists often borrow from several paradigms and implement them in a restricted and targeted manner.
Almost all current HPC programs are written in MPI (“paradigm-less”): MPI is a “lowest common denominator”
upon which any paradigm can be imposed.
![Page 37: Enabling Rapid Development of Parallel Tree-Search Applications Harnessing the Power of Massively Parallel Platforms for Astrophysical Data Analysis Jeffrey](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697c01e1a28abf838cd0e07/html5/thumbnails/37.jpg)
Conclusions N tropy provides:
Remote Method Invocation (RMI) Distributed Shared-Memory (DSM)
Implementation of these paradigms is “lean and mean”
Targeted specifically for problem domain This approach successfully enables
astrophysics data analysis Substantially reduces application development time Scales to thousands of processors
More Information: Go to Wikipedia and seach “Ntropy”