the challenges ahead for visualizing and analyzing massive data sets hank childs lawrence berkeley...
TRANSCRIPT
![Page 1: The Challenges Ahead for Visualizing and Analyzing Massive Data Sets Hank Childs Lawrence Berkeley National Laboratory February 26, 2010 27B element Rayleigh-Taylor](https://reader037.vdocuments.us/reader037/viewer/2022110205/56649cb15503460f9497626f/html5/thumbnails/1.jpg)
The Challenges Ahead for Visualizing and Analyzing
Massive Data SetsHank Childs
Lawrence Berkeley National Laboratory
February 26, 2010
27B elementRayleigh-Taylor Instability(MIRANDA, BG/L)
2 trillion element mesh
2 billion elementThermal hydraulics(Nek5000, BG/P)
![Page 2: The Challenges Ahead for Visualizing and Analyzing Massive Data Sets Hank Childs Lawrence Berkeley National Laboratory February 26, 2010 27B element Rayleigh-Taylor](https://reader037.vdocuments.us/reader037/viewer/2022110205/56649cb15503460f9497626f/html5/thumbnails/2.jpg)
Overview of This Mini-Symposium
• Peterka: we can visualize the results on the supercomputer itself
• Bremer: we can understand and gain insight from these massive data sets
• Childs: visualization and analysis will be a crucial problem on the next generation of supercomputers
• Pugmire: we can make our algorithms work at massive scale
![Page 3: The Challenges Ahead for Visualizing and Analyzing Massive Data Sets Hank Childs Lawrence Berkeley National Laboratory February 26, 2010 27B element Rayleigh-Taylor](https://reader037.vdocuments.us/reader037/viewer/2022110205/56649cb15503460f9497626f/html5/thumbnails/3.jpg)
How does the {peta-, exa-} scale affect visualization?
Large # of time steps
Large ensembles
High-res meshes
Large # of variables
![Page 4: The Challenges Ahead for Visualizing and Analyzing Massive Data Sets Hank Childs Lawrence Berkeley National Laboratory February 26, 2010 27B element Rayleigh-Taylor](https://reader037.vdocuments.us/reader037/viewer/2022110205/56649cb15503460f9497626f/html5/thumbnails/4.jpg)
The soon-to-be “good ole days” … how visualization is done right now*
P0P1
P3
P2
P8P7 P6
P5
P4
P9
Pieces of data(on disk)
Read Process Render
Processor 0
Read Process Render
Processor 1
Read Process Render
Processor 2
Parallelized visualizationdata flow network
P0 P3P2
P5P4 P7P6
P9P8
P1
Parallel Simulation Code
* = Your mileage may vary - Are you running full machine?- How much data do you
output?
![Page 5: The Challenges Ahead for Visualizing and Analyzing Massive Data Sets Hank Childs Lawrence Berkeley National Laboratory February 26, 2010 27B element Rayleigh-Taylor](https://reader037.vdocuments.us/reader037/viewer/2022110205/56649cb15503460f9497626f/html5/thumbnails/5.jpg)
Pure parallelism performance is based on # bytes to process and I/O rates.
Vis is almost always >50% I/O and sometimes 98% I/O
Amount of data to visualize is typically O(total mem)
Relative I/O (ratio of total memory and I/O) is key
FLOPs Memory I/O
Terascale machine
“Petascale machine”
![Page 6: The Challenges Ahead for Visualizing and Analyzing Massive Data Sets Hank Childs Lawrence Berkeley National Laboratory February 26, 2010 27B element Rayleigh-Taylor](https://reader037.vdocuments.us/reader037/viewer/2022110205/56649cb15503460f9497626f/html5/thumbnails/6.jpg)
Anedoctal evidence: relative I/O is getting slower.
Machine name Main memory I/O rate
ASC purple 49.0TB 140GB/s 5.8min
BGL-init 32.0TB 24GB/s 22.2min
BGL-cur 69.0TB 30GB/s 38.3min
Petascale machine
?? ?? >40min
Time to write memory to disk
![Page 7: The Challenges Ahead for Visualizing and Analyzing Massive Data Sets Hank Childs Lawrence Berkeley National Laboratory February 26, 2010 27B element Rayleigh-Taylor](https://reader037.vdocuments.us/reader037/viewer/2022110205/56649cb15503460f9497626f/html5/thumbnails/7.jpg)
Why is relative I/O getting slower?
• “I/O doesn’t pay the bills”—And I/O is becoming a dominant cost in the
overall supercomputer procurement.• Simulation codes aren’t as exposed.
![Page 8: The Challenges Ahead for Visualizing and Analyzing Massive Data Sets Hank Childs Lawrence Berkeley National Laboratory February 26, 2010 27B element Rayleigh-Taylor](https://reader037.vdocuments.us/reader037/viewer/2022110205/56649cb15503460f9497626f/html5/thumbnails/8.jpg)
Recent runs of trillion cell data sets provide further evidence that I/O dominates
8
● Weak scaling study: ~62.5M cells/core
8
#coresProblem Size
TypeMachine
8K0.5TZAIXPurple
16K1TZSun LinuxRanger
16K1TZLinuxJuno
32K2TZCray XT5JaguarPF
64K4TZBG/PDawn
16K, 32K1TZ, 2TZCray XT4Franklin2T cells, 32K procs on Jaguar
2T cells, 32K procs on Franklin
- Approx I/O time: 2-5 minutes- Approx processing time: 10 seconds
![Page 9: The Challenges Ahead for Visualizing and Analyzing Massive Data Sets Hank Childs Lawrence Berkeley National Laboratory February 26, 2010 27B element Rayleigh-Taylor](https://reader037.vdocuments.us/reader037/viewer/2022110205/56649cb15503460f9497626f/html5/thumbnails/9.jpg)
Visualization works because it uses the brain’s highly effective visual processing system.
Trillions of data points
Millions of pixels
But is this still a good idea at the peta-/exascale?
• (Note that visualization is often reducing the data … so we are frequently *not* trying to render all of the data points.)
![Page 10: The Challenges Ahead for Visualizing and Analyzing Massive Data Sets Hank Childs Lawrence Berkeley National Laboratory February 26, 2010 27B element Rayleigh-Taylor](https://reader037.vdocuments.us/reader037/viewer/2022110205/56649cb15503460f9497626f/html5/thumbnails/10.jpg)
Visualization works because it uses the brain’s highly effective visual processing system.
Trillions of data points
One idea: add more pixels!
35M pixel powerwall• Bonus: big displays act as collaboration centers.
![Page 11: The Challenges Ahead for Visualizing and Analyzing Massive Data Sets Hank Childs Lawrence Berkeley National Laboratory February 26, 2010 27B element Rayleigh-Taylor](https://reader037.vdocuments.us/reader037/viewer/2022110205/56649cb15503460f9497626f/html5/thumbnails/11.jpg)
Visualization works because it uses the brain’s highly effective visual processing system.
Trillions of data points
One idea: add more pixels!
35M pixel powerwall
Source: Sawant & Healey, NC State
Visual acuity of the human eye is <30M
pixels!!
![Page 12: The Challenges Ahead for Visualizing and Analyzing Massive Data Sets Hank Childs Lawrence Berkeley National Laboratory February 26, 2010 27B element Rayleigh-Taylor](https://reader037.vdocuments.us/reader037/viewer/2022110205/56649cb15503460f9497626f/html5/thumbnails/12.jpg)
Summary: what are the challenges?
• Scale—We can’t read all of the data at full resolution
any more? What can we do?• Insight
—There is a lot more data than pixels. How are we going to understand it?
![Page 13: The Challenges Ahead for Visualizing and Analyzing Massive Data Sets Hank Childs Lawrence Berkeley National Laboratory February 26, 2010 27B element Rayleigh-Taylor](https://reader037.vdocuments.us/reader037/viewer/2022110205/56649cb15503460f9497626f/html5/thumbnails/13.jpg)
How can we deal with so many cells per pixel?
• What should the color of this pixel be?—“Random” between the 9 colors?—An average value of the 9 colors? (brown)—The color of the minimum value?—The color of the maximum value?
• We need infrastructure to allow users to have confidence in the pictures we deliver.
A single pixel
Data insight often goes far beyond pictures (see Bremer talk)
![Page 14: The Challenges Ahead for Visualizing and Analyzing Massive Data Sets Hank Childs Lawrence Berkeley National Laboratory February 26, 2010 27B element Rayleigh-Taylor](https://reader037.vdocuments.us/reader037/viewer/2022110205/56649cb15503460f9497626f/html5/thumbnails/14.jpg)
Multi-resolution techniques use coarse representations then refine.
P0P1
P3
P2
P8P7 P6
P5
P4
P9
Pieces of data(on disk)
Read Process Render
Processor 0
Read Process Render
Processor 1
Read Process Render
Processor 2
Parallelized visualizationdata flow network
P0 P3P2
P5P4 P7P6
P9P8
P1
Parallel Simulation Code
P2
P4
![Page 15: The Challenges Ahead for Visualizing and Analyzing Massive Data Sets Hank Childs Lawrence Berkeley National Laboratory February 26, 2010 27B element Rayleigh-Taylor](https://reader037.vdocuments.us/reader037/viewer/2022110205/56649cb15503460f9497626f/html5/thumbnails/15.jpg)
Multi-resolution: pros and cons
• Summary:—“Dive” into data
• Enough diving results in original data
• Pros—Avoid I/O & memory requirements—Confidence in pictures; multi-res hierarchy
addresses “many cells to one pixel issue”• Cons
—Is it meaningful to process simplified version of the data?
—How do we generate hierarchical representations? What costs do they incur?
![Page 16: The Challenges Ahead for Visualizing and Analyzing Massive Data Sets Hank Childs Lawrence Berkeley National Laboratory February 26, 2010 27B element Rayleigh-Taylor](https://reader037.vdocuments.us/reader037/viewer/2022110205/56649cb15503460f9497626f/html5/thumbnails/16.jpg)
In situ processing does visualization as part of the simulation.
P0P1
P3
P2
P8P7 P6
P5
P4
P9
Pieces of data(on disk)
Read Process Render
Processor 0
Read Process Render
Processor 1
Read Process Render
Processor 2
P0 P3P2
P5P4 P7P6
P9P8
P1
Parallel Simulation Code
![Page 17: The Challenges Ahead for Visualizing and Analyzing Massive Data Sets Hank Childs Lawrence Berkeley National Laboratory February 26, 2010 27B element Rayleigh-Taylor](https://reader037.vdocuments.us/reader037/viewer/2022110205/56649cb15503460f9497626f/html5/thumbnails/17.jpg)
In situ processing does visualization as part of the simulation.
P0P1
P3
P2
P8P7 P6
P5
P4
P9
GetAccessToData
Process RenderProcessor 0
Parallelized visualization data flow networkParallel Simulation Code
GetAccessToData
Process RenderProcessor 1
GetAccessToData
Process RenderProcessor 2
GetAccessToData
Process RenderProcessor 9
… … … …
![Page 18: The Challenges Ahead for Visualizing and Analyzing Massive Data Sets Hank Childs Lawrence Berkeley National Laboratory February 26, 2010 27B element Rayleigh-Taylor](https://reader037.vdocuments.us/reader037/viewer/2022110205/56649cb15503460f9497626f/html5/thumbnails/18.jpg)
In situ: pros and cons
• Pros:—No I/O!—Lots of compute power available
• Cons:—Very memory constrained—Many operations not possible
• Once the simulation has advanced, you cannot go back and analyze it
—User must know what to look a priori• Expensive resource to hold hostage!
![Page 19: The Challenges Ahead for Visualizing and Analyzing Massive Data Sets Hank Childs Lawrence Berkeley National Laboratory February 26, 2010 27B element Rayleigh-Taylor](https://reader037.vdocuments.us/reader037/viewer/2022110205/56649cb15503460f9497626f/html5/thumbnails/19.jpg)
Now we know the tools … what problem are we trying to solve?
• Three primary use cases:—Exploration—Confirmation—Communication
Examples:Scientific discoveryDebugging
Examples:Data analysisImages / moviesComparison
Examples:Data analysisImages / movies
![Page 20: The Challenges Ahead for Visualizing and Analyzing Massive Data Sets Hank Childs Lawrence Berkeley National Laboratory February 26, 2010 27B element Rayleigh-Taylor](https://reader037.vdocuments.us/reader037/viewer/2022110205/56649cb15503460f9497626f/html5/thumbnails/20.jpg)
Notional decision process
Need all data at full
resolution?
No Multi-resolution(debugging & scientific
discovery)
YesDo you know
what you want do a priori?
Yes
In Situ(data analysis & images / movies)
Exploration
Confirmation
Communication
Pure parallelism(Anything & esp.
comparison)
No
Also roles for more minor techniques that weren’t
discussed such as streaming and data subsetting.
![Page 21: The Challenges Ahead for Visualizing and Analyzing Massive Data Sets Hank Childs Lawrence Berkeley National Laboratory February 26, 2010 27B element Rayleigh-Taylor](https://reader037.vdocuments.us/reader037/viewer/2022110205/56649cb15503460f9497626f/html5/thumbnails/21.jpg)
Prepare for difficult conversations in the future.
• Multi-resolution:—Do you understand what a multi-resolution
hierarchy should look like for your data?—Who do you trust to generate it?—Are you comfortable with your I/O routines
generating these hierarchies while they write?—How much overhead are you willing to tolerate
on your dumps? 33+%?—Willing to accept that your visualizations are
not the “real” data?
![Page 22: The Challenges Ahead for Visualizing and Analyzing Massive Data Sets Hank Childs Lawrence Berkeley National Laboratory February 26, 2010 27B element Rayleigh-Taylor](https://reader037.vdocuments.us/reader037/viewer/2022110205/56649cb15503460f9497626f/html5/thumbnails/22.jpg)
Prepare for difficult conversations in the future.
• In situ:—How much memory are you willing to give up
for visualization?—Will you be angry if the vis algorithms crash?—Do you know what you want to generate a
priori? • Can you re-run simulations if necessary?
![Page 23: The Challenges Ahead for Visualizing and Analyzing Massive Data Sets Hank Childs Lawrence Berkeley National Laboratory February 26, 2010 27B element Rayleigh-Taylor](https://reader037.vdocuments.us/reader037/viewer/2022110205/56649cb15503460f9497626f/html5/thumbnails/23.jpg)
Summary
• Is there a problem with massive data?—Yes, I/O is a major problem—Yes, obtaining insight is a major problem
• Why is there a problem? Who’s fault is it?—As we scale up, some things get cheap, others
things (like I/O) stay expensive• What can we do about it?
—Multi-res / in-situ• Will it hurt?
—Yes.• Can we do it?
—Yes, see next three talks
![Page 24: The Challenges Ahead for Visualizing and Analyzing Massive Data Sets Hank Childs Lawrence Berkeley National Laboratory February 26, 2010 27B element Rayleigh-Taylor](https://reader037.vdocuments.us/reader037/viewer/2022110205/56649cb15503460f9497626f/html5/thumbnails/24.jpg)
• Questions???
• Hank Childs, LBL & UC Davis• Contact info:
—[email protected] / [email protected]—http://vis.lbl.gov/~hrchilds