extreme scalability working group (xs-wg): status update nick nystrom director, strategic...

Extreme Scalability Working Group (XS-WG):Status Update

Nick NystromDirector, Strategic Applications

Pittsburgh Supercomputing CenterOctober 22, 2009

XS-WG Update | Nystrom | October 22, 2009 2

Extreme Scalability Working Group (XS-WG): Purpose

• Meet the challenges and opportunities of deploying extreme-scale resources into the TeraGrid, maximizing both scientific output and user productivity.– Aggregate, develop, and share wisdom– Identify and address needs that are common to multiple sites and

projects– May require assembling teams and obtaining support for sustained

effort

• XS-WG benefits from active involvement of all Track 2 sites, BlueWaters, tool developers, and users.

• The XS-WG leverages and combines RPs’ interests to deliver greater value to the computational science community.


XS-WG Participants• Nick Nystrom PSC, XS-WG lead• Jay Alameda NCSA• Martin Berzins Univ. of Utah (U)• Paul Brown IU• Shawn Brown PSC• Lonnie Crosby NICS, IO/Workflows

lead• Tim Dudek GIG EOT• Victor Eijkhout TACC• Jeff Gardner U. Washington (U)• Chris Hempel TACC• Ken Jansen RPI (U)• Shantenu Jha LONI• Nick Karonis NIU (G)• Dan Katz LONI• Ricky Kendall ORNL• Byoung-Do Kim TACC• Scott Lathrop GIG, EOT AD• Vickie Lynch ORNL

• Amit Majumdar SDSC, TG AUS AD• Mahin Mahmoodi PSC, Tools lead• Allen Malony Univ. of Oregon (P)• David O’Neal PSC• Dmitry Pekurovsky SDSC• Wayne Pfeiffer SDSC• Raghu Reddy PSC, Scalability lead• Sergiu Sanielevici PSC• Sameer Shende Univ. of Oregon (P)• Ray Sheppard IU• Alan Snavely SDSC• Henry Tufo NCAR• George Turner IU• John Urbanic PSC• Joel Welling PSC• Nick Wright SDSC (P)• S. Levent Yilmaz* CSM, U. Pittsburgh

(P)

U: user; P: performance tool developer; G: grid infrastructure developer; *: joined XS-WG since last TG-ARCH update


Technical Challenge Area #1:Scalability and Architecture

• Algorithms, numerics, multicore, etc.– Robust, scalable infrastructure (libraries, frameworks, languages) for

supporting applications that scale to O(104–6) cores– Numerical stability and convergence issues that emerge at scale– Exploiting systems’ architectural strengths– Fault tolerance and resilience

• Contributors– POC: Raghu Reddy (PSC)– Members: Reddy, Majumdar, Urbanic, Kim, Lynch, Jha, Nystrom

• Current activities– Understanding performance tradeoffs in hierarchical architectures

• e.g. partitioning between MPI/OpenMP for different node architectures, interconnects, and software stacks

• candidate codes for benchmarking: HOMB, WRF, perhaps others– Characterizing bandwidth-intensive communication performance


Investigating the Effectiveness of Hybrid Programming (MPI+OpenMP)• Begun in XS-WG, extended through AUS effort in collaboration with Amit

Majumdar

• Examples of applications with hybrid implementations: WRF, POP, ENZO

• To exploit more memory per task, threading offers clear benefits.

• But what about performance?– Prior results are mixed; pure MPI often seems at least as good.– Historically, systems had fewer cores/socket and fewer cores/node than we

have today, and far fewer than they will have in the future.– Have OpenMP versions been as carefully optimized?

• Reasons to look into hybrid implementations now– Current T2 systems have 8-16 cores per node.– Are we at the tipping point for threading offering a win? If not, is there one, and

at what core count, and for which kinds of algorithms?– What is the potential for performance improvement?


Hybrid OpenMP-MPI Benchmark (HOMB)

• Developed by Jordan Soyke, while a student intern at PSC, subsequently enhanced by Raghu Reddy

• Simple benchmark code

• Permits systematic evaluation by– Varying computation-communication ratio– Varying message sizes– Varying MPI vs. OpenMP balance

• Allows characterization of performance bounds– Characterizing the potential hybrid performance of an actual

application is possible with adequate understanding of its algorithms and their implementations.


Characteristics of the Benchmark

• Perfectly parallel with both MPI/OpenMP

• Perfectly load balanced

• Distinct computation and communication sections

• Only nearest-neighbor communication– Currently no reduction operations– No overlap of computation and communication

• Can easily vary computation/communication ratio

• Current tests are with large messages


Preliminary Results on Kraken:MPI vs. MPI+OpenMP, 12 threads/node

1.00

1.05

1.10

1.15

1.20

1.25

1.30

1.35

0% 20% 40% 60% 80%

spee

dup

communication fraction

HOMB speedup with increasing communication

• Hybrid could be beneficial because of other reasons:– Application has limited scalability because of the decomposition– Application needs more memory– Application has dynamic load imbalance

• The hybrid approach provides increasing performance advantage as communication fraction increases.– … for the current core count per

node.– Non-threaded sections of an actual

application would have an Amdahl’s Law effect; these results constitute a best case limit.


Technical Challenge Area #2:Tools

• Performance tools, debuggers, compilers, etc.– Evaluate strengths and interactions; ensure adequate installations– Analyze/address gaps in programming environment infrastructure– Provide advanced guidance to RP consultants

• Contributors• POC: Mahin Mahmoodi (PSC) • Members: Mahmoodi, Wright, Alameda, Shende, Sheppard, Brown, Nystrom

• Current activities• Focus on testing debuggers and performance tools at large core counts• Ongoing, excellent collaboration between SDCI tool projects, plus

consideration of complementary tools• Submission for a joint POINT/IPM tools tutorial to TG09

• Installing and evaluating strengths of tools as they apply to complex production applications


Collaborative Performance Engineering Tutorials

• TG09: Using Tools to Understand Performance Issues on TeraGrid Machines: IPM and the POINT Project (June 22, 2009)– Karl Fuerlinger (UC Berkeley), David Skinner (NERSC/LBNL), Nick Wright (then

SDSC), Rui Liu (NCSA), Allen Malony (Univ. of Oregon), Haihang You (UTK), Nick Nystrom (PSC)

– Analysis and optimization of applications on the TeraGrid, focusing on Ranger and Kraken.

• SC09: Productive Performance Engineering of Petascale Applications with POINT and VI-HPS (Nov. 16, 2009)– Allen Malony and Sameer Shende (Univ. of Oregon), Rick Kufrin (NCSA), Brian

Wylie and Felix Wolf (JSC), Andreas Knuepfer and Wolfgang Nagel (TU Dresden), Shirley Moore (UTK), Nick Nystrom (PSC) .

– Addresses performance engineering of petascale, scientific applications with TAU, PerfSuite, Scalasca, and Vampir.

– Includes hands-on exercises using a Live-DVD containing all of the tools, helping to prepare participants to apply modern methods for locating and diagnosing typical performance bottlenecks in real-world parallel programs at scale.


Technical Challenge Area #3: Workflow, data transport, analysis, visualization, and storage

• Coordinating massive simulations, analysis, and visualization– Data movement between RPs involved in complex simulation workflows;

staging data from HSM systems across the TeraGrid– Technologies and techniques for in situ visualization and analysis

• Contributors– POC: Lonnie Crosby (NICS)– Members: Crosby, Welling, Nystrom

• Current activities– Focus on I/O profiling and determining platform-specific recommendations for

obtaining good performance for common parallel I/O scenarios


Co-organized a Workshop on Enabling Data-Intensive Computing: from Systems to Applications

• July 30-31, 2009, University of Pittsburghhttp://www.cs.pitt.edu/~mhh/workshop09/index.html

• 2 days: presentations, breakout discussions– architectures– software frameworks and middleware– algorithms and applications

• Speakers– John Abowd - Cornell University– David Andersen - Carnegie Mellon University– Magda Balazinska - The University of Washington– Roger Barga - Microsoft Research– Scott Brandt - The University of California at Santa Cruz– Mootaz Elnozahy - International Business Machines– Ian Foster - Argonne National labs– Geoffrey Fox - Indiana University– Dave O'Hallaron - Intel Research– Michael Wood-Vasey - University of Pittsburgh– Mazin Yousif - The University of Arizona– Taieb Znati - The National Science Foundation

From R. Kouzes et al., The Changing Paradigm of Data-Intensive Computing, IEEE Computer, January 2009

http://www.cs.pitt.edu/~mhh/workshop09/index.html


Next TeraGrid/Blue Waters Extreme-Scale Computing Workshop

• To focus on parallel I/O for petascale applications, addressing:– multiple levels of applications, middleware (HDF, MPI-IO, etc.), and

systems– requirements for data transfers to/from archives and remote

processing and management facilities.

• Tentatively scheduled for the week of March 22, 2010, in Austin

• Builds on preceding Petascale Application Workshops– December 2007, Tempe: general issues of petascale applications– June 2008, Las Vegas: more general issues of petascale applications– March 2009, Albuquerque: fault tolerance and resilience; included

significant participation from NNSA, DOE, and DoD

extreme scalability working group (xs-wg): status update nick nystrom director, strategic...

Documents