from athena to minerva: a brief overview ben cash minerva project team, minerva workshop, gmu/cola,...

18
From Athena to Minerva: A Brief Overview Ben Cash Minerva Project Team, Minerva Workshop, GMU/COLA, September 16, 2013

Upload: damon-popwell

Post on 15-Dec-2015

229 views

Category:

Documents


3 download

TRANSCRIPT

From Athena to Minerva: A Brief Overview

Ben CashMinerva Project Team, Minerva Workshop, GMU/COLA, September 16, 2013

Athena Background· World Modeling Summit (WMS; May 2008)

· Summit calls for revolution in climate modeling to more rapidly advance improvement in climate model resolution, accuracy and reliability

· Recommends petascale supercomputers dedicated to climate modeling

· Athena supercomputer· The U.S. National Science Foundation responds, offering to dedicate the

retiring Athena supercomputer over a six-month period in 2009-2010

· An international collaboration was formed among groups in the U.S., Japan and the U.K. to use Athena to take up the challenge

Project Athena· Dedicated supercomputer

· Athena was a Cray XT-4 with 18,048 computational cores · Replaced by new Cray XT-5, Kraken, with 99,072 cores (since increased)· # 21 on June 2009 Top 500 list· 6 months, 24/7, 99.3% utilization· Over 1 PB data generated

· Large international collaboration· Over 30 people · 6 groups · 3 continents

· State-of-the-art global AGCMs · NICAM (JAMSTEC/ U. Tokyo): Nonhydrostatic Icosahedral Atmospheric Model· IFS (ECMWF): Integrated Forecast System· Highest possible spatial resolution

Athena Science Goals• Hypothesis: Increasing climate model resolution to accurately

resolve mesoscale phenomena in the atmosphere (and ocean and land surface) can dramatically improve the fidelity of the models in simulating climate – mean, variances, covariances, and extreme events.

• Hypothesis: Simulating the effect of increasing greenhouse gases on regional aspects of climate, especially extremes, may, for some regions, depend critically on the spatial resolution of the climate model.

• Hypothesis: Explicitly resolving important processes, such as clouds in the atmosphere (and eddies in the ocean and landscape features on the continental surface), without parameterization, can improve the fidelity of the models, especially in describing the regional structure of weather and climate.

Qualitative Analysis:

2009 NICAM Precipitation and CloudinessMay 21-August 31

Athena Lessons Learned• Dedicated usage of a relatively big supercomputer greatly

enhances productivity • Dealing with only a few users and their requirements allows for

more efficient utilization of resources• Challenge: Dedicated simulation projects like Project Athena

can generate enormous amounts of data to be archived, analyzed and managed. NICS (and TeraGrid) do not currently have enough storage capacity. Data management is a big challenge.

• Preparation time: 2 to 3 weeks at least were needed before the beginning of dedicated runs to test and optimize the codes and to plan strategies for optimal use of the system. Communication throughout the project was essential: (weekly telecons, email lists, personal calls, …)

Athena Limitations• Athena was a tremendous success, generating tremendous amount

of data and large number of papers for a six month project.• BUT…• Limited number of realizations

• Athena runs generally consisted of a single realization• No way to assess robustness of results

• Uncoupled models• Multiple, dissimilar models

• Resources were split between IFS and NICAM• Differences in performance meant very different experiments performed –

difficult to directly compare results• Storage limitations and post-processing demands limited what could be saved

for each model

Minerva Background· NCAR Yellowstone

· In 2012, NCAR-Wyoming Supercomputing Center (NWSC) debuted Yellowstone, the successor to Bluefire, their previous production platform

· IBM iDataplex, 72,280 cores, 1.5 petaflops peak performance· #17 on June 2013 Top 500 list· 10.7 PB disk capability – vast increase over capacity available during Athena· High capacity HPSS data archive· Dedicated high memory analysis clusters (Geyser and Caldera)

• Accelerated Scientific Discovery (ASD) program· Recognizing that many groups will not be ready to take advantage of new architecture, NCAR

accepted a small number proposals for early access to Yellowstone· 3 months of near-dedicated access before being opened to general user community

· Opportunity to continue successful Athena collaboration between COLA and ECMWF, and to address limitations in the Athena experiments

Minerva Timeline· March 2012 – Proposal finalized and submitted

· 31 million core hours requested· April 2012 – Proposal accepted

· 21 million core hours approved· Anticipated date of production start: July 21· Code testing and benchmarking on Janus begins

· October 5, 2012· First login to Yellowstone – bcash reportedly user 1

· October – November 23, 2012· Jobs are plagued by massive system instabilities, conflict between code and

Intel compiler

Minerva Timeline continued· November 24 – Dec 1, 2012

· Code conflict resolved, low core count jobs avoid worst of system instability

· Minerva jobs occupy 61000 cores (!)· Peter Towers estimates Minerva easily sets record for “Most IFS

FLOPs in a 24 hour period”· Jobs rapidly overrun initial 250 TB disk allocation, triggering request

for additional resources · This becomes a Minerva project theme

· Due to system instability, user accounts are not charged for jobs at this time

· Roughly 7 million free core hours as a result: 28 million total· 800+ TB generated

Minerva Catalog: Base Experiments

Resolution Start Dates Ensembles Length Period of Integration

T319 May 1 15 24 months (total) 1980-2011 **

T639 May 1 15 24 months (total) 1980-2011

T639 May 1, Nov 1 51 (total) 5 and 4 months,respectively

2000-2011

Minerva Catalog: Extended Experiments

Resolution Start Dates Ensembles Length Period of Integration

T319 May 1, Nov 1 51 7 months 1980-2011

T639 May 1, Nov 1 15 7 months 1980-2011

T1279 May 1 15 7 months 2000-2011

** to be completed

Qualitative Analysis:

2010 T1279 Precipitation May – November

Minerva Lessons Learned• Dedicated usage of a relatively big supercomputer greatly enhances

productivity• Experience with early usage period demonstrates tremendous progress can

be made with dedicated access

• Dealing with only a few users allows for more efficient utilization• Noticeable decrease in efficiency once scheduling multiple jobs of multiple

sizes was turned over to a scheduler• NCAR resources initially overwhelmed by challenges of new machine and

individual problems that arose.

• Focus on a single model allows for in-depth exploration• Data saved at much higher frequency• Multiple ensemble members, increased vertical levels, etc.

• Dedicated simulation projects like Athena and Minerva generate enormous amounts of data to be archived, analyzed and managed. Data management is a big challenge.• Other than machine instability, data management and post-processing were

solely responsible for halts in production.• Even on a system designed with lessons from Athena in mind, production

capabilities overwhelm storage and processing • Post-processing and storage must be incorporated into production stream• ‘Rapid burn’ projects such as Athena and Minerva are particularly prone to

overwhelming storage resources

• Despite advances beyond Athena, more work to be done• Focus of Tuesday discussion• Fill in matrix of experiments• Further increases in ocean, at mospheric resolution• Sensitivity tests (aerosols, greenhouse gases)• ??

Beyond Minerva: A New Pantheon