sc13 19-20 november 2013

28
COLA ASD: Project Minerva – SC13 − November 2013 – Ben Cash SC13 19-20 November 2013 Ben Cash, COLA From Athena to Minerva: COLA’s Experience in the NCAR Advanced Scientific Discovery Program nimation courtesy of CIMSS

Upload: jace

Post on 09-Feb-2016

26 views

Category:

Documents


0 download

DESCRIPTION

From Athena to Minerva: COLA’s Experience in the NCAR Advanced Scientific Discovery Program. SC13 19-20 November 2013. Ben Cash, COLA. Animation courtesy of CIMSS. Why does climate research need HPC and Big Data?. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: SC13  19-20 November 2013

SC13 19-20 November 2013

Ben Cash, COLA

From Athena to Minerva:COLA’s Experience in the NCAR Advanced Scientific

Discovery Program

Animation courtesy of CIMSS

Page 2: SC13  19-20 November 2013

COLA ASD: Project Minerva – SC13 − November 2013 – Ben Cash

Why does climate research need HPC and Big Data?

• Societal demand for information about weather-in-climate and climate impacts on weather on regional scales

• Seamless days-to-decades prediction & unified weather/climate modeling

• Multi-model ensembles and Earth system prediction

• Requirements for data assimilation

Page 3: SC13  19-20 November 2013

COLA ASD: Project Minerva – SC13 − November 2013 – Ben Cash

Balancing Demands on Resources

Duration and/or Ensemble size

Res

olut

ion

Data

and H

PC

Resou

rces Complexity

1/120

Data Assimilation

Page 4: SC13  19-20 November 2013

COLA ASD: Project Minerva – SC13 − November 2013 – Ben Cash

Project Athena: An International, Dedicated High-End Computing Project to Revolutionize Climate Modeling (Dedicated XT4 at NICS)Collaborating Groups: COLA, ECMWF, JAMSTEC, NICS, Cray

Project Minerva: Toward Seamless, High-Resolution Prediction at Intra-seasonal and Longer Time Scales (Dedicated Advanced Scientific Discovery resources on NCAR Yellowstone)Collaborating Groups: COLA, ECMWF, U. Oxford, NCAR

COLA HPC & Big Data Projects

Page 5: SC13  19-20 November 2013

COLA ASD: Project Minerva – SC13 − November 2013 – Ben Cash

NICS Resources for Project Athena

• The Cray XT4 – Athena – the first NICS machine in 2008– 4512 nodes: AMD 2.3 GHz quad-core CPUs + 4 GB RAM – 18,048 cores + 17.6 TB aggregate memory– 165 TFLOPS peak performance – Dedicated to this project during October 2009 – March

2010 72 million core-hours!• Other resources made available to project:

– 85 TB Lustre file system– 258 TB auxilliary Lustre file system (called Nakji) – Verne: 16-core 128-GB system (data analysis) during

production phase (2009-2010)– Nautilus: SGI UV with 1024 Nehelem EX cores, 8 GPUs, 4

TB memory, 960 TB GPFS disk (data analysis) in 2010-11

Many thanks to NICS for resources and sustained support!

Page 6: SC13  19-20 November 2013

COLA ASD: Project Minerva – SC13 − November 2013 – Ben Cash

Regional Climate Change – Beyond CMIP3 Models’ Ability?

Page 7: SC13  19-20 November 2013

COLA ASD: Project Minerva – SC13 − November 2013 – Ben Cash

Europe Growing Season (Apr-Oct) Precipitation Change: 20th C to 21st C

T159 (125-km) T1279 (16-km)

“Time-slice” runs of the ECMWF IFS global atmospheric model with observed SST for the 20th century and CMIP3 projections of SST for the 21st century at two different model resolutions

The continental-scale pattern of precipitation change in April – October (growing season) associated with global warming is similar, but the regional details are quite different, particularly in southern Europe.

Page 8: SC13  19-20 November 2013

COLA ASD: Project Minerva – SC13 − November 2013 – Ben Cash

4X probability of extreme summer drought in Great Plains, Florida, Yucutan, and parts of Eurasia

Future Change in Extreme Summer Drought Late 20th C to Late 21st C

10th Percentile Drought: Number of years out of 47 in a simulation of future climate (2071-2117) for which the June-August mean rainfall was less than the 5th driest year of 47 in a simulation of current climate (1961-2007).

Page 9: SC13  19-20 November 2013

COLA ASD: Project Minerva – SC13 − November 2013 – Ben Cash

Clouds and Precipitation: Summer 2009 (NICAM 7km)

Page 10: SC13  19-20 November 2013

COLA ASD: Project Minerva – SC13 − November 2013 – Ben Cash

Page 11: SC13  19-20 November 2013

COLA ASD: Project Minerva – SC13 − November 2013 – Ben Cash

Athena Limitations• Athena was a tremendous success, generating tremendous

amount of data and large number of papers for a six month project.

• BUT…• Limited number of realizations

• Athena runs generally consisted of a single realization• No way to assess robustness of results

• Uncoupled models• Multiple, dissimilar models

• Resources were split between IFS and NICAM• Differences in performance meant very different experiments performed

– difficult to directly compare results• Storage limitations and post-processing demands limited what could be

saved for each model

Page 12: SC13  19-20 November 2013

COLA ASD: Project Minerva – SC13 − November 2013 – Ben Cash

Project Minerva

• Explore the impact of increased atmospheric resolution on model fidelity and prediction skill in a coupled, seamless framework by using a state-of-the-art coupled operational long-range prediction system to systematically evaluate the prediction skill and reliability of a robust set of hindcast ensembles at low, medium and high atmospheric resolutions

• NCAR Advanced Scientific Discovery Program to inaugurate Yellowstone

• Allocated 21 M core-hours on Yellowstone • Used ~28 M core-hours

Many thanks to NCAR for resources & sustained support!

Page 13: SC13  19-20 November 2013

COLA ASD: Project Minerva – SC13 − November 2013 – Ben Cash

Project Minerva: Background· NCAR Yellowstone

· In 2012, NCAR-Wyoming Supercomputing Center (NWSC) debuted Yellowstone, the successor to Bluefire

· IBM iDataplex, 72,280 cores, 1.5 petaflops peak performance· #17 on June 2013 Top500 list· 10.7 PB disk capability · High capacity HPSS data archive· Dedicated large memory and floating point accelerator clusters (Geyser and Caldera)

• Accelerated Scientific Discovery (ASD) program· NCAR accepted a small number proposals for early access to Yellowstone, as it has

done in the past with new hardware installs· 3 months of near-dedicated access before being opened to general user

community

· Opportunity · Continue successful Athena collaboration between COLA and ECMWF, and

to address limitations in the Athena experiments

Page 14: SC13  19-20 November 2013

COLA ASD: Project Minerva – SC13 − November 2013 – Ben Cash

Project Minerva: Timeline· March 2012 – ASD proposal submitted

· 31 million core hours requested· April 2012 – Proposal approved

· 21 million core hours approved· October 5, 2012

· First login to Yellowstone – bcash = user #1 (Ben Cash)· November 21 – Dec 1, 2012

· Minerva production code finalized· Yellowstone system instability due to “cable cancer” · Minerva’s low core count jobs avoid problem – user accounts not charged for jobs at

this time Minerva benefits by using ~7 million free core hours· Minerva jobs occupy as many as 61000 cores (!)· Minerva sets record: “Most IFS FLOPs in 24 hours”

· December 1 – project end· Failure rate falls to 1%, then to 0.5%; production computing tailed off in March 2013· Data management becomes by far the greatest challenge

· Project Minerva consumption: ~28 million total· 800+ TB generated

Many thanks to NCAR for resources & sustained support!

Page 15: SC13  19-20 November 2013

COLA ASD: Project Minerva – SC13 − November 2013 – Ben Cash

Minerva Catalog

Resolution Start Dates Ensembles Length Period of Integration

T319 May 1 15 24 months (total)

1980-2011

T639 May 1 15 24 months (total)

1980-2011

T639 May 1, Nov 1 51 (total) 5 and 4 months,respectively

2000-2011

Minerva Catalog: Extended Experiments

Resolution Start Dates Ensembles Length Period of Integration

T319 May 1, Nov 1

51 7 months 1980-2011

T639 May 1, Nov 1

15 7 months 1980-2011

T1279 May 1 15 7 months 2000-2011

Page 16: SC13  19-20 November 2013

COLA ASD: Project Minerva – SC13 − November 2013 – Ben Cash

Project Minerva: Selected Results

• Simulated precipitation• Tropical cyclones• SST – ENSO

Page 17: SC13  19-20 November 2013

COLA ASD: Project Minerva – SC13 − November 2013 – Ben Cash

Precipitation: Summer 2010 (IFS 16km)

Page 18: SC13  19-20 November 2013

COLA ASD: Project Minerva – SC13 − November 2013 – Ben Cash

Page 19: SC13  19-20 November 2013

COLA ASD: Project Minerva – SC13 − November 2013 – Ben Cash

Minerva: Coupled

Prediction of Tropical Cyclones

11-12 June 2005 hurricane off west coast of Mexico: precipitation in mm/day every 3 hours (T1279 coupled forecast initialized on 1 May 2005)

The predicted maximum rainfall rate reaches 725 mm/day (30 mm/hr)

Based TRMM global TC rainfall observations (1998-2000), the frequency of rainfall rates exceeding 30 mm/hr is roughly 1%

Courtesy Julia Manganello, COLA

Page 20: SC13  19-20 November 2013

COLA ASD: Project Minerva – SC13 − November 2013 – Ben Cash

Minerva vs. Athena – TC Frequency (NH; JJASON; T1279)

9-Year Mean (2000-2008)OBS 49.9 Athena 59.1 Minerva 48.9 (all members)

Athena

MinervaIBTrACS

Courtesy Julia Manganello, COLA

Page 21: SC13  19-20 November 2013

COLA ASD: Project Minerva – SC13 − November 2013 – Ben Cash

Jul Sep Nov

Page 22: SC13  19-20 November 2013

COLA ASD: Project Minerva – SC13 − November 2013 – Ben Cash

Project Minerva: Lessons Learned• More evidence that dedicated usage of a relatively

big supercomputer greatly enhances productivity• Experience with ASD period demonstrates tremendous

progress can be made with dedicated access• Dedicated computing campaigns provide

demonstrably more efficient utilization• Noticeable decrease in efficiency once scheduling

multiple jobs of multiple sizes was turned over to a scheduler

• In-depth exploration• Data saved at much higher frequency• Multiple ensemble members, increased vertical levels, etc.

Page 23: SC13  19-20 November 2013

COLA ASD: Project Minerva – SC13 − November 2013 – Ben Cash

Project Minerva: Lessons Learned

• Dedicated simulation projects like Athena and Minerva generate enormous amounts of data to be archived, analyzed and managed. Data management is a big challenge.

• Other than machine instability, data management and post-processing were solely responsible for halts in production.

Page 24: SC13  19-20 November 2013

COLA ASD: Project Minerva – SC13 − November 2013 – Ben Cash

Data Volumes• Project Athena: Total data volume 1.2 PB (~500 TB unique)*

Spinning disk 40 TB at COLA 0 TB at NICS (was 340 TB)

• * no home after April 2014

• Project Minerva: Total data volume 1.0 PB (~800 TB unique)Spinning disk 100 TB at COLA

500TB at NCAR (for now)

• That much data breaks everything: H/W, systems management policies, networks, apps S/W, tools, and shared archive space

• NB: Generating 800 TB using 28 M core-hours took ~3 months; this would take about a week using a comparable fraction of a system with 1M cores!

Page 25: SC13  19-20 November 2013

COLA ASD: Project Minerva – SC13 − November 2013 – Ben Cash

HPC capability Data analysis capacityAutomation/abstraction Human controlData-driven development Science-driven developmentSmall, portable code End-to-end toolsTight, local control of data Distributed data

Challenges and Tensions• Making effective use of large allocations – takes a village• Exaflood of data• Resolution vs. parameterization• Sampling (e.g. extreme events)

TENSIONS

“Having more data won’t substitute for thinking hard, recognizing anomalies, and exploring deep truths.”Samuel Arbeson, Wash. Post (18 Aug. 2013)

Page 26: SC13  19-20 November 2013

COLA ASD: Project Minerva – SC13 − November 2013 – Ben Cash

Athena and Minerva: Harbingers of the Exaflood

• Even on a system designed for big projects like these, HPC production capabilities overwhelm storage and processing, a particularly acute problem for ‘rapid burn’ projects such as Athena and Minerva

• Familiar diagnostics are hard to do at very high resolution • Can’t “just recompute” – years of data analysis and mining after production phase• Have we wrung all the “science” out of the data sets, given that we can only keep

a small percentage of the total data volume on spinning disk? How can we tell?

• Must move from ad hoc problem solving systematic, repeatable workflows(e.g. incorporate post-processing and data management into production stream)(transform Noah’s Ark a Shipping Industry)

“We need exaflood insurance.” - Jennifer Adams

Page 27: SC13  19-20 November 2013

COLA ASD: Project Minerva – SC13 − November 2013 – Ben Cash

ANY QUESTIONS?

Page 28: SC13  19-20 November 2013

COLA ASD: Project Minerva – SC13 − November 2013 – Ben Cash