![Page 1: Final Report on Magellan and Update on Advanced Networking Initiative/media/ascr/ascac/pdf/meetings/Mar12/... · Final Report on Magellan and Update on Advanced Networking Initiative](https://reader031.vdocuments.us/reader031/viewer/2022030920/5b78e7a17f8b9a7f378c7661/html5/thumbnails/1.jpg)
Final Report on Magellan and Update on Advanced Networking Initiative
Kathy Yelick
Associate Laboratory Director for Computing Sciences Lawrence Berkeley National Laboratory
Professor of EECS, UC Berkeley
![Page 2: Final Report on Magellan and Update on Advanced Networking Initiative/media/ascr/ascac/pdf/meetings/Mar12/... · Final Report on Magellan and Update on Advanced Networking Initiative](https://reader031.vdocuments.us/reader031/viewer/2022030920/5b78e7a17f8b9a7f378c7661/html5/thumbnails/2.jpg)
High Performance Computing in Science
Science at Scale
Science through Volume
Science in Data
2
![Page 3: Final Report on Magellan and Update on Advanced Networking Initiative/media/ascr/ascac/pdf/meetings/Mar12/... · Final Report on Magellan and Update on Advanced Networking Initiative](https://reader031.vdocuments.us/reader031/viewer/2022030920/5b78e7a17f8b9a7f378c7661/html5/thumbnails/3.jpg)
• Warming ocean and Antarctic ice sheet key to sea level rise
• Previous models inadequate • BISICLES ice sheet model built
on FASTMath Chombo uses AMR to resolve ice-ocean interface. – Dynamics very fine resolution (AMR) – Antarctica still very large (scalability)
• Ongoing collaboration among BISICLES and BER-sponsored IMPACTS, COSIM to couple ice sheet and ocean models – 19M ALCC Hours at NERSC
Science at Scale: Simulations Aid in Understanding Climate Impacts
3
Antarctic ice speed (left): AMR enables sub-1 km
resolution (black, above) (Using NERSC’s Hopper)
BISICLES Pine Island Glacier simulation – mesh resolution crucial for grounding line behavior.
Enhanced POP ocean model solution for coupling to ice
![Page 4: Final Report on Magellan and Update on Advanced Networking Initiative/media/ascr/ascac/pdf/meetings/Mar12/... · Final Report on Magellan and Update on Advanced Networking Initiative](https://reader031.vdocuments.us/reader031/viewer/2022030920/5b78e7a17f8b9a7f378c7661/html5/thumbnails/4.jpg)
Science through Volume: Screening Diseases to Batteries
• Large number of simulations covering a variety of related materials, chemicals, proteins,…
!
Today’s batteries
Interesting materials…
Voltage limit
Materials Genome Cut in half the 18 years from design to manufacturing, e.g., 20,000 potential battery materials stored in a database
Dynameomics Database Improve understanding of disease and drug design, e.g., 11,000 protein unfolding simulations stored in a public database.
4
![Page 5: Final Report on Magellan and Update on Advanced Networking Initiative/media/ascr/ascac/pdf/meetings/Mar12/... · Final Report on Magellan and Update on Advanced Networking Initiative](https://reader031.vdocuments.us/reader031/viewer/2022030920/5b78e7a17f8b9a7f378c7661/html5/thumbnails/5.jpg)
Science in Data: From Simulation to Image Analysis
5
LBNL Computing key in 3 Nobel Prizes • Simulations at NERSC modeled the
appearance of Supernovae. • CMB data analysis done at CRD/NERSC • IPCC simulations have used NERSC LBNL Computing key in 4 of 10 Science Breakthroughs of the decade • 3 Genomics problems + CMB Data rates from experimental devices will require exascale volume computing • Cost of sequencing > Moore’s Law • Rate+Density of CCDs > Moore’s Law • Computing > Data, O(n2) common • Computing performance < Moore Law
0
10
20
30
40
50
60
2010 2011 2012 2013 2014 2015 In
crea
se o
ver 2
010
Projected Rates
Sequencers Detectors Processors Memory
![Page 6: Final Report on Magellan and Update on Advanced Networking Initiative/media/ascr/ascac/pdf/meetings/Mar12/... · Final Report on Magellan and Update on Advanced Networking Initiative](https://reader031.vdocuments.us/reader031/viewer/2022030920/5b78e7a17f8b9a7f378c7661/html5/thumbnails/6.jpg)
DOE Facilities have Huge Science Data Challenges
• Petabyte data sets today, many growing exponentially • Processing grows super-linearly • Exascale is both a driver and solution to Data challenges
6
Astronomy
Particle Physics
Chemistry and Materials Genomics
Fusion
Petascale to Exascale
![Page 7: Final Report on Magellan and Update on Advanced Networking Initiative/media/ascr/ascac/pdf/meetings/Mar12/... · Final Report on Magellan and Update on Advanced Networking Initiative](https://reader031.vdocuments.us/reader031/viewer/2022030920/5b78e7a17f8b9a7f378c7661/html5/thumbnails/7.jpg)
Two ARRA Projects to Explore Advanced Technology for Science
• ANI: Advanced Networking Initiative – Accelerate 100 Gbps networking – Prototype national network (transition to
production later) – Testbed for networking R&D
• Magellan: Cloud testbed for science – Can massive numbers of mid-range
science job make effective use of a cloud? – Use of a testbed hosted at ALCF at
Argonne and NERSC at Berkeley Lab 7
Science through Volume
Science in Data
![Page 8: Final Report on Magellan and Update on Advanced Networking Initiative/media/ascr/ascac/pdf/meetings/Mar12/... · Final Report on Magellan and Update on Advanced Networking Initiative](https://reader031.vdocuments.us/reader031/viewer/2022030920/5b78e7a17f8b9a7f378c7661/html5/thumbnails/8.jpg)
ESnet is a Unique Capability for Science ESnet designed for large data – Connects 40 DOE sites to 140 other
networks – 72% annual traffic growth exceeds
commercial networks – 50% of traffic is from “big data”
First in performance: - First 100G continental scale network - Will transition to production this year - ANI dark fiber can be leveraged to
develop and deliver 1 terabit - Services: Bandwidth reservations,
monitoring, research testbeds
8
![Page 9: Final Report on Magellan and Update on Advanced Networking Initiative/media/ascr/ascac/pdf/meetings/Mar12/... · Final Report on Magellan and Update on Advanced Networking Initiative](https://reader031.vdocuments.us/reader031/viewer/2022030920/5b78e7a17f8b9a7f378c7661/html5/thumbnails/9.jpg)
ESnet Policy Board
Policy Board highlights: – Outstanding people/operations to be preserved – Leverage unique dark fiber testbed for data-intensive
science and basic networking research
![Page 10: Final Report on Magellan and Update on Advanced Networking Initiative/media/ascr/ascac/pdf/meetings/Mar12/... · Final Report on Magellan and Update on Advanced Networking Initiative](https://reader031.vdocuments.us/reader031/viewer/2022030920/5b78e7a17f8b9a7f378c7661/html5/thumbnails/10.jpg)
Advanced Networking Initiative
• Goal: Accelerate 100 Gbps networking • 100Gbps Prototype National Network
– 4 sites (ALCF, OLCF, NERSC, and NY international exchange point)
• Network Research Testbed – Dark fiber – Research project support
• Starting point in 2009: – No 100Gbps standard; no carrier plans for 100G;
little dark fiber due to consolidation
10
![Page 11: Final Report on Magellan and Update on Advanced Networking Initiative/media/ascr/ascac/pdf/meetings/Mar12/... · Final Report on Magellan and Update on Advanced Networking Initiative](https://reader031.vdocuments.us/reader031/viewer/2022030920/5b78e7a17f8b9a7f378c7661/html5/thumbnails/11.jpg)
Advanced Networking Initiative
2009: “Table-top” testbed created; Purchased Long Island dark fiber
2010: Transport RFP released; Thirteen testbed projects started
2011: Partner with Internet2 (Level3 / Cienna l / Alcatel-Lucent) 100Gb Prototype to 4 sites;
2012: Complete network buildout (Oct); 100G production “ESnet5” (Dec)
11
![Page 12: Final Report on Magellan and Update on Advanced Networking Initiative/media/ascr/ascac/pdf/meetings/Mar12/... · Final Report on Magellan and Update on Advanced Networking Initiative](https://reader031.vdocuments.us/reader031/viewer/2022030920/5b78e7a17f8b9a7f378c7661/html5/thumbnails/12.jpg)
100Gbps Prototype Network
• Combines ANI funding with Internet2 stimulus funds to build full national footprint
• Internet2/Level3 Communications/Indiana Univ. manage the optical equipment and supporting infrastructure
• Uses Ciena Activeflex 6500 optical equipment – Backbone network: chassis and fiber owned by Internet2, but
ESnet purchases and owns transponder cards – Metropolitan networks: All equipment and fiber owned by ESnet – Ability to provision wavelengths between any two add/drop or
regeneration locations on network
• Uses Alcatel-Lucent 7750 routers – 14 chassis deployed with 33 100Gbps interfaces
12
![Page 13: Final Report on Magellan and Update on Advanced Networking Initiative/media/ascr/ascac/pdf/meetings/Mar12/... · Final Report on Magellan and Update on Advanced Networking Initiative](https://reader031.vdocuments.us/reader031/viewer/2022030920/5b78e7a17f8b9a7f378c7661/html5/thumbnails/13.jpg)
Testbed: Monitoring And Visualization of Energy in Networks (MAVEN)
• Establish energy baseline for end-to-end networking
• Provide real operational data to researchers
• Identify opportunities for improved efficiency
• Optimize globally (network of centers)
• First of kind in ESnet5
Figure: Visualization of energy (alpha version, unreleased) consumed by ESnet’s ANI prototype network.
“what gets measured gets improved”
13
![Page 14: Final Report on Magellan and Update on Advanced Networking Initiative/media/ascr/ascac/pdf/meetings/Mar12/... · Final Report on Magellan and Update on Advanced Networking Initiative](https://reader031.vdocuments.us/reader031/viewer/2022030920/5b78e7a17f8b9a7f378c7661/html5/thumbnails/14.jpg)
Testbed: End-to-End Circuit Service with OpenFlow
• Dynamic “tunnels” across wide area - No manual configuration of virtual circuit - Automated discovery of circuit end-points
• High Performance RDMA-over-Ethernet (Remote Direct Memory Access) - 9.8 Gbps out of 10 Gbps NY to WA at SC11 - Low overhead: 4% CPU vs. 80% with 1-stream TCP - No special host hardware except RDMA
OSCARS/ESnet4 BNL, NY
SC11 Seattle,
WA
Fully Automated, End to End, Dynamically Stitched, Virtual Connection
14
![Page 15: Final Report on Magellan and Update on Advanced Networking Initiative/media/ascr/ascac/pdf/meetings/Mar12/... · Final Report on Magellan and Update on Advanced Networking Initiative](https://reader031.vdocuments.us/reader031/viewer/2022030920/5b78e7a17f8b9a7f378c7661/html5/thumbnails/15.jpg)
ANI Legacy
15
• Unique 100G networking facility: • Connects DOE facilities (experimental, computational)
• Enables first-of-kind “Big Data” science • Optimizations (OSCARS, perfSONAR, ScienceDMZ
and Data Transfer Nodes) • Dark Fiber for future ESnet upgrades
• Future optical gear, routers, systems • Dark Fiber for networking research
• Enable previously-impossible wide area, high performance research for universities/companies
![Page 16: Final Report on Magellan and Update on Advanced Networking Initiative/media/ascr/ascac/pdf/meetings/Mar12/... · Final Report on Magellan and Update on Advanced Networking Initiative](https://reader031.vdocuments.us/reader031/viewer/2022030920/5b78e7a17f8b9a7f378c7661/html5/thumbnails/16.jpg)
The Magellan Team
• Magellan/NERSC – Shane Canon, Lavanya Ramakrishnan,
Tina Declerck, Iwona Sakrejda, Scott Campbell, Brent Draney, Jeff Broughton
• Magellan/ANL – Susan Coghlan, Adam Scovel, Piotr T
Zbiegiel, Narayan Desai, Rick Bradshaw, Anping Liu
• Amazon Benchmarking – Krishna Muriki, Nick Wright, John Shalf, Keith
Jackson, Harvey Wasserman, Shreyas Cholia
16
• Applications – Jared Wilkening, Gabe West, Ed Holohan, Doug Olson, Jan
Balewski, STAR collaboration, K. John Wu, Alex Sim, Prabhat, Suren Byna, Victor Markowitz
![Page 17: Final Report on Magellan and Update on Advanced Networking Initiative/media/ascr/ascac/pdf/meetings/Mar12/... · Final Report on Magellan and Update on Advanced Networking Initiative](https://reader031.vdocuments.us/reader031/viewer/2022030920/5b78e7a17f8b9a7f378c7661/html5/thumbnails/17.jpg)
Cloud Computing Hype Gartner’s 2010 Emerging Technologies Hype Cycle
Cloud Computing
17
![Page 18: Final Report on Magellan and Update on Advanced Networking Initiative/media/ascr/ascac/pdf/meetings/Mar12/... · Final Report on Magellan and Update on Advanced Networking Initiative](https://reader031.vdocuments.us/reader031/viewer/2022030920/5b78e7a17f8b9a7f378c7661/html5/thumbnails/18.jpg)
What is a Cloud?
According to NIST… • Resource pooling. Resources are pooled
across users for efficiency. • Broad network access. Capabilities are
available over the network. • Measured Service. Usage is monitored and
reported for transparency (pay-as-you-go). • Elasticity. Capabilities can be rapidly scaled
out and in. • Self-service. Configuration without on-site
system administration
18
![Page 19: Final Report on Magellan and Update on Advanced Networking Initiative/media/ascr/ascac/pdf/meetings/Mar12/... · Final Report on Magellan and Update on Advanced Networking Initiative](https://reader031.vdocuments.us/reader031/viewer/2022030920/5b78e7a17f8b9a7f378c7661/html5/thumbnails/19.jpg)
Why Clouds for Science?
• Resource pooling. – HPC Centers run at 90% utilization – Commercial clouds at 60% utilization
• Measured Service (pay-as-you-go). – HPC Centers charge in hours (not fungible with cash) – Commercial clouds charge in dollars
• Elasticity. – HPC Centers allow job scale-up but users wait in queues – Commercial clouds allow rapid growth in aggregate work
• Self-service (control vs. ease-of-use). – HPC Centers: fix some software (OS, compilers) – EC2 DIY administration; others fix entire software model
19
![Page 20: Final Report on Magellan and Update on Advanced Networking Initiative/media/ascr/ascac/pdf/meetings/Mar12/... · Final Report on Magellan and Update on Advanced Networking Initiative](https://reader031.vdocuments.us/reader031/viewer/2022030920/5b78e7a17f8b9a7f378c7661/html5/thumbnails/20.jpg)
Magellan Research Agenda and Lines of Inquiry
• Are the open source cloud software stacks ready for DOE HPC science?
• Can DOE cyber security requirements be met within a cloud?
• Are the new cloud programming models useful for scientific computing?
• Can DOE HPC applications run efficiently in the cloud? What applications are suitable for clouds?
• How usable are cloud environments for scientific applications?
• When is it cost effective to run DOE HPC science in a cloud?
20
![Page 21: Final Report on Magellan and Update on Advanced Networking Initiative/media/ascr/ascac/pdf/meetings/Mar12/... · Final Report on Magellan and Update on Advanced Networking Initiative](https://reader031.vdocuments.us/reader031/viewer/2022030920/5b78e7a17f8b9a7f378c7661/html5/thumbnails/21.jpg)
QD
R InfiniB
and
Mgt Nodes (12) Active Storage Servers FLASH/SSD Storage
Gateway Nodes (16)
File Servers (8) (/home) 160TB
21
ESNet 10Gb/s
GPU Servers 266 Nvidia cards at ANL
Big Memory Servers 1 TB of Memory per node 15 at ANL / 2 at NERSC
ANI 100 Gb/s (Future)
Aggregation Sw
itch
Router
Compute Servers 504 Nodes at ANL 720 Nodes at NERSC Intel Nehalem 8 cores/node
QDR Infiniband + 100 Gbps to ANI
Magellan Testbed Architected for Flexibility
![Page 22: Final Report on Magellan and Update on Advanced Networking Initiative/media/ascr/ascac/pdf/meetings/Mar12/... · Final Report on Magellan and Update on Advanced Networking Initiative](https://reader031.vdocuments.us/reader031/viewer/2022030920/5b78e7a17f8b9a7f378c7661/html5/thumbnails/22.jpg)
In 2009 Significant interest in cloud computing for science
0% 10% 20% 30% 40% 50% 60% 70% 80% 90%
Access to additional resources Access to on-demand (commercial)
Ability to control software Ability to share setup of software or
Ability to control groups/users Exclusive access to the computing
Easier to acquire/operate than a Cost associativity? (i.e., I can get 10
MapReduce Programming Model/Hadoop File System
User interfaces/Science Gateways:
22
![Page 23: Final Report on Magellan and Update on Advanced Networking Initiative/media/ascr/ascac/pdf/meetings/Mar12/... · Final Report on Magellan and Update on Advanced Networking Initiative](https://reader031.vdocuments.us/reader031/viewer/2022030920/5b78e7a17f8b9a7f378c7661/html5/thumbnails/23.jpg)
Demonstration of Cloud Technology for Science
![Page 24: Final Report on Magellan and Update on Advanced Networking Initiative/media/ascr/ascac/pdf/meetings/Mar12/... · Final Report on Magellan and Update on Advanced Networking Initiative](https://reader031.vdocuments.us/reader031/viewer/2022030920/5b78e7a17f8b9a7f378c7661/html5/thumbnails/24.jpg)
Magellan Timeline
2010
24
2011
![Page 25: Final Report on Magellan and Update on Advanced Networking Initiative/media/ascr/ascac/pdf/meetings/Mar12/... · Final Report on Magellan and Update on Advanced Networking Initiative](https://reader031.vdocuments.us/reader031/viewer/2022030920/5b78e7a17f8b9a7f378c7661/html5/thumbnails/25.jpg)
Federated Clouds provide elasticity, but with significant administrative support
25
STAR performed Real-time analysis of data coming from Brookhaven Nat. Lab • First time data was analyzed in real-
time to a high degree • Leveraged existing OS image from
NERSC system • Started out with 20 VMs at NERSC and
expanded to ANL.
![Page 26: Final Report on Magellan and Update on Advanced Networking Initiative/media/ascr/ascac/pdf/meetings/Mar12/... · Final Report on Magellan and Update on Advanced Networking Initiative](https://reader031.vdocuments.us/reader031/viewer/2022030920/5b78e7a17f8b9a7f378c7661/html5/thumbnails/26.jpg)
Performance of Clouds for Science
![Page 27: Final Report on Magellan and Update on Advanced Networking Initiative/media/ascr/ascac/pdf/meetings/Mar12/... · Final Report on Magellan and Update on Advanced Networking Initiative](https://reader031.vdocuments.us/reader031/viewer/2022030920/5b78e7a17f8b9a7f378c7661/html5/thumbnails/27.jpg)
Applications Cover Algorithm and Science Space
Science areas Dense Sparse Spectral Particles Structured Unstructured Independent
Accelerators X X IMPACT
X IMPACT
X IMPACT X
Fluids / Astro X X
MAESTRO X X X MAESTRO
X (MAESTRO)
Chemistry X GAMESS X X X
Climate X CAM
X CAM X
Fusion X X X
GTC X
GTC X
Nuclear QCD
X MILC
X MILC
X MILC
X MILC
Materials X PARATEC
X PARATEC X
X PARATEC
Biology X
BLAST 27
Parallel job size and input data drastically reduced for cloud benchmarking 27
![Page 28: Final Report on Magellan and Update on Advanced Networking Initiative/media/ascr/ascac/pdf/meetings/Mar12/... · Final Report on Magellan and Update on Advanced Networking Initiative](https://reader031.vdocuments.us/reader031/viewer/2022030920/5b78e7a17f8b9a7f378c7661/html5/thumbnails/28.jpg)
Slowdown of Clouds Relative to an HPC System
28
Study by Jackson, Ramakrishnan, Muriki, Canon, Cholia, Shalf, Wasserman, Wright
0
4
8
12
16
20 Sl
owdo
wn
Rel
ativ
e to
HPC
Sys
tem
Commercial Cloud
53x
~
![Page 29: Final Report on Magellan and Update on Advanced Networking Initiative/media/ascr/ascac/pdf/meetings/Mar12/... · Final Report on Magellan and Update on Advanced Networking Initiative](https://reader031.vdocuments.us/reader031/viewer/2022030920/5b78e7a17f8b9a7f378c7661/html5/thumbnails/29.jpg)
HPC Commercial Cloud Results
• Commercial HPC clouds catch up with clusters if set up as shared cluster – High speed network (10GigE) and no over-subscription – Some slowdown from virtualization
29 Keith Jackson, Lavanya Ramakrisha, John Shalf, Harvey Wasserman
0
4
8
12
16
20 R
untim
e R
elat
ive
to
Supe
rcom
pute
r
Commercial Cloud
Magellan
EC2-Beta-Opt
53x
~
![Page 30: Final Report on Magellan and Update on Advanced Networking Initiative/media/ascr/ascac/pdf/meetings/Mar12/... · Final Report on Magellan and Update on Advanced Networking Initiative](https://reader031.vdocuments.us/reader031/viewer/2022030920/5b78e7a17f8b9a7f378c7661/html5/thumbnails/30.jpg)
TCP is slower than IB even at modest concurrency
30
0
50
100
150
200
250
32 64 128 256 512 1024
Ping
Pong
Lat
ency
(us)
Number of Cores
IB TCPoIB
10G - TCPoEth Amazon CC
10G- TCPoEth VM 1G-TCPoEth
Better 40X
HPCC: PingPong Latency
![Page 31: Final Report on Magellan and Update on Advanced Networking Initiative/media/ascr/ascac/pdf/meetings/Mar12/... · Final Report on Magellan and Update on Advanced Networking Initiative](https://reader031.vdocuments.us/reader031/viewer/2022030920/5b78e7a17f8b9a7f378c7661/html5/thumbnails/31.jpg)
Network Hardware and Protocol Matter (PARATEC)
31
0
2
4
6
8
10
12
14
32 64 128 256 512 1024
Perf
orm
ance
Number of cores
IB TCPoIB
10G - TCPoEth 1G-TCPoEth
Better
TCP Can’t keep up
![Page 32: Final Report on Magellan and Update on Advanced Networking Initiative/media/ascr/ascac/pdf/meetings/Mar12/... · Final Report on Magellan and Update on Advanced Networking Initiative](https://reader031.vdocuments.us/reader031/viewer/2022030920/5b78e7a17f8b9a7f378c7661/html5/thumbnails/32.jpg)
Virtualization Penalty is Substantial (PARATEC)
32
0
2
4
6
8
10
12
14
32 64 128 256 512 1024
Perf
orm
ance
Number of cores
IB 10G - TCPoEth
Amazon CC 10G- TCPoEth Vm
Better
Virtualization overhead increases with core count
![Page 33: Final Report on Magellan and Update on Advanced Networking Initiative/media/ascr/ascac/pdf/meetings/Mar12/... · Final Report on Magellan and Update on Advanced Networking Initiative](https://reader031.vdocuments.us/reader031/viewer/2022030920/5b78e7a17f8b9a7f378c7661/html5/thumbnails/33.jpg)
Elasticity Requirements for Science
![Page 34: Final Report on Magellan and Update on Advanced Networking Initiative/media/ascr/ascac/pdf/meetings/Mar12/... · Final Report on Magellan and Update on Advanced Networking Initiative](https://reader031.vdocuments.us/reader031/viewer/2022030920/5b78e7a17f8b9a7f378c7661/html5/thumbnails/34.jpg)
Job Size Mix on Hopper “Unleashed”
Breakdown of Computing Hours by Job Size 100%
80%
60%
40%
20%
Raw
Hou
rs
• Hopper is a 153,216 core system. During availability period, over 50% of hours were used for jobs larger than 16k cores.
<1% <10% <43% >43%
34
![Page 35: Final Report on Magellan and Update on Advanced Networking Initiative/media/ascr/ascac/pdf/meetings/Mar12/... · Final Report on Magellan and Update on Advanced Networking Initiative](https://reader031.vdocuments.us/reader031/viewer/2022030920/5b78e7a17f8b9a7f378c7661/html5/thumbnails/35.jpg)
On-demand science access might be difficult if not impossible
Number of cores required to run a job immediately upon submission to Franklin
35
![Page 36: Final Report on Magellan and Update on Advanced Networking Initiative/media/ascr/ascac/pdf/meetings/Mar12/... · Final Report on Magellan and Update on Advanced Networking Initiative](https://reader031.vdocuments.us/reader031/viewer/2022030920/5b78e7a17f8b9a7f378c7661/html5/thumbnails/36.jpg)
Costs of Clouds for Science
![Page 37: Final Report on Magellan and Update on Advanced Networking Initiative/media/ascr/ascac/pdf/meetings/Mar12/... · Final Report on Magellan and Update on Advanced Networking Initiative](https://reader031.vdocuments.us/reader031/viewer/2022030920/5b78e7a17f8b9a7f378c7661/html5/thumbnails/37.jpg)
Cloud is a business model and can be applied to HPC centers
37
Cloud HPC Centers NIST Definition Resource Pooling, Broad
network access, measured service, rapid elasticity, on-demand self service
Resource Pooling, Broad network access, measured service. Limited: rapid elasticity, on-demand self service
Computational Needs
Bounded computing requirements – Sufficient to meet customer demand or transaction rates.
Virtually unbounded requirements – Scientist always have larger, more complicated problems to simulate or analyze.
Scaling Approach
Scale-in. Emphasis on consolidating in a node using virtualization
Scale-Out Applications run in parallel across multiple nodes.
Workloads High throughput modest data workloads
High Synchronous large concurrencies parallel codes with significant I/O and communication
Software Stack Flexible user managed custom software stacks
Access to parallel file systems and low-latency high bandwidth interconnect. Preinstalled, pre-tuned application software stacks for performance
![Page 38: Final Report on Magellan and Update on Advanced Networking Initiative/media/ascr/ascac/pdf/meetings/Mar12/... · Final Report on Magellan and Update on Advanced Networking Initiative](https://reader031.vdocuments.us/reader031/viewer/2022030920/5b78e7a17f8b9a7f378c7661/html5/thumbnails/38.jpg)
Public clouds compared to private HPC Centers
38
Component Cost
Compute Systems (1.38B hours) $180,900,000
HPSS (17 PB) $12,200,000
File Systems (2 PB) $2,500,000
Total (Annual Cost) $195,600,000
Over estimate: These are “list” prices, but... Underestimate: • Doesn’t include the measured performance slowdown 2x-10x. • This still only captures about 65% of NERSC’s $55M annual budget.
No consulting staff, no administration, no support.
![Page 39: Final Report on Magellan and Update on Advanced Networking Initiative/media/ascr/ascac/pdf/meetings/Mar12/... · Final Report on Magellan and Update on Advanced Networking Initiative](https://reader031.vdocuments.us/reader031/viewer/2022030920/5b78e7a17f8b9a7f378c7661/html5/thumbnails/39.jpg)
Factors in Price
Factor HPC Center
Public Cloud
Utilization (30% private, 90% HPC, 60%? Cloud); Note: trades off against wait times, elasticity
$$
Cost of people, largest machines lowest people costs/core $ Cost of power, advantage for placement of center, bulk $$ Energy efficiency (PUE, 1.1-1.3 is possible; 1.8 typical) Cost of specialized hardware (interconnect) $ Cost of commodity hardware $ Profit $$$
39
$ means “cost disadvantage”
![Page 40: Final Report on Magellan and Update on Advanced Networking Initiative/media/ascr/ascac/pdf/meetings/Mar12/... · Final Report on Magellan and Update on Advanced Networking Initiative](https://reader031.vdocuments.us/reader031/viewer/2022030920/5b78e7a17f8b9a7f378c7661/html5/thumbnails/40.jpg)
Where is Moore’s Law (Cores/$) in Commercial Clouds?
40
• Cost of a small instance at Amazon dropped 18% over 5 years. • Cores increased 2x-5x per socket; roughly constant cost. • NERSC cost/core dropped by 10x (20K – 200K cores in 2007-2011)
0% 100% 200% 300% 400% 500% 600% 700% 800% 900%
1000%
2006 2007 2008 2009 2010
Increase in Cores/$ or per Socket Relative to 2006
Amazon (small) Cores - Intel Cores AMD
![Page 41: Final Report on Magellan and Update on Advanced Networking Initiative/media/ascr/ascac/pdf/meetings/Mar12/... · Final Report on Magellan and Update on Advanced Networking Initiative](https://reader031.vdocuments.us/reader031/viewer/2022030920/5b78e7a17f8b9a7f378c7661/html5/thumbnails/41.jpg)
Cloud Artifacts
• Lessons for HPC Centers from Clouds – Provide higher service level (for higher
price) with guaranteed low wait – Allow users to control access (buy time) – Provide for configurable systems software
• Other features associated with Clouds – Virtualization for over-subscription of nodes – Map-Reduce programming model
41
![Page 42: Final Report on Magellan and Update on Advanced Networking Initiative/media/ascr/ascac/pdf/meetings/Mar12/... · Final Report on Magellan and Update on Advanced Networking Initiative](https://reader031.vdocuments.us/reader031/viewer/2022030920/5b78e7a17f8b9a7f378c7661/html5/thumbnails/42.jpg)
Key Findings
• Cloud approaches provide many useful benefits such as customized environments and access to surge capacity.
• Cloud computing can require significant initial effort and skills in order to port applications to these new models.
• Significant gaps and challenges exist in the areas of managing virtual environments, workflows, data, cyber-security, etc.
• The key economic benefit of clouds comes from the consolidation of resources across a broad community, which results in higher utilization, economies of scale, and operational efficiency. DOE already achieves this with facilities like NERSC and the LCFs.
• Cost analysis shows that DOE centers are cost competitive, typically 3–7x less expensive, when compared to commercial cloud providers.
42
![Page 43: Final Report on Magellan and Update on Advanced Networking Initiative/media/ascr/ascac/pdf/meetings/Mar12/... · Final Report on Magellan and Update on Advanced Networking Initiative](https://reader031.vdocuments.us/reader031/viewer/2022030920/5b78e7a17f8b9a7f378c7661/html5/thumbnails/43.jpg)
Magellan Legacy
• Magellan project is complete • Hardware and infrastructure is still valuable • DOE Systems Biology Knowledge Base
– BER-funded – Hardware from Magellan – Community-Driven Cyberinfrastructure for Sharing and Integrating
Data and Analytical Tools to Accelerate Predictive Biology
• GPUs to become next ALCF vis/DA cluster • Other Strategic Projects at NERSC
– Data at large DOE facilities: Call for Proposals • Use of private clouds at ANL
43
Coming Soon!
![Page 44: Final Report on Magellan and Update on Advanced Networking Initiative/media/ascr/ascac/pdf/meetings/Mar12/... · Final Report on Magellan and Update on Advanced Networking Initiative](https://reader031.vdocuments.us/reader031/viewer/2022030920/5b78e7a17f8b9a7f378c7661/html5/thumbnails/44.jpg)
Magellan Final Report
• Final Report released on ASCR website • Joint ANL/NERSC • Comprehensive
– 170 pages – User Experiences – Benchmarking – Programming – Security – Cost Analysis
44