climate machine update

15
Climate Machine Update David Donofrio RAMP Retreat 8/20/2008

Upload: astrid

Post on 11-Feb-2016

27 views

Category:

Documents


0 download

DESCRIPTION

Climate Machine Update. David Donofrio RAMP Retreat 8/20/2008. Agenda. Project Overview Tensilica Architecture and Design Flow Tensilica Tools Demo Why we need RAMP Current Progress Next Steps. A New Approach to HPC. Current HPC Design approach: - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Climate Machine Update

Climate Machine Update

David DonofrioRAMP Retreat

8/20/2008

Page 2: Climate Machine Update

Agenda

• Project Overview• Tensilica Architecture and Design Flow• Tensilica Tools Demo• Why we need RAMP• Current Progress• Next Steps

Page 3: Climate Machine Update

A New Approach to HPC

• Current HPC Design approach:– Leverage commodity processors

from Intel, AMD, etc– Once machine is built, optimize

problems to run on it – Power wall prevents scaling to

exaflop performance– Power is the new design point

Olukotun and Sutter

Moore’s Law still in effect - but number of processors double every

18 months rather than clock rate

Page 4: Climate Machine Update

A New Approach to HPC• Our approach:

– Identify application, then tailor machine using semi-custom design – Optimize CPU architecture and further extend with semi-custom ISA– Leverage auto-tuning to access architecture specific optimizations– Even if each simple core is 1/4 as computationally efficient as a

complex core you can fit hundreds on a single die and be 100x more power efficient

• Learn from embedded market where Flops / Watt and rapid design cycles are crucial– Start with building blocks from embedded designs rather than full

custom ASIC– Preserve ability to run general purpose C code

• Application Target: 1km Scale Climate ModelTailor machine architecture to application to

reduce waste

Page 5: Climate Machine Update

Climate Model Resource Requirements

• DOE has identified high-resolution climate modeling as a leading justification for exascale computing

• Must express 20M way parallelism• Requires performance of 200 Pflops peak• Simulation must run 1000x faster than real time

Randall / CSU

NASA

QuickTime™ and a decompressor

are needed to see this picture.

• Amenable to massively concurrent architectures composed of power efficient embedded cores.• Actively working with the climate science community to enable new Icosahedral model

Page 6: Climate Machine Update

Tensilica Processor Design Flow

• Complete Solution: Hardware, Software and Verification

• Fully customizable– Required base ISA ensures

general purpose applications• Processor configuration

submitted to Tensilica’s servers where synthesis is performed– Returned design can be spun for

ASIC or FPGA– Bit file available for Avnet boards

• Building block approach drastically reduces design cycle time compared to full-custom design

Tensilica Inc.

Page 7: Climate Machine Update

Tensilica Architecture Features

• Verilog-like TIE language allows for custom ISA extensions– Functional and performance verification built in– Auto generated compiler intrinsics– 64-bit IEEE-DP floating point coded up in TIE and available

• Custom VLIW support• Inter-processor communication easily enabled

through:– TIE Ports– TIE Queues

• Access to direct HW support for interprocessor communication– TIE Lookups

• Allows interface to external ROMs or other RTL block

Page 8: Climate Machine Update

Tensilica Architecture Overview

QuickTime™ and a decompressor

are needed to see this picture.

Tensilica Inc.

Page 9: Climate Machine Update

Tensilica Performance Debug• Processor viewed as black box• State can be compressed (via HW) and pushed out

JTAG port– Intended for program replay

• Xtensa trace port gives real-time visibility into internal pipeline state with unprecedented detail – $ hit miss with virtual address– Branch taken / not taken– Call / return– Resource dependency– Etc…

• Opportunity for hundredsof performance countersto be made available

QuickTime™ and a decompressor

are needed to see this picture.

Tensilica Inc.

Page 10: Climate Machine Update

Tensilica Tools Demo

QuickTime™ and a decompressor

are needed to see this picture.

Page 11: Climate Machine Update

Why we need RAMP• Fast, accurate emulation enables:

– Dual nested loop of HW / SW co-design• Preliminary work using Stanford SM sim shows significant

improvement in power eff. using automated HW/SW co-tuning• RAMP critical to accelerate

– Rapid prototyping and analysis of Tensilica architectural options

– Inter-processor communication architecture exploration– Running FULL climate code providing a more complete

performance picture• Cycle accurate simulator currently running at ~100

kHz vs. 50MHz on V5– Extensive HW performance counter data enables an

emulation environment with similar resolution but much greater speed

Tensilica provided emulation environment kick-starts this effort

Page 12: Climate Machine Update

Current Status• ML505 used for initial design exploration

– Basic xtensa processor + JTAG and memory controller is ~50% of a Virtex 5 50t

– Runs at 50MHz • ASIC in 65G process runs at 650MHz

• OnChip Debug working • Can load / run programs using main memory

synthesized from BRAM• DRAM interface coded - currently being

debugged• RTL license recently obtained - full simulation

environment (in ModelSim) being brought up

Page 13: Climate Machine Update

Next Steps…

• Transition to BEE3 from ML505• Bring up XTOS environment on single xtensa

processor on BEE3• Run single column of climate code on single

processor – Demo at SC’08 in November– Continue HW / SW co-tuning optimization

• Begin multi-processor emulation– Emulation of single socket, 32 core, using

networked BEE3s– Running full 2 Million line climate model

Page 14: Climate Machine Update

Backup

Page 15: Climate Machine Update

The Need for Exascale Computing

• DOE has identified high-resolution climate modeling as leading justification for exascale computing– 1 km resolution targeted for accurate cloud

resolving model• Difficult to scale existing systems

– HPC design using commodity processors estimated to draw 179MW

– BlueGene design estimated to draw 20MW– Leveraging embedded cores and more

application specific design a power envelope of 3-5MW is projected

Icosahedral

LBNL will seek an external vendor to build the machine if our approach is proven valid - LBNL is not entering the commercial HPC market.

Randall / CSU