© 2006 mercury computer systems, inc. the cell broadband engine processor hardware, software,...

75
© 2006 Mercury Computer Systems, Inc. The Cell Broadband Engine Processor Hardware, Software, Performance and Applications John Brickman Director, Business Manager, Performance Computing Group Aerospace & Electronic Systems Society

Upload: derick-williamson

Post on 24-Dec-2015

218 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: © 2006 Mercury Computer Systems, Inc. The Cell Broadband Engine Processor Hardware, Software, Performance and Applications John Brickman Director, Business

© 2006 Mercury Computer Systems, Inc.

The Cell Broadband Engine Processor

Hardware, Software, Performance and ApplicationsJohn BrickmanDirector, Business Manager, Performance Computing Group

Aerospace & Electronic Systems Society

Page 2: © 2006 Mercury Computer Systems, Inc. The Cell Broadband Engine Processor Hardware, Software, Performance and Applications John Brickman Director, Business

© 2005 Mercury Computer Systems, Inc.2 © 2006 Mercury Computer Systems

Cell Chip Lives in Two Worlds

• Game console chip market Driven by “game physics” requirements, not just graphics

• Compute intensive, vector processing, floating and fixed point New consoles introduced every 5+ years, last about 10 years

• PS3 unveiled May 2005, will launch November 2006, about 6 years after PS2.

New chip architectures linked to console designs• Chip architecture unchanged during lifetime• Process shrinks targeted at lower cost and lower power

• High performance processor market Evolving architecture with backwards compatibility Piggy-back off largest volume processor platform

that is leading in performance• With affordable architecture increments to address high performance

needs Previously desktop PC, now game console

• Cell roadmap addresses both game console and high performance markets

Page 3: © 2006 Mercury Computer Systems, Inc. The Cell Broadband Engine Processor Hardware, Software, Performance and Applications John Brickman Director, Business

© 2005 Mercury Computer Systems, Inc.3 © 2006 Mercury Computer Systems

Mercury’s Relationship with IBM

In June 2005, Mercury announced a strategic alliance agreement

with IBM offering Mercury special access to IBM

expertise including the broadly publicized Cell technology.

Multicomputer-on-a-chip

Page 4: © 2006 Mercury Computer Systems, Inc. The Cell Broadband Engine Processor Hardware, Software, Performance and Applications John Brickman Director, Business

© 2005 Mercury Computer Systems, Inc.4 © 2006 Mercury Computer Systems

Cell BE Processor Block Diagram

• Cell BE processor boasts nine processors on a single die 1 Power® processor 8 vector processors

• Computational Performance 205 GFLOPS @ 3.2 GHz 410 GOPS @ 3.2 GHZ

• A high-speed data ring connects everything 205 GB/s maximum sustained bandwidth

• High performance chip interfaces 25.6 GB/s XDR main memory bandwidth

Page 5: © 2006 Mercury Computer Systems, Inc. The Cell Broadband Engine Processor Hardware, Software, Performance and Applications John Brickman Director, Business

© 2005 Mercury Computer Systems, Inc.5 © 2006 Mercury Computer Systems

• Standalone vector processor 128 bit SIMD model 128 registers each 128 bits wide

• AltiVec/VMX has only 32 registers, SSE3 only eight

• 256KB local store Load/store instructions can

access only local store

• Memory flow controller DMA engine built into each SPE SPE includes DMA instructions

for explicitly moving data between local store and main memory

• Performance Dual issue Two- to sixteen-way SIMD 25.6 GFLOPS (single precision), 51 GOPS (8 bit)

Synergistic Processing Element

Page 6: © 2006 Mercury Computer Systems, Inc. The Cell Broadband Engine Processor Hardware, Software, Performance and Applications John Brickman Director, Business

© 2005 Mercury Computer Systems, Inc.6 © 2006 Mercury Computer Systems

SPE 128 Bit SIMD Engine

• Operates on 128 bit vector registers 2 x 64 bits (DP float) 4 x 32 bits (SP float or integer) 8 x 16 bits (integer) 16 x 8 bits (integer)

• Example: Floating point multiply add 4 x 32 bit fma instruction can

complete eight floating point operations (FLOPS) every cycle

128 bits

fma vr, v1, v2, v3

v1

v2

v3

vr

X

+

X

+

X

+

X

+

Page 7: © 2006 Mercury Computer Systems, Inc. The Cell Broadband Engine Processor Hardware, Software, Performance and Applications John Brickman Director, Business

© 2005 Mercury Computer Systems, Inc.7 © 2006 Mercury Computer Systems

• 64-bit Power® core with complete AltiVec™/VMX

• High frequency

• Low power consumption

• Hardware multi-threading

• L2 is 512 KB

• Can use any SPE’s DMA engine

Power® Processing Element

Altivec is a registered trademark of Freescale Semiconductor Corp.

Page 8: © 2006 Mercury Computer Systems, Inc. The Cell Broadband Engine Processor Hardware, Software, Performance and Applications John Brickman Director, Business

© 2005 Mercury Computer Systems, Inc.8 © 2006 Mercury Computer Systems

Why is Cell So Fast?

• The SPE is a very fast, very lean core SPE (3.2 GHz) is up to 3 times faster than the fastest

Pentium core (3.6 GHz) when computing FFTs That’s 24X better performance chip to chip

• Huge internal chip bandwidth 205 GB/s sustained ring bandwidth 25.6 GB/s main memory bandwidth

• High performance DMA DMA can be fully overlapped with SPE computation Software controlled DMAs can bring exactly the right

data into local store at the right time

Page 9: © 2006 Mercury Computer Systems, Inc. The Cell Broadband Engine Processor Hardware, Software, Performance and Applications John Brickman Director, Business

© 2005 Mercury Computer Systems, Inc.9 © 2006 Mercury Computer Systems

Page 10: © 2006 Mercury Computer Systems, Inc. The Cell Broadband Engine Processor Hardware, Software, Performance and Applications John Brickman Director, Business

© 2005 Mercury Computer Systems, Inc.10 © 2006 Mercury Computer Systems

Page 11: © 2006 Mercury Computer Systems, Inc. The Cell Broadband Engine Processor Hardware, Software, Performance and Applications John Brickman Director, Business

© 2005 Mercury Computer Systems, Inc.11 © 2006 Mercury Computer Systems

Page 12: © 2006 Mercury Computer Systems, Inc. The Cell Broadband Engine Processor Hardware, Software, Performance and Applications John Brickman Director, Business

© 2006 Mercury Computer Systems, Inc.

Mercury Cell Hardware Products

Page 13: © 2006 Mercury Computer Systems, Inc. The Cell Broadband Engine Processor Hardware, Software, Performance and Applications John Brickman Director, Business

© 2005 Mercury Computer Systems, Inc.13 © 2006 Mercury Computer Systems

Mercury Cell Related Roadmap

2006 2007 2008

3Q 4Q 1Q 2Q 3Q 4Q 1Q 2QBlades

1U Servers

Dual Cell Based Blade2 BE, 2 SouthBridges, 1GB XDR

Dual Cell Based Blade 3Single slot, 2 BE, 2 Comp. Chips,

up to 32GB DDR2

Dual Cell Based Blade 2Single slot, 2 BE, 2 Comp. Chips,

4GB XDR+DDR2

Dual Cell Based Server 2 BE 2 Southbridges, 1GB XDR

Dual Cell Based Server 32 BE, 2 Comp. Chips,

up to 32GB DDR2

Embedded

PowerBlock™200 ½ ATR Concept

1 BE, 1 Companion Chip, 4 GB DDR2, 1GB XDR

Rugged

TurismoChassis Concept

ATCA Blade Concept1 BE, 1 Companion Chip, 4 GB DDR2

1GB XDR

VITA 46 / 48Concept PowerStreamTM

Concept

Dual Cell Based Server 2 2 BE, 2 Comp. Chips

4GB XDR+DDR2

CAB PCIe Add-In Card1 BE, 1 Companion Chip, 4 GB DDR2, 1GB XDR

Page 14: © 2006 Mercury Computer Systems, Inc. The Cell Broadband Engine Processor Hardware, Software, Performance and Applications John Brickman Director, Business

© 2005 Mercury Computer Systems, Inc.14 © 2006 Mercury Computer Systems

Dual Cell Based Blade

• Flexible blade solution based on the Cell BE processor Outstanding performance for HPC

applications Designed for distributed processing Cell-optimized software available About 11 TFLOPS in 5 feet of rack height

• Dual-width BladeCenterTM blade• Two PCI Express x4 expansion

slots Initially supports only Infiniband cards

• Evaluation units available sinceDecember 2005

• Production October 2006

Page 15: © 2006 Mercury Computer Systems, Inc. The Cell Broadband Engine Processor Hardware, Software, Performance and Applications John Brickman Director, Business

© 2005 Mercury Computer Systems, Inc.15 © 2006 Mercury Computer Systems

Dual Cell Based Blade Block Diagram

3.2 GHzCell

Processor

South-bridge

512 MB XDR DRAM

Power

3.2 GHzCell

Processor512 MB XDR DRAM

Power

BladeCenterMidplane

Connector

GbE

GbE

InfinibandDaughtercard

InfinibandDaughtercard

PCI Express x4

PCI Express x4

25.6 GB/s

25.6 GB/s

2.5 GB/seach way

20 GB/s each way

Serial Port

South-bridge

2.5 GB/seach way

Page 16: © 2006 Mercury Computer Systems, Inc. The Cell Broadband Engine Processor Hardware, Software, Performance and Applications John Brickman Director, Business

© 2005 Mercury Computer Systems, Inc.16 © 2006 Mercury Computer Systems

Cell Blade Systems

Complete 19” rack-based systems• 25U (42.75” high)

Up to 14 blades, 5.7 TFLOPS• 42U (73.5”) chassis

Up to 28 blades, 11.5 TFLOPS• Multi-rack systems scalable using Infiniband

and GbE

Cell Technology Evaluation System• Complete turn-key Cell HW & SW system• 25U rack• One Dual Cell-Based Blade

All components included to support expansion to 7 blade system

• MultiCore Plus SDK One year subscription to production SW

Monitor and keyboardSerial line concentratorXeon based Linux serverExternal GbE switchBladeCenter chassisPower distribution

25U 14-Blade System

front rear

Page 17: © 2006 Mercury Computer Systems, Inc. The Cell Broadband Engine Processor Hardware, Software, Performance and Applications John Brickman Director, Business

© 2005 Mercury Computer Systems, Inc.17 © 2006 Mercury Computer Systems

1U Dual-Cell Based Server

• Hardware Dual Cell processors at 3.2 GHz 1 GB of XDR DRAM Integrated dual Gigabit Ethernet Serial port Dual full size PCI Express x4 slots

• Initially supports only Infiniband cards

• Software Toolchain

• Native (PPE hosted)• Cross (x86 hosted)

GUI via X-Windows over GbE• No direct keyboard / video / mouse support

• Production Q1 2007

Page 18: © 2006 Mercury Computer Systems, Inc. The Cell Broadband Engine Processor Hardware, Software, Performance and Applications John Brickman Director, Business

© 2005 Mercury Computer Systems, Inc.18 © 2006 Mercury Computer Systems

Cell Companion Chip

• Under design by IBM since May 2005 With significant design input from Mercury

• First parts began preliminary testing June 2006• Second spin for production in December 2006

Cell BE Interface5 GB/s

GbE

GbE

UA

RT

GPI

OPC

I-X

Low latency, high capacity mailbox

Multichannel, striding DMA engine

DDR2 controllers• 5 GB/s each• Up to 4 GB each

PCIe 16x interfacesEach configurable:•8x, 4x, 2x and 1x•Endpoint or root complex

Cell BE Interface• 5 GB/s each way• Extends Cell global address

space to PCIe, DDR2 etc.• Non-coherent (non-cached)

DMA

Mailbox

405 PPC

PCIe 16x

PCIe 16x

DDR2 667 MHz DDR2 667 MHz

Page 19: © 2006 Mercury Computer Systems, Inc. The Cell Broadband Engine Processor Hardware, Software, Performance and Applications John Brickman Director, Business

© 2005 Mercury Computer Systems, Inc.19 © 2006 Mercury Computer Systems

Dual Cell Based Blade 2

CompanionChip1 GB XDR DRAM

PCIe x16 / PCI-X Daughtercard

Power

BladeCenter HHigh Speed

Daughtercard

CompanionChip1 GB XDR DRAM

PCIe x16 / PCI-X Daughtercard

Power

PCIe 16x

PCIe 16x

IB 4

x IB

4x

2 PC

Ie 8

x

One-Slot Processor Blade

One-Slot I/O Expansion Blade

GbE

GbE

3.2 GHzCell

Processor

3.2 GHzCell

Processor

25.6 GB/s

25.6 GB/s

5 GB/seach way

20 GB/s each way

5 GB/seach way

2-8 GB DDR2

2-8 GB DDR2

• Single slot blade Up to twice the

density

• Uses new companion chip Up to 10x I/O

bandwidth

• DDR2 I/O buffer memory

• Production available Q3 2007

Page 20: © 2006 Mercury Computer Systems, Inc. The Cell Broadband Engine Processor Hardware, Software, Performance and Applications John Brickman Director, Business

© 2005 Mercury Computer Systems, Inc.20 © 2006 Mercury Computer Systems

Dual Cell Based Blade 3 Concept

CompanionChip8-16 GB DDR2

PCIe / PCI-X x16 Daughtercard

Power

1-2 GB DDR2

CompanionChip8-16 GB DDR2

PCIe / PCI-X x16 Daughtercard

Power

2 IB

4x

2 IB

4xOne-Slot Processor Blade

One-Slot I/O Expansion Blade

GbE

GbE

1-2 GB DDR2

• Improved SPE double precision performance

• Expanded memoryDDR2

replaces XDR

• Production available Q1 2008

BladeCenter HHigh Speed

Daughtercard

PCIe 16x

PCIe 16x

2 PC

Ie 8

x3.2 GHzCell

Processor

3.2 GHzCell

Processor

25.6 GB/s

25.6 GB/s

5 GB/seach way

5 GB/seach way

20 GB/s each way

Page 21: © 2006 Mercury Computer Systems, Inc. The Cell Broadband Engine Processor Hardware, Software, Performance and Applications John Brickman Director, Business

© 2005 Mercury Computer Systems, Inc.21 © 2006 Mercury Computer Systems

1U Dual-Cell Based Server 2

• 1U solution using based on companion chip• Dual 3.2 GHz Cell processors• Memory

2 GB of XDR 4-16 GB of DDR2

• I/O Daughtercard site options under

consideration• PCI-E and PCI-X customer options

Dual GigE Dual IB 4x

• Production available Q3 2007

Page 22: © 2006 Mercury Computer Systems, Inc. The Cell Broadband Engine Processor Hardware, Software, Performance and Applications John Brickman Director, Business

© 2005 Mercury Computer Systems, Inc.22 © 2006 Mercury Computer Systems

1U Dual-Cell Based Server 3 Concept

• 1U solution with enhanced memory capacity• Dual 3.2 GHz Cell processors• Memory

16-32 GB of DDR2 Main memory is now DDR2 DIMMs 1-2 GB of DDR2 per companion chip

for IO buffering

• I/O PCIe / PCI-X daughtercards Dual GigE Dual IB 4x

• Production available Q1 2008

Page 23: © 2006 Mercury Computer Systems, Inc. The Cell Broadband Engine Processor Hardware, Software, Performance and Applications John Brickman Director, Business

© 2005 Mercury Computer Systems, Inc.23 © 2006 Mercury Computer Systems

Cell Accelerator Board

• PCI Express™ accelerator card compatible with high-end workstations

• More than 180 GFLOPS on a desktop

• 1 GB of XDR and 4GB of DDR2• Gigabit Ethernet on end bracket

• Internal prototype boards with FPGA bridge received July 2006

• Boards with the prototype bridge silicon received September 2006

• Volume production of boards Q1 2007

Page 24: © 2006 Mercury Computer Systems, Inc. The Cell Broadband Engine Processor Hardware, Software, Performance and Applications John Brickman Director, Business

© 2005 Mercury Computer Systems, Inc.24 © 2006 Mercury Computer Systems

Cell Accelerator Board Block Diagram

CompanionChip

4 GB DDR2

2.8 GHzCell

Processor

8 GB/s

1 GB XDR DRAM

22 GB/s

Page 25: © 2006 Mercury Computer Systems, Inc. The Cell Broadband Engine Processor Hardware, Software, Performance and Applications John Brickman Director, Business

© 2006 Mercury Computer Systems, Inc.

Software is the Key to Harnessing Cell Performance!

•Mercury’s MultiCore Plus SDK

Page 26: © 2006 Mercury Computer Systems, Inc. The Cell Broadband Engine Processor Hardware, Software, Performance and Applications John Brickman Director, Business

© 2005 Mercury Computer Systems, Inc.26 © 2006 Mercury Computer Systems

Cell BE Processor Architecture

• Resembles distributed memory multiprocessor with explicit DMA over a fabric

Page 27: © 2006 Mercury Computer Systems, Inc. The Cell Broadband Engine Processor Hardware, Software, Performance and Applications John Brickman Director, Business

© 2005 Mercury Computer Systems, Inc.27 © 2006 Mercury Computer Systems

Mercury Multi-DSP Board (1996)

Page 28: © 2006 Mercury Computer Systems, Inc. The Cell Broadband Engine Processor Hardware, Software, Performance and Applications John Brickman Director, Business

© 2005 Mercury Computer Systems, Inc.28 © 2006 Mercury Computer Systems

Programming Cell: What’s Good and What’s Hard

No second guessing about cache replacement algorithm

Very deterministic pipeline 128 registers mask pipeline

latency very well

DMA has negligible impact on SPE local store bandwidth

Generous ring bandwidth means topology is seldom an issue

Standard Power® core

Burden on software to get code and data into local store

Local store is small compared to ring latency

Branch prediction is manual and very restricted

128 byte alignment necessary for best performance

XDR bandwidth is a bottleneck Cell chips linked in coherent

mode increases latency

Performance is modest

SPE

Ring and XDR

PPE

Good Hard

Page 29: © 2006 Mercury Computer Systems, Inc. The Cell Broadband Engine Processor Hardware, Software, Performance and Applications John Brickman Director, Business

© 2005 Mercury Computer Systems, Inc.29 © 2006 Mercury Computer Systems

How Much Faster Is Cell?

Relative performance of Cell and leading general purpose processors

32

56

30

13

25

1.0 1.0 1.0 1.0 1.01.5 2.1 1.90.9 1.3 1.0

1.8 2.01.0

0

10

20

30

40

50

60

1K point FFT 8K point FFT 64K point FFT 15x15 16-bit filter 15x15 8-bit filter

Re

lati

ve

Pe

rfo

rma

nc

e

Cell BE 3.2 GHz

Freescale 744x 975 MHz

Pentium 3.6 GHz 2MB L2

Opteron 2.4 GHz

PPC 970 2.0 GHz

Single precision complex FFTs Symmetric image filters

• Performance relative to 1GHz Freescale 744x (i.e. Freescale = 1)

• In all cases, we are comparing Mercury optimized Cell algorithm implementations with the best available (Mercury or 3rd party) implementations on other processors

• Did not compare with dual core x86 processors

Page 30: © 2006 Mercury Computer Systems, Inc. The Cell Broadband Engine Processor Hardware, Software, Performance and Applications John Brickman Director, Business

© 2005 Mercury Computer Systems, Inc.30 © 2006 Mercury Computer Systems

Goals for Programming Cell

• Achieve high performance: The only reason for choosing Cell

• Ease of programming: An important aspect of this is programmer portability

• Code Portability Important for large legacy code bases written in C/C+

+, Fortran And new code developed for Cell should be portable

to current and anticipated multiprocessor architectures

Page 31: © 2006 Mercury Computer Systems, Inc. The Cell Broadband Engine Processor Hardware, Software, Performance and Applications John Brickman Director, Business

© 2005 Mercury Computer Systems, Inc.31 © 2006 Mercury Computer Systems

Linux OS

• Linux on Cell patches released by IBM Linux Technology Center Kernel Version 2.6.17 libspe version 1.1 Built and tested with Fedora Core 5 distribution IBM LTC releases packages through Barcelona Supercomputing Center to

official kernel websitewww.bsc.es/projects/deepcomputing/linuxoncell/

• Mercury works closely with IBM Linux team on performance optimization Linux now able to acheive maximum hardware performance

possible on Dual Cell-Based Blade NUMA support, PPE affinity, SPE affinity, 64KB and 16MB page support

• Mercury uses Terra Soft Solutions Y-HPC Distribution Mercury contracted TSS to port to Y-HPC to the Dual Cell Based Blade Distributions are tested and supported on Mercury hardware Mercury assists TSS with driver development

• GbE, uDAPL, Infiniband

Page 32: © 2006 Mercury Computer Systems, Inc. The Cell Broadband Engine Processor Hardware, Software, Performance and Applications John Brickman Director, Business

© 2006 Mercury Computer Systems, Inc.

The MultiCore Plus SDK

•MultiCore Framework (MCF)•Scientific Algorithm Library (SAL)•MultiCore Plus IDE•TATL•SPEAK

Page 33: © 2006 Mercury Computer Systems, Inc. The Cell Broadband Engine Processor Hardware, Software, Performance and Applications John Brickman Director, Business

© 2005 Mercury Computer Systems, Inc.33 © 2006 Mercury Computer Systems

Mercury Approach to Programming Cell

• Very pragmatic Can’t wait for tools to mature Develop our own tools when it makes sense

• Emphasis on explicitly programming the architecture rather than trying to hide it When the tools are immature, this allows us to get

maximum performance

• Achieve ease-of-use and portability through function offload model Run legacy code on PPE Offload compute intensive workload to SPEs

Page 34: © 2006 Mercury Computer Systems, Inc. The Cell Broadband Engine Processor Hardware, Software, Performance and Applications John Brickman Director, Business

© 2005 Mercury Computer Systems, Inc.34 © 2006 Mercury Computer Systems

MultiCore Framework

• An API for programming heterogeneous multicores that contain explicit non-cached memory hierarchies

• Provides an abstract view of the hardware oriented toward computation of multidimensional data sets

• First implementation is for the Cell BE processor

Page 35: © 2006 Mercury Computer Systems, Inc. The Cell Broadband Engine Processor Hardware, Software, Performance and Applications John Brickman Director, Business

© 2005 Mercury Computer Systems, Inc.35 © 2006 Mercury Computer Systems

MCF Abstractions

• Function offload model Worker Teams: Allocate tasks to SPEs Plug-ins: Dynamically load and unload functions

from within worker programs

• Data movement Distribution Objects: Defining how n-dimensional data is

organized in memory Tile Channels: Move data between SPEs and main

memory Re-org Channels: Move data among SPEs Multibuffering: Overlap data movement and computation

• Miscellaneous Barrier and semaphore synchronization DMA-friendly memory allocator DMA convenience functions Performance profiling

Page 36: © 2006 Mercury Computer Systems, Inc. The Cell Broadband Engine Processor Hardware, Software, Performance and Applications John Brickman Director, Business

© 2005 Mercury Computer Systems, Inc.36 © 2006 Mercury Computer Systems

MCF Abstractions

• Function offload model Worker Teams: Allocate tasks to SPEs Plug-ins: Dynamically load and unload functions

from within worker programs

• Data movement Distribution Objects: Defining how n-dimensional data is

organized in memory Tile Channels: Move data between SPE and main

memory Re-org Channels: Move data among SPEs Multibuffering: Overlap data movement and computation

• Miscellaneous Barrier and semaphore synchronization DMA-friendly memory allocator DMA convenience functions Performance profiling

Page 37: © 2006 Mercury Computer Systems, Inc. The Cell Broadband Engine Processor Hardware, Software, Performance and Applications John Brickman Director, Business

© 2005 Mercury Computer Systems, Inc.37 © 2006 Mercury Computer Systems

MCF Distribution Objects

One complete data set in main memory

Frame

• Distribution Object parameters: Number of dimensions Frame size Tile size and tile overlap Array indexing order Compound data type organization (e.g. split / interleaved) Partitioning policy across workers, including partition overlap

Page 38: © 2006 Mercury Computer Systems, Inc. The Cell Broadband Engine Processor Hardware, Software, Performance and Applications John Brickman Director, Business

© 2005 Mercury Computer Systems, Inc.38 © 2006 Mercury Computer Systems

MCF Distribution Objects

• Distribution Object parameters: Number of dimensions Frame size Tile size and tile overlap Array indexing order Compound data type organization (e.g. split / interleaved) Partitioning policy across workers, including partition overlap

One complete data set in main memory

Unit of work for an SPE

Tile

Frame

Page 39: © 2006 Mercury Computer Systems, Inc. The Cell Broadband Engine Processor Hardware, Software, Performance and Applications John Brickman Director, Business

© 2005 Mercury Computer Systems, Inc.39 © 2006 Mercury Computer Systems

MCF Partition Assignment

• Distribution Object parameters: Number of dimensions Frame size Tile size and tile overlap Array indexing order Compound data type organization (e.g. split / interleaved) Partitioning policy across workers, including partition overlap

Partitions

SPE 0

SPE 1

SPE 2

Page 40: © 2006 Mercury Computer Systems, Inc. The Cell Broadband Engine Processor Hardware, Software, Performance and Applications John Brickman Director, Business

© 2005 Mercury Computer Systems, Inc.40 © 2006 Mercury Computer Systems

MCF Tile Channels

• Distribution Object parameters: Number of dimensions Frame size Tile size and tile overlap Array indexing order Compound data type organization (e.g. split / interleaved) Partitioning policy across workers, including partition overlap

Partitions

SPE 0

SPE 1

SPE 2

Tile Channel

Page 41: © 2006 Mercury Computer Systems, Inc. The Cell Broadband Engine Processor Hardware, Software, Performance and Applications John Brickman Director, Business

© 2005 Mercury Computer Systems, Inc.41 © 2006 Mercury Computer Systems

manager (PPE) generates data set and injects it into input tile channel

input tile channel subdivides data set into tiles

each worker (SPE) extract tiles out of input tile channel ...

... computes on input tiles to produce output tiles...

...and inserts them into output tile channel

output tile channel automatically puts tiles into correct location in output data set

when output data set is complete, manager is notified and extracts data set

manager

worker 1

worker 2

worker 3

input tile channel

output tile channel

MCF Tile Channels

Page 42: © 2006 Mercury Computer Systems, Inc. The Cell Broadband Engine Processor Hardware, Software, Performance and Applications John Brickman Director, Business

© 2005 Mercury Computer Systems, Inc.42 © 2006 Mercury Computer Systems

MCF Manager Programmain(int argc, char **argv) {

mcf_m_net_create();mcf_m_net_initialize();

mcf_m_net_add_task();mcf_m_team_run_task();

mcf_m_tile_distribution_create_3d(“in”);mcf_m_tile_distribution_set_partition_overlap(“in”);mcf_m_tile_distribution_create_3d(“out”);

mcf_m_tile_channel_create(“in”); mcf_m_tile_channel_create(“out”);

mcf_m_tile_channel_connect(“in”);mcf_m_tile_channel_connect(“out”);

mcf_m_tile_channel_get_buffer(“in”);

// fill input data here

mcf_m_tile_channel_put_buffer(“in”);mcf_m_tile_channel_get_buffer(“out”);

// process output data here}

Add worker tasks

Specify data organization

Create and connectto tile channels

Get empty source buffer

Fill it with data

Send it to workers

Wait for results from workers

Page 43: © 2006 Mercury Computer Systems, Inc. The Cell Broadband Engine Processor Hardware, Software, Performance and Applications John Brickman Director, Business

© 2005 Mercury Computer Systems, Inc.43 © 2006 Mercury Computer Systems

MCF Worker Program

mcf_w_main (int n_bytes, void * p_arg_ls) {mcf_w_tile_channel_create(“in”);mcf_w_tile_channel_create(“out”);mcf_w_tile_channel_connect(“in”);mcf_w_tile_channel_connect(“out”);

while (! mcf_w_tile_channel_is_end_of_channel(“in”) {

mcf_w_tile_channel_get_buffer(“in”);

mcf_w_tile_channel_get_buffer(“out”);

// Do math here

mcf_w_tile_channel_put_buffer(“in”);

mcf_w_tile_channel_put_buffer(“out”);}

}

Create and connectto tile channels

Get full source buffer

Put back empty source buffer

Put back fulldestination buffer

Get empty destination bufferDo math and fill

destination buffer

Page 44: © 2006 Mercury Computer Systems, Inc. The Cell Broadband Engine Processor Hardware, Software, Performance and Applications John Brickman Director, Business

© 2005 Mercury Computer Systems, Inc.44 © 2006 Mercury Computer Systems

MCF Implementation

• Consists of PPE library SPE library and tiny executive (12 KB)

• Utilizes Cell Linux “libspe” support But amortizes expensive system calls Reduces overhead from milliseconds to microseconds Provides faster and smaller footprint memory allocation library

• Based on Data Reorg standard http://www.data-re.org

• Derived from existing Mercury technologies Other Mercury RDMA-based middleware DSP product experience with small footprint, non-cached

architectures

Page 45: © 2006 Mercury Computer Systems, Inc. The Cell Broadband Engine Processor Hardware, Software, Performance and Applications John Brickman Director, Business

© 2005 Mercury Computer Systems, Inc.45 © 2006 Mercury Computer Systems

Radar SonarMedical Imaging

Signals IntelligenceDefense Imaging

Semiconductor Inspection

SAL Primary Markets

Page 46: © 2006 Mercury Computer Systems, Inc. The Cell Broadband Engine Processor Hardware, Software, Performance and Applications John Brickman Director, Business

© 2005 Mercury Computer Systems, Inc.46 © 2006 Mercury Computer Systems

Scientific Algorithm Library

• SAL is a collection of optimized functions Baseline

• Arithmetic, data type conversions, data moves DSP

• FFTs, convolutions, correlation, filters, etc. Linear Algebra

• Linear systems, matrix decomposition, etc. Parallel Algorithms (future)

• High level algorithms on multiple cores• Invoked from application running on PPE• Automatically use one or more SPEs• Initial work done for 1D and 2D FFTs and fast convolutions

• PIXL – Image Processing Library• Edge detection, fixed point operations and analysis, filtering, manipulation,

erosion, dilation, histogram, lookup tables, etc.• Work in this area depend on customer demand.

• PPE SAL based on Altivec optimizations for G4 and G4A2 SAL C source code version also available

• SPE SAL is new implementation optimized for SPE architecture Backwards compatibility with existing SAL API except in very rare cases Some new APIs needed in order to extract best performance from SPE Static and plug-in component versions for each function

Page 47: © 2006 Mercury Computer Systems, Inc. The Cell Broadband Engine Processor Hardware, Software, Performance and Applications John Brickman Director, Business

© 2005 Mercury Computer Systems, Inc.47 © 2006 Mercury Computer Systems

Eclipse Framework

• Provides an open platform for creating an Integrated Development Environment (IDE)

• Eclipse Consortium manages continuous development of the tool

• Eclipse plug-ins extend the functionality of the framework

• Written in Java

• Compilers, debuggers, TATL, helpfiles, etc. are all be Eclipse plug-ins.

Page 48: © 2006 Mercury Computer Systems, Inc. The Cell Broadband Engine Processor Hardware, Software, Performance and Applications John Brickman Director, Business

© 2005 Mercury Computer Systems, Inc.48 © 2006 Mercury Computer Systems

Mercury MultiCore Plus IDE

• PPE and SPE cross build support for Gcc/gcc++ XLC/C++

• Eclipse CDT (C/C++ Development Toolkit) Syntax highlighting Code completion Content assistance Makefile generation Remote debugging of PPE and SPE applications TATL plug-in

Page 49: © 2006 Mercury Computer Systems, Inc. The Cell Broadband Engine Processor Hardware, Software, Performance and Applications John Brickman Director, Business

© 2005 Mercury Computer Systems, Inc.49 © 2006 Mercury Computer Systems

TATL™ Trace Analysis Tool

• Log events from PPE & SPE threads across multiple Cell chips

• Synchronized global timestamps

• Minimally intrusive in space and time

• Timeline trace and histogram viewers

• Structured log file for use in other tools

Page 50: © 2006 Mercury Computer Systems, Inc. The Cell Broadband Engine Processor Hardware, Software, Performance and Applications John Brickman Director, Business

© 2005 Mercury Computer Systems, Inc.50 © 2006 Mercury Computer Systems

SPE Assembly Development Kit (SPE-ADK)

• The SPE architecture encourages “bare metal programmers” Very deterministic architecture Performance benefits from hand tuning the pipelines

• SPE-ADK dramatically improves bare metal productivity• SPE-ADK consists of

Assembler preprocessor, optimizer and macro library

• Using SPE-ADK is similar to programming with SPE C extensions But with more deterministic control of instruction scheduling and hardware

resources

• SPE-ADK is a productized version of the internal development tool used by all Mercury SAL developers

Page 51: © 2006 Mercury Computer Systems, Inc. The Cell Broadband Engine Processor Hardware, Software, Performance and Applications John Brickman Director, Business

© 2005 Mercury Computer Systems, Inc.51 © 2006 Mercury Computer Systems

SPE-ADK Features

• Alignment of instructions for the even and odd pipelines of the SPU

• Automatic insertion of nop's and lnop's or instruction swapping to maintain dual dispatch

• Alignment of loops to minimize instruction fetching overhead

• Register assignment. It automatically: Finds symbolic register operands, Assigns registers to symbols to

minimize register usage, Eliminates bugs from inconsistent

register assignment.

• Mapping of register usage, both active line number extents per symbol, and active hardware registers per line

• Analysis of stall cycles due to register dependencies

• Optional C emulation for assembly development allows C-like debugging facilities

Hardware independence for assembly code,

Setting breakpoints at source line numbers,

Displaying source code rather than disassembling the object code,

Displaying register contents by symbol.

• Detection of errors to preclude bugs:

Inconsistent manual register assignment,

Write-only variables, Uninitialized variables, Updated but unused variables.

Page 52: © 2006 Mercury Computer Systems, Inc. The Cell Broadband Engine Processor Hardware, Software, Performance and Applications John Brickman Director, Business

© 2005 Mercury Computer Systems, Inc.52 © 2006 Mercury Computer Systems

Software Summary

• The Cell BE processor can achieve one to two orders of magnitude performance improvement over current general purpose processors Lean SPE core saves space and power And makes it easier for software to approach peak performance

• Cell is a distributed memory multiprocessor on a chip Prior experience on these architectures translates easily to Cell

• But for most programmers, Cell is a new architecture Successful adoption by programmers is Cell’s biggest challenge And the history of other new processor architectures is not

encouraging

• We need a range of tools that span the continuum from ease-of-use to high performance

Page 53: © 2006 Mercury Computer Systems, Inc. The Cell Broadband Engine Processor Hardware, Software, Performance and Applications John Brickman Director, Business

© 2006 Mercury Computer Systems, Inc.

Markets for Cell

•Aerospace and Defense•Semiconductor•Medical Imaging•Oil and Gas•Visualization

Page 54: © 2006 Mercury Computer Systems, Inc. The Cell Broadband Engine Processor Hardware, Software, Performance and Applications John Brickman Director, Business

© 2005 Mercury Computer Systems, Inc.54 © 2006 Mercury Computer Systems

Sales & Marketing Progress for Cell

Very Active• Semiconductor inspection – active sales engagements;

prototypes sold• Medical imaging – active sales engagements; prototypes sold• Semiconductor lithography – active sales engagements;

prototypes sold.• Defense signal & image processing – active sales

engagements; prototypes sold• Oil & Gas exploration – active sales engagements; prototypes

sold• Video transcoding – active sales engagementsLess Active for Mercury• Financial modeling (IBM)• Gaming• Animation & rendering• Defense simulation for training (specialized gaming)

Page 55: © 2006 Mercury Computer Systems, Inc. The Cell Broadband Engine Processor Hardware, Software, Performance and Applications John Brickman Director, Business

© 2005 Mercury Computer Systems, Inc.55 © 2006 Mercury Computer Systems

Summary

• Mercury has been developing computing solutions for applications well suited for Cell technology for many years.

• Cell technology represents a significant performance breakthrough similar to historical programming models.

• Customers can leverage Cell technology through Mercury to achieve: Unbiased assessment of risks and applicability of

deploying Cell-based solutions. Significant improvements in performance and

bandwidth for certain applications compared to conventional processors

Page 56: © 2006 Mercury Computer Systems, Inc. The Cell Broadband Engine Processor Hardware, Software, Performance and Applications John Brickman Director, Business

© 2005 Mercury Computer Systems, Inc.56 © 2006 Mercury Computer Systems

For More Information

(866) 627-6951 (US)(978) 967-1401 (International)

E-mail: [email protected]

Web: www.mc.com/cell

Page 57: © 2006 Mercury Computer Systems, Inc. The Cell Broadband Engine Processor Hardware, Software, Performance and Applications John Brickman Director, Business

© 2006 Mercury Computer Systems, Inc.

Backup Slides

Page 58: © 2006 Mercury Computer Systems, Inc. The Cell Broadband Engine Processor Hardware, Software, Performance and Applications John Brickman Director, Business

© 2006 Mercury Computer Systems, Inc.

Semiconductor DFM Requirements

Page 59: © 2006 Mercury Computer Systems, Inc. The Cell Broadband Engine Processor Hardware, Software, Performance and Applications John Brickman Director, Business

© 2005 Mercury Computer Systems, Inc.59 © 2006 Mercury Computer Systems

Moore’s Law Irrelevant

• Processing requirements of semiconductor industry are increasing at an even faster rate

• Driven by: Increased feature density Increased complexity of processing due to sub-

wavelength physics Tool specific features

Year 1 Year 2 Year 3 Year 4

Moore’s Law

ProcessingRequirements

4X4X

12X12X

Processing needs outpace mainstream computing as data rates and algorithm

complexity increase

Page 60: © 2006 Mercury Computer Systems, Inc. The Cell Broadband Engine Processor Hardware, Software, Performance and Applications John Brickman Director, Business

© 2005 Mercury Computer Systems, Inc.60 © 2006 Mercury Computer Systems

OPC/RET/DFM – The need for speed

CHALLENGES • Reduce OPC cycle times from days/weeks to hours

• Simulation models that ensure a mask will work when printed• Computing goes up by an order of magnitude at every design node (e.g. 65nm to 45nm)

• Resolution Enhancement Technologies (RET)

• Optical Proximity Correction (OPC)• Phase Shift Masks (PSM)• Off-axis Illumination (OAI)

• Design for Manufacturing (DFM)

Quotes from top chip designers:

“It takes 8 days with 500 nodes to do OPC on a single chip layer … and we need it to

be 10 to 100 times faster”

“We have 10,000 blades to do RET”• WYSIWYG no more

Source: AMD

Page 61: © 2006 Mercury Computer Systems, Inc. The Cell Broadband Engine Processor Hardware, Software, Performance and Applications John Brickman Director, Business

© 2005 Mercury Computer Systems, Inc.61 © 2006 Mercury Computer Systems

Cost of Ownership

• System sizes to do RET and Lithography simulation are expanding to the 1000s of 1U servers

• Dense racks of servers are expensive to maintain Cost of electricity to power computers Cost of capital infrastructure for electricity

delivery Cost of electricity to power HVAC

systems Cost of capital infrastructure for HVAC Challenge of managing air flow

Page 62: © 2006 Mercury Computer Systems, Inc. The Cell Broadband Engine Processor Hardware, Software, Performance and Applications John Brickman Director, Business

© 2005 Mercury Computer Systems, Inc.62 © 2006 Mercury Computer Systems

Cost of Ownership

• A rack of 84 such servers Costs $10K+ per year to power Comparable amounts for HVAC and capital costs

• Operators of data centers now see power and cooling costs as more significant than cost of computing hardware

• A single dual processor server Consumes 250-400 Watts Costs $100-200/year just to power

(at $.05/kWh) Comparable amounts for HVAC

and capital costs

Page 63: © 2006 Mercury Computer Systems, Inc. The Cell Broadband Engine Processor Hardware, Software, Performance and Applications John Brickman Director, Business

© 2005 Mercury Computer Systems, Inc.63 © 2006 Mercury Computer Systems

Processing Efficiency

• The metric of performance per dollar must be expanded to include not just the cost of the hardware but also the lifetime cost of operating the computer system

• Performance/Watt, which used to just be a metric for the embedded and defense industry, is now important for commercial customers as well

Page 64: © 2006 Mercury Computer Systems, Inc. The Cell Broadband Engine Processor Hardware, Software, Performance and Applications John Brickman Director, Business

© 2005 Mercury Computer Systems, Inc.64 © 2006 Mercury Computer Systems

Summary

• Cell processor technology provides: Order-of-magnitude improvement in computing

performance per processor for OPC/RET applications Significant improvement in performance per Watt Significant performance breakthrough for other critical

computationally intensive applications

• The right software infrastructure is critical for: Taking full advantage of specialized processing units Partitioning application among heterogeneous group or

processing cores Parallelizing application among multiple processing

nodes

• Cell can significantly improve OPC/RET turnaround time

Page 65: © 2006 Mercury Computer Systems, Inc. The Cell Broadband Engine Processor Hardware, Software, Performance and Applications John Brickman Director, Business

© 2006 Mercury Computer Systems, Inc.

Ray Tracing

•Mercury Computer Systems•Visualization and Sciences Group

Page 66: © 2006 Mercury Computer Systems, Inc. The Cell Broadband Engine Processor Hardware, Software, Performance and Applications John Brickman Director, Business

© 2005 Mercury Computer Systems, Inc.66 © 2006 Mercury Computer Systems

What is Ray Tracing?

Computer Graphics Rendering Technique which mathematically simulates rays of light

Capable of producing photo-realistic images

Used in a variety of markets

Automotive, aerospace and marine virtual prototyping

Architecture

Industrial Design

Digital Content Creation in film and video

Page 67: © 2006 Mercury Computer Systems, Inc. The Cell Broadband Engine Processor Hardware, Software, Performance and Applications John Brickman Director, Business

© 2005 Mercury Computer Systems, Inc.67 © 2006 Mercury Computer Systems

Basic Technique

For each pixel in the screen, send out a ray of light from the viewpoint.

Check every object in the scene and check for intersection.

If the ray does not intersect an object, set pixel to background color

If the ray does intersect an object, set the pixel color to the first object it intersects

Page 68: © 2006 Mercury Computer Systems, Inc. The Cell Broadband Engine Processor Hardware, Software, Performance and Applications John Brickman Director, Business

© 2005 Mercury Computer Systems, Inc.68 © 2006 Mercury Computer Systems

More Advanced Technique

Page 69: © 2006 Mercury Computer Systems, Inc. The Cell Broadband Engine Processor Hardware, Software, Performance and Applications John Brickman Director, Business

© 2005 Mercury Computer Systems, Inc.69 © 2006 Mercury Computer Systems

Characteristics of Ray Tracing

• Simulating the Physics of Light• Simulates light transport by following “photons”• Fully parallel: just as nature• Demand-driven: start from the camera• Correctly orders rendering effects (per pixel !!)• Can account for all global effects• All effects are orthogonal to each other• Makes content design easy and fast

• Requires very large amount of CPU in order to be interactive

• Driven by intersection calculations• Every ray checked against all objects• Each secondary ray becomes a primary ray in a recursive

algorithm• 800 x 600 screen, 3 light sources, 50 opaque objects

requires 600 billion intersection tests!

Page 70: © 2006 Mercury Computer Systems, Inc. The Cell Broadband Engine Processor Hardware, Software, Performance and Applications John Brickman Director, Business

© 2005 Mercury Computer Systems, Inc.70 © 2006 Mercury Computer Systems

Challenges Implementing on Cell

• In-order instruction access and SIMD Must carefully optimize instructions to avoid stalls Must parallelize code to take advantage of SIMD

instructions

• Memory Access DMA engines must move data into LS from XDR Hiding latency requires overlapped I/O and

processing (DMA read latency is a few hundered clock cycles)

Even more challenging for irregular data access

• Mapping to 8 SPEs Mapping algorithm very important with Cell

architecture

Page 71: © 2006 Mercury Computer Systems, Inc. The Cell Broadband Engine Processor Hardware, Software, Performance and Applications John Brickman Director, Business

© 2005 Mercury Computer Systems, Inc.71 © 2006 Mercury Computer Systems

Linear Speed-up Across SPEs

Page 72: © 2006 Mercury Computer Systems, Inc. The Cell Broadband Engine Processor Hardware, Software, Performance and Applications John Brickman Director, Business

© 2005 Mercury Computer Systems, Inc.72 © 2006 Mercury Computer Systems

Results

Frames per Second (Normalized to 2.4 GHz Opteron)

2.4 GHz x86 7.2 3.0 2.5

2.4 GHz SPE 7.4 (+3%) 2.6 (-13%) 1.9 (-24%)

2.4 GHz Cell 58.1 (8x) 20 (6.6x) 16.2 (6.4x)

2.4 GHz Dual Cell 110.9 (15.4x) 37.3 (12.4x) 30.6 (12.2x)

3.2 GHz Cell 67.8 (9.4x) 23.2 (7.7x) 18.9 (7.5x)

Page 73: © 2006 Mercury Computer Systems, Inc. The Cell Broadband Engine Processor Hardware, Software, Performance and Applications John Brickman Director, Business

© 2005 Mercury Computer Systems, Inc.73 © 2006 Mercury Computer Systems

What is OpenRTRT from Mercury?

• Highly optimized ray tracing rendering engine

• Enabling high-quality rendering at interactive frame rate

• Supports large model visualisation

• Complements GPU OpenGL-based rendering Realism and rendering effects Quality and accuracy Capacity for large models Performance scalability

with multiple CPUs and clusters

Page 74: © 2006 Mercury Computer Systems, Inc. The Cell Broadband Engine Processor Hardware, Software, Performance and Applications John Brickman Director, Business

© 2005 Mercury Computer Systems, Inc.74 © 2006 Mercury Computer Systems

OpenRTRT: Real-Time Ray Tracing

• Recognized as outstanding, breakthrough technologyCutting edge research and dramatic optimizations achieved by U. Saarland and inTrace: cache & data layout optimization, parallelization - SIMD/SSE,

multi-threading, distribution… Interactive even on a PC, enough for preparation work for instance

• Scalable performances with multiple CPUs Allow fully interactive visualization Performance depends linearly on the number of pixels, rays and

processors Logarithmic in scene size (20Mio triangles guaranteed)

• Available for Linux on x86, x86-64, and IA64 and Windows 32

Page 75: © 2006 Mercury Computer Systems, Inc. The Cell Broadband Engine Processor Hardware, Software, Performance and Applications John Brickman Director, Business

© 2005 Mercury Computer Systems, Inc.75 © 2006 Mercury Computer Systems

Background

2000 Start of research at the University of Saarland

2001 Presentation of the first scientific results

2002 Initial projects with the Automotive industrySimulation of Ray Tracing hardware

2003 Foundation of inTrace GmbHVolkswagen AG as first customer (VR – Lab)

2004 New project visualization center at Wolfsburg based on Ray Tracing. First Ray tracing hardware prototype

2005 Projects with basically all German car manufacturers:VW, Audi, BMW, DaimlerChrysler + Airbus, Boeing, …First design of fully programmable chip for Ray Tracing

2005 Exclusive agreement for worldwide distribution with Mercury Computer Systems