mescal: design support for embedded processors and...

1

MESCAL: Design Support forEmbedded Processors and Applications

Prof. Kurt Keutzerand the MESCAL teamUC Berkeley

2

When We Got Started

01995 1996 1997 1998

Year

IC D

esig

ns

ASIC

ASSP

Handel Jones, IBS 9/23/2002

2

3

More Trouble for ASICs

DSM

Effects

Com

plex

ity

HeterogeneityTime-to-Money

Exponentially more complex, greater design risk,greater variety, and a smaller design window !

QuadrupleQuadruple--WhammyWhammy

4

Today’s Environment

• Unprecedented desire for product differentiation using per-application silicon, but …

• ASIC design becoming expensive and unpredictable– Increasing device complexity– Deep sub-micron effects: interconnect delay, noise– Design heterogeneity: analog, digital, processors, memory– Increasing time-to-market pressure

3

5

The result: total IC Designs

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

1995

1996

1997

1998

1999

2000

2001

2002

2003

2004

2005

2006

2007

Year

IC D

esig

ns

ASSPASIC

ASIC

ASSP

Handel Jones, IBS 9/23/2002

6

Solution: ASIC => ASSP => ASIP

ASIP: Programmable Platforms•Develop platforms that allow for amortization of design costs over multiple generations•Make platforms programmable so that they have maximum flexibility with minimum overheadThe MESCAL Mission:

– To bring a disciplined methodology, and a supporting tool set, to the development, deployment and programming of application-specific programmable platforms akaASIPs

Invited paper: ``From ASIC to ASIP:The Next Design Discontinuity’’,K. Keutzer, S. Malik, R. Newton,Proceedings of ICCD, pp. 84-91, 2002.

Press coverage Sept 2002:Programmable Platforms will Rule:http://www.eetimes.com/story/OEG20020911S0063High on MESCALhttp://www.eetimes.com/story/OEG20020911S0065

SDRAM Controller

µenginePCI

Interface

SRAMController

StrongArmCore

I$

µengine

µengine

µengine

µengine

µengine

MiniD$

D$

IX BusInterface

HashEngine

ScratchPad

SRAM

4

7

The New Design Target

• Explosion of ASIP programmable platforms– Diverse types of processing elements– Diverse communications architecture– Multiple memories– Peripherals

ARM CoreARM Core

µµEnginesEngines

BusesBuses

IntelIntelIXP1200IXP1200

EthernetEthernetMACsMACs,,RAMRAM

8

A Discipline of Programmable Platform Design

1. Develop a disciplined approach to selecting application benchmarks

2. Develop a disciplined approach to identifying the architectural/micro-architectural design-space to be explored

3. Develop a convenient and comprehensive environment for the description, simulation, and analysis of potential architectural platforms within the design space

4. Develop an environment to efficiently explore and evaluate the design space of architectural platforms

5

9






10

Step 1: Disciplined Approach to Benchmarking

• The primary goals of (network processor) benchmarks– The chosen suite of benchmarks should be

• Representative• Easy to specify• Consist of a manageable number of benchmarks

– Enable quantitative comparison of architectures• Developed three benchmark specifications

– IPv4 Packet Forwarding– Network Address Port Translation– Multiprotocol Label Switching (MPLS)

• Implemented benchmarks on the Intel IXP1200 in assembler, C, Click, and a commercial environment (Teja)

• M. Tsai, C. Kulkarni, C. Sauer, N. Shah, K. Keutzer, “A Benchmarking Methodology for Network Processors,” First Workshop on Network Processors at the 8th International Symposium on High Performance Computer Architecture (HPCA8), Cambridge MA, USA, February 2002.

6

11






12

0

2

4

6

8

10

12

14

16

18

20

0 1 2 3 4 5 6 7 8 9Issue width per PE

Num

ber o

f PEs

32

48

64

Cognigine

Cisco

EZchip

Xelerated

IBMLexraMotorola

Intel

BRECISBroadcom

AppliedMicro

Clearwater

ClearSpeedVitesse

Agere

PMC-Sierra

AlchemyConexant

64 instrs/cycle

16 instrs/cycle

8 instrs/cycle

10

Charted the Architectural Diversity of NPUs

Surveyed over 30 network processor platforms

7

13

Step 2: Defined the Architectural Search SpaceFocused on NPU’s but this has been a robust classification for ASIPs5 Axes of the Architectural Design Space• Approaches to Parallel Processing

– Processing Element (PE) level– Instruction-level– Bit-level

• Elements of Special Purpose Hardware• Structure of Memory Architectures• Wide-variety of On-Chip Communication Mechanisms• Use of wide range of peripherals

Niraj Shah. Understanding Network Processors. Master's thesis, University of California, Berkeley, September, 2001.

Invited paper: Network Processors: Origin of Species, Niraj Shah, Kurt Keutzer, Proceedings of ISCIS XVII, The Seventeenth International Symposium on Computer and Information Sciences, October, 2002

14






8

15

Step 3: comprehensive environment for the description, simulation, and analysis of architectural platforms

Three significant sub-problems:• Individual processor models • Communication network models• Task-specific processor models

FiberFiber

GbEGbENoC

EthernetEthernet

802.11g802.11g

POTSPOTSMEMMEM

NoC

NPUNPU

MEMMEM

IP-SECIP-SECMedia Serve

r

Media Serve

rSATASATA

Media Acceleration

InternetInternet

OfficeOfficeNetworkNetwork

UWBUWB

Home Gateway

16

Key Features

• Natural description:– The environment must enable the easy description of all

the key elements of the programmable platform• Automated high-performance simulation

– The environment must automatically generate simulation models

– Simulation models must be high-performance• Amenable to Analysis

– Analytical or simulation models must provide the relevant information for making key design decisions

• Industrial strength– The environment must be capable of describing,

simulating, and analyzing REAL industrial-strength designs

9

17


Three significant sub-problems:• Individual processor models• Communication network models• Task-specific processor models

FiberFiber

GbEGbENoC

EthernetEthernet

802.11g802.11g

POTSPOTSMEMMEM

NoC

NPUNPU

MEMMEM


r

Media Serve

rSATASATA

Media Acceleration

InternetInternet


UWBUWB

Home Gateway

18

Processor Modeling with MADL•Research focus

– Modeling concurrency and resource utilization in processors

– Automating software tool-chain generation

• Achievements– Operation State Machine (OSM) as

micro-processor model (For StrongARM, PowerPC750, TMS320C54x)

MADLMADL

Model Analyzer

SimulatorSimulator

Compiler

Machine CodeMachine Code

FalseFalse

FalseFalseFalseFalse

PE modelPE model ApplicationApplication

W. Qin, S. Malik. Automated Synthesis of Efficient Binary Decoders for Automated Synthesis of Efficient Binary Decoders for Retargetable Software ToolkitsRetargetable Software Toolkits, Proceedings of the 40th Design Automation Conference (DAC 03), June 2003, pp. 764-769. W. Qin, S. Malik. Flexible and Formal Modeling of Microprocessors with Flexible and Formal Modeling of Microprocessors with Application to Retargetable SimulationApplication to Retargetable Simulation, Proceedings of 2003 Design Automation and Test in Europe Conference (DATE 03), Mar, 2003, pp.556-561

10

19


Three significant sub-problems:• Individual processor models • Communication network models• Task-specific processor models

FiberFiber

GbEGbENoC

EthernetEthernet

802.11g802.11g

POTSPOTSMEMMEM

NoC

NPUNPU

MEMMEM


r

Media Serve

rSATASATA

Media Acceleration

InternetInternet


UWBUWB

Home Gateway

M. Sgroi, M. Sheet, A. Mihal, K. Keutzer, S. Malik, J. Rabaey, A. Sangiovanni-Vincentelli, ,

"Addressing the System-on-a-Chip Interconnect Woes Through Communication-Based Design",

Proceedings of the 38th Design Automation Conference, Los Angeles, CA., Pages 667-672,

June 2001.

20

Network-on-a-Chip (NOC) Architectures

NOC Description Distributed Application

Simulation Engine

Timing Power

• Research focus:– Design space exploration

tools to evaluate and make NOC design choices

• An application driven approach based on modular modeling environments– Multiprocessor simulators

developed based on SystemC, Liberty Simulation Environment (LSE)

Xinping Zhu, Sharad Malik, A Hierarchical Modeling Framework for On-Chip Communication Architectures, Proceedings of International Conference on Computer-Aided Design, 2002.

Hang-Sheng Wang, Xinping Zhu, Li-Shiuan Peh and Sharad Malik, Orion: A Power-Performance Simulator for Interconnection Networks, In Proceedings of the 35th International

Symposium on Microarchitecture (MICRO), Istanbul, Turkey, November 2002.

11

21

Power-aware Networks-on-a-Chip

Research focus: Modeling and development of power efficient network architectures• Hang-Sheng Wang, Li-Shiuan Peh and Sharad Malik, "Power-Driven Design of

Router Microarchitectures in On-Chip Networks.", In Proceedings of the 36th International Symposium on Microarchitecture (MICRO), San Diego, November 2003, to appear.

• Hang-Sheng Wang, Li-Shiuan Peh and Sharad Malik, Power Model for Routers: Modeling Alpha 21364 and InfiniBand Routers , In IEEE Micro, Vol. 24, No. 1, January/February 2003 (Best of Hot Interconnects 10).

average power savings of synthetic and real traces

0.00%

5.00%

10.00%

15.00%

20.00%

25.00%

30.00%

35.00%

40.00%

45.00%

50.00%

8x8 torus randomtraffic

4x4 torus randomtraffic

TRIPS CMPtraces

benchmarks

pow

er s

avin

gs cut-through crossbarsegmented crossbarwrite-through bufferexpress cubeall

Power-efficient network architectures/microarchitectures

Power modeling of interconnection networks

a flit

in/out queue energy

buffer model

xb traversal energy

xb model

link energy

link model

arbitration energy

arbiter model

a flit

in/out queue energy

buffer model

xb traversal energy

xb model

link energy

link model

arbitration energy

arbiter modelarchitectural-level power modeling, validated against Raw microprocessor and other routers

22

Step 3: comprehensive environment for the description, simulation, and analysis of architectural platformsThree significant sub-problems:• Individual processor models • Communication network models• Task-specific processor models

FiberFiber

GbEGbENoC

EthernetEthernet

802.11g802.11g

POTSPOTSMEMMEM

NoC

NPUNPU

MEMMEM


r

Media Serve

rSATASATA

Media Acceleration

InternetInternet


UWBUWB

Home Gateway

12

23

TIPI: A design environment for task-specific processors

TIPI: Research focus: operation-based design approach for datapath intensive

task-specific processors• Convolution Coding Processor was pushed through the Tipi design

methodology by an industrial ASIC designer• An automatically generated compiled code simulator executed >100

million instructions/second on 2.4 GHz P4.• Synthesizable RTL was also automatically generated.• “Multi-View Operation-Level Design -- Supporting the Design of

Irregular ASIPS”, Scott J Weber, Matthew W. Moskewicz, Manuel Loew, and Kurt Keutzer, University of California, Berkeley, UCB/ERL M03/12, April, 2003

24

Step 3: comprehensive environment for the description, simulation, and analysis of architectural platformsThree significant sub-problems:• Individual processor models • Communication network models• Task-specific processor models

FiberFiber

GbEGbENoC

EthernetEthernet

802.11g802.11g

POTSPOTSMEMMEM

NoC

NPUNPU

MEMMEM


r

Media Serve

rSATASATA

Media Acceleration

InternetInternet


UWBUWB

Home Gateway

13

25

The Liberty Simulation EnvironmentReleased!• Research focus:

Validation, Automatic Model Generation, Model Language Theory, Simulator Synthesis

• Detailed micro-architectural modeling

• Users: UCLA, UPC Barcelona, Colorado, UIUC, Rice, Intel, UMich, Princeton, Infineon

• Models: IA-64, DLX, Multiprocessor, Networks/Routers

• Version 1.0 Release Party at MICRO!

Optimizations for a Simulator Construction System Supporting Reusable Components David A. Penry and David I. August Proceedings of the 40th Design Automation Conference, June 2003.

Microarchitectural Exploration with Liberty Manish Vachharajani, Neil Vachharajani, David A. Penry, Jason A. Blome, and David I. August Proceedings of the 35th International Symposium on Microarchitecture, November 2002. (Best Student Paper Award)

26






14

27

Step 4: Efficiently explore and evaluate the design space of architectural platforms

PerformanceAnalysis

PerformanceAnalysis

ArchitectureArchitectureApplicationApplication

ArchitecturePlatform

ArchitecturePlatform

510152025303540

28

Comprehensive Survey of Design Space Exploration MethodsResearch focus: comprehensive survey of design space exploration techniques• Comparison of 16 frameworks, 9 evaluation schemes, 18 covering and

automation methods, cost functions, and representations for architectures and applications

• Overall more than 120 papers considered• M. Gries: Methods for Evaluating and Covering the Design Space during Early

Design Development. Technical report, UC Berkeley, UCB/ERL M03/32, 53 pages, Aug. 2003

Architecture Application

Mapping

Evaluation

15

29

Case Study on Fast Design Space Exploration

Research focus:• Evaluation of analytical method for fast design space exploration• Comparison for IPv4 packet forwarding on Intel IXPM. Gries, C. Kulkarni, C. Sauer, K. Keutzer: Comparing Analytical Modeling

with Simulation for Network Processors. DATE, March 2003

0

5

10

15

20

25

30

35

40

40 64 65 128 129 192 193 256Packet length [byte]

End-to-end packet delay [µs]

SimulationNP-GPS analysis

Packet length [byte]

µ−Engine load [%]

Analytical model

Simulation: polling artifactsSimulation: computation part

0%10%20%30%40%50%60%70%80%90%

40 64 65 128 129 192 193 256

30

Exploring Processing Element Topologies

Research focus• Which topology

performs best?• What is the impact of

choosing a certain topology on the programmability of the device?

• Scaling issues?

Number of PE stages

Number of PEs per stage

Intel (IXP1200)Intel (IXP1200)

Cisco (PXF/ToasterCisco (PXF/Toaster--2)2)

AgereAgere (Payload Plus)(Payload Plus)UniUni processorprocessor1

4

8

841

2

6

2 6

VitesseVitesse IQ2000IQ2000

BroadcomBroadcom 1250012500

XeleratedXelerated Packet DevicesPacket Devices

1x8 Pool1x8 Pool

2x4 2x4 Pool of PipelinesPool of Pipelines

4x2 4x2 Pool of PipelinesPool of Pipelines

8x1 Pipeline8x1 Pipeline

M. Gries, C. Kulkarni, C. Sauer, K. Keutzer: M. Gries, C. Kulkarni, C. Sauer, K. Keutzer: Exploring TradeExploring Trade--offs in Performance and offs in Performance and Programmability of Processing Element Topologies for Network ProProgrammability of Processing Element Topologies for Network Processors, In: Network cessors, In: Network Processor Design: Issues and Practices, volume 2Processor Design: Issues and Practices, volume 2, (NP2 Workshop @ HPCA9), Morgan , (NP2 Workshop @ HPCA9), Morgan Kaufmann Publishers, Oct. 2003Kaufmann Publishers, Oct. 2003

16

31

The New Design Source

• Heterogeneous applications• Multiple flavors of concurrency

FromDevice(0)ToDevice(0)

FromDevice(1)

FromDevice(2)

FromDevice(3)

Discard

ToDevice(1)

ToDevice(2)

ToDevice(3)

Discard

…

FromDevice(15)

LookupIPRoue

ToDevice(15)

… …

IPVerify DecIPTTL

DiscardDiscard

IPVerifyDecIPTTL

Discard Discard

IPVerifyDecIPTTL

… Discard

DecIPTTL

Discard

DecIPTTL

32

Modeling

• Many interesting problems in modeling complex heterogeneous systems

• We are hoping that Metropolis and Ptolemy II solve them all

17

33

Implementation Gap

The New Implementation Problem

Mapping concurrent heterogeneous applications onto heterogeneous multiprocessor systems

Can we bridge this gap and provide

- Programmer productivity- Implementation efficiency- System correctness

34

Implementation Gap

The New Design Problem

Mapping concurrent heterogeneous applications onto heterogeneous multiprocessor systems

Can we bridge this gap and provide

- Programmer productivity- Implementation efficiency- System correctness

Goal: Close the gap!

18

35

MESCAL Approaches

• Bottom-up: generalize from specific instance– Start with a specific application domain and a specific

architecture– Develop useful abstractions of the device– Aspire to achieve within 10% of hand-coded performance

with 2-5X improvement in productivity– Should teach us a lot about how to get this right

• Top-down: specify from general approach– Consider heterogeneous applications that use

combinations of MoCs– Develop a mapping discipline

• Correct-by-construction implementation• Target a broad class of architectures

– Should teach us a lot about how to provide a general solution

36

Bottom-up Approach

• Start with a specific application development environment and a specific architecture instance

• Identify the preferred device – IXP1200• Identify the preferred progamming environment - Click • Attempt to fill the implementation gap with

– Within 10% of hand-coded efficiency– With 2-5X productivity

19

37

Target Architecture of choice: Intel IXP1200

SDRAM Controller

µenginePCI

Interface

SRAMController

StrongArmCore

I$

µengine

µengine

µengine

µengine

µengine

MiniD$

D$

IX BusInterface

HashEngine

ScratchPad

SRAM

38

IXP1200 Programming Difficulties

• Current programming abstraction: IXP-C– Subset of C– Need to write 6 parallel multi-threaded programs– Not clear where the architectural bottlenecks are

• Programmer must still:– Divide code among threads– Take advantage of distributed, heterogeneous memories– Arbitrate access to shared resources– Interact with peripherals– Take advantage of application concurrency

20

39

Environment of choice: Click

• Domain-specific language for describing networking applications

• Applications are built by composing elements that correspond to common packet processing operations

• Elements communicate via ports that pass packets– push: initiated by source element– pull: initiated by destination element

• Current implementation in C++ for Linux workstations

FromDevice(0) ToDevice(0)

FromDevice(1) ToDevice(1)

LookupIPRoute

Source: E. Kohler et al. The Click Modular Router. TOCS. pg. 263-297, August 2000.

40

Programming Model (NP-Click)• raises abstraction of architecture• facilitates mapping of application env

Implementation Gap

Bridging the Gap

21

41

What is a Programming Model?

• A programmer’s view of the architecture that balances:

Opacity

– Abstract architecture

– Obviate need to initially learn microarchitecture

– Ease of programming

Visibility

– Expose key architectural features

– Allow performance improvement

– Enable efficient implementation

• Presents a productive approach to using computational power of the device

42

Our Solution: NP-Click

• NP-Click is a programming model implemented on the Intel IXP1200

• Integrates concepts from Click– elements – communication via push and pull of packets

• And an abstraction of the underlying hardware– thread boundaries– data layout– arbitration of shared resources

22

43

NP-Click: Usage Model

• Methodology: identify what is important to the programmer and narrow the scope of their concerns

• Two steps– Design application by composing elements

• determine thread boundaries• mapping shared data to physical memory• select/implement arbitration schemes

– Implement elements in IXP-C• elements have well-defined I/O• data descriptors for scoping of variables• simple interface to access shared resources• special-purpose hardware

44

Evaluating the Methodology

• Implemented a 16-port IPv4 packet forwarder – NP-Click– NP-Click with arbitration optimization– IXP-C (hand-coded)– Assembler (hand-coded)

• Use maximum sustainable data rate as proxy for performance

• Measured performance across a range of packet sizes, including an IETF recommended packet mix

23

45

Initial Results

• NP-Click achieves 35% of IXP-C’s performance• Poor TFIFO arbitration scheme is responsible for

performance shortfall

0

200

400

600

800

1000

1200

1400

64 128 256 512 1024 1280 1518 IETFInput Packet Size

Dat

a R

ate

(Mbp

s)

NP-Click

IXP-C

Source: N. Shah et al, “NP-Click: A Programming Model for the Intel IXP1200,” NP-2, 9th HPCA, 2003.

46

Performance Tuning

• A better arbitration scheme results in >2x performance improvement• This improved version performs within 10% of IXP-C for larger

packets

0

200

400

600

800

1000

1200

1400


Dat

a Ra

te (

Mbp

s) NP-Click

NP-Click (w/arb opt)

IXP-C

Source: N. Shah et al, “NP-Click: A Programming Model for the Intel IXP1200,” NP-2, 9th HPCA, 2003.

24

47

Comparison to Assembly Language

• ASM version outperforms IXP-C version by ~15%• Fine-grain synchronization with TFIFO state machine

0

200

400

600

800

1000

1200

1400

1600


Dat

a Ra

te (

Mbp

s) NP-Click

NP-Click (w/arb opt)

IXP-C

ASM

Source: N. Shah et al, “A Comparison of Programming Models”, submitted to LCTES 2003.

48

Bottom-up: Lessons Learned

• What does the designer need to see in order to do mapping?– Application characteristics– Architectural features

• Concurrency– Application thread boundaries– Architectural multiprocessing capabilities– Match threads with PEs

• State– Application memory usage– Multiprocessor memory architecture / memory hierarchy

• Arbitration of shared resources– Special-purpose function units– I/O

25

49

Top-down Approach

• Start with a general application development environment and a broad family of architectures– Heterogeneous applications are important– Architectural features evolve during design-space exploration

• Create a formal model of the application– Capture application concurrency– Handle heterogeneous combinations of MoCs

• Disciplined approach to mapping– Enable design-space exploration– Discover architectural features that give the most

performance• Warpath

– Model heterogeneous applications (with the goal of implementation)

– Map to Teepee architectures

50

Warpath

• Disciplined methodologies and a supporting tool set for the top-down approach

Formal models capture concurrency

Formal model enables automatic exportation

Correct-by-construction implementation

Programmer’sModel

Programmer’sModel Mapping

ProcessMappingProcess

CodeGeneration

Process

CodeGeneration

Process

PerformanceAnalysis

PerformanceAnalysis

ApplicationDevelopmentEnvironment


Architecture Instance


ApplicationsApplicationsApplicationsApplicationsApplicationsApplications

26

51

Disciplined Design-Space Exploration

• Y-chart (Kienhuis, Deprettere et al. 2001), Polis 2001

Programmer’sModel

Programmer’sModel Mapping

ProcessMappingProcess

CodeGeneration

Process

CodeGeneration

Process

PerformanceAnalysis

PerformanceAnalysis

Suggest architecturalimprovements Modify the

applications

Use differentmappingstrategies





ApplicationsApplicationsApplicationsApplicationsApplicationsApplications

52

Application Development

• Model concurrent applications formally with Models of Computation

• CLICK: MoC and actor library for network processing applications

27

53

Warpath Application Development Env.

• Good ideas from Ptolemy II– Models of Computation– Orthogonalization of computation, communication, and

control– Library of domain-polymorphic components– Hierarchical heterogeneity

• Targeted for implementation on a Teepee architectural platform– Strict software interfaces for computation,

communication, control– Separate implementation and visualization– Get rid of Java– Don’t assume RISC-like datapaths

54

Teepee Processing Elements

• Control structures are implicit in the model• Control synthesis strategies:

– Hardcoded state machine– Horizontal/vertical microcode– Reconfigurable– RISC/VLIW– None of the above

• Runs sequential programs Executes one or more operations each cycle

• Opportunity to customize processing element control to the style of computation the application uses

28

55

Lessons Learned

• What to capture in an Application Development Environment?– Ptolemy II– Separate communications, computation, control

• What to export up from an architecture?– Processing element capabilities– Communication architecture capabilities

• Communications Implementation View– Match application actors with architecture PEs– Implement communication semantics over communication

architecture– Verify that an implementation is correct

56

MESCAL Summary

• Address the key challenges in supporting a the design, deployment, and implementation on a new generation of programmable platforms

• Supply new generation of ASIPs with programming models• Close the implementation gap between application development

environments and target ASIPs• Explore in parallel a ``bottom-up’’ approach seeking ``industrial

strength’’ results and a ``top-down’’ approach seeking a generally applicable methodology

• Examine tradeoffs between– Quality-of-results (e.g. speed, but also power, device cost)– Programmer productivity (how long does all this take?)

• Active questions:– What are the costs and benefits of a general approach vs. an

application- and architecture-specific approach?

mescal: design support for embedded processors and...

Documents