andy d. pimentel

Sesame Opening new doors to Multi-level Design Space Exploration of Embedded Systems ArchitecturesAndy D. Pimentel

University of AmsterdamInformatics InstituteComputer Systems Architecture group

Thank you.

Questions?

OutlineBackground and problem statementGeneral overview of modeling methodologySesame environmentApplication modeling layerArchitecture modeling layerMapping layerGradual refinement of architecture modelsEvent refinement using dataflow graphsBoth computational and communication refinementCurrent status and future work

Sketching the contextLets play a little quizWhat is the most popular microprocessor around?You may have answered something like Intel PentiumIf so, thanks for playing!Intel Pentium has almost 0% market share. Zip. Zilch.Pentium is a statistically insignificant chip with tiny sales!The answer should (of course?) be: embedded processors (no particular brand)

Sketching the context (contd)Relating microprocessors to life on earth:are Pentiums the viruses of the microprocessor market? ;-)

Sketching the context (contd)

Sketching the context (contd)Estimation: 5 times as much embedded software than normal softwareEmbedded systems are everywhereOn the average, a human touches about 50 to 100 embedded processors per dayAverage car has 15 processors, luxurious one ~ 60!The domain of embedded multimedia and signal processing applications plays an important roleCamcorders, PDAs, set-top boxes, (Digital) TVs, cell phones, etc.

Embedded media systemsModern embedded systems for media and signal processing must support multiple applications and various standardsoften provide real-time performance These systems increasingly have heterogeneous system architectures, integratingDedicated hardware High performance and low power/costEmbedded processor coresHigh flexibilityReconfigurable components (e.g. FPGAs)Good performance/power/flexibility

Trends in system design (contd)Silicon budgets are increasing (Moores Law)Integration of functions: Systems-on-Chip(Massively) Parallel Systems on a single chip!Life cycle of systems decreasing (e.g., look at cellphones)Short time to market

Design crisis0.350.250.180.150.120.1Log ScaleTechnology (micron)

The system design problemDesign better products fasterDesign productivityDesign technology: architectures, methods, tools, librariesDesign qualityLow cost, low power, flexible, no bugsMulti-dimensional design space with many tradeoffs:Cost (silicon area, design time)PerformancePower consumptionFlexibilityTime-to-marketetc.

Design tradeoffs: computational efficiency

From Applications to Silicon + SoftwareArchitecture componentsSiliconSoftware+Application(s)HW / SW ArchitectureTMCP1MIPS

Rethinking system designDesign complexity forces us to reconsider current design practiceClassical design methods often depart from a single application specification which is gradually synthesized into HW/SW implementationlack generalizability to cope with highly programmable architectures targeting multiple applicationsalso hamper extensibility to efficiently support future applications

Rethinking system design (contd)Traditionally, designers only rely on detailed simulators for design space explorationHW/SW co-simulationThis approach becomes infeasible for the early design stagesEffort to build these simulators is too high as systems become too complexThe low speeds of these simulators seriously hamper the architectural explorationHW/SW co-simulation requires a HW/SW partitioningA new system model is needed for assessment of each HW/SW partitioning

Jumping down the design pyramidAbstractionHighLowHighEffortAlternative realizationsLow

Design by stepwise refinementAbstractionHighLowHighEffortAlternative realizationsLow

SesameSimulation of Embedded Systems Architectures for Multi-level ExplorationPart of Artemisia projectDesign methods for NoC-based embedded systemsCo-operation ofLeiden Embedded Research Center, Leiden University (prof. E.F. Deprettere)Computer Engineering group, Delft University of Technology (prof. S. Vassiliadis)Computer Systems Architecture group, University of Amsterdam (prof. C. Jesshope)Philips Research Labs in Eindhoven

SesameSimulation of Embedded Systems Architectures for Multi-level Exploration

Provides methods and tools to efficiently evaluate the performance of heterogeneous embedded systems and explore their design spaceDifferent architectures, applications, and mappingsDifferent HW/SW partitionings Smooth transition between abstraction levelsMixed-level simulationsPromotes reuse of models (re-use of IP)Targets the multimedia application domainTechniques and tools also applicable to other application domains

Y-chart Design Methodology [Kienhuis]Architecture

Modeling and simulation using the Y-Chart methodologyApplication model Description of functional behavior of an application Independent from architecture, HW/SW partitioning and timing characteristicsGenerates application events representing the workload imposed on the architecture

Explicit mapping of application and architecture modelsTrace-driven co-simulation [Lieverse]Easy reuse of both application and architecture models!Architecture model Parameterized timing behavior of architecture componentsModels timing consequences of application events

Application modelingUsing Kahn Process Networks (KPNs)Parallel (C/C++) processes communicating with each other via unbounded FIFO channelsexpresses parallelism in an application and makes communication explicitblocking reads, non-blocking writesGeneration of application events:Code is instrumented with annotations describing computational actionsReading from/writing to Kahn channels represent communication behavior Application events can be very coarse grain like compute a DCT or read/write a pixel block

Application modeling (contd)Why Kahn process networks (KPNs)?Fit very well to multimedia application domainKPNs are deterministicautomatically guarantees validity of event traces when application and architecture simulators are executed independently Application model can also be analyzed in isolation from any architecture modelInvestigation of upper performance bounds and early recognition of bottlenecks within application

Architecture modelingArchitecture models react to application trace events to simulate the timing behaviorAccounting for functional behavior is not necessary!Architecture modeling at varying abstraction levelsStarting at black box levelProcessing cores can model timing behavior of SW, HW or reconfigurable executionparameterizable latencies for the application eventsSW execution = high latency, HW execution = low latencyAllows for rapid evaluation of different HW/SW partitionings!

Architecture modeling (contd)Models implemented in PearlObject-based discrete event simulation languageKeeps track of virtual timeProvides simulation primitivesInter-object communication via message-passingKeeps track of simulation statisticsRISC-like language: keep it simple and make the common case fastLacks features not needed for architectural modeling (e.g., no dynamic datastructures, dynamic object creation, etc.)Result: high-performance modeling & simulationHigh simulation speed and low modeling effort

Pearl: an exampleProcessorobjectmessage

Architecture modeling (contd)Models implemented in SystemCWe added a layer on top of SystemC 2.0, called SCPEx (SystemC Pearl Extension)Provides SystemC with Pearls message-passing semanticsRaises abstraction level of SystemC (e.g., no ports, transparent incorporation of synchronization) Improves transaction-level modelingSCPEx enables reuse of Pearl models in SystemC contextMakes Pearl SystemC translation trivial Provides link towards possible implementationFacilitates importing SystemC IP models in Sesame

Sesame in layersApplication modelArchitecturemodelMappinglayerEvent trace

Sesames mapping layerMaps application tasks (event traces) to architecture model componentsGuarantees deadlock-free scheduling of application events

Scheduling of communication eventsABCProc.coreProc.coreBusApplication modelArchitecture modelRead(C)Write(C)Read(B)Write(A)Because Read events are blocking (Kahn), some schedules may yield deadlock

Accounts for synchronization behaviorMapping layer executes in same time domain as architecture model Transforms application-level events into primitives (events) for architecture modelMore on this later on...Tool for auto-generation of mapping layerMaps application tasks (event traces) to architecture model componentsGuarantees deadlock-free scheduling of application eventsSesames mapping layer

Sesame from a software perspective(SCPEx)

Y-chart Modeling Language (YML)Flexible and persistent description (XML) ofThe structure of application and architecture models (connecting library components)SCPEx also supports YML!The mapping of appl. models onto arch. models (i.e., the mapping layer)YML combines scripting language within XMLSimplifies descriptions of complicated structuresIncreases expressive power of componentsE.g., a parameterized complex interconnect component modeling a network of arbitrary sizeIncreases reusabilityRe-use of components and structures

Lossy, Motion-JPEG encoderAccepts both RGB and YUV formatsIncludes dynamic quality control by on-the-fly adaptation of quantization and Huffman tablesAn illustrative case study: M-JPEG

Bus-based shared memory multiprocessor architectureThe platform architecture

M-JPEG case study (contd)mappingExploration

M-JPEG case study (contd) Kahn Process Network Functional behavior Library approach Timing behavior

Screenshot model editor

M-JPEG design space explorationExperimented with different HW/SW partitioningsApplication-architecture mappingsProcessor speedsInterconnect structures (bus, crossbar and networks)This took about 1 person-month (all modeling included)Simulation performance: for 128x128 frames, a 270 MHz Sun Ultra 5 Sparcstation simulated 2,3 frames/second (= 0.43 secs/frame)

M-JPEG design space exploration

Mapping problem: implementation gapApplication behavioralmodel (what?)Architecture model(how?)ImplementationPrimitive operationsPrimitive operations

Mapping problemApplication events: Read, Write and ExecuteTypical mismatch between application events and architecture primitives, examples:Architecture primitives operating on different data granularitiesArchitecture primitives more refined than application eventsTrace events from the application layer need to be refinedHow?Refine the application modelA transformation mechanism between the application and architecture models

Communication refinementLets take the mismatch of communication primitives as an exampleAssume following architecture communication primitivesCheck-Data (CD)Load-Data (Ld)Signal-Room (SR)Check-Room (CR)Store-Data (St)Signal-Data (SD)

Communication refinement (contd)Transformation rules for refining application-level communication events [Lieverse]RCD Ld SR(1)WCR St SD(2)E E(3)How to transform traces of application events using (1), (2) and (3)? GeneratesREWevent sequences

Communication refinement (contd)Processor1Processor 2busMemProcessor 3 Assumption 1: processor 2 has local (block) memory Transforming REW event sequences from process B:

R EW CDLdSRECRStSD Assumption 2: processor 2 has NO local (block) memory Transforming REW event sequences from process B:

R EW CDCRLdEStSRSD

IDF-based trace transformationVirtual processors in mapping layer are refined to accomplish trace refinementInteger-controlled DataFlow (IDF) model describes internal behavior of a virtual processorApplication events specify what a virtual processor executes with whom it communicatesInternal IDF model specifies how the computations and communications take place at the architecture layer

A short Dataflow intermezzoSynchronous DataFlow (SDF) [Lee,Messerschmitt]Static model of computation allowing compile-time schedulingBasic idea: each actor consumes and produces a fixed number of tokens each time it firesInteger-controlled DataFlow (IDF) [Buck]Extends SDF with dynamic integer-controlled switch and select actors to allow data dependent executionGeneralization makes it more powerful (Turing complete) but generally needs dynamic schedulingHard to analyze statically

ProcessAProcessBProcess CbusVirtualproc. XVirtualproc. ZVirtualproc. YApplication model Process networkMapping layer DataflowArchitecture model Discrete event

IDF-based trace transformation (contd)IDF models transform application events into architecture events at run-timeIDF models execute in the same simulation time-domain as the architecture model timed IDF modelsWe distinguish three IDF token-channel types:Intra-event dependency channels specify dependencies within the refinement of an application eventInter-event dependency channels specify dependencies between refinements of different application eventsToken-exchange channels connected to architecture model (accomplish timed execution)

Communication refinement revisitedProcessAProcessBProcess CProcessor1Processor 2busMemProcessor 3 Assumption: processor 2 has NO local (block) memory Transforming REW event sequences from process B:

R EW CDCRLdEStSRSD

Communication refinement revisited (2)CDLdSREbCRStSDbCRStSDCDLdSRBusArchitecture modelprocessor2Event trace process BVirtual processor XVirtual processor ZVirtualprocessor YswitchREW

Communication refinement revisited (3)ProcessAProcessBProcess CProcessor1Processor 2busMemProcessor 3Virtualproc. XVirtualproc. ZVirtualproc. Y Now assume that processor 2 operates on lines (3 lines = 2 blocks) processor 2 has a single-entry local line buffer processors 1 and 3 still operate at block granularityREWREW CDCR Ld(line)E(line)St(line) Ld(line)E(line)St(line) Ld(line)E(line)St(line) SRSD

Communication refinement revisited (4)CDLdSRprocessor2Virtual processor ZswitchREWCDLd(line)switch01select102331SRCRSt(line)switch01select102331SDE(line)13231...,1,0,1,0...,1,0,1,0from virtualproc. Xto virtualproc. X1221b0,1,0,1,...Event trace from process B0,1,0,1,...

A case of computational refinementThe application models a synthetic 2D-IDCT by computing two consecutive IDCT operations at block levelHigh level, so execute(block) = 1D-IDCT on a data block while(1) { read(block); execute(block);}while(1) { write(block); execute(block);}while(1) { read(block); execute(block); write(block);}while(1) { read(block); execute(block); write(block);}while(1) { read(block); execute(block); write(block);}

Computational refinement (contd)Two target architectures are explored:Scenario 1: All processing elements (PE's) are modeled at block levelProc AProc DProc BProc AProc CProc DProc CProc BMemScenario 2: The PE models onto which the IDCT tasks are mapped, operate at line level and are pipelinedAnd two scenarios...

Computational refinement (contd)Trace transformation rulesR(block)R(line) . . . R(line)(1)W(block)W(line) . . . W(line)(2)E(block) E(line) . . . E(line)(3)E(line) e1 . . . en(4)

Computational refinement ProcessAProcessBProcess CbusVirtualproc. XVirtualproc. Z

Computational refinement (contd)

Putting Sesame to use: An example design flowCode suitable forFPGA executionSystem-levelarchitectureexplorationArchitecturesimulationenvironmentApplicationsExperimentationReconfigurablearchitectureframeworkCompaan/Laura (Leiden University)+Molen (Delft University)Motion-JPEGencoderDCT

A real implementation using Compaan/Laura/MolenC++ CompilerLauraMapping M-JPEG on the Molen platform architecturefor k = 1:1:4, for j = 1:1:64, [Pixel (k,j)] = In(inBlock); endendfor k = 1:1:4, if k

System-level simulation experimentModeling Molen with DCT mapped onto CCUValidation against real implementationInformation from Compaan/Laura/Molen used for calibration of architecture modelApply architecture model refinementKeep M-JPEG application model untouched DCT component in architecture model is refinedOperates at pixel levelAbstract pipeline model, deeply pipelinedOther architecture components operate at (pixel-)block level

Sesames IDF-based model refinementProcessAProcessBProcess CbusVirtualproc. XVirtualproc. ZApplication modelMapping layerArchitecturemodelM-JPEGMolenMap DCT onCCU and refine

DCT virtual processorBlock inTo/from architecture modelType inBlock outschedulerP1Event traceControl traceP22d-dct63

Simulation resultsFull software implementationSimulation: 85024000 cyclesReal Molen: 84581250 cyclesError: 0.5%DCT mapped onto CCUSimulation: 40107869Real Molen: 39369970Error: 1.9%No tuning was done!

Where are we going?Some ongoing and future work

NoC modelingSo far, we mainly modeled bus-based systemsNetworks-on-Chip (NoC) will be our (near) futureStandardized interfaces Scalable (point-to-point) networksMuch more complex protocols (protocol stack?)QoS aspectsModeling NoCsTopologies, switching & routing methods, flow-control, protocols, QoS, etc.Communication mappingModeling at multiple abstraction levelsGradual refinementRole of IDF models

Communication mappingWith more complex Networks-on-Chiprouting information is needed

Architecture model calibrationInitial derivation of latency parameters: documentation educated guess performance budgeting (what is the required parameter range?)Next step: calibration with lower-level, external simulation models or prototypes, e.g. Instruction set simulators (ISSs) Compaan/Laura framework

Calibration using an ISSCISS(e.g. Simplescalar)APIAPI_read(C,..);write(2,);computation | read(1,);API_write(C,..);12API_read(C,);API_write(C,);computationISS measures cycle times of annotatedcode fragments

Mixed-level system simulationZoom in on interesting system components in architecture modelSimulate these components at a lower levelRetain high abstraction level for other componentsSaves modeling effortMay save simulation overheadIntegration of external simulation modelsISSs, SystemC models, etc.Also allows calibration of higher-level modelsBUTMixed-level simulation can be complex!multiple time domains and time grain sizes (synchronization)differences in protocol and data granularity of components

Mixed-level system simulation (contd)IDF-basedrefinementEmbeddingexternalmodels

Does mixed-level need to be hard? NO!CISS(e.g. Simplescalar)APIbufferbufferVirtualprocessorVirtualprocessorVirtualprocessorReadE(N cycles)WriteTrace calibration!

Towards real design space explorationSesame supplies basic methods & tools for evaluating application, architecture, and mapping combinationsSimulating entire design space is not an optionMore is needed to explore large design spacesWhat will be the initial design(s) to evaluate?How to react when the evaluated architecture does not suffice?We need steering before and during simulation Design decisions using analytical modelingFinding Pareto-optimal candidates using multi-objective optimizationDesign evaluation using simulation

Real design space exploration (contd)Heuristic methods like evolutionary algorithms

CreditsPaul LieverseBart KienhuisEd DepretterePieter van der WolfKees VissersVladimir ZivkovicTodor StefanovCagkan ErbasSimon PolstraBerry van HalderenJoseph CofflandFrank TerpstraMark ThompsonThis work would not have been possible without the (ground-laying work of the) following people:

For more informationURL: www.science.uva.nl/~andy/publications.htmloremail: [email protected] software can be found at:sesamesim.sourceforge.net

andy d. pimentel

Documents

design tradeoffs

embedded processors

design space explorationhwsw

embedded software

life cycle of systems

chipmassively parallel

domain of embedded multimedia

signal processing applications