[ieee eighth ieee international symposium on multimedia (ism'06) - san diego, ca, usa...

AbstractWithin SoCs for embedded media applications, per-

formance and computational efficiency can only growthrough scalability, dedicated instructions, and special-ized memory sub-systems. Scalability can only beachieved through short wires and low fan-in and fan-out. The presented multi-processor template makes useof these properties. It provides multiple threads of con-trol (cell-level multi processing), each processor cellhaving a multitude of issue slots, localised registerfiles, memories, and interconnects. The template allowsthe compiler to control all of these resources sepa-rately, eliminating hardware overhead for instructiondecoding, pipeline control, hazard detection, andbypass networks. The template also integrates media-oriented and SIMD instructions, which the compilercan automatically select. This paper describes the tem-plate underlying several multimedia multi-processordesigns.

1. IntroductionToday's embedded media-oriented Systems-on-Chip

(SoCs) show an increasing gap between high perfor-mance and low power budget [6]. Another such con-flicting set of trends is: increased complexity andintegration versus decreasing time-to-market. Designteams have to decide on vast ranges of integrated func-tionality, implement that functionality in a very shorttime, whilst reducing market risks.

Processing rates for typical media algorithms rangefrom 60 GOPS for H.264 HD decoding at 30 frames/sto 1000GOPS for HD picture-rate up-conversion at 120frames/s [4]. The required memory bandwidth for suchalgorithms exceeds 20GB/s. In order to stay within thepower budget of fixed-function appliances, the powerconsumption can not exceed 3W. In mobile appliances,the power budget is less than 300mW.

In order to combat such a wide set of conflictingrequirements, the industry needs to take some radicalsteps. Market risks can only be reduced by replacingfixed-function ASICs with programmable devices.Power consumption can be reduced through full exploi-tation of parallelism and reduction of speculative oper-ations (large caches speculate on data being re-used,deep pipelines speculate on code being sequential,

branch prediction keeps pipelines filled by speculatingon jump statistics tables, etc.). Programmable parallel-ism can grow through scalability at multiple levels:multi-processing, instruction-level parallelism, andvector processing. Beside these data-path oriented fea-tures, the template must also support scalable memorysubsystems with specific 2D and 3D access patterns.Scalability can only be achieved through locality of ref-erence, short wires, and low fan-in and fan-out. Time-to-market decreases through programmability, IP-baseddesign, and IP design at high abstraction levels.

Until now, these requirements were met through ded-icated ASICs, because processor-based designs do notprovide enough computational efficiency to handle thedomain-specific high data and processing rates withinembedded and mobile power budgets. The ARM Cor-tex-A8 covers about 4mm2 (pre-layout, 65nm,600MHz) and achieves about 2GOPS within 300mW[1],[2]. When put together, 7 Tensilica’s Diamond545CK DSPs cover an area of about 20mm2 (pre-lay-out, 90nm, 200MHz), consume about 300mW, andachieve about 20GOPS [3]. In [4], Alba Pinto et al.describe a 2-processor SIMD video signal processingtile for H.264 HD processing and associated pictureimprovement algorithms. This tile is based on the pro-posed architecture template, and can be configured toachieve in excess of 120 GOPS, when running at 200MHz. In that case, die area is about 5.9 mm2 (pre-lay-out, 65nm), and power consumption is estimated at150mW. The proposed template allows this tile to bescaled down or replicated, in order to achieve appropri-ate performance, area, and power consumption data-points.

The template is an extension of the one discussed in[5]. The next chapters discuss the features of the pro-posed processor template, processor design metholod-ogy, and compiler, respectively. The last chapter givessome conclusions.

2. Template componentsThe proposed template is fully hierarchical. A core

consists of multiple cells, each of which having its ownthread of control. Cells consist of Processing and Stor-age Elements (PSEs) and interconnect networks. PSEs

Multiprocessing Template for Media Applications

{Jeroen.Leijten,Menno.Lindwer}@philips.comSilicon Hive

Proceedings of the Eighth IEEE International Symposium on Multimedia (ISM'06)0-7695-2746-9/06 $20.00 © 2006

consist of register files, issue slots, and local or sharedmemories. Issue slots consist of interconnect networks,and function units. Cells are very long instruction word(VLIW) machines, in which all issue slots operate inparallel. The hierarchy is depicted in Figure 1. The tem-plate is supported by a library of specialised PSEs(control, DSP, media, OFDM) and function units (load/store, branch, arithmetic, shift, MAC, etc.).

Figure 1 shows that cells can have multiple slave-and master interfaces connecting to SoC buses. TheCoreIO template, to be discussed below, is designed tooffer flexibility needed to incorporate these cores in awide range of SoC environments, having configurablesets of memory sub-systems, point-to-point connec-tions and bus interfaces. It is worth noting that the tem-plate supports time-stationary pipeline control [8], [9].This means that all control points of the template com-ponents -- including interconnect networks -- are visi-bile to the HiveCC2 compiler [12]. The compilerdirectly controls the complete datapath of the processor.As opposed to other time-stationary architectures, theHiveCC2 compiler allows every operation to have itsown pipeline depth, independent of the pipelining ofother operations in the same VLIW instruction. Thecompiler will schedule each stage of each operationseparately. This way, the pipelining of individual opera-tions can be precisely balanced against each other andagainst the required overall clock speed.

2.1. Processor coreA processor core consists of multiple cells. Cells can

have streaming interfaces, which allow the cells to beinterconnected. For scalability reasons, usually a near-est-neighbour interconnect strategy is chosen, leadingto a mesh structure (comparable to that of the transputernetwork [7]). In order to reduce the number of externalstreaming interfaces, the core can be equipped with so-called stream switches, which act as run-time config-urable concentrators (not shown in figure 1).

For sample-based processing, a systolic array archi-tecture is usually chosen. Systolic arrays are homoge-neous sets of template cells, interconnected throughnearest neighbour point-to-point connections. Forexample, FIR filters are typically spread out over sev-eral cells, each cell computing one or more taps. Con-versely, frame-based processing usually takes place onsingle-cell template instances. Single-cell cores do notrequire stream switches.

Next to streaming interfaces, cells can have bothmaster and slave bus interfaces for several different IPinterconnect standards. The streaming-, master- andslave interfaces are part of CoreIO, as discussed in sec-tion 3.

2.2. Processor cellEach processor cell has its own thread of control. The

control PSE (indicated as 'CPSE' in Figure 1) is stan-dardised and has two functions: stepping through pro-gram code, and guaranteeing ANSI C compliance.Because of the time-stationary architecture, the controlfunction of the control PSE is much less complex thanin traditional RISC or VLIW processors. Also, becauseall datapath features are visible to and controlled by thecompiler, there is no need for hazard detection, bypassnetworks, branch prediction, or pipeline control. Asso-ciated with the control PSE are Program Memory (PM)and Control Register File (CRF). Both are also CoreIOelements. A cell contains at least a slave interface, con-nected via the Slave Routing Network (SRN) to the PMand CRF. Through this slave interface, an external hostcan upload code into the program memory and controlthe cell.

Next to the control PSE, a cell normally contains aset of Data PSEs. Please note that the interconnect net-work between the register files and function units issparsely connected with regard to connections betweenPSEs. Usually, within PSEs, the network is more dense.

1. Hierarchy of architecture template

....

CoreCell

CPSEIS

CoreIO

LM

....

....PM

Data PSEIS

...

...

RF

IN

MRN, Arb

SRN

....

FU

i/f

CRF

Cell

PM

....

i/f i/f

....

CPSE


2.3. Processing and storage element (PSE)A PSE comprises multiple issue slots (ISs). In order

to achieve the objectives of having short wires, low fan-in and low fan-out, the input data for the ISs is distrib-uted over multiple register files (RFs). Interconnect net-works (INs) route the data from the outputs of ISs intothe RFs and from the RFs into the inputs of ISs. Thedesigns of the INs are such that the RFs are connectedto FUs on a need-to-have basis. This ensures that theinput/output ports of the distributed RFs are kept at aminimum.

2.4. Interconnect networkAs discussed above, in order to minimise wirelength

and RF ports, the INs are specifically set up to obtain asparsely interconnected design. Design space explora-tion usually starts from a minimum configuration andbuilds additional interconnect, as required. If notapplied carefully, this strategy may preclude futureapplications.

The HiveCC2 compiler has full control over theinterconnect networks. For every output of a registerfile, it determines the output register and the functionunit input to which the register file output has to berouted. Conversely, for every output of every functionunit, the compiler produces control bits to determine towhich input of which register file the result has to besent. The interconnect networks support multicastingand broadcasting of result values to multiple registerfiles.

2.5. Issue Slots and Function UnitsIssue slots basically consist of multiple function units

(FU). Issue slots are also logical elements, in the sensethat the program word contains exactly one operationper IS. This means that in every subsequent cycle, oneoperation per issue slot can be fired. Maximum instruc-tion-level parallelism of a cell depends on the numberof issue slots.

The INs within ISs are often more or less fully con-nected, as they are either present for distribution or con-centration.

FUs determine the semantics of the core. FUs gener-ally do not contain state. This is not a template require-ment. However, the compiler can not perform aliasanalysis on values stored within function units. FUs canhave multiple cycles latency. FUs generally can per-form several operations (a load/store unit can load andstore values of different sizes, either with or withoutauto-increment, etc.). With every FU instantiation, thecore designer determines which of the FU-supported

operations are actually implemented. This saves valu-able program memory bits, for example, when instanti-ating an ALU which only needs to do additions.

Load/store units (LSU) are special kinds of FUs.Besides being connected to RFs within the core, theyare also connected to logical memories. As far as anLSU is concerned, it does not make a differencewhether the logical memory connects it to a streaminginterface, a bus interface, or to a physical local memory.

FUs have parameterised latency. If a certain FUappears to be a bottleneck, its latency can be increased.The compiler will take this into account when schedul-ing the availability of the FU's results.

2.6. Logical memoryLogical memory and CoreIO (section 3) are orthogo-

nal concepts. PSEs are vertical slices (columns)through the processor cell, consisting of INs, RFs, FUs,LSUs, and logical memories. Each LSU within a PSE isconnected to at least one logical memory. However, alllogical memories are also part of the horizontal CoreIOcomponent.

Logical memories can be of several types: FIFOs,external (bus) interfaces, local memories, programmemory, and control register file. These physicaldevices have parameters such as width and depth.

The widths of memories can be freely selected. AnLSU may read words of different widths out of the log-ical memories. The LSUs may also subsequently splitthe wide memory words and send sub-words throughthe INs into multiple registers or even multiple registerfiles. Alternatively, the memory words comprise com-plete SIMD vectors, which may be sent by the LSU tovector registers.

2.7. Instruction set and code compactionAs part of the design methodology, as discussed in

section 4, the instruction format for each cell is deter-mined automatically from the architecture of that cell.The width of the cell’s instructions is not artificiallyfixed at a particular number (e.g. 32 or 16 for mostRISCs). Rather, the width is exactly determined by theneeds of the VLIW architecture. When generating aninstruction set architecture, typical compaction mecha-nisms are being applied, such as sharing immediatefields and register address fields.

Even so, because of the VLIW nature of the cells, theinstructions are generally wider than those of RISCmachines. From an overall codesize point of view, thisis less of an issue than with most VLIW architectures.Because of the application-specific nature of the pro-cessors and because the processors generally are co-


developed with the software, the mapping efficiency istaken into account for both hardware and software. Thismeans that, eventually, the software will make veryefficient use of the available instruction bits.

For those cases where multi-type use of processors isanticipated (e.g. mixed control and data processing), acompaction scheme with multiple decode units isapplied. The number of decode units is configurable.

3. The CoreIO Input/Output ConceptEvery cell has a CoreIO component. It connects the

cell's LSUs to a configurable number of logical memo-ries. Caches may be inserted between any LSU and thedevices to which it connects. Within the Silicon Hivecores, caches are used very sparingly. The reason is thatcaches reduce predictability of software schedules.

CoreIO contains an IN for interconnect betweenLSUs and external master interfaces/memories. This INis referred to Master Routing Network (MRN). CoreIOalso contains an IN for interconnect between externalslave interfaces and memories, referred to as SlaveRouting Network (SRN). CoreIO also takes care ofarbitration between accesses from multiple LSUs in thecell itself and external accesses.

CoreIO offers a high level of flexibility, needed toeasily integrate cores in SoCs. CoreIO is completelyscalable in terms of the number of external bus inter-faces (both master and slave), and streaming interfaces.The streaming interfaces can be parameterised in termsof width and FIFO depth. CoreIO's external interfaceshave parameters such as protocol (proprietary CoreIOprotocol, AHB, etc.) and width.

Please note that the template allows one cell to act asa DMA device for another cell. At Core level, inter-faces can be connected back-to-back, in order to allowone Cell to write directly into the memory of another.Within one Core, streaming FIFO interfaces can also beconnected back-to-back. This allows Cells to bothexchange data and control each other’s communicationbehaviour.

This feature is very important in the media domainfor implementing cells that can act as intelligent andfast DMA units for other cells. The intelligence ofDMA cells is being used to drastically reduce the band-width requirements to external memory [4].

4. Design MethodologyThe design methodology for generating template-

based cores consists of two main iterative loops. Theinitial loop iterates over the right-hand side of the meth-odology, as depicted in figure 2. It establishes the main

features of the processor instance (processor designentry), compiles the application programs (HiveCC),and simulates them at a high level and at very highspeed. The methodology provides for several systemsimulation abstraction levels. Function-level simula-tions run at GOPS speeds, instruction-level simulationsrun at MOPS speeds, and cycle-level simulations run at10-100 KOPS. This loop iterates until the cycle-levelsystem simulations produce instruction traces, whichindicate that the code schedules within the requiredcycle budget. Each iteration in this loop typically takesin the order of one hour.

The second loop iterates over the left-hand side of thearchitecture. It is concerned with architecture issues,such as function unit latency, in order to balance cyclebudgets within the design and remove timing bottle-necks. The processor configurator takes in the order ofseconds to produce a new instance of the HDL code.Subsequently, standard EDA tools are used to derivemetrics, such as area, clock speed, power consumption.Iteration times in this loop depend mainly on the avail-able EDA tools. They are generally in the order ofhours to days.

5. Software development with HiveCC2The software development tool set is integrated into

an Integrated Development Environment (IDE) shownin Figure 3. The IDE is based on the Eclipse framework[11]. The HiveCC2 compiler is at the heart of the IDE.The compiler automatically gets targeted to any proces-sor that results from the design methodology, outlined

2. H-chart processor design methodology

ANSI-CApp.code

Pre-conf.IP/blocks

Processordesign entry

ProcessorBuilder

ProcessorInstance

HiveCC

Assembly

Simu-lator

Assem-bler

binary instr.trace

Configu-rator

HDL

Std.EDAtools

NetlistSimulate

TB


in section 4. The compiler translates C-program intoassembly and microcode for the target processor(s).Because of the highly parallel nature of the resultingprocessors, HiveCC is referred to as a spatial compiler.It handles code generation, spatial scheduling and spa-tial resource allocation.

As outlined in section 4, different levels of simula-tion/verification are available. At each abstractionlevel, the assembly code can be co-simulated with therest of the application (running on a host processor,rather than an accelerator core).

5.1. Code profiling and other outputThe compiler gives graphical feedback on the quality

of the code. This output takes the form of HTMLtables, which can be viewed with any web browser. TheHTML files show the machine utilization at every pro-gram-counter position. In the example, the distributionof the execution cycles over the microcode for the 8k-FFT is used to determine whether the code is scheduledoptimally.

The HTML table in Figure 4 shows the utilization ofthe resources (columns) in a 10-issue-slot processor perprogram-counter position (lines). • Green blocks indicate the use of a given issue-slot

(10 first lines) in a given program-counter,• Red blocks indicate the use of an interconnect line,• The utilization of each register (not shown) is also

given for every program-counter position by acolor code indicating whether a given register con-

tains a live value, is written to, read from, or bothwritten and read.

5.2. SchedulerHiveCC2 uses deterministic constraint solving tech-

niques that deal with instruction-level parallelism (doz-ens of issue slots), distributed register files (overhundred), and partial, constrained interconnect.

Given sufficient compilation time (~ few minutes),the generated code is guaranteed optimal.

The compiler schedules all operations in time (i.etemporal assignment task that is classical in compilertechnology) and in space, aiming to maximize localityof reference.

When required, HiveCC2 offers full control ofresource-usage to the programmer. The programmercan control allocation of data structures to memories, ofoperations to issue slots, or the operations themselves.It also offers the use of application specific operationspresent in the architecture in the form of intrinsic func-tions at source-code level.

For different levels of optimization, two differentschedulers can be used. The list scheduler is a fastscheduler that can do software pipelining. It is fast butnot optimal. For optimal scheduling, the full constraintsolving scheduler has to be used.

Because applications and processors are designed inconjunction, and because processors are automaticallygenerated along woth re-targeting the compiler, imple-menting proposed design changes almost instanta-neously leads to new RTL code and softwaredevelopment tools. Thus, the emphasis can be laid onspecifying and analyzing optimization scenarios.

3. Eclipse-based IDE

4. Graphical scheduling information


5.3. Single-instruction loopsFigure 5 shows the resource utilization of scheduling

a 2K complex FFT on 36 out of 41 issue slots ofAvispa-OFDM. The shading indicates the loop-nestinglevel. Darker blocks around program-counter positions16 and 40 indicate 2 single-instruction inner loops. Theapplication spends 98% of its execution time on thoseinner loops. .

Single-instruction loops occur when the schedulermanages to software-pipeline complete bodies of innerloops, such that they fit onto a single very wide instruc-tion. The figure also shows the pre- and post-ambles ofthese loops, where the software pipeline is first loaded,and unloaded after the loop is finished.

Single-instruction loops are very beneficial for powerconsumption, because all through such a loop, the celldoes not need to fetch instructions anymore. Also, itdoes not need to do any switching al all anymore. Thereis only data flowing through its datapath. The processorturned into a pure dataflow machine.

6. ConclusionsThe tile architecture, proposed by Alba Pinto et al. in

[4] illustrates almost all features of the template. It sub-stantiates the claim that the template can be used toaddress highly complex programmable designs, whilereducing time-to-market and market risks.

As far as we know, the discussed example illustratesthe highest possible performance, achievable in themobile media application sphere. This ASIC-levelpower efficiency is a result of the template actuallyresulting in an ASIC-like design, with a comparablelevel of parallelism and locality of reference. The core

designer determines which compute and memory facili-ties are being layed out. The HiveCC2 compiler sched-ules operations, in order to optimally utilise thesefacilities.

Other design examples [10] show that the templateand associated methodology achieve the goal of reduc-ing time-to-market, because comparable IC develop-ment projects require at least two years and a silicon re-spin, as opposed to 9 months until tape-out for thisproject (and no re-spin).

Bibliography[1] ARM Cortex-A8, http://www.arm.com/products/

CPUs/ARM_Cortex-A8.html[2] OMAP3430 multimedia applications processor,

http://focus.ti.com/pdfs/wtbu/ti_omap3430.pdf[3] Diamond Standard 545CK, http://www.tensil-

ica.com/diamond/di_545ck.htm[4] C. Alba Pinto, Video Signal Processing Tile

Architecture for Video Coding and Post-Pro-cessing, To appear in IEEE ISM’06 proceedings

[5] G..F. Burns, M.Jacobs, M.Lindwer, B.Vandew-iele, Silicon Hive's Scalable and Modular Archi-tecture Template for High-Performance Multi-Core Systems, Proceedings of GSPx2005

[6] John L. Hennessy, Directions and challenges inmicroprocessor architecture, Holst MemorialLecture 2002, http://www.holstmemorial.nl/hen-nessy.ppt

[7] INMOS transputer, http://en.wikipedia.org/wiki/Transputer

[8] J.J. Kim, F.J.Kurdahi, N.Park, Automatic Syn-thesis of Time-Stationary Controllers for Pipe-lined Data Paths, ICCAD’91

[9] P.M.Kogge, The Architecture of Pipelined Com-puters, McGraw-Hill, New York, N.Y., 1981.

[10] Philips Demonstrates World's First Fully Pro-grammable Digital TV Demodulator IP Core,http://www.siliconhive.com/t.php?asset-name=text&id=79

[11] http://www.eclipse.org[12] Lex Augusteijn, The HiveCC Compiler for Mas-

sively Parallel ULIW Cores, Embedded Proces-sor Forum, San Jose, May 17-20, 2004

In the single-instruction loop 76% (31 out of 41) of the issue slots are active

In the single-instruction loop 76% (31 out of 41) of the

issue slots are active

The application is scheduled so that

98% of the execution time is spent in pure dataflow mode (i.e.,

single-instruction loops)







The application is scheduled so that

98% of the execution time is spent in pure dataflow mode (i.e.,

single-instruction loops)

5. Schedule of 8K FFT on Avispa-OFDM


[ieee eighth ieee international symposium on multimedia (ism'06) - san diego, ca, usa...

Documents