impact of cache architecture and interface on performance ...impact of cache architecture and...

Impact of Cache Architecture and Interface on Performance and Area ofFPGA-Based Processor/Parallel-Accelerator Systems

Jongsok Choi!, Kevin Nam!, Andrew Canis!, Jason Anderson!,Stephen Brown!, and Tomasz Czajkowski†

!ECE Department, University of Toronto, Toronto, ON, Canada†Altera Toronto Technology Centre, Toronto, ON, Canada

Abstract—We describe new multi-ported cache designs suit-able for use in FPGA-based processor/parallel-accelerator sys-tems, and evaluate their impact on application performanceand area. The baseline system comprises a MIPS soft processorand custom hardware accelerators with a shared memoryarchitecture: on-FPGA L1 cache backed by off-chip DDR2SDRAM. Within this general system model, we evaluate tradi-tional cache design parameters (cache size, line size, associativ-ity). In the parallel accelerator context, we examine the impactof the cache design and its interface. Specifically, we lookat how the number of cache ports affects performance whenmultiple hardware accelerators operate (and access memory) inparallel, and evaluate two different hardware implementationsof multi-ported caches using: 1) multi-pumping, and 2) arecently-published approach based on the concept of a live-value table. Results show that application performance dependsstrongly on the cache interface and architecture: for a systemwith 6 accelerators, depending on the cache design, speed-up swings from 0.73! to 6.14!, on average, relative to abaseline sequential system (with a single accelerator and adirect-mapped, 2KB cache with 32B lines). Considering bothperformance and area, the best architecture is found to be a4-port multi-pump direct-mapped cache with a 16KB cachesize and a 128B line size.

I. INTRODUCTION

Field-programmable gate arrays (FPGAs) have recentlybeen garnering attention for their successful use in comput-ing applications, where they implement custom hardwareaccelerators specially tailored to a particular application.Indeed, recent work has shown that implementing compu-tations using FPGA hardware has the potential to bringorders of magnitude improvement in energy-efficiency andthroughput (e.g. [12]) vs. realizing computations in softwarerunning on a conventional processor. In general, such FPGA-based computing systems include a processor, which per-forms a portion of the work in software, as well as one ormore FPGA-based accelerators, which perform the compute-and/or energy-intensive portions of the work. The processormay be a high-performance Intel or AMD x86 processorrunning on a connected host PC, or alternately, in the caseof embedded systems, a soft processor is often used, imple-mented on the same FPGA as the accelerators. This paperdeals with a key aspect of such embedded processor/parallel-accelerator systems – the memory architecture design.

While a variety of different memory architectures arepossible in processor/accelerator systems, a commonly-usedapproach is one where data shared between the processorand accelerators resides in a shared memory hierarchy com-prised of a cache and main memory. The advantage of such

a model is its simplicity, as cache coherency mechanisms arenot required. The disadvantage is the potential for contentionwhen multiple accelerators and/or the processor accessmemory concurrently. Despite this potential limitation, weuse the shared memory model as the basis of our initialinvestigation, with our results being applicable (in future)to multi-cache scenarios. In our work, data shared amongthe processor and parallel accelerators is accessed througha shared L1 cache, implemented using on-FPGA memory,and backed by off-chip DDR2 SDRAM. Non-shared localdata within a single accelerator is implemented within theaccelerator itself using on-FPGA RAMs. We explore thefollowing question: given multiple parallel FPGA-basedaccelerators that access a shared L1 cache memory, howshould the cache and its interface be architected to maximizeapplication performance?

We consider the impact of three traditional cache param-eters on performance: 1) cache size; 2) line size; and, 3) as-sociativity. While such parameters have been investigated instandard processors, an FPGA-based cache implementationpresents unique challenges. For example, set associativity istypically realized using multiple memory banks with multi-plexers to select (based on cache tags) which bank containsthe correct data. Multiplexers are costly to implement inFPGAs, potentially making the area/performance trade-offsof associativity different in FPGAs vs. custom chips. Beyondtraditional cache parameters, we also consider the hardwaredesign of the FPGA-based cache and its interface.

Dual-ported memory blocks in commercial FPGAs allowa natural implementation of an L1 cache when a singleaccelerator is present: the processor is given exclusive accessto one port, and the accelerator given exclusive access tothe second port. With more than two parallel accelerators,the potential for memory contention exists. We explorethe extent to which such contention can be mitigated viathe cache interface. The accelerators can share a memoryport with arbitration between concurrent accesses happeningoutside the memory – in our case, using the Altera AvalonInterface [3]. Alternatively, a multi-ported memory canbe used. We consider both approaches in this work, andevaluate two multi-ported cache implementations: 1) multi-pumping, where the underlying cache memory operatesat a multiple of the system frequency, allowing multiplememory reads/writes to happen in a single system cycle; and2) an alternative multi-porting approach recently proposedin [18], comprising the use of multiple RAM banks anda small memory, called the live-value table, that trackswhich RAM bank holds the most-recently-written value for

2012 IEEE 20th International Symposium on Field-Programmable Custom Computing Machines

978-0-7695-4699-5/12 $26.00 © 2012 IEEEDOI 10.1109/FCCM.2012.13

17

a memory address. The two porting schemes offer differentperformance/area trade-offs.

This paper makes the following contributions:• Parameterized multi-ported cache memory designs for

FPGAs based on two different approaches to multi-porting. Verilog parameters allow easy modificationof traditional cache attributes such as cache size, linesize and associativity. The designs are open source andfreely downloadable (www.legup.org).

• The multi-ported caches do not require memory par-titioning and allow single-cycle concurrent accesses toall regions of the cache. To the authors’ knowledge, thisis the first of its kind to be implemented on an FPGA.

• An analysis of the impact of traditional cache pa-rameters on application performance in FPGA-basedprocessor/parallel-accelerator systems.

• An analysis of performance and area trade-offs in multi-ported cache memory design for processor/acceleratorsystems. To the authors’ knowledge, this is the firststudy of its kind.

• A demonstration of the effectiveness of deployingmultiple accelerators, operating in parallel, to improveapplication performance vs. the single accelerator case.

We conduct our research using the LegUp open sourcehigh-level synthesis (HLS) tool [8] from the Universityof Toronto. We target the Altera DE4 board, containinga Stratix IV 40nm FPGA [13], and show that applicationperformance depends strongly on the cache design and itsinterface.

The remainder of this paper is organized as follows:Section II presents background and related work. Section IIIprovides an overview of the targeted system. The multi-ported cache designs are described in Section IV. Section Voutlines the set of memory architectures explored in thisstudy – 160 in total. An experimental study is described inSection VI. Section VII concludes and offers suggestions forfuture work.

II. BACKGROUND

A. Memory Architecture in FPGA-Based ComputingA recent work by Cong et al. proposes memory partition-

ing as a way of dealing with memory contention imposed bymultiple parallel accelerators [11]. The idea is to partitionthe global memory space based on the regions of memoryaccessed by each accelerator. Each partition is then imple-mented as a separate memory bank, with the objective beingto minimize the number of accelerators that need concurrentaccess a particular bank. The partitioning approach yieldsgood results when accelerators operate on separate memoryregions – i.e. for applications with memory parallelism.Ang et al. likewise describes a multi-ported cache, whereindependent sub-caches correspond to separate partitionsof the global address space [6]. The techniques describedin our paper, however, are independent of the acceleratormemory access patterns and thus are orthogonal to thememory partitioning approach. Combining our techniqueswith memory partitioning is an area for future research.We believe the cache designs evaluated in this paper arealso compatible with the recently-proposed CoRAM and

LEAP memory architectures [9], [1], both of which providea memory abstraction for use in FPGA-based computing,and a generic well-defined API/model for managing datatransfers between off-FPGA and on-FPGA memory.

Another interesting recent work is the CHiMPS HLSproject which uses multiple caches [19]. Each cache corre-sponds to a particular region of global memory, based on ananalysis of a program’s access patterns. In CHiMPS, the re-gions of memory that are “covered” by different caches mayoverlap, and in such cases, cache coherency is maintained byflushing. One can consider the CHiMPS’ memory handlingto be that of memory partitioning, permitting duplication.Aside from cache size, [19] does not consider other cacheparameters (e.g. line size) as in the current work.

Altera’s C2H tool targets a hybrid/processor acceleratorsystem with a cache, however, only the processor hasaccess to the cache (the accelerator can only access off-chipmemory), and as such, the cache must be flushed before theaccelerator is activated if the two are to share memory [2].

The idea of implementing a parameterizable cache on anFPGA has been proposed in several prior works (e.g. [20],[22]). However, the prior efforts only evaluated single-ported caches in isolation; that is, outside of the parallelaccelerator scenario and without a comprehensive multi-benchmark study. Other works (e.g. [15], [10]) have consid-ered implementing configurable caches in the ASIC domain,which presents different trade-offs than FPGAs.

B. LegUp High-Level SynthesisIn this study, the LegUp high-level synthesis tool is used

to automatically synthesize C benchmark programs to hybridprocessor/parallel-accelerator systems [8]. With LegUp, onecan specify that certain C functions (and their descendants)in a program be synthesized to hardware accelerators, withthe balance of the program running in software on a MIPSsoft processor [21]. The original versions of the HW-designated functions are replaced by wrapper functions inthe software; the wrappers invoke and send/receive datato/from the corresponding accelerators via the Altera Avaloninterconnect.

III. SYSTEM ARCHITECTURE OVERVIEW

Our default system architecture is shown in Fig. 1. Itis composed of a MIPS soft processor with one or morehardware accelerators, supported by memory components,including the on-chip dual-port data cache and the off-chip DDR2 SDRAM. Note also that accelerators may havelocal memory for data not shared with other modules inthe system. An instruction cache is instantiated within theMIPS processor as it is only accessed by the processor. Thecomponents are connected to each other over the AvalonInterface in a point-to-point manner. This interconnectionnetwork is generated by Altera’s SOPC Builder tool [4].The point-to-point connections allow multiple independenttransfers to occur simultaneously, which is not feasiblewith a bus topology. Communication between two compo-nents occur via memory-mapped addresses. For example,the MIPS processor communicates with an accelerator bywriting to the address associated with the accelerator. Whenmultiple components are connected to a single component,

18

!"#$%#&'()**'&+,&-.,&)%

/(()0)&,1'&

2234%$23/!

+,&-.,&)%/(()0)&,1'&

+,&-.,&)%/(()0)&,1'&

56789:;%2,1,%8,(9)

!"#$%&'(&)*+,%&&*,)

<'(,0%!)=

<'(,0%!)=

Figure 1. Default system architecture.

such as the on-chip data cache, a round-robin arbiter isautomatically created by SOPC Builder.

The solid arrows in Fig. 1 represent the communicationlinks between the processor and accelerators. These linksare used by the processor to send arguments to accelerators,invoke accelerators, query an accelerator’s “done” status, andretrieve returned data, if applicable. Two modes of executionexist in our system: sequential and parallel. In sequentialmode, either the processor or a single accelerator execute ata given time, but not both. Thus, once the processor startsan accelerator, it is stalled until the accelerator finishes. Inparallel mode, the processor and multiple accelerators canexecute concurrently. The processor first invokes a set ofparallel accelerators, then continues to execute without beingstalled. The processor may perform other computations,or may check if the accelerators are done by polling onaddresses assigned to the accelerators.

The dotted arrows represent communication links betweenthe processor/accelerators and the shared memory hierarchy.The data cache comprises on-chip dual-port block RAMsand memory controllers. On a read, if it is a cache hit, thedata is returned to the requester in a single cycle. On amiss, the memory controller bursts to fetch a cache linefrom off-chip DDR2 SDRAM. In our system, this takes24 cycles for a single burst of 256 bits when there is nocontention from other accesses. Depending on the cache linesize, the number of bursts is varied. On a burst, after aninitial delay of 23 cycles, each additional cycle returns 256bits of data, continuing until a cache line is filled. As withmany L1 caches, we employ a write-through cache owingto its simplicity1.

Note that our approach is not that of a single monolithicmemory hierarchy. Each accelerator has its own local mem-ory for data that is not shared with the processor or otheraccelerators. This allows single cycle memory access for alllocal memories.

IV. MULTI-PORTED CACHE DESIGNS

Current commercial FPGAs contain block RAMs hav-ing up to two ports, leading to our default architecturewhere the processor accesses one port of the RAM andan accelerator accesses the other port. A separate memorycontroller controls each port. One port is always reserved

1Write-through caches do not require bookkeeping to track which cachelines are dirty.

for the processor, as the processor requires different controlsignals from the accelerators. Hence, with more than oneaccelerator, multiple accelerators share the second port asin Fig. 1. This architecture is suitable in two scenarios:1) sequential execution, where only the processor or asingle accelerator is executing at a given time; 2) parallelexecution either with a small number of accelerators, or forcompute-intensive applications, where accelerators do notaccess memory often.

For memory-intensive applications, with many accelera-tors operating in parallel, a dual-ported cache architecturemay result in poor performance, as accelerators contendfor one port of the cache, leading to the accelerators beingstalled most of the time. For increased memory bandwidth,we investigate two types of multi-ported caches, both ofwhich allow multiple concurrent accesses to all regions ofthe cache in every cycle.

A. Live-Value Table ApproachThe first multi-ported cache is based on the work by

LaForest et al. [18]. The original work in [18] replicatesmemory blocks for each read and write port, while keepingread and write as separate ports, and uses a live-value table(LVT) to indicate which of the replicated memories holdsthe most recent value for a given memory address. Eachwrite port has its own memory bank containing R memories,where R is the number of read ports. On a write, the writingport writes to all memories in its bank, and also writesits port number to the corresponding memory address inthe LVT, indicating that it is the most-recent writer to theaddress. Read ports connect to one memory block in eachwrite-port bank. On a read, the reading port reads fromall of its connected memories, and looks up the memoryaddress in the LVT, which returns the previously-writtenport number. This is used to select the most-recently-writtenvalue from one of the connected memories. The LVT isimplemented with registers, as multiple ports can read andwrite from different memory locations at the same time. Theoriginal work was intended for register files in processorswith separate read and write ports, and uses a simple dual-port memory, where one port is reserved for writes and theother port is reserved for reads. With this architecture, thetotal memory consumed is ! ! the original memory size,where ! is equal to the number of write ports ! the numberof read ports.

In the case of caches, the number of read and write portsis equal, and with n read/write ports, the cache size wouldgrow by n2. However, in our system, the read and writeports do not need to be separate. A read and a write portcan be combined into a single read/write port, since anaccelerator can only read or write at a given time, but not doboth. One true dual-ported memory can therefore be usedfor two read/write ports, instead of using 2 simple dual-ported memories with 2 read ports and 2 write ports. Thisreduces the total memory consumption to less than half ofthat in [18]. A 4-ported cache using the new architectureis shown in Fig. 2, where each M represents a true dual-ported memory, and MC represents a memory controller.For clarity, only the output (read) lines are shown from eachmemory block. The input (write) lines are connected from

19

M1

M3

M5

Port1 Port2

Port3

LVT

M2

M4

M6

Port4

M1

M5

Port1Port1 Port2Port2

Port3Port3

LVT

M2

M6

M2

M4

M6

M3M3

M6

Port4Port4

MC2

MC4MC3

MC1Proc

Accel 2

Accel 1

Accel 3

Cache Memory

Figure 2. LVT-based 4-ported cache.

each port to the same memory blocks as shown by the arrows(without the multiplexer). A memory controller is connectedto each port of the memory, which is subsequently connectedto either the processor or an accelerator.

In our variant of the LVT memory approach, it is requiredthat any two ports have one memory block in common. Forexample, in Fig. 2, port 1 shares memory blocks M1, M3and M4, with ports 2, 3, and 4, respectively. This allowsdata written by port 1 to be read by all other ports. On awrite, a port writes to all of the memory blocks that it isconnected to, which is n-1 blocks. As in the original LVTmemory implementation, the port also writes to the LVTto indicate it has updated a memory location most recently.On a read, a port reads from all connected RAM blocks andselects the data according to the port number read from theLVT. Compared to the previous work which caused memoryto grow by n2! with n ports, our LVT variant scales as:

New cache size =n ! (n - 1)

2! original cache size (1)

The 4-port cache in Fig. 2 replicates memory size by 6!,whereas the approach in [18] replicates memory size by16!. The output multiplexer, which selects between thememory blocks is also reduced from an n-to-1 multiplexerto an (n-1)-to-1 multiplexer.

This new multi-ported cache based on the LVT approachis referred to as the LVT cache in this paper. This canbe compared to multi-cache architectures, where multiplecaches are distributed, with each cache connected to aprocessor or an accelerator, and a cache coherency schemeis implemented to keep the memory synchronized. The mainadvantage of the LVT cache architecture is that it offers ashared memory space, which acts as one combined pieceof memory, thus no cache coherency scheme is needed,avoiding area and latency costs for synchronization. Thismulti-ported cache allows all ports to access coherent dataconcurrently every cycle.

Note that since the cache line sizes can be large, the LVTcannot keep track of every byte in the cache. Hence theLVT works on a cache line granularity, to keep track ofwhich cache line was the most recently updated. This cancause cache line write conflicts, when two ports attempt towrite to the same cache line in the same cycle, even if theyare writing to different bytes in the line. To resolve this,

Dual-port RAM

CLK 1X

CLK 2X

MC1Proc

Accel 2

Accel 1

Accel 3MC3

Memory

MC2

MC4

Cache

Figure 3. 4-ported cache with double-pumping.

an arbiter is implemented, which serializes the writes to thesame cache line from multiple ports. This is only for cachewrites to the same cache line. Any reads from the same lineor writes to independent cache lines can occur concurrently.

B. Multi-PumpingModern FPGAs, such as the Altera Stratix IV, have

memory blocks which can operate at over 500 MHz [5]– a speed which is often much faster than the speed ofthe overall system. We use this advantage to create anothertype of multi-ported cache, using dual-ported memoriesclocked at twice the system speed – a well-known techniquecalled memory multi-pumping. In one clock cycle of thesystem clock, the double-pumped dual-ported memoriescan perform 4 memory accesses, which from the system’sperspective, is equivalent to having a 4-ported memory. Thisarchitecture is shown in Fig. 3. Four concurrent accessesare broken down into two sets of two accesses. The firstset of accesses is presented to the RAM immediately, whilethe second set is stored in registers. In the second-half of asystem cycle, the results of the first set of accesses are storedin registers, while the second set of accesses is presented tothe RAM. At the end of a complete system cycle, both setsof accesses are complete. In essence, by running the RAMat 2! the system clock frequency, multi-pumping mimicsthe presence of a RAM with 2! as many ports as there areactually present in the hardware.

The advantage of this approach is that it does not increasememory consumption. It uses the same amount of memoryas a single dual-ported cache with extra control logic andregisters to steer the data in and out of the RAMs. Thelimitation, however, is that the system needs to run slowenough so that the memory can run at multiples of thesystem clock. This constraint usually makes it difficult to runthe memories at more than twice the system clock. In addi-tion to its reduced memory consumption, the multi-pumpingcache bears other advantages over the LVT approach. Sincememory blocks are not replicated, no extra logic is requiredto keep track of which memory block holds the most recentvalue, thus eliminating the LVT table. This in turn allows usto make use of byte enable signals to only write to the bytesof interest, instead of writing the entire cache line. Thus,this approach does not have the cache line write conflicts,described previously. Even if two ports are writing to thesame cache line, no arbitration is needed unless they arewriting to the same bytes.

The multi-pumping cache uses less logic and memory

20

P A

Dual-port Cache

Off-chip memory

Port1 Port2

a) Sequential Dual-port

P

Dual-port Cache

Off-chip memory

Port2

b) Parallel Dual-port

A1 A2 A3 A4 A5 A6

P

Multi-port/Multi-pump Cache

Off-chip memory

Port1 Port2

A1 A2 A3 A4 A5 A6

Port3 Port4

P

Multi-port Cache

Off-chip memory

P1

A1 A2 A3 A4 A5 A6

Port1

P3P2 P4 P7P5 P6

c) Parallel 4-port using LVT/MP d) Parallel 7-port using LVT

Figure 4. Architectures evaluated in this work.

resources than the LVT cache, which often leads to higherFmax (shown in Section VI), with the caveat that multi-pumping cannot be scaled to more than 4-ports in oursystem. The multi-ported cache based on the multi-pumpingapproach is referred to as the MP cache for the rest of thispaper.

V. SYSTEM ARCHITECTURES EXPLORED

Using the LVT and MP caches, we investigated 5 differentsystems, shown in Fig. 4, where P denotes the processor,and A denotes an accelerator. Fig. 4(a) shows our defaultarchitecture, where a single accelerator executes sequentiallyto perform all of the work in the program while the processoris stalled. Fig. 4(b) illustrates the same cache architecture asFig. 4(a), but with multiple accelerators which execute inparallel. With one of the ports designated for the processor,all of the accelerators arbitrate for the second port (viaAvalon). Fig. 4(c) shows a 4-ported cache architecture, usingeither an LVT or an MP cache, with two accelerators sharingone port of the cache – Avalon arbitrates between the twoaccelerators that share a port. Fig. 4(d) shows an architecturewhere the processor and each accelerator has its own port tothe cache. This has the highest possible memory bandwidth,since no contention exists to the cache and all of the corescan access memory at the same time. Contention still existsto access off-chip memory. Only the LVT cache can be usedfor the architecture in Fig. 4(d), as the MP memory withits surrounding control logic cannot run at more than 2!the system clock. For the parallel cases, 6 accelerators werechosen to evaluate performance variations with differentcache architectures, ranging from high memory contentionto no memory contention to access the cache.

For each of the 5 different systems, we explore a variety ofcache configurations, shown in Table I. Four different cachesizes are investigated, and for each cache size, 4 differentline sizes are studied. For each cache and line size, we lastlyinvestigate direct-mapped (one-way) caches vs. 2-way set-associative caches. Hence, we evaluate a total of 5!4!4!2 = 160 different memory architectures across 9 parallelbenchmarks (c.f. Section VI).

Table ICACHE CONFIGURATIONS EVALUATED.

Cache Size Cache line size / # of Cache lines2KB 32B/64 64B/32 128B/16 256B/84KB 32B/128 64B/64 128B/32 256B/168KB 32B/256 64B/128 128B/64 256B/3216KB 32B/512 64B/256 128B/128 256B/64Associativity Direct-mapped 2-way Set-associativity

VI. EXPERIMENTAL STUDY

A. BenchmarksThe 9 parallel benchmarks used in this study are all

memory-intensive data-parallel programs, meaning that notwo accelerators write to the same memory address, butmay read from the same address. Each benchmark includespredefined inputs and golden outputs, with the computedresult checked against each golden output at the end of theprogram.

• The array addition benchmark sums the total of 6arrays, each with 10,000-integer elements.

• The box filter is a commonly-used kernel in imageprocessing. It has the effect of smoothing an image,by using a 3 ! 3 filter to eliminate any out-of-placepixel values.

• The dot product benchmark does pairwise multipli-cation of each element in two 6,000-element integerarrays and calculates the total sum.

• The GSM!6 benchmark is adopted from the CHStonebenchmark [16] and performs 6 linear predictive codinganalyses of a global system for mobile communications.

• The histogram benchmark takes an input of 36,000integers between 1 to 100, and accumulates them into5 equally-sized bins.

• The line of sight benchmark uses the Bresenhams linealgorithm [7] to determine whether each pixel in a 2-dimensional grid is visible from the source.

• The matrix multiply benchmark multiplies a 100 ! 6matrix with a 6! 100 matrix.

• The matrix transpose benchmark transposes a 1024!6matrix.

• The perfect hash benchmark hashes 24,000 integersbetween 0 and 500,000,000 and creates a perfect hashtable.

Note that for all of benchmarks the input data is sized tobe equally divisible by 6, as 6 parallel accelerators are used,with each accelerator performing an equal amount of work.This can easily be changed to handle different numbers ofaccelerators.

Each benchmark was first simulated on each type of sys-tem architecture for each cache-size/line-size/associativityusing a ModelSim functional simulation. The total num-ber of execution cycles was extracted from the ModelSimsimulation, using an accurate simulation model for the off-chip DDR2 SDRAM on the Altera DE4 board. Followingcycle-accurate simulation, each benchmark was synthesizedto Altera Stratix IV with Quartus II (ver. 10.1SP1) to obtainarea and critical path delay (Fmax) numbers. For Fmax,we use slow 900mV 85 deg. C timing models. Execution

21

Table IIBASELINE SYSTEM RESULTS.

Time Cycles Fmax Area Memory2,028.3 µs 259,216 127.8 MHz 13,460 ALMs 301,693 bits

time for each benchmark is computed as the product ofexecution cycles and post-routed clock period. Exploringall benchmarks and architectures required a total of 1,440ModelSim simulation and Quartus synthesis runs, which wasdone using SciNet, a top-500 supercomputer [14].

B. ResultsFigs. 5, 6 and 7 show average execution time (speed-

up), cycle count (speed-up) and Fmax, respectively, relativeto a baseline system which employs a single acceleratorrunning sequentially using a 1-way (direct-mapped) 2KBcache with a 32-byte line size. The geometric mean resultsfor the baseline system (across all benchmarks) are shownin Table II.

Beginning with Fig. 5, observe that larger cache sizesgenerally provide higher speed-ups, as expected. We see aclear trend in the figure: two clusters of speed-up curves –one cluster clearly above the other. The cluster of six curveswith the higher speed-ups corresponds to the three multi-portcache designs considered (each direct-mapped or 2-way setassociative). The best speed-ups of 6.14! and 6.0! occurfor the parallel 7-port LVT 1-way and the parallel 4-portMP 1-way, respectively. From the bottom cluster of curves,it is evident that, even with 6 accelerators, using a dual-portcache and relying on the Avalon interconnect for arbitrationprovides poor speed-up results (even slower than sequentialexecution in some cases!).

Observe in Fig. 5 that performance varies widely as linesizes are increased. In general, performance increases up to64-byte line size for cache sizes of 2 and 4 KBs, and 128-byte line size for cache sizes of 8 and 16 KBs. For a givencache size, as line size increases, the total number of linesare reduced. With a smaller number of cache lines, a greaternumber of memory addresses map to the same cache line,making the line more likely to be evicted for a new cacheline. Miss rate actually goes up if the line size is too largerelative to the cache size [17]. With multiple acceleratorsaccessing memory, this can cause cache thrashing to occur,where the same set of cache lines are evicted and retrievedcontinuously, causing excessive off-chip memory accesses.Larger lines sizes also lead to longer fetch cycles. Indeed,smaller line sizes perform better for smaller caches as shownin Fig. 6. This effect is mitigated with 2-way set-associativecaches, and also as cache sizes themselves become bigger.In Fig. 6, 4-port MP is overlapped with 4-port LVT for both1-way and 2-way, with 4-MP showing slightly better resultsdue to cache write conflicts, discussed previously. Sequentialsystems show less sensitivity to cache size and line size, asonly one accelerator is accessing memory sequentially.

Fig. 7 shows the impact of multiplexers on Fmax. Themost drastic decrease in Fmax occurs for 2-way systems,as they require the most multiplexing logic. A 2-way set-associative cache instantiates two different sets of memories,

with multiplexers on the inputs and outputs to select betweenthe two sets. In the case of multi-ported memories, extramultiplexers and/or replicated memory blocks exist withineach set. As expected, the 7-port LVT cache shows thegreatest decrease in Fmax as it uses the highest number ofreplicated memories with the largest output MUX. This isfollowed by the 4-port LVT and MP caches. On average,parallel dual-port systems show lower Fmax than sequentialand 4-ported systems since 6 accelerators connect to 1 portof the cache, creating a larger MUX (6-to-1) than othermulti-ported systems. Accelerators generated by LegUp HLSfor some benchmarks, such as matrix multiply, exhibitedlong combinational paths to the cache, which in combinationwith the 6-to-1 MUX, cause a significant impact on Fmax.In general, Fmax is also decreases with increasing line sizesas larger MUXes are required to select specific data fromlonger lines.

Turning to area (# of ALMs) in Fig. 8, all of the 2-way systems consume more area than their respective 1-way systems, as expected2. Among all systems, the 7-portedarchitecture has the largest size, with the most MUXes,followed by the 4-ported LVT architecture. This has morearea than the 4-ported MP architecture due to its LVT table,as well as output MUXes to select between its replicatedmemory blocks. All of the parallel systems consume morearea than sequential systems due to their having multipleaccelerators. Area also increases with increasing line sizedue to bigger MUXes.

In terms of memory consumption, shown in Fig. 9, the7-port LVT 2-way consumes the most memory bits, owingto its replicated memories. Note that the graph illustrates thetotal memory consumption of the system, which include thedata cache, instruction cache, local accelerator memories, aswell as any memory consumed by other components suchas the DDR2 controller, and dividers. The reason that thememory consumption decreases slightly with increasing linesizes is due to the tags and the valid bit, which is stored aspart of the each line in addition to the data. Hence for agiven cache size, the total number of lines are reduced withbigger line sizes, reducing tag/valid-bit overhead.

Note that 7-port LVT 2-way, 7-port LVT 1-way, and 4-port LVT 2-way have missing data points in the figures, asthey cannot fit within the Stratix IV on the DE4 board forlarger line sizes. This is due to the structure of the Stratix IVM9K memory blocks, that can store 9,216 bits. Each M9Kblock in true dual-port mode can operate with 18-bit-widewords. A total of 1,280 M9K blocks are available on theStratix IV [5]. Thus, a 256 byte line size with tags/validbit consumes 115 M9K blocks. For a 7-ported LVT 1-waycache, where the memory blocks are replicated by 21! (seeEqn. 1), this totals 2,415 M9K blocks, exceeding the numberavailable. Despite the fact that a 7-port LVT 1-way cacheusing a 256B line size only consumes a small fraction of thetotal available memory bits on the device, it cannot fit dueto the architecture of the dual-ported M9K blocks. Similarconditions arise for 7-port LVT 2-way, and 4-port LVT 2-way architectures with large line sizes.

2The area results represent the complete system including the processorand accelerators.

22

0

1

2

3

4

5

6

7

32B 64B 128B 256B 32B 64B 128B 256B 32B 64B 128B 256B 32B 64B 128B 256B

2KB 4KB 8KB 16KB

Spee

d up

ove

r bas

elin

e

Cache line size / Cache total size

Sequential 1-way

Sequential 2-way

Parallel Dual-port 1-way

Parallel Dual-port 2-way

Parallel 4-port MP 1-way

Parallel 4-port MP 2-way

Parallel 4-port LVT 1-way




Figure 5. Execution time (geometric mean).

0

1

2

3

4

5

6

7

8


2KB 4KB 8KB 16KB

Spee

d up

ove

r bas

elin

e

Figure 6. Execution cycles (geometric mean).

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1


2KB 4KB 8KB 16KB

Fmax

vsba

selin

e

Figure 7. Fmax (geometric mean).

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5


2KB 4KB 8KB 16KB

Area

vs

base

line

Figure 8. Area in Stratix IV ALMs (geometric mean).

0

2

4

6

8

10

12


2KB 4KB 8KB 16KB

Mem

ory

used

vs

base

line

Figure 9. Memory consumption (geometric mean).

Table III shows the performance of each benchmark in4 cases: 1) The worst result for the benchmark listed inthe top line; 2) the best result for the benchmark shownin the second line; 3) the result for the benchmark usingthe cache architecture which yields the best performancein geomean results over all benchmarks: the 7-port LVT 1-way cache with 16KB cache and 64-byte line size, shownin the third line; and lastly, 4) the result with the cachearchitecture which yields the second best performance ingeomean results: the 4-port MP 1-way cache with 16KBcache and 128-byte line size. The data in Table III onlyincludes parallel 6-accelerator architectures; i.e. the gainsshown are isolated to be due to memory changes and arenot due to parallelization. The ‘Cache Arch.’ column is orga-nized as type of portedness/cache size/line size/associativity.For example, 2-port/2KB/256B/1 indicates a Parallel Dual-port 1-way cache architecture with a 2KB cache and a 256-byte line size. The ‘Time’ column shows the execution time

in µs, with the number in the brackets indicating the speed-up compared to the worst-case.

The dot product benchmark shows the biggest improve-ment, with 29.4! speed-up between its best and worst cases.This is due to the cache line thrashing effect, described pre-viously. The GSM!6 benchmark shows the least sensitivityto the cache design, with the gap between the best and worstbeing 2.6!. Further analysis showed the GSM benchmarkto be the least memory intensive among all benchmarks. Itis interesting to note, however, that the biggest cache sizedoes not always yield the best results, as shown by the boxfilter, dot product, GSM!6, and the histogram benchmarks.Moreover, observe that no single cache architecture is bestfor all benchmarks.

In summary, the 7-LVT/16KB/64B/1 architecture showscomparable results to each best case on a per-benchmarkbasis. However, with much less area and memory con-sumption, the 4-MP/16KB/128B/1 architecture also exhibits

23

Table IIIINDIVIDUAL BENCHMARK RESULTS.

Benchmark Cache Arch. Time Cycles FmaxAdd 2-Port/2KB/256B/1 10,596 (1) 1,438,565 135.8

4-MP/16KB/256B/1 443 (23.9) 61,229 138.17-LVT/16KB/64B/1 524 (20.2) 66,868 127.64-MP/16KB/128B/1 507 (20.9) 67,682 133.58

Box Filter 2-Port/2KB/256B/1 11,635 (1) 1,082,204 93.04-MP/4KB/64B/2 947 (12.3) 114,785 121.2

7-LVT/16KB/64B/1 1,066 (10.9) 118,446 111.14-MP/16KB/128B/1 1,181 (9.86) 128,775 109.08

Dot Product 2-Port/2KB/256B/1 2,901 (1) 390,864 134.77-LVT/8KB/128B/1 99 (29.4) 11,167 113.27-LVT/16KB/64B/1 100 (29.1) 12,837 128.64-MP/16KB/128B/1 104 (27.8) 13,851 132.73

GSM"6 2-Port/4KB/32B/2 246 (1) 21,903 89.04-MP/4KB/32B/1 94 (2.6) 10,493 111.5

7-LVT/16KB/64B/1 108 (2.3) 9,626 89.14-MP/16KB/128B/1 105 (2.4) 10,119 96.61

Histogram 2-Port/4KB/32B/2 1,465 (1) 187,882 128.34-LVT/2KB/128B/1 476 (3.1) 64,545 135.67-LVT/16KB/64B/1 560 (2.6) 63,459 113.34-MP/16KB/128B/1 539 (2.7) 64,545 119.76

Line of Sight 2-Port/2KB/256B/1 6,139 (1) 597,016 97.37-LVT/16KB/64B/1 376 (16.3) 43,658 116.07-LVT/16KB/64B/1 376 (16.3) 43,658 116.04-MP/16KB/128B/1 394 (15.6) 44,934 114.2

Matrix Mult 2-Port/2KB/256B/1 836 (1) 68,803 82.37-LVT/16KB/128B/1 55 (15.2) 5,521 100.47-LVT/16KB/64B/1 58 (14.3) 5,983 102.74-MP/16KB/128B/1 62 (13.4) 6,161 98.6

Matrix Trans 2-Port/4KB/256B/2 1,790 (1) 217,357 121.44-LVT/16KB/128B/1 178 (10.1) 22,623 127.47-LVT/16KB/64B/1 228 (7.8) 25,505 111.74-MP/16KB/128B/1 188 (9.5) 22,623 120.5

Perfect Hash 2-Port/2KB/256B/1 19,732 (1) 2,740,215 138.94-MP/16KB/64B/2 2,451 (8.1) 332,281 135.67-LVT/16KB/64B/1 2,799 (7.1) 357,306 127.74-MP/16KB/128B/1 3,621 (5.5) 464,224 128.2

competitive results to the 7-LVT/16KB/64B/1 architecture,sometimes performing better depending on the benchmark.Recall that multi-pumping does not require the cache to bereplicated and does not use the LVT which requires biggerMUXes, making the total area approximately half, with acache size 21! smaller than the 7-LVT (see Figs. 8 and 9).Thus, if the designer were forced into choosing a memoryarchitecture without knowledge of an application’s specificaccess patterns, the 4-MP/16KB/128B/1 architecture appearsto be a good choice.

VII. CONCLUSIONS AND FUTURE WORK

In this paper, we explored a wide range of cache archi-tectures in the context of FPGA-based processor/parallel-accelerator platform. We introduced two types of sharedmulti-ported caches which do not require memory parti-tioning and allow single-cycle concurrent accesses to allregions of the cache. The multi-pumping cache uses lesslogic and does not increase memory consumption, which issuitable for memory/area-constrained designs. With enoughresources, the LVT cache can allow a larger number of portsfor maximum memory bandwidth. While our results withtraditional cache parameters conform to other research in

the computer architecture domain, an FPGA implementa-tion poses challenges which are unique to the platform.Specifically, multiplexers, which are required to implementdifferent cache line sizes, associativity, and portedness areexpensive in FPGAs and significantly impact performanceresults. On the other hand, high-speed on-chip block RAMswhich can often run much faster than their surroundingsystem offer a differentiating advantage compared to othercomputing platforms – they make multi-pumping a viableapproach in FPGAs. We conclude that the best architecturefrom a performance angle, considering all aspects of totalexecution time, area, and memory consumption, is the 4-port multi-pump direct-mapped cache with a 16KB cachesize and a 128 byte line size. We believe this architecture tobe the best choice for systems which run under 200 MHz.

Future work includes combining the memory architecturesanalyzed here with the work on memory partitioning inHLS (e.g. as described in [11]). We are also interestedin analyzing these architectures from an energy-efficiencyperspective.

REFERENCES[1] M. Adler, K. Fleming, A. Parashar, M. Pellauer, and J. Emer. LEAP

scratchpads: automatic memory and cache management for reconfigurablelogic. In ACM Int’l Symp. on FPGAs, pages 25–28, 2011.

[2] Altera, Corp., San Jose, CA. Nios II C2H Compiler User Guide, 2009.[3] Altera, Corp., San Jose, CA. Avalon Interface Specification, 2010.[4] Altera, Corp., San Jose, CA. SOPC Builder User Guide, 2010.[5] Altera, Corp., San Jose, CA. TriMatrix Embedded Memory Blocks in Stratix

IV Devices, 2011.[6] S.-S. Ang, G. Constantinides, P. Cheung, and W. Luk. A flexible multi-

port caching scheme for reconfigurable platforms. In Applied ReconfigurableComputing, pages 205–216, 2006.

[7] J. Bresenham. Algorithm for computer control of a digital plotter. IBM SystemsJournal, 4, 1965.

[8] A. Canis, J. Choi, M. Aldham, V. Zhang, A. Kammoona, J. Anderson,S. Brown, and T. Czajkowski. LegUp: High-level synthesis for FPGA-basedprocessor/accelerator systems. In ACM Int’l Symp. on FPGAs, pages 33–36,2011.

[9] E. Chung, J. Hoe, and K. Mai. CoRAM: an in-fabric memory architecturefor FPGA-based computing. In ACM Int’l Symnp. on FPGAs, pages 97–106,2011.

[10] J. Cong, K. Gururaj, Hui Huang, Chunyue Liu, G. Reinman, and Yi Zou. Anenergy-efficient adaptive hybrid cache. In IEEE Int’l Symp. on Low-PowerElectronics and Design, pages 67–72, 2011.

[11] J. Cong, W. Jiang, B. Liu, and Y. Zou. Automatic memory partitioning andscheduling for throughput and power optimization. ACM Trans. Des. Autom.Electron. Syst., 16, April 2011.

[12] J. Cong and Y. Zou. FPGA-based hardware acceleration of lithographic aerialimage simulation. ACM Trans. Reconfigurable Technol. Syst., 2(3):1–29, 2009.

[13] DE4, Altera Corp, San Jose, CA. DE4 Development and Education Board,2012.

[14] C. Loken et al. SciNet: Lessons learned from building a power-efficient top-20system and data centre. Journal of Physics: Conference Series, 256, 2010.

[15] A. Gordon-Ross, F. Vahid, and N.D. Dutt. Fast configurable-cache tuning witha unified second-level cache. IEEE Trans. on VLSI, pages 80–91, 2009.

[16] Y. Hara, H. Tomiyama, S. Honda, and H. Takada. Proposal and quantitativeanalysis of the CHStone benchmark program suite for practical C-based high-level synthesis. Journal of Information Processing, 17:242–254, 2009.

[17] J. Hennessy and D. Patterson. Computer Architecture: A Quantitative Ap-proach. Morgan Kaufmann, 2006.

[18] C.E. LaForest and J.G. Steffan. Efficient multi-ported memories for FPGAs.In ACM Int’l Symp. on FPGAs, pages 41–50, 2010.

[19] A. Putnam, S. Eggers, D. Bennett, E. Dellinger, J. Mason, H. Styles, P. Sun-dararajan, and R. Wittig. Performance and power of cache-based reconfigurablecomputing. In Int’l Symp. on Computer Architecture, pages 395–405, 2009.

[20] A. Santana Gil, J. Benavides Benitez, M. Hernandez Calvino, and E Her-ruzo Gomez. Reconfigurable cache implemented on an FPGA. In IEEE Int’lConf. on Reconfigurable Computing, pages 250–255, 2010.

[21] University of Cambridge. The Tiger ”MIPS” processor(http://www.cl.cam.ac.uk/teaching/ 0910/ECAD+Arch/mips.html)., 2010.

[22] P. Yiannacouras and J. Rose. A parameterized automatic cache generator forFPGAs. In IEEE Int’l Conf. on Field-Programmable Technology, pages 324–327, 2003.

24

impact of cache architecture and interface on performance ...impact of cache architecture and...

Documents