cost/performance analysis of a multithreaded pim architecture

1

Cost/Performance Analysis of a Multithreaded PIMArchitecture

Shyamkumar Thoziyoor, Shannon K. Kuntz, Jay B. Brockman, and Peter M. KoggeDepartment of Computer Science and Engineering

University of Notre Dame

Abstract— Many research projects over the past decade haveexplored embedding processing logic into the memory system ofa computer as a means to enhance performance. An obstacle towidespread acceptance of processing-in-memory (PIM) has beenthe increased cost-per-bit of embedded memory as comparedto high-volume commodity memory. Using analytic models, thispaper examines the performance and volume production siliconcosts for a PIM-enhanced system with multithreading in thememory, versus a baseline system with commodity DRAM.The paper provides insight into the question of which PIMconfigurations would provide cost-effective performanceif theywere produced in high volume.

Index Terms— processing-in-memory, PIM, multithreading,DRAM.

I. I NTRODUCTION

It is becoming increasingly difficult to economically squeezeadditional performance out of a single thread of execution ona conventional PC. Behind this difficulty lies the fact thata random access to main memory, resulting from a cachemiss, requires tens to hundreds of processor clock cycles. This“memory wall” associated with transferring data between theprocessor and DRAM is a well-known limitation to computingperformance, and this barrier is getting worse with each newgeneration of technology. Compounding this limitation is thefact that the applications are becoming increasingly data-intensive, placing more and more pressure on the memorysystem.

One solution that has been suggested by several researchersfor breaking through the memory wall is to embed computingwithin the memory system, which we refer to asprocessing-in-memory or PIM1. This approach has been explored innumerous projects over the last decade [1] [2] [3] [4] [5][6] [7][8]. While processing within the memory system hasshown promise architecturally, a major obstacle to its adoptionhas been the perceived—and in today’s manufacturing envi-ronment, real—cost of embedding processing logic in dense,high-capacity DRAM. In terms of basic circuit technology,there is no barrier: 37 percent of a typical commodity DRAMchip is already devoted to logic functions such as decoding[9]. While the high-threshold transistors used in DRAM do

1The concept of computation in memory has been explored through manydifferent projects and given a variety of names, including intelligent RAM,merged logic and DRAM, smart memory, active memory, etc. For brevity, wewill refer to the general concept as PIM in this paper. We fully acknowledgethat in doing so, we gloss over important differences between the variousprojects, and encourage the reader to read the papers on these projects listedin the references

have slower switching speeds than ASIC logic transistors,all of the PIM projects cited above seek to compensate forreduced clock rates through a combination of lower memorylatency, increased memory bandwidth, and high degrees ofconcurrency. The issue, rather is the increase in the per-bitcost of memory with the addition of custom logic. This inespecially of concern given the economies of scale and marketforces that keep the cost of commodity DRAM very low ascompared with ASICs.

Unless PIM parts are produced in sufficiently high volumes,the per-bit cost will be considerably higher than that ofcommodity DRAM and hence be cost-prohibitive for manysystems that require high memory capacity. The main goalof this paper is to determine which PIM chip configurationswould provide cost-effective performanceif they were pro-duced at high-volume, and thus help identify configurationsthat may be suitable candidates for high-volume production.Our approach is to compare the performance andvolume pro-duction silicon costof a system with PIM-enhanced memoryagainst a baseline system with commodity DRAM.

In focusing on high-volume silicon production, our costmodel considers key parameters of recurring die cost, in-cluding die size, number of processing steps, and yield ofrepairable and non-repairable areas of the die. The modeldoesnot consider non-recurring design and development costs,software costs, or the cost of yield learning in ramping to highvolume. The model also doesnot consider test and packagingcosts, both of which are important components of the recurringcost of memory products. Both of these areas are topics ofmuch research and development that go beyond the scope ofthis paper; projected trends for test and packaging may befound in [9]. Our modeling framework does, however, includeI/O count as a parameter, which factors into both test andpackage cost.

Our analytical performance model calculates the speedupobtainable for applications in a PIM-enhanced system relativeto the baseline system. Our performance modeling frameworkis composed of various architectural parameters that charac-terize the baseline and PIM-enhanced systems and variousapplication parameters that characterize the workloads. Ourperformance model does have certain limitations. We donot model dynamic effects such as contention and queueingof memory accesses in the network of the PIM-enhancedsystem. We also do not model effects such as the impact ofmultithreading on the cache miss rates of each thread. Sucheffects that we ignore do limit the use of multithreading in

2

a system. Nevertheless we ignore these effects and proceedwith an optimistic calculation of the speedup by creating ananalytical model that captures the critical parameters relatedto the execution time of a thread. In our analysis, we thenexamine the sensitivity of speedup with respect to the differentparameters by sweeping them over a wide range. While suchan exercise clearly has limitations, it does indicate whethermultithreaded PIM-enhanced systems have a region of oppor-tunity under optimistic conditions.

Figure 1 illustrates the baseline and the PIM-enhancedsystems. The baseline system consists of a commodity micro-processor, optimized for single thread performance, coupledwith a DRAM-based memory system. In 2005, a typicalmemory system for a desktop personal computer would be1 GB of memory, physically packaged as16×64 MB DRAMchips. In the PIM-enhanced system, PIM chips replace theconventional DRAM chips. The PIM card is assumed to besimilar to a commodity memory card, in that it has a memorybus interface to the host processor, but it also contains asimple network for communication between the PIM chips.Each PIM chip contains the same amount of memory as theDRAM chip that it replaces, along with a set of independentlightweight multithreaded processors(LWPs) and a simple on-chip communication network. In the baseline system, a givenworkload is processed as a single thread running on the host,making memory accesses to the DRAM. In the PIM-enhancedsystem, the workload is recast as single heavyweight threadrunning on the host and a set of concurrent lightweight threadsrunning on the PIM chips.

host CPU

& cache

H

host CPU

& cache

H

L L

L L

L L

L L

baseline system

PIM-enhanced system

plain memory

PIM memory

Fig. 1. Enhanced versus baseline systems. Both systems have heavyweightthreads (H) running on the host CPU, while the enhanced system also haslightweight threads (L) running on processors embedded in the memory.

The remainder of this paper is organized as follows. SectionII provides an overview of the multithreaded PIM executionmodel. Sections III and IV describe the frameworks forperformance and silicon cost modeling respectively. SectionV discusses the results of sweep analyses with the models,showing sensitivities to various model parameters, and SectionVI summarizes conclusions.

II. EXECUTION MODEL OVERVIEW

Our multithreaded PIM execution model borrows heavilyfrom early work in hybrid dataflow architectures includingP-RISC [10] and Monsoon [11], as well as the ThreadedAbstract Machine (TAM) [12]. Lightweight messaging usingparcels [2] was also inspired by protocols developed forsplit-phase memory access in the dataflow work, as wellas active messages [13], the MDP [14] and the J-Machine[15]. Multithreaded PIM extends this prior work by adaptinglightweight multithreading and communication for use withwide-words from on-chip memory and also integrates shortSIMD operations into the architecture.

The following sections describe the key elements of theexecution model. We cite particular examples of the imple-mentation of these in the PIM Lite chip, [16] [17]. PIM Lite isa 16-bit, multithreaded processing-in-memory architecture andcorresponding VLSI implementation, developed for convenientexperimentation with many aspects of PIM technology, includ-ing execution models, ISA design, microarchitecture, physicallayout, system software, and applications programming. Figure2 shows the layout of the PIM Lite chip, which has beenfabricated on a 0.18µ TSMC process through MOSIS. Wealso give a brief overview of an implementation of an N-bodytree-code [18] that illustrates the application of the executionmodel from a software perspective.

Fig. 2. 4-node PIM Lite layout (left) and fabricated 1-node chip.

A. Lightweight Multithreading

Multithreading is a technique for explicitly dividing pro-grams into identifiably separate sequences of instructions—called threads—that can execute concurrently. The weight ofa thread is a measure of the data that comprises the threadstate, together with the time and space overhead in managingthreads. At the heavy end of the spectrum would be an OSthread such as a UnixPthread running on a workstation,which typically accumulates a large data trail including a deepcall stack, large cache footprint, and significant branch historydata. Further, the operations of starting, stopping, suspendingor synchronizing heavyweight OS threads themselves requiresignificant compute cycles and supporting data structures.Because of these factors, heavyweight threads can only beused sparingly within applications to achieve coarse-grain

3

parallelism. By contrast, a lightweight thread has considerablyless state associated with it, perhaps just a register set and asingle call frame. The lightweight multithreading model alsoassumes direct architectural support for thread management,so that the thread operations listed above require only afew instruction cycles to execute. As a result, the programmodel encourages widespread usage of explicit multithreadingto expose large amounts of fine-grain concurrency to thehardware.

B. Frames

In the multithreaded PIM model, nearly all the state in-formation of a thread is taken out of the CPU and kept inmemory at all times. In place of named registers in the CPU,thread state is packaged indata framesof memory. The maindifference between a frame and a register set is that framesare logically and physically part of thememorysystem, ratherthan part of the processor, and that a multithreaded programcan have access to many different dynamically-created framesover the course of execution. Logically, a frame is simply aregion of contiguous memory locations within a single node.Physically a frame consists of one or possibly a few rows ina memory block. For a typical DRAM block such as [19], arow of memory fetched in a single internal read cycle is 2048bits which is storage equivalent to 32, 64-bit registers. In PIMLite, a frame consists of 32, 16-bit words.

C. Distributed Shared Memory with Embedded AnonymousProcessing Elements

In this programming model, objects in memory are named,but processors are not. We assume that threads are invoked inthe “neighborhood”—on the same PIM chip or even quadrantof a PIM chip—of a data object, rather than on a particularprocessor and further assume that there is an anonymousprocessor nearby that can perform the operations. This tiesprogram performance closely to the spacial distribution ofdata. If we assume an address mapping scheme where blocksof contiguous virtual address are guaranteed to reside on thesame PIM chip, then the compiler has an opportunity tokeep a thread’s memory references local—and hence lowerlatency—by building data structures out of these contiguousblocks. Similarly, a program can avoid hot spots and increaseconcurrency by distributing portions of data structures ran-domly across the virtual address space. Runtime libraries cansupport memory allocation to simplify the spacial allocation ofmemory. PIM Lite has a runtime library call,pim malloc ,which allocates and returns a pointer to a block of PIMmemory, in the neighborhood of (on the same node as) a givenaddress. Several macros are defined for specifying the addressto support data distribution, includingPIM HERE, the currentnode andPIM RANDOM, which randomly chooses an address.

D. Parcels

A parallel communication element, orparcel is a typeof message for performing operations on remote objects inmemory. As with conventional memory operations, parcels are

routed to target memory locations, but they greatly expandthe set of possible actions beyond simple reads and writes.Parcels may be viewed as an outgrowth of other lightweightmessaging schemes for multithreaded systems, such the MDP[20], J-Machine [15], TAM [12], and Active Messages [13].The concept of parcels has formed the basis for several PIMdesigns including HTMT [21], DIVA [22], and MIND/Gil-gamesh [23]. A parcel contains the following information:• the address of a target memory location,• the operation to be performed,• a set of arguments.In a specific machine architecture that supports a pro-

gramming model with parcels, the parcels can either be sentexplicitly, or as a side-effect of a memory operation that isdetermined to be remote.

E. Locality Awareness and Travelling Threads

For applications that use large dynamic data structures, it isgenerally not possible to determine at compile time on whichPIM node a given object resides. When a thread needs accessto data that is not on the current node, there are two choices:either bring the data to the thread, or move the thread to thedata. The latter option is especially attractive when the threadstate is small and the likelihood of future references to data onthe new node is high [2][24]. In either case, the thread mustbe able to determine whether a given virtual address is onthe current node or not—what we calllocality awareness. Weassume that the machine architecture can efficiently supportthis query in hardware, since it need only know which pagesare local, not the location of every page.

One way to express a “travelling thread” in a high-levelprogram is to use amoveTo library call. The followingexample, which traverses a distributed linked list and printseach list element, is drawn from the PIM Lite runtime library.pim_moveTo takes a global pointer as an argument; if thepointer is non-local the thread is moved via a parcel, otherwiseno action is taken.

void* printList(void* listArgsVoid) {elt *curElt;

curElt = ((listArgs_t*) listArgsVoid)->head;while (curElt != NULL) {

pim_moveTo(curElt);pim_printWhereValue(curElt, curElt->val);curElt = curElt->next;

}pim_stop();

}

F. Short-Vector, SIMD Operations

One of the main structural features of on-chip memory isthat entire rows of data can be accessed at once. Building onthis, a number of past PIM designs, including [25] [26] [22][27], have used wide-word, SIMD, or short-vector instructionto operate on a row of data. We also consider the use ofsuch instructions as part of the programming model, but laterin this paper will also consider their relative value whenmultithreading is also present.

4

G. Example—N-Body Tree-code

While PIM can use either conventional shared memory ormessage passing metaphors to orchestrate parallel programs,it also opens a new possibility: moving the thread to the data,rather than copying data back to an immobile thread. In thissection, we demonstrate the use of travelling threads for anN-body tree code.

The goal of the N-body problem is to simulate the motionof bodies in space resulting from interactive forces betweenthe bodies. Since each body exerts a force on every otherbody, the problem of calculating the net force on each bodyis O(N2). A number of methods have been devised, however,that reduce the complexity of this in practice. TheBarnes-Hut[18] method is based on the heuristic that if a group of bodiesis tightly clustered, then it may be approximated as a singlebody for the purposes of calculating the net force on a singlebody that is sufficiently far away from the group. Central tothe Barnes-Hut method is building a quadtree in 2-space or anocttree in 3-space that recursively subdivides the region intocells, until each cell contains 0 or 1 bodies. To calculate thenet force on a given body, the Barnes-Hut algorithms performsa depth-first traversal of the tree. If a given cell is far enoughfrom the body, the entire cell is approximated as a single bodyat the center-of-mass and the traversal along that branch of thetree stops. Otherwise, the traversal expands to the children ofthat cell.

In our PIM version of the program, we distributed the treespatially over a set of PIM nodes, as shown in Figure 3. Wegenerated a thread for each body to traverse the tree in paralleland accumulate the net force acting upon each body. In orderto eliminate the bottleneck at the top of the tree, since eachthread starts its traversal at the root, we replicated the topfew levels of the tree onto each node before beginning thetraversal (not shown in the figure). Parcels are only sent overthe network when a thread encounters a link to a child off-chip.When this occurs, the entire thread state must be sent, but inthis application the thread state is very small, since it consistsonly of the accumulated force, the mass of the body, itscoordinates in 3-space, and a pointer to the body for reportingback the net force when the traversal is complete. Within eachthread, there is further opportunity for short-vector parallelismin the calculation of the force as a function of the distancebetween two points. Overall, the computation is organizedas timesteps or phases with serial, multithreaded, and short-vector portions. In our N-body implementation, approximately20 percent of the executed instructions must be run serially andthe remaining 80 percent can be multithreaded. Within eachthread, approximately 40 percent can be run as short-vectorinstructions. This program structure, with varying percentagesof each of the workload types, can be applied to a wide varietyof applications, including grid and mesh problems, and formsthe basis for the application side of our performance modelpresented later in this paper.

III. PERFORMANCEMODELING FRAMEWORK

In this section, we describe our framework for modeling theperformance of the baseline and PIM-enhanced systems. Our

host CPU

host cache

PIMs

Fig. 3. Distribution of N-body tree over PIM system

modeling framework is composed of parameters that describethe architecture of the system, and those that describe themapping of an application onto the architecture. By using thesearchitectural and application parameters, we create a model forcalculating speedup in a PIM-enhanced system.

A. Architectural Parameters

Table I lists the set of architectural parameters that charac-terize the baseline and PIM-enhanced systems. The parametersare categorized based on whether they are characteristics ofthe PIM chip, host processor, plain DRAM or system-level.The meaning of each of these parameters is described next bycategory.

1) PIM Chip Characteristics:Each PIM chip has a cer-tain amount of embedded DRAM and a certain number(Nlwps) of fine-grained, multithreaded lightweight processors(LWPs). Each LWP has the ability to execute a certainnumber (Nmax-hardware-threads) of hardware threads at any timeat a maximum clock frequency offlwp. On an average, inthe absence of any idle cycles on the LWP, each computeinstruction is assumed to take 1 clock cycle to execute. EachLWP is composed of one 64-bit integer unit and one 64-bitfloating point unit. It can also have SIMD capability by in-corporating additional integer/floating point units (Nsimd-units).Each LWP also contains level-1 SRAM-based instruction andframe caches. We assume that the frame cache has a hit latencyof Llwp-cacheand that the width of its cache line isWlwp-cache-line.On a frame cache miss, data is brought in typically from theon-chip DRAM. We assume that the width of the on-chip databus is Wchannel-on-chip. Bpim represents the maximum rate atwhich data can be accessed from the on-chip DRAM per bitper unit time. Thread management on an LWP entails variousoverheads.Lthread-starting represents the latency of starting athread on the LWP. When a thread migrates from one PIM chipto the other, there is overhead in creating a parcel and packingthe state of the thread into the parcel;Lthread-migrationrepresentsthe latency of a thread migration operation. When a parcel isreceived by a PIM chip, some time is spent in handling theparcel;Llwp-parcel-handlingrepresents this parcel handling latency.

5

PIM chip characteristics

Nlwps Number of LWPs per PIM chipflwp Clock frequency of an LWPNmax-hw-threads Maximum number of hardware

threads that can execute in an LWPat any time

Nsimd-units Number of SIMD units in an LWPLlwp-cache Hit latency of LWP level-1 frame

cacheWlwp-cache-line Width of LWP level-1 frame cache

lineWchannel-on-chip Width of on-chip (PIM) busBpim Maximum rate at which data can be

accessed from the on-chip DRAM perbit per unit time

Lthread-starting Latency of starting a thread on anLWP

Lthread-migration Latency of a thread migration opera-tion on an LWP

Llwp-parcel-handling Parcel handling latency on an LWPLpim Average latency of an on-chip DRAM

access

Host processor characteristics

fhost Clock frequency of host processorCPIhost-non-mem Minimum number of clock cycles per

compute (non-memory) instruction ofthe host processor

Lhost-cache Average hit latency of host cacheWhost-cache-line Width of host cache line that is trans-

ferred between host and memory sys-tem

Lhost-parcel-handling Parcel handling latency on the host

Plain DRAM characteristics

Ldram Latency of a DRAM accessBdram Maximum rate of data transfer from

DRAM to host per pin per unit time

System-level characteristics

Nchips Number of PIM (DRAM) chips in thePIM-enhanced (baseline) system

Wparcel Size of a parcelLinter-pim Latency of communication between

PIM chipsLhost-dram Latency of communication between

host and DRAM/PIM chipsWchannel-host-dram Width of data channel between host

and DRAM/PIM chipsWchannel-inter-pim Width of data channel between PIM

chips

TABLE I

ARCHITECTURAL PARAMETERS

2) Host Processor Characteristics:We consider the hostprocessor to be a conventional single threaded processor thatcan operate at a maximum clock frequency offhost. The hosthas the ability to execute compute (non-memory) instructionsat a minimum clock cycles per instruction ofCPIhost-non-mem.We assume that the host has a hierarchy of caches. A cachehit has an average latency ofLhost-cache. The width of a cacheline that is transferred between the host and memory systemis represented byWhost-cache-line. In the PIM-enhanced system,the PIM chips can send parcels to the host;Lhost-parcel-handling

represents the parcel handling latency of the host.3) Plain DRAM Characteristics:The latency of an access

(read or write) to the plain DRAM in the baseline system isrepresented byLdram. Bdram represents the maximum rate ofdata transfer from DRAM to the host per pin per unit time.

4) System-level Characteristics:At the system-level, weconsider that the baseline system hasNchips DRAM chips.Since in the PIM-enhanced system, PIM chips simply replacethe DRAM chips, the PIM-enhanced system hasNchips PIMchips. Communication latencies between host and DRAM/PIMchips, and between PIM chips are represented byLhost-dramandLinter-pim respectively. The widths of the data channels betweenhost and DRAM/PIM chips, and between PIM chips are repre-sented byWchannel-host-dramandWchannel-inter-pimrespectively. Thesize of a parcel in the PIM-enhanced system is represented byWparcel.

B. Application Parameters

As described in the introduction, in the PIM-enhanced sys-tem, an application is decomposed into a single “heavyweight”thread running on the host and a collection of “lightweight”threads running on the PIM chips. Further, within a PIMthread, there is a scalar part and a SIMD part. Figure 4illustrates the overall structure of a computation, which isorganized as timesteps or phases with serial, multithreaded,and short-vector portions. As mentioned earlier, in our N-body example, approximately 20% of the executed instructionsmust be run serially and the remaining 80% can be multi-threaded. Within each thread, about 44% can be run as SIMDinstructions. This program structure, with varying percentagesof each of the workload types, can be applied to a widevariety of applications, including grid and mesh problems,and forms the basis in selecting the application parametersof our modeling framework, which are listed in Table II. Forthe application parameters, we need to consider the mappingsof both sequential and multithreaded versions of applicationcode to the baseline and PIM-enhanced systems. We classifythe parameters based on whether they are characteristics ofthe sequential or multithreaded versions.

serial

multithreaded

short-vector

phase 1

phase 2

Fig. 4. Phases in a multithreaded, SIMD computation

C. Speedup in a PIM-enhanced System

We calculate the speedup of execution of an application inthe PIM-enhanced system with respect to the baseline system

6

Characteristics of sequential version of application code

I Number of executed instructionsrmem Fraction of executed instructions that

are memory reference instructionsrmt Fraction of executed instructions that

can benefit from multithreadingrmt-simd Fraction of executed instructions that

can benefit from multithreading andSIMD hardware

phit-host-cache Effective hit rate of the host cache

Characteristics of multithreaded PIM version of application code

Napp-threads Number of threads in the applicationrthread-lwp-host Frequency of parcel communication

in a thread from the LWP to the hostVpim-host Percentage of overlap between PIM

and host executionspthread-mem-local Probability that a memory access of a

thread is localpthread-mem-remote Probability that a memory access of a

thread is remotepthread-hit-lwp-cache Hit rate of a thread’s memory accesses

in the LWP level-1 frame cacheVthread-comp-comm Percentage of overlap between com-

putation and memory accesses/com-munication of an LWP thread

TABLE II

APPLICATION PARAMETERS

using the following equation:

Speedup=Tbaseline

Tpim-enhanced(1)

where,

• Tbaseline is the execution time of the application on thebaseline system.

• Tpim-enhancedis the execution time of the application onthe PIM-enhanced system.

In the rest of this section, we develop equations forTbaseline

and Tpim-enhancedin terms of the architectural and applicationparameters that were presented in Tables I and II.

1) Execution time on the baseline system:We calculatethe total execution time of an application on the baselinesystem composed of a single-threaded host processor and plainDRAM, as the sum of the times spent on compute operationsand memory accesses:

Tbaseline = Tbaseline-comp+ Tbaseline-mem (2)

Tbaseline-comp =I(1− rmem)CPIhost-non-mem

fhost(3)

Tbaseline-mem = IrmemLhost-mem (4)

whereLhost-memis the average memory access time for the hostprocessor, and all other parameters in the above equations aredescribed in Tables I and II.Lhost-memis given by the followingequation:

Lhost-mem = phit-host-cacheLhost-cache+(1− phit-host-cache)Lhost-cache-miss (5)

whereLhost-cache-missis the miss penalty of the host cache, andis given by:

Lhost-cache-miss = Lhost-dram+ Ldram+Whost-cache-line−Wchannel-host-dram

BdramWchannel-host-dram(6)

In the above equation (6),Lhost-cache-missis calculated asthe time taken to access a block of data following a miss.The time taken to access the first word in the block is givenby Lhost-dram+ Ldram and represents the communication timebetween host and DRAM, and the DRAM access time. Thelast term in Equation 6 represents the time taken to accessthe remaining data, and is dictated by the maximum per pinbandwidth obtainable from the DRAM (Bdram), the amount ofdata that is transferred (Whost-cache-line), and the width of thedata channel (Wchannel-host-dram) between the host and DRAM.

2) Execution time on the PIM-enhanced system:To calcu-late the execution time of the application on the PIM-enhancedsystem, we consider the execution of the serial portion of theapplication on the host and that of the threads on the LWPswithin the PIM chips as well as the degree to which thesecan by overlapped. The equation for execution time on thePIM-enhanced system can be written as follows:

Tpim-enhanced= Max(Tpim + Thost(1− Vpim-host), Thost) (7)

where Tpim and Thost are the execution times of the multi-threaded portion on the PIM system and of the serial portionon the host processor.Vpim-host is an application parameterthat indicates the degree to which the host execution can beoverlapped with PIM execution. IfThost is greater thanTpim,then it is possible that the execution time is determined byThost only, hence the use of theMax function.

a) Calculation ofTpim: Since all LWPs are processingthreads from the same workload we assume that the executionis uniformly distributed and the execution time for the PIMsystem can be computed as the execution time on a singleLWP. Each LWP supportsNmax-hw-threadsthreads of executionat a time and, for the purposes of this analysis, we assume thatthe threads are scheduled onto the available LWPs in a numberof passes. Thus, if there is a very large number of threads inthe workload, then typically there areNmax-hw-threads×Nchips×Nlwps total threads executing in the system concurrently andthe total number of passes necessary to complete all threadsin the application is given as:

Npasses =Napp-threads

Nmax-hw-threadsNchipsNlwps(8)

The execution time for the threads on an LWP depends onthe ability to overlap the memory acccess and communicationlatencies with computation from other threads and can beexpressed as follows:

Tpim = (Tthread-comp+ Tthread-overhead+(Tthread-mem+ Tthread-comm)(1− Vthread-comp-comm))Nexecuting-threadsNpasses (9)

7

where,

• Tthread-compis the time spent on compute operations of athread;

• Tthread-overheadis the time spent on overhead operations in athread that are incurred as a consequence of multithread-ing;

• Tthread-mem is the time spent on memory accesses of athread;

• Tthread-comm is the time spent on communication by athread either with another PIM chip or with the host;

• Nexecuting-threadsis the number of threads executing on anLWP in a certain pass; typically equal toNmax-hw-threads;

• Vthread-comp-commis a parameter that indicates the degreeto which time spent by a thread in memory accesses orcommunication can be overlapped with computation. Fora well-designed multithreaded system,Vthread-comp-comm

can be expressed using the following equation:

Vthread-comp-comm= Min(Tthread-comp+ Tthread-overhead

Tthread-mem+ Tthread-comm, 1) (10)

Compute operations of a thread can be divided into scalarand SIMD operations. The SIMD operations are supported byhaving multiple SIMD units in an LWP. The time spent oncompute operations of a thread can be written as:

Tthread-comp =Nthread-comp-scalar+ Nthread-comp-simd

flwp(11)

whereNthread-comp-scalarandNthread-comp-simdrepresent the num-ber of scalar and SIMD compute instructions in a thread re-spectively, and can be expressed using the following equations:

Nthread-comp-scalar =Irmt(1− rmt-simd)(1− rmem)

Napp-threads

Nthread-comp-simd =Irmtrmt-simd(1− rmem)Napp-threadsNsimd-units

(12)

Overhead operations in a thread can be attributed to threesources: overhead incurred in starting a thread on an LWP,thread migration overhead incurred when a thread migratesfrom one PIM chip to another, and overhead incurred forhandling a parcel.

Tthread-overhead = Lthread-starting+ Tthread-migration+Tthread-parcel-handling (13)

The overhead incurred during a thread migration is due totime needed for creating a parcel that contains the relevantstate of the lightweight thread. The time taken for this opera-tion can be given by:

Tthread-migration = Nthread-mempthread-mem-migrateLthread-migration

(14)

whereNthread-memis the number of memory reference instruc-tions in a thread.Nthread-memis composed of of scalar memory

references and SIMD memory references, such as vector loadand store, and can be expressed as:

Nthread-mem = Nthread-mem-scalar+ Nthread-mem-simd (15)

Nthread-mem-scalar =Irmtrmem(1− rmt-simd)

Napp-threads

Nthread-mem-simd =Irmtrmemrmt-simd

Napp-threadsNsimd-units(16)

Each memory reference falls into one of three categories:local memory reference, off-chip (remote) memory reference,or off-chip (remote) thread migration. The probability that athread will migrate on a memory reference,pthread-mem-migrate,is given by:

pthread-mem-migrate = 1− pthread-mem-local− pthread-mem-remote

(17)

The parcel-handling overhead is incurred for three types ofparcels: off-chip (remote) memory accesses, off-chip threadmigrations, and PIM-host parcel communications. The timespent by a thread on parcel-handling can be written as:

Tthread-parcel-handling = (Nthread-mempthread-mem-remote+Nthread-mempthread-mem-migrate+Nthread-lwp-host)Llwp-parcel-handling

(18)

where Nthread-lwp-host is the number of times a parcel is sentfrom the LWP to the host during the execution of a thread,and can be expressed as:

Nthread-lwp-host =Irmtrthread-lwp-host

Napp-threads(19)

The time spent by a thread on memory accesses,Tthread-mem,can be expressed by considering the time spent on scalar andSIMD memory accesses:

Tthread-mem = Nthread-memLthread-mem (20)

whereLthread-memis the average memory access latency of athread memory access. Note that the time spent in commu-nication over the network during a remote memory access orthread migration is calculated separately as part ofTthread-comm,which is the time spent by the thread on communication. Theaverage memory access latency of a thread can be written as:

Lthread-mem = pthread-hit-lwp-cacheLlwp-cache+(1− pthread-hit-lwp-cache)(pthread-mem-remoteLpim +(1− pthread-mem-remote)Llwp-cache-miss) (21)

whereLlwp-cache-missis the penalty of a frame cache miss whenthe data transfer takes place on-chip. Whenever a memory

8

access hits in the frame cache, a latency ofLlwp-cache isincurred. On a frame cache miss, there are 3 possibilities: theaccess remains local and data is fetched to the frame cachefrom the on-chip DRAM, the access becomes a remote accessand data is fetched from another PIM chip, or the access resultsin a thread migration such that the thread state is packaged andshipped to another PIM chip where the access is restarted.For a frame cache miss resulting in a remote memory access,the time spent on the access, excluding the time spent oncommunication over the network, is equal toLpim, the averagelatency of an on-chip DRAM access. For a frame cache missthat does not result in a memory access, the access is eitherlocal or results in a thread migration. In both cases, the timespent on the memory access is equal toLlwp-cache-miss, the timetaken to transfer a frame cache line from on-chip DRAM,which can be expressed as follows:

Llwp-cache-miss = Lpim +(Wlwp-cache-line−Wchannel-on-chip)

BpimWchannel-on-chip

(22)

The time spent in communication by a thread,Tthread-comm, iscomposed of the network latency incurred by off-chip memoryreferences, thread migrations, and PIM-host communications.

Tthread-comm = pthread-mem-remoteNthread-memLpim-comm-remote+pthread-mem-migrateNthread-memLpim-comm-migrate+Nthread-lwp-hostLpim-host-comm (23)

whereLpim-comm-remoteandLpim-comm-migrateare the communica-tion latencies of a remote memory access and thread migrationrespectively, andLpim-host-comm is the communication latencywhen a parcel is sent from an LWP to the host.Nthread-lwp-host

is the number of times a parcel is sent from the LWP to thehost during the execution of a thread.

Lpim-comm-remote, Lpim-comm-migrateandLpim-host-commare givenby the following equations:

Lpim-comm-remote = Linter-pim +Wlwp-cache-line−Wchannel-inter-pim

BpimWchannel-inter-pim(24)

Lpim-comm-migrate = Linter-pim +Wparcel−Wchannel-inter-pim

BpimWchannel-inter-pim(25)

Lpim-host-comm = Lhost-dram+Wparcel−Wchannel-host-dram

BpimWchannel-host-dram(26)

b) Calculation ofThost: Since the host processor is singlethreaded, the host execution time consists of the time spent onserial computation, memory accesses, and parcel handling.

Thost = Thost-comp+ Thost-mem+ Thost-parcel (27)

The time spent by the host on serial computation may becalculated using the following equation:

Thost-comp=I(1− rmt)(1− rmem)CPIhost-non-mem

fhost(28)

The time spent by the host on memory accesses can becalculated using the following equation:

Thost-mem = Nhost-memLhost-mem (29)

where Lhost-mem is the average memory latency of the hostbased on Equation 5, andNhost-mem is the number of memoryreference instructions issued by the host given by the followingequation:

Nhost-mem = I(1− rmt)rmem (30)

The time spent by the host processor on parcel-handling,Thost-parcel, is given by:

Thost-parcel = Nthread-lwp-hostNapp-threadsLhost-parcel-handling

(31)

Together these equations provide an analytical performancemodel for our system.

IV. SILICON COST MODELING FRAMEWORK

In this section, we describe our framework for modelingthe silicon cost increase due to PIM. Our model calculatesthe silicon cost increase of PIM with respect to conventional,commodity DRAM. Using this model, we can study thescaling of silicon cost increase of PIM over present and futureDRAM technology nodes. Our silicon cost model considersonly the silicon costof PIM, based solely on the processingcost of a wafer and the number of obtainable functional diefrom a wafer. We do not consider the impact of testing orpackaging, which can have a significant impact onoverall PIMchip cost. Also, we assume negligible cost for the network androuting resources inside a PIM chip.

A. PIM Fabrication Process

We assume that the PIM die are fabricated in a commodityDRAM foundry using a fabrication process that adds morelayers of metal to a pure commodity DRAM process. The ad-ditional layers of metal are required for a more efficient VLSIimplementation of the PIM chip with higher performance andlesser area. Commodity DRAM fabrication processes usuallyhave fewer layers of metal than MPU processes - the processesused to fabricate microprocessors. Typical present-day com-modity DRAM processes have up to only 4 layers of metal ascompared to typical present-day MPU processes, which haveup to 9 layers of metal [9]. For the PIM cost analysis in thispaper, we specifically assume that an efficient implementationof our PIM microarchitecture can be comfortably done with nomore than 4 additional layers of metal over a pure commodityDRAM process. We also assume that the PIM fabricationprocess differs from the commodity DRAM process only inthe number of metal layers. In all other respects, the PIM

9

fabrication process is identical to the commodity DRAMprocess. Thus, the speed of transistors fabricated using thePIM fabrication process is the same as those that are fabricatedusing the commodity DRAM fabrication process.

B. Silicon Cost Modeling Parameters

The silicon cost modeling parameters may be considered tobe divided into 2 categories: parameters that are characteristicsof the DRAM/PIM chip, and parameters that are characteristicsof the DRAM/PIM fabrication process. These parameters areshown in Table III.

DRAM/PIM chip characteristics

Adram Area of DRAM dieAnon-core-dram Area of non-core portion of DRAM

dieAlwp Area of an LWPNlwps Number of LWPs in a PIM die

DRAM/PIM fabrication process characteristics

Nmasks-dram Number of masks used in DRAMfabrication process

dwafer Diameter of a wafer used in theDRAM/PIM fabrication process

Ysystematic-dram Systematic limited yield of theDRAM fabrication process

Ysystematic-pim Systematic limited yield of the PIMfabrication process

αdram Cluster factor of defects for DRAMfabrication process

αpim Cluster factor of defects for PIM fab-rication process

Dfault-dram Random electrical fault density ofDRAM die

TABLE III

SILICON COST MODELING FRAMEWORK PARAMETERS

C. Area Model for an LWP

Our cost model requires calculation of area of a 64-bitLWP. In PIM Lite, the execution pipeline was very similarto a typical 5-stage RISC pipeline, with the main exceptionthat it accesses frames in memory in place of a register file[17]. Building upon this, we approximate a 64-bit LWP, minuscaches, with a 64-bit MIPS core, also minus caches. We thenuse published area of the 64-bit MIPS core in a 130 nm process[28] to create an analytic expression for area of a 64-bit LWP.Since later on in the paper we consider silicon cost increaseof LWP configurations with floating point SIMD, we createan expression for area occupied by a 64-bit MIPS FPU too,and use that as the area of an LWP FPU. Table IV showsthese analytic expressions for area that we use in our costmodel. We assume that each LWP has a 32KB direct-mappedinstruction cache and a 32KB 2-way set associative framecache. To determine the areas of these caches, we use CACTI[29], an integrated area, timing and power modeling tool forSRAM-based caches.

D. Silicon Cost Increase of PIM

In general, die cost depends on processing cost of the wafer(Cwafer) and the number of obtainable good (functional) die

TABLE IV

AREAS OF64-BIT LWP CORE AND 64-BIT FPU IN TERMS OF GATE

LENGTH Lp OF FABRICATION PROCESS.

Entity Average area interms ofLp

64-bit LWP (minus caches) 260L2p * 106

64-bit LWP FPU 107L2p * 106

(Ngood) from the wafer. Thus, die cost can be represented usingthe following expression [30]:

Die cost=Cwafer

Ngood(32)

Using the above expression, silicon cost increase of PIMwith respect to commodity DRAM is simply:

Silicon cost increase of PIM=Cwafer-pim

Cwafer-dram.Ngood-dram

Ngood-pim(33)

where Cwafer-pim and Cwafer-dram are the processing costs forPIM and DRAM wafers respectively, andNgood-pim andNgood-dram are the number of obtainable good PIM andDRAM die respectively. We now discuss how the termsCwafer-pim/Cwafer-dram and Ngood-dram/Ngood-pim of Equation 33are calculated.

1) Calculation ofCwafer-pim/Cwafer-dram: To a first approx-imation, processing cost of a wafer is proportional to thenumber of wafer passes and/or number of masks used inprocessing the wafer. According to the 2003 ITRS, in theperiod 2003-2018, MPUs are expected to use between 29-39masks for their fabrication. In the same period, the number ofmasks used for fabrication of commodity DRAM is expectedto be 24 in 2003-2009, and 26 in 2010-2018. 4 additional metallayers in the PIM fabrication process would correspond to anaddition of 7 masks to the pure commodity DRAM process (4masks for the metal layers itself + 3 masks for intermediatevia layers). Thus, with the assumption that processing cost ofa wafer is proportional to the number of masks used, we canwrite the following expression:

Cwafer-pim

Cwafer-dram=

Nmasks-dram+ 7Nmasks-dram

(34)

2) Calculation ofNgood-dram and Ngood-pim: Ngood-dram andNgood-pim can be calculated using the following equation fornumber of good die in a wafer:

Ngood = Y Nmax (35)

whereY is the die yield andNmax is the maximum numberof obtainable die from a wafer.

An approximation typically used to calculateNmax is:

Nmax =π

4A(dwafer−

√A)2 (36)

whereA is the area of the die anddwafer is the diameter ofthe wafer.

The die yieldY is typically calculated as:

Y = YsystematicYrandom (37)

10

where,

• Ysystematic is the systematic (or gross) limited yield.1 −Ysystematicrepresents the fraction of die that is lost due tooverall wafer processing errors.

• Yrandom is the random-defect limited yield.1 − Yrandom

represents the fraction of die that is lost due to randomdefects in the die.

Ysystematicis usually process related, and not amenable to fur-ther analysis.Yrandom is typically calculated using the negativebinomial yield expression as:

Yrandom=

(1

1 + (ADfaultα )

)α

(38)

where,

• A is the area of each die that, if a defect occurs, resultsin an uncorrectable die fault;

• Dfault is the random electrical fault density;• α is a “cluster factor” that relates to the potential for

multiple defects to be in same neighborhood.

Ngood-dramcan thus be found using Equations 35, 36, 37 and38 and requires knowledge of the following factors:

• Yrandom-dram, random-defect limited yield of the DRAMdie. Yrandom-dramitself is dependent on the following:

– Anon-core-dram, non-core area of the DRAM die, whichis the area susceptible to defects;

– Dfault-dram, random electrical fault density of theDRAM die;

– αdram, cluster factor of defects of the DRAM die;

• Ysystematic-dram, systematic limited yield of the DRAM die;• Adram, area of the DRAM die;• dwafer, diameter of a wafer in the DRAM fabrication

process;

Ngood-pim can also be found using Equations 35, 36, 37 and38, and thus depends on the following factors:

• Yrandom-pim, random-defect limited yield of the PIM die,which depends on the following:

– Adef-suscept-pim, area of the PIM die that is susceptibleto defects;

– Dfault-pim, random electrical fault density of the PIMdie;

– αpim, cluster factor of defects of the PIM die;

• Ysystematic-pim, systematic limited yield of the PIM die;• Apim, total area of the PIM die;• dwafer, diameter of a wafer in the PIM fabrication process;

Adef-suscept-pim, the area of the PIM die that is susceptible todefects is the sum of the non-core area of the on-chip DRAM,and the area occupied by the LWPs. We consider completeLWP area to be irreparable and do not assume any redundancyin the SRAM-based caches. Thus,

Adef-suscept-pim= Anon-core-dram+ NlwpsAlwp (39)

whereNlwps is the number of LWPs in a PIM chip, andAlwp

is the area of each LWP.Alwp is calculated based on the areamodel of an LWP described in Section IV-C.

Apim, the total area of a PIM die is simply the sum of thearea occupied by the on-chip DRAM and the area occupiedby the LWPs, and is thus given by:

Apim = Adram+ NlwpsAlwp (40)

Dfault-pim, the random electrical fault density of the PIMfabrication process can be expressed using the followingequation:

Dfault-pim = Ddefect-pimpfailure (41)

where Ddefect-pim is the random electricaldefect density ofthe PIM die, andpfailure is the average probability of failureof the PIM die. We make the assumption that the averageprobability of failure of the PIM die is the same as that ofthe corresponding DRAM die of the same technology nodedespite the additional number of metal layers in the PIMfabrication process. We also make an assumption that therandom electrical defect density of the PIM die is more thanthat of the DRAM die through a factor by which the numberof mask layers of the PIM fabrication process is more than thatof the DRAM fabrication process. With these assumptions, therandom electrical fault density of the PIM die can be writtenas:

Dfault-pim = Dfault-dram.Nmasks-dram+ 7

Nmasks-dram(42)

Using Equations 33 through 42, the silicon cost increasedue to PIM can be calculated.

V. RESULTS

Given our analytical cost/performance models, in this sec-tion, we first attribute default values to the parameters of ourmodels. Next we examine the sensitivity of cost/performanceby sweeping key parameters. After that we examine the impactof technology scaling on cost/performance. Finally, we selectsuitable cost-effective PIM chip configurations.

A. Default Values for Silicon Cost Modeling Parameters

Default values for the parameters of the silicon cost modelof Table III are shown in Table V. For all DRAM technology-node dependent parameters, we choose default values based ontheir values for the year 2005 obtained from the 2003 ITRS.Note that the ITRS assumes the percentage of non-core areaof a DRAM die to remain constant at 37% of DRAM diearea across all DRAM technology nodes. The ITRS targetsa yield of 85% for each DRAM technology node, which isachieved by assuming a systematic limited yield value of 95%and a random-defect limited yield value of 89.5%. The valueof Dfault-dram for each technology node is back-calculated bythe ITRS from values of random-defect limited yield (85%)and the DRAM non-core area, which is the area in a DRAMdie that is assumed to be susceptible to defects. The ITRSassumes the core area of a DRAM die to be 100% repairablethrough redundancy in its rows and columns.

11

Parameter Default value Rationale of selectionof default value

Adram 82 mm2 For year 2005, fromITRS [9]

Anon-core-dram 30 mm2 Non-core area ofDRAM die is assumedto be 37% of DRAMdie by the ITRS [9]

Alwp 8 mm2 For year 2005, usingarea model of LWP

Nlwps 8 Default value in perfor-mance model

dmasks-dram 24 For year 2005, fromITRS [9]

dwafer 300 mm For year 2005, fromITRS [9]

Ysystematic-dram 95% Target value specified inITRS [9]

Ysystematic-pim 95% Optimistically assumedto be same asYsystematic-dram

αdram 2 From ITRS [9]αpim 2 Assumed to be same as

αdram

Dfault-dram 3751\m2 From ITRS [9]

TABLE V

DEFAULT VALUES FOR SILICON COST MODELING PARAMETERS

B. Default Values for Performance Modeling Parameters

Default values for the architectural and application param-eters of Tables I and II are shown in Tables VI and VIIrespectively. Some of the default architectural parameters werechosen from different resources for the year 2005. Thus,for instance, default value (3.5 GHz) for clock frequencyof the host processor was chosen based on expected clockfrequency of a 2005 MPU obtained from the 2003 ITRS[9]. Most other default values were chosen by attributingreasonable values. The third column of Table VI explains therationale of selection of the default value.Nchips, the numberof DRAM/PIM chips was chosen to be 16 based on our earlierdescriptions of the baseline and enhanced systems in SectionI. In the baseline system, a 1 GB DIMM was assumed tobe composed of 16 64 MB DRAM chips, and in the PIM-enhanced system PIM chips simply replace the DRAM chips.The maximum number of hardware threads that can execute inan LWP at any time,Nmax-hw-threads, was fixed to be 128, basedon the maximum number of 32-word frames that would fit inthe frame cache. However, it should be noted that the executiontime in the PIM system obtained through our performancemodeling is independent ofNmax-hw-threads. As Nmax-hw-threadsisincreased, the number of passes in which all threads finishexecuting decreases proportionally, but there is no impact onexecution time.

Default values for the application parameters that are char-acteristics of the sequential version of application code werechosen by an analysis of a sequential version of the Barnes-Hut algorithm for the N-body problem. The default valueswere obtained through a combination of analyzing the sourcecode and making timing measurements of instrumented coderunning under both Linux and Solaris, as well as instructioncounting on a SPARC processor using Shade [31].


Nlwps 8 Nominal value; Anyway,Nlwps is swept in mostcharts

flwp 875 MHz flwp-clock = fclock-host4

;Slower transistors inDRAM process

Nmax-hw-threads 128 Nominal valueNsimd-units 1 No SIMD capability in

default caseLlwp-cache 2/fclock-lwp Nominal valueWlwp-cache-line 512 bytes Wide cache line as-

sumed for frame cache;based off design param-eter from [32]

Wchannel-on-chip 512 bits Nominal valueLthread-starting 10/flwp Nominal valueLthread-migration 10/flwp Nominal valueLlwp-parcel-handling 10/flwp Nominal valueLlocal-pim 10 ns Nominal valueBpim Bdram Nominal valuefhost 3.5 GHz Expected clock

frequency of 2005MPU based on 10 FO4gate delay (from ITRS[9])

CPIhost-non-mem 0.1 Nominal valueLhost-cache 2/fhost Nominal valueWhost-cache-line 32 bytes Nominal valueLhost-parcel-handling 10/fhost Nominal valueLdram 30 ns From a DRAM

datasheet [33]; assumingbank precharge can behidden for each memoryaccess

Bdram 668 Mb/s/pin Based on projections bymemory expert [34]

Nchips 16 System definitionWparcel 64 bytes Nominal valueLinter-pim 20 ns Nominal valueLhost-dram 20 ns Nominal valueWchannel-host-dram 128 bits Based on projections by

memory expert [34]Wchannel-inter-pim 64 Nominal value

TABLE VI

DEFAULT VALUES FOR ARCHITECTURAL PARAMETERS


I 100000000 ArbitraryNapp-threads 220 Nominal valuermem 0.45 Timing measurements/

Instruction countingrmt 0.8 Timing measurements/

Instruction countingrmt-simd 0.44 Timing measurements/

Instruction countingphit-host-cache 0.9 Nominal valuerthread-lwp-host 0.001 Nominal valueVpim-host 0 Nominal valuepthread-mem-local 0.9 Nominal valuepthread-mem-remote 0 Nominal valuephit-lwp-cache 0.9 Nominal value

TABLE VII

DEFAULT VALUES FOR APPLICATION PARAMETERS

12

C. Examining Sensitivity of Cost/Performance to Various Pa-rameters

We present the results of a sensitivity analysis of cost/perfor-mance with respect to certain key parameters of our cost/per-formance models. While a particular parameter is being sweptall other parameters assume their default values.

1) Number of LWPs per PIM Chip:Figure 5(a) showshow speedup varies as the number of LWPs per PIM chipis increased in powers of 2. Speedup increases at a fast rateinitially and then flattens out. The flattening is due to theserial bottleneck in the application that executes on the hostand cannot be overlapped with the threads executing on thePIM chips. Figure 5(b) shows the silicon cost increase of PIMover plain DRAM as the number of LWPs increases. It canbe seen that the silicon cost increase keeps growing as thenumber of LWPs is increased. To find out what number ofLWPs give good performance improvement per unit siliconcost increase, we plot the ratio of speedup to silicon costincrease, in Figure 5(c). It can be seen from the figure that forthe assumed default parameters, 2 LWPs per PIM chip givesbest performance improvement per unit silicon cost increase.Further investments in silicon area yield diminishing gains inperformance.

2) Degree of Multithreading of Application:Figure 6(a)shows the speedup plotted against number of LWPs per PIMchip for different degrees of multithreading (rmt) in the appli-cation. It can be seen that for the range considered, speedupshows close to linear growth with number of LWPs when theentire application is completely multithreadable (rmt = 1) withno serial portion. In general, speedup in the PIM-enhanced sys-tem improves as the number of LWPs in the system increasesuntil the execution time is almost completely dominated bythe execution time of the serial portion of the application onthe host, and further increase in LWPs does not help muchin improving speedup. As the degree of multithreading ofthe application decreases, speedup also decreases and quicklyflattens out as number of LWPs per PIM chip is increased.Figure 6(b) shows the ratio of speedup to silicon cost increaseof PIM for the different degrees of multithreading. It can beseen that when the degree of multithreading is less than 40%,the performance improvement is always lesser than the siliconcost increase. Clearly, multithreaded PIM-enhanced systemsare cost-beneficial only when there is sufficient degree ofmultithreading in the application.

3) Number of Application Threads:Figure 7 showsspeedup plotted against number of LWPs per PIM chip fordifferent number of threads in the application. Because wekeep the total number of executed instructions (I) fixed,most components of execution time in the PIM enhanced-system don’t experience any change in their values. The onlycomponent of execution time which does increase as numberof application threads is increased is the overhead incurred instarting the threads, which grows linearly with the number ofapplication threads. However, the impact of this is very limitedas can be seen from Figure 7 in which number of applicationthreads was increased up to 64 million.

4) Frequency of Parcel Communication in a Thread be-tween LWP and Host:Figure 8 shows speedup plotted against

0 10 20 30 40 50 60 701

1.5

2

2.5

3

3.5

4

4.5

5

Number of LWPs per PIM chip

Spe

edup

(a)

0 10 20 30 40 50 60 700

10

20

30

40

50

60


Sili

con

cost

incr

ease

(b)

0 10 20 30 40 50 60 700

0.5

1

1.5

2

2.5


Spe

edup

/Sili

con

cost

incr

ease

rat

io

Baseline (No PIM)

(c)

Fig. 5. Speedup, Silicon cost increase and Speedup/Silicon cost increaseratio vs. number of LWPs per PIM chip.

13

0 2 4 6 8 10 12 14 160

10

20

30

40

50

60

70

80

90

100


Spe

edup

00.40.60.80.90.950.981

rmt

(a)

0 2 4 6 8 10 12 14 160

2

4

6

8

10

12

14

16


Spe

edup

/Sili

con

cost

incr

ease

rat

io

00.40.60.80.90.950.981

rmt

(b)

Fig. 6. Speedup and Speedup to Silicon cost increase for different percentagesof multithreadability in application.

102

103

104

105

106

107

108

1

1.5

2

2.5

3

3.5

4

4.5

Number of application threads

Spe

edup

Fig. 7. Speedup for different number of application threads.

number of LWPs per PIM chip for different frequencies ofparcel communication in a thread between an LWP and host.There is not much change in speedup when a parcel is sentout once every 1000 instructions of a thread instead of onceevery 10000 instructions. However, speedup drops by a moresignificant amount when a parcel has to be sent out once every10 instructions instead of once every 100 instructions.

0 2 4 6 8 10 12 14 161

1.5

2

2.5

3

3.5

4

4.5

5

Number of LWPs per PIM chipS

peed

up

0.10.010.0010.0001

rthread−lwp−host

Fig. 8. Speedup for different frequencies of parcel communication in a threadbetween LWP and host.

5) SIMD Capability: Figure 9 shows the speedup andsilicon cost increase of PIM with and without floating pointSIMD hardware. Speedups with floating point SIMD hardwareare shown for a few different values ofrmt-simd. The forcecalculation in the N-body problem can benefit from 4-waySIMD for floating point operations, and so in the SIMDconfiguration we consider only floating point SIMD with eachLWP having 4 FPUs. It can be seen that speedup improvesonly slightly with the incorporation of SIMD hardware. Thedifference between the speedups with and without SIMDhardware grows smaller as number of LWPs increases. Evenwhen 100% of the multithreadable code can take advantage ofthe SIMD hardware, there is not much change in speedup. Thisis simply because the speedup is limited by the serial portionof the application, which does not benefit from addition ofSIMD hardware in the PIM chips. The silicon cost increaseon the other hand grows at a faster rate with incorporationof SIMD hardware particularly at higher values of number ofLWPs per PIM chip.

6) Clock Frequency of LWP:Figure 10 plots speedupagainst number of LWPs per PIM chip for different clockfrequencies of the LWP. As clock frequency of the LWP isincreased, the speedup increases, and best speedup is obtainedwhen LWP clock frequency matches that of the host (3.5GHz). When LWP clock frequency is reduced to 875 MHz,there is a drop is speedup but it’s not much. When LWP clockfrequency is reduced to 100 MHz, there is a larger drop inspeedup.

7) Various Overheads:Figure 11 shows speedup plottedagainst number of LWPs per PIM chip for different val-ues of all the overheads in the PIM-enhanced system. The

14

0 2 4 6 8 10 12 14 161

2

3

4

5

6

7

8

9


Spe

edup

and

Sili

con

cost

incr

ease

with

and

with

out F

P S

IMD

har

dwar

eSpeedup without FP SIMD hw

Speedup with FP SIMD hw, rmt−simd

= 0.44


= 0.75


= 1

Si cost inc without FP SIMD hw

Si cost inc with FP SIMD hw

Fig. 9. Speedup with and without floating point SIMD capability in theLWPs.

0 2 4 6 8 10 12 14 161

1.5

2

2.5

3

3.5

4

4.5

5


Spe

edup

100 MHz500 MHz875 MHz2 GHz3.5 GHz

Fig. 10. Speedup for different clock frequencies of the LWP.

overheads areLthread-starting, Lthread-migration, Llwp-parcel-handlingandLhost-parcel-handling. In each curve of Figure 11, each kind ofoverhead takes the same value as the other in terms ofnumber of cycles. The parcel handling latency of the host,Lhost-parcel-handling, is in terms of host clock cycles while theother overheads are in terms of LWP clock cycles. It can beseen that a significant amount of overhead can be tolerated inthe multithreaded PIM-enhanced system before speedup startsto deteriorate.

8) Communication Latencies:Figure 12 shows speedupplotted against number of LWPs per PIM chip for differentlatencies of communication between the host and DRAM/PIMchips. It can be seen that the speedup obtained in the PIM-enhanced system improves as the host gets further away fromthe DRAM/PIM chips. Figure 13 shows speedup for differentlatencies of communication between PIM chips. It can beseen that the speedup is quite tolerant of large communicationlatencies between PIM chips.

D. Technology Scaling of Cost/Performance of PIM

We consider the impact of technology scaling on the fol-lowing architectural parameters of our performance model:

0 2 4 6 8 10 12 14 160

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5


Spe

edup

1 cycle10 cycles100 cycles1000 cycles

Fig. 11. Speedup for different values of overheads in the PIM-enhancedsystem.

0 2 4 6 8 10 12 14 161

1.5

2

2.5

3

3.5

4

4.5

5

1 ns20 ns100 ns1000 ns10000 ns

Lhost−dram

Fig. 12. Speedup for different latencies of communication between host andDRAM/PIM chip.

clock frequency of the host processorfhost, and DRAM/PIMbandwidths (Bdram, Bpim). We conservatively assume that thespeed of transistors in a commodity DRAM does not scalewith technology, and so keep the clock frequency of theLWP fixed at its default value of 875 MHz. We projectfhost

based on information from the ITRS [9]. We projectBdram

based on information from [34] and continue to conserva-tively assume thatBpim is equal toBdram. Technology scalingalso has an impact on the the following technology-nodedependent parameters of our cost model:Adram, Anon-core-dram,Alwp Nmasks-dram, dwafer, andDfault-dram. ForAdram, Anon-core-dram,Nmasks-dram, dwafer andDfault-dram, we use the ITRS. ForAlwp,we simply use the equations from Table IV along with CACTI.

1) Fixed Number of LWPs per PIM Chip:First, we studythe impact of technology scaling when the number of LWPsis fixed per PIM chip across technology nodes. We consider8 LWPs per PIM chip. Figure 14 shows speedup, siliconcost increase and speedup to silicon cost increase ratio plot-ted against year (ITRS technology-node). It can be seen

15

0 2 4 6 8 10 12 14 160

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5


Spe

edup

1 ns20 ns100 ns1000 ns10000 ns

Linter−pim

Fig. 13. Speedup for different latencies of communication between PIMchips.

that speedup does not change much with technology scalingmainly because the speedup is relatively insensitive to thearchitectural parameters that have changed with technologyscaling. However, the silicon cost increase due to PIM dropswith technology scaling. The silicon cost increase of PIM isstrongly dependent on the relative area occupied by the LWPsin the PIM die, which keeps decreasing because the capacityof on-chip DRAM keeps increasing with technology scaling.Thus, with technology scaling and fixed number of LWPs perPIM chip, PIM-enhanced systems show a very good trend forperformance improvement per unit silicon cost trend. Such atrend is beneficial for solving problems of constant size.

2005 2006 2007 2008 2009 20101

1.5

2

2.5

3

3.5

4

4.5

5

Year

Spe

edup

, Si c

ost i

nc, S

peed

up/S

i cos

t inc

rat

io

Speedup

Si cost incSpeedup/Si cost inc ratio

Fig. 14. Speedup, Silicon cost increase and Speedup to Silicon cost increaseratio for fixed number (8) of LWPs per PIM chip.

2) Fixed Ratio of Capacity of DRAM to Number of LWPsper PIM Chip: We now study the impact of technologyscaling when the number of LWPs per PIM chip scale inthe same proportion as on-chip DRAM capacity. Thus, thenumber of LWPs per PIM chip now grows with the capacityof on-chip DRAM. Figure 15 shows speedup, silicon cost

increase and speedup to silicon cost increase ratio curvesplotted against year (ITRS technology-node). It can be seenthat speedup again remains relatively insensitive to the effectsof technology scaling. More number of LWPs do not reallyimprove speedup much which is limited by the serial portionof the application. The silicon cost increase curve shows slightup-and-down variations across technology nodes. For instance,going from 2005 to 2006, there is a slight increase in thesilicon cost. This is because in 2006, based on the ITRS, aDRAM cell experiences a small shrink in its area. Becausethe DRAM non-core area is calculated by the ITRS as beinga constant 37% of DRAM die area across all technologynodes, the non-core area also experiences the same shrinkas the DRAM cell. The area occupied by the LWPs on theother hand does not experience such a shrink, and so appearsto account for a greater percentage of PIM die area whencompared to the previous technology-node. As silicon costincrease of PIM is strongly sensitive to the percentage areaoccupied by the LWPs, the silicon cost also appears to increaseslightly. Another instance of slight variation in silicon costincrease of PIM can be seen going from 2009 to 2010. Inthis case there is a slight decrease. This is because in 2010,based on the ITRS, the number of masks used for DRAMfabrication goes up to 26 from 24. The number of masks forPIM fabrication also thus goes up to 33 (26 + 7) from 31 (24+ 7). Because 33/26 is slightly less than 31/24, the processingcost of PIM decreases slightly and thereby also brings thesilicon cost increase curve down slightly. In general though,for fixed ratio of capacity of DRAM to number of LWPs perPIM chip across technology nodes, the silicon cost increase ofPIM remains almost constant, and all up-and-down variationsare very small.

Because of the almost constant trends for speedup andsilicon cost increase across technology nodes, the speedup tosilicon cost increase ratio curve also remains almost constant.This indicates that constant performance improvement perunit unit silicon cost increase can be obtained through PIM-enhanced systems under technology scaling when the numberof LWPs per PIM chip is scaled proportionally with DRAMcapacity. This potentially augurs well for solving problems ofincreasing sizes that fit in the growing memory capacity.

E. Selecting Suitable PIM Chip Configurations

Given our cost/performance models, we now consider thequestion of which PIM chip configurations are suitable. Be-cause we have observed that multithreaded PIM chips are onlycost-beneficial for applications with sufficiently high degreesof multithreading, we investigate the selection of suitable PIMchip configurations for applications with relatively high de-grees of multithreading. We considerrmt to take values of 0.7,0.8 and 0.95. All other application and architectural parametersassume their default values from Tables VI and VII. Usingour cost/performance models, we find the number of LWPsper PIM chip that maximizes the performance improvementper unit silicon cost increase. We consider LWP configurationsboth with and without floating point SIMD hardware. We againconsider 4-way floating point SIMD because the force calcula-tion in the N-body problem can benefit from 4-way SIMD. We

16

2005 2006 2007 2008 2009 20101

1.5

2

2.5

3

3.5

4

4.5

5

Year

Spe

edup

, Si c

ost i

nc, S

peed

up/S

i cos

t inc

rat

io

Speedup

Si cost inc

Speedup/Si cost inc ratio

Fig. 15. Speedup, Silicon cost increase and Speedup to Silicon cost increaseratio for fixed ratio (16) of capacity of DRAM to number of LWPS per PIMchip.

carry out this exercise for two years, 2005 and 2010; Figures16(a) and 16(b) show the plots of performance improvementper unit silicon cost increase versus the number of LWPs perPIM chip for the years 2005 and 2010 respectively.

It can be seen that in general the number of LWPs perPIM chip that maximizes the performance improvement perunit silicon cost increase is around 1-4 for 2005 and around2-8 for 2010. The number of LWPs that can be embeddedcost-effectively is more in 2010 compared to 2005 because oftechnology scaling and the increasing capacity of DRAM die.As the number of LWPs increases to values greater than theoptimal value, the performance improvement per unit siliconcost increase deteriorates because of diminishing returns. Forour set of default parameters, LWP configurations with 4-wayfloating point SIMD hardware offer the best performance im-provement per unit silicon cost increase. However, LWPs withSIMD hardware also suffer more from diminishing returns forhigher than optimal values of number of LWPs compared toLWPs without SIMD hardware.

VI. CONCLUSIONS

In this paper, we have developed analytic models to calcu-late the speedup and silicon cost increase in replacing plainDRAM chips of a conventional workstation with multithreadedPIM chips. We used these models to examine the sensitivity ofspeedup with respect to different architectural and applicationparameters, and to examine which PIM chip configurationsprovide cost-effective performance if produced at high vol-umes.

Our cost/performance analysis indicates that there is a re-gion of opportunity for multithreaded PIM-enhanced systems.Multithreaded PIM-enhanced systems have the potential to becost-beneficial for applications with sufficiently high degreesof multithreading. For applications with relatively high degreesof multithreading our results indicate that the number of LWPsper PIM chip that produce the best performance improvement

0 2 4 6 8 10 12 14 160

1

2

3

4

5

6


Spe

edup

to s

ilico

n co

st in

crea

se r

atio

in y

ear

2005

rmt

= 0.7, no FP SIMD h/w

rmt

= 0.7, with FP SIMD h/w

rmt


rmt


rmt


rmt


(a)

0 2 4 6 8 10 12 14 161

2

3

4

5

6

7

8

9


Spe

edup

to s

ilico

n co

st in

crea

se r

atio

in y

ear

2010

rmt


rmt


rmt


rmt


rmt


rmt


(b)

Fig. 16. Performance improvement per unit silicon cost increase with andwithout FP SIMD hardware for different degrees of multithreading in years2005 and 2010.

per unit silicon cost increase is around 1-4 in the year 2005and around 2-8 in the year 2010.

Future work in this area involves the development of moredetailed analysis and simulation environments that overcomethe limitations of our current performance models and projectperformance with tighter bounds. Cost modeling that considersthe impact of packaging and testing also need to be developed.Another avenue that needs to be explored is the conceptof defect-tolerant PIM chips. Defect-tolerant PIM chips thattolerate the presence of one or more defective LWPs canpotentially lower the cost of a PIM-enhanced system and makeit even more cost-effective.

ACKNOWLEDGMENT

We would like to thank Cray Inc. and DARPA who sup-ported this research through the Cascade project under theHigh Productivity Computing Systems program.

REFERENCES

[1] P. M. Kogge. EXECUBE - A new architecture for scalable MPPs.In Dharma P. Agrawal, editor,Proceedings of the 23rd InternationalConference on Parallel Processing. Volume 1: Architecture, pages 77–84, Boca Raton, FL, USA, August 1994. CRC Press.

17

[2] Jay B. Brockman, Peter M. Kogge, Vincent W. Freeh, Shannon K. Kuntz,and Thomas L. Sterling. Microservers: A new memory semantics formassively parallel computing. InConference Proceedings of the 1999International Conference on Supercomputing, pages 454–463, Rhodes,Greece, June 20–25, 1999. ACM SIGARCH.

[3] Christoforos E. Kozyrakis, Stylianos Perissakis, David Patterson,Thomas Anderson, Krste Asanovic, Neal Cardwell, Richard Fromm,Jason Golbus, Benjamin Gribstad, Kimberly Keeton, Randi Thomas,Noah Treuhaft, and Katherine Yelick. Scalable processors in the billion-transistor era: IRAM.IEEE Computer, 30(9):75–78, September 1997.

[4] Mary Hall, Peter Kogge, Jeff Koller, Pedro Diniz, Jacqueline Chame,Jeff Draper, Jeff LaCoss, John Granacki, Apoorv Srivastava, WilliamAthas, Jay Brockman, Vincent Freeh, Joonseok Park, and Jaewook Shin.Mapping irregular applications to DIVA, A PIM-based data-intensivearchitecture. InSupercomputing (SC’99), Portland, Oregon, November1999. ACM Press and IEEE Computer Society Press.

[5] M. Oskin, F. Chong, and T. Sherwood. Active pages: A computationmodel for intelligent memory. InProceedings of the 25th AnnualInternational Symposium on Computer Architecture (ISCA-98), volume26,3 of ACM Computer Architecture News, pages 192–203, New York,June 27–July 1 1998. ACM Press.

[6] Y. Kang, W. Huang, S.-M. Yoo, D. Keen, Z. Ge, V. Lam, P. Pattnaik, andJ. Torrellas. Flexram: Toward an advanced intelligent memory system.pages 192–201.

[7] Kenneth Mai, Timothy Paaske, Nuwan Jayasena, Ron Ho, William J.Dally, and Mark Horowitz. Smart memories: A modular reconfigurablearchitecture. In27th Annual International Symposium on ComputerArchitecture (27th ISCA-2000), Vancouver, British Columbia, Canada,June 2000. ACM SIGARCH / IEEE.

[8] George Almasi, Calin Cascaval, Jose G. Castanos, Monty Denneau,Derek Lieber, Jose E. Moreira, and Henry S. Warren Jr. DissectingCyclops: a detailed analysis of a multithreaded architecture.SIGARCHComput. Archit. News, 31(1):26–38, 2003.

[9] Semiconductor Industries Association. International technologyroadmap for semiconductors. Technical report, 2003.

[10] Rishiyur S. Nikhil and Arvind. Can dataflow subsume von Neumanncomputing? InProceedings of the 16th Annual International Symposiumon Computer Architecture, pages 262–272, June 1989.

[11] Gregory M. Papadopoulos and David E. Culler. Monsoon: An explicittoken-store architecture. In17th International Symposium on ComputerArchitecture, number 18(2) in ACM SIGARCH Computer ArchitectureNews, pages 82–91, Seattle, Washington, May 28–31, June 1990.

[12] David E. Culler, Seth Copen Goldstein, Klaus Erik Schauser, andThorsten Von Eicken. TAM – A compiler controlled Threaded AbstractMachine. Journal of Parallel and Distributed Computing, 18(3):347–370, July 1993.

[13] Thorsten von Eicken, David E. Culler, Seth Copen Goldstein, andKlaus Erik Schauser. Active messages: a mechanism for integratedcommunication and computation. InProceedings the 19th AnnualInternational Symposium on Computer Architecture, pages 256–266,Gold Coast, Australia, May 1992.

[14] W. J. Dally, J. A. S. Fiske, J. S. Keen, R. A. Lethin, M. D. Noakes, P. R.Nuth, R. E. Davison, and G. A. Fyler. The message-driven processor.IEEE Micro, pages 23–39, April 1992.

[15] M. D. Noakes, D. A. Wallach, and W. J. Dally. The J-machine multicom-puter: An architectural evaluation. In Lubomir Bic, editor,Proceedingsof the 20th Annual International Symposium on Computer Architecture,pages 224–236, San Diego, CA, May 1993. IEEE Computer SocietyPress.

[16] Jay B. Brockman, Shyamkumar Thoziyoor, Shannon K. Kuntz, andPeter M. Kogge. A low-cost, multithreaded processing-in-memorysystem. In3rd Workshop on Memory Performance Issues (WMPI-2004), held in conjuntion with the 31st International Symposium onComputer Architecture (ISCA), Munich, Germany, June 20, 2004. ACMSIGARCH.

[17] Shyamkumar Thoziyoor, Jay B. Brockman, and Daniel Rinzler. PIMLite: A multithreaded processor-in-memory prototype. InGreat LakesSymposium on VLSI, Chicago, IL, April 17, 2005. ACM/IEEE.

[18] Josh Barnes and Piet Hut. A hierarchicalO(N logN) force-calculationalgorithm. Nature, 324(4):446–449, December 1986.

[19] IBM. Embedded Memory Selection Guide. http://www-3.ibm.com/chips/products/asics/products/ememory.html, March 2003.

[20] W. J. Dally, A. Chien, J. A. S. Fiske, G. Fyler, W. Horwat, J. S. Keen,R. A. Lethin, M. Noakes, P. R. Nuth, and D. S. Wills. The messagedriven processor: An integrated multicomputer processing element. InInternational Conference on Computer Design, VLSI in Computers and

Processors, pages 416–419, Los Alamitos, CA, October 1992. IEEEComputer Society Press.

[21] Thomas Sterling and Larry Bergman. A design analysis of a hybridtechnology multithreaded architecture for petaflops scale computation.In Conference Proceedings of the 1999 International Conference onSupercomputing, pages 286–293, Rhodes, Greece, June 20–25, 1999.ACM SIGARCH.

[22] J. Draper, J. Chame, M. Hall, C. Steele, T. Barrett, J. LaCoss,J. Granacki, J. Shin, C. Chen, C. W. Kang, I. Kim, , and G. Daglikoca.The architecture of the DIVA processing-in-memory chip. InACMInternational Conference on Supercomputing (ICS’02), June 2002.

[23] T. Sterling and H. Zima. Gilgamesh: A multithreaded processor-in-memory architecture for petaflops computing. InSupercomputing: High-Performance Networking and Computing, November 2002.

[24] Arun Rodrigues, Richard Murphy, Peter Kogge, and Keith Underwood.Characterizing a new class of threads in scientific applications for highend supercomputers. InInternational Conference on Supercomputing,Saint Malo, France, June 26, 2004. ACM.

[25] Ken Iobst, Maya Gokhale, and Bill Holmes. Processing in memory:The Terasys massively parallel PIM array.IEEE Computer, 28(4), April1995.

[26] G. Lipovski and C. Yu. The dynamic associative access memorychip and its application to SIMD processing and full-text databaseretrieval. In IEEE International Workshop on Memory Technology,Design and Testing, pages 24–33, San Jose, CA, August 1999. IEEE,IEEE Computer Society.

[27] Graham Kirsch. Active memory device delivers massive parallelism. InMicroprocessor Forum, San Jose, CA, October 2002.

[28] MIPS. MIPS64 5K family, http://www.mips.com/content/products/cores/64-bitcores, 2003.

[29] P. Shivakumar and N. P. Jouppi. CACTI 3.0: An Integrated CacheTiming, Power, and Area Model. Technical Report WRL-2001-2,Hewlett Packard Laboratories, 2001.

[30] J. M. Rabaey, A. Chandrakasan, and B. Nikolic.Digital integratedcircuits. Prentice Hall, 2nd edition, 2003.

[31] R. F. Cmelik and D. Keppel. Shade: A fast instruction-set simulatorfor execution profiling. Technical Report TR-93-06-06, University ofWashington, Department of Computer Science and Engineering, June1993.

[32] Ashley Saulsbury, Fong Pong, and Andreas Nowatzyk. Miss-ing the memory wall: The case for processor/memory integration.In 23rd Annual International Symposium on Computer Architecture(23rd ISCA’96), Computer Architecture News, pages 90–101. ACMSIGARCH, May 1996.

[33] Micron Technology Inc.DDR2 Datasheet, 2004.[34] Dean Klein. JEDEX keynote address: Memory Trends, Drivers, and

Solutions, Apr 2004.

cost/performance analysis of a multithreaded pim architecture

Documents