pact: g: from near-data processing to in-data processing ... · storage and allows data processing...

PACT: G: From Near-Data Processing to In-Data Processing-in-Storage

Roman Kaplan

Technion, Israel Institute of Technology

[email protected]

Abstract

Near-data in-storage processing research has been gaining

momentum in recent years. Typical processing-in-storage

architecture places a single or several processing cores inside the

storage and allows data processing without transferring it to the

host CPU. Since this approach replicates von Neumann architecture

inside storage, it is exposed to the problems faced by von Neumann

architectures, especially the bandwidth wall.

The goal of this work is to design a storage device that does not

suffer from the von Neumann bandwidth bottleneck and can

provide high performance on massively parallel big data

workloads. To achieve this, a novel processing-in-storage system

based on Resistive Content Addressable Memory (ReCAM) is

presented. ReCAM functions simultaneously as storage and as a

parallel associative processor. ReCAM processing-in-storage

resolves the bandwidth wall by keeping computation inside the

storage array, thus implementing in-data, rather than near-data,

processing.

This work shows that ReCAM based processing-in-storage

architecture may outperform existing near-data and accelerator

based designs. ReCAM processing-in-storage implementation of

Smith-Waterman DNA sequence alignment reaches a speedup of

almost five over a GPU cluster. For K-means, ReCAM processing-

in-storage achieves speedup of 4.6—68 relative to CPU, GPU and

FPGA, and for K-Nearest Neighbors, ReCAM processing-in-

storage achieves speedup of 17.9—17,470. 1 Finally, an

implementation of in-storage inline data deduplication is presented

and shown to achieve orders of magnitude higher throughput than

traditional CPU and DRAM based systems.

Keywords Content addressable memory, associative processing,

memristors, in-storage processing

1. Problem and Motivation

Until the breakdown of Dennard scaling designers focused on

improving performance for a single core by increasing its clock

rate. Since 2005, when Dennard scaling diminished but Moore’s

law continued, the focus has shifted to increasing the number of

cores in multicore processors [12]. However, in the era of big data

[14], improved processing elements do not necessarily translate to

increased performance. One challenging definition for Big Data is

“the amount of data just beyond technology’s capability to store,

manage and process efficiently” [21]. Large contributors to the

increase in the volume of data are social media, Internet of Things

(IoT) and multimedia. All have significantly increased the rate of

produced data, whether structured or unstructured. Moreover, big

data has the potential to transform a plethora of fields, including

science, engineering, healthcare, finance and many others.

Big data datasets do not fit in DRAM of a single machine or

even a cluster of machines. Therefore, data is typically fetched to

1 The work on the K-means and KNN algorithms are a novelty since the PACT 2016

SRC submission.

the CPUs and their memory hierarchies from non-volatile storage

such as hard disks or Flash SSDs. Consequently, bandwidth and

access-time to storage pose a major concern for Big Data

applications. The throughput at which data can be fetched from

storage to the processing units (CPU or an accelerator) has become

a main determinant of system performance. The problem worsens

in a datacenter cloud environment, where a dataset can be

distributed among multiple nodes across the datacenter. In this case,

communicating the stored data adds latency and reduces bandwidth

even further, placing low bounds for maximal performance.

This challenge has motivated re-emerging Near-Data

Processing (NDP) [2]. The data-centric NDP shifts computing

closer to data. NDP seeks to minimize data movement by

computing at the most appropriate location in the hierarchy, which

can be cache, main memory or persistent storage. With NDP, less

data needs to be transferred through levels of hierarchy, thus

alleviating the limited bandwidth problem. Since storage

bandwidth constitutes the key bottleneck in big data processing,

NDP proposals that place computing resources at the cache level

[42] or in main memory [18][1][36] (also known as Processing-in-

Memory or PiM) do not address the bandwidth bottleneck problem

facing big data workloads. In-storage NDP targets the source of the

problem, but offers only a partial solution, as reviewed and

discussed in section 2.3.

Resistive CAM (ReCAM), a storage device based on emerging

resistive materials in the bitcell with a novel non-von Neumann

Processing-in-Storage (PRinS) compute paradigm, is proposed in

order to mitigate the storage bandwidth bottleneck of big data

processing. Section 2 provides background on the basic concepts of

ReCAM and PRinS and covers related work. Section 3 presents the

ReCAM architecture, explains how processing is performed within

ReCAM and establishes its scalability. PRinS implementations for

three key algorithms are presented in Section 4 and compared to

other approaches: Smith-Waterman DNA sequence alignment, K-

means clustering and K-nearest neighbors. In addition, previous

work on in-storage data deduplication with ReCAM is concisely

presented.

2. Background and Related Work

Three basic concepts underline the proposed ReCAM: content

addressable memories, associative processing and resistive

materials. The following two subsections introduce each concept.

The third subsection covers related work on NDP and highlights

their limitations at addressing the storage bandwidth challenge of

big data processing.

2.1 Content Addressable Memory and Associative

Processing

Content addressable memory (CAM), also called associative

memory, allows the comparison of all data words to a key in

parallel, tagging the matching words, and possibly reading some or

all of the tagged words, one by one. Standard memory read and

write operations of a single word at a time can also take place.

In addition to storing information, a CAM array can be modified

to function as an associative processor [13][41]. In associative

processing, the parallel compare and parallel write operations

supported by CAM are used to implement an ‘if condition then

value’ expression. Thus, complex Boolean expressions are

evaluated in parallel on all data words (CAM rows) by sequential

execution of truth table if-then lines. Each (multi-bit) argument of

a truth table line is matched with the contents of the appropriate

field in the entire CAM array: the matching rows are tagged and the

corresponding result values from the truth table line are written into

the designated fields of all tagged rows.

For an 𝑚-bit argument 𝑥 , any Boolean 𝑓(𝑥) has 2𝑚 possible

values, therefore the associative computing operation should incur

𝑂(2𝑚) cycles, regardless of the dataset size. More efficiently,

arithmetic operations can be performed on ReCAM in a word-

parallel, bit-serial manner, reducing compute time from 𝑂(2𝑚) to

𝑂(𝑚). The massive parallelism of each operation compensates in

performance for the relatively large number of (parallel execution)

cycles of each arithmetic operation.

More complex computations (more than Boolean functions) are

decomposed into series of Boolean expressions, as usual [41][43].

2.2 Resistive Memories

Resistive random access memory (ReRAM) stores information by

modulating the resistance of nanoscale storage elements. They are

nonvolatile, free of leakage power, and emerge as long-term

potential alternatives to charge-based memories, including NAND

flash [3]. In particular, the storage elements can be placed as a layer

above a normal CMOS logic circuit, enabling integrated

processing-in-storage (PRinS). A 32GB ReRAM with a CMOS

logic layer underneath [29] demonstrated a high density functional

chip.

2.3 Related Work

The concept of NDP has been considered for several decades [25].

One approach to implementing NDP employs 3D stacked

memories (e.g., [1][16]). DRAM memory is placed on top of a logic

die allowing for a high-speed vertical inner interface between the

memory and logic. The Hybrid Memory Cube [31] is another

memory design (and a product) that exploits the increased

bandwidth between DRAM and near-memory logic, and may be

exploited for NDP [4].

Another approach for NDP in memory employs custom memory

chips with resistive bitcells. In [18] and [19], a resistive associative

memory is used as the main memory. Resistive bitcells are also

used in analog computing [8][37] , proposing PiM structures that

allow the memory array to be used as RAM, scratchpad and analog

dot product engine. Both trends are still limited to memory NDP.

In-storage NDP, an SSD comprising an array of Flash chips, is

amended by an embedded CPU [22] or GPU [9] to perform some

processing tasks in the SSD rather than in the host CPU, as

demonstrated in Figure 1a. Often called Intelligent (or Smart) SSD,

the main idea is increasing effective bandwidth to the in-SSD

processors by connecting these processors directly to the flash

banks inside the SSD. However, the storage bandwidth inside the

SSD is still limited by the read/write ports of the flash devices.

The goal of this work is to design a storage device that does not

suffer from von Neumann architecture bandwidth bottleneck and

can provide high performance on big data workloads. To achieve

this, processing is pushed upon each bit of storage.

3. Approach and Uniqueness

The approach of this work is to design a device for storing big

datasets and processing them efficiently. The key properties of this

design are scalability and massively parallel processing, possible

due to the non-von Neumann architecture. Parallelism is achieved

by in-situ processing of the data, in contrast with NDP approaches.

In this work, Resistive CAM (ReCAM), a non-volatile and scalable

storage device with resistive bitcells and a novel Processing-in-

Storage (PRinS) paradigm is presented. The concept is

demonstrated in Figure 1b.

3.1 ReCAM Crossbar Array

Figure 2a shows a single integrated circuit (IC) of the resistive

CAM crossbar. A bitcell shown in Figure 2b, consists of two

transistors and two resistive elements (2T2R). The KEY register

contains a data word to be written or compared against. The MASK

register defines the active columns for write and read operations,

enabling bit selectivity. The TAG register (Figure 2c) marks the

rows that are matched by the compare operation and may be

affected by a parallel write. The TAG register enables chaining

multiple ReCAM ICs.

In a conventional CAM, compare operation is typically

followed by a read of the matched data word. When in-storage

processing involves arithmetic operations, a compare is usually

followed by a parallel write into the unmasked bits of all tagged

rows, and additional capabilities, such as read and reduction

operations, are included [42].

Any computational expression can be efficiently implemented

in ReCAM storage using line-by-line execution of the truth table of

Figure 1: (a) In-storage Near Data Processing. (b) In-Data

Processing in ReCAM based storage.

HostCPU

DRAM

In-SSDProcessor Flash Channel

Limited B/W Limited B/W

In-SSDProcessor

In-SSDProcessor

Flash Channel

Flash Channel

Ultra High B/W

Limited B/W Parallel ProcessingPRinS Device

ReCAM

ReCAM

ReCAM

HostCPU

DRAM

SSD(a)

(b)

(b)

Match /

Word Line

Match /

Word Line

Match /

Word Line

T1 T2

R R

...

...

...

... ...

...

KEY

MASK

Bit Bit

Bit Bit Bit Bit

TAG

TAG

MUX

I1

I0

Write

TAG from prev. chip

SelO

To next TAG

TAG Latch

Shift select

PrechML

TAG to next chip

SA

(c)(a)

...

Red

ucti

on

Tree

RT

Figure 2: (a) Single ReCAM crossbar IC. (b) 2T2R ReCAM

bitcell. (c) TAG logic.

the expression [41]. Arithmetic operations are typically performed

bit-serially. Table 1 lists several operations supported by ReCAM

and the number of cycles required by each operation. Addition (in-

place or not) is performed in a bit-serial manner using the truth-

table approach (32 bits times 8 truth-table rows times 2 for compare

and write amounts to 512 cycles). Shifting down a sequence of 32

bits in a consecutive block of rows by one line requires three cycles

per bit. First, compare to ‘1’ copies the source bit column of all

rows into the TAG. Second, shift moves the TAG vector down by

setting the shift-select line (Figure 2c). Third, write-‘1’ copies the

shifted TAG to the same bit-column. Row-wise maximum

compares two 32-bit numbers in each row in parallel. Max Scalar

tags all rows that contain the maximal value in the selected 32 bits.

3.2 System Architecture

Conceptually, the ReCAM comprises hundreds of millions of rows,

each serving as a computational unit. Due to power die restrictions,

the entire array may be divided into multiple smaller ICs, as in

Figure 3a. A row is fully contained within an IC. All ICs are daisy

chained for Shift and Max Scalar operations.

The ReCAM storage system uses a microcontroller (Figure 3b)

similar to [19]. It issues instructions, sets the KEY and MASK

registers, handles control sequences and executes read requests. In

addition, the microcontroller may also perform some baseline

processing, such as normalization of the reduction tree results.

ReCAM-based storage is scalable due to its inherent

parallelism. It allows for scalability by adding more ICs and

increasing storage capacity at no performance cost since compute

capability is linearly scalable in the number of ICs. Therefore,

processing in-storage of large data sets does not require ReCAM

for external communication, in contrast to datacenter-scale storage.

4. Results and Contribution

A cycle-accurate simulator of the ReCAM storage was constructed.

Assumed operational frequency is 500MHz. An in-house power

simulator was used to evaluate the power consumption of ReCAM.

The latency and energy figures used by both the timing and power

simulations are obtained using SPICE simulation and are detailed

in [41].

4.1 Smith-Waterman DNA Sequence Alignment

Searching for similarities in pairs of protein and DNA sequences

(also called Pairwise Alignment) has become a routine procedure

in Molecular Biology and it is a crucial operation in many

bioinformatic tools. The Smith-Waterman algorithm (S-W) [38]

provides an optimal solution for the pairwise sequence alignment

problem, but requires a number of operations proportional to the

product of the two sequences.

S-W identifies the optimal alignment of two sequences by

computing a two-dimensional scoring matrix. Matchings base-pairs

score positively, while mismatching result in a negative score. The

optimal alignment score of two sequences is the highest score in the

matrix. The S-W has two steps: scoring (to find the maximal

alignment score) and trace-back to construct the alignment. The

first step is the most computationally demanding and is the focus

of this work.

Figure 4a shows a snapshot of the scoring matrix during the

algorithm execution. In a parallel implementation, the matrix is

filled along the main diagonal and the entire anti-diagonal scores

are calculated in parallel. Figure 4b shows the ReCAM memory

map of two consecutive ICs at the beginning of an iteration. 𝐴 and

𝐵 contain the sequences, where each base-pair takes 2-bit and

resides in a separate row. 𝐸 and 𝐹 are partial score results of the

affine gap model [17]. 𝐴𝐷[0], 𝐴𝐷[1] and 𝐴𝐷[2] contain scoring

matrix anti-diagonals. Scores are represented by 32-bit integers.

Shift operations in the PRinS implementation move data between

rows inside an IC and between daisy-chained ICs. Figure 4c shows

the ReCAM memory map at the end of an iteration and the mapping

between ReCAM and the scoring matrix. A complete description

of the S-W PRinS implementation appears in [23].

We simulate the ReCAM with the cycle-accurate simulator. The

CUPS metric (Cell Updates per Second) is used to measure S-W

performance. Results are compared to other works in Table 2. The

in-storage implementation is compared to other implementations in

different platforms: a 384-GPU cluster [36], the 128-FPGA

RIVYERA platform [39] and a four Xeon Phi implementation [30].

On ReCAM with a total of 8GB in 32 separate ICs, each 256MB

and 8M rows, 53 TCUPS are demonstrated, computing a total of

57.2×1012 scores. 4.7× faster than the best implementation. The

table also shows computed GCUPS/Watt ratios; ReCAM is close

to twice better than the FPGA solution and 80× better than the GPU

system.

4.2 K-Means Clustering

K-means is an unsupervised learning algorithm for clustering

unclassified samples. It aims to partition 𝑁 samples into K clusters,

where each sample belongs to the cluster with the nearest center

(mean).

An in-storage version of the algorithm was designed and

implemented it in the cycle-accurate simulator. The algorithm

contains two K-iteration loops, assignment and update. The

assignment loop first associatively calculates (in parallel) the

Euclidean distance between all samples and each mean. Next, the

loop assigns each sample with its closest mean in parallel, on all

samples. The update loop recalculates the new mean coordinates.

Table 1: Operations included in ReCAM

Integer Instruction with 32-bit Operand Cycles

B A + B 256

C A + B 512

Shift down by one row 96

Row-wise Max (A, B) 64

Max Scalar (A) 64

Register File

ReCAM Data Buffer

Instruction Memory

ALU

(a)

(b)

Figure 3: (a) ReCAM-based Storage system is composed of

separate multiple daisy-chained ICs. (b) Microcontroller.

Connected to the multiple ICs with a reduction tree network.

These two loops may be repeated until the mean coordinates

converge.

Several evaluations are performed. The ReCAM K-means

PRinS implementation is compared with Altera Stratix V [34] and

Xilinx ZC706 [28] FPGAs, Intel i7-3770K eight-core CPU [11],

NVIDIA K20M GPU [6], and ten NVIDIA K20M GPU cluster [35]

implementations. Figure 5 presents the evaluated dataset sizes, the

average runtime per iteration (log scale) of each architecture and

the relative speedup of ReCAM.

Overall, ReCAM achieved speedups ranging between 4.6 and

67.8. One notable comparison is to Rossbach et al. [35] who used

a ten GPU cluster with a large data set of 1 billion samples,

occupying roughly 150GB. Their average iteration time is 30.6

seconds, compared with 0.65 seconds on ReCAM, yielding

speedup of 47. This comparison highlights the benefit of PRinS

insensitivity to dataset size, in contrast to the GPU cluster, which is

limited by the inter- and intra-node communication. A complete

description of the K-means PRinS implementation and

performance comparisons appears in [43].

4.3 K-Nearest Neighbors (KNN)

K-nearest neighbors (KNN) is a method frequently used for

classification. It computes the distances between an (unclassified)

input query vector and a dataset of classified samples. Query vector

classification is usually determined by the majority vote of 𝐾

nearest database samples, hence the name 𝐾-nearest neighbors.

In a von Neumann machine, the required computational effort

is proportional to dataset size. In contrast, in-data implementation

of KNN is not limited by dataset size and can therefore provide high

performance on very large datasets.

Our in-storage implementation has two steps. The first

computes the Euclidean distance between the query vector and all

dataset samples in parallel. The second step iteratively finds the K

dataset samples that are closest to the query vector (the nearest

neighbors), one by one.

The ReCAM processing-in-storage implementation is

compared with Stratix IV FPGA [32], NVIDIA K20M [20] and

NVIDIA Titan X GPUs [27], and Nearest neighbor Content

Addressable Memory (NCAM) [27] KNN implementations. The

latter is a NDP device with 3D-stacked DRAM on top of a custom

logic. Figure 6 plots runtime (log scale) results and relative speedup

of ReCAM, in addition to the datasets used in the evaluations.

Overall ReCAM achieved speedups ranging between 17.9 –

17,470. A complete description of the KNN PRinS implementation

and the performance comparisons appears in [43].

Most notable is the comparison to NCAM. While NCAM

improves runtime over the GPU by an order of magnitude (17× and

5.2× on SIFT and ImageNet, respectively) and shows a significant

benefit of NDP, the ReCAM processing-in-storage architecture

demonstrates additional 1-2 orders of magnitude speedup (17.9 and

67.9, respectively) relative to NCAM, thanks to in-data rather than

near-data processing.

Figure 4: (a) A snapshot of the dynamic programming matrix,

shows the direction of progress for the parallel algorithm. (b)

and (c) shows an example of organization of data in the

ReCAM crossbar array at the beginning (b) and end (c) of an

iteration. AD[2] contents in (b) is being replaced with the new

result (c). Bottom rows in a crossbar IC are daisy-chained to

the next IC in a shift instruction. The cell marked with X

contains the global maxmium score.

Dynamic Programming Matrix

ReCAM Memory Map

(b) (c)

(a)

Table 2: Summary of State-of-the-Art Performance for S-W

Scoring Step in Previous Works and in ReCAM

Accelerator Xeon Phi FPGA GPU ReCAM

Perf. (TCUPS) 0.23 6.0 11.1 53

Number of ICs 4 128 384 32

Power (kWatt) 0.8 1.3 100.0 6.6

GCUPS/Watt 0.3 4.7 0.1 8.0

Reference [30] [39] [36]

Figure 5: Comparison of the average runtime of a single K-

means iteration between ReCAM and different platforms.

Figure 6: KNN runtime comparison: ReCAM vs. different

platforms.

4.4 In-Storage Data Deduplication

Deduplication is a data compression technique for eliminating

redundant copies of repeated data, designed to improve storage

utilization. Files are split into multiple data blocks. Only unique

blocks are meant to be stored.

The ReCAM native compare operation is used to find duplicate

data blocks in a fixed number of cycles. In [24], an inline in-storage

implementation of deduplication is presented and its performance

is compared to a CPU and DRAM implementation. A 256GB

ReCAM may achieve on average 100× higher throughput than a

typical mass storage appliance with CPU and DRAM based

deduplication, while consuming similar energy. The complete in-

storage deduplication implementation on ReCAM, including the

performance comparisons, appears in [24].

5. Conclusions

This paper introduces ReCAM, a Processing-in-Storage

architecture. It combines storage with parallel processing of data

within the storage. Each ReCAM IC device may contain hundreds

of millions of data rows, each row serving as an associative

processing unit. Many ReCAM ICs are chained together

constituting a large storage device, suitable for big data

applications. ReCAM applies parallel associative processing to the

data. This paper presents Smith-Waterman DNA sequence

alignment, K-Means, KNN algorithms and in-storage inline data

deduplication on ReCAM and shows significant speedup over

alternative non-in-storage implementations.

REFERENCES

[1] Ahn, J. et al. “A Scalable Processing-in-Memory Accelerator for Parallel Graph

Processing.” In ISCA, pp. 105-117, 2015.

[2] Ahn, J. et al. “PIM-Enabled Instructions: A Low-Overhead, Locality-Aware

Processing-in-Memory Architecture.” In ISCA, pp. 336-348, 2015.

[3] Akinaga, H. and Hisashi S. "Resistive Random Access Memory (ReRAM) Based

on Metal Oxides." Proceedings of the IEEE, vol. 98, no. 12, pp. 2237-2251, 2010.

[4] Azarkhish, E. et al. “Logic-Base Interconnect Design for Near Memory Computing

in the Smart Memory Cube.” IEEE Transactions on VLSI Systems, vol. 25, no. 1, pp.

210-223, 2017.

[5] Balasubramonian, R. et al. “Near-data processing: Insights from a MICRO-46

workshop.” IEEE Micro, vol. 34, no. 4, pp. 36–42, 2014.

[6] Bhimani, J., Leeser, M. and Mi, N. "Accelerating K-Means Clustering with Parallel

Implementations and GPU Computing." In HPEC, pp. 1-6, 2015.

[7] Boboila, S. et al. “Active flash: Out-of-core data analytics on flash storage.” In MSST,

pp. 1-12, 2012.

[8] Chi, P. et al. “PRIME: A Novel Processing-In-Memory Architecture for Neural

Network Computation in ReRAM-based Main Memory.” In ISCA, pp. 27-39, 2016.

[9] Cho, B.Y. et al. “XSD: Accelerating MapReduce by Harnessing the GPU Inside an

SSD.” In 1st Workshop on Near-Data Processing, 2013.

[10] Cho, S. et al. “Active Disk Meets Flash: A Case for Intelligent SSDs.” In ICS, pp. 31-

102, 2013.

[11] Ding, Y. et al. “Yinyang K-Means: A Drop-In Replacement of the Classic K-Means

with Consistent Speedup.” In ICML, pp. 579–587, 2015.

[12] Esmaeilzadeh, H. et al. “Dark silicon and the end of multicore scaling.” In ISCA, pp.

365-376, 2011.

[13] Foster, C. C. “Content addressable parallel processors.” John Wiley & Sons, Inc.,

1976.

[14] Gandomi, A. and Haider, M. “Beyond the Hype: Big Data Concepts, Methods, and

Analytics.” Int. J. of Information Management, vol. 35, no. 2, pp. 137–144, 2015.

[15] Gao, M. et al. “Practical Near-Data Processing for In-Memory Analytics

Frameworks.” In PACT, pp. 113-124, 2015.

[16] Giridhar, B. et al. “Exploring DRAM Organizations for Energy-Efficient and

Resilient Exascale Memories.” In SC, p. 23, 2013.

[17] Gotoh, O. "An Improved Algorithm for Matching Biological Sequences." Journal of

Molecular Biology, vol. 162, no. 3, pp. 705-708, 1982.

[18] Guo, Q. et al. "A Resistive TCAM Accelerator for Data-Intensive Computing." In

MICRO, pp. 339-350, 2011.

[19] Guo, Q. et al. “AC-DIMM: Associative Computing with STT-MRAM.” In ACM

SIGARCH Computer Architecture News, vol. 41, no. 3, pp. 189-200, 2013.

[20] Gutiérrez, P.D. et al. “GPU-SME-kNN: Scalable and memory efficient kNN and

lazy learning using GPUs.” Information Sciences, vol. 373, pp. 165-182, 2016.

[21] Kaisler, S. et al. “Big Data: Issues and Challenges Moving Forward.” In HICSS, pp.

995–1004, 2013.

[22] Kang, Y. et al., “Enabling Cost-Effective Data Processing with Smart SSD.” In

MSST, pp. 1–12, 2013.

[23] Kaplan, R. et al. "A Resistive CAM Processing-in-Storage Architecture for DNA

Sequence Alignment." To be published, IEEE Micro, Jul./Aug. 2017.

(arXiv:1701.04723)

[24] Kaplan, R. et al. "Deduplication in Resistive Content Addressable Memory Based

Solid State Drive." In PATMOS, pp. 100-106, 2016.

[25] Kogge, P.M. “A Short History of PIM at Notre Dame.” July 1999;

www.cse.nd.edu/~pim/projects.html

[26] Kogge, P.M., Brockman, J.B. and Freeh, V.W. et al. "PIM architectures to support

petaflops level computation in the HTMT machine." Int. Workshop on Innovative

Architecture for Future Generation Processors and Systems, pp. 35-44, 2000.

[27] Lee, V.T. et al. “NCAM: Near-Data Processing for Nearest Neighbor Search.” arXiv

preprint arXiv:1606.03742, 2016.

[28] Li, Z., Jifang, J. and Lingli, W. " High-Performance K-means Implementation based

on a Coarse-grained Map-Reduce Architecture." arXiv preprint arXiv:1610.05601,

2016.

[29] Liu, T.Y. et al. "A 130.7-mm2 2-Layer 32-Gb ReRAM Memory Device in 24-nm

Technology." IEEE Journal of Solid-State Circuits, vol. 49, no. 1, pp. 140-153, 2014.

[30] Liu, Y. and Schmidt, B. “SWAPHI: Smith-Waterman Protein Database Search on

Xeon Phi Coprocessors.” In IEEE ASAP, pp. 184–185, 2014.

[31] Pawlowski, J.T. “Hybrid Memory Cube: Breakthrough DRAM Performance with

a Fundamentally Re-Architected DRAM Subsystem.” In 23rd Hot Chips, 2011.

[32] Pu, Y. et al. "An Efficient KNN Algorithm Implemented on FPGA Based

Heterogeneous Computing System Using OpenCL." In FCCM, pp. 167-170, 2015.

[33] Pugsley, S. et al. “NDC: Analyzing the Impact of 3D-Dtacked Memory+Logic

Devices on MapReduce Workloads.” In ISPASS, pp. 190-200, 2014.

[34] Ramanathan, N. et al. "A Case for Work-stealing on FPGAs with OpenCL

Atomics." In FPGA, pp. 48-53, 2016.

[35] Rossbach, C.J. et al. "Dandelion: a compiler and runtime for heterogeneous

systems." In SIGOPS, pp. 49-68, 2013.

[36] Sandes, E.F.D.O. et al. "CUDAlign 4.0: Incremental Speculative Traceback for Exact

Chromosome-Wide Alignment in GPU Clusters." IEEE TPDS, vol. 27, no. 10, pp.

2838-2850, 2016.

[37] Shafiee, A. et al. “ISAAC: A Convolutional Neural Network Accelerator with in-

Situ analog Arithmetic in Crossbars.” In ISCA, pp. 14-26, 2016.

[38] Smith, T.F. and Waterman, M.S. "Identification of Common Molecular

Subsequences." Journal of Molecular Biology, vol. 147, no. 1, pp. 195-197, 1981.

[39] Wienbrandt, L. “The FPGA-based High-Performance Computer RIVYERA for

Applications in Bioinformatics.” In CiE, pp. 383-392, 2014.

[40] X-Brick tech spec. https://www.emc.com/collateral/data-sheet/h12451-xtremio-4-

system-specifications-ss.pdf

[41] Yavits, L. et al. “Resistive Associative Processor.” IEEE Computer Architecture

Letters, vol. 14, no. 2, pp. 148 – 151, 2015.

[42] Yavits, L., Morad, A. and Ginosar, R. “Computer Architecture with Associative

Processor Replacing Last Level Cache and SIMD Accelerator.” IEEE Transactions

on Computers, vol. 64, no. 2, pp. 368-381, 2014.

[43] “In-Data vs. Near-Data Processing: The Case for Processing in Resistive CAM,”

submitted, http://webee.technion.ac.il/~ran/papers/InDataReCAM2017.pdf

http://www.cse.nd.edu/~pim/projects.html

https://www.emc.com/collateral/data-sheet/h12451-xtremio-4-system-specifications-ss.pdf

https://www.emc.com/collateral/data-sheet/h12451-xtremio-4-system-specifications-ss.pdf

http://webee.technion.ac.il/~ran/papers/InDataReCAM2017.pdf

pact: g: from near-data processing to in-data processing ... · storage and allows data processing...

Documents