ac-dimm: associative computing with stt-mram
TRANSCRIPT
AC-DIMM: Associative Computing with STT-MRAM
Qing Guo, Xiaochen Guo, Ravi Patel
Engin Ipek, Eby G. Friedman
University of Rochester
Published In: ISCA-2013
Mustafa Shihab: 02/28/2014
Prevalent Trends in Modern Computing: 1. Technology Scaling > Creates Power and Bandwidth Challenges - Transistor density doubles every two years, but power efficiency does not scale proportionally - Number of pins grows approximately at 16% / year only 2. Data-Intensive Work Load
Resultant Bottlenecks: - On-Chip Power Dissipation - Off-Chip Memory Bandwidth
One Promising Solution: Associative Computing Using Content-Addressable Memories (CAM)
Motivation
Mustafa Shihab: 02/28/2014
Content Addressable Memory
• Simultaneously compares all stored keys against a search key
• Energy- and bandwidth-efficient on an important subset of data-intensive applications
CAM Array
Stored keys
Search Key
Matchlines
Searchlines
Number of matches
Highest priority match
Mustafa Shihab: 02/28/2014
Current Challenges With CAMs
• Commercial uses of CAMs are limited – Highly associative caches, TLBs
– Microarchitectural queues
– Networking routers
CMOS-based CAMs are large, costly, and power-hungry
[Goel and Gupta, 2010]
Mustafa Shihab: 02/28/2014
Resistive CAMs • Resistive memories (e.g., PCM and STT-MRAM) offer high density
and very low leakage power
• Previously proposed PCM-based TCAM accelerator [MICRO’11]
– A gigabyte, DDR3-compatible DIMM
– TCAM caters to a wide range of search-intensive applications
Processor
Search key Number of matches
Highest priority match
+ Optimized for density
and search throughput
− limited functionality
Mustafa Shihab: 02/28/2014
Associative Computing Paradigm • Broadens the use of CAMs to a more general programming
framework
• Data organized by key-value pairs
– Linked list, array, stack, queue
– Matrix, tree, graph
L. Potter, “ASC: an associative-computing paradigm”, 1994
a b
c d
Mustafa Shihab: 02/28/2014
AC-DIMM • AC-DIMM combines associative lookup and processing in
memory
Processor
Microcontroller
runs user-defined kernels
STT-MRAM Array
enhanced endurance
lower write energy
shorter access latency
Key-value co-location
<keys, ops>
results
Mustafa Shihab: 02/28/2014
System Interface • AC-DIMM is a DDR3 compatible module
Processor
AC-DIMM
DDR3 bus Chips
DDR3 bus DRAM DIMMs
Mustafa Shihab: 02/28/2014
Programming Model • Program accesses AC-DIMM via a user-level library
Array
μCode
μCtrl
Key
Search Search
Search Search
Processor
Mustafa Shihab: 02/28/2014
Programming Model • Program accesses AC-DIMM via a user-level library
Array
μCode
μCtrl
Result Processor
Mustafa Shihab: 02/28/2014
Array Organization • Memory row can be searched, read, and written
• Co-locate key-value pairs in the same row
Search
Re
ad
/ W
rite
Mustafa Shihab: 02/28/2014
Bit-Serial Search • Progressively searches column-by-column across the array
• Improves power efficiency and simplifies cell structure
Value Key
Example: search for 011
1
1
1
Mustafa Shihab: 02/28/2014
Bit-Serial Search • Progressively searches column-by-column across the array
• Improves power efficiency and simplifies cell structure
Value Key
Example: search for 011
0
1
0
0
Mustafa Shihab: 02/28/2014
Bit-Serial Search • Progressively searches column-by-column across the array
• Improves power efficiency and simplifies cell structure
Value Key
Example: search for 011
0
1
0
1
Mustafa Shihab: 02/28/2014
Bit-Serial Search • Progressively searches column-by-column across the array
• Improves power efficiency and simplifies cell structure
Value Key
Example: search for 011
0
1
0
1
Match
Mustafa Shihab: 02/28/2014
Microcontroller • Microcontroller runs user-defined kernel on the matching rows
4 arrays share a μController
A total of 64 μControllers on a
256Mb chip (4% area)
Reduction tree
Mustafa Shihab: 02/28/2014
AC-DIMM Cell Structure -- 2T1R CAM Cell Using STT-MRAM
• Data is stored in a magnetic tunnel junction (MTJ)
MTJ
Bitlines act as read and write ports
Matchline acts as a search port Mustafa Shihab: 02/28/2014
Reading • Stored data is read by bitline sense amps
VDD
IREAD
Bitline sense amplifier
Mustafa Shihab: 02/28/2014
Writing • Programming an MTJ requires a bi-directional write current
ISET
IRESET
Mustafa Shihab: 02/28/2014
Writing
• Resetting an AC-DIMM cell
• Setting an AC-DIMM cell
VDD VDD
VDD
Mustafa Shihab: 02/28/2014
Searching • Accomplished by reading a column of bits, and
comparing against the search key
• Outputs a 1 on a match, a 0 otherwise
VDD
ISEARCH Matchline sense amplifier
XNOR
Mustafa Shihab: 02/28/2014
Experimental Setup • System configuration
– Processor: 8 cores, 4GHz
– Memory bus: DDR3-1066
• Simulation tools – Cadence (Spectre), Encounter RTL Compiler with FreePDK
– SESC simulator
• Applications – NuMineBench
– MiBench
– Phoenix
– SPEC INT 2000
Mustafa Shihab: 02/28/2014
System Performance
• AC-DIMM outperforms the previous TCAM-DIMM when search key is short (<32 bits)
AC-DIMM
TCAM-DIMM
Mustafa Shihab: 02/28/2014
System Performance AC-DIMM only
• AC-DIMM caters to a broader range of applications
Mustafa Shihab: 02/28/2014
System Energy AC-DIMM only
• Dynamic energy saved by eliminating data movement • Leakage energy saved by reducing execution time
AC-DIMM
TCAM-DIMM
Mustafa Shihab: 02/28/2014
Summary
• AC-DIMM is an STT-MRAM based compute engine
– DDR3 compatible module
– Applicable to other RAM-based technologies
– Integrates programmable microcontrollers
– Co-locates key-value pairs
• Improves energy and bandwidth efficiency
– Eliminates unnecessary data movement
– Reduces instruction and address processing overheads
Mustafa Shihab: 02/28/2014