class 09 content addressable memories
DESCRIPTION
Class 09 Content Addressable Memories. Cell Design and Peripheral Circuits. Semiconductor Memory Classification. FIFO: First-in-first-out LIFO: Last-in-first-out (stack) CAM: Content addressable memory. Memory Architecture: Decoders. pitch matched. line too long. 2D Memory Architecture. - PowerPoint PPT PresentationTRANSCRIPT
Class 09Content Addressable Memories
Cell Design and Peripheral Circuits
Semiconductor Memory Classification
RWM NVRWM ROM
EPROM
E2PROM
FLASH
RandomAccess
Non-RandomAccess
SRAM
DRAM
Mask-Programmed
Programmable (PROM)
FIFO
Shift Register
CAM
LIFO
FIFO: First-in-first-outLIFO: Last-in-first-out (stack)CAM: Content addressable memory
Memory Architecture: Decoders
Word 0
Word 1
Word 2
Word N-1
Word N-2
Input-Output
S0S1S2
SN-2SN_1
(M bits)
StorageCell
M bits
N W
ords
Word 0
Word 1
Word 2
Word N-1
Word N-2
Input-Output(M bits)
StorageCell
M bits
Deco
der
A0
A1
AK-1
S0
N words => N select signalsToo many select signals
Decoder reduces # of select signalsK = log2N
pitch matched
line too long
2D Memory Architecture
A0
Row
Dec
oder
A1
Aj-1Sense Amplifiers
bit line
word line
storage (RAM) cell
Row
Add
r es s
Colu
mn
Addr
ess
Aj
Aj+1
Ak-1
Read/Write Circuits
Column Decoder
2k-j
m2j
Input/Output (m bits)
amplifies bit line swing
selects appropriate word from memory row
3D Memory ArchitectureRo
w A
ddr
Colu
mn
Addr
Bloc
k Ad
dr
Input/Output (m bits)
Advantages: 1. Shorter word and/or bit lines 2. Block addr activates only 1 block saving power
Hierarchical Memory Architecture
Global Data Bus
RowAddress
ColumnAddress
BlockAddress
Block Selector GlobalAmplifier/Driver
I/O
ControlCircuitry
Advantages: shorter wires within blocks block address activates only 1 block: power management
Read-Write Memories (RAM)
Static (SRAM) Data stored as long as supply is applied Large (6 transistors per cell) Fast Differential signal (more reliable)
Dynamic (DRAM) Periodic refresh required Small (1-3 transistors per cell) but slower Single ended (unless using dummy cell to generate
differential signals)
Associative Memory
What is CAM?• Content Addressable Memory
is a special kind of memory!• Read operation in traditional
memory:Input is address location of the
content that we are interested in it.
Output is the content of that address.
• In CAM it is the reverse:Input is associated with
something stored in the memory.
Output is location where the associated content is stored.
1 0 1 X X
0 1 1 0 X
0 1 1 X X
1 0 0 1 1
0 1 1 0 1
0 0
0 1
1 0
1 1
0 1
Content AddressableMemory
1 0 1 X X
0 1 1 0 X
0 1 1 X X
1 0 0 1 1
0 1
0 0
0 1
1 0
1 1
0 1 1 0 X
Traditional Memory
Type of CAMs • Binary CAM (BCAM) only stores 0s and 1s
– Applications: MAC table consultation. Layer 2 security related VPN segregation.
• Ternary CAM (TCAM) stores 0s, 1s and don’t cares.– Application: when we need wilds cards such as, layer 3 and 4
classification for QoS and CoS purposes. IP routing (longest prefix matching).
• Available sizes: 1Mb, 2Mb, 4.7Mb, 9.4Mb, and 18.8Mb.
• CAM entries are structured as multiples of 36 bits rather than 32 bits.
CAM: Introduction
• CAM vs. RAM
001101115100011014101111013110010112000011011010101010
10001101Data Out
4
Add
ress
In
110001115000111014100011013110010112000011011010101010
10001101Data In
3
Add
ress
Out
1000110110001101
Memory Hierarchy
The overall goal of using a memory hierarchy is to obtain the highest-possible average access speed while minimizing the total cost of the entire memory system.
Microprogramming: refers to the existence of many programs in different parts of main memory at the same time.
Main memory
ROM Chip
Memory Address Map
Memory Configuration (case study):
Required: 512 bytes ROM + 512 bytes RAM Available: 512 byte ROM + 128 bytes RAM
The designer of a computer system must calculate the amount of memory required for the particular application and assign it to either RAM or ROM.
The interconnection between memory and processor is then established from knowledge of the size of memory needed and the type of RAM and ROM chips available.
The addressing of memory can be established by meansof a table that specifies the memory address assigned to each chip.
The table, called a memory address map, is a pictorial representation of assigned address space for each chip in the system.
Memory Address Map
Associative Memory
The time required to find an item stored in memory can be reduced considerably if stored data can be identified for access by the content of the data itself rather than by an address.
A memory unit access by content is called an associative memory or Content Addressable Memory (CAM). This type of memory is accessed simultaneously and in parallel on the basis of data content rather than specific address or location.
When a word is written in an associative memory, no address is given. The memory is capable of finding an empty unused location to store the word. When a word is to be read from an associative memory, the content of the word or part of the word is specified.
The associative memory is uniquely suited to do parallel searches by data association. Moreover, searches can be done on an entire word or on a specific field within a word. Associative memories are used in applications where the search time is very critical and must be very short.
Hardware Organization
Argument register (A)
Key register (K)
Associative memoryarray and logic
m words n bits per word
M
Matchregister
Input
WriteRead
Output
Associative memory of an m word, n cells per word
A1
C11
AnAj
K1 KnKj
C1j C1n
C i1 C ij C in
Cm1 Cmj Cmn
M1
Mm
Mi
Bit 1 Bit nBit j
Word 1
Word m
Word i
One Cell of Associative Memory
R S Matchlogic
Input
Read
Write
Output
To M i
K jA i
F ij
Match Logic cct.
F'i1 Fi1
A1K1
F'i2 Fi2
A2K2
F'in Fin
AnKn
M i
CAM: Introduction
• Binary CAM Cell
BL1cBL1
WL
SL1c SL1
ML
BL1c_cellBL1_cell
P1 P2
N1 N2
N3N4
N5 N7
N6 N8
CAM: Introduction
• Ternary CAM (TCAM)
00X001115010011014000111013110010X12101011011010X01010
XXX01101
Input Keyword
XXXXX1115XXXX11014XXX111013XX0010112X00011011010101010
01101
01101
1101
00011011
4
Match
Match
1
4
Match
Match
10001101
Input Keyword
CAM: Introduction
• TCAM Cell– Global Masking SLs– Local Masking BLs
BL1 BL2 Logic0 1 01 0 11 1 X0 0 N.A.
BL1 BL2
WL
RAM Cell
RAM Cell
SL1 SL2ML
BL1c BL2c
Comparison Logic
CAM: Introduction
• DRAM based TCAM Cell Higher bit densitySlower table updateExpensive processRefreshing circuitryScaling issues (Leakage)
BL2BL1
WL
SL2 SL1
ML
BL2_cellBL1_cell
N3 N4
N5 N7
N6 N8
CAM: Introduction
• SRAM based TCAM Cell Standard CMOS process Fast table updateLarge area (16T)
BL1 BL1c BL2BL2c
WL
SL1 SL2
ML
BL1c_cell BL2c_cell
CAM: Introduction
• Block diagram of a 256 x 144 TCAM
CAM Cell (0)
BL1c(0) BL2c(0)
CAM Cell (143)
BL1c(N) BL2c(N)
CAM Cell (0)
BL1c(0) BL2c(0)
CAM Cell (143)
BL1c(N) BL2c(N)
ML0SL1(143) SL2(143) SL1(0) SL2(0)
MLSAMLSO(0)
MLSAML255 MLSO(255)
SL Drivers
Search Lines (SLs)
ML Sense Amplifiers
Match Lines
(MLs)
CAM: Introduction
• Why low-power TCAMs?– Parallel search Very high power
– Larger word size, larger no. of entries High power
– Embedded applications (SoC)
CAM: Design Techniques
• Cell Design: 12T Static TCAM cell*– ‘0’ is retained by Leakage (VWL ~ 200 mV) High densityLeakage (3 orders)Noise marginSoft-errors (node S)Unsuitable for READ
CAM: Design Techniques
• Cell Design: NAND vs. NOR Type CAM Low PowerCharge-sharingSlow CAM
Cell (N)CAM
Cell (1)CAM
Cell (0)
SAML_NAND M
SA
CAM Cell (N)
CAM Cell (1)
CAM Cell (0)
ML_NOR MM
BL1 BL1c
WL
SL1 SL1c
VDD BL1 BL1c
WL
SL1c SL1
VDD
NAND-type CAM NOR-type CAM
CAM: Design Techniques
• MLSA Design: Conventional– Pre-charge ML to VDD
– Match VML = VDD
– Mismatch VML = 0
MM MM
VDD
PREMLSO
VDD
ML
CAM: Design Techniques
• Low Power: Dual-ML TCAM– Same speed, 50% less energy (Ideally!)
– Parasitic interconnects degrade both speed and energy
– Additional ML increases coupling capacitance
CAM: Design Techniques
• Static Power Reduction– 16T TCAM: Leakage Paths*
WL
BL1 BL1c
SL1 SL2
BL2BL2c
ML
‘1’‘0’ ‘1’
‘0’
N1 N2
N3 N4
P1 P2
N5 N6
N7 N8
P3 P4N12
N9 N11
N10
‘0’ ‘0’‘1’ ‘1’
BL1c_cell BL2c_cell
* N. Mohan, M. Sachdev, Proc. IEEE CCECE, pp. 711-714, May 2-5, 2004
CAM: Design Techniques
• Static Power Reduction– Side Effects of VDD Reduction in TCAM Cells Speed: No change Dynamic power: No changeRobustness – VDD Volt. Margin (Current-race sensing) Voltage Margin
ML [0]
MLSO [0]
ML [1]
CAM for Routing Table Implementation
• CAM can be used as a search engine.• We want to find matching contents in a
database or Table.• Example Routing Table
Source: http://pagiamtzis.com/cam/camintro.html
Simplified CAM Block Diagram The input to the system is the search word. The search word is broadcast on the search lines. Match line indicates if there were a match btw. the search and stored word. Encoder specifies the match location. If multiple matches, a priority encoder selects the first match. Hit signal specifies if there is no match. The length of the search word is long ranging from 36 to 144 bits. Table size ranges: a few hundred to 32K. Address space : 7 to 15 bits.
Source: K. Pagiamtzis, A. Sheikholeslami, “Content-Addressable Memory (CAM) Circuits and Architectures: A Tutorial and Survey,” IEEE J. of Solid-state circuits. March 2006
CAM Memory Size
• Largest available around 18 Mbit (single chip).
• Rule of thumb: Largest CAM chip is about half the largest available SRAM chip.A typical CAM cell consists
of two SRAM cells.
• Exponential growth rate on the size
Source: K. Pagiamtzis, A. Sheikholeslami, “Content-Addressable Memory (CAM) Circuits and Architectures: A Tutorial and Survey,” IEEE J. of Solid-state circuits. March 2006
CAM Basics• The search-data word is loaded
into the search-data register.• All match-lines are pre-charged to
high (temporary match state).• Search line drivers broadcast the
search word onto the differential search lines.
• Each CAM core compares its stored bit against the bit on the corresponding search-lines.
• Match words that have at least one missing bit, discharge to ground. Source: K. Pagiamtzis, A. Sheikholeslami, “Content-Addressable
Memory (CAM) Circuits and Architectures: A Tutorial and Survey,” IEEE J. of Solid-state circuits. March 2006
CAM Advantages• They associate the input (comparand) with their memory
contents in one clock cycle.
• They are configurable in multiple formats of width and depth of search data that allows searches to be conducted in parallel.
• CAM can be cascaded to increase the size of lookup tables that they can store.
• We can add new entries into their table to learn what they don’t know before.
• They are one of the appropriate solutions for higher speeds.
CAM Disadvantages • They cost several hundred of dollars per CAM even in large
quantities.
• They occupy a relatively large footprint on a card.
• They consume excessive power.
• Generic system engineering problems:– Interface with network processor.– Simultaneous table update and looking up requests.
CAM structure• The comparand bus is 72 bytes
wide bidirectional.• The result bus is output.• Command bus enables instructions
to be loaded to the CAM.• It has 8 configurable banks of
memory. • The NPU issues a command to the
CAM.• CAM then performs exact match or
uses wildcard characters to extract relevant information.
• There are two sets of mask registers inside the CAM.
CAM control
Global mask registers
72 bits 131072CAM
(72 bits x 16K x 8 structures)
Mixable with72 bits x 16384144 bits x 8192288 bits x 4096576 bits x 2048
Em
pty
Bit
Prio
rity
Enc
oder
Flag
Con
trol
Out
put P
ort
Con
trol
Control & status registers
I/O P
ort C
ontro
l
Dec
oder
Pip
elin
e ex
ecut
ion
cont
rol
(com
man
d bu
s)
CAM structure
There is global mask registers which can remove specific bits and a mask register that is present in each location of memory.
The search result can be one output (highest priority) Burst of successive results.
The output port is 24 bytes wide.
Flag and control signals specify status of the banks of the memory.
They also enable us to cascade multiple chips.
CAM control
Global mask registers
72 bits 131072CAM
(72 bits x 16K x 8 structures)
Mixable with72 bits x 16384144 bits x 8192288 bits x 4096576 bits x 2048
Em
pty
Bit
Prio
rity
Enc
oder
Flag
Con
trol
Out
put P
ort
Con
trol
Control & status registers
I/O P
ort C
ontro
l
Dec
oder
Pip
elin
e ex
ecut
ion
cont
rol
(com
man
d bu
s)
CAM Features• CAM Cascading:
– We can cascade up to 8 pieces without incurring performance penalty in search time (72 bits x 512K).
– We can cascade up to 32 pieces with performance degradation (72 bits x 2M).
• Terminology:– Initializing the CAM: writing the table into the memory.– Learning: updating specific table entries.– Writing search key to the CAM: search operation
• Handling wider keys:– Most CAM support 72 bit keys.– They can support wider keys in native hardware.
• Shorter keys: can be handled at the system level more efficiently.
CAM Latency• Clock rate is between 66 to 133
MHz.• The clock speed determines
maximum search capacity.• Factors affecting the search
performance:– Key size– Table size
• For the system designer the total latency to retrieve data from the SRAM connected to the CAM is important.
• By using pipeline and multi-thread techniques for resource allocation we can ease the CAM speed requirements.
Source: IDT
Management of Tables Inside a CAM• It is important to squeeze as much information as we can in a CAM.• Example from Netlogic application notes:
– We want to store 4 tables of 32 bit wide IP destination addresses.– The CAM is 128 bits wide.– If we store directly in every slot 96 bits are wasted.
• We can arrange the 32 bit wide tables next to each other.– Every 128 bit slot is partitioned into four 32 bit slots.– These are 3rd, 2nd, 1st, and 0th tables going from left to right.– We use the global mask register to access only one of the tables.
MASK 3 00000000 FFFFFFFF FFFFFFFF FFFFFFFFMASK 2 FFFFFFFF 00000000 FFFFFFFF FFFFFFFFMASK 1 FFFFFFFF FFFFFFFF 00000000 FFFFFFFFMASK 0 FFFFFFFF FFFFFFFF FFFFFFFF 00000000
Example Continued• We can still use the mask register (not global mask register) to do maximum prefix
length match.
1 0 1 0 0 0….1 0 1 1 1 0….1 0 1 1 0 1….1 1 0 1 1 1….
127 97 96 95
0
1
0
0
94
1 1 0
1 0 1
0 0 0
0 1 1
3 2 1
1
0
1
0
0
1 0 1 1 1 0…. 0 1 1 1 0
MATCH FOUND
0 0 0 0 0 1…. 1 1 1 1 1
ComparandRegister
Global MaskRegister
….….….….
….
….