in-memory associative computing
TRANSCRIPT
![Page 1: IN-MEMORY ASSOCIATIVE COMPUTING](https://reader034.vdocuments.us/reader034/viewer/2022042620/62646c66dee0ee3d620a285c/html5/thumbnails/1.jpg)
IN-MEMORY ASSOCIATIVE COMPUTING AVIDAN AKERIB, GSI [email protected]
![Page 2: IN-MEMORY ASSOCIATIVE COMPUTING](https://reader034.vdocuments.us/reader034/viewer/2022042620/62646c66dee0ee3d620a285c/html5/thumbnails/2.jpg)
AGENDA
The AI computational challengeIntroduction to associative computing Examples An NLP use caseWhat’s next?
![Page 3: IN-MEMORY ASSOCIATIVE COMPUTING](https://reader034.vdocuments.us/reader034/viewer/2022042620/62646c66dee0ee3d620a285c/html5/thumbnails/3.jpg)
THE CHALLENGE IN AI COMPUTING
AI Requirement32 bit FP Neural network learningMulti precision Neural network inference, data mining, etc.Scaling Data centerSort-search Top-K, recommendation, speech, classify image/videoHeavy computation Non linearity, Softmax, exponent , normalize Bandwidth Required for speed and power
Use Case Example
![Page 4: IN-MEMORY ASSOCIATIVE COMPUTING](https://reader034.vdocuments.us/reader034/viewer/2022042620/62646c66dee0ee3d620a285c/html5/thumbnails/4.jpg)
CURRENT SOLUTION
CPUGeneralPurpose
GPU
Question
Answer
DRAM
• Bottleneck when register file data needs to be replaced on a regular basis ➢ Limits performance
➢ Increases power consumption
• Does not scale with the search, sort, and rank requirements of applications like recommender systems, NLP, speech recognition, and data mining that requires functions like Top-K and Softmax.
Very Wide Bus
Tens of Cores Thousands of Cores
![Page 5: IN-MEMORY ASSOCIATIVE COMPUTING](https://reader034.vdocuments.us/reader034/viewer/2022042620/62646c66dee0ee3d620a285c/html5/thumbnails/5.jpg)
GPU VS CPU VS FPGA
![Page 6: IN-MEMORY ASSOCIATIVE COMPUTING](https://reader034.vdocuments.us/reader034/viewer/2022042620/62646c66dee0ee3d620a285c/html5/thumbnails/6.jpg)
GPU VS CPU VS FPGA VS APU
![Page 7: IN-MEMORY ASSOCIATIVE COMPUTING](https://reader034.vdocuments.us/reader034/viewer/2022042620/62646c66dee0ee3d620a285c/html5/thumbnails/7.jpg)
GSI’S SOLUTION APU—ASSOCIATIVE PROCESSING UNIT
• Computes in-place directly in the memory array—removes the I/O bottleneck
• Significantly increases performances
• Reduces power
Simple CPU APUAssociative Memory
Question
Answer
Simple & Narrow Bus
Millions Processors
![Page 8: IN-MEMORY ASSOCIATIVE COMPUTING](https://reader034.vdocuments.us/reader034/viewer/2022042620/62646c66dee0ee3d620a285c/html5/thumbnails/8.jpg)
IN-MEMORY COMPUTING CONCEPT
![Page 9: IN-MEMORY ASSOCIATIVE COMPUTING](https://reader034.vdocuments.us/reader034/viewer/2022042620/62646c66dee0ee3d620a285c/html5/thumbnails/9.jpg)
THE COMPUTING MODEL FOR THE PAST 80 YEARS
100011001001
Address Decoder
Sense Amp /IO Drivers
ALU
0101
1000
Read
Read
0010
Write
Write
![Page 10: IN-MEMORY ASSOCIATIVE COMPUTING](https://reader034.vdocuments.us/reader034/viewer/2022042620/62646c66dee0ee3d620a285c/html5/thumbnails/10.jpg)
THE CHANGE—IN-MEMORY COMPUTING
• Patented in-memory logic using only Read/Write operations
• Any logic/arithmetic function can be generated internally
0101100011001001
Read
Read
WriteNOR
0101
Read
Read
Write
0010
NOR
Sim
ple
Con
trolle
r
![Page 11: IN-MEMORY ASSOCIATIVE COMPUTING](https://reader034.vdocuments.us/reader034/viewer/2022042620/62646c66dee0ee3d620a285c/html5/thumbnails/11.jpg)
1101
1000
1001
1001
0110
0010
0111
0110
Records
1=match00
Duplicate vales with inverse data
Duplicate the key with inverse. Move the original
key next to the inverse data
RE
RE
RE
RE
1 in the combines key goes to the
read enable
11
CAM/ ASSOCIATIVE SEARCH
Values KEY:Search 0110
![Page 12: IN-MEMORY ASSOCIATIVE COMPUTING](https://reader034.vdocuments.us/reader034/viewer/2022042620/62646c66dee0ee3d620a285c/html5/thumbnails/12.jpg)
0110
Don’t care
010
Don’t care
11
Don’t care
0110
12
TCAM SEARCH WITH STANDARD MEMORY CELLS
![Page 13: IN-MEMORY ASSOCIATIVE COMPUTING](https://reader034.vdocuments.us/reader034/viewer/2022042620/62646c66dee0ee3d620a285c/html5/thumbnails/13.jpg)
0101
0000
1001
1001
0110
0010
0110
0110 KEY:
Search 0110
1=match01= match
Duplicate data. Inverse only to those which are not don’t care Duplicate the key with
inverse. Move The original Key next to the
inverse data
RE
RE
RE
RE
1 in the combines key goes to the
read enable
Insert zero instead of don’t-care
13
TCAM SEARCH WITH STANDARD MEMORY CELLS
![Page 14: IN-MEMORY ASSOCIATIVE COMPUTING](https://reader034.vdocuments.us/reader034/viewer/2022042620/62646c66dee0ee3d620a285c/html5/thumbnails/14.jpg)
COMPUTING IN THE BIT LINES
Vector A
Vector B
Each bit line becomes a processor and storagemillions of bit lines = millions of processors
C=f(A,B)
![Page 15: IN-MEMORY ASSOCIATIVE COMPUTING](https://reader034.vdocuments.us/reader034/viewer/2022042620/62646c66dee0ee3d620a285c/html5/thumbnails/15.jpg)
NEIGHBORHOOD COMPUTING
Parallel shift of bit lines @ 1 cycle sections
Enables neighborhood operations such as convolutions
C=f(A,SL(B,1))
Shift vector
![Page 16: IN-MEMORY ASSOCIATIVE COMPUTING](https://reader034.vdocuments.us/reader034/viewer/2022042620/62646c66dee0ee3d620a285c/html5/thumbnails/16.jpg)
SEARCH & COUNT
5
Count = 3 Search 20
20-17
3 20 54 20
1 1 1
• Key applications for search and count for predictive analytics:• Recommender systems • K-nearest neighbors (using cosine similarity search)• Random forest• Image histogram• Regular expression
8
• Search (binary or ternary) all bit lines in 1 cycle• 128 M bit lines => 128 Peta search/sec
1 1 1
![Page 17: IN-MEMORY ASSOCIATIVE COMPUTING](https://reader034.vdocuments.us/reader034/viewer/2022042620/62646c66dee0ee3d620a285c/html5/thumbnails/17.jpg)
DATABASE SEARCH AND UPDATE
Content-based search, record can be placed anywhereUpdate, modify, insert, delete is immediate
Exact MatchCAM/TCAM
Similarity MatchIn-Place Aggregate
![Page 18: IN-MEMORY ASSOCIATIVE COMPUTING](https://reader034.vdocuments.us/reader034/viewer/2022042620/62646c66dee0ee3d620a285c/html5/thumbnails/18.jpg)
TRADITIONAL STORAGE CAN DO MUCH MORE
Standard memory
cell1 bit
Standard memory
cell
2 bits2 input NOR, 1 TCAM cell
Standard memory
cell
3 Bits3 input NOR, 2 Input NOR + 1 Output 4 State CAM
Standard memory
cell
…
…
![Page 19: IN-MEMORY ASSOCIATIVE COMPUTING](https://reader034.vdocuments.us/reader034/viewer/2022042620/62646c66dee0ee3d620a285c/html5/thumbnails/19.jpg)
CPU/GPGPU VS APU
![Page 20: IN-MEMORY ASSOCIATIVE COMPUTING](https://reader034.vdocuments.us/reader034/viewer/2022042620/62646c66dee0ee3d620a285c/html5/thumbnails/20.jpg)
ARCHITECTURE
![Page 21: IN-MEMORY ASSOCIATIVE COMPUTING](https://reader034.vdocuments.us/reader034/viewer/2022042620/62646c66dee0ee3d620a285c/html5/thumbnails/21.jpg)
SECTION COMPUTING TO IMPROVE PERFORMANCE
21
Mem
ory
MLB section 0
Connecting Mux
. . .
control
Instr.Buffer
MLB section 1
Connecting mux
24 rows
![Page 22: IN-MEMORY ASSOCIATIVE COMPUTING](https://reader034.vdocuments.us/reader034/viewer/2022042620/62646c66dee0ee3d620a285c/html5/thumbnails/22.jpg)
Store, compute, search and move data anywhere
…
…
Shift between sections enable neighborhood operations (filters , CNN etc.)
COMMUNICATION BETWEEN SECTIONS
22
![Page 23: IN-MEMORY ASSOCIATIVE COMPUTING](https://reader034.vdocuments.us/reader034/viewer/2022042620/62646c66dee0ee3d620a285c/html5/thumbnails/23.jpg)
APU CHIP LAYOUT
2M bit processors or 128K vector processors runs at 1G Hz with up to 2 Peta OPS peak performance
![Page 24: IN-MEMORY ASSOCIATIVE COMPUTING](https://reader034.vdocuments.us/reader034/viewer/2022042620/62646c66dee0ee3d620a285c/html5/thumbnails/24.jpg)
EVALUATION BOARD PERFORMANCE
Precision : Unlimited : from 1 bit to 160 bits or more.6.4 TOPS (FP) – 8 Peta OPS for one bit computing or 16 bit exact search
Similarity Search, Top-k , min, max , Softmax,O(1) complexity in μs, any size of K compared to ms with current solutions
In-memory IO 2 Petabit/sec > 100X GPGPU/CPU/FPGA
Sparse matrix multiplication> 100X GPGPU/CPU/FPGA
![Page 25: IN-MEMORY ASSOCIATIVE COMPUTING](https://reader034.vdocuments.us/reader034/viewer/2022042620/62646c66dee0ee3d620a285c/html5/thumbnails/25.jpg)
APU SERVER
64 APU chips , 256-512GByte DDR,From 100TFLOPS Up to 128 Peta OPS with peak performance 128TOPS/WO(1) Top-K, min, max,32 Peta bits/sec internal IO< 1K Watts> 1000X GPGPs on averageLinearly scalableCurrently 28nm process and scalable to 7nm or less
Well suited to advanced memory technology such as non volatile ReRAM and more
![Page 26: IN-MEMORY ASSOCIATIVE COMPUTING](https://reader034.vdocuments.us/reader034/viewer/2022042620/62646c66dee0ee3d620a285c/html5/thumbnails/26.jpg)
EXAMPLE APPLICATIONS
![Page 27: IN-MEMORY ASSOCIATIVE COMPUTING](https://reader034.vdocuments.us/reader034/viewer/2022042620/62646c66dee0ee3d620a285c/html5/thumbnails/27.jpg)
K-NEAREST NEIGHBORS (K-NN)
Simple example: N = 36, 3 groups, 2 dimensions (D = 2 ) for X and Y
K = 4Group Green selected as the majority
For actual applications: N = Billions, D = Tens, K = Tens of thousands
![Page 28: IN-MEMORY ASSOCIATIVE COMPUTING](https://reader034.vdocuments.us/reader034/viewer/2022042620/62646c66dee0ee3d620a285c/html5/thumbnails/28.jpg)
K-NN USE CASE IN AN APU
Item1
Item2
Item N
Q
4 1 3 2
Majority Calculation
Item features and label storage
Computing Area
Compute cosine distancesfor all N in parallel
(≤ 10μs, assuming D=50 features)
K Mins at O(1) complexity (≤ 3μs)Distribute data – 2 ns (to all)
In-Place ranking
Features of item 1
Item 1
Item 2
Item 3
Item N
Features of item 2
Features of item N
With the data base in an APU, computation for all N items done in
≤ 0.05 ms, independent of K
![Page 29: IN-MEMORY ASSOCIATIVE COMPUTING](https://reader034.vdocuments.us/reader034/viewer/2022042620/62646c66dee0ee3d620a285c/html5/thumbnails/29.jpg)
LARGE DATABASE EXAMPLE USING APU SERVERS
• Number of items: billions
• Features per item: tens to hundreds
• Latency: ≤ 1 msec
• Throughput: Scales to 1M similarity searches/sec
• k-NN: Top 100,000 nearest neighbors
![Page 30: IN-MEMORY ASSOCIATIVE COMPUTING](https://reader034.vdocuments.us/reader034/viewer/2022042620/62646c66dee0ee3d620a285c/html5/thumbnails/30.jpg)
EXAMPLE K-NN FOR RECOGNITION
Text BOW, Word Embedding
Feature Extractor
(Neural Network)
K-NN Classifier(Associative
Memory)
Image Convolution Layer
![Page 31: IN-MEMORY ASSOCIATIVE COMPUTING](https://reader034.vdocuments.us/reader034/viewer/2022042620/62646c66dee0ee3d620a285c/html5/thumbnails/31.jpg)
K-MINS: O(1) ALGORITHM
KMINS(int K, vector C){ M := 1, V := 0; FOR b = msb to b = lsb: D := not(C[b]); N := M & D; cnt = COUNT(N|V) IF cnt > K: M := N; ELIF cnt < K: V := N|V; ELSE: // cnt == K V := N|V; EXIT; ENDIF ENDFOR }
C0
C1
C2
N
MSB LSB
![Page 32: IN-MEMORY ASSOCIATIVE COMPUTING](https://reader034.vdocuments.us/reader034/viewer/2022042620/62646c66dee0ee3d620a285c/html5/thumbnails/32.jpg)
110101 010100 000101 000111 000001 111011 000101 010110 001100 011100 111100 101101 000101 111101 000101 000101
K-MINS: THE ALGORITHM
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
V M
KMINS(int K, vector C){ M := 1, V := 0; FOR b = msb to b = lsb: D := not(C[b]); N := M & D; cnt = COUNT(N|V) IF cnt > K: M := N; ELIF cnt < K: V := N|V; ELSE: // cnt == K V := N|V; EXIT; ENDIF ENDFOR }
0 1 1 1 1 0 1 1 1 1 0 0 1 0 1 1
D0 1 1 1 1 0 1 1 1 1 0 0 1 0 1 1
N0 1 1 1 1 0 1 1 1 1 0 0 1 0 1 1
N|V
cnt=11
0 1 1 1 1 0 1 1 1 1 0 0 1 0 1 1
110101 010100 000101 000111 000001 111011 000101 010110 001100 011100 111100 101101 000101 111101 000101 000101
C[0]
![Page 33: IN-MEMORY ASSOCIATIVE COMPUTING](https://reader034.vdocuments.us/reader034/viewer/2022042620/62646c66dee0ee3d620a285c/html5/thumbnails/33.jpg)
110101 010100 000101 000111 000001 111011 000101 010110 001100 011100 111100 101101 000101 111101 000101 000101
K-MINS: THE ALGORITHM
0 1 1 1 1 0 1 1 1 1 0 0 1 0 1 1
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
V M
KMINS(int K, vector C){ M := 1, V := 0; FOR b = msb to b = lsb: D := not(C[b]); N := M & D; cnt = COUNT(N|V) IF cnt > K: M := N; ELIF cnt < K: V := N|V; ELSE: // cnt == K V := N|V; EXIT; ENDIF ENDFOR }
0 0 1 1 1 0 1 0 1 0 0 1 1 0 1 1
D0 0 1 1 1 0 1 0 1 0 0 0 1 0 1 1
N0 0 1 1 1 0 1 0 1 0 0 0 1 0 1 1
N|V
cnt=8
0 0 1 1 1 0 1 0 1 0 0 0 1 0 1 1
110101 010100 000101 000111 000001 111011 000101 010110 001100 011100 111100 101101 000101 111101 000101 000101
C[1]
![Page 34: IN-MEMORY ASSOCIATIVE COMPUTING](https://reader034.vdocuments.us/reader034/viewer/2022042620/62646c66dee0ee3d620a285c/html5/thumbnails/34.jpg)
110101 010100 000101 000111 000001 111011 000101 010110 001100 011100 111100 101101 000101 111101 000101 000101
K-MINS: THE ALGORITHM
VV
KMINS(int K, vector C){ M := 1, V := 0; FOR b = msb to b = lsb: D := not(C[b]); N := M & D; cnt = COUNT(N|V) IF cnt > K: M := N; ELIF cnt < K: V := N|V; ELSE: // cnt == K V := N|V; EXIT; ENDIF ENDFOR }
0 0 10101 0 0 0 0 0 1 0 0 0
110101 010100 000101 000111 000001 111011 000101 010110 001100 011100 111100 101101 000101 111101 000101 000101
C
final output
O(1) Complexity
![Page 35: IN-MEMORY ASSOCIATIVE COMPUTING](https://reader034.vdocuments.us/reader034/viewer/2022042620/62646c66dee0ee3d620a285c/html5/thumbnails/35.jpg)
DENSE (1XN) VECTOR BY SPARSE NXM MATRIX
4 -2 3 -1Input Vector
Sparse Matrix
Output Vector
3 5 9 171 2 2 4
2 3 4 4
Shift and Add all belonging to same column
0 0 0 0
3 0 0 0
0 5 0 0
0 9 0 17
-6 6 0 -17
Row
Column
-6 15 -9 -17
-6 6 0 -17
Search all columns for row = 2 :Distribute -2 : 2 CySearch all columns for row = 3 :Distribute 3 : 2 Cy Search all columns for row = 4 :Distribute -1 : 2 CyMultiply in Parallel : 100 Cy
APU Representations and Computing
-2 3 -1 -1
Complexity including IO : O (N +logβ)where β is the number of nonzero elements in the sparse matrix
N << M in general for recommender systems
![Page 36: IN-MEMORY ASSOCIATIVE COMPUTING](https://reader034.vdocuments.us/reader034/viewer/2022042620/62646c66dee0ee3d620a285c/html5/thumbnails/36.jpg)
SPARSE MATRIX MULTIPLICATION PERFORMANCE ANALYSIS
• G3 circuit matrix • 1.5M X 1.5M sparse matrix
• Roughly 8M nonzero elements
• 20–100 GFLOPS with GPGPU solution
• APU solution provides 64 TFLOPS using the same
amount of power as the GPGPU solution above
• > 500x improvement with APU solution
![Page 37: IN-MEMORY ASSOCIATIVE COMPUTING](https://reader034.vdocuments.us/reader034/viewer/2022042620/62646c66dee0ee3d620a285c/html5/thumbnails/37.jpg)
ASSOCIATIVE MEMORY FOR NATURAL LANGUAGE PROCESSING (NLP)
Q&A, dialog, language translation, speech recognition’ etc.Requires learning things from the past – needs memoryMore memory, more accuracy
i.e. “Dan put the book in his car, ….. Long story here…. Mike took Dan’s car … Long story here …. He drove to SF”Q : Where is the book now?A: Car, SF
![Page 38: IN-MEMORY ASSOCIATIVE COMPUTING](https://reader034.vdocuments.us/reader034/viewer/2022042620/62646c66dee0ee3d620a285c/html5/thumbnails/38.jpg)
END-TO-END MEMORY NETWORKS
End-To-End Memory Networks, (Weston et. al., NIPS 2015). (a): Single hope, (b) 3 hops
![Page 39: IN-MEMORY ASSOCIATIVE COMPUTING](https://reader034.vdocuments.us/reader034/viewer/2022042620/62646c66dee0ee3d620a285c/html5/thumbnails/39.jpg)
Q&A : END TO END NETWORK
![Page 40: IN-MEMORY ASSOCIATIVE COMPUTING](https://reader034.vdocuments.us/reader034/viewer/2022042620/62646c66dee0ee3d620a285c/html5/thumbnails/40.jpg)
REQUIREMENTS FOR AUGMENTED MEMORY
Cosine Similarity Search + Top-K
Compute softmax to selected
Vertical Multiplication selected Columns and horizontal sum Output
Embedding Input features to next coloumn , to any other location or location based on
content
![Page 41: IN-MEMORY ASSOCIATIVE COMPUTING](https://reader034.vdocuments.us/reader034/viewer/2022042620/62646c66dee0ee3d620a285c/html5/thumbnails/41.jpg)
Control
I
Omi
Ti
M
0 1 2 N-1
i mi
i mi Broadcast input I to selected columnsCompute any function at selected columns Generate outputGenerate tags for selection
APU MEMORY FOR NLP
![Page 42: IN-MEMORY ASSOCIATIVE COMPUTING](https://reader034.vdocuments.us/reader034/viewer/2022042620/62646c66dee0ee3d620a285c/html5/thumbnails/42.jpg)
GSI SOLUTION FOR END TO END
Constant time of 3 µsec per iteration, any memory size.
![Page 43: IN-MEMORY ASSOCIATIVE COMPUTING](https://reader034.vdocuments.us/reader034/viewer/2022042620/62646c66dee0ee3d620a285c/html5/thumbnails/43.jpg)
PROGRAMING MODEL
![Page 44: IN-MEMORY ASSOCIATIVE COMPUTING](https://reader034.vdocuments.us/reader034/viewer/2022042620/62646c66dee0ee3d620a285c/html5/thumbnails/44.jpg)
PROGRAMMING MODEL
APU (Associative Processing Unit Hardware)
Graph Execution and Tasks Scheduling
Framework (TensorFlow )
Application (C++, Python)HOST
Device
![Page 45: IN-MEMORY ASSOCIATIVE COMPUTING](https://reader034.vdocuments.us/reader034/viewer/2022042620/62646c66dee0ee3d620a285c/html5/thumbnails/45.jpg)
c
A TF EXAMPLE: MATMUL
a = tf.placeholder(tf.int32, shape=[3,4])
b = tf.placeholder(tf.int32, shape=[4,6])
c = tf.matmul(a, b) # shape = [3,6]
ba
matmul
![Page 46: IN-MEMORY ASSOCIATIVE COMPUTING](https://reader034.vdocuments.us/reader034/viewer/2022042620/62646c66dee0ee3d620a285c/html5/thumbnails/46.jpg)
A TF EXAMPLE: MATMUL GRAPH PREPARATION
apu
devi
ce (t
f+ei
gen)
a = tf.placeholder(tf.int32, shape=[3, 4])
b = tf.placeholder(tf.int32, shape=[4, 6])
c = tf.matmul(a, b)
gnl_create_array(a)
APU DEVICE SPACE
gnl_create_array(b)
gnl_create_array(c)
a
b
c
apuc’sL4host
L1
MMB
![Page 47: IN-MEMORY ASSOCIATIVE COMPUTING](https://reader034.vdocuments.us/reader034/viewer/2022042620/62646c66dee0ee3d620a285c/html5/thumbnails/47.jpg)
MMB
L1
A TF EXAMPLE: MATMUL GVML_SET, GVML_MUL
with tf.Session() as sess: result = sess.run(c, feed_dict= {a: [[2 -4 15 3] b: [[27 -8 -14 2 9 -32] [11 1 -30 4] [-7 52 -6 21 0 4] [-8 23 -9 7]], [-81 1 20 6 19 -3] [-38 90 5 2 13 77]]})
2 -4 15 3 11 1 -30 4 -8 23 -9 7
APU DEVICE SPACE
a
apucL4
27 -8 -14 2 9 -32 -7 52 -6 21 0 4 -81 1 20 6 19 -3 -38 90 5 2 13 77
b
c
gnlpd_mat_mul(c,a,b)
controller
dma copy L4 to
L1 -7 52 -6 21 0 4
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 c
gnlpd_dma_16b_start(GNLPD_SYS_2_VMR,…) gvml_mul_s16(……)
dma keepsloading
data to apucwhile matmul
is being computed
in apuc
gvml_set_16(……)
27 -8 -14 2 9 -32 2 2 2 2 2 2
54 -16 -28 4 18 -64
X gvml_mul_s16(……)
gvml_set_16(……)
![Page 48: IN-MEMORY ASSOCIATIVE COMPUTING](https://reader034.vdocuments.us/reader034/viewer/2022042620/62646c66dee0ee3d620a285c/html5/thumbnails/48.jpg)
TENSORFLOW ENHANCEMENT: FUSED OPERATIONS
a = tf.placeholder(tf.int32, shape=[3,4])
b = tf.placeholder(tf.int32, shape=[4,6])
c = tf.matmul(a, b) # shape = [3,6]
ba
cmatmul
d = tf.nn.top_k(c, k=2) # shape = [3,2] dtop_k
fused(matmul,top_k)
• The two operations are computed inside the apuc • Data stays in L1 • No IO operations between them • Saves valuable data transfer time and powerfused
![Page 49: IN-MEMORY ASSOCIATIVE COMPUTING](https://reader034.vdocuments.us/reader034/viewer/2022042620/62646c66dee0ee3d620a285c/html5/thumbnails/49.jpg)
MMB
L1
A TF EXAMPLE: MATMUL CODE EXAMPLE
APU DEVICE
27 -8 -14 2 9 -32
54 -16 -28 4 18 -64 0 0 0 0 0 0 0 0 0 0 0 0 c
27 -8 -14 2 9 -32 2 2 2 2 2 2
54 -16 -28 4 18 -64
APL_FRAG add_u16(RN_REG x, RN_REG y, RN_REG t_xory, RN_REG t_ci0) { SM_0XFFFF: RL = SB[x]; // RL[0-15] = x SM_0XFFFF: RL |= SB[y]; // RL[0-15] = x|y { SM_0XFFFF: SB[t_xory] = RL; // t_xory[1-15] = x|y SM_0XFFFF: RL = SB[x , y]; // RL[0-15] = x&y } // Add init state: // 0: RL = co[0] // 1..15: RL = x&y { (SM_0X1111 << 1): SB[t_ci0] = NRL; // t_ci0[5,9,13] = x&y (SM_0X1111 << 1): RL |= SB[t_xory] & NRL; // RL[1] = Cout[1] = x&y | ci(x|y) // 5,9,13: RL = Cout0[5,9,13] = x&y | ci(x|y) (SM_0X1111 << 4): RL |= SB[t_xory]; // RL[4,8,12] = Cout1[4,8,12] = x&y | 1&(x|y) } { (SM_0X1111 << 2): SB[t_ci0] = NRL; // Propagate Cin0 (SM_0X1111 << 2): RL |= SB[t_xory] & NRL; // Propagate Cout0 (SM_0X1111 << 5): RL |= SB[t_xory] & NRL; // Propagate Cout1 } { (SM_0X1111 << 3): SB[t_ci0] = NRL; // Propagate Cin0 (SM_0X1111 << 3): RL |= SB[t_xory] & NRL; // Propagate Cout0 (SM_0X1111 << 6): RL |= SB[t_xory] & NRL; // Propagate Cout1 (SM_0X0001 << 15): GL = RL; } { (SM_0X1111 << 4): SB[t_ci0] = NRL; // t_ci0[8,12,16] = Cout0[7, 11, 15] SM_0X0001: SB[t_ci0] = GL; // t_ci0[8,12,16] = Cout0[7, 11, 15] (SM_0X1111 << 7): RL |= SB[t_xory] & NRL; // Propagate Cout1 (SM_0X0001 << 15): GL = RL; } . . .
![Page 50: IN-MEMORY ASSOCIATIVE COMPUTING](https://reader034.vdocuments.us/reader034/viewer/2022042620/62646c66dee0ee3d620a285c/html5/thumbnails/50.jpg)
FUTURE APPROACH – NON VOLATILE CONCEPT
![Page 51: IN-MEMORY ASSOCIATIVE COMPUTING](https://reader034.vdocuments.us/reader034/viewer/2022042620/62646c66dee0ee3d620a285c/html5/thumbnails/51.jpg)
CPU Register File
L1/L2/L3
Flash
HDD
SOLUTIONS FOR FUTURE DATA CENTERS
DRAM
ASSOCIATIVE
STT-RAM RAM Based
PC-RAM Based
ReRam Based
Full computing (floating points etc.) requires
read & write : Machine learning, malware detection detection etc., : Much more read and
much less writeData search engines (read most of the time)
High endurance
Mid enduranceLow endurance
Standard SRAM Based VolatileNon Volatile
![Page 52: IN-MEMORY ASSOCIATIVE COMPUTING](https://reader034.vdocuments.us/reader034/viewer/2022042620/62646c66dee0ee3d620a285c/html5/thumbnails/52.jpg)
THANK YOU