first place memocode'14 design contest entry
DESCRIPTION
This is what I presented at the 2014 Memocode conference on Iowa State's winning design contest entry. The team was lead by me.TRANSCRIPT
A High Performance Systolic Architecture for k-NNClassification
Kevin Townsend, Philip Jones, Joseph Zambreno
Reconfigurable Computing LaboratoryIowa State University
MEMOCODE’14
Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 1 / 11
Outline
1 The Competition
2 Our Approach
3 Hardware DesignPlatformSystolic ArrayProcessing ElementDot ProductSort
4 Results
Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 2 / 11
The Competition
Problem Statement
k Neareast Neighbors
32 Dimensional Space or 32 element length vectors
1,000 (M) test vectors
10,000,000 (N) train vectors
Values are 12 bits
Mahalonobis Distance√(x − y)tS−1(x − y) vs
√(x − y)t(x − y) where x is a training
vector and y is a testing vector.Better results for some problems1024 multiplications vs 32
Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 3 / 11
The Competition
Problem Statement
k Neareast Neighbors
32 Dimensional Space or 32 element length vectors
1,000 (M) test vectors
10,000,000 (N) train vectors
Values are 12 bits
Mahalonobis Distance√(x − y)tS−1(x − y) vs
√(x − y)t(x − y) where x is a training
vector and y is a testing vector.Better results for some problems1024 multiplications vs 32
Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 3 / 11
The Competition
Problem Statement
k Neareast Neighbors
32 Dimensional Space or 32 element length vectors
1,000 (M) test vectors
10,000,000 (N) train vectors
Values are 12 bits
Mahalonobis Distance√(x − y)tS−1(x − y) vs
√(x − y)t(x − y) where x is a training
vector and y is a testing vector.Better results for some problems1024 multiplications vs 32
Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 3 / 11
The Competition
Problem Statement
k Neareast Neighbors
32 Dimensional Space or 32 element length vectors
1,000 (M) test vectors
10,000,000 (N) train vectors
Values are 12 bits
Mahalonobis Distance√(x − y)tS−1(x − y) vs
√(x − y)t(x − y) where x is a training
vector and y is a testing vector.Better results for some problems1024 multiplications vs 32
Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 3 / 11
The Competition
Problem Statement
k Neareast Neighbors
32 Dimensional Space or 32 element length vectors
1,000 (M) test vectors
10,000,000 (N) train vectors
Values are 12 bits
Mahalonobis Distance√(x − y)tS−1(x − y) vs
√(x − y)t(x − y) where x is a training
vector and y is a testing vector.Better results for some problems1024 multiplications vs 32
Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 3 / 11
The Competition
Problem Statement
k Neareast Neighbors
32 Dimensional Space or 32 element length vectors
1,000 (M) test vectors
10,000,000 (N) train vectors
Values are 12 bits
Mahalonobis Distance√(x − y)tS−1(x − y) vs
√(x − y)t(x − y) where x is a training
vector and y is a testing vector.Better results for some problems1024 multiplications vs 32
Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 3 / 11
The Competition
Problem Statement
k Neareast Neighbors
32 Dimensional Space or 32 element length vectors
1,000 (M) test vectors
10,000,000 (N) train vectors
Values are 12 bits
Mahalonobis Distance√(x − y)tS−1(x − y) vs
√(x − y)t(x − y) where x is a training
vector and y is a testing vector.Better results for some problems1024 multiplications vs 32
Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 3 / 11
The Competition
Problem Statement
k Neareast Neighbors
32 Dimensional Space or 32 element length vectors
1,000 (M) test vectors
10,000,000 (N) train vectors
Values are 12 bits
Mahalonobis Distance√(x − y)tS−1(x − y) vs
√(x − y)t(x − y) where x is a training
vector and y is a testing vector.Better results for some problems1024 multiplications vs 32
Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 3 / 11
The Competition
Problem Statement
k Neareast Neighbors
32 Dimensional Space or 32 element length vectors
1,000 (M) test vectors
10,000,000 (N) train vectors
Values are 12 bits
Mahalonobis Distance√(x − y)tS−1(x − y) vs
√(x − y)t(x − y) where x is a training
vector and y is a testing vector.Better results for some problems1024 multiplications vs 32
Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 3 / 11
Our Approach
Optimizations
We choose a brute force solution. This is all 10,000,000,000 (M × N)products.
(x − y)tS−1(x − y) is used because√
is an increasing function.
(x − y)t(S−1x − S−1y) reduces the computation from 1024multiplications to 32 multiplications.
S−1x and S−1y can be calculated ahead of time. (Only 10,001,000matrix vector multiplications)
This results in approximately 1.3 trillion integer operations required.
Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 4 / 11
Our Approach
Optimizations
We choose a brute force solution. This is all 10,000,000,000 (M × N)products.
(x − y)tS−1(x − y) is used because√
is an increasing function.
(x − y)t(S−1x − S−1y) reduces the computation from 1024multiplications to 32 multiplications.
S−1x and S−1y can be calculated ahead of time. (Only 10,001,000matrix vector multiplications)
This results in approximately 1.3 trillion integer operations required.
Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 4 / 11
Our Approach
Optimizations
We choose a brute force solution. This is all 10,000,000,000 (M × N)products.
(x − y)tS−1(x − y) is used because√
is an increasing function.
(x − y)t(S−1x − S−1y) reduces the computation from 1024multiplications to 32 multiplications.
S−1x and S−1y can be calculated ahead of time. (Only 10,001,000matrix vector multiplications)
This results in approximately 1.3 trillion integer operations required.
Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 4 / 11
Our Approach
Optimizations
We choose a brute force solution. This is all 10,000,000,000 (M × N)products.
(x − y)tS−1(x − y) is used because√
is an increasing function.
(x − y)t(S−1x − S−1y) reduces the computation from 1024multiplications to 32 multiplications.
S−1x and S−1y can be calculated ahead of time. (Only 10,001,000matrix vector multiplications)
This results in approximately 1.3 trillion integer operations required.
Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 4 / 11
Our Approach
Optimizations
We choose a brute force solution. This is all 10,000,000,000 (M × N)products.
(x − y)tS−1(x − y) is used because√
is an increasing function.
(x − y)t(S−1x − S−1y) reduces the computation from 1024multiplications to 32 multiplications.
S−1x and S−1y can be calculated ahead of time. (Only 10,001,000matrix vector multiplications)
This results in approximately 1.3 trillion integer operations required.
Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 4 / 11
Our Approach
High level approach
trainA trainA
trainB trainB
testA testA
testB testB
MahalanobisProduct
MahalanobisProduct
k-NN
retret
0.6GB
1.3GB
64KB
128KB
256KB
Host Coprocessor
start time
end time
Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 5 / 11
Our Approach
High level approach
trainA trainA
trainB trainB
testA testA
testB testB
MahalanobisProduct
MahalanobisProduct
k-NN
retret
0.6GB
1.3GB
64KB
128KB
256KB
Host Coprocessor
start time
end time
Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 5 / 11
Our Approach
High level approach
trainA trainA
trainB trainB
testA testA
testB testB
MahalanobisProduct
MahalanobisProduct
k-NN
retret
0.6GB
1.3GB
64KB
128KB
256KB
Host Coprocessor
start time
end time
Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 5 / 11
Our Approach
High level approach
trainA trainA
trainB trainB
testA testA
testB testB
MahalanobisProduct
MahalanobisProduct
k-NN
retret
0.6GB
1.3GB
64KB
128KB
256KB
Host Coprocessor
start time
end time
Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 5 / 11
Our Approach
High level approach
trainA trainA
trainB trainB
testA testA
testB testB
MahalanobisProduct
MahalanobisProduct
k-NN
retret
0.6GB
1.3GB
64KB
128KB
256KB
Host Coprocessor
start time
end time
Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 5 / 11
Our Approach
High level approach
trainA trainA
trainB trainB
testA testA
testB testB
MahalanobisProduct
MahalanobisProduct
k-NN
retret
0.6GB
1.3GB
64KB
128KB
256KB
Host Coprocessor
start time
end time
Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 5 / 11
Our Approach
High level approach
trainA trainA
trainB trainB
testA testA
testB testB
MahalanobisProduct
MahalanobisProduct
k-NN
retret
0.6GB
1.3GB
64KB
128KB
256KB
Host Coprocessor
start time
end time
Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 5 / 11
Our Approach
High level approach
trainA trainA
trainB trainB
testA testA
testB testB
MahalanobisProduct
MahalanobisProduct
k-NN
retret
0.6GB
1.3GB
64KB
128KB
256KB
Host Coprocessor
start time
end time
Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 5 / 11
Our Approach
High level approach
trainA trainA
trainB trainB
testA testA
testB testB
MahalanobisProduct
MahalanobisProduct
k-NN
retret
0.6GB
1.3GB
64KB
128KB
256KB
Host Coprocessor
start time
end time
Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 5 / 11
Our Approach
High level approach
trainA trainA
trainB trainB
testA testA
testB testB
MahalanobisProduct
MahalanobisProduct
k-NN
retret
0.6GB
1.3GB
64KB
128KB
256KB
Host Coprocessor
start time
end time
Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 5 / 11
Our Approach
High level approach
trainA trainA
trainB trainB
testA testA
testB testB
MahalanobisProduct
MahalanobisProduct
k-NN
retret
0.6GB
1.3GB
64KB
128KB
256KB
Host Coprocessor
start time
end time
Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 5 / 11
Our Approach
High level approach
trainA trainA
trainB trainB
testA testA
testB testB
MahalanobisProduct
MahalanobisProduct
k-NN
retret
0.6GB
1.3GB
64KB
128KB
256KB
Host Coprocessor
start time
end time
Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 5 / 11
Hardware Design Platform
The Convey Platform
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
Memory
Controller 1
Memory
Controller 2
Memory
Controller 3
Memory
Controller 4
Memory
Controller 5
Memory
Controller 6
Memory
Controller 7
Memory
Controller 8
Design a k-NN processing element (PE) with one floating pointmultiply-accumulator (MAC).
Duplicate the PE block as many times as possible.
Give each PE access to memory.
Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 6 / 11
Hardware Design Platform
The Convey Platform
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
Memory
Controller 1
Memory
Controller 2
Memory
Controller 3
Memory
Controller 4
Memory
Controller 5
Memory
Controller 6
Memory
Controller 7
Memory
Controller 8
Design a k-NN processing element (PE) with one floating pointmultiply-accumulator (MAC).
Duplicate the PE block as many times as possible.
Give each PE access to memory.
Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 6 / 11
Hardware Design Platform
The Convey Platform
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
kNNPE
Memory
Controller 1
Memory
Controller 2
Memory
Controller 3
Memory
Controller 4
Memory
Controller 5
Memory
Controller 6
Memory
Controller 7
Memory
Controller 8
Design a k-NN processing element (PE) with one floating pointmultiply-accumulator (MAC).
Duplicate the PE block as many times as possible.
Give each PE access to memory.
Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 6 / 11
Hardware Design Systolic Array
Systolic Arrays
testA testB trainA trainB ret
k-NNPE
k-NNPE
k-NNPE
k-NNPE
. . .
Solves routing problem
Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 7 / 11
Hardware Design Processing Element
Single Processing Element
kNN PE
Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 8 / 11
Hardware Design Processing Element
Single Processing Element
Datain
/192 Data
out
kNN PE
Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 8 / 11
Hardware Design Processing Element
Single Processing Element
Datain
Opcodein
Indexin
Opcodeout
Indexout
/192 Data
out
kNN PE
Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 8 / 11
Hardware Design Processing Element
Single Processing Element
Buffer
Datain
Opcodein
Indexin
Opcodeout
Indexout
/192 Data
out
≈ 1536 Registers
kNN PE
Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 8 / 11
Hardware Design Processing Element
Single Processing Element
Buffer
TestCache
Datain
Opcodein
Indexin
Opcodeout
Indexout
/192 Data
out
660 Registers560 LUTs
≈ 1536 Registers
kNN PE
Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 8 / 11
Hardware Design Processing Element
Single Processing Element
Buffer TrainBuffer
TestCache
Datain
Opcodein
Indexin
Opcodeout
Indexout
/192 Data
out
660 Registers560 LUTs
≈ 1536 Registers ≈1536 Registers≈768 LUTs
kNN PE
Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 8 / 11
Hardware Design Processing Element
Single Processing Element
Buffer TrainBuffer
TestCache Product
Datain
Opcodein
Indexin
Opcodeout
Indexout
/192 Data
out
660 Registers560 LUTs
≈ 1536 Registers ≈1536 Registers≈768 LUTs
8704 Registers6806 Luts20 DSPs
kNN PE
Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 8 / 11
Hardware Design Processing Element
Single Processing Element
Buffer TrainBuffer
TestCache Product
Sort
Datain
Opcodein
Indexin
Opcodeout
Indexout
/192 Data
out
660 Registers560 LUTs
316 Registers388 LUTs
7 BlockRAMs
≈ 1536 Registers ≈1536 Registers≈768 LUTs
8704 Registers6806 Luts20 DSPs
kNN PE
Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 8 / 11
Hardware Design Dot Product
Dot Product Pipeline
31, 12-bit subtracters
31, 24-bit subtracters
32, 13x25-bit multipliers
31, 45-bit adder tree
≈ 128 interger operators
150Mhz, 128 processingelements
2.4 billion operations persecond
testA
testB
trainA
trainB
pro
du
ct
Vec
tor
Su
btr
acte
rV
ecto
rS
ub
trac
ter
Vec
tor
Mu
ltip
lier
Ad
der
Tre
e
Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 9 / 11
Hardware Design Dot Product
Dot Product Pipeline
31, 12-bit subtracters
31, 24-bit subtracters
32, 13x25-bit multipliers
31, 45-bit adder tree
≈ 128 interger operators
150Mhz, 128 processingelements
2.4 billion operations persecond
testA
testB
trainA
trainB
pro
du
ct
Vec
tor
Su
btr
acte
rV
ecto
rS
ub
trac
ter
Vec
tor
Mu
ltip
lier
Ad
der
Tre
e
Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 9 / 11
Hardware Design Dot Product
Dot Product Pipeline
31, 12-bit subtracters
31, 24-bit subtracters
32, 13x25-bit multipliers
31, 45-bit adder tree
≈ 128 interger operators
150Mhz, 128 processingelements
2.4 billion operations persecond
testA
testB
trainA
trainB
pro
du
ct
Vec
tor
Su
btr
acte
rV
ecto
rS
ub
trac
ter
Vec
tor
Mu
ltip
lier
Ad
der
Tre
e
Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 9 / 11
Hardware Design Dot Product
Dot Product Pipeline
31, 12-bit subtracters
31, 24-bit subtracters
32, 13x25-bit multipliers
31, 45-bit adder tree
≈ 128 interger operators
150Mhz, 128 processingelements
2.4 billion operations persecond
testA
testB
trainA
trainB
pro
du
ct
Vec
tor
Su
btr
acte
rV
ecto
rS
ub
trac
ter
Vec
tor
Mu
ltip
lier
Ad
der
Tre
e
Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 9 / 11
Hardware Design Sort
Sort
Counter
product
Bouncer
B3
B2
B1=100
B0
Inse
rter
RAM
V0
V1
V2
V3
out
7
19
42
68
Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 10 / 11
Hardware Design Sort
Sort
Counter
product13
Bouncer
B3
B2
B1=100
B0
Inse
rter
RAM
V0
V1
V2
V3
out
7
19
42
68
Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 10 / 11
Hardware Design Sort
Sort
Counter
product
Bouncer
B3
B2
B1=100
B0
Inse
rter
RAM
V0
V1
V2
V3
out
7
19
42
68
13
Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 10 / 11
Hardware Design Sort
Sort
Counter
product
Bouncer
B3
B2
B1=100
B0
Inse
rter
RAM
V0
V1
V2
V3
out
7
19
42
68
13
Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 10 / 11
Hardware Design Sort
Sort
Counter
product
Bouncer
B3
B2
B1=100
B0
Inse
rter
RAM
V0
V1
V2
V3
out
7
13
42
68
19
Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 10 / 11
Hardware Design Sort
Sort
Counter
product
Bouncer
B3
B2
B1=100
B0
Inse
rter
RAM
V0
V1
V2
V3
out
7
13
19
6842
Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 10 / 11
Hardware Design Sort
Sort
Counter
product
Bouncer
B3
B2
B1=100
B0
Inse
rter
RAM
V0
V1
V2
V3
out
7
13
19
42
68
Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 10 / 11
Hardware Design Sort
Sort
Counter
product
Bouncer
B3
B2
B1=68
B0
Inse
rter
RAM
V0
V1
V2
V3
out
7
13
19
42
Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 10 / 11
Results
Results
1.3 billion integer operations / 2.4 billion integer operations persecond = 0.54 seconds.
Actual runtime is 0.54 seconds.
Paper at:http://www.rcl.ece.iastate.edu/sites/default/files/papers/TowJon14A.pdf
Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 11 / 11
Results
Results
1.3 billion integer operations / 2.4 billion integer operations persecond = 0.54 seconds.
Actual runtime is 0.54 seconds.
Paper at:http://www.rcl.ece.iastate.edu/sites/default/files/papers/TowJon14A.pdf
Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 11 / 11
Results
Results
1.3 billion integer operations / 2.4 billion integer operations persecond = 0.54 seconds.
Actual runtime is 0.54 seconds.
Paper at:http://www.rcl.ece.iastate.edu/sites/default/files/papers/TowJon14A.pdf
Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 11 / 11