first place memocode'14 design contest entry

A High Performance Systolic Architecture for k-NNClassification

Kevin Townsend, Philip Jones, Joseph Zambreno

Reconfigurable Computing LaboratoryIowa State University

MEMOCODE’14

Townsend (RCL@ISU) k-NN on FPGA MEMOCODE’14 1 / 11

Outline

1 The Competition

2 Our Approach

3 Hardware DesignPlatformSystolic ArrayProcessing ElementDot ProductSort

4 Results

The Competition

Problem Statement

k Neareast Neighbors

32 Dimensional Space or 32 element length vectors

1,000 (M) test vectors

10,000,000 (N) train vectors

Values are 12 bits

Mahalonobis Distance√(x − y)tS−1(x − y) vs

√(x − y)t(x − y) where x is a training

vector and y is a testing vector.Better results for some problems1024 multiplications vs 32

The Competition

Problem Statement

Values are 12 bits

The Competition

Problem Statement

Values are 12 bits

The Competition

Problem Statement

Values are 12 bits

The Competition

Problem Statement

Values are 12 bits

The Competition

Problem Statement

Values are 12 bits

The Competition

Problem Statement

Values are 12 bits

The Competition

Problem Statement

Values are 12 bits

The Competition

Problem Statement

Values are 12 bits

Our Approach

Optimizations

We choose a brute force solution. This is all 10,000,000,000 (M × N)products.

(x − y)tS−1(x − y) is used because√

is an increasing function.

(x − y)t(S−1x − S−1y) reduces the computation from 1024multiplications to 32 multiplications.

S−1x and S−1y can be calculated ahead of time. (Only 10,001,000matrix vector multiplications)

This results in approximately 1.3 trillion integer operations required.

Our Approach

Optimizations

Our Approach

Optimizations

Our Approach

Optimizations

Our Approach

Optimizations

Our Approach

High level approach

trainA trainA

trainB trainB

testA testA

testB testB

MahalanobisProduct

retret

Host Coprocessor

start time

end time

Our Approach

High level approach

trainA trainA

trainB trainB

testA testA

testB testB

MahalanobisProduct

retret

Host Coprocessor

start time

end time

Our Approach

High level approach

trainA trainA

trainB trainB

testA testA

testB testB

MahalanobisProduct

retret

Host Coprocessor

start time

end time

Our Approach

High level approach

trainA trainA

trainB trainB

testA testA

testB testB

MahalanobisProduct

retret

Host Coprocessor

start time

end time

Our Approach

High level approach

trainA trainA

trainB trainB

testA testA

testB testB

MahalanobisProduct

retret

Host Coprocessor

start time

end time

Our Approach

High level approach

trainA trainA

trainB trainB

testA testA

testB testB

MahalanobisProduct

retret

Host Coprocessor

start time

end time

Our Approach

High level approach

trainA trainA

trainB trainB

testA testA

testB testB

MahalanobisProduct

retret

Host Coprocessor

start time

end time

Our Approach

High level approach

trainA trainA

trainB trainB

testA testA

testB testB

MahalanobisProduct

retret

Host Coprocessor

start time

end time

Our Approach

High level approach

trainA trainA

trainB trainB

testA testA

testB testB

MahalanobisProduct

retret

Host Coprocessor

start time

end time

Our Approach

High level approach

trainA trainA

trainB trainB

testA testA

testB testB

MahalanobisProduct

retret

Host Coprocessor

start time

end time

Our Approach

High level approach

trainA trainA

trainB trainB

testA testA

testB testB

MahalanobisProduct

retret

Host Coprocessor

start time

end time

Our Approach

High level approach

trainA trainA

trainB trainB

testA testA

testB testB

MahalanobisProduct

retret

Host Coprocessor

start time

end time

Hardware Design Platform

The Convey Platform

Memory

Controller 1

Memory

Controller 2

Memory

Controller 3

Memory

Controller 4

Memory

Controller 5

Memory

Controller 6

Memory

Controller 7

Memory

Controller 8

Design a k-NN processing element (PE) with one floating pointmultiply-accumulator (MAC).

Duplicate the PE block as many times as possible.

Give each PE access to memory.

The Convey Platform

Memory

Controller 1

Memory

Controller 2

Memory

Controller 3

Memory

Controller 4

Memory

Controller 5

Memory

Controller 6

Memory

Controller 7

Memory

Controller 8

The Convey Platform

Memory

Controller 1

Memory

Controller 2

Memory

Controller 3

Memory

Controller 4

Memory

Controller 5

Memory

Controller 6

Memory

Controller 7

Memory

Controller 8

Hardware Design Systolic Array

Systolic Arrays

testA testB trainA trainB ret

k-NNPE

Solves routing problem

Hardware Design Processing Element

Single Processing Element

kNN PE

Datain

/192 Data

kNN PE

Datain

Opcodein

Indexin

Opcodeout

Indexout

/192 Data

kNN PE

Buffer

Datain

Opcodein

Indexin

Opcodeout

Indexout

/192 Data

≈ 1536 Registers

kNN PE

Buffer

TestCache

Datain

Opcodein

Indexin

Opcodeout

Indexout

/192 Data

660 Registers560 LUTs

≈ 1536 Registers

kNN PE

Buffer TrainBuffer

TestCache

Datain

Opcodein

Indexin

Opcodeout

Indexout

/192 Data

≈ 1536 Registers ≈1536 Registers≈768 LUTs

kNN PE

Buffer TrainBuffer

TestCache Product

Datain

Opcodein

Indexin

Opcodeout

Indexout

/192 Data

8704 Registers6806 Luts20 DSPs

kNN PE

Buffer TrainBuffer

TestCache Product

Datain

Opcodein

Indexin

Opcodeout

Indexout

/192 Data

7 BlockRAMs

8704 Registers6806 Luts20 DSPs

kNN PE

Hardware Design Dot Product

Dot Product Pipeline

31, 12-bit subtracters

32, 13x25-bit multipliers

31, 45-bit adder tree

≈ 128 interger operators

150Mhz, 128 processingelements

2.4 billion operations persecond

trainA

trainB

trainA

trainB

trainA

trainB

trainA

trainB

Hardware Design Sort

Counter

product

Bouncer

B1=100

Counter

product13

Bouncer

B1=100

Counter

product

Bouncer

B1=100

Counter

product

Bouncer

B1=100

Counter

product

Bouncer

B1=100

Counter

product

Bouncer

B1=100

Counter

product

Bouncer

B1=100

Counter

product

Bouncer

Results

1.3 billion integer operations / 2.4 billion integer operations persecond = 0.54 seconds.

Actual runtime is 0.54 seconds.

Paper at:http://www.rcl.ece.iastate.edu/sites/default/files/papers/TowJon14A.pdf

Results

first place memocode'14 design contest entry

Engineering