unconventional architectures with reconfigurable...

Unconventional Architectures with Reconfigurable ComputingMichaela Blott, Principal Engineer, Xilinx Research

© Copyright 2015 Xilinx.

Agenda

Introductions

Industry landscape

Platform characterization & performance estimation

Some unconventional examples

Summary & future plans


Introductions – FPGAs, Xilinx, Xilinx research and the labs in Dublin

Industry landscape





What are FPGAs?Customizable, Programmable Hardware Architectures

Great vehicle for implementation of unconventional architectures


Customized hardware

Customized Hardware Architectures

R0 = Load a;

R1 = Load b;

R2 = Load 0x256;

R1 = Add(R1,R0);

R2 = Mult(R1,R2);

Mem[0x0003] = Store(R2);

+ << 9

Task: X = (a + b) * 256;

1 cycle @ 400MHz 6 cycles @ 2GHz

General Purpose Processor

data parallel

task p

ara

llel 10 cores

FPGA promise very high throughput, latency, and

power savings


Customizable Interfaces & Memory Architectures

Flexibility to interface to any other device

and customize memory architectures

FPGA

DDRx

QDR SRAM

Flash

caches

QoS

DDRxDDRx

QDR SRAM

Flash


Fabless semiconductor company

Founded in Silicon Valley in 1984

Today: Approximately 3,500 employees and $2.25B revenue

20,000 Customers

Introduction to Xilinx

64 FF

128 3-input LUTs

58 IOs

2um

3.4M FF

1.7M 6-input LUTs

6.3Tbps IO

16nm

+ DSPs, ARM, 2.5D…

Ultrascale +: VU13P1st FPGA in 1985: XC2064

30 years


Xilinx is Diversified Across Multiple Markets


Xilinx Research - Ireland

Applications & Architectures

Through application-driven

technology development with

customers, partners, and

engineering & marketing


Introductions

Industry landscape





Moore’s Law & The Technology Pipeline

Scaling becomes increasingly esoteric


Transistor Cost Trend

Calculation of Cost Per Transistor by Node

Source: IBS

.0700

.0600

.0500

.0400

.0300

.0200

.0100

.000090nm 65nm 40nm 28nm 20nm 14nm

.0636

.0521

.0362

.0278.0275.0267

Wrong Trend

Co

st

per

mil

lio

n g

ate

s

Economics become questionable


End of Dennard Scaling

Source: Intel

Power dissipation becomes problematic


Applications require

– Increasing compute (machine learning, data

analytics)

– Increasing storage capacity (photos, videos)

– Lower power (OPEX = 2x CAPEX)

Heterogeneous compute is required to

provide further performance scaling and

reducing power consumption

Accelerator integration transitions from

– Loosely coupled IO device, coherent

accelerators (CAPI, QPI, CCIX) to on-chip

integration with processors and memory

Computing: Increasingly Heterogeneous and Integrated

New decade of application-driven architecturesDiversification with increasingly heterogeneous devices


New Generation of Design Environments (FPGAs)make it easier

• ISE, RTL-based design entry with IP library

Legacy

• Microblaze, SDK, EDK

Embedded CPU integration

• Vivado HLS

• SDNet (DSL PX)

• Block stitching and manual integration in platform in RTL

Raised abstraction for accelerators

• SDSoC, SDNet, SDAccel

• Predefined methods for data transfer & automated implementation

Simplified host integration & automated infrastructure creation

Tim

e

Abstra

ctio

n

Monitoring & profiling infrastructure, Runtime OS, Dynamic and

static workload partitioning, Cloud integration


For a given application, which

architecture should I build?

For a given architecture, which

applications are suitable?

=> Characterization & benchmarking

The Question


Agenda

Introductions

Industry landscape





Peak performance as a function of operational intensity

– PT = min{ OI*BW; P}

– Takes into account maximum compute performance and memory

bandwidth

UC Berkeley’s Rooflines for Hardware Platforms

Operational intensity

of an implementation

OPS:Byte

(log)

Achievable

Performance

GOPS/sec

(log)

maximum

performance

Memory bound Compute bound

Hardware:

P=100GOPS/s

BW= 1GB/s

Implementation

OI = 1OPS/Byte

Estimated peak performance for I:

1GOPS/s

Very crude but useful for performance estimates and platform comparison


Allows performance estimates & tracking of optimizations

Performance Estimation & Tracking

OPS:Byte

(log)

Achievable

Performance

GOPS/sec

(log)

Estimated

peak performance

Implementation

of Application A

Measurements

Current project: refining rooflines

SP mult

8b add

Average cost for a mix of operations


FPGAs

– Highest performance for non-float

(fixed point, characters, bit) with

operational intensity > 16

– Float/power

– Lowest absolute power

GPU

– Absolute float performance for

highly data parallel applications

with little control flow

CPU

– Best balance, all-round average

performance for all applications, in

particular with large memory

requirements

Platform Characterization

CNNsGenomics

Video

Com

pute

Perf

orm

ance [

GO

PS

/sec]

Non-float

Float


Introductions

Industry landscape





Applications under Investigation

Key value stores

Machine learning

Vision Processing

Genomics

Networking

Stencils

Fintech

Synthetic workloads


Key-Value Stores

Common middleware application to alleviate access bottlenecks

on databases

– Most popular and most recent database contents are cached in main

memory of a tier of server platforms

– Provides the abstraction of an associative memory

OI = [3.65, 300]

Past: Scaling performance using custom dataflow architectures

– Demonstrated 35x performance/power with dataflow architectures on FPGA

MemcachedWeb server

DatabaseMemcachedKey-value

stores

Web serverWeb server

© Copyright 2015 Xilinx

Motherboard

DRAMx86

Network adapter

PC

Ie

Motherboard

DRAMx86

FPGA network adapter

PC

Ie


10GRequest

Parser

Response

Formatter

Hash

TableValue Store10G

Dataflow architectures to scale performance

10Gbps demonstrated with a 64b data path @ 156MHz using 3% of FPGA

resources

80Gbps can be achieved by using a 512b @ 156MHz pipeline for example

DRAM Controller

DRAM

FPGA

Hash

Table

Value

Store


Streaming architecture:

Flow-controlled series of processing

stages which manipulate and pass

through packets and their associated

state

Numerous requests are processed

back to back exploiting task level

parallelism

Source: [4] Blott et al: Achieving 10Gbps line-rate key-value stores with FPGAs; HotCloud 2013


Last Year: Scaling Capacity

FPGAs enable custom memory architectures whereby storage

media can be leveraged to their advantages

Example:

– SSDs combined with DDRx channels can be used to build high capacity &

high performance key value stores

– Concepts and early prototype to scale to 40TB & 80Gbps key value stores

Host memory

(via CAPI)

Source: HotStorage 2015, Scaling out to a Single-Node 80Gbps Memcached Server with 40Terabytes of Memory


Advantages:

– Larger objects require larger storage

– Larger granular access to flash suits page-size access granularity of flash

Concerns:

– Large access latency on flash

– Variations in access bandwidth and latency between DRAM and flash

Object distribution on the basis of size

Source: [3] Atikoglu et al: Workload analysis of a large-scale key-value store; SIGMETRICS 2012

[13] Lim et al: Thin servers with smart pipes: designing {SoC} accelerators for memcached; ISCA 2013

Stored in DRAM Stored in Flash

128 256 512 768 1K 4K 8K 32K 1M

0.55 0.075 0.275 0 0 0 0 0 0.1

0 0 0 0.1 0.85 0.05 0 0 0

0 0 0.2 0.1 0.4 0.29 0.008 0.001 0.001

0 0 0 0 0 0.9 0.05 0.03 0.02

Value Size (B)

Facebook

Twitter

Wiki

Flickr


Dataflow architectures can accommodate high latency accesses without sacrificing throughput

Read SSDRead SSDRead SSD

100usec

• In dataflow architectures: no limit to number of outstanding requests

• Flash can be serviced at maximum speed

10GRequest

Parser

Response

Formatter

Hash

TableValue Store10G

FPGA

Flash

Value

Store

Hybrid Storage Subsystem

Flash

Value

Store

…

Read SSDRead SSDRead SSD

time

Read SSDRead SSDRead SSDRead SSDRead SSD

Request

Buffer

Read SSDRead SSDRead SSDRead SSDRead SSD

ResponseResponseResponseResponseResponseResponseResponseResponseResponseResponseResponseResponseResponseResponseResponse

Cmd:

Rsp:

Read SSDRead SSDRead SSDRead SSDRead SSDRead SSDRead SSDRead SSDRead SSDRead SSDRead SSDRead SSDRead SSD

Cmd Rsp


Custom memory controllers with out of order processing

SSD

Value

Store

DRAM

Value

Store

Hybrid Memory Controller

Splitter Merger

…

DRAM

ControllerSATA HBA



PCIe X16 (256Gb/s)

Dual SFP+

2x 10/25 Gbps

Dual M.2 SSD

2x 512 GB

Dual DDR4 SODIMM

16GB x72 ECC DR

273 Gb/s @ 2133 Mb/s

16nm MPSoC

Quad A53 CPU

Embedded FPGA

Today:Networked Object Storage Board with MPSoC50Gbps key value store with 2TB, 25W

Unconventional memory architecture to achieve high

capacity while maintaining performance


Machine Learning:Top-5 accuracy image classificationImage-Net Large-Scale Visual Recognition Challenge (ILSVRC*)

* http://image-net.org/challenges/LSVRC/

**http://www.slideshare.net/NVIDIA/nvidia-ces-2016-press-conference, pg 10

*** Russakovsky, et al 2014, http://arxiv.org/pdf/1409.0575.pdf

Human @95%

CNNs deliver super-human accuracy

http://www.slideshare.net/NVIDIA/nvidia-ces-2016-press-conference


Compute and Memory Requirements

CNN

for ImageNet datasets

Memory (SP)

[MB]

Operations

[MOPS]

Operational Intensity

[OPS:B]

AlexNet – convolutions only 9.3 1332 143

AlexNet – complete 244 1456 5.97

VGG-16 552 30823 55.84

GoogleNet 27.2 1502 55.24

CNNs are highly compute and highly memory intensive

GPUs deliver highest performance for AlexNet with 4000+ fps


Reducing precision is shown today to work to 6b

– 50x reduction in model size (no external memory needed) [1]

Reducing to the extreme: binary neural networks (BNNs)

Emerging: Low-Precision Networks

[2] Sung et al., “Resiliency of Deep Neural

Networks Under Quantization”, ICLR’16

(fully connected network layers for phoneme recognition)

[1] Iandola et al. "SqueezeNet: AlexNet-level accuracy with

50x fewer parameters and< 1MB model size." (2016).

Bipolar NN

SP float NN


Binary and Almost Binary NetworksAccuracy (published & reproduced results)

[1] Courbariaux, Matthieu, and Yoshua Bengio. "BinaryNet: Training deep neural networks with weights and activations constrained

to+ 1 or-1." arXiv preprint arXiv:1602.02830 (2016).

[2] Rastegari, Mohammad, et al. "XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks." arXiv preprint

arXiv:1603.05279 (2016).

[3] Xundong Wu: High Performance Binarized Neural Networks trained on the ImageNet Classification Task” arXiv:1604.03058

[4] S. Zhou, z.Ni, X. Zhou, H.Wen, Y.Wu, Y. Zou: “DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low

Bitwidth Gradients”, http://arxiv.org/abs/1606.06160#

Dataset FP32 BNN Source

MNIST 99% 99% [1]

CIFAR-10 92% 90% [1]

ImageNet(GoogleNet arch)

90% top-5 86% top-5 [2] binary weights

ImageNet(DoReFaNet)

56% top-1 50% top-1 [4] 2-bit activations


8

16

32

64

128

256

512

1024

2048

4096

8192

16384

32768

65536

131072

0.125 0.5 2 8 32 128 512 2048 8192 32768

GO

ps

/s

Ops:Byte

Xilinx FPGA Rooflines

AlexNet (complete) AlexNet (Conv+MaxPool) LeNet-5

VGG-16 GoogleNet SqueezeNet/FireCaffe

KU115-SP KU115-16bint KU115-8bint

KU115-1b

Roofline model for KU115 (ADM-PCIE-8K5) & CNNs

2 Tops peak, 16b

BNN (avoid mem bw)

BN

N –

2.5

LU

Ts/O

P


Lab setup using the MNIST dataset, Zynq chip

First prototype in Xilinx labs Dublin* shows

In hardware: 12Mfps for MNIST, 2usec latency, ~7.4Watt

*Yaman Umuroglu (NTNU, Xilinx); Nicholas Fraser (University of Sydney, Xilinx); Giulio Gambardella (Xilinx Research);

Michaela Blott (Xilinx Research)


Introductions

Industry landscape

Platform characterization




Trend towards unconventional architectures

– A diversification of increasingly heterogeneous system

Characterization leveraging Berkeley roofline models

– Visualizes application suitability for different accelerators and

performance estimation


In collaboration with leading customers, partners and universities

Future:

– Facilitate ease of use for reconfigurable computing

– Bring clarity & understanding on applications

Summary & Future Plans


Thank [email protected]

Any questions?

unconventional architectures with reconfigurable...

Documents