custom computing - imperial college londonwl/teachlocal/cuscomp/notes/cc...–e.g. fpgas,...

Post on 21-Aug-2020

1 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

wl 2020 1.1

Custom Computing

• theory and practice of customising designs– one of the fastest growing technologies

– impact on ASIC, CPU, many-core, GPU, multi-scale dataflow

• wide range of architectures and applications– data-centre/supercomputers with user-customisable accelerators

– message routers, mobile robots, LCD TVs, car audio systems

– invent processors with your own instruction set!

• based mainly on customisable implementation technology – e.g. Field-Programmable gate Arrays (FPGAs)

– also called reconfigurable computing, FPGA-based computing

• we focus on concepts, abstractions, design methods

• requirement: willing to learn new ideas, languages, tools – not afraid of C/Java/functional programs, maths, hardware

wl 2020 1.2

Course coverage

• topics

– custom computing technology overview

– design parametrisation and optimisation

– system-on-chip architecture and design

• 18 lectures, 8 tutorials (flexible), 1 assessed exercise

• course material

– https://www.doc.ic.ac.uk/~wl/teachlocal/cuscomp

– EEE students: may need access via EEE machines

• preparation for projects and research

– many received project prizes or distinctions

– summer projects for non-MSc students

wl 2020 1.3

Why custom computing?

• FPGAs: customisable hardware resources– data centres for cloud computing

– mobile handsets, Internet of Things (IoT), edge computing

• acceleration of demanding workloads– big data, finance, genomics, weather/climate modelling, – integrated solution: often with interface to memory, sensors…

– target multiple platforms: need to promote design re-use

• design approach: generalisation + customisation– often start with design instance: f0

– generalise f0 to become a template f(x), such that f(x0) = f0where x is a parameter and x0 is a specific value for x

– customise f with values for x to support tradeoff in speed, size…

f0

f(x)

x=x0 f1 f2

f3

generalise customise

x=x1x=x2

x=x3

wl 2020 1.4

Benefits of customisation

• improvements in– accuracy: as needed, not necessarily 8, 32, 64, 128 bits

– throughput: rate of producing results

– latency: time between first input and first output

– reconfiguration time: speed of adapting to changes

– size: area, volume, weight

– energy and power consumption: mobile and remote applications

– development time: design and validation

– cost: minimise fabrication, post-delivery fixes, enhancements

• need to prioritise design objectives– e.g. smallest design at a given speed consuming given energy

• opportunities for customisation– application-oriented, e.g. run-time conditions

– implementation-oriented, e.g. technology used

wl 2020 1.5

Implementation technology

• application-specific integrated circuit (ASIC)– high performance, low part cost: cheap if producing large volume

– high risk, high development cost, slow time-to-market

– costly (Moore’s Second Law) to develop, build and test, inflexible

• Field-Programmable Gate Array (FPGA)– low risk, fast time-to-market, low development cost, high part cost

– post-delivery improvement: fix bugs, update functions

– customisable at run time: adapt to environment changes

– prototype for ASIC

– enable internet routing

• custom computing systems– stand-alone

– PCIe / Infiniband

– system-on-chip: instruction processor + FPGA

wl 2020 1.6

Technology comparison

FPGAs

Efficiency, Performance

Fle

xib

ility

ASICs

General-Purpose

Processors

Digital Signal

Processors

Special-Purpose

Processors

(adapted from K. Fan, HPCA’09)

wl 2020 1.7

Where are FPGAs? Consumer applications

Digital Camera & Editing

LCD Projectors

PDP & HDTV

STB, DVR & VTR

Automotive

Handheld

Automotive

Diagnostics

Home Computing

Home Networking

(source: Xilinx Inc.)

wl 2020 1.8

• Smart NIC (Network Interface Controller)

– compute accelerator: local / remote

– infrastructure accelerator: network / storage

– flexibility of Software Defined Network + speed of hardware

New: accelerators for data centre servers

Source: Microsoft

wl 2020 1.9

Accelerate clouds: Microsoft + Amazon

aws.amazon.com/ec2/instance-types/f1/

www.top500.org/news/microsoft-goes-all-in-for-fpgas-to-build-out-cloud-based-ai/

wl 2020 1.10

Why Intel bought Altera

Source: IntelIP: Intellectual Property

wl 2020 1.11

Source: Intel

Drones + IoT + …

Aerotenna:

Octagonal Pilot on Chip

ASSP: Application-Specific Standard Part

SAM: Serviceable Available Market

wl 2020 1.12

Particle Physics: Large Hadron Collider

(source: Xilinx Inc.)

Opto-RX,

12 way

3 x Delay FPGA

(ADC clk timing)

Virtex II, 2M gate FPGA performs signal processing

Optical ribbon cable input

Opto-to-electrical conversion Digitise & sync data Find hit clusters

• real-time analysis of particle collision

• combine data from various detectors

(source: G. Hall)

wl 2020 1.13

Customisation: pre-fab and post-fab

• fabrication: manufacturing the chip– Xilinx UltraScale FPGA: 16nm, Intel i7-i770T: 22nm

– costly: very small geometry, ultra-clean room

• application-specific integrated circuit (ASIC)– greatest customisation at pre-fabrication, but could be inflexible

– high performance, low part cost: cheap if producing large volume

– high risk, high development cost, slow time-to-market

– costly (in money and time) to develop and test: Moore’s Law

• field-programmable gate array (FPGA)– post-fabrication, post-delivery, even run-time customisation

– hardware speed, software flexibility

– most basic, fine-grained unit of programmability

– need larger function blocks for efficiency

wl 2020 1.14

Design metrics

• NRE (non-recurring engineering) cost– one-time cost of designing system

• total cost: total cost = NRE cost + unit cost * number of units

• size, performance, power

• flexibility– make changes to the hardware with low NRE cost

• time-to-prototype, time-to-market

• maintainability

• correctness, safety, robustness

Source: J. Wong

wl 2020 1.15

FPGA/ASIC crossover points

Production Volume

Co

st

FPGA Cost Advantage ASIC Cost AdvantageFPGA Cost Advantage ASIC Cost AdvantageFPGA Cost Advantage

Source: S.S.S.P. Rao

wl 2020 1.16

FPGA vs ASIC

FPGA

• faster time-to-market

– no layout, masks or other manufacturing steps are needed

• no upfront NRE costs

• simpler design cycle

– software tools for routing, placement, and timing

• more predictable project cycle

• field re-programmability

ASIC

• full custom capability

– for design since device is

manufactured to design specs

• lower unit costs

– for very high volume

• smaller form factor

– device is made to design specs

• higher raw internal clock speeds

Source: J. Wong

wl 2020 1.17

Design flows

HDL: Hardware Description Language DFT: Design For Test Source: J. Wong

wl 2020 1.18

Early FPGA architecture

Connection

Block

Logic Block

Switch Block

Routing Track

(Horizontal)

Routing Channel

(Vertical){

TILESource: S. Wilton

wl 2020 1.19

Basic logic gate: lookup table

Function of each lookup table can be configured by

shifting in bit-stream.

Reconfigurable logic

Inputs

Bit-S

trea

m

Source: S. Wilton

wl 2020 1.20

Basic logic gate: lookup table

Function of each lookup table can be configured by

shifting in bit-stream. By-passable register at output.

Reconfigurable logic

D Q

Inputs

Source: S. Wilton

wl 2020 1.21

Reconfigurable logic

•Connect logic

blocks using fixed

metal tracks and

programmable

switches

Source: S. Wilton

wl 2020 1.22

Reconfigurable logic

•Connect logic

blocks using fixed

metal tracks and

programmable

switches

Everything can be

built using fine-

grained logic;

why need anything

else?

Source: S. Wilton

wl 2020 1.23

But every user must pay for them, whether used or not…

FPGA vendors embed fixed blocks to improve speed

and density:

Implementing systems in an FPGA

Embedded Memories

(blocks of 2K-18K)

Source: S. Wilton

wl 2020 1.24

FPGA vendors embed fixed blocks to improve speed

and density:

Implementing systems in an FPGA

Embedded Memories

(blocks of 2K-18K)

Hard Blocks, eg multiplier

Source: S. Wilton

But every user must pay for them, whether used or not…

wl 2020 1.25

But every user must pay for them, whether used or not…

FPGA vendors embed fixed blocks to improve speed

and density:

Implementing systems in an FPGA

Embedded Memories

(blocks of 2K-18K)

Hard Blocks, eg multiplier

High-Speed I/Os

Source: S. Wilton

wl 2020 1.26

Example: Xilinx Virtex CLB tile

• CLB tile is composed of:

– switch matrix

– Configurable Logic Block and associated general routing resources

– IMUX and OMUX

• all CLB inputs have access to interconnect on all 4 sides

• fast local feedback within CLB and direct connects to east and west CLBs: support wide functions of up to 19 inputs within a single CLB

SINGLE

HEX

LONG

SINGLE

HEX

LONG

SIN

GL

E

HE

X

LO

NG

SIN

GL

E

HE

X

LO

NG

TRISTATE BUSSES

SWITCH

MATRIX

SLICE SLICE

Local

Feedback

CA

RR

Y

CA

RR

Y

CLB

CA

RR

Y

CA

RR

Y

DIRECTCONNECT

DIRECTCONNECT

Source: Xilinx Inc.

wl 2020 1.27

CLB

Slice 0

LUT Carry

LUT Carry D Q

CE

PRE

CLR

D Q

CE

PRE

CLR

Slice 1

LUT Carry

LUT Carry D Q

CE

PRE

CLR

D Q

CE

PRE

CLR

Simplified CLB structure

• two slices in each CLB

– two BUFTs associated with each CLB, accessible by all 8 CLB outputs

– carry Logic runs vertically upwards, to speed up carry propagation

Source: Xilinx Inc.

wl 2020 1.28

Combinatorial Logic

AB

CD

Z

A B C D Z

0 0 0 0 0

0 0 0 1 0

0 0 1 0 0

0 0 1 1 1

0 1 0 0 1

0 1 0 1 1

. . .1 1 0 0 0

1 1 0 1 0

1 1 1 0 0

1 1 1 1 1

Look-Up Tables

• combinatorial logic is stored in Look-Up Tables (LUTs) in a CLB

• capacity is limited by number of inputs, not complexity

• delay through CLB is constant

wl 2020 1.29

Stratix IVGX 230: mid-size device

Adaptive

Logic

Modules

(fine grain)

RAM

Blocks

(M9K &

M144K)

(source: V. Betz)

DSP

Blocks

(coarse grain)

High

Speed

Serial

Interfaces:

eg connect

multiple

FPGAs

wl 2020 1.30

Stratix IV Overview

Feature Stratix III (65 nm) Stratix IV (40 nm)

Logic Elements 340k 680k

RAM bits 16 Mb + 4 Mb 33 Mb + 8.5 Mb

18x18 multipliers 768 1360

General I/O 1104 1104

High-speed serial links

048 transmit + 48 receive

@ 11.3 Gb/s

Hard PCIe blocks 0 4

Clock generation 12 PLL(x10)

12 PLL(x10) +

32 serial recovered +

+ 24 serial transmit

Clock distribution16 Global + 88 Quadrant +

132 PCLK16 Global + 88 Quadrant

+ 132 PCLK

(from V. Betz)

wl 2020 1.31

Current and future: System-on-Chip

I/O Ring and Interface Circuitry

Embedded

Processor

On-Chip

Memory

Fixed

IP

Block

Fixed

IP

Block

Reconfigurable

Logic

I/O Ring and Interface Circuitry

Fixed Intellectual Property Block

- functionality fixedat design time

- little post-fab

flexibility

Processor eg ARM

- functionality

specified using software

Programmable Logic

- circuit can be specified / modified

after fabrication, possibly at run time

- maybe slower than fixed IP block

Source: S. Wilton

wl 2020 1.32

Summary

• custom computing: theory and practice of customisation – from data centres/cloud computing to mobile appliances

• customisable off-the-shelf implementation technology – e.g. FPGAs, coarse-grained/hybrid processors, custom instructions

• factors favouring field-programmability– rise in FPGA capability: many exciting applications

– rise in integrated circuit fabrication cost: zero for FPGA users!

– customisation: facilitate product evolution and prototyping

• custom computing tools + applications at Imperial College– financial analysis/trading, multimedia processing, medical imaging

– network firewall, data compression/encryption, mobile robots

– bio-informatics, machine learning, bio-inspired/self-aware systems see: http://cc.doc.ic.ac.uk

top related