firesim: fpga-accelerated cycle-exact scale-out system ... · firesim fpga-accelerated cycle-exact...

Post on 24-Apr-2020

24 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

FireSimFPGA-Accelerated Cycle-Exact

Scale-Out System Simulation in the Public Cloud

Sagar Karandikar, Howard Mao, Donggyu Kim, David Biancolin, Alon Amid, Dayeol Lee, Nathan Pemberton, Emmanuel Amaro, Colin Schmidt, Aditya Chopra, Qijing Huang, Kyle

Kovacs, Borivoje Nikolic, Randy Katz, Jonathan Bachrach, Krste Asanović

https://fires.im@firesimproject

sagark@eecs.berkeley.edu

The new datacenter hardware environment

Deeper memory/storage

hierarchies

e.g. 3DXPoint, HBM

The end of Moore’s Law

Custom Silicon in the Cloud

Faster networks

e.g. Silicon Photonics

New datacenter architectures

e.g. disaggregation[1]

2

Disaggregated Datacenters

3Diagram from Gao et al., OSDI’16

…and custom HW is changing faster than ever

FPGAs: Agile HW Design for ASICs:

[2]

4

What does our simulator need to do?• Model hardware at scale:

• CPUs down to microarchitecture• Fast networks, switches• Novel accelerators

• Run real software: • Real OS, networking stack (Linux)• Real frameworks/applications (not microbenchmarks)

• Be productive/usable:• Run on a commodity platform• Want to encourage collaboration between systems, architecture: real HW/SW

co-design

5

(2) Build a software simulator

Comparing existing HW “simulation” systems

(1) Build the hardware

(3) Build a hardware-accelerated simulator

6

A HW-accelerated DC simulator: DIABLO

• DIABLO, ASPLOS’15 [4]:• Simulated 3072 servers, 96 ToRs at ~2.7 MHz• Booted Linux, ran apps like Memcached• Part of RAMP collaboration [8]

• Need to hand-write abstract RTL models• Harder than writing “tapeout-ready” RTL• Need to validate against real HW

• Tied to an expensive custom host-platform• $100k+ host platform, custom built DIABLO Prototype

7

Comparing existing HW “simulation” systems

8

• Taping-out excels at:• Modeling reality: “single source of truth”• Scalability

• Hardware-accelerated simulators excel at:• Simulation rate• Ability to run real workloads (as fn. of sim rate)

• Software-based simulators excel at:• Ease-of-use• Ease-of-rebuild (time-to-first-cycle)• Commodity host platform• Cost• Introspection

FPGAs in the Cloud

Useful trends throughout the architect’s stack

High-Productivity Hardware Design

Language w/IR

Open, Silicon-Proven SoC Implementations

Open ISA

9

FireSim at a high-level

Server Simulations• Inherent parallelism – lots of gates• We have tapeout-proven RTL:

automatically FAME-1 transform• Put RTL-derived sims on the FPGAs

Network simulation• Little parallelism in switch models

(e.g. a thread per port)• Need to coordinate all of our

distributed server simulations• So use CPUs + host network

10

Server Simulation(s)

Server Simulation(s)

Server Simulation(s)

Server Simulation(s)

Server Simulation(s)

Server Simulation(s)

Server Simulation(s)

FPGAs (x8)

CPU

HostPCIe

f1.16xlarge

Host

Eth

erne

t (EC

2 Ne

twor

k)

Server Simulations

Switc

h M

odel

Now, let’s build a datacenter-scale FireSim simulation!

11

12

13

Mod

eled

Sys

tem

Reso

urce

Uti

l

Step

N: T

itle

(pla

ceho

lder

slid

e)

Root

Sw

itchA

ggre

gati

on P

od

Agg

rega

tion

Pod

Agg

rega

tion

Pod

Agg

rega

tion

Sw

itch

Rack

Rack

Rack

Rack

Rack

Rack

Rack

FPG

A

(4 S

ims)

FPG

A

(4 S

ims)

Hos

t In

stan

ce C

PU: T

oRSw

itch

Mod

el

DRA

M

DRA

M

DRA

M

DRA

M

Serv

er

Blad

eSi

mul

aIon

Serv

er

Blad

eSi

mul

atio

n

Serv

er

Blad

eSi

m.

Serv

er

Blad

eSi

mul

atio

n

Rock

et

Core

Rock

et

Core

Rock

et

Core

Rock

et

Core

L1I

L1D

L1I

L1D

L1I

L1D

L1I

L1D

L2

NIC

Oth

er P

erip

hera

lsDRAM Model

NIC

Sim

En

dpoi

ntO

ther

Per

iph.

Si

m E

ndpo

ints

FPGA Fabric

PCIe

to H

ost

FPG

A

(4 S

ims)

FPG

A

(4 S

ims)

FPG

A

(4 S

ims)

FPG

A

(4 S

ims)

FPG

A

(4 S

ims)

Modeled System- 4x RISC-V Rocket Cores @ 3.2 GHz- 16K I/D L1$- 256K Shared L2$- 200 Gb/s Eth. NIC

Resource Util.- < ¼ of an FPGA

Sim Rate- N/A

Step 1: Server SoC in RTL

13

13

Mod

eled

Sys

tem

Reso

urce

Uti

l

Step

N: T

itle

(pla

ceho

lder

slid

e)

Root

Sw

itchA

ggre

gati

on P

od

Agg

rega

tion

Pod

Agg

rega

tion

Pod

Agg

rega

tion

Sw

itch

Rack

Rack

Rack

Rack

Rack

Rack

Rack

FPG

A

(4 S

ims)

FPG

A

(4 S

ims)

Hos

t In

stan

ce C

PU: T

oRSw

itch

Mod

el

DRA

M

DRA

M

DRA

M

DRA

M

Serv

er

Blad

eSi

mul

aIon

Serv

er

Blad

eSi

mul

atio

n

Serv

er

Blad

eSi

m.

Serv

er

Blad

eSi

mul

atio

n

Rock

et

Core

Rock

et

Core

Rock

et

Core

Rock

et

Core

L1I

L1D

L1I

L1D

L1I

L1D

L1I

L1D

L2

NIC

Oth

er P

erip

hera

lsDRAM Model

NIC

Sim

En

dpoi

ntO

ther

Per

iph.

Si

m E

ndpo

ints

FPGA Fabric

PCIe

to H

ost

FPG

A

(4 S

ims)

FPG

A

(4 S

ims)

FPG

A

(4 S

ims)

FPG

A

(4 S

ims)

FPG

A

(4 S

ims)

Modeled System- 4x RISC-V Rocket Cores @ 3.2 GHz- 16K I/D L1$- 256K Shared L2$- 200 Gb/s Eth. NIC

Resource Util.- < ¼ of an FPGA

Sim Rate- N/A

Step 1: Server SoC in RTL

14

13

Mod

eled

Sys

tem

Reso

urce

Uti

l

Step

N: T

itle

(pla

ceho

lder

slid

e)

Root

Sw

itchA

ggre

gati

on P

od

Agg

rega

tion

Pod

Agg

rega

tion

Pod

Agg

rega

tion

Sw

itch

Rack

Rack

Rack

Rack

Rack

Rack

Rack

FPG

A

(4 S

ims)

FPG

A

(4 S

ims)

Hos

t In

stan

ce C

PU: T

oRSw

itch

Mod

el

DRA

M

DRA

M

DRA

M

DRA

M

Serv

er

Blad

eSi

mul

aIon

Serv

er

Blad

eSi

mul

atio

n

Serv

er

Blad

eSi

m.

Serv

er

Blad

eSi

mul

atio

n

Rock

et

Core

Rock

et

Core

Rock

et

Core

Rock

et

Core

L1I

L1D

L1I

L1D

L1I

L1D

L1I

L1D

L2

NIC

Oth

er P

erip

hera

ls

DRAM Model

NIC

Sim

En

dpoi

ntO

ther

Per

iph.

Si

m E

ndpo

ints

FPGA Fabric

PCIe

to H

ost

FPG

A

(4 S

ims)

FPG

A

(4 S

ims)

FPG

A

(4 S

ims)

FPG

A

(4 S

ims)

FPG

A

(4 S

ims)

Modeled System- 4x RISC-V Rocket Cores @ 3.2 GHz- 16K I/D L1$- 256K Shared L2$- 200 Gb/s Eth. NIC- 16 GB DDR3Resource Util.- < ¼ of an FPGA- ¼ Mem ChansSim Rate- ~150 MHz- ~40 MHz (netw)

Step 2: FPGA Simulation of one server blade

15

13

Mod

eled

Sys

tem

Reso

urce

Uti

l

Step

N: T

itle

(pla

ceho

lder

slid

e)

Root

Sw

itchA

ggre

gati

on P

od

Agg

rega

tion

Pod

Agg

rega

tion

Pod

Agg

rega

tion

Sw

itch

Rack

Rack

Rack

Rack

Rack

Rack

Rack

FPG

A

(4 S

ims)

FPG

A

(4 S

ims)

Hos

t In

stan

ce C

PU: T

oRSw

itch

Mod

el

DRA

M

DRA

M

DRA

M

DRA

M

Serv

er

Blad

eSi

mul

aIon

Serv

er

Blad

eSi

mul

atio

n

Serv

er

Blad

eSi

m.

Serv

er

Blad

eSi

mul

atio

n

Rock

et

Core

Rock

et

Core

Rock

et

Core

Rock

et

Core

L1I

L1D

L1I

L1D

L1I

L1D

L1I

L1D

L2

NIC

Oth

er P

erip

hera

ls

DRAM Model

NIC

Sim

En

dpoi

ntO

ther

Per

iph.

Si

m E

ndpo

ints

FPGA Fabric

PCIe

to H

ost

FPG

A

(4 S

ims)

FPG

A

(4 S

ims)

FPG

A

(4 S

ims)

FPG

A

(4 S

ims)

FPG

A

(4 S

ims)

Modeled System- 4x RISC-V Rocket Cores @ 3.2 GHz- 16K I/D L1$- 256K Shared L2$- 200 Gb/s Eth. NIC- 16 GB DDR3Resource Util.- < ¼ of an FPGA- ¼ Mem ChansSim Rate- ~150 MHz- ~40 MHz (netw)

Step 2: FPGA Simulation of one server blade

16

13

Modeled SystemResource Util

Step N: Title (placeholder slide)

Root Switch

Aggregation Pod

Aggregation PodAggregation Pod

Aggregation Switch

Rack

Rack Rack RackRack

Rack RackFPGA

(4 Sims)

FPGA (4 Sims)

Host Instance CPU: ToR Switch Model

DRAM DRAM

DRAM DRAM

Server Blade

SimulaIon

Server Blade

Simulation

Server BladeSim.

Server Blade

Simulation

Rocket Core

Rocket Core

Rocket Core

Rocket Core

L1I

L1D

L1I

L1D

L1I

L1D

L1I

L1D

L2

NIC

Other Peripherals

DRA

M M

odel

NIC Sim Endpoint

Other Periph. Sim Endpoints

FPGA Fabric

PCIe to Host

FPGA (4 Sims)

FPGA (4 Sims)

FPGA (4 Sims)

FPGA (4 Sims)

FPGA (4 Sims)

Modeled System- 4 Server Blades- 16 Cores- 64 GB DDR3Resource Util.- < 1 FPGA- 4/4 Mem ChansSim Rate- ~14.3 MHz (netw)

Step 3: FPGA Simulation of 4 server blades

Cost: $0.49 per hour (spot)

$1.65 per hour (on-demand)

17

13

Modeled SystemResource Util

Step N: Title (placeholder slide)

Root Switch

Aggregation Pod

Aggregation PodAggregation Pod

Aggregation Switch

Rack

Rack Rack RackRack

Rack RackFPGA

(4 Sims)

FPGA (4 Sims)

Host Instance CPU: ToR Switch Model

DRAM DRAM

DRAM DRAM

Server Blade

SimulaIon

Server Blade

Simulation

Server BladeSim.

Server Blade

Simulation

Rocket Core

Rocket Core

Rocket Core

Rocket Core

L1I

L1D

L1I

L1D

L1I

L1D

L1I

L1D

L2

NIC

Other Peripherals

DRA

M M

odel

NIC Sim Endpoint

Other Periph. Sim Endpoints

FPGA Fabric

PCIe to Host

FPGA (4 Sims)

FPGA (4 Sims)

FPGA (4 Sims)

FPGA (4 Sims)

FPGA (4 Sims)

Modeled System- 4 Server Blades- 16 Cores- 64 GB DDR3Resource Util.- < 1 FPGA- 4/4 Mem ChansSim Rate- ~14.3 MHz (netw)

Step 3: FPGA Simulation of 4 server blades

18

13

Modeled SystemResource Util

Step N: Title (placeholder slide)

Root Switch

Aggregation Pod

Aggregation PodAggregation Pod

Aggregation Switch

Rack

Rack Rack RackRack

Rack RackFPGA

(4 Sims)

FPGA (4 Sims)

Host Instance CPU: ToR Switch Model

DRAM DRAM

DRAM DRAM

Server Blade

SimulaIon

Server Blade

Simulation

Server BladeSim.

Server Blade

Simulation

Rocket Core

Rocket Core

Rocket Core

Rocket Core

L1I

L1D

L1I

L1D

L1I

L1D

L1I

L1D

L2

NIC

Other Peripherals

DRA

M M

odel

NIC Sim Endpoint

Other Periph. Sim Endpoints

FPGA Fabric

PCIe to Host

FPGA (4 Sims)

FPGA (4 Sims)

FPGA (4 Sims)

FPGA (4 Sims)

FPGA (4 Sims)

Modeled System

- 32 Server Blades

- 128 Cores

- 512 GB DDR3

- 32 Port ToR

Switch

- 200 Gb/s, 2us

links

Resource Util.

- 8 FPGAs =

- 1x f1.16xlarge

Sim Rate

- ~10.7 MHz

(netw)

Step 4: Simulating a 32 node rack

Cost: $2.60 per hour (spot)

$13.20 per hour (on-demand)

Cycle-accurate Network Modeling

• For global cycle-accuracy, send a token on each link for each cycle, in each direction

• Each direction of a link has link latency in cycles tokens in-flight• e.g. 6400 tokens in flight on link for 2us link

latency @ 3.2 GHz

• Each token is desired bandwidth / clock frequency bits wide• e.g. 200 Gbps / 3.2 GHz ≈ 64 bit wide token

sent per cycle• Target transport agnostic (we provide

Ethernet switch models)• Host transport agnostic (shared mem,

sockets, PCIe)• Can “downgrade” to a zero-perf-impact

functional network model (150+ MHz)

19

Switch Port

NIC Top-Level I/O on FPGA

ß à

Link Model

64b 64b

6400 tokens

20

13

Modeled SystemResource Util

Step N: Title (placeholder slide)

Root Switch

Aggregation Pod

Aggregation PodAggregation Pod

Aggregation Switch

Rack

Rack Rack RackRack

Rack RackFPGA

(4 Sims)

FPGA (4 Sims)

Host Instance CPU: ToR Switch Model

DRAM DRAM

DRAM DRAM

Server Blade

SimulaIon

Server Blade

Simulation

Server BladeSim.

Server Blade

Simulation

Rocket Core

Rocket Core

Rocket Core

Rocket Core

L1I

L1D

L1I

L1D

L1I

L1D

L1I

L1D

L2

NIC

Other Peripherals

DRA

M M

odel

NIC Sim Endpoint

Other Periph. Sim Endpoints

FPGA Fabric

PCIe to Host

FPGA (4 Sims)

FPGA (4 Sims)

FPGA (4 Sims)

FPGA (4 Sims)

FPGA (4 Sims)

Modeled System

- 32 Server Blades

- 128 Cores

- 512 GB DDR3

- 32 Port ToR

Switch

- 200 Gb/s, 2us

links

Resource Util.

- 8 FPGAs =

- 1x f1.16xlarge

Sim Rate

- ~10.7 MHz

(netw)

Step 4: Simulating a 32 node rack

Cost: $2.60 per hour (spot)

$13.20 per hour (on-demand)

21

13

Modeled SystemResource Util

Step N: Title (placeholder slide)

Root Switch

Aggregation Pod

Aggregation PodAggregation Pod

Aggregation Switch

Rack

Rack Rack RackRack

Rack RackFPGA

(4 Sims)

FPGA (4 Sims)

Host Instance CPU: ToR Switch Model

DRAM DRAM

DRAM DRAM

Server Blade

SimulaIon

Server Blade

Simulation

Server BladeSim.

Server Blade

Simulation

Rocket Core

Rocket Core

Rocket Core

Rocket Core

L1I

L1D

L1I

L1D

L1I

L1D

L1I

L1D

L2

NIC

Other Peripherals

DRA

M M

odel

NIC Sim Endpoint

Other Periph. Sim Endpoints

FPGA Fabric

PCIe to Host

FPGA (4 Sims)

FPGA (4 Sims)

FPGA (4 Sims)

FPGA (4 Sims)

FPGA (4 Sims)

Modeled System

- 32 Server Blades

- 128 Cores

- 512 GB DDR3

- 32 Port ToR

Switch

- 200 Gb/s, 2us

links

Resource Util.

- 8 FPGAs =

- 1x f1.16xlarge

Sim Rate

- ~10.7 MHz

(netw)

Step 4: Simulating a 32 node rack

22

13

Modeled SystemResource Util

Step N: Title (placeholder slide)

Root Switch

Aggregation Pod

Aggregation PodAggregation Pod

Aggregation Switch

Rack

Rack Rack RackRack

Rack RackFPGA

(4 Sims)

FPGA (4 Sims)

Host Instance CPU: ToR Switch Model

DRAM DRAM

DRAM DRAM

Server Blade

SimulaIon

Server Blade

Simulation

Server BladeSim.

Server Blade

Simulation

Rocket Core

Rocket Core

Rocket Core

Rocket Core

L1I

L1D

L1I

L1D

L1I

L1D

L1I

L1D

L2

NIC

Other Peripherals

DRA

M M

odel

NIC Sim Endpoint

Other Periph. Sim Endpoints

FPGA Fabric

PCIe to Host

FPGA (4 Sims)

FPGA (4 Sims)

FPGA (4 Sims)

FPGA (4 Sims)

FPGA (4 Sims)

Modeled System- 256 Server Blades- 1024 Cores- 4 TB DDR3- 8 ToRs, 1 Aggr- 200 Gb/s, 2us linksResource Util.- 64 FPGAs =- 8x f1.16xlarge- 1x m4.16xlargeSim Rate- ~9 MHz (netw)

Step 5: Simulating a 256 node “aggregation pod”

23

13

Modeled SystemResource Util

Step N: Title (placeholder slide)

Root Switch

Aggregation Pod

Aggregation PodAggregation Pod

Aggregation Switch

Rack

Rack Rack RackRack

Rack RackFPGA

(4 Sims)

FPGA (4 Sims)

Host Instance CPU: ToR Switch Model

DRAM DRAM

DRAM DRAM

Server Blade

SimulaIon

Server Blade

Simulation

Server BladeSim.

Server Blade

Simulation

Rocket Core

Rocket Core

Rocket Core

Rocket Core

L1I

L1D

L1I

L1D

L1I

L1D

L1I

L1D

L2

NIC

Other Peripherals

DRA

M M

odel

NIC Sim Endpoint

Other Periph. Sim Endpoints

FPGA Fabric

PCIe to Host

FPGA (4 Sims)

FPGA (4 Sims)

FPGA (4 Sims)

FPGA (4 Sims)

FPGA (4 Sims)

Modeled System- 256 Server Blades- 1024 Cores- 4 TB DDR3- 8 ToRs, 1 Aggr- 200 Gb/s, 2us linksResource Util.- 64 FPGAs =- 8x f1.16xlarge- 1x m4.16xlargeSim Rate- ~9 MHz (netw)

Step 5: Simulating a 256 node “aggregation pod”

24

13

Modeled SystemResource Util

Step N: Title (placeholder slide)

Root Switch

Aggregation Pod

Aggregation PodAggregation Pod

Aggregation Switch

Rack

Rack Rack RackRack

Rack RackFPGA

(4 Sims)

FPGA (4 Sims)

Host Instance CPU: ToR Switch Model

DRAM DRAM

DRAM DRAM

Server Blade

SimulaIon

Server Blade

Simulation

Server BladeSim.

Server Blade

Simulation

Rocket Core

Rocket Core

Rocket Core

Rocket Core

L1I

L1D

L1I

L1D

L1I

L1D

L1I

L1D

L2

NIC

Other Peripherals

DRA

M M

odel

NIC Sim Endpoint

Other Periph. Sim Endpoints

FPGA Fabric

PCIe to Host

FPGA (4 Sims)

FPGA (4 Sims)

FPGA (4 Sims)

FPGA (4 Sims)

FPGA (4 Sims)

Modeled System- 1024 Servers- 4096 Cores- 16 TB DDR3- 32 ToRs, 4 Aggr, 1 Root- 200 Gb/s, 2us linksResource Util.- 256 FPGAs =- 32x f1.16xlarge- 5x m4.16xlargeSim Rate- ~6.6 MHz (netw)

Step 6: Simulating a 1024 node datacenter

Experimenting on a 1024 Node Datacenter

25

13

Modeled SystemResource Util

Step N: Title (placeholder slide)

Root Switch

Aggregation Pod

Aggregation PodAggregation Pod

Aggregation Switch

Rack

Rack Rack RackRack

Rack RackFPGA

(4 Sims)

FPGA (4 Sims)

Host Instance CPU: ToR Switch Model

DRAM DRAM

DRAM DRAM

Server Blade

SimulaIon

Server Blade

Simulation

Server BladeSim.

Server Blade

Simulation

Rocket Core

Rocket Core

Rocket Core

Rocket Core

L1I

L1D

L1I

L1D

L1I

L1D

L1I

L1D

L2

NIC

Other Peripherals

DRA

M M

odel

NIC Sim Endpoint

Other Periph. Sim Endpoints

FPGA Fabric

PCIe to Host

FPGA (4 Sims)

FPGA (4 Sims)

FPGA (4 Sims)

FPGA (4 Sims)

FPGA (4 Sims)

50th%-ile (us)

Cross-ToR 79.3Cross-aggregation 87.1

512 Memcached

Servers

512 Mutilate Clients

Experimenting on a 1024 Node Datacenter

26

13

Modeled SystemResource Util

Step N: Title (placeholder slide)

Root Switch

Aggregation Pod

Aggregation PodAggregation Pod

Aggregation Switch

Rack

Rack Rack RackRack

Rack RackFPGA

(4 Sims)

FPGA (4 Sims)

Host Instance CPU: ToR Switch Model

DRAM DRAM

DRAM DRAM

Server Blade

SimulaIon

Server Blade

Simulation

Server BladeSim.

Server Blade

Simulation

Rocket Core

Rocket Core

Rocket Core

Rocket Core

L1I

L1D

L1I

L1D

L1I

L1D

L1I

L1D

L2

NIC

Other Peripherals

DRA

M M

odel

NIC Sim Endpoint

Other Periph. Sim Endpoints

FPGA Fabric

PCIe to Host

FPGA (4 Sims)

FPGA (4 Sims)

FPGA (4 Sims)

FPGA (4 Sims)

FPGA (4 Sims)

50th%-ile (us)

Cross-ToR 79.3Cross-aggregation 87.1

512 Memcached

Servers

512 Mutilate Clients

Reproducing tail latency effects from deployed clusters

• Leverich and Kozyrakis show effects of thread-imbalance in memcached in EuroSys ’14 [3]

27

TileLink2 On-Chip Interconnect

Rocket Core

L1I L1D

Accel

Rocket Core

L1I L1D

Accel

Rocket Core

L1I L1D

Accel

Rocket Core

L1I L1D

Accel

memcachedthreads

1 2 3 4

No thread imbalance

Reproducing tail latency effects from deployed clusters

• Leverich and Kozyrakis show effects of thread-imbalance in memcached in EuroSys ’14 [3]

28

TileLink2 On-Chip Interconnect

Rocket Core

L1I L1D

Accel

Rocket Core

L1I L1D

Accel

Rocket Core

L1I L1D

Accel

Rocket Core

L1I L1D

Accel

memcachedthreads

1 2 3 4

?

5From [3], under thread imbalance, we expect:

1) Median latency to be unchanged

2) Tail latency to increase drastically

Thread imbalance

Reproducing tail latency effects from deployed clusters

• Let’s run a similar experiment on an 8 node cluster in FireSim:

29

50th

Reproducing tail latency effects from deployed clusters

• Let’s run a similar experiment on an 8 node cluster in FireSim:

30

50th95th

Open-source: Not just datacenter simulation• An “easy” button for fast, FPGA-

accelerated full-system simulation• One-click: Parallel FPGA builds, Simulation

run/result collection, building target software• Scales to a variety of use cases:

• Networked (performance depends on scale)• Non-networked (150+ MHz), limited by your budget

• firesim command line program• Like docker or vagrant, but for FPGA sims• User doesn’t need to care about distributed magic

happening behind the scenes

31FireSim Developer Environment

Open-source: Not just datacenter simulation• Scripts can call firesim to fully

automate distributed FPGA sim• Reproducibility: included scripts to

reproduce ISCA 2018 results• e.g. scripts to automatically run

SPECInt2017 reference inputs in ≈1 day• Many others

• 91+ pages of documentation: https://docs.fires.im• AWS provides grants for

researchers: https://aws.amazon.com/grants/

32

$ cd fsim/deploy/workloads$ ./run-all.sh

Wrapping Up

• We can prototype thousand-nodedatacenters built on arbitrary RTL

+ Mix software models when desired

• Simulation is automatically built and deployed• Automatically deploy real

workloads and collect results• Open-source, runs on Amazon EC2

F1, no capex

33

SoC RTL

Other RTL

Network Topology

Automatically deployed, high-performance,

distributed simulation

SW Models

Full Work-load

13

Modeled SystemResource Util

Step N: Title (placeholder slide)

Root Switch

Aggregation Pod

Aggregation PodAggregation Pod

Aggregation Switch

Rack

Rack Rack RackRack

Rack RackFPGA

(4 Sims)

FPGA (4 Sims)

Host Instance CPU: ToR Switch Model

DRAM DRAM

DRAM DRAM

Server Blade

SimulaIon

Server Blade

Simulation

Server BladeSim.

Server Blade

Simulation

Rocket Core

Rocket Core

Rocket Core

Rocket Core

L1I

L1D

L1I

L1D

L1I

L1D

L1I

L1D

L2

NIC

Other Peripherals

DRA

M M

odel

NIC Sim Endpoint

Other Periph. Sim Endpoints

FPGA Fabric

PCIe to Host

FPGA (4 Sims)

FPGA (4 Sims)

FPGA (4 Sims)

FPGA (4 Sims)

FPGA (4 Sims)

Now open-sourced!https://fires.imhttps://github.com/firesim

@firesimproject

sagark@eecs.berkeley.eduThe information, data, or work presented herein was funded in part by the Advanced Research Projects Agency-Energy (ARPA-E), U.S. Department of Energy, under Award Number DE-AR0000849, DARPA Award Number HR0011-12-2-0016, RISE Lab sponsor Amazon Web Services, and ADEPT/ASPIRE Lab industrial sponsors and affiliates Intel, HP, Huawei, NVIDIA, and SK Hynix. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States Government, any agency thereof, or of the industrial sponsors.

References

[1] Peter X. Gao, Akshay Narayan, Sagar Karandikar, Joao Carreira, Sangjin Han, Rachit Agarwal, Sylvia Ratnasamy, and Scott Shenker. 2016. Network requirements for resource disaggregation. OSDI'16 [2] Y. Lee et al., "An Agile Approach to Building RISC-V Microprocessors," in IEEE Micro, vol. 36, no. 2, pp. 8-20, Mar.-Apr. 2016.[3] Jacob Leverich and Christos Kozyrakis. Reconciling high server utilization and sub-millisecond quality-of-service. EuroSys '14[4] Zhangxi Tan, Zhenghao Qian, Xi Chen, Krste Asanovic, and David Patterson. DIABLO: A Warehouse-Scale Computer Network Simulator using FPGAs. ASPLOS '15[5] Tan, Z., Waterman, A., Cook, H., Bird, S., Asanović, K., & Patterson, D. A case for FAME: FPGA architecture model execution. ISCA ’10[6] Donggyu Kim, Christopher Celio, David Biancolin, Jonathan Bachrach and Krste Asanovic. Evaluation of RISC-V RTL Designs with FPGA Simulation. CARRV '17.[7] Donggyu Kim, Adam Izraelevitz, Christopher Celio, Hokeun Kim, Brian Zimmer, Yunsup Lee, Jonathan Bachrach, and Krste Asanović. Strober: fast and accurate sample-based energy simulation for arbitrary RTL. ISCA ‘16[8] http://ramp.eecs.berkeley.edu/

35

Backup Slides/Common Questions

36

New! Beta support for BOOM, an OoO Core

• A superscalar out-of-order RV64G core• Highly parameterizable

• Beta support for BOOM is now available on the rc-bump-may branch• On FPGA, currently boots Linux to userspace, then

hangs• A target-RTL issue, can be reproduced in VCS without

any FireSim simulation shims• Working with BOOM devs to solve this

37

FPGA Utilization

38

• “Supernode” config – 4 quad-core nodes• Available on FPGA: 1,182,000 LUTs• Top-level consumed (w/shell):

803,462 LUTs = ~68%• firesim_top consumed: 569921

LUTs = ~48%

Comparing cost of Cloud FPGAs: SPEC17 Run

• f1.2xlarge spot market: 49c per hour per FPGA• SPEC17 intrate reference inputs – 10 workloads – 177.6 fpga-sim

machine-hours• Longest individual benchmark: omnetpp @ 27.3 hours

• $87.024 for a SPEC17 Intrate run w/reference inputs, in 27.3 hours• Roughly 1 day• ~ 1/100 – 1/500 the cost of the FPGA to run on the cloud

• Purchase one FPGA:• Pay 10k - 50k per FPGA (+ops/cooling)• Each run takes 177.6 fpga-sim hours – roughly 1 week

39

Or – compare purchasing one FPGA

• Let’s say $10k for one of the FPGAs (conservative)

• 49c per hour on F1

• Break even if you run an F1 instance continuously for 2.33 years

• What if a grad student works 16 hours a day, 7 days a week? 3.5 years

• What about 16 hours a day, 5 days a week? 4.9 years

• Ignoring all other benefits of cloud FPGAs…

40

Now open-sourced!https://fires.imhttps://github.com/firesim

@firesimproject

sagark@eecs.berkeley.eduThe information, data, or work presented herein was funded in part by the Advanced Research Projects Agency-Energy (ARPA-E), U.S. Department of Energy, under Award Number DE-AR0000849, DARPA Award Number HR0011-12-2-0016, RISE Lab sponsor Amazon Web Services, and ADEPT/ASPIRE Lab industrial sponsors and affiliates Intel, HP, Huawei, NVIDIA, and SK Hynix. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States Government, any agency thereof, or of the industrial sponsors.

References

[1] Peter X. Gao, Akshay Narayan, Sagar Karandikar, Joao Carreira, Sangjin Han, Rachit Agarwal, Sylvia Ratnasamy, and Scott Shenker. 2016. Network requirements for resource disaggregation. OSDI'16 [2] Y. Lee et al., "An Agile Approach to Building RISC-V Microprocessors," in IEEE Micro, vol. 36, no. 2, pp. 8-20, Mar.-Apr. 2016.[3] Jacob Leverich and Christos Kozyrakis. Reconciling high server utilization and sub-millisecond quality-of-service. EuroSys '14[4] Zhangxi Tan, Zhenghao Qian, Xi Chen, Krste Asanovic, and David Patterson. DIABLO: A Warehouse-Scale Computer Network Simulator using FPGAs. ASPLOS '15[5] Tan, Z., Waterman, A., Cook, H., Bird, S., Asanović, K., & Patterson, D. A case for FAME: FPGA architecture model execution. ISCA ’10[6] Donggyu Kim, Christopher Celio, David Biancolin, Jonathan Bachrach and Krste Asanovic. Evaluation of RISC-V RTL Designs with FPGA Simulation. CARRV '17.[7] Donggyu Kim, Adam Izraelevitz, Christopher Celio, Hokeun Kim, Brian Zimmer, Yunsup Lee, Jonathan Bachrach, and Krste Asanović. Strober: fast and accurate sample-based energy simulation for arbitrary RTL. ISCA ‘16[8] http://ramp.eecs.berkeley.edu/

42

top related