constructing virtual architectures on tiled...

56
1 Constructing Virtual Architectures on Tiled Processors David Wentzlaff Anant Agarwal MIT

Upload: others

Post on 04-Feb-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal

1

Constructing Virtual Architectures

on Tiled Processors

David Wentzlaff

Anant Agarwal

MIT

Page 2: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal

2

Emulators and JITs

for Multi-Core

David Wentzlaff

Anant Agarwal

MIT

Page 3: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal

3

Why Multi-Core?

Page 4: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal

4

Why Multi-Core?

Future architectures will be on-chip parallel machines

Moore’s Law provides more parallel silicon resources

Diminishing sequential returns

Growth applications are parallel

Page 5: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal

5

Why Multi-Core?

Future architectures will be on-chip parallel machines

Moore’s Law provides more parallel silicon resources

Diminishing sequential returns

Growth applications are parallel

Future architectures will be optimized for parallel applications

Hardware compatibility will be broken

Page 6: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal

6

Why Emulators and JITs on Multi-Core?

Future architectures will be on-chip parallel machines

Moore’s Law provides more parallel silicon resources

Diminishing sequential returns

Growth applications are parallel

Future architectures will be optimized for parallel applications

Hardware compatibility will be broken

Page 7: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal

7

Why Emulators and JITs on Multi-Core?

Future architectures will be on-chip parallel machines

Moore’s Law provides more parallel silicon resources

Diminishing sequential returns

Growth applications are parallel

Future architectures will be optimized for parallel applications

Hardware compatibility will be broken

Future architectures will need to run legacy applications

Market forces will require future chips to run 1983 “Frogger” for DOS

Software re-verification on new architectures too costly

Page 8: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal

8

Are Emulators and JITs for Multi-Core Different?

Page 9: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal

9

Are Emulators and JITs for Multi-Core Different?

Yes

Page 10: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal

10

Are Emulators and JITs for Multi-Core Different?

Yes

Bountiful parallel resources

Example: Code optimization cost is reduced

“hot spot” analysis may miss sequential performance

Parallelize client application

Page 11: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal

11

Are Emulators and JITs for Multi-Core Different?

Yes

Bountiful parallel resources

Example: Code optimization cost is reduced

“hot spot” analysis may miss sequential performance

Parallelize client application

Parameters for Multi-Core are different

Core-to-Core latencies reduced

Page 12: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal

12

Road Map

Exploit on-chip parallel resources to accelerate emulation

Page 13: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal

13

Road Map

Exploit on-chip parallel resources to accelerate emulation

Page 14: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal

14

Road Map

Exploit on-chip parallel resources to accelerate emulation

Focused on performance

Page 15: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal

15

Road Map

Exploit on-chip parallel resources to accelerate emulation

Focused on performance

Acceleration Mechanisms

Page 16: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal

16

Road Map

Exploit on-chip parallel resources to accelerate emulation

Focused on performance

Acceleration Mechanisms

1. Pipelining Virtual Architectures

Page 17: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal

17

Road Map

Exploit on-chip parallel resources to accelerate emulation

Focused on performance

Acceleration Mechanisms

1. Pipelining Virtual Architectures

2. Speculative Parallel Translation

Page 18: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal

18

Road Map

Exploit on-chip parallel resources to accelerate emulation

Focused on performance

Acceleration Mechanisms

1. Pipelining Virtual Architectures

2. Speculative Parallel Translation

3. Static & Dynamic Architecture Reconfiguration

Page 19: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal

19

Road Map

Exploit on-chip parallel resources to accelerate emulation

Focused on performance

Acceleration Mechanisms

1. Pipelining Virtual Architectures

2. Speculative Parallel Translation

3. Static & Dynamic Architecture Reconfiguration

Proof of concept system

All software parallel translator: x86 on Raw

Page 20: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal

20

Data RAM

Disk

Background: Translation

x86

Binary

Runtime -- Execution

x86

Binary

Code Cache Code Cache

Tags

Translator

x86 Parser &

High Level

Translator

High Level

Optimization

Low Level

Code Generation

Low Level

Optimization and

Scheduling

Page 21: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal

21

Background: “Old” Parallel Translation

Page 22: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal

22

1. Pipelining Virtual Architectures

Page 23: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal

23

1. Pipelining Virtual Architectures

Translator

Tile

Code Cache

Tile

Execution

Tile

MMU

Tile

Data Cache

Tile

Page 24: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal

24

1. Pipelining Virtual Architectures

Utilize a Tiled Processor as fabric to construct virtual processor

Translator

Tile

Code Cache

Tile

Execution

Tile

MMU

Tile

Data Cache

Tile

Page 25: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal

25

1. Pipelining Virtual Architectures

Utilize a Tiled Processor as fabric to construct virtual processor

Coarse grain pipelining to exploit parallelism

Translator

Tile

Code Cache

Tile

Execution

Tile

MMU

Tile

Data Cache

Tile

Page 26: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal

26

Sequential TranslationExecution

Time

Translation

Time

Optimization

Time

Tim

e

Page 27: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal

27

2. Speculative Parallel TranslationExecution

Time

Translation

Time

Optimization

Time

Tim

e

Page 28: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal

28

3. Reconfiguration

Different programs have different characteristics

Processor Architect uses benchmarks to choose “compromise”

processor

Page 29: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal

29

3. Reconfiguration

Different programs have different characteristics

Processor Architect uses benchmarks to choose “compromise”

processor

Static Reconfiguration

Choose different virtual machine configuration based off application

Dynamic Reconfiguration

Detect phases/programs dynamic needs and reconfigure at runtime

Cost to reconfiguration

Page 30: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal

30

Reconfiguration

Photo

Courtesy

Intel Corp.

Page 31: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal

31

Prototype System and Evaluation

Page 32: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal

32

Background: Architectures

x86 (Pentium III)

CISC instruction set

Hardware Virtual Memory (VM)

Hardware Memory Protection

Condition Codes used for branching

Hardware instruction cache

1 superscalar processor core

3-way parallelism

Out-of-order processor

Raw

RISC instruction set

No VM

No Memory Protection

No Condition Codes

Software managed instruction memory

16 Processors arranged in 4x4 mesh

4 low latency networks

In-order processors

ISA

Impl

emen

tatio

n

* Photo

Courtesy

Intel Corp.

*

Page 33: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal

33

System Design

Manager

L2 Code

Cache

Translation

Slave

Translation

Slave

Translation

Slave

Translation

Slave

Translation

Slave

Translation

Slave

Banked L1.5 Code Cache

Runtime -- Execution

L1 Code

Cache

L1 Data

Cache

MMU

TLB

System

Functionality

& Loader

L2 Data $

Bank 0

L2 Data $

Bank 1

L2 Data $

Bank 2

L2 Data $

Bank 3

Page 34: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal

34

System Design

Manager

L2 Code

Cache

Translation

Slave

Translation

Slave

Translation

Slave

Translation

Slave

Translation

Slave

Translation

Slave

Banked L1.5 Code Cache

Runtime -- Execution

L1 Code

Cache

L1 Data

Cache

MMU

TLB

System

Functionality

& Loader

L2 Data $

Bank 0

L2 Data $

Bank 1

L2 Data $

Bank 2

L2 Data $

Bank 3

Runtime-ExecL1

I-$

L1

D-$

Page 35: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal

35

Methodology

Cycle comparison of Raw vs. Pentium III

All results collected on real hardware

No hardware added

Same binaries (unmodified)

Metric: Slowdown

Raw executing x86 code compared by cycle against Pentium III

Page 36: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal

36

Baseline Performance

Page 37: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal

37

2. Speculative Parallel Translation

Page 38: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal

38

L2 Code Cache Miss Rate

2. Speculative Parallel TranslationC

ode

Cac

he

Page 39: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal

39

3. Static Reconfiguration

Manager

L2 Code

Cache

Translation

Slave

Translation

Slave

Translation

Slave

Translation

Slave

Translation

Slave

Translation

Slave

Banked L1.5 Code CacheRuntime-Exec

L1

I-$

L1

D-$

MMU

TLB

System

Functionality

& Loader

L2 Data $

Bank 0

L2 Data $

Bank 1

L2 Data $

Bank 2

L2 Data $

Bank 3

Page 40: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal

40

3. Static Reconfiguration

Manager

L2 Code

Cache

Translation

Slave

Translation

Slave

Translation

Slave

Translation

Slave

Translation

Slave

Translation

Slave

Banked L1.5 Code CacheRuntime-Exec

L1

I-$

L1

D-$

MMU

TLB

System

Functionality

& Loader

L2 Data $

Bank 0

L2 Data $

Bank 1

L2 Data $

Bank 2

L2 Data $

Bank 3

Page 41: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal

41

3. Static Reconfiguration

Manager

L2 Code

Cache

Translation

Slave

Translation

Slave

Translation

Slave

Translation

Slave

Translation

Slave

Translation

Slave

Banked L1.5 Code CacheRuntime-Exec

L1

I-$

L1

D-$

MMU

TLB

System

Functionality

& Loader

L2 Data $

Bank 0

Translation

Slave

Translation

Slave

Translation

Slave

Page 42: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal

42

3. Dynamic Reconfiguration

Manager

L2 Code

Cache

Translation

Slave

Translation

Slave

Translation

Slave

Translation

Slave

Translation

Slave

Translation

Slave

Banked L1.5 Code CacheRuntime-Exec

L1

I-$

L1

D-$

MMU

TLB

System

Functionality

& Loader

L2 Data $

Bank 0

L2 Data $

Bank 1

L2 Data $

Bank 2

L2 Data $

Bank 3

Translation

Slave

Translation

Slave

Translation

Slave

Page 43: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal

43

3. Dynamic Reconfiguration

Manager

L2 Code

Cache

Translation

Slave

Translation

Slave

Translation

Slave

Translation

Slave

Translation

Slave

Translation

Slave

Banked L1.5 Code CacheRuntime-Exec

L1

I-$

L1

D-$

MMU

TLB

System

Functionality

& Loader

L2 Data $

Bank 0

L2 Data $

Bank 1

L2 Data $

Bank 2

L2 Data $

Bank 3

Translation

Slave

Translation

Slave

Translation

Slave

Page 44: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal

44

3. Reconfiguration Zoom

Page 45: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal

45

Baseline Performance Analysis

Page 46: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal

46

Baseline Performance Analysis

Page 47: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal

47

Baseline Performance Analysis

Intrinsic Raw Emulator Pentium III

latency occupancy latency occupancy

L1 Cache Hit 6 4 3 1

L2 Cache Hit 87 87 7 1

L2 Cache Miss 151 87 79 1

Page 48: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal

48

Baseline Performance Analysis

Intrinsic Raw Emulator Pentium III

latency occupancy latency occupancy

L1 Cache Hit 6 4 3 1

L2 Cache Hit 87 87 7 1

L2 Cache Miss 151 87 79 1

CPI = (memory_access_rate * (((1 – L1_miss_rate) * L1_hit_occupancy) +

(L1_miss_rate * (((1 – L2_miss_rate * L2_miss_occupancy))))) + ((1 –

memory_access_rate) * non_memory_CPI)

Page 49: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal

49

Baseline Performance Analysis

Intrinsic Raw Emulator Pentium III

latency occupancy latency occupancy

L1 Cache Hit 6 4 3 1

L2 Cache Hit 87 87 7 1

L2 Cache Miss 151 87 79 1

CPI = (memory_access_rate * (((1 – L1_miss_rate) * L1_hit_occupancy) +

(L1_miss_rate * (((1 – L2_miss_rate * L2_miss_occupancy))))) + ((1 –

memory_access_rate) * non_memory_CPI)

Memory CPI 3.9 1

Page 50: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal

50

Baseline Performance Analysis

Memory System 3.9x slowdown

Lack of ILP 1.3x slowdown

Condition Codes (Flags) 1.1x slowdown

Page 51: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal

51

Baseline Performance Analysis

Memory System 3.9x slowdown

Lack of ILP 1.3x slowdown

Condition Codes (Flags) x 1.1x slowdown

5.5x slowdown

Page 52: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal

52

Baseline Performance Analysis

Memory System 3.9x slowdown

Lack of ILP 1.3x slowdown

Condition Codes (Flags) x 1.1x slowdown

5.5x slowdown

Code Cache Misses 1 – 20x slowdown

Page 53: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal

53

Baseline Performance Analysis

Page 54: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal

54

Future Work

Hardware additions to facilitate parallel emulation

x86 Server farm on a chip

Inter-Virtual Processor dynamic load balancing

Differing forms of Dynamic Reconfiguration

Vary number of functional units

Page 55: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal

55

Questions ?

Page 56: Constructing Virtual Architectures on Tiled Processorswentzlaf/documents/CGO_Wentzlaff_slides.pdfConstructing Virtual Architectures on Tiled Processors David Wentzlaff. Anant Agarwal

56

Extras