capo - university of illinois urbana-champaign

29
Capo: A Software-Hardware Interface for Practical Deterministic Multiprocessor Replay Department of Computer Science University of Illinois at Urbana-Champaign Pablo Montesinos, Matthew Hicks, Samuel T. King and Josep Torrellas

Upload: others

Post on 03-Jan-2022

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Capo - University of Illinois Urbana-Champaign

Capo: A Software-Hardware Interface for Practical

Deterministic Multiprocessor Replay

Department of Computer ScienceUniversity of Illinois at Urbana-Champaign

Pablo Montesinos, Matthew Hicks, Samuel T. King and Josep Torrellas

Page 2: Capo - University of Illinois Urbana-Champaign

Pablo Montesinos Capo: Practical Deterministic Replay of Multiprocessors

Motivation: Time Travel

Allows us to visit and recreate past states and events in computer

Wide range of uses:

Debugging

Security

Enabled by using Deterministic Replay of Execution

2

Page 3: Capo - University of Illinois Urbana-Champaign

Pablo Montesinos Capo: Practical Deterministic Replay of Multiprocessors

Phase I: Initial Execution (a.k.a Recording)

Execute and record certain non-deterministic events into log

Sources of non-determinism: interrupts, memory access interleaving ...

Phase II: Replay

Restore to a previous checkpoint

Re-execute and use log to force software down the same execution path

How Deterministic Replay Works3

Page 4: Capo - University of Illinois Urbana-Champaign

Pablo Montesinos Capo: Practical Deterministic Replay of Multiprocessors

SW-Based Deterministic Replay

Flexible, integrate well with rest of SW stack

Very slow or non-applicable to multiprocessor execution:

Software is slow at capturing memory access interleaving

4

Library

Compiler

Virtual Machine

Operating System

Virtual Machine Monitor

SW-based schemes

Page 5: Capo - University of Illinois Urbana-Champaign

Pablo Montesinos Capo: Practical Deterministic Replay of Multiprocessors

HW-Based Deterministic Replay of Multiprocessors

HW can record interleaving of shared-memory accesses effectively:

Small Memory Access Interleaving Log

Little overhead

Limitation: integration with SW stack is poor

5

Library

Compiler

Virtual Machine

Operating System

Virtual Machine Monitor

Hardware-based schemes

SW-based schemes

Page 6: Capo - University of Illinois Urbana-Champaign

Pablo Montesinos Capo: Practical Deterministic Replay of Multiprocessors

Limitations of HW-Based Replay of Multiprocessors

Past proposals focused only on HW primitive for recording and replaying

How does it integrate with the SW stack?

Cannot separate SW being recorded/replayed from the rest

Paradox: where does the SW that manages the logs go?

Require complex VMM or simulator to replay execution

Can’t mix recording, replay and normal execution simultaneously in the machine

6

We must adapt HW-based replay systems and carefully integrate them with SW in order to make

HW-based replay practical

Page 7: Capo - University of Illinois Urbana-Champaign

Pablo Montesinos Capo: Practical Deterministic Replay of Multiprocessors

Capo Contributions

SW-HW interface for practical HW-assisted deterministic replay

Works with any HW-based replay system

Replay Sphere: new abstraction

Isolates SW that is being recorded (replayed) from the rest

Separates the responsibilities of the HW and the SW components

CapoOne: Linux-based prototype

7

Page 8: Capo - University of Illinois Urbana-Champaign

Pablo Montesinos Capo: Practical Deterministic Replay of Multiprocessors

Replay Sphere: Set of threads recorded andreplayed as a unit and their address space

Only user-mode threads run inside spheres

Threads inside a sphere: R-threads

Replay spheres and processes:

R-threads that share memory must run within same sphere

Many processes can run within the same sphere

Replay Sphere 2

Replaying

Replay Sphere: Isolating Processes8

Replay Sphere 1

Recording

OS

Replay Sphere 1

Recording

Thread103

Thread128

Thread26

R-thread1

R-thread2

R-thread 1

Thread39

R-thread3

CPU1

CPU2

CPU3

CPU4

Replay HW

Replay Sphere Manager

CPU1

CPU3

CPU2

CPU4

Page 9: Capo - University of Illinois Urbana-Champaign

Pablo Montesinos Capo: Practical Deterministic Replay of Multiprocessors

Replay Sphere: Separating Responsibilities

HW:

Records memory access interleaving of R-threads running within same sphere

Produces per-sphere Memory Access Interleaving Log

Enforces same memory access interleaving during replay

SW (Replay Sphere Manager):

Logs the other sources of non-determinism that affect the sphere

Produces per-sphere Input Log

Includes system call return values, signals, data copied into the sphere...

Injects data from log into sphere during replay

9

Page 10: Capo - University of Illinois Urbana-Champaign

Pablo Montesinos Capo: Practical Deterministic Replay of Multiprocessors

Other Replay Sphere Manager Responsibilities

Assign the same virtual memory addresses during recording/replay

Assign the same IDs to R-threads during recording/replay

Manage Memory Access Interleaving Log and Input Log

Manage replay HW resources

10

Page 11: Capo - University of Illinois Urbana-Champaign

Pablo Montesinos Capo: Practical Deterministic Replay of Multiprocessors

Capo’s HW Interface

Works with any HW-based replay system

Per-processor R-Thread Control Block:

Sphere ID register

R-Thread ID register

Per-sphere Replay Sphere Control Block:

Mode register: specifies whether the sphere is recording or replaying

Log pointers: insert to / remove from Memory Access Interleaving Log

11

Page 12: Capo - University of Illinois Urbana-Champaign

Pablo Montesinos Capo: Practical Deterministic Replay of Multiprocessors

Virtualizing the Replay HW

Replay sphere manager schedules spheres into hardware contexts

12

Replay Sphere Manager

Sphere 1 Recording

Sphere 2Replaying

Sphere 3Recording

(ready)

Replay Sphere Control Block

Replay SphereControl Block

HW

SWLog 1 Log 2

Log 3

Replay SphereControl Block

Replay Sphere Control Block

ModeLog Pointers

ModeLog Pointers

(running) (running)

Page 13: Capo - University of Illinois Urbana-Champaign

Pablo Montesinos Capo: Practical Deterministic Replay of Multiprocessors

Three Key Challenges

Ensuring deterministic interleaving when OS copies data into a sphere

Using fewer processors during replay than were used during recording

Emulating vs. re-executing system calls

13

3

2

1

Page 14: Capo - University of Illinois Urbana-Champaign

Pablo Montesinos Capo: Practical Deterministic Replay of Multiprocessors

OS Copies Data into Spheres

Problem: interleaving between OS copies and R-threads not recorded

Solution: insert copy_to_user into sphere:

HW can log memory access interleaving

copy_to_user exits sphere once copy is over

14

OS

Replay Sphere 1 - Recording

B

Y

E

\0

Replay Sphere Manager

R-thread 1 R-thread 2

copy_to_user

H

I

!

\0

X = buf[2]buf[3] = Y

read(&buf)

Log 1Log 1

buf

Page 15: Capo - University of Illinois Urbana-Champaign

Pablo Montesinos Capo: Practical Deterministic Replay of Multiprocessors

Replaying with a Lower Processor Count

Problem: R-thread that should replay next log entry not scheduled in CPU

Solution 1: HW detects problem and raises interrupt

Efficient, but it requires additional HW and SW support

Solution 2: SW inspects Interleaving Log and tries to prevent problem

Not trivial, requires changes to OS scheduler

Solution 3: Do nothing, simply wait for OS to schedule R-thread

Simple, but can hurt performance

15

Page 16: Capo - University of Illinois Urbana-Champaign

Pablo Montesinos Capo: Practical Deterministic Replay of Multiprocessors

DeLorean HW Replay System

Ubuntu Linux

CapoOne: First Capo Implementation

Simulated replay HW:

DeLorean HW system [Montesinos ISCA’08]

Augmented with Capo’s HW interface

Modified 2.6.24 Linux kernel

Supports replay spheres, R-threads

New, deterministic copy_to_user

Split Replay Sphere Manager:

User-level component based on ptrace

Kernel-level component schedules spheres and R-threads

16

Replay Sphere 1

Recording

R-thread1

R-thread2

Replay Sphere Manager

Log 1

ptrace

new_copy_to_user

ReplaySphere

Manager

Replay Sphere Control Blocks R-thread Control Blocks

Page 17: Capo - University of Illinois Urbana-Champaign

Pablo Montesinos Capo: Practical Deterministic Replay of Multiprocessors

Also in the Paper

CapoOne implementation details

Lessons learned during CapoOne’s development

Emulating vs. Re-Executing System calls

Using Capo with different HW-Based replay systems

17

Page 18: Capo - University of Illinois Urbana-Champaign

Pablo Montesinos Capo: Practical Deterministic Replay of Multiprocessors

CapoOne Evaluation Setup

Two HW configurations

Simulated DeLorean HW replay system (SIMICS): 4 x86 processors

Real hardware: 4-Core x86 Intel processor without DeLorean HW

SW: Ubuntu 7.10 with Replay Sphere Manager

Modified 2.6.24 Kernel

Benchmarks:

Scientific Benchmarks: SPLASH-2

System benchmarks: Apache, Compilation

18

Page 19: Capo - University of Illinois Urbana-Champaign

Pablo Montesinos Capo: Practical Deterministic Replay of Multiprocessors

Overall Log Size

Memory Access Interleaving Log takes most of the space

Small overall log: 3.17 bits/kilo-instruction

19

0

1

2

3

4

SPLASH2-avg Apache-avg Compilation-avg

Log

Siz

e (b

its/k

ilo-in

stru

ctio

n)

Memory Access Interleaving LogInput Log

Page 20: Capo - University of Illinois Urbana-Champaign

Pablo Montesinos Capo: Practical Deterministic Replay of Multiprocessors

Recording Performance

Moderate overhead: 21% for SPLASH2 and 41% average for system apps

Minimal timing distortion for debugging concurrency defects

20

0

1

2

SPLASH2-avg Apache-avg Compilation-avg

Nor

mal

ized

Exe

cutio

n Ti

me

Standard executionptrace’s interposition overheadRest of Replay Sphere Manager

Page 21: Capo - University of Illinois Urbana-Champaign

Pablo Montesinos Capo: Practical Deterministic Replay of Multiprocessors

Replay Performance: SPLASH-2

Emulating system calls reduces cycles during replay

Replay takes only 80% more cycles

R-Threads must wait for their turn to commit

21

0

1

2

Record Replay

Nor

mal

ized

Cyc

les Execution

Stall

Page 22: Capo - University of Illinois Urbana-Champaign

Pablo Montesinos Capo: Practical Deterministic Replay of Multiprocessors

Conclusions

Capo enables practical replay of execution for systems with replay HW

The Replay Sphere is a powerful abstraction

Enable mixing recording, replay and standard execution

CapoOne: first Capo prototype

Working system

Good performance (21-41% recording overhead, 80% replay overhead)

Good for debugging concurrency defects

22

Page 23: Capo - University of Illinois Urbana-Champaign

Capo: A Software-Hardware Interface for Practical

Deterministic Multiprocessor Replay

Department of Computer ScienceUniversity of Illinois at Urbana-Champaign

Pablo Montesinos, Matthew Hicks, Samuel T. King and Josep Torrellas

Page 24: Capo - University of Illinois Urbana-Champaign

Pablo Montesinos Capo: Practical Deterministic Replay of Multiprocessors

CapoOne: Basic HW Operation24

Proc 0 Commit Request Commit Request

7 3

Ok Ok

CS LogA Size R-threadID = 7 R-threadID = 3

Chunk A

Proc 1

CS LogB Size

Chunk BArbiter

Two new per-processor registers: RSID, R-threadID

Arbiter now supports concurrent spheres

Manages an Interleaving Log for each of them

RSIDR-threadID

0

7

RSID

R-threadID

1

3

InterleavingLogs

Page 25: Capo - University of Illinois Urbana-Champaign

Pablo Montesinos Capo: Practical Deterministic Replay of Multiprocessors

Today’s Agenda: Towards the Perfect Replay System

DeLorean: New hardware replay engine

Very efficient multiprocessor support

Vastly improved log requirements

Capo: New SW-HW interface for replay

Makes HW-based replay systems practical

Evaluation

Future work

25

Page 26: Capo - University of Illinois Urbana-Champaign

Pablo Montesinos Capo: Practical Deterministic Replay of Multiprocessors

DMA

DMA Log

Network

Overall DeLorean System

Proc 0

PI Log

CS Log

Chunk A Arbiter

Interrupt Log

I/O Log

Proc 1

CS Log

Chunk B

Interrupt Log

I/O Log

26

Interrupt, I/O and DMA logs are common to other HW-based schemes

Page 27: Capo - University of Illinois Urbana-Champaign

Pablo Montesinos Capo: Practical Deterministic Replay of Multiprocessors

CapoOne: HW Implementation

No need for DeLorean’s Interrupt Log, DMA Log nor I/O Log

PI Log becomes the per-sphere Interleaving Log

CS Log becomes a per-R-thread Log

Chunks only have instructions from one application (or the kernel)

27

DMA

NetworkProc 0

PI Log

CS Log

Chunk A Arbiter

Proc 1

CS Log

Chunk B

DMA Log

Interrupt Log

I/O Log

Interrupt Log

I/O Log

Interleaving Log

Page 28: Capo - University of Illinois Urbana-Champaign

Pablo Montesinos Capo: Practical Deterministic Replay of Multiprocessors

Replay Sphere 1

Replaying

Replay Sphere 1

Replaying

Emulating vs Re-Executing System Calls

During replay, the RSM emulates most system calls:

RSM injects return values from Sphere Input Log, squashes outputs

Some have to be re-executed

Thread management (clone)

Address space modification (mprotect)

28

R-thread1

R-thread2

read()Log 1 fork()

OS

RSM

sys_fork

thread674

R-thread3

Page 29: Capo - University of Illinois Urbana-Champaign

Pablo Montesinos Capo: Practical Deterministic Replay of Multiprocessors

Implicit Dependencies

R-thread changes mapping or protection of address space, and another R-thread uses this changed address space

RSM can express these dependencies to hardware so these interactions can be recorded

29

Page Table

Sphere 1

R-thread 1

mprotect X

R-thread 2

while(1){ *x = *x+1}

CPU 1 CPU 2