paralog : enabling and accelerating online parallel monitoring of multithreaded applications

38
Computer Architecture Lab at Evangelos Vlachos, Michelle L. Goodstein, Michael A. Kozuch, Shimin Chen, Phillip B. Gibbons, Babak Falsafi and Todd C. Mowry ParaLog: Enabling and Accelerating Online Parallel Monitoring of Multithreaded Applications

Upload: miette

Post on 25-Feb-2016

33 views

Category:

Documents


0 download

DESCRIPTION

ParaLog : Enabling and Accelerating Online Parallel Monitoring of Multithreaded Applications. Evangelos Vlachos , Michelle L. Goodstein, Michael A. Kozuch , Shimin Chen, Phillip B. Gibbons, Babak Falsafi and Todd C. Mowry. Software Errors & Analysis Tools. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: ParaLog : Enabling and Accelerating Online Parallel Monitoring of Multithreaded Applications

Computer Architecture Lab at

Evangelos Vlachos,Michelle L. Goodstein, Michael A. Kozuch,

Shimin Chen, Phillip B. Gibbons, Babak Falsafi and Todd C. Mowry

ParaLog: Enabling and Accelerating Online Parallel Monitoring of Multithreaded

Applications

Page 2: ParaLog : Enabling and Accelerating Online Parallel Monitoring of Multithreaded Applications

ASPLOS '10 - ParaLog 2

Software Errors & Analysis Tools

• Errors abundant in parallel software– Program crashes/vulnerabilities, limited performance

• Three main categories of analysis tools– Checking before, during or after program execution

• Instruction-grain Lifeguards– Online detailed analysis, but with high overhead– Several tools available, but mostly support for single-

threaded code

© Evangelos Vlachos

ParaLog: a framework for efficient analysis of parallel applications

Page 3: ParaLog : Enabling and Accelerating Online Parallel Monitoring of Multithreaded Applications

Lifeguards and Parallel Applications

Application Threads

TimeslicedExecution & Analysis

ParallelExecution & Analysis

Time Butterfly Analysis ParaLog

windows of uncertainty

precise application

order

(previous talk) (this talk)

DBI tools available today

- high overhead due to serialization

- some false positives+software-based

- new hardware required+no false positives+even better performance

Page 4: ParaLog : Enabling and Accelerating Online Parallel Monitoring of Multithreaded Applications

4

Low-Overhead Instruction-level Analysis

© Evangelos Vlachos ASPLOS '10 - ParaLog

accelerators: IT, IF, MTLB

[Chen et. al., ISCA’08]

event streamevent capturing

applicationthread

lifeguard thread

event delivery

application lifeguardonline monitoring platform

metadata

add r1 r2, r4

add, r1, r2, r4add_handler(){

i = load_state(r2); j = load_state(r4); if(check(i, j)) upd_state(r1); else error();}

Lifeguard coreApplication core

Page 5: ParaLog : Enabling and Accelerating Online Parallel Monitoring of Multithreaded Applications

ASPLOS '10 - ParaLog 5

accelerators: IT, IF, MTLB

accelerators: IT, IF, MTLB

Challenges in Parallel Monitoring

© Evangelos Vlachos

event stream

application lifeguardonline parallel monitoring platform[ParaLog]

applicationthread 1

event capturing event deliverylifeguard thread 1

globalmetadata

event streamapplicationthread k

event capturing event deliverylifeguard thread k

Page 6: ParaLog : Enabling and Accelerating Online Parallel Monitoring of Multithreaded Applications

ASPLOS '10 - ParaLog 6

accelerators: IT, IF, MTLBaccelerators: IT, IF, MTLB

accelerators: IT, IF, MTLBaccelerators: IT, IF, MTLB

Addressing the Challenges

1. Application event ordering

2. Ensuring metadata access atomicity efficiently

3. Parallelizing hardware accelerators© Evangelos Vlachos

event streamapplication-onlyorder capturing

order enforcing

application lifeguardonline parallel monitoring platform

dependence arcs

[ParaLog]

applicationthread 1

event capturing event deliverylifeguard thread 1

globalmetadata

event streamapplication-onlyorder capturing

order enforcingapplication

thread k

event capturing event deliverylifeguard thread k

Page 7: ParaLog : Enabling and Accelerating Online Parallel Monitoring of Multithreaded Applications

ASPLOS '10 - ParaLog 7

Outline

• Introduction

• Addressing the Challenges of Parallel Monitoring1. Capturing & enforcing application event ordering2. Ensuring metadata access atomicity3. Parallelizing hardware accelerators

• Evaluation

• Conclusions

© Evangelos Vlachos

Page 8: ParaLog : Enabling and Accelerating Online Parallel Monitoring of Multithreaded Applications

ASPLOS '10 - ParaLog 8

Event Ordering: the Problem• Case Study: Information flow analysis (i.e., Taintcheck)

© Evangelos Vlachos

store(A)

load(A)

Applicationthread j thread k

st_handler(A)

Lifeguardthread j thread k

ApplicationTime

ld_handler(A)

Expose happens-before information to lifeguards

LifeguardTime

Page 9: ParaLog : Enabling and Accelerating Online Parallel Monitoring of Multithreaded Applications

ASPLOS '10 - ParaLog 9

{thread j, tj}{thread j, tj}

progressj: tj progressj: tj - 2 progressk: tk

- 1 progressk: tk progressk: tk - 2 progressj: tj

- 1

Event Ordering: the solution (1/2)

• Coherence-based ordering of application events– Similar to FDR, but online, focusing on application-only events

© Evangelos Vlachos

store(A)

load(A)

Applicationthread j thread k

Time tj - 1

tjtj

+ 1tk - 1tk

tk + 1

st_handler(A)

ld_handler(A)

Lifeguardthread j thread k

wait whileprogressj < tj

Page 10: ParaLog : Enabling and Accelerating Online Parallel Monitoring of Multithreaded Applications

ASPLOS '10 - ParaLog 10

Is monitoring coherence enough?

Event Ordering: the Solution (2/2)

• Previous work has not solved the problem of Logical Races• Both logical races and system calls resolved with Conflict Alert messages

© Evangelos Vlachos

free(A)

load(A)

Applicationthread j thread k

free(A)start

ld_handler(A)

Lifeguardthread j thread k

Metadata(A)

free(A)end

Conflict Alert Message Dependence

LogicalRace

ApplicationTime

LifeguardTime

Page 11: ParaLog : Enabling and Accelerating Online Parallel Monitoring of Multithreaded Applications

ASPLOS '10 - ParaLog 11

Metadata Atomicity• Frequent use of locking too expensive

– # of instructions added & synchronization cost

• Dependence arcs handle the majority of the cases – Sufficient conditions:

1. One-to-one data-to-metadata mapping

2. Application reads don’t become metadata writes– Enforcing dependence arcs race-free operation

• Rest of the cases handled by acquiring a lock– Lock used only in the load_handler(); other handlers safe

© Evangelos Vlachos

(more details in the paper)

Page 12: ParaLog : Enabling and Accelerating Online Parallel Monitoring of Multithreaded Applications

12

Parallel Hardware Accelerators• Speed-up frequent lifeguard actions

– Metadata-TLB; fast metadata address calculation– Idempotent Filters; filter out redundant checking– Inheritance Tracking; fast tracking of dataflow paths

• Accelerators have only local view of the analysis– Cache locally analysis information (e.g., frequent events) – Important events have application-wide effects (e.g., free())– Coherence-like issues with accelerators’ local state

• Important events accompanied by Conflict Alerts – Use Conflict Alerts to flush accelerators’ state

© Evangelos Vlachos ASPLOS '10 - ParaLog

Page 13: ParaLog : Enabling and Accelerating Online Parallel Monitoring of Multithreaded Applications

ASPLOS '10 - ParaLog 13

Outline

• Introduction

• Addressing the Challenges of Parallel Monitoring– Capturing & enforcing application event ordering– Ensuring metadata access atomicity– Parallelizing hardware accelerators

• Evaluation

• Conclusions

© Evangelos Vlachos

Page 14: ParaLog : Enabling and Accelerating Online Parallel Monitoring of Multithreaded Applications

ASPLOS '10 - ParaLog 14

Experimental Framework

© Evangelos Vlachos

• Log-Based Architectures framework– Simics full-system simulation– CMP system with {2, 4, 8, 16} cores– {1, 2, 4, 8} of application and lifeguard threads – Sequentially Consistent memory model

• Benchmarks and multithreaded Lifeguards used– SPLASH-2 and PARSEC– TaintCheck: Information flow tracking; accelerated by M-TLB, IT– AddrCheck: Memory access checking; accelerated by M-TLB, IF

• Comparison with Timesliced Monitoring

Page 15: ParaLog : Enabling and Accelerating Online Parallel Monitoring of Multithreaded Applications

ASPLOS '10 - ParaLog 15

Performance Results: AddrCheck

1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 BARNES LU OCEAN BLACKSCH. FLUIDANIM. SWAPTIONS FMM RADIOSITY GEO. MEAN

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6AddrCheckNo Monitoring

Timesliced MonitoringParaLog

Nor

mal

ized

Exe

cutio

n Ti

me

© Evangelos Vlachos

8 app/lifeguard threads16 cores total

Normalized to sequential,

unmonitored

Page 16: ParaLog : Enabling and Accelerating Online Parallel Monitoring of Multithreaded Applications

ASPLOS '10 - ParaLog 16

Performance Results: AddrCheck

1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 BARNES LU OCEAN BLACKSCH. FLUIDANIM. SWAPTIONS FMM RADIOSITY GEO. MEAN

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6AddrCheckNo Monitoring

Timesliced MonitoringParaLog

Nor

mal

ized

Exe

cutio

n Ti

me

© Evangelos Vlachos

Page 17: ParaLog : Enabling and Accelerating Online Parallel Monitoring of Multithreaded Applications

17

Performance Results: AddrCheck

1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 BARNES LU OCEAN BLACKSCH. FLUIDANIM. SWAPTIONS FMM RADIOSITY GEO. MEAN

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6AddrCheckNo Monitoring

Timesliced MonitoringParaLog

Nor

mal

ized

Exe

cutio

n Ti

me

© Evangelos Vlachos ASPLOS '10 - ParaLog

2.3 6.1 6.7 1.71.9 2.9 9.5 15.4 2.1 6.2 1.9 2.4

• Timesliced Monitoring is not scalable• On average 15x slowdown over No Monitoring (8 threads)

Page 18: ParaLog : Enabling and Accelerating Online Parallel Monitoring of Multithreaded Applications

ASPLOS '10 - ParaLog 18

Performance Results: AddrCheck

1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 BARNES LU OCEAN BLACKSCH. FLUIDANIM. SWAPTIONS FMM RADIOSITY GEO. MEAN

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6AddrCheckNo Monitoring

Timesliced MonitoringParaLog

Nor

mal

ized

Exe

cutio

n Ti

me

© Evangelos Vlachos

• Highest overhead with 8 threads: SWAPTIONS 6x• Lowest overhead with 8 threads: < 5%• Average overhead with 8 threads: 26%

Page 19: ParaLog : Enabling and Accelerating Online Parallel Monitoring of Multithreaded Applications

ASPLOS '10 - ParaLog 19

Performance Results: TaintCheck

© Evangelos Vlachos

1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 BARNES LU OCEAN BLACKSCH. FLUIDANIM. SWAPTIONS FMM RADIOSITY GEO. MEAN

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6 TaintCheckNo Monitoring Timesliced MonitoringParaLog

Nor

mal

ized

Exe

cutio

n Ti

me

Page 20: ParaLog : Enabling and Accelerating Online Parallel Monitoring of Multithreaded Applications

ASPLOS '10 - ParaLog 20

Performance Results: TaintCheck

© Evangelos Vlachos

1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 BARNES LU OCEAN BLACKSCH. FLUIDANIM. SWAPTIONS FMM RADIOSITY GEO. MEAN

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6TaintCheckNo Monitoring

Timesliced MonitoringParaLog

Nor

mal

ized

Exe

cutio

n Ti

me

2.1 11.5 12.9 1.910 1.7

1.9 2.9 6.64.6

15.7 2.4 2.81.7

• Timesliced Monitoring is not scalable• On average 23x slowdown over No Monitoring (8 threads)

Page 21: ParaLog : Enabling and Accelerating Online Parallel Monitoring of Multithreaded Applications

ASPLOS '10 - ParaLog 21

Performance Results: TaintCheck

© Evangelos Vlachos

1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 BARNES LU OCEAN BLACKSCH. FLUIDANIM. SWAPTIONS FMM RADIOSITY GEO. MEAN

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6 TaintCheckNo Monitoring Timesliced MonitoringParaLog

Nor

mal

ized

Exe

cutio

n Ti

me

• Highest overhead with 8 threads: BARNES 2.6x• Lowest overhead with 8 threads: LU 5%• Average overhead with 8 threads: 48%

Page 22: ParaLog : Enabling and Accelerating Online Parallel Monitoring of Multithreaded Applications

ASPLOS '10 - ParaLog 22

Other Results in the Paper

• Order capturing and order enforcing under TSO

• Performance Impact of Lifeguard Accelerators– AddrCheck: [1.13x – 3.4x], TaintCheck: [2x – 9x]

• A less expensive order capturing mechanism gets similar performance results– 1 timestamp per core vs. 1 timestamp per cache block

© Evangelos Vlachos

Page 23: ParaLog : Enabling and Accelerating Online Parallel Monitoring of Multithreaded Applications

23

Conclusions• ParaLog: Fast and precise parallel monitoring

• Components of event ordering– Normal memory accesses: monitor coherence activity– Logical Races; use of Conflict Alert messages

• Metadata Atomicity– Enforcing dependence arcs ensures atomicity (most cases)

• Parallel Hardware Accelerators– Flush local state on remote events (Conflict Alert)

• Average overhead is relatively low– AddrCheck: 26% and TaintCheck: 48% (8 threads)

© Evangelos Vlachos ASPLOS '10 - ParaLog

Page 24: ParaLog : Enabling and Accelerating Online Parallel Monitoring of Multithreaded Applications

ASPLOS '10 - ParaLog 24

Questions ?

© Evangelos Vlachos

Page 25: ParaLog : Enabling and Accelerating Online Parallel Monitoring of Multithreaded Applications

ASPLOS '10 - ParaLog 25

Backup Slides

© Evangelos Vlachos

Page 26: ParaLog : Enabling and Accelerating Online Parallel Monitoring of Multithreaded Applications

ASPLOS '10 26

Metadata Atomicity

• Synchronization-free fast path vs. slow path– Concurrent application reads; no ordering available!

• Concurrent metadata reads: follow the fast-path• Concurrent metadata writes: follow slow-path acquiring a lock• Concurrent metadata read and write: read may get either value

– In any other case dependence arcs are available

© Evangelos Vlachos

Application Event Lifeguard ActionR R WW R W

AddrCheckTaintCheckMemCheck

LockSet

Page 27: ParaLog : Enabling and Accelerating Online Parallel Monitoring of Multithreaded Applications

ASPLOS '10 - ParaLog 27

Parallel Hardware Accelerators• Accelerators have only local view of the analysis

– Important events have system-wide effects– Case study: Idempotent Filters and AddrCheck

© Evangelos Vlachos

R(A)

R(B)

R(A)

R(A)

R(A)

R(C)

R(B)

R(A)

IF

free(A)

R(A)

IF

LG 0

LG 1

✔✖ ✔ Delivered to lifeguard

✖ Redundant; discarded

✖ ✔

✔✖ ✔✔

Flush IF filters

free(A)

Flush local and remote IF

filters

• Details for parallel M-TLB and IT can be found in the paper

Builds on Remote Conflict

Messages

Page 28: ParaLog : Enabling and Accelerating Online Parallel Monitoring of Multithreaded Applications

28

Performance Impact of Lifeguard Accelerators

© Evangelos Vlachos ASPLOS '10 - ParaLog

BARNES LUOCEAN

BLACKSCH.FLUIDANIM.

SWAPTIONS FMMRADIOSITY

0

1

2

3

4

5

6 5.79.4 6.8

4.2 4.3

5.47.3 11.3

2.6

1.0 1.1 1.3 1.4

2.2

1.4 1.5

TaintCheck Not AcceleratedAccelerated

Slow

dow

n (8

thre

ads)

9.4 6.8 7.3 11.3

• Accelerators provide a major speedup [2x – 9x]

Page 29: ParaLog : Enabling and Accelerating Online Parallel Monitoring of Multithreaded Applications

ASPLOS '10 - ParaLog 29

Performance Impact of Lifeguard Accelerators

© Evangelos Vlachos

• Accelerators provide a major speedup [1.13x – 3.4x]

BARNES LUOCEAN

BLACKSCH.FLUIDANIM.

SWAPTIONS FMMRADIOSITY

0

1

2

3

4

5

6

7

8

9

3.93.2

1.01.4 1.1

8.4

1.01.41.1 1.0 1.0 1.0 1.0

6.0

1.0 1.0

AddrCheckNot AcceleratedAccelerated

Slow

dow

n (8

thre

ads)

Page 30: ParaLog : Enabling and Accelerating Online Parallel Monitoring of Multithreaded Applications

ASPLOS '10 - ParaLog 30

Transitive Reduction Sensitivity Study

© Evangelos Vlachos

BARNES LUOCEAN

BLACKSCH.FLUIDANIM.

SWAPTIONS FMMRADIOSITY

0

0.5

1

1.5

2

2.5

3

3.5

2.9

1.1 1.21.3

1.5

3.1

1.5 1.6

2.6

1.0 1.11.3 1.4

2.2

1.4 1.5

TaintCheck Limited (1 timestamp / cache)Ideal (1 timestamp / cacheblock)

Slow

dow

n (8

thre

ads)

• Limited transitive reduction– No major performance impact; savings in chip area

Page 31: ParaLog : Enabling and Accelerating Online Parallel Monitoring of Multithreaded Applications

31

Supporting Total Store Order (TSO)

• Cycle of dependencies in relaxed memory models– TSO relaxes the RAW ordering– Previous work (RTR): maintain versions of data– Identify SC offending instructions; save loaded value

• This paper: maintain versions of metadata

© Evangelos Vlachos ASPLOS '10 - ParaLog

Thread 0 Thread 1Commit

order

0

1

2

Wr(A) Wr(B)

Rd(B) Rd(A)

Memory Order:

P(v1, A)

C(v0, B)

P(v0, B)

C(v1, A)

Log 0 Log 1

Wr(A)

Rd(B, v0)

Wr(B)

Rd(A, v1)

produce_version(v1,A)Lifeguard 0

store_handler(A)

wait_until_available(v0,B)

load_handler(B, v0)

Page 32: ParaLog : Enabling and Accelerating Online Parallel Monitoring of Multithreaded Applications

ASPLOS '10 - ParaLog 32

Parallel Hardware Accelerators• Speed-up frequent lifeguard actions

– Fast metadata address calculation – Metadata-TLB– Fast tracking of data-flow paths – Inheritance Tracking– Filter out redundant checking – Idempotent Filters

• Per-instruction checking gives the same result; cache event

• Accelerators have only local view of the analysis– Important events have system-wide effects (e.g., free())– Coherence-like issues with accelerators’ local state

• Important events accompanied by Conflict Alerts– Use Conflict Alerts to flush state and deliver pending events

© Evangelos Vlachos

Page 33: ParaLog : Enabling and Accelerating Online Parallel Monitoring of Multithreaded Applications

ASPLOS '10 - ParaLog 33

Experimental FrameworkBenchmarks Inputbarnes 16K bodiesocean Grid: 258 x 258lu Matrix: 1024 x 1024fmm 32768 particlesradiosity Base problemblackscholes Simlargefluidanimate Simlargeswaptions Simlarge

Simulation ParametersCores {2, 4, 8,16}, 1 GHz,

In-Order scalar x86L1I & L1D(private)

64KB, 64B line, 4-way assoc.

L2 (shared) {1, 2, 4, 8}MB, 64B line, 8-way assoc., 6-cycle latency

Memory 90-cycle latencyLog Buffer 64KB per thread

Multithreaded LifeguardsTaintCheck: Information flow tracking; accelerated by M-TLB and ITAddrCheck: Memory access checking; accelerated by M-TLB and IF

© Evangelos Vlachos

Page 34: ParaLog : Enabling and Accelerating Online Parallel Monitoring of Multithreaded Applications

ASPLOS '10 - ParaLog 34

Relative Slowdown - TaintCheck

TaintCheck

0.0

0.5

1.0

1.5

2.0

2.5

3.0

1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8

BARNES LU OCEAN BLACKSCH. FLUIDANIM. SWAPTIONS FMM RADIOSITY

Slo

wdo

wn

Waiting for ApplicationWaiting for DependenceUseful Work

© Evangelos Vlachos

Page 35: ParaLog : Enabling and Accelerating Online Parallel Monitoring of Multithreaded Applications

ASPLOS '10 - ParaLog 35

Relative Slowdown - AddrCheck

AddrCheck

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8

BARNES LU OCEAN BLACKSH. FLUIDANIM. SWAPTIONS FMM RADIOSITY

Slow

dow

n

Waiting for ApplicationWaiting for DependenceUseful Work

3.0 6.0

© Evangelos Vlachos

Page 36: ParaLog : Enabling and Accelerating Online Parallel Monitoring of Multithreaded Applications

ASPLOS '10 - ParaLog 36

Performance Results - AddrCheck

1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 BARNES LU OCEAN BLACKSCH. FLUIDANIM. SWAPTIONS FMM RADIOSITY GEO. MEAN

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6AddrCheckNo Monitoring

Timesliced MonitoringParaLog

Nor

mal

ized

Exe

cutio

n Ti

me

© Evangelos Vlachos

2.3 6.1 6.7 1.71.9 2.9 9.5 15.4 2.1 6.2 1.9 2.4

Page 37: ParaLog : Enabling and Accelerating Online Parallel Monitoring of Multithreaded Applications

ASPLOS '10 - ParaLog 37

Performance Results - TaintCheck

© Evangelos Vlachos

1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 BARNES LU OCEAN BLACKSCH. FLUIDANIM. SWAPTIONS FMM RADIOSITY GEO. MEAN

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6TaintCheckNo Monitoring

Timesliced MonitoringParaLog

Nor

mal

ized

Exe

cutio

n Ti

me

2.1 11.5 12.9 1.910 1.7

1.9 2.9 6.64.6

15.7 2.4 2.81.7

Page 38: ParaLog : Enabling and Accelerating Online Parallel Monitoring of Multithreaded Applications

38

Parallel Hardware Accelerators• Speed-up frequent lifeguard actions

– Metadata-TLB & Inheritance Tracking (discussed in the paper)

– Idempotent Filters; identify and filter out redundant checking• Per-instruction checking gives the same result• Cache incoming event and local state to identify redundancy

• Accelerators have only local view of the analysis– Important events have application-wide effects (e.g., free())– Coherence-like issues with accelerators’ local state

• Important events accompanied by Conflict Alerts – Use Conflict Alerts to flush accelerators’ state

© Evangelos Vlachos ASPLOS '10 - ParaLog