paralog: enabling and accelerating online parallel monitoring of multithreaded applications

Post on 15-Jan-2016

29 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

ParaLog: Enabling and Accelerating Online Parallel Monitoring of Multithreaded Applications. Evangelos Vlachos , Michelle L. Goodstein, Michael A. Kozuch, Shimin Chen, Phillip B. Gibbons, Babak Falsafi and Todd C. Mowry. Software Errors & Analysis Tools. Errors abundant in parallel software - PowerPoint PPT Presentation

TRANSCRIPT

Computer Architecture Lab at

Evangelos Vlachos,

Michelle L. Goodstein, Michael A. Kozuch, Shimin Chen, Phillip B. Gibbons, Babak Falsafi

and Todd C. Mowry

ParaLog: Enabling and Accelerating Online Parallel Monitoring of Multithreaded

Applications

Software Errors & Analysis Tools

• Errors abundant in parallel software– Program crashes/vulnerabilities, limited performance

• Three main categories of analysis tools– Checking before, during or after program execution

• Instruction-grain Lifeguards– Online detailed analysis, but with high overhead

– Several tools available, but mostly support for single-threaded code

2© Evangelos Vlachos ASPLOS '10 - ParaLog

ParaLog: a framework for efficient analysis of parallel applicationsParaLog: a framework for efficient analysis of parallel applications

Lifeguards and Parallel Applications

Application Threads

TimeslicedExecution & Analysis

ParallelExecution & Analysis

TimeButterfly Analysis ParaLog

windows of uncertainty

precise application

order

(previous talk) (this talk)

DBI tools available today

- high overhead due to serialization

- some false positives+software-based

- new hardware required+no false positives+even better performance

Low-Overhead Instruction-level Analysis

© Evangelos Vlachos ASPLOS '10 - ParaLog 4

accelerators: IT, IF, MTLB

[Chen et. al., ISCA’08]

event streamevent capturingevent capturing

applicationthread

lifeguard thread

event deliveryevent delivery

application lifeguardonline monitoring platform

metadata

add r1 r2, r4

add, r1, r2, r4add, r1, r2, r4

add_handler(){

i = load_state(r2); j = load_state(r4); if(check(i, j)) upd_state(r1); else error();}

Lifeguard coreApplication core

accelerators: IT, IF, MTLB

accelerators: IT, IF, MTLB

Challenges in Parallel Monitoring

© Evangelos Vlachos ASPLOS '10 - ParaLog 5

event stream

application lifeguardonline parallel monitoring platform[ParaLog]

applicationthread 1

event capturingevent capturing event deliveryevent deliverylifeguard thread 1

globalmetadata

event streamapplicationthread k

event capturingevent capturing event deliveryevent deliverylifeguard thread k

accelerators: IT, IF, MTLBaccelerators: IT, IF, MTLB

accelerators: IT, IF, MTLBaccelerators: IT, IF, MTLB

Addressing the Challenges

1. Application event ordering

2. Ensuring metadata access atomicity efficiently

3. Parallelizing hardware accelerators

© Evangelos Vlachos ASPLOS '10 - ParaLog 6

event streamapplication-onlyorder capturingapplication-onlyorder capturing

order enforcingorder enforcing

application lifeguardonline parallel monitoring platform

dependence arcs

[ParaLog]

applicationthread 1

event capturingevent capturing event deliveryevent deliverylifeguard thread 1

globalmetadata

event streamapplication-onlyorder capturingapplication-onlyorder capturing

order enforcingorder enforcingapplication

thread k

event capturingevent capturing event deliveryevent deliverylifeguard thread k

Outline

• Introduction

• Addressing the Challenges of Parallel Monitoring1. Capturing & enforcing application event ordering

2. Ensuring metadata access atomicity

3. Parallelizing hardware accelerators

• Evaluation

• Conclusions

7© Evangelos Vlachos ASPLOS '10 - ParaLog

Event Ordering: the Problem

• Case Study: Information flow analysis (i.e., Taintcheck)

© Evangelos Vlachos ASPLOS '10 - ParaLog 8

store(A)store(A)

load(A)load(A)

Applicationthread j thread k

st_handler(A)st_handler(A)

Lifeguardthread j thread k

ApplicationTime

ld_handler(A)ld_handler(A)

Expose happens-before information to lifeguards

LifeguardTime

{thread j, tj}{thread j, tj}{thread j, tj}{thread j, tj}

progressj: tj progressj: tj - 2 progressk: tk

- 1 progressk: tk progressk: tk - 2 progressj: tj

- 1

Event Ordering: the solution (1/2)

• Coherence-based ordering of application events– Similar to FDR, but online, focusing on application-only events

© Evangelos Vlachos ASPLOS '10 - ParaLog 9

store(A)store(A)

load(A)load(A)

Applicationthread j thread k

Time tj - 1

tj

tj + 1

tk - 1

tk

tk + 1

st_handler(A)st_handler(A)

ld_handler(A)ld_handler(A)

Lifeguardthread j thread k

wait whileprogressj < tj

wait whileprogressj < tj

Is monitoring coherence enough?

Event Ordering: the Solution (2/2)

• Previous work has not solved the problem of Logical Races

• Both logical races and system calls resolved with Conflict Alert messages© Evangelos Vlachos ASPLOS '10 - ParaLog 10

free(A)free(A)

load(A)load(A)

Applicationthread j thread k

free(A)startfree(A)start

ld_handler(A)ld_handler(A)

Lifeguardthread j thread k

Metadata(A)Metadata(A)

free(A)endfree(A)end

Conflict Alert Message Dependence

LogicalRace

ApplicationTime

LifeguardTime

Metadata Atomicity

• Frequent use of locking too expensive– # of instructions added & synchronization cost

• Dependence arcs handle the majority of the cases – Sufficient conditions:

1. One-to-one data-to-metadata mapping

2. Application reads don’t become metadata writes

– Enforcing dependence arcs race-free operation

• Rest of the cases handled by acquiring a lock– Lock used only in the load_handler(); other handlers safe

© Evangelos Vlachos ASPLOS '10 - ParaLog 11

(more details in the paper)

Parallel Hardware Accelerators

• Speed-up frequent lifeguard actions– Metadata-TLB; fast metadata address calculation

– Idempotent Filters; filter out redundant checking

– Inheritance Tracking; fast tracking of dataflow paths

• Accelerators have only local view of the analysis– Cache locally analysis information (e.g., frequent events)

– Important events have application-wide effects (e.g., free())

– Coherence-like issues with accelerators’ local state

• Important events accompanied by Conflict Alerts – Use Conflict Alerts to flush accelerators’ state

© Evangelos Vlachos ASPLOS '10 - ParaLog 12

Outline

• Introduction

• Addressing the Challenges of Parallel Monitoring– Capturing & enforcing application event ordering

– Ensuring metadata access atomicity

– Parallelizing hardware accelerators

• Evaluation

• Conclusions

13© Evangelos Vlachos ASPLOS '10 - ParaLog

Experimental Framework

14© Evangelos Vlachos ASPLOS '10 - ParaLog

• Log-Based Architectures framework– Simics full-system simulation

– CMP system with {2, 4, 8, 16} cores

– {1, 2, 4, 8} of application and lifeguard threads

– Sequentially Consistent memory model

• Benchmarks and multithreaded Lifeguards used– SPLASH-2 and PARSEC– TaintCheck: Information flow tracking; accelerated by M-TLB, IT

– AddrCheck: Memory access checking; accelerated by M-TLB, IF

• Comparison with Timesliced Monitoring

Performance Results: AddrCheck

15© Evangelos Vlachos ASPLOS '10 - ParaLog

8 app/lifeguard threads16 cores total

Normalized to sequential,

unmonitored

Performance Results: AddrCheck

16© Evangelos Vlachos ASPLOS '10 - ParaLog

Performance Results: AddrCheck

17© Evangelos Vlachos ASPLOS '10 - ParaLog

2.3 6.1 6.7 1.71.9 2.9 9.5 15.4 2.1 6.2 1.9 2.4

• Timesliced Monitoring is not scalable• On average 15x slowdown over No Monitoring (8 threads)

Performance Results: AddrCheck

18© Evangelos Vlachos ASPLOS '10 - ParaLog

• Highest overhead with 8 threads: SWAPTIONS 6x• Lowest overhead with 8 threads: < 5%• Average overhead with 8 threads: 26%

Performance Results: TaintCheck

19© Evangelos Vlachos ASPLOS '10 - ParaLog

Performance Results: TaintCheck

20© Evangelos Vlachos ASPLOS '10 - ParaLog

2.1 11.5 12.9 1.910 1.7

1.92.9

6.64.6

15.7 2.4 2.81.7

• Timesliced Monitoring is not scalable• On average 23x slowdown over No Monitoring (8 threads)

Performance Results: TaintCheck

21© Evangelos Vlachos ASPLOS '10 - ParaLog

• Highest overhead with 8 threads: BARNES 2.6x• Lowest overhead with 8 threads: LU 5%• Average overhead with 8 threads: 48%

Other Results in the Paper

• Order capturing and order enforcing under TSO

• Performance Impact of Lifeguard Accelerators– AddrCheck: [1.13x – 3.4x], TaintCheck: [2x – 9x]

• A less expensive order capturing mechanism gets similar performance results– 1 timestamp per core vs. 1 timestamp per cache block

© Evangelos Vlachos ASPLOS '10 - ParaLog 22

Conclusions

• ParaLog: Fast and precise parallel monitoring

• Components of event ordering– Normal memory accesses: monitor coherence activity– Logical Races; use of Conflict Alert messages

• Metadata Atomicity– Enforcing dependence arcs ensures atomicity (most cases)

• Parallel Hardware Accelerators– Flush local state on remote events (Conflict Alert)

• Average overhead is relatively low– AddrCheck: 26% and TaintCheck: 48% (8 threads)

23© Evangelos Vlachos ASPLOS '10 - ParaLog

Questions ?

24© Evangelos Vlachos ASPLOS '10 - ParaLog

Backup Slides

25© Evangelos Vlachos ASPLOS '10 - ParaLog

Metadata Atomicity

• Synchronization-free fast path vs. slow path– Concurrent application reads; no ordering available!

• Concurrent metadata reads: follow the fast-path

• Concurrent metadata writes: follow slow-path acquiring a lock

• Concurrent metadata read and write: read may get either value

– In any other case dependence arcs are available

© Evangelos Vlachos ASPLOS '10 26

Application Event Lifeguard Action

R R W

W R W

AddrCheckTaintCheckMemCheck

LockSet

Parallel Hardware Accelerators

• Accelerators have only local view of the analysis– Important events have system-wide effects

– Case study: Idempotent Filters and AddrCheck

© Evangelos Vlachos ASPLOS '10 - ParaLog 27

R(A)

R(B)

R(A)

R(A)

R(A)

R(C

)

R(B)

R(A)

IF

free(A)

R(A)

IF

LG 0

LG 1

✔✖✔Delivered to lifeguard

✖ Redundant; discarded

✖ ✔

✔✖ ✔✔

Flush IF filtersFlush IF filters

free(A)

Flush local and remote IF

filters

Flush local and remote IF

filters

• Details for parallel M-TLB and IT can be found in the paper

Builds on Remote Conflict

Messages

Builds on Remote Conflict

Messages

Performance Impact of Lifeguard Accelerators

28© Evangelos Vlachos ASPLOS '10 - ParaLog

9.4 6.8 7.3 11.3

• Accelerators provide a major speedup [2x – 9x]

Performance Impact of Lifeguard Accelerators

29© Evangelos Vlachos ASPLOS '10 - ParaLog

• Accelerators provide a major speedup [1.13x – 3.4x]

Transitive Reduction Sensitivity Study

30© Evangelos Vlachos ASPLOS '10 - ParaLog

• Limited transitive reduction– No major performance impact; savings in chip area

Supporting Total Store Order (TSO)

• Cycle of dependencies in relaxed memory models– TSO relaxes the RAW ordering

– Previous work (RTR): maintain versions of data

– Identify SC offending instructions; save loaded value

• This paper: maintain versions of metadata

© Evangelos Vlachos ASPLOS '10 - ParaLog 31

Thread 0 Thread 1Commit

order

0

1

2

Wr(A) Wr(B)

Rd(B) Rd(A)

Memory Order:

P(v1, A)

C(v0, B)

P(v0, B)

C(v1, A)

Log 0 Log 1

Wr(A)

Rd(B, v0)

Wr(B)

Rd(A, v1)

produce_version(v1,A)Lifeguard 0

store_handler(A)

wait_until_available(v0,B)

load_handler(B, v0)

Parallel Hardware Accelerators

• Speed-up frequent lifeguard actions– Fast metadata address calculation – Metadata-TLB

– Fast tracking of data-flow paths – Inheritance Tracking– Filter out redundant checking – Idempotent Filters

• Per-instruction checking gives the same result; cache event

• Accelerators have only local view of the analysis– Important events have system-wide effects (e.g., free())– Coherence-like issues with accelerators’ local state

• Important events accompanied by Conflict Alerts– Use Conflict Alerts to flush state and deliver pending

events© Evangelos Vlachos ASPLOS '10 - ParaLog 32

Experimental Framework

Benchmarks Input

barnes 16K bodies

ocean Grid: 258 x 258

lu Matrix: 1024 x 1024

fmm 32768 particles

radiosity Base problem

blackscholes Simlarge

fluidanimate Simlarge

swaptions Simlarge

Simulation Parameters

Cores {2, 4, 8,16}, 1 GHz, In-Order scalar x86

L1I & L1D(private)

64KB, 64B line, 4-way assoc.

L2 (shared) {1, 2, 4, 8}MB, 64B line, 8-way assoc., 6-cycle latency

Memory 90-cycle latency

Log Buffer 64KB per thread

Multithreaded Lifeguards

TaintCheck: Information flow tracking; accelerated by M-TLB and IT

AddrCheck: Memory access checking; accelerated by M-TLB and IF

33© Evangelos Vlachos ASPLOS '10 - ParaLog

Relative Slowdown - TaintCheck

TaintCheck

0.0

0.5

1.0

1.5

2.0

2.5

3.0

1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8

BARNES LU OCEAN BLACKSCH. FLUIDANIM. SWAPTIONS FMM RADIOSITY

Slo

wdo

wn

Waiting for ApplicationWaiting for DependenceUseful Work

34© Evangelos Vlachos ASPLOS '10 - ParaLog

Relative Slowdown - AddrCheck

AddrCheck

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8

BARNES LU OCEAN BLACKSH. FLUIDANIM. SWAPTIONS FMM RADIOSITY

Slo

wd

ow

n

Waiting for ApplicationWaiting for DependenceUseful Work

3.0 6.0

35© Evangelos Vlachos ASPLOS '10 - ParaLog

Performance Results - AddrCheck

36© Evangelos Vlachos ASPLOS '10 - ParaLog

2.3 6.1 6.7 1.71.9 2.9 9.5 15.4 2.1 6.2 1.9 2.4

Performance Results - TaintCheck

37© Evangelos Vlachos ASPLOS '10 - ParaLog

2.1 11.5 12.9 1.910 1.7

1.92.9

6.64.6

15.7 2.4 2.81.7

Parallel Hardware Accelerators

• Speed-up frequent lifeguard actions– Metadata-TLB & Inheritance Tracking (discussed in the paper)

– Idempotent Filters; identify and filter out redundant checking

• Per-instruction checking gives the same result

• Cache incoming event and local state to identify redundancy

• Accelerators have only local view of the analysis– Important events have application-wide effects (e.g., free())

– Coherence-like issues with accelerators’ local state

• Important events accompanied by Conflict Alerts – Use Conflict Alerts to flush accelerators’ state

© Evangelos Vlachos ASPLOS '10 - ParaLog 38

top related