paralog : enabling and accelerating online parallel monitoring of multithreaded applications
DESCRIPTION
ParaLog : Enabling and Accelerating Online Parallel Monitoring of Multithreaded Applications. Evangelos Vlachos , Michelle L. Goodstein, Michael A. Kozuch , Shimin Chen, Phillip B. Gibbons, Babak Falsafi and Todd C. Mowry. Software Errors & Analysis Tools. - PowerPoint PPT PresentationTRANSCRIPT
Computer Architecture Lab at
Evangelos Vlachos,Michelle L. Goodstein, Michael A. Kozuch,
Shimin Chen, Phillip B. Gibbons, Babak Falsafi and Todd C. Mowry
ParaLog: Enabling and Accelerating Online Parallel Monitoring of Multithreaded
Applications
ASPLOS '10 - ParaLog 2
Software Errors & Analysis Tools
• Errors abundant in parallel software– Program crashes/vulnerabilities, limited performance
• Three main categories of analysis tools– Checking before, during or after program execution
• Instruction-grain Lifeguards– Online detailed analysis, but with high overhead– Several tools available, but mostly support for single-
threaded code
© Evangelos Vlachos
ParaLog: a framework for efficient analysis of parallel applications
Lifeguards and Parallel Applications
Application Threads
TimeslicedExecution & Analysis
ParallelExecution & Analysis
Time Butterfly Analysis ParaLog
windows of uncertainty
precise application
order
(previous talk) (this talk)
DBI tools available today
- high overhead due to serialization
- some false positives+software-based
- new hardware required+no false positives+even better performance
4
Low-Overhead Instruction-level Analysis
© Evangelos Vlachos ASPLOS '10 - ParaLog
accelerators: IT, IF, MTLB
[Chen et. al., ISCA’08]
event streamevent capturing
applicationthread
lifeguard thread
event delivery
application lifeguardonline monitoring platform
metadata
add r1 r2, r4
add, r1, r2, r4add_handler(){
i = load_state(r2); j = load_state(r4); if(check(i, j)) upd_state(r1); else error();}
Lifeguard coreApplication core
ASPLOS '10 - ParaLog 5
accelerators: IT, IF, MTLB
accelerators: IT, IF, MTLB
Challenges in Parallel Monitoring
© Evangelos Vlachos
event stream
application lifeguardonline parallel monitoring platform[ParaLog]
applicationthread 1
event capturing event deliverylifeguard thread 1
globalmetadata
event streamapplicationthread k
event capturing event deliverylifeguard thread k
ASPLOS '10 - ParaLog 6
accelerators: IT, IF, MTLBaccelerators: IT, IF, MTLB
accelerators: IT, IF, MTLBaccelerators: IT, IF, MTLB
Addressing the Challenges
1. Application event ordering
2. Ensuring metadata access atomicity efficiently
3. Parallelizing hardware accelerators© Evangelos Vlachos
event streamapplication-onlyorder capturing
order enforcing
application lifeguardonline parallel monitoring platform
dependence arcs
[ParaLog]
applicationthread 1
event capturing event deliverylifeguard thread 1
globalmetadata
event streamapplication-onlyorder capturing
order enforcingapplication
thread k
event capturing event deliverylifeguard thread k
ASPLOS '10 - ParaLog 7
Outline
• Introduction
• Addressing the Challenges of Parallel Monitoring1. Capturing & enforcing application event ordering2. Ensuring metadata access atomicity3. Parallelizing hardware accelerators
• Evaluation
• Conclusions
© Evangelos Vlachos
ASPLOS '10 - ParaLog 8
Event Ordering: the Problem• Case Study: Information flow analysis (i.e., Taintcheck)
© Evangelos Vlachos
store(A)
load(A)
Applicationthread j thread k
st_handler(A)
Lifeguardthread j thread k
ApplicationTime
ld_handler(A)
Expose happens-before information to lifeguards
LifeguardTime
ASPLOS '10 - ParaLog 9
{thread j, tj}{thread j, tj}
progressj: tj progressj: tj - 2 progressk: tk
- 1 progressk: tk progressk: tk - 2 progressj: tj
- 1
Event Ordering: the solution (1/2)
• Coherence-based ordering of application events– Similar to FDR, but online, focusing on application-only events
© Evangelos Vlachos
store(A)
load(A)
Applicationthread j thread k
Time tj - 1
tjtj
+ 1tk - 1tk
tk + 1
st_handler(A)
ld_handler(A)
Lifeguardthread j thread k
wait whileprogressj < tj
ASPLOS '10 - ParaLog 10
Is monitoring coherence enough?
Event Ordering: the Solution (2/2)
• Previous work has not solved the problem of Logical Races• Both logical races and system calls resolved with Conflict Alert messages
© Evangelos Vlachos
free(A)
load(A)
Applicationthread j thread k
free(A)start
ld_handler(A)
Lifeguardthread j thread k
Metadata(A)
free(A)end
Conflict Alert Message Dependence
LogicalRace
ApplicationTime
LifeguardTime
ASPLOS '10 - ParaLog 11
Metadata Atomicity• Frequent use of locking too expensive
– # of instructions added & synchronization cost
• Dependence arcs handle the majority of the cases – Sufficient conditions:
1. One-to-one data-to-metadata mapping
2. Application reads don’t become metadata writes– Enforcing dependence arcs race-free operation
• Rest of the cases handled by acquiring a lock– Lock used only in the load_handler(); other handlers safe
© Evangelos Vlachos
(more details in the paper)
12
Parallel Hardware Accelerators• Speed-up frequent lifeguard actions
– Metadata-TLB; fast metadata address calculation– Idempotent Filters; filter out redundant checking– Inheritance Tracking; fast tracking of dataflow paths
• Accelerators have only local view of the analysis– Cache locally analysis information (e.g., frequent events) – Important events have application-wide effects (e.g., free())– Coherence-like issues with accelerators’ local state
• Important events accompanied by Conflict Alerts – Use Conflict Alerts to flush accelerators’ state
© Evangelos Vlachos ASPLOS '10 - ParaLog
ASPLOS '10 - ParaLog 13
Outline
• Introduction
• Addressing the Challenges of Parallel Monitoring– Capturing & enforcing application event ordering– Ensuring metadata access atomicity– Parallelizing hardware accelerators
• Evaluation
• Conclusions
© Evangelos Vlachos
ASPLOS '10 - ParaLog 14
Experimental Framework
© Evangelos Vlachos
• Log-Based Architectures framework– Simics full-system simulation– CMP system with {2, 4, 8, 16} cores– {1, 2, 4, 8} of application and lifeguard threads – Sequentially Consistent memory model
• Benchmarks and multithreaded Lifeguards used– SPLASH-2 and PARSEC– TaintCheck: Information flow tracking; accelerated by M-TLB, IT– AddrCheck: Memory access checking; accelerated by M-TLB, IF
• Comparison with Timesliced Monitoring
ASPLOS '10 - ParaLog 15
Performance Results: AddrCheck
1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 BARNES LU OCEAN BLACKSCH. FLUIDANIM. SWAPTIONS FMM RADIOSITY GEO. MEAN
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6AddrCheckNo Monitoring
Timesliced MonitoringParaLog
Nor
mal
ized
Exe
cutio
n Ti
me
© Evangelos Vlachos
8 app/lifeguard threads16 cores total
Normalized to sequential,
unmonitored
ASPLOS '10 - ParaLog 16
Performance Results: AddrCheck
1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 BARNES LU OCEAN BLACKSCH. FLUIDANIM. SWAPTIONS FMM RADIOSITY GEO. MEAN
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6AddrCheckNo Monitoring
Timesliced MonitoringParaLog
Nor
mal
ized
Exe
cutio
n Ti
me
© Evangelos Vlachos
17
Performance Results: AddrCheck
1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 BARNES LU OCEAN BLACKSCH. FLUIDANIM. SWAPTIONS FMM RADIOSITY GEO. MEAN
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6AddrCheckNo Monitoring
Timesliced MonitoringParaLog
Nor
mal
ized
Exe
cutio
n Ti
me
© Evangelos Vlachos ASPLOS '10 - ParaLog
2.3 6.1 6.7 1.71.9 2.9 9.5 15.4 2.1 6.2 1.9 2.4
• Timesliced Monitoring is not scalable• On average 15x slowdown over No Monitoring (8 threads)
ASPLOS '10 - ParaLog 18
Performance Results: AddrCheck
1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 BARNES LU OCEAN BLACKSCH. FLUIDANIM. SWAPTIONS FMM RADIOSITY GEO. MEAN
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6AddrCheckNo Monitoring
Timesliced MonitoringParaLog
Nor
mal
ized
Exe
cutio
n Ti
me
© Evangelos Vlachos
• Highest overhead with 8 threads: SWAPTIONS 6x• Lowest overhead with 8 threads: < 5%• Average overhead with 8 threads: 26%
ASPLOS '10 - ParaLog 19
Performance Results: TaintCheck
© Evangelos Vlachos
1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 BARNES LU OCEAN BLACKSCH. FLUIDANIM. SWAPTIONS FMM RADIOSITY GEO. MEAN
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6 TaintCheckNo Monitoring Timesliced MonitoringParaLog
Nor
mal
ized
Exe
cutio
n Ti
me
ASPLOS '10 - ParaLog 20
Performance Results: TaintCheck
© Evangelos Vlachos
1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 BARNES LU OCEAN BLACKSCH. FLUIDANIM. SWAPTIONS FMM RADIOSITY GEO. MEAN
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6TaintCheckNo Monitoring
Timesliced MonitoringParaLog
Nor
mal
ized
Exe
cutio
n Ti
me
2.1 11.5 12.9 1.910 1.7
1.9 2.9 6.64.6
15.7 2.4 2.81.7
• Timesliced Monitoring is not scalable• On average 23x slowdown over No Monitoring (8 threads)
ASPLOS '10 - ParaLog 21
Performance Results: TaintCheck
© Evangelos Vlachos
1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 BARNES LU OCEAN BLACKSCH. FLUIDANIM. SWAPTIONS FMM RADIOSITY GEO. MEAN
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6 TaintCheckNo Monitoring Timesliced MonitoringParaLog
Nor
mal
ized
Exe
cutio
n Ti
me
• Highest overhead with 8 threads: BARNES 2.6x• Lowest overhead with 8 threads: LU 5%• Average overhead with 8 threads: 48%
ASPLOS '10 - ParaLog 22
Other Results in the Paper
• Order capturing and order enforcing under TSO
• Performance Impact of Lifeguard Accelerators– AddrCheck: [1.13x – 3.4x], TaintCheck: [2x – 9x]
• A less expensive order capturing mechanism gets similar performance results– 1 timestamp per core vs. 1 timestamp per cache block
© Evangelos Vlachos
23
Conclusions• ParaLog: Fast and precise parallel monitoring
• Components of event ordering– Normal memory accesses: monitor coherence activity– Logical Races; use of Conflict Alert messages
• Metadata Atomicity– Enforcing dependence arcs ensures atomicity (most cases)
• Parallel Hardware Accelerators– Flush local state on remote events (Conflict Alert)
• Average overhead is relatively low– AddrCheck: 26% and TaintCheck: 48% (8 threads)
© Evangelos Vlachos ASPLOS '10 - ParaLog
ASPLOS '10 - ParaLog 24
Questions ?
© Evangelos Vlachos
ASPLOS '10 - ParaLog 25
Backup Slides
© Evangelos Vlachos
ASPLOS '10 26
Metadata Atomicity
• Synchronization-free fast path vs. slow path– Concurrent application reads; no ordering available!
• Concurrent metadata reads: follow the fast-path• Concurrent metadata writes: follow slow-path acquiring a lock• Concurrent metadata read and write: read may get either value
– In any other case dependence arcs are available
© Evangelos Vlachos
Application Event Lifeguard ActionR R WW R W
AddrCheckTaintCheckMemCheck
LockSet
ASPLOS '10 - ParaLog 27
Parallel Hardware Accelerators• Accelerators have only local view of the analysis
– Important events have system-wide effects– Case study: Idempotent Filters and AddrCheck
© Evangelos Vlachos
R(A)
R(B)
R(A)
R(A)
R(A)
R(C)
R(B)
R(A)
IF
free(A)
R(A)
IF
LG 0
LG 1
✔✖ ✔ Delivered to lifeguard
✖ Redundant; discarded
✖ ✔
✔✖ ✔✔
✔
Flush IF filters
free(A)
Flush local and remote IF
filters
• Details for parallel M-TLB and IT can be found in the paper
Builds on Remote Conflict
Messages
28
Performance Impact of Lifeguard Accelerators
© Evangelos Vlachos ASPLOS '10 - ParaLog
BARNES LUOCEAN
BLACKSCH.FLUIDANIM.
SWAPTIONS FMMRADIOSITY
0
1
2
3
4
5
6 5.79.4 6.8
4.2 4.3
5.47.3 11.3
2.6
1.0 1.1 1.3 1.4
2.2
1.4 1.5
TaintCheck Not AcceleratedAccelerated
Slow
dow
n (8
thre
ads)
9.4 6.8 7.3 11.3
• Accelerators provide a major speedup [2x – 9x]
ASPLOS '10 - ParaLog 29
Performance Impact of Lifeguard Accelerators
© Evangelos Vlachos
• Accelerators provide a major speedup [1.13x – 3.4x]
BARNES LUOCEAN
BLACKSCH.FLUIDANIM.
SWAPTIONS FMMRADIOSITY
0
1
2
3
4
5
6
7
8
9
3.93.2
1.01.4 1.1
8.4
1.01.41.1 1.0 1.0 1.0 1.0
6.0
1.0 1.0
AddrCheckNot AcceleratedAccelerated
Slow
dow
n (8
thre
ads)
ASPLOS '10 - ParaLog 30
Transitive Reduction Sensitivity Study
© Evangelos Vlachos
BARNES LUOCEAN
BLACKSCH.FLUIDANIM.
SWAPTIONS FMMRADIOSITY
0
0.5
1
1.5
2
2.5
3
3.5
2.9
1.1 1.21.3
1.5
3.1
1.5 1.6
2.6
1.0 1.11.3 1.4
2.2
1.4 1.5
TaintCheck Limited (1 timestamp / cache)Ideal (1 timestamp / cacheblock)
Slow
dow
n (8
thre
ads)
• Limited transitive reduction– No major performance impact; savings in chip area
31
Supporting Total Store Order (TSO)
• Cycle of dependencies in relaxed memory models– TSO relaxes the RAW ordering– Previous work (RTR): maintain versions of data– Identify SC offending instructions; save loaded value
• This paper: maintain versions of metadata
© Evangelos Vlachos ASPLOS '10 - ParaLog
Thread 0 Thread 1Commit
order
0
1
2
Wr(A) Wr(B)
Rd(B) Rd(A)
Memory Order:
P(v1, A)
C(v0, B)
P(v0, B)
C(v1, A)
Log 0 Log 1
Wr(A)
Rd(B, v0)
Wr(B)
Rd(A, v1)
produce_version(v1,A)Lifeguard 0
store_handler(A)
wait_until_available(v0,B)
load_handler(B, v0)
ASPLOS '10 - ParaLog 32
Parallel Hardware Accelerators• Speed-up frequent lifeguard actions
– Fast metadata address calculation – Metadata-TLB– Fast tracking of data-flow paths – Inheritance Tracking– Filter out redundant checking – Idempotent Filters
• Per-instruction checking gives the same result; cache event
• Accelerators have only local view of the analysis– Important events have system-wide effects (e.g., free())– Coherence-like issues with accelerators’ local state
• Important events accompanied by Conflict Alerts– Use Conflict Alerts to flush state and deliver pending events
© Evangelos Vlachos
ASPLOS '10 - ParaLog 33
Experimental FrameworkBenchmarks Inputbarnes 16K bodiesocean Grid: 258 x 258lu Matrix: 1024 x 1024fmm 32768 particlesradiosity Base problemblackscholes Simlargefluidanimate Simlargeswaptions Simlarge
Simulation ParametersCores {2, 4, 8,16}, 1 GHz,
In-Order scalar x86L1I & L1D(private)
64KB, 64B line, 4-way assoc.
L2 (shared) {1, 2, 4, 8}MB, 64B line, 8-way assoc., 6-cycle latency
Memory 90-cycle latencyLog Buffer 64KB per thread
Multithreaded LifeguardsTaintCheck: Information flow tracking; accelerated by M-TLB and ITAddrCheck: Memory access checking; accelerated by M-TLB and IF
© Evangelos Vlachos
ASPLOS '10 - ParaLog 34
Relative Slowdown - TaintCheck
TaintCheck
0.0
0.5
1.0
1.5
2.0
2.5
3.0
1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8
BARNES LU OCEAN BLACKSCH. FLUIDANIM. SWAPTIONS FMM RADIOSITY
Slo
wdo
wn
Waiting for ApplicationWaiting for DependenceUseful Work
© Evangelos Vlachos
ASPLOS '10 - ParaLog 35
Relative Slowdown - AddrCheck
AddrCheck
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8
BARNES LU OCEAN BLACKSH. FLUIDANIM. SWAPTIONS FMM RADIOSITY
Slow
dow
n
Waiting for ApplicationWaiting for DependenceUseful Work
3.0 6.0
© Evangelos Vlachos
ASPLOS '10 - ParaLog 36
Performance Results - AddrCheck
1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 BARNES LU OCEAN BLACKSCH. FLUIDANIM. SWAPTIONS FMM RADIOSITY GEO. MEAN
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6AddrCheckNo Monitoring
Timesliced MonitoringParaLog
Nor
mal
ized
Exe
cutio
n Ti
me
© Evangelos Vlachos
2.3 6.1 6.7 1.71.9 2.9 9.5 15.4 2.1 6.2 1.9 2.4
ASPLOS '10 - ParaLog 37
Performance Results - TaintCheck
© Evangelos Vlachos
1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 BARNES LU OCEAN BLACKSCH. FLUIDANIM. SWAPTIONS FMM RADIOSITY GEO. MEAN
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6TaintCheckNo Monitoring
Timesliced MonitoringParaLog
Nor
mal
ized
Exe
cutio
n Ti
me
2.1 11.5 12.9 1.910 1.7
1.9 2.9 6.64.6
15.7 2.4 2.81.7
38
Parallel Hardware Accelerators• Speed-up frequent lifeguard actions
– Metadata-TLB & Inheritance Tracking (discussed in the paper)
– Idempotent Filters; identify and filter out redundant checking• Per-instruction checking gives the same result• Cache incoming event and local state to identify redundancy
• Accelerators have only local view of the analysis– Important events have application-wide effects (e.g., free())– Coherence-like issues with accelerators’ local state
• Important events accompanied by Conflict Alerts – Use Conflict Alerts to flush accelerators’ state
© Evangelos Vlachos ASPLOS '10 - ParaLog