automatic performance analysis and tuning
DESCRIPTION
Dyninst/Paradyn Week February, 2003. Automatic Performance Analysis and Tuning. Anna Morajko, Oleg Morajko, Tomas Margalef, Emilio Luque Universitat Autónoma de Barcelona. Content. Our goals – automatic analysis and tuning Automatic analysis based on ASL Dynamic tuning - PowerPoint PPT PresentationTRANSCRIPT
Automatic Performance Analysis and Tuning
Anna Morajko, Oleg Morajko, Tomas Margalef, Emilio Luque Universitat Autónoma de Barcelona
Dyninst/Paradyn WeekFebruary, 2003
1. Our goals – automatic analysis and tuning
2. Automatic analysis based on ASL
3. Dynamic tuning
4. Conclusions and future work
Content
Our goals
Primary objectives
• Create a tool that is able to automatically analyze the performance of parallel applications post-mortem and is based on trace files
• Create a tool that is able to automatically analyze the performance of parallel applications on the fly, detect bottlenecks and explain their reasons
• Create a tool that is able to automatically improve the performance of parallel applications during their execution without recompiling and rerunning them
Our goals
Static automatic analysis– static trace file analysis supported by source code
examination
Pattern Matching Engine
Source Analysis
Hints to users
List of Detected Patterns
Evaluator Engine
Problem causes
Trace
Problem specs
Problem Analysis
specs
Source Code
Hints specs
Our goals
Dynamic automatic analysis– based on declarative knowledge specification
ApplicationApplication
Tracing, profiling
PerformancePerformanceMonitorMonitor
Performance dataProperties
Bottlenecks
SearchSearchRefinementRefinement
PropertyPropertyEvaluationEvaluation
ProblemProblemRankingRanking
Problems
User output
Run-Time Analysis
Measurement requests
Our goals
Dynamic automatic tuning
Suggestions
User
TuningTracing
Exe-cutio
n
Tool
SolutionProblem /
EventsPerformance
analysis
Modifications
Instrumentation
Application development
Source
Application
1. Our goals – automatic analysis and tuning
2. Automatic analysis based on ASL
• Performance data collection
• Performance problem catalog (properties and ASL)
• Performance analysis
3. Dynamic tuning
4. Conclusions and future work
Content
Objectives
• Analyze the performance of parallel applications
during their execution
• Automatically detect bottlenecks
• Provide a clear explanation of the identified
problems to developers
• Correlate problems with the source code
• Provide recommendations on possible solutions
Key Challenges
– What is wrong in the application?Problem identification – bottlenecks
– Where is the critical resource?Problem location (hardware, OS, source code)
– When does it happen?How to organize problem search in time?
– How important it is?How to compare the importance of different problems?
– Why it happens?How to explain the reasons of the problem to the user?
Design Assumptions
• Dynamic on-the-fly analysis• Knowledge specification based on ASL• Bottleneck detection (what axis) based on
inductive reasoning (bottom-up approach)• Problem location identification (where axis) based
on call-graph search
• Tool primarily targeted to MPI-based parallel programs
Performance Analysis Concept
ApplicationApplication
Tracing, profiling
PerformancePerformanceMonitorMonitor
Performance dataProperties
Bottlenecks
SearchSearchRefinementRefinement
PropertyPropertyEvaluationEvaluation
ProblemProblemRankingRanking
Problems
User output
Run-Time Analysis
Measurement requests
Performance Data Collection
• Performance analysis is based on measurements of performance data
• There are various techniques on providing this data– event tracing, sampling, instrumentation
• Static data– Structural data (modules, functions), call graphs, etc.
• Dynamic data– Metrics, execution profiles, communication patterns, events, etc.
Performance Data Collection
N
N
Call graph
RegionRegion
ApplicationApplication
ProcessProcess
FunctionFunction
ModuleModule
Application model
• We represent the application using object model
• Model is generated dynamically at application startup (by parsing the executables)
• Each object represents a location context called Region
• This information can be used to guide problem location search
Performance Data Collection
N
Region ProfileRegion Profile
App ProfileApp Profile
Process ProfileProcess Profile
Function ProfileFunction Profile
Module ProfileModule Profile
Execution profile
• Each region can have an execution profile - RegionProfile
• Each profile contains a set of predefined metrics
• Each metric can be measured on demand for a specific region
• The key issue is to decide what metrics for what regions should be measured
Performance Data Collection
N
Region ProfileRegion Profile
App ProfileApp Profile
Process ProfileProcess Profile
Function ProfileFunction Profile
Module ProfileModule Profile
Execution profile
• Each region can have an execution profile - RegionProfile
• Each profile contains a set of predefined metrics
• Each metric can be measured on demand for a specific region
• The key issue is to decide what metrics for what regions should be measured
cpu_timeio_time
comm_timesync_timeidle_timenum_exec
cpu_timeio_time
comm_timesync_timeidle_timenum_exec
Performance Properties
• Properties describe the specific types of performance behavior in a program
• Properties are higher-level abstractions used to represent common performance problems
• They are based on conditions dependent on certain performance metrics
• We can express these abstractions using ASL (APART Specification Language)
APART Specification Language
property small_io_requests (Region r, Experiment e, Region basis){ let float cost = profile (r,e).io_time; int num_reads = profile (r,e).num_reads; int bytes_read = profile (r,e).bytes_read; in condition : cost > 0 and bytes_read/num_reads < SMALL_IO_THRESHOLD; confidence: 1; severity : cost/duration (basis, e); }
• ASL is a declarative language (like SQL)
Property Evaluation
property small_io_requests (Region r, Experiment e, Region basis){ let float cost = profile (r,e).io_time; int num_reads = profile (r,e).num_reads; int bytes_read = profile (r,e).bytes_read; in condition : cost > 0 and
bytes_read/num_reads < SMALL_IO_THRESHOLD; confidence: 1; severity : cost/duration (basis, e); }
int foo (){ open („sim.dat”, O_RDONLY); for (...) { read (block, ...); ... add (block); } ...}
Functionfoo ()
File: foo.cppLines: 17-35
Function foo ()Profile
io_time 181 secnum_reads 8921
bytes_read 95201
condition: trueconfidence: 1severity: 0.35
Region
Region profile
Hierarchical Properties
• During run-time all properties cannot be evaluated at once (metrics cost!)
• But we want to find a problem and keep the cost controlled
• So what properties to evaluate?
• Properties have natural dependences
• Let’s express dependencies explicitly in ASL
• This can guide the automatic search and limit the number of properties to be evaluated
Example
property communication_cost (Region r, Experiment e, Region basis){ let float cost = profile (r,e).sums.comm_time; in condition : cost > 0; confidence: 1; severity : cost/duration (basis, e); }
property late_sender (Region r, Experiment e, Region basis){ let float idleT = profile (r,e).sums.idle_time; in condition : r.type == Receive and idleT > 0; confidence: 1; severity : idleT/duration (basis, e); }
• Analyzer should evaluate the late_sender property only if communication_cost property holds
• Properties create natural hierarchies
Example: Hierarchy of properties
Late SenderLate Sender
Late ReceiverLate Receiver
Small MsgsSmall Msgs
Comm CostComm CostI/O CostI/O Cost
Slow ReadSlow Read
Small Small RequestsRequests
Large Large RequestsRequests
Sync CostSync Cost
BarrierBarrier LockLockContentionContention
...
...
...
...
CostCost
Search Process
• Top-down approach
• Start with top-level property, i.e.– Cost = io_time, comm_time, sync_time
• Start with search location = process
• Perform measurements
• Evaluate the property – Severity = cost/duration (e.g. 40%)
Search Process
• Continue with set of sub-properties:– evaluate io_cost, comm_cost, sync_cost
• E.g. highest severity is io_cost (e.g. 25%)
• Continue the top-down search until reaching most specific property, i.e. small_io_requests
• Next, try looking for more precise location of the problem– use call-graph region search– start from main– step down until reaching foo () function
Generic Analysis Algorithm
Select property to evaluate
Select search context (Region)
Determine required metrics from
property
„Order” metrics for a given
context
Wait for data. Update execution profile.
Evaluate property
Rank property
Change search context or select next property
1. Our goals – automatic analysis and tuning
2. Automatic analysis based on ASL
3. Dynamic tuning
• Objectives
• MATE (Monitoring, Analysis and Tuning Environment)
• Tuning techniques and experiments
4. Conclusions and future work
Content
Objectives
Improve the execution of a parallel application by dynamically adapting it to the environment
Key issues:• Dynamic performance tuning approach• Automatic improvement of any application without recompiling and
rerunning it• What can be tuned in an unknown application?
– Library usage– Operating system usage
MATE – Monitoring, Analysis and Tuning Environment • prototype implementation in C++• for PVM based applications• Sun Solaris 2.x / SPARC
MATE
Machine 1 Machine 2
Machine 3
Monitor
Tuner
pvmd
Analyzer
Monitor
Tuner
Events
Task1
Task1 Task2
pvmd
Events
Apply solutions
Instrument Instrument
MATE: Monitoring
Monitors control execution of application tasks and allow for dynamic event tracing
Key services:• Distributed application control
• Instrumentation management– AddEventTrace(id,func,place,args)– RemoveEventTrace(id)
• Transmission of requested event records to analyzer
Machine 1
Monitor
lib lib
Task2Task1
InstrumentVia
DynInst
Machine 2
Analyzer
add event/remove event
events events
MATE: Monitoring
Instrumentation management:• Based on DynInst API
• Dynamically loads tracing library
• Inserts snippets into requested points
• A snippet calls a library function
• A function prepares event record and transmits it to the Analyzer
Event record:• What - event type (id, place)
• When - global timestamp
• Where – task identifier
• Requested attributes - function call parameters, source code line number, etc.
Machine 1
Monitor
lib
Task1
pvm_send (params){
}
pvm_send (params){
}
LogEvent (params){ ...}
LogEvent (params){ ...}
Analyzer
entry
exit
instrument
load
1 0 64884 524247 262149 1 23
TCP/IP
MATE: Analysis
Analyzer is responsible for the automatic performance analysis on the fly
• Uses a set of predefined tuning techniques• Each technique is specified as:
– measure points – what events are needed– performance model and activating conditions– solution - tuning actions/points/synchronization - what to change,
where, when
2. Collect events
3. Calculate metrics
4. Evaluate
performance model
6. Perform tuning
1. Request events
5. Refine monitoring
MATE: Analysis
Events (from tracing library) via TCP/IP
Tuning request (to tuner) via TCP/IP
Analyzer
EventProcessor
Tuning Manager
InstrManager
Tuning technique
Tuning technique
Event Collector
Instrumentation request (to monitor) via TCP/IP
Tuning technique
thread
MetricRepository
MATE: Knowledge
Measure point example:
<intrumentation_request taskId=all> <function name=“pvm_send”> <eventId>1</eventId> <place>entry</place> <param idx=1>int</param> <param idx=2>int</param> </function></intrumentation_request>
<intrumentation_request taskId=all> <function name=“pvm_send”> <eventId>1</eventId> <place>entry</place> <param idx=1>int</param> <param idx=2>int</param> </function></intrumentation_request>
Insert instrumentation into ALL tasksat entry of function pvm_send() as eventId 1record parameter 1 as int parameter 2 as int
Insert instrumentation into ALL tasksat entry of function pvm_send() as eventId 1record parameter 1 as int parameter 2 as int
MATE: Knowledge
Performance model example:
<performance_model> <value name=CurrentSize> <calc> <type>function_result</type> <function name=“pvm_getopt”> <param>PvmFragSize</param> </function> </calc> </value>...</performance_model>
<performance_model> <value name=CurrentSize> <calc> <type>function_result</type> <function name=“pvm_getopt”> <param>PvmFragSize</param> </function> </calc> </value>...</performance_model>
CurrentSize = result of pvm_getopt (PvmFragSize)OptimalSize = Average (MsgSize) + Stddev (MsgSize)Condition: CurrentSize – OptimalSize > threshold1CommunicationCost = xxxCondition: Communication cost > threshold2
CurrentSize = result of pvm_getopt (PvmFragSize)OptimalSize = Average (MsgSize) + Stddev (MsgSize)Condition: CurrentSize – OptimalSize > threshold1CommunicationCost = xxxCondition: Communication cost > threshold2
In task with tid=524247,execute one time a function pvm_setopt (PvmFragSize, 16384) breaking at entry of function pvm_send()
In task with tid=524247,execute one time a function pvm_setopt (PvmFragSize, 16384) breaking at entry of function pvm_send()
MATE: Knowledge
Solution example:
<tuning_request taskId=524247> <action> <one_time_call> <function name=„pvm_setopt”> <param idx=0>PvmFragSize</param>
<param idx=1>16384</param> </function> <synchronize> <breakpoint> <function name=„pvm_send”> <place>entry</place> </function> </breakpoint>
</synchronize> </one_time_call> </action></tuning_request>
<tuning_request taskId=524247> <action> <one_time_call> <function name=„pvm_setopt”> <param idx=0>PvmFragSize</param>
<param idx=1>16384</param> </function> <synchronize> <breakpoint> <function name=„pvm_send”> <place>entry</place> </function> </breakpoint>
</synchronize> </one_time_call> </action></tuning_request>
MATE: Tuning
Tuners apply solutions by executing tuning actions at specified tuning points
• A tuner module is integrated with monitor process• Receives a tuning request from analyzer• Prepares modifications (snippets)• Applies modifications via DynInst
TuningReq (){ send_req (tuner, taskId, tuningReq);}
TuningReq (){ send_req (tuner, taskId, tuningReq);}
Analyzer
recv_req (taskId, TuningReq){ Task task = taskList.Find (taskId); snippet = PrepareSnippet (TuningReq); task.thread.insertSnippet (snippet);}
recv_req (taskId, TuningReq){ Task task = taskList.Find (taskId); snippet = PrepareSnippet (TuningReq); task.thread.insertSnippet (snippet);}
Tuner/Monitor
MATE: Tuning
Tuning request:• Tuner machine• Task id• Tuning action
– One time function call– Function parameter changes– Function call– Function replacement– Variable changes
• Tuning points as pairs: object, value– function – name, param – name, param – value
• Synchronization– When to perform tuning action
In task with tid=524247,call one time a function pvm_setopt (PvmFragSize, 16384) breaking at entry of function pvm_send()
In task with tid=524247,call one time a function pvm_setopt (PvmFragSize, 16384) breaking at entry of function pvm_send()
Tuning techniques
What can be tuned?• Library usage
– Tuning of PVM library usage– Investigating PVM bottlenecks– PVM communication bottlenecks:
• Communication mode • Data encoding mode• Message fragment size
Tuning techniques
Communication mode• 2 modes:
– indirect - task to daemon to daemon to task– direct - task to task
• Indirect is slow, but default• Direct is faster, but consumes socket resources
(limited number of connections)
Benefits when changing communication mode to direct
in a round trip application
Msg size [B] Benefits
10 50%
100 46%
1000 33%
10000 20%
100000 19%
1000000 18%
Tuning techniques
Communication mode
Current communication mode: pvm_getopt(PvmRoute)New task creation: pvm_spawn()
Current communication mode: pvm_getopt(PvmRoute)New task creation: pvm_spawn()
Measure points
Non shared-memory architectureIndirect modeNumber of PVM tasks smaller than system limit
Non shared-memory architectureIndirect modeNumber of PVM tasks smaller than system limit
Activating conditions
One time function callOne time function callTuning action
pvm_setopt(PvmRoute, PvmRouteDirect)pvm_setopt(PvmRoute, PvmRouteDirect)Tuning points
Break at entry of pvm_send()Break at entry of pvm_send()Synchronization
Tuning techniques
Data encoding mode• 2 modes
– XDR – allows for transparent transfer between heterogeneous machines, slower mode
– DataRaw – encoding phase is skipped, possible when VM contains homogeneous machines
• XDR – more data to be transferred, more time required to encode/decode it, by default
• Big endian/little endian problem• DataRaw - more effective
for integer data
Msg size [B] Benefits
10 2%
100 4%
1000 16%
10000 61%
100000 72%
1000000 73%
Benefits when changing encoding mode to
DataRaw in a round trip application
Tuning techniques
Data encoding mode
Hosts’ architecture: pvm_config(), pvm_addhosts(), pvm_delhost()
Hosts’ architecture: pvm_config(), pvm_addhosts(), pvm_delhost()
Measure points
All machines from the PVM virtual machine have the same architecture
All machines from the PVM virtual machine have the same architecture
Activating conditions
Function parameter changeFunction parameter changeTuning action
pvm_initsend(PvmDataDefault) ->pvm_initsend(PvmDataRaw)
pvm_initsend(PvmDataDefault) ->pvm_initsend(PvmDataRaw)
Tuning points
NoneNoneSynchronization
Tuning techniques
Message fragment size• Message divided into fixed-size fragments, default 4KB
fragments• Larger message -> more fragments, hence
– bigger fragment -> more data sent• Optimal fragment size depends on exchanged data size• Effective when direct communication mode used• Drawback: increased memory
usageMsg size [B] 16KB-512KB
4 43%-44%
8 46%-48%
64 38%-52%
256 41%-53%
512 42%-54%
1024 43%-55%
Benefits when changing 4KB message fragment size
in a round trip application
Tuning techniques
Message fragment size
Current fragment size: pvm_getopt (PvmFragSize)Message size: pvm_send()
Current fragment size: pvm_getopt (PvmFragSize)Message size: pvm_send()
Measure points
High frequency of messages with size > 4KBOptimalFragSize = Average (MsgSize) + Stddev (MsgSize)CurrentFragSize – OptimalFragSize > threshold1Communication cost > threshold2
High frequency of messages with size > 4KBOptimalFragSize = Average (MsgSize) + Stddev (MsgSize)CurrentFragSize – OptimalFragSize > threshold1Communication cost > threshold2
Activating conditions
One time function callOne time function callTuning action
pvm_setopt(PvmFragSize, OptimalFragSize)pvm_setopt(PvmFragSize, OptimalFragSize)Tuning points
Break at entry of pvm_send()Break at entry of pvm_send()Synchronization
Tuning techniques: Example application
• Integer Sort Kernel benchmark from NAS• High communication cost (50%)• Default settings: indirect communication mode, DataRaw
encoding, message fragment size 4KB
No. Tuning techniqueExecution time
[sec]Tuning benefits
[sec]Intrusion
[sec]
1. PVM (no tuning) 732 - -
2.PVM + communication mode tuning
604 127 (17,5%) 21 (~3,5%)
3.PVM + data encoding mode tuning
761 -29 (-3,9%) 21 (~2,8%)
4.PVM + message fragment size tuning
769 -37 (-5,1%) 27 (~3,5%)
5. PVM + all scenarios 529 202 (27,7%) 28 (~5,3%)
Other tuning techniques
Other:• TCP/IP
– send/receive buffer size– sending without delay (Nagle’s algorithm,
TCP_NO_DELAY)
• I/O– read/write operations
• using prefetch when small requests• using asynchronous read/write instead of synchronous
– I/O buffer size
• Memory allocation– plugging-in specialized strategies (pool allocator)
1. Our goals – automatic analysis and tuning
2. Automatic analysis based on ASL
3. Dynamic tuning
4. Conclusions and future work
Content
Conclusions
• Automatic performance analysis
• Dynamic tuning
• Designs
• Experiments
Future work
Automatic analysis
• Discuss and close detailed ASL language specification
• Complete property evaluator
• Connect the analyzer with performance measurement tool
• Investigate the „why-axis” analysis(Evaluate the casual property chains)
Future work
Dynamic tuning • Approaches:
– Black box – tuning of ANY application• more tuning techniques
– Cooperative – tuning of prepared application • supported by program specification,• application developed using framework• knowledge about tuning techniques provided by an
application framework
Automatic Performance Analysis and Tuning
Universitat Autónoma de Barcelona
Dyninst/Paradyn WeekFebruary, 2003
Thank you very much