automatic performance analysis and tuning

Automatic Performance Analysis and Tuning

Anna Morajko, Oleg Morajko, Tomas Margalef, Emilio Luque Universitat Autónoma de Barcelona

Dyninst/Paradyn WeekFebruary, 2003

1. Our goals – automatic analysis and tuning

2. Automatic analysis based on ASL

3. Dynamic tuning

4. Conclusions and future work

Content

Our goals

Primary objectives

• Create a tool that is able to automatically analyze the performance of parallel applications post-mortem and is based on trace files

• Create a tool that is able to automatically analyze the performance of parallel applications on the fly, detect bottlenecks and explain their reasons

• Create a tool that is able to automatically improve the performance of parallel applications during their execution without recompiling and rerunning them

Our goals

Static automatic analysis– static trace file analysis supported by source code

examination

Pattern Matching Engine

Source Analysis

Hints to users

List of Detected Patterns

Evaluator Engine

Problem causes

Trace

Problem specs

Problem Analysis

specs

Source Code

Hints specs

Our goals

Dynamic automatic analysis– based on declarative knowledge specification

ApplicationApplication

Tracing, profiling

PerformancePerformanceMonitorMonitor

Performance dataProperties

Bottlenecks

SearchSearchRefinementRefinement

PropertyPropertyEvaluationEvaluation

ProblemProblemRankingRanking

Problems

User output

Run-Time Analysis

Measurement requests

Our goals

Dynamic automatic tuning

Suggestions

User

TuningTracing

Exe-cutio

n

Tool

SolutionProblem /

EventsPerformance

analysis

Modifications

Instrumentation

Application development

Source

Application



• Performance data collection

• Performance problem catalog (properties and ASL)

• Performance analysis

3. Dynamic tuning


Content

Objectives

• Analyze the performance of parallel applications

during their execution

• Automatically detect bottlenecks

• Provide a clear explanation of the identified

problems to developers

• Correlate problems with the source code

• Provide recommendations on possible solutions

Key Challenges

– What is wrong in the application?Problem identification – bottlenecks

– Where is the critical resource?Problem location (hardware, OS, source code)

– When does it happen?How to organize problem search in time?

– How important it is?How to compare the importance of different problems?

– Why it happens?How to explain the reasons of the problem to the user?

Design Assumptions

• Dynamic on-the-fly analysis• Knowledge specification based on ASL• Bottleneck detection (what axis) based on

inductive reasoning (bottom-up approach)• Problem location identification (where axis) based

on call-graph search

• Tool primarily targeted to MPI-based parallel programs

Performance Analysis Concept


Tracing, profiling

PerformancePerformanceMonitorMonitor

Performance dataProperties

Bottlenecks

SearchSearchRefinementRefinement

PropertyPropertyEvaluationEvaluation

ProblemProblemRankingRanking

Problems

User output

Run-Time Analysis

Measurement requests

Performance Data Collection

• Performance analysis is based on measurements of performance data

• There are various techniques on providing this data– event tracing, sampling, instrumentation

• Static data– Structural data (modules, functions), call graphs, etc.

• Dynamic data– Metrics, execution profiles, communication patterns, events, etc.


N

N

Call graph

RegionRegion


ProcessProcess

FunctionFunction

ModuleModule

Application model

• We represent the application using object model

• Model is generated dynamically at application startup (by parsing the executables)

• Each object represents a location context called Region

• This information can be used to guide problem location search


N

Region ProfileRegion Profile

App ProfileApp Profile

Process ProfileProcess Profile

Function ProfileFunction Profile

Module ProfileModule Profile

Execution profile

• Each region can have an execution profile - RegionProfile

• Each profile contains a set of predefined metrics

• Each metric can be measured on demand for a specific region

• The key issue is to decide what metrics for what regions should be measured


N

Region ProfileRegion Profile

App ProfileApp Profile

Process ProfileProcess Profile

Function ProfileFunction Profile

Module ProfileModule Profile

Execution profile

• Each region can have an execution profile - RegionProfile

• Each profile contains a set of predefined metrics

• Each metric can be measured on demand for a specific region

• The key issue is to decide what metrics for what regions should be measured

cpu_timeio_time

comm_timesync_timeidle_timenum_exec

cpu_timeio_time

comm_timesync_timeidle_timenum_exec

Performance Properties

• Properties describe the specific types of performance behavior in a program

• Properties are higher-level abstractions used to represent common performance problems

• They are based on conditions dependent on certain performance metrics

• We can express these abstractions using ASL (APART Specification Language)

APART Specification Language

property small_io_requests (Region r, Experiment e, Region basis){ let float cost = profile (r,e).io_time; int num_reads = profile (r,e).num_reads; int bytes_read = profile (r,e).bytes_read; in condition : cost > 0 and bytes_read/num_reads < SMALL_IO_THRESHOLD; confidence: 1; severity : cost/duration (basis, e); }

• ASL is a declarative language (like SQL)

Property Evaluation

property small_io_requests (Region r, Experiment e, Region basis){ let float cost = profile (r,e).io_time; int num_reads = profile (r,e).num_reads; int bytes_read = profile (r,e).bytes_read; in condition : cost > 0 and

bytes_read/num_reads < SMALL_IO_THRESHOLD; confidence: 1; severity : cost/duration (basis, e); }

int foo (){ open („sim.dat”, O_RDONLY); for (...) { read (block, ...); ... add (block); } ...}

Functionfoo ()

File: foo.cppLines: 17-35

Function foo ()Profile

io_time 181 secnum_reads 8921

bytes_read 95201

condition: trueconfidence: 1severity: 0.35

Region

Region profile

Hierarchical Properties

• During run-time all properties cannot be evaluated at once (metrics cost!)

• But we want to find a problem and keep the cost controlled

• So what properties to evaluate?

• Properties have natural dependences

• Let’s express dependencies explicitly in ASL

• This can guide the automatic search and limit the number of properties to be evaluated

Example

property communication_cost (Region r, Experiment e, Region basis){ let float cost = profile (r,e).sums.comm_time; in condition : cost > 0; confidence: 1; severity : cost/duration (basis, e); }

property late_sender (Region r, Experiment e, Region basis){ let float idleT = profile (r,e).sums.idle_time; in condition : r.type == Receive and idleT > 0; confidence: 1; severity : idleT/duration (basis, e); }

• Analyzer should evaluate the late_sender property only if communication_cost property holds

• Properties create natural hierarchies

Example: Hierarchy of properties

Late SenderLate Sender

Late ReceiverLate Receiver

Small MsgsSmall Msgs

Comm CostComm CostI/O CostI/O Cost

Slow ReadSlow Read

Small Small RequestsRequests

Large Large RequestsRequests

Sync CostSync Cost

BarrierBarrier LockLockContentionContention

...

...

...

...

CostCost

Search Process

• Top-down approach

• Start with top-level property, i.e.– Cost = io_time, comm_time, sync_time

• Start with search location = process

• Perform measurements

• Evaluate the property – Severity = cost/duration (e.g. 40%)

Search Process

• Continue with set of sub-properties:– evaluate io_cost, comm_cost, sync_cost

• E.g. highest severity is io_cost (e.g. 25%)

• Continue the top-down search until reaching most specific property, i.e. small_io_requests

• Next, try looking for more precise location of the problem– use call-graph region search– start from main– step down until reaching foo () function

Generic Analysis Algorithm

Select property to evaluate

Select search context (Region)

Determine required metrics from

property

„Order” metrics for a given

context

Wait for data. Update execution profile.

Evaluate property

Rank property

Change search context or select next property



3. Dynamic tuning

• Objectives

• MATE (Monitoring, Analysis and Tuning Environment)

• Tuning techniques and experiments


Content

Objectives

Improve the execution of a parallel application by dynamically adapting it to the environment

Key issues:• Dynamic performance tuning approach• Automatic improvement of any application without recompiling and

rerunning it• What can be tuned in an unknown application?

– Library usage– Operating system usage

MATE – Monitoring, Analysis and Tuning Environment • prototype implementation in C++• for PVM based applications• Sun Solaris 2.x / SPARC

MATE

Machine 1 Machine 2

Machine 3

Monitor

Tuner

pvmd

Analyzer

Monitor

Tuner

Events

Task1

Task1 Task2

pvmd

Events

Apply solutions

Instrument Instrument

MATE: Monitoring

Monitors control execution of application tasks and allow for dynamic event tracing

Key services:• Distributed application control

• Instrumentation management– AddEventTrace(id,func,place,args)– RemoveEventTrace(id)

• Transmission of requested event records to analyzer

Machine 1

Monitor

lib lib

Task2Task1

InstrumentVia

DynInst

Machine 2

Analyzer

add event/remove event

events events

MATE: Monitoring

Instrumentation management:• Based on DynInst API

• Dynamically loads tracing library

• Inserts snippets into requested points

• A snippet calls a library function

• A function prepares event record and transmits it to the Analyzer

Event record:• What - event type (id, place)

• When - global timestamp

• Where – task identifier

• Requested attributes - function call parameters, source code line number, etc.

Machine 1

Monitor

lib

Task1

pvm_send (params){

}

pvm_send (params){

}

LogEvent (params){ ...}

LogEvent (params){ ...}

Analyzer

entry

exit

instrument

load

1 0 64884 524247 262149 1 23

TCP/IP

MATE: Analysis

Analyzer is responsible for the automatic performance analysis on the fly

• Uses a set of predefined tuning techniques• Each technique is specified as:

– measure points – what events are needed– performance model and activating conditions– solution - tuning actions/points/synchronization - what to change,

where, when

2. Collect events

3. Calculate metrics

4. Evaluate

performance model

6. Perform tuning

1. Request events

5. Refine monitoring

MATE: Analysis

Events (from tracing library) via TCP/IP

Tuning request (to tuner) via TCP/IP

Analyzer

EventProcessor

Tuning Manager

InstrManager

Tuning technique

Tuning technique

Event Collector

Instrumentation request (to monitor) via TCP/IP

Tuning technique

thread

MetricRepository

MATE: Knowledge

Measure point example:

<intrumentation_request taskId=all> <function name=“pvm_send”> <eventId>1</eventId> <place>entry</place> <param idx=1>int</param> <param idx=2>int</param> </function></intrumentation_request>

<intrumentation_request taskId=all> <function name=“pvm_send”> <eventId>1</eventId> <place>entry</place> <param idx=1>int</param> <param idx=2>int</param> </function></intrumentation_request>

Insert instrumentation into ALL tasksat entry of function pvm_send() as eventId 1record parameter 1 as int parameter 2 as int

Insert instrumentation into ALL tasksat entry of function pvm_send() as eventId 1record parameter 1 as int parameter 2 as int

MATE: Knowledge

Performance model example:

<performance_model> <value name=CurrentSize> <calc> <type>function_result</type> <function name=“pvm_getopt”> <param>PvmFragSize</param> </function> </calc> </value>...</performance_model>

<performance_model> <value name=CurrentSize> <calc> <type>function_result</type> <function name=“pvm_getopt”> <param>PvmFragSize</param> </function> </calc> </value>...</performance_model>

CurrentSize = result of pvm_getopt (PvmFragSize)OptimalSize = Average (MsgSize) + Stddev (MsgSize)Condition: CurrentSize – OptimalSize > threshold1CommunicationCost = xxxCondition: Communication cost > threshold2

CurrentSize = result of pvm_getopt (PvmFragSize)OptimalSize = Average (MsgSize) + Stddev (MsgSize)Condition: CurrentSize – OptimalSize > threshold1CommunicationCost = xxxCondition: Communication cost > threshold2

In task with tid=524247,execute one time a function pvm_setopt (PvmFragSize, 16384) breaking at entry of function pvm_send()

In task with tid=524247,execute one time a function pvm_setopt (PvmFragSize, 16384) breaking at entry of function pvm_send()

MATE: Knowledge

Solution example:

<tuning_request taskId=524247> <action> <one_time_call> <function name=„pvm_setopt”> <param idx=0>PvmFragSize</param>

<param idx=1>16384</param> </function> <synchronize> <breakpoint> <function name=„pvm_send”> <place>entry</place> </function> </breakpoint>

</synchronize> </one_time_call> </action></tuning_request>

<tuning_request taskId=524247> <action> <one_time_call> <function name=„pvm_setopt”> <param idx=0>PvmFragSize</param>

<param idx=1>16384</param> </function> <synchronize> <breakpoint> <function name=„pvm_send”> <place>entry</place> </function> </breakpoint>

</synchronize> </one_time_call> </action></tuning_request>

MATE: Tuning

Tuners apply solutions by executing tuning actions at specified tuning points

• A tuner module is integrated with monitor process• Receives a tuning request from analyzer• Prepares modifications (snippets)• Applies modifications via DynInst

TuningReq (){ send_req (tuner, taskId, tuningReq);}

TuningReq (){ send_req (tuner, taskId, tuningReq);}

Analyzer

recv_req (taskId, TuningReq){ Task task = taskList.Find (taskId); snippet = PrepareSnippet (TuningReq); task.thread.insertSnippet (snippet);}

recv_req (taskId, TuningReq){ Task task = taskList.Find (taskId); snippet = PrepareSnippet (TuningReq); task.thread.insertSnippet (snippet);}

Tuner/Monitor

MATE: Tuning

Tuning request:• Tuner machine• Task id• Tuning action

– One time function call– Function parameter changes– Function call– Function replacement– Variable changes

• Tuning points as pairs: object, value– function – name, param – name, param – value

• Synchronization– When to perform tuning action

In task with tid=524247,call one time a function pvm_setopt (PvmFragSize, 16384) breaking at entry of function pvm_send()

In task with tid=524247,call one time a function pvm_setopt (PvmFragSize, 16384) breaking at entry of function pvm_send()

Tuning techniques

What can be tuned?• Library usage

– Tuning of PVM library usage– Investigating PVM bottlenecks– PVM communication bottlenecks:

• Communication mode • Data encoding mode• Message fragment size

Tuning techniques

Communication mode• 2 modes:

– indirect - task to daemon to daemon to task– direct - task to task

• Indirect is slow, but default• Direct is faster, but consumes socket resources

(limited number of connections)

Benefits when changing communication mode to direct

in a round trip application

Msg size [B] Benefits

10 50%

100 46%

1000 33%

10000 20%

100000 19%

1000000 18%

Tuning techniques

Communication mode

Current communication mode: pvm_getopt(PvmRoute)New task creation: pvm_spawn()

Current communication mode: pvm_getopt(PvmRoute)New task creation: pvm_spawn()

Measure points

Non shared-memory architectureIndirect modeNumber of PVM tasks smaller than system limit

Non shared-memory architectureIndirect modeNumber of PVM tasks smaller than system limit

Activating conditions

One time function callOne time function callTuning action

pvm_setopt(PvmRoute, PvmRouteDirect)pvm_setopt(PvmRoute, PvmRouteDirect)Tuning points

Break at entry of pvm_send()Break at entry of pvm_send()Synchronization

Tuning techniques

Data encoding mode• 2 modes

– XDR – allows for transparent transfer between heterogeneous machines, slower mode

– DataRaw – encoding phase is skipped, possible when VM contains homogeneous machines

• XDR – more data to be transferred, more time required to encode/decode it, by default

• Big endian/little endian problem• DataRaw - more effective

for integer data

Msg size [B] Benefits

10 2%

100 4%

1000 16%

10000 61%

100000 72%

1000000 73%

Benefits when changing encoding mode to

DataRaw in a round trip application

Tuning techniques

Data encoding mode

Hosts’ architecture: pvm_config(), pvm_addhosts(), pvm_delhost()

Hosts’ architecture: pvm_config(), pvm_addhosts(), pvm_delhost()

Measure points

All machines from the PVM virtual machine have the same architecture

All machines from the PVM virtual machine have the same architecture


Function parameter changeFunction parameter changeTuning action

pvm_initsend(PvmDataDefault) ->pvm_initsend(PvmDataRaw)

pvm_initsend(PvmDataDefault) ->pvm_initsend(PvmDataRaw)

Tuning points

NoneNoneSynchronization

Tuning techniques

Message fragment size• Message divided into fixed-size fragments, default 4KB

fragments• Larger message -> more fragments, hence

– bigger fragment -> more data sent• Optimal fragment size depends on exchanged data size• Effective when direct communication mode used• Drawback: increased memory

usageMsg size [B] 16KB-512KB

4 43%-44%

8 46%-48%

64 38%-52%

256 41%-53%

512 42%-54%

1024 43%-55%

Benefits when changing 4KB message fragment size

in a round trip application

Tuning techniques

Message fragment size

Current fragment size: pvm_getopt (PvmFragSize)Message size: pvm_send()

Current fragment size: pvm_getopt (PvmFragSize)Message size: pvm_send()

Measure points

High frequency of messages with size > 4KBOptimalFragSize = Average (MsgSize) + Stddev (MsgSize)CurrentFragSize – OptimalFragSize > threshold1Communication cost > threshold2

High frequency of messages with size > 4KBOptimalFragSize = Average (MsgSize) + Stddev (MsgSize)CurrentFragSize – OptimalFragSize > threshold1Communication cost > threshold2


One time function callOne time function callTuning action

pvm_setopt(PvmFragSize, OptimalFragSize)pvm_setopt(PvmFragSize, OptimalFragSize)Tuning points

Break at entry of pvm_send()Break at entry of pvm_send()Synchronization

Tuning techniques: Example application

• Integer Sort Kernel benchmark from NAS• High communication cost (50%)• Default settings: indirect communication mode, DataRaw

encoding, message fragment size 4KB

No. Tuning techniqueExecution time

[sec]Tuning benefits

[sec]Intrusion

[sec]

1. PVM (no tuning) 732 - -

2.PVM + communication mode tuning

604 127 (17,5%) 21 (~3,5%)

3.PVM + data encoding mode tuning

761 -29 (-3,9%) 21 (~2,8%)

4.PVM + message fragment size tuning

769 -37 (-5,1%) 27 (~3,5%)

5. PVM + all scenarios 529 202 (27,7%) 28 (~5,3%)

Other tuning techniques

Other:• TCP/IP

– send/receive buffer size– sending without delay (Nagle’s algorithm,

TCP_NO_DELAY)

• I/O– read/write operations

• using prefetch when small requests• using asynchronous read/write instead of synchronous

– I/O buffer size

• Memory allocation– plugging-in specialized strategies (pool allocator)



3. Dynamic tuning


Content

Conclusions

• Automatic performance analysis

• Dynamic tuning

• Designs

• Experiments

Future work

Automatic analysis

• Discuss and close detailed ASL language specification

• Complete property evaluator

• Connect the analyzer with performance measurement tool

• Investigate the „why-axis” analysis(Evaluate the casual property chains)

Future work

Dynamic tuning • Approaches:

– Black box – tuning of ANY application• more tuning techniques

– Cooperative – tuning of prepared application • supported by program specification,• application developed using framework• knowledge about tuning techniques provided by an

application framework

Automatic Performance Analysis and Tuning

Universitat Autónoma de Barcelona

Dyninst/Paradyn WeekFebruary, 2003

Thank you very much

automatic performance analysis and tuning

Documents

automatic performance

tuningautomatic analysis

problem search

problem location hardware

location context

graph search tool

dynamic datametrics

execution profiles