automatic performance analysis and tuning

51
Automatic Performance Analysis and Tuning Anna Morajko, Oleg Morajko, Tomas Margalef, Emilio Luque Universitat Autónoma de Barcelona Dyninst/Paradyn Week February, 2003

Upload: stephanie-ashley

Post on 01-Jan-2016

41 views

Category:

Documents


3 download

DESCRIPTION

Dyninst/Paradyn Week February, 2003. Automatic Performance Analysis and Tuning. Anna Morajko, Oleg Morajko, Tomas Margalef, Emilio Luque Universitat Autónoma de Barcelona. Content. Our goals – automatic analysis and tuning Automatic analysis based on ASL Dynamic tuning - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Automatic Performance Analysis and Tuning

                                                                  

Automatic Performance Analysis and Tuning

Anna Morajko, Oleg Morajko, Tomas Margalef, Emilio Luque Universitat Autónoma de Barcelona

Dyninst/Paradyn WeekFebruary, 2003

Page 2: Automatic Performance Analysis and Tuning

                                                                  

1. Our goals – automatic analysis and tuning

2. Automatic analysis based on ASL

3. Dynamic tuning

4. Conclusions and future work

Content

Page 3: Automatic Performance Analysis and Tuning

                                                                  Our goals

Primary objectives

• Create a tool that is able to automatically analyze the performance of parallel applications post-mortem and is based on trace files

• Create a tool that is able to automatically analyze the performance of parallel applications on the fly, detect bottlenecks and explain their reasons

• Create a tool that is able to automatically improve the performance of parallel applications during their execution without recompiling and rerunning them

Page 4: Automatic Performance Analysis and Tuning

                                                                  Our goals

Static automatic analysis– static trace file analysis supported by source code

examination

Pattern Matching Engine

Source Analysis

Hints to users

List of Detected Patterns

Evaluator Engine

Problem causes

Trace

Problem specs

Problem Analysis

specs

Source Code

Hints specs

Page 5: Automatic Performance Analysis and Tuning

                                                                  Our goals

Dynamic automatic analysis– based on declarative knowledge specification

ApplicationApplication

Tracing, profiling

PerformancePerformanceMonitorMonitor

Performance dataProperties

Bottlenecks

SearchSearchRefinementRefinement

PropertyPropertyEvaluationEvaluation

ProblemProblemRankingRanking

Problems

User output

Run-Time Analysis

Measurement requests

Page 6: Automatic Performance Analysis and Tuning

                                                                  Our goals

Dynamic automatic tuning

Suggestions

User

TuningTracing

Exe-cutio

n

Tool

SolutionProblem /

EventsPerformance

analysis

Modifications

Instrumentation

Application development

Source

Application

Page 7: Automatic Performance Analysis and Tuning

                                                                  

1. Our goals – automatic analysis and tuning

2. Automatic analysis based on ASL

• Performance data collection

• Performance problem catalog (properties and ASL)

• Performance analysis

3. Dynamic tuning

4. Conclusions and future work

Content

Page 8: Automatic Performance Analysis and Tuning

                                                                  Objectives

• Analyze the performance of parallel applications

during their execution

• Automatically detect bottlenecks

• Provide a clear explanation of the identified

problems to developers

• Correlate problems with the source code

• Provide recommendations on possible solutions

Page 9: Automatic Performance Analysis and Tuning

                                                                  Key Challenges

– What is wrong in the application?Problem identification – bottlenecks

– Where is the critical resource?Problem location (hardware, OS, source code)

– When does it happen?How to organize problem search in time?

– How important it is?How to compare the importance of different problems?

– Why it happens?How to explain the reasons of the problem to the user?

Page 10: Automatic Performance Analysis and Tuning

                                                                  Design Assumptions

• Dynamic on-the-fly analysis• Knowledge specification based on ASL• Bottleneck detection (what axis) based on

inductive reasoning (bottom-up approach)• Problem location identification (where axis) based

on call-graph search

• Tool primarily targeted to MPI-based parallel programs

Page 11: Automatic Performance Analysis and Tuning

                                                                  Performance Analysis Concept

ApplicationApplication

Tracing, profiling

PerformancePerformanceMonitorMonitor

Performance dataProperties

Bottlenecks

SearchSearchRefinementRefinement

PropertyPropertyEvaluationEvaluation

ProblemProblemRankingRanking

Problems

User output

Run-Time Analysis

Measurement requests

Page 12: Automatic Performance Analysis and Tuning

                                                                  

Page 13: Automatic Performance Analysis and Tuning

                                                                  Performance Data Collection

• Performance analysis is based on measurements of performance data

• There are various techniques on providing this data– event tracing, sampling, instrumentation

• Static data– Structural data (modules, functions), call graphs, etc.

• Dynamic data– Metrics, execution profiles, communication patterns, events, etc.

Page 14: Automatic Performance Analysis and Tuning

                                                                  Performance Data Collection

N

N

Call graph

RegionRegion

ApplicationApplication

ProcessProcess

FunctionFunction

ModuleModule

Application model

• We represent the application using object model

• Model is generated dynamically at application startup (by parsing the executables)

• Each object represents a location context called Region

• This information can be used to guide problem location search

Page 15: Automatic Performance Analysis and Tuning

                                                                  Performance Data Collection

N

Region ProfileRegion Profile

App ProfileApp Profile

Process ProfileProcess Profile

Function ProfileFunction Profile

Module ProfileModule Profile

Execution profile

• Each region can have an execution profile - RegionProfile

• Each profile contains a set of predefined metrics

• Each metric can be measured on demand for a specific region

• The key issue is to decide what metrics for what regions should be measured

Page 16: Automatic Performance Analysis and Tuning

                                                                  Performance Data Collection

N

Region ProfileRegion Profile

App ProfileApp Profile

Process ProfileProcess Profile

Function ProfileFunction Profile

Module ProfileModule Profile

Execution profile

• Each region can have an execution profile - RegionProfile

• Each profile contains a set of predefined metrics

• Each metric can be measured on demand for a specific region

• The key issue is to decide what metrics for what regions should be measured

cpu_timeio_time

comm_timesync_timeidle_timenum_exec

cpu_timeio_time

comm_timesync_timeidle_timenum_exec

Page 17: Automatic Performance Analysis and Tuning

                                                                  Performance Properties

• Properties describe the specific types of performance behavior in a program

• Properties are higher-level abstractions used to represent common performance problems

• They are based on conditions dependent on certain performance metrics

• We can express these abstractions using ASL (APART Specification Language)

Page 18: Automatic Performance Analysis and Tuning

                                                                  APART Specification Language

property small_io_requests (Region r, Experiment e, Region basis){ let float cost = profile (r,e).io_time; int num_reads = profile (r,e).num_reads; int bytes_read = profile (r,e).bytes_read; in condition : cost > 0 and bytes_read/num_reads < SMALL_IO_THRESHOLD; confidence: 1; severity : cost/duration (basis, e); }

• ASL is a declarative language (like SQL)

Page 19: Automatic Performance Analysis and Tuning

                                                                  Property Evaluation

property small_io_requests (Region r, Experiment e, Region basis){ let float cost = profile (r,e).io_time; int num_reads = profile (r,e).num_reads; int bytes_read = profile (r,e).bytes_read; in condition : cost > 0 and

bytes_read/num_reads < SMALL_IO_THRESHOLD; confidence: 1; severity : cost/duration (basis, e); }

int foo (){ open („sim.dat”, O_RDONLY); for (...) { read (block, ...); ... add (block); } ...}

Functionfoo ()

File: foo.cppLines: 17-35

Function foo ()Profile

io_time 181 secnum_reads 8921

bytes_read 95201

condition: trueconfidence: 1severity: 0.35

Region

Region profile

Page 20: Automatic Performance Analysis and Tuning

                                                                  Hierarchical Properties

• During run-time all properties cannot be evaluated at once (metrics cost!)

• But we want to find a problem and keep the cost controlled

• So what properties to evaluate?

• Properties have natural dependences

• Let’s express dependencies explicitly in ASL

• This can guide the automatic search and limit the number of properties to be evaluated

Page 21: Automatic Performance Analysis and Tuning

                                                                  Example

property communication_cost (Region r, Experiment e, Region basis){ let float cost = profile (r,e).sums.comm_time; in condition : cost > 0; confidence: 1; severity : cost/duration (basis, e); }

property late_sender (Region r, Experiment e, Region basis){ let float idleT = profile (r,e).sums.idle_time; in condition : r.type == Receive and idleT > 0; confidence: 1; severity : idleT/duration (basis, e); }

• Analyzer should evaluate the late_sender property only if communication_cost property holds

• Properties create natural hierarchies

Page 22: Automatic Performance Analysis and Tuning

                                                                  Example: Hierarchy of properties

Late SenderLate Sender

Late ReceiverLate Receiver

Small MsgsSmall Msgs

Comm CostComm CostI/O CostI/O Cost

Slow ReadSlow Read

Small Small RequestsRequests

Large Large RequestsRequests

Sync CostSync Cost

BarrierBarrier LockLockContentionContention

...

...

...

...

CostCost

Page 23: Automatic Performance Analysis and Tuning

                                                                  Search Process

• Top-down approach

• Start with top-level property, i.e.– Cost = io_time, comm_time, sync_time

• Start with search location = process

• Perform measurements

• Evaluate the property – Severity = cost/duration (e.g. 40%)

Page 24: Automatic Performance Analysis and Tuning

                                                                  Search Process

• Continue with set of sub-properties:– evaluate io_cost, comm_cost, sync_cost

• E.g. highest severity is io_cost (e.g. 25%)

• Continue the top-down search until reaching most specific property, i.e. small_io_requests

• Next, try looking for more precise location of the problem– use call-graph region search– start from main– step down until reaching foo () function

Page 25: Automatic Performance Analysis and Tuning

                                                                  Generic Analysis Algorithm

Select property to evaluate

Select search context (Region)

Determine required metrics from

property

„Order” metrics for a given

context

Wait for data. Update execution profile.

Evaluate property

Rank property

Change search context or select next property

Page 26: Automatic Performance Analysis and Tuning

                                                                  

1. Our goals – automatic analysis and tuning

2. Automatic analysis based on ASL

3. Dynamic tuning

• Objectives

• MATE (Monitoring, Analysis and Tuning Environment)

• Tuning techniques and experiments

4. Conclusions and future work

Content

Page 27: Automatic Performance Analysis and Tuning

                                                                  Objectives

Improve the execution of a parallel application by dynamically adapting it to the environment

Key issues:• Dynamic performance tuning approach• Automatic improvement of any application without recompiling and

rerunning it• What can be tuned in an unknown application?

– Library usage– Operating system usage

MATE – Monitoring, Analysis and Tuning Environment • prototype implementation in C++• for PVM based applications• Sun Solaris 2.x / SPARC

Page 28: Automatic Performance Analysis and Tuning

                                                                  MATE

Machine 1 Machine 2

Machine 3

Monitor

Tuner

pvmd

Analyzer

Monitor

Tuner

Events

Task1

Task1 Task2

pvmd

Events

Apply solutions

Instrument Instrument

Page 29: Automatic Performance Analysis and Tuning

                                                                  MATE: Monitoring

Monitors control execution of application tasks and allow for dynamic event tracing

Key services:• Distributed application control

• Instrumentation management– AddEventTrace(id,func,place,args)– RemoveEventTrace(id)

• Transmission of requested event records to analyzer

Machine 1

Monitor

lib lib

Task2Task1

InstrumentVia

DynInst

Machine 2

Analyzer

add event/remove event

events events

Page 30: Automatic Performance Analysis and Tuning

                                                                  MATE: Monitoring

Instrumentation management:• Based on DynInst API

• Dynamically loads tracing library

• Inserts snippets into requested points

• A snippet calls a library function

• A function prepares event record and transmits it to the Analyzer

Event record:• What - event type (id, place)

• When - global timestamp

• Where – task identifier

• Requested attributes - function call parameters, source code line number, etc.

Machine 1

Monitor

lib

Task1

pvm_send (params){

}

pvm_send (params){

}

LogEvent (params){ ...}

LogEvent (params){ ...}

Analyzer

entry

exit

instrument

load

1 0 64884 524247 262149 1 23

TCP/IP

Page 31: Automatic Performance Analysis and Tuning

                                                                  MATE: Analysis

Analyzer is responsible for the automatic performance analysis on the fly

• Uses a set of predefined tuning techniques• Each technique is specified as:

– measure points – what events are needed– performance model and activating conditions– solution - tuning actions/points/synchronization - what to change,

where, when

2. Collect events

3. Calculate metrics

4. Evaluate

performance model

6. Perform tuning

1. Request events

5. Refine monitoring

Page 32: Automatic Performance Analysis and Tuning

                                                                  MATE: Analysis

Events (from tracing library) via TCP/IP

Tuning request (to tuner) via TCP/IP

Analyzer

EventProcessor

Tuning Manager

InstrManager

Tuning technique

Tuning technique

Event Collector

Instrumentation request (to monitor) via TCP/IP

Tuning technique

thread

MetricRepository

Page 33: Automatic Performance Analysis and Tuning

                                                                  MATE: Knowledge

Measure point example:

<intrumentation_request taskId=all> <function name=“pvm_send”> <eventId>1</eventId> <place>entry</place> <param idx=1>int</param> <param idx=2>int</param> </function></intrumentation_request>

<intrumentation_request taskId=all> <function name=“pvm_send”> <eventId>1</eventId> <place>entry</place> <param idx=1>int</param> <param idx=2>int</param> </function></intrumentation_request>

Insert instrumentation into ALL tasksat entry of function pvm_send() as eventId 1record parameter 1 as int parameter 2 as int

Insert instrumentation into ALL tasksat entry of function pvm_send() as eventId 1record parameter 1 as int parameter 2 as int

Page 34: Automatic Performance Analysis and Tuning

                                                                  MATE: Knowledge

Performance model example:

<performance_model> <value name=CurrentSize> <calc> <type>function_result</type> <function name=“pvm_getopt”> <param>PvmFragSize</param> </function> </calc> </value>...</performance_model>

<performance_model> <value name=CurrentSize> <calc> <type>function_result</type> <function name=“pvm_getopt”> <param>PvmFragSize</param> </function> </calc> </value>...</performance_model>

CurrentSize = result of pvm_getopt (PvmFragSize)OptimalSize = Average (MsgSize) + Stddev (MsgSize)Condition: CurrentSize – OptimalSize > threshold1CommunicationCost = xxxCondition: Communication cost > threshold2

CurrentSize = result of pvm_getopt (PvmFragSize)OptimalSize = Average (MsgSize) + Stddev (MsgSize)Condition: CurrentSize – OptimalSize > threshold1CommunicationCost = xxxCondition: Communication cost > threshold2

Page 35: Automatic Performance Analysis and Tuning

                                                                  

In task with tid=524247,execute one time a function pvm_setopt (PvmFragSize, 16384) breaking at entry of function pvm_send()

In task with tid=524247,execute one time a function pvm_setopt (PvmFragSize, 16384) breaking at entry of function pvm_send()

MATE: Knowledge

Solution example:

<tuning_request taskId=524247> <action> <one_time_call> <function name=„pvm_setopt”> <param idx=0>PvmFragSize</param>

<param idx=1>16384</param> </function> <synchronize> <breakpoint> <function name=„pvm_send”> <place>entry</place> </function> </breakpoint>

</synchronize> </one_time_call> </action></tuning_request>

<tuning_request taskId=524247> <action> <one_time_call> <function name=„pvm_setopt”> <param idx=0>PvmFragSize</param>

<param idx=1>16384</param> </function> <synchronize> <breakpoint> <function name=„pvm_send”> <place>entry</place> </function> </breakpoint>

</synchronize> </one_time_call> </action></tuning_request>

Page 36: Automatic Performance Analysis and Tuning

                                                                  MATE: Tuning

Tuners apply solutions by executing tuning actions at specified tuning points

• A tuner module is integrated with monitor process• Receives a tuning request from analyzer• Prepares modifications (snippets)• Applies modifications via DynInst

TuningReq (){ send_req (tuner, taskId, tuningReq);}

TuningReq (){ send_req (tuner, taskId, tuningReq);}

Analyzer

recv_req (taskId, TuningReq){ Task task = taskList.Find (taskId); snippet = PrepareSnippet (TuningReq); task.thread.insertSnippet (snippet);}

recv_req (taskId, TuningReq){ Task task = taskList.Find (taskId); snippet = PrepareSnippet (TuningReq); task.thread.insertSnippet (snippet);}

Tuner/Monitor

Page 37: Automatic Performance Analysis and Tuning

                                                                  MATE: Tuning

Tuning request:• Tuner machine• Task id• Tuning action

– One time function call– Function parameter changes– Function call– Function replacement– Variable changes

• Tuning points as pairs: object, value– function – name, param – name, param – value

• Synchronization– When to perform tuning action

In task with tid=524247,call one time a function pvm_setopt (PvmFragSize, 16384) breaking at entry of function pvm_send()

In task with tid=524247,call one time a function pvm_setopt (PvmFragSize, 16384) breaking at entry of function pvm_send()

Page 38: Automatic Performance Analysis and Tuning

                                                                  Tuning techniques

What can be tuned?• Library usage

– Tuning of PVM library usage– Investigating PVM bottlenecks– PVM communication bottlenecks:

• Communication mode • Data encoding mode• Message fragment size

Page 39: Automatic Performance Analysis and Tuning

                                                                  Tuning techniques

Communication mode• 2 modes:

– indirect - task to daemon to daemon to task– direct - task to task

• Indirect is slow, but default• Direct is faster, but consumes socket resources

(limited number of connections)

Benefits when changing communication mode to direct

in a round trip application

Msg size [B] Benefits

10 50%

100 46%

1000 33%

10000 20%

100000 19%

1000000 18%

Page 40: Automatic Performance Analysis and Tuning

                                                                  Tuning techniques

Communication mode

Current communication mode: pvm_getopt(PvmRoute)New task creation: pvm_spawn()

Current communication mode: pvm_getopt(PvmRoute)New task creation: pvm_spawn()

Measure points

Non shared-memory architectureIndirect modeNumber of PVM tasks smaller than system limit

Non shared-memory architectureIndirect modeNumber of PVM tasks smaller than system limit

Activating conditions

One time function callOne time function callTuning action

pvm_setopt(PvmRoute, PvmRouteDirect)pvm_setopt(PvmRoute, PvmRouteDirect)Tuning points

Break at entry of pvm_send()Break at entry of pvm_send()Synchronization

Page 41: Automatic Performance Analysis and Tuning

                                                                  Tuning techniques

Data encoding mode• 2 modes

– XDR – allows for transparent transfer between heterogeneous machines, slower mode

– DataRaw – encoding phase is skipped, possible when VM contains homogeneous machines

• XDR – more data to be transferred, more time required to encode/decode it, by default

• Big endian/little endian problem• DataRaw - more effective

for integer data

Msg size [B] Benefits

10 2%

100 4%

1000 16%

10000 61%

100000 72%

1000000 73%

Benefits when changing encoding mode to

DataRaw in a round trip application

Page 42: Automatic Performance Analysis and Tuning

                                                                  Tuning techniques

Data encoding mode

Hosts’ architecture: pvm_config(), pvm_addhosts(), pvm_delhost()

Hosts’ architecture: pvm_config(), pvm_addhosts(), pvm_delhost()

Measure points

All machines from the PVM virtual machine have the same architecture

All machines from the PVM virtual machine have the same architecture

Activating conditions

Function parameter changeFunction parameter changeTuning action

pvm_initsend(PvmDataDefault) ->pvm_initsend(PvmDataRaw)

pvm_initsend(PvmDataDefault) ->pvm_initsend(PvmDataRaw)

Tuning points

NoneNoneSynchronization

Page 43: Automatic Performance Analysis and Tuning

                                                                  Tuning techniques

Message fragment size• Message divided into fixed-size fragments, default 4KB

fragments• Larger message -> more fragments, hence

– bigger fragment -> more data sent• Optimal fragment size depends on exchanged data size• Effective when direct communication mode used• Drawback: increased memory

usageMsg size [B] 16KB-512KB

4 43%-44%

8 46%-48%

64 38%-52%

256 41%-53%

512 42%-54%

1024 43%-55%

Benefits when changing 4KB message fragment size

in a round trip application

Page 44: Automatic Performance Analysis and Tuning

                                                                  Tuning techniques

Message fragment size

Current fragment size: pvm_getopt (PvmFragSize)Message size: pvm_send()

Current fragment size: pvm_getopt (PvmFragSize)Message size: pvm_send()

Measure points

High frequency of messages with size > 4KBOptimalFragSize = Average (MsgSize) + Stddev (MsgSize)CurrentFragSize – OptimalFragSize > threshold1Communication cost > threshold2

High frequency of messages with size > 4KBOptimalFragSize = Average (MsgSize) + Stddev (MsgSize)CurrentFragSize – OptimalFragSize > threshold1Communication cost > threshold2

Activating conditions

One time function callOne time function callTuning action

pvm_setopt(PvmFragSize, OptimalFragSize)pvm_setopt(PvmFragSize, OptimalFragSize)Tuning points

Break at entry of pvm_send()Break at entry of pvm_send()Synchronization

Page 45: Automatic Performance Analysis and Tuning

                                                                  Tuning techniques: Example application

• Integer Sort Kernel benchmark from NAS• High communication cost (50%)• Default settings: indirect communication mode, DataRaw

encoding, message fragment size 4KB

No. Tuning techniqueExecution time

[sec]Tuning benefits

[sec]Intrusion

[sec]

1. PVM (no tuning) 732 - -

2.PVM + communication mode tuning

604 127 (17,5%) 21 (~3,5%)

3.PVM + data encoding mode tuning

761 -29 (-3,9%) 21 (~2,8%)

4.PVM + message fragment size tuning

769 -37 (-5,1%) 27 (~3,5%)

5. PVM + all scenarios 529 202 (27,7%) 28 (~5,3%)

Page 46: Automatic Performance Analysis and Tuning

                                                                  Other tuning techniques

Other:• TCP/IP

– send/receive buffer size– sending without delay (Nagle’s algorithm,

TCP_NO_DELAY)

• I/O– read/write operations

• using prefetch when small requests• using asynchronous read/write instead of synchronous

– I/O buffer size

• Memory allocation– plugging-in specialized strategies (pool allocator)

Page 47: Automatic Performance Analysis and Tuning

                                                                  

1. Our goals – automatic analysis and tuning

2. Automatic analysis based on ASL

3. Dynamic tuning

4. Conclusions and future work

Content

Page 48: Automatic Performance Analysis and Tuning

                                                                  Conclusions

• Automatic performance analysis

• Dynamic tuning

• Designs

• Experiments

Page 49: Automatic Performance Analysis and Tuning

                                                                  Future work

Automatic analysis

• Discuss and close detailed ASL language specification

• Complete property evaluator

• Connect the analyzer with performance measurement tool

• Investigate the „why-axis” analysis(Evaluate the casual property chains)

Page 50: Automatic Performance Analysis and Tuning

                                                                  Future work

Dynamic tuning • Approaches:

– Black box – tuning of ANY application• more tuning techniques

– Cooperative – tuning of prepared application • supported by program specification,• application developed using framework• knowledge about tuning techniques provided by an

application framework

Page 51: Automatic Performance Analysis and Tuning

                                                                  

Automatic Performance Analysis and Tuning

Universitat Autónoma de Barcelona

Dyninst/Paradyn WeekFebruary, 2003

Thank you very much