intel trace collector and trace analyzer evaluation report
Post on 31-Dec-2015
75 Views
Preview:
DESCRIPTION
TRANSCRIPT
Intel Trace Collector andTrace Analyzer Evaluation Report
Hans Sherburne,Adam LekoUPC Group
HCS Research LaboratoryUniversity of Florida
Color encoding key:
Blue: Information
Red: Negative note
Green: Positive note
2
Basic Information Name: Intel Trace Collector, Intel Trace Analyzer Developer: Intel Current versions:
Intel Trace Collector 5.0.1.0 Intel Trace Analyzer 4.0.3.1
Website: http://www.intel.com/software/products/cluster
Contact: http://premier.intel.com
3
Intel Cluster Tools Overview A toolkit for creating high-performance applications on Intel’s architectures
(x86, IA64) Intel MPI Library
Intel’s implementation of MPI Intel Cluster Math Kernel Library
Contains several Intel-optimized math routines Also has a version of ScaLAPACK
Intel Trace Collector & Trace Analyzer Represent the performance analysis portion of Intel Cluster Tools The two are used in conjunction to analyze performance of parallel applications
(mostly MPI): Trace Collector: Provides a method for instrumenting programs and recording
performance data Trace Analyzer: Provides graphical representation of trace data from STF trace
file Formerly known as Vampirtrace & Vampir
4
Trace Collector Overview What can be traced:
MPI applications can be traced automatically by linking against profiling library Records of MPI routine calls Data describing communication (point-to-point and collective) Hardware counter data if available Statistics – function calls, sent messages, collective operations (count,
duration, bytes) User-level code can be traced through manual instrumentation using ITC
API User defined states User defined counters
Non-MPI (distributed) applications can be traced Use same API calls as instrumenting user code in MPI apps
Binaries instrumentation without recompilation is possible Use itcinstrument tool Must use MPI or must explicitly initialize/finalize Trace Collector
Java programs
5
Trace Collector Libraries ITC Offers four different libraries for creating trace files. Each offers
different operation characteristics libVT
Contains wrapper functions for automatic logging of MPI calls Offers extended functionality through an API for logging of user
defined data libVTnull
Contains dummy versions of API calls libVTfs
Same functionality as libVT Trace file writing is done via TCP sockets In case of failure, trace data is not lost
libVTcs Similar to VTfs in that it uses TCP sockets to write tracefiles Does not automatically log MPI calls Requires that a process be explicitly designated as server for trace
file creation coordination
6
Structured Trace File Format (STF) Structured Trace File format is the default format for traces Data is divided into logical frames, which helps to partition data for large-
scale programs with large traces (possibly GBs) Time axis Location axis Type of data (state, collective operations, point-to-point messages,
counter values, MPI-IO) Indexing allows for quick random access Uses multiple files
File division does not necessarily reflect frame division Allows for parallelism in reading and writing Documentation does not detail the innerworkings
Can be converted to single-file STF for ease of file handling and transmission
No documentation provided on how actual construct STF trace files without using Trace Collector
7
STF Utilities STF files can be manipulated using stftool and xstftool:
Extract various data Manipulate frames, and groupings Convert STF files into AVT, or XVT
AVT Format used by previous versions of Vampir Should be understood by Trace Analyzer Created by other existing tools
XVT Similar to AVT in syntax Replaces integer descriptors with more easily understood titles Combine all data in one file
Alternatives are human and script readable No means is provided to facilitate importing the data into another
tool
8
Trace Collector API Intel Trace Collector offers an API to
Trace user code in detail Trace non-MPI distributed apps
Functions are defined to: Record user defined states in trace Record user defined communication events in the trace Record source code locations for correlation in Intel Trace Analyzer Record user defined counters in trace Define process groupings used in trace analyzer Define frames (recommended to use config options instead) Turn tracing on and off during execution Enable tracing of multithreaded applications Initialize and finalize Intel Trace Collector - needed for non-MPI
applications
9
Trace Collector Overhead All programs executed correctly when instrumented Benchmarks marked with a star had high variability in execution time
Readings with stars probably not accurate In most cases overhead less than 8% Wasn’t able to test overhead of hardware counter instrumentation However, trace file writing for class B LU with 32 processes took almost 20 minutes!
mpiP Profiling Overhead
5%
3%
2%
3%
1%
1%
0%
7%
5%
0%
1%
0%
0% 1% 2% 3% 4% 5% 6% 7% 8%
CAMEL
NAS LU (8p, W)
NAS LU (32p, B)
PP: Big message
PP: Diffuse procedure
PP: Hot procedure
PP: Intensive server
PP: Ping pong
PP: Random barrier
PP: Small messages
PP: System time
PP: Wrong way
Be
nc
hm
ark
Overhead (instrumented/uninstrumented)
10
Trace Analyzer Intel Trace Analyzer (ITA) is a visualization program
Reads STF tracefiles Tracefiles from previous versions should also work ITA can display:
Event based data (including messages) Statistical data Counter data if it is contained in the trace
Displays may represent view of: Multiple processes
Individual processes Group of processes (depending on selected filtering options)
Single process Possible to configure the views in various ways
Activities / Symbols Absolute Time / Scaled (percentage of total) Time Number of processes displayed at once Colors used for activities
11
Trace Analyzer (2) Data from a large trace file can be viewed in increments
Select the appropriate frames from the STF file
Views may be linked to visible portion of zoomed timeline Pre-computed statistical data can be viewed without
loading trace data General Notes on ITA Interface
Uses X-windows Is quite stable Provides good interface responsiveness Interface is intuitive (for the most part)
ITA is not capable of automatic analysis of trace data.
12
Trace Analyzer Views
Summary Chart Display Allows the user to see how
much work is spent in MPI calls
Timeline Display Zoomable, scrollable timeline
representation of program execution
Summary Chart
Timeline Display
13
Trace Analyzer Views (2)
Summary Timeline Timeline/histogram
representation showing the number of processes in each activity per time bin
Counter Timeline Value over time representation
(behavior depends on counter definition in trace)
Summary Timeline
Counter Timeline
14
Trace Analyzer Views (3)
Message Statistics Display Message data to/from each
process (count,length, rate, duration)
Process Profile Display Per process data regarding
activities
Message Statistics
Process Profile Display
15
Trace Analyzer Views (4)
Statistics Display Various statistics regarding
activities in histogram, table, or text format
Call Tree Display
Statistics Display
Call Tree Display
16
Trace Analyzer Views (5)
Source View Source code
correlation with events in Timeline
Activity Chart Per Process histograms
of Application and MPI activity
Source View
Activity Chart
17
Trace Analyzer Views (6) Process Timeline
Activity timeline and counter timeline for a single process
Process Activity Chart Same type of information
as Global Summary Chart Process Call Tree
Same type of information as Global Call Tree
Process Timeline
Process Activity Chart & Call Tree
18
Bottleneck Identification Test Suite Testing metric: what did trace visualization tell us (automatic
instrumentation)? CAMEL: PASSED
Identified large number of small messages at beginning of program execution
Easily see that MPI calls take up small portion of run time (<3%) NAS LU: PASSED
Showed communication bottlenecks very clearly Large(!) number of small messages Shows sensitivity to latency for processors waiting on data from other processors
“W” Class: 18 MB trace file Loads quickly
“B” Class: 240 MB trace file Loads slowly (2-3 min.), responsiveness of program is diminished
However, can be loaded in small pieces that load much faster Some information is available with out loading any frames
Took nearly 20 minutes to write trace after program completion!
19
Bottleneck Identification Test Suite (2) Big message: PASSED
Traces illustrated large amount of time spent in send and receive
Diffuse procedure: PASSED Traces illustrated a lot of synchronization with
each process executing user code in an exclusive, alternating manner
Hot procedure: TOSS-UP Assuming hardware counters work, would be
easy to see extra CPU utilization Manually instrumenting code would improve
accuracy of source code correlation Intensive server: PASSED
Trace clearly shows that all processes communicate with a single process whose response time is delayed by user code
Ping pong: PASSED Traces illustrated that most time is spent in MPI
code sending and receiving messages, with little time spent in user code
Random barrier: PASSED Traces show that there are many barriers,
with each one held up by a random processor in user code
Small messages: PASSED Traces illustrated a large number of
messages being sent to node 0 System time: TOSS-UP
Hardware counter timeline might be able to indicate bottleneck if they were working
Wrong way: PASSED Trace shows that first receive takes a long
time, but the rest of the messages sent during this time period are received quickly
20
General Comments Intel Trace Collector/Analyzer are very popular and effective tools
for creating and displaying trace files. These tools are proprietary, and closed source. Analyzing performance of MPI applications is the primary intended
use. Support for analyzing non-MPI applications is provided via an API,
and a special library (libVTcs - allows for coordination of tracefile creation without MPI).
Performance analysis requires the user to have a good understanding of the types of problems likely to affect performance.
No automatic detection of bottlenecks
21
Evaluation (1) Available metrics: 4.5/5
Can use PAPI Many metrics (event-based and counter-based) are available, but it is not possible
to create custom metrics as in Paraver Cost: 3/5
A single-user license costs ~$500 Multiple user licenses are for a single cluster only
A 20-user license costs ~$5000 A100-user license costs ~$15,000 , A unlimited user license costs ~$30,000
Documentation quality: 4/5 Documentation covers most of the features in a clear and consistent fashion Trace Analyzer documentation includes a section that walks a user through the
process of analyzing a trace file for bottlenecks through a sample scenario However, some parts of the documentation are confusing if the document is not
read in it’s entirety Doesn’t describe inner-workings of trace collection/display
*Note: evaluated IA:32 MPICH Linux version
22
Evaluation (2) Extensibility: 0/5
Commerical (no source) Trace file format is not documented However could possible use distributed application tracing features to create
traces Filtering and aggregation: 4/5
Much of what is recorded in trace files can be controlled through a configuration file (or command line arguments)
Some post-mortem filtering and aggregation can be controlled from within Trace Analyzer, but it is not as customizable as other tools
Hardware support: 1/5 Supports only systems using Intel IA-32, Itanium 2, or Intel Extended Memory 64
Heterogeneity support: 5/5 Through the use of libVTcs one may manually instrument the code of distributed
applications across heterogeneous platforms No automatic event capturing for heterogeneous applications, however
23
Evaluation (3) Installation: 4.5/5
Install was very simple, and worked immediately However, I was never able to get hardware counters to function due to incompatibilities with
installed PAPI and getrusage Interoperability: 1/5
Trace Analyzer is capable of reading older vampirtrace trace file format files which can be output by some other tools
A tracefiles can be output in (or converted to) older ASCII-based vampirtrace trace file format
Learning curve: 4.5/5 Most important, and useful views and features are intuitive and easy to understand Some features seem a bit redundant or oddly named
Manual overhead: 3/5 MPI call tracing is done automatically by linking against profiling library Can also instrument all functions or a handful of functions using binary instrumentation More detailed tracing information requires manually inserting API function calls A null library is included so that binaries utilizing API function calls need not be altered
24
Evaluation (4) Measurement accuracy: 4/5
CAMEL overhead ~5% Tracing overhead is negligible However, sometimes trace analyzer finds reversed messages that shouldn’t be there
Multiple executions: 1/5 Multiple instances of Trace Analyzer can be opened at once, but comparing views must be
done manually Some support is offered for comparing statistics between two different tracefiles but it is
greatly limited (difference or quotient of histograms between two runs) Multiple analyses & views: 4/5
A number of common, useful views are available However, the values displayed are not as customizable as other tools No automatic analysis is offered Analysis can be performed by examining timelines, histograms, or textual representations
Performance bottleneck identification: 4.5/5 No automatic detection Views provided should allow for manual detection of most common bottlenecks
25
Evaluation (5) Profiling/tracing support: 5/5
Both tracing (recording events, and messages) and profiling (recording statistics) are supported and can be used independent of each other
Response time: 2/5 No data at all until after run has completed and tracefile has been
opened Some information available without fully loading tracefile Large trace files can take a long time to write out and read back in
Searching: 0/5 (not supported) Software support: 4.5/5
MPI profiling interface should permit use with many MPI implementations (support of Intel, Lam, and MPICH is explicitly offered)
Full support is available for C/C++, Fortran, and some support for Java and OpenMP
26
Evaluation (6) Source code correlation: 4/5
All MPI calls on time line offer click source code correlation User code correlation requires more manual effort
System stability: 4.5/5 Trace Analyzer crashed (segmentation fault) only once throughout evaluation Trace Collector never caused an application to fail
Technical support: 4/5 Quick initial response through support webpage (a few hours) Subsequent responses required a few days
27
References
Intel Trace Analyzer 4.0 User’s Guide 4.0.3.0
Intel Trace Collector -
IA32-LIN-MPICH PRODUCT.5.0.1.0 User’s Guide PRODUCT 5.0.1.0
top related