from trace generation to visualization: a performance ... · from trace generation to...

18
0-7803-9820-5/2000/$10.00 (c) 2000 IEEE. 1 From Trace Generation to Visualization: A Performance Framework for Distributed Parallel Systems 1 C. Eric Wu, Anthony Bolmarcich, Marc Snir IBM T.J. Watson Research Center P.O. Box 218, Yorktown Heights, NY 10598 David Wootton, Farid Parpia IBM RS6000 Division 2455 South Road, Poughkeepsie, NY 12601 Anthony Chan 2 , Ewing Lusk 3 , William Gropp 3 Mathematics and Computer Science Division Angonne National Laboratory, Argonne, IL 60439 Abstract In this paper we describe a trace analysis framework, from trace generation to visualiza- tion. It includes a unified tracing facility on IBM SP systems, a self-defining interval file format, an API for framework extensions, utilities for merging and statistics genera- tion, and a visualization tool with preview and multiple time-space diagrams. The trace environment is extremely scalable, and combines MPI events with system activities in the same set of trace files, one for each SMP node. Since the amount of trace data may be very large, utilities are developed to convert and merge individual trace files into a self-defin- ing interval trace file with multiple frame directories. The interval format allows the development of multiple time-space diagrams, such as thread-activity view, processor- activity view, etc., from the same interval file. A visualization tool, Jumpshot, is modified to visualize these views. A statistics utility is developed using the API, along with its graphics viewer. Keyword Phrases: distributed parallel systems, SMP clusters, trace generation, inter- val file format, multiple time-space diagrams, trace visualization 1. This work describes a research prototype. IBM may not offer any product or services discussed herein, and no inference should be drawn about any IBM product from the contents of this paper. 2. Supported by the ASCI/Alliances Center for Astrophysical Thermonuclear Flashes at the University of Chicago under DOE subcontract B341495. 3. This work was supported by the Mathematical, Information, and Computational Sciences Division sub- program of the Office of Advanced Scientific Computing Research, U.S. Department of Energy, under Contract W-31-109-Eng-38.

Upload: ngohanh

Post on 20-Apr-2018

221 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: From Trace Generation to Visualization: A Performance ... · From Trace Generation to Visualization: A Performance Framework for Distributed Parallel Systems 3 regardless of systems

0-7803-9820-5/2000/$10.00 (c) 2000 IEEE. 1

From Trace Generation to Visualization: A Performance Framework for Distributed Parallel Systems1

C. Eric Wu, Anthony Bolmarcich, Marc SnirIBM T.J. Watson Research Center

P.O. Box 218, Yorktown Heights, NY 10598

David Wootton, Farid ParpiaIBM RS6000 Division

2455 South Road, Poughkeepsie, NY 12601

Anthony Chan2, Ewing Lusk3, William Gropp3

Mathematics and Computer Science DivisionAngonne National Laboratory, Argonne, IL 60439

Abstract

In this paper we describe a trace analysis framework, from trace generation to visualiza-tion. It includes a unified tracing facility on IBM SP systems, a self-defining interval file format, an API for framework extensions, utilities for merging and statistics genera-tion, and a visualization tool with preview and multiple time-space diagrams. The trace environment is extremely scalable, and combines MPI events with system activities in the same set of trace files, one for each SMP node. Since the amount of trace data may be very large, utilities are developed to convert and merge individual trace files into a self-defin-ing interval trace file with multiple frame directories. The interval format allows the development of multiple time-space diagrams, such as thread-activity view, processor-activity view, etc., from the same interval file. A visualization tool, Jumpshot, is modified to visualize these views. A statistics utility is developed using the API, along with its graphics viewer.

Keyword Phrases: distributed parallel systems, SMP clusters, trace generation, inter-val file format, multiple time-space diagrams, trace visualization

1. This work describes a research prototype. IBM may not offer any product or services discussed herein, and no inference should be drawn about any IBM product from the contents of this paper.

2. Supported by the ASCI/Alliances Center for Astrophysical Thermonuclear Flashes at the University of Chicago under DOE subcontract B341495.

3. This work was supported by the Mathematical, Information, and Computational Sciences Division sub-program of the Office of Advanced Scientific Computing Research, U.S. Department of Energy, under Contract W-31-109-Eng-38.

Page 2: From Trace Generation to Visualization: A Performance ... · From Trace Generation to Visualization: A Performance Framework for Distributed Parallel Systems 3 regardless of systems

From Trace Generation to Visualization: A Performance Framework for Distributed Parallel Systems 2

1.0 Introduction

Performance data collection and analysis is critical for understanding distributed parallel program behavior in distributed parallel systems. One common way of collecting perfor-mance data is to generate trace events while executing the program. Trace data collected during the execution can then be used for other purposes such as debugging and program visualization. What are needed in such a performance framework for distributed parallel systems include the ability to collect thread dispatch events, a globally (or virtually) syn-chronized clock, and the ability to record both user and system events in a common file.

Tracing overhead should be as small as possible, and the user should be able to trace sys-tem events as well as user events. Most trace driven performance tools capture only MPI events (the beginning and end of MPI calls) and user events. This is not sufficient for multi-threaded implementations of MPI, especially on SMP nodes. Threads may be sched-uled and de-scheduled during MPI calls, multiple threads may participate in the call, and threads may migrate from one processor to another. To understand performance issues it is absolutely necessary to trace thread scheduling events.

1.1 Clock Synchronization Problem

One of the most serious problems for analyzing trace data in a distributed parallel system is clock synchronization [2]. In such a system, multiple nodes produce separate trace streams independently. Since local clocks are used to generate local timestamps and there exist discrepancies among local clocks, the logical order of events cannot be guaranteed.

FIGURE 1. Accumulated timestamp discrepancies among 4 local clocks

Figure 1 shows the accumulated timestamp discrepancies among 4 local clocks over a period of roughly 140 seconds. It shows the general clock synchronization problem,

Page 3: From Trace Generation to Visualization: A Performance ... · From Trace Generation to Visualization: A Performance Framework for Distributed Parallel Systems 3 regardless of systems

From Trace Generation to Visualization: A Performance Framework for Distributed Parallel Systems 3

regardless of systems or platforms. The elapsed time of a reference clock is used as the x axis. It can be seen that the accumulated discrepancies increase as the elapsed time increases, regardless of the reference clock.

Previous systems have adopted purely software-based approaches of varying complexity and accuracy (see [3, 4, 5, 6, 7, 8]) to create a virtual global time out of the individual local clocks available at each node of a distributed system. Some of these approaches, such as the Network Time Protocol [4, 5], are extremely successful in some particular environ-ment (such as the Internet) but not accurate enough to solve the clock synchronization problem in a distributed parallel system. In Section 2 we will describe our approach for the clock synchronization problem using the switch adapter clock on IBM SP systems.

1.2 Motivation for Intervals

The execution of an MPI (Message Passing Interface) program in a trace-based perfor-mance environment generates event records. A trace record, called an “event record”, has a timestamp to indicate the point in time when the event is generated. Event records are generated at the start and end of each MPI call using the standard PMPI interface.

It is difficult to see an event in a time-space diagram, since a point in time can hardly be seen no matter how far one zooms in. An “interval record”, on the other hand, has an addi-tional duration field to indicate how long the interval lasts. Intervals are therefore consid-ered to be visualization-friendly and much easier to be visualized than events.

Utilities are required to convert event trace files into interval trace files for easy visualiza-tion. An interval record has four variants: begin, continuation, end, and complete. In the simple case where an MPI call is executed by one thread on one processor without inter-ruption, an interval record of type “complete” will be generated for the duration of the call. However, if the execution of the call was not continuous (for example, the thread was de-scheduled), then multiple interval pieces are generated, one for each continuous time interval. The first such record has type “begin”, the last has type “end”, and any records in between have type “continuation”. This type information allows us to properly count MPI calls and associate call fragments that pertain to the same call. After the event to interval conversion, multiple interval files, one from each node, are then merged into one single interval file for analysis, statistics generation, and visualization.

Time-space diagrams are desirable for a user to see intervals along multiple timelines. In a time-space diagram, time is used as one axis while the other axis is organized according to a significant discriminator, such as threads, processors, or record types. If each interval record contains a node ID, a processor ID, a thread ID, and a record type ID, multiple time-space diagrams may be derived from the same interval file, providing various views from different viewpoints. The possible time-space diagrams include:

• Thread-activity view: A thread-activity view displays activities (using various color graphic objects to represent various records, such as MPI_Send, MPI_Recv, etc.) along multiple timelines, one for each thread. A thread-activity view could be a view of inter-val pieces with no nested states, or a view with connected and nested states.

Page 4: From Trace Generation to Visualization: A Performance ... · From Trace Generation to Visualization: A Performance Framework for Distributed Parallel Systems 3 regardless of systems

From Trace Generation to Visualization: A Performance Framework for Distributed Parallel Systems 4

• Processor-activity view: A processor-activity view displays activities along multiple timelines, one for each processor. This time-space diagram must be a view of interval pieces, since threads may jump among processors in an SMP node and connecting interval pieces in different threads makes no sense.

• Thread-processor view: A thread-processor view displays processors (using various color graphic objects to represent various processors, such as CPU_0 in Node 0, CPU_1 in node 0, etc.) along multiple timelines, one for each thread. This time-space diagram shows how threads jump among the processors.

• Processor-thread view: A processor-thread view displays threads along multiple time-lines, one for each processor. This time-space diagram shows processor allocation among threads.

Other possible views may use record type as the significant discriminator along the y-axis. Although all these views can be derived from the same interval file, the available views depend on the visualization tool.

In Section 2 we describe the unified approach for trace generation, including both system and user events and global clock records, a self-defining interval format, and a simple API for reading the interval files. Utilities and a statistics viewer are discussed in Section 3 along with some design considerations, and trace visualization using the modified Jump-shot is discussed in Section 4. Summaries are given in Section 5.

2.0 A Unified Approach

The goal of the project is to provide a unified and easily expandable trace environment for MPI and other software layers on IBM SP systems. A user is allowed to trace both user and system activities. To achieve the goal, we chose the native trace facility in the IBM SP systems to generate MPI events along with system events. The AIX trace facility, as part of the IBM AIX operating system, is capable of capturing a sequential flow of time-stamped events to provide a fine or coarse level of detail on system and user activities in a single stream. Figure 2 illustrates the control flow of the performance framework, from trace generation to visualization.

The left half of Figure 2 illustrates the trace generation step, in which a user program is linked with the tracing library so that its execution creates multiple raw trace files, one on each node. Both system and user events, including process dispatch and MPI events, are generated in the same set of raw trace files. The right half of Figure 1 illustrates the trace processing and analysis steps, including merging, format conversion, and visualization, will be discussed throughout the paper in later sections.

2.1 Collecting Trace Records

To trace program execution, a mechanism is provided to specify a set of trace options, such as the name prefix of the trace files, trace buffer size, and events to be traced. By default tracing starts at the start of program execution. The user can also delay trace gener-

Page 5: From Trace Generation to Visualization: A Performance ... · From Trace Generation to Visualization: A Performance Framework for Distributed Parallel Systems 3 regardless of systems

From Trace Generation to Visualization: A Performance Framework for Distributed Parallel Systems 5

FIGURE 2. Trace generation and processing in the unified tracing approach

ation until a later point to trace only a portion of the code to substantially reduce the amount of trace data. The tracing library also adds a unique sequence number to each point-to-point message passing event record so that utilities can match sends with corre-sponding receives.

The cost of cutting an ordinary trace record has three parts. The first is the cost of testing whether the event is enabled and then calling the trace buffer insertion routine. The second is the cost of the trace buffer insertion routine. The third is the cost of wrapper routines in the tracing library, whose cost varies depending on individual MPI wrappers. If a typical trace record has three words of data in addition to a one-word record header (called a hookword, which identifies the event type and record length) and a one-word timestamp, the average cost of cutting a trace record is fairly small (a small fraction of one micro sec-ond) for the first two parts in a modern PowerPC machine. Overall the trace generation is very efficient [9, 10].

2.2 Global Clock Records

In order to preserve the logical order of events, we take advantage of the SP switch hard-ware. The IBM SP switch adapter, which connects each SP node to the high-performance switch network, provides a globally synchronized clock. The clock synchronization prob-lem could be completely avoided if all events utilized the global clock instead of local clocks. However, accessing the global clock is much more expensive than accessing a local clock. In addition, it is not feasible to change the AIX trace facility and use the glo-

trace library source code

compile and link

program

execute

raw trace files

convert and merge

interval file

statistics formatconversiongeneration

visualization

SLOG file

Page 6: From Trace Generation to Visualization: A Performance ... · From Trace Generation to Visualization: A Performance Framework for Distributed Parallel Systems 3 regardless of systems

From Trace Generation to Visualization: A Performance Framework for Distributed Parallel Systems 6

bal clock for system events. Thus, we chose to access the global clock register periodically in each node to collect global clock records, each of which contains a global timestamp and a local timestamp, and adjust local timestamps after trace files are created.

It can be seen in Figure 1 that the measured local clock drift rate (the slope) for a given node is roughly a constant in a reasonable period of time. This is because that the fre-quency of a clock crystal is directly related to its temperature. It remains more or less con-stant unless its temperature changes dramatically.

During the merge process the first global clock records in individual trace files are used to determine the starting point in time for records in each trace file. Subsequent global clock records are used to calculate the ratio of global versus local clock timestamps. The ratio is then used to adjust local timestamps with respect to the global clock. Given a sequence of timestamp pairs (Gi,Li), i=0...n, where Gi and Li represent the global and local times-tamps, respectively, the global to local clock ratio R is calculated as the root mean square of the slope segments constructed by adjacent pairs of timestamp points. That is,

The root mean square of the slope segments is a reasonable choice for the global to local clock ratio, since slope segments with bigger slopes are compensated by slope segments with smaller slopes. The equation is also better than the root mean square of all slopes, which uses (G0,L0) instead of (Gi-1,Li-1) in the calculation and gives too much weight on the first point in the sequence. Thus, an interval generated from the node with a local timestamp S and duration D can be adjusted with a global timestamp R*S and duration R*D.

An alternative is to use the slope of the last timestamp pair as the ratio, if the elapsed time of the trace is reasonably long. Since we collect a sequence of timestamp pairs, it is also possible to adjust local timestamps using slopes of individual slope segments. This approach effectively partitions the total elapsed time into n segments, each of which has its own global to local clock ratio.

2.3 A Self-Defining Interval Format

Trace files are often organized in a specific format for analysis and visualization. Existing formats include ALOG [11] and CLOG [8] from Argonne National Laboratory, both with predefined record formats. In general, predefined record formats are less flexible than the so-called “self-defining data format”, also known as “data meta-format”. Self-defining formats use meta-format syntax to describe what a valid data record looks like, and a valid

R

Gi Gi 1––

Li Li 1––---------------------------

2

i 1=

n

∑n

-------------------------------------------------=

Page 7: From Trace Generation to Visualization: A Performance ... · From Trace Generation to Visualization: A Performance Framework for Distributed Parallel Systems 3 regardless of systems

From Trace Generation to Visualization: A Performance Framework for Distributed Parallel Systems 7

file is one that follows the meta-format syntax. Existing self-defining data formats include HDF [12] and netCDF [13], which are designed for a small number of large records con-taining arrays of floating point data, and the Pablo Self-Defining Data Format (SDDF) [14]. The Pablo SDDF is a self-defining format to describe arbitrary trace records. Although intervals are obviously visualization-friendly, none of the above is designed with intervals in mind. In addition, since trace files tend to be very large, we would like to partition trace records into multiple frames and frame directories so that utilities and visu-alization tools can easily jump into the starting point of any given frame without reading through records ahead of the frame.. Unfortunately, none of the above formats, including SDDF, supports frames and frame directories. Thus, an interval format with frames and directories is designed to make the utilities and tools manageable as the size of the trace file increases.

In the unified tracing approach interval records and their specifications are stored in sepa-rate files. The interval records are stored in an interval file, and the interval record specifi-cations are stored in a description profile file. An interval file includes the version ID of the profile used to create the file. Utilities and programs that read interval files check that they are using the correct profile by comparing the version ID stored in the profile file to the version ID stored in the interval files.

The profile file, interval records, interval file structure, and a simple API are described in the following separate subsections.

2.3.1 Profile File

A description profile file contains a header followed by interval record specifications. The header includes a version ID, the number of interval record types, and arrays of strings for record and field names. There is an interval record specification for each existing interval type. An interval type consists of the event type and two bits called “bebits”. A typical event type is MPI_Send. The “bebits” (begin and end bits) indicate whether an interval record is for a complete interval or for a begin, continuation, or end type of the interval.

Figure 3 shows the structure of a record specification in the profile. An interval record specification includes the interval type, number of fields, index of the name of the record in the name array which is in the profile header, and a field specification for each field. Each field in a record is described through the use of one field description word, including a vector bit, a counter length, a data type, an element length, a field selection attribute, and a field name index. The field selection attribute in a field description word is used along with a field selection mask in the header of a given interval file to determine if the field exists for the record type. This design accommodates the case that a given record type may have a different number of fields in individual and merged interval files.

The description profile provides the needed flexibility to describe almost any record, including interval records, in the same or a separate file. Once a utility reads the profile, it knows all field names and record names, along with field sizes, data types, etc.

Page 8: From Trace Generation to Visualization: A Performance ... · From Trace Generation to Visualization: A Performance Framework for Distributed Parallel Systems 3 regardless of systems

From Trace Generation to Visualization: A Performance Framework for Distributed Parallel Systems 8

2.3.2 Interval Records

An interval record has a number of fields, each of a fixed data type, as specified in the description profile. A field can be a single data element of a certain size, or a vector field with a vector counter followed by the data elements of the same type and size.

Each interval record is associated with a one-byte record length. A zero length indicates a record with more than 255 bytes. In such a case, the actual record length is stored in the next two bytes. Thus, a program reader can always find the next interval record without examining the current record in detail.

An interval record includes a number of common fields: record type, start time, duration, processor ID, node ID, and logical thread ID (starts from 0 for each node). Currently there could be up to 512 relevant threads per node. Combined with the node ID field, it supports more than 2 million threads in a trace file. Related thread information, such as its process ID, system thread ID, and MPI task ID, is stored in the thread table of the interval trace file, ahead of all interval records.

MPI intervals may have an instruction address field suitable for a source code browser. A user marker interval may have up to two such fields, one for the begin marker and the other for the end marker. In addition to these common fields, the arguments of each MPI routine may add additional fields.

The tracing library also provides a number of routines to create user markers for various loops, code segments, subroutines, and functions. User markers are defined through the use of marker creation routines before creating user marker events.

FIGURE 3. Structure of a record specification in the description profile.

field specification for 1st field (4) repeatedfor eachfield in the

The size of each item is specified in parenthesis in bytes.

num. offields (1)

field specification for 2nd field (4)

record type index (4)

record nameindex (2)

reserved(1)

record type

Page 9: From Trace Generation to Visualization: A Performance ... · From Trace Generation to Visualization: A Performance Framework for Distributed Parallel Systems 3 regardless of systems

From Trace Generation to Visualization: A Performance Framework for Distributed Parallel Systems 9

2.3.3 Interval File Structure

A valid interval file contains a header, a thread table, and interval records partitioned into multiple frames and frame directories. Figure 4 shows the structure of an interval file. The header of an interval file includes a profile version number, a header version number, the number of thread entries in the thread table, and the field selection mask.

The thread table consists of a number of thread entries. Each thread entry contains the MPI task ID, process ID, system thread ID, node ID, the logical thread ID, and a thread type. Each interval record has a logical thread ID to identify the associated thread, thus helps reduce the size of the interval file. Threads in a thread table are partitioned into three cate-gories: MPI threads, user-defined threads, and system threads. This provides a way to choose specific threads for merging.

As described in Section 2.3, an interval file has multiple frame directories so that utilities and tools can jump into a specific frame without reading or processing any record ahead of the frame. The header of a frame directory contains the size of the frame directory, the number of frames in the frame directory, and the starting offsets of the previous and next frame directories. A frame directory has a number of frame entries. Each entry contains a frame pointer indicating the starting offset of the frame, the size of the frame, the number of records in the frame, and the start time and end time of the frame. Following the frame entries are the interval records. The last interval record of the last frame in the current frame directory is followed by the next frame directory, which again contains a frame directory header and a number of frame entries. The resulting file structure is simple yet powerful enough to enable fast accesses of interval records for utilities and tools.

Dir Frame Frame Frame Dir Frame Frame

link to next directory

links to frames

FIGURE 4. Structure of an interval file

header thread table records

Page 10: From Trace Generation to Visualization: A Performance ... · From Trace Generation to Visualization: A Performance Framework for Distributed Parallel Systems 3 regardless of systems

From Trace Generation to Visualization: A Performance Framework for Distributed Parallel Systems 10

2.4 A Simple API

In order to simplify the access to interval files, a utility library was developed to provide a simple API for handling interval files. Figure 5 shows a simple code segment to calculate the total number of bytes in the fields whose field name is “msgSizeSent”.

In Figure 5 the readHeader() routine reads the header of an interval file and returns the file pointer. Although an interval file may have many frame directories, a user need not read any frame directories except the first one using the readFrameDir() routine, which provides the starting point of the double-linked list to access any frame in the file. The readProfile() routine reads the profile, using the field selection mask in the header to select the needed fields. During the sequential access of an interval file, the getInter-val() routine hides all subsequent frames and frame directories from the user. The get-ItemByName() routine returns the size of a scalar item, specified by its field name, and the value of the item is stored in the place holder ilong. It returns -1 otherwise.#include “event_define.h”#include “uteheader.h”long long ilong, totalSize=0;table_format table; interval_header header;frame_directory framedir; FILE *infp;...if ((infp = readHeader(“input_file”, &header)) == NULL) exit(-1); if (readFrameDir(infp, &framedir) <= 0) exit(-1);if (readProfile(“profile.ute”, &table, header.masks) < 0) exit(-1);while ((length = getInterval(infp, &framedir, buffer, bufSize)) > 0) { if ((nbits = getItemByName(&table, &buffer, length, “msgSizeSent”, &ilong) > 0) totalSize += ilong;}printf(“total bytes sent = %lld\n”, totalSize);

FIGURE 5. Code segment to compute the total bytes sent in messages

There are many other routines in the utility library, to retrieve a field from a record, to determine if a field is a vector field, to get a vector field such as a character string, to retrieve an interval at a specific location, to allocate space and store marker string/identi-fier pairs, and to retrieve a marker string using a marker identifier. Other frame or frame directory related routines are used to retrieve or aggregate information from frame direc-tory structures, such as the total elapsed time and total number of records in the trace file.

3.0 Utilities and Other Design Considerations

Utilities were developed to convert event trace files into interval files, to merge individual interval files into one single interval file, and to generate statistics from interval files using the API.

Page 11: From Trace Generation to Visualization: A Performance ... · From Trace Generation to Visualization: A Performance Framework for Distributed Parallel Systems 3 regardless of systems

From Trace Generation to Visualization: A Performance Framework for Distributed Parallel Systems 11

3.1 Convert and Merge Utilities

A convert utility was develop to convert a set of event trace files into a set of interval files. Matching events is the first step in the conversion process. A begin event is matched with its end event to create an interval, provided that there is no other events in between. If there are other events, such as user marker events and thread dispatch events, the interval is divided into multiple interval pieces.

A user can define and then create user markers in an MPI application. When a user marker is defined with a user-specified string, an identifier is returned so that the user can create user marker records with the identifier. To minimize overhead, the tracing library assigns an identifier for the string without any cross-task communication. Since the MPI applica-tion is executed with multiple tasks and the calling sequence for marker creation calls may be different, there is no guarantee that the same identifier is returned for the same marker string. Thus, the identifier for a marker with the string, say “Initial Phase”, may be differ-ent in different tasks. The convert utility re-assigns a unique identifier to each user-defined marker string in the trace files. This ensures that the same identifier is used for the same marker string for all subsequent performance analysis.

A merge utility was also developed to merge multiple interval files, one from each node, into a single merged interval file. In addition to the interval file format, a format conver-sion function is also added to create output files in SLOG format, which is understandable for Jumpshot visualization, to be described in Section 4.0.

The key functions of the merge utility include aligning starting points of individual inter-val files by the first global clock records in them, and adjusting local timestamps to account for local clock drifts, as described in Section 2.2. The merge utility uses a bal-anced tree in which each tree node holds the pointer to the next interval in the correspond-ing interval file. Tree nodes are sorted by end time (i.e. start time plus duration). After an interval is copied into the merged file, the next interval is fetched from the same file and its tree node moves in the tree. Records in an interval file are in ascending order based on their end time.

Table 1 shows the speed of the convert and slogmerge utilities for trace files created by a test program with 4 MPI tasks, each of which has 4 threads. The purpose of the experi-ment is to show that the time spent processing an event scales well with the number of events. The test program was executed several times with different problem sizes and parameters, so that the numbers of raw events are different. The utilities were executed on an IBM PowerPC machine. It can be seen that the average speeds of the utilities remain roughly unchanged while the number of raw events increases. Note that the slogmerge utility also converts the file format to SLOG, described in Section 4.0.

TABLE 1. Utility speed

# raw events 40282 128378 254225 641354 4613568 11216936

sec/event in convert .0000890 .0000846 .0000841 .0000834 .0000821 .0000826

sec/event in slogmerge .000258 .000232 .000231 .000239 .000189 .000241

Page 12: From Trace Generation to Visualization: A Performance ... · From Trace Generation to Visualization: A Performance Framework for Distributed Parallel Systems 3 regardless of systems

From Trace Generation to Visualization: A Performance Framework for Distributed Parallel Systems 12

3.2 Statistics Generation Utility and Viewer

A statistics utility was developed using the API to generate statistics from interval files. It reads one or more interval files and generates tables specified by a program written in a declarative language. An example program is:

table name=sample condition=(start < 2) x=(“node”, node) x=(“processor”, cpu) y=(“avg(duration)”, dura, avg)

The table generated by this program contains the average of the duration each interval that started during the first 2 seconds of the run for each (node, cpu) pair. The generated tables is a tab-separated-value text file.

The field names used in the program, such as start, node, cpu, and dura, are defined in the profile file. Intervals to be included are selected using condition expressions. The x expressions specify the free variables of the table. The y expressions specify the dependent values of the table and how these values are aggregated. The free variables and the dependent values in the generated table are labeled with the strings given as the first element of the x and y expressions. The statistics program generates a set of pre-defined tables when it is not given user-defined table specifications. A statistics viewer was devel-oped to visualize these pre-defined tables.

Figure 6 shows the visualization of one of the pre-defined tables. The table contains the sum of the duration of interesting intervals per node and per 50 equally sized time bins of the execution of the program. Here, an interesting interval is one for a state other than the default state of Running. With the example program that was traced, the interesting inter-vals were for MPI routines.

This table and its visualization indicate the time ranges of a time-space diagram that are likely to be interesting. The figure shows that the program is doing something interesting during the time ranges from the start of the program to 948 seconds, between 1117 and 1422 seconds, and from 1658 seconds to the end of the program.

3.3 Unification of Interval Pieces

An interval defines a time span or region for a running thread. Typical time spans include MPI routines, user marker regions, and a Running state if a thread is running but not inside any MPI routine or user-marked code segments. For a thread-activity view, interval pieces of the same state could be connected to show nested states, or disconnected as they are in the interval file.

Visualization with connected interval pieces shows nested states. For example, assume that user marker 2 is nested inside user marker 1, and MPI calls are issued inside the state defined by user marker 2. In this case the state defined by user marker 1 may have two interval pieces: its begin interval and end interval.

Page 13: From Trace Generation to Visualization: A Performance ... · From Trace Generation to Visualization: A Performance Framework for Distributed Parallel Systems 3 regardless of systems

From Trace Generation to Visualization: A Performance Framework for Distributed Parallel Systems 13

FIGURE 6. Statistics visualization for pre-defined statistics tables

Now consider the case in which these two interval pieces are far apart from each other. If a user looks at a window between this two far-apart interval pieces, there is no information available for them. On the other hand, the state defined by user marker 2 is divided into multiple interval pieces by MPI intervals. Again, its begin and end interval pieces may be far apart. When a user looks at a window between these two far-apart interval pieces, the visualization tool should be able to display the state defined by user marker 2, as long as the window contains a continuation interval piece for user marker 2.

It becomes clear from the example that if one jumps into the middle of an interval file and start visualizing the intervals with interval pieces connected in a thread-activity view, the

Page 14: From Trace Generation to Visualization: A Performance ... · From Trace Generation to Visualization: A Performance Framework for Distributed Parallel Systems 3 regardless of systems

From Trace Generation to Visualization: A Performance Framework for Distributed Parallel Systems 14

outer states may be missing if there is no additional information in the frames. Therefore, the merge utility provides additional zero-duration continuation intervals at the beginning of each frame. These zero-duration continuation intervals represent the nested outer states at the beginning of the frame.

4.0 Visualization with Preview and Multiple Views

As described in Section 1.2, multiple time-space diagrams and performance-analysis applications may be derived from the same interval trace file. The Argonne visualization tool Jumpshot [7], is modified and used as one viewer. An API to write SLOG (for scal-able log file) files is used together with the API described in Section 2.4 to read the indi-vidual interval files, merge them, and write a file in SLOG format, which is understandable by the Jumpshot tool. In this section we briefly describe our approach and give examples of the use of the combined system.

FIGURE 7. Jumpshot visualization with preview for the FLASH code

Two challenges face a detailed trace data visualization program such as Jumpshot when it is dealing with the large files of events that may result from a long run on a large parallel machine. The first is rapid access to a time interval that might be located far into the run

Page 15: From Trace Generation to Visualization: A Performance ... · From Trace Generation to Visualization: A Performance Framework for Distributed Parallel Systems 3 regardless of systems

From Trace Generation to Visualization: A Performance Framework for Distributed Parallel Systems 15

(and far into the file). The second is that accurate portrayal of that time interval may involve data that was logged at times outside that interval. Examples, besides intervals that occur near the edges of a time interval, are long user-marker regions or arrows repre-senting messages that are sent long before they are received. The SLOG file format addresses the first problem by dividing the time of the run into frames as in Figure 4 and employing a frame index based on time, so that given a time, it is easy to locate the frame containing that point in time. To address the second challenge, additional data is added to frames (in the form of what we call “pseudo-interval” records) that supplies whatever data is needed from other frames to complete the visualization of the current frame.

When it is first invoked, Jumpshot provides a summary preview of the whole file that allows the user to select a time. Then the desired frame can be quickly located with the help of the frame index and accurately displayed with the data in the pseudo-interval records. The frame size is chosen so that the display of a single frame is quick. A preview and frame display for an adaptive mesh astrophysical application is shown in Figure 7. The smaller window in Figure 7 shows a graphical representation of the entire run. State counters accumulated during construction of the SLOG file and propotional allocation of event durations to a fixed number of time bins allow quick display of the entire run. Although a number of controls allow manipulation of this display, even the default presen-tation shown here allows one to identify the initialization and termination phases of this run, and the “typical” iteration phase in the middle. The user has selected a time instant in this middle section which causes the display of the data in the frame containing this instant. Scalability in the time it takes to display this frame (independence from the size of the SLOG file) comes from the combination of this preview and the frame index that allows rapid access to the chosen frame.

FIGURE 8. A thread-activity view of the ASCI sPPM benchmark using Jumpshot

Page 16: From Trace Generation to Visualization: A Performance ... · From Trace Generation to Visualization: A Performance Framework for Distributed Parallel Systems 3 regardless of systems

From Trace Generation to Visualization: A Performance Framework for Distributed Parallel Systems 16

FIGURE 9. A processor-activity view of the ASCI sPPM benchmark using Jumpshot

Figure 8 shows a thread-activity view of the ASCI sPPM benchmark, which solves a 3D gas dynamics problem on a uniform Cartesian mesh using a simplified version of the piecewise parabolic method. The benchmark was executed in 4 nodes, each of which is an 8-way SMP. There were four threads per MPI process, one of which made MPI calls. One can see system activity on the non-MPI threads, and observe that one thread is idle during this part of the computation.

Figure 9 shows the processor-activity view of the same ASCI sPPM execution using Jumpshot. Since each node has eight processors, there may be up to eight timelines for each node. Here one can see that the CPUs are mostly idle (each horizontal line represents a CPU), and that the MPI threads for processes 0 and 1 jump from one CPU to another on the same node during this section of the run. More threads (and/or processes) are needed to take advantage of the extra CPUs. Additional views may be developed in the future.

Jumpshot can be used on logfiles produced by other libraries than the AIX-based tool described here. In particular, Jumpshot comes with an MPI profiling library that can pro-duce SLOG files from any MPI implementation. On non-AIX platform, it is less efficient and captures less detail associated with thread dispatching than the AIX-based tool.

5.0 Summaries

In this paper we describe a performance framework, from trace generation to visualiza-tion. It includes the unified tracing facility on IBM RS/6000 SP systems, a self-defining interval format, its API, utilities for merging, format conversion, and statistics generation,

Page 17: From Trace Generation to Visualization: A Performance ... · From Trace Generation to Visualization: A Performance Framework for Distributed Parallel Systems 3 regardless of systems

From Trace Generation to Visualization: A Performance Framework for Distributed Parallel Systems 17

and a visualization tool for managing large amount of trace data. Since trace files are gen-erated independently during program execution on individual nodes, trace generation is extremely scalable.

The interval file format allows the development of multiple time-space diagrams, and has a description profile to describe record contents and to provide the flexibility for changes and extensions. In addition, the interval file format contains multiple frame directories for fast accesses of frames in the file. Thus, trace data are manageable even for extremely large trace files.

The merge utility merges multiple interval files using both local and global timestamps, and the statistics utility generates statistics summary tables. A statistics viewer is also developed. The Jumpshot visualization tool provides a way to visualize multiple time-space diagrams from a single interval trace file.

Since global clock records are collected by a thread in each node, there is a remote chance that significant discrepancy between the global and local clock may be recorded due to, say thread de-scheduling right after accessing the global clock. Although this significant discrepancy may be easily filtered out by utilities, an atomic operation would totally elim-inate such possibilities.

In summary, the framework uses an innovative approach to handle the clock synchroniza-tion problem. The ability for multiple views for a given interval trace file results from the nature of interval records. The native trace facility provides thread dispatch information, which can be analyzed and visualized in the framework. Future extensions with additional system activities, such as I/O, page miss, etc. may result in even better tools.

Trademarks and Disclaimer

AIX and IBM are registered trademarks of International Business Machines Corporation. Pow-erPC and Scalable POWERparallel are trademarks of International Business Machines Corpora-tion. All other registered trademarks and trademarks are the properties of the respective companies.

IBM may have patents or pending patent applications covering subject matter in this paper. The furnishing of this paper does not give away any license to these patents. License inquiries should be sent, in writing, to IBM Director of Licensing, IBM Corporation, New Castle Drive, Armonk, NY 10504-1785, USA.

References1. C. Wu, H. Franke, and Y.-H. Liu, “A Unified Trace Environment for IBM SP Sys-

tems,” IEEE Parallel and Distributed Technology, vol. 4, no. 2, pp. 89 - 93, 1996.

2. L. Lamport, “Time, Clocks, and the Ordering of Events in a Distributed System,” Com-munications of ACM, vol. 21, no. 7, pp. 558 - 565, July 1978.

3. G. Geist, M. Heath, B. Peyton, and P. Worley, “A User’s Guide to PICL: A Portable Instrumented Communication Library,” Technical Report ORNL/TM-11616, Oak Ridge National Laboratory, October 1990.

Page 18: From Trace Generation to Visualization: A Performance ... · From Trace Generation to Visualization: A Performance Framework for Distributed Parallel Systems 3 regardless of systems

From Trace Generation to Visualization: A Performance Framework for Distributed Parallel Systems 18

4. D. Mills, “Internet Time Synchronization: the Network Time Protocol,” IEEE Trans. on Communications, October 1991.

5. D. Mills, “Improved Algorithms for Synchronizing Computer Network Clocks,” IEEE/ACM Trans. on Networks, vol. 3, no. 3, June 1995.

6. F. Christian, “Probabilistic Clock Synchronization,” Distributed Computing, vol. 3, pp. 146 - 158, 1989.

7. C. Liao et. al. “Experience with an Adaptive Globally-Synchronizing Clock Algo-rithm,” Proc. of the 11th ACM Symposium on Parallel Algorithms and Architectures, June 1999.

8. O. Zaki, W. Gropp, E. Lusk, and D. Swider, “Scalable Performance Visualization with Jumpshot,” International Journal of Supercomputer Applications and High Perfor-mance Computing, vol 13, no. 3, pp. 277 - 288, 1999.

9. IBM, “AIX Performance Tuning Guide,” IBM Manual SC23-2365.

10. IBM, “General Programming Concepts: Writing and Debugging Programs,” IBM Manual SC23-2533.

11. E. Karrels and E. Lusk, “Performance Analysis of MPI Programs,” Proc. of the Work-shop on Environments and Tools for Parallel Scientific Computing, 1994.

12. NCSA, “NCSA HDF Version 2.0,” University of Illinois at Urbana-Champaign, National Center for Supercomputing Applications, 1989.

13. R. Rew, “netCDF User’s Guide Version 1.0,” Unidata Program Center, University Corporation for Atmospheric Research, 1989.

14. R. Aydt, “The Pablo Self-Defining Data Format,” Technical Report, Department of Computer Science, University of Illinois, March 1992, Revised April 1995.