g -wide profiling acontinuous profiling i data...

..........................................................................................................................................................................................................................

GOOGLE-WIDE PROFILING:A CONTINUOUS PROFILING

INFRASTRUCTURE FOR DATA CENTERS..........................................................................................................................................................................................................................

GOOGLE-WIDE PROFILING (GWP), A CONTINUOUS PROFILING INFRASTRUCTURE FOR

DATA CENTERS, PROVIDES PERFORMANCE INSIGHTS FOR CLOUD APPLICATIONS. WITH

NEGLIGIBLE OVERHEAD, GWP PROVIDES STABLE, ACCURATE PROFILES AND A

DATACENTER-SCALE TOOL FOR TRADITIONAL PERFORMANCE ANALYSES. FURTHERMORE,

GWP INTRODUCES NOVEL APPLICATIONS OF ITS PROFILES, SUCH AS APPLICATION-

PLATFORM AFFINITY MEASUREMENTS AND IDENTIFICATION OF PLATFORM-SPECIFIC,

MICROARCHITECTURAL PECULIARITIES.

......As cloud-based computing growsin pervasiveness and scale, understanding data-center applications’ performance and utiliza-tion characteristics is critically important,because even minor performance improve-ments translate into huge cost savings. Tradi-tional performance analysis, which typicallyneeds to isolate benchmarks, can be too com-plicated or even impossible with moderndatacenter applications. It’s easier and morerepresentative to monitor datacenter applica-tions running on live traffic. However, applica-tion owners won’t tolerate latency degradationsof more than a few percent, so these tools mustbe nonintrusive and have minimal overhead.As with all profiling tools, observer distortionmust be minimized to enable meaningfulanalysis. (For additional information on re-lated techniques, see the ‘‘Profiling: From sin-gle systems to data centers’’ sidebar.)

Sampling-based tools can bring overheadand distortion to acceptable levels, so they’re

uniquely qualified for performance monitor-ing in the data center. Traditionally, sam-pling tools run on a single machine,monitoring specific processes or the systemas a whole. Profiling can begin on demandto analyze a performance problem, or it canrun continuously.1

Google-Wide Profiling can be theoreti-cally viewed as an extension of the DigitalContinuous Profiling Infrastructure(DCPI)1 to data centers. GWP is a continu-ous profiling infrastructure; it samples acrossmachines in multiple data centers and col-lects various events—such as stack traces,hardware events, lock contention profiles,heap profiles, and kernel events—allowingcross-correlation with job scheduling data,application-specific data, and other informa-tion from the data centers.

GWP collects daily profiles from severalthousand applications running on thousandsof servers, and the compressed profile

[3B2-14] mmi2010040065.3d 30/7/010 19:43 Page 65

Gang Ren

Eric Tune

Tipp Moseley

Yixin Shi

Silvius Rus

Robert Hundt

Google

0272-1732/10/$26.00 �c 2010 IEEE Published by the IEEE Computer Society

...................................................................

65

database grows by several Gbytes every day.Profiling at this scale presents significantchallenges that don’t exist for a single ma-chine. Verifying that the profiles are correctis important and challenging because theworkloads are dynamic. Managing profilingoverhead becomes far more important aswell, as any unnecessary profiling overheadcan cost millions of dollars in additionalresources. Finally, making the profile datauniversally accessible is an additional chal-lenge. GWP is also a cloud application,with its own scalability and performanceissues.

With this volume of data, we can answertypical performance questions about datacen-ter applications, including the following:

� What are the hottest processes, rou-tines, or code regions?

� How does performance differ acrosssoftware versions?

� Which locks are most contended?� Which processes are memory hogs?

� Does a particular memory allocationscheme benefit a particular class ofapplications?

� What is the cycles per instruction (CPI)for applications across platforms?

Additionally, we can derive higher-level datato more complex but interesting questions,such as which compilers were used for appli-cations in the fleet, whether there are more32-bit or 64-bit applications running, andhow much utilization is being lost by subop-timal job scheduling.

InfrastructureFigure 1 provides an overview of the en-

tire GWP system.

CollectorGWP samples in two dimensions. At any

moment, profiling occurs only on a smallsubset of all machines in the fleet, andevent-based sampling is used at the machinelevel. Sampling in only one dimension would

[3B2-14] mmi2010040065.3d 30/7/010 19:43 Page 66

..............................................................................................................................................................................................

Profiling: From single systems to data centers

From gprof1 to Intel VTune (http://software.intel.com/en-us/intel-

vtune), profiling has long been standard in the development process.

Most profiling tools focus on a single execution of a single program.

As computing systems have evolved, understanding the bigger picture

across multiple machines has become increasingly important.

Continuous, always-on profiling became more important with Morph

and DCPI for Digital UNIX. Morph uses low overhead system-wide sam-

pling (approximately 0.3 percent) and binary rewriting to continuously

evolve programs to adapt to their host architectures.2 DCPI gathers

much more robust profiles, such as precise stall times and causes,

and focuses on reporting information to users.3

OProfile, a DCPI-inspired tool, collects and reports data much in the

same way for a plethora of architectures, though it offers less sophisti-

cated analysis. GWP uses OProfile (http://oprofile.sourceforge.net) as a

profile source. High-performance computing (HPC) shares many profiling

challenges with cloud computing, because profilers must gather repre-

sentative samples with minimal overhead over thousands of nodes.

Cloud computing has additional challenges—vastly disparate applica-

tions, workloads, and machine configurations. As a result, different profil-

ing strategies exist for HPC and cloud computing. HPCToolkit uses

lightweight sampling to collect call stacks from optimized programs

and compares profiles to identify scaling issues as parallelism increases.4

Open|SpeedShop (www.openspeedshop.org) provides similar func-

tionality. Similarly, Upshot5 and Jumpshot6 can analyze traces (such

as MPI calls) from parallel programs but aren’t suitable for continuous

profiling.

References

1. S.L. Graham, P.B. Kessler, and M.K. Mckusick, ‘‘Gprof: A

Call Graph Execution Profiler,’’ Proc. Symp. Compiler Con-

struction (CC 82), ACM Press, 1982, pp. 120-126.

2. X. Zhang et al., ‘‘System Support for Automatic Profiling and

Optimization,’’ Proc. 16th ACM Symp. Operating Systems

Principles (SOSP 97), ACM Press, 1997, pp. 15-26.

3. J.M. Anderson et al., ‘‘Continuous Profiling: Where Have All

the Cycles Gone?’’ Proc. 16th ACM Symp. Operating Sys-

tems Principles (SOSP 97), ACM Press, 1997, pp. 357-390.

4. N.R. Tallent et al., ‘‘Diagnosing Performance Bottlenecks in

Emerging Petascale Applications,’’ Proc. Conf. High Perfor-

mance Computing Networking, Storage and Analysis (SC

09), ACM Press, 2009, no. 51.

5. V. Herrarte and E. Lusk, Studying Parallel Program Behavior

with Upshot, tech. report, Argonne National Laboratory,

1991.

6. O. Zaki et al., ‘‘Toward Scalable Performance Visualization

with Jumpshot,’’ Int’l J. High Performance Computing Appli-

cations, vol. 13, no. 3, 1999, pp. 277-288.

....................................................................

66 IEEE MICRO

...............................................................................................................................................................................................

DATACENTER COMPUTING

be unsatisfactory; if event-based profilingwere active on every machine all the time,at a normal event-sampling rate, we wouldbe using too many resources across thefleet. Alternatively, if the event-samplingrate is too low, profiles become too sparseto drill down to the individual machinelevel. For each event type, we choose asampling rate high enough to provide mean-ingful machine-level data while stillminimizing the distortion caused by theprofiling on critical applications. The systemhas been actively profiling nearly allmachines at Google for several years withonly rare complaints of system interference.

A central machine database manages allmachines in the fleet and lists every machine’sname and basic hardware characteristics. TheGWP profile collector periodically gets a listof all machines from that database and selectsa random sample of machines from that pool.The collector then remotely activates profil-ing on the selected machines and retrievesthe results. It retrieves different types ofsampled profiles sequentially or concurrently,depending on the machine and event type.For example, the collector might gatherhardware performance counters for severalseconds each, then move on to profiling forlock contention or memory allocation. Ittakes a few minutes to gather profiles for aspecific machine.

For robustness, the GWP collector is adistributed service. It helps improve availabil-ity and reduce additional variation from thecollector itself. To minimize distortion onthe machines and the services running onthem, the collector monitors error conditionsand ceases profiling if the failure rate reachesa predefined threshold. Aside from the collec-tor, we monitor all other GWP componentsto ensure an always-on service to users.

On the top of the two-dimensional sam-pling approach, we apply several techniquesto further reduce the overhead. First, we mea-sure the event-based profiling overhead ona set of benchmark applications and then con-servatively set the maximum rates to ensurethe overhead is always less than a few percent.Second, we don’t collect whole call stacks forthe machine-wide profiles to avoid the highoverhead associated with unwinding (but wecollect call stacks for most server profiles at

lower sampling frequencies). Finally, wesave the profile and metadata in their raw for-mat and perform symbolization on a separateset of machines. As a result, the aggregatedprofiling overhead is negligible—less than0.01 percent. At the same time, the derivedprofiles are still meaningful, as we show inthe ‘‘Reliability analysis’’ section.

Profiles and profiling interfacesGWP collects two categories of profiles:

whole-machine and per-process. Whole-machine profiles capture all activities happeningon the machine, including user applications,the kernel, kernel modules, daemons,and other background jobs. The whole-machine profiles include hardware perfor-mance monitoring (HPM) event profiles,kernel event traces, and power measurements.

Users without root access cannot directlyinvoke most of the whole-machine profilingsystems, so we deploy lightweight daemonson every machine to let remote users (suchas GWP collectors) access those profiles.The daemons act as gate keepers to controlaccess, enforce sampling rate limits, and col-lect system variables that must be synchron-ized with the profiles.

[3B2-14] mmi2010040065.3d 30/7/010 19:43 Page 67

Machinedatabase

Symbolizer(MapReduce)

Webserver Profile database(OLAP)

Processedprofiles

Raw profiles

Daemons

Data center

Applications(with profiling

interfaces)

Binaryrepository

Binarycollector

Collectors

Figure 1. An overview of the Google-Wide Profiling (GWP) infrastructure.

The whole system consists of collector, symbolizer, profile database,

Web server, and other components.

....................................................................

JULY/AUGUST 2010 67

We use OProfile (http://oprofile.sourceforge.net) to collect HPM event profiles. OProfile isa system-wide profiler that uses HPM to gen-erate event-based samples for all runningbinaries at low overhead. To hide the hetero-geneity of events between architectures, wedefine some generic HPM events on topof the platform-specific events, using anapproach similar to PAPI.2 The most com-monly used generic events are CPU cycles,retired instructions, L1 and L2 cache misses,and branch mispredictions. We also provideaccess to some architecture-specific events.Although the aggregated profiles for thoseevents are biased to specific architectures,they provide useful information for machine-specific scenarios.

In addition to whole-machine profiles, wecollect various types of profiles from mostapplications running on a machine usingthe Google Performance Tools (http://code.google.com/p/google-perftools). Most appli-cations include a common library thatenables process-wide stacktrace-attributedprofiling mechanisms for heap allocation,lock contention, wall time and CPU time,and other performance metrics. The com-mon library includes a simple HTTP serverlinked with handlers for each type of profiler.A handler accepts requests from remoteusers, activates profiling (if it’s not alreadyactive), and then sends the profile data back.

The GWP collector learns from the cluster-wide job management system what applica-tions are running on a machine and onwhich port each can be contacted for remoteprofiling invocation. A machine lacking re-mote profiling support has some programs,which the per-process profiling doesn’t cap-ture. However, these are few; comparisonwith system-wide profiles shows that remoteper-process profiling captures the vast major-ity of Google’s programs. Most programs’users have found profiling useful or unobtru-sive enough to leave them enabled.

Together with profiles, GWP collects otherinformation about the target machine andapplications. Some of the extra informationis needed to postprocess the collected profiles,such as a unique identifier for each runningbinary that can be correlated across machineswith unstripped versions for offline symboliza-tion. The rest are mainly used to tag the

profiles so that we later correlate the profileswith job, machine, or datacenter attributes.

Symbolization and binary storageAfter collection, the Google File System

(GFS) stores the profiles.3 To provide mean-ingful information, the profiles must corre-late to source code. However, to savenetwork bandwidth and disk space, applica-tions are usually deployed into data centerswithout any debug or symbolic information,which can make source correlation impossi-ble. Furthermore, several applications, suchas Java and QEMU, dynamically generateand execute code. The code is not availableoffline and can therefore no longer be sym-bolized. The symbolizer must also symbolizeoperating system kernels and kernel loadablemodules.

Therefore, the symbolization process be-comes surprisingly complicated, althoughit’s usually trivial for single-machine profil-ing. Various strategies exist to obtain binarieswith debug information. For example, wecould try to recompile all sampled applica-tions at specific source milestones. However,it’s too resource-intensive and sometimes im-possible for applications whose source isn’treadily available. An alternative is to persis-tently store binaries that contain debug infor-mation before they’re stripped.

Currently, GWP stores unstrippedbinaries in a global repository, which otherservices use to symbolize stack traces forautomated failure reporting. Since thebinaries are quite large and many uniquebinaries exist, symbolization for a single dayof profiles would take weeks if run sequen-tially. To reduce the result latency, we dis-tribute symbolization across a few hundredmachines using MapReduce.4

Profile storageOver the past years, GWP has amassed

several terabytes of historical performancedata. GFS archives the entire performancelogs and corresponding binaries. To makethe data useful and accessible, we load thesamples into a read-only dimensional data-base that is distributed across hundreds ofmachines. That service is accessible to allusers for ad hoc queries and to systems forautomated analyses.

[3B2-14] mmi2010040065.3d 30/7/010 19:43 Page 68

....................................................................

68 IEEE MICRO

...............................................................................................................................................................................................


The database supports a subset of SQL-like semantics. Although the dimensionaldatabase is well suited to perform queriesthat aggregate over the large data set, someindividual queries can take tens of secondsto complete. Fortunately, most queries areseen frequently, so the profile server uses ag-gressive caching to hide the database latency.

User interfacesFor most users, GWP deploys a webserver

to provide a user interface on top of the pro-file database. This makes it easy to accessprofile data and construct ad hoc queriesfor the traditional use of application profiles(with additional freedom to filter, group, andaggregate profiles differently).

Query view. Several visual interfaces retrieveinformation from the profile database, andall are navigable from a web browser. Theprimary interface (see Figure 2) displaysthe result entries, such as functions or exe-cutables, that match the desired queryparameters. This page supplies links thatlet users refine the query to more specificdata. For example, the user can restrict thequery to only report samples for a specificexecutable collected within a desired timeperiod. Additionally, the user can modifyor refine any of the parameters to the currentquery to create a custom profile view. TheGWP homepage has links to display thetop results, Google-wide, for each perfor-mance metric.

Call graph view. For most server profilesamples, the profilers collect full call stackswith each sample. Call stacks are aggregatedto produce complete dynamic call graphsfor a given profile. Figure 3 shows an exam-ple call graph. Each node displays the func-tion name and its percentage of samples,and the nodes are shaded based on this per-centage. The call graph is also displayedthrough the web browser, via a Graphvizplug-in.

Source annotation. The query and call graphviews are useful in directing users to specificfunctions of interest. From there, GWP pro-vides a source annotation view that presentsthe original source file with a header

describing overall profile information aboutthe file and a histogram bar showing the rel-ative hotness of each source file line. Becausedifferent versions of each file can exist insource repositories and branches, we retrievea hash signature from the repository for eachsource file and aggregate samples on fileswith identical signatures.

Profile data API. In addition to the web-server, we offer a data-access API to readprofiles directly from the database. It’smore suitable for automated analyses thatmust process a large amount of profile

[3B2-14] mmi2010040065.3d 30/7/010 19:43 Page 69

(a)

(b)

Figure 2. An example query view: an application-level profile (a) and a

function-level profile (b).

....................................................................

JULY/AUGUST 2010 69

data (such as reliability studies) offline. Westore both raw profiles and symbolized pro-files in ProtocolBuffer formats (http://code.google.com/apis/protocolbuffers).Advanced users can access and reprocessthem using their preferred programminglanguage.

Application-specific profilingAlthough the default sampling rate is high

enough to derive top-level profiles with highconfidence, GWP might not collect enoughsamples for applications that consume rela-tively few cycles Google-wide. Increasingthe overall sampling rate to cover those pro-files is too expensive because they’re usuallysparse.

Therefore, we provide an extension toGWP for application-specific profiling onthe cloud. The machine pool for applica-tion-specific profiling is usually muchsmaller than GWP, so we can achieve ahigh sampling rate on those machines forthe specific application. Several applicationteams at Google use application-specificprofiling to continuously monitor theirapplications running on the fleet.

Application-specific profiling is genericand can target any specific set of machines.For example, we can use it to profile a setof machines deployed with the newest kernelversion. We can also limit the profiling dura-tion to a small time period, such as the appli-cation’s running time. It’s useful for batchjobs running on data centers, such as MapRe-duce, because it facilitates collecting, aggre-gating, and exploring profiles collected fromhundreds or thousands of their workers.

Reliability analysisTo conduct continuous profiling on data-

center machines serving real traffic, extremelylow overhead is paramount, so we sample inboth time and machine dimensions. Sam-pling introduces variation, so we must mea-sure and understand how sampling affectsthe profiles’ quality. But the nature of data-center workloads makes this difficult; theirbehavior is continually changing. There’sno direct way to measure the datacenterapplications’ profiles’ representativeness. In-stead, we use two indirect methods to evalu-ate their soundness.

First, we study the stability of aggregatedprofiles themselves using several differentmetrics. Second, we correlate profiles withthe performance data from other sources tocross-validate both.

Stability of profilesWe use a single metric, entropy, to mea-

sure a given profile’s variation. In short, en-tropy is a measure of the uncertaintyassociated with a random variable, which inthis case is profile samples. The entropy Hof a profile is defined as

H Wð Þ ¼ �Xn

i¼1

p xið Þ log p xið Þð Þ

where n is the total number of entries in theprofile and p(x) is the fraction of profilesamples on the entry x.5

In general, a high entropy implies a flatprofile with many samples. A low entropyusually results from a small number of sam-ples or most samples being concentrated tofew entries. We’re not concerned with en-tropy itself. Because entropy is like the signa-ture in a profile, measuring the inherent

[3B2-14] mmi2010040065.3d 30/7/010 19:48 Page 70

Figure 3. An example dynamic call graph.

Function names are intentionally blurred.

....................................................................

70 IEEE MICRO

...............................................................................................................................................................................................


variation, it should be stable between repre-sentative profiles.

Entropy doesn’t account for differencesbetween entry names. For example, a func-tion profile with x percent on foo and y per-cent on bar has the same entropy as a profilewith y percent on foo and x percent on bar.So, when we need to identify the changeson the same entries between profiles, we cal-culate the Manhattan distance of two profilesby adding the absolute percentage differencesbetween the top k entries, defined as

M X ;Yð Þ ¼Xk

i¼1

px xið Þ � py xið Þ��

where X and Y are two profiles, k is the num-ber of top entries to count, and py(xi) is 0when xi is not in Y. Essentially, the Manhat-tan distance is a simplified version of relativeentropy between two profiles.

Profiles’ entropy. First, we compare theentropies of application-level profileswhere samples are broken down on individ-ual applications. Figure 2a shows an exam-ple of such an application-level profile.

Figure 4 shows daily application-levelprofiles’ entropies for a series of dates, to-gether with the total number of profile

samples collected for each date. Unless speci-fied, we used CPU cycles in the study, andour conclusions also apply to the othertypes. As the graph shows, the entropy ofdaily application-level profiles is stable be-tween dates, and it usually falls into a smallinterval. The correlation between the num-ber of samples and the profile’s entropy isloose. Once the number of samples reachessome threshold, it doesn’t necessarily leadto a lower entropy, partly because GWPsometimes samples more machines than nec-essary for daily application-level profiles.This is because users frequently must drilldown to specific profiles with additional fil-ters on certain tags, such as applicationnames, which are a small fraction of all pro-files collected.

We can conduct similar analysis on anapplication’s function-level profile (for ex-ample, the application in Figure 2b). The re-sult, shown in Figure 5a, is from anapplication whose workload is fairly stablewhen aggregating from many clients. Itsentropy is actually more stable. It’s interest-ing to analyze how entropy changes amongmachines for an application’s function-level profiles. Unlike the aggregated profilesacross machines, an application’s per-machine profiles can vary greatly in terms

[3B2-14] mmi2010040065.3d 30/7/010 19:43 1

0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

5.0

Ent

rop

y

0

2

4

6

8

10

12

14

16

18

20

Num

ber

of s

amp

les

(mill

ions

)

Random dates

Figure 4. The number of samples and the entropy of daily application-level profiles.

The primary y-axis (bars) is the total number of profile samples. The secondary

y-axis (line) is the entropy of the daily application-level profile.

....................................................................

JULY/AUGUST 2010 71

[3B2-14] mmi2010040065.3d 30/7/010 19:43 Page 72

(a)

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

5.0

5.5

6.0

6.5

7.0

Ent

rop

y

0 0.01 0.02 0.03 0.04 0.05 0.06

Number of samples (millions)(b)

0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

5.0

5.5

6.0

Ent

rop

y

00.020.040.060.080.100.120.140.160.180.200.220.240.260.280.300.320.340.36

Num

ber

of s

amp

les

(mill

ions

)

Random dates

Figure 5. Function-level profiles. The number of samples and the entropy for a single

application (a). The correlation between the number of samples and the entropy for all

per-machine profiles (b).

....................................................................

72 IEEE MICRO

...............................................................................................................................................................................................


of number of samples. Figure 5b plots therelationship between the number of sam-ples per machine and the entropy of func-tion-level profiles. As expected, when thetotal number of samples is small, the pro-file’s entropy is also small (limited by themaximum possible uncertainty). But oncethe threshold is reached at the maximumnumber of samples, the entropy becomesstable. We can observe two clusters fromthe graph; some entropies are concentratedbetween 5.5 and 6, and the others fall be-tween 4.5 and 5. The application’s two be-havioral states can explain the two clusters.We’ve seen various clustering patterns ondifferent applications.

The Manhattan distance between profiles. Weuse the Manhattan distance to study thevariation between profiles consideringthe changes on entry name, where smallerdistance implies less variation. Figure 6aillustrates the Manhattan distance betweenthe daily application-level profiles for aseries of dates. The results from the Man-hattan distances for both application-leveland function-level profiles are similar tothe results with entropy.

In Figure 6b, we plot the Manhattan dis-tance for several profile types, leading to twoobservations:

� In general, memory and thread profileshave smaller distances, and their varia-tions appear less correlated with theother profiles.

� Server CPU time profiles correlate withHPM profiles of cycles and instructionsin terms of variations, which couldimply that those variations resulted nat-urally from external reasons, such asworkload changes.

To further understand the correlation be-tween the Manhattan distance and the num-ber of samples, we randomly pick a subsetof machines from a specific machine set andthen compute the Manhattan distance of theselected subset’s profile against the wholeset’s profile. We could use a power function’strend line to capture the change in the Man-hattan distance over the number of samples.The trend line roughly approximates a square

root relationship between the distance and thenumber of samples,

M Xð Þ ¼ C=ffiffiffiffiffiffiffiffiffiffiffiffiN Xð Þ

p

where N(X) is the total number of samplesin a profile and C is a constant that dependson the profile type.

Derived metrics. We can also indirectlyevaluate the profiles’ stability by computingsome derived metrics from multiple pro-files. For example, we can derive CPIfrom HPM profiles containing cycles andretired instructions. Figure 7 shows thatthe derived CPI is stable across dates. Notsurprisingly, the daily aggregated profiles’CPI falls into a small interval between 1.7and 1.8 for those days.

Comparing with other sourcesBeyond measuring profiles’ stability

across dates, we also cross-validate the pro-files with performance and utilization datafrom other Google sources. One example isthe utilization data that the data center’smonitoring system collects. Unlike GWP,the monitoring system collects data from allmachines in the data center but at a coarsergranularity, such as overall CPU utilization.Its CPU utilization data, in terms of core-seconds, matches the measurement fromGWP’s CPU cycles profile with the follow-ing formula:

CoreSeconds ¼ Cycles * SamplingRatemachine *SamplingPeriod/CPUFrequencyaverage

Profile usesIn each profile, GWP records the samples

of interesting events and a vector of associ-ated information. GWP collects roughly adozen events, such as CPU cycles, retiredinstructions, L1 and L2 cache misses, branchmispredictions, heap memory allocations,and lock contention time. The sample defi-nition varies depending on the eventtype—it can be CPU cycles or cache misses,bytes allocated, or the sampled thread’s lock-ing time. Note that the sample must benumeric and capable of aggregation.The associated vector contains informationsuch as application name, function name,

[3B2-14] mmi2010040065.3d 30/7/010 19:43 Page 73

....................................................................

JULY/AUGUST 2010 73

[3B2-14] mmi2010040065.3d 30/7/010 19:43 Page 74

00.010.020.030.040.050.060.070.080.090.10 0.110.120.130.140.150.160.170.180.190.20

Man

hatta

n d

ista

nce

0

2

(a)

(b)

4

6

8

10

12

14

16

18

20

22

Num

ber

of s

amp

les

(mill

ions

)

Random dates

Cycles Instr L1_Miss L2_Miss CPU Heap Threads

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

0.50

0.55

0.60

0.65

0.70

0.75

0.80

0.85

Man

hatta

n d

ista

nce

0 2 4 6 8 10 12 14 16 18Number of samples (millions)

Figure 6. The Manhattan distance between daily application-level profiles for various

profile types (a). The correlation between the number of samples and the Manhattan

distance of profiles (b).

....................................................................

74 IEEE MICRO

...............................................................................................................................................................................................


platform, compiler version, image name,data center, kernel information, build revi-sion, and builder’s name. Assuming thatthe vector contains m elements, we can rep-resent a record GWP collected as a tuple<event, sample counter, m-dimensionvector>.

When aggregating, GWP lets userschoose k keys from the m dimensions andgroups the samples by the keys. Basically, itfilters the samples by imposing one ormore restrictions on the rest of the dimen-sions (m�k) and then projects the samplesinto k key dimensions. GWP finally displaysthe sorted results to users, delivering answersto various performance queries with highconfidence. Although not every querymakes sense in practice, even a small subsetof them are demonstrably informative inidentifying performance issues and providinginsights into computing resources in thecloud.

Cloud applications’ performanceGWP profiles provide performance

insights for cloud applications. Users cansee how cloud applications are actually con-suming machine resources and how the pic-ture evolves over time. For example, Figure 2shows cycles distributed over top executables

and functions, which is useful for many as-pects of designing, building, maintaining,and operating data centers. Infrastructureteams can see the big picture of how theirsoftware stacks are being used, aggregatedacross every application. This helps identifyperformance-critical components, create rep-resentative benchmark suites, and prioritizeperformance efforts.

At the same time, an application team canuse GWP as the first stop for the applica-tion’s profiles. As an always-on profiling ser-vice, GWP collects a representative sample ofthe application’s running instances over time.Application developers often are surprisedby application’s profiles when browsingGWP results. For example, Google’s speech-recognition team quickly optimized a previ-ously unknown hot function that theycouldn’t have easily located without theaggregated GWP results. Application teamsalso use GWP profiles to design, evaluate,and calibrate their load tests.

Finding the hottest shared code. Shared codeis remarkably abundant. Profiling each pro-gram independently might not identify hotshared code if it’s not hot in any single ap-plication, but GWP can identify routinesthat don’t account for a significant portion

[3B2-14] mmi2010040065.3d 30/7/010 19:43 Page 75

0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

2.0

Cyc

les

per

inst

ruct

ion

(CP

I)

0

2

4

6

8

10

12

14

16

18

20

22

24

Num

ber

of s

amp

les

(mill

ions

)

Random dates

Figure 7. The correlation between the number of samples and derived cycles per

instruction (CPI).

....................................................................

JULY/AUGUST 2010 75

of any single application but consume themost cycles overall. For example, theGWP profiles revealed that the zlib library(www.zlib.net) accounted for nearly 5 per-cent of all CPU cycles consumed. Thatmotivated an effort to optimize zlib routinesand evaluate compression alternatives. Infact, some users have used GWP numbersto calculate the estimated savings for perfor-mance tuning efforts on shared functions.Not surprisingly, given the Google fleet’sscale, a single-percent improvement on acore routine could potentially save signifi-cant money per year. Unsurprisingly, thenew informal metric, ‘‘dollar amount perperformance change,’’ has become popularamong Google engineers. We’re consideringproviding a new metric, ‘‘dollar per sourceline,’’ in annotated source views.

At the same time, some users have usedGWP profiles as a source of coverage infor-mation to assess the feasibility of deprecationand to uncover users of library functions thatare to be deprecated. Due to the profiles’dynamic nature, the users might miss lesscommon clients, but the biggest, mostimportant callers are easy to find.

Evaluating hardware features. The low-levelinformation GWP provides about howCPU cycles (and other machine resources)are spent is also used for early evaluationof new hardware features that datacenteroperators might want to introduce. One in-teresting example has been to evaluatewhether it would be beneficial to use a spe-cial coprocessor to accelerate floating-point

computation by looking at its percentageof all Google’s computations. As anotherexample, GWP profiles can identify theapplications running on old hardware con-figurations and evaluate whether theyshould be retired for efficiency.

Optimizing for application affinitiesSome applications run better on a partic-

ular hardware platform due to sensitivity toarchitectural details, such as processor micro-architecture or cache size. It’s generally veryhard or impossible to predict which applica-tion will fare best on which platform. In-stead, we measure an efficiency metric,CPI, for each application and platform com-bination. We can then improve job schedul-ing so that applications are scheduled onplatforms where they do best, subject toavailability. The example in Table 1 showshow the total number of cycles needed torun a fixed number of instructions on afixed machine capacity drops from 500 to400 using preferential scheduling. Specifi-cally, although the application NumCrunchruns just as well on Platform1 as on Plat-form2, application MemBench does poorlyon Platform2 because of the smaller L2cache. Thus, the scheduler should give Mem-Bench preference to Platform1.

The overall optimization process has sev-eral steps. First, we derive cycle and instruc-tion samples from GWP profiles. Then, wecompute an improved assignment table bymoving instructions away from application-platform combinations with the worst relativeefficiency. We use cycle and instruction sam-ples over a fixed period of time, aggregatedper job and platform. We then computeand normalize CPI by clock rate. We can for-mulate finding the optimal assignment as alinear programming problem. The one un-known is Loadij —the number of instructionsamples of application j on platform i.

The constants are:

� CPIij, the measured CPI of applicationj on platform i.

� TotalLoadj, the total measured numberof instruction samples of application j.

� Capacityi , the total capacity for plat-form i, measured as total number ofcycle samples for platform i.

[3B2-14] mmi2010040065.3d 30/7/010 19:43 Page 76

Table 1. Platform affinity example.

Random assignment of instructions (CPI in brackets)

Platform 1 Platform 2

NumCrunch 100 (1) 100 (1)

MemBench 100 (1) 100 (2)

Total cycles 200 300

Optimal assignment of instructions

Platform 1 Platform 2

NumCrunch 0 (1) 200 (1)

MemBench 200 (1) 0 (2)

Total cycles 200 200

....................................................................

76 IEEE MICRO

...............................................................................................................................................................................................


The equation is:

MinimizeX

i;j

CPIij � Loadij

whereX

j

Loadij ¼ TotalLoadj

andX

j

CPIij � Loadij � Capacityi

We use a simulated annealing solver thatapproximates the optimal solution in secondsfor workloads of around 100 jobs runningon thousands of machines of four differentplatforms over one month. Although appli-cation developers already mapped majorapplications to their best platform throughmanual assignment, we’ve measured 10 to15 percent potential improvement in mostcases where many jobs run on multiple plat-forms. Similarly, users can use GWP data toidentify how to colocate multiple applica-tions on a single machine to achieve thebest throughput.6

Datacenter performance monitoringGWP users can also use GWP queries

with computing-resource�related keys, suchas data center, platform, compiler, orbuilder, for auditing purposes. For example,when rolling out a new compiler, users in-spect the versions that running applicationsare actually compiled with. Users can easilymeasure such transitions from time-basedprofiles. Similarly, a user can measure howsoon a new hardware platform becomesactive or how quickly old ones are retired.This also applies to new versions of applica-tions being rolled out. Grouping by datacenter, GWP displays how the applications,CPU cycles, and machine types are distrib-uted in different locations.

Beyond providing profile snapshots,GWP can monitor the changes between cho-sen profiles from two queries. The two pro-files should be similar in that they musthave the identical keys and events, but differ-ent in one or more other dimensions. Forapplications, GWP usually focuses on moni-toring function profiles. This is useful fortwo reasons:

� performance optimization normallystarts at the function level, and

� function samples can reconstruct theentire call graph, showing users howthe CPU cycles (and other events) aredistributed in the program.

A dramatic change to hot functions in thedaily profiles could trigger a finer-grainedcomparison, which eventually points blameto the source code revision number, com-piler, or data center.

Feedback-directed optimizationSampled profiles can also be used for

feedback-directed compiler optimization(FDO), outlined in work by Chen et al.7

GWP collects such sampled profiles andoffers a mechanism to extract profiles for aparticular binary in a format that the com-piler understands. This profile will be higherquality than any profile derived from testinputs because it was derived from runningon live data. Furthermore, as long as devel-opers make no changes in critical programsections, we can use an aged profile to im-prove a freshly released binary’s performance.Many Web companies have release cycles oftwo weeks or less, so this approach workswell in practice.

Similar to load test calibration, users alsouse GWP for quality assurance for profilesgenerated from static benchmarks. Bench-marks can represent some codes well, but notothers, and users use GWP to identify whichcodes are ill-suited for FDO compilation.

Finally, users use GWP to estimate thequality of HPM events and microarchitec-tural features, such as cache latencies of vari-ous incarnations of processors with the sameinstruction set architecture (ISA). For exam-ple, if the same application runs on two plat-forms, we can compare the HPM countersand identify the relevant differences.

B esides adding more types of perfor-mance events to collect, we’re now

exploring more directions for using GWPprofiles. These include not only user-interfaceenhancement but also advanced data-miningtechniques to detect interesting patterns inprofiles. It’s also interesting to mesh-up GWPprofiles with performance data from othersources to address complex performanceproblems in datacenter applications. MICRO

[3B2-14] mmi2010040065.3d 30/7/010 19:43 Page 77

....................................................................

JULY/AUGUST 2010 77

....................................................................References

1. J.M. Anderson et al., ‘‘Continuous Profiling:

Where Have All the Cycles Gone?’’ Proc. 16th

ACM Symp. Operating Systems Principles

(SOSP 97), ACM Press, 1997, pp. 357-390.

2. J. Dongarra et al., ‘‘Using PAPI for Hard-

ware Performance Monitoring on Linux

Systems,’’ Conf. Linux Clusters: The HPC

Revolution, 2001.

3. S. Ghemawat, H. Gobioff, and S.-T. Leung,

‘‘The Google File System,’’ Proc. 19th

ACM Symp. Operating Systems Principles

(SOSP 03), ACM Press, 2003, pp. 29-43.

4. J. Dean and S. Ghemawat, ‘‘MapReduce:

Simplified Data Processing on Large Clus-

ters,’’ Proc. 6th Conf. Symp. Operating Sys-

tem Design and Implementation (OSDI 04),

Usenix Assoc., 2004, pp. 137-150.

5. S. Savari and C. Young, ‘‘Comparing and

Combining Profiles,’’ Proc. Workshop on

Feedback-Directed Optimization (FDO 02),

ACM Press, 2000, pp. 50-62.

6. J. Mars and R. Hundt, ‘‘Scenario Based

Optimization: A Framework for Statically

Enabling Online Optimizations,’’ Proc.

2009 Int’l Symp. Code Generation and Opti-

mization (CGO 09), IEEE CS Press, 2009,

pp. 169-179.

7. D. Chen, N. Vachharajani, and R. Hundt,

‘‘Taming Hardware Event Samples for

FDO Compilation, Proc. 8th Ann. IEEE/

ACM Int’l Symp. Code Generation and Opti-

mization (CGO 10), ACM Press, 2010,

pp. 42-52.

Gang Ren is a senior software engineer atGoogle, where he’s working on datacenterapplication performance analysis and

optimization. His research interests includeapplication performance tuning in datacenters and building tools for datacenter-scaleperformance analysis and optimization. Renhas a PhD in computer science from theUniversity of Illinois at Urbana-Champaign.

Eric Tune is a senior software engineer atGoogle, where he’s working on a system thatallocates computational resources and pro-vides isolation between jobs that sharemachines. Tune has a PhD in computerengineering from the University of Califor-nia, San Diego.

Tipp Moseley is a software engineer atGoogle, where he’s working on datacenter-scale performance analysis. His researchinterests include program analysis, profiling,and optimization. Moseley has a PhD incomputer science from the University ofColorado, Boulder. He is a member of ACM.

Yixin Shi is a software engineer at Google,where he’s working on performance analysistools and large-volume imagery data proces-sing. His research interests include architec-tural support for securing program execution,cache design for wide-issue processors, com-puter architecture simulators, and large-scaledata processing. Shi has a PhD in computerengineering from the University of Illinoisat Chicago. He is a member of ACM.

Silvius Rus is a senior software engineer atGoogle, where he’s working on datacenterapplication performance optimizationthrough compiler and library transformationsbased on memory allocation and referenceanalysis. Rus has a PhD in computer sciencefrom Texas A&M University. He is amember of ACM.

Robert Hundt is a tech lead at Google,where he’s working on compiler optimiza-tion and datacenter performance. Hundt hasa Diplom Univ. in computer science fromthe Technical University in Munich. He is amember of IEEE and SIGPLAN.

Direct question and comments aboutthis article to Gang Ren, Google, 1600Amphitheatre Pkwy. Mountain View, CA94043; [email protected].

[3B2-14] mmi2010040065.3d 30/7/010 19:58 Page 78

....................................................................

78 IEEE MICRO

...............................................................................................................................................................................................


[3B2-14] mmi2010040065.3d 30/7/010 19:43 Page 79

....................................................................

JULY/AUGUST 2010 79

g -wide profiling acontinuous profiling i data...

Documents