expose web performance problems with the rrdtool · 2010-01-15 · expose web performance problems...

Expose Web performance problems with theRRDtoolMeasure and display data easily to pinpoint the source ofproblems

Skill Level: Intermediate

Sean Walberg ([email protected])Senior Network EngineerP.Eng

21 Mar 2006

Examine how to determine the root cause of Web performance problems. Withoutproper measurement, how do you know whether your Web application is performingwell? By using open source tools such as the RRDtool, you can graph the keyperformance measurements of any Web application, use these graphs to determinethe impact of changes in the environment, or point to changes that need addressing.

Section 1. Before you start

Learn what to expect from this tutorial and how to get the most out of it.

About this tutorial

The purpose of this tutorial is twofold. First, it demonstrates how to use the RRDtoolto collect and display data. Second, it shows how to measure the performance of aWeb-based application. These two concepts are separate, but learning themtogether helps you understand them individually.

Objectives

In this tutorial, you learn how to store data in Round Robin Databases (RRDs) andhow to display the data in the form of graphs. Additionally, you learn how to measure

Expose Web performance problems with the RRDtool© Copyright IBM Corporation 1994, 2007. All rights reserved. of 23

mailto:[email protected]

http://www.ibm.com/legal/copytrade.shtml

the performance of Web-based applications and how to spot the source ofperformance problems.

Prerequisites

This tutorial is written for users who have a basic understanding of the UNIX®command line, some basic shell scripting knowledge, and a basic understanding ofstatistics (averages, minimum, maximum).

System requirements

To follow this tutorial, you need a computer running UNIX and a Web server. (Thetwo can reside on the same computer.) In addition, you must have the following toolsinstalled:

• RRDtool 1.2.12

• cURL 7.15.1

Both of these programs follow the standard compilation and installation procedure:

1. From the UNIX command line, run tar -xzf filename.tar.gz.

2. The last command creates a directory as part of the extraction process.Switch to this directory with cd dirname .

3. Run ./configure to create the build instructions.

4. Run the make command to compile the source.

5. Run the make install command to install the software.

You might also find binaries for these programs provided by your vendor or thirdparties. You don't need to have the absolute latest versions of the software for thistutorial.

Section 2. Explaining Web performance

Measurement -- and the appropriate display of those measurements -- is crucial tothe operation of any system. Only with objective measurements can you quantify theimpact of an improvement,or discover the nature of a reported performance problem.In the case of Web applications, several factors dictate how quickly a user's requestis fulfilled, and you must measure them carefully to avoid tainting the results.

developerWorks® ibm.com/developerWorks

Expose Web performance problems with the RRDtool of 23 © Copyright IBM Corporation 1994, 2007. All rights reserved.

http://rrdtool.org

http://curl.haxx.se/download.html


The anatomy of a request

When measuring anything, it is helpful to understand what is really happening, whichleads to determining what must be measured. In the case of a Web request, thefollowing process occurs:

1. The user types a Uniform Resource Locator (URL) into a browser (client),clicks a link, or submits a form.

2. The client queries the global Domain Name System (DNS) to map aserver name to an Internet Protocol (IP) address. The mapping is cachedfor later requests, but it might expire at some time during the user'ssession and have to be obtained again.

3. The client makes a Transport Control Protocol (TCP) connection to theremote server and completes the three-way handshake.

4. The client sends commands to the remote server to request a particularWeb page, including the page requested and any form data.

5. The server processes this information and prepares a response. Thisresponse could return the contents of a file, or run a program that makesdatabase calls or connections to other systems.

6. The server sends the data to the client over the Internet.

Determine what to measure

Because the purpose of measuring Web performance is to determine a client'sresponsiveness to requests, it stands to reason that the measurements should betaken from the perspective of a client. The easiest way to do this is to simulate aclient's requests and measure the results. The alternative is to passively watch clientrequests and take measurements that way. While this latter method does show thereal client experience, differences in the clients or their network connections wouldmake the measurements less objective.

From the perspective of the clients, the following measurements are useful:

1. The time it takes to resolve the DNS name of the server.

2. The time it takes to connect to the server.

3. The time it takes to send the request.

4. The time it takes to receive the first byte from the server.

5. The time it takes to receive the last byte from the server.

ibm.com/developerWorks developerWorks®



As you'll see later, each one of these metrics is influenced by different things. Forinstance, the time difference between items four and five is more a function ofnetwork speed, whereas the difference between items three and four is often theWeb application or back-end database.

Take measurements

You can use the cURL tool for UNIX to retrieve Web pages from scripts or thecommand line. cURL has a feature that can report on the metrics discussed above ina format that a script can easily read. The cURL tool can also take form data on thecommand line, which makes testing dynamic sites easier to manage.

Using cURL to produce the needed statistics is simple, as shown in Listing 1.

Listing 1. Use cURL to show the responsiveness of the developerWorks site

$ curl -m 60 -w %{time_total}:%{time_namelookup}:\%{time_connect}:%{time_pretransfer}:%{time_starttransfer}\-o /dev/null -s http://www-128.ibm.com/developerworks1.067:0.388:0.447:0.447:0.691$

In this example, a long command was broken into shorter lines by ending each linewith a backslash. The various numbers are separated by colons and do notterminate with a new line -- hence, the final dollar sign ($), which is the shell prompt.

The command-line options to cURL change the behavior to suit the needs of thisapplication. The -m 60 option limits the execution of the command to 60 seconds,which prevents the script from hanging. The -w %{time_total}:... optionspecifies the message printed after the page is downloaded. Items of the form%{keyword} are expanded to show the metrics above. Together, -o /dev/null-s ensures that no output is sent to the terminal, leaving only the data specified bythe -w option. Finally, a list of URLs appears at the end of the command.

The output of Listing 1 shows that the total request took 1.067 seconds. All the othernumbers are relative to the start of the request. That is, the DNS name resolutiontook 0.388 seconds, the connection took 0.447-0.388 = 0.059 seconds, and the firstbyte came in at 0.691-0.447 = 0.244 seconds. The connect and pretransfervalues are the same, because the time it took to send the headers wasn'tmeasurable. This behavior is common for GET requests, because the headers are sosmall.

Section 3. RRDtool

To keep track of the data and to produce graphs, use a package called the RRDtool.RRDtool is part of the Multi Router Traffic Grapher (MRTG) package, which is a




popular program for graphing the traffic patterns of a network.

How data is stored

RRDtool uses RRDs that store time-series data in a so-called round-robin format.That is, older data is averaged and finally moved out as newer data is inserted. Thetheory behind this behavior is that granular data (such as data taken at five-minuteintervals) is helpful when looking at recent events, but when looking at longer history(such as data taken over six months), taking two-hour averages is more thanadequate. Another advantage of RRDs is that old data is continually being removed.The size of the RRD doesn't change, which also means that each measurement inthe RRD has a timestamp associated with it. Old items can't be removed, and youcan't enter items out of sequence.

You need to use the rrdtool create filename -s seconds command tocreate these RRDs. When creating the RRD, you must describe the data to store inthe database, which involves data sources and archives. You must also provide thename of the output file and the number of seconds between successivemeasurements. This information is used later to calculate the consolidation of olddata as well as to know when data is missing.

Data sources

Each RRD consists of one or more data sources that describe a particular type ofdata to be stored. For example, DNS resolution time is a good data source. The datasource should not be specific to the particular instance, which means that DNSresolution for IBM.com is a poor choice for a data source. Each RRD file describesthe instance, so each site being tested has its own RRD. It is difficult to add newdata sources after the fact, so ask yourself, "If I add or remove data sources from mylist of devices or sites being measured, would I have to adjust my RRD?" If theanswer to that is "Yes," chances are your data source is instance-specific andshould be abstracted. New items being measured should be in different RRDs -- notnew data sources in the same RRD.

The first decision you make when defining the data source is the name, which canbe 1 to 19 characters long, alphanumeric, and underscores only. Again, the nameshould describe the measurement itself, not the instance.

Perhaps the most important decision is the type of data source. Some preprocessingmight be done on the values each time you add a measurement to the RRD. Take,for instance, a counter on a router that measures the number of bytes transferred. Avalue of 600 has little meaning by itself without knowing that five minutes ago thevalue might have been 300. With that in mind, a more useful piece of data to store isthe rate of change, which is (600-300)/(5*60) = 1/sec. The COUNTER type does justthis and is used for data sources that continually count upward.

Another data source type is GAUGE, which you use when you just want to store thevalue and not the rate. Examples of this type include the current temperature or, inthe case of this tutorial, the amount of time a particular action took. Several other




types of data sources are available, but GAUGEs and COUNTERs handle most of thecases.

The syntax for defining the data source is DS:name:type:heartbeat:min:max .The heartbeat defines how many seconds can separate two measurements beforethe intermediate data is filled with a special value of UNKNOWN. For most uses, twotimes the measurement interval works well. You use min and max for boundschecking of input data. If data is outside this interval, it is stored as UNKNOWN. Youcan specify U for either of these two, if you don't need a limit.

Round Robin Archives

Earlier, I mentioned that RRDs average time-series data. For example, viewing amonth's worth of data shows two-hour averages. The Round Robin Archives (RRAs)define how the various data sources are averaged and stored. Multiple RRAs can bedefined per RRD, and they are not data source specific. That is, if you define anRRA to store a month of two-hour averages, one RRA is created for every datasource. When displaying graphs, RRDtool figures out which RRAs must beconsulted based on the time frame provided. Again, you must make some decisionswhen building the RRAs:

• When aggregating data, do you want to store the average, minimum,maximum, or last value? Using separate RRAs, you can do more thanone of these consolidation functions (CFs).

• How many UNKNOWN values are you willing to tolerate within the seriesbefore you declare the aggregate to be UNKNOWN? This value isexpressed as a ratio of UNKNOWN values to total values. In rough terms,.1 means that one measurement per hour can be missing. This value iscalled the XFiles Factor, or xff for short.

• How much do you want to consolidate your data? This consolidation isexpressed as a multiple of the measurement interval. With a five-minuteinterval, 1 stores the raw data, 12 stores hourly data, and so forth. Thisbehavior is called the step size.

• How many steps do you want to keep (rows)? This value is expressed asthe number of steps defined in the previous step. That is, if the step sizeis 12 (hourly data), 24 rows means one day of history.

While it's possible -- and certainly tempting -- to store a year's worth of data infive-minute intervals, it defeats the purpose of RRDs and provides far moreinformation than most people need. In the example of Web performance, one wouldlikely be interested in keeping granular data for a day, and then averaging it over anhour for the course of a week, progressively averaging the data as time goes on. Inaddition to averages, the best and worst measurement can be kept to give moremeaning to the averages.

The syntax for defining an RRA is RRA:cf:xff:steps:rows , where cf is theconsolidation function (AVERAGE, MIN, MAX, or LAST), and xff, steps, and rows areas described above.




All this might seem confusing, so Figure 1 illustrates the flow of data through theRRD.

Figure 1. Data flow through an RRD

The top row of boxes represents the data within the data source. Data points areshifted to the right to make room for the incoming data point. Enough data points arekept to satisfy the needs of all the archives in the RRD. The RRA data points followa similar scheme, except that the leftmost entry is calculated by applying the chosenCF to the data points within the data source. The number of data points -- and thefrequency with which data is consolidated from the primary data points to the archive-- is defined by the step size. The number of boxes within the RRA corresponds tothe rows parameter. As items in either the RRA or the data source are shifted offthe right side, they are removed forever.

Populate the RRD

After creating the RRD, populating the data is quite simple by comparison. Thecommand rrdtool update filename timestamp:datapoints adds datapoints to the RRD. The timestamp parameter can be a standard UNIX timespecified in seconds, since epoch or the value N means the current time. The datapoints are a colon-separated list of data, one point for each defined data source inthe order you defined them. Optionally, use the -t option to specify acolon-separated list of data source names that map to the data points you're adding.

A complete measurement script

You'll create a script to automate the measurements. Before the measurements canbe made, however, you must determine the structure of the RRDs. Table 1 showsthe data sources to be collected.

Table 1. Data source definitions for Web performance collectionName Type Heartbeat Min/Max

Total GAUGE 600 0/U

DNS GAUGE 600 0/U

Connect GAUGE 600 0/U

Pretransfer GAUGE 600 0/U




Start transfer GAUGE 600 0/U

Table 2 shows the RRAs.

Table 2. RRA definitions for Web performance collectionType XFiles Factor Step size (time) Rows (time)

AVERAGE 0.5 1 (5 minutes) 288 (1 day)

LAST 0.5 1 (5 minutes) 288 (1 day)

MIN 0.5 1 (5 min) 288 (1 day)

MAX 0.5 1 (5 minutes) 288 (1 day)

AVERAGE 0.5 6 (30 minutes) 336 (1 week)

MIN 0.5 6 (30 minutes) 336 (1 week)

MAX 0.5 6 (30 minutes) 336 (1 week)

AVERAGE 0.5 24 (2 hours) 372 (31 days)

MIN 0.5 24 (2 hours) 372 (31 days)

MAX 0.5 24 (2 hours) 372 (31 days)

AVERAGE 0.5 144 (12 hours) 730 (365 days)

MIN 0.5 144 (12 hours) 730 (365 days)

MAX 0.5 144 (12 hours) 730 (365 days)

Listing 2 shows shell code for taking the measurements.

Listing 2. Measurement code

#!/bin/bash# Paths... adjust as necessaryBASE="/home/sean"CONFIG="config"CURL="/usr/bin/curl"RRDTOOL="/usr/bin/rrdtool"

# Given a URL, return a file namemakefile () {

# ensure we have the needed parameters[ -z "$1" ] && exit 2# Strip out any non alphas, replace with underscores,# and add a host nameFILE=`echo $1 | sed 's/[^a-zA-Z0-9_]/_/g'`FILE="`/bin/hostname`-${FILE}.rrd"# Can only return numeric results, so global $FILE# has the result

}

# Called to create the RRDcreate_rrd () {

# ensure we have the needed parameters[ -z "$1" ] && exit 2

# Convert the URL to a name and build the RRDmakefile $1$RRDTOOL create "$BASE/$FILE" -s 300 \

DS:total:GAUGE:600:0:U DS:dns:GAUGE:600:0:U DS:connect:GAUGE:600:0:U \DS:pretransfer:GAUGE:600:0:U DS:starttransfer:GAUGE:600:0:U \RRA:AVERAGE:0.5:1:288 RRA:AVERAGE:0.5:6:336 RRA:AVERAGE:0.5:24:372 \




RRA:AVERAGE:0.5:144:730 RRA:MIN:0.5:1:288 RRA:MIN:0.5:6:336: \RRA:MIN:0.5:24:372 RRA:MIN:0.5:144:730 RRA:MAX:0.5:1:288 \RRA:MAX:0.5:6:336 RRA:MAX:0.5:24:372 RRA:MAX:0.5:144:730 \

I RRA:LAST:0.5:1:288

}

# Loop through config filecat "$BASE/$CONFIG" | while read LINE; do

# For later expansion, the URL is the last fieldURL=`echo $LINE | awk '{print $NF}'`

# If the RRD doesn't exist, create itmakefile $URLif [ ! -f "$BASE/$FILE" ]; then

create_rrd $URLfi# take the measurement and dump to RRD with current timestampOUT=`$CURL -m 60 -w %{time_total}:%{time_namelookup}:%{time_connect}:\

%{time_pretransfer}:%{time_starttransfer} -o /dev/null -s $LINE`$RRDTOOL update \

"$BASE/$FILE" -t "total:dns:connect:pretransfer:starttransfer" N:$OUTdone

When the config file is populated with the URLs to be monitored, you can run thisscript from cron every five minutes to start capturing data.

Section 4. RRDtool graph

Now that the RRDs are filling with data, it's time to look at producing some graphs.The rrdtool graph outputfile command generates graphs based on inputRRDs and the arguments you pass on the command line.

Specify data sources

First, reference the RRD files to be graphed and which data sources are used. Youdo this with the DEF keyword. When defining the data to be used, the name of theRRD, the data source within that RRD, and CF (AVERAGE, MIN, MAX, or LAT) arerequired. The format of the definition isDEF:defname=rrdfile:datasource:ConsolidationFunction . Thiscommand pulls out a particular set of data from the RRD and assigns it to the nameyou specify with the defname parameter. This name can be the same as the nameof the data source, but as you'll see later, you can use multiple files, which couldlead to a naming collision. For a single Web site, Listing 3 shows a simple commandthat imports the average total time for a request.

Listing 3. An RRDtool graph command that has one DEF

rrdtool graph sample1.png \DEF:total=bob.ertw.com-http___ertw_com.rrd:total:AVERAGE




Running the command returns 0x0 to the terminal. The output of the graphcommand is the size of the image produced. While the command defined the data, itdidn't say what to do with it. As a result, no graph is made.

Draw lines

You draw lines with the LINE command, which is passed as a line width, datasource, color (optionally) for the line, and a label to be printed at the bottom of thegraph. Listing 4 augments the previous example by graphing the total request time.

Listing 4. A graph with a single data source

rrdtool graph sample2.png \DEF:total=bob.ertw.com-http___ertw_com.rrd:total:AVERAGE \LINE1:total#FF0000:"Total Request time"

Figure 2 shows the graph that this command produced.

Figure 2. A simple line graph

The line width is either 1, 2, or 3. These numbers represent the thickness of the linein pixels. The color is a six-digit hex value consisting of two digits each for the levelsof red, green, and blue. Note the quoting of the label, which prevents everythingafter the first space from being interpreted as a separate command.

Similar to the LINE command is the AREA command. This command creates a solidshape rather than just a line. After the first AREA has been drawn, the STACKcommand draws the next data series on top of the data defined by the AREA. Forexample, if the AREA at a particular point in time had a value of 4, a solid block witha height of 4 along the Y axis would be drawn. Then, if a STACK were drawn afterthat with a value of 2 at the same point in time as before, a solid block from 4 to 6along the Y axis would be drawn.

Graph options and sizes

The previous example doesn't really tell the viewer anything other than the "TotalRequest Time" at a given point was around 2.5. But 2.5 what? Minutes, seconds,




hours? And what's being measured? This is where titles and legends come into play.The rrdtool graph command can accept several command-line options that alterthe output. Some of the more popular options are:

• -t: Gives the graph a title at the top.

• -v: Specifies the name of the Y axis, which is displayed vertically on theleft side of the graph.

• -w and -h: Specify the width and height in pixels of the graph part itself,respectively. (The resulting image will be larger because of printing thelabels.)

• -u, -l, and -r: By default, RRDtool attempts to determine how to scalethe graph. The -u and -l options specify the upper and lower boundaries(in units of whatever you're measuring) to use by default. If the dataexceeds these values, RRDtool ignores your input unless you specify the-r option.

Listing 5 augments Listing 4 by specifying titles and changing the default size.

Listing 5. A simple graph with labels

rrrdtool graph sample3.png \-w 450 -h 120 \-t "Performance of http://ertw.com from inside" -v "Seconds" \DEF:total=bob.ertw.com-http___ertw_com.rrd:total:AVERAGE \LINE1:total#FF0000:"Total Request Time"

Figure 3 shows the graph that this command produced.

Figure 3. A simple line graph with labels

Specify time ranges

By default, graphs show the last 24 hours of data. By using the -s and -eparameters, which specify start and end times, you can show different time periods.The values used for these parameters are either in the traditional UNIX "seconds




since epoch (midnight on January 1, 1970)" or in at-style syntax. The followingexamples illustrate their use:

• -s end-1week: Because the default ending time is the current time, thiscommand sets the start time as one week ago.

• -s end-1week -e "January 31, 2006": This command shows thelast week of January 2006.

• -s "February 7,2006" -e "February 8, 2006 23:59": Thiscommand shows both February 7 and 8. Without a time, midnight isassumed. Had the 23:59 been omitted, only data for February 7 wouldhave been shown, because the assumption would have been frommidnight of the first day to midnight of the next day, which is 24 hours.

In addition to week, you can also specify month, day, hour, minute, or second. Theplural version also works. You can specify arbitrary times in the seconds sinceepoch, such as 1139378400 for midnight of February 8, 2006. You can mix thisformat with the at-style commands, such as -e 1139378400 -s end-2weeksfor the two weeks ending February 8. Time is rarely expressed as an integer, but it issomewhat convenient for scripting purposes.

Put text on the graph

In addition to lines, you can put text on the graph to help the reader. The text caninclude data from the data sources, or it can be a static string. Listing 6 and Figure 4clearly illustrate the use of labels.

Listing 6. Displaying data in the form of labels

rrdtool graph sample5.JPG \-s "February 7,2006" -e "February 8, 2006 23:59" \-t "Performance of http://ertw.com from inside" -v "Seconds" \DEF:total=bob.ertw.com-http___ertw_com.rrd:total:AVERAGE \LINE1:total#FF0000:"Total Request Time" \COMMENT:"\n" \GPRINT:total:AVERAGE:"Average request time %0.2lf %Ss" \COMMENT:"\n" \GPRINT:total:LAST:"Current request time %0.2lf %Ss"

Figure 4. Graph with additional data in the labels




Think of the text area below the graph as being like a typewriter. Where the lastpiece of text left off, the next one starts. In this case, the first piece of text is TotalRequest Time. The COMMENT:text command puts static text at the current spot.Here, \n means start a new line. In contrast, the GPRINT command uses data froma data source previously defined with the DEF command.

The GPRINT command takes a value data source definition, which is a data sourcethat returns a single value instead of a time series. You can define this source withthe VDEF command, but it can also take the form of data source:consolidationfunction. In this case, the first GPRINT command returns the average of the totaldata source, and the second command takes the last element from the series. Likethe data source definitions, you can use MAX and MIN here, too.

The format string, which is the last part of the GPRINT command, uses formatterstrings. Like the C printf() function, these strings that start with the percentcharacter are replaced with the value of a variable as the expression is evaluated.The %0.2lf value displays a long floating-point number (lf) with no specified fieldwidth and two decimal places (0.2). The %S is replaced by a magnitude unit for theprevious value, such as K for Kilo or m for milli, but the capital S means that allmagnitudes are the same (which makes the graph easier to interpret). For printingnumbers, %lf and %S handle most cases. The rrdgraph_rrdgraph man pageoutlines all the options.

CDEFs: Calculated data sources

So far, you've focused on displaying data exactly as it's measured. In reality, somefurther data manipulation is often required. For example, network devices usuallyreturn values in bytes, but for planning purposes, network engineers multiply thatnumber by eight to calculate bits per second. The CDEF command creates a newdata source that is the result of calculations on other data sources.

The CDEF command uses Reverse Polish Notation (RPN) instead of the infixnotation you're used to. RPN uses a stack. Numbers are pushed on and operatorspull the required number of numbers off the stack before pushing on the result. (Formore information, see the sidebar, RPN.)




When capturing the output of curl, all the times are relative to the start. A CDEFcommand could easily calculate the time for one of the components, such asconnection time. Because the TCP connection starts as soon as the DNS requesthas returned, the time for the connection itself is the connection time minus the DNStime (because both are reported relative to the start of the whole process). In RPN,this format is simply connection,dns,-, as shown in Listing 7 and Figure 5.

Listing 7. Basic graph demonstrating a CDEF

rrdtool graph sample6.JPG \DEF:dns=bob.ertw.com-http___canoe_ca.rrd:dns:AVERAGE \DEF:connect=bob.ertw.com-http___canoe_ca.rrd:connect:AVERAGE \CDEF:connect_only=connect,dns,-\LINE1:connect_only#FF0000:"Connection Time"

Figure 5. Graph using a CDEF command

The CDEF command gives the user the power to manipulate the source data,independent of how it was collected.

RRDtool graph summary

This section has introduced many of RRDtool's graphing capabilities. Obviously,many things are possible, which brings up a great piece of advice: Just because youcan, doesn't mean you should. This means that slick features that don't add anythingto the graph should probably be left out (like a background, in most cases).

RPNRPN takes a little while to get used to, but when you understand it,it's incredibly powerful. Not only is it easier to implement on thesoftware side, but pushing and pulling values from a stack lets theoperator break an equation into many smaller equations, reducingthe number of keystrokes. The equation is more efficient, because itdoesn't require parenthesis.

Where the conventional formula to change bytes into bits would bevalue*8, RPN would express this as value,8,*. (RRDtoolseparates the items with a comma. If you were using an RPNcalculator, you would press ENTER, instead.) As each number isentered, it goes on the stack. After value and 8 are pushed to thestack, the * operator retrieves the two most recent values, multipliesthem, and pushes back the result. In this case, after * is applied,




the stack contains only the bits value.

When writing expressions in RRDtool, RPN helps by making youthink about the data first, and then what you want to do with it. Youdon't contend with order of operations: You just push your data tothe stack and add the operations as you need them. Because you'renot limited to traditional arithmetic operators, you have access toother functions, such as comparison operators, that are moredifficult to express in infix notation. The rrdgraph_rpn man pageexplains all the operators, along with a brief explanation of RPN.

Follow these tips to make good graphs:

• Every number should have a unit of measurement, such as seconds,connections per minute, or units. Use magnitudes to keep the legendsalong the axis short, and when displaying numbers with GPRINT, use %Sinstead of %s to tell RRDtool to use the same one in all printed lines.

• When dealing with data sets of different magnitudes, either separate theminto a different graph or create a CDEF command to change themagnitude to something that is comparable to the other set. For example,if you are measuring connections per second using thousands andmemory allocation failures per second that might measure less than one,consider scaling down the first set by a factor of 1000. Otherwise, thedetail in the memory allocation failures will be dwarfed. When scaling,ensure that all labels are updated to indicate the change.

• Graphs are good for showing changes over time, or showing a correlationbetween data sets. As such, the data sources on your graph should berelated somehow to tell one story. Graphing the data rate on your routerin Memphis, Tennessee, and memory usage on a server in London is nothelpful unless you're trying to find or show a relationship between the two.

• Don't put too many lines on one graph. Four or five lines should be themost you need.

• Be careful with color. With lines, the colors you use should be easilyidentifiable on the graph. Six lines, all with different shades of red, is not agood idea. In contrast, when using stacked areas (such as to show asingle action with many components, such as a Web request), the colorsshould be different enough to be discernible, but not cause an eyesore. Inthis situation, find complementary colors and progress from dark at thebottom to light at the top.

• If accuracy is important for a single value, make sure that the number isprinted on the bottom. For example, you can draw a horizontal line torepresent the average of a particular data series. Because the purpose ofthis line is likely to compare data against the average, printing theaverage value at the bottom of the graph makes the comparison moreclear.

• Always use titles and labels for your data series.




Section 5. Putting it all together

Learn how to enable data collection, graph a single Web site, and compareinformation from multiple RRDs.

Enable data collection

Listing 2 showed a complete script to take the performance measurements for anynumber of sites. Run this code from cron using the crontab -e option. Changethe value of CONFIG in the beginning of the script to reflect the directory that theconfiguration file and RRD files go. Listing 8 shows the crontab entry to enablefive-minute data samples, assuming that the script is run from /home/sean.

Listing 8. Crontab entry to enable periodic data collection

0,5,10,15,20,25,30,35,40,45,50,55 * * * * /home/sean/measure.sh

Because the measurement code inserts the name of the collecting machine into thename of the RRD file, multiple machines can measure performance on the samesite. This functionality evens out variations in local bandwidth or DNS resolutionspeed.

Graph a single site

The first and easiest graph to present shows the connection components (DNSrequest, connection, first byte, total) together and displays some basic statistics (seeListing 9). Note that I've left out the pretransfer attribute, because it's onlysignificant when I'm transferring data to the server to prepare it for query. All therequests I'm looking at are simple GET requests, hence the omission.

Listing 9. Area graph showing connection components

#!/bin/sh

MEASUREMENT="bob.ertw.com"SITE="http://canoe.ca"makefile () {

# ensure we have the needed parameters[ -z "$1" ] && exit 2# Strip out any non alphas, replace with underscores,# and add the host nameFILE=`echo $1 | sed 's/[^a-zA-Z0-9_]/_/g'`# Can only return numeric results, so global $FILE# has the result

}

makefile $SITE




DEFS=" DEF:total=$MEASUREMENT-$FILE.rrd:total:AVERAGE \DEF:dns=$MEASUREMENT-$FILE.rrd:dns:AVERAGE \DEF:connect=$MEASUREMENT-$FILE.rrd:connect:AVERAGE \DEF:pretransfer=$MEASUREMENT-$FILE.rrd:pretransfer:AVERAGE \DEF:starttransfer=$MEASUREMENT-$FILE.rrd:starttransfer:AVERAGE"

rrdtool graph $FILE-daily.png \-u 3 -r \-t "Performance of http://canoe.ca from $MEASUREMENT" -v "Seconds" \$DEFS \AREA:total#DADAAA:"Data Transfer" \AREA:starttransfer#CAAA00:"First Byte" \AREA:connect#555555:"Connection" \AREA:dns#222222:"DNS Lookup" \COMMENT:"\n" COMMENT:"\n" \GPRINT:total:AVERAGE:"Average request time %0.2lf %Ss" \GPRINT:total:MIN:"Min/Max %0.2lf %Ss/" \GPRINT:total:MAX:"%0.2lf %Ss" \COMMENT:"\n" \GPRINT:total:LAST:"Current request time %0.2lf %Ss"

rrdtool graph $FILE-monthly.png \-u 3 -r -s end-1month\-t "Performance of http://canoe.ca from $MEASUREMENT" -v "Seconds" \$DEFS \AREA:total#DADAAA:"Data Transfer" \AREA:starttransfer#CAAA00:"First Byte" \AREA:connect#555555:"Connection" \AREA:dns#222222:"DNS Lookup" \COMMENT:"\n" COMMENT:"\n" \GPRINT:total:AVERAGE:"Average request time %0.2lf %Ss" \GPRINT:total:MIN:"Min/Max %0.2lf %Ss/" \GPRINT:total:MAX:"%0.2lf %Ss" \COMMENT:"\n" \GPRINT:total:LAST:"Current request time %0.2lf %Ss"

This code produces two graphs: one showing the past 24 hours and the othershowing the past month. Figure 6 shows the daily graph.

Figure 6. Basic connection graph

When displaying the same data for different time periods, you might notice somedifferences. I've chosen to leave out the Current request time in the monthly graph,because it's averaged from 24 different data points to make a two-hour consolidateddata point. Similarly, the average value reported at the bottom of the graph is likelyto be lower, because outliers are averaged out.




Compare data from multiple RRDs

There might be some cases in which you want to compare the performance ofmultiple sites, or to compare the measurements of the same site from severaldifferent monitoring points. This ability would be helpful when attempting to correlateevents (such as when the hosts run on the same server) to find local performanceproblems (using multiple measurement points), or to perform longer-term trending ofmultiple sites at one time.

In any case, the solution involves multiple DEF commands that refer to different RRDfiles. In Listing 10, three different Web sites are displayed along with an average.

Listing 10. Comparing three Web sites and displaying the average

rrdtool graph multiple-year.png \-u 5 -r -s end-1year \-t "Web performance trending" -v "Seconds" \DEF:total1=bob.ertw.com-http___canoe_ca.rrd:total:AVERAGE \DEF:total2=bob.ertw.com-http___cnn_com.rrd:total:AVERAGE \DEF:total3=bob.ertw.com-http___ertw_com.rrd:total:AVERAGE \CDEF:avg=total1,total2,total3,+,+,3,/ \LINE1:total1#FF0000:"Canoe.ca" \LINE1:total2#00FF00:"CNN.com" \LINE1:total3#0000FF:"ERTW.COM" \LINE1:avg#000000:"Average" \GPRINT:avg:AVERAGE:"Average request time %0.2lf %Ss"

Section 6. Interpreting the results

So far in this tutorial, you've seen tools for measuring and displaying data. Withthese tools, you can draw graphs that point out problem areas. For instance, if aparticular query is to execute in less than three seconds but often takes fiveseconds, where does one start to look? Alternatively, if you receive complaints thatperformance is slow at noon, you must still determine the cause before fixing theproblem. It might also be that there is no problem, and the complaints are subjective.

Going back to the anatomy of a request, several key areas are being measured:

• The time it takes to resolve the DNS name of the server.

• The time it takes to connect to the server.

• The time it takes to send the request.

• The time it takes to receive the first byte from the server.

• The time it takes to receive the last byte from the server.

Each of these events occurs consecutively, and the time they take to execute




depend on different parts of the system.

DNS resolution

DNS resolution is the process of converting a name, such as www.ibm.com, into anIP address, such as 1.1.1.1. In most cases, the client making the request asks itsname server, which in turn walks a chain of name servers before finally returning theresult. Any slowness is the result of the client's name server or the final name serverin the chain, which typically belongs to the company or hosting provider.

The goal of measuring DNS response time is both to spot problems in DNS serversand to normalize the data when comparing multiple measurement sites. Forexample, if it takes 0.5 seconds to resolve a name from ISP A but one second fromISP B, measurements from ISP A look half a second better than those from ISP B.DNS might be operating correctly, or it might just be a problem with ISP B that isoutside the control of the company, but it is important to understand that themeasurements differ even though the Web application and network might beoperating correctly.

Connection

The time it takes to establish a TCP connection relies some part on the speed of theintermediate network but, in a larger part, on the health of the Web server itself.Stepping back, a TCP connection is formed by the TCP three-way handshake:

1. The client sends a packet with the SYN bit set.

2. The server responds with SYN and ACK.

3. The client confirms with ACK.

Implicit in this process is that the server has sufficient resources to accept theconnection. If there is a problem in this regard, the connection time suffers.

Most Web servers operate by having multiple worker threads or processes handlingrequests. The master process accepts connections and dispatches them to aworker. The UNIX kernel queues incoming connection requests up to a certain point,but the master process must properly handle those requests and dispatch them tothe worker to leave the queue. In high-load situations, the server might be slow inspawning new workers or dispatching the work, which leads to high connectiontimes.

Other causes of high connection times include improperly configured load balancersor other network equipment. While it is possible that the network between the clientand server is so slow that the handshake takes longer than normal, this latency alsorepeats in the last byte time.

Figure 7 shows a Web site that on two occasions has displayed connectionslowness at midnight.




Figure 7. Connection slowness at midnight

While this graph is not indicative of a long-term problem, it does point to somethingthat is of interest to the person running the site. Given the time the problem occurs, itmight be coinciding with scheduled backups and other maintenance. If this is thecase, spreading the jobs over a longer time frame often fixes the problem. If thislatency occurred over longer periods of time during the day, one might look towardserver load as a cause.

In most cases, the request itself is small. If a file is being uploaded as part of therequest, the request time might become longer. More often, however, the time ittakes to send the request is inconsequential.

First byte

The first byte is the time it takes to return data after the request has been issued. Inthe case of static sites, the latency is caused by any rules the Web server has toprocess and any disk activity that must occur.

Far more interesting than static sites are dynamic sites. A request might require thata database be queried or other computers be consulted. Consider a search enginethat must parse the query and look in an index to determine the best match, or anonline database that must pull records and perhaps perform a transaction. The timeit takes from the request to the first byte returned speaks heavily to the performanceof the Web server and, more importantly, to the Web application.

The causes of Web application degradation are numerous. Looking at the data overa period of time helps determine the cause. If the problem is persistent, it could bebecause of an inefficiently written program or a badly tuned database. If one broughtin other data, such as concurrent connections, it would be possible to look for arelationship between poor application performance and system load.

Last byte

The difference between first and last byte is the time it takes to send the HTML overthe Internet to the destination. In some cases, the Web application might stream thedata to the Web server, so this metric might also include components of Webapplication performance. For the most part, however, last byte is a function ofnetwork health.




The network health might be a function of raw bandwidth (such as a maxed-outInternet connection somewhere along the path) or latency between the client and theserver. In both cases, measuring from different parts of the Internet helps locate thecause.

Section 7. Summary

Determining the root cause of Web performance problems requires anunderstanding of how the request is made and obtaining relevant data with which tomake a judgment. More importantly, you must know that the problem exists. You candetermine both of these things with proper monitoring and displaying of data.

The graphs and techniques presented in this tutorial form the basis for diagnosingproblems and proving that they've been solved after remediation has beenundertaken. It is likely that you will have to create custom graphs using the toolsshown here to truly understand the problem. Measuring data with RRDtool, includingother components of your Web application, is quite easy. The round-robin feature ofthe tool, along with the incredible ease of graphing, ensures that you spend yourtime effectively interpreting data rather than building one-off tools that are laterthrown away.




Resources

Learn

• The RRDtool documentation is vital to working with the product.

• Although I provided a brief outline of RPN earlier, Steve Rader provides anin-depth look at RPN specifically geared for the RRDtool. His tutorial alsocovers comparison functions and the IF statement.

• When you have multiple Web servers handling a load, how do you distributerequests among them? The easiest way is with Round-Robin DNS. After that,you start to look at hardware load balancers. The hardware devices usecomplex network address translation (NAT) that makes multiple servers appearas one. Tony Bourke describes this process in his book, Server Load Balancing(O'Reilly, August 2001).

• Part of scaling a Web application has to do with the architecture of the Web anddatabase servers. If developers understand this concept, they are more apt towrite the applications with scaling in mind. This article athttp://blog.flickr.com/flickrblog/2005/10/lamp.html discusses the evolution ofdatabase-driven Web applications. Though it's written from the perspective ofLinux®, Apache, MySQL, and PHP (LAMP) installations, it is applicable toalmost all Web applications.

• The most popular UNIX Web server is Apache. The Apache performance tuningguidelines give advice for configuring both Apache and the underlying operatingsystem for maximum performance. This is one component in the first bytereturned metric and indirectly affects the connection delay.

• The O'Reilly WEB DEVCENTER provides 10 tips on Web performance tuning.One thing that is not measured by cURL is the time it takes to render the pagein the user's browser. Some of the 10 tips help in speeding up the renderingprocess.

• Stay current with developerWorks technical events and Webcasts.

Get products and technologies

• Download RRDtool from RRDtool.org. Even if your system provides it, youmight want to ensure that you have the latest version. RRDtool 1.2 has supportfor anti-aliased graphs and introduces more advanced statistical functions.

• Build your next development project with IBM trial software, available fordownload directly from developerWorks.

Discuss

• Participate in developerWorks blogs and get involved in the developerWorkscommunity.

About the author



http://people.ee.ethz.ch/~oetiker/webtools/rrdtool/doc/index.en.html

http://people.ee.ethz.ch/~oetiker/webtools/rrdtool/tut/rpntutorial.en.html

http://www.vergenet.net/linux/mail_farm/html/node10.html

http://www.oreilly.com/catalog/serverload/chapter/ch07.html

http://blog.flickr.com/flickrblog/2005/10/lamp.html

http://httpd.apache.org/docs/2.0/misc/perf-tuning.html

http://httpd.apache.org/docs/2.0/misc/perf-tuning.html

http://www.oreillynet.com/pub/a/javascript/2002/06/27/web_tuning.html

http://www.ibm.com/developerworks/offers/techbriefings/?S_TACT=105AGX12&S_CMP=art

http://rrdtool.org

http://www.ibm.com/developerworks/downloads/?S_TACT=105AGX12&S_CMP=art

http://www.ibm.com/developerworks/blogs/


Sean WalbergSean Walberg has been working with Linux and UNIX systems since 1994 inacademic, corporate, and Internet service provider environments. He has writtenextensively about systems administration over the past several years. You cancontact him at [email protected].



[email protected]


expose web performance problems with the rrdtool · 2010-01-15 · expose web performance problems...

Documents