http compression

83
Dissertation ON BENEFITS & DRAWBACKS Of HTTP DATA COMPRESSION IN PARTIAL FULFILMENT OF MASTER’S DEGREE IN COMPUTER APPLICATIONS (M.C.A.) GUJARAT UNIVERSITY [GUIDE: Mr.Vrutik Shah] SUBMITTED BY: Chintan Parikh Nihar Dave

Upload: asha-koshti

Post on 28-Mar-2015

441 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: http compression

Dissertation

ON

BENEFITS & DRAWBACKS

Of

HTTP DATA COMPRESSION

IN PARTIAL FULFILMENT

OF

MASTER’S DEGREE IN COMPUTER APPLICATIONS (M.C.A.)

GUJARAT UNIVERSITY

[GUIDE: Mr.Vrutik Shah]

INDUS INSTITUTE OF COMPUTER TECHNOLOGY & ENGINNERRING

SUBMITTED BY:

Chintan Parikh

Nihar Dave

Page 2: http compression

Benefits and Drawbacks of HTTP Data Compression

ACKNOWLEDGEMENT

The Dissertation gives us the feeling of fulfillment & deep gratitude towards all our

teachers, friends and colleagues. As the final frontier towards achievement of my master’s degree,

the activity of going through the dissertation has bridged the gap between the academics and the

research work for us. It has prepared us to apply ourselves better to become good computer

Professionals. Since we were beginners it required lot of support from different people. We

acknowledge all the help we have received from so many people in accomplishing this project

and wish to thank them.

We take this opportunity to thank Mr. H.K DESAI, director of I.I.T.E. for taking

personal interest in the project and guidance which often resulted into valuable tips. We also

thank our guide MR.VRUTIK SHAH for his regular guidance and for his encouragement to us.

Our sincere thanks to our batch mates, who have provided us with innumerable

discussions on many technicalities and friendly tips without their cordial support his activity

would have been much tougher.

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 2

Page 3: http compression

Benefits and Drawbacks of HTTP Data Compression

Table of Contents

1. Overview of HTTP Compression.................................................................................................7

1.1 Abstract.................................................................................................................................7

1.2 Introduction...........................................................................................................................8

1.3 Benefits of HTTP Compression..........................................................................................10

2.0 Process Flow.............................................................................................................................12

2.1 Negotiation Process..................................................................................................................12

2.2 Client send request to the server...............................................................................................13

2.3 Server send Response to the Client...........................................................................................13

3.1 Popular Compression Techniques............................................................................................14

3.2 Modem Compression................................................................................................................15

3.3 GZIP..........................................................................................................................................17

3.3.1 Introduction............................................................................................................................17

HTTP Request and Response(Uncompressed)...................................................................17

So what’s the problem?...............................................................................................................17

3.3.2 Purpose...................................................................................................................................18

3.4 HTTP Compression..................................................................................................................19

3.5 Static Compression...................................................................................................................20

3.6 Content and Transfer encoding.................................................................................................21

4.0 HTTP’s Support for Compression............................................................................................23

4.1 HTTP/1.0..................................................................................................................................23

4.2 HTTP/1.1..................................................................................................................................23

4.3 Content Coding Values.............................................................................................................24

5.0 Approaches to HTTP Compression..........................................................................................26

5.1 HTTP Compression on Servers................................................................................................26

5.2 HTTP Compression on Software-based Load Balancers.........................................................26

5.3 HTTP Compression on ASIC-based Load Balancers...............................................................27

5.4 HTTP Compression on a Purpose-built HTTP Compression Device.......................................28

6.0 Browser Support for Compression...........................................................................................30

7.0 Client Side Compression Issues................................................................................................32

8.0 Web Server Support for Compression......................................................................................34

8.1 IIS..............................................................................................................................................34

8.2 Apache......................................................................................................................................35

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 3

Page 4: http compression

Benefits and Drawbacks of HTTP Data Compression

9.0 Proxy Support for Compression...............................................................................................37

10.0 Related Work..........................................................................................................................38

11.0 Experiments............................................................................................................................40

11.1 Compression Ratio Measurements.........................................................................................40

11.2 Web Server Performance Test................................................................................................44

11.2.1 Apache Performance Benchmark........................................................................................45

11.2.2 IIS Performance Benchmark................................................................................................51

12.0 Summary / Suggestion............................................................................................................57

References.......................................................................................................................................59

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 4

Page 5: http compression

Benefits and Drawbacks of HTTP Data Compression

Table of Figure

3.1 Http Request & Response (uncompressed)……………………………………………..25

3.2 Http Request & Response (compressed)………………………………………………..25

11.1 The total page size (including HTML and embedded resources) for the top ten web

sites……………………………………………………………………………………53

11.2 Benchmarking results for the retrieval of the Google HTML file from the Apache

Server…………………………………………………………………………………58

11.3 Benchmarking results for the retrieval of the Yahoo HTML file from the Apache

Server…………………………………………………………………………………59

11.4 Benchmarking results for the retrieval of the AOL HTML file from the Apache

Server…………………………………………………………………………………60

11.5 Benchmarking results for the retrieval of the eBay HTML file from the Apache

Server…………………………………………………………………………………61

11.6 Benchmarking results for the retrieval of the Google HTML file from the IIS

Server…………………………………………………………………………………64

11.7 Benchmarking results for the retrieval of the Yahoo HTML file from the IIS

Server…………………………………………………………………………………65

11.8 Benchmarking results for the retrieval of the AOL HTML file from the IIS

Server…………............................................................................................................66

11.9 Benchmarking results for the retrieval of the eBay HTML file from the IIS

Server……………………………………………………………………………….67

11.10 HTTP response header from an IIS Server after a request for an uncompressed .asp

resource…………………………………………………………………………….68

11.11 HTTP response header from an IIS Server after a request for a compressed .asp

resource…………………………………………………………………………….69

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 5

Page 6: http compression

Benefits and Drawbacks of HTTP Data Compression

Tables of Tables11.1 Comparison of the total compression ratios of level 1 and level 9 gzip encoding for the indicated

URLs……………………………………………………………………………………………54

11.2 Estimated saturation points for the Apache web server based on repeated client requests for the

indicated Document…………………………………………………………………...62

11.3 Average response time (in milliseconds) for the Apache server to respond to requests for compressed

and uncompressed static documents………………………………………………………………….62

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 6

Page 7: http compression

Benefits and Drawbacks of HTTP Data Compression

Chapter 1

1. Overview of HTTP Compression

1.1 Abstract

HTTP compression addresses some of the performance problems of the

Web by attempting to reduce the size of resources transferred between a server and client

thereby conserving bandwidth and reducing user perceived latency.

Currently, most modern browsers and web servers support some form of

content compression. Additionally, a number of browsers are able to perform streaming

decompression of gzipped content. Despite this existing support for HTTP compression,

it remains an underutilized feature of the Web today. This can perhaps be explained, in

part, by the fact that there currently exists little proxy support for the Vary header, which

is necessary for a proxy cache to correctly store and handle compressed content.

To demonstrate some of the quantitativebenefits of compression, we

conducted a test to determine the potential byte savings for a number of popular web

sites. Our results show that, on average, 27% byte reductions are possible for these sites.

Finally, we provide an in depth look at the compression features that exist

in Microsoft’s Internet Information Services (IIS) 5.0 and the Apache 1.3.22 web server

and perform benchmarking tests on both applications.

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 7

Page 8: http compression

Benefits and Drawbacks of HTTP Data Compression

1.2 Introduction

User perceived latency is one of the main performance problems plaguing

the World Wide Web today. At one point or another every Internet user has experienced

just how painfully slow the “World Wide Wait” can be. As a result, there has been a

great deal of research and development focused on improving Web performance

Currently there exist a number of techniques designed to bring content

closer to the end user in the hopes of conserving bandwidth and reducing user perceived

latency, among other things. Such techniques include prefetching, caching and content

delivery networks. However, one area that seems to have drawn only a modest amount

of attention involves HTTP compression.

Many Web resources, such as HTML, JavaScript, CSS and XML

documents, are simply ASCII text files. Given the fact that such files often contain many

repeated sequences of identical information they are ideal candidates for compression.

Other resources, such as JPEG and GIF images and streaming audio and video files, are

precompressed and hence would not benefit from further compression. As such, when

dealing with HTTP compression, focus is typically limited to text resources, which stand

to gain the most byte savings from compression.

Encoding schemes for such text resources must provide lossless data

compression. As the name implies, a lossless data compression algorithm is one that can

recreate the original data, bit-for-bit, from a compressed file. One can easily imagine how

the loss or alteration of a single bit in an HTML file could affect its meaning.

The goal of HTTP compression is to reduce the size of certain resources

that are transferred between a server and client. By reducing the size of web resources,

compression can make more efficient use of network bandwidth. Compressed content can

also provide monetary savings for those individuals who pay a fee based on the amount

of bandwidth they consume. More importantly, though, since fewer bytes are transmitted,

clients would typically receive the resource in less time than if it had been sent

uncompressed. This is especially true for narrowband clients (that is, those users who are

connected to the Internet via a modem). Modems typically present what is referred to as

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 8

Page 9: http compression

Benefits and Drawbacks of HTTP Data Compression

the weakest link or longest mile in a data transfer; hence methods to reduce download

times are especially pertinent to these users.

Furthermore, compression can potentially alleviate some of the burden

imposed by the TCP slow start phase. The TCP slow start phase is a means of controlling

the amount of congestion on a network. It works by forcing a small initial congestion

window on each new TCP connection thereby limiting the number of maximum-size

packets that can initially be transmitted by the sender. Upon the reception of an ACK

packet, the sender’s congestion window is increased. This continues until a packet is lost,

at which point the size of the congestion window is decreased. This process of increasing

and decreasing the congestion window continues throughout the connection in order to

constantly maintain an appropriate transmission rate. In this way, a new TCP connection

avoids overburdening a network with large bursts of data. Due to this slow start phase,

the first few packets that are transferred on a connection are relatively more expensive

than subsequent ones. Also, one can imagine that for the transfer of small files, a

connection may not reach its maximum transfer rate because the transfer may reach

completion before it has the chance to get out of the TCP slow start phase. So, by

compressing a resource, more data effectively fits into each packet. This in turns results

in fewer packets being transferred thereby lessening the effects of slow start (reducing the

number of server stalls)[3].

In the case where an HTML document is sent in a compressed format, it is

probable that the first few packets of data will contain more HTML code and hence a

greater number of inline image references than if the same document had been sent

uncompressed. As a result, the client can subsequently issue requests for these embedded

resources quicker hence easing some of the slow start burden. Also, inline objects are

likely to be on the same server as this HTML document. Therefore an HTTP/1.1

compliant browser may be able to pipeline these requests onto the same TCP connection

[4]. Thus, not only does the client receive the HTML file in less time he/she is also able

to expedite the process of requesting embedded resources [3].

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 9

Page 10: http compression

Benefits and Drawbacks of HTTP Data Compression

Currently, most modern browsers and web servers support some form of

content compression. Additionally, a number of browsers are able to perform streaming

decompression of gzipped content. This means that, for instance, such a browser could

decompress and parse a gzipped HTML file as each successive packet of data arrives

rather than having to wait for the entire file to be retrieved before decompressing. Despite

all of the aforementioned benefits and the existing support for HTTP compression, it

remains an underutilized feature of the Web today. This can perhaps be explained, in

part, by the fact that there currently exists little proxy support for the Vary header, as we

shall see later, which is necessary for a proxy cache to correctly store and handle

compressed content.

1.3 Benefits of HTTP Compression

Many Internet users are concerned about speed and complain that the

Internet is too slow. They want Web pages to load faster. Modem users want the Web to

be faster, but aren't yet ready or able to upgrade to broadband service. Broadband users

enjoy the speed they have, but as new bandwidth-intensive content becomes more

available, they're looking for still more performance. For site operators, the answer to

these customer demands may be HTTP compression, which can offer modem and

broadband users up to 2 to 4 times their current page load performance.

Most Web sites would benefit from serving compressed HTTP data. Infect

most site images in GIF and JPEG formats are already compressed and do not compress

much more using HTTP compression. But what about the HTML code? Currently, the

base page is compressible text, as are JavaScript, cascading style sheets, XML, and

more. Using HTTP compression, static or on-the-fly-generated HTML content can be

reduced. A typical 30 KB HTML home page can be compressed to 6 KB, with no loss of

fidelity. This lossless compression provides for the exact same content to be rendered on

a user’s browser, but the content is represented with fewer bits. This results in less time

being required to transfer fewer bits.

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 10

Page 11: http compression

Benefits and Drawbacks of HTTP Data Compression

Using HTTP compression in production environments, outbound

bandwidth can typically be reduced by 40 percent to 60 percent. This compression level

is achieved by leveraging browsers’ tendencies to cache images, especially those with

expiration dates set in the HTTP headers. However, browsers do not typically cache the

base content on a page. When a user frequents a site, the images are usually pre-cached

by the browser, but the base-page content must be requested again, as it changes rapidly.

Pages tend to be much more dynamic and change far more often than images.

With a reduction in outbound data comes a corresponding cost

decreas.For example, if the reduction is 50 percent, then instead of peaking at 50 Mb/sec,

the peak is at 25 Mb/sec, freeing up bandwidth.

Clearly, HTTP compression offers sites and users tremendous advantages,

which is part of the reason that HTTP compression has been a W3C standard since 1999.

Browsers also have supported compression for years. In fact, most browsers, from IE 4.0

and Netscape 4.7 to the latest AOL, Mozilla, IE, and Netscape browsers, all automatically

support compression in one form or another. The standards are set and the user base is in

place to take further advantage of HTTP compression, but the challenges are many.

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 11

Page 12: http compression

Benefits and Drawbacks of HTTP Data Compression

Chapter 2

2.0 Process Flow

2.1 Negotiation Process

In order for a client to receive a compressed resource from a server via

content coding a negotiation process similar to the following occurs:

First, the client (a web browser) includes with every HTTP request a

header field indicating its ability to accept compressed content. The header may look

similar to the following: “Accept-Encoding: gzip, deflate”. In this example the client is

indicating that it can handle resources compressed in either the gzip or deflate format.

The server, upon receiving the HTTP request, examines the Accept-Encoding field. The

server may not support any of the indicated encoding formats and would simply send the

requested resource uncompressed. However, in the case where the server can handle such

encoding it then decides if it should compress the requested resource based on its

Content-Type or filename extension. In most cases it is up to the site administrator to

specify which resources the server should compress.

Usually one would only want to compress text resources, such as HTML

and CGI output, as opposed to images, as explained above. For this example we will

assume that the server supports gzip encoding and successfully determines that it should

compress the resource being requested. The server would include a field similar to

“Content-Encoding: gzip” in the reply header and then include the gzipped resource in

the body of the reply message. The client would receive the reply from the server,

analyze the headers, see that the resource is gzipped and then perform the necessary

decoding in order to produce the original, uncompressed resource [9].

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 12

Page 13: http compression

Benefits and Drawbacks of HTTP Data Compression

2.2 Client send request to the server

When a Web browser loads a Web page, it opens a connection to the Web server

and sends an HTTP request to the Web server. A typical HTTP request looks like this:

GET /index.html HTTP/1.1

Host: www.http-compression.com

Accept-Encoding: gzip

User-Agent: Firefox/3.6

With this request, the Web browsers asks for the object "/index.html" on

host "www.http-compression.com". The browser identifies itself as "Firefox/1.0" and

claims that it can understand HTTP responses in gzip format.

2.3 Server send Response to the Client

After parsing and processing the client's request, the Web server may send

the HTTP response in compressed format. Then a typical HTTP response looks like this:

HTTP/1.1 200 OK

Server: Apache

Content-Type: text/html

Content-Encoding: gzip

Content-Length: 26395

With this response, the Web server tells the browser with status code 200

that he could fulfil the request. In the next line, the Web server identifies itself as Apache.

The line "Content-Type" says that it's an HTML document. The response header

"Content-Encoding" informs the browser that the following data is compressed with gzip.

Finally, the length of the compressed data is stated.

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 13

Page 14: http compression

Benefits and Drawbacks of HTTP Data Compression

Chapter 3

3.0 Compression Techniques

3.1 Popular Compression Techniques

Though there exists many different lossless compression algorithms

today, most are variations of two popular schemes: Huffman encoding and the Lempel-

Ziv algorithm.

Huffman encoding works by assigning a binary code to each of the

symbols (characters) in an input stream (file). This is accomplished by first building a

binary tree of symbols based on their frequency of occurrence in a file. The assignment

of binary codes to symbols is done in such a way that the most frequently occurring

symbols are assigned the shortest binary codes and the least frequently

occurring symbols assigned the longest codes. This in turn creates a smaller

compressed file.

The Lempel–Ziv algorithm, also known as LZ-77, exploits the redundant

nature of data to provide compression. The algorithm utilizes what is referred to as a

sliding window to keep track of the last n bytes of data seen. Each time a phrase is

encountered that exists in the sliding window buffer, it is replaced with a pointer to the

starting position of the previously occurring phrase in the sliding window along with the

length of the phrase.

The main metric for data compression algorithms is the compression ratio,

which refers to the ratio of the size of the original data to the size of the compressed data.

For example, if we had a 100 kilobyte file and were able to compress it down to only 20

kilobytes we would say the compression ratio is 5-to-1, or 80%. The contents of a file,

particularly the redundancy and orderliness of the data, can strongly affect the

compression ratio.

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 14

Page 15: http compression

Benefits and Drawbacks of HTTP Data Compression

3.2 Modem Compression

Users with low bandwidth and/or high latency connections to the Internet,

such as dialup modem users, are the most likely to perceive the benefits of data

compression. Modems currently implement compression on their own however it is not

optimal nor is it as effective as HTTP compression. Most modems implement

compression at the hardware level, though it can be built into software that interfaces

with the modem. The V.44 protocol, which is the current modem compression standard,

uses the Lempel-Ziv-Jeff-Heath (LZJH) algorithm, a variant of LZ-77, to perform

compression.

The LZJH algorithm works by constructing a dictionary, represented by a

tree, which contains frequently repeated groups of characters and strings. For each

occurrence of a string that appears in the dictionary one need only output the dictionary

index, referred to as the codeword, rather than each individual character in the string.

When data transfer begins, the encoder (sender) and decoder (receiver) begin building

identical dictionary trees. Thus, whenever the decoder receives a codeword sent by the

encoder, they can use this as the index into the dictionary tree in order to rebuild the

original string.

Modem compression only works over small blocks of fixed length data,

called frames, rather than on a large chunk of a file as is the case with high-level

algorithms such as gzip. An algorithm constantly monitors the compressibility of these

frames to determine whether it should send the compressed or uncompressed version of

the frame. Modems can quickly switch between compressed and transparent mode

through use of a special escape code.

Furthermore, modems only offer point to- point compression. It is only

when the data is transferred on the analog line, usually between an end user and an

Internet Service Provider, that the data is compressed. HTTP compression would thus

generally be considered superior since it provides an end-to-end solution between the

origin server and the end user.

In, Mogul et al. carried out an experiment to compare the performance of a

high level compression algorithm versus that of modem compression. To do this, several

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 15

Page 16: http compression

Benefits and Drawbacks of HTTP Data Compression

plaintext and gzipped HTML files were transferred over a 28.8 Kbps modem that

supported the V.42bis compression protocol, which is a predecessor to the V.44 protocol.

The algorithm used in the V.42bis protocol is similar to that of the V.44 protocol

described above; however, it is less efficient in terms of speed and memory requirements

and usually achieves a lower compression ratio for an identical resource compared to

V.44. The size of the plaintext documents in the experiment ranged from approximately

six to 365 kilobytes. Seven trials were run for each HTML file and the average transfer

time in every case was less for the gzipped file versus the plaintext file. So, the authors

concluded that, while the tests showed that modem compression does work, it is not

nearly as effective as a high level compression algorithm for reducing transfer time.

In [3], a 42 kilobyte HTML file, consisting of data combined from the

Microsoft and Netscape home page, was transferred in uncompressed and compressed

form over a 28.8 kbps modem. The results showed that using high level compression

rather than standard modem compression resulted in a 68% reduction in the total number

of packets transferred and a 64% reduction in total transfer time.

An experiment was conducted, to compare the performance of modems

that support the V.42bis and V.44 compression protocols versus that of PKZIP, a high-

level compression algorithm. The experiment involved measuring the throughput

achieved by several modems in transferring files of varying type, such as a text file, an

uncompressed graphics file, and an executable file, via FTP. Each of these files was also

compressed using the PKZIP program. The results showed the high-level compression

program, PKZIP, to be significantly more effective than either of the modem

compression algorithms. Specifically, the performance gain of the V.44 modem over the

V.42bis modem was approximately 29%. However, the performance gain of PKZIP over

the V.44 modem was approximately 94%. Clearly, there was a more drastic performance

difference between PKZIP and the V.44 modem compared to that between the V.44 and

V.42bis modems. Thus, the author concluded that while the V.44 modems provided

better performance than the V.42bis modems, the high-level compression algorithm,

PKZIP, was significantly more effective than modem compression.

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 16

Page 17: http compression

Benefits and Drawbacks of HTTP Data Compression

3.3 GZIP

3.3.1 Introduction

This specification defines a lossless compressed data format that is

compatible with the widely used GZIP utility. The format includes a cyclic redundancy

check value for detecting data corruption. The format presently uses the DEFLATE

method of compression but can be easily extended to use other compression methods.

The format can be implemented readily in a manner not covered by patents.

Before we start I should explain what content encoding is. When you

request a file like http://www.yahoo.com/index.html, your browser talks to a web server.

The conversation goes a little like this:

HTTP Request and Response(Uncompressed)

So what’s the problem?

Well, the system works, but it’s not that efficient. 100KB is a lot of text,

and frankly, HTML is redundant. Every <html>, <table> and <div> tag has a closing tag

that’s almost the same. Words are repeated throughout the document. Any way you slice

it, HTML (and its beefy cousin, XML) is not lean.

And what’s the plan when a file’s too big? Zip it![5]

If we could send a .zip file to the browser (index.html.zip) instead of plain

old index.html, we’d save on bandwidth and download time. The browser could

download the zipped file, extract it, and then show it to user, who’s in a good mood

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 17

Page 18: http compression

Benefits and Drawbacks of HTTP Data Compression

because the page loaded quickly[5]. The browser-server conversation might look like

this:

Compressed HTTP Request and Response

3.3.2 Purpose

The purpose of this specification is to define a lossless compressed data format that:

Is independent of CPU type, operating system, file system, and character set, and

hence can be used for interchange;

Can compress or decompress a data stream (as opposed to a randomly accessible

file) to produce another data stream, using only an a priori bounded amount of

intermediate storage, and hence can be used in data communications or similar

structures such as Unix filters;

Compresses data with efficiency comparable to the best currently available

general-purpose compression methods, and in particular considerably better than

the “compress” program;

Can be implemented readily in a manner not covered by patents, and hence can be

practiced freely;

Is compatible with the file format produced by the current widely used gzip

utility, in that conforming decompressors will be able to read data produced by

the existing gzip compressor.

The data format defined by this specification does not attempt to:

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 18

Page 19: http compression

Benefits and Drawbacks of HTTP Data Compression

Provide random access to compressed data;

Compress specialized data (e.g., raster graphics) as well as the best currently

available specialized algorithms.

GZIP is a freely available compressor available within the JRE and  the SDK as

Java.util.zip.GZIPInputStream and Java.util.zip.GZIPOutputStream.

The Command line versions are available with most Unix Operating Systems, Windows

Unix Toolkits (Cygwin and MKS), or they are dowloadable for a plethora of operating

systems at http://www.gzip.org/.

One can get the highest degree of compression using gzip to compress an uncompressed

jar file vs. compressing a compressed jar file, the downside is that the file may be stored

uncompressed on the target systems.

Here is an example:

Compressing using gzip on a jar file containing individual deflated entries.

Notepad.jar       46.25 kb

Notepad.jar.gz   43.00 kb

Compressing using gzip on a jar file containing "stored" entries

Notepad.jar      987.47 kb

Notepad.jar.gz   32.47 kb

As you can see the download size can be reduced by 14% using uncompressed jar, versus

3% using compressed jar file.

3.4 HTTP Compression

HTTP compression is the technology used to compress contents from a Web

server (also known as an HTTP server). The Web server content may be in the form of

any of the many available MIME types: HTML, plain text, images formats, PDF files,

and more. HTML and image formats are the most widely used MIME formats in a Web

application.

Most images used in Web applications (for example, GIF and JPG) are

already in compressed format and do not compress much further; certainly no discernible

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 19

Page 20: http compression

Benefits and Drawbacks of HTTP Data Compression

performance is gained by another incremental compression of these files. However, static

or on-the-fly created HTML content contains only plain text and is ideal for compression.

The focus of HTTP compression is to enable the Web site to serve fewer bytes of

data. For this to work effectively, a couple of things are required:

The Web server should compress the data

The browser should decompress the data and display the pages in the usual

manner

This is obvious. Of course, the process of compression and decompression should

not consume a significant amount of time or resources.

So what's the hold-up in this seemingly simple process? The recommendations for

HTTP compression were stipulated by the IETF (Internet Engineering Task Force) while

specifying the protocol specifications of HTTP 1.1. The publicly available gzip

compression format was intended to be the compression algorithm. Popular browsers

have already implemented the decompression feature and were ready to receive the

encoded data (as per the HTTP 1.1 protocol specifications), but HTTP compression on

the Web server side was not implemented as quickly nor in a serious manner.

3.5 Static Compression

If the Web content is pre-generated and requires no server-side dynamic

interaction with other systems, the content can be pre-compressed and placed in the Web

server, with these compressed pages being delivered to the user. Publicly available

compression tools (gzip, Unix compress) can be used to compress the static files.

Static compression, though, is not useful when the content has to be generated

dynamically, such as on e-commerce sites or on sites which are driven by applications

and databases. The better solution is to compress the data on the fly.

3.6 Content and Transfer encoding

The IETF's standard for compressing HTTP contents includes two levels of

encoding: content encoding and transfer encoding. Content encoding applies to methods

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 20

Page 21: http compression

Benefits and Drawbacks of HTTP Data Compression

of encoding and compression that have been already applied to documents before the

Web user requests them. This is also known as pre-compressing pages or static

compression. This concept never really caught on because of the complex file-

maintenance burden it represents and few Internet sites use pre-compressed pages.

On the other hand, transfer encoding applies to methods of encoding during the

actual transmission of the data.

In modern practice the difference between content and transfer encoding is

blurred since the pages requested do not exist until after they are requested (they are

created in real-time). Therefore the encoding has to be always in real-time

The browsers, taking the cue from IETF recommendations, implemented the

Accept Encoding feature by 1998-99. This allows browsers to receive and decompress

files compressed using the public algorithms. In this case, the HTTP request header fields

sent from the browser indicate that the browser is capable of receiving encoded

information. When the Web server receives this request, it can

1. Send pre-compressed files as requested. If they are not available, then it can:

2. Compress the requested static files, send the compressed data, and keep the

compressed file in a temporary directory for further requests; or

3. If transfer encoding is implemented, compress the Web server output on the fly.

As I mentioned, pre-compressing files, as well as real-time compression of static

files by the Web server (the first two points, above) never caught on because of the

complexities of file maintenance, though some Web servers supported these functions to

an extent.

The feature of compressing Web server dynamic output on the fly wasn't

seriously considered until recently, since its importance is only now being realized. So,

sending dynamically compressed HTTP data over the network has remained a dream

even though many browsers were ready to receive the compressed formats.

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 21

Page 22: http compression

Benefits and Drawbacks of HTTP Data Compression

Chapter 4

4.0 HTTP’s Support for Compression

HTTP compression has been around since the day of HTTP/1.0. However,

it was not until the last two years or so that support for this feature was added to most

major web browsers and servers.

4.1 HTTP/1.0

In HTTP/1.0 the manner in which the client and server negotiate which

compression format, if any, to use when transferring a resource is not well defined.

Support for compression exists in HTTP/1.0 by way of content coding values in the

Content- Encoding and, to an extent, the Accept-Encoding header fields.

The Content-Encoding entity header was included in the HTTP/1.0

specification as a way for the message sender to indicate what transformation, if any, had

been applied to an entity and hence what decoding must be performed by the receiver in

order to obtain the original resource. Content-Encoding values apply only to end-to-end

encoding.

The Accept-Encoding header was included as part of the Additional

Header Field Definitions in the appendix of HTTP/1.0. Its purpose is presumably to

restrict the content coding that the server can apply to a client’s response; however, this

header field is explained in a single sentence and there is no specification as to how the

server should handle this header.

4.2 HTTP/1.1

HTTP/1.1 extended support for compression by expanding on the content-

coding values and adding transfer-coding values. Much like in HTTP/1.0, content coding

values are specified within the Accept-Encoding and Content-Encoding headers in

HTTP/1.1[6].

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 22

Page 23: http compression

Benefits and Drawbacks of HTTP Data Compression

The Accept-Encoding header is more thoroughly defined in HTTP/1.1 and

provides a way for the client to indicate to the server which encoding formats it supports.

HTTP/1.1 also specifies how the server should handle the Accept-Encoding field if it is

present in a request [6].

Much like content coding values, transfer coding values, as defined in

HTTP/1.1, provide a way for communicating parties to indicate the type of encoding

transformation that can be, or has been, applied to a resource. The difference being that

transfer coding values are a property of the message and not of the entity. That is,

transfer-coding values can be used to indicate the hop-by-hop encoding that has been

applied to a message [6].

Transfer coding values are specified in HTTP/1.1 by the TE and Transfer-

Encoding header fields. The TE request header field provides a way for the client to

specify which encoding formats it supports. The Transfer- Encoding general-header field

indicates what transformation, if any, has been applied to the message body [6].

To summarize, HTTP/1.1 allows for compression to occur on either an

end-to-end basis, via content-coding values, or on a hop-byhop basis, via transfer coding

values [6]. In either case, compression is transparent to the end user.

4.3 Content Coding Values

The Internet Assigned Numbers Authority (IANA) was designated in

HTTP/1.1 as a registry for content-coding value tokens. The HTTP/1.1 specification

defined the initial tokens to be: deflate, compress, gzip and identity [6].

The deflate algorithm is based on both Huffman coding and LZ-77

compression [7].

“Compress” was a popular UNIX file compression program. It uses the

LZW algorithm, which is covered by patents held by UNISYS and IBM [6].

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 23

Page 24: http compression

Benefits and Drawbacks of HTTP Data Compression

Gzip is an open source, patent free compression utility that was designed

to replace “compress”. Gzip currently uses a variation of the LZ-77 algorithm by default,

though it was designed to handle several compression algorithms [7]. Gzip works by

replacing duplicated patterns with a (distance, length) pointer pair. The distance pointer is

limited to the previous 32 kilobytes while the length is limited to 258 bytes [7].

The identity token indicates the use of default encoding, that is, no

transformation at all.

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 24

Page 25: http compression

Benefits and Drawbacks of HTTP Data Compression

Chapter 5

5.0 Approaches to HTTP Compression

5.1 HTTP Compression on Servers

At first glance, the server level seems like a logical place to perform

HTTP compression. After all, servers have already implemented the HTTP protocol and

know how to communicate with clients and handle retransmits, chunking, HTTP/1.0,

HTTP/1.1, and more. But a Web server is not the best-performing piece of equipment in

the network.

Typically, each server operates rather slowly and at a relatively low

capacity. HTTP compression is hard, computationally intensive work, and HTTP

compression is far too much work for servers. Tasking servers to perform HTTP

compression results in even longer response times and still lower throughput,

demonstrating that HTTP compression on the server is not the optimal solution.

5.2 HTTP Compression on Software-based Load Balancers

In software-based load balancers, the TCP/IP stack is optimized for store-

and-forward operations. The device gets a packet in, inspects it as little as possible (often

just the destination address information), and then forwards the unmodified packet to an

origin server. The origin server then communicates with the client over the HTTP

protocol, while the load balancer is simply shuttling packets back and forth between the

origin server and client, without regard to the complex elements of the HTTP protocol.

This store-and-forward design contrasts to a server or cache device, which

must know the complete HTTP protocol and be able to conduct an HTTP dialog with the

client. While a software-based server load balancer may have impressive specifications

for storing and forwarding packets, it does not know how to communicate to clients,

handle re-transmits, chunking, HTTP/1.0, HTTP/1.1, and other functions. Software load

balancer vendors have acknowledged this fundamental incongruity between the two

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 25

Page 26: http compression

Benefits and Drawbacks of HTTP Data Compression

operations of load balancing and doing full HTTP protocol work. In response, they have

offered standalone caching devices rather than attempting to integrate caching

functionality into the load balancer.

HTTP compression is even more demanding than HTTP caching. In

caching, a device merely needs to store the exact contents delivered to it by the origin

server. Some caches even store the whole packets provided by the server. In contrast,

HTTP compression demands that new data be generated for individual user devices. This

is necessary because not every user device supports compression and not all user devices

that support compression generally support compression in every particular instance.

5.3 HTTP Compression on ASIC-based Load Balancers

ASIC-based load balancers were developed to extend the capacity and

performance of storeand- forward load balancing. In practice, even though these devices

are ASIC-based, they rely on general-purpose CPUs to perform the Layer 7 functionality.

Capacity and performance when performing Layer 7 functions is greatly decreased

compared to Layer 4 load balancing.

ASIC-based load balancers use general-purpose CPUs for Layer 7

functionality because the HTTP/1.1 protocol is more complex than the TCP/IP protocol

and would require an unrealistic number of gates to execute on an ASIC. The ever-

changing demands of browsers and servers, and the potential security vulnerabilities that

require regular modifications, make it impractical to implement HTTP/1.1 protocol in an

ASIC design.

Onboard memory limitations further restrict the amount of Layer 7

information an ASICbased load balancer can gather and process. For example, some load

balancers support processing only 256 bytes of a cookie, but a cookie can be up to 4,000

bytes long. The fundamental architecture of these devices is designed to process only

header information while ignoring the actual data portion of the packet.

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 26

Page 27: http compression

Benefits and Drawbacks of HTTP Data Compression

Even if a packet-level compression engine were to be developed, this

design would work with a maximum of 1460 data bytes, which could be compressed at a

compression ratio of 2:1, rather than the 5:1 or up to 10:1 ratios that can be achieved by

processing at the application layer. Furthermore, any compression at the packet level

would simply lead to smaller packets — but not fewer packets. Given TCP’s reliance on

packet acknowledgements (ACKs) and dependence on round-trip times (RTTs), by not

reducing the packet count, RTTs and ACK requirements remain constant. Therefore, the

total download time would be virtually identical whether the packets were filled at 100

percent or at 50 percent. The minimal Layer 7 functionality and intentionally limited

architectural design of ASIC-based load balancers does not provide the platform

necessary for HTTP compression technology.

5.4 HTTP Compression on a Purpose-built HTTP Compression Device

A dedicated HTTP compression device is the best place to perform HTTP

compression. The device must be a full proxy server, able to communicate the HTTP/1.0

and HTTP/1.1 protocol, and have enough resources available to handle the

computationally intensive compression work. The device should also own the connection

to the end user, taking responsibility for resending dropped packets so that the origin

server is completely offloaded.

Some first-generation approaches use caching techniques and attempt to

cache the content to overcome the challenge of compressing content “on the fly.” Other

early efforts cannot overcome the computational latency involved in compressing data,

and thus do not compress for high-speed users. An unfortunate side effect of this

technique is that users coming through proxy servers, such as AOL users, appear to be

high-speed users and, as such, the device will not perform any compression despite the

reality that the end user is connected via a slow link. Thus, in first generation approaches,

slow users do not experience a faster download and sites do not realize bandwidth

savings.

Despite the limitations of these initial attempts, a dedicated solution is still

the best approach. The right solution consists of a core platform that can:

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 27

Page 28: http compression

Benefits and Drawbacks of HTTP Data Compression

Handle tens or hundreds of thousands of simultaneous, persistent connections,

completely owning both the TCP and HTTP interaction with the client, including

re-transmits

Manage client TCP connections in such a way as to eliminate unnecessary control

packets and further delays caused by TCP’s slow start mechanisms

Deliver equal benefits for static as well as personalized data by avoiding the use

of caching technology

Deliver equal benefits for slow users as well as high-speed users by eliminating

the computational latency of HTTP compression

Request content from multiple origin servers and guarantee even workload across

all servers

Employ high-speed SSL capabilities to provide benefits for all secure transactions

Guarantee user stickiness in a total end-to-end SSL environment

With this platform, HTTP compression can be conducted at wire speed

and origin Web servers can get data out to the HTTP compression device faster, resulting

in a net decrease in latency for the entire transaction. In this way, adding another box in

the network actually decreases every step of the transaction time, starting with the very

first byte of data received by the client. It also becomes possible for an HTTP

compression device to deliver SSL content faster than a Web server can deliver clear text

to both modem and broadband users.

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 28

Page 29: http compression

Benefits and Drawbacks of HTTP Data Compression

Chapter 6

6.0 Browser Support for Compression

Support for content coding is inherent in most of today’s popular web

browsers. Support for content coding has existed in the following browsers since the

indicated version: Netscape 4.5+ (including Netscape 6), all versions of Mozilla, Internet

Explorer 4.0+, Opera 4.0+ and Lynx 2.8+. However, there have been reports that Internet

Explorer for the Macintosh cannot correctly handle gzip encoded content. Unfortunately,

support for transfer coding is lacking in most browsers.

In order to verify some of these claims regarding browser support for

content coding a test was conducted using a Java program called BrowserSpy. When

executed, this program binds itself to an open port on the user’s machine. The user can

then issue HTTP requests from a web browser to this program. The program then returns

an HTML file to the browser indicating the HTTP request headers it had received from

the browser.

For this test we used all of the web browsers that were accessible to us in

order to see exactly what HTTP request headers these browsers were sending. The header

fields and values that is of greatest interest include the HTTP version number and the

Accept-Encoding and TE field.

First, Lynx version 2.8.4dev.16 running under Linux sends:

GET / HTTP/1.0

Accept-Encoding: gzip, compress

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 29

Page 30: http compression

Benefits and Drawbacks of HTTP Data Compression

Next, Opera, Netscape and Internet Explorer were tested under various versions

Of Windows. The Opera 5.11 request header contained the following relevant fields:

GET / HTTP/1.1

Accept-Encoding: deflate, gzip, x-gzip, identity,

*;q=0

Connection: Keep-Alive, TE

TE: deflate, gzip, chunked, identity, trailers.

Internet Explorer 5.01, 5.5 SP2 and 6.0 all issued the same relevant

request information:

GET / HTTP/1.1

Accept-Encoding: gzip, deflate

Finally, the Netscape 4.78 request header contained the following relevant

fields:

GET / HTTP/1.0

Accept-Encoding: gzip

Whereas Netscape 6.2.1 issued:

GET / HTTP/1.1

Accept-Encoding: gzip, deflate, compress; q=0.9;

Notice that gzip is the most common content coding value and, in fact, the

only value to appear in the request header of every browser tested. Also, observe that

Lynx and Netscape 4.x only support the HTTP/1.0 protocol. For reasons that will be

explained later, some compression-enabled servers may choose not to send compressed

content to such clients. Finally, note that Opera appears to be the only browser that

supports transfer coding as it is the only one that includes the TE header with each HTTP

request.

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 30

Page 31: http compression

Benefits and Drawbacks of HTTP Data Compression

Chapter 7

7.0 Client Side Compression Issues

When receiving compressed data in either the gzip or deflate format most

web browsers are capable of performing streaming decompression. That is,

decompression can be performed as each successive packet arrives at the client end rather

than having to wait for the entire resource to be retrieved before performing the

decompression [8].

This is possible in the case of gzip because, as was mentioned above, gzip

performs compression by replacing previously encountered phrases with pointers. Thus,

each successive packet the client receives could only contain pointers to phrases in the

current or previous packets, not future packets.

The CPU time necessary for the client to uncompress data is minimal and

usually takes only a fraction of the time it does to compress the data. Thus the

decompression process adds only a small amount of latency on the client’s end.

A test was conducted to verify that some of the major web browsers do in

fact perform streaming decompression. To begin, a single web page consisting of an

extremely large amount of text was constructed. Along with the text, four embedded

images were included. Three of the images were placed near the top of the web page and

the fourth at bottom. The web page was designed to be as simple as possible consisting

only of a few basic HTML tags and image references. The page, which was uploaded to a

compression enabled web server, ended up containing over 550 kilobytes of text.

Three popular web browsers, Opera 5.11, Netscape 4.78 and Internet

Explorer 5.01, all running in Windows 2000, were each used to issue an HTTP request

for the web page. Using Ethereal, a packet-sniffing program, an analysis of the packet log

for each of the three browsers was performed. In all three cases, the browsers, after

receiving a few dozen packets containing the compressed HTML file, issued requests for

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 31

Page 32: http compression

Benefits and Drawbacks of HTTP Data Compression

the first three embedded image files. Clearly, these requests were issued well before the

entire compressed HTML file had been transferred hence proving that the major

browsers, running in Windows, do indeed perform streaming decompression for content

encoding using gzip.

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 32

Page 33: http compression

Benefits and Drawbacks of HTTP Data Compression

Chapter 8

8.0 Web Server Support for Compression

The analyses in this report focus exclusively on the two most popular web

servers, namely Apache and IIS. Based on a survey of over 38 million web sites, Netcraft

reported that Apache and IIS comprise over 85% of the hosts on the World Wide Web as

of March 2002. Two alternate web servers, Flash and TUX, were also initially considered

for inclusion in this report, however, based on the information provided in their

respective specification documents, Flash do not appear to support HTTP compression in

any form and TUX can only serve static resources that exist as a precompressed file on

the web server. As a result, neither program was selected for further analysis.

8.1 IIS

Microsoft’s Internet Information Services (IIS) 5.0 web servers provides

built in support for compression; however it is not enabled by default. The process of

enabling compression is straightforward and involves changing a single configuration

setting in IIS. An IIS server can be configured to compress both static and dynamic

documents. However, compression must be applied to the entire server and cannot be

activated on a directory-by directory basis. IIS has built in support for the gzip and

deflate compression standards and includes a mechanism through which customized

compression filters can be added. Also, only the Server families of the Windows

operating system have compression capability, as the Professional family is intended for

use as a development platform and not as a production server...

In IIS, static content is compressed in two passes. The first time a static

document is requested it is sent uncompressed to the client and also added to a

compression queue where a background process will compress the file and store it in a

temporary directory. Then, on subsequent accesses, the compressed file can be retrieved

from this temporary directory and sent to the client.

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 33

Page 34: http compression

Benefits and Drawbacks of HTTP Data Compression

Dynamic content, on the other hand, is always compressed on-demand

since IIS does not cache such content.

Thus in IIS the cost of compressing a static document is negligible, as this

cost is incurred only once for each document. On the other hand, since dynamic content

is compressed on demand it imposes a slightly greater overhead on the server.

By default, IIS does not send compressed content to HTTP/1.0 clients.

Additionally, all compressed content is sent by IIS with a header indicating an Expiration

date of January 1, 1997. By always setting the expiration date in the past, proxies will be

forced to validate the object on every request before it can be served to the client.

However, IIS does allow the site administrator to set the Max-Age header for dynamic,

compressed documents. The HTTP/1.1 specification stipulates that the Max-Age header

overrides the Expires header, even if the Expires header is more restrictive [6]. As we

will see in section 9, this scheme may have been designed intentionally so as to allow

HTTP/1.1 clients to get a cached version of a compressed object and at the same time to

prevent HTTP/1.0 clients from getting stale or bad content.

8.2 Apache

Apache does not provide a built in mechanism for HTTP compression.

However, there is a fairly popular and extensively tested open source module, called

mod_gzip, which provides HTTP compression for Apache.

Enabling compression is fairly straightforward, as mod_gzip can either be

loaded as an external Apache module or compiled directly into the Apache web server.

Since mod_gzip is a standard Apache module, it runs on any platform supported by the

Apache server. Much like IIS, mod_gzip uses existing content-coding standards as

described in HTTP/1.1. Contrary to IIS, mod_gzip allows for compression to be activated

on a per-directory basis thus giving the website administrator greater control over the

compression functionality for his/her site(s). Mod_gzip, like IIS, can compress both static

and dynamic documents.

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 34

Page 35: http compression

Benefits and Drawbacks of HTTP Data Compression

In the case of static documents, mod_gzip can first check to see if a

precompressed version of the file exists and, if it does, send this version. Otherwise

mod_gzip will compress the document on-the-fly. In this way, mod_gzip differs from IIS

because mod_gzip can compress a static document on its first access. Also, mod_gzip can

be configured to save the compressed files to a temporary directory. However, if such a

directory is not specified the static document will simply be compressed on every access.

Mod_gzip is purported to support nearly every type of CGI output, including: Perl, PHP,

ColdFusion, compiled C code, etc. Both IIS and mod_gzip allow the administrator to

specify which static and dynamic content should and should not be compressed based on

the mime type or file name extension of the resource. Also, contrary to IIS, mod_gzip by

default sends content to clients regardless of the HTTP version they support. This can

easily be changed so that only requests from HTTP/1.1-compliant clients can receive

compressed content.

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 35

Page 36: http compression

Benefits and Drawbacks of HTTP Data Compression

Chapter 9

9.0 Proxy Support for Compression

Currently one of the main problems with HTTP compression is the lack of

proxy cache support. Many proxies cannot handle the Content-Encoding header and

hence simply forward the response to the client without caching the resource. As was

mentioned above, IIS attempts to ensure compressed documents are not served stale by

setting the Expires time in the past.

Caching was handled in HTTP/1.0 by storing and retrieving resources

based on the URI. This, of course, proves inadequate when multiple versions of the same

resource exist - in this case, a compressed and uncompressed representation.

This problem was addressed in HTTP/1.1 with the inclusion of the Vary

response header. A cache could then store both a compressed and uncompressed version

of the same object and use the Vary header to distinguish between the two. The Vary

header is used to indicate which response headers should be analyzed in order to

determine the appropriate variant of the cached resource to return to the client.

Looking ahead to Figure 11, we can see that IIS already sends the Vary

header with its compressed content. The most recent version of mod_gzip does not set the

Vary header. However, a patch was posted on a mod_gzip discussion board in April 2001

that incorporates support for the Vary header into mod_gzip. A message posted in

September 2001 was posted to another discussion and stated that mod_gzip had been

ported to Apache 2.0 and includes support for the Vary header.

Based on a quick search of the Web, the latest stable release of Squid,

version 2.4, was the only proxy cache that appeared to support the Vary header. Since

such support was only recently included in Squid it is likely that many proxy caches are

not running this latest version and hence cannot cache negotiated content. Hence the full

potential of HTTP compression cannot be realized until more proxies are able to cache

compressed content.

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 36

Page 37: http compression

Benefits and Drawbacks of HTTP Data Compression

Chapter 10

10.0 Related Work

In Mogul et al. quantified the potential benefits of delta encoding and data

compression for HTTP by analyzing lives traces from Digital Equipment Corporation

(DEC) and an AT&T Research Lab. The traces were filtered in an attempt to remove

requests for precompressed content; for example, references to GIF, JPEG and MPEG

files. The authors then estimated the time and byte savings that could have been achieved

had the HTTP responses to the clients been delta encoded and/or compressed. The

authors determined that in the case of the DEC trace, of the 2465 MB of data analyzed,

965 MB, or approximately 39%, could have been saved had the content been gzip

compressed. For the AT&T trace, 1054MB, or approximately 17%, of the total 6216 MB

of data could have been saved. Furthermore, retrieval times could have been reduced

22% and 14% in the DEC and AT&T traces, respectively. The authors remarked that they

felt their results demonstrated a significant potential improvement in response size and

response delay as a result of delta encoding and compression.

In [8], the authors attempted to determine the performance benefits of

HTTP compression by simulating a realistic workload environment. This was done by

setting up a web server and replicating the CNN site on this machine. The authors then

accessed the replicated CNN main page and ten subsections within this page (i.e. World

News, Weather, etc), emptying the cache before each test. Analysis of the total time to

load all of these pages showed that when accessing the site on a 28.8 kbps modem, gzip

content coding resulted in 30% faster page loads. They also experienced 35% faster page

loads when using a 14.4 kbps modem.

Finally, in [4] the authors attempted to determine the performance effects

of HTTP/1.1. Their tests included an analysis of the benefits of HTTP compression via

the deflate contentcoding. The authors created a test web site that combined data from the

Netscape and Microsoft home pages into a page called “MicroScape”. The HTML for

this new page totaled 42 KB with 42 inline GIF images totaling 125 KB. Three different

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 37

Page 38: http compression

Benefits and Drawbacks of HTTP Data Compression

network environments were used to perform the test: a Local Area Network (high

bandwidth, low latency), a Wide Area Network (high bandwidth, high latency) and a 28.8

kbps modem (low bandwidth, high latency). The test involved measuring the time

required for the client to retrieve the Microscape web page from the server, parse, and, if

necessary, decompress the HTML file on-the-fly and retrieve the 42 inline images. The

results showed significant improvements for those clients on low bandwidth and/or high

latency connections. In fact, looking at the results from all of the test environments,

compression reduced the total number of packets transferred by 16% and the download

time for the first time retrieval of the page by 12%.

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 38

Page 39: http compression

Benefits and Drawbacks of HTTP Data Compression

11.0 Experiments

We will now analyze the results from a number of tests that were

performed in order to determine the potential benefits and drawbacks of HTTP

compression.

11.1 Compression Ratio Measurements

The first test that was conducted was designed to provide a basic idea of

the compression ratio that could be achieved by compressing some of the more popular

sites on the Web. The objective was to determine how many fewer bytes would need to

be transferred across the Internet if web pages were sent to the client in a compressed

form. To determine this we first found a web page that ranks the Top 99 sites on the Web

based on the number of unique visitors. Although the rankings had not been updated

since March 2001 most of the indicated sites are still fairly popular. Besides, the intent

was not to find a definitive list of the most popular sites but rather to get a general idea of

some of the more highly visited ones. A freely available program called wget was used

to retrieve pages from the Web and Perl scripts were written to parse these files and

extract relevant information.

The steps involved in carrying out this test consisted of first fetching the

web page containing the list of Top 99 web sites. This HTML file was then parsed in

order to extract all of the URLs for the Top 99 sites. A pre-existing CGI program on the

Web that allows a user to submit a URL for analysis was then utilized. The program

determines whether or not the indicated site utilizes gzip compression and, if not, how

many bytes could have been saved was the site to implement compression. These byte

savings are calculated for all 10 levels of gzip encoding. Level 0 corresponds to no gzip

encoding. Level 1 encoding uses the least aggressive form of phrase matching but is also

the fastest, as it uses the least amount of CPU time when compared to the other levels,

excluding level 0. Alternatively, level 9 encoding performs the most aggressive form of

pattern matching but also takes the longest, utilizing the most CPU resources.

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 39

Page 40: http compression

Benefits and Drawbacks of HTTP Data Compression

A Perl script was employed to parse the HTML file returned by this CGI

program, with all of the relevant information being dumped to a file that could be easily

imported into a spreadsheet. Unfortunately, the CGI program can only determine the byte

savings for the HTML file. While this information is useful it does not give the user an

idea of the compression ratio for the entire page - including the images and other

embedded resources. Therefore, we set our web browser to go through a proxy cache and

subsequently retrieved each of the top 99 web pages. We then used the trace log from the

proxy to determine the total size of all of the web pages. After filtering out the web sites

that could not be handled by the CGI program, wget or Perl scripts we were left with 77

URLs. One of the problems encountered by the CGI program and wget involved the

handling of server redirection replies. Also, a number of the URLs referenced sites that

either no longer existed or were inaccessible at the time the tests were run.

The results of this experiment were encouraging. First, if we consider the

savings for the HTML document alone, the average compression ratio for level 1 gzip

encoding turns out to be 74% and for level 9 this figure is 78%. This clearly shows that

HTML files are prime candidates for compression. Next, we factor into the equation the

size of all of the embedded resources for each web page. We will refer to this as the total

compression ratio and define it as the ratio of the size of the original page, which includes

the embedded resources and the uncompressed HTML, to the size of the encoded page,

which includes the embedded resources and the compressed HTML.

The results show that the average total compression ratio comes to about

27% for level 1 encoding and 29% for level 9 encoding. This still represents a significant

amount of savings, especially in the case where the content is being served to a modem

user.

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 40

Page 41: http compression

Benefits and Drawbacks of HTTP Data Compression

Figure 11.1 – The total page size (including HTML and embedded resources) for

the top ten web sites.

Figure 11.1 shows the difference in the total number of bytes transferred

in an uncompressed web page versus those transferred with level 1 and level 9 gzip

compressions. Note the small difference in compression ratios between level 1 and level

9 encoding.

Table 11.1 shows a comparison of the total compression ratios for the top

ten web sites. You can see that there is only a slight difference in the total compression

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 41

Page 42: http compression

Benefits and Drawbacks of HTTP Data Compression

ratios for levels 1 and 9 of gzip encoding. Thus, if a site administrator were to decide to

enable gzip compression on a web server but wanted to devote the least amount of CPU

cycles as possible to the compression process, he/she could set the encoding to level 1

and still maintain favorable byte savings.

Ultimately, what these results show is that, on average, a compression-

enabled server could send approximately 27% less bytes yet still transmit the exact same

web page to supporting clients. Despite this potential savings, out of all of the URLs

examined, www.excite.com was the only site that supported gzip content coding. We

believe this is indicative of HTTP compression’s current popularity, or lack thereof.

URL Level 1 Level 9

www.yahoo.com 36.353 38.222

www.aol.com 26.436 25.697

www.msn.com 35.465 37.624

www.microsoft.com 38.850 40.189

www.passport.com 25.193 26.544

www.geocities.com 43.316 45.129

www.ebay.com 29.030 30.446

www.lycos.com 40.170 42.058

www.amazon.com 31.334 32.755

www.angelfire.com 34.537 36.427

Table 11.1 – Comparison of the total compression ratios of level 1 and level 9 gzip encoding for

the indicated URLs.

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 42

Page 43: http compression

Benefits and Drawbacks of HTTP Data Compression

11.2 Web Server Performance Test

The next set of tests that were conducted involved gathering performance

statistics from both the IIS and Apache web servers.

The test environment consisted of a client and server PC both with 10

Mbps Ethernet Network Interface Cards connected together directly via a crossover

cable. The first computer, a Pentium II 266 MHz PC with 128 MB of RAM, was setup as

a dual-boot server running Windows 2000 Server with IIS 5.0 and Red Hat Linux 7.1

(kernel 2.4.2-2) with Apache 1.3.22 and mod_gzip 1.3.19.1a. Compression was enabled

for static and dynamic content in both IIS and Apache. All of the compression caching

options was disabled for both web servers. The client machine was a Pentium II 266 MHz

PC with 190 MB of RAM running Red Hat Linux 7.1 (kernel 2.4.2-2).

Two programs, httperf and Autobench, were used to perform the

benchmarking tests. Httperf actually generates the HTTP workloads and measures server

performance. Autobench is simply a Perl script that acts a wrapper around httperf,

automating its execution and extracting and formatting its output in such a way that it can

easily be imported into a spreadsheet program. Httperf was one of the few benchmarking

programs that fully support the HTTP/1.1 protocol. Httperf was helpful for these tests as

it allows the insertion of additional HTTP request header fields via a command line

option. Thus, if an httperf user wanted to receive compressed content they could easily

append the “Accept- Encoding: gzip” header to each HTTP request.

The HTML code from the main page of four major web sites was

downloaded for use in the next set of tests. The file sizes for these sites: Google, Yahoo,

AOL and eBay were 2528, 20508, 29898, and 50187 bytes respectively. Thus, the test

files were not padded with repeated strings in order to improve compression ratios. Also

note that these tests only involved retrieval of the HTML file for each page and not any

of the embedded images. Since images are not further compressible they would provide

no indication of the performance effects of compression on a web server.

The tests were designed to determine the maximum throughput of the two

servers by issuing a series of requests for compressed and uncompressed documents.

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 43

Page 44: http compression

Benefits and Drawbacks of HTTP Data Compression

Using Autobench we were able to start by issuing a low rate of requests per second to the

server and then increase this rate by a specified step until a high rate of requests per

second were attempted to be issued. An example of the command line options used to run

some of the tests is as follows:

./autobench_gzip_on --low_rate 10 --

high_rate 150 --rate_step 10 --

single_host --host1 192.168.0.106 --

num_conn 1000 --num_call 1 --output_fmt

csv --quiet --timeout 10 --uri1

/google.html --file google_compr.csv

These command line options indicate that initially requests for the

google.html file will be issued at a rate of 10 requests per second. Requests will continue

at this rate until 1000 connections have been made. For these tests each connection makes

only one call. In other words no persistent connections were used. The rate of requests is

then increased by the rate step, which is 10. So, now 20 requests will be attempted per

second until 1000 connections have been made. This will continue until a rate of 150

requests per second is attempted.

Keep in mind when looking at the results that the client may not be

capable of issuing 150 requests per second to the server. Thus a distinction is made

between the desired and actual number of requests per second.

11.2.1 Apache Performance Benchmark

We will first take a look at the results of the tests when run against the

Apache server. Figures 2 through 5 represent graphs of some of the results from

respective test cases. Referring to the graphs, we can see that for each test case a

saturation point was reached. This saturation point reflects the maximum numbers of

requests the server could handle for the given resource. Looking at the graphs, the

saturation point can be recognized by the point at which the server’s average response

time increases significantly, often times jumping from a few milliseconds up to hundreds

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 44

Page 45: http compression

Benefits and Drawbacks of HTTP Data Compression

or thousands of milliseconds. The response time corresponds to the time between when

the client sends the first byte of the request and receives the first byte of the reply.

So, if we were to look at Yahoo (Figure 3), for instance, we would notice

that the server reaches its saturation point at about the time when the client issues 36

requests per second for uncompressed content. This figure falls slightly, to about 33

requests per second, when compressed content is requested.

Refer to Table 2 for a comparison of the estimated saturation points for

each test case. These estimates were obtained by calculating the average number of

connections per second handled by the server using data available from the benchmarking

results. One interesting thing to note from the graphs is that, aside from the Google page,

the server maintained almost the same average reply rate for a page up until the saturation

point, regardless of whether the content was being served compressed or uncompressed.

Figure11. 2 – Benchmarking results for the retrieval of the Google HTML file from the Apache

Server.

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 45

Page 46: http compression

Benefits and Drawbacks of HTTP Data Compression

Figure11. 3 – Benchmarking results for the retrieval of the Yahoo HTML file from the Apache

Server.

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 46

Page 47: http compression

Benefits and Drawbacks of HTTP Data Compression

Figure 11.4 – Benchmarking results for the retrieval of the AOL HTML file from the Apache

Server.

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 47

Page 48: http compression

Benefits and Drawbacks of HTTP Data Compression

Figure 11.5 – Benchmarking results for the retrieval of the eBay HTML file from the Apache

Server.

After the saturation point the numbers diverge slightly, as is noticeable in

the graphs. What this means is that the server was able to serve almost the same number

of requests per second for both compressed and uncompressed documents.

The Google test case shows it is beneficial to impose a limit on the

minimum file size necessary to compress a document. Both mod_gzip and IIS allow the

site administrator to set a lower and upper bound on the size of compressible resources.

Thus if the size of a resource falls outside of these bounds it will be sent uncompressed.

For these tests all such bounds were disabled, which caused all resources

to be compressed regardless of their size.

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 48

Page 49: http compression

Benefits and Drawbacks of HTTP Data Compression

When calculating results we will only look at those cases where the

demanded number of requests per second is less than or equal to the saturation point. Not

surprisingly, compression greatly reduced the network bandwidth required for server

replies. The factor by which network bandwidth was reduced roughly corresponds to the

compression ratio of the document.

Web Site Uncompressed Compressed

Google 215 105

Yahoo 36 33

AOL 27 25

EBay 16 15

Table11. 2 – Estimated saturation points for the Apache web server based on repeated

client requests for the indicated document

Next we will see the performance effects that on-the-fly compression

imposed on the server. To do so we will compare the server’s average response time in

serving the compressed and uncompressed document. The findings are summarized in

Table 3. The results are not particularly surprising. We can see that the size of a static

document does not affect response time when it is requested in an uncompressed form. In

the case of compression, however, we can see that as the file size of the resource

increases so too does the average response time. We would certainly expect to see such

results because it takes a slightly longer time to compress larger documents. Keep in

mind that the time to compress a document will likely be far smaller for a faster, more

powerful computer. The machine running as the web server for these tests has a modest

amount of computing power, especially when compared to the speed of today’s average

web server.

Web Site Uncompressed Compressed

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 49

Page 50: http compression

Benefits and Drawbacks of HTTP Data Compression

Google 3.2 10.2

Yahoo 3.3 27.5

AOL 3.4 34.7

EBay 3.4 51.4

Table 11.3 – Average response time (in milliseconds) for the Apache server to respond to

requests for compressed and uncompressed static documents.

11.2.2 IIS Performance Benchmark

We attempted to repeat the same tests as above using the IIS server.

However, our test had to be slightly altered since, unlike mod_gzip and Apache, IIS does

not compress static documents on the fly. So, to overcome this we simply changed the

extension of all of the .html files to .asp. Active Server Pages (ASP) are generated

dynamically hence IIS will not cache them.

Figures 6 through 9 represent graphs of some of the results from each of

the respective test cases. Note from the graphs that, like with Apache, the actual request

rate and the average reply rate are identical for the compressed and uncompressed format

up until the saturation point. Note also that the average response times present an

interesting value to examine. Recall that with Apache the average reply time was

significantly greater when serving compressed content. On the other hand, in IIS, the

average response time when serving compressed dynamic content is the same, and

sometimes even a few milliseconds less, than when serving uncompressed content. At

first we thought this was happening because the content was somehow being cached even

though the IIS documentation states that dynamic content is never cached, so we spent

hours searching for every possible compression and cache setting but to no avail. Finally,

we stumbled across what we believe to be the answer, which is the use of chunked

transfer coding. By default IIS does not chunk the response for an uncompressed

document that is dynamically generated via ASP. Figure 10 shows the HTTP response

header that was generated from the IIS server based on a client request for an

uncompressed ASP resource. In this case, IIS generates the entire page dynamically

before returning it to the client.

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 50

Page 51: http compression

Benefits and Drawbacks of HTTP Data Compression

Figure 11.6 – Benchmarking results for the retrieval of the Google HTML file from the IIS

Server.

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 51

Page 52: http compression

Benefits and Drawbacks of HTTP Data Compression

Figure 11.7 – Benchmarking results for the retrieval of the Yahoo HTML file from the IIS Server.

Figure 11.8 – Benchmarking results for the retrieval of the AOL HTML file from the IIS Server.

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 52

Page 53: http compression

Benefits and Drawbacks of HTTP Data Compression

Figure 11.9 – Benchmarking results for the retrieval of the eBay HTML file from the IIS Server.

However, as we can see in Figure 11.9, when a client requests an ASP file

in compressed form, IIS appears to send the compressed content to the client in chunks.

Thus, the server can immediately compress and send chunks of data as they are

dynamically generated without having to wait for the entire document to be generated

before performing this process. Such a process appears to significantly reduce response

time latency. In this way IIS is able to achieve average response times that are identical to

the response times for uncompressed content.

HTTP/1.1 200 OK

Server: Microsoft-IIS/5.0

Date: Tue, 11 Dec 2001 18:01:09 GMT

Connection: close

Content-Length: 13506

Content-Type: text/html

Set-Cookie:

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 53

Page 54: http compression

Benefits and Drawbacks of HTTP Data Compression

ASPSESSIONIDGGQQQJHK=EJKABPBDEDGHBPLPLMDP

EAHA; path=/

Cache-control: private

Figure 11.10 – HTTP response header from an IIS Server after a request for an uncompressed .asp resource

HTTP/1.1 200 OK

Server: Microsoft-IIS/5.0

Date: Tue, 11 Dec 2001 18:01:01 GMT

Connection: close

Content-Type: text/html

Set-Cookie:

ASPSESSIONIDGGQQQJHK=DJKABPBDPEAMPJHEBNIH

CBHB; path=/

Content-Encoding: gzip

Transfer-Encoding: chunked

Expires: Wed, 01 Jan 1997 12:00:00 GMT

Cache-Control: private, max-age=86400

Vary: Accept-Encoding

Figure 11.11 – HTTP response header from an IIS Server after a request for a compressed .asp

resource

The graphs show that the lines for the average response times follow

nearly identical paths for the Google, Yahoo! and AOL test case, though there is a slight

divergence in the AOL test as you near the saturation point. In the eBay test, however,

the server achieves must quicker average response times. Again, this can be explained by

the server’s use of chunked transfers. Looking at the graphs we can see that for the

Google, Yahoo! and AOL test cases, the saturation point is roughly equivalent regardless

of whether the content is served in a compressed form or not.

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 54

Page 55: http compression

Benefits and Drawbacks of HTTP Data Compression

Based on these results we conclude that the use of chunked transfer coding

for compressing dynamic content provides significant performance benefits.

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 55

Page 56: http compression

Benefits and Drawbacks of HTTP Data Compression

Chapter 12

12.0 Summary / Suggestion

So, should a web server use HTTP compression? Well, that’s not such an

easy question to answer. There are a number of things that must first be considered. For

instance, if the server generates a large amount of dynamic content one must consider

whether the server can handle the additional processing costs of on-the-fly compression

while still maintaining acceptable performance. Thus it must be determined whether the

price of a few extra CPU cycles per request is an acceptable trade-off for reduced

network bandwidth. Also, compression currently comes at the price of cache ability.

Much Internet content is already compressed, such as GIF and JPEG

images and streaming audio and video. However, a large portion of the Internet is text

based and is currently being transferred uncompressed. As we have seen, HTTP

compression is an underutilized feature on the web today. This despite the fact that

support for compression is built into most modern web browsers and servers.

Furthermore, the fact that most browsers running in the Windows environment perform

streaming decompression of gzipped content is beneficial because a client receiving a

compressed HTML file can decompress the file as new packets of data arrive rather than

having to wait for the entire object to be retrieved. Our tests indicated that 27% byte

reductions are possible for the average web site, proving the practicality of HTTP

compression. However, in order for HTTP compression to gain popularity a few things

need to occur.

First, the design of a new patent free algorithm that is tailored specifically

towards compressing web documents, such as HTML and CSS, could be helpful. After

all, gzip and deflate are simply general purpose compression schemes and do not take

into account the content type of the input stream. Therefore, an algorithm that, for

instance, has a predefined library of common HTML tags could provide a much higher

compression ratio than gzip or deflate.

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 56

Page 57: http compression

Benefits and Drawbacks of HTTP Data Compression

Secondly, we believe expanded support for compressed transfer coding is

essential. Currently, support for this feature is scarce in most browsers, proxies and

servers. Based on the browser test in Section 6 we can see that as of Netscape 6.2 and

Internet Explorer 6.0, neither browser fully supports transfer coding. Opera 5 was the

only browser tested that indicated its ability to handle transfer coding by issuing a TE

header field along with each HTTP request. As far as proxies are concerned, Squid

appears only to support compressed content coding, but not transfer coding, in its current

version. According to the Squid development project web site a beta version of a patch

was developed to extend Squid to handle transfer coding. However, the patch has not

been updated recently and the status of this particular project is listed as idle and in need

of developers. Also, in our research we found no evidence of support for compressed

transfer coding in either Apache or IIS.

The most important thing in regards to HTTP compression, in our opinion,

is the need for expanded proxy support. As of now compression comes at the price of

uncacheability in most instances. As we saw, outside of the latest version of Squid, little

proxy support exists for the Vary header. So, even though a given resource may be

compressible by a large factor, this effectiveness is negated if the server has to constantly

retransmit this compressed document to clients who should have otherwise been served

by a proxy cache.

Until these issues can be resolved HTTP compression will likely continue

to be overlooked as a way to reduce user perceived latency and improve Web

performance.

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 57

Page 58: http compression

Benefits and Drawbacks of HTTP Data Compression

References

1 http://www.http-compression.com

2 http://www.google.com

3 H.F. Nielsen. The Effect of HTML Compression on a LAN.

http://www.w3.org/Protocols/HTTP/Performance/Compression/LAN.html.

4 http://www.w3.org/Protocols/HTTP/Perform ance/Pipeline.html

5 P. Peterlin. Why Compression? Why Gzip?

http://sizif.mf.uni-lj.si/gzip.html.

6 http://www.ietf.org/rfc/rfc2616.txt

7 http://www.gzip.org/zlib/feldspar.html.

8 Mozilla. performance: HTTP Compression.

http://www.mozilla.org/projects/apache/gzip

INDUS INSTITUTE OF TECHNOLOGY AND ENGINEERING 58