

Characterizing Files in the Modern Gnutella Network:

A Measurement Study

Shanyu Zhao, Daniel Stutzbach, Reza Rejaie

University of Oregon

SPIE Multimedia Computing and Networking 2006 (MMCN’06), 18-19th January 2006San Jose, California, USA



Measurement study of modern Gnutella system

Conduct static, topological and dynamic analysis

Help to improve design and evaluations of P2P file-sharing applications


Previous studies

Focus on a small population Be more than three years old Not examine dynamics of file characteristics

over time and correlation between the overlay topology and file distribution


Why Gnutella

Top three (eDonkey2K, FastTrack, Gnutella) Gnutella has Browse-Host extension to extra

ct the list of shared files from peers One of most studied P2P systems; compare

and contrast with previous studies


Original Gnutella

A new node joins the system (Node A) Node A connects to some node (Node B) by pre-

existing list, a particular website, IRC and etc Node B sends its working nodes to Node A Node A connects provided nodes till certain

threshold During search, Node A sends requests to connected

nodes which in turn forward requests


Original Gnutella

Nodes reply the request directly or indirectly depending on the firewall existence

Node A downloads file pieces from one ore more positive nodes

Unlike Napster, Gnutella is decentralized; flood-based searches


Modern Gnutella

Contrast to unstructured overlay topology, most modern Gnutella clients adopt a two-tier overlay structure

Ultrapeers and leaf peers (majority) Legacy peers (not implement ultrapeer featur



Measurement methodology

Problems of general crawlers Slow, distorted, inflate population

Previous studies Partial snapshot, periodic probe of a fixed group Significance is doubted

Goal of this work Capture entire population (?) Short period


Measurement methodology

Topology crawl List of neighboring nodes

Content crawl List of available files of each node Need more



Parallel P2P crawler Orders of magnitude faster than previous

crawlers (?) Master-slave architecture

Slave crawls hundreds of peers and master coordinates multiple slaves

Increase degree of concurrency



Using 6 off-the-shelf 1GHz GNU/Linux boxes, crawl takes 15min + 5.5hr + 15min ~ 6 hours

Each content crawl takes 10GB log file containing file name and content hash



Three measurement periods; within each period, take snapshots everyday

6/8/2005-6/18/2005, 8/23/2005-9/9/2005 and 10/11/2005-10/21/2005

Examine both short and long timescales




Sources of unreachable nodes

Firewall Severe network congestion Peer departed Not support Browse Host protocol

Ultrapeers: depart Leaf peers: depart and firewall Contact 20% peers (~half a million)



Low-bandwidth TCP connection Some crawls do not complete after the timeout threshold,

as they are sent at extremely low rate

File identity File name is not a reliable file identifier; so this work use

content hash

Post-processing More than 100 million distinct files Divide into 7 segments randomly, trim files of less than 10

copies in a segment, combine trimmed back to one


Static analysis

Ratio of free riders Degree of resources sharing among

cooperative peers File popularity distribution File type analysis


Ratio of free riders

Free riders drop, ratio of ultrapeers is lower, long-lived peers slightly higher, # files not strongly correlate


Degree of resources sharing among cooperative peers

Distribution of # peers sharing x files – power-law distribution


Degree of resources sharing among cooperative peers

Distribution of contributed disk space – power-law distribution


Degree of resources sharing among cooperative peers

Correlation not as strong as previous studies Discernable line with slope 3.7MB/file which

is typical size of MP3 audio file


File popularity distribution


File type analysis


File type analysis

Previous studies Current studies

Music 67.2% files

79.2% bytes

67% files

40% bytes

Video 2.1% files

19.1% bytes

6% files

52.5% bytes


Topological analysis

Per-file perspective – figure a & b Per-peer perspective – figure c


Topological analysis

Churn (dynamics of peer participation) is dominant factor Depart Join Leaf peers become ultrapeers Rapid change in overlay topology prevents format

ion of topological clustering


Dynamics analysis

Variations in shared files by individual peers Variations in popularity of individual files Trends in popularity variations


Variations in shared files by individual peers


Variations in popularity of individual files

Focus on top 100 and top 1000 files


Trends in popularity variations

Track top 10 files across several days (fig a & b) Over several months (fig c)



Use parallel crawl to obtain snapshots of peer connectivity and available files

Conduct three types of analysis Understand the distribution, correlation and

dynamics of available files


Summary of findings

Free riding significantly drops # shared files and contributed storage space

by individual peers follow power-law distribution most peers contribute little disk space (<100MB) while small # peers contribute very large space (50-100GB)

Popularity of individual files follow Zipf distribution small # files are extremely popular but majority of files are very unpopular


Summary of findings

Most popular file type is MP3 file (2/3 of all files, 1/3 of all bytes)

Popularity and occupied space by video files has tripled over past few years

# video files < 1/10 of audio files but occupy 25% more bytes

93% of bytes or 73% of files are multimedia files


Summary of findings

Files are randomly distributed; no strong correlation between the available files at peers that are one, two or three hops apart in overlay topology

Shared files by individual slowly change over timescale of days; more popular files experience larger variations in popularity

Top Related