1 characterizing files in the modern gnutella network: a measurement study shanyu zhao, daniel...
Post on 21-Dec-2015
215 views
TRANSCRIPT
1
Characterizing Files in the Modern Gnutella Network:
A Measurement Study
Shanyu Zhao, Daniel Stutzbach, Reza Rejaie
University of Oregon
SPIE Multimedia Computing and Networking 2006 (MMCN’06), 18-19th January 2006San Jose, California, USA
2
Outlines
Measurement study of modern Gnutella system
Conduct static, topological and dynamic analysis
Help to improve design and evaluations of P2P file-sharing applications
3
Previous studies
Focus on a small population Be more than three years old Not examine dynamics of file characteristics
over time and correlation between the overlay topology and file distribution
4
Why Gnutella
Top three (eDonkey2K, FastTrack, Gnutella) Gnutella has Browse-Host extension to extra
ct the list of shared files from peers One of most studied P2P systems; compare
and contrast with previous studies
5
Original Gnutella
A new node joins the system (Node A) Node A connects to some node (Node B) by pre-
existing list, a particular website, IRC and etc Node B sends its working nodes to Node A Node A connects provided nodes till certain
threshold During search, Node A sends requests to connected
nodes which in turn forward requests
6
Original Gnutella
Nodes reply the request directly or indirectly depending on the firewall existence
Node A downloads file pieces from one ore more positive nodes
Unlike Napster, Gnutella is decentralized; flood-based searches
7
Modern Gnutella
Contrast to unstructured overlay topology, most modern Gnutella clients adopt a two-tier overlay structure
Ultrapeers and leaf peers (majority) Legacy peers (not implement ultrapeer featur
e)
8
Measurement methodology
Problems of general crawlers Slow, distorted, inflate population
Previous studies Partial snapshot, periodic probe of a fixed group Significance is doubted
Goal of this work Capture entire population (?) Short period
9
Measurement methodology
Topology crawl List of neighboring nodes
Content crawl List of available files of each node Need more
10
Cruiser
Parallel P2P crawler Orders of magnitude faster than previous
crawlers (?) Master-slave architecture
Slave crawls hundreds of peers and master coordinates multiple slaves
Increase degree of concurrency
11
Cruiser
Using 6 off-the-shelf 1GHz GNU/Linux boxes, crawl takes 15min + 5.5hr + 15min ~ 6 hours
Each content crawl takes 10GB log file containing file name and content hash
12
Dataset
Three measurement periods; within each period, take snapshots everyday
6/8/2005-6/18/2005, 8/23/2005-9/9/2005 and 10/11/2005-10/21/2005
Examine both short and long timescales
14
Sources of unreachable nodes
Firewall Severe network congestion Peer departed Not support Browse Host protocol
Ultrapeers: depart Leaf peers: depart and firewall Contact 20% peers (~half a million)
15
Problems
Low-bandwidth TCP connection Some crawls do not complete after the timeout threshold,
as they are sent at extremely low rate
File identity File name is not a reliable file identifier; so this work use
content hash
Post-processing More than 100 million distinct files Divide into 7 segments randomly, trim files of less than 10
copies in a segment, combine trimmed back to one
16
Static analysis
Ratio of free riders Degree of resources sharing among
cooperative peers File popularity distribution File type analysis
17
Ratio of free riders
Free riders drop, ratio of ultrapeers is lower, long-lived peers slightly higher, # files not strongly correlate
18
Degree of resources sharing among cooperative peers
Distribution of # peers sharing x files – power-law distribution
19
Degree of resources sharing among cooperative peers
Distribution of contributed disk space – power-law distribution
20
Degree of resources sharing among cooperative peers
Correlation not as strong as previous studies Discernable line with slope 3.7MB/file which
is typical size of MP3 audio file
23
File type analysis
Previous studies Current studies
Music 67.2% files
79.2% bytes
67% files
40% bytes
Video 2.1% files
19.1% bytes
6% files
52.5% bytes
25
Topological analysis
Churn (dynamics of peer participation) is dominant factor Depart Join Leaf peers become ultrapeers Rapid change in overlay topology prevents format
ion of topological clustering
26
Dynamics analysis
Variations in shared files by individual peers Variations in popularity of individual files Trends in popularity variations
29
Trends in popularity variations
Track top 10 files across several days (fig a & b) Over several months (fig c)
30
Conclusion
Use parallel crawl to obtain snapshots of peer connectivity and available files
Conduct three types of analysis Understand the distribution, correlation and
dynamics of available files
31
Summary of findings
Free riding significantly drops # shared files and contributed storage space
by individual peers follow power-law distribution most peers contribute little disk space (<100MB) while small # peers contribute very large space (50-100GB)
Popularity of individual files follow Zipf distribution small # files are extremely popular but majority of files are very unpopular
32
Summary of findings
Most popular file type is MP3 file (2/3 of all files, 1/3 of all bytes)
Popularity and occupied space by video files has tripled over past few years
# video files < 1/10 of audio files but occupy 25% more bytes
93% of bytes or 73% of files are multimedia files