Unveiling Core Network-Wide Communication Patterns through Application Traffic Activity
Graph Decomposition
Yu Jin, Esam Sharafuddin, Zhi-Li ZhangSIGMETRICS / Performance 2009
2009. 7. 8Seokshin Son
Methodology for network graph analysis
• Decompose the graph (extracting community structures or dense subgraphs)
• Interpretation and persistence of the subgraphs (identifying non-random substructures)
• Understanding the formation of network graphs and associated applications
decomposition
interpretable?
persistent?
1. Characterizing the graph2. Explain the graph formation3. Applications1
2
3
2/24
Outline
• Application traffic activity graph (TAG)• Decomposing application traffic activity
graphs• Interpretation of TAG subgraphs• Temporal properties of TAG subgraphs• Applications• Summary and General remarks
3/24
Application traffic activity graphs (TAG’s)
• Host-to-host communication involves various types of traffic.• We study snapshots of network traffic to capture the temporal correlations.• We detrend the data based on ports (applications) and create TAGs.• Inside hosts are (likely) service requesters and outside hosts are service
providers.• TAG’s are bi-partite and the associated adjacency matrices are binary.
UMN
Internet
T
HTTP port 80/443
Email port 25/993Gnutella port 6346/6348
4/24
Application traffic activity graphs (TAGs) and evolution
HTTP 1K to 3K DNS 1K to 3KAOL IM 1K to 3KEmail 1K to 3K
5/24
Characteristics of TAGs
• We observe difference in terms of basic statistics, such as graph density, average in/out degree, etc.
• ALL TAGs contain giant connected component (GCC), which accounts for more than 85% of all the edges.
80.00%
82.00%
84.00%
86.00%
88.00%
90.00%
92.00%
94.00%
96.00%
98.00%
100.00%
GC
C s
ize
HTTPEmail
AOL Msgr
BitTorre
ntDNS
eMule
Gnutella
MSN Msgr
6/24
Outline
• Application traffic activity graph (TAG)• Decomposing application traffic activity
graphs• Interpretation of TAG subgraphs• Temporal properties of TAG subgraphs• Applications• Summary and General remarks
7/24
Dense subgraphs in TAGs
• Block structures in the adjacency matrices indicate dense subgraphs in TAGs
• Rotating rows and columns of corresponding adjacency matrices…
1 1 1 0 0 1 01 1 1 0 0 0 01 1 1 0 0 0 00 0 0 1 1 1 10 0 0 1 0 1 1
HTTP Email AOL IM BitTorrent DNS
8/24
Extracting dense subgraphs
• Extracting dense subgraphs can be formulated as a co-clustering problem, i.e., cluster hosts into inside host groups and outside host groups, then extract pairs of groups with more edges connected (higher density).
• This co-clustering problem can be solved by tri-nonnegative matrix factorization algorithm, which minimizes:
• We use hard clustering setting by assigning each host to only one inside/outside group
9/24
Tri-nonnegative matrix factorization
≈
1 0 0 … 01 0 0 … 00 1 0 … 00 1 0 … 0
…0 0 0 … 1
1 1 1 0 0 … 00 0 0 1 1 … 0
. . .0 0 0 0 0 … 1
× ×
≈ × ×A R H C
Adjacency matrix assoc. with TAG
Row group membership
Indicator matrix
Proportional to the subgraph density
matrix
Column group membership
Indicator matrix
We identify dense subgraphs basedon the large entries in H
R is m-by-k, C is r-by-n, hence, the product is a low-rank approximation of A, with rank min(k, r)
10/24
Subgraph prototypes
• Recall inside (UMN) hosts are (likely) service requesters and outside hosts are service providers.
• Based on the number of inside/outside hosts in each subgraph, we propose three prototypes.
In-star Bi-meshOut-star
One inside client accesses multiple outside servers
Multiple inside client accesses one outside servers
Multiple inside clients interacts with many outside servers
11/24
Characterizing TAGs with subgraph prototypes
• Different application TAGs contain different types of subgraphs
• We can distinguish and characterize applications based on the subgraph components
• What do these subgraphs mean?
HTTP Email AOL IM BitTorrent DNS
12/24
Outline
• Application traffic activity graph (TAG)• Decomposing application traffic activity
graphs• Interpretation of TAG subgraphs• Temporal properties of TAG subgraphs• Applications• Summary and General remarks
13/24
Interpreting HTTP bi-mesh structures
• Most star structures are due to popular servers or active clients
• We can explain more than 80% of the HTTP bi-meshes identified in one dayServer correlation driven– Server farms
• Lycos, Yahoo, Google– Correlated service providers
• CDN: LLNW, Akamai, SAVVIS, Level3• Advertising providers: DoubleClick, etc.
• User interests driven• News: WashingtonPost, New York Times, Cnet• Media: ImageShack, casalemedia, tl4s2• Online shopping: Ebay, Costco, Walmart• Social network: Facebook, MySpace
14/24
How are the dense subgraphs connected?
AS1
AS1 AS2
(A) Randomly connected stars (C) Pool
(B) Tree: client/server dual role (D) Correlated pool
15/24
Outline
• Application traffic activity graph (TAG)• Decomposing application traffic activity
graphs• Interpretation of TAG subgraphs• Temporal properties of TAG subgraphs• Applications• Summary and General remarks
16/24
Evolution of HTTP TAGs
• The extracted dense subgraphs are assumed to be non-random (persistent)
• However, each subgraph may evolve over time with hosts leaving/joining the subgraph
• A “best effort” similarity metric:
Gra
nu
lari
ty
AS
domain name
IP
=?
If the percentage of similar hosts (at certain level) is greater than η
17/24
Evolution of HTTP TAGs (2)• TAGs are temporally stable at the Domain/AS level• TAGs are transient at the host level
18/24
Evolution of HTTP TAGs (2)• Subgraphs last from a few hours to a whole day• Subgraphs become more similar during the day time
19/24
Outline
• Application traffic activity graph (TAG)• Decomposing application traffic activity
graphs• Interpretation of TAG subgraphs• Temporal properties of TAG subgraphs• Applications• Summary and General remarks
20/24
Application: Identifying unknown traffic
• Similarity of UDP port 4000 TAG subgraphs with subgraphs from messenger (chat) traffic
AOL Messenger Yahoo! Messenger UDP port 4000
Best match
21/24
Application: Storm worm botnet analysis
StormTypical P2P
Storm worm botnet graph contains many bi-mesh structures, which differs significantly from typical p2p applications
Bots query for supernodesBots acquire commands from supernodes Spam campaign
22/24
Outline
• Application traffic activity graph (TAG)• Decomposing application traffic activity
graphs• Interpretation of TAG subgraphs• Temporal properties of TAG subgraphs• Applications• Summary and General Remarks
23/24
Summary and General Summary and General RemarksRemarks
• Many network research questions can be formulated as “graph analysis” problem
• These graphs are mostly not random and have latent structures.
• We propose a co-clustering (tri-Nonnegative matrix factorization) based method for decomposing such graphs and reveal the community structures (dense subgraphs).
• Obtained subgraphs are meaningful and persistent, which help understand the formation of complicated communication graphs.
• We demonstrate various applications based on the decomposition results
24/24
Appendix
25
Characteristics of app. TAGs
• These statistics show difference between various app. TAGs
• It does not explain the formation of TAGs
26/24
TNMF algorithm related
• Iterative optimization algorithm
• Group density matrix derivation
27/24
Pratical issues of tNMF
• Selection of rank and density– Linear search of appropriate ranks– Edge coverage converges after
proper chosen rank and density
• With the above rank selection method, we achieve stable subgraph decomposition results
28/24
Pratical issues of tNMF (2)
• Low convergence rate and local minima– We use SVD as an initialization approach
for tNMF.– Multiple runs are applied when
necessary and the result with the lowest relative square error (RSE) is selected.
29/24
Number of subgraphs over time
30/24