data mining with mapreduce: graph and tensor algorithms with applications
DESCRIPTION
Data Mining with MapReduce: Graph and Tensor Algorithms with ApplicationsTRANSCRIPT
Charalampos (Babis) E. Tsourakakis
Modern Data Mining Algorithms 1
Data Analysis Project 20 Apr. 2010
Introduction PART I: Graphs
Triangles Diameter
PART II: Tensors 2 Heads method MACH
Conclusion/Research Directions
Modern Data Mining Algorithms 2
Modern Data Mining Algorithms 3
Leonard Euler (1707-1783)
Seven Bridges of Königsberg Eulerian Paths
Modern Data Mining Algorithms P0-4
Internet Map [lumeta.com]
Food Web [Martinez ’91]
Protein Interactions [genomebiology.com]
Friendship Network [Moody ’01]
Modern Data Mining Algorithms 5
m customers n products
Market Basket Analysis
m documents n words
Documents-Terms
freedom
dance
prison
0200040006000800010000051015202530time (min)value
Temperature 02000400060008000100000100200300400500600time (min)valueLight
020004000600080001000000.511.522.5time (min)value
Voltage 0200040006000800010000010203040time (min)value
Humidity
Intel Berkeley lab
6 Modern Data Mining Algorithms
time
Loca
tion
Data modeled as a tensor, i.e., multidimensional matrix, Tx(#sensors)x(#types of measurements)
7
Multi-‐dimensional time series can be modeled in such way.
Modern Data Mining Algorithms
Modern Data Mining Algorithms 8
voxel x subjects x trials x task conditions x timeticks
Functional Magnetic Resonance Imaging (fMRI)
Introduction PART I: Graphs
Triangles Diameter
PART II: Tensors 2 Heads method MACH
Conclusion/Research Directions
Modern Data Mining Algorithms 9
Spam Detection Exponential random graphs Clustering Coefficients & Transitivity Ratio Uncovering the Hidden Thematic Structure of the web
Link Recommendation
Modern Data Mining Algorithms 10
Friends of friends tend to become friends themselves
Modern Data Mining Algorithms 12
Theorem 1
Δ(G) = # triangles in graph G(V,E) = eigenvalues of adjacency matrix AG
Modern Data Mining Algorithms 13
Theorem 2
Δ(i) = #Δs vertex i participates at. = j-‐th eigenvector = i-‐th entry of
i
Δ(i) = 2
Modern Data Mining Algorithms 15
Very important for us because: Few eigenvalues contribute a lot! Cubes amplify this even more! Lanczos converges fast due to large spectral gaps!
Modern Data Mining Algorithms 16
Almost symmetric around 0!
Sum of cubes almost cancels out!
Political Blogs
Omit!
Keep only 3!
3
Modern Data Mining Algorithms 17
Nodes Edges Description
~75K ~405K Epinions network
~404K ~2.1M Flickr
~27K ~341K Arxiv Hep-‐Th
~1K ~17K Political blogs
~13K ~148K Reuters news
~3M 35M Wikipedia 2006-‐Sep-‐05
~3.15M ~37M Wikipedia 2006-‐Nov-‐04
~13.5K ~37.5K AS Oregon
~23.5K ~47.5K CAIDA AS 2004 to 2008 (means over 151 timestamps)
Social Networks
Co-authorship network
Information Networks
Web Graphs
Internet Graphs
Modern Data Mining Algorithms 20
Triangles node i participates Tria
ngle
s no
de i
parti
cipa
tes
acco
rdin
g to
our
est
imat
ion
Modern Data Mining Algorithms 22
Kronecker graphs is a model for generating graphs that mimic properties of real-‐world networks. The basic operation is the Kronecker product([Leskovec et al.]).
0 1 1
1 0 1
1 1 0
Initiator graph
Adjacency matrix A[0]
Kronecker Product
Adjacency matrix A[1] Adjacency matrix A[2]
Repeat k times Adjacency matrix A[k]
Modern Data Mining Algorithms 23
Theorem[KroneckerTRC ] Let B = A[k] k-‐th Kronecker product and Δ(GA), Δ(GΒ) the total number of triangles in GA , GΒ . Then, the following equality holds:
Observation 1: Eigendecomposition <-‐> SVD when matrix is symmetric, i.e., eigenvectors = left singular vectors λi=σi sgn(uivi) (where λi,σi eigenvalue, singular value respectively, ui and vi left and right singular vectors respectively.
Observation 2: We care about a low rank approximation of A
Modern Data Mining Algorithms 24
Frieze, Kannan, Vempala
Idea: Sample c columns, obtain A and find Ak instead of the optimal Ak. Recover signs from left and right singular vectors. Use EigenTriangle!
Results: c=100, k=6 for Flickr, EigenTriangle 95.6% accuracy, Approximation 95.46%
Modern Data Mining Algorithms 25
(1) Pick column i with probability proportional to its squared length (2) Use the sampled matrix to obtain a good low rank approximation to the original one
~ ~
Approximate a given graph G with a sparse graph H, such that H is close to G in a certain notion.
Examples: Cut preserving Benczur-‐Karger
Spectral Sparsifier Spielman-‐Teng
Modern Data Mining Algorithms 27
What about Triangle Sparsifiers?
G(V,E) k m
TAILS! (k,m) “dies”
29 Modern Data Mining Algorithms
Now, count triangles in G’ and let T/p3
be the estimate of t.
G’(V,E’)
t =# Δ
Τ =# Δ Main Theoretical Results: Under mild conditions on the triangle density (at least nearly linear number of triangles), our estimate is strongly concentrated around the true number of triangles!
Introduction PART I: Graphs
Triangles Diameter
PART II: Tensors 2 Heads method MACH
Conclusion/Research Directions
Modern Data Mining Algorithms 32
Milgram 1967
Modern Data Mining Algorithms 33
The “small world experiment” • Pick 300 people at random • Ask them to get a letter to a by passing it through friends to a stockbroker in Boston. How many steps does it take?
Only 6! Typically the diameter of real-‐world network is surprisingly small!
Does the same observation hold on the Yahoo Web Graph (2002), where #nodes=1.4B and #edges=6.83B?
Modern Data Mining Algorithms 34
Assume we have a multiset M={x1,..,xm} and we want to count the number of distinct elements n from M. How can we do this using small amount of space?
Flajolet & G. Nigel Martin
Modern Data Mining Algorithms 35
Hash function h(x in U):[0,..,2L-‐1] y = Σ bit(y,k) 2k ρ(y) = minimum k s.t bit(y,k)=1, o/w L Let’s keep a bitmask[0..L] Hash every x in M and find ρ(h(x)). If BITMASK[ρ(h(x))] is not 0, then flip it!
How will the bitmask look at the end? 0000000000…. 010110… 1111111111111
Modern Data Mining Algorithms 36
i<<log(n) i>>log(n) i~=log(n)
How will the bitmask look at the end? 0000000000…. 010110… 1111111111111
Modern Data Mining Algorithms 37
i<<log(n) i>>log(n) i~=log(n)
This region will give us the information. Flajolet-‐Martin prove that for the random variable R=leftmost 0 in our bitmask: E(R)= log(0.77351*n)
For every h = 1,2, .. Estimate the cardinality of the set N(h), i.e., the pairs of nodes reachable within h steps.
When the cardinality stabilizes, output the number of steps to reach that cardinality as the diameter.
Scalability O(diam(G)*m), m=#edges Efficient access to the file (very important) Parallelizable (also very important)
Modern Data Mining Algorithms 38
Introduction PART I: Graphs
Triangles Diameter
PART II: Tensors 2 Heads method MACH
Conclusion/Research Directions
Modern Data Mining Algorithms 40
= x x
Document to term matrix
Documents to Document HCs
Strength of each concept
Term to Term HCs data graph java brain lung
CS
MD
41 Modern Data Mining Algorithms
Modern Data Mining Algorithms 42
Tucker is an SVD-‐like decomposition of a tensor, one projection matrix per mode and a core tensor giving the correlation among the projection matrices
In: D Out: D’=[G;U0,U1,U2] 1. Spatial compression
Tucker decomposition 2. Temporal compression
Wavelet transform 3. Sparsify the core
tensor G e2 = 1 -‐ ||G||2/||D||2
modality
D
loca
tion
X U1
U2T lo
catio
n
modality
Tucker-2 sparsify
G' U1
U2T lo
catio
n
modality
In Out
Transform Matrix (fixed)
U0
Wavelet coefficients
G
43 Modern Data Mining Algorithms
In: sensor measurements
Out: Projection matrices U1 and U2 Core G’ (wavelet coefficients)
Mining guide: U1 and U2 reveal the patterns on location and modality, respectively
G’ provides the patterns on time
G' U1
U2T loca
tion
modality
D
loca
tion
modality
0200040006000800010000051015202530time (min)value
Temperature
02000400060008000100000100200300400500600time (min)value
Light
0200040006000800010000010203040time (min)value
Humidity
020004000600080001000000.511.522.5time (min)value
Voltage
44 Modern Data Mining Algorithms
1st HC : dominant trend, e.g. daily periodicity. 2nd HC: Exceptions
G' U1
U2T
1st Hidden Concept Daily Periodicity
2nd Hidden Concept Exceptions
1 . .
54
1 . .
54
45 Modern Data Mining Algorithms
• 1st HC indicates the main sensor modality correlations ▪ Temperature and light are positively correlated, while humidity is anti-‐
correlated with the rest
• 2nd HC indicates an abnormal pattern which is due to battery outage for some sensors
volt humid
temp light
volt humid
temp light
1st Hidden Concept 2nd Hidden Concept
G' U1
U2T
modality
1 2 3 4 1 2 3 4
46 Modern Data Mining Algorithms
U1
U2T
modality
• 1st scalogram indicates daily periodicity • 2nd scalogram gives abnormal flat trend due to battery outage
47 Modern Data Mining Algorithms
G'
Introduction PART I: Graphs
Triangles Diameter
PART II: Tensors 2 Heads method MACH
Conclusion/Research Directions
Modern Data Mining Algorithms 48
Most of the real-‐world processes result in sparse tensors. However, there exist important processes which result in dense tensors:
Modern Data Mining Algorithms 49
Physical Process Percentage of non-‐zero entries
Sensor network (sensor x measurement type x timeticks)
85%
Computer network (machine x measurement type x timeticks)
81%
It can be either very slow or impossible to perform due to memory constraints a Tucker decomposition on a dense tensor.
Can we trade a little bit of accuracy for efficiency?
Modern Data Mining Algorithms 50
Modern Data Mining Algorithms 51
McSherry Achlioptas
MACH extends the work of Achlioptas-McSherry for fast low rank approximations to the multilinear setting.
Toss a coin for each non-‐zero entry with probability p If it “survives” reweigh it by 1/p. If not, make it zero!
Perform Tucker on the sparsified tensor! For the theoretical results, see Tsourakakis, SDM 2010.
Modern Data Mining Algorithms 52
Intemon (Carnegie Mellon University Self-‐Monitoring system)
Tensor X, 100 machines x 12 types of measurement x 10080 timeticks
Jimeng Sun showed in his thesis that Tucker decompositions can be used to monitor efficiently the system
Modern Data Mining Algorithms 53
Modern Data Mining Algorithms 54
For p=0.1 we obtain that Pearson’s Correlation Coefficient is 0.99
Ideal ρ=1
Modern Data Mining Algorithms 55
Exact MACH
The qualitative analysis which is important for our goals remains the same!
Find the differences!
Berkeley Lab
Tensor 54 sensors x 4 types of measurement x 5385 timeticks
Modern Data Mining Algorithms 56
Modern Data Mining Algorithms 57
The qualitative analysis which is important for our goals remains the same!
Modern Data Mining Algorithms 58
The spatial principal mode is also preserved, and Pearson’s correlation coefficient is again almost 1!
Modern Data Mining Algorithms 59
REMARKS 1) Daily periodicity is apparent. 2) Pearson’s correlation Coefficient 0.99 with the exact component.
Introduction PART I: Graphs
Triangles Diameter
PART II: Tensors 2 Heads method MACH
Conclusion/Research Directions
Modern Data Mining Algorithms 60
More Applications of Probabilistic Combinatorics in Large Scale Graph Mining Randomized Algorithms work very well (e.g., sublinear time algorithm), but typically hard to analyze.
Smallest p* for tensor sparsification for the (messy) HOOI algorithm
Modern Data Mining Algorithms 61
Better sparsification (Edge (1,2) is important, Weighted Graphs!)
Property Testing: Is a graph triangle free? Does Boolean Matrix Multiplication have a truly subcubic algorithm?
Triangle Sparsifiers 62 3/16/2010
Modern Data Mining Algorithms 63
Faloutsos Miller Schwartz Frieze Kolountzakis Koutis
Drineas Kang Leskovec
Modern Data Mining Algorithms 66
Concentration appears
Concentration becomes stronger
Pick p=1/ Keep doubling until concentration
Modern Data Mining Algorithms 68
I want to compute the number of triangles!
Use Lanczos to compute the first two eigenvalues
please!
Is the cube of the second one significantly smaller than
the cube of the first?
NO Iterate then!
After some iterations… (hopefully
few!)
Compute the k-‐th eigenvalue.
Is much smaller
than ?
YES! Algorithm terminates! The estimated # of Δs is the sum of cubes of λi’s
divided by 6!