data mining with mapreduce: graph and tensor algorithms with applications

Charalampos (Babis) E. Tsourakakis

Modern Data Mining Algorithms 1

Data Analysis Project 20 Apr. 2010

  Introduction   PART I: Graphs

  Triangles   Diameter

  PART II: Tensors   2 Heads method  MACH

  Conclusion/Research Directions



Leonard Euler (1707-1783)

Seven Bridges of Königsberg Eulerian Paths

Modern Data Mining Algorithms P0-4

Internet Map [lumeta.com]

Food Web [Martinez ’91]

Protein Interactions [genomebiology.com]

Friendship Network [Moody ’01]


m customers n products

Market Basket Analysis

m documents n words

Documents-Terms

freedom

dance

prison

0200040006000800010000051015202530time (min)value

Temperature 02000400060008000100000100200300400500600time (min)valueLight

020004000600080001000000.511.522.5time (min)value

Voltage 0200040006000800010000010203040time (min)value

Humidity

Intel Berkeley lab

6 Modern Data Mining Algorithms

time

Loca

tion

Data modeled as a tensor, i.e., multidimensional matrix, Tx(#sensors)x(#types of measurements)

7

Multi-‐dimensional time series can be modeled in such way.

Modern Data Mining Algorithms


voxel x subjects x trials x task conditions x timeticks

Functional Magnetic Resonance Imaging (fMRI)

  Spam Detection   Exponential random graphs   Clustering Coefficients & Transitivity Ratio   Uncovering the Hidden Thematic Structure of the web

  Link Recommendation


Friends of friends tend to become friends themselves


Spectral family

Triangle Sparsifiers

Randomized SVD

Contributions


Theorem 1

Δ(G) = # triangles in graph G(V,E) = eigenvalues of adjacency matrix AG


Theorem 2

Δ(i) = #Δs vertex i participates at. = j-‐th eigenvector = i-‐th entry of

i

Δ(i) = 2


Airports Political blogs


  Very important for us because:  Few eigenvalues contribute a lot!  Cubes amplify this even more!  Lanczos converges fast due to large spectral gaps!


  Almost symmetric around 0!

  Sum of cubes almost cancels out!

Political Blogs

Omit!

Keep only 3!

3


Nodes Edges Description

~75K ~405K Epinions network

~404K ~2.1M Flickr

~27K ~341K Arxiv Hep-‐Th

~1K ~17K Political blogs

~13K ~148K Reuters news

~3M 35M Wikipedia 2006-‐Sep-‐05

~3.15M ~37M Wikipedia 2006-‐Nov-‐04

~13.5K ~37.5K AS Oregon

~23.5K ~47.5K CAIDA AS 2004 to 2008 (means over 151 timestamps)

Social Networks

Co-authorship network

Information Networks

Web Graphs

Internet Graphs


Triangles node i participates Tria

ngle

s no

de i

parti

cipa

tes

acco

rdin

g to

our

est

imat

ion


2-3 eigenvalues almost ideal results!


  Kronecker graphs is a model for generating graphs that mimic properties of real-‐world networks. The basic operation is the Kronecker product([Leskovec et al.]).

0 1 1

1 0 1

1 1 0

Initiator graph

Adjacency matrix A[0]

Kronecker Product

Adjacency matrix A[1] Adjacency matrix A[2]

Repeat k times Adjacency matrix A[k]


  Theorem[KroneckerTRC ] Let B = A[k] k-‐th Kronecker product and Δ(GA), Δ(GΒ) the total number of triangles in GA , GΒ . Then, the following equality holds:

  Observation 1: Eigendecomposition <-‐> SVD when matrix is symmetric, i.e.,   eigenvectors = left singular vectors   λi=σi sgn(uivi) (where λi,σi eigenvalue, singular value respectively, ui and vi left and right singular vectors respectively.

  Observation 2: We care about a low rank approximation of A


  Frieze, Kannan, Vempala

  Idea: Sample c columns, obtain A and find Ak instead of the optimal Ak. Recover signs from left and right singular vectors. Use EigenTriangle!

  Results: c=100, k=6 for Flickr, EigenTriangle 95.6% accuracy, Approximation 95.46%


(1) Pick column i with probability proportional to its squared length (2) Use the sampled matrix to obtain a good low rank approximation to the original one

~ ~


Spectral family

Triangle Sparsifiers

Randomized SVD

Contributions

  Approximate a given graph G with a sparse graph H, such that H is close to G in a certain notion.

  Examples: Cut preserving Benczur-‐Karger

Spectral Sparsifier Spielman-‐Teng


What about Triangle Sparsifiers?

G(V,E) i j

HEADS! (i,j) “survives” with probability p


t =# Δ

G(V,E) k m

TAILS! (k,m) “dies”


Now, count triangles in G’ and let T/p3

be the estimate of t.

G’(V,E’)

t =# Δ

Τ =# Δ Main Theoretical Results: Under mild conditions on the triangle density (at least nearly linear number of triangles), our estimate is strongly concentrated around the true number of triangles!


Re

1 day = 86400 seconds Expected Speedup 1/p2

 Milgram 1967


The “small world experiment” • Pick 300 people at random •  Ask them to get a letter to a by passing it through friends to a stockbroker in Boston. How many steps does it take?

Only 6! Typically the diameter of real-‐world network is surprisingly small!

Does the same observation hold on the Yahoo Web Graph (2002), where #nodes=1.4B and #edges=6.83B?


  Assume we have a multiset M={x1,..,xm} and we want to count the number of distinct elements n from M. How can we do this using small amount of space?

Flajolet & G. Nigel Martin


  Hash function h(x in U):[0,..,2L-‐1]   y = Σ bit(y,k) 2k   ρ(y) = minimum k s.t bit(y,k)=1, o/w L   Let’s keep a bitmask[0..L]   Hash every x in M and find ρ(h(x)). If BITMASK[ρ(h(x))] is not 0, then flip it!

  How will the bitmask look at the end? 0000000000…. 010110… 1111111111111


i<<log(n) i>>log(n) i~=log(n)

  How will the bitmask look at the end? 0000000000…. 010110… 1111111111111


i<<log(n) i>>log(n) i~=log(n)

This region will give us the information. Flajolet-‐Martin prove that for the random variable R=leftmost 0 in our bitmask: E(R)= log(0.77351*n)

  For every h = 1,2, ..   Estimate the cardinality of the set N(h), i.e., the pairs of nodes reachable within h steps.

 When the cardinality stabilizes, output the number of steps to reach that cardinality as the diameter.

  Scalability O(diam(G)*m), m=#edges   Efficient access to the file (very important)   Parallelizable (also very important)


  The diameter of the Yahoo Web Graph is surprisingly small (7~8)


= x x

Document to term matrix

Documents to Document HCs

Strength of each concept

Term to Term HCs data graph java brain lung

CS

MD



Tucker is an SVD-‐like decomposition of a tensor, one projection matrix per mode and a core tensor giving the correlation among the projection matrices

  In: D   Out: D’=[G;U0,U1,U2] 1.  Spatial compression

  Tucker decomposition 2.  Temporal compression

  Wavelet transform 3.  Sparsify the core

tensor G   e2 = 1 -‐ ||G||2/||D||2

modality

D

loca

tion

X U1

U2T lo

catio

n

modality

Tucker-2 sparsify

G' U1

U2T lo

catio

n

modality

In Out

Transform Matrix (fixed)

U0

Wavelet coefficients

G


  In:   sensor measurements

  Out:   Projection matrices U1 and U2   Core G’ (wavelet coefficients)

 Mining guide:   U1 and U2 reveal the patterns on location and modality, respectively

  G’ provides the patterns on time

G' U1

U2T loca

tion

modality

D

loca

tion

modality

0200040006000800010000051015202530time (min)value

Temperature

02000400060008000100000100200300400500600time (min)value

Light

0200040006000800010000010203040time (min)value

Humidity

020004000600080001000000.511.522.5time (min)value

Voltage


  1st HC : dominant trend, e.g. daily periodicity.   2nd HC: Exceptions

G' U1

U2T

1st Hidden Concept Daily Periodicity

2nd Hidden Concept Exceptions

1 . .

54

1 . .

54


•  1st HC indicates the main sensor modality correlations ▪  Temperature and light are positively correlated, while humidity is anti-‐

correlated with the rest

•  2nd HC indicates an abnormal pattern which is due to battery outage for some sensors

volt humid

temp light

volt humid

temp light

1st Hidden Concept 2nd Hidden Concept

G' U1

U2T

modality

1 2 3 4 1 2 3 4


U1

U2T

modality

•  1st scalogram indicates daily periodicity •  2nd scalogram gives abnormal flat trend due to battery outage


G'

 Most of the real-‐world processes result in sparse tensors. However, there exist important processes which result in dense tensors:


Physical Process Percentage of non-‐zero entries

Sensor network (sensor x measurement type x timeticks)

85%

Computer network (machine x measurement type x timeticks)

81%

  It can be either very slow or impossible to perform due to memory constraints a Tucker decomposition on a dense tensor.

  Can we trade a little bit of accuracy for efficiency?



McSherry Achlioptas

MACH extends the work of Achlioptas-McSherry for fast low rank approximations to the multilinear setting.

  Toss a coin for each non-‐zero entry with probability p   If it “survives” reweigh it by 1/p.   If not, make it zero!

  Perform Tucker on the sparsified tensor!   For the theoretical results, see Tsourakakis, SDM 2010.


  Intemon (Carnegie Mellon University Self-‐Monitoring system)

  Tensor X, 100 machines x 12 types of measurement x 10080 timeticks

  Jimeng Sun showed in his thesis that Tucker decompositions can be used to monitor efficiently the system



For p=0.1 we obtain that Pearson’s Correlation Coefficient is 0.99

Ideal ρ=1


Exact MACH

The qualitative analysis which is important for our goals remains the same!

Find the differences!

  Berkeley Lab

  Tensor 54 sensors x 4 types of measurement x 5385 timeticks



The qualitative analysis which is important for our goals remains the same!


The spatial principal mode is also preserved, and Pearson’s correlation coefficient is again almost 1!


REMARKS 1) Daily periodicity is apparent. 2) Pearson’s correlation Coefficient 0.99 with the exact component.

 More Applications of Probabilistic Combinatorics in Large Scale Graph Mining   Randomized Algorithms work very well (e.g., sublinear time algorithm), but typically hard to analyze.

  Smallest p* for tensor sparsification for the (messy) HOOI algorithm


  Better sparsification (Edge (1,2) is important, Weighted Graphs!)

  Property Testing: Is a graph triangle free?   Does Boolean Matrix Multiplication have a truly subcubic algorithm?

Triangle Sparsifiers 62 3/16/2010


Faloutsos Miller Schwartz Frieze Kolountzakis Koutis

Drineas Kang Leskovec


Concentration appears

Concentration becomes stronger

Pick p=1/ Keep doubling until concentration


Mildness, pick p=1

Concentration

How to choose p?


I want to compute the number of triangles!

Use Lanczos to compute the first two eigenvalues

please!

Is the cube of the second one significantly smaller than

the cube of the first?

  NO Iterate then!

After some iterations… (hopefully

few!)

Compute the k-‐th eigenvalue.

Is much smaller

than ?

YES! Algorithm terminates! The estimated # of Δs is the sum of cubes of λi’s

divided by 6!


Remark:Even if our theoretical results refer to HOSVD, MACH works for HOOI