powerpoint-presentation information and computing sciencesjilles/edu/... · conclusions graphs need...

Graph SummarisationJilles Vreeken

10 July 2015

The Case of The Lost Pen

-- or –

The Case of the Found Pen

Service Announcement #0

Next week, a guest lecture

Mining Data that Changes

by dr. Pauli Miettinen (MPI-INF)


Exam.

Oral.

3rd and 4th of August.

Timeslots to be decided.

Mail me if you want to participate, let me know if you have a preferred time/day.



Introduction

Patterns

Correlation and Causation

Graphs

Wrap-up +

(Subjective) Interestingness


Introduction

Patterns

Correlation and Causation

Graphs

Wrap-up +

(Subjective) Interestingness

?

Yes! Prepare questions on anything* you’ve always wanted to ask me.

Mail them to me in advance, or have me answer on the spot

* preferably related to TADA, data mining, machine learning, science, the world, etc.

Question of the day

How can we summarise

the main structure of a graphin easily understandable terms?

Graphs

Graphs are everywhere

↔

Everything* can be represented as a graph

* almost

Graphs, formally

We consider graphs 𝐺 = 𝑉, 𝐸with 𝑉 the set of 𝑛 nodes,

and 𝐸 a set of 𝑚 edges between nodes

In general, nodes can have labels, and

edges can have labels, weights and can be directed.

Real world graphs

road networks

social networks

biological networks

cellular

networks

relational

databases

Real world graphs

the internet

Graphs, formally

Today we consider unlabeled unweighted undirected graphs.

The adjacency matrix 𝐴 then is an𝑛 × 𝑛 matrix 𝐴 ∈ 0,1 𝑛×𝑛 where

a cell 𝑎𝑖,𝑗 = 1 iff 𝑖, 𝑗 ∈ 𝐸 and 0 otherwise.

We call the number of edges 𝑑𝑖of a node 𝑖 its degree

Why summarisation?

Visualization

Guiding attention

Staring at an Adjacency Matrix

Nodes: wiki editors

Edges: co-edited

I don’t see

anything!

Staring at a Hairball

Stars:

admins,

bots,

heavy users

Bipartite cores: edit wars

Nodes: wiki editors

Edges: co-edited

Kiev vs. Kyiv vandals

Example: Wikipedia Controversy

Summary Statistics

For ‘normal’ data, we can get insight by taking an average.

What kind of summary statistics do we have for graphs?

Average degree.Not very insightful.

Summary Statistics



Degree plots

Powerlaws

Summary Statistics

For ‘normal’ data, we can get insight by taking an average.What kind of summary statistics do we have for graphs?

Cluster coefficient (global)How clustered are the nodes in the graph?

𝐶 =𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑙𝑜𝑠𝑒𝑑 𝑡𝑟𝑖𝑎𝑛𝑔𝑙𝑒𝑠

𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑜𝑛𝑛𝑒𝑐𝑡𝑒𝑑 𝑡𝑟𝑖𝑝𝑙𝑒𝑡𝑠 𝑜𝑓 𝑣𝑒𝑟𝑡𝑖𝑐𝑒𝑠

Counting triangles requires matrix multiplication, which takes 𝑂(𝑛𝜔) where 𝜔 < 2.376, but takes 𝑂 𝑛2 space.

(but fast estimators exist)

Summary Statistics



Cluster coefficient (local)

How close is the neighborhood of

node 𝑖 to being a clique?

𝐶𝑖 =2 𝑗, 𝑘 ∈ 𝐸 𝑗, 𝑘 ∈ 𝑁𝑖

𝑑𝑖(𝑑𝑖 − 1)

which is 𝑂(𝑑𝑖2) at 𝑂(𝑛2) space

Summary Statistics



Diameter

The longest shortest path between two nodes.

Requires calculating all shortest paths.

Calculating shortest path takes 𝑂(𝑛2).

So, no.

Scalability

Many real world graphs are big,

with 𝑛 in the order of millions.

𝑂(𝑛2) is very scary for a graph miner.

Current-day graph mining algorithms

need to be linear in the number of edges,

or else your paper will almost surely be rejected.

What are the implications?

Summarising a Graph

Given: a graph

Summarising a Graph

Given: a graph

Find: a succinct summary

with possibly

overlapping subgraphs

Summarising a Graph

Given: a graph

Find:

≈important graph

structures.

a succinct summary

with possibly

overlapping subgraphs

Community Detection

Adjacency MatrixAssumed graph

Community Detection

Adjacency MatrixReal graph

Summarising a Graph

Fully Automatic Cross Associations

is a nice MDL based algorithm to summarise a matrix.

1) REASSIGN: Given a grid, assign rows and columns

s.t. entropy within the grid is minimal.

(Chakrabarti et al. 2004)

Summarising a Graph

Fully Automatic Cross Associations

is a nice MDL based algorithm to summarise a matrix.

1) REASSIGN: Given a grid, assign rows and columns

s.t. entropy within the grid is minimal.

2) CROSSASSOC: Find cluster with highest entropy, split it, run REASSIGN.

Stop when no split reduces the MDL score.

(Chakrabarti et al. 2004)

Beyond Cave-men Communities

Traditional community detection

algorithms assume that you interact

only with people in your ‘cave’.

You are assumed not to interact

with others, except if you are one

of few ‘messengers’ between ‘caves’.

That is not very realistic.

(Kang & Faloutsos, ICDM 2011)

Slash’n’Burn

Slash’n’Burn finds the node 𝑖with highest 𝑑𝑖 and removesits edges 𝑁𝑖 and recurses.

SLASHBURN:

1. Slash top-𝑘 hubs, burn edges

2. Repeat on the remaining GCCBefore


Slash’n’Burn


SLASHBURN:


2. Repeat on the remaining GCC


Slash’n’Burn


SLASHBURN:


2. Repeat on the remaining GCCAfter


Beyond Cave-men Communities

Slash’n’Burn applied on the

AS-Oregon graphs shows that

real graphs indeed have structure

beyond cave-men communities!

– but also include those!

A nice side-result is that the

Slash’n’Burned ordered matrix

has lots of ‘empty space’ and

can hence be stored efficiently.


Carnegie Mellon University

Korea Advanced Institute of Science and Technology

VoG: Summarizing and Understanding Large Graphs

Danai Koutra

Jilles Vreeken

U Kang

Christos Faloutsos

SDM, 25 April 2014, Philadelphia, USA

Main Idea

1) Use a graph vocabulary:

2) Best graph summary

optimal compression (MDL)

Main Idea

1) Use a graph vocabulary:

2) Shortest lossless description

optimal compression (MDL)

Given a set of models ℳ,

the best model 𝑀 ∈ ℳ is

argmin 𝐿 𝑀 + 𝐿(𝐷 ∣ 𝑀)

# bits

for 𝑀# bits for the

data using 𝑀

ℳ

𝑀

Minimum Description Length

a1 x + a0

𝐿 𝑀 + 𝐿(𝐷|𝑀)

a10 x10 + a9 x

9 + … + a0

errors

{ }

MDL example

Given: - a graph 𝐺 with adjacency matrix 𝐴- vocabulary Ω

Find: model 𝑀 s.t.𝐿(𝐺,𝑀) = min 𝐿(𝑀) + 𝐿(𝐸)

Minimum Graph Description

Model 𝑀Adjacency 𝐴 Error 𝐸

VoG: Overview

argmin

≈

≈?

VoG: Overview

VoG: Overview

some criterion

VoG: Overview

Summary

VoG: Overview

We need candidate structures…

… How can we get them?

Step 1: Graph Decomposition

We can use:

Any decomposition method

We did use/adapt:

SLASHBURN

Slash top-k hubs, burn edges

Before

SnB Graph Decomposition


candidate

structures

After



candidate

structures

After


Notice that the structures can overlap!


candidate

structures

After



Repeat on the remaining GCC

GCC


Now, how can we

‘label’ them?

We got candidate structures.

≈?

argmin

≈

1

2

Step 2: Graph Labeling

hub? “best”

node split?

45

80

n

“best”

node ordering?

1

1

n

missing

edges?

Graph Representations

hub

Hub: top-degree nodeSpokes: the rest

DETAILS


hub


𝐿𝑁 𝑠𝑡 − 1 + log 𝑛 + log𝑛 − 1𝑠𝑡 − 1

+ 𝐿(𝐸+) + 𝐿(𝐸

−)

# of spokes

hub ID spokes IDs extra missingErrorsStar structure

𝑛=7

DETAILS


hub



+ 𝐿(𝐸+) + 𝐿(𝐸

−)

# of spokes

hub ID spokes IDs extra missingErrors

𝑛=7

DETAILS


hub



+ 𝐿(𝐸+) + 𝐿(𝐸

−)hub ID spokes IDs extra missingErrors

6

𝑛=7

DETAILS


hub



+ 𝐿(𝐸+) + 𝐿(𝐸

−)spokes IDs extra missingErrors

6

𝑛=7

DETAILS


hub



+ 𝐿(𝐸+) + 𝐿(𝐸

−)extra missingErrors

6

𝑛=7

DETAILS


hub



+ 𝐿(𝐸+) + 𝐿(𝐸

−)extra missing

6

𝑛=7

DETAILS


Max bipartite graph: NP-hard

Heuristic: Belief Propagation with heterophily for node classification

(blue/red)

DETAILSGraph Representations



(blue/red)

+ logn + log( ) + L(E+ ) + L(E− )# of blue

nodesn−1|st|−

1their IDs extra missingErrors

# of rednodes

Bipartite graph structure




(blue/red)


nodesn−1|st|−

1their IDs extra missingErrors

# of rednodes




(blue/red)


nodesn−1|st|−

1their IDs extra missing

# of rednodes


1

45

80

n

1

n

Longest path: NP-hard

Heuristic: BFS + local search


1

45

80

n

1

n

Longest path: NP-hard

Heuristic: BFS + local search

+ extramissin

gErrorsChain structure


≈?


≈?

argmin

≈


Step 3: Summary Assembly


Summary

Concepts

= # bits as structure - # bits as noisecompression gain

Savings

DETAILS

Summary


Concepts

Summary Encoding cost

𝐿 𝑀 = 𝐿𝑁( 𝑀 + 1) + log𝑀 + 1Ω + 1

+ ∑ − log𝑃 𝑥 𝑠 𝑀 + 𝐿 𝑠

# of

structures

# of

structures

per type

for each structure

its encoding length

its

connectivity

its

type3

# of

structures

# of

structures

per type

for each structure

its encoding length

: 1

: 1

: 1


𝐿(𝐷,𝑀)

structures

…

DETAILS

75%98% 93%

75%

2%

77%

46%60%

0%

20%

40%

60%

80%

100%

Plain Top-10 Top-100 G&F

Bits needed Unexplained edges

4292729 bits as noise

Real graphs have structure!(we can save bits by encoding with structures!)

Quantitative Analysis

1

10

100


Star

Near-Bipartite

Full clique

Full Bipartite

Chain

Main structure types:



1

10

100

1000

10000


Star

Near-Bipartite

Full clique

Full Bipartite

Main structure types:

Stars, near- and full-bipartite cores.

Top-3 Stars

klay

[email protected]

Top-1 NBC

Ski

excursion

[email protected]

Qualitative Analysis: Enron

VOG is near-linear on the number of edges of the input graph.

Runtime

“jellyfish”(Tauro, 2001)

Future Work

Those of you interested in a MSc or RIL project…

Our current vocabulary is

But many other structures make sense, for example

Future Work

Those of you who might be interested in a MSc or RIL project…

It would be great if we could mine summaries directly from data

… without pre-mining all candidate structures

Real graphs show powerlaw-ish degree distributions,

… would be great if VoG could take that into account

Conclusions

Graphs need Summaries graphs are powerful – but difficult to interpret far too few (efficient) summary methods available

Cross-Associations powerful technique to find bi-clusters heuristic, improvements exist

Slash’n’Burn reorders nodes of a graph finds sub-graphs ‘beyond’ cave-men communities

VoG summarises graphs with a graph-theoretic vocabulary first of its kind – but a big stack of heuristics fast, good results.

Thank you!Graphs need Summaries graphs are powerful – but difficult to interpret far too few (efficient) summary methods available

Cross-Associations powerful technique to find bi-clusters heuristic, improvements exist

Slash’n’Burn reorders nodes of a graph finds sub-graphs ‘beyond’ cave-men communities

VoG summarises graphs with a graph-theoretic vocabulary first of its kind – but a big stack of heuristics fast, good results.

powerpoint-presentation information and computing sciencesjilles/edu/... · conclusions graphs need...

Documents