powerpoint-presentation information and computing sciencesjilles/edu/... · conclusions graphs need...
TRANSCRIPT
-
Graph SummarisationJilles Vreeken
10 July 2015
-
The Case of The Lost Pen
-- or –
The Case of the Found Pen
Service Announcement #0
-
Next week, a guest lecture
Mining Data that Changes
by dr. Pauli Miettinen (MPI-INF)
Service Announcement #1
-
Exam.
Oral.
3rd and 4th of August.
Timeslots to be decided.
Mail me if you want to participate, let me know if you have a preferred time/day.
Service Announcement #2
-
Service Announcement #3
Introduction
Patterns
Correlation and Causation
Graphs
Wrap-up +
(Subjective) Interestingness
-
Service Announcement #2
Introduction
Patterns
Correlation and Causation
Graphs
Wrap-up +
(Subjective) Interestingness
?
Yes! Prepare questions on anything* you’ve always wanted to ask me.
Mail them to me in advance, or have me answer on the spot
* preferably related to TADA, data mining, machine learning, science, the world, etc.
-
Question of the day
How can we summarise
the main structure of a graphin easily understandable terms?
-
Graphs
Graphs are everywhere
↔
Everything* can be represented as a graph
* almost
-
Graphs, formally
We consider graphs 𝐺 = 𝑉, 𝐸with 𝑉 the set of 𝑛 nodes,
and 𝐸 a set of 𝑚 edges between nodes
In general, nodes can have labels, and
edges can have labels, weights and can be directed.
-
Real world graphs
road networks
social networks
biological networks
cellular
networks
relational
databases
-
Real world graphs
the internet
-
Graphs, formally
Today we consider unlabeled unweighted undirected graphs.
The adjacency matrix 𝐴 then is an𝑛 × 𝑛 matrix 𝐴 ∈ 0,1 𝑛×𝑛 where
a cell 𝑎𝑖,𝑗 = 1 iff 𝑖, 𝑗 ∈ 𝐸 and 0 otherwise.
We call the number of edges 𝑑𝑖of a node 𝑖 its degree
-
Why summarisation?
Visualization
Guiding attention
-
Why summarisation?
Visualization
Guiding attention
-
Staring at an Adjacency Matrix
-
Nodes: wiki editors
Edges: co-edited
I don’t see
anything!
Staring at a Hairball
-
Stars:
admins,
bots,
heavy users
Bipartite cores: edit wars
Nodes: wiki editors
Edges: co-edited
Kiev vs. Kyiv vandals
Example: Wikipedia Controversy
-
Summary Statistics
For ‘normal’ data, we can get insight by taking an average.
What kind of summary statistics do we have for graphs?
Average degree.Not very insightful.
-
Summary Statistics
For ‘normal’ data, we can get insight by taking an average.
What kind of summary statistics do we have for graphs?
Degree plots
-
Powerlaws
-
Summary Statistics
For ‘normal’ data, we can get insight by taking an average.What kind of summary statistics do we have for graphs?
Cluster coefficient (global)How clustered are the nodes in the graph?
𝐶 =𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑙𝑜𝑠𝑒𝑑 𝑡𝑟𝑖𝑎𝑛𝑔𝑙𝑒𝑠
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑜𝑛𝑛𝑒𝑐𝑡𝑒𝑑 𝑡𝑟𝑖𝑝𝑙𝑒𝑡𝑠 𝑜𝑓 𝑣𝑒𝑟𝑡𝑖𝑐𝑒𝑠
Counting triangles requires matrix multiplication, which takes 𝑂(𝑛𝜔) where 𝜔 < 2.376, but takes 𝑂 𝑛2 space.
(but fast estimators exist)
-
Summary Statistics
For ‘normal’ data, we can get insight by taking an average.
What kind of summary statistics do we have for graphs?
Cluster coefficient (local)
How close is the neighborhood of
node 𝑖 to being a clique?
𝐶𝑖 =2 𝑗, 𝑘 ∈ 𝐸 𝑗, 𝑘 ∈ 𝑁𝑖
𝑑𝑖(𝑑𝑖 − 1)
which is 𝑂(𝑑𝑖2) at 𝑂(𝑛2) space
-
Summary Statistics
For ‘normal’ data, we can get insight by taking an average.
What kind of summary statistics do we have for graphs?
Diameter
The longest shortest path between two nodes.
Requires calculating all shortest paths.
Calculating shortest path takes 𝑂(𝑛2).
So, no.
-
Scalability
Many real world graphs are big,
with 𝑛 in the order of millions.
𝑂(𝑛2) is very scary for a graph miner.
Current-day graph mining algorithms
need to be linear in the number of edges,
or else your paper will almost surely be rejected.
What are the implications?
-
Summarising a Graph
Given: a graph
-
Summarising a Graph
Given: a graph
Find: a succinct summary
with possibly
overlapping subgraphs
-
Summarising a Graph
Given: a graph
Find: a succinct summary
with possibly
overlapping subgraphs
-
Summarising a Graph
Given: a graph
Find:
≈important graph
structures.
a succinct summary
with possibly
overlapping subgraphs
-
Community Detection
Adjacency MatrixAssumed graph
-
Community Detection
Adjacency MatrixReal graph
-
Summarising a Graph
Fully Automatic Cross Associations
is a nice MDL based algorithm to summarise a matrix.
1) REASSIGN: Given a grid, assign rows and columns
s.t. entropy within the grid is minimal.
(Chakrabarti et al. 2004)
-
Summarising a Graph
Fully Automatic Cross Associations
is a nice MDL based algorithm to summarise a matrix.
1) REASSIGN: Given a grid, assign rows and columns
s.t. entropy within the grid is minimal.
2) CROSSASSOC: Find cluster with highest entropy, split it, run REASSIGN.
Stop when no split reduces the MDL score.
(Chakrabarti et al. 2004)
-
Summarising a Graph
Fully Automatic Cross Associations
is a nice MDL based algorithm to summarise a matrix.
1) REASSIGN: Given a grid, assign rows and columns
s.t. entropy within the grid is minimal.
2) CROSSASSOC: Find cluster with highest entropy, split it, run REASSIGN.
Stop when no split reduces the MDL score.
(Chakrabarti et al. 2004)
-
Beyond Cave-men Communities
Traditional community detection
algorithms assume that you interact
only with people in your ‘cave’.
You are assumed not to interact
with others, except if you are one
of few ‘messengers’ between ‘caves’.
That is not very realistic.
(Kang & Faloutsos, ICDM 2011)
-
Slash’n’Burn
Slash’n’Burn finds the node 𝑖with highest 𝑑𝑖 and removesits edges 𝑁𝑖 and recurses.
SLASHBURN:
1. Slash top-𝑘 hubs, burn edges
2. Repeat on the remaining GCCBefore
(Kang & Faloutsos, ICDM 2011)
-
Slash’n’Burn
Slash’n’Burn finds the node 𝑖with highest 𝑑𝑖 and removesits edges 𝑁𝑖 and recurses.
SLASHBURN:
1. Slash top-𝑘 hubs, burn edges
2. Repeat on the remaining GCC
(Kang & Faloutsos, ICDM 2011)
-
Slash’n’Burn
Slash’n’Burn finds the node 𝑖with highest 𝑑𝑖 and removesits edges 𝑁𝑖 and recurses.
SLASHBURN:
1. Slash top-𝑘 hubs, burn edges
2. Repeat on the remaining GCCAfter
(Kang & Faloutsos, ICDM 2011)
-
Beyond Cave-men Communities
Slash’n’Burn applied on the
AS-Oregon graphs shows that
real graphs indeed have structure
beyond cave-men communities!
– but also include those!
A nice side-result is that the
Slash’n’Burned ordered matrix
has lots of ‘empty space’ and
can hence be stored efficiently.
(Kang & Faloutsos, ICDM 2011)
-
Carnegie Mellon University
Korea Advanced Institute of Science and Technology
VoG: Summarizing and Understanding Large Graphs
Danai Koutra
Jilles Vreeken
U Kang
Christos Faloutsos
SDM, 25 April 2014, Philadelphia, USA
-
Main Idea
1) Use a graph vocabulary:
2) Best graph summary
optimal compression (MDL)
-
Main Idea
1) Use a graph vocabulary:
2) Shortest lossless description
optimal compression (MDL)
-
Given a set of models ℳ,
the best model 𝑀 ∈ ℳ is
argmin 𝐿 𝑀 + 𝐿(𝐷 ∣ 𝑀)
# bits
for 𝑀# bits for the
data using 𝑀
ℳ
𝑀
Minimum Description Length
-
a1 x + a0
𝐿 𝑀 + 𝐿(𝐷|𝑀)
a10 x10 + a9 x
9 + … + a0
errors
{ }
MDL example
-
Given: - a graph 𝐺 with adjacency matrix 𝐴- vocabulary Ω
Find: model 𝑀 s.t.𝐿(𝐺,𝑀) = min 𝐿(𝑀) + 𝐿(𝐸)
Minimum Graph Description
Model 𝑀Adjacency 𝐴 Error 𝐸
-
VoG: Overview
argmin
≈
≈?
-
VoG: Overview
-
VoG: Overview
some criterion
-
VoG: Overview
-
VoG: Overview
-
Summary
VoG: Overview
-
We need candidate structures…
… How can we get them?
-
Step 1: Graph Decomposition
We can use:
Any decomposition method
We did use/adapt:
SLASHBURN
-
Slash top-k hubs, burn edges
Before
SnB Graph Decomposition
-
Slash top-k hubs, burn edges
SnB Graph Decomposition
-
Slash top-k hubs, burn edges
candidate
structures
After
SnB Graph Decomposition
-
Slash top-k hubs, burn edges
candidate
structures
After
SnB Graph Decomposition
Notice that the structures can overlap!
-
Slash top-k hubs, burn edges
candidate
structures
After
SnB Graph Decomposition
-
Slash top-k hubs, burn edges
Repeat on the remaining GCC
GCC
SnB Graph Decomposition
-
Now, how can we
‘label’ them?
We got candidate structures.
-
≈?
argmin
≈
1
2
Step 2: Graph Labeling
-
hub? “best”
node split?
45
80
n
“best”
node ordering?
1
1
n
missing
edges?
Graph Representations
-
hub
Hub: top-degree nodeSpokes: the rest
DETAILS
Graph Representations
-
hub
Hub: top-degree nodeSpokes: the rest
𝐿𝑁 𝑠𝑡 − 1 + log 𝑛 + log𝑛 − 1𝑠𝑡 − 1
+ 𝐿(𝐸+) + 𝐿(𝐸
−)
# of spokes
hub ID spokes IDs extra missingErrorsStar structure
𝑛=7
DETAILS
Graph Representations
-
hub
Hub: top-degree nodeSpokes: the rest
𝐿𝑁 𝑠𝑡 − 1 + log 𝑛 + log𝑛 − 1𝑠𝑡 − 1
+ 𝐿(𝐸+) + 𝐿(𝐸
−)
# of spokes
hub ID spokes IDs extra missingErrors
𝑛=7
DETAILS
Graph Representations
-
hub
Hub: top-degree nodeSpokes: the rest
𝐿𝑁 𝑠𝑡 − 1 + log 𝑛 + log𝑛 − 1𝑠𝑡 − 1
+ 𝐿(𝐸+) + 𝐿(𝐸
−)hub ID spokes IDs extra missingErrors
6
𝑛=7
DETAILS
Graph Representations
-
hub
Hub: top-degree nodeSpokes: the rest
𝐿𝑁 𝑠𝑡 − 1 + log 𝑛 + log𝑛 − 1𝑠𝑡 − 1
+ 𝐿(𝐸+) + 𝐿(𝐸
−)spokes IDs extra missingErrors
6
𝑛=7
DETAILS
Graph Representations
-
hub
Hub: top-degree nodeSpokes: the rest
𝐿𝑁 𝑠𝑡 − 1 + log 𝑛 + log𝑛 − 1𝑠𝑡 − 1
+ 𝐿(𝐸+) + 𝐿(𝐸
−)extra missingErrors
6
𝑛=7
DETAILS
Graph Representations
-
hub
Hub: top-degree nodeSpokes: the rest
𝐿𝑁 𝑠𝑡 − 1 + log 𝑛 + log𝑛 − 1𝑠𝑡 − 1
+ 𝐿(𝐸+) + 𝐿(𝐸
−)extra missing
6
𝑛=7
DETAILS
Graph Representations
-
Max bipartite graph: NP-hard
Heuristic: Belief Propagation with heterophily for node classification
(blue/red)
DETAILSGraph Representations
-
Max bipartite graph: NP-hard
Heuristic: Belief Propagation with heterophily for node classification
(blue/red)
+ logn + log( ) + L(E+ ) + L(E− )# of blue
nodesn−1|st|−
1their IDs extra missingErrors
# of rednodes
Bipartite graph structure
DETAILSGraph Representations
-
Max bipartite graph: NP-hard
Heuristic: Belief Propagation with heterophily for node classification
(blue/red)
+ logn + log( ) + L(E+ ) + L(E− )# of blue
nodesn−1|st|−
1their IDs extra missingErrors
# of rednodes
DETAILSGraph Representations
-
Max bipartite graph: NP-hard
Heuristic: Belief Propagation with heterophily for node classification
(blue/red)
+ logn + log( ) + L(E+ ) + L(E− )# of blue
nodesn−1|st|−
1their IDs extra missing
# of rednodes
DETAILSGraph Representations
-
1
45
80
n
1
n
Longest path: NP-hard
Heuristic: BFS + local search
Graph Representations
-
1
45
80
n
1
n
Longest path: NP-hard
Heuristic: BFS + local search
+ extramissin
gErrorsChain structure
Graph Representations
-
≈?
Step 2: Graph Labeling
-
≈?
argmin
≈
Step 2: Graph Labeling
-
≈?
argmin
≈
Step 2: Graph Labeling
-
Step 3: Summary Assembly
-
Step 3: Summary Assembly
-
Step 3: Summary Assembly
Summary
-
Concepts
= # bits as structure - # bits as noisecompression gain
Savings
DETAILS
-
Step 3: Summary Assembly
-
Step 3: Summary Assembly
-
Step 3: Summary Assembly
-
Summary
Step 3: Summary Assembly
-
Concepts
Summary Encoding cost
𝐿 𝑀 = 𝐿𝑁( 𝑀 + 1) + log𝑀 + 1Ω + 1
+ ∑ − log𝑃 𝑥 𝑠 𝑀 + 𝐿 𝑠
# of
structures
# of
structures
per type
for each structure
its encoding length
its
connectivity
its
type3
# of
structures
# of
structures
per type
for each structure
its encoding length
: 1
: 1
: 1
-
Step 3: Summary Assembly
𝐿(𝐷,𝑀)
structures
…
DETAILS
-
75%98% 93%
75%
2%
77%
46%60%
0%
20%
40%
60%
80%
100%
Plain Top-10 Top-100 G&F
Bits needed Unexplained edges
4292729 bits as noise
Real graphs have structure!(we can save bits by encoding with structures!)
Quantitative Analysis
-
1
10
100
Plain Top-10 Top-100 G&F
Star
Near-Bipartite
Full clique
Full Bipartite
Chain
Main structure types:
Quantitative Analysis
-
Quantitative Analysis
1
10
100
1000
10000
Plain Top-10 Top-100 G&F
Star
Near-Bipartite
Full clique
Full Bipartite
Main structure types:
Stars, near- and full-bipartite cores.
-
Top-3 Stars
klay
Top-1 NBC
Ski
excursion
Qualitative Analysis: Enron
-
VOG is near-linear on the number of edges of the input graph.
Runtime
-
“jellyfish”(Tauro, 2001)
Future Work
Those of you interested in a MSc or RIL project…
Our current vocabulary is
But many other structures make sense, for example
-
Future Work
Those of you who might be interested in a MSc or RIL project…
It would be great if we could mine summaries directly from data
… without pre-mining all candidate structures
Real graphs show powerlaw-ish degree distributions,
… would be great if VoG could take that into account
-
Conclusions
Graphs need Summaries graphs are powerful – but difficult to interpret far too few (efficient) summary methods available
Cross-Associations powerful technique to find bi-clusters heuristic, improvements exist
Slash’n’Burn reorders nodes of a graph finds sub-graphs ‘beyond’ cave-men communities
VoG summarises graphs with a graph-theoretic vocabulary first of its kind – but a big stack of heuristics fast, good results.
-
Thank you!Graphs need Summaries graphs are powerful – but difficult to interpret far too few (efficient) summary methods available
Cross-Associations powerful technique to find bi-clusters heuristic, improvements exist
Slash’n’Burn reorders nodes of a graph finds sub-graphs ‘beyond’ cave-men communities
VoG summarises graphs with a graph-theoretic vocabulary first of its kind – but a big stack of heuristics fast, good results.