human-centric data exploration · 2019-09-13 · human-centric data exploration part 1/5:...
TRANSCRIPT
HUMAN-CENTRIC
DATA EXPLORATION
PART 1/5: MOTIVATION, BACKGROUND & OUTLINE
Tijl De Bie – slides in collaboration with Jefrey LijffijtGhent University
Based on joint work with many others (see references and final slide of this lecture)
DEPARTMENT ELECTRONICS AND INFORMATION SYSTEMS (ELIS)
RESEARCH GROUP IDLAB
www.forsied.net 1
SUBJECTIVITY
AND VISUALIZATION
PART 1/5: MOTIVATION, BACKGROUND & OUTLINE
Tijl De Bie – slides in collaboration with Jefrey LijffijtGhent University
DEPARTMENT ELECTRONICS AND INFORMATION SYSTEMS (ELIS)
RESEARCH GROUP IDLAB
www.forsied.net 2
MOTIVATION
3
SUBJECTIVITY = KEY
Three motivating examples:
1. Frequent itemset mining
‒ Individually frequent items = probably frequent together
2. Visualizing high-dimensional data
‒ Outliers = high variance, so why maximize it in PCA?
‒ Interaction = key
3. Graph embedding
‒ High degree nodes = probably embedded centrally
4
Subjective interestingness ranking
Prior info on:Row & column sums
#docs
Support x size (area)
#docs
svm, support, machin, vector 25 data, paper 389
state, art 39 algorithm, propose 246
unlabelled, labelled, supervised, learn 10 data, mine 312
associ, rule, mine 36 base, method 202
gene, express 25 result, show 196
frequent, itemset 28 problem 373
large, social, network, graph 15 data, set 279
column, row 13 approach 330
algorithm, order, magnitud, faster 12 model 301
paper, propos, algorithm, real, synthetic, data
27 present 296
ASSOCIATION ANALYSIS / ITEMSET MINING
words
docum
ents
KDD abstracts dataset:
Subjective interestingness ranking
Prior info on:Row & column sums
#docs
Support x size (area)
#docs
svm, support, machin, vector 25 data, paper 389
state, art 39 algorithm, propose 246
unlabelled, labelled, supervised, learn 10 data, mine 312
associ, rule, mine 36 base, method 202
gene, express 25 result, show 196
frequent, itemset 28 problem 373
large, social, network, graph 15 data, set 279
column, row 13 approach 330
algorithm, order, magnitud, faster 12 model 301
paper, propos, algorithm, real, synthetic, data
27 present 296
ASSOCIATION ANALYSIS / ITEMSET MINING
Subjective interestingness ranking
Prior info on:Row & column sums
#docs
Subjective interestingnessrankingAdditionally prior info on:Keyword tiles
#docs
svm, support, machin, vector 25 art, state 39
state, art 39 row, column, algorithm 12
unlabelled, labelled, supervised, learn 10 unlabelled, labelled, data 14
associ, rule, mine 36 answer, question 18
gene, express 25 Precis, recal 14
words
docum
ents
KDD abstracts dataset:
VISUALIZING HIGH-DIMENSIONAL DATA
CONDITIONAL NETWORK EMBEDDINGS
8
The search for interesting patterns in data
• Association analysis
• Dimensionality reduction
• Graph embedding
• Clustering
• Community detection
• Privacy-preserving data publishing
• … Zillions of
PCA, ICA, projection pursuit, Laplacian Eigenmaps,
tSNE, LLE,…
K-means clustering, hierarchical clustering, Mixture of
Gaussians, spectral clustering,…
Stochastic block modelling, modularity, k-cores, quasi-
cliques, dense subgraphs,…
Frequency, lift, confidence, leverage, coverage,...
EXPLORING DATA
‘Interestingness measures’
Objective functions
Quality functions
Utility functions
Cost functions
…
Node2Vec, Path2Vec, MetaPath2Vec,...
Discernibility, generalization height,
average group size,...
THE CHALLENGE
Zillions of interestingness measures = good & bad‒ Good: more options!
‒ Bad: the trees & the forest…
Challenge:
‒ Formalise true interestingness!‒ With minimal user interaction
‒ Without requiring user expertise
MOTIVATING EXAMPLE
Community detection:
What makes for an interesting community?
‒ Densely connected?
‒ Large?
‒ Few neighbours outside community?
‒ Unrelated to certain known ‘affiliations’?‒ …
11
THE FORSIED APPROACH: SUBJECTIVITY!
12
DataData mining
researcher
Interestingness(pattern)
THE FORSIED APPROACH: SUBJECTIVITY!
13
Data
Data mining
researcher
Data
analyst
Interestingness(pattern) Interestingness(pattern, analyst)
Interestingness = subjective
MOTIVATING EXAMPLE
Community detection:
User states expectations / beliefs
‒ Formalized as a ‘background distribution’ Any ‘pattern’ that contrasts with this and is easy to describe
= subjectively interesting
14
OUTLINE
15
OUTLINEPart 1: Introduction and motivation
10mins
Part 2: The FORSIED framework40mins
Part 3: Binary matrices, graphs, and relational data40mins
BREAK
Part 4: Numeric and mixed data (including high-dimensional data visualization)50mins
Part 5: Advanced topics, outlook & conclusions25mins
Q&A15mins
16
17
Feel free to interrupt
for questions anytime
HUMAN-CENTRIC
DATA EXPLORATION
PART 2/5: THE FORSIED FRAMEWORK
Tijl De Bie – slides in collaboration with Jefrey LijffijtGhent University
DEPARTMENT ELECTRONICS AND INFORMATION SYSTEMS (ELIS)
RESEARCH GROUP IDLAB
www.forsied.net 1
OUTLINEPart 1: Introduction and motivation
10mins
Part 2: The FORSIED framework40mins
Part 3: Binary matrices, graphs, and relational data40mins
BREAK
Part 4: Numeric and mixed data (including high-dimensional data visualization)50mins
Part 5: Advanced topics, outlook & conclusions25mins
Q&A15mins
2
GENERIC FRAMEWORK
3
De Bie, KDD 2011
De Bie, DAMI 2011
FORSIED(*)
framework
Data PatternsModel of user beliefs
(“background distribution”)
Data space
Data
Pattern
𝑃 evolves!
Pattern User Subjective!
(*)Formalizing Subjective Interestingness in Exploratory Data mining
𝑥 ∈ Ω′𝑥 ∈ Ω
ΩΩ′′Ω′ 𝑃′𝑃′′ 𝑃
• Data: the adjacency matrix
of a graph under study
• Patterns: the claim that a
specified set of nodes are
densely connected
• Prior beliefs: the degrees of
the nodes, a known block
structure,...
• Interestingness: subjective
information density
• Overlapping communities!
Interestingness Ω′, 𝑃 = InformationContent Ω′, 𝑃DescriptionLength Ω′
InformationContent Ω′, 𝑃 = − log 𝑃 Ω′
SI = IC/DL
THE FINE PRINT
Initial background distribution 𝑃?‒ Maximum entropy distribution s.t. prior belief constraints:
Updated background distribution 𝑃′ given pattern 𝑥 ∈ Ω′?‒ 𝑃 conditioned onto event 𝑥 ∈ Ω′𝑃′ Ω′′ = 𝑃 Ω′′ ∩ Ω′𝑃 Ω′⇒ −log 𝑃′ 𝒙 = −log 𝑃 𝒙 + log 𝑃 Ω′
Description length?‒ Smaller if the pattern is a better explanation
‒ Essentially problem-dependent
max𝑃 𝐸𝑋~𝑃 − log𝑃(𝑋) s.t. 𝐸𝑋~𝑃 𝑓 𝑋 = 𝑐𝑓 (∀𝑓) ΩΩ′ 𝑃′ 𝑃𝑥
Information content
in data after pattern
Information content
in data before pattern
Minus information content
of pattern under 𝑃
Ω′′
WHY MAXIMUM ENTROPY / CONDITIONING?
Most unbiased estimateInformal... no bias other than the constraints
Assume cautious / pessimistic userA user who expects to be very surprised
Leads to most robust estimate of true subjective information
contentInformation content estimated with maxent 𝑃 will never differ much from
information content w.r.t. true prior belief of user
6
A FIRST INSTANTIATION:COMMUNITY DETECTION
7
van Leeuwen, De Bie, Spyropoulou, Mesnage, MLj, 2016
COMMUNITY DETECTION IN NETWORKS
Data: Graph
Prior beliefs:1. Overall density2. or: Vertex degrees
MaxEnt distribution:𝑃 𝑨 =ෑ𝑖>𝑗 𝑃𝑖,𝑗 (𝑎𝑖𝑗)𝑃𝑖,𝑗 𝑎𝑖𝑗 = exp 𝑎𝑖𝑗 ∙ 𝜆𝑖 + 𝜆𝑗1 + exp 𝜆𝑖 + 𝜆𝑗
𝑷𝒊,𝒋 𝒂𝒊𝒋 small
𝑷𝒊,𝒋 𝒂𝒊𝒋large
Adjacency
matrixEdge indicator
variables
COMMUNITY DETECTION IN NETWORKS
Data: Graph
Prior beliefs:1. Overall density2. or: Vertex degrees
Pattern: Dense subgraphs
𝑷𝒊,𝒋 𝒂𝒊𝒋 small
𝑷𝒊,𝒋 𝒂𝒊𝒋large𝑖,𝑗∈subgraph𝑎𝑖𝑗 ≥ 𝑘
COMMUNITY DETECTION IN NETWORKS
Data: Graph
Prior beliefs:1. Overall density2. or: Vertex degrees
Pattern: Dense subgraphs
Interestingness:
𝑷𝒊,𝒋 𝒂𝒊𝒋 small
𝑷𝒊,𝒋 𝒂𝒊𝒋large− log𝑃 patternDescriptionLength(pattern)
COMMUNITY DETECTION IN NETWORKS
Data: Graph
Prior beliefs:1. Overall density2. or: Vertex degrees
Pattern: Dense subgraphs
Interestingness: Density vs. size
2. preferably low degree nodes
Most interesting given 1.
Most interesting given 2.
COMMUNITY DETECTION IN NETWORKS
Data: Graph
Prior beliefs:1. Overall density2. or: Vertex degrees
Pattern: Dense subgraphs
Interestingness: Density vs. size
2. preferably low degree nodes
Hill-climbing for searchUpdate 𝑷 after each pattern
Most interesting given 1.
Most interesting given 2.
COMMUNITY DETECTION IN NETWORKS
Data: Graph
Prior beliefs:1. Overall density2. or: Vertex degrees
Pattern: Dense subgraphs
Interestingness: Density vs. size
2. preferably low degree nodes
Hill-climbing for searchUpdate 𝑷 after each pattern
Rock
Trance
Indie
Bhangra
GospelCountry
Hip hop / grime
Afro pop
UK garage
Hip hop
TAKE-AWAYS
1. What is the data?
2. Determine suitable pattern syntax
3. What are the prior beliefs? (= what is irrelevant to user?)
Compute background distribution 𝑷 using maximum entropy
4. Formulate subjective interestingness:
5. Design an algorithm to optimize it
6. Find out how to condition background distribution on a pattern
14
Interestingness Ω′, 𝑃 = InformationContent Ω′, 𝑃DescriptionLength Ω′InformationContent Ω′, 𝑃 = − log 𝑃 Ω′
THE BACKGROUND DISTRIBUTION: MAXENT
15
MAXENT MODEL S.T. DEGREE BELIEFS
max𝑃 𝑨 −𝑃 𝑨 ∙ log 𝑃 𝑨s.t. 𝑨 𝑃 𝑨 ∙ 𝑗=1:𝑛 𝑎𝑖𝑗 = 𝑑𝑖 , ∀𝑖 = 1: 𝑛
𝑨 𝑃 𝑨 = 1𝐿 𝑃, 𝝀, 𝜇 =𝑨 −𝑃 𝑨 ∙ log 𝑃 𝑨 + 𝑖=1:𝑛 𝜆𝑖 𝑨 𝑃 𝑨 ∙ 𝑗=1:𝑛 𝑎𝑖𝑗 − 𝑑𝑖 + 𝜇 𝑨 𝑃 𝑨 − 1
Convex!
Lagrangian:
Entropy
Degree
Expected degree constraint
Normalization
Lagrange multipliers
𝑨 = adjacency matrix with 𝑎𝑖𝑗 in row 𝑖 and column 𝑗
𝐿 𝑃, 𝝀, 𝜇 =𝑨 −𝑃 𝑨 ∙ log 𝑃 𝑨 + 𝑖=1:𝑛 𝜆𝑖 𝑨 𝑃 𝑨 ∙ 𝑗=1:𝑛 𝑎𝑖𝑗 − 𝑑𝑖 + 𝜇 𝑨 𝑃 𝑨 − 1Optimality condition:𝜕𝜕𝑃(𝑨) 𝐿 𝑃, 𝝀, 𝜇 = 0𝜕𝜕𝑃(𝑨) 𝐿 𝑃, 𝝀, 𝜇 = − log 𝑃 𝑨 − 1 + 𝑖,𝑗=1:𝑛 𝜆𝑖𝑎𝑖𝑗 + 𝜇 = 0
MAXENT MODEL S.T. DEGREE BELIEFS
MAXENT MODEL S.T. DEGREE BELIEFS𝜕𝜕𝑃(𝑨) 𝐿 𝑃, 𝝀, 𝜇 = − log 𝑃 𝑨 − 1 + 𝑖,𝑗=1:𝑛 𝜆𝑖𝑎𝑖𝑗 + 𝜇 = 0So: 𝑃 𝑨 = exp 𝜇 − 1 ∙ exp 𝑖,𝑗=1:𝑛 𝜆𝑖𝑎𝑖𝑗= 1𝑍 𝝀 ∙ exp 𝑖>𝑗 𝜆𝑖 + 𝜆𝑗 𝑎𝑖𝑗 = 1𝑍 𝝀 ∙ෑ𝑖>𝑗 exp 𝜆𝑖 + 𝜆𝑗 ∙ 𝑎𝑖𝑗
=ෑ𝑖>𝑗 exp 𝜆𝑖 + 𝜆𝑗 ∙ 𝑎𝑖𝑗1+exp 𝜆𝑖 + 𝜆𝑗
MAXENT MODEL S.T. DEGREE BELIEFS𝜕𝜕𝑃(𝑨) 𝐿 𝑃, 𝝀, 𝜇 = − log 𝑃 𝑨 − 1 + 𝑖,𝑗=1:𝑛 𝜆𝑖𝑎𝑖𝑗 + 𝜇 = 0So: 𝑃 𝑨 = exp 𝜇 − 1 ∙ exp 𝑖,𝑗=1:𝑛 𝜆𝑖𝑎𝑖𝑗= 1𝑍 𝝀 ∙ exp 𝑖>𝑗 𝜆𝑖 + 𝜆𝑗 𝑎𝑖𝑗 = 1𝑍 𝝀 ∙ෑ𝑖>𝑗 exp 𝜆𝑖 + 𝜆𝑗 ∙ 𝑎𝑖𝑗
=ෑ𝑖>𝑗 exp 𝜆𝑖 + 𝜆𝑗 ∙ 𝑎𝑖𝑗1+exp 𝜆𝑖 + 𝜆𝑗
MAXENT MODEL S.T. DEGREE BELIEFS𝜕𝜕𝑃(𝑨) 𝐿 𝑃, 𝝀, 𝜇 = − log 𝑃 𝑨 − 1 + 𝑖,𝑗=1:𝑛 𝜆𝑖𝑎𝑖𝑗 + 𝜇 = 0So: 𝑃 𝑨 = exp 𝜇 − 1 ∙ exp 𝑖,𝑗=1:𝑛 𝜆𝑖𝑎𝑖𝑗= 1𝑍 𝝀 ∙ exp 𝑖>𝑗 𝜆𝑖 + 𝜆𝑗 𝑎𝑖𝑗 = 1𝑍 𝝀 ∙ෑ𝑖>𝑗 exp 𝜆𝑖 + 𝜆𝑗 ∙ 𝑎𝑖𝑗
=ෑ𝑖>𝑗 exp 𝜆𝑖 + 𝜆𝑗 ∙ 𝑎𝑖𝑗1+exp 𝜆𝑖 + 𝜆𝑗
MAXENT MODEL S.T. DEGREE BELIEFS𝜕𝜕𝑃(𝑨) 𝐿 𝑃, 𝝀, 𝜇 = − log 𝑃 𝑨 − 1 + 𝑖,𝑗=1:𝑛 𝜆𝑖𝑎𝑖𝑗 + 𝜇 = 0So: 𝑃 𝑨 = exp 𝜇 − 1 ∙ exp 𝑖,𝑗=1:𝑛 𝜆𝑖𝑎𝑖𝑗= 1𝑍 𝝀 ∙ exp 𝑖>𝑗 𝜆𝑖 + 𝜆𝑗 𝑎𝑖𝑗 = 1𝑍 𝝀 ∙ෑ𝑖>𝑗 exp 𝜆𝑖 + 𝜆𝑗 ∙ 𝑎𝑖𝑗
=ෑ𝑖>𝑗 exp 𝜆𝑖 + 𝜆𝑗 ∙ 𝑎𝑖𝑗1+exp 𝜆𝑖 + 𝜆𝑗 𝑃𝑖,𝑗 𝑎𝑖𝑗 Product of independent
Bernoulli distributions!
Thanks to fact that prior belief constraint
is on a (weighted) sum of the 𝑎𝑖𝑗
MAXENT MODEL S.T. DEGREE BELIEFS
To find optimal values of Lagrange multipliers, solve the dual:min𝝀 𝐿 𝑃, 𝝀where 𝑃 is given as:
𝑃 𝑨 =ෑ𝑖>𝑗 exp 𝜆𝑖 + 𝜆𝑗 ∙ 𝑎𝑖𝑗1+exp 𝜆𝑖 + 𝜆𝑗After some calculations:
22
min𝝀 𝑖>𝑗 log 1 + exp 𝜆𝑖 + 𝜆𝑗 − 𝑖=1:𝑛 𝜆𝑖𝑑𝑖
MAXENT MODEL S.T. DEGREE BELIEFS
min𝝀 𝑖>𝑗 log 1 + exp 𝜆𝑖 + 𝜆𝑗 − 𝑖=1:𝑛 𝜆𝑖𝑑𝑖Can be solved using gradient descent:𝜕𝜕𝜆𝑘 𝑖>𝑗 log 1 + exp 𝜆𝑖 + 𝜆𝑗 − 𝑖=1:𝑛 𝜆𝑖𝑑𝑖= 𝑖=1:𝑛 exp 𝜆𝑖 + 𝜆𝑘1+exp 𝜆𝑖 + 𝜆𝑘 − 𝑑𝑘Lots of computational speed-ups possible...
23
Expected degree of node 𝒌 Required expected degree of node 𝒌
min𝝀 𝑖>𝑗 log 1 + exp 𝜆𝑖 + 𝜆𝑗 − 𝑖=1:𝑛 𝜆𝑖𝑑𝑖
TAKE-AWAYS
Constraints on the expected value of weighted averages:𝐸𝑨~𝑃 𝑖,𝑗∈𝐼 𝑓𝑖𝑗𝑎𝑖𝑗 = 𝑐where 𝑓𝑖𝑗 and 𝑐 are constants and 𝐼 is a set of indices,
lead to convenient product distributions
Other examples for graphs: Overall density (trivial)
Densities of particular blocks (e.g. block of nodes with same
affiliation)
Assortativity (approximately)
...
24
THE INTERESTINGNESS
25
INFORMATION CONTENT
Information content:InformationContent pattern, 𝑃 = − log 𝑃 pattern Pattern: “number of edges between given set of nodes 𝑊 ⊆ 𝑉 is larger than or
equal to a specified 𝑘𝑊” A bit tricky...
Cliques as a special case: “set of nodes 𝑊 ⊆ 𝑉 forms a clique”. Then:𝑃 pattern = ෑ𝑖>𝑗∈𝑊𝑃𝑖,𝑗 1 So: InformationContent pattern, 𝑃 = − log 𝑃 pattern= −σ𝑖>𝑗∈𝑊 log 𝑃𝑖,𝑗 1 Larger if |𝑊| is larger and if 𝑃𝑖,𝑗 1 for 𝑖, 𝑗 ∈ 𝑊 is smaller
26
INFORMATION CONTENT
Pattern (general case): “number of edges between given set of nodes 𝑊 ⊆ 𝑉 is
larger than or equal to a specified 𝑘𝑊” Probability of at least 𝑘𝑊 successes in 𝑛𝑊 = |𝑊|2 Bernoulli trials?
Approximated by:𝑃 pattern ≈ exp −𝑛𝑊KL 𝑘𝑊𝑛𝑊 ||𝑝𝑊where 𝑝𝑊 is the average probability 𝑃𝑖,𝑗 1 for the edges between 𝑖, 𝑗 ∈ 𝑊
And thus:InformationContent pattern, 𝑃 = − log 𝑃 pattern≈ 𝑛𝑊KL 𝑘𝑊𝑛𝑊 ||𝑝𝑊 Larger if |𝑊| (and thus 𝑛𝑊) is larger, 𝑝𝑊 is smaller, and 𝑘𝑊 is larger
27
DESCRIPTION LENGTH
For cliques:
Describe set 𝑊 A constant (to describe |𝑊|)
plus a linear term in 𝑊 (to describe its elements):DescriptionLength pattern = 𝛼 𝑊 + 𝛽 For dense subgraphs:
Constant 𝛽 also describes 𝑘𝑊28
INTERESTINGNESS
Putting things together:Interestingness pattern, 𝑃 = −σ𝑖>𝑗∈𝑊 log 𝑃𝑖,𝑗 1𝛼 𝑊 + 𝛽 A bit more complex for general dense subgraphs:
Interestingness pattern, 𝑃 ≈ 𝑛𝑊KL 𝑘𝑊𝑛𝑊 ||𝑝𝑊𝛼 𝑊 + 𝛽 Hard to optimize!
Exact search for small graphs
Effective hill climber for large graphs
29
TAKE-AWAYS
No compromises w.r.t. interestingness
Often leads to hard search problems
Question: is this intrinsic to genuine subjective
interestingness?
30
UPDATING THE BACKGROUND DISTRIBUTION
31
UPDATING THE BACKGROUND DISTRIBUTION
Given a pattern,update the background distribution by conditioning on the pattern
Easy to do for cliques 𝑊:
Set 𝑃′𝑖,𝑗 𝑎𝑖𝑗 = 1 for 𝑎𝑖𝑗 = 1 if 𝑖, 𝑗 ∈ 𝑊 Fast to approximate for (non-clique) dense subgraphs 𝑊:
Set 𝑃′𝑖,𝑗 𝑎𝑖𝑗 ∝ 𝑃𝑖,𝑗 𝑎𝑖𝑗 ⋅ exp 𝜆𝑊𝑎𝑖𝑗 if 𝑖, 𝑗 ∈ 𝑊such that the expected density of 𝑊 is 𝑘𝑊
Remains a product of Bernoulli’s32
TAKE-AWAYS
Updating can be trivial
Otherwise, often easy to do approximately
33
HUMAN-CENTRIC
DATA EXPLORATION
PART 3/5: BINARY MATRICES, GRAPHS, RELATIONAL DATA
Tijl De Bie – slides in collaboration with Jefrey LijffijtGhent University
DEPARTMENT ELECTRONICS AND INFORMATION SYSTEMS (ELIS)
RESEARCH GROUP IDLAB
www.forsied.net 1
OUTLINEPart 1: Introduction and motivation
10mins
Part 2: The FORSIED framework40mins
Part 3: Binary matrices, graphs, and relational data40mins
BREAK
Part 4: Numeric and mixed data (including high-dimensional data visualization)50mins
Part 5: Advanced topics, outlook & conclusions25mins
Q&A15mins
2
OUTLINE OF THIS PART
(Community detection)
Itemsets
Relational patterns
Connecting trees
Network embedding
3
ITEMSETS
4
ITEMSETS
Data: binary matrix:
5
𝑿 ∈ {0,1}𝑚×𝑛
Beer Diapers Lipstick Carrier SUM
Alice 1 1 1 3
Bob 1 1 1 3
Charlie 1 1 2
Denise 1 1 2
Eve 1 1 2
Frankie 1 1 2
SUM 4 3 2 5
De Bie, DMKD 2011
ITEMSETS
Data: binary matrix:
Prior beliefs: row and column sums
6
𝑿 ∈ {0,1}𝑚×𝑛
Beer Diapers Lipstick Carrier SUM
Alice 1 1 1 3
Bob 1 1 1 3
Charlie 1 1 2
Denise 1 1 2
Eve 1 1 2
Frankie 1 1 2
SUM 4 3 2 5
De Bie, DMKD 2011
ITEMSETS
7
𝑿 ∈ {0,1}𝑚×𝑛
Beer Diapers Lipstick Carrier SUM
Alice 1 1 1 3
Bob 1 1 1 3
Charlie 1 1 2
Denise 1 1 2
Eve 1 1 2
Frankie 1 1 2
SUM 4 3 2 5
Data: binary matrix:
Prior beliefs: row and column sums
Background distribution 𝑃- Nonnegative and properly normalized
- Has correct marginals Many solutions !?
‘Unbiased’ distribution: Maximum Entropy distribution
De Bie, DMKD 2011
Beer Diapers Lipstick Carrier SUM
Alice 1 1 1 3
Bob 1 1 1 3
Charlie 1 1 2
Denise 1 1 2
Eve 1 1 2
Frankie 1 1 2
SUM 4 3 2 5
ITEMSETS
Data: binary matrix:
Prior beliefs: row and column sums
Background distribution 𝑃
8
𝑿 ∈ {0,1}𝑚×𝑛
Beer Diapers Lipstick Carrier SUM
Alice 1 1 1 3
Bob 1 1 1 3
Charlie 1 1 2
Denise 1 1 2
Eve 1 1 2
Frankie 1 1 2
SUM 4 3 2 5
𝑃 𝑿 =ෑ𝑖,𝑗 𝑃𝑖,𝑗 (𝑥𝑖𝑗) 𝑃𝑖,𝑗 𝑥𝑖𝑗 = exp 𝑥𝑖𝑗 ∙ 𝜇𝑖 + 𝜆𝑗1 + exp 𝜇𝑖 + 𝜆𝑗
De Bie, DMKD 2011
Convex
optimization
problem
‘Unbiased’ distribution: Maximum Entropy distribution
ITEMSETS
Data: binary matrix:
Prior beliefs: uniform at observed density
9
𝑿 ∈ {0,1}𝑚×𝑛De Bie, DMKD 2011
Beer Diapers Lipstick Carrier SUM
Alice 1 1 1 3
Bob 1 1 1 3
Charlie 1 1 2
Denise 1 1 2
Eve 1 1 2
Frankie 1 1 2
SUM 4 3 2 5
𝑃𝑖,𝑗 = 1424 ∀ 𝑖, 𝑗𝑃𝑖,𝑗 𝑥𝑖𝑗 = exp 𝑥𝑖𝑗 ∙ 𝜆1 + exp 𝜆 𝜆 = 0.3365
ITEMSETS
Data: binary matrix:
Prior beliefs: row and column sums
Patterns: tiles (set of rows and columns)
10
𝑿 ∈ {0,1}𝑚×𝑛
Beer Diapers Lipstick Carrier SUM
Alice 1 1 1 3
Bob 1 1 1 3
Charlie 1 1 2
Denise 1 1 2
Eve 1 1 2
Frankie 1 1 2
SUM 4 3 2 5
Large support may not be
interesting
?
De Bie, DMKD 2011
ITEMSETS
Data: binary matrix:
Prior beliefs: row and column sums
Patterns: tiles
11
𝑿 ∈ {0,1}𝑚×𝑛
Beer Diapers Lipstick Carrier SUM
Alice 1 1 1 3
Bob 1 1 1 3
Charlie 1 1 2
Denise 1 1 2
Eve 1 1 2
Frankie 1 1 2
SUM 4 3 2 5
Large surface may not be
interesting
?
De Bie, DMKD 2011
ITEMSETS
Data: binary matrix:
Prior beliefs: row and column sums
Patterns: tiles
12
𝑿 ∈ {0,1}𝑚×𝑛
Beer Diapers Lipstick Carrier SUM
Alice 1 1 1 3
Bob 1 1 1 3
Charlie 1 1 2
Denise 1 1 2
Eve 1 1 2
Frankie 1 1 2
SUM 4 3 2 5
Less expected smaller row
and column margins
Bingo!
De Bie, DMKD 2011
ITEMSETS
Data: binary matrix:
Prior beliefs: row and column sums
Patterns: tiles
As stated SI = IC / DL
13
𝑿 ∈ {0,1}𝑚×𝑛
Beer Diapers Lipstick Carrier SUM
Alice 1 1 1 3
Bob 1 1 1 3
Charlie 1 1 2
Denise 1 1 2
Eve 1 1 2
Frankie 1 1 2
SUM 4 3 2 5
Less expected smaller row
and column margins
Bingo!
De Bie, DMKD 2011
IC 𝑍 = − log Pr 𝑍 = 𝑖,𝑗∈𝑍−log Pr 𝑝𝑖,𝑗DL 𝑍 = 𝑎 #rows + #columns + 𝑏
Beer Diapers Lipstick Carrier SUM
Alice 1 1 1 3
Bob 1 1 1 3
Charlie 1 1 2
Denise 1 1 2
Eve 1 1 2
Frankie 1 1 2
SUM 4 3 2 5
Beer Diapers Lipstick Carrier SUM
Alice 1 1 1 3
Bob 1 1 1 3
Charlie 1 1 2
Denise 1 1 2
Eve 1 1 2
Frankie 1 1 2
SUM 4 3 2 5
ITEMSETS
Data: binary matrix:
Prior beliefs: row and column sums
Patterns: tiles
14
𝑿 ∈ {0,1}𝑚×𝑛 Iterate
De Bie, DMKD 2011
We specified there are ones here
ITEMSETS
Data: binary matrix:
Prior beliefs: row and column sums
Patterns: tiles
15
𝑿 ∈ {0,1}𝑚×𝑛
Beer Diapers Lipstick Carrier SUM
Alice 1 1 1 3
Bob 1 1 1 3
Charlie 1 1 2
Denise 1 1 2
Eve 1 1 2
Frankie 1 1 2
SUM 4 3 2 5
Iterate
De Bie, DMKD 2011
ITEMSETS
Data: binary matrix:
Prior beliefs: row and column sums
Patterns: tiles
16
𝑿 ∈ {0,1}𝑚×𝑛
Beer Diapers Lipstick Carrier SUM
Alice 1 1 1 3
Bob 1 1 1 3
Charlie 1 1 2
Denise 1 1 2
Eve 1 1 2
Frankie 1 1 2
SUM 4 3 2 5
Iterate
De Bie, DMKD 2011
Subjective interestingness ranking
Prior info on:Row & column sums
#docs
Support x size (area)
#docs
svm, support, machin, vector 25 data, paper 389
state, art 39 algorithm, propose 246
unlabelled, labelled, supervised, learn 10 data, mine 312
associ, rule, mine 36 base, method 202
gene, express 25 result, show 196
frequent, itemset 28 problem 373
large, social, network, graph 15 data, set 279
column, row 13 approach 330
algorithm, order, magnitud, faster 12 model 301
paper, propos, algorithm, real, synthetic, data
27 present 296
ITEMSETS
words
docum
ents
KDD abstracts dataset:
De Bie, DMKD 2011
Subjective interestingness ranking
Prior info on:Row & column sums
#docs
Support x size (area)
#docs
svm, support, machin, vector 25 data, paper 389
state, art 39 algorithm, propose 246
unlabelled, labelled, supervised, learn 10 data, mine 312
associ, rule, mine 36 base, method 202
gene, express 25 result, show 196
frequent, itemset 28 problem 373
large, social, network, graph 15 data, set 279
column, row 13 approach 330
algorithm, order, magnitud, faster 12 model 301
paper, propos, algorithm, real, synthetic, data
27 present 296
ITEMSETS
Subjective interestingness ranking
Prior info on:Row & column sums
#docs
Subjective interestingnessrankingAdditionally prior info on:Keyword tiles
#docs
svm, support, machin, vector 25 art, state 39
state, art 39 row, column, algorithm 12
unlabelled, labelled, supervised, learn 10 unlabelled, labelled, data 14
associ, rule, mine 36 answer, question 18
gene, express 25 Precis, recal 14
words
docum
ents
KDD abstracts dataset:
De Bie, DMKD 2011
ITEMSETS
Data: binary matrix:
Prior beliefs: row and column sums
Background distribution 𝑃 Patterns: tiles of ones and zeros
19
𝑿 ∈ {0,1}𝑚×𝑛
Beer Diapers Lipstick Carrier SUM
Alice 1 1 1 3
Bob 1 1 1 3
Charlie 1 1 2
Denise 1 1 2
Eve 1 1 2
Frankie 1 1 2
SUM 4 3 2 5
Extension: noisy tiles
Kontonasios & De Bie, SDM 2010
IC is straightforward
DL depends on the skew
(entropy) of the distribution of
ones and zeros within the tile
ITEMSETS
Algorithmic approach: ?
Not studied extensively, but special case of relational
patterns (see next instantiation)
Interesting result: if we can mine the best pattern at
every iteration, then we approximate the total IC for the
best set of tiles with that (cumulative) DL at
20
1 − 1𝑒 (≈ 0.63)Kontonasios & De Bie, SDM 2010
RELATIONAL PATTERNS
21
RELATIONAL PATTERN MINING
Data: relational database
Pattern: connected complete subgraphs
Prior beliefs: degree of each node in each
relationship
Customers Items Attributes
Spyropoulou, De Bie, Boley (DMKD 2014, DS 2013)
Lijffijt, Spyropoulou, Kang, De Bie (DSAA 2015, IJDSA 2016)
Guns, Aknin, Lijffijt, De Bie (ICDM 2016)
RELATIONAL PATTERN MINING
Data: relational database
Pattern: connected complete subgraphs
Prior beliefs: degree of each node in each
relationship
Customers Items Attributes
Spyropoulou, De Bie, Boley (DMKD 2014, DS 2013)
Lijffijt, Spyropoulou, Kang, De Bie (DSAA 2015, IJDSA 2016)
Guns, Aknin, Lijffijt, De Bie (ICDM 2016)
Prior factorizes over relationships
equivalent to itemsets
Users Films
Genres
Actors
RELATIONAL PATTERN MINING
RMiner
RMINER
Algorithmic approach: enumerate + rank
Based on fixpoint-enumeration (Boley et al. 2010)
25
A1
A2
A3
B1
B2
B3
C1
C2
Spyropoulou, De Bie, Boley (DMKD 2014, DS 2013)
RMINER
Algorithmic approach: enumerate + rank
2626
A1
A2
A3
B1
B2
B3
C1
C2
1) Branch on any entity
Spyropoulou, De Bie, Boley (DMKD 2014, DS 2013)
RMINER
Algorithmic approach: enumerate + rank
2727
A1
A2
A3
B1
B2
B3
C1
C2
1) Branch on any entity
2) Compute closure
- Add entities that are in all supersets
Spyropoulou, De Bie, Boley (DMKD 2014, DS 2013)
RMINER
Algorithmic approach: enumerate + rank
28
A1
A2
A3
B1
B2
B3
C1
C2
1) Branch on any entity
2) Compute closure
- Add entities that are in all supersets
- Repeat until no valid candidates
- If invalid candidates
- Stop
- Else
- Output pattern
- Backtrack and declare entity invalid
Spyropoulou, De Bie, Boley (DMKD 2014, DS 2013)
RMINER
Algorithmic approach: enumerate + rank
29
A1
A2
A3
B1
B2
B3
C1
C2
1) Branch on any entity
2) Compute closure
- Add entities that are in all supersets
- Repeat until no valid candidates
- If invalid candidates
- Stop
- Else
- Output pattern
- Backtrack and declare entity invalid
Spyropoulou, De Bie, Boley (DMKD 2014, DS 2013)
RMINER
Algorithmic approach: enumerate + rank
30
A1
A2
A3
B1
B2
B3
C1
C2
1) Branch on any entity
2) Compute closure
- Add entities that are in all supersets
- Repeat until no valid candidates
- If invalid candidates
- Stop
- Else
- Output pattern
- Backtrack and declare entity invalid
Spyropoulou, De Bie, Boley (DMKD 2014, DS 2013)
RMINER
Algorithmic approach: enumerate + rank
31
A1
A2
A3
B1
B2
B3
C1
C2
1) Branch on any entity
2) Compute closure
- Add entities that are in all supersets
- Repeat until no valid candidates
- If invalid candidates
- Stop
- Else
- Output pattern
- Backtrack and declare entity invalid
Spyropoulou, De Bie, Boley (DMKD 2014, DS 2013)
RMINER
Algorithmic approach: enumerate + rank
32
A1
A2
A3
B1
B2
B3
C1
C2
1) Branch on any entity
2) Compute closure
- Add entities that are in all supersets
- Repeat until no valid candidates
- If invalid candidates
- Stop
- Else
- Output pattern
- Backtrack and declare entity invalid
Spyropoulou, De Bie, Boley (DMKD 2014, DS 2013)
RMINER
Algorithmic approach: enumerate + rank
33
A1
A2
A3
B1
B2
B3
C1
C2
1) Branch on any entity
2) Compute closure
- Add entities that are in all supersets
- If invalid added, backtrack
- Repeat until no valid candidates
- If invalid candidates
- Stop
- Else
- Output pattern
- Backtrack and declare entity invalid
Spyropoulou, De Bie, Boley (DMKD 2014, DS 2013)
RMINER
Algorithmic approach: enumerate + rank
34
A1
A2
A3
B1
B2
B3
C1
C2
1) Branch on any entity
2) Compute closure
- Add entities that are in all supersets
- If invalid added, backtrack
- Repeat until no valid candidates
- If invalid candidates
- Stop
- Else
- Output pattern
- Backtrack and declare entity invalid
Spyropoulou, De Bie, Boley (DMKD 2014, DS 2013)
RMINER
Algorithmic approach: enumerate + rank
35
A1
A2
A3
B1
B2
B3
C1
C2
1) Branch on any entity
2) Compute closure
- Add entities that are in all supersets
- If invalid added, backtrack
- Repeat until no valid candidates
- If invalid candidates
- Stop
- Else
- Output pattern
- Backtrack and declare entity invalid
Etc. etc.
Spyropoulou, De Bie, Boley (DMKD 2014, DS 2013)
Circadian
Numeric
Numeric Taxonomyelement
Taxonomyelement
n-ary relation
N-RMiner
P-N-RMiner
RELATIONAL PATTERN MINING
P-N-RMINER: FISHER DATA
37
Lijffijt, Spyropoulou, Kang, De Bie (DSAA 2015, IJDSA 2016)
CP-RMINER
Algorithmic approach: branch & bound in CP (top 1)
38
Guns, Aknin, Lijffijt, De Bie (ICDM 2016)
CONNECTING TREES
39
I
D
CONNECTING SUBTREES
Data: Graph , known, unknown
40
𝑬 ⊆ 𝑽 × 𝑽𝑮 = (𝑽, 𝑬) 𝑽A
B
C
E
F
G
H
Adriaens, Lijffijt, De Bie (ECMLPKDD 2017)
CONNECTING SUBTREES
Data: Graph , known, unknown
Prior beliefs: degree of vertices, (batch) time order
41
𝑬 ⊆ 𝑽 × 𝑽𝑮 = (𝑽, 𝑬) 𝑽
I
DA
B
C
E
F
G
H
Adriaens, Lijffijt, De Bie (ECMLPKDD 2017)
I
D
CONNECTING SUBTREES
Data: Graph , known, unknown
Prior beliefs: degree of vertices, (batch) time order
Patterns: subtree connecting query vertices Q
42
𝑬 ⊆ 𝑽 × 𝑽𝑮 = (𝑽, 𝑬) 𝑽A
B
C
E
F
G
H
Adriaens, Lijffijt, De Bie (ECMLPKDD 2017)
I
D
CONNECTING SUBTREES
Data: Graph , known, unknown
Prior beliefs: degree of vertices, (batch) time order
Patterns: subtree connecting query vertices Q
Which is more interesting?
43
𝑬 ⊆ 𝑽 × 𝑽𝑮 = (𝑽, 𝑬) 𝑽A
B
C
E
F
G
H
ID
F
D
I
C
E
I
D
A
C
E
F
Unsuprising No information
compression
Most interesting,
since E and C have
small in-degree
Adriaens, Lijffijt, De Bie (ECMLPKDD 2017, DAMI 2019)
TIME DIFFERENCE PRIOR
Consider citation network
Papers arrive in batches
(per year)
Earlier papers cannot /
rarely cite newer papers
44
Adriaens, Lijffijt, De Bie (ECMLPKDD 2017, DAMI 2019)
TIME DIFFERENCE PRIOR
Consider citation network
Papers arrive in batches
(per year)
Earlier papers cannot /
rarely cite newer papers
Limited increase in
computational cost to fit
background distribution
45
Adriaens, Lijffijt, De Bie (ECMLPKDD 2017, DAMI 2019)
EXAMPLE ON CITATION DATA
46
3 recent best
papers from ACM
SIGKDD
Uniform prior:
Time and degree prior:
Adriaens, Lijffijt, De Bie (ECMLPKDD 2017, DAMI 2019)
CONNECTING SUBTREES
Algorithmic approach: greedily construct a tree with
maximum depth k
Not so straightforward!
We investigated various heuristics
47
CONNECTING SUBTREES
Empirically: best strategy depends on size of query set Q
48
CONNECTING SUBTREES
49
Uniform Degree prior
NIPS/PODS authors
Repeated from Akoglu et al. (SDM 2013)
NETWORK EMBEDDING
50
CONDITIONAL NETWORK EMBEDDINGS
51
Data: a graph 𝐺 w. adj. matrix 𝑨 Pattern: a metric embedding 𝑿
‒ Probabilistic info about the graph
‒ 𝑃 𝒙𝑖 − 𝒙𝑗 |𝑎𝑖𝑗 = Half-Normal
Prior beliefs: 𝑃𝑖,𝑗 𝑎𝑖𝑗‒ overall density
‒ degrees
‒ block structure
‒ assortativity
‒ ...
Find ML embedding:max𝑿 𝑃 𝐺|𝑿
Kang, Lijffijt, De Bie
(ICLR 2019)
EXAMPLE ON STUDENTDB
52
53
CONDITIONAL NETWORK EMBEDDINGS
54
CONDITIONAL NETWORK EMBEDDINGS
Algorithmic approach: gradient descent with
estimated gradient (positive and negative sampling)
55
SUMMARY PART 3
56
SUMMARY
Informative prior can be useful in many settings
For binary data (all previous examples), fitting and
updating background model is computationally easy
Mining SI patterns challenging, (for now) tailor-made
algorithms necessary
57
HUMAN-CENTRIC
DATA EXPLORATION
PART 4/5: NUMERIC AND MIXED DATA
Tijl De Bie – slides in collaboration with Jefrey LijffijtGhent University
DEPARTMENT ELECTRONICS AND INFORMATION SYSTEMS (ELIS)
RESEARCH GROUP IDLAB
www.forsied.net 1
OUTLINEPart 1: Introduction and motivation
10mins
Part 2: The FORSIED framework40mins
Part 3: Binary matrices, graphs, and relational data40mins
BREAK
Part 4: Numeric and mixed data (including high-dimensional data visualization)50mins
Part 5: Advanced topics, outlook & conclusions25mins
Q&A15mins
2
OUTLINE OF THIS PART
Attributed subgraphs
Subgroup discovery in real-valued (target) data
Dimensionality reduction
Time series
3
ATTRIBUTED SUBGRAPHS
4
COHESIVE SUBGRAPHS
Data: attributed graph
Bendimerad, Mel, Lijffijt, Plantevit, Robardet, De Bie (forthcoming)
[Presented and can be found online at MLG @ KDD 2018 workshop]
A
B
D
C
E
F
G
H
I
Vertex Shops Events Pubs
A 1 5 3
B 5 0 2
C 2 3 10
D 1 6 4
E 4 0 2
F 3 1 5
G 6 2 9
H 2 1 3
I 0 1 3
COHESIVE SUBGRAPHS
Data: attributed graph
Prior beliefs: global attribute statistics
A
B
D
C
E
F
G
H
I
Vertex Shops Events Pubs
A 1 5 3
B 5 0 2
C 2 3 10
D 1 6 4
E 4 0 2
F 3 1 5
G 6 2 9
H 2 1 3
I 0 1 3
Bendimerad, Mel, Lijffijt, Plantevit, Robardet, De Bie (forthcoming)
[Presented and can be found online at MLG @ KDD 2018 workshop]
A
C
D
COHESIVE SUBGRAPHS
Data: attributed graph
Prior beliefs: global attribute statistics
Patterns: cohesive subgraphs with exceptional attributes (CSEA)
B
E
F
G
H
I
Vertex Shops Events Pubs
A 1 5 3
B 5 0 2
C 2 3 10
D 1 6 4
E 4 2 2
F 3 1 5
G 6 2 9
H 2 1 3
I 0 1 3
Pattern:
These locations
have many events
Bendimerad, Mel, Lijffijt, Plantevit, Robardet, De Bie (forthcoming)
[Presented and can be found online at MLG @ KDD 2018 workshop]
F
G
H
COHESIVE SUBGRAPHS
Data: attributed graph
Prior beliefs: global attribute statistics
Patterns: cohesive subgraphs with exceptional attributes (CSEA)
A
B
D
C
EI
Vertex Shops Events Pubs
A 1 5 3
B 5 0 2
C 2 3 10
D 1 6 4
E 4 0 2
F 3 1 5
G 6 2 9
H 2 1 3
I 0 1 3
Pattern is easy to
interpret if it is local.
How to quantify this?
Pattern:
These locations
have many pubs &
shops
Bendimerad, Mel, Lijffijt, Plantevit, Robardet, De Bie (forthcoming)
[Presented and can be found online at MLG @ KDD 2018 workshop]
G
FH
COHESIVE SUBGRAPHS
Data: attributed graph
Prior beliefs: global attribute statistics
Patterns: cohesive subgraphs with exceptional attributes (CSEA)
A
B
D
C
EI
Vertex Shops Events Pubs
A 1 5 3
B 5 0 2
C 2 3 10
D 1 6 4
E 4 0 2
F 3 1 5
G 6 2 9
H 2 1 3
I 0 1 3
Pattern:
Vertices around G
have many pubs &
shops
Easy to describe
Bendimerad, Mel, Lijffijt, Plantevit, Robardet, De Bie (forthcoming)
[Presented and can be found online at MLG @ KDD 2018 workshop]
C
AD
COHESIVE SUBGRAPHS
Data: attributed graph
Prior beliefs: global attribute statistics
Patterns: cohesive subgraphs with exceptional attributes (CSEA)
B
E
F
G
H
I
Vertex Shops Events Pubs
A 1 5 3
B 5 0 2
C 2 3 10
D 1 6 4
E 4 2 2
F 3 1 5
G 6 2 9
H 2 1 3
I 0 1 3
Pattern:
Vertices around C
have many events
Easy to describe
Bendimerad, Mel, Lijffijt, Plantevit, Robardet, De Bie (forthcoming)
[Presented and can be found online at MLG @ KDD 2018 workshop]
COHESIVE SUBGRAPHS
Data: attributed graph
Prior beliefs: global attribute statistics
Patterns: cohesive subgraphs with exceptional attributes (CSEA)
‒ Like subgroup discovery
‒ Describe vertex set with rule
‒ Intersection of neighbourhoods
‒ Minus exceptions
‒ Attributes below / above threshold
‒ As compared to expectation
Bendimerad, Mel, Lijffijt, Plantevit, Robardet, De Bie (forthcoming)
[Presented and can be found online at MLG @ KDD 2018 workshop]
COHESIVE SUBGRAPHS
Data: attributed graph
Prior beliefs: global attribute statistics
Vertex Shops Events Pubs
A 1 5 3
B 5 0 2
C 2 3 10
D 1 6 4
E 4 0 2
F 3 1 5
G 6 2 9
H 2 1 3
I 0 1 3
Vertex Shops Events Pubs
A 0.5 … …B 0.04 … …C 0.2 … …D 0.8 … …E 0.06 … …F 0.13 … …G 0.07 … …H 0.1 … …I 1.0 … …
Attribute values Interestingness
Background
distribution
- Geometric
per cell
- Using
row/column
margins
(values are for illustration only)
Bendimerad, Mel, Lijffijt, Plantevit, Robardet, De Bie (forthcoming)
[Presented and can be found online at MLG @ KDD 2018 workshop]
COHESIVE SUBGRAPHS
Data: attributed graph
Prior beliefs: global attribute statistics
Vertex Shops Events Pubs
A 1 5 3
B 5 0 2
C 2 3 10
D 1 6 4
E 4 0 2
F 3 1 5
G 6 2 9
H 2 1 3
I 0 1 3
Vertex Shops Events Pubs
A 0.5 … …B 0.04 … …C 0.2 … …D 0.8 … …E 0.06 … …F 0.13 … …G 0.07 … …H 0.1 … …I 1.0 … …
Attribute values Interestingness
Background
distribution
- Geometric
per cell
- Using
row/column
margins
(values are for illustration only)
Bendimerad, Mel, Lijffijt, Plantevit, Robardet, De Bie (forthcoming)
[Presented and can be found online at MLG @ KDD 2018 workshop]
COHESIVE SUBGRAPHS
Data: attributed graph
Prior beliefs: global attribute statistics
Patterns: cohesive subgraphs with exceptional attributes (CSEA)
14
𝑃1: +food
𝑃2: +professional, +nightlife,
+outdoors, +college
𝑃3: +nightlife, +food, -college
Bendimerad, Mel, Lijffijt, Plantevit, Robardet, De Bie (forthcoming)
[Presented and can be found online at MLG @ KDD 2018 workshop]
COHESIVE SUBGRAPHS
Data: attributed graph
Prior beliefs: global attribute statistics
Patterns: cohesive subgraphs with exceptional attributes (CSEA)
Bendimerad, Mel, Lijffijt, Plantevit, Robardet, De Bie (forthcoming)
[Presented and can be found online at MLG @ KDD 2018 workshop]
COHESIVE SUBGRAPHS
Algorithmic approach: we tried
Option 1: enumerate with P-N-RMiner, then rank
Option 2: dedicated branch-and-bound algorithm
16
SUBGROUP DISCOVERY (EXCEPTIONAL MODEL MINING)
17
LOCATION & SPREAD PATTERNS
Data:
‒ Meta-data: any type
‒ Target data: real valued matrix
Prior beliefs: mean and variance statistics
‒ Typically overall, but can be for subsets
Patterns: description with
‒ mean vector, or
‒ projection and magnitude of variance
18
𝑿 ∈ ℝ𝑚×𝑛
-4 -2 0 2 4
Attribute 1
-4
-2
0
2
4
Att
rib
ute
2
(a)Objects with
property x
Objects with
property y
Objects with
property z1
and z2
Lijffijt, Kang, Duivesteijn, Puolamäki,
Oikarinen, De Bie (ICDE 2018)
SINGLE TARGET DIMENSION: CRIME IN US
UCI Crime data:
violent crime rate
(per 1k pop)
Description: areas with high incidence of unmarried mothers (coded
by CBS as percentage illegitimate)
Target: high average crime rate
19
ECOLOGY: PRESENT SPECIES AS TARGETS
Description of pattern (a): “mean temperature in March ≤ −1.68 ◦C’”Description of pattern (b): “average monthly rainfall in August ≤ 47.62 mm”Description of pattern (c): “average monthly rainfall in October ≤ 45.25 mm and mean temperature of wettest quarter ≥ 16.32 ◦C”
Mean of target attributes for (a): -Wood mouse +Mountain hare, moose, red-
backed vole, wood lemming 20
GERMAN POLITICS VS. DEMOGRAPHICS
Description of pattern (a): "few children". Target: LEFT is popular.
Description of pattern (b): "large mid-aged pop". Target: GREEN relatively popular.
Description of pattern (c): many children. Target: LEFT is impopular.
21
GERMAN POLITICS VS. DEMOGRAPHICS
Moreover, target of pattern (a): “SDP and CDU negative correlated”.Due to popularity of LEFT, SDP and CDU are in tougher competition here
22
SUBGROUP DISCOVERY
For reference only
Prior:
IC/DL:
23
SUBGROUP DISCOVERY
Algorithmic approach:
Find descriptions using beam search
‒ Fairly standard in SD; Implementation from
Cortana (Meeng & Knobbe, BeneLearn 2011)
‒ Using the SI objective directly
For projections
‒ Manifold learning problem (ManOpt toolbox)
Interesting results to speed-up update to bg. distrib.
24
DIMENSIONALITY REDUCTION
25
PROJECTION PATTERNS
Data: Real valued matrix
Prior beliefs: global mean and (co-)variance structure
Patterns: projections
26
𝑿 ∈ ℝ𝑚×𝑛
DATA PROJECTIONS
De Bie, Lijffijt, Santos-Rodriguez, Kang
(ESANN 2016)
Kang, Lijffijt, Santos-Rodriguez, De Bie
(KDD 2016, DMKD 2018)
Puolamäki, Kang, Lijffijt, De Bie (ECMLPKDD 2016)
Kang, Puolamäki, Lijffijt, De Bie (ECMLPKDD 2016)
Puolamäki, Oikarinen, Kang, Lijffijt, De Bie (ICDE 2018)
Finding informative
projections
Accounting for
user feedback
SI COMPONENT ANALYSIS (SICA)
Problem is parametrized by a resolution parameter
Go from density to probability for projections
28
De Bie, Lijffijt, Santos-Rodriguez, Kang (ESANN 2016)
Kang, Lijffijt, Santos-Rodriguez, De Bie (KDD 2016, DMKD 2018)
SICA
Effect of the prior beliefs:
Expectation on mean/variance PCA objective
Expectation on magnitude of variance more
robust variant of PCA (which we call t-PCA)
Graph of point similarities next slides
29
SICA GRAPH PRIOR
30
SICA GRAPH PRIOR
German voting percentages per district, account for
east-west divide
31
SI DATA EXPLORER (SIDE)
32
https://users.ugent.be/~bkang/software/side_dev/entry.html
SI DATA EXPLORER (SIDE)
33
SI DATA EXPLORER (SIDE)
34
SICA/SIDE
Algorithmic approach:
SICA (uniform/graph) are eigenvalue problems
SICA t-PCA and SIDE use manifold learning toolbox
35
C-T-SNE
36
Non-linear
dimensionality
reduction
Based on t-SNE
Prior beliefs:
known clusters
Result: c-t-SNE
visualization
may make new
clusters salientBo Kang, Dario Garcia Garcia, Jefrey
Lijffijt, Raul Santos Rodriguez, Tijl De
Bie (arxiv, 2019)
TIME SERIES
37
TIME SERIES MOTIFS
Data: time series
Prior beliefs: mean, var, co-var (first order difference)
38
𝑿 ∈ ℝ1×𝑛Deng, Lijffijt, Kang, De Bie (Entropy, 2019)
TIME SERIES MOTIFS
Data: time series
Prior beliefs: mean, var, co-var (first order difference)
Patterns: motif template
39
𝑿 ∈ ℝ1×𝑛Deng, Lijffijt, Kang, De Bie (Entropy, 2019)
TIME SERIES MOTIFS
IC quantifies the reduction in uncertainty
Likelihood of data increases by inserting template in
background distribution for matched locations
Due to co-variance, expectations change globally
Updating computationally (somewhat) costly
40
TIME SERIES MOTIFS
Algorithmic approach:
Construct template from 3 or 4 subsequences using
constraint programming and relaxed objective
Greedily add subsequences using exact objective
Prune dissimilar subsequences from search after
branching and selection of initial set
41
TIME SERIES MOTIFS
Data: time series
Prior beliefs: mean, var, co-var (first order difference)
Patterns: motif template
42
𝑿 ∈ ℝ1×𝑛Deng, Lijffijt, Kang, De Bie (Entropy, 2019)
AND MORE
43
AND MORE
Past Data clustering
Biclustering
Exceptional model mining / subgroup discovery Time series segments
Ongoing / future Backbone of a network
Insightful summaries of an attributed network
Network embeddings ...
44
with all past and current members of the FORSIED team and Jilles
Vreeken, Antonis Matakos, Dario Garcia-Garcia, Siegfried Nijssen,...
HUMAN-CENTRIC
DATA EXPLORATION
PART 5/5: ADVANCED TOPICS, OUTLOOK & CONCLUSIONS
Tijl De Bie – slides in collaboration with Jefrey LijffijtGhent University
DEPARTMENT ELECTRONICS AND INFORMATION SYSTEMS (ELIS)
RESEARCH GROUP IDLAB
www.forsied.net 1
OUTLINEPart 1: Introduction and motivation
10mins
Part 2: The FORSIED framework40mins
Part 3: Binary matrices, graphs, and relational data40mins
BREAK
Part 4: Numeric and mixed data (including high-dimensional data visualization)50mins
Part 5: Advanced topics, outlook & conclusions25mins
Q&A15mins
2
RELATED WORK
3
RELATED WORK
MDL/compression
MAP estimation
Hypothesis testing / randomization techniques
‘Subjective interestingness’ research by Thuzhilin,
Silberschatz, Padmanabhan (90’s)
4
LIMITATIONS
Not all new information is interesting!
The FORSIED framework does not directly address this
Importance of feedback in combination with FORSIED
(see work on dimensionality reduction)
Description length
How to determine?
Is shorter really always better?
5
FORSIED’S ORIGINS,AND OUTLOOK
6
FORSIED’S ORIGINS
7
Remodiscovery, 2006
DISTILLER, 2009
FORSIED’S ORIGINS
8
FORSIED’S ORIGINS
9
MINI, ECML-PKDD 2007
TKDD, 2007
KDD, 2009
DAMI, 2014
FORSIED’S ORIGINS
10
Since 2017 also Jefrey’s Pegasus2 fellowship, and a large and a small FWO grant
OUTLOOK Improve theoretical understanding
Estimating the background distribution (information geometry)
Cognitive aspects (cognitive science)
User interface (human computer interaction)
Visualization (visual analytics)
Algorithmic aspects (optimisation theory)
Safeguarding sensitive information & fairness
More instantiations Data types (linked data / knowledge graphs!) / pattern types / prior belief types
Applications Bioinformatics
Web and social media mining
11
DATA MINING WITHOUT SPILLING THE BEANS
12
PRIVACY-PRESERVING DATA PUBLISHING
Anonymization insufficient to protect sensitive attributes (linkage attack)
Generalization!
13
ZIP D.O.B. Sex Diagnosis
94701 01/02/1968 F Healthy
94701 06/03/1990 F Obesitas
94702 11/08/1991 M Healthy
94703 03/09/1979 M Prostate cancer
94703 07/10/1951 F Healthy
94704 10/02/1973 M Obesitas
94705 20/12/2001 F Obesitas
ZIP D.O.B. Sex Full name
94701 01/02/1968 F Mary Smith
94701 06/03/1990 F Patricia Johnson
94702 11/08/1991 M James Jones
94703 03/09/1979 M John Brown
94703 07/10/1951 F Linda Davis
94704 10/02/1973 M Robert Miller
94705 20/12/2001 F Barbara Wilson
Anonymized patient database Voting records database
Quasi-identifiers
ZIP D.O.B. Sex Diagnosis
94701 ‘51-’01 F Healthy
94701 ‘51-’01 F Obesitas
94702-5 ‘51-’01 M Healthy
94702-5 ‘51-’01 M Prostate cancer
94702-5 ‘51-’01 F Healthy
94702-5 ‘51-’01 M Obesitas
94702-5 ‘51-’01 F Obesitas
PRIVACY-PRESERVING DATA PUBLISHING
𝒌-anonymity: minimum size equivalence class ≥ 𝑘 Homogeneity attack
Background knowledge attack
𝒍-diversity: ≥ 𝑙 sensitive attribute values well represented in each
equivalence class
Hard to achieve for imbalanced data
Skewness attack
Similarity attack
𝒕-closeness: sensitive attribute value distribution in equivalence classes
same as in overall data
14Source: https://www.linkedin.com/pulse/dont-throw-baby-out-bathwater-didi-gurfinkel/
PRIVACY-PRESERVING DATA PUBLISHING
𝒌-anonymity: minimum size equivalence class ≥ 𝑘 Homogeneity attack
Background knowledge attack
𝒍-diversity: ≥ 𝑙 sensitive attribute values well represented in each
equivalence class
Hard to achieve for imbalanced data
Skewness attack
Similarity attack
𝒕-closeness: sensitive attribute value distribution in equivalence classes
same as in overall data
15Source: https://www.linkedin.com/pulse/dont-throw-baby-out-bathwater-didi-gurfinkel/
OTHER KINDS OF SENSITIVE INFORMATION
Existence of a tight community in a network
16
OTHER KINDS OF SENSITIVE INFORMATION
Existence of a tight community in a network
Existence of a cluster in data
Frequency of particular items / size of particular
transactions in a database of purchases
Preserve this while:
publishing generalized version of database,
identifying dense subgraphs,
finding clusters,
mining frequent itemsets, etc
17
Data mining patterns
GENERAL STRATEGY
Data: 𝒙 Data mining goal: reveal as much as possible about 𝒙
Sensitive aspects: 𝑓 𝒙 ∈ Φ the sensitive attributes’ values density of a specified subgraph
existence of a tight cluster
frequencies of all items
Goal: reveal as little as possible about 𝒇 𝒙 Updating 𝑷 → 𝑷′ results in updating 𝑷𝒇 → 𝑷𝒇′
More complex than conditioning!
𝑃𝑓 𝑓 𝒙 can be larger or smaller than 𝑃𝑓′ 𝑓 𝒙18
(data) 𝒙 → 𝑓 𝒙 (sensitive aspects)
Considering the user’s prior beliefs! ‘Subjective’ measures
ΩΩ′ 𝑃𝑥 Φ𝑃𝑓
𝑓 𝑥𝑓
Φ′ = 𝑓 Ω′
TRADING-OFF TWO THINGS
1. Subjective information content of a pattern2. A criterion on the background distribution about sensitive aspects:
Information content left in sensitive aspects(surprise in actual value of the sensitive attributes):−log 𝑃𝑓′ 𝑓 𝒙
Entropy of 𝑃𝑓 (uncertainty about sensitive attributes):−𝐸𝒙~𝑃𝑓 log 𝑃𝑓′ 𝑓 𝒙 Knowledge gained about actual value of the sensitive aspects:−log 𝑃𝑓 𝑓 𝒙𝑃𝑓′ 𝑓 𝒙 Degree of belief that the sensitive aspects are within a specified set Φ∗ ⊆ Φ:𝑃𝑓′ Φ∗ ...
19
EXAMPLES
20
PRIVACY-PRESERVING DATA PUBLISHING
21
• Random synthetic dataset:
• 5 real-valued quasi-identifiers,
generalization through intervals
• 1 sensitive attribute, 3 possible values
• 1 other attribute, 3 possible values
• 100 data records
• Trade-off:
• information about data (other & sensitive attributes)
• knowledge gained about sensitive attribute
• Generalize quasi-attributes 5 equivalence classes
• Ensure the maximum information content about any
sensitive attribute value is small
Conditional distributions within
the 5 equivalence classes over
the 3 sensitive attribute values
Conditional distributions within
the 5 equivalence classes over
the 3 other attribute values
Joint conditional distribution of the sensitive (rows) and other (columns) attributes, within 5 equivalence classesQI SA
OA
• Zip code
• DOB
E.g.
• Sexual orientation
• Ethnicity
E.g.
• Sense of well-being
• Productivity
E.g.
DENSE SUBGRAPHS WITHOUT SPILLING BEANS
22
Initially
10 20 30 40
10
20
30
40
After both community patterns
10 20 30 40
10
20
30
40
After both community patterns
and a deception pattern
10 20 30 40
10
20
30
40
After both community patterns
partially concealed
10 20 30 40
10
20
30
40
• Random network:
• 2 non-overlapping
communities
• A 3rd community
overlapping both
• The 3rd is sensitive
• Analyst should
remain surprised by its presence
• Task:
• Identify (non-)dense subgraphs
• Without spilling the beans on the 3rd
community
• Approaches (result from general strategy):
• Deceive
• Conceal
TAKE-AWAYS
FORSIED ideas can be used for quantifying sensitive
information disclosure
Key point: sensitive information disclosure is subjective
More work needed to understand how to make this
practical
23
CONCLUSIONS
24
OVERALL CONCLUSIONS
A generic approach for designing methods for exploring data Several successes
Sometimes more challenging, mostly due to algorithmic issues
Key take-away Model what’s not interesting (= prior beliefs),
show what’s complementary (= subjectively interesting) Using information theory
New horizons? Privacy and sensitive information
25
www.forsied.net 26
“Data Mining without Spilling the Beans: Preserving more than Privacy alone”Research project funded by the FWO
Tijl De Bie, Jefrey Lijffijt
“Exploring Data: Theoretical Foundations and Applications to Web, Multimedia, and Omics Data”Odysseus project funded by the FWO
Tijl De Bie
“Formalizing Subjective Interestingness in Data mining”ERC project FORSIED
Tijl De Bie
“Personalised, interactive, and visual exploratory mining of patterns in complex data”FWO [Pegasus]2 Marie Skłodowska-Curie Fellowship
Jefrey Lijffijt
Acknowledgements go to
27
Jefrey Lijffijt
Bo Kang
Wouter
Duivesteijn
Achille Aknin
Holly Silk
Raul
Santos-Rodriguez
Eirini Spyropoulou
Akis Kontonasios
Paolo Simeone
Robin Vandaele
Florian Adriaens
Tijl De Bie
Xi Chen
Junning (Lemon)
Deng
Ahmad MelAlexandru Mara
We are recruiting!
www.forsied.net / aida.ugent.be
+ lots of collaborators...
Maryam Fanaeepour
Maarten Buyl
THANKS!
TIME FOR Q&A
28
MINING SUBJECTIVELY
INTERESTING PATTERNS IN DATA
SUPPLEMENTARY MATERIAL
Tijl De Bie – slides in collaboration with Jefrey LijffijtGhent University
DEPARTMENT ELECTRONICS AND INFORMATION SYSTEMS (ELIS)
RESEARCH GROUP IDLAB
www.forsied.net 1
REFERENCES
2
Adriaens, Lijffijt, De Bie: Subjectively Interesting Connecting Trees. ECML/PKDD (2) 2017: 53-69
De Bie, Lijffijt, Santos-Rodriguez, Kang: Informative Data Projections: A Framework and Two Examples. ESANN 2015 :
435-640
De Bie: Subjective Interestingness in Exploratory Data Mining. IDA 2013: 19-31
De Bie: Maximum entropy models and subjective interestingness: an application to tiles in binary databases. Data Min.
Knowl. Discov. 23(3): 407-446 (2011)
De Bie: An information theoretic framework for data mining. KDD 2011: 564-572
De Bie: Subjectively Interesting Alternative Clusters. MultiClust@ECML/PKDD 2011: 43-54
Deng, Lijffijt, Kang, De Bie: Subjectively Interesting Motifs in Time Series, AALTD@ECML/PKDD 2018
Guns, Aknin, Lijffijt, De Bie: Direct Mining of Subjectively Interesting Relational Patterns. ICDM 2016: 913-918
3
Kang, Lijffijt, De Bie: Conditional Network Embeddings. Proceedings of the 7th International Conference on Learning Representations (ICLR), 2019.
Kang, Lijffijt, Santos-Rodríguez, De Bie: SICA: subjectively interesting component analysis. Data Min. Knowl. Discov. 32(4): 949-987 (2018)
Kang, Lijffijt, Santos-Rodriguez, De Bie: Subjectively Interesting Component Analysis: Data Projections that Contrast with Prior Expectations. KDD 2016: 1615-1624
Kang, Puolamäki, Lijffijt, De Bie: A Tool for Subjective and Interactive Visual Data Exploration. ECML/PKDD (3) 2016: 3-7
Kontonasios, De Bie: Subjectively interesting alternative clusterings. Machine Learning 98(1-2): 31-56 (2015)
Kontonasios, De Bie: An Information-Theoretic Approach to Finding Informative Noisy Tiles in Binary Databases. SDM 2010: 153-164
Kontonasios, Vreeken, De Bie: Maximum Entropy Models for Iteratively Identifying Subjectively Interesting Structure in Real-Valued Data. ECML/PKDD (2) 2013: 256-271
4
Kontonasios, Vreeken, De Bie: Maximum Entropy Modelling for Assessing Results on Real-Valued Data. ICDM 2011:
350-359
Van Leeuwen, De Bie, Spyropoulou, Mesnage: Subjective interestingness of subgraph patterns. Machine Learning
105(1): 41-75 (2016)
Lijffijt, Kang, Duivesteijn, Puolamäki, Oikarinen, De Bie: Subjectively Interesting Subgroup Discovery on Real-valued
Targets. IEEE ICDE 2018 : to appear
Lijffijt, Spyropoulou, Kang, De Bie: P-N-RMiner: a generic framework for mining interesting structured relational
patterns. I. J. Data Science and Analytics 1(1): 61-76 (2016)
Lijffijt, Spyropoulou, Kang, De Bie: P-N-RMiner: A generic framework for mining interesting structured relational
patterns. DSAA 2015: 1-10
Puolamäki, Kang, Lijffijt, De Bie: Interactive Visual Data Exploration with Subjective Feedback. ECML/PKDD (2) 2016:
214-229
5
Puolamäki, Oikarinen, Kang, Lijffijt, De Bie: Interactive Visual Data Exploration with Subjective Feedback: An
Information-Theoretic Approach. IEEE ICDE 2018 : to appear
Spyropoulou, De Bie, Boley: Interesting pattern mining in multi-relational data. Data Min. Knowl. Discov. 28(3): 808-849
(2014)
Spyropoulou, De Bie, Boley: Mining Interesting Patterns in Multi-relational Data with N-ary Relationships. Discovery
Science 2013: 217-232
6
LINKS TO SOFTWARE
7
R-MINER(S)
Original (fastest for full enumeration):
https://bitbucket.org/BristolDataScience/rminer/
N-RMiner (supports n-ary):
https://bitbucket.org/BristolDataScience/n-rminer/
P-N-RMiner (support structured attributes):
https://bitbucket.org/BristolDataScience/p-n-rminer/
CP-RMiner (top 1 RMiner pattern, iteratively, fast):
https://bitbucket.org/ghentdatascience/cp/
8
CONNECTING TREES
https://bitbucket.org/ghentdatascience/interestingtreesp
ublic/
9
DENSE SUBGRAPHS (COMMUNITIES)
http://patternsthatmatter.org/software.php#ssgminer
10
NETWORK EMBEDDING
https://bitbucket.org/ghentdatascience/cne-public/
11
SUBGROUP DISCOVERY
https://bitbucket.org/ghentdatascience/sisd-public/
13
DIMENSIONALITY REDUCTION
SICA:
http://users.ugent.be/~bkang/software/sica/sica.zip
SIDE (online tool):
http://users.ugent.be/~bkang/software/side_dev/index.
html
SIDE (MaxEnt R version): http://kaip.iki.fi/sider.html
CLIPPR: https://bitbucket.org/ghentdatascience/clippr/
14
ANYTHING MISSING?
Not all (source) code has been published, please ask if
you are interested in something that is missing!
15