on node classification in dynamic content-based networks
DESCRIPTION
Charu Aggarwal Nan Li IBM T. J. Watson Research Center [email protected] University of California, Santa Barbara [email protected] Presented by Nan Li. On Node Classification in Dynamic Content-based Networks. Motivation. Ke Wang. Jian Pei. “Sequential Pattern” “Data Mining” - PowerPoint PPT PresentationTRANSCRIPT
On Node Classification On Node Classification in Dynamic Content-based Networksin Dynamic Content-based Networks
MotivationMotivation
2
Ke Wang
Jiawei Han
Jian Pei
Kenneth A. Ross
“Data Mining”“Databases”“Clustering”“Sequential Pattern”…
“Algorithms”…
“Sequential Pattern”“Data Mining”“Systems”“Rules”…
“Mining”“Efficient”“Association Rules”…
Year 2001
MotivationMotivation
3
Ke Wang
Jiawei Han
Jian Pei
“Data Mining”“Web”“Sequential Pattern”…
“Pattern”“Data Mining”“Stream”“Semantics”…
“Association Rules”“Data Mining”“Ranking”“Web”…
Year 2002
“Parallel”“Automated”“Data”…
“Pattern Mining”…
“Clustering”“Distributed”“Databases”“Mining”…
Marianne Winslett
Xifeng Yan Philip S. Yu
MotivationMotivation
4
Ke Wang
Jiawei Han
Jian Pei
“Mining”“Databases”“Clustering”“Sequential Pattern”…
“Sequential Pattern”“Mining”“Systems”“Rules”…
“Mining”“Efficient”“Association”…
Year 2003
“Graph”“Databases”“Sequential Mining”…
“Algorithms”“Association Rules”“Clustering”“Wireless”“Web”…
“Clustering”“Indexing”“Knowledge”“XML”…
Charu Aggarwal
Xifeng Yan Philip S. Yu
MotivationMotivation Networks annotated with an increasing amount of text
• Citation networks, co-authorship networks, product databases with large amounts of text content, etc.
• Highly dynamic
5
Node classification Problem• Often arises in the context of many network scenarios in which the
underlying nodes are associated with content.• A subset of the nodes in the network may be labeled.
• Can we use these labeled nodes in conjunction with the structure for the classification of nodes which are not currently labeled?
Applications
Information networks are very large• Scalable and efficient
Many such networks are dynamic• Updatable in real time• Self-adaptable and robust
Such networks are often noisy• Intelligent and selective
Heterogeneous correlations in such networks
ChallengesChallenges
6
A
B C
A
B C
A
B C
OutlineOutline Related Works DYCOS: DYnamic Classification algorithm with
cOntent and Structure• Semi-bipartite content-structure transformation• Classification using a series of text and link-
based random walks• Accuracy analysis
Experiments• NetKit-SRL
Conclusion
7
Related WorksRelated Works Link-based classification (Bhagat et al., WebKDD 2007)
• Local iterative• Global nearest neighbor
Content-only classification (Nigam et al. Machine Learning 2000)• Each object’s own attributes only
Relational classification (Sen et al., Technical Report 2004)• Each object’s own attributes• Attributes and known labels of the neighbors
Collective classification (Macskassy & Provost, JMLR 2007, Sen et al., Technical Report 2004, Chakrabarti, SIGMOD 1998)
• Local classification• Flexible: ranging from a decision tree to an SVM
• Approximate inference algorithms• Iterative classification• Gibbs sampling• Loopy belief propagation• Relaxation labeling
8
OutlineOutline Related Works DYCOS: DYnamic Classification algorithm with
cOntent and Structure• Semi-bipartite content-structure transformation• Classification using a series of text and link-
based random walks• Accuracy analysis
Experiments• NetKit-SRL
Conclusion
9
DYCOS in A NutshellDYCOS in A Nutshell Node classification in a dynamic environment
Dynamic network: the entire network is denoted by Gt = (Nt, At, Tt) at time t. Problem statement:
Classify the unlabeled nodes (Nt \ Tt) using both the content and structure of the network for all the time stamps in an efficient and accurate manner
10
Both the structure and the content of the network change over time!
t+1 t+2t
Text-augmented representation• Leveraged for a random
walk-based classification model that uses both links and text
• Two partitions: structural nodes and word nodes
• Semi-bipartite: one partition of nodes is allowed to have edges either within the set, or to nodes in the other partition.
Efficient updates upon dynamic changes
Semi-bipartite TransformationSemi-bipartite Transformation
11
Random walks over augmented structure
• Starting node: the unlabeled node to be classified.
• Structural hop• A random jump from a
structural node to one of its neighbors
• Content-based multi-hop• A jump from a
structural node to another through implicit common word nodes
• Structural parameter: ps Classification
• Classify the starting node with the most frequently encountered class label during the random walks
Random Walk-Based ClassificationRandom Walk-Based Classification
12
Discriminative keywords• A set Mt of the top m words with the highest discriminative power are
used to construct the word node partition.• Gini-index
• The value of G(w) lies in the range (0, 1).• Words with a higher value of gini-index are more discriminative
for classification purposes. Inverted lists
• Inverted list of keywords for each node• Inverted list of nodes for each keyword
Gini-Index & Inverted ListsGini-Index & Inverted Lists
13
€
pi(w) = ni(w) / n j (w)j =1
k
∑
€
G(w) = p j (w)2
j =1
k
∑
Probabilistic bound: multi-class classification• k classes {C1, C2, …, Ck}• b-accurate• Pr[b-accurate] ≥ 1 - (k-1)exp{-lb2/2}
AnalysisAnalysis
14
Why do we care?• DYCOS is essentially using Monte-Carlo sampling to sample various paths
from each unlabeled node.• Advantage: fast approach• Disadvantage: loss of accuracy
• Can we present analysis on how accurate DYCOS sampling is?
Probabilistic bound: bi-class classification• Two classes C1 and C2
• E[Pr[C1]] = f1, E[Pr[C2]] = f2, f1 - f2 = b ≥ 0• Pr[mis-classification] ≤ exp{-lb2/2}
OutlineOutline Related Works DYCOS: DYnamic Classification algorithm with
cOntent and Structure• Semi-bipartite content-structure transformation• Classification using a series of text and link-
based random walks• Accuracy analysis
Experiments• NetKit-SRL
Conclusion
15
Experimental ResultsExperimental Results Data sets
• CORA: a set of research papers and the citation relations among them. • Each node is a paper and each edge is a citation relation.• A total of 12,313 English words are extracted from the paper titles.• We segment the data into 10 synthetic time periods.
• DBLP: a set of authors and their collaborations• Each node is an author and each edge is a collaboration.• A total of 194 English words in the domain of computer science are used.• We segment the data into 36 annual graphs from year 1975 to year 2010.
16
Experimental ResultsExperimental Results
17
NetKit-SRL toolkit • An open-source and publicly available toolkit for statistical relational learning in
networked data (Macskassy and Provost, 2007).• Instantiations of previous relational and collective classification algorithms• Configuration
• Local classifier: domain-specific class prior• Relational classifier: network-only multinomial Bayes classifier• Collective inference: relaxation labeling
Parameters• 1) The number of most discriminative words, m; 2) The size constraint of the
inverted list for each keyword a; 3) The number of top content-hop neighbors, q; 4) The number of random walks, l; 5) The length of each random walk, h; 6) Structure parameter, ps.
The results demonstrate that DYCOS improves the classification accuracy over NetKit by 7.18% to 17.44%, while reducing the runtime to only 14.60% to 18.95% of that of NetKit.
Experimental ResultsExperimental Results
18
Classification Accuracy ComparisonClassification Accuracy Comparison Classification Time ComparisonClassification Time Comparison
DYCOS vs. NetKit on CORADYCOS vs. NetKit on CORA
Experimental ResultsExperimental Results
19
Sensitivity to Sensitivity to mm, , ll and and h h ((aa=30, =30, ppss=70%) =70%) Sensitivity to Sensitivity to aa, , mm and and pps s ((ll=3, =3, hh=5)=5)
Parameter Sensitivity of DYCOSParameter Sensitivity of DYCOS
CORA Data DBLP Data
Experimental ResultsExperimental Results
20
Dynamic Updating Time: DBLPDynamic Updating Time: DBLP
Time Period 1 2 3 4 5Update Time (Sec.)
0.019 0.013 0.015 0.013 0.023
Time Period 6 7 8 9 10Update Time (Sec.)
0.015 0.014 0.014 0.013 0.011
Dynamic Updating Time: CORADynamic Updating Time: CORA
Time Period 1975-1989 1990-1999 2000-2010
Update Time (Sec.) 0.03107 0.22671 1.00154
OutlineOutline Related Works DYCOS: DYnamic Classification algorithm with
cOntent and Structure• Semi-bipartite content-structure transformation• Classification using a series of text and link-
based random walks• Accuracy analysis
Experiments• NetKit-SRL
Conclusion
21
ConclusionConclusion We propose an efficient, dynamic and scalable method for
node classification in dynamic networks.
We provide analysis on how accurate the proposed method will be in practice.
We present experimental results on real data sets, and show that our algorithms are more effective and efficient than competing algorithms.
22
23
ChallengesChallenges
24
AnalysisAnalysis
25
€
e−l• b 2 / 2
€
1 − (k −1) • e−l• b 2 / 2
Experimental ResultsExperimental Results
26
Classification accuracy comparison: DBLPClassification accuracy comparison: DBLP
Classification time comparison: DBLPClassification time comparison: DBLP
Experimental ResultsExperimental Results
27
Sensitivity to Sensitivity to aa, , ll and and hhSensitivity to Sensitivity to mm, , ll and and hh
Sensitivity to Sensitivity to mm, , aa and and ppss Sensitivity to Sensitivity to aa, , mm and and ppss