mining heterogeneous information networks
TRANSCRIPT
-
7/31/2019 Mining Heterogeneous Information Networks
1/176
1
MiningHeterogeneousMiningHeterogeneous
InformationNetworksInformation
Networks
JiaweiHan
YizhouSun
XifengYan PhilipS.Yu
UniversityofIllinoisatUrbanaChampaign
Universityof
California
at
Santa
Barbara
UniversityofIllinoisatChicago
Acknowledgements:NSF,ARL,NASA,AFOSR(MURI),Microsoft,IBM,Yahoo!,Google,HPLab&Boeing
July12,2010
ACMSIGKDDConferenceTutorial,Washington,D.C.,July25,2010
-
7/31/2019 Mining Heterogeneous Information Networks
2/176
2
OutlineOutline
Motivation: WhyMiningHeterogeneousInformationNetworks?
PartI: Clustering,RankingandClassification
Clusteringand
Ranking
in
Information
Networks
ClassificationofInformationNetworks
PartII: DataQualityandSearchinInformationNetworks
DataCleaning
and
Data
Validation
by
InfoNet
Analysis
SimilaritySearchinInformationNetworks
PartIII:AdvancedTopicsonInformationNetworkAnalysis
RoleDiscoveryandOLAPinInformationNetworks
MiningEvolutionandDynamicsofInformationNetworks
Conclusions
-
7/31/2019 Mining Heterogeneous Information Networks
3/176
3
OutlineOutline
Motivation: WhyMiningHeterogeneousInformationNetworks?
PartI: Clustering,RankingandClassification
Clusteringand
Ranking
in
Information
Networks
ClassificationofInformationNetworks
PartII: DataQualityandSearchinInformationNetworks
DataCleaning
and
Data
Validation
by
InfoNet
Analysis
SimilaritySearchinInformationNetworks
PartIII:AdvancedTopicsonInformationNetworkAnalysis
RoleDiscoveryandOLAPinInformationNetworks
MiningEvolutionandDynamicsofInformationNetworks
Conclusions
-
7/31/2019 Mining Heterogeneous Information Networks
4/176
4
WhatAreInformationNetworks?WhatAreInformationNetworks? Information
network:
A
network
where
each
node
represents
an
entity
(e.g.,
actorinasocialnetwork)andeachlink(e.g.,tie)arelationshipbetween
entities
Eachnode/link
may
have
attributes,
labels,
and
weights
Linkmaycarryrichsemanticinformation
Homogeneousvs.heterogeneousnetworks
Homogeneousnetworks
Singleobjecttypeandsinglelinktype
Singlemodelsocialnetworks(e.g.,friends)
WWW:
a
collection
of
linked
Web
pages Heterogeneous,multitypednetworks
Multipleobjectandlinktypes
Medical
network:
patients,
doctors,
disease,
contacts,
treatments Bibliographicnetwork:publications,authors,venues
-
7/31/2019 Mining Heterogeneous Information Networks
5/176
5
UbiquitousInformationNetworksUbiquitousInformationNetworks
Graphsand
substructures
Chemicalcompounds,computervisionobjects,circuits,XML
Biologicalnetworks
Bibliographicnetworks:DBLP,ArXiv,PubMed,
Socialnetworks:Facebook>100millionactiveusers
WorldWide
Web
(WWW):
>3billion
nodes,
>50
billion
arcs
Cyberphysicalnetworks
Yeast prot eininteract ion net w ork
An I n t e r n et W eb Co-aut hor netw ork Social netw ork sit es
-
7/31/2019 Mining Heterogeneous Information Networks
6/1766
Homogeneousvs.HeterogeneousNetworksHomogeneousvs.HeterogeneousNetworks
Conference-Author Net w orkCo-aut hor Netw ork
-
7/31/2019 Mining Heterogeneous Information Networks
7/1767
DBLP:AnInterestingandFamiliarNetworkDBLP:AnInterestingandFamiliarNetwork
DBLP:
Acomputer
science
publication
bibliographic
database
1.4Mrecords(papers),0.7Mauthors,5Kconferences,
Willthisdatabasediscloseinterestingknowledgeabout
computerscience
research?
Whatarethepopularresearchfields/subfieldsinCS?
WhoaretheleadingresearchersonDBorXQueries?
Howdo
the
authors
in
this
subfield
collaborate
and
evolve?
HowmanyWeiWangsinDBLP,whichpaperdonebywhich?
WhoisSergy Brins supervisorandwhen?
WhoareverysimilartoChristosFaloutsos?
Allthesekindsofquestions,andpotentiallymuchmore,canbe
nicely
answered
by
the
DBLP
InfoNet How? Exploringthepoweroflinksininformationnetworks!
-
7/31/2019 Mining Heterogeneous Information Networks
8/1768
Homo.vs.Hetero.:DifferencesinDBHomo.vs.Hetero.:DifferencesinDBInfoNetInfoNet MiningMining Homogeneous
networks
can
often
be
derived
from
their
original
heterogeneousnetworks
Coauthornetworkscanbederivedfromauthorpaper
conferencenetworks
by
projection
on
authors
only
Papercitationnetworkscanbederivedfromacomplete
bibliographicnetworkwithpapersandcitationsprojected
HeterogeneousDB
InfoNet carries
richer
information
than
its
correspondingprojectedhomogeneousnetworks
TypedheterogeneousInfoNet vs.nontypedhetero.InfoNet (i.e.,
notdistinguishing
different
types
of
nodes)
TypednodesandlinksimplyamorestructuredInfoNet,and
thusoftenleadtomoreinformativediscovery
Ouremphasis:
Mining
structured information
networks!
-
7/31/2019 Mining Heterogeneous Information Networks
9/1769
WhyMiningHeterogeneousInformationNetworks?WhyMiningHeterogeneousInformationNetworks?
Mostdatasets
can
be
organized or
transformed into
a
structured heterogeneousinformationnetwork!
Examples:DBLP,IMDB,Flickr,GoogleNews,Wikipedia,
Structurescanbeprogressivelyextractedfromlessorganized
datasetsbyinformationnetworkanalysis
Informationrich,interrelated,organizeddatasetsformone
oraset
of
gigantic,
interconnected,
multi
typed
heterogeneousinformationnetworks
Surprisinglyrichknowledgecanbederivedfromsuch
structuredheterogeneous
information
networks
Ourgoal:Uncoverknowledgehiddenfromorganized data
Exploringthepowerofmultityped,heterogeneouslinks
Miningstructured heterogeneousinformationnetworks!
-
7/31/2019 Mining Heterogeneous Information Networks
10/17610
OutlineOutline
Motivation: WhyMining
Heterogeneous
Information
Networks?
PartI: Clustering,RankingandClassification
Clusteringand
Ranking
in
Information
Networks
ClassificationofInformationNetworks
PartII: DataQualityandSearchinInformationNetworks
DataCleaning
and
Data
Validation
by
InfoNet Analysis
SimilaritySearchinInformationNetworks
PartIII:AdvancedTopicsonInformationNetworkAnalysis
RoleDiscoveryandOLAPinInformationNetworks
MiningEvolutionandDynamicsofInformationNetworks
Conclusions
l d k fCl i d R ki i I f i
-
7/31/2019 Mining Heterogeneous Information Networks
11/17611
Clusteringand
Ranking
in
Information
Clustering
and
Ranking
in
Information
NetworksNetworks Integrated
Clustering
and
Ranking
of
Heterogeneous
InformationNetworks
Clusteringof
Homogeneous
Information
Networks
LinkClus:Clusteringwithlinkbasedsimilaritymeasure
SCAN:
Density
based
clustering
of
networks Others
Spectralclustering
Modularitybasedclustering
Probabilisticmodelbasedclustering
UserGuided
Clustering
of
Information
Networks
-
7/31/2019 Mining Heterogeneous Information Networks
12/17612
ClusteringandRankinginHeterogeneousClusteringandRankinginHeterogeneousInformationNetworksInformationNetworks
Ranking&
clustering
each
provides
anew
view
over
anetwork
Rankinggloballywithoutconsideringclusters dumb
RankingDBandArchitectureconfs.together?
Clusteringauthorsinonehugeclusterwithoutdistinction?
Dulltoviewthousandsofobjects(thisiswhyPageRank!)
RankClus:
Integratesclustering
with
ranking
Conditionalrankingrelativetoclusters
Useshighlyrankedobjectstoimproveclusters
Qualitiesof
clustering
and
ranking
are
mutually
enhanced
Y.Sun,J.Han,etal.,RankClus:IntegratingClusteringwith
RankingforHeterogeneousInformationNetworkAnalysis,
EDBT'09.
-
7/31/2019 Mining Heterogeneous Information Networks
13/17613
GlobalRankingvs.ClusterGlobalRankingvs.ClusterBasedRankingBasedRanking
AToy
Example:
Two
areas
with
10
conferences
and
100
authors
ineacharea
-
7/31/2019 Mining Heterogeneous Information Networks
14/17614
RankClusRankClus:ANewFramework:ANewFramework
Sub-NetworkRanking
Clustering
-
7/31/2019 Mining Heterogeneous Information Networks
15/17615
TheTheRankClusRankClus PhilosophyPhilosophy Why
integrated
Ranking
and
Clustering?
Rankingandclusteringcanbemutuallyimproved
Ranking:
Once
a
cluster
becomes
more
accurate,
ranking
will
bemorereasonableforsuchaclusterandwillbethe
distinguishedfeatureofthecluster
Clustering:Once
ranking
is
more
distinguished
from
each
other,theclusterscanbeadjustedandgetmoreaccurate
results
Noteveryobjectshouldbetreatedequallyinclustering!
Objectspreservesimilarityundernewmeasurespace
E.g.,VLDB
vs.
SIGMOD
-
7/31/2019 Mining Heterogeneous Information Networks
16/17616
RankClusRankClus:AlgorithmFramework:AlgorithmFramework
Step0. Initialization
RandomlypartitiontargetobjectsintoKclusters
Step1.
Ranking
Rankingforeachsubnetworkinducedfromeachcluster,
whichservesasfeatureforeachcluster
Step2. Generatingnewmeasurespace
Estimatemixturemodelcoefficientsforeachtargetobject
Step3.
Adjusting
cluster
Step4. RepeatingSteps13untilstable
-
7/31/2019 Mining Heterogeneous Information Networks
17/17617
FocusonaBiFocusonaBiTypedNetworkCaseTypedNetworkCase
Conferenceauthor
network,
links
can
exist
between
Conference(X)andauthor(Y)
Author
(Y)
and
author
(Y)
UseWtodenotethelinksandthereweights
W=
-
7/31/2019 Mining Heterogeneous Information Networks
18/17618
Step1:Ranking:FeatureExtractionStep1:Ranking:FeatureExtraction
Tworanking
strategies:
Simple
ranking
vs.
authority
ranking
SimpleRanking
Proportionaltodegreecountingforobjects,e.g.,#of
publicationsofanauthor
Considersonlyimmediateneighborhoodinthenetwork
Authority
Ranking:
Extension
to
HITS
in
weighted
bi
type
network
Rule1:Highlyrankedauthorspublishmanypapersinhighly
rankedconferences
Rule2:Highlyrankedconferencesattractmanypapersfrom
manyhighlyrankedauthors
Rule
3:
The
rank
of
an
author
is
enhanced
if
he
or
she
co
authorswithmanyauthorsormanyhighlyrankedauthors
-
7/31/2019 Mining Heterogeneous Information Networks
19/17619
EncodingRulesinAuthorityRankingEncodingRulesinAuthorityRanking
Rule1:Highlyrankedauthorspublishmanypapersinhighlyrankedconferences
Rule2:Highlyrankedconferencesattractmanypapersfrom
many
highly
ranked
authors
Rule3:
The
rank
of
an
author
is
enhanced
if
he
or
she
co
authorswithmanyauthorsormanyhighlyrankedauthors
E l A h i R ki i h 2E l A th it R ki i th 2 AA
-
7/31/2019 Mining Heterogeneous Information Networks
20/17620
Example:AuthorityRankinginthe2Example:AuthorityRankinginthe2AreaAreaConferenceConferenceAuthorNetworkAuthorNetwork
Therankings
of
authors
are
quite
distinct
from
each
other
in
the
twoclusters
Step 2 Generate Ne Meas re SpaceStep 2 Generate Ne Meas re Space
-
7/31/2019 Mining Heterogeneous Information Networks
21/17621
Step2:
Generate
New
Measure
SpaceStep
2:
Generate
New
Measure
Space:
:
AMixtureModelMethodAMixtureModelMethod Consider
each
target
objects
links
are
generated
under
amixture
distributionofrankingfromeachcluster
Consider
ranking
as
a
distribution:
r(Y)
p(Y)
Eachtargetobject xi ismappedintoaKvector(i,k)
Parametersare
estimated
using
the
EM
algorithm
Maximizetheloglikelihoodgivenalltheobservationsof
links
Example: 2Example: 2 D Coefficients in the 2D Coefficients in the 2 AreaArea
-
7/31/2019 Mining Heterogeneous Information Networks
22/17622
Example:2Example:2
D
Coefficients
in
the
2D
Coefficients
in
the
2
Area
Area
ConferenceConferenceAuthorNetworkAuthorNetwork
Theconferencesarewellseparatedinthenewmeasurespace
Scatterplots
of
two
conferences
and
component
coefficients
St 3 Cl t Adj t t i NSt 3 Cl t Adj t t i N
-
7/31/2019 Mining Heterogeneous Information Networks
23/17623
Step3:ClusterAdjustmentinNewStep3:ClusterAdjustmentinNewMeasureSpaceMeasureSpace
Clustercenterinnewmeasurespace
Vectormeanofobjectsinthecluster(Kdimensional)
Cluster
adjustment Distancemeasure:1Cosinesimilarity
Assigntotheclusterwiththenearestcenter
WhyBetterRankingFunctionDerivesBetterClustering?
Considerthemeasurespacegenerationprocess
Highlyranked
objects
in
acluster
play
amore
important
role
to decide
atargetobjectsnewmeasure
Intuitively,ifwecanfindthehighlyrankedobjectsina
cluster,equivalently,
we
get
the
right
cluster
-
7/31/2019 Mining Heterogeneous Information Networks
24/176
24
StepStepbybyStepRunningCaseIllustrationStepRunningCaseIllustration
I nit ial ly, rankingdist ribut ions are
mixed t oget her
Tw o clust ers ofobject s mixed
t oget her, butpreserve similarity
somehow
I mproved a l i t t le
Tw o clusters are
almost w ellseparated
Improvedsignificantly
Stable
Well separat ed
-
7/31/2019 Mining Heterogeneous Information Networks
25/176
25
TimeComplexity:Linearto#ofLinksTimeComplexity:Linearto#ofLinks
Ateach
iteration,
|E|:
edges
in
network,
m:
number
of
target
objects,K:numberofclusters
Rankingforsparsenetwork
~O(|E|)
Mixturemodelestimation
~O(K|E|+mK) Clusteradjustment
~O(mK^2)
Inall,
linear
to
|E|
~O(K|E|)
Note:SimRank willbeatleastquadraticateachiterationsinceit
evaluatesdistancebetweeneverypairinthenetwork
-
7/31/2019 Mining Heterogeneous Information Networks
26/176
26
CaseStudy:Dataset:DBLPCaseStudy:Dataset:DBLP
Allthe2676conferencesand20,000authorswithmost
publications,fromthetimeperiodofyear1998toyear2007
Bothconference
author
relationships
and
co
author
relationshipsareused
K=15(selectonly5clustershere)
NetClusNetClus : Rank ing & Clust er ing w i t h St ar: Rank ing & Clust er ing w i t h St ar
-
7/31/2019 Mining Heterogeneous Information Networks
27/176
27
NetClusNetClus : Rank ing & Clust er ing w i t h St ar: Rank ing & Clust er ing w i t h St ar
Ne tw ork Sc hemaNet w ork Sc hema
Beyondbi
typed
information
network:
A
Star
Network
Schema
Splitanetworkintodifferentlayers,eachrepresentingbyanet
cluster
27
-
7/31/2019 Mining Heterogeneous Information Networks
28/176
28
StarNetStarNet:Schema&Net:Schema&NetClusterCluster
StarNetwork
Schema
Centertype: Targettype
E.g.,apaper,amovie,ataggingevent
Acenter
object
is
aco
occurrence of
abag
of
different
types
of
objects,whichstandsforamultirelationamongdifferenttypesof
objects
Surroundingtypes: Attribute(property)types
NetCluster
GivenainformationnetworkG,anetclusterCcontainstwopiecesof
information:
Nodeset
and
link
set
as
asub
network
of
G
Membershipindicatorforeachnodex:P(x inC)
GivenainformationnetworkG,clusternumberK,aclusteringforGisa
setofnetclustersandforeachnodex,thesumofxs probability
distributionin
all
Knet
clusters
should
be
1
-
7/31/2019 Mining Heterogeneous Information Networks
29/176
29
ResearchPaper
Term
AuthorVenue
Publish Write
Contain
DBLP 29
-
7/31/2019 Mining Heterogeneous Information Networks
30/176
30
StarNetStarNet
ofof
Delicious.comDelicious.com
TaggingEvent
Tag
UserWebSite
Contain
Delicious.com 30
Actor/A
-
7/31/2019 Mining Heterogeneous Information Networks
31/176
31
StartNetStartNet forIMDBforIMDBMovie
Title/
Plot
DirectorActor/Actress
Star in Direct
Contain
IMDB
31
-
7/31/2019 Mining Heterogeneous Information Networks
32/176
32
RankingFunctionsRankingFunctions
RankinganobjectxoftypeTx inanetworkG,denotedasp(x|Tx,G)
Giveascoretoeachobject,accordingtoitsimportance
Differentrulesdefineddifferentrankingfunctions:
SimpleRanking
Rankingscoreisassignedaccordingtothedegreeofanobject
AuthorityRanking
Rankingscoreisassignedaccordingtothemutualenhancementby
propagationsofscorethroughlinks
highlyrankedconferencesacceptmanygoodpaperspublishedby
manyhighlyrankedauthors,andhighlyrankedauthorspublishmany
goodpapersinhighlyrankedconferences:
-
7/31/2019 Mining Heterogeneous Information Networks
33/176
33
RankingFunction(Cont.)RankingFunction(Cont.)
Priorscanbeadded:
PP(X|Tx,Gk)=(1 P)P(X|Tx,Gk)+PP0(X|Tx,Gk)
P0(X|Tx,
Gk)
is
the
prior
knowledge,
usually
given
as
a
distribution,denotedbyonlyseveralwords
Pistheparameterthatwebelieveinpriordistribution
Rankingdistribution
Normalizerankingscoresto1,giventhemaprobabilistic
meaning
Similarto
the
idea
of
PageRank
-
7/31/2019 Mining Heterogeneous Information Networks
34/176
34
NetClusNetClus:AlgorithmFramework:AlgorithmFramework
Mapeachtargetobjectintoanewlowdimensionalfeature
spaceaccordingtocurrentnetclustering,andadjustthe
clusteringfurtherinthenewmeasurespace
Step0: Generateinitialrandomclusters
Step1: Generaterankingbasedgenerativemodelfor
targetobjectsforeachnetcluster
Step2: Calculateposteriorprobabilitiesfortargetobjects,
whichservesasthenewmeasure,andassigntargetobjects
tothenearestclusteraccordingly
Step3: Repeatsteps1and2,untilclustersdonotchangesignificantly
Step4: Calculateposteriorprobabilitiesforattribute
objectsin
each
net
cluster
Generat ive Model for Target Objec t sGenerat ive Model for Target Objec t s
-
7/31/2019 Mining Heterogeneous Information Networks
35/176
35
Generat ive Model fo r Target Objec t sGenerat ive Model fo r Target Objec t s
Given a Ne tGiven a Ne t --c lus te rc lus te r
Eachtargetobjectstandsforancooccurrenceofabagofattributeobjects
Definetheprobabilityofatargetobjectdefinetheprobabilityofthe
cooccurrenceofalltheassociatedattributeobjects
GenerativeprobabilityP(d|Gk)fortargetobjectdinclusterCk :
whereP(x
|Tx
,Gk
)
isrankingfunction,P(Tx
|Gk
)istypeprobability
Twoassumptionsofindependence
Theprobabilities
to
visit
objects
of
different
types
are
independent
to
eachother
Theprobabilitiestovisittwoobjectswithinthesametypeare
independentto
each
other
-
7/31/2019 Mining Heterogeneous Information Networks
36/176
39
ClusterAdjustmentClusterAdjustment
Usingposteriorprobabilitiesoftargetobjectsasnewfeature
space
Eachtarget
object
=>
Kdimension
vector
Eachnetclustercenter=>Kdimensionvector
Averageontheobjectsinthecluster
Assigneach
target
object
into
nearest
cluster
center
(e.g.,
cosinesimilarity)
Asubnetworkcorrespondingtoanewnetclusteristhenbuilt
byextracting
all
the
target
objects
in
that
cluster
and
all
linkedattributeobjects
-
7/31/2019 Mining Heterogeneous Information Networks
37/176
40
Experiments:DBLPandBeyondExperiments:DBLPandBeyond
DataSet:
DBLP
all
area data
set
Allconferences+Top 50Kauthors
DBLPfourarea dataset
20conferencesfromDB,DM,ML,IR
Allauthorsfromtheseconferences
Allpapers
published
in
these
conferences
Runningcaseillustration
-
7/31/2019 Mining Heterogeneous Information Networks
38/176
41
AccuracyStudy:ExperimentsAccuracyStudy:Experiments
Accuracy,compared
with
PLSA,
apure
text
model,
no
other
typesofobjectsandlinksareused,usethesamepriorasin
NetClus
Accuracy,comparedwithRankClus,abitypedclusteringmethod
ononlyonetype
AccuracyofPaperClusteringResults
AccuracyofConference ClusteringResults
N ClN Cl Di i i hi C fDi i i hi C f
-
7/31/2019 Mining Heterogeneous Information Networks
39/176
42
NetClusNetClus:DistinguishingConferences:DistinguishingConferences
AAAI 0.00226670.00899168
0.934024 0.0300042
0.0247133
CIKM0.1500530.3101720.007238070.4445240.0880127
CVPR0.0001638120.007630720.931496 0.02813420.032575
ECIR3.47023e050.007126950.006574020.978391 0.00787288
ECML0.000774770.1109220.814362 0.05794260.015999
EDBT0.573362 0.316033
0.00101442
0.0245591
0.0850319
ICDE0.529522 0.3765420.002391520.01511130.0764334
ICDM0.0004550280.778452 0.05664570.1131840.0512633
ICML0.0003096240.0500780.878757 0.06223350.00862134
IJCAI0.00329816
0.0046758
0.94288 0.0303745
0.0187718
KDD0.005742230.797633 0.06173510.0676810.0672086
PAKDD0.001112460.813473 0.04031050.05747550.0876289
PKDD5.39434e050.760374 0.1196080.0529260.0670379
PODS
0.78935 0.113751
0.013939
0.00277417
0.0801858
SDM0.0001729530.841087 0.0583160.05270810.0477156
SIGIR0.006003990.002800130.002752370.977783 0.0106604
SIGMOD0.689348 0.2231220.00177030.008254550.0775055
VLDB0.701899 0.2074280.001000120.01169660.0779764
WSDM0.00751654
0.269259
0.0260291
0.683646 0.0135497
WWW0.07711860.2706350.029307 0.451857 0.171082
-
7/31/2019 Mining Heterogeneous Information Networks
40/176
43
NetClusNetClus:DatabaseSystemCluster:DatabaseSystemCluster
database 0.0995511databases 0.0708818
system 0.0678563data 0.0214893query 0.0133316
systems 0.0110413queries 0.0090603
management 0.00850744object 0.00837766
relational 0.0081175processing 0.00745875
based 0.00736599distributed 0.0068367
xml 0.00664958oriented 0.00589557design 0.00527672web 0.00509167
information 0.0050518
model 0.00499396efficient 0.00465707
Surajit Chaudhuri 0.00678065Michael Stonebraker 0.00616469
Michael J. Carey 0.00545769C. Mohan 0.00528346
David J. DeWitt 0.00491615Hector Garcia-Molina 0.00453497
H. V. Jagadish 0.00434289David B. Lomet 0.00397865
Raghu Ramakrishnan 0.0039278
Philip A. Bernstein 0.00376314Joseph M. Hellerstein 0.00372064Jeffrey F. Naughton 0.00363698Yannis E. Ioannidis 0.00359853
Jennifer Widom 0.00351929
Per-Ake Larson 0.00334911Rakesh Agrawal 0.00328274
Dan Suciu 0.00309047Michael J. Franklin 0.00304099Umeshwar Dayal 0.00290143
Abraham Silberschatz 0.00278185
VLDB 0.318495SIGMOD Conf. 0.313903
ICDE 0.188746
PODS 0.107943EDBT 0.0436849
Ranking authors in XML
-
7/31/2019 Mining Heterogeneous Information Networks
41/176
44
NetClusNetClus::StarNetStarNetBasedRankingandClusteringBasedRankingandClustering
Ageneral
framework
in
which
ranking
and
clustering
are
successfullycombinedtoanalyzeinfornets
Rankingandclusteringcanmutuallyreinforceeachotherin
informationnetwork
analysis
NetClus,anextensiontoRankClus thatintegratesrankingand
clusteringandgeneratenetclustersinastarnetworkwith
arbitrarynumber
of
types
Netcluster,heterogeneous
informationsubnetworks
comprisedof
multiple
typesofobjects
GowellbeyondDBLP,and
structuredrelational
DBs
Flickr:queryRaleigh
derivesmultipleclusters
iNextCubeiNextCube: Information Network: Information Network Enhanced TexEnhanced Text
-
7/31/2019 Mining Heterogeneous Information Networks
42/176
45
iNextCubeiNextCube:Information
Network:
Information
Network
Enhanced
TexEnhanced
Text
Cube(VLDBCube(VLDB09Demo)09Demo)
ArchitectureofiNextCube
DimensionhierarchiesgeneratedbyNetClusDimensionhierarchiesgeneratedbyNetClus
Author/confere
termrankingfo
researcharea.
T
researchareasc
atdifferentleve
All
DBand
IS Theory Architecture
DB DM IR
XML DistributedDB
NetclusterHierarchy
Demo: iNextCube.cs.uiuc.edu
Clustering and Ranking in InformationClustering and Ranking in Information
-
7/31/2019 Mining Heterogeneous Information Networks
43/176
46
Clusteringand
Ranking
in
Information
Clustering
and
Ranking
in
Information
NetworksNetworks Integrated
Clustering
and
Ranking
of
Heterogeneous
InformationNetworks
Clusteringof
Homogeneous
Information
Networks
LinkClus:Clusteringwithlinkbasedsimilaritymeasure
SCAN:
Density
based
clustering
of
networks Others
Spectralclustering
Modularitybased
clustering
Probabilisticmodelbasedclustering
UserGuided
Clustering
of
Information
Networks
-
7/31/2019 Mining Heterogeneous Information Networks
44/176
47
LinkLinkBasedClustering:WhyUseful?BasedClustering:WhyUseful?
Questions:
Q1:Howtoclustereachtypeofobjects?
Q2:Howtodefinesimilaritybetweeneachtypeofobjects?
Tom sigmod03
Mike
Cathy
John
sigmod04sigmod05
vldb03
vldb04
vldb05
sigmod
vldb
Mary
aaai04
aaai05 aaai
Authors Proceedings Conferences
-
7/31/2019 Mining Heterogeneous Information Networks
45/176
48
SimRankSimRank:Link:LinkBasedSimilaritiesBasedSimilarities
Twoobjects
are
similar
iflinked
with
the
same
or
similar
objects
Tom sigmod03sigmod04
sigmod05
sigmod
Tom
Mike
Cathy
John
sigmod03
sigmod04
sigmod05
vldb03
vldb04
vldb05
sigmod
vldb
Jeh
&Widom,2002
SimRank
Similarity
between
two
objects
a and
b,
S(a,b)=theaveragesimilaritybetween
objectslinkedwitha
andthosewithb:
whereI(v)isthesetofinneighborsof
thevertexv.
But:Itisexpensivetocompute:
ForadatasetofN
objectsandM links,
ittakesO(N2)spaceandO(M2)timeto
computeall
similarities.
Mary
-
7/31/2019 Mining Heterogeneous Information Networks
46/176
49
Observation1:HierarchicalStructuresObservation1:HierarchicalStructures
Hierarchicalstructures
often
exist
naturally
among
objects
(e.g.,
taxonomyofanimals)
All
electronicsgrocery apparel
DVD cameraTV
A
hierarchical
structure
of
productsinWalmart
A
rticles
Words
Relationshipsbetweenarticlesand
words(Chakrabarti,
Papadimitriou,
Modha,Faloutsos,2004)
-
7/31/2019 Mining Heterogeneous Information Networks
47/176
50
Observation2:DistributionofSimilarityObservation2:DistributionofSimilarity
Powerlawdistributionexistsinsimilarities
56%of
similarity
entries
are
in
[0.005,
0.015]
1.4%ofsimilarityentriesarelargerthan0.1
Ourgoal:Designadatastructurethatstoresthesignificantsimilaritiesand
compressesinsignificantones
0
0.1
0.2
0.3
0.4
0
0.0
2
0.0
4
0.0
6
0.0
8
0.1
0.1
2
0.1
4
0.1
6
0.1
8
0.2
0.2
2
0.2
4
similarity value
portionofentries Distribution ofSimRanksimilarities
among DBLP authors
-
7/31/2019 Mining Heterogeneous Information Networks
48/176
51
OurDataStructure:OurDataStructure:SimTreeSimTree
simp(n7,n8) = s(n7,n4) x s(n4,n5) x s(n5,n8)
Path-based node similarity
Similarity between two nodes is the average similarity between objectslinked with them in other SimTrees
Adjustment ratio for x =
n1
n2
n4 n5 n6
n3
0.9 1.0
0.90.8
0.2
n7 n9
0.3
n8
0.8
0.9
Similarity between twosibling nodes n1
and n2
Adjustment ratiofor node n7
Average similarity between xand all other nodes
Average similarity betweenxs parent and allother nodes
Each leaf noderepresents an object
Each non-leaf noderepresents a group ofsimilar lower-level node
Li kClLi kCl Si TSi T B d Hi hi l Cl t iB d Hi hi l Cl t i
-
7/31/2019 Mining Heterogeneous Information Networks
49/176
52
LinkClusLinkClus::SimTreeSimTreeBasedHierarchicalClusteringBasedHierarchicalClustering
InitializeaSimTree forobjectsofeachtype
Repeat
ForeachSimTree,updatethesimilaritiesbetweenitsnodes
usingsimilaritiesinotherSimTrees
Similaritybetween
two
nodes
a and
bis
the
average
similaritybetweenobjectslinkedwiththem
AdjustthestructureofeachSimTree
Assigneachnodetotheparentnodethatitismost
similarto
I iti li ti fInitialization of Si TSimTrees
-
7/31/2019 Mining Heterogeneous Information Networks
50/176
53
InitializationofInitializationofSimTreesSimTrees
Findingtightgroups Frequentpatternmining
Initializingatree:
Startfromleafnodes(level0)
Ateachlevell,findnonoverlappinggroupsofsimilarnodes
withfrequentpatternmining
Reduced to
g1
g2
{n
1}{n1, n2}
{n2}{n1, n2}
{n1, n2}{n2, n3, n4}
{n4}
{n3, n4}
{n3, n4}
Transactions
n1 123
4
56
7
8
9
n2
n3
n4
Thetightnessofagroupofnodesisthesupportofa
frequentpattern
C l itComplexity: Li kClLinkClus vs Si R kSimRank
-
7/31/2019 Mining Heterogeneous Information Networks
51/176
54
Complexity:Complexity:LinkClusLinkClus vs.vs.SimRankSimRank Afterinitialization,iteratively(1)foreachSimTree updatethe
similaritiesbetweenitsnodesusingsimilaritiesinotherSimTrees,
and(2)AdjustthestructureofeachSimTree
Computationalcomplexity:
Fortwotypesofobjects,N ineach,andM linkagesbetween
them
Time Space
Updating similarities O(M(logN)2) O(M+N)
Adjusting tree structures O(N) O(N)
LinkClus O (M(logN)2) O(M+N)
SimRank O (M2) O(N2)
P f C i E i t S tPerformance Comparison: Experiment Setup
-
7/31/2019 Mining Heterogeneous Information Networks
52/176
55
PerformanceComparison:ExperimentSetupPerformanceComparison:ExperimentSetup
DBLP dataset: 4170 most productive authors, and 154 well-knownconferences with most proceedings
Manually labeled research areas of 400 most productive authorsaccording to their home pages (or publications)
Manually labeled areas of 154 conferences according to their call forpapers
Approaches Compared:
SimRank (Jeh & Widom, KDD 2002) Computing pair-wise similarities
SimRank with FingerPrints (F-SimRank)
Fogaras & Racz, WWW 2005 pre-computes a large sample of random paths from each object
and uses samples of two objects to estimate SimRank similarity
ReCom (Wang et al. SIGIR 2003)
Iteratively clustering objects using cluster labels of linked objects
DBLP Data Set Accuracy and Computation TimeDBLP Data Set: Accuracy and Computation Time
-
7/31/2019 Mining Heterogeneous Information Networks
53/176
56
0.8
0.85
0.9
0.95
1
1 3 5 7 9 11 13 15 17 19
#iteration
accuracy
LinkClus
SimRank
ReCom
F-SimRank
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
1 3 5 7 9 11 13 15 17 19
#iteration
accuracy
LinkClus
SimRank
ReCom
F-SimRank
DBLPDataSet:AccuracyandComputationTimeDBLPDataSet:AccuracyandComputationTime
Approaches AccrAuthor Accr
Conf average
time
LinkClus 0.957 0.723 76.7
SimRank 0.958 0.760 1020
ReCom 0.907 0.457 43.1
FSimRank 0.908 0.583 83.6
Authors Conferences
l dil A d i
-
7/31/2019 Mining Heterogeneous Information Networks
54/176
57
EmailDataset:AccuracyandTimeEmailDataset:AccuracyandTime
F.Nielsen.
Email
dataset.
http://www.imm.dtu.dk/rem/data/Email1431.zip
370
emails
on
conferences,
272
on
jobs,
and
789
spam
emails WhyisLinkClus evenbetterthanSimRank inaccuracy?
Noisefilteringduetofrequentpatternbasedpreprocessing
Approach Accuracy Totaltime(sec)
LinkClus 0.8026 1579.6
SimRank 0.7965 39160ReCom 0.5711 74.6
FSimRank 0.3688 479.7
CLARANS
0.4768 8.55
Clustering and Ranking in InformationClustering and Ranking in Information
-
7/31/2019 Mining Heterogeneous Information Networks
55/176
58
Clusteringand
Ranking
in
Information
Clustering
and
Ranking
in
Information
NetworksNetworks Integrated
Clustering
and
Ranking
of
Heterogeneous
InformationNetworks
Clusteringof
Homogeneous
Information
Networks
LinkClus:Clusteringwithlinkbasedsimilaritymeasure
SCAN:Densitybasedclusteringofnetworks
Others
Spectralclustering
Modularitybased
clustering
Probabilisticmodelbasedclustering
UserGuided
Clustering
of
Information
Networks
SCAN: DensitySCAN: Density Based Network ClusteringBased Network Clustering
-
7/31/2019 Mining Heterogeneous Information Networks
56/176
59
SCAN:DensitySCAN:DensityBasedNetworkClusteringBasedNetworkClustering
Networksmade
up
of
the
mutual
relationships
of
data
elements
usually
haveanunderlyingstructure. Clustering: Astructurediscovery
problem
Givensimply
information
of
who
associates
with
whom,
could
one
identifyclustersofindividualswithcommoninterestsorspecial
relationships(families,cliques,terroristcells)?
Questionsto
be
answered:
How
many
clusters?
What
size
should
they
be?Whatisthebestpartitioning?Shouldsomepointsbesegregated?
Scan:Aninterestingdensitybasedalgorithm
X.Xu,N.Yuruk,Z.Feng,andT.A.J.Schweiger,SCAN:AStructural
ClusteringAlgorithmforNetworks,Proc.2007ACMSIGKDDInt.
Conf.KnowledgeDiscoveryinDatabases(KDD'07),SanJose,CA,
Aug.2007
Social Network and Its Clustering ProblemSocial Network and Its Clustering Problem
-
7/31/2019 Mining Heterogeneous Information Networks
57/176
60
SocialNetworkandItsClusteringProblemSocialNetworkandItsClusteringProblem
Individualsin
atight
social
group,
or
clique,knowmanyofthesamepeople,
regardlessofthesizeofthegroup.
Individualswhoarehubs knowmany
peopleindifferentgroupsbutbelongto
nosinglegroup. Politicians,forexample
bridgemultiplegroups.
Individualswhoareoutliers resideat
the
margins
of
society.
Hermits,
for
example,knowfewpeopleandbelong
tonogroup.
Structure SimilarityStructure Similarity
-
7/31/2019 Mining Heterogeneous Information Networks
58/176
61
StructureSimilarityStructureSimilarity
Define(v)
as
immediate
neighbor
of
avertex
v.
Thedesiredfeaturestendtobecapturedbyameasure(u,v)
as
Structural
Similarity
Structuralsimilarityislargeformembersofacliqueandsmall for
hubsandoutliers.
|)(||)(|
|)()(|
),( wv
wv
wv
=
I
v
Structural Connectivity [1]Structural Connectivity [1]
-
7/31/2019 Mining Heterogeneous Information Networks
59/176
62
StructuralConnectivity[1]StructuralConnectivity[1]
Neighborhood:
Core:
Directstructure
reachable:
Structurereachable:transitiveclosureofdirectstructure
reachability
Structureconnected:
}),(|)({)( = wvvwvN
|)(|)(, vNvCORE
)()(),( ,, vNwvCOREwvDirRECH
),(),(:),( ,,, wuRECHvuRECHVuwvCONNECT
[1] M. Ester, H. P. Kriegel, J. Sander, & X. Xu (KDD'96) A Density-
Based Algorithm for Discovering Clusters in Large Spatial Databases
StructureStructure Connected ClustersConnected Clusters
-
7/31/2019 Mining Heterogeneous Information Networks
60/176
63
StructureStructureConnectedClustersConnectedClusters
Structureconnected
cluster
C
Connectivity:
Maximality:
Hubs:
Notbelongtoanycluster
Bridgetomanyclusters
Outliers:
Notbelong
to
any
cluster
Connecttolessclusters
),(:, , wvCONNECTCwv
CwwvREACHCvVwv ),(:, ,
hub
outlier
AlgorithmAlgorithm
-
7/31/2019 Mining Heterogeneous Information Networks
61/176
64
13
9
10
11
7
8
12
6
4
0
15
2
3
AlgorithmAlgorithm
= 2= 0.7
0.75
0.67
0.82
AlgorithmAlgorithm
-
7/31/2019 Mining Heterogeneous Information Networks
62/176
65
13
9
10
11
7
8
12
6
4
0
15
2
3
AlgorithmAlgorithm
= 2= 0.7 0.51
0.51
0.68
Running TimeRunning Time
-
7/31/2019 Mining Heterogeneous Information Networks
63/176
66
RunningTimeRunningTime
Runningtime:
O(|E|)
Forsparsenetworks:O(|V|)
[2] A. Clauset, M. E. J. Newman, & C. Moore, Phys. Rev. E70, 066111 (2004).
Clustering and Ranking in InformationClustering and Ranking in Information
-
7/31/2019 Mining Heterogeneous Information Networks
64/176
67
g
g
g
g
NetworksNetworks Integrated
Clustering
and
Ranking
of
Heterogeneous
InformationNetworks
Clusteringof
Homogeneous
Information
Networks
LinkClus:Clusteringwithlinkbasedsimilaritymeasure
SCAN:Densitybasedclusteringofnetworks
Others
Spectralclustering
Modularitybased
clustering
Probabilisticmodelbasedclustering
UserGuided
Clustering
of
Information
Networks
Spectral ClusteringSpectral Clustering
-
7/31/2019 Mining Heterogeneous Information Networks
65/176
68
SpectralClusteringSpectralClustering
Spectralclustering:
Find
the
best
cut
that
partitions
the
network
Differentcriteriatodecidebest
Mincut,ratiocut,normalizedcut,MinMaxcut
UsingMincutasanexample[Wuetal.1993]
Assigneach
node
ian
indicator
variable
Represent
the
cut
size
using
indicator
vector
and
adjacency
matrix Cutsize =
Minimizetheobjectivefunctionthroughsolvingeigenvalue system
Relaxthediscretevalueofqtocontinuousvalue
Mapcontinuousvalueofqintodiscreteonestogetclusterlabels
Usesecondsmallesteigenvectorforq
ModularityModularityBased ClusteringBased Clustering
-
7/31/2019 Mining Heterogeneous Information Networks
66/176
69
ModularityModularityBasedClusteringBasedClustering
Modularitybased
clustering
Findthebestclusteringthatmaximizesthemodularityfunction
Qfunction[Newmanetal.,2004]
Leteij be
ahalf
of
the
fraction
of
edges
between
group
iand
group
j
eii isthefractionofedgeswithingroupi
Letai bethefractionofallendsofedgesattachedtoverticesingroupI
Qisthendefinedassumofdifferencebetweenwithingroupedgesand
expectedwithingroupedges
MinimizeQ
Onepossiblesolution:hierarchicallymergeclustersresultingingreatest
increaseinQfunction[Newmanetal.,2004]
Probabilistic ModelProbabilistic ModelBased ClusteringBased Clustering
-
7/31/2019 Mining Heterogeneous Information Networks
67/176
70
ProbabilisticModelProbabilisticModelBasedClusteringBasedClustering
Probabilisticmodel
based
clustering
Buildgenerativemodelsforlinksbasedonhiddenclusterlabels
Maximizetheloglikelihoodofallthelinkstoderivethehiddencluster
membership
Anexample:Airoldi etal.,Mixedmembershipstochasticblockmodels,2008
DefineagroupinteractionprobabilitymatrixB(KXK)
B(g,h)denotestheprobabilityoflinkgenerationbetweengroupg
andgroup
h
Generativemodelforalink
Foreachnode,drawamembershipprobabilityvectorforma
Dirichlet prior Foreachpaperofnodes,drawclusterlabelsaccordingtotheir
membershipprobability(supposinggandh),thendecidewhetherto
havealinkaccordingtoprobabilityB(g,h)
Derivethe
hidden
cluster
label
by
maximize
the
likelihood
given B
and
theprior
Clustering and Ranking in InformationClustering and Ranking in Informationk
-
7/31/2019 Mining Heterogeneous Information Networks
68/176
71
g
g
g
g
NetworksNetworks Integrated
Clustering
and
Ranking
of
Heterogeneous
InformationNetworks
Clustering
of
Homogeneous
Information
Networks LinkClus:Clusteringwithlinkbasedsimilaritymeasure
SCAN:Densitybasedclusteringofnetworks
Others
Spectralclustering
Modularitybased
clustering
Probabilisticmodelbasedclustering
UserGuided
Clustering
of
Information
Networks
UserUserGuided Clustering in DBGuided Clustering in DB InfoNetInfoNet
-
7/31/2019 Mining Heterogeneous Information Networks
69/176
72
UserUser GuidedClusteringinDBGuidedClusteringinDBInfoNetInfoNet
Course
name
office
position
Professor
course-id
name
area
course
semester
instructor
officeposition
Student
name
student
course
semester
unit
Register
grade
professor
student
degree
Advise
name
Group
person
group
Work-In
area
year
conf
Publication
title
title
Publish
author
Target ofclustering
User hint
Open-course
Userusuallyhasagoalofclustering,e.g.,clusteringstudents byresearcharea
Userspecifies
his
clustering
goal
to
aDB
InfoNet cluster:
CrossClus
Classification vs UserClassification vs UserGuided ClusteringGuided Clustering
-
7/31/2019 Mining Heterogeneous Information Networks
70/176
73
Classificationvs.UserClassificationvs.UserGuidedClusteringGuidedClustering
Userspecifiedfeature(inthe
formofattribute)isusedasa
hint,
not
class
labels Theattributemaycontaintoo
manyortoofewdistinct
values
E.g.,ausermaywantto
clusterstudentsinto20
clustersinsteadof3
Additionalfeaturesneedtobe
includedin
cluster
analysis
All tuples for clustering
User hint
UserUserGuided Clustering vs. SemiGuided Clustering vs. Semisupervised Clusterinsupervised Clusterin
-
7/31/2019 Mining Heterogeneous Information Networks
71/176
74
UserUser GuidedClusteringvs.SemiGuidedClusteringvs.Semi supervisedClusterinsupervisedClusterin
Semisupervised
clustering
[Wagstaff,
et
al 01,
Xing,
et
al.02]
Userprovidesatrainingsetconsistingofsimilar anddissimilar
pairsofobjects
Userguidedclustering
Userspecifiesanattributeasahint,andmorerelevantfeaturesare
foundforclustering
All tuples for clustering
Semi-supervised clustering
All tuples for clustering
User-guided clustering
x
Why Not Typical SemiWhy Not Typical SemiSupervised Clustering?Supervised Clustering?
-
7/31/2019 Mining Heterogeneous Information Networks
72/176
75
WhyNotTypicalSemiWhyNotTypicalSemi SupervisedClustering?SupervisedClustering?
Whynot
do
typical
semi
supervise
clustering?
Muchinformation(inmultiplerelations)isneededtojudgewhethertwo
tuples aresimilar
Ausermaynotbeabletoprovideagoodtrainingset
Itismucheasierforausertospecifyanattributeasahint,suchasastudents
researcharea
Tom Smith SC1211 TA
Jane Chang BI205 RA
Tuples to be compared
User hint
CrossClusCrossClus: An Overview: An Overview
-
7/31/2019 Mining Heterogeneous Information Networks
73/176
76
CrossClusCrossClus:AnOverview:AnOverview
CrossClus:Framework
Searchforgoodmultirelationalfeaturesforclustering
Measuresimilaritybetweenfeaturesbasedonhowthey
clusterobjectsintogroups
Userguidance+heuristic searchforfindingpertinent
features Clusteringbasedonakmedoidsbased algorithm
CrossClus:Majoradvantages
Userguidance,
even
in
avery
simple
form,
plays
an
importantroleinmultirelationalclustering
CrossClus findspertinentfeatures bycomputingsimilarities
betweenfeatures
Selection of MultiSelection of MultiRelational FeaturesRelational Features
-
7/31/2019 Mining Heterogeneous Information Networks
74/176
77
SelectionofMultiSelectionofMultiRelationalFeaturesRelationalFeatures
Amulti
relational
feature
is
defined
by:
Ajoinpath.E.g.,Student RegisterOpenCourse Course
Anattribute.E.g.,Course.area
(Fornumericalfeature)anaggregationoperator.E.g.,sumoraverage
CategoricalFeaturef= [StudentRegisterOpenCourse Course,Course.area,null]
Tuple Areas of coursesDB AI TH
t1 5 5 0
t2 0 3 7
t3 1 5 4
t4 5 0 5
t5 3 3 4
areas of courses of each student
Tuple Feature fDB AI TH
t1 0.5 0.5 0
t2 0 0.3 0.7
t3 0.1 0.5 0.4
t4 0.5 0 0.5
t5 0.3 0.3 0.4
Values of featureff(t1
)
f(t2
)
f(t3
)
f(t4
)
f(t5
)
DB
AI
TH
Numerical Feature, e.g., average grades of students
h = [StudentRegister, Register.grade, average]
E.g. h(t1) = 3.5
Similarity Between FeaturesSimilarity Between Features
-
7/31/2019 Mining Heterogeneous Information Networks
75/176
78
SimilarityBetweenFeaturesSimilarityBetweenFeatures
Featuref(course) Feature g (group)
DB AI TH Info sys Cog sci Theory
t1 0.5 0.5 0 1 0 0
t2 0 0.3 0.7 0 0 1
t3 0.1 0.5 0.4 0 0.5 0.5
t4 0.5 0 0.5 0.5 0 0.5
t5 0.3 0.3 0.4 0.5 0.5 0
Values of Featurefand g
1
2
3
4
5
S1
S2
S3
S4
S5
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0.5-0.6
0.4-0.5
0.3-0.4
0.2-0.3
0.1-0.2
0-0.1
1
2
3
4
5
S1
S2
S3
S4
S5
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0.5-0.6
0.4-0.5
0.3-0.4
0.2-0.3
0.1-0.2
0-0.1
Similarity between two features
cosine similarity of two vectors
Vf
Vg
( )gf
gf
VV
VVgfSim
=,
SimilaritybetweenCategorical&NumericalFeaturesSimilaritybetweenCategorical&NumericalFeatures
-
7/31/2019 Mining Heterogeneous Information Networks
76/176
79
y gy g
( ) ( ) = 0)
f
shouldbe
similar
to
the
given
ground
truth
G et e et odo ogy gy
GNetMineGNetMine:Graph:GraphBasedRegularizationBasedRegularization
-
7/31/2019 Mining Heterogeneous Information Networks
89/176
92
( ) ( )
1
( ) ( ) 2,
, 1 1 1 , ,
( ) ( ) ( ) ( )
1
( ,..., )
1 1( )
( ) ( )
ji
k k
m
nnm
k kij ij pq ip jq
i j p q ij pp ji qq
m
k k T k k i i i i i
i
J
R f fD D
= = =
=
=
+
y y
f f
f f
pp gg
Minimizethe
objective
function
Smoothnessconstraints:objectslinkedtogethershouldsharesimilar
estimations
of
confidence
belonging
to
class
k
Normalizationtermappliedtoeachtypeoflinkseparately:reducetheimpactofpopularityofnodes
Confidence
estimation
on
labeled
data
and
their
pre
given
labelsshouldbesimilar
Userpreference:howmuchdoyouvaluethisrelationship/groundtruth?
-
7/31/2019 Mining Heterogeneous Information Networks
90/176
ClassificationAccuracy:
Labeling
aVery
Small
Portion
Classification
Accuracy:
Labeling
aVery
Small
Portion
of Authors and Papersof Authors and Papers
-
7/31/2019 Mining Heterogeneous Information Networks
91/176
94
ofAuthorsandPapersofAuthorsandPapers
(a%,p%)
nLB wvRN LLGC GNetMine
AA A
CPT A
A A
CPT A
A A
CP
T A
CPT
(0.1%,0.1%) 25.4 26.0 40.8 34.1 41.4 61.3 82.9
(0.2%,0.2%) 28.3 26.0 46.0 41.2 44.7 62.2 83.4
(0.3%,0.3%) 28.4 27.4 48.6 42.5 48.8 65.7 86.7
(0.4%,0.4%) 30.7 26.7 46.3 45.6 48.7 66.0 87.2
(0.5%,0.5%) 29.8 27.3 49.0 51.4 50.6 68.9 87.5
Comparisonofclassificationaccuracyonauthors(%)
(a%,p%)nLB wvRN LLGC GNetMine
PP ACPT PP ACPT PP ACPT ACPT
(0.1%,0.1%) 49.8 31.5 62.0 42.0 67.2 62.7 79.2
(0.2%,0.2%) 73.1 40.3 71.7 49.7 72.8 65.5 83.5
(0.3%,0.3%) 77.9 35.4 77.9 54.3 76.8 66.6 83.2
(0.4%,0.4%) 79.1 38.6 78.1 54.4 77.9 70.5 83.7
(0.5%,0.5%) 80.7 39.3 77.9 53.5 79.0 73.5 84.1
Comparisonofclassificationaccuracyonpapers(%)
(a%,p%)nLB wvRN LLGC GNetMine
ACPT ACPT ACPT ACPT
(0.1%,0.1%) 25.5 43.5 79.0 81.0
(0.2%,0.2%) 22.5 56.0 83.5 85.0
(0.3%,0.3%) 25.0 59.0 87.0 87.0
(0.4%,0.4%) 25.0 57.0 86.5 89.5
(0.5%,0.5%) 25.0 68.0 90.0 94.0
Comparisonofclassificationaccuracyonconferences(%)
KnowledgePropagation:
List
Objects
with
the
Knowledge
Propagation:
List
Objects
with
the
Highest Confidence Measure Belonging to Each ClassHighest Confidence Measure Belonging to Each Class
-
7/31/2019 Mining Heterogeneous Information Networks
92/176
95
HighestConfidenceMeasureBelongingtoEachClassHighestConfidenceMeasureBelongingtoEachClass
No. Database DataMining Artificial
Intelligence Information
Retrieval
1 data mining learning retrieval
2 database data knowledge information
3 query clustering Reinforcement web
4 system learning reasoning search
5 xml classification model document
Top5termsrelatedtoeacharea
No. Database DataMining ArtificialIntelligence InformationRetrieval
1 Surajit
Chaudhuri Jiawei
Han SridharMahadevan W.BruceCroft
2 H.V.
Jagadish Philip
S.
Yu Takeo
Kanade Iadh
Ounis
3 MichaelJ.Carey ChristosFaloutsos AndrewW.Moore MarkSanderson
4 MichaelStonebraker WeiWang Satinder
P.Singh ChengXiang
Zhai
5 C.Mohan Shusaku
Tsumoto ThomasS.Huang GerardSalton
Top5authorsconcentratedineacharea
No. Database DataMining Artificial
Intelligence Information
Retrieval
1 VLDB KDD IJCAI SIGIR
2 SIGMOD SDM AAAI ECIR
3 PODS PAKDD CVPR WWW
4 ICDE ICDM ICML WSDM
5 EDBT PKDD ECML CIKMTop5conferencesconcentratedineacharea
ClassificationofInformationNetworksClassificationofInformationNetworks
-
7/31/2019 Mining Heterogeneous Information Networks
93/176
96
ClassificationofHeterogeneousInformationNetworks:
Graph
regularization
Based
Method
(GNetMine)
MultiRelationalMiningBasedMethod(CrossMine)
StatisticalRelational
Learning
Based
Method
(SRL)
ClassificationofHomogeneousInformationNetworks
MultiMultiRelation to Flat Relation Mining?RelationtoFlatRelationMining?
-
7/31/2019 Mining Heterogeneous Information Networks
94/176
97
Multi RelationtoFlatRelationMining?g
Foldingmultiple
relations
into
asingle
flat one
for
mining?
Cannotbeasolutionduetoproblems:
Loseinformation
of
linkages
and
relationships,
no
semantics
preservation
Cannotutilizeinformationofdatabasestructuresorschemas
(e.g.,ER
modeling)
Patientflatten
ContactDoctor
OneApproach:InductiveLogicProgramming(ILP)OneApproach:InductiveLogicProgramming(ILP)
-
7/31/2019 Mining Heterogeneous Information Networks
95/176
98
Findahypothesis
that
is
consistent
with
background
knowledge(trainingdata)
FOIL,Golem,Progol,TILDE,
Backgroundknowledge
Relations(predicates),Tuples (groundfacts)
InductiveLogicProgramming(ILP)
Hypothesis:
Thehypothesis
is
usually
aset
of
rules,
which
canpredictcertainattributesincertainrelations
Daughter(X,Y) female(X),parent(Y,X)
Daughter(mary, ann) +
Daughter(eve, tom) +
Daughter(tom, ann)
Daughter(eve, ann)
Training examples
Parent(ann, mary)
Parent(ann, tom)
Parent(tom, eve)
Parent(tom, ian)
Background knowledge
Female(ann)
Female(mary)
Female(eve)
InductiveLogic
Programming
Approach
to
MultiInductive
Logic
Programming
Approach
to
Multi
Relation ClassificationRelation Classification
-
7/31/2019 Mining Heterogeneous Information Networks
96/176
99
RelationClassificationRelationClassification ILP
Approached
to
Multi
Relation
Classification
TopdownApproaches(e.g.,FOIL)
while(enough
examplesleft)
generatearule
removeexamplessatisfyingthisrule
BottomupApproaches(e.g.,Golem)
Useeach
example
as
arule
Generalizerulesbymergingrules
DecisionTreeApproaches(e.g.,TILDE)
ILPApproach:ProsandCons
Advantages:Expressiveandpowerful,andrulesareunderstandable
Disadvantages:Inefficientfordatabaseswithcomplexschemas,and
inappropriateforcontinuousattributes
FOIL:FirstFOIL:FirstOrderInductiveLearner(RuleGeneration)OrderInductiveLearner(RuleGeneration)
-
7/31/2019 Mining Heterogeneous Information Networks
97/176
100
Findaset
of
rules
consistent
with
training
data
Atopdown,sequentialcoveringlearner
Buildeachrulebyheuristics
Foilgain
a
special
type
of
information
gain
Examplescovered
byRule1
Allexamples
Examplescovered
byRule2Examples
covered
byRule3
Positive
examples
Negative
examples
A3=1
A3=1&&A1=2A3=1&&A1=2
&&A8=5
Togenerate
arule
while(true)
findthebestpredicatep
if
foilgain(p)>thresholdthen
addp
tocurrentrule
else
break
FindtheBestPredicate:PredicateEvaluationFindtheBestPredicate:PredicateEvaluation
-
7/31/2019 Mining Heterogeneous Information Networks
98/176
101
Allpredicates
in
arelation
can
be
evaluated
based
on
propagatedIDs
Usefoilgain toevaluatepredicates
Supposecurrent
rule
is
r.
For
apredicate
p,
foilgain(p)=
CategoricalAttributes
Computefoil
gain
directly
NumericalAttributes
Discretize witheverypossiblevalue
( )( )
( ) ( )
( )
( ) ( )
+++
++
++
prNprP
prP
rNrP
rPprP loglog
-
7/31/2019 Mining Heterogeneous Information Networks
99/176
CrossMineCrossMine:AnEffectiveMulti:AnEffectiveMultirelationalClassifierrelationalClassifier
-
7/31/2019 Mining Heterogeneous Information Networks
100/176
103
Methodology
TupleIDpropagation:anefficientandflexible
methodfor
virtually
joining
relations
Confinetherulesearchprocessinpromising
directions
Lookoneahead:amorepowerfulsearchstrategy
Negativetuple sampling:improveefficiencywhile
maintainingaccuracy
TupleTuple IDPropagationIDPropagation
-
7/31/2019 Mining Heterogeneous Information Networks
101/176
104
Loan ID Account ID Amount Duration Decision
1 124 1000 12 Yes
2 124 4000 12 Yes
3 108 10000 24 No
4 45 12000 36 No
0+, 0
0+, 1
0+, 1
2+, 0
Labels
1, 202/27/93monthly124
Null01/01/97weekly67
412/09/96monthly45
309/23/97weekly108
Propagated IDOpen dateFrequencyAccount ID
Applicant #1
Applicant #2
Applicant #3
Applicant #4
Propagatetuple IDsoftargetrelationtonontargetrelations
Virtuallyjoin
relations
to
avoid
the
high
cost
of
physical
joins
Possiblepredicates:
Frequency=monthly:2+,
1
Opendate
-
7/31/2019 Mining Heterogeneous Information Networks
102/176
105
TargetRelation R1 R2
R3
Efficient Onlypropagatethetuple IDs
Timeandspaceusageislow
Flexible
Canpropagate
IDs
among
non
target
relations
ManysetsofIDscanbekeptononerelation,whicharepropagatedfromdifferentjoinpaths
RuleGeneration:ExampleRuleGeneration:Example
-
7/31/2019 Mining Heterogeneous Information Networks
103/176
106
districtid
frequency
date
Account
accountid
accountid
date
amount
duration
Loan
loanid
payment
accountid
bankto
accountto
amount
Order
orderid
type
dispid
type
issuedate
Card
cardid
accountid
clientid
Disposition
dispid
birthdate
gender
districtid
Client
clientid
distname
region
#people
#lt500
District
districtid
#lt2000
#lt10000
#gt10000
#city
ratiourban
avgsalary
unemploy95
unemploy96
denenter
#crime95
#crime96
accountid
date
type
operation
Transaction
transid
amount
balance
symbol
Targetrelation
Firstpredicate
Second
predicate
RuleGeneration
Startatthetargetrelation
Repeat
Searchinallactiverelations
Searchin
all
relations
joinabletoactiverelations
Addthebestpredicatetothecurrentrule
Setthe
involved
relation
toactive
Until
Thebestpredicatedoesnothaveenoughgain
Currentruleistoolong
LookLookoneoneaheadinRuleGenerationaheadinRuleGeneration
-
7/31/2019 Mining Heterogeneous Information Networks
104/176
107
Twotypesofrelations:EntityandRelationship
Oftencannotfindusefulpredicatesonrelationsofrelationship
Solutionof
CrossMine:
WhenpropagatingIDstoarelationofrelationship,
propagateonemoresteptonextrelationofentity
Target
Relation
Nogoodpredicate
NegativeNegativeTupleTuple SamplingSampling
-
7/31/2019 Mining Heterogeneous Information Networks
105/176
108
Each
time
arule
is
generated,
covered
positive
examples
are
removed
Aftergeneratingmanyrules,therearemuchlesspositiveexamplesthannegativeones
Cannotbuildgoodrules(lowsupport)
Stilltimeconsuming(largenumberofnegativeexamples)
Solution:Samplingonnegativeexamples
Improveefficiencywithoutaffectingrulequality
+
++
++++ ++
+
+
+
++
+
+ +
+ +
++
RealDatasetRealDataset
-
7/31/2019 Mining Heterogeneous Information Networks
106/176
109
PKDDCup99dataset LoanApplication
Accuracy Time (per fold)
FOIL 74.0% 3338 sec
TILDE 81.3% 2429 sec
CrossMine 90.7% 15.3 sec
Accuracy Time (per fold)FOIL 79.7% 1.65 sec
TILDE 89.4% 25.6 sec
CrossMine 87.7% 0.83 sec
Mutagenesisdataset(4relations):Only4relations,soTILDE
doesagoodjob,thoughslow
ClassificationofInformationNetworksClassificationofInformationNetworks
-
7/31/2019 Mining Heterogeneous Information Networks
107/176
110
ClassificationofHeterogeneousInformationNetworks:
GraphregularizationBasedMethod(GNetMine)
MultiRelationalMiningBasedMethod(CrossMine)
StatisticalRelational
Learning
Based
Method
(SRL)
ClassificationofHomogeneousInformationNetworks
ProbabilisticRelational
Models
in
Probabilistic
Relational
Models
in
Statistical Relational LearningStatisticalRelationalLearning
-
7/31/2019 Mining Heterogeneous Information Networks
108/176
111
StatisticalRelationalLearningg
Goal:model
distribution
of
data
in
relational
databases
Treatbothentitiesandrelationsasclasses
Intuition:objectsarenolongerindependentwitheachother
Buildstatistical
networks
according
to
the
dependency
relationshipsbetweenattributesofdifferentclasses
AProbabilisticRelationalModels (PRM)consistsof
Relational
schema
(from
databases) Dependencystructure(betweenattributes)
Localprobabilitymodel(conditionalprobabilitydistribution)
Three
major
methods
of
probabilistic
relational
models RelationalBayesianNetwork(RBN,Lise Getoor etal.)
RelationalMarkovNetworks(RMN,BenTaskar etal.)
RelationalDependency
Networks
(RDN,
Jennifer
Neville
et
al.)
RelationalBayesianNetworks(RBN)RelationalBayesianNetworks(RBN)
-
7/31/2019 Mining Heterogeneous Information Networks
109/176
112
ExtendBayesian
network
to
consider
entities,
properties
and
relationshipsinaDBscenario
Threedifferentuncertainties
Attributeuncertainty
Modelconditionalprobabilityforanattributegivenits
parentvariables
Structuraluncertainty
Modelconditionalprobabilityforareferenceorlink
existencegivenitsparentvariables
Classuncertainty
Refinetheconditionalprobabilitybyconsideringsub
classesorhierarchyofclasses
-
7/31/2019 Mining Heterogeneous Information Networks
110/176
ClassificationofInformationNetworksClassificationofInformationNetworks
-
7/31/2019 Mining Heterogeneous Information Networks
111/176
114
ClassificationofHeterogeneousInformationNetworks:
GraphregularizationBasedMethod(GNetMine)
MultiRelationalMiningBasedMethod(CrossMine)
StatisticalRelational
Learning
Based
Method
(SRL)
ClassificationofHomogeneousInformationNetworks
TransductiveTransductive LearningintheGraphLearningintheGraph
-
7/31/2019 Mining Heterogeneous Information Networks
112/176
115
Problem:
for
aset
of
nodes
in
the
graph,
the
class
labels
are
given
for
partial
of
thenodes,thetaskistolearnthelabelsoftheunlabelednodes
Methods
Labelpropagation
algorithm
[Zhu
et
al.
2002,
Zhou
et
al.
2004,
Szummer et
al.2001]
Iterativelypropagatelabelstoitsneighbors,accordingtothe
transitionprobabilitydefinedbythenetwork
Graphregularizationbasedalgorithm[Zhouetal.2004]
Intuition:tradeoffbetween(1)consistencywiththelabelingdataand
(2)consistencybetweenlinkedobjects
Anquadratic
optimization
problem
OutlineOutline
-
7/31/2019 Mining Heterogeneous Information Networks
113/176
116
Motivation: WhyMining
Heterogeneous
Information
Networks?
PartI: Clustering,RankingandClassification
ClusteringandRankinginInformationNetworks
ClassificationofInformationNetworks
PartII: DataQualityandSearchinInformationNetworks
DataCleaning and
Data
Validation
by
InfoNet Analysis
SimilaritySearchinInformationNetworks
PartIII:AdvancedTopicsonInformationNetworkAnalysis
RoleDiscovery
and
OLAP
in
Information
Networks
MiningEvolutionandDynamicsofInformationNetworks
Conclusions
DataCleaningbyLinkAnalysisDataCleaningbyLinkAnalysis
-
7/31/2019 Mining Heterogeneous Information Networks
114/176
117
Objectreconciliation
vs.
object
distinction
as
data
cleaning
tasks
Linkanalysismaytakeadvantagesofredundancyandmake
facilitateentitycrosscheckingandvalidation
Objectdistinction:
Different
people/objects
do
share
names
InAllMusic.com,72songsand3albumsnamedForgotten or
TheForgotten
InDBLP,
141
papers
are
written
by
at
least
14
Wei
Wang
Newchallengesofobjectdistinction:
Textualsimilaritycannotbeused
Distinct:Objectdistinctionbyinformationnetworkanalysis
X.Yin,J.Han,andP.S.Yu,ObjectDistinction:Distinguishing
Objects
with
Identical
Names
by
Link
Analysis,
ICDE'07
-
7/31/2019 Mining Heterogeneous Information Networks
115/176
-
7/31/2019 Mining Heterogeneous Information Networks
116/176
TrainingwiththeTrainingwiththeSameSame DataSetDataSet
-
7/31/2019 Mining Heterogeneous Information Networks
117/176
120
Build
atraining
set
automatically
Selectdistinctnames,e.g.,JohannesGehrke
Thecollaborationbehaviorwithinthesamecommunityshare
somesimilarity
Trainingparametersusingatypicalandlargesetof
unambiguous examples
UseSVMtolearnamodelforcombiningdifferentjoinpaths
Eachjoinpathisusedastwoattributes(withlinkbased
similarityand
neighborhood
similarity)
Themodelisaweightedsumofallattributes
-
7/31/2019 Mining Heterogeneous Information Networks
118/176
RealCases:DBLPPopularNamesRealCases:DBLPPopularNames
-
7/31/2019 Mining Heterogeneous Information Networks
119/176
122
Nam e Num _aut hor s Num _r ef s accur acy pr ecision r ecall f - m easur e
Hui
Fang 3 9 1.0 1.0 1.0 1.0
Ajay Gupta 4 16 1.0 1.0 1.0 1.0
Joseph Hellerstein 2 151 0.81 1.0 0.81 0.895
Rakesh
Kumar 2 36 1.0 1.0 1.0 1.0
Michael Wagner 5 29 0.395 1.0 0.395 0.566
Bing Liu 6 89 0.825 1.0 0.825 0.904
Jim Smith 3 19 0.829 0.888 0.926 0.906
Lei Wang 13 55 0.863 0.92 0.932 0.926
Wei Wang 14 141 0.716 0.855 0.814 0.834
Bin Yu 5 44 0.658 1.0 0.658 0.794
Average 0.81 0.966 0.836 0.883
DistinguishingDifferentDistinguishingDifferentWeiWeiWangWangss
-
7/31/2019 Mining Heterogeneous Information Networks
120/176
123
UNC-CH(57)
Fudan U, China(31)
UNSW, Australia(19)
SUNYBuffalo
(5)
BeijingPolytech
(3)
NUSingapore
(5)
Harbin UChina
(5)
Zhejiang UChina
(3)
Najing NormalChina
(3)
Ningbo TechChina
(2)
Purdue(2)
Beijing U ComChina
(2)
Chongqing UChina
(2)
SUNYBinghamton
(2)
5
6
2
-
7/31/2019 Mining Heterogeneous Information Networks
121/176
TruthValidationbyInfo.NetworkAnalysisTruthValidationbyInfo.NetworkAnalysis
-
7/31/2019 Mining Heterogeneous Information Networks
122/176
125
Thetrustworthinessproblemoftheweb(accordingtoasurvey):
54%ofInternetuserstrustnewswebsitesmostoftime
26%forwebsitesthatsellproducts
12%for
blogs
TruthFinder:TruthdiscoveryontheWebbylinkanalysis
Amongmultipleconflictresults,canweautomaticallyidentifywhichone
islikely
the
true
fact?
Veracity(conformitytotruth):
Givenalargeamountofconflictinginformationaboutmanyobjects,
providedby
multiple
web
sites
(or
other
information
providers), how
to
discoverthetruefactabouteachobject?
Ourwork: Xiaoxin Yin,Jiawei Han,PhilipS.Yu,TruthDiscoverywithMultiple
ConflictingInformationProvidersontheWeb,TKDE08
-
7/31/2019 Mining Heterogeneous Information Networks
123/176
-
7/31/2019 Mining Heterogeneous Information Networks
124/176
-
7/31/2019 Mining Heterogeneous Information Networks
125/176
OverviewoftheOverviewoftheTruthFinderTruthFinder MethodMethod
-
7/31/2019 Mining Heterogeneous Information Networks
126/176
129
Confidenceoffacts Trustworthinessofwebsites
Afacthashighconfidence ifitisprovidedby(many)
trustworthyweb
sites
Awebsiteistrustworthyifitprovidesmanyfactswithhigh
confidence
TheTruthFinder mechanism,anoverview:
Initially,eachwebsiteisequallytrustworthy
Basedon
the
above
four
heuristics,
infer
fact
confidence
from
websitetrustworthiness,andthenbackwards
Repeat
until
achieving
stable
state
AnalogytoAuthorityAnalogytoAuthorityHubAnalysisHubAnalysis
-
7/31/2019 Mining Heterogeneous Information Networks
127/176
130
Facts Authorities, Websites Hubs
Differencefromauthorityhubanalysis
Linearsummationcannotbeused
A
web
site
is
trustable
if
it
provides
accurate
facts,
instead
of
many
facts
Confidenceistheprobabilityofbeingtrue
Different
facts
of
the
same
object
influence
each
other
w1 f1
Web sites Facts
Hubs Authorities
Hightrustworthiness
Highconfidence
-
7/31/2019 Mining Heterogeneous Information Networks
128/176
-
7/31/2019 Mining Heterogeneous Information Networks
129/176
Experiments:FindingTruthofFactsExperiments:FindingTruthofFacts
-
7/31/2019 Mining Heterogeneous Information Networks
130/176
133
Determiningauthorsofbooks
Datasetcontains1265bookslistedonabebooks.com
We
analyze
100
random
books
(using
book
images)
Case Voting TruthFinder Barnes & Noble
Correct 71 85 64
Miss author(s) 12 2 4
Incomplete names 18 5 6
Wrong first/middle names 1 1 3
Has redundant names 0 2 23
Add incorrect names 1 5 5
No information 0 0 2
-
7/31/2019 Mining Heterogeneous Information Networks
131/176
OutlineOutline
-
7/31/2019 Mining Heterogeneous Information Networks
132/176
135
Motivation: WhyMining
Heterogeneous
Information
Networks?
PartI: Clustering,RankingandClassification
ClusteringandRankinginInformationNetworks
ClassificationofInformationNetworks
PartII: DataQualityandSearchinInformationNetworks
DataCleaning
and
Data
Validation
by
InfoNet Analysis
SimilaritySearchinInformationNetworks
PartIII:AdvancedTopicsonInformationNetworkAnalysis
RoleDiscovery
and
OLAP
in
Information
Networks
MiningEvolutionandDynamicsofInformationNetworks
Conclusions
-
7/31/2019 Mining Heterogeneous Information Networks
133/176
-
7/31/2019 Mining Heterogeneous Information Networks
134/176
SemanticsSemanticsBasedSimilaritySearchinBasedSimilaritySearchinInfoNetInfoNet
-
7/31/2019 Mining Heterogeneous Information Networks
135/176
138
Searchtop
ksimilar
objects
of
the
same
type
for
aquery
FindresearchersmostsimilarwithChristosFaloutsos
Twocriticalconceptstodefineasimilarity
Featurespace
Traditionaldata:attributesdenotedasnumerical
value/vector,set,etc.
Networkdata:
arelation
sequence
called
path
schema
Existinghomogeneousnetworkbasedsimilaritydoes
notdealwiththisproblem
Measuredefined
on
the
feature
space
Cosine,Euclideandistance,Jaccard coefficient,etc.
PathSim
PathSchemaforDBLPQueriesPathSchemaforDBLPQueries
-
7/31/2019 Mining Heterogeneous Information Networks
136/176
139
Pathschema:
A
path
of
InfoNet schema,
e.g.,
APC,
APA
WhoaremostsimilartoChristosFaloutsos?
FlickrFlickr:WhichPicturesAreMostSimilar?:WhichPicturesAreMostSimilar?
-
7/31/2019 Mining Heterogeneous Information Networks
137/176
140
Somepath
schema
leads
to
similarity
closer
to
human
intuition
Butsomeothersarenot
NotAllSimilarityMeasuresAreGoodNotAllSimilarityMeasuresAreGood
-
7/31/2019 Mining Heterogeneous Information Networks
138/176
141
Favor highly
visibleobjects
Notreasonable
-
7/31/2019 Mining Heterogeneous Information Networks
139/176
PathSimPathSim:Definition&Properties:Definition&Properties
-
7/31/2019 Mining Heterogeneous Information Networks
140/176
143
Commutingmatrix
corresponding
to
apath
schema
MP
Productofadjacencymatrixofrelationsinthepathschema
TheelementMP(i,j)denotesthestrengthbetweenobjecti
andobject
jin
the
semantic
of
path
schema
P
Iftheweightofadjacencymatrixisunweighted,itwill
denotethenumberofpathinstancesfollowingpath
schemaP
PathSim
s(i,j)=2MP(i,j)/(MP(i,i)+MP(j,j))
Properties:
CoCoClusteringClusteringBasedPruningAlgorithmBasedPruningAlgorithm
-
7/31/2019 Mining Heterogeneous Information Networks
141/176
144
Storecommuting
matrices
for
short
path
schemas
&
compute
top
kqueries
on
line
Framework
Generatecoclustersformaterializedcommutingmatrices,forfeature
objectsand
target
objects
Deriveupperboundforsimilaritybetweenobjectandtargetcluster,and
betweenobjectandobject:Safelypruningtargetclustersandobjectsif
theupperboundsimilarityislowerthancurrentthreshold
Dynamicallyupdatetopkthreshold:
Performance:Baselinevs.pruning
-
7/31/2019 Mining Heterogeneous Information Networks
142/176
-
7/31/2019 Mining Heterogeneous Information Networks
143/176
RoleDiscovery:
Extraction
Semantic
Role
Discovery:
Extraction
Semantic
InformationfromLinksInformationfromLinks
-
7/31/2019 Mining Heterogeneous Information Networks
144/176
147
Objective:Extract
semantic
meaning
from
plain
links
to
finely
modelandbetterorganizeinformationnetworks
Challenges
Latentsemantic
knowledge
Interdependency
Scalability
Opportunity
Humanintuition
Realistic
constraint
Crosscheckwithcollectiveintelligence
Methodology:propagatesimpleintuitiverulesandconstraints
over
the
whole
network
Discoveryof
AdvisorDiscovery
of
Advisor
Advisee
Advisee
RelationshipsinDBLPNetworkRelationshipsinDBLPNetwork
-
7/31/2019 Mining Heterogeneous Information Networks
145/176
148
Input:DBLP
research
publication
network
Output:Potentialadvisingrelationshipanditsranking (r,[st,ed])
Ref. C.Wang,J.Han,etal., MiningAdvisorAdvisee
Relationshipsfrom
Research
Publication
Networks,
SIGKDD
2010
Smith
2000
2000
2001
2002
2003
1999
Ada Bob
Jerry
Ying
Input: Temporal
collaboration networkOutput: Relationship analysis
(0.8, [1999,2000])
(0.7,
[2000, 2001])
(0.65, [2002, 2004])
2004
Ada
Bob
Ying
Smith
(0.2,
[2001, 2003])
(0.5, [/, 2000])
(0.9, [/, 1998])
(0.4,
[/, 1998])
(0.49,
[/, 1999])
Visualized chorological hierarchies
Jerry
OverallFrameworkOverall
Framework
-
7/31/2019 Mining Heterogeneous Information Networks
146/176
149
ai:authori
pj:paperj
py:paper
year
pn:paper#
sti,yi:startingtime
edi,yi:endingtime
ri,yi:rankingscore
-
7/31/2019 Mining Heterogeneous Information Networks
147/176
ExperimentResultsExperimentResults
-
7/31/2019 Mining Heterogeneous Information Networks
148/176
151
DBLPdata:
654,
628
authors,
1076,946
publications,
years
provided
Labeleddata:MathGealogy Project;AIGealogy Project;
Homepage
Datasets RULE SVM IndMAX TPFG
TEST1 69.9% 73.4% 75.2% 78.9% 80.2% 84.4%
TEST2 69.8% 74.6% 74.6% 79.0% 81.5% 84.3%
TEST3 80.6% 86.7% 83.1% 90.9% 88.8% 91.3%
Empirical
parameter
optimized
parameter
heuristics Supervised
learning
-
7/31/2019 Mining Heterogeneous Information Networks
149/176
Graph/NetworkSummarization:
Graph
Graph/Network
Summarization:
Graph
CompressionCompression
-
7/31/2019 Mining Heterogeneous Information Networks
150/176
153
Extractcommon
subgraphs and
simplify
graphs
by
condensing
thesesubgraphs intonodes
OLAPon
Information
NetworksOLAP
on
Information
Networks
-
7/31/2019 Mining Heterogeneous Information Networks
151/176
154
WhyOLAP
information
networks?
AdvantagesofOLAP:Interactiveexplorationofmultidimensional
andmultilevelspaceinadatacubeInfonet
Multidimensional:
Different
perspectives
Multilevel:Differentgranularities
InfoNet OLAP:Rollup/drilldownandslice/diceoninformation
networkdata
TraditionalOLAPcannothandlethis,becausetheyignorelinks
amongdataobjects
Handlingtwo
kinds
of
InfoNet OLAP
InformationalOLAP
TopologicalOLAP
InformationalOLAPInformational
OLAP
-
7/31/2019 Mining Heterogeneous Information Networks
152/176
155
Inthe
DBLP
network,
study
the
collaborationpatternsamongresearchers
Dimensionscomefrominformational
attributesattached
at
the
whole
snapshot
level,socalled InfoDims
IOLAPCharacteristics:
Overlaymultiple
pieces
of
information
Nochangeontheobjectswhose
interactionsarebeingexamined
Inthe
underlying
snapshots,
each
nodeisaresearcher
Inthesummarizedview,eachnodeis
stillaresearcher
TopologicalOLAPTopological
OLAP
-
7/31/2019 Mining Heterogeneous Information Networks
153/176
156
Dimensionscome
from
the
node/edgeattributesinsideindividual
networks,socalledTopoDims
T
OLAP
Characteristics Zoomin/Zoomout
Networktopologychanged:
generalized nodesand
generalized edges
Intheunderlyingnetwork,
each
node
is
a
researcher Inthesummarizedview,each
nodebecomesaninstitutethat
comprises
multiple
researchers
-
7/31/2019 Mining Heterogeneous Information Networks
154/176
OutlineOutline
-
7/31/2019 Mining Heterogeneous Information Networks
155/176
158
Motivation: WhyMining
Heterogeneous
Information
Networks?
PartI: Clustering,RankingandClassification
ClusteringandRankinginInformationNetworks
Classificationof
Information
Networks
PartII: DataQualityandSearchinInformationNetworks
DataCleaning
and
Data
Validation
by
InfoNet Analysis
SimilaritySearchinInformationNetworks
PartIII:AdvancedTopicsonInformationNetworkAnalysis
RoleDiscovery
and
OLAP
in
Information
Networks
MiningEvolutionandDynamicsofInformationNetworks
Conclusions
MiningEvolutionandDynamicsofMiningEvolutionandDynamicsofInfoNetInfoNet
-
7/31/2019 Mining Heterogeneous Information Networks
156/176
159
Manynetworks
are
with
time
information
E.g.,accordingtopaperpublicationyear,DBLPnetworkscan
formnetworksequences
Motivation:Model
evolution
of
communities
in
heterogeneous
network
Automaticallydetectthebestnumberofcommunitiesineach
timestamp Modelthesmoothnessbetweencommunitiesofadjacent
timestamps
Modelthe
evolution
structure
explicitly
Birth,death,split
-
7/31/2019 Mining Heterogeneous Information Networks
157/176
-
7/31/2019 Mining Heterogeneous Information Networks
158/176
-
7/31/2019 Mining Heterogeneous Information Networks
159/176
AccuracyStudyAccuracy
Study
-
7/31/2019 Mining Heterogeneous Information Networks
160/176
163
Themore
types
of
objects
used,
the
better
accuracy
Historicalpriorresultsinbetteraccuracy
-
7/31/2019 Mining Heterogeneous Information Networks
161/176
-
7/31/2019 Mining Heterogeneous Information Networks
162/176
OutlineOutline
-
7/31/2019 Mining Heterogeneous Information Networks
163/176
Motivation: WhyMining
Heterogeneous
Information
Networks?
PartI: Clustering,RankingandClassification
ClusteringandRankinginInformationNetworks
Classificationof
Information
Networks
PartII: DataQualityandSearchinInformationNetworks
Data
Cleaning
and
Data
Validation
by
InfoNet Analysis SimilaritySearchinInformationNetworks
PartIII:AdvancedTopicsonInformationNetworkAnalysis
RoleDiscovery
and