mining heterogeneous information networks

7/31/2019 Mining Heterogeneous Information Networks

1/176

1

MiningHeterogeneousMiningHeterogeneous

InformationNetworksInformation

Networks

JiaweiHan

YizhouSun

XifengYan PhilipS.Yu

UniversityofIllinoisatUrbanaChampaign

Universityof

California

at

Santa

Barbara

UniversityofIllinoisatChicago

Acknowledgements:NSF,ARL,NASA,AFOSR(MURI),Microsoft,IBM,Yahoo!,Google,HPLab&Boeing

July12,2010

ACMSIGKDDConferenceTutorial,Washington,D.C.,July25,2010


2/176

2

OutlineOutline

Motivation: WhyMiningHeterogeneousInformationNetworks?

PartI: Clustering,RankingandClassification

Clusteringand

Ranking

in

Information

Networks

ClassificationofInformationNetworks

PartII: DataQualityandSearchinInformationNetworks

DataCleaning

and

Data

Validation

by

InfoNet

Analysis

SimilaritySearchinInformationNetworks

PartIII:AdvancedTopicsonInformationNetworkAnalysis

RoleDiscoveryandOLAPinInformationNetworks

MiningEvolutionandDynamicsofInformationNetworks

Conclusions


3/176

3

OutlineOutline

Motivation: WhyMiningHeterogeneousInformationNetworks?


Clusteringand

Ranking

in

Information

Networks



DataCleaning

and

Data

Validation

by

InfoNet

Analysis





Conclusions


4/176

4

WhatAreInformationNetworks?WhatAreInformationNetworks? Information

network:

A

network

where

each

node

represents

an

entity

(e.g.,

actorinasocialnetwork)andeachlink(e.g.,tie)arelationshipbetween

entities

Eachnode/link

may

have

attributes,

labels,

and

weights

Linkmaycarryrichsemanticinformation

Homogeneousvs.heterogeneousnetworks

Homogeneousnetworks

Singleobjecttypeandsinglelinktype

Singlemodelsocialnetworks(e.g.,friends)

WWW:

a

collection

of

linked

Web

pages Heterogeneous,multitypednetworks

Multipleobjectandlinktypes

Medical

network:

patients,

doctors,

disease,

contacts,

treatments Bibliographicnetwork:publications,authors,venues


5/176

5

UbiquitousInformationNetworksUbiquitousInformationNetworks

Graphsand

substructures

Chemicalcompounds,computervisionobjects,circuits,XML

Biologicalnetworks

Bibliographicnetworks:DBLP,ArXiv,PubMed,

Socialnetworks:Facebook>100millionactiveusers

WorldWide

Web

(WWW):

>3billion

nodes,

>50

billion

arcs

Cyberphysicalnetworks

Yeast prot eininteract ion net w ork

An I n t e r n et W eb Co-aut hor netw ork Social netw ork sit es


6/1766

Homogeneousvs.HeterogeneousNetworksHomogeneousvs.HeterogeneousNetworks

Conference-Author Net w orkCo-aut hor Netw ork


7/1767

DBLP:AnInterestingandFamiliarNetworkDBLP:AnInterestingandFamiliarNetwork

DBLP:

Acomputer

science

publication

bibliographic

database

1.4Mrecords(papers),0.7Mauthors,5Kconferences,

Willthisdatabasediscloseinterestingknowledgeabout

computerscience

research?

Whatarethepopularresearchfields/subfieldsinCS?

WhoaretheleadingresearchersonDBorXQueries?

Howdo

the

authors

in

this

subfield

collaborate

and

evolve?

HowmanyWeiWangsinDBLP,whichpaperdonebywhich?

WhoisSergy Brins supervisorandwhen?

WhoareverysimilartoChristosFaloutsos?

Allthesekindsofquestions,andpotentiallymuchmore,canbe

nicely

answered

by

the

DBLP

InfoNet How? Exploringthepoweroflinksininformationnetworks!


8/1768

Homo.vs.Hetero.:DifferencesinDBHomo.vs.Hetero.:DifferencesinDBInfoNetInfoNet MiningMining Homogeneous

networks

can

often

be

derived

from

their

original

heterogeneousnetworks

Coauthornetworkscanbederivedfromauthorpaper

conferencenetworks

by

projection

on

authors

only

Papercitationnetworkscanbederivedfromacomplete

bibliographicnetworkwithpapersandcitationsprojected

HeterogeneousDB

InfoNet carries

richer

information

than

its

correspondingprojectedhomogeneousnetworks

TypedheterogeneousInfoNet vs.nontypedhetero.InfoNet (i.e.,

notdistinguishing

different

types

of

nodes)

TypednodesandlinksimplyamorestructuredInfoNet,and

thusoftenleadtomoreinformativediscovery

Ouremphasis:

Mining

structured information

networks!


9/1769

WhyMiningHeterogeneousInformationNetworks?WhyMiningHeterogeneousInformationNetworks?

Mostdatasets

can

be

organized or

transformed into

a

structured heterogeneousinformationnetwork!

Examples:DBLP,IMDB,Flickr,GoogleNews,Wikipedia,

Structurescanbeprogressivelyextractedfromlessorganized

datasetsbyinformationnetworkanalysis

Informationrich,interrelated,organizeddatasetsformone

oraset

of

gigantic,

interconnected,

multi

typed

heterogeneousinformationnetworks

Surprisinglyrichknowledgecanbederivedfromsuch

structuredheterogeneous

information

networks

Ourgoal:Uncoverknowledgehiddenfromorganized data

Exploringthepowerofmultityped,heterogeneouslinks

Miningstructured heterogeneousinformationnetworks!


10/17610

OutlineOutline

Motivation: WhyMining

Heterogeneous

Information

Networks?


Clusteringand

Ranking

in

Information

Networks



DataCleaning

and

Data

Validation

by

InfoNet Analysis





Conclusions

l d k fCl i d R ki i I f i


11/17611

Clusteringand

Ranking

in

Information

Clustering

and

Ranking

in

Information

NetworksNetworks Integrated

Clustering

and

Ranking

of

Heterogeneous

InformationNetworks

Clusteringof

Homogeneous

Information

Networks

LinkClus:Clusteringwithlinkbasedsimilaritymeasure

SCAN:

Density

based

clustering

of

networks Others

Spectralclustering

Modularitybasedclustering

Probabilisticmodelbasedclustering

UserGuided

Clustering

of

Information

Networks


12/17612

ClusteringandRankinginHeterogeneousClusteringandRankinginHeterogeneousInformationNetworksInformationNetworks

Ranking&

clustering

each

provides

anew

view

over

anetwork

Rankinggloballywithoutconsideringclusters dumb

RankingDBandArchitectureconfs.together?

Clusteringauthorsinonehugeclusterwithoutdistinction?

Dulltoviewthousandsofobjects(thisiswhyPageRank!)

RankClus:

Integratesclustering

with

ranking

Conditionalrankingrelativetoclusters

Useshighlyrankedobjectstoimproveclusters

Qualitiesof

clustering

and

ranking

are

mutually

enhanced

Y.Sun,J.Han,etal.,RankClus:IntegratingClusteringwith

RankingforHeterogeneousInformationNetworkAnalysis,

EDBT'09.


13/17613

GlobalRankingvs.ClusterGlobalRankingvs.ClusterBasedRankingBasedRanking

AToy

Example:

Two

areas

with

10

conferences

and

100

authors

ineacharea


14/17614

RankClusRankClus:ANewFramework:ANewFramework

Sub-NetworkRanking

Clustering


15/17615

TheTheRankClusRankClus PhilosophyPhilosophy Why

integrated

Ranking

and

Clustering?

Rankingandclusteringcanbemutuallyimproved

Ranking:

Once

a

cluster

becomes

more

accurate,

ranking

will

bemorereasonableforsuchaclusterandwillbethe

distinguishedfeatureofthecluster

Clustering:Once

ranking

is

more

distinguished

from

each

other,theclusterscanbeadjustedandgetmoreaccurate

results

Noteveryobjectshouldbetreatedequallyinclustering!

Objectspreservesimilarityundernewmeasurespace

E.g.,VLDB

vs.

SIGMOD


16/17616

RankClusRankClus:AlgorithmFramework:AlgorithmFramework

Step0. Initialization

RandomlypartitiontargetobjectsintoKclusters

Step1.

Ranking

Rankingforeachsubnetworkinducedfromeachcluster,

whichservesasfeatureforeachcluster

Step2. Generatingnewmeasurespace

Estimatemixturemodelcoefficientsforeachtargetobject

Step3.

Adjusting

cluster

Step4. RepeatingSteps13untilstable


17/17617

FocusonaBiFocusonaBiTypedNetworkCaseTypedNetworkCase

Conferenceauthor

network,

links

can

exist

between

Conference(X)andauthor(Y)

Author

(Y)

and

author

(Y)

UseWtodenotethelinksandthereweights

W=


18/17618

Step1:Ranking:FeatureExtractionStep1:Ranking:FeatureExtraction

Tworanking

strategies:

Simple

ranking

vs.

authority

ranking

SimpleRanking

Proportionaltodegreecountingforobjects,e.g.,#of

publicationsofanauthor

Considersonlyimmediateneighborhoodinthenetwork

Authority

Ranking:

Extension

to

HITS

in

weighted

bi

type

network

Rule1:Highlyrankedauthorspublishmanypapersinhighly

rankedconferences

Rule2:Highlyrankedconferencesattractmanypapersfrom

manyhighlyrankedauthors

Rule

3:

The

rank

of

an

author

is

enhanced

if

he

or

she

co

authorswithmanyauthorsormanyhighlyrankedauthors


19/17619

EncodingRulesinAuthorityRankingEncodingRulesinAuthorityRanking

Rule1:Highlyrankedauthorspublishmanypapersinhighlyrankedconferences

Rule2:Highlyrankedconferencesattractmanypapersfrom

many

highly

ranked

authors

Rule3:

The

rank

of

an

author

is

enhanced

if

he

or

she

co

authorswithmanyauthorsormanyhighlyrankedauthors

E l A h i R ki i h 2E l A th it R ki i th 2 AA


20/17620

Example:AuthorityRankinginthe2Example:AuthorityRankinginthe2AreaAreaConferenceConferenceAuthorNetworkAuthorNetwork

Therankings

of

authors

are

quite

distinct

from

each

other

in

the

twoclusters

Step 2 Generate Ne Meas re SpaceStep 2 Generate Ne Meas re Space


21/17621

Step2:

Generate

New

Measure

SpaceStep

2:

Generate

New

Measure

Space:

:

AMixtureModelMethodAMixtureModelMethod Consider

each

target

objects

links

are

generated

under

amixture

distributionofrankingfromeachcluster

Consider

ranking

as

a

distribution:

r(Y)

p(Y)

Eachtargetobject xi ismappedintoaKvector(i,k)

Parametersare

estimated

using

the

EM

algorithm

Maximizetheloglikelihoodgivenalltheobservationsof

links

Example: 2Example: 2 D Coefficients in the 2D Coefficients in the 2 AreaArea


22/17622

Example:2Example:2

D

Coefficients

in

the

2D

Coefficients

in

the

2

Area

Area

ConferenceConferenceAuthorNetworkAuthorNetwork

Theconferencesarewellseparatedinthenewmeasurespace

Scatterplots

of

two

conferences

and

component

coefficients

St 3 Cl t Adj t t i NSt 3 Cl t Adj t t i N


23/17623

Step3:ClusterAdjustmentinNewStep3:ClusterAdjustmentinNewMeasureSpaceMeasureSpace

Clustercenterinnewmeasurespace

Vectormeanofobjectsinthecluster(Kdimensional)

Cluster

adjustment Distancemeasure:1Cosinesimilarity

Assigntotheclusterwiththenearestcenter

WhyBetterRankingFunctionDerivesBetterClustering?

Considerthemeasurespacegenerationprocess

Highlyranked

objects

in

acluster

play

amore

important

role

to decide

atargetobjectsnewmeasure

Intuitively,ifwecanfindthehighlyrankedobjectsina

cluster,equivalently,

we

get

the

right

cluster


24/176

24

StepStepbybyStepRunningCaseIllustrationStepRunningCaseIllustration

I nit ial ly, rankingdist ribut ions are

mixed t oget her

Tw o clust ers ofobject s mixed

t oget her, butpreserve similarity

somehow

I mproved a l i t t le

Tw o clusters are

almost w ellseparated

Improvedsignificantly

Stable

Well separat ed


25/176

25

TimeComplexity:Linearto#ofLinksTimeComplexity:Linearto#ofLinks

Ateach

iteration,

|E|:

edges

in

network,

m:

number

of

target

objects,K:numberofclusters

Rankingforsparsenetwork

~O(|E|)

Mixturemodelestimation

~O(K|E|+mK) Clusteradjustment

~O(mK^2)

Inall,

linear

to

|E|

~O(K|E|)

Note:SimRank willbeatleastquadraticateachiterationsinceit

evaluatesdistancebetweeneverypairinthenetwork


26/176

26

CaseStudy:Dataset:DBLPCaseStudy:Dataset:DBLP

Allthe2676conferencesand20,000authorswithmost

publications,fromthetimeperiodofyear1998toyear2007

Bothconference

author

relationships

and

co

author

relationshipsareused

K=15(selectonly5clustershere)

NetClusNetClus : Rank ing & Clust er ing w i t h St ar: Rank ing & Clust er ing w i t h St ar


27/176

27

NetClusNetClus : Rank ing & Clust er ing w i t h St ar: Rank ing & Clust er ing w i t h St ar

Ne tw ork Sc hemaNet w ork Sc hema

Beyondbi

typed

information

network:

A

Star

Network

Schema

Splitanetworkintodifferentlayers,eachrepresentingbyanet

cluster

27


28/176

28

StarNetStarNet:Schema&Net:Schema&NetClusterCluster

StarNetwork

Schema

Centertype: Targettype

E.g.,apaper,amovie,ataggingevent

Acenter

object

is

aco

occurrence of

abag

of

different

types

of

objects,whichstandsforamultirelationamongdifferenttypesof

objects

Surroundingtypes: Attribute(property)types

NetCluster

GivenainformationnetworkG,anetclusterCcontainstwopiecesof

information:

Nodeset

and

link

set

as

asub

network

of

G

Membershipindicatorforeachnodex:P(x inC)

GivenainformationnetworkG,clusternumberK,aclusteringforGisa

setofnetclustersandforeachnodex,thesumofxs probability

distributionin

all

Knet

clusters

should

be

1


29/176

29

ResearchPaper

Term

AuthorVenue

Publish Write

Contain

DBLP 29


30/176

30

StarNetStarNet

ofof

Delicious.comDelicious.com

TaggingEvent

Tag

UserWebSite

Contain

Delicious.com 30

Actor/A


31/176

31

StartNetStartNet forIMDBforIMDBMovie

Title/

Plot

DirectorActor/Actress

Star in Direct

Contain

IMDB

31


32/176

32

RankingFunctionsRankingFunctions

RankinganobjectxoftypeTx inanetworkG,denotedasp(x|Tx,G)

Giveascoretoeachobject,accordingtoitsimportance

Differentrulesdefineddifferentrankingfunctions:

SimpleRanking

Rankingscoreisassignedaccordingtothedegreeofanobject

AuthorityRanking

Rankingscoreisassignedaccordingtothemutualenhancementby

propagationsofscorethroughlinks

highlyrankedconferencesacceptmanygoodpaperspublishedby

manyhighlyrankedauthors,andhighlyrankedauthorspublishmany

goodpapersinhighlyrankedconferences:


33/176

33

RankingFunction(Cont.)RankingFunction(Cont.)

Priorscanbeadded:

PP(X|Tx,Gk)=(1 P)P(X|Tx,Gk)+PP0(X|Tx,Gk)

P0(X|Tx,

Gk)

is

the

prior

knowledge,

usually

given

as

a

distribution,denotedbyonlyseveralwords

Pistheparameterthatwebelieveinpriordistribution

Rankingdistribution

Normalizerankingscoresto1,giventhemaprobabilistic

meaning

Similarto

the

idea

of

PageRank


34/176

34

NetClusNetClus:AlgorithmFramework:AlgorithmFramework

Mapeachtargetobjectintoanewlowdimensionalfeature

spaceaccordingtocurrentnetclustering,andadjustthe

clusteringfurtherinthenewmeasurespace

Step0: Generateinitialrandomclusters

Step1: Generaterankingbasedgenerativemodelfor

targetobjectsforeachnetcluster

Step2: Calculateposteriorprobabilitiesfortargetobjects,

whichservesasthenewmeasure,andassigntargetobjects

tothenearestclusteraccordingly

Step3: Repeatsteps1and2,untilclustersdonotchangesignificantly

Step4: Calculateposteriorprobabilitiesforattribute

objectsin

each

net

cluster

Generat ive Model for Target Objec t sGenerat ive Model for Target Objec t s


35/176

35

Generat ive Model fo r Target Objec t sGenerat ive Model fo r Target Objec t s

Given a Ne tGiven a Ne t --c lus te rc lus te r

Eachtargetobjectstandsforancooccurrenceofabagofattributeobjects

Definetheprobabilityofatargetobjectdefinetheprobabilityofthe

cooccurrenceofalltheassociatedattributeobjects

GenerativeprobabilityP(d|Gk)fortargetobjectdinclusterCk :

whereP(x

|Tx

,Gk

)

isrankingfunction,P(Tx

|Gk

)istypeprobability

Twoassumptionsofindependence

Theprobabilities

to

visit

objects

of

different

types

are

independent

to

eachother

Theprobabilitiestovisittwoobjectswithinthesametypeare

independentto

each

other


36/176

39

ClusterAdjustmentClusterAdjustment

Usingposteriorprobabilitiesoftargetobjectsasnewfeature

space

Eachtarget

object

=>

Kdimension

vector

Eachnetclustercenter=>Kdimensionvector

Averageontheobjectsinthecluster

Assigneach

target

object

into

nearest

cluster

center

(e.g.,

cosinesimilarity)

Asubnetworkcorrespondingtoanewnetclusteristhenbuilt

byextracting

all

the

target

objects

in

that

cluster

and

all

linkedattributeobjects


37/176

40

Experiments:DBLPandBeyondExperiments:DBLPandBeyond

DataSet:

DBLP

all

area data

set

Allconferences+Top 50Kauthors

DBLPfourarea dataset

20conferencesfromDB,DM,ML,IR

Allauthorsfromtheseconferences

Allpapers

published

in

these

conferences

Runningcaseillustration


38/176

41

AccuracyStudy:ExperimentsAccuracyStudy:Experiments

Accuracy,compared

with

PLSA,

apure

text

model,

no

other

typesofobjectsandlinksareused,usethesamepriorasin

NetClus

Accuracy,comparedwithRankClus,abitypedclusteringmethod

ononlyonetype

AccuracyofPaperClusteringResults

AccuracyofConference ClusteringResults

N ClN Cl Di i i hi C fDi i i hi C f


39/176

42

NetClusNetClus:DistinguishingConferences:DistinguishingConferences

AAAI 0.00226670.00899168

0.934024 0.0300042

0.0247133

CIKM0.1500530.3101720.007238070.4445240.0880127

CVPR0.0001638120.007630720.931496 0.02813420.032575

ECIR3.47023e050.007126950.006574020.978391 0.00787288

ECML0.000774770.1109220.814362 0.05794260.015999

EDBT0.573362 0.316033

0.00101442

0.0245591

0.0850319

ICDE0.529522 0.3765420.002391520.01511130.0764334

ICDM0.0004550280.778452 0.05664570.1131840.0512633

ICML0.0003096240.0500780.878757 0.06223350.00862134

IJCAI0.00329816

0.0046758

0.94288 0.0303745

0.0187718

KDD0.005742230.797633 0.06173510.0676810.0672086

PAKDD0.001112460.813473 0.04031050.05747550.0876289

PKDD5.39434e050.760374 0.1196080.0529260.0670379

PODS

0.78935 0.113751

0.013939

0.00277417

0.0801858

SDM0.0001729530.841087 0.0583160.05270810.0477156

SIGIR0.006003990.002800130.002752370.977783 0.0106604

SIGMOD0.689348 0.2231220.00177030.008254550.0775055

VLDB0.701899 0.2074280.001000120.01169660.0779764

WSDM0.00751654

0.269259

0.0260291

0.683646 0.0135497

WWW0.07711860.2706350.029307 0.451857 0.171082


40/176

43

NetClusNetClus:DatabaseSystemCluster:DatabaseSystemCluster

database 0.0995511databases 0.0708818

system 0.0678563data 0.0214893query 0.0133316

systems 0.0110413queries 0.0090603

management 0.00850744object 0.00837766

relational 0.0081175processing 0.00745875

based 0.00736599distributed 0.0068367

xml 0.00664958oriented 0.00589557design 0.00527672web 0.00509167

information 0.0050518

model 0.00499396efficient 0.00465707

Surajit Chaudhuri 0.00678065Michael Stonebraker 0.00616469

Michael J. Carey 0.00545769C. Mohan 0.00528346

David J. DeWitt 0.00491615Hector Garcia-Molina 0.00453497

H. V. Jagadish 0.00434289David B. Lomet 0.00397865

Raghu Ramakrishnan 0.0039278

Philip A. Bernstein 0.00376314Joseph M. Hellerstein 0.00372064Jeffrey F. Naughton 0.00363698Yannis E. Ioannidis 0.00359853

Jennifer Widom 0.00351929

Per-Ake Larson 0.00334911Rakesh Agrawal 0.00328274

Dan Suciu 0.00309047Michael J. Franklin 0.00304099Umeshwar Dayal 0.00290143

Abraham Silberschatz 0.00278185

VLDB 0.318495SIGMOD Conf. 0.313903

ICDE 0.188746

PODS 0.107943EDBT 0.0436849

Ranking authors in XML


41/176

44

NetClusNetClus::StarNetStarNetBasedRankingandClusteringBasedRankingandClustering

Ageneral

framework

in

which

ranking

and

clustering

are

successfullycombinedtoanalyzeinfornets

Rankingandclusteringcanmutuallyreinforceeachotherin

informationnetwork

analysis

NetClus,anextensiontoRankClus thatintegratesrankingand

clusteringandgeneratenetclustersinastarnetworkwith

arbitrarynumber

of

types

Netcluster,heterogeneous

informationsubnetworks

comprisedof

multiple

typesofobjects

GowellbeyondDBLP,and

structuredrelational

DBs

Flickr:queryRaleigh

derivesmultipleclusters

iNextCubeiNextCube: Information Network: Information Network Enhanced TexEnhanced Text


42/176

45

iNextCubeiNextCube:Information

Network:

Information

Network

Enhanced

TexEnhanced

Text

Cube(VLDBCube(VLDB09Demo)09Demo)

ArchitectureofiNextCube

DimensionhierarchiesgeneratedbyNetClusDimensionhierarchiesgeneratedbyNetClus

Author/confere

termrankingfo

researcharea.

T

researchareasc

atdifferentleve

All

DBand

IS Theory Architecture

DB DM IR

XML DistributedDB

NetclusterHierarchy

Demo: iNextCube.cs.uiuc.edu

Clustering and Ranking in InformationClustering and Ranking in Information


43/176

46

Clusteringand

Ranking

in

Information

Clustering

and

Ranking

in

Information


Clustering

and

Ranking

of

Heterogeneous

InformationNetworks

Clusteringof

Homogeneous

Information

Networks


SCAN:

Density

based

clustering

of

networks Others

Spectralclustering

Modularitybased

clustering


UserGuided

Clustering

of

Information

Networks


44/176

47

LinkLinkBasedClustering:WhyUseful?BasedClustering:WhyUseful?

Questions:

Q1:Howtoclustereachtypeofobjects?

Q2:Howtodefinesimilaritybetweeneachtypeofobjects?

Tom sigmod03

Mike

Cathy

John

sigmod04sigmod05

vldb03

vldb04

vldb05

sigmod

vldb

Mary

aaai04

aaai05 aaai

Authors Proceedings Conferences


45/176

48

SimRankSimRank:Link:LinkBasedSimilaritiesBasedSimilarities

Twoobjects

are

similar

iflinked

with

the

same

or

similar

objects

Tom sigmod03sigmod04

sigmod05

sigmod

Tom

Mike

Cathy

John

sigmod03

sigmod04

sigmod05

vldb03

vldb04

vldb05

sigmod

vldb

Jeh

&Widom,2002

SimRank

Similarity

between

two

objects

a and

b,

S(a,b)=theaveragesimilaritybetween

objectslinkedwitha

andthosewithb:

whereI(v)isthesetofinneighborsof

thevertexv.

But:Itisexpensivetocompute:

ForadatasetofN

objectsandM links,

ittakesO(N2)spaceandO(M2)timeto

computeall

similarities.

Mary


46/176

49

Observation1:HierarchicalStructuresObservation1:HierarchicalStructures

Hierarchicalstructures

often

exist

naturally

among

objects

(e.g.,

taxonomyofanimals)

All

electronicsgrocery apparel

DVD cameraTV

A

hierarchical

structure

of

productsinWalmart

A

rticles

Words

Relationshipsbetweenarticlesand

words(Chakrabarti,

Papadimitriou,

Modha,Faloutsos,2004)


47/176

50

Observation2:DistributionofSimilarityObservation2:DistributionofSimilarity

Powerlawdistributionexistsinsimilarities

56%of

similarity

entries

are

in

[0.005,

0.015]

1.4%ofsimilarityentriesarelargerthan0.1

Ourgoal:Designadatastructurethatstoresthesignificantsimilaritiesand

compressesinsignificantones

0

0.1

0.2

0.3

0.4

0

0.0

2

0.0

4

0.0

6

0.0

8

0.1

0.1

2

0.1

4

0.1

6

0.1

8

0.2

0.2

2

0.2

4

similarity value

portionofentries Distribution ofSimRanksimilarities

among DBLP authors


48/176

51

OurDataStructure:OurDataStructure:SimTreeSimTree

simp(n7,n8) = s(n7,n4) x s(n4,n5) x s(n5,n8)

Path-based node similarity

Similarity between two nodes is the average similarity between objectslinked with them in other SimTrees

Adjustment ratio for x =

n1

n2

n4 n5 n6

n3

0.9 1.0

0.90.8

0.2

n7 n9

0.3

n8

0.8

0.9

Similarity between twosibling nodes n1

and n2

Adjustment ratiofor node n7

Average similarity between xand all other nodes

Average similarity betweenxs parent and allother nodes

Each leaf noderepresents an object

Each non-leaf noderepresents a group ofsimilar lower-level node

Li kClLi kCl Si TSi T B d Hi hi l Cl t iB d Hi hi l Cl t i


49/176

52

LinkClusLinkClus::SimTreeSimTreeBasedHierarchicalClusteringBasedHierarchicalClustering

InitializeaSimTree forobjectsofeachtype

Repeat

ForeachSimTree,updatethesimilaritiesbetweenitsnodes

usingsimilaritiesinotherSimTrees

Similaritybetween

two

nodes

a and

bis

the

average

similaritybetweenobjectslinkedwiththem

AdjustthestructureofeachSimTree

Assigneachnodetotheparentnodethatitismost

similarto

I iti li ti fInitialization of Si TSimTrees


50/176

53

InitializationofInitializationofSimTreesSimTrees

Findingtightgroups Frequentpatternmining

Initializingatree:

Startfromleafnodes(level0)

Ateachlevell,findnonoverlappinggroupsofsimilarnodes

withfrequentpatternmining

Reduced to

g1

g2

{n

1}{n1, n2}

{n2}{n1, n2}

{n1, n2}{n2, n3, n4}

{n4}

{n3, n4}

{n3, n4}

Transactions

n1 123

4

56

7

8

9

n2

n3

n4

Thetightnessofagroupofnodesisthesupportofa

frequentpattern

C l itComplexity: Li kClLinkClus vs Si R kSimRank


51/176

54

Complexity:Complexity:LinkClusLinkClus vs.vs.SimRankSimRank Afterinitialization,iteratively(1)foreachSimTree updatethe

similaritiesbetweenitsnodesusingsimilaritiesinotherSimTrees,

and(2)AdjustthestructureofeachSimTree

Computationalcomplexity:

Fortwotypesofobjects,N ineach,andM linkagesbetween

them

Time Space

Updating similarities O(M(logN)2) O(M+N)

Adjusting tree structures O(N) O(N)

LinkClus O (M(logN)2) O(M+N)

SimRank O (M2) O(N2)

P f C i E i t S tPerformance Comparison: Experiment Setup


52/176

55

PerformanceComparison:ExperimentSetupPerformanceComparison:ExperimentSetup

DBLP dataset: 4170 most productive authors, and 154 well-knownconferences with most proceedings

Manually labeled research areas of 400 most productive authorsaccording to their home pages (or publications)

Manually labeled areas of 154 conferences according to their call forpapers

Approaches Compared:

SimRank (Jeh & Widom, KDD 2002) Computing pair-wise similarities

SimRank with FingerPrints (F-SimRank)

Fogaras & Racz, WWW 2005 pre-computes a large sample of random paths from each object

and uses samples of two objects to estimate SimRank similarity

ReCom (Wang et al. SIGIR 2003)

Iteratively clustering objects using cluster labels of linked objects

DBLP Data Set Accuracy and Computation TimeDBLP Data Set: Accuracy and Computation Time


53/176

56

0.8

0.85

0.9

0.95

1

1 3 5 7 9 11 13 15 17 19

#iteration

accuracy

LinkClus

SimRank

ReCom

F-SimRank

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

1 3 5 7 9 11 13 15 17 19

#iteration

accuracy

LinkClus

SimRank

ReCom

F-SimRank

DBLPDataSet:AccuracyandComputationTimeDBLPDataSet:AccuracyandComputationTime

Approaches AccrAuthor Accr

Conf average

time

LinkClus 0.957 0.723 76.7

SimRank 0.958 0.760 1020

ReCom 0.907 0.457 43.1

FSimRank 0.908 0.583 83.6

Authors Conferences

l dil A d i


54/176

57

EmailDataset:AccuracyandTimeEmailDataset:AccuracyandTime

F.Nielsen.

Email

dataset.

http://www.imm.dtu.dk/rem/data/Email1431.zip

370

emails

on

conferences,

272

on

jobs,

and

789

spam

emails WhyisLinkClus evenbetterthanSimRank inaccuracy?

Noisefilteringduetofrequentpatternbasedpreprocessing

Approach Accuracy Totaltime(sec)

LinkClus 0.8026 1579.6

SimRank 0.7965 39160ReCom 0.5711 74.6

FSimRank 0.3688 479.7

CLARANS

0.4768 8.55



55/176

58

Clusteringand

Ranking

in

Information

Clustering

and

Ranking

in

Information


Clustering

and

Ranking

of

Heterogeneous

InformationNetworks

Clusteringof

Homogeneous

Information

Networks


SCAN:Densitybasedclusteringofnetworks

Others

Spectralclustering

Modularitybased

clustering


UserGuided

Clustering

of

Information

Networks

SCAN: DensitySCAN: Density Based Network ClusteringBased Network Clustering


56/176

59

SCAN:DensitySCAN:DensityBasedNetworkClusteringBasedNetworkClustering

Networksmade

up

of

the

mutual

relationships

of

data

elements

usually

haveanunderlyingstructure. Clustering: Astructurediscovery

problem

Givensimply

information

of

who

associates

with

whom,

could

one

identifyclustersofindividualswithcommoninterestsorspecial

relationships(families,cliques,terroristcells)?

Questionsto

be

answered:

How

many

clusters?

What

size

should

they

be?Whatisthebestpartitioning?Shouldsomepointsbesegregated?

Scan:Aninterestingdensitybasedalgorithm

X.Xu,N.Yuruk,Z.Feng,andT.A.J.Schweiger,SCAN:AStructural

ClusteringAlgorithmforNetworks,Proc.2007ACMSIGKDDInt.

Conf.KnowledgeDiscoveryinDatabases(KDD'07),SanJose,CA,

Aug.2007

Social Network and Its Clustering ProblemSocial Network and Its Clustering Problem


57/176

60

SocialNetworkandItsClusteringProblemSocialNetworkandItsClusteringProblem

Individualsin

atight

social

group,

or

clique,knowmanyofthesamepeople,

regardlessofthesizeofthegroup.

Individualswhoarehubs knowmany

peopleindifferentgroupsbutbelongto

nosinglegroup. Politicians,forexample

bridgemultiplegroups.

Individualswhoareoutliers resideat

the

margins

of

society.

Hermits,

for

example,knowfewpeopleandbelong

tonogroup.

Structure SimilarityStructure Similarity


58/176

61

StructureSimilarityStructureSimilarity

Define(v)

as

immediate

neighbor

of

avertex

v.

Thedesiredfeaturestendtobecapturedbyameasure(u,v)

as

Structural

Similarity

Structuralsimilarityislargeformembersofacliqueandsmall for

hubsandoutliers.

|)(||)(|

|)()(|

),( wv

wv

wv

=

I

v

Structural Connectivity [1]Structural Connectivity [1]


59/176

62

StructuralConnectivity[1]StructuralConnectivity[1]

Neighborhood:

Core:

Directstructure

reachable:

Structurereachable:transitiveclosureofdirectstructure

reachability

Structureconnected:

}),(|)({)( = wvvwvN

|)(|)(, vNvCORE

)()(),( ,, vNwvCOREwvDirRECH

),(),(:),( ,,, wuRECHvuRECHVuwvCONNECT

[1] M. Ester, H. P. Kriegel, J. Sander, & X. Xu (KDD'96) A Density-

Based Algorithm for Discovering Clusters in Large Spatial Databases

StructureStructure Connected ClustersConnected Clusters


60/176

63

StructureStructureConnectedClustersConnectedClusters

Structureconnected

cluster

C

Connectivity:

Maximality:

Hubs:

Notbelongtoanycluster

Bridgetomanyclusters

Outliers:

Notbelong

to

any

cluster

Connecttolessclusters

),(:, , wvCONNECTCwv

CwwvREACHCvVwv ),(:, ,

hub

outlier

AlgorithmAlgorithm


61/176

64

13

9

10

11

7

8

12

6

4

0

15

2

3

AlgorithmAlgorithm

= 2= 0.7

0.75

0.67

0.82

AlgorithmAlgorithm


62/176

65

13

9

10

11

7

8

12

6

4

0

15

2

3

AlgorithmAlgorithm

= 2= 0.7 0.51

0.51

0.68

Running TimeRunning Time


63/176

66

RunningTimeRunningTime

Runningtime:

O(|E|)

Forsparsenetworks:O(|V|)

[2] A. Clauset, M. E. J. Newman, & C. Moore, Phys. Rev. E70, 066111 (2004).



64/176

67

g

g

g

g


Clustering

and

Ranking

of

Heterogeneous

InformationNetworks

Clusteringof

Homogeneous

Information

Networks



Others

Spectralclustering

Modularitybased

clustering


UserGuided

Clustering

of

Information

Networks

Spectral ClusteringSpectral Clustering


65/176

68

SpectralClusteringSpectralClustering

Spectralclustering:

Find

the

best

cut

that

partitions

the

network

Differentcriteriatodecidebest

Mincut,ratiocut,normalizedcut,MinMaxcut

UsingMincutasanexample[Wuetal.1993]

Assigneach

node

ian

indicator

variable

Represent

the

cut

size

using

indicator

vector

and

adjacency

matrix Cutsize =

Minimizetheobjectivefunctionthroughsolvingeigenvalue system

Relaxthediscretevalueofqtocontinuousvalue

Mapcontinuousvalueofqintodiscreteonestogetclusterlabels

Usesecondsmallesteigenvectorforq

ModularityModularityBased ClusteringBased Clustering


66/176

69

ModularityModularityBasedClusteringBasedClustering

Modularitybased

clustering

Findthebestclusteringthatmaximizesthemodularityfunction

Qfunction[Newmanetal.,2004]

Leteij be

ahalf

of

the

fraction

of

edges

between

group

iand

group

j

eii isthefractionofedgeswithingroupi

Letai bethefractionofallendsofedgesattachedtoverticesingroupI

Qisthendefinedassumofdifferencebetweenwithingroupedgesand

expectedwithingroupedges

MinimizeQ

Onepossiblesolution:hierarchicallymergeclustersresultingingreatest

increaseinQfunction[Newmanetal.,2004]

Probabilistic ModelProbabilistic ModelBased ClusteringBased Clustering


67/176

70

ProbabilisticModelProbabilisticModelBasedClusteringBasedClustering

Probabilisticmodel

based

clustering

Buildgenerativemodelsforlinksbasedonhiddenclusterlabels

Maximizetheloglikelihoodofallthelinkstoderivethehiddencluster

membership

Anexample:Airoldi etal.,Mixedmembershipstochasticblockmodels,2008

DefineagroupinteractionprobabilitymatrixB(KXK)

B(g,h)denotestheprobabilityoflinkgenerationbetweengroupg

andgroup

h

Generativemodelforalink

Foreachnode,drawamembershipprobabilityvectorforma

Dirichlet prior Foreachpaperofnodes,drawclusterlabelsaccordingtotheir

membershipprobability(supposinggandh),thendecidewhetherto

havealinkaccordingtoprobabilityB(g,h)

Derivethe

hidden

cluster

label

by

maximize

the

likelihood

given B

and

theprior

Clustering and Ranking in InformationClustering and Ranking in Informationk


68/176

71

g

g

g

g


Clustering

and

Ranking

of

Heterogeneous

InformationNetworks

Clustering

of

Homogeneous

Information

Networks LinkClus:Clusteringwithlinkbasedsimilaritymeasure


Others

Spectralclustering

Modularitybased

clustering


UserGuided

Clustering

of

Information

Networks

UserUserGuided Clustering in DBGuided Clustering in DB InfoNetInfoNet


69/176

72

UserUser GuidedClusteringinDBGuidedClusteringinDBInfoNetInfoNet

Course

name

office

position

Professor

course-id

name

area

course

semester

instructor

officeposition

Student

name

student

course

semester

unit

Register

grade

professor

student

degree

Advise

name

Group

person

group

Work-In

area

year

conf

Publication

title

title

Publish

author

Target ofclustering

User hint

Open-course

Userusuallyhasagoalofclustering,e.g.,clusteringstudents byresearcharea

Userspecifies

his

clustering

goal

to

aDB

InfoNet cluster:

CrossClus

Classification vs UserClassification vs UserGuided ClusteringGuided Clustering


70/176

73

Classificationvs.UserClassificationvs.UserGuidedClusteringGuidedClustering

Userspecifiedfeature(inthe

formofattribute)isusedasa

hint,

not

class

labels Theattributemaycontaintoo

manyortoofewdistinct

values

E.g.,ausermaywantto

clusterstudentsinto20

clustersinsteadof3

Additionalfeaturesneedtobe

includedin

cluster

analysis

All tuples for clustering

User hint

UserUserGuided Clustering vs. SemiGuided Clustering vs. Semisupervised Clusterinsupervised Clusterin


71/176

74

UserUser GuidedClusteringvs.SemiGuidedClusteringvs.Semi supervisedClusterinsupervisedClusterin

Semisupervised

clustering

[Wagstaff,

et

al 01,

Xing,

et

al.02]

Userprovidesatrainingsetconsistingofsimilar anddissimilar

pairsofobjects

Userguidedclustering

Userspecifiesanattributeasahint,andmorerelevantfeaturesare

foundforclustering


Semi-supervised clustering


User-guided clustering

x

Why Not Typical SemiWhy Not Typical SemiSupervised Clustering?Supervised Clustering?


72/176

75

WhyNotTypicalSemiWhyNotTypicalSemi SupervisedClustering?SupervisedClustering?

Whynot

do

typical

semi

supervise

clustering?

Muchinformation(inmultiplerelations)isneededtojudgewhethertwo

tuples aresimilar

Ausermaynotbeabletoprovideagoodtrainingset

Itismucheasierforausertospecifyanattributeasahint,suchasastudents

researcharea

Tom Smith SC1211 TA

Jane Chang BI205 RA

Tuples to be compared

User hint

CrossClusCrossClus: An Overview: An Overview


73/176

76

CrossClusCrossClus:AnOverview:AnOverview

CrossClus:Framework

Searchforgoodmultirelationalfeaturesforclustering

Measuresimilaritybetweenfeaturesbasedonhowthey

clusterobjectsintogroups

Userguidance+heuristic searchforfindingpertinent

features Clusteringbasedonakmedoidsbased algorithm

CrossClus:Majoradvantages

Userguidance,

even

in

avery

simple

form,

plays

an

importantroleinmultirelationalclustering

CrossClus findspertinentfeatures bycomputingsimilarities

betweenfeatures

Selection of MultiSelection of MultiRelational FeaturesRelational Features


74/176

77

SelectionofMultiSelectionofMultiRelationalFeaturesRelationalFeatures

Amulti

relational

feature

is

defined

by:

Ajoinpath.E.g.,Student RegisterOpenCourse Course

Anattribute.E.g.,Course.area

(Fornumericalfeature)anaggregationoperator.E.g.,sumoraverage

CategoricalFeaturef= [StudentRegisterOpenCourse Course,Course.area,null]

Tuple Areas of coursesDB AI TH

t1 5 5 0

t2 0 3 7

t3 1 5 4

t4 5 0 5

t5 3 3 4

areas of courses of each student

Tuple Feature fDB AI TH

t1 0.5 0.5 0

t2 0 0.3 0.7

t3 0.1 0.5 0.4

t4 0.5 0 0.5

t5 0.3 0.3 0.4

Values of featureff(t1

)

f(t2

)

f(t3

)

f(t4

)

f(t5

)

DB

AI

TH

Numerical Feature, e.g., average grades of students

h = [StudentRegister, Register.grade, average]

E.g. h(t1) = 3.5

Similarity Between FeaturesSimilarity Between Features


75/176

78

SimilarityBetweenFeaturesSimilarityBetweenFeatures

Featuref(course) Feature g (group)

DB AI TH Info sys Cog sci Theory

t1 0.5 0.5 0 1 0 0

t2 0 0.3 0.7 0 0 1

t3 0.1 0.5 0.4 0 0.5 0.5

t4 0.5 0 0.5 0.5 0 0.5

t5 0.3 0.3 0.4 0.5 0.5 0

Values of Featurefand g

1

2

3

4

5

S1

S2

S3

S4

S5

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.5-0.6

0.4-0.5

0.3-0.4

0.2-0.3

0.1-0.2

0-0.1

1

2

3

4

5

S1

S2

S3

S4

S5

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.5-0.6

0.4-0.5

0.3-0.4

0.2-0.3

0.1-0.2

0-0.1

Similarity between two features

cosine similarity of two vectors

Vf

Vg

( )gf

gf

VV

VVgfSim

=,

SimilaritybetweenCategorical&NumericalFeaturesSimilaritybetweenCategorical&NumericalFeatures


76/176

79

y gy g

( ) ( ) = 0)

f

shouldbe

similar

to

the

given

ground

truth

G et e et odo ogy gy

GNetMineGNetMine:Graph:GraphBasedRegularizationBasedRegularization


89/176

92

( ) ( )

1

( ) ( ) 2,

, 1 1 1 , ,

( ) ( ) ( ) ( )

1

( ,..., )

1 1( )

( ) ( )

ji

k k

m

nnm

k kij ij pq ip jq

i j p q ij pp ji qq

m

k k T k k i i i i i

i

J

R f fD D

= = =

=

=

+

y y

f f

f f

pp gg

Minimizethe

objective

function

Smoothnessconstraints:objectslinkedtogethershouldsharesimilar

estimations

of

confidence

belonging

to

class

k

Normalizationtermappliedtoeachtypeoflinkseparately:reducetheimpactofpopularityofnodes

Confidence

estimation

on

labeled

data

and

their

pre

given

labelsshouldbesimilar

Userpreference:howmuchdoyouvaluethisrelationship/groundtruth?


90/176

ClassificationAccuracy:

Labeling

aVery

Small

Portion

Classification

Accuracy:

Labeling

aVery

Small

Portion

of Authors and Papersof Authors and Papers


91/176

94

ofAuthorsandPapersofAuthorsandPapers

(a%,p%)

nLB wvRN LLGC GNetMine

AA A

CPT A

A A

CPT A

A A

CP

T A

CPT

(0.1%,0.1%) 25.4 26.0 40.8 34.1 41.4 61.3 82.9

(0.2%,0.2%) 28.3 26.0 46.0 41.2 44.7 62.2 83.4

(0.3%,0.3%) 28.4 27.4 48.6 42.5 48.8 65.7 86.7

(0.4%,0.4%) 30.7 26.7 46.3 45.6 48.7 66.0 87.2

(0.5%,0.5%) 29.8 27.3 49.0 51.4 50.6 68.9 87.5

Comparisonofclassificationaccuracyonauthors(%)

(a%,p%)nLB wvRN LLGC GNetMine

PP ACPT PP ACPT PP ACPT ACPT

(0.1%,0.1%) 49.8 31.5 62.0 42.0 67.2 62.7 79.2

(0.2%,0.2%) 73.1 40.3 71.7 49.7 72.8 65.5 83.5

(0.3%,0.3%) 77.9 35.4 77.9 54.3 76.8 66.6 83.2

(0.4%,0.4%) 79.1 38.6 78.1 54.4 77.9 70.5 83.7

(0.5%,0.5%) 80.7 39.3 77.9 53.5 79.0 73.5 84.1

Comparisonofclassificationaccuracyonpapers(%)

(a%,p%)nLB wvRN LLGC GNetMine

ACPT ACPT ACPT ACPT

(0.1%,0.1%) 25.5 43.5 79.0 81.0

(0.2%,0.2%) 22.5 56.0 83.5 85.0

(0.3%,0.3%) 25.0 59.0 87.0 87.0

(0.4%,0.4%) 25.0 57.0 86.5 89.5

(0.5%,0.5%) 25.0 68.0 90.0 94.0

Comparisonofclassificationaccuracyonconferences(%)

KnowledgePropagation:

List

Objects

with

the

Knowledge

Propagation:

List

Objects

with

the

Highest Confidence Measure Belonging to Each ClassHighest Confidence Measure Belonging to Each Class


92/176

95

HighestConfidenceMeasureBelongingtoEachClassHighestConfidenceMeasureBelongingtoEachClass

No. Database DataMining Artificial

Intelligence Information

Retrieval

1 data mining learning retrieval

2 database data knowledge information

3 query clustering Reinforcement web

4 system learning reasoning search

5 xml classification model document

Top5termsrelatedtoeacharea

No. Database DataMining ArtificialIntelligence InformationRetrieval

1 Surajit

Chaudhuri Jiawei

Han SridharMahadevan W.BruceCroft

2 H.V.

Jagadish Philip

S.

Yu Takeo

Kanade Iadh

Ounis

3 MichaelJ.Carey ChristosFaloutsos AndrewW.Moore MarkSanderson

4 MichaelStonebraker WeiWang Satinder

P.Singh ChengXiang

Zhai

5 C.Mohan Shusaku

Tsumoto ThomasS.Huang GerardSalton

Top5authorsconcentratedineacharea

No. Database DataMining Artificial

Intelligence Information

Retrieval

1 VLDB KDD IJCAI SIGIR

2 SIGMOD SDM AAAI ECIR

3 PODS PAKDD CVPR WWW

4 ICDE ICDM ICML WSDM

5 EDBT PKDD ECML CIKMTop5conferencesconcentratedineacharea

ClassificationofInformationNetworksClassificationofInformationNetworks


93/176

96

ClassificationofHeterogeneousInformationNetworks:

Graph

regularization

Based

Method

(GNetMine)

MultiRelationalMiningBasedMethod(CrossMine)

StatisticalRelational

Learning

Based

Method

(SRL)

ClassificationofHomogeneousInformationNetworks

MultiMultiRelation to Flat Relation Mining?RelationtoFlatRelationMining?


94/176

97

Multi RelationtoFlatRelationMining?g

Foldingmultiple

relations

into

asingle

flat one

for

mining?

Cannotbeasolutionduetoproblems:

Loseinformation

of

linkages

and

relationships,

no

semantics

preservation

Cannotutilizeinformationofdatabasestructuresorschemas

(e.g.,ER

modeling)

Patientflatten

ContactDoctor

OneApproach:InductiveLogicProgramming(ILP)OneApproach:InductiveLogicProgramming(ILP)


95/176

98

Findahypothesis

that

is

consistent

with

background

knowledge(trainingdata)

FOIL,Golem,Progol,TILDE,

Backgroundknowledge

Relations(predicates),Tuples (groundfacts)

InductiveLogicProgramming(ILP)

Hypothesis:

Thehypothesis

is

usually

aset

of

rules,

which

canpredictcertainattributesincertainrelations

Daughter(X,Y) female(X),parent(Y,X)

Daughter(mary, ann) +

Daughter(eve, tom) +

Daughter(tom, ann)

Daughter(eve, ann)

Training examples

Parent(ann, mary)

Parent(ann, tom)

Parent(tom, eve)

Parent(tom, ian)

Background knowledge

Female(ann)

Female(mary)

Female(eve)

InductiveLogic

Programming

Approach

to

MultiInductive

Logic

Programming

Approach

to

Multi

Relation ClassificationRelation Classification


96/176

99

RelationClassificationRelationClassification ILP

Approached

to

Multi

Relation

Classification

TopdownApproaches(e.g.,FOIL)

while(enough

examplesleft)

generatearule

removeexamplessatisfyingthisrule

BottomupApproaches(e.g.,Golem)

Useeach

example

as

arule

Generalizerulesbymergingrules

DecisionTreeApproaches(e.g.,TILDE)

ILPApproach:ProsandCons

Advantages:Expressiveandpowerful,andrulesareunderstandable

Disadvantages:Inefficientfordatabaseswithcomplexschemas,and

inappropriateforcontinuousattributes

FOIL:FirstFOIL:FirstOrderInductiveLearner(RuleGeneration)OrderInductiveLearner(RuleGeneration)


97/176

100

Findaset

of

rules

consistent

with

training

data

Atopdown,sequentialcoveringlearner

Buildeachrulebyheuristics

Foilgain

a

special

type

of

information

gain

Examplescovered

byRule1

Allexamples

Examplescovered

byRule2Examples

covered

byRule3

Positive

examples

Negative

examples

A3=1

A3=1&&A1=2A3=1&&A1=2

&&A8=5

Togenerate

arule

while(true)

findthebestpredicatep

if

foilgain(p)>thresholdthen

addp

tocurrentrule

else

break

FindtheBestPredicate:PredicateEvaluationFindtheBestPredicate:PredicateEvaluation


98/176

101

Allpredicates

in

arelation

can

be

evaluated

based

on

propagatedIDs

Usefoilgain toevaluatepredicates

Supposecurrent

rule

is

r.

For

apredicate

p,

foilgain(p)=

CategoricalAttributes

Computefoil

gain

directly

NumericalAttributes

Discretize witheverypossiblevalue

( )( )

( ) ( )

( )

( ) ( )

+++

++

++

prNprP

prP

rNrP

rPprP loglog


99/176

CrossMineCrossMine:AnEffectiveMulti:AnEffectiveMultirelationalClassifierrelationalClassifier


100/176

103

Methodology

TupleIDpropagation:anefficientandflexible

methodfor

virtually

joining

relations

Confinetherulesearchprocessinpromising

directions

Lookoneahead:amorepowerfulsearchstrategy

Negativetuple sampling:improveefficiencywhile

maintainingaccuracy

TupleTuple IDPropagationIDPropagation


101/176

104

Loan ID Account ID Amount Duration Decision

1 124 1000 12 Yes

2 124 4000 12 Yes

3 108 10000 24 No

4 45 12000 36 No

0+, 0

0+, 1

0+, 1

2+, 0

Labels

1, 202/27/93monthly124

Null01/01/97weekly67

412/09/96monthly45

309/23/97weekly108

Propagated IDOpen dateFrequencyAccount ID

Applicant #1

Applicant #2

Applicant #3

Applicant #4

Propagatetuple IDsoftargetrelationtonontargetrelations

Virtuallyjoin

relations

to

avoid

the

high

cost

of

physical

joins

Possiblepredicates:

Frequency=monthly:2+,

1

Opendate


102/176

105

TargetRelation R1 R2

R3

Efficient Onlypropagatethetuple IDs

Timeandspaceusageislow

Flexible

Canpropagate

IDs

among

non

target

relations

ManysetsofIDscanbekeptononerelation,whicharepropagatedfromdifferentjoinpaths

RuleGeneration:ExampleRuleGeneration:Example


103/176

106

districtid

frequency

date

Account

accountid

accountid

date

amount

duration

Loan

loanid

payment

accountid

bankto

accountto

amount

Order

orderid

type

dispid

type

issuedate

Card

cardid

accountid

clientid

Disposition

dispid

birthdate

gender

districtid

Client

clientid

distname

region

#people

#lt500

District

districtid

#lt2000

#lt10000

#gt10000

#city

ratiourban

avgsalary

unemploy95

unemploy96

denenter

#crime95

#crime96

accountid

date

type

operation

Transaction

transid

amount

balance

symbol

Targetrelation

Firstpredicate

Second

predicate

RuleGeneration

Startatthetargetrelation

Repeat

Searchinallactiverelations

Searchin

all

relations

joinabletoactiverelations

Addthebestpredicatetothecurrentrule

Setthe

involved

relation

toactive

Until

Thebestpredicatedoesnothaveenoughgain

Currentruleistoolong

LookLookoneoneaheadinRuleGenerationaheadinRuleGeneration


104/176

107

Twotypesofrelations:EntityandRelationship

Oftencannotfindusefulpredicatesonrelationsofrelationship

Solutionof

CrossMine:

WhenpropagatingIDstoarelationofrelationship,

propagateonemoresteptonextrelationofentity

Target

Relation

Nogoodpredicate

NegativeNegativeTupleTuple SamplingSampling


105/176

108

Each

time

arule

is

generated,

covered

positive

examples

are

removed

Aftergeneratingmanyrules,therearemuchlesspositiveexamplesthannegativeones

Cannotbuildgoodrules(lowsupport)

Stilltimeconsuming(largenumberofnegativeexamples)

Solution:Samplingonnegativeexamples

Improveefficiencywithoutaffectingrulequality

+

++

++++ ++

+

+

+

++

+

+ +

+ +

++

RealDatasetRealDataset


106/176

109

PKDDCup99dataset LoanApplication

Accuracy Time (per fold)

FOIL 74.0% 3338 sec

TILDE 81.3% 2429 sec

CrossMine 90.7% 15.3 sec

Accuracy Time (per fold)FOIL 79.7% 1.65 sec

TILDE 89.4% 25.6 sec

CrossMine 87.7% 0.83 sec

Mutagenesisdataset(4relations):Only4relations,soTILDE

doesagoodjob,thoughslow



107/176

110


GraphregularizationBasedMethod(GNetMine)



Learning

Based

Method

(SRL)


ProbabilisticRelational

Models

in

Probabilistic

Relational

Models

in

Statistical Relational LearningStatisticalRelationalLearning


108/176

111

StatisticalRelationalLearningg

Goal:model

distribution

of

data

in

relational

databases

Treatbothentitiesandrelationsasclasses

Intuition:objectsarenolongerindependentwitheachother

Buildstatistical

networks

according

to

the

dependency

relationshipsbetweenattributesofdifferentclasses

AProbabilisticRelationalModels (PRM)consistsof

Relational

schema

(from

databases) Dependencystructure(betweenattributes)

Localprobabilitymodel(conditionalprobabilitydistribution)

Three

major

methods

of

probabilistic

relational

models RelationalBayesianNetwork(RBN,Lise Getoor etal.)

RelationalMarkovNetworks(RMN,BenTaskar etal.)

RelationalDependency

Networks

(RDN,

Jennifer

Neville

et

al.)

RelationalBayesianNetworks(RBN)RelationalBayesianNetworks(RBN)


109/176

112

ExtendBayesian

network

to

consider

entities,

properties

and

relationshipsinaDBscenario

Threedifferentuncertainties

Attributeuncertainty

Modelconditionalprobabilityforanattributegivenits

parentvariables

Structuraluncertainty

Modelconditionalprobabilityforareferenceorlink

existencegivenitsparentvariables

Classuncertainty

Refinetheconditionalprobabilitybyconsideringsub

classesorhierarchyofclasses


110/176



111/176

114


GraphregularizationBasedMethod(GNetMine)



Learning

Based

Method

(SRL)


TransductiveTransductive LearningintheGraphLearningintheGraph


112/176

115

Problem:

for

aset

of

nodes

in

the

graph,

the

class

labels

are

given

for

partial

of

thenodes,thetaskistolearnthelabelsoftheunlabelednodes

Methods

Labelpropagation

algorithm

[Zhu

et

al.

2002,

Zhou

et

al.

2004,

Szummer et

al.2001]

Iterativelypropagatelabelstoitsneighbors,accordingtothe

transitionprobabilitydefinedbythenetwork

Graphregularizationbasedalgorithm[Zhouetal.2004]

Intuition:tradeoffbetween(1)consistencywiththelabelingdataand

(2)consistencybetweenlinkedobjects

Anquadratic

optimization

problem

OutlineOutline


113/176

116


Heterogeneous

Information

Networks?


ClusteringandRankinginInformationNetworks



DataCleaning and

Data

Validation

by

InfoNet Analysis



RoleDiscovery

and

OLAP

in

Information

Networks


Conclusions

DataCleaningbyLinkAnalysisDataCleaningbyLinkAnalysis


114/176

117

Objectreconciliation

vs.

object

distinction

as

data

cleaning

tasks

Linkanalysismaytakeadvantagesofredundancyandmake

facilitateentitycrosscheckingandvalidation

Objectdistinction:

Different

people/objects

do

share

names

InAllMusic.com,72songsand3albumsnamedForgotten or

TheForgotten

InDBLP,

141

papers

are

written

by

at

least

14

Wei

Wang

Newchallengesofobjectdistinction:

Textualsimilaritycannotbeused

Distinct:Objectdistinctionbyinformationnetworkanalysis

X.Yin,J.Han,andP.S.Yu,ObjectDistinction:Distinguishing

Objects

with

Identical

Names

by

Link

Analysis,

ICDE'07


115/176


116/176

TrainingwiththeTrainingwiththeSameSame DataSetDataSet


117/176

120

Build

atraining

set

automatically

Selectdistinctnames,e.g.,JohannesGehrke

Thecollaborationbehaviorwithinthesamecommunityshare

somesimilarity

Trainingparametersusingatypicalandlargesetof

unambiguous examples

UseSVMtolearnamodelforcombiningdifferentjoinpaths

Eachjoinpathisusedastwoattributes(withlinkbased

similarityand

neighborhood

similarity)

Themodelisaweightedsumofallattributes


118/176

RealCases:DBLPPopularNamesRealCases:DBLPPopularNames


119/176

122

Nam e Num _aut hor s Num _r ef s accur acy pr ecision r ecall f - m easur e

Hui

Fang 3 9 1.0 1.0 1.0 1.0

Ajay Gupta 4 16 1.0 1.0 1.0 1.0

Joseph Hellerstein 2 151 0.81 1.0 0.81 0.895

Rakesh

Kumar 2 36 1.0 1.0 1.0 1.0

Michael Wagner 5 29 0.395 1.0 0.395 0.566

Bing Liu 6 89 0.825 1.0 0.825 0.904

Jim Smith 3 19 0.829 0.888 0.926 0.906

Lei Wang 13 55 0.863 0.92 0.932 0.926

Wei Wang 14 141 0.716 0.855 0.814 0.834

Bin Yu 5 44 0.658 1.0 0.658 0.794

Average 0.81 0.966 0.836 0.883

DistinguishingDifferentDistinguishingDifferentWeiWeiWangWangss


120/176

123

UNC-CH(57)

Fudan U, China(31)

UNSW, Australia(19)

SUNYBuffalo

(5)

BeijingPolytech

(3)

NUSingapore

(5)

Harbin UChina

(5)

Zhejiang UChina

(3)

Najing NormalChina

(3)

Ningbo TechChina

(2)

Purdue(2)

Beijing U ComChina

(2)

Chongqing UChina

(2)

SUNYBinghamton

(2)

5

6

2


121/176

TruthValidationbyInfo.NetworkAnalysisTruthValidationbyInfo.NetworkAnalysis


122/176

125

Thetrustworthinessproblemoftheweb(accordingtoasurvey):

54%ofInternetuserstrustnewswebsitesmostoftime

26%forwebsitesthatsellproducts

12%for

blogs

TruthFinder:TruthdiscoveryontheWebbylinkanalysis

Amongmultipleconflictresults,canweautomaticallyidentifywhichone

islikely

the

true

fact?

Veracity(conformitytotruth):

Givenalargeamountofconflictinginformationaboutmanyobjects,

providedby

multiple

web

sites

(or

other

information

providers), how

to

discoverthetruefactabouteachobject?

Ourwork: Xiaoxin Yin,Jiawei Han,PhilipS.Yu,TruthDiscoverywithMultiple

ConflictingInformationProvidersontheWeb,TKDE08


123/176


124/176


125/176

OverviewoftheOverviewoftheTruthFinderTruthFinder MethodMethod


126/176

129

Confidenceoffacts Trustworthinessofwebsites

Afacthashighconfidence ifitisprovidedby(many)

trustworthyweb

sites

Awebsiteistrustworthyifitprovidesmanyfactswithhigh

confidence

TheTruthFinder mechanism,anoverview:

Initially,eachwebsiteisequallytrustworthy

Basedon

the

above

four

heuristics,

infer

fact

confidence

from

websitetrustworthiness,andthenbackwards

Repeat

until

achieving

stable

state

AnalogytoAuthorityAnalogytoAuthorityHubAnalysisHubAnalysis


127/176

130

Facts Authorities, Websites Hubs

Differencefromauthorityhubanalysis

Linearsummationcannotbeused

A

web

site

is

trustable

if

it

provides

accurate

facts,

instead

of

many

facts

Confidenceistheprobabilityofbeingtrue

Different

facts

of

the

same

object

influence

each

other

w1 f1

Web sites Facts

Hubs Authorities

Hightrustworthiness

Highconfidence


128/176


129/176

Experiments:FindingTruthofFactsExperiments:FindingTruthofFacts


130/176

133

Determiningauthorsofbooks

Datasetcontains1265bookslistedonabebooks.com

We

analyze

100

random

books

(using

book

images)

Case Voting TruthFinder Barnes & Noble

Correct 71 85 64

Miss author(s) 12 2 4

Incomplete names 18 5 6

Wrong first/middle names 1 1 3

Has redundant names 0 2 23

Add incorrect names 1 5 5

No information 0 0 2


131/176

OutlineOutline


132/176

135


Heterogeneous

Information

Networks?





DataCleaning

and

Data

Validation

by

InfoNet Analysis



RoleDiscovery

and

OLAP

in

Information

Networks


Conclusions


133/176


134/176

SemanticsSemanticsBasedSimilaritySearchinBasedSimilaritySearchinInfoNetInfoNet


135/176

138

Searchtop

ksimilar

objects

of

the

same

type

for

aquery

FindresearchersmostsimilarwithChristosFaloutsos

Twocriticalconceptstodefineasimilarity

Featurespace

Traditionaldata:attributesdenotedasnumerical

value/vector,set,etc.

Networkdata:

arelation

sequence

called

path

schema

Existinghomogeneousnetworkbasedsimilaritydoes

notdealwiththisproblem

Measuredefined

on

the

feature

space

Cosine,Euclideandistance,Jaccard coefficient,etc.

PathSim

PathSchemaforDBLPQueriesPathSchemaforDBLPQueries


136/176

139

Pathschema:

A

path

of

InfoNet schema,

e.g.,

APC,

APA

WhoaremostsimilartoChristosFaloutsos?

FlickrFlickr:WhichPicturesAreMostSimilar?:WhichPicturesAreMostSimilar?


137/176

140

Somepath

schema

leads

to

similarity

closer

to

human

intuition

Butsomeothersarenot

NotAllSimilarityMeasuresAreGoodNotAllSimilarityMeasuresAreGood


138/176

141

Favor highly

visibleobjects

Notreasonable


139/176

PathSimPathSim:Definition&Properties:Definition&Properties


140/176

143

Commutingmatrix

corresponding

to

apath

schema

MP

Productofadjacencymatrixofrelationsinthepathschema

TheelementMP(i,j)denotesthestrengthbetweenobjecti

andobject

jin

the

semantic

of

path

schema

P

Iftheweightofadjacencymatrixisunweighted,itwill

denotethenumberofpathinstancesfollowingpath

schemaP

PathSim

s(i,j)=2MP(i,j)/(MP(i,i)+MP(j,j))

Properties:

CoCoClusteringClusteringBasedPruningAlgorithmBasedPruningAlgorithm


141/176

144

Storecommuting

matrices

for

short

path

schemas

&

compute

top

kqueries

on

line

Framework

Generatecoclustersformaterializedcommutingmatrices,forfeature

objectsand

target

objects

Deriveupperboundforsimilaritybetweenobjectandtargetcluster,and

betweenobjectandobject:Safelypruningtargetclustersandobjectsif

theupperboundsimilarityislowerthancurrentthreshold

Dynamicallyupdatetopkthreshold:

Performance:Baselinevs.pruning


142/176


143/176

RoleDiscovery:

Extraction

Semantic

Role

Discovery:

Extraction

Semantic

InformationfromLinksInformationfromLinks


144/176

147

Objective:Extract

semantic

meaning

from

plain

links

to

finely

modelandbetterorganizeinformationnetworks

Challenges

Latentsemantic

knowledge

Interdependency

Scalability

Opportunity

Humanintuition

Realistic

constraint

Crosscheckwithcollectiveintelligence

Methodology:propagatesimpleintuitiverulesandconstraints

over

the

whole

network

Discoveryof

AdvisorDiscovery

of

Advisor

Advisee

Advisee

RelationshipsinDBLPNetworkRelationshipsinDBLPNetwork


145/176

148

Input:DBLP

research

publication

network

Output:Potentialadvisingrelationshipanditsranking (r,[st,ed])

Ref. C.Wang,J.Han,etal., MiningAdvisorAdvisee

Relationshipsfrom

Research

Publication

Networks,

SIGKDD

2010

Smith

2000

2000

2001

2002

2003

1999

Ada Bob

Jerry

Ying

Input: Temporal

collaboration networkOutput: Relationship analysis

(0.8, [1999,2000])

(0.7,

[2000, 2001])

(0.65, [2002, 2004])

2004

Ada

Bob

Ying

Smith

(0.2,

[2001, 2003])

(0.5, [/, 2000])

(0.9, [/, 1998])

(0.4,

[/, 1998])

(0.49,

[/, 1999])

Visualized chorological hierarchies

Jerry

OverallFrameworkOverall

Framework


146/176

149

ai:authori

pj:paperj

py:paper

year

pn:paper#

sti,yi:startingtime

edi,yi:endingtime

ri,yi:rankingscore


147/176

ExperimentResultsExperimentResults


148/176

151

DBLPdata:

654,

628

authors,

1076,946

publications,

years

provided

Labeleddata:MathGealogy Project;AIGealogy Project;

Homepage

Datasets RULE SVM IndMAX TPFG

TEST1 69.9% 73.4% 75.2% 78.9% 80.2% 84.4%

TEST2 69.8% 74.6% 74.6% 79.0% 81.5% 84.3%

TEST3 80.6% 86.7% 83.1% 90.9% 88.8% 91.3%

Empirical

parameter

optimized

parameter

heuristics Supervised

learning


149/176

Graph/NetworkSummarization:

Graph

Graph/Network

Summarization:

Graph

CompressionCompression


150/176

153

Extractcommon

subgraphs and

simplify

graphs

by

condensing

thesesubgraphs intonodes

OLAPon

Information

NetworksOLAP

on

Information

Networks


151/176

154

WhyOLAP

information

networks?

AdvantagesofOLAP:Interactiveexplorationofmultidimensional

andmultilevelspaceinadatacubeInfonet

Multidimensional:

Different

perspectives

Multilevel:Differentgranularities

InfoNet OLAP:Rollup/drilldownandslice/diceoninformation

networkdata

TraditionalOLAPcannothandlethis,becausetheyignorelinks

amongdataobjects

Handlingtwo

kinds

of

InfoNet OLAP

InformationalOLAP

TopologicalOLAP

InformationalOLAPInformational

OLAP


152/176

155

Inthe

DBLP

network,

study

the

collaborationpatternsamongresearchers

Dimensionscomefrominformational

attributesattached

at

the

whole

snapshot

level,socalled InfoDims

IOLAPCharacteristics:

Overlaymultiple

pieces

of

information

Nochangeontheobjectswhose

interactionsarebeingexamined

Inthe

underlying

snapshots,

each

nodeisaresearcher

Inthesummarizedview,eachnodeis

stillaresearcher

TopologicalOLAPTopological

OLAP


153/176

156

Dimensionscome

from

the

node/edgeattributesinsideindividual

networks,socalledTopoDims

T

OLAP

Characteristics Zoomin/Zoomout

Networktopologychanged:

generalized nodesand

generalized edges

Intheunderlyingnetwork,

each

node

is

a

researcher Inthesummarizedview,each

nodebecomesaninstitutethat

comprises

multiple

researchers


154/176

OutlineOutline


155/176

158


Heterogeneous

Information

Networks?



Classificationof

Information

Networks


DataCleaning

and

Data

Validation

by

InfoNet Analysis



RoleDiscovery

and

OLAP

in

Information

Networks


Conclusions

MiningEvolutionandDynamicsofMiningEvolutionandDynamicsofInfoNetInfoNet


156/176

159

Manynetworks

are

with

time

information

E.g.,accordingtopaperpublicationyear,DBLPnetworkscan

formnetworksequences

Motivation:Model

evolution

of

communities

in

heterogeneous

network

Automaticallydetectthebestnumberofcommunitiesineach

timestamp Modelthesmoothnessbetweencommunitiesofadjacent

timestamps

Modelthe

evolution

structure

explicitly

Birth,death,split


157/176


158/176


159/176

AccuracyStudyAccuracy

Study


160/176

163

Themore

types

of

objects

used,

the

better

accuracy

Historicalpriorresultsinbetteraccuracy


161/176


162/176

OutlineOutline


163/176


Heterogeneous

Information

Networks?



Classificationof

Information

Networks


Data

Cleaning

and

Data

Validation

by

InfoNet Analysis SimilaritySearchinInformationNetworks


RoleDiscovery

and

mining heterogeneous information networks

Documents