mining heterogeneous information networks

Upload: seadragonwin

Post on 05-Apr-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/31/2019 Mining Heterogeneous Information Networks

    1/176

    1

    MiningHeterogeneousMiningHeterogeneous

    InformationNetworksInformation

    Networks

    JiaweiHan

    YizhouSun

    XifengYan PhilipS.Yu

    UniversityofIllinoisatUrbanaChampaign

    Universityof

    California

    at

    Santa

    Barbara

    UniversityofIllinoisatChicago

    Acknowledgements:NSF,ARL,NASA,AFOSR(MURI),Microsoft,IBM,Yahoo!,Google,HPLab&Boeing

    July12,2010

    ACMSIGKDDConferenceTutorial,Washington,D.C.,July25,2010

  • 7/31/2019 Mining Heterogeneous Information Networks

    2/176

    2

    OutlineOutline

    Motivation: WhyMiningHeterogeneousInformationNetworks?

    PartI: Clustering,RankingandClassification

    Clusteringand

    Ranking

    in

    Information

    Networks

    ClassificationofInformationNetworks

    PartII: DataQualityandSearchinInformationNetworks

    DataCleaning

    and

    Data

    Validation

    by

    InfoNet

    Analysis

    SimilaritySearchinInformationNetworks

    PartIII:AdvancedTopicsonInformationNetworkAnalysis

    RoleDiscoveryandOLAPinInformationNetworks

    MiningEvolutionandDynamicsofInformationNetworks

    Conclusions

  • 7/31/2019 Mining Heterogeneous Information Networks

    3/176

    3

    OutlineOutline

    Motivation: WhyMiningHeterogeneousInformationNetworks?

    PartI: Clustering,RankingandClassification

    Clusteringand

    Ranking

    in

    Information

    Networks

    ClassificationofInformationNetworks

    PartII: DataQualityandSearchinInformationNetworks

    DataCleaning

    and

    Data

    Validation

    by

    InfoNet

    Analysis

    SimilaritySearchinInformationNetworks

    PartIII:AdvancedTopicsonInformationNetworkAnalysis

    RoleDiscoveryandOLAPinInformationNetworks

    MiningEvolutionandDynamicsofInformationNetworks

    Conclusions

  • 7/31/2019 Mining Heterogeneous Information Networks

    4/176

    4

    WhatAreInformationNetworks?WhatAreInformationNetworks? Information

    network:

    A

    network

    where

    each

    node

    represents

    an

    entity

    (e.g.,

    actorinasocialnetwork)andeachlink(e.g.,tie)arelationshipbetween

    entities

    Eachnode/link

    may

    have

    attributes,

    labels,

    and

    weights

    Linkmaycarryrichsemanticinformation

    Homogeneousvs.heterogeneousnetworks

    Homogeneousnetworks

    Singleobjecttypeandsinglelinktype

    Singlemodelsocialnetworks(e.g.,friends)

    WWW:

    a

    collection

    of

    linked

    Web

    pages Heterogeneous,multitypednetworks

    Multipleobjectandlinktypes

    Medical

    network:

    patients,

    doctors,

    disease,

    contacts,

    treatments Bibliographicnetwork:publications,authors,venues

  • 7/31/2019 Mining Heterogeneous Information Networks

    5/176

    5

    UbiquitousInformationNetworksUbiquitousInformationNetworks

    Graphsand

    substructures

    Chemicalcompounds,computervisionobjects,circuits,XML

    Biologicalnetworks

    Bibliographicnetworks:DBLP,ArXiv,PubMed,

    Socialnetworks:Facebook>100millionactiveusers

    WorldWide

    Web

    (WWW):

    >3billion

    nodes,

    >50

    billion

    arcs

    Cyberphysicalnetworks

    Yeast prot eininteract ion net w ork

    An I n t e r n et W eb Co-aut hor netw ork Social netw ork sit es

  • 7/31/2019 Mining Heterogeneous Information Networks

    6/1766

    Homogeneousvs.HeterogeneousNetworksHomogeneousvs.HeterogeneousNetworks

    Conference-Author Net w orkCo-aut hor Netw ork

  • 7/31/2019 Mining Heterogeneous Information Networks

    7/1767

    DBLP:AnInterestingandFamiliarNetworkDBLP:AnInterestingandFamiliarNetwork

    DBLP:

    Acomputer

    science

    publication

    bibliographic

    database

    1.4Mrecords(papers),0.7Mauthors,5Kconferences,

    Willthisdatabasediscloseinterestingknowledgeabout

    computerscience

    research?

    Whatarethepopularresearchfields/subfieldsinCS?

    WhoaretheleadingresearchersonDBorXQueries?

    Howdo

    the

    authors

    in

    this

    subfield

    collaborate

    and

    evolve?

    HowmanyWeiWangsinDBLP,whichpaperdonebywhich?

    WhoisSergy Brins supervisorandwhen?

    WhoareverysimilartoChristosFaloutsos?

    Allthesekindsofquestions,andpotentiallymuchmore,canbe

    nicely

    answered

    by

    the

    DBLP

    InfoNet How? Exploringthepoweroflinksininformationnetworks!

  • 7/31/2019 Mining Heterogeneous Information Networks

    8/1768

    Homo.vs.Hetero.:DifferencesinDBHomo.vs.Hetero.:DifferencesinDBInfoNetInfoNet MiningMining Homogeneous

    networks

    can

    often

    be

    derived

    from

    their

    original

    heterogeneousnetworks

    Coauthornetworkscanbederivedfromauthorpaper

    conferencenetworks

    by

    projection

    on

    authors

    only

    Papercitationnetworkscanbederivedfromacomplete

    bibliographicnetworkwithpapersandcitationsprojected

    HeterogeneousDB

    InfoNet carries

    richer

    information

    than

    its

    correspondingprojectedhomogeneousnetworks

    TypedheterogeneousInfoNet vs.nontypedhetero.InfoNet (i.e.,

    notdistinguishing

    different

    types

    of

    nodes)

    TypednodesandlinksimplyamorestructuredInfoNet,and

    thusoftenleadtomoreinformativediscovery

    Ouremphasis:

    Mining

    structured information

    networks!

  • 7/31/2019 Mining Heterogeneous Information Networks

    9/1769

    WhyMiningHeterogeneousInformationNetworks?WhyMiningHeterogeneousInformationNetworks?

    Mostdatasets

    can

    be

    organized or

    transformed into

    a

    structured heterogeneousinformationnetwork!

    Examples:DBLP,IMDB,Flickr,GoogleNews,Wikipedia,

    Structurescanbeprogressivelyextractedfromlessorganized

    datasetsbyinformationnetworkanalysis

    Informationrich,interrelated,organizeddatasetsformone

    oraset

    of

    gigantic,

    interconnected,

    multi

    typed

    heterogeneousinformationnetworks

    Surprisinglyrichknowledgecanbederivedfromsuch

    structuredheterogeneous

    information

    networks

    Ourgoal:Uncoverknowledgehiddenfromorganized data

    Exploringthepowerofmultityped,heterogeneouslinks

    Miningstructured heterogeneousinformationnetworks!

  • 7/31/2019 Mining Heterogeneous Information Networks

    10/17610

    OutlineOutline

    Motivation: WhyMining

    Heterogeneous

    Information

    Networks?

    PartI: Clustering,RankingandClassification

    Clusteringand

    Ranking

    in

    Information

    Networks

    ClassificationofInformationNetworks

    PartII: DataQualityandSearchinInformationNetworks

    DataCleaning

    and

    Data

    Validation

    by

    InfoNet Analysis

    SimilaritySearchinInformationNetworks

    PartIII:AdvancedTopicsonInformationNetworkAnalysis

    RoleDiscoveryandOLAPinInformationNetworks

    MiningEvolutionandDynamicsofInformationNetworks

    Conclusions

    l d k fCl i d R ki i I f i

  • 7/31/2019 Mining Heterogeneous Information Networks

    11/17611

    Clusteringand

    Ranking

    in

    Information

    Clustering

    and

    Ranking

    in

    Information

    NetworksNetworks Integrated

    Clustering

    and

    Ranking

    of

    Heterogeneous

    InformationNetworks

    Clusteringof

    Homogeneous

    Information

    Networks

    LinkClus:Clusteringwithlinkbasedsimilaritymeasure

    SCAN:

    Density

    based

    clustering

    of

    networks Others

    Spectralclustering

    Modularitybasedclustering

    Probabilisticmodelbasedclustering

    UserGuided

    Clustering

    of

    Information

    Networks

  • 7/31/2019 Mining Heterogeneous Information Networks

    12/17612

    ClusteringandRankinginHeterogeneousClusteringandRankinginHeterogeneousInformationNetworksInformationNetworks

    Ranking&

    clustering

    each

    provides

    anew

    view

    over

    anetwork

    Rankinggloballywithoutconsideringclusters dumb

    RankingDBandArchitectureconfs.together?

    Clusteringauthorsinonehugeclusterwithoutdistinction?

    Dulltoviewthousandsofobjects(thisiswhyPageRank!)

    RankClus:

    Integratesclustering

    with

    ranking

    Conditionalrankingrelativetoclusters

    Useshighlyrankedobjectstoimproveclusters

    Qualitiesof

    clustering

    and

    ranking

    are

    mutually

    enhanced

    Y.Sun,J.Han,etal.,RankClus:IntegratingClusteringwith

    RankingforHeterogeneousInformationNetworkAnalysis,

    EDBT'09.

  • 7/31/2019 Mining Heterogeneous Information Networks

    13/17613

    GlobalRankingvs.ClusterGlobalRankingvs.ClusterBasedRankingBasedRanking

    AToy

    Example:

    Two

    areas

    with

    10

    conferences

    and

    100

    authors

    ineacharea

  • 7/31/2019 Mining Heterogeneous Information Networks

    14/17614

    RankClusRankClus:ANewFramework:ANewFramework

    Sub-NetworkRanking

    Clustering

  • 7/31/2019 Mining Heterogeneous Information Networks

    15/17615

    TheTheRankClusRankClus PhilosophyPhilosophy Why

    integrated

    Ranking

    and

    Clustering?

    Rankingandclusteringcanbemutuallyimproved

    Ranking:

    Once

    a

    cluster

    becomes

    more

    accurate,

    ranking

    will

    bemorereasonableforsuchaclusterandwillbethe

    distinguishedfeatureofthecluster

    Clustering:Once

    ranking

    is

    more

    distinguished

    from

    each

    other,theclusterscanbeadjustedandgetmoreaccurate

    results

    Noteveryobjectshouldbetreatedequallyinclustering!

    Objectspreservesimilarityundernewmeasurespace

    E.g.,VLDB

    vs.

    SIGMOD

  • 7/31/2019 Mining Heterogeneous Information Networks

    16/17616

    RankClusRankClus:AlgorithmFramework:AlgorithmFramework

    Step0. Initialization

    RandomlypartitiontargetobjectsintoKclusters

    Step1.

    Ranking

    Rankingforeachsubnetworkinducedfromeachcluster,

    whichservesasfeatureforeachcluster

    Step2. Generatingnewmeasurespace

    Estimatemixturemodelcoefficientsforeachtargetobject

    Step3.

    Adjusting

    cluster

    Step4. RepeatingSteps13untilstable

  • 7/31/2019 Mining Heterogeneous Information Networks

    17/17617

    FocusonaBiFocusonaBiTypedNetworkCaseTypedNetworkCase

    Conferenceauthor

    network,

    links

    can

    exist

    between

    Conference(X)andauthor(Y)

    Author

    (Y)

    and

    author

    (Y)

    UseWtodenotethelinksandthereweights

    W=

  • 7/31/2019 Mining Heterogeneous Information Networks

    18/17618

    Step1:Ranking:FeatureExtractionStep1:Ranking:FeatureExtraction

    Tworanking

    strategies:

    Simple

    ranking

    vs.

    authority

    ranking

    SimpleRanking

    Proportionaltodegreecountingforobjects,e.g.,#of

    publicationsofanauthor

    Considersonlyimmediateneighborhoodinthenetwork

    Authority

    Ranking:

    Extension

    to

    HITS

    in

    weighted

    bi

    type

    network

    Rule1:Highlyrankedauthorspublishmanypapersinhighly

    rankedconferences

    Rule2:Highlyrankedconferencesattractmanypapersfrom

    manyhighlyrankedauthors

    Rule

    3:

    The

    rank

    of

    an

    author

    is

    enhanced

    if

    he

    or

    she

    co

    authorswithmanyauthorsormanyhighlyrankedauthors

  • 7/31/2019 Mining Heterogeneous Information Networks

    19/17619

    EncodingRulesinAuthorityRankingEncodingRulesinAuthorityRanking

    Rule1:Highlyrankedauthorspublishmanypapersinhighlyrankedconferences

    Rule2:Highlyrankedconferencesattractmanypapersfrom

    many

    highly

    ranked

    authors

    Rule3:

    The

    rank

    of

    an

    author

    is

    enhanced

    if

    he

    or

    she

    co

    authorswithmanyauthorsormanyhighlyrankedauthors

    E l A h i R ki i h 2E l A th it R ki i th 2 AA

  • 7/31/2019 Mining Heterogeneous Information Networks

    20/17620

    Example:AuthorityRankinginthe2Example:AuthorityRankinginthe2AreaAreaConferenceConferenceAuthorNetworkAuthorNetwork

    Therankings

    of

    authors

    are

    quite

    distinct

    from

    each

    other

    in

    the

    twoclusters

    Step 2 Generate Ne Meas re SpaceStep 2 Generate Ne Meas re Space

  • 7/31/2019 Mining Heterogeneous Information Networks

    21/17621

    Step2:

    Generate

    New

    Measure

    SpaceStep

    2:

    Generate

    New

    Measure

    Space:

    :

    AMixtureModelMethodAMixtureModelMethod Consider

    each

    target

    objects

    links

    are

    generated

    under

    amixture

    distributionofrankingfromeachcluster

    Consider

    ranking

    as

    a

    distribution:

    r(Y)

    p(Y)

    Eachtargetobject xi ismappedintoaKvector(i,k)

    Parametersare

    estimated

    using

    the

    EM

    algorithm

    Maximizetheloglikelihoodgivenalltheobservationsof

    links

    Example: 2Example: 2 D Coefficients in the 2D Coefficients in the 2 AreaArea

  • 7/31/2019 Mining Heterogeneous Information Networks

    22/17622

    Example:2Example:2

    D

    Coefficients

    in

    the

    2D

    Coefficients

    in

    the

    2

    Area

    Area

    ConferenceConferenceAuthorNetworkAuthorNetwork

    Theconferencesarewellseparatedinthenewmeasurespace

    Scatterplots

    of

    two

    conferences

    and

    component

    coefficients

    St 3 Cl t Adj t t i NSt 3 Cl t Adj t t i N

  • 7/31/2019 Mining Heterogeneous Information Networks

    23/17623

    Step3:ClusterAdjustmentinNewStep3:ClusterAdjustmentinNewMeasureSpaceMeasureSpace

    Clustercenterinnewmeasurespace

    Vectormeanofobjectsinthecluster(Kdimensional)

    Cluster

    adjustment Distancemeasure:1Cosinesimilarity

    Assigntotheclusterwiththenearestcenter

    WhyBetterRankingFunctionDerivesBetterClustering?

    Considerthemeasurespacegenerationprocess

    Highlyranked

    objects

    in

    acluster

    play

    amore

    important

    role

    to decide

    atargetobjectsnewmeasure

    Intuitively,ifwecanfindthehighlyrankedobjectsina

    cluster,equivalently,

    we

    get

    the

    right

    cluster

  • 7/31/2019 Mining Heterogeneous Information Networks

    24/176

    24

    StepStepbybyStepRunningCaseIllustrationStepRunningCaseIllustration

    I nit ial ly, rankingdist ribut ions are

    mixed t oget her

    Tw o clust ers ofobject s mixed

    t oget her, butpreserve similarity

    somehow

    I mproved a l i t t le

    Tw o clusters are

    almost w ellseparated

    Improvedsignificantly

    Stable

    Well separat ed

  • 7/31/2019 Mining Heterogeneous Information Networks

    25/176

    25

    TimeComplexity:Linearto#ofLinksTimeComplexity:Linearto#ofLinks

    Ateach

    iteration,

    |E|:

    edges

    in

    network,

    m:

    number

    of

    target

    objects,K:numberofclusters

    Rankingforsparsenetwork

    ~O(|E|)

    Mixturemodelestimation

    ~O(K|E|+mK) Clusteradjustment

    ~O(mK^2)

    Inall,

    linear

    to

    |E|

    ~O(K|E|)

    Note:SimRank willbeatleastquadraticateachiterationsinceit

    evaluatesdistancebetweeneverypairinthenetwork

  • 7/31/2019 Mining Heterogeneous Information Networks

    26/176

    26

    CaseStudy:Dataset:DBLPCaseStudy:Dataset:DBLP

    Allthe2676conferencesand20,000authorswithmost

    publications,fromthetimeperiodofyear1998toyear2007

    Bothconference

    author

    relationships

    and

    co

    author

    relationshipsareused

    K=15(selectonly5clustershere)

    NetClusNetClus : Rank ing & Clust er ing w i t h St ar: Rank ing & Clust er ing w i t h St ar

  • 7/31/2019 Mining Heterogeneous Information Networks

    27/176

    27

    NetClusNetClus : Rank ing & Clust er ing w i t h St ar: Rank ing & Clust er ing w i t h St ar

    Ne tw ork Sc hemaNet w ork Sc hema

    Beyondbi

    typed

    information

    network:

    A

    Star

    Network

    Schema

    Splitanetworkintodifferentlayers,eachrepresentingbyanet

    cluster

    27

  • 7/31/2019 Mining Heterogeneous Information Networks

    28/176

    28

    StarNetStarNet:Schema&Net:Schema&NetClusterCluster

    StarNetwork

    Schema

    Centertype: Targettype

    E.g.,apaper,amovie,ataggingevent

    Acenter

    object

    is

    aco

    occurrence of

    abag

    of

    different

    types

    of

    objects,whichstandsforamultirelationamongdifferenttypesof

    objects

    Surroundingtypes: Attribute(property)types

    NetCluster

    GivenainformationnetworkG,anetclusterCcontainstwopiecesof

    information:

    Nodeset

    and

    link

    set

    as

    asub

    network

    of

    G

    Membershipindicatorforeachnodex:P(x inC)

    GivenainformationnetworkG,clusternumberK,aclusteringforGisa

    setofnetclustersandforeachnodex,thesumofxs probability

    distributionin

    all

    Knet

    clusters

    should

    be

    1

  • 7/31/2019 Mining Heterogeneous Information Networks

    29/176

    29

    ResearchPaper

    Term

    AuthorVenue

    Publish Write

    Contain

    DBLP 29

  • 7/31/2019 Mining Heterogeneous Information Networks

    30/176

    30

    StarNetStarNet

    ofof

    Delicious.comDelicious.com

    TaggingEvent

    Tag

    UserWebSite

    Contain

    Delicious.com 30

    Actor/A

  • 7/31/2019 Mining Heterogeneous Information Networks

    31/176

    31

    StartNetStartNet forIMDBforIMDBMovie

    Title/

    Plot

    DirectorActor/Actress

    Star in Direct

    Contain

    IMDB

    31

  • 7/31/2019 Mining Heterogeneous Information Networks

    32/176

    32

    RankingFunctionsRankingFunctions

    RankinganobjectxoftypeTx inanetworkG,denotedasp(x|Tx,G)

    Giveascoretoeachobject,accordingtoitsimportance

    Differentrulesdefineddifferentrankingfunctions:

    SimpleRanking

    Rankingscoreisassignedaccordingtothedegreeofanobject

    AuthorityRanking

    Rankingscoreisassignedaccordingtothemutualenhancementby

    propagationsofscorethroughlinks

    highlyrankedconferencesacceptmanygoodpaperspublishedby

    manyhighlyrankedauthors,andhighlyrankedauthorspublishmany

    goodpapersinhighlyrankedconferences:

  • 7/31/2019 Mining Heterogeneous Information Networks

    33/176

    33

    RankingFunction(Cont.)RankingFunction(Cont.)

    Priorscanbeadded:

    PP(X|Tx,Gk)=(1 P)P(X|Tx,Gk)+PP0(X|Tx,Gk)

    P0(X|Tx,

    Gk)

    is

    the

    prior

    knowledge,

    usually

    given

    as

    a

    distribution,denotedbyonlyseveralwords

    Pistheparameterthatwebelieveinpriordistribution

    Rankingdistribution

    Normalizerankingscoresto1,giventhemaprobabilistic

    meaning

    Similarto

    the

    idea

    of

    PageRank

  • 7/31/2019 Mining Heterogeneous Information Networks

    34/176

    34

    NetClusNetClus:AlgorithmFramework:AlgorithmFramework

    Mapeachtargetobjectintoanewlowdimensionalfeature

    spaceaccordingtocurrentnetclustering,andadjustthe

    clusteringfurtherinthenewmeasurespace

    Step0: Generateinitialrandomclusters

    Step1: Generaterankingbasedgenerativemodelfor

    targetobjectsforeachnetcluster

    Step2: Calculateposteriorprobabilitiesfortargetobjects,

    whichservesasthenewmeasure,andassigntargetobjects

    tothenearestclusteraccordingly

    Step3: Repeatsteps1and2,untilclustersdonotchangesignificantly

    Step4: Calculateposteriorprobabilitiesforattribute

    objectsin

    each

    net

    cluster

    Generat ive Model for Target Objec t sGenerat ive Model for Target Objec t s

  • 7/31/2019 Mining Heterogeneous Information Networks

    35/176

    35

    Generat ive Model fo r Target Objec t sGenerat ive Model fo r Target Objec t s

    Given a Ne tGiven a Ne t --c lus te rc lus te r

    Eachtargetobjectstandsforancooccurrenceofabagofattributeobjects

    Definetheprobabilityofatargetobjectdefinetheprobabilityofthe

    cooccurrenceofalltheassociatedattributeobjects

    GenerativeprobabilityP(d|Gk)fortargetobjectdinclusterCk :

    whereP(x

    |Tx

    ,Gk

    )

    isrankingfunction,P(Tx

    |Gk

    )istypeprobability

    Twoassumptionsofindependence

    Theprobabilities

    to

    visit

    objects

    of

    different

    types

    are

    independent

    to

    eachother

    Theprobabilitiestovisittwoobjectswithinthesametypeare

    independentto

    each

    other

  • 7/31/2019 Mining Heterogeneous Information Networks

    36/176

    39

    ClusterAdjustmentClusterAdjustment

    Usingposteriorprobabilitiesoftargetobjectsasnewfeature

    space

    Eachtarget

    object

    =>

    Kdimension

    vector

    Eachnetclustercenter=>Kdimensionvector

    Averageontheobjectsinthecluster

    Assigneach

    target

    object

    into

    nearest

    cluster

    center

    (e.g.,

    cosinesimilarity)

    Asubnetworkcorrespondingtoanewnetclusteristhenbuilt

    byextracting

    all

    the

    target

    objects

    in

    that

    cluster

    and

    all

    linkedattributeobjects

  • 7/31/2019 Mining Heterogeneous Information Networks

    37/176

    40

    Experiments:DBLPandBeyondExperiments:DBLPandBeyond

    DataSet:

    DBLP

    all

    area data

    set

    Allconferences+Top 50Kauthors

    DBLPfourarea dataset

    20conferencesfromDB,DM,ML,IR

    Allauthorsfromtheseconferences

    Allpapers

    published

    in

    these

    conferences

    Runningcaseillustration

  • 7/31/2019 Mining Heterogeneous Information Networks

    38/176

    41

    AccuracyStudy:ExperimentsAccuracyStudy:Experiments

    Accuracy,compared

    with

    PLSA,

    apure

    text

    model,

    no

    other

    typesofobjectsandlinksareused,usethesamepriorasin

    NetClus

    Accuracy,comparedwithRankClus,abitypedclusteringmethod

    ononlyonetype

    AccuracyofPaperClusteringResults

    AccuracyofConference ClusteringResults

    N ClN Cl Di i i hi C fDi i i hi C f

  • 7/31/2019 Mining Heterogeneous Information Networks

    39/176

    42

    NetClusNetClus:DistinguishingConferences:DistinguishingConferences

    AAAI 0.00226670.00899168

    0.934024 0.0300042

    0.0247133

    CIKM0.1500530.3101720.007238070.4445240.0880127

    CVPR0.0001638120.007630720.931496 0.02813420.032575

    ECIR3.47023e050.007126950.006574020.978391 0.00787288

    ECML0.000774770.1109220.814362 0.05794260.015999

    EDBT0.573362 0.316033

    0.00101442

    0.0245591

    0.0850319

    ICDE0.529522 0.3765420.002391520.01511130.0764334

    ICDM0.0004550280.778452 0.05664570.1131840.0512633

    ICML0.0003096240.0500780.878757 0.06223350.00862134

    IJCAI0.00329816

    0.0046758

    0.94288 0.0303745

    0.0187718

    KDD0.005742230.797633 0.06173510.0676810.0672086

    PAKDD0.001112460.813473 0.04031050.05747550.0876289

    PKDD5.39434e050.760374 0.1196080.0529260.0670379

    PODS

    0.78935 0.113751

    0.013939

    0.00277417

    0.0801858

    SDM0.0001729530.841087 0.0583160.05270810.0477156

    SIGIR0.006003990.002800130.002752370.977783 0.0106604

    SIGMOD0.689348 0.2231220.00177030.008254550.0775055

    VLDB0.701899 0.2074280.001000120.01169660.0779764

    WSDM0.00751654

    0.269259

    0.0260291

    0.683646 0.0135497

    WWW0.07711860.2706350.029307 0.451857 0.171082

  • 7/31/2019 Mining Heterogeneous Information Networks

    40/176

    43

    NetClusNetClus:DatabaseSystemCluster:DatabaseSystemCluster

    database 0.0995511databases 0.0708818

    system 0.0678563data 0.0214893query 0.0133316

    systems 0.0110413queries 0.0090603

    management 0.00850744object 0.00837766

    relational 0.0081175processing 0.00745875

    based 0.00736599distributed 0.0068367

    xml 0.00664958oriented 0.00589557design 0.00527672web 0.00509167

    information 0.0050518

    model 0.00499396efficient 0.00465707

    Surajit Chaudhuri 0.00678065Michael Stonebraker 0.00616469

    Michael J. Carey 0.00545769C. Mohan 0.00528346

    David J. DeWitt 0.00491615Hector Garcia-Molina 0.00453497

    H. V. Jagadish 0.00434289David B. Lomet 0.00397865

    Raghu Ramakrishnan 0.0039278

    Philip A. Bernstein 0.00376314Joseph M. Hellerstein 0.00372064Jeffrey F. Naughton 0.00363698Yannis E. Ioannidis 0.00359853

    Jennifer Widom 0.00351929

    Per-Ake Larson 0.00334911Rakesh Agrawal 0.00328274

    Dan Suciu 0.00309047Michael J. Franklin 0.00304099Umeshwar Dayal 0.00290143

    Abraham Silberschatz 0.00278185

    VLDB 0.318495SIGMOD Conf. 0.313903

    ICDE 0.188746

    PODS 0.107943EDBT 0.0436849

    Ranking authors in XML

  • 7/31/2019 Mining Heterogeneous Information Networks

    41/176

    44

    NetClusNetClus::StarNetStarNetBasedRankingandClusteringBasedRankingandClustering

    Ageneral

    framework

    in

    which

    ranking

    and

    clustering

    are

    successfullycombinedtoanalyzeinfornets

    Rankingandclusteringcanmutuallyreinforceeachotherin

    informationnetwork

    analysis

    NetClus,anextensiontoRankClus thatintegratesrankingand

    clusteringandgeneratenetclustersinastarnetworkwith

    arbitrarynumber

    of

    types

    Netcluster,heterogeneous

    informationsubnetworks

    comprisedof

    multiple

    typesofobjects

    GowellbeyondDBLP,and

    structuredrelational

    DBs

    Flickr:queryRaleigh

    derivesmultipleclusters

    iNextCubeiNextCube: Information Network: Information Network Enhanced TexEnhanced Text

  • 7/31/2019 Mining Heterogeneous Information Networks

    42/176

    45

    iNextCubeiNextCube:Information

    Network:

    Information

    Network

    Enhanced

    TexEnhanced

    Text

    Cube(VLDBCube(VLDB09Demo)09Demo)

    ArchitectureofiNextCube

    DimensionhierarchiesgeneratedbyNetClusDimensionhierarchiesgeneratedbyNetClus

    Author/confere

    termrankingfo

    researcharea.

    T

    researchareasc

    atdifferentleve

    All

    DBand

    IS Theory Architecture

    DB DM IR

    XML DistributedDB

    NetclusterHierarchy

    Demo: iNextCube.cs.uiuc.edu

    Clustering and Ranking in InformationClustering and Ranking in Information

  • 7/31/2019 Mining Heterogeneous Information Networks

    43/176

    46

    Clusteringand

    Ranking

    in

    Information

    Clustering

    and

    Ranking

    in

    Information

    NetworksNetworks Integrated

    Clustering

    and

    Ranking

    of

    Heterogeneous

    InformationNetworks

    Clusteringof

    Homogeneous

    Information

    Networks

    LinkClus:Clusteringwithlinkbasedsimilaritymeasure

    SCAN:

    Density

    based

    clustering

    of

    networks Others

    Spectralclustering

    Modularitybased

    clustering

    Probabilisticmodelbasedclustering

    UserGuided

    Clustering

    of

    Information

    Networks

  • 7/31/2019 Mining Heterogeneous Information Networks

    44/176

    47

    LinkLinkBasedClustering:WhyUseful?BasedClustering:WhyUseful?

    Questions:

    Q1:Howtoclustereachtypeofobjects?

    Q2:Howtodefinesimilaritybetweeneachtypeofobjects?

    Tom sigmod03

    Mike

    Cathy

    John

    sigmod04sigmod05

    vldb03

    vldb04

    vldb05

    sigmod

    vldb

    Mary

    aaai04

    aaai05 aaai

    Authors Proceedings Conferences

  • 7/31/2019 Mining Heterogeneous Information Networks

    45/176

    48

    SimRankSimRank:Link:LinkBasedSimilaritiesBasedSimilarities

    Twoobjects

    are

    similar

    iflinked

    with

    the

    same

    or

    similar

    objects

    Tom sigmod03sigmod04

    sigmod05

    sigmod

    Tom

    Mike

    Cathy

    John

    sigmod03

    sigmod04

    sigmod05

    vldb03

    vldb04

    vldb05

    sigmod

    vldb

    Jeh

    &Widom,2002

    SimRank

    Similarity

    between

    two

    objects

    a and

    b,

    S(a,b)=theaveragesimilaritybetween

    objectslinkedwitha

    andthosewithb:

    whereI(v)isthesetofinneighborsof

    thevertexv.

    But:Itisexpensivetocompute:

    ForadatasetofN

    objectsandM links,

    ittakesO(N2)spaceandO(M2)timeto

    computeall

    similarities.

    Mary

  • 7/31/2019 Mining Heterogeneous Information Networks

    46/176

    49

    Observation1:HierarchicalStructuresObservation1:HierarchicalStructures

    Hierarchicalstructures

    often

    exist

    naturally

    among

    objects

    (e.g.,

    taxonomyofanimals)

    All

    electronicsgrocery apparel

    DVD cameraTV

    A

    hierarchical

    structure

    of

    productsinWalmart

    A

    rticles

    Words

    Relationshipsbetweenarticlesand

    words(Chakrabarti,

    Papadimitriou,

    Modha,Faloutsos,2004)

  • 7/31/2019 Mining Heterogeneous Information Networks

    47/176

    50

    Observation2:DistributionofSimilarityObservation2:DistributionofSimilarity

    Powerlawdistributionexistsinsimilarities

    56%of

    similarity

    entries

    are

    in

    [0.005,

    0.015]

    1.4%ofsimilarityentriesarelargerthan0.1

    Ourgoal:Designadatastructurethatstoresthesignificantsimilaritiesand

    compressesinsignificantones

    0

    0.1

    0.2

    0.3

    0.4

    0

    0.0

    2

    0.0

    4

    0.0

    6

    0.0

    8

    0.1

    0.1

    2

    0.1

    4

    0.1

    6

    0.1

    8

    0.2

    0.2

    2

    0.2

    4

    similarity value

    portionofentries Distribution ofSimRanksimilarities

    among DBLP authors

  • 7/31/2019 Mining Heterogeneous Information Networks

    48/176

    51

    OurDataStructure:OurDataStructure:SimTreeSimTree

    simp(n7,n8) = s(n7,n4) x s(n4,n5) x s(n5,n8)

    Path-based node similarity

    Similarity between two nodes is the average similarity between objectslinked with them in other SimTrees

    Adjustment ratio for x =

    n1

    n2

    n4 n5 n6

    n3

    0.9 1.0

    0.90.8

    0.2

    n7 n9

    0.3

    n8

    0.8

    0.9

    Similarity between twosibling nodes n1

    and n2

    Adjustment ratiofor node n7

    Average similarity between xand all other nodes

    Average similarity betweenxs parent and allother nodes

    Each leaf noderepresents an object

    Each non-leaf noderepresents a group ofsimilar lower-level node

    Li kClLi kCl Si TSi T B d Hi hi l Cl t iB d Hi hi l Cl t i

  • 7/31/2019 Mining Heterogeneous Information Networks

    49/176

    52

    LinkClusLinkClus::SimTreeSimTreeBasedHierarchicalClusteringBasedHierarchicalClustering

    InitializeaSimTree forobjectsofeachtype

    Repeat

    ForeachSimTree,updatethesimilaritiesbetweenitsnodes

    usingsimilaritiesinotherSimTrees

    Similaritybetween

    two

    nodes

    a and

    bis

    the

    average

    similaritybetweenobjectslinkedwiththem

    AdjustthestructureofeachSimTree

    Assigneachnodetotheparentnodethatitismost

    similarto

    I iti li ti fInitialization of Si TSimTrees

  • 7/31/2019 Mining Heterogeneous Information Networks

    50/176

    53

    InitializationofInitializationofSimTreesSimTrees

    Findingtightgroups Frequentpatternmining

    Initializingatree:

    Startfromleafnodes(level0)

    Ateachlevell,findnonoverlappinggroupsofsimilarnodes

    withfrequentpatternmining

    Reduced to

    g1

    g2

    {n

    1}{n1, n2}

    {n2}{n1, n2}

    {n1, n2}{n2, n3, n4}

    {n4}

    {n3, n4}

    {n3, n4}

    Transactions

    n1 123

    4

    56

    7

    8

    9

    n2

    n3

    n4

    Thetightnessofagroupofnodesisthesupportofa

    frequentpattern

    C l itComplexity: Li kClLinkClus vs Si R kSimRank

  • 7/31/2019 Mining Heterogeneous Information Networks

    51/176

    54

    Complexity:Complexity:LinkClusLinkClus vs.vs.SimRankSimRank Afterinitialization,iteratively(1)foreachSimTree updatethe

    similaritiesbetweenitsnodesusingsimilaritiesinotherSimTrees,

    and(2)AdjustthestructureofeachSimTree

    Computationalcomplexity:

    Fortwotypesofobjects,N ineach,andM linkagesbetween

    them

    Time Space

    Updating similarities O(M(logN)2) O(M+N)

    Adjusting tree structures O(N) O(N)

    LinkClus O (M(logN)2) O(M+N)

    SimRank O (M2) O(N2)

    P f C i E i t S tPerformance Comparison: Experiment Setup

  • 7/31/2019 Mining Heterogeneous Information Networks

    52/176

    55

    PerformanceComparison:ExperimentSetupPerformanceComparison:ExperimentSetup

    DBLP dataset: 4170 most productive authors, and 154 well-knownconferences with most proceedings

    Manually labeled research areas of 400 most productive authorsaccording to their home pages (or publications)

    Manually labeled areas of 154 conferences according to their call forpapers

    Approaches Compared:

    SimRank (Jeh & Widom, KDD 2002) Computing pair-wise similarities

    SimRank with FingerPrints (F-SimRank)

    Fogaras & Racz, WWW 2005 pre-computes a large sample of random paths from each object

    and uses samples of two objects to estimate SimRank similarity

    ReCom (Wang et al. SIGIR 2003)

    Iteratively clustering objects using cluster labels of linked objects

    DBLP Data Set Accuracy and Computation TimeDBLP Data Set: Accuracy and Computation Time

  • 7/31/2019 Mining Heterogeneous Information Networks

    53/176

    56

    0.8

    0.85

    0.9

    0.95

    1

    1 3 5 7 9 11 13 15 17 19

    #iteration

    accuracy

    LinkClus

    SimRank

    ReCom

    F-SimRank

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    1 3 5 7 9 11 13 15 17 19

    #iteration

    accuracy

    LinkClus

    SimRank

    ReCom

    F-SimRank

    DBLPDataSet:AccuracyandComputationTimeDBLPDataSet:AccuracyandComputationTime

    Approaches AccrAuthor Accr

    Conf average

    time

    LinkClus 0.957 0.723 76.7

    SimRank 0.958 0.760 1020

    ReCom 0.907 0.457 43.1

    FSimRank 0.908 0.583 83.6

    Authors Conferences

    l dil A d i

  • 7/31/2019 Mining Heterogeneous Information Networks

    54/176

    57

    EmailDataset:AccuracyandTimeEmailDataset:AccuracyandTime

    F.Nielsen.

    Email

    dataset.

    http://www.imm.dtu.dk/rem/data/Email1431.zip

    370

    emails

    on

    conferences,

    272

    on

    jobs,

    and

    789

    spam

    emails WhyisLinkClus evenbetterthanSimRank inaccuracy?

    Noisefilteringduetofrequentpatternbasedpreprocessing

    Approach Accuracy Totaltime(sec)

    LinkClus 0.8026 1579.6

    SimRank 0.7965 39160ReCom 0.5711 74.6

    FSimRank 0.3688 479.7

    CLARANS

    0.4768 8.55

    Clustering and Ranking in InformationClustering and Ranking in Information

  • 7/31/2019 Mining Heterogeneous Information Networks

    55/176

    58

    Clusteringand

    Ranking

    in

    Information

    Clustering

    and

    Ranking

    in

    Information

    NetworksNetworks Integrated

    Clustering

    and

    Ranking

    of

    Heterogeneous

    InformationNetworks

    Clusteringof

    Homogeneous

    Information

    Networks

    LinkClus:Clusteringwithlinkbasedsimilaritymeasure

    SCAN:Densitybasedclusteringofnetworks

    Others

    Spectralclustering

    Modularitybased

    clustering

    Probabilisticmodelbasedclustering

    UserGuided

    Clustering

    of

    Information

    Networks

    SCAN: DensitySCAN: Density Based Network ClusteringBased Network Clustering

  • 7/31/2019 Mining Heterogeneous Information Networks

    56/176

    59

    SCAN:DensitySCAN:DensityBasedNetworkClusteringBasedNetworkClustering

    Networksmade

    up

    of

    the

    mutual

    relationships

    of

    data

    elements

    usually

    haveanunderlyingstructure. Clustering: Astructurediscovery

    problem

    Givensimply

    information

    of

    who

    associates

    with

    whom,

    could

    one

    identifyclustersofindividualswithcommoninterestsorspecial

    relationships(families,cliques,terroristcells)?

    Questionsto

    be

    answered:

    How

    many

    clusters?

    What

    size

    should

    they

    be?Whatisthebestpartitioning?Shouldsomepointsbesegregated?

    Scan:Aninterestingdensitybasedalgorithm

    X.Xu,N.Yuruk,Z.Feng,andT.A.J.Schweiger,SCAN:AStructural

    ClusteringAlgorithmforNetworks,Proc.2007ACMSIGKDDInt.

    Conf.KnowledgeDiscoveryinDatabases(KDD'07),SanJose,CA,

    Aug.2007

    Social Network and Its Clustering ProblemSocial Network and Its Clustering Problem

  • 7/31/2019 Mining Heterogeneous Information Networks

    57/176

    60

    SocialNetworkandItsClusteringProblemSocialNetworkandItsClusteringProblem

    Individualsin

    atight

    social

    group,

    or

    clique,knowmanyofthesamepeople,

    regardlessofthesizeofthegroup.

    Individualswhoarehubs knowmany

    peopleindifferentgroupsbutbelongto

    nosinglegroup. Politicians,forexample

    bridgemultiplegroups.

    Individualswhoareoutliers resideat

    the

    margins

    of

    society.

    Hermits,

    for

    example,knowfewpeopleandbelong

    tonogroup.

    Structure SimilarityStructure Similarity

  • 7/31/2019 Mining Heterogeneous Information Networks

    58/176

    61

    StructureSimilarityStructureSimilarity

    Define(v)

    as

    immediate

    neighbor

    of

    avertex

    v.

    Thedesiredfeaturestendtobecapturedbyameasure(u,v)

    as

    Structural

    Similarity

    Structuralsimilarityislargeformembersofacliqueandsmall for

    hubsandoutliers.

    |)(||)(|

    |)()(|

    ),( wv

    wv

    wv

    =

    I

    v

    Structural Connectivity [1]Structural Connectivity [1]

  • 7/31/2019 Mining Heterogeneous Information Networks

    59/176

    62

    StructuralConnectivity[1]StructuralConnectivity[1]

    Neighborhood:

    Core:

    Directstructure

    reachable:

    Structurereachable:transitiveclosureofdirectstructure

    reachability

    Structureconnected:

    }),(|)({)( = wvvwvN

    |)(|)(, vNvCORE

    )()(),( ,, vNwvCOREwvDirRECH

    ),(),(:),( ,,, wuRECHvuRECHVuwvCONNECT

    [1] M. Ester, H. P. Kriegel, J. Sander, & X. Xu (KDD'96) A Density-

    Based Algorithm for Discovering Clusters in Large Spatial Databases

    StructureStructure Connected ClustersConnected Clusters

  • 7/31/2019 Mining Heterogeneous Information Networks

    60/176

    63

    StructureStructureConnectedClustersConnectedClusters

    Structureconnected

    cluster

    C

    Connectivity:

    Maximality:

    Hubs:

    Notbelongtoanycluster

    Bridgetomanyclusters

    Outliers:

    Notbelong

    to

    any

    cluster

    Connecttolessclusters

    ),(:, , wvCONNECTCwv

    CwwvREACHCvVwv ),(:, ,

    hub

    outlier

    AlgorithmAlgorithm

  • 7/31/2019 Mining Heterogeneous Information Networks

    61/176

    64

    13

    9

    10

    11

    7

    8

    12

    6

    4

    0

    15

    2

    3

    AlgorithmAlgorithm

    = 2= 0.7

    0.75

    0.67

    0.82

    AlgorithmAlgorithm

  • 7/31/2019 Mining Heterogeneous Information Networks

    62/176

    65

    13

    9

    10

    11

    7

    8

    12

    6

    4

    0

    15

    2

    3

    AlgorithmAlgorithm

    = 2= 0.7 0.51

    0.51

    0.68

    Running TimeRunning Time

  • 7/31/2019 Mining Heterogeneous Information Networks

    63/176

    66

    RunningTimeRunningTime

    Runningtime:

    O(|E|)

    Forsparsenetworks:O(|V|)

    [2] A. Clauset, M. E. J. Newman, & C. Moore, Phys. Rev. E70, 066111 (2004).

    Clustering and Ranking in InformationClustering and Ranking in Information

  • 7/31/2019 Mining Heterogeneous Information Networks

    64/176

    67

    g

    g

    g

    g

    NetworksNetworks Integrated

    Clustering

    and

    Ranking

    of

    Heterogeneous

    InformationNetworks

    Clusteringof

    Homogeneous

    Information

    Networks

    LinkClus:Clusteringwithlinkbasedsimilaritymeasure

    SCAN:Densitybasedclusteringofnetworks

    Others

    Spectralclustering

    Modularitybased

    clustering

    Probabilisticmodelbasedclustering

    UserGuided

    Clustering

    of

    Information

    Networks

    Spectral ClusteringSpectral Clustering

  • 7/31/2019 Mining Heterogeneous Information Networks

    65/176

    68

    SpectralClusteringSpectralClustering

    Spectralclustering:

    Find

    the

    best

    cut

    that

    partitions

    the

    network

    Differentcriteriatodecidebest

    Mincut,ratiocut,normalizedcut,MinMaxcut

    UsingMincutasanexample[Wuetal.1993]

    Assigneach

    node

    ian

    indicator

    variable

    Represent

    the

    cut

    size

    using

    indicator

    vector

    and

    adjacency

    matrix Cutsize =

    Minimizetheobjectivefunctionthroughsolvingeigenvalue system

    Relaxthediscretevalueofqtocontinuousvalue

    Mapcontinuousvalueofqintodiscreteonestogetclusterlabels

    Usesecondsmallesteigenvectorforq

    ModularityModularityBased ClusteringBased Clustering

  • 7/31/2019 Mining Heterogeneous Information Networks

    66/176

    69

    ModularityModularityBasedClusteringBasedClustering

    Modularitybased

    clustering

    Findthebestclusteringthatmaximizesthemodularityfunction

    Qfunction[Newmanetal.,2004]

    Leteij be

    ahalf

    of

    the

    fraction

    of

    edges

    between

    group

    iand

    group

    j

    eii isthefractionofedgeswithingroupi

    Letai bethefractionofallendsofedgesattachedtoverticesingroupI

    Qisthendefinedassumofdifferencebetweenwithingroupedgesand

    expectedwithingroupedges

    MinimizeQ

    Onepossiblesolution:hierarchicallymergeclustersresultingingreatest

    increaseinQfunction[Newmanetal.,2004]

    Probabilistic ModelProbabilistic ModelBased ClusteringBased Clustering

  • 7/31/2019 Mining Heterogeneous Information Networks

    67/176

    70

    ProbabilisticModelProbabilisticModelBasedClusteringBasedClustering

    Probabilisticmodel

    based

    clustering

    Buildgenerativemodelsforlinksbasedonhiddenclusterlabels

    Maximizetheloglikelihoodofallthelinkstoderivethehiddencluster

    membership

    Anexample:Airoldi etal.,Mixedmembershipstochasticblockmodels,2008

    DefineagroupinteractionprobabilitymatrixB(KXK)

    B(g,h)denotestheprobabilityoflinkgenerationbetweengroupg

    andgroup

    h

    Generativemodelforalink

    Foreachnode,drawamembershipprobabilityvectorforma

    Dirichlet prior Foreachpaperofnodes,drawclusterlabelsaccordingtotheir

    membershipprobability(supposinggandh),thendecidewhetherto

    havealinkaccordingtoprobabilityB(g,h)

    Derivethe

    hidden

    cluster

    label

    by

    maximize

    the

    likelihood

    given B

    and

    theprior

    Clustering and Ranking in InformationClustering and Ranking in Informationk

  • 7/31/2019 Mining Heterogeneous Information Networks

    68/176

    71

    g

    g

    g

    g

    NetworksNetworks Integrated

    Clustering

    and

    Ranking

    of

    Heterogeneous

    InformationNetworks

    Clustering

    of

    Homogeneous

    Information

    Networks LinkClus:Clusteringwithlinkbasedsimilaritymeasure

    SCAN:Densitybasedclusteringofnetworks

    Others

    Spectralclustering

    Modularitybased

    clustering

    Probabilisticmodelbasedclustering

    UserGuided

    Clustering

    of

    Information

    Networks

    UserUserGuided Clustering in DBGuided Clustering in DB InfoNetInfoNet

  • 7/31/2019 Mining Heterogeneous Information Networks

    69/176

    72

    UserUser GuidedClusteringinDBGuidedClusteringinDBInfoNetInfoNet

    Course

    name

    office

    position

    Professor

    course-id

    name

    area

    course

    semester

    instructor

    officeposition

    Student

    name

    student

    course

    semester

    unit

    Register

    grade

    professor

    student

    degree

    Advise

    name

    Group

    person

    group

    Work-In

    area

    year

    conf

    Publication

    title

    title

    Publish

    author

    Target ofclustering

    User hint

    Open-course

    Userusuallyhasagoalofclustering,e.g.,clusteringstudents byresearcharea

    Userspecifies

    his

    clustering

    goal

    to

    aDB

    InfoNet cluster:

    CrossClus

    Classification vs UserClassification vs UserGuided ClusteringGuided Clustering

  • 7/31/2019 Mining Heterogeneous Information Networks

    70/176

    73

    Classificationvs.UserClassificationvs.UserGuidedClusteringGuidedClustering

    Userspecifiedfeature(inthe

    formofattribute)isusedasa

    hint,

    not

    class

    labels Theattributemaycontaintoo

    manyortoofewdistinct

    values

    E.g.,ausermaywantto

    clusterstudentsinto20

    clustersinsteadof3

    Additionalfeaturesneedtobe

    includedin

    cluster

    analysis

    All tuples for clustering

    User hint

    UserUserGuided Clustering vs. SemiGuided Clustering vs. Semisupervised Clusterinsupervised Clusterin

  • 7/31/2019 Mining Heterogeneous Information Networks

    71/176

    74

    UserUser GuidedClusteringvs.SemiGuidedClusteringvs.Semi supervisedClusterinsupervisedClusterin

    Semisupervised

    clustering

    [Wagstaff,

    et

    al 01,

    Xing,

    et

    al.02]

    Userprovidesatrainingsetconsistingofsimilar anddissimilar

    pairsofobjects

    Userguidedclustering

    Userspecifiesanattributeasahint,andmorerelevantfeaturesare

    foundforclustering

    All tuples for clustering

    Semi-supervised clustering

    All tuples for clustering

    User-guided clustering

    x

    Why Not Typical SemiWhy Not Typical SemiSupervised Clustering?Supervised Clustering?

  • 7/31/2019 Mining Heterogeneous Information Networks

    72/176

    75

    WhyNotTypicalSemiWhyNotTypicalSemi SupervisedClustering?SupervisedClustering?

    Whynot

    do

    typical

    semi

    supervise

    clustering?

    Muchinformation(inmultiplerelations)isneededtojudgewhethertwo

    tuples aresimilar

    Ausermaynotbeabletoprovideagoodtrainingset

    Itismucheasierforausertospecifyanattributeasahint,suchasastudents

    researcharea

    Tom Smith SC1211 TA

    Jane Chang BI205 RA

    Tuples to be compared

    User hint

    CrossClusCrossClus: An Overview: An Overview

  • 7/31/2019 Mining Heterogeneous Information Networks

    73/176

    76

    CrossClusCrossClus:AnOverview:AnOverview

    CrossClus:Framework

    Searchforgoodmultirelationalfeaturesforclustering

    Measuresimilaritybetweenfeaturesbasedonhowthey

    clusterobjectsintogroups

    Userguidance+heuristic searchforfindingpertinent

    features Clusteringbasedonakmedoidsbased algorithm

    CrossClus:Majoradvantages

    Userguidance,

    even

    in

    avery

    simple

    form,

    plays

    an

    importantroleinmultirelationalclustering

    CrossClus findspertinentfeatures bycomputingsimilarities

    betweenfeatures

    Selection of MultiSelection of MultiRelational FeaturesRelational Features

  • 7/31/2019 Mining Heterogeneous Information Networks

    74/176

    77

    SelectionofMultiSelectionofMultiRelationalFeaturesRelationalFeatures

    Amulti

    relational

    feature

    is

    defined

    by:

    Ajoinpath.E.g.,Student RegisterOpenCourse Course

    Anattribute.E.g.,Course.area

    (Fornumericalfeature)anaggregationoperator.E.g.,sumoraverage

    CategoricalFeaturef= [StudentRegisterOpenCourse Course,Course.area,null]

    Tuple Areas of coursesDB AI TH

    t1 5 5 0

    t2 0 3 7

    t3 1 5 4

    t4 5 0 5

    t5 3 3 4

    areas of courses of each student

    Tuple Feature fDB AI TH

    t1 0.5 0.5 0

    t2 0 0.3 0.7

    t3 0.1 0.5 0.4

    t4 0.5 0 0.5

    t5 0.3 0.3 0.4

    Values of featureff(t1

    )

    f(t2

    )

    f(t3

    )

    f(t4

    )

    f(t5

    )

    DB

    AI

    TH

    Numerical Feature, e.g., average grades of students

    h = [StudentRegister, Register.grade, average]

    E.g. h(t1) = 3.5

    Similarity Between FeaturesSimilarity Between Features

  • 7/31/2019 Mining Heterogeneous Information Networks

    75/176

    78

    SimilarityBetweenFeaturesSimilarityBetweenFeatures

    Featuref(course) Feature g (group)

    DB AI TH Info sys Cog sci Theory

    t1 0.5 0.5 0 1 0 0

    t2 0 0.3 0.7 0 0 1

    t3 0.1 0.5 0.4 0 0.5 0.5

    t4 0.5 0 0.5 0.5 0 0.5

    t5 0.3 0.3 0.4 0.5 0.5 0

    Values of Featurefand g

    1

    2

    3

    4

    5

    S1

    S2

    S3

    S4

    S5

    0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    0.5-0.6

    0.4-0.5

    0.3-0.4

    0.2-0.3

    0.1-0.2

    0-0.1

    1

    2

    3

    4

    5

    S1

    S2

    S3

    S4

    S5

    0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    0.5-0.6

    0.4-0.5

    0.3-0.4

    0.2-0.3

    0.1-0.2

    0-0.1

    Similarity between two features

    cosine similarity of two vectors

    Vf

    Vg

    ( )gf

    gf

    VV

    VVgfSim

    =,

    SimilaritybetweenCategorical&NumericalFeaturesSimilaritybetweenCategorical&NumericalFeatures

  • 7/31/2019 Mining Heterogeneous Information Networks

    76/176

    79

    y gy g

    ( ) ( ) = 0)

    f

    shouldbe

    similar

    to

    the

    given

    ground

    truth

    G et e et odo ogy gy

    GNetMineGNetMine:Graph:GraphBasedRegularizationBasedRegularization

  • 7/31/2019 Mining Heterogeneous Information Networks

    89/176

    92

    ( ) ( )

    1

    ( ) ( ) 2,

    , 1 1 1 , ,

    ( ) ( ) ( ) ( )

    1

    ( ,..., )

    1 1( )

    ( ) ( )

    ji

    k k

    m

    nnm

    k kij ij pq ip jq

    i j p q ij pp ji qq

    m

    k k T k k i i i i i

    i

    J

    R f fD D

    = = =

    =

    =

    +

    y y

    f f

    f f

    pp gg

    Minimizethe

    objective

    function

    Smoothnessconstraints:objectslinkedtogethershouldsharesimilar

    estimations

    of

    confidence

    belonging

    to

    class

    k

    Normalizationtermappliedtoeachtypeoflinkseparately:reducetheimpactofpopularityofnodes

    Confidence

    estimation

    on

    labeled

    data

    and

    their

    pre

    given

    labelsshouldbesimilar

    Userpreference:howmuchdoyouvaluethisrelationship/groundtruth?

  • 7/31/2019 Mining Heterogeneous Information Networks

    90/176

    ClassificationAccuracy:

    Labeling

    aVery

    Small

    Portion

    Classification

    Accuracy:

    Labeling

    aVery

    Small

    Portion

    of Authors and Papersof Authors and Papers

  • 7/31/2019 Mining Heterogeneous Information Networks

    91/176

    94

    ofAuthorsandPapersofAuthorsandPapers

    (a%,p%)

    nLB wvRN LLGC GNetMine

    AA A

    CPT A

    A A

    CPT A

    A A

    CP

    T A

    CPT

    (0.1%,0.1%) 25.4 26.0 40.8 34.1 41.4 61.3 82.9

    (0.2%,0.2%) 28.3 26.0 46.0 41.2 44.7 62.2 83.4

    (0.3%,0.3%) 28.4 27.4 48.6 42.5 48.8 65.7 86.7

    (0.4%,0.4%) 30.7 26.7 46.3 45.6 48.7 66.0 87.2

    (0.5%,0.5%) 29.8 27.3 49.0 51.4 50.6 68.9 87.5

    Comparisonofclassificationaccuracyonauthors(%)

    (a%,p%)nLB wvRN LLGC GNetMine

    PP ACPT PP ACPT PP ACPT ACPT

    (0.1%,0.1%) 49.8 31.5 62.0 42.0 67.2 62.7 79.2

    (0.2%,0.2%) 73.1 40.3 71.7 49.7 72.8 65.5 83.5

    (0.3%,0.3%) 77.9 35.4 77.9 54.3 76.8 66.6 83.2

    (0.4%,0.4%) 79.1 38.6 78.1 54.4 77.9 70.5 83.7

    (0.5%,0.5%) 80.7 39.3 77.9 53.5 79.0 73.5 84.1

    Comparisonofclassificationaccuracyonpapers(%)

    (a%,p%)nLB wvRN LLGC GNetMine

    ACPT ACPT ACPT ACPT

    (0.1%,0.1%) 25.5 43.5 79.0 81.0

    (0.2%,0.2%) 22.5 56.0 83.5 85.0

    (0.3%,0.3%) 25.0 59.0 87.0 87.0

    (0.4%,0.4%) 25.0 57.0 86.5 89.5

    (0.5%,0.5%) 25.0 68.0 90.0 94.0

    Comparisonofclassificationaccuracyonconferences(%)

    KnowledgePropagation:

    List

    Objects

    with

    the

    Knowledge

    Propagation:

    List

    Objects

    with

    the

    Highest Confidence Measure Belonging to Each ClassHighest Confidence Measure Belonging to Each Class

  • 7/31/2019 Mining Heterogeneous Information Networks

    92/176

    95

    HighestConfidenceMeasureBelongingtoEachClassHighestConfidenceMeasureBelongingtoEachClass

    No. Database DataMining Artificial

    Intelligence Information

    Retrieval

    1 data mining learning retrieval

    2 database data knowledge information

    3 query clustering Reinforcement web

    4 system learning reasoning search

    5 xml classification model document

    Top5termsrelatedtoeacharea

    No. Database DataMining ArtificialIntelligence InformationRetrieval

    1 Surajit

    Chaudhuri Jiawei

    Han SridharMahadevan W.BruceCroft

    2 H.V.

    Jagadish Philip

    S.

    Yu Takeo

    Kanade Iadh

    Ounis

    3 MichaelJ.Carey ChristosFaloutsos AndrewW.Moore MarkSanderson

    4 MichaelStonebraker WeiWang Satinder

    P.Singh ChengXiang

    Zhai

    5 C.Mohan Shusaku

    Tsumoto ThomasS.Huang GerardSalton

    Top5authorsconcentratedineacharea

    No. Database DataMining Artificial

    Intelligence Information

    Retrieval

    1 VLDB KDD IJCAI SIGIR

    2 SIGMOD SDM AAAI ECIR

    3 PODS PAKDD CVPR WWW

    4 ICDE ICDM ICML WSDM

    5 EDBT PKDD ECML CIKMTop5conferencesconcentratedineacharea

    ClassificationofInformationNetworksClassificationofInformationNetworks

  • 7/31/2019 Mining Heterogeneous Information Networks

    93/176

    96

    ClassificationofHeterogeneousInformationNetworks:

    Graph

    regularization

    Based

    Method

    (GNetMine)

    MultiRelationalMiningBasedMethod(CrossMine)

    StatisticalRelational

    Learning

    Based

    Method

    (SRL)

    ClassificationofHomogeneousInformationNetworks

    MultiMultiRelation to Flat Relation Mining?RelationtoFlatRelationMining?

  • 7/31/2019 Mining Heterogeneous Information Networks

    94/176

    97

    Multi RelationtoFlatRelationMining?g

    Foldingmultiple

    relations

    into

    asingle

    flat one

    for

    mining?

    Cannotbeasolutionduetoproblems:

    Loseinformation

    of

    linkages

    and

    relationships,

    no

    semantics

    preservation

    Cannotutilizeinformationofdatabasestructuresorschemas

    (e.g.,ER

    modeling)

    Patientflatten

    ContactDoctor

    OneApproach:InductiveLogicProgramming(ILP)OneApproach:InductiveLogicProgramming(ILP)

  • 7/31/2019 Mining Heterogeneous Information Networks

    95/176

    98

    Findahypothesis

    that

    is

    consistent

    with

    background

    knowledge(trainingdata)

    FOIL,Golem,Progol,TILDE,

    Backgroundknowledge

    Relations(predicates),Tuples (groundfacts)

    InductiveLogicProgramming(ILP)

    Hypothesis:

    Thehypothesis

    is

    usually

    aset

    of

    rules,

    which

    canpredictcertainattributesincertainrelations

    Daughter(X,Y) female(X),parent(Y,X)

    Daughter(mary, ann) +

    Daughter(eve, tom) +

    Daughter(tom, ann)

    Daughter(eve, ann)

    Training examples

    Parent(ann, mary)

    Parent(ann, tom)

    Parent(tom, eve)

    Parent(tom, ian)

    Background knowledge

    Female(ann)

    Female(mary)

    Female(eve)

    InductiveLogic

    Programming

    Approach

    to

    MultiInductive

    Logic

    Programming

    Approach

    to

    Multi

    Relation ClassificationRelation Classification

  • 7/31/2019 Mining Heterogeneous Information Networks

    96/176

    99

    RelationClassificationRelationClassification ILP

    Approached

    to

    Multi

    Relation

    Classification

    TopdownApproaches(e.g.,FOIL)

    while(enough

    examplesleft)

    generatearule

    removeexamplessatisfyingthisrule

    BottomupApproaches(e.g.,Golem)

    Useeach

    example

    as

    arule

    Generalizerulesbymergingrules

    DecisionTreeApproaches(e.g.,TILDE)

    ILPApproach:ProsandCons

    Advantages:Expressiveandpowerful,andrulesareunderstandable

    Disadvantages:Inefficientfordatabaseswithcomplexschemas,and

    inappropriateforcontinuousattributes

    FOIL:FirstFOIL:FirstOrderInductiveLearner(RuleGeneration)OrderInductiveLearner(RuleGeneration)

  • 7/31/2019 Mining Heterogeneous Information Networks

    97/176

    100

    Findaset

    of

    rules

    consistent

    with

    training

    data

    Atopdown,sequentialcoveringlearner

    Buildeachrulebyheuristics

    Foilgain

    a

    special

    type

    of

    information

    gain

    Examplescovered

    byRule1

    Allexamples

    Examplescovered

    byRule2Examples

    covered

    byRule3

    Positive

    examples

    Negative

    examples

    A3=1

    A3=1&&A1=2A3=1&&A1=2

    &&A8=5

    Togenerate

    arule

    while(true)

    findthebestpredicatep

    if

    foilgain(p)>thresholdthen

    addp

    tocurrentrule

    else

    break

    FindtheBestPredicate:PredicateEvaluationFindtheBestPredicate:PredicateEvaluation

  • 7/31/2019 Mining Heterogeneous Information Networks

    98/176

    101

    Allpredicates

    in

    arelation

    can

    be

    evaluated

    based

    on

    propagatedIDs

    Usefoilgain toevaluatepredicates

    Supposecurrent

    rule

    is

    r.

    For

    apredicate

    p,

    foilgain(p)=

    CategoricalAttributes

    Computefoil

    gain

    directly

    NumericalAttributes

    Discretize witheverypossiblevalue

    ( )( )

    ( ) ( )

    ( )

    ( ) ( )

    +++

    ++

    ++

    prNprP

    prP

    rNrP

    rPprP loglog

  • 7/31/2019 Mining Heterogeneous Information Networks

    99/176

    CrossMineCrossMine:AnEffectiveMulti:AnEffectiveMultirelationalClassifierrelationalClassifier

  • 7/31/2019 Mining Heterogeneous Information Networks

    100/176

    103

    Methodology

    TupleIDpropagation:anefficientandflexible

    methodfor

    virtually

    joining

    relations

    Confinetherulesearchprocessinpromising

    directions

    Lookoneahead:amorepowerfulsearchstrategy

    Negativetuple sampling:improveefficiencywhile

    maintainingaccuracy

    TupleTuple IDPropagationIDPropagation

  • 7/31/2019 Mining Heterogeneous Information Networks

    101/176

    104

    Loan ID Account ID Amount Duration Decision

    1 124 1000 12 Yes

    2 124 4000 12 Yes

    3 108 10000 24 No

    4 45 12000 36 No

    0+, 0

    0+, 1

    0+, 1

    2+, 0

    Labels

    1, 202/27/93monthly124

    Null01/01/97weekly67

    412/09/96monthly45

    309/23/97weekly108

    Propagated IDOpen dateFrequencyAccount ID

    Applicant #1

    Applicant #2

    Applicant #3

    Applicant #4

    Propagatetuple IDsoftargetrelationtonontargetrelations

    Virtuallyjoin

    relations

    to

    avoid

    the

    high

    cost

    of

    physical

    joins

    Possiblepredicates:

    Frequency=monthly:2+,

    1

    Opendate

  • 7/31/2019 Mining Heterogeneous Information Networks

    102/176

    105

    TargetRelation R1 R2

    R3

    Efficient Onlypropagatethetuple IDs

    Timeandspaceusageislow

    Flexible

    Canpropagate

    IDs

    among

    non

    target

    relations

    ManysetsofIDscanbekeptononerelation,whicharepropagatedfromdifferentjoinpaths

    RuleGeneration:ExampleRuleGeneration:Example

  • 7/31/2019 Mining Heterogeneous Information Networks

    103/176

    106

    districtid

    frequency

    date

    Account

    accountid

    accountid

    date

    amount

    duration

    Loan

    loanid

    payment

    accountid

    bankto

    accountto

    amount

    Order

    orderid

    type

    dispid

    type

    issuedate

    Card

    cardid

    accountid

    clientid

    Disposition

    dispid

    birthdate

    gender

    districtid

    Client

    clientid

    distname

    region

    #people

    #lt500

    District

    districtid

    #lt2000

    #lt10000

    #gt10000

    #city

    ratiourban

    avgsalary

    unemploy95

    unemploy96

    denenter

    #crime95

    #crime96

    accountid

    date

    type

    operation

    Transaction

    transid

    amount

    balance

    symbol

    Targetrelation

    Firstpredicate

    Second

    predicate

    RuleGeneration

    Startatthetargetrelation

    Repeat

    Searchinallactiverelations

    Searchin

    all

    relations

    joinabletoactiverelations

    Addthebestpredicatetothecurrentrule

    Setthe

    involved

    relation

    toactive

    Until

    Thebestpredicatedoesnothaveenoughgain

    Currentruleistoolong

    LookLookoneoneaheadinRuleGenerationaheadinRuleGeneration

  • 7/31/2019 Mining Heterogeneous Information Networks

    104/176

    107

    Twotypesofrelations:EntityandRelationship

    Oftencannotfindusefulpredicatesonrelationsofrelationship

    Solutionof

    CrossMine:

    WhenpropagatingIDstoarelationofrelationship,

    propagateonemoresteptonextrelationofentity

    Target

    Relation

    Nogoodpredicate

    NegativeNegativeTupleTuple SamplingSampling

  • 7/31/2019 Mining Heterogeneous Information Networks

    105/176

    108

    Each

    time

    arule

    is

    generated,

    covered

    positive

    examples

    are

    removed

    Aftergeneratingmanyrules,therearemuchlesspositiveexamplesthannegativeones

    Cannotbuildgoodrules(lowsupport)

    Stilltimeconsuming(largenumberofnegativeexamples)

    Solution:Samplingonnegativeexamples

    Improveefficiencywithoutaffectingrulequality

    +

    ++

    ++++ ++

    +

    +

    +

    ++

    +

    + +

    + +

    ++

    RealDatasetRealDataset

  • 7/31/2019 Mining Heterogeneous Information Networks

    106/176

    109

    PKDDCup99dataset LoanApplication

    Accuracy Time (per fold)

    FOIL 74.0% 3338 sec

    TILDE 81.3% 2429 sec

    CrossMine 90.7% 15.3 sec

    Accuracy Time (per fold)FOIL 79.7% 1.65 sec

    TILDE 89.4% 25.6 sec

    CrossMine 87.7% 0.83 sec

    Mutagenesisdataset(4relations):Only4relations,soTILDE

    doesagoodjob,thoughslow

    ClassificationofInformationNetworksClassificationofInformationNetworks

  • 7/31/2019 Mining Heterogeneous Information Networks

    107/176

    110

    ClassificationofHeterogeneousInformationNetworks:

    GraphregularizationBasedMethod(GNetMine)

    MultiRelationalMiningBasedMethod(CrossMine)

    StatisticalRelational

    Learning

    Based

    Method

    (SRL)

    ClassificationofHomogeneousInformationNetworks

    ProbabilisticRelational

    Models

    in

    Probabilistic

    Relational

    Models

    in

    Statistical Relational LearningStatisticalRelationalLearning

  • 7/31/2019 Mining Heterogeneous Information Networks

    108/176

    111

    StatisticalRelationalLearningg

    Goal:model

    distribution

    of

    data

    in

    relational

    databases

    Treatbothentitiesandrelationsasclasses

    Intuition:objectsarenolongerindependentwitheachother

    Buildstatistical

    networks

    according

    to

    the

    dependency

    relationshipsbetweenattributesofdifferentclasses

    AProbabilisticRelationalModels (PRM)consistsof

    Relational

    schema

    (from

    databases) Dependencystructure(betweenattributes)

    Localprobabilitymodel(conditionalprobabilitydistribution)

    Three

    major

    methods

    of

    probabilistic

    relational

    models RelationalBayesianNetwork(RBN,Lise Getoor etal.)

    RelationalMarkovNetworks(RMN,BenTaskar etal.)

    RelationalDependency

    Networks

    (RDN,

    Jennifer

    Neville

    et

    al.)

    RelationalBayesianNetworks(RBN)RelationalBayesianNetworks(RBN)

  • 7/31/2019 Mining Heterogeneous Information Networks

    109/176

    112

    ExtendBayesian

    network

    to

    consider

    entities,

    properties

    and

    relationshipsinaDBscenario

    Threedifferentuncertainties

    Attributeuncertainty

    Modelconditionalprobabilityforanattributegivenits

    parentvariables

    Structuraluncertainty

    Modelconditionalprobabilityforareferenceorlink

    existencegivenitsparentvariables

    Classuncertainty

    Refinetheconditionalprobabilitybyconsideringsub

    classesorhierarchyofclasses

  • 7/31/2019 Mining Heterogeneous Information Networks

    110/176

    ClassificationofInformationNetworksClassificationofInformationNetworks

  • 7/31/2019 Mining Heterogeneous Information Networks

    111/176

    114

    ClassificationofHeterogeneousInformationNetworks:

    GraphregularizationBasedMethod(GNetMine)

    MultiRelationalMiningBasedMethod(CrossMine)

    StatisticalRelational

    Learning

    Based

    Method

    (SRL)

    ClassificationofHomogeneousInformationNetworks

    TransductiveTransductive LearningintheGraphLearningintheGraph

  • 7/31/2019 Mining Heterogeneous Information Networks

    112/176

    115

    Problem:

    for

    aset

    of

    nodes

    in

    the

    graph,

    the

    class

    labels

    are

    given

    for

    partial

    of

    thenodes,thetaskistolearnthelabelsoftheunlabelednodes

    Methods

    Labelpropagation

    algorithm

    [Zhu

    et

    al.

    2002,

    Zhou

    et

    al.

    2004,

    Szummer et

    al.2001]

    Iterativelypropagatelabelstoitsneighbors,accordingtothe

    transitionprobabilitydefinedbythenetwork

    Graphregularizationbasedalgorithm[Zhouetal.2004]

    Intuition:tradeoffbetween(1)consistencywiththelabelingdataand

    (2)consistencybetweenlinkedobjects

    Anquadratic

    optimization

    problem

    OutlineOutline

  • 7/31/2019 Mining Heterogeneous Information Networks

    113/176

    116

    Motivation: WhyMining

    Heterogeneous

    Information

    Networks?

    PartI: Clustering,RankingandClassification

    ClusteringandRankinginInformationNetworks

    ClassificationofInformationNetworks

    PartII: DataQualityandSearchinInformationNetworks

    DataCleaning and

    Data

    Validation

    by

    InfoNet Analysis

    SimilaritySearchinInformationNetworks

    PartIII:AdvancedTopicsonInformationNetworkAnalysis

    RoleDiscovery

    and

    OLAP

    in

    Information

    Networks

    MiningEvolutionandDynamicsofInformationNetworks

    Conclusions

    DataCleaningbyLinkAnalysisDataCleaningbyLinkAnalysis

  • 7/31/2019 Mining Heterogeneous Information Networks

    114/176

    117

    Objectreconciliation

    vs.

    object

    distinction

    as

    data

    cleaning

    tasks

    Linkanalysismaytakeadvantagesofredundancyandmake

    facilitateentitycrosscheckingandvalidation

    Objectdistinction:

    Different

    people/objects

    do

    share

    names

    InAllMusic.com,72songsand3albumsnamedForgotten or

    TheForgotten

    InDBLP,

    141

    papers

    are

    written

    by

    at

    least

    14

    Wei

    Wang

    Newchallengesofobjectdistinction:

    Textualsimilaritycannotbeused

    Distinct:Objectdistinctionbyinformationnetworkanalysis

    X.Yin,J.Han,andP.S.Yu,ObjectDistinction:Distinguishing

    Objects

    with

    Identical

    Names

    by

    Link

    Analysis,

    ICDE'07

  • 7/31/2019 Mining Heterogeneous Information Networks

    115/176

  • 7/31/2019 Mining Heterogeneous Information Networks

    116/176

    TrainingwiththeTrainingwiththeSameSame DataSetDataSet

  • 7/31/2019 Mining Heterogeneous Information Networks

    117/176

    120

    Build

    atraining

    set

    automatically

    Selectdistinctnames,e.g.,JohannesGehrke

    Thecollaborationbehaviorwithinthesamecommunityshare

    somesimilarity

    Trainingparametersusingatypicalandlargesetof

    unambiguous examples

    UseSVMtolearnamodelforcombiningdifferentjoinpaths

    Eachjoinpathisusedastwoattributes(withlinkbased

    similarityand

    neighborhood

    similarity)

    Themodelisaweightedsumofallattributes

  • 7/31/2019 Mining Heterogeneous Information Networks

    118/176

    RealCases:DBLPPopularNamesRealCases:DBLPPopularNames

  • 7/31/2019 Mining Heterogeneous Information Networks

    119/176

    122

    Nam e Num _aut hor s Num _r ef s accur acy pr ecision r ecall f - m easur e

    Hui

    Fang 3 9 1.0 1.0 1.0 1.0

    Ajay Gupta 4 16 1.0 1.0 1.0 1.0

    Joseph Hellerstein 2 151 0.81 1.0 0.81 0.895

    Rakesh

    Kumar 2 36 1.0 1.0 1.0 1.0

    Michael Wagner 5 29 0.395 1.0 0.395 0.566

    Bing Liu 6 89 0.825 1.0 0.825 0.904

    Jim Smith 3 19 0.829 0.888 0.926 0.906

    Lei Wang 13 55 0.863 0.92 0.932 0.926

    Wei Wang 14 141 0.716 0.855 0.814 0.834

    Bin Yu 5 44 0.658 1.0 0.658 0.794

    Average 0.81 0.966 0.836 0.883

    DistinguishingDifferentDistinguishingDifferentWeiWeiWangWangss

  • 7/31/2019 Mining Heterogeneous Information Networks

    120/176

    123

    UNC-CH(57)

    Fudan U, China(31)

    UNSW, Australia(19)

    SUNYBuffalo

    (5)

    BeijingPolytech

    (3)

    NUSingapore

    (5)

    Harbin UChina

    (5)

    Zhejiang UChina

    (3)

    Najing NormalChina

    (3)

    Ningbo TechChina

    (2)

    Purdue(2)

    Beijing U ComChina

    (2)

    Chongqing UChina

    (2)

    SUNYBinghamton

    (2)

    5

    6

    2

  • 7/31/2019 Mining Heterogeneous Information Networks

    121/176

    TruthValidationbyInfo.NetworkAnalysisTruthValidationbyInfo.NetworkAnalysis

  • 7/31/2019 Mining Heterogeneous Information Networks

    122/176

    125

    Thetrustworthinessproblemoftheweb(accordingtoasurvey):

    54%ofInternetuserstrustnewswebsitesmostoftime

    26%forwebsitesthatsellproducts

    12%for

    blogs

    TruthFinder:TruthdiscoveryontheWebbylinkanalysis

    Amongmultipleconflictresults,canweautomaticallyidentifywhichone

    islikely

    the

    true

    fact?

    Veracity(conformitytotruth):

    Givenalargeamountofconflictinginformationaboutmanyobjects,

    providedby

    multiple

    web

    sites

    (or

    other

    information

    providers), how

    to

    discoverthetruefactabouteachobject?

    Ourwork: Xiaoxin Yin,Jiawei Han,PhilipS.Yu,TruthDiscoverywithMultiple

    ConflictingInformationProvidersontheWeb,TKDE08

  • 7/31/2019 Mining Heterogeneous Information Networks

    123/176

  • 7/31/2019 Mining Heterogeneous Information Networks

    124/176

  • 7/31/2019 Mining Heterogeneous Information Networks

    125/176

    OverviewoftheOverviewoftheTruthFinderTruthFinder MethodMethod

  • 7/31/2019 Mining Heterogeneous Information Networks

    126/176

    129

    Confidenceoffacts Trustworthinessofwebsites

    Afacthashighconfidence ifitisprovidedby(many)

    trustworthyweb

    sites

    Awebsiteistrustworthyifitprovidesmanyfactswithhigh

    confidence

    TheTruthFinder mechanism,anoverview:

    Initially,eachwebsiteisequallytrustworthy

    Basedon

    the

    above

    four

    heuristics,

    infer

    fact

    confidence

    from

    websitetrustworthiness,andthenbackwards

    Repeat

    until

    achieving

    stable

    state

    AnalogytoAuthorityAnalogytoAuthorityHubAnalysisHubAnalysis

  • 7/31/2019 Mining Heterogeneous Information Networks

    127/176

    130

    Facts Authorities, Websites Hubs

    Differencefromauthorityhubanalysis

    Linearsummationcannotbeused

    A

    web

    site

    is

    trustable

    if

    it

    provides

    accurate

    facts,

    instead

    of

    many

    facts

    Confidenceistheprobabilityofbeingtrue

    Different

    facts

    of

    the

    same

    object

    influence

    each

    other

    w1 f1

    Web sites Facts

    Hubs Authorities

    Hightrustworthiness

    Highconfidence

  • 7/31/2019 Mining Heterogeneous Information Networks

    128/176

  • 7/31/2019 Mining Heterogeneous Information Networks

    129/176

    Experiments:FindingTruthofFactsExperiments:FindingTruthofFacts

  • 7/31/2019 Mining Heterogeneous Information Networks

    130/176

    133

    Determiningauthorsofbooks

    Datasetcontains1265bookslistedonabebooks.com

    We

    analyze

    100

    random

    books

    (using

    book

    images)

    Case Voting TruthFinder Barnes & Noble

    Correct 71 85 64

    Miss author(s) 12 2 4

    Incomplete names 18 5 6

    Wrong first/middle names 1 1 3

    Has redundant names 0 2 23

    Add incorrect names 1 5 5

    No information 0 0 2

  • 7/31/2019 Mining Heterogeneous Information Networks

    131/176

    OutlineOutline

  • 7/31/2019 Mining Heterogeneous Information Networks

    132/176

    135

    Motivation: WhyMining

    Heterogeneous

    Information

    Networks?

    PartI: Clustering,RankingandClassification

    ClusteringandRankinginInformationNetworks

    ClassificationofInformationNetworks

    PartII: DataQualityandSearchinInformationNetworks

    DataCleaning

    and

    Data

    Validation

    by

    InfoNet Analysis

    SimilaritySearchinInformationNetworks

    PartIII:AdvancedTopicsonInformationNetworkAnalysis

    RoleDiscovery

    and

    OLAP

    in

    Information

    Networks

    MiningEvolutionandDynamicsofInformationNetworks

    Conclusions

  • 7/31/2019 Mining Heterogeneous Information Networks

    133/176

  • 7/31/2019 Mining Heterogeneous Information Networks

    134/176

    SemanticsSemanticsBasedSimilaritySearchinBasedSimilaritySearchinInfoNetInfoNet

  • 7/31/2019 Mining Heterogeneous Information Networks

    135/176

    138

    Searchtop

    ksimilar

    objects

    of

    the

    same

    type

    for

    aquery

    FindresearchersmostsimilarwithChristosFaloutsos

    Twocriticalconceptstodefineasimilarity

    Featurespace

    Traditionaldata:attributesdenotedasnumerical

    value/vector,set,etc.

    Networkdata:

    arelation

    sequence

    called

    path

    schema

    Existinghomogeneousnetworkbasedsimilaritydoes

    notdealwiththisproblem

    Measuredefined

    on

    the

    feature

    space

    Cosine,Euclideandistance,Jaccard coefficient,etc.

    PathSim

    PathSchemaforDBLPQueriesPathSchemaforDBLPQueries

  • 7/31/2019 Mining Heterogeneous Information Networks

    136/176

    139

    Pathschema:

    A

    path

    of

    InfoNet schema,

    e.g.,

    APC,

    APA

    WhoaremostsimilartoChristosFaloutsos?

    FlickrFlickr:WhichPicturesAreMostSimilar?:WhichPicturesAreMostSimilar?

  • 7/31/2019 Mining Heterogeneous Information Networks

    137/176

    140

    Somepath

    schema

    leads

    to

    similarity

    closer

    to

    human

    intuition

    Butsomeothersarenot

    NotAllSimilarityMeasuresAreGoodNotAllSimilarityMeasuresAreGood

  • 7/31/2019 Mining Heterogeneous Information Networks

    138/176

    141

    Favor highly

    visibleobjects

    Notreasonable

  • 7/31/2019 Mining Heterogeneous Information Networks

    139/176

    PathSimPathSim:Definition&Properties:Definition&Properties

  • 7/31/2019 Mining Heterogeneous Information Networks

    140/176

    143

    Commutingmatrix

    corresponding

    to

    apath

    schema

    MP

    Productofadjacencymatrixofrelationsinthepathschema

    TheelementMP(i,j)denotesthestrengthbetweenobjecti

    andobject

    jin

    the

    semantic

    of

    path

    schema

    P

    Iftheweightofadjacencymatrixisunweighted,itwill

    denotethenumberofpathinstancesfollowingpath

    schemaP

    PathSim

    s(i,j)=2MP(i,j)/(MP(i,i)+MP(j,j))

    Properties:

    CoCoClusteringClusteringBasedPruningAlgorithmBasedPruningAlgorithm

  • 7/31/2019 Mining Heterogeneous Information Networks

    141/176

    144

    Storecommuting

    matrices

    for

    short

    path

    schemas

    &

    compute

    top

    kqueries

    on

    line

    Framework

    Generatecoclustersformaterializedcommutingmatrices,forfeature

    objectsand

    target

    objects

    Deriveupperboundforsimilaritybetweenobjectandtargetcluster,and

    betweenobjectandobject:Safelypruningtargetclustersandobjectsif

    theupperboundsimilarityislowerthancurrentthreshold

    Dynamicallyupdatetopkthreshold:

    Performance:Baselinevs.pruning

  • 7/31/2019 Mining Heterogeneous Information Networks

    142/176

  • 7/31/2019 Mining Heterogeneous Information Networks

    143/176

    RoleDiscovery:

    Extraction

    Semantic

    Role

    Discovery:

    Extraction

    Semantic

    InformationfromLinksInformationfromLinks

  • 7/31/2019 Mining Heterogeneous Information Networks

    144/176

    147

    Objective:Extract

    semantic

    meaning

    from

    plain

    links

    to

    finely

    modelandbetterorganizeinformationnetworks

    Challenges

    Latentsemantic

    knowledge

    Interdependency

    Scalability

    Opportunity

    Humanintuition

    Realistic

    constraint

    Crosscheckwithcollectiveintelligence

    Methodology:propagatesimpleintuitiverulesandconstraints

    over

    the

    whole

    network

    Discoveryof

    AdvisorDiscovery

    of

    Advisor

    Advisee

    Advisee

    RelationshipsinDBLPNetworkRelationshipsinDBLPNetwork

  • 7/31/2019 Mining Heterogeneous Information Networks

    145/176

    148

    Input:DBLP

    research

    publication

    network

    Output:Potentialadvisingrelationshipanditsranking (r,[st,ed])

    Ref. C.Wang,J.Han,etal., MiningAdvisorAdvisee

    Relationshipsfrom

    Research

    Publication

    Networks,

    SIGKDD

    2010

    Smith

    2000

    2000

    2001

    2002

    2003

    1999

    Ada Bob

    Jerry

    Ying

    Input: Temporal

    collaboration networkOutput: Relationship analysis

    (0.8, [1999,2000])

    (0.7,

    [2000, 2001])

    (0.65, [2002, 2004])

    2004

    Ada

    Bob

    Ying

    Smith

    (0.2,

    [2001, 2003])

    (0.5, [/, 2000])

    (0.9, [/, 1998])

    (0.4,

    [/, 1998])

    (0.49,

    [/, 1999])

    Visualized chorological hierarchies

    Jerry

    OverallFrameworkOverall

    Framework

  • 7/31/2019 Mining Heterogeneous Information Networks

    146/176

    149

    ai:authori

    pj:paperj

    py:paper

    year

    pn:paper#

    sti,yi:startingtime

    edi,yi:endingtime

    ri,yi:rankingscore

  • 7/31/2019 Mining Heterogeneous Information Networks

    147/176

    ExperimentResultsExperimentResults

  • 7/31/2019 Mining Heterogeneous Information Networks

    148/176

    151

    DBLPdata:

    654,

    628

    authors,

    1076,946

    publications,

    years

    provided

    Labeleddata:MathGealogy Project;AIGealogy Project;

    Homepage

    Datasets RULE SVM IndMAX TPFG

    TEST1 69.9% 73.4% 75.2% 78.9% 80.2% 84.4%

    TEST2 69.8% 74.6% 74.6% 79.0% 81.5% 84.3%

    TEST3 80.6% 86.7% 83.1% 90.9% 88.8% 91.3%

    Empirical

    parameter

    optimized

    parameter

    heuristics Supervised

    learning

  • 7/31/2019 Mining Heterogeneous Information Networks

    149/176

    Graph/NetworkSummarization:

    Graph

    Graph/Network

    Summarization:

    Graph

    CompressionCompression

  • 7/31/2019 Mining Heterogeneous Information Networks

    150/176

    153

    Extractcommon

    subgraphs and

    simplify

    graphs

    by

    condensing

    thesesubgraphs intonodes

    OLAPon

    Information

    NetworksOLAP

    on

    Information

    Networks

  • 7/31/2019 Mining Heterogeneous Information Networks

    151/176

    154

    WhyOLAP

    information

    networks?

    AdvantagesofOLAP:Interactiveexplorationofmultidimensional

    andmultilevelspaceinadatacubeInfonet

    Multidimensional:

    Different

    perspectives

    Multilevel:Differentgranularities

    InfoNet OLAP:Rollup/drilldownandslice/diceoninformation

    networkdata

    TraditionalOLAPcannothandlethis,becausetheyignorelinks

    amongdataobjects

    Handlingtwo

    kinds

    of

    InfoNet OLAP

    InformationalOLAP

    TopologicalOLAP

    InformationalOLAPInformational

    OLAP

  • 7/31/2019 Mining Heterogeneous Information Networks

    152/176

    155

    Inthe

    DBLP

    network,

    study

    the

    collaborationpatternsamongresearchers

    Dimensionscomefrominformational

    attributesattached

    at

    the

    whole

    snapshot

    level,socalled InfoDims

    IOLAPCharacteristics:

    Overlaymultiple

    pieces

    of

    information

    Nochangeontheobjectswhose

    interactionsarebeingexamined

    Inthe

    underlying

    snapshots,

    each

    nodeisaresearcher

    Inthesummarizedview,eachnodeis

    stillaresearcher

    TopologicalOLAPTopological

    OLAP

  • 7/31/2019 Mining Heterogeneous Information Networks

    153/176

    156

    Dimensionscome

    from

    the

    node/edgeattributesinsideindividual

    networks,socalledTopoDims

    T

    OLAP

    Characteristics Zoomin/Zoomout

    Networktopologychanged:

    generalized nodesand

    generalized edges

    Intheunderlyingnetwork,

    each

    node

    is

    a

    researcher Inthesummarizedview,each

    nodebecomesaninstitutethat

    comprises

    multiple

    researchers

  • 7/31/2019 Mining Heterogeneous Information Networks

    154/176

    OutlineOutline

  • 7/31/2019 Mining Heterogeneous Information Networks

    155/176

    158

    Motivation: WhyMining

    Heterogeneous

    Information

    Networks?

    PartI: Clustering,RankingandClassification

    ClusteringandRankinginInformationNetworks

    Classificationof

    Information

    Networks

    PartII: DataQualityandSearchinInformationNetworks

    DataCleaning

    and

    Data

    Validation

    by

    InfoNet Analysis

    SimilaritySearchinInformationNetworks

    PartIII:AdvancedTopicsonInformationNetworkAnalysis

    RoleDiscovery

    and

    OLAP

    in

    Information

    Networks

    MiningEvolutionandDynamicsofInformationNetworks

    Conclusions

    MiningEvolutionandDynamicsofMiningEvolutionandDynamicsofInfoNetInfoNet

  • 7/31/2019 Mining Heterogeneous Information Networks

    156/176

    159

    Manynetworks

    are

    with

    time

    information

    E.g.,accordingtopaperpublicationyear,DBLPnetworkscan

    formnetworksequences

    Motivation:Model

    evolution

    of

    communities

    in

    heterogeneous

    network

    Automaticallydetectthebestnumberofcommunitiesineach

    timestamp Modelthesmoothnessbetweencommunitiesofadjacent

    timestamps

    Modelthe

    evolution

    structure

    explicitly

    Birth,death,split

  • 7/31/2019 Mining Heterogeneous Information Networks

    157/176

  • 7/31/2019 Mining Heterogeneous Information Networks

    158/176

  • 7/31/2019 Mining Heterogeneous Information Networks

    159/176

    AccuracyStudyAccuracy

    Study

  • 7/31/2019 Mining Heterogeneous Information Networks

    160/176

    163

    Themore

    types

    of

    objects

    used,

    the

    better

    accuracy

    Historicalpriorresultsinbetteraccuracy

  • 7/31/2019 Mining Heterogeneous Information Networks

    161/176

  • 7/31/2019 Mining Heterogeneous Information Networks

    162/176

    OutlineOutline

  • 7/31/2019 Mining Heterogeneous Information Networks

    163/176

    Motivation: WhyMining

    Heterogeneous

    Information

    Networks?

    PartI: Clustering,RankingandClassification

    ClusteringandRankinginInformationNetworks

    Classificationof

    Information

    Networks

    PartII: DataQualityandSearchinInformationNetworks

    Data

    Cleaning

    and

    Data

    Validation

    by

    InfoNet Analysis SimilaritySearchinInformationNetworks

    PartIII:AdvancedTopicsonInformationNetworkAnalysis

    RoleDiscovery

    and