joint enhancement of topic modeling and information network mining

35
March 24, 2011 Joint Enhancement of Topic Modeling and Information Network Mining Mid-Year PI Report focusing on I3.2 Heng Ji City University of New York NSCTA/INARC

Upload: gerda

Post on 17-Jan-2016

21 views

Category:

Documents


2 download

DESCRIPTION

Joint Enhancement of Topic Modeling and Information Network Mining. Mid-Year PI Report focusing on I3.2 Heng Ji City University of New York NSCTA/INARC. INARC Project Major Contributions. I3.2-Subtask 1: - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Joint Enhancement of Topic Modeling and Information Network Mining

March 24, 2011

Joint Enhancement of Topic Modeling and Information Network Mining

Mid-Year PI Report focusing on I3.2Heng Ji

City University of New YorkNSCTA/INARC

Page 2: Joint Enhancement of Topic Modeling and Information Network Mining

INARC Project Major Contributions I3.2-Subtask 1:

– Disambiguate objects with rich semantic structures extracted from interconnected texts (ACL2011)– A new Collaborative Network Ranking Theory for Coreference Resolution (EMNLP2011-sub): – Markov Logic Networks and Learning-to-Rank to Enhance Open Domain Role Discovery (TAC2010,

LNCS, SIGIR2011-sub, EMNLP2011-sub)– 16.4% improvement over state-of-the-art entity linking and 13%-22% improvement over link

discovery I3.2-Subtask 2 (with H. Deng (UIUC) and J. Han (UIUC); Focus of this Talk)

– Novel topic modeling: Multi-typed objects are treated differently along with their inherent textual information and the rich semantics of the heterogeneous information network (KDD2011-sub, IEEE Journal invited-sub)

– Exploit the power of extended topic modeling for event network partitioning and refinement through active learning and topic cluster driven inferences. (ACL2011-sub, IEEE Journal invited-sub)

– Model the dynamics of information networks through a new temporal event network representation theory, evaluation metric and corresponding kernel methods (ACL2011-sub, EMNLP2011-sub)

I3.2-Subtask 3 (with H. Deng (UIUC) and J. Han (UIUC))– Self-Boosting Terrorism Network Search and Browsing (Springer Book Chapter, SIGIR2011-sub)

*I3.1: Uncovering Hierarchical Relationships among Linked Objects (with C. Wang (UIUC) and J. Han (UIUC), KDD'11 sub, presented by J. Han)

CUNY Students and Post-docs: Q. Li, X. Li, W. Lin, Z. Chen, S. Tamang, S. Anzaroot, J. Artiles 2

Page 3: Joint Enhancement of Topic Modeling and Information Network Mining

Mining and Modeling Interconnected Information Networks

Text-rich heterogeneous information network– Textual documents (news, blogs, twitter, papers, reports)

are getting richer Approximately 80% percent of all data in information network is held in an unstructured

format; Thousands of "attack" events and hundreds of "arrest" events can be mined from one week's unstructured textual data

Identify topics and events from documents using topic models– Interconnect with users and other objects

How topics propagate from documents to objects?

3

Page 4: Joint Enhancement of Topic Modeling and Information Network Mining

A Starting Point: ‘Isolated’ Information Network

Residence: Tahrir (Feb 18th, 2001-present)Residence: Tahrir (Feb 18th, 2001-present)

Website: We are all Khaled Said

Node Pair in InfoNet PER GPE ORG Link Types in Open-domain Information Network

Static

Spouse, Parents, Children, Siblings

Member

Birth-Place, Death-Place, Nationality, Origin

Subsidiaries, Parents

Location, Headquarter, Political-Affiliation

Located-Country, Capital

Dynamic

Contact-Meet, Contact-Phone_Write, Justice, Sport

Leader, Schools-Attended, Employee, Founder, Shareholder, Justice

Resides-Place, Leader, Conflict-Attack, Conflict-Demonstrate, Justice, Movement-Transport, Injure

Business-Merge, Sport, Transaction

Conflict-Attack

Page 5: Joint Enhancement of Topic Modeling and Information Network Mining

Fundamental Theory: InforNet construction and knowledge discovery capability can be mutually enhanced by network analysis on text and interconnected data

Q1: How to discover latent topics and identify clusters of multi-typed objects simultaneously? A1: Probabilistic Topic Modeling with Biased Propagation to take advantage of inter-

connectivity in InforNets Q2: How can text data and heterogeneous InforNet mutually enhance each other in topic

modeling and other text mining tasks? A2: Incorporate topic clusters to partition and refine InforNets, yield new representation,

evaluation metric and modeling theory

Topic modelBiased propagation

Joint Enhancement of Topic Modeling and Heterogeneous Information Network Mining

Page 6: Joint Enhancement of Topic Modeling and Information Network Mining

Preliminaries

 

6

– Maximize the log likelihood of a collection of docs

Page 7: Joint Enhancement of Topic Modeling and Information Network Mining

Probabilistic Topic Models with Biased Propagation

Intuition: InforNet provides valuable informationDifferent objects have their own inherent information (e.g., D with rich text and U without explicit text) To treat documents with rich text and other objects without explicit text in a different way Topic(D) inherent text + connected U Topic(U) connected D

7

Basic Idea: (Biased Topic Propagation) Propagate the topic probabilities obtained by topic models from

documents to other objects through the heterogeneous InforNet A simple and unbiased topic propagation does not make much sense

Page 8: Joint Enhancement of Topic Modeling and Information Network Mining

Biased Random Walk Basic criterion

– The topic of an object without explicit text depends on the topic of the documents it connects

E.g., the research topic of an author could be characterized by his/her published papers;

– The topic of a document is correlated with its objects to some extent, and should be principally determined by its inherent content of the text

8

The topic distribution of an object is determined by the average topic distribution of connected documents

Inherent topic distributions of docs

Propagated topic distribution

ξ: control the balance between inherent topic distribution and the propagated topic distribution

Page 9: Joint Enhancement of Topic Modeling and Information Network Mining

Biased Regularization: Put All Together

 

9

 

 

Page 10: Joint Enhancement of Topic Modeling and Information Network Mining

Fundamental Theory: InforNet construction and knowledge discovery capability can be mutually enhanced by network analysis on text and interconnected data

Q1: How to discover latent topics and identify clusters of multi-typed objects simultaneously? A1: Probabilistic Topic Modeling with Biased Propagation to take advantage of inter-

connectivity in InforNets Q2: How can text data and heterogeneous InforNet mutually enhance each other in topic

modeling and other text mining tasks? A2: Incorporate topic clusters to partition and refine InforNets, yield new representation,

evaluation metric and modeling theory

Topic modelBiased propagation

Joint Enhancement of Topic Modeling and Heterogeneous Information Network Mining

Page 11: Joint Enhancement of Topic Modeling and Information Network Mining

TMBP for InforNet Partitioning

Putin

weapons

nuclear

talks

forces

troops

army

militaryBritish

AFPmillion

government

dollars

convicted

billion company

court

sentence

Event Type: "Contact"Trigger: talk, meet etc.Arguments: "Entity" "Instrument" "Place" "Time-Within"

Event Type: "Business"Trigger: form, dissolve Arguments: "Org""Place" "Time-Within" "Agent"

Event Type: "Attack"Trigger: blew, attack Arguments: "Attacker" "Target" "Place" "Time-Within"

EventType:"Transaction"Trigger: Borrow, LaunchArguments: "Giver" "Recipient""Money""Seller""Artifact""Buyer"

………

………

Pyongyang

China

officials

Washington

north

southKorea

program

United States

Saddam

control

fighting

city

Baghdad

Iraqi

regime

Kurdish

York

case

media

Event Type: "Justice"Trigger: Arrest, JailArguments:"Defendant" "Time-Within" "Adjudicator" "Place"

Doc 1

Doc 3

Doc 4

Doc 6

Doc N

………

Doc 2

Doc 5

TYPE="Contact"

SUBTYPE="Meet"

the leaders of Germany and France

Saint Petersburg

T Y P E = " P H Y S "

S U B T Y P E = " L o c a t e d "

T Y P E = " P E R - S O C "

S U B T Y P E = " B u s i n e s s "

US President

George W. Bush

TYPE="Per

sonnel"

SUBTYPE="Elect"

former Chinese president

Jiang Zeming

T Y P E = " P E R - S O C "

S U B T Y P E = " B u s i n e s s "

Evian, France

TYPE="Movement"

SUBTYPE="Transport"

March 2004

Russian President

Vladimir Putin

Cluster 1 Cluster 2 Cluster 3 Cluster 4

PalestinianIsraelpoliceIsraelipeoplebank

Mondaykilledwest

securityAttack

Iraqwar

UnitedStatesBush

NationsIraqi

ministercouncil

resolutioncountry

northnuclearKorea

weaponsKorean

talksofficials

WashingtonPutinsouthChina

courtdollarsyear

appealmillionyears

governmentconvicted

billionsentence

AFP

Page 12: Joint Enhancement of Topic Modeling and Information Network Mining

Across a heterogeneous information network, a particular object can sometimes be an event trigger and sometimes not, and can represent different event types

Within a cluster of topically-related documents, the distribution is much more convergent

e.g. In the overall information networks only 7% of “fire” indicate “End-Position” events; while all of “fire” in a topic cluster are “End-Position” events

Topic Modeling can enhance information network construction by grouping similar objects, event types and roles together

TMBP for InforNet Refinement

Page 13: Joint Enhancement of Topic Modeling and Information Network Mining

13

Bombing Threats Tracking and Dynamic Terrorism Networks Construction

– Most information obtained from text-rich InforNet construction so far is viewed as static, ignoring the temporal dimension of many links in the networks

– It’s not enough to rely on information reporting time (publication years, blog post dates, news release time, narrative order, etc.) for open-domain real-world scenarios – only 3.71% correlation with gold-standards

– Temporal information on individual documents can be sparse, incomplete and inaccurate. About 50% events don’t include explicit time arguments

13

Open-domain Progressive Information Network Analysis with TMBP

Ali Larijani

IranSupreme National Security Council

2005 – 2007 Islamic Republic of Iran Broadcasting

2005 – 2007

Farideh Motahari

1978-Tehran University Hassan

Rowhani

1989-2005

School-a

ttended

spouseem

ployee employee

employee

1982–1987

0.9 0.3

0.8

0.4

0.6

Page 14: Joint Enhancement of Topic Modeling and Information Network Mining

Toward deep analysis and global aggregation across information networks– Partition Infornet based on topic modeling

– Within a topic cluster, we can recover temporal information by gleaning knowledge across networks and reach a global estimation of time boundaries

Research Methods– Novel representation of complex temporal information

– Meaningful comparison of approaches through InforNet-specific metrics

– Design novel dependency path based kernel methods to capture long contexts

– Global inference and aggregation over text-rich InforNet in order to reduce vagueness and over-constraining, resolve contradiction, and improve information quality

TMBP based Information Aggregation

Page 15: Joint Enhancement of Topic Modeling and Information Network Mining

4-tuple representation– T1=Earliest possible start/ T2=latest possible start /T3= Earliest possible

end /

T4=latest possible end

– Can represent punctual start/end points (T1 = T2, T3 = T4)

– Captures uncertainty when necessary (T1 < T2, T3 < T4)

– Consistency restrictions: T1 <= T2, T3 <= T4, T1<=T3, T2<=T4

A new quality of information metric based on formal constraints:– Detect cases of non-informative nodes and links in information networks– Allow independent parameterization of vagueness and over-constraining

errors– Error penalization can be tuned for more coarse or fine grained penalization– ti: automatic output; gi: gold-standard

New Representation Theory and Evaluation Metric

, ( {1,3} ) ( {2,4} )

,

overconstraining i i i i

vagueness

c if i t g i t gc

c otherwise

Over-constraining model

Vague model

Page 16: Joint Enhancement of Topic Modeling and Information Network Mining

Dependency Paths based Kernel Method and Information Aggregation with CCMs

Dependency paths based kernel method for local network prediction

Maximize global network quality by aggregating temporal information across documents over the entire information networks, using Conditional Constraint Models for optimization (Collaboration with Dan Roth (UIUC))

( ) ( ) ( ) ( ) ( )1 1 2 2 3 3 4 4max( , ),min( , ),max( , ),min( , )i i i i iT T T T T T T T T T

, ,max (ln( ) ) . .i k i ki k

p x s t

( ) ( ), , , :i ji j i j and T conflicts withT

1i kx x , ,1

, : {0,1}, 1K

i k i kk

i k x x

Page 17: Joint Enhancement of Topic Modeling and Information Network Mining

Topic Modeling Experiments Compared to State-of-the-art

Data Collection– DBLP– NSF-Awards

Metrics– Accuracy (AC)– Normalized mutual information (NMI)

Results: improve 20%-40% over Probabilistic Latent Semantic Analysis (PLSA)

17

Page 18: Joint Enhancement of Topic Modeling and Information Network Mining

18

Topic Modeling based Active Learning for Event and Role Mining (Enhance Portability)

Data: open-domain news with gold-standard information annotation

Learning algorithm: combining pattern matching and Maximum Entropy based classification of triggers, arguments and roles

Automatically select topically-related documents as for event training data annotation

Using Topic modeling, with only 1/4 training data we can achieve comparable performance as passive learning

Page 19: Joint Enhancement of Topic Modeling and Information Network Mining

19

Topic-cluster wide cross-document inference based on Markov Logic Networks (MLN) to enhance event and role mining One trigger sense per topic cluster / One argument role per topic cluster Remove events and roles with low local and cluster-wide confidence Adjust event and role labeling to achieve cluster-wide consistency

Results: Precision (P), Recall (R), F-Measure (F)

Topic Modeling based MLN Inference (Enhance Quality)

Approach Event Discovery (%) Role Discovery (%)

P R F P R F

Baseline 74.1 49.6 59.4 50.4 28.7 36.6

State-of-the-art (Information Retrieval

based Clustering)66.5 67.4 66.9 60.8 32.2 42.1

Topic Modeling 73.3 66.3 69.6 59.4 36.5 45.2

Page 20: Joint Enhancement of Topic Modeling and Information Network Mining

Progressive Temporal Infornet Mining Results

Data– 1.3 million newswire documents and 0.4 million web blogs/forum

documents Overall Comparison with State-of-the-Art

Impact of Information Aggregation

Approach Exploit InforNet Structures? Accuracy Quality1-gram kernel No 54.9 0.662-gram kernel No 56.8 0.673-gram kernel No 56.5 0.66Our Approach Yes 61.5 0.76

20

No InforNetAggregation over 2 tuples Aggregation over 10 tuples

Exploit InforNet

Page 21: Joint Enhancement of Topic Modeling and Information Network Mining

What’s New in Network Science?

21

Previous Approaches Our Approaches

only considered the textual information while ignored the network structures or could merely integrate with homogeneous networks

Declaratively model the inter-connectivity in information networks using probabilistic topic modeling with biased propagation; Multi-typed objects are treated differently along with their inherent textual information and the rich semantics of the heterogeneous information network

analyzed text documents and information networks separately

text data and heterogeneous information network mutually enhance each other in topic modeling and event/role discovery based on information network partitioning and refinement

focused on the analysis of one or a small set of documents

Leverage information redundancy and semantic links across documents in information networks through cross-document aggregation and reasoning; reach global quality optimization in multi-dimensional space (topic, entity, event, time, place)

treated equally static and dynamic information discovered from ambiguous and uncertain information networks

Develop a new temporal event network representation theory and evaluation metric with formal constraints that can account for uncertain temporal ranges, a new kernel method based on dependency paths to capture long contexts

Page 22: Joint Enhancement of Topic Modeling and Information Network Mining

22

Enrich and enhance the quality of information gathering from daily events and trends, and detecting terrorism or other potential threats by exploring unstructured text messages, blogs, twitters, news, reports integrated information networks

Improved information quality has potential of pointing the soldiers and military data analysts to more relevant information, go beyond keyword based Information Retrieval approaches

Multi-facet object search can provide methods for finding groups of soldiers with certain expertise and finding characteristics of enemies that may pose an imminent threat (An example: Web-scale Terrorism Network Search and Browsing)– Developed methods to efficiently trace membership relations, attack/arrest/die

activities and information clusters involving any specific entities– Improve the quality of information by the interconnected network itself (self-

boosting information networks)

22

Potential Army Impact and Technology Transition

Page 23: Joint Enhancement of Topic Modeling and Information Network Mining

Collaborations Within Task:

– With J. Han on subtask 2 and 3, >2 teleconferences every week, frequent teleconferences/emails among students/post-docs, submitted 2 joint research papers (1 SIGIR2011 submission and 1 ACL2011 submission), preparing 3 new joint research papers

– With D. Roth, collaboration on Constrained Conditional Models (I1.1) for Information Aggregation, entity coreference resolution and event extraction

Cross-Task:– With J. Han on I3.1, weekly teleconferences, regular emails,

submitted 1 joint research paper to KDD2011– With T. Huang on I1.1, on multi-media InforNet construction and

utilization, published 2 joint research papers, submitted a joint NSF proposal

Cross-Center:– With S. Parsons (SCNARC and T1.4), on using text-rich information

networks for trust prediction and dynamic social network analysis, co-advising a PhD student

Page 24: Joint Enhancement of Topic Modeling and Information Network Mining

Research Plans for Next Six Months Continue research conducted in the current I3.2 APP

– Explore topic correlation and social correlation from neighbors for improving topic modeling (with Hongbo Deng, Jiawei Han and collaboration with SCNARC)

– Introduce more constraints in cross-link inferences (with D. Roth)– Exploit new graph alignment algorithms for text mining (with X. Yan)– Exploit implicit links for InforNet analysis, such as the response

structures in twitter data– Technology Transition: Apply all of the successful approaches to

military applications, e.g. conduct tight collaborations with ARL (e.g. Dr. Robert Cole) to make terrorism network search engine deliverable; with ARL (Dr. Robert Winkler) on entity coreference resolution; with A. Leung on military data topic and event analysis

Collaborations with researchers in other tasks and networks– I3.1 APP: Continue collaborations with Jiawei Han (UIUC), to extend the

work of uncovering hierarchical relationships to more general relation types, data genres and domains

– Work with Thomas Huang (UIUC, I1.1) on cross-media transfer learning

– Work with Jiawei Han (UIUC, E2.3) on evolution of information networks

– Work with Simon Parsons (T1.4) on automatic social network analysis, and exploit logic reasoning to enhance entity disambiguation and information aggregation

24

Page 25: Joint Enhancement of Topic Modeling and Information Network Mining

A Research Path Ahead to 2012 Next year research planned if funded:

– Effective theories and methods for mining text-rich heterogeneous networks involving social and communication networks

– Leverage topic modeling for improving expert finding (expertise ranking problem) on heterogeneous information network

– Continue to exploit network structures to enhance knowledge discovery and population

– Multi-dimensional, hierarchical abstractive summarization based on information network analysis

– Explore collaborations with information fusion tasks in I1– Explore collaborations with social network and trust

projects on automatic social network construction and mining

– Application of effective theories and methods in military applications

25

Page 26: Joint Enhancement of Topic Modeling and Information Network Mining

Research PapersI3.1 (UIUC+CUNY) C. Wang, J. Han, X. Li, Q. Li, W. Lin, A. Lee, H. Li and H. Ji. 2011. Uncovering Hierarchical

Relationships among Linked Objects: A Probabilistic Modeling Approach. Submitted to KDD2011. I3.2Accepted/Published: Z. Chen, S. Tamang, A. Lee, X. Li, W. Lin, J. Artiles, M. Snover, M. Passantino and H. Ji. CUNY-BLENDER TAC-

KBP2010 Entity Linking and Slot Filling System Description. Proc. TAC2010. H. Li, X. Li, H. Ji and Y. Marton. Domain-Independent Novel Event Discovery and Semi-Automatic Event

Annotation. Proc. PACLIC 2010. H. Ji, R. Grishman. Knowledge Base Population: Successful Approaches and Challenges. Proc. ACL-HLT2011. H. Ji, Adam Lee and Wen-Pin Lin. Information Network Construction and Alignment from Automatically

Acquired Comparable Corpora. Invited book chapter for Building and Using Comparable Corpora. Springer. H. Ji, B. Favre, W. Lin, D. Gillick, D. Hakkani-Tur and R. Grishman. Open-domain Multi-document Summarization

via Information Extraction: Challenges and Prospects. Invited book chapter for Multi-source, Multilingual Information Extraction and Summarisation. Springer.

Submitted (CUNY + UIUC) H. Ji and J. Han. 2011. Web-Scale Knowledge Discovery and Information Extraction. Invited

Paper for IEEE Special Issue on Web-Scale Multimedia Processing and Applications. (CUNY + UIUC) H. Li, H. Ji, H. Deng and J. Han. 2011. Topically Related Data is Better Data: Topic Modeling for

Event Extraction. ACL-HLT2011. (CUNY + UIUC) S. Anzaroot, J. Artiles, H. Ji, H. Deng and J. Han. 2011. Search and Browsing Self-Boosting

Information Networks. SIGIR2011. J. Artiles, Q. Li, E. Amigo and H. Ji. 2011. Leveraging Cross-document Redundancy for Temporal Information

Extraction. EMNLP2011. J. Artiles, E. Amigo, Q. Li and H. Ji. 2011. Evaluating Temporal Information Extraction. ACL-HLT2011 Z. Chen and H. Ji. 2011. Collaborative Ranking: A Case Study in Entity Linking. EMNLP2011. Q. Li, J. Artiles and H. Ji. 2011. Dependency Paths Kernel for Temporal Relation Classification. ACL-HLT2011. S. Tamang and H. Ji. 2011. Learning-to-Rank for Slot Filling System Combination and Assessment. EMNLP2011. Z. Chen, S. Tamang, A. Lee and H. Ji. 2011. A Toolkit for Knowledge Base Population. SIGIR2011. X. Li and H. Ji. 2011. Comment-guided Learning for Automatic Assessment. EMNLP2011.

26

Page 27: Joint Enhancement of Topic Modeling and Information Network Mining

Awards and Keynote Speech Heng Ji. CUNY Chancellor's "Salute to Scholar" Award, November 2010. Heng Ji. National Science Foundation Research Experiences for

Undergraduates, March 2011 Heng Ji, Web-Scale Knowledge Discovery and Population from Unstructured

Data, Keynote Speech ACLCLP 2010 Information Retrieval Conference, December 2010.

Heng Ji. Overview of the TAC2010 Knowledge Base Population Track, Keynote Speech at Web People Search (WePS-3) Conference, September 2010.

Five students received university-wide awards

27

Page 28: Joint Enhancement of Topic Modeling and Information Network Mining

Brief Summary of My Team’s Other Research Work in I3.1 and I3.2

28

Page 29: Joint Enhancement of Topic Modeling and Information Network Mining

Leverage Semantic Information Network to Enhance Entity Coreference Resolution / Entity Identification

Disambiguation

Name Variant Clustering

Apply Graph-cutting based algorithms on semantic information networks9.4% absolute improvement in micro-averaged accuracy

29

Page 30: Joint Enhancement of Topic Modeling and Information Network Mining

30

1cq2cq

3cq4cq 5cq

6cq7cq ( )q

Bo

( )qAo

q0.7

0.4

q

0.30.6

correct rank :

Micro and Macro Collaborative Networks Ranking for Entity and Event Coreference Resolution

Previous methods only focused on the target node and one learning theory itself

Propose a new collaborative network ranking theory which imitates human collaborative learning

Leverage inter-connections among collaborative entities in information networks

Automatic profiling for each node Construct a collaborative network for each

entity based on graph-based clustering Rank multiple decisions from collaborative

entities (micro) and algorithms (macro) based on global prediction

7% absolute improvement in micro-averaged accuracy

On-going CUNY+UIUC work: using topic modeling for entity clustering

30

Page 31: Joint Enhancement of Topic Modeling and Information Network Mining

Khamis Mushait

31 31

Wail Al-Shehri

V3

Markov Logic Networks and Learning-to-Rank to Enhance Open Domain Role Discovery

Waleed Al-Shehri

Abdul Aziz Al-OmariAbdul Rahman Al-Omari

V4

V6V6 V7V7 V8V8

V9V9

V10V10

V11V11V12V12

Wail Al-Shehri

V3

Waleed Al-Shehri

Abdul Aziz Al-OmariAbdul Rahman Al-OmariV4

911 Suspect Terrorist Network

V15

Terrorist Information Network

originmember

Al-Qaeda

V13

sibling

news pageweb blog

twitterforumBoston

V14residence

residence

Mohamed AttaMohamed AttaV16

pilot

pilotSaudi Arabian Airlines

Discovered 26 roles for persons, 16 roles for organizations and 13 roles for locations Markov Logic Networks for Cross-slot and Cross-query reasoning based on InfoNet and textual linkages

to resolve conflictions and predict missing links Weight=15: Weight=100:

Maximum Entropy based Learning-to-rank model to re-rank candidate answers 13%-22% absolute F-measure improvement

(CUNY) Chen et al. "CUNY-BLENDER TAC-KBP2010 Entity Linking and Slot Filling System Description". Proc. TAC2010 and Lecture Notes in Computer Science, 2010

, , ( , ) ( , ) ( ) ( )x y z Ambiguous X Y Textual Linkage Y Z Pilot X Pilot Z Remove X

, , ( , ) ( , ) ( , )x y z Sibling X Y Origin Y Z Origin X Z

Page 32: Joint Enhancement of Topic Modeling and Information Network Mining

Uncovering Hierarchical Relationships among Linked Objects

Parent-child, manager-subordinate, organizational, initiator-follower

DAG underlying tree Data: Nodes, links, labeled trees Jointly Learn the importance of

features and rules (challenge: joint learning)

Infer the tree structures of unlabeled data (challenge: model & feature design)

Develop a general model & summarize typical features w/ uncertain importance Local feature (singleton

potential) Dependency rule (pairwise

potential) Test on two tasks

Uncover family tree structure Uncover online discussion

structure

p1

p2

p3p4

Candidate DA G

v2

v1

v3v4

v1

v4 v2

v3

One possible result

v1

v4

v2

v3

A nother possible result

Inference performance in diff. measures Practical usefulness and generalityOur model > state-of-the-art text mining (2-3X) Does not require many labels for trainingJoint model > two-stage model (5% - 381%) Good adaptability for generalization

Examples of features and rules

(UIUC + CUNY) Chi Wang, Jiawei Han, Xiang Li, Qi Li, Wen-Pin Lin, Adam Lee, Hao Li, Heng Ji, "Uncovering Hierarchical Relationships among Linked Objects: A Probabilistic Modeling Approach", KDD'11 (sub)

Page 33: Joint Enhancement of Topic Modeling and Information Network Mining

Uncovering Hierarchical Relationships among Linked Objects

Using a novel discriminative model CRF-Hier– optimized for joint modeling of tree structure learning and reasoning– 10%-12% higher performance than state-of-the-art

Mohammed bin Awad

bin Laden

Salem

bin Laden

Bakr

bin Laden Abdullah Osama

bin Laden

Osama

bin LadenSaad

bin Laden

Omar Osama

bin Laden

(UIUC + CUNY) Chi Wang, Jiawei Han, Xiang Li, Qi Li, Wen-Pin Lin, Adam Lee, Hao Li, Heng Ji, "Uncovering Hierarchical Relationships among Linked Objects: A Probabilistic Modeling Approach", KDD'11 (sub)

Page 34: Joint Enhancement of Topic Modeling and Information Network Mining

Potential Transition Example: Terrorism Networks Search and Browsing Engine

• In many scenarios, a user may only know information about limited portions of objects or dimensions of links in information networks and thus have difficulty at creating informative queries

• For example, a military data analyst may have a list of famous terrorism organizations without knowing their detailed person member names, but still wish to track activities about these members

Page 35: Joint Enhancement of Topic Modeling and Information Network Mining

Multi-Facet Search in Self-Boosting Information Networks (Example: Terrorism Network Search and Browsing)

Demo Video: http://nlp.cs.qc.cuny.edu/terrorism.m4v

(CUNY + UIUC) Sam Anzaroot, Javier Artiles, Heng Ji, Hongbo Deng and Jiawei Han. 2011. Search and Browsing Self-Boosting Information Networks. SIGIR2011 [SUB]

• Facilitate a military analyst in expert finding and terrorist information search gathering, control and analysis for any given query

• Entity-topic analyzer for self-expansion and self-boosting: Terrorism organization members status of members (die, arrest,...) and information networks associated with each member