effective extraction of thematically grouped key terms from text

Effective Extraction of Thematically Grouped Key Terms From Text

Maria Grineva Ph.D., research scientist at

Institute for System Programming of RAS

Outline1. Key terms extraction: traditional approaches

and applications2. Using Wikipedia as a knowledge base for

Natural Language Processing3. Main techniques of our approach:

• Wikipedia-based semantic relatedness• Network analysis algorithm to detect

community structure in networks4. Our method5. Experimental evaluation

Key Terms Extraction• Basic step for various NLP tasks:

– document classification– document clustering– text summarization– inferring a more general topic of a text

document• Core task of Internet content-based

advertising systems, such as Google AdSense and Yahoo! Contextual Match.

Approaches to Key Terms Extraction

• Based on statistical learning:– use for example: frequency criterion (TFxIDF model),

keyphrase-frequency, distance between terms normalized by the number of words in the document (KEA)

– compute statistical features over Wikipedia corpus (Wikify! )

– require training set

• Based on analyzing syntactic or semantic term relatedness within a document– compute semantic relatedness between terms (using, for

example, Wikipedia)– modeling document as a semantic graph of terms and

applying graph analysis techniques to it (TextRank)– no training set required

Using Wikipedia as a Knowledge Base for Natural Language Processing

• Wikipedia (www.wikipedia.org) – free open encyclopedia– Today Wikipedia is the biggest encyclopedia

(more than 2.7 million articles in English Wikipedia)

– It is always up-to-date thanks to millions of editors over the world

– Has huge network of cross-references between articles, large number of categories, redirect pages, disambiguation pages => rich resource for bootstrapping NLP and IR tasks

Basic Techniques of Our Method:Semantic Relatedness of Terms

• Semantic relatedness assigns a score for a pair of terms that represents the strength of relatedness between the terms

• Can be computed over dictionary or thesaurus. We use Wikipedia

• Wikipedia-based semantic relatedness for the two terms can be computed using:– the links found within their corresponding Wikipedia

articles – Wikipedia categories structure– the article’s textual content

• Using Dice-measure for Wikipedia-based semantic relatedness

Basic Techniques of Our Method:Semantic Relatedness of Terms

Denis Turdakov, Pavel Velikhov“Semantic Relatedness Metric for Wikipedia Concepts Based on Link Analysis and its Application to Word Sense Disambiguation”

SYRCoDIS, 2008

Basic Techniques of Our Method:Detecting Community Structure in Networks

• Community – densely interconnected group of nodes in a network

• Girvan-Newman algorithm for detection community structure in networks:• betweenness – how much is

edge “in between” different communities

• modularity - partition is a good one, if there are many edges within communities and only a few between them

Our Method

1. Candidate terms extraction2. Word sense disambiguation3. Building semantic graph4. Discovering community structure of the

semantic graph5. Selecting valuable communities

Our Method: Candidate Terms Extraction

• Goal: extract all terms from the document and for each term prepare a set of Wikipedia articles that can describe its meaning

• Parse the input document and extract all possible n-grams

• For each n-gram (+ its morphological variations) provide a set of Wikipedia article titles– “drinks”, “drinking”, “drink” => [Wikipedia:] Drink;

Drinking

Our Method: Word Sense Disambiguation

• Goal: choose the most appropriate Wikipedia article from the set of candidate articles for each ambiguous term extracted on the previous step

• Use of Wikipedia disambiguation and redirect pages to obtain candidate meanings of ambiguous terms

Denis Turdakov, Pavel Velikhov“Semantic Relatedness Metric for Wikipedia Concepts Based on Link Analysis and its Application to Word Sense Disambiguation”

SYRCoDIS, 2008

Our Method:Building Semantic Graph

• Goal: building document semantic graph using semantic relatedness between terms

Semantic graph built from a news article "Apple to Make ITunes More Accessible For the Blind"

Our Method:Detecting Community Structure of the

Semantic Graph• Dense communities represent main topics of the document• Disambiguation mistakes become isolated vertices• Modularity for semantic graphs: 0.3~0.5

Our Method: Selecting Valuable Communities

• Goal: rank term communities in a way that:– the highest ranked communities contain key terms– the lowest ranked communities contain not important

terms, and possible disambiguation mistakes

• Use:– density of community – sum of inner edges of

community divided by the number of vertices in this community

– informativeness – sum of keyphraseness measure (Wikipedia-based TFxIDF analogue) of community terms

• Community rank: density*informativeness• Take 2-3 communities with highest rank

Advantages of the Method

• No training. Instead of training the system with handcreated examples, we use semantic information derived from Wikipedia

• Thematically grouped key terms. Significantly improve further inferring of document topics using, for example, spreading activation over Wikipedia categories graph

• High accuracy. Evaluation using human judgments (futher in this presentation)

Experimental Evaluation:Creating Test Set

• 30 blog posts from the technical blogs– 5 persons took part in evaluation and was aksed

to:– identify from 5 to 10 key terms for each blog post– each key term must present in the blog post, and

must be identified using Wikipedia article names as the allowed vocabulary

– choose key terms should cover several main topics of the blog post

• Eventualy, key term was considered valid if at least two of the participants identified the same key term from the blog post

Experimental Evaluation:Precision and Recall

• 30 blog posts, 180 key terms extracted manually, 297 key terms were extracted by our method, 123 of manually extracted key terms were also extracted by our method

• Recall equals to 68%

• Precision equals to 41%

Experimental Evaluation:Revision of Precision and Recall

• Our method typically extracts more related terms in each thematic group than a human (possibly, our method produces better terms coverage for a specific topic than an average human) => revisit precision and recall

• Each participant reviewed key terms extracted automatically for every blog, and, if possible, extended his manually identified key terms with some from the automatically extracted set

• Recall after revision equals to 73% • Precision after revision equals to 52%

Your Questions

effective extraction of thematically grouped key terms from text

Technology

terms semantic graph

document semantic graph

wikipedia concepts

semantic graph goal

wikipedia modeling document

pair of terms

important terms

wikipedia corpus