2006.04.27 - slide 1is 240 – spring 2006 prof. ray larson university of california, berkeley...

78
2006.04.27 - SLIDE 1 IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday 10:30 am - 12:00 pm Spring 2006 http://www.sims.berkeley.edu/academics/courses/ is240/s06/ Principles of Information Retrieval Lecture 28: CLIR

Post on 21-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 2006.04.27 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.27 - SLIDE 1IS 240 – Spring 2006

Prof. Ray Larson

University of California, Berkeley

School of Information Management & Systems

Tuesday and Thursday 10:30 am - 12:00 pm

Spring 2006http://www.sims.berkeley.edu/academics/courses/is240/s06/

Principles of Information Retrieval

Lecture 28: CLIR

Page 2: 2006.04.27 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.27 - SLIDE 2IS 240 – Spring 2006

Mini-TREC

• Proposed Schedule– February 14-16 – Database and previous

Queries– March 2 – report on system acquisition and

setup– March 2, New Queries for testing…– April 20, Results due– April 25, Results and system rankings (sort of)– May 9, Group reports and discussion

Page 3: 2006.04.27 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.27 - SLIDE 3IS 240 – Spring 2006

Results (with bad runs)

Page 4: 2006.04.27 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.27 - SLIDE 4IS 240 – Spring 2006

With new runs…

Page 5: 2006.04.27 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.27 - SLIDE 5IS 240 – Spring 2006

Mean Average Precision

Submission File MAPGroup2/stemmedphrases-minitrec.results 0.3391Group2/stemmed-minitrec.results 0.3314Group4/G4BF.txt 0.3158Group2/tokenphrases-minitrec.results 0.3049Group4/G4P.txt 0.2902Group3/results20060420_ssr_minitrec.txt 0.2549Group3/results20060426_ssr_minitrec_rerun.txt 0.2548Group3/results20060426_ssr_minitrec_rf.txt 0.2508Group1/trec_top_file_1.txt 0.1873

Page 6: 2006.04.27 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.27 - SLIDE 6IS 240 – Spring 2006

Today

• Review– NLP for IR– Text Summarization

• Cross-Language Information Retrieval– Introduction– Cross-Language EVIs

Credit for some of the material in this lecture goes to Doug Oard (University of Maryland)and to Fredric Gey and Aitao Chen

Page 7: 2006.04.27 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.27 - SLIDE 7IS 240 – Spring 2006

Today

• Review– NLP for IR– Text Summarization

• Cross-Language Information Retrieval– Introduction– Cross-Language EVIs

Credit for some of the material in this lecture goes to Doug Oard (University of Maryland)and to Fredric Gey and Aitao Chen

Page 8: 2006.04.27 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.27 - SLIDE 8IS 240 – Spring 2006

Natural Language Processing and IR

• The main approach in applying NLP to IR has been to attempt to address

– Phrase usage vs individual terms

– Search expansion using related terms/concepts

– Attempts to automatically exploit or assign controlled vocabularies

Page 9: 2006.04.27 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.27 - SLIDE 9IS 240 – Spring 2006

NLP and IR

• Much early research showed that (at least in the restricted test databases tested)– Indexing documents by individual terms

corresponding to words and word stems produces retrieval results at least as good as when indexes use controlled vocabularies (whether applied manually or automatically)

– Constructing phrases or “pre-coordinated” terms provides only marginal and inconsistent improvements

Page 10: 2006.04.27 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.27 - SLIDE 10IS 240 – Spring 2006

NLP and IR

• Not clear why intuitively plausible improvements to document representation have had little effect on retrieval results when compared to statistical methods– E.g. Use of syntactic role relations between

terms has shown no improvement in performance over “bag of words” approaches

Page 11: 2006.04.27 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.27 - SLIDE 11IS 240 – Spring 2006

General Framework of NLP

Morphological andLexical Processing

Syntactic Analysis

Semantic Analysis

Context processingInterpretation

John runs.

John run+s. P-N V 3-pre N plu

S

NP

P-N

John

VP

V

runPred: RUN Agent:John

John is a student.He runs. Slide from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

Page 12: 2006.04.27 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.27 - SLIDE 12IS 240 – Spring 2006

Using NLP

• Strzalkowski (in Reader)

Text NLP represDbasesearch

TAGGERNLP: PARSER TERMS

Page 13: 2006.04.27 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.27 - SLIDE 13IS 240 – Spring 2006

Using NLP

INPUT SENTENCEThe former Soviet President has been a local hero ever sincea Russian tank invaded Wisconsin.

TAGGED SENTENCEThe/dt former/jj Soviet/jj President/nn has/vbz been/vbn a/dt local/jj hero/nn ever/rb since/in a/dt Russian/jj tank/nn invaded/vbd Wisconsin/np ./per

Page 14: 2006.04.27 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.27 - SLIDE 14IS 240 – Spring 2006

Using NLP

TAGGED & STEMMED SENTENCEthe/dt former/jj soviet/jj president/nn have/vbz be/vbn a/dt local/jj hero/nn ever/rb since/in a/dt russian/jj tank/nn invade/vbd wisconsin/np ./per

Page 15: 2006.04.27 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.27 - SLIDE 15IS 240 – Spring 2006

Using NLP

PARSED SENTENCE

[assert

[[perf [have]][[verb[BE]]

[subject [np[n PRESIDENT][t_pos THE]

[adj[FORMER]][adj[SOVIET]]]]

[adv EVER]

[sub_ord[SINCE [[verb[INVADE]]

[subject [np [n TANK][t_pos A]

[adj [RUSSIAN]]]]

[object [np [name [WISCONSIN]]]]]]]]]

Page 16: 2006.04.27 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.27 - SLIDE 16IS 240 – Spring 2006

Using NLP

EXTRACTED TERMS & WEIGHTS

President 2.623519 soviet 5.416102

President+soviet 11.556747 president+former 14.594883

Hero 7.896426 hero+local 14.314775

Invade 8.435012 tank 6.848128

Tank+invade 17.402237 tank+russian 16.030809

Russian 7.383342 wisconsin 7.785689

Page 17: 2006.04.27 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.27 - SLIDE 17IS 240 – Spring 2006

NLP & IR

• Indexing– Use of NLP methods to identify phrases

• Test weighting schemes for phrases

– Use of more sophisticated morphological analysis

• Searching– Use of two-stage retrieval

• Statistical retrieval• Followed by more sophisticated NLP filtering

Page 18: 2006.04.27 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.27 - SLIDE 18IS 240 – Spring 2006

NLP & IR

• New “Question Answering” track at TREC has been exploring these areas– Usually statistical methods are used to

retrieve candidate documents– NLP techniques are used to extract the likely

answers from the text of the documents

Page 19: 2006.04.27 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.27 - SLIDE 19IS 240 – Spring 2006

Today

• Review– NLP for IR– Text Summarization

• Cross-Language Information Retrieval– Introduction– Cross-Language EVIs

Credit for some of the material in this lecture goes to Doug Oard (University of Maryland)and to Fredric Gey and Aitao Chen

Page 20: 2006.04.27 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.27 - SLIDE 20IS 240 – Spring 2006

Introduction to CLIR

• Slides from Doug Oard…

Page 21: 2006.04.27 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.27 - SLIDE 21IS 240 – Spring 2006

Cross-Language IR

• Given a query expressed in one language

• Find info that may be expressed in another– Electronic texts– Document images– Recorded speech [101]– Sign language

Retrieval SystemEnglish Query French Documents

Page 22: 2006.04.27 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.27 - SLIDE 22IS 240 – Spring 2006

Why Do Cross-Language IR?

• When users can read several languages– Eliminates multiple queries– Query in most fluent language

• Monolingual users can also benefit– If translations can be provided– If it suffices to know that a document exists– If text captions are used to search for images

Page 23: 2006.04.27 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.27 - SLIDE 23IS 240 – Spring 2006

What We Know

• Dictionaries are very useful– Easily get to 50% of monolingual IR

effectiveness– We can get to about 75% using:

• Part-of-speech tags• Pseudo-relevance feedback• Phrase indexing

• Multilingual training corpora are also useful– When the corpus is from the right domain

Page 24: 2006.04.27 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.27 - SLIDE 24IS 240 – Spring 2006

Related Issues

• Multiscript text processing [12]– Character sets, writing system, direction, ...

• Language identification [109]– Markup, detection

• Language-specific processing [103]– Stemming, morphological roots, compounds,

• Document translation [51]

Page 25: 2006.04.27 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.27 - SLIDE 25IS 240 – Spring 2006

Term-aligned Sentence-aligned Document-aligned Unaligned

Parallel Comparable

Knowledge-based Corpus-based

Controlled Vocabulary Free Text

Cross-Language Text Retrieval

Query Translation Document Translation

Text Translation Vector Translation

Ontology-based Dictionary-based

Thesaurus-based

Page 26: 2006.04.27 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.27 - SLIDE 26IS 240 – Spring 2006

Free Text Developments

• 1970, 1973 Salton– Hand coded bilingual dictionaries

• 1990 Latent Semantic Indexing [53]– French/English using Hansard training corpus

• 1994 European multilingual IR project [84]– Medium-scale recall/precision evaluation

• 1996 SIGIR Cross-lingual IR workshop– And over 10 conferences and workshops

since!

Page 27: 2006.04.27 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.27 - SLIDE 27IS 240 – Spring 2006

How Controlled Vocabulary Works

• Thesaurus design [102]– Design a knowledge structure for domain– Assign a unique “descriptor” to each concept

• Include “scope notes” and “lead-in vocabulary”

• Document indexing– Read the document, assign appropriate

descriptors

• Retrieval– Select desired descriptors, use exact match

retrieval

Page 28: 2006.04.27 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.27 - SLIDE 28IS 240 – Spring 2006

Multilingual Thesauri

• Adapt the knowledge structure– Cultural differences influence indexing

choices

• Use language-independent descriptors– Matched to a unique term in each language

• Three construction techniques [46]– Build it from scratch– Translate an existing thesaurus– Merge monolingual thesauri

Page 29: 2006.04.27 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.27 - SLIDE 29IS 240 – Spring 2006

Advantages over Free Text

• High-quality concept-based indexing– Descriptors need not appear in the document

• Knowledge-guided searching– Good thesauri capture expert domain

knowledge

• Excellent cross-language effectiveness– Up to 100% of monolingual effectiveness

• Understandable retrieval results

• Efficient implementation

Page 30: 2006.04.27 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.27 - SLIDE 30IS 240 – Spring 2006

Limitations

• Costly to create– Design knowledge structure, index each

document

• Costly to maintain– Document indexing, vocabulary and concept

change

• Hard to use– Vocabulary choice, knowledge structure

navigation

• Limited scope– Domain must be chosen at design time

Page 31: 2006.04.27 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.27 - SLIDE 31IS 240 – Spring 2006

Query vs. Document Translation

• Query translation– Very efficient for short queries

• Not as big an advantage for relevance feedback

– Hard to resolve ambiguous query terms

• Document translation– May be needed by the selection interface

• And supports adaptive filtering well

– Slow, but only need to do it once per document

• Poor scale-up to large numbers of languages

Page 32: 2006.04.27 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.27 - SLIDE 32IS 240 – Spring 2006

Document Translation Example

• Approach– Select a single query language– Translate every document into that language– Perform monolingual retrieval

• Long documents provide enough context– And many translation errors do not hurt

retrieval

• Much of the generation effort is wasted– And choosing a single translation can hurt

Page 33: 2006.04.27 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.27 - SLIDE 33IS 240 – Spring 2006

Query Translation Example

• Select controlled vocabulary search terms

• Retrieve documents in desired language

• Form monolingual query from the documents

• Perform a monolingual free text search

Information Need

Thesaurus

ControlledVocabulary MultilingualText RetrievalSystem

Alta Vista

FrenchQueryTerms

EnglishAbstracts

English Web Pages

Page 34: 2006.04.27 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.27 - SLIDE 34IS 240 – Spring 2006

Machine Readable Dictionaries

• Based on printed bilingual dictionaries– Becoming widely available

• Used to produce bilingual term lists– Cross-language term mappings are

accessible• Sometimes listed in order of most common usage

– Some knowledge structure is also present• Hard to extract and represent automatically

• The challenge is to pick the right translation

Page 35: 2006.04.27 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.27 - SLIDE 35IS 240 – Spring 2006

Unconstrained Query Translation

• Replace each word with every translation– Typically 5-10 translations per word

• About 50% of monolingual effectiveness– Main problem is ambiguity– Example: Fly (English)

• 8 word senses (e.g., to fly a flag)• 13 Spanish translations (enarbolar, ondear, …)• 38 English retranslations (hoist, brandish, lift…)

Page 36: 2006.04.27 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.27 - SLIDE 36IS 240 – Spring 2006

Phrase Indexing

• Improves retrieval effectiveness two ways– Phrases are less ambiguous than single

words– Idiomatic phrases translate as a single

concept

• Three ways to identify phrases– Semantic (e.g., appears in a dictionary)– Syntactic (e.g., parse as a noun phrase)– Cooccurrence (words found together often)

• Semantic phrase results are impressive

Page 37: 2006.04.27 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.27 - SLIDE 37IS 240 – Spring 2006

Types of Bilingual Corpora

• Parallel corpora: translation-equivalent pairs– Document pairs– Sentence pairs – Term pairs

• Comparable corpora– Content-equivalent document pairs

• Unaligned corpora – Content from the same domain

Page 38: 2006.04.27 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.27 - SLIDE 38IS 240 – Spring 2006

Generating Parallel Corpora

• Parallel corpora are naturally domain-tuned– Finding one for the right domain may be hard

• Alternative is to build one– Start with a monolingual corpus– Automatic machine translation for second language

• Worthwhile when IR technique is faster than MT– If translation errors don’t hurt the IR technique

• Good results with Latent Semantic Indexing

Page 39: 2006.04.27 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.27 - SLIDE 39IS 240 – Spring 2006

Top ranked FrenchDocuments French

Text Retrieval System

Alta Vista

FrenchQueryTerms

EnglishTranslations

English Web Pages

ParallelCorpus

Pseudo-Relevance Feedback

• Enter query terms in French

• Find top French documents in parallel corpus

• Construct a query from English translations

• Perform a monolingual free text search

Page 40: 2006.04.27 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.27 - SLIDE 40IS 240 – Spring 2006

Similarity-Based Dictionaries

• Automatically developed from aligned documents– Reflects language use in a specific domain

• For each term, find most similar in other language– Retain only the top few (5 or so)

• Performs as well as dictionary-based techniques– Evaluated on a comparable corpus of news

stories [98]• Stories were automatically linked based on date and

subject

Page 41: 2006.04.27 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.27 - SLIDE 41IS 240 – Spring 2006

Latent Semantic Indexing

• Designed for better monolingual effectiveness– Works well across languages too [27]

• Cross-language is just a type of term choice variation

• Produces short dense document vectors– Better than long sparse ones for adaptive

filtering• Training data needs grow with dimensionality

– Not as good for retrieval efficiency• Always 300 multiplications, even for short queries

Page 42: 2006.04.27 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.27 - SLIDE 42IS 240 – Spring 2006

Cooccurrence-Based Dictionaries

• Align terms using cooccurrence statistics– How often do a term pair occur in sentence

pairs?• Weighted by relative position in the sentences

– Retain term pairs that occur unusually often

• Useful for query translation– Excellent results when the domain is the

same

• Also practical for document translation– Term use variations to reinforce good

translations

Page 43: 2006.04.27 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.27 - SLIDE 43IS 240 – Spring 2006

Language Identification

• Can be specified using metadata– Included in HTTP and HTML

• Determined using word-scale features– Which dictionary gets the most hits?

• Determined using subword features– Letter n-grams in electronic and printed text– Phoneme n-grams in speech

Page 44: 2006.04.27 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.27 - SLIDE 44IS 240 – Spring 2006

Research Directions

• User needs assessment

• Evaluation

• Corpus construction

• Word sense disambiguation

• System integration

• Probabilistic models

• Adaptive filtering

Page 45: 2006.04.27 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.27 - SLIDE 45IS 240 – Spring 2006

Evaluation

• Most critical need is for side by side tests– TREC-did this for French/German/Italian

• Domain shift metric– Domain shift hurts corpus-based techniques– Need a way to measure severity of the shift

• Test collections for adaptive filtering– From cross-language recall/precision

evaluation

Page 46: 2006.04.27 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.27 - SLIDE 46IS 240 – Spring 2006

Corpus Construction

• Corpus-based techniques have great potential

• Parallel corpora are rare and expensive– Find it, reverse engineer the links, clean it up

• Unlinked corpora are of limited value– Context linking research could change that [77]

• Comparable corpora offer middle ground– Need to develop automatic linking techniques– Also need a metric for degree of comparability

Page 47: 2006.04.27 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.27 - SLIDE 47IS 240 – Spring 2006

1

Find and Interpret Information

Vital to National Security

The Tamil National leader, Mr. V.Pirapaharan delivered a speech on 13 May 1998, theanniversary of the launch of Sri Lanka's biggest and longest assault on the Tamil homelands,describing how the LTTE defended against Sri Lanka's latest military ambitions. Here’s whathe said:

62Million people in South India and Sri Lanka can read this

•Find and retrieve informationin unfamiliar languages

•Translate it into English

•Extract and correlate its contentagainst other materials

TIDES

Page 48: 2006.04.27 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.27 - SLIDE 48IS 240 – Spring 2006

3

The Challenges

Today is a significant day in the history of ournational liberation struggle, it marks the end of a yearduring which we have resisted and fought against thebiggest ever offensive operation launched by the SriLankan armed forces code named "Jayasikuru ”...

Translation

Topic Detection

Summarization

Extraction

The objective of the Sinhala chauvinists was to utilizemaximum man power and fire power to destroy themilitary capability of the LTTE and to bring an end tothe Tamil freedom movement. Before the launching ofthe operation "Jayasikuru " the Sri Lankan political andmilitary high command miscalculated the militarystrength and determination of the LTTE .

•Liberation Tigers of Tamil Eelam (LTTE)

•Sri Lanka

•Velupillai Pirapaharan

•Rebellion

Org Leader HQ LossesSinhala Kumaratunga 3000LTTE Pirapaharan Wanni 1300

(manual)

(experimental)

(special-purpose)

(key sentences)

Tamil document

Tamil document analysis

Page 49: 2006.04.27 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.27 - SLIDE 49IS 240 – Spring 2006

Cross-Language IR on the Web

• http://www.clis.umd.edu/dlrg/clir/– Most workshop proceedings– Lots of papers and project descriptions– Links to working systems

• Including 2 web search engines

– Useful linguistic resources– BibTeX for the attached bibliography

Page 50: 2006.04.27 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.27 - SLIDE 50IS 240 – Spring 2006

Today

• Review– NLP for IR– Text Summarization

• Cross-Language Information Retrieval– Introduction– Cross-Language EVIs

Credit for some of the material in this lecture goes to Doug Oard (University of Maryland)and to Fredric Gey and Aitao Chen

Page 51: 2006.04.27 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.27 - SLIDE 51IS 240 – Spring 2006

The “Entry Vocabulary Index” Approach to Multilingual Search

Ray R. Larson, Fredric Gey, Aitao Chen, Michael BucklandUniversity of California, Berkeley

School of Information Management and Systemsand UC Data

Harvesting Translingual Vocabulary Mappings for

Multilingual Digital Libraries

Note: This talk was presented at the 2002 JCDL (Joint Conference on Digital Libraries)

Page 52: 2006.04.27 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.27 - SLIDE 52IS 240 – Spring 2006

Overview

• What are Entry Vocabulary Indexes?– EVI Research at Berkeley– Notion of an EVI– How are EVIs Built

• Berkeley Multilingual EVI– Technology components– Database– Examples of operation

• Ongoing research

Page 53: 2006.04.27 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.27 - SLIDE 53IS 240 – Spring 2006

Entry Vocabulary Index Research Projects at Berkeley

• DARPA Information Management Program– “Search Support for Unfamiliar Metadata Vocabularies”

• Institute for Museum and Library Services– “Seamless Searching of Numeric and Textual

Resources”

• DARPA TIDES program– “Translingual Information Management Using Domain

Ontologies”

• NSF/NASA/DARPA: DLI-2 (IDL) – “ Discovery and Use of Textual, Numeric and Spatial

Data”

Page 54: 2006.04.27 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.27 - SLIDE 54IS 240 – Spring 2006

The IMLS project:To demonstrate improved access to written material and numerical data on the same topic when searching two very different databases:

--- books, articles, and their bibliographic records;

--- numerical data in socio-economic databases.

PHASE I: A library gateway providing search support for searching both text and socio-economic numeric databases. The gateway would accept a query in the library users’ own terms and would suggest what terms in the specialized categorization used in the resource to be searched.

PHASE II: Demonstration of a library gateway supporting searches between text and numeric databases. If you found some thing interesting in a socio-economic database, the gateway would help you to find documents on the same topic in a text database – and vice versa.

Page 55: 2006.04.27 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.27 - SLIDE 55IS 240 – Spring 2006

TIDES Project

• Translingual Information Detection, Extraction and Summarization– Building EVIs to map across languages

• Using same notion with training data in different languages

• Using Library of Congress Subject Headings from the CDL MELVYL database

Page 56: 2006.04.27 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.27 - SLIDE 56IS 240 – Spring 2006

What is an Entry Vocabulary Index?

• EVIs are a means of mapping from user’s vocabulary to the controlled vocabulary of a collection of documents…

Page 57: 2006.04.27 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.27 - SLIDE 57IS 240 – Spring 2006

Start with a collection of documents.

Page 58: 2006.04.27 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.27 - SLIDE 58IS 240 – Spring 2006

Classify and index with controlled

vocabulary.Index

Ideally, use a database

already indexed

Page 59: 2006.04.27 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.27 - SLIDE 59IS 240 – Spring 2006

Problem:Controlled

Vocabularies can be

difficult for people to

use.“pass mtr veh spark ign eng”

Index

Page 60: 2006.04.27 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.27 - SLIDE 60IS 240 – Spring 2006

Solution:Entry Level Vocabulary

Indexes.Index

EVIpass mtr veh

spark ign eng”

= “Automobile”

Page 61: 2006.04.27 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.27 - SLIDE 61IS 240 – Spring 2006

EVI exampleEVI 1

Index term:“pass mtr veh spark ign eng”User

Query “Automobile

” EVI 2Index term:“automobiles”OR

“internal combustible engines”

Page 62: 2006.04.27 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.27 - SLIDE 62IS 240 – Spring 2006

But why stop there?

Index

EVI

Page 63: 2006.04.27 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.27 - SLIDE 63IS 240 – Spring 2006

“Which EVI do I use?”

Index

EVI

Index

Index EVI

IndexEVI

Page 64: 2006.04.27 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.27 - SLIDE 64IS 240 – Spring 2006

EVI to EVIs

Index

EVI

Index

Index EVI

IndexEVI

EVI2

Page 65: 2006.04.27 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.27 - SLIDE 65IS 240 – Spring 2006

FindPlutonium

In Arabic Chinese Greek Japanese Korean Russian Tamil

Why not treat language the same way?

Page 66: 2006.04.27 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.27 - SLIDE 66IS 240 – Spring 2006

FindPlutonium

In Arabic Chinese Greek Japanese Korean Russian Tamil

...),,2[logL(p t)W(c, 1 baaStatistical association

Digital library resources

Page 67: 2006.04.27 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.27 - SLIDE 67IS 240 – Spring 2006

Background on Online Library Catalogs

• Library catalogs have been automated at a furious pace worldwide since the late ’70s

• Library objects (books, maps, pictures) in 400+ languages

• Bibliographic descriptions contain one or more sentences from a particular language (transliterated)

• Objects have been classified by subject by librarians– Library of Congress Subject Heading (Islamic Fundamentalism)– Library of Congress Classification (BP60, BP63, KF27)– Dewey Decimal Classification (297.2, 306.6, 320.5)

• International standard (MARC) for catalog metadata• Huge number of remotely searchable catalogs worldwide

accessible using the international search/retrieve protocol Z39.50

Page 68: 2006.04.27 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.27 - SLIDE 68IS 240 – Spring 2006

What can libraries and their catalogs provide?

• Millions of sentences in multiple languages• Sentences with topical content identified from

150,000 Library of Congress Subject Headings• Transfer point (interlingua) between English

topics and words in other languages• Can be used to create:

– Bilingual dictionaries– Query expansion in cross-language information

retrieval

Page 69: 2006.04.27 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.27 - SLIDE 69IS 240 – Spring 2006

Search: SUBJECT “Islamic Fundamentalism” and LANGUAGE “Arabic”

Yield: 119 Arabic language samples on topic “Islamic Fundamentalism”

Page 70: 2006.04.27 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.27 - SLIDE 70IS 240 – Spring 2006

Our Training Set and Prototype

• University of California/CDL MELVYL catalog• Private copy, 10 million+ records (5 million non-

English)• Records in over 100 languages• Obtained in MARC database standard format• Foreign language titles use Library of Congress

transliteration (Romanization) standard• Prototype search software maps from/to English

and– Arabic, Chinese, French, German– Italian, Japanese, Russian, Spanish

Page 71: 2006.04.27 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.27 - SLIDE 71IS 240 – Spring 2006

Technical Details

Download a set of

training data.

Build associations between extracted terms & controlled

vocabularies.

Part of speech tagging

Extract terms (words and noun

phrases) from titles and abstracts.

Building an Entry Vocabulary Module (EVI)

For noun phrases

Internet DB indexed with a

controlled vocabulary.

Page 72: 2006.04.27 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.27 - SLIDE 72IS 240 – Spring 2006

Association Measure

C ¬Ct a b¬t c d

Where t is the occurrence of a term and C is the occurrence of a classification in the training set

Page 73: 2006.04.27 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.27 - SLIDE 73IS 240 – Spring 2006

Association Measure

• Maximum Likelihood ratio

W(C,t) = 2[logL(p1,a,a+b) + logL(p2,c,c+d) - logL(p,a,a+b) – logL(p,c,c+d)] where logL(p,n,k) = klog(p) + (n – k)log(1- p)

and p1= p2= p=

a a+b

c c+d

a+c a+b+c+d

Vis. Dunning

Page 74: 2006.04.27 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.27 - SLIDE 74IS 240 – Spring 2006

Example: Library of Congress Subject Heading “Islamic Fundamentalism” yields most closely associated words in multiple languages

Page 75: 2006.04.27 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.27 - SLIDE 75IS 240 – Spring 2006

Non-English words can be mapped to English subject headings

Page 76: 2006.04.27 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.27 - SLIDE 76IS 240 – Spring 2006

Examples

Page 77: 2006.04.27 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.27 - SLIDE 77IS 240 – Spring 2006

Catalog Languages vs. FBIS Languages  (University of California online catalog. 10 million records)   Approx. language distribution (Berkeley # sentences, FBIS est. # lines source)

  Berkeley FBIS     Berkeley FBIS             

German 840,032 49,872   Danish 41,517 18,688Spanish 614,025 388,772   Hebrew 41,468 3,500French 609,089 2,871   Czech 35,432 3,647Russian 341,050 15,415   Urdu 30,206  

Italian 266,424 254   Turkish 30,015  

Portuguese 149,389 24,930   Bulgarian 27,850  

Chinese 127,636 246,549   Norwegian 26,478 13,596Japanese 110,956     Korean 25,979 68,607Arabic 96,124 (8263)*   Rumanian 25,874  

Dutch 90,170     Finnish 25,027 8,187Latin 88,818     Thai 24,693  

Polish 81,698     Serbo-Croatian 24,601 36,139Indonesian 59,445     Greek 23,926  

Swedish 53,854 16,652   Bengali 23,430  

Hungarian 46,330 6,631   Catalan 20,392  

Hindi 42,886     Tamil 20,232  

*English only, no source text

106 languages with > 500 records

Page 78: 2006.04.27 - SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday

2006.04.27 - SLIDE 78IS 240 – Spring 2006

Future Research

• Add content from other online library catalogs– RLIN (>30M records, >900K Chinese, >250K Arabic) – COPAC [UK] (9M records, 40k Arabic)

• Transliteration and back-transliteration for scripted languages

• Phrase mapping (POS tagging for English, bigram-trigram identification for target languages using mutual information)

• Further evaluation (TREC, CLEF, NCTIR and local analysis)