nlp and ir: coming closer or moving apart pushpak bhattacharyya computer science and engineering...
TRANSCRIPT
NLP and IR: Coming Closer or Moving Apart
Pushpak Bhattacharyya
Computer Science and Engineering Department
IIT Bombay
www.cse.iitb.ac.in/~pb
Acknowledgement: Manoj Chinnakotla, Arjun Atreya, Mitesh Khapra and many others
Classical Information Retrieval (Simplified)
Retrieval Model a.k.a
Ranking algorithm
query
relevant documents
40+ years of work in designing better models• Vector space models• Binary independence models• Network models• Logistic regression models • Bayesian inference models• Hyperlink retrieval models
late 1960’s
2010
Simple analysis (parsing, tokenization, apply synonyms,….) on queries and documents.Complex and intricate ranking algorithms to do all the heavy lifting.
documentrepresentation
(Courtesy: Dr. Sriram Raghvan, IBM India Research Lab)
The elusive user satisfaction
Ranking
Correctnessof
Query ProcessingCoverage
NER
StemmingMWE
Crawling Indexing
How NLP has used IR
• Web looked upon as a huge repository of evidence for language phenomena.
• PMI measure used for Co occurence, Collocation leading to interesting NLP problems like Malapropism detection (Bolshakov and Gelbukh, 2003), evaluation of synsets quality (outlier)
• Page rank used for WSD• Recent important problem: getting dictionary from
comparable corpora which abound in the web – (Bo Li & Eric Gaussier: Improving corpus comparability for
bilingual lexicon extraction from comparable corpora. COLING 2010)
How IR has used NLP• Query disambiguation
– WSD: very necessary, but does it make a difference– Different technique than in NLP, very small context
• Morphology: better than statistical stemmer? (McNamee et al SIGIR 2009)
• Capture Term Relationships – Random walk (Kevyn Collins-Thompson and Jamie
Callan, CIKM 2005)– Lexical Semantic Association/Indexing (Deerwester et
al, 1990)• Search Quality Management: Understanding Queries
(Fagin et al, PODS 2010)
Road map
1. A perspective on the relationship between NLP and IR
a) How NLP has used IR
b) How IR has used NLP
2. NLP using IR
Into the Heartland of NLP: Malapropism
3. IR using NLP
MultiPRF: a way of disambiguation in IR leveraging multilinguality
4. Conclusions and future directions
Into the heartland of NLP: Malapropism detection (Gelbuk et. Al,
2003)• Unintended replacement of one content word by another
existing content word similar in sound but semantically incompatible
• Immortalized by Mrs. Malaprop in Sheridan’s The Rival– “Why, murder's the matter! slaughter's the matter!
killing's the matter! But he can tell you the perpendiculars.“
– "He is the very pineapple of politeness.“
Different from… • Spelling error
– They travel around the workd (l and k are adjacent on the KBD)
Needs detection and correction: solved problem
Different from… • Eggcorn (idiosyncratic substitution, but plausible)
– ex-patriot instead of expatriate– on the spurt of the moment instead of on the spur of the moment
Needs detection and correction, but not critical: needs to be solved
Different from… • Spoonerism (error in speech or deliberate play on words in which
corresponding consonants, vowels, or morphemes are switched)– "The Lord is a shoving leopard." (a loving shepherd) – "A blushing crow." (crushing blow)– "You have hissed all my mystery lectures.
Needs detection and correction: needs to be solved
Different from… • Pun
– Is life worth living? That depends a lot on the liver (both meanings plausible)
– We're not getting anywhere in geometry class. It feels like we're going in circles.
Should NOT be corrected
Motivation for Malapropism Detection
• Interactive (manual) editing• Detect and correct errors• Spell checking and correction is practically a solved
problem• Grammar checking and correction still needs vast
improvement• Semantic incoherence, very difficult to detect and correct
Solution proposed using Google Search and Mutual Information
A pair (V,W) is collocation if it satisfies the above formula N(V,W) is the number of web-pages where V and W co-occur N(V) and N(W) are numbers of the web-pages evaluated
separately Nmax is the total web-page number managed by Google for the
given language.
As a bottom approximation to Nmax, the number N(MFW) of the pages containing the most frequent word MFW can be taken
For English, MFW is ‘the,’ and N(‘the’) is evaluated to 2.5 billions of pages
This inequality rejects a potential collocation, if its components co-occur in a statistically insignificant quantity.
Experimental ResultsPossible collocation Correct version Malapropos version1 travel around the word 55400 202 swim to spoon 23 03 take for granite 340000 154 bowels are pronounced 767 05 loose vowels 2750 13206 (a) wear turbines 3640 306 (b) turbines on the heads 25 07 ingenuous machine 805 68 affordable germs 1840 99 dielectric materialism 1080 410 (a) equal excess 457000 99010 (b) excess to school 19100 411 Ironic columns 5560 2812 activated by irritation 22 1013 histerical center 90000 714 scientific hypotenuse 7050 0
Cross Lingual Search: Motivation• English still the most dominant
language on the web Contributes 72% of the
content• Number of non-English users
steadily rising all over the world
• English penetration in India Estimated to be around 3-4% Mostly the urban educated
class• Need to enable access to
above information through local languages
Crawled and Indexed
Web Pages
Target Informationin English
ति�रूपति� या�त्रा�
Hindi Query
CLIR Engine
Target Language Indexin English
Ranked List of Results
Language Resources
ति�रूपति� आने के लि ए रे सा�धने
ति�रूपति� प�ण्य नगर पहुँ�चन� के� लि�ए बहुँ� र�� उप�ब्ध हैं� | अगर मुं��बई से� य�त्रा� केर रहैं� हैं� � मुं��बई- च�न्नई एक्सेप्रे�से ग�ड़ी% से� प्रेवा�से केर सेके�� हैं� |
ति�रूपति� या�त्रा�
Result Snippetsin Hindi
v vTranslation
DisambiguationMachine
Transliteration
QueryTranslation
NER Identification
Language Analysis (Stemming &
Stopword)
FocusedCrawler
FontTranscoder CMLifier
LanguageIdentifier
NE and MWEIdentification
MWEIdentify
The Web (WWW)
CLIA GUI
SnippetGeneration and
Translation
Template BasedInformation Extraction
MLIRIndex
Search and Ranking
MultilingualDictionary
Results with Summary
and Snippet
OutputPresentation
Crawling and Indexing
Input Query Processing
DocumentSummarization
Translatedand
DisambiguatedQuery
IR using Language Typology and NLP:
Multilingual Pseudo Relevance Feedback
(Manoj Chinnakotla, Karthik Raman and Pushpak Bhattacharyya, Multilingual Relevance Feedback: One Language Can Help Another, Conference of Association of Computational Linguistics (ACL 2010), Uppsala, Sweden, July 2010.)
User Information Need
• Expressed as short query (average length 2.5 words)• Need query expansion• Lexical resources based expansion did not deliver
(Voorhees 1994) – Paradigmatic association (synonyms, antonyms, hypo
and hypernyms)– Introduces severe topic drift through unrelated senses
of expansion terms– Also through irrelevant senses of query terms
Illustration
Query word: “Madrid bomb blast case”
{case, container}
{case, suit, lawsuit}
{suit, apparel}
Drifted topic due to expanded term!!!
Drifted topic due to inapplicable sense!!!
Query Expansion: Current Dominant Practice
• Syntagmatic Expansion• Through Pseudo Relevance Feedback• We show
– Mutlilingual PRF helps– Familially related language helps still more– Result of insight from linguistics and NLP– Disambiguation by leveraging multilinguality
Language Modeling Approach to IR
• Offers a principled approach to IR• Each document modeled as a probability distribution –
Document Language Model• User information need is modeled as a probability
distribution – Query Model
Importance of term in Query
Importance of term in Document
Score(D) ( , )
( | ) log ( | )
R
Rw
KL D
P w P w D
Ranking Function – KL Divergence
Problem of Retrieval ↔ Problem of Estimating P(w|ΘR) and P(w|D)
The Challenge - Estimating Query Model ΘR
• Average length of query: 2.5 words
• Relevance Feedback to the rescue–User marks some documents from initial
ranked list as “relevant”–Usually difficult to obtain
Pseudo-Relevance Feedback (PRF)
DocumentCollection
IR Engine
Doc. Scored1 2.4d2 2.1
d3 1.8d4 0.7.dm 0.01
Initial Results
Query Q
Rerank Corpuswith Updated
Query Relevance
Model
UpdatedQuery Relevance
Model
Pseudo-RelevanceFeedback
(PRF)Learn Feedback Model from Documents
d1 √d2 √
d3 √d4 √dk √
Assumetop ‘k’ asRelevant
Doc. Scored2 2.3d1 2.2
d3 1.8d5 0.6.dm 0.01
Final Results
Limitations of PRF: Lexical and Semantic Non-Exclusion
Accession to European Union
Initial Retrieval Documents
europeunionaccessnationrussiapresidgettiyearstate
Relevant documents with terms like “Membership”, “Member”, “Country” not
ranked high enough
Final ExpandedQuery
Stemmed Query“access europe union”
Previous Attempts • Voorhees et al. used Wordnet –
Negative results
• Random walk models on Wordnet, Morphological variants, co-occurrence terms (Zhai et al., Collins-Thompson and Callan)
Limitations of PRF: Lack of Robustness
Olive Oil Production in
Mediterranean
Initial Retrieved Documents
OilOlivMediterraneanProducCookSaltPepperServCup
Causes Query Drift
Final ExpandedQuery
Stemmed Query“oliv oil mediterranean”
Documents about
Cooking
Previous Attempts• Refining top document set• Refining initial terms obtained
through PRF• Selective query expansion• TREC Robustness Track –
improving robustness
Can both Semantic Non-inclusion and Lack of Robustness be solved?
• Harnesses “Multilinguality”:– Take help of a collection in a different language called
“assisting language”– Expectation of increased robustness, since searching
in two collections
• An attractive proposition for languages that have poor monolingual performance due to– Resource constraints like inadequate coverage– Morphological complexity
Related Work
• Gao et al. (2009) use English to improve Chinese language ranking– Demonstrate only on a subset of queries– Experimentation on a small dataset– Uses cross-document similarity
Multilingual PRF: System Flow
Query in L1
Initial Retrieval
Translate Query into L2
Initial Retrieval
L1 Index
L2
Index
Top ‘k’ Results
Top ‘k’ Results
Get Feedback Model in
L1
Get Feedback Model in
L2
θL1
θL2
Translate Feedback
Model into L1
θL1Trans
InterpolateModels
Ranking using Final
Model
Feedback Model Translation
• Feedback model estimated in L2 (ΘFL2) translated back
into L1 (ΘTransL1)
– Using probabilistic bi-lingual dictionary from L2 to L1
– Learnt from parallel sentence-aligned corpora
• Back Translation Step
Semantically Related Terms through Feedback Model Translation
Feedback Model
Translation Step
Nation
Nation
Country
State
UN
English-French Word Alignments
Nation, CountryState, UN, United
United
Flugzeug
Aircraft
Plane
Aeroplane
Air
German-English Word Alignments
Aircraft, PlaneAeroplane, Air, Flight
Flight
SIGIR Findings: English Lends a Helping Hand!
• English used as assisting language– Good monolingual performance– Ease of processing
• MultiPRF consistently and significantly outperforms monolingual PRF baseline
(Manoj Chinnakotla, Karthik Raman and Pushpak Bhattacharyya, Multilingual PRF: English Lends a Helping Hand, SIGIR 2010, Geneva, Switzerland, July, 2010.)
Performance Study of Assisting Languages
• Do the results hold for languages other than English?• What are the characteristics of a good assisting
language?• Can any language be used to improve the PRF
performance of another language?• Can this be extended to multiple assisting languages?
Experimental Setup
• European languages chosen• Europarl corpora• CLEF dataset
– Six languages from different language families– French, Spanish (Romance), – German, English, Dutch (West Germanic), – Finnish (Baltic-Finnic)– On more than 600 topics
• Use Google Translate for Query Translation
Query in German Initial
Retrieval
Translate Query into
Spanish
Initial Retrieval
L1 Index
L2
Index
Top ‘k’ Results
Top ‘k’ Results
Get Feedback Model in
L1
Get Feedback Model in
L2
θL1
θL2
θL1Multi
Bronchial asthma
El asma bronquial
chronisch (chronic), pet, athlet(athlete), ekrank (ill), gesund(healthy), tuberkulos(tuberculosis), patient, reis (rice),person
asthma, allergi, krankheit (disease), allerg (allergenic), chronisch, hauterkrank (illness of skin), arzt (doctor), erkrank (ill)
Asma, bronquial, contamin, ozon, cient, enfermed, alerg, alergi, air
Translate &
Interpolate
MAP improves from0.062 to 0.636!
Query in French Initial
Retrieval
Translate Query into
Dutch
Initial Retrieval
L1 Index
L2
Index
Top ‘k’ Results
Top ‘k’ Results
Get Feedback Model in
L1
Get Feedback Model in
L2
θL1
θL2
θL1Multi
Ingénierie Génétique
GenetischeManipulatie
développ (developed), évolu(evolved), product, produit(product), moléculair (molecular)
génet, ingénier, manipul, animal, pêcheur (fisherman), développ (developed), gen
genetisch, manipulatie, exxon, dier (animal), visser (fisherman), gen
Translate &
Interpolate
MAP improves from0.145 to 0.357!
More than one assisting language
• Tried parallel composition for two assisting languages
• Uniform interpolation weights used
• Exhaustively tried all 60 combinations
• Improvements reported over best performing PRF of L1 or L2
Conclusions• NLP and IR: seems to be needing each other
– NLP needs IR: for large scale evidence for language phenomena
– IR needs NLP: for high quality sophisticated search• Very likely useful in extended-text query situation: Question
Answering, Topic-Description-Narration• MultiPRF uses another language to improve robustness and
performance• Can be looked upon as a way of disambiguation• Robust, Large Scale, Language Independent WSD methods needed
in the face of resource constrained situation (not-so-large amount of sense marked corpora)
URLs
• For resources
www.cfilt.iitb.ac.in• For publications
www.cse.iitb.ac.in/~pb
Tracing the Development of IR Models from
Term Presence-Absence to Query Meaning
Ranking: can NLP help IR
Description AND, OR and NOT operators
Features Based on term presence/absence. 1/0 decision
Query Collection of Logical Expressions
Relevance True or False
Term weighting NA
Benefits Easy to implement, high efficiency and clarity
Limitations Unnatural query formation. Too severe (All or nothing).
Standard Boolean Model
Description Query is in natural language. Converted to boolean query automatically.
Features Less strict than standard booleanRanking of output
Query Collection of Logical Expressions
Parameters Boolean operators andTerm frequency
Relevance True or False with relevance
Term weighting Term frequency
Benefits No need of boolean operators.Assist in query reformulation
Limitations Formulating extended boolean queries needs domain and query expertise.
Extended Boolean Model
Description Ranking based on similarity between query vector and document vector
Features Simplifies query formulation Provides ability to control output.
Query Collection of independent terms
Parameters Term frequencyInverse document frequency
Relevance Cosine Similarity
Term weighting tf * idf
Benefits Term weights are not binary.Allows ranking of documents according to similarity.
Limitations Less clarity. Assumes terms are independent. False positives and False negatives.
Vector Space Model
i
j
dj
q
Description Ranking based on conditional probability of relevance
Features Addresses uncertainty in query representation.
Query Collection of independent terms
Parameters Probabilities of a term appearing in relevant and non-relevant document
Relevance Model based relevance
Term weighting Relevance feedback
Benefits Multiple models with different tuning of parameters.
Limitations Estimation of needed probabilities is difficult.Assumes terms to be independent.
Probabilistic Model (e.g., OKAPI BM25)
Description Every term is associated with a concept. Term dependencies handled.
Features Based on term document matrix
Query Considers some amount of term dependencies
ParametersTerm Matrix TDocument Matrix DSingular value Matrix S
Relevance Based on each entry of matrix A
Term weighting Based on document category
Benefits Association among terms and documents.Terms generalized to concepts
Limitations Computation expensive.Finding optimal dimension of A is difficult.
A k=T k S k D kT
Latent Semantic Indexing
Description Query meaning used. Retrieval at concept level
Features Each document corresponds to a language model.
Query Collection of dependent words
Parameters Probability of query under the language model.
Relevance
Term weighting
Benefits Clusters dependent terms
Limitations Difficult to exploit feedback to improve ranking.High computational complexity.
Query likelihood Model
Summary: Tracing IR Models
• Set theoretic, Algebraic and Probabilistic Models• Underlying current of attempt trying to capture “Query
Meaning”• Started with Karen Spark-Jones’ thesis titled “Synonyms
and Semantic Search” in Cambridge in the 90s• The effort continues
WSD Definition
• Obtain the sense of – A set of target words, or of– All words (all word WSD, more difficult)
• Against a – Sense repository (like the wordnet), or– A thesaurus (not same as wordnet, does not have
semantic relations)
Elaboration (example word: operation)
• Operation, surgery, surgical operation, surgical procedure, surgical process -- (a medical procedure involving an incision with instruments; performed to repair damage or arrest disease in a living body; "they will schedule the operation as soon as an operating room is available"; "he died while undergoing surgery") TOPIC->(noun) surgery#1
• Operation, military operation -- (activity by a military or naval force (as a maneuver or campaign); "it was a joint operation of the navy and air force") TOPIC->(noun) military#1, armed forces#1, armed services#1, military machine#1, war machine#1
• Operation -- ((computer science) data processing in which the result is completely specified by a rule (especially the processing that results from a single instruction); "it can perform millions of operations per second") TOPIC->(noun) computer science#1, computing#1
• mathematical process, mathematical operation, operation -- ((mathematics) calculation by mathematical methods; "the problems at the end of the chapter demonstrated the mathematical processes involved in the derivation"; "they were learning the basic operations of arithmetic") TOPIC->(noun) mathematics#1, math#1, maths#1
To do or not to do WSD?
• Persuasive arguments against WSD (e.g., Voorhees, 1999)
• Investigations still continue• One such compelling investigation: WSD in the context
of query expansion in IR• While good quality expansion terms can increase recall,
bad expansion terms introduce topic drift and drastically bring down precision
To do or not to do WSD? (contd)
• Given query words, expanding them into accurate paradigmatically related words- synonyms, hypernyms, meronyms, troponyms, derived_forms etc. depends on WSD
• Syntagmatic expansion terms to come from Pseudo Relevance Feedback (Buckley et. al., 1994; Xu and Croft, 2000; Mitra et al., 1998)
Our direct experience of need for WSD in CLIR
• Recurring experience: Incorrect mapping of query words into English reduces the map score by at least 0.2 points
• Another need: Pages retrieved by the search engine are preferably presented in the language of the query
• Usefulness of WSD is well established in Machine Translation (Chang et. al., 2007 and Capuat and Wu, 2007).
Hindi-English Marathi-English0
5
10
15
20
25
30
35
40
45
50
43.75 43.75
25
31.25
18.75
6.25
12.5
0
TransliterationTranslation DisambiguationStemmerDictionaryRanking
Err
or
Pe
rce
nta
ge
Our observationsOn error PercentagesDue to variousFactorsCLEF 2007
Shared Tasks/Competition
• CLEF initiated competitions to determine quantitatively the performance gain or otherwise of WSD in CLIR- since 2007
• SemEval-2007 Task 01: Evaluating WSD on Cross-Language Information Retrieval (Agirre et. al., 2007): – Earliest attempt at evaluation of WSD for IR– SpanishEnglish– Best map value was obtained when the first sense of
a word from wordnet 1.6 was picked up and all the words in the synset were translated
– Performance badly hit when all synsets and all translations were chosen
Shared Tasks/Competition (contd.)
• Robust WSD task@CLEF2008: SpanishEnglish• Query expansion with synonyms from the first sense
helped• Effect of WSD most pronounced for those words which
are highly polysemous and have high IDF value, reported in– http://www.clef-campaign.org/2008/working_notes/CLEF2008WN-Conte
nts.html
• Clough and Stevenson (2004) observed that WSD (manual or automatic) introduced in CLIR can go towards monolingual retrieval performance (Eurowordnet used)
Robust WSD task@CLEF2009
• English and Spanish Monolingual IR• SpanishEnglish CLIR• Documents:
– LA Times 1994 (with word sense disambiguated data); ca 113,000 documents, 425 MB without WSD, 1,448 MB (UBC) or 2,151 MB (NUS) with WSD;
– Glasgow Herald 95 (with word sense disambiguated data); ca 56,500 documents, 154 MB without WSD, 626 MB (UBC) or 904 MB (NUS) with WSD.
Robust WSD task@CLEF2009 (contd)
• CLEF Topics from years 2001, 2002 and 2004 were used as training topics (relevance assessments were offered to participants)
• Topics from years 2003, 2005 and 2006 were used for the test
• All topics were offered both with and without WSD• Relevance data
– CLEF 2003: 23,674 documents, 1,006 relevant– CLEF 2005: 19,790 document, 2,063 relevant– CLEF 2006: 21,247 document, 1,258 relevant
Robust WSD task@CLEF2009 (Observations)
CLEF 2009 Working notes on robust track: WSD: Agirre et. al.
Observations from the task
• Inconclusive!• Exact techniques used for WSD were not revealed• Teams consisted mostly of IR people• Little NLP awareness• WSD used as blackbox
link
(Raghuvar Nadig, J. Ramanand and Pushpak Bhattacharyya, ICON 2008)
NLP using IR:
Web based validation
of
Wordnet Hypernyms
Anatomy of a wordnet
A graph of “synsets”
Synset: a collection of synonyms
Relations:
synonymy { icon – picture – image - ikon }
antonymy { … opposite of … }
hypernymy { icon is a type of representation }
holonymy { … part of … }
Anatomy of a synset
Relational semantics
Words in a synset disambiguate each other
icon (three meanings: computer term, image, religion)
icon, ikon (two meanings: image, religion)
picture, icon, ikon, (we have a winner!)
Are the hypernym-nyponyms in the wordent correct?
“is-a” relation between noun synsets
Given 2 synsets, check if they share a hypernym-hyponym relationship
Hypernym DiscoveryHearst: syntactic patterns
Snow: hypernymy dependency trees
Sumida: Wikipedia-based extraction
Hypernym validation - 3 Steps
{synset X}
1. Common head
2. Hearst patterns
3. Coordinate Terms
valid
valid
Could not validate
{synset Y}
WWW
Wiki-pedia
Step 1: common head
If a term of synset X is a proper suffix of a term in synset Y
X is a hypernym of Y
Task 1:
Synset A: {racing}
Synset B: {auto racing, car racing}
Task 2:
Synset A: {work}
Synset B: {paper work}
Step 2: Web evidence using Hearst patterns
If words from synsets X and Y in the form of a Hearst pattern show sufficient number of search results
X is a hypernym of Y
Hearst Patterns
9 types, e.g.:
“W_x such as W_y” (“Games such as cricket”)
“W_y and other W_x” (“Cricket and other games”)
Test
Query WWW for all Hearst patterns for pairs of words from synsets X & Y
If at least two patterns show non-zero results, validated
Step 3: Coordinate terms from Wikipedia
Find coordinate terms for words in Synset Y
Use Wikipedia
Repeat Step 2 (web evidence using Hearst Patterns) using coordinate terms
Example
Synset A: {medical science}
Synset B: {radiology}
Assume Step 2 does not succeed
Search Wikipedia for patterns
w_1, w_2, …, w_n-1 and w_n
w_1, w_2, …, w_n-1 or w_n
where w_k = “radiology”
Step 3: Coordinate terms from Wikipedia
Example
Synset A: {medicine, medical specialty}
Synset B: {radiology}
Wikipedia results
“categories such as radiology, pathology, or anaesthesia” (src: http://en.wikipedia.org/wiki/Medicine#Specialties)
Step 2 with Hearst Pattern search (medicine, radiology)
Hypernym validation – ResultsTested on Princeton Wordnet 2.1
Noun synsets
79297 hypernym-hyponym pairs
Search Queries
Microsoft Live Search
Coordinate Terms from Wikipedia
Indexed content for coordianate term retrieval
Hypernym validation – Results
Rule % of validated
pairs
1 – Suffix match 21.35%
2 – Search using Hearst Patterns
46.84%
3 – Use Coordinate terms 02.68%
Observations
71% of all synsets validated
Validation is a strong indicator of hypernymy correctness
Treat failure to validate as a flag for human inspection
Resource Constrained Languages
• Resource-constrained languages have poor crawl coverage– May not have much online
content
• Hawking and Robertson (2003) investigated impact of coverage on ranking– Affects precision at higher
ranks– Report 28% higher P@20
values than the collection whose size is 10% of whole corpus
Impact of Coverage on PRF Performance
• PRF performance dependent on precision of documents at top ranks
• If coverage is less, precision at higher ranks decreases
• Experimented on CLEF collections in French– Same set of queries run
on different collection sizes