cross-language retrieval lbsc 708a/cmsc 838l philip resnik and douglas w. oard session 9, november...
TRANSCRIPT
![Page 1: Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f315503460f94c4d111/html5/thumbnails/1.jpg)
Cross-Language Retrieval
LBSC 708A/CMSC 838L
Philip Resnik and Douglas W. Oard
Session 9, November 13, 2001
![Page 2: Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f315503460f94c4d111/html5/thumbnails/2.jpg)
Agenda
• Questions
• Overview– The information– The users
• Cross-Language Search
• User Interaction
![Page 3: Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f315503460f94c4d111/html5/thumbnails/3.jpg)
The Grand Plan
• Phase 1: What makes up an IR system?– perspectives on the elephant
• Phase 2: Representations– words, ratings
• Phase 3: Beyond English text– ideas applied in many settings
![Page 4: Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f315503460f94c4d111/html5/thumbnails/4.jpg)
A Driving Example
• Visual History Foundation– Interviews with Holocaust survivors– 39 years’ worth of audio/video– 32 languages; accented, emotional speech– 30 people, 2 years : $12 million
• Joint project: MALACH– VHF, IBM, JHU, UMD– http://www.clsp.jhu.edu/research/malach
![Page 5: Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f315503460f94c4d111/html5/thumbnails/5.jpg)
TranslationTranslingual
BrowsingTranslingual
Search
Query Document
Select Examine
InformationAccess
InformationUse
![Page 6: Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f315503460f94c4d111/html5/thumbnails/6.jpg)
A Little (Confusing) Vocabulary• Multilingual document
– Document containing more than one language
• Multilingual collection– Collection of documents in different languages
• Multilingual system– Can retrieve from a multilingual collection
• Cross-language system– Query in one language finds document in another
• Translingual system– Queries can find documents in any language
![Page 7: Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f315503460f94c4d111/html5/thumbnails/7.jpg)
Who needs Cross-Language Search?
• When users can read several languages– Eliminate multiple queries– Query in most fluent language
• Monolingual users can also benefit– If translations can be provided– If it suffices to know that a document exists– If text captions are used to search for images
![Page 8: Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f315503460f94c4d111/html5/thumbnails/8.jpg)
Motivations
• Commerce
• Security
• Social
![Page 9: Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f315503460f94c4d111/html5/thumbnails/9.jpg)
0.1
1.0
10.0
100.0
Inte
rnet
Hos
ts (
mill
ion)
:
Eng
lish
Japa
nese
Ger
man
Fre
nch
Dut
ch
Fin
nish
Span
ish
Chi
nese
Swed
ish
Language (estimated by domain)
Global Internet Hosts
Source: Network Wizards Jan 99 Internet Domain Survey
![Page 10: Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f315503460f94c4d111/html5/thumbnails/10.jpg)
72%
7%
5%
2%
2%
1%
1%
1%
1%
1%
7%
English
Japanese
German
French
Chinese
Spanish
Italian
Swedish
Malay
Korean
Portuguese
Dutch
Danish
Czech
Finnish
Russian
Polish
Hungarian
Norwegian
Estonian
Greek
Bulgarian
Croatian
Basque
Thai
Turkish
Arabic
Albanian
Others & Unknown
Source: Jack Xu, Excite@Home, 1999
Global Web Page Languages
![Page 11: Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f315503460f94c4d111/html5/thumbnails/11.jpg)
European Web Content
Source: European Commission, Evolution of the Internet and the World Wide Web in Europe, 1997
Bilingual25%
Other8%
Native Tounge Only42%
English Only11%
English Only/UK &
Ireland14%
![Page 12: Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f315503460f94c4d111/html5/thumbnails/12.jpg)
European Web Size Projection
0.1
1.0
10.0
100.0
1,000.0
10,000.0
Bil
lio
ns
of
Wo
rds
English Other European
Source: Extrapolated from Grefenstette and Nioche, RIAO 2000
![Page 13: Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f315503460f94c4d111/html5/thumbnails/13.jpg)
Global Internet Audio
source: www.real.com, Feb 2000
529
1367
English
OtherLanguages
Almost 2000 Internet-accessible
Radio and TelevisionStations
![Page 14: Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f315503460f94c4d111/html5/thumbnails/14.jpg)
13 Months Later
source: www.real.com, Mar 2001
10621438
English
OtherLanguages
About 2500 Internet-accessible
Radio and TelevisionStations
![Page 15: Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f315503460f94c4d111/html5/thumbnails/15.jpg)
User Needs Assessment
• Who are the potential users?
• What goals do we seek to support?
• What language skills must we accommodate?
![Page 16: Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f315503460f94c4d111/html5/thumbnails/16.jpg)
Global Languages
0
200
400
600
800
Spea
kers
(M
illio
ns)
Chi
nese
Eng
lish
Hin
di-U
rdu
Span
ish
Por
tugu
ese
Ben
gali
Rus
sian
Ara
bic
Japa
nese
Source: http://www.g11n.com/faq.html
![Page 17: Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f315503460f94c4d111/html5/thumbnails/17.jpg)
Global Trade
0
200
400
600
800
1000U
SA
Ger
man
y
Jap
an
Ch
ina
Fra
nce
UK
Can
ada
Ital
y
Net
her
lan
ds
Bel
giu
m
Kor
ea
Mex
ico
Tai
wan
Sin
gap
ore
Sp
ain
Exports Imports
Bil
lion
s of
US
Dol
lars
(19
99)
Source: World Trade Organization 2000 Annual Report
![Page 18: Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f315503460f94c4d111/html5/thumbnails/18.jpg)
Source: Global Reach
EnglishEnglish
2000 2005
Global Internet User Population
Chinese
![Page 19: Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f315503460f94c4d111/html5/thumbnails/19.jpg)
Agenda
• Questions
• Overview
• Cross-Language Search
• User Interaction
![Page 20: Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f315503460f94c4d111/html5/thumbnails/20.jpg)
The Search Process
Choose Document-Language
Terms
Query-DocumentMatching
InferConcepts
Select Document-Language
Terms
Document
Author
Query
Choose Document-Language
Terms
MonolingualSearcher
Choose Query-Language
Terms
Cross-LanguageSearcher
![Page 21: Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f315503460f94c4d111/html5/thumbnails/21.jpg)
C ross -L an g u ag e R etrieva lIn d exin g L an g u ag esM ach in e-A ss is ted In d exin g
In fo rm ation R e trieva l
M u lt ilin g u a l M e tad a ta
D ig ita l L ib ra ries
In te rn a tion a l In fo rm ation F lowD iffu s ion o f In n ova tion
In fo rm ation U se
A u tom atic A b s trac tin g
Inform ation Science
M ach in e Tran s la tionIn fo rm ation E xtrac tionText S u m m ariza tion
N atu ra l L an g u ag e P rocess in g
M u ltilin g u a l O n to log ies
O n to log ica l E n g in eerin g
Textu a l D a ta M in in g
K n ow led g e D iscovery
M ach in e L earn in g
Artificial Intelligence
L oca liza tionIn fo rm ation V isu a liza tion
H u m an -C om p u ter In te rac tion
W eb In te rn a tion a liza tion
W orld -W id e W eb
Top ic D e tec tion an d Track in g
S p eech P rocess in g
M u ltilin g u a l O C R
D ocu m en t Im ag e U n d ers tan d in g
Other Fields
M ultilingua l In form ation Access
![Page 22: Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f315503460f94c4d111/html5/thumbnails/22.jpg)
Some history: from controlled vocabular to free text
• 1964 International Road Research– Multilingual thesauri
• 1970 SMART– Dictionary-based free-text cross-language retrieval
• 1978 ISO Standard 5964 (revised 1985)– Guidelines for developing multilingual thesauri
• 1990 Latent Semantic Indexing– Corpus-based free-text translingual retrieval
![Page 23: Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f315503460f94c4d111/html5/thumbnails/23.jpg)
Multilingual Thesauri
• Build a cross-cultural knowledge structure– Cultural differences influence indexing choices
• Use language-independent descriptors– Matched to language-specific lead-in vocabulary
• Three construction techniques– Build it from scratch– Translate an existing thesaurus– Merge monolingual thesauri
![Page 24: Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f315503460f94c4d111/html5/thumbnails/24.jpg)
Free Text CLIR
• What to translate?– Queries or documents
• Where to get translation knowledge?– Dictionary or corpus
• How to use it?
![Page 25: Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f315503460f94c4d111/html5/thumbnails/25.jpg)
Translingual Retrieval Architecture
LanguageIdentification
EnglishTerm
Selection
ChineseTerm
Selection
Cross-LanguageRetrieval
MonolingualChineseRetrieval
3: 0.91 4: 0.575: 0.36
1: 0.722: 0.48
ChineseQuery
ChineseTerm
Selection
![Page 26: Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f315503460f94c4d111/html5/thumbnails/26.jpg)
Evidence for Language Identification
• Metadata– Included in HTTP and HTML
• Word-scale features– Which dictionary gets the most hits?
• Subword features– Character n-gram statistics
![Page 27: Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f315503460f94c4d111/html5/thumbnails/27.jpg)
Query-Language Retrieval
MonolingualChineseRetrieval
3: 0.91 4: 0.575: 0.36
ChineseQueryTerms
EnglishDocument
Terms
DocumentTranslation
![Page 28: Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f315503460f94c4d111/html5/thumbnails/28.jpg)
Example: Modular use of MT
• Select a single query language
• Translate every document into that language
• Perform monolingual retrieval
![Page 29: Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f315503460f94c4d111/html5/thumbnails/29.jpg)
TDT-3 Mandarin Broadcast News
Systran
Balanced 2-best translation
Is Machine Translation Enough?
![Page 30: Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f315503460f94c4d111/html5/thumbnails/30.jpg)
Document-Language Retrieval
MonolingualEnglish
Retrieval
3: 0.91 4: 0.575: 0.36
QueryTranslation
ChineseQueryTerms
EnglishDocument
Terms
![Page 31: Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f315503460f94c4d111/html5/thumbnails/31.jpg)
Query vs. Document Translation
• Query translation– Efficient for short queries (not relevance feedback)– Limited context for ambiguous query terms
• Document translation– Rapid support for interactive selection– Need only be done once (if query language is same)
• Merged query and document translation– Can produce better effectiveness than either alone
![Page 32: Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f315503460f94c4d111/html5/thumbnails/32.jpg)
The Short Query Challenge
0 1 2 3
1997
1998
1999
Number of Terms Per Query
English
Other EuropeanLanguages (German,French, Italian, Dutch,Swedish)
Source: Jack Xu, Excite@Home, 1999
![Page 33: Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f315503460f94c4d111/html5/thumbnails/33.jpg)
Interlingual Retrieval
InterlingualRetrieval
3: 0.91 4: 0.575: 0.36
QueryTranslation
ChineseQueryTerms
EnglishDocument
Terms
DocumentTranslation
![Page 34: Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f315503460f94c4d111/html5/thumbnails/34.jpg)
Key Challenges in CLIR
probesurveytake samples
cymbidium goeringii
restrainoilpetroleum
Wrongsegmentation
Whichtranslation?
Notranslation?
![Page 35: Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f315503460f94c4d111/html5/thumbnails/35.jpg)
Sources of Evidence for Translation
• Corpus statistics
• Lexical resources
• Algorithms
• The user
![Page 36: Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f315503460f94c4d111/html5/thumbnails/36.jpg)
Hieroglyphic
Egyptian Demotic
Greek
![Page 37: Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f315503460f94c4d111/html5/thumbnails/37.jpg)
Types of Bilingual Corpora
• Parallel corpora: translation-equivalent pairs– Document pairs– Sentence pairs – Term pairs
• Comparable corpora: topically related– Collection pairs– Document pairs
![Page 38: Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f315503460f94c4d111/html5/thumbnails/38.jpg)
Exploiting Parallel Corpora
• Automatic acquisition of translation lexicons
• Statistical machine translation
• Corpus-guided translation selection
• Document-linked techniques
![Page 39: Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f315503460f94c4d111/html5/thumbnails/39.jpg)
Lexicon acquisition from the WWW
Word alignment (GIZA)
STRAND
Chunk-level alignment
Association stats
3378 document pairs
63K chunks
500K words
170K entries
Frequency-based thresholding
… cannot understand crew commands…
ne comprenez pas les instructions de l’ equip…
![Page 40: Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f315503460f94c4d111/html5/thumbnails/40.jpg)
Corpus-Guided Translation Selection
• Rank translation alternatives for each term – pick English word e that maximizes Pr(e)– Pick English word e that maximizes Pr(e|c)
– Pick English words e1…en maximizing Pr(e1…en|c1…cm)
= statistical machine translation!
• Unigram language models are easy to build– Can use the collection being searched– Limits uncommon translation and spelling error effects
![Page 41: Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f315503460f94c4d111/html5/thumbnails/41.jpg)
Corpus-Based CLIR Example
Top ranked FrenchDocumentsFrench
IR SystemEnglish
IR System
FrenchQueryTerms
EnglishTranslations
Top rankedEnglish
DocumentsParallelCorpus
![Page 42: Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f315503460f94c4d111/html5/thumbnails/42.jpg)
Exploiting Comparable Corpora
• Blind relevance feedback– Existing CLIR technique + collection-linked corpus
• Lexicon enrichment– Existing lexicon + collection-linked corpus
• Dual-space techniques– Document-linked corpus
![Page 43: Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f315503460f94c4d111/html5/thumbnails/43.jpg)
Blind Relevance Feedback
• Augment a representation with related terms– Find related documents, extract distinguishing terms
• Multiple opportunities:– Before doc translation: Enrich the vocabulary– After doc translation: Mitigate translation errors – Before query translation: Improve the query– After query translation: Mitigate translation errors
• Short queries get the most dramatic improvement
![Page 44: Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f315503460f94c4d111/html5/thumbnails/44.jpg)
Example: Post-Translation “Document Expansion”
Mandarin Chinese Documents
Term-to-TermTranslation
EnglishCorpus
IRSystem
Top 5
AutomaticSegmentation
TermSelection
IRSystem
Results
EnglishQuery
Document to be Indexed
Single Document
![Page 45: Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f315503460f94c4d111/html5/thumbnails/45.jpg)
Mandarin Newswire Text
Post-Translation Document Expansion
![Page 46: Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f315503460f94c4d111/html5/thumbnails/46.jpg)
Why Document Expansion Works
• Story-length objects provide useful context
• Ranked retrieval finds signal amid the noise
• Selective terms discriminate among documents– Enrich index with low DF terms from top documents
• Similar strategies work well in other applications– CLIR query translation
– Monolingual spoken document retrieval
![Page 47: Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f315503460f94c4d111/html5/thumbnails/47.jpg)
Lexicon Enrichment
… Cross-Language Evaluation Forum …
… Solto Extunifoc Tanixul Knadu …
?
Similar techniques can guide translation selection
![Page 48: Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f315503460f94c4d111/html5/thumbnails/48.jpg)
Lexicon Enrichment
• Use a bilingual lexicon to align “context regions”– Regions with high coincidence of known translations
• Pair unknown terms with unmatched terms– Unknown: language A, not in the lexicon– Unmatched: language B, not covered by translation
• Treat the most surprising pairs as new translations
• Not yet tested in a CLIR application
![Page 49: Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f315503460f94c4d111/html5/thumbnails/49.jpg)
Learning From Document Pairs
E1 E2 E3 E4 E5 S1 S2 S3 S4
Doc 1
Doc 2
Doc 3
Doc 4
Doc 5
4 2 2 1
8 4 4 2
2 2 2 1
2 1 2 1
4 1 2 1
English Terms Spanish Terms
![Page 50: Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f315503460f94c4d111/html5/thumbnails/50.jpg)
Similarity “Thesauri”
• For each term, find most similar in other language– Terms E1 & S1 (or E3 & S4) are used in similar ways
• Treat top related terms as candidate translations– Applying dictionary-based techniques
• Performed well on comparable news corpus– Automatically linked based on date and subject codes
![Page 51: Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f315503460f94c4d111/html5/thumbnails/51.jpg)
Generalized Vector Space Model
• “Term space” of each language is different– Document links define a common “document space”
• Describe documents based on the corpus– Vector of similarities to each corpus document
• Compute cosine similarity in document space
• Very effective in a within-domain evaluation
![Page 52: Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f315503460f94c4d111/html5/thumbnails/52.jpg)
Latent Semantic Indexing
• Cosine similarity captures noise with signal– Term choice variation and word sense ambiguity
• Signal-preserving dimensionality reduction– Conflates terms with similar usage patterns
• Reduces term choice effect, even across languages
• Computationally expensive
![Page 53: Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f315503460f94c4d111/html5/thumbnails/53.jpg)
Cognate Matching
• Dictionary coverage is inherently limited– Translation of proper names– Translation of newly coined terms– Translation of unfamiliar technical terms
• Strategy: model derivational translation– Orthography-based– Pronunciation-based
![Page 54: Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f315503460f94c4d111/html5/thumbnails/54.jpg)
Matching Orthographic Cognates
• Retain untranslatable words unchanged– Often works well between European languages
• Rule-based systems– Even off-the-shelf spelling correction can help!
• Character-level statistical MT– Trained using a set of representative cognates
![Page 55: Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f315503460f94c4d111/html5/thumbnails/55.jpg)
Matching Phonetic Cognates
• Forward transliteration– Generate all potential transliterations
• Reverse transliteration– Guess source string(s) that produced a transliteration
• Match in phonetic space
![Page 56: Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f315503460f94c4d111/html5/thumbnails/56.jpg)
Sources of Evidence for Translation
• Corpus statistics
• Lexical resources
• Algorithms
• The user
![Page 57: Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f315503460f94c4d111/html5/thumbnails/57.jpg)
Types of Lexical Resources
• Ontology
• Thesaurus
• Lexicon
• Dictionary
• Bilingual term list
![Page 58: Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f315503460f94c4d111/html5/thumbnails/58.jpg)
Original query: El Nino and infectious diseases
Term selection: “El Nino” infectious diseases
Term translation:(Dictionary coverage: “El Nino” is not
found)
Translation selection:
Query formulation:Structure:
Dictionary-Based Query Translation
![Page 59: Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f315503460f94c4d111/html5/thumbnails/59.jpg)
The Unbalanced Translation Problem
• Common query terms may have many translations
• Some of the translations may be rare
• IR systems give rare translations higher weights
• The wrong documents get highly ranked
![Page 60: Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f315503460f94c4d111/html5/thumbnails/60.jpg)
Solution 1: Balanced Translation
• Replace each term with plausible translations– Common terms have many translations– Specific terms are more useful for retrieval
• Balance the contribution of each translation– Modular: duplicate translations– Integrated: average the contributions
![Page 61: Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f315503460f94c4d111/html5/thumbnails/61.jpg)
Solution 2: “Structured Queries”
• Weight of term a in a document i depends on:– TF(a,i): Frequency of term a in document i– DF(a): How many documents term a occurs in
• Build pseudo-terms from alternate translations– TF (syn(a,b),i) = TF(a,i)+TF(b,i)– DF (syn(a,b) = |{docs with a}U{docs with b}|
• Downweights terms with any common translation– Particularly effective for long queries
![Page 62: Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f315503460f94c4d111/html5/thumbnails/62.jpg)
(Query Terms: 1: 2: 3: )
Computing Weights
• Unbalanced:– Overweights query terms that have many translations
• Balanced (#sum): – Sensitive to rare translations
• Pirkola (#syn):– Deemphasizes query terms with any common translation
][3
1
3
3
2
2
1
1
DF
TF
DF
TF
DF
TF
])(2
1[
2
1
3
3
2
2
1
1
DF
TF
DF
TF
DF
TF
][2
1
3
3
21
21
DF
TF
DFDF
TFTF
![Page 63: Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f315503460f94c4d111/html5/thumbnails/63.jpg)
Relative Effectiveness
0 0.1 0.2 0.3 0.4
Pirkola
Balanced
Mean Average Precision
All Trans 3 Trans
NTCIR-2 ECIR Collection, CETA+LDC Dictionary, Inquery 3.1p1
![Page 64: Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f315503460f94c4d111/html5/thumbnails/64.jpg)
Exploiting Part-of-Speech (POS)
• Constrain translations by part-of-speech– Requires POS tagger and POS-tagged lexicon
• Works well when queries are full sentences– Short queries provide little basis for tagging
• Constrained matching can hurt monolingual IR– Nouns in queries often match verbs in documents
![Page 65: Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f315503460f94c4d111/html5/thumbnails/65.jpg)
Evaluation Collections• TREC
– TREC-6/7/8: English, French, German, Italian text– TREC-9: Chinese text– TREC-10: Arabic text
• CLEF– CLEF-1: English, French, German, Italian text– CLEF-2: Above, plus Spanish and Dutch
• TDT– TDT-2/3: Chinese and English, text and speech
• NTCIR– NTCIR-1: Japanese and English text– NTCIR-2: Japanese, Chinese and English text
![Page 66: Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f315503460f94c4d111/html5/thumbnails/66.jpg)
Percent of Monolingual Performance
• Create document collection in language F• Create queries and relevance judgments
– Queries in language F, relevance judgments– Translated queries in language E
• Monolingual run uses F queries, F docs– Upper bound on performance
• Cross-language run uses E queries, F docs– Measure as percentage of the upper bound
• Caveat: “query expansion” effect!
![Page 67: Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f315503460f94c4d111/html5/thumbnails/67.jpg)
A real example (HLT’2001)• Combining two translation lexicons
– Contained in downloaded dictionary– Extracted from bilingual WWW documents
• Interaction between stemming and translation– 4-stage backoff, balanced translation
• Evidence combination– Baseline: dictionary with four-stage backoff– Merging: reranking informed by WWW tralex– Backoff: from baseline to WWW tralex
• Experiment and results
![Page 68: Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f315503460f94c4d111/html5/thumbnails/68.jpg)
Four-stage Backoff
• Tralex might contain stems, surface forms, or some combination of the two.
mangez mangez
mangez mangemange
mangezmange mange
mangez mange mangent mange
- eat
- eats eat
- eat
- eat
Document Translation Lexicon
surface form surface form
stem surface form
surface form stem
stem stem
French stemmer: Oard, Levow, and Cabezas (2001); English: Inquiry’s kstem
![Page 69: Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f315503460f94c4d111/html5/thumbnails/69.jpg)
Tralex Merging by Voting
Ranked dictionary
F
e1e2e3
100
99
98
100
99
Corpus tralex
F
e2e3e4e5
F
Ranked merge
e2e3e1
199
197
100
e4e5
![Page 70: Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f315503460f94c4d111/html5/thumbnails/70.jpg)
Tralex Backoff
• Attempt 4-stage backoff using dictionary
• If not found, use corpus tralex as 5th stage
Dictionary-based 4-stage backoff
Lookup in WWW-based tralexmangez mangez - eatsurface form surface form
![Page 71: Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f315503460f94c4d111/html5/thumbnails/71.jpg)
Experiment
• CLEF-2000 French document collection– ~21M words, Le Monde, document translation
• English queries (34 topics with relevant docs)
• Inquery (using #sum operator)
• Conditions (balanced top-2 translation of docs):– Baseline: downloaded dictionary alone– Corpus (STRAND) tralex, thresholds N=1,2,3– Tralex merging by voting– Backoff from dictionary tralex to corpus tralex
![Page 72: Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f315503460f94c4d111/html5/thumbnails/72.jpg)
Results
STRAND corpus tralex (N=1) 0.2320
STRAND corpus tralex (N=2) 0.2440
STRAND corpus tralex (N=3) 0.2499
Merging by voting 0.2892
Baseline: downloaded dictionary 0.2919
Backoff from dictionary to corpus tralex
0.3282
Condition Mean Average Precision
+12% (p < .01) relative
![Page 73: Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f315503460f94c4d111/html5/thumbnails/73.jpg)
Results Detail
mAP
![Page 74: Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f315503460f94c4d111/html5/thumbnails/74.jpg)
Topic Detection and Tracking
• English and Chinese news stories– Newswire, radio, and television sources
• Query-by-example, mixed language/source– Merged mutilingual result set
• Set-based retrieval measures– Focus on utility
![Page 75: Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f315503460f94c4d111/html5/thumbnails/75.jpg)
TDT: Evaluating CL Speech Retrieval
2265manually
segmentedstories
3371manually segmented
stories
DevelopmentCollection: TDT-2
EvaluationCollection: TDT-3
Mar 98
Oct 98 Dec 98
17 topics,variable number
of exemplars
Jun 98Jan 98
Exhaustive relevance assessment based on event overlap
English texttopic exemplars:Associated PressNew York Times
Mandarin audiobroadcast news:Voice of America
56 topics,variable number
of exemplars
![Page 76: Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f315503460f94c4d111/html5/thumbnails/76.jpg)
Query byExampleEnglish
NewswireExemplars
MandarinAudioStories
President Bill Clinton and Chinese President Jiang Zemin engaged in a spirited, televised debate Saturday over human rights and theTiananmen Square crackdown, and announced a string of agreements on arms control, energy and environmental matters. There were no announced breakthroughs on American human rights concerns, including Tibet, but both leaders accentuated the positive …
美国总统克林顿的助手赞扬中国官员允许电视现场直播克林顿和江泽民在首脑会晤后举行的联合记者招待会。。特别是一九八九镇压民主运动的决定。他表示镇压天安门民主运动是错误的 , 他还批评了中国对西藏精神领袖达 国家安全事务助理伯格表示 , 这次直播让中国人第一次在种公开的论坛上听到围绕敏感的人权问题的讨论。在记者招待会上 …
![Page 77: Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f315503460f94c4d111/html5/thumbnails/77.jpg)
Known Item Retrieval
• Design queries to retrieve a single document
• Measure the rank of that document in the list
• Average the inverse of the rank across queries
• Use sign test for statistical significance
• Useful first pass evaluation strategy– Avoids the cost of relevance judgments
– Poor mean inverse rank implies poor average precision
– Does not distinguish well among fairly close systems
![Page 78: Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f315503460f94c4d111/html5/thumbnails/78.jpg)
Evaluating Corpus-Based Techniques
• Within-domain evaluation (upper bound)– Partition a bilingual corpus into training and test– Use the training part to tune the system– Generate relevance judgments for evaluation part
• Cross-domain evaluation (fair)– Use existing corpora and evaluation collections– No good metric for degree of domain shift
![Page 79: Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f315503460f94c4d111/html5/thumbnails/79.jpg)
Evaluating Tralex Coverage
• Lexicon size
• Vocabulary coverage of the collection
• Term instance coverage of the collection
• Term weight coverage of the collection
• Term weight coverage on representative queries
• Retrieval performance on a test collection
![Page 80: Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f315503460f94c4d111/html5/thumbnails/80.jpg)
Outline
• Questions
• Overview
• Search
Browsing
![Page 81: Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f315503460f94c4d111/html5/thumbnails/81.jpg)
Interactive CLIR
• Important: Strong support for interactive relevance judgments can make up for less accurate nominations– [Hersh et al., SIGIR 2000]
• Practical: Interactive relevance judgments based on imperfect translations can beat fully automatic nominations alone– [Oard and Resnik, IP&M 1997]
![Page 82: Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f315503460f94c4d111/html5/thumbnails/82.jpg)
User-Assisted Query Translation
![Page 83: Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f315503460f94c4d111/html5/thumbnails/83.jpg)
Reverse Translation
Swiss bankQuery in English:
bank: Bankgebäude ( )bankverbindung (bank account, correspondent)bank (bench, settle)damm (causeway, dam, embankment)ufer (shore, strand, waterside)wall (parapet, rampart)
Click on a box to remove a possible translation:
Search
Continue
![Page 84: Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f315503460f94c4d111/html5/thumbnails/84.jpg)
Selection
• Goal: Provide information to support decisions
• May not require very good translations– e.g., Term-by-term title translation
• People can “read past” some ambiguity– May help to display a few alternative translations
![Page 85: Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f315503460f94c4d111/html5/thumbnails/85.jpg)
Language-Specific Selection
Swiss bankQuery in English: Search
English German(Swiss)(Bankgebäude, bankverbindung, bank)
1 (0.72) Swiss Bankers CriticizedAP / June 14, 1997
2 (0.48) Bank Director ResignsAP / July 24, 1997
1 (0.91) U.S. Senator Warpathing NZZ / June 14, 1997
2 (0.57) [Bankensecret] Law ChangeSDA / August 22, 1997
3 (0.36) Banks Pressure ExistentNZZ / May 3, 1997
![Page 86: Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f315503460f94c4d111/html5/thumbnails/86.jpg)
Translingual Selection
Swiss bankQuery in English: Search
German Query: (Swiss)(Bankgebäude, bankverbindung, bank)
1 (0.91) U.S. Senator Warpathing NZZ June 14, 19972 (0.57) [Bankensecret] Law Change SDA August 22, 19973 (0.52) Swiss Bankers Criticized AP June 14, 19974 (0.36) Banks Pressure Existent NZZ May 3, 19975 (0.28) Bank Director Resigns AP July 24, 1997
![Page 87: Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f315503460f94c4d111/html5/thumbnails/87.jpg)
Merging Ranked Lists
• Types of Evidence– Rank– Score
• Evidence Combination– Weighted round robin– Score combination
• Parameter tuning– Condition-based– Query-based
1 voa4062 .22 2 voa3052 .21 3 voa4091 .17 …1000 voa4221 .04
1 voa4062 .52 2 voa2156 .37 3 voa3052 .31 …1000 voa2159 .02
1 voa4062 2 voa3052 3 voa2156 …1000 voa4201
![Page 88: Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f315503460f94c4d111/html5/thumbnails/88.jpg)
Examination Interface
• Two goals– Refine document delivery decisions– Support vocabulary discovery for query refinement
• Rapid translation is essential– Document translation retrieval strategies are a good fit– Focused on-the-fly translation may be a viable alternative
![Page 89: Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f315503460f94c4d111/html5/thumbnails/89.jpg)
Uh oh…
![Page 90: Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f315503460f94c4d111/html5/thumbnails/90.jpg)
State-of-the-Art Machine Translation
![Page 91: Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f315503460f94c4d111/html5/thumbnails/91.jpg)
Term-by-Term Gloss Translation
![Page 92: Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f315503460f94c4d111/html5/thumbnails/92.jpg)
Experiment Design
Participant
1
2
3
4
Task Order
Narrow:
Broad:
Topic Key
System Key
System B:
System A:
Topic11, Topic17 Topic13, Topic29
Topic11, Topic17 Topic13, Topic29
Topic17, Topic11 Topic29, Topic13
Topic17, Topic11 Topic29, Topic13
11, 13
17, 29
![Page 93: Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f315503460f94c4d111/html5/thumbnails/93.jpg)
An Experiment Session• Task and system familiarization (30 minutes)
– Gain experience with both systems
• 4 searches (20 minutes x 4):– Read topic description (in a language you know)
– Examine translations (into that same language)
– Select one of 5 relevance judgments (two clusters)• Relevant
• Somewhat relevant, Not relevant, Unsure, Not judged
– Instructed to seek high precision
• 8 questionnaires– Initial (1), each topic (4), each system (2), Final (1)
![Page 94: Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f315503460f94c4d111/html5/thumbnails/94.jpg)
Measure of Effectiveness
• Unbalanced F-Measure:– P = precision
– R = recall = 0.8
• Favors precision over recall
• This models an application in which:– Fluent translation is expensive
– Missing some relevant documents would be okay
RP
F
11
![Page 95: Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f315503460f94c4d111/html5/thumbnails/95.jpg)
French Results OverviewCLEF
AUTO
![Page 96: Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f315503460f94c4d111/html5/thumbnails/96.jpg)
English Results OverviewCLEF
AUTO
![Page 97: Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f315503460f94c4d111/html5/thumbnails/97.jpg)
Maryland Experiments
• MT is almost always better– Significant overall and for narrow topics alone (one-tailed t-test, p<0.05)
• F measure is less insightful for narrow topics– Always near 0 or 1
0
0.2
0.4
0.6
0.8
1
1.2
umd01 umd02 umd03 umd04 umd01 umd02 umd03 umd04
Participant
Ave
rag
e F
_0.8
on
Tw
o T
op
ics MT
GLOSS|---------- Broad topics -----------| |--------- Narrow topics -----------|
![Page 98: Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f315503460f94c4d111/html5/thumbnails/98.jpg)
Some Observations
• Small agreement with CLEF assessments!– Time pressure, precision bias, strict judgments
• Systran was fairly consistent across sites– Only when the language pair was the same
• Monolingual > Systran > Gloss– In both recall and precision
• UNED’s phrase translations improve recall– With no adverse affect on precision
![Page 99: Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f315503460f94c4d111/html5/thumbnails/99.jpg)
Delivery
• Use may require high-quality translation– Machine translation quality is often rough
• Route to best translator based on:– Acceptable delay– Required quality (language and technical skills)– Cost
![Page 100: Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f315503460f94c4d111/html5/thumbnails/100.jpg)
Summary
• Controlled vocabulary– Mature, efficient, easily explained
• Dictionary-based– Simple, broad coverage
• Collection-aligned corpus-based– Generally helpful
• Document- and Term-aligned corpus-based– Effective in the same domain
• User interface– Only very preliminary results available
![Page 101: Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f315503460f94c4d111/html5/thumbnails/101.jpg)
Research Opportunities
User Interaction
Translation Selection
Transliteration
Comparable Corpora
Parallel Corpora
Dictionaries
Term Selection
Percieved Opportunities Past InvestmentSegmentation &Phrase Indexing
Lexical Coverage
![Page 102: Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f315503460f94c4d111/html5/thumbnails/102.jpg)
No “muddiest point” question…
… it’s all CLIR!
![Page 103: Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f315503460f94c4d111/html5/thumbnails/103.jpg)
Concerns about Social Filtering• Subjectivity: is good for me good for you? (6)• Sparse recommendations (4)• Privacy and trust are in tension(3)• Insufficient perceived self interest (3)• Difficulty of making inferences from behavior• Changing information needs• How to observe user behavior?• Non-cooperative users• Scalable ratings servers• Adoption of new technology
![Page 104: Cross-Language Retrieval LBSC 708A/CMSC 838L Philip Resnik and Douglas W. Oard Session 9, November 13, 2001](https://reader038.vdocuments.us/reader038/viewer/2022110102/56649f315503460f94c4d111/html5/thumbnails/104.jpg)
Muddiest Points
• Architectures for using implicit feedback (4)• Rocchio (4)
– Is it commonly used?
• Machine learning techniques (2)– Especially hill climbing and rule induction
• How are links related to social filtering?• The math behind Google• How to reducing marginal cost of ratings?• Relationship between filtering and routing