5/24/09
1
Cross‐LanguageIR
CISC489/689‐010,Lecture#23Monday,May11th
BenCartereCe
Cross‐LanguageIR
• Usersubmitsaqueryinonelanguage,getsresultsinadifferentlanguage
• Documentsaresemi‐structuredandheterogeneous(asalmostalldatainIR),andalsoinmulNplelanguages
• InformaNonmayonlybeavailableindocumentswriCeninoneofthelanguages
• Highlyusefultointelligencecommunity
5/24/09
2
ApproachestoCLIR
• Translatethedocumentsintotheusers’language,andlettheuserssubmitqueriesintheirownlanguage
• Translatetheusers’queriesintotargetlanguage(s)andusethetranslatedqueryforretrieval
• Translatebothqueriesanddocumentstoan“intermediate”language
AutomaNcTranslaNon
• WhataresomeapproachestoautomaNctranslaNon?– Language‐to‐languagedicNonaries
• Languagesdonottranslateprecisely– Onewordwithseveralmeaningsinonelanguagemighttranslatetoseveraldifferentwordsintheother
– Manywordswiththesamemeaningmightalltranslatetoasingleword
– Awordinonelanguagemightonlybeexpressibleasaphraseinanother(orvice‐versa)
– etc…
5/24/09
3
Example
• TranslaNonsof“bank”:– Orilla(riverbank)– Terraplen (bankofearth)– Banco (bankofclouds)– Bateria (bankoflights)– Banco (financialinsNtuNon)– Banca(casinobank)
• TranslaNonsof“fraud”:– Impostor (fraudulentperson)– Fraude(decepNon)
• HowwouldadicNonary‐basedsystemknowwhichpairoftranslaNonstouse?
• EnglishqueriestoretrieveSpanishdocuments
• SystemworksbytranslaNngquerytoSpanish• Query:“bankfraud”
• PossiblycorrecttranslaNon: • Fraude bancario
StaNsNcalApproach
• Insteadoftryingtotranslatedirectly,applystaNsNcalmethods
• Learn“translaNonprobabiliNes”P(f|e)–probabilityoftranslaNngstringeinlanguageEtostringfinlanguageF
• E.g.:– P(orillafraude|bankfraud),P(orillaimpostor|bankfraud),P(bancofraude|bankfraud),…
5/24/09
4
Cross‐LanguageLanguageModel
• Recallquery‐likelihoodlanguagemodel:
• Let’sadaptthistocross‐languageretrievalusingstaNsNcaltranslaNon
P (Q|D) =!
q!Q
P (q|D) =!
q!Q
(1! !D)tfqD
|D| + !Dctfq
|C|
P (Qf |De) =!
qf!Qf
P (qf |De)
=!
qf!Qf
"
te!E
P (qf |te)P (te|De)
=!
qf!Qf
"
te!E
P (qf |te)#
(1! !De)tfteDe
|De| + !De
ctfte
|Ce|
$
TranslaNonModel
• WhatisP(qf|te)?• Thetransla6on model:probabilityoftranslaNngwordteinlanguageEtowordqfinlanguageF
• Wheredoesitcomefrom?– MaybeadicNonaryapproach:everypossibletranslaNonoftehasequalprobability
– e.g.P(orilla|bank)=P(banco|bank)=P(banca|bank)=…
5/24/09
5
StaNsNcalTranslaNonModel
• AnalternaNveapproach:parallel corpora
StaNsNcalTranslaNonwithParallelCorpora
• ParallelcorporaconsistofdocumentsintwoormorelanguagesthatareknowntobetranslaNonsofoneanother
• Theparallelcoporaarealigned:stringeandstringfaremarkedastranslaNonsofeachother
• WecanusethesealignmentstoesNmateatranslaNonmodel
5/24/09
6
TranslaNonModel
• ToesNmateP(qf|te),countthenumberofalignedstringpairs(e,f)suchthatteisawordineandqfisawordinf
• Dividebythetotalnumberofstringsinlanguageethatcontainte
P (qf |te) =|{(e, f)|te ! e and qf ! f}|
|{e|te ! e}|
SimpleAlignmentExample
• Englishsentence:“TheobjecNvewasclear:arrestandextraditetoMexicothewomanagainstwhomtheyhadchargedforfraudtoarecognizedbankinginsNtuNon.”
• Spanishsentence:“ElobjeNvoeraclaro:deteneralamujeryenviarladeregresoaMéxicopueshabíancargosensucontraporfraudeaunareconocidainsNtuciónbancaria.”
• EverypairofwordsinthesetwosentenceswillhavesometranslaNonprobability
• Overmanysentences,thehighestprobabiliNeswillbethepairsofwordsthataremostcloselyrelated
5/24/09
7
Alignments
• Alignmentscanbemuchmoredetailed
ImagesfromBrownetal.,“TheMathemaNcsofStaNsNcalMachineTranslaNon”
ParallelCorpora
• Wheredowegetparallelcorpora?– FinddocumentsthatweknowtobetranslaNons
– CanadianHansard:transcriptsofCanadianparliamentarydebatesinbothEnglishandFrench
– EuropeanUnionlawin22languages• Anythingthat’snotlaw‐related?– WikipediaarNclesindifferentlanguages..NotnecessarilytranslaNonsthough
5/24/09
8
CLIRExperiments
• CLIRtrackranatTRECfrom1998through2002
• LanguagesusedincludeEnglish,German,French,Italian,Chinese,andArabic
• OtherissuesinCLIR:– SegmentaNon,stemming,stopping,phrasesrequiredifferentapproachesindifferentlanguages
– Iamgoingtofocusonhigh‐levelproblem
CLIRExperiments
• In2001and2002,themainCLIRtaskwasEnglishqueriestoretrieveArabicdocuments
• Documents:383,872newsarNclesfromAgenceFrancePressfrom1994‐2000
• InformaNonneeds:25queries,descripNons,andnarraNvesinEnglishbynaNveArabicspeakers– TranslatedintoArabicandFrenchaswell
• ParNcipaNngsitescoulddoCLIR(EnglishtoArabicorFrenchtoArabic)ornormalIR(ArabictoArabic)
5/24/09
9
ExampleTopic<num>Number:AR26<Ntle> الكردستانيالوطنياملقاومةمجلس
<desc>DescripNon:اإلستقاللالىالوطنيةاملقاومةمجلسينظركيف
؟لالكراداحملتمل
<narr>NarraNve:مجلسبتحركاتمتعلقةنصوصيتضمناملوضوعالوطنيةاملقاومة،قيادةعنتتحدثمقاالت
لالستقاللاالكرادجهودضمناوجالن.
<num>Number:AR26<Ntle>KurdistanIndependence
<desc>DescripNon:HowdoestheNaNonalCouncilof
ResistancerelatetothepotenNalindependenceofKurdistan?
<narr>NarraNve:ArNclesreporNngacNviNesofthe
NaNonalCouncilofResistanceareconsideredontopic.ArNclesdiscussingOcalan'sleadershipwithinthecontextoftheKurdisheffortstowardindependencearealsoconsideredontopic.
ExampleDocument
5/24/09
10
Results
• BBN,Umass,IBMusedstaNsNcalmodels• Umassperformanceoncross‐languageisroughlyequaltoperformanceonmonolingual!
Monolingual(ArabictoArabic)Cross‐lingual(English/FrenchtoArabic)
PlotsfromOard&Gey,“TheTREC‐2002Arabic/EnglishCLIRTrack”
Analysis
• ThetranslaNonmodelisimperfect– ItassignsprobabiliNestoalmosteverypairofwords
– TherearemanyerrorsintranslaNon
• Sohowcouldcross‐lingualbealmostasgoodasmonolingual?
• Hypotheses:– TranslaNonprocessdisambiguatessometerms– TranslaNonprocesssmoothsquerymodels
5/24/09
11
IRasStaNsNcalTranslaNon
• WhatifweviewIRasatranslaNonprocess?– UserinputsqueryinEnglish,systemdoes“cross‐language”retrievalfromuser‐Englishtosystem‐English
– Thismayaccountforusersnotusingtherightkeywordsintheirqueries
• ThereisnonaturaltranslaNonmodel,soonemustbesimulated
• Berger&Lafferty,SIGIR1999
IRTranslaNonModel
• GenerateatranslaNonmodelbyaligningsimulatedqueriestorelevantdocuments
5/24/09
12
Results
TranslaNonmodelscomparedtow‐idf
LMcoincideswithModel0
Conclusion:staNsNcaltranslaNonworksatleastaswellasw‐idforLM
TranslaNonforMulNmediaRetrieval
• English‐ArabicCLIRworks• English‐EnglishCLIRworks• WhataboutEnglish‐mulNmediaCLIR?
• “Translate”animageintowordstoenableretrievalofimagesbytextqueries
• TranslaNonmodel:P(w|I)isprobabilityof“translaNng”imageItowordw
5/24/09
13
ImageTranslaNonModel
• EsNmateP(w|I)requirestwothings:– Afeature‐basedrepresentaNonoftheimage
– Asetofwordsthat“align”withtheimage
• UseimagesegmentaNonandclusteringtoformarepresentaNonofimages
• UseimagecapNonstoalignwordstoimage
ImageRepresentaNon:“Blobs”
FromJeonetal.,“AutomaNcImageAnnotaNonandRetrievalUsingCross‐MediaRelevanceModels”
5/24/09
14
Cross‐MediaRelevanceModel
• Retrievalisbyquery‐likelihoodP(Q|I)P (Q|I) =
!
q!Q
P (q|I)
!!
q!Q
P (q|b1, ..., bm)
"!
q!Q
"
J!C
P (q|J)P (J)m!
i=1
P (bi|J)
CisthecollecNonofimages,JisanimageinC,andb1…bmare“blobs”
ExampleResults
FromJeonetal.,“AutomaNcImageAnnotaNonandRetrievalUsingCross‐MediaRelevanceModels”
5/24/09
15
MachineTranslaNon
• MachinetranslaNon(MT)isaprobleminNLP/computaNonallinguisNcs
• ThegoalistoautomaNcallytranslatetextinonelanguagetoanother
• DifferentfromCLIRwithquerytranslaNonmodelinthattheCLIRmodeldoesnotrequirea“coherent”translaNonofthequery– CLIRessenNallyuseseverypossibletranslaNon
• MachinetranslaNonshouldprovideasingle“good”translaNonthatishuman‐readable
StaNsNcalMT
• ThoughMTandCLIRaredifferentproblems,thestaNsNcalapproachesareverysimilar
• IBMdevelopedseveralstaNsNcalmodelsforMT– “AstaNsNcalapproachtomachinetranslaNon”,Brownetal.1990
– CLIRmodelsbasedonIBM’smodels
5/24/09
16
IBMModels
• Basicidea:totranslateasentencefinlanguageFtoasentenceeinlanguageE,esNmateP(e|f)usingBayesRule
• The“right”translaNonistheonewithhighestprobability
P (e|f) =P (f |e)P (e)
P (f)
!e = argmaxe
P (f |e)P (e)
IBMModels
• ThekeyisesNmaNngP(f|e)• Brownetal.presentedfivedifferentmodels– Increasinglycomplicated,requirealotoftrainingdataintheformofparallelalignedcorpora
• GooglemachinetranslaNonisbasedonalignmentandIBMmodels,butalsobasedonverylargeamountsofunaligneddata
5/24/09
17
GoogleMachineTranslaNon
Google’stranslaNonoftheSpanishWikipediapageforSpain(hCp://es.wikipedia.org/wiki/Espana)