cross‐language irir.cis.udel.edu/~carteret/cisc689/slides/lecture23.pdf · • spanish sentence:...

17
5/24/09 1 Cross‐Language IR CISC489/689‐010, Lecture #23 Monday, May 11 th Ben CartereCe Cross‐Language IR User submits a query in one language, gets results in a different language Documents are semi‐structured and heterogeneous (as almost all data in IR), and also in mulNple languages InformaNon may only be available in documents wriCen in one of the languages Highly useful to intelligence community

Upload: others

Post on 24-Sep-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Cross‐Language IRir.cis.udel.edu/~carteret/CISC689/slides/lecture23.pdf · • Spanish sentence: “El objevo era claro: detener a la mujer y enviarla de regreso a México pues

5/24/09

1

Cross‐LanguageIR

CISC489/689‐010,Lecture#23Monday,May11th

BenCartereCe

Cross‐LanguageIR

•  Usersubmitsaqueryinonelanguage,getsresultsinadifferentlanguage

•  Documentsaresemi‐structuredandheterogeneous(asalmostalldatainIR),andalsoinmulNplelanguages

•  InformaNonmayonlybeavailableindocumentswriCeninoneofthelanguages

•  Highlyusefultointelligencecommunity

Page 2: Cross‐Language IRir.cis.udel.edu/~carteret/CISC689/slides/lecture23.pdf · • Spanish sentence: “El objevo era claro: detener a la mujer y enviarla de regreso a México pues

5/24/09

2

ApproachestoCLIR

•  Translatethedocumentsintotheusers’language,andlettheuserssubmitqueriesintheirownlanguage

•  Translatetheusers’queriesintotargetlanguage(s)andusethetranslatedqueryforretrieval

•  Translatebothqueriesanddocumentstoan“intermediate”language

AutomaNcTranslaNon

•  WhataresomeapproachestoautomaNctranslaNon?–  Language‐to‐languagedicNonaries

•  Languagesdonottranslateprecisely– Onewordwithseveralmeaningsinonelanguagemighttranslatetoseveraldifferentwordsintheother

– Manywordswiththesamemeaningmightalltranslatetoasingleword

– Awordinonelanguagemightonlybeexpressibleasaphraseinanother(orvice‐versa)

–  etc…

Page 3: Cross‐Language IRir.cis.udel.edu/~carteret/CISC689/slides/lecture23.pdf · • Spanish sentence: “El objevo era claro: detener a la mujer y enviarla de regreso a México pues

5/24/09

3

Example

•  TranslaNonsof“bank”:–  Orilla(riverbank)–  Terraplen (bankofearth)–  Banco (bankofclouds)–  Bateria (bankoflights)–  Banco (financialinsNtuNon)–  Banca(casinobank) 

•  TranslaNonsof“fraud”:–  Impostor (fraudulentperson)–  Fraude(decepNon)

•  HowwouldadicNonary‐basedsystemknowwhichpairoftranslaNonstouse?

•  EnglishqueriestoretrieveSpanishdocuments

•  SystemworksbytranslaNngquerytoSpanish•  Query:“bankfraud”

•  PossiblycorrecttranslaNon: •  Fraude bancario

StaNsNcalApproach

•  Insteadoftryingtotranslatedirectly,applystaNsNcalmethods

•  Learn“translaNonprobabiliNes”P(f|e)–probabilityoftranslaNngstringeinlanguageEtostringfinlanguageF

•  E.g.:– P(orillafraude|bankfraud),P(orillaimpostor|bankfraud),P(bancofraude|bankfraud),…

Page 4: Cross‐Language IRir.cis.udel.edu/~carteret/CISC689/slides/lecture23.pdf · • Spanish sentence: “El objevo era claro: detener a la mujer y enviarla de regreso a México pues

5/24/09

4

Cross‐LanguageLanguageModel

•  Recallquery‐likelihoodlanguagemodel:

•  Let’sadaptthistocross‐languageretrievalusingstaNsNcaltranslaNon

P (Q|D) =!

q!Q

P (q|D) =!

q!Q

(1! !D)tfqD

|D| + !Dctfq

|C|

P (Qf |De) =!

qf!Qf

P (qf |De)

=!

qf!Qf

"

te!E

P (qf |te)P (te|De)

=!

qf!Qf

"

te!E

P (qf |te)#

(1! !De)tfteDe

|De| + !De

ctfte

|Ce|

$

TranslaNonModel

•  WhatisP(qf|te)?•  Thetransla6on model:probabilityoftranslaNngwordteinlanguageEtowordqfinlanguageF

•  Wheredoesitcomefrom?– MaybeadicNonaryapproach:everypossibletranslaNonoftehasequalprobability

– e.g.P(orilla|bank)=P(banco|bank)=P(banca|bank)=…

Page 5: Cross‐Language IRir.cis.udel.edu/~carteret/CISC689/slides/lecture23.pdf · • Spanish sentence: “El objevo era claro: detener a la mujer y enviarla de regreso a México pues

5/24/09

5

StaNsNcalTranslaNonModel

•  AnalternaNveapproach:parallel corpora

StaNsNcalTranslaNonwithParallelCorpora

•  ParallelcorporaconsistofdocumentsintwoormorelanguagesthatareknowntobetranslaNonsofoneanother

•  Theparallelcoporaarealigned:stringeandstringfaremarkedastranslaNonsofeachother

•  WecanusethesealignmentstoesNmateatranslaNonmodel

Page 6: Cross‐Language IRir.cis.udel.edu/~carteret/CISC689/slides/lecture23.pdf · • Spanish sentence: “El objevo era claro: detener a la mujer y enviarla de regreso a México pues

5/24/09

6

TranslaNonModel

•  ToesNmateP(qf|te),countthenumberofalignedstringpairs(e,f)suchthatteisawordineandqfisawordinf

•  Dividebythetotalnumberofstringsinlanguageethatcontainte

P (qf |te) =|{(e, f)|te ! e and qf ! f}|

|{e|te ! e}|

SimpleAlignmentExample

•  Englishsentence:“TheobjecNvewasclear:arrestandextraditetoMexicothewomanagainstwhomtheyhadchargedforfraudtoarecognizedbankinginsNtuNon.”

•  Spanishsentence:“ElobjeNvoeraclaro:deteneralamujeryenviarladeregresoaMéxicopueshabíancargosensucontraporfraudeaunareconocidainsNtuciónbancaria.”

•  EverypairofwordsinthesetwosentenceswillhavesometranslaNonprobability

•  Overmanysentences,thehighestprobabiliNeswillbethepairsofwordsthataremostcloselyrelated

Page 7: Cross‐Language IRir.cis.udel.edu/~carteret/CISC689/slides/lecture23.pdf · • Spanish sentence: “El objevo era claro: detener a la mujer y enviarla de regreso a México pues

5/24/09

7

Alignments

•  Alignmentscanbemuchmoredetailed

ImagesfromBrownetal.,“TheMathemaNcsofStaNsNcalMachineTranslaNon”

ParallelCorpora

•  Wheredowegetparallelcorpora?– FinddocumentsthatweknowtobetranslaNons

– CanadianHansard:transcriptsofCanadianparliamentarydebatesinbothEnglishandFrench

– EuropeanUnionlawin22languages•  Anythingthat’snotlaw‐related?– WikipediaarNclesindifferentlanguages..NotnecessarilytranslaNonsthough

Page 8: Cross‐Language IRir.cis.udel.edu/~carteret/CISC689/slides/lecture23.pdf · • Spanish sentence: “El objevo era claro: detener a la mujer y enviarla de regreso a México pues

5/24/09

8

CLIRExperiments

•  CLIRtrackranatTRECfrom1998through2002

•  LanguagesusedincludeEnglish,German,French,Italian,Chinese,andArabic

•  OtherissuesinCLIR:– SegmentaNon,stemming,stopping,phrasesrequiredifferentapproachesindifferentlanguages

–  Iamgoingtofocusonhigh‐levelproblem

CLIRExperiments

•  In2001and2002,themainCLIRtaskwasEnglishqueriestoretrieveArabicdocuments

•  Documents:383,872newsarNclesfromAgenceFrancePressfrom1994‐2000

•  InformaNonneeds:25queries,descripNons,andnarraNvesinEnglishbynaNveArabicspeakers–  TranslatedintoArabicandFrenchaswell

•  ParNcipaNngsitescoulddoCLIR(EnglishtoArabicorFrenchtoArabic)ornormalIR(ArabictoArabic)

Page 9: Cross‐Language IRir.cis.udel.edu/~carteret/CISC689/slides/lecture23.pdf · • Spanish sentence: “El objevo era claro: detener a la mujer y enviarla de regreso a México pues

5/24/09

9

ExampleTopic<num>Number:AR26<Ntle> الكردستانيالوطنياملقاومةمجلس

<desc>DescripNon:اإلستقاللالىالوطنيةاملقاومةمجلسينظركيف

؟لالكراداحملتمل

<narr>NarraNve:مجلسبتحركاتمتعلقةنصوصيتضمناملوضوعالوطنيةاملقاومة،قيادةعنتتحدثمقاالت

لالستقاللاالكرادجهودضمناوجالن.

<num>Number:AR26<Ntle>KurdistanIndependence

<desc>DescripNon:HowdoestheNaNonalCouncilof

ResistancerelatetothepotenNalindependenceofKurdistan?

<narr>NarraNve:ArNclesreporNngacNviNesofthe

NaNonalCouncilofResistanceareconsideredontopic.ArNclesdiscussingOcalan'sleadershipwithinthecontextoftheKurdisheffortstowardindependencearealsoconsideredontopic.

ExampleDocument

Page 10: Cross‐Language IRir.cis.udel.edu/~carteret/CISC689/slides/lecture23.pdf · • Spanish sentence: “El objevo era claro: detener a la mujer y enviarla de regreso a México pues

5/24/09

10

Results

•  BBN,Umass,IBMusedstaNsNcalmodels•  Umassperformanceoncross‐languageisroughlyequaltoperformanceonmonolingual!

Monolingual(ArabictoArabic)Cross‐lingual(English/FrenchtoArabic)

PlotsfromOard&Gey,“TheTREC‐2002Arabic/EnglishCLIRTrack”

Analysis

•  ThetranslaNonmodelisimperfect–  ItassignsprobabiliNestoalmosteverypairofwords

– TherearemanyerrorsintranslaNon

•  Sohowcouldcross‐lingualbealmostasgoodasmonolingual?

•  Hypotheses:– TranslaNonprocessdisambiguatessometerms– TranslaNonprocesssmoothsquerymodels

Page 11: Cross‐Language IRir.cis.udel.edu/~carteret/CISC689/slides/lecture23.pdf · • Spanish sentence: “El objevo era claro: detener a la mujer y enviarla de regreso a México pues

5/24/09

11

IRasStaNsNcalTranslaNon

•  WhatifweviewIRasatranslaNonprocess?– UserinputsqueryinEnglish,systemdoes“cross‐language”retrievalfromuser‐Englishtosystem‐English

– Thismayaccountforusersnotusingtherightkeywordsintheirqueries

•  ThereisnonaturaltranslaNonmodel,soonemustbesimulated

•  Berger&Lafferty,SIGIR1999

IRTranslaNonModel

•  GenerateatranslaNonmodelbyaligningsimulatedqueriestorelevantdocuments

Page 12: Cross‐Language IRir.cis.udel.edu/~carteret/CISC689/slides/lecture23.pdf · • Spanish sentence: “El objevo era claro: detener a la mujer y enviarla de regreso a México pues

5/24/09

12

Results

TranslaNonmodelscomparedtow‐idf

LMcoincideswithModel0

Conclusion:staNsNcaltranslaNonworksatleastaswellasw‐idforLM

TranslaNonforMulNmediaRetrieval

•  English‐ArabicCLIRworks•  English‐EnglishCLIRworks•  WhataboutEnglish‐mulNmediaCLIR?

•  “Translate”animageintowordstoenableretrievalofimagesbytextqueries

•  TranslaNonmodel:P(w|I)isprobabilityof“translaNng”imageItowordw

Page 13: Cross‐Language IRir.cis.udel.edu/~carteret/CISC689/slides/lecture23.pdf · • Spanish sentence: “El objevo era claro: detener a la mujer y enviarla de regreso a México pues

5/24/09

13

ImageTranslaNonModel

•  EsNmateP(w|I)requirestwothings:– Afeature‐basedrepresentaNonoftheimage

– Asetofwordsthat“align”withtheimage

•  UseimagesegmentaNonandclusteringtoformarepresentaNonofimages

•  UseimagecapNonstoalignwordstoimage

ImageRepresentaNon:“Blobs”

FromJeonetal.,“AutomaNcImageAnnotaNonandRetrievalUsingCross‐MediaRelevanceModels”

Page 14: Cross‐Language IRir.cis.udel.edu/~carteret/CISC689/slides/lecture23.pdf · • Spanish sentence: “El objevo era claro: detener a la mujer y enviarla de regreso a México pues

5/24/09

14

Cross‐MediaRelevanceModel

•  Retrievalisbyquery‐likelihoodP(Q|I)P (Q|I) =

!

q!Q

P (q|I)

!!

q!Q

P (q|b1, ..., bm)

"!

q!Q

"

J!C

P (q|J)P (J)m!

i=1

P (bi|J)

CisthecollecNonofimages,JisanimageinC,andb1…bmare“blobs”

ExampleResults

FromJeonetal.,“AutomaNcImageAnnotaNonandRetrievalUsingCross‐MediaRelevanceModels”

Page 15: Cross‐Language IRir.cis.udel.edu/~carteret/CISC689/slides/lecture23.pdf · • Spanish sentence: “El objevo era claro: detener a la mujer y enviarla de regreso a México pues

5/24/09

15

MachineTranslaNon

•  MachinetranslaNon(MT)isaprobleminNLP/computaNonallinguisNcs

•  ThegoalistoautomaNcallytranslatetextinonelanguagetoanother

•  DifferentfromCLIRwithquerytranslaNonmodelinthattheCLIRmodeldoesnotrequirea“coherent”translaNonofthequery–  CLIRessenNallyuseseverypossibletranslaNon

•  MachinetranslaNonshouldprovideasingle“good”translaNonthatishuman‐readable

StaNsNcalMT

•  ThoughMTandCLIRaredifferentproblems,thestaNsNcalapproachesareverysimilar

•  IBMdevelopedseveralstaNsNcalmodelsforMT– “AstaNsNcalapproachtomachinetranslaNon”,Brownetal.1990

– CLIRmodelsbasedonIBM’smodels

Page 16: Cross‐Language IRir.cis.udel.edu/~carteret/CISC689/slides/lecture23.pdf · • Spanish sentence: “El objevo era claro: detener a la mujer y enviarla de regreso a México pues

5/24/09

16

IBMModels

•  Basicidea:totranslateasentencefinlanguageFtoasentenceeinlanguageE,esNmateP(e|f)usingBayesRule

•  The“right”translaNonistheonewithhighestprobability

P (e|f) =P (f |e)P (e)

P (f)

!e = argmaxe

P (f |e)P (e)

IBMModels

•  ThekeyisesNmaNngP(f|e)•  Brownetal.presentedfivedifferentmodels–  Increasinglycomplicated,requirealotoftrainingdataintheformofparallelalignedcorpora

•  GooglemachinetranslaNonisbasedonalignmentandIBMmodels,butalsobasedonverylargeamountsofunaligneddata

Page 17: Cross‐Language IRir.cis.udel.edu/~carteret/CISC689/slides/lecture23.pdf · • Spanish sentence: “El objevo era claro: detener a la mujer y enviarla de regreso a México pues

5/24/09

17

GoogleMachineTranslaNon

Google’stranslaNonoftheSpanishWikipediapageforSpain(hCp://es.wikipedia.org/wiki/Espana)