Transcript
Page 1: Structural Text Features - University of Delawareir.cis.udel.edu/~carteret/CISC689/slides/lecture12.pdf · 2009. 4. 7. · 4/7/09 1 Structural Text Features CISC489/689‐010, Lecture

4/7/09

1

StructuralTextFeatures

CISC489/689‐010,Lecture#13Monday,April6th

BenCartereGe

StructuralFeatures

•  Sofarwehavemainlyfocusedon“vanilla”featuresoftermsindocuments–  Termfrequency,documentfrequency–  “Bagofwords”models

•  Somedocumentshavestructurethatwecouldleverageforimprovedretrieval– Naturallanguagehasstructureaswell

•  Wecanderivefeaturesfromthisstructure,especiallyfromtheplacementoftermswithinstructureorplacementoftermswithrespecttoeachother

Page 2: Structural Text Features - University of Delawareir.cis.udel.edu/~carteret/CISC689/slides/lecture12.pdf · 2009. 4. 7. · 4/7/09 1 Structural Text Features CISC489/689‐010, Lecture

4/7/09

2

Example:HTML

•  “HyperTextMarkupLanguage”•  Providesdocumentstructureusingtagsenclosingtext–  <Ytle>:enclosedtextdisplayedattopofbrowser–  <body>:enclosedtextdisplayedinbrowser–  <h1>:enclosedtextdisplayedinlargefont–  <b>:enclosedtextdisplayedinbold–  <a>:enclosedtextcanbeclickedtogotoanotherpage

•  Thetextenclosedinfieldsiso]enunstructuredorstructuredwithmoreHTML

Example:HTML

Page 3: Structural Text Features - University of Delawareir.cis.udel.edu/~carteret/CISC689/slides/lecture12.pdf · 2009. 4. 7. · 4/7/09 1 Structural Text Features CISC489/689‐010, Lecture

4/7/09

3

Example:HTML

•  HTMLpagesorganizeintotrees.

<HTML>

<HEAD>

<TITLE> Tropicalfish

<META>

<BODY>

<H1> Tropicalfish

<P>

<B> Tropicalfish

<A> fish

<A> tropical

includefoundinenvironmentsaroundtheworld

Nodescontainblocksoftext.

Example:Email

•  Headerfieldsprovidesomestructure

Page 4: Structural Text Features - University of Delawareir.cis.udel.edu/~carteret/CISC689/slides/lecture12.pdf · 2009. 4. 7. · 4/7/09 1 Structural Text Features CISC489/689‐010, Lecture

4/7/09

4

StructureinNaturalLanguage

•  Oneexample:parsetrees

(fromhGp://www.lancs.ac.uk/fss/courses/ling/corpus/Corpus2/2PARSE.HTM)

Hyper‐Structure

•  Thedocumentsthemselvesmayoccurwithinsomestructure– Theweb:documentslinktoeachother,creaYngagraphstructure

– Email:threadedconversaYons– Sentencesformparagraphs,paragraphsformsecYons,secYonsformchapters,chaptersformbooks,…

•  Thisstructuremayprovideusefulfeatures

Page 5: Structural Text Features - University of Delawareir.cis.udel.edu/~carteret/CISC689/slides/lecture12.pdf · 2009. 4. 7. · 4/7/09 1 Structural Text Features CISC489/689‐010, Lecture

4/7/09

5

UsingStructuralFeaturesinRetrieval

•  Steps:– Derivefeatures–documentprocessing

–  Indexfeatures–usinginvertedlists– Retrievalusingfeatures–retrievalmodels,scoringfuncYons,querylanguages

SpecificFeatures

•  Phrases:–  Sequencesofwordsinorder–  Userswanttoqueryphrases,e.g.“tropicalfish”

•  Fieldsandtags:–  Markupenclosingpartsofdocuments–  Wewanttoemphasizesomeparts,de‐emphasizeothers.E.g.

Ytlesimportant,sidebarsnot•  Webhyper‐structure:

–  Linksbetweenpages–  Wewantpagesthatarefrequentlylinkedusingthesametextto

scorehigherforqueriesthatcontainthattext•  Whatarethefeatures,howdowederivethem,howdowe

storethem,andhowdowemodeltheminretrieval?

Page 6: Structural Text Features - University of Delawareir.cis.udel.edu/~carteret/CISC689/slides/lecture12.pdf · 2009. 4. 7. · 4/7/09 1 Structural Text Features CISC489/689‐010, Lecture

4/7/09

6

DerivingandIndexingFeatures

•  DerivaYonconsideraYons:–  ComputaYonalYmeandspacerequirements–  Errorsinprocessing– Useinqueries

•  IndexingconsideraYons:–  Fastqueryprocessing–  Flexibility(indexoncewithallinfoforcalculaYnganythingyoucanimaginevs.re‐indexeveryYmeyoucomeupwithanewidea)

–  Storage

Phrases•  Manyqueriesare2‐3wordphrases•  Phrasesare– Moreprecisethansinglewords

•  e.g.,documentscontaining“blacksea”vs.twowords“black”and“sea”

–  Lessambiguous•  e.g.,“bigapple”vs.“apple”

•  Canbedifficultforranking•  e.g.,Givenquery“fishingsupplies”,howdowescoredocumentswith–  exactphrasemanyYmes,exactphrasejustonce,individualwordsinsamesentence,sameparagraph,wholedocument,variaYonsonwords?

Page 7: Structural Text Features - University of Delawareir.cis.udel.edu/~carteret/CISC689/slides/lecture12.pdf · 2009. 4. 7. · 4/7/09 1 Structural Text Features CISC489/689‐010, Lecture

4/7/09

7

Phrases

•  Textprocessingissue–howarephrasesrecognized?

•  Threepossibleapproaches:–  IdenYfysyntacYcphrasesusingapart‐of‐speech(POS)tagger

– Usewordn‐grams – StorewordposiYonsinindexesanduseproximityoperatorsinqueries

POSTagging

•  POStaggersusestaYsYcalmodelsoftexttopredictsyntacYctagsofwords– Exampletags:•  NN(singularnoun),NNS(pluralnoun),VB(verb),VBD(verb,pasttense),VBN(verb,pastparYciple),IN(preposiYon),JJ(adjecYve),CC(conjuncYon,e.g.,“and”,“or”),PRP(pronoun),andMD(modalauxiliary,e.g.,“can”,“will”).

•  Phrasescanthenbedefinedassimplenoungroups,forexample

Page 8: Structural Text Features - University of Delawareir.cis.udel.edu/~carteret/CISC689/slides/lecture12.pdf · 2009. 4. 7. · 4/7/09 1 Structural Text Features CISC489/689‐010, Lecture

4/7/09

8

PosTaggingExample

ExampleNounPhrases

Page 9: Structural Text Features - University of Delawareir.cis.udel.edu/~carteret/CISC689/slides/lecture12.pdf · 2009. 4. 7. · 4/7/09 1 Structural Text Features CISC489/689‐010, Lecture

4/7/09

9

NounPhraseInvertedLists

Q=“unitedstates”:retrieveinvertedlistforphrase“unitedstates”andprocessQ=unitedstates:retrieveinvertedlistsforterms“united”,“states”andprocess

WordN‐Grams

•  POStaggingtooslowforlargecollecYons•  SimplerdefiniYon–phraseisanysequenceofnwords–knownasn‐grams –  bigram:2wordsequence,trigram:3wordsequence,unigram:singlewords

– N‐gramsalsousedatcharacterlevelforapplicaYonssuchasOCR

•  N‐gramstypicallyformedfromoverlappingsequencesofwords–  i.e.moven‐word“window”onewordataYmeindocument

Page 10: Structural Text Features - University of Delawareir.cis.udel.edu/~carteret/CISC689/slides/lecture12.pdf · 2009. 4. 7. · 4/7/09 1 Structural Text Features CISC489/689‐010, Lecture

4/7/09

10

WordBigrams

Tropicalfishfishincludeincludefishfishfoundfoundinintropicaltropicalenvironmentsenvironmentsaroundaroundthetheworld…

BigramInvertedLists

Thoughmanyunusualphrasesareincluded,termstaYsYcshelpensurethattheydonothurtretrieval

Page 11: Structural Text Features - University of Delawareir.cis.udel.edu/~carteret/CISC689/slides/lecture12.pdf · 2009. 4. 7. · 4/7/09 1 Structural Text Features CISC489/689‐010, Lecture

4/7/09

11

N‐Grams

•  Frequentn‐gramsaremorelikelytobemeaningfulphrases

•  N‐gramsformaZipfdistribuYon– BeGerfitthanwordsalone

•  Couldindexalln‐gramsuptospecifiedlength– MuchfasterthanPOStagging

– Usesalotofstorage•  e.g.,documentcontaining1,000wordswouldcontain3,990instancesofwordn‐gramsoflength2≤ n ≤ 5

GoogleN‐Grams

•  Websearchenginesindexn‐grams•  Googlesample:

•  MostfrequenttrigraminEnglishis“allrightsreserved”–  InChinese,“limitedliabilitycorporaYon”

Page 12: Structural Text Features - University of Delawareir.cis.udel.edu/~carteret/CISC689/slides/lecture12.pdf · 2009. 4. 7. · 4/7/09 1 Structural Text Features CISC489/689‐010, Lecture

4/7/09

12

UseTermPosiYons

•  Ratherthanstorephrasesinindexdirectly,storetermposiYonsandlocatephrasesatqueryYme

•  Matchphrasesorwordswithinawindow– e.g.,"tropical fish",or“findtropicalwithin5wordsoffish”

PhraseMethodTradeoffs

•  POStagging:–  VerylongindexYme,possibleerrors,mediumstoragerequirement,notveryflexible

–  Fastphrase‐queryprocessing•  N‐Grams:– Highstoragerequirement– Moreflexible,fastphrase‐queryprocessing

•  TermposiYons:– Medium‐lowstoragerequirement,veryflexible–  PossiblyslowerqueryprocessingduetoneedingtocalculatecollecYonstaYsYcs

Page 13: Structural Text Features - University of Delawareir.cis.udel.edu/~carteret/CISC689/slides/lecture12.pdf · 2009. 4. 7. · 4/7/09 1 Structural Text Features CISC489/689‐010, Lecture

4/7/09

13

Parsing

•  Basicparsing:idenYfywhichpartsofdocumentstoindex,whichtoignore

•  Fullparsing:idenYfyandlabelpartsofdocuments,maintainstructure,decidewhichpartsarerelaYvelymoreimportant

HTMLParsing

•  AnHTMLparserproducesaDOMtree

•  WewanttostorebasicterminformaYon(v,idf)aswellasinformaYonaboutthenodesthetermappersin

<HTML>

<HEAD>

<TITLE> Tropicalfish

<META>

<BODY>

<H1> Tropicalfish

<P>

<B> Tropicalfish

<A> fish

<A> tropical

includefoundinenvironmentsaroundtheworld

Page 14: Structural Text Features - University of Delawareir.cis.udel.edu/~carteret/CISC689/slides/lecture12.pdf · 2009. 4. 7. · 4/7/09 1 Structural Text Features CISC489/689‐010, Lecture

4/7/09

14

IndexingFields

•  A]erparsingwehave:–  <Ytle>:tropicalfish–  <body>:tropicalfishtropicalfishincludefishfoundintropicalenvironmentsaroundtheworld…

–  <h1>:tropicalfish–  <b>:tropicalfish–  <a>:fish–  <a>:topical

•  Ideasforindexing:–  StorefieldinformaYonininvertedlist.–  Addnewinvertedlistsforfields.–  Useextentstokeeptrackoffieldsindocuments.

FieldInformaYoninInvertedLists

•  CreaYngtheterminvertedlist:– Foreachdocumentthetermappearsin,•  Foreachfieldthetermappearsininthatdocument,

–  Storethetermfrequencywithinthefield

•  Alsostorethe“fieldfrequency”–  i.e.totalnumberofYmesthetermappearsineachfieldthroughthecollecYon

Page 15: Structural Text Features - University of Delawareir.cis.udel.edu/~carteret/CISC689/slides/lecture12.pdf · 2009. 4. 7. · 4/7/09 1 Structural Text Features CISC489/689‐010, Lecture

4/7/09

15

FieldInformaYoninInvertedList

Example

Documentfreq

<Ytle>freq

<body>freq

<h1>freq

vindoc1

vin<Ytle>indoc1

vin<body>indoc1

vin<h1>indoc1

Page 16: Structural Text Features - University of Delawareir.cis.udel.edu/~carteret/CISC689/slides/lecture12.pdf · 2009. 4. 7. · 4/7/09 1 Structural Text Features CISC489/689‐010, Lecture

4/7/09

16

AddNewInvertedLists

•  InsteadofstoringallfieldinformaYoninonelist,createanewlistforeachfieldthetermappearsin

•  AddsKnewinvertedlists,whereK=thetotalnumberoffieldsthetermappearsin.

Example

Page 17: Structural Text Features - University of Delawareir.cis.udel.edu/~carteret/CISC689/slides/lecture12.pdf · 2009. 4. 7. · 4/7/09 1 Structural Text Features CISC489/689‐010, Lecture

4/7/09

17

Extents

•  AnextentisaconYguousregioninadocument

•  DefinedbyastarYngtermposiYonandanendingtermposiYon– \ ExtentfromposiYon8

throughposiYon36

UsingExtentstoStoreFields

•  StoretermposiYonsinterminvertedlists•  Defineanextentinvertedlistforeachfield•  IncludethedocumentnumberandrangeofposiYonstheextentincludes

Page 18: Structural Text Features - University of Delawareir.cis.udel.edu/~carteret/CISC689/slides/lecture12.pdf · 2009. 4. 7. · 4/7/09 1 Structural Text Features CISC489/689‐010, Lecture

4/7/09

18

FieldStorageTradeoffs

•  Includefieldinfoininvertedlists:– Storageefficient,fairlyinflexible,fairlyslowprocessing

•  Newlistsfortermsinfields:– Storageinefficient,moreflexible,fasterprocessing

•  Fieldextents:– Storageefficient,veryflexible,fairlyfastprocessing

AnchorText

•  Anchor textistextonanotherpageusedtolinktoadocument

•  Canindicatewhatotherpeoplethinkthedocumentisabout

•  Canbetakenasashortsummaryofthedocumentscontents

Page 19: Structural Text Features - University of Delawareir.cis.udel.edu/~carteret/CISC689/slides/lecture12.pdf · 2009. 4. 7. · 4/7/09 1 Structural Text Features CISC489/689‐010, Lecture

4/7/09

19

AnchorTextExample

IndexingAnchorText

•  SimplesoluYon:–  Includeanchortextaspartofdocumenttext

– “Tropical”termfrequency=#ofYmesitappearsinthedocument+#ofYmesitappearsinanchortextindocumentslinkingtoit

•  SlightlymorecomplexsoluYon:–  Includeanchortextinfields,e.g.<anchor>– Onefieldforeachlinktothedocument

Page 20: Structural Text Features - University of Delawareir.cis.udel.edu/~carteret/CISC689/slides/lecture12.pdf · 2009. 4. 7. · 4/7/09 1 Structural Text Features CISC489/689‐010, Lecture

4/7/09

20

InvertedListsatGoogle

•  Asof1998,Googlestoredthefollowing:– Whetheratermoccurrenceis“plain”or“fancy”

•  “Fancy”=occursinURL,Ytle,anchortext,ormetatag.•  “Plain”=everythingelse

–  Ifplain,store:•  Whethercapitalized,fontsizeinformaYon,andposiYoninformaYon(in1bit,3bits,and12bitsrespecYvely)

–  Iffancy,store:•  Whethercapitalized,maximumfontsize,typeofhit,andposiYoninformaYon(in1bit,3bits,4bits,and8bitsrespecYvely)

•  Andiftype=anchor,split8posiYonbitsinto4docIDbitsand4posiYonbits

InvertedListsatGoogle

•  Example:“tropical”occurs3Ymesindocument– OncecapitalizedinYtleatposiYon1– OncecapitalizedinaheaderatposiYon4– Onceinlower‐caseinbodytextatposiYon108

•  Alsooccursin2otherlinkingdocuments•  Googleinvertedlistmightlooklikethis:

• 

Fancyhit1(Ytle)

Fancyhit2(header)

PlainhitAnchorhit1

Anchorhit2


Top Related