structural text features - university of...

20
4/7/09 1 Structural Text Features CISC489/689‐010, Lecture #13 Monday, April 6 th Ben CartereGe Structural Features So far we have mainly focused on “vanilla” features of terms in documents Term frequency, document frequency “Bag of words” models Some documents have structure that we could leverage for improved retrieval Natural language has structure as well We can derive features from this structure, especially from the placement of terms within structure or placement of terms with respect to each other

Upload: others

Post on 20-Jan-2021

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Structural Text Features - University of Delawareir.cis.udel.edu/~carteret/CISC689/slides/lecture12.pdf · 2009. 4. 7. · 4/7/09 1 Structural Text Features CISC489/689‐010, Lecture

4/7/09

1

StructuralTextFeatures

CISC489/689‐010,Lecture#13Monday,April6th

BenCartereGe

StructuralFeatures

•  Sofarwehavemainlyfocusedon“vanilla”featuresoftermsindocuments–  Termfrequency,documentfrequency–  “Bagofwords”models

•  Somedocumentshavestructurethatwecouldleverageforimprovedretrieval– Naturallanguagehasstructureaswell

•  Wecanderivefeaturesfromthisstructure,especiallyfromtheplacementoftermswithinstructureorplacementoftermswithrespecttoeachother

Page 2: Structural Text Features - University of Delawareir.cis.udel.edu/~carteret/CISC689/slides/lecture12.pdf · 2009. 4. 7. · 4/7/09 1 Structural Text Features CISC489/689‐010, Lecture

4/7/09

2

Example:HTML

•  “HyperTextMarkupLanguage”•  Providesdocumentstructureusingtagsenclosingtext–  <Ytle>:enclosedtextdisplayedattopofbrowser–  <body>:enclosedtextdisplayedinbrowser–  <h1>:enclosedtextdisplayedinlargefont–  <b>:enclosedtextdisplayedinbold–  <a>:enclosedtextcanbeclickedtogotoanotherpage

•  Thetextenclosedinfieldsiso]enunstructuredorstructuredwithmoreHTML

Example:HTML

Page 3: Structural Text Features - University of Delawareir.cis.udel.edu/~carteret/CISC689/slides/lecture12.pdf · 2009. 4. 7. · 4/7/09 1 Structural Text Features CISC489/689‐010, Lecture

4/7/09

3

Example:HTML

•  HTMLpagesorganizeintotrees.

<HTML>

<HEAD>

<TITLE> Tropicalfish

<META>

<BODY>

<H1> Tropicalfish

<P>

<B> Tropicalfish

<A> fish

<A> tropical

includefoundinenvironmentsaroundtheworld

Nodescontainblocksoftext.

Example:Email

•  Headerfieldsprovidesomestructure

Page 4: Structural Text Features - University of Delawareir.cis.udel.edu/~carteret/CISC689/slides/lecture12.pdf · 2009. 4. 7. · 4/7/09 1 Structural Text Features CISC489/689‐010, Lecture

4/7/09

4

StructureinNaturalLanguage

•  Oneexample:parsetrees

(fromhGp://www.lancs.ac.uk/fss/courses/ling/corpus/Corpus2/2PARSE.HTM)

Hyper‐Structure

•  Thedocumentsthemselvesmayoccurwithinsomestructure– Theweb:documentslinktoeachother,creaYngagraphstructure

– Email:threadedconversaYons– Sentencesformparagraphs,paragraphsformsecYons,secYonsformchapters,chaptersformbooks,…

•  Thisstructuremayprovideusefulfeatures

Page 5: Structural Text Features - University of Delawareir.cis.udel.edu/~carteret/CISC689/slides/lecture12.pdf · 2009. 4. 7. · 4/7/09 1 Structural Text Features CISC489/689‐010, Lecture

4/7/09

5

UsingStructuralFeaturesinRetrieval

•  Steps:– Derivefeatures–documentprocessing

–  Indexfeatures–usinginvertedlists– Retrievalusingfeatures–retrievalmodels,scoringfuncYons,querylanguages

SpecificFeatures

•  Phrases:–  Sequencesofwordsinorder–  Userswanttoqueryphrases,e.g.“tropicalfish”

•  Fieldsandtags:–  Markupenclosingpartsofdocuments–  Wewanttoemphasizesomeparts,de‐emphasizeothers.E.g.

Ytlesimportant,sidebarsnot•  Webhyper‐structure:

–  Linksbetweenpages–  Wewantpagesthatarefrequentlylinkedusingthesametextto

scorehigherforqueriesthatcontainthattext•  Whatarethefeatures,howdowederivethem,howdowe

storethem,andhowdowemodeltheminretrieval?

Page 6: Structural Text Features - University of Delawareir.cis.udel.edu/~carteret/CISC689/slides/lecture12.pdf · 2009. 4. 7. · 4/7/09 1 Structural Text Features CISC489/689‐010, Lecture

4/7/09

6

DerivingandIndexingFeatures

•  DerivaYonconsideraYons:–  ComputaYonalYmeandspacerequirements–  Errorsinprocessing– Useinqueries

•  IndexingconsideraYons:–  Fastqueryprocessing–  Flexibility(indexoncewithallinfoforcalculaYnganythingyoucanimaginevs.re‐indexeveryYmeyoucomeupwithanewidea)

–  Storage

Phrases•  Manyqueriesare2‐3wordphrases•  Phrasesare– Moreprecisethansinglewords

•  e.g.,documentscontaining“blacksea”vs.twowords“black”and“sea”

–  Lessambiguous•  e.g.,“bigapple”vs.“apple”

•  Canbedifficultforranking•  e.g.,Givenquery“fishingsupplies”,howdowescoredocumentswith–  exactphrasemanyYmes,exactphrasejustonce,individualwordsinsamesentence,sameparagraph,wholedocument,variaYonsonwords?

Page 7: Structural Text Features - University of Delawareir.cis.udel.edu/~carteret/CISC689/slides/lecture12.pdf · 2009. 4. 7. · 4/7/09 1 Structural Text Features CISC489/689‐010, Lecture

4/7/09

7

Phrases

•  Textprocessingissue–howarephrasesrecognized?

•  Threepossibleapproaches:–  IdenYfysyntacYcphrasesusingapart‐of‐speech(POS)tagger

– Usewordn‐grams – StorewordposiYonsinindexesanduseproximityoperatorsinqueries

POSTagging

•  POStaggersusestaYsYcalmodelsoftexttopredictsyntacYctagsofwords– Exampletags:•  NN(singularnoun),NNS(pluralnoun),VB(verb),VBD(verb,pasttense),VBN(verb,pastparYciple),IN(preposiYon),JJ(adjecYve),CC(conjuncYon,e.g.,“and”,“or”),PRP(pronoun),andMD(modalauxiliary,e.g.,“can”,“will”).

•  Phrasescanthenbedefinedassimplenoungroups,forexample

Page 8: Structural Text Features - University of Delawareir.cis.udel.edu/~carteret/CISC689/slides/lecture12.pdf · 2009. 4. 7. · 4/7/09 1 Structural Text Features CISC489/689‐010, Lecture

4/7/09

8

PosTaggingExample

ExampleNounPhrases

Page 9: Structural Text Features - University of Delawareir.cis.udel.edu/~carteret/CISC689/slides/lecture12.pdf · 2009. 4. 7. · 4/7/09 1 Structural Text Features CISC489/689‐010, Lecture

4/7/09

9

NounPhraseInvertedLists

Q=“unitedstates”:retrieveinvertedlistforphrase“unitedstates”andprocessQ=unitedstates:retrieveinvertedlistsforterms“united”,“states”andprocess

WordN‐Grams

•  POStaggingtooslowforlargecollecYons•  SimplerdefiniYon–phraseisanysequenceofnwords–knownasn‐grams –  bigram:2wordsequence,trigram:3wordsequence,unigram:singlewords

– N‐gramsalsousedatcharacterlevelforapplicaYonssuchasOCR

•  N‐gramstypicallyformedfromoverlappingsequencesofwords–  i.e.moven‐word“window”onewordataYmeindocument

Page 10: Structural Text Features - University of Delawareir.cis.udel.edu/~carteret/CISC689/slides/lecture12.pdf · 2009. 4. 7. · 4/7/09 1 Structural Text Features CISC489/689‐010, Lecture

4/7/09

10

WordBigrams

Tropicalfishfishincludeincludefishfishfoundfoundinintropicaltropicalenvironmentsenvironmentsaroundaroundthetheworld…

BigramInvertedLists

Thoughmanyunusualphrasesareincluded,termstaYsYcshelpensurethattheydonothurtretrieval

Page 11: Structural Text Features - University of Delawareir.cis.udel.edu/~carteret/CISC689/slides/lecture12.pdf · 2009. 4. 7. · 4/7/09 1 Structural Text Features CISC489/689‐010, Lecture

4/7/09

11

N‐Grams

•  Frequentn‐gramsaremorelikelytobemeaningfulphrases

•  N‐gramsformaZipfdistribuYon– BeGerfitthanwordsalone

•  Couldindexalln‐gramsuptospecifiedlength– MuchfasterthanPOStagging

– Usesalotofstorage•  e.g.,documentcontaining1,000wordswouldcontain3,990instancesofwordn‐gramsoflength2≤ n ≤ 5

GoogleN‐Grams

•  Websearchenginesindexn‐grams•  Googlesample:

•  MostfrequenttrigraminEnglishis“allrightsreserved”–  InChinese,“limitedliabilitycorporaYon”

Page 12: Structural Text Features - University of Delawareir.cis.udel.edu/~carteret/CISC689/slides/lecture12.pdf · 2009. 4. 7. · 4/7/09 1 Structural Text Features CISC489/689‐010, Lecture

4/7/09

12

UseTermPosiYons

•  Ratherthanstorephrasesinindexdirectly,storetermposiYonsandlocatephrasesatqueryYme

•  Matchphrasesorwordswithinawindow– e.g.,"tropical fish",or“findtropicalwithin5wordsoffish”

PhraseMethodTradeoffs

•  POStagging:–  VerylongindexYme,possibleerrors,mediumstoragerequirement,notveryflexible

–  Fastphrase‐queryprocessing•  N‐Grams:– Highstoragerequirement– Moreflexible,fastphrase‐queryprocessing

•  TermposiYons:– Medium‐lowstoragerequirement,veryflexible–  PossiblyslowerqueryprocessingduetoneedingtocalculatecollecYonstaYsYcs

Page 13: Structural Text Features - University of Delawareir.cis.udel.edu/~carteret/CISC689/slides/lecture12.pdf · 2009. 4. 7. · 4/7/09 1 Structural Text Features CISC489/689‐010, Lecture

4/7/09

13

Parsing

•  Basicparsing:idenYfywhichpartsofdocumentstoindex,whichtoignore

•  Fullparsing:idenYfyandlabelpartsofdocuments,maintainstructure,decidewhichpartsarerelaYvelymoreimportant

HTMLParsing

•  AnHTMLparserproducesaDOMtree

•  WewanttostorebasicterminformaYon(v,idf)aswellasinformaYonaboutthenodesthetermappersin

<HTML>

<HEAD>

<TITLE> Tropicalfish

<META>

<BODY>

<H1> Tropicalfish

<P>

<B> Tropicalfish

<A> fish

<A> tropical

includefoundinenvironmentsaroundtheworld

Page 14: Structural Text Features - University of Delawareir.cis.udel.edu/~carteret/CISC689/slides/lecture12.pdf · 2009. 4. 7. · 4/7/09 1 Structural Text Features CISC489/689‐010, Lecture

4/7/09

14

IndexingFields

•  A]erparsingwehave:–  <Ytle>:tropicalfish–  <body>:tropicalfishtropicalfishincludefishfoundintropicalenvironmentsaroundtheworld…

–  <h1>:tropicalfish–  <b>:tropicalfish–  <a>:fish–  <a>:topical

•  Ideasforindexing:–  StorefieldinformaYonininvertedlist.–  Addnewinvertedlistsforfields.–  Useextentstokeeptrackoffieldsindocuments.

FieldInformaYoninInvertedLists

•  CreaYngtheterminvertedlist:– Foreachdocumentthetermappearsin,•  Foreachfieldthetermappearsininthatdocument,

–  Storethetermfrequencywithinthefield

•  Alsostorethe“fieldfrequency”–  i.e.totalnumberofYmesthetermappearsineachfieldthroughthecollecYon

Page 15: Structural Text Features - University of Delawareir.cis.udel.edu/~carteret/CISC689/slides/lecture12.pdf · 2009. 4. 7. · 4/7/09 1 Structural Text Features CISC489/689‐010, Lecture

4/7/09

15

FieldInformaYoninInvertedList

Example

Documentfreq

<Ytle>freq

<body>freq

<h1>freq

vindoc1

vin<Ytle>indoc1

vin<body>indoc1

vin<h1>indoc1

Page 16: Structural Text Features - University of Delawareir.cis.udel.edu/~carteret/CISC689/slides/lecture12.pdf · 2009. 4. 7. · 4/7/09 1 Structural Text Features CISC489/689‐010, Lecture

4/7/09

16

AddNewInvertedLists

•  InsteadofstoringallfieldinformaYoninonelist,createanewlistforeachfieldthetermappearsin

•  AddsKnewinvertedlists,whereK=thetotalnumberoffieldsthetermappearsin.

Example

Page 17: Structural Text Features - University of Delawareir.cis.udel.edu/~carteret/CISC689/slides/lecture12.pdf · 2009. 4. 7. · 4/7/09 1 Structural Text Features CISC489/689‐010, Lecture

4/7/09

17

Extents

•  AnextentisaconYguousregioninadocument

•  DefinedbyastarYngtermposiYonandanendingtermposiYon– \ ExtentfromposiYon8

throughposiYon36

UsingExtentstoStoreFields

•  StoretermposiYonsinterminvertedlists•  Defineanextentinvertedlistforeachfield•  IncludethedocumentnumberandrangeofposiYonstheextentincludes

Page 18: Structural Text Features - University of Delawareir.cis.udel.edu/~carteret/CISC689/slides/lecture12.pdf · 2009. 4. 7. · 4/7/09 1 Structural Text Features CISC489/689‐010, Lecture

4/7/09

18

FieldStorageTradeoffs

•  Includefieldinfoininvertedlists:– Storageefficient,fairlyinflexible,fairlyslowprocessing

•  Newlistsfortermsinfields:– Storageinefficient,moreflexible,fasterprocessing

•  Fieldextents:– Storageefficient,veryflexible,fairlyfastprocessing

AnchorText

•  Anchor textistextonanotherpageusedtolinktoadocument

•  Canindicatewhatotherpeoplethinkthedocumentisabout

•  Canbetakenasashortsummaryofthedocumentscontents

Page 19: Structural Text Features - University of Delawareir.cis.udel.edu/~carteret/CISC689/slides/lecture12.pdf · 2009. 4. 7. · 4/7/09 1 Structural Text Features CISC489/689‐010, Lecture

4/7/09

19

AnchorTextExample

IndexingAnchorText

•  SimplesoluYon:–  Includeanchortextaspartofdocumenttext

– “Tropical”termfrequency=#ofYmesitappearsinthedocument+#ofYmesitappearsinanchortextindocumentslinkingtoit

•  SlightlymorecomplexsoluYon:–  Includeanchortextinfields,e.g.<anchor>– Onefieldforeachlinktothedocument

Page 20: Structural Text Features - University of Delawareir.cis.udel.edu/~carteret/CISC689/slides/lecture12.pdf · 2009. 4. 7. · 4/7/09 1 Structural Text Features CISC489/689‐010, Lecture

4/7/09

20

InvertedListsatGoogle

•  Asof1998,Googlestoredthefollowing:– Whetheratermoccurrenceis“plain”or“fancy”

•  “Fancy”=occursinURL,Ytle,anchortext,ormetatag.•  “Plain”=everythingelse

–  Ifplain,store:•  Whethercapitalized,fontsizeinformaYon,andposiYoninformaYon(in1bit,3bits,and12bitsrespecYvely)

–  Iffancy,store:•  Whethercapitalized,maximumfontsize,typeofhit,andposiYoninformaYon(in1bit,3bits,4bits,and8bitsrespecYvely)

•  Andiftype=anchor,split8posiYonbitsinto4docIDbitsand4posiYonbits

InvertedListsatGoogle

•  Example:“tropical”occurs3Ymesindocument– OncecapitalizedinYtleatposiYon1– OncecapitalizedinaheaderatposiYon4– Onceinlower‐caseinbodytextatposiYon108

•  Alsooccursin2otherlinkingdocuments•  Googleinvertedlistmightlooklikethis:

• 

Fancyhit1(Ytle)

Fancyhit2(header)

PlainhitAnchorhit1

Anchorhit2