old indexing preprocess - northeastern university

1many slides courtesy James Allanumass

indexing

2

bull File organizations or indexes are used to increaseperformance of system

ndash Will talk about how to store indexes later

bull Text indexing is the process of deciding what will beused to represent a given documentbull These index terms are then used to build indexes forthe documentsbull The retrieval model described how the indexed termsare incorporated into a model

ndash Relationship between retrieval model and indexing model

3

Manual vs Automatic Indexingbull Manual or human indexing

ndash Indexers decide which keywords to assign to document based on controlled vocabularybull eg MEDLINE MeSH LC subject headings Yahoondash Significant cost

bull Automatic indexingndash Indexing program decides which words phrases or other features to use from text of documentndash Indexing speeds range widely

bull Indri (CIIR research system) indexes approximately 10GBhour

4

bull Index languagendash Language used to describe documents and queries

bull Exhaustivityndash Number of different topics indexed completeness

bull Specificityndash Level of accuracy of indexing

bull Pre-coordinate indexingndash Combinations of index terms (eg phrases) used as indexing labelndash Eg author lists key phrases of a paper

bull Post-coordinate indexingndash Combinations generated at search timendash Most common and the focus of this course

5

A -- GENERAL WORKS B -- PHILOSOPHY PSYCHOLOGY RELIGION C -- AUXILIARY SCIENCES OF HISTORY D -- HISTORY GENERAL AND OLD WORLD E -- HISTORY AMERICA F -- HISTORY AMERICA G -- GEOGRAPHY ANTHROPOLOGY RECREATION H -- SOCIAL SCIENCES J -- POLITICAL SCIENCE K -- LAW L -- EDUCATION M -- MUSIC AND BOOKS ON MUSIC N -- FINE ARTS P -- LANGUAGE AND LITERATURE Q -- SCIENCE R -- MEDICINE S -- AGRICULTURE T -- TECHNOLOGY U -- MILITARY SCIENCE V -- NAVAL SCIENCE Z -- BIBLIOGRAPHY LIBRARY SCIENCE INFORMATION RESOURCES (GENERAL)

6

7

8

bull Experimental evidence is that retrieval effectivenessusing automatic indexing can be at least as effectiveas manual indexing with controlled vocabularies

ndash original results were from the Cranfield experiments in the 60sndash considered counter-intuitivendash other results since then have supported this conclusionndash broadly accepted at this point

bull Experiments have also shown that using both manualand automatic indexing improves performance

ndash ldquocombination of evidencerdquo

9

bull Parse documents to recognize structurendash eg title date other fieldsndash clear advantage to XML

bull Scan for word tokensndash numbers special characters hyphenation capitalization etcndash languages like Chinese need segmentationndash record positional information for proximity operators

bull Stopword removalndash based on short list of common words such as ldquotherdquo ldquoandrdquo ldquoorrdquondash saves storage overhead of very long indexesndash can be dangerous (eg ldquoThe Whordquo ldquoand-or gatesrdquo ldquovitamin ardquo)

10

bull Stem wordsndash morphological processing to group word variants such as pluralsndash better than string matching (eg comput)ndash can make mistakes but generally preferredndash not done by most Web search engines (why)

bull Weight wordsndash want more ldquoimportantrdquo words to have higher weightndash using frequency in documents and databasendash frequency data independent of retrieval model

bull Optionalndash phrase indexingndash thesaurus classes (probably will not discuss)ndash others

11

bull Parse and tokenizebull Remove stop wordsbull Stemmingbull Weight terms

12

bull Simple indexing is based on words or word stemsndash More complex indexing could include phrases or thesaurus classesndash Index term is general name for word phrase or feature used for indexing

bull Concept-based retrieval often used to imply somethingbeyond word indexingbull In virtually all systems a concept is a name given to a setof recognition criteria or rules

ndash similar to a thesaurus class

bull Words phrases synonyms linguistic relations can all beevidence used to infer presence of the conceptbull eg the concept ldquoinformation retrievalrdquo can be inferredbased on the presence of the words ldquoinformationrdquoldquoretrievalrdquo the phrase ldquoinformation retrievalrdquo and maybethe phrase ldquotext retrievalrdquo

13

bull Both statistical and syntactic methods have been usedto identify ldquogoodrdquo phrasesbull Proven techniques include finding all word pairs thatoccur more than n times in the corpus or using a partof-speech tagger to identify simple noun phrases

ndash 1100000 phrases extracted from all TREC data (more than1000000 WSJ AP SJMS FT Ziff CNN documents)ndash 3700000 phrases extracted from PTO 1996 data

bull Phrases can have an impact on both effectiveness andefficiency

ndash phrase indexing will speed up phrase queriesndash finding documents containing ldquoBlack Seardquo better than finding documents containing both wordsndash effectiveness not straightforward and depends on retrieval model

bull eg for ldquoinformation retrievalrdquo how much do individual words count

14

15

16

bull Special recognizers for specific conceptsndash people organizations places dates monetary amounts products hellip

bull ldquoMetardquo terms such as COMPANY PERSON canbe added to indexingbull eg a query could include a restriction like ldquohellipthedocument must specify the location of the companiesinvolvedhelliprdquobull Could potentially customize indexing by adding morerecognizers

ndash difficult to buildndash problems with accuracyndash adds considerable overhead

bull Key component of question answering systemsndash To find concepts of the right type (eg people for ldquowhordquo questions)

17

18

bull Remove non-content-bearing wordsndash Function words that do not convey much meaning

bull Can be as few as one wordndash What might that be

bull Can be several hundredsndash Surprising() examples from Inquery at UMass (of 418)ndash Halves exclude exception everywhere sang saw see smote slew year cos ff double down

bull Need to be careful of words in phrasesndash Library of Congress Smoky the Bear

bull Primarily an efficiency device though sometimeshelps with spurious matches

19

Word Occurrences Percentage the 8543794 68of 3893790 31to 3364653 27and 3320687 26in 2311785 18is 1559147 12for 1313561 10that 1066503 08said 1027713 08Frequencies from 336310 documents in the 1GB TREC Volume 3 Corpus125720891 total word occurrences 508209 unique words

20

a about above according across after afterwards again against albeit all almost alone along already also although always am among amongst an and another any anybody anyhow anyone anything anyway anywhere apart are around as at av be became because become becomes becoming been before beforehand behind being below beside besides between beyond both but by can cannot canst certain cf choose contrariwise cos could cu day do does doesnt doing dost doth double down dual during each either else elsewhere enough et etc even ever every everybody everyone everything everywhere except excepted excepting exception exclude excluding exclusive far farther farthest few ff first for formerly forth forward from front further furthermore furthest get go had halves hardly has hast hath have he hence henceforth her here hereabouts hereafter hereby herein hereto hereupon hers herself him himself hindmost his hither hitherto how however howsoever i ie if in inasmuch inc include included including indeed indoors inside insomuch instead into inward inwards is it its itself just kind kg km last latter latterly less lest let like little ltd many may maybe me meantime meanwhile might moreover most mostly more mr mrs ms much must my myself namely need neither never nevertheless next no nobody none nonetheless noone nope nor not nothing notwithstanding now nowadays nowhere of off often ok on once one only onto or other others otherwise ought our ours ourselves out outside over own per perhaps plenty provide quite rather really round said sake same sang save saw see seeing seem seemed seeming seems seen seldom selves sent several shalt she should shown sideways since slept slew slung slunk smote so some somebody somehow someone something sometime sometimes somewhat somewhere spake spat spoke spoken sprang sprung stave staves still such supposing than that the thee their them themselves then thence thenceforth there thereabout therabouts thereafter thereby therefore therein thereof thereon thereto thereupon these they this those thou though thrice through throughout thru thus thy thyself till to together too toward towards ugh unable under underneath unless unlike until up upon upward upwards us use used using very via vs want was we week well were what whatever whatsoever when whence whenever whensoever where whereabouts whereafter whereas whereat whereby wherefore wherefrom wherein whereinto whereof whereon wheresoever whereto whereunto whereupon wherever wherewith whether whew which whichever whichsoever while whilst whither who whoa whoever whole whom whomever whomsoever whose whosoever why will wilt with within without worse worst would wow ye yet year yippee you your yours yourself yourselves

21

bull Stemming is commonly used in IR to conflatemorphological variantsbull Typical stemmer consists of collection of rulesandor dictionaries

ndash simplest stemmer is ldquosuffix srdquondash Porter stemmer is a collection of rulesndash KSTEM [Krovetz] uses lists of words plus rules for inflectionaland derivational morphologyndash similar approach can be used in many languagesndash some languages are difficult eg Arabic

bull Small improvements in effectiveness andsignificant usability benefits

ndash With huge document set such as the Web less valuable

22

servomanipulator | servomanipulators servomanipulator logic | logical logic logically logics logicals logicial logicially login | login logins microwire | microwires microwire overpressurize | overpressurization overpressurized overpressurizations overpressurizing overpressurize vidrio | vidrio sakhuja | sakhuja rockel | rockel pantopon | pantopon knead | kneaded kneads knead kneader kneading kneaders linxi | linxi rocket | rockets rocket rocketed rocketing rocketings rocketeer hydroxytoluene | hydroxytoluene ripup | ripup

23

bull Based on a measure of vowel-consonant sequences ndash measure m for a stem is [C](VC)m[V] where C is a sequence of consonants and V is a sequence of vowels (inc y) [] = optional ndash m=0 (tree by) m=1 (troubleoats trees ivy) m=2 (troubles private)

bull Algorithm is based on a set of condition action rules ndash old suffix new suffix ndash rules are divided into steps and are examined in sequence

bull Longest match in a step is the one used ndash eg Step 1a sses ss (caresses caress) ies i (ponies 10485761048577104857810485791048580104858110485821048583104858410485851048586104858710485881048589104859010485911048592104859310485941048595104859610485971048598104859910486001048601104860210486031048604104860510486061048607104860810486091048610104861110486121048613104861410486151048616104861710486181048619104862010486211048622104862310486241048625104862610486271048628104862910486301048631104863210486331048634104863510486361048637104863810486391048640104864110486421048643104864410486451048646104864710486481048649104865010486511048652104865310486541048655104865610486571048658104865910486601048661104866210486631048664104866510486661048667104866810486691048670104867110486721048673104867410486751048676104867710486781048679104868010486811048682104868310486841048685104868610486871048688104868910486901048691104869210486931048694104869510486961048697104869810486991048700104870110487021048703poni) s NULL (cats 10485761048577104857810485791048580104858110485821048583104858410485851048586104858710485881048589104859010485911048592104859310485941048595104859610485971048598104859910486001048601104860210486031048604104860510486061048607104860810486091048610104861110486121048613104861410486151048616104861710486181048619104862010486211048622104862310486241048625104862610486271048628104862910486301048631104863210486331048634104863510486361048637104863810486391048640104864110486421048643104864410486451048646104864710486481048649104865010486511048652104865310486541048655104865610486571048658104865910486601048661104866210486631048664104866510486661048667104866810486691048670104867110486721048673104867410486751048676104867710486781048679104868010486811048682104868310486841048685104868610486871048688104868910486901048691104869210486931048694104869510486961048697104869810486991048700104870110487021048703cat) ndash eg Step 1b if mgt0 eed ee (agreed 10485761048577104857810485791048580104858110485821048583104858410485851048586104858710485881048589104859010485911048592104859310485941048595104859610485971048598104859910486001048601104860210486031048604104860510486061048607104860810486091048610104861110486121048613104861410486151048616104861710486181048619104862010486211048622104862310486241048625104862610486271048628104862910486301048631104863210486331048634104863510486361048637104863810486391048640104864110486421048643104864410486451048646104864710486481048649104865010486511048652104865310486541048655104865610486571048658104865910486601048661104866210486631048664104866510486661048667104866810486691048670104867110486721048673104867410486751048676104867710486781048679104868010486811048682104868310486841048685104868610486871048688104868910486901048691104869210486931048694104869510486961048697104869810486991048700104870110487021048703agree) if ved NULL (plastered plaster but bled 10485761048577104857810485791048580104858110485821048583104858410485851048586104858710485881048589104859010485911048592104859310485941048595104859610485971048598104859910486001048601104860210486031048604104860510486061048607104860810486091048610104861110486121048613104861410486151048616104861710486181048619104862010486211048622104862310486241048625104862610486271048628104862910486301048631104863210486331048634104863510486361048637104863810486391048640104864110486421048643104864410486451048646104864710486481048649104865010486511048652104865310486541048655104865610486571048658104865910486601048661104866210486631048664104866510486661048667104866810486691048670104867110486721048673104867410486751048676104867710486781048679104868010486811048682104868310486841048685104868610486871048688104868910486901048691104869210486931048694104869510486961048697104869810486991048700104870110487021048703bled) then at ate (conflat(ed) conflate)

bull Many implementations available ndash httpwwwtartarusorg~martinPorterStemmer

bull Good average recall and precision

24

bull Original text Document will describe marketing strategies carried out by US companies for their agricultural chemicals report predictions for market share of such chemicals or report market statistics for agrochemicals pesticide herbicide fungicide insecticide fertilizer predicted sales market share stimulate demand price cut volume of sales bull Porter Stemmer market strateg carr compan agricultur chemic report predict market share chemic report market statist agrochem pesticid herbicid fungicid insecticid fertil predict sale stimul demand price cut volum sale bull KSTEM marketing strategy carry company agriculture chemical report prediction market share chemical report market statistic agrochemic pesticide herbicide fungicide insecticide fertilizer predict sale stimulate demand price cut volume sale

25

bull Lack of domain-specificity and context can lead to occasional serious retrieval failures bull Stemmers are often difficult to understand and modify bull Sometimes too aggressive in conflation

ndash eg ldquopolicyrdquoldquopolicerdquo ldquoexecuterdquoldquoexecutiverdquo ldquouniversityrdquoldquouniverserdquo ldquoorganizationrdquoldquoorganrdquo are conflated by Porter

bull Miss good conflations ndash eg ldquoEuropeanrdquoldquoEuroperdquo ldquomatricesrdquoldquomatrixrdquo ldquomachinerdquoldquomachineryrdquo are not conflated by Porter

bull Produce stems that are not words and are often difficult for a user to interpret

ndash eg with Porter ldquoiterationrdquo produces ldquoiterrdquo and ldquogeneralrdquo produces ldquogenerrdquo

bull Corpus analysis can be used to improve a stemmer or replace it

26

bull Hypothesis Word variants that should be conflated will co-occur in documents (text windows) in the corpus bull Modify equivalence classes generated by a stemmer or other ldquoaggressiverdquo techniques such as initial n-grams

ndash more aggressive classes mean less conflations missed

bull New equivalence classes are clusters formed using (modified) EMIM scores between pairs of word variants bull Can be used for other languages

27

Some Porter Classes for a WSJ Database abandon abandoned abandoning abandonment abandonments abandons abate abated abatement abatements abates abating abrasion abrasions abrasive abrasively abrasiveness abrasives absorb absorbable absorbables absorbed absorbencies absorbency absorbent absorbents absorber absorbers absorbing absorbs abusable abuse abused abuser abusers abuses abusing abusive abusively access accessed accessibility accessible accessing accession

Classes refined through corpus analysis (singleton classes omitted)

abandonment abandonments abated abatements abatement abrasive abrasives absorbable absorbables absorbencies absorbency absorbent absorber absorbers abuse abusing abuses abusive abusers abuser abused accessibility accessible

28

29

bull Clustering technique used has an impactbull Both Porter and KSTEM stemmers are improvedslightly by this technique (max of 4 avg precisionon WSJ)bull N-gram stemmer gives same performance asimproved ldquolinguisticrdquo stemmersbull N-gram stemmer gives same performance asbaseline Spanish linguistic stemmerbull Suggests advantage to this technique for

ndash building new stemmersndash building stemmers for new languages

30

bull Basic Issue Which terms should be used to index(describe) a documentbull Different focus than retrieval model but relatedbull Sometimes seen as term weightingbull Some approaches

ndash TFmiddotIDFndash Term Discrimination modelndash 2-Poisson modelndash Clumping modelndash Language models

31

bull What makes a term good for indexingndash Trying to represent ldquokeyrdquo concepts in a document

bull What makes an index term good for a query

32

bull Standard weighting approach for many IR systemsndash many different variations of exactly how it is calculatedbull TF component - the more often a term occurs in adocument the more important it is in describing that document

ndash normalized term frequencyndash normalization can be based on maximum term frequency or could include a document length componentndash often includes some correction for estimation using small samplesndash some bias towards numbers between 04-10 to represent fact that a singleoccurrence of a term is importantndash logarithms used to smooth numbers for large collectionsndash eg where c is a constant such as 04 tf is the term frequency in thedocument and max_tf is the maximum term frequency in any document

33

34

35

36

37

bull Proposed by Salton in 1975bull Based on vector space model

ndash documents and queries are vectors in an n-dimensionalspace for n terms

bull Compute discrimination value of a term

ndash degree to which use of the term will help to distinguishdocuments

bull Compare average similarity of documents bothwith and without an index term

38

bull Compute average similarity or ldquodensityrdquo of documentspace

ndash AVGSIM is the densityndash where K is a normalizing constant (eg 1n(n-1))ndash similar() is a similarity function such as cosine correlation

bull Can be computed more efficiently using an averagedocument or centroid

ndash frequencies in the centroid vector are average of frequencies in document vectors

39

bull Let (AVGSIM)k be density with term k removed from documentsbull Discrimination value for term k is DISCVALUEk = (AVGSIM)k - AVGSIMbull Good discriminators have positive DISCVALUEk

ndash introduction of term decreases the density (moves some docs away)ndash tend to be medium frequency

bull Indifferent discriminators have DISCVALUE near zerondash introduction of term has no effectndash tend to be low frequency

bull Poor discriminators have negative DISCVALUEndash introduction of term increases the density (moves all docs closer)ndash tend to be high frequency

bull Obvious criticism is that discrimination of relevant and nonrelevant documents is the important factor

40

41

42

bull Index model identifies how to represent documents ndash Manual ndash Automatic

bull Typically consider content-based indexing ndash Using features that occur within the document

bull Identifying features used to represent documents

ndash Words phrases concepts hellip bull Normalizing them if needed

ndash Stopping stemming hellip bull Assigning a weight (significance) to them

ndash TFmiddotIDF discrimination value bull Some decisions determined by retrieval model

ndash Eg language modeling incorporates ldquoweightingrdquo directly

2

bull File organizations or indexes are used to increaseperformance of system

ndash Will talk about how to store indexes later

bull Text indexing is the process of deciding what will beused to represent a given documentbull These index terms are then used to build indexes forthe documentsbull The retrieval model described how the indexed termsare incorporated into a model

ndash Relationship between retrieval model and indexing model

3





4






5


6

7

8





9




10




11


12





13






14

15

16





17

18






19


20


21





22


23






24


25







26




27




28

29



30



31



32



33

34

35

36

37






38





39






40

41

42








3





4






5


6

7

8





9




10




11


12





13






14

15

16





17

18






19


20


21





22


23






24


25







26




27




28

29



30



31



32



33

34

35

36

37






38





39






40

41

42








4






5


6

7

8





9




10




11


12





13






14

15

16





17

18






19


20


21





22


23






24


25







26




27




28

29



30



31



32



33

34

35

36

37






38





39






40

41

42








5


6

7

8





9




10




11


12





13






14

15

16





17

18






19


20


21





22


23






24


25







26




27




28

29



30



31



32



33

34

35

36

37






38





39






40

41

42








6

7

8





9




10




11


12





13






14

15

16





17

18






19


20


21





22


23






24


25







26




27




28

29



30



31



32



33

34

35

36

37






38





39






40

41

42








7

8





9




10




11


12





13






14

15

16





17

18






19


20


21





22


23






24


25







26




27




28

29



30



31



32



33

34

35

36

37






38





39






40

41

42








8





9




10




11


12





13






14

15

16





17

18






19


20


21





22


23






24


25







26




27




28

29



30



31



32



33

34

35

36

37






38





39






40

41

42








9




10




11


12





13






14

15

16





17

18






19


20


21





22


23






24


25







26




27




28

29



30



31



32



33

34

35

36

37






38





39






40

41

42








10




11


12





13






14

15

16





17

18






19


20


21





22


23






24


25







26




27




28

29



30



31



32



33

34

35

36

37






38





39






40

41

42








11


12





13






14

15

16





17

18






19


20


21





22


23






24


25







26




27




28

29



30



31



32



33

34

35

36

37






38





39






40

41

42








12





13






14

15

16





17

18






19


20


21





22


23






24


25







26




27




28

29



30



31



32



33

34

35

36

37






38





39






40

41

42








13






14

15

16





17

18






19


20


21





22


23






24


25







26




27




28

29



30



31



32



33

34

35

36

37






38





39






40

41

42








14

15

16





17

18






19


20


21





22


23






24


25







26




27




28

29



30



31



32



33

34

35

36

37






38





39






40

41

42








15

16





17

18






19


20


21





22


23






24


25







26




27




28

29



30



31



32



33

34

35

36

37






38





39






40

41

42








16





17

18






19


20


21





22


23






24


25







26




27




28

29



30



31



32



33

34

35

36

37






38





39






40

41

42








17

18






19


20


21





22


23






24


25







26




27




28

29



30



31



32



33

34

35

36

37






38





39






40

41

42








18






19


20


21





22


23






24


25







26




27




28

29



30



31



32



33

34

35

36

37






38





39






40

41

42








19


20


21





22


23






24


25







26




27




28

29



30



31



32



33

34

35

36

37






38





39






40

41

42








20


21





22


23






24


25







26




27




28

29



30



31



32



33

34

35

36

37






38





39






40

41

42








21





22


23






24


25







26




27




28

29



30



31



32



33

34

35

36

37






38





39






40

41

42








22


23






24


25







26




27




28

29



30



31



32



33

34

35

36

37






38





39






40

41

42








23






24


25







26




27




28

29



30



31



32



33

34

35

36

37






38





39






40

41

42








24


25







26




27




28

29



30



31



32



33

34

35

36

37






38





39






40

41

42








25







26




27




28

29



30



31



32



33

34

35

36

37






38





39






40

41

42








26




27




28

29



30



31



32



33

34

35

36

37






38





39






40

41

42








27




28

29



30



31



32



33

34

35

36

37






38





39






40

41

42








28

29



30



31



32



33

34

35

36

37






38





39






40

41

42








29



30



31



32



33

34

35

36

37






38





39






40

41

42








30



31



32



33

34

35

36

37






38





39






40

41

42








31



32



33

34

35

36

37






38





39






40

41

42








32



33

34

35

36

37






38





39






40

41

42








33

34

35

36

37






38





39






40

41

42








34

35

36

37






38





39






40

41

42








35

36

37






38





39






40

41

42








36

37






38





39






40

41

42








37






38





39






40

41

42








38





39






40

41

42








39






40

41

42








40

41

42








41

42








42








old indexing preprocess - northeastern university

Documents