text processing - ir.cis.udel.edu

Post on 06-Jan-2022

4 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

3/17/09

1

TextProcessing

CISC489/689‐010,Lecture#3Monday,Feb.16

BenCartereFe

Indexing

•  Anindexisalistofthings(keys)withpointerstootherthings(items).– Keywordscatalognumbers(shelves).– Conceptspagenumbers.– Termsdocuments.

•  Needforindexes:– Easeofuse.– Speed.– Scalability.

3/17/09

2

Manualvs.AutomaVcIndexing

•  Manual:– An“expert”assignskeystoeachitem.

– Example:cardcatalog.

•  AutomaVc:– KeysautomaVcallyidenVfiedandassigned.– Example:Google.

•  AutomaVcasgoodasmanualformostpurposes.

TextProcessing

•  FirststepinautomaVcindexing.•  ConverVngdocumentsintoindex terms. 

•  Termsarenotjustwords.– Notallwordsareofequalvalueinasearch.– SomeVmesnotclearwherewordsbeginandend.

•  Especiallywhennotspace‐separated,e.g.Chinese,Korean.

– Matchingtheexactwordstypedbytheuserdoesn’tworkverywellintermsofeffecVveness.

3/17/09

3

TextProcessingSteps

•  Foreachdocument:– Parseittolocatethepartsthatareimportant.

– Segmentandtokenizethetextintheimportantpartstogetwords.

– Removestop words.– Stemwordstocommonroots.

•  Advancedprocessingmayincludedphrases,enVtytagging,link‐graphfeatures,andmore.

Parsing

•  Somepartsofadocumentaremoreimportantthanothers.

•  Documentparserrecognizesstructureusingmarkup suchasHTMLtags.– Headers,anchortext,boldedtextarelikelytobeimportant.

–  JavaScript,styleinformaVon,navigaVonlinkslesslikelytobeimportant.

– Metadatacanalsobeimportant.

3/17/09

4

ExampleWikipediaPage

WikipediaMarkup<title>Tropical fish</title> <text>{{Unreferenced|date=July 2008}} {{Original research|date=July 2008}} ’’’Tropical fish’’’ include [[fish]] found in [[Tropics|

topical]] environments around the world, including both [[fresh water|freshwater]] and [[sea water|salt water]] species. [[Fishkeeping|Fishkeepers]] often use the term ’’tropical fish’’ to refer only those requiring fresh water, with saltwater tropical fish referred to as ’’[[list of marine aquarium fish species|marine fish]]’’.

3/17/09

5

WikipediaHTML

DocumentParsing

•  HTMLpagesorganizeintotrees.

<HTML>

<HEAD>

<TITLE> Tropicalfish

<META>

<BODY>

<H1> Tropicalfish

<P>

<B> Tropicalfish

<A> fish

<A> tropical

includefoundinenvironmentsaroundtheworld

Nodes contain blocks of text.

3/17/09

6

EndResultofParsing

•  Blocksoftextfromimportantpartsofpage.– Tropicalfishincludefishfoundintropicalenvironmentsaroundtheworld,includingbothfreshwaterandsaltwaterspecies.Fishkeepersoienusetheterm“tropicalfish”toreferonlythoserequiringfreshwater,withsaltwatertropicalfishreferredtoas“marinefish”.

•  Nextstep:segmenVngandtokenizing.

Tokenizing

•  Formingwordsfromsequenceofcharactersinblocksoftext.

•  SurprisinglycomplexinEnglish,canbeharderinotherlanguages.

•  EarlyIRsystems:– Anysequenceofalphanumericcharactersoflength3ormore.

– Terminatedbyaspaceorotherspecialcharacter.

– Upper‐casechangedtolower‐case.

3/17/09

7

Tokenizing

•  Example:– “Bigcorp's2007bi‐annualreportshowedprofitsrose10%.”becomes

– “bigcorp2007annualreportshowedprofitsrose”•  ToosimpleforsearchapplicaVonsorevenlarge‐scaleexperiments

•  Why?ToomuchinformaVonlost– SmalldecisionsintokenizingcanhavemajorimpactoneffecVvenessofsomequeries

TokenizingProblems•  Smallwordscanbeimportantinsomequeries,usuallyincombinaVons

•  xp,ma,pm,beneking,elpaso,masterp,gm,jlo,worldwarII

•  Bothhyphenatedandnon‐hyphenatedformsofmanywordsarecommon–  SomeVmeshyphenisnotneeded

•  e‐bay,wal‐mart,acVve‐x,cd‐rom,t‐shirts

– AtotherVmes,hyphensshouldbeconsideredeitheraspartofthewordorawordseparator

•  winston‐salem,mazdarx‐7,e‐cards,pre‐diabetes,t‐mobile,spanish‐speaking

3/17/09

8

TokenizingProblems

•  Specialcharactersareanimportantpartoftags,URLs,codeindocuments

•  Capitalizedwordscanhavedifferentmeaningfromlowercasewords–  Bush,Apple

•  Apostrophescanbeapartofaword,apartofapossessive,orjustamistake–  rosieo'donnell,can't,don't,80's,1890's,men'sstrawhats,master'sdegree,england'stenlargestciVes,shriner's

TokenizingProblems

•  Numberscanbeimportant,includingdecimals– nokia3250,top10courses,united93,quickVme6.5pro,92.3thebeat,288358

•  Periodscanoccurinnumbers,abbreviaVons,URLs,endsofsentences,andothersituaVons–  I.B.M.,Ph.D.,cis.udel.edu

•  Note:tokenizingstepsforqueriesmustbeidenVcaltostepsfordocuments

3/17/09

9

TokenizingProcess

•  Assumewehaveusedtheparsertofindblocksofimportanttext.

•  Awordmaybeanysequenceofalphanumericcharactersterminatedbyaspaceorspecialcharacter.–  everythingconvertedtolowercase.–  everythingindexed.

•  Defercomplexdecisionstoothercomponents–  example:92.3→923butsearchfindsdocumentswith92and3adjacent

–  incorporatesomerulestoreducedependenceonquerytransformaVoncomponents

EndResultofTokenizaVon

•  Listofwordsinblocksoftext.–  tropicalfishincludefishfoundintropicalenvironmentsaroundtheworldincludingbothfreshwaterandsaltwaterspeciesfishkeepersoienusethetermtropicalfishtoreferonlythoserequiringfreshwaterwithsaltwatertropicalfishreferredtoasmarinefish

•  Nextstep:stopping.•  Butfirst:textstaVsVcs.

3/17/09

10

TextStaVsVcs

•  Hugevarietyofwordsusedintextbut•  ManystaVsVcalcharacterisVcsofwordoccurrencesarepredictable– e.g.,distribuVonofwordcounts

•  RetrievalmodelsandrankingalgorithmsdependheavilyonstaVsVcalproperVesofwords– e.g.,importantwordsoccuroienindocumentsbutarenothighfrequencyincollecVon

Zipf’sLaw•  DistribuVonofwordfrequenciesisveryskewed 

–  afewwordsoccurveryoien,manywordshardlyeveroccur

–  e.g.,twomostcommonwords(“the”,“of”)makeupabout10%ofallwordoccurrencesintextdocuments

•  Zipf’s“law”:–  observaVonthatrank(r)ofawordVmesitsfrequency(f)isapproximatelyaconstant(k) 

•  assumingwordsarerankedinorderofdecreasingfrequency

–  i.e.,r.f ≈korr.Pr≈c,wherePrisprobabilityofwordoccurrenceandc≈ 0.1forEnglish 

3/17/09

11

Zipf’sLaw

WikipediaStaVsVcs(wiki000subset)

Totaldocuments 5,001

Totalwordoccurrences 22,545,922

Vocabularysize 348,436

Wordsoccurring>1000Vmes 2,751

Wordsoccurringonce 163,404

Word Freq r Pr(%) r.Pr

poliVcian 5096 510 0.023 0.116

contractor 100 14,852 4.4∙10‐4 0.066

kickboxer 10 56,125 4.4∙10‐5 0.025

comdedian 1 185,035 4.4∙10‐6 0.008

3/17/09

12

Top50Wordsfromwiki000Subset

Zipf’sLawforwiki000Subset

Rank

Pro

babi

lity

3/17/09

13

Zipf’sLaw

•  WhatistheproporVonofwordswithagivenfrequency?– Wordthatoccursn Vmeshasrankrn = k/n – Numberofwordswithfrequencyn is

•  rn − rn+1  =  k/n − k/(n + 1)=  k/n(n + 1)– ProporVonfoundbydividingbytotalnumberofwords=highestrank=k 

– So,proporVonwithfrequencynis1/n(n+1)

Zipf’sLaw

•  Exampleword

frequencyranking

•  Tocomputenumberofwordswithfrequency493–  rankof“png”minustherankof“defend”

– 5005−5001=4

Rank Word Freq

4999 objecVve 494

5000 albany 494

5001 defend 494

5002 appeals 493

5003 125 493

5004 lasVng 493

5005 png 493

3/17/09

14

Example

•  ProporVonsofwordsoccurringnVmesin5,001Wikipediadocuments

•  Vocabularysizeis348,436.

Num.occurrences(n)

Predictedpropor:on(1/n(n+1))

Actualpropor:on

Actualnumberofwords

1 .500 .469 163,404

2 .167 .151 52,672

3 .083 .070 24,272

4 .050 .045 15,685

5 .033 .030 10,437

6 .024 .022 7,832

7 .018 .017 5,962

8 .014 .014 4,890

9 .011 .011 3,886

10 .009 .009 3,291

VocabularyGrowth

•  Ascorpusgrows,sodoesvocabularysize– Fewernewwordswhencorpusisalreadylarge

•  ObservedrelaVonship(Heaps’ Law):

v=k.nβ

wherevisvocabularysize(numberofuniquewords),nisthenumberofwordsincorpus, k,β areparametersthatvaryfor

eachcorpus (typicalvaluesgivenare10≤ k ≤ 100 andβ ≈ 0.5)

3/17/09

15

wiki000SubsetExample

Words in collection

Voca

bula

ry s

ize

v ≈ 18.61·n0.5819

Heaps’LawPredicVons

•  PredicVonsforTRECcollecVonsareaccurateforlargenumbersofwords– e.g.,first22,545,922wordsofwiki000scanned– predicVonis353,587uniquewords– actualnumberis348,436

•  PredicVonsforsmallnumbersofwords(i.e.<1000)aremuchworse

3/17/09

16

Heaps’LawPredicVons

•  Heaps’Lawworkswithverylargecorpora– newwordsoccurringevenaierseeing30million!

•  Newwordscomefromavarietyofsources•  spellingerrors,inventedwords(e.g.product,companynames),code,otherlanguages,emailaddresses,etc.

•  Searchenginesmustdealwiththeselargeandgrowingvocabularies

Stopping

•  FuncVonwords(determiners,preposiVons)haveliFlemeaningontheirown

•  Highoccurrencefrequencies– Top6words:the, of, and, in, to, a

•  Treatedasstopwords (i.e.removed)–  reduceindexspace,improveresponseVme,improveeffecVveness

•  CanbeimportantincombinaVons– e.g.,“tobeornottobe”

3/17/09

17

Stopping

•  Keeptrackofallverycommonwordsinastopwords list.

•  Duringtextprocessing,ignoreanywordonthelist.

•  Stopwordlistcanbecreatedfromhigh‐frequencywordsorbasedonastandardlist

•  ListsarecustomizedforapplicaVons,domains,andevenpartsofdocuments– e.g.,“click”isagoodstopwordforanchortext

Stopping

•  Whenstoragespaceisnotaconcern,itcanbebeFertonotstop.– Queriesarelessrestricted.– RemovestopwordsatqueryVmeunlessusersaystoincludethem.

•  Googledoesnotstop.– “tobeornottobe” returnsresults.– +thereturnsresults(over14billion).

3/17/09

18

EndResultofStopping

•  Listofwordsminusthoseonthestoplist.–  tropicalfishincludefishfoundtropicalenvironmentsaroundworldincludingbothfreshwatersaltwaterspeciesfishkeepersoienusetermtropicalfishreferonlythoserequiringfreshwatersaltwatertropicalfishreferredmarinefish

•  Nextstep:stemming.

Stemming•  ManymorphologicalvariaVonsofwords

–  inflecFonal(plurals,tenses)– derivaFonal(makingverbsnounsetc.)

•  Inmostcases,thesehavethesameorverysimilarmeanings

•  StemmersaFempttoreducemorphologicalvariaVonsofwordstoacommonstem– usuallyinvolvesremovingsuffixes

•  CanbedoneatindexingVmeoraspartofqueryprocessing(likestopwords)

3/17/09

19

Stemming

•  GenerallyasmallbutsignificanteffecVvenessimprovement– canbecrucialforsomelanguages– e.g.,5‐10%improvementforEnglish,upto50%inArabic

Words with the Arabic root ktb

Stemming

•  Twobasictypes– DicVonary‐based:useslistsofrelatedwords– Algorithmic:usesprogramtodeterminerelatedwords

•  Algorithmicstemmers– suffix‐s: remove‘s’endingsassumingplural

•  e.g.,cats→cat,lakes→lake

•  Manyfalse negaFves:supplies→supplie•  Somefalse posiFves:ups→up

3/17/09

20

PorterStemmer

•  AlgorithmicstemmerusedinIRexperimentssincethe70s

•  Consistsofaseriesofrulesdesignedtothelongestpossiblesuffixateachstep

•  ProvablyeffecVve•  Producesstemsnotwords

•  Makesanumberoferrorsanddifficulttomodify

PorterStemmer

•  Examplestep(1of5)

3/17/09

21

PorterStemmer

•  Porter2stemmeraddressessomeoftheseissues

•  Approachhasbeenusedwithotherlanguages

KrovetzStemmer

•  Hybridalgorithmic‐dicVonary– WordcheckedindicVonary

•  Ifpresent,eitherleialoneorreplacedwith“excepVon”•  Ifnotpresent,wordischeckedforsuffixesthatcouldberemoved

•  Aierremoval,dicVonaryischeckedagain

•  Produceswordsnotstems•  ComparableeffecVveness•  LowerfalseposiVverate,somewhathigherfalsenegaVve

3/17/09

22

StemmerComparison

EndResultofStemming

•  Listofstemmedterms:–  tropicfishincludefishfoundtropicenvironaroundworldincludebothfreshwatsaltwaterspecifishkeepoienusetermtropicfishreferonlithoserequirfreshwatersaltwattropicfishrefermarinfish

–  (fromPorter2stemmer)

•  Nextstep:advancedprocessing,orindexing.

3/17/09

23

Martin Hall, 49, head of public policy and external affairs at the London Stock Exchange, is to leave at the end of June.

… The departure of Hall, who had

been in the running to be head of corporate affairs at the BBC, appears to have been prompted by the decision of the new chief executive, Michael Lawrence, to split Hall’s job in two and take the public policy element under his own wing.

<person id=pe1>Martin Hall</person>, 49, <sense num=2>head</sense> of <ow1>public policy</ow1> and external affairs at the <corp id=co1>London Stock Exchange</corp>, is to <syn grp=1>leave</syn> at the end of June.

… The <syn grp=1>departure</syn> of

<person id=pe1>Hall</person>, <ref to=pe1>who</ref> had been in the running to be head of corporate affairs at the <corp id=co2>BBC</corp>, appears to have been prompted by the decision of the new chief executive, <person id=pe2>Michael Lawrence</person>, to split <person id=pe1>Hall’s</person> job in two and take the public policy element under <ref to=pe1>his</ref> own wing.

AdvancedTextProcessing

•  Part‐of‐speechtagging.•  SensedisambiguaVon.•  SynonymclassificaVon.•  NamedenVtytagging.•  PhraseidenVficaVon.•  ReferentresoluVon.•  SentencesegmentaVon.•  TranslaVon.•  SpeechrecogniVon.

TextProcessingErrors

•  Alltextprocessingiserrorful.– DesigndecisionsproducesegmentaVonerrors,stoppingerrors,stemmingerrors.

–  FalseposiVvesandfalsenegaVves.– Moreadvancedmethodsmoredifficultprocessingmoreerrors.

•  Doesthebenefitoutweighthecost?–  SegmentaVon&stemming:definitely.–  POStagging,NEtagging:dependsondomain.–  Synonymclasses:maybenot.

3/17/09

24

EndResultofTextProcessing<title>Tropical fish</title> <text>{{Unreferenced|date=July 2008}} {{Original research|date=July 2008}} ’’’Tropical fish’’’ include [[fish]] found in [[Tropics|topical]]

environments around the world, including both [[fresh water|freshwater]] and [[sea water|salt water]] species. [[Fishkeeping|Fishkeepers]] often use the term ’’tropical fish’’ to refer only those requiring fresh water, with saltwater tropical fish referred to as ’’[[list of marine aquarium fish species|marine fish]]’’.

•  Metadata:–  Title:Tropicalfish

•  Importantfields:–  Links:fishtropicfreshwatsalt

waterfishkeepmarinfish

•  Body:–  tropicfishincludefishfound

tropicenvironaroundworldincludebothfreshwatsaltwaterspecifishkeepoienusetermtropicfishreferonlithoserequirfreshwatersaltwattropicfishrefermarinfish

CourseProject

•  PhaseI,worksheet1.– Writeatextprocessingmodule.

– ParseWikipediapages,tokenize,stop,andstem.– AnswerquesVonsaboutWikipediadata:howbigisvocabulary,howmanywordoccurrencesarethere,etc.

•  DuenextWednesday.– PleasestartASAP!

3/17/09

25

ExpectaVons

•  ReadWikipediapagesoffdisk.•  IdenVfypartsofthemthatdonotneedtobeindexed.

•  Converttherestintoalistofwords.•  Dropstopwords,stemremainingwordstoterms.

•  KeeptrackofthenumberofVmeseachtermappears,howmanydocumentsitappearsin.

PseudoJavaimport java.io.*; import java.util.*;

… HashMap<String, int> termCounts = new HashMap();

File doc = new File(filename); Scanner docScanner = new Scanner(doc); while (docScanner.hasNextLine()) {

List<String> terms = processLine(docScanner.nextLine()) for (int i=0; i < terms.size(); i++) { String currentTerm = terms.get(i); int termCount = termCounts.get(currentTerm);

termCounts.set(currentTerm, termCount+1); }

}

docScanner.close()

3/17/09

26

public List processLine(String line) { List<String> terms = new List();

int i = 0;

Scanner lineScanner = new Scanner(line);

lineScanner.useDelimiter(“\\s*”); while (lineScanner.hasNext()) { String word = lineScanner.next();

/* check if word is appropriate for indexing or if it marks the start of a block to ignore */ if (word.indexOf(“{{“) >= 0)

/* ignore words until closing the block with a }}

… /* other conditions */

/* strip non-alphanumeric characters and lower-case */

word = word.replaceAll("[^a-zA-Z0-9]", ""); word = word.toLowerCase();

/* check if word is in the stop list */

if (!isStopWord(word)) { word = stemmer.stem(word); /* stem word */ terms.set(i, word);

i++; } } return(terms);

}

top related