what? investigating what a corpus is about - max kemman. what.pdf · 1. what - investigating what a...

55
What? Investigating what a corpus is about Max Kemman University of Luxembourg October 25, 2015 Doing Digital History: Introduction to Tools and Technology

Upload: others

Post on 19-Jun-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: What? Investigating what a corpus is about - Max Kemman. What.pdf · 1. What - Investigating what a corpus is about 2. Where - Investigating the spatial entities in a corpus ... ,

What?Investigatingwhatacorpusisabout

MaxKemmanUniversityofLuxembourg

October25,2015

DoingDigitalHistory:IntroductiontoToolsandTechnology

Page 2: What? Investigating what a corpus is about - Max Kemman. What.pdf · 1. What - Investigating what a corpus is about 2. Where - Investigating the spatial entities in a corpus ... ,

Recapfromlasttime

Whatisdistantreading?

Whatisann-gram?

WhatdotheY-axisandX-axisshow?

Page 3: What? Investigating what a corpus is about - Max Kemman. What.pdf · 1. What - Investigating what a corpus is about 2. Where - Investigating the spatial entities in a corpus ... ,

Recap-Assignment

Howdidtheassignmentgo?

Whatdidyouthinkofthetoolsused?

Couldthisbeusefulforyourresearch?

Page 4: What? Investigating what a corpus is about - Max Kemman. What.pdf · 1. What - Investigating what a corpus is about 2. Where - Investigating the spatial entities in a corpus ... ,

OnemorethingonHTML:specialcharacters

http://www.ascii.cl/htmlcodes.htm

FindthesymbolandtheHTMLnumber

é&ü ->�&�

é&ü ->é&ü

InyourHTML,write longuedurée towritelonguedurée

Page 5: What? Investigating what a corpus is about - Max Kemman. What.pdf · 1. What - Investigating what a corpus is about 2. Where - Investigating the spatial entities in a corpus ... ,

Onemorething:whatisanalgorithm?

Asetofrulestofollowtosolveaproblem

Prettymuchlikeacookingrecipe

a=0while(a<10){ a=a+1}

Page 6: What? Investigating what a corpus is about - Max Kemman. What.pdf · 1. What - Investigating what a corpus is about 2. Where - Investigating the spatial entities in a corpus ... ,

TodayTheW'sofresearch•

Whatacorpusisabout•

Theentitiesinacorpus•

Anotherlookatouremails•

VoyantTools•

Nexttime•

Assignment•

Page 7: What? Investigating what a corpus is about - Max Kemman. What.pdf · 1. What - Investigating what a corpus is about 2. Where - Investigating the spatial entities in a corpus ... ,

TheW'sofresearch

Thusfar:

Now:wehaveadigitalcorpus,whattodowithit?

1. Abundanceofsources

2. WritingfortheWeb

3. DigitisationandDigitalLibraries

4. BigData

5. DistantReading

Page 8: What? Investigating what a corpus is about - Max Kemman. What.pdf · 1. What - Investigating what a corpus is about 2. Where - Investigating the spatial entities in a corpus ... ,

Researchthecorpus

NowcometheW'sofresearch:

1. What-Investigatingwhatacorpusisabout

2. Where-Investigatingthespatialentitiesinacorpus

3. When-Investigatingthetemporalentitiesinacorpus

4. Who-Investigatingthesocialentitiesinacorpus

Page 9: What? Investigating what a corpus is about - Max Kemman. What.pdf · 1. What - Investigating what a corpus is about 2. Where - Investigating the spatial entities in a corpus ... ,

What?

ThefirstWofinterest,whatisthiscorpusactuallyabout?

Differentmethodsarepossible

Findadescriptionofthecorpustoread•

Selectasampleofdocumentstoread•

Visualizetheusedwords•

Page 10: What? Investigating what a corpus is about - Max Kemman. What.pdf · 1. What - Investigating what a corpus is about 2. Where - Investigating the spatial entities in a corpus ... ,

Whatacorpusisabout

Page 11: What? Investigating what a corpus is about - Max Kemman. What.pdf · 1. What - Investigating what a corpus is about 2. Where - Investigating the spatial entities in a corpus ... ,

Whatisthisconferenceabout?

Page 12: What? Investigating what a corpus is about - Max Kemman. What.pdf · 1. What - Investigating what a corpus is about 2. Where - Investigating the spatial entities in a corpus ... ,

Wordclouds

Advantagesofwordclouds

Veryeasytocreate•

Visuallypleasing•

Givesaquickoverview•

Page 13: What? Investigating what a corpus is about - Max Kemman. What.pdf · 1. What - Investigating what a corpus is about 2. Where - Investigating the spatial entities in a corpus ... ,

Whatdoesawordclouddo?

Putverysimply,awordclouddoesthefollowing:

1. Countthenumberofoccurrencesperword

2. Sizeeachwordbyitsfrequency

3. Layoutthewordstoformashape

4. Optional:colorizewordsfordistinguishingandbetterreadability

Page 14: What? Investigating what a corpus is about - Max Kemman. What.pdf · 1. What - Investigating what a corpus is about 2. Where - Investigating the spatial entities in a corpus ... ,

Layout

UnliketheNgramviewer:noXorYaxes

Thepositionofeachwordismeaningless

Themeaningisinthesizeofthewords

Page 15: What? Investigating what a corpus is about - Max Kemman. What.pdf · 1. What - Investigating what a corpus is about 2. Where - Investigating the spatial entities in a corpus ... ,

Counting

Wordcloudsvisualizethefrequencyofwords

Buthowtocountwordsthatvaryinspelling?

E.g."Digital"and"digital"and"digitally","digitize"and"digitization"•

Normalization:

Lowercase•

Tokenize•

Stemmingorlemmatizing•

Stopwords•

Page 16: What? Investigating what a corpus is about - Max Kemman. What.pdf · 1. What - Investigating what a corpus is about 2. Where - Investigating the spatial entities in a corpus ... ,

Lowercase

WewereonvacationinFranceinAugust2015

wewereonvacationinfranceinaugust2015

Page 17: What? Investigating what a corpus is about - Max Kemman. What.pdf · 1. What - Investigating what a corpus is about 2. Where - Investigating the spatial entities in a corpus ... ,

Tokenize

wewereonvacation,infrance,inaugust2015

we|were|on|vacation|in|france|in|august|2015

Page 18: What? Investigating what a corpus is about - Max Kemman. What.pdf · 1. What - Investigating what a corpus is about 2. Where - Investigating the spatial entities in a corpus ... ,

Stemmingorlemmatizing

digitized|digital|digitization|digitizing

Stemming:digit

Lemmatizing:digitiz|digital

CouldbeveryusefulespeciallywithLatintexts

Page 19: What? Investigating what a corpus is about - Max Kemman. What.pdf · 1. What - Investigating what a corpus is about 2. Where - Investigating the spatial entities in a corpus ... ,

Stopwords

Mostcommonwordsinthelanguage:and,or,the

Sometimes:removenumbers

Notofinterest(usually)

we|were|on|vacation|in|france|in|august|2015

we|were|vacation|france|august|

Page 20: What? Investigating what a corpus is about - Max Kemman. What.pdf · 1. What - Investigating what a corpus is about 2. Where - Investigating the spatial entities in a corpus ... ,

Whatarethesegrantsabout?(normalized)

Page 21: What? Investigating what a corpus is about - Max Kemman. What.pdf · 1. What - Investigating what a corpus is about 2. Where - Investigating the spatial entities in a corpus ... ,

Comparingbetweendifferentpartsofthecorpus

Sourcesseparatedbytheircitationbehaviour

Page 22: What? Investigating what a corpus is about - Max Kemman. What.pdf · 1. What - Investigating what a corpus is about 2. Where - Investigating the spatial entities in a corpus ... ,

Representingamodelofthetext

Whatifwedonotknowhowtoseparatesources?

Orifwewanttoknowwhatotherwordsarerelatedtoourkeywords?

Page 23: What? Investigating what a corpus is about - Max Kemman. What.pdf · 1. What - Investigating what a corpus is about 2. Where - Investigating the spatial entities in a corpus ... ,

Topicmodelling

Documentsandwordscanbedirectlyobserved,buttopicsarelatent

Howtorepresentthetopicsinacorpus?

(SlidesontopicmodellingfromPimHuijnenandMarijnKoolen)

Statisticstofindtopicsrepresentedbygroupsofwords•

Documentisamixoftopics•

Topicisamixofwords•

Page 24: What? Investigating what a corpus is about - Max Kemman. What.pdf · 1. What - Investigating what a corpus is about 2. Where - Investigating the spatial entities in a corpus ... ,

Topicmodelling

Assumption:twodocumentswiththesametopicswillhaveoverlapinwords

Foragivencorpus,modellingprocessdoes:

1. Createwordprobabilitydistributionfortopics

2. Createtopicprobabilitydistributionfordocuments

Page 25: What? Investigating what a corpus is about - Max Kemman. What.pdf · 1. What - Investigating what a corpus is about 2. Where - Investigating the spatial entities in a corpus ... ,

Topicmodelling

Inshort:acorpusisrepresentedbystatisticaltopics

Thisallowsusto:

Separatesourcesbytopics•

Findrelatedkeywords•

Page 26: What? Investigating what a corpus is about - Max Kemman. What.pdf · 1. What - Investigating what a corpus is about 2. Where - Investigating the spatial entities in a corpus ... ,

Comparingdifferentpartsofthecorpus

MendeleyResearchMaps

Comparingthetopicalsimilarity

Assigneddocumentstodisciplinestomapdisciplinesbytopics

Whichformofmachinelearningwouldthisbe?

Page 27: What? Investigating what a corpus is about - Max Kemman. What.pdf · 1. What - Investigating what a corpus is about 2. Where - Investigating the spatial entities in a corpus ... ,

Whatisthecorpusabout?

Wecannowrepresentthewordsorthetopicsofacorpus

But,remember:WorldWarI≠"WorldWarI"

Page 28: What? Investigating what a corpus is about - Max Kemman. What.pdf · 1. What - Investigating what a corpus is about 2. Where - Investigating the spatial entities in a corpus ... ,

Theentitiesinacorpus

Thusfarweknowthefrequenciesofallthewords

Butwhatareweinterestedin?

WhatdoweneedfortheotherW's?

Page 29: What? Investigating what a corpus is about - Max Kemman. What.pdf · 1. What - Investigating what a corpus is about 2. Where - Investigating the spatial entities in a corpus ... ,

Theentitiesinacorpus

Thusfarweknowthefrequenciesofallthewords

Butwhatareweinterestedin?

WhatdoweneedfortheotherW's?

Where-places•

When-dates•

Who-people•

Page 30: What? Investigating what a corpus is about - Max Kemman. What.pdf · 1. What - Investigating what a corpus is about 2. Where - Investigating the spatial entities in a corpus ... ,

Peopleinthecorpus

TerBraake&Fokkens-Fairlyeasytodiscoverfamouspeople(withbiographicaldictionariesandNgramviewers)

Ngramshelptop-down:whenyouknowwhotosearchfor

Buthowtodiscoverwhodidnotbecomefamous,whileprominentintheirowntime?

Needtofindallpeoplebottom-upbyidentifyingallthenames

Page 31: What? Investigating what a corpus is about - Max Kemman. What.pdf · 1. What - Investigating what a corpus is about 2. Where - Investigating the spatial entities in a corpus ... ,

Bottom-upproces

TerBraake&Fokkens

1. Identifyallnamesinthecorpus

2. Giveallnamesanidentifier

3. Disambiguatenamesreferringtothesameperson

4. Compareresultswithanon-digitalcorpus

5. Visualizetheresults

6. Interpret!

Page 32: What? Investigating what a corpus is about - Max Kemman. What.pdf · 1. What - Investigating what a corpus is about 2. Where - Investigating the spatial entities in a corpus ... ,

Identifyingnames

Combinationsofwordsthatstartwithacapital

Thiswon'tworkforGerman

Theiralgorithmallowsfortwosequentiallowercasewords:JohanvanderCapellen

Note:builtforrecall,notprecision

Page 33: What? Investigating what a corpus is about - Max Kemman. What.pdf · 1. What - Investigating what a corpus is about 2. Where - Investigating the spatial entities in a corpus ... ,

Recall&Precision

Recall:retrieveallrelevantentities

Precision:donoretrieveirrelevantentities

Foralgorithmsusuallyachoicewhattooptimize

Recallofpeoplereferredtowithsinglename(Erasmus,Rembrandt)wouldleadtotoomuchnoise=lowerprecision

Page 34: What? Investigating what a corpus is about - Max Kemman. What.pdf · 1. What - Investigating what a corpus is about 2. Where - Investigating the spatial entities in a corpus ... ,

Difficulties

Spellingofnames(especiallybefore19thcentury)

Peoplewiththesamename

Nicknamesandchangingnames

Peoplewiththesametitle

Contextmatters!

Page 35: What? Investigating what a corpus is about - Max Kemman. What.pdf · 1. What - Investigating what a corpus is about 2. Where - Investigating the spatial entities in a corpus ... ,

NamedEntityRecognition

Wewanttoidentifytheentities

WewereonvacationinFranceinAugust2015.IwenttoshopattheIntermarche.TheareaaroundAptisreallynice.Maxalsoboughticecream,whichcost€2.

WewereonvacationinFranceinAugust2015.IwenttoshopattheIntermarche.TheareaaroundAptisreallynice.Maxalsoboughticecream,whichcost€2.

Page 36: What? Investigating what a corpus is about - Max Kemman. What.pdf · 1. What - Investigating what a corpus is about 2. Where - Investigating the spatial entities in a corpus ... ,

NamedEntities

Orwewanttosee:

WewereonvacationinFranceinAugust2015.IwenttoshopattheIntermarche.TheareaaroundAptisreallynice.Maxalsoboughticecream,whichcost€2.

People:Max•

Places:France,Apt•

Organizations:Intermarche•

Dates:August2015•

Currencies:€2•

Page 37: What? Investigating what a corpus is about - Max Kemman. What.pdf · 1. What - Investigating what a corpus is about 2. Where - Investigating the spatial entities in a corpus ... ,

Anotherlookatouremails

Forall30kemails,weperformedtextnormalisationandnamedentityrecognition

Let'stakealookathttps://www.wikileaks.org/clinton-emails/emailid/8

Exercise1:trytonormalisethetext

Exercise2:trytodiscoverthenamedentities:People,Places,Organisations

Page 38: What? Investigating what a corpus is about - Max Kemman. What.pdf · 1. What - Investigating what a corpus is about 2. Where - Investigating the spatial entities in a corpus ... ,

Normalised

SeeEmail8-normalised.txtinMoodleunder"Emails"

unclassife,us,department,state,case,f--,doc,date,release,full,hrod,clintonemailcom,sent,friday,july,pm,sullivanjj,stategov,subject,re,pakistan,bomb,ok,go,original,message,sullivan,jacob,sullivanjj,stategov,sent,fri,jul,subject,pakistan,bomb,fyi,put,follow,statement,statement,secretary,clinton,bomb,shrine,sy,ali,hujviri,lahore,shock,sadden,yesterday,attack,one,pakistan,popular,place,worship,shrine,sy,ali,hujviri,data,ganjbakhsh,lahore,claime,live,many,innocent,pakistane,extremist,shown,respect,neither,human,dignity,fundamental,religious,value,pakistani,society,violact,sanctity,rever,shrine,particularly,sinister,attempt,destabilize,pakistan,intimidate,people,attacker,will,succeed,pakistani,public,refuse,cow,violence,condemn,brutal,crime,reaffirm,commitment,support,pakistani,people,effort,defend,democracy,violent,

Page 39: What? Investigating what a corpus is about - Max Kemman. What.pdf · 1. What - Investigating what a corpus is about 2. Where - Investigating the spatial entities in a corpus ... ,

commitment,support,pakistani,people,effort,defend,democracy,violent,extremist,seek,destroy,thought,prayer,family,victim,people,pakistanNamedEntities

Trytodoitbyhand

NERtool:http://nlp.stanford.edu:8080/ner/

People Places Organisations

Sullivan

Jacob

CLINTON

AliHujviri

Pakistan

Pakistan

Lahore

Pakistan

Lahore

Pakistan

Pakistan

U.S.DepartmentofStateCaseNo

ShrineofSyedAliHujviri

Page 40: What? Investigating what a corpus is about - Max Kemman. What.pdf · 1. What - Investigating what a corpus is about 2. Where - Investigating the spatial entities in a corpus ... ,

Visualisetheemail

Gotohttp://tagcrowd.com/

Comparewithandwithoutstopwords

Comparenormalandnormalisedtext

Page 41: What? Investigating what a corpus is about - Max Kemman. What.pdf · 1. What - Investigating what a corpus is about 2. Where - Investigating the spatial entities in a corpus ... ,

What?

So,what'stheemailabout?Dowegetdifferentperspectives?

Page 42: What? Investigating what a corpus is about - Max Kemman. What.pdf · 1. What - Investigating what a corpus is about 2. Where - Investigating the spatial entities in a corpus ... ,

VoyantTools

Gotowww.voyant-tools.org/

UseMozillaFirefox,itdoesn'tworkinChrome(that'swhatwentwrongduringlecture)

FromMoodle:downloadthefilesforemails6000-6019f6-20-raw.txtandf6-20-normalised.txt

Youcanpasteintext,oruploadthefile

Continuebyhittingreveal

Page 43: What? Investigating what a corpus is about - Max Kemman. What.pdf · 1. What - Investigating what a corpus is about 2. Where - Investigating the spatial entities in a corpus ... ,

SavingtheVoyantsession

ItmightbeagoodideatocopytheURLearlyon,asthiswillallowyoutorefreshthepageifthetoolcrashes,ortoopenthetoolagainlateronusingthedataandstopwordsyoualreadyhad

Sharethesession:mousehoverthetopbluebar,andclickthethirdiconinthetopright(seeimage),youcanthenchoosetosharetheURL:thiswillopenanewbrowserwindowwhereyoucancopytheaddressfrom

Page 44: What? Investigating what a corpus is about - Max Kemman. What.pdf · 1. What - Investigating what a corpus is about 2. Where - Investigating the spatial entities in a corpus ... ,

Voyantwindows

LookatallthewindowsinVoyantandseeifyouunderstandthem

1. Cirrus(wordcloud)

2. Reader

3. Summary

4. Trends

5. Contexts

Page 45: What? Investigating what a corpus is about - Max Kemman. What.pdf · 1. What - Investigating what a corpus is about 2. Where - Investigating the spatial entities in a corpus ... ,

VoyantWordClouds

IntheCirrus,holdmouseonthetitlebar,andclick3rdicon•

Selectthestopwordlistyouneed•

OrEditListtoaddmorewords:1wordperline,clickSave•

Checkapplygloballytoactivateinallwindows•

Usethewordcloudtodetectcommonwordswe'renotinterestin:unclassified,department,subject,etc

HitConfirm•

Wheneditingagain,thestopwordsareorderedalphabetically,soyoumightnotseethemattheendanymore

Page 46: What? Investigating what a corpus is about - Max Kemman. What.pdf · 1. What - Investigating what a corpus is about 2. Where - Investigating the spatial entities in a corpus ... ,

VoyantSummary

Whatisthelongestemail?

Whataredistinctivewords?

DistinctivewordscalculatedbyTF-IDF:whatwasthatagain?

Update:thedistinctivewordsfeaturedoesn'tworknowthatwecombinedalltheemailsinasingletext-file

Page 47: What? Investigating what a corpus is about - Max Kemman. What.pdf · 1. What - Investigating what a corpus is about 2. Where - Investigating the spatial entities in a corpus ... ,

Searchingspecificwords

IntheCirruswindow,youcanclickTermsinthetopbartogetthelistofwordsorderedbycount

Youcanseeimmediatelyperwordhowitdevelopsovertimeintheemails

Fromthislistyoucanselectawordbycheckingtheboxlefttoit

Alternatively,youcansearchforwordsperwindow.Forexample,intheContextswindow(lower-right),atthebottomisasearchboxwhereyoucansearchforwords

Page 48: What? Investigating what a corpus is about - Max Kemman. What.pdf · 1. What - Investigating what a corpus is about 2. Where - Investigating the spatial entities in a corpus ... ,

InterpretingwithVoyant

Whatarethebiggestwords?

Howdotheydevelopthroughouttheemails?

Doesthistellwhattheemailsareaboutandhowitgoes?

Ifnot:whatisdifferent?

Page 49: What? Investigating what a corpus is about - Max Kemman. What.pdf · 1. What - Investigating what a corpus is about 2. Where - Investigating the spatial entities in a corpus ... ,

SharingtheVoyant

Youcaneither

Takescreenshotsofwhatyouwanttoshow•

Sharethesession:mousehoverthetopbluebar,andclickthethirdiconinthetopright(seeimage),youcanthenchoosetosharetheURL:thiswillopenanewbrowserwindowwhereyoucancopytheaddressfrom

TheHTMLsnippetwillgiveanHTMLcodethatyoucanembedinyourreport.•

Sharespecificwindows:forexample,inthetopbaroftrends,clickthefirsticon(seeimage),andselecttoexportaurl,aHTMLsnippetforembedding,oraPNGforincludinginyourreport

Page 50: What? Investigating what a corpus is about - Max Kemman. What.pdf · 1. What - Investigating what a corpus is about 2. Where - Investigating the spatial entities in a corpus ... ,

Nexttime

1November:Noclass

8November

When?Temporalentitiesandtimelines

Page 51: What? Investigating what a corpus is about - Max Kemman. What.pdf · 1. What - Investigating what a corpus is about 2. Where - Investigating the spatial entities in a corpus ... ,

Assignment

PerformVoyantanalysisofHCemails

Compare(seenextslideforalltheavailablefiles):

DocomparisonsinseparateVoyantwindows

f6-100-raw.txtvsf6-100-normalised.txttoseehowtextnormalisationgivesdifferentperspective

Forfurthercomparisons,chooseeithertheraworthenormalisedtext:•

f6-1000-*.txtvsf7-1000-*.txttoseehowtheemailsaredifferent•

IfVoyantoryourcomputerhasdifficultywith1000emails,comparef6-100-*.txtvsf7-100-*.txt

Page 52: What? Investigating what a corpus is about - Max Kemman. What.pdf · 1. What - Investigating what a corpus is about 2. Where - Investigating the spatial entities in a corpus ... ,

DownloadfilesfromMoodle:

Emails Raw Normalised

6000-6099 f6-100-raw.txt f6-100-normalised.txt

7000-7099 f7-100-raw.txt f7-100-normalised.txt

6000-6999 f6-1000-raw.txt f6-1000-normalised.txt

7000-7999 f7-1000-raw.txt f7-1000-normalised.txt

Page 53: What? Investigating what a corpus is about - Max Kemman. What.pdf · 1. What - Investigating what a corpus is about 2. Where - Investigating the spatial entities in a corpus ... ,

Assignment

Workinpairsoftwoorthree

Usethetoolsdiscussedtodaytotryandfindsomethingyoufindinteresting.Documentyourstepsandchoicesanddiscusswhyafindingisofinterest,andwhetheryoucanbecertainofthisfinding.

HandintheassignmentinHTML,includeyournameandadecentprofilephoto

500-1000words,inEnglish

Page 54: What? Investigating what a corpus is about - Max Kemman. What.pdf · 1. What - Investigating what a corpus is about 2. Where - Investigating the spatial entities in a corpus ... ,

Possiblequestionsyoumightaskofyourcorpora

Whataretheseemailsabout?•

Doweneedtofurthercleanthedata?•

Howarethesecorporadifferent?•

Doestextnormalisationleadtodifferentresults?•

Page 55: What? Investigating what a corpus is about - Max Kemman. What.pdf · 1. What - Investigating what a corpus is about 2. Where - Investigating the spatial entities in a corpus ... ,

Grading

Donote:thefindingitselfisnotthemostimportantpart

[email protected]

1ptforfree•

3ptsforHTML•

3ptsfordocumentationofyourprocess•

3ptsforcriticalreflectiononyourfinding•