lecture 12: review and conclusions · web count units ª ghit or “google hit” is the most...
TRANSCRIPT
HG2052Language Technology and the Internet
Review and Conclusions
Francis BondDivision of Linguistics and Multilingual Studies
httpwww3ntuedusghomefcbondbondieeeorg
Lecture 12
HG2052 (2020)
Introduction
Review and Conclusions 1
What have we learned
atilde How technology affects our use of language
atilde How language is used on the internet
acirc Some interesting things we can now do that we couldnrsquot before
atilde Collaboration on the Web
atilde The web as a source of linguistic data Direct Query and Sample
atilde The Semantic Web meaning for non-humans
atilde Citation and reputation
Review and Conclusions 2
Reflection
atilde What was the most surprising thing in this class
atilde What do you think is most likely wrong
atilde What do you think is the coolest result
atilde What do you think yoursquore most likely to remember
atilde How do you think this course will influence you as a linguist
atilde What (if anything) did you hope to learn that you didnrsquot
Review and Conclusions 3
Goals
atilde Gain an understanding of how technology affects language use
atilde Develop familiarity with markup and meta information in texts
atilde Get a feel for what research is all about especially relating to web mining and onlinefrequency counting
Upon successful completion students will
atilde have an understanding of how technology shapes language use
atilde be able to test linguistic hypotheses against web data
atilde know how to edit Wikipedia
Review and Conclusions 4
Themes
atilde Language and Technology
acirc Writing and Speech Technology
atilde Language and the Internet
acirc Email Chat Virtual Worlds WWW IM Blogs Facebook Wikis Twitter
atilde The Web as Corpus
atilde The Web beyond Language
acirc Semantic Web and Networks
Review and Conclusions 5
Revision
Review and Conclusions 6
Language and Technology
atilde The two things that separate humans from animals (Sproat 2010 sect11)
acirc Languagelowast large vocabulary (10000+)lowast complicated syntax (no upper length recursion embedding)
acirc Technologylowast Widespread tool uselowast Widespread tool manufacture
atilde Speech mdash the start of language
atilde Writing mdash the first great intersection
Review and Conclusions 7
Language and the Internet
atilde New forms of communication
acirc Neither speech nor textacirc Massively interactive
atilde Extremely rapid change
atilde A first hand narrative (I was online before the internet -)
acirc but I am probably behind you all now
Review and Conclusions 8
New forms
atilde Email (from PC phone other)
atilde Chat Usenet
atilde Virtual Worlds
atilde WWW
atilde Blogs (overlap)
atilde Facebook LinkedIn
atilde Wikis
atilde Twitter
Review and Conclusions 9
Representing Languageatilde Writing Systems
atilde Encodings
atilde Speech
atilde Bandwidth
Review and Conclusions 10
What is represented
atilde Phonemes ma d laks aeligvkadoz (45)
acirc Not a simple correspondence between a writing system and the soundsacirc Some logographs (3 $)acirc Even when a new alphabet is designed pronunciation changes
atilde Syllables (ma) (d) (laks) (aelig)(v)(ka)(doz) (10000+)
atilde Morphemes mame + rsquos d lak + s aeligvkado + z (100000+)
atilde Words my dog likes avocados (200000+)
atilde Concepts speaker poss canine fond avacado+PL (400000+)
Review and Conclusions 11
Three Major Writing Systems
atilde Alphabetic (Latin)
acirc one symbol for consonant or vowelacirc Typically 20-30 base symbols (1 byte)
atilde Syllabic (Hiragana)
acirc one symbol for each syllable (consonant+vowel)acirc Typically 50-100 base symbols (1-2 bytes)
atilde Logographic (Hanzi)
acirc pictographs ideographs sounds-meaning combinationsacirc Typically 100000+ symbols (2-3 bytes)
Review and Conclusions 12
Computational Encoding
atilde Need to map characters to bits
atilde More characters require more space
acirc English ASCII (7 bits) Western Europe ISO-8859-1 (8 bits)Japanese EUC (16 bits) Everything UTF-8 (8-32 bits)
atilde Moving towards unicode for everything
atilde If you get the encoding wrong it is gibberish
acirc Web pages should state their encodingacirc Sometimes they are wrongacirc Encoding Detection usually involves statistical analysis of byte patterns
lowast But an encoding can be shared by many languages
Review and Conclusions 13
Speech and Language Technologyatilde The need for speech representation
atilde Storing sound
atilde Transforming Speech
acirc Automatic Speech Recognition (ASR) sounds to textacirc Text-to-Speech Synthesis (TTS) texts to sounds
atilde Speech technology mdash the Telephone
Review and Conclusions 14
How good are the systems
Task Vocab WER () WER () adaptedDigits 11 04 02Dialogue (travel) 21000 109 mdashDictation (WSJ) 5000 39 30Dictation (WSJ) 20000 100 86Dialogue (noisy army) 3000 422 310Phone Conversations 4000 419 310
Results of various DARPA competitions (from Richard Sproatrsquos slides)
Review and Conclusions 15
Why is it so difficult
atilde Pronunciation depends on context
acirc The same word will be pronounced differently in different sentences
atilde Speaker variability
acirc Genderacirc DialectForeign Accentacirc Individual Differences Physical differences Language differences (idiolect)
atilde Many many rare events
acirc 300 out of 2000 diphones in the core set for the ATampT NextGen system occuronly once in a 2-hour speech database
Review and Conclusions 16
Two steps in a TTS system
1 Linguistic Analysis
atilde Sentence Segmentationatilde Abbreviations Dr Smith lives on Nanyang Dr He is hellipatilde Word Segmentation
acirc 森前銀総裁 Moriyama zen Nichigin Sousaiotimes 森前銀総裁emspMoriyama zennichi gin Sousai
2 Speech Synthesis
atilde Find the pronunciationatilde Generate soundsatilde Add intonation
Review and Conclusions 17
Speech Synthesis
atilde Articulatory Synthesis Attempt to model human articulation
atilde Formant Synthesis Bypass modeling of human articulation and model acousticsdirectly
atilde Concatenative Synthesis Synthesize from stored units of actual speech
Review and Conclusions 18
Prosody of Emotion
atilde Excitement Fast very high pitch loud
atilde Hot anger Fast high pitch strong falling accent loud
atilde Fear Jitter
atilde Sarcasm Prolonged accent late peak
atilde Sad Slow low pitch
The main determinant of ldquonaturalnessrdquo in speech synthesis is not ldquovoice qualityrdquobut natural-sounding prosody (intonation and duration)
Richard Sproat
Review and Conclusions 19
Speed is different for different modalities
Speed in words per minute (one word is 6 characters)(English computer science students various studies)
Reading 300 200 (proof reading)Writing 31 21 (composing)Speaking 150Hearing 150 210 (speeded up)Typing 33 19 (composing)
atilde Reading gtgt SpeakingHearing gtgt Typing
rArr Speech for inputrArr Text for output
Review and Conclusions 20
The Telephone
Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 21
New Modalities
Review and Conclusions 22
Email Usenet Chat and Blogs
atilde All share some characteristics of speech and text
atilde Usage norms not fixed
atilde Communication methods may disappear before the norms are fixedUsenet Bulletin Boards Gopher Archie hellip
atilde Large scale discourse studies still to be done
atilde Some genuinely new things
acirc time-lagged multi-person conversationacirc raw un-edited textacirc extreme multi-authorship
Review and Conclusions 23
Speech like Text liketime bound space bound (deletable)spontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 24
Usenet (asynchronous)
Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 25
Chat (synchronous)
Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 26
Blogs
Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactivecomments factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 27
Netspeak
atilde Inspired by shared background
acirc ltrantgtI canrsquot stand thisltrantgtacirc I hate^H^H^H^Hlove that idea (^H is backspace on a vt100)acirc lusers (users as seen by Systems Administrators)acirc suits
atilde Need to be in-group
cow orker Coworkerclue ldquoYou couldnrsquot get a clue during the clue mating season in a field full of horny
clues if you smeared your body with clue musk and did the clue mating dancerdquoEdward Flaherty (talkbizaare)
Review and Conclusions 28
atilde Gricean Principals
acirc Posting (topmiddlebottom quoting and trimming)
atilde Inspired by medium limitations
acirc GREATgreatgreatacirc -) (^_^) (o) acirc brb RTFM IMHOacirc lower caseacirc lack of punctuation (hard on phone keyboards)
Review and Conclusions 29
Collaboration and Wikisatilde Version Control Systems
atilde Wikipedia
atilde Licensing and Ownership
30
Version Control Systems
atilde Versioning file systems
acirc every time a file is opened a new copy is stored
atilde CVS Subversion Git
acirc changes to a collection of files are trackedacirc simultaneous changes are merged
atilde Revision Tracking
acirc Revisions are stored within a file
atilde Authorship in shared writing Explicit responsibility for changes
Review and Conclusions 31
Wikipedia
atilde The core aim of the Wikimedia Foundation is to get a free encyclopedia to everysingle person on the planet (Jimmy Wales)
atilde Wikipedia makes it easy to share your knowledgepeople like to do this
atilde Most edits are done by insiders
atilde Most content is added by outsiders
atilde Content comparable to Britannica
Review and Conclusions 32
The five pillars of Wikipedia
1 Wikipedia is an online encyclopedia
2 Wikipedia has a neutral point of view
atilde Content policies NPOV Verifiability and No original research
3 Wikipedia is free content
4 Wikipedians should interact in a respectful and civil manner
5 Wikipedia does not have firm rules
WikipediaFivepillars 33
Licenses and Ownership
atilde Copyright
atilde Copyleft
atilde Creative Commons
Review and Conclusions 34
What is a good article
1 Well-written
2 Factually accurate and verifiable
3 Broad in its coverage
4 Neutral
5 Stable
6 Illustrated if possible by images
WikipediaGood_article_criteria 35
The World Wide Web and HTML
Review and Conclusions 36
The InterWeb
atilde The Internetmore than just the web (email VoIP FTP Streaming Messaging
atilde The structure of Markup Visual vs LogicalWSISWYG WYSIAYG WYSIWIM
atilde The structure of the Web mdash hypertextpages linked to pages
atilde The future of the Web
atilde Linguistic features of the webun-edited large volume editable multi-media
Review and Conclusions 37
Map of online communities
httpxkcdcom802 38
The Deep Web (the Invisible Web)
Dynamic content dynamic pages which are returned in response to a submitted queryor accessed only through a form
Unlinked content pages which are not linked to by other pages (but clicking linksthem)
Private Web sites that require registration and login (Edventure)
Contextual Web pages with content varying for different access contexts (egranges of client IP addresses or previous navigation sequence)
Limited access content sites that limit access to their pages in a technical way (egusing the Robots Exclusion Standard)
Scripted content pages that are only accessible through links produced by JavaScriptas well as content dynamically downloaded from Web servers via Flash or Ajaxsolutions
Review and Conclusions 39
Non-HTMLtext content textual content encoded in multimedia (image or video)files or specific file formats not handled by search engines
These pages all include data that search engines cannot find
Review and Conclusions 40
Visual Markup vs Logical Markup
atilde Visual Markup (Presentational)
acirc What you see is what you get (WYSIWYG)acirc Equivalent of printersrsquo markupacirc Shows what things look like
atilde Logical Markup (Structural)
acirc Show the structure and meaningacirc Can be mapped to visual markupacirc Less flexible than visual markupacirc More adaptable (and reusable)
Review and Conclusions 41
The Web as Corpus
Review and Conclusions 42
Two Approaches to using the Web as a Corpus
atilde Direct Query Search Engine as Query tool and WWW as corpus(Objection Results are not reliable)
acirc Population and exact hit counts are unknown rarr no statistics possibleacirc Indexing does not allow to draw conclusions on the dataotimes Google is missing functionalities that linguists lexicographers would like to have
atilde Web Sample Use search engine to download data from the net and build a corpusfrom it
acirc known size and exact hit counts rarr statistics possibleacirc people can draw conclusions over the included text typesacirc (limited) control over the contentotimes sparser data
Review and Conclusions 43
Direct Query
atilde Accessible through search engines (Google API Yahoo API Scripts)
atilde Document counts are shown to correlate directly with ldquorealrdquo frequencies (Keller2003) so search engines can help - but
acirc lots of repetitions of the same text (not representative)acirc very limited query precision (no upperlower case no punctuation)acirc only estimated counts often hard to reproduce exactlyacirc different queries give wildly different numbers
Review and Conclusions 44
Web Count Units
atilde ghit or ldquogoogle hitrdquo is the most common unit used to count web snippets (in theearly 2000s)
acirc it is document frequency not term frequency
atilde whit or ldquoweb hitrdquo is the more general term
atilde Normally you compare two phenomena to get a unitless ratio(eg different from vs different than)251000000 ghits vs 71500000 or 351 (accessed 2012-04-04)
atilde GPB for ldquoGhits per billion documentsrdquo is good if want a more stable number (sug-gested by Mark Liberman)but Google no longer releases the index size
atilde So always say when you counted and try to use ratios
httpitrecisupennedu myllanguagelogarchives000953html 45
Web Sample
atilde Extracting and filtering web documents to create linguistically annotated corpora(Kilgarriff 2006)
acirc gather documents for different topics (balance)acirc exclude documents which cannot be preprocessed with available tools (here tag-
gers and lemmatizers)acirc exclude documents which seem irrelevant for a corpus (too short or too long word
lists)acirc do this for several languages and make the corpora available
46
Building Internet Corpora Outline
1 Select Seed Words (500)
2 Combine to form multiple queries (6000)
3 Query a search engine and retrieve the URLs (50000)
4 Download the files from the URLS (100000000 words)
5 Postprocess the data (encoding cleanup tagging and parsing)
Sharoff S (2006) Creating general-purpose corpora using automated search enginequeries In M Baroni S Bernardini (eds) WaCky Working papers on the Web asCorpus Bologna 2006
httpcorpusleedsacukinternethtml 47
Internet Corpora Summary
atilde The web can be used as a corpus
acirc Direct accesslowast Fast and convenientlowast Huge amounts of dataotimes unreliable counts
acirc Web samplelowast Control over the samplelowast Some setup costs (semi-automated)otimes Less data
atilde Richer data than a compiled corpus
otimes Less balanced less markup
48
Text and Meta-textatilde Explicit Meta-data
acirc Keywords and Categoriesacirc Rankingsacirc Structural Markup
atilde Implicit Meta-data
acirc Links and Citationsacirc Tagsacirc Tablesacirc File Namesacirc Translations
and Zombies 49
Explicit Metadata
atilde You can get information from metadata within documents
acirc When they are accurate they are very goodacirc They are often deceitful
atilde HTML PDF Word hellipMetadata
atilde Keywords and Tags
atilde Rankings
atilde Links and Citations
atilde Structural Markup
50
Implicit Metadata
atilde You can get clues from metadata within documents
acirc as they are non-intended they tend to be noisyacirc but they are rarely deceitful
atilde HTML tags as constituent boundaries
atilde Tables as Semantic Relations
atilde File Names (content type and language)
atilde Translations mdash Bracketed Glosses Cross-lingual Disambiguation
atilde Query Data
atilde Wikipedia Redirections and Cross-wiki Links
Review and Conclusions 51
Language Identification andNormalization
Review and Conclusions 52
Language Identification
atilde Need to identify the encoding and language of a page in order to process it (meta-data may be unreliable)
acirc Linguistically-grounded methodslowast Diacriticslowast Character n-gramslowast Stop words
acirc Similarity-based categorisation and classificationlowast Character n-gram rankings
acirc Machine Learning based methodsacirc Context (under jp or ko)
atilde Hard to do for short test snippets similar languages mixed text
Review and Conclusions 53
Normalization
atilde Extracting text from various documents
atilde Segmenting continuous text
atilde Number Normalization $700K $700000 07 million dollars hellip
atilde Date Normalization 2000AD 1421AH Heisei 12 hellip
atilde Stripping stop words (the a of hellip)
atilde Lemmatization produces rarr produce
atilde Stemming producer rarr produc produces rarr produc
atilde Decompounding zonnecel rarr zon cel
Review and Conclusions 54
The Semantic Web
Review and Conclusions 55
Goals of the Semantic Web
atilde Web of data
acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions
atilde Increase the utility of information by connecting it to definitions and context
atilde More efficient information access and analysis
EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt
Review and Conclusions 56
Semantic Web Architecture
Review and Conclusions 57
XML eXtensible Markup Language
atilde XML is a set of rules for encoding documents electronically
acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation
atilde Can be validated for syntactic well-formedness
Review and Conclusions 58
Semantic Web Architecture (details)
atilde Identify things with Uniform Resource Identifiers
acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond
atilde Identify relations with Resource Description Framework
acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything
atilde You can build relations in ontologies (OWL)
acirc Then reason over them search them hellip
Review and Conclusions 59
Criticism of the Semantic Web
Doctorowrsquos seven insurmountable obstacles to reliable metadata are
1 People lie
2 People are lazy
3 People are stupid
4 Mission Impossible know thyself
5 Schemas arenrsquot neutral
6 Metrics influence results
7 Therersquos more than one way to describe something
httpwwwwellcom~doctorowmetacraphtm 60
Semantic Web and NLP
atilde The Semantic Web is about structuring data
atilde Text Mining is about unstructured data
atilde There is much more unstructured than structured data
acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP
61
Citation Reputation andPageRank
Review and Conclusions 62
Citation Networks
atilde How can we tell what is a good scientific paper
acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked
acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite
Review and Conclusions 63
Reputation and Citation Analysis
atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group
atilde Some scores are
acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)
atilde Problems
acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones
atilde Weight Citations by Quality of the paper
Review and Conclusions 64
Gaming Citations
atilde LeastMinimum Publishable Unit
acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information
atilde Self citation in-group citation
atilde Write only proceedings (some journals are not often read)
atilde Submitting only to High Impact factor journals
You improve what gets measurednot necessarily what you want to improve
Review and Conclusions 65
Characteristics of the Web Bow Tie
SCC Strongly Connected Core mdash can travel from any page to any page
Review and Conclusions 66
Anchor Text
atilde Recall how hyperlinks are written
lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt
For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt
atilde Link analysis builds on two intuitions
1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A
2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice
Review and Conclusions 67
PageRank as Citation analysis
atilde Citation frequency can be used to measure the impact of an article
acirc Simplest measure Each article gets one vote ndash not very accurate
atilde On the web citation frequency = inlink count
acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam
atilde Better measure weighted citation frequency or citation rank
acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated
Review and Conclusions 68
PageRank as Random walk
atilde Imagine a web surfer doing a random walk on the web
acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page
equiprobably
atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there
atilde This long-term visit rate is the pagersquos PageRank
atilde PageRank = long-term visit rate = steady state robability
Review and Conclusions 69
Teleporting ndash to get us out of dead ends
atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)
atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)
atilde With remaining probability (90) go out on a random hyperlink
acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225
atilde 10 is a parameter the teleportation rate
atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate
Review and Conclusions 70
Example Graph
Each inbound link is a positive vote
httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71
Example Graph Weighted
Pages with higher PageRanks are lighter
Review and Conclusions 72
Gaming PageRank
atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs
atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies
atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission
73
atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks
The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index
acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip
74
Current Status
atilde There is a continuous battle between
acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read
atilde All metrics get gamed
Review and Conclusions 75
Digital object identifier
atilde DOI a string used to uniquely identify an electronic document or object
acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)
atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation
atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations
acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u
Review and Conclusions 76
Conclusions
Review and Conclusions 77
Revolutions in Language Technology
atilde Speech (Language itself)
atilde Writing (invented 3-5 times)
atilde Printing (made writing common)
atilde Digital Text (made writing transferable)
atilde Hyperlinking (taking writing beyond language)
Review and Conclusions 78
The Internet
atilde The internet is useful as a tool
acirc Passively for information accessacirc Actively for collaboration
atilde The internet is interesting in and of itself
acirc As a source of data about existing languageacirc As a source of innovation in language
Review and Conclusions 79
Recommended Texts
atilde Wikipedia
atilde Crystal D (2001) Language and the Internet Cambridge University Press
atilde Sproat R (2010) Language Technology and Society Oxford University Press
atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition
atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press
atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press
Review and Conclusions 80
Complementary Courses
atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics
atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics
Review and Conclusions 81
Introduction
Review and Conclusions 1
What have we learned
atilde How technology affects our use of language
atilde How language is used on the internet
acirc Some interesting things we can now do that we couldnrsquot before
atilde Collaboration on the Web
atilde The web as a source of linguistic data Direct Query and Sample
atilde The Semantic Web meaning for non-humans
atilde Citation and reputation
Review and Conclusions 2
Reflection
atilde What was the most surprising thing in this class
atilde What do you think is most likely wrong
atilde What do you think is the coolest result
atilde What do you think yoursquore most likely to remember
atilde How do you think this course will influence you as a linguist
atilde What (if anything) did you hope to learn that you didnrsquot
Review and Conclusions 3
Goals
atilde Gain an understanding of how technology affects language use
atilde Develop familiarity with markup and meta information in texts
atilde Get a feel for what research is all about especially relating to web mining and onlinefrequency counting
Upon successful completion students will
atilde have an understanding of how technology shapes language use
atilde be able to test linguistic hypotheses against web data
atilde know how to edit Wikipedia
Review and Conclusions 4
Themes
atilde Language and Technology
acirc Writing and Speech Technology
atilde Language and the Internet
acirc Email Chat Virtual Worlds WWW IM Blogs Facebook Wikis Twitter
atilde The Web as Corpus
atilde The Web beyond Language
acirc Semantic Web and Networks
Review and Conclusions 5
Revision
Review and Conclusions 6
Language and Technology
atilde The two things that separate humans from animals (Sproat 2010 sect11)
acirc Languagelowast large vocabulary (10000+)lowast complicated syntax (no upper length recursion embedding)
acirc Technologylowast Widespread tool uselowast Widespread tool manufacture
atilde Speech mdash the start of language
atilde Writing mdash the first great intersection
Review and Conclusions 7
Language and the Internet
atilde New forms of communication
acirc Neither speech nor textacirc Massively interactive
atilde Extremely rapid change
atilde A first hand narrative (I was online before the internet -)
acirc but I am probably behind you all now
Review and Conclusions 8
New forms
atilde Email (from PC phone other)
atilde Chat Usenet
atilde Virtual Worlds
atilde WWW
atilde Blogs (overlap)
atilde Facebook LinkedIn
atilde Wikis
atilde Twitter
Review and Conclusions 9
Representing Languageatilde Writing Systems
atilde Encodings
atilde Speech
atilde Bandwidth
Review and Conclusions 10
What is represented
atilde Phonemes ma d laks aeligvkadoz (45)
acirc Not a simple correspondence between a writing system and the soundsacirc Some logographs (3 $)acirc Even when a new alphabet is designed pronunciation changes
atilde Syllables (ma) (d) (laks) (aelig)(v)(ka)(doz) (10000+)
atilde Morphemes mame + rsquos d lak + s aeligvkado + z (100000+)
atilde Words my dog likes avocados (200000+)
atilde Concepts speaker poss canine fond avacado+PL (400000+)
Review and Conclusions 11
Three Major Writing Systems
atilde Alphabetic (Latin)
acirc one symbol for consonant or vowelacirc Typically 20-30 base symbols (1 byte)
atilde Syllabic (Hiragana)
acirc one symbol for each syllable (consonant+vowel)acirc Typically 50-100 base symbols (1-2 bytes)
atilde Logographic (Hanzi)
acirc pictographs ideographs sounds-meaning combinationsacirc Typically 100000+ symbols (2-3 bytes)
Review and Conclusions 12
Computational Encoding
atilde Need to map characters to bits
atilde More characters require more space
acirc English ASCII (7 bits) Western Europe ISO-8859-1 (8 bits)Japanese EUC (16 bits) Everything UTF-8 (8-32 bits)
atilde Moving towards unicode for everything
atilde If you get the encoding wrong it is gibberish
acirc Web pages should state their encodingacirc Sometimes they are wrongacirc Encoding Detection usually involves statistical analysis of byte patterns
lowast But an encoding can be shared by many languages
Review and Conclusions 13
Speech and Language Technologyatilde The need for speech representation
atilde Storing sound
atilde Transforming Speech
acirc Automatic Speech Recognition (ASR) sounds to textacirc Text-to-Speech Synthesis (TTS) texts to sounds
atilde Speech technology mdash the Telephone
Review and Conclusions 14
How good are the systems
Task Vocab WER () WER () adaptedDigits 11 04 02Dialogue (travel) 21000 109 mdashDictation (WSJ) 5000 39 30Dictation (WSJ) 20000 100 86Dialogue (noisy army) 3000 422 310Phone Conversations 4000 419 310
Results of various DARPA competitions (from Richard Sproatrsquos slides)
Review and Conclusions 15
Why is it so difficult
atilde Pronunciation depends on context
acirc The same word will be pronounced differently in different sentences
atilde Speaker variability
acirc Genderacirc DialectForeign Accentacirc Individual Differences Physical differences Language differences (idiolect)
atilde Many many rare events
acirc 300 out of 2000 diphones in the core set for the ATampT NextGen system occuronly once in a 2-hour speech database
Review and Conclusions 16
Two steps in a TTS system
1 Linguistic Analysis
atilde Sentence Segmentationatilde Abbreviations Dr Smith lives on Nanyang Dr He is hellipatilde Word Segmentation
acirc 森前銀総裁 Moriyama zen Nichigin Sousaiotimes 森前銀総裁emspMoriyama zennichi gin Sousai
2 Speech Synthesis
atilde Find the pronunciationatilde Generate soundsatilde Add intonation
Review and Conclusions 17
Speech Synthesis
atilde Articulatory Synthesis Attempt to model human articulation
atilde Formant Synthesis Bypass modeling of human articulation and model acousticsdirectly
atilde Concatenative Synthesis Synthesize from stored units of actual speech
Review and Conclusions 18
Prosody of Emotion
atilde Excitement Fast very high pitch loud
atilde Hot anger Fast high pitch strong falling accent loud
atilde Fear Jitter
atilde Sarcasm Prolonged accent late peak
atilde Sad Slow low pitch
The main determinant of ldquonaturalnessrdquo in speech synthesis is not ldquovoice qualityrdquobut natural-sounding prosody (intonation and duration)
Richard Sproat
Review and Conclusions 19
Speed is different for different modalities
Speed in words per minute (one word is 6 characters)(English computer science students various studies)
Reading 300 200 (proof reading)Writing 31 21 (composing)Speaking 150Hearing 150 210 (speeded up)Typing 33 19 (composing)
atilde Reading gtgt SpeakingHearing gtgt Typing
rArr Speech for inputrArr Text for output
Review and Conclusions 20
The Telephone
Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 21
New Modalities
Review and Conclusions 22
Email Usenet Chat and Blogs
atilde All share some characteristics of speech and text
atilde Usage norms not fixed
atilde Communication methods may disappear before the norms are fixedUsenet Bulletin Boards Gopher Archie hellip
atilde Large scale discourse studies still to be done
atilde Some genuinely new things
acirc time-lagged multi-person conversationacirc raw un-edited textacirc extreme multi-authorship
Review and Conclusions 23
Speech like Text liketime bound space bound (deletable)spontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 24
Usenet (asynchronous)
Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 25
Chat (synchronous)
Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 26
Blogs
Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactivecomments factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 27
Netspeak
atilde Inspired by shared background
acirc ltrantgtI canrsquot stand thisltrantgtacirc I hate^H^H^H^Hlove that idea (^H is backspace on a vt100)acirc lusers (users as seen by Systems Administrators)acirc suits
atilde Need to be in-group
cow orker Coworkerclue ldquoYou couldnrsquot get a clue during the clue mating season in a field full of horny
clues if you smeared your body with clue musk and did the clue mating dancerdquoEdward Flaherty (talkbizaare)
Review and Conclusions 28
atilde Gricean Principals
acirc Posting (topmiddlebottom quoting and trimming)
atilde Inspired by medium limitations
acirc GREATgreatgreatacirc -) (^_^) (o) acirc brb RTFM IMHOacirc lower caseacirc lack of punctuation (hard on phone keyboards)
Review and Conclusions 29
Collaboration and Wikisatilde Version Control Systems
atilde Wikipedia
atilde Licensing and Ownership
30
Version Control Systems
atilde Versioning file systems
acirc every time a file is opened a new copy is stored
atilde CVS Subversion Git
acirc changes to a collection of files are trackedacirc simultaneous changes are merged
atilde Revision Tracking
acirc Revisions are stored within a file
atilde Authorship in shared writing Explicit responsibility for changes
Review and Conclusions 31
Wikipedia
atilde The core aim of the Wikimedia Foundation is to get a free encyclopedia to everysingle person on the planet (Jimmy Wales)
atilde Wikipedia makes it easy to share your knowledgepeople like to do this
atilde Most edits are done by insiders
atilde Most content is added by outsiders
atilde Content comparable to Britannica
Review and Conclusions 32
The five pillars of Wikipedia
1 Wikipedia is an online encyclopedia
2 Wikipedia has a neutral point of view
atilde Content policies NPOV Verifiability and No original research
3 Wikipedia is free content
4 Wikipedians should interact in a respectful and civil manner
5 Wikipedia does not have firm rules
WikipediaFivepillars 33
Licenses and Ownership
atilde Copyright
atilde Copyleft
atilde Creative Commons
Review and Conclusions 34
What is a good article
1 Well-written
2 Factually accurate and verifiable
3 Broad in its coverage
4 Neutral
5 Stable
6 Illustrated if possible by images
WikipediaGood_article_criteria 35
The World Wide Web and HTML
Review and Conclusions 36
The InterWeb
atilde The Internetmore than just the web (email VoIP FTP Streaming Messaging
atilde The structure of Markup Visual vs LogicalWSISWYG WYSIAYG WYSIWIM
atilde The structure of the Web mdash hypertextpages linked to pages
atilde The future of the Web
atilde Linguistic features of the webun-edited large volume editable multi-media
Review and Conclusions 37
Map of online communities
httpxkcdcom802 38
The Deep Web (the Invisible Web)
Dynamic content dynamic pages which are returned in response to a submitted queryor accessed only through a form
Unlinked content pages which are not linked to by other pages (but clicking linksthem)
Private Web sites that require registration and login (Edventure)
Contextual Web pages with content varying for different access contexts (egranges of client IP addresses or previous navigation sequence)
Limited access content sites that limit access to their pages in a technical way (egusing the Robots Exclusion Standard)
Scripted content pages that are only accessible through links produced by JavaScriptas well as content dynamically downloaded from Web servers via Flash or Ajaxsolutions
Review and Conclusions 39
Non-HTMLtext content textual content encoded in multimedia (image or video)files or specific file formats not handled by search engines
These pages all include data that search engines cannot find
Review and Conclusions 40
Visual Markup vs Logical Markup
atilde Visual Markup (Presentational)
acirc What you see is what you get (WYSIWYG)acirc Equivalent of printersrsquo markupacirc Shows what things look like
atilde Logical Markup (Structural)
acirc Show the structure and meaningacirc Can be mapped to visual markupacirc Less flexible than visual markupacirc More adaptable (and reusable)
Review and Conclusions 41
The Web as Corpus
Review and Conclusions 42
Two Approaches to using the Web as a Corpus
atilde Direct Query Search Engine as Query tool and WWW as corpus(Objection Results are not reliable)
acirc Population and exact hit counts are unknown rarr no statistics possibleacirc Indexing does not allow to draw conclusions on the dataotimes Google is missing functionalities that linguists lexicographers would like to have
atilde Web Sample Use search engine to download data from the net and build a corpusfrom it
acirc known size and exact hit counts rarr statistics possibleacirc people can draw conclusions over the included text typesacirc (limited) control over the contentotimes sparser data
Review and Conclusions 43
Direct Query
atilde Accessible through search engines (Google API Yahoo API Scripts)
atilde Document counts are shown to correlate directly with ldquorealrdquo frequencies (Keller2003) so search engines can help - but
acirc lots of repetitions of the same text (not representative)acirc very limited query precision (no upperlower case no punctuation)acirc only estimated counts often hard to reproduce exactlyacirc different queries give wildly different numbers
Review and Conclusions 44
Web Count Units
atilde ghit or ldquogoogle hitrdquo is the most common unit used to count web snippets (in theearly 2000s)
acirc it is document frequency not term frequency
atilde whit or ldquoweb hitrdquo is the more general term
atilde Normally you compare two phenomena to get a unitless ratio(eg different from vs different than)251000000 ghits vs 71500000 or 351 (accessed 2012-04-04)
atilde GPB for ldquoGhits per billion documentsrdquo is good if want a more stable number (sug-gested by Mark Liberman)but Google no longer releases the index size
atilde So always say when you counted and try to use ratios
httpitrecisupennedu myllanguagelogarchives000953html 45
Web Sample
atilde Extracting and filtering web documents to create linguistically annotated corpora(Kilgarriff 2006)
acirc gather documents for different topics (balance)acirc exclude documents which cannot be preprocessed with available tools (here tag-
gers and lemmatizers)acirc exclude documents which seem irrelevant for a corpus (too short or too long word
lists)acirc do this for several languages and make the corpora available
46
Building Internet Corpora Outline
1 Select Seed Words (500)
2 Combine to form multiple queries (6000)
3 Query a search engine and retrieve the URLs (50000)
4 Download the files from the URLS (100000000 words)
5 Postprocess the data (encoding cleanup tagging and parsing)
Sharoff S (2006) Creating general-purpose corpora using automated search enginequeries In M Baroni S Bernardini (eds) WaCky Working papers on the Web asCorpus Bologna 2006
httpcorpusleedsacukinternethtml 47
Internet Corpora Summary
atilde The web can be used as a corpus
acirc Direct accesslowast Fast and convenientlowast Huge amounts of dataotimes unreliable counts
acirc Web samplelowast Control over the samplelowast Some setup costs (semi-automated)otimes Less data
atilde Richer data than a compiled corpus
otimes Less balanced less markup
48
Text and Meta-textatilde Explicit Meta-data
acirc Keywords and Categoriesacirc Rankingsacirc Structural Markup
atilde Implicit Meta-data
acirc Links and Citationsacirc Tagsacirc Tablesacirc File Namesacirc Translations
and Zombies 49
Explicit Metadata
atilde You can get information from metadata within documents
acirc When they are accurate they are very goodacirc They are often deceitful
atilde HTML PDF Word hellipMetadata
atilde Keywords and Tags
atilde Rankings
atilde Links and Citations
atilde Structural Markup
50
Implicit Metadata
atilde You can get clues from metadata within documents
acirc as they are non-intended they tend to be noisyacirc but they are rarely deceitful
atilde HTML tags as constituent boundaries
atilde Tables as Semantic Relations
atilde File Names (content type and language)
atilde Translations mdash Bracketed Glosses Cross-lingual Disambiguation
atilde Query Data
atilde Wikipedia Redirections and Cross-wiki Links
Review and Conclusions 51
Language Identification andNormalization
Review and Conclusions 52
Language Identification
atilde Need to identify the encoding and language of a page in order to process it (meta-data may be unreliable)
acirc Linguistically-grounded methodslowast Diacriticslowast Character n-gramslowast Stop words
acirc Similarity-based categorisation and classificationlowast Character n-gram rankings
acirc Machine Learning based methodsacirc Context (under jp or ko)
atilde Hard to do for short test snippets similar languages mixed text
Review and Conclusions 53
Normalization
atilde Extracting text from various documents
atilde Segmenting continuous text
atilde Number Normalization $700K $700000 07 million dollars hellip
atilde Date Normalization 2000AD 1421AH Heisei 12 hellip
atilde Stripping stop words (the a of hellip)
atilde Lemmatization produces rarr produce
atilde Stemming producer rarr produc produces rarr produc
atilde Decompounding zonnecel rarr zon cel
Review and Conclusions 54
The Semantic Web
Review and Conclusions 55
Goals of the Semantic Web
atilde Web of data
acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions
atilde Increase the utility of information by connecting it to definitions and context
atilde More efficient information access and analysis
EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt
Review and Conclusions 56
Semantic Web Architecture
Review and Conclusions 57
XML eXtensible Markup Language
atilde XML is a set of rules for encoding documents electronically
acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation
atilde Can be validated for syntactic well-formedness
Review and Conclusions 58
Semantic Web Architecture (details)
atilde Identify things with Uniform Resource Identifiers
acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond
atilde Identify relations with Resource Description Framework
acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything
atilde You can build relations in ontologies (OWL)
acirc Then reason over them search them hellip
Review and Conclusions 59
Criticism of the Semantic Web
Doctorowrsquos seven insurmountable obstacles to reliable metadata are
1 People lie
2 People are lazy
3 People are stupid
4 Mission Impossible know thyself
5 Schemas arenrsquot neutral
6 Metrics influence results
7 Therersquos more than one way to describe something
httpwwwwellcom~doctorowmetacraphtm 60
Semantic Web and NLP
atilde The Semantic Web is about structuring data
atilde Text Mining is about unstructured data
atilde There is much more unstructured than structured data
acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP
61
Citation Reputation andPageRank
Review and Conclusions 62
Citation Networks
atilde How can we tell what is a good scientific paper
acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked
acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite
Review and Conclusions 63
Reputation and Citation Analysis
atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group
atilde Some scores are
acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)
atilde Problems
acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones
atilde Weight Citations by Quality of the paper
Review and Conclusions 64
Gaming Citations
atilde LeastMinimum Publishable Unit
acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information
atilde Self citation in-group citation
atilde Write only proceedings (some journals are not often read)
atilde Submitting only to High Impact factor journals
You improve what gets measurednot necessarily what you want to improve
Review and Conclusions 65
Characteristics of the Web Bow Tie
SCC Strongly Connected Core mdash can travel from any page to any page
Review and Conclusions 66
Anchor Text
atilde Recall how hyperlinks are written
lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt
For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt
atilde Link analysis builds on two intuitions
1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A
2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice
Review and Conclusions 67
PageRank as Citation analysis
atilde Citation frequency can be used to measure the impact of an article
acirc Simplest measure Each article gets one vote ndash not very accurate
atilde On the web citation frequency = inlink count
acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam
atilde Better measure weighted citation frequency or citation rank
acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated
Review and Conclusions 68
PageRank as Random walk
atilde Imagine a web surfer doing a random walk on the web
acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page
equiprobably
atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there
atilde This long-term visit rate is the pagersquos PageRank
atilde PageRank = long-term visit rate = steady state robability
Review and Conclusions 69
Teleporting ndash to get us out of dead ends
atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)
atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)
atilde With remaining probability (90) go out on a random hyperlink
acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225
atilde 10 is a parameter the teleportation rate
atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate
Review and Conclusions 70
Example Graph
Each inbound link is a positive vote
httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71
Example Graph Weighted
Pages with higher PageRanks are lighter
Review and Conclusions 72
Gaming PageRank
atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs
atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies
atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission
73
atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks
The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index
acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip
74
Current Status
atilde There is a continuous battle between
acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read
atilde All metrics get gamed
Review and Conclusions 75
Digital object identifier
atilde DOI a string used to uniquely identify an electronic document or object
acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)
atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation
atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations
acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u
Review and Conclusions 76
Conclusions
Review and Conclusions 77
Revolutions in Language Technology
atilde Speech (Language itself)
atilde Writing (invented 3-5 times)
atilde Printing (made writing common)
atilde Digital Text (made writing transferable)
atilde Hyperlinking (taking writing beyond language)
Review and Conclusions 78
The Internet
atilde The internet is useful as a tool
acirc Passively for information accessacirc Actively for collaboration
atilde The internet is interesting in and of itself
acirc As a source of data about existing languageacirc As a source of innovation in language
Review and Conclusions 79
Recommended Texts
atilde Wikipedia
atilde Crystal D (2001) Language and the Internet Cambridge University Press
atilde Sproat R (2010) Language Technology and Society Oxford University Press
atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition
atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press
atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press
Review and Conclusions 80
Complementary Courses
atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics
atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics
Review and Conclusions 81
What have we learned
atilde How technology affects our use of language
atilde How language is used on the internet
acirc Some interesting things we can now do that we couldnrsquot before
atilde Collaboration on the Web
atilde The web as a source of linguistic data Direct Query and Sample
atilde The Semantic Web meaning for non-humans
atilde Citation and reputation
Review and Conclusions 2
Reflection
atilde What was the most surprising thing in this class
atilde What do you think is most likely wrong
atilde What do you think is the coolest result
atilde What do you think yoursquore most likely to remember
atilde How do you think this course will influence you as a linguist
atilde What (if anything) did you hope to learn that you didnrsquot
Review and Conclusions 3
Goals
atilde Gain an understanding of how technology affects language use
atilde Develop familiarity with markup and meta information in texts
atilde Get a feel for what research is all about especially relating to web mining and onlinefrequency counting
Upon successful completion students will
atilde have an understanding of how technology shapes language use
atilde be able to test linguistic hypotheses against web data
atilde know how to edit Wikipedia
Review and Conclusions 4
Themes
atilde Language and Technology
acirc Writing and Speech Technology
atilde Language and the Internet
acirc Email Chat Virtual Worlds WWW IM Blogs Facebook Wikis Twitter
atilde The Web as Corpus
atilde The Web beyond Language
acirc Semantic Web and Networks
Review and Conclusions 5
Revision
Review and Conclusions 6
Language and Technology
atilde The two things that separate humans from animals (Sproat 2010 sect11)
acirc Languagelowast large vocabulary (10000+)lowast complicated syntax (no upper length recursion embedding)
acirc Technologylowast Widespread tool uselowast Widespread tool manufacture
atilde Speech mdash the start of language
atilde Writing mdash the first great intersection
Review and Conclusions 7
Language and the Internet
atilde New forms of communication
acirc Neither speech nor textacirc Massively interactive
atilde Extremely rapid change
atilde A first hand narrative (I was online before the internet -)
acirc but I am probably behind you all now
Review and Conclusions 8
New forms
atilde Email (from PC phone other)
atilde Chat Usenet
atilde Virtual Worlds
atilde WWW
atilde Blogs (overlap)
atilde Facebook LinkedIn
atilde Wikis
atilde Twitter
Review and Conclusions 9
Representing Languageatilde Writing Systems
atilde Encodings
atilde Speech
atilde Bandwidth
Review and Conclusions 10
What is represented
atilde Phonemes ma d laks aeligvkadoz (45)
acirc Not a simple correspondence between a writing system and the soundsacirc Some logographs (3 $)acirc Even when a new alphabet is designed pronunciation changes
atilde Syllables (ma) (d) (laks) (aelig)(v)(ka)(doz) (10000+)
atilde Morphemes mame + rsquos d lak + s aeligvkado + z (100000+)
atilde Words my dog likes avocados (200000+)
atilde Concepts speaker poss canine fond avacado+PL (400000+)
Review and Conclusions 11
Three Major Writing Systems
atilde Alphabetic (Latin)
acirc one symbol for consonant or vowelacirc Typically 20-30 base symbols (1 byte)
atilde Syllabic (Hiragana)
acirc one symbol for each syllable (consonant+vowel)acirc Typically 50-100 base symbols (1-2 bytes)
atilde Logographic (Hanzi)
acirc pictographs ideographs sounds-meaning combinationsacirc Typically 100000+ symbols (2-3 bytes)
Review and Conclusions 12
Computational Encoding
atilde Need to map characters to bits
atilde More characters require more space
acirc English ASCII (7 bits) Western Europe ISO-8859-1 (8 bits)Japanese EUC (16 bits) Everything UTF-8 (8-32 bits)
atilde Moving towards unicode for everything
atilde If you get the encoding wrong it is gibberish
acirc Web pages should state their encodingacirc Sometimes they are wrongacirc Encoding Detection usually involves statistical analysis of byte patterns
lowast But an encoding can be shared by many languages
Review and Conclusions 13
Speech and Language Technologyatilde The need for speech representation
atilde Storing sound
atilde Transforming Speech
acirc Automatic Speech Recognition (ASR) sounds to textacirc Text-to-Speech Synthesis (TTS) texts to sounds
atilde Speech technology mdash the Telephone
Review and Conclusions 14
How good are the systems
Task Vocab WER () WER () adaptedDigits 11 04 02Dialogue (travel) 21000 109 mdashDictation (WSJ) 5000 39 30Dictation (WSJ) 20000 100 86Dialogue (noisy army) 3000 422 310Phone Conversations 4000 419 310
Results of various DARPA competitions (from Richard Sproatrsquos slides)
Review and Conclusions 15
Why is it so difficult
atilde Pronunciation depends on context
acirc The same word will be pronounced differently in different sentences
atilde Speaker variability
acirc Genderacirc DialectForeign Accentacirc Individual Differences Physical differences Language differences (idiolect)
atilde Many many rare events
acirc 300 out of 2000 diphones in the core set for the ATampT NextGen system occuronly once in a 2-hour speech database
Review and Conclusions 16
Two steps in a TTS system
1 Linguistic Analysis
atilde Sentence Segmentationatilde Abbreviations Dr Smith lives on Nanyang Dr He is hellipatilde Word Segmentation
acirc 森前銀総裁 Moriyama zen Nichigin Sousaiotimes 森前銀総裁emspMoriyama zennichi gin Sousai
2 Speech Synthesis
atilde Find the pronunciationatilde Generate soundsatilde Add intonation
Review and Conclusions 17
Speech Synthesis
atilde Articulatory Synthesis Attempt to model human articulation
atilde Formant Synthesis Bypass modeling of human articulation and model acousticsdirectly
atilde Concatenative Synthesis Synthesize from stored units of actual speech
Review and Conclusions 18
Prosody of Emotion
atilde Excitement Fast very high pitch loud
atilde Hot anger Fast high pitch strong falling accent loud
atilde Fear Jitter
atilde Sarcasm Prolonged accent late peak
atilde Sad Slow low pitch
The main determinant of ldquonaturalnessrdquo in speech synthesis is not ldquovoice qualityrdquobut natural-sounding prosody (intonation and duration)
Richard Sproat
Review and Conclusions 19
Speed is different for different modalities
Speed in words per minute (one word is 6 characters)(English computer science students various studies)
Reading 300 200 (proof reading)Writing 31 21 (composing)Speaking 150Hearing 150 210 (speeded up)Typing 33 19 (composing)
atilde Reading gtgt SpeakingHearing gtgt Typing
rArr Speech for inputrArr Text for output
Review and Conclusions 20
The Telephone
Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 21
New Modalities
Review and Conclusions 22
Email Usenet Chat and Blogs
atilde All share some characteristics of speech and text
atilde Usage norms not fixed
atilde Communication methods may disappear before the norms are fixedUsenet Bulletin Boards Gopher Archie hellip
atilde Large scale discourse studies still to be done
atilde Some genuinely new things
acirc time-lagged multi-person conversationacirc raw un-edited textacirc extreme multi-authorship
Review and Conclusions 23
Speech like Text liketime bound space bound (deletable)spontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 24
Usenet (asynchronous)
Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 25
Chat (synchronous)
Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 26
Blogs
Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactivecomments factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 27
Netspeak
atilde Inspired by shared background
acirc ltrantgtI canrsquot stand thisltrantgtacirc I hate^H^H^H^Hlove that idea (^H is backspace on a vt100)acirc lusers (users as seen by Systems Administrators)acirc suits
atilde Need to be in-group
cow orker Coworkerclue ldquoYou couldnrsquot get a clue during the clue mating season in a field full of horny
clues if you smeared your body with clue musk and did the clue mating dancerdquoEdward Flaherty (talkbizaare)
Review and Conclusions 28
atilde Gricean Principals
acirc Posting (topmiddlebottom quoting and trimming)
atilde Inspired by medium limitations
acirc GREATgreatgreatacirc -) (^_^) (o) acirc brb RTFM IMHOacirc lower caseacirc lack of punctuation (hard on phone keyboards)
Review and Conclusions 29
Collaboration and Wikisatilde Version Control Systems
atilde Wikipedia
atilde Licensing and Ownership
30
Version Control Systems
atilde Versioning file systems
acirc every time a file is opened a new copy is stored
atilde CVS Subversion Git
acirc changes to a collection of files are trackedacirc simultaneous changes are merged
atilde Revision Tracking
acirc Revisions are stored within a file
atilde Authorship in shared writing Explicit responsibility for changes
Review and Conclusions 31
Wikipedia
atilde The core aim of the Wikimedia Foundation is to get a free encyclopedia to everysingle person on the planet (Jimmy Wales)
atilde Wikipedia makes it easy to share your knowledgepeople like to do this
atilde Most edits are done by insiders
atilde Most content is added by outsiders
atilde Content comparable to Britannica
Review and Conclusions 32
The five pillars of Wikipedia
1 Wikipedia is an online encyclopedia
2 Wikipedia has a neutral point of view
atilde Content policies NPOV Verifiability and No original research
3 Wikipedia is free content
4 Wikipedians should interact in a respectful and civil manner
5 Wikipedia does not have firm rules
WikipediaFivepillars 33
Licenses and Ownership
atilde Copyright
atilde Copyleft
atilde Creative Commons
Review and Conclusions 34
What is a good article
1 Well-written
2 Factually accurate and verifiable
3 Broad in its coverage
4 Neutral
5 Stable
6 Illustrated if possible by images
WikipediaGood_article_criteria 35
The World Wide Web and HTML
Review and Conclusions 36
The InterWeb
atilde The Internetmore than just the web (email VoIP FTP Streaming Messaging
atilde The structure of Markup Visual vs LogicalWSISWYG WYSIAYG WYSIWIM
atilde The structure of the Web mdash hypertextpages linked to pages
atilde The future of the Web
atilde Linguistic features of the webun-edited large volume editable multi-media
Review and Conclusions 37
Map of online communities
httpxkcdcom802 38
The Deep Web (the Invisible Web)
Dynamic content dynamic pages which are returned in response to a submitted queryor accessed only through a form
Unlinked content pages which are not linked to by other pages (but clicking linksthem)
Private Web sites that require registration and login (Edventure)
Contextual Web pages with content varying for different access contexts (egranges of client IP addresses or previous navigation sequence)
Limited access content sites that limit access to their pages in a technical way (egusing the Robots Exclusion Standard)
Scripted content pages that are only accessible through links produced by JavaScriptas well as content dynamically downloaded from Web servers via Flash or Ajaxsolutions
Review and Conclusions 39
Non-HTMLtext content textual content encoded in multimedia (image or video)files or specific file formats not handled by search engines
These pages all include data that search engines cannot find
Review and Conclusions 40
Visual Markup vs Logical Markup
atilde Visual Markup (Presentational)
acirc What you see is what you get (WYSIWYG)acirc Equivalent of printersrsquo markupacirc Shows what things look like
atilde Logical Markup (Structural)
acirc Show the structure and meaningacirc Can be mapped to visual markupacirc Less flexible than visual markupacirc More adaptable (and reusable)
Review and Conclusions 41
The Web as Corpus
Review and Conclusions 42
Two Approaches to using the Web as a Corpus
atilde Direct Query Search Engine as Query tool and WWW as corpus(Objection Results are not reliable)
acirc Population and exact hit counts are unknown rarr no statistics possibleacirc Indexing does not allow to draw conclusions on the dataotimes Google is missing functionalities that linguists lexicographers would like to have
atilde Web Sample Use search engine to download data from the net and build a corpusfrom it
acirc known size and exact hit counts rarr statistics possibleacirc people can draw conclusions over the included text typesacirc (limited) control over the contentotimes sparser data
Review and Conclusions 43
Direct Query
atilde Accessible through search engines (Google API Yahoo API Scripts)
atilde Document counts are shown to correlate directly with ldquorealrdquo frequencies (Keller2003) so search engines can help - but
acirc lots of repetitions of the same text (not representative)acirc very limited query precision (no upperlower case no punctuation)acirc only estimated counts often hard to reproduce exactlyacirc different queries give wildly different numbers
Review and Conclusions 44
Web Count Units
atilde ghit or ldquogoogle hitrdquo is the most common unit used to count web snippets (in theearly 2000s)
acirc it is document frequency not term frequency
atilde whit or ldquoweb hitrdquo is the more general term
atilde Normally you compare two phenomena to get a unitless ratio(eg different from vs different than)251000000 ghits vs 71500000 or 351 (accessed 2012-04-04)
atilde GPB for ldquoGhits per billion documentsrdquo is good if want a more stable number (sug-gested by Mark Liberman)but Google no longer releases the index size
atilde So always say when you counted and try to use ratios
httpitrecisupennedu myllanguagelogarchives000953html 45
Web Sample
atilde Extracting and filtering web documents to create linguistically annotated corpora(Kilgarriff 2006)
acirc gather documents for different topics (balance)acirc exclude documents which cannot be preprocessed with available tools (here tag-
gers and lemmatizers)acirc exclude documents which seem irrelevant for a corpus (too short or too long word
lists)acirc do this for several languages and make the corpora available
46
Building Internet Corpora Outline
1 Select Seed Words (500)
2 Combine to form multiple queries (6000)
3 Query a search engine and retrieve the URLs (50000)
4 Download the files from the URLS (100000000 words)
5 Postprocess the data (encoding cleanup tagging and parsing)
Sharoff S (2006) Creating general-purpose corpora using automated search enginequeries In M Baroni S Bernardini (eds) WaCky Working papers on the Web asCorpus Bologna 2006
httpcorpusleedsacukinternethtml 47
Internet Corpora Summary
atilde The web can be used as a corpus
acirc Direct accesslowast Fast and convenientlowast Huge amounts of dataotimes unreliable counts
acirc Web samplelowast Control over the samplelowast Some setup costs (semi-automated)otimes Less data
atilde Richer data than a compiled corpus
otimes Less balanced less markup
48
Text and Meta-textatilde Explicit Meta-data
acirc Keywords and Categoriesacirc Rankingsacirc Structural Markup
atilde Implicit Meta-data
acirc Links and Citationsacirc Tagsacirc Tablesacirc File Namesacirc Translations
and Zombies 49
Explicit Metadata
atilde You can get information from metadata within documents
acirc When they are accurate they are very goodacirc They are often deceitful
atilde HTML PDF Word hellipMetadata
atilde Keywords and Tags
atilde Rankings
atilde Links and Citations
atilde Structural Markup
50
Implicit Metadata
atilde You can get clues from metadata within documents
acirc as they are non-intended they tend to be noisyacirc but they are rarely deceitful
atilde HTML tags as constituent boundaries
atilde Tables as Semantic Relations
atilde File Names (content type and language)
atilde Translations mdash Bracketed Glosses Cross-lingual Disambiguation
atilde Query Data
atilde Wikipedia Redirections and Cross-wiki Links
Review and Conclusions 51
Language Identification andNormalization
Review and Conclusions 52
Language Identification
atilde Need to identify the encoding and language of a page in order to process it (meta-data may be unreliable)
acirc Linguistically-grounded methodslowast Diacriticslowast Character n-gramslowast Stop words
acirc Similarity-based categorisation and classificationlowast Character n-gram rankings
acirc Machine Learning based methodsacirc Context (under jp or ko)
atilde Hard to do for short test snippets similar languages mixed text
Review and Conclusions 53
Normalization
atilde Extracting text from various documents
atilde Segmenting continuous text
atilde Number Normalization $700K $700000 07 million dollars hellip
atilde Date Normalization 2000AD 1421AH Heisei 12 hellip
atilde Stripping stop words (the a of hellip)
atilde Lemmatization produces rarr produce
atilde Stemming producer rarr produc produces rarr produc
atilde Decompounding zonnecel rarr zon cel
Review and Conclusions 54
The Semantic Web
Review and Conclusions 55
Goals of the Semantic Web
atilde Web of data
acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions
atilde Increase the utility of information by connecting it to definitions and context
atilde More efficient information access and analysis
EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt
Review and Conclusions 56
Semantic Web Architecture
Review and Conclusions 57
XML eXtensible Markup Language
atilde XML is a set of rules for encoding documents electronically
acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation
atilde Can be validated for syntactic well-formedness
Review and Conclusions 58
Semantic Web Architecture (details)
atilde Identify things with Uniform Resource Identifiers
acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond
atilde Identify relations with Resource Description Framework
acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything
atilde You can build relations in ontologies (OWL)
acirc Then reason over them search them hellip
Review and Conclusions 59
Criticism of the Semantic Web
Doctorowrsquos seven insurmountable obstacles to reliable metadata are
1 People lie
2 People are lazy
3 People are stupid
4 Mission Impossible know thyself
5 Schemas arenrsquot neutral
6 Metrics influence results
7 Therersquos more than one way to describe something
httpwwwwellcom~doctorowmetacraphtm 60
Semantic Web and NLP
atilde The Semantic Web is about structuring data
atilde Text Mining is about unstructured data
atilde There is much more unstructured than structured data
acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP
61
Citation Reputation andPageRank
Review and Conclusions 62
Citation Networks
atilde How can we tell what is a good scientific paper
acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked
acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite
Review and Conclusions 63
Reputation and Citation Analysis
atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group
atilde Some scores are
acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)
atilde Problems
acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones
atilde Weight Citations by Quality of the paper
Review and Conclusions 64
Gaming Citations
atilde LeastMinimum Publishable Unit
acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information
atilde Self citation in-group citation
atilde Write only proceedings (some journals are not often read)
atilde Submitting only to High Impact factor journals
You improve what gets measurednot necessarily what you want to improve
Review and Conclusions 65
Characteristics of the Web Bow Tie
SCC Strongly Connected Core mdash can travel from any page to any page
Review and Conclusions 66
Anchor Text
atilde Recall how hyperlinks are written
lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt
For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt
atilde Link analysis builds on two intuitions
1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A
2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice
Review and Conclusions 67
PageRank as Citation analysis
atilde Citation frequency can be used to measure the impact of an article
acirc Simplest measure Each article gets one vote ndash not very accurate
atilde On the web citation frequency = inlink count
acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam
atilde Better measure weighted citation frequency or citation rank
acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated
Review and Conclusions 68
PageRank as Random walk
atilde Imagine a web surfer doing a random walk on the web
acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page
equiprobably
atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there
atilde This long-term visit rate is the pagersquos PageRank
atilde PageRank = long-term visit rate = steady state robability
Review and Conclusions 69
Teleporting ndash to get us out of dead ends
atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)
atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)
atilde With remaining probability (90) go out on a random hyperlink
acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225
atilde 10 is a parameter the teleportation rate
atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate
Review and Conclusions 70
Example Graph
Each inbound link is a positive vote
httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71
Example Graph Weighted
Pages with higher PageRanks are lighter
Review and Conclusions 72
Gaming PageRank
atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs
atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies
atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission
73
atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks
The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index
acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip
74
Current Status
atilde There is a continuous battle between
acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read
atilde All metrics get gamed
Review and Conclusions 75
Digital object identifier
atilde DOI a string used to uniquely identify an electronic document or object
acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)
atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation
atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations
acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u
Review and Conclusions 76
Conclusions
Review and Conclusions 77
Revolutions in Language Technology
atilde Speech (Language itself)
atilde Writing (invented 3-5 times)
atilde Printing (made writing common)
atilde Digital Text (made writing transferable)
atilde Hyperlinking (taking writing beyond language)
Review and Conclusions 78
The Internet
atilde The internet is useful as a tool
acirc Passively for information accessacirc Actively for collaboration
atilde The internet is interesting in and of itself
acirc As a source of data about existing languageacirc As a source of innovation in language
Review and Conclusions 79
Recommended Texts
atilde Wikipedia
atilde Crystal D (2001) Language and the Internet Cambridge University Press
atilde Sproat R (2010) Language Technology and Society Oxford University Press
atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition
atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press
atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press
Review and Conclusions 80
Complementary Courses
atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics
atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics
Review and Conclusions 81
Reflection
atilde What was the most surprising thing in this class
atilde What do you think is most likely wrong
atilde What do you think is the coolest result
atilde What do you think yoursquore most likely to remember
atilde How do you think this course will influence you as a linguist
atilde What (if anything) did you hope to learn that you didnrsquot
Review and Conclusions 3
Goals
atilde Gain an understanding of how technology affects language use
atilde Develop familiarity with markup and meta information in texts
atilde Get a feel for what research is all about especially relating to web mining and onlinefrequency counting
Upon successful completion students will
atilde have an understanding of how technology shapes language use
atilde be able to test linguistic hypotheses against web data
atilde know how to edit Wikipedia
Review and Conclusions 4
Themes
atilde Language and Technology
acirc Writing and Speech Technology
atilde Language and the Internet
acirc Email Chat Virtual Worlds WWW IM Blogs Facebook Wikis Twitter
atilde The Web as Corpus
atilde The Web beyond Language
acirc Semantic Web and Networks
Review and Conclusions 5
Revision
Review and Conclusions 6
Language and Technology
atilde The two things that separate humans from animals (Sproat 2010 sect11)
acirc Languagelowast large vocabulary (10000+)lowast complicated syntax (no upper length recursion embedding)
acirc Technologylowast Widespread tool uselowast Widespread tool manufacture
atilde Speech mdash the start of language
atilde Writing mdash the first great intersection
Review and Conclusions 7
Language and the Internet
atilde New forms of communication
acirc Neither speech nor textacirc Massively interactive
atilde Extremely rapid change
atilde A first hand narrative (I was online before the internet -)
acirc but I am probably behind you all now
Review and Conclusions 8
New forms
atilde Email (from PC phone other)
atilde Chat Usenet
atilde Virtual Worlds
atilde WWW
atilde Blogs (overlap)
atilde Facebook LinkedIn
atilde Wikis
atilde Twitter
Review and Conclusions 9
Representing Languageatilde Writing Systems
atilde Encodings
atilde Speech
atilde Bandwidth
Review and Conclusions 10
What is represented
atilde Phonemes ma d laks aeligvkadoz (45)
acirc Not a simple correspondence between a writing system and the soundsacirc Some logographs (3 $)acirc Even when a new alphabet is designed pronunciation changes
atilde Syllables (ma) (d) (laks) (aelig)(v)(ka)(doz) (10000+)
atilde Morphemes mame + rsquos d lak + s aeligvkado + z (100000+)
atilde Words my dog likes avocados (200000+)
atilde Concepts speaker poss canine fond avacado+PL (400000+)
Review and Conclusions 11
Three Major Writing Systems
atilde Alphabetic (Latin)
acirc one symbol for consonant or vowelacirc Typically 20-30 base symbols (1 byte)
atilde Syllabic (Hiragana)
acirc one symbol for each syllable (consonant+vowel)acirc Typically 50-100 base symbols (1-2 bytes)
atilde Logographic (Hanzi)
acirc pictographs ideographs sounds-meaning combinationsacirc Typically 100000+ symbols (2-3 bytes)
Review and Conclusions 12
Computational Encoding
atilde Need to map characters to bits
atilde More characters require more space
acirc English ASCII (7 bits) Western Europe ISO-8859-1 (8 bits)Japanese EUC (16 bits) Everything UTF-8 (8-32 bits)
atilde Moving towards unicode for everything
atilde If you get the encoding wrong it is gibberish
acirc Web pages should state their encodingacirc Sometimes they are wrongacirc Encoding Detection usually involves statistical analysis of byte patterns
lowast But an encoding can be shared by many languages
Review and Conclusions 13
Speech and Language Technologyatilde The need for speech representation
atilde Storing sound
atilde Transforming Speech
acirc Automatic Speech Recognition (ASR) sounds to textacirc Text-to-Speech Synthesis (TTS) texts to sounds
atilde Speech technology mdash the Telephone
Review and Conclusions 14
How good are the systems
Task Vocab WER () WER () adaptedDigits 11 04 02Dialogue (travel) 21000 109 mdashDictation (WSJ) 5000 39 30Dictation (WSJ) 20000 100 86Dialogue (noisy army) 3000 422 310Phone Conversations 4000 419 310
Results of various DARPA competitions (from Richard Sproatrsquos slides)
Review and Conclusions 15
Why is it so difficult
atilde Pronunciation depends on context
acirc The same word will be pronounced differently in different sentences
atilde Speaker variability
acirc Genderacirc DialectForeign Accentacirc Individual Differences Physical differences Language differences (idiolect)
atilde Many many rare events
acirc 300 out of 2000 diphones in the core set for the ATampT NextGen system occuronly once in a 2-hour speech database
Review and Conclusions 16
Two steps in a TTS system
1 Linguistic Analysis
atilde Sentence Segmentationatilde Abbreviations Dr Smith lives on Nanyang Dr He is hellipatilde Word Segmentation
acirc 森前銀総裁 Moriyama zen Nichigin Sousaiotimes 森前銀総裁emspMoriyama zennichi gin Sousai
2 Speech Synthesis
atilde Find the pronunciationatilde Generate soundsatilde Add intonation
Review and Conclusions 17
Speech Synthesis
atilde Articulatory Synthesis Attempt to model human articulation
atilde Formant Synthesis Bypass modeling of human articulation and model acousticsdirectly
atilde Concatenative Synthesis Synthesize from stored units of actual speech
Review and Conclusions 18
Prosody of Emotion
atilde Excitement Fast very high pitch loud
atilde Hot anger Fast high pitch strong falling accent loud
atilde Fear Jitter
atilde Sarcasm Prolonged accent late peak
atilde Sad Slow low pitch
The main determinant of ldquonaturalnessrdquo in speech synthesis is not ldquovoice qualityrdquobut natural-sounding prosody (intonation and duration)
Richard Sproat
Review and Conclusions 19
Speed is different for different modalities
Speed in words per minute (one word is 6 characters)(English computer science students various studies)
Reading 300 200 (proof reading)Writing 31 21 (composing)Speaking 150Hearing 150 210 (speeded up)Typing 33 19 (composing)
atilde Reading gtgt SpeakingHearing gtgt Typing
rArr Speech for inputrArr Text for output
Review and Conclusions 20
The Telephone
Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 21
New Modalities
Review and Conclusions 22
Email Usenet Chat and Blogs
atilde All share some characteristics of speech and text
atilde Usage norms not fixed
atilde Communication methods may disappear before the norms are fixedUsenet Bulletin Boards Gopher Archie hellip
atilde Large scale discourse studies still to be done
atilde Some genuinely new things
acirc time-lagged multi-person conversationacirc raw un-edited textacirc extreme multi-authorship
Review and Conclusions 23
Speech like Text liketime bound space bound (deletable)spontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 24
Usenet (asynchronous)
Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 25
Chat (synchronous)
Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 26
Blogs
Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactivecomments factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 27
Netspeak
atilde Inspired by shared background
acirc ltrantgtI canrsquot stand thisltrantgtacirc I hate^H^H^H^Hlove that idea (^H is backspace on a vt100)acirc lusers (users as seen by Systems Administrators)acirc suits
atilde Need to be in-group
cow orker Coworkerclue ldquoYou couldnrsquot get a clue during the clue mating season in a field full of horny
clues if you smeared your body with clue musk and did the clue mating dancerdquoEdward Flaherty (talkbizaare)
Review and Conclusions 28
atilde Gricean Principals
acirc Posting (topmiddlebottom quoting and trimming)
atilde Inspired by medium limitations
acirc GREATgreatgreatacirc -) (^_^) (o) acirc brb RTFM IMHOacirc lower caseacirc lack of punctuation (hard on phone keyboards)
Review and Conclusions 29
Collaboration and Wikisatilde Version Control Systems
atilde Wikipedia
atilde Licensing and Ownership
30
Version Control Systems
atilde Versioning file systems
acirc every time a file is opened a new copy is stored
atilde CVS Subversion Git
acirc changes to a collection of files are trackedacirc simultaneous changes are merged
atilde Revision Tracking
acirc Revisions are stored within a file
atilde Authorship in shared writing Explicit responsibility for changes
Review and Conclusions 31
Wikipedia
atilde The core aim of the Wikimedia Foundation is to get a free encyclopedia to everysingle person on the planet (Jimmy Wales)
atilde Wikipedia makes it easy to share your knowledgepeople like to do this
atilde Most edits are done by insiders
atilde Most content is added by outsiders
atilde Content comparable to Britannica
Review and Conclusions 32
The five pillars of Wikipedia
1 Wikipedia is an online encyclopedia
2 Wikipedia has a neutral point of view
atilde Content policies NPOV Verifiability and No original research
3 Wikipedia is free content
4 Wikipedians should interact in a respectful and civil manner
5 Wikipedia does not have firm rules
WikipediaFivepillars 33
Licenses and Ownership
atilde Copyright
atilde Copyleft
atilde Creative Commons
Review and Conclusions 34
What is a good article
1 Well-written
2 Factually accurate and verifiable
3 Broad in its coverage
4 Neutral
5 Stable
6 Illustrated if possible by images
WikipediaGood_article_criteria 35
The World Wide Web and HTML
Review and Conclusions 36
The InterWeb
atilde The Internetmore than just the web (email VoIP FTP Streaming Messaging
atilde The structure of Markup Visual vs LogicalWSISWYG WYSIAYG WYSIWIM
atilde The structure of the Web mdash hypertextpages linked to pages
atilde The future of the Web
atilde Linguistic features of the webun-edited large volume editable multi-media
Review and Conclusions 37
Map of online communities
httpxkcdcom802 38
The Deep Web (the Invisible Web)
Dynamic content dynamic pages which are returned in response to a submitted queryor accessed only through a form
Unlinked content pages which are not linked to by other pages (but clicking linksthem)
Private Web sites that require registration and login (Edventure)
Contextual Web pages with content varying for different access contexts (egranges of client IP addresses or previous navigation sequence)
Limited access content sites that limit access to their pages in a technical way (egusing the Robots Exclusion Standard)
Scripted content pages that are only accessible through links produced by JavaScriptas well as content dynamically downloaded from Web servers via Flash or Ajaxsolutions
Review and Conclusions 39
Non-HTMLtext content textual content encoded in multimedia (image or video)files or specific file formats not handled by search engines
These pages all include data that search engines cannot find
Review and Conclusions 40
Visual Markup vs Logical Markup
atilde Visual Markup (Presentational)
acirc What you see is what you get (WYSIWYG)acirc Equivalent of printersrsquo markupacirc Shows what things look like
atilde Logical Markup (Structural)
acirc Show the structure and meaningacirc Can be mapped to visual markupacirc Less flexible than visual markupacirc More adaptable (and reusable)
Review and Conclusions 41
The Web as Corpus
Review and Conclusions 42
Two Approaches to using the Web as a Corpus
atilde Direct Query Search Engine as Query tool and WWW as corpus(Objection Results are not reliable)
acirc Population and exact hit counts are unknown rarr no statistics possibleacirc Indexing does not allow to draw conclusions on the dataotimes Google is missing functionalities that linguists lexicographers would like to have
atilde Web Sample Use search engine to download data from the net and build a corpusfrom it
acirc known size and exact hit counts rarr statistics possibleacirc people can draw conclusions over the included text typesacirc (limited) control over the contentotimes sparser data
Review and Conclusions 43
Direct Query
atilde Accessible through search engines (Google API Yahoo API Scripts)
atilde Document counts are shown to correlate directly with ldquorealrdquo frequencies (Keller2003) so search engines can help - but
acirc lots of repetitions of the same text (not representative)acirc very limited query precision (no upperlower case no punctuation)acirc only estimated counts often hard to reproduce exactlyacirc different queries give wildly different numbers
Review and Conclusions 44
Web Count Units
atilde ghit or ldquogoogle hitrdquo is the most common unit used to count web snippets (in theearly 2000s)
acirc it is document frequency not term frequency
atilde whit or ldquoweb hitrdquo is the more general term
atilde Normally you compare two phenomena to get a unitless ratio(eg different from vs different than)251000000 ghits vs 71500000 or 351 (accessed 2012-04-04)
atilde GPB for ldquoGhits per billion documentsrdquo is good if want a more stable number (sug-gested by Mark Liberman)but Google no longer releases the index size
atilde So always say when you counted and try to use ratios
httpitrecisupennedu myllanguagelogarchives000953html 45
Web Sample
atilde Extracting and filtering web documents to create linguistically annotated corpora(Kilgarriff 2006)
acirc gather documents for different topics (balance)acirc exclude documents which cannot be preprocessed with available tools (here tag-
gers and lemmatizers)acirc exclude documents which seem irrelevant for a corpus (too short or too long word
lists)acirc do this for several languages and make the corpora available
46
Building Internet Corpora Outline
1 Select Seed Words (500)
2 Combine to form multiple queries (6000)
3 Query a search engine and retrieve the URLs (50000)
4 Download the files from the URLS (100000000 words)
5 Postprocess the data (encoding cleanup tagging and parsing)
Sharoff S (2006) Creating general-purpose corpora using automated search enginequeries In M Baroni S Bernardini (eds) WaCky Working papers on the Web asCorpus Bologna 2006
httpcorpusleedsacukinternethtml 47
Internet Corpora Summary
atilde The web can be used as a corpus
acirc Direct accesslowast Fast and convenientlowast Huge amounts of dataotimes unreliable counts
acirc Web samplelowast Control over the samplelowast Some setup costs (semi-automated)otimes Less data
atilde Richer data than a compiled corpus
otimes Less balanced less markup
48
Text and Meta-textatilde Explicit Meta-data
acirc Keywords and Categoriesacirc Rankingsacirc Structural Markup
atilde Implicit Meta-data
acirc Links and Citationsacirc Tagsacirc Tablesacirc File Namesacirc Translations
and Zombies 49
Explicit Metadata
atilde You can get information from metadata within documents
acirc When they are accurate they are very goodacirc They are often deceitful
atilde HTML PDF Word hellipMetadata
atilde Keywords and Tags
atilde Rankings
atilde Links and Citations
atilde Structural Markup
50
Implicit Metadata
atilde You can get clues from metadata within documents
acirc as they are non-intended they tend to be noisyacirc but they are rarely deceitful
atilde HTML tags as constituent boundaries
atilde Tables as Semantic Relations
atilde File Names (content type and language)
atilde Translations mdash Bracketed Glosses Cross-lingual Disambiguation
atilde Query Data
atilde Wikipedia Redirections and Cross-wiki Links
Review and Conclusions 51
Language Identification andNormalization
Review and Conclusions 52
Language Identification
atilde Need to identify the encoding and language of a page in order to process it (meta-data may be unreliable)
acirc Linguistically-grounded methodslowast Diacriticslowast Character n-gramslowast Stop words
acirc Similarity-based categorisation and classificationlowast Character n-gram rankings
acirc Machine Learning based methodsacirc Context (under jp or ko)
atilde Hard to do for short test snippets similar languages mixed text
Review and Conclusions 53
Normalization
atilde Extracting text from various documents
atilde Segmenting continuous text
atilde Number Normalization $700K $700000 07 million dollars hellip
atilde Date Normalization 2000AD 1421AH Heisei 12 hellip
atilde Stripping stop words (the a of hellip)
atilde Lemmatization produces rarr produce
atilde Stemming producer rarr produc produces rarr produc
atilde Decompounding zonnecel rarr zon cel
Review and Conclusions 54
The Semantic Web
Review and Conclusions 55
Goals of the Semantic Web
atilde Web of data
acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions
atilde Increase the utility of information by connecting it to definitions and context
atilde More efficient information access and analysis
EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt
Review and Conclusions 56
Semantic Web Architecture
Review and Conclusions 57
XML eXtensible Markup Language
atilde XML is a set of rules for encoding documents electronically
acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation
atilde Can be validated for syntactic well-formedness
Review and Conclusions 58
Semantic Web Architecture (details)
atilde Identify things with Uniform Resource Identifiers
acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond
atilde Identify relations with Resource Description Framework
acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything
atilde You can build relations in ontologies (OWL)
acirc Then reason over them search them hellip
Review and Conclusions 59
Criticism of the Semantic Web
Doctorowrsquos seven insurmountable obstacles to reliable metadata are
1 People lie
2 People are lazy
3 People are stupid
4 Mission Impossible know thyself
5 Schemas arenrsquot neutral
6 Metrics influence results
7 Therersquos more than one way to describe something
httpwwwwellcom~doctorowmetacraphtm 60
Semantic Web and NLP
atilde The Semantic Web is about structuring data
atilde Text Mining is about unstructured data
atilde There is much more unstructured than structured data
acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP
61
Citation Reputation andPageRank
Review and Conclusions 62
Citation Networks
atilde How can we tell what is a good scientific paper
acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked
acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite
Review and Conclusions 63
Reputation and Citation Analysis
atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group
atilde Some scores are
acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)
atilde Problems
acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones
atilde Weight Citations by Quality of the paper
Review and Conclusions 64
Gaming Citations
atilde LeastMinimum Publishable Unit
acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information
atilde Self citation in-group citation
atilde Write only proceedings (some journals are not often read)
atilde Submitting only to High Impact factor journals
You improve what gets measurednot necessarily what you want to improve
Review and Conclusions 65
Characteristics of the Web Bow Tie
SCC Strongly Connected Core mdash can travel from any page to any page
Review and Conclusions 66
Anchor Text
atilde Recall how hyperlinks are written
lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt
For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt
atilde Link analysis builds on two intuitions
1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A
2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice
Review and Conclusions 67
PageRank as Citation analysis
atilde Citation frequency can be used to measure the impact of an article
acirc Simplest measure Each article gets one vote ndash not very accurate
atilde On the web citation frequency = inlink count
acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam
atilde Better measure weighted citation frequency or citation rank
acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated
Review and Conclusions 68
PageRank as Random walk
atilde Imagine a web surfer doing a random walk on the web
acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page
equiprobably
atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there
atilde This long-term visit rate is the pagersquos PageRank
atilde PageRank = long-term visit rate = steady state robability
Review and Conclusions 69
Teleporting ndash to get us out of dead ends
atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)
atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)
atilde With remaining probability (90) go out on a random hyperlink
acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225
atilde 10 is a parameter the teleportation rate
atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate
Review and Conclusions 70
Example Graph
Each inbound link is a positive vote
httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71
Example Graph Weighted
Pages with higher PageRanks are lighter
Review and Conclusions 72
Gaming PageRank
atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs
atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies
atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission
73
atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks
The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index
acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip
74
Current Status
atilde There is a continuous battle between
acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read
atilde All metrics get gamed
Review and Conclusions 75
Digital object identifier
atilde DOI a string used to uniquely identify an electronic document or object
acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)
atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation
atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations
acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u
Review and Conclusions 76
Conclusions
Review and Conclusions 77
Revolutions in Language Technology
atilde Speech (Language itself)
atilde Writing (invented 3-5 times)
atilde Printing (made writing common)
atilde Digital Text (made writing transferable)
atilde Hyperlinking (taking writing beyond language)
Review and Conclusions 78
The Internet
atilde The internet is useful as a tool
acirc Passively for information accessacirc Actively for collaboration
atilde The internet is interesting in and of itself
acirc As a source of data about existing languageacirc As a source of innovation in language
Review and Conclusions 79
Recommended Texts
atilde Wikipedia
atilde Crystal D (2001) Language and the Internet Cambridge University Press
atilde Sproat R (2010) Language Technology and Society Oxford University Press
atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition
atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press
atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press
Review and Conclusions 80
Complementary Courses
atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics
atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics
Review and Conclusions 81
Goals
atilde Gain an understanding of how technology affects language use
atilde Develop familiarity with markup and meta information in texts
atilde Get a feel for what research is all about especially relating to web mining and onlinefrequency counting
Upon successful completion students will
atilde have an understanding of how technology shapes language use
atilde be able to test linguistic hypotheses against web data
atilde know how to edit Wikipedia
Review and Conclusions 4
Themes
atilde Language and Technology
acirc Writing and Speech Technology
atilde Language and the Internet
acirc Email Chat Virtual Worlds WWW IM Blogs Facebook Wikis Twitter
atilde The Web as Corpus
atilde The Web beyond Language
acirc Semantic Web and Networks
Review and Conclusions 5
Revision
Review and Conclusions 6
Language and Technology
atilde The two things that separate humans from animals (Sproat 2010 sect11)
acirc Languagelowast large vocabulary (10000+)lowast complicated syntax (no upper length recursion embedding)
acirc Technologylowast Widespread tool uselowast Widespread tool manufacture
atilde Speech mdash the start of language
atilde Writing mdash the first great intersection
Review and Conclusions 7
Language and the Internet
atilde New forms of communication
acirc Neither speech nor textacirc Massively interactive
atilde Extremely rapid change
atilde A first hand narrative (I was online before the internet -)
acirc but I am probably behind you all now
Review and Conclusions 8
New forms
atilde Email (from PC phone other)
atilde Chat Usenet
atilde Virtual Worlds
atilde WWW
atilde Blogs (overlap)
atilde Facebook LinkedIn
atilde Wikis
atilde Twitter
Review and Conclusions 9
Representing Languageatilde Writing Systems
atilde Encodings
atilde Speech
atilde Bandwidth
Review and Conclusions 10
What is represented
atilde Phonemes ma d laks aeligvkadoz (45)
acirc Not a simple correspondence between a writing system and the soundsacirc Some logographs (3 $)acirc Even when a new alphabet is designed pronunciation changes
atilde Syllables (ma) (d) (laks) (aelig)(v)(ka)(doz) (10000+)
atilde Morphemes mame + rsquos d lak + s aeligvkado + z (100000+)
atilde Words my dog likes avocados (200000+)
atilde Concepts speaker poss canine fond avacado+PL (400000+)
Review and Conclusions 11
Three Major Writing Systems
atilde Alphabetic (Latin)
acirc one symbol for consonant or vowelacirc Typically 20-30 base symbols (1 byte)
atilde Syllabic (Hiragana)
acirc one symbol for each syllable (consonant+vowel)acirc Typically 50-100 base symbols (1-2 bytes)
atilde Logographic (Hanzi)
acirc pictographs ideographs sounds-meaning combinationsacirc Typically 100000+ symbols (2-3 bytes)
Review and Conclusions 12
Computational Encoding
atilde Need to map characters to bits
atilde More characters require more space
acirc English ASCII (7 bits) Western Europe ISO-8859-1 (8 bits)Japanese EUC (16 bits) Everything UTF-8 (8-32 bits)
atilde Moving towards unicode for everything
atilde If you get the encoding wrong it is gibberish
acirc Web pages should state their encodingacirc Sometimes they are wrongacirc Encoding Detection usually involves statistical analysis of byte patterns
lowast But an encoding can be shared by many languages
Review and Conclusions 13
Speech and Language Technologyatilde The need for speech representation
atilde Storing sound
atilde Transforming Speech
acirc Automatic Speech Recognition (ASR) sounds to textacirc Text-to-Speech Synthesis (TTS) texts to sounds
atilde Speech technology mdash the Telephone
Review and Conclusions 14
How good are the systems
Task Vocab WER () WER () adaptedDigits 11 04 02Dialogue (travel) 21000 109 mdashDictation (WSJ) 5000 39 30Dictation (WSJ) 20000 100 86Dialogue (noisy army) 3000 422 310Phone Conversations 4000 419 310
Results of various DARPA competitions (from Richard Sproatrsquos slides)
Review and Conclusions 15
Why is it so difficult
atilde Pronunciation depends on context
acirc The same word will be pronounced differently in different sentences
atilde Speaker variability
acirc Genderacirc DialectForeign Accentacirc Individual Differences Physical differences Language differences (idiolect)
atilde Many many rare events
acirc 300 out of 2000 diphones in the core set for the ATampT NextGen system occuronly once in a 2-hour speech database
Review and Conclusions 16
Two steps in a TTS system
1 Linguistic Analysis
atilde Sentence Segmentationatilde Abbreviations Dr Smith lives on Nanyang Dr He is hellipatilde Word Segmentation
acirc 森前銀総裁 Moriyama zen Nichigin Sousaiotimes 森前銀総裁emspMoriyama zennichi gin Sousai
2 Speech Synthesis
atilde Find the pronunciationatilde Generate soundsatilde Add intonation
Review and Conclusions 17
Speech Synthesis
atilde Articulatory Synthesis Attempt to model human articulation
atilde Formant Synthesis Bypass modeling of human articulation and model acousticsdirectly
atilde Concatenative Synthesis Synthesize from stored units of actual speech
Review and Conclusions 18
Prosody of Emotion
atilde Excitement Fast very high pitch loud
atilde Hot anger Fast high pitch strong falling accent loud
atilde Fear Jitter
atilde Sarcasm Prolonged accent late peak
atilde Sad Slow low pitch
The main determinant of ldquonaturalnessrdquo in speech synthesis is not ldquovoice qualityrdquobut natural-sounding prosody (intonation and duration)
Richard Sproat
Review and Conclusions 19
Speed is different for different modalities
Speed in words per minute (one word is 6 characters)(English computer science students various studies)
Reading 300 200 (proof reading)Writing 31 21 (composing)Speaking 150Hearing 150 210 (speeded up)Typing 33 19 (composing)
atilde Reading gtgt SpeakingHearing gtgt Typing
rArr Speech for inputrArr Text for output
Review and Conclusions 20
The Telephone
Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 21
New Modalities
Review and Conclusions 22
Email Usenet Chat and Blogs
atilde All share some characteristics of speech and text
atilde Usage norms not fixed
atilde Communication methods may disappear before the norms are fixedUsenet Bulletin Boards Gopher Archie hellip
atilde Large scale discourse studies still to be done
atilde Some genuinely new things
acirc time-lagged multi-person conversationacirc raw un-edited textacirc extreme multi-authorship
Review and Conclusions 23
Speech like Text liketime bound space bound (deletable)spontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 24
Usenet (asynchronous)
Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 25
Chat (synchronous)
Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 26
Blogs
Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactivecomments factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 27
Netspeak
atilde Inspired by shared background
acirc ltrantgtI canrsquot stand thisltrantgtacirc I hate^H^H^H^Hlove that idea (^H is backspace on a vt100)acirc lusers (users as seen by Systems Administrators)acirc suits
atilde Need to be in-group
cow orker Coworkerclue ldquoYou couldnrsquot get a clue during the clue mating season in a field full of horny
clues if you smeared your body with clue musk and did the clue mating dancerdquoEdward Flaherty (talkbizaare)
Review and Conclusions 28
atilde Gricean Principals
acirc Posting (topmiddlebottom quoting and trimming)
atilde Inspired by medium limitations
acirc GREATgreatgreatacirc -) (^_^) (o) acirc brb RTFM IMHOacirc lower caseacirc lack of punctuation (hard on phone keyboards)
Review and Conclusions 29
Collaboration and Wikisatilde Version Control Systems
atilde Wikipedia
atilde Licensing and Ownership
30
Version Control Systems
atilde Versioning file systems
acirc every time a file is opened a new copy is stored
atilde CVS Subversion Git
acirc changes to a collection of files are trackedacirc simultaneous changes are merged
atilde Revision Tracking
acirc Revisions are stored within a file
atilde Authorship in shared writing Explicit responsibility for changes
Review and Conclusions 31
Wikipedia
atilde The core aim of the Wikimedia Foundation is to get a free encyclopedia to everysingle person on the planet (Jimmy Wales)
atilde Wikipedia makes it easy to share your knowledgepeople like to do this
atilde Most edits are done by insiders
atilde Most content is added by outsiders
atilde Content comparable to Britannica
Review and Conclusions 32
The five pillars of Wikipedia
1 Wikipedia is an online encyclopedia
2 Wikipedia has a neutral point of view
atilde Content policies NPOV Verifiability and No original research
3 Wikipedia is free content
4 Wikipedians should interact in a respectful and civil manner
5 Wikipedia does not have firm rules
WikipediaFivepillars 33
Licenses and Ownership
atilde Copyright
atilde Copyleft
atilde Creative Commons
Review and Conclusions 34
What is a good article
1 Well-written
2 Factually accurate and verifiable
3 Broad in its coverage
4 Neutral
5 Stable
6 Illustrated if possible by images
WikipediaGood_article_criteria 35
The World Wide Web and HTML
Review and Conclusions 36
The InterWeb
atilde The Internetmore than just the web (email VoIP FTP Streaming Messaging
atilde The structure of Markup Visual vs LogicalWSISWYG WYSIAYG WYSIWIM
atilde The structure of the Web mdash hypertextpages linked to pages
atilde The future of the Web
atilde Linguistic features of the webun-edited large volume editable multi-media
Review and Conclusions 37
Map of online communities
httpxkcdcom802 38
The Deep Web (the Invisible Web)
Dynamic content dynamic pages which are returned in response to a submitted queryor accessed only through a form
Unlinked content pages which are not linked to by other pages (but clicking linksthem)
Private Web sites that require registration and login (Edventure)
Contextual Web pages with content varying for different access contexts (egranges of client IP addresses or previous navigation sequence)
Limited access content sites that limit access to their pages in a technical way (egusing the Robots Exclusion Standard)
Scripted content pages that are only accessible through links produced by JavaScriptas well as content dynamically downloaded from Web servers via Flash or Ajaxsolutions
Review and Conclusions 39
Non-HTMLtext content textual content encoded in multimedia (image or video)files or specific file formats not handled by search engines
These pages all include data that search engines cannot find
Review and Conclusions 40
Visual Markup vs Logical Markup
atilde Visual Markup (Presentational)
acirc What you see is what you get (WYSIWYG)acirc Equivalent of printersrsquo markupacirc Shows what things look like
atilde Logical Markup (Structural)
acirc Show the structure and meaningacirc Can be mapped to visual markupacirc Less flexible than visual markupacirc More adaptable (and reusable)
Review and Conclusions 41
The Web as Corpus
Review and Conclusions 42
Two Approaches to using the Web as a Corpus
atilde Direct Query Search Engine as Query tool and WWW as corpus(Objection Results are not reliable)
acirc Population and exact hit counts are unknown rarr no statistics possibleacirc Indexing does not allow to draw conclusions on the dataotimes Google is missing functionalities that linguists lexicographers would like to have
atilde Web Sample Use search engine to download data from the net and build a corpusfrom it
acirc known size and exact hit counts rarr statistics possibleacirc people can draw conclusions over the included text typesacirc (limited) control over the contentotimes sparser data
Review and Conclusions 43
Direct Query
atilde Accessible through search engines (Google API Yahoo API Scripts)
atilde Document counts are shown to correlate directly with ldquorealrdquo frequencies (Keller2003) so search engines can help - but
acirc lots of repetitions of the same text (not representative)acirc very limited query precision (no upperlower case no punctuation)acirc only estimated counts often hard to reproduce exactlyacirc different queries give wildly different numbers
Review and Conclusions 44
Web Count Units
atilde ghit or ldquogoogle hitrdquo is the most common unit used to count web snippets (in theearly 2000s)
acirc it is document frequency not term frequency
atilde whit or ldquoweb hitrdquo is the more general term
atilde Normally you compare two phenomena to get a unitless ratio(eg different from vs different than)251000000 ghits vs 71500000 or 351 (accessed 2012-04-04)
atilde GPB for ldquoGhits per billion documentsrdquo is good if want a more stable number (sug-gested by Mark Liberman)but Google no longer releases the index size
atilde So always say when you counted and try to use ratios
httpitrecisupennedu myllanguagelogarchives000953html 45
Web Sample
atilde Extracting and filtering web documents to create linguistically annotated corpora(Kilgarriff 2006)
acirc gather documents for different topics (balance)acirc exclude documents which cannot be preprocessed with available tools (here tag-
gers and lemmatizers)acirc exclude documents which seem irrelevant for a corpus (too short or too long word
lists)acirc do this for several languages and make the corpora available
46
Building Internet Corpora Outline
1 Select Seed Words (500)
2 Combine to form multiple queries (6000)
3 Query a search engine and retrieve the URLs (50000)
4 Download the files from the URLS (100000000 words)
5 Postprocess the data (encoding cleanup tagging and parsing)
Sharoff S (2006) Creating general-purpose corpora using automated search enginequeries In M Baroni S Bernardini (eds) WaCky Working papers on the Web asCorpus Bologna 2006
httpcorpusleedsacukinternethtml 47
Internet Corpora Summary
atilde The web can be used as a corpus
acirc Direct accesslowast Fast and convenientlowast Huge amounts of dataotimes unreliable counts
acirc Web samplelowast Control over the samplelowast Some setup costs (semi-automated)otimes Less data
atilde Richer data than a compiled corpus
otimes Less balanced less markup
48
Text and Meta-textatilde Explicit Meta-data
acirc Keywords and Categoriesacirc Rankingsacirc Structural Markup
atilde Implicit Meta-data
acirc Links and Citationsacirc Tagsacirc Tablesacirc File Namesacirc Translations
and Zombies 49
Explicit Metadata
atilde You can get information from metadata within documents
acirc When they are accurate they are very goodacirc They are often deceitful
atilde HTML PDF Word hellipMetadata
atilde Keywords and Tags
atilde Rankings
atilde Links and Citations
atilde Structural Markup
50
Implicit Metadata
atilde You can get clues from metadata within documents
acirc as they are non-intended they tend to be noisyacirc but they are rarely deceitful
atilde HTML tags as constituent boundaries
atilde Tables as Semantic Relations
atilde File Names (content type and language)
atilde Translations mdash Bracketed Glosses Cross-lingual Disambiguation
atilde Query Data
atilde Wikipedia Redirections and Cross-wiki Links
Review and Conclusions 51
Language Identification andNormalization
Review and Conclusions 52
Language Identification
atilde Need to identify the encoding and language of a page in order to process it (meta-data may be unreliable)
acirc Linguistically-grounded methodslowast Diacriticslowast Character n-gramslowast Stop words
acirc Similarity-based categorisation and classificationlowast Character n-gram rankings
acirc Machine Learning based methodsacirc Context (under jp or ko)
atilde Hard to do for short test snippets similar languages mixed text
Review and Conclusions 53
Normalization
atilde Extracting text from various documents
atilde Segmenting continuous text
atilde Number Normalization $700K $700000 07 million dollars hellip
atilde Date Normalization 2000AD 1421AH Heisei 12 hellip
atilde Stripping stop words (the a of hellip)
atilde Lemmatization produces rarr produce
atilde Stemming producer rarr produc produces rarr produc
atilde Decompounding zonnecel rarr zon cel
Review and Conclusions 54
The Semantic Web
Review and Conclusions 55
Goals of the Semantic Web
atilde Web of data
acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions
atilde Increase the utility of information by connecting it to definitions and context
atilde More efficient information access and analysis
EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt
Review and Conclusions 56
Semantic Web Architecture
Review and Conclusions 57
XML eXtensible Markup Language
atilde XML is a set of rules for encoding documents electronically
acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation
atilde Can be validated for syntactic well-formedness
Review and Conclusions 58
Semantic Web Architecture (details)
atilde Identify things with Uniform Resource Identifiers
acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond
atilde Identify relations with Resource Description Framework
acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything
atilde You can build relations in ontologies (OWL)
acirc Then reason over them search them hellip
Review and Conclusions 59
Criticism of the Semantic Web
Doctorowrsquos seven insurmountable obstacles to reliable metadata are
1 People lie
2 People are lazy
3 People are stupid
4 Mission Impossible know thyself
5 Schemas arenrsquot neutral
6 Metrics influence results
7 Therersquos more than one way to describe something
httpwwwwellcom~doctorowmetacraphtm 60
Semantic Web and NLP
atilde The Semantic Web is about structuring data
atilde Text Mining is about unstructured data
atilde There is much more unstructured than structured data
acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP
61
Citation Reputation andPageRank
Review and Conclusions 62
Citation Networks
atilde How can we tell what is a good scientific paper
acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked
acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite
Review and Conclusions 63
Reputation and Citation Analysis
atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group
atilde Some scores are
acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)
atilde Problems
acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones
atilde Weight Citations by Quality of the paper
Review and Conclusions 64
Gaming Citations
atilde LeastMinimum Publishable Unit
acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information
atilde Self citation in-group citation
atilde Write only proceedings (some journals are not often read)
atilde Submitting only to High Impact factor journals
You improve what gets measurednot necessarily what you want to improve
Review and Conclusions 65
Characteristics of the Web Bow Tie
SCC Strongly Connected Core mdash can travel from any page to any page
Review and Conclusions 66
Anchor Text
atilde Recall how hyperlinks are written
lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt
For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt
atilde Link analysis builds on two intuitions
1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A
2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice
Review and Conclusions 67
PageRank as Citation analysis
atilde Citation frequency can be used to measure the impact of an article
acirc Simplest measure Each article gets one vote ndash not very accurate
atilde On the web citation frequency = inlink count
acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam
atilde Better measure weighted citation frequency or citation rank
acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated
Review and Conclusions 68
PageRank as Random walk
atilde Imagine a web surfer doing a random walk on the web
acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page
equiprobably
atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there
atilde This long-term visit rate is the pagersquos PageRank
atilde PageRank = long-term visit rate = steady state robability
Review and Conclusions 69
Teleporting ndash to get us out of dead ends
atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)
atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)
atilde With remaining probability (90) go out on a random hyperlink
acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225
atilde 10 is a parameter the teleportation rate
atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate
Review and Conclusions 70
Example Graph
Each inbound link is a positive vote
httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71
Example Graph Weighted
Pages with higher PageRanks are lighter
Review and Conclusions 72
Gaming PageRank
atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs
atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies
atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission
73
atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks
The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index
acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip
74
Current Status
atilde There is a continuous battle between
acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read
atilde All metrics get gamed
Review and Conclusions 75
Digital object identifier
atilde DOI a string used to uniquely identify an electronic document or object
acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)
atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation
atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations
acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u
Review and Conclusions 76
Conclusions
Review and Conclusions 77
Revolutions in Language Technology
atilde Speech (Language itself)
atilde Writing (invented 3-5 times)
atilde Printing (made writing common)
atilde Digital Text (made writing transferable)
atilde Hyperlinking (taking writing beyond language)
Review and Conclusions 78
The Internet
atilde The internet is useful as a tool
acirc Passively for information accessacirc Actively for collaboration
atilde The internet is interesting in and of itself
acirc As a source of data about existing languageacirc As a source of innovation in language
Review and Conclusions 79
Recommended Texts
atilde Wikipedia
atilde Crystal D (2001) Language and the Internet Cambridge University Press
atilde Sproat R (2010) Language Technology and Society Oxford University Press
atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition
atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press
atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press
Review and Conclusions 80
Complementary Courses
atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics
atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics
Review and Conclusions 81
Themes
atilde Language and Technology
acirc Writing and Speech Technology
atilde Language and the Internet
acirc Email Chat Virtual Worlds WWW IM Blogs Facebook Wikis Twitter
atilde The Web as Corpus
atilde The Web beyond Language
acirc Semantic Web and Networks
Review and Conclusions 5
Revision
Review and Conclusions 6
Language and Technology
atilde The two things that separate humans from animals (Sproat 2010 sect11)
acirc Languagelowast large vocabulary (10000+)lowast complicated syntax (no upper length recursion embedding)
acirc Technologylowast Widespread tool uselowast Widespread tool manufacture
atilde Speech mdash the start of language
atilde Writing mdash the first great intersection
Review and Conclusions 7
Language and the Internet
atilde New forms of communication
acirc Neither speech nor textacirc Massively interactive
atilde Extremely rapid change
atilde A first hand narrative (I was online before the internet -)
acirc but I am probably behind you all now
Review and Conclusions 8
New forms
atilde Email (from PC phone other)
atilde Chat Usenet
atilde Virtual Worlds
atilde WWW
atilde Blogs (overlap)
atilde Facebook LinkedIn
atilde Wikis
atilde Twitter
Review and Conclusions 9
Representing Languageatilde Writing Systems
atilde Encodings
atilde Speech
atilde Bandwidth
Review and Conclusions 10
What is represented
atilde Phonemes ma d laks aeligvkadoz (45)
acirc Not a simple correspondence between a writing system and the soundsacirc Some logographs (3 $)acirc Even when a new alphabet is designed pronunciation changes
atilde Syllables (ma) (d) (laks) (aelig)(v)(ka)(doz) (10000+)
atilde Morphemes mame + rsquos d lak + s aeligvkado + z (100000+)
atilde Words my dog likes avocados (200000+)
atilde Concepts speaker poss canine fond avacado+PL (400000+)
Review and Conclusions 11
Three Major Writing Systems
atilde Alphabetic (Latin)
acirc one symbol for consonant or vowelacirc Typically 20-30 base symbols (1 byte)
atilde Syllabic (Hiragana)
acirc one symbol for each syllable (consonant+vowel)acirc Typically 50-100 base symbols (1-2 bytes)
atilde Logographic (Hanzi)
acirc pictographs ideographs sounds-meaning combinationsacirc Typically 100000+ symbols (2-3 bytes)
Review and Conclusions 12
Computational Encoding
atilde Need to map characters to bits
atilde More characters require more space
acirc English ASCII (7 bits) Western Europe ISO-8859-1 (8 bits)Japanese EUC (16 bits) Everything UTF-8 (8-32 bits)
atilde Moving towards unicode for everything
atilde If you get the encoding wrong it is gibberish
acirc Web pages should state their encodingacirc Sometimes they are wrongacirc Encoding Detection usually involves statistical analysis of byte patterns
lowast But an encoding can be shared by many languages
Review and Conclusions 13
Speech and Language Technologyatilde The need for speech representation
atilde Storing sound
atilde Transforming Speech
acirc Automatic Speech Recognition (ASR) sounds to textacirc Text-to-Speech Synthesis (TTS) texts to sounds
atilde Speech technology mdash the Telephone
Review and Conclusions 14
How good are the systems
Task Vocab WER () WER () adaptedDigits 11 04 02Dialogue (travel) 21000 109 mdashDictation (WSJ) 5000 39 30Dictation (WSJ) 20000 100 86Dialogue (noisy army) 3000 422 310Phone Conversations 4000 419 310
Results of various DARPA competitions (from Richard Sproatrsquos slides)
Review and Conclusions 15
Why is it so difficult
atilde Pronunciation depends on context
acirc The same word will be pronounced differently in different sentences
atilde Speaker variability
acirc Genderacirc DialectForeign Accentacirc Individual Differences Physical differences Language differences (idiolect)
atilde Many many rare events
acirc 300 out of 2000 diphones in the core set for the ATampT NextGen system occuronly once in a 2-hour speech database
Review and Conclusions 16
Two steps in a TTS system
1 Linguistic Analysis
atilde Sentence Segmentationatilde Abbreviations Dr Smith lives on Nanyang Dr He is hellipatilde Word Segmentation
acirc 森前銀総裁 Moriyama zen Nichigin Sousaiotimes 森前銀総裁emspMoriyama zennichi gin Sousai
2 Speech Synthesis
atilde Find the pronunciationatilde Generate soundsatilde Add intonation
Review and Conclusions 17
Speech Synthesis
atilde Articulatory Synthesis Attempt to model human articulation
atilde Formant Synthesis Bypass modeling of human articulation and model acousticsdirectly
atilde Concatenative Synthesis Synthesize from stored units of actual speech
Review and Conclusions 18
Prosody of Emotion
atilde Excitement Fast very high pitch loud
atilde Hot anger Fast high pitch strong falling accent loud
atilde Fear Jitter
atilde Sarcasm Prolonged accent late peak
atilde Sad Slow low pitch
The main determinant of ldquonaturalnessrdquo in speech synthesis is not ldquovoice qualityrdquobut natural-sounding prosody (intonation and duration)
Richard Sproat
Review and Conclusions 19
Speed is different for different modalities
Speed in words per minute (one word is 6 characters)(English computer science students various studies)
Reading 300 200 (proof reading)Writing 31 21 (composing)Speaking 150Hearing 150 210 (speeded up)Typing 33 19 (composing)
atilde Reading gtgt SpeakingHearing gtgt Typing
rArr Speech for inputrArr Text for output
Review and Conclusions 20
The Telephone
Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 21
New Modalities
Review and Conclusions 22
Email Usenet Chat and Blogs
atilde All share some characteristics of speech and text
atilde Usage norms not fixed
atilde Communication methods may disappear before the norms are fixedUsenet Bulletin Boards Gopher Archie hellip
atilde Large scale discourse studies still to be done
atilde Some genuinely new things
acirc time-lagged multi-person conversationacirc raw un-edited textacirc extreme multi-authorship
Review and Conclusions 23
Speech like Text liketime bound space bound (deletable)spontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 24
Usenet (asynchronous)
Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 25
Chat (synchronous)
Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 26
Blogs
Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactivecomments factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 27
Netspeak
atilde Inspired by shared background
acirc ltrantgtI canrsquot stand thisltrantgtacirc I hate^H^H^H^Hlove that idea (^H is backspace on a vt100)acirc lusers (users as seen by Systems Administrators)acirc suits
atilde Need to be in-group
cow orker Coworkerclue ldquoYou couldnrsquot get a clue during the clue mating season in a field full of horny
clues if you smeared your body with clue musk and did the clue mating dancerdquoEdward Flaherty (talkbizaare)
Review and Conclusions 28
atilde Gricean Principals
acirc Posting (topmiddlebottom quoting and trimming)
atilde Inspired by medium limitations
acirc GREATgreatgreatacirc -) (^_^) (o) acirc brb RTFM IMHOacirc lower caseacirc lack of punctuation (hard on phone keyboards)
Review and Conclusions 29
Collaboration and Wikisatilde Version Control Systems
atilde Wikipedia
atilde Licensing and Ownership
30
Version Control Systems
atilde Versioning file systems
acirc every time a file is opened a new copy is stored
atilde CVS Subversion Git
acirc changes to a collection of files are trackedacirc simultaneous changes are merged
atilde Revision Tracking
acirc Revisions are stored within a file
atilde Authorship in shared writing Explicit responsibility for changes
Review and Conclusions 31
Wikipedia
atilde The core aim of the Wikimedia Foundation is to get a free encyclopedia to everysingle person on the planet (Jimmy Wales)
atilde Wikipedia makes it easy to share your knowledgepeople like to do this
atilde Most edits are done by insiders
atilde Most content is added by outsiders
atilde Content comparable to Britannica
Review and Conclusions 32
The five pillars of Wikipedia
1 Wikipedia is an online encyclopedia
2 Wikipedia has a neutral point of view
atilde Content policies NPOV Verifiability and No original research
3 Wikipedia is free content
4 Wikipedians should interact in a respectful and civil manner
5 Wikipedia does not have firm rules
WikipediaFivepillars 33
Licenses and Ownership
atilde Copyright
atilde Copyleft
atilde Creative Commons
Review and Conclusions 34
What is a good article
1 Well-written
2 Factually accurate and verifiable
3 Broad in its coverage
4 Neutral
5 Stable
6 Illustrated if possible by images
WikipediaGood_article_criteria 35
The World Wide Web and HTML
Review and Conclusions 36
The InterWeb
atilde The Internetmore than just the web (email VoIP FTP Streaming Messaging
atilde The structure of Markup Visual vs LogicalWSISWYG WYSIAYG WYSIWIM
atilde The structure of the Web mdash hypertextpages linked to pages
atilde The future of the Web
atilde Linguistic features of the webun-edited large volume editable multi-media
Review and Conclusions 37
Map of online communities
httpxkcdcom802 38
The Deep Web (the Invisible Web)
Dynamic content dynamic pages which are returned in response to a submitted queryor accessed only through a form
Unlinked content pages which are not linked to by other pages (but clicking linksthem)
Private Web sites that require registration and login (Edventure)
Contextual Web pages with content varying for different access contexts (egranges of client IP addresses or previous navigation sequence)
Limited access content sites that limit access to their pages in a technical way (egusing the Robots Exclusion Standard)
Scripted content pages that are only accessible through links produced by JavaScriptas well as content dynamically downloaded from Web servers via Flash or Ajaxsolutions
Review and Conclusions 39
Non-HTMLtext content textual content encoded in multimedia (image or video)files or specific file formats not handled by search engines
These pages all include data that search engines cannot find
Review and Conclusions 40
Visual Markup vs Logical Markup
atilde Visual Markup (Presentational)
acirc What you see is what you get (WYSIWYG)acirc Equivalent of printersrsquo markupacirc Shows what things look like
atilde Logical Markup (Structural)
acirc Show the structure and meaningacirc Can be mapped to visual markupacirc Less flexible than visual markupacirc More adaptable (and reusable)
Review and Conclusions 41
The Web as Corpus
Review and Conclusions 42
Two Approaches to using the Web as a Corpus
atilde Direct Query Search Engine as Query tool and WWW as corpus(Objection Results are not reliable)
acirc Population and exact hit counts are unknown rarr no statistics possibleacirc Indexing does not allow to draw conclusions on the dataotimes Google is missing functionalities that linguists lexicographers would like to have
atilde Web Sample Use search engine to download data from the net and build a corpusfrom it
acirc known size and exact hit counts rarr statistics possibleacirc people can draw conclusions over the included text typesacirc (limited) control over the contentotimes sparser data
Review and Conclusions 43
Direct Query
atilde Accessible through search engines (Google API Yahoo API Scripts)
atilde Document counts are shown to correlate directly with ldquorealrdquo frequencies (Keller2003) so search engines can help - but
acirc lots of repetitions of the same text (not representative)acirc very limited query precision (no upperlower case no punctuation)acirc only estimated counts often hard to reproduce exactlyacirc different queries give wildly different numbers
Review and Conclusions 44
Web Count Units
atilde ghit or ldquogoogle hitrdquo is the most common unit used to count web snippets (in theearly 2000s)
acirc it is document frequency not term frequency
atilde whit or ldquoweb hitrdquo is the more general term
atilde Normally you compare two phenomena to get a unitless ratio(eg different from vs different than)251000000 ghits vs 71500000 or 351 (accessed 2012-04-04)
atilde GPB for ldquoGhits per billion documentsrdquo is good if want a more stable number (sug-gested by Mark Liberman)but Google no longer releases the index size
atilde So always say when you counted and try to use ratios
httpitrecisupennedu myllanguagelogarchives000953html 45
Web Sample
atilde Extracting and filtering web documents to create linguistically annotated corpora(Kilgarriff 2006)
acirc gather documents for different topics (balance)acirc exclude documents which cannot be preprocessed with available tools (here tag-
gers and lemmatizers)acirc exclude documents which seem irrelevant for a corpus (too short or too long word
lists)acirc do this for several languages and make the corpora available
46
Building Internet Corpora Outline
1 Select Seed Words (500)
2 Combine to form multiple queries (6000)
3 Query a search engine and retrieve the URLs (50000)
4 Download the files from the URLS (100000000 words)
5 Postprocess the data (encoding cleanup tagging and parsing)
Sharoff S (2006) Creating general-purpose corpora using automated search enginequeries In M Baroni S Bernardini (eds) WaCky Working papers on the Web asCorpus Bologna 2006
httpcorpusleedsacukinternethtml 47
Internet Corpora Summary
atilde The web can be used as a corpus
acirc Direct accesslowast Fast and convenientlowast Huge amounts of dataotimes unreliable counts
acirc Web samplelowast Control over the samplelowast Some setup costs (semi-automated)otimes Less data
atilde Richer data than a compiled corpus
otimes Less balanced less markup
48
Text and Meta-textatilde Explicit Meta-data
acirc Keywords and Categoriesacirc Rankingsacirc Structural Markup
atilde Implicit Meta-data
acirc Links and Citationsacirc Tagsacirc Tablesacirc File Namesacirc Translations
and Zombies 49
Explicit Metadata
atilde You can get information from metadata within documents
acirc When they are accurate they are very goodacirc They are often deceitful
atilde HTML PDF Word hellipMetadata
atilde Keywords and Tags
atilde Rankings
atilde Links and Citations
atilde Structural Markup
50
Implicit Metadata
atilde You can get clues from metadata within documents
acirc as they are non-intended they tend to be noisyacirc but they are rarely deceitful
atilde HTML tags as constituent boundaries
atilde Tables as Semantic Relations
atilde File Names (content type and language)
atilde Translations mdash Bracketed Glosses Cross-lingual Disambiguation
atilde Query Data
atilde Wikipedia Redirections and Cross-wiki Links
Review and Conclusions 51
Language Identification andNormalization
Review and Conclusions 52
Language Identification
atilde Need to identify the encoding and language of a page in order to process it (meta-data may be unreliable)
acirc Linguistically-grounded methodslowast Diacriticslowast Character n-gramslowast Stop words
acirc Similarity-based categorisation and classificationlowast Character n-gram rankings
acirc Machine Learning based methodsacirc Context (under jp or ko)
atilde Hard to do for short test snippets similar languages mixed text
Review and Conclusions 53
Normalization
atilde Extracting text from various documents
atilde Segmenting continuous text
atilde Number Normalization $700K $700000 07 million dollars hellip
atilde Date Normalization 2000AD 1421AH Heisei 12 hellip
atilde Stripping stop words (the a of hellip)
atilde Lemmatization produces rarr produce
atilde Stemming producer rarr produc produces rarr produc
atilde Decompounding zonnecel rarr zon cel
Review and Conclusions 54
The Semantic Web
Review and Conclusions 55
Goals of the Semantic Web
atilde Web of data
acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions
atilde Increase the utility of information by connecting it to definitions and context
atilde More efficient information access and analysis
EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt
Review and Conclusions 56
Semantic Web Architecture
Review and Conclusions 57
XML eXtensible Markup Language
atilde XML is a set of rules for encoding documents electronically
acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation
atilde Can be validated for syntactic well-formedness
Review and Conclusions 58
Semantic Web Architecture (details)
atilde Identify things with Uniform Resource Identifiers
acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond
atilde Identify relations with Resource Description Framework
acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything
atilde You can build relations in ontologies (OWL)
acirc Then reason over them search them hellip
Review and Conclusions 59
Criticism of the Semantic Web
Doctorowrsquos seven insurmountable obstacles to reliable metadata are
1 People lie
2 People are lazy
3 People are stupid
4 Mission Impossible know thyself
5 Schemas arenrsquot neutral
6 Metrics influence results
7 Therersquos more than one way to describe something
httpwwwwellcom~doctorowmetacraphtm 60
Semantic Web and NLP
atilde The Semantic Web is about structuring data
atilde Text Mining is about unstructured data
atilde There is much more unstructured than structured data
acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP
61
Citation Reputation andPageRank
Review and Conclusions 62
Citation Networks
atilde How can we tell what is a good scientific paper
acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked
acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite
Review and Conclusions 63
Reputation and Citation Analysis
atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group
atilde Some scores are
acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)
atilde Problems
acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones
atilde Weight Citations by Quality of the paper
Review and Conclusions 64
Gaming Citations
atilde LeastMinimum Publishable Unit
acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information
atilde Self citation in-group citation
atilde Write only proceedings (some journals are not often read)
atilde Submitting only to High Impact factor journals
You improve what gets measurednot necessarily what you want to improve
Review and Conclusions 65
Characteristics of the Web Bow Tie
SCC Strongly Connected Core mdash can travel from any page to any page
Review and Conclusions 66
Anchor Text
atilde Recall how hyperlinks are written
lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt
For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt
atilde Link analysis builds on two intuitions
1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A
2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice
Review and Conclusions 67
PageRank as Citation analysis
atilde Citation frequency can be used to measure the impact of an article
acirc Simplest measure Each article gets one vote ndash not very accurate
atilde On the web citation frequency = inlink count
acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam
atilde Better measure weighted citation frequency or citation rank
acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated
Review and Conclusions 68
PageRank as Random walk
atilde Imagine a web surfer doing a random walk on the web
acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page
equiprobably
atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there
atilde This long-term visit rate is the pagersquos PageRank
atilde PageRank = long-term visit rate = steady state robability
Review and Conclusions 69
Teleporting ndash to get us out of dead ends
atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)
atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)
atilde With remaining probability (90) go out on a random hyperlink
acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225
atilde 10 is a parameter the teleportation rate
atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate
Review and Conclusions 70
Example Graph
Each inbound link is a positive vote
httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71
Example Graph Weighted
Pages with higher PageRanks are lighter
Review and Conclusions 72
Gaming PageRank
atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs
atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies
atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission
73
atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks
The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index
acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip
74
Current Status
atilde There is a continuous battle between
acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read
atilde All metrics get gamed
Review and Conclusions 75
Digital object identifier
atilde DOI a string used to uniquely identify an electronic document or object
acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)
atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation
atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations
acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u
Review and Conclusions 76
Conclusions
Review and Conclusions 77
Revolutions in Language Technology
atilde Speech (Language itself)
atilde Writing (invented 3-5 times)
atilde Printing (made writing common)
atilde Digital Text (made writing transferable)
atilde Hyperlinking (taking writing beyond language)
Review and Conclusions 78
The Internet
atilde The internet is useful as a tool
acirc Passively for information accessacirc Actively for collaboration
atilde The internet is interesting in and of itself
acirc As a source of data about existing languageacirc As a source of innovation in language
Review and Conclusions 79
Recommended Texts
atilde Wikipedia
atilde Crystal D (2001) Language and the Internet Cambridge University Press
atilde Sproat R (2010) Language Technology and Society Oxford University Press
atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition
atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press
atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press
Review and Conclusions 80
Complementary Courses
atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics
atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics
Review and Conclusions 81
Revision
Review and Conclusions 6
Language and Technology
atilde The two things that separate humans from animals (Sproat 2010 sect11)
acirc Languagelowast large vocabulary (10000+)lowast complicated syntax (no upper length recursion embedding)
acirc Technologylowast Widespread tool uselowast Widespread tool manufacture
atilde Speech mdash the start of language
atilde Writing mdash the first great intersection
Review and Conclusions 7
Language and the Internet
atilde New forms of communication
acirc Neither speech nor textacirc Massively interactive
atilde Extremely rapid change
atilde A first hand narrative (I was online before the internet -)
acirc but I am probably behind you all now
Review and Conclusions 8
New forms
atilde Email (from PC phone other)
atilde Chat Usenet
atilde Virtual Worlds
atilde WWW
atilde Blogs (overlap)
atilde Facebook LinkedIn
atilde Wikis
atilde Twitter
Review and Conclusions 9
Representing Languageatilde Writing Systems
atilde Encodings
atilde Speech
atilde Bandwidth
Review and Conclusions 10
What is represented
atilde Phonemes ma d laks aeligvkadoz (45)
acirc Not a simple correspondence between a writing system and the soundsacirc Some logographs (3 $)acirc Even when a new alphabet is designed pronunciation changes
atilde Syllables (ma) (d) (laks) (aelig)(v)(ka)(doz) (10000+)
atilde Morphemes mame + rsquos d lak + s aeligvkado + z (100000+)
atilde Words my dog likes avocados (200000+)
atilde Concepts speaker poss canine fond avacado+PL (400000+)
Review and Conclusions 11
Three Major Writing Systems
atilde Alphabetic (Latin)
acirc one symbol for consonant or vowelacirc Typically 20-30 base symbols (1 byte)
atilde Syllabic (Hiragana)
acirc one symbol for each syllable (consonant+vowel)acirc Typically 50-100 base symbols (1-2 bytes)
atilde Logographic (Hanzi)
acirc pictographs ideographs sounds-meaning combinationsacirc Typically 100000+ symbols (2-3 bytes)
Review and Conclusions 12
Computational Encoding
atilde Need to map characters to bits
atilde More characters require more space
acirc English ASCII (7 bits) Western Europe ISO-8859-1 (8 bits)Japanese EUC (16 bits) Everything UTF-8 (8-32 bits)
atilde Moving towards unicode for everything
atilde If you get the encoding wrong it is gibberish
acirc Web pages should state their encodingacirc Sometimes they are wrongacirc Encoding Detection usually involves statistical analysis of byte patterns
lowast But an encoding can be shared by many languages
Review and Conclusions 13
Speech and Language Technologyatilde The need for speech representation
atilde Storing sound
atilde Transforming Speech
acirc Automatic Speech Recognition (ASR) sounds to textacirc Text-to-Speech Synthesis (TTS) texts to sounds
atilde Speech technology mdash the Telephone
Review and Conclusions 14
How good are the systems
Task Vocab WER () WER () adaptedDigits 11 04 02Dialogue (travel) 21000 109 mdashDictation (WSJ) 5000 39 30Dictation (WSJ) 20000 100 86Dialogue (noisy army) 3000 422 310Phone Conversations 4000 419 310
Results of various DARPA competitions (from Richard Sproatrsquos slides)
Review and Conclusions 15
Why is it so difficult
atilde Pronunciation depends on context
acirc The same word will be pronounced differently in different sentences
atilde Speaker variability
acirc Genderacirc DialectForeign Accentacirc Individual Differences Physical differences Language differences (idiolect)
atilde Many many rare events
acirc 300 out of 2000 diphones in the core set for the ATampT NextGen system occuronly once in a 2-hour speech database
Review and Conclusions 16
Two steps in a TTS system
1 Linguistic Analysis
atilde Sentence Segmentationatilde Abbreviations Dr Smith lives on Nanyang Dr He is hellipatilde Word Segmentation
acirc 森前銀総裁 Moriyama zen Nichigin Sousaiotimes 森前銀総裁emspMoriyama zennichi gin Sousai
2 Speech Synthesis
atilde Find the pronunciationatilde Generate soundsatilde Add intonation
Review and Conclusions 17
Speech Synthesis
atilde Articulatory Synthesis Attempt to model human articulation
atilde Formant Synthesis Bypass modeling of human articulation and model acousticsdirectly
atilde Concatenative Synthesis Synthesize from stored units of actual speech
Review and Conclusions 18
Prosody of Emotion
atilde Excitement Fast very high pitch loud
atilde Hot anger Fast high pitch strong falling accent loud
atilde Fear Jitter
atilde Sarcasm Prolonged accent late peak
atilde Sad Slow low pitch
The main determinant of ldquonaturalnessrdquo in speech synthesis is not ldquovoice qualityrdquobut natural-sounding prosody (intonation and duration)
Richard Sproat
Review and Conclusions 19
Speed is different for different modalities
Speed in words per minute (one word is 6 characters)(English computer science students various studies)
Reading 300 200 (proof reading)Writing 31 21 (composing)Speaking 150Hearing 150 210 (speeded up)Typing 33 19 (composing)
atilde Reading gtgt SpeakingHearing gtgt Typing
rArr Speech for inputrArr Text for output
Review and Conclusions 20
The Telephone
Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 21
New Modalities
Review and Conclusions 22
Email Usenet Chat and Blogs
atilde All share some characteristics of speech and text
atilde Usage norms not fixed
atilde Communication methods may disappear before the norms are fixedUsenet Bulletin Boards Gopher Archie hellip
atilde Large scale discourse studies still to be done
atilde Some genuinely new things
acirc time-lagged multi-person conversationacirc raw un-edited textacirc extreme multi-authorship
Review and Conclusions 23
Speech like Text liketime bound space bound (deletable)spontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 24
Usenet (asynchronous)
Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 25
Chat (synchronous)
Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 26
Blogs
Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactivecomments factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 27
Netspeak
atilde Inspired by shared background
acirc ltrantgtI canrsquot stand thisltrantgtacirc I hate^H^H^H^Hlove that idea (^H is backspace on a vt100)acirc lusers (users as seen by Systems Administrators)acirc suits
atilde Need to be in-group
cow orker Coworkerclue ldquoYou couldnrsquot get a clue during the clue mating season in a field full of horny
clues if you smeared your body with clue musk and did the clue mating dancerdquoEdward Flaherty (talkbizaare)
Review and Conclusions 28
atilde Gricean Principals
acirc Posting (topmiddlebottom quoting and trimming)
atilde Inspired by medium limitations
acirc GREATgreatgreatacirc -) (^_^) (o) acirc brb RTFM IMHOacirc lower caseacirc lack of punctuation (hard on phone keyboards)
Review and Conclusions 29
Collaboration and Wikisatilde Version Control Systems
atilde Wikipedia
atilde Licensing and Ownership
30
Version Control Systems
atilde Versioning file systems
acirc every time a file is opened a new copy is stored
atilde CVS Subversion Git
acirc changes to a collection of files are trackedacirc simultaneous changes are merged
atilde Revision Tracking
acirc Revisions are stored within a file
atilde Authorship in shared writing Explicit responsibility for changes
Review and Conclusions 31
Wikipedia
atilde The core aim of the Wikimedia Foundation is to get a free encyclopedia to everysingle person on the planet (Jimmy Wales)
atilde Wikipedia makes it easy to share your knowledgepeople like to do this
atilde Most edits are done by insiders
atilde Most content is added by outsiders
atilde Content comparable to Britannica
Review and Conclusions 32
The five pillars of Wikipedia
1 Wikipedia is an online encyclopedia
2 Wikipedia has a neutral point of view
atilde Content policies NPOV Verifiability and No original research
3 Wikipedia is free content
4 Wikipedians should interact in a respectful and civil manner
5 Wikipedia does not have firm rules
WikipediaFivepillars 33
Licenses and Ownership
atilde Copyright
atilde Copyleft
atilde Creative Commons
Review and Conclusions 34
What is a good article
1 Well-written
2 Factually accurate and verifiable
3 Broad in its coverage
4 Neutral
5 Stable
6 Illustrated if possible by images
WikipediaGood_article_criteria 35
The World Wide Web and HTML
Review and Conclusions 36
The InterWeb
atilde The Internetmore than just the web (email VoIP FTP Streaming Messaging
atilde The structure of Markup Visual vs LogicalWSISWYG WYSIAYG WYSIWIM
atilde The structure of the Web mdash hypertextpages linked to pages
atilde The future of the Web
atilde Linguistic features of the webun-edited large volume editable multi-media
Review and Conclusions 37
Map of online communities
httpxkcdcom802 38
The Deep Web (the Invisible Web)
Dynamic content dynamic pages which are returned in response to a submitted queryor accessed only through a form
Unlinked content pages which are not linked to by other pages (but clicking linksthem)
Private Web sites that require registration and login (Edventure)
Contextual Web pages with content varying for different access contexts (egranges of client IP addresses or previous navigation sequence)
Limited access content sites that limit access to their pages in a technical way (egusing the Robots Exclusion Standard)
Scripted content pages that are only accessible through links produced by JavaScriptas well as content dynamically downloaded from Web servers via Flash or Ajaxsolutions
Review and Conclusions 39
Non-HTMLtext content textual content encoded in multimedia (image or video)files or specific file formats not handled by search engines
These pages all include data that search engines cannot find
Review and Conclusions 40
Visual Markup vs Logical Markup
atilde Visual Markup (Presentational)
acirc What you see is what you get (WYSIWYG)acirc Equivalent of printersrsquo markupacirc Shows what things look like
atilde Logical Markup (Structural)
acirc Show the structure and meaningacirc Can be mapped to visual markupacirc Less flexible than visual markupacirc More adaptable (and reusable)
Review and Conclusions 41
The Web as Corpus
Review and Conclusions 42
Two Approaches to using the Web as a Corpus
atilde Direct Query Search Engine as Query tool and WWW as corpus(Objection Results are not reliable)
acirc Population and exact hit counts are unknown rarr no statistics possibleacirc Indexing does not allow to draw conclusions on the dataotimes Google is missing functionalities that linguists lexicographers would like to have
atilde Web Sample Use search engine to download data from the net and build a corpusfrom it
acirc known size and exact hit counts rarr statistics possibleacirc people can draw conclusions over the included text typesacirc (limited) control over the contentotimes sparser data
Review and Conclusions 43
Direct Query
atilde Accessible through search engines (Google API Yahoo API Scripts)
atilde Document counts are shown to correlate directly with ldquorealrdquo frequencies (Keller2003) so search engines can help - but
acirc lots of repetitions of the same text (not representative)acirc very limited query precision (no upperlower case no punctuation)acirc only estimated counts often hard to reproduce exactlyacirc different queries give wildly different numbers
Review and Conclusions 44
Web Count Units
atilde ghit or ldquogoogle hitrdquo is the most common unit used to count web snippets (in theearly 2000s)
acirc it is document frequency not term frequency
atilde whit or ldquoweb hitrdquo is the more general term
atilde Normally you compare two phenomena to get a unitless ratio(eg different from vs different than)251000000 ghits vs 71500000 or 351 (accessed 2012-04-04)
atilde GPB for ldquoGhits per billion documentsrdquo is good if want a more stable number (sug-gested by Mark Liberman)but Google no longer releases the index size
atilde So always say when you counted and try to use ratios
httpitrecisupennedu myllanguagelogarchives000953html 45
Web Sample
atilde Extracting and filtering web documents to create linguistically annotated corpora(Kilgarriff 2006)
acirc gather documents for different topics (balance)acirc exclude documents which cannot be preprocessed with available tools (here tag-
gers and lemmatizers)acirc exclude documents which seem irrelevant for a corpus (too short or too long word
lists)acirc do this for several languages and make the corpora available
46
Building Internet Corpora Outline
1 Select Seed Words (500)
2 Combine to form multiple queries (6000)
3 Query a search engine and retrieve the URLs (50000)
4 Download the files from the URLS (100000000 words)
5 Postprocess the data (encoding cleanup tagging and parsing)
Sharoff S (2006) Creating general-purpose corpora using automated search enginequeries In M Baroni S Bernardini (eds) WaCky Working papers on the Web asCorpus Bologna 2006
httpcorpusleedsacukinternethtml 47
Internet Corpora Summary
atilde The web can be used as a corpus
acirc Direct accesslowast Fast and convenientlowast Huge amounts of dataotimes unreliable counts
acirc Web samplelowast Control over the samplelowast Some setup costs (semi-automated)otimes Less data
atilde Richer data than a compiled corpus
otimes Less balanced less markup
48
Text and Meta-textatilde Explicit Meta-data
acirc Keywords and Categoriesacirc Rankingsacirc Structural Markup
atilde Implicit Meta-data
acirc Links and Citationsacirc Tagsacirc Tablesacirc File Namesacirc Translations
and Zombies 49
Explicit Metadata
atilde You can get information from metadata within documents
acirc When they are accurate they are very goodacirc They are often deceitful
atilde HTML PDF Word hellipMetadata
atilde Keywords and Tags
atilde Rankings
atilde Links and Citations
atilde Structural Markup
50
Implicit Metadata
atilde You can get clues from metadata within documents
acirc as they are non-intended they tend to be noisyacirc but they are rarely deceitful
atilde HTML tags as constituent boundaries
atilde Tables as Semantic Relations
atilde File Names (content type and language)
atilde Translations mdash Bracketed Glosses Cross-lingual Disambiguation
atilde Query Data
atilde Wikipedia Redirections and Cross-wiki Links
Review and Conclusions 51
Language Identification andNormalization
Review and Conclusions 52
Language Identification
atilde Need to identify the encoding and language of a page in order to process it (meta-data may be unreliable)
acirc Linguistically-grounded methodslowast Diacriticslowast Character n-gramslowast Stop words
acirc Similarity-based categorisation and classificationlowast Character n-gram rankings
acirc Machine Learning based methodsacirc Context (under jp or ko)
atilde Hard to do for short test snippets similar languages mixed text
Review and Conclusions 53
Normalization
atilde Extracting text from various documents
atilde Segmenting continuous text
atilde Number Normalization $700K $700000 07 million dollars hellip
atilde Date Normalization 2000AD 1421AH Heisei 12 hellip
atilde Stripping stop words (the a of hellip)
atilde Lemmatization produces rarr produce
atilde Stemming producer rarr produc produces rarr produc
atilde Decompounding zonnecel rarr zon cel
Review and Conclusions 54
The Semantic Web
Review and Conclusions 55
Goals of the Semantic Web
atilde Web of data
acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions
atilde Increase the utility of information by connecting it to definitions and context
atilde More efficient information access and analysis
EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt
Review and Conclusions 56
Semantic Web Architecture
Review and Conclusions 57
XML eXtensible Markup Language
atilde XML is a set of rules for encoding documents electronically
acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation
atilde Can be validated for syntactic well-formedness
Review and Conclusions 58
Semantic Web Architecture (details)
atilde Identify things with Uniform Resource Identifiers
acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond
atilde Identify relations with Resource Description Framework
acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything
atilde You can build relations in ontologies (OWL)
acirc Then reason over them search them hellip
Review and Conclusions 59
Criticism of the Semantic Web
Doctorowrsquos seven insurmountable obstacles to reliable metadata are
1 People lie
2 People are lazy
3 People are stupid
4 Mission Impossible know thyself
5 Schemas arenrsquot neutral
6 Metrics influence results
7 Therersquos more than one way to describe something
httpwwwwellcom~doctorowmetacraphtm 60
Semantic Web and NLP
atilde The Semantic Web is about structuring data
atilde Text Mining is about unstructured data
atilde There is much more unstructured than structured data
acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP
61
Citation Reputation andPageRank
Review and Conclusions 62
Citation Networks
atilde How can we tell what is a good scientific paper
acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked
acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite
Review and Conclusions 63
Reputation and Citation Analysis
atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group
atilde Some scores are
acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)
atilde Problems
acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones
atilde Weight Citations by Quality of the paper
Review and Conclusions 64
Gaming Citations
atilde LeastMinimum Publishable Unit
acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information
atilde Self citation in-group citation
atilde Write only proceedings (some journals are not often read)
atilde Submitting only to High Impact factor journals
You improve what gets measurednot necessarily what you want to improve
Review and Conclusions 65
Characteristics of the Web Bow Tie
SCC Strongly Connected Core mdash can travel from any page to any page
Review and Conclusions 66
Anchor Text
atilde Recall how hyperlinks are written
lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt
For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt
atilde Link analysis builds on two intuitions
1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A
2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice
Review and Conclusions 67
PageRank as Citation analysis
atilde Citation frequency can be used to measure the impact of an article
acirc Simplest measure Each article gets one vote ndash not very accurate
atilde On the web citation frequency = inlink count
acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam
atilde Better measure weighted citation frequency or citation rank
acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated
Review and Conclusions 68
PageRank as Random walk
atilde Imagine a web surfer doing a random walk on the web
acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page
equiprobably
atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there
atilde This long-term visit rate is the pagersquos PageRank
atilde PageRank = long-term visit rate = steady state robability
Review and Conclusions 69
Teleporting ndash to get us out of dead ends
atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)
atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)
atilde With remaining probability (90) go out on a random hyperlink
acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225
atilde 10 is a parameter the teleportation rate
atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate
Review and Conclusions 70
Example Graph
Each inbound link is a positive vote
httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71
Example Graph Weighted
Pages with higher PageRanks are lighter
Review and Conclusions 72
Gaming PageRank
atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs
atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies
atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission
73
atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks
The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index
acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip
74
Current Status
atilde There is a continuous battle between
acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read
atilde All metrics get gamed
Review and Conclusions 75
Digital object identifier
atilde DOI a string used to uniquely identify an electronic document or object
acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)
atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation
atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations
acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u
Review and Conclusions 76
Conclusions
Review and Conclusions 77
Revolutions in Language Technology
atilde Speech (Language itself)
atilde Writing (invented 3-5 times)
atilde Printing (made writing common)
atilde Digital Text (made writing transferable)
atilde Hyperlinking (taking writing beyond language)
Review and Conclusions 78
The Internet
atilde The internet is useful as a tool
acirc Passively for information accessacirc Actively for collaboration
atilde The internet is interesting in and of itself
acirc As a source of data about existing languageacirc As a source of innovation in language
Review and Conclusions 79
Recommended Texts
atilde Wikipedia
atilde Crystal D (2001) Language and the Internet Cambridge University Press
atilde Sproat R (2010) Language Technology and Society Oxford University Press
atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition
atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press
atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press
Review and Conclusions 80
Complementary Courses
atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics
atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics
Review and Conclusions 81
Language and Technology
atilde The two things that separate humans from animals (Sproat 2010 sect11)
acirc Languagelowast large vocabulary (10000+)lowast complicated syntax (no upper length recursion embedding)
acirc Technologylowast Widespread tool uselowast Widespread tool manufacture
atilde Speech mdash the start of language
atilde Writing mdash the first great intersection
Review and Conclusions 7
Language and the Internet
atilde New forms of communication
acirc Neither speech nor textacirc Massively interactive
atilde Extremely rapid change
atilde A first hand narrative (I was online before the internet -)
acirc but I am probably behind you all now
Review and Conclusions 8
New forms
atilde Email (from PC phone other)
atilde Chat Usenet
atilde Virtual Worlds
atilde WWW
atilde Blogs (overlap)
atilde Facebook LinkedIn
atilde Wikis
atilde Twitter
Review and Conclusions 9
Representing Languageatilde Writing Systems
atilde Encodings
atilde Speech
atilde Bandwidth
Review and Conclusions 10
What is represented
atilde Phonemes ma d laks aeligvkadoz (45)
acirc Not a simple correspondence between a writing system and the soundsacirc Some logographs (3 $)acirc Even when a new alphabet is designed pronunciation changes
atilde Syllables (ma) (d) (laks) (aelig)(v)(ka)(doz) (10000+)
atilde Morphemes mame + rsquos d lak + s aeligvkado + z (100000+)
atilde Words my dog likes avocados (200000+)
atilde Concepts speaker poss canine fond avacado+PL (400000+)
Review and Conclusions 11
Three Major Writing Systems
atilde Alphabetic (Latin)
acirc one symbol for consonant or vowelacirc Typically 20-30 base symbols (1 byte)
atilde Syllabic (Hiragana)
acirc one symbol for each syllable (consonant+vowel)acirc Typically 50-100 base symbols (1-2 bytes)
atilde Logographic (Hanzi)
acirc pictographs ideographs sounds-meaning combinationsacirc Typically 100000+ symbols (2-3 bytes)
Review and Conclusions 12
Computational Encoding
atilde Need to map characters to bits
atilde More characters require more space
acirc English ASCII (7 bits) Western Europe ISO-8859-1 (8 bits)Japanese EUC (16 bits) Everything UTF-8 (8-32 bits)
atilde Moving towards unicode for everything
atilde If you get the encoding wrong it is gibberish
acirc Web pages should state their encodingacirc Sometimes they are wrongacirc Encoding Detection usually involves statistical analysis of byte patterns
lowast But an encoding can be shared by many languages
Review and Conclusions 13
Speech and Language Technologyatilde The need for speech representation
atilde Storing sound
atilde Transforming Speech
acirc Automatic Speech Recognition (ASR) sounds to textacirc Text-to-Speech Synthesis (TTS) texts to sounds
atilde Speech technology mdash the Telephone
Review and Conclusions 14
How good are the systems
Task Vocab WER () WER () adaptedDigits 11 04 02Dialogue (travel) 21000 109 mdashDictation (WSJ) 5000 39 30Dictation (WSJ) 20000 100 86Dialogue (noisy army) 3000 422 310Phone Conversations 4000 419 310
Results of various DARPA competitions (from Richard Sproatrsquos slides)
Review and Conclusions 15
Why is it so difficult
atilde Pronunciation depends on context
acirc The same word will be pronounced differently in different sentences
atilde Speaker variability
acirc Genderacirc DialectForeign Accentacirc Individual Differences Physical differences Language differences (idiolect)
atilde Many many rare events
acirc 300 out of 2000 diphones in the core set for the ATampT NextGen system occuronly once in a 2-hour speech database
Review and Conclusions 16
Two steps in a TTS system
1 Linguistic Analysis
atilde Sentence Segmentationatilde Abbreviations Dr Smith lives on Nanyang Dr He is hellipatilde Word Segmentation
acirc 森前銀総裁 Moriyama zen Nichigin Sousaiotimes 森前銀総裁emspMoriyama zennichi gin Sousai
2 Speech Synthesis
atilde Find the pronunciationatilde Generate soundsatilde Add intonation
Review and Conclusions 17
Speech Synthesis
atilde Articulatory Synthesis Attempt to model human articulation
atilde Formant Synthesis Bypass modeling of human articulation and model acousticsdirectly
atilde Concatenative Synthesis Synthesize from stored units of actual speech
Review and Conclusions 18
Prosody of Emotion
atilde Excitement Fast very high pitch loud
atilde Hot anger Fast high pitch strong falling accent loud
atilde Fear Jitter
atilde Sarcasm Prolonged accent late peak
atilde Sad Slow low pitch
The main determinant of ldquonaturalnessrdquo in speech synthesis is not ldquovoice qualityrdquobut natural-sounding prosody (intonation and duration)
Richard Sproat
Review and Conclusions 19
Speed is different for different modalities
Speed in words per minute (one word is 6 characters)(English computer science students various studies)
Reading 300 200 (proof reading)Writing 31 21 (composing)Speaking 150Hearing 150 210 (speeded up)Typing 33 19 (composing)
atilde Reading gtgt SpeakingHearing gtgt Typing
rArr Speech for inputrArr Text for output
Review and Conclusions 20
The Telephone
Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 21
New Modalities
Review and Conclusions 22
Email Usenet Chat and Blogs
atilde All share some characteristics of speech and text
atilde Usage norms not fixed
atilde Communication methods may disappear before the norms are fixedUsenet Bulletin Boards Gopher Archie hellip
atilde Large scale discourse studies still to be done
atilde Some genuinely new things
acirc time-lagged multi-person conversationacirc raw un-edited textacirc extreme multi-authorship
Review and Conclusions 23
Speech like Text liketime bound space bound (deletable)spontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 24
Usenet (asynchronous)
Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 25
Chat (synchronous)
Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 26
Blogs
Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactivecomments factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 27
Netspeak
atilde Inspired by shared background
acirc ltrantgtI canrsquot stand thisltrantgtacirc I hate^H^H^H^Hlove that idea (^H is backspace on a vt100)acirc lusers (users as seen by Systems Administrators)acirc suits
atilde Need to be in-group
cow orker Coworkerclue ldquoYou couldnrsquot get a clue during the clue mating season in a field full of horny
clues if you smeared your body with clue musk and did the clue mating dancerdquoEdward Flaherty (talkbizaare)
Review and Conclusions 28
atilde Gricean Principals
acirc Posting (topmiddlebottom quoting and trimming)
atilde Inspired by medium limitations
acirc GREATgreatgreatacirc -) (^_^) (o) acirc brb RTFM IMHOacirc lower caseacirc lack of punctuation (hard on phone keyboards)
Review and Conclusions 29
Collaboration and Wikisatilde Version Control Systems
atilde Wikipedia
atilde Licensing and Ownership
30
Version Control Systems
atilde Versioning file systems
acirc every time a file is opened a new copy is stored
atilde CVS Subversion Git
acirc changes to a collection of files are trackedacirc simultaneous changes are merged
atilde Revision Tracking
acirc Revisions are stored within a file
atilde Authorship in shared writing Explicit responsibility for changes
Review and Conclusions 31
Wikipedia
atilde The core aim of the Wikimedia Foundation is to get a free encyclopedia to everysingle person on the planet (Jimmy Wales)
atilde Wikipedia makes it easy to share your knowledgepeople like to do this
atilde Most edits are done by insiders
atilde Most content is added by outsiders
atilde Content comparable to Britannica
Review and Conclusions 32
The five pillars of Wikipedia
1 Wikipedia is an online encyclopedia
2 Wikipedia has a neutral point of view
atilde Content policies NPOV Verifiability and No original research
3 Wikipedia is free content
4 Wikipedians should interact in a respectful and civil manner
5 Wikipedia does not have firm rules
WikipediaFivepillars 33
Licenses and Ownership
atilde Copyright
atilde Copyleft
atilde Creative Commons
Review and Conclusions 34
What is a good article
1 Well-written
2 Factually accurate and verifiable
3 Broad in its coverage
4 Neutral
5 Stable
6 Illustrated if possible by images
WikipediaGood_article_criteria 35
The World Wide Web and HTML
Review and Conclusions 36
The InterWeb
atilde The Internetmore than just the web (email VoIP FTP Streaming Messaging
atilde The structure of Markup Visual vs LogicalWSISWYG WYSIAYG WYSIWIM
atilde The structure of the Web mdash hypertextpages linked to pages
atilde The future of the Web
atilde Linguistic features of the webun-edited large volume editable multi-media
Review and Conclusions 37
Map of online communities
httpxkcdcom802 38
The Deep Web (the Invisible Web)
Dynamic content dynamic pages which are returned in response to a submitted queryor accessed only through a form
Unlinked content pages which are not linked to by other pages (but clicking linksthem)
Private Web sites that require registration and login (Edventure)
Contextual Web pages with content varying for different access contexts (egranges of client IP addresses or previous navigation sequence)
Limited access content sites that limit access to their pages in a technical way (egusing the Robots Exclusion Standard)
Scripted content pages that are only accessible through links produced by JavaScriptas well as content dynamically downloaded from Web servers via Flash or Ajaxsolutions
Review and Conclusions 39
Non-HTMLtext content textual content encoded in multimedia (image or video)files or specific file formats not handled by search engines
These pages all include data that search engines cannot find
Review and Conclusions 40
Visual Markup vs Logical Markup
atilde Visual Markup (Presentational)
acirc What you see is what you get (WYSIWYG)acirc Equivalent of printersrsquo markupacirc Shows what things look like
atilde Logical Markup (Structural)
acirc Show the structure and meaningacirc Can be mapped to visual markupacirc Less flexible than visual markupacirc More adaptable (and reusable)
Review and Conclusions 41
The Web as Corpus
Review and Conclusions 42
Two Approaches to using the Web as a Corpus
atilde Direct Query Search Engine as Query tool and WWW as corpus(Objection Results are not reliable)
acirc Population and exact hit counts are unknown rarr no statistics possibleacirc Indexing does not allow to draw conclusions on the dataotimes Google is missing functionalities that linguists lexicographers would like to have
atilde Web Sample Use search engine to download data from the net and build a corpusfrom it
acirc known size and exact hit counts rarr statistics possibleacirc people can draw conclusions over the included text typesacirc (limited) control over the contentotimes sparser data
Review and Conclusions 43
Direct Query
atilde Accessible through search engines (Google API Yahoo API Scripts)
atilde Document counts are shown to correlate directly with ldquorealrdquo frequencies (Keller2003) so search engines can help - but
acirc lots of repetitions of the same text (not representative)acirc very limited query precision (no upperlower case no punctuation)acirc only estimated counts often hard to reproduce exactlyacirc different queries give wildly different numbers
Review and Conclusions 44
Web Count Units
atilde ghit or ldquogoogle hitrdquo is the most common unit used to count web snippets (in theearly 2000s)
acirc it is document frequency not term frequency
atilde whit or ldquoweb hitrdquo is the more general term
atilde Normally you compare two phenomena to get a unitless ratio(eg different from vs different than)251000000 ghits vs 71500000 or 351 (accessed 2012-04-04)
atilde GPB for ldquoGhits per billion documentsrdquo is good if want a more stable number (sug-gested by Mark Liberman)but Google no longer releases the index size
atilde So always say when you counted and try to use ratios
httpitrecisupennedu myllanguagelogarchives000953html 45
Web Sample
atilde Extracting and filtering web documents to create linguistically annotated corpora(Kilgarriff 2006)
acirc gather documents for different topics (balance)acirc exclude documents which cannot be preprocessed with available tools (here tag-
gers and lemmatizers)acirc exclude documents which seem irrelevant for a corpus (too short or too long word
lists)acirc do this for several languages and make the corpora available
46
Building Internet Corpora Outline
1 Select Seed Words (500)
2 Combine to form multiple queries (6000)
3 Query a search engine and retrieve the URLs (50000)
4 Download the files from the URLS (100000000 words)
5 Postprocess the data (encoding cleanup tagging and parsing)
Sharoff S (2006) Creating general-purpose corpora using automated search enginequeries In M Baroni S Bernardini (eds) WaCky Working papers on the Web asCorpus Bologna 2006
httpcorpusleedsacukinternethtml 47
Internet Corpora Summary
atilde The web can be used as a corpus
acirc Direct accesslowast Fast and convenientlowast Huge amounts of dataotimes unreliable counts
acirc Web samplelowast Control over the samplelowast Some setup costs (semi-automated)otimes Less data
atilde Richer data than a compiled corpus
otimes Less balanced less markup
48
Text and Meta-textatilde Explicit Meta-data
acirc Keywords and Categoriesacirc Rankingsacirc Structural Markup
atilde Implicit Meta-data
acirc Links and Citationsacirc Tagsacirc Tablesacirc File Namesacirc Translations
and Zombies 49
Explicit Metadata
atilde You can get information from metadata within documents
acirc When they are accurate they are very goodacirc They are often deceitful
atilde HTML PDF Word hellipMetadata
atilde Keywords and Tags
atilde Rankings
atilde Links and Citations
atilde Structural Markup
50
Implicit Metadata
atilde You can get clues from metadata within documents
acirc as they are non-intended they tend to be noisyacirc but they are rarely deceitful
atilde HTML tags as constituent boundaries
atilde Tables as Semantic Relations
atilde File Names (content type and language)
atilde Translations mdash Bracketed Glosses Cross-lingual Disambiguation
atilde Query Data
atilde Wikipedia Redirections and Cross-wiki Links
Review and Conclusions 51
Language Identification andNormalization
Review and Conclusions 52
Language Identification
atilde Need to identify the encoding and language of a page in order to process it (meta-data may be unreliable)
acirc Linguistically-grounded methodslowast Diacriticslowast Character n-gramslowast Stop words
acirc Similarity-based categorisation and classificationlowast Character n-gram rankings
acirc Machine Learning based methodsacirc Context (under jp or ko)
atilde Hard to do for short test snippets similar languages mixed text
Review and Conclusions 53
Normalization
atilde Extracting text from various documents
atilde Segmenting continuous text
atilde Number Normalization $700K $700000 07 million dollars hellip
atilde Date Normalization 2000AD 1421AH Heisei 12 hellip
atilde Stripping stop words (the a of hellip)
atilde Lemmatization produces rarr produce
atilde Stemming producer rarr produc produces rarr produc
atilde Decompounding zonnecel rarr zon cel
Review and Conclusions 54
The Semantic Web
Review and Conclusions 55
Goals of the Semantic Web
atilde Web of data
acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions
atilde Increase the utility of information by connecting it to definitions and context
atilde More efficient information access and analysis
EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt
Review and Conclusions 56
Semantic Web Architecture
Review and Conclusions 57
XML eXtensible Markup Language
atilde XML is a set of rules for encoding documents electronically
acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation
atilde Can be validated for syntactic well-formedness
Review and Conclusions 58
Semantic Web Architecture (details)
atilde Identify things with Uniform Resource Identifiers
acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond
atilde Identify relations with Resource Description Framework
acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything
atilde You can build relations in ontologies (OWL)
acirc Then reason over them search them hellip
Review and Conclusions 59
Criticism of the Semantic Web
Doctorowrsquos seven insurmountable obstacles to reliable metadata are
1 People lie
2 People are lazy
3 People are stupid
4 Mission Impossible know thyself
5 Schemas arenrsquot neutral
6 Metrics influence results
7 Therersquos more than one way to describe something
httpwwwwellcom~doctorowmetacraphtm 60
Semantic Web and NLP
atilde The Semantic Web is about structuring data
atilde Text Mining is about unstructured data
atilde There is much more unstructured than structured data
acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP
61
Citation Reputation andPageRank
Review and Conclusions 62
Citation Networks
atilde How can we tell what is a good scientific paper
acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked
acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite
Review and Conclusions 63
Reputation and Citation Analysis
atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group
atilde Some scores are
acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)
atilde Problems
acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones
atilde Weight Citations by Quality of the paper
Review and Conclusions 64
Gaming Citations
atilde LeastMinimum Publishable Unit
acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information
atilde Self citation in-group citation
atilde Write only proceedings (some journals are not often read)
atilde Submitting only to High Impact factor journals
You improve what gets measurednot necessarily what you want to improve
Review and Conclusions 65
Characteristics of the Web Bow Tie
SCC Strongly Connected Core mdash can travel from any page to any page
Review and Conclusions 66
Anchor Text
atilde Recall how hyperlinks are written
lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt
For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt
atilde Link analysis builds on two intuitions
1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A
2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice
Review and Conclusions 67
PageRank as Citation analysis
atilde Citation frequency can be used to measure the impact of an article
acirc Simplest measure Each article gets one vote ndash not very accurate
atilde On the web citation frequency = inlink count
acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam
atilde Better measure weighted citation frequency or citation rank
acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated
Review and Conclusions 68
PageRank as Random walk
atilde Imagine a web surfer doing a random walk on the web
acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page
equiprobably
atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there
atilde This long-term visit rate is the pagersquos PageRank
atilde PageRank = long-term visit rate = steady state robability
Review and Conclusions 69
Teleporting ndash to get us out of dead ends
atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)
atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)
atilde With remaining probability (90) go out on a random hyperlink
acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225
atilde 10 is a parameter the teleportation rate
atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate
Review and Conclusions 70
Example Graph
Each inbound link is a positive vote
httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71
Example Graph Weighted
Pages with higher PageRanks are lighter
Review and Conclusions 72
Gaming PageRank
atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs
atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies
atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission
73
atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks
The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index
acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip
74
Current Status
atilde There is a continuous battle between
acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read
atilde All metrics get gamed
Review and Conclusions 75
Digital object identifier
atilde DOI a string used to uniquely identify an electronic document or object
acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)
atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation
atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations
acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u
Review and Conclusions 76
Conclusions
Review and Conclusions 77
Revolutions in Language Technology
atilde Speech (Language itself)
atilde Writing (invented 3-5 times)
atilde Printing (made writing common)
atilde Digital Text (made writing transferable)
atilde Hyperlinking (taking writing beyond language)
Review and Conclusions 78
The Internet
atilde The internet is useful as a tool
acirc Passively for information accessacirc Actively for collaboration
atilde The internet is interesting in and of itself
acirc As a source of data about existing languageacirc As a source of innovation in language
Review and Conclusions 79
Recommended Texts
atilde Wikipedia
atilde Crystal D (2001) Language and the Internet Cambridge University Press
atilde Sproat R (2010) Language Technology and Society Oxford University Press
atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition
atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press
atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press
Review and Conclusions 80
Complementary Courses
atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics
atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics
Review and Conclusions 81
Language and the Internet
atilde New forms of communication
acirc Neither speech nor textacirc Massively interactive
atilde Extremely rapid change
atilde A first hand narrative (I was online before the internet -)
acirc but I am probably behind you all now
Review and Conclusions 8
New forms
atilde Email (from PC phone other)
atilde Chat Usenet
atilde Virtual Worlds
atilde WWW
atilde Blogs (overlap)
atilde Facebook LinkedIn
atilde Wikis
atilde Twitter
Review and Conclusions 9
Representing Languageatilde Writing Systems
atilde Encodings
atilde Speech
atilde Bandwidth
Review and Conclusions 10
What is represented
atilde Phonemes ma d laks aeligvkadoz (45)
acirc Not a simple correspondence between a writing system and the soundsacirc Some logographs (3 $)acirc Even when a new alphabet is designed pronunciation changes
atilde Syllables (ma) (d) (laks) (aelig)(v)(ka)(doz) (10000+)
atilde Morphemes mame + rsquos d lak + s aeligvkado + z (100000+)
atilde Words my dog likes avocados (200000+)
atilde Concepts speaker poss canine fond avacado+PL (400000+)
Review and Conclusions 11
Three Major Writing Systems
atilde Alphabetic (Latin)
acirc one symbol for consonant or vowelacirc Typically 20-30 base symbols (1 byte)
atilde Syllabic (Hiragana)
acirc one symbol for each syllable (consonant+vowel)acirc Typically 50-100 base symbols (1-2 bytes)
atilde Logographic (Hanzi)
acirc pictographs ideographs sounds-meaning combinationsacirc Typically 100000+ symbols (2-3 bytes)
Review and Conclusions 12
Computational Encoding
atilde Need to map characters to bits
atilde More characters require more space
acirc English ASCII (7 bits) Western Europe ISO-8859-1 (8 bits)Japanese EUC (16 bits) Everything UTF-8 (8-32 bits)
atilde Moving towards unicode for everything
atilde If you get the encoding wrong it is gibberish
acirc Web pages should state their encodingacirc Sometimes they are wrongacirc Encoding Detection usually involves statistical analysis of byte patterns
lowast But an encoding can be shared by many languages
Review and Conclusions 13
Speech and Language Technologyatilde The need for speech representation
atilde Storing sound
atilde Transforming Speech
acirc Automatic Speech Recognition (ASR) sounds to textacirc Text-to-Speech Synthesis (TTS) texts to sounds
atilde Speech technology mdash the Telephone
Review and Conclusions 14
How good are the systems
Task Vocab WER () WER () adaptedDigits 11 04 02Dialogue (travel) 21000 109 mdashDictation (WSJ) 5000 39 30Dictation (WSJ) 20000 100 86Dialogue (noisy army) 3000 422 310Phone Conversations 4000 419 310
Results of various DARPA competitions (from Richard Sproatrsquos slides)
Review and Conclusions 15
Why is it so difficult
atilde Pronunciation depends on context
acirc The same word will be pronounced differently in different sentences
atilde Speaker variability
acirc Genderacirc DialectForeign Accentacirc Individual Differences Physical differences Language differences (idiolect)
atilde Many many rare events
acirc 300 out of 2000 diphones in the core set for the ATampT NextGen system occuronly once in a 2-hour speech database
Review and Conclusions 16
Two steps in a TTS system
1 Linguistic Analysis
atilde Sentence Segmentationatilde Abbreviations Dr Smith lives on Nanyang Dr He is hellipatilde Word Segmentation
acirc 森前銀総裁 Moriyama zen Nichigin Sousaiotimes 森前銀総裁emspMoriyama zennichi gin Sousai
2 Speech Synthesis
atilde Find the pronunciationatilde Generate soundsatilde Add intonation
Review and Conclusions 17
Speech Synthesis
atilde Articulatory Synthesis Attempt to model human articulation
atilde Formant Synthesis Bypass modeling of human articulation and model acousticsdirectly
atilde Concatenative Synthesis Synthesize from stored units of actual speech
Review and Conclusions 18
Prosody of Emotion
atilde Excitement Fast very high pitch loud
atilde Hot anger Fast high pitch strong falling accent loud
atilde Fear Jitter
atilde Sarcasm Prolonged accent late peak
atilde Sad Slow low pitch
The main determinant of ldquonaturalnessrdquo in speech synthesis is not ldquovoice qualityrdquobut natural-sounding prosody (intonation and duration)
Richard Sproat
Review and Conclusions 19
Speed is different for different modalities
Speed in words per minute (one word is 6 characters)(English computer science students various studies)
Reading 300 200 (proof reading)Writing 31 21 (composing)Speaking 150Hearing 150 210 (speeded up)Typing 33 19 (composing)
atilde Reading gtgt SpeakingHearing gtgt Typing
rArr Speech for inputrArr Text for output
Review and Conclusions 20
The Telephone
Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 21
New Modalities
Review and Conclusions 22
Email Usenet Chat and Blogs
atilde All share some characteristics of speech and text
atilde Usage norms not fixed
atilde Communication methods may disappear before the norms are fixedUsenet Bulletin Boards Gopher Archie hellip
atilde Large scale discourse studies still to be done
atilde Some genuinely new things
acirc time-lagged multi-person conversationacirc raw un-edited textacirc extreme multi-authorship
Review and Conclusions 23
Speech like Text liketime bound space bound (deletable)spontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 24
Usenet (asynchronous)
Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 25
Chat (synchronous)
Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 26
Blogs
Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactivecomments factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 27
Netspeak
atilde Inspired by shared background
acirc ltrantgtI canrsquot stand thisltrantgtacirc I hate^H^H^H^Hlove that idea (^H is backspace on a vt100)acirc lusers (users as seen by Systems Administrators)acirc suits
atilde Need to be in-group
cow orker Coworkerclue ldquoYou couldnrsquot get a clue during the clue mating season in a field full of horny
clues if you smeared your body with clue musk and did the clue mating dancerdquoEdward Flaherty (talkbizaare)
Review and Conclusions 28
atilde Gricean Principals
acirc Posting (topmiddlebottom quoting and trimming)
atilde Inspired by medium limitations
acirc GREATgreatgreatacirc -) (^_^) (o) acirc brb RTFM IMHOacirc lower caseacirc lack of punctuation (hard on phone keyboards)
Review and Conclusions 29
Collaboration and Wikisatilde Version Control Systems
atilde Wikipedia
atilde Licensing and Ownership
30
Version Control Systems
atilde Versioning file systems
acirc every time a file is opened a new copy is stored
atilde CVS Subversion Git
acirc changes to a collection of files are trackedacirc simultaneous changes are merged
atilde Revision Tracking
acirc Revisions are stored within a file
atilde Authorship in shared writing Explicit responsibility for changes
Review and Conclusions 31
Wikipedia
atilde The core aim of the Wikimedia Foundation is to get a free encyclopedia to everysingle person on the planet (Jimmy Wales)
atilde Wikipedia makes it easy to share your knowledgepeople like to do this
atilde Most edits are done by insiders
atilde Most content is added by outsiders
atilde Content comparable to Britannica
Review and Conclusions 32
The five pillars of Wikipedia
1 Wikipedia is an online encyclopedia
2 Wikipedia has a neutral point of view
atilde Content policies NPOV Verifiability and No original research
3 Wikipedia is free content
4 Wikipedians should interact in a respectful and civil manner
5 Wikipedia does not have firm rules
WikipediaFivepillars 33
Licenses and Ownership
atilde Copyright
atilde Copyleft
atilde Creative Commons
Review and Conclusions 34
What is a good article
1 Well-written
2 Factually accurate and verifiable
3 Broad in its coverage
4 Neutral
5 Stable
6 Illustrated if possible by images
WikipediaGood_article_criteria 35
The World Wide Web and HTML
Review and Conclusions 36
The InterWeb
atilde The Internetmore than just the web (email VoIP FTP Streaming Messaging
atilde The structure of Markup Visual vs LogicalWSISWYG WYSIAYG WYSIWIM
atilde The structure of the Web mdash hypertextpages linked to pages
atilde The future of the Web
atilde Linguistic features of the webun-edited large volume editable multi-media
Review and Conclusions 37
Map of online communities
httpxkcdcom802 38
The Deep Web (the Invisible Web)
Dynamic content dynamic pages which are returned in response to a submitted queryor accessed only through a form
Unlinked content pages which are not linked to by other pages (but clicking linksthem)
Private Web sites that require registration and login (Edventure)
Contextual Web pages with content varying for different access contexts (egranges of client IP addresses or previous navigation sequence)
Limited access content sites that limit access to their pages in a technical way (egusing the Robots Exclusion Standard)
Scripted content pages that are only accessible through links produced by JavaScriptas well as content dynamically downloaded from Web servers via Flash or Ajaxsolutions
Review and Conclusions 39
Non-HTMLtext content textual content encoded in multimedia (image or video)files or specific file formats not handled by search engines
These pages all include data that search engines cannot find
Review and Conclusions 40
Visual Markup vs Logical Markup
atilde Visual Markup (Presentational)
acirc What you see is what you get (WYSIWYG)acirc Equivalent of printersrsquo markupacirc Shows what things look like
atilde Logical Markup (Structural)
acirc Show the structure and meaningacirc Can be mapped to visual markupacirc Less flexible than visual markupacirc More adaptable (and reusable)
Review and Conclusions 41
The Web as Corpus
Review and Conclusions 42
Two Approaches to using the Web as a Corpus
atilde Direct Query Search Engine as Query tool and WWW as corpus(Objection Results are not reliable)
acirc Population and exact hit counts are unknown rarr no statistics possibleacirc Indexing does not allow to draw conclusions on the dataotimes Google is missing functionalities that linguists lexicographers would like to have
atilde Web Sample Use search engine to download data from the net and build a corpusfrom it
acirc known size and exact hit counts rarr statistics possibleacirc people can draw conclusions over the included text typesacirc (limited) control over the contentotimes sparser data
Review and Conclusions 43
Direct Query
atilde Accessible through search engines (Google API Yahoo API Scripts)
atilde Document counts are shown to correlate directly with ldquorealrdquo frequencies (Keller2003) so search engines can help - but
acirc lots of repetitions of the same text (not representative)acirc very limited query precision (no upperlower case no punctuation)acirc only estimated counts often hard to reproduce exactlyacirc different queries give wildly different numbers
Review and Conclusions 44
Web Count Units
atilde ghit or ldquogoogle hitrdquo is the most common unit used to count web snippets (in theearly 2000s)
acirc it is document frequency not term frequency
atilde whit or ldquoweb hitrdquo is the more general term
atilde Normally you compare two phenomena to get a unitless ratio(eg different from vs different than)251000000 ghits vs 71500000 or 351 (accessed 2012-04-04)
atilde GPB for ldquoGhits per billion documentsrdquo is good if want a more stable number (sug-gested by Mark Liberman)but Google no longer releases the index size
atilde So always say when you counted and try to use ratios
httpitrecisupennedu myllanguagelogarchives000953html 45
Web Sample
atilde Extracting and filtering web documents to create linguistically annotated corpora(Kilgarriff 2006)
acirc gather documents for different topics (balance)acirc exclude documents which cannot be preprocessed with available tools (here tag-
gers and lemmatizers)acirc exclude documents which seem irrelevant for a corpus (too short or too long word
lists)acirc do this for several languages and make the corpora available
46
Building Internet Corpora Outline
1 Select Seed Words (500)
2 Combine to form multiple queries (6000)
3 Query a search engine and retrieve the URLs (50000)
4 Download the files from the URLS (100000000 words)
5 Postprocess the data (encoding cleanup tagging and parsing)
Sharoff S (2006) Creating general-purpose corpora using automated search enginequeries In M Baroni S Bernardini (eds) WaCky Working papers on the Web asCorpus Bologna 2006
httpcorpusleedsacukinternethtml 47
Internet Corpora Summary
atilde The web can be used as a corpus
acirc Direct accesslowast Fast and convenientlowast Huge amounts of dataotimes unreliable counts
acirc Web samplelowast Control over the samplelowast Some setup costs (semi-automated)otimes Less data
atilde Richer data than a compiled corpus
otimes Less balanced less markup
48
Text and Meta-textatilde Explicit Meta-data
acirc Keywords and Categoriesacirc Rankingsacirc Structural Markup
atilde Implicit Meta-data
acirc Links and Citationsacirc Tagsacirc Tablesacirc File Namesacirc Translations
and Zombies 49
Explicit Metadata
atilde You can get information from metadata within documents
acirc When they are accurate they are very goodacirc They are often deceitful
atilde HTML PDF Word hellipMetadata
atilde Keywords and Tags
atilde Rankings
atilde Links and Citations
atilde Structural Markup
50
Implicit Metadata
atilde You can get clues from metadata within documents
acirc as they are non-intended they tend to be noisyacirc but they are rarely deceitful
atilde HTML tags as constituent boundaries
atilde Tables as Semantic Relations
atilde File Names (content type and language)
atilde Translations mdash Bracketed Glosses Cross-lingual Disambiguation
atilde Query Data
atilde Wikipedia Redirections and Cross-wiki Links
Review and Conclusions 51
Language Identification andNormalization
Review and Conclusions 52
Language Identification
atilde Need to identify the encoding and language of a page in order to process it (meta-data may be unreliable)
acirc Linguistically-grounded methodslowast Diacriticslowast Character n-gramslowast Stop words
acirc Similarity-based categorisation and classificationlowast Character n-gram rankings
acirc Machine Learning based methodsacirc Context (under jp or ko)
atilde Hard to do for short test snippets similar languages mixed text
Review and Conclusions 53
Normalization
atilde Extracting text from various documents
atilde Segmenting continuous text
atilde Number Normalization $700K $700000 07 million dollars hellip
atilde Date Normalization 2000AD 1421AH Heisei 12 hellip
atilde Stripping stop words (the a of hellip)
atilde Lemmatization produces rarr produce
atilde Stemming producer rarr produc produces rarr produc
atilde Decompounding zonnecel rarr zon cel
Review and Conclusions 54
The Semantic Web
Review and Conclusions 55
Goals of the Semantic Web
atilde Web of data
acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions
atilde Increase the utility of information by connecting it to definitions and context
atilde More efficient information access and analysis
EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt
Review and Conclusions 56
Semantic Web Architecture
Review and Conclusions 57
XML eXtensible Markup Language
atilde XML is a set of rules for encoding documents electronically
acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation
atilde Can be validated for syntactic well-formedness
Review and Conclusions 58
Semantic Web Architecture (details)
atilde Identify things with Uniform Resource Identifiers
acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond
atilde Identify relations with Resource Description Framework
acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything
atilde You can build relations in ontologies (OWL)
acirc Then reason over them search them hellip
Review and Conclusions 59
Criticism of the Semantic Web
Doctorowrsquos seven insurmountable obstacles to reliable metadata are
1 People lie
2 People are lazy
3 People are stupid
4 Mission Impossible know thyself
5 Schemas arenrsquot neutral
6 Metrics influence results
7 Therersquos more than one way to describe something
httpwwwwellcom~doctorowmetacraphtm 60
Semantic Web and NLP
atilde The Semantic Web is about structuring data
atilde Text Mining is about unstructured data
atilde There is much more unstructured than structured data
acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP
61
Citation Reputation andPageRank
Review and Conclusions 62
Citation Networks
atilde How can we tell what is a good scientific paper
acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked
acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite
Review and Conclusions 63
Reputation and Citation Analysis
atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group
atilde Some scores are
acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)
atilde Problems
acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones
atilde Weight Citations by Quality of the paper
Review and Conclusions 64
Gaming Citations
atilde LeastMinimum Publishable Unit
acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information
atilde Self citation in-group citation
atilde Write only proceedings (some journals are not often read)
atilde Submitting only to High Impact factor journals
You improve what gets measurednot necessarily what you want to improve
Review and Conclusions 65
Characteristics of the Web Bow Tie
SCC Strongly Connected Core mdash can travel from any page to any page
Review and Conclusions 66
Anchor Text
atilde Recall how hyperlinks are written
lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt
For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt
atilde Link analysis builds on two intuitions
1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A
2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice
Review and Conclusions 67
PageRank as Citation analysis
atilde Citation frequency can be used to measure the impact of an article
acirc Simplest measure Each article gets one vote ndash not very accurate
atilde On the web citation frequency = inlink count
acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam
atilde Better measure weighted citation frequency or citation rank
acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated
Review and Conclusions 68
PageRank as Random walk
atilde Imagine a web surfer doing a random walk on the web
acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page
equiprobably
atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there
atilde This long-term visit rate is the pagersquos PageRank
atilde PageRank = long-term visit rate = steady state robability
Review and Conclusions 69
Teleporting ndash to get us out of dead ends
atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)
atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)
atilde With remaining probability (90) go out on a random hyperlink
acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225
atilde 10 is a parameter the teleportation rate
atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate
Review and Conclusions 70
Example Graph
Each inbound link is a positive vote
httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71
Example Graph Weighted
Pages with higher PageRanks are lighter
Review and Conclusions 72
Gaming PageRank
atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs
atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies
atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission
73
atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks
The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index
acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip
74
Current Status
atilde There is a continuous battle between
acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read
atilde All metrics get gamed
Review and Conclusions 75
Digital object identifier
atilde DOI a string used to uniquely identify an electronic document or object
acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)
atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation
atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations
acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u
Review and Conclusions 76
Conclusions
Review and Conclusions 77
Revolutions in Language Technology
atilde Speech (Language itself)
atilde Writing (invented 3-5 times)
atilde Printing (made writing common)
atilde Digital Text (made writing transferable)
atilde Hyperlinking (taking writing beyond language)
Review and Conclusions 78
The Internet
atilde The internet is useful as a tool
acirc Passively for information accessacirc Actively for collaboration
atilde The internet is interesting in and of itself
acirc As a source of data about existing languageacirc As a source of innovation in language
Review and Conclusions 79
Recommended Texts
atilde Wikipedia
atilde Crystal D (2001) Language and the Internet Cambridge University Press
atilde Sproat R (2010) Language Technology and Society Oxford University Press
atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition
atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press
atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press
Review and Conclusions 80
Complementary Courses
atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics
atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics
Review and Conclusions 81
New forms
atilde Email (from PC phone other)
atilde Chat Usenet
atilde Virtual Worlds
atilde WWW
atilde Blogs (overlap)
atilde Facebook LinkedIn
atilde Wikis
atilde Twitter
Review and Conclusions 9
Representing Languageatilde Writing Systems
atilde Encodings
atilde Speech
atilde Bandwidth
Review and Conclusions 10
What is represented
atilde Phonemes ma d laks aeligvkadoz (45)
acirc Not a simple correspondence between a writing system and the soundsacirc Some logographs (3 $)acirc Even when a new alphabet is designed pronunciation changes
atilde Syllables (ma) (d) (laks) (aelig)(v)(ka)(doz) (10000+)
atilde Morphemes mame + rsquos d lak + s aeligvkado + z (100000+)
atilde Words my dog likes avocados (200000+)
atilde Concepts speaker poss canine fond avacado+PL (400000+)
Review and Conclusions 11
Three Major Writing Systems
atilde Alphabetic (Latin)
acirc one symbol for consonant or vowelacirc Typically 20-30 base symbols (1 byte)
atilde Syllabic (Hiragana)
acirc one symbol for each syllable (consonant+vowel)acirc Typically 50-100 base symbols (1-2 bytes)
atilde Logographic (Hanzi)
acirc pictographs ideographs sounds-meaning combinationsacirc Typically 100000+ symbols (2-3 bytes)
Review and Conclusions 12
Computational Encoding
atilde Need to map characters to bits
atilde More characters require more space
acirc English ASCII (7 bits) Western Europe ISO-8859-1 (8 bits)Japanese EUC (16 bits) Everything UTF-8 (8-32 bits)
atilde Moving towards unicode for everything
atilde If you get the encoding wrong it is gibberish
acirc Web pages should state their encodingacirc Sometimes they are wrongacirc Encoding Detection usually involves statistical analysis of byte patterns
lowast But an encoding can be shared by many languages
Review and Conclusions 13
Speech and Language Technologyatilde The need for speech representation
atilde Storing sound
atilde Transforming Speech
acirc Automatic Speech Recognition (ASR) sounds to textacirc Text-to-Speech Synthesis (TTS) texts to sounds
atilde Speech technology mdash the Telephone
Review and Conclusions 14
How good are the systems
Task Vocab WER () WER () adaptedDigits 11 04 02Dialogue (travel) 21000 109 mdashDictation (WSJ) 5000 39 30Dictation (WSJ) 20000 100 86Dialogue (noisy army) 3000 422 310Phone Conversations 4000 419 310
Results of various DARPA competitions (from Richard Sproatrsquos slides)
Review and Conclusions 15
Why is it so difficult
atilde Pronunciation depends on context
acirc The same word will be pronounced differently in different sentences
atilde Speaker variability
acirc Genderacirc DialectForeign Accentacirc Individual Differences Physical differences Language differences (idiolect)
atilde Many many rare events
acirc 300 out of 2000 diphones in the core set for the ATampT NextGen system occuronly once in a 2-hour speech database
Review and Conclusions 16
Two steps in a TTS system
1 Linguistic Analysis
atilde Sentence Segmentationatilde Abbreviations Dr Smith lives on Nanyang Dr He is hellipatilde Word Segmentation
acirc 森前銀総裁 Moriyama zen Nichigin Sousaiotimes 森前銀総裁emspMoriyama zennichi gin Sousai
2 Speech Synthesis
atilde Find the pronunciationatilde Generate soundsatilde Add intonation
Review and Conclusions 17
Speech Synthesis
atilde Articulatory Synthesis Attempt to model human articulation
atilde Formant Synthesis Bypass modeling of human articulation and model acousticsdirectly
atilde Concatenative Synthesis Synthesize from stored units of actual speech
Review and Conclusions 18
Prosody of Emotion
atilde Excitement Fast very high pitch loud
atilde Hot anger Fast high pitch strong falling accent loud
atilde Fear Jitter
atilde Sarcasm Prolonged accent late peak
atilde Sad Slow low pitch
The main determinant of ldquonaturalnessrdquo in speech synthesis is not ldquovoice qualityrdquobut natural-sounding prosody (intonation and duration)
Richard Sproat
Review and Conclusions 19
Speed is different for different modalities
Speed in words per minute (one word is 6 characters)(English computer science students various studies)
Reading 300 200 (proof reading)Writing 31 21 (composing)Speaking 150Hearing 150 210 (speeded up)Typing 33 19 (composing)
atilde Reading gtgt SpeakingHearing gtgt Typing
rArr Speech for inputrArr Text for output
Review and Conclusions 20
The Telephone
Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 21
New Modalities
Review and Conclusions 22
Email Usenet Chat and Blogs
atilde All share some characteristics of speech and text
atilde Usage norms not fixed
atilde Communication methods may disappear before the norms are fixedUsenet Bulletin Boards Gopher Archie hellip
atilde Large scale discourse studies still to be done
atilde Some genuinely new things
acirc time-lagged multi-person conversationacirc raw un-edited textacirc extreme multi-authorship
Review and Conclusions 23
Speech like Text liketime bound space bound (deletable)spontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 24
Usenet (asynchronous)
Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 25
Chat (synchronous)
Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 26
Blogs
Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactivecomments factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 27
Netspeak
atilde Inspired by shared background
acirc ltrantgtI canrsquot stand thisltrantgtacirc I hate^H^H^H^Hlove that idea (^H is backspace on a vt100)acirc lusers (users as seen by Systems Administrators)acirc suits
atilde Need to be in-group
cow orker Coworkerclue ldquoYou couldnrsquot get a clue during the clue mating season in a field full of horny
clues if you smeared your body with clue musk and did the clue mating dancerdquoEdward Flaherty (talkbizaare)
Review and Conclusions 28
atilde Gricean Principals
acirc Posting (topmiddlebottom quoting and trimming)
atilde Inspired by medium limitations
acirc GREATgreatgreatacirc -) (^_^) (o) acirc brb RTFM IMHOacirc lower caseacirc lack of punctuation (hard on phone keyboards)
Review and Conclusions 29
Collaboration and Wikisatilde Version Control Systems
atilde Wikipedia
atilde Licensing and Ownership
30
Version Control Systems
atilde Versioning file systems
acirc every time a file is opened a new copy is stored
atilde CVS Subversion Git
acirc changes to a collection of files are trackedacirc simultaneous changes are merged
atilde Revision Tracking
acirc Revisions are stored within a file
atilde Authorship in shared writing Explicit responsibility for changes
Review and Conclusions 31
Wikipedia
atilde The core aim of the Wikimedia Foundation is to get a free encyclopedia to everysingle person on the planet (Jimmy Wales)
atilde Wikipedia makes it easy to share your knowledgepeople like to do this
atilde Most edits are done by insiders
atilde Most content is added by outsiders
atilde Content comparable to Britannica
Review and Conclusions 32
The five pillars of Wikipedia
1 Wikipedia is an online encyclopedia
2 Wikipedia has a neutral point of view
atilde Content policies NPOV Verifiability and No original research
3 Wikipedia is free content
4 Wikipedians should interact in a respectful and civil manner
5 Wikipedia does not have firm rules
WikipediaFivepillars 33
Licenses and Ownership
atilde Copyright
atilde Copyleft
atilde Creative Commons
Review and Conclusions 34
What is a good article
1 Well-written
2 Factually accurate and verifiable
3 Broad in its coverage
4 Neutral
5 Stable
6 Illustrated if possible by images
WikipediaGood_article_criteria 35
The World Wide Web and HTML
Review and Conclusions 36
The InterWeb
atilde The Internetmore than just the web (email VoIP FTP Streaming Messaging
atilde The structure of Markup Visual vs LogicalWSISWYG WYSIAYG WYSIWIM
atilde The structure of the Web mdash hypertextpages linked to pages
atilde The future of the Web
atilde Linguistic features of the webun-edited large volume editable multi-media
Review and Conclusions 37
Map of online communities
httpxkcdcom802 38
The Deep Web (the Invisible Web)
Dynamic content dynamic pages which are returned in response to a submitted queryor accessed only through a form
Unlinked content pages which are not linked to by other pages (but clicking linksthem)
Private Web sites that require registration and login (Edventure)
Contextual Web pages with content varying for different access contexts (egranges of client IP addresses or previous navigation sequence)
Limited access content sites that limit access to their pages in a technical way (egusing the Robots Exclusion Standard)
Scripted content pages that are only accessible through links produced by JavaScriptas well as content dynamically downloaded from Web servers via Flash or Ajaxsolutions
Review and Conclusions 39
Non-HTMLtext content textual content encoded in multimedia (image or video)files or specific file formats not handled by search engines
These pages all include data that search engines cannot find
Review and Conclusions 40
Visual Markup vs Logical Markup
atilde Visual Markup (Presentational)
acirc What you see is what you get (WYSIWYG)acirc Equivalent of printersrsquo markupacirc Shows what things look like
atilde Logical Markup (Structural)
acirc Show the structure and meaningacirc Can be mapped to visual markupacirc Less flexible than visual markupacirc More adaptable (and reusable)
Review and Conclusions 41
The Web as Corpus
Review and Conclusions 42
Two Approaches to using the Web as a Corpus
atilde Direct Query Search Engine as Query tool and WWW as corpus(Objection Results are not reliable)
acirc Population and exact hit counts are unknown rarr no statistics possibleacirc Indexing does not allow to draw conclusions on the dataotimes Google is missing functionalities that linguists lexicographers would like to have
atilde Web Sample Use search engine to download data from the net and build a corpusfrom it
acirc known size and exact hit counts rarr statistics possibleacirc people can draw conclusions over the included text typesacirc (limited) control over the contentotimes sparser data
Review and Conclusions 43
Direct Query
atilde Accessible through search engines (Google API Yahoo API Scripts)
atilde Document counts are shown to correlate directly with ldquorealrdquo frequencies (Keller2003) so search engines can help - but
acirc lots of repetitions of the same text (not representative)acirc very limited query precision (no upperlower case no punctuation)acirc only estimated counts often hard to reproduce exactlyacirc different queries give wildly different numbers
Review and Conclusions 44
Web Count Units
atilde ghit or ldquogoogle hitrdquo is the most common unit used to count web snippets (in theearly 2000s)
acirc it is document frequency not term frequency
atilde whit or ldquoweb hitrdquo is the more general term
atilde Normally you compare two phenomena to get a unitless ratio(eg different from vs different than)251000000 ghits vs 71500000 or 351 (accessed 2012-04-04)
atilde GPB for ldquoGhits per billion documentsrdquo is good if want a more stable number (sug-gested by Mark Liberman)but Google no longer releases the index size
atilde So always say when you counted and try to use ratios
httpitrecisupennedu myllanguagelogarchives000953html 45
Web Sample
atilde Extracting and filtering web documents to create linguistically annotated corpora(Kilgarriff 2006)
acirc gather documents for different topics (balance)acirc exclude documents which cannot be preprocessed with available tools (here tag-
gers and lemmatizers)acirc exclude documents which seem irrelevant for a corpus (too short or too long word
lists)acirc do this for several languages and make the corpora available
46
Building Internet Corpora Outline
1 Select Seed Words (500)
2 Combine to form multiple queries (6000)
3 Query a search engine and retrieve the URLs (50000)
4 Download the files from the URLS (100000000 words)
5 Postprocess the data (encoding cleanup tagging and parsing)
Sharoff S (2006) Creating general-purpose corpora using automated search enginequeries In M Baroni S Bernardini (eds) WaCky Working papers on the Web asCorpus Bologna 2006
httpcorpusleedsacukinternethtml 47
Internet Corpora Summary
atilde The web can be used as a corpus
acirc Direct accesslowast Fast and convenientlowast Huge amounts of dataotimes unreliable counts
acirc Web samplelowast Control over the samplelowast Some setup costs (semi-automated)otimes Less data
atilde Richer data than a compiled corpus
otimes Less balanced less markup
48
Text and Meta-textatilde Explicit Meta-data
acirc Keywords and Categoriesacirc Rankingsacirc Structural Markup
atilde Implicit Meta-data
acirc Links and Citationsacirc Tagsacirc Tablesacirc File Namesacirc Translations
and Zombies 49
Explicit Metadata
atilde You can get information from metadata within documents
acirc When they are accurate they are very goodacirc They are often deceitful
atilde HTML PDF Word hellipMetadata
atilde Keywords and Tags
atilde Rankings
atilde Links and Citations
atilde Structural Markup
50
Implicit Metadata
atilde You can get clues from metadata within documents
acirc as they are non-intended they tend to be noisyacirc but they are rarely deceitful
atilde HTML tags as constituent boundaries
atilde Tables as Semantic Relations
atilde File Names (content type and language)
atilde Translations mdash Bracketed Glosses Cross-lingual Disambiguation
atilde Query Data
atilde Wikipedia Redirections and Cross-wiki Links
Review and Conclusions 51
Language Identification andNormalization
Review and Conclusions 52
Language Identification
atilde Need to identify the encoding and language of a page in order to process it (meta-data may be unreliable)
acirc Linguistically-grounded methodslowast Diacriticslowast Character n-gramslowast Stop words
acirc Similarity-based categorisation and classificationlowast Character n-gram rankings
acirc Machine Learning based methodsacirc Context (under jp or ko)
atilde Hard to do for short test snippets similar languages mixed text
Review and Conclusions 53
Normalization
atilde Extracting text from various documents
atilde Segmenting continuous text
atilde Number Normalization $700K $700000 07 million dollars hellip
atilde Date Normalization 2000AD 1421AH Heisei 12 hellip
atilde Stripping stop words (the a of hellip)
atilde Lemmatization produces rarr produce
atilde Stemming producer rarr produc produces rarr produc
atilde Decompounding zonnecel rarr zon cel
Review and Conclusions 54
The Semantic Web
Review and Conclusions 55
Goals of the Semantic Web
atilde Web of data
acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions
atilde Increase the utility of information by connecting it to definitions and context
atilde More efficient information access and analysis
EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt
Review and Conclusions 56
Semantic Web Architecture
Review and Conclusions 57
XML eXtensible Markup Language
atilde XML is a set of rules for encoding documents electronically
acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation
atilde Can be validated for syntactic well-formedness
Review and Conclusions 58
Semantic Web Architecture (details)
atilde Identify things with Uniform Resource Identifiers
acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond
atilde Identify relations with Resource Description Framework
acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything
atilde You can build relations in ontologies (OWL)
acirc Then reason over them search them hellip
Review and Conclusions 59
Criticism of the Semantic Web
Doctorowrsquos seven insurmountable obstacles to reliable metadata are
1 People lie
2 People are lazy
3 People are stupid
4 Mission Impossible know thyself
5 Schemas arenrsquot neutral
6 Metrics influence results
7 Therersquos more than one way to describe something
httpwwwwellcom~doctorowmetacraphtm 60
Semantic Web and NLP
atilde The Semantic Web is about structuring data
atilde Text Mining is about unstructured data
atilde There is much more unstructured than structured data
acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP
61
Citation Reputation andPageRank
Review and Conclusions 62
Citation Networks
atilde How can we tell what is a good scientific paper
acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked
acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite
Review and Conclusions 63
Reputation and Citation Analysis
atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group
atilde Some scores are
acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)
atilde Problems
acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones
atilde Weight Citations by Quality of the paper
Review and Conclusions 64
Gaming Citations
atilde LeastMinimum Publishable Unit
acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information
atilde Self citation in-group citation
atilde Write only proceedings (some journals are not often read)
atilde Submitting only to High Impact factor journals
You improve what gets measurednot necessarily what you want to improve
Review and Conclusions 65
Characteristics of the Web Bow Tie
SCC Strongly Connected Core mdash can travel from any page to any page
Review and Conclusions 66
Anchor Text
atilde Recall how hyperlinks are written
lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt
For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt
atilde Link analysis builds on two intuitions
1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A
2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice
Review and Conclusions 67
PageRank as Citation analysis
atilde Citation frequency can be used to measure the impact of an article
acirc Simplest measure Each article gets one vote ndash not very accurate
atilde On the web citation frequency = inlink count
acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam
atilde Better measure weighted citation frequency or citation rank
acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated
Review and Conclusions 68
PageRank as Random walk
atilde Imagine a web surfer doing a random walk on the web
acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page
equiprobably
atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there
atilde This long-term visit rate is the pagersquos PageRank
atilde PageRank = long-term visit rate = steady state robability
Review and Conclusions 69
Teleporting ndash to get us out of dead ends
atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)
atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)
atilde With remaining probability (90) go out on a random hyperlink
acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225
atilde 10 is a parameter the teleportation rate
atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate
Review and Conclusions 70
Example Graph
Each inbound link is a positive vote
httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71
Example Graph Weighted
Pages with higher PageRanks are lighter
Review and Conclusions 72
Gaming PageRank
atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs
atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies
atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission
73
atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks
The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index
acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip
74
Current Status
atilde There is a continuous battle between
acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read
atilde All metrics get gamed
Review and Conclusions 75
Digital object identifier
atilde DOI a string used to uniquely identify an electronic document or object
acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)
atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation
atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations
acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u
Review and Conclusions 76
Conclusions
Review and Conclusions 77
Revolutions in Language Technology
atilde Speech (Language itself)
atilde Writing (invented 3-5 times)
atilde Printing (made writing common)
atilde Digital Text (made writing transferable)
atilde Hyperlinking (taking writing beyond language)
Review and Conclusions 78
The Internet
atilde The internet is useful as a tool
acirc Passively for information accessacirc Actively for collaboration
atilde The internet is interesting in and of itself
acirc As a source of data about existing languageacirc As a source of innovation in language
Review and Conclusions 79
Recommended Texts
atilde Wikipedia
atilde Crystal D (2001) Language and the Internet Cambridge University Press
atilde Sproat R (2010) Language Technology and Society Oxford University Press
atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition
atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press
atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press
Review and Conclusions 80
Complementary Courses
atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics
atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics
Review and Conclusions 81
Representing Languageatilde Writing Systems
atilde Encodings
atilde Speech
atilde Bandwidth
Review and Conclusions 10
What is represented
atilde Phonemes ma d laks aeligvkadoz (45)
acirc Not a simple correspondence between a writing system and the soundsacirc Some logographs (3 $)acirc Even when a new alphabet is designed pronunciation changes
atilde Syllables (ma) (d) (laks) (aelig)(v)(ka)(doz) (10000+)
atilde Morphemes mame + rsquos d lak + s aeligvkado + z (100000+)
atilde Words my dog likes avocados (200000+)
atilde Concepts speaker poss canine fond avacado+PL (400000+)
Review and Conclusions 11
Three Major Writing Systems
atilde Alphabetic (Latin)
acirc one symbol for consonant or vowelacirc Typically 20-30 base symbols (1 byte)
atilde Syllabic (Hiragana)
acirc one symbol for each syllable (consonant+vowel)acirc Typically 50-100 base symbols (1-2 bytes)
atilde Logographic (Hanzi)
acirc pictographs ideographs sounds-meaning combinationsacirc Typically 100000+ symbols (2-3 bytes)
Review and Conclusions 12
Computational Encoding
atilde Need to map characters to bits
atilde More characters require more space
acirc English ASCII (7 bits) Western Europe ISO-8859-1 (8 bits)Japanese EUC (16 bits) Everything UTF-8 (8-32 bits)
atilde Moving towards unicode for everything
atilde If you get the encoding wrong it is gibberish
acirc Web pages should state their encodingacirc Sometimes they are wrongacirc Encoding Detection usually involves statistical analysis of byte patterns
lowast But an encoding can be shared by many languages
Review and Conclusions 13
Speech and Language Technologyatilde The need for speech representation
atilde Storing sound
atilde Transforming Speech
acirc Automatic Speech Recognition (ASR) sounds to textacirc Text-to-Speech Synthesis (TTS) texts to sounds
atilde Speech technology mdash the Telephone
Review and Conclusions 14
How good are the systems
Task Vocab WER () WER () adaptedDigits 11 04 02Dialogue (travel) 21000 109 mdashDictation (WSJ) 5000 39 30Dictation (WSJ) 20000 100 86Dialogue (noisy army) 3000 422 310Phone Conversations 4000 419 310
Results of various DARPA competitions (from Richard Sproatrsquos slides)
Review and Conclusions 15
Why is it so difficult
atilde Pronunciation depends on context
acirc The same word will be pronounced differently in different sentences
atilde Speaker variability
acirc Genderacirc DialectForeign Accentacirc Individual Differences Physical differences Language differences (idiolect)
atilde Many many rare events
acirc 300 out of 2000 diphones in the core set for the ATampT NextGen system occuronly once in a 2-hour speech database
Review and Conclusions 16
Two steps in a TTS system
1 Linguistic Analysis
atilde Sentence Segmentationatilde Abbreviations Dr Smith lives on Nanyang Dr He is hellipatilde Word Segmentation
acirc 森前銀総裁 Moriyama zen Nichigin Sousaiotimes 森前銀総裁emspMoriyama zennichi gin Sousai
2 Speech Synthesis
atilde Find the pronunciationatilde Generate soundsatilde Add intonation
Review and Conclusions 17
Speech Synthesis
atilde Articulatory Synthesis Attempt to model human articulation
atilde Formant Synthesis Bypass modeling of human articulation and model acousticsdirectly
atilde Concatenative Synthesis Synthesize from stored units of actual speech
Review and Conclusions 18
Prosody of Emotion
atilde Excitement Fast very high pitch loud
atilde Hot anger Fast high pitch strong falling accent loud
atilde Fear Jitter
atilde Sarcasm Prolonged accent late peak
atilde Sad Slow low pitch
The main determinant of ldquonaturalnessrdquo in speech synthesis is not ldquovoice qualityrdquobut natural-sounding prosody (intonation and duration)
Richard Sproat
Review and Conclusions 19
Speed is different for different modalities
Speed in words per minute (one word is 6 characters)(English computer science students various studies)
Reading 300 200 (proof reading)Writing 31 21 (composing)Speaking 150Hearing 150 210 (speeded up)Typing 33 19 (composing)
atilde Reading gtgt SpeakingHearing gtgt Typing
rArr Speech for inputrArr Text for output
Review and Conclusions 20
The Telephone
Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 21
New Modalities
Review and Conclusions 22
Email Usenet Chat and Blogs
atilde All share some characteristics of speech and text
atilde Usage norms not fixed
atilde Communication methods may disappear before the norms are fixedUsenet Bulletin Boards Gopher Archie hellip
atilde Large scale discourse studies still to be done
atilde Some genuinely new things
acirc time-lagged multi-person conversationacirc raw un-edited textacirc extreme multi-authorship
Review and Conclusions 23
Speech like Text liketime bound space bound (deletable)spontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 24
Usenet (asynchronous)
Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 25
Chat (synchronous)
Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 26
Blogs
Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactivecomments factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 27
Netspeak
atilde Inspired by shared background
acirc ltrantgtI canrsquot stand thisltrantgtacirc I hate^H^H^H^Hlove that idea (^H is backspace on a vt100)acirc lusers (users as seen by Systems Administrators)acirc suits
atilde Need to be in-group
cow orker Coworkerclue ldquoYou couldnrsquot get a clue during the clue mating season in a field full of horny
clues if you smeared your body with clue musk and did the clue mating dancerdquoEdward Flaherty (talkbizaare)
Review and Conclusions 28
atilde Gricean Principals
acirc Posting (topmiddlebottom quoting and trimming)
atilde Inspired by medium limitations
acirc GREATgreatgreatacirc -) (^_^) (o) acirc brb RTFM IMHOacirc lower caseacirc lack of punctuation (hard on phone keyboards)
Review and Conclusions 29
Collaboration and Wikisatilde Version Control Systems
atilde Wikipedia
atilde Licensing and Ownership
30
Version Control Systems
atilde Versioning file systems
acirc every time a file is opened a new copy is stored
atilde CVS Subversion Git
acirc changes to a collection of files are trackedacirc simultaneous changes are merged
atilde Revision Tracking
acirc Revisions are stored within a file
atilde Authorship in shared writing Explicit responsibility for changes
Review and Conclusions 31
Wikipedia
atilde The core aim of the Wikimedia Foundation is to get a free encyclopedia to everysingle person on the planet (Jimmy Wales)
atilde Wikipedia makes it easy to share your knowledgepeople like to do this
atilde Most edits are done by insiders
atilde Most content is added by outsiders
atilde Content comparable to Britannica
Review and Conclusions 32
The five pillars of Wikipedia
1 Wikipedia is an online encyclopedia
2 Wikipedia has a neutral point of view
atilde Content policies NPOV Verifiability and No original research
3 Wikipedia is free content
4 Wikipedians should interact in a respectful and civil manner
5 Wikipedia does not have firm rules
WikipediaFivepillars 33
Licenses and Ownership
atilde Copyright
atilde Copyleft
atilde Creative Commons
Review and Conclusions 34
What is a good article
1 Well-written
2 Factually accurate and verifiable
3 Broad in its coverage
4 Neutral
5 Stable
6 Illustrated if possible by images
WikipediaGood_article_criteria 35
The World Wide Web and HTML
Review and Conclusions 36
The InterWeb
atilde The Internetmore than just the web (email VoIP FTP Streaming Messaging
atilde The structure of Markup Visual vs LogicalWSISWYG WYSIAYG WYSIWIM
atilde The structure of the Web mdash hypertextpages linked to pages
atilde The future of the Web
atilde Linguistic features of the webun-edited large volume editable multi-media
Review and Conclusions 37
Map of online communities
httpxkcdcom802 38
The Deep Web (the Invisible Web)
Dynamic content dynamic pages which are returned in response to a submitted queryor accessed only through a form
Unlinked content pages which are not linked to by other pages (but clicking linksthem)
Private Web sites that require registration and login (Edventure)
Contextual Web pages with content varying for different access contexts (egranges of client IP addresses or previous navigation sequence)
Limited access content sites that limit access to their pages in a technical way (egusing the Robots Exclusion Standard)
Scripted content pages that are only accessible through links produced by JavaScriptas well as content dynamically downloaded from Web servers via Flash or Ajaxsolutions
Review and Conclusions 39
Non-HTMLtext content textual content encoded in multimedia (image or video)files or specific file formats not handled by search engines
These pages all include data that search engines cannot find
Review and Conclusions 40
Visual Markup vs Logical Markup
atilde Visual Markup (Presentational)
acirc What you see is what you get (WYSIWYG)acirc Equivalent of printersrsquo markupacirc Shows what things look like
atilde Logical Markup (Structural)
acirc Show the structure and meaningacirc Can be mapped to visual markupacirc Less flexible than visual markupacirc More adaptable (and reusable)
Review and Conclusions 41
The Web as Corpus
Review and Conclusions 42
Two Approaches to using the Web as a Corpus
atilde Direct Query Search Engine as Query tool and WWW as corpus(Objection Results are not reliable)
acirc Population and exact hit counts are unknown rarr no statistics possibleacirc Indexing does not allow to draw conclusions on the dataotimes Google is missing functionalities that linguists lexicographers would like to have
atilde Web Sample Use search engine to download data from the net and build a corpusfrom it
acirc known size and exact hit counts rarr statistics possibleacirc people can draw conclusions over the included text typesacirc (limited) control over the contentotimes sparser data
Review and Conclusions 43
Direct Query
atilde Accessible through search engines (Google API Yahoo API Scripts)
atilde Document counts are shown to correlate directly with ldquorealrdquo frequencies (Keller2003) so search engines can help - but
acirc lots of repetitions of the same text (not representative)acirc very limited query precision (no upperlower case no punctuation)acirc only estimated counts often hard to reproduce exactlyacirc different queries give wildly different numbers
Review and Conclusions 44
Web Count Units
atilde ghit or ldquogoogle hitrdquo is the most common unit used to count web snippets (in theearly 2000s)
acirc it is document frequency not term frequency
atilde whit or ldquoweb hitrdquo is the more general term
atilde Normally you compare two phenomena to get a unitless ratio(eg different from vs different than)251000000 ghits vs 71500000 or 351 (accessed 2012-04-04)
atilde GPB for ldquoGhits per billion documentsrdquo is good if want a more stable number (sug-gested by Mark Liberman)but Google no longer releases the index size
atilde So always say when you counted and try to use ratios
httpitrecisupennedu myllanguagelogarchives000953html 45
Web Sample
atilde Extracting and filtering web documents to create linguistically annotated corpora(Kilgarriff 2006)
acirc gather documents for different topics (balance)acirc exclude documents which cannot be preprocessed with available tools (here tag-
gers and lemmatizers)acirc exclude documents which seem irrelevant for a corpus (too short or too long word
lists)acirc do this for several languages and make the corpora available
46
Building Internet Corpora Outline
1 Select Seed Words (500)
2 Combine to form multiple queries (6000)
3 Query a search engine and retrieve the URLs (50000)
4 Download the files from the URLS (100000000 words)
5 Postprocess the data (encoding cleanup tagging and parsing)
Sharoff S (2006) Creating general-purpose corpora using automated search enginequeries In M Baroni S Bernardini (eds) WaCky Working papers on the Web asCorpus Bologna 2006
httpcorpusleedsacukinternethtml 47
Internet Corpora Summary
atilde The web can be used as a corpus
acirc Direct accesslowast Fast and convenientlowast Huge amounts of dataotimes unreliable counts
acirc Web samplelowast Control over the samplelowast Some setup costs (semi-automated)otimes Less data
atilde Richer data than a compiled corpus
otimes Less balanced less markup
48
Text and Meta-textatilde Explicit Meta-data
acirc Keywords and Categoriesacirc Rankingsacirc Structural Markup
atilde Implicit Meta-data
acirc Links and Citationsacirc Tagsacirc Tablesacirc File Namesacirc Translations
and Zombies 49
Explicit Metadata
atilde You can get information from metadata within documents
acirc When they are accurate they are very goodacirc They are often deceitful
atilde HTML PDF Word hellipMetadata
atilde Keywords and Tags
atilde Rankings
atilde Links and Citations
atilde Structural Markup
50
Implicit Metadata
atilde You can get clues from metadata within documents
acirc as they are non-intended they tend to be noisyacirc but they are rarely deceitful
atilde HTML tags as constituent boundaries
atilde Tables as Semantic Relations
atilde File Names (content type and language)
atilde Translations mdash Bracketed Glosses Cross-lingual Disambiguation
atilde Query Data
atilde Wikipedia Redirections and Cross-wiki Links
Review and Conclusions 51
Language Identification andNormalization
Review and Conclusions 52
Language Identification
atilde Need to identify the encoding and language of a page in order to process it (meta-data may be unreliable)
acirc Linguistically-grounded methodslowast Diacriticslowast Character n-gramslowast Stop words
acirc Similarity-based categorisation and classificationlowast Character n-gram rankings
acirc Machine Learning based methodsacirc Context (under jp or ko)
atilde Hard to do for short test snippets similar languages mixed text
Review and Conclusions 53
Normalization
atilde Extracting text from various documents
atilde Segmenting continuous text
atilde Number Normalization $700K $700000 07 million dollars hellip
atilde Date Normalization 2000AD 1421AH Heisei 12 hellip
atilde Stripping stop words (the a of hellip)
atilde Lemmatization produces rarr produce
atilde Stemming producer rarr produc produces rarr produc
atilde Decompounding zonnecel rarr zon cel
Review and Conclusions 54
The Semantic Web
Review and Conclusions 55
Goals of the Semantic Web
atilde Web of data
acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions
atilde Increase the utility of information by connecting it to definitions and context
atilde More efficient information access and analysis
EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt
Review and Conclusions 56
Semantic Web Architecture
Review and Conclusions 57
XML eXtensible Markup Language
atilde XML is a set of rules for encoding documents electronically
acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation
atilde Can be validated for syntactic well-formedness
Review and Conclusions 58
Semantic Web Architecture (details)
atilde Identify things with Uniform Resource Identifiers
acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond
atilde Identify relations with Resource Description Framework
acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything
atilde You can build relations in ontologies (OWL)
acirc Then reason over them search them hellip
Review and Conclusions 59
Criticism of the Semantic Web
Doctorowrsquos seven insurmountable obstacles to reliable metadata are
1 People lie
2 People are lazy
3 People are stupid
4 Mission Impossible know thyself
5 Schemas arenrsquot neutral
6 Metrics influence results
7 Therersquos more than one way to describe something
httpwwwwellcom~doctorowmetacraphtm 60
Semantic Web and NLP
atilde The Semantic Web is about structuring data
atilde Text Mining is about unstructured data
atilde There is much more unstructured than structured data
acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP
61
Citation Reputation andPageRank
Review and Conclusions 62
Citation Networks
atilde How can we tell what is a good scientific paper
acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked
acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite
Review and Conclusions 63
Reputation and Citation Analysis
atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group
atilde Some scores are
acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)
atilde Problems
acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones
atilde Weight Citations by Quality of the paper
Review and Conclusions 64
Gaming Citations
atilde LeastMinimum Publishable Unit
acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information
atilde Self citation in-group citation
atilde Write only proceedings (some journals are not often read)
atilde Submitting only to High Impact factor journals
You improve what gets measurednot necessarily what you want to improve
Review and Conclusions 65
Characteristics of the Web Bow Tie
SCC Strongly Connected Core mdash can travel from any page to any page
Review and Conclusions 66
Anchor Text
atilde Recall how hyperlinks are written
lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt
For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt
atilde Link analysis builds on two intuitions
1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A
2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice
Review and Conclusions 67
PageRank as Citation analysis
atilde Citation frequency can be used to measure the impact of an article
acirc Simplest measure Each article gets one vote ndash not very accurate
atilde On the web citation frequency = inlink count
acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam
atilde Better measure weighted citation frequency or citation rank
acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated
Review and Conclusions 68
PageRank as Random walk
atilde Imagine a web surfer doing a random walk on the web
acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page
equiprobably
atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there
atilde This long-term visit rate is the pagersquos PageRank
atilde PageRank = long-term visit rate = steady state robability
Review and Conclusions 69
Teleporting ndash to get us out of dead ends
atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)
atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)
atilde With remaining probability (90) go out on a random hyperlink
acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225
atilde 10 is a parameter the teleportation rate
atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate
Review and Conclusions 70
Example Graph
Each inbound link is a positive vote
httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71
Example Graph Weighted
Pages with higher PageRanks are lighter
Review and Conclusions 72
Gaming PageRank
atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs
atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies
atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission
73
atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks
The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index
acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip
74
Current Status
atilde There is a continuous battle between
acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read
atilde All metrics get gamed
Review and Conclusions 75
Digital object identifier
atilde DOI a string used to uniquely identify an electronic document or object
acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)
atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation
atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations
acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u
Review and Conclusions 76
Conclusions
Review and Conclusions 77
Revolutions in Language Technology
atilde Speech (Language itself)
atilde Writing (invented 3-5 times)
atilde Printing (made writing common)
atilde Digital Text (made writing transferable)
atilde Hyperlinking (taking writing beyond language)
Review and Conclusions 78
The Internet
atilde The internet is useful as a tool
acirc Passively for information accessacirc Actively for collaboration
atilde The internet is interesting in and of itself
acirc As a source of data about existing languageacirc As a source of innovation in language
Review and Conclusions 79
Recommended Texts
atilde Wikipedia
atilde Crystal D (2001) Language and the Internet Cambridge University Press
atilde Sproat R (2010) Language Technology and Society Oxford University Press
atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition
atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press
atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press
Review and Conclusions 80
Complementary Courses
atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics
atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics
Review and Conclusions 81
What is represented
atilde Phonemes ma d laks aeligvkadoz (45)
acirc Not a simple correspondence between a writing system and the soundsacirc Some logographs (3 $)acirc Even when a new alphabet is designed pronunciation changes
atilde Syllables (ma) (d) (laks) (aelig)(v)(ka)(doz) (10000+)
atilde Morphemes mame + rsquos d lak + s aeligvkado + z (100000+)
atilde Words my dog likes avocados (200000+)
atilde Concepts speaker poss canine fond avacado+PL (400000+)
Review and Conclusions 11
Three Major Writing Systems
atilde Alphabetic (Latin)
acirc one symbol for consonant or vowelacirc Typically 20-30 base symbols (1 byte)
atilde Syllabic (Hiragana)
acirc one symbol for each syllable (consonant+vowel)acirc Typically 50-100 base symbols (1-2 bytes)
atilde Logographic (Hanzi)
acirc pictographs ideographs sounds-meaning combinationsacirc Typically 100000+ symbols (2-3 bytes)
Review and Conclusions 12
Computational Encoding
atilde Need to map characters to bits
atilde More characters require more space
acirc English ASCII (7 bits) Western Europe ISO-8859-1 (8 bits)Japanese EUC (16 bits) Everything UTF-8 (8-32 bits)
atilde Moving towards unicode for everything
atilde If you get the encoding wrong it is gibberish
acirc Web pages should state their encodingacirc Sometimes they are wrongacirc Encoding Detection usually involves statistical analysis of byte patterns
lowast But an encoding can be shared by many languages
Review and Conclusions 13
Speech and Language Technologyatilde The need for speech representation
atilde Storing sound
atilde Transforming Speech
acirc Automatic Speech Recognition (ASR) sounds to textacirc Text-to-Speech Synthesis (TTS) texts to sounds
atilde Speech technology mdash the Telephone
Review and Conclusions 14
How good are the systems
Task Vocab WER () WER () adaptedDigits 11 04 02Dialogue (travel) 21000 109 mdashDictation (WSJ) 5000 39 30Dictation (WSJ) 20000 100 86Dialogue (noisy army) 3000 422 310Phone Conversations 4000 419 310
Results of various DARPA competitions (from Richard Sproatrsquos slides)
Review and Conclusions 15
Why is it so difficult
atilde Pronunciation depends on context
acirc The same word will be pronounced differently in different sentences
atilde Speaker variability
acirc Genderacirc DialectForeign Accentacirc Individual Differences Physical differences Language differences (idiolect)
atilde Many many rare events
acirc 300 out of 2000 diphones in the core set for the ATampT NextGen system occuronly once in a 2-hour speech database
Review and Conclusions 16
Two steps in a TTS system
1 Linguistic Analysis
atilde Sentence Segmentationatilde Abbreviations Dr Smith lives on Nanyang Dr He is hellipatilde Word Segmentation
acirc 森前銀総裁 Moriyama zen Nichigin Sousaiotimes 森前銀総裁emspMoriyama zennichi gin Sousai
2 Speech Synthesis
atilde Find the pronunciationatilde Generate soundsatilde Add intonation
Review and Conclusions 17
Speech Synthesis
atilde Articulatory Synthesis Attempt to model human articulation
atilde Formant Synthesis Bypass modeling of human articulation and model acousticsdirectly
atilde Concatenative Synthesis Synthesize from stored units of actual speech
Review and Conclusions 18
Prosody of Emotion
atilde Excitement Fast very high pitch loud
atilde Hot anger Fast high pitch strong falling accent loud
atilde Fear Jitter
atilde Sarcasm Prolonged accent late peak
atilde Sad Slow low pitch
The main determinant of ldquonaturalnessrdquo in speech synthesis is not ldquovoice qualityrdquobut natural-sounding prosody (intonation and duration)
Richard Sproat
Review and Conclusions 19
Speed is different for different modalities
Speed in words per minute (one word is 6 characters)(English computer science students various studies)
Reading 300 200 (proof reading)Writing 31 21 (composing)Speaking 150Hearing 150 210 (speeded up)Typing 33 19 (composing)
atilde Reading gtgt SpeakingHearing gtgt Typing
rArr Speech for inputrArr Text for output
Review and Conclusions 20
The Telephone
Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 21
New Modalities
Review and Conclusions 22
Email Usenet Chat and Blogs
atilde All share some characteristics of speech and text
atilde Usage norms not fixed
atilde Communication methods may disappear before the norms are fixedUsenet Bulletin Boards Gopher Archie hellip
atilde Large scale discourse studies still to be done
atilde Some genuinely new things
acirc time-lagged multi-person conversationacirc raw un-edited textacirc extreme multi-authorship
Review and Conclusions 23
Speech like Text liketime bound space bound (deletable)spontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 24
Usenet (asynchronous)
Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 25
Chat (synchronous)
Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 26
Blogs
Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactivecomments factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 27
Netspeak
atilde Inspired by shared background
acirc ltrantgtI canrsquot stand thisltrantgtacirc I hate^H^H^H^Hlove that idea (^H is backspace on a vt100)acirc lusers (users as seen by Systems Administrators)acirc suits
atilde Need to be in-group
cow orker Coworkerclue ldquoYou couldnrsquot get a clue during the clue mating season in a field full of horny
clues if you smeared your body with clue musk and did the clue mating dancerdquoEdward Flaherty (talkbizaare)
Review and Conclusions 28
atilde Gricean Principals
acirc Posting (topmiddlebottom quoting and trimming)
atilde Inspired by medium limitations
acirc GREATgreatgreatacirc -) (^_^) (o) acirc brb RTFM IMHOacirc lower caseacirc lack of punctuation (hard on phone keyboards)
Review and Conclusions 29
Collaboration and Wikisatilde Version Control Systems
atilde Wikipedia
atilde Licensing and Ownership
30
Version Control Systems
atilde Versioning file systems
acirc every time a file is opened a new copy is stored
atilde CVS Subversion Git
acirc changes to a collection of files are trackedacirc simultaneous changes are merged
atilde Revision Tracking
acirc Revisions are stored within a file
atilde Authorship in shared writing Explicit responsibility for changes
Review and Conclusions 31
Wikipedia
atilde The core aim of the Wikimedia Foundation is to get a free encyclopedia to everysingle person on the planet (Jimmy Wales)
atilde Wikipedia makes it easy to share your knowledgepeople like to do this
atilde Most edits are done by insiders
atilde Most content is added by outsiders
atilde Content comparable to Britannica
Review and Conclusions 32
The five pillars of Wikipedia
1 Wikipedia is an online encyclopedia
2 Wikipedia has a neutral point of view
atilde Content policies NPOV Verifiability and No original research
3 Wikipedia is free content
4 Wikipedians should interact in a respectful and civil manner
5 Wikipedia does not have firm rules
WikipediaFivepillars 33
Licenses and Ownership
atilde Copyright
atilde Copyleft
atilde Creative Commons
Review and Conclusions 34
What is a good article
1 Well-written
2 Factually accurate and verifiable
3 Broad in its coverage
4 Neutral
5 Stable
6 Illustrated if possible by images
WikipediaGood_article_criteria 35
The World Wide Web and HTML
Review and Conclusions 36
The InterWeb
atilde The Internetmore than just the web (email VoIP FTP Streaming Messaging
atilde The structure of Markup Visual vs LogicalWSISWYG WYSIAYG WYSIWIM
atilde The structure of the Web mdash hypertextpages linked to pages
atilde The future of the Web
atilde Linguistic features of the webun-edited large volume editable multi-media
Review and Conclusions 37
Map of online communities
httpxkcdcom802 38
The Deep Web (the Invisible Web)
Dynamic content dynamic pages which are returned in response to a submitted queryor accessed only through a form
Unlinked content pages which are not linked to by other pages (but clicking linksthem)
Private Web sites that require registration and login (Edventure)
Contextual Web pages with content varying for different access contexts (egranges of client IP addresses or previous navigation sequence)
Limited access content sites that limit access to their pages in a technical way (egusing the Robots Exclusion Standard)
Scripted content pages that are only accessible through links produced by JavaScriptas well as content dynamically downloaded from Web servers via Flash or Ajaxsolutions
Review and Conclusions 39
Non-HTMLtext content textual content encoded in multimedia (image or video)files or specific file formats not handled by search engines
These pages all include data that search engines cannot find
Review and Conclusions 40
Visual Markup vs Logical Markup
atilde Visual Markup (Presentational)
acirc What you see is what you get (WYSIWYG)acirc Equivalent of printersrsquo markupacirc Shows what things look like
atilde Logical Markup (Structural)
acirc Show the structure and meaningacirc Can be mapped to visual markupacirc Less flexible than visual markupacirc More adaptable (and reusable)
Review and Conclusions 41
The Web as Corpus
Review and Conclusions 42
Two Approaches to using the Web as a Corpus
atilde Direct Query Search Engine as Query tool and WWW as corpus(Objection Results are not reliable)
acirc Population and exact hit counts are unknown rarr no statistics possibleacirc Indexing does not allow to draw conclusions on the dataotimes Google is missing functionalities that linguists lexicographers would like to have
atilde Web Sample Use search engine to download data from the net and build a corpusfrom it
acirc known size and exact hit counts rarr statistics possibleacirc people can draw conclusions over the included text typesacirc (limited) control over the contentotimes sparser data
Review and Conclusions 43
Direct Query
atilde Accessible through search engines (Google API Yahoo API Scripts)
atilde Document counts are shown to correlate directly with ldquorealrdquo frequencies (Keller2003) so search engines can help - but
acirc lots of repetitions of the same text (not representative)acirc very limited query precision (no upperlower case no punctuation)acirc only estimated counts often hard to reproduce exactlyacirc different queries give wildly different numbers
Review and Conclusions 44
Web Count Units
atilde ghit or ldquogoogle hitrdquo is the most common unit used to count web snippets (in theearly 2000s)
acirc it is document frequency not term frequency
atilde whit or ldquoweb hitrdquo is the more general term
atilde Normally you compare two phenomena to get a unitless ratio(eg different from vs different than)251000000 ghits vs 71500000 or 351 (accessed 2012-04-04)
atilde GPB for ldquoGhits per billion documentsrdquo is good if want a more stable number (sug-gested by Mark Liberman)but Google no longer releases the index size
atilde So always say when you counted and try to use ratios
httpitrecisupennedu myllanguagelogarchives000953html 45
Web Sample
atilde Extracting and filtering web documents to create linguistically annotated corpora(Kilgarriff 2006)
acirc gather documents for different topics (balance)acirc exclude documents which cannot be preprocessed with available tools (here tag-
gers and lemmatizers)acirc exclude documents which seem irrelevant for a corpus (too short or too long word
lists)acirc do this for several languages and make the corpora available
46
Building Internet Corpora Outline
1 Select Seed Words (500)
2 Combine to form multiple queries (6000)
3 Query a search engine and retrieve the URLs (50000)
4 Download the files from the URLS (100000000 words)
5 Postprocess the data (encoding cleanup tagging and parsing)
Sharoff S (2006) Creating general-purpose corpora using automated search enginequeries In M Baroni S Bernardini (eds) WaCky Working papers on the Web asCorpus Bologna 2006
httpcorpusleedsacukinternethtml 47
Internet Corpora Summary
atilde The web can be used as a corpus
acirc Direct accesslowast Fast and convenientlowast Huge amounts of dataotimes unreliable counts
acirc Web samplelowast Control over the samplelowast Some setup costs (semi-automated)otimes Less data
atilde Richer data than a compiled corpus
otimes Less balanced less markup
48
Text and Meta-textatilde Explicit Meta-data
acirc Keywords and Categoriesacirc Rankingsacirc Structural Markup
atilde Implicit Meta-data
acirc Links and Citationsacirc Tagsacirc Tablesacirc File Namesacirc Translations
and Zombies 49
Explicit Metadata
atilde You can get information from metadata within documents
acirc When they are accurate they are very goodacirc They are often deceitful
atilde HTML PDF Word hellipMetadata
atilde Keywords and Tags
atilde Rankings
atilde Links and Citations
atilde Structural Markup
50
Implicit Metadata
atilde You can get clues from metadata within documents
acirc as they are non-intended they tend to be noisyacirc but they are rarely deceitful
atilde HTML tags as constituent boundaries
atilde Tables as Semantic Relations
atilde File Names (content type and language)
atilde Translations mdash Bracketed Glosses Cross-lingual Disambiguation
atilde Query Data
atilde Wikipedia Redirections and Cross-wiki Links
Review and Conclusions 51
Language Identification andNormalization
Review and Conclusions 52
Language Identification
atilde Need to identify the encoding and language of a page in order to process it (meta-data may be unreliable)
acirc Linguistically-grounded methodslowast Diacriticslowast Character n-gramslowast Stop words
acirc Similarity-based categorisation and classificationlowast Character n-gram rankings
acirc Machine Learning based methodsacirc Context (under jp or ko)
atilde Hard to do for short test snippets similar languages mixed text
Review and Conclusions 53
Normalization
atilde Extracting text from various documents
atilde Segmenting continuous text
atilde Number Normalization $700K $700000 07 million dollars hellip
atilde Date Normalization 2000AD 1421AH Heisei 12 hellip
atilde Stripping stop words (the a of hellip)
atilde Lemmatization produces rarr produce
atilde Stemming producer rarr produc produces rarr produc
atilde Decompounding zonnecel rarr zon cel
Review and Conclusions 54
The Semantic Web
Review and Conclusions 55
Goals of the Semantic Web
atilde Web of data
acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions
atilde Increase the utility of information by connecting it to definitions and context
atilde More efficient information access and analysis
EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt
Review and Conclusions 56
Semantic Web Architecture
Review and Conclusions 57
XML eXtensible Markup Language
atilde XML is a set of rules for encoding documents electronically
acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation
atilde Can be validated for syntactic well-formedness
Review and Conclusions 58
Semantic Web Architecture (details)
atilde Identify things with Uniform Resource Identifiers
acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond
atilde Identify relations with Resource Description Framework
acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything
atilde You can build relations in ontologies (OWL)
acirc Then reason over them search them hellip
Review and Conclusions 59
Criticism of the Semantic Web
Doctorowrsquos seven insurmountable obstacles to reliable metadata are
1 People lie
2 People are lazy
3 People are stupid
4 Mission Impossible know thyself
5 Schemas arenrsquot neutral
6 Metrics influence results
7 Therersquos more than one way to describe something
httpwwwwellcom~doctorowmetacraphtm 60
Semantic Web and NLP
atilde The Semantic Web is about structuring data
atilde Text Mining is about unstructured data
atilde There is much more unstructured than structured data
acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP
61
Citation Reputation andPageRank
Review and Conclusions 62
Citation Networks
atilde How can we tell what is a good scientific paper
acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked
acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite
Review and Conclusions 63
Reputation and Citation Analysis
atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group
atilde Some scores are
acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)
atilde Problems
acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones
atilde Weight Citations by Quality of the paper
Review and Conclusions 64
Gaming Citations
atilde LeastMinimum Publishable Unit
acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information
atilde Self citation in-group citation
atilde Write only proceedings (some journals are not often read)
atilde Submitting only to High Impact factor journals
You improve what gets measurednot necessarily what you want to improve
Review and Conclusions 65
Characteristics of the Web Bow Tie
SCC Strongly Connected Core mdash can travel from any page to any page
Review and Conclusions 66
Anchor Text
atilde Recall how hyperlinks are written
lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt
For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt
atilde Link analysis builds on two intuitions
1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A
2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice
Review and Conclusions 67
PageRank as Citation analysis
atilde Citation frequency can be used to measure the impact of an article
acirc Simplest measure Each article gets one vote ndash not very accurate
atilde On the web citation frequency = inlink count
acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam
atilde Better measure weighted citation frequency or citation rank
acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated
Review and Conclusions 68
PageRank as Random walk
atilde Imagine a web surfer doing a random walk on the web
acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page
equiprobably
atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there
atilde This long-term visit rate is the pagersquos PageRank
atilde PageRank = long-term visit rate = steady state robability
Review and Conclusions 69
Teleporting ndash to get us out of dead ends
atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)
atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)
atilde With remaining probability (90) go out on a random hyperlink
acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225
atilde 10 is a parameter the teleportation rate
atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate
Review and Conclusions 70
Example Graph
Each inbound link is a positive vote
httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71
Example Graph Weighted
Pages with higher PageRanks are lighter
Review and Conclusions 72
Gaming PageRank
atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs
atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies
atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission
73
atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks
The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index
acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip
74
Current Status
atilde There is a continuous battle between
acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read
atilde All metrics get gamed
Review and Conclusions 75
Digital object identifier
atilde DOI a string used to uniquely identify an electronic document or object
acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)
atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation
atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations
acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u
Review and Conclusions 76
Conclusions
Review and Conclusions 77
Revolutions in Language Technology
atilde Speech (Language itself)
atilde Writing (invented 3-5 times)
atilde Printing (made writing common)
atilde Digital Text (made writing transferable)
atilde Hyperlinking (taking writing beyond language)
Review and Conclusions 78
The Internet
atilde The internet is useful as a tool
acirc Passively for information accessacirc Actively for collaboration
atilde The internet is interesting in and of itself
acirc As a source of data about existing languageacirc As a source of innovation in language
Review and Conclusions 79
Recommended Texts
atilde Wikipedia
atilde Crystal D (2001) Language and the Internet Cambridge University Press
atilde Sproat R (2010) Language Technology and Society Oxford University Press
atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition
atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press
atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press
Review and Conclusions 80
Complementary Courses
atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics
atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics
Review and Conclusions 81
Three Major Writing Systems
atilde Alphabetic (Latin)
acirc one symbol for consonant or vowelacirc Typically 20-30 base symbols (1 byte)
atilde Syllabic (Hiragana)
acirc one symbol for each syllable (consonant+vowel)acirc Typically 50-100 base symbols (1-2 bytes)
atilde Logographic (Hanzi)
acirc pictographs ideographs sounds-meaning combinationsacirc Typically 100000+ symbols (2-3 bytes)
Review and Conclusions 12
Computational Encoding
atilde Need to map characters to bits
atilde More characters require more space
acirc English ASCII (7 bits) Western Europe ISO-8859-1 (8 bits)Japanese EUC (16 bits) Everything UTF-8 (8-32 bits)
atilde Moving towards unicode for everything
atilde If you get the encoding wrong it is gibberish
acirc Web pages should state their encodingacirc Sometimes they are wrongacirc Encoding Detection usually involves statistical analysis of byte patterns
lowast But an encoding can be shared by many languages
Review and Conclusions 13
Speech and Language Technologyatilde The need for speech representation
atilde Storing sound
atilde Transforming Speech
acirc Automatic Speech Recognition (ASR) sounds to textacirc Text-to-Speech Synthesis (TTS) texts to sounds
atilde Speech technology mdash the Telephone
Review and Conclusions 14
How good are the systems
Task Vocab WER () WER () adaptedDigits 11 04 02Dialogue (travel) 21000 109 mdashDictation (WSJ) 5000 39 30Dictation (WSJ) 20000 100 86Dialogue (noisy army) 3000 422 310Phone Conversations 4000 419 310
Results of various DARPA competitions (from Richard Sproatrsquos slides)
Review and Conclusions 15
Why is it so difficult
atilde Pronunciation depends on context
acirc The same word will be pronounced differently in different sentences
atilde Speaker variability
acirc Genderacirc DialectForeign Accentacirc Individual Differences Physical differences Language differences (idiolect)
atilde Many many rare events
acirc 300 out of 2000 diphones in the core set for the ATampT NextGen system occuronly once in a 2-hour speech database
Review and Conclusions 16
Two steps in a TTS system
1 Linguistic Analysis
atilde Sentence Segmentationatilde Abbreviations Dr Smith lives on Nanyang Dr He is hellipatilde Word Segmentation
acirc 森前銀総裁 Moriyama zen Nichigin Sousaiotimes 森前銀総裁emspMoriyama zennichi gin Sousai
2 Speech Synthesis
atilde Find the pronunciationatilde Generate soundsatilde Add intonation
Review and Conclusions 17
Speech Synthesis
atilde Articulatory Synthesis Attempt to model human articulation
atilde Formant Synthesis Bypass modeling of human articulation and model acousticsdirectly
atilde Concatenative Synthesis Synthesize from stored units of actual speech
Review and Conclusions 18
Prosody of Emotion
atilde Excitement Fast very high pitch loud
atilde Hot anger Fast high pitch strong falling accent loud
atilde Fear Jitter
atilde Sarcasm Prolonged accent late peak
atilde Sad Slow low pitch
The main determinant of ldquonaturalnessrdquo in speech synthesis is not ldquovoice qualityrdquobut natural-sounding prosody (intonation and duration)
Richard Sproat
Review and Conclusions 19
Speed is different for different modalities
Speed in words per minute (one word is 6 characters)(English computer science students various studies)
Reading 300 200 (proof reading)Writing 31 21 (composing)Speaking 150Hearing 150 210 (speeded up)Typing 33 19 (composing)
atilde Reading gtgt SpeakingHearing gtgt Typing
rArr Speech for inputrArr Text for output
Review and Conclusions 20
The Telephone
Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 21
New Modalities
Review and Conclusions 22
Email Usenet Chat and Blogs
atilde All share some characteristics of speech and text
atilde Usage norms not fixed
atilde Communication methods may disappear before the norms are fixedUsenet Bulletin Boards Gopher Archie hellip
atilde Large scale discourse studies still to be done
atilde Some genuinely new things
acirc time-lagged multi-person conversationacirc raw un-edited textacirc extreme multi-authorship
Review and Conclusions 23
Speech like Text liketime bound space bound (deletable)spontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 24
Usenet (asynchronous)
Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 25
Chat (synchronous)
Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 26
Blogs
Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactivecomments factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 27
Netspeak
atilde Inspired by shared background
acirc ltrantgtI canrsquot stand thisltrantgtacirc I hate^H^H^H^Hlove that idea (^H is backspace on a vt100)acirc lusers (users as seen by Systems Administrators)acirc suits
atilde Need to be in-group
cow orker Coworkerclue ldquoYou couldnrsquot get a clue during the clue mating season in a field full of horny
clues if you smeared your body with clue musk and did the clue mating dancerdquoEdward Flaherty (talkbizaare)
Review and Conclusions 28
atilde Gricean Principals
acirc Posting (topmiddlebottom quoting and trimming)
atilde Inspired by medium limitations
acirc GREATgreatgreatacirc -) (^_^) (o) acirc brb RTFM IMHOacirc lower caseacirc lack of punctuation (hard on phone keyboards)
Review and Conclusions 29
Collaboration and Wikisatilde Version Control Systems
atilde Wikipedia
atilde Licensing and Ownership
30
Version Control Systems
atilde Versioning file systems
acirc every time a file is opened a new copy is stored
atilde CVS Subversion Git
acirc changes to a collection of files are trackedacirc simultaneous changes are merged
atilde Revision Tracking
acirc Revisions are stored within a file
atilde Authorship in shared writing Explicit responsibility for changes
Review and Conclusions 31
Wikipedia
atilde The core aim of the Wikimedia Foundation is to get a free encyclopedia to everysingle person on the planet (Jimmy Wales)
atilde Wikipedia makes it easy to share your knowledgepeople like to do this
atilde Most edits are done by insiders
atilde Most content is added by outsiders
atilde Content comparable to Britannica
Review and Conclusions 32
The five pillars of Wikipedia
1 Wikipedia is an online encyclopedia
2 Wikipedia has a neutral point of view
atilde Content policies NPOV Verifiability and No original research
3 Wikipedia is free content
4 Wikipedians should interact in a respectful and civil manner
5 Wikipedia does not have firm rules
WikipediaFivepillars 33
Licenses and Ownership
atilde Copyright
atilde Copyleft
atilde Creative Commons
Review and Conclusions 34
What is a good article
1 Well-written
2 Factually accurate and verifiable
3 Broad in its coverage
4 Neutral
5 Stable
6 Illustrated if possible by images
WikipediaGood_article_criteria 35
The World Wide Web and HTML
Review and Conclusions 36
The InterWeb
atilde The Internetmore than just the web (email VoIP FTP Streaming Messaging
atilde The structure of Markup Visual vs LogicalWSISWYG WYSIAYG WYSIWIM
atilde The structure of the Web mdash hypertextpages linked to pages
atilde The future of the Web
atilde Linguistic features of the webun-edited large volume editable multi-media
Review and Conclusions 37
Map of online communities
httpxkcdcom802 38
The Deep Web (the Invisible Web)
Dynamic content dynamic pages which are returned in response to a submitted queryor accessed only through a form
Unlinked content pages which are not linked to by other pages (but clicking linksthem)
Private Web sites that require registration and login (Edventure)
Contextual Web pages with content varying for different access contexts (egranges of client IP addresses or previous navigation sequence)
Limited access content sites that limit access to their pages in a technical way (egusing the Robots Exclusion Standard)
Scripted content pages that are only accessible through links produced by JavaScriptas well as content dynamically downloaded from Web servers via Flash or Ajaxsolutions
Review and Conclusions 39
Non-HTMLtext content textual content encoded in multimedia (image or video)files or specific file formats not handled by search engines
These pages all include data that search engines cannot find
Review and Conclusions 40
Visual Markup vs Logical Markup
atilde Visual Markup (Presentational)
acirc What you see is what you get (WYSIWYG)acirc Equivalent of printersrsquo markupacirc Shows what things look like
atilde Logical Markup (Structural)
acirc Show the structure and meaningacirc Can be mapped to visual markupacirc Less flexible than visual markupacirc More adaptable (and reusable)
Review and Conclusions 41
The Web as Corpus
Review and Conclusions 42
Two Approaches to using the Web as a Corpus
atilde Direct Query Search Engine as Query tool and WWW as corpus(Objection Results are not reliable)
acirc Population and exact hit counts are unknown rarr no statistics possibleacirc Indexing does not allow to draw conclusions on the dataotimes Google is missing functionalities that linguists lexicographers would like to have
atilde Web Sample Use search engine to download data from the net and build a corpusfrom it
acirc known size and exact hit counts rarr statistics possibleacirc people can draw conclusions over the included text typesacirc (limited) control over the contentotimes sparser data
Review and Conclusions 43
Direct Query
atilde Accessible through search engines (Google API Yahoo API Scripts)
atilde Document counts are shown to correlate directly with ldquorealrdquo frequencies (Keller2003) so search engines can help - but
acirc lots of repetitions of the same text (not representative)acirc very limited query precision (no upperlower case no punctuation)acirc only estimated counts often hard to reproduce exactlyacirc different queries give wildly different numbers
Review and Conclusions 44
Web Count Units
atilde ghit or ldquogoogle hitrdquo is the most common unit used to count web snippets (in theearly 2000s)
acirc it is document frequency not term frequency
atilde whit or ldquoweb hitrdquo is the more general term
atilde Normally you compare two phenomena to get a unitless ratio(eg different from vs different than)251000000 ghits vs 71500000 or 351 (accessed 2012-04-04)
atilde GPB for ldquoGhits per billion documentsrdquo is good if want a more stable number (sug-gested by Mark Liberman)but Google no longer releases the index size
atilde So always say when you counted and try to use ratios
httpitrecisupennedu myllanguagelogarchives000953html 45
Web Sample
atilde Extracting and filtering web documents to create linguistically annotated corpora(Kilgarriff 2006)
acirc gather documents for different topics (balance)acirc exclude documents which cannot be preprocessed with available tools (here tag-
gers and lemmatizers)acirc exclude documents which seem irrelevant for a corpus (too short or too long word
lists)acirc do this for several languages and make the corpora available
46
Building Internet Corpora Outline
1 Select Seed Words (500)
2 Combine to form multiple queries (6000)
3 Query a search engine and retrieve the URLs (50000)
4 Download the files from the URLS (100000000 words)
5 Postprocess the data (encoding cleanup tagging and parsing)
Sharoff S (2006) Creating general-purpose corpora using automated search enginequeries In M Baroni S Bernardini (eds) WaCky Working papers on the Web asCorpus Bologna 2006
httpcorpusleedsacukinternethtml 47
Internet Corpora Summary
atilde The web can be used as a corpus
acirc Direct accesslowast Fast and convenientlowast Huge amounts of dataotimes unreliable counts
acirc Web samplelowast Control over the samplelowast Some setup costs (semi-automated)otimes Less data
atilde Richer data than a compiled corpus
otimes Less balanced less markup
48
Text and Meta-textatilde Explicit Meta-data
acirc Keywords and Categoriesacirc Rankingsacirc Structural Markup
atilde Implicit Meta-data
acirc Links and Citationsacirc Tagsacirc Tablesacirc File Namesacirc Translations
and Zombies 49
Explicit Metadata
atilde You can get information from metadata within documents
acirc When they are accurate they are very goodacirc They are often deceitful
atilde HTML PDF Word hellipMetadata
atilde Keywords and Tags
atilde Rankings
atilde Links and Citations
atilde Structural Markup
50
Implicit Metadata
atilde You can get clues from metadata within documents
acirc as they are non-intended they tend to be noisyacirc but they are rarely deceitful
atilde HTML tags as constituent boundaries
atilde Tables as Semantic Relations
atilde File Names (content type and language)
atilde Translations mdash Bracketed Glosses Cross-lingual Disambiguation
atilde Query Data
atilde Wikipedia Redirections and Cross-wiki Links
Review and Conclusions 51
Language Identification andNormalization
Review and Conclusions 52
Language Identification
atilde Need to identify the encoding and language of a page in order to process it (meta-data may be unreliable)
acirc Linguistically-grounded methodslowast Diacriticslowast Character n-gramslowast Stop words
acirc Similarity-based categorisation and classificationlowast Character n-gram rankings
acirc Machine Learning based methodsacirc Context (under jp or ko)
atilde Hard to do for short test snippets similar languages mixed text
Review and Conclusions 53
Normalization
atilde Extracting text from various documents
atilde Segmenting continuous text
atilde Number Normalization $700K $700000 07 million dollars hellip
atilde Date Normalization 2000AD 1421AH Heisei 12 hellip
atilde Stripping stop words (the a of hellip)
atilde Lemmatization produces rarr produce
atilde Stemming producer rarr produc produces rarr produc
atilde Decompounding zonnecel rarr zon cel
Review and Conclusions 54
The Semantic Web
Review and Conclusions 55
Goals of the Semantic Web
atilde Web of data
acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions
atilde Increase the utility of information by connecting it to definitions and context
atilde More efficient information access and analysis
EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt
Review and Conclusions 56
Semantic Web Architecture
Review and Conclusions 57
XML eXtensible Markup Language
atilde XML is a set of rules for encoding documents electronically
acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation
atilde Can be validated for syntactic well-formedness
Review and Conclusions 58
Semantic Web Architecture (details)
atilde Identify things with Uniform Resource Identifiers
acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond
atilde Identify relations with Resource Description Framework
acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything
atilde You can build relations in ontologies (OWL)
acirc Then reason over them search them hellip
Review and Conclusions 59
Criticism of the Semantic Web
Doctorowrsquos seven insurmountable obstacles to reliable metadata are
1 People lie
2 People are lazy
3 People are stupid
4 Mission Impossible know thyself
5 Schemas arenrsquot neutral
6 Metrics influence results
7 Therersquos more than one way to describe something
httpwwwwellcom~doctorowmetacraphtm 60
Semantic Web and NLP
atilde The Semantic Web is about structuring data
atilde Text Mining is about unstructured data
atilde There is much more unstructured than structured data
acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP
61
Citation Reputation andPageRank
Review and Conclusions 62
Citation Networks
atilde How can we tell what is a good scientific paper
acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked
acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite
Review and Conclusions 63
Reputation and Citation Analysis
atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group
atilde Some scores are
acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)
atilde Problems
acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones
atilde Weight Citations by Quality of the paper
Review and Conclusions 64
Gaming Citations
atilde LeastMinimum Publishable Unit
acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information
atilde Self citation in-group citation
atilde Write only proceedings (some journals are not often read)
atilde Submitting only to High Impact factor journals
You improve what gets measurednot necessarily what you want to improve
Review and Conclusions 65
Characteristics of the Web Bow Tie
SCC Strongly Connected Core mdash can travel from any page to any page
Review and Conclusions 66
Anchor Text
atilde Recall how hyperlinks are written
lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt
For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt
atilde Link analysis builds on two intuitions
1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A
2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice
Review and Conclusions 67
PageRank as Citation analysis
atilde Citation frequency can be used to measure the impact of an article
acirc Simplest measure Each article gets one vote ndash not very accurate
atilde On the web citation frequency = inlink count
acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam
atilde Better measure weighted citation frequency or citation rank
acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated
Review and Conclusions 68
PageRank as Random walk
atilde Imagine a web surfer doing a random walk on the web
acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page
equiprobably
atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there
atilde This long-term visit rate is the pagersquos PageRank
atilde PageRank = long-term visit rate = steady state robability
Review and Conclusions 69
Teleporting ndash to get us out of dead ends
atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)
atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)
atilde With remaining probability (90) go out on a random hyperlink
acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225
atilde 10 is a parameter the teleportation rate
atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate
Review and Conclusions 70
Example Graph
Each inbound link is a positive vote
httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71
Example Graph Weighted
Pages with higher PageRanks are lighter
Review and Conclusions 72
Gaming PageRank
atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs
atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies
atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission
73
atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks
The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index
acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip
74
Current Status
atilde There is a continuous battle between
acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read
atilde All metrics get gamed
Review and Conclusions 75
Digital object identifier
atilde DOI a string used to uniquely identify an electronic document or object
acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)
atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation
atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations
acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u
Review and Conclusions 76
Conclusions
Review and Conclusions 77
Revolutions in Language Technology
atilde Speech (Language itself)
atilde Writing (invented 3-5 times)
atilde Printing (made writing common)
atilde Digital Text (made writing transferable)
atilde Hyperlinking (taking writing beyond language)
Review and Conclusions 78
The Internet
atilde The internet is useful as a tool
acirc Passively for information accessacirc Actively for collaboration
atilde The internet is interesting in and of itself
acirc As a source of data about existing languageacirc As a source of innovation in language
Review and Conclusions 79
Recommended Texts
atilde Wikipedia
atilde Crystal D (2001) Language and the Internet Cambridge University Press
atilde Sproat R (2010) Language Technology and Society Oxford University Press
atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition
atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press
atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press
Review and Conclusions 80
Complementary Courses
atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics
atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics
Review and Conclusions 81
Computational Encoding
atilde Need to map characters to bits
atilde More characters require more space
acirc English ASCII (7 bits) Western Europe ISO-8859-1 (8 bits)Japanese EUC (16 bits) Everything UTF-8 (8-32 bits)
atilde Moving towards unicode for everything
atilde If you get the encoding wrong it is gibberish
acirc Web pages should state their encodingacirc Sometimes they are wrongacirc Encoding Detection usually involves statistical analysis of byte patterns
lowast But an encoding can be shared by many languages
Review and Conclusions 13
Speech and Language Technologyatilde The need for speech representation
atilde Storing sound
atilde Transforming Speech
acirc Automatic Speech Recognition (ASR) sounds to textacirc Text-to-Speech Synthesis (TTS) texts to sounds
atilde Speech technology mdash the Telephone
Review and Conclusions 14
How good are the systems
Task Vocab WER () WER () adaptedDigits 11 04 02Dialogue (travel) 21000 109 mdashDictation (WSJ) 5000 39 30Dictation (WSJ) 20000 100 86Dialogue (noisy army) 3000 422 310Phone Conversations 4000 419 310
Results of various DARPA competitions (from Richard Sproatrsquos slides)
Review and Conclusions 15
Why is it so difficult
atilde Pronunciation depends on context
acirc The same word will be pronounced differently in different sentences
atilde Speaker variability
acirc Genderacirc DialectForeign Accentacirc Individual Differences Physical differences Language differences (idiolect)
atilde Many many rare events
acirc 300 out of 2000 diphones in the core set for the ATampT NextGen system occuronly once in a 2-hour speech database
Review and Conclusions 16
Two steps in a TTS system
1 Linguistic Analysis
atilde Sentence Segmentationatilde Abbreviations Dr Smith lives on Nanyang Dr He is hellipatilde Word Segmentation
acirc 森前銀総裁 Moriyama zen Nichigin Sousaiotimes 森前銀総裁emspMoriyama zennichi gin Sousai
2 Speech Synthesis
atilde Find the pronunciationatilde Generate soundsatilde Add intonation
Review and Conclusions 17
Speech Synthesis
atilde Articulatory Synthesis Attempt to model human articulation
atilde Formant Synthesis Bypass modeling of human articulation and model acousticsdirectly
atilde Concatenative Synthesis Synthesize from stored units of actual speech
Review and Conclusions 18
Prosody of Emotion
atilde Excitement Fast very high pitch loud
atilde Hot anger Fast high pitch strong falling accent loud
atilde Fear Jitter
atilde Sarcasm Prolonged accent late peak
atilde Sad Slow low pitch
The main determinant of ldquonaturalnessrdquo in speech synthesis is not ldquovoice qualityrdquobut natural-sounding prosody (intonation and duration)
Richard Sproat
Review and Conclusions 19
Speed is different for different modalities
Speed in words per minute (one word is 6 characters)(English computer science students various studies)
Reading 300 200 (proof reading)Writing 31 21 (composing)Speaking 150Hearing 150 210 (speeded up)Typing 33 19 (composing)
atilde Reading gtgt SpeakingHearing gtgt Typing
rArr Speech for inputrArr Text for output
Review and Conclusions 20
The Telephone
Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 21
New Modalities
Review and Conclusions 22
Email Usenet Chat and Blogs
atilde All share some characteristics of speech and text
atilde Usage norms not fixed
atilde Communication methods may disappear before the norms are fixedUsenet Bulletin Boards Gopher Archie hellip
atilde Large scale discourse studies still to be done
atilde Some genuinely new things
acirc time-lagged multi-person conversationacirc raw un-edited textacirc extreme multi-authorship
Review and Conclusions 23
Speech like Text liketime bound space bound (deletable)spontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 24
Usenet (asynchronous)
Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 25
Chat (synchronous)
Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 26
Blogs
Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactivecomments factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 27
Netspeak
atilde Inspired by shared background
acirc ltrantgtI canrsquot stand thisltrantgtacirc I hate^H^H^H^Hlove that idea (^H is backspace on a vt100)acirc lusers (users as seen by Systems Administrators)acirc suits
atilde Need to be in-group
cow orker Coworkerclue ldquoYou couldnrsquot get a clue during the clue mating season in a field full of horny
clues if you smeared your body with clue musk and did the clue mating dancerdquoEdward Flaherty (talkbizaare)
Review and Conclusions 28
atilde Gricean Principals
acirc Posting (topmiddlebottom quoting and trimming)
atilde Inspired by medium limitations
acirc GREATgreatgreatacirc -) (^_^) (o) acirc brb RTFM IMHOacirc lower caseacirc lack of punctuation (hard on phone keyboards)
Review and Conclusions 29
Collaboration and Wikisatilde Version Control Systems
atilde Wikipedia
atilde Licensing and Ownership
30
Version Control Systems
atilde Versioning file systems
acirc every time a file is opened a new copy is stored
atilde CVS Subversion Git
acirc changes to a collection of files are trackedacirc simultaneous changes are merged
atilde Revision Tracking
acirc Revisions are stored within a file
atilde Authorship in shared writing Explicit responsibility for changes
Review and Conclusions 31
Wikipedia
atilde The core aim of the Wikimedia Foundation is to get a free encyclopedia to everysingle person on the planet (Jimmy Wales)
atilde Wikipedia makes it easy to share your knowledgepeople like to do this
atilde Most edits are done by insiders
atilde Most content is added by outsiders
atilde Content comparable to Britannica
Review and Conclusions 32
The five pillars of Wikipedia
1 Wikipedia is an online encyclopedia
2 Wikipedia has a neutral point of view
atilde Content policies NPOV Verifiability and No original research
3 Wikipedia is free content
4 Wikipedians should interact in a respectful and civil manner
5 Wikipedia does not have firm rules
WikipediaFivepillars 33
Licenses and Ownership
atilde Copyright
atilde Copyleft
atilde Creative Commons
Review and Conclusions 34
What is a good article
1 Well-written
2 Factually accurate and verifiable
3 Broad in its coverage
4 Neutral
5 Stable
6 Illustrated if possible by images
WikipediaGood_article_criteria 35
The World Wide Web and HTML
Review and Conclusions 36
The InterWeb
atilde The Internetmore than just the web (email VoIP FTP Streaming Messaging
atilde The structure of Markup Visual vs LogicalWSISWYG WYSIAYG WYSIWIM
atilde The structure of the Web mdash hypertextpages linked to pages
atilde The future of the Web
atilde Linguistic features of the webun-edited large volume editable multi-media
Review and Conclusions 37
Map of online communities
httpxkcdcom802 38
The Deep Web (the Invisible Web)
Dynamic content dynamic pages which are returned in response to a submitted queryor accessed only through a form
Unlinked content pages which are not linked to by other pages (but clicking linksthem)
Private Web sites that require registration and login (Edventure)
Contextual Web pages with content varying for different access contexts (egranges of client IP addresses or previous navigation sequence)
Limited access content sites that limit access to their pages in a technical way (egusing the Robots Exclusion Standard)
Scripted content pages that are only accessible through links produced by JavaScriptas well as content dynamically downloaded from Web servers via Flash or Ajaxsolutions
Review and Conclusions 39
Non-HTMLtext content textual content encoded in multimedia (image or video)files or specific file formats not handled by search engines
These pages all include data that search engines cannot find
Review and Conclusions 40
Visual Markup vs Logical Markup
atilde Visual Markup (Presentational)
acirc What you see is what you get (WYSIWYG)acirc Equivalent of printersrsquo markupacirc Shows what things look like
atilde Logical Markup (Structural)
acirc Show the structure and meaningacirc Can be mapped to visual markupacirc Less flexible than visual markupacirc More adaptable (and reusable)
Review and Conclusions 41
The Web as Corpus
Review and Conclusions 42
Two Approaches to using the Web as a Corpus
atilde Direct Query Search Engine as Query tool and WWW as corpus(Objection Results are not reliable)
acirc Population and exact hit counts are unknown rarr no statistics possibleacirc Indexing does not allow to draw conclusions on the dataotimes Google is missing functionalities that linguists lexicographers would like to have
atilde Web Sample Use search engine to download data from the net and build a corpusfrom it
acirc known size and exact hit counts rarr statistics possibleacirc people can draw conclusions over the included text typesacirc (limited) control over the contentotimes sparser data
Review and Conclusions 43
Direct Query
atilde Accessible through search engines (Google API Yahoo API Scripts)
atilde Document counts are shown to correlate directly with ldquorealrdquo frequencies (Keller2003) so search engines can help - but
acirc lots of repetitions of the same text (not representative)acirc very limited query precision (no upperlower case no punctuation)acirc only estimated counts often hard to reproduce exactlyacirc different queries give wildly different numbers
Review and Conclusions 44
Web Count Units
atilde ghit or ldquogoogle hitrdquo is the most common unit used to count web snippets (in theearly 2000s)
acirc it is document frequency not term frequency
atilde whit or ldquoweb hitrdquo is the more general term
atilde Normally you compare two phenomena to get a unitless ratio(eg different from vs different than)251000000 ghits vs 71500000 or 351 (accessed 2012-04-04)
atilde GPB for ldquoGhits per billion documentsrdquo is good if want a more stable number (sug-gested by Mark Liberman)but Google no longer releases the index size
atilde So always say when you counted and try to use ratios
httpitrecisupennedu myllanguagelogarchives000953html 45
Web Sample
atilde Extracting and filtering web documents to create linguistically annotated corpora(Kilgarriff 2006)
acirc gather documents for different topics (balance)acirc exclude documents which cannot be preprocessed with available tools (here tag-
gers and lemmatizers)acirc exclude documents which seem irrelevant for a corpus (too short or too long word
lists)acirc do this for several languages and make the corpora available
46
Building Internet Corpora Outline
1 Select Seed Words (500)
2 Combine to form multiple queries (6000)
3 Query a search engine and retrieve the URLs (50000)
4 Download the files from the URLS (100000000 words)
5 Postprocess the data (encoding cleanup tagging and parsing)
Sharoff S (2006) Creating general-purpose corpora using automated search enginequeries In M Baroni S Bernardini (eds) WaCky Working papers on the Web asCorpus Bologna 2006
httpcorpusleedsacukinternethtml 47
Internet Corpora Summary
atilde The web can be used as a corpus
acirc Direct accesslowast Fast and convenientlowast Huge amounts of dataotimes unreliable counts
acirc Web samplelowast Control over the samplelowast Some setup costs (semi-automated)otimes Less data
atilde Richer data than a compiled corpus
otimes Less balanced less markup
48
Text and Meta-textatilde Explicit Meta-data
acirc Keywords and Categoriesacirc Rankingsacirc Structural Markup
atilde Implicit Meta-data
acirc Links and Citationsacirc Tagsacirc Tablesacirc File Namesacirc Translations
and Zombies 49
Explicit Metadata
atilde You can get information from metadata within documents
acirc When they are accurate they are very goodacirc They are often deceitful
atilde HTML PDF Word hellipMetadata
atilde Keywords and Tags
atilde Rankings
atilde Links and Citations
atilde Structural Markup
50
Implicit Metadata
atilde You can get clues from metadata within documents
acirc as they are non-intended they tend to be noisyacirc but they are rarely deceitful
atilde HTML tags as constituent boundaries
atilde Tables as Semantic Relations
atilde File Names (content type and language)
atilde Translations mdash Bracketed Glosses Cross-lingual Disambiguation
atilde Query Data
atilde Wikipedia Redirections and Cross-wiki Links
Review and Conclusions 51
Language Identification andNormalization
Review and Conclusions 52
Language Identification
atilde Need to identify the encoding and language of a page in order to process it (meta-data may be unreliable)
acirc Linguistically-grounded methodslowast Diacriticslowast Character n-gramslowast Stop words
acirc Similarity-based categorisation and classificationlowast Character n-gram rankings
acirc Machine Learning based methodsacirc Context (under jp or ko)
atilde Hard to do for short test snippets similar languages mixed text
Review and Conclusions 53
Normalization
atilde Extracting text from various documents
atilde Segmenting continuous text
atilde Number Normalization $700K $700000 07 million dollars hellip
atilde Date Normalization 2000AD 1421AH Heisei 12 hellip
atilde Stripping stop words (the a of hellip)
atilde Lemmatization produces rarr produce
atilde Stemming producer rarr produc produces rarr produc
atilde Decompounding zonnecel rarr zon cel
Review and Conclusions 54
The Semantic Web
Review and Conclusions 55
Goals of the Semantic Web
atilde Web of data
acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions
atilde Increase the utility of information by connecting it to definitions and context
atilde More efficient information access and analysis
EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt
Review and Conclusions 56
Semantic Web Architecture
Review and Conclusions 57
XML eXtensible Markup Language
atilde XML is a set of rules for encoding documents electronically
acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation
atilde Can be validated for syntactic well-formedness
Review and Conclusions 58
Semantic Web Architecture (details)
atilde Identify things with Uniform Resource Identifiers
acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond
atilde Identify relations with Resource Description Framework
acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything
atilde You can build relations in ontologies (OWL)
acirc Then reason over them search them hellip
Review and Conclusions 59
Criticism of the Semantic Web
Doctorowrsquos seven insurmountable obstacles to reliable metadata are
1 People lie
2 People are lazy
3 People are stupid
4 Mission Impossible know thyself
5 Schemas arenrsquot neutral
6 Metrics influence results
7 Therersquos more than one way to describe something
httpwwwwellcom~doctorowmetacraphtm 60
Semantic Web and NLP
atilde The Semantic Web is about structuring data
atilde Text Mining is about unstructured data
atilde There is much more unstructured than structured data
acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP
61
Citation Reputation andPageRank
Review and Conclusions 62
Citation Networks
atilde How can we tell what is a good scientific paper
acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked
acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite
Review and Conclusions 63
Reputation and Citation Analysis
atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group
atilde Some scores are
acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)
atilde Problems
acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones
atilde Weight Citations by Quality of the paper
Review and Conclusions 64
Gaming Citations
atilde LeastMinimum Publishable Unit
acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information
atilde Self citation in-group citation
atilde Write only proceedings (some journals are not often read)
atilde Submitting only to High Impact factor journals
You improve what gets measurednot necessarily what you want to improve
Review and Conclusions 65
Characteristics of the Web Bow Tie
SCC Strongly Connected Core mdash can travel from any page to any page
Review and Conclusions 66
Anchor Text
atilde Recall how hyperlinks are written
lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt
For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt
atilde Link analysis builds on two intuitions
1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A
2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice
Review and Conclusions 67
PageRank as Citation analysis
atilde Citation frequency can be used to measure the impact of an article
acirc Simplest measure Each article gets one vote ndash not very accurate
atilde On the web citation frequency = inlink count
acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam
atilde Better measure weighted citation frequency or citation rank
acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated
Review and Conclusions 68
PageRank as Random walk
atilde Imagine a web surfer doing a random walk on the web
acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page
equiprobably
atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there
atilde This long-term visit rate is the pagersquos PageRank
atilde PageRank = long-term visit rate = steady state robability
Review and Conclusions 69
Teleporting ndash to get us out of dead ends
atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)
atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)
atilde With remaining probability (90) go out on a random hyperlink
acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225
atilde 10 is a parameter the teleportation rate
atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate
Review and Conclusions 70
Example Graph
Each inbound link is a positive vote
httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71
Example Graph Weighted
Pages with higher PageRanks are lighter
Review and Conclusions 72
Gaming PageRank
atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs
atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies
atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission
73
atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks
The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index
acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip
74
Current Status
atilde There is a continuous battle between
acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read
atilde All metrics get gamed
Review and Conclusions 75
Digital object identifier
atilde DOI a string used to uniquely identify an electronic document or object
acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)
atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation
atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations
acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u
Review and Conclusions 76
Conclusions
Review and Conclusions 77
Revolutions in Language Technology
atilde Speech (Language itself)
atilde Writing (invented 3-5 times)
atilde Printing (made writing common)
atilde Digital Text (made writing transferable)
atilde Hyperlinking (taking writing beyond language)
Review and Conclusions 78
The Internet
atilde The internet is useful as a tool
acirc Passively for information accessacirc Actively for collaboration
atilde The internet is interesting in and of itself
acirc As a source of data about existing languageacirc As a source of innovation in language
Review and Conclusions 79
Recommended Texts
atilde Wikipedia
atilde Crystal D (2001) Language and the Internet Cambridge University Press
atilde Sproat R (2010) Language Technology and Society Oxford University Press
atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition
atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press
atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press
Review and Conclusions 80
Complementary Courses
atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics
atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics
Review and Conclusions 81
Speech and Language Technologyatilde The need for speech representation
atilde Storing sound
atilde Transforming Speech
acirc Automatic Speech Recognition (ASR) sounds to textacirc Text-to-Speech Synthesis (TTS) texts to sounds
atilde Speech technology mdash the Telephone
Review and Conclusions 14
How good are the systems
Task Vocab WER () WER () adaptedDigits 11 04 02Dialogue (travel) 21000 109 mdashDictation (WSJ) 5000 39 30Dictation (WSJ) 20000 100 86Dialogue (noisy army) 3000 422 310Phone Conversations 4000 419 310
Results of various DARPA competitions (from Richard Sproatrsquos slides)
Review and Conclusions 15
Why is it so difficult
atilde Pronunciation depends on context
acirc The same word will be pronounced differently in different sentences
atilde Speaker variability
acirc Genderacirc DialectForeign Accentacirc Individual Differences Physical differences Language differences (idiolect)
atilde Many many rare events
acirc 300 out of 2000 diphones in the core set for the ATampT NextGen system occuronly once in a 2-hour speech database
Review and Conclusions 16
Two steps in a TTS system
1 Linguistic Analysis
atilde Sentence Segmentationatilde Abbreviations Dr Smith lives on Nanyang Dr He is hellipatilde Word Segmentation
acirc 森前銀総裁 Moriyama zen Nichigin Sousaiotimes 森前銀総裁emspMoriyama zennichi gin Sousai
2 Speech Synthesis
atilde Find the pronunciationatilde Generate soundsatilde Add intonation
Review and Conclusions 17
Speech Synthesis
atilde Articulatory Synthesis Attempt to model human articulation
atilde Formant Synthesis Bypass modeling of human articulation and model acousticsdirectly
atilde Concatenative Synthesis Synthesize from stored units of actual speech
Review and Conclusions 18
Prosody of Emotion
atilde Excitement Fast very high pitch loud
atilde Hot anger Fast high pitch strong falling accent loud
atilde Fear Jitter
atilde Sarcasm Prolonged accent late peak
atilde Sad Slow low pitch
The main determinant of ldquonaturalnessrdquo in speech synthesis is not ldquovoice qualityrdquobut natural-sounding prosody (intonation and duration)
Richard Sproat
Review and Conclusions 19
Speed is different for different modalities
Speed in words per minute (one word is 6 characters)(English computer science students various studies)
Reading 300 200 (proof reading)Writing 31 21 (composing)Speaking 150Hearing 150 210 (speeded up)Typing 33 19 (composing)
atilde Reading gtgt SpeakingHearing gtgt Typing
rArr Speech for inputrArr Text for output
Review and Conclusions 20
The Telephone
Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 21
New Modalities
Review and Conclusions 22
Email Usenet Chat and Blogs
atilde All share some characteristics of speech and text
atilde Usage norms not fixed
atilde Communication methods may disappear before the norms are fixedUsenet Bulletin Boards Gopher Archie hellip
atilde Large scale discourse studies still to be done
atilde Some genuinely new things
acirc time-lagged multi-person conversationacirc raw un-edited textacirc extreme multi-authorship
Review and Conclusions 23
Speech like Text liketime bound space bound (deletable)spontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 24
Usenet (asynchronous)
Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 25
Chat (synchronous)
Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 26
Blogs
Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactivecomments factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 27
Netspeak
atilde Inspired by shared background
acirc ltrantgtI canrsquot stand thisltrantgtacirc I hate^H^H^H^Hlove that idea (^H is backspace on a vt100)acirc lusers (users as seen by Systems Administrators)acirc suits
atilde Need to be in-group
cow orker Coworkerclue ldquoYou couldnrsquot get a clue during the clue mating season in a field full of horny
clues if you smeared your body with clue musk and did the clue mating dancerdquoEdward Flaherty (talkbizaare)
Review and Conclusions 28
atilde Gricean Principals
acirc Posting (topmiddlebottom quoting and trimming)
atilde Inspired by medium limitations
acirc GREATgreatgreatacirc -) (^_^) (o) acirc brb RTFM IMHOacirc lower caseacirc lack of punctuation (hard on phone keyboards)
Review and Conclusions 29
Collaboration and Wikisatilde Version Control Systems
atilde Wikipedia
atilde Licensing and Ownership
30
Version Control Systems
atilde Versioning file systems
acirc every time a file is opened a new copy is stored
atilde CVS Subversion Git
acirc changes to a collection of files are trackedacirc simultaneous changes are merged
atilde Revision Tracking
acirc Revisions are stored within a file
atilde Authorship in shared writing Explicit responsibility for changes
Review and Conclusions 31
Wikipedia
atilde The core aim of the Wikimedia Foundation is to get a free encyclopedia to everysingle person on the planet (Jimmy Wales)
atilde Wikipedia makes it easy to share your knowledgepeople like to do this
atilde Most edits are done by insiders
atilde Most content is added by outsiders
atilde Content comparable to Britannica
Review and Conclusions 32
The five pillars of Wikipedia
1 Wikipedia is an online encyclopedia
2 Wikipedia has a neutral point of view
atilde Content policies NPOV Verifiability and No original research
3 Wikipedia is free content
4 Wikipedians should interact in a respectful and civil manner
5 Wikipedia does not have firm rules
WikipediaFivepillars 33
Licenses and Ownership
atilde Copyright
atilde Copyleft
atilde Creative Commons
Review and Conclusions 34
What is a good article
1 Well-written
2 Factually accurate and verifiable
3 Broad in its coverage
4 Neutral
5 Stable
6 Illustrated if possible by images
WikipediaGood_article_criteria 35
The World Wide Web and HTML
Review and Conclusions 36
The InterWeb
atilde The Internetmore than just the web (email VoIP FTP Streaming Messaging
atilde The structure of Markup Visual vs LogicalWSISWYG WYSIAYG WYSIWIM
atilde The structure of the Web mdash hypertextpages linked to pages
atilde The future of the Web
atilde Linguistic features of the webun-edited large volume editable multi-media
Review and Conclusions 37
Map of online communities
httpxkcdcom802 38
The Deep Web (the Invisible Web)
Dynamic content dynamic pages which are returned in response to a submitted queryor accessed only through a form
Unlinked content pages which are not linked to by other pages (but clicking linksthem)
Private Web sites that require registration and login (Edventure)
Contextual Web pages with content varying for different access contexts (egranges of client IP addresses or previous navigation sequence)
Limited access content sites that limit access to their pages in a technical way (egusing the Robots Exclusion Standard)
Scripted content pages that are only accessible through links produced by JavaScriptas well as content dynamically downloaded from Web servers via Flash or Ajaxsolutions
Review and Conclusions 39
Non-HTMLtext content textual content encoded in multimedia (image or video)files or specific file formats not handled by search engines
These pages all include data that search engines cannot find
Review and Conclusions 40
Visual Markup vs Logical Markup
atilde Visual Markup (Presentational)
acirc What you see is what you get (WYSIWYG)acirc Equivalent of printersrsquo markupacirc Shows what things look like
atilde Logical Markup (Structural)
acirc Show the structure and meaningacirc Can be mapped to visual markupacirc Less flexible than visual markupacirc More adaptable (and reusable)
Review and Conclusions 41
The Web as Corpus
Review and Conclusions 42
Two Approaches to using the Web as a Corpus
atilde Direct Query Search Engine as Query tool and WWW as corpus(Objection Results are not reliable)
acirc Population and exact hit counts are unknown rarr no statistics possibleacirc Indexing does not allow to draw conclusions on the dataotimes Google is missing functionalities that linguists lexicographers would like to have
atilde Web Sample Use search engine to download data from the net and build a corpusfrom it
acirc known size and exact hit counts rarr statistics possibleacirc people can draw conclusions over the included text typesacirc (limited) control over the contentotimes sparser data
Review and Conclusions 43
Direct Query
atilde Accessible through search engines (Google API Yahoo API Scripts)
atilde Document counts are shown to correlate directly with ldquorealrdquo frequencies (Keller2003) so search engines can help - but
acirc lots of repetitions of the same text (not representative)acirc very limited query precision (no upperlower case no punctuation)acirc only estimated counts often hard to reproduce exactlyacirc different queries give wildly different numbers
Review and Conclusions 44
Web Count Units
atilde ghit or ldquogoogle hitrdquo is the most common unit used to count web snippets (in theearly 2000s)
acirc it is document frequency not term frequency
atilde whit or ldquoweb hitrdquo is the more general term
atilde Normally you compare two phenomena to get a unitless ratio(eg different from vs different than)251000000 ghits vs 71500000 or 351 (accessed 2012-04-04)
atilde GPB for ldquoGhits per billion documentsrdquo is good if want a more stable number (sug-gested by Mark Liberman)but Google no longer releases the index size
atilde So always say when you counted and try to use ratios
httpitrecisupennedu myllanguagelogarchives000953html 45
Web Sample
atilde Extracting and filtering web documents to create linguistically annotated corpora(Kilgarriff 2006)
acirc gather documents for different topics (balance)acirc exclude documents which cannot be preprocessed with available tools (here tag-
gers and lemmatizers)acirc exclude documents which seem irrelevant for a corpus (too short or too long word
lists)acirc do this for several languages and make the corpora available
46
Building Internet Corpora Outline
1 Select Seed Words (500)
2 Combine to form multiple queries (6000)
3 Query a search engine and retrieve the URLs (50000)
4 Download the files from the URLS (100000000 words)
5 Postprocess the data (encoding cleanup tagging and parsing)
Sharoff S (2006) Creating general-purpose corpora using automated search enginequeries In M Baroni S Bernardini (eds) WaCky Working papers on the Web asCorpus Bologna 2006
httpcorpusleedsacukinternethtml 47
Internet Corpora Summary
atilde The web can be used as a corpus
acirc Direct accesslowast Fast and convenientlowast Huge amounts of dataotimes unreliable counts
acirc Web samplelowast Control over the samplelowast Some setup costs (semi-automated)otimes Less data
atilde Richer data than a compiled corpus
otimes Less balanced less markup
48
Text and Meta-textatilde Explicit Meta-data
acirc Keywords and Categoriesacirc Rankingsacirc Structural Markup
atilde Implicit Meta-data
acirc Links and Citationsacirc Tagsacirc Tablesacirc File Namesacirc Translations
and Zombies 49
Explicit Metadata
atilde You can get information from metadata within documents
acirc When they are accurate they are very goodacirc They are often deceitful
atilde HTML PDF Word hellipMetadata
atilde Keywords and Tags
atilde Rankings
atilde Links and Citations
atilde Structural Markup
50
Implicit Metadata
atilde You can get clues from metadata within documents
acirc as they are non-intended they tend to be noisyacirc but they are rarely deceitful
atilde HTML tags as constituent boundaries
atilde Tables as Semantic Relations
atilde File Names (content type and language)
atilde Translations mdash Bracketed Glosses Cross-lingual Disambiguation
atilde Query Data
atilde Wikipedia Redirections and Cross-wiki Links
Review and Conclusions 51
Language Identification andNormalization
Review and Conclusions 52
Language Identification
atilde Need to identify the encoding and language of a page in order to process it (meta-data may be unreliable)
acirc Linguistically-grounded methodslowast Diacriticslowast Character n-gramslowast Stop words
acirc Similarity-based categorisation and classificationlowast Character n-gram rankings
acirc Machine Learning based methodsacirc Context (under jp or ko)
atilde Hard to do for short test snippets similar languages mixed text
Review and Conclusions 53
Normalization
atilde Extracting text from various documents
atilde Segmenting continuous text
atilde Number Normalization $700K $700000 07 million dollars hellip
atilde Date Normalization 2000AD 1421AH Heisei 12 hellip
atilde Stripping stop words (the a of hellip)
atilde Lemmatization produces rarr produce
atilde Stemming producer rarr produc produces rarr produc
atilde Decompounding zonnecel rarr zon cel
Review and Conclusions 54
The Semantic Web
Review and Conclusions 55
Goals of the Semantic Web
atilde Web of data
acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions
atilde Increase the utility of information by connecting it to definitions and context
atilde More efficient information access and analysis
EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt
Review and Conclusions 56
Semantic Web Architecture
Review and Conclusions 57
XML eXtensible Markup Language
atilde XML is a set of rules for encoding documents electronically
acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation
atilde Can be validated for syntactic well-formedness
Review and Conclusions 58
Semantic Web Architecture (details)
atilde Identify things with Uniform Resource Identifiers
acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond
atilde Identify relations with Resource Description Framework
acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything
atilde You can build relations in ontologies (OWL)
acirc Then reason over them search them hellip
Review and Conclusions 59
Criticism of the Semantic Web
Doctorowrsquos seven insurmountable obstacles to reliable metadata are
1 People lie
2 People are lazy
3 People are stupid
4 Mission Impossible know thyself
5 Schemas arenrsquot neutral
6 Metrics influence results
7 Therersquos more than one way to describe something
httpwwwwellcom~doctorowmetacraphtm 60
Semantic Web and NLP
atilde The Semantic Web is about structuring data
atilde Text Mining is about unstructured data
atilde There is much more unstructured than structured data
acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP
61
Citation Reputation andPageRank
Review and Conclusions 62
Citation Networks
atilde How can we tell what is a good scientific paper
acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked
acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite
Review and Conclusions 63
Reputation and Citation Analysis
atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group
atilde Some scores are
acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)
atilde Problems
acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones
atilde Weight Citations by Quality of the paper
Review and Conclusions 64
Gaming Citations
atilde LeastMinimum Publishable Unit
acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information
atilde Self citation in-group citation
atilde Write only proceedings (some journals are not often read)
atilde Submitting only to High Impact factor journals
You improve what gets measurednot necessarily what you want to improve
Review and Conclusions 65
Characteristics of the Web Bow Tie
SCC Strongly Connected Core mdash can travel from any page to any page
Review and Conclusions 66
Anchor Text
atilde Recall how hyperlinks are written
lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt
For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt
atilde Link analysis builds on two intuitions
1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A
2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice
Review and Conclusions 67
PageRank as Citation analysis
atilde Citation frequency can be used to measure the impact of an article
acirc Simplest measure Each article gets one vote ndash not very accurate
atilde On the web citation frequency = inlink count
acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam
atilde Better measure weighted citation frequency or citation rank
acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated
Review and Conclusions 68
PageRank as Random walk
atilde Imagine a web surfer doing a random walk on the web
acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page
equiprobably
atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there
atilde This long-term visit rate is the pagersquos PageRank
atilde PageRank = long-term visit rate = steady state robability
Review and Conclusions 69
Teleporting ndash to get us out of dead ends
atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)
atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)
atilde With remaining probability (90) go out on a random hyperlink
acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225
atilde 10 is a parameter the teleportation rate
atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate
Review and Conclusions 70
Example Graph
Each inbound link is a positive vote
httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71
Example Graph Weighted
Pages with higher PageRanks are lighter
Review and Conclusions 72
Gaming PageRank
atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs
atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies
atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission
73
atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks
The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index
acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip
74
Current Status
atilde There is a continuous battle between
acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read
atilde All metrics get gamed
Review and Conclusions 75
Digital object identifier
atilde DOI a string used to uniquely identify an electronic document or object
acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)
atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation
atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations
acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u
Review and Conclusions 76
Conclusions
Review and Conclusions 77
Revolutions in Language Technology
atilde Speech (Language itself)
atilde Writing (invented 3-5 times)
atilde Printing (made writing common)
atilde Digital Text (made writing transferable)
atilde Hyperlinking (taking writing beyond language)
Review and Conclusions 78
The Internet
atilde The internet is useful as a tool
acirc Passively for information accessacirc Actively for collaboration
atilde The internet is interesting in and of itself
acirc As a source of data about existing languageacirc As a source of innovation in language
Review and Conclusions 79
Recommended Texts
atilde Wikipedia
atilde Crystal D (2001) Language and the Internet Cambridge University Press
atilde Sproat R (2010) Language Technology and Society Oxford University Press
atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition
atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press
atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press
Review and Conclusions 80
Complementary Courses
atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics
atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics
Review and Conclusions 81
How good are the systems
Task Vocab WER () WER () adaptedDigits 11 04 02Dialogue (travel) 21000 109 mdashDictation (WSJ) 5000 39 30Dictation (WSJ) 20000 100 86Dialogue (noisy army) 3000 422 310Phone Conversations 4000 419 310
Results of various DARPA competitions (from Richard Sproatrsquos slides)
Review and Conclusions 15
Why is it so difficult
atilde Pronunciation depends on context
acirc The same word will be pronounced differently in different sentences
atilde Speaker variability
acirc Genderacirc DialectForeign Accentacirc Individual Differences Physical differences Language differences (idiolect)
atilde Many many rare events
acirc 300 out of 2000 diphones in the core set for the ATampT NextGen system occuronly once in a 2-hour speech database
Review and Conclusions 16
Two steps in a TTS system
1 Linguistic Analysis
atilde Sentence Segmentationatilde Abbreviations Dr Smith lives on Nanyang Dr He is hellipatilde Word Segmentation
acirc 森前銀総裁 Moriyama zen Nichigin Sousaiotimes 森前銀総裁emspMoriyama zennichi gin Sousai
2 Speech Synthesis
atilde Find the pronunciationatilde Generate soundsatilde Add intonation
Review and Conclusions 17
Speech Synthesis
atilde Articulatory Synthesis Attempt to model human articulation
atilde Formant Synthesis Bypass modeling of human articulation and model acousticsdirectly
atilde Concatenative Synthesis Synthesize from stored units of actual speech
Review and Conclusions 18
Prosody of Emotion
atilde Excitement Fast very high pitch loud
atilde Hot anger Fast high pitch strong falling accent loud
atilde Fear Jitter
atilde Sarcasm Prolonged accent late peak
atilde Sad Slow low pitch
The main determinant of ldquonaturalnessrdquo in speech synthesis is not ldquovoice qualityrdquobut natural-sounding prosody (intonation and duration)
Richard Sproat
Review and Conclusions 19
Speed is different for different modalities
Speed in words per minute (one word is 6 characters)(English computer science students various studies)
Reading 300 200 (proof reading)Writing 31 21 (composing)Speaking 150Hearing 150 210 (speeded up)Typing 33 19 (composing)
atilde Reading gtgt SpeakingHearing gtgt Typing
rArr Speech for inputrArr Text for output
Review and Conclusions 20
The Telephone
Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 21
New Modalities
Review and Conclusions 22
Email Usenet Chat and Blogs
atilde All share some characteristics of speech and text
atilde Usage norms not fixed
atilde Communication methods may disappear before the norms are fixedUsenet Bulletin Boards Gopher Archie hellip
atilde Large scale discourse studies still to be done
atilde Some genuinely new things
acirc time-lagged multi-person conversationacirc raw un-edited textacirc extreme multi-authorship
Review and Conclusions 23
Speech like Text liketime bound space bound (deletable)spontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 24
Usenet (asynchronous)
Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 25
Chat (synchronous)
Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 26
Blogs
Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactivecomments factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 27
Netspeak
atilde Inspired by shared background
acirc ltrantgtI canrsquot stand thisltrantgtacirc I hate^H^H^H^Hlove that idea (^H is backspace on a vt100)acirc lusers (users as seen by Systems Administrators)acirc suits
atilde Need to be in-group
cow orker Coworkerclue ldquoYou couldnrsquot get a clue during the clue mating season in a field full of horny
clues if you smeared your body with clue musk and did the clue mating dancerdquoEdward Flaherty (talkbizaare)
Review and Conclusions 28
atilde Gricean Principals
acirc Posting (topmiddlebottom quoting and trimming)
atilde Inspired by medium limitations
acirc GREATgreatgreatacirc -) (^_^) (o) acirc brb RTFM IMHOacirc lower caseacirc lack of punctuation (hard on phone keyboards)
Review and Conclusions 29
Collaboration and Wikisatilde Version Control Systems
atilde Wikipedia
atilde Licensing and Ownership
30
Version Control Systems
atilde Versioning file systems
acirc every time a file is opened a new copy is stored
atilde CVS Subversion Git
acirc changes to a collection of files are trackedacirc simultaneous changes are merged
atilde Revision Tracking
acirc Revisions are stored within a file
atilde Authorship in shared writing Explicit responsibility for changes
Review and Conclusions 31
Wikipedia
atilde The core aim of the Wikimedia Foundation is to get a free encyclopedia to everysingle person on the planet (Jimmy Wales)
atilde Wikipedia makes it easy to share your knowledgepeople like to do this
atilde Most edits are done by insiders
atilde Most content is added by outsiders
atilde Content comparable to Britannica
Review and Conclusions 32
The five pillars of Wikipedia
1 Wikipedia is an online encyclopedia
2 Wikipedia has a neutral point of view
atilde Content policies NPOV Verifiability and No original research
3 Wikipedia is free content
4 Wikipedians should interact in a respectful and civil manner
5 Wikipedia does not have firm rules
WikipediaFivepillars 33
Licenses and Ownership
atilde Copyright
atilde Copyleft
atilde Creative Commons
Review and Conclusions 34
What is a good article
1 Well-written
2 Factually accurate and verifiable
3 Broad in its coverage
4 Neutral
5 Stable
6 Illustrated if possible by images
WikipediaGood_article_criteria 35
The World Wide Web and HTML
Review and Conclusions 36
The InterWeb
atilde The Internetmore than just the web (email VoIP FTP Streaming Messaging
atilde The structure of Markup Visual vs LogicalWSISWYG WYSIAYG WYSIWIM
atilde The structure of the Web mdash hypertextpages linked to pages
atilde The future of the Web
atilde Linguistic features of the webun-edited large volume editable multi-media
Review and Conclusions 37
Map of online communities
httpxkcdcom802 38
The Deep Web (the Invisible Web)
Dynamic content dynamic pages which are returned in response to a submitted queryor accessed only through a form
Unlinked content pages which are not linked to by other pages (but clicking linksthem)
Private Web sites that require registration and login (Edventure)
Contextual Web pages with content varying for different access contexts (egranges of client IP addresses or previous navigation sequence)
Limited access content sites that limit access to their pages in a technical way (egusing the Robots Exclusion Standard)
Scripted content pages that are only accessible through links produced by JavaScriptas well as content dynamically downloaded from Web servers via Flash or Ajaxsolutions
Review and Conclusions 39
Non-HTMLtext content textual content encoded in multimedia (image or video)files or specific file formats not handled by search engines
These pages all include data that search engines cannot find
Review and Conclusions 40
Visual Markup vs Logical Markup
atilde Visual Markup (Presentational)
acirc What you see is what you get (WYSIWYG)acirc Equivalent of printersrsquo markupacirc Shows what things look like
atilde Logical Markup (Structural)
acirc Show the structure and meaningacirc Can be mapped to visual markupacirc Less flexible than visual markupacirc More adaptable (and reusable)
Review and Conclusions 41
The Web as Corpus
Review and Conclusions 42
Two Approaches to using the Web as a Corpus
atilde Direct Query Search Engine as Query tool and WWW as corpus(Objection Results are not reliable)
acirc Population and exact hit counts are unknown rarr no statistics possibleacirc Indexing does not allow to draw conclusions on the dataotimes Google is missing functionalities that linguists lexicographers would like to have
atilde Web Sample Use search engine to download data from the net and build a corpusfrom it
acirc known size and exact hit counts rarr statistics possibleacirc people can draw conclusions over the included text typesacirc (limited) control over the contentotimes sparser data
Review and Conclusions 43
Direct Query
atilde Accessible through search engines (Google API Yahoo API Scripts)
atilde Document counts are shown to correlate directly with ldquorealrdquo frequencies (Keller2003) so search engines can help - but
acirc lots of repetitions of the same text (not representative)acirc very limited query precision (no upperlower case no punctuation)acirc only estimated counts often hard to reproduce exactlyacirc different queries give wildly different numbers
Review and Conclusions 44
Web Count Units
atilde ghit or ldquogoogle hitrdquo is the most common unit used to count web snippets (in theearly 2000s)
acirc it is document frequency not term frequency
atilde whit or ldquoweb hitrdquo is the more general term
atilde Normally you compare two phenomena to get a unitless ratio(eg different from vs different than)251000000 ghits vs 71500000 or 351 (accessed 2012-04-04)
atilde GPB for ldquoGhits per billion documentsrdquo is good if want a more stable number (sug-gested by Mark Liberman)but Google no longer releases the index size
atilde So always say when you counted and try to use ratios
httpitrecisupennedu myllanguagelogarchives000953html 45
Web Sample
atilde Extracting and filtering web documents to create linguistically annotated corpora(Kilgarriff 2006)
acirc gather documents for different topics (balance)acirc exclude documents which cannot be preprocessed with available tools (here tag-
gers and lemmatizers)acirc exclude documents which seem irrelevant for a corpus (too short or too long word
lists)acirc do this for several languages and make the corpora available
46
Building Internet Corpora Outline
1 Select Seed Words (500)
2 Combine to form multiple queries (6000)
3 Query a search engine and retrieve the URLs (50000)
4 Download the files from the URLS (100000000 words)
5 Postprocess the data (encoding cleanup tagging and parsing)
Sharoff S (2006) Creating general-purpose corpora using automated search enginequeries In M Baroni S Bernardini (eds) WaCky Working papers on the Web asCorpus Bologna 2006
httpcorpusleedsacukinternethtml 47
Internet Corpora Summary
atilde The web can be used as a corpus
acirc Direct accesslowast Fast and convenientlowast Huge amounts of dataotimes unreliable counts
acirc Web samplelowast Control over the samplelowast Some setup costs (semi-automated)otimes Less data
atilde Richer data than a compiled corpus
otimes Less balanced less markup
48
Text and Meta-textatilde Explicit Meta-data
acirc Keywords and Categoriesacirc Rankingsacirc Structural Markup
atilde Implicit Meta-data
acirc Links and Citationsacirc Tagsacirc Tablesacirc File Namesacirc Translations
and Zombies 49
Explicit Metadata
atilde You can get information from metadata within documents
acirc When they are accurate they are very goodacirc They are often deceitful
atilde HTML PDF Word hellipMetadata
atilde Keywords and Tags
atilde Rankings
atilde Links and Citations
atilde Structural Markup
50
Implicit Metadata
atilde You can get clues from metadata within documents
acirc as they are non-intended they tend to be noisyacirc but they are rarely deceitful
atilde HTML tags as constituent boundaries
atilde Tables as Semantic Relations
atilde File Names (content type and language)
atilde Translations mdash Bracketed Glosses Cross-lingual Disambiguation
atilde Query Data
atilde Wikipedia Redirections and Cross-wiki Links
Review and Conclusions 51
Language Identification andNormalization
Review and Conclusions 52
Language Identification
atilde Need to identify the encoding and language of a page in order to process it (meta-data may be unreliable)
acirc Linguistically-grounded methodslowast Diacriticslowast Character n-gramslowast Stop words
acirc Similarity-based categorisation and classificationlowast Character n-gram rankings
acirc Machine Learning based methodsacirc Context (under jp or ko)
atilde Hard to do for short test snippets similar languages mixed text
Review and Conclusions 53
Normalization
atilde Extracting text from various documents
atilde Segmenting continuous text
atilde Number Normalization $700K $700000 07 million dollars hellip
atilde Date Normalization 2000AD 1421AH Heisei 12 hellip
atilde Stripping stop words (the a of hellip)
atilde Lemmatization produces rarr produce
atilde Stemming producer rarr produc produces rarr produc
atilde Decompounding zonnecel rarr zon cel
Review and Conclusions 54
The Semantic Web
Review and Conclusions 55
Goals of the Semantic Web
atilde Web of data
acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions
atilde Increase the utility of information by connecting it to definitions and context
atilde More efficient information access and analysis
EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt
Review and Conclusions 56
Semantic Web Architecture
Review and Conclusions 57
XML eXtensible Markup Language
atilde XML is a set of rules for encoding documents electronically
acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation
atilde Can be validated for syntactic well-formedness
Review and Conclusions 58
Semantic Web Architecture (details)
atilde Identify things with Uniform Resource Identifiers
acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond
atilde Identify relations with Resource Description Framework
acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything
atilde You can build relations in ontologies (OWL)
acirc Then reason over them search them hellip
Review and Conclusions 59
Criticism of the Semantic Web
Doctorowrsquos seven insurmountable obstacles to reliable metadata are
1 People lie
2 People are lazy
3 People are stupid
4 Mission Impossible know thyself
5 Schemas arenrsquot neutral
6 Metrics influence results
7 Therersquos more than one way to describe something
httpwwwwellcom~doctorowmetacraphtm 60
Semantic Web and NLP
atilde The Semantic Web is about structuring data
atilde Text Mining is about unstructured data
atilde There is much more unstructured than structured data
acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP
61
Citation Reputation andPageRank
Review and Conclusions 62
Citation Networks
atilde How can we tell what is a good scientific paper
acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked
acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite
Review and Conclusions 63
Reputation and Citation Analysis
atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group
atilde Some scores are
acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)
atilde Problems
acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones
atilde Weight Citations by Quality of the paper
Review and Conclusions 64
Gaming Citations
atilde LeastMinimum Publishable Unit
acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information
atilde Self citation in-group citation
atilde Write only proceedings (some journals are not often read)
atilde Submitting only to High Impact factor journals
You improve what gets measurednot necessarily what you want to improve
Review and Conclusions 65
Characteristics of the Web Bow Tie
SCC Strongly Connected Core mdash can travel from any page to any page
Review and Conclusions 66
Anchor Text
atilde Recall how hyperlinks are written
lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt
For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt
atilde Link analysis builds on two intuitions
1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A
2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice
Review and Conclusions 67
PageRank as Citation analysis
atilde Citation frequency can be used to measure the impact of an article
acirc Simplest measure Each article gets one vote ndash not very accurate
atilde On the web citation frequency = inlink count
acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam
atilde Better measure weighted citation frequency or citation rank
acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated
Review and Conclusions 68
PageRank as Random walk
atilde Imagine a web surfer doing a random walk on the web
acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page
equiprobably
atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there
atilde This long-term visit rate is the pagersquos PageRank
atilde PageRank = long-term visit rate = steady state robability
Review and Conclusions 69
Teleporting ndash to get us out of dead ends
atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)
atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)
atilde With remaining probability (90) go out on a random hyperlink
acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225
atilde 10 is a parameter the teleportation rate
atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate
Review and Conclusions 70
Example Graph
Each inbound link is a positive vote
httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71
Example Graph Weighted
Pages with higher PageRanks are lighter
Review and Conclusions 72
Gaming PageRank
atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs
atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies
atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission
73
atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks
The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index
acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip
74
Current Status
atilde There is a continuous battle between
acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read
atilde All metrics get gamed
Review and Conclusions 75
Digital object identifier
atilde DOI a string used to uniquely identify an electronic document or object
acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)
atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation
atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations
acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u
Review and Conclusions 76
Conclusions
Review and Conclusions 77
Revolutions in Language Technology
atilde Speech (Language itself)
atilde Writing (invented 3-5 times)
atilde Printing (made writing common)
atilde Digital Text (made writing transferable)
atilde Hyperlinking (taking writing beyond language)
Review and Conclusions 78
The Internet
atilde The internet is useful as a tool
acirc Passively for information accessacirc Actively for collaboration
atilde The internet is interesting in and of itself
acirc As a source of data about existing languageacirc As a source of innovation in language
Review and Conclusions 79
Recommended Texts
atilde Wikipedia
atilde Crystal D (2001) Language and the Internet Cambridge University Press
atilde Sproat R (2010) Language Technology and Society Oxford University Press
atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition
atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press
atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press
Review and Conclusions 80
Complementary Courses
atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics
atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics
Review and Conclusions 81
Why is it so difficult
atilde Pronunciation depends on context
acirc The same word will be pronounced differently in different sentences
atilde Speaker variability
acirc Genderacirc DialectForeign Accentacirc Individual Differences Physical differences Language differences (idiolect)
atilde Many many rare events
acirc 300 out of 2000 diphones in the core set for the ATampT NextGen system occuronly once in a 2-hour speech database
Review and Conclusions 16
Two steps in a TTS system
1 Linguistic Analysis
atilde Sentence Segmentationatilde Abbreviations Dr Smith lives on Nanyang Dr He is hellipatilde Word Segmentation
acirc 森前銀総裁 Moriyama zen Nichigin Sousaiotimes 森前銀総裁emspMoriyama zennichi gin Sousai
2 Speech Synthesis
atilde Find the pronunciationatilde Generate soundsatilde Add intonation
Review and Conclusions 17
Speech Synthesis
atilde Articulatory Synthesis Attempt to model human articulation
atilde Formant Synthesis Bypass modeling of human articulation and model acousticsdirectly
atilde Concatenative Synthesis Synthesize from stored units of actual speech
Review and Conclusions 18
Prosody of Emotion
atilde Excitement Fast very high pitch loud
atilde Hot anger Fast high pitch strong falling accent loud
atilde Fear Jitter
atilde Sarcasm Prolonged accent late peak
atilde Sad Slow low pitch
The main determinant of ldquonaturalnessrdquo in speech synthesis is not ldquovoice qualityrdquobut natural-sounding prosody (intonation and duration)
Richard Sproat
Review and Conclusions 19
Speed is different for different modalities
Speed in words per minute (one word is 6 characters)(English computer science students various studies)
Reading 300 200 (proof reading)Writing 31 21 (composing)Speaking 150Hearing 150 210 (speeded up)Typing 33 19 (composing)
atilde Reading gtgt SpeakingHearing gtgt Typing
rArr Speech for inputrArr Text for output
Review and Conclusions 20
The Telephone
Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 21
New Modalities
Review and Conclusions 22
Email Usenet Chat and Blogs
atilde All share some characteristics of speech and text
atilde Usage norms not fixed
atilde Communication methods may disappear before the norms are fixedUsenet Bulletin Boards Gopher Archie hellip
atilde Large scale discourse studies still to be done
atilde Some genuinely new things
acirc time-lagged multi-person conversationacirc raw un-edited textacirc extreme multi-authorship
Review and Conclusions 23
Speech like Text liketime bound space bound (deletable)spontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 24
Usenet (asynchronous)
Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 25
Chat (synchronous)
Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 26
Blogs
Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactivecomments factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 27
Netspeak
atilde Inspired by shared background
acirc ltrantgtI canrsquot stand thisltrantgtacirc I hate^H^H^H^Hlove that idea (^H is backspace on a vt100)acirc lusers (users as seen by Systems Administrators)acirc suits
atilde Need to be in-group
cow orker Coworkerclue ldquoYou couldnrsquot get a clue during the clue mating season in a field full of horny
clues if you smeared your body with clue musk and did the clue mating dancerdquoEdward Flaherty (talkbizaare)
Review and Conclusions 28
atilde Gricean Principals
acirc Posting (topmiddlebottom quoting and trimming)
atilde Inspired by medium limitations
acirc GREATgreatgreatacirc -) (^_^) (o) acirc brb RTFM IMHOacirc lower caseacirc lack of punctuation (hard on phone keyboards)
Review and Conclusions 29
Collaboration and Wikisatilde Version Control Systems
atilde Wikipedia
atilde Licensing and Ownership
30
Version Control Systems
atilde Versioning file systems
acirc every time a file is opened a new copy is stored
atilde CVS Subversion Git
acirc changes to a collection of files are trackedacirc simultaneous changes are merged
atilde Revision Tracking
acirc Revisions are stored within a file
atilde Authorship in shared writing Explicit responsibility for changes
Review and Conclusions 31
Wikipedia
atilde The core aim of the Wikimedia Foundation is to get a free encyclopedia to everysingle person on the planet (Jimmy Wales)
atilde Wikipedia makes it easy to share your knowledgepeople like to do this
atilde Most edits are done by insiders
atilde Most content is added by outsiders
atilde Content comparable to Britannica
Review and Conclusions 32
The five pillars of Wikipedia
1 Wikipedia is an online encyclopedia
2 Wikipedia has a neutral point of view
atilde Content policies NPOV Verifiability and No original research
3 Wikipedia is free content
4 Wikipedians should interact in a respectful and civil manner
5 Wikipedia does not have firm rules
WikipediaFivepillars 33
Licenses and Ownership
atilde Copyright
atilde Copyleft
atilde Creative Commons
Review and Conclusions 34
What is a good article
1 Well-written
2 Factually accurate and verifiable
3 Broad in its coverage
4 Neutral
5 Stable
6 Illustrated if possible by images
WikipediaGood_article_criteria 35
The World Wide Web and HTML
Review and Conclusions 36
The InterWeb
atilde The Internetmore than just the web (email VoIP FTP Streaming Messaging
atilde The structure of Markup Visual vs LogicalWSISWYG WYSIAYG WYSIWIM
atilde The structure of the Web mdash hypertextpages linked to pages
atilde The future of the Web
atilde Linguistic features of the webun-edited large volume editable multi-media
Review and Conclusions 37
Map of online communities
httpxkcdcom802 38
The Deep Web (the Invisible Web)
Dynamic content dynamic pages which are returned in response to a submitted queryor accessed only through a form
Unlinked content pages which are not linked to by other pages (but clicking linksthem)
Private Web sites that require registration and login (Edventure)
Contextual Web pages with content varying for different access contexts (egranges of client IP addresses or previous navigation sequence)
Limited access content sites that limit access to their pages in a technical way (egusing the Robots Exclusion Standard)
Scripted content pages that are only accessible through links produced by JavaScriptas well as content dynamically downloaded from Web servers via Flash or Ajaxsolutions
Review and Conclusions 39
Non-HTMLtext content textual content encoded in multimedia (image or video)files or specific file formats not handled by search engines
These pages all include data that search engines cannot find
Review and Conclusions 40
Visual Markup vs Logical Markup
atilde Visual Markup (Presentational)
acirc What you see is what you get (WYSIWYG)acirc Equivalent of printersrsquo markupacirc Shows what things look like
atilde Logical Markup (Structural)
acirc Show the structure and meaningacirc Can be mapped to visual markupacirc Less flexible than visual markupacirc More adaptable (and reusable)
Review and Conclusions 41
The Web as Corpus
Review and Conclusions 42
Two Approaches to using the Web as a Corpus
atilde Direct Query Search Engine as Query tool and WWW as corpus(Objection Results are not reliable)
acirc Population and exact hit counts are unknown rarr no statistics possibleacirc Indexing does not allow to draw conclusions on the dataotimes Google is missing functionalities that linguists lexicographers would like to have
atilde Web Sample Use search engine to download data from the net and build a corpusfrom it
acirc known size and exact hit counts rarr statistics possibleacirc people can draw conclusions over the included text typesacirc (limited) control over the contentotimes sparser data
Review and Conclusions 43
Direct Query
atilde Accessible through search engines (Google API Yahoo API Scripts)
atilde Document counts are shown to correlate directly with ldquorealrdquo frequencies (Keller2003) so search engines can help - but
acirc lots of repetitions of the same text (not representative)acirc very limited query precision (no upperlower case no punctuation)acirc only estimated counts often hard to reproduce exactlyacirc different queries give wildly different numbers
Review and Conclusions 44
Web Count Units
atilde ghit or ldquogoogle hitrdquo is the most common unit used to count web snippets (in theearly 2000s)
acirc it is document frequency not term frequency
atilde whit or ldquoweb hitrdquo is the more general term
atilde Normally you compare two phenomena to get a unitless ratio(eg different from vs different than)251000000 ghits vs 71500000 or 351 (accessed 2012-04-04)
atilde GPB for ldquoGhits per billion documentsrdquo is good if want a more stable number (sug-gested by Mark Liberman)but Google no longer releases the index size
atilde So always say when you counted and try to use ratios
httpitrecisupennedu myllanguagelogarchives000953html 45
Web Sample
atilde Extracting and filtering web documents to create linguistically annotated corpora(Kilgarriff 2006)
acirc gather documents for different topics (balance)acirc exclude documents which cannot be preprocessed with available tools (here tag-
gers and lemmatizers)acirc exclude documents which seem irrelevant for a corpus (too short or too long word
lists)acirc do this for several languages and make the corpora available
46
Building Internet Corpora Outline
1 Select Seed Words (500)
2 Combine to form multiple queries (6000)
3 Query a search engine and retrieve the URLs (50000)
4 Download the files from the URLS (100000000 words)
5 Postprocess the data (encoding cleanup tagging and parsing)
Sharoff S (2006) Creating general-purpose corpora using automated search enginequeries In M Baroni S Bernardini (eds) WaCky Working papers on the Web asCorpus Bologna 2006
httpcorpusleedsacukinternethtml 47
Internet Corpora Summary
atilde The web can be used as a corpus
acirc Direct accesslowast Fast and convenientlowast Huge amounts of dataotimes unreliable counts
acirc Web samplelowast Control over the samplelowast Some setup costs (semi-automated)otimes Less data
atilde Richer data than a compiled corpus
otimes Less balanced less markup
48
Text and Meta-textatilde Explicit Meta-data
acirc Keywords and Categoriesacirc Rankingsacirc Structural Markup
atilde Implicit Meta-data
acirc Links and Citationsacirc Tagsacirc Tablesacirc File Namesacirc Translations
and Zombies 49
Explicit Metadata
atilde You can get information from metadata within documents
acirc When they are accurate they are very goodacirc They are often deceitful
atilde HTML PDF Word hellipMetadata
atilde Keywords and Tags
atilde Rankings
atilde Links and Citations
atilde Structural Markup
50
Implicit Metadata
atilde You can get clues from metadata within documents
acirc as they are non-intended they tend to be noisyacirc but they are rarely deceitful
atilde HTML tags as constituent boundaries
atilde Tables as Semantic Relations
atilde File Names (content type and language)
atilde Translations mdash Bracketed Glosses Cross-lingual Disambiguation
atilde Query Data
atilde Wikipedia Redirections and Cross-wiki Links
Review and Conclusions 51
Language Identification andNormalization
Review and Conclusions 52
Language Identification
atilde Need to identify the encoding and language of a page in order to process it (meta-data may be unreliable)
acirc Linguistically-grounded methodslowast Diacriticslowast Character n-gramslowast Stop words
acirc Similarity-based categorisation and classificationlowast Character n-gram rankings
acirc Machine Learning based methodsacirc Context (under jp or ko)
atilde Hard to do for short test snippets similar languages mixed text
Review and Conclusions 53
Normalization
atilde Extracting text from various documents
atilde Segmenting continuous text
atilde Number Normalization $700K $700000 07 million dollars hellip
atilde Date Normalization 2000AD 1421AH Heisei 12 hellip
atilde Stripping stop words (the a of hellip)
atilde Lemmatization produces rarr produce
atilde Stemming producer rarr produc produces rarr produc
atilde Decompounding zonnecel rarr zon cel
Review and Conclusions 54
The Semantic Web
Review and Conclusions 55
Goals of the Semantic Web
atilde Web of data
acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions
atilde Increase the utility of information by connecting it to definitions and context
atilde More efficient information access and analysis
EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt
Review and Conclusions 56
Semantic Web Architecture
Review and Conclusions 57
XML eXtensible Markup Language
atilde XML is a set of rules for encoding documents electronically
acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation
atilde Can be validated for syntactic well-formedness
Review and Conclusions 58
Semantic Web Architecture (details)
atilde Identify things with Uniform Resource Identifiers
acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond
atilde Identify relations with Resource Description Framework
acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything
atilde You can build relations in ontologies (OWL)
acirc Then reason over them search them hellip
Review and Conclusions 59
Criticism of the Semantic Web
Doctorowrsquos seven insurmountable obstacles to reliable metadata are
1 People lie
2 People are lazy
3 People are stupid
4 Mission Impossible know thyself
5 Schemas arenrsquot neutral
6 Metrics influence results
7 Therersquos more than one way to describe something
httpwwwwellcom~doctorowmetacraphtm 60
Semantic Web and NLP
atilde The Semantic Web is about structuring data
atilde Text Mining is about unstructured data
atilde There is much more unstructured than structured data
acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP
61
Citation Reputation andPageRank
Review and Conclusions 62
Citation Networks
atilde How can we tell what is a good scientific paper
acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked
acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite
Review and Conclusions 63
Reputation and Citation Analysis
atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group
atilde Some scores are
acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)
atilde Problems
acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones
atilde Weight Citations by Quality of the paper
Review and Conclusions 64
Gaming Citations
atilde LeastMinimum Publishable Unit
acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information
atilde Self citation in-group citation
atilde Write only proceedings (some journals are not often read)
atilde Submitting only to High Impact factor journals
You improve what gets measurednot necessarily what you want to improve
Review and Conclusions 65
Characteristics of the Web Bow Tie
SCC Strongly Connected Core mdash can travel from any page to any page
Review and Conclusions 66
Anchor Text
atilde Recall how hyperlinks are written
lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt
For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt
atilde Link analysis builds on two intuitions
1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A
2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice
Review and Conclusions 67
PageRank as Citation analysis
atilde Citation frequency can be used to measure the impact of an article
acirc Simplest measure Each article gets one vote ndash not very accurate
atilde On the web citation frequency = inlink count
acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam
atilde Better measure weighted citation frequency or citation rank
acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated
Review and Conclusions 68
PageRank as Random walk
atilde Imagine a web surfer doing a random walk on the web
acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page
equiprobably
atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there
atilde This long-term visit rate is the pagersquos PageRank
atilde PageRank = long-term visit rate = steady state robability
Review and Conclusions 69
Teleporting ndash to get us out of dead ends
atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)
atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)
atilde With remaining probability (90) go out on a random hyperlink
acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225
atilde 10 is a parameter the teleportation rate
atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate
Review and Conclusions 70
Example Graph
Each inbound link is a positive vote
httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71
Example Graph Weighted
Pages with higher PageRanks are lighter
Review and Conclusions 72
Gaming PageRank
atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs
atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies
atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission
73
atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks
The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index
acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip
74
Current Status
atilde There is a continuous battle between
acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read
atilde All metrics get gamed
Review and Conclusions 75
Digital object identifier
atilde DOI a string used to uniquely identify an electronic document or object
acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)
atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation
atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations
acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u
Review and Conclusions 76
Conclusions
Review and Conclusions 77
Revolutions in Language Technology
atilde Speech (Language itself)
atilde Writing (invented 3-5 times)
atilde Printing (made writing common)
atilde Digital Text (made writing transferable)
atilde Hyperlinking (taking writing beyond language)
Review and Conclusions 78
The Internet
atilde The internet is useful as a tool
acirc Passively for information accessacirc Actively for collaboration
atilde The internet is interesting in and of itself
acirc As a source of data about existing languageacirc As a source of innovation in language
Review and Conclusions 79
Recommended Texts
atilde Wikipedia
atilde Crystal D (2001) Language and the Internet Cambridge University Press
atilde Sproat R (2010) Language Technology and Society Oxford University Press
atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition
atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press
atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press
Review and Conclusions 80
Complementary Courses
atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics
atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics
Review and Conclusions 81
Two steps in a TTS system
1 Linguistic Analysis
atilde Sentence Segmentationatilde Abbreviations Dr Smith lives on Nanyang Dr He is hellipatilde Word Segmentation
acirc 森前銀総裁 Moriyama zen Nichigin Sousaiotimes 森前銀総裁emspMoriyama zennichi gin Sousai
2 Speech Synthesis
atilde Find the pronunciationatilde Generate soundsatilde Add intonation
Review and Conclusions 17
Speech Synthesis
atilde Articulatory Synthesis Attempt to model human articulation
atilde Formant Synthesis Bypass modeling of human articulation and model acousticsdirectly
atilde Concatenative Synthesis Synthesize from stored units of actual speech
Review and Conclusions 18
Prosody of Emotion
atilde Excitement Fast very high pitch loud
atilde Hot anger Fast high pitch strong falling accent loud
atilde Fear Jitter
atilde Sarcasm Prolonged accent late peak
atilde Sad Slow low pitch
The main determinant of ldquonaturalnessrdquo in speech synthesis is not ldquovoice qualityrdquobut natural-sounding prosody (intonation and duration)
Richard Sproat
Review and Conclusions 19
Speed is different for different modalities
Speed in words per minute (one word is 6 characters)(English computer science students various studies)
Reading 300 200 (proof reading)Writing 31 21 (composing)Speaking 150Hearing 150 210 (speeded up)Typing 33 19 (composing)
atilde Reading gtgt SpeakingHearing gtgt Typing
rArr Speech for inputrArr Text for output
Review and Conclusions 20
The Telephone
Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 21
New Modalities
Review and Conclusions 22
Email Usenet Chat and Blogs
atilde All share some characteristics of speech and text
atilde Usage norms not fixed
atilde Communication methods may disappear before the norms are fixedUsenet Bulletin Boards Gopher Archie hellip
atilde Large scale discourse studies still to be done
atilde Some genuinely new things
acirc time-lagged multi-person conversationacirc raw un-edited textacirc extreme multi-authorship
Review and Conclusions 23
Speech like Text liketime bound space bound (deletable)spontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 24
Usenet (asynchronous)
Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 25
Chat (synchronous)
Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 26
Blogs
Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactivecomments factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 27
Netspeak
atilde Inspired by shared background
acirc ltrantgtI canrsquot stand thisltrantgtacirc I hate^H^H^H^Hlove that idea (^H is backspace on a vt100)acirc lusers (users as seen by Systems Administrators)acirc suits
atilde Need to be in-group
cow orker Coworkerclue ldquoYou couldnrsquot get a clue during the clue mating season in a field full of horny
clues if you smeared your body with clue musk and did the clue mating dancerdquoEdward Flaherty (talkbizaare)
Review and Conclusions 28
atilde Gricean Principals
acirc Posting (topmiddlebottom quoting and trimming)
atilde Inspired by medium limitations
acirc GREATgreatgreatacirc -) (^_^) (o) acirc brb RTFM IMHOacirc lower caseacirc lack of punctuation (hard on phone keyboards)
Review and Conclusions 29
Collaboration and Wikisatilde Version Control Systems
atilde Wikipedia
atilde Licensing and Ownership
30
Version Control Systems
atilde Versioning file systems
acirc every time a file is opened a new copy is stored
atilde CVS Subversion Git
acirc changes to a collection of files are trackedacirc simultaneous changes are merged
atilde Revision Tracking
acirc Revisions are stored within a file
atilde Authorship in shared writing Explicit responsibility for changes
Review and Conclusions 31
Wikipedia
atilde The core aim of the Wikimedia Foundation is to get a free encyclopedia to everysingle person on the planet (Jimmy Wales)
atilde Wikipedia makes it easy to share your knowledgepeople like to do this
atilde Most edits are done by insiders
atilde Most content is added by outsiders
atilde Content comparable to Britannica
Review and Conclusions 32
The five pillars of Wikipedia
1 Wikipedia is an online encyclopedia
2 Wikipedia has a neutral point of view
atilde Content policies NPOV Verifiability and No original research
3 Wikipedia is free content
4 Wikipedians should interact in a respectful and civil manner
5 Wikipedia does not have firm rules
WikipediaFivepillars 33
Licenses and Ownership
atilde Copyright
atilde Copyleft
atilde Creative Commons
Review and Conclusions 34
What is a good article
1 Well-written
2 Factually accurate and verifiable
3 Broad in its coverage
4 Neutral
5 Stable
6 Illustrated if possible by images
WikipediaGood_article_criteria 35
The World Wide Web and HTML
Review and Conclusions 36
The InterWeb
atilde The Internetmore than just the web (email VoIP FTP Streaming Messaging
atilde The structure of Markup Visual vs LogicalWSISWYG WYSIAYG WYSIWIM
atilde The structure of the Web mdash hypertextpages linked to pages
atilde The future of the Web
atilde Linguistic features of the webun-edited large volume editable multi-media
Review and Conclusions 37
Map of online communities
httpxkcdcom802 38
The Deep Web (the Invisible Web)
Dynamic content dynamic pages which are returned in response to a submitted queryor accessed only through a form
Unlinked content pages which are not linked to by other pages (but clicking linksthem)
Private Web sites that require registration and login (Edventure)
Contextual Web pages with content varying for different access contexts (egranges of client IP addresses or previous navigation sequence)
Limited access content sites that limit access to their pages in a technical way (egusing the Robots Exclusion Standard)
Scripted content pages that are only accessible through links produced by JavaScriptas well as content dynamically downloaded from Web servers via Flash or Ajaxsolutions
Review and Conclusions 39
Non-HTMLtext content textual content encoded in multimedia (image or video)files or specific file formats not handled by search engines
These pages all include data that search engines cannot find
Review and Conclusions 40
Visual Markup vs Logical Markup
atilde Visual Markup (Presentational)
acirc What you see is what you get (WYSIWYG)acirc Equivalent of printersrsquo markupacirc Shows what things look like
atilde Logical Markup (Structural)
acirc Show the structure and meaningacirc Can be mapped to visual markupacirc Less flexible than visual markupacirc More adaptable (and reusable)
Review and Conclusions 41
The Web as Corpus
Review and Conclusions 42
Two Approaches to using the Web as a Corpus
atilde Direct Query Search Engine as Query tool and WWW as corpus(Objection Results are not reliable)
acirc Population and exact hit counts are unknown rarr no statistics possibleacirc Indexing does not allow to draw conclusions on the dataotimes Google is missing functionalities that linguists lexicographers would like to have
atilde Web Sample Use search engine to download data from the net and build a corpusfrom it
acirc known size and exact hit counts rarr statistics possibleacirc people can draw conclusions over the included text typesacirc (limited) control over the contentotimes sparser data
Review and Conclusions 43
Direct Query
atilde Accessible through search engines (Google API Yahoo API Scripts)
atilde Document counts are shown to correlate directly with ldquorealrdquo frequencies (Keller2003) so search engines can help - but
acirc lots of repetitions of the same text (not representative)acirc very limited query precision (no upperlower case no punctuation)acirc only estimated counts often hard to reproduce exactlyacirc different queries give wildly different numbers
Review and Conclusions 44
Web Count Units
atilde ghit or ldquogoogle hitrdquo is the most common unit used to count web snippets (in theearly 2000s)
acirc it is document frequency not term frequency
atilde whit or ldquoweb hitrdquo is the more general term
atilde Normally you compare two phenomena to get a unitless ratio(eg different from vs different than)251000000 ghits vs 71500000 or 351 (accessed 2012-04-04)
atilde GPB for ldquoGhits per billion documentsrdquo is good if want a more stable number (sug-gested by Mark Liberman)but Google no longer releases the index size
atilde So always say when you counted and try to use ratios
httpitrecisupennedu myllanguagelogarchives000953html 45
Web Sample
atilde Extracting and filtering web documents to create linguistically annotated corpora(Kilgarriff 2006)
acirc gather documents for different topics (balance)acirc exclude documents which cannot be preprocessed with available tools (here tag-
gers and lemmatizers)acirc exclude documents which seem irrelevant for a corpus (too short or too long word
lists)acirc do this for several languages and make the corpora available
46
Building Internet Corpora Outline
1 Select Seed Words (500)
2 Combine to form multiple queries (6000)
3 Query a search engine and retrieve the URLs (50000)
4 Download the files from the URLS (100000000 words)
5 Postprocess the data (encoding cleanup tagging and parsing)
Sharoff S (2006) Creating general-purpose corpora using automated search enginequeries In M Baroni S Bernardini (eds) WaCky Working papers on the Web asCorpus Bologna 2006
httpcorpusleedsacukinternethtml 47
Internet Corpora Summary
atilde The web can be used as a corpus
acirc Direct accesslowast Fast and convenientlowast Huge amounts of dataotimes unreliable counts
acirc Web samplelowast Control over the samplelowast Some setup costs (semi-automated)otimes Less data
atilde Richer data than a compiled corpus
otimes Less balanced less markup
48
Text and Meta-textatilde Explicit Meta-data
acirc Keywords and Categoriesacirc Rankingsacirc Structural Markup
atilde Implicit Meta-data
acirc Links and Citationsacirc Tagsacirc Tablesacirc File Namesacirc Translations
and Zombies 49
Explicit Metadata
atilde You can get information from metadata within documents
acirc When they are accurate they are very goodacirc They are often deceitful
atilde HTML PDF Word hellipMetadata
atilde Keywords and Tags
atilde Rankings
atilde Links and Citations
atilde Structural Markup
50
Implicit Metadata
atilde You can get clues from metadata within documents
acirc as they are non-intended they tend to be noisyacirc but they are rarely deceitful
atilde HTML tags as constituent boundaries
atilde Tables as Semantic Relations
atilde File Names (content type and language)
atilde Translations mdash Bracketed Glosses Cross-lingual Disambiguation
atilde Query Data
atilde Wikipedia Redirections and Cross-wiki Links
Review and Conclusions 51
Language Identification andNormalization
Review and Conclusions 52
Language Identification
atilde Need to identify the encoding and language of a page in order to process it (meta-data may be unreliable)
acirc Linguistically-grounded methodslowast Diacriticslowast Character n-gramslowast Stop words
acirc Similarity-based categorisation and classificationlowast Character n-gram rankings
acirc Machine Learning based methodsacirc Context (under jp or ko)
atilde Hard to do for short test snippets similar languages mixed text
Review and Conclusions 53
Normalization
atilde Extracting text from various documents
atilde Segmenting continuous text
atilde Number Normalization $700K $700000 07 million dollars hellip
atilde Date Normalization 2000AD 1421AH Heisei 12 hellip
atilde Stripping stop words (the a of hellip)
atilde Lemmatization produces rarr produce
atilde Stemming producer rarr produc produces rarr produc
atilde Decompounding zonnecel rarr zon cel
Review and Conclusions 54
The Semantic Web
Review and Conclusions 55
Goals of the Semantic Web
atilde Web of data
acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions
atilde Increase the utility of information by connecting it to definitions and context
atilde More efficient information access and analysis
EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt
Review and Conclusions 56
Semantic Web Architecture
Review and Conclusions 57
XML eXtensible Markup Language
atilde XML is a set of rules for encoding documents electronically
acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation
atilde Can be validated for syntactic well-formedness
Review and Conclusions 58
Semantic Web Architecture (details)
atilde Identify things with Uniform Resource Identifiers
acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond
atilde Identify relations with Resource Description Framework
acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything
atilde You can build relations in ontologies (OWL)
acirc Then reason over them search them hellip
Review and Conclusions 59
Criticism of the Semantic Web
Doctorowrsquos seven insurmountable obstacles to reliable metadata are
1 People lie
2 People are lazy
3 People are stupid
4 Mission Impossible know thyself
5 Schemas arenrsquot neutral
6 Metrics influence results
7 Therersquos more than one way to describe something
httpwwwwellcom~doctorowmetacraphtm 60
Semantic Web and NLP
atilde The Semantic Web is about structuring data
atilde Text Mining is about unstructured data
atilde There is much more unstructured than structured data
acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP
61
Citation Reputation andPageRank
Review and Conclusions 62
Citation Networks
atilde How can we tell what is a good scientific paper
acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked
acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite
Review and Conclusions 63
Reputation and Citation Analysis
atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group
atilde Some scores are
acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)
atilde Problems
acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones
atilde Weight Citations by Quality of the paper
Review and Conclusions 64
Gaming Citations
atilde LeastMinimum Publishable Unit
acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information
atilde Self citation in-group citation
atilde Write only proceedings (some journals are not often read)
atilde Submitting only to High Impact factor journals
You improve what gets measurednot necessarily what you want to improve
Review and Conclusions 65
Characteristics of the Web Bow Tie
SCC Strongly Connected Core mdash can travel from any page to any page
Review and Conclusions 66
Anchor Text
atilde Recall how hyperlinks are written
lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt
For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt
atilde Link analysis builds on two intuitions
1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A
2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice
Review and Conclusions 67
PageRank as Citation analysis
atilde Citation frequency can be used to measure the impact of an article
acirc Simplest measure Each article gets one vote ndash not very accurate
atilde On the web citation frequency = inlink count
acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam
atilde Better measure weighted citation frequency or citation rank
acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated
Review and Conclusions 68
PageRank as Random walk
atilde Imagine a web surfer doing a random walk on the web
acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page
equiprobably
atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there
atilde This long-term visit rate is the pagersquos PageRank
atilde PageRank = long-term visit rate = steady state robability
Review and Conclusions 69
Teleporting ndash to get us out of dead ends
atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)
atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)
atilde With remaining probability (90) go out on a random hyperlink
acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225
atilde 10 is a parameter the teleportation rate
atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate
Review and Conclusions 70
Example Graph
Each inbound link is a positive vote
httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71
Example Graph Weighted
Pages with higher PageRanks are lighter
Review and Conclusions 72
Gaming PageRank
atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs
atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies
atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission
73
atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks
The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index
acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip
74
Current Status
atilde There is a continuous battle between
acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read
atilde All metrics get gamed
Review and Conclusions 75
Digital object identifier
atilde DOI a string used to uniquely identify an electronic document or object
acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)
atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation
atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations
acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u
Review and Conclusions 76
Conclusions
Review and Conclusions 77
Revolutions in Language Technology
atilde Speech (Language itself)
atilde Writing (invented 3-5 times)
atilde Printing (made writing common)
atilde Digital Text (made writing transferable)
atilde Hyperlinking (taking writing beyond language)
Review and Conclusions 78
The Internet
atilde The internet is useful as a tool
acirc Passively for information accessacirc Actively for collaboration
atilde The internet is interesting in and of itself
acirc As a source of data about existing languageacirc As a source of innovation in language
Review and Conclusions 79
Recommended Texts
atilde Wikipedia
atilde Crystal D (2001) Language and the Internet Cambridge University Press
atilde Sproat R (2010) Language Technology and Society Oxford University Press
atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition
atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press
atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press
Review and Conclusions 80
Complementary Courses
atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics
atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics
Review and Conclusions 81
Speech Synthesis
atilde Articulatory Synthesis Attempt to model human articulation
atilde Formant Synthesis Bypass modeling of human articulation and model acousticsdirectly
atilde Concatenative Synthesis Synthesize from stored units of actual speech
Review and Conclusions 18
Prosody of Emotion
atilde Excitement Fast very high pitch loud
atilde Hot anger Fast high pitch strong falling accent loud
atilde Fear Jitter
atilde Sarcasm Prolonged accent late peak
atilde Sad Slow low pitch
The main determinant of ldquonaturalnessrdquo in speech synthesis is not ldquovoice qualityrdquobut natural-sounding prosody (intonation and duration)
Richard Sproat
Review and Conclusions 19
Speed is different for different modalities
Speed in words per minute (one word is 6 characters)(English computer science students various studies)
Reading 300 200 (proof reading)Writing 31 21 (composing)Speaking 150Hearing 150 210 (speeded up)Typing 33 19 (composing)
atilde Reading gtgt SpeakingHearing gtgt Typing
rArr Speech for inputrArr Text for output
Review and Conclusions 20
The Telephone
Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 21
New Modalities
Review and Conclusions 22
Email Usenet Chat and Blogs
atilde All share some characteristics of speech and text
atilde Usage norms not fixed
atilde Communication methods may disappear before the norms are fixedUsenet Bulletin Boards Gopher Archie hellip
atilde Large scale discourse studies still to be done
atilde Some genuinely new things
acirc time-lagged multi-person conversationacirc raw un-edited textacirc extreme multi-authorship
Review and Conclusions 23
Speech like Text liketime bound space bound (deletable)spontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 24
Usenet (asynchronous)
Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 25
Chat (synchronous)
Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 26
Blogs
Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactivecomments factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 27
Netspeak
atilde Inspired by shared background
acirc ltrantgtI canrsquot stand thisltrantgtacirc I hate^H^H^H^Hlove that idea (^H is backspace on a vt100)acirc lusers (users as seen by Systems Administrators)acirc suits
atilde Need to be in-group
cow orker Coworkerclue ldquoYou couldnrsquot get a clue during the clue mating season in a field full of horny
clues if you smeared your body with clue musk and did the clue mating dancerdquoEdward Flaherty (talkbizaare)
Review and Conclusions 28
atilde Gricean Principals
acirc Posting (topmiddlebottom quoting and trimming)
atilde Inspired by medium limitations
acirc GREATgreatgreatacirc -) (^_^) (o) acirc brb RTFM IMHOacirc lower caseacirc lack of punctuation (hard on phone keyboards)
Review and Conclusions 29
Collaboration and Wikisatilde Version Control Systems
atilde Wikipedia
atilde Licensing and Ownership
30
Version Control Systems
atilde Versioning file systems
acirc every time a file is opened a new copy is stored
atilde CVS Subversion Git
acirc changes to a collection of files are trackedacirc simultaneous changes are merged
atilde Revision Tracking
acirc Revisions are stored within a file
atilde Authorship in shared writing Explicit responsibility for changes
Review and Conclusions 31
Wikipedia
atilde The core aim of the Wikimedia Foundation is to get a free encyclopedia to everysingle person on the planet (Jimmy Wales)
atilde Wikipedia makes it easy to share your knowledgepeople like to do this
atilde Most edits are done by insiders
atilde Most content is added by outsiders
atilde Content comparable to Britannica
Review and Conclusions 32
The five pillars of Wikipedia
1 Wikipedia is an online encyclopedia
2 Wikipedia has a neutral point of view
atilde Content policies NPOV Verifiability and No original research
3 Wikipedia is free content
4 Wikipedians should interact in a respectful and civil manner
5 Wikipedia does not have firm rules
WikipediaFivepillars 33
Licenses and Ownership
atilde Copyright
atilde Copyleft
atilde Creative Commons
Review and Conclusions 34
What is a good article
1 Well-written
2 Factually accurate and verifiable
3 Broad in its coverage
4 Neutral
5 Stable
6 Illustrated if possible by images
WikipediaGood_article_criteria 35
The World Wide Web and HTML
Review and Conclusions 36
The InterWeb
atilde The Internetmore than just the web (email VoIP FTP Streaming Messaging
atilde The structure of Markup Visual vs LogicalWSISWYG WYSIAYG WYSIWIM
atilde The structure of the Web mdash hypertextpages linked to pages
atilde The future of the Web
atilde Linguistic features of the webun-edited large volume editable multi-media
Review and Conclusions 37
Map of online communities
httpxkcdcom802 38
The Deep Web (the Invisible Web)
Dynamic content dynamic pages which are returned in response to a submitted queryor accessed only through a form
Unlinked content pages which are not linked to by other pages (but clicking linksthem)
Private Web sites that require registration and login (Edventure)
Contextual Web pages with content varying for different access contexts (egranges of client IP addresses or previous navigation sequence)
Limited access content sites that limit access to their pages in a technical way (egusing the Robots Exclusion Standard)
Scripted content pages that are only accessible through links produced by JavaScriptas well as content dynamically downloaded from Web servers via Flash or Ajaxsolutions
Review and Conclusions 39
Non-HTMLtext content textual content encoded in multimedia (image or video)files or specific file formats not handled by search engines
These pages all include data that search engines cannot find
Review and Conclusions 40
Visual Markup vs Logical Markup
atilde Visual Markup (Presentational)
acirc What you see is what you get (WYSIWYG)acirc Equivalent of printersrsquo markupacirc Shows what things look like
atilde Logical Markup (Structural)
acirc Show the structure and meaningacirc Can be mapped to visual markupacirc Less flexible than visual markupacirc More adaptable (and reusable)
Review and Conclusions 41
The Web as Corpus
Review and Conclusions 42
Two Approaches to using the Web as a Corpus
atilde Direct Query Search Engine as Query tool and WWW as corpus(Objection Results are not reliable)
acirc Population and exact hit counts are unknown rarr no statistics possibleacirc Indexing does not allow to draw conclusions on the dataotimes Google is missing functionalities that linguists lexicographers would like to have
atilde Web Sample Use search engine to download data from the net and build a corpusfrom it
acirc known size and exact hit counts rarr statistics possibleacirc people can draw conclusions over the included text typesacirc (limited) control over the contentotimes sparser data
Review and Conclusions 43
Direct Query
atilde Accessible through search engines (Google API Yahoo API Scripts)
atilde Document counts are shown to correlate directly with ldquorealrdquo frequencies (Keller2003) so search engines can help - but
acirc lots of repetitions of the same text (not representative)acirc very limited query precision (no upperlower case no punctuation)acirc only estimated counts often hard to reproduce exactlyacirc different queries give wildly different numbers
Review and Conclusions 44
Web Count Units
atilde ghit or ldquogoogle hitrdquo is the most common unit used to count web snippets (in theearly 2000s)
acirc it is document frequency not term frequency
atilde whit or ldquoweb hitrdquo is the more general term
atilde Normally you compare two phenomena to get a unitless ratio(eg different from vs different than)251000000 ghits vs 71500000 or 351 (accessed 2012-04-04)
atilde GPB for ldquoGhits per billion documentsrdquo is good if want a more stable number (sug-gested by Mark Liberman)but Google no longer releases the index size
atilde So always say when you counted and try to use ratios
httpitrecisupennedu myllanguagelogarchives000953html 45
Web Sample
atilde Extracting and filtering web documents to create linguistically annotated corpora(Kilgarriff 2006)
acirc gather documents for different topics (balance)acirc exclude documents which cannot be preprocessed with available tools (here tag-
gers and lemmatizers)acirc exclude documents which seem irrelevant for a corpus (too short or too long word
lists)acirc do this for several languages and make the corpora available
46
Building Internet Corpora Outline
1 Select Seed Words (500)
2 Combine to form multiple queries (6000)
3 Query a search engine and retrieve the URLs (50000)
4 Download the files from the URLS (100000000 words)
5 Postprocess the data (encoding cleanup tagging and parsing)
Sharoff S (2006) Creating general-purpose corpora using automated search enginequeries In M Baroni S Bernardini (eds) WaCky Working papers on the Web asCorpus Bologna 2006
httpcorpusleedsacukinternethtml 47
Internet Corpora Summary
atilde The web can be used as a corpus
acirc Direct accesslowast Fast and convenientlowast Huge amounts of dataotimes unreliable counts
acirc Web samplelowast Control over the samplelowast Some setup costs (semi-automated)otimes Less data
atilde Richer data than a compiled corpus
otimes Less balanced less markup
48
Text and Meta-textatilde Explicit Meta-data
acirc Keywords and Categoriesacirc Rankingsacirc Structural Markup
atilde Implicit Meta-data
acirc Links and Citationsacirc Tagsacirc Tablesacirc File Namesacirc Translations
and Zombies 49
Explicit Metadata
atilde You can get information from metadata within documents
acirc When they are accurate they are very goodacirc They are often deceitful
atilde HTML PDF Word hellipMetadata
atilde Keywords and Tags
atilde Rankings
atilde Links and Citations
atilde Structural Markup
50
Implicit Metadata
atilde You can get clues from metadata within documents
acirc as they are non-intended they tend to be noisyacirc but they are rarely deceitful
atilde HTML tags as constituent boundaries
atilde Tables as Semantic Relations
atilde File Names (content type and language)
atilde Translations mdash Bracketed Glosses Cross-lingual Disambiguation
atilde Query Data
atilde Wikipedia Redirections and Cross-wiki Links
Review and Conclusions 51
Language Identification andNormalization
Review and Conclusions 52
Language Identification
atilde Need to identify the encoding and language of a page in order to process it (meta-data may be unreliable)
acirc Linguistically-grounded methodslowast Diacriticslowast Character n-gramslowast Stop words
acirc Similarity-based categorisation and classificationlowast Character n-gram rankings
acirc Machine Learning based methodsacirc Context (under jp or ko)
atilde Hard to do for short test snippets similar languages mixed text
Review and Conclusions 53
Normalization
atilde Extracting text from various documents
atilde Segmenting continuous text
atilde Number Normalization $700K $700000 07 million dollars hellip
atilde Date Normalization 2000AD 1421AH Heisei 12 hellip
atilde Stripping stop words (the a of hellip)
atilde Lemmatization produces rarr produce
atilde Stemming producer rarr produc produces rarr produc
atilde Decompounding zonnecel rarr zon cel
Review and Conclusions 54
The Semantic Web
Review and Conclusions 55
Goals of the Semantic Web
atilde Web of data
acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions
atilde Increase the utility of information by connecting it to definitions and context
atilde More efficient information access and analysis
EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt
Review and Conclusions 56
Semantic Web Architecture
Review and Conclusions 57
XML eXtensible Markup Language
atilde XML is a set of rules for encoding documents electronically
acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation
atilde Can be validated for syntactic well-formedness
Review and Conclusions 58
Semantic Web Architecture (details)
atilde Identify things with Uniform Resource Identifiers
acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond
atilde Identify relations with Resource Description Framework
acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything
atilde You can build relations in ontologies (OWL)
acirc Then reason over them search them hellip
Review and Conclusions 59
Criticism of the Semantic Web
Doctorowrsquos seven insurmountable obstacles to reliable metadata are
1 People lie
2 People are lazy
3 People are stupid
4 Mission Impossible know thyself
5 Schemas arenrsquot neutral
6 Metrics influence results
7 Therersquos more than one way to describe something
httpwwwwellcom~doctorowmetacraphtm 60
Semantic Web and NLP
atilde The Semantic Web is about structuring data
atilde Text Mining is about unstructured data
atilde There is much more unstructured than structured data
acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP
61
Citation Reputation andPageRank
Review and Conclusions 62
Citation Networks
atilde How can we tell what is a good scientific paper
acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked
acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite
Review and Conclusions 63
Reputation and Citation Analysis
atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group
atilde Some scores are
acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)
atilde Problems
acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones
atilde Weight Citations by Quality of the paper
Review and Conclusions 64
Gaming Citations
atilde LeastMinimum Publishable Unit
acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information
atilde Self citation in-group citation
atilde Write only proceedings (some journals are not often read)
atilde Submitting only to High Impact factor journals
You improve what gets measurednot necessarily what you want to improve
Review and Conclusions 65
Characteristics of the Web Bow Tie
SCC Strongly Connected Core mdash can travel from any page to any page
Review and Conclusions 66
Anchor Text
atilde Recall how hyperlinks are written
lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt
For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt
atilde Link analysis builds on two intuitions
1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A
2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice
Review and Conclusions 67
PageRank as Citation analysis
atilde Citation frequency can be used to measure the impact of an article
acirc Simplest measure Each article gets one vote ndash not very accurate
atilde On the web citation frequency = inlink count
acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam
atilde Better measure weighted citation frequency or citation rank
acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated
Review and Conclusions 68
PageRank as Random walk
atilde Imagine a web surfer doing a random walk on the web
acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page
equiprobably
atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there
atilde This long-term visit rate is the pagersquos PageRank
atilde PageRank = long-term visit rate = steady state robability
Review and Conclusions 69
Teleporting ndash to get us out of dead ends
atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)
atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)
atilde With remaining probability (90) go out on a random hyperlink
acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225
atilde 10 is a parameter the teleportation rate
atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate
Review and Conclusions 70
Example Graph
Each inbound link is a positive vote
httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71
Example Graph Weighted
Pages with higher PageRanks are lighter
Review and Conclusions 72
Gaming PageRank
atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs
atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies
atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission
73
atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks
The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index
acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip
74
Current Status
atilde There is a continuous battle between
acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read
atilde All metrics get gamed
Review and Conclusions 75
Digital object identifier
atilde DOI a string used to uniquely identify an electronic document or object
acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)
atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation
atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations
acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u
Review and Conclusions 76
Conclusions
Review and Conclusions 77
Revolutions in Language Technology
atilde Speech (Language itself)
atilde Writing (invented 3-5 times)
atilde Printing (made writing common)
atilde Digital Text (made writing transferable)
atilde Hyperlinking (taking writing beyond language)
Review and Conclusions 78
The Internet
atilde The internet is useful as a tool
acirc Passively for information accessacirc Actively for collaboration
atilde The internet is interesting in and of itself
acirc As a source of data about existing languageacirc As a source of innovation in language
Review and Conclusions 79
Recommended Texts
atilde Wikipedia
atilde Crystal D (2001) Language and the Internet Cambridge University Press
atilde Sproat R (2010) Language Technology and Society Oxford University Press
atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition
atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press
atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press
Review and Conclusions 80
Complementary Courses
atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics
atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics
Review and Conclusions 81
Prosody of Emotion
atilde Excitement Fast very high pitch loud
atilde Hot anger Fast high pitch strong falling accent loud
atilde Fear Jitter
atilde Sarcasm Prolonged accent late peak
atilde Sad Slow low pitch
The main determinant of ldquonaturalnessrdquo in speech synthesis is not ldquovoice qualityrdquobut natural-sounding prosody (intonation and duration)
Richard Sproat
Review and Conclusions 19
Speed is different for different modalities
Speed in words per minute (one word is 6 characters)(English computer science students various studies)
Reading 300 200 (proof reading)Writing 31 21 (composing)Speaking 150Hearing 150 210 (speeded up)Typing 33 19 (composing)
atilde Reading gtgt SpeakingHearing gtgt Typing
rArr Speech for inputrArr Text for output
Review and Conclusions 20
The Telephone
Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 21
New Modalities
Review and Conclusions 22
Email Usenet Chat and Blogs
atilde All share some characteristics of speech and text
atilde Usage norms not fixed
atilde Communication methods may disappear before the norms are fixedUsenet Bulletin Boards Gopher Archie hellip
atilde Large scale discourse studies still to be done
atilde Some genuinely new things
acirc time-lagged multi-person conversationacirc raw un-edited textacirc extreme multi-authorship
Review and Conclusions 23
Speech like Text liketime bound space bound (deletable)spontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 24
Usenet (asynchronous)
Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 25
Chat (synchronous)
Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 26
Blogs
Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactivecomments factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 27
Netspeak
atilde Inspired by shared background
acirc ltrantgtI canrsquot stand thisltrantgtacirc I hate^H^H^H^Hlove that idea (^H is backspace on a vt100)acirc lusers (users as seen by Systems Administrators)acirc suits
atilde Need to be in-group
cow orker Coworkerclue ldquoYou couldnrsquot get a clue during the clue mating season in a field full of horny
clues if you smeared your body with clue musk and did the clue mating dancerdquoEdward Flaherty (talkbizaare)
Review and Conclusions 28
atilde Gricean Principals
acirc Posting (topmiddlebottom quoting and trimming)
atilde Inspired by medium limitations
acirc GREATgreatgreatacirc -) (^_^) (o) acirc brb RTFM IMHOacirc lower caseacirc lack of punctuation (hard on phone keyboards)
Review and Conclusions 29
Collaboration and Wikisatilde Version Control Systems
atilde Wikipedia
atilde Licensing and Ownership
30
Version Control Systems
atilde Versioning file systems
acirc every time a file is opened a new copy is stored
atilde CVS Subversion Git
acirc changes to a collection of files are trackedacirc simultaneous changes are merged
atilde Revision Tracking
acirc Revisions are stored within a file
atilde Authorship in shared writing Explicit responsibility for changes
Review and Conclusions 31
Wikipedia
atilde The core aim of the Wikimedia Foundation is to get a free encyclopedia to everysingle person on the planet (Jimmy Wales)
atilde Wikipedia makes it easy to share your knowledgepeople like to do this
atilde Most edits are done by insiders
atilde Most content is added by outsiders
atilde Content comparable to Britannica
Review and Conclusions 32
The five pillars of Wikipedia
1 Wikipedia is an online encyclopedia
2 Wikipedia has a neutral point of view
atilde Content policies NPOV Verifiability and No original research
3 Wikipedia is free content
4 Wikipedians should interact in a respectful and civil manner
5 Wikipedia does not have firm rules
WikipediaFivepillars 33
Licenses and Ownership
atilde Copyright
atilde Copyleft
atilde Creative Commons
Review and Conclusions 34
What is a good article
1 Well-written
2 Factually accurate and verifiable
3 Broad in its coverage
4 Neutral
5 Stable
6 Illustrated if possible by images
WikipediaGood_article_criteria 35
The World Wide Web and HTML
Review and Conclusions 36
The InterWeb
atilde The Internetmore than just the web (email VoIP FTP Streaming Messaging
atilde The structure of Markup Visual vs LogicalWSISWYG WYSIAYG WYSIWIM
atilde The structure of the Web mdash hypertextpages linked to pages
atilde The future of the Web
atilde Linguistic features of the webun-edited large volume editable multi-media
Review and Conclusions 37
Map of online communities
httpxkcdcom802 38
The Deep Web (the Invisible Web)
Dynamic content dynamic pages which are returned in response to a submitted queryor accessed only through a form
Unlinked content pages which are not linked to by other pages (but clicking linksthem)
Private Web sites that require registration and login (Edventure)
Contextual Web pages with content varying for different access contexts (egranges of client IP addresses or previous navigation sequence)
Limited access content sites that limit access to their pages in a technical way (egusing the Robots Exclusion Standard)
Scripted content pages that are only accessible through links produced by JavaScriptas well as content dynamically downloaded from Web servers via Flash or Ajaxsolutions
Review and Conclusions 39
Non-HTMLtext content textual content encoded in multimedia (image or video)files or specific file formats not handled by search engines
These pages all include data that search engines cannot find
Review and Conclusions 40
Visual Markup vs Logical Markup
atilde Visual Markup (Presentational)
acirc What you see is what you get (WYSIWYG)acirc Equivalent of printersrsquo markupacirc Shows what things look like
atilde Logical Markup (Structural)
acirc Show the structure and meaningacirc Can be mapped to visual markupacirc Less flexible than visual markupacirc More adaptable (and reusable)
Review and Conclusions 41
The Web as Corpus
Review and Conclusions 42
Two Approaches to using the Web as a Corpus
atilde Direct Query Search Engine as Query tool and WWW as corpus(Objection Results are not reliable)
acirc Population and exact hit counts are unknown rarr no statistics possibleacirc Indexing does not allow to draw conclusions on the dataotimes Google is missing functionalities that linguists lexicographers would like to have
atilde Web Sample Use search engine to download data from the net and build a corpusfrom it
acirc known size and exact hit counts rarr statistics possibleacirc people can draw conclusions over the included text typesacirc (limited) control over the contentotimes sparser data
Review and Conclusions 43
Direct Query
atilde Accessible through search engines (Google API Yahoo API Scripts)
atilde Document counts are shown to correlate directly with ldquorealrdquo frequencies (Keller2003) so search engines can help - but
acirc lots of repetitions of the same text (not representative)acirc very limited query precision (no upperlower case no punctuation)acirc only estimated counts often hard to reproduce exactlyacirc different queries give wildly different numbers
Review and Conclusions 44
Web Count Units
atilde ghit or ldquogoogle hitrdquo is the most common unit used to count web snippets (in theearly 2000s)
acirc it is document frequency not term frequency
atilde whit or ldquoweb hitrdquo is the more general term
atilde Normally you compare two phenomena to get a unitless ratio(eg different from vs different than)251000000 ghits vs 71500000 or 351 (accessed 2012-04-04)
atilde GPB for ldquoGhits per billion documentsrdquo is good if want a more stable number (sug-gested by Mark Liberman)but Google no longer releases the index size
atilde So always say when you counted and try to use ratios
httpitrecisupennedu myllanguagelogarchives000953html 45
Web Sample
atilde Extracting and filtering web documents to create linguistically annotated corpora(Kilgarriff 2006)
acirc gather documents for different topics (balance)acirc exclude documents which cannot be preprocessed with available tools (here tag-
gers and lemmatizers)acirc exclude documents which seem irrelevant for a corpus (too short or too long word
lists)acirc do this for several languages and make the corpora available
46
Building Internet Corpora Outline
1 Select Seed Words (500)
2 Combine to form multiple queries (6000)
3 Query a search engine and retrieve the URLs (50000)
4 Download the files from the URLS (100000000 words)
5 Postprocess the data (encoding cleanup tagging and parsing)
Sharoff S (2006) Creating general-purpose corpora using automated search enginequeries In M Baroni S Bernardini (eds) WaCky Working papers on the Web asCorpus Bologna 2006
httpcorpusleedsacukinternethtml 47
Internet Corpora Summary
atilde The web can be used as a corpus
acirc Direct accesslowast Fast and convenientlowast Huge amounts of dataotimes unreliable counts
acirc Web samplelowast Control over the samplelowast Some setup costs (semi-automated)otimes Less data
atilde Richer data than a compiled corpus
otimes Less balanced less markup
48
Text and Meta-textatilde Explicit Meta-data
acirc Keywords and Categoriesacirc Rankingsacirc Structural Markup
atilde Implicit Meta-data
acirc Links and Citationsacirc Tagsacirc Tablesacirc File Namesacirc Translations
and Zombies 49
Explicit Metadata
atilde You can get information from metadata within documents
acirc When they are accurate they are very goodacirc They are often deceitful
atilde HTML PDF Word hellipMetadata
atilde Keywords and Tags
atilde Rankings
atilde Links and Citations
atilde Structural Markup
50
Implicit Metadata
atilde You can get clues from metadata within documents
acirc as they are non-intended they tend to be noisyacirc but they are rarely deceitful
atilde HTML tags as constituent boundaries
atilde Tables as Semantic Relations
atilde File Names (content type and language)
atilde Translations mdash Bracketed Glosses Cross-lingual Disambiguation
atilde Query Data
atilde Wikipedia Redirections and Cross-wiki Links
Review and Conclusions 51
Language Identification andNormalization
Review and Conclusions 52
Language Identification
atilde Need to identify the encoding and language of a page in order to process it (meta-data may be unreliable)
acirc Linguistically-grounded methodslowast Diacriticslowast Character n-gramslowast Stop words
acirc Similarity-based categorisation and classificationlowast Character n-gram rankings
acirc Machine Learning based methodsacirc Context (under jp or ko)
atilde Hard to do for short test snippets similar languages mixed text
Review and Conclusions 53
Normalization
atilde Extracting text from various documents
atilde Segmenting continuous text
atilde Number Normalization $700K $700000 07 million dollars hellip
atilde Date Normalization 2000AD 1421AH Heisei 12 hellip
atilde Stripping stop words (the a of hellip)
atilde Lemmatization produces rarr produce
atilde Stemming producer rarr produc produces rarr produc
atilde Decompounding zonnecel rarr zon cel
Review and Conclusions 54
The Semantic Web
Review and Conclusions 55
Goals of the Semantic Web
atilde Web of data
acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions
atilde Increase the utility of information by connecting it to definitions and context
atilde More efficient information access and analysis
EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt
Review and Conclusions 56
Semantic Web Architecture
Review and Conclusions 57
XML eXtensible Markup Language
atilde XML is a set of rules for encoding documents electronically
acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation
atilde Can be validated for syntactic well-formedness
Review and Conclusions 58
Semantic Web Architecture (details)
atilde Identify things with Uniform Resource Identifiers
acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond
atilde Identify relations with Resource Description Framework
acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything
atilde You can build relations in ontologies (OWL)
acirc Then reason over them search them hellip
Review and Conclusions 59
Criticism of the Semantic Web
Doctorowrsquos seven insurmountable obstacles to reliable metadata are
1 People lie
2 People are lazy
3 People are stupid
4 Mission Impossible know thyself
5 Schemas arenrsquot neutral
6 Metrics influence results
7 Therersquos more than one way to describe something
httpwwwwellcom~doctorowmetacraphtm 60
Semantic Web and NLP
atilde The Semantic Web is about structuring data
atilde Text Mining is about unstructured data
atilde There is much more unstructured than structured data
acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP
61
Citation Reputation andPageRank
Review and Conclusions 62
Citation Networks
atilde How can we tell what is a good scientific paper
acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked
acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite
Review and Conclusions 63
Reputation and Citation Analysis
atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group
atilde Some scores are
acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)
atilde Problems
acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones
atilde Weight Citations by Quality of the paper
Review and Conclusions 64
Gaming Citations
atilde LeastMinimum Publishable Unit
acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information
atilde Self citation in-group citation
atilde Write only proceedings (some journals are not often read)
atilde Submitting only to High Impact factor journals
You improve what gets measurednot necessarily what you want to improve
Review and Conclusions 65
Characteristics of the Web Bow Tie
SCC Strongly Connected Core mdash can travel from any page to any page
Review and Conclusions 66
Anchor Text
atilde Recall how hyperlinks are written
lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt
For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt
atilde Link analysis builds on two intuitions
1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A
2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice
Review and Conclusions 67
PageRank as Citation analysis
atilde Citation frequency can be used to measure the impact of an article
acirc Simplest measure Each article gets one vote ndash not very accurate
atilde On the web citation frequency = inlink count
acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam
atilde Better measure weighted citation frequency or citation rank
acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated
Review and Conclusions 68
PageRank as Random walk
atilde Imagine a web surfer doing a random walk on the web
acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page
equiprobably
atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there
atilde This long-term visit rate is the pagersquos PageRank
atilde PageRank = long-term visit rate = steady state robability
Review and Conclusions 69
Teleporting ndash to get us out of dead ends
atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)
atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)
atilde With remaining probability (90) go out on a random hyperlink
acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225
atilde 10 is a parameter the teleportation rate
atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate
Review and Conclusions 70
Example Graph
Each inbound link is a positive vote
httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71
Example Graph Weighted
Pages with higher PageRanks are lighter
Review and Conclusions 72
Gaming PageRank
atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs
atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies
atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission
73
atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks
The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index
acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip
74
Current Status
atilde There is a continuous battle between
acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read
atilde All metrics get gamed
Review and Conclusions 75
Digital object identifier
atilde DOI a string used to uniquely identify an electronic document or object
acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)
atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation
atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations
acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u
Review and Conclusions 76
Conclusions
Review and Conclusions 77
Revolutions in Language Technology
atilde Speech (Language itself)
atilde Writing (invented 3-5 times)
atilde Printing (made writing common)
atilde Digital Text (made writing transferable)
atilde Hyperlinking (taking writing beyond language)
Review and Conclusions 78
The Internet
atilde The internet is useful as a tool
acirc Passively for information accessacirc Actively for collaboration
atilde The internet is interesting in and of itself
acirc As a source of data about existing languageacirc As a source of innovation in language
Review and Conclusions 79
Recommended Texts
atilde Wikipedia
atilde Crystal D (2001) Language and the Internet Cambridge University Press
atilde Sproat R (2010) Language Technology and Society Oxford University Press
atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition
atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press
atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press
Review and Conclusions 80
Complementary Courses
atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics
atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics
Review and Conclusions 81
Speed is different for different modalities
Speed in words per minute (one word is 6 characters)(English computer science students various studies)
Reading 300 200 (proof reading)Writing 31 21 (composing)Speaking 150Hearing 150 210 (speeded up)Typing 33 19 (composing)
atilde Reading gtgt SpeakingHearing gtgt Typing
rArr Speech for inputrArr Text for output
Review and Conclusions 20
The Telephone
Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 21
New Modalities
Review and Conclusions 22
Email Usenet Chat and Blogs
atilde All share some characteristics of speech and text
atilde Usage norms not fixed
atilde Communication methods may disappear before the norms are fixedUsenet Bulletin Boards Gopher Archie hellip
atilde Large scale discourse studies still to be done
atilde Some genuinely new things
acirc time-lagged multi-person conversationacirc raw un-edited textacirc extreme multi-authorship
Review and Conclusions 23
Speech like Text liketime bound space bound (deletable)spontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 24
Usenet (asynchronous)
Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 25
Chat (synchronous)
Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 26
Blogs
Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactivecomments factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 27
Netspeak
atilde Inspired by shared background
acirc ltrantgtI canrsquot stand thisltrantgtacirc I hate^H^H^H^Hlove that idea (^H is backspace on a vt100)acirc lusers (users as seen by Systems Administrators)acirc suits
atilde Need to be in-group
cow orker Coworkerclue ldquoYou couldnrsquot get a clue during the clue mating season in a field full of horny
clues if you smeared your body with clue musk and did the clue mating dancerdquoEdward Flaherty (talkbizaare)
Review and Conclusions 28
atilde Gricean Principals
acirc Posting (topmiddlebottom quoting and trimming)
atilde Inspired by medium limitations
acirc GREATgreatgreatacirc -) (^_^) (o) acirc brb RTFM IMHOacirc lower caseacirc lack of punctuation (hard on phone keyboards)
Review and Conclusions 29
Collaboration and Wikisatilde Version Control Systems
atilde Wikipedia
atilde Licensing and Ownership
30
Version Control Systems
atilde Versioning file systems
acirc every time a file is opened a new copy is stored
atilde CVS Subversion Git
acirc changes to a collection of files are trackedacirc simultaneous changes are merged
atilde Revision Tracking
acirc Revisions are stored within a file
atilde Authorship in shared writing Explicit responsibility for changes
Review and Conclusions 31
Wikipedia
atilde The core aim of the Wikimedia Foundation is to get a free encyclopedia to everysingle person on the planet (Jimmy Wales)
atilde Wikipedia makes it easy to share your knowledgepeople like to do this
atilde Most edits are done by insiders
atilde Most content is added by outsiders
atilde Content comparable to Britannica
Review and Conclusions 32
The five pillars of Wikipedia
1 Wikipedia is an online encyclopedia
2 Wikipedia has a neutral point of view
atilde Content policies NPOV Verifiability and No original research
3 Wikipedia is free content
4 Wikipedians should interact in a respectful and civil manner
5 Wikipedia does not have firm rules
WikipediaFivepillars 33
Licenses and Ownership
atilde Copyright
atilde Copyleft
atilde Creative Commons
Review and Conclusions 34
What is a good article
1 Well-written
2 Factually accurate and verifiable
3 Broad in its coverage
4 Neutral
5 Stable
6 Illustrated if possible by images
WikipediaGood_article_criteria 35
The World Wide Web and HTML
Review and Conclusions 36
The InterWeb
atilde The Internetmore than just the web (email VoIP FTP Streaming Messaging
atilde The structure of Markup Visual vs LogicalWSISWYG WYSIAYG WYSIWIM
atilde The structure of the Web mdash hypertextpages linked to pages
atilde The future of the Web
atilde Linguistic features of the webun-edited large volume editable multi-media
Review and Conclusions 37
Map of online communities
httpxkcdcom802 38
The Deep Web (the Invisible Web)
Dynamic content dynamic pages which are returned in response to a submitted queryor accessed only through a form
Unlinked content pages which are not linked to by other pages (but clicking linksthem)
Private Web sites that require registration and login (Edventure)
Contextual Web pages with content varying for different access contexts (egranges of client IP addresses or previous navigation sequence)
Limited access content sites that limit access to their pages in a technical way (egusing the Robots Exclusion Standard)
Scripted content pages that are only accessible through links produced by JavaScriptas well as content dynamically downloaded from Web servers via Flash or Ajaxsolutions
Review and Conclusions 39
Non-HTMLtext content textual content encoded in multimedia (image or video)files or specific file formats not handled by search engines
These pages all include data that search engines cannot find
Review and Conclusions 40
Visual Markup vs Logical Markup
atilde Visual Markup (Presentational)
acirc What you see is what you get (WYSIWYG)acirc Equivalent of printersrsquo markupacirc Shows what things look like
atilde Logical Markup (Structural)
acirc Show the structure and meaningacirc Can be mapped to visual markupacirc Less flexible than visual markupacirc More adaptable (and reusable)
Review and Conclusions 41
The Web as Corpus
Review and Conclusions 42
Two Approaches to using the Web as a Corpus
atilde Direct Query Search Engine as Query tool and WWW as corpus(Objection Results are not reliable)
acirc Population and exact hit counts are unknown rarr no statistics possibleacirc Indexing does not allow to draw conclusions on the dataotimes Google is missing functionalities that linguists lexicographers would like to have
atilde Web Sample Use search engine to download data from the net and build a corpusfrom it
acirc known size and exact hit counts rarr statistics possibleacirc people can draw conclusions over the included text typesacirc (limited) control over the contentotimes sparser data
Review and Conclusions 43
Direct Query
atilde Accessible through search engines (Google API Yahoo API Scripts)
atilde Document counts are shown to correlate directly with ldquorealrdquo frequencies (Keller2003) so search engines can help - but
acirc lots of repetitions of the same text (not representative)acirc very limited query precision (no upperlower case no punctuation)acirc only estimated counts often hard to reproduce exactlyacirc different queries give wildly different numbers
Review and Conclusions 44
Web Count Units
atilde ghit or ldquogoogle hitrdquo is the most common unit used to count web snippets (in theearly 2000s)
acirc it is document frequency not term frequency
atilde whit or ldquoweb hitrdquo is the more general term
atilde Normally you compare two phenomena to get a unitless ratio(eg different from vs different than)251000000 ghits vs 71500000 or 351 (accessed 2012-04-04)
atilde GPB for ldquoGhits per billion documentsrdquo is good if want a more stable number (sug-gested by Mark Liberman)but Google no longer releases the index size
atilde So always say when you counted and try to use ratios
httpitrecisupennedu myllanguagelogarchives000953html 45
Web Sample
atilde Extracting and filtering web documents to create linguistically annotated corpora(Kilgarriff 2006)
acirc gather documents for different topics (balance)acirc exclude documents which cannot be preprocessed with available tools (here tag-
gers and lemmatizers)acirc exclude documents which seem irrelevant for a corpus (too short or too long word
lists)acirc do this for several languages and make the corpora available
46
Building Internet Corpora Outline
1 Select Seed Words (500)
2 Combine to form multiple queries (6000)
3 Query a search engine and retrieve the URLs (50000)
4 Download the files from the URLS (100000000 words)
5 Postprocess the data (encoding cleanup tagging and parsing)
Sharoff S (2006) Creating general-purpose corpora using automated search enginequeries In M Baroni S Bernardini (eds) WaCky Working papers on the Web asCorpus Bologna 2006
httpcorpusleedsacukinternethtml 47
Internet Corpora Summary
atilde The web can be used as a corpus
acirc Direct accesslowast Fast and convenientlowast Huge amounts of dataotimes unreliable counts
acirc Web samplelowast Control over the samplelowast Some setup costs (semi-automated)otimes Less data
atilde Richer data than a compiled corpus
otimes Less balanced less markup
48
Text and Meta-textatilde Explicit Meta-data
acirc Keywords and Categoriesacirc Rankingsacirc Structural Markup
atilde Implicit Meta-data
acirc Links and Citationsacirc Tagsacirc Tablesacirc File Namesacirc Translations
and Zombies 49
Explicit Metadata
atilde You can get information from metadata within documents
acirc When they are accurate they are very goodacirc They are often deceitful
atilde HTML PDF Word hellipMetadata
atilde Keywords and Tags
atilde Rankings
atilde Links and Citations
atilde Structural Markup
50
Implicit Metadata
atilde You can get clues from metadata within documents
acirc as they are non-intended they tend to be noisyacirc but they are rarely deceitful
atilde HTML tags as constituent boundaries
atilde Tables as Semantic Relations
atilde File Names (content type and language)
atilde Translations mdash Bracketed Glosses Cross-lingual Disambiguation
atilde Query Data
atilde Wikipedia Redirections and Cross-wiki Links
Review and Conclusions 51
Language Identification andNormalization
Review and Conclusions 52
Language Identification
atilde Need to identify the encoding and language of a page in order to process it (meta-data may be unreliable)
acirc Linguistically-grounded methodslowast Diacriticslowast Character n-gramslowast Stop words
acirc Similarity-based categorisation and classificationlowast Character n-gram rankings
acirc Machine Learning based methodsacirc Context (under jp or ko)
atilde Hard to do for short test snippets similar languages mixed text
Review and Conclusions 53
Normalization
atilde Extracting text from various documents
atilde Segmenting continuous text
atilde Number Normalization $700K $700000 07 million dollars hellip
atilde Date Normalization 2000AD 1421AH Heisei 12 hellip
atilde Stripping stop words (the a of hellip)
atilde Lemmatization produces rarr produce
atilde Stemming producer rarr produc produces rarr produc
atilde Decompounding zonnecel rarr zon cel
Review and Conclusions 54
The Semantic Web
Review and Conclusions 55
Goals of the Semantic Web
atilde Web of data
acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions
atilde Increase the utility of information by connecting it to definitions and context
atilde More efficient information access and analysis
EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt
Review and Conclusions 56
Semantic Web Architecture
Review and Conclusions 57
XML eXtensible Markup Language
atilde XML is a set of rules for encoding documents electronically
acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation
atilde Can be validated for syntactic well-formedness
Review and Conclusions 58
Semantic Web Architecture (details)
atilde Identify things with Uniform Resource Identifiers
acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond
atilde Identify relations with Resource Description Framework
acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything
atilde You can build relations in ontologies (OWL)
acirc Then reason over them search them hellip
Review and Conclusions 59
Criticism of the Semantic Web
Doctorowrsquos seven insurmountable obstacles to reliable metadata are
1 People lie
2 People are lazy
3 People are stupid
4 Mission Impossible know thyself
5 Schemas arenrsquot neutral
6 Metrics influence results
7 Therersquos more than one way to describe something
httpwwwwellcom~doctorowmetacraphtm 60
Semantic Web and NLP
atilde The Semantic Web is about structuring data
atilde Text Mining is about unstructured data
atilde There is much more unstructured than structured data
acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP
61
Citation Reputation andPageRank
Review and Conclusions 62
Citation Networks
atilde How can we tell what is a good scientific paper
acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked
acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite
Review and Conclusions 63
Reputation and Citation Analysis
atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group
atilde Some scores are
acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)
atilde Problems
acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones
atilde Weight Citations by Quality of the paper
Review and Conclusions 64
Gaming Citations
atilde LeastMinimum Publishable Unit
acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information
atilde Self citation in-group citation
atilde Write only proceedings (some journals are not often read)
atilde Submitting only to High Impact factor journals
You improve what gets measurednot necessarily what you want to improve
Review and Conclusions 65
Characteristics of the Web Bow Tie
SCC Strongly Connected Core mdash can travel from any page to any page
Review and Conclusions 66
Anchor Text
atilde Recall how hyperlinks are written
lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt
For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt
atilde Link analysis builds on two intuitions
1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A
2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice
Review and Conclusions 67
PageRank as Citation analysis
atilde Citation frequency can be used to measure the impact of an article
acirc Simplest measure Each article gets one vote ndash not very accurate
atilde On the web citation frequency = inlink count
acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam
atilde Better measure weighted citation frequency or citation rank
acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated
Review and Conclusions 68
PageRank as Random walk
atilde Imagine a web surfer doing a random walk on the web
acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page
equiprobably
atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there
atilde This long-term visit rate is the pagersquos PageRank
atilde PageRank = long-term visit rate = steady state robability
Review and Conclusions 69
Teleporting ndash to get us out of dead ends
atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)
atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)
atilde With remaining probability (90) go out on a random hyperlink
acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225
atilde 10 is a parameter the teleportation rate
atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate
Review and Conclusions 70
Example Graph
Each inbound link is a positive vote
httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71
Example Graph Weighted
Pages with higher PageRanks are lighter
Review and Conclusions 72
Gaming PageRank
atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs
atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies
atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission
73
atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks
The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index
acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip
74
Current Status
atilde There is a continuous battle between
acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read
atilde All metrics get gamed
Review and Conclusions 75
Digital object identifier
atilde DOI a string used to uniquely identify an electronic document or object
acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)
atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation
atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations
acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u
Review and Conclusions 76
Conclusions
Review and Conclusions 77
Revolutions in Language Technology
atilde Speech (Language itself)
atilde Writing (invented 3-5 times)
atilde Printing (made writing common)
atilde Digital Text (made writing transferable)
atilde Hyperlinking (taking writing beyond language)
Review and Conclusions 78
The Internet
atilde The internet is useful as a tool
acirc Passively for information accessacirc Actively for collaboration
atilde The internet is interesting in and of itself
acirc As a source of data about existing languageacirc As a source of innovation in language
Review and Conclusions 79
Recommended Texts
atilde Wikipedia
atilde Crystal D (2001) Language and the Internet Cambridge University Press
atilde Sproat R (2010) Language Technology and Society Oxford University Press
atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition
atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press
atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press
Review and Conclusions 80
Complementary Courses
atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics
atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics
Review and Conclusions 81
The Telephone
Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 21
New Modalities
Review and Conclusions 22
Email Usenet Chat and Blogs
atilde All share some characteristics of speech and text
atilde Usage norms not fixed
atilde Communication methods may disappear before the norms are fixedUsenet Bulletin Boards Gopher Archie hellip
atilde Large scale discourse studies still to be done
atilde Some genuinely new things
acirc time-lagged multi-person conversationacirc raw un-edited textacirc extreme multi-authorship
Review and Conclusions 23
Speech like Text liketime bound space bound (deletable)spontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 24
Usenet (asynchronous)
Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 25
Chat (synchronous)
Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 26
Blogs
Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactivecomments factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 27
Netspeak
atilde Inspired by shared background
acirc ltrantgtI canrsquot stand thisltrantgtacirc I hate^H^H^H^Hlove that idea (^H is backspace on a vt100)acirc lusers (users as seen by Systems Administrators)acirc suits
atilde Need to be in-group
cow orker Coworkerclue ldquoYou couldnrsquot get a clue during the clue mating season in a field full of horny
clues if you smeared your body with clue musk and did the clue mating dancerdquoEdward Flaherty (talkbizaare)
Review and Conclusions 28
atilde Gricean Principals
acirc Posting (topmiddlebottom quoting and trimming)
atilde Inspired by medium limitations
acirc GREATgreatgreatacirc -) (^_^) (o) acirc brb RTFM IMHOacirc lower caseacirc lack of punctuation (hard on phone keyboards)
Review and Conclusions 29
Collaboration and Wikisatilde Version Control Systems
atilde Wikipedia
atilde Licensing and Ownership
30
Version Control Systems
atilde Versioning file systems
acirc every time a file is opened a new copy is stored
atilde CVS Subversion Git
acirc changes to a collection of files are trackedacirc simultaneous changes are merged
atilde Revision Tracking
acirc Revisions are stored within a file
atilde Authorship in shared writing Explicit responsibility for changes
Review and Conclusions 31
Wikipedia
atilde The core aim of the Wikimedia Foundation is to get a free encyclopedia to everysingle person on the planet (Jimmy Wales)
atilde Wikipedia makes it easy to share your knowledgepeople like to do this
atilde Most edits are done by insiders
atilde Most content is added by outsiders
atilde Content comparable to Britannica
Review and Conclusions 32
The five pillars of Wikipedia
1 Wikipedia is an online encyclopedia
2 Wikipedia has a neutral point of view
atilde Content policies NPOV Verifiability and No original research
3 Wikipedia is free content
4 Wikipedians should interact in a respectful and civil manner
5 Wikipedia does not have firm rules
WikipediaFivepillars 33
Licenses and Ownership
atilde Copyright
atilde Copyleft
atilde Creative Commons
Review and Conclusions 34
What is a good article
1 Well-written
2 Factually accurate and verifiable
3 Broad in its coverage
4 Neutral
5 Stable
6 Illustrated if possible by images
WikipediaGood_article_criteria 35
The World Wide Web and HTML
Review and Conclusions 36
The InterWeb
atilde The Internetmore than just the web (email VoIP FTP Streaming Messaging
atilde The structure of Markup Visual vs LogicalWSISWYG WYSIAYG WYSIWIM
atilde The structure of the Web mdash hypertextpages linked to pages
atilde The future of the Web
atilde Linguistic features of the webun-edited large volume editable multi-media
Review and Conclusions 37
Map of online communities
httpxkcdcom802 38
The Deep Web (the Invisible Web)
Dynamic content dynamic pages which are returned in response to a submitted queryor accessed only through a form
Unlinked content pages which are not linked to by other pages (but clicking linksthem)
Private Web sites that require registration and login (Edventure)
Contextual Web pages with content varying for different access contexts (egranges of client IP addresses or previous navigation sequence)
Limited access content sites that limit access to their pages in a technical way (egusing the Robots Exclusion Standard)
Scripted content pages that are only accessible through links produced by JavaScriptas well as content dynamically downloaded from Web servers via Flash or Ajaxsolutions
Review and Conclusions 39
Non-HTMLtext content textual content encoded in multimedia (image or video)files or specific file formats not handled by search engines
These pages all include data that search engines cannot find
Review and Conclusions 40
Visual Markup vs Logical Markup
atilde Visual Markup (Presentational)
acirc What you see is what you get (WYSIWYG)acirc Equivalent of printersrsquo markupacirc Shows what things look like
atilde Logical Markup (Structural)
acirc Show the structure and meaningacirc Can be mapped to visual markupacirc Less flexible than visual markupacirc More adaptable (and reusable)
Review and Conclusions 41
The Web as Corpus
Review and Conclusions 42
Two Approaches to using the Web as a Corpus
atilde Direct Query Search Engine as Query tool and WWW as corpus(Objection Results are not reliable)
acirc Population and exact hit counts are unknown rarr no statistics possibleacirc Indexing does not allow to draw conclusions on the dataotimes Google is missing functionalities that linguists lexicographers would like to have
atilde Web Sample Use search engine to download data from the net and build a corpusfrom it
acirc known size and exact hit counts rarr statistics possibleacirc people can draw conclusions over the included text typesacirc (limited) control over the contentotimes sparser data
Review and Conclusions 43
Direct Query
atilde Accessible through search engines (Google API Yahoo API Scripts)
atilde Document counts are shown to correlate directly with ldquorealrdquo frequencies (Keller2003) so search engines can help - but
acirc lots of repetitions of the same text (not representative)acirc very limited query precision (no upperlower case no punctuation)acirc only estimated counts often hard to reproduce exactlyacirc different queries give wildly different numbers
Review and Conclusions 44
Web Count Units
atilde ghit or ldquogoogle hitrdquo is the most common unit used to count web snippets (in theearly 2000s)
acirc it is document frequency not term frequency
atilde whit or ldquoweb hitrdquo is the more general term
atilde Normally you compare two phenomena to get a unitless ratio(eg different from vs different than)251000000 ghits vs 71500000 or 351 (accessed 2012-04-04)
atilde GPB for ldquoGhits per billion documentsrdquo is good if want a more stable number (sug-gested by Mark Liberman)but Google no longer releases the index size
atilde So always say when you counted and try to use ratios
httpitrecisupennedu myllanguagelogarchives000953html 45
Web Sample
atilde Extracting and filtering web documents to create linguistically annotated corpora(Kilgarriff 2006)
acirc gather documents for different topics (balance)acirc exclude documents which cannot be preprocessed with available tools (here tag-
gers and lemmatizers)acirc exclude documents which seem irrelevant for a corpus (too short or too long word
lists)acirc do this for several languages and make the corpora available
46
Building Internet Corpora Outline
1 Select Seed Words (500)
2 Combine to form multiple queries (6000)
3 Query a search engine and retrieve the URLs (50000)
4 Download the files from the URLS (100000000 words)
5 Postprocess the data (encoding cleanup tagging and parsing)
Sharoff S (2006) Creating general-purpose corpora using automated search enginequeries In M Baroni S Bernardini (eds) WaCky Working papers on the Web asCorpus Bologna 2006
httpcorpusleedsacukinternethtml 47
Internet Corpora Summary
atilde The web can be used as a corpus
acirc Direct accesslowast Fast and convenientlowast Huge amounts of dataotimes unreliable counts
acirc Web samplelowast Control over the samplelowast Some setup costs (semi-automated)otimes Less data
atilde Richer data than a compiled corpus
otimes Less balanced less markup
48
Text and Meta-textatilde Explicit Meta-data
acirc Keywords and Categoriesacirc Rankingsacirc Structural Markup
atilde Implicit Meta-data
acirc Links and Citationsacirc Tagsacirc Tablesacirc File Namesacirc Translations
and Zombies 49
Explicit Metadata
atilde You can get information from metadata within documents
acirc When they are accurate they are very goodacirc They are often deceitful
atilde HTML PDF Word hellipMetadata
atilde Keywords and Tags
atilde Rankings
atilde Links and Citations
atilde Structural Markup
50
Implicit Metadata
atilde You can get clues from metadata within documents
acirc as they are non-intended they tend to be noisyacirc but they are rarely deceitful
atilde HTML tags as constituent boundaries
atilde Tables as Semantic Relations
atilde File Names (content type and language)
atilde Translations mdash Bracketed Glosses Cross-lingual Disambiguation
atilde Query Data
atilde Wikipedia Redirections and Cross-wiki Links
Review and Conclusions 51
Language Identification andNormalization
Review and Conclusions 52
Language Identification
atilde Need to identify the encoding and language of a page in order to process it (meta-data may be unreliable)
acirc Linguistically-grounded methodslowast Diacriticslowast Character n-gramslowast Stop words
acirc Similarity-based categorisation and classificationlowast Character n-gram rankings
acirc Machine Learning based methodsacirc Context (under jp or ko)
atilde Hard to do for short test snippets similar languages mixed text
Review and Conclusions 53
Normalization
atilde Extracting text from various documents
atilde Segmenting continuous text
atilde Number Normalization $700K $700000 07 million dollars hellip
atilde Date Normalization 2000AD 1421AH Heisei 12 hellip
atilde Stripping stop words (the a of hellip)
atilde Lemmatization produces rarr produce
atilde Stemming producer rarr produc produces rarr produc
atilde Decompounding zonnecel rarr zon cel
Review and Conclusions 54
The Semantic Web
Review and Conclusions 55
Goals of the Semantic Web
atilde Web of data
acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions
atilde Increase the utility of information by connecting it to definitions and context
atilde More efficient information access and analysis
EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt
Review and Conclusions 56
Semantic Web Architecture
Review and Conclusions 57
XML eXtensible Markup Language
atilde XML is a set of rules for encoding documents electronically
acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation
atilde Can be validated for syntactic well-formedness
Review and Conclusions 58
Semantic Web Architecture (details)
atilde Identify things with Uniform Resource Identifiers
acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond
atilde Identify relations with Resource Description Framework
acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything
atilde You can build relations in ontologies (OWL)
acirc Then reason over them search them hellip
Review and Conclusions 59
Criticism of the Semantic Web
Doctorowrsquos seven insurmountable obstacles to reliable metadata are
1 People lie
2 People are lazy
3 People are stupid
4 Mission Impossible know thyself
5 Schemas arenrsquot neutral
6 Metrics influence results
7 Therersquos more than one way to describe something
httpwwwwellcom~doctorowmetacraphtm 60
Semantic Web and NLP
atilde The Semantic Web is about structuring data
atilde Text Mining is about unstructured data
atilde There is much more unstructured than structured data
acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP
61
Citation Reputation andPageRank
Review and Conclusions 62
Citation Networks
atilde How can we tell what is a good scientific paper
acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked
acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite
Review and Conclusions 63
Reputation and Citation Analysis
atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group
atilde Some scores are
acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)
atilde Problems
acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones
atilde Weight Citations by Quality of the paper
Review and Conclusions 64
Gaming Citations
atilde LeastMinimum Publishable Unit
acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information
atilde Self citation in-group citation
atilde Write only proceedings (some journals are not often read)
atilde Submitting only to High Impact factor journals
You improve what gets measurednot necessarily what you want to improve
Review and Conclusions 65
Characteristics of the Web Bow Tie
SCC Strongly Connected Core mdash can travel from any page to any page
Review and Conclusions 66
Anchor Text
atilde Recall how hyperlinks are written
lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt
For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt
atilde Link analysis builds on two intuitions
1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A
2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice
Review and Conclusions 67
PageRank as Citation analysis
atilde Citation frequency can be used to measure the impact of an article
acirc Simplest measure Each article gets one vote ndash not very accurate
atilde On the web citation frequency = inlink count
acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam
atilde Better measure weighted citation frequency or citation rank
acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated
Review and Conclusions 68
PageRank as Random walk
atilde Imagine a web surfer doing a random walk on the web
acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page
equiprobably
atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there
atilde This long-term visit rate is the pagersquos PageRank
atilde PageRank = long-term visit rate = steady state robability
Review and Conclusions 69
Teleporting ndash to get us out of dead ends
atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)
atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)
atilde With remaining probability (90) go out on a random hyperlink
acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225
atilde 10 is a parameter the teleportation rate
atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate
Review and Conclusions 70
Example Graph
Each inbound link is a positive vote
httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71
Example Graph Weighted
Pages with higher PageRanks are lighter
Review and Conclusions 72
Gaming PageRank
atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs
atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies
atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission
73
atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks
The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index
acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip
74
Current Status
atilde There is a continuous battle between
acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read
atilde All metrics get gamed
Review and Conclusions 75
Digital object identifier
atilde DOI a string used to uniquely identify an electronic document or object
acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)
atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation
atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations
acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u
Review and Conclusions 76
Conclusions
Review and Conclusions 77
Revolutions in Language Technology
atilde Speech (Language itself)
atilde Writing (invented 3-5 times)
atilde Printing (made writing common)
atilde Digital Text (made writing transferable)
atilde Hyperlinking (taking writing beyond language)
Review and Conclusions 78
The Internet
atilde The internet is useful as a tool
acirc Passively for information accessacirc Actively for collaboration
atilde The internet is interesting in and of itself
acirc As a source of data about existing languageacirc As a source of innovation in language
Review and Conclusions 79
Recommended Texts
atilde Wikipedia
atilde Crystal D (2001) Language and the Internet Cambridge University Press
atilde Sproat R (2010) Language Technology and Society Oxford University Press
atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition
atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press
atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press
Review and Conclusions 80
Complementary Courses
atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics
atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics
Review and Conclusions 81
New Modalities
Review and Conclusions 22
Email Usenet Chat and Blogs
atilde All share some characteristics of speech and text
atilde Usage norms not fixed
atilde Communication methods may disappear before the norms are fixedUsenet Bulletin Boards Gopher Archie hellip
atilde Large scale discourse studies still to be done
atilde Some genuinely new things
acirc time-lagged multi-person conversationacirc raw un-edited textacirc extreme multi-authorship
Review and Conclusions 23
Speech like Text liketime bound space bound (deletable)spontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 24
Usenet (asynchronous)
Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 25
Chat (synchronous)
Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 26
Blogs
Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactivecomments factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 27
Netspeak
atilde Inspired by shared background
acirc ltrantgtI canrsquot stand thisltrantgtacirc I hate^H^H^H^Hlove that idea (^H is backspace on a vt100)acirc lusers (users as seen by Systems Administrators)acirc suits
atilde Need to be in-group
cow orker Coworkerclue ldquoYou couldnrsquot get a clue during the clue mating season in a field full of horny
clues if you smeared your body with clue musk and did the clue mating dancerdquoEdward Flaherty (talkbizaare)
Review and Conclusions 28
atilde Gricean Principals
acirc Posting (topmiddlebottom quoting and trimming)
atilde Inspired by medium limitations
acirc GREATgreatgreatacirc -) (^_^) (o) acirc brb RTFM IMHOacirc lower caseacirc lack of punctuation (hard on phone keyboards)
Review and Conclusions 29
Collaboration and Wikisatilde Version Control Systems
atilde Wikipedia
atilde Licensing and Ownership
30
Version Control Systems
atilde Versioning file systems
acirc every time a file is opened a new copy is stored
atilde CVS Subversion Git
acirc changes to a collection of files are trackedacirc simultaneous changes are merged
atilde Revision Tracking
acirc Revisions are stored within a file
atilde Authorship in shared writing Explicit responsibility for changes
Review and Conclusions 31
Wikipedia
atilde The core aim of the Wikimedia Foundation is to get a free encyclopedia to everysingle person on the planet (Jimmy Wales)
atilde Wikipedia makes it easy to share your knowledgepeople like to do this
atilde Most edits are done by insiders
atilde Most content is added by outsiders
atilde Content comparable to Britannica
Review and Conclusions 32
The five pillars of Wikipedia
1 Wikipedia is an online encyclopedia
2 Wikipedia has a neutral point of view
atilde Content policies NPOV Verifiability and No original research
3 Wikipedia is free content
4 Wikipedians should interact in a respectful and civil manner
5 Wikipedia does not have firm rules
WikipediaFivepillars 33
Licenses and Ownership
atilde Copyright
atilde Copyleft
atilde Creative Commons
Review and Conclusions 34
What is a good article
1 Well-written
2 Factually accurate and verifiable
3 Broad in its coverage
4 Neutral
5 Stable
6 Illustrated if possible by images
WikipediaGood_article_criteria 35
The World Wide Web and HTML
Review and Conclusions 36
The InterWeb
atilde The Internetmore than just the web (email VoIP FTP Streaming Messaging
atilde The structure of Markup Visual vs LogicalWSISWYG WYSIAYG WYSIWIM
atilde The structure of the Web mdash hypertextpages linked to pages
atilde The future of the Web
atilde Linguistic features of the webun-edited large volume editable multi-media
Review and Conclusions 37
Map of online communities
httpxkcdcom802 38
The Deep Web (the Invisible Web)
Dynamic content dynamic pages which are returned in response to a submitted queryor accessed only through a form
Unlinked content pages which are not linked to by other pages (but clicking linksthem)
Private Web sites that require registration and login (Edventure)
Contextual Web pages with content varying for different access contexts (egranges of client IP addresses or previous navigation sequence)
Limited access content sites that limit access to their pages in a technical way (egusing the Robots Exclusion Standard)
Scripted content pages that are only accessible through links produced by JavaScriptas well as content dynamically downloaded from Web servers via Flash or Ajaxsolutions
Review and Conclusions 39
Non-HTMLtext content textual content encoded in multimedia (image or video)files or specific file formats not handled by search engines
These pages all include data that search engines cannot find
Review and Conclusions 40
Visual Markup vs Logical Markup
atilde Visual Markup (Presentational)
acirc What you see is what you get (WYSIWYG)acirc Equivalent of printersrsquo markupacirc Shows what things look like
atilde Logical Markup (Structural)
acirc Show the structure and meaningacirc Can be mapped to visual markupacirc Less flexible than visual markupacirc More adaptable (and reusable)
Review and Conclusions 41
The Web as Corpus
Review and Conclusions 42
Two Approaches to using the Web as a Corpus
atilde Direct Query Search Engine as Query tool and WWW as corpus(Objection Results are not reliable)
acirc Population and exact hit counts are unknown rarr no statistics possibleacirc Indexing does not allow to draw conclusions on the dataotimes Google is missing functionalities that linguists lexicographers would like to have
atilde Web Sample Use search engine to download data from the net and build a corpusfrom it
acirc known size and exact hit counts rarr statistics possibleacirc people can draw conclusions over the included text typesacirc (limited) control over the contentotimes sparser data
Review and Conclusions 43
Direct Query
atilde Accessible through search engines (Google API Yahoo API Scripts)
atilde Document counts are shown to correlate directly with ldquorealrdquo frequencies (Keller2003) so search engines can help - but
acirc lots of repetitions of the same text (not representative)acirc very limited query precision (no upperlower case no punctuation)acirc only estimated counts often hard to reproduce exactlyacirc different queries give wildly different numbers
Review and Conclusions 44
Web Count Units
atilde ghit or ldquogoogle hitrdquo is the most common unit used to count web snippets (in theearly 2000s)
acirc it is document frequency not term frequency
atilde whit or ldquoweb hitrdquo is the more general term
atilde Normally you compare two phenomena to get a unitless ratio(eg different from vs different than)251000000 ghits vs 71500000 or 351 (accessed 2012-04-04)
atilde GPB for ldquoGhits per billion documentsrdquo is good if want a more stable number (sug-gested by Mark Liberman)but Google no longer releases the index size
atilde So always say when you counted and try to use ratios
httpitrecisupennedu myllanguagelogarchives000953html 45
Web Sample
atilde Extracting and filtering web documents to create linguistically annotated corpora(Kilgarriff 2006)
acirc gather documents for different topics (balance)acirc exclude documents which cannot be preprocessed with available tools (here tag-
gers and lemmatizers)acirc exclude documents which seem irrelevant for a corpus (too short or too long word
lists)acirc do this for several languages and make the corpora available
46
Building Internet Corpora Outline
1 Select Seed Words (500)
2 Combine to form multiple queries (6000)
3 Query a search engine and retrieve the URLs (50000)
4 Download the files from the URLS (100000000 words)
5 Postprocess the data (encoding cleanup tagging and parsing)
Sharoff S (2006) Creating general-purpose corpora using automated search enginequeries In M Baroni S Bernardini (eds) WaCky Working papers on the Web asCorpus Bologna 2006
httpcorpusleedsacukinternethtml 47
Internet Corpora Summary
atilde The web can be used as a corpus
acirc Direct accesslowast Fast and convenientlowast Huge amounts of dataotimes unreliable counts
acirc Web samplelowast Control over the samplelowast Some setup costs (semi-automated)otimes Less data
atilde Richer data than a compiled corpus
otimes Less balanced less markup
48
Text and Meta-textatilde Explicit Meta-data
acirc Keywords and Categoriesacirc Rankingsacirc Structural Markup
atilde Implicit Meta-data
acirc Links and Citationsacirc Tagsacirc Tablesacirc File Namesacirc Translations
and Zombies 49
Explicit Metadata
atilde You can get information from metadata within documents
acirc When they are accurate they are very goodacirc They are often deceitful
atilde HTML PDF Word hellipMetadata
atilde Keywords and Tags
atilde Rankings
atilde Links and Citations
atilde Structural Markup
50
Implicit Metadata
atilde You can get clues from metadata within documents
acirc as they are non-intended they tend to be noisyacirc but they are rarely deceitful
atilde HTML tags as constituent boundaries
atilde Tables as Semantic Relations
atilde File Names (content type and language)
atilde Translations mdash Bracketed Glosses Cross-lingual Disambiguation
atilde Query Data
atilde Wikipedia Redirections and Cross-wiki Links
Review and Conclusions 51
Language Identification andNormalization
Review and Conclusions 52
Language Identification
atilde Need to identify the encoding and language of a page in order to process it (meta-data may be unreliable)
acirc Linguistically-grounded methodslowast Diacriticslowast Character n-gramslowast Stop words
acirc Similarity-based categorisation and classificationlowast Character n-gram rankings
acirc Machine Learning based methodsacirc Context (under jp or ko)
atilde Hard to do for short test snippets similar languages mixed text
Review and Conclusions 53
Normalization
atilde Extracting text from various documents
atilde Segmenting continuous text
atilde Number Normalization $700K $700000 07 million dollars hellip
atilde Date Normalization 2000AD 1421AH Heisei 12 hellip
atilde Stripping stop words (the a of hellip)
atilde Lemmatization produces rarr produce
atilde Stemming producer rarr produc produces rarr produc
atilde Decompounding zonnecel rarr zon cel
Review and Conclusions 54
The Semantic Web
Review and Conclusions 55
Goals of the Semantic Web
atilde Web of data
acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions
atilde Increase the utility of information by connecting it to definitions and context
atilde More efficient information access and analysis
EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt
Review and Conclusions 56
Semantic Web Architecture
Review and Conclusions 57
XML eXtensible Markup Language
atilde XML is a set of rules for encoding documents electronically
acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation
atilde Can be validated for syntactic well-formedness
Review and Conclusions 58
Semantic Web Architecture (details)
atilde Identify things with Uniform Resource Identifiers
acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond
atilde Identify relations with Resource Description Framework
acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything
atilde You can build relations in ontologies (OWL)
acirc Then reason over them search them hellip
Review and Conclusions 59
Criticism of the Semantic Web
Doctorowrsquos seven insurmountable obstacles to reliable metadata are
1 People lie
2 People are lazy
3 People are stupid
4 Mission Impossible know thyself
5 Schemas arenrsquot neutral
6 Metrics influence results
7 Therersquos more than one way to describe something
httpwwwwellcom~doctorowmetacraphtm 60
Semantic Web and NLP
atilde The Semantic Web is about structuring data
atilde Text Mining is about unstructured data
atilde There is much more unstructured than structured data
acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP
61
Citation Reputation andPageRank
Review and Conclusions 62
Citation Networks
atilde How can we tell what is a good scientific paper
acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked
acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite
Review and Conclusions 63
Reputation and Citation Analysis
atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group
atilde Some scores are
acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)
atilde Problems
acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones
atilde Weight Citations by Quality of the paper
Review and Conclusions 64
Gaming Citations
atilde LeastMinimum Publishable Unit
acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information
atilde Self citation in-group citation
atilde Write only proceedings (some journals are not often read)
atilde Submitting only to High Impact factor journals
You improve what gets measurednot necessarily what you want to improve
Review and Conclusions 65
Characteristics of the Web Bow Tie
SCC Strongly Connected Core mdash can travel from any page to any page
Review and Conclusions 66
Anchor Text
atilde Recall how hyperlinks are written
lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt
For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt
atilde Link analysis builds on two intuitions
1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A
2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice
Review and Conclusions 67
PageRank as Citation analysis
atilde Citation frequency can be used to measure the impact of an article
acirc Simplest measure Each article gets one vote ndash not very accurate
atilde On the web citation frequency = inlink count
acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam
atilde Better measure weighted citation frequency or citation rank
acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated
Review and Conclusions 68
PageRank as Random walk
atilde Imagine a web surfer doing a random walk on the web
acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page
equiprobably
atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there
atilde This long-term visit rate is the pagersquos PageRank
atilde PageRank = long-term visit rate = steady state robability
Review and Conclusions 69
Teleporting ndash to get us out of dead ends
atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)
atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)
atilde With remaining probability (90) go out on a random hyperlink
acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225
atilde 10 is a parameter the teleportation rate
atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate
Review and Conclusions 70
Example Graph
Each inbound link is a positive vote
httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71
Example Graph Weighted
Pages with higher PageRanks are lighter
Review and Conclusions 72
Gaming PageRank
atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs
atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies
atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission
73
atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks
The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index
acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip
74
Current Status
atilde There is a continuous battle between
acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read
atilde All metrics get gamed
Review and Conclusions 75
Digital object identifier
atilde DOI a string used to uniquely identify an electronic document or object
acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)
atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation
atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations
acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u
Review and Conclusions 76
Conclusions
Review and Conclusions 77
Revolutions in Language Technology
atilde Speech (Language itself)
atilde Writing (invented 3-5 times)
atilde Printing (made writing common)
atilde Digital Text (made writing transferable)
atilde Hyperlinking (taking writing beyond language)
Review and Conclusions 78
The Internet
atilde The internet is useful as a tool
acirc Passively for information accessacirc Actively for collaboration
atilde The internet is interesting in and of itself
acirc As a source of data about existing languageacirc As a source of innovation in language
Review and Conclusions 79
Recommended Texts
atilde Wikipedia
atilde Crystal D (2001) Language and the Internet Cambridge University Press
atilde Sproat R (2010) Language Technology and Society Oxford University Press
atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition
atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press
atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press
Review and Conclusions 80
Complementary Courses
atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics
atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics
Review and Conclusions 81
Email Usenet Chat and Blogs
atilde All share some characteristics of speech and text
atilde Usage norms not fixed
atilde Communication methods may disappear before the norms are fixedUsenet Bulletin Boards Gopher Archie hellip
atilde Large scale discourse studies still to be done
atilde Some genuinely new things
acirc time-lagged multi-person conversationacirc raw un-edited textacirc extreme multi-authorship
Review and Conclusions 23
Speech like Text liketime bound space bound (deletable)spontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 24
Usenet (asynchronous)
Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 25
Chat (synchronous)
Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 26
Blogs
Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactivecomments factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 27
Netspeak
atilde Inspired by shared background
acirc ltrantgtI canrsquot stand thisltrantgtacirc I hate^H^H^H^Hlove that idea (^H is backspace on a vt100)acirc lusers (users as seen by Systems Administrators)acirc suits
atilde Need to be in-group
cow orker Coworkerclue ldquoYou couldnrsquot get a clue during the clue mating season in a field full of horny
clues if you smeared your body with clue musk and did the clue mating dancerdquoEdward Flaherty (talkbizaare)
Review and Conclusions 28
atilde Gricean Principals
acirc Posting (topmiddlebottom quoting and trimming)
atilde Inspired by medium limitations
acirc GREATgreatgreatacirc -) (^_^) (o) acirc brb RTFM IMHOacirc lower caseacirc lack of punctuation (hard on phone keyboards)
Review and Conclusions 29
Collaboration and Wikisatilde Version Control Systems
atilde Wikipedia
atilde Licensing and Ownership
30
Version Control Systems
atilde Versioning file systems
acirc every time a file is opened a new copy is stored
atilde CVS Subversion Git
acirc changes to a collection of files are trackedacirc simultaneous changes are merged
atilde Revision Tracking
acirc Revisions are stored within a file
atilde Authorship in shared writing Explicit responsibility for changes
Review and Conclusions 31
Wikipedia
atilde The core aim of the Wikimedia Foundation is to get a free encyclopedia to everysingle person on the planet (Jimmy Wales)
atilde Wikipedia makes it easy to share your knowledgepeople like to do this
atilde Most edits are done by insiders
atilde Most content is added by outsiders
atilde Content comparable to Britannica
Review and Conclusions 32
The five pillars of Wikipedia
1 Wikipedia is an online encyclopedia
2 Wikipedia has a neutral point of view
atilde Content policies NPOV Verifiability and No original research
3 Wikipedia is free content
4 Wikipedians should interact in a respectful and civil manner
5 Wikipedia does not have firm rules
WikipediaFivepillars 33
Licenses and Ownership
atilde Copyright
atilde Copyleft
atilde Creative Commons
Review and Conclusions 34
What is a good article
1 Well-written
2 Factually accurate and verifiable
3 Broad in its coverage
4 Neutral
5 Stable
6 Illustrated if possible by images
WikipediaGood_article_criteria 35
The World Wide Web and HTML
Review and Conclusions 36
The InterWeb
atilde The Internetmore than just the web (email VoIP FTP Streaming Messaging
atilde The structure of Markup Visual vs LogicalWSISWYG WYSIAYG WYSIWIM
atilde The structure of the Web mdash hypertextpages linked to pages
atilde The future of the Web
atilde Linguistic features of the webun-edited large volume editable multi-media
Review and Conclusions 37
Map of online communities
httpxkcdcom802 38
The Deep Web (the Invisible Web)
Dynamic content dynamic pages which are returned in response to a submitted queryor accessed only through a form
Unlinked content pages which are not linked to by other pages (but clicking linksthem)
Private Web sites that require registration and login (Edventure)
Contextual Web pages with content varying for different access contexts (egranges of client IP addresses or previous navigation sequence)
Limited access content sites that limit access to their pages in a technical way (egusing the Robots Exclusion Standard)
Scripted content pages that are only accessible through links produced by JavaScriptas well as content dynamically downloaded from Web servers via Flash or Ajaxsolutions
Review and Conclusions 39
Non-HTMLtext content textual content encoded in multimedia (image or video)files or specific file formats not handled by search engines
These pages all include data that search engines cannot find
Review and Conclusions 40
Visual Markup vs Logical Markup
atilde Visual Markup (Presentational)
acirc What you see is what you get (WYSIWYG)acirc Equivalent of printersrsquo markupacirc Shows what things look like
atilde Logical Markup (Structural)
acirc Show the structure and meaningacirc Can be mapped to visual markupacirc Less flexible than visual markupacirc More adaptable (and reusable)
Review and Conclusions 41
The Web as Corpus
Review and Conclusions 42
Two Approaches to using the Web as a Corpus
atilde Direct Query Search Engine as Query tool and WWW as corpus(Objection Results are not reliable)
acirc Population and exact hit counts are unknown rarr no statistics possibleacirc Indexing does not allow to draw conclusions on the dataotimes Google is missing functionalities that linguists lexicographers would like to have
atilde Web Sample Use search engine to download data from the net and build a corpusfrom it
acirc known size and exact hit counts rarr statistics possibleacirc people can draw conclusions over the included text typesacirc (limited) control over the contentotimes sparser data
Review and Conclusions 43
Direct Query
atilde Accessible through search engines (Google API Yahoo API Scripts)
atilde Document counts are shown to correlate directly with ldquorealrdquo frequencies (Keller2003) so search engines can help - but
acirc lots of repetitions of the same text (not representative)acirc very limited query precision (no upperlower case no punctuation)acirc only estimated counts often hard to reproduce exactlyacirc different queries give wildly different numbers
Review and Conclusions 44
Web Count Units
atilde ghit or ldquogoogle hitrdquo is the most common unit used to count web snippets (in theearly 2000s)
acirc it is document frequency not term frequency
atilde whit or ldquoweb hitrdquo is the more general term
atilde Normally you compare two phenomena to get a unitless ratio(eg different from vs different than)251000000 ghits vs 71500000 or 351 (accessed 2012-04-04)
atilde GPB for ldquoGhits per billion documentsrdquo is good if want a more stable number (sug-gested by Mark Liberman)but Google no longer releases the index size
atilde So always say when you counted and try to use ratios
httpitrecisupennedu myllanguagelogarchives000953html 45
Web Sample
atilde Extracting and filtering web documents to create linguistically annotated corpora(Kilgarriff 2006)
acirc gather documents for different topics (balance)acirc exclude documents which cannot be preprocessed with available tools (here tag-
gers and lemmatizers)acirc exclude documents which seem irrelevant for a corpus (too short or too long word
lists)acirc do this for several languages and make the corpora available
46
Building Internet Corpora Outline
1 Select Seed Words (500)
2 Combine to form multiple queries (6000)
3 Query a search engine and retrieve the URLs (50000)
4 Download the files from the URLS (100000000 words)
5 Postprocess the data (encoding cleanup tagging and parsing)
Sharoff S (2006) Creating general-purpose corpora using automated search enginequeries In M Baroni S Bernardini (eds) WaCky Working papers on the Web asCorpus Bologna 2006
httpcorpusleedsacukinternethtml 47
Internet Corpora Summary
atilde The web can be used as a corpus
acirc Direct accesslowast Fast and convenientlowast Huge amounts of dataotimes unreliable counts
acirc Web samplelowast Control over the samplelowast Some setup costs (semi-automated)otimes Less data
atilde Richer data than a compiled corpus
otimes Less balanced less markup
48
Text and Meta-textatilde Explicit Meta-data
acirc Keywords and Categoriesacirc Rankingsacirc Structural Markup
atilde Implicit Meta-data
acirc Links and Citationsacirc Tagsacirc Tablesacirc File Namesacirc Translations
and Zombies 49
Explicit Metadata
atilde You can get information from metadata within documents
acirc When they are accurate they are very goodacirc They are often deceitful
atilde HTML PDF Word hellipMetadata
atilde Keywords and Tags
atilde Rankings
atilde Links and Citations
atilde Structural Markup
50
Implicit Metadata
atilde You can get clues from metadata within documents
acirc as they are non-intended they tend to be noisyacirc but they are rarely deceitful
atilde HTML tags as constituent boundaries
atilde Tables as Semantic Relations
atilde File Names (content type and language)
atilde Translations mdash Bracketed Glosses Cross-lingual Disambiguation
atilde Query Data
atilde Wikipedia Redirections and Cross-wiki Links
Review and Conclusions 51
Language Identification andNormalization
Review and Conclusions 52
Language Identification
atilde Need to identify the encoding and language of a page in order to process it (meta-data may be unreliable)
acirc Linguistically-grounded methodslowast Diacriticslowast Character n-gramslowast Stop words
acirc Similarity-based categorisation and classificationlowast Character n-gram rankings
acirc Machine Learning based methodsacirc Context (under jp or ko)
atilde Hard to do for short test snippets similar languages mixed text
Review and Conclusions 53
Normalization
atilde Extracting text from various documents
atilde Segmenting continuous text
atilde Number Normalization $700K $700000 07 million dollars hellip
atilde Date Normalization 2000AD 1421AH Heisei 12 hellip
atilde Stripping stop words (the a of hellip)
atilde Lemmatization produces rarr produce
atilde Stemming producer rarr produc produces rarr produc
atilde Decompounding zonnecel rarr zon cel
Review and Conclusions 54
The Semantic Web
Review and Conclusions 55
Goals of the Semantic Web
atilde Web of data
acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions
atilde Increase the utility of information by connecting it to definitions and context
atilde More efficient information access and analysis
EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt
Review and Conclusions 56
Semantic Web Architecture
Review and Conclusions 57
XML eXtensible Markup Language
atilde XML is a set of rules for encoding documents electronically
acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation
atilde Can be validated for syntactic well-formedness
Review and Conclusions 58
Semantic Web Architecture (details)
atilde Identify things with Uniform Resource Identifiers
acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond
atilde Identify relations with Resource Description Framework
acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything
atilde You can build relations in ontologies (OWL)
acirc Then reason over them search them hellip
Review and Conclusions 59
Criticism of the Semantic Web
Doctorowrsquos seven insurmountable obstacles to reliable metadata are
1 People lie
2 People are lazy
3 People are stupid
4 Mission Impossible know thyself
5 Schemas arenrsquot neutral
6 Metrics influence results
7 Therersquos more than one way to describe something
httpwwwwellcom~doctorowmetacraphtm 60
Semantic Web and NLP
atilde The Semantic Web is about structuring data
atilde Text Mining is about unstructured data
atilde There is much more unstructured than structured data
acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP
61
Citation Reputation andPageRank
Review and Conclusions 62
Citation Networks
atilde How can we tell what is a good scientific paper
acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked
acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite
Review and Conclusions 63
Reputation and Citation Analysis
atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group
atilde Some scores are
acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)
atilde Problems
acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones
atilde Weight Citations by Quality of the paper
Review and Conclusions 64
Gaming Citations
atilde LeastMinimum Publishable Unit
acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information
atilde Self citation in-group citation
atilde Write only proceedings (some journals are not often read)
atilde Submitting only to High Impact factor journals
You improve what gets measurednot necessarily what you want to improve
Review and Conclusions 65
Characteristics of the Web Bow Tie
SCC Strongly Connected Core mdash can travel from any page to any page
Review and Conclusions 66
Anchor Text
atilde Recall how hyperlinks are written
lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt
For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt
atilde Link analysis builds on two intuitions
1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A
2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice
Review and Conclusions 67
PageRank as Citation analysis
atilde Citation frequency can be used to measure the impact of an article
acirc Simplest measure Each article gets one vote ndash not very accurate
atilde On the web citation frequency = inlink count
acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam
atilde Better measure weighted citation frequency or citation rank
acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated
Review and Conclusions 68
PageRank as Random walk
atilde Imagine a web surfer doing a random walk on the web
acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page
equiprobably
atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there
atilde This long-term visit rate is the pagersquos PageRank
atilde PageRank = long-term visit rate = steady state robability
Review and Conclusions 69
Teleporting ndash to get us out of dead ends
atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)
atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)
atilde With remaining probability (90) go out on a random hyperlink
acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225
atilde 10 is a parameter the teleportation rate
atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate
Review and Conclusions 70
Example Graph
Each inbound link is a positive vote
httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71
Example Graph Weighted
Pages with higher PageRanks are lighter
Review and Conclusions 72
Gaming PageRank
atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs
atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies
atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission
73
atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks
The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index
acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip
74
Current Status
atilde There is a continuous battle between
acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read
atilde All metrics get gamed
Review and Conclusions 75
Digital object identifier
atilde DOI a string used to uniquely identify an electronic document or object
acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)
atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation
atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations
acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u
Review and Conclusions 76
Conclusions
Review and Conclusions 77
Revolutions in Language Technology
atilde Speech (Language itself)
atilde Writing (invented 3-5 times)
atilde Printing (made writing common)
atilde Digital Text (made writing transferable)
atilde Hyperlinking (taking writing beyond language)
Review and Conclusions 78
The Internet
atilde The internet is useful as a tool
acirc Passively for information accessacirc Actively for collaboration
atilde The internet is interesting in and of itself
acirc As a source of data about existing languageacirc As a source of innovation in language
Review and Conclusions 79
Recommended Texts
atilde Wikipedia
atilde Crystal D (2001) Language and the Internet Cambridge University Press
atilde Sproat R (2010) Language Technology and Society Oxford University Press
atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition
atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press
atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press
Review and Conclusions 80
Complementary Courses
atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics
atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics
Review and Conclusions 81
Speech like Text liketime bound space bound (deletable)spontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 24
Usenet (asynchronous)
Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 25
Chat (synchronous)
Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 26
Blogs
Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactivecomments factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 27
Netspeak
atilde Inspired by shared background
acirc ltrantgtI canrsquot stand thisltrantgtacirc I hate^H^H^H^Hlove that idea (^H is backspace on a vt100)acirc lusers (users as seen by Systems Administrators)acirc suits
atilde Need to be in-group
cow orker Coworkerclue ldquoYou couldnrsquot get a clue during the clue mating season in a field full of horny
clues if you smeared your body with clue musk and did the clue mating dancerdquoEdward Flaherty (talkbizaare)
Review and Conclusions 28
atilde Gricean Principals
acirc Posting (topmiddlebottom quoting and trimming)
atilde Inspired by medium limitations
acirc GREATgreatgreatacirc -) (^_^) (o) acirc brb RTFM IMHOacirc lower caseacirc lack of punctuation (hard on phone keyboards)
Review and Conclusions 29
Collaboration and Wikisatilde Version Control Systems
atilde Wikipedia
atilde Licensing and Ownership
30
Version Control Systems
atilde Versioning file systems
acirc every time a file is opened a new copy is stored
atilde CVS Subversion Git
acirc changes to a collection of files are trackedacirc simultaneous changes are merged
atilde Revision Tracking
acirc Revisions are stored within a file
atilde Authorship in shared writing Explicit responsibility for changes
Review and Conclusions 31
Wikipedia
atilde The core aim of the Wikimedia Foundation is to get a free encyclopedia to everysingle person on the planet (Jimmy Wales)
atilde Wikipedia makes it easy to share your knowledgepeople like to do this
atilde Most edits are done by insiders
atilde Most content is added by outsiders
atilde Content comparable to Britannica
Review and Conclusions 32
The five pillars of Wikipedia
1 Wikipedia is an online encyclopedia
2 Wikipedia has a neutral point of view
atilde Content policies NPOV Verifiability and No original research
3 Wikipedia is free content
4 Wikipedians should interact in a respectful and civil manner
5 Wikipedia does not have firm rules
WikipediaFivepillars 33
Licenses and Ownership
atilde Copyright
atilde Copyleft
atilde Creative Commons
Review and Conclusions 34
What is a good article
1 Well-written
2 Factually accurate and verifiable
3 Broad in its coverage
4 Neutral
5 Stable
6 Illustrated if possible by images
WikipediaGood_article_criteria 35
The World Wide Web and HTML
Review and Conclusions 36
The InterWeb
atilde The Internetmore than just the web (email VoIP FTP Streaming Messaging
atilde The structure of Markup Visual vs LogicalWSISWYG WYSIAYG WYSIWIM
atilde The structure of the Web mdash hypertextpages linked to pages
atilde The future of the Web
atilde Linguistic features of the webun-edited large volume editable multi-media
Review and Conclusions 37
Map of online communities
httpxkcdcom802 38
The Deep Web (the Invisible Web)
Dynamic content dynamic pages which are returned in response to a submitted queryor accessed only through a form
Unlinked content pages which are not linked to by other pages (but clicking linksthem)
Private Web sites that require registration and login (Edventure)
Contextual Web pages with content varying for different access contexts (egranges of client IP addresses or previous navigation sequence)
Limited access content sites that limit access to their pages in a technical way (egusing the Robots Exclusion Standard)
Scripted content pages that are only accessible through links produced by JavaScriptas well as content dynamically downloaded from Web servers via Flash or Ajaxsolutions
Review and Conclusions 39
Non-HTMLtext content textual content encoded in multimedia (image or video)files or specific file formats not handled by search engines
These pages all include data that search engines cannot find
Review and Conclusions 40
Visual Markup vs Logical Markup
atilde Visual Markup (Presentational)
acirc What you see is what you get (WYSIWYG)acirc Equivalent of printersrsquo markupacirc Shows what things look like
atilde Logical Markup (Structural)
acirc Show the structure and meaningacirc Can be mapped to visual markupacirc Less flexible than visual markupacirc More adaptable (and reusable)
Review and Conclusions 41
The Web as Corpus
Review and Conclusions 42
Two Approaches to using the Web as a Corpus
atilde Direct Query Search Engine as Query tool and WWW as corpus(Objection Results are not reliable)
acirc Population and exact hit counts are unknown rarr no statistics possibleacirc Indexing does not allow to draw conclusions on the dataotimes Google is missing functionalities that linguists lexicographers would like to have
atilde Web Sample Use search engine to download data from the net and build a corpusfrom it
acirc known size and exact hit counts rarr statistics possibleacirc people can draw conclusions over the included text typesacirc (limited) control over the contentotimes sparser data
Review and Conclusions 43
Direct Query
atilde Accessible through search engines (Google API Yahoo API Scripts)
atilde Document counts are shown to correlate directly with ldquorealrdquo frequencies (Keller2003) so search engines can help - but
acirc lots of repetitions of the same text (not representative)acirc very limited query precision (no upperlower case no punctuation)acirc only estimated counts often hard to reproduce exactlyacirc different queries give wildly different numbers
Review and Conclusions 44
Web Count Units
atilde ghit or ldquogoogle hitrdquo is the most common unit used to count web snippets (in theearly 2000s)
acirc it is document frequency not term frequency
atilde whit or ldquoweb hitrdquo is the more general term
atilde Normally you compare two phenomena to get a unitless ratio(eg different from vs different than)251000000 ghits vs 71500000 or 351 (accessed 2012-04-04)
atilde GPB for ldquoGhits per billion documentsrdquo is good if want a more stable number (sug-gested by Mark Liberman)but Google no longer releases the index size
atilde So always say when you counted and try to use ratios
httpitrecisupennedu myllanguagelogarchives000953html 45
Web Sample
atilde Extracting and filtering web documents to create linguistically annotated corpora(Kilgarriff 2006)
acirc gather documents for different topics (balance)acirc exclude documents which cannot be preprocessed with available tools (here tag-
gers and lemmatizers)acirc exclude documents which seem irrelevant for a corpus (too short or too long word
lists)acirc do this for several languages and make the corpora available
46
Building Internet Corpora Outline
1 Select Seed Words (500)
2 Combine to form multiple queries (6000)
3 Query a search engine and retrieve the URLs (50000)
4 Download the files from the URLS (100000000 words)
5 Postprocess the data (encoding cleanup tagging and parsing)
Sharoff S (2006) Creating general-purpose corpora using automated search enginequeries In M Baroni S Bernardini (eds) WaCky Working papers on the Web asCorpus Bologna 2006
httpcorpusleedsacukinternethtml 47
Internet Corpora Summary
atilde The web can be used as a corpus
acirc Direct accesslowast Fast and convenientlowast Huge amounts of dataotimes unreliable counts
acirc Web samplelowast Control over the samplelowast Some setup costs (semi-automated)otimes Less data
atilde Richer data than a compiled corpus
otimes Less balanced less markup
48
Text and Meta-textatilde Explicit Meta-data
acirc Keywords and Categoriesacirc Rankingsacirc Structural Markup
atilde Implicit Meta-data
acirc Links and Citationsacirc Tagsacirc Tablesacirc File Namesacirc Translations
and Zombies 49
Explicit Metadata
atilde You can get information from metadata within documents
acirc When they are accurate they are very goodacirc They are often deceitful
atilde HTML PDF Word hellipMetadata
atilde Keywords and Tags
atilde Rankings
atilde Links and Citations
atilde Structural Markup
50
Implicit Metadata
atilde You can get clues from metadata within documents
acirc as they are non-intended they tend to be noisyacirc but they are rarely deceitful
atilde HTML tags as constituent boundaries
atilde Tables as Semantic Relations
atilde File Names (content type and language)
atilde Translations mdash Bracketed Glosses Cross-lingual Disambiguation
atilde Query Data
atilde Wikipedia Redirections and Cross-wiki Links
Review and Conclusions 51
Language Identification andNormalization
Review and Conclusions 52
Language Identification
atilde Need to identify the encoding and language of a page in order to process it (meta-data may be unreliable)
acirc Linguistically-grounded methodslowast Diacriticslowast Character n-gramslowast Stop words
acirc Similarity-based categorisation and classificationlowast Character n-gram rankings
acirc Machine Learning based methodsacirc Context (under jp or ko)
atilde Hard to do for short test snippets similar languages mixed text
Review and Conclusions 53
Normalization
atilde Extracting text from various documents
atilde Segmenting continuous text
atilde Number Normalization $700K $700000 07 million dollars hellip
atilde Date Normalization 2000AD 1421AH Heisei 12 hellip
atilde Stripping stop words (the a of hellip)
atilde Lemmatization produces rarr produce
atilde Stemming producer rarr produc produces rarr produc
atilde Decompounding zonnecel rarr zon cel
Review and Conclusions 54
The Semantic Web
Review and Conclusions 55
Goals of the Semantic Web
atilde Web of data
acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions
atilde Increase the utility of information by connecting it to definitions and context
atilde More efficient information access and analysis
EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt
Review and Conclusions 56
Semantic Web Architecture
Review and Conclusions 57
XML eXtensible Markup Language
atilde XML is a set of rules for encoding documents electronically
acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation
atilde Can be validated for syntactic well-formedness
Review and Conclusions 58
Semantic Web Architecture (details)
atilde Identify things with Uniform Resource Identifiers
acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond
atilde Identify relations with Resource Description Framework
acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything
atilde You can build relations in ontologies (OWL)
acirc Then reason over them search them hellip
Review and Conclusions 59
Criticism of the Semantic Web
Doctorowrsquos seven insurmountable obstacles to reliable metadata are
1 People lie
2 People are lazy
3 People are stupid
4 Mission Impossible know thyself
5 Schemas arenrsquot neutral
6 Metrics influence results
7 Therersquos more than one way to describe something
httpwwwwellcom~doctorowmetacraphtm 60
Semantic Web and NLP
atilde The Semantic Web is about structuring data
atilde Text Mining is about unstructured data
atilde There is much more unstructured than structured data
acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP
61
Citation Reputation andPageRank
Review and Conclusions 62
Citation Networks
atilde How can we tell what is a good scientific paper
acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked
acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite
Review and Conclusions 63
Reputation and Citation Analysis
atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group
atilde Some scores are
acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)
atilde Problems
acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones
atilde Weight Citations by Quality of the paper
Review and Conclusions 64
Gaming Citations
atilde LeastMinimum Publishable Unit
acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information
atilde Self citation in-group citation
atilde Write only proceedings (some journals are not often read)
atilde Submitting only to High Impact factor journals
You improve what gets measurednot necessarily what you want to improve
Review and Conclusions 65
Characteristics of the Web Bow Tie
SCC Strongly Connected Core mdash can travel from any page to any page
Review and Conclusions 66
Anchor Text
atilde Recall how hyperlinks are written
lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt
For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt
atilde Link analysis builds on two intuitions
1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A
2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice
Review and Conclusions 67
PageRank as Citation analysis
atilde Citation frequency can be used to measure the impact of an article
acirc Simplest measure Each article gets one vote ndash not very accurate
atilde On the web citation frequency = inlink count
acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam
atilde Better measure weighted citation frequency or citation rank
acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated
Review and Conclusions 68
PageRank as Random walk
atilde Imagine a web surfer doing a random walk on the web
acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page
equiprobably
atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there
atilde This long-term visit rate is the pagersquos PageRank
atilde PageRank = long-term visit rate = steady state robability
Review and Conclusions 69
Teleporting ndash to get us out of dead ends
atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)
atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)
atilde With remaining probability (90) go out on a random hyperlink
acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225
atilde 10 is a parameter the teleportation rate
atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate
Review and Conclusions 70
Example Graph
Each inbound link is a positive vote
httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71
Example Graph Weighted
Pages with higher PageRanks are lighter
Review and Conclusions 72
Gaming PageRank
atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs
atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies
atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission
73
atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks
The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index
acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip
74
Current Status
atilde There is a continuous battle between
acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read
atilde All metrics get gamed
Review and Conclusions 75
Digital object identifier
atilde DOI a string used to uniquely identify an electronic document or object
acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)
atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation
atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations
acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u
Review and Conclusions 76
Conclusions
Review and Conclusions 77
Revolutions in Language Technology
atilde Speech (Language itself)
atilde Writing (invented 3-5 times)
atilde Printing (made writing common)
atilde Digital Text (made writing transferable)
atilde Hyperlinking (taking writing beyond language)
Review and Conclusions 78
The Internet
atilde The internet is useful as a tool
acirc Passively for information accessacirc Actively for collaboration
atilde The internet is interesting in and of itself
acirc As a source of data about existing languageacirc As a source of innovation in language
Review and Conclusions 79
Recommended Texts
atilde Wikipedia
atilde Crystal D (2001) Language and the Internet Cambridge University Press
atilde Sproat R (2010) Language Technology and Society Oxford University Press
atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition
atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press
atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press
Review and Conclusions 80
Complementary Courses
atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics
atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics
Review and Conclusions 81
Usenet (asynchronous)
Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 25
Chat (synchronous)
Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 26
Blogs
Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactivecomments factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 27
Netspeak
atilde Inspired by shared background
acirc ltrantgtI canrsquot stand thisltrantgtacirc I hate^H^H^H^Hlove that idea (^H is backspace on a vt100)acirc lusers (users as seen by Systems Administrators)acirc suits
atilde Need to be in-group
cow orker Coworkerclue ldquoYou couldnrsquot get a clue during the clue mating season in a field full of horny
clues if you smeared your body with clue musk and did the clue mating dancerdquoEdward Flaherty (talkbizaare)
Review and Conclusions 28
atilde Gricean Principals
acirc Posting (topmiddlebottom quoting and trimming)
atilde Inspired by medium limitations
acirc GREATgreatgreatacirc -) (^_^) (o) acirc brb RTFM IMHOacirc lower caseacirc lack of punctuation (hard on phone keyboards)
Review and Conclusions 29
Collaboration and Wikisatilde Version Control Systems
atilde Wikipedia
atilde Licensing and Ownership
30
Version Control Systems
atilde Versioning file systems
acirc every time a file is opened a new copy is stored
atilde CVS Subversion Git
acirc changes to a collection of files are trackedacirc simultaneous changes are merged
atilde Revision Tracking
acirc Revisions are stored within a file
atilde Authorship in shared writing Explicit responsibility for changes
Review and Conclusions 31
Wikipedia
atilde The core aim of the Wikimedia Foundation is to get a free encyclopedia to everysingle person on the planet (Jimmy Wales)
atilde Wikipedia makes it easy to share your knowledgepeople like to do this
atilde Most edits are done by insiders
atilde Most content is added by outsiders
atilde Content comparable to Britannica
Review and Conclusions 32
The five pillars of Wikipedia
1 Wikipedia is an online encyclopedia
2 Wikipedia has a neutral point of view
atilde Content policies NPOV Verifiability and No original research
3 Wikipedia is free content
4 Wikipedians should interact in a respectful and civil manner
5 Wikipedia does not have firm rules
WikipediaFivepillars 33
Licenses and Ownership
atilde Copyright
atilde Copyleft
atilde Creative Commons
Review and Conclusions 34
What is a good article
1 Well-written
2 Factually accurate and verifiable
3 Broad in its coverage
4 Neutral
5 Stable
6 Illustrated if possible by images
WikipediaGood_article_criteria 35
The World Wide Web and HTML
Review and Conclusions 36
The InterWeb
atilde The Internetmore than just the web (email VoIP FTP Streaming Messaging
atilde The structure of Markup Visual vs LogicalWSISWYG WYSIAYG WYSIWIM
atilde The structure of the Web mdash hypertextpages linked to pages
atilde The future of the Web
atilde Linguistic features of the webun-edited large volume editable multi-media
Review and Conclusions 37
Map of online communities
httpxkcdcom802 38
The Deep Web (the Invisible Web)
Dynamic content dynamic pages which are returned in response to a submitted queryor accessed only through a form
Unlinked content pages which are not linked to by other pages (but clicking linksthem)
Private Web sites that require registration and login (Edventure)
Contextual Web pages with content varying for different access contexts (egranges of client IP addresses or previous navigation sequence)
Limited access content sites that limit access to their pages in a technical way (egusing the Robots Exclusion Standard)
Scripted content pages that are only accessible through links produced by JavaScriptas well as content dynamically downloaded from Web servers via Flash or Ajaxsolutions
Review and Conclusions 39
Non-HTMLtext content textual content encoded in multimedia (image or video)files or specific file formats not handled by search engines
These pages all include data that search engines cannot find
Review and Conclusions 40
Visual Markup vs Logical Markup
atilde Visual Markup (Presentational)
acirc What you see is what you get (WYSIWYG)acirc Equivalent of printersrsquo markupacirc Shows what things look like
atilde Logical Markup (Structural)
acirc Show the structure and meaningacirc Can be mapped to visual markupacirc Less flexible than visual markupacirc More adaptable (and reusable)
Review and Conclusions 41
The Web as Corpus
Review and Conclusions 42
Two Approaches to using the Web as a Corpus
atilde Direct Query Search Engine as Query tool and WWW as corpus(Objection Results are not reliable)
acirc Population and exact hit counts are unknown rarr no statistics possibleacirc Indexing does not allow to draw conclusions on the dataotimes Google is missing functionalities that linguists lexicographers would like to have
atilde Web Sample Use search engine to download data from the net and build a corpusfrom it
acirc known size and exact hit counts rarr statistics possibleacirc people can draw conclusions over the included text typesacirc (limited) control over the contentotimes sparser data
Review and Conclusions 43
Direct Query
atilde Accessible through search engines (Google API Yahoo API Scripts)
atilde Document counts are shown to correlate directly with ldquorealrdquo frequencies (Keller2003) so search engines can help - but
acirc lots of repetitions of the same text (not representative)acirc very limited query precision (no upperlower case no punctuation)acirc only estimated counts often hard to reproduce exactlyacirc different queries give wildly different numbers
Review and Conclusions 44
Web Count Units
atilde ghit or ldquogoogle hitrdquo is the most common unit used to count web snippets (in theearly 2000s)
acirc it is document frequency not term frequency
atilde whit or ldquoweb hitrdquo is the more general term
atilde Normally you compare two phenomena to get a unitless ratio(eg different from vs different than)251000000 ghits vs 71500000 or 351 (accessed 2012-04-04)
atilde GPB for ldquoGhits per billion documentsrdquo is good if want a more stable number (sug-gested by Mark Liberman)but Google no longer releases the index size
atilde So always say when you counted and try to use ratios
httpitrecisupennedu myllanguagelogarchives000953html 45
Web Sample
atilde Extracting and filtering web documents to create linguistically annotated corpora(Kilgarriff 2006)
acirc gather documents for different topics (balance)acirc exclude documents which cannot be preprocessed with available tools (here tag-
gers and lemmatizers)acirc exclude documents which seem irrelevant for a corpus (too short or too long word
lists)acirc do this for several languages and make the corpora available
46
Building Internet Corpora Outline
1 Select Seed Words (500)
2 Combine to form multiple queries (6000)
3 Query a search engine and retrieve the URLs (50000)
4 Download the files from the URLS (100000000 words)
5 Postprocess the data (encoding cleanup tagging and parsing)
Sharoff S (2006) Creating general-purpose corpora using automated search enginequeries In M Baroni S Bernardini (eds) WaCky Working papers on the Web asCorpus Bologna 2006
httpcorpusleedsacukinternethtml 47
Internet Corpora Summary
atilde The web can be used as a corpus
acirc Direct accesslowast Fast and convenientlowast Huge amounts of dataotimes unreliable counts
acirc Web samplelowast Control over the samplelowast Some setup costs (semi-automated)otimes Less data
atilde Richer data than a compiled corpus
otimes Less balanced less markup
48
Text and Meta-textatilde Explicit Meta-data
acirc Keywords and Categoriesacirc Rankingsacirc Structural Markup
atilde Implicit Meta-data
acirc Links and Citationsacirc Tagsacirc Tablesacirc File Namesacirc Translations
and Zombies 49
Explicit Metadata
atilde You can get information from metadata within documents
acirc When they are accurate they are very goodacirc They are often deceitful
atilde HTML PDF Word hellipMetadata
atilde Keywords and Tags
atilde Rankings
atilde Links and Citations
atilde Structural Markup
50
Implicit Metadata
atilde You can get clues from metadata within documents
acirc as they are non-intended they tend to be noisyacirc but they are rarely deceitful
atilde HTML tags as constituent boundaries
atilde Tables as Semantic Relations
atilde File Names (content type and language)
atilde Translations mdash Bracketed Glosses Cross-lingual Disambiguation
atilde Query Data
atilde Wikipedia Redirections and Cross-wiki Links
Review and Conclusions 51
Language Identification andNormalization
Review and Conclusions 52
Language Identification
atilde Need to identify the encoding and language of a page in order to process it (meta-data may be unreliable)
acirc Linguistically-grounded methodslowast Diacriticslowast Character n-gramslowast Stop words
acirc Similarity-based categorisation and classificationlowast Character n-gram rankings
acirc Machine Learning based methodsacirc Context (under jp or ko)
atilde Hard to do for short test snippets similar languages mixed text
Review and Conclusions 53
Normalization
atilde Extracting text from various documents
atilde Segmenting continuous text
atilde Number Normalization $700K $700000 07 million dollars hellip
atilde Date Normalization 2000AD 1421AH Heisei 12 hellip
atilde Stripping stop words (the a of hellip)
atilde Lemmatization produces rarr produce
atilde Stemming producer rarr produc produces rarr produc
atilde Decompounding zonnecel rarr zon cel
Review and Conclusions 54
The Semantic Web
Review and Conclusions 55
Goals of the Semantic Web
atilde Web of data
acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions
atilde Increase the utility of information by connecting it to definitions and context
atilde More efficient information access and analysis
EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt
Review and Conclusions 56
Semantic Web Architecture
Review and Conclusions 57
XML eXtensible Markup Language
atilde XML is a set of rules for encoding documents electronically
acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation
atilde Can be validated for syntactic well-formedness
Review and Conclusions 58
Semantic Web Architecture (details)
atilde Identify things with Uniform Resource Identifiers
acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond
atilde Identify relations with Resource Description Framework
acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything
atilde You can build relations in ontologies (OWL)
acirc Then reason over them search them hellip
Review and Conclusions 59
Criticism of the Semantic Web
Doctorowrsquos seven insurmountable obstacles to reliable metadata are
1 People lie
2 People are lazy
3 People are stupid
4 Mission Impossible know thyself
5 Schemas arenrsquot neutral
6 Metrics influence results
7 Therersquos more than one way to describe something
httpwwwwellcom~doctorowmetacraphtm 60
Semantic Web and NLP
atilde The Semantic Web is about structuring data
atilde Text Mining is about unstructured data
atilde There is much more unstructured than structured data
acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP
61
Citation Reputation andPageRank
Review and Conclusions 62
Citation Networks
atilde How can we tell what is a good scientific paper
acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked
acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite
Review and Conclusions 63
Reputation and Citation Analysis
atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group
atilde Some scores are
acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)
atilde Problems
acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones
atilde Weight Citations by Quality of the paper
Review and Conclusions 64
Gaming Citations
atilde LeastMinimum Publishable Unit
acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information
atilde Self citation in-group citation
atilde Write only proceedings (some journals are not often read)
atilde Submitting only to High Impact factor journals
You improve what gets measurednot necessarily what you want to improve
Review and Conclusions 65
Characteristics of the Web Bow Tie
SCC Strongly Connected Core mdash can travel from any page to any page
Review and Conclusions 66
Anchor Text
atilde Recall how hyperlinks are written
lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt
For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt
atilde Link analysis builds on two intuitions
1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A
2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice
Review and Conclusions 67
PageRank as Citation analysis
atilde Citation frequency can be used to measure the impact of an article
acirc Simplest measure Each article gets one vote ndash not very accurate
atilde On the web citation frequency = inlink count
acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam
atilde Better measure weighted citation frequency or citation rank
acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated
Review and Conclusions 68
PageRank as Random walk
atilde Imagine a web surfer doing a random walk on the web
acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page
equiprobably
atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there
atilde This long-term visit rate is the pagersquos PageRank
atilde PageRank = long-term visit rate = steady state robability
Review and Conclusions 69
Teleporting ndash to get us out of dead ends
atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)
atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)
atilde With remaining probability (90) go out on a random hyperlink
acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225
atilde 10 is a parameter the teleportation rate
atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate
Review and Conclusions 70
Example Graph
Each inbound link is a positive vote
httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71
Example Graph Weighted
Pages with higher PageRanks are lighter
Review and Conclusions 72
Gaming PageRank
atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs
atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies
atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission
73
atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks
The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index
acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip
74
Current Status
atilde There is a continuous battle between
acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read
atilde All metrics get gamed
Review and Conclusions 75
Digital object identifier
atilde DOI a string used to uniquely identify an electronic document or object
acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)
atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation
atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations
acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u
Review and Conclusions 76
Conclusions
Review and Conclusions 77
Revolutions in Language Technology
atilde Speech (Language itself)
atilde Writing (invented 3-5 times)
atilde Printing (made writing common)
atilde Digital Text (made writing transferable)
atilde Hyperlinking (taking writing beyond language)
Review and Conclusions 78
The Internet
atilde The internet is useful as a tool
acirc Passively for information accessacirc Actively for collaboration
atilde The internet is interesting in and of itself
acirc As a source of data about existing languageacirc As a source of innovation in language
Review and Conclusions 79
Recommended Texts
atilde Wikipedia
atilde Crystal D (2001) Language and the Internet Cambridge University Press
atilde Sproat R (2010) Language Technology and Society Oxford University Press
atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition
atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press
atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press
Review and Conclusions 80
Complementary Courses
atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics
atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics
Review and Conclusions 81
Chat (synchronous)
Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 26
Blogs
Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactivecomments factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 27
Netspeak
atilde Inspired by shared background
acirc ltrantgtI canrsquot stand thisltrantgtacirc I hate^H^H^H^Hlove that idea (^H is backspace on a vt100)acirc lusers (users as seen by Systems Administrators)acirc suits
atilde Need to be in-group
cow orker Coworkerclue ldquoYou couldnrsquot get a clue during the clue mating season in a field full of horny
clues if you smeared your body with clue musk and did the clue mating dancerdquoEdward Flaherty (talkbizaare)
Review and Conclusions 28
atilde Gricean Principals
acirc Posting (topmiddlebottom quoting and trimming)
atilde Inspired by medium limitations
acirc GREATgreatgreatacirc -) (^_^) (o) acirc brb RTFM IMHOacirc lower caseacirc lack of punctuation (hard on phone keyboards)
Review and Conclusions 29
Collaboration and Wikisatilde Version Control Systems
atilde Wikipedia
atilde Licensing and Ownership
30
Version Control Systems
atilde Versioning file systems
acirc every time a file is opened a new copy is stored
atilde CVS Subversion Git
acirc changes to a collection of files are trackedacirc simultaneous changes are merged
atilde Revision Tracking
acirc Revisions are stored within a file
atilde Authorship in shared writing Explicit responsibility for changes
Review and Conclusions 31
Wikipedia
atilde The core aim of the Wikimedia Foundation is to get a free encyclopedia to everysingle person on the planet (Jimmy Wales)
atilde Wikipedia makes it easy to share your knowledgepeople like to do this
atilde Most edits are done by insiders
atilde Most content is added by outsiders
atilde Content comparable to Britannica
Review and Conclusions 32
The five pillars of Wikipedia
1 Wikipedia is an online encyclopedia
2 Wikipedia has a neutral point of view
atilde Content policies NPOV Verifiability and No original research
3 Wikipedia is free content
4 Wikipedians should interact in a respectful and civil manner
5 Wikipedia does not have firm rules
WikipediaFivepillars 33
Licenses and Ownership
atilde Copyright
atilde Copyleft
atilde Creative Commons
Review and Conclusions 34
What is a good article
1 Well-written
2 Factually accurate and verifiable
3 Broad in its coverage
4 Neutral
5 Stable
6 Illustrated if possible by images
WikipediaGood_article_criteria 35
The World Wide Web and HTML
Review and Conclusions 36
The InterWeb
atilde The Internetmore than just the web (email VoIP FTP Streaming Messaging
atilde The structure of Markup Visual vs LogicalWSISWYG WYSIAYG WYSIWIM
atilde The structure of the Web mdash hypertextpages linked to pages
atilde The future of the Web
atilde Linguistic features of the webun-edited large volume editable multi-media
Review and Conclusions 37
Map of online communities
httpxkcdcom802 38
The Deep Web (the Invisible Web)
Dynamic content dynamic pages which are returned in response to a submitted queryor accessed only through a form
Unlinked content pages which are not linked to by other pages (but clicking linksthem)
Private Web sites that require registration and login (Edventure)
Contextual Web pages with content varying for different access contexts (egranges of client IP addresses or previous navigation sequence)
Limited access content sites that limit access to their pages in a technical way (egusing the Robots Exclusion Standard)
Scripted content pages that are only accessible through links produced by JavaScriptas well as content dynamically downloaded from Web servers via Flash or Ajaxsolutions
Review and Conclusions 39
Non-HTMLtext content textual content encoded in multimedia (image or video)files or specific file formats not handled by search engines
These pages all include data that search engines cannot find
Review and Conclusions 40
Visual Markup vs Logical Markup
atilde Visual Markup (Presentational)
acirc What you see is what you get (WYSIWYG)acirc Equivalent of printersrsquo markupacirc Shows what things look like
atilde Logical Markup (Structural)
acirc Show the structure and meaningacirc Can be mapped to visual markupacirc Less flexible than visual markupacirc More adaptable (and reusable)
Review and Conclusions 41
The Web as Corpus
Review and Conclusions 42
Two Approaches to using the Web as a Corpus
atilde Direct Query Search Engine as Query tool and WWW as corpus(Objection Results are not reliable)
acirc Population and exact hit counts are unknown rarr no statistics possibleacirc Indexing does not allow to draw conclusions on the dataotimes Google is missing functionalities that linguists lexicographers would like to have
atilde Web Sample Use search engine to download data from the net and build a corpusfrom it
acirc known size and exact hit counts rarr statistics possibleacirc people can draw conclusions over the included text typesacirc (limited) control over the contentotimes sparser data
Review and Conclusions 43
Direct Query
atilde Accessible through search engines (Google API Yahoo API Scripts)
atilde Document counts are shown to correlate directly with ldquorealrdquo frequencies (Keller2003) so search engines can help - but
acirc lots of repetitions of the same text (not representative)acirc very limited query precision (no upperlower case no punctuation)acirc only estimated counts often hard to reproduce exactlyacirc different queries give wildly different numbers
Review and Conclusions 44
Web Count Units
atilde ghit or ldquogoogle hitrdquo is the most common unit used to count web snippets (in theearly 2000s)
acirc it is document frequency not term frequency
atilde whit or ldquoweb hitrdquo is the more general term
atilde Normally you compare two phenomena to get a unitless ratio(eg different from vs different than)251000000 ghits vs 71500000 or 351 (accessed 2012-04-04)
atilde GPB for ldquoGhits per billion documentsrdquo is good if want a more stable number (sug-gested by Mark Liberman)but Google no longer releases the index size
atilde So always say when you counted and try to use ratios
httpitrecisupennedu myllanguagelogarchives000953html 45
Web Sample
atilde Extracting and filtering web documents to create linguistically annotated corpora(Kilgarriff 2006)
acirc gather documents for different topics (balance)acirc exclude documents which cannot be preprocessed with available tools (here tag-
gers and lemmatizers)acirc exclude documents which seem irrelevant for a corpus (too short or too long word
lists)acirc do this for several languages and make the corpora available
46
Building Internet Corpora Outline
1 Select Seed Words (500)
2 Combine to form multiple queries (6000)
3 Query a search engine and retrieve the URLs (50000)
4 Download the files from the URLS (100000000 words)
5 Postprocess the data (encoding cleanup tagging and parsing)
Sharoff S (2006) Creating general-purpose corpora using automated search enginequeries In M Baroni S Bernardini (eds) WaCky Working papers on the Web asCorpus Bologna 2006
httpcorpusleedsacukinternethtml 47
Internet Corpora Summary
atilde The web can be used as a corpus
acirc Direct accesslowast Fast and convenientlowast Huge amounts of dataotimes unreliable counts
acirc Web samplelowast Control over the samplelowast Some setup costs (semi-automated)otimes Less data
atilde Richer data than a compiled corpus
otimes Less balanced less markup
48
Text and Meta-textatilde Explicit Meta-data
acirc Keywords and Categoriesacirc Rankingsacirc Structural Markup
atilde Implicit Meta-data
acirc Links and Citationsacirc Tagsacirc Tablesacirc File Namesacirc Translations
and Zombies 49
Explicit Metadata
atilde You can get information from metadata within documents
acirc When they are accurate they are very goodacirc They are often deceitful
atilde HTML PDF Word hellipMetadata
atilde Keywords and Tags
atilde Rankings
atilde Links and Citations
atilde Structural Markup
50
Implicit Metadata
atilde You can get clues from metadata within documents
acirc as they are non-intended they tend to be noisyacirc but they are rarely deceitful
atilde HTML tags as constituent boundaries
atilde Tables as Semantic Relations
atilde File Names (content type and language)
atilde Translations mdash Bracketed Glosses Cross-lingual Disambiguation
atilde Query Data
atilde Wikipedia Redirections and Cross-wiki Links
Review and Conclusions 51
Language Identification andNormalization
Review and Conclusions 52
Language Identification
atilde Need to identify the encoding and language of a page in order to process it (meta-data may be unreliable)
acirc Linguistically-grounded methodslowast Diacriticslowast Character n-gramslowast Stop words
acirc Similarity-based categorisation and classificationlowast Character n-gram rankings
acirc Machine Learning based methodsacirc Context (under jp or ko)
atilde Hard to do for short test snippets similar languages mixed text
Review and Conclusions 53
Normalization
atilde Extracting text from various documents
atilde Segmenting continuous text
atilde Number Normalization $700K $700000 07 million dollars hellip
atilde Date Normalization 2000AD 1421AH Heisei 12 hellip
atilde Stripping stop words (the a of hellip)
atilde Lemmatization produces rarr produce
atilde Stemming producer rarr produc produces rarr produc
atilde Decompounding zonnecel rarr zon cel
Review and Conclusions 54
The Semantic Web
Review and Conclusions 55
Goals of the Semantic Web
atilde Web of data
acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions
atilde Increase the utility of information by connecting it to definitions and context
atilde More efficient information access and analysis
EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt
Review and Conclusions 56
Semantic Web Architecture
Review and Conclusions 57
XML eXtensible Markup Language
atilde XML is a set of rules for encoding documents electronically
acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation
atilde Can be validated for syntactic well-formedness
Review and Conclusions 58
Semantic Web Architecture (details)
atilde Identify things with Uniform Resource Identifiers
acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond
atilde Identify relations with Resource Description Framework
acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything
atilde You can build relations in ontologies (OWL)
acirc Then reason over them search them hellip
Review and Conclusions 59
Criticism of the Semantic Web
Doctorowrsquos seven insurmountable obstacles to reliable metadata are
1 People lie
2 People are lazy
3 People are stupid
4 Mission Impossible know thyself
5 Schemas arenrsquot neutral
6 Metrics influence results
7 Therersquos more than one way to describe something
httpwwwwellcom~doctorowmetacraphtm 60
Semantic Web and NLP
atilde The Semantic Web is about structuring data
atilde Text Mining is about unstructured data
atilde There is much more unstructured than structured data
acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP
61
Citation Reputation andPageRank
Review and Conclusions 62
Citation Networks
atilde How can we tell what is a good scientific paper
acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked
acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite
Review and Conclusions 63
Reputation and Citation Analysis
atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group
atilde Some scores are
acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)
atilde Problems
acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones
atilde Weight Citations by Quality of the paper
Review and Conclusions 64
Gaming Citations
atilde LeastMinimum Publishable Unit
acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information
atilde Self citation in-group citation
atilde Write only proceedings (some journals are not often read)
atilde Submitting only to High Impact factor journals
You improve what gets measurednot necessarily what you want to improve
Review and Conclusions 65
Characteristics of the Web Bow Tie
SCC Strongly Connected Core mdash can travel from any page to any page
Review and Conclusions 66
Anchor Text
atilde Recall how hyperlinks are written
lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt
For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt
atilde Link analysis builds on two intuitions
1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A
2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice
Review and Conclusions 67
PageRank as Citation analysis
atilde Citation frequency can be used to measure the impact of an article
acirc Simplest measure Each article gets one vote ndash not very accurate
atilde On the web citation frequency = inlink count
acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam
atilde Better measure weighted citation frequency or citation rank
acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated
Review and Conclusions 68
PageRank as Random walk
atilde Imagine a web surfer doing a random walk on the web
acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page
equiprobably
atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there
atilde This long-term visit rate is the pagersquos PageRank
atilde PageRank = long-term visit rate = steady state robability
Review and Conclusions 69
Teleporting ndash to get us out of dead ends
atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)
atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)
atilde With remaining probability (90) go out on a random hyperlink
acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225
atilde 10 is a parameter the teleportation rate
atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate
Review and Conclusions 70
Example Graph
Each inbound link is a positive vote
httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71
Example Graph Weighted
Pages with higher PageRanks are lighter
Review and Conclusions 72
Gaming PageRank
atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs
atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies
atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission
73
atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks
The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index
acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip
74
Current Status
atilde There is a continuous battle between
acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read
atilde All metrics get gamed
Review and Conclusions 75
Digital object identifier
atilde DOI a string used to uniquely identify an electronic document or object
acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)
atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation
atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations
acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u
Review and Conclusions 76
Conclusions
Review and Conclusions 77
Revolutions in Language Technology
atilde Speech (Language itself)
atilde Writing (invented 3-5 times)
atilde Printing (made writing common)
atilde Digital Text (made writing transferable)
atilde Hyperlinking (taking writing beyond language)
Review and Conclusions 78
The Internet
atilde The internet is useful as a tool
acirc Passively for information accessacirc Actively for collaboration
atilde The internet is interesting in and of itself
acirc As a source of data about existing languageacirc As a source of innovation in language
Review and Conclusions 79
Recommended Texts
atilde Wikipedia
atilde Crystal D (2001) Language and the Internet Cambridge University Press
atilde Sproat R (2010) Language Technology and Society Oxford University Press
atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition
atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press
atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press
Review and Conclusions 80
Complementary Courses
atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics
atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics
Review and Conclusions 81
Blogs
Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactivecomments factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich
Review and Conclusions 27
Netspeak
atilde Inspired by shared background
acirc ltrantgtI canrsquot stand thisltrantgtacirc I hate^H^H^H^Hlove that idea (^H is backspace on a vt100)acirc lusers (users as seen by Systems Administrators)acirc suits
atilde Need to be in-group
cow orker Coworkerclue ldquoYou couldnrsquot get a clue during the clue mating season in a field full of horny
clues if you smeared your body with clue musk and did the clue mating dancerdquoEdward Flaherty (talkbizaare)
Review and Conclusions 28
atilde Gricean Principals
acirc Posting (topmiddlebottom quoting and trimming)
atilde Inspired by medium limitations
acirc GREATgreatgreatacirc -) (^_^) (o) acirc brb RTFM IMHOacirc lower caseacirc lack of punctuation (hard on phone keyboards)
Review and Conclusions 29
Collaboration and Wikisatilde Version Control Systems
atilde Wikipedia
atilde Licensing and Ownership
30
Version Control Systems
atilde Versioning file systems
acirc every time a file is opened a new copy is stored
atilde CVS Subversion Git
acirc changes to a collection of files are trackedacirc simultaneous changes are merged
atilde Revision Tracking
acirc Revisions are stored within a file
atilde Authorship in shared writing Explicit responsibility for changes
Review and Conclusions 31
Wikipedia
atilde The core aim of the Wikimedia Foundation is to get a free encyclopedia to everysingle person on the planet (Jimmy Wales)
atilde Wikipedia makes it easy to share your knowledgepeople like to do this
atilde Most edits are done by insiders
atilde Most content is added by outsiders
atilde Content comparable to Britannica
Review and Conclusions 32
The five pillars of Wikipedia
1 Wikipedia is an online encyclopedia
2 Wikipedia has a neutral point of view
atilde Content policies NPOV Verifiability and No original research
3 Wikipedia is free content
4 Wikipedians should interact in a respectful and civil manner
5 Wikipedia does not have firm rules
WikipediaFivepillars 33
Licenses and Ownership
atilde Copyright
atilde Copyleft
atilde Creative Commons
Review and Conclusions 34
What is a good article
1 Well-written
2 Factually accurate and verifiable
3 Broad in its coverage
4 Neutral
5 Stable
6 Illustrated if possible by images
WikipediaGood_article_criteria 35
The World Wide Web and HTML
Review and Conclusions 36
The InterWeb
atilde The Internetmore than just the web (email VoIP FTP Streaming Messaging
atilde The structure of Markup Visual vs LogicalWSISWYG WYSIAYG WYSIWIM
atilde The structure of the Web mdash hypertextpages linked to pages
atilde The future of the Web
atilde Linguistic features of the webun-edited large volume editable multi-media
Review and Conclusions 37
Map of online communities
httpxkcdcom802 38
The Deep Web (the Invisible Web)
Dynamic content dynamic pages which are returned in response to a submitted queryor accessed only through a form
Unlinked content pages which are not linked to by other pages (but clicking linksthem)
Private Web sites that require registration and login (Edventure)
Contextual Web pages with content varying for different access contexts (egranges of client IP addresses or previous navigation sequence)
Limited access content sites that limit access to their pages in a technical way (egusing the Robots Exclusion Standard)
Scripted content pages that are only accessible through links produced by JavaScriptas well as content dynamically downloaded from Web servers via Flash or Ajaxsolutions
Review and Conclusions 39
Non-HTMLtext content textual content encoded in multimedia (image or video)files or specific file formats not handled by search engines
These pages all include data that search engines cannot find
Review and Conclusions 40
Visual Markup vs Logical Markup
atilde Visual Markup (Presentational)
acirc What you see is what you get (WYSIWYG)acirc Equivalent of printersrsquo markupacirc Shows what things look like
atilde Logical Markup (Structural)
acirc Show the structure and meaningacirc Can be mapped to visual markupacirc Less flexible than visual markupacirc More adaptable (and reusable)
Review and Conclusions 41
The Web as Corpus
Review and Conclusions 42
Two Approaches to using the Web as a Corpus
atilde Direct Query Search Engine as Query tool and WWW as corpus(Objection Results are not reliable)
acirc Population and exact hit counts are unknown rarr no statistics possibleacirc Indexing does not allow to draw conclusions on the dataotimes Google is missing functionalities that linguists lexicographers would like to have
atilde Web Sample Use search engine to download data from the net and build a corpusfrom it
acirc known size and exact hit counts rarr statistics possibleacirc people can draw conclusions over the included text typesacirc (limited) control over the contentotimes sparser data
Review and Conclusions 43
Direct Query
atilde Accessible through search engines (Google API Yahoo API Scripts)
atilde Document counts are shown to correlate directly with ldquorealrdquo frequencies (Keller2003) so search engines can help - but
acirc lots of repetitions of the same text (not representative)acirc very limited query precision (no upperlower case no punctuation)acirc only estimated counts often hard to reproduce exactlyacirc different queries give wildly different numbers
Review and Conclusions 44
Web Count Units
atilde ghit or ldquogoogle hitrdquo is the most common unit used to count web snippets (in theearly 2000s)
acirc it is document frequency not term frequency
atilde whit or ldquoweb hitrdquo is the more general term
atilde Normally you compare two phenomena to get a unitless ratio(eg different from vs different than)251000000 ghits vs 71500000 or 351 (accessed 2012-04-04)
atilde GPB for ldquoGhits per billion documentsrdquo is good if want a more stable number (sug-gested by Mark Liberman)but Google no longer releases the index size
atilde So always say when you counted and try to use ratios
httpitrecisupennedu myllanguagelogarchives000953html 45
Web Sample
atilde Extracting and filtering web documents to create linguistically annotated corpora(Kilgarriff 2006)
acirc gather documents for different topics (balance)acirc exclude documents which cannot be preprocessed with available tools (here tag-
gers and lemmatizers)acirc exclude documents which seem irrelevant for a corpus (too short or too long word
lists)acirc do this for several languages and make the corpora available
46
Building Internet Corpora Outline
1 Select Seed Words (500)
2 Combine to form multiple queries (6000)
3 Query a search engine and retrieve the URLs (50000)
4 Download the files from the URLS (100000000 words)
5 Postprocess the data (encoding cleanup tagging and parsing)
Sharoff S (2006) Creating general-purpose corpora using automated search enginequeries In M Baroni S Bernardini (eds) WaCky Working papers on the Web asCorpus Bologna 2006
httpcorpusleedsacukinternethtml 47
Internet Corpora Summary
atilde The web can be used as a corpus
acirc Direct accesslowast Fast and convenientlowast Huge amounts of dataotimes unreliable counts
acirc Web samplelowast Control over the samplelowast Some setup costs (semi-automated)otimes Less data
atilde Richer data than a compiled corpus
otimes Less balanced less markup
48
Text and Meta-textatilde Explicit Meta-data
acirc Keywords and Categoriesacirc Rankingsacirc Structural Markup
atilde Implicit Meta-data
acirc Links and Citationsacirc Tagsacirc Tablesacirc File Namesacirc Translations
and Zombies 49
Explicit Metadata
atilde You can get information from metadata within documents
acirc When they are accurate they are very goodacirc They are often deceitful
atilde HTML PDF Word hellipMetadata
atilde Keywords and Tags
atilde Rankings
atilde Links and Citations
atilde Structural Markup
50
Implicit Metadata
atilde You can get clues from metadata within documents
acirc as they are non-intended they tend to be noisyacirc but they are rarely deceitful
atilde HTML tags as constituent boundaries
atilde Tables as Semantic Relations
atilde File Names (content type and language)
atilde Translations mdash Bracketed Glosses Cross-lingual Disambiguation
atilde Query Data
atilde Wikipedia Redirections and Cross-wiki Links
Review and Conclusions 51
Language Identification andNormalization
Review and Conclusions 52
Language Identification
atilde Need to identify the encoding and language of a page in order to process it (meta-data may be unreliable)
acirc Linguistically-grounded methodslowast Diacriticslowast Character n-gramslowast Stop words
acirc Similarity-based categorisation and classificationlowast Character n-gram rankings
acirc Machine Learning based methodsacirc Context (under jp or ko)
atilde Hard to do for short test snippets similar languages mixed text
Review and Conclusions 53
Normalization
atilde Extracting text from various documents
atilde Segmenting continuous text
atilde Number Normalization $700K $700000 07 million dollars hellip
atilde Date Normalization 2000AD 1421AH Heisei 12 hellip
atilde Stripping stop words (the a of hellip)
atilde Lemmatization produces rarr produce
atilde Stemming producer rarr produc produces rarr produc
atilde Decompounding zonnecel rarr zon cel
Review and Conclusions 54
The Semantic Web
Review and Conclusions 55
Goals of the Semantic Web
atilde Web of data
acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions
atilde Increase the utility of information by connecting it to definitions and context
atilde More efficient information access and analysis
EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt
Review and Conclusions 56
Semantic Web Architecture
Review and Conclusions 57
XML eXtensible Markup Language
atilde XML is a set of rules for encoding documents electronically
acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation
atilde Can be validated for syntactic well-formedness
Review and Conclusions 58
Semantic Web Architecture (details)
atilde Identify things with Uniform Resource Identifiers
acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond
atilde Identify relations with Resource Description Framework
acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything
atilde You can build relations in ontologies (OWL)
acirc Then reason over them search them hellip
Review and Conclusions 59
Criticism of the Semantic Web
Doctorowrsquos seven insurmountable obstacles to reliable metadata are
1 People lie
2 People are lazy
3 People are stupid
4 Mission Impossible know thyself
5 Schemas arenrsquot neutral
6 Metrics influence results
7 Therersquos more than one way to describe something
httpwwwwellcom~doctorowmetacraphtm 60
Semantic Web and NLP
atilde The Semantic Web is about structuring data
atilde Text Mining is about unstructured data
atilde There is much more unstructured than structured data
acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP
61
Citation Reputation andPageRank
Review and Conclusions 62
Citation Networks
atilde How can we tell what is a good scientific paper
acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked
acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite
Review and Conclusions 63
Reputation and Citation Analysis
atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group
atilde Some scores are
acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)
atilde Problems
acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones
atilde Weight Citations by Quality of the paper
Review and Conclusions 64
Gaming Citations
atilde LeastMinimum Publishable Unit
acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information
atilde Self citation in-group citation
atilde Write only proceedings (some journals are not often read)
atilde Submitting only to High Impact factor journals
You improve what gets measurednot necessarily what you want to improve
Review and Conclusions 65
Characteristics of the Web Bow Tie
SCC Strongly Connected Core mdash can travel from any page to any page
Review and Conclusions 66
Anchor Text
atilde Recall how hyperlinks are written
lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt
For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt
atilde Link analysis builds on two intuitions
1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A
2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice
Review and Conclusions 67
PageRank as Citation analysis
atilde Citation frequency can be used to measure the impact of an article
acirc Simplest measure Each article gets one vote ndash not very accurate
atilde On the web citation frequency = inlink count
acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam
atilde Better measure weighted citation frequency or citation rank
acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated
Review and Conclusions 68
PageRank as Random walk
atilde Imagine a web surfer doing a random walk on the web
acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page
equiprobably
atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there
atilde This long-term visit rate is the pagersquos PageRank
atilde PageRank = long-term visit rate = steady state robability
Review and Conclusions 69
Teleporting ndash to get us out of dead ends
atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)
atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)
atilde With remaining probability (90) go out on a random hyperlink
acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225
atilde 10 is a parameter the teleportation rate
atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate
Review and Conclusions 70
Example Graph
Each inbound link is a positive vote
httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71
Example Graph Weighted
Pages with higher PageRanks are lighter
Review and Conclusions 72
Gaming PageRank
atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs
atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies
atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission
73
atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks
The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index
acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip
74
Current Status
atilde There is a continuous battle between
acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read
atilde All metrics get gamed
Review and Conclusions 75
Digital object identifier
atilde DOI a string used to uniquely identify an electronic document or object
acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)
atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation
atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations
acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u
Review and Conclusions 76
Conclusions
Review and Conclusions 77
Revolutions in Language Technology
atilde Speech (Language itself)
atilde Writing (invented 3-5 times)
atilde Printing (made writing common)
atilde Digital Text (made writing transferable)
atilde Hyperlinking (taking writing beyond language)
Review and Conclusions 78
The Internet
atilde The internet is useful as a tool
acirc Passively for information accessacirc Actively for collaboration
atilde The internet is interesting in and of itself
acirc As a source of data about existing languageacirc As a source of innovation in language
Review and Conclusions 79
Recommended Texts
atilde Wikipedia
atilde Crystal D (2001) Language and the Internet Cambridge University Press
atilde Sproat R (2010) Language Technology and Society Oxford University Press
atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition
atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press
atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press
Review and Conclusions 80
Complementary Courses
atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics
atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics
Review and Conclusions 81
Netspeak
atilde Inspired by shared background
acirc ltrantgtI canrsquot stand thisltrantgtacirc I hate^H^H^H^Hlove that idea (^H is backspace on a vt100)acirc lusers (users as seen by Systems Administrators)acirc suits
atilde Need to be in-group
cow orker Coworkerclue ldquoYou couldnrsquot get a clue during the clue mating season in a field full of horny
clues if you smeared your body with clue musk and did the clue mating dancerdquoEdward Flaherty (talkbizaare)
Review and Conclusions 28
atilde Gricean Principals
acirc Posting (topmiddlebottom quoting and trimming)
atilde Inspired by medium limitations
acirc GREATgreatgreatacirc -) (^_^) (o) acirc brb RTFM IMHOacirc lower caseacirc lack of punctuation (hard on phone keyboards)
Review and Conclusions 29
Collaboration and Wikisatilde Version Control Systems
atilde Wikipedia
atilde Licensing and Ownership
30
Version Control Systems
atilde Versioning file systems
acirc every time a file is opened a new copy is stored
atilde CVS Subversion Git
acirc changes to a collection of files are trackedacirc simultaneous changes are merged
atilde Revision Tracking
acirc Revisions are stored within a file
atilde Authorship in shared writing Explicit responsibility for changes
Review and Conclusions 31
Wikipedia
atilde The core aim of the Wikimedia Foundation is to get a free encyclopedia to everysingle person on the planet (Jimmy Wales)
atilde Wikipedia makes it easy to share your knowledgepeople like to do this
atilde Most edits are done by insiders
atilde Most content is added by outsiders
atilde Content comparable to Britannica
Review and Conclusions 32
The five pillars of Wikipedia
1 Wikipedia is an online encyclopedia
2 Wikipedia has a neutral point of view
atilde Content policies NPOV Verifiability and No original research
3 Wikipedia is free content
4 Wikipedians should interact in a respectful and civil manner
5 Wikipedia does not have firm rules
WikipediaFivepillars 33
Licenses and Ownership
atilde Copyright
atilde Copyleft
atilde Creative Commons
Review and Conclusions 34
What is a good article
1 Well-written
2 Factually accurate and verifiable
3 Broad in its coverage
4 Neutral
5 Stable
6 Illustrated if possible by images
WikipediaGood_article_criteria 35
The World Wide Web and HTML
Review and Conclusions 36
The InterWeb
atilde The Internetmore than just the web (email VoIP FTP Streaming Messaging
atilde The structure of Markup Visual vs LogicalWSISWYG WYSIAYG WYSIWIM
atilde The structure of the Web mdash hypertextpages linked to pages
atilde The future of the Web
atilde Linguistic features of the webun-edited large volume editable multi-media
Review and Conclusions 37
Map of online communities
httpxkcdcom802 38
The Deep Web (the Invisible Web)
Dynamic content dynamic pages which are returned in response to a submitted queryor accessed only through a form
Unlinked content pages which are not linked to by other pages (but clicking linksthem)
Private Web sites that require registration and login (Edventure)
Contextual Web pages with content varying for different access contexts (egranges of client IP addresses or previous navigation sequence)
Limited access content sites that limit access to their pages in a technical way (egusing the Robots Exclusion Standard)
Scripted content pages that are only accessible through links produced by JavaScriptas well as content dynamically downloaded from Web servers via Flash or Ajaxsolutions
Review and Conclusions 39
Non-HTMLtext content textual content encoded in multimedia (image or video)files or specific file formats not handled by search engines
These pages all include data that search engines cannot find
Review and Conclusions 40
Visual Markup vs Logical Markup
atilde Visual Markup (Presentational)
acirc What you see is what you get (WYSIWYG)acirc Equivalent of printersrsquo markupacirc Shows what things look like
atilde Logical Markup (Structural)
acirc Show the structure and meaningacirc Can be mapped to visual markupacirc Less flexible than visual markupacirc More adaptable (and reusable)
Review and Conclusions 41
The Web as Corpus
Review and Conclusions 42
Two Approaches to using the Web as a Corpus
atilde Direct Query Search Engine as Query tool and WWW as corpus(Objection Results are not reliable)
acirc Population and exact hit counts are unknown rarr no statistics possibleacirc Indexing does not allow to draw conclusions on the dataotimes Google is missing functionalities that linguists lexicographers would like to have
atilde Web Sample Use search engine to download data from the net and build a corpusfrom it
acirc known size and exact hit counts rarr statistics possibleacirc people can draw conclusions over the included text typesacirc (limited) control over the contentotimes sparser data
Review and Conclusions 43
Direct Query
atilde Accessible through search engines (Google API Yahoo API Scripts)
atilde Document counts are shown to correlate directly with ldquorealrdquo frequencies (Keller2003) so search engines can help - but
acirc lots of repetitions of the same text (not representative)acirc very limited query precision (no upperlower case no punctuation)acirc only estimated counts often hard to reproduce exactlyacirc different queries give wildly different numbers
Review and Conclusions 44
Web Count Units
atilde ghit or ldquogoogle hitrdquo is the most common unit used to count web snippets (in theearly 2000s)
acirc it is document frequency not term frequency
atilde whit or ldquoweb hitrdquo is the more general term
atilde Normally you compare two phenomena to get a unitless ratio(eg different from vs different than)251000000 ghits vs 71500000 or 351 (accessed 2012-04-04)
atilde GPB for ldquoGhits per billion documentsrdquo is good if want a more stable number (sug-gested by Mark Liberman)but Google no longer releases the index size
atilde So always say when you counted and try to use ratios
httpitrecisupennedu myllanguagelogarchives000953html 45
Web Sample
atilde Extracting and filtering web documents to create linguistically annotated corpora(Kilgarriff 2006)
acirc gather documents for different topics (balance)acirc exclude documents which cannot be preprocessed with available tools (here tag-
gers and lemmatizers)acirc exclude documents which seem irrelevant for a corpus (too short or too long word
lists)acirc do this for several languages and make the corpora available
46
Building Internet Corpora Outline
1 Select Seed Words (500)
2 Combine to form multiple queries (6000)
3 Query a search engine and retrieve the URLs (50000)
4 Download the files from the URLS (100000000 words)
5 Postprocess the data (encoding cleanup tagging and parsing)
Sharoff S (2006) Creating general-purpose corpora using automated search enginequeries In M Baroni S Bernardini (eds) WaCky Working papers on the Web asCorpus Bologna 2006
httpcorpusleedsacukinternethtml 47
Internet Corpora Summary
atilde The web can be used as a corpus
acirc Direct accesslowast Fast and convenientlowast Huge amounts of dataotimes unreliable counts
acirc Web samplelowast Control over the samplelowast Some setup costs (semi-automated)otimes Less data
atilde Richer data than a compiled corpus
otimes Less balanced less markup
48
Text and Meta-textatilde Explicit Meta-data
acirc Keywords and Categoriesacirc Rankingsacirc Structural Markup
atilde Implicit Meta-data
acirc Links and Citationsacirc Tagsacirc Tablesacirc File Namesacirc Translations
and Zombies 49
Explicit Metadata
atilde You can get information from metadata within documents
acirc When they are accurate they are very goodacirc They are often deceitful
atilde HTML PDF Word hellipMetadata
atilde Keywords and Tags
atilde Rankings
atilde Links and Citations
atilde Structural Markup
50
Implicit Metadata
atilde You can get clues from metadata within documents
acirc as they are non-intended they tend to be noisyacirc but they are rarely deceitful
atilde HTML tags as constituent boundaries
atilde Tables as Semantic Relations
atilde File Names (content type and language)
atilde Translations mdash Bracketed Glosses Cross-lingual Disambiguation
atilde Query Data
atilde Wikipedia Redirections and Cross-wiki Links
Review and Conclusions 51
Language Identification andNormalization
Review and Conclusions 52
Language Identification
atilde Need to identify the encoding and language of a page in order to process it (meta-data may be unreliable)
acirc Linguistically-grounded methodslowast Diacriticslowast Character n-gramslowast Stop words
acirc Similarity-based categorisation and classificationlowast Character n-gram rankings
acirc Machine Learning based methodsacirc Context (under jp or ko)
atilde Hard to do for short test snippets similar languages mixed text
Review and Conclusions 53
Normalization
atilde Extracting text from various documents
atilde Segmenting continuous text
atilde Number Normalization $700K $700000 07 million dollars hellip
atilde Date Normalization 2000AD 1421AH Heisei 12 hellip
atilde Stripping stop words (the a of hellip)
atilde Lemmatization produces rarr produce
atilde Stemming producer rarr produc produces rarr produc
atilde Decompounding zonnecel rarr zon cel
Review and Conclusions 54
The Semantic Web
Review and Conclusions 55
Goals of the Semantic Web
atilde Web of data
acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions
atilde Increase the utility of information by connecting it to definitions and context
atilde More efficient information access and analysis
EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt
Review and Conclusions 56
Semantic Web Architecture
Review and Conclusions 57
XML eXtensible Markup Language
atilde XML is a set of rules for encoding documents electronically
acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation
atilde Can be validated for syntactic well-formedness
Review and Conclusions 58
Semantic Web Architecture (details)
atilde Identify things with Uniform Resource Identifiers
acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond
atilde Identify relations with Resource Description Framework
acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything
atilde You can build relations in ontologies (OWL)
acirc Then reason over them search them hellip
Review and Conclusions 59
Criticism of the Semantic Web
Doctorowrsquos seven insurmountable obstacles to reliable metadata are
1 People lie
2 People are lazy
3 People are stupid
4 Mission Impossible know thyself
5 Schemas arenrsquot neutral
6 Metrics influence results
7 Therersquos more than one way to describe something
httpwwwwellcom~doctorowmetacraphtm 60
Semantic Web and NLP
atilde The Semantic Web is about structuring data
atilde Text Mining is about unstructured data
atilde There is much more unstructured than structured data
acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP
61
Citation Reputation andPageRank
Review and Conclusions 62
Citation Networks
atilde How can we tell what is a good scientific paper
acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked
acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite
Review and Conclusions 63
Reputation and Citation Analysis
atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group
atilde Some scores are
acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)
atilde Problems
acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones
atilde Weight Citations by Quality of the paper
Review and Conclusions 64
Gaming Citations
atilde LeastMinimum Publishable Unit
acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information
atilde Self citation in-group citation
atilde Write only proceedings (some journals are not often read)
atilde Submitting only to High Impact factor journals
You improve what gets measurednot necessarily what you want to improve
Review and Conclusions 65
Characteristics of the Web Bow Tie
SCC Strongly Connected Core mdash can travel from any page to any page
Review and Conclusions 66
Anchor Text
atilde Recall how hyperlinks are written
lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt
For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt
atilde Link analysis builds on two intuitions
1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A
2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice
Review and Conclusions 67
PageRank as Citation analysis
atilde Citation frequency can be used to measure the impact of an article
acirc Simplest measure Each article gets one vote ndash not very accurate
atilde On the web citation frequency = inlink count
acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam
atilde Better measure weighted citation frequency or citation rank
acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated
Review and Conclusions 68
PageRank as Random walk
atilde Imagine a web surfer doing a random walk on the web
acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page
equiprobably
atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there
atilde This long-term visit rate is the pagersquos PageRank
atilde PageRank = long-term visit rate = steady state robability
Review and Conclusions 69
Teleporting ndash to get us out of dead ends
atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)
atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)
atilde With remaining probability (90) go out on a random hyperlink
acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225
atilde 10 is a parameter the teleportation rate
atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate
Review and Conclusions 70
Example Graph
Each inbound link is a positive vote
httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71
Example Graph Weighted
Pages with higher PageRanks are lighter
Review and Conclusions 72
Gaming PageRank
atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs
atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies
atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission
73
atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks
The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index
acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip
74
Current Status
atilde There is a continuous battle between
acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read
atilde All metrics get gamed
Review and Conclusions 75
Digital object identifier
atilde DOI a string used to uniquely identify an electronic document or object
acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)
atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation
atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations
acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u
Review and Conclusions 76
Conclusions
Review and Conclusions 77
Revolutions in Language Technology
atilde Speech (Language itself)
atilde Writing (invented 3-5 times)
atilde Printing (made writing common)
atilde Digital Text (made writing transferable)
atilde Hyperlinking (taking writing beyond language)
Review and Conclusions 78
The Internet
atilde The internet is useful as a tool
acirc Passively for information accessacirc Actively for collaboration
atilde The internet is interesting in and of itself
acirc As a source of data about existing languageacirc As a source of innovation in language
Review and Conclusions 79
Recommended Texts
atilde Wikipedia
atilde Crystal D (2001) Language and the Internet Cambridge University Press
atilde Sproat R (2010) Language Technology and Society Oxford University Press
atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition
atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press
atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press
Review and Conclusions 80
Complementary Courses
atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics
atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics
Review and Conclusions 81
atilde Gricean Principals
acirc Posting (topmiddlebottom quoting and trimming)
atilde Inspired by medium limitations
acirc GREATgreatgreatacirc -) (^_^) (o) acirc brb RTFM IMHOacirc lower caseacirc lack of punctuation (hard on phone keyboards)
Review and Conclusions 29
Collaboration and Wikisatilde Version Control Systems
atilde Wikipedia
atilde Licensing and Ownership
30
Version Control Systems
atilde Versioning file systems
acirc every time a file is opened a new copy is stored
atilde CVS Subversion Git
acirc changes to a collection of files are trackedacirc simultaneous changes are merged
atilde Revision Tracking
acirc Revisions are stored within a file
atilde Authorship in shared writing Explicit responsibility for changes
Review and Conclusions 31
Wikipedia
atilde The core aim of the Wikimedia Foundation is to get a free encyclopedia to everysingle person on the planet (Jimmy Wales)
atilde Wikipedia makes it easy to share your knowledgepeople like to do this
atilde Most edits are done by insiders
atilde Most content is added by outsiders
atilde Content comparable to Britannica
Review and Conclusions 32
The five pillars of Wikipedia
1 Wikipedia is an online encyclopedia
2 Wikipedia has a neutral point of view
atilde Content policies NPOV Verifiability and No original research
3 Wikipedia is free content
4 Wikipedians should interact in a respectful and civil manner
5 Wikipedia does not have firm rules
WikipediaFivepillars 33
Licenses and Ownership
atilde Copyright
atilde Copyleft
atilde Creative Commons
Review and Conclusions 34
What is a good article
1 Well-written
2 Factually accurate and verifiable
3 Broad in its coverage
4 Neutral
5 Stable
6 Illustrated if possible by images
WikipediaGood_article_criteria 35
The World Wide Web and HTML
Review and Conclusions 36
The InterWeb
atilde The Internetmore than just the web (email VoIP FTP Streaming Messaging
atilde The structure of Markup Visual vs LogicalWSISWYG WYSIAYG WYSIWIM
atilde The structure of the Web mdash hypertextpages linked to pages
atilde The future of the Web
atilde Linguistic features of the webun-edited large volume editable multi-media
Review and Conclusions 37
Map of online communities
httpxkcdcom802 38
The Deep Web (the Invisible Web)
Dynamic content dynamic pages which are returned in response to a submitted queryor accessed only through a form
Unlinked content pages which are not linked to by other pages (but clicking linksthem)
Private Web sites that require registration and login (Edventure)
Contextual Web pages with content varying for different access contexts (egranges of client IP addresses or previous navigation sequence)
Limited access content sites that limit access to their pages in a technical way (egusing the Robots Exclusion Standard)
Scripted content pages that are only accessible through links produced by JavaScriptas well as content dynamically downloaded from Web servers via Flash or Ajaxsolutions
Review and Conclusions 39
Non-HTMLtext content textual content encoded in multimedia (image or video)files or specific file formats not handled by search engines
These pages all include data that search engines cannot find
Review and Conclusions 40
Visual Markup vs Logical Markup
atilde Visual Markup (Presentational)
acirc What you see is what you get (WYSIWYG)acirc Equivalent of printersrsquo markupacirc Shows what things look like
atilde Logical Markup (Structural)
acirc Show the structure and meaningacirc Can be mapped to visual markupacirc Less flexible than visual markupacirc More adaptable (and reusable)
Review and Conclusions 41
The Web as Corpus
Review and Conclusions 42
Two Approaches to using the Web as a Corpus
atilde Direct Query Search Engine as Query tool and WWW as corpus(Objection Results are not reliable)
acirc Population and exact hit counts are unknown rarr no statistics possibleacirc Indexing does not allow to draw conclusions on the dataotimes Google is missing functionalities that linguists lexicographers would like to have
atilde Web Sample Use search engine to download data from the net and build a corpusfrom it
acirc known size and exact hit counts rarr statistics possibleacirc people can draw conclusions over the included text typesacirc (limited) control over the contentotimes sparser data
Review and Conclusions 43
Direct Query
atilde Accessible through search engines (Google API Yahoo API Scripts)
atilde Document counts are shown to correlate directly with ldquorealrdquo frequencies (Keller2003) so search engines can help - but
acirc lots of repetitions of the same text (not representative)acirc very limited query precision (no upperlower case no punctuation)acirc only estimated counts often hard to reproduce exactlyacirc different queries give wildly different numbers
Review and Conclusions 44
Web Count Units
atilde ghit or ldquogoogle hitrdquo is the most common unit used to count web snippets (in theearly 2000s)
acirc it is document frequency not term frequency
atilde whit or ldquoweb hitrdquo is the more general term
atilde Normally you compare two phenomena to get a unitless ratio(eg different from vs different than)251000000 ghits vs 71500000 or 351 (accessed 2012-04-04)
atilde GPB for ldquoGhits per billion documentsrdquo is good if want a more stable number (sug-gested by Mark Liberman)but Google no longer releases the index size
atilde So always say when you counted and try to use ratios
httpitrecisupennedu myllanguagelogarchives000953html 45
Web Sample
atilde Extracting and filtering web documents to create linguistically annotated corpora(Kilgarriff 2006)
acirc gather documents for different topics (balance)acirc exclude documents which cannot be preprocessed with available tools (here tag-
gers and lemmatizers)acirc exclude documents which seem irrelevant for a corpus (too short or too long word
lists)acirc do this for several languages and make the corpora available
46
Building Internet Corpora Outline
1 Select Seed Words (500)
2 Combine to form multiple queries (6000)
3 Query a search engine and retrieve the URLs (50000)
4 Download the files from the URLS (100000000 words)
5 Postprocess the data (encoding cleanup tagging and parsing)
Sharoff S (2006) Creating general-purpose corpora using automated search enginequeries In M Baroni S Bernardini (eds) WaCky Working papers on the Web asCorpus Bologna 2006
httpcorpusleedsacukinternethtml 47
Internet Corpora Summary
atilde The web can be used as a corpus
acirc Direct accesslowast Fast and convenientlowast Huge amounts of dataotimes unreliable counts
acirc Web samplelowast Control over the samplelowast Some setup costs (semi-automated)otimes Less data
atilde Richer data than a compiled corpus
otimes Less balanced less markup
48
Text and Meta-textatilde Explicit Meta-data
acirc Keywords and Categoriesacirc Rankingsacirc Structural Markup
atilde Implicit Meta-data
acirc Links and Citationsacirc Tagsacirc Tablesacirc File Namesacirc Translations
and Zombies 49
Explicit Metadata
atilde You can get information from metadata within documents
acirc When they are accurate they are very goodacirc They are often deceitful
atilde HTML PDF Word hellipMetadata
atilde Keywords and Tags
atilde Rankings
atilde Links and Citations
atilde Structural Markup
50
Implicit Metadata
atilde You can get clues from metadata within documents
acirc as they are non-intended they tend to be noisyacirc but they are rarely deceitful
atilde HTML tags as constituent boundaries
atilde Tables as Semantic Relations
atilde File Names (content type and language)
atilde Translations mdash Bracketed Glosses Cross-lingual Disambiguation
atilde Query Data
atilde Wikipedia Redirections and Cross-wiki Links
Review and Conclusions 51
Language Identification andNormalization
Review and Conclusions 52
Language Identification
atilde Need to identify the encoding and language of a page in order to process it (meta-data may be unreliable)
acirc Linguistically-grounded methodslowast Diacriticslowast Character n-gramslowast Stop words
acirc Similarity-based categorisation and classificationlowast Character n-gram rankings
acirc Machine Learning based methodsacirc Context (under jp or ko)
atilde Hard to do for short test snippets similar languages mixed text
Review and Conclusions 53
Normalization
atilde Extracting text from various documents
atilde Segmenting continuous text
atilde Number Normalization $700K $700000 07 million dollars hellip
atilde Date Normalization 2000AD 1421AH Heisei 12 hellip
atilde Stripping stop words (the a of hellip)
atilde Lemmatization produces rarr produce
atilde Stemming producer rarr produc produces rarr produc
atilde Decompounding zonnecel rarr zon cel
Review and Conclusions 54
The Semantic Web
Review and Conclusions 55
Goals of the Semantic Web
atilde Web of data
acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions
atilde Increase the utility of information by connecting it to definitions and context
atilde More efficient information access and analysis
EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt
Review and Conclusions 56
Semantic Web Architecture
Review and Conclusions 57
XML eXtensible Markup Language
atilde XML is a set of rules for encoding documents electronically
acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation
atilde Can be validated for syntactic well-formedness
Review and Conclusions 58
Semantic Web Architecture (details)
atilde Identify things with Uniform Resource Identifiers
acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond
atilde Identify relations with Resource Description Framework
acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything
atilde You can build relations in ontologies (OWL)
acirc Then reason over them search them hellip
Review and Conclusions 59
Criticism of the Semantic Web
Doctorowrsquos seven insurmountable obstacles to reliable metadata are
1 People lie
2 People are lazy
3 People are stupid
4 Mission Impossible know thyself
5 Schemas arenrsquot neutral
6 Metrics influence results
7 Therersquos more than one way to describe something
httpwwwwellcom~doctorowmetacraphtm 60
Semantic Web and NLP
atilde The Semantic Web is about structuring data
atilde Text Mining is about unstructured data
atilde There is much more unstructured than structured data
acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP
61
Citation Reputation andPageRank
Review and Conclusions 62
Citation Networks
atilde How can we tell what is a good scientific paper
acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked
acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite
Review and Conclusions 63
Reputation and Citation Analysis
atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group
atilde Some scores are
acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)
atilde Problems
acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones
atilde Weight Citations by Quality of the paper
Review and Conclusions 64
Gaming Citations
atilde LeastMinimum Publishable Unit
acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information
atilde Self citation in-group citation
atilde Write only proceedings (some journals are not often read)
atilde Submitting only to High Impact factor journals
You improve what gets measurednot necessarily what you want to improve
Review and Conclusions 65
Characteristics of the Web Bow Tie
SCC Strongly Connected Core mdash can travel from any page to any page
Review and Conclusions 66
Anchor Text
atilde Recall how hyperlinks are written
lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt
For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt
atilde Link analysis builds on two intuitions
1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A
2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice
Review and Conclusions 67
PageRank as Citation analysis
atilde Citation frequency can be used to measure the impact of an article
acirc Simplest measure Each article gets one vote ndash not very accurate
atilde On the web citation frequency = inlink count
acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam
atilde Better measure weighted citation frequency or citation rank
acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated
Review and Conclusions 68
PageRank as Random walk
atilde Imagine a web surfer doing a random walk on the web
acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page
equiprobably
atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there
atilde This long-term visit rate is the pagersquos PageRank
atilde PageRank = long-term visit rate = steady state robability
Review and Conclusions 69
Teleporting ndash to get us out of dead ends
atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)
atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)
atilde With remaining probability (90) go out on a random hyperlink
acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225
atilde 10 is a parameter the teleportation rate
atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate
Review and Conclusions 70
Example Graph
Each inbound link is a positive vote
httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71
Example Graph Weighted
Pages with higher PageRanks are lighter
Review and Conclusions 72
Gaming PageRank
atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs
atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies
atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission
73
atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks
The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index
acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip
74
Current Status
atilde There is a continuous battle between
acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read
atilde All metrics get gamed
Review and Conclusions 75
Digital object identifier
atilde DOI a string used to uniquely identify an electronic document or object
acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)
atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation
atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations
acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u
Review and Conclusions 76
Conclusions
Review and Conclusions 77
Revolutions in Language Technology
atilde Speech (Language itself)
atilde Writing (invented 3-5 times)
atilde Printing (made writing common)
atilde Digital Text (made writing transferable)
atilde Hyperlinking (taking writing beyond language)
Review and Conclusions 78
The Internet
atilde The internet is useful as a tool
acirc Passively for information accessacirc Actively for collaboration
atilde The internet is interesting in and of itself
acirc As a source of data about existing languageacirc As a source of innovation in language
Review and Conclusions 79
Recommended Texts
atilde Wikipedia
atilde Crystal D (2001) Language and the Internet Cambridge University Press
atilde Sproat R (2010) Language Technology and Society Oxford University Press
atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition
atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press
atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press
Review and Conclusions 80
Complementary Courses
atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics
atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics
Review and Conclusions 81
Collaboration and Wikisatilde Version Control Systems
atilde Wikipedia
atilde Licensing and Ownership
30
Version Control Systems
atilde Versioning file systems
acirc every time a file is opened a new copy is stored
atilde CVS Subversion Git
acirc changes to a collection of files are trackedacirc simultaneous changes are merged
atilde Revision Tracking
acirc Revisions are stored within a file
atilde Authorship in shared writing Explicit responsibility for changes
Review and Conclusions 31
Wikipedia
atilde The core aim of the Wikimedia Foundation is to get a free encyclopedia to everysingle person on the planet (Jimmy Wales)
atilde Wikipedia makes it easy to share your knowledgepeople like to do this
atilde Most edits are done by insiders
atilde Most content is added by outsiders
atilde Content comparable to Britannica
Review and Conclusions 32
The five pillars of Wikipedia
1 Wikipedia is an online encyclopedia
2 Wikipedia has a neutral point of view
atilde Content policies NPOV Verifiability and No original research
3 Wikipedia is free content
4 Wikipedians should interact in a respectful and civil manner
5 Wikipedia does not have firm rules
WikipediaFivepillars 33
Licenses and Ownership
atilde Copyright
atilde Copyleft
atilde Creative Commons
Review and Conclusions 34
What is a good article
1 Well-written
2 Factually accurate and verifiable
3 Broad in its coverage
4 Neutral
5 Stable
6 Illustrated if possible by images
WikipediaGood_article_criteria 35
The World Wide Web and HTML
Review and Conclusions 36
The InterWeb
atilde The Internetmore than just the web (email VoIP FTP Streaming Messaging
atilde The structure of Markup Visual vs LogicalWSISWYG WYSIAYG WYSIWIM
atilde The structure of the Web mdash hypertextpages linked to pages
atilde The future of the Web
atilde Linguistic features of the webun-edited large volume editable multi-media
Review and Conclusions 37
Map of online communities
httpxkcdcom802 38
The Deep Web (the Invisible Web)
Dynamic content dynamic pages which are returned in response to a submitted queryor accessed only through a form
Unlinked content pages which are not linked to by other pages (but clicking linksthem)
Private Web sites that require registration and login (Edventure)
Contextual Web pages with content varying for different access contexts (egranges of client IP addresses or previous navigation sequence)
Limited access content sites that limit access to their pages in a technical way (egusing the Robots Exclusion Standard)
Scripted content pages that are only accessible through links produced by JavaScriptas well as content dynamically downloaded from Web servers via Flash or Ajaxsolutions
Review and Conclusions 39
Non-HTMLtext content textual content encoded in multimedia (image or video)files or specific file formats not handled by search engines
These pages all include data that search engines cannot find
Review and Conclusions 40
Visual Markup vs Logical Markup
atilde Visual Markup (Presentational)
acirc What you see is what you get (WYSIWYG)acirc Equivalent of printersrsquo markupacirc Shows what things look like
atilde Logical Markup (Structural)
acirc Show the structure and meaningacirc Can be mapped to visual markupacirc Less flexible than visual markupacirc More adaptable (and reusable)
Review and Conclusions 41
The Web as Corpus
Review and Conclusions 42
Two Approaches to using the Web as a Corpus
atilde Direct Query Search Engine as Query tool and WWW as corpus(Objection Results are not reliable)
acirc Population and exact hit counts are unknown rarr no statistics possibleacirc Indexing does not allow to draw conclusions on the dataotimes Google is missing functionalities that linguists lexicographers would like to have
atilde Web Sample Use search engine to download data from the net and build a corpusfrom it
acirc known size and exact hit counts rarr statistics possibleacirc people can draw conclusions over the included text typesacirc (limited) control over the contentotimes sparser data
Review and Conclusions 43
Direct Query
atilde Accessible through search engines (Google API Yahoo API Scripts)
atilde Document counts are shown to correlate directly with ldquorealrdquo frequencies (Keller2003) so search engines can help - but
acirc lots of repetitions of the same text (not representative)acirc very limited query precision (no upperlower case no punctuation)acirc only estimated counts often hard to reproduce exactlyacirc different queries give wildly different numbers
Review and Conclusions 44
Web Count Units
atilde ghit or ldquogoogle hitrdquo is the most common unit used to count web snippets (in theearly 2000s)
acirc it is document frequency not term frequency
atilde whit or ldquoweb hitrdquo is the more general term
atilde Normally you compare two phenomena to get a unitless ratio(eg different from vs different than)251000000 ghits vs 71500000 or 351 (accessed 2012-04-04)
atilde GPB for ldquoGhits per billion documentsrdquo is good if want a more stable number (sug-gested by Mark Liberman)but Google no longer releases the index size
atilde So always say when you counted and try to use ratios
httpitrecisupennedu myllanguagelogarchives000953html 45
Web Sample
atilde Extracting and filtering web documents to create linguistically annotated corpora(Kilgarriff 2006)
acirc gather documents for different topics (balance)acirc exclude documents which cannot be preprocessed with available tools (here tag-
gers and lemmatizers)acirc exclude documents which seem irrelevant for a corpus (too short or too long word
lists)acirc do this for several languages and make the corpora available
46
Building Internet Corpora Outline
1 Select Seed Words (500)
2 Combine to form multiple queries (6000)
3 Query a search engine and retrieve the URLs (50000)
4 Download the files from the URLS (100000000 words)
5 Postprocess the data (encoding cleanup tagging and parsing)
Sharoff S (2006) Creating general-purpose corpora using automated search enginequeries In M Baroni S Bernardini (eds) WaCky Working papers on the Web asCorpus Bologna 2006
httpcorpusleedsacukinternethtml 47
Internet Corpora Summary
atilde The web can be used as a corpus
acirc Direct accesslowast Fast and convenientlowast Huge amounts of dataotimes unreliable counts
acirc Web samplelowast Control over the samplelowast Some setup costs (semi-automated)otimes Less data
atilde Richer data than a compiled corpus
otimes Less balanced less markup
48
Text and Meta-textatilde Explicit Meta-data
acirc Keywords and Categoriesacirc Rankingsacirc Structural Markup
atilde Implicit Meta-data
acirc Links and Citationsacirc Tagsacirc Tablesacirc File Namesacirc Translations
and Zombies 49
Explicit Metadata
atilde You can get information from metadata within documents
acirc When they are accurate they are very goodacirc They are often deceitful
atilde HTML PDF Word hellipMetadata
atilde Keywords and Tags
atilde Rankings
atilde Links and Citations
atilde Structural Markup
50
Implicit Metadata
atilde You can get clues from metadata within documents
acirc as they are non-intended they tend to be noisyacirc but they are rarely deceitful
atilde HTML tags as constituent boundaries
atilde Tables as Semantic Relations
atilde File Names (content type and language)
atilde Translations mdash Bracketed Glosses Cross-lingual Disambiguation
atilde Query Data
atilde Wikipedia Redirections and Cross-wiki Links
Review and Conclusions 51
Language Identification andNormalization
Review and Conclusions 52
Language Identification
atilde Need to identify the encoding and language of a page in order to process it (meta-data may be unreliable)
acirc Linguistically-grounded methodslowast Diacriticslowast Character n-gramslowast Stop words
acirc Similarity-based categorisation and classificationlowast Character n-gram rankings
acirc Machine Learning based methodsacirc Context (under jp or ko)
atilde Hard to do for short test snippets similar languages mixed text
Review and Conclusions 53
Normalization
atilde Extracting text from various documents
atilde Segmenting continuous text
atilde Number Normalization $700K $700000 07 million dollars hellip
atilde Date Normalization 2000AD 1421AH Heisei 12 hellip
atilde Stripping stop words (the a of hellip)
atilde Lemmatization produces rarr produce
atilde Stemming producer rarr produc produces rarr produc
atilde Decompounding zonnecel rarr zon cel
Review and Conclusions 54
The Semantic Web
Review and Conclusions 55
Goals of the Semantic Web
atilde Web of data
acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions
atilde Increase the utility of information by connecting it to definitions and context
atilde More efficient information access and analysis
EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt
Review and Conclusions 56
Semantic Web Architecture
Review and Conclusions 57
XML eXtensible Markup Language
atilde XML is a set of rules for encoding documents electronically
acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation
atilde Can be validated for syntactic well-formedness
Review and Conclusions 58
Semantic Web Architecture (details)
atilde Identify things with Uniform Resource Identifiers
acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond
atilde Identify relations with Resource Description Framework
acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything
atilde You can build relations in ontologies (OWL)
acirc Then reason over them search them hellip
Review and Conclusions 59
Criticism of the Semantic Web
Doctorowrsquos seven insurmountable obstacles to reliable metadata are
1 People lie
2 People are lazy
3 People are stupid
4 Mission Impossible know thyself
5 Schemas arenrsquot neutral
6 Metrics influence results
7 Therersquos more than one way to describe something
httpwwwwellcom~doctorowmetacraphtm 60
Semantic Web and NLP
atilde The Semantic Web is about structuring data
atilde Text Mining is about unstructured data
atilde There is much more unstructured than structured data
acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP
61
Citation Reputation andPageRank
Review and Conclusions 62
Citation Networks
atilde How can we tell what is a good scientific paper
acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked
acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite
Review and Conclusions 63
Reputation and Citation Analysis
atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group
atilde Some scores are
acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)
atilde Problems
acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones
atilde Weight Citations by Quality of the paper
Review and Conclusions 64
Gaming Citations
atilde LeastMinimum Publishable Unit
acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information
atilde Self citation in-group citation
atilde Write only proceedings (some journals are not often read)
atilde Submitting only to High Impact factor journals
You improve what gets measurednot necessarily what you want to improve
Review and Conclusions 65
Characteristics of the Web Bow Tie
SCC Strongly Connected Core mdash can travel from any page to any page
Review and Conclusions 66
Anchor Text
atilde Recall how hyperlinks are written
lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt
For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt
atilde Link analysis builds on two intuitions
1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A
2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice
Review and Conclusions 67
PageRank as Citation analysis
atilde Citation frequency can be used to measure the impact of an article
acirc Simplest measure Each article gets one vote ndash not very accurate
atilde On the web citation frequency = inlink count
acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam
atilde Better measure weighted citation frequency or citation rank
acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated
Review and Conclusions 68
PageRank as Random walk
atilde Imagine a web surfer doing a random walk on the web
acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page
equiprobably
atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there
atilde This long-term visit rate is the pagersquos PageRank
atilde PageRank = long-term visit rate = steady state robability
Review and Conclusions 69
Teleporting ndash to get us out of dead ends
atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)
atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)
atilde With remaining probability (90) go out on a random hyperlink
acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225
atilde 10 is a parameter the teleportation rate
atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate
Review and Conclusions 70
Example Graph
Each inbound link is a positive vote
httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71
Example Graph Weighted
Pages with higher PageRanks are lighter
Review and Conclusions 72
Gaming PageRank
atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs
atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies
atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission
73
atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks
The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index
acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip
74
Current Status
atilde There is a continuous battle between
acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read
atilde All metrics get gamed
Review and Conclusions 75
Digital object identifier
atilde DOI a string used to uniquely identify an electronic document or object
acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)
atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation
atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations
acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u
Review and Conclusions 76
Conclusions
Review and Conclusions 77
Revolutions in Language Technology
atilde Speech (Language itself)
atilde Writing (invented 3-5 times)
atilde Printing (made writing common)
atilde Digital Text (made writing transferable)
atilde Hyperlinking (taking writing beyond language)
Review and Conclusions 78
The Internet
atilde The internet is useful as a tool
acirc Passively for information accessacirc Actively for collaboration
atilde The internet is interesting in and of itself
acirc As a source of data about existing languageacirc As a source of innovation in language
Review and Conclusions 79
Recommended Texts
atilde Wikipedia
atilde Crystal D (2001) Language and the Internet Cambridge University Press
atilde Sproat R (2010) Language Technology and Society Oxford University Press
atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition
atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press
atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press
Review and Conclusions 80
Complementary Courses
atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics
atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics
Review and Conclusions 81
Version Control Systems
atilde Versioning file systems
acirc every time a file is opened a new copy is stored
atilde CVS Subversion Git
acirc changes to a collection of files are trackedacirc simultaneous changes are merged
atilde Revision Tracking
acirc Revisions are stored within a file
atilde Authorship in shared writing Explicit responsibility for changes
Review and Conclusions 31
Wikipedia
atilde The core aim of the Wikimedia Foundation is to get a free encyclopedia to everysingle person on the planet (Jimmy Wales)
atilde Wikipedia makes it easy to share your knowledgepeople like to do this
atilde Most edits are done by insiders
atilde Most content is added by outsiders
atilde Content comparable to Britannica
Review and Conclusions 32
The five pillars of Wikipedia
1 Wikipedia is an online encyclopedia
2 Wikipedia has a neutral point of view
atilde Content policies NPOV Verifiability and No original research
3 Wikipedia is free content
4 Wikipedians should interact in a respectful and civil manner
5 Wikipedia does not have firm rules
WikipediaFivepillars 33
Licenses and Ownership
atilde Copyright
atilde Copyleft
atilde Creative Commons
Review and Conclusions 34
What is a good article
1 Well-written
2 Factually accurate and verifiable
3 Broad in its coverage
4 Neutral
5 Stable
6 Illustrated if possible by images
WikipediaGood_article_criteria 35
The World Wide Web and HTML
Review and Conclusions 36
The InterWeb
atilde The Internetmore than just the web (email VoIP FTP Streaming Messaging
atilde The structure of Markup Visual vs LogicalWSISWYG WYSIAYG WYSIWIM
atilde The structure of the Web mdash hypertextpages linked to pages
atilde The future of the Web
atilde Linguistic features of the webun-edited large volume editable multi-media
Review and Conclusions 37
Map of online communities
httpxkcdcom802 38
The Deep Web (the Invisible Web)
Dynamic content dynamic pages which are returned in response to a submitted queryor accessed only through a form
Unlinked content pages which are not linked to by other pages (but clicking linksthem)
Private Web sites that require registration and login (Edventure)
Contextual Web pages with content varying for different access contexts (egranges of client IP addresses or previous navigation sequence)
Limited access content sites that limit access to their pages in a technical way (egusing the Robots Exclusion Standard)
Scripted content pages that are only accessible through links produced by JavaScriptas well as content dynamically downloaded from Web servers via Flash or Ajaxsolutions
Review and Conclusions 39
Non-HTMLtext content textual content encoded in multimedia (image or video)files or specific file formats not handled by search engines
These pages all include data that search engines cannot find
Review and Conclusions 40
Visual Markup vs Logical Markup
atilde Visual Markup (Presentational)
acirc What you see is what you get (WYSIWYG)acirc Equivalent of printersrsquo markupacirc Shows what things look like
atilde Logical Markup (Structural)
acirc Show the structure and meaningacirc Can be mapped to visual markupacirc Less flexible than visual markupacirc More adaptable (and reusable)
Review and Conclusions 41
The Web as Corpus
Review and Conclusions 42
Two Approaches to using the Web as a Corpus
atilde Direct Query Search Engine as Query tool and WWW as corpus(Objection Results are not reliable)
acirc Population and exact hit counts are unknown rarr no statistics possibleacirc Indexing does not allow to draw conclusions on the dataotimes Google is missing functionalities that linguists lexicographers would like to have
atilde Web Sample Use search engine to download data from the net and build a corpusfrom it
acirc known size and exact hit counts rarr statistics possibleacirc people can draw conclusions over the included text typesacirc (limited) control over the contentotimes sparser data
Review and Conclusions 43
Direct Query
atilde Accessible through search engines (Google API Yahoo API Scripts)
atilde Document counts are shown to correlate directly with ldquorealrdquo frequencies (Keller2003) so search engines can help - but
acirc lots of repetitions of the same text (not representative)acirc very limited query precision (no upperlower case no punctuation)acirc only estimated counts often hard to reproduce exactlyacirc different queries give wildly different numbers
Review and Conclusions 44
Web Count Units
atilde ghit or ldquogoogle hitrdquo is the most common unit used to count web snippets (in theearly 2000s)
acirc it is document frequency not term frequency
atilde whit or ldquoweb hitrdquo is the more general term
atilde Normally you compare two phenomena to get a unitless ratio(eg different from vs different than)251000000 ghits vs 71500000 or 351 (accessed 2012-04-04)
atilde GPB for ldquoGhits per billion documentsrdquo is good if want a more stable number (sug-gested by Mark Liberman)but Google no longer releases the index size
atilde So always say when you counted and try to use ratios
httpitrecisupennedu myllanguagelogarchives000953html 45
Web Sample
atilde Extracting and filtering web documents to create linguistically annotated corpora(Kilgarriff 2006)
acirc gather documents for different topics (balance)acirc exclude documents which cannot be preprocessed with available tools (here tag-
gers and lemmatizers)acirc exclude documents which seem irrelevant for a corpus (too short or too long word
lists)acirc do this for several languages and make the corpora available
46
Building Internet Corpora Outline
1 Select Seed Words (500)
2 Combine to form multiple queries (6000)
3 Query a search engine and retrieve the URLs (50000)
4 Download the files from the URLS (100000000 words)
5 Postprocess the data (encoding cleanup tagging and parsing)
Sharoff S (2006) Creating general-purpose corpora using automated search enginequeries In M Baroni S Bernardini (eds) WaCky Working papers on the Web asCorpus Bologna 2006
httpcorpusleedsacukinternethtml 47
Internet Corpora Summary
atilde The web can be used as a corpus
acirc Direct accesslowast Fast and convenientlowast Huge amounts of dataotimes unreliable counts
acirc Web samplelowast Control over the samplelowast Some setup costs (semi-automated)otimes Less data
atilde Richer data than a compiled corpus
otimes Less balanced less markup
48
Text and Meta-textatilde Explicit Meta-data
acirc Keywords and Categoriesacirc Rankingsacirc Structural Markup
atilde Implicit Meta-data
acirc Links and Citationsacirc Tagsacirc Tablesacirc File Namesacirc Translations
and Zombies 49
Explicit Metadata
atilde You can get information from metadata within documents
acirc When they are accurate they are very goodacirc They are often deceitful
atilde HTML PDF Word hellipMetadata
atilde Keywords and Tags
atilde Rankings
atilde Links and Citations
atilde Structural Markup
50
Implicit Metadata
atilde You can get clues from metadata within documents
acirc as they are non-intended they tend to be noisyacirc but they are rarely deceitful
atilde HTML tags as constituent boundaries
atilde Tables as Semantic Relations
atilde File Names (content type and language)
atilde Translations mdash Bracketed Glosses Cross-lingual Disambiguation
atilde Query Data
atilde Wikipedia Redirections and Cross-wiki Links
Review and Conclusions 51
Language Identification andNormalization
Review and Conclusions 52
Language Identification
atilde Need to identify the encoding and language of a page in order to process it (meta-data may be unreliable)
acirc Linguistically-grounded methodslowast Diacriticslowast Character n-gramslowast Stop words
acirc Similarity-based categorisation and classificationlowast Character n-gram rankings
acirc Machine Learning based methodsacirc Context (under jp or ko)
atilde Hard to do for short test snippets similar languages mixed text
Review and Conclusions 53
Normalization
atilde Extracting text from various documents
atilde Segmenting continuous text
atilde Number Normalization $700K $700000 07 million dollars hellip
atilde Date Normalization 2000AD 1421AH Heisei 12 hellip
atilde Stripping stop words (the a of hellip)
atilde Lemmatization produces rarr produce
atilde Stemming producer rarr produc produces rarr produc
atilde Decompounding zonnecel rarr zon cel
Review and Conclusions 54
The Semantic Web
Review and Conclusions 55
Goals of the Semantic Web
atilde Web of data
acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions
atilde Increase the utility of information by connecting it to definitions and context
atilde More efficient information access and analysis
EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt
Review and Conclusions 56
Semantic Web Architecture
Review and Conclusions 57
XML eXtensible Markup Language
atilde XML is a set of rules for encoding documents electronically
acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation
atilde Can be validated for syntactic well-formedness
Review and Conclusions 58
Semantic Web Architecture (details)
atilde Identify things with Uniform Resource Identifiers
acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond
atilde Identify relations with Resource Description Framework
acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything
atilde You can build relations in ontologies (OWL)
acirc Then reason over them search them hellip
Review and Conclusions 59
Criticism of the Semantic Web
Doctorowrsquos seven insurmountable obstacles to reliable metadata are
1 People lie
2 People are lazy
3 People are stupid
4 Mission Impossible know thyself
5 Schemas arenrsquot neutral
6 Metrics influence results
7 Therersquos more than one way to describe something
httpwwwwellcom~doctorowmetacraphtm 60
Semantic Web and NLP
atilde The Semantic Web is about structuring data
atilde Text Mining is about unstructured data
atilde There is much more unstructured than structured data
acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP
61
Citation Reputation andPageRank
Review and Conclusions 62
Citation Networks
atilde How can we tell what is a good scientific paper
acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked
acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite
Review and Conclusions 63
Reputation and Citation Analysis
atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group
atilde Some scores are
acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)
atilde Problems
acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones
atilde Weight Citations by Quality of the paper
Review and Conclusions 64
Gaming Citations
atilde LeastMinimum Publishable Unit
acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information
atilde Self citation in-group citation
atilde Write only proceedings (some journals are not often read)
atilde Submitting only to High Impact factor journals
You improve what gets measurednot necessarily what you want to improve
Review and Conclusions 65
Characteristics of the Web Bow Tie
SCC Strongly Connected Core mdash can travel from any page to any page
Review and Conclusions 66
Anchor Text
atilde Recall how hyperlinks are written
lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt
For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt
atilde Link analysis builds on two intuitions
1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A
2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice
Review and Conclusions 67
PageRank as Citation analysis
atilde Citation frequency can be used to measure the impact of an article
acirc Simplest measure Each article gets one vote ndash not very accurate
atilde On the web citation frequency = inlink count
acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam
atilde Better measure weighted citation frequency or citation rank
acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated
Review and Conclusions 68
PageRank as Random walk
atilde Imagine a web surfer doing a random walk on the web
acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page
equiprobably
atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there
atilde This long-term visit rate is the pagersquos PageRank
atilde PageRank = long-term visit rate = steady state robability
Review and Conclusions 69
Teleporting ndash to get us out of dead ends
atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)
atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)
atilde With remaining probability (90) go out on a random hyperlink
acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225
atilde 10 is a parameter the teleportation rate
atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate
Review and Conclusions 70
Example Graph
Each inbound link is a positive vote
httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71
Example Graph Weighted
Pages with higher PageRanks are lighter
Review and Conclusions 72
Gaming PageRank
atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs
atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies
atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission
73
atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks
The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index
acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip
74
Current Status
atilde There is a continuous battle between
acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read
atilde All metrics get gamed
Review and Conclusions 75
Digital object identifier
atilde DOI a string used to uniquely identify an electronic document or object
acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)
atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation
atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations
acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u
Review and Conclusions 76
Conclusions
Review and Conclusions 77
Revolutions in Language Technology
atilde Speech (Language itself)
atilde Writing (invented 3-5 times)
atilde Printing (made writing common)
atilde Digital Text (made writing transferable)
atilde Hyperlinking (taking writing beyond language)
Review and Conclusions 78
The Internet
atilde The internet is useful as a tool
acirc Passively for information accessacirc Actively for collaboration
atilde The internet is interesting in and of itself
acirc As a source of data about existing languageacirc As a source of innovation in language
Review and Conclusions 79
Recommended Texts
atilde Wikipedia
atilde Crystal D (2001) Language and the Internet Cambridge University Press
atilde Sproat R (2010) Language Technology and Society Oxford University Press
atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition
atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press
atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press
Review and Conclusions 80
Complementary Courses
atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics
atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics
Review and Conclusions 81
Wikipedia
atilde The core aim of the Wikimedia Foundation is to get a free encyclopedia to everysingle person on the planet (Jimmy Wales)
atilde Wikipedia makes it easy to share your knowledgepeople like to do this
atilde Most edits are done by insiders
atilde Most content is added by outsiders
atilde Content comparable to Britannica
Review and Conclusions 32
The five pillars of Wikipedia
1 Wikipedia is an online encyclopedia
2 Wikipedia has a neutral point of view
atilde Content policies NPOV Verifiability and No original research
3 Wikipedia is free content
4 Wikipedians should interact in a respectful and civil manner
5 Wikipedia does not have firm rules
WikipediaFivepillars 33
Licenses and Ownership
atilde Copyright
atilde Copyleft
atilde Creative Commons
Review and Conclusions 34
What is a good article
1 Well-written
2 Factually accurate and verifiable
3 Broad in its coverage
4 Neutral
5 Stable
6 Illustrated if possible by images
WikipediaGood_article_criteria 35
The World Wide Web and HTML
Review and Conclusions 36
The InterWeb
atilde The Internetmore than just the web (email VoIP FTP Streaming Messaging
atilde The structure of Markup Visual vs LogicalWSISWYG WYSIAYG WYSIWIM
atilde The structure of the Web mdash hypertextpages linked to pages
atilde The future of the Web
atilde Linguistic features of the webun-edited large volume editable multi-media
Review and Conclusions 37
Map of online communities
httpxkcdcom802 38
The Deep Web (the Invisible Web)
Dynamic content dynamic pages which are returned in response to a submitted queryor accessed only through a form
Unlinked content pages which are not linked to by other pages (but clicking linksthem)
Private Web sites that require registration and login (Edventure)
Contextual Web pages with content varying for different access contexts (egranges of client IP addresses or previous navigation sequence)
Limited access content sites that limit access to their pages in a technical way (egusing the Robots Exclusion Standard)
Scripted content pages that are only accessible through links produced by JavaScriptas well as content dynamically downloaded from Web servers via Flash or Ajaxsolutions
Review and Conclusions 39
Non-HTMLtext content textual content encoded in multimedia (image or video)files or specific file formats not handled by search engines
These pages all include data that search engines cannot find
Review and Conclusions 40
Visual Markup vs Logical Markup
atilde Visual Markup (Presentational)
acirc What you see is what you get (WYSIWYG)acirc Equivalent of printersrsquo markupacirc Shows what things look like
atilde Logical Markup (Structural)
acirc Show the structure and meaningacirc Can be mapped to visual markupacirc Less flexible than visual markupacirc More adaptable (and reusable)
Review and Conclusions 41
The Web as Corpus
Review and Conclusions 42
Two Approaches to using the Web as a Corpus
atilde Direct Query Search Engine as Query tool and WWW as corpus(Objection Results are not reliable)
acirc Population and exact hit counts are unknown rarr no statistics possibleacirc Indexing does not allow to draw conclusions on the dataotimes Google is missing functionalities that linguists lexicographers would like to have
atilde Web Sample Use search engine to download data from the net and build a corpusfrom it
acirc known size and exact hit counts rarr statistics possibleacirc people can draw conclusions over the included text typesacirc (limited) control over the contentotimes sparser data
Review and Conclusions 43
Direct Query
atilde Accessible through search engines (Google API Yahoo API Scripts)
atilde Document counts are shown to correlate directly with ldquorealrdquo frequencies (Keller2003) so search engines can help - but
acirc lots of repetitions of the same text (not representative)acirc very limited query precision (no upperlower case no punctuation)acirc only estimated counts often hard to reproduce exactlyacirc different queries give wildly different numbers
Review and Conclusions 44
Web Count Units
atilde ghit or ldquogoogle hitrdquo is the most common unit used to count web snippets (in theearly 2000s)
acirc it is document frequency not term frequency
atilde whit or ldquoweb hitrdquo is the more general term
atilde Normally you compare two phenomena to get a unitless ratio(eg different from vs different than)251000000 ghits vs 71500000 or 351 (accessed 2012-04-04)
atilde GPB for ldquoGhits per billion documentsrdquo is good if want a more stable number (sug-gested by Mark Liberman)but Google no longer releases the index size
atilde So always say when you counted and try to use ratios
httpitrecisupennedu myllanguagelogarchives000953html 45
Web Sample
atilde Extracting and filtering web documents to create linguistically annotated corpora(Kilgarriff 2006)
acirc gather documents for different topics (balance)acirc exclude documents which cannot be preprocessed with available tools (here tag-
gers and lemmatizers)acirc exclude documents which seem irrelevant for a corpus (too short or too long word
lists)acirc do this for several languages and make the corpora available
46
Building Internet Corpora Outline
1 Select Seed Words (500)
2 Combine to form multiple queries (6000)
3 Query a search engine and retrieve the URLs (50000)
4 Download the files from the URLS (100000000 words)
5 Postprocess the data (encoding cleanup tagging and parsing)
Sharoff S (2006) Creating general-purpose corpora using automated search enginequeries In M Baroni S Bernardini (eds) WaCky Working papers on the Web asCorpus Bologna 2006
httpcorpusleedsacukinternethtml 47
Internet Corpora Summary
atilde The web can be used as a corpus
acirc Direct accesslowast Fast and convenientlowast Huge amounts of dataotimes unreliable counts
acirc Web samplelowast Control over the samplelowast Some setup costs (semi-automated)otimes Less data
atilde Richer data than a compiled corpus
otimes Less balanced less markup
48
Text and Meta-textatilde Explicit Meta-data
acirc Keywords and Categoriesacirc Rankingsacirc Structural Markup
atilde Implicit Meta-data
acirc Links and Citationsacirc Tagsacirc Tablesacirc File Namesacirc Translations
and Zombies 49
Explicit Metadata
atilde You can get information from metadata within documents
acirc When they are accurate they are very goodacirc They are often deceitful
atilde HTML PDF Word hellipMetadata
atilde Keywords and Tags
atilde Rankings
atilde Links and Citations
atilde Structural Markup
50
Implicit Metadata
atilde You can get clues from metadata within documents
acirc as they are non-intended they tend to be noisyacirc but they are rarely deceitful
atilde HTML tags as constituent boundaries
atilde Tables as Semantic Relations
atilde File Names (content type and language)
atilde Translations mdash Bracketed Glosses Cross-lingual Disambiguation
atilde Query Data
atilde Wikipedia Redirections and Cross-wiki Links
Review and Conclusions 51
Language Identification andNormalization
Review and Conclusions 52
Language Identification
atilde Need to identify the encoding and language of a page in order to process it (meta-data may be unreliable)
acirc Linguistically-grounded methodslowast Diacriticslowast Character n-gramslowast Stop words
acirc Similarity-based categorisation and classificationlowast Character n-gram rankings
acirc Machine Learning based methodsacirc Context (under jp or ko)
atilde Hard to do for short test snippets similar languages mixed text
Review and Conclusions 53
Normalization
atilde Extracting text from various documents
atilde Segmenting continuous text
atilde Number Normalization $700K $700000 07 million dollars hellip
atilde Date Normalization 2000AD 1421AH Heisei 12 hellip
atilde Stripping stop words (the a of hellip)
atilde Lemmatization produces rarr produce
atilde Stemming producer rarr produc produces rarr produc
atilde Decompounding zonnecel rarr zon cel
Review and Conclusions 54
The Semantic Web
Review and Conclusions 55
Goals of the Semantic Web
atilde Web of data
acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions
atilde Increase the utility of information by connecting it to definitions and context
atilde More efficient information access and analysis
EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt
Review and Conclusions 56
Semantic Web Architecture
Review and Conclusions 57
XML eXtensible Markup Language
atilde XML is a set of rules for encoding documents electronically
acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation
atilde Can be validated for syntactic well-formedness
Review and Conclusions 58
Semantic Web Architecture (details)
atilde Identify things with Uniform Resource Identifiers
acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond
atilde Identify relations with Resource Description Framework
acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything
atilde You can build relations in ontologies (OWL)
acirc Then reason over them search them hellip
Review and Conclusions 59
Criticism of the Semantic Web
Doctorowrsquos seven insurmountable obstacles to reliable metadata are
1 People lie
2 People are lazy
3 People are stupid
4 Mission Impossible know thyself
5 Schemas arenrsquot neutral
6 Metrics influence results
7 Therersquos more than one way to describe something
httpwwwwellcom~doctorowmetacraphtm 60
Semantic Web and NLP
atilde The Semantic Web is about structuring data
atilde Text Mining is about unstructured data
atilde There is much more unstructured than structured data
acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP
61
Citation Reputation andPageRank
Review and Conclusions 62
Citation Networks
atilde How can we tell what is a good scientific paper
acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked
acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite
Review and Conclusions 63
Reputation and Citation Analysis
atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group
atilde Some scores are
acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)
atilde Problems
acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones
atilde Weight Citations by Quality of the paper
Review and Conclusions 64
Gaming Citations
atilde LeastMinimum Publishable Unit
acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information
atilde Self citation in-group citation
atilde Write only proceedings (some journals are not often read)
atilde Submitting only to High Impact factor journals
You improve what gets measurednot necessarily what you want to improve
Review and Conclusions 65
Characteristics of the Web Bow Tie
SCC Strongly Connected Core mdash can travel from any page to any page
Review and Conclusions 66
Anchor Text
atilde Recall how hyperlinks are written
lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt
For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt
atilde Link analysis builds on two intuitions
1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A
2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice
Review and Conclusions 67
PageRank as Citation analysis
atilde Citation frequency can be used to measure the impact of an article
acirc Simplest measure Each article gets one vote ndash not very accurate
atilde On the web citation frequency = inlink count
acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam
atilde Better measure weighted citation frequency or citation rank
acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated
Review and Conclusions 68
PageRank as Random walk
atilde Imagine a web surfer doing a random walk on the web
acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page
equiprobably
atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there
atilde This long-term visit rate is the pagersquos PageRank
atilde PageRank = long-term visit rate = steady state robability
Review and Conclusions 69
Teleporting ndash to get us out of dead ends
atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)
atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)
atilde With remaining probability (90) go out on a random hyperlink
acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225
atilde 10 is a parameter the teleportation rate
atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate
Review and Conclusions 70
Example Graph
Each inbound link is a positive vote
httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71
Example Graph Weighted
Pages with higher PageRanks are lighter
Review and Conclusions 72
Gaming PageRank
atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs
atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies
atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission
73
atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks
The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index
acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip
74
Current Status
atilde There is a continuous battle between
acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read
atilde All metrics get gamed
Review and Conclusions 75
Digital object identifier
atilde DOI a string used to uniquely identify an electronic document or object
acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)
atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation
atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations
acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u
Review and Conclusions 76
Conclusions
Review and Conclusions 77
Revolutions in Language Technology
atilde Speech (Language itself)
atilde Writing (invented 3-5 times)
atilde Printing (made writing common)
atilde Digital Text (made writing transferable)
atilde Hyperlinking (taking writing beyond language)
Review and Conclusions 78
The Internet
atilde The internet is useful as a tool
acirc Passively for information accessacirc Actively for collaboration
atilde The internet is interesting in and of itself
acirc As a source of data about existing languageacirc As a source of innovation in language
Review and Conclusions 79
Recommended Texts
atilde Wikipedia
atilde Crystal D (2001) Language and the Internet Cambridge University Press
atilde Sproat R (2010) Language Technology and Society Oxford University Press
atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition
atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press
atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press
Review and Conclusions 80
Complementary Courses
atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics
atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics
Review and Conclusions 81
The five pillars of Wikipedia
1 Wikipedia is an online encyclopedia
2 Wikipedia has a neutral point of view
atilde Content policies NPOV Verifiability and No original research
3 Wikipedia is free content
4 Wikipedians should interact in a respectful and civil manner
5 Wikipedia does not have firm rules
WikipediaFivepillars 33
Licenses and Ownership
atilde Copyright
atilde Copyleft
atilde Creative Commons
Review and Conclusions 34
What is a good article
1 Well-written
2 Factually accurate and verifiable
3 Broad in its coverage
4 Neutral
5 Stable
6 Illustrated if possible by images
WikipediaGood_article_criteria 35
The World Wide Web and HTML
Review and Conclusions 36
The InterWeb
atilde The Internetmore than just the web (email VoIP FTP Streaming Messaging
atilde The structure of Markup Visual vs LogicalWSISWYG WYSIAYG WYSIWIM
atilde The structure of the Web mdash hypertextpages linked to pages
atilde The future of the Web
atilde Linguistic features of the webun-edited large volume editable multi-media
Review and Conclusions 37
Map of online communities
httpxkcdcom802 38
The Deep Web (the Invisible Web)
Dynamic content dynamic pages which are returned in response to a submitted queryor accessed only through a form
Unlinked content pages which are not linked to by other pages (but clicking linksthem)
Private Web sites that require registration and login (Edventure)
Contextual Web pages with content varying for different access contexts (egranges of client IP addresses or previous navigation sequence)
Limited access content sites that limit access to their pages in a technical way (egusing the Robots Exclusion Standard)
Scripted content pages that are only accessible through links produced by JavaScriptas well as content dynamically downloaded from Web servers via Flash or Ajaxsolutions
Review and Conclusions 39
Non-HTMLtext content textual content encoded in multimedia (image or video)files or specific file formats not handled by search engines
These pages all include data that search engines cannot find
Review and Conclusions 40
Visual Markup vs Logical Markup
atilde Visual Markup (Presentational)
acirc What you see is what you get (WYSIWYG)acirc Equivalent of printersrsquo markupacirc Shows what things look like
atilde Logical Markup (Structural)
acirc Show the structure and meaningacirc Can be mapped to visual markupacirc Less flexible than visual markupacirc More adaptable (and reusable)
Review and Conclusions 41
The Web as Corpus
Review and Conclusions 42
Two Approaches to using the Web as a Corpus
atilde Direct Query Search Engine as Query tool and WWW as corpus(Objection Results are not reliable)
acirc Population and exact hit counts are unknown rarr no statistics possibleacirc Indexing does not allow to draw conclusions on the dataotimes Google is missing functionalities that linguists lexicographers would like to have
atilde Web Sample Use search engine to download data from the net and build a corpusfrom it
acirc known size and exact hit counts rarr statistics possibleacirc people can draw conclusions over the included text typesacirc (limited) control over the contentotimes sparser data
Review and Conclusions 43
Direct Query
atilde Accessible through search engines (Google API Yahoo API Scripts)
atilde Document counts are shown to correlate directly with ldquorealrdquo frequencies (Keller2003) so search engines can help - but
acirc lots of repetitions of the same text (not representative)acirc very limited query precision (no upperlower case no punctuation)acirc only estimated counts often hard to reproduce exactlyacirc different queries give wildly different numbers
Review and Conclusions 44
Web Count Units
atilde ghit or ldquogoogle hitrdquo is the most common unit used to count web snippets (in theearly 2000s)
acirc it is document frequency not term frequency
atilde whit or ldquoweb hitrdquo is the more general term
atilde Normally you compare two phenomena to get a unitless ratio(eg different from vs different than)251000000 ghits vs 71500000 or 351 (accessed 2012-04-04)
atilde GPB for ldquoGhits per billion documentsrdquo is good if want a more stable number (sug-gested by Mark Liberman)but Google no longer releases the index size
atilde So always say when you counted and try to use ratios
httpitrecisupennedu myllanguagelogarchives000953html 45
Web Sample
atilde Extracting and filtering web documents to create linguistically annotated corpora(Kilgarriff 2006)
acirc gather documents for different topics (balance)acirc exclude documents which cannot be preprocessed with available tools (here tag-
gers and lemmatizers)acirc exclude documents which seem irrelevant for a corpus (too short or too long word
lists)acirc do this for several languages and make the corpora available
46
Building Internet Corpora Outline
1 Select Seed Words (500)
2 Combine to form multiple queries (6000)
3 Query a search engine and retrieve the URLs (50000)
4 Download the files from the URLS (100000000 words)
5 Postprocess the data (encoding cleanup tagging and parsing)
Sharoff S (2006) Creating general-purpose corpora using automated search enginequeries In M Baroni S Bernardini (eds) WaCky Working papers on the Web asCorpus Bologna 2006
httpcorpusleedsacukinternethtml 47
Internet Corpora Summary
atilde The web can be used as a corpus
acirc Direct accesslowast Fast and convenientlowast Huge amounts of dataotimes unreliable counts
acirc Web samplelowast Control over the samplelowast Some setup costs (semi-automated)otimes Less data
atilde Richer data than a compiled corpus
otimes Less balanced less markup
48
Text and Meta-textatilde Explicit Meta-data
acirc Keywords and Categoriesacirc Rankingsacirc Structural Markup
atilde Implicit Meta-data
acirc Links and Citationsacirc Tagsacirc Tablesacirc File Namesacirc Translations
and Zombies 49
Explicit Metadata
atilde You can get information from metadata within documents
acirc When they are accurate they are very goodacirc They are often deceitful
atilde HTML PDF Word hellipMetadata
atilde Keywords and Tags
atilde Rankings
atilde Links and Citations
atilde Structural Markup
50
Implicit Metadata
atilde You can get clues from metadata within documents
acirc as they are non-intended they tend to be noisyacirc but they are rarely deceitful
atilde HTML tags as constituent boundaries
atilde Tables as Semantic Relations
atilde File Names (content type and language)
atilde Translations mdash Bracketed Glosses Cross-lingual Disambiguation
atilde Query Data
atilde Wikipedia Redirections and Cross-wiki Links
Review and Conclusions 51
Language Identification andNormalization
Review and Conclusions 52
Language Identification
atilde Need to identify the encoding and language of a page in order to process it (meta-data may be unreliable)
acirc Linguistically-grounded methodslowast Diacriticslowast Character n-gramslowast Stop words
acirc Similarity-based categorisation and classificationlowast Character n-gram rankings
acirc Machine Learning based methodsacirc Context (under jp or ko)
atilde Hard to do for short test snippets similar languages mixed text
Review and Conclusions 53
Normalization
atilde Extracting text from various documents
atilde Segmenting continuous text
atilde Number Normalization $700K $700000 07 million dollars hellip
atilde Date Normalization 2000AD 1421AH Heisei 12 hellip
atilde Stripping stop words (the a of hellip)
atilde Lemmatization produces rarr produce
atilde Stemming producer rarr produc produces rarr produc
atilde Decompounding zonnecel rarr zon cel
Review and Conclusions 54
The Semantic Web
Review and Conclusions 55
Goals of the Semantic Web
atilde Web of data
acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions
atilde Increase the utility of information by connecting it to definitions and context
atilde More efficient information access and analysis
EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt
Review and Conclusions 56
Semantic Web Architecture
Review and Conclusions 57
XML eXtensible Markup Language
atilde XML is a set of rules for encoding documents electronically
acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation
atilde Can be validated for syntactic well-formedness
Review and Conclusions 58
Semantic Web Architecture (details)
atilde Identify things with Uniform Resource Identifiers
acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond
atilde Identify relations with Resource Description Framework
acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything
atilde You can build relations in ontologies (OWL)
acirc Then reason over them search them hellip
Review and Conclusions 59
Criticism of the Semantic Web
Doctorowrsquos seven insurmountable obstacles to reliable metadata are
1 People lie
2 People are lazy
3 People are stupid
4 Mission Impossible know thyself
5 Schemas arenrsquot neutral
6 Metrics influence results
7 Therersquos more than one way to describe something
httpwwwwellcom~doctorowmetacraphtm 60
Semantic Web and NLP
atilde The Semantic Web is about structuring data
atilde Text Mining is about unstructured data
atilde There is much more unstructured than structured data
acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP
61
Citation Reputation andPageRank
Review and Conclusions 62
Citation Networks
atilde How can we tell what is a good scientific paper
acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked
acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite
Review and Conclusions 63
Reputation and Citation Analysis
atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group
atilde Some scores are
acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)
atilde Problems
acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones
atilde Weight Citations by Quality of the paper
Review and Conclusions 64
Gaming Citations
atilde LeastMinimum Publishable Unit
acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information
atilde Self citation in-group citation
atilde Write only proceedings (some journals are not often read)
atilde Submitting only to High Impact factor journals
You improve what gets measurednot necessarily what you want to improve
Review and Conclusions 65
Characteristics of the Web Bow Tie
SCC Strongly Connected Core mdash can travel from any page to any page
Review and Conclusions 66
Anchor Text
atilde Recall how hyperlinks are written
lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt
For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt
atilde Link analysis builds on two intuitions
1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A
2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice
Review and Conclusions 67
PageRank as Citation analysis
atilde Citation frequency can be used to measure the impact of an article
acirc Simplest measure Each article gets one vote ndash not very accurate
atilde On the web citation frequency = inlink count
acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam
atilde Better measure weighted citation frequency or citation rank
acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated
Review and Conclusions 68
PageRank as Random walk
atilde Imagine a web surfer doing a random walk on the web
acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page
equiprobably
atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there
atilde This long-term visit rate is the pagersquos PageRank
atilde PageRank = long-term visit rate = steady state robability
Review and Conclusions 69
Teleporting ndash to get us out of dead ends
atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)
atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)
atilde With remaining probability (90) go out on a random hyperlink
acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225
atilde 10 is a parameter the teleportation rate
atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate
Review and Conclusions 70
Example Graph
Each inbound link is a positive vote
httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71
Example Graph Weighted
Pages with higher PageRanks are lighter
Review and Conclusions 72
Gaming PageRank
atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs
atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies
atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission
73
atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks
The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index
acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip
74
Current Status
atilde There is a continuous battle between
acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read
atilde All metrics get gamed
Review and Conclusions 75
Digital object identifier
atilde DOI a string used to uniquely identify an electronic document or object
acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)
atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation
atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations
acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u
Review and Conclusions 76
Conclusions
Review and Conclusions 77
Revolutions in Language Technology
atilde Speech (Language itself)
atilde Writing (invented 3-5 times)
atilde Printing (made writing common)
atilde Digital Text (made writing transferable)
atilde Hyperlinking (taking writing beyond language)
Review and Conclusions 78
The Internet
atilde The internet is useful as a tool
acirc Passively for information accessacirc Actively for collaboration
atilde The internet is interesting in and of itself
acirc As a source of data about existing languageacirc As a source of innovation in language
Review and Conclusions 79
Recommended Texts
atilde Wikipedia
atilde Crystal D (2001) Language and the Internet Cambridge University Press
atilde Sproat R (2010) Language Technology and Society Oxford University Press
atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition
atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press
atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press
Review and Conclusions 80
Complementary Courses
atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics
atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics
Review and Conclusions 81
Licenses and Ownership
atilde Copyright
atilde Copyleft
atilde Creative Commons
Review and Conclusions 34
What is a good article
1 Well-written
2 Factually accurate and verifiable
3 Broad in its coverage
4 Neutral
5 Stable
6 Illustrated if possible by images
WikipediaGood_article_criteria 35
The World Wide Web and HTML
Review and Conclusions 36
The InterWeb
atilde The Internetmore than just the web (email VoIP FTP Streaming Messaging
atilde The structure of Markup Visual vs LogicalWSISWYG WYSIAYG WYSIWIM
atilde The structure of the Web mdash hypertextpages linked to pages
atilde The future of the Web
atilde Linguistic features of the webun-edited large volume editable multi-media
Review and Conclusions 37
Map of online communities
httpxkcdcom802 38
The Deep Web (the Invisible Web)
Dynamic content dynamic pages which are returned in response to a submitted queryor accessed only through a form
Unlinked content pages which are not linked to by other pages (but clicking linksthem)
Private Web sites that require registration and login (Edventure)
Contextual Web pages with content varying for different access contexts (egranges of client IP addresses or previous navigation sequence)
Limited access content sites that limit access to their pages in a technical way (egusing the Robots Exclusion Standard)
Scripted content pages that are only accessible through links produced by JavaScriptas well as content dynamically downloaded from Web servers via Flash or Ajaxsolutions
Review and Conclusions 39
Non-HTMLtext content textual content encoded in multimedia (image or video)files or specific file formats not handled by search engines
These pages all include data that search engines cannot find
Review and Conclusions 40
Visual Markup vs Logical Markup
atilde Visual Markup (Presentational)
acirc What you see is what you get (WYSIWYG)acirc Equivalent of printersrsquo markupacirc Shows what things look like
atilde Logical Markup (Structural)
acirc Show the structure and meaningacirc Can be mapped to visual markupacirc Less flexible than visual markupacirc More adaptable (and reusable)
Review and Conclusions 41
The Web as Corpus
Review and Conclusions 42
Two Approaches to using the Web as a Corpus
atilde Direct Query Search Engine as Query tool and WWW as corpus(Objection Results are not reliable)
acirc Population and exact hit counts are unknown rarr no statistics possibleacirc Indexing does not allow to draw conclusions on the dataotimes Google is missing functionalities that linguists lexicographers would like to have
atilde Web Sample Use search engine to download data from the net and build a corpusfrom it
acirc known size and exact hit counts rarr statistics possibleacirc people can draw conclusions over the included text typesacirc (limited) control over the contentotimes sparser data
Review and Conclusions 43
Direct Query
atilde Accessible through search engines (Google API Yahoo API Scripts)
atilde Document counts are shown to correlate directly with ldquorealrdquo frequencies (Keller2003) so search engines can help - but
acirc lots of repetitions of the same text (not representative)acirc very limited query precision (no upperlower case no punctuation)acirc only estimated counts often hard to reproduce exactlyacirc different queries give wildly different numbers
Review and Conclusions 44
Web Count Units
atilde ghit or ldquogoogle hitrdquo is the most common unit used to count web snippets (in theearly 2000s)
acirc it is document frequency not term frequency
atilde whit or ldquoweb hitrdquo is the more general term
atilde Normally you compare two phenomena to get a unitless ratio(eg different from vs different than)251000000 ghits vs 71500000 or 351 (accessed 2012-04-04)
atilde GPB for ldquoGhits per billion documentsrdquo is good if want a more stable number (sug-gested by Mark Liberman)but Google no longer releases the index size
atilde So always say when you counted and try to use ratios
httpitrecisupennedu myllanguagelogarchives000953html 45
Web Sample
atilde Extracting and filtering web documents to create linguistically annotated corpora(Kilgarriff 2006)
acirc gather documents for different topics (balance)acirc exclude documents which cannot be preprocessed with available tools (here tag-
gers and lemmatizers)acirc exclude documents which seem irrelevant for a corpus (too short or too long word
lists)acirc do this for several languages and make the corpora available
46
Building Internet Corpora Outline
1 Select Seed Words (500)
2 Combine to form multiple queries (6000)
3 Query a search engine and retrieve the URLs (50000)
4 Download the files from the URLS (100000000 words)
5 Postprocess the data (encoding cleanup tagging and parsing)
Sharoff S (2006) Creating general-purpose corpora using automated search enginequeries In M Baroni S Bernardini (eds) WaCky Working papers on the Web asCorpus Bologna 2006
httpcorpusleedsacukinternethtml 47
Internet Corpora Summary
atilde The web can be used as a corpus
acirc Direct accesslowast Fast and convenientlowast Huge amounts of dataotimes unreliable counts
acirc Web samplelowast Control over the samplelowast Some setup costs (semi-automated)otimes Less data
atilde Richer data than a compiled corpus
otimes Less balanced less markup
48
Text and Meta-textatilde Explicit Meta-data
acirc Keywords and Categoriesacirc Rankingsacirc Structural Markup
atilde Implicit Meta-data
acirc Links and Citationsacirc Tagsacirc Tablesacirc File Namesacirc Translations
and Zombies 49
Explicit Metadata
atilde You can get information from metadata within documents
acirc When they are accurate they are very goodacirc They are often deceitful
atilde HTML PDF Word hellipMetadata
atilde Keywords and Tags
atilde Rankings
atilde Links and Citations
atilde Structural Markup
50
Implicit Metadata
atilde You can get clues from metadata within documents
acirc as they are non-intended they tend to be noisyacirc but they are rarely deceitful
atilde HTML tags as constituent boundaries
atilde Tables as Semantic Relations
atilde File Names (content type and language)
atilde Translations mdash Bracketed Glosses Cross-lingual Disambiguation
atilde Query Data
atilde Wikipedia Redirections and Cross-wiki Links
Review and Conclusions 51
Language Identification andNormalization
Review and Conclusions 52
Language Identification
atilde Need to identify the encoding and language of a page in order to process it (meta-data may be unreliable)
acirc Linguistically-grounded methodslowast Diacriticslowast Character n-gramslowast Stop words
acirc Similarity-based categorisation and classificationlowast Character n-gram rankings
acirc Machine Learning based methodsacirc Context (under jp or ko)
atilde Hard to do for short test snippets similar languages mixed text
Review and Conclusions 53
Normalization
atilde Extracting text from various documents
atilde Segmenting continuous text
atilde Number Normalization $700K $700000 07 million dollars hellip
atilde Date Normalization 2000AD 1421AH Heisei 12 hellip
atilde Stripping stop words (the a of hellip)
atilde Lemmatization produces rarr produce
atilde Stemming producer rarr produc produces rarr produc
atilde Decompounding zonnecel rarr zon cel
Review and Conclusions 54
The Semantic Web
Review and Conclusions 55
Goals of the Semantic Web
atilde Web of data
acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions
atilde Increase the utility of information by connecting it to definitions and context
atilde More efficient information access and analysis
EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt
Review and Conclusions 56
Semantic Web Architecture
Review and Conclusions 57
XML eXtensible Markup Language
atilde XML is a set of rules for encoding documents electronically
acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation
atilde Can be validated for syntactic well-formedness
Review and Conclusions 58
Semantic Web Architecture (details)
atilde Identify things with Uniform Resource Identifiers
acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond
atilde Identify relations with Resource Description Framework
acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything
atilde You can build relations in ontologies (OWL)
acirc Then reason over them search them hellip
Review and Conclusions 59
Criticism of the Semantic Web
Doctorowrsquos seven insurmountable obstacles to reliable metadata are
1 People lie
2 People are lazy
3 People are stupid
4 Mission Impossible know thyself
5 Schemas arenrsquot neutral
6 Metrics influence results
7 Therersquos more than one way to describe something
httpwwwwellcom~doctorowmetacraphtm 60
Semantic Web and NLP
atilde The Semantic Web is about structuring data
atilde Text Mining is about unstructured data
atilde There is much more unstructured than structured data
acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP
61
Citation Reputation andPageRank
Review and Conclusions 62
Citation Networks
atilde How can we tell what is a good scientific paper
acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked
acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite
Review and Conclusions 63
Reputation and Citation Analysis
atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group
atilde Some scores are
acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)
atilde Problems
acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones
atilde Weight Citations by Quality of the paper
Review and Conclusions 64
Gaming Citations
atilde LeastMinimum Publishable Unit
acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information
atilde Self citation in-group citation
atilde Write only proceedings (some journals are not often read)
atilde Submitting only to High Impact factor journals
You improve what gets measurednot necessarily what you want to improve
Review and Conclusions 65
Characteristics of the Web Bow Tie
SCC Strongly Connected Core mdash can travel from any page to any page
Review and Conclusions 66
Anchor Text
atilde Recall how hyperlinks are written
lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt
For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt
atilde Link analysis builds on two intuitions
1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A
2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice
Review and Conclusions 67
PageRank as Citation analysis
atilde Citation frequency can be used to measure the impact of an article
acirc Simplest measure Each article gets one vote ndash not very accurate
atilde On the web citation frequency = inlink count
acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam
atilde Better measure weighted citation frequency or citation rank
acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated
Review and Conclusions 68
PageRank as Random walk
atilde Imagine a web surfer doing a random walk on the web
acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page
equiprobably
atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there
atilde This long-term visit rate is the pagersquos PageRank
atilde PageRank = long-term visit rate = steady state robability
Review and Conclusions 69
Teleporting ndash to get us out of dead ends
atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)
atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)
atilde With remaining probability (90) go out on a random hyperlink
acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225
atilde 10 is a parameter the teleportation rate
atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate
Review and Conclusions 70
Example Graph
Each inbound link is a positive vote
httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71
Example Graph Weighted
Pages with higher PageRanks are lighter
Review and Conclusions 72
Gaming PageRank
atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs
atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies
atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission
73
atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks
The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index
acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip
74
Current Status
atilde There is a continuous battle between
acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read
atilde All metrics get gamed
Review and Conclusions 75
Digital object identifier
atilde DOI a string used to uniquely identify an electronic document or object
acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)
atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation
atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations
acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u
Review and Conclusions 76
Conclusions
Review and Conclusions 77
Revolutions in Language Technology
atilde Speech (Language itself)
atilde Writing (invented 3-5 times)
atilde Printing (made writing common)
atilde Digital Text (made writing transferable)
atilde Hyperlinking (taking writing beyond language)
Review and Conclusions 78
The Internet
atilde The internet is useful as a tool
acirc Passively for information accessacirc Actively for collaboration
atilde The internet is interesting in and of itself
acirc As a source of data about existing languageacirc As a source of innovation in language
Review and Conclusions 79
Recommended Texts
atilde Wikipedia
atilde Crystal D (2001) Language and the Internet Cambridge University Press
atilde Sproat R (2010) Language Technology and Society Oxford University Press
atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition
atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press
atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press
Review and Conclusions 80
Complementary Courses
atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics
atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics
Review and Conclusions 81
What is a good article
1 Well-written
2 Factually accurate and verifiable
3 Broad in its coverage
4 Neutral
5 Stable
6 Illustrated if possible by images
WikipediaGood_article_criteria 35
The World Wide Web and HTML
Review and Conclusions 36
The InterWeb
atilde The Internetmore than just the web (email VoIP FTP Streaming Messaging
atilde The structure of Markup Visual vs LogicalWSISWYG WYSIAYG WYSIWIM
atilde The structure of the Web mdash hypertextpages linked to pages
atilde The future of the Web
atilde Linguistic features of the webun-edited large volume editable multi-media
Review and Conclusions 37
Map of online communities
httpxkcdcom802 38
The Deep Web (the Invisible Web)
Dynamic content dynamic pages which are returned in response to a submitted queryor accessed only through a form
Unlinked content pages which are not linked to by other pages (but clicking linksthem)
Private Web sites that require registration and login (Edventure)
Contextual Web pages with content varying for different access contexts (egranges of client IP addresses or previous navigation sequence)
Limited access content sites that limit access to their pages in a technical way (egusing the Robots Exclusion Standard)
Scripted content pages that are only accessible through links produced by JavaScriptas well as content dynamically downloaded from Web servers via Flash or Ajaxsolutions
Review and Conclusions 39
Non-HTMLtext content textual content encoded in multimedia (image or video)files or specific file formats not handled by search engines
These pages all include data that search engines cannot find
Review and Conclusions 40
Visual Markup vs Logical Markup
atilde Visual Markup (Presentational)
acirc What you see is what you get (WYSIWYG)acirc Equivalent of printersrsquo markupacirc Shows what things look like
atilde Logical Markup (Structural)
acirc Show the structure and meaningacirc Can be mapped to visual markupacirc Less flexible than visual markupacirc More adaptable (and reusable)
Review and Conclusions 41
The Web as Corpus
Review and Conclusions 42
Two Approaches to using the Web as a Corpus
atilde Direct Query Search Engine as Query tool and WWW as corpus(Objection Results are not reliable)
acirc Population and exact hit counts are unknown rarr no statistics possibleacirc Indexing does not allow to draw conclusions on the dataotimes Google is missing functionalities that linguists lexicographers would like to have
atilde Web Sample Use search engine to download data from the net and build a corpusfrom it
acirc known size and exact hit counts rarr statistics possibleacirc people can draw conclusions over the included text typesacirc (limited) control over the contentotimes sparser data
Review and Conclusions 43
Direct Query
atilde Accessible through search engines (Google API Yahoo API Scripts)
atilde Document counts are shown to correlate directly with ldquorealrdquo frequencies (Keller2003) so search engines can help - but
acirc lots of repetitions of the same text (not representative)acirc very limited query precision (no upperlower case no punctuation)acirc only estimated counts often hard to reproduce exactlyacirc different queries give wildly different numbers
Review and Conclusions 44
Web Count Units
atilde ghit or ldquogoogle hitrdquo is the most common unit used to count web snippets (in theearly 2000s)
acirc it is document frequency not term frequency
atilde whit or ldquoweb hitrdquo is the more general term
atilde Normally you compare two phenomena to get a unitless ratio(eg different from vs different than)251000000 ghits vs 71500000 or 351 (accessed 2012-04-04)
atilde GPB for ldquoGhits per billion documentsrdquo is good if want a more stable number (sug-gested by Mark Liberman)but Google no longer releases the index size
atilde So always say when you counted and try to use ratios
httpitrecisupennedu myllanguagelogarchives000953html 45
Web Sample
atilde Extracting and filtering web documents to create linguistically annotated corpora(Kilgarriff 2006)
acirc gather documents for different topics (balance)acirc exclude documents which cannot be preprocessed with available tools (here tag-
gers and lemmatizers)acirc exclude documents which seem irrelevant for a corpus (too short or too long word
lists)acirc do this for several languages and make the corpora available
46
Building Internet Corpora Outline
1 Select Seed Words (500)
2 Combine to form multiple queries (6000)
3 Query a search engine and retrieve the URLs (50000)
4 Download the files from the URLS (100000000 words)
5 Postprocess the data (encoding cleanup tagging and parsing)
Sharoff S (2006) Creating general-purpose corpora using automated search enginequeries In M Baroni S Bernardini (eds) WaCky Working papers on the Web asCorpus Bologna 2006
httpcorpusleedsacukinternethtml 47
Internet Corpora Summary
atilde The web can be used as a corpus
acirc Direct accesslowast Fast and convenientlowast Huge amounts of dataotimes unreliable counts
acirc Web samplelowast Control over the samplelowast Some setup costs (semi-automated)otimes Less data
atilde Richer data than a compiled corpus
otimes Less balanced less markup
48
Text and Meta-textatilde Explicit Meta-data
acirc Keywords and Categoriesacirc Rankingsacirc Structural Markup
atilde Implicit Meta-data
acirc Links and Citationsacirc Tagsacirc Tablesacirc File Namesacirc Translations
and Zombies 49
Explicit Metadata
atilde You can get information from metadata within documents
acirc When they are accurate they are very goodacirc They are often deceitful
atilde HTML PDF Word hellipMetadata
atilde Keywords and Tags
atilde Rankings
atilde Links and Citations
atilde Structural Markup
50
Implicit Metadata
atilde You can get clues from metadata within documents
acirc as they are non-intended they tend to be noisyacirc but they are rarely deceitful
atilde HTML tags as constituent boundaries
atilde Tables as Semantic Relations
atilde File Names (content type and language)
atilde Translations mdash Bracketed Glosses Cross-lingual Disambiguation
atilde Query Data
atilde Wikipedia Redirections and Cross-wiki Links
Review and Conclusions 51
Language Identification andNormalization
Review and Conclusions 52
Language Identification
atilde Need to identify the encoding and language of a page in order to process it (meta-data may be unreliable)
acirc Linguistically-grounded methodslowast Diacriticslowast Character n-gramslowast Stop words
acirc Similarity-based categorisation and classificationlowast Character n-gram rankings
acirc Machine Learning based methodsacirc Context (under jp or ko)
atilde Hard to do for short test snippets similar languages mixed text
Review and Conclusions 53
Normalization
atilde Extracting text from various documents
atilde Segmenting continuous text
atilde Number Normalization $700K $700000 07 million dollars hellip
atilde Date Normalization 2000AD 1421AH Heisei 12 hellip
atilde Stripping stop words (the a of hellip)
atilde Lemmatization produces rarr produce
atilde Stemming producer rarr produc produces rarr produc
atilde Decompounding zonnecel rarr zon cel
Review and Conclusions 54
The Semantic Web
Review and Conclusions 55
Goals of the Semantic Web
atilde Web of data
acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions
atilde Increase the utility of information by connecting it to definitions and context
atilde More efficient information access and analysis
EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt
Review and Conclusions 56
Semantic Web Architecture
Review and Conclusions 57
XML eXtensible Markup Language
atilde XML is a set of rules for encoding documents electronically
acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation
atilde Can be validated for syntactic well-formedness
Review and Conclusions 58
Semantic Web Architecture (details)
atilde Identify things with Uniform Resource Identifiers
acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond
atilde Identify relations with Resource Description Framework
acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything
atilde You can build relations in ontologies (OWL)
acirc Then reason over them search them hellip
Review and Conclusions 59
Criticism of the Semantic Web
Doctorowrsquos seven insurmountable obstacles to reliable metadata are
1 People lie
2 People are lazy
3 People are stupid
4 Mission Impossible know thyself
5 Schemas arenrsquot neutral
6 Metrics influence results
7 Therersquos more than one way to describe something
httpwwwwellcom~doctorowmetacraphtm 60
Semantic Web and NLP
atilde The Semantic Web is about structuring data
atilde Text Mining is about unstructured data
atilde There is much more unstructured than structured data
acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP
61
Citation Reputation andPageRank
Review and Conclusions 62
Citation Networks
atilde How can we tell what is a good scientific paper
acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked
acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite
Review and Conclusions 63
Reputation and Citation Analysis
atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group
atilde Some scores are
acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)
atilde Problems
acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones
atilde Weight Citations by Quality of the paper
Review and Conclusions 64
Gaming Citations
atilde LeastMinimum Publishable Unit
acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information
atilde Self citation in-group citation
atilde Write only proceedings (some journals are not often read)
atilde Submitting only to High Impact factor journals
You improve what gets measurednot necessarily what you want to improve
Review and Conclusions 65
Characteristics of the Web Bow Tie
SCC Strongly Connected Core mdash can travel from any page to any page
Review and Conclusions 66
Anchor Text
atilde Recall how hyperlinks are written
lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt
For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt
atilde Link analysis builds on two intuitions
1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A
2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice
Review and Conclusions 67
PageRank as Citation analysis
atilde Citation frequency can be used to measure the impact of an article
acirc Simplest measure Each article gets one vote ndash not very accurate
atilde On the web citation frequency = inlink count
acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam
atilde Better measure weighted citation frequency or citation rank
acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated
Review and Conclusions 68
PageRank as Random walk
atilde Imagine a web surfer doing a random walk on the web
acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page
equiprobably
atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there
atilde This long-term visit rate is the pagersquos PageRank
atilde PageRank = long-term visit rate = steady state robability
Review and Conclusions 69
Teleporting ndash to get us out of dead ends
atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)
atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)
atilde With remaining probability (90) go out on a random hyperlink
acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225
atilde 10 is a parameter the teleportation rate
atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate
Review and Conclusions 70
Example Graph
Each inbound link is a positive vote
httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71
Example Graph Weighted
Pages with higher PageRanks are lighter
Review and Conclusions 72
Gaming PageRank
atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs
atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies
atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission
73
atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks
The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index
acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip
74
Current Status
atilde There is a continuous battle between
acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read
atilde All metrics get gamed
Review and Conclusions 75
Digital object identifier
atilde DOI a string used to uniquely identify an electronic document or object
acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)
atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation
atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations
acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u
Review and Conclusions 76
Conclusions
Review and Conclusions 77
Revolutions in Language Technology
atilde Speech (Language itself)
atilde Writing (invented 3-5 times)
atilde Printing (made writing common)
atilde Digital Text (made writing transferable)
atilde Hyperlinking (taking writing beyond language)
Review and Conclusions 78
The Internet
atilde The internet is useful as a tool
acirc Passively for information accessacirc Actively for collaboration
atilde The internet is interesting in and of itself
acirc As a source of data about existing languageacirc As a source of innovation in language
Review and Conclusions 79
Recommended Texts
atilde Wikipedia
atilde Crystal D (2001) Language and the Internet Cambridge University Press
atilde Sproat R (2010) Language Technology and Society Oxford University Press
atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition
atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press
atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press
Review and Conclusions 80
Complementary Courses
atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics
atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics
Review and Conclusions 81
The World Wide Web and HTML
Review and Conclusions 36
The InterWeb
atilde The Internetmore than just the web (email VoIP FTP Streaming Messaging
atilde The structure of Markup Visual vs LogicalWSISWYG WYSIAYG WYSIWIM
atilde The structure of the Web mdash hypertextpages linked to pages
atilde The future of the Web
atilde Linguistic features of the webun-edited large volume editable multi-media
Review and Conclusions 37
Map of online communities
httpxkcdcom802 38
The Deep Web (the Invisible Web)
Dynamic content dynamic pages which are returned in response to a submitted queryor accessed only through a form
Unlinked content pages which are not linked to by other pages (but clicking linksthem)
Private Web sites that require registration and login (Edventure)
Contextual Web pages with content varying for different access contexts (egranges of client IP addresses or previous navigation sequence)
Limited access content sites that limit access to their pages in a technical way (egusing the Robots Exclusion Standard)
Scripted content pages that are only accessible through links produced by JavaScriptas well as content dynamically downloaded from Web servers via Flash or Ajaxsolutions
Review and Conclusions 39
Non-HTMLtext content textual content encoded in multimedia (image or video)files or specific file formats not handled by search engines
These pages all include data that search engines cannot find
Review and Conclusions 40
Visual Markup vs Logical Markup
atilde Visual Markup (Presentational)
acirc What you see is what you get (WYSIWYG)acirc Equivalent of printersrsquo markupacirc Shows what things look like
atilde Logical Markup (Structural)
acirc Show the structure and meaningacirc Can be mapped to visual markupacirc Less flexible than visual markupacirc More adaptable (and reusable)
Review and Conclusions 41
The Web as Corpus
Review and Conclusions 42
Two Approaches to using the Web as a Corpus
atilde Direct Query Search Engine as Query tool and WWW as corpus(Objection Results are not reliable)
acirc Population and exact hit counts are unknown rarr no statistics possibleacirc Indexing does not allow to draw conclusions on the dataotimes Google is missing functionalities that linguists lexicographers would like to have
atilde Web Sample Use search engine to download data from the net and build a corpusfrom it
acirc known size and exact hit counts rarr statistics possibleacirc people can draw conclusions over the included text typesacirc (limited) control over the contentotimes sparser data
Review and Conclusions 43
Direct Query
atilde Accessible through search engines (Google API Yahoo API Scripts)
atilde Document counts are shown to correlate directly with ldquorealrdquo frequencies (Keller2003) so search engines can help - but
acirc lots of repetitions of the same text (not representative)acirc very limited query precision (no upperlower case no punctuation)acirc only estimated counts often hard to reproduce exactlyacirc different queries give wildly different numbers
Review and Conclusions 44
Web Count Units
atilde ghit or ldquogoogle hitrdquo is the most common unit used to count web snippets (in theearly 2000s)
acirc it is document frequency not term frequency
atilde whit or ldquoweb hitrdquo is the more general term
atilde Normally you compare two phenomena to get a unitless ratio(eg different from vs different than)251000000 ghits vs 71500000 or 351 (accessed 2012-04-04)
atilde GPB for ldquoGhits per billion documentsrdquo is good if want a more stable number (sug-gested by Mark Liberman)but Google no longer releases the index size
atilde So always say when you counted and try to use ratios
httpitrecisupennedu myllanguagelogarchives000953html 45
Web Sample
atilde Extracting and filtering web documents to create linguistically annotated corpora(Kilgarriff 2006)
acirc gather documents for different topics (balance)acirc exclude documents which cannot be preprocessed with available tools (here tag-
gers and lemmatizers)acirc exclude documents which seem irrelevant for a corpus (too short or too long word
lists)acirc do this for several languages and make the corpora available
46
Building Internet Corpora Outline
1 Select Seed Words (500)
2 Combine to form multiple queries (6000)
3 Query a search engine and retrieve the URLs (50000)
4 Download the files from the URLS (100000000 words)
5 Postprocess the data (encoding cleanup tagging and parsing)
Sharoff S (2006) Creating general-purpose corpora using automated search enginequeries In M Baroni S Bernardini (eds) WaCky Working papers on the Web asCorpus Bologna 2006
httpcorpusleedsacukinternethtml 47
Internet Corpora Summary
atilde The web can be used as a corpus
acirc Direct accesslowast Fast and convenientlowast Huge amounts of dataotimes unreliable counts
acirc Web samplelowast Control over the samplelowast Some setup costs (semi-automated)otimes Less data
atilde Richer data than a compiled corpus
otimes Less balanced less markup
48
Text and Meta-textatilde Explicit Meta-data
acirc Keywords and Categoriesacirc Rankingsacirc Structural Markup
atilde Implicit Meta-data
acirc Links and Citationsacirc Tagsacirc Tablesacirc File Namesacirc Translations
and Zombies 49
Explicit Metadata
atilde You can get information from metadata within documents
acirc When they are accurate they are very goodacirc They are often deceitful
atilde HTML PDF Word hellipMetadata
atilde Keywords and Tags
atilde Rankings
atilde Links and Citations
atilde Structural Markup
50
Implicit Metadata
atilde You can get clues from metadata within documents
acirc as they are non-intended they tend to be noisyacirc but they are rarely deceitful
atilde HTML tags as constituent boundaries
atilde Tables as Semantic Relations
atilde File Names (content type and language)
atilde Translations mdash Bracketed Glosses Cross-lingual Disambiguation
atilde Query Data
atilde Wikipedia Redirections and Cross-wiki Links
Review and Conclusions 51
Language Identification andNormalization
Review and Conclusions 52
Language Identification
atilde Need to identify the encoding and language of a page in order to process it (meta-data may be unreliable)
acirc Linguistically-grounded methodslowast Diacriticslowast Character n-gramslowast Stop words
acirc Similarity-based categorisation and classificationlowast Character n-gram rankings
acirc Machine Learning based methodsacirc Context (under jp or ko)
atilde Hard to do for short test snippets similar languages mixed text
Review and Conclusions 53
Normalization
atilde Extracting text from various documents
atilde Segmenting continuous text
atilde Number Normalization $700K $700000 07 million dollars hellip
atilde Date Normalization 2000AD 1421AH Heisei 12 hellip
atilde Stripping stop words (the a of hellip)
atilde Lemmatization produces rarr produce
atilde Stemming producer rarr produc produces rarr produc
atilde Decompounding zonnecel rarr zon cel
Review and Conclusions 54
The Semantic Web
Review and Conclusions 55
Goals of the Semantic Web
atilde Web of data
acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions
atilde Increase the utility of information by connecting it to definitions and context
atilde More efficient information access and analysis
EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt
Review and Conclusions 56
Semantic Web Architecture
Review and Conclusions 57
XML eXtensible Markup Language
atilde XML is a set of rules for encoding documents electronically
acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation
atilde Can be validated for syntactic well-formedness
Review and Conclusions 58
Semantic Web Architecture (details)
atilde Identify things with Uniform Resource Identifiers
acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond
atilde Identify relations with Resource Description Framework
acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything
atilde You can build relations in ontologies (OWL)
acirc Then reason over them search them hellip
Review and Conclusions 59
Criticism of the Semantic Web
Doctorowrsquos seven insurmountable obstacles to reliable metadata are
1 People lie
2 People are lazy
3 People are stupid
4 Mission Impossible know thyself
5 Schemas arenrsquot neutral
6 Metrics influence results
7 Therersquos more than one way to describe something
httpwwwwellcom~doctorowmetacraphtm 60
Semantic Web and NLP
atilde The Semantic Web is about structuring data
atilde Text Mining is about unstructured data
atilde There is much more unstructured than structured data
acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP
61
Citation Reputation andPageRank
Review and Conclusions 62
Citation Networks
atilde How can we tell what is a good scientific paper
acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked
acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite
Review and Conclusions 63
Reputation and Citation Analysis
atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group
atilde Some scores are
acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)
atilde Problems
acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones
atilde Weight Citations by Quality of the paper
Review and Conclusions 64
Gaming Citations
atilde LeastMinimum Publishable Unit
acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information
atilde Self citation in-group citation
atilde Write only proceedings (some journals are not often read)
atilde Submitting only to High Impact factor journals
You improve what gets measurednot necessarily what you want to improve
Review and Conclusions 65
Characteristics of the Web Bow Tie
SCC Strongly Connected Core mdash can travel from any page to any page
Review and Conclusions 66
Anchor Text
atilde Recall how hyperlinks are written
lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt
For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt
atilde Link analysis builds on two intuitions
1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A
2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice
Review and Conclusions 67
PageRank as Citation analysis
atilde Citation frequency can be used to measure the impact of an article
acirc Simplest measure Each article gets one vote ndash not very accurate
atilde On the web citation frequency = inlink count
acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam
atilde Better measure weighted citation frequency or citation rank
acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated
Review and Conclusions 68
PageRank as Random walk
atilde Imagine a web surfer doing a random walk on the web
acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page
equiprobably
atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there
atilde This long-term visit rate is the pagersquos PageRank
atilde PageRank = long-term visit rate = steady state robability
Review and Conclusions 69
Teleporting ndash to get us out of dead ends
atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)
atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)
atilde With remaining probability (90) go out on a random hyperlink
acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225
atilde 10 is a parameter the teleportation rate
atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate
Review and Conclusions 70
Example Graph
Each inbound link is a positive vote
httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71
Example Graph Weighted
Pages with higher PageRanks are lighter
Review and Conclusions 72
Gaming PageRank
atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs
atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies
atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission
73
atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks
The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index
acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip
74
Current Status
atilde There is a continuous battle between
acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read
atilde All metrics get gamed
Review and Conclusions 75
Digital object identifier
atilde DOI a string used to uniquely identify an electronic document or object
acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)
atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation
atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations
acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u
Review and Conclusions 76
Conclusions
Review and Conclusions 77
Revolutions in Language Technology
atilde Speech (Language itself)
atilde Writing (invented 3-5 times)
atilde Printing (made writing common)
atilde Digital Text (made writing transferable)
atilde Hyperlinking (taking writing beyond language)
Review and Conclusions 78
The Internet
atilde The internet is useful as a tool
acirc Passively for information accessacirc Actively for collaboration
atilde The internet is interesting in and of itself
acirc As a source of data about existing languageacirc As a source of innovation in language
Review and Conclusions 79
Recommended Texts
atilde Wikipedia
atilde Crystal D (2001) Language and the Internet Cambridge University Press
atilde Sproat R (2010) Language Technology and Society Oxford University Press
atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition
atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press
atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press
Review and Conclusions 80
Complementary Courses
atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics
atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics
Review and Conclusions 81
The InterWeb
atilde The Internetmore than just the web (email VoIP FTP Streaming Messaging
atilde The structure of Markup Visual vs LogicalWSISWYG WYSIAYG WYSIWIM
atilde The structure of the Web mdash hypertextpages linked to pages
atilde The future of the Web
atilde Linguistic features of the webun-edited large volume editable multi-media
Review and Conclusions 37
Map of online communities
httpxkcdcom802 38
The Deep Web (the Invisible Web)
Dynamic content dynamic pages which are returned in response to a submitted queryor accessed only through a form
Unlinked content pages which are not linked to by other pages (but clicking linksthem)
Private Web sites that require registration and login (Edventure)
Contextual Web pages with content varying for different access contexts (egranges of client IP addresses or previous navigation sequence)
Limited access content sites that limit access to their pages in a technical way (egusing the Robots Exclusion Standard)
Scripted content pages that are only accessible through links produced by JavaScriptas well as content dynamically downloaded from Web servers via Flash or Ajaxsolutions
Review and Conclusions 39
Non-HTMLtext content textual content encoded in multimedia (image or video)files or specific file formats not handled by search engines
These pages all include data that search engines cannot find
Review and Conclusions 40
Visual Markup vs Logical Markup
atilde Visual Markup (Presentational)
acirc What you see is what you get (WYSIWYG)acirc Equivalent of printersrsquo markupacirc Shows what things look like
atilde Logical Markup (Structural)
acirc Show the structure and meaningacirc Can be mapped to visual markupacirc Less flexible than visual markupacirc More adaptable (and reusable)
Review and Conclusions 41
The Web as Corpus
Review and Conclusions 42
Two Approaches to using the Web as a Corpus
atilde Direct Query Search Engine as Query tool and WWW as corpus(Objection Results are not reliable)
acirc Population and exact hit counts are unknown rarr no statistics possibleacirc Indexing does not allow to draw conclusions on the dataotimes Google is missing functionalities that linguists lexicographers would like to have
atilde Web Sample Use search engine to download data from the net and build a corpusfrom it
acirc known size and exact hit counts rarr statistics possibleacirc people can draw conclusions over the included text typesacirc (limited) control over the contentotimes sparser data
Review and Conclusions 43
Direct Query
atilde Accessible through search engines (Google API Yahoo API Scripts)
atilde Document counts are shown to correlate directly with ldquorealrdquo frequencies (Keller2003) so search engines can help - but
acirc lots of repetitions of the same text (not representative)acirc very limited query precision (no upperlower case no punctuation)acirc only estimated counts often hard to reproduce exactlyacirc different queries give wildly different numbers
Review and Conclusions 44
Web Count Units
atilde ghit or ldquogoogle hitrdquo is the most common unit used to count web snippets (in theearly 2000s)
acirc it is document frequency not term frequency
atilde whit or ldquoweb hitrdquo is the more general term
atilde Normally you compare two phenomena to get a unitless ratio(eg different from vs different than)251000000 ghits vs 71500000 or 351 (accessed 2012-04-04)
atilde GPB for ldquoGhits per billion documentsrdquo is good if want a more stable number (sug-gested by Mark Liberman)but Google no longer releases the index size
atilde So always say when you counted and try to use ratios
httpitrecisupennedu myllanguagelogarchives000953html 45
Web Sample
atilde Extracting and filtering web documents to create linguistically annotated corpora(Kilgarriff 2006)
acirc gather documents for different topics (balance)acirc exclude documents which cannot be preprocessed with available tools (here tag-
gers and lemmatizers)acirc exclude documents which seem irrelevant for a corpus (too short or too long word
lists)acirc do this for several languages and make the corpora available
46
Building Internet Corpora Outline
1 Select Seed Words (500)
2 Combine to form multiple queries (6000)
3 Query a search engine and retrieve the URLs (50000)
4 Download the files from the URLS (100000000 words)
5 Postprocess the data (encoding cleanup tagging and parsing)
Sharoff S (2006) Creating general-purpose corpora using automated search enginequeries In M Baroni S Bernardini (eds) WaCky Working papers on the Web asCorpus Bologna 2006
httpcorpusleedsacukinternethtml 47
Internet Corpora Summary
atilde The web can be used as a corpus
acirc Direct accesslowast Fast and convenientlowast Huge amounts of dataotimes unreliable counts
acirc Web samplelowast Control over the samplelowast Some setup costs (semi-automated)otimes Less data
atilde Richer data than a compiled corpus
otimes Less balanced less markup
48
Text and Meta-textatilde Explicit Meta-data
acirc Keywords and Categoriesacirc Rankingsacirc Structural Markup
atilde Implicit Meta-data
acirc Links and Citationsacirc Tagsacirc Tablesacirc File Namesacirc Translations
and Zombies 49
Explicit Metadata
atilde You can get information from metadata within documents
acirc When they are accurate they are very goodacirc They are often deceitful
atilde HTML PDF Word hellipMetadata
atilde Keywords and Tags
atilde Rankings
atilde Links and Citations
atilde Structural Markup
50
Implicit Metadata
atilde You can get clues from metadata within documents
acirc as they are non-intended they tend to be noisyacirc but they are rarely deceitful
atilde HTML tags as constituent boundaries
atilde Tables as Semantic Relations
atilde File Names (content type and language)
atilde Translations mdash Bracketed Glosses Cross-lingual Disambiguation
atilde Query Data
atilde Wikipedia Redirections and Cross-wiki Links
Review and Conclusions 51
Language Identification andNormalization
Review and Conclusions 52
Language Identification
atilde Need to identify the encoding and language of a page in order to process it (meta-data may be unreliable)
acirc Linguistically-grounded methodslowast Diacriticslowast Character n-gramslowast Stop words
acirc Similarity-based categorisation and classificationlowast Character n-gram rankings
acirc Machine Learning based methodsacirc Context (under jp or ko)
atilde Hard to do for short test snippets similar languages mixed text
Review and Conclusions 53
Normalization
atilde Extracting text from various documents
atilde Segmenting continuous text
atilde Number Normalization $700K $700000 07 million dollars hellip
atilde Date Normalization 2000AD 1421AH Heisei 12 hellip
atilde Stripping stop words (the a of hellip)
atilde Lemmatization produces rarr produce
atilde Stemming producer rarr produc produces rarr produc
atilde Decompounding zonnecel rarr zon cel
Review and Conclusions 54
The Semantic Web
Review and Conclusions 55
Goals of the Semantic Web
atilde Web of data
acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions
atilde Increase the utility of information by connecting it to definitions and context
atilde More efficient information access and analysis
EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt
Review and Conclusions 56
Semantic Web Architecture
Review and Conclusions 57
XML eXtensible Markup Language
atilde XML is a set of rules for encoding documents electronically
acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation
atilde Can be validated for syntactic well-formedness
Review and Conclusions 58
Semantic Web Architecture (details)
atilde Identify things with Uniform Resource Identifiers
acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond
atilde Identify relations with Resource Description Framework
acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything
atilde You can build relations in ontologies (OWL)
acirc Then reason over them search them hellip
Review and Conclusions 59
Criticism of the Semantic Web
Doctorowrsquos seven insurmountable obstacles to reliable metadata are
1 People lie
2 People are lazy
3 People are stupid
4 Mission Impossible know thyself
5 Schemas arenrsquot neutral
6 Metrics influence results
7 Therersquos more than one way to describe something
httpwwwwellcom~doctorowmetacraphtm 60
Semantic Web and NLP
atilde The Semantic Web is about structuring data
atilde Text Mining is about unstructured data
atilde There is much more unstructured than structured data
acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP
61
Citation Reputation andPageRank
Review and Conclusions 62
Citation Networks
atilde How can we tell what is a good scientific paper
acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked
acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite
Review and Conclusions 63
Reputation and Citation Analysis
atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group
atilde Some scores are
acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)
atilde Problems
acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones
atilde Weight Citations by Quality of the paper
Review and Conclusions 64
Gaming Citations
atilde LeastMinimum Publishable Unit
acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information
atilde Self citation in-group citation
atilde Write only proceedings (some journals are not often read)
atilde Submitting only to High Impact factor journals
You improve what gets measurednot necessarily what you want to improve
Review and Conclusions 65
Characteristics of the Web Bow Tie
SCC Strongly Connected Core mdash can travel from any page to any page
Review and Conclusions 66
Anchor Text
atilde Recall how hyperlinks are written
lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt
For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt
atilde Link analysis builds on two intuitions
1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A
2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice
Review and Conclusions 67
PageRank as Citation analysis
atilde Citation frequency can be used to measure the impact of an article
acirc Simplest measure Each article gets one vote ndash not very accurate
atilde On the web citation frequency = inlink count
acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam
atilde Better measure weighted citation frequency or citation rank
acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated
Review and Conclusions 68
PageRank as Random walk
atilde Imagine a web surfer doing a random walk on the web
acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page
equiprobably
atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there
atilde This long-term visit rate is the pagersquos PageRank
atilde PageRank = long-term visit rate = steady state robability
Review and Conclusions 69
Teleporting ndash to get us out of dead ends
atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)
atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)
atilde With remaining probability (90) go out on a random hyperlink
acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225
atilde 10 is a parameter the teleportation rate
atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate
Review and Conclusions 70
Example Graph
Each inbound link is a positive vote
httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71
Example Graph Weighted
Pages with higher PageRanks are lighter
Review and Conclusions 72
Gaming PageRank
atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs
atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies
atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission
73
atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks
The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index
acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip
74
Current Status
atilde There is a continuous battle between
acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read
atilde All metrics get gamed
Review and Conclusions 75
Digital object identifier
atilde DOI a string used to uniquely identify an electronic document or object
acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)
atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation
atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations
acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u
Review and Conclusions 76
Conclusions
Review and Conclusions 77
Revolutions in Language Technology
atilde Speech (Language itself)
atilde Writing (invented 3-5 times)
atilde Printing (made writing common)
atilde Digital Text (made writing transferable)
atilde Hyperlinking (taking writing beyond language)
Review and Conclusions 78
The Internet
atilde The internet is useful as a tool
acirc Passively for information accessacirc Actively for collaboration
atilde The internet is interesting in and of itself
acirc As a source of data about existing languageacirc As a source of innovation in language
Review and Conclusions 79
Recommended Texts
atilde Wikipedia
atilde Crystal D (2001) Language and the Internet Cambridge University Press
atilde Sproat R (2010) Language Technology and Society Oxford University Press
atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition
atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press
atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press
Review and Conclusions 80
Complementary Courses
atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics
atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics
Review and Conclusions 81
Map of online communities
httpxkcdcom802 38
The Deep Web (the Invisible Web)
Dynamic content dynamic pages which are returned in response to a submitted queryor accessed only through a form
Unlinked content pages which are not linked to by other pages (but clicking linksthem)
Private Web sites that require registration and login (Edventure)
Contextual Web pages with content varying for different access contexts (egranges of client IP addresses or previous navigation sequence)
Limited access content sites that limit access to their pages in a technical way (egusing the Robots Exclusion Standard)
Scripted content pages that are only accessible through links produced by JavaScriptas well as content dynamically downloaded from Web servers via Flash or Ajaxsolutions
Review and Conclusions 39
Non-HTMLtext content textual content encoded in multimedia (image or video)files or specific file formats not handled by search engines
These pages all include data that search engines cannot find
Review and Conclusions 40
Visual Markup vs Logical Markup
atilde Visual Markup (Presentational)
acirc What you see is what you get (WYSIWYG)acirc Equivalent of printersrsquo markupacirc Shows what things look like
atilde Logical Markup (Structural)
acirc Show the structure and meaningacirc Can be mapped to visual markupacirc Less flexible than visual markupacirc More adaptable (and reusable)
Review and Conclusions 41
The Web as Corpus
Review and Conclusions 42
Two Approaches to using the Web as a Corpus
atilde Direct Query Search Engine as Query tool and WWW as corpus(Objection Results are not reliable)
acirc Population and exact hit counts are unknown rarr no statistics possibleacirc Indexing does not allow to draw conclusions on the dataotimes Google is missing functionalities that linguists lexicographers would like to have
atilde Web Sample Use search engine to download data from the net and build a corpusfrom it
acirc known size and exact hit counts rarr statistics possibleacirc people can draw conclusions over the included text typesacirc (limited) control over the contentotimes sparser data
Review and Conclusions 43
Direct Query
atilde Accessible through search engines (Google API Yahoo API Scripts)
atilde Document counts are shown to correlate directly with ldquorealrdquo frequencies (Keller2003) so search engines can help - but
acirc lots of repetitions of the same text (not representative)acirc very limited query precision (no upperlower case no punctuation)acirc only estimated counts often hard to reproduce exactlyacirc different queries give wildly different numbers
Review and Conclusions 44
Web Count Units
atilde ghit or ldquogoogle hitrdquo is the most common unit used to count web snippets (in theearly 2000s)
acirc it is document frequency not term frequency
atilde whit or ldquoweb hitrdquo is the more general term
atilde Normally you compare two phenomena to get a unitless ratio(eg different from vs different than)251000000 ghits vs 71500000 or 351 (accessed 2012-04-04)
atilde GPB for ldquoGhits per billion documentsrdquo is good if want a more stable number (sug-gested by Mark Liberman)but Google no longer releases the index size
atilde So always say when you counted and try to use ratios
httpitrecisupennedu myllanguagelogarchives000953html 45
Web Sample
atilde Extracting and filtering web documents to create linguistically annotated corpora(Kilgarriff 2006)
acirc gather documents for different topics (balance)acirc exclude documents which cannot be preprocessed with available tools (here tag-
gers and lemmatizers)acirc exclude documents which seem irrelevant for a corpus (too short or too long word
lists)acirc do this for several languages and make the corpora available
46
Building Internet Corpora Outline
1 Select Seed Words (500)
2 Combine to form multiple queries (6000)
3 Query a search engine and retrieve the URLs (50000)
4 Download the files from the URLS (100000000 words)
5 Postprocess the data (encoding cleanup tagging and parsing)
Sharoff S (2006) Creating general-purpose corpora using automated search enginequeries In M Baroni S Bernardini (eds) WaCky Working papers on the Web asCorpus Bologna 2006
httpcorpusleedsacukinternethtml 47
Internet Corpora Summary
atilde The web can be used as a corpus
acirc Direct accesslowast Fast and convenientlowast Huge amounts of dataotimes unreliable counts
acirc Web samplelowast Control over the samplelowast Some setup costs (semi-automated)otimes Less data
atilde Richer data than a compiled corpus
otimes Less balanced less markup
48
Text and Meta-textatilde Explicit Meta-data
acirc Keywords and Categoriesacirc Rankingsacirc Structural Markup
atilde Implicit Meta-data
acirc Links and Citationsacirc Tagsacirc Tablesacirc File Namesacirc Translations
and Zombies 49
Explicit Metadata
atilde You can get information from metadata within documents
acirc When they are accurate they are very goodacirc They are often deceitful
atilde HTML PDF Word hellipMetadata
atilde Keywords and Tags
atilde Rankings
atilde Links and Citations
atilde Structural Markup
50
Implicit Metadata
atilde You can get clues from metadata within documents
acirc as they are non-intended they tend to be noisyacirc but they are rarely deceitful
atilde HTML tags as constituent boundaries
atilde Tables as Semantic Relations
atilde File Names (content type and language)
atilde Translations mdash Bracketed Glosses Cross-lingual Disambiguation
atilde Query Data
atilde Wikipedia Redirections and Cross-wiki Links
Review and Conclusions 51
Language Identification andNormalization
Review and Conclusions 52
Language Identification
atilde Need to identify the encoding and language of a page in order to process it (meta-data may be unreliable)
acirc Linguistically-grounded methodslowast Diacriticslowast Character n-gramslowast Stop words
acirc Similarity-based categorisation and classificationlowast Character n-gram rankings
acirc Machine Learning based methodsacirc Context (under jp or ko)
atilde Hard to do for short test snippets similar languages mixed text
Review and Conclusions 53
Normalization
atilde Extracting text from various documents
atilde Segmenting continuous text
atilde Number Normalization $700K $700000 07 million dollars hellip
atilde Date Normalization 2000AD 1421AH Heisei 12 hellip
atilde Stripping stop words (the a of hellip)
atilde Lemmatization produces rarr produce
atilde Stemming producer rarr produc produces rarr produc
atilde Decompounding zonnecel rarr zon cel
Review and Conclusions 54
The Semantic Web
Review and Conclusions 55
Goals of the Semantic Web
atilde Web of data
acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions
atilde Increase the utility of information by connecting it to definitions and context
atilde More efficient information access and analysis
EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt
Review and Conclusions 56
Semantic Web Architecture
Review and Conclusions 57
XML eXtensible Markup Language
atilde XML is a set of rules for encoding documents electronically
acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation
atilde Can be validated for syntactic well-formedness
Review and Conclusions 58
Semantic Web Architecture (details)
atilde Identify things with Uniform Resource Identifiers
acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond
atilde Identify relations with Resource Description Framework
acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything
atilde You can build relations in ontologies (OWL)
acirc Then reason over them search them hellip
Review and Conclusions 59
Criticism of the Semantic Web
Doctorowrsquos seven insurmountable obstacles to reliable metadata are
1 People lie
2 People are lazy
3 People are stupid
4 Mission Impossible know thyself
5 Schemas arenrsquot neutral
6 Metrics influence results
7 Therersquos more than one way to describe something
httpwwwwellcom~doctorowmetacraphtm 60
Semantic Web and NLP
atilde The Semantic Web is about structuring data
atilde Text Mining is about unstructured data
atilde There is much more unstructured than structured data
acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP
61
Citation Reputation andPageRank
Review and Conclusions 62
Citation Networks
atilde How can we tell what is a good scientific paper
acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked
acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite
Review and Conclusions 63
Reputation and Citation Analysis
atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group
atilde Some scores are
acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)
atilde Problems
acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones
atilde Weight Citations by Quality of the paper
Review and Conclusions 64
Gaming Citations
atilde LeastMinimum Publishable Unit
acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information
atilde Self citation in-group citation
atilde Write only proceedings (some journals are not often read)
atilde Submitting only to High Impact factor journals
You improve what gets measurednot necessarily what you want to improve
Review and Conclusions 65
Characteristics of the Web Bow Tie
SCC Strongly Connected Core mdash can travel from any page to any page
Review and Conclusions 66
Anchor Text
atilde Recall how hyperlinks are written
lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt
For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt
atilde Link analysis builds on two intuitions
1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A
2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice
Review and Conclusions 67
PageRank as Citation analysis
atilde Citation frequency can be used to measure the impact of an article
acirc Simplest measure Each article gets one vote ndash not very accurate
atilde On the web citation frequency = inlink count
acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam
atilde Better measure weighted citation frequency or citation rank
acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated
Review and Conclusions 68
PageRank as Random walk
atilde Imagine a web surfer doing a random walk on the web
acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page
equiprobably
atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there
atilde This long-term visit rate is the pagersquos PageRank
atilde PageRank = long-term visit rate = steady state robability
Review and Conclusions 69
Teleporting ndash to get us out of dead ends
atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)
atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)
atilde With remaining probability (90) go out on a random hyperlink
acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225
atilde 10 is a parameter the teleportation rate
atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate
Review and Conclusions 70
Example Graph
Each inbound link is a positive vote
httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71
Example Graph Weighted
Pages with higher PageRanks are lighter
Review and Conclusions 72
Gaming PageRank
atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs
atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies
atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission
73
atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks
The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index
acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip
74
Current Status
atilde There is a continuous battle between
acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read
atilde All metrics get gamed
Review and Conclusions 75
Digital object identifier
atilde DOI a string used to uniquely identify an electronic document or object
acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)
atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation
atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations
acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u
Review and Conclusions 76
Conclusions
Review and Conclusions 77
Revolutions in Language Technology
atilde Speech (Language itself)
atilde Writing (invented 3-5 times)
atilde Printing (made writing common)
atilde Digital Text (made writing transferable)
atilde Hyperlinking (taking writing beyond language)
Review and Conclusions 78
The Internet
atilde The internet is useful as a tool
acirc Passively for information accessacirc Actively for collaboration
atilde The internet is interesting in and of itself
acirc As a source of data about existing languageacirc As a source of innovation in language
Review and Conclusions 79
Recommended Texts
atilde Wikipedia
atilde Crystal D (2001) Language and the Internet Cambridge University Press
atilde Sproat R (2010) Language Technology and Society Oxford University Press
atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition
atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press
atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press
Review and Conclusions 80
Complementary Courses
atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics
atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics
Review and Conclusions 81
The Deep Web (the Invisible Web)
Dynamic content dynamic pages which are returned in response to a submitted queryor accessed only through a form
Unlinked content pages which are not linked to by other pages (but clicking linksthem)
Private Web sites that require registration and login (Edventure)
Contextual Web pages with content varying for different access contexts (egranges of client IP addresses or previous navigation sequence)
Limited access content sites that limit access to their pages in a technical way (egusing the Robots Exclusion Standard)
Scripted content pages that are only accessible through links produced by JavaScriptas well as content dynamically downloaded from Web servers via Flash or Ajaxsolutions
Review and Conclusions 39
Non-HTMLtext content textual content encoded in multimedia (image or video)files or specific file formats not handled by search engines
These pages all include data that search engines cannot find
Review and Conclusions 40
Visual Markup vs Logical Markup
atilde Visual Markup (Presentational)
acirc What you see is what you get (WYSIWYG)acirc Equivalent of printersrsquo markupacirc Shows what things look like
atilde Logical Markup (Structural)
acirc Show the structure and meaningacirc Can be mapped to visual markupacirc Less flexible than visual markupacirc More adaptable (and reusable)
Review and Conclusions 41
The Web as Corpus
Review and Conclusions 42
Two Approaches to using the Web as a Corpus
atilde Direct Query Search Engine as Query tool and WWW as corpus(Objection Results are not reliable)
acirc Population and exact hit counts are unknown rarr no statistics possibleacirc Indexing does not allow to draw conclusions on the dataotimes Google is missing functionalities that linguists lexicographers would like to have
atilde Web Sample Use search engine to download data from the net and build a corpusfrom it
acirc known size and exact hit counts rarr statistics possibleacirc people can draw conclusions over the included text typesacirc (limited) control over the contentotimes sparser data
Review and Conclusions 43
Direct Query
atilde Accessible through search engines (Google API Yahoo API Scripts)
atilde Document counts are shown to correlate directly with ldquorealrdquo frequencies (Keller2003) so search engines can help - but
acirc lots of repetitions of the same text (not representative)acirc very limited query precision (no upperlower case no punctuation)acirc only estimated counts often hard to reproduce exactlyacirc different queries give wildly different numbers
Review and Conclusions 44
Web Count Units
atilde ghit or ldquogoogle hitrdquo is the most common unit used to count web snippets (in theearly 2000s)
acirc it is document frequency not term frequency
atilde whit or ldquoweb hitrdquo is the more general term
atilde Normally you compare two phenomena to get a unitless ratio(eg different from vs different than)251000000 ghits vs 71500000 or 351 (accessed 2012-04-04)
atilde GPB for ldquoGhits per billion documentsrdquo is good if want a more stable number (sug-gested by Mark Liberman)but Google no longer releases the index size
atilde So always say when you counted and try to use ratios
httpitrecisupennedu myllanguagelogarchives000953html 45
Web Sample
atilde Extracting and filtering web documents to create linguistically annotated corpora(Kilgarriff 2006)
acirc gather documents for different topics (balance)acirc exclude documents which cannot be preprocessed with available tools (here tag-
gers and lemmatizers)acirc exclude documents which seem irrelevant for a corpus (too short or too long word
lists)acirc do this for several languages and make the corpora available
46
Building Internet Corpora Outline
1 Select Seed Words (500)
2 Combine to form multiple queries (6000)
3 Query a search engine and retrieve the URLs (50000)
4 Download the files from the URLS (100000000 words)
5 Postprocess the data (encoding cleanup tagging and parsing)
Sharoff S (2006) Creating general-purpose corpora using automated search enginequeries In M Baroni S Bernardini (eds) WaCky Working papers on the Web asCorpus Bologna 2006
httpcorpusleedsacukinternethtml 47
Internet Corpora Summary
atilde The web can be used as a corpus
acirc Direct accesslowast Fast and convenientlowast Huge amounts of dataotimes unreliable counts
acirc Web samplelowast Control over the samplelowast Some setup costs (semi-automated)otimes Less data
atilde Richer data than a compiled corpus
otimes Less balanced less markup
48
Text and Meta-textatilde Explicit Meta-data
acirc Keywords and Categoriesacirc Rankingsacirc Structural Markup
atilde Implicit Meta-data
acirc Links and Citationsacirc Tagsacirc Tablesacirc File Namesacirc Translations
and Zombies 49
Explicit Metadata
atilde You can get information from metadata within documents
acirc When they are accurate they are very goodacirc They are often deceitful
atilde HTML PDF Word hellipMetadata
atilde Keywords and Tags
atilde Rankings
atilde Links and Citations
atilde Structural Markup
50
Implicit Metadata
atilde You can get clues from metadata within documents
acirc as they are non-intended they tend to be noisyacirc but they are rarely deceitful
atilde HTML tags as constituent boundaries
atilde Tables as Semantic Relations
atilde File Names (content type and language)
atilde Translations mdash Bracketed Glosses Cross-lingual Disambiguation
atilde Query Data
atilde Wikipedia Redirections and Cross-wiki Links
Review and Conclusions 51
Language Identification andNormalization
Review and Conclusions 52
Language Identification
atilde Need to identify the encoding and language of a page in order to process it (meta-data may be unreliable)
acirc Linguistically-grounded methodslowast Diacriticslowast Character n-gramslowast Stop words
acirc Similarity-based categorisation and classificationlowast Character n-gram rankings
acirc Machine Learning based methodsacirc Context (under jp or ko)
atilde Hard to do for short test snippets similar languages mixed text
Review and Conclusions 53
Normalization
atilde Extracting text from various documents
atilde Segmenting continuous text
atilde Number Normalization $700K $700000 07 million dollars hellip
atilde Date Normalization 2000AD 1421AH Heisei 12 hellip
atilde Stripping stop words (the a of hellip)
atilde Lemmatization produces rarr produce
atilde Stemming producer rarr produc produces rarr produc
atilde Decompounding zonnecel rarr zon cel
Review and Conclusions 54
The Semantic Web
Review and Conclusions 55
Goals of the Semantic Web
atilde Web of data
acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions
atilde Increase the utility of information by connecting it to definitions and context
atilde More efficient information access and analysis
EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt
Review and Conclusions 56
Semantic Web Architecture
Review and Conclusions 57
XML eXtensible Markup Language
atilde XML is a set of rules for encoding documents electronically
acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation
atilde Can be validated for syntactic well-formedness
Review and Conclusions 58
Semantic Web Architecture (details)
atilde Identify things with Uniform Resource Identifiers
acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond
atilde Identify relations with Resource Description Framework
acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything
atilde You can build relations in ontologies (OWL)
acirc Then reason over them search them hellip
Review and Conclusions 59
Criticism of the Semantic Web
Doctorowrsquos seven insurmountable obstacles to reliable metadata are
1 People lie
2 People are lazy
3 People are stupid
4 Mission Impossible know thyself
5 Schemas arenrsquot neutral
6 Metrics influence results
7 Therersquos more than one way to describe something
httpwwwwellcom~doctorowmetacraphtm 60
Semantic Web and NLP
atilde The Semantic Web is about structuring data
atilde Text Mining is about unstructured data
atilde There is much more unstructured than structured data
acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP
61
Citation Reputation andPageRank
Review and Conclusions 62
Citation Networks
atilde How can we tell what is a good scientific paper
acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked
acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite
Review and Conclusions 63
Reputation and Citation Analysis
atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group
atilde Some scores are
acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)
atilde Problems
acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones
atilde Weight Citations by Quality of the paper
Review and Conclusions 64
Gaming Citations
atilde LeastMinimum Publishable Unit
acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information
atilde Self citation in-group citation
atilde Write only proceedings (some journals are not often read)
atilde Submitting only to High Impact factor journals
You improve what gets measurednot necessarily what you want to improve
Review and Conclusions 65
Characteristics of the Web Bow Tie
SCC Strongly Connected Core mdash can travel from any page to any page
Review and Conclusions 66
Anchor Text
atilde Recall how hyperlinks are written
lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt
For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt
atilde Link analysis builds on two intuitions
1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A
2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice
Review and Conclusions 67
PageRank as Citation analysis
atilde Citation frequency can be used to measure the impact of an article
acirc Simplest measure Each article gets one vote ndash not very accurate
atilde On the web citation frequency = inlink count
acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam
atilde Better measure weighted citation frequency or citation rank
acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated
Review and Conclusions 68
PageRank as Random walk
atilde Imagine a web surfer doing a random walk on the web
acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page
equiprobably
atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there
atilde This long-term visit rate is the pagersquos PageRank
atilde PageRank = long-term visit rate = steady state robability
Review and Conclusions 69
Teleporting ndash to get us out of dead ends
atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)
atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)
atilde With remaining probability (90) go out on a random hyperlink
acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225
atilde 10 is a parameter the teleportation rate
atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate
Review and Conclusions 70
Example Graph
Each inbound link is a positive vote
httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71
Example Graph Weighted
Pages with higher PageRanks are lighter
Review and Conclusions 72
Gaming PageRank
atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs
atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies
atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission
73
atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks
The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index
acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip
74
Current Status
atilde There is a continuous battle between
acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read
atilde All metrics get gamed
Review and Conclusions 75
Digital object identifier
atilde DOI a string used to uniquely identify an electronic document or object
acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)
atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation
atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations
acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u
Review and Conclusions 76
Conclusions
Review and Conclusions 77
Revolutions in Language Technology
atilde Speech (Language itself)
atilde Writing (invented 3-5 times)
atilde Printing (made writing common)
atilde Digital Text (made writing transferable)
atilde Hyperlinking (taking writing beyond language)
Review and Conclusions 78
The Internet
atilde The internet is useful as a tool
acirc Passively for information accessacirc Actively for collaboration
atilde The internet is interesting in and of itself
acirc As a source of data about existing languageacirc As a source of innovation in language
Review and Conclusions 79
Recommended Texts
atilde Wikipedia
atilde Crystal D (2001) Language and the Internet Cambridge University Press
atilde Sproat R (2010) Language Technology and Society Oxford University Press
atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition
atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press
atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press
Review and Conclusions 80
Complementary Courses
atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics
atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics
Review and Conclusions 81
Non-HTMLtext content textual content encoded in multimedia (image or video)files or specific file formats not handled by search engines
These pages all include data that search engines cannot find
Review and Conclusions 40
Visual Markup vs Logical Markup
atilde Visual Markup (Presentational)
acirc What you see is what you get (WYSIWYG)acirc Equivalent of printersrsquo markupacirc Shows what things look like
atilde Logical Markup (Structural)
acirc Show the structure and meaningacirc Can be mapped to visual markupacirc Less flexible than visual markupacirc More adaptable (and reusable)
Review and Conclusions 41
The Web as Corpus
Review and Conclusions 42
Two Approaches to using the Web as a Corpus
atilde Direct Query Search Engine as Query tool and WWW as corpus(Objection Results are not reliable)
acirc Population and exact hit counts are unknown rarr no statistics possibleacirc Indexing does not allow to draw conclusions on the dataotimes Google is missing functionalities that linguists lexicographers would like to have
atilde Web Sample Use search engine to download data from the net and build a corpusfrom it
acirc known size and exact hit counts rarr statistics possibleacirc people can draw conclusions over the included text typesacirc (limited) control over the contentotimes sparser data
Review and Conclusions 43
Direct Query
atilde Accessible through search engines (Google API Yahoo API Scripts)
atilde Document counts are shown to correlate directly with ldquorealrdquo frequencies (Keller2003) so search engines can help - but
acirc lots of repetitions of the same text (not representative)acirc very limited query precision (no upperlower case no punctuation)acirc only estimated counts often hard to reproduce exactlyacirc different queries give wildly different numbers
Review and Conclusions 44
Web Count Units
atilde ghit or ldquogoogle hitrdquo is the most common unit used to count web snippets (in theearly 2000s)
acirc it is document frequency not term frequency
atilde whit or ldquoweb hitrdquo is the more general term
atilde Normally you compare two phenomena to get a unitless ratio(eg different from vs different than)251000000 ghits vs 71500000 or 351 (accessed 2012-04-04)
atilde GPB for ldquoGhits per billion documentsrdquo is good if want a more stable number (sug-gested by Mark Liberman)but Google no longer releases the index size
atilde So always say when you counted and try to use ratios
httpitrecisupennedu myllanguagelogarchives000953html 45
Web Sample
atilde Extracting and filtering web documents to create linguistically annotated corpora(Kilgarriff 2006)
acirc gather documents for different topics (balance)acirc exclude documents which cannot be preprocessed with available tools (here tag-
gers and lemmatizers)acirc exclude documents which seem irrelevant for a corpus (too short or too long word
lists)acirc do this for several languages and make the corpora available
46
Building Internet Corpora Outline
1 Select Seed Words (500)
2 Combine to form multiple queries (6000)
3 Query a search engine and retrieve the URLs (50000)
4 Download the files from the URLS (100000000 words)
5 Postprocess the data (encoding cleanup tagging and parsing)
Sharoff S (2006) Creating general-purpose corpora using automated search enginequeries In M Baroni S Bernardini (eds) WaCky Working papers on the Web asCorpus Bologna 2006
httpcorpusleedsacukinternethtml 47
Internet Corpora Summary
atilde The web can be used as a corpus
acirc Direct accesslowast Fast and convenientlowast Huge amounts of dataotimes unreliable counts
acirc Web samplelowast Control over the samplelowast Some setup costs (semi-automated)otimes Less data
atilde Richer data than a compiled corpus
otimes Less balanced less markup
48
Text and Meta-textatilde Explicit Meta-data
acirc Keywords and Categoriesacirc Rankingsacirc Structural Markup
atilde Implicit Meta-data
acirc Links and Citationsacirc Tagsacirc Tablesacirc File Namesacirc Translations
and Zombies 49
Explicit Metadata
atilde You can get information from metadata within documents
acirc When they are accurate they are very goodacirc They are often deceitful
atilde HTML PDF Word hellipMetadata
atilde Keywords and Tags
atilde Rankings
atilde Links and Citations
atilde Structural Markup
50
Implicit Metadata
atilde You can get clues from metadata within documents
acirc as they are non-intended they tend to be noisyacirc but they are rarely deceitful
atilde HTML tags as constituent boundaries
atilde Tables as Semantic Relations
atilde File Names (content type and language)
atilde Translations mdash Bracketed Glosses Cross-lingual Disambiguation
atilde Query Data
atilde Wikipedia Redirections and Cross-wiki Links
Review and Conclusions 51
Language Identification andNormalization
Review and Conclusions 52
Language Identification
atilde Need to identify the encoding and language of a page in order to process it (meta-data may be unreliable)
acirc Linguistically-grounded methodslowast Diacriticslowast Character n-gramslowast Stop words
acirc Similarity-based categorisation and classificationlowast Character n-gram rankings
acirc Machine Learning based methodsacirc Context (under jp or ko)
atilde Hard to do for short test snippets similar languages mixed text
Review and Conclusions 53
Normalization
atilde Extracting text from various documents
atilde Segmenting continuous text
atilde Number Normalization $700K $700000 07 million dollars hellip
atilde Date Normalization 2000AD 1421AH Heisei 12 hellip
atilde Stripping stop words (the a of hellip)
atilde Lemmatization produces rarr produce
atilde Stemming producer rarr produc produces rarr produc
atilde Decompounding zonnecel rarr zon cel
Review and Conclusions 54
The Semantic Web
Review and Conclusions 55
Goals of the Semantic Web
atilde Web of data
acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions
atilde Increase the utility of information by connecting it to definitions and context
atilde More efficient information access and analysis
EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt
Review and Conclusions 56
Semantic Web Architecture
Review and Conclusions 57
XML eXtensible Markup Language
atilde XML is a set of rules for encoding documents electronically
acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation
atilde Can be validated for syntactic well-formedness
Review and Conclusions 58
Semantic Web Architecture (details)
atilde Identify things with Uniform Resource Identifiers
acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond
atilde Identify relations with Resource Description Framework
acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything
atilde You can build relations in ontologies (OWL)
acirc Then reason over them search them hellip
Review and Conclusions 59
Criticism of the Semantic Web
Doctorowrsquos seven insurmountable obstacles to reliable metadata are
1 People lie
2 People are lazy
3 People are stupid
4 Mission Impossible know thyself
5 Schemas arenrsquot neutral
6 Metrics influence results
7 Therersquos more than one way to describe something
httpwwwwellcom~doctorowmetacraphtm 60
Semantic Web and NLP
atilde The Semantic Web is about structuring data
atilde Text Mining is about unstructured data
atilde There is much more unstructured than structured data
acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP
61
Citation Reputation andPageRank
Review and Conclusions 62
Citation Networks
atilde How can we tell what is a good scientific paper
acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked
acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite
Review and Conclusions 63
Reputation and Citation Analysis
atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group
atilde Some scores are
acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)
atilde Problems
acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones
atilde Weight Citations by Quality of the paper
Review and Conclusions 64
Gaming Citations
atilde LeastMinimum Publishable Unit
acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information
atilde Self citation in-group citation
atilde Write only proceedings (some journals are not often read)
atilde Submitting only to High Impact factor journals
You improve what gets measurednot necessarily what you want to improve
Review and Conclusions 65
Characteristics of the Web Bow Tie
SCC Strongly Connected Core mdash can travel from any page to any page
Review and Conclusions 66
Anchor Text
atilde Recall how hyperlinks are written
lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt
For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt
atilde Link analysis builds on two intuitions
1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A
2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice
Review and Conclusions 67
PageRank as Citation analysis
atilde Citation frequency can be used to measure the impact of an article
acirc Simplest measure Each article gets one vote ndash not very accurate
atilde On the web citation frequency = inlink count
acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam
atilde Better measure weighted citation frequency or citation rank
acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated
Review and Conclusions 68
PageRank as Random walk
atilde Imagine a web surfer doing a random walk on the web
acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page
equiprobably
atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there
atilde This long-term visit rate is the pagersquos PageRank
atilde PageRank = long-term visit rate = steady state robability
Review and Conclusions 69
Teleporting ndash to get us out of dead ends
atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)
atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)
atilde With remaining probability (90) go out on a random hyperlink
acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225
atilde 10 is a parameter the teleportation rate
atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate
Review and Conclusions 70
Example Graph
Each inbound link is a positive vote
httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71
Example Graph Weighted
Pages with higher PageRanks are lighter
Review and Conclusions 72
Gaming PageRank
atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs
atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies
atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission
73
atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks
The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index
acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip
74
Current Status
atilde There is a continuous battle between
acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read
atilde All metrics get gamed
Review and Conclusions 75
Digital object identifier
atilde DOI a string used to uniquely identify an electronic document or object
acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)
atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation
atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations
acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u
Review and Conclusions 76
Conclusions
Review and Conclusions 77
Revolutions in Language Technology
atilde Speech (Language itself)
atilde Writing (invented 3-5 times)
atilde Printing (made writing common)
atilde Digital Text (made writing transferable)
atilde Hyperlinking (taking writing beyond language)
Review and Conclusions 78
The Internet
atilde The internet is useful as a tool
acirc Passively for information accessacirc Actively for collaboration
atilde The internet is interesting in and of itself
acirc As a source of data about existing languageacirc As a source of innovation in language
Review and Conclusions 79
Recommended Texts
atilde Wikipedia
atilde Crystal D (2001) Language and the Internet Cambridge University Press
atilde Sproat R (2010) Language Technology and Society Oxford University Press
atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition
atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press
atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press
Review and Conclusions 80
Complementary Courses
atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics
atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics
Review and Conclusions 81
Visual Markup vs Logical Markup
atilde Visual Markup (Presentational)
acirc What you see is what you get (WYSIWYG)acirc Equivalent of printersrsquo markupacirc Shows what things look like
atilde Logical Markup (Structural)
acirc Show the structure and meaningacirc Can be mapped to visual markupacirc Less flexible than visual markupacirc More adaptable (and reusable)
Review and Conclusions 41
The Web as Corpus
Review and Conclusions 42
Two Approaches to using the Web as a Corpus
atilde Direct Query Search Engine as Query tool and WWW as corpus(Objection Results are not reliable)
acirc Population and exact hit counts are unknown rarr no statistics possibleacirc Indexing does not allow to draw conclusions on the dataotimes Google is missing functionalities that linguists lexicographers would like to have
atilde Web Sample Use search engine to download data from the net and build a corpusfrom it
acirc known size and exact hit counts rarr statistics possibleacirc people can draw conclusions over the included text typesacirc (limited) control over the contentotimes sparser data
Review and Conclusions 43
Direct Query
atilde Accessible through search engines (Google API Yahoo API Scripts)
atilde Document counts are shown to correlate directly with ldquorealrdquo frequencies (Keller2003) so search engines can help - but
acirc lots of repetitions of the same text (not representative)acirc very limited query precision (no upperlower case no punctuation)acirc only estimated counts often hard to reproduce exactlyacirc different queries give wildly different numbers
Review and Conclusions 44
Web Count Units
atilde ghit or ldquogoogle hitrdquo is the most common unit used to count web snippets (in theearly 2000s)
acirc it is document frequency not term frequency
atilde whit or ldquoweb hitrdquo is the more general term
atilde Normally you compare two phenomena to get a unitless ratio(eg different from vs different than)251000000 ghits vs 71500000 or 351 (accessed 2012-04-04)
atilde GPB for ldquoGhits per billion documentsrdquo is good if want a more stable number (sug-gested by Mark Liberman)but Google no longer releases the index size
atilde So always say when you counted and try to use ratios
httpitrecisupennedu myllanguagelogarchives000953html 45
Web Sample
atilde Extracting and filtering web documents to create linguistically annotated corpora(Kilgarriff 2006)
acirc gather documents for different topics (balance)acirc exclude documents which cannot be preprocessed with available tools (here tag-
gers and lemmatizers)acirc exclude documents which seem irrelevant for a corpus (too short or too long word
lists)acirc do this for several languages and make the corpora available
46
Building Internet Corpora Outline
1 Select Seed Words (500)
2 Combine to form multiple queries (6000)
3 Query a search engine and retrieve the URLs (50000)
4 Download the files from the URLS (100000000 words)
5 Postprocess the data (encoding cleanup tagging and parsing)
Sharoff S (2006) Creating general-purpose corpora using automated search enginequeries In M Baroni S Bernardini (eds) WaCky Working papers on the Web asCorpus Bologna 2006
httpcorpusleedsacukinternethtml 47
Internet Corpora Summary
atilde The web can be used as a corpus
acirc Direct accesslowast Fast and convenientlowast Huge amounts of dataotimes unreliable counts
acirc Web samplelowast Control over the samplelowast Some setup costs (semi-automated)otimes Less data
atilde Richer data than a compiled corpus
otimes Less balanced less markup
48
Text and Meta-textatilde Explicit Meta-data
acirc Keywords and Categoriesacirc Rankingsacirc Structural Markup
atilde Implicit Meta-data
acirc Links and Citationsacirc Tagsacirc Tablesacirc File Namesacirc Translations
and Zombies 49
Explicit Metadata
atilde You can get information from metadata within documents
acirc When they are accurate they are very goodacirc They are often deceitful
atilde HTML PDF Word hellipMetadata
atilde Keywords and Tags
atilde Rankings
atilde Links and Citations
atilde Structural Markup
50
Implicit Metadata
atilde You can get clues from metadata within documents
acirc as they are non-intended they tend to be noisyacirc but they are rarely deceitful
atilde HTML tags as constituent boundaries
atilde Tables as Semantic Relations
atilde File Names (content type and language)
atilde Translations mdash Bracketed Glosses Cross-lingual Disambiguation
atilde Query Data
atilde Wikipedia Redirections and Cross-wiki Links
Review and Conclusions 51
Language Identification andNormalization
Review and Conclusions 52
Language Identification
atilde Need to identify the encoding and language of a page in order to process it (meta-data may be unreliable)
acirc Linguistically-grounded methodslowast Diacriticslowast Character n-gramslowast Stop words
acirc Similarity-based categorisation and classificationlowast Character n-gram rankings
acirc Machine Learning based methodsacirc Context (under jp or ko)
atilde Hard to do for short test snippets similar languages mixed text
Review and Conclusions 53
Normalization
atilde Extracting text from various documents
atilde Segmenting continuous text
atilde Number Normalization $700K $700000 07 million dollars hellip
atilde Date Normalization 2000AD 1421AH Heisei 12 hellip
atilde Stripping stop words (the a of hellip)
atilde Lemmatization produces rarr produce
atilde Stemming producer rarr produc produces rarr produc
atilde Decompounding zonnecel rarr zon cel
Review and Conclusions 54
The Semantic Web
Review and Conclusions 55
Goals of the Semantic Web
atilde Web of data
acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions
atilde Increase the utility of information by connecting it to definitions and context
atilde More efficient information access and analysis
EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt
Review and Conclusions 56
Semantic Web Architecture
Review and Conclusions 57
XML eXtensible Markup Language
atilde XML is a set of rules for encoding documents electronically
acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation
atilde Can be validated for syntactic well-formedness
Review and Conclusions 58
Semantic Web Architecture (details)
atilde Identify things with Uniform Resource Identifiers
acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond
atilde Identify relations with Resource Description Framework
acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything
atilde You can build relations in ontologies (OWL)
acirc Then reason over them search them hellip
Review and Conclusions 59
Criticism of the Semantic Web
Doctorowrsquos seven insurmountable obstacles to reliable metadata are
1 People lie
2 People are lazy
3 People are stupid
4 Mission Impossible know thyself
5 Schemas arenrsquot neutral
6 Metrics influence results
7 Therersquos more than one way to describe something
httpwwwwellcom~doctorowmetacraphtm 60
Semantic Web and NLP
atilde The Semantic Web is about structuring data
atilde Text Mining is about unstructured data
atilde There is much more unstructured than structured data
acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP
61
Citation Reputation andPageRank
Review and Conclusions 62
Citation Networks
atilde How can we tell what is a good scientific paper
acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked
acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite
Review and Conclusions 63
Reputation and Citation Analysis
atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group
atilde Some scores are
acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)
atilde Problems
acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones
atilde Weight Citations by Quality of the paper
Review and Conclusions 64
Gaming Citations
atilde LeastMinimum Publishable Unit
acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information
atilde Self citation in-group citation
atilde Write only proceedings (some journals are not often read)
atilde Submitting only to High Impact factor journals
You improve what gets measurednot necessarily what you want to improve
Review and Conclusions 65
Characteristics of the Web Bow Tie
SCC Strongly Connected Core mdash can travel from any page to any page
Review and Conclusions 66
Anchor Text
atilde Recall how hyperlinks are written
lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt
For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt
atilde Link analysis builds on two intuitions
1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A
2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice
Review and Conclusions 67
PageRank as Citation analysis
atilde Citation frequency can be used to measure the impact of an article
acirc Simplest measure Each article gets one vote ndash not very accurate
atilde On the web citation frequency = inlink count
acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam
atilde Better measure weighted citation frequency or citation rank
acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated
Review and Conclusions 68
PageRank as Random walk
atilde Imagine a web surfer doing a random walk on the web
acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page
equiprobably
atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there
atilde This long-term visit rate is the pagersquos PageRank
atilde PageRank = long-term visit rate = steady state robability
Review and Conclusions 69
Teleporting ndash to get us out of dead ends
atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)
atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)
atilde With remaining probability (90) go out on a random hyperlink
acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225
atilde 10 is a parameter the teleportation rate
atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate
Review and Conclusions 70
Example Graph
Each inbound link is a positive vote
httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71
Example Graph Weighted
Pages with higher PageRanks are lighter
Review and Conclusions 72
Gaming PageRank
atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs
atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies
atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission
73
atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks
The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index
acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip
74
Current Status
atilde There is a continuous battle between
acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read
atilde All metrics get gamed
Review and Conclusions 75
Digital object identifier
atilde DOI a string used to uniquely identify an electronic document or object
acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)
atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation
atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations
acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u
Review and Conclusions 76
Conclusions
Review and Conclusions 77
Revolutions in Language Technology
atilde Speech (Language itself)
atilde Writing (invented 3-5 times)
atilde Printing (made writing common)
atilde Digital Text (made writing transferable)
atilde Hyperlinking (taking writing beyond language)
Review and Conclusions 78
The Internet
atilde The internet is useful as a tool
acirc Passively for information accessacirc Actively for collaboration
atilde The internet is interesting in and of itself
acirc As a source of data about existing languageacirc As a source of innovation in language
Review and Conclusions 79
Recommended Texts
atilde Wikipedia
atilde Crystal D (2001) Language and the Internet Cambridge University Press
atilde Sproat R (2010) Language Technology and Society Oxford University Press
atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition
atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press
atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press
Review and Conclusions 80
Complementary Courses
atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics
atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics
Review and Conclusions 81
The Web as Corpus
Review and Conclusions 42
Two Approaches to using the Web as a Corpus
atilde Direct Query Search Engine as Query tool and WWW as corpus(Objection Results are not reliable)
acirc Population and exact hit counts are unknown rarr no statistics possibleacirc Indexing does not allow to draw conclusions on the dataotimes Google is missing functionalities that linguists lexicographers would like to have
atilde Web Sample Use search engine to download data from the net and build a corpusfrom it
acirc known size and exact hit counts rarr statistics possibleacirc people can draw conclusions over the included text typesacirc (limited) control over the contentotimes sparser data
Review and Conclusions 43
Direct Query
atilde Accessible through search engines (Google API Yahoo API Scripts)
atilde Document counts are shown to correlate directly with ldquorealrdquo frequencies (Keller2003) so search engines can help - but
acirc lots of repetitions of the same text (not representative)acirc very limited query precision (no upperlower case no punctuation)acirc only estimated counts often hard to reproduce exactlyacirc different queries give wildly different numbers
Review and Conclusions 44
Web Count Units
atilde ghit or ldquogoogle hitrdquo is the most common unit used to count web snippets (in theearly 2000s)
acirc it is document frequency not term frequency
atilde whit or ldquoweb hitrdquo is the more general term
atilde Normally you compare two phenomena to get a unitless ratio(eg different from vs different than)251000000 ghits vs 71500000 or 351 (accessed 2012-04-04)
atilde GPB for ldquoGhits per billion documentsrdquo is good if want a more stable number (sug-gested by Mark Liberman)but Google no longer releases the index size
atilde So always say when you counted and try to use ratios
httpitrecisupennedu myllanguagelogarchives000953html 45
Web Sample
atilde Extracting and filtering web documents to create linguistically annotated corpora(Kilgarriff 2006)
acirc gather documents for different topics (balance)acirc exclude documents which cannot be preprocessed with available tools (here tag-
gers and lemmatizers)acirc exclude documents which seem irrelevant for a corpus (too short or too long word
lists)acirc do this for several languages and make the corpora available
46
Building Internet Corpora Outline
1 Select Seed Words (500)
2 Combine to form multiple queries (6000)
3 Query a search engine and retrieve the URLs (50000)
4 Download the files from the URLS (100000000 words)
5 Postprocess the data (encoding cleanup tagging and parsing)
Sharoff S (2006) Creating general-purpose corpora using automated search enginequeries In M Baroni S Bernardini (eds) WaCky Working papers on the Web asCorpus Bologna 2006
httpcorpusleedsacukinternethtml 47
Internet Corpora Summary
atilde The web can be used as a corpus
acirc Direct accesslowast Fast and convenientlowast Huge amounts of dataotimes unreliable counts
acirc Web samplelowast Control over the samplelowast Some setup costs (semi-automated)otimes Less data
atilde Richer data than a compiled corpus
otimes Less balanced less markup
48
Text and Meta-textatilde Explicit Meta-data
acirc Keywords and Categoriesacirc Rankingsacirc Structural Markup
atilde Implicit Meta-data
acirc Links and Citationsacirc Tagsacirc Tablesacirc File Namesacirc Translations
and Zombies 49
Explicit Metadata
atilde You can get information from metadata within documents
acirc When they are accurate they are very goodacirc They are often deceitful
atilde HTML PDF Word hellipMetadata
atilde Keywords and Tags
atilde Rankings
atilde Links and Citations
atilde Structural Markup
50
Implicit Metadata
atilde You can get clues from metadata within documents
acirc as they are non-intended they tend to be noisyacirc but they are rarely deceitful
atilde HTML tags as constituent boundaries
atilde Tables as Semantic Relations
atilde File Names (content type and language)
atilde Translations mdash Bracketed Glosses Cross-lingual Disambiguation
atilde Query Data
atilde Wikipedia Redirections and Cross-wiki Links
Review and Conclusions 51
Language Identification andNormalization
Review and Conclusions 52
Language Identification
atilde Need to identify the encoding and language of a page in order to process it (meta-data may be unreliable)
acirc Linguistically-grounded methodslowast Diacriticslowast Character n-gramslowast Stop words
acirc Similarity-based categorisation and classificationlowast Character n-gram rankings
acirc Machine Learning based methodsacirc Context (under jp or ko)
atilde Hard to do for short test snippets similar languages mixed text
Review and Conclusions 53
Normalization
atilde Extracting text from various documents
atilde Segmenting continuous text
atilde Number Normalization $700K $700000 07 million dollars hellip
atilde Date Normalization 2000AD 1421AH Heisei 12 hellip
atilde Stripping stop words (the a of hellip)
atilde Lemmatization produces rarr produce
atilde Stemming producer rarr produc produces rarr produc
atilde Decompounding zonnecel rarr zon cel
Review and Conclusions 54
The Semantic Web
Review and Conclusions 55
Goals of the Semantic Web
atilde Web of data
acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions
atilde Increase the utility of information by connecting it to definitions and context
atilde More efficient information access and analysis
EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt
Review and Conclusions 56
Semantic Web Architecture
Review and Conclusions 57
XML eXtensible Markup Language
atilde XML is a set of rules for encoding documents electronically
acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation
atilde Can be validated for syntactic well-formedness
Review and Conclusions 58
Semantic Web Architecture (details)
atilde Identify things with Uniform Resource Identifiers
acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond
atilde Identify relations with Resource Description Framework
acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything
atilde You can build relations in ontologies (OWL)
acirc Then reason over them search them hellip
Review and Conclusions 59
Criticism of the Semantic Web
Doctorowrsquos seven insurmountable obstacles to reliable metadata are
1 People lie
2 People are lazy
3 People are stupid
4 Mission Impossible know thyself
5 Schemas arenrsquot neutral
6 Metrics influence results
7 Therersquos more than one way to describe something
httpwwwwellcom~doctorowmetacraphtm 60
Semantic Web and NLP
atilde The Semantic Web is about structuring data
atilde Text Mining is about unstructured data
atilde There is much more unstructured than structured data
acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP
61
Citation Reputation andPageRank
Review and Conclusions 62
Citation Networks
atilde How can we tell what is a good scientific paper
acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked
acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite
Review and Conclusions 63
Reputation and Citation Analysis
atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group
atilde Some scores are
acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)
atilde Problems
acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones
atilde Weight Citations by Quality of the paper
Review and Conclusions 64
Gaming Citations
atilde LeastMinimum Publishable Unit
acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information
atilde Self citation in-group citation
atilde Write only proceedings (some journals are not often read)
atilde Submitting only to High Impact factor journals
You improve what gets measurednot necessarily what you want to improve
Review and Conclusions 65
Characteristics of the Web Bow Tie
SCC Strongly Connected Core mdash can travel from any page to any page
Review and Conclusions 66
Anchor Text
atilde Recall how hyperlinks are written
lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt
For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt
atilde Link analysis builds on two intuitions
1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A
2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice
Review and Conclusions 67
PageRank as Citation analysis
atilde Citation frequency can be used to measure the impact of an article
acirc Simplest measure Each article gets one vote ndash not very accurate
atilde On the web citation frequency = inlink count
acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam
atilde Better measure weighted citation frequency or citation rank
acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated
Review and Conclusions 68
PageRank as Random walk
atilde Imagine a web surfer doing a random walk on the web
acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page
equiprobably
atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there
atilde This long-term visit rate is the pagersquos PageRank
atilde PageRank = long-term visit rate = steady state robability
Review and Conclusions 69
Teleporting ndash to get us out of dead ends
atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)
atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)
atilde With remaining probability (90) go out on a random hyperlink
acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225
atilde 10 is a parameter the teleportation rate
atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate
Review and Conclusions 70
Example Graph
Each inbound link is a positive vote
httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71
Example Graph Weighted
Pages with higher PageRanks are lighter
Review and Conclusions 72
Gaming PageRank
atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs
atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies
atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission
73
atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks
The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index
acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip
74
Current Status
atilde There is a continuous battle between
acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read
atilde All metrics get gamed
Review and Conclusions 75
Digital object identifier
atilde DOI a string used to uniquely identify an electronic document or object
acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)
atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation
atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations
acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u
Review and Conclusions 76
Conclusions
Review and Conclusions 77
Revolutions in Language Technology
atilde Speech (Language itself)
atilde Writing (invented 3-5 times)
atilde Printing (made writing common)
atilde Digital Text (made writing transferable)
atilde Hyperlinking (taking writing beyond language)
Review and Conclusions 78
The Internet
atilde The internet is useful as a tool
acirc Passively for information accessacirc Actively for collaboration
atilde The internet is interesting in and of itself
acirc As a source of data about existing languageacirc As a source of innovation in language
Review and Conclusions 79
Recommended Texts
atilde Wikipedia
atilde Crystal D (2001) Language and the Internet Cambridge University Press
atilde Sproat R (2010) Language Technology and Society Oxford University Press
atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition
atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press
atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press
Review and Conclusions 80
Complementary Courses
atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics
atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics
Review and Conclusions 81
Two Approaches to using the Web as a Corpus
atilde Direct Query Search Engine as Query tool and WWW as corpus(Objection Results are not reliable)
acirc Population and exact hit counts are unknown rarr no statistics possibleacirc Indexing does not allow to draw conclusions on the dataotimes Google is missing functionalities that linguists lexicographers would like to have
atilde Web Sample Use search engine to download data from the net and build a corpusfrom it
acirc known size and exact hit counts rarr statistics possibleacirc people can draw conclusions over the included text typesacirc (limited) control over the contentotimes sparser data
Review and Conclusions 43
Direct Query
atilde Accessible through search engines (Google API Yahoo API Scripts)
atilde Document counts are shown to correlate directly with ldquorealrdquo frequencies (Keller2003) so search engines can help - but
acirc lots of repetitions of the same text (not representative)acirc very limited query precision (no upperlower case no punctuation)acirc only estimated counts often hard to reproduce exactlyacirc different queries give wildly different numbers
Review and Conclusions 44
Web Count Units
atilde ghit or ldquogoogle hitrdquo is the most common unit used to count web snippets (in theearly 2000s)
acirc it is document frequency not term frequency
atilde whit or ldquoweb hitrdquo is the more general term
atilde Normally you compare two phenomena to get a unitless ratio(eg different from vs different than)251000000 ghits vs 71500000 or 351 (accessed 2012-04-04)
atilde GPB for ldquoGhits per billion documentsrdquo is good if want a more stable number (sug-gested by Mark Liberman)but Google no longer releases the index size
atilde So always say when you counted and try to use ratios
httpitrecisupennedu myllanguagelogarchives000953html 45
Web Sample
atilde Extracting and filtering web documents to create linguistically annotated corpora(Kilgarriff 2006)
acirc gather documents for different topics (balance)acirc exclude documents which cannot be preprocessed with available tools (here tag-
gers and lemmatizers)acirc exclude documents which seem irrelevant for a corpus (too short or too long word
lists)acirc do this for several languages and make the corpora available
46
Building Internet Corpora Outline
1 Select Seed Words (500)
2 Combine to form multiple queries (6000)
3 Query a search engine and retrieve the URLs (50000)
4 Download the files from the URLS (100000000 words)
5 Postprocess the data (encoding cleanup tagging and parsing)
Sharoff S (2006) Creating general-purpose corpora using automated search enginequeries In M Baroni S Bernardini (eds) WaCky Working papers on the Web asCorpus Bologna 2006
httpcorpusleedsacukinternethtml 47
Internet Corpora Summary
atilde The web can be used as a corpus
acirc Direct accesslowast Fast and convenientlowast Huge amounts of dataotimes unreliable counts
acirc Web samplelowast Control over the samplelowast Some setup costs (semi-automated)otimes Less data
atilde Richer data than a compiled corpus
otimes Less balanced less markup
48
Text and Meta-textatilde Explicit Meta-data
acirc Keywords and Categoriesacirc Rankingsacirc Structural Markup
atilde Implicit Meta-data
acirc Links and Citationsacirc Tagsacirc Tablesacirc File Namesacirc Translations
and Zombies 49
Explicit Metadata
atilde You can get information from metadata within documents
acirc When they are accurate they are very goodacirc They are often deceitful
atilde HTML PDF Word hellipMetadata
atilde Keywords and Tags
atilde Rankings
atilde Links and Citations
atilde Structural Markup
50
Implicit Metadata
atilde You can get clues from metadata within documents
acirc as they are non-intended they tend to be noisyacirc but they are rarely deceitful
atilde HTML tags as constituent boundaries
atilde Tables as Semantic Relations
atilde File Names (content type and language)
atilde Translations mdash Bracketed Glosses Cross-lingual Disambiguation
atilde Query Data
atilde Wikipedia Redirections and Cross-wiki Links
Review and Conclusions 51
Language Identification andNormalization
Review and Conclusions 52
Language Identification
atilde Need to identify the encoding and language of a page in order to process it (meta-data may be unreliable)
acirc Linguistically-grounded methodslowast Diacriticslowast Character n-gramslowast Stop words
acirc Similarity-based categorisation and classificationlowast Character n-gram rankings
acirc Machine Learning based methodsacirc Context (under jp or ko)
atilde Hard to do for short test snippets similar languages mixed text
Review and Conclusions 53
Normalization
atilde Extracting text from various documents
atilde Segmenting continuous text
atilde Number Normalization $700K $700000 07 million dollars hellip
atilde Date Normalization 2000AD 1421AH Heisei 12 hellip
atilde Stripping stop words (the a of hellip)
atilde Lemmatization produces rarr produce
atilde Stemming producer rarr produc produces rarr produc
atilde Decompounding zonnecel rarr zon cel
Review and Conclusions 54
The Semantic Web
Review and Conclusions 55
Goals of the Semantic Web
atilde Web of data
acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions
atilde Increase the utility of information by connecting it to definitions and context
atilde More efficient information access and analysis
EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt
Review and Conclusions 56
Semantic Web Architecture
Review and Conclusions 57
XML eXtensible Markup Language
atilde XML is a set of rules for encoding documents electronically
acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation
atilde Can be validated for syntactic well-formedness
Review and Conclusions 58
Semantic Web Architecture (details)
atilde Identify things with Uniform Resource Identifiers
acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond
atilde Identify relations with Resource Description Framework
acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything
atilde You can build relations in ontologies (OWL)
acirc Then reason over them search them hellip
Review and Conclusions 59
Criticism of the Semantic Web
Doctorowrsquos seven insurmountable obstacles to reliable metadata are
1 People lie
2 People are lazy
3 People are stupid
4 Mission Impossible know thyself
5 Schemas arenrsquot neutral
6 Metrics influence results
7 Therersquos more than one way to describe something
httpwwwwellcom~doctorowmetacraphtm 60
Semantic Web and NLP
atilde The Semantic Web is about structuring data
atilde Text Mining is about unstructured data
atilde There is much more unstructured than structured data
acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP
61
Citation Reputation andPageRank
Review and Conclusions 62
Citation Networks
atilde How can we tell what is a good scientific paper
acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked
acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite
Review and Conclusions 63
Reputation and Citation Analysis
atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group
atilde Some scores are
acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)
atilde Problems
acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones
atilde Weight Citations by Quality of the paper
Review and Conclusions 64
Gaming Citations
atilde LeastMinimum Publishable Unit
acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information
atilde Self citation in-group citation
atilde Write only proceedings (some journals are not often read)
atilde Submitting only to High Impact factor journals
You improve what gets measurednot necessarily what you want to improve
Review and Conclusions 65
Characteristics of the Web Bow Tie
SCC Strongly Connected Core mdash can travel from any page to any page
Review and Conclusions 66
Anchor Text
atilde Recall how hyperlinks are written
lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt
For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt
atilde Link analysis builds on two intuitions
1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A
2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice
Review and Conclusions 67
PageRank as Citation analysis
atilde Citation frequency can be used to measure the impact of an article
acirc Simplest measure Each article gets one vote ndash not very accurate
atilde On the web citation frequency = inlink count
acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam
atilde Better measure weighted citation frequency or citation rank
acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated
Review and Conclusions 68
PageRank as Random walk
atilde Imagine a web surfer doing a random walk on the web
acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page
equiprobably
atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there
atilde This long-term visit rate is the pagersquos PageRank
atilde PageRank = long-term visit rate = steady state robability
Review and Conclusions 69
Teleporting ndash to get us out of dead ends
atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)
atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)
atilde With remaining probability (90) go out on a random hyperlink
acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225
atilde 10 is a parameter the teleportation rate
atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate
Review and Conclusions 70
Example Graph
Each inbound link is a positive vote
httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71
Example Graph Weighted
Pages with higher PageRanks are lighter
Review and Conclusions 72
Gaming PageRank
atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs
atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies
atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission
73
atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks
The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index
acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip
74
Current Status
atilde There is a continuous battle between
acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read
atilde All metrics get gamed
Review and Conclusions 75
Digital object identifier
atilde DOI a string used to uniquely identify an electronic document or object
acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)
atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation
atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations
acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u
Review and Conclusions 76
Conclusions
Review and Conclusions 77
Revolutions in Language Technology
atilde Speech (Language itself)
atilde Writing (invented 3-5 times)
atilde Printing (made writing common)
atilde Digital Text (made writing transferable)
atilde Hyperlinking (taking writing beyond language)
Review and Conclusions 78
The Internet
atilde The internet is useful as a tool
acirc Passively for information accessacirc Actively for collaboration
atilde The internet is interesting in and of itself
acirc As a source of data about existing languageacirc As a source of innovation in language
Review and Conclusions 79
Recommended Texts
atilde Wikipedia
atilde Crystal D (2001) Language and the Internet Cambridge University Press
atilde Sproat R (2010) Language Technology and Society Oxford University Press
atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition
atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press
atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press
Review and Conclusions 80
Complementary Courses
atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics
atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics
Review and Conclusions 81
Direct Query
atilde Accessible through search engines (Google API Yahoo API Scripts)
atilde Document counts are shown to correlate directly with ldquorealrdquo frequencies (Keller2003) so search engines can help - but
acirc lots of repetitions of the same text (not representative)acirc very limited query precision (no upperlower case no punctuation)acirc only estimated counts often hard to reproduce exactlyacirc different queries give wildly different numbers
Review and Conclusions 44
Web Count Units
atilde ghit or ldquogoogle hitrdquo is the most common unit used to count web snippets (in theearly 2000s)
acirc it is document frequency not term frequency
atilde whit or ldquoweb hitrdquo is the more general term
atilde Normally you compare two phenomena to get a unitless ratio(eg different from vs different than)251000000 ghits vs 71500000 or 351 (accessed 2012-04-04)
atilde GPB for ldquoGhits per billion documentsrdquo is good if want a more stable number (sug-gested by Mark Liberman)but Google no longer releases the index size
atilde So always say when you counted and try to use ratios
httpitrecisupennedu myllanguagelogarchives000953html 45
Web Sample
atilde Extracting and filtering web documents to create linguistically annotated corpora(Kilgarriff 2006)
acirc gather documents for different topics (balance)acirc exclude documents which cannot be preprocessed with available tools (here tag-
gers and lemmatizers)acirc exclude documents which seem irrelevant for a corpus (too short or too long word
lists)acirc do this for several languages and make the corpora available
46
Building Internet Corpora Outline
1 Select Seed Words (500)
2 Combine to form multiple queries (6000)
3 Query a search engine and retrieve the URLs (50000)
4 Download the files from the URLS (100000000 words)
5 Postprocess the data (encoding cleanup tagging and parsing)
Sharoff S (2006) Creating general-purpose corpora using automated search enginequeries In M Baroni S Bernardini (eds) WaCky Working papers on the Web asCorpus Bologna 2006
httpcorpusleedsacukinternethtml 47
Internet Corpora Summary
atilde The web can be used as a corpus
acirc Direct accesslowast Fast and convenientlowast Huge amounts of dataotimes unreliable counts
acirc Web samplelowast Control over the samplelowast Some setup costs (semi-automated)otimes Less data
atilde Richer data than a compiled corpus
otimes Less balanced less markup
48
Text and Meta-textatilde Explicit Meta-data
acirc Keywords and Categoriesacirc Rankingsacirc Structural Markup
atilde Implicit Meta-data
acirc Links and Citationsacirc Tagsacirc Tablesacirc File Namesacirc Translations
and Zombies 49
Explicit Metadata
atilde You can get information from metadata within documents
acirc When they are accurate they are very goodacirc They are often deceitful
atilde HTML PDF Word hellipMetadata
atilde Keywords and Tags
atilde Rankings
atilde Links and Citations
atilde Structural Markup
50
Implicit Metadata
atilde You can get clues from metadata within documents
acirc as they are non-intended they tend to be noisyacirc but they are rarely deceitful
atilde HTML tags as constituent boundaries
atilde Tables as Semantic Relations
atilde File Names (content type and language)
atilde Translations mdash Bracketed Glosses Cross-lingual Disambiguation
atilde Query Data
atilde Wikipedia Redirections and Cross-wiki Links
Review and Conclusions 51
Language Identification andNormalization
Review and Conclusions 52
Language Identification
atilde Need to identify the encoding and language of a page in order to process it (meta-data may be unreliable)
acirc Linguistically-grounded methodslowast Diacriticslowast Character n-gramslowast Stop words
acirc Similarity-based categorisation and classificationlowast Character n-gram rankings
acirc Machine Learning based methodsacirc Context (under jp or ko)
atilde Hard to do for short test snippets similar languages mixed text
Review and Conclusions 53
Normalization
atilde Extracting text from various documents
atilde Segmenting continuous text
atilde Number Normalization $700K $700000 07 million dollars hellip
atilde Date Normalization 2000AD 1421AH Heisei 12 hellip
atilde Stripping stop words (the a of hellip)
atilde Lemmatization produces rarr produce
atilde Stemming producer rarr produc produces rarr produc
atilde Decompounding zonnecel rarr zon cel
Review and Conclusions 54
The Semantic Web
Review and Conclusions 55
Goals of the Semantic Web
atilde Web of data
acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions
atilde Increase the utility of information by connecting it to definitions and context
atilde More efficient information access and analysis
EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt
Review and Conclusions 56
Semantic Web Architecture
Review and Conclusions 57
XML eXtensible Markup Language
atilde XML is a set of rules for encoding documents electronically
acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation
atilde Can be validated for syntactic well-formedness
Review and Conclusions 58
Semantic Web Architecture (details)
atilde Identify things with Uniform Resource Identifiers
acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond
atilde Identify relations with Resource Description Framework
acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything
atilde You can build relations in ontologies (OWL)
acirc Then reason over them search them hellip
Review and Conclusions 59
Criticism of the Semantic Web
Doctorowrsquos seven insurmountable obstacles to reliable metadata are
1 People lie
2 People are lazy
3 People are stupid
4 Mission Impossible know thyself
5 Schemas arenrsquot neutral
6 Metrics influence results
7 Therersquos more than one way to describe something
httpwwwwellcom~doctorowmetacraphtm 60
Semantic Web and NLP
atilde The Semantic Web is about structuring data
atilde Text Mining is about unstructured data
atilde There is much more unstructured than structured data
acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP
61
Citation Reputation andPageRank
Review and Conclusions 62
Citation Networks
atilde How can we tell what is a good scientific paper
acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked
acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite
Review and Conclusions 63
Reputation and Citation Analysis
atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group
atilde Some scores are
acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)
atilde Problems
acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones
atilde Weight Citations by Quality of the paper
Review and Conclusions 64
Gaming Citations
atilde LeastMinimum Publishable Unit
acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information
atilde Self citation in-group citation
atilde Write only proceedings (some journals are not often read)
atilde Submitting only to High Impact factor journals
You improve what gets measurednot necessarily what you want to improve
Review and Conclusions 65
Characteristics of the Web Bow Tie
SCC Strongly Connected Core mdash can travel from any page to any page
Review and Conclusions 66
Anchor Text
atilde Recall how hyperlinks are written
lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt
For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt
atilde Link analysis builds on two intuitions
1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A
2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice
Review and Conclusions 67
PageRank as Citation analysis
atilde Citation frequency can be used to measure the impact of an article
acirc Simplest measure Each article gets one vote ndash not very accurate
atilde On the web citation frequency = inlink count
acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam
atilde Better measure weighted citation frequency or citation rank
acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated
Review and Conclusions 68
PageRank as Random walk
atilde Imagine a web surfer doing a random walk on the web
acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page
equiprobably
atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there
atilde This long-term visit rate is the pagersquos PageRank
atilde PageRank = long-term visit rate = steady state robability
Review and Conclusions 69
Teleporting ndash to get us out of dead ends
atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)
atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)
atilde With remaining probability (90) go out on a random hyperlink
acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225
atilde 10 is a parameter the teleportation rate
atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate
Review and Conclusions 70
Example Graph
Each inbound link is a positive vote
httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71
Example Graph Weighted
Pages with higher PageRanks are lighter
Review and Conclusions 72
Gaming PageRank
atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs
atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies
atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission
73
atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks
The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index
acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip
74
Current Status
atilde There is a continuous battle between
acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read
atilde All metrics get gamed
Review and Conclusions 75
Digital object identifier
atilde DOI a string used to uniquely identify an electronic document or object
acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)
atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation
atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations
acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u
Review and Conclusions 76
Conclusions
Review and Conclusions 77
Revolutions in Language Technology
atilde Speech (Language itself)
atilde Writing (invented 3-5 times)
atilde Printing (made writing common)
atilde Digital Text (made writing transferable)
atilde Hyperlinking (taking writing beyond language)
Review and Conclusions 78
The Internet
atilde The internet is useful as a tool
acirc Passively for information accessacirc Actively for collaboration
atilde The internet is interesting in and of itself
acirc As a source of data about existing languageacirc As a source of innovation in language
Review and Conclusions 79
Recommended Texts
atilde Wikipedia
atilde Crystal D (2001) Language and the Internet Cambridge University Press
atilde Sproat R (2010) Language Technology and Society Oxford University Press
atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition
atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press
atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press
Review and Conclusions 80
Complementary Courses
atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics
atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics
Review and Conclusions 81
Web Count Units
atilde ghit or ldquogoogle hitrdquo is the most common unit used to count web snippets (in theearly 2000s)
acirc it is document frequency not term frequency
atilde whit or ldquoweb hitrdquo is the more general term
atilde Normally you compare two phenomena to get a unitless ratio(eg different from vs different than)251000000 ghits vs 71500000 or 351 (accessed 2012-04-04)
atilde GPB for ldquoGhits per billion documentsrdquo is good if want a more stable number (sug-gested by Mark Liberman)but Google no longer releases the index size
atilde So always say when you counted and try to use ratios
httpitrecisupennedu myllanguagelogarchives000953html 45
Web Sample
atilde Extracting and filtering web documents to create linguistically annotated corpora(Kilgarriff 2006)
acirc gather documents for different topics (balance)acirc exclude documents which cannot be preprocessed with available tools (here tag-
gers and lemmatizers)acirc exclude documents which seem irrelevant for a corpus (too short or too long word
lists)acirc do this for several languages and make the corpora available
46
Building Internet Corpora Outline
1 Select Seed Words (500)
2 Combine to form multiple queries (6000)
3 Query a search engine and retrieve the URLs (50000)
4 Download the files from the URLS (100000000 words)
5 Postprocess the data (encoding cleanup tagging and parsing)
Sharoff S (2006) Creating general-purpose corpora using automated search enginequeries In M Baroni S Bernardini (eds) WaCky Working papers on the Web asCorpus Bologna 2006
httpcorpusleedsacukinternethtml 47
Internet Corpora Summary
atilde The web can be used as a corpus
acirc Direct accesslowast Fast and convenientlowast Huge amounts of dataotimes unreliable counts
acirc Web samplelowast Control over the samplelowast Some setup costs (semi-automated)otimes Less data
atilde Richer data than a compiled corpus
otimes Less balanced less markup
48
Text and Meta-textatilde Explicit Meta-data
acirc Keywords and Categoriesacirc Rankingsacirc Structural Markup
atilde Implicit Meta-data
acirc Links and Citationsacirc Tagsacirc Tablesacirc File Namesacirc Translations
and Zombies 49
Explicit Metadata
atilde You can get information from metadata within documents
acirc When they are accurate they are very goodacirc They are often deceitful
atilde HTML PDF Word hellipMetadata
atilde Keywords and Tags
atilde Rankings
atilde Links and Citations
atilde Structural Markup
50
Implicit Metadata
atilde You can get clues from metadata within documents
acirc as they are non-intended they tend to be noisyacirc but they are rarely deceitful
atilde HTML tags as constituent boundaries
atilde Tables as Semantic Relations
atilde File Names (content type and language)
atilde Translations mdash Bracketed Glosses Cross-lingual Disambiguation
atilde Query Data
atilde Wikipedia Redirections and Cross-wiki Links
Review and Conclusions 51
Language Identification andNormalization
Review and Conclusions 52
Language Identification
atilde Need to identify the encoding and language of a page in order to process it (meta-data may be unreliable)
acirc Linguistically-grounded methodslowast Diacriticslowast Character n-gramslowast Stop words
acirc Similarity-based categorisation and classificationlowast Character n-gram rankings
acirc Machine Learning based methodsacirc Context (under jp or ko)
atilde Hard to do for short test snippets similar languages mixed text
Review and Conclusions 53
Normalization
atilde Extracting text from various documents
atilde Segmenting continuous text
atilde Number Normalization $700K $700000 07 million dollars hellip
atilde Date Normalization 2000AD 1421AH Heisei 12 hellip
atilde Stripping stop words (the a of hellip)
atilde Lemmatization produces rarr produce
atilde Stemming producer rarr produc produces rarr produc
atilde Decompounding zonnecel rarr zon cel
Review and Conclusions 54
The Semantic Web
Review and Conclusions 55
Goals of the Semantic Web
atilde Web of data
acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions
atilde Increase the utility of information by connecting it to definitions and context
atilde More efficient information access and analysis
EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt
Review and Conclusions 56
Semantic Web Architecture
Review and Conclusions 57
XML eXtensible Markup Language
atilde XML is a set of rules for encoding documents electronically
acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation
atilde Can be validated for syntactic well-formedness
Review and Conclusions 58
Semantic Web Architecture (details)
atilde Identify things with Uniform Resource Identifiers
acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond
atilde Identify relations with Resource Description Framework
acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything
atilde You can build relations in ontologies (OWL)
acirc Then reason over them search them hellip
Review and Conclusions 59
Criticism of the Semantic Web
Doctorowrsquos seven insurmountable obstacles to reliable metadata are
1 People lie
2 People are lazy
3 People are stupid
4 Mission Impossible know thyself
5 Schemas arenrsquot neutral
6 Metrics influence results
7 Therersquos more than one way to describe something
httpwwwwellcom~doctorowmetacraphtm 60
Semantic Web and NLP
atilde The Semantic Web is about structuring data
atilde Text Mining is about unstructured data
atilde There is much more unstructured than structured data
acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP
61
Citation Reputation andPageRank
Review and Conclusions 62
Citation Networks
atilde How can we tell what is a good scientific paper
acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked
acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite
Review and Conclusions 63
Reputation and Citation Analysis
atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group
atilde Some scores are
acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)
atilde Problems
acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones
atilde Weight Citations by Quality of the paper
Review and Conclusions 64
Gaming Citations
atilde LeastMinimum Publishable Unit
acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information
atilde Self citation in-group citation
atilde Write only proceedings (some journals are not often read)
atilde Submitting only to High Impact factor journals
You improve what gets measurednot necessarily what you want to improve
Review and Conclusions 65
Characteristics of the Web Bow Tie
SCC Strongly Connected Core mdash can travel from any page to any page
Review and Conclusions 66
Anchor Text
atilde Recall how hyperlinks are written
lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt
For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt
atilde Link analysis builds on two intuitions
1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A
2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice
Review and Conclusions 67
PageRank as Citation analysis
atilde Citation frequency can be used to measure the impact of an article
acirc Simplest measure Each article gets one vote ndash not very accurate
atilde On the web citation frequency = inlink count
acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam
atilde Better measure weighted citation frequency or citation rank
acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated
Review and Conclusions 68
PageRank as Random walk
atilde Imagine a web surfer doing a random walk on the web
acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page
equiprobably
atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there
atilde This long-term visit rate is the pagersquos PageRank
atilde PageRank = long-term visit rate = steady state robability
Review and Conclusions 69
Teleporting ndash to get us out of dead ends
atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)
atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)
atilde With remaining probability (90) go out on a random hyperlink
acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225
atilde 10 is a parameter the teleportation rate
atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate
Review and Conclusions 70
Example Graph
Each inbound link is a positive vote
httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71
Example Graph Weighted
Pages with higher PageRanks are lighter
Review and Conclusions 72
Gaming PageRank
atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs
atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies
atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission
73
atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks
The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index
acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip
74
Current Status
atilde There is a continuous battle between
acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read
atilde All metrics get gamed
Review and Conclusions 75
Digital object identifier
atilde DOI a string used to uniquely identify an electronic document or object
acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)
atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation
atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations
acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u
Review and Conclusions 76
Conclusions
Review and Conclusions 77
Revolutions in Language Technology
atilde Speech (Language itself)
atilde Writing (invented 3-5 times)
atilde Printing (made writing common)
atilde Digital Text (made writing transferable)
atilde Hyperlinking (taking writing beyond language)
Review and Conclusions 78
The Internet
atilde The internet is useful as a tool
acirc Passively for information accessacirc Actively for collaboration
atilde The internet is interesting in and of itself
acirc As a source of data about existing languageacirc As a source of innovation in language
Review and Conclusions 79
Recommended Texts
atilde Wikipedia
atilde Crystal D (2001) Language and the Internet Cambridge University Press
atilde Sproat R (2010) Language Technology and Society Oxford University Press
atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition
atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press
atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press
Review and Conclusions 80
Complementary Courses
atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics
atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics
Review and Conclusions 81
Web Sample
atilde Extracting and filtering web documents to create linguistically annotated corpora(Kilgarriff 2006)
acirc gather documents for different topics (balance)acirc exclude documents which cannot be preprocessed with available tools (here tag-
gers and lemmatizers)acirc exclude documents which seem irrelevant for a corpus (too short or too long word
lists)acirc do this for several languages and make the corpora available
46
Building Internet Corpora Outline
1 Select Seed Words (500)
2 Combine to form multiple queries (6000)
3 Query a search engine and retrieve the URLs (50000)
4 Download the files from the URLS (100000000 words)
5 Postprocess the data (encoding cleanup tagging and parsing)
Sharoff S (2006) Creating general-purpose corpora using automated search enginequeries In M Baroni S Bernardini (eds) WaCky Working papers on the Web asCorpus Bologna 2006
httpcorpusleedsacukinternethtml 47
Internet Corpora Summary
atilde The web can be used as a corpus
acirc Direct accesslowast Fast and convenientlowast Huge amounts of dataotimes unreliable counts
acirc Web samplelowast Control over the samplelowast Some setup costs (semi-automated)otimes Less data
atilde Richer data than a compiled corpus
otimes Less balanced less markup
48
Text and Meta-textatilde Explicit Meta-data
acirc Keywords and Categoriesacirc Rankingsacirc Structural Markup
atilde Implicit Meta-data
acirc Links and Citationsacirc Tagsacirc Tablesacirc File Namesacirc Translations
and Zombies 49
Explicit Metadata
atilde You can get information from metadata within documents
acirc When they are accurate they are very goodacirc They are often deceitful
atilde HTML PDF Word hellipMetadata
atilde Keywords and Tags
atilde Rankings
atilde Links and Citations
atilde Structural Markup
50
Implicit Metadata
atilde You can get clues from metadata within documents
acirc as they are non-intended they tend to be noisyacirc but they are rarely deceitful
atilde HTML tags as constituent boundaries
atilde Tables as Semantic Relations
atilde File Names (content type and language)
atilde Translations mdash Bracketed Glosses Cross-lingual Disambiguation
atilde Query Data
atilde Wikipedia Redirections and Cross-wiki Links
Review and Conclusions 51
Language Identification andNormalization
Review and Conclusions 52
Language Identification
atilde Need to identify the encoding and language of a page in order to process it (meta-data may be unreliable)
acirc Linguistically-grounded methodslowast Diacriticslowast Character n-gramslowast Stop words
acirc Similarity-based categorisation and classificationlowast Character n-gram rankings
acirc Machine Learning based methodsacirc Context (under jp or ko)
atilde Hard to do for short test snippets similar languages mixed text
Review and Conclusions 53
Normalization
atilde Extracting text from various documents
atilde Segmenting continuous text
atilde Number Normalization $700K $700000 07 million dollars hellip
atilde Date Normalization 2000AD 1421AH Heisei 12 hellip
atilde Stripping stop words (the a of hellip)
atilde Lemmatization produces rarr produce
atilde Stemming producer rarr produc produces rarr produc
atilde Decompounding zonnecel rarr zon cel
Review and Conclusions 54
The Semantic Web
Review and Conclusions 55
Goals of the Semantic Web
atilde Web of data
acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions
atilde Increase the utility of information by connecting it to definitions and context
atilde More efficient information access and analysis
EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt
Review and Conclusions 56
Semantic Web Architecture
Review and Conclusions 57
XML eXtensible Markup Language
atilde XML is a set of rules for encoding documents electronically
acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation
atilde Can be validated for syntactic well-formedness
Review and Conclusions 58
Semantic Web Architecture (details)
atilde Identify things with Uniform Resource Identifiers
acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond
atilde Identify relations with Resource Description Framework
acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything
atilde You can build relations in ontologies (OWL)
acirc Then reason over them search them hellip
Review and Conclusions 59
Criticism of the Semantic Web
Doctorowrsquos seven insurmountable obstacles to reliable metadata are
1 People lie
2 People are lazy
3 People are stupid
4 Mission Impossible know thyself
5 Schemas arenrsquot neutral
6 Metrics influence results
7 Therersquos more than one way to describe something
httpwwwwellcom~doctorowmetacraphtm 60
Semantic Web and NLP
atilde The Semantic Web is about structuring data
atilde Text Mining is about unstructured data
atilde There is much more unstructured than structured data
acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP
61
Citation Reputation andPageRank
Review and Conclusions 62
Citation Networks
atilde How can we tell what is a good scientific paper
acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked
acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite
Review and Conclusions 63
Reputation and Citation Analysis
atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group
atilde Some scores are
acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)
atilde Problems
acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones
atilde Weight Citations by Quality of the paper
Review and Conclusions 64
Gaming Citations
atilde LeastMinimum Publishable Unit
acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information
atilde Self citation in-group citation
atilde Write only proceedings (some journals are not often read)
atilde Submitting only to High Impact factor journals
You improve what gets measurednot necessarily what you want to improve
Review and Conclusions 65
Characteristics of the Web Bow Tie
SCC Strongly Connected Core mdash can travel from any page to any page
Review and Conclusions 66
Anchor Text
atilde Recall how hyperlinks are written
lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt
For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt
atilde Link analysis builds on two intuitions
1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A
2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice
Review and Conclusions 67
PageRank as Citation analysis
atilde Citation frequency can be used to measure the impact of an article
acirc Simplest measure Each article gets one vote ndash not very accurate
atilde On the web citation frequency = inlink count
acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam
atilde Better measure weighted citation frequency or citation rank
acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated
Review and Conclusions 68
PageRank as Random walk
atilde Imagine a web surfer doing a random walk on the web
acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page
equiprobably
atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there
atilde This long-term visit rate is the pagersquos PageRank
atilde PageRank = long-term visit rate = steady state robability
Review and Conclusions 69
Teleporting ndash to get us out of dead ends
atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)
atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)
atilde With remaining probability (90) go out on a random hyperlink
acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225
atilde 10 is a parameter the teleportation rate
atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate
Review and Conclusions 70
Example Graph
Each inbound link is a positive vote
httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71
Example Graph Weighted
Pages with higher PageRanks are lighter
Review and Conclusions 72
Gaming PageRank
atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs
atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies
atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission
73
atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks
The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index
acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip
74
Current Status
atilde There is a continuous battle between
acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read
atilde All metrics get gamed
Review and Conclusions 75
Digital object identifier
atilde DOI a string used to uniquely identify an electronic document or object
acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)
atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation
atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations
acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u
Review and Conclusions 76
Conclusions
Review and Conclusions 77
Revolutions in Language Technology
atilde Speech (Language itself)
atilde Writing (invented 3-5 times)
atilde Printing (made writing common)
atilde Digital Text (made writing transferable)
atilde Hyperlinking (taking writing beyond language)
Review and Conclusions 78
The Internet
atilde The internet is useful as a tool
acirc Passively for information accessacirc Actively for collaboration
atilde The internet is interesting in and of itself
acirc As a source of data about existing languageacirc As a source of innovation in language
Review and Conclusions 79
Recommended Texts
atilde Wikipedia
atilde Crystal D (2001) Language and the Internet Cambridge University Press
atilde Sproat R (2010) Language Technology and Society Oxford University Press
atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition
atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press
atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press
Review and Conclusions 80
Complementary Courses
atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics
atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics
Review and Conclusions 81
Building Internet Corpora Outline
1 Select Seed Words (500)
2 Combine to form multiple queries (6000)
3 Query a search engine and retrieve the URLs (50000)
4 Download the files from the URLS (100000000 words)
5 Postprocess the data (encoding cleanup tagging and parsing)
Sharoff S (2006) Creating general-purpose corpora using automated search enginequeries In M Baroni S Bernardini (eds) WaCky Working papers on the Web asCorpus Bologna 2006
httpcorpusleedsacukinternethtml 47
Internet Corpora Summary
atilde The web can be used as a corpus
acirc Direct accesslowast Fast and convenientlowast Huge amounts of dataotimes unreliable counts
acirc Web samplelowast Control over the samplelowast Some setup costs (semi-automated)otimes Less data
atilde Richer data than a compiled corpus
otimes Less balanced less markup
48
Text and Meta-textatilde Explicit Meta-data
acirc Keywords and Categoriesacirc Rankingsacirc Structural Markup
atilde Implicit Meta-data
acirc Links and Citationsacirc Tagsacirc Tablesacirc File Namesacirc Translations
and Zombies 49
Explicit Metadata
atilde You can get information from metadata within documents
acirc When they are accurate they are very goodacirc They are often deceitful
atilde HTML PDF Word hellipMetadata
atilde Keywords and Tags
atilde Rankings
atilde Links and Citations
atilde Structural Markup
50
Implicit Metadata
atilde You can get clues from metadata within documents
acirc as they are non-intended they tend to be noisyacirc but they are rarely deceitful
atilde HTML tags as constituent boundaries
atilde Tables as Semantic Relations
atilde File Names (content type and language)
atilde Translations mdash Bracketed Glosses Cross-lingual Disambiguation
atilde Query Data
atilde Wikipedia Redirections and Cross-wiki Links
Review and Conclusions 51
Language Identification andNormalization
Review and Conclusions 52
Language Identification
atilde Need to identify the encoding and language of a page in order to process it (meta-data may be unreliable)
acirc Linguistically-grounded methodslowast Diacriticslowast Character n-gramslowast Stop words
acirc Similarity-based categorisation and classificationlowast Character n-gram rankings
acirc Machine Learning based methodsacirc Context (under jp or ko)
atilde Hard to do for short test snippets similar languages mixed text
Review and Conclusions 53
Normalization
atilde Extracting text from various documents
atilde Segmenting continuous text
atilde Number Normalization $700K $700000 07 million dollars hellip
atilde Date Normalization 2000AD 1421AH Heisei 12 hellip
atilde Stripping stop words (the a of hellip)
atilde Lemmatization produces rarr produce
atilde Stemming producer rarr produc produces rarr produc
atilde Decompounding zonnecel rarr zon cel
Review and Conclusions 54
The Semantic Web
Review and Conclusions 55
Goals of the Semantic Web
atilde Web of data
acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions
atilde Increase the utility of information by connecting it to definitions and context
atilde More efficient information access and analysis
EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt
Review and Conclusions 56
Semantic Web Architecture
Review and Conclusions 57
XML eXtensible Markup Language
atilde XML is a set of rules for encoding documents electronically
acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation
atilde Can be validated for syntactic well-formedness
Review and Conclusions 58
Semantic Web Architecture (details)
atilde Identify things with Uniform Resource Identifiers
acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond
atilde Identify relations with Resource Description Framework
acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything
atilde You can build relations in ontologies (OWL)
acirc Then reason over them search them hellip
Review and Conclusions 59
Criticism of the Semantic Web
Doctorowrsquos seven insurmountable obstacles to reliable metadata are
1 People lie
2 People are lazy
3 People are stupid
4 Mission Impossible know thyself
5 Schemas arenrsquot neutral
6 Metrics influence results
7 Therersquos more than one way to describe something
httpwwwwellcom~doctorowmetacraphtm 60
Semantic Web and NLP
atilde The Semantic Web is about structuring data
atilde Text Mining is about unstructured data
atilde There is much more unstructured than structured data
acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP
61
Citation Reputation andPageRank
Review and Conclusions 62
Citation Networks
atilde How can we tell what is a good scientific paper
acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked
acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite
Review and Conclusions 63
Reputation and Citation Analysis
atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group
atilde Some scores are
acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)
atilde Problems
acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones
atilde Weight Citations by Quality of the paper
Review and Conclusions 64
Gaming Citations
atilde LeastMinimum Publishable Unit
acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information
atilde Self citation in-group citation
atilde Write only proceedings (some journals are not often read)
atilde Submitting only to High Impact factor journals
You improve what gets measurednot necessarily what you want to improve
Review and Conclusions 65
Characteristics of the Web Bow Tie
SCC Strongly Connected Core mdash can travel from any page to any page
Review and Conclusions 66
Anchor Text
atilde Recall how hyperlinks are written
lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt
For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt
atilde Link analysis builds on two intuitions
1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A
2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice
Review and Conclusions 67
PageRank as Citation analysis
atilde Citation frequency can be used to measure the impact of an article
acirc Simplest measure Each article gets one vote ndash not very accurate
atilde On the web citation frequency = inlink count
acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam
atilde Better measure weighted citation frequency or citation rank
acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated
Review and Conclusions 68
PageRank as Random walk
atilde Imagine a web surfer doing a random walk on the web
acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page
equiprobably
atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there
atilde This long-term visit rate is the pagersquos PageRank
atilde PageRank = long-term visit rate = steady state robability
Review and Conclusions 69
Teleporting ndash to get us out of dead ends
atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)
atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)
atilde With remaining probability (90) go out on a random hyperlink
acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225
atilde 10 is a parameter the teleportation rate
atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate
Review and Conclusions 70
Example Graph
Each inbound link is a positive vote
httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71
Example Graph Weighted
Pages with higher PageRanks are lighter
Review and Conclusions 72
Gaming PageRank
atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs
atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies
atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission
73
atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks
The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index
acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip
74
Current Status
atilde There is a continuous battle between
acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read
atilde All metrics get gamed
Review and Conclusions 75
Digital object identifier
atilde DOI a string used to uniquely identify an electronic document or object
acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)
atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation
atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations
acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u
Review and Conclusions 76
Conclusions
Review and Conclusions 77
Revolutions in Language Technology
atilde Speech (Language itself)
atilde Writing (invented 3-5 times)
atilde Printing (made writing common)
atilde Digital Text (made writing transferable)
atilde Hyperlinking (taking writing beyond language)
Review and Conclusions 78
The Internet
atilde The internet is useful as a tool
acirc Passively for information accessacirc Actively for collaboration
atilde The internet is interesting in and of itself
acirc As a source of data about existing languageacirc As a source of innovation in language
Review and Conclusions 79
Recommended Texts
atilde Wikipedia
atilde Crystal D (2001) Language and the Internet Cambridge University Press
atilde Sproat R (2010) Language Technology and Society Oxford University Press
atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition
atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press
atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press
Review and Conclusions 80
Complementary Courses
atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics
atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics
Review and Conclusions 81
Internet Corpora Summary
atilde The web can be used as a corpus
acirc Direct accesslowast Fast and convenientlowast Huge amounts of dataotimes unreliable counts
acirc Web samplelowast Control over the samplelowast Some setup costs (semi-automated)otimes Less data
atilde Richer data than a compiled corpus
otimes Less balanced less markup
48
Text and Meta-textatilde Explicit Meta-data
acirc Keywords and Categoriesacirc Rankingsacirc Structural Markup
atilde Implicit Meta-data
acirc Links and Citationsacirc Tagsacirc Tablesacirc File Namesacirc Translations
and Zombies 49
Explicit Metadata
atilde You can get information from metadata within documents
acirc When they are accurate they are very goodacirc They are often deceitful
atilde HTML PDF Word hellipMetadata
atilde Keywords and Tags
atilde Rankings
atilde Links and Citations
atilde Structural Markup
50
Implicit Metadata
atilde You can get clues from metadata within documents
acirc as they are non-intended they tend to be noisyacirc but they are rarely deceitful
atilde HTML tags as constituent boundaries
atilde Tables as Semantic Relations
atilde File Names (content type and language)
atilde Translations mdash Bracketed Glosses Cross-lingual Disambiguation
atilde Query Data
atilde Wikipedia Redirections and Cross-wiki Links
Review and Conclusions 51
Language Identification andNormalization
Review and Conclusions 52
Language Identification
atilde Need to identify the encoding and language of a page in order to process it (meta-data may be unreliable)
acirc Linguistically-grounded methodslowast Diacriticslowast Character n-gramslowast Stop words
acirc Similarity-based categorisation and classificationlowast Character n-gram rankings
acirc Machine Learning based methodsacirc Context (under jp or ko)
atilde Hard to do for short test snippets similar languages mixed text
Review and Conclusions 53
Normalization
atilde Extracting text from various documents
atilde Segmenting continuous text
atilde Number Normalization $700K $700000 07 million dollars hellip
atilde Date Normalization 2000AD 1421AH Heisei 12 hellip
atilde Stripping stop words (the a of hellip)
atilde Lemmatization produces rarr produce
atilde Stemming producer rarr produc produces rarr produc
atilde Decompounding zonnecel rarr zon cel
Review and Conclusions 54
The Semantic Web
Review and Conclusions 55
Goals of the Semantic Web
atilde Web of data
acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions
atilde Increase the utility of information by connecting it to definitions and context
atilde More efficient information access and analysis
EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt
Review and Conclusions 56
Semantic Web Architecture
Review and Conclusions 57
XML eXtensible Markup Language
atilde XML is a set of rules for encoding documents electronically
acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation
atilde Can be validated for syntactic well-formedness
Review and Conclusions 58
Semantic Web Architecture (details)
atilde Identify things with Uniform Resource Identifiers
acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond
atilde Identify relations with Resource Description Framework
acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything
atilde You can build relations in ontologies (OWL)
acirc Then reason over them search them hellip
Review and Conclusions 59
Criticism of the Semantic Web
Doctorowrsquos seven insurmountable obstacles to reliable metadata are
1 People lie
2 People are lazy
3 People are stupid
4 Mission Impossible know thyself
5 Schemas arenrsquot neutral
6 Metrics influence results
7 Therersquos more than one way to describe something
httpwwwwellcom~doctorowmetacraphtm 60
Semantic Web and NLP
atilde The Semantic Web is about structuring data
atilde Text Mining is about unstructured data
atilde There is much more unstructured than structured data
acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP
61
Citation Reputation andPageRank
Review and Conclusions 62
Citation Networks
atilde How can we tell what is a good scientific paper
acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked
acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite
Review and Conclusions 63
Reputation and Citation Analysis
atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group
atilde Some scores are
acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)
atilde Problems
acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones
atilde Weight Citations by Quality of the paper
Review and Conclusions 64
Gaming Citations
atilde LeastMinimum Publishable Unit
acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information
atilde Self citation in-group citation
atilde Write only proceedings (some journals are not often read)
atilde Submitting only to High Impact factor journals
You improve what gets measurednot necessarily what you want to improve
Review and Conclusions 65
Characteristics of the Web Bow Tie
SCC Strongly Connected Core mdash can travel from any page to any page
Review and Conclusions 66
Anchor Text
atilde Recall how hyperlinks are written
lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt
For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt
atilde Link analysis builds on two intuitions
1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A
2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice
Review and Conclusions 67
PageRank as Citation analysis
atilde Citation frequency can be used to measure the impact of an article
acirc Simplest measure Each article gets one vote ndash not very accurate
atilde On the web citation frequency = inlink count
acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam
atilde Better measure weighted citation frequency or citation rank
acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated
Review and Conclusions 68
PageRank as Random walk
atilde Imagine a web surfer doing a random walk on the web
acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page
equiprobably
atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there
atilde This long-term visit rate is the pagersquos PageRank
atilde PageRank = long-term visit rate = steady state robability
Review and Conclusions 69
Teleporting ndash to get us out of dead ends
atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)
atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)
atilde With remaining probability (90) go out on a random hyperlink
acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225
atilde 10 is a parameter the teleportation rate
atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate
Review and Conclusions 70
Example Graph
Each inbound link is a positive vote
httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71
Example Graph Weighted
Pages with higher PageRanks are lighter
Review and Conclusions 72
Gaming PageRank
atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs
atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies
atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission
73
atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks
The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index
acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip
74
Current Status
atilde There is a continuous battle between
acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read
atilde All metrics get gamed
Review and Conclusions 75
Digital object identifier
atilde DOI a string used to uniquely identify an electronic document or object
acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)
atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation
atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations
acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u
Review and Conclusions 76
Conclusions
Review and Conclusions 77
Revolutions in Language Technology
atilde Speech (Language itself)
atilde Writing (invented 3-5 times)
atilde Printing (made writing common)
atilde Digital Text (made writing transferable)
atilde Hyperlinking (taking writing beyond language)
Review and Conclusions 78
The Internet
atilde The internet is useful as a tool
acirc Passively for information accessacirc Actively for collaboration
atilde The internet is interesting in and of itself
acirc As a source of data about existing languageacirc As a source of innovation in language
Review and Conclusions 79
Recommended Texts
atilde Wikipedia
atilde Crystal D (2001) Language and the Internet Cambridge University Press
atilde Sproat R (2010) Language Technology and Society Oxford University Press
atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition
atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press
atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press
Review and Conclusions 80
Complementary Courses
atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics
atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics
Review and Conclusions 81
Text and Meta-textatilde Explicit Meta-data
acirc Keywords and Categoriesacirc Rankingsacirc Structural Markup
atilde Implicit Meta-data
acirc Links and Citationsacirc Tagsacirc Tablesacirc File Namesacirc Translations
and Zombies 49
Explicit Metadata
atilde You can get information from metadata within documents
acirc When they are accurate they are very goodacirc They are often deceitful
atilde HTML PDF Word hellipMetadata
atilde Keywords and Tags
atilde Rankings
atilde Links and Citations
atilde Structural Markup
50
Implicit Metadata
atilde You can get clues from metadata within documents
acirc as they are non-intended they tend to be noisyacirc but they are rarely deceitful
atilde HTML tags as constituent boundaries
atilde Tables as Semantic Relations
atilde File Names (content type and language)
atilde Translations mdash Bracketed Glosses Cross-lingual Disambiguation
atilde Query Data
atilde Wikipedia Redirections and Cross-wiki Links
Review and Conclusions 51
Language Identification andNormalization
Review and Conclusions 52
Language Identification
atilde Need to identify the encoding and language of a page in order to process it (meta-data may be unreliable)
acirc Linguistically-grounded methodslowast Diacriticslowast Character n-gramslowast Stop words
acirc Similarity-based categorisation and classificationlowast Character n-gram rankings
acirc Machine Learning based methodsacirc Context (under jp or ko)
atilde Hard to do for short test snippets similar languages mixed text
Review and Conclusions 53
Normalization
atilde Extracting text from various documents
atilde Segmenting continuous text
atilde Number Normalization $700K $700000 07 million dollars hellip
atilde Date Normalization 2000AD 1421AH Heisei 12 hellip
atilde Stripping stop words (the a of hellip)
atilde Lemmatization produces rarr produce
atilde Stemming producer rarr produc produces rarr produc
atilde Decompounding zonnecel rarr zon cel
Review and Conclusions 54
The Semantic Web
Review and Conclusions 55
Goals of the Semantic Web
atilde Web of data
acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions
atilde Increase the utility of information by connecting it to definitions and context
atilde More efficient information access and analysis
EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt
Review and Conclusions 56
Semantic Web Architecture
Review and Conclusions 57
XML eXtensible Markup Language
atilde XML is a set of rules for encoding documents electronically
acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation
atilde Can be validated for syntactic well-formedness
Review and Conclusions 58
Semantic Web Architecture (details)
atilde Identify things with Uniform Resource Identifiers
acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond
atilde Identify relations with Resource Description Framework
acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything
atilde You can build relations in ontologies (OWL)
acirc Then reason over them search them hellip
Review and Conclusions 59
Criticism of the Semantic Web
Doctorowrsquos seven insurmountable obstacles to reliable metadata are
1 People lie
2 People are lazy
3 People are stupid
4 Mission Impossible know thyself
5 Schemas arenrsquot neutral
6 Metrics influence results
7 Therersquos more than one way to describe something
httpwwwwellcom~doctorowmetacraphtm 60
Semantic Web and NLP
atilde The Semantic Web is about structuring data
atilde Text Mining is about unstructured data
atilde There is much more unstructured than structured data
acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP
61
Citation Reputation andPageRank
Review and Conclusions 62
Citation Networks
atilde How can we tell what is a good scientific paper
acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked
acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite
Review and Conclusions 63
Reputation and Citation Analysis
atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group
atilde Some scores are
acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)
atilde Problems
acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones
atilde Weight Citations by Quality of the paper
Review and Conclusions 64
Gaming Citations
atilde LeastMinimum Publishable Unit
acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information
atilde Self citation in-group citation
atilde Write only proceedings (some journals are not often read)
atilde Submitting only to High Impact factor journals
You improve what gets measurednot necessarily what you want to improve
Review and Conclusions 65
Characteristics of the Web Bow Tie
SCC Strongly Connected Core mdash can travel from any page to any page
Review and Conclusions 66
Anchor Text
atilde Recall how hyperlinks are written
lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt
For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt
atilde Link analysis builds on two intuitions
1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A
2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice
Review and Conclusions 67
PageRank as Citation analysis
atilde Citation frequency can be used to measure the impact of an article
acirc Simplest measure Each article gets one vote ndash not very accurate
atilde On the web citation frequency = inlink count
acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam
atilde Better measure weighted citation frequency or citation rank
acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated
Review and Conclusions 68
PageRank as Random walk
atilde Imagine a web surfer doing a random walk on the web
acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page
equiprobably
atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there
atilde This long-term visit rate is the pagersquos PageRank
atilde PageRank = long-term visit rate = steady state robability
Review and Conclusions 69
Teleporting ndash to get us out of dead ends
atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)
atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)
atilde With remaining probability (90) go out on a random hyperlink
acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225
atilde 10 is a parameter the teleportation rate
atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate
Review and Conclusions 70
Example Graph
Each inbound link is a positive vote
httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71
Example Graph Weighted
Pages with higher PageRanks are lighter
Review and Conclusions 72
Gaming PageRank
atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs
atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies
atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission
73
atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks
The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index
acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip
74
Current Status
atilde There is a continuous battle between
acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read
atilde All metrics get gamed
Review and Conclusions 75
Digital object identifier
atilde DOI a string used to uniquely identify an electronic document or object
acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)
atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation
atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations
acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u
Review and Conclusions 76
Conclusions
Review and Conclusions 77
Revolutions in Language Technology
atilde Speech (Language itself)
atilde Writing (invented 3-5 times)
atilde Printing (made writing common)
atilde Digital Text (made writing transferable)
atilde Hyperlinking (taking writing beyond language)
Review and Conclusions 78
The Internet
atilde The internet is useful as a tool
acirc Passively for information accessacirc Actively for collaboration
atilde The internet is interesting in and of itself
acirc As a source of data about existing languageacirc As a source of innovation in language
Review and Conclusions 79
Recommended Texts
atilde Wikipedia
atilde Crystal D (2001) Language and the Internet Cambridge University Press
atilde Sproat R (2010) Language Technology and Society Oxford University Press
atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition
atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press
atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press
Review and Conclusions 80
Complementary Courses
atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics
atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics
Review and Conclusions 81
Explicit Metadata
atilde You can get information from metadata within documents
acirc When they are accurate they are very goodacirc They are often deceitful
atilde HTML PDF Word hellipMetadata
atilde Keywords and Tags
atilde Rankings
atilde Links and Citations
atilde Structural Markup
50
Implicit Metadata
atilde You can get clues from metadata within documents
acirc as they are non-intended they tend to be noisyacirc but they are rarely deceitful
atilde HTML tags as constituent boundaries
atilde Tables as Semantic Relations
atilde File Names (content type and language)
atilde Translations mdash Bracketed Glosses Cross-lingual Disambiguation
atilde Query Data
atilde Wikipedia Redirections and Cross-wiki Links
Review and Conclusions 51
Language Identification andNormalization
Review and Conclusions 52
Language Identification
atilde Need to identify the encoding and language of a page in order to process it (meta-data may be unreliable)
acirc Linguistically-grounded methodslowast Diacriticslowast Character n-gramslowast Stop words
acirc Similarity-based categorisation and classificationlowast Character n-gram rankings
acirc Machine Learning based methodsacirc Context (under jp or ko)
atilde Hard to do for short test snippets similar languages mixed text
Review and Conclusions 53
Normalization
atilde Extracting text from various documents
atilde Segmenting continuous text
atilde Number Normalization $700K $700000 07 million dollars hellip
atilde Date Normalization 2000AD 1421AH Heisei 12 hellip
atilde Stripping stop words (the a of hellip)
atilde Lemmatization produces rarr produce
atilde Stemming producer rarr produc produces rarr produc
atilde Decompounding zonnecel rarr zon cel
Review and Conclusions 54
The Semantic Web
Review and Conclusions 55
Goals of the Semantic Web
atilde Web of data
acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions
atilde Increase the utility of information by connecting it to definitions and context
atilde More efficient information access and analysis
EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt
Review and Conclusions 56
Semantic Web Architecture
Review and Conclusions 57
XML eXtensible Markup Language
atilde XML is a set of rules for encoding documents electronically
acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation
atilde Can be validated for syntactic well-formedness
Review and Conclusions 58
Semantic Web Architecture (details)
atilde Identify things with Uniform Resource Identifiers
acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond
atilde Identify relations with Resource Description Framework
acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything
atilde You can build relations in ontologies (OWL)
acirc Then reason over them search them hellip
Review and Conclusions 59
Criticism of the Semantic Web
Doctorowrsquos seven insurmountable obstacles to reliable metadata are
1 People lie
2 People are lazy
3 People are stupid
4 Mission Impossible know thyself
5 Schemas arenrsquot neutral
6 Metrics influence results
7 Therersquos more than one way to describe something
httpwwwwellcom~doctorowmetacraphtm 60
Semantic Web and NLP
atilde The Semantic Web is about structuring data
atilde Text Mining is about unstructured data
atilde There is much more unstructured than structured data
acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP
61
Citation Reputation andPageRank
Review and Conclusions 62
Citation Networks
atilde How can we tell what is a good scientific paper
acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked
acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite
Review and Conclusions 63
Reputation and Citation Analysis
atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group
atilde Some scores are
acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)
atilde Problems
acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones
atilde Weight Citations by Quality of the paper
Review and Conclusions 64
Gaming Citations
atilde LeastMinimum Publishable Unit
acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information
atilde Self citation in-group citation
atilde Write only proceedings (some journals are not often read)
atilde Submitting only to High Impact factor journals
You improve what gets measurednot necessarily what you want to improve
Review and Conclusions 65
Characteristics of the Web Bow Tie
SCC Strongly Connected Core mdash can travel from any page to any page
Review and Conclusions 66
Anchor Text
atilde Recall how hyperlinks are written
lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt
For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt
atilde Link analysis builds on two intuitions
1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A
2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice
Review and Conclusions 67
PageRank as Citation analysis
atilde Citation frequency can be used to measure the impact of an article
acirc Simplest measure Each article gets one vote ndash not very accurate
atilde On the web citation frequency = inlink count
acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam
atilde Better measure weighted citation frequency or citation rank
acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated
Review and Conclusions 68
PageRank as Random walk
atilde Imagine a web surfer doing a random walk on the web
acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page
equiprobably
atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there
atilde This long-term visit rate is the pagersquos PageRank
atilde PageRank = long-term visit rate = steady state robability
Review and Conclusions 69
Teleporting ndash to get us out of dead ends
atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)
atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)
atilde With remaining probability (90) go out on a random hyperlink
acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225
atilde 10 is a parameter the teleportation rate
atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate
Review and Conclusions 70
Example Graph
Each inbound link is a positive vote
httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71
Example Graph Weighted
Pages with higher PageRanks are lighter
Review and Conclusions 72
Gaming PageRank
atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs
atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies
atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission
73
atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks
The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index
acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip
74
Current Status
atilde There is a continuous battle between
acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read
atilde All metrics get gamed
Review and Conclusions 75
Digital object identifier
atilde DOI a string used to uniquely identify an electronic document or object
acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)
atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation
atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations
acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u
Review and Conclusions 76
Conclusions
Review and Conclusions 77
Revolutions in Language Technology
atilde Speech (Language itself)
atilde Writing (invented 3-5 times)
atilde Printing (made writing common)
atilde Digital Text (made writing transferable)
atilde Hyperlinking (taking writing beyond language)
Review and Conclusions 78
The Internet
atilde The internet is useful as a tool
acirc Passively for information accessacirc Actively for collaboration
atilde The internet is interesting in and of itself
acirc As a source of data about existing languageacirc As a source of innovation in language
Review and Conclusions 79
Recommended Texts
atilde Wikipedia
atilde Crystal D (2001) Language and the Internet Cambridge University Press
atilde Sproat R (2010) Language Technology and Society Oxford University Press
atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition
atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press
atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press
Review and Conclusions 80
Complementary Courses
atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics
atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics
Review and Conclusions 81
Implicit Metadata
atilde You can get clues from metadata within documents
acirc as they are non-intended they tend to be noisyacirc but they are rarely deceitful
atilde HTML tags as constituent boundaries
atilde Tables as Semantic Relations
atilde File Names (content type and language)
atilde Translations mdash Bracketed Glosses Cross-lingual Disambiguation
atilde Query Data
atilde Wikipedia Redirections and Cross-wiki Links
Review and Conclusions 51
Language Identification andNormalization
Review and Conclusions 52
Language Identification
atilde Need to identify the encoding and language of a page in order to process it (meta-data may be unreliable)
acirc Linguistically-grounded methodslowast Diacriticslowast Character n-gramslowast Stop words
acirc Similarity-based categorisation and classificationlowast Character n-gram rankings
acirc Machine Learning based methodsacirc Context (under jp or ko)
atilde Hard to do for short test snippets similar languages mixed text
Review and Conclusions 53
Normalization
atilde Extracting text from various documents
atilde Segmenting continuous text
atilde Number Normalization $700K $700000 07 million dollars hellip
atilde Date Normalization 2000AD 1421AH Heisei 12 hellip
atilde Stripping stop words (the a of hellip)
atilde Lemmatization produces rarr produce
atilde Stemming producer rarr produc produces rarr produc
atilde Decompounding zonnecel rarr zon cel
Review and Conclusions 54
The Semantic Web
Review and Conclusions 55
Goals of the Semantic Web
atilde Web of data
acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions
atilde Increase the utility of information by connecting it to definitions and context
atilde More efficient information access and analysis
EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt
Review and Conclusions 56
Semantic Web Architecture
Review and Conclusions 57
XML eXtensible Markup Language
atilde XML is a set of rules for encoding documents electronically
acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation
atilde Can be validated for syntactic well-formedness
Review and Conclusions 58
Semantic Web Architecture (details)
atilde Identify things with Uniform Resource Identifiers
acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond
atilde Identify relations with Resource Description Framework
acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything
atilde You can build relations in ontologies (OWL)
acirc Then reason over them search them hellip
Review and Conclusions 59
Criticism of the Semantic Web
Doctorowrsquos seven insurmountable obstacles to reliable metadata are
1 People lie
2 People are lazy
3 People are stupid
4 Mission Impossible know thyself
5 Schemas arenrsquot neutral
6 Metrics influence results
7 Therersquos more than one way to describe something
httpwwwwellcom~doctorowmetacraphtm 60
Semantic Web and NLP
atilde The Semantic Web is about structuring data
atilde Text Mining is about unstructured data
atilde There is much more unstructured than structured data
acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP
61
Citation Reputation andPageRank
Review and Conclusions 62
Citation Networks
atilde How can we tell what is a good scientific paper
acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked
acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite
Review and Conclusions 63
Reputation and Citation Analysis
atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group
atilde Some scores are
acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)
atilde Problems
acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones
atilde Weight Citations by Quality of the paper
Review and Conclusions 64
Gaming Citations
atilde LeastMinimum Publishable Unit
acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information
atilde Self citation in-group citation
atilde Write only proceedings (some journals are not often read)
atilde Submitting only to High Impact factor journals
You improve what gets measurednot necessarily what you want to improve
Review and Conclusions 65
Characteristics of the Web Bow Tie
SCC Strongly Connected Core mdash can travel from any page to any page
Review and Conclusions 66
Anchor Text
atilde Recall how hyperlinks are written
lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt
For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt
atilde Link analysis builds on two intuitions
1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A
2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice
Review and Conclusions 67
PageRank as Citation analysis
atilde Citation frequency can be used to measure the impact of an article
acirc Simplest measure Each article gets one vote ndash not very accurate
atilde On the web citation frequency = inlink count
acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam
atilde Better measure weighted citation frequency or citation rank
acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated
Review and Conclusions 68
PageRank as Random walk
atilde Imagine a web surfer doing a random walk on the web
acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page
equiprobably
atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there
atilde This long-term visit rate is the pagersquos PageRank
atilde PageRank = long-term visit rate = steady state robability
Review and Conclusions 69
Teleporting ndash to get us out of dead ends
atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)
atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)
atilde With remaining probability (90) go out on a random hyperlink
acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225
atilde 10 is a parameter the teleportation rate
atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate
Review and Conclusions 70
Example Graph
Each inbound link is a positive vote
httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71
Example Graph Weighted
Pages with higher PageRanks are lighter
Review and Conclusions 72
Gaming PageRank
atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs
atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies
atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission
73
atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks
The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index
acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip
74
Current Status
atilde There is a continuous battle between
acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read
atilde All metrics get gamed
Review and Conclusions 75
Digital object identifier
atilde DOI a string used to uniquely identify an electronic document or object
acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)
atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation
atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations
acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u
Review and Conclusions 76
Conclusions
Review and Conclusions 77
Revolutions in Language Technology
atilde Speech (Language itself)
atilde Writing (invented 3-5 times)
atilde Printing (made writing common)
atilde Digital Text (made writing transferable)
atilde Hyperlinking (taking writing beyond language)
Review and Conclusions 78
The Internet
atilde The internet is useful as a tool
acirc Passively for information accessacirc Actively for collaboration
atilde The internet is interesting in and of itself
acirc As a source of data about existing languageacirc As a source of innovation in language
Review and Conclusions 79
Recommended Texts
atilde Wikipedia
atilde Crystal D (2001) Language and the Internet Cambridge University Press
atilde Sproat R (2010) Language Technology and Society Oxford University Press
atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition
atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press
atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press
Review and Conclusions 80
Complementary Courses
atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics
atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics
Review and Conclusions 81
Language Identification andNormalization
Review and Conclusions 52
Language Identification
atilde Need to identify the encoding and language of a page in order to process it (meta-data may be unreliable)
acirc Linguistically-grounded methodslowast Diacriticslowast Character n-gramslowast Stop words
acirc Similarity-based categorisation and classificationlowast Character n-gram rankings
acirc Machine Learning based methodsacirc Context (under jp or ko)
atilde Hard to do for short test snippets similar languages mixed text
Review and Conclusions 53
Normalization
atilde Extracting text from various documents
atilde Segmenting continuous text
atilde Number Normalization $700K $700000 07 million dollars hellip
atilde Date Normalization 2000AD 1421AH Heisei 12 hellip
atilde Stripping stop words (the a of hellip)
atilde Lemmatization produces rarr produce
atilde Stemming producer rarr produc produces rarr produc
atilde Decompounding zonnecel rarr zon cel
Review and Conclusions 54
The Semantic Web
Review and Conclusions 55
Goals of the Semantic Web
atilde Web of data
acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions
atilde Increase the utility of information by connecting it to definitions and context
atilde More efficient information access and analysis
EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt
Review and Conclusions 56
Semantic Web Architecture
Review and Conclusions 57
XML eXtensible Markup Language
atilde XML is a set of rules for encoding documents electronically
acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation
atilde Can be validated for syntactic well-formedness
Review and Conclusions 58
Semantic Web Architecture (details)
atilde Identify things with Uniform Resource Identifiers
acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond
atilde Identify relations with Resource Description Framework
acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything
atilde You can build relations in ontologies (OWL)
acirc Then reason over them search them hellip
Review and Conclusions 59
Criticism of the Semantic Web
Doctorowrsquos seven insurmountable obstacles to reliable metadata are
1 People lie
2 People are lazy
3 People are stupid
4 Mission Impossible know thyself
5 Schemas arenrsquot neutral
6 Metrics influence results
7 Therersquos more than one way to describe something
httpwwwwellcom~doctorowmetacraphtm 60
Semantic Web and NLP
atilde The Semantic Web is about structuring data
atilde Text Mining is about unstructured data
atilde There is much more unstructured than structured data
acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP
61
Citation Reputation andPageRank
Review and Conclusions 62
Citation Networks
atilde How can we tell what is a good scientific paper
acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked
acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite
Review and Conclusions 63
Reputation and Citation Analysis
atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group
atilde Some scores are
acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)
atilde Problems
acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones
atilde Weight Citations by Quality of the paper
Review and Conclusions 64
Gaming Citations
atilde LeastMinimum Publishable Unit
acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information
atilde Self citation in-group citation
atilde Write only proceedings (some journals are not often read)
atilde Submitting only to High Impact factor journals
You improve what gets measurednot necessarily what you want to improve
Review and Conclusions 65
Characteristics of the Web Bow Tie
SCC Strongly Connected Core mdash can travel from any page to any page
Review and Conclusions 66
Anchor Text
atilde Recall how hyperlinks are written
lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt
For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt
atilde Link analysis builds on two intuitions
1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A
2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice
Review and Conclusions 67
PageRank as Citation analysis
atilde Citation frequency can be used to measure the impact of an article
acirc Simplest measure Each article gets one vote ndash not very accurate
atilde On the web citation frequency = inlink count
acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam
atilde Better measure weighted citation frequency or citation rank
acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated
Review and Conclusions 68
PageRank as Random walk
atilde Imagine a web surfer doing a random walk on the web
acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page
equiprobably
atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there
atilde This long-term visit rate is the pagersquos PageRank
atilde PageRank = long-term visit rate = steady state robability
Review and Conclusions 69
Teleporting ndash to get us out of dead ends
atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)
atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)
atilde With remaining probability (90) go out on a random hyperlink
acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225
atilde 10 is a parameter the teleportation rate
atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate
Review and Conclusions 70
Example Graph
Each inbound link is a positive vote
httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71
Example Graph Weighted
Pages with higher PageRanks are lighter
Review and Conclusions 72
Gaming PageRank
atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs
atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies
atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission
73
atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks
The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index
acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip
74
Current Status
atilde There is a continuous battle between
acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read
atilde All metrics get gamed
Review and Conclusions 75
Digital object identifier
atilde DOI a string used to uniquely identify an electronic document or object
acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)
atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation
atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations
acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u
Review and Conclusions 76
Conclusions
Review and Conclusions 77
Revolutions in Language Technology
atilde Speech (Language itself)
atilde Writing (invented 3-5 times)
atilde Printing (made writing common)
atilde Digital Text (made writing transferable)
atilde Hyperlinking (taking writing beyond language)
Review and Conclusions 78
The Internet
atilde The internet is useful as a tool
acirc Passively for information accessacirc Actively for collaboration
atilde The internet is interesting in and of itself
acirc As a source of data about existing languageacirc As a source of innovation in language
Review and Conclusions 79
Recommended Texts
atilde Wikipedia
atilde Crystal D (2001) Language and the Internet Cambridge University Press
atilde Sproat R (2010) Language Technology and Society Oxford University Press
atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition
atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press
atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press
Review and Conclusions 80
Complementary Courses
atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics
atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics
Review and Conclusions 81
Language Identification
atilde Need to identify the encoding and language of a page in order to process it (meta-data may be unreliable)
acirc Linguistically-grounded methodslowast Diacriticslowast Character n-gramslowast Stop words
acirc Similarity-based categorisation and classificationlowast Character n-gram rankings
acirc Machine Learning based methodsacirc Context (under jp or ko)
atilde Hard to do for short test snippets similar languages mixed text
Review and Conclusions 53
Normalization
atilde Extracting text from various documents
atilde Segmenting continuous text
atilde Number Normalization $700K $700000 07 million dollars hellip
atilde Date Normalization 2000AD 1421AH Heisei 12 hellip
atilde Stripping stop words (the a of hellip)
atilde Lemmatization produces rarr produce
atilde Stemming producer rarr produc produces rarr produc
atilde Decompounding zonnecel rarr zon cel
Review and Conclusions 54
The Semantic Web
Review and Conclusions 55
Goals of the Semantic Web
atilde Web of data
acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions
atilde Increase the utility of information by connecting it to definitions and context
atilde More efficient information access and analysis
EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt
Review and Conclusions 56
Semantic Web Architecture
Review and Conclusions 57
XML eXtensible Markup Language
atilde XML is a set of rules for encoding documents electronically
acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation
atilde Can be validated for syntactic well-formedness
Review and Conclusions 58
Semantic Web Architecture (details)
atilde Identify things with Uniform Resource Identifiers
acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond
atilde Identify relations with Resource Description Framework
acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything
atilde You can build relations in ontologies (OWL)
acirc Then reason over them search them hellip
Review and Conclusions 59
Criticism of the Semantic Web
Doctorowrsquos seven insurmountable obstacles to reliable metadata are
1 People lie
2 People are lazy
3 People are stupid
4 Mission Impossible know thyself
5 Schemas arenrsquot neutral
6 Metrics influence results
7 Therersquos more than one way to describe something
httpwwwwellcom~doctorowmetacraphtm 60
Semantic Web and NLP
atilde The Semantic Web is about structuring data
atilde Text Mining is about unstructured data
atilde There is much more unstructured than structured data
acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP
61
Citation Reputation andPageRank
Review and Conclusions 62
Citation Networks
atilde How can we tell what is a good scientific paper
acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked
acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite
Review and Conclusions 63
Reputation and Citation Analysis
atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group
atilde Some scores are
acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)
atilde Problems
acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones
atilde Weight Citations by Quality of the paper
Review and Conclusions 64
Gaming Citations
atilde LeastMinimum Publishable Unit
acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information
atilde Self citation in-group citation
atilde Write only proceedings (some journals are not often read)
atilde Submitting only to High Impact factor journals
You improve what gets measurednot necessarily what you want to improve
Review and Conclusions 65
Characteristics of the Web Bow Tie
SCC Strongly Connected Core mdash can travel from any page to any page
Review and Conclusions 66
Anchor Text
atilde Recall how hyperlinks are written
lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt
For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt
atilde Link analysis builds on two intuitions
1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A
2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice
Review and Conclusions 67
PageRank as Citation analysis
atilde Citation frequency can be used to measure the impact of an article
acirc Simplest measure Each article gets one vote ndash not very accurate
atilde On the web citation frequency = inlink count
acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam
atilde Better measure weighted citation frequency or citation rank
acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated
Review and Conclusions 68
PageRank as Random walk
atilde Imagine a web surfer doing a random walk on the web
acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page
equiprobably
atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there
atilde This long-term visit rate is the pagersquos PageRank
atilde PageRank = long-term visit rate = steady state robability
Review and Conclusions 69
Teleporting ndash to get us out of dead ends
atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)
atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)
atilde With remaining probability (90) go out on a random hyperlink
acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225
atilde 10 is a parameter the teleportation rate
atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate
Review and Conclusions 70
Example Graph
Each inbound link is a positive vote
httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71
Example Graph Weighted
Pages with higher PageRanks are lighter
Review and Conclusions 72
Gaming PageRank
atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs
atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies
atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission
73
atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks
The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index
acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip
74
Current Status
atilde There is a continuous battle between
acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read
atilde All metrics get gamed
Review and Conclusions 75
Digital object identifier
atilde DOI a string used to uniquely identify an electronic document or object
acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)
atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation
atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations
acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u
Review and Conclusions 76
Conclusions
Review and Conclusions 77
Revolutions in Language Technology
atilde Speech (Language itself)
atilde Writing (invented 3-5 times)
atilde Printing (made writing common)
atilde Digital Text (made writing transferable)
atilde Hyperlinking (taking writing beyond language)
Review and Conclusions 78
The Internet
atilde The internet is useful as a tool
acirc Passively for information accessacirc Actively for collaboration
atilde The internet is interesting in and of itself
acirc As a source of data about existing languageacirc As a source of innovation in language
Review and Conclusions 79
Recommended Texts
atilde Wikipedia
atilde Crystal D (2001) Language and the Internet Cambridge University Press
atilde Sproat R (2010) Language Technology and Society Oxford University Press
atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition
atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press
atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press
Review and Conclusions 80
Complementary Courses
atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics
atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics
Review and Conclusions 81
Normalization
atilde Extracting text from various documents
atilde Segmenting continuous text
atilde Number Normalization $700K $700000 07 million dollars hellip
atilde Date Normalization 2000AD 1421AH Heisei 12 hellip
atilde Stripping stop words (the a of hellip)
atilde Lemmatization produces rarr produce
atilde Stemming producer rarr produc produces rarr produc
atilde Decompounding zonnecel rarr zon cel
Review and Conclusions 54
The Semantic Web
Review and Conclusions 55
Goals of the Semantic Web
atilde Web of data
acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions
atilde Increase the utility of information by connecting it to definitions and context
atilde More efficient information access and analysis
EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt
Review and Conclusions 56
Semantic Web Architecture
Review and Conclusions 57
XML eXtensible Markup Language
atilde XML is a set of rules for encoding documents electronically
acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation
atilde Can be validated for syntactic well-formedness
Review and Conclusions 58
Semantic Web Architecture (details)
atilde Identify things with Uniform Resource Identifiers
acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond
atilde Identify relations with Resource Description Framework
acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything
atilde You can build relations in ontologies (OWL)
acirc Then reason over them search them hellip
Review and Conclusions 59
Criticism of the Semantic Web
Doctorowrsquos seven insurmountable obstacles to reliable metadata are
1 People lie
2 People are lazy
3 People are stupid
4 Mission Impossible know thyself
5 Schemas arenrsquot neutral
6 Metrics influence results
7 Therersquos more than one way to describe something
httpwwwwellcom~doctorowmetacraphtm 60
Semantic Web and NLP
atilde The Semantic Web is about structuring data
atilde Text Mining is about unstructured data
atilde There is much more unstructured than structured data
acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP
61
Citation Reputation andPageRank
Review and Conclusions 62
Citation Networks
atilde How can we tell what is a good scientific paper
acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked
acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite
Review and Conclusions 63
Reputation and Citation Analysis
atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group
atilde Some scores are
acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)
atilde Problems
acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones
atilde Weight Citations by Quality of the paper
Review and Conclusions 64
Gaming Citations
atilde LeastMinimum Publishable Unit
acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information
atilde Self citation in-group citation
atilde Write only proceedings (some journals are not often read)
atilde Submitting only to High Impact factor journals
You improve what gets measurednot necessarily what you want to improve
Review and Conclusions 65
Characteristics of the Web Bow Tie
SCC Strongly Connected Core mdash can travel from any page to any page
Review and Conclusions 66
Anchor Text
atilde Recall how hyperlinks are written
lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt
For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt
atilde Link analysis builds on two intuitions
1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A
2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice
Review and Conclusions 67
PageRank as Citation analysis
atilde Citation frequency can be used to measure the impact of an article
acirc Simplest measure Each article gets one vote ndash not very accurate
atilde On the web citation frequency = inlink count
acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam
atilde Better measure weighted citation frequency or citation rank
acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated
Review and Conclusions 68
PageRank as Random walk
atilde Imagine a web surfer doing a random walk on the web
acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page
equiprobably
atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there
atilde This long-term visit rate is the pagersquos PageRank
atilde PageRank = long-term visit rate = steady state robability
Review and Conclusions 69
Teleporting ndash to get us out of dead ends
atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)
atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)
atilde With remaining probability (90) go out on a random hyperlink
acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225
atilde 10 is a parameter the teleportation rate
atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate
Review and Conclusions 70
Example Graph
Each inbound link is a positive vote
httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71
Example Graph Weighted
Pages with higher PageRanks are lighter
Review and Conclusions 72
Gaming PageRank
atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs
atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies
atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission
73
atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks
The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index
acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip
74
Current Status
atilde There is a continuous battle between
acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read
atilde All metrics get gamed
Review and Conclusions 75
Digital object identifier
atilde DOI a string used to uniquely identify an electronic document or object
acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)
atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation
atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations
acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u
Review and Conclusions 76
Conclusions
Review and Conclusions 77
Revolutions in Language Technology
atilde Speech (Language itself)
atilde Writing (invented 3-5 times)
atilde Printing (made writing common)
atilde Digital Text (made writing transferable)
atilde Hyperlinking (taking writing beyond language)
Review and Conclusions 78
The Internet
atilde The internet is useful as a tool
acirc Passively for information accessacirc Actively for collaboration
atilde The internet is interesting in and of itself
acirc As a source of data about existing languageacirc As a source of innovation in language
Review and Conclusions 79
Recommended Texts
atilde Wikipedia
atilde Crystal D (2001) Language and the Internet Cambridge University Press
atilde Sproat R (2010) Language Technology and Society Oxford University Press
atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition
atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press
atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press
Review and Conclusions 80
Complementary Courses
atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics
atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics
Review and Conclusions 81
The Semantic Web
Review and Conclusions 55
Goals of the Semantic Web
atilde Web of data
acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions
atilde Increase the utility of information by connecting it to definitions and context
atilde More efficient information access and analysis
EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt
Review and Conclusions 56
Semantic Web Architecture
Review and Conclusions 57
XML eXtensible Markup Language
atilde XML is a set of rules for encoding documents electronically
acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation
atilde Can be validated for syntactic well-formedness
Review and Conclusions 58
Semantic Web Architecture (details)
atilde Identify things with Uniform Resource Identifiers
acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond
atilde Identify relations with Resource Description Framework
acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything
atilde You can build relations in ontologies (OWL)
acirc Then reason over them search them hellip
Review and Conclusions 59
Criticism of the Semantic Web
Doctorowrsquos seven insurmountable obstacles to reliable metadata are
1 People lie
2 People are lazy
3 People are stupid
4 Mission Impossible know thyself
5 Schemas arenrsquot neutral
6 Metrics influence results
7 Therersquos more than one way to describe something
httpwwwwellcom~doctorowmetacraphtm 60
Semantic Web and NLP
atilde The Semantic Web is about structuring data
atilde Text Mining is about unstructured data
atilde There is much more unstructured than structured data
acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP
61
Citation Reputation andPageRank
Review and Conclusions 62
Citation Networks
atilde How can we tell what is a good scientific paper
acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked
acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite
Review and Conclusions 63
Reputation and Citation Analysis
atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group
atilde Some scores are
acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)
atilde Problems
acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones
atilde Weight Citations by Quality of the paper
Review and Conclusions 64
Gaming Citations
atilde LeastMinimum Publishable Unit
acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information
atilde Self citation in-group citation
atilde Write only proceedings (some journals are not often read)
atilde Submitting only to High Impact factor journals
You improve what gets measurednot necessarily what you want to improve
Review and Conclusions 65
Characteristics of the Web Bow Tie
SCC Strongly Connected Core mdash can travel from any page to any page
Review and Conclusions 66
Anchor Text
atilde Recall how hyperlinks are written
lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt
For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt
atilde Link analysis builds on two intuitions
1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A
2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice
Review and Conclusions 67
PageRank as Citation analysis
atilde Citation frequency can be used to measure the impact of an article
acirc Simplest measure Each article gets one vote ndash not very accurate
atilde On the web citation frequency = inlink count
acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam
atilde Better measure weighted citation frequency or citation rank
acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated
Review and Conclusions 68
PageRank as Random walk
atilde Imagine a web surfer doing a random walk on the web
acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page
equiprobably
atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there
atilde This long-term visit rate is the pagersquos PageRank
atilde PageRank = long-term visit rate = steady state robability
Review and Conclusions 69
Teleporting ndash to get us out of dead ends
atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)
atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)
atilde With remaining probability (90) go out on a random hyperlink
acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225
atilde 10 is a parameter the teleportation rate
atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate
Review and Conclusions 70
Example Graph
Each inbound link is a positive vote
httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71
Example Graph Weighted
Pages with higher PageRanks are lighter
Review and Conclusions 72
Gaming PageRank
atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs
atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies
atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission
73
atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks
The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index
acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip
74
Current Status
atilde There is a continuous battle between
acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read
atilde All metrics get gamed
Review and Conclusions 75
Digital object identifier
atilde DOI a string used to uniquely identify an electronic document or object
acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)
atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation
atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations
acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u
Review and Conclusions 76
Conclusions
Review and Conclusions 77
Revolutions in Language Technology
atilde Speech (Language itself)
atilde Writing (invented 3-5 times)
atilde Printing (made writing common)
atilde Digital Text (made writing transferable)
atilde Hyperlinking (taking writing beyond language)
Review and Conclusions 78
The Internet
atilde The internet is useful as a tool
acirc Passively for information accessacirc Actively for collaboration
atilde The internet is interesting in and of itself
acirc As a source of data about existing languageacirc As a source of innovation in language
Review and Conclusions 79
Recommended Texts
atilde Wikipedia
atilde Crystal D (2001) Language and the Internet Cambridge University Press
atilde Sproat R (2010) Language Technology and Society Oxford University Press
atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition
atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press
atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press
Review and Conclusions 80
Complementary Courses
atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics
atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics
Review and Conclusions 81
Goals of the Semantic Web
atilde Web of data
acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions
atilde Increase the utility of information by connecting it to definitions and context
atilde More efficient information access and analysis
EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt
Review and Conclusions 56
Semantic Web Architecture
Review and Conclusions 57
XML eXtensible Markup Language
atilde XML is a set of rules for encoding documents electronically
acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation
atilde Can be validated for syntactic well-formedness
Review and Conclusions 58
Semantic Web Architecture (details)
atilde Identify things with Uniform Resource Identifiers
acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond
atilde Identify relations with Resource Description Framework
acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything
atilde You can build relations in ontologies (OWL)
acirc Then reason over them search them hellip
Review and Conclusions 59
Criticism of the Semantic Web
Doctorowrsquos seven insurmountable obstacles to reliable metadata are
1 People lie
2 People are lazy
3 People are stupid
4 Mission Impossible know thyself
5 Schemas arenrsquot neutral
6 Metrics influence results
7 Therersquos more than one way to describe something
httpwwwwellcom~doctorowmetacraphtm 60
Semantic Web and NLP
atilde The Semantic Web is about structuring data
atilde Text Mining is about unstructured data
atilde There is much more unstructured than structured data
acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP
61
Citation Reputation andPageRank
Review and Conclusions 62
Citation Networks
atilde How can we tell what is a good scientific paper
acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked
acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite
Review and Conclusions 63
Reputation and Citation Analysis
atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group
atilde Some scores are
acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)
atilde Problems
acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones
atilde Weight Citations by Quality of the paper
Review and Conclusions 64
Gaming Citations
atilde LeastMinimum Publishable Unit
acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information
atilde Self citation in-group citation
atilde Write only proceedings (some journals are not often read)
atilde Submitting only to High Impact factor journals
You improve what gets measurednot necessarily what you want to improve
Review and Conclusions 65
Characteristics of the Web Bow Tie
SCC Strongly Connected Core mdash can travel from any page to any page
Review and Conclusions 66
Anchor Text
atilde Recall how hyperlinks are written
lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt
For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt
atilde Link analysis builds on two intuitions
1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A
2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice
Review and Conclusions 67
PageRank as Citation analysis
atilde Citation frequency can be used to measure the impact of an article
acirc Simplest measure Each article gets one vote ndash not very accurate
atilde On the web citation frequency = inlink count
acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam
atilde Better measure weighted citation frequency or citation rank
acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated
Review and Conclusions 68
PageRank as Random walk
atilde Imagine a web surfer doing a random walk on the web
acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page
equiprobably
atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there
atilde This long-term visit rate is the pagersquos PageRank
atilde PageRank = long-term visit rate = steady state robability
Review and Conclusions 69
Teleporting ndash to get us out of dead ends
atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)
atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)
atilde With remaining probability (90) go out on a random hyperlink
acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225
atilde 10 is a parameter the teleportation rate
atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate
Review and Conclusions 70
Example Graph
Each inbound link is a positive vote
httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71
Example Graph Weighted
Pages with higher PageRanks are lighter
Review and Conclusions 72
Gaming PageRank
atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs
atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies
atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission
73
atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks
The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index
acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip
74
Current Status
atilde There is a continuous battle between
acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read
atilde All metrics get gamed
Review and Conclusions 75
Digital object identifier
atilde DOI a string used to uniquely identify an electronic document or object
acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)
atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation
atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations
acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u
Review and Conclusions 76
Conclusions
Review and Conclusions 77
Revolutions in Language Technology
atilde Speech (Language itself)
atilde Writing (invented 3-5 times)
atilde Printing (made writing common)
atilde Digital Text (made writing transferable)
atilde Hyperlinking (taking writing beyond language)
Review and Conclusions 78
The Internet
atilde The internet is useful as a tool
acirc Passively for information accessacirc Actively for collaboration
atilde The internet is interesting in and of itself
acirc As a source of data about existing languageacirc As a source of innovation in language
Review and Conclusions 79
Recommended Texts
atilde Wikipedia
atilde Crystal D (2001) Language and the Internet Cambridge University Press
atilde Sproat R (2010) Language Technology and Society Oxford University Press
atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition
atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press
atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press
Review and Conclusions 80
Complementary Courses
atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics
atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics
Review and Conclusions 81
Semantic Web Architecture
Review and Conclusions 57
XML eXtensible Markup Language
atilde XML is a set of rules for encoding documents electronically
acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation
atilde Can be validated for syntactic well-formedness
Review and Conclusions 58
Semantic Web Architecture (details)
atilde Identify things with Uniform Resource Identifiers
acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond
atilde Identify relations with Resource Description Framework
acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything
atilde You can build relations in ontologies (OWL)
acirc Then reason over them search them hellip
Review and Conclusions 59
Criticism of the Semantic Web
Doctorowrsquos seven insurmountable obstacles to reliable metadata are
1 People lie
2 People are lazy
3 People are stupid
4 Mission Impossible know thyself
5 Schemas arenrsquot neutral
6 Metrics influence results
7 Therersquos more than one way to describe something
httpwwwwellcom~doctorowmetacraphtm 60
Semantic Web and NLP
atilde The Semantic Web is about structuring data
atilde Text Mining is about unstructured data
atilde There is much more unstructured than structured data
acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP
61
Citation Reputation andPageRank
Review and Conclusions 62
Citation Networks
atilde How can we tell what is a good scientific paper
acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked
acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite
Review and Conclusions 63
Reputation and Citation Analysis
atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group
atilde Some scores are
acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)
atilde Problems
acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones
atilde Weight Citations by Quality of the paper
Review and Conclusions 64
Gaming Citations
atilde LeastMinimum Publishable Unit
acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information
atilde Self citation in-group citation
atilde Write only proceedings (some journals are not often read)
atilde Submitting only to High Impact factor journals
You improve what gets measurednot necessarily what you want to improve
Review and Conclusions 65
Characteristics of the Web Bow Tie
SCC Strongly Connected Core mdash can travel from any page to any page
Review and Conclusions 66
Anchor Text
atilde Recall how hyperlinks are written
lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt
For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt
atilde Link analysis builds on two intuitions
1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A
2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice
Review and Conclusions 67
PageRank as Citation analysis
atilde Citation frequency can be used to measure the impact of an article
acirc Simplest measure Each article gets one vote ndash not very accurate
atilde On the web citation frequency = inlink count
acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam
atilde Better measure weighted citation frequency or citation rank
acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated
Review and Conclusions 68
PageRank as Random walk
atilde Imagine a web surfer doing a random walk on the web
acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page
equiprobably
atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there
atilde This long-term visit rate is the pagersquos PageRank
atilde PageRank = long-term visit rate = steady state robability
Review and Conclusions 69
Teleporting ndash to get us out of dead ends
atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)
atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)
atilde With remaining probability (90) go out on a random hyperlink
acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225
atilde 10 is a parameter the teleportation rate
atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate
Review and Conclusions 70
Example Graph
Each inbound link is a positive vote
httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71
Example Graph Weighted
Pages with higher PageRanks are lighter
Review and Conclusions 72
Gaming PageRank
atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs
atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies
atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission
73
atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks
The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index
acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip
74
Current Status
atilde There is a continuous battle between
acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read
atilde All metrics get gamed
Review and Conclusions 75
Digital object identifier
atilde DOI a string used to uniquely identify an electronic document or object
acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)
atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation
atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations
acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u
Review and Conclusions 76
Conclusions
Review and Conclusions 77
Revolutions in Language Technology
atilde Speech (Language itself)
atilde Writing (invented 3-5 times)
atilde Printing (made writing common)
atilde Digital Text (made writing transferable)
atilde Hyperlinking (taking writing beyond language)
Review and Conclusions 78
The Internet
atilde The internet is useful as a tool
acirc Passively for information accessacirc Actively for collaboration
atilde The internet is interesting in and of itself
acirc As a source of data about existing languageacirc As a source of innovation in language
Review and Conclusions 79
Recommended Texts
atilde Wikipedia
atilde Crystal D (2001) Language and the Internet Cambridge University Press
atilde Sproat R (2010) Language Technology and Society Oxford University Press
atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition
atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press
atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press
Review and Conclusions 80
Complementary Courses
atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics
atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics
Review and Conclusions 81
XML eXtensible Markup Language
atilde XML is a set of rules for encoding documents electronically
acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation
atilde Can be validated for syntactic well-formedness
Review and Conclusions 58
Semantic Web Architecture (details)
atilde Identify things with Uniform Resource Identifiers
acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond
atilde Identify relations with Resource Description Framework
acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything
atilde You can build relations in ontologies (OWL)
acirc Then reason over them search them hellip
Review and Conclusions 59
Criticism of the Semantic Web
Doctorowrsquos seven insurmountable obstacles to reliable metadata are
1 People lie
2 People are lazy
3 People are stupid
4 Mission Impossible know thyself
5 Schemas arenrsquot neutral
6 Metrics influence results
7 Therersquos more than one way to describe something
httpwwwwellcom~doctorowmetacraphtm 60
Semantic Web and NLP
atilde The Semantic Web is about structuring data
atilde Text Mining is about unstructured data
atilde There is much more unstructured than structured data
acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP
61
Citation Reputation andPageRank
Review and Conclusions 62
Citation Networks
atilde How can we tell what is a good scientific paper
acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked
acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite
Review and Conclusions 63
Reputation and Citation Analysis
atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group
atilde Some scores are
acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)
atilde Problems
acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones
atilde Weight Citations by Quality of the paper
Review and Conclusions 64
Gaming Citations
atilde LeastMinimum Publishable Unit
acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information
atilde Self citation in-group citation
atilde Write only proceedings (some journals are not often read)
atilde Submitting only to High Impact factor journals
You improve what gets measurednot necessarily what you want to improve
Review and Conclusions 65
Characteristics of the Web Bow Tie
SCC Strongly Connected Core mdash can travel from any page to any page
Review and Conclusions 66
Anchor Text
atilde Recall how hyperlinks are written
lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt
For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt
atilde Link analysis builds on two intuitions
1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A
2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice
Review and Conclusions 67
PageRank as Citation analysis
atilde Citation frequency can be used to measure the impact of an article
acirc Simplest measure Each article gets one vote ndash not very accurate
atilde On the web citation frequency = inlink count
acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam
atilde Better measure weighted citation frequency or citation rank
acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated
Review and Conclusions 68
PageRank as Random walk
atilde Imagine a web surfer doing a random walk on the web
acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page
equiprobably
atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there
atilde This long-term visit rate is the pagersquos PageRank
atilde PageRank = long-term visit rate = steady state robability
Review and Conclusions 69
Teleporting ndash to get us out of dead ends
atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)
atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)
atilde With remaining probability (90) go out on a random hyperlink
acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225
atilde 10 is a parameter the teleportation rate
atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate
Review and Conclusions 70
Example Graph
Each inbound link is a positive vote
httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71
Example Graph Weighted
Pages with higher PageRanks are lighter
Review and Conclusions 72
Gaming PageRank
atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs
atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies
atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission
73
atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks
The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index
acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip
74
Current Status
atilde There is a continuous battle between
acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read
atilde All metrics get gamed
Review and Conclusions 75
Digital object identifier
atilde DOI a string used to uniquely identify an electronic document or object
acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)
atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation
atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations
acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u
Review and Conclusions 76
Conclusions
Review and Conclusions 77
Revolutions in Language Technology
atilde Speech (Language itself)
atilde Writing (invented 3-5 times)
atilde Printing (made writing common)
atilde Digital Text (made writing transferable)
atilde Hyperlinking (taking writing beyond language)
Review and Conclusions 78
The Internet
atilde The internet is useful as a tool
acirc Passively for information accessacirc Actively for collaboration
atilde The internet is interesting in and of itself
acirc As a source of data about existing languageacirc As a source of innovation in language
Review and Conclusions 79
Recommended Texts
atilde Wikipedia
atilde Crystal D (2001) Language and the Internet Cambridge University Press
atilde Sproat R (2010) Language Technology and Society Oxford University Press
atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition
atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press
atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press
Review and Conclusions 80
Complementary Courses
atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics
atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics
Review and Conclusions 81
Semantic Web Architecture (details)
atilde Identify things with Uniform Resource Identifiers
acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond
atilde Identify relations with Resource Description Framework
acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything
atilde You can build relations in ontologies (OWL)
acirc Then reason over them search them hellip
Review and Conclusions 59
Criticism of the Semantic Web
Doctorowrsquos seven insurmountable obstacles to reliable metadata are
1 People lie
2 People are lazy
3 People are stupid
4 Mission Impossible know thyself
5 Schemas arenrsquot neutral
6 Metrics influence results
7 Therersquos more than one way to describe something
httpwwwwellcom~doctorowmetacraphtm 60
Semantic Web and NLP
atilde The Semantic Web is about structuring data
atilde Text Mining is about unstructured data
atilde There is much more unstructured than structured data
acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP
61
Citation Reputation andPageRank
Review and Conclusions 62
Citation Networks
atilde How can we tell what is a good scientific paper
acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked
acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite
Review and Conclusions 63
Reputation and Citation Analysis
atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group
atilde Some scores are
acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)
atilde Problems
acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones
atilde Weight Citations by Quality of the paper
Review and Conclusions 64
Gaming Citations
atilde LeastMinimum Publishable Unit
acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information
atilde Self citation in-group citation
atilde Write only proceedings (some journals are not often read)
atilde Submitting only to High Impact factor journals
You improve what gets measurednot necessarily what you want to improve
Review and Conclusions 65
Characteristics of the Web Bow Tie
SCC Strongly Connected Core mdash can travel from any page to any page
Review and Conclusions 66
Anchor Text
atilde Recall how hyperlinks are written
lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt
For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt
atilde Link analysis builds on two intuitions
1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A
2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice
Review and Conclusions 67
PageRank as Citation analysis
atilde Citation frequency can be used to measure the impact of an article
acirc Simplest measure Each article gets one vote ndash not very accurate
atilde On the web citation frequency = inlink count
acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam
atilde Better measure weighted citation frequency or citation rank
acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated
Review and Conclusions 68
PageRank as Random walk
atilde Imagine a web surfer doing a random walk on the web
acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page
equiprobably
atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there
atilde This long-term visit rate is the pagersquos PageRank
atilde PageRank = long-term visit rate = steady state robability
Review and Conclusions 69
Teleporting ndash to get us out of dead ends
atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)
atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)
atilde With remaining probability (90) go out on a random hyperlink
acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225
atilde 10 is a parameter the teleportation rate
atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate
Review and Conclusions 70
Example Graph
Each inbound link is a positive vote
httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71
Example Graph Weighted
Pages with higher PageRanks are lighter
Review and Conclusions 72
Gaming PageRank
atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs
atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies
atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission
73
atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks
The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index
acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip
74
Current Status
atilde There is a continuous battle between
acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read
atilde All metrics get gamed
Review and Conclusions 75
Digital object identifier
atilde DOI a string used to uniquely identify an electronic document or object
acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)
atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation
atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations
acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u
Review and Conclusions 76
Conclusions
Review and Conclusions 77
Revolutions in Language Technology
atilde Speech (Language itself)
atilde Writing (invented 3-5 times)
atilde Printing (made writing common)
atilde Digital Text (made writing transferable)
atilde Hyperlinking (taking writing beyond language)
Review and Conclusions 78
The Internet
atilde The internet is useful as a tool
acirc Passively for information accessacirc Actively for collaboration
atilde The internet is interesting in and of itself
acirc As a source of data about existing languageacirc As a source of innovation in language
Review and Conclusions 79
Recommended Texts
atilde Wikipedia
atilde Crystal D (2001) Language and the Internet Cambridge University Press
atilde Sproat R (2010) Language Technology and Society Oxford University Press
atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition
atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press
atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press
Review and Conclusions 80
Complementary Courses
atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics
atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics
Review and Conclusions 81
Criticism of the Semantic Web
Doctorowrsquos seven insurmountable obstacles to reliable metadata are
1 People lie
2 People are lazy
3 People are stupid
4 Mission Impossible know thyself
5 Schemas arenrsquot neutral
6 Metrics influence results
7 Therersquos more than one way to describe something
httpwwwwellcom~doctorowmetacraphtm 60
Semantic Web and NLP
atilde The Semantic Web is about structuring data
atilde Text Mining is about unstructured data
atilde There is much more unstructured than structured data
acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP
61
Citation Reputation andPageRank
Review and Conclusions 62
Citation Networks
atilde How can we tell what is a good scientific paper
acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked
acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite
Review and Conclusions 63
Reputation and Citation Analysis
atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group
atilde Some scores are
acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)
atilde Problems
acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones
atilde Weight Citations by Quality of the paper
Review and Conclusions 64
Gaming Citations
atilde LeastMinimum Publishable Unit
acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information
atilde Self citation in-group citation
atilde Write only proceedings (some journals are not often read)
atilde Submitting only to High Impact factor journals
You improve what gets measurednot necessarily what you want to improve
Review and Conclusions 65
Characteristics of the Web Bow Tie
SCC Strongly Connected Core mdash can travel from any page to any page
Review and Conclusions 66
Anchor Text
atilde Recall how hyperlinks are written
lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt
For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt
atilde Link analysis builds on two intuitions
1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A
2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice
Review and Conclusions 67
PageRank as Citation analysis
atilde Citation frequency can be used to measure the impact of an article
acirc Simplest measure Each article gets one vote ndash not very accurate
atilde On the web citation frequency = inlink count
acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam
atilde Better measure weighted citation frequency or citation rank
acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated
Review and Conclusions 68
PageRank as Random walk
atilde Imagine a web surfer doing a random walk on the web
acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page
equiprobably
atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there
atilde This long-term visit rate is the pagersquos PageRank
atilde PageRank = long-term visit rate = steady state robability
Review and Conclusions 69
Teleporting ndash to get us out of dead ends
atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)
atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)
atilde With remaining probability (90) go out on a random hyperlink
acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225
atilde 10 is a parameter the teleportation rate
atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate
Review and Conclusions 70
Example Graph
Each inbound link is a positive vote
httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71
Example Graph Weighted
Pages with higher PageRanks are lighter
Review and Conclusions 72
Gaming PageRank
atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs
atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies
atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission
73
atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks
The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index
acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip
74
Current Status
atilde There is a continuous battle between
acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read
atilde All metrics get gamed
Review and Conclusions 75
Digital object identifier
atilde DOI a string used to uniquely identify an electronic document or object
acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)
atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation
atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations
acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u
Review and Conclusions 76
Conclusions
Review and Conclusions 77
Revolutions in Language Technology
atilde Speech (Language itself)
atilde Writing (invented 3-5 times)
atilde Printing (made writing common)
atilde Digital Text (made writing transferable)
atilde Hyperlinking (taking writing beyond language)
Review and Conclusions 78
The Internet
atilde The internet is useful as a tool
acirc Passively for information accessacirc Actively for collaboration
atilde The internet is interesting in and of itself
acirc As a source of data about existing languageacirc As a source of innovation in language
Review and Conclusions 79
Recommended Texts
atilde Wikipedia
atilde Crystal D (2001) Language and the Internet Cambridge University Press
atilde Sproat R (2010) Language Technology and Society Oxford University Press
atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition
atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press
atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press
Review and Conclusions 80
Complementary Courses
atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics
atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics
Review and Conclusions 81
Semantic Web and NLP
atilde The Semantic Web is about structuring data
atilde Text Mining is about unstructured data
atilde There is much more unstructured than structured data
acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP
61
Citation Reputation andPageRank
Review and Conclusions 62
Citation Networks
atilde How can we tell what is a good scientific paper
acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked
acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite
Review and Conclusions 63
Reputation and Citation Analysis
atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group
atilde Some scores are
acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)
atilde Problems
acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones
atilde Weight Citations by Quality of the paper
Review and Conclusions 64
Gaming Citations
atilde LeastMinimum Publishable Unit
acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information
atilde Self citation in-group citation
atilde Write only proceedings (some journals are not often read)
atilde Submitting only to High Impact factor journals
You improve what gets measurednot necessarily what you want to improve
Review and Conclusions 65
Characteristics of the Web Bow Tie
SCC Strongly Connected Core mdash can travel from any page to any page
Review and Conclusions 66
Anchor Text
atilde Recall how hyperlinks are written
lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt
For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt
atilde Link analysis builds on two intuitions
1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A
2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice
Review and Conclusions 67
PageRank as Citation analysis
atilde Citation frequency can be used to measure the impact of an article
acirc Simplest measure Each article gets one vote ndash not very accurate
atilde On the web citation frequency = inlink count
acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam
atilde Better measure weighted citation frequency or citation rank
acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated
Review and Conclusions 68
PageRank as Random walk
atilde Imagine a web surfer doing a random walk on the web
acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page
equiprobably
atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there
atilde This long-term visit rate is the pagersquos PageRank
atilde PageRank = long-term visit rate = steady state robability
Review and Conclusions 69
Teleporting ndash to get us out of dead ends
atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)
atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)
atilde With remaining probability (90) go out on a random hyperlink
acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225
atilde 10 is a parameter the teleportation rate
atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate
Review and Conclusions 70
Example Graph
Each inbound link is a positive vote
httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71
Example Graph Weighted
Pages with higher PageRanks are lighter
Review and Conclusions 72
Gaming PageRank
atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs
atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies
atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission
73
atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks
The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index
acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip
74
Current Status
atilde There is a continuous battle between
acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read
atilde All metrics get gamed
Review and Conclusions 75
Digital object identifier
atilde DOI a string used to uniquely identify an electronic document or object
acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)
atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation
atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations
acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u
Review and Conclusions 76
Conclusions
Review and Conclusions 77
Revolutions in Language Technology
atilde Speech (Language itself)
atilde Writing (invented 3-5 times)
atilde Printing (made writing common)
atilde Digital Text (made writing transferable)
atilde Hyperlinking (taking writing beyond language)
Review and Conclusions 78
The Internet
atilde The internet is useful as a tool
acirc Passively for information accessacirc Actively for collaboration
atilde The internet is interesting in and of itself
acirc As a source of data about existing languageacirc As a source of innovation in language
Review and Conclusions 79
Recommended Texts
atilde Wikipedia
atilde Crystal D (2001) Language and the Internet Cambridge University Press
atilde Sproat R (2010) Language Technology and Society Oxford University Press
atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition
atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press
atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press
Review and Conclusions 80
Complementary Courses
atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics
atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics
Review and Conclusions 81
Citation Reputation andPageRank
Review and Conclusions 62
Citation Networks
atilde How can we tell what is a good scientific paper
acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked
acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite
Review and Conclusions 63
Reputation and Citation Analysis
atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group
atilde Some scores are
acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)
atilde Problems
acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones
atilde Weight Citations by Quality of the paper
Review and Conclusions 64
Gaming Citations
atilde LeastMinimum Publishable Unit
acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information
atilde Self citation in-group citation
atilde Write only proceedings (some journals are not often read)
atilde Submitting only to High Impact factor journals
You improve what gets measurednot necessarily what you want to improve
Review and Conclusions 65
Characteristics of the Web Bow Tie
SCC Strongly Connected Core mdash can travel from any page to any page
Review and Conclusions 66
Anchor Text
atilde Recall how hyperlinks are written
lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt
For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt
atilde Link analysis builds on two intuitions
1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A
2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice
Review and Conclusions 67
PageRank as Citation analysis
atilde Citation frequency can be used to measure the impact of an article
acirc Simplest measure Each article gets one vote ndash not very accurate
atilde On the web citation frequency = inlink count
acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam
atilde Better measure weighted citation frequency or citation rank
acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated
Review and Conclusions 68
PageRank as Random walk
atilde Imagine a web surfer doing a random walk on the web
acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page
equiprobably
atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there
atilde This long-term visit rate is the pagersquos PageRank
atilde PageRank = long-term visit rate = steady state robability
Review and Conclusions 69
Teleporting ndash to get us out of dead ends
atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)
atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)
atilde With remaining probability (90) go out on a random hyperlink
acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225
atilde 10 is a parameter the teleportation rate
atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate
Review and Conclusions 70
Example Graph
Each inbound link is a positive vote
httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71
Example Graph Weighted
Pages with higher PageRanks are lighter
Review and Conclusions 72
Gaming PageRank
atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs
atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies
atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission
73
atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks
The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index
acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip
74
Current Status
atilde There is a continuous battle between
acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read
atilde All metrics get gamed
Review and Conclusions 75
Digital object identifier
atilde DOI a string used to uniquely identify an electronic document or object
acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)
atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation
atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations
acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u
Review and Conclusions 76
Conclusions
Review and Conclusions 77
Revolutions in Language Technology
atilde Speech (Language itself)
atilde Writing (invented 3-5 times)
atilde Printing (made writing common)
atilde Digital Text (made writing transferable)
atilde Hyperlinking (taking writing beyond language)
Review and Conclusions 78
The Internet
atilde The internet is useful as a tool
acirc Passively for information accessacirc Actively for collaboration
atilde The internet is interesting in and of itself
acirc As a source of data about existing languageacirc As a source of innovation in language
Review and Conclusions 79
Recommended Texts
atilde Wikipedia
atilde Crystal D (2001) Language and the Internet Cambridge University Press
atilde Sproat R (2010) Language Technology and Society Oxford University Press
atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition
atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press
atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press
Review and Conclusions 80
Complementary Courses
atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics
atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics
Review and Conclusions 81
Citation Networks
atilde How can we tell what is a good scientific paper
acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked
acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite
Review and Conclusions 63
Reputation and Citation Analysis
atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group
atilde Some scores are
acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)
atilde Problems
acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones
atilde Weight Citations by Quality of the paper
Review and Conclusions 64
Gaming Citations
atilde LeastMinimum Publishable Unit
acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information
atilde Self citation in-group citation
atilde Write only proceedings (some journals are not often read)
atilde Submitting only to High Impact factor journals
You improve what gets measurednot necessarily what you want to improve
Review and Conclusions 65
Characteristics of the Web Bow Tie
SCC Strongly Connected Core mdash can travel from any page to any page
Review and Conclusions 66
Anchor Text
atilde Recall how hyperlinks are written
lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt
For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt
atilde Link analysis builds on two intuitions
1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A
2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice
Review and Conclusions 67
PageRank as Citation analysis
atilde Citation frequency can be used to measure the impact of an article
acirc Simplest measure Each article gets one vote ndash not very accurate
atilde On the web citation frequency = inlink count
acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam
atilde Better measure weighted citation frequency or citation rank
acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated
Review and Conclusions 68
PageRank as Random walk
atilde Imagine a web surfer doing a random walk on the web
acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page
equiprobably
atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there
atilde This long-term visit rate is the pagersquos PageRank
atilde PageRank = long-term visit rate = steady state robability
Review and Conclusions 69
Teleporting ndash to get us out of dead ends
atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)
atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)
atilde With remaining probability (90) go out on a random hyperlink
acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225
atilde 10 is a parameter the teleportation rate
atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate
Review and Conclusions 70
Example Graph
Each inbound link is a positive vote
httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71
Example Graph Weighted
Pages with higher PageRanks are lighter
Review and Conclusions 72
Gaming PageRank
atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs
atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies
atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission
73
atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks
The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index
acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip
74
Current Status
atilde There is a continuous battle between
acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read
atilde All metrics get gamed
Review and Conclusions 75
Digital object identifier
atilde DOI a string used to uniquely identify an electronic document or object
acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)
atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation
atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations
acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u
Review and Conclusions 76
Conclusions
Review and Conclusions 77
Revolutions in Language Technology
atilde Speech (Language itself)
atilde Writing (invented 3-5 times)
atilde Printing (made writing common)
atilde Digital Text (made writing transferable)
atilde Hyperlinking (taking writing beyond language)
Review and Conclusions 78
The Internet
atilde The internet is useful as a tool
acirc Passively for information accessacirc Actively for collaboration
atilde The internet is interesting in and of itself
acirc As a source of data about existing languageacirc As a source of innovation in language
Review and Conclusions 79
Recommended Texts
atilde Wikipedia
atilde Crystal D (2001) Language and the Internet Cambridge University Press
atilde Sproat R (2010) Language Technology and Society Oxford University Press
atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition
atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press
atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press
Review and Conclusions 80
Complementary Courses
atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics
atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics
Review and Conclusions 81
Reputation and Citation Analysis
atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group
atilde Some scores are
acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)
atilde Problems
acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones
atilde Weight Citations by Quality of the paper
Review and Conclusions 64
Gaming Citations
atilde LeastMinimum Publishable Unit
acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information
atilde Self citation in-group citation
atilde Write only proceedings (some journals are not often read)
atilde Submitting only to High Impact factor journals
You improve what gets measurednot necessarily what you want to improve
Review and Conclusions 65
Characteristics of the Web Bow Tie
SCC Strongly Connected Core mdash can travel from any page to any page
Review and Conclusions 66
Anchor Text
atilde Recall how hyperlinks are written
lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt
For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt
atilde Link analysis builds on two intuitions
1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A
2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice
Review and Conclusions 67
PageRank as Citation analysis
atilde Citation frequency can be used to measure the impact of an article
acirc Simplest measure Each article gets one vote ndash not very accurate
atilde On the web citation frequency = inlink count
acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam
atilde Better measure weighted citation frequency or citation rank
acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated
Review and Conclusions 68
PageRank as Random walk
atilde Imagine a web surfer doing a random walk on the web
acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page
equiprobably
atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there
atilde This long-term visit rate is the pagersquos PageRank
atilde PageRank = long-term visit rate = steady state robability
Review and Conclusions 69
Teleporting ndash to get us out of dead ends
atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)
atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)
atilde With remaining probability (90) go out on a random hyperlink
acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225
atilde 10 is a parameter the teleportation rate
atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate
Review and Conclusions 70
Example Graph
Each inbound link is a positive vote
httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71
Example Graph Weighted
Pages with higher PageRanks are lighter
Review and Conclusions 72
Gaming PageRank
atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs
atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies
atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission
73
atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks
The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index
acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip
74
Current Status
atilde There is a continuous battle between
acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read
atilde All metrics get gamed
Review and Conclusions 75
Digital object identifier
atilde DOI a string used to uniquely identify an electronic document or object
acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)
atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation
atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations
acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u
Review and Conclusions 76
Conclusions
Review and Conclusions 77
Revolutions in Language Technology
atilde Speech (Language itself)
atilde Writing (invented 3-5 times)
atilde Printing (made writing common)
atilde Digital Text (made writing transferable)
atilde Hyperlinking (taking writing beyond language)
Review and Conclusions 78
The Internet
atilde The internet is useful as a tool
acirc Passively for information accessacirc Actively for collaboration
atilde The internet is interesting in and of itself
acirc As a source of data about existing languageacirc As a source of innovation in language
Review and Conclusions 79
Recommended Texts
atilde Wikipedia
atilde Crystal D (2001) Language and the Internet Cambridge University Press
atilde Sproat R (2010) Language Technology and Society Oxford University Press
atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition
atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press
atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press
Review and Conclusions 80
Complementary Courses
atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics
atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics
Review and Conclusions 81
Gaming Citations
atilde LeastMinimum Publishable Unit
acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information
atilde Self citation in-group citation
atilde Write only proceedings (some journals are not often read)
atilde Submitting only to High Impact factor journals
You improve what gets measurednot necessarily what you want to improve
Review and Conclusions 65
Characteristics of the Web Bow Tie
SCC Strongly Connected Core mdash can travel from any page to any page
Review and Conclusions 66
Anchor Text
atilde Recall how hyperlinks are written
lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt
For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt
atilde Link analysis builds on two intuitions
1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A
2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice
Review and Conclusions 67
PageRank as Citation analysis
atilde Citation frequency can be used to measure the impact of an article
acirc Simplest measure Each article gets one vote ndash not very accurate
atilde On the web citation frequency = inlink count
acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam
atilde Better measure weighted citation frequency or citation rank
acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated
Review and Conclusions 68
PageRank as Random walk
atilde Imagine a web surfer doing a random walk on the web
acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page
equiprobably
atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there
atilde This long-term visit rate is the pagersquos PageRank
atilde PageRank = long-term visit rate = steady state robability
Review and Conclusions 69
Teleporting ndash to get us out of dead ends
atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)
atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)
atilde With remaining probability (90) go out on a random hyperlink
acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225
atilde 10 is a parameter the teleportation rate
atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate
Review and Conclusions 70
Example Graph
Each inbound link is a positive vote
httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71
Example Graph Weighted
Pages with higher PageRanks are lighter
Review and Conclusions 72
Gaming PageRank
atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs
atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies
atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission
73
atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks
The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index
acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip
74
Current Status
atilde There is a continuous battle between
acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read
atilde All metrics get gamed
Review and Conclusions 75
Digital object identifier
atilde DOI a string used to uniquely identify an electronic document or object
acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)
atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation
atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations
acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u
Review and Conclusions 76
Conclusions
Review and Conclusions 77
Revolutions in Language Technology
atilde Speech (Language itself)
atilde Writing (invented 3-5 times)
atilde Printing (made writing common)
atilde Digital Text (made writing transferable)
atilde Hyperlinking (taking writing beyond language)
Review and Conclusions 78
The Internet
atilde The internet is useful as a tool
acirc Passively for information accessacirc Actively for collaboration
atilde The internet is interesting in and of itself
acirc As a source of data about existing languageacirc As a source of innovation in language
Review and Conclusions 79
Recommended Texts
atilde Wikipedia
atilde Crystal D (2001) Language and the Internet Cambridge University Press
atilde Sproat R (2010) Language Technology and Society Oxford University Press
atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition
atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press
atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press
Review and Conclusions 80
Complementary Courses
atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics
atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics
Review and Conclusions 81
Characteristics of the Web Bow Tie
SCC Strongly Connected Core mdash can travel from any page to any page
Review and Conclusions 66
Anchor Text
atilde Recall how hyperlinks are written
lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt
For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt
atilde Link analysis builds on two intuitions
1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A
2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice
Review and Conclusions 67
PageRank as Citation analysis
atilde Citation frequency can be used to measure the impact of an article
acirc Simplest measure Each article gets one vote ndash not very accurate
atilde On the web citation frequency = inlink count
acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam
atilde Better measure weighted citation frequency or citation rank
acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated
Review and Conclusions 68
PageRank as Random walk
atilde Imagine a web surfer doing a random walk on the web
acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page
equiprobably
atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there
atilde This long-term visit rate is the pagersquos PageRank
atilde PageRank = long-term visit rate = steady state robability
Review and Conclusions 69
Teleporting ndash to get us out of dead ends
atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)
atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)
atilde With remaining probability (90) go out on a random hyperlink
acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225
atilde 10 is a parameter the teleportation rate
atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate
Review and Conclusions 70
Example Graph
Each inbound link is a positive vote
httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71
Example Graph Weighted
Pages with higher PageRanks are lighter
Review and Conclusions 72
Gaming PageRank
atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs
atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies
atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission
73
atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks
The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index
acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip
74
Current Status
atilde There is a continuous battle between
acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read
atilde All metrics get gamed
Review and Conclusions 75
Digital object identifier
atilde DOI a string used to uniquely identify an electronic document or object
acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)
atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation
atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations
acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u
Review and Conclusions 76
Conclusions
Review and Conclusions 77
Revolutions in Language Technology
atilde Speech (Language itself)
atilde Writing (invented 3-5 times)
atilde Printing (made writing common)
atilde Digital Text (made writing transferable)
atilde Hyperlinking (taking writing beyond language)
Review and Conclusions 78
The Internet
atilde The internet is useful as a tool
acirc Passively for information accessacirc Actively for collaboration
atilde The internet is interesting in and of itself
acirc As a source of data about existing languageacirc As a source of innovation in language
Review and Conclusions 79
Recommended Texts
atilde Wikipedia
atilde Crystal D (2001) Language and the Internet Cambridge University Press
atilde Sproat R (2010) Language Technology and Society Oxford University Press
atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition
atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press
atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press
Review and Conclusions 80
Complementary Courses
atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics
atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics
Review and Conclusions 81
Anchor Text
atilde Recall how hyperlinks are written
lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt
For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt
atilde Link analysis builds on two intuitions
1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A
2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice
Review and Conclusions 67
PageRank as Citation analysis
atilde Citation frequency can be used to measure the impact of an article
acirc Simplest measure Each article gets one vote ndash not very accurate
atilde On the web citation frequency = inlink count
acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam
atilde Better measure weighted citation frequency or citation rank
acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated
Review and Conclusions 68
PageRank as Random walk
atilde Imagine a web surfer doing a random walk on the web
acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page
equiprobably
atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there
atilde This long-term visit rate is the pagersquos PageRank
atilde PageRank = long-term visit rate = steady state robability
Review and Conclusions 69
Teleporting ndash to get us out of dead ends
atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)
atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)
atilde With remaining probability (90) go out on a random hyperlink
acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225
atilde 10 is a parameter the teleportation rate
atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate
Review and Conclusions 70
Example Graph
Each inbound link is a positive vote
httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71
Example Graph Weighted
Pages with higher PageRanks are lighter
Review and Conclusions 72
Gaming PageRank
atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs
atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies
atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission
73
atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks
The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index
acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip
74
Current Status
atilde There is a continuous battle between
acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read
atilde All metrics get gamed
Review and Conclusions 75
Digital object identifier
atilde DOI a string used to uniquely identify an electronic document or object
acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)
atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation
atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations
acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u
Review and Conclusions 76
Conclusions
Review and Conclusions 77
Revolutions in Language Technology
atilde Speech (Language itself)
atilde Writing (invented 3-5 times)
atilde Printing (made writing common)
atilde Digital Text (made writing transferable)
atilde Hyperlinking (taking writing beyond language)
Review and Conclusions 78
The Internet
atilde The internet is useful as a tool
acirc Passively for information accessacirc Actively for collaboration
atilde The internet is interesting in and of itself
acirc As a source of data about existing languageacirc As a source of innovation in language
Review and Conclusions 79
Recommended Texts
atilde Wikipedia
atilde Crystal D (2001) Language and the Internet Cambridge University Press
atilde Sproat R (2010) Language Technology and Society Oxford University Press
atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition
atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press
atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press
Review and Conclusions 80
Complementary Courses
atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics
atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics
Review and Conclusions 81
PageRank as Citation analysis
atilde Citation frequency can be used to measure the impact of an article
acirc Simplest measure Each article gets one vote ndash not very accurate
atilde On the web citation frequency = inlink count
acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam
atilde Better measure weighted citation frequency or citation rank
acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated
Review and Conclusions 68
PageRank as Random walk
atilde Imagine a web surfer doing a random walk on the web
acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page
equiprobably
atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there
atilde This long-term visit rate is the pagersquos PageRank
atilde PageRank = long-term visit rate = steady state robability
Review and Conclusions 69
Teleporting ndash to get us out of dead ends
atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)
atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)
atilde With remaining probability (90) go out on a random hyperlink
acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225
atilde 10 is a parameter the teleportation rate
atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate
Review and Conclusions 70
Example Graph
Each inbound link is a positive vote
httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71
Example Graph Weighted
Pages with higher PageRanks are lighter
Review and Conclusions 72
Gaming PageRank
atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs
atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies
atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission
73
atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks
The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index
acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip
74
Current Status
atilde There is a continuous battle between
acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read
atilde All metrics get gamed
Review and Conclusions 75
Digital object identifier
atilde DOI a string used to uniquely identify an electronic document or object
acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)
atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation
atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations
acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u
Review and Conclusions 76
Conclusions
Review and Conclusions 77
Revolutions in Language Technology
atilde Speech (Language itself)
atilde Writing (invented 3-5 times)
atilde Printing (made writing common)
atilde Digital Text (made writing transferable)
atilde Hyperlinking (taking writing beyond language)
Review and Conclusions 78
The Internet
atilde The internet is useful as a tool
acirc Passively for information accessacirc Actively for collaboration
atilde The internet is interesting in and of itself
acirc As a source of data about existing languageacirc As a source of innovation in language
Review and Conclusions 79
Recommended Texts
atilde Wikipedia
atilde Crystal D (2001) Language and the Internet Cambridge University Press
atilde Sproat R (2010) Language Technology and Society Oxford University Press
atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition
atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press
atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press
Review and Conclusions 80
Complementary Courses
atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics
atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics
Review and Conclusions 81
PageRank as Random walk
atilde Imagine a web surfer doing a random walk on the web
acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page
equiprobably
atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there
atilde This long-term visit rate is the pagersquos PageRank
atilde PageRank = long-term visit rate = steady state robability
Review and Conclusions 69
Teleporting ndash to get us out of dead ends
atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)
atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)
atilde With remaining probability (90) go out on a random hyperlink
acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225
atilde 10 is a parameter the teleportation rate
atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate
Review and Conclusions 70
Example Graph
Each inbound link is a positive vote
httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71
Example Graph Weighted
Pages with higher PageRanks are lighter
Review and Conclusions 72
Gaming PageRank
atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs
atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies
atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission
73
atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks
The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index
acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip
74
Current Status
atilde There is a continuous battle between
acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read
atilde All metrics get gamed
Review and Conclusions 75
Digital object identifier
atilde DOI a string used to uniquely identify an electronic document or object
acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)
atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation
atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations
acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u
Review and Conclusions 76
Conclusions
Review and Conclusions 77
Revolutions in Language Technology
atilde Speech (Language itself)
atilde Writing (invented 3-5 times)
atilde Printing (made writing common)
atilde Digital Text (made writing transferable)
atilde Hyperlinking (taking writing beyond language)
Review and Conclusions 78
The Internet
atilde The internet is useful as a tool
acirc Passively for information accessacirc Actively for collaboration
atilde The internet is interesting in and of itself
acirc As a source of data about existing languageacirc As a source of innovation in language
Review and Conclusions 79
Recommended Texts
atilde Wikipedia
atilde Crystal D (2001) Language and the Internet Cambridge University Press
atilde Sproat R (2010) Language Technology and Society Oxford University Press
atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition
atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press
atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press
Review and Conclusions 80
Complementary Courses
atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics
atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics
Review and Conclusions 81
Teleporting ndash to get us out of dead ends
atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)
atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)
atilde With remaining probability (90) go out on a random hyperlink
acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225
atilde 10 is a parameter the teleportation rate
atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate
Review and Conclusions 70
Example Graph
Each inbound link is a positive vote
httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71
Example Graph Weighted
Pages with higher PageRanks are lighter
Review and Conclusions 72
Gaming PageRank
atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs
atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies
atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission
73
atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks
The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index
acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip
74
Current Status
atilde There is a continuous battle between
acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read
atilde All metrics get gamed
Review and Conclusions 75
Digital object identifier
atilde DOI a string used to uniquely identify an electronic document or object
acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)
atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation
atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations
acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u
Review and Conclusions 76
Conclusions
Review and Conclusions 77
Revolutions in Language Technology
atilde Speech (Language itself)
atilde Writing (invented 3-5 times)
atilde Printing (made writing common)
atilde Digital Text (made writing transferable)
atilde Hyperlinking (taking writing beyond language)
Review and Conclusions 78
The Internet
atilde The internet is useful as a tool
acirc Passively for information accessacirc Actively for collaboration
atilde The internet is interesting in and of itself
acirc As a source of data about existing languageacirc As a source of innovation in language
Review and Conclusions 79
Recommended Texts
atilde Wikipedia
atilde Crystal D (2001) Language and the Internet Cambridge University Press
atilde Sproat R (2010) Language Technology and Society Oxford University Press
atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition
atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press
atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press
Review and Conclusions 80
Complementary Courses
atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics
atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics
Review and Conclusions 81
Example Graph
Each inbound link is a positive vote
httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71
Example Graph Weighted
Pages with higher PageRanks are lighter
Review and Conclusions 72
Gaming PageRank
atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs
atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies
atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission
73
atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks
The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index
acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip
74
Current Status
atilde There is a continuous battle between
acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read
atilde All metrics get gamed
Review and Conclusions 75
Digital object identifier
atilde DOI a string used to uniquely identify an electronic document or object
acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)
atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation
atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations
acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u
Review and Conclusions 76
Conclusions
Review and Conclusions 77
Revolutions in Language Technology
atilde Speech (Language itself)
atilde Writing (invented 3-5 times)
atilde Printing (made writing common)
atilde Digital Text (made writing transferable)
atilde Hyperlinking (taking writing beyond language)
Review and Conclusions 78
The Internet
atilde The internet is useful as a tool
acirc Passively for information accessacirc Actively for collaboration
atilde The internet is interesting in and of itself
acirc As a source of data about existing languageacirc As a source of innovation in language
Review and Conclusions 79
Recommended Texts
atilde Wikipedia
atilde Crystal D (2001) Language and the Internet Cambridge University Press
atilde Sproat R (2010) Language Technology and Society Oxford University Press
atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition
atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press
atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press
Review and Conclusions 80
Complementary Courses
atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics
atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics
Review and Conclusions 81
Example Graph Weighted
Pages with higher PageRanks are lighter
Review and Conclusions 72
Gaming PageRank
atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs
atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies
atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission
73
atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks
The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index
acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip
74
Current Status
atilde There is a continuous battle between
acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read
atilde All metrics get gamed
Review and Conclusions 75
Digital object identifier
atilde DOI a string used to uniquely identify an electronic document or object
acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)
atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation
atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations
acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u
Review and Conclusions 76
Conclusions
Review and Conclusions 77
Revolutions in Language Technology
atilde Speech (Language itself)
atilde Writing (invented 3-5 times)
atilde Printing (made writing common)
atilde Digital Text (made writing transferable)
atilde Hyperlinking (taking writing beyond language)
Review and Conclusions 78
The Internet
atilde The internet is useful as a tool
acirc Passively for information accessacirc Actively for collaboration
atilde The internet is interesting in and of itself
acirc As a source of data about existing languageacirc As a source of innovation in language
Review and Conclusions 79
Recommended Texts
atilde Wikipedia
atilde Crystal D (2001) Language and the Internet Cambridge University Press
atilde Sproat R (2010) Language Technology and Society Oxford University Press
atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition
atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press
atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press
Review and Conclusions 80
Complementary Courses
atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics
atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics
Review and Conclusions 81
Gaming PageRank
atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs
atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies
atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission
73
atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks
The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index
acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip
74
Current Status
atilde There is a continuous battle between
acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read
atilde All metrics get gamed
Review and Conclusions 75
Digital object identifier
atilde DOI a string used to uniquely identify an electronic document or object
acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)
atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation
atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations
acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u
Review and Conclusions 76
Conclusions
Review and Conclusions 77
Revolutions in Language Technology
atilde Speech (Language itself)
atilde Writing (invented 3-5 times)
atilde Printing (made writing common)
atilde Digital Text (made writing transferable)
atilde Hyperlinking (taking writing beyond language)
Review and Conclusions 78
The Internet
atilde The internet is useful as a tool
acirc Passively for information accessacirc Actively for collaboration
atilde The internet is interesting in and of itself
acirc As a source of data about existing languageacirc As a source of innovation in language
Review and Conclusions 79
Recommended Texts
atilde Wikipedia
atilde Crystal D (2001) Language and the Internet Cambridge University Press
atilde Sproat R (2010) Language Technology and Society Oxford University Press
atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition
atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press
atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press
Review and Conclusions 80
Complementary Courses
atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics
atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics
Review and Conclusions 81
atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks
The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index
acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip
74
Current Status
atilde There is a continuous battle between
acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read
atilde All metrics get gamed
Review and Conclusions 75
Digital object identifier
atilde DOI a string used to uniquely identify an electronic document or object
acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)
atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation
atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations
acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u
Review and Conclusions 76
Conclusions
Review and Conclusions 77
Revolutions in Language Technology
atilde Speech (Language itself)
atilde Writing (invented 3-5 times)
atilde Printing (made writing common)
atilde Digital Text (made writing transferable)
atilde Hyperlinking (taking writing beyond language)
Review and Conclusions 78
The Internet
atilde The internet is useful as a tool
acirc Passively for information accessacirc Actively for collaboration
atilde The internet is interesting in and of itself
acirc As a source of data about existing languageacirc As a source of innovation in language
Review and Conclusions 79
Recommended Texts
atilde Wikipedia
atilde Crystal D (2001) Language and the Internet Cambridge University Press
atilde Sproat R (2010) Language Technology and Society Oxford University Press
atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition
atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press
atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press
Review and Conclusions 80
Complementary Courses
atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics
atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics
Review and Conclusions 81
Current Status
atilde There is a continuous battle between
acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read
atilde All metrics get gamed
Review and Conclusions 75
Digital object identifier
atilde DOI a string used to uniquely identify an electronic document or object
acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)
atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation
atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations
acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u
Review and Conclusions 76
Conclusions
Review and Conclusions 77
Revolutions in Language Technology
atilde Speech (Language itself)
atilde Writing (invented 3-5 times)
atilde Printing (made writing common)
atilde Digital Text (made writing transferable)
atilde Hyperlinking (taking writing beyond language)
Review and Conclusions 78
The Internet
atilde The internet is useful as a tool
acirc Passively for information accessacirc Actively for collaboration
atilde The internet is interesting in and of itself
acirc As a source of data about existing languageacirc As a source of innovation in language
Review and Conclusions 79
Recommended Texts
atilde Wikipedia
atilde Crystal D (2001) Language and the Internet Cambridge University Press
atilde Sproat R (2010) Language Technology and Society Oxford University Press
atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition
atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press
atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press
Review and Conclusions 80
Complementary Courses
atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics
atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics
Review and Conclusions 81
Digital object identifier
atilde DOI a string used to uniquely identify an electronic document or object
acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)
atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation
atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations
acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u
Review and Conclusions 76
Conclusions
Review and Conclusions 77
Revolutions in Language Technology
atilde Speech (Language itself)
atilde Writing (invented 3-5 times)
atilde Printing (made writing common)
atilde Digital Text (made writing transferable)
atilde Hyperlinking (taking writing beyond language)
Review and Conclusions 78
The Internet
atilde The internet is useful as a tool
acirc Passively for information accessacirc Actively for collaboration
atilde The internet is interesting in and of itself
acirc As a source of data about existing languageacirc As a source of innovation in language
Review and Conclusions 79
Recommended Texts
atilde Wikipedia
atilde Crystal D (2001) Language and the Internet Cambridge University Press
atilde Sproat R (2010) Language Technology and Society Oxford University Press
atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition
atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press
atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press
Review and Conclusions 80
Complementary Courses
atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics
atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics
Review and Conclusions 81
Conclusions
Review and Conclusions 77
Revolutions in Language Technology
atilde Speech (Language itself)
atilde Writing (invented 3-5 times)
atilde Printing (made writing common)
atilde Digital Text (made writing transferable)
atilde Hyperlinking (taking writing beyond language)
Review and Conclusions 78
The Internet
atilde The internet is useful as a tool
acirc Passively for information accessacirc Actively for collaboration
atilde The internet is interesting in and of itself
acirc As a source of data about existing languageacirc As a source of innovation in language
Review and Conclusions 79
Recommended Texts
atilde Wikipedia
atilde Crystal D (2001) Language and the Internet Cambridge University Press
atilde Sproat R (2010) Language Technology and Society Oxford University Press
atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition
atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press
atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press
Review and Conclusions 80
Complementary Courses
atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics
atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics
Review and Conclusions 81
Revolutions in Language Technology
atilde Speech (Language itself)
atilde Writing (invented 3-5 times)
atilde Printing (made writing common)
atilde Digital Text (made writing transferable)
atilde Hyperlinking (taking writing beyond language)
Review and Conclusions 78
The Internet
atilde The internet is useful as a tool
acirc Passively for information accessacirc Actively for collaboration
atilde The internet is interesting in and of itself
acirc As a source of data about existing languageacirc As a source of innovation in language
Review and Conclusions 79
Recommended Texts
atilde Wikipedia
atilde Crystal D (2001) Language and the Internet Cambridge University Press
atilde Sproat R (2010) Language Technology and Society Oxford University Press
atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition
atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press
atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press
Review and Conclusions 80
Complementary Courses
atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics
atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics
Review and Conclusions 81
The Internet
atilde The internet is useful as a tool
acirc Passively for information accessacirc Actively for collaboration
atilde The internet is interesting in and of itself
acirc As a source of data about existing languageacirc As a source of innovation in language
Review and Conclusions 79
Recommended Texts
atilde Wikipedia
atilde Crystal D (2001) Language and the Internet Cambridge University Press
atilde Sproat R (2010) Language Technology and Society Oxford University Press
atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition
atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press
atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press
Review and Conclusions 80
Complementary Courses
atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics
atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics
Review and Conclusions 81
Recommended Texts
atilde Wikipedia
atilde Crystal D (2001) Language and the Internet Cambridge University Press
atilde Sproat R (2010) Language Technology and Society Oxford University Press
atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition
atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press
atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press
Review and Conclusions 80
Complementary Courses
atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics
atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics
Review and Conclusions 81
Complementary Courses
atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics
atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics
Review and Conclusions 81