lecture 12: review and conclusions · web count units ª ghit or “google hit” is the most...

82
HG2052 Language, Technology and the Internet Review and Conclusions Francis Bond Division of Linguistics and Multilingual Studies http://www3.ntu.edu.sg/home/fcbond/ [email protected] Lecture 12 HG2052 (2020)

Upload: others

Post on 18-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 12: Review and Conclusions · Web Count Units ª ghit or “google hit” is the most common unit used to count web snippets (in the early2000s) â itisdocumentfrequencynottermfrequency

HG2052Language Technology and the Internet

Review and Conclusions

Francis BondDivision of Linguistics and Multilingual Studies

httpwww3ntuedusghomefcbondbondieeeorg

Lecture 12

HG2052 (2020)

Introduction

Review and Conclusions 1

What have we learned

atilde How technology affects our use of language

atilde How language is used on the internet

acirc Some interesting things we can now do that we couldnrsquot before

atilde Collaboration on the Web

atilde The web as a source of linguistic data Direct Query and Sample

atilde The Semantic Web meaning for non-humans

atilde Citation and reputation

Review and Conclusions 2

Reflection

atilde What was the most surprising thing in this class

atilde What do you think is most likely wrong

atilde What do you think is the coolest result

atilde What do you think yoursquore most likely to remember

atilde How do you think this course will influence you as a linguist

atilde What (if anything) did you hope to learn that you didnrsquot

Review and Conclusions 3

Goals

atilde Gain an understanding of how technology affects language use

atilde Develop familiarity with markup and meta information in texts

atilde Get a feel for what research is all about especially relating to web mining and onlinefrequency counting

Upon successful completion students will

atilde have an understanding of how technology shapes language use

atilde be able to test linguistic hypotheses against web data

atilde know how to edit Wikipedia

Review and Conclusions 4

Themes

atilde Language and Technology

acirc Writing and Speech Technology

atilde Language and the Internet

acirc Email Chat Virtual Worlds WWW IM Blogs Facebook Wikis Twitter

atilde The Web as Corpus

atilde The Web beyond Language

acirc Semantic Web and Networks

Review and Conclusions 5

Revision

Review and Conclusions 6

Language and Technology

atilde The two things that separate humans from animals (Sproat 2010 sect11)

acirc Languagelowast large vocabulary (10000+)lowast complicated syntax (no upper length recursion embedding)

acirc Technologylowast Widespread tool uselowast Widespread tool manufacture

atilde Speech mdash the start of language

atilde Writing mdash the first great intersection

Review and Conclusions 7

Language and the Internet

atilde New forms of communication

acirc Neither speech nor textacirc Massively interactive

atilde Extremely rapid change

atilde A first hand narrative (I was online before the internet -)

acirc but I am probably behind you all now

Review and Conclusions 8

New forms

atilde Email (from PC phone other)

atilde Chat Usenet

atilde Virtual Worlds

atilde WWW

atilde Blogs (overlap)

atilde Facebook LinkedIn

atilde Wikis

atilde Twitter

Review and Conclusions 9

Representing Languageatilde Writing Systems

atilde Encodings

atilde Speech

atilde Bandwidth

Review and Conclusions 10

What is represented

atilde Phonemes ma d laks aeligvkadoz (45)

acirc Not a simple correspondence between a writing system and the soundsacirc Some logographs (3 $)acirc Even when a new alphabet is designed pronunciation changes

atilde Syllables (ma) (d) (laks) (aelig)(v)(ka)(doz) (10000+)

atilde Morphemes mame + rsquos d lak + s aeligvkado + z (100000+)

atilde Words my dog likes avocados (200000+)

atilde Concepts speaker poss canine fond avacado+PL (400000+)

Review and Conclusions 11

Three Major Writing Systems

atilde Alphabetic (Latin)

acirc one symbol for consonant or vowelacirc Typically 20-30 base symbols (1 byte)

atilde Syllabic (Hiragana)

acirc one symbol for each syllable (consonant+vowel)acirc Typically 50-100 base symbols (1-2 bytes)

atilde Logographic (Hanzi)

acirc pictographs ideographs sounds-meaning combinationsacirc Typically 100000+ symbols (2-3 bytes)

Review and Conclusions 12

Computational Encoding

atilde Need to map characters to bits

atilde More characters require more space

acirc English ASCII (7 bits) Western Europe ISO-8859-1 (8 bits)Japanese EUC (16 bits) Everything UTF-8 (8-32 bits)

atilde Moving towards unicode for everything

atilde If you get the encoding wrong it is gibberish

acirc Web pages should state their encodingacirc Sometimes they are wrongacirc Encoding Detection usually involves statistical analysis of byte patterns

lowast But an encoding can be shared by many languages

Review and Conclusions 13

Speech and Language Technologyatilde The need for speech representation

atilde Storing sound

atilde Transforming Speech

acirc Automatic Speech Recognition (ASR) sounds to textacirc Text-to-Speech Synthesis (TTS) texts to sounds

atilde Speech technology mdash the Telephone

Review and Conclusions 14

How good are the systems

Task Vocab WER () WER () adaptedDigits 11 04 02Dialogue (travel) 21000 109 mdashDictation (WSJ) 5000 39 30Dictation (WSJ) 20000 100 86Dialogue (noisy army) 3000 422 310Phone Conversations 4000 419 310

Results of various DARPA competitions (from Richard Sproatrsquos slides)

Review and Conclusions 15

Why is it so difficult

atilde Pronunciation depends on context

acirc The same word will be pronounced differently in different sentences

atilde Speaker variability

acirc Genderacirc DialectForeign Accentacirc Individual Differences Physical differences Language differences (idiolect)

atilde Many many rare events

acirc 300 out of 2000 diphones in the core set for the ATampT NextGen system occuronly once in a 2-hour speech database

Review and Conclusions 16

Two steps in a TTS system

1 Linguistic Analysis

atilde Sentence Segmentationatilde Abbreviations Dr Smith lives on Nanyang Dr He is hellipatilde Word Segmentation

acirc 森前銀総裁 Moriyama zen Nichigin Sousaiotimes 森前銀総裁emspMoriyama zennichi gin Sousai

2 Speech Synthesis

atilde Find the pronunciationatilde Generate soundsatilde Add intonation

Review and Conclusions 17

Speech Synthesis

atilde Articulatory Synthesis Attempt to model human articulation

atilde Formant Synthesis Bypass modeling of human articulation and model acousticsdirectly

atilde Concatenative Synthesis Synthesize from stored units of actual speech

Review and Conclusions 18

Prosody of Emotion

atilde Excitement Fast very high pitch loud

atilde Hot anger Fast high pitch strong falling accent loud

atilde Fear Jitter

atilde Sarcasm Prolonged accent late peak

atilde Sad Slow low pitch

The main determinant of ldquonaturalnessrdquo in speech synthesis is not ldquovoice qualityrdquobut natural-sounding prosody (intonation and duration)

Richard Sproat

Review and Conclusions 19

Speed is different for different modalities

Speed in words per minute (one word is 6 characters)(English computer science students various studies)

Reading 300 200 (proof reading)Writing 31 21 (composing)Speaking 150Hearing 150 210 (speeded up)Typing 33 19 (composing)

atilde Reading gtgt SpeakingHearing gtgt Typing

rArr Speech for inputrArr Text for output

Review and Conclusions 20

The Telephone

Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 21

New Modalities

Review and Conclusions 22

Email Usenet Chat and Blogs

atilde All share some characteristics of speech and text

atilde Usage norms not fixed

atilde Communication methods may disappear before the norms are fixedUsenet Bulletin Boards Gopher Archie hellip

atilde Large scale discourse studies still to be done

atilde Some genuinely new things

acirc time-lagged multi-person conversationacirc raw un-edited textacirc extreme multi-authorship

Review and Conclusions 23

Email

Speech like Text liketime bound space bound (deletable)spontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 24

Usenet (asynchronous)

Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 25

Chat (synchronous)

Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 26

Blogs

Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactivecomments factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 27

Netspeak

atilde Inspired by shared background

acirc ltrantgtI canrsquot stand thisltrantgtacirc I hate^H^H^H^Hlove that idea (^H is backspace on a vt100)acirc lusers (users as seen by Systems Administrators)acirc suits

atilde Need to be in-group

cow orker Coworkerclue ldquoYou couldnrsquot get a clue during the clue mating season in a field full of horny

clues if you smeared your body with clue musk and did the clue mating dancerdquoEdward Flaherty (talkbizaare)

Review and Conclusions 28

atilde Gricean Principals

acirc Posting (topmiddlebottom quoting and trimming)

atilde Inspired by medium limitations

acirc GREATgreatgreatacirc -) (^_^) (o) acirc brb RTFM IMHOacirc lower caseacirc lack of punctuation (hard on phone keyboards)

Review and Conclusions 29

Collaboration and Wikisatilde Version Control Systems

atilde Wikipedia

atilde Licensing and Ownership

30

Version Control Systems

atilde Versioning file systems

acirc every time a file is opened a new copy is stored

atilde CVS Subversion Git

acirc changes to a collection of files are trackedacirc simultaneous changes are merged

atilde Revision Tracking

acirc Revisions are stored within a file

atilde Authorship in shared writing Explicit responsibility for changes

Review and Conclusions 31

Wikipedia

atilde The core aim of the Wikimedia Foundation is to get a free encyclopedia to everysingle person on the planet (Jimmy Wales)

atilde Wikipedia makes it easy to share your knowledgepeople like to do this

atilde Most edits are done by insiders

atilde Most content is added by outsiders

atilde Content comparable to Britannica

Review and Conclusions 32

The five pillars of Wikipedia

1 Wikipedia is an online encyclopedia

2 Wikipedia has a neutral point of view

atilde Content policies NPOV Verifiability and No original research

3 Wikipedia is free content

4 Wikipedians should interact in a respectful and civil manner

5 Wikipedia does not have firm rules

WikipediaFivepillars 33

Licenses and Ownership

atilde Copyright

atilde Copyleft

atilde Creative Commons

Review and Conclusions 34

What is a good article

1 Well-written

2 Factually accurate and verifiable

3 Broad in its coverage

4 Neutral

5 Stable

6 Illustrated if possible by images

WikipediaGood_article_criteria 35

The World Wide Web and HTML

Review and Conclusions 36

The InterWeb

atilde The Internetmore than just the web (email VoIP FTP Streaming Messaging

atilde The structure of Markup Visual vs LogicalWSISWYG WYSIAYG WYSIWIM

atilde The structure of the Web mdash hypertextpages linked to pages

atilde The future of the Web

atilde Linguistic features of the webun-edited large volume editable multi-media

Review and Conclusions 37

Map of online communities

httpxkcdcom802 38

The Deep Web (the Invisible Web)

Dynamic content dynamic pages which are returned in response to a submitted queryor accessed only through a form

Unlinked content pages which are not linked to by other pages (but clicking linksthem)

Private Web sites that require registration and login (Edventure)

Contextual Web pages with content varying for different access contexts (egranges of client IP addresses or previous navigation sequence)

Limited access content sites that limit access to their pages in a technical way (egusing the Robots Exclusion Standard)

Scripted content pages that are only accessible through links produced by JavaScriptas well as content dynamically downloaded from Web servers via Flash or Ajaxsolutions

Review and Conclusions 39

Non-HTMLtext content textual content encoded in multimedia (image or video)files or specific file formats not handled by search engines

These pages all include data that search engines cannot find

Review and Conclusions 40

Visual Markup vs Logical Markup

atilde Visual Markup (Presentational)

acirc What you see is what you get (WYSIWYG)acirc Equivalent of printersrsquo markupacirc Shows what things look like

atilde Logical Markup (Structural)

acirc Show the structure and meaningacirc Can be mapped to visual markupacirc Less flexible than visual markupacirc More adaptable (and reusable)

Review and Conclusions 41

The Web as Corpus

Review and Conclusions 42

Two Approaches to using the Web as a Corpus

atilde Direct Query Search Engine as Query tool and WWW as corpus(Objection Results are not reliable)

acirc Population and exact hit counts are unknown rarr no statistics possibleacirc Indexing does not allow to draw conclusions on the dataotimes Google is missing functionalities that linguists lexicographers would like to have

atilde Web Sample Use search engine to download data from the net and build a corpusfrom it

acirc known size and exact hit counts rarr statistics possibleacirc people can draw conclusions over the included text typesacirc (limited) control over the contentotimes sparser data

Review and Conclusions 43

Direct Query

atilde Accessible through search engines (Google API Yahoo API Scripts)

atilde Document counts are shown to correlate directly with ldquorealrdquo frequencies (Keller2003) so search engines can help - but

acirc lots of repetitions of the same text (not representative)acirc very limited query precision (no upperlower case no punctuation)acirc only estimated counts often hard to reproduce exactlyacirc different queries give wildly different numbers

Review and Conclusions 44

Web Count Units

atilde ghit or ldquogoogle hitrdquo is the most common unit used to count web snippets (in theearly 2000s)

acirc it is document frequency not term frequency

atilde whit or ldquoweb hitrdquo is the more general term

atilde Normally you compare two phenomena to get a unitless ratio(eg different from vs different than)251000000 ghits vs 71500000 or 351 (accessed 2012-04-04)

atilde GPB for ldquoGhits per billion documentsrdquo is good if want a more stable number (sug-gested by Mark Liberman)but Google no longer releases the index size

atilde So always say when you counted and try to use ratios

httpitrecisupennedu myllanguagelogarchives000953html 45

Web Sample

atilde Extracting and filtering web documents to create linguistically annotated corpora(Kilgarriff 2006)

acirc gather documents for different topics (balance)acirc exclude documents which cannot be preprocessed with available tools (here tag-

gers and lemmatizers)acirc exclude documents which seem irrelevant for a corpus (too short or too long word

lists)acirc do this for several languages and make the corpora available

46

Building Internet Corpora Outline

1 Select Seed Words (500)

2 Combine to form multiple queries (6000)

3 Query a search engine and retrieve the URLs (50000)

4 Download the files from the URLS (100000000 words)

5 Postprocess the data (encoding cleanup tagging and parsing)

Sharoff S (2006) Creating general-purpose corpora using automated search enginequeries In M Baroni S Bernardini (eds) WaCky Working papers on the Web asCorpus Bologna 2006

httpcorpusleedsacukinternethtml 47

Internet Corpora Summary

atilde The web can be used as a corpus

acirc Direct accesslowast Fast and convenientlowast Huge amounts of dataotimes unreliable counts

acirc Web samplelowast Control over the samplelowast Some setup costs (semi-automated)otimes Less data

atilde Richer data than a compiled corpus

otimes Less balanced less markup

48

Text and Meta-textatilde Explicit Meta-data

acirc Keywords and Categoriesacirc Rankingsacirc Structural Markup

atilde Implicit Meta-data

acirc Links and Citationsacirc Tagsacirc Tablesacirc File Namesacirc Translations

and Zombies 49

Explicit Metadata

atilde You can get information from metadata within documents

acirc When they are accurate they are very goodacirc They are often deceitful

atilde HTML PDF Word hellipMetadata

atilde Keywords and Tags

atilde Rankings

atilde Links and Citations

atilde Structural Markup

50

Implicit Metadata

atilde You can get clues from metadata within documents

acirc as they are non-intended they tend to be noisyacirc but they are rarely deceitful

atilde HTML tags as constituent boundaries

atilde Tables as Semantic Relations

atilde File Names (content type and language)

atilde Translations mdash Bracketed Glosses Cross-lingual Disambiguation

atilde Query Data

atilde Wikipedia Redirections and Cross-wiki Links

Review and Conclusions 51

Language Identification andNormalization

Review and Conclusions 52

Language Identification

atilde Need to identify the encoding and language of a page in order to process it (meta-data may be unreliable)

acirc Linguistically-grounded methodslowast Diacriticslowast Character n-gramslowast Stop words

acirc Similarity-based categorisation and classificationlowast Character n-gram rankings

acirc Machine Learning based methodsacirc Context (under jp or ko)

atilde Hard to do for short test snippets similar languages mixed text

Review and Conclusions 53

Normalization

atilde Extracting text from various documents

atilde Segmenting continuous text

atilde Number Normalization $700K $700000 07 million dollars hellip

atilde Date Normalization 2000AD 1421AH Heisei 12 hellip

atilde Stripping stop words (the a of hellip)

atilde Lemmatization produces rarr produce

atilde Stemming producer rarr produc produces rarr produc

atilde Decompounding zonnecel rarr zon cel

Review and Conclusions 54

The Semantic Web

Review and Conclusions 55

Goals of the Semantic Web

atilde Web of data

acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions

atilde Increase the utility of information by connecting it to definitions and context

atilde More efficient information access and analysis

EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt

Review and Conclusions 56

Semantic Web Architecture

Review and Conclusions 57

XML eXtensible Markup Language

atilde XML is a set of rules for encoding documents electronically

acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation

atilde Can be validated for syntactic well-formedness

Review and Conclusions 58

Semantic Web Architecture (details)

atilde Identify things with Uniform Resource Identifiers

acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond

atilde Identify relations with Resource Description Framework

acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything

atilde You can build relations in ontologies (OWL)

acirc Then reason over them search them hellip

Review and Conclusions 59

Criticism of the Semantic Web

Doctorowrsquos seven insurmountable obstacles to reliable metadata are

1 People lie

2 People are lazy

3 People are stupid

4 Mission Impossible know thyself

5 Schemas arenrsquot neutral

6 Metrics influence results

7 Therersquos more than one way to describe something

httpwwwwellcom~doctorowmetacraphtm 60

Semantic Web and NLP

atilde The Semantic Web is about structuring data

atilde Text Mining is about unstructured data

atilde There is much more unstructured than structured data

acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP

61

Citation Reputation andPageRank

Review and Conclusions 62

Citation Networks

atilde How can we tell what is a good scientific paper

acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked

acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite

Review and Conclusions 63

Reputation and Citation Analysis

atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group

atilde Some scores are

acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)

atilde Problems

acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones

atilde Weight Citations by Quality of the paper

Review and Conclusions 64

Gaming Citations

atilde LeastMinimum Publishable Unit

acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information

atilde Self citation in-group citation

atilde Write only proceedings (some journals are not often read)

atilde Submitting only to High Impact factor journals

You improve what gets measurednot necessarily what you want to improve

Review and Conclusions 65

Characteristics of the Web Bow Tie

SCC Strongly Connected Core mdash can travel from any page to any page

Review and Conclusions 66

Anchor Text

atilde Recall how hyperlinks are written

lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt

For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt

atilde Link analysis builds on two intuitions

1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A

2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice

Review and Conclusions 67

PageRank as Citation analysis

atilde Citation frequency can be used to measure the impact of an article

acirc Simplest measure Each article gets one vote ndash not very accurate

atilde On the web citation frequency = inlink count

acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam

atilde Better measure weighted citation frequency or citation rank

acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated

Review and Conclusions 68

PageRank as Random walk

atilde Imagine a web surfer doing a random walk on the web

acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page

equiprobably

atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there

atilde This long-term visit rate is the pagersquos PageRank

atilde PageRank = long-term visit rate = steady state robability

Review and Conclusions 69

Teleporting ndash to get us out of dead ends

atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)

atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)

atilde With remaining probability (90) go out on a random hyperlink

acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225

atilde 10 is a parameter the teleportation rate

atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate

Review and Conclusions 70

Example Graph

Each inbound link is a positive vote

httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71

Example Graph Weighted

Pages with higher PageRanks are lighter

Review and Conclusions 72

Gaming PageRank

atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs

atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies

atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission

73

atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks

The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index

acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip

74

Current Status

atilde There is a continuous battle between

acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read

atilde All metrics get gamed

Review and Conclusions 75

Digital object identifier

atilde DOI a string used to uniquely identify an electronic document or object

acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)

atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation

atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations

acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u

Review and Conclusions 76

Conclusions

Review and Conclusions 77

Revolutions in Language Technology

atilde Speech (Language itself)

atilde Writing (invented 3-5 times)

atilde Printing (made writing common)

atilde Digital Text (made writing transferable)

atilde Hyperlinking (taking writing beyond language)

Review and Conclusions 78

The Internet

atilde The internet is useful as a tool

acirc Passively for information accessacirc Actively for collaboration

atilde The internet is interesting in and of itself

acirc As a source of data about existing languageacirc As a source of innovation in language

Review and Conclusions 79

Recommended Texts

atilde Wikipedia

atilde Crystal D (2001) Language and the Internet Cambridge University Press

atilde Sproat R (2010) Language Technology and Society Oxford University Press

atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition

atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press

atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press

Review and Conclusions 80

Complementary Courses

atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics

atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics

Review and Conclusions 81

Page 2: Lecture 12: Review and Conclusions · Web Count Units ª ghit or “google hit” is the most common unit used to count web snippets (in the early2000s) â itisdocumentfrequencynottermfrequency

Introduction

Review and Conclusions 1

What have we learned

atilde How technology affects our use of language

atilde How language is used on the internet

acirc Some interesting things we can now do that we couldnrsquot before

atilde Collaboration on the Web

atilde The web as a source of linguistic data Direct Query and Sample

atilde The Semantic Web meaning for non-humans

atilde Citation and reputation

Review and Conclusions 2

Reflection

atilde What was the most surprising thing in this class

atilde What do you think is most likely wrong

atilde What do you think is the coolest result

atilde What do you think yoursquore most likely to remember

atilde How do you think this course will influence you as a linguist

atilde What (if anything) did you hope to learn that you didnrsquot

Review and Conclusions 3

Goals

atilde Gain an understanding of how technology affects language use

atilde Develop familiarity with markup and meta information in texts

atilde Get a feel for what research is all about especially relating to web mining and onlinefrequency counting

Upon successful completion students will

atilde have an understanding of how technology shapes language use

atilde be able to test linguistic hypotheses against web data

atilde know how to edit Wikipedia

Review and Conclusions 4

Themes

atilde Language and Technology

acirc Writing and Speech Technology

atilde Language and the Internet

acirc Email Chat Virtual Worlds WWW IM Blogs Facebook Wikis Twitter

atilde The Web as Corpus

atilde The Web beyond Language

acirc Semantic Web and Networks

Review and Conclusions 5

Revision

Review and Conclusions 6

Language and Technology

atilde The two things that separate humans from animals (Sproat 2010 sect11)

acirc Languagelowast large vocabulary (10000+)lowast complicated syntax (no upper length recursion embedding)

acirc Technologylowast Widespread tool uselowast Widespread tool manufacture

atilde Speech mdash the start of language

atilde Writing mdash the first great intersection

Review and Conclusions 7

Language and the Internet

atilde New forms of communication

acirc Neither speech nor textacirc Massively interactive

atilde Extremely rapid change

atilde A first hand narrative (I was online before the internet -)

acirc but I am probably behind you all now

Review and Conclusions 8

New forms

atilde Email (from PC phone other)

atilde Chat Usenet

atilde Virtual Worlds

atilde WWW

atilde Blogs (overlap)

atilde Facebook LinkedIn

atilde Wikis

atilde Twitter

Review and Conclusions 9

Representing Languageatilde Writing Systems

atilde Encodings

atilde Speech

atilde Bandwidth

Review and Conclusions 10

What is represented

atilde Phonemes ma d laks aeligvkadoz (45)

acirc Not a simple correspondence between a writing system and the soundsacirc Some logographs (3 $)acirc Even when a new alphabet is designed pronunciation changes

atilde Syllables (ma) (d) (laks) (aelig)(v)(ka)(doz) (10000+)

atilde Morphemes mame + rsquos d lak + s aeligvkado + z (100000+)

atilde Words my dog likes avocados (200000+)

atilde Concepts speaker poss canine fond avacado+PL (400000+)

Review and Conclusions 11

Three Major Writing Systems

atilde Alphabetic (Latin)

acirc one symbol for consonant or vowelacirc Typically 20-30 base symbols (1 byte)

atilde Syllabic (Hiragana)

acirc one symbol for each syllable (consonant+vowel)acirc Typically 50-100 base symbols (1-2 bytes)

atilde Logographic (Hanzi)

acirc pictographs ideographs sounds-meaning combinationsacirc Typically 100000+ symbols (2-3 bytes)

Review and Conclusions 12

Computational Encoding

atilde Need to map characters to bits

atilde More characters require more space

acirc English ASCII (7 bits) Western Europe ISO-8859-1 (8 bits)Japanese EUC (16 bits) Everything UTF-8 (8-32 bits)

atilde Moving towards unicode for everything

atilde If you get the encoding wrong it is gibberish

acirc Web pages should state their encodingacirc Sometimes they are wrongacirc Encoding Detection usually involves statistical analysis of byte patterns

lowast But an encoding can be shared by many languages

Review and Conclusions 13

Speech and Language Technologyatilde The need for speech representation

atilde Storing sound

atilde Transforming Speech

acirc Automatic Speech Recognition (ASR) sounds to textacirc Text-to-Speech Synthesis (TTS) texts to sounds

atilde Speech technology mdash the Telephone

Review and Conclusions 14

How good are the systems

Task Vocab WER () WER () adaptedDigits 11 04 02Dialogue (travel) 21000 109 mdashDictation (WSJ) 5000 39 30Dictation (WSJ) 20000 100 86Dialogue (noisy army) 3000 422 310Phone Conversations 4000 419 310

Results of various DARPA competitions (from Richard Sproatrsquos slides)

Review and Conclusions 15

Why is it so difficult

atilde Pronunciation depends on context

acirc The same word will be pronounced differently in different sentences

atilde Speaker variability

acirc Genderacirc DialectForeign Accentacirc Individual Differences Physical differences Language differences (idiolect)

atilde Many many rare events

acirc 300 out of 2000 diphones in the core set for the ATampT NextGen system occuronly once in a 2-hour speech database

Review and Conclusions 16

Two steps in a TTS system

1 Linguistic Analysis

atilde Sentence Segmentationatilde Abbreviations Dr Smith lives on Nanyang Dr He is hellipatilde Word Segmentation

acirc 森前銀総裁 Moriyama zen Nichigin Sousaiotimes 森前銀総裁emspMoriyama zennichi gin Sousai

2 Speech Synthesis

atilde Find the pronunciationatilde Generate soundsatilde Add intonation

Review and Conclusions 17

Speech Synthesis

atilde Articulatory Synthesis Attempt to model human articulation

atilde Formant Synthesis Bypass modeling of human articulation and model acousticsdirectly

atilde Concatenative Synthesis Synthesize from stored units of actual speech

Review and Conclusions 18

Prosody of Emotion

atilde Excitement Fast very high pitch loud

atilde Hot anger Fast high pitch strong falling accent loud

atilde Fear Jitter

atilde Sarcasm Prolonged accent late peak

atilde Sad Slow low pitch

The main determinant of ldquonaturalnessrdquo in speech synthesis is not ldquovoice qualityrdquobut natural-sounding prosody (intonation and duration)

Richard Sproat

Review and Conclusions 19

Speed is different for different modalities

Speed in words per minute (one word is 6 characters)(English computer science students various studies)

Reading 300 200 (proof reading)Writing 31 21 (composing)Speaking 150Hearing 150 210 (speeded up)Typing 33 19 (composing)

atilde Reading gtgt SpeakingHearing gtgt Typing

rArr Speech for inputrArr Text for output

Review and Conclusions 20

The Telephone

Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 21

New Modalities

Review and Conclusions 22

Email Usenet Chat and Blogs

atilde All share some characteristics of speech and text

atilde Usage norms not fixed

atilde Communication methods may disappear before the norms are fixedUsenet Bulletin Boards Gopher Archie hellip

atilde Large scale discourse studies still to be done

atilde Some genuinely new things

acirc time-lagged multi-person conversationacirc raw un-edited textacirc extreme multi-authorship

Review and Conclusions 23

Email

Speech like Text liketime bound space bound (deletable)spontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 24

Usenet (asynchronous)

Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 25

Chat (synchronous)

Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 26

Blogs

Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactivecomments factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 27

Netspeak

atilde Inspired by shared background

acirc ltrantgtI canrsquot stand thisltrantgtacirc I hate^H^H^H^Hlove that idea (^H is backspace on a vt100)acirc lusers (users as seen by Systems Administrators)acirc suits

atilde Need to be in-group

cow orker Coworkerclue ldquoYou couldnrsquot get a clue during the clue mating season in a field full of horny

clues if you smeared your body with clue musk and did the clue mating dancerdquoEdward Flaherty (talkbizaare)

Review and Conclusions 28

atilde Gricean Principals

acirc Posting (topmiddlebottom quoting and trimming)

atilde Inspired by medium limitations

acirc GREATgreatgreatacirc -) (^_^) (o) acirc brb RTFM IMHOacirc lower caseacirc lack of punctuation (hard on phone keyboards)

Review and Conclusions 29

Collaboration and Wikisatilde Version Control Systems

atilde Wikipedia

atilde Licensing and Ownership

30

Version Control Systems

atilde Versioning file systems

acirc every time a file is opened a new copy is stored

atilde CVS Subversion Git

acirc changes to a collection of files are trackedacirc simultaneous changes are merged

atilde Revision Tracking

acirc Revisions are stored within a file

atilde Authorship in shared writing Explicit responsibility for changes

Review and Conclusions 31

Wikipedia

atilde The core aim of the Wikimedia Foundation is to get a free encyclopedia to everysingle person on the planet (Jimmy Wales)

atilde Wikipedia makes it easy to share your knowledgepeople like to do this

atilde Most edits are done by insiders

atilde Most content is added by outsiders

atilde Content comparable to Britannica

Review and Conclusions 32

The five pillars of Wikipedia

1 Wikipedia is an online encyclopedia

2 Wikipedia has a neutral point of view

atilde Content policies NPOV Verifiability and No original research

3 Wikipedia is free content

4 Wikipedians should interact in a respectful and civil manner

5 Wikipedia does not have firm rules

WikipediaFivepillars 33

Licenses and Ownership

atilde Copyright

atilde Copyleft

atilde Creative Commons

Review and Conclusions 34

What is a good article

1 Well-written

2 Factually accurate and verifiable

3 Broad in its coverage

4 Neutral

5 Stable

6 Illustrated if possible by images

WikipediaGood_article_criteria 35

The World Wide Web and HTML

Review and Conclusions 36

The InterWeb

atilde The Internetmore than just the web (email VoIP FTP Streaming Messaging

atilde The structure of Markup Visual vs LogicalWSISWYG WYSIAYG WYSIWIM

atilde The structure of the Web mdash hypertextpages linked to pages

atilde The future of the Web

atilde Linguistic features of the webun-edited large volume editable multi-media

Review and Conclusions 37

Map of online communities

httpxkcdcom802 38

The Deep Web (the Invisible Web)

Dynamic content dynamic pages which are returned in response to a submitted queryor accessed only through a form

Unlinked content pages which are not linked to by other pages (but clicking linksthem)

Private Web sites that require registration and login (Edventure)

Contextual Web pages with content varying for different access contexts (egranges of client IP addresses or previous navigation sequence)

Limited access content sites that limit access to their pages in a technical way (egusing the Robots Exclusion Standard)

Scripted content pages that are only accessible through links produced by JavaScriptas well as content dynamically downloaded from Web servers via Flash or Ajaxsolutions

Review and Conclusions 39

Non-HTMLtext content textual content encoded in multimedia (image or video)files or specific file formats not handled by search engines

These pages all include data that search engines cannot find

Review and Conclusions 40

Visual Markup vs Logical Markup

atilde Visual Markup (Presentational)

acirc What you see is what you get (WYSIWYG)acirc Equivalent of printersrsquo markupacirc Shows what things look like

atilde Logical Markup (Structural)

acirc Show the structure and meaningacirc Can be mapped to visual markupacirc Less flexible than visual markupacirc More adaptable (and reusable)

Review and Conclusions 41

The Web as Corpus

Review and Conclusions 42

Two Approaches to using the Web as a Corpus

atilde Direct Query Search Engine as Query tool and WWW as corpus(Objection Results are not reliable)

acirc Population and exact hit counts are unknown rarr no statistics possibleacirc Indexing does not allow to draw conclusions on the dataotimes Google is missing functionalities that linguists lexicographers would like to have

atilde Web Sample Use search engine to download data from the net and build a corpusfrom it

acirc known size and exact hit counts rarr statistics possibleacirc people can draw conclusions over the included text typesacirc (limited) control over the contentotimes sparser data

Review and Conclusions 43

Direct Query

atilde Accessible through search engines (Google API Yahoo API Scripts)

atilde Document counts are shown to correlate directly with ldquorealrdquo frequencies (Keller2003) so search engines can help - but

acirc lots of repetitions of the same text (not representative)acirc very limited query precision (no upperlower case no punctuation)acirc only estimated counts often hard to reproduce exactlyacirc different queries give wildly different numbers

Review and Conclusions 44

Web Count Units

atilde ghit or ldquogoogle hitrdquo is the most common unit used to count web snippets (in theearly 2000s)

acirc it is document frequency not term frequency

atilde whit or ldquoweb hitrdquo is the more general term

atilde Normally you compare two phenomena to get a unitless ratio(eg different from vs different than)251000000 ghits vs 71500000 or 351 (accessed 2012-04-04)

atilde GPB for ldquoGhits per billion documentsrdquo is good if want a more stable number (sug-gested by Mark Liberman)but Google no longer releases the index size

atilde So always say when you counted and try to use ratios

httpitrecisupennedu myllanguagelogarchives000953html 45

Web Sample

atilde Extracting and filtering web documents to create linguistically annotated corpora(Kilgarriff 2006)

acirc gather documents for different topics (balance)acirc exclude documents which cannot be preprocessed with available tools (here tag-

gers and lemmatizers)acirc exclude documents which seem irrelevant for a corpus (too short or too long word

lists)acirc do this for several languages and make the corpora available

46

Building Internet Corpora Outline

1 Select Seed Words (500)

2 Combine to form multiple queries (6000)

3 Query a search engine and retrieve the URLs (50000)

4 Download the files from the URLS (100000000 words)

5 Postprocess the data (encoding cleanup tagging and parsing)

Sharoff S (2006) Creating general-purpose corpora using automated search enginequeries In M Baroni S Bernardini (eds) WaCky Working papers on the Web asCorpus Bologna 2006

httpcorpusleedsacukinternethtml 47

Internet Corpora Summary

atilde The web can be used as a corpus

acirc Direct accesslowast Fast and convenientlowast Huge amounts of dataotimes unreliable counts

acirc Web samplelowast Control over the samplelowast Some setup costs (semi-automated)otimes Less data

atilde Richer data than a compiled corpus

otimes Less balanced less markup

48

Text and Meta-textatilde Explicit Meta-data

acirc Keywords and Categoriesacirc Rankingsacirc Structural Markup

atilde Implicit Meta-data

acirc Links and Citationsacirc Tagsacirc Tablesacirc File Namesacirc Translations

and Zombies 49

Explicit Metadata

atilde You can get information from metadata within documents

acirc When they are accurate they are very goodacirc They are often deceitful

atilde HTML PDF Word hellipMetadata

atilde Keywords and Tags

atilde Rankings

atilde Links and Citations

atilde Structural Markup

50

Implicit Metadata

atilde You can get clues from metadata within documents

acirc as they are non-intended they tend to be noisyacirc but they are rarely deceitful

atilde HTML tags as constituent boundaries

atilde Tables as Semantic Relations

atilde File Names (content type and language)

atilde Translations mdash Bracketed Glosses Cross-lingual Disambiguation

atilde Query Data

atilde Wikipedia Redirections and Cross-wiki Links

Review and Conclusions 51

Language Identification andNormalization

Review and Conclusions 52

Language Identification

atilde Need to identify the encoding and language of a page in order to process it (meta-data may be unreliable)

acirc Linguistically-grounded methodslowast Diacriticslowast Character n-gramslowast Stop words

acirc Similarity-based categorisation and classificationlowast Character n-gram rankings

acirc Machine Learning based methodsacirc Context (under jp or ko)

atilde Hard to do for short test snippets similar languages mixed text

Review and Conclusions 53

Normalization

atilde Extracting text from various documents

atilde Segmenting continuous text

atilde Number Normalization $700K $700000 07 million dollars hellip

atilde Date Normalization 2000AD 1421AH Heisei 12 hellip

atilde Stripping stop words (the a of hellip)

atilde Lemmatization produces rarr produce

atilde Stemming producer rarr produc produces rarr produc

atilde Decompounding zonnecel rarr zon cel

Review and Conclusions 54

The Semantic Web

Review and Conclusions 55

Goals of the Semantic Web

atilde Web of data

acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions

atilde Increase the utility of information by connecting it to definitions and context

atilde More efficient information access and analysis

EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt

Review and Conclusions 56

Semantic Web Architecture

Review and Conclusions 57

XML eXtensible Markup Language

atilde XML is a set of rules for encoding documents electronically

acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation

atilde Can be validated for syntactic well-formedness

Review and Conclusions 58

Semantic Web Architecture (details)

atilde Identify things with Uniform Resource Identifiers

acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond

atilde Identify relations with Resource Description Framework

acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything

atilde You can build relations in ontologies (OWL)

acirc Then reason over them search them hellip

Review and Conclusions 59

Criticism of the Semantic Web

Doctorowrsquos seven insurmountable obstacles to reliable metadata are

1 People lie

2 People are lazy

3 People are stupid

4 Mission Impossible know thyself

5 Schemas arenrsquot neutral

6 Metrics influence results

7 Therersquos more than one way to describe something

httpwwwwellcom~doctorowmetacraphtm 60

Semantic Web and NLP

atilde The Semantic Web is about structuring data

atilde Text Mining is about unstructured data

atilde There is much more unstructured than structured data

acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP

61

Citation Reputation andPageRank

Review and Conclusions 62

Citation Networks

atilde How can we tell what is a good scientific paper

acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked

acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite

Review and Conclusions 63

Reputation and Citation Analysis

atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group

atilde Some scores are

acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)

atilde Problems

acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones

atilde Weight Citations by Quality of the paper

Review and Conclusions 64

Gaming Citations

atilde LeastMinimum Publishable Unit

acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information

atilde Self citation in-group citation

atilde Write only proceedings (some journals are not often read)

atilde Submitting only to High Impact factor journals

You improve what gets measurednot necessarily what you want to improve

Review and Conclusions 65

Characteristics of the Web Bow Tie

SCC Strongly Connected Core mdash can travel from any page to any page

Review and Conclusions 66

Anchor Text

atilde Recall how hyperlinks are written

lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt

For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt

atilde Link analysis builds on two intuitions

1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A

2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice

Review and Conclusions 67

PageRank as Citation analysis

atilde Citation frequency can be used to measure the impact of an article

acirc Simplest measure Each article gets one vote ndash not very accurate

atilde On the web citation frequency = inlink count

acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam

atilde Better measure weighted citation frequency or citation rank

acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated

Review and Conclusions 68

PageRank as Random walk

atilde Imagine a web surfer doing a random walk on the web

acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page

equiprobably

atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there

atilde This long-term visit rate is the pagersquos PageRank

atilde PageRank = long-term visit rate = steady state robability

Review and Conclusions 69

Teleporting ndash to get us out of dead ends

atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)

atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)

atilde With remaining probability (90) go out on a random hyperlink

acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225

atilde 10 is a parameter the teleportation rate

atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate

Review and Conclusions 70

Example Graph

Each inbound link is a positive vote

httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71

Example Graph Weighted

Pages with higher PageRanks are lighter

Review and Conclusions 72

Gaming PageRank

atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs

atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies

atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission

73

atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks

The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index

acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip

74

Current Status

atilde There is a continuous battle between

acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read

atilde All metrics get gamed

Review and Conclusions 75

Digital object identifier

atilde DOI a string used to uniquely identify an electronic document or object

acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)

atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation

atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations

acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u

Review and Conclusions 76

Conclusions

Review and Conclusions 77

Revolutions in Language Technology

atilde Speech (Language itself)

atilde Writing (invented 3-5 times)

atilde Printing (made writing common)

atilde Digital Text (made writing transferable)

atilde Hyperlinking (taking writing beyond language)

Review and Conclusions 78

The Internet

atilde The internet is useful as a tool

acirc Passively for information accessacirc Actively for collaboration

atilde The internet is interesting in and of itself

acirc As a source of data about existing languageacirc As a source of innovation in language

Review and Conclusions 79

Recommended Texts

atilde Wikipedia

atilde Crystal D (2001) Language and the Internet Cambridge University Press

atilde Sproat R (2010) Language Technology and Society Oxford University Press

atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition

atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press

atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press

Review and Conclusions 80

Complementary Courses

atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics

atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics

Review and Conclusions 81

Page 3: Lecture 12: Review and Conclusions · Web Count Units ª ghit or “google hit” is the most common unit used to count web snippets (in the early2000s) â itisdocumentfrequencynottermfrequency

What have we learned

atilde How technology affects our use of language

atilde How language is used on the internet

acirc Some interesting things we can now do that we couldnrsquot before

atilde Collaboration on the Web

atilde The web as a source of linguistic data Direct Query and Sample

atilde The Semantic Web meaning for non-humans

atilde Citation and reputation

Review and Conclusions 2

Reflection

atilde What was the most surprising thing in this class

atilde What do you think is most likely wrong

atilde What do you think is the coolest result

atilde What do you think yoursquore most likely to remember

atilde How do you think this course will influence you as a linguist

atilde What (if anything) did you hope to learn that you didnrsquot

Review and Conclusions 3

Goals

atilde Gain an understanding of how technology affects language use

atilde Develop familiarity with markup and meta information in texts

atilde Get a feel for what research is all about especially relating to web mining and onlinefrequency counting

Upon successful completion students will

atilde have an understanding of how technology shapes language use

atilde be able to test linguistic hypotheses against web data

atilde know how to edit Wikipedia

Review and Conclusions 4

Themes

atilde Language and Technology

acirc Writing and Speech Technology

atilde Language and the Internet

acirc Email Chat Virtual Worlds WWW IM Blogs Facebook Wikis Twitter

atilde The Web as Corpus

atilde The Web beyond Language

acirc Semantic Web and Networks

Review and Conclusions 5

Revision

Review and Conclusions 6

Language and Technology

atilde The two things that separate humans from animals (Sproat 2010 sect11)

acirc Languagelowast large vocabulary (10000+)lowast complicated syntax (no upper length recursion embedding)

acirc Technologylowast Widespread tool uselowast Widespread tool manufacture

atilde Speech mdash the start of language

atilde Writing mdash the first great intersection

Review and Conclusions 7

Language and the Internet

atilde New forms of communication

acirc Neither speech nor textacirc Massively interactive

atilde Extremely rapid change

atilde A first hand narrative (I was online before the internet -)

acirc but I am probably behind you all now

Review and Conclusions 8

New forms

atilde Email (from PC phone other)

atilde Chat Usenet

atilde Virtual Worlds

atilde WWW

atilde Blogs (overlap)

atilde Facebook LinkedIn

atilde Wikis

atilde Twitter

Review and Conclusions 9

Representing Languageatilde Writing Systems

atilde Encodings

atilde Speech

atilde Bandwidth

Review and Conclusions 10

What is represented

atilde Phonemes ma d laks aeligvkadoz (45)

acirc Not a simple correspondence between a writing system and the soundsacirc Some logographs (3 $)acirc Even when a new alphabet is designed pronunciation changes

atilde Syllables (ma) (d) (laks) (aelig)(v)(ka)(doz) (10000+)

atilde Morphemes mame + rsquos d lak + s aeligvkado + z (100000+)

atilde Words my dog likes avocados (200000+)

atilde Concepts speaker poss canine fond avacado+PL (400000+)

Review and Conclusions 11

Three Major Writing Systems

atilde Alphabetic (Latin)

acirc one symbol for consonant or vowelacirc Typically 20-30 base symbols (1 byte)

atilde Syllabic (Hiragana)

acirc one symbol for each syllable (consonant+vowel)acirc Typically 50-100 base symbols (1-2 bytes)

atilde Logographic (Hanzi)

acirc pictographs ideographs sounds-meaning combinationsacirc Typically 100000+ symbols (2-3 bytes)

Review and Conclusions 12

Computational Encoding

atilde Need to map characters to bits

atilde More characters require more space

acirc English ASCII (7 bits) Western Europe ISO-8859-1 (8 bits)Japanese EUC (16 bits) Everything UTF-8 (8-32 bits)

atilde Moving towards unicode for everything

atilde If you get the encoding wrong it is gibberish

acirc Web pages should state their encodingacirc Sometimes they are wrongacirc Encoding Detection usually involves statistical analysis of byte patterns

lowast But an encoding can be shared by many languages

Review and Conclusions 13

Speech and Language Technologyatilde The need for speech representation

atilde Storing sound

atilde Transforming Speech

acirc Automatic Speech Recognition (ASR) sounds to textacirc Text-to-Speech Synthesis (TTS) texts to sounds

atilde Speech technology mdash the Telephone

Review and Conclusions 14

How good are the systems

Task Vocab WER () WER () adaptedDigits 11 04 02Dialogue (travel) 21000 109 mdashDictation (WSJ) 5000 39 30Dictation (WSJ) 20000 100 86Dialogue (noisy army) 3000 422 310Phone Conversations 4000 419 310

Results of various DARPA competitions (from Richard Sproatrsquos slides)

Review and Conclusions 15

Why is it so difficult

atilde Pronunciation depends on context

acirc The same word will be pronounced differently in different sentences

atilde Speaker variability

acirc Genderacirc DialectForeign Accentacirc Individual Differences Physical differences Language differences (idiolect)

atilde Many many rare events

acirc 300 out of 2000 diphones in the core set for the ATampT NextGen system occuronly once in a 2-hour speech database

Review and Conclusions 16

Two steps in a TTS system

1 Linguistic Analysis

atilde Sentence Segmentationatilde Abbreviations Dr Smith lives on Nanyang Dr He is hellipatilde Word Segmentation

acirc 森前銀総裁 Moriyama zen Nichigin Sousaiotimes 森前銀総裁emspMoriyama zennichi gin Sousai

2 Speech Synthesis

atilde Find the pronunciationatilde Generate soundsatilde Add intonation

Review and Conclusions 17

Speech Synthesis

atilde Articulatory Synthesis Attempt to model human articulation

atilde Formant Synthesis Bypass modeling of human articulation and model acousticsdirectly

atilde Concatenative Synthesis Synthesize from stored units of actual speech

Review and Conclusions 18

Prosody of Emotion

atilde Excitement Fast very high pitch loud

atilde Hot anger Fast high pitch strong falling accent loud

atilde Fear Jitter

atilde Sarcasm Prolonged accent late peak

atilde Sad Slow low pitch

The main determinant of ldquonaturalnessrdquo in speech synthesis is not ldquovoice qualityrdquobut natural-sounding prosody (intonation and duration)

Richard Sproat

Review and Conclusions 19

Speed is different for different modalities

Speed in words per minute (one word is 6 characters)(English computer science students various studies)

Reading 300 200 (proof reading)Writing 31 21 (composing)Speaking 150Hearing 150 210 (speeded up)Typing 33 19 (composing)

atilde Reading gtgt SpeakingHearing gtgt Typing

rArr Speech for inputrArr Text for output

Review and Conclusions 20

The Telephone

Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 21

New Modalities

Review and Conclusions 22

Email Usenet Chat and Blogs

atilde All share some characteristics of speech and text

atilde Usage norms not fixed

atilde Communication methods may disappear before the norms are fixedUsenet Bulletin Boards Gopher Archie hellip

atilde Large scale discourse studies still to be done

atilde Some genuinely new things

acirc time-lagged multi-person conversationacirc raw un-edited textacirc extreme multi-authorship

Review and Conclusions 23

Email

Speech like Text liketime bound space bound (deletable)spontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 24

Usenet (asynchronous)

Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 25

Chat (synchronous)

Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 26

Blogs

Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactivecomments factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 27

Netspeak

atilde Inspired by shared background

acirc ltrantgtI canrsquot stand thisltrantgtacirc I hate^H^H^H^Hlove that idea (^H is backspace on a vt100)acirc lusers (users as seen by Systems Administrators)acirc suits

atilde Need to be in-group

cow orker Coworkerclue ldquoYou couldnrsquot get a clue during the clue mating season in a field full of horny

clues if you smeared your body with clue musk and did the clue mating dancerdquoEdward Flaherty (talkbizaare)

Review and Conclusions 28

atilde Gricean Principals

acirc Posting (topmiddlebottom quoting and trimming)

atilde Inspired by medium limitations

acirc GREATgreatgreatacirc -) (^_^) (o) acirc brb RTFM IMHOacirc lower caseacirc lack of punctuation (hard on phone keyboards)

Review and Conclusions 29

Collaboration and Wikisatilde Version Control Systems

atilde Wikipedia

atilde Licensing and Ownership

30

Version Control Systems

atilde Versioning file systems

acirc every time a file is opened a new copy is stored

atilde CVS Subversion Git

acirc changes to a collection of files are trackedacirc simultaneous changes are merged

atilde Revision Tracking

acirc Revisions are stored within a file

atilde Authorship in shared writing Explicit responsibility for changes

Review and Conclusions 31

Wikipedia

atilde The core aim of the Wikimedia Foundation is to get a free encyclopedia to everysingle person on the planet (Jimmy Wales)

atilde Wikipedia makes it easy to share your knowledgepeople like to do this

atilde Most edits are done by insiders

atilde Most content is added by outsiders

atilde Content comparable to Britannica

Review and Conclusions 32

The five pillars of Wikipedia

1 Wikipedia is an online encyclopedia

2 Wikipedia has a neutral point of view

atilde Content policies NPOV Verifiability and No original research

3 Wikipedia is free content

4 Wikipedians should interact in a respectful and civil manner

5 Wikipedia does not have firm rules

WikipediaFivepillars 33

Licenses and Ownership

atilde Copyright

atilde Copyleft

atilde Creative Commons

Review and Conclusions 34

What is a good article

1 Well-written

2 Factually accurate and verifiable

3 Broad in its coverage

4 Neutral

5 Stable

6 Illustrated if possible by images

WikipediaGood_article_criteria 35

The World Wide Web and HTML

Review and Conclusions 36

The InterWeb

atilde The Internetmore than just the web (email VoIP FTP Streaming Messaging

atilde The structure of Markup Visual vs LogicalWSISWYG WYSIAYG WYSIWIM

atilde The structure of the Web mdash hypertextpages linked to pages

atilde The future of the Web

atilde Linguistic features of the webun-edited large volume editable multi-media

Review and Conclusions 37

Map of online communities

httpxkcdcom802 38

The Deep Web (the Invisible Web)

Dynamic content dynamic pages which are returned in response to a submitted queryor accessed only through a form

Unlinked content pages which are not linked to by other pages (but clicking linksthem)

Private Web sites that require registration and login (Edventure)

Contextual Web pages with content varying for different access contexts (egranges of client IP addresses or previous navigation sequence)

Limited access content sites that limit access to their pages in a technical way (egusing the Robots Exclusion Standard)

Scripted content pages that are only accessible through links produced by JavaScriptas well as content dynamically downloaded from Web servers via Flash or Ajaxsolutions

Review and Conclusions 39

Non-HTMLtext content textual content encoded in multimedia (image or video)files or specific file formats not handled by search engines

These pages all include data that search engines cannot find

Review and Conclusions 40

Visual Markup vs Logical Markup

atilde Visual Markup (Presentational)

acirc What you see is what you get (WYSIWYG)acirc Equivalent of printersrsquo markupacirc Shows what things look like

atilde Logical Markup (Structural)

acirc Show the structure and meaningacirc Can be mapped to visual markupacirc Less flexible than visual markupacirc More adaptable (and reusable)

Review and Conclusions 41

The Web as Corpus

Review and Conclusions 42

Two Approaches to using the Web as a Corpus

atilde Direct Query Search Engine as Query tool and WWW as corpus(Objection Results are not reliable)

acirc Population and exact hit counts are unknown rarr no statistics possibleacirc Indexing does not allow to draw conclusions on the dataotimes Google is missing functionalities that linguists lexicographers would like to have

atilde Web Sample Use search engine to download data from the net and build a corpusfrom it

acirc known size and exact hit counts rarr statistics possibleacirc people can draw conclusions over the included text typesacirc (limited) control over the contentotimes sparser data

Review and Conclusions 43

Direct Query

atilde Accessible through search engines (Google API Yahoo API Scripts)

atilde Document counts are shown to correlate directly with ldquorealrdquo frequencies (Keller2003) so search engines can help - but

acirc lots of repetitions of the same text (not representative)acirc very limited query precision (no upperlower case no punctuation)acirc only estimated counts often hard to reproduce exactlyacirc different queries give wildly different numbers

Review and Conclusions 44

Web Count Units

atilde ghit or ldquogoogle hitrdquo is the most common unit used to count web snippets (in theearly 2000s)

acirc it is document frequency not term frequency

atilde whit or ldquoweb hitrdquo is the more general term

atilde Normally you compare two phenomena to get a unitless ratio(eg different from vs different than)251000000 ghits vs 71500000 or 351 (accessed 2012-04-04)

atilde GPB for ldquoGhits per billion documentsrdquo is good if want a more stable number (sug-gested by Mark Liberman)but Google no longer releases the index size

atilde So always say when you counted and try to use ratios

httpitrecisupennedu myllanguagelogarchives000953html 45

Web Sample

atilde Extracting and filtering web documents to create linguistically annotated corpora(Kilgarriff 2006)

acirc gather documents for different topics (balance)acirc exclude documents which cannot be preprocessed with available tools (here tag-

gers and lemmatizers)acirc exclude documents which seem irrelevant for a corpus (too short or too long word

lists)acirc do this for several languages and make the corpora available

46

Building Internet Corpora Outline

1 Select Seed Words (500)

2 Combine to form multiple queries (6000)

3 Query a search engine and retrieve the URLs (50000)

4 Download the files from the URLS (100000000 words)

5 Postprocess the data (encoding cleanup tagging and parsing)

Sharoff S (2006) Creating general-purpose corpora using automated search enginequeries In M Baroni S Bernardini (eds) WaCky Working papers on the Web asCorpus Bologna 2006

httpcorpusleedsacukinternethtml 47

Internet Corpora Summary

atilde The web can be used as a corpus

acirc Direct accesslowast Fast and convenientlowast Huge amounts of dataotimes unreliable counts

acirc Web samplelowast Control over the samplelowast Some setup costs (semi-automated)otimes Less data

atilde Richer data than a compiled corpus

otimes Less balanced less markup

48

Text and Meta-textatilde Explicit Meta-data

acirc Keywords and Categoriesacirc Rankingsacirc Structural Markup

atilde Implicit Meta-data

acirc Links and Citationsacirc Tagsacirc Tablesacirc File Namesacirc Translations

and Zombies 49

Explicit Metadata

atilde You can get information from metadata within documents

acirc When they are accurate they are very goodacirc They are often deceitful

atilde HTML PDF Word hellipMetadata

atilde Keywords and Tags

atilde Rankings

atilde Links and Citations

atilde Structural Markup

50

Implicit Metadata

atilde You can get clues from metadata within documents

acirc as they are non-intended they tend to be noisyacirc but they are rarely deceitful

atilde HTML tags as constituent boundaries

atilde Tables as Semantic Relations

atilde File Names (content type and language)

atilde Translations mdash Bracketed Glosses Cross-lingual Disambiguation

atilde Query Data

atilde Wikipedia Redirections and Cross-wiki Links

Review and Conclusions 51

Language Identification andNormalization

Review and Conclusions 52

Language Identification

atilde Need to identify the encoding and language of a page in order to process it (meta-data may be unreliable)

acirc Linguistically-grounded methodslowast Diacriticslowast Character n-gramslowast Stop words

acirc Similarity-based categorisation and classificationlowast Character n-gram rankings

acirc Machine Learning based methodsacirc Context (under jp or ko)

atilde Hard to do for short test snippets similar languages mixed text

Review and Conclusions 53

Normalization

atilde Extracting text from various documents

atilde Segmenting continuous text

atilde Number Normalization $700K $700000 07 million dollars hellip

atilde Date Normalization 2000AD 1421AH Heisei 12 hellip

atilde Stripping stop words (the a of hellip)

atilde Lemmatization produces rarr produce

atilde Stemming producer rarr produc produces rarr produc

atilde Decompounding zonnecel rarr zon cel

Review and Conclusions 54

The Semantic Web

Review and Conclusions 55

Goals of the Semantic Web

atilde Web of data

acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions

atilde Increase the utility of information by connecting it to definitions and context

atilde More efficient information access and analysis

EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt

Review and Conclusions 56

Semantic Web Architecture

Review and Conclusions 57

XML eXtensible Markup Language

atilde XML is a set of rules for encoding documents electronically

acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation

atilde Can be validated for syntactic well-formedness

Review and Conclusions 58

Semantic Web Architecture (details)

atilde Identify things with Uniform Resource Identifiers

acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond

atilde Identify relations with Resource Description Framework

acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything

atilde You can build relations in ontologies (OWL)

acirc Then reason over them search them hellip

Review and Conclusions 59

Criticism of the Semantic Web

Doctorowrsquos seven insurmountable obstacles to reliable metadata are

1 People lie

2 People are lazy

3 People are stupid

4 Mission Impossible know thyself

5 Schemas arenrsquot neutral

6 Metrics influence results

7 Therersquos more than one way to describe something

httpwwwwellcom~doctorowmetacraphtm 60

Semantic Web and NLP

atilde The Semantic Web is about structuring data

atilde Text Mining is about unstructured data

atilde There is much more unstructured than structured data

acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP

61

Citation Reputation andPageRank

Review and Conclusions 62

Citation Networks

atilde How can we tell what is a good scientific paper

acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked

acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite

Review and Conclusions 63

Reputation and Citation Analysis

atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group

atilde Some scores are

acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)

atilde Problems

acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones

atilde Weight Citations by Quality of the paper

Review and Conclusions 64

Gaming Citations

atilde LeastMinimum Publishable Unit

acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information

atilde Self citation in-group citation

atilde Write only proceedings (some journals are not often read)

atilde Submitting only to High Impact factor journals

You improve what gets measurednot necessarily what you want to improve

Review and Conclusions 65

Characteristics of the Web Bow Tie

SCC Strongly Connected Core mdash can travel from any page to any page

Review and Conclusions 66

Anchor Text

atilde Recall how hyperlinks are written

lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt

For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt

atilde Link analysis builds on two intuitions

1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A

2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice

Review and Conclusions 67

PageRank as Citation analysis

atilde Citation frequency can be used to measure the impact of an article

acirc Simplest measure Each article gets one vote ndash not very accurate

atilde On the web citation frequency = inlink count

acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam

atilde Better measure weighted citation frequency or citation rank

acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated

Review and Conclusions 68

PageRank as Random walk

atilde Imagine a web surfer doing a random walk on the web

acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page

equiprobably

atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there

atilde This long-term visit rate is the pagersquos PageRank

atilde PageRank = long-term visit rate = steady state robability

Review and Conclusions 69

Teleporting ndash to get us out of dead ends

atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)

atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)

atilde With remaining probability (90) go out on a random hyperlink

acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225

atilde 10 is a parameter the teleportation rate

atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate

Review and Conclusions 70

Example Graph

Each inbound link is a positive vote

httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71

Example Graph Weighted

Pages with higher PageRanks are lighter

Review and Conclusions 72

Gaming PageRank

atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs

atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies

atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission

73

atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks

The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index

acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip

74

Current Status

atilde There is a continuous battle between

acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read

atilde All metrics get gamed

Review and Conclusions 75

Digital object identifier

atilde DOI a string used to uniquely identify an electronic document or object

acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)

atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation

atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations

acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u

Review and Conclusions 76

Conclusions

Review and Conclusions 77

Revolutions in Language Technology

atilde Speech (Language itself)

atilde Writing (invented 3-5 times)

atilde Printing (made writing common)

atilde Digital Text (made writing transferable)

atilde Hyperlinking (taking writing beyond language)

Review and Conclusions 78

The Internet

atilde The internet is useful as a tool

acirc Passively for information accessacirc Actively for collaboration

atilde The internet is interesting in and of itself

acirc As a source of data about existing languageacirc As a source of innovation in language

Review and Conclusions 79

Recommended Texts

atilde Wikipedia

atilde Crystal D (2001) Language and the Internet Cambridge University Press

atilde Sproat R (2010) Language Technology and Society Oxford University Press

atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition

atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press

atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press

Review and Conclusions 80

Complementary Courses

atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics

atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics

Review and Conclusions 81

Page 4: Lecture 12: Review and Conclusions · Web Count Units ª ghit or “google hit” is the most common unit used to count web snippets (in the early2000s) â itisdocumentfrequencynottermfrequency

Reflection

atilde What was the most surprising thing in this class

atilde What do you think is most likely wrong

atilde What do you think is the coolest result

atilde What do you think yoursquore most likely to remember

atilde How do you think this course will influence you as a linguist

atilde What (if anything) did you hope to learn that you didnrsquot

Review and Conclusions 3

Goals

atilde Gain an understanding of how technology affects language use

atilde Develop familiarity with markup and meta information in texts

atilde Get a feel for what research is all about especially relating to web mining and onlinefrequency counting

Upon successful completion students will

atilde have an understanding of how technology shapes language use

atilde be able to test linguistic hypotheses against web data

atilde know how to edit Wikipedia

Review and Conclusions 4

Themes

atilde Language and Technology

acirc Writing and Speech Technology

atilde Language and the Internet

acirc Email Chat Virtual Worlds WWW IM Blogs Facebook Wikis Twitter

atilde The Web as Corpus

atilde The Web beyond Language

acirc Semantic Web and Networks

Review and Conclusions 5

Revision

Review and Conclusions 6

Language and Technology

atilde The two things that separate humans from animals (Sproat 2010 sect11)

acirc Languagelowast large vocabulary (10000+)lowast complicated syntax (no upper length recursion embedding)

acirc Technologylowast Widespread tool uselowast Widespread tool manufacture

atilde Speech mdash the start of language

atilde Writing mdash the first great intersection

Review and Conclusions 7

Language and the Internet

atilde New forms of communication

acirc Neither speech nor textacirc Massively interactive

atilde Extremely rapid change

atilde A first hand narrative (I was online before the internet -)

acirc but I am probably behind you all now

Review and Conclusions 8

New forms

atilde Email (from PC phone other)

atilde Chat Usenet

atilde Virtual Worlds

atilde WWW

atilde Blogs (overlap)

atilde Facebook LinkedIn

atilde Wikis

atilde Twitter

Review and Conclusions 9

Representing Languageatilde Writing Systems

atilde Encodings

atilde Speech

atilde Bandwidth

Review and Conclusions 10

What is represented

atilde Phonemes ma d laks aeligvkadoz (45)

acirc Not a simple correspondence between a writing system and the soundsacirc Some logographs (3 $)acirc Even when a new alphabet is designed pronunciation changes

atilde Syllables (ma) (d) (laks) (aelig)(v)(ka)(doz) (10000+)

atilde Morphemes mame + rsquos d lak + s aeligvkado + z (100000+)

atilde Words my dog likes avocados (200000+)

atilde Concepts speaker poss canine fond avacado+PL (400000+)

Review and Conclusions 11

Three Major Writing Systems

atilde Alphabetic (Latin)

acirc one symbol for consonant or vowelacirc Typically 20-30 base symbols (1 byte)

atilde Syllabic (Hiragana)

acirc one symbol for each syllable (consonant+vowel)acirc Typically 50-100 base symbols (1-2 bytes)

atilde Logographic (Hanzi)

acirc pictographs ideographs sounds-meaning combinationsacirc Typically 100000+ symbols (2-3 bytes)

Review and Conclusions 12

Computational Encoding

atilde Need to map characters to bits

atilde More characters require more space

acirc English ASCII (7 bits) Western Europe ISO-8859-1 (8 bits)Japanese EUC (16 bits) Everything UTF-8 (8-32 bits)

atilde Moving towards unicode for everything

atilde If you get the encoding wrong it is gibberish

acirc Web pages should state their encodingacirc Sometimes they are wrongacirc Encoding Detection usually involves statistical analysis of byte patterns

lowast But an encoding can be shared by many languages

Review and Conclusions 13

Speech and Language Technologyatilde The need for speech representation

atilde Storing sound

atilde Transforming Speech

acirc Automatic Speech Recognition (ASR) sounds to textacirc Text-to-Speech Synthesis (TTS) texts to sounds

atilde Speech technology mdash the Telephone

Review and Conclusions 14

How good are the systems

Task Vocab WER () WER () adaptedDigits 11 04 02Dialogue (travel) 21000 109 mdashDictation (WSJ) 5000 39 30Dictation (WSJ) 20000 100 86Dialogue (noisy army) 3000 422 310Phone Conversations 4000 419 310

Results of various DARPA competitions (from Richard Sproatrsquos slides)

Review and Conclusions 15

Why is it so difficult

atilde Pronunciation depends on context

acirc The same word will be pronounced differently in different sentences

atilde Speaker variability

acirc Genderacirc DialectForeign Accentacirc Individual Differences Physical differences Language differences (idiolect)

atilde Many many rare events

acirc 300 out of 2000 diphones in the core set for the ATampT NextGen system occuronly once in a 2-hour speech database

Review and Conclusions 16

Two steps in a TTS system

1 Linguistic Analysis

atilde Sentence Segmentationatilde Abbreviations Dr Smith lives on Nanyang Dr He is hellipatilde Word Segmentation

acirc 森前銀総裁 Moriyama zen Nichigin Sousaiotimes 森前銀総裁emspMoriyama zennichi gin Sousai

2 Speech Synthesis

atilde Find the pronunciationatilde Generate soundsatilde Add intonation

Review and Conclusions 17

Speech Synthesis

atilde Articulatory Synthesis Attempt to model human articulation

atilde Formant Synthesis Bypass modeling of human articulation and model acousticsdirectly

atilde Concatenative Synthesis Synthesize from stored units of actual speech

Review and Conclusions 18

Prosody of Emotion

atilde Excitement Fast very high pitch loud

atilde Hot anger Fast high pitch strong falling accent loud

atilde Fear Jitter

atilde Sarcasm Prolonged accent late peak

atilde Sad Slow low pitch

The main determinant of ldquonaturalnessrdquo in speech synthesis is not ldquovoice qualityrdquobut natural-sounding prosody (intonation and duration)

Richard Sproat

Review and Conclusions 19

Speed is different for different modalities

Speed in words per minute (one word is 6 characters)(English computer science students various studies)

Reading 300 200 (proof reading)Writing 31 21 (composing)Speaking 150Hearing 150 210 (speeded up)Typing 33 19 (composing)

atilde Reading gtgt SpeakingHearing gtgt Typing

rArr Speech for inputrArr Text for output

Review and Conclusions 20

The Telephone

Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 21

New Modalities

Review and Conclusions 22

Email Usenet Chat and Blogs

atilde All share some characteristics of speech and text

atilde Usage norms not fixed

atilde Communication methods may disappear before the norms are fixedUsenet Bulletin Boards Gopher Archie hellip

atilde Large scale discourse studies still to be done

atilde Some genuinely new things

acirc time-lagged multi-person conversationacirc raw un-edited textacirc extreme multi-authorship

Review and Conclusions 23

Email

Speech like Text liketime bound space bound (deletable)spontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 24

Usenet (asynchronous)

Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 25

Chat (synchronous)

Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 26

Blogs

Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactivecomments factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 27

Netspeak

atilde Inspired by shared background

acirc ltrantgtI canrsquot stand thisltrantgtacirc I hate^H^H^H^Hlove that idea (^H is backspace on a vt100)acirc lusers (users as seen by Systems Administrators)acirc suits

atilde Need to be in-group

cow orker Coworkerclue ldquoYou couldnrsquot get a clue during the clue mating season in a field full of horny

clues if you smeared your body with clue musk and did the clue mating dancerdquoEdward Flaherty (talkbizaare)

Review and Conclusions 28

atilde Gricean Principals

acirc Posting (topmiddlebottom quoting and trimming)

atilde Inspired by medium limitations

acirc GREATgreatgreatacirc -) (^_^) (o) acirc brb RTFM IMHOacirc lower caseacirc lack of punctuation (hard on phone keyboards)

Review and Conclusions 29

Collaboration and Wikisatilde Version Control Systems

atilde Wikipedia

atilde Licensing and Ownership

30

Version Control Systems

atilde Versioning file systems

acirc every time a file is opened a new copy is stored

atilde CVS Subversion Git

acirc changes to a collection of files are trackedacirc simultaneous changes are merged

atilde Revision Tracking

acirc Revisions are stored within a file

atilde Authorship in shared writing Explicit responsibility for changes

Review and Conclusions 31

Wikipedia

atilde The core aim of the Wikimedia Foundation is to get a free encyclopedia to everysingle person on the planet (Jimmy Wales)

atilde Wikipedia makes it easy to share your knowledgepeople like to do this

atilde Most edits are done by insiders

atilde Most content is added by outsiders

atilde Content comparable to Britannica

Review and Conclusions 32

The five pillars of Wikipedia

1 Wikipedia is an online encyclopedia

2 Wikipedia has a neutral point of view

atilde Content policies NPOV Verifiability and No original research

3 Wikipedia is free content

4 Wikipedians should interact in a respectful and civil manner

5 Wikipedia does not have firm rules

WikipediaFivepillars 33

Licenses and Ownership

atilde Copyright

atilde Copyleft

atilde Creative Commons

Review and Conclusions 34

What is a good article

1 Well-written

2 Factually accurate and verifiable

3 Broad in its coverage

4 Neutral

5 Stable

6 Illustrated if possible by images

WikipediaGood_article_criteria 35

The World Wide Web and HTML

Review and Conclusions 36

The InterWeb

atilde The Internetmore than just the web (email VoIP FTP Streaming Messaging

atilde The structure of Markup Visual vs LogicalWSISWYG WYSIAYG WYSIWIM

atilde The structure of the Web mdash hypertextpages linked to pages

atilde The future of the Web

atilde Linguistic features of the webun-edited large volume editable multi-media

Review and Conclusions 37

Map of online communities

httpxkcdcom802 38

The Deep Web (the Invisible Web)

Dynamic content dynamic pages which are returned in response to a submitted queryor accessed only through a form

Unlinked content pages which are not linked to by other pages (but clicking linksthem)

Private Web sites that require registration and login (Edventure)

Contextual Web pages with content varying for different access contexts (egranges of client IP addresses or previous navigation sequence)

Limited access content sites that limit access to their pages in a technical way (egusing the Robots Exclusion Standard)

Scripted content pages that are only accessible through links produced by JavaScriptas well as content dynamically downloaded from Web servers via Flash or Ajaxsolutions

Review and Conclusions 39

Non-HTMLtext content textual content encoded in multimedia (image or video)files or specific file formats not handled by search engines

These pages all include data that search engines cannot find

Review and Conclusions 40

Visual Markup vs Logical Markup

atilde Visual Markup (Presentational)

acirc What you see is what you get (WYSIWYG)acirc Equivalent of printersrsquo markupacirc Shows what things look like

atilde Logical Markup (Structural)

acirc Show the structure and meaningacirc Can be mapped to visual markupacirc Less flexible than visual markupacirc More adaptable (and reusable)

Review and Conclusions 41

The Web as Corpus

Review and Conclusions 42

Two Approaches to using the Web as a Corpus

atilde Direct Query Search Engine as Query tool and WWW as corpus(Objection Results are not reliable)

acirc Population and exact hit counts are unknown rarr no statistics possibleacirc Indexing does not allow to draw conclusions on the dataotimes Google is missing functionalities that linguists lexicographers would like to have

atilde Web Sample Use search engine to download data from the net and build a corpusfrom it

acirc known size and exact hit counts rarr statistics possibleacirc people can draw conclusions over the included text typesacirc (limited) control over the contentotimes sparser data

Review and Conclusions 43

Direct Query

atilde Accessible through search engines (Google API Yahoo API Scripts)

atilde Document counts are shown to correlate directly with ldquorealrdquo frequencies (Keller2003) so search engines can help - but

acirc lots of repetitions of the same text (not representative)acirc very limited query precision (no upperlower case no punctuation)acirc only estimated counts often hard to reproduce exactlyacirc different queries give wildly different numbers

Review and Conclusions 44

Web Count Units

atilde ghit or ldquogoogle hitrdquo is the most common unit used to count web snippets (in theearly 2000s)

acirc it is document frequency not term frequency

atilde whit or ldquoweb hitrdquo is the more general term

atilde Normally you compare two phenomena to get a unitless ratio(eg different from vs different than)251000000 ghits vs 71500000 or 351 (accessed 2012-04-04)

atilde GPB for ldquoGhits per billion documentsrdquo is good if want a more stable number (sug-gested by Mark Liberman)but Google no longer releases the index size

atilde So always say when you counted and try to use ratios

httpitrecisupennedu myllanguagelogarchives000953html 45

Web Sample

atilde Extracting and filtering web documents to create linguistically annotated corpora(Kilgarriff 2006)

acirc gather documents for different topics (balance)acirc exclude documents which cannot be preprocessed with available tools (here tag-

gers and lemmatizers)acirc exclude documents which seem irrelevant for a corpus (too short or too long word

lists)acirc do this for several languages and make the corpora available

46

Building Internet Corpora Outline

1 Select Seed Words (500)

2 Combine to form multiple queries (6000)

3 Query a search engine and retrieve the URLs (50000)

4 Download the files from the URLS (100000000 words)

5 Postprocess the data (encoding cleanup tagging and parsing)

Sharoff S (2006) Creating general-purpose corpora using automated search enginequeries In M Baroni S Bernardini (eds) WaCky Working papers on the Web asCorpus Bologna 2006

httpcorpusleedsacukinternethtml 47

Internet Corpora Summary

atilde The web can be used as a corpus

acirc Direct accesslowast Fast and convenientlowast Huge amounts of dataotimes unreliable counts

acirc Web samplelowast Control over the samplelowast Some setup costs (semi-automated)otimes Less data

atilde Richer data than a compiled corpus

otimes Less balanced less markup

48

Text and Meta-textatilde Explicit Meta-data

acirc Keywords and Categoriesacirc Rankingsacirc Structural Markup

atilde Implicit Meta-data

acirc Links and Citationsacirc Tagsacirc Tablesacirc File Namesacirc Translations

and Zombies 49

Explicit Metadata

atilde You can get information from metadata within documents

acirc When they are accurate they are very goodacirc They are often deceitful

atilde HTML PDF Word hellipMetadata

atilde Keywords and Tags

atilde Rankings

atilde Links and Citations

atilde Structural Markup

50

Implicit Metadata

atilde You can get clues from metadata within documents

acirc as they are non-intended they tend to be noisyacirc but they are rarely deceitful

atilde HTML tags as constituent boundaries

atilde Tables as Semantic Relations

atilde File Names (content type and language)

atilde Translations mdash Bracketed Glosses Cross-lingual Disambiguation

atilde Query Data

atilde Wikipedia Redirections and Cross-wiki Links

Review and Conclusions 51

Language Identification andNormalization

Review and Conclusions 52

Language Identification

atilde Need to identify the encoding and language of a page in order to process it (meta-data may be unreliable)

acirc Linguistically-grounded methodslowast Diacriticslowast Character n-gramslowast Stop words

acirc Similarity-based categorisation and classificationlowast Character n-gram rankings

acirc Machine Learning based methodsacirc Context (under jp or ko)

atilde Hard to do for short test snippets similar languages mixed text

Review and Conclusions 53

Normalization

atilde Extracting text from various documents

atilde Segmenting continuous text

atilde Number Normalization $700K $700000 07 million dollars hellip

atilde Date Normalization 2000AD 1421AH Heisei 12 hellip

atilde Stripping stop words (the a of hellip)

atilde Lemmatization produces rarr produce

atilde Stemming producer rarr produc produces rarr produc

atilde Decompounding zonnecel rarr zon cel

Review and Conclusions 54

The Semantic Web

Review and Conclusions 55

Goals of the Semantic Web

atilde Web of data

acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions

atilde Increase the utility of information by connecting it to definitions and context

atilde More efficient information access and analysis

EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt

Review and Conclusions 56

Semantic Web Architecture

Review and Conclusions 57

XML eXtensible Markup Language

atilde XML is a set of rules for encoding documents electronically

acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation

atilde Can be validated for syntactic well-formedness

Review and Conclusions 58

Semantic Web Architecture (details)

atilde Identify things with Uniform Resource Identifiers

acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond

atilde Identify relations with Resource Description Framework

acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything

atilde You can build relations in ontologies (OWL)

acirc Then reason over them search them hellip

Review and Conclusions 59

Criticism of the Semantic Web

Doctorowrsquos seven insurmountable obstacles to reliable metadata are

1 People lie

2 People are lazy

3 People are stupid

4 Mission Impossible know thyself

5 Schemas arenrsquot neutral

6 Metrics influence results

7 Therersquos more than one way to describe something

httpwwwwellcom~doctorowmetacraphtm 60

Semantic Web and NLP

atilde The Semantic Web is about structuring data

atilde Text Mining is about unstructured data

atilde There is much more unstructured than structured data

acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP

61

Citation Reputation andPageRank

Review and Conclusions 62

Citation Networks

atilde How can we tell what is a good scientific paper

acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked

acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite

Review and Conclusions 63

Reputation and Citation Analysis

atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group

atilde Some scores are

acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)

atilde Problems

acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones

atilde Weight Citations by Quality of the paper

Review and Conclusions 64

Gaming Citations

atilde LeastMinimum Publishable Unit

acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information

atilde Self citation in-group citation

atilde Write only proceedings (some journals are not often read)

atilde Submitting only to High Impact factor journals

You improve what gets measurednot necessarily what you want to improve

Review and Conclusions 65

Characteristics of the Web Bow Tie

SCC Strongly Connected Core mdash can travel from any page to any page

Review and Conclusions 66

Anchor Text

atilde Recall how hyperlinks are written

lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt

For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt

atilde Link analysis builds on two intuitions

1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A

2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice

Review and Conclusions 67

PageRank as Citation analysis

atilde Citation frequency can be used to measure the impact of an article

acirc Simplest measure Each article gets one vote ndash not very accurate

atilde On the web citation frequency = inlink count

acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam

atilde Better measure weighted citation frequency or citation rank

acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated

Review and Conclusions 68

PageRank as Random walk

atilde Imagine a web surfer doing a random walk on the web

acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page

equiprobably

atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there

atilde This long-term visit rate is the pagersquos PageRank

atilde PageRank = long-term visit rate = steady state robability

Review and Conclusions 69

Teleporting ndash to get us out of dead ends

atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)

atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)

atilde With remaining probability (90) go out on a random hyperlink

acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225

atilde 10 is a parameter the teleportation rate

atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate

Review and Conclusions 70

Example Graph

Each inbound link is a positive vote

httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71

Example Graph Weighted

Pages with higher PageRanks are lighter

Review and Conclusions 72

Gaming PageRank

atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs

atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies

atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission

73

atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks

The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index

acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip

74

Current Status

atilde There is a continuous battle between

acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read

atilde All metrics get gamed

Review and Conclusions 75

Digital object identifier

atilde DOI a string used to uniquely identify an electronic document or object

acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)

atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation

atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations

acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u

Review and Conclusions 76

Conclusions

Review and Conclusions 77

Revolutions in Language Technology

atilde Speech (Language itself)

atilde Writing (invented 3-5 times)

atilde Printing (made writing common)

atilde Digital Text (made writing transferable)

atilde Hyperlinking (taking writing beyond language)

Review and Conclusions 78

The Internet

atilde The internet is useful as a tool

acirc Passively for information accessacirc Actively for collaboration

atilde The internet is interesting in and of itself

acirc As a source of data about existing languageacirc As a source of innovation in language

Review and Conclusions 79

Recommended Texts

atilde Wikipedia

atilde Crystal D (2001) Language and the Internet Cambridge University Press

atilde Sproat R (2010) Language Technology and Society Oxford University Press

atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition

atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press

atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press

Review and Conclusions 80

Complementary Courses

atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics

atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics

Review and Conclusions 81

Page 5: Lecture 12: Review and Conclusions · Web Count Units ª ghit or “google hit” is the most common unit used to count web snippets (in the early2000s) â itisdocumentfrequencynottermfrequency

Goals

atilde Gain an understanding of how technology affects language use

atilde Develop familiarity with markup and meta information in texts

atilde Get a feel for what research is all about especially relating to web mining and onlinefrequency counting

Upon successful completion students will

atilde have an understanding of how technology shapes language use

atilde be able to test linguistic hypotheses against web data

atilde know how to edit Wikipedia

Review and Conclusions 4

Themes

atilde Language and Technology

acirc Writing and Speech Technology

atilde Language and the Internet

acirc Email Chat Virtual Worlds WWW IM Blogs Facebook Wikis Twitter

atilde The Web as Corpus

atilde The Web beyond Language

acirc Semantic Web and Networks

Review and Conclusions 5

Revision

Review and Conclusions 6

Language and Technology

atilde The two things that separate humans from animals (Sproat 2010 sect11)

acirc Languagelowast large vocabulary (10000+)lowast complicated syntax (no upper length recursion embedding)

acirc Technologylowast Widespread tool uselowast Widespread tool manufacture

atilde Speech mdash the start of language

atilde Writing mdash the first great intersection

Review and Conclusions 7

Language and the Internet

atilde New forms of communication

acirc Neither speech nor textacirc Massively interactive

atilde Extremely rapid change

atilde A first hand narrative (I was online before the internet -)

acirc but I am probably behind you all now

Review and Conclusions 8

New forms

atilde Email (from PC phone other)

atilde Chat Usenet

atilde Virtual Worlds

atilde WWW

atilde Blogs (overlap)

atilde Facebook LinkedIn

atilde Wikis

atilde Twitter

Review and Conclusions 9

Representing Languageatilde Writing Systems

atilde Encodings

atilde Speech

atilde Bandwidth

Review and Conclusions 10

What is represented

atilde Phonemes ma d laks aeligvkadoz (45)

acirc Not a simple correspondence between a writing system and the soundsacirc Some logographs (3 $)acirc Even when a new alphabet is designed pronunciation changes

atilde Syllables (ma) (d) (laks) (aelig)(v)(ka)(doz) (10000+)

atilde Morphemes mame + rsquos d lak + s aeligvkado + z (100000+)

atilde Words my dog likes avocados (200000+)

atilde Concepts speaker poss canine fond avacado+PL (400000+)

Review and Conclusions 11

Three Major Writing Systems

atilde Alphabetic (Latin)

acirc one symbol for consonant or vowelacirc Typically 20-30 base symbols (1 byte)

atilde Syllabic (Hiragana)

acirc one symbol for each syllable (consonant+vowel)acirc Typically 50-100 base symbols (1-2 bytes)

atilde Logographic (Hanzi)

acirc pictographs ideographs sounds-meaning combinationsacirc Typically 100000+ symbols (2-3 bytes)

Review and Conclusions 12

Computational Encoding

atilde Need to map characters to bits

atilde More characters require more space

acirc English ASCII (7 bits) Western Europe ISO-8859-1 (8 bits)Japanese EUC (16 bits) Everything UTF-8 (8-32 bits)

atilde Moving towards unicode for everything

atilde If you get the encoding wrong it is gibberish

acirc Web pages should state their encodingacirc Sometimes they are wrongacirc Encoding Detection usually involves statistical analysis of byte patterns

lowast But an encoding can be shared by many languages

Review and Conclusions 13

Speech and Language Technologyatilde The need for speech representation

atilde Storing sound

atilde Transforming Speech

acirc Automatic Speech Recognition (ASR) sounds to textacirc Text-to-Speech Synthesis (TTS) texts to sounds

atilde Speech technology mdash the Telephone

Review and Conclusions 14

How good are the systems

Task Vocab WER () WER () adaptedDigits 11 04 02Dialogue (travel) 21000 109 mdashDictation (WSJ) 5000 39 30Dictation (WSJ) 20000 100 86Dialogue (noisy army) 3000 422 310Phone Conversations 4000 419 310

Results of various DARPA competitions (from Richard Sproatrsquos slides)

Review and Conclusions 15

Why is it so difficult

atilde Pronunciation depends on context

acirc The same word will be pronounced differently in different sentences

atilde Speaker variability

acirc Genderacirc DialectForeign Accentacirc Individual Differences Physical differences Language differences (idiolect)

atilde Many many rare events

acirc 300 out of 2000 diphones in the core set for the ATampT NextGen system occuronly once in a 2-hour speech database

Review and Conclusions 16

Two steps in a TTS system

1 Linguistic Analysis

atilde Sentence Segmentationatilde Abbreviations Dr Smith lives on Nanyang Dr He is hellipatilde Word Segmentation

acirc 森前銀総裁 Moriyama zen Nichigin Sousaiotimes 森前銀総裁emspMoriyama zennichi gin Sousai

2 Speech Synthesis

atilde Find the pronunciationatilde Generate soundsatilde Add intonation

Review and Conclusions 17

Speech Synthesis

atilde Articulatory Synthesis Attempt to model human articulation

atilde Formant Synthesis Bypass modeling of human articulation and model acousticsdirectly

atilde Concatenative Synthesis Synthesize from stored units of actual speech

Review and Conclusions 18

Prosody of Emotion

atilde Excitement Fast very high pitch loud

atilde Hot anger Fast high pitch strong falling accent loud

atilde Fear Jitter

atilde Sarcasm Prolonged accent late peak

atilde Sad Slow low pitch

The main determinant of ldquonaturalnessrdquo in speech synthesis is not ldquovoice qualityrdquobut natural-sounding prosody (intonation and duration)

Richard Sproat

Review and Conclusions 19

Speed is different for different modalities

Speed in words per minute (one word is 6 characters)(English computer science students various studies)

Reading 300 200 (proof reading)Writing 31 21 (composing)Speaking 150Hearing 150 210 (speeded up)Typing 33 19 (composing)

atilde Reading gtgt SpeakingHearing gtgt Typing

rArr Speech for inputrArr Text for output

Review and Conclusions 20

The Telephone

Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 21

New Modalities

Review and Conclusions 22

Email Usenet Chat and Blogs

atilde All share some characteristics of speech and text

atilde Usage norms not fixed

atilde Communication methods may disappear before the norms are fixedUsenet Bulletin Boards Gopher Archie hellip

atilde Large scale discourse studies still to be done

atilde Some genuinely new things

acirc time-lagged multi-person conversationacirc raw un-edited textacirc extreme multi-authorship

Review and Conclusions 23

Email

Speech like Text liketime bound space bound (deletable)spontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 24

Usenet (asynchronous)

Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 25

Chat (synchronous)

Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 26

Blogs

Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactivecomments factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 27

Netspeak

atilde Inspired by shared background

acirc ltrantgtI canrsquot stand thisltrantgtacirc I hate^H^H^H^Hlove that idea (^H is backspace on a vt100)acirc lusers (users as seen by Systems Administrators)acirc suits

atilde Need to be in-group

cow orker Coworkerclue ldquoYou couldnrsquot get a clue during the clue mating season in a field full of horny

clues if you smeared your body with clue musk and did the clue mating dancerdquoEdward Flaherty (talkbizaare)

Review and Conclusions 28

atilde Gricean Principals

acirc Posting (topmiddlebottom quoting and trimming)

atilde Inspired by medium limitations

acirc GREATgreatgreatacirc -) (^_^) (o) acirc brb RTFM IMHOacirc lower caseacirc lack of punctuation (hard on phone keyboards)

Review and Conclusions 29

Collaboration and Wikisatilde Version Control Systems

atilde Wikipedia

atilde Licensing and Ownership

30

Version Control Systems

atilde Versioning file systems

acirc every time a file is opened a new copy is stored

atilde CVS Subversion Git

acirc changes to a collection of files are trackedacirc simultaneous changes are merged

atilde Revision Tracking

acirc Revisions are stored within a file

atilde Authorship in shared writing Explicit responsibility for changes

Review and Conclusions 31

Wikipedia

atilde The core aim of the Wikimedia Foundation is to get a free encyclopedia to everysingle person on the planet (Jimmy Wales)

atilde Wikipedia makes it easy to share your knowledgepeople like to do this

atilde Most edits are done by insiders

atilde Most content is added by outsiders

atilde Content comparable to Britannica

Review and Conclusions 32

The five pillars of Wikipedia

1 Wikipedia is an online encyclopedia

2 Wikipedia has a neutral point of view

atilde Content policies NPOV Verifiability and No original research

3 Wikipedia is free content

4 Wikipedians should interact in a respectful and civil manner

5 Wikipedia does not have firm rules

WikipediaFivepillars 33

Licenses and Ownership

atilde Copyright

atilde Copyleft

atilde Creative Commons

Review and Conclusions 34

What is a good article

1 Well-written

2 Factually accurate and verifiable

3 Broad in its coverage

4 Neutral

5 Stable

6 Illustrated if possible by images

WikipediaGood_article_criteria 35

The World Wide Web and HTML

Review and Conclusions 36

The InterWeb

atilde The Internetmore than just the web (email VoIP FTP Streaming Messaging

atilde The structure of Markup Visual vs LogicalWSISWYG WYSIAYG WYSIWIM

atilde The structure of the Web mdash hypertextpages linked to pages

atilde The future of the Web

atilde Linguistic features of the webun-edited large volume editable multi-media

Review and Conclusions 37

Map of online communities

httpxkcdcom802 38

The Deep Web (the Invisible Web)

Dynamic content dynamic pages which are returned in response to a submitted queryor accessed only through a form

Unlinked content pages which are not linked to by other pages (but clicking linksthem)

Private Web sites that require registration and login (Edventure)

Contextual Web pages with content varying for different access contexts (egranges of client IP addresses or previous navigation sequence)

Limited access content sites that limit access to their pages in a technical way (egusing the Robots Exclusion Standard)

Scripted content pages that are only accessible through links produced by JavaScriptas well as content dynamically downloaded from Web servers via Flash or Ajaxsolutions

Review and Conclusions 39

Non-HTMLtext content textual content encoded in multimedia (image or video)files or specific file formats not handled by search engines

These pages all include data that search engines cannot find

Review and Conclusions 40

Visual Markup vs Logical Markup

atilde Visual Markup (Presentational)

acirc What you see is what you get (WYSIWYG)acirc Equivalent of printersrsquo markupacirc Shows what things look like

atilde Logical Markup (Structural)

acirc Show the structure and meaningacirc Can be mapped to visual markupacirc Less flexible than visual markupacirc More adaptable (and reusable)

Review and Conclusions 41

The Web as Corpus

Review and Conclusions 42

Two Approaches to using the Web as a Corpus

atilde Direct Query Search Engine as Query tool and WWW as corpus(Objection Results are not reliable)

acirc Population and exact hit counts are unknown rarr no statistics possibleacirc Indexing does not allow to draw conclusions on the dataotimes Google is missing functionalities that linguists lexicographers would like to have

atilde Web Sample Use search engine to download data from the net and build a corpusfrom it

acirc known size and exact hit counts rarr statistics possibleacirc people can draw conclusions over the included text typesacirc (limited) control over the contentotimes sparser data

Review and Conclusions 43

Direct Query

atilde Accessible through search engines (Google API Yahoo API Scripts)

atilde Document counts are shown to correlate directly with ldquorealrdquo frequencies (Keller2003) so search engines can help - but

acirc lots of repetitions of the same text (not representative)acirc very limited query precision (no upperlower case no punctuation)acirc only estimated counts often hard to reproduce exactlyacirc different queries give wildly different numbers

Review and Conclusions 44

Web Count Units

atilde ghit or ldquogoogle hitrdquo is the most common unit used to count web snippets (in theearly 2000s)

acirc it is document frequency not term frequency

atilde whit or ldquoweb hitrdquo is the more general term

atilde Normally you compare two phenomena to get a unitless ratio(eg different from vs different than)251000000 ghits vs 71500000 or 351 (accessed 2012-04-04)

atilde GPB for ldquoGhits per billion documentsrdquo is good if want a more stable number (sug-gested by Mark Liberman)but Google no longer releases the index size

atilde So always say when you counted and try to use ratios

httpitrecisupennedu myllanguagelogarchives000953html 45

Web Sample

atilde Extracting and filtering web documents to create linguistically annotated corpora(Kilgarriff 2006)

acirc gather documents for different topics (balance)acirc exclude documents which cannot be preprocessed with available tools (here tag-

gers and lemmatizers)acirc exclude documents which seem irrelevant for a corpus (too short or too long word

lists)acirc do this for several languages and make the corpora available

46

Building Internet Corpora Outline

1 Select Seed Words (500)

2 Combine to form multiple queries (6000)

3 Query a search engine and retrieve the URLs (50000)

4 Download the files from the URLS (100000000 words)

5 Postprocess the data (encoding cleanup tagging and parsing)

Sharoff S (2006) Creating general-purpose corpora using automated search enginequeries In M Baroni S Bernardini (eds) WaCky Working papers on the Web asCorpus Bologna 2006

httpcorpusleedsacukinternethtml 47

Internet Corpora Summary

atilde The web can be used as a corpus

acirc Direct accesslowast Fast and convenientlowast Huge amounts of dataotimes unreliable counts

acirc Web samplelowast Control over the samplelowast Some setup costs (semi-automated)otimes Less data

atilde Richer data than a compiled corpus

otimes Less balanced less markup

48

Text and Meta-textatilde Explicit Meta-data

acirc Keywords and Categoriesacirc Rankingsacirc Structural Markup

atilde Implicit Meta-data

acirc Links and Citationsacirc Tagsacirc Tablesacirc File Namesacirc Translations

and Zombies 49

Explicit Metadata

atilde You can get information from metadata within documents

acirc When they are accurate they are very goodacirc They are often deceitful

atilde HTML PDF Word hellipMetadata

atilde Keywords and Tags

atilde Rankings

atilde Links and Citations

atilde Structural Markup

50

Implicit Metadata

atilde You can get clues from metadata within documents

acirc as they are non-intended they tend to be noisyacirc but they are rarely deceitful

atilde HTML tags as constituent boundaries

atilde Tables as Semantic Relations

atilde File Names (content type and language)

atilde Translations mdash Bracketed Glosses Cross-lingual Disambiguation

atilde Query Data

atilde Wikipedia Redirections and Cross-wiki Links

Review and Conclusions 51

Language Identification andNormalization

Review and Conclusions 52

Language Identification

atilde Need to identify the encoding and language of a page in order to process it (meta-data may be unreliable)

acirc Linguistically-grounded methodslowast Diacriticslowast Character n-gramslowast Stop words

acirc Similarity-based categorisation and classificationlowast Character n-gram rankings

acirc Machine Learning based methodsacirc Context (under jp or ko)

atilde Hard to do for short test snippets similar languages mixed text

Review and Conclusions 53

Normalization

atilde Extracting text from various documents

atilde Segmenting continuous text

atilde Number Normalization $700K $700000 07 million dollars hellip

atilde Date Normalization 2000AD 1421AH Heisei 12 hellip

atilde Stripping stop words (the a of hellip)

atilde Lemmatization produces rarr produce

atilde Stemming producer rarr produc produces rarr produc

atilde Decompounding zonnecel rarr zon cel

Review and Conclusions 54

The Semantic Web

Review and Conclusions 55

Goals of the Semantic Web

atilde Web of data

acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions

atilde Increase the utility of information by connecting it to definitions and context

atilde More efficient information access and analysis

EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt

Review and Conclusions 56

Semantic Web Architecture

Review and Conclusions 57

XML eXtensible Markup Language

atilde XML is a set of rules for encoding documents electronically

acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation

atilde Can be validated for syntactic well-formedness

Review and Conclusions 58

Semantic Web Architecture (details)

atilde Identify things with Uniform Resource Identifiers

acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond

atilde Identify relations with Resource Description Framework

acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything

atilde You can build relations in ontologies (OWL)

acirc Then reason over them search them hellip

Review and Conclusions 59

Criticism of the Semantic Web

Doctorowrsquos seven insurmountable obstacles to reliable metadata are

1 People lie

2 People are lazy

3 People are stupid

4 Mission Impossible know thyself

5 Schemas arenrsquot neutral

6 Metrics influence results

7 Therersquos more than one way to describe something

httpwwwwellcom~doctorowmetacraphtm 60

Semantic Web and NLP

atilde The Semantic Web is about structuring data

atilde Text Mining is about unstructured data

atilde There is much more unstructured than structured data

acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP

61

Citation Reputation andPageRank

Review and Conclusions 62

Citation Networks

atilde How can we tell what is a good scientific paper

acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked

acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite

Review and Conclusions 63

Reputation and Citation Analysis

atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group

atilde Some scores are

acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)

atilde Problems

acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones

atilde Weight Citations by Quality of the paper

Review and Conclusions 64

Gaming Citations

atilde LeastMinimum Publishable Unit

acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information

atilde Self citation in-group citation

atilde Write only proceedings (some journals are not often read)

atilde Submitting only to High Impact factor journals

You improve what gets measurednot necessarily what you want to improve

Review and Conclusions 65

Characteristics of the Web Bow Tie

SCC Strongly Connected Core mdash can travel from any page to any page

Review and Conclusions 66

Anchor Text

atilde Recall how hyperlinks are written

lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt

For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt

atilde Link analysis builds on two intuitions

1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A

2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice

Review and Conclusions 67

PageRank as Citation analysis

atilde Citation frequency can be used to measure the impact of an article

acirc Simplest measure Each article gets one vote ndash not very accurate

atilde On the web citation frequency = inlink count

acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam

atilde Better measure weighted citation frequency or citation rank

acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated

Review and Conclusions 68

PageRank as Random walk

atilde Imagine a web surfer doing a random walk on the web

acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page

equiprobably

atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there

atilde This long-term visit rate is the pagersquos PageRank

atilde PageRank = long-term visit rate = steady state robability

Review and Conclusions 69

Teleporting ndash to get us out of dead ends

atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)

atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)

atilde With remaining probability (90) go out on a random hyperlink

acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225

atilde 10 is a parameter the teleportation rate

atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate

Review and Conclusions 70

Example Graph

Each inbound link is a positive vote

httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71

Example Graph Weighted

Pages with higher PageRanks are lighter

Review and Conclusions 72

Gaming PageRank

atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs

atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies

atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission

73

atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks

The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index

acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip

74

Current Status

atilde There is a continuous battle between

acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read

atilde All metrics get gamed

Review and Conclusions 75

Digital object identifier

atilde DOI a string used to uniquely identify an electronic document or object

acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)

atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation

atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations

acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u

Review and Conclusions 76

Conclusions

Review and Conclusions 77

Revolutions in Language Technology

atilde Speech (Language itself)

atilde Writing (invented 3-5 times)

atilde Printing (made writing common)

atilde Digital Text (made writing transferable)

atilde Hyperlinking (taking writing beyond language)

Review and Conclusions 78

The Internet

atilde The internet is useful as a tool

acirc Passively for information accessacirc Actively for collaboration

atilde The internet is interesting in and of itself

acirc As a source of data about existing languageacirc As a source of innovation in language

Review and Conclusions 79

Recommended Texts

atilde Wikipedia

atilde Crystal D (2001) Language and the Internet Cambridge University Press

atilde Sproat R (2010) Language Technology and Society Oxford University Press

atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition

atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press

atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press

Review and Conclusions 80

Complementary Courses

atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics

atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics

Review and Conclusions 81

Page 6: Lecture 12: Review and Conclusions · Web Count Units ª ghit or “google hit” is the most common unit used to count web snippets (in the early2000s) â itisdocumentfrequencynottermfrequency

Themes

atilde Language and Technology

acirc Writing and Speech Technology

atilde Language and the Internet

acirc Email Chat Virtual Worlds WWW IM Blogs Facebook Wikis Twitter

atilde The Web as Corpus

atilde The Web beyond Language

acirc Semantic Web and Networks

Review and Conclusions 5

Revision

Review and Conclusions 6

Language and Technology

atilde The two things that separate humans from animals (Sproat 2010 sect11)

acirc Languagelowast large vocabulary (10000+)lowast complicated syntax (no upper length recursion embedding)

acirc Technologylowast Widespread tool uselowast Widespread tool manufacture

atilde Speech mdash the start of language

atilde Writing mdash the first great intersection

Review and Conclusions 7

Language and the Internet

atilde New forms of communication

acirc Neither speech nor textacirc Massively interactive

atilde Extremely rapid change

atilde A first hand narrative (I was online before the internet -)

acirc but I am probably behind you all now

Review and Conclusions 8

New forms

atilde Email (from PC phone other)

atilde Chat Usenet

atilde Virtual Worlds

atilde WWW

atilde Blogs (overlap)

atilde Facebook LinkedIn

atilde Wikis

atilde Twitter

Review and Conclusions 9

Representing Languageatilde Writing Systems

atilde Encodings

atilde Speech

atilde Bandwidth

Review and Conclusions 10

What is represented

atilde Phonemes ma d laks aeligvkadoz (45)

acirc Not a simple correspondence between a writing system and the soundsacirc Some logographs (3 $)acirc Even when a new alphabet is designed pronunciation changes

atilde Syllables (ma) (d) (laks) (aelig)(v)(ka)(doz) (10000+)

atilde Morphemes mame + rsquos d lak + s aeligvkado + z (100000+)

atilde Words my dog likes avocados (200000+)

atilde Concepts speaker poss canine fond avacado+PL (400000+)

Review and Conclusions 11

Three Major Writing Systems

atilde Alphabetic (Latin)

acirc one symbol for consonant or vowelacirc Typically 20-30 base symbols (1 byte)

atilde Syllabic (Hiragana)

acirc one symbol for each syllable (consonant+vowel)acirc Typically 50-100 base symbols (1-2 bytes)

atilde Logographic (Hanzi)

acirc pictographs ideographs sounds-meaning combinationsacirc Typically 100000+ symbols (2-3 bytes)

Review and Conclusions 12

Computational Encoding

atilde Need to map characters to bits

atilde More characters require more space

acirc English ASCII (7 bits) Western Europe ISO-8859-1 (8 bits)Japanese EUC (16 bits) Everything UTF-8 (8-32 bits)

atilde Moving towards unicode for everything

atilde If you get the encoding wrong it is gibberish

acirc Web pages should state their encodingacirc Sometimes they are wrongacirc Encoding Detection usually involves statistical analysis of byte patterns

lowast But an encoding can be shared by many languages

Review and Conclusions 13

Speech and Language Technologyatilde The need for speech representation

atilde Storing sound

atilde Transforming Speech

acirc Automatic Speech Recognition (ASR) sounds to textacirc Text-to-Speech Synthesis (TTS) texts to sounds

atilde Speech technology mdash the Telephone

Review and Conclusions 14

How good are the systems

Task Vocab WER () WER () adaptedDigits 11 04 02Dialogue (travel) 21000 109 mdashDictation (WSJ) 5000 39 30Dictation (WSJ) 20000 100 86Dialogue (noisy army) 3000 422 310Phone Conversations 4000 419 310

Results of various DARPA competitions (from Richard Sproatrsquos slides)

Review and Conclusions 15

Why is it so difficult

atilde Pronunciation depends on context

acirc The same word will be pronounced differently in different sentences

atilde Speaker variability

acirc Genderacirc DialectForeign Accentacirc Individual Differences Physical differences Language differences (idiolect)

atilde Many many rare events

acirc 300 out of 2000 diphones in the core set for the ATampT NextGen system occuronly once in a 2-hour speech database

Review and Conclusions 16

Two steps in a TTS system

1 Linguistic Analysis

atilde Sentence Segmentationatilde Abbreviations Dr Smith lives on Nanyang Dr He is hellipatilde Word Segmentation

acirc 森前銀総裁 Moriyama zen Nichigin Sousaiotimes 森前銀総裁emspMoriyama zennichi gin Sousai

2 Speech Synthesis

atilde Find the pronunciationatilde Generate soundsatilde Add intonation

Review and Conclusions 17

Speech Synthesis

atilde Articulatory Synthesis Attempt to model human articulation

atilde Formant Synthesis Bypass modeling of human articulation and model acousticsdirectly

atilde Concatenative Synthesis Synthesize from stored units of actual speech

Review and Conclusions 18

Prosody of Emotion

atilde Excitement Fast very high pitch loud

atilde Hot anger Fast high pitch strong falling accent loud

atilde Fear Jitter

atilde Sarcasm Prolonged accent late peak

atilde Sad Slow low pitch

The main determinant of ldquonaturalnessrdquo in speech synthesis is not ldquovoice qualityrdquobut natural-sounding prosody (intonation and duration)

Richard Sproat

Review and Conclusions 19

Speed is different for different modalities

Speed in words per minute (one word is 6 characters)(English computer science students various studies)

Reading 300 200 (proof reading)Writing 31 21 (composing)Speaking 150Hearing 150 210 (speeded up)Typing 33 19 (composing)

atilde Reading gtgt SpeakingHearing gtgt Typing

rArr Speech for inputrArr Text for output

Review and Conclusions 20

The Telephone

Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 21

New Modalities

Review and Conclusions 22

Email Usenet Chat and Blogs

atilde All share some characteristics of speech and text

atilde Usage norms not fixed

atilde Communication methods may disappear before the norms are fixedUsenet Bulletin Boards Gopher Archie hellip

atilde Large scale discourse studies still to be done

atilde Some genuinely new things

acirc time-lagged multi-person conversationacirc raw un-edited textacirc extreme multi-authorship

Review and Conclusions 23

Email

Speech like Text liketime bound space bound (deletable)spontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 24

Usenet (asynchronous)

Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 25

Chat (synchronous)

Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 26

Blogs

Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactivecomments factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 27

Netspeak

atilde Inspired by shared background

acirc ltrantgtI canrsquot stand thisltrantgtacirc I hate^H^H^H^Hlove that idea (^H is backspace on a vt100)acirc lusers (users as seen by Systems Administrators)acirc suits

atilde Need to be in-group

cow orker Coworkerclue ldquoYou couldnrsquot get a clue during the clue mating season in a field full of horny

clues if you smeared your body with clue musk and did the clue mating dancerdquoEdward Flaherty (talkbizaare)

Review and Conclusions 28

atilde Gricean Principals

acirc Posting (topmiddlebottom quoting and trimming)

atilde Inspired by medium limitations

acirc GREATgreatgreatacirc -) (^_^) (o) acirc brb RTFM IMHOacirc lower caseacirc lack of punctuation (hard on phone keyboards)

Review and Conclusions 29

Collaboration and Wikisatilde Version Control Systems

atilde Wikipedia

atilde Licensing and Ownership

30

Version Control Systems

atilde Versioning file systems

acirc every time a file is opened a new copy is stored

atilde CVS Subversion Git

acirc changes to a collection of files are trackedacirc simultaneous changes are merged

atilde Revision Tracking

acirc Revisions are stored within a file

atilde Authorship in shared writing Explicit responsibility for changes

Review and Conclusions 31

Wikipedia

atilde The core aim of the Wikimedia Foundation is to get a free encyclopedia to everysingle person on the planet (Jimmy Wales)

atilde Wikipedia makes it easy to share your knowledgepeople like to do this

atilde Most edits are done by insiders

atilde Most content is added by outsiders

atilde Content comparable to Britannica

Review and Conclusions 32

The five pillars of Wikipedia

1 Wikipedia is an online encyclopedia

2 Wikipedia has a neutral point of view

atilde Content policies NPOV Verifiability and No original research

3 Wikipedia is free content

4 Wikipedians should interact in a respectful and civil manner

5 Wikipedia does not have firm rules

WikipediaFivepillars 33

Licenses and Ownership

atilde Copyright

atilde Copyleft

atilde Creative Commons

Review and Conclusions 34

What is a good article

1 Well-written

2 Factually accurate and verifiable

3 Broad in its coverage

4 Neutral

5 Stable

6 Illustrated if possible by images

WikipediaGood_article_criteria 35

The World Wide Web and HTML

Review and Conclusions 36

The InterWeb

atilde The Internetmore than just the web (email VoIP FTP Streaming Messaging

atilde The structure of Markup Visual vs LogicalWSISWYG WYSIAYG WYSIWIM

atilde The structure of the Web mdash hypertextpages linked to pages

atilde The future of the Web

atilde Linguistic features of the webun-edited large volume editable multi-media

Review and Conclusions 37

Map of online communities

httpxkcdcom802 38

The Deep Web (the Invisible Web)

Dynamic content dynamic pages which are returned in response to a submitted queryor accessed only through a form

Unlinked content pages which are not linked to by other pages (but clicking linksthem)

Private Web sites that require registration and login (Edventure)

Contextual Web pages with content varying for different access contexts (egranges of client IP addresses or previous navigation sequence)

Limited access content sites that limit access to their pages in a technical way (egusing the Robots Exclusion Standard)

Scripted content pages that are only accessible through links produced by JavaScriptas well as content dynamically downloaded from Web servers via Flash or Ajaxsolutions

Review and Conclusions 39

Non-HTMLtext content textual content encoded in multimedia (image or video)files or specific file formats not handled by search engines

These pages all include data that search engines cannot find

Review and Conclusions 40

Visual Markup vs Logical Markup

atilde Visual Markup (Presentational)

acirc What you see is what you get (WYSIWYG)acirc Equivalent of printersrsquo markupacirc Shows what things look like

atilde Logical Markup (Structural)

acirc Show the structure and meaningacirc Can be mapped to visual markupacirc Less flexible than visual markupacirc More adaptable (and reusable)

Review and Conclusions 41

The Web as Corpus

Review and Conclusions 42

Two Approaches to using the Web as a Corpus

atilde Direct Query Search Engine as Query tool and WWW as corpus(Objection Results are not reliable)

acirc Population and exact hit counts are unknown rarr no statistics possibleacirc Indexing does not allow to draw conclusions on the dataotimes Google is missing functionalities that linguists lexicographers would like to have

atilde Web Sample Use search engine to download data from the net and build a corpusfrom it

acirc known size and exact hit counts rarr statistics possibleacirc people can draw conclusions over the included text typesacirc (limited) control over the contentotimes sparser data

Review and Conclusions 43

Direct Query

atilde Accessible through search engines (Google API Yahoo API Scripts)

atilde Document counts are shown to correlate directly with ldquorealrdquo frequencies (Keller2003) so search engines can help - but

acirc lots of repetitions of the same text (not representative)acirc very limited query precision (no upperlower case no punctuation)acirc only estimated counts often hard to reproduce exactlyacirc different queries give wildly different numbers

Review and Conclusions 44

Web Count Units

atilde ghit or ldquogoogle hitrdquo is the most common unit used to count web snippets (in theearly 2000s)

acirc it is document frequency not term frequency

atilde whit or ldquoweb hitrdquo is the more general term

atilde Normally you compare two phenomena to get a unitless ratio(eg different from vs different than)251000000 ghits vs 71500000 or 351 (accessed 2012-04-04)

atilde GPB for ldquoGhits per billion documentsrdquo is good if want a more stable number (sug-gested by Mark Liberman)but Google no longer releases the index size

atilde So always say when you counted and try to use ratios

httpitrecisupennedu myllanguagelogarchives000953html 45

Web Sample

atilde Extracting and filtering web documents to create linguistically annotated corpora(Kilgarriff 2006)

acirc gather documents for different topics (balance)acirc exclude documents which cannot be preprocessed with available tools (here tag-

gers and lemmatizers)acirc exclude documents which seem irrelevant for a corpus (too short or too long word

lists)acirc do this for several languages and make the corpora available

46

Building Internet Corpora Outline

1 Select Seed Words (500)

2 Combine to form multiple queries (6000)

3 Query a search engine and retrieve the URLs (50000)

4 Download the files from the URLS (100000000 words)

5 Postprocess the data (encoding cleanup tagging and parsing)

Sharoff S (2006) Creating general-purpose corpora using automated search enginequeries In M Baroni S Bernardini (eds) WaCky Working papers on the Web asCorpus Bologna 2006

httpcorpusleedsacukinternethtml 47

Internet Corpora Summary

atilde The web can be used as a corpus

acirc Direct accesslowast Fast and convenientlowast Huge amounts of dataotimes unreliable counts

acirc Web samplelowast Control over the samplelowast Some setup costs (semi-automated)otimes Less data

atilde Richer data than a compiled corpus

otimes Less balanced less markup

48

Text and Meta-textatilde Explicit Meta-data

acirc Keywords and Categoriesacirc Rankingsacirc Structural Markup

atilde Implicit Meta-data

acirc Links and Citationsacirc Tagsacirc Tablesacirc File Namesacirc Translations

and Zombies 49

Explicit Metadata

atilde You can get information from metadata within documents

acirc When they are accurate they are very goodacirc They are often deceitful

atilde HTML PDF Word hellipMetadata

atilde Keywords and Tags

atilde Rankings

atilde Links and Citations

atilde Structural Markup

50

Implicit Metadata

atilde You can get clues from metadata within documents

acirc as they are non-intended they tend to be noisyacirc but they are rarely deceitful

atilde HTML tags as constituent boundaries

atilde Tables as Semantic Relations

atilde File Names (content type and language)

atilde Translations mdash Bracketed Glosses Cross-lingual Disambiguation

atilde Query Data

atilde Wikipedia Redirections and Cross-wiki Links

Review and Conclusions 51

Language Identification andNormalization

Review and Conclusions 52

Language Identification

atilde Need to identify the encoding and language of a page in order to process it (meta-data may be unreliable)

acirc Linguistically-grounded methodslowast Diacriticslowast Character n-gramslowast Stop words

acirc Similarity-based categorisation and classificationlowast Character n-gram rankings

acirc Machine Learning based methodsacirc Context (under jp or ko)

atilde Hard to do for short test snippets similar languages mixed text

Review and Conclusions 53

Normalization

atilde Extracting text from various documents

atilde Segmenting continuous text

atilde Number Normalization $700K $700000 07 million dollars hellip

atilde Date Normalization 2000AD 1421AH Heisei 12 hellip

atilde Stripping stop words (the a of hellip)

atilde Lemmatization produces rarr produce

atilde Stemming producer rarr produc produces rarr produc

atilde Decompounding zonnecel rarr zon cel

Review and Conclusions 54

The Semantic Web

Review and Conclusions 55

Goals of the Semantic Web

atilde Web of data

acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions

atilde Increase the utility of information by connecting it to definitions and context

atilde More efficient information access and analysis

EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt

Review and Conclusions 56

Semantic Web Architecture

Review and Conclusions 57

XML eXtensible Markup Language

atilde XML is a set of rules for encoding documents electronically

acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation

atilde Can be validated for syntactic well-formedness

Review and Conclusions 58

Semantic Web Architecture (details)

atilde Identify things with Uniform Resource Identifiers

acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond

atilde Identify relations with Resource Description Framework

acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything

atilde You can build relations in ontologies (OWL)

acirc Then reason over them search them hellip

Review and Conclusions 59

Criticism of the Semantic Web

Doctorowrsquos seven insurmountable obstacles to reliable metadata are

1 People lie

2 People are lazy

3 People are stupid

4 Mission Impossible know thyself

5 Schemas arenrsquot neutral

6 Metrics influence results

7 Therersquos more than one way to describe something

httpwwwwellcom~doctorowmetacraphtm 60

Semantic Web and NLP

atilde The Semantic Web is about structuring data

atilde Text Mining is about unstructured data

atilde There is much more unstructured than structured data

acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP

61

Citation Reputation andPageRank

Review and Conclusions 62

Citation Networks

atilde How can we tell what is a good scientific paper

acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked

acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite

Review and Conclusions 63

Reputation and Citation Analysis

atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group

atilde Some scores are

acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)

atilde Problems

acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones

atilde Weight Citations by Quality of the paper

Review and Conclusions 64

Gaming Citations

atilde LeastMinimum Publishable Unit

acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information

atilde Self citation in-group citation

atilde Write only proceedings (some journals are not often read)

atilde Submitting only to High Impact factor journals

You improve what gets measurednot necessarily what you want to improve

Review and Conclusions 65

Characteristics of the Web Bow Tie

SCC Strongly Connected Core mdash can travel from any page to any page

Review and Conclusions 66

Anchor Text

atilde Recall how hyperlinks are written

lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt

For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt

atilde Link analysis builds on two intuitions

1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A

2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice

Review and Conclusions 67

PageRank as Citation analysis

atilde Citation frequency can be used to measure the impact of an article

acirc Simplest measure Each article gets one vote ndash not very accurate

atilde On the web citation frequency = inlink count

acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam

atilde Better measure weighted citation frequency or citation rank

acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated

Review and Conclusions 68

PageRank as Random walk

atilde Imagine a web surfer doing a random walk on the web

acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page

equiprobably

atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there

atilde This long-term visit rate is the pagersquos PageRank

atilde PageRank = long-term visit rate = steady state robability

Review and Conclusions 69

Teleporting ndash to get us out of dead ends

atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)

atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)

atilde With remaining probability (90) go out on a random hyperlink

acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225

atilde 10 is a parameter the teleportation rate

atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate

Review and Conclusions 70

Example Graph

Each inbound link is a positive vote

httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71

Example Graph Weighted

Pages with higher PageRanks are lighter

Review and Conclusions 72

Gaming PageRank

atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs

atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies

atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission

73

atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks

The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index

acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip

74

Current Status

atilde There is a continuous battle between

acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read

atilde All metrics get gamed

Review and Conclusions 75

Digital object identifier

atilde DOI a string used to uniquely identify an electronic document or object

acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)

atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation

atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations

acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u

Review and Conclusions 76

Conclusions

Review and Conclusions 77

Revolutions in Language Technology

atilde Speech (Language itself)

atilde Writing (invented 3-5 times)

atilde Printing (made writing common)

atilde Digital Text (made writing transferable)

atilde Hyperlinking (taking writing beyond language)

Review and Conclusions 78

The Internet

atilde The internet is useful as a tool

acirc Passively for information accessacirc Actively for collaboration

atilde The internet is interesting in and of itself

acirc As a source of data about existing languageacirc As a source of innovation in language

Review and Conclusions 79

Recommended Texts

atilde Wikipedia

atilde Crystal D (2001) Language and the Internet Cambridge University Press

atilde Sproat R (2010) Language Technology and Society Oxford University Press

atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition

atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press

atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press

Review and Conclusions 80

Complementary Courses

atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics

atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics

Review and Conclusions 81

Page 7: Lecture 12: Review and Conclusions · Web Count Units ª ghit or “google hit” is the most common unit used to count web snippets (in the early2000s) â itisdocumentfrequencynottermfrequency

Revision

Review and Conclusions 6

Language and Technology

atilde The two things that separate humans from animals (Sproat 2010 sect11)

acirc Languagelowast large vocabulary (10000+)lowast complicated syntax (no upper length recursion embedding)

acirc Technologylowast Widespread tool uselowast Widespread tool manufacture

atilde Speech mdash the start of language

atilde Writing mdash the first great intersection

Review and Conclusions 7

Language and the Internet

atilde New forms of communication

acirc Neither speech nor textacirc Massively interactive

atilde Extremely rapid change

atilde A first hand narrative (I was online before the internet -)

acirc but I am probably behind you all now

Review and Conclusions 8

New forms

atilde Email (from PC phone other)

atilde Chat Usenet

atilde Virtual Worlds

atilde WWW

atilde Blogs (overlap)

atilde Facebook LinkedIn

atilde Wikis

atilde Twitter

Review and Conclusions 9

Representing Languageatilde Writing Systems

atilde Encodings

atilde Speech

atilde Bandwidth

Review and Conclusions 10

What is represented

atilde Phonemes ma d laks aeligvkadoz (45)

acirc Not a simple correspondence between a writing system and the soundsacirc Some logographs (3 $)acirc Even when a new alphabet is designed pronunciation changes

atilde Syllables (ma) (d) (laks) (aelig)(v)(ka)(doz) (10000+)

atilde Morphemes mame + rsquos d lak + s aeligvkado + z (100000+)

atilde Words my dog likes avocados (200000+)

atilde Concepts speaker poss canine fond avacado+PL (400000+)

Review and Conclusions 11

Three Major Writing Systems

atilde Alphabetic (Latin)

acirc one symbol for consonant or vowelacirc Typically 20-30 base symbols (1 byte)

atilde Syllabic (Hiragana)

acirc one symbol for each syllable (consonant+vowel)acirc Typically 50-100 base symbols (1-2 bytes)

atilde Logographic (Hanzi)

acirc pictographs ideographs sounds-meaning combinationsacirc Typically 100000+ symbols (2-3 bytes)

Review and Conclusions 12

Computational Encoding

atilde Need to map characters to bits

atilde More characters require more space

acirc English ASCII (7 bits) Western Europe ISO-8859-1 (8 bits)Japanese EUC (16 bits) Everything UTF-8 (8-32 bits)

atilde Moving towards unicode for everything

atilde If you get the encoding wrong it is gibberish

acirc Web pages should state their encodingacirc Sometimes they are wrongacirc Encoding Detection usually involves statistical analysis of byte patterns

lowast But an encoding can be shared by many languages

Review and Conclusions 13

Speech and Language Technologyatilde The need for speech representation

atilde Storing sound

atilde Transforming Speech

acirc Automatic Speech Recognition (ASR) sounds to textacirc Text-to-Speech Synthesis (TTS) texts to sounds

atilde Speech technology mdash the Telephone

Review and Conclusions 14

How good are the systems

Task Vocab WER () WER () adaptedDigits 11 04 02Dialogue (travel) 21000 109 mdashDictation (WSJ) 5000 39 30Dictation (WSJ) 20000 100 86Dialogue (noisy army) 3000 422 310Phone Conversations 4000 419 310

Results of various DARPA competitions (from Richard Sproatrsquos slides)

Review and Conclusions 15

Why is it so difficult

atilde Pronunciation depends on context

acirc The same word will be pronounced differently in different sentences

atilde Speaker variability

acirc Genderacirc DialectForeign Accentacirc Individual Differences Physical differences Language differences (idiolect)

atilde Many many rare events

acirc 300 out of 2000 diphones in the core set for the ATampT NextGen system occuronly once in a 2-hour speech database

Review and Conclusions 16

Two steps in a TTS system

1 Linguistic Analysis

atilde Sentence Segmentationatilde Abbreviations Dr Smith lives on Nanyang Dr He is hellipatilde Word Segmentation

acirc 森前銀総裁 Moriyama zen Nichigin Sousaiotimes 森前銀総裁emspMoriyama zennichi gin Sousai

2 Speech Synthesis

atilde Find the pronunciationatilde Generate soundsatilde Add intonation

Review and Conclusions 17

Speech Synthesis

atilde Articulatory Synthesis Attempt to model human articulation

atilde Formant Synthesis Bypass modeling of human articulation and model acousticsdirectly

atilde Concatenative Synthesis Synthesize from stored units of actual speech

Review and Conclusions 18

Prosody of Emotion

atilde Excitement Fast very high pitch loud

atilde Hot anger Fast high pitch strong falling accent loud

atilde Fear Jitter

atilde Sarcasm Prolonged accent late peak

atilde Sad Slow low pitch

The main determinant of ldquonaturalnessrdquo in speech synthesis is not ldquovoice qualityrdquobut natural-sounding prosody (intonation and duration)

Richard Sproat

Review and Conclusions 19

Speed is different for different modalities

Speed in words per minute (one word is 6 characters)(English computer science students various studies)

Reading 300 200 (proof reading)Writing 31 21 (composing)Speaking 150Hearing 150 210 (speeded up)Typing 33 19 (composing)

atilde Reading gtgt SpeakingHearing gtgt Typing

rArr Speech for inputrArr Text for output

Review and Conclusions 20

The Telephone

Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 21

New Modalities

Review and Conclusions 22

Email Usenet Chat and Blogs

atilde All share some characteristics of speech and text

atilde Usage norms not fixed

atilde Communication methods may disappear before the norms are fixedUsenet Bulletin Boards Gopher Archie hellip

atilde Large scale discourse studies still to be done

atilde Some genuinely new things

acirc time-lagged multi-person conversationacirc raw un-edited textacirc extreme multi-authorship

Review and Conclusions 23

Email

Speech like Text liketime bound space bound (deletable)spontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 24

Usenet (asynchronous)

Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 25

Chat (synchronous)

Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 26

Blogs

Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactivecomments factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 27

Netspeak

atilde Inspired by shared background

acirc ltrantgtI canrsquot stand thisltrantgtacirc I hate^H^H^H^Hlove that idea (^H is backspace on a vt100)acirc lusers (users as seen by Systems Administrators)acirc suits

atilde Need to be in-group

cow orker Coworkerclue ldquoYou couldnrsquot get a clue during the clue mating season in a field full of horny

clues if you smeared your body with clue musk and did the clue mating dancerdquoEdward Flaherty (talkbizaare)

Review and Conclusions 28

atilde Gricean Principals

acirc Posting (topmiddlebottom quoting and trimming)

atilde Inspired by medium limitations

acirc GREATgreatgreatacirc -) (^_^) (o) acirc brb RTFM IMHOacirc lower caseacirc lack of punctuation (hard on phone keyboards)

Review and Conclusions 29

Collaboration and Wikisatilde Version Control Systems

atilde Wikipedia

atilde Licensing and Ownership

30

Version Control Systems

atilde Versioning file systems

acirc every time a file is opened a new copy is stored

atilde CVS Subversion Git

acirc changes to a collection of files are trackedacirc simultaneous changes are merged

atilde Revision Tracking

acirc Revisions are stored within a file

atilde Authorship in shared writing Explicit responsibility for changes

Review and Conclusions 31

Wikipedia

atilde The core aim of the Wikimedia Foundation is to get a free encyclopedia to everysingle person on the planet (Jimmy Wales)

atilde Wikipedia makes it easy to share your knowledgepeople like to do this

atilde Most edits are done by insiders

atilde Most content is added by outsiders

atilde Content comparable to Britannica

Review and Conclusions 32

The five pillars of Wikipedia

1 Wikipedia is an online encyclopedia

2 Wikipedia has a neutral point of view

atilde Content policies NPOV Verifiability and No original research

3 Wikipedia is free content

4 Wikipedians should interact in a respectful and civil manner

5 Wikipedia does not have firm rules

WikipediaFivepillars 33

Licenses and Ownership

atilde Copyright

atilde Copyleft

atilde Creative Commons

Review and Conclusions 34

What is a good article

1 Well-written

2 Factually accurate and verifiable

3 Broad in its coverage

4 Neutral

5 Stable

6 Illustrated if possible by images

WikipediaGood_article_criteria 35

The World Wide Web and HTML

Review and Conclusions 36

The InterWeb

atilde The Internetmore than just the web (email VoIP FTP Streaming Messaging

atilde The structure of Markup Visual vs LogicalWSISWYG WYSIAYG WYSIWIM

atilde The structure of the Web mdash hypertextpages linked to pages

atilde The future of the Web

atilde Linguistic features of the webun-edited large volume editable multi-media

Review and Conclusions 37

Map of online communities

httpxkcdcom802 38

The Deep Web (the Invisible Web)

Dynamic content dynamic pages which are returned in response to a submitted queryor accessed only through a form

Unlinked content pages which are not linked to by other pages (but clicking linksthem)

Private Web sites that require registration and login (Edventure)

Contextual Web pages with content varying for different access contexts (egranges of client IP addresses or previous navigation sequence)

Limited access content sites that limit access to their pages in a technical way (egusing the Robots Exclusion Standard)

Scripted content pages that are only accessible through links produced by JavaScriptas well as content dynamically downloaded from Web servers via Flash or Ajaxsolutions

Review and Conclusions 39

Non-HTMLtext content textual content encoded in multimedia (image or video)files or specific file formats not handled by search engines

These pages all include data that search engines cannot find

Review and Conclusions 40

Visual Markup vs Logical Markup

atilde Visual Markup (Presentational)

acirc What you see is what you get (WYSIWYG)acirc Equivalent of printersrsquo markupacirc Shows what things look like

atilde Logical Markup (Structural)

acirc Show the structure and meaningacirc Can be mapped to visual markupacirc Less flexible than visual markupacirc More adaptable (and reusable)

Review and Conclusions 41

The Web as Corpus

Review and Conclusions 42

Two Approaches to using the Web as a Corpus

atilde Direct Query Search Engine as Query tool and WWW as corpus(Objection Results are not reliable)

acirc Population and exact hit counts are unknown rarr no statistics possibleacirc Indexing does not allow to draw conclusions on the dataotimes Google is missing functionalities that linguists lexicographers would like to have

atilde Web Sample Use search engine to download data from the net and build a corpusfrom it

acirc known size and exact hit counts rarr statistics possibleacirc people can draw conclusions over the included text typesacirc (limited) control over the contentotimes sparser data

Review and Conclusions 43

Direct Query

atilde Accessible through search engines (Google API Yahoo API Scripts)

atilde Document counts are shown to correlate directly with ldquorealrdquo frequencies (Keller2003) so search engines can help - but

acirc lots of repetitions of the same text (not representative)acirc very limited query precision (no upperlower case no punctuation)acirc only estimated counts often hard to reproduce exactlyacirc different queries give wildly different numbers

Review and Conclusions 44

Web Count Units

atilde ghit or ldquogoogle hitrdquo is the most common unit used to count web snippets (in theearly 2000s)

acirc it is document frequency not term frequency

atilde whit or ldquoweb hitrdquo is the more general term

atilde Normally you compare two phenomena to get a unitless ratio(eg different from vs different than)251000000 ghits vs 71500000 or 351 (accessed 2012-04-04)

atilde GPB for ldquoGhits per billion documentsrdquo is good if want a more stable number (sug-gested by Mark Liberman)but Google no longer releases the index size

atilde So always say when you counted and try to use ratios

httpitrecisupennedu myllanguagelogarchives000953html 45

Web Sample

atilde Extracting and filtering web documents to create linguistically annotated corpora(Kilgarriff 2006)

acirc gather documents for different topics (balance)acirc exclude documents which cannot be preprocessed with available tools (here tag-

gers and lemmatizers)acirc exclude documents which seem irrelevant for a corpus (too short or too long word

lists)acirc do this for several languages and make the corpora available

46

Building Internet Corpora Outline

1 Select Seed Words (500)

2 Combine to form multiple queries (6000)

3 Query a search engine and retrieve the URLs (50000)

4 Download the files from the URLS (100000000 words)

5 Postprocess the data (encoding cleanup tagging and parsing)

Sharoff S (2006) Creating general-purpose corpora using automated search enginequeries In M Baroni S Bernardini (eds) WaCky Working papers on the Web asCorpus Bologna 2006

httpcorpusleedsacukinternethtml 47

Internet Corpora Summary

atilde The web can be used as a corpus

acirc Direct accesslowast Fast and convenientlowast Huge amounts of dataotimes unreliable counts

acirc Web samplelowast Control over the samplelowast Some setup costs (semi-automated)otimes Less data

atilde Richer data than a compiled corpus

otimes Less balanced less markup

48

Text and Meta-textatilde Explicit Meta-data

acirc Keywords and Categoriesacirc Rankingsacirc Structural Markup

atilde Implicit Meta-data

acirc Links and Citationsacirc Tagsacirc Tablesacirc File Namesacirc Translations

and Zombies 49

Explicit Metadata

atilde You can get information from metadata within documents

acirc When they are accurate they are very goodacirc They are often deceitful

atilde HTML PDF Word hellipMetadata

atilde Keywords and Tags

atilde Rankings

atilde Links and Citations

atilde Structural Markup

50

Implicit Metadata

atilde You can get clues from metadata within documents

acirc as they are non-intended they tend to be noisyacirc but they are rarely deceitful

atilde HTML tags as constituent boundaries

atilde Tables as Semantic Relations

atilde File Names (content type and language)

atilde Translations mdash Bracketed Glosses Cross-lingual Disambiguation

atilde Query Data

atilde Wikipedia Redirections and Cross-wiki Links

Review and Conclusions 51

Language Identification andNormalization

Review and Conclusions 52

Language Identification

atilde Need to identify the encoding and language of a page in order to process it (meta-data may be unreliable)

acirc Linguistically-grounded methodslowast Diacriticslowast Character n-gramslowast Stop words

acirc Similarity-based categorisation and classificationlowast Character n-gram rankings

acirc Machine Learning based methodsacirc Context (under jp or ko)

atilde Hard to do for short test snippets similar languages mixed text

Review and Conclusions 53

Normalization

atilde Extracting text from various documents

atilde Segmenting continuous text

atilde Number Normalization $700K $700000 07 million dollars hellip

atilde Date Normalization 2000AD 1421AH Heisei 12 hellip

atilde Stripping stop words (the a of hellip)

atilde Lemmatization produces rarr produce

atilde Stemming producer rarr produc produces rarr produc

atilde Decompounding zonnecel rarr zon cel

Review and Conclusions 54

The Semantic Web

Review and Conclusions 55

Goals of the Semantic Web

atilde Web of data

acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions

atilde Increase the utility of information by connecting it to definitions and context

atilde More efficient information access and analysis

EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt

Review and Conclusions 56

Semantic Web Architecture

Review and Conclusions 57

XML eXtensible Markup Language

atilde XML is a set of rules for encoding documents electronically

acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation

atilde Can be validated for syntactic well-formedness

Review and Conclusions 58

Semantic Web Architecture (details)

atilde Identify things with Uniform Resource Identifiers

acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond

atilde Identify relations with Resource Description Framework

acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything

atilde You can build relations in ontologies (OWL)

acirc Then reason over them search them hellip

Review and Conclusions 59

Criticism of the Semantic Web

Doctorowrsquos seven insurmountable obstacles to reliable metadata are

1 People lie

2 People are lazy

3 People are stupid

4 Mission Impossible know thyself

5 Schemas arenrsquot neutral

6 Metrics influence results

7 Therersquos more than one way to describe something

httpwwwwellcom~doctorowmetacraphtm 60

Semantic Web and NLP

atilde The Semantic Web is about structuring data

atilde Text Mining is about unstructured data

atilde There is much more unstructured than structured data

acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP

61

Citation Reputation andPageRank

Review and Conclusions 62

Citation Networks

atilde How can we tell what is a good scientific paper

acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked

acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite

Review and Conclusions 63

Reputation and Citation Analysis

atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group

atilde Some scores are

acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)

atilde Problems

acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones

atilde Weight Citations by Quality of the paper

Review and Conclusions 64

Gaming Citations

atilde LeastMinimum Publishable Unit

acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information

atilde Self citation in-group citation

atilde Write only proceedings (some journals are not often read)

atilde Submitting only to High Impact factor journals

You improve what gets measurednot necessarily what you want to improve

Review and Conclusions 65

Characteristics of the Web Bow Tie

SCC Strongly Connected Core mdash can travel from any page to any page

Review and Conclusions 66

Anchor Text

atilde Recall how hyperlinks are written

lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt

For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt

atilde Link analysis builds on two intuitions

1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A

2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice

Review and Conclusions 67

PageRank as Citation analysis

atilde Citation frequency can be used to measure the impact of an article

acirc Simplest measure Each article gets one vote ndash not very accurate

atilde On the web citation frequency = inlink count

acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam

atilde Better measure weighted citation frequency or citation rank

acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated

Review and Conclusions 68

PageRank as Random walk

atilde Imagine a web surfer doing a random walk on the web

acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page

equiprobably

atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there

atilde This long-term visit rate is the pagersquos PageRank

atilde PageRank = long-term visit rate = steady state robability

Review and Conclusions 69

Teleporting ndash to get us out of dead ends

atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)

atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)

atilde With remaining probability (90) go out on a random hyperlink

acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225

atilde 10 is a parameter the teleportation rate

atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate

Review and Conclusions 70

Example Graph

Each inbound link is a positive vote

httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71

Example Graph Weighted

Pages with higher PageRanks are lighter

Review and Conclusions 72

Gaming PageRank

atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs

atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies

atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission

73

atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks

The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index

acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip

74

Current Status

atilde There is a continuous battle between

acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read

atilde All metrics get gamed

Review and Conclusions 75

Digital object identifier

atilde DOI a string used to uniquely identify an electronic document or object

acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)

atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation

atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations

acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u

Review and Conclusions 76

Conclusions

Review and Conclusions 77

Revolutions in Language Technology

atilde Speech (Language itself)

atilde Writing (invented 3-5 times)

atilde Printing (made writing common)

atilde Digital Text (made writing transferable)

atilde Hyperlinking (taking writing beyond language)

Review and Conclusions 78

The Internet

atilde The internet is useful as a tool

acirc Passively for information accessacirc Actively for collaboration

atilde The internet is interesting in and of itself

acirc As a source of data about existing languageacirc As a source of innovation in language

Review and Conclusions 79

Recommended Texts

atilde Wikipedia

atilde Crystal D (2001) Language and the Internet Cambridge University Press

atilde Sproat R (2010) Language Technology and Society Oxford University Press

atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition

atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press

atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press

Review and Conclusions 80

Complementary Courses

atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics

atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics

Review and Conclusions 81

Page 8: Lecture 12: Review and Conclusions · Web Count Units ª ghit or “google hit” is the most common unit used to count web snippets (in the early2000s) â itisdocumentfrequencynottermfrequency

Language and Technology

atilde The two things that separate humans from animals (Sproat 2010 sect11)

acirc Languagelowast large vocabulary (10000+)lowast complicated syntax (no upper length recursion embedding)

acirc Technologylowast Widespread tool uselowast Widespread tool manufacture

atilde Speech mdash the start of language

atilde Writing mdash the first great intersection

Review and Conclusions 7

Language and the Internet

atilde New forms of communication

acirc Neither speech nor textacirc Massively interactive

atilde Extremely rapid change

atilde A first hand narrative (I was online before the internet -)

acirc but I am probably behind you all now

Review and Conclusions 8

New forms

atilde Email (from PC phone other)

atilde Chat Usenet

atilde Virtual Worlds

atilde WWW

atilde Blogs (overlap)

atilde Facebook LinkedIn

atilde Wikis

atilde Twitter

Review and Conclusions 9

Representing Languageatilde Writing Systems

atilde Encodings

atilde Speech

atilde Bandwidth

Review and Conclusions 10

What is represented

atilde Phonemes ma d laks aeligvkadoz (45)

acirc Not a simple correspondence between a writing system and the soundsacirc Some logographs (3 $)acirc Even when a new alphabet is designed pronunciation changes

atilde Syllables (ma) (d) (laks) (aelig)(v)(ka)(doz) (10000+)

atilde Morphemes mame + rsquos d lak + s aeligvkado + z (100000+)

atilde Words my dog likes avocados (200000+)

atilde Concepts speaker poss canine fond avacado+PL (400000+)

Review and Conclusions 11

Three Major Writing Systems

atilde Alphabetic (Latin)

acirc one symbol for consonant or vowelacirc Typically 20-30 base symbols (1 byte)

atilde Syllabic (Hiragana)

acirc one symbol for each syllable (consonant+vowel)acirc Typically 50-100 base symbols (1-2 bytes)

atilde Logographic (Hanzi)

acirc pictographs ideographs sounds-meaning combinationsacirc Typically 100000+ symbols (2-3 bytes)

Review and Conclusions 12

Computational Encoding

atilde Need to map characters to bits

atilde More characters require more space

acirc English ASCII (7 bits) Western Europe ISO-8859-1 (8 bits)Japanese EUC (16 bits) Everything UTF-8 (8-32 bits)

atilde Moving towards unicode for everything

atilde If you get the encoding wrong it is gibberish

acirc Web pages should state their encodingacirc Sometimes they are wrongacirc Encoding Detection usually involves statistical analysis of byte patterns

lowast But an encoding can be shared by many languages

Review and Conclusions 13

Speech and Language Technologyatilde The need for speech representation

atilde Storing sound

atilde Transforming Speech

acirc Automatic Speech Recognition (ASR) sounds to textacirc Text-to-Speech Synthesis (TTS) texts to sounds

atilde Speech technology mdash the Telephone

Review and Conclusions 14

How good are the systems

Task Vocab WER () WER () adaptedDigits 11 04 02Dialogue (travel) 21000 109 mdashDictation (WSJ) 5000 39 30Dictation (WSJ) 20000 100 86Dialogue (noisy army) 3000 422 310Phone Conversations 4000 419 310

Results of various DARPA competitions (from Richard Sproatrsquos slides)

Review and Conclusions 15

Why is it so difficult

atilde Pronunciation depends on context

acirc The same word will be pronounced differently in different sentences

atilde Speaker variability

acirc Genderacirc DialectForeign Accentacirc Individual Differences Physical differences Language differences (idiolect)

atilde Many many rare events

acirc 300 out of 2000 diphones in the core set for the ATampT NextGen system occuronly once in a 2-hour speech database

Review and Conclusions 16

Two steps in a TTS system

1 Linguistic Analysis

atilde Sentence Segmentationatilde Abbreviations Dr Smith lives on Nanyang Dr He is hellipatilde Word Segmentation

acirc 森前銀総裁 Moriyama zen Nichigin Sousaiotimes 森前銀総裁emspMoriyama zennichi gin Sousai

2 Speech Synthesis

atilde Find the pronunciationatilde Generate soundsatilde Add intonation

Review and Conclusions 17

Speech Synthesis

atilde Articulatory Synthesis Attempt to model human articulation

atilde Formant Synthesis Bypass modeling of human articulation and model acousticsdirectly

atilde Concatenative Synthesis Synthesize from stored units of actual speech

Review and Conclusions 18

Prosody of Emotion

atilde Excitement Fast very high pitch loud

atilde Hot anger Fast high pitch strong falling accent loud

atilde Fear Jitter

atilde Sarcasm Prolonged accent late peak

atilde Sad Slow low pitch

The main determinant of ldquonaturalnessrdquo in speech synthesis is not ldquovoice qualityrdquobut natural-sounding prosody (intonation and duration)

Richard Sproat

Review and Conclusions 19

Speed is different for different modalities

Speed in words per minute (one word is 6 characters)(English computer science students various studies)

Reading 300 200 (proof reading)Writing 31 21 (composing)Speaking 150Hearing 150 210 (speeded up)Typing 33 19 (composing)

atilde Reading gtgt SpeakingHearing gtgt Typing

rArr Speech for inputrArr Text for output

Review and Conclusions 20

The Telephone

Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 21

New Modalities

Review and Conclusions 22

Email Usenet Chat and Blogs

atilde All share some characteristics of speech and text

atilde Usage norms not fixed

atilde Communication methods may disappear before the norms are fixedUsenet Bulletin Boards Gopher Archie hellip

atilde Large scale discourse studies still to be done

atilde Some genuinely new things

acirc time-lagged multi-person conversationacirc raw un-edited textacirc extreme multi-authorship

Review and Conclusions 23

Email

Speech like Text liketime bound space bound (deletable)spontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 24

Usenet (asynchronous)

Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 25

Chat (synchronous)

Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 26

Blogs

Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactivecomments factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 27

Netspeak

atilde Inspired by shared background

acirc ltrantgtI canrsquot stand thisltrantgtacirc I hate^H^H^H^Hlove that idea (^H is backspace on a vt100)acirc lusers (users as seen by Systems Administrators)acirc suits

atilde Need to be in-group

cow orker Coworkerclue ldquoYou couldnrsquot get a clue during the clue mating season in a field full of horny

clues if you smeared your body with clue musk and did the clue mating dancerdquoEdward Flaherty (talkbizaare)

Review and Conclusions 28

atilde Gricean Principals

acirc Posting (topmiddlebottom quoting and trimming)

atilde Inspired by medium limitations

acirc GREATgreatgreatacirc -) (^_^) (o) acirc brb RTFM IMHOacirc lower caseacirc lack of punctuation (hard on phone keyboards)

Review and Conclusions 29

Collaboration and Wikisatilde Version Control Systems

atilde Wikipedia

atilde Licensing and Ownership

30

Version Control Systems

atilde Versioning file systems

acirc every time a file is opened a new copy is stored

atilde CVS Subversion Git

acirc changes to a collection of files are trackedacirc simultaneous changes are merged

atilde Revision Tracking

acirc Revisions are stored within a file

atilde Authorship in shared writing Explicit responsibility for changes

Review and Conclusions 31

Wikipedia

atilde The core aim of the Wikimedia Foundation is to get a free encyclopedia to everysingle person on the planet (Jimmy Wales)

atilde Wikipedia makes it easy to share your knowledgepeople like to do this

atilde Most edits are done by insiders

atilde Most content is added by outsiders

atilde Content comparable to Britannica

Review and Conclusions 32

The five pillars of Wikipedia

1 Wikipedia is an online encyclopedia

2 Wikipedia has a neutral point of view

atilde Content policies NPOV Verifiability and No original research

3 Wikipedia is free content

4 Wikipedians should interact in a respectful and civil manner

5 Wikipedia does not have firm rules

WikipediaFivepillars 33

Licenses and Ownership

atilde Copyright

atilde Copyleft

atilde Creative Commons

Review and Conclusions 34

What is a good article

1 Well-written

2 Factually accurate and verifiable

3 Broad in its coverage

4 Neutral

5 Stable

6 Illustrated if possible by images

WikipediaGood_article_criteria 35

The World Wide Web and HTML

Review and Conclusions 36

The InterWeb

atilde The Internetmore than just the web (email VoIP FTP Streaming Messaging

atilde The structure of Markup Visual vs LogicalWSISWYG WYSIAYG WYSIWIM

atilde The structure of the Web mdash hypertextpages linked to pages

atilde The future of the Web

atilde Linguistic features of the webun-edited large volume editable multi-media

Review and Conclusions 37

Map of online communities

httpxkcdcom802 38

The Deep Web (the Invisible Web)

Dynamic content dynamic pages which are returned in response to a submitted queryor accessed only through a form

Unlinked content pages which are not linked to by other pages (but clicking linksthem)

Private Web sites that require registration and login (Edventure)

Contextual Web pages with content varying for different access contexts (egranges of client IP addresses or previous navigation sequence)

Limited access content sites that limit access to their pages in a technical way (egusing the Robots Exclusion Standard)

Scripted content pages that are only accessible through links produced by JavaScriptas well as content dynamically downloaded from Web servers via Flash or Ajaxsolutions

Review and Conclusions 39

Non-HTMLtext content textual content encoded in multimedia (image or video)files or specific file formats not handled by search engines

These pages all include data that search engines cannot find

Review and Conclusions 40

Visual Markup vs Logical Markup

atilde Visual Markup (Presentational)

acirc What you see is what you get (WYSIWYG)acirc Equivalent of printersrsquo markupacirc Shows what things look like

atilde Logical Markup (Structural)

acirc Show the structure and meaningacirc Can be mapped to visual markupacirc Less flexible than visual markupacirc More adaptable (and reusable)

Review and Conclusions 41

The Web as Corpus

Review and Conclusions 42

Two Approaches to using the Web as a Corpus

atilde Direct Query Search Engine as Query tool and WWW as corpus(Objection Results are not reliable)

acirc Population and exact hit counts are unknown rarr no statistics possibleacirc Indexing does not allow to draw conclusions on the dataotimes Google is missing functionalities that linguists lexicographers would like to have

atilde Web Sample Use search engine to download data from the net and build a corpusfrom it

acirc known size and exact hit counts rarr statistics possibleacirc people can draw conclusions over the included text typesacirc (limited) control over the contentotimes sparser data

Review and Conclusions 43

Direct Query

atilde Accessible through search engines (Google API Yahoo API Scripts)

atilde Document counts are shown to correlate directly with ldquorealrdquo frequencies (Keller2003) so search engines can help - but

acirc lots of repetitions of the same text (not representative)acirc very limited query precision (no upperlower case no punctuation)acirc only estimated counts often hard to reproduce exactlyacirc different queries give wildly different numbers

Review and Conclusions 44

Web Count Units

atilde ghit or ldquogoogle hitrdquo is the most common unit used to count web snippets (in theearly 2000s)

acirc it is document frequency not term frequency

atilde whit or ldquoweb hitrdquo is the more general term

atilde Normally you compare two phenomena to get a unitless ratio(eg different from vs different than)251000000 ghits vs 71500000 or 351 (accessed 2012-04-04)

atilde GPB for ldquoGhits per billion documentsrdquo is good if want a more stable number (sug-gested by Mark Liberman)but Google no longer releases the index size

atilde So always say when you counted and try to use ratios

httpitrecisupennedu myllanguagelogarchives000953html 45

Web Sample

atilde Extracting and filtering web documents to create linguistically annotated corpora(Kilgarriff 2006)

acirc gather documents for different topics (balance)acirc exclude documents which cannot be preprocessed with available tools (here tag-

gers and lemmatizers)acirc exclude documents which seem irrelevant for a corpus (too short or too long word

lists)acirc do this for several languages and make the corpora available

46

Building Internet Corpora Outline

1 Select Seed Words (500)

2 Combine to form multiple queries (6000)

3 Query a search engine and retrieve the URLs (50000)

4 Download the files from the URLS (100000000 words)

5 Postprocess the data (encoding cleanup tagging and parsing)

Sharoff S (2006) Creating general-purpose corpora using automated search enginequeries In M Baroni S Bernardini (eds) WaCky Working papers on the Web asCorpus Bologna 2006

httpcorpusleedsacukinternethtml 47

Internet Corpora Summary

atilde The web can be used as a corpus

acirc Direct accesslowast Fast and convenientlowast Huge amounts of dataotimes unreliable counts

acirc Web samplelowast Control over the samplelowast Some setup costs (semi-automated)otimes Less data

atilde Richer data than a compiled corpus

otimes Less balanced less markup

48

Text and Meta-textatilde Explicit Meta-data

acirc Keywords and Categoriesacirc Rankingsacirc Structural Markup

atilde Implicit Meta-data

acirc Links and Citationsacirc Tagsacirc Tablesacirc File Namesacirc Translations

and Zombies 49

Explicit Metadata

atilde You can get information from metadata within documents

acirc When they are accurate they are very goodacirc They are often deceitful

atilde HTML PDF Word hellipMetadata

atilde Keywords and Tags

atilde Rankings

atilde Links and Citations

atilde Structural Markup

50

Implicit Metadata

atilde You can get clues from metadata within documents

acirc as they are non-intended they tend to be noisyacirc but they are rarely deceitful

atilde HTML tags as constituent boundaries

atilde Tables as Semantic Relations

atilde File Names (content type and language)

atilde Translations mdash Bracketed Glosses Cross-lingual Disambiguation

atilde Query Data

atilde Wikipedia Redirections and Cross-wiki Links

Review and Conclusions 51

Language Identification andNormalization

Review and Conclusions 52

Language Identification

atilde Need to identify the encoding and language of a page in order to process it (meta-data may be unreliable)

acirc Linguistically-grounded methodslowast Diacriticslowast Character n-gramslowast Stop words

acirc Similarity-based categorisation and classificationlowast Character n-gram rankings

acirc Machine Learning based methodsacirc Context (under jp or ko)

atilde Hard to do for short test snippets similar languages mixed text

Review and Conclusions 53

Normalization

atilde Extracting text from various documents

atilde Segmenting continuous text

atilde Number Normalization $700K $700000 07 million dollars hellip

atilde Date Normalization 2000AD 1421AH Heisei 12 hellip

atilde Stripping stop words (the a of hellip)

atilde Lemmatization produces rarr produce

atilde Stemming producer rarr produc produces rarr produc

atilde Decompounding zonnecel rarr zon cel

Review and Conclusions 54

The Semantic Web

Review and Conclusions 55

Goals of the Semantic Web

atilde Web of data

acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions

atilde Increase the utility of information by connecting it to definitions and context

atilde More efficient information access and analysis

EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt

Review and Conclusions 56

Semantic Web Architecture

Review and Conclusions 57

XML eXtensible Markup Language

atilde XML is a set of rules for encoding documents electronically

acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation

atilde Can be validated for syntactic well-formedness

Review and Conclusions 58

Semantic Web Architecture (details)

atilde Identify things with Uniform Resource Identifiers

acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond

atilde Identify relations with Resource Description Framework

acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything

atilde You can build relations in ontologies (OWL)

acirc Then reason over them search them hellip

Review and Conclusions 59

Criticism of the Semantic Web

Doctorowrsquos seven insurmountable obstacles to reliable metadata are

1 People lie

2 People are lazy

3 People are stupid

4 Mission Impossible know thyself

5 Schemas arenrsquot neutral

6 Metrics influence results

7 Therersquos more than one way to describe something

httpwwwwellcom~doctorowmetacraphtm 60

Semantic Web and NLP

atilde The Semantic Web is about structuring data

atilde Text Mining is about unstructured data

atilde There is much more unstructured than structured data

acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP

61

Citation Reputation andPageRank

Review and Conclusions 62

Citation Networks

atilde How can we tell what is a good scientific paper

acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked

acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite

Review and Conclusions 63

Reputation and Citation Analysis

atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group

atilde Some scores are

acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)

atilde Problems

acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones

atilde Weight Citations by Quality of the paper

Review and Conclusions 64

Gaming Citations

atilde LeastMinimum Publishable Unit

acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information

atilde Self citation in-group citation

atilde Write only proceedings (some journals are not often read)

atilde Submitting only to High Impact factor journals

You improve what gets measurednot necessarily what you want to improve

Review and Conclusions 65

Characteristics of the Web Bow Tie

SCC Strongly Connected Core mdash can travel from any page to any page

Review and Conclusions 66

Anchor Text

atilde Recall how hyperlinks are written

lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt

For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt

atilde Link analysis builds on two intuitions

1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A

2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice

Review and Conclusions 67

PageRank as Citation analysis

atilde Citation frequency can be used to measure the impact of an article

acirc Simplest measure Each article gets one vote ndash not very accurate

atilde On the web citation frequency = inlink count

acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam

atilde Better measure weighted citation frequency or citation rank

acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated

Review and Conclusions 68

PageRank as Random walk

atilde Imagine a web surfer doing a random walk on the web

acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page

equiprobably

atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there

atilde This long-term visit rate is the pagersquos PageRank

atilde PageRank = long-term visit rate = steady state robability

Review and Conclusions 69

Teleporting ndash to get us out of dead ends

atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)

atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)

atilde With remaining probability (90) go out on a random hyperlink

acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225

atilde 10 is a parameter the teleportation rate

atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate

Review and Conclusions 70

Example Graph

Each inbound link is a positive vote

httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71

Example Graph Weighted

Pages with higher PageRanks are lighter

Review and Conclusions 72

Gaming PageRank

atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs

atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies

atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission

73

atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks

The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index

acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip

74

Current Status

atilde There is a continuous battle between

acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read

atilde All metrics get gamed

Review and Conclusions 75

Digital object identifier

atilde DOI a string used to uniquely identify an electronic document or object

acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)

atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation

atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations

acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u

Review and Conclusions 76

Conclusions

Review and Conclusions 77

Revolutions in Language Technology

atilde Speech (Language itself)

atilde Writing (invented 3-5 times)

atilde Printing (made writing common)

atilde Digital Text (made writing transferable)

atilde Hyperlinking (taking writing beyond language)

Review and Conclusions 78

The Internet

atilde The internet is useful as a tool

acirc Passively for information accessacirc Actively for collaboration

atilde The internet is interesting in and of itself

acirc As a source of data about existing languageacirc As a source of innovation in language

Review and Conclusions 79

Recommended Texts

atilde Wikipedia

atilde Crystal D (2001) Language and the Internet Cambridge University Press

atilde Sproat R (2010) Language Technology and Society Oxford University Press

atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition

atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press

atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press

Review and Conclusions 80

Complementary Courses

atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics

atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics

Review and Conclusions 81

Page 9: Lecture 12: Review and Conclusions · Web Count Units ª ghit or “google hit” is the most common unit used to count web snippets (in the early2000s) â itisdocumentfrequencynottermfrequency

Language and the Internet

atilde New forms of communication

acirc Neither speech nor textacirc Massively interactive

atilde Extremely rapid change

atilde A first hand narrative (I was online before the internet -)

acirc but I am probably behind you all now

Review and Conclusions 8

New forms

atilde Email (from PC phone other)

atilde Chat Usenet

atilde Virtual Worlds

atilde WWW

atilde Blogs (overlap)

atilde Facebook LinkedIn

atilde Wikis

atilde Twitter

Review and Conclusions 9

Representing Languageatilde Writing Systems

atilde Encodings

atilde Speech

atilde Bandwidth

Review and Conclusions 10

What is represented

atilde Phonemes ma d laks aeligvkadoz (45)

acirc Not a simple correspondence between a writing system and the soundsacirc Some logographs (3 $)acirc Even when a new alphabet is designed pronunciation changes

atilde Syllables (ma) (d) (laks) (aelig)(v)(ka)(doz) (10000+)

atilde Morphemes mame + rsquos d lak + s aeligvkado + z (100000+)

atilde Words my dog likes avocados (200000+)

atilde Concepts speaker poss canine fond avacado+PL (400000+)

Review and Conclusions 11

Three Major Writing Systems

atilde Alphabetic (Latin)

acirc one symbol for consonant or vowelacirc Typically 20-30 base symbols (1 byte)

atilde Syllabic (Hiragana)

acirc one symbol for each syllable (consonant+vowel)acirc Typically 50-100 base symbols (1-2 bytes)

atilde Logographic (Hanzi)

acirc pictographs ideographs sounds-meaning combinationsacirc Typically 100000+ symbols (2-3 bytes)

Review and Conclusions 12

Computational Encoding

atilde Need to map characters to bits

atilde More characters require more space

acirc English ASCII (7 bits) Western Europe ISO-8859-1 (8 bits)Japanese EUC (16 bits) Everything UTF-8 (8-32 bits)

atilde Moving towards unicode for everything

atilde If you get the encoding wrong it is gibberish

acirc Web pages should state their encodingacirc Sometimes they are wrongacirc Encoding Detection usually involves statistical analysis of byte patterns

lowast But an encoding can be shared by many languages

Review and Conclusions 13

Speech and Language Technologyatilde The need for speech representation

atilde Storing sound

atilde Transforming Speech

acirc Automatic Speech Recognition (ASR) sounds to textacirc Text-to-Speech Synthesis (TTS) texts to sounds

atilde Speech technology mdash the Telephone

Review and Conclusions 14

How good are the systems

Task Vocab WER () WER () adaptedDigits 11 04 02Dialogue (travel) 21000 109 mdashDictation (WSJ) 5000 39 30Dictation (WSJ) 20000 100 86Dialogue (noisy army) 3000 422 310Phone Conversations 4000 419 310

Results of various DARPA competitions (from Richard Sproatrsquos slides)

Review and Conclusions 15

Why is it so difficult

atilde Pronunciation depends on context

acirc The same word will be pronounced differently in different sentences

atilde Speaker variability

acirc Genderacirc DialectForeign Accentacirc Individual Differences Physical differences Language differences (idiolect)

atilde Many many rare events

acirc 300 out of 2000 diphones in the core set for the ATampT NextGen system occuronly once in a 2-hour speech database

Review and Conclusions 16

Two steps in a TTS system

1 Linguistic Analysis

atilde Sentence Segmentationatilde Abbreviations Dr Smith lives on Nanyang Dr He is hellipatilde Word Segmentation

acirc 森前銀総裁 Moriyama zen Nichigin Sousaiotimes 森前銀総裁emspMoriyama zennichi gin Sousai

2 Speech Synthesis

atilde Find the pronunciationatilde Generate soundsatilde Add intonation

Review and Conclusions 17

Speech Synthesis

atilde Articulatory Synthesis Attempt to model human articulation

atilde Formant Synthesis Bypass modeling of human articulation and model acousticsdirectly

atilde Concatenative Synthesis Synthesize from stored units of actual speech

Review and Conclusions 18

Prosody of Emotion

atilde Excitement Fast very high pitch loud

atilde Hot anger Fast high pitch strong falling accent loud

atilde Fear Jitter

atilde Sarcasm Prolonged accent late peak

atilde Sad Slow low pitch

The main determinant of ldquonaturalnessrdquo in speech synthesis is not ldquovoice qualityrdquobut natural-sounding prosody (intonation and duration)

Richard Sproat

Review and Conclusions 19

Speed is different for different modalities

Speed in words per minute (one word is 6 characters)(English computer science students various studies)

Reading 300 200 (proof reading)Writing 31 21 (composing)Speaking 150Hearing 150 210 (speeded up)Typing 33 19 (composing)

atilde Reading gtgt SpeakingHearing gtgt Typing

rArr Speech for inputrArr Text for output

Review and Conclusions 20

The Telephone

Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 21

New Modalities

Review and Conclusions 22

Email Usenet Chat and Blogs

atilde All share some characteristics of speech and text

atilde Usage norms not fixed

atilde Communication methods may disappear before the norms are fixedUsenet Bulletin Boards Gopher Archie hellip

atilde Large scale discourse studies still to be done

atilde Some genuinely new things

acirc time-lagged multi-person conversationacirc raw un-edited textacirc extreme multi-authorship

Review and Conclusions 23

Email

Speech like Text liketime bound space bound (deletable)spontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 24

Usenet (asynchronous)

Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 25

Chat (synchronous)

Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 26

Blogs

Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactivecomments factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 27

Netspeak

atilde Inspired by shared background

acirc ltrantgtI canrsquot stand thisltrantgtacirc I hate^H^H^H^Hlove that idea (^H is backspace on a vt100)acirc lusers (users as seen by Systems Administrators)acirc suits

atilde Need to be in-group

cow orker Coworkerclue ldquoYou couldnrsquot get a clue during the clue mating season in a field full of horny

clues if you smeared your body with clue musk and did the clue mating dancerdquoEdward Flaherty (talkbizaare)

Review and Conclusions 28

atilde Gricean Principals

acirc Posting (topmiddlebottom quoting and trimming)

atilde Inspired by medium limitations

acirc GREATgreatgreatacirc -) (^_^) (o) acirc brb RTFM IMHOacirc lower caseacirc lack of punctuation (hard on phone keyboards)

Review and Conclusions 29

Collaboration and Wikisatilde Version Control Systems

atilde Wikipedia

atilde Licensing and Ownership

30

Version Control Systems

atilde Versioning file systems

acirc every time a file is opened a new copy is stored

atilde CVS Subversion Git

acirc changes to a collection of files are trackedacirc simultaneous changes are merged

atilde Revision Tracking

acirc Revisions are stored within a file

atilde Authorship in shared writing Explicit responsibility for changes

Review and Conclusions 31

Wikipedia

atilde The core aim of the Wikimedia Foundation is to get a free encyclopedia to everysingle person on the planet (Jimmy Wales)

atilde Wikipedia makes it easy to share your knowledgepeople like to do this

atilde Most edits are done by insiders

atilde Most content is added by outsiders

atilde Content comparable to Britannica

Review and Conclusions 32

The five pillars of Wikipedia

1 Wikipedia is an online encyclopedia

2 Wikipedia has a neutral point of view

atilde Content policies NPOV Verifiability and No original research

3 Wikipedia is free content

4 Wikipedians should interact in a respectful and civil manner

5 Wikipedia does not have firm rules

WikipediaFivepillars 33

Licenses and Ownership

atilde Copyright

atilde Copyleft

atilde Creative Commons

Review and Conclusions 34

What is a good article

1 Well-written

2 Factually accurate and verifiable

3 Broad in its coverage

4 Neutral

5 Stable

6 Illustrated if possible by images

WikipediaGood_article_criteria 35

The World Wide Web and HTML

Review and Conclusions 36

The InterWeb

atilde The Internetmore than just the web (email VoIP FTP Streaming Messaging

atilde The structure of Markup Visual vs LogicalWSISWYG WYSIAYG WYSIWIM

atilde The structure of the Web mdash hypertextpages linked to pages

atilde The future of the Web

atilde Linguistic features of the webun-edited large volume editable multi-media

Review and Conclusions 37

Map of online communities

httpxkcdcom802 38

The Deep Web (the Invisible Web)

Dynamic content dynamic pages which are returned in response to a submitted queryor accessed only through a form

Unlinked content pages which are not linked to by other pages (but clicking linksthem)

Private Web sites that require registration and login (Edventure)

Contextual Web pages with content varying for different access contexts (egranges of client IP addresses or previous navigation sequence)

Limited access content sites that limit access to their pages in a technical way (egusing the Robots Exclusion Standard)

Scripted content pages that are only accessible through links produced by JavaScriptas well as content dynamically downloaded from Web servers via Flash or Ajaxsolutions

Review and Conclusions 39

Non-HTMLtext content textual content encoded in multimedia (image or video)files or specific file formats not handled by search engines

These pages all include data that search engines cannot find

Review and Conclusions 40

Visual Markup vs Logical Markup

atilde Visual Markup (Presentational)

acirc What you see is what you get (WYSIWYG)acirc Equivalent of printersrsquo markupacirc Shows what things look like

atilde Logical Markup (Structural)

acirc Show the structure and meaningacirc Can be mapped to visual markupacirc Less flexible than visual markupacirc More adaptable (and reusable)

Review and Conclusions 41

The Web as Corpus

Review and Conclusions 42

Two Approaches to using the Web as a Corpus

atilde Direct Query Search Engine as Query tool and WWW as corpus(Objection Results are not reliable)

acirc Population and exact hit counts are unknown rarr no statistics possibleacirc Indexing does not allow to draw conclusions on the dataotimes Google is missing functionalities that linguists lexicographers would like to have

atilde Web Sample Use search engine to download data from the net and build a corpusfrom it

acirc known size and exact hit counts rarr statistics possibleacirc people can draw conclusions over the included text typesacirc (limited) control over the contentotimes sparser data

Review and Conclusions 43

Direct Query

atilde Accessible through search engines (Google API Yahoo API Scripts)

atilde Document counts are shown to correlate directly with ldquorealrdquo frequencies (Keller2003) so search engines can help - but

acirc lots of repetitions of the same text (not representative)acirc very limited query precision (no upperlower case no punctuation)acirc only estimated counts often hard to reproduce exactlyacirc different queries give wildly different numbers

Review and Conclusions 44

Web Count Units

atilde ghit or ldquogoogle hitrdquo is the most common unit used to count web snippets (in theearly 2000s)

acirc it is document frequency not term frequency

atilde whit or ldquoweb hitrdquo is the more general term

atilde Normally you compare two phenomena to get a unitless ratio(eg different from vs different than)251000000 ghits vs 71500000 or 351 (accessed 2012-04-04)

atilde GPB for ldquoGhits per billion documentsrdquo is good if want a more stable number (sug-gested by Mark Liberman)but Google no longer releases the index size

atilde So always say when you counted and try to use ratios

httpitrecisupennedu myllanguagelogarchives000953html 45

Web Sample

atilde Extracting and filtering web documents to create linguistically annotated corpora(Kilgarriff 2006)

acirc gather documents for different topics (balance)acirc exclude documents which cannot be preprocessed with available tools (here tag-

gers and lemmatizers)acirc exclude documents which seem irrelevant for a corpus (too short or too long word

lists)acirc do this for several languages and make the corpora available

46

Building Internet Corpora Outline

1 Select Seed Words (500)

2 Combine to form multiple queries (6000)

3 Query a search engine and retrieve the URLs (50000)

4 Download the files from the URLS (100000000 words)

5 Postprocess the data (encoding cleanup tagging and parsing)

Sharoff S (2006) Creating general-purpose corpora using automated search enginequeries In M Baroni S Bernardini (eds) WaCky Working papers on the Web asCorpus Bologna 2006

httpcorpusleedsacukinternethtml 47

Internet Corpora Summary

atilde The web can be used as a corpus

acirc Direct accesslowast Fast and convenientlowast Huge amounts of dataotimes unreliable counts

acirc Web samplelowast Control over the samplelowast Some setup costs (semi-automated)otimes Less data

atilde Richer data than a compiled corpus

otimes Less balanced less markup

48

Text and Meta-textatilde Explicit Meta-data

acirc Keywords and Categoriesacirc Rankingsacirc Structural Markup

atilde Implicit Meta-data

acirc Links and Citationsacirc Tagsacirc Tablesacirc File Namesacirc Translations

and Zombies 49

Explicit Metadata

atilde You can get information from metadata within documents

acirc When they are accurate they are very goodacirc They are often deceitful

atilde HTML PDF Word hellipMetadata

atilde Keywords and Tags

atilde Rankings

atilde Links and Citations

atilde Structural Markup

50

Implicit Metadata

atilde You can get clues from metadata within documents

acirc as they are non-intended they tend to be noisyacirc but they are rarely deceitful

atilde HTML tags as constituent boundaries

atilde Tables as Semantic Relations

atilde File Names (content type and language)

atilde Translations mdash Bracketed Glosses Cross-lingual Disambiguation

atilde Query Data

atilde Wikipedia Redirections and Cross-wiki Links

Review and Conclusions 51

Language Identification andNormalization

Review and Conclusions 52

Language Identification

atilde Need to identify the encoding and language of a page in order to process it (meta-data may be unreliable)

acirc Linguistically-grounded methodslowast Diacriticslowast Character n-gramslowast Stop words

acirc Similarity-based categorisation and classificationlowast Character n-gram rankings

acirc Machine Learning based methodsacirc Context (under jp or ko)

atilde Hard to do for short test snippets similar languages mixed text

Review and Conclusions 53

Normalization

atilde Extracting text from various documents

atilde Segmenting continuous text

atilde Number Normalization $700K $700000 07 million dollars hellip

atilde Date Normalization 2000AD 1421AH Heisei 12 hellip

atilde Stripping stop words (the a of hellip)

atilde Lemmatization produces rarr produce

atilde Stemming producer rarr produc produces rarr produc

atilde Decompounding zonnecel rarr zon cel

Review and Conclusions 54

The Semantic Web

Review and Conclusions 55

Goals of the Semantic Web

atilde Web of data

acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions

atilde Increase the utility of information by connecting it to definitions and context

atilde More efficient information access and analysis

EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt

Review and Conclusions 56

Semantic Web Architecture

Review and Conclusions 57

XML eXtensible Markup Language

atilde XML is a set of rules for encoding documents electronically

acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation

atilde Can be validated for syntactic well-formedness

Review and Conclusions 58

Semantic Web Architecture (details)

atilde Identify things with Uniform Resource Identifiers

acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond

atilde Identify relations with Resource Description Framework

acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything

atilde You can build relations in ontologies (OWL)

acirc Then reason over them search them hellip

Review and Conclusions 59

Criticism of the Semantic Web

Doctorowrsquos seven insurmountable obstacles to reliable metadata are

1 People lie

2 People are lazy

3 People are stupid

4 Mission Impossible know thyself

5 Schemas arenrsquot neutral

6 Metrics influence results

7 Therersquos more than one way to describe something

httpwwwwellcom~doctorowmetacraphtm 60

Semantic Web and NLP

atilde The Semantic Web is about structuring data

atilde Text Mining is about unstructured data

atilde There is much more unstructured than structured data

acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP

61

Citation Reputation andPageRank

Review and Conclusions 62

Citation Networks

atilde How can we tell what is a good scientific paper

acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked

acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite

Review and Conclusions 63

Reputation and Citation Analysis

atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group

atilde Some scores are

acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)

atilde Problems

acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones

atilde Weight Citations by Quality of the paper

Review and Conclusions 64

Gaming Citations

atilde LeastMinimum Publishable Unit

acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information

atilde Self citation in-group citation

atilde Write only proceedings (some journals are not often read)

atilde Submitting only to High Impact factor journals

You improve what gets measurednot necessarily what you want to improve

Review and Conclusions 65

Characteristics of the Web Bow Tie

SCC Strongly Connected Core mdash can travel from any page to any page

Review and Conclusions 66

Anchor Text

atilde Recall how hyperlinks are written

lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt

For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt

atilde Link analysis builds on two intuitions

1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A

2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice

Review and Conclusions 67

PageRank as Citation analysis

atilde Citation frequency can be used to measure the impact of an article

acirc Simplest measure Each article gets one vote ndash not very accurate

atilde On the web citation frequency = inlink count

acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam

atilde Better measure weighted citation frequency or citation rank

acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated

Review and Conclusions 68

PageRank as Random walk

atilde Imagine a web surfer doing a random walk on the web

acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page

equiprobably

atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there

atilde This long-term visit rate is the pagersquos PageRank

atilde PageRank = long-term visit rate = steady state robability

Review and Conclusions 69

Teleporting ndash to get us out of dead ends

atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)

atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)

atilde With remaining probability (90) go out on a random hyperlink

acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225

atilde 10 is a parameter the teleportation rate

atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate

Review and Conclusions 70

Example Graph

Each inbound link is a positive vote

httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71

Example Graph Weighted

Pages with higher PageRanks are lighter

Review and Conclusions 72

Gaming PageRank

atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs

atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies

atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission

73

atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks

The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index

acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip

74

Current Status

atilde There is a continuous battle between

acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read

atilde All metrics get gamed

Review and Conclusions 75

Digital object identifier

atilde DOI a string used to uniquely identify an electronic document or object

acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)

atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation

atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations

acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u

Review and Conclusions 76

Conclusions

Review and Conclusions 77

Revolutions in Language Technology

atilde Speech (Language itself)

atilde Writing (invented 3-5 times)

atilde Printing (made writing common)

atilde Digital Text (made writing transferable)

atilde Hyperlinking (taking writing beyond language)

Review and Conclusions 78

The Internet

atilde The internet is useful as a tool

acirc Passively for information accessacirc Actively for collaboration

atilde The internet is interesting in and of itself

acirc As a source of data about existing languageacirc As a source of innovation in language

Review and Conclusions 79

Recommended Texts

atilde Wikipedia

atilde Crystal D (2001) Language and the Internet Cambridge University Press

atilde Sproat R (2010) Language Technology and Society Oxford University Press

atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition

atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press

atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press

Review and Conclusions 80

Complementary Courses

atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics

atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics

Review and Conclusions 81

Page 10: Lecture 12: Review and Conclusions · Web Count Units ª ghit or “google hit” is the most common unit used to count web snippets (in the early2000s) â itisdocumentfrequencynottermfrequency

New forms

atilde Email (from PC phone other)

atilde Chat Usenet

atilde Virtual Worlds

atilde WWW

atilde Blogs (overlap)

atilde Facebook LinkedIn

atilde Wikis

atilde Twitter

Review and Conclusions 9

Representing Languageatilde Writing Systems

atilde Encodings

atilde Speech

atilde Bandwidth

Review and Conclusions 10

What is represented

atilde Phonemes ma d laks aeligvkadoz (45)

acirc Not a simple correspondence between a writing system and the soundsacirc Some logographs (3 $)acirc Even when a new alphabet is designed pronunciation changes

atilde Syllables (ma) (d) (laks) (aelig)(v)(ka)(doz) (10000+)

atilde Morphemes mame + rsquos d lak + s aeligvkado + z (100000+)

atilde Words my dog likes avocados (200000+)

atilde Concepts speaker poss canine fond avacado+PL (400000+)

Review and Conclusions 11

Three Major Writing Systems

atilde Alphabetic (Latin)

acirc one symbol for consonant or vowelacirc Typically 20-30 base symbols (1 byte)

atilde Syllabic (Hiragana)

acirc one symbol for each syllable (consonant+vowel)acirc Typically 50-100 base symbols (1-2 bytes)

atilde Logographic (Hanzi)

acirc pictographs ideographs sounds-meaning combinationsacirc Typically 100000+ symbols (2-3 bytes)

Review and Conclusions 12

Computational Encoding

atilde Need to map characters to bits

atilde More characters require more space

acirc English ASCII (7 bits) Western Europe ISO-8859-1 (8 bits)Japanese EUC (16 bits) Everything UTF-8 (8-32 bits)

atilde Moving towards unicode for everything

atilde If you get the encoding wrong it is gibberish

acirc Web pages should state their encodingacirc Sometimes they are wrongacirc Encoding Detection usually involves statistical analysis of byte patterns

lowast But an encoding can be shared by many languages

Review and Conclusions 13

Speech and Language Technologyatilde The need for speech representation

atilde Storing sound

atilde Transforming Speech

acirc Automatic Speech Recognition (ASR) sounds to textacirc Text-to-Speech Synthesis (TTS) texts to sounds

atilde Speech technology mdash the Telephone

Review and Conclusions 14

How good are the systems

Task Vocab WER () WER () adaptedDigits 11 04 02Dialogue (travel) 21000 109 mdashDictation (WSJ) 5000 39 30Dictation (WSJ) 20000 100 86Dialogue (noisy army) 3000 422 310Phone Conversations 4000 419 310

Results of various DARPA competitions (from Richard Sproatrsquos slides)

Review and Conclusions 15

Why is it so difficult

atilde Pronunciation depends on context

acirc The same word will be pronounced differently in different sentences

atilde Speaker variability

acirc Genderacirc DialectForeign Accentacirc Individual Differences Physical differences Language differences (idiolect)

atilde Many many rare events

acirc 300 out of 2000 diphones in the core set for the ATampT NextGen system occuronly once in a 2-hour speech database

Review and Conclusions 16

Two steps in a TTS system

1 Linguistic Analysis

atilde Sentence Segmentationatilde Abbreviations Dr Smith lives on Nanyang Dr He is hellipatilde Word Segmentation

acirc 森前銀総裁 Moriyama zen Nichigin Sousaiotimes 森前銀総裁emspMoriyama zennichi gin Sousai

2 Speech Synthesis

atilde Find the pronunciationatilde Generate soundsatilde Add intonation

Review and Conclusions 17

Speech Synthesis

atilde Articulatory Synthesis Attempt to model human articulation

atilde Formant Synthesis Bypass modeling of human articulation and model acousticsdirectly

atilde Concatenative Synthesis Synthesize from stored units of actual speech

Review and Conclusions 18

Prosody of Emotion

atilde Excitement Fast very high pitch loud

atilde Hot anger Fast high pitch strong falling accent loud

atilde Fear Jitter

atilde Sarcasm Prolonged accent late peak

atilde Sad Slow low pitch

The main determinant of ldquonaturalnessrdquo in speech synthesis is not ldquovoice qualityrdquobut natural-sounding prosody (intonation and duration)

Richard Sproat

Review and Conclusions 19

Speed is different for different modalities

Speed in words per minute (one word is 6 characters)(English computer science students various studies)

Reading 300 200 (proof reading)Writing 31 21 (composing)Speaking 150Hearing 150 210 (speeded up)Typing 33 19 (composing)

atilde Reading gtgt SpeakingHearing gtgt Typing

rArr Speech for inputrArr Text for output

Review and Conclusions 20

The Telephone

Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 21

New Modalities

Review and Conclusions 22

Email Usenet Chat and Blogs

atilde All share some characteristics of speech and text

atilde Usage norms not fixed

atilde Communication methods may disappear before the norms are fixedUsenet Bulletin Boards Gopher Archie hellip

atilde Large scale discourse studies still to be done

atilde Some genuinely new things

acirc time-lagged multi-person conversationacirc raw un-edited textacirc extreme multi-authorship

Review and Conclusions 23

Email

Speech like Text liketime bound space bound (deletable)spontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 24

Usenet (asynchronous)

Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 25

Chat (synchronous)

Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 26

Blogs

Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactivecomments factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 27

Netspeak

atilde Inspired by shared background

acirc ltrantgtI canrsquot stand thisltrantgtacirc I hate^H^H^H^Hlove that idea (^H is backspace on a vt100)acirc lusers (users as seen by Systems Administrators)acirc suits

atilde Need to be in-group

cow orker Coworkerclue ldquoYou couldnrsquot get a clue during the clue mating season in a field full of horny

clues if you smeared your body with clue musk and did the clue mating dancerdquoEdward Flaherty (talkbizaare)

Review and Conclusions 28

atilde Gricean Principals

acirc Posting (topmiddlebottom quoting and trimming)

atilde Inspired by medium limitations

acirc GREATgreatgreatacirc -) (^_^) (o) acirc brb RTFM IMHOacirc lower caseacirc lack of punctuation (hard on phone keyboards)

Review and Conclusions 29

Collaboration and Wikisatilde Version Control Systems

atilde Wikipedia

atilde Licensing and Ownership

30

Version Control Systems

atilde Versioning file systems

acirc every time a file is opened a new copy is stored

atilde CVS Subversion Git

acirc changes to a collection of files are trackedacirc simultaneous changes are merged

atilde Revision Tracking

acirc Revisions are stored within a file

atilde Authorship in shared writing Explicit responsibility for changes

Review and Conclusions 31

Wikipedia

atilde The core aim of the Wikimedia Foundation is to get a free encyclopedia to everysingle person on the planet (Jimmy Wales)

atilde Wikipedia makes it easy to share your knowledgepeople like to do this

atilde Most edits are done by insiders

atilde Most content is added by outsiders

atilde Content comparable to Britannica

Review and Conclusions 32

The five pillars of Wikipedia

1 Wikipedia is an online encyclopedia

2 Wikipedia has a neutral point of view

atilde Content policies NPOV Verifiability and No original research

3 Wikipedia is free content

4 Wikipedians should interact in a respectful and civil manner

5 Wikipedia does not have firm rules

WikipediaFivepillars 33

Licenses and Ownership

atilde Copyright

atilde Copyleft

atilde Creative Commons

Review and Conclusions 34

What is a good article

1 Well-written

2 Factually accurate and verifiable

3 Broad in its coverage

4 Neutral

5 Stable

6 Illustrated if possible by images

WikipediaGood_article_criteria 35

The World Wide Web and HTML

Review and Conclusions 36

The InterWeb

atilde The Internetmore than just the web (email VoIP FTP Streaming Messaging

atilde The structure of Markup Visual vs LogicalWSISWYG WYSIAYG WYSIWIM

atilde The structure of the Web mdash hypertextpages linked to pages

atilde The future of the Web

atilde Linguistic features of the webun-edited large volume editable multi-media

Review and Conclusions 37

Map of online communities

httpxkcdcom802 38

The Deep Web (the Invisible Web)

Dynamic content dynamic pages which are returned in response to a submitted queryor accessed only through a form

Unlinked content pages which are not linked to by other pages (but clicking linksthem)

Private Web sites that require registration and login (Edventure)

Contextual Web pages with content varying for different access contexts (egranges of client IP addresses or previous navigation sequence)

Limited access content sites that limit access to their pages in a technical way (egusing the Robots Exclusion Standard)

Scripted content pages that are only accessible through links produced by JavaScriptas well as content dynamically downloaded from Web servers via Flash or Ajaxsolutions

Review and Conclusions 39

Non-HTMLtext content textual content encoded in multimedia (image or video)files or specific file formats not handled by search engines

These pages all include data that search engines cannot find

Review and Conclusions 40

Visual Markup vs Logical Markup

atilde Visual Markup (Presentational)

acirc What you see is what you get (WYSIWYG)acirc Equivalent of printersrsquo markupacirc Shows what things look like

atilde Logical Markup (Structural)

acirc Show the structure and meaningacirc Can be mapped to visual markupacirc Less flexible than visual markupacirc More adaptable (and reusable)

Review and Conclusions 41

The Web as Corpus

Review and Conclusions 42

Two Approaches to using the Web as a Corpus

atilde Direct Query Search Engine as Query tool and WWW as corpus(Objection Results are not reliable)

acirc Population and exact hit counts are unknown rarr no statistics possibleacirc Indexing does not allow to draw conclusions on the dataotimes Google is missing functionalities that linguists lexicographers would like to have

atilde Web Sample Use search engine to download data from the net and build a corpusfrom it

acirc known size and exact hit counts rarr statistics possibleacirc people can draw conclusions over the included text typesacirc (limited) control over the contentotimes sparser data

Review and Conclusions 43

Direct Query

atilde Accessible through search engines (Google API Yahoo API Scripts)

atilde Document counts are shown to correlate directly with ldquorealrdquo frequencies (Keller2003) so search engines can help - but

acirc lots of repetitions of the same text (not representative)acirc very limited query precision (no upperlower case no punctuation)acirc only estimated counts often hard to reproduce exactlyacirc different queries give wildly different numbers

Review and Conclusions 44

Web Count Units

atilde ghit or ldquogoogle hitrdquo is the most common unit used to count web snippets (in theearly 2000s)

acirc it is document frequency not term frequency

atilde whit or ldquoweb hitrdquo is the more general term

atilde Normally you compare two phenomena to get a unitless ratio(eg different from vs different than)251000000 ghits vs 71500000 or 351 (accessed 2012-04-04)

atilde GPB for ldquoGhits per billion documentsrdquo is good if want a more stable number (sug-gested by Mark Liberman)but Google no longer releases the index size

atilde So always say when you counted and try to use ratios

httpitrecisupennedu myllanguagelogarchives000953html 45

Web Sample

atilde Extracting and filtering web documents to create linguistically annotated corpora(Kilgarriff 2006)

acirc gather documents for different topics (balance)acirc exclude documents which cannot be preprocessed with available tools (here tag-

gers and lemmatizers)acirc exclude documents which seem irrelevant for a corpus (too short or too long word

lists)acirc do this for several languages and make the corpora available

46

Building Internet Corpora Outline

1 Select Seed Words (500)

2 Combine to form multiple queries (6000)

3 Query a search engine and retrieve the URLs (50000)

4 Download the files from the URLS (100000000 words)

5 Postprocess the data (encoding cleanup tagging and parsing)

Sharoff S (2006) Creating general-purpose corpora using automated search enginequeries In M Baroni S Bernardini (eds) WaCky Working papers on the Web asCorpus Bologna 2006

httpcorpusleedsacukinternethtml 47

Internet Corpora Summary

atilde The web can be used as a corpus

acirc Direct accesslowast Fast and convenientlowast Huge amounts of dataotimes unreliable counts

acirc Web samplelowast Control over the samplelowast Some setup costs (semi-automated)otimes Less data

atilde Richer data than a compiled corpus

otimes Less balanced less markup

48

Text and Meta-textatilde Explicit Meta-data

acirc Keywords and Categoriesacirc Rankingsacirc Structural Markup

atilde Implicit Meta-data

acirc Links and Citationsacirc Tagsacirc Tablesacirc File Namesacirc Translations

and Zombies 49

Explicit Metadata

atilde You can get information from metadata within documents

acirc When they are accurate they are very goodacirc They are often deceitful

atilde HTML PDF Word hellipMetadata

atilde Keywords and Tags

atilde Rankings

atilde Links and Citations

atilde Structural Markup

50

Implicit Metadata

atilde You can get clues from metadata within documents

acirc as they are non-intended they tend to be noisyacirc but they are rarely deceitful

atilde HTML tags as constituent boundaries

atilde Tables as Semantic Relations

atilde File Names (content type and language)

atilde Translations mdash Bracketed Glosses Cross-lingual Disambiguation

atilde Query Data

atilde Wikipedia Redirections and Cross-wiki Links

Review and Conclusions 51

Language Identification andNormalization

Review and Conclusions 52

Language Identification

atilde Need to identify the encoding and language of a page in order to process it (meta-data may be unreliable)

acirc Linguistically-grounded methodslowast Diacriticslowast Character n-gramslowast Stop words

acirc Similarity-based categorisation and classificationlowast Character n-gram rankings

acirc Machine Learning based methodsacirc Context (under jp or ko)

atilde Hard to do for short test snippets similar languages mixed text

Review and Conclusions 53

Normalization

atilde Extracting text from various documents

atilde Segmenting continuous text

atilde Number Normalization $700K $700000 07 million dollars hellip

atilde Date Normalization 2000AD 1421AH Heisei 12 hellip

atilde Stripping stop words (the a of hellip)

atilde Lemmatization produces rarr produce

atilde Stemming producer rarr produc produces rarr produc

atilde Decompounding zonnecel rarr zon cel

Review and Conclusions 54

The Semantic Web

Review and Conclusions 55

Goals of the Semantic Web

atilde Web of data

acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions

atilde Increase the utility of information by connecting it to definitions and context

atilde More efficient information access and analysis

EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt

Review and Conclusions 56

Semantic Web Architecture

Review and Conclusions 57

XML eXtensible Markup Language

atilde XML is a set of rules for encoding documents electronically

acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation

atilde Can be validated for syntactic well-formedness

Review and Conclusions 58

Semantic Web Architecture (details)

atilde Identify things with Uniform Resource Identifiers

acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond

atilde Identify relations with Resource Description Framework

acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything

atilde You can build relations in ontologies (OWL)

acirc Then reason over them search them hellip

Review and Conclusions 59

Criticism of the Semantic Web

Doctorowrsquos seven insurmountable obstacles to reliable metadata are

1 People lie

2 People are lazy

3 People are stupid

4 Mission Impossible know thyself

5 Schemas arenrsquot neutral

6 Metrics influence results

7 Therersquos more than one way to describe something

httpwwwwellcom~doctorowmetacraphtm 60

Semantic Web and NLP

atilde The Semantic Web is about structuring data

atilde Text Mining is about unstructured data

atilde There is much more unstructured than structured data

acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP

61

Citation Reputation andPageRank

Review and Conclusions 62

Citation Networks

atilde How can we tell what is a good scientific paper

acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked

acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite

Review and Conclusions 63

Reputation and Citation Analysis

atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group

atilde Some scores are

acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)

atilde Problems

acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones

atilde Weight Citations by Quality of the paper

Review and Conclusions 64

Gaming Citations

atilde LeastMinimum Publishable Unit

acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information

atilde Self citation in-group citation

atilde Write only proceedings (some journals are not often read)

atilde Submitting only to High Impact factor journals

You improve what gets measurednot necessarily what you want to improve

Review and Conclusions 65

Characteristics of the Web Bow Tie

SCC Strongly Connected Core mdash can travel from any page to any page

Review and Conclusions 66

Anchor Text

atilde Recall how hyperlinks are written

lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt

For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt

atilde Link analysis builds on two intuitions

1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A

2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice

Review and Conclusions 67

PageRank as Citation analysis

atilde Citation frequency can be used to measure the impact of an article

acirc Simplest measure Each article gets one vote ndash not very accurate

atilde On the web citation frequency = inlink count

acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam

atilde Better measure weighted citation frequency or citation rank

acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated

Review and Conclusions 68

PageRank as Random walk

atilde Imagine a web surfer doing a random walk on the web

acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page

equiprobably

atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there

atilde This long-term visit rate is the pagersquos PageRank

atilde PageRank = long-term visit rate = steady state robability

Review and Conclusions 69

Teleporting ndash to get us out of dead ends

atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)

atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)

atilde With remaining probability (90) go out on a random hyperlink

acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225

atilde 10 is a parameter the teleportation rate

atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate

Review and Conclusions 70

Example Graph

Each inbound link is a positive vote

httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71

Example Graph Weighted

Pages with higher PageRanks are lighter

Review and Conclusions 72

Gaming PageRank

atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs

atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies

atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission

73

atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks

The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index

acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip

74

Current Status

atilde There is a continuous battle between

acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read

atilde All metrics get gamed

Review and Conclusions 75

Digital object identifier

atilde DOI a string used to uniquely identify an electronic document or object

acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)

atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation

atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations

acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u

Review and Conclusions 76

Conclusions

Review and Conclusions 77

Revolutions in Language Technology

atilde Speech (Language itself)

atilde Writing (invented 3-5 times)

atilde Printing (made writing common)

atilde Digital Text (made writing transferable)

atilde Hyperlinking (taking writing beyond language)

Review and Conclusions 78

The Internet

atilde The internet is useful as a tool

acirc Passively for information accessacirc Actively for collaboration

atilde The internet is interesting in and of itself

acirc As a source of data about existing languageacirc As a source of innovation in language

Review and Conclusions 79

Recommended Texts

atilde Wikipedia

atilde Crystal D (2001) Language and the Internet Cambridge University Press

atilde Sproat R (2010) Language Technology and Society Oxford University Press

atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition

atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press

atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press

Review and Conclusions 80

Complementary Courses

atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics

atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics

Review and Conclusions 81

Page 11: Lecture 12: Review and Conclusions · Web Count Units ª ghit or “google hit” is the most common unit used to count web snippets (in the early2000s) â itisdocumentfrequencynottermfrequency

Representing Languageatilde Writing Systems

atilde Encodings

atilde Speech

atilde Bandwidth

Review and Conclusions 10

What is represented

atilde Phonemes ma d laks aeligvkadoz (45)

acirc Not a simple correspondence between a writing system and the soundsacirc Some logographs (3 $)acirc Even when a new alphabet is designed pronunciation changes

atilde Syllables (ma) (d) (laks) (aelig)(v)(ka)(doz) (10000+)

atilde Morphemes mame + rsquos d lak + s aeligvkado + z (100000+)

atilde Words my dog likes avocados (200000+)

atilde Concepts speaker poss canine fond avacado+PL (400000+)

Review and Conclusions 11

Three Major Writing Systems

atilde Alphabetic (Latin)

acirc one symbol for consonant or vowelacirc Typically 20-30 base symbols (1 byte)

atilde Syllabic (Hiragana)

acirc one symbol for each syllable (consonant+vowel)acirc Typically 50-100 base symbols (1-2 bytes)

atilde Logographic (Hanzi)

acirc pictographs ideographs sounds-meaning combinationsacirc Typically 100000+ symbols (2-3 bytes)

Review and Conclusions 12

Computational Encoding

atilde Need to map characters to bits

atilde More characters require more space

acirc English ASCII (7 bits) Western Europe ISO-8859-1 (8 bits)Japanese EUC (16 bits) Everything UTF-8 (8-32 bits)

atilde Moving towards unicode for everything

atilde If you get the encoding wrong it is gibberish

acirc Web pages should state their encodingacirc Sometimes they are wrongacirc Encoding Detection usually involves statistical analysis of byte patterns

lowast But an encoding can be shared by many languages

Review and Conclusions 13

Speech and Language Technologyatilde The need for speech representation

atilde Storing sound

atilde Transforming Speech

acirc Automatic Speech Recognition (ASR) sounds to textacirc Text-to-Speech Synthesis (TTS) texts to sounds

atilde Speech technology mdash the Telephone

Review and Conclusions 14

How good are the systems

Task Vocab WER () WER () adaptedDigits 11 04 02Dialogue (travel) 21000 109 mdashDictation (WSJ) 5000 39 30Dictation (WSJ) 20000 100 86Dialogue (noisy army) 3000 422 310Phone Conversations 4000 419 310

Results of various DARPA competitions (from Richard Sproatrsquos slides)

Review and Conclusions 15

Why is it so difficult

atilde Pronunciation depends on context

acirc The same word will be pronounced differently in different sentences

atilde Speaker variability

acirc Genderacirc DialectForeign Accentacirc Individual Differences Physical differences Language differences (idiolect)

atilde Many many rare events

acirc 300 out of 2000 diphones in the core set for the ATampT NextGen system occuronly once in a 2-hour speech database

Review and Conclusions 16

Two steps in a TTS system

1 Linguistic Analysis

atilde Sentence Segmentationatilde Abbreviations Dr Smith lives on Nanyang Dr He is hellipatilde Word Segmentation

acirc 森前銀総裁 Moriyama zen Nichigin Sousaiotimes 森前銀総裁emspMoriyama zennichi gin Sousai

2 Speech Synthesis

atilde Find the pronunciationatilde Generate soundsatilde Add intonation

Review and Conclusions 17

Speech Synthesis

atilde Articulatory Synthesis Attempt to model human articulation

atilde Formant Synthesis Bypass modeling of human articulation and model acousticsdirectly

atilde Concatenative Synthesis Synthesize from stored units of actual speech

Review and Conclusions 18

Prosody of Emotion

atilde Excitement Fast very high pitch loud

atilde Hot anger Fast high pitch strong falling accent loud

atilde Fear Jitter

atilde Sarcasm Prolonged accent late peak

atilde Sad Slow low pitch

The main determinant of ldquonaturalnessrdquo in speech synthesis is not ldquovoice qualityrdquobut natural-sounding prosody (intonation and duration)

Richard Sproat

Review and Conclusions 19

Speed is different for different modalities

Speed in words per minute (one word is 6 characters)(English computer science students various studies)

Reading 300 200 (proof reading)Writing 31 21 (composing)Speaking 150Hearing 150 210 (speeded up)Typing 33 19 (composing)

atilde Reading gtgt SpeakingHearing gtgt Typing

rArr Speech for inputrArr Text for output

Review and Conclusions 20

The Telephone

Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 21

New Modalities

Review and Conclusions 22

Email Usenet Chat and Blogs

atilde All share some characteristics of speech and text

atilde Usage norms not fixed

atilde Communication methods may disappear before the norms are fixedUsenet Bulletin Boards Gopher Archie hellip

atilde Large scale discourse studies still to be done

atilde Some genuinely new things

acirc time-lagged multi-person conversationacirc raw un-edited textacirc extreme multi-authorship

Review and Conclusions 23

Email

Speech like Text liketime bound space bound (deletable)spontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 24

Usenet (asynchronous)

Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 25

Chat (synchronous)

Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 26

Blogs

Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactivecomments factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 27

Netspeak

atilde Inspired by shared background

acirc ltrantgtI canrsquot stand thisltrantgtacirc I hate^H^H^H^Hlove that idea (^H is backspace on a vt100)acirc lusers (users as seen by Systems Administrators)acirc suits

atilde Need to be in-group

cow orker Coworkerclue ldquoYou couldnrsquot get a clue during the clue mating season in a field full of horny

clues if you smeared your body with clue musk and did the clue mating dancerdquoEdward Flaherty (talkbizaare)

Review and Conclusions 28

atilde Gricean Principals

acirc Posting (topmiddlebottom quoting and trimming)

atilde Inspired by medium limitations

acirc GREATgreatgreatacirc -) (^_^) (o) acirc brb RTFM IMHOacirc lower caseacirc lack of punctuation (hard on phone keyboards)

Review and Conclusions 29

Collaboration and Wikisatilde Version Control Systems

atilde Wikipedia

atilde Licensing and Ownership

30

Version Control Systems

atilde Versioning file systems

acirc every time a file is opened a new copy is stored

atilde CVS Subversion Git

acirc changes to a collection of files are trackedacirc simultaneous changes are merged

atilde Revision Tracking

acirc Revisions are stored within a file

atilde Authorship in shared writing Explicit responsibility for changes

Review and Conclusions 31

Wikipedia

atilde The core aim of the Wikimedia Foundation is to get a free encyclopedia to everysingle person on the planet (Jimmy Wales)

atilde Wikipedia makes it easy to share your knowledgepeople like to do this

atilde Most edits are done by insiders

atilde Most content is added by outsiders

atilde Content comparable to Britannica

Review and Conclusions 32

The five pillars of Wikipedia

1 Wikipedia is an online encyclopedia

2 Wikipedia has a neutral point of view

atilde Content policies NPOV Verifiability and No original research

3 Wikipedia is free content

4 Wikipedians should interact in a respectful and civil manner

5 Wikipedia does not have firm rules

WikipediaFivepillars 33

Licenses and Ownership

atilde Copyright

atilde Copyleft

atilde Creative Commons

Review and Conclusions 34

What is a good article

1 Well-written

2 Factually accurate and verifiable

3 Broad in its coverage

4 Neutral

5 Stable

6 Illustrated if possible by images

WikipediaGood_article_criteria 35

The World Wide Web and HTML

Review and Conclusions 36

The InterWeb

atilde The Internetmore than just the web (email VoIP FTP Streaming Messaging

atilde The structure of Markup Visual vs LogicalWSISWYG WYSIAYG WYSIWIM

atilde The structure of the Web mdash hypertextpages linked to pages

atilde The future of the Web

atilde Linguistic features of the webun-edited large volume editable multi-media

Review and Conclusions 37

Map of online communities

httpxkcdcom802 38

The Deep Web (the Invisible Web)

Dynamic content dynamic pages which are returned in response to a submitted queryor accessed only through a form

Unlinked content pages which are not linked to by other pages (but clicking linksthem)

Private Web sites that require registration and login (Edventure)

Contextual Web pages with content varying for different access contexts (egranges of client IP addresses or previous navigation sequence)

Limited access content sites that limit access to their pages in a technical way (egusing the Robots Exclusion Standard)

Scripted content pages that are only accessible through links produced by JavaScriptas well as content dynamically downloaded from Web servers via Flash or Ajaxsolutions

Review and Conclusions 39

Non-HTMLtext content textual content encoded in multimedia (image or video)files or specific file formats not handled by search engines

These pages all include data that search engines cannot find

Review and Conclusions 40

Visual Markup vs Logical Markup

atilde Visual Markup (Presentational)

acirc What you see is what you get (WYSIWYG)acirc Equivalent of printersrsquo markupacirc Shows what things look like

atilde Logical Markup (Structural)

acirc Show the structure and meaningacirc Can be mapped to visual markupacirc Less flexible than visual markupacirc More adaptable (and reusable)

Review and Conclusions 41

The Web as Corpus

Review and Conclusions 42

Two Approaches to using the Web as a Corpus

atilde Direct Query Search Engine as Query tool and WWW as corpus(Objection Results are not reliable)

acirc Population and exact hit counts are unknown rarr no statistics possibleacirc Indexing does not allow to draw conclusions on the dataotimes Google is missing functionalities that linguists lexicographers would like to have

atilde Web Sample Use search engine to download data from the net and build a corpusfrom it

acirc known size and exact hit counts rarr statistics possibleacirc people can draw conclusions over the included text typesacirc (limited) control over the contentotimes sparser data

Review and Conclusions 43

Direct Query

atilde Accessible through search engines (Google API Yahoo API Scripts)

atilde Document counts are shown to correlate directly with ldquorealrdquo frequencies (Keller2003) so search engines can help - but

acirc lots of repetitions of the same text (not representative)acirc very limited query precision (no upperlower case no punctuation)acirc only estimated counts often hard to reproduce exactlyacirc different queries give wildly different numbers

Review and Conclusions 44

Web Count Units

atilde ghit or ldquogoogle hitrdquo is the most common unit used to count web snippets (in theearly 2000s)

acirc it is document frequency not term frequency

atilde whit or ldquoweb hitrdquo is the more general term

atilde Normally you compare two phenomena to get a unitless ratio(eg different from vs different than)251000000 ghits vs 71500000 or 351 (accessed 2012-04-04)

atilde GPB for ldquoGhits per billion documentsrdquo is good if want a more stable number (sug-gested by Mark Liberman)but Google no longer releases the index size

atilde So always say when you counted and try to use ratios

httpitrecisupennedu myllanguagelogarchives000953html 45

Web Sample

atilde Extracting and filtering web documents to create linguistically annotated corpora(Kilgarriff 2006)

acirc gather documents for different topics (balance)acirc exclude documents which cannot be preprocessed with available tools (here tag-

gers and lemmatizers)acirc exclude documents which seem irrelevant for a corpus (too short or too long word

lists)acirc do this for several languages and make the corpora available

46

Building Internet Corpora Outline

1 Select Seed Words (500)

2 Combine to form multiple queries (6000)

3 Query a search engine and retrieve the URLs (50000)

4 Download the files from the URLS (100000000 words)

5 Postprocess the data (encoding cleanup tagging and parsing)

Sharoff S (2006) Creating general-purpose corpora using automated search enginequeries In M Baroni S Bernardini (eds) WaCky Working papers on the Web asCorpus Bologna 2006

httpcorpusleedsacukinternethtml 47

Internet Corpora Summary

atilde The web can be used as a corpus

acirc Direct accesslowast Fast and convenientlowast Huge amounts of dataotimes unreliable counts

acirc Web samplelowast Control over the samplelowast Some setup costs (semi-automated)otimes Less data

atilde Richer data than a compiled corpus

otimes Less balanced less markup

48

Text and Meta-textatilde Explicit Meta-data

acirc Keywords and Categoriesacirc Rankingsacirc Structural Markup

atilde Implicit Meta-data

acirc Links and Citationsacirc Tagsacirc Tablesacirc File Namesacirc Translations

and Zombies 49

Explicit Metadata

atilde You can get information from metadata within documents

acirc When they are accurate they are very goodacirc They are often deceitful

atilde HTML PDF Word hellipMetadata

atilde Keywords and Tags

atilde Rankings

atilde Links and Citations

atilde Structural Markup

50

Implicit Metadata

atilde You can get clues from metadata within documents

acirc as they are non-intended they tend to be noisyacirc but they are rarely deceitful

atilde HTML tags as constituent boundaries

atilde Tables as Semantic Relations

atilde File Names (content type and language)

atilde Translations mdash Bracketed Glosses Cross-lingual Disambiguation

atilde Query Data

atilde Wikipedia Redirections and Cross-wiki Links

Review and Conclusions 51

Language Identification andNormalization

Review and Conclusions 52

Language Identification

atilde Need to identify the encoding and language of a page in order to process it (meta-data may be unreliable)

acirc Linguistically-grounded methodslowast Diacriticslowast Character n-gramslowast Stop words

acirc Similarity-based categorisation and classificationlowast Character n-gram rankings

acirc Machine Learning based methodsacirc Context (under jp or ko)

atilde Hard to do for short test snippets similar languages mixed text

Review and Conclusions 53

Normalization

atilde Extracting text from various documents

atilde Segmenting continuous text

atilde Number Normalization $700K $700000 07 million dollars hellip

atilde Date Normalization 2000AD 1421AH Heisei 12 hellip

atilde Stripping stop words (the a of hellip)

atilde Lemmatization produces rarr produce

atilde Stemming producer rarr produc produces rarr produc

atilde Decompounding zonnecel rarr zon cel

Review and Conclusions 54

The Semantic Web

Review and Conclusions 55

Goals of the Semantic Web

atilde Web of data

acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions

atilde Increase the utility of information by connecting it to definitions and context

atilde More efficient information access and analysis

EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt

Review and Conclusions 56

Semantic Web Architecture

Review and Conclusions 57

XML eXtensible Markup Language

atilde XML is a set of rules for encoding documents electronically

acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation

atilde Can be validated for syntactic well-formedness

Review and Conclusions 58

Semantic Web Architecture (details)

atilde Identify things with Uniform Resource Identifiers

acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond

atilde Identify relations with Resource Description Framework

acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything

atilde You can build relations in ontologies (OWL)

acirc Then reason over them search them hellip

Review and Conclusions 59

Criticism of the Semantic Web

Doctorowrsquos seven insurmountable obstacles to reliable metadata are

1 People lie

2 People are lazy

3 People are stupid

4 Mission Impossible know thyself

5 Schemas arenrsquot neutral

6 Metrics influence results

7 Therersquos more than one way to describe something

httpwwwwellcom~doctorowmetacraphtm 60

Semantic Web and NLP

atilde The Semantic Web is about structuring data

atilde Text Mining is about unstructured data

atilde There is much more unstructured than structured data

acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP

61

Citation Reputation andPageRank

Review and Conclusions 62

Citation Networks

atilde How can we tell what is a good scientific paper

acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked

acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite

Review and Conclusions 63

Reputation and Citation Analysis

atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group

atilde Some scores are

acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)

atilde Problems

acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones

atilde Weight Citations by Quality of the paper

Review and Conclusions 64

Gaming Citations

atilde LeastMinimum Publishable Unit

acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information

atilde Self citation in-group citation

atilde Write only proceedings (some journals are not often read)

atilde Submitting only to High Impact factor journals

You improve what gets measurednot necessarily what you want to improve

Review and Conclusions 65

Characteristics of the Web Bow Tie

SCC Strongly Connected Core mdash can travel from any page to any page

Review and Conclusions 66

Anchor Text

atilde Recall how hyperlinks are written

lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt

For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt

atilde Link analysis builds on two intuitions

1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A

2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice

Review and Conclusions 67

PageRank as Citation analysis

atilde Citation frequency can be used to measure the impact of an article

acirc Simplest measure Each article gets one vote ndash not very accurate

atilde On the web citation frequency = inlink count

acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam

atilde Better measure weighted citation frequency or citation rank

acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated

Review and Conclusions 68

PageRank as Random walk

atilde Imagine a web surfer doing a random walk on the web

acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page

equiprobably

atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there

atilde This long-term visit rate is the pagersquos PageRank

atilde PageRank = long-term visit rate = steady state robability

Review and Conclusions 69

Teleporting ndash to get us out of dead ends

atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)

atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)

atilde With remaining probability (90) go out on a random hyperlink

acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225

atilde 10 is a parameter the teleportation rate

atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate

Review and Conclusions 70

Example Graph

Each inbound link is a positive vote

httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71

Example Graph Weighted

Pages with higher PageRanks are lighter

Review and Conclusions 72

Gaming PageRank

atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs

atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies

atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission

73

atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks

The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index

acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip

74

Current Status

atilde There is a continuous battle between

acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read

atilde All metrics get gamed

Review and Conclusions 75

Digital object identifier

atilde DOI a string used to uniquely identify an electronic document or object

acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)

atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation

atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations

acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u

Review and Conclusions 76

Conclusions

Review and Conclusions 77

Revolutions in Language Technology

atilde Speech (Language itself)

atilde Writing (invented 3-5 times)

atilde Printing (made writing common)

atilde Digital Text (made writing transferable)

atilde Hyperlinking (taking writing beyond language)

Review and Conclusions 78

The Internet

atilde The internet is useful as a tool

acirc Passively for information accessacirc Actively for collaboration

atilde The internet is interesting in and of itself

acirc As a source of data about existing languageacirc As a source of innovation in language

Review and Conclusions 79

Recommended Texts

atilde Wikipedia

atilde Crystal D (2001) Language and the Internet Cambridge University Press

atilde Sproat R (2010) Language Technology and Society Oxford University Press

atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition

atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press

atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press

Review and Conclusions 80

Complementary Courses

atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics

atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics

Review and Conclusions 81

Page 12: Lecture 12: Review and Conclusions · Web Count Units ª ghit or “google hit” is the most common unit used to count web snippets (in the early2000s) â itisdocumentfrequencynottermfrequency

What is represented

atilde Phonemes ma d laks aeligvkadoz (45)

acirc Not a simple correspondence between a writing system and the soundsacirc Some logographs (3 $)acirc Even when a new alphabet is designed pronunciation changes

atilde Syllables (ma) (d) (laks) (aelig)(v)(ka)(doz) (10000+)

atilde Morphemes mame + rsquos d lak + s aeligvkado + z (100000+)

atilde Words my dog likes avocados (200000+)

atilde Concepts speaker poss canine fond avacado+PL (400000+)

Review and Conclusions 11

Three Major Writing Systems

atilde Alphabetic (Latin)

acirc one symbol for consonant or vowelacirc Typically 20-30 base symbols (1 byte)

atilde Syllabic (Hiragana)

acirc one symbol for each syllable (consonant+vowel)acirc Typically 50-100 base symbols (1-2 bytes)

atilde Logographic (Hanzi)

acirc pictographs ideographs sounds-meaning combinationsacirc Typically 100000+ symbols (2-3 bytes)

Review and Conclusions 12

Computational Encoding

atilde Need to map characters to bits

atilde More characters require more space

acirc English ASCII (7 bits) Western Europe ISO-8859-1 (8 bits)Japanese EUC (16 bits) Everything UTF-8 (8-32 bits)

atilde Moving towards unicode for everything

atilde If you get the encoding wrong it is gibberish

acirc Web pages should state their encodingacirc Sometimes they are wrongacirc Encoding Detection usually involves statistical analysis of byte patterns

lowast But an encoding can be shared by many languages

Review and Conclusions 13

Speech and Language Technologyatilde The need for speech representation

atilde Storing sound

atilde Transforming Speech

acirc Automatic Speech Recognition (ASR) sounds to textacirc Text-to-Speech Synthesis (TTS) texts to sounds

atilde Speech technology mdash the Telephone

Review and Conclusions 14

How good are the systems

Task Vocab WER () WER () adaptedDigits 11 04 02Dialogue (travel) 21000 109 mdashDictation (WSJ) 5000 39 30Dictation (WSJ) 20000 100 86Dialogue (noisy army) 3000 422 310Phone Conversations 4000 419 310

Results of various DARPA competitions (from Richard Sproatrsquos slides)

Review and Conclusions 15

Why is it so difficult

atilde Pronunciation depends on context

acirc The same word will be pronounced differently in different sentences

atilde Speaker variability

acirc Genderacirc DialectForeign Accentacirc Individual Differences Physical differences Language differences (idiolect)

atilde Many many rare events

acirc 300 out of 2000 diphones in the core set for the ATampT NextGen system occuronly once in a 2-hour speech database

Review and Conclusions 16

Two steps in a TTS system

1 Linguistic Analysis

atilde Sentence Segmentationatilde Abbreviations Dr Smith lives on Nanyang Dr He is hellipatilde Word Segmentation

acirc 森前銀総裁 Moriyama zen Nichigin Sousaiotimes 森前銀総裁emspMoriyama zennichi gin Sousai

2 Speech Synthesis

atilde Find the pronunciationatilde Generate soundsatilde Add intonation

Review and Conclusions 17

Speech Synthesis

atilde Articulatory Synthesis Attempt to model human articulation

atilde Formant Synthesis Bypass modeling of human articulation and model acousticsdirectly

atilde Concatenative Synthesis Synthesize from stored units of actual speech

Review and Conclusions 18

Prosody of Emotion

atilde Excitement Fast very high pitch loud

atilde Hot anger Fast high pitch strong falling accent loud

atilde Fear Jitter

atilde Sarcasm Prolonged accent late peak

atilde Sad Slow low pitch

The main determinant of ldquonaturalnessrdquo in speech synthesis is not ldquovoice qualityrdquobut natural-sounding prosody (intonation and duration)

Richard Sproat

Review and Conclusions 19

Speed is different for different modalities

Speed in words per minute (one word is 6 characters)(English computer science students various studies)

Reading 300 200 (proof reading)Writing 31 21 (composing)Speaking 150Hearing 150 210 (speeded up)Typing 33 19 (composing)

atilde Reading gtgt SpeakingHearing gtgt Typing

rArr Speech for inputrArr Text for output

Review and Conclusions 20

The Telephone

Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 21

New Modalities

Review and Conclusions 22

Email Usenet Chat and Blogs

atilde All share some characteristics of speech and text

atilde Usage norms not fixed

atilde Communication methods may disappear before the norms are fixedUsenet Bulletin Boards Gopher Archie hellip

atilde Large scale discourse studies still to be done

atilde Some genuinely new things

acirc time-lagged multi-person conversationacirc raw un-edited textacirc extreme multi-authorship

Review and Conclusions 23

Email

Speech like Text liketime bound space bound (deletable)spontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 24

Usenet (asynchronous)

Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 25

Chat (synchronous)

Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 26

Blogs

Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactivecomments factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 27

Netspeak

atilde Inspired by shared background

acirc ltrantgtI canrsquot stand thisltrantgtacirc I hate^H^H^H^Hlove that idea (^H is backspace on a vt100)acirc lusers (users as seen by Systems Administrators)acirc suits

atilde Need to be in-group

cow orker Coworkerclue ldquoYou couldnrsquot get a clue during the clue mating season in a field full of horny

clues if you smeared your body with clue musk and did the clue mating dancerdquoEdward Flaherty (talkbizaare)

Review and Conclusions 28

atilde Gricean Principals

acirc Posting (topmiddlebottom quoting and trimming)

atilde Inspired by medium limitations

acirc GREATgreatgreatacirc -) (^_^) (o) acirc brb RTFM IMHOacirc lower caseacirc lack of punctuation (hard on phone keyboards)

Review and Conclusions 29

Collaboration and Wikisatilde Version Control Systems

atilde Wikipedia

atilde Licensing and Ownership

30

Version Control Systems

atilde Versioning file systems

acirc every time a file is opened a new copy is stored

atilde CVS Subversion Git

acirc changes to a collection of files are trackedacirc simultaneous changes are merged

atilde Revision Tracking

acirc Revisions are stored within a file

atilde Authorship in shared writing Explicit responsibility for changes

Review and Conclusions 31

Wikipedia

atilde The core aim of the Wikimedia Foundation is to get a free encyclopedia to everysingle person on the planet (Jimmy Wales)

atilde Wikipedia makes it easy to share your knowledgepeople like to do this

atilde Most edits are done by insiders

atilde Most content is added by outsiders

atilde Content comparable to Britannica

Review and Conclusions 32

The five pillars of Wikipedia

1 Wikipedia is an online encyclopedia

2 Wikipedia has a neutral point of view

atilde Content policies NPOV Verifiability and No original research

3 Wikipedia is free content

4 Wikipedians should interact in a respectful and civil manner

5 Wikipedia does not have firm rules

WikipediaFivepillars 33

Licenses and Ownership

atilde Copyright

atilde Copyleft

atilde Creative Commons

Review and Conclusions 34

What is a good article

1 Well-written

2 Factually accurate and verifiable

3 Broad in its coverage

4 Neutral

5 Stable

6 Illustrated if possible by images

WikipediaGood_article_criteria 35

The World Wide Web and HTML

Review and Conclusions 36

The InterWeb

atilde The Internetmore than just the web (email VoIP FTP Streaming Messaging

atilde The structure of Markup Visual vs LogicalWSISWYG WYSIAYG WYSIWIM

atilde The structure of the Web mdash hypertextpages linked to pages

atilde The future of the Web

atilde Linguistic features of the webun-edited large volume editable multi-media

Review and Conclusions 37

Map of online communities

httpxkcdcom802 38

The Deep Web (the Invisible Web)

Dynamic content dynamic pages which are returned in response to a submitted queryor accessed only through a form

Unlinked content pages which are not linked to by other pages (but clicking linksthem)

Private Web sites that require registration and login (Edventure)

Contextual Web pages with content varying for different access contexts (egranges of client IP addresses or previous navigation sequence)

Limited access content sites that limit access to their pages in a technical way (egusing the Robots Exclusion Standard)

Scripted content pages that are only accessible through links produced by JavaScriptas well as content dynamically downloaded from Web servers via Flash or Ajaxsolutions

Review and Conclusions 39

Non-HTMLtext content textual content encoded in multimedia (image or video)files or specific file formats not handled by search engines

These pages all include data that search engines cannot find

Review and Conclusions 40

Visual Markup vs Logical Markup

atilde Visual Markup (Presentational)

acirc What you see is what you get (WYSIWYG)acirc Equivalent of printersrsquo markupacirc Shows what things look like

atilde Logical Markup (Structural)

acirc Show the structure and meaningacirc Can be mapped to visual markupacirc Less flexible than visual markupacirc More adaptable (and reusable)

Review and Conclusions 41

The Web as Corpus

Review and Conclusions 42

Two Approaches to using the Web as a Corpus

atilde Direct Query Search Engine as Query tool and WWW as corpus(Objection Results are not reliable)

acirc Population and exact hit counts are unknown rarr no statistics possibleacirc Indexing does not allow to draw conclusions on the dataotimes Google is missing functionalities that linguists lexicographers would like to have

atilde Web Sample Use search engine to download data from the net and build a corpusfrom it

acirc known size and exact hit counts rarr statistics possibleacirc people can draw conclusions over the included text typesacirc (limited) control over the contentotimes sparser data

Review and Conclusions 43

Direct Query

atilde Accessible through search engines (Google API Yahoo API Scripts)

atilde Document counts are shown to correlate directly with ldquorealrdquo frequencies (Keller2003) so search engines can help - but

acirc lots of repetitions of the same text (not representative)acirc very limited query precision (no upperlower case no punctuation)acirc only estimated counts often hard to reproduce exactlyacirc different queries give wildly different numbers

Review and Conclusions 44

Web Count Units

atilde ghit or ldquogoogle hitrdquo is the most common unit used to count web snippets (in theearly 2000s)

acirc it is document frequency not term frequency

atilde whit or ldquoweb hitrdquo is the more general term

atilde Normally you compare two phenomena to get a unitless ratio(eg different from vs different than)251000000 ghits vs 71500000 or 351 (accessed 2012-04-04)

atilde GPB for ldquoGhits per billion documentsrdquo is good if want a more stable number (sug-gested by Mark Liberman)but Google no longer releases the index size

atilde So always say when you counted and try to use ratios

httpitrecisupennedu myllanguagelogarchives000953html 45

Web Sample

atilde Extracting and filtering web documents to create linguistically annotated corpora(Kilgarriff 2006)

acirc gather documents for different topics (balance)acirc exclude documents which cannot be preprocessed with available tools (here tag-

gers and lemmatizers)acirc exclude documents which seem irrelevant for a corpus (too short or too long word

lists)acirc do this for several languages and make the corpora available

46

Building Internet Corpora Outline

1 Select Seed Words (500)

2 Combine to form multiple queries (6000)

3 Query a search engine and retrieve the URLs (50000)

4 Download the files from the URLS (100000000 words)

5 Postprocess the data (encoding cleanup tagging and parsing)

Sharoff S (2006) Creating general-purpose corpora using automated search enginequeries In M Baroni S Bernardini (eds) WaCky Working papers on the Web asCorpus Bologna 2006

httpcorpusleedsacukinternethtml 47

Internet Corpora Summary

atilde The web can be used as a corpus

acirc Direct accesslowast Fast and convenientlowast Huge amounts of dataotimes unreliable counts

acirc Web samplelowast Control over the samplelowast Some setup costs (semi-automated)otimes Less data

atilde Richer data than a compiled corpus

otimes Less balanced less markup

48

Text and Meta-textatilde Explicit Meta-data

acirc Keywords and Categoriesacirc Rankingsacirc Structural Markup

atilde Implicit Meta-data

acirc Links and Citationsacirc Tagsacirc Tablesacirc File Namesacirc Translations

and Zombies 49

Explicit Metadata

atilde You can get information from metadata within documents

acirc When they are accurate they are very goodacirc They are often deceitful

atilde HTML PDF Word hellipMetadata

atilde Keywords and Tags

atilde Rankings

atilde Links and Citations

atilde Structural Markup

50

Implicit Metadata

atilde You can get clues from metadata within documents

acirc as they are non-intended they tend to be noisyacirc but they are rarely deceitful

atilde HTML tags as constituent boundaries

atilde Tables as Semantic Relations

atilde File Names (content type and language)

atilde Translations mdash Bracketed Glosses Cross-lingual Disambiguation

atilde Query Data

atilde Wikipedia Redirections and Cross-wiki Links

Review and Conclusions 51

Language Identification andNormalization

Review and Conclusions 52

Language Identification

atilde Need to identify the encoding and language of a page in order to process it (meta-data may be unreliable)

acirc Linguistically-grounded methodslowast Diacriticslowast Character n-gramslowast Stop words

acirc Similarity-based categorisation and classificationlowast Character n-gram rankings

acirc Machine Learning based methodsacirc Context (under jp or ko)

atilde Hard to do for short test snippets similar languages mixed text

Review and Conclusions 53

Normalization

atilde Extracting text from various documents

atilde Segmenting continuous text

atilde Number Normalization $700K $700000 07 million dollars hellip

atilde Date Normalization 2000AD 1421AH Heisei 12 hellip

atilde Stripping stop words (the a of hellip)

atilde Lemmatization produces rarr produce

atilde Stemming producer rarr produc produces rarr produc

atilde Decompounding zonnecel rarr zon cel

Review and Conclusions 54

The Semantic Web

Review and Conclusions 55

Goals of the Semantic Web

atilde Web of data

acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions

atilde Increase the utility of information by connecting it to definitions and context

atilde More efficient information access and analysis

EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt

Review and Conclusions 56

Semantic Web Architecture

Review and Conclusions 57

XML eXtensible Markup Language

atilde XML is a set of rules for encoding documents electronically

acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation

atilde Can be validated for syntactic well-formedness

Review and Conclusions 58

Semantic Web Architecture (details)

atilde Identify things with Uniform Resource Identifiers

acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond

atilde Identify relations with Resource Description Framework

acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything

atilde You can build relations in ontologies (OWL)

acirc Then reason over them search them hellip

Review and Conclusions 59

Criticism of the Semantic Web

Doctorowrsquos seven insurmountable obstacles to reliable metadata are

1 People lie

2 People are lazy

3 People are stupid

4 Mission Impossible know thyself

5 Schemas arenrsquot neutral

6 Metrics influence results

7 Therersquos more than one way to describe something

httpwwwwellcom~doctorowmetacraphtm 60

Semantic Web and NLP

atilde The Semantic Web is about structuring data

atilde Text Mining is about unstructured data

atilde There is much more unstructured than structured data

acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP

61

Citation Reputation andPageRank

Review and Conclusions 62

Citation Networks

atilde How can we tell what is a good scientific paper

acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked

acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite

Review and Conclusions 63

Reputation and Citation Analysis

atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group

atilde Some scores are

acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)

atilde Problems

acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones

atilde Weight Citations by Quality of the paper

Review and Conclusions 64

Gaming Citations

atilde LeastMinimum Publishable Unit

acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information

atilde Self citation in-group citation

atilde Write only proceedings (some journals are not often read)

atilde Submitting only to High Impact factor journals

You improve what gets measurednot necessarily what you want to improve

Review and Conclusions 65

Characteristics of the Web Bow Tie

SCC Strongly Connected Core mdash can travel from any page to any page

Review and Conclusions 66

Anchor Text

atilde Recall how hyperlinks are written

lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt

For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt

atilde Link analysis builds on two intuitions

1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A

2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice

Review and Conclusions 67

PageRank as Citation analysis

atilde Citation frequency can be used to measure the impact of an article

acirc Simplest measure Each article gets one vote ndash not very accurate

atilde On the web citation frequency = inlink count

acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam

atilde Better measure weighted citation frequency or citation rank

acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated

Review and Conclusions 68

PageRank as Random walk

atilde Imagine a web surfer doing a random walk on the web

acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page

equiprobably

atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there

atilde This long-term visit rate is the pagersquos PageRank

atilde PageRank = long-term visit rate = steady state robability

Review and Conclusions 69

Teleporting ndash to get us out of dead ends

atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)

atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)

atilde With remaining probability (90) go out on a random hyperlink

acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225

atilde 10 is a parameter the teleportation rate

atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate

Review and Conclusions 70

Example Graph

Each inbound link is a positive vote

httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71

Example Graph Weighted

Pages with higher PageRanks are lighter

Review and Conclusions 72

Gaming PageRank

atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs

atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies

atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission

73

atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks

The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index

acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip

74

Current Status

atilde There is a continuous battle between

acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read

atilde All metrics get gamed

Review and Conclusions 75

Digital object identifier

atilde DOI a string used to uniquely identify an electronic document or object

acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)

atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation

atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations

acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u

Review and Conclusions 76

Conclusions

Review and Conclusions 77

Revolutions in Language Technology

atilde Speech (Language itself)

atilde Writing (invented 3-5 times)

atilde Printing (made writing common)

atilde Digital Text (made writing transferable)

atilde Hyperlinking (taking writing beyond language)

Review and Conclusions 78

The Internet

atilde The internet is useful as a tool

acirc Passively for information accessacirc Actively for collaboration

atilde The internet is interesting in and of itself

acirc As a source of data about existing languageacirc As a source of innovation in language

Review and Conclusions 79

Recommended Texts

atilde Wikipedia

atilde Crystal D (2001) Language and the Internet Cambridge University Press

atilde Sproat R (2010) Language Technology and Society Oxford University Press

atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition

atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press

atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press

Review and Conclusions 80

Complementary Courses

atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics

atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics

Review and Conclusions 81

Page 13: Lecture 12: Review and Conclusions · Web Count Units ª ghit or “google hit” is the most common unit used to count web snippets (in the early2000s) â itisdocumentfrequencynottermfrequency

Three Major Writing Systems

atilde Alphabetic (Latin)

acirc one symbol for consonant or vowelacirc Typically 20-30 base symbols (1 byte)

atilde Syllabic (Hiragana)

acirc one symbol for each syllable (consonant+vowel)acirc Typically 50-100 base symbols (1-2 bytes)

atilde Logographic (Hanzi)

acirc pictographs ideographs sounds-meaning combinationsacirc Typically 100000+ symbols (2-3 bytes)

Review and Conclusions 12

Computational Encoding

atilde Need to map characters to bits

atilde More characters require more space

acirc English ASCII (7 bits) Western Europe ISO-8859-1 (8 bits)Japanese EUC (16 bits) Everything UTF-8 (8-32 bits)

atilde Moving towards unicode for everything

atilde If you get the encoding wrong it is gibberish

acirc Web pages should state their encodingacirc Sometimes they are wrongacirc Encoding Detection usually involves statistical analysis of byte patterns

lowast But an encoding can be shared by many languages

Review and Conclusions 13

Speech and Language Technologyatilde The need for speech representation

atilde Storing sound

atilde Transforming Speech

acirc Automatic Speech Recognition (ASR) sounds to textacirc Text-to-Speech Synthesis (TTS) texts to sounds

atilde Speech technology mdash the Telephone

Review and Conclusions 14

How good are the systems

Task Vocab WER () WER () adaptedDigits 11 04 02Dialogue (travel) 21000 109 mdashDictation (WSJ) 5000 39 30Dictation (WSJ) 20000 100 86Dialogue (noisy army) 3000 422 310Phone Conversations 4000 419 310

Results of various DARPA competitions (from Richard Sproatrsquos slides)

Review and Conclusions 15

Why is it so difficult

atilde Pronunciation depends on context

acirc The same word will be pronounced differently in different sentences

atilde Speaker variability

acirc Genderacirc DialectForeign Accentacirc Individual Differences Physical differences Language differences (idiolect)

atilde Many many rare events

acirc 300 out of 2000 diphones in the core set for the ATampT NextGen system occuronly once in a 2-hour speech database

Review and Conclusions 16

Two steps in a TTS system

1 Linguistic Analysis

atilde Sentence Segmentationatilde Abbreviations Dr Smith lives on Nanyang Dr He is hellipatilde Word Segmentation

acirc 森前銀総裁 Moriyama zen Nichigin Sousaiotimes 森前銀総裁emspMoriyama zennichi gin Sousai

2 Speech Synthesis

atilde Find the pronunciationatilde Generate soundsatilde Add intonation

Review and Conclusions 17

Speech Synthesis

atilde Articulatory Synthesis Attempt to model human articulation

atilde Formant Synthesis Bypass modeling of human articulation and model acousticsdirectly

atilde Concatenative Synthesis Synthesize from stored units of actual speech

Review and Conclusions 18

Prosody of Emotion

atilde Excitement Fast very high pitch loud

atilde Hot anger Fast high pitch strong falling accent loud

atilde Fear Jitter

atilde Sarcasm Prolonged accent late peak

atilde Sad Slow low pitch

The main determinant of ldquonaturalnessrdquo in speech synthesis is not ldquovoice qualityrdquobut natural-sounding prosody (intonation and duration)

Richard Sproat

Review and Conclusions 19

Speed is different for different modalities

Speed in words per minute (one word is 6 characters)(English computer science students various studies)

Reading 300 200 (proof reading)Writing 31 21 (composing)Speaking 150Hearing 150 210 (speeded up)Typing 33 19 (composing)

atilde Reading gtgt SpeakingHearing gtgt Typing

rArr Speech for inputrArr Text for output

Review and Conclusions 20

The Telephone

Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 21

New Modalities

Review and Conclusions 22

Email Usenet Chat and Blogs

atilde All share some characteristics of speech and text

atilde Usage norms not fixed

atilde Communication methods may disappear before the norms are fixedUsenet Bulletin Boards Gopher Archie hellip

atilde Large scale discourse studies still to be done

atilde Some genuinely new things

acirc time-lagged multi-person conversationacirc raw un-edited textacirc extreme multi-authorship

Review and Conclusions 23

Email

Speech like Text liketime bound space bound (deletable)spontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 24

Usenet (asynchronous)

Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 25

Chat (synchronous)

Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 26

Blogs

Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactivecomments factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 27

Netspeak

atilde Inspired by shared background

acirc ltrantgtI canrsquot stand thisltrantgtacirc I hate^H^H^H^Hlove that idea (^H is backspace on a vt100)acirc lusers (users as seen by Systems Administrators)acirc suits

atilde Need to be in-group

cow orker Coworkerclue ldquoYou couldnrsquot get a clue during the clue mating season in a field full of horny

clues if you smeared your body with clue musk and did the clue mating dancerdquoEdward Flaherty (talkbizaare)

Review and Conclusions 28

atilde Gricean Principals

acirc Posting (topmiddlebottom quoting and trimming)

atilde Inspired by medium limitations

acirc GREATgreatgreatacirc -) (^_^) (o) acirc brb RTFM IMHOacirc lower caseacirc lack of punctuation (hard on phone keyboards)

Review and Conclusions 29

Collaboration and Wikisatilde Version Control Systems

atilde Wikipedia

atilde Licensing and Ownership

30

Version Control Systems

atilde Versioning file systems

acirc every time a file is opened a new copy is stored

atilde CVS Subversion Git

acirc changes to a collection of files are trackedacirc simultaneous changes are merged

atilde Revision Tracking

acirc Revisions are stored within a file

atilde Authorship in shared writing Explicit responsibility for changes

Review and Conclusions 31

Wikipedia

atilde The core aim of the Wikimedia Foundation is to get a free encyclopedia to everysingle person on the planet (Jimmy Wales)

atilde Wikipedia makes it easy to share your knowledgepeople like to do this

atilde Most edits are done by insiders

atilde Most content is added by outsiders

atilde Content comparable to Britannica

Review and Conclusions 32

The five pillars of Wikipedia

1 Wikipedia is an online encyclopedia

2 Wikipedia has a neutral point of view

atilde Content policies NPOV Verifiability and No original research

3 Wikipedia is free content

4 Wikipedians should interact in a respectful and civil manner

5 Wikipedia does not have firm rules

WikipediaFivepillars 33

Licenses and Ownership

atilde Copyright

atilde Copyleft

atilde Creative Commons

Review and Conclusions 34

What is a good article

1 Well-written

2 Factually accurate and verifiable

3 Broad in its coverage

4 Neutral

5 Stable

6 Illustrated if possible by images

WikipediaGood_article_criteria 35

The World Wide Web and HTML

Review and Conclusions 36

The InterWeb

atilde The Internetmore than just the web (email VoIP FTP Streaming Messaging

atilde The structure of Markup Visual vs LogicalWSISWYG WYSIAYG WYSIWIM

atilde The structure of the Web mdash hypertextpages linked to pages

atilde The future of the Web

atilde Linguistic features of the webun-edited large volume editable multi-media

Review and Conclusions 37

Map of online communities

httpxkcdcom802 38

The Deep Web (the Invisible Web)

Dynamic content dynamic pages which are returned in response to a submitted queryor accessed only through a form

Unlinked content pages which are not linked to by other pages (but clicking linksthem)

Private Web sites that require registration and login (Edventure)

Contextual Web pages with content varying for different access contexts (egranges of client IP addresses or previous navigation sequence)

Limited access content sites that limit access to their pages in a technical way (egusing the Robots Exclusion Standard)

Scripted content pages that are only accessible through links produced by JavaScriptas well as content dynamically downloaded from Web servers via Flash or Ajaxsolutions

Review and Conclusions 39

Non-HTMLtext content textual content encoded in multimedia (image or video)files or specific file formats not handled by search engines

These pages all include data that search engines cannot find

Review and Conclusions 40

Visual Markup vs Logical Markup

atilde Visual Markup (Presentational)

acirc What you see is what you get (WYSIWYG)acirc Equivalent of printersrsquo markupacirc Shows what things look like

atilde Logical Markup (Structural)

acirc Show the structure and meaningacirc Can be mapped to visual markupacirc Less flexible than visual markupacirc More adaptable (and reusable)

Review and Conclusions 41

The Web as Corpus

Review and Conclusions 42

Two Approaches to using the Web as a Corpus

atilde Direct Query Search Engine as Query tool and WWW as corpus(Objection Results are not reliable)

acirc Population and exact hit counts are unknown rarr no statistics possibleacirc Indexing does not allow to draw conclusions on the dataotimes Google is missing functionalities that linguists lexicographers would like to have

atilde Web Sample Use search engine to download data from the net and build a corpusfrom it

acirc known size and exact hit counts rarr statistics possibleacirc people can draw conclusions over the included text typesacirc (limited) control over the contentotimes sparser data

Review and Conclusions 43

Direct Query

atilde Accessible through search engines (Google API Yahoo API Scripts)

atilde Document counts are shown to correlate directly with ldquorealrdquo frequencies (Keller2003) so search engines can help - but

acirc lots of repetitions of the same text (not representative)acirc very limited query precision (no upperlower case no punctuation)acirc only estimated counts often hard to reproduce exactlyacirc different queries give wildly different numbers

Review and Conclusions 44

Web Count Units

atilde ghit or ldquogoogle hitrdquo is the most common unit used to count web snippets (in theearly 2000s)

acirc it is document frequency not term frequency

atilde whit or ldquoweb hitrdquo is the more general term

atilde Normally you compare two phenomena to get a unitless ratio(eg different from vs different than)251000000 ghits vs 71500000 or 351 (accessed 2012-04-04)

atilde GPB for ldquoGhits per billion documentsrdquo is good if want a more stable number (sug-gested by Mark Liberman)but Google no longer releases the index size

atilde So always say when you counted and try to use ratios

httpitrecisupennedu myllanguagelogarchives000953html 45

Web Sample

atilde Extracting and filtering web documents to create linguistically annotated corpora(Kilgarriff 2006)

acirc gather documents for different topics (balance)acirc exclude documents which cannot be preprocessed with available tools (here tag-

gers and lemmatizers)acirc exclude documents which seem irrelevant for a corpus (too short or too long word

lists)acirc do this for several languages and make the corpora available

46

Building Internet Corpora Outline

1 Select Seed Words (500)

2 Combine to form multiple queries (6000)

3 Query a search engine and retrieve the URLs (50000)

4 Download the files from the URLS (100000000 words)

5 Postprocess the data (encoding cleanup tagging and parsing)

Sharoff S (2006) Creating general-purpose corpora using automated search enginequeries In M Baroni S Bernardini (eds) WaCky Working papers on the Web asCorpus Bologna 2006

httpcorpusleedsacukinternethtml 47

Internet Corpora Summary

atilde The web can be used as a corpus

acirc Direct accesslowast Fast and convenientlowast Huge amounts of dataotimes unreliable counts

acirc Web samplelowast Control over the samplelowast Some setup costs (semi-automated)otimes Less data

atilde Richer data than a compiled corpus

otimes Less balanced less markup

48

Text and Meta-textatilde Explicit Meta-data

acirc Keywords and Categoriesacirc Rankingsacirc Structural Markup

atilde Implicit Meta-data

acirc Links and Citationsacirc Tagsacirc Tablesacirc File Namesacirc Translations

and Zombies 49

Explicit Metadata

atilde You can get information from metadata within documents

acirc When they are accurate they are very goodacirc They are often deceitful

atilde HTML PDF Word hellipMetadata

atilde Keywords and Tags

atilde Rankings

atilde Links and Citations

atilde Structural Markup

50

Implicit Metadata

atilde You can get clues from metadata within documents

acirc as they are non-intended they tend to be noisyacirc but they are rarely deceitful

atilde HTML tags as constituent boundaries

atilde Tables as Semantic Relations

atilde File Names (content type and language)

atilde Translations mdash Bracketed Glosses Cross-lingual Disambiguation

atilde Query Data

atilde Wikipedia Redirections and Cross-wiki Links

Review and Conclusions 51

Language Identification andNormalization

Review and Conclusions 52

Language Identification

atilde Need to identify the encoding and language of a page in order to process it (meta-data may be unreliable)

acirc Linguistically-grounded methodslowast Diacriticslowast Character n-gramslowast Stop words

acirc Similarity-based categorisation and classificationlowast Character n-gram rankings

acirc Machine Learning based methodsacirc Context (under jp or ko)

atilde Hard to do for short test snippets similar languages mixed text

Review and Conclusions 53

Normalization

atilde Extracting text from various documents

atilde Segmenting continuous text

atilde Number Normalization $700K $700000 07 million dollars hellip

atilde Date Normalization 2000AD 1421AH Heisei 12 hellip

atilde Stripping stop words (the a of hellip)

atilde Lemmatization produces rarr produce

atilde Stemming producer rarr produc produces rarr produc

atilde Decompounding zonnecel rarr zon cel

Review and Conclusions 54

The Semantic Web

Review and Conclusions 55

Goals of the Semantic Web

atilde Web of data

acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions

atilde Increase the utility of information by connecting it to definitions and context

atilde More efficient information access and analysis

EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt

Review and Conclusions 56

Semantic Web Architecture

Review and Conclusions 57

XML eXtensible Markup Language

atilde XML is a set of rules for encoding documents electronically

acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation

atilde Can be validated for syntactic well-formedness

Review and Conclusions 58

Semantic Web Architecture (details)

atilde Identify things with Uniform Resource Identifiers

acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond

atilde Identify relations with Resource Description Framework

acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything

atilde You can build relations in ontologies (OWL)

acirc Then reason over them search them hellip

Review and Conclusions 59

Criticism of the Semantic Web

Doctorowrsquos seven insurmountable obstacles to reliable metadata are

1 People lie

2 People are lazy

3 People are stupid

4 Mission Impossible know thyself

5 Schemas arenrsquot neutral

6 Metrics influence results

7 Therersquos more than one way to describe something

httpwwwwellcom~doctorowmetacraphtm 60

Semantic Web and NLP

atilde The Semantic Web is about structuring data

atilde Text Mining is about unstructured data

atilde There is much more unstructured than structured data

acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP

61

Citation Reputation andPageRank

Review and Conclusions 62

Citation Networks

atilde How can we tell what is a good scientific paper

acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked

acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite

Review and Conclusions 63

Reputation and Citation Analysis

atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group

atilde Some scores are

acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)

atilde Problems

acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones

atilde Weight Citations by Quality of the paper

Review and Conclusions 64

Gaming Citations

atilde LeastMinimum Publishable Unit

acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information

atilde Self citation in-group citation

atilde Write only proceedings (some journals are not often read)

atilde Submitting only to High Impact factor journals

You improve what gets measurednot necessarily what you want to improve

Review and Conclusions 65

Characteristics of the Web Bow Tie

SCC Strongly Connected Core mdash can travel from any page to any page

Review and Conclusions 66

Anchor Text

atilde Recall how hyperlinks are written

lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt

For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt

atilde Link analysis builds on two intuitions

1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A

2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice

Review and Conclusions 67

PageRank as Citation analysis

atilde Citation frequency can be used to measure the impact of an article

acirc Simplest measure Each article gets one vote ndash not very accurate

atilde On the web citation frequency = inlink count

acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam

atilde Better measure weighted citation frequency or citation rank

acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated

Review and Conclusions 68

PageRank as Random walk

atilde Imagine a web surfer doing a random walk on the web

acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page

equiprobably

atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there

atilde This long-term visit rate is the pagersquos PageRank

atilde PageRank = long-term visit rate = steady state robability

Review and Conclusions 69

Teleporting ndash to get us out of dead ends

atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)

atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)

atilde With remaining probability (90) go out on a random hyperlink

acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225

atilde 10 is a parameter the teleportation rate

atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate

Review and Conclusions 70

Example Graph

Each inbound link is a positive vote

httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71

Example Graph Weighted

Pages with higher PageRanks are lighter

Review and Conclusions 72

Gaming PageRank

atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs

atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies

atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission

73

atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks

The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index

acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip

74

Current Status

atilde There is a continuous battle between

acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read

atilde All metrics get gamed

Review and Conclusions 75

Digital object identifier

atilde DOI a string used to uniquely identify an electronic document or object

acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)

atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation

atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations

acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u

Review and Conclusions 76

Conclusions

Review and Conclusions 77

Revolutions in Language Technology

atilde Speech (Language itself)

atilde Writing (invented 3-5 times)

atilde Printing (made writing common)

atilde Digital Text (made writing transferable)

atilde Hyperlinking (taking writing beyond language)

Review and Conclusions 78

The Internet

atilde The internet is useful as a tool

acirc Passively for information accessacirc Actively for collaboration

atilde The internet is interesting in and of itself

acirc As a source of data about existing languageacirc As a source of innovation in language

Review and Conclusions 79

Recommended Texts

atilde Wikipedia

atilde Crystal D (2001) Language and the Internet Cambridge University Press

atilde Sproat R (2010) Language Technology and Society Oxford University Press

atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition

atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press

atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press

Review and Conclusions 80

Complementary Courses

atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics

atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics

Review and Conclusions 81

Page 14: Lecture 12: Review and Conclusions · Web Count Units ª ghit or “google hit” is the most common unit used to count web snippets (in the early2000s) â itisdocumentfrequencynottermfrequency

Computational Encoding

atilde Need to map characters to bits

atilde More characters require more space

acirc English ASCII (7 bits) Western Europe ISO-8859-1 (8 bits)Japanese EUC (16 bits) Everything UTF-8 (8-32 bits)

atilde Moving towards unicode for everything

atilde If you get the encoding wrong it is gibberish

acirc Web pages should state their encodingacirc Sometimes they are wrongacirc Encoding Detection usually involves statistical analysis of byte patterns

lowast But an encoding can be shared by many languages

Review and Conclusions 13

Speech and Language Technologyatilde The need for speech representation

atilde Storing sound

atilde Transforming Speech

acirc Automatic Speech Recognition (ASR) sounds to textacirc Text-to-Speech Synthesis (TTS) texts to sounds

atilde Speech technology mdash the Telephone

Review and Conclusions 14

How good are the systems

Task Vocab WER () WER () adaptedDigits 11 04 02Dialogue (travel) 21000 109 mdashDictation (WSJ) 5000 39 30Dictation (WSJ) 20000 100 86Dialogue (noisy army) 3000 422 310Phone Conversations 4000 419 310

Results of various DARPA competitions (from Richard Sproatrsquos slides)

Review and Conclusions 15

Why is it so difficult

atilde Pronunciation depends on context

acirc The same word will be pronounced differently in different sentences

atilde Speaker variability

acirc Genderacirc DialectForeign Accentacirc Individual Differences Physical differences Language differences (idiolect)

atilde Many many rare events

acirc 300 out of 2000 diphones in the core set for the ATampT NextGen system occuronly once in a 2-hour speech database

Review and Conclusions 16

Two steps in a TTS system

1 Linguistic Analysis

atilde Sentence Segmentationatilde Abbreviations Dr Smith lives on Nanyang Dr He is hellipatilde Word Segmentation

acirc 森前銀総裁 Moriyama zen Nichigin Sousaiotimes 森前銀総裁emspMoriyama zennichi gin Sousai

2 Speech Synthesis

atilde Find the pronunciationatilde Generate soundsatilde Add intonation

Review and Conclusions 17

Speech Synthesis

atilde Articulatory Synthesis Attempt to model human articulation

atilde Formant Synthesis Bypass modeling of human articulation and model acousticsdirectly

atilde Concatenative Synthesis Synthesize from stored units of actual speech

Review and Conclusions 18

Prosody of Emotion

atilde Excitement Fast very high pitch loud

atilde Hot anger Fast high pitch strong falling accent loud

atilde Fear Jitter

atilde Sarcasm Prolonged accent late peak

atilde Sad Slow low pitch

The main determinant of ldquonaturalnessrdquo in speech synthesis is not ldquovoice qualityrdquobut natural-sounding prosody (intonation and duration)

Richard Sproat

Review and Conclusions 19

Speed is different for different modalities

Speed in words per minute (one word is 6 characters)(English computer science students various studies)

Reading 300 200 (proof reading)Writing 31 21 (composing)Speaking 150Hearing 150 210 (speeded up)Typing 33 19 (composing)

atilde Reading gtgt SpeakingHearing gtgt Typing

rArr Speech for inputrArr Text for output

Review and Conclusions 20

The Telephone

Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 21

New Modalities

Review and Conclusions 22

Email Usenet Chat and Blogs

atilde All share some characteristics of speech and text

atilde Usage norms not fixed

atilde Communication methods may disappear before the norms are fixedUsenet Bulletin Boards Gopher Archie hellip

atilde Large scale discourse studies still to be done

atilde Some genuinely new things

acirc time-lagged multi-person conversationacirc raw un-edited textacirc extreme multi-authorship

Review and Conclusions 23

Email

Speech like Text liketime bound space bound (deletable)spontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 24

Usenet (asynchronous)

Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 25

Chat (synchronous)

Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 26

Blogs

Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactivecomments factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 27

Netspeak

atilde Inspired by shared background

acirc ltrantgtI canrsquot stand thisltrantgtacirc I hate^H^H^H^Hlove that idea (^H is backspace on a vt100)acirc lusers (users as seen by Systems Administrators)acirc suits

atilde Need to be in-group

cow orker Coworkerclue ldquoYou couldnrsquot get a clue during the clue mating season in a field full of horny

clues if you smeared your body with clue musk and did the clue mating dancerdquoEdward Flaherty (talkbizaare)

Review and Conclusions 28

atilde Gricean Principals

acirc Posting (topmiddlebottom quoting and trimming)

atilde Inspired by medium limitations

acirc GREATgreatgreatacirc -) (^_^) (o) acirc brb RTFM IMHOacirc lower caseacirc lack of punctuation (hard on phone keyboards)

Review and Conclusions 29

Collaboration and Wikisatilde Version Control Systems

atilde Wikipedia

atilde Licensing and Ownership

30

Version Control Systems

atilde Versioning file systems

acirc every time a file is opened a new copy is stored

atilde CVS Subversion Git

acirc changes to a collection of files are trackedacirc simultaneous changes are merged

atilde Revision Tracking

acirc Revisions are stored within a file

atilde Authorship in shared writing Explicit responsibility for changes

Review and Conclusions 31

Wikipedia

atilde The core aim of the Wikimedia Foundation is to get a free encyclopedia to everysingle person on the planet (Jimmy Wales)

atilde Wikipedia makes it easy to share your knowledgepeople like to do this

atilde Most edits are done by insiders

atilde Most content is added by outsiders

atilde Content comparable to Britannica

Review and Conclusions 32

The five pillars of Wikipedia

1 Wikipedia is an online encyclopedia

2 Wikipedia has a neutral point of view

atilde Content policies NPOV Verifiability and No original research

3 Wikipedia is free content

4 Wikipedians should interact in a respectful and civil manner

5 Wikipedia does not have firm rules

WikipediaFivepillars 33

Licenses and Ownership

atilde Copyright

atilde Copyleft

atilde Creative Commons

Review and Conclusions 34

What is a good article

1 Well-written

2 Factually accurate and verifiable

3 Broad in its coverage

4 Neutral

5 Stable

6 Illustrated if possible by images

WikipediaGood_article_criteria 35

The World Wide Web and HTML

Review and Conclusions 36

The InterWeb

atilde The Internetmore than just the web (email VoIP FTP Streaming Messaging

atilde The structure of Markup Visual vs LogicalWSISWYG WYSIAYG WYSIWIM

atilde The structure of the Web mdash hypertextpages linked to pages

atilde The future of the Web

atilde Linguistic features of the webun-edited large volume editable multi-media

Review and Conclusions 37

Map of online communities

httpxkcdcom802 38

The Deep Web (the Invisible Web)

Dynamic content dynamic pages which are returned in response to a submitted queryor accessed only through a form

Unlinked content pages which are not linked to by other pages (but clicking linksthem)

Private Web sites that require registration and login (Edventure)

Contextual Web pages with content varying for different access contexts (egranges of client IP addresses or previous navigation sequence)

Limited access content sites that limit access to their pages in a technical way (egusing the Robots Exclusion Standard)

Scripted content pages that are only accessible through links produced by JavaScriptas well as content dynamically downloaded from Web servers via Flash or Ajaxsolutions

Review and Conclusions 39

Non-HTMLtext content textual content encoded in multimedia (image or video)files or specific file formats not handled by search engines

These pages all include data that search engines cannot find

Review and Conclusions 40

Visual Markup vs Logical Markup

atilde Visual Markup (Presentational)

acirc What you see is what you get (WYSIWYG)acirc Equivalent of printersrsquo markupacirc Shows what things look like

atilde Logical Markup (Structural)

acirc Show the structure and meaningacirc Can be mapped to visual markupacirc Less flexible than visual markupacirc More adaptable (and reusable)

Review and Conclusions 41

The Web as Corpus

Review and Conclusions 42

Two Approaches to using the Web as a Corpus

atilde Direct Query Search Engine as Query tool and WWW as corpus(Objection Results are not reliable)

acirc Population and exact hit counts are unknown rarr no statistics possibleacirc Indexing does not allow to draw conclusions on the dataotimes Google is missing functionalities that linguists lexicographers would like to have

atilde Web Sample Use search engine to download data from the net and build a corpusfrom it

acirc known size and exact hit counts rarr statistics possibleacirc people can draw conclusions over the included text typesacirc (limited) control over the contentotimes sparser data

Review and Conclusions 43

Direct Query

atilde Accessible through search engines (Google API Yahoo API Scripts)

atilde Document counts are shown to correlate directly with ldquorealrdquo frequencies (Keller2003) so search engines can help - but

acirc lots of repetitions of the same text (not representative)acirc very limited query precision (no upperlower case no punctuation)acirc only estimated counts often hard to reproduce exactlyacirc different queries give wildly different numbers

Review and Conclusions 44

Web Count Units

atilde ghit or ldquogoogle hitrdquo is the most common unit used to count web snippets (in theearly 2000s)

acirc it is document frequency not term frequency

atilde whit or ldquoweb hitrdquo is the more general term

atilde Normally you compare two phenomena to get a unitless ratio(eg different from vs different than)251000000 ghits vs 71500000 or 351 (accessed 2012-04-04)

atilde GPB for ldquoGhits per billion documentsrdquo is good if want a more stable number (sug-gested by Mark Liberman)but Google no longer releases the index size

atilde So always say when you counted and try to use ratios

httpitrecisupennedu myllanguagelogarchives000953html 45

Web Sample

atilde Extracting and filtering web documents to create linguistically annotated corpora(Kilgarriff 2006)

acirc gather documents for different topics (balance)acirc exclude documents which cannot be preprocessed with available tools (here tag-

gers and lemmatizers)acirc exclude documents which seem irrelevant for a corpus (too short or too long word

lists)acirc do this for several languages and make the corpora available

46

Building Internet Corpora Outline

1 Select Seed Words (500)

2 Combine to form multiple queries (6000)

3 Query a search engine and retrieve the URLs (50000)

4 Download the files from the URLS (100000000 words)

5 Postprocess the data (encoding cleanup tagging and parsing)

Sharoff S (2006) Creating general-purpose corpora using automated search enginequeries In M Baroni S Bernardini (eds) WaCky Working papers on the Web asCorpus Bologna 2006

httpcorpusleedsacukinternethtml 47

Internet Corpora Summary

atilde The web can be used as a corpus

acirc Direct accesslowast Fast and convenientlowast Huge amounts of dataotimes unreliable counts

acirc Web samplelowast Control over the samplelowast Some setup costs (semi-automated)otimes Less data

atilde Richer data than a compiled corpus

otimes Less balanced less markup

48

Text and Meta-textatilde Explicit Meta-data

acirc Keywords and Categoriesacirc Rankingsacirc Structural Markup

atilde Implicit Meta-data

acirc Links and Citationsacirc Tagsacirc Tablesacirc File Namesacirc Translations

and Zombies 49

Explicit Metadata

atilde You can get information from metadata within documents

acirc When they are accurate they are very goodacirc They are often deceitful

atilde HTML PDF Word hellipMetadata

atilde Keywords and Tags

atilde Rankings

atilde Links and Citations

atilde Structural Markup

50

Implicit Metadata

atilde You can get clues from metadata within documents

acirc as they are non-intended they tend to be noisyacirc but they are rarely deceitful

atilde HTML tags as constituent boundaries

atilde Tables as Semantic Relations

atilde File Names (content type and language)

atilde Translations mdash Bracketed Glosses Cross-lingual Disambiguation

atilde Query Data

atilde Wikipedia Redirections and Cross-wiki Links

Review and Conclusions 51

Language Identification andNormalization

Review and Conclusions 52

Language Identification

atilde Need to identify the encoding and language of a page in order to process it (meta-data may be unreliable)

acirc Linguistically-grounded methodslowast Diacriticslowast Character n-gramslowast Stop words

acirc Similarity-based categorisation and classificationlowast Character n-gram rankings

acirc Machine Learning based methodsacirc Context (under jp or ko)

atilde Hard to do for short test snippets similar languages mixed text

Review and Conclusions 53

Normalization

atilde Extracting text from various documents

atilde Segmenting continuous text

atilde Number Normalization $700K $700000 07 million dollars hellip

atilde Date Normalization 2000AD 1421AH Heisei 12 hellip

atilde Stripping stop words (the a of hellip)

atilde Lemmatization produces rarr produce

atilde Stemming producer rarr produc produces rarr produc

atilde Decompounding zonnecel rarr zon cel

Review and Conclusions 54

The Semantic Web

Review and Conclusions 55

Goals of the Semantic Web

atilde Web of data

acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions

atilde Increase the utility of information by connecting it to definitions and context

atilde More efficient information access and analysis

EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt

Review and Conclusions 56

Semantic Web Architecture

Review and Conclusions 57

XML eXtensible Markup Language

atilde XML is a set of rules for encoding documents electronically

acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation

atilde Can be validated for syntactic well-formedness

Review and Conclusions 58

Semantic Web Architecture (details)

atilde Identify things with Uniform Resource Identifiers

acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond

atilde Identify relations with Resource Description Framework

acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything

atilde You can build relations in ontologies (OWL)

acirc Then reason over them search them hellip

Review and Conclusions 59

Criticism of the Semantic Web

Doctorowrsquos seven insurmountable obstacles to reliable metadata are

1 People lie

2 People are lazy

3 People are stupid

4 Mission Impossible know thyself

5 Schemas arenrsquot neutral

6 Metrics influence results

7 Therersquos more than one way to describe something

httpwwwwellcom~doctorowmetacraphtm 60

Semantic Web and NLP

atilde The Semantic Web is about structuring data

atilde Text Mining is about unstructured data

atilde There is much more unstructured than structured data

acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP

61

Citation Reputation andPageRank

Review and Conclusions 62

Citation Networks

atilde How can we tell what is a good scientific paper

acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked

acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite

Review and Conclusions 63

Reputation and Citation Analysis

atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group

atilde Some scores are

acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)

atilde Problems

acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones

atilde Weight Citations by Quality of the paper

Review and Conclusions 64

Gaming Citations

atilde LeastMinimum Publishable Unit

acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information

atilde Self citation in-group citation

atilde Write only proceedings (some journals are not often read)

atilde Submitting only to High Impact factor journals

You improve what gets measurednot necessarily what you want to improve

Review and Conclusions 65

Characteristics of the Web Bow Tie

SCC Strongly Connected Core mdash can travel from any page to any page

Review and Conclusions 66

Anchor Text

atilde Recall how hyperlinks are written

lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt

For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt

atilde Link analysis builds on two intuitions

1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A

2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice

Review and Conclusions 67

PageRank as Citation analysis

atilde Citation frequency can be used to measure the impact of an article

acirc Simplest measure Each article gets one vote ndash not very accurate

atilde On the web citation frequency = inlink count

acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam

atilde Better measure weighted citation frequency or citation rank

acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated

Review and Conclusions 68

PageRank as Random walk

atilde Imagine a web surfer doing a random walk on the web

acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page

equiprobably

atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there

atilde This long-term visit rate is the pagersquos PageRank

atilde PageRank = long-term visit rate = steady state robability

Review and Conclusions 69

Teleporting ndash to get us out of dead ends

atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)

atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)

atilde With remaining probability (90) go out on a random hyperlink

acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225

atilde 10 is a parameter the teleportation rate

atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate

Review and Conclusions 70

Example Graph

Each inbound link is a positive vote

httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71

Example Graph Weighted

Pages with higher PageRanks are lighter

Review and Conclusions 72

Gaming PageRank

atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs

atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies

atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission

73

atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks

The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index

acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip

74

Current Status

atilde There is a continuous battle between

acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read

atilde All metrics get gamed

Review and Conclusions 75

Digital object identifier

atilde DOI a string used to uniquely identify an electronic document or object

acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)

atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation

atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations

acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u

Review and Conclusions 76

Conclusions

Review and Conclusions 77

Revolutions in Language Technology

atilde Speech (Language itself)

atilde Writing (invented 3-5 times)

atilde Printing (made writing common)

atilde Digital Text (made writing transferable)

atilde Hyperlinking (taking writing beyond language)

Review and Conclusions 78

The Internet

atilde The internet is useful as a tool

acirc Passively for information accessacirc Actively for collaboration

atilde The internet is interesting in and of itself

acirc As a source of data about existing languageacirc As a source of innovation in language

Review and Conclusions 79

Recommended Texts

atilde Wikipedia

atilde Crystal D (2001) Language and the Internet Cambridge University Press

atilde Sproat R (2010) Language Technology and Society Oxford University Press

atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition

atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press

atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press

Review and Conclusions 80

Complementary Courses

atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics

atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics

Review and Conclusions 81

Page 15: Lecture 12: Review and Conclusions · Web Count Units ª ghit or “google hit” is the most common unit used to count web snippets (in the early2000s) â itisdocumentfrequencynottermfrequency

Speech and Language Technologyatilde The need for speech representation

atilde Storing sound

atilde Transforming Speech

acirc Automatic Speech Recognition (ASR) sounds to textacirc Text-to-Speech Synthesis (TTS) texts to sounds

atilde Speech technology mdash the Telephone

Review and Conclusions 14

How good are the systems

Task Vocab WER () WER () adaptedDigits 11 04 02Dialogue (travel) 21000 109 mdashDictation (WSJ) 5000 39 30Dictation (WSJ) 20000 100 86Dialogue (noisy army) 3000 422 310Phone Conversations 4000 419 310

Results of various DARPA competitions (from Richard Sproatrsquos slides)

Review and Conclusions 15

Why is it so difficult

atilde Pronunciation depends on context

acirc The same word will be pronounced differently in different sentences

atilde Speaker variability

acirc Genderacirc DialectForeign Accentacirc Individual Differences Physical differences Language differences (idiolect)

atilde Many many rare events

acirc 300 out of 2000 diphones in the core set for the ATampT NextGen system occuronly once in a 2-hour speech database

Review and Conclusions 16

Two steps in a TTS system

1 Linguistic Analysis

atilde Sentence Segmentationatilde Abbreviations Dr Smith lives on Nanyang Dr He is hellipatilde Word Segmentation

acirc 森前銀総裁 Moriyama zen Nichigin Sousaiotimes 森前銀総裁emspMoriyama zennichi gin Sousai

2 Speech Synthesis

atilde Find the pronunciationatilde Generate soundsatilde Add intonation

Review and Conclusions 17

Speech Synthesis

atilde Articulatory Synthesis Attempt to model human articulation

atilde Formant Synthesis Bypass modeling of human articulation and model acousticsdirectly

atilde Concatenative Synthesis Synthesize from stored units of actual speech

Review and Conclusions 18

Prosody of Emotion

atilde Excitement Fast very high pitch loud

atilde Hot anger Fast high pitch strong falling accent loud

atilde Fear Jitter

atilde Sarcasm Prolonged accent late peak

atilde Sad Slow low pitch

The main determinant of ldquonaturalnessrdquo in speech synthesis is not ldquovoice qualityrdquobut natural-sounding prosody (intonation and duration)

Richard Sproat

Review and Conclusions 19

Speed is different for different modalities

Speed in words per minute (one word is 6 characters)(English computer science students various studies)

Reading 300 200 (proof reading)Writing 31 21 (composing)Speaking 150Hearing 150 210 (speeded up)Typing 33 19 (composing)

atilde Reading gtgt SpeakingHearing gtgt Typing

rArr Speech for inputrArr Text for output

Review and Conclusions 20

The Telephone

Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 21

New Modalities

Review and Conclusions 22

Email Usenet Chat and Blogs

atilde All share some characteristics of speech and text

atilde Usage norms not fixed

atilde Communication methods may disappear before the norms are fixedUsenet Bulletin Boards Gopher Archie hellip

atilde Large scale discourse studies still to be done

atilde Some genuinely new things

acirc time-lagged multi-person conversationacirc raw un-edited textacirc extreme multi-authorship

Review and Conclusions 23

Email

Speech like Text liketime bound space bound (deletable)spontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 24

Usenet (asynchronous)

Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 25

Chat (synchronous)

Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 26

Blogs

Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactivecomments factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 27

Netspeak

atilde Inspired by shared background

acirc ltrantgtI canrsquot stand thisltrantgtacirc I hate^H^H^H^Hlove that idea (^H is backspace on a vt100)acirc lusers (users as seen by Systems Administrators)acirc suits

atilde Need to be in-group

cow orker Coworkerclue ldquoYou couldnrsquot get a clue during the clue mating season in a field full of horny

clues if you smeared your body with clue musk and did the clue mating dancerdquoEdward Flaherty (talkbizaare)

Review and Conclusions 28

atilde Gricean Principals

acirc Posting (topmiddlebottom quoting and trimming)

atilde Inspired by medium limitations

acirc GREATgreatgreatacirc -) (^_^) (o) acirc brb RTFM IMHOacirc lower caseacirc lack of punctuation (hard on phone keyboards)

Review and Conclusions 29

Collaboration and Wikisatilde Version Control Systems

atilde Wikipedia

atilde Licensing and Ownership

30

Version Control Systems

atilde Versioning file systems

acirc every time a file is opened a new copy is stored

atilde CVS Subversion Git

acirc changes to a collection of files are trackedacirc simultaneous changes are merged

atilde Revision Tracking

acirc Revisions are stored within a file

atilde Authorship in shared writing Explicit responsibility for changes

Review and Conclusions 31

Wikipedia

atilde The core aim of the Wikimedia Foundation is to get a free encyclopedia to everysingle person on the planet (Jimmy Wales)

atilde Wikipedia makes it easy to share your knowledgepeople like to do this

atilde Most edits are done by insiders

atilde Most content is added by outsiders

atilde Content comparable to Britannica

Review and Conclusions 32

The five pillars of Wikipedia

1 Wikipedia is an online encyclopedia

2 Wikipedia has a neutral point of view

atilde Content policies NPOV Verifiability and No original research

3 Wikipedia is free content

4 Wikipedians should interact in a respectful and civil manner

5 Wikipedia does not have firm rules

WikipediaFivepillars 33

Licenses and Ownership

atilde Copyright

atilde Copyleft

atilde Creative Commons

Review and Conclusions 34

What is a good article

1 Well-written

2 Factually accurate and verifiable

3 Broad in its coverage

4 Neutral

5 Stable

6 Illustrated if possible by images

WikipediaGood_article_criteria 35

The World Wide Web and HTML

Review and Conclusions 36

The InterWeb

atilde The Internetmore than just the web (email VoIP FTP Streaming Messaging

atilde The structure of Markup Visual vs LogicalWSISWYG WYSIAYG WYSIWIM

atilde The structure of the Web mdash hypertextpages linked to pages

atilde The future of the Web

atilde Linguistic features of the webun-edited large volume editable multi-media

Review and Conclusions 37

Map of online communities

httpxkcdcom802 38

The Deep Web (the Invisible Web)

Dynamic content dynamic pages which are returned in response to a submitted queryor accessed only through a form

Unlinked content pages which are not linked to by other pages (but clicking linksthem)

Private Web sites that require registration and login (Edventure)

Contextual Web pages with content varying for different access contexts (egranges of client IP addresses or previous navigation sequence)

Limited access content sites that limit access to their pages in a technical way (egusing the Robots Exclusion Standard)

Scripted content pages that are only accessible through links produced by JavaScriptas well as content dynamically downloaded from Web servers via Flash or Ajaxsolutions

Review and Conclusions 39

Non-HTMLtext content textual content encoded in multimedia (image or video)files or specific file formats not handled by search engines

These pages all include data that search engines cannot find

Review and Conclusions 40

Visual Markup vs Logical Markup

atilde Visual Markup (Presentational)

acirc What you see is what you get (WYSIWYG)acirc Equivalent of printersrsquo markupacirc Shows what things look like

atilde Logical Markup (Structural)

acirc Show the structure and meaningacirc Can be mapped to visual markupacirc Less flexible than visual markupacirc More adaptable (and reusable)

Review and Conclusions 41

The Web as Corpus

Review and Conclusions 42

Two Approaches to using the Web as a Corpus

atilde Direct Query Search Engine as Query tool and WWW as corpus(Objection Results are not reliable)

acirc Population and exact hit counts are unknown rarr no statistics possibleacirc Indexing does not allow to draw conclusions on the dataotimes Google is missing functionalities that linguists lexicographers would like to have

atilde Web Sample Use search engine to download data from the net and build a corpusfrom it

acirc known size and exact hit counts rarr statistics possibleacirc people can draw conclusions over the included text typesacirc (limited) control over the contentotimes sparser data

Review and Conclusions 43

Direct Query

atilde Accessible through search engines (Google API Yahoo API Scripts)

atilde Document counts are shown to correlate directly with ldquorealrdquo frequencies (Keller2003) so search engines can help - but

acirc lots of repetitions of the same text (not representative)acirc very limited query precision (no upperlower case no punctuation)acirc only estimated counts often hard to reproduce exactlyacirc different queries give wildly different numbers

Review and Conclusions 44

Web Count Units

atilde ghit or ldquogoogle hitrdquo is the most common unit used to count web snippets (in theearly 2000s)

acirc it is document frequency not term frequency

atilde whit or ldquoweb hitrdquo is the more general term

atilde Normally you compare two phenomena to get a unitless ratio(eg different from vs different than)251000000 ghits vs 71500000 or 351 (accessed 2012-04-04)

atilde GPB for ldquoGhits per billion documentsrdquo is good if want a more stable number (sug-gested by Mark Liberman)but Google no longer releases the index size

atilde So always say when you counted and try to use ratios

httpitrecisupennedu myllanguagelogarchives000953html 45

Web Sample

atilde Extracting and filtering web documents to create linguistically annotated corpora(Kilgarriff 2006)

acirc gather documents for different topics (balance)acirc exclude documents which cannot be preprocessed with available tools (here tag-

gers and lemmatizers)acirc exclude documents which seem irrelevant for a corpus (too short or too long word

lists)acirc do this for several languages and make the corpora available

46

Building Internet Corpora Outline

1 Select Seed Words (500)

2 Combine to form multiple queries (6000)

3 Query a search engine and retrieve the URLs (50000)

4 Download the files from the URLS (100000000 words)

5 Postprocess the data (encoding cleanup tagging and parsing)

Sharoff S (2006) Creating general-purpose corpora using automated search enginequeries In M Baroni S Bernardini (eds) WaCky Working papers on the Web asCorpus Bologna 2006

httpcorpusleedsacukinternethtml 47

Internet Corpora Summary

atilde The web can be used as a corpus

acirc Direct accesslowast Fast and convenientlowast Huge amounts of dataotimes unreliable counts

acirc Web samplelowast Control over the samplelowast Some setup costs (semi-automated)otimes Less data

atilde Richer data than a compiled corpus

otimes Less balanced less markup

48

Text and Meta-textatilde Explicit Meta-data

acirc Keywords and Categoriesacirc Rankingsacirc Structural Markup

atilde Implicit Meta-data

acirc Links and Citationsacirc Tagsacirc Tablesacirc File Namesacirc Translations

and Zombies 49

Explicit Metadata

atilde You can get information from metadata within documents

acirc When they are accurate they are very goodacirc They are often deceitful

atilde HTML PDF Word hellipMetadata

atilde Keywords and Tags

atilde Rankings

atilde Links and Citations

atilde Structural Markup

50

Implicit Metadata

atilde You can get clues from metadata within documents

acirc as they are non-intended they tend to be noisyacirc but they are rarely deceitful

atilde HTML tags as constituent boundaries

atilde Tables as Semantic Relations

atilde File Names (content type and language)

atilde Translations mdash Bracketed Glosses Cross-lingual Disambiguation

atilde Query Data

atilde Wikipedia Redirections and Cross-wiki Links

Review and Conclusions 51

Language Identification andNormalization

Review and Conclusions 52

Language Identification

atilde Need to identify the encoding and language of a page in order to process it (meta-data may be unreliable)

acirc Linguistically-grounded methodslowast Diacriticslowast Character n-gramslowast Stop words

acirc Similarity-based categorisation and classificationlowast Character n-gram rankings

acirc Machine Learning based methodsacirc Context (under jp or ko)

atilde Hard to do for short test snippets similar languages mixed text

Review and Conclusions 53

Normalization

atilde Extracting text from various documents

atilde Segmenting continuous text

atilde Number Normalization $700K $700000 07 million dollars hellip

atilde Date Normalization 2000AD 1421AH Heisei 12 hellip

atilde Stripping stop words (the a of hellip)

atilde Lemmatization produces rarr produce

atilde Stemming producer rarr produc produces rarr produc

atilde Decompounding zonnecel rarr zon cel

Review and Conclusions 54

The Semantic Web

Review and Conclusions 55

Goals of the Semantic Web

atilde Web of data

acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions

atilde Increase the utility of information by connecting it to definitions and context

atilde More efficient information access and analysis

EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt

Review and Conclusions 56

Semantic Web Architecture

Review and Conclusions 57

XML eXtensible Markup Language

atilde XML is a set of rules for encoding documents electronically

acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation

atilde Can be validated for syntactic well-formedness

Review and Conclusions 58

Semantic Web Architecture (details)

atilde Identify things with Uniform Resource Identifiers

acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond

atilde Identify relations with Resource Description Framework

acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything

atilde You can build relations in ontologies (OWL)

acirc Then reason over them search them hellip

Review and Conclusions 59

Criticism of the Semantic Web

Doctorowrsquos seven insurmountable obstacles to reliable metadata are

1 People lie

2 People are lazy

3 People are stupid

4 Mission Impossible know thyself

5 Schemas arenrsquot neutral

6 Metrics influence results

7 Therersquos more than one way to describe something

httpwwwwellcom~doctorowmetacraphtm 60

Semantic Web and NLP

atilde The Semantic Web is about structuring data

atilde Text Mining is about unstructured data

atilde There is much more unstructured than structured data

acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP

61

Citation Reputation andPageRank

Review and Conclusions 62

Citation Networks

atilde How can we tell what is a good scientific paper

acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked

acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite

Review and Conclusions 63

Reputation and Citation Analysis

atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group

atilde Some scores are

acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)

atilde Problems

acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones

atilde Weight Citations by Quality of the paper

Review and Conclusions 64

Gaming Citations

atilde LeastMinimum Publishable Unit

acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information

atilde Self citation in-group citation

atilde Write only proceedings (some journals are not often read)

atilde Submitting only to High Impact factor journals

You improve what gets measurednot necessarily what you want to improve

Review and Conclusions 65

Characteristics of the Web Bow Tie

SCC Strongly Connected Core mdash can travel from any page to any page

Review and Conclusions 66

Anchor Text

atilde Recall how hyperlinks are written

lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt

For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt

atilde Link analysis builds on two intuitions

1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A

2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice

Review and Conclusions 67

PageRank as Citation analysis

atilde Citation frequency can be used to measure the impact of an article

acirc Simplest measure Each article gets one vote ndash not very accurate

atilde On the web citation frequency = inlink count

acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam

atilde Better measure weighted citation frequency or citation rank

acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated

Review and Conclusions 68

PageRank as Random walk

atilde Imagine a web surfer doing a random walk on the web

acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page

equiprobably

atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there

atilde This long-term visit rate is the pagersquos PageRank

atilde PageRank = long-term visit rate = steady state robability

Review and Conclusions 69

Teleporting ndash to get us out of dead ends

atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)

atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)

atilde With remaining probability (90) go out on a random hyperlink

acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225

atilde 10 is a parameter the teleportation rate

atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate

Review and Conclusions 70

Example Graph

Each inbound link is a positive vote

httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71

Example Graph Weighted

Pages with higher PageRanks are lighter

Review and Conclusions 72

Gaming PageRank

atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs

atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies

atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission

73

atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks

The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index

acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip

74

Current Status

atilde There is a continuous battle between

acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read

atilde All metrics get gamed

Review and Conclusions 75

Digital object identifier

atilde DOI a string used to uniquely identify an electronic document or object

acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)

atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation

atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations

acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u

Review and Conclusions 76

Conclusions

Review and Conclusions 77

Revolutions in Language Technology

atilde Speech (Language itself)

atilde Writing (invented 3-5 times)

atilde Printing (made writing common)

atilde Digital Text (made writing transferable)

atilde Hyperlinking (taking writing beyond language)

Review and Conclusions 78

The Internet

atilde The internet is useful as a tool

acirc Passively for information accessacirc Actively for collaboration

atilde The internet is interesting in and of itself

acirc As a source of data about existing languageacirc As a source of innovation in language

Review and Conclusions 79

Recommended Texts

atilde Wikipedia

atilde Crystal D (2001) Language and the Internet Cambridge University Press

atilde Sproat R (2010) Language Technology and Society Oxford University Press

atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition

atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press

atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press

Review and Conclusions 80

Complementary Courses

atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics

atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics

Review and Conclusions 81

Page 16: Lecture 12: Review and Conclusions · Web Count Units ª ghit or “google hit” is the most common unit used to count web snippets (in the early2000s) â itisdocumentfrequencynottermfrequency

How good are the systems

Task Vocab WER () WER () adaptedDigits 11 04 02Dialogue (travel) 21000 109 mdashDictation (WSJ) 5000 39 30Dictation (WSJ) 20000 100 86Dialogue (noisy army) 3000 422 310Phone Conversations 4000 419 310

Results of various DARPA competitions (from Richard Sproatrsquos slides)

Review and Conclusions 15

Why is it so difficult

atilde Pronunciation depends on context

acirc The same word will be pronounced differently in different sentences

atilde Speaker variability

acirc Genderacirc DialectForeign Accentacirc Individual Differences Physical differences Language differences (idiolect)

atilde Many many rare events

acirc 300 out of 2000 diphones in the core set for the ATampT NextGen system occuronly once in a 2-hour speech database

Review and Conclusions 16

Two steps in a TTS system

1 Linguistic Analysis

atilde Sentence Segmentationatilde Abbreviations Dr Smith lives on Nanyang Dr He is hellipatilde Word Segmentation

acirc 森前銀総裁 Moriyama zen Nichigin Sousaiotimes 森前銀総裁emspMoriyama zennichi gin Sousai

2 Speech Synthesis

atilde Find the pronunciationatilde Generate soundsatilde Add intonation

Review and Conclusions 17

Speech Synthesis

atilde Articulatory Synthesis Attempt to model human articulation

atilde Formant Synthesis Bypass modeling of human articulation and model acousticsdirectly

atilde Concatenative Synthesis Synthesize from stored units of actual speech

Review and Conclusions 18

Prosody of Emotion

atilde Excitement Fast very high pitch loud

atilde Hot anger Fast high pitch strong falling accent loud

atilde Fear Jitter

atilde Sarcasm Prolonged accent late peak

atilde Sad Slow low pitch

The main determinant of ldquonaturalnessrdquo in speech synthesis is not ldquovoice qualityrdquobut natural-sounding prosody (intonation and duration)

Richard Sproat

Review and Conclusions 19

Speed is different for different modalities

Speed in words per minute (one word is 6 characters)(English computer science students various studies)

Reading 300 200 (proof reading)Writing 31 21 (composing)Speaking 150Hearing 150 210 (speeded up)Typing 33 19 (composing)

atilde Reading gtgt SpeakingHearing gtgt Typing

rArr Speech for inputrArr Text for output

Review and Conclusions 20

The Telephone

Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 21

New Modalities

Review and Conclusions 22

Email Usenet Chat and Blogs

atilde All share some characteristics of speech and text

atilde Usage norms not fixed

atilde Communication methods may disappear before the norms are fixedUsenet Bulletin Boards Gopher Archie hellip

atilde Large scale discourse studies still to be done

atilde Some genuinely new things

acirc time-lagged multi-person conversationacirc raw un-edited textacirc extreme multi-authorship

Review and Conclusions 23

Email

Speech like Text liketime bound space bound (deletable)spontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 24

Usenet (asynchronous)

Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 25

Chat (synchronous)

Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 26

Blogs

Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactivecomments factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 27

Netspeak

atilde Inspired by shared background

acirc ltrantgtI canrsquot stand thisltrantgtacirc I hate^H^H^H^Hlove that idea (^H is backspace on a vt100)acirc lusers (users as seen by Systems Administrators)acirc suits

atilde Need to be in-group

cow orker Coworkerclue ldquoYou couldnrsquot get a clue during the clue mating season in a field full of horny

clues if you smeared your body with clue musk and did the clue mating dancerdquoEdward Flaherty (talkbizaare)

Review and Conclusions 28

atilde Gricean Principals

acirc Posting (topmiddlebottom quoting and trimming)

atilde Inspired by medium limitations

acirc GREATgreatgreatacirc -) (^_^) (o) acirc brb RTFM IMHOacirc lower caseacirc lack of punctuation (hard on phone keyboards)

Review and Conclusions 29

Collaboration and Wikisatilde Version Control Systems

atilde Wikipedia

atilde Licensing and Ownership

30

Version Control Systems

atilde Versioning file systems

acirc every time a file is opened a new copy is stored

atilde CVS Subversion Git

acirc changes to a collection of files are trackedacirc simultaneous changes are merged

atilde Revision Tracking

acirc Revisions are stored within a file

atilde Authorship in shared writing Explicit responsibility for changes

Review and Conclusions 31

Wikipedia

atilde The core aim of the Wikimedia Foundation is to get a free encyclopedia to everysingle person on the planet (Jimmy Wales)

atilde Wikipedia makes it easy to share your knowledgepeople like to do this

atilde Most edits are done by insiders

atilde Most content is added by outsiders

atilde Content comparable to Britannica

Review and Conclusions 32

The five pillars of Wikipedia

1 Wikipedia is an online encyclopedia

2 Wikipedia has a neutral point of view

atilde Content policies NPOV Verifiability and No original research

3 Wikipedia is free content

4 Wikipedians should interact in a respectful and civil manner

5 Wikipedia does not have firm rules

WikipediaFivepillars 33

Licenses and Ownership

atilde Copyright

atilde Copyleft

atilde Creative Commons

Review and Conclusions 34

What is a good article

1 Well-written

2 Factually accurate and verifiable

3 Broad in its coverage

4 Neutral

5 Stable

6 Illustrated if possible by images

WikipediaGood_article_criteria 35

The World Wide Web and HTML

Review and Conclusions 36

The InterWeb

atilde The Internetmore than just the web (email VoIP FTP Streaming Messaging

atilde The structure of Markup Visual vs LogicalWSISWYG WYSIAYG WYSIWIM

atilde The structure of the Web mdash hypertextpages linked to pages

atilde The future of the Web

atilde Linguistic features of the webun-edited large volume editable multi-media

Review and Conclusions 37

Map of online communities

httpxkcdcom802 38

The Deep Web (the Invisible Web)

Dynamic content dynamic pages which are returned in response to a submitted queryor accessed only through a form

Unlinked content pages which are not linked to by other pages (but clicking linksthem)

Private Web sites that require registration and login (Edventure)

Contextual Web pages with content varying for different access contexts (egranges of client IP addresses or previous navigation sequence)

Limited access content sites that limit access to their pages in a technical way (egusing the Robots Exclusion Standard)

Scripted content pages that are only accessible through links produced by JavaScriptas well as content dynamically downloaded from Web servers via Flash or Ajaxsolutions

Review and Conclusions 39

Non-HTMLtext content textual content encoded in multimedia (image or video)files or specific file formats not handled by search engines

These pages all include data that search engines cannot find

Review and Conclusions 40

Visual Markup vs Logical Markup

atilde Visual Markup (Presentational)

acirc What you see is what you get (WYSIWYG)acirc Equivalent of printersrsquo markupacirc Shows what things look like

atilde Logical Markup (Structural)

acirc Show the structure and meaningacirc Can be mapped to visual markupacirc Less flexible than visual markupacirc More adaptable (and reusable)

Review and Conclusions 41

The Web as Corpus

Review and Conclusions 42

Two Approaches to using the Web as a Corpus

atilde Direct Query Search Engine as Query tool and WWW as corpus(Objection Results are not reliable)

acirc Population and exact hit counts are unknown rarr no statistics possibleacirc Indexing does not allow to draw conclusions on the dataotimes Google is missing functionalities that linguists lexicographers would like to have

atilde Web Sample Use search engine to download data from the net and build a corpusfrom it

acirc known size and exact hit counts rarr statistics possibleacirc people can draw conclusions over the included text typesacirc (limited) control over the contentotimes sparser data

Review and Conclusions 43

Direct Query

atilde Accessible through search engines (Google API Yahoo API Scripts)

atilde Document counts are shown to correlate directly with ldquorealrdquo frequencies (Keller2003) so search engines can help - but

acirc lots of repetitions of the same text (not representative)acirc very limited query precision (no upperlower case no punctuation)acirc only estimated counts often hard to reproduce exactlyacirc different queries give wildly different numbers

Review and Conclusions 44

Web Count Units

atilde ghit or ldquogoogle hitrdquo is the most common unit used to count web snippets (in theearly 2000s)

acirc it is document frequency not term frequency

atilde whit or ldquoweb hitrdquo is the more general term

atilde Normally you compare two phenomena to get a unitless ratio(eg different from vs different than)251000000 ghits vs 71500000 or 351 (accessed 2012-04-04)

atilde GPB for ldquoGhits per billion documentsrdquo is good if want a more stable number (sug-gested by Mark Liberman)but Google no longer releases the index size

atilde So always say when you counted and try to use ratios

httpitrecisupennedu myllanguagelogarchives000953html 45

Web Sample

atilde Extracting and filtering web documents to create linguistically annotated corpora(Kilgarriff 2006)

acirc gather documents for different topics (balance)acirc exclude documents which cannot be preprocessed with available tools (here tag-

gers and lemmatizers)acirc exclude documents which seem irrelevant for a corpus (too short or too long word

lists)acirc do this for several languages and make the corpora available

46

Building Internet Corpora Outline

1 Select Seed Words (500)

2 Combine to form multiple queries (6000)

3 Query a search engine and retrieve the URLs (50000)

4 Download the files from the URLS (100000000 words)

5 Postprocess the data (encoding cleanup tagging and parsing)

Sharoff S (2006) Creating general-purpose corpora using automated search enginequeries In M Baroni S Bernardini (eds) WaCky Working papers on the Web asCorpus Bologna 2006

httpcorpusleedsacukinternethtml 47

Internet Corpora Summary

atilde The web can be used as a corpus

acirc Direct accesslowast Fast and convenientlowast Huge amounts of dataotimes unreliable counts

acirc Web samplelowast Control over the samplelowast Some setup costs (semi-automated)otimes Less data

atilde Richer data than a compiled corpus

otimes Less balanced less markup

48

Text and Meta-textatilde Explicit Meta-data

acirc Keywords and Categoriesacirc Rankingsacirc Structural Markup

atilde Implicit Meta-data

acirc Links and Citationsacirc Tagsacirc Tablesacirc File Namesacirc Translations

and Zombies 49

Explicit Metadata

atilde You can get information from metadata within documents

acirc When they are accurate they are very goodacirc They are often deceitful

atilde HTML PDF Word hellipMetadata

atilde Keywords and Tags

atilde Rankings

atilde Links and Citations

atilde Structural Markup

50

Implicit Metadata

atilde You can get clues from metadata within documents

acirc as they are non-intended they tend to be noisyacirc but they are rarely deceitful

atilde HTML tags as constituent boundaries

atilde Tables as Semantic Relations

atilde File Names (content type and language)

atilde Translations mdash Bracketed Glosses Cross-lingual Disambiguation

atilde Query Data

atilde Wikipedia Redirections and Cross-wiki Links

Review and Conclusions 51

Language Identification andNormalization

Review and Conclusions 52

Language Identification

atilde Need to identify the encoding and language of a page in order to process it (meta-data may be unreliable)

acirc Linguistically-grounded methodslowast Diacriticslowast Character n-gramslowast Stop words

acirc Similarity-based categorisation and classificationlowast Character n-gram rankings

acirc Machine Learning based methodsacirc Context (under jp or ko)

atilde Hard to do for short test snippets similar languages mixed text

Review and Conclusions 53

Normalization

atilde Extracting text from various documents

atilde Segmenting continuous text

atilde Number Normalization $700K $700000 07 million dollars hellip

atilde Date Normalization 2000AD 1421AH Heisei 12 hellip

atilde Stripping stop words (the a of hellip)

atilde Lemmatization produces rarr produce

atilde Stemming producer rarr produc produces rarr produc

atilde Decompounding zonnecel rarr zon cel

Review and Conclusions 54

The Semantic Web

Review and Conclusions 55

Goals of the Semantic Web

atilde Web of data

acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions

atilde Increase the utility of information by connecting it to definitions and context

atilde More efficient information access and analysis

EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt

Review and Conclusions 56

Semantic Web Architecture

Review and Conclusions 57

XML eXtensible Markup Language

atilde XML is a set of rules for encoding documents electronically

acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation

atilde Can be validated for syntactic well-formedness

Review and Conclusions 58

Semantic Web Architecture (details)

atilde Identify things with Uniform Resource Identifiers

acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond

atilde Identify relations with Resource Description Framework

acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything

atilde You can build relations in ontologies (OWL)

acirc Then reason over them search them hellip

Review and Conclusions 59

Criticism of the Semantic Web

Doctorowrsquos seven insurmountable obstacles to reliable metadata are

1 People lie

2 People are lazy

3 People are stupid

4 Mission Impossible know thyself

5 Schemas arenrsquot neutral

6 Metrics influence results

7 Therersquos more than one way to describe something

httpwwwwellcom~doctorowmetacraphtm 60

Semantic Web and NLP

atilde The Semantic Web is about structuring data

atilde Text Mining is about unstructured data

atilde There is much more unstructured than structured data

acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP

61

Citation Reputation andPageRank

Review and Conclusions 62

Citation Networks

atilde How can we tell what is a good scientific paper

acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked

acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite

Review and Conclusions 63

Reputation and Citation Analysis

atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group

atilde Some scores are

acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)

atilde Problems

acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones

atilde Weight Citations by Quality of the paper

Review and Conclusions 64

Gaming Citations

atilde LeastMinimum Publishable Unit

acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information

atilde Self citation in-group citation

atilde Write only proceedings (some journals are not often read)

atilde Submitting only to High Impact factor journals

You improve what gets measurednot necessarily what you want to improve

Review and Conclusions 65

Characteristics of the Web Bow Tie

SCC Strongly Connected Core mdash can travel from any page to any page

Review and Conclusions 66

Anchor Text

atilde Recall how hyperlinks are written

lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt

For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt

atilde Link analysis builds on two intuitions

1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A

2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice

Review and Conclusions 67

PageRank as Citation analysis

atilde Citation frequency can be used to measure the impact of an article

acirc Simplest measure Each article gets one vote ndash not very accurate

atilde On the web citation frequency = inlink count

acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam

atilde Better measure weighted citation frequency or citation rank

acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated

Review and Conclusions 68

PageRank as Random walk

atilde Imagine a web surfer doing a random walk on the web

acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page

equiprobably

atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there

atilde This long-term visit rate is the pagersquos PageRank

atilde PageRank = long-term visit rate = steady state robability

Review and Conclusions 69

Teleporting ndash to get us out of dead ends

atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)

atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)

atilde With remaining probability (90) go out on a random hyperlink

acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225

atilde 10 is a parameter the teleportation rate

atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate

Review and Conclusions 70

Example Graph

Each inbound link is a positive vote

httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71

Example Graph Weighted

Pages with higher PageRanks are lighter

Review and Conclusions 72

Gaming PageRank

atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs

atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies

atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission

73

atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks

The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index

acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip

74

Current Status

atilde There is a continuous battle between

acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read

atilde All metrics get gamed

Review and Conclusions 75

Digital object identifier

atilde DOI a string used to uniquely identify an electronic document or object

acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)

atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation

atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations

acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u

Review and Conclusions 76

Conclusions

Review and Conclusions 77

Revolutions in Language Technology

atilde Speech (Language itself)

atilde Writing (invented 3-5 times)

atilde Printing (made writing common)

atilde Digital Text (made writing transferable)

atilde Hyperlinking (taking writing beyond language)

Review and Conclusions 78

The Internet

atilde The internet is useful as a tool

acirc Passively for information accessacirc Actively for collaboration

atilde The internet is interesting in and of itself

acirc As a source of data about existing languageacirc As a source of innovation in language

Review and Conclusions 79

Recommended Texts

atilde Wikipedia

atilde Crystal D (2001) Language and the Internet Cambridge University Press

atilde Sproat R (2010) Language Technology and Society Oxford University Press

atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition

atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press

atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press

Review and Conclusions 80

Complementary Courses

atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics

atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics

Review and Conclusions 81

Page 17: Lecture 12: Review and Conclusions · Web Count Units ª ghit or “google hit” is the most common unit used to count web snippets (in the early2000s) â itisdocumentfrequencynottermfrequency

Why is it so difficult

atilde Pronunciation depends on context

acirc The same word will be pronounced differently in different sentences

atilde Speaker variability

acirc Genderacirc DialectForeign Accentacirc Individual Differences Physical differences Language differences (idiolect)

atilde Many many rare events

acirc 300 out of 2000 diphones in the core set for the ATampT NextGen system occuronly once in a 2-hour speech database

Review and Conclusions 16

Two steps in a TTS system

1 Linguistic Analysis

atilde Sentence Segmentationatilde Abbreviations Dr Smith lives on Nanyang Dr He is hellipatilde Word Segmentation

acirc 森前銀総裁 Moriyama zen Nichigin Sousaiotimes 森前銀総裁emspMoriyama zennichi gin Sousai

2 Speech Synthesis

atilde Find the pronunciationatilde Generate soundsatilde Add intonation

Review and Conclusions 17

Speech Synthesis

atilde Articulatory Synthesis Attempt to model human articulation

atilde Formant Synthesis Bypass modeling of human articulation and model acousticsdirectly

atilde Concatenative Synthesis Synthesize from stored units of actual speech

Review and Conclusions 18

Prosody of Emotion

atilde Excitement Fast very high pitch loud

atilde Hot anger Fast high pitch strong falling accent loud

atilde Fear Jitter

atilde Sarcasm Prolonged accent late peak

atilde Sad Slow low pitch

The main determinant of ldquonaturalnessrdquo in speech synthesis is not ldquovoice qualityrdquobut natural-sounding prosody (intonation and duration)

Richard Sproat

Review and Conclusions 19

Speed is different for different modalities

Speed in words per minute (one word is 6 characters)(English computer science students various studies)

Reading 300 200 (proof reading)Writing 31 21 (composing)Speaking 150Hearing 150 210 (speeded up)Typing 33 19 (composing)

atilde Reading gtgt SpeakingHearing gtgt Typing

rArr Speech for inputrArr Text for output

Review and Conclusions 20

The Telephone

Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 21

New Modalities

Review and Conclusions 22

Email Usenet Chat and Blogs

atilde All share some characteristics of speech and text

atilde Usage norms not fixed

atilde Communication methods may disappear before the norms are fixedUsenet Bulletin Boards Gopher Archie hellip

atilde Large scale discourse studies still to be done

atilde Some genuinely new things

acirc time-lagged multi-person conversationacirc raw un-edited textacirc extreme multi-authorship

Review and Conclusions 23

Email

Speech like Text liketime bound space bound (deletable)spontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 24

Usenet (asynchronous)

Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 25

Chat (synchronous)

Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 26

Blogs

Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactivecomments factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 27

Netspeak

atilde Inspired by shared background

acirc ltrantgtI canrsquot stand thisltrantgtacirc I hate^H^H^H^Hlove that idea (^H is backspace on a vt100)acirc lusers (users as seen by Systems Administrators)acirc suits

atilde Need to be in-group

cow orker Coworkerclue ldquoYou couldnrsquot get a clue during the clue mating season in a field full of horny

clues if you smeared your body with clue musk and did the clue mating dancerdquoEdward Flaherty (talkbizaare)

Review and Conclusions 28

atilde Gricean Principals

acirc Posting (topmiddlebottom quoting and trimming)

atilde Inspired by medium limitations

acirc GREATgreatgreatacirc -) (^_^) (o) acirc brb RTFM IMHOacirc lower caseacirc lack of punctuation (hard on phone keyboards)

Review and Conclusions 29

Collaboration and Wikisatilde Version Control Systems

atilde Wikipedia

atilde Licensing and Ownership

30

Version Control Systems

atilde Versioning file systems

acirc every time a file is opened a new copy is stored

atilde CVS Subversion Git

acirc changes to a collection of files are trackedacirc simultaneous changes are merged

atilde Revision Tracking

acirc Revisions are stored within a file

atilde Authorship in shared writing Explicit responsibility for changes

Review and Conclusions 31

Wikipedia

atilde The core aim of the Wikimedia Foundation is to get a free encyclopedia to everysingle person on the planet (Jimmy Wales)

atilde Wikipedia makes it easy to share your knowledgepeople like to do this

atilde Most edits are done by insiders

atilde Most content is added by outsiders

atilde Content comparable to Britannica

Review and Conclusions 32

The five pillars of Wikipedia

1 Wikipedia is an online encyclopedia

2 Wikipedia has a neutral point of view

atilde Content policies NPOV Verifiability and No original research

3 Wikipedia is free content

4 Wikipedians should interact in a respectful and civil manner

5 Wikipedia does not have firm rules

WikipediaFivepillars 33

Licenses and Ownership

atilde Copyright

atilde Copyleft

atilde Creative Commons

Review and Conclusions 34

What is a good article

1 Well-written

2 Factually accurate and verifiable

3 Broad in its coverage

4 Neutral

5 Stable

6 Illustrated if possible by images

WikipediaGood_article_criteria 35

The World Wide Web and HTML

Review and Conclusions 36

The InterWeb

atilde The Internetmore than just the web (email VoIP FTP Streaming Messaging

atilde The structure of Markup Visual vs LogicalWSISWYG WYSIAYG WYSIWIM

atilde The structure of the Web mdash hypertextpages linked to pages

atilde The future of the Web

atilde Linguistic features of the webun-edited large volume editable multi-media

Review and Conclusions 37

Map of online communities

httpxkcdcom802 38

The Deep Web (the Invisible Web)

Dynamic content dynamic pages which are returned in response to a submitted queryor accessed only through a form

Unlinked content pages which are not linked to by other pages (but clicking linksthem)

Private Web sites that require registration and login (Edventure)

Contextual Web pages with content varying for different access contexts (egranges of client IP addresses or previous navigation sequence)

Limited access content sites that limit access to their pages in a technical way (egusing the Robots Exclusion Standard)

Scripted content pages that are only accessible through links produced by JavaScriptas well as content dynamically downloaded from Web servers via Flash or Ajaxsolutions

Review and Conclusions 39

Non-HTMLtext content textual content encoded in multimedia (image or video)files or specific file formats not handled by search engines

These pages all include data that search engines cannot find

Review and Conclusions 40

Visual Markup vs Logical Markup

atilde Visual Markup (Presentational)

acirc What you see is what you get (WYSIWYG)acirc Equivalent of printersrsquo markupacirc Shows what things look like

atilde Logical Markup (Structural)

acirc Show the structure and meaningacirc Can be mapped to visual markupacirc Less flexible than visual markupacirc More adaptable (and reusable)

Review and Conclusions 41

The Web as Corpus

Review and Conclusions 42

Two Approaches to using the Web as a Corpus

atilde Direct Query Search Engine as Query tool and WWW as corpus(Objection Results are not reliable)

acirc Population and exact hit counts are unknown rarr no statistics possibleacirc Indexing does not allow to draw conclusions on the dataotimes Google is missing functionalities that linguists lexicographers would like to have

atilde Web Sample Use search engine to download data from the net and build a corpusfrom it

acirc known size and exact hit counts rarr statistics possibleacirc people can draw conclusions over the included text typesacirc (limited) control over the contentotimes sparser data

Review and Conclusions 43

Direct Query

atilde Accessible through search engines (Google API Yahoo API Scripts)

atilde Document counts are shown to correlate directly with ldquorealrdquo frequencies (Keller2003) so search engines can help - but

acirc lots of repetitions of the same text (not representative)acirc very limited query precision (no upperlower case no punctuation)acirc only estimated counts often hard to reproduce exactlyacirc different queries give wildly different numbers

Review and Conclusions 44

Web Count Units

atilde ghit or ldquogoogle hitrdquo is the most common unit used to count web snippets (in theearly 2000s)

acirc it is document frequency not term frequency

atilde whit or ldquoweb hitrdquo is the more general term

atilde Normally you compare two phenomena to get a unitless ratio(eg different from vs different than)251000000 ghits vs 71500000 or 351 (accessed 2012-04-04)

atilde GPB for ldquoGhits per billion documentsrdquo is good if want a more stable number (sug-gested by Mark Liberman)but Google no longer releases the index size

atilde So always say when you counted and try to use ratios

httpitrecisupennedu myllanguagelogarchives000953html 45

Web Sample

atilde Extracting and filtering web documents to create linguistically annotated corpora(Kilgarriff 2006)

acirc gather documents for different topics (balance)acirc exclude documents which cannot be preprocessed with available tools (here tag-

gers and lemmatizers)acirc exclude documents which seem irrelevant for a corpus (too short or too long word

lists)acirc do this for several languages and make the corpora available

46

Building Internet Corpora Outline

1 Select Seed Words (500)

2 Combine to form multiple queries (6000)

3 Query a search engine and retrieve the URLs (50000)

4 Download the files from the URLS (100000000 words)

5 Postprocess the data (encoding cleanup tagging and parsing)

Sharoff S (2006) Creating general-purpose corpora using automated search enginequeries In M Baroni S Bernardini (eds) WaCky Working papers on the Web asCorpus Bologna 2006

httpcorpusleedsacukinternethtml 47

Internet Corpora Summary

atilde The web can be used as a corpus

acirc Direct accesslowast Fast and convenientlowast Huge amounts of dataotimes unreliable counts

acirc Web samplelowast Control over the samplelowast Some setup costs (semi-automated)otimes Less data

atilde Richer data than a compiled corpus

otimes Less balanced less markup

48

Text and Meta-textatilde Explicit Meta-data

acirc Keywords and Categoriesacirc Rankingsacirc Structural Markup

atilde Implicit Meta-data

acirc Links and Citationsacirc Tagsacirc Tablesacirc File Namesacirc Translations

and Zombies 49

Explicit Metadata

atilde You can get information from metadata within documents

acirc When they are accurate they are very goodacirc They are often deceitful

atilde HTML PDF Word hellipMetadata

atilde Keywords and Tags

atilde Rankings

atilde Links and Citations

atilde Structural Markup

50

Implicit Metadata

atilde You can get clues from metadata within documents

acirc as they are non-intended they tend to be noisyacirc but they are rarely deceitful

atilde HTML tags as constituent boundaries

atilde Tables as Semantic Relations

atilde File Names (content type and language)

atilde Translations mdash Bracketed Glosses Cross-lingual Disambiguation

atilde Query Data

atilde Wikipedia Redirections and Cross-wiki Links

Review and Conclusions 51

Language Identification andNormalization

Review and Conclusions 52

Language Identification

atilde Need to identify the encoding and language of a page in order to process it (meta-data may be unreliable)

acirc Linguistically-grounded methodslowast Diacriticslowast Character n-gramslowast Stop words

acirc Similarity-based categorisation and classificationlowast Character n-gram rankings

acirc Machine Learning based methodsacirc Context (under jp or ko)

atilde Hard to do for short test snippets similar languages mixed text

Review and Conclusions 53

Normalization

atilde Extracting text from various documents

atilde Segmenting continuous text

atilde Number Normalization $700K $700000 07 million dollars hellip

atilde Date Normalization 2000AD 1421AH Heisei 12 hellip

atilde Stripping stop words (the a of hellip)

atilde Lemmatization produces rarr produce

atilde Stemming producer rarr produc produces rarr produc

atilde Decompounding zonnecel rarr zon cel

Review and Conclusions 54

The Semantic Web

Review and Conclusions 55

Goals of the Semantic Web

atilde Web of data

acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions

atilde Increase the utility of information by connecting it to definitions and context

atilde More efficient information access and analysis

EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt

Review and Conclusions 56

Semantic Web Architecture

Review and Conclusions 57

XML eXtensible Markup Language

atilde XML is a set of rules for encoding documents electronically

acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation

atilde Can be validated for syntactic well-formedness

Review and Conclusions 58

Semantic Web Architecture (details)

atilde Identify things with Uniform Resource Identifiers

acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond

atilde Identify relations with Resource Description Framework

acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything

atilde You can build relations in ontologies (OWL)

acirc Then reason over them search them hellip

Review and Conclusions 59

Criticism of the Semantic Web

Doctorowrsquos seven insurmountable obstacles to reliable metadata are

1 People lie

2 People are lazy

3 People are stupid

4 Mission Impossible know thyself

5 Schemas arenrsquot neutral

6 Metrics influence results

7 Therersquos more than one way to describe something

httpwwwwellcom~doctorowmetacraphtm 60

Semantic Web and NLP

atilde The Semantic Web is about structuring data

atilde Text Mining is about unstructured data

atilde There is much more unstructured than structured data

acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP

61

Citation Reputation andPageRank

Review and Conclusions 62

Citation Networks

atilde How can we tell what is a good scientific paper

acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked

acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite

Review and Conclusions 63

Reputation and Citation Analysis

atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group

atilde Some scores are

acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)

atilde Problems

acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones

atilde Weight Citations by Quality of the paper

Review and Conclusions 64

Gaming Citations

atilde LeastMinimum Publishable Unit

acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information

atilde Self citation in-group citation

atilde Write only proceedings (some journals are not often read)

atilde Submitting only to High Impact factor journals

You improve what gets measurednot necessarily what you want to improve

Review and Conclusions 65

Characteristics of the Web Bow Tie

SCC Strongly Connected Core mdash can travel from any page to any page

Review and Conclusions 66

Anchor Text

atilde Recall how hyperlinks are written

lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt

For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt

atilde Link analysis builds on two intuitions

1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A

2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice

Review and Conclusions 67

PageRank as Citation analysis

atilde Citation frequency can be used to measure the impact of an article

acirc Simplest measure Each article gets one vote ndash not very accurate

atilde On the web citation frequency = inlink count

acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam

atilde Better measure weighted citation frequency or citation rank

acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated

Review and Conclusions 68

PageRank as Random walk

atilde Imagine a web surfer doing a random walk on the web

acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page

equiprobably

atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there

atilde This long-term visit rate is the pagersquos PageRank

atilde PageRank = long-term visit rate = steady state robability

Review and Conclusions 69

Teleporting ndash to get us out of dead ends

atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)

atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)

atilde With remaining probability (90) go out on a random hyperlink

acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225

atilde 10 is a parameter the teleportation rate

atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate

Review and Conclusions 70

Example Graph

Each inbound link is a positive vote

httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71

Example Graph Weighted

Pages with higher PageRanks are lighter

Review and Conclusions 72

Gaming PageRank

atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs

atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies

atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission

73

atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks

The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index

acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip

74

Current Status

atilde There is a continuous battle between

acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read

atilde All metrics get gamed

Review and Conclusions 75

Digital object identifier

atilde DOI a string used to uniquely identify an electronic document or object

acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)

atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation

atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations

acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u

Review and Conclusions 76

Conclusions

Review and Conclusions 77

Revolutions in Language Technology

atilde Speech (Language itself)

atilde Writing (invented 3-5 times)

atilde Printing (made writing common)

atilde Digital Text (made writing transferable)

atilde Hyperlinking (taking writing beyond language)

Review and Conclusions 78

The Internet

atilde The internet is useful as a tool

acirc Passively for information accessacirc Actively for collaboration

atilde The internet is interesting in and of itself

acirc As a source of data about existing languageacirc As a source of innovation in language

Review and Conclusions 79

Recommended Texts

atilde Wikipedia

atilde Crystal D (2001) Language and the Internet Cambridge University Press

atilde Sproat R (2010) Language Technology and Society Oxford University Press

atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition

atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press

atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press

Review and Conclusions 80

Complementary Courses

atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics

atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics

Review and Conclusions 81

Page 18: Lecture 12: Review and Conclusions · Web Count Units ª ghit or “google hit” is the most common unit used to count web snippets (in the early2000s) â itisdocumentfrequencynottermfrequency

Two steps in a TTS system

1 Linguistic Analysis

atilde Sentence Segmentationatilde Abbreviations Dr Smith lives on Nanyang Dr He is hellipatilde Word Segmentation

acirc 森前銀総裁 Moriyama zen Nichigin Sousaiotimes 森前銀総裁emspMoriyama zennichi gin Sousai

2 Speech Synthesis

atilde Find the pronunciationatilde Generate soundsatilde Add intonation

Review and Conclusions 17

Speech Synthesis

atilde Articulatory Synthesis Attempt to model human articulation

atilde Formant Synthesis Bypass modeling of human articulation and model acousticsdirectly

atilde Concatenative Synthesis Synthesize from stored units of actual speech

Review and Conclusions 18

Prosody of Emotion

atilde Excitement Fast very high pitch loud

atilde Hot anger Fast high pitch strong falling accent loud

atilde Fear Jitter

atilde Sarcasm Prolonged accent late peak

atilde Sad Slow low pitch

The main determinant of ldquonaturalnessrdquo in speech synthesis is not ldquovoice qualityrdquobut natural-sounding prosody (intonation and duration)

Richard Sproat

Review and Conclusions 19

Speed is different for different modalities

Speed in words per minute (one word is 6 characters)(English computer science students various studies)

Reading 300 200 (proof reading)Writing 31 21 (composing)Speaking 150Hearing 150 210 (speeded up)Typing 33 19 (composing)

atilde Reading gtgt SpeakingHearing gtgt Typing

rArr Speech for inputrArr Text for output

Review and Conclusions 20

The Telephone

Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 21

New Modalities

Review and Conclusions 22

Email Usenet Chat and Blogs

atilde All share some characteristics of speech and text

atilde Usage norms not fixed

atilde Communication methods may disappear before the norms are fixedUsenet Bulletin Boards Gopher Archie hellip

atilde Large scale discourse studies still to be done

atilde Some genuinely new things

acirc time-lagged multi-person conversationacirc raw un-edited textacirc extreme multi-authorship

Review and Conclusions 23

Email

Speech like Text liketime bound space bound (deletable)spontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 24

Usenet (asynchronous)

Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 25

Chat (synchronous)

Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 26

Blogs

Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactivecomments factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 27

Netspeak

atilde Inspired by shared background

acirc ltrantgtI canrsquot stand thisltrantgtacirc I hate^H^H^H^Hlove that idea (^H is backspace on a vt100)acirc lusers (users as seen by Systems Administrators)acirc suits

atilde Need to be in-group

cow orker Coworkerclue ldquoYou couldnrsquot get a clue during the clue mating season in a field full of horny

clues if you smeared your body with clue musk and did the clue mating dancerdquoEdward Flaherty (talkbizaare)

Review and Conclusions 28

atilde Gricean Principals

acirc Posting (topmiddlebottom quoting and trimming)

atilde Inspired by medium limitations

acirc GREATgreatgreatacirc -) (^_^) (o) acirc brb RTFM IMHOacirc lower caseacirc lack of punctuation (hard on phone keyboards)

Review and Conclusions 29

Collaboration and Wikisatilde Version Control Systems

atilde Wikipedia

atilde Licensing and Ownership

30

Version Control Systems

atilde Versioning file systems

acirc every time a file is opened a new copy is stored

atilde CVS Subversion Git

acirc changes to a collection of files are trackedacirc simultaneous changes are merged

atilde Revision Tracking

acirc Revisions are stored within a file

atilde Authorship in shared writing Explicit responsibility for changes

Review and Conclusions 31

Wikipedia

atilde The core aim of the Wikimedia Foundation is to get a free encyclopedia to everysingle person on the planet (Jimmy Wales)

atilde Wikipedia makes it easy to share your knowledgepeople like to do this

atilde Most edits are done by insiders

atilde Most content is added by outsiders

atilde Content comparable to Britannica

Review and Conclusions 32

The five pillars of Wikipedia

1 Wikipedia is an online encyclopedia

2 Wikipedia has a neutral point of view

atilde Content policies NPOV Verifiability and No original research

3 Wikipedia is free content

4 Wikipedians should interact in a respectful and civil manner

5 Wikipedia does not have firm rules

WikipediaFivepillars 33

Licenses and Ownership

atilde Copyright

atilde Copyleft

atilde Creative Commons

Review and Conclusions 34

What is a good article

1 Well-written

2 Factually accurate and verifiable

3 Broad in its coverage

4 Neutral

5 Stable

6 Illustrated if possible by images

WikipediaGood_article_criteria 35

The World Wide Web and HTML

Review and Conclusions 36

The InterWeb

atilde The Internetmore than just the web (email VoIP FTP Streaming Messaging

atilde The structure of Markup Visual vs LogicalWSISWYG WYSIAYG WYSIWIM

atilde The structure of the Web mdash hypertextpages linked to pages

atilde The future of the Web

atilde Linguistic features of the webun-edited large volume editable multi-media

Review and Conclusions 37

Map of online communities

httpxkcdcom802 38

The Deep Web (the Invisible Web)

Dynamic content dynamic pages which are returned in response to a submitted queryor accessed only through a form

Unlinked content pages which are not linked to by other pages (but clicking linksthem)

Private Web sites that require registration and login (Edventure)

Contextual Web pages with content varying for different access contexts (egranges of client IP addresses or previous navigation sequence)

Limited access content sites that limit access to their pages in a technical way (egusing the Robots Exclusion Standard)

Scripted content pages that are only accessible through links produced by JavaScriptas well as content dynamically downloaded from Web servers via Flash or Ajaxsolutions

Review and Conclusions 39

Non-HTMLtext content textual content encoded in multimedia (image or video)files or specific file formats not handled by search engines

These pages all include data that search engines cannot find

Review and Conclusions 40

Visual Markup vs Logical Markup

atilde Visual Markup (Presentational)

acirc What you see is what you get (WYSIWYG)acirc Equivalent of printersrsquo markupacirc Shows what things look like

atilde Logical Markup (Structural)

acirc Show the structure and meaningacirc Can be mapped to visual markupacirc Less flexible than visual markupacirc More adaptable (and reusable)

Review and Conclusions 41

The Web as Corpus

Review and Conclusions 42

Two Approaches to using the Web as a Corpus

atilde Direct Query Search Engine as Query tool and WWW as corpus(Objection Results are not reliable)

acirc Population and exact hit counts are unknown rarr no statistics possibleacirc Indexing does not allow to draw conclusions on the dataotimes Google is missing functionalities that linguists lexicographers would like to have

atilde Web Sample Use search engine to download data from the net and build a corpusfrom it

acirc known size and exact hit counts rarr statistics possibleacirc people can draw conclusions over the included text typesacirc (limited) control over the contentotimes sparser data

Review and Conclusions 43

Direct Query

atilde Accessible through search engines (Google API Yahoo API Scripts)

atilde Document counts are shown to correlate directly with ldquorealrdquo frequencies (Keller2003) so search engines can help - but

acirc lots of repetitions of the same text (not representative)acirc very limited query precision (no upperlower case no punctuation)acirc only estimated counts often hard to reproduce exactlyacirc different queries give wildly different numbers

Review and Conclusions 44

Web Count Units

atilde ghit or ldquogoogle hitrdquo is the most common unit used to count web snippets (in theearly 2000s)

acirc it is document frequency not term frequency

atilde whit or ldquoweb hitrdquo is the more general term

atilde Normally you compare two phenomena to get a unitless ratio(eg different from vs different than)251000000 ghits vs 71500000 or 351 (accessed 2012-04-04)

atilde GPB for ldquoGhits per billion documentsrdquo is good if want a more stable number (sug-gested by Mark Liberman)but Google no longer releases the index size

atilde So always say when you counted and try to use ratios

httpitrecisupennedu myllanguagelogarchives000953html 45

Web Sample

atilde Extracting and filtering web documents to create linguistically annotated corpora(Kilgarriff 2006)

acirc gather documents for different topics (balance)acirc exclude documents which cannot be preprocessed with available tools (here tag-

gers and lemmatizers)acirc exclude documents which seem irrelevant for a corpus (too short or too long word

lists)acirc do this for several languages and make the corpora available

46

Building Internet Corpora Outline

1 Select Seed Words (500)

2 Combine to form multiple queries (6000)

3 Query a search engine and retrieve the URLs (50000)

4 Download the files from the URLS (100000000 words)

5 Postprocess the data (encoding cleanup tagging and parsing)

Sharoff S (2006) Creating general-purpose corpora using automated search enginequeries In M Baroni S Bernardini (eds) WaCky Working papers on the Web asCorpus Bologna 2006

httpcorpusleedsacukinternethtml 47

Internet Corpora Summary

atilde The web can be used as a corpus

acirc Direct accesslowast Fast and convenientlowast Huge amounts of dataotimes unreliable counts

acirc Web samplelowast Control over the samplelowast Some setup costs (semi-automated)otimes Less data

atilde Richer data than a compiled corpus

otimes Less balanced less markup

48

Text and Meta-textatilde Explicit Meta-data

acirc Keywords and Categoriesacirc Rankingsacirc Structural Markup

atilde Implicit Meta-data

acirc Links and Citationsacirc Tagsacirc Tablesacirc File Namesacirc Translations

and Zombies 49

Explicit Metadata

atilde You can get information from metadata within documents

acirc When they are accurate they are very goodacirc They are often deceitful

atilde HTML PDF Word hellipMetadata

atilde Keywords and Tags

atilde Rankings

atilde Links and Citations

atilde Structural Markup

50

Implicit Metadata

atilde You can get clues from metadata within documents

acirc as they are non-intended they tend to be noisyacirc but they are rarely deceitful

atilde HTML tags as constituent boundaries

atilde Tables as Semantic Relations

atilde File Names (content type and language)

atilde Translations mdash Bracketed Glosses Cross-lingual Disambiguation

atilde Query Data

atilde Wikipedia Redirections and Cross-wiki Links

Review and Conclusions 51

Language Identification andNormalization

Review and Conclusions 52

Language Identification

atilde Need to identify the encoding and language of a page in order to process it (meta-data may be unreliable)

acirc Linguistically-grounded methodslowast Diacriticslowast Character n-gramslowast Stop words

acirc Similarity-based categorisation and classificationlowast Character n-gram rankings

acirc Machine Learning based methodsacirc Context (under jp or ko)

atilde Hard to do for short test snippets similar languages mixed text

Review and Conclusions 53

Normalization

atilde Extracting text from various documents

atilde Segmenting continuous text

atilde Number Normalization $700K $700000 07 million dollars hellip

atilde Date Normalization 2000AD 1421AH Heisei 12 hellip

atilde Stripping stop words (the a of hellip)

atilde Lemmatization produces rarr produce

atilde Stemming producer rarr produc produces rarr produc

atilde Decompounding zonnecel rarr zon cel

Review and Conclusions 54

The Semantic Web

Review and Conclusions 55

Goals of the Semantic Web

atilde Web of data

acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions

atilde Increase the utility of information by connecting it to definitions and context

atilde More efficient information access and analysis

EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt

Review and Conclusions 56

Semantic Web Architecture

Review and Conclusions 57

XML eXtensible Markup Language

atilde XML is a set of rules for encoding documents electronically

acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation

atilde Can be validated for syntactic well-formedness

Review and Conclusions 58

Semantic Web Architecture (details)

atilde Identify things with Uniform Resource Identifiers

acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond

atilde Identify relations with Resource Description Framework

acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything

atilde You can build relations in ontologies (OWL)

acirc Then reason over them search them hellip

Review and Conclusions 59

Criticism of the Semantic Web

Doctorowrsquos seven insurmountable obstacles to reliable metadata are

1 People lie

2 People are lazy

3 People are stupid

4 Mission Impossible know thyself

5 Schemas arenrsquot neutral

6 Metrics influence results

7 Therersquos more than one way to describe something

httpwwwwellcom~doctorowmetacraphtm 60

Semantic Web and NLP

atilde The Semantic Web is about structuring data

atilde Text Mining is about unstructured data

atilde There is much more unstructured than structured data

acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP

61

Citation Reputation andPageRank

Review and Conclusions 62

Citation Networks

atilde How can we tell what is a good scientific paper

acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked

acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite

Review and Conclusions 63

Reputation and Citation Analysis

atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group

atilde Some scores are

acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)

atilde Problems

acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones

atilde Weight Citations by Quality of the paper

Review and Conclusions 64

Gaming Citations

atilde LeastMinimum Publishable Unit

acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information

atilde Self citation in-group citation

atilde Write only proceedings (some journals are not often read)

atilde Submitting only to High Impact factor journals

You improve what gets measurednot necessarily what you want to improve

Review and Conclusions 65

Characteristics of the Web Bow Tie

SCC Strongly Connected Core mdash can travel from any page to any page

Review and Conclusions 66

Anchor Text

atilde Recall how hyperlinks are written

lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt

For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt

atilde Link analysis builds on two intuitions

1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A

2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice

Review and Conclusions 67

PageRank as Citation analysis

atilde Citation frequency can be used to measure the impact of an article

acirc Simplest measure Each article gets one vote ndash not very accurate

atilde On the web citation frequency = inlink count

acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam

atilde Better measure weighted citation frequency or citation rank

acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated

Review and Conclusions 68

PageRank as Random walk

atilde Imagine a web surfer doing a random walk on the web

acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page

equiprobably

atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there

atilde This long-term visit rate is the pagersquos PageRank

atilde PageRank = long-term visit rate = steady state robability

Review and Conclusions 69

Teleporting ndash to get us out of dead ends

atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)

atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)

atilde With remaining probability (90) go out on a random hyperlink

acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225

atilde 10 is a parameter the teleportation rate

atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate

Review and Conclusions 70

Example Graph

Each inbound link is a positive vote

httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71

Example Graph Weighted

Pages with higher PageRanks are lighter

Review and Conclusions 72

Gaming PageRank

atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs

atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies

atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission

73

atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks

The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index

acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip

74

Current Status

atilde There is a continuous battle between

acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read

atilde All metrics get gamed

Review and Conclusions 75

Digital object identifier

atilde DOI a string used to uniquely identify an electronic document or object

acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)

atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation

atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations

acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u

Review and Conclusions 76

Conclusions

Review and Conclusions 77

Revolutions in Language Technology

atilde Speech (Language itself)

atilde Writing (invented 3-5 times)

atilde Printing (made writing common)

atilde Digital Text (made writing transferable)

atilde Hyperlinking (taking writing beyond language)

Review and Conclusions 78

The Internet

atilde The internet is useful as a tool

acirc Passively for information accessacirc Actively for collaboration

atilde The internet is interesting in and of itself

acirc As a source of data about existing languageacirc As a source of innovation in language

Review and Conclusions 79

Recommended Texts

atilde Wikipedia

atilde Crystal D (2001) Language and the Internet Cambridge University Press

atilde Sproat R (2010) Language Technology and Society Oxford University Press

atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition

atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press

atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press

Review and Conclusions 80

Complementary Courses

atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics

atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics

Review and Conclusions 81

Page 19: Lecture 12: Review and Conclusions · Web Count Units ª ghit or “google hit” is the most common unit used to count web snippets (in the early2000s) â itisdocumentfrequencynottermfrequency

Speech Synthesis

atilde Articulatory Synthesis Attempt to model human articulation

atilde Formant Synthesis Bypass modeling of human articulation and model acousticsdirectly

atilde Concatenative Synthesis Synthesize from stored units of actual speech

Review and Conclusions 18

Prosody of Emotion

atilde Excitement Fast very high pitch loud

atilde Hot anger Fast high pitch strong falling accent loud

atilde Fear Jitter

atilde Sarcasm Prolonged accent late peak

atilde Sad Slow low pitch

The main determinant of ldquonaturalnessrdquo in speech synthesis is not ldquovoice qualityrdquobut natural-sounding prosody (intonation and duration)

Richard Sproat

Review and Conclusions 19

Speed is different for different modalities

Speed in words per minute (one word is 6 characters)(English computer science students various studies)

Reading 300 200 (proof reading)Writing 31 21 (composing)Speaking 150Hearing 150 210 (speeded up)Typing 33 19 (composing)

atilde Reading gtgt SpeakingHearing gtgt Typing

rArr Speech for inputrArr Text for output

Review and Conclusions 20

The Telephone

Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 21

New Modalities

Review and Conclusions 22

Email Usenet Chat and Blogs

atilde All share some characteristics of speech and text

atilde Usage norms not fixed

atilde Communication methods may disappear before the norms are fixedUsenet Bulletin Boards Gopher Archie hellip

atilde Large scale discourse studies still to be done

atilde Some genuinely new things

acirc time-lagged multi-person conversationacirc raw un-edited textacirc extreme multi-authorship

Review and Conclusions 23

Email

Speech like Text liketime bound space bound (deletable)spontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 24

Usenet (asynchronous)

Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 25

Chat (synchronous)

Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 26

Blogs

Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactivecomments factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 27

Netspeak

atilde Inspired by shared background

acirc ltrantgtI canrsquot stand thisltrantgtacirc I hate^H^H^H^Hlove that idea (^H is backspace on a vt100)acirc lusers (users as seen by Systems Administrators)acirc suits

atilde Need to be in-group

cow orker Coworkerclue ldquoYou couldnrsquot get a clue during the clue mating season in a field full of horny

clues if you smeared your body with clue musk and did the clue mating dancerdquoEdward Flaherty (talkbizaare)

Review and Conclusions 28

atilde Gricean Principals

acirc Posting (topmiddlebottom quoting and trimming)

atilde Inspired by medium limitations

acirc GREATgreatgreatacirc -) (^_^) (o) acirc brb RTFM IMHOacirc lower caseacirc lack of punctuation (hard on phone keyboards)

Review and Conclusions 29

Collaboration and Wikisatilde Version Control Systems

atilde Wikipedia

atilde Licensing and Ownership

30

Version Control Systems

atilde Versioning file systems

acirc every time a file is opened a new copy is stored

atilde CVS Subversion Git

acirc changes to a collection of files are trackedacirc simultaneous changes are merged

atilde Revision Tracking

acirc Revisions are stored within a file

atilde Authorship in shared writing Explicit responsibility for changes

Review and Conclusions 31

Wikipedia

atilde The core aim of the Wikimedia Foundation is to get a free encyclopedia to everysingle person on the planet (Jimmy Wales)

atilde Wikipedia makes it easy to share your knowledgepeople like to do this

atilde Most edits are done by insiders

atilde Most content is added by outsiders

atilde Content comparable to Britannica

Review and Conclusions 32

The five pillars of Wikipedia

1 Wikipedia is an online encyclopedia

2 Wikipedia has a neutral point of view

atilde Content policies NPOV Verifiability and No original research

3 Wikipedia is free content

4 Wikipedians should interact in a respectful and civil manner

5 Wikipedia does not have firm rules

WikipediaFivepillars 33

Licenses and Ownership

atilde Copyright

atilde Copyleft

atilde Creative Commons

Review and Conclusions 34

What is a good article

1 Well-written

2 Factually accurate and verifiable

3 Broad in its coverage

4 Neutral

5 Stable

6 Illustrated if possible by images

WikipediaGood_article_criteria 35

The World Wide Web and HTML

Review and Conclusions 36

The InterWeb

atilde The Internetmore than just the web (email VoIP FTP Streaming Messaging

atilde The structure of Markup Visual vs LogicalWSISWYG WYSIAYG WYSIWIM

atilde The structure of the Web mdash hypertextpages linked to pages

atilde The future of the Web

atilde Linguistic features of the webun-edited large volume editable multi-media

Review and Conclusions 37

Map of online communities

httpxkcdcom802 38

The Deep Web (the Invisible Web)

Dynamic content dynamic pages which are returned in response to a submitted queryor accessed only through a form

Unlinked content pages which are not linked to by other pages (but clicking linksthem)

Private Web sites that require registration and login (Edventure)

Contextual Web pages with content varying for different access contexts (egranges of client IP addresses or previous navigation sequence)

Limited access content sites that limit access to their pages in a technical way (egusing the Robots Exclusion Standard)

Scripted content pages that are only accessible through links produced by JavaScriptas well as content dynamically downloaded from Web servers via Flash or Ajaxsolutions

Review and Conclusions 39

Non-HTMLtext content textual content encoded in multimedia (image or video)files or specific file formats not handled by search engines

These pages all include data that search engines cannot find

Review and Conclusions 40

Visual Markup vs Logical Markup

atilde Visual Markup (Presentational)

acirc What you see is what you get (WYSIWYG)acirc Equivalent of printersrsquo markupacirc Shows what things look like

atilde Logical Markup (Structural)

acirc Show the structure and meaningacirc Can be mapped to visual markupacirc Less flexible than visual markupacirc More adaptable (and reusable)

Review and Conclusions 41

The Web as Corpus

Review and Conclusions 42

Two Approaches to using the Web as a Corpus

atilde Direct Query Search Engine as Query tool and WWW as corpus(Objection Results are not reliable)

acirc Population and exact hit counts are unknown rarr no statistics possibleacirc Indexing does not allow to draw conclusions on the dataotimes Google is missing functionalities that linguists lexicographers would like to have

atilde Web Sample Use search engine to download data from the net and build a corpusfrom it

acirc known size and exact hit counts rarr statistics possibleacirc people can draw conclusions over the included text typesacirc (limited) control over the contentotimes sparser data

Review and Conclusions 43

Direct Query

atilde Accessible through search engines (Google API Yahoo API Scripts)

atilde Document counts are shown to correlate directly with ldquorealrdquo frequencies (Keller2003) so search engines can help - but

acirc lots of repetitions of the same text (not representative)acirc very limited query precision (no upperlower case no punctuation)acirc only estimated counts often hard to reproduce exactlyacirc different queries give wildly different numbers

Review and Conclusions 44

Web Count Units

atilde ghit or ldquogoogle hitrdquo is the most common unit used to count web snippets (in theearly 2000s)

acirc it is document frequency not term frequency

atilde whit or ldquoweb hitrdquo is the more general term

atilde Normally you compare two phenomena to get a unitless ratio(eg different from vs different than)251000000 ghits vs 71500000 or 351 (accessed 2012-04-04)

atilde GPB for ldquoGhits per billion documentsrdquo is good if want a more stable number (sug-gested by Mark Liberman)but Google no longer releases the index size

atilde So always say when you counted and try to use ratios

httpitrecisupennedu myllanguagelogarchives000953html 45

Web Sample

atilde Extracting and filtering web documents to create linguistically annotated corpora(Kilgarriff 2006)

acirc gather documents for different topics (balance)acirc exclude documents which cannot be preprocessed with available tools (here tag-

gers and lemmatizers)acirc exclude documents which seem irrelevant for a corpus (too short or too long word

lists)acirc do this for several languages and make the corpora available

46

Building Internet Corpora Outline

1 Select Seed Words (500)

2 Combine to form multiple queries (6000)

3 Query a search engine and retrieve the URLs (50000)

4 Download the files from the URLS (100000000 words)

5 Postprocess the data (encoding cleanup tagging and parsing)

Sharoff S (2006) Creating general-purpose corpora using automated search enginequeries In M Baroni S Bernardini (eds) WaCky Working papers on the Web asCorpus Bologna 2006

httpcorpusleedsacukinternethtml 47

Internet Corpora Summary

atilde The web can be used as a corpus

acirc Direct accesslowast Fast and convenientlowast Huge amounts of dataotimes unreliable counts

acirc Web samplelowast Control over the samplelowast Some setup costs (semi-automated)otimes Less data

atilde Richer data than a compiled corpus

otimes Less balanced less markup

48

Text and Meta-textatilde Explicit Meta-data

acirc Keywords and Categoriesacirc Rankingsacirc Structural Markup

atilde Implicit Meta-data

acirc Links and Citationsacirc Tagsacirc Tablesacirc File Namesacirc Translations

and Zombies 49

Explicit Metadata

atilde You can get information from metadata within documents

acirc When they are accurate they are very goodacirc They are often deceitful

atilde HTML PDF Word hellipMetadata

atilde Keywords and Tags

atilde Rankings

atilde Links and Citations

atilde Structural Markup

50

Implicit Metadata

atilde You can get clues from metadata within documents

acirc as they are non-intended they tend to be noisyacirc but they are rarely deceitful

atilde HTML tags as constituent boundaries

atilde Tables as Semantic Relations

atilde File Names (content type and language)

atilde Translations mdash Bracketed Glosses Cross-lingual Disambiguation

atilde Query Data

atilde Wikipedia Redirections and Cross-wiki Links

Review and Conclusions 51

Language Identification andNormalization

Review and Conclusions 52

Language Identification

atilde Need to identify the encoding and language of a page in order to process it (meta-data may be unreliable)

acirc Linguistically-grounded methodslowast Diacriticslowast Character n-gramslowast Stop words

acirc Similarity-based categorisation and classificationlowast Character n-gram rankings

acirc Machine Learning based methodsacirc Context (under jp or ko)

atilde Hard to do for short test snippets similar languages mixed text

Review and Conclusions 53

Normalization

atilde Extracting text from various documents

atilde Segmenting continuous text

atilde Number Normalization $700K $700000 07 million dollars hellip

atilde Date Normalization 2000AD 1421AH Heisei 12 hellip

atilde Stripping stop words (the a of hellip)

atilde Lemmatization produces rarr produce

atilde Stemming producer rarr produc produces rarr produc

atilde Decompounding zonnecel rarr zon cel

Review and Conclusions 54

The Semantic Web

Review and Conclusions 55

Goals of the Semantic Web

atilde Web of data

acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions

atilde Increase the utility of information by connecting it to definitions and context

atilde More efficient information access and analysis

EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt

Review and Conclusions 56

Semantic Web Architecture

Review and Conclusions 57

XML eXtensible Markup Language

atilde XML is a set of rules for encoding documents electronically

acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation

atilde Can be validated for syntactic well-formedness

Review and Conclusions 58

Semantic Web Architecture (details)

atilde Identify things with Uniform Resource Identifiers

acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond

atilde Identify relations with Resource Description Framework

acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything

atilde You can build relations in ontologies (OWL)

acirc Then reason over them search them hellip

Review and Conclusions 59

Criticism of the Semantic Web

Doctorowrsquos seven insurmountable obstacles to reliable metadata are

1 People lie

2 People are lazy

3 People are stupid

4 Mission Impossible know thyself

5 Schemas arenrsquot neutral

6 Metrics influence results

7 Therersquos more than one way to describe something

httpwwwwellcom~doctorowmetacraphtm 60

Semantic Web and NLP

atilde The Semantic Web is about structuring data

atilde Text Mining is about unstructured data

atilde There is much more unstructured than structured data

acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP

61

Citation Reputation andPageRank

Review and Conclusions 62

Citation Networks

atilde How can we tell what is a good scientific paper

acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked

acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite

Review and Conclusions 63

Reputation and Citation Analysis

atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group

atilde Some scores are

acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)

atilde Problems

acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones

atilde Weight Citations by Quality of the paper

Review and Conclusions 64

Gaming Citations

atilde LeastMinimum Publishable Unit

acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information

atilde Self citation in-group citation

atilde Write only proceedings (some journals are not often read)

atilde Submitting only to High Impact factor journals

You improve what gets measurednot necessarily what you want to improve

Review and Conclusions 65

Characteristics of the Web Bow Tie

SCC Strongly Connected Core mdash can travel from any page to any page

Review and Conclusions 66

Anchor Text

atilde Recall how hyperlinks are written

lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt

For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt

atilde Link analysis builds on two intuitions

1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A

2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice

Review and Conclusions 67

PageRank as Citation analysis

atilde Citation frequency can be used to measure the impact of an article

acirc Simplest measure Each article gets one vote ndash not very accurate

atilde On the web citation frequency = inlink count

acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam

atilde Better measure weighted citation frequency or citation rank

acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated

Review and Conclusions 68

PageRank as Random walk

atilde Imagine a web surfer doing a random walk on the web

acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page

equiprobably

atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there

atilde This long-term visit rate is the pagersquos PageRank

atilde PageRank = long-term visit rate = steady state robability

Review and Conclusions 69

Teleporting ndash to get us out of dead ends

atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)

atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)

atilde With remaining probability (90) go out on a random hyperlink

acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225

atilde 10 is a parameter the teleportation rate

atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate

Review and Conclusions 70

Example Graph

Each inbound link is a positive vote

httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71

Example Graph Weighted

Pages with higher PageRanks are lighter

Review and Conclusions 72

Gaming PageRank

atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs

atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies

atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission

73

atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks

The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index

acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip

74

Current Status

atilde There is a continuous battle between

acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read

atilde All metrics get gamed

Review and Conclusions 75

Digital object identifier

atilde DOI a string used to uniquely identify an electronic document or object

acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)

atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation

atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations

acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u

Review and Conclusions 76

Conclusions

Review and Conclusions 77

Revolutions in Language Technology

atilde Speech (Language itself)

atilde Writing (invented 3-5 times)

atilde Printing (made writing common)

atilde Digital Text (made writing transferable)

atilde Hyperlinking (taking writing beyond language)

Review and Conclusions 78

The Internet

atilde The internet is useful as a tool

acirc Passively for information accessacirc Actively for collaboration

atilde The internet is interesting in and of itself

acirc As a source of data about existing languageacirc As a source of innovation in language

Review and Conclusions 79

Recommended Texts

atilde Wikipedia

atilde Crystal D (2001) Language and the Internet Cambridge University Press

atilde Sproat R (2010) Language Technology and Society Oxford University Press

atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition

atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press

atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press

Review and Conclusions 80

Complementary Courses

atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics

atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics

Review and Conclusions 81

Page 20: Lecture 12: Review and Conclusions · Web Count Units ª ghit or “google hit” is the most common unit used to count web snippets (in the early2000s) â itisdocumentfrequencynottermfrequency

Prosody of Emotion

atilde Excitement Fast very high pitch loud

atilde Hot anger Fast high pitch strong falling accent loud

atilde Fear Jitter

atilde Sarcasm Prolonged accent late peak

atilde Sad Slow low pitch

The main determinant of ldquonaturalnessrdquo in speech synthesis is not ldquovoice qualityrdquobut natural-sounding prosody (intonation and duration)

Richard Sproat

Review and Conclusions 19

Speed is different for different modalities

Speed in words per minute (one word is 6 characters)(English computer science students various studies)

Reading 300 200 (proof reading)Writing 31 21 (composing)Speaking 150Hearing 150 210 (speeded up)Typing 33 19 (composing)

atilde Reading gtgt SpeakingHearing gtgt Typing

rArr Speech for inputrArr Text for output

Review and Conclusions 20

The Telephone

Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 21

New Modalities

Review and Conclusions 22

Email Usenet Chat and Blogs

atilde All share some characteristics of speech and text

atilde Usage norms not fixed

atilde Communication methods may disappear before the norms are fixedUsenet Bulletin Boards Gopher Archie hellip

atilde Large scale discourse studies still to be done

atilde Some genuinely new things

acirc time-lagged multi-person conversationacirc raw un-edited textacirc extreme multi-authorship

Review and Conclusions 23

Email

Speech like Text liketime bound space bound (deletable)spontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 24

Usenet (asynchronous)

Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 25

Chat (synchronous)

Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 26

Blogs

Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactivecomments factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 27

Netspeak

atilde Inspired by shared background

acirc ltrantgtI canrsquot stand thisltrantgtacirc I hate^H^H^H^Hlove that idea (^H is backspace on a vt100)acirc lusers (users as seen by Systems Administrators)acirc suits

atilde Need to be in-group

cow orker Coworkerclue ldquoYou couldnrsquot get a clue during the clue mating season in a field full of horny

clues if you smeared your body with clue musk and did the clue mating dancerdquoEdward Flaherty (talkbizaare)

Review and Conclusions 28

atilde Gricean Principals

acirc Posting (topmiddlebottom quoting and trimming)

atilde Inspired by medium limitations

acirc GREATgreatgreatacirc -) (^_^) (o) acirc brb RTFM IMHOacirc lower caseacirc lack of punctuation (hard on phone keyboards)

Review and Conclusions 29

Collaboration and Wikisatilde Version Control Systems

atilde Wikipedia

atilde Licensing and Ownership

30

Version Control Systems

atilde Versioning file systems

acirc every time a file is opened a new copy is stored

atilde CVS Subversion Git

acirc changes to a collection of files are trackedacirc simultaneous changes are merged

atilde Revision Tracking

acirc Revisions are stored within a file

atilde Authorship in shared writing Explicit responsibility for changes

Review and Conclusions 31

Wikipedia

atilde The core aim of the Wikimedia Foundation is to get a free encyclopedia to everysingle person on the planet (Jimmy Wales)

atilde Wikipedia makes it easy to share your knowledgepeople like to do this

atilde Most edits are done by insiders

atilde Most content is added by outsiders

atilde Content comparable to Britannica

Review and Conclusions 32

The five pillars of Wikipedia

1 Wikipedia is an online encyclopedia

2 Wikipedia has a neutral point of view

atilde Content policies NPOV Verifiability and No original research

3 Wikipedia is free content

4 Wikipedians should interact in a respectful and civil manner

5 Wikipedia does not have firm rules

WikipediaFivepillars 33

Licenses and Ownership

atilde Copyright

atilde Copyleft

atilde Creative Commons

Review and Conclusions 34

What is a good article

1 Well-written

2 Factually accurate and verifiable

3 Broad in its coverage

4 Neutral

5 Stable

6 Illustrated if possible by images

WikipediaGood_article_criteria 35

The World Wide Web and HTML

Review and Conclusions 36

The InterWeb

atilde The Internetmore than just the web (email VoIP FTP Streaming Messaging

atilde The structure of Markup Visual vs LogicalWSISWYG WYSIAYG WYSIWIM

atilde The structure of the Web mdash hypertextpages linked to pages

atilde The future of the Web

atilde Linguistic features of the webun-edited large volume editable multi-media

Review and Conclusions 37

Map of online communities

httpxkcdcom802 38

The Deep Web (the Invisible Web)

Dynamic content dynamic pages which are returned in response to a submitted queryor accessed only through a form

Unlinked content pages which are not linked to by other pages (but clicking linksthem)

Private Web sites that require registration and login (Edventure)

Contextual Web pages with content varying for different access contexts (egranges of client IP addresses or previous navigation sequence)

Limited access content sites that limit access to their pages in a technical way (egusing the Robots Exclusion Standard)

Scripted content pages that are only accessible through links produced by JavaScriptas well as content dynamically downloaded from Web servers via Flash or Ajaxsolutions

Review and Conclusions 39

Non-HTMLtext content textual content encoded in multimedia (image or video)files or specific file formats not handled by search engines

These pages all include data that search engines cannot find

Review and Conclusions 40

Visual Markup vs Logical Markup

atilde Visual Markup (Presentational)

acirc What you see is what you get (WYSIWYG)acirc Equivalent of printersrsquo markupacirc Shows what things look like

atilde Logical Markup (Structural)

acirc Show the structure and meaningacirc Can be mapped to visual markupacirc Less flexible than visual markupacirc More adaptable (and reusable)

Review and Conclusions 41

The Web as Corpus

Review and Conclusions 42

Two Approaches to using the Web as a Corpus

atilde Direct Query Search Engine as Query tool and WWW as corpus(Objection Results are not reliable)

acirc Population and exact hit counts are unknown rarr no statistics possibleacirc Indexing does not allow to draw conclusions on the dataotimes Google is missing functionalities that linguists lexicographers would like to have

atilde Web Sample Use search engine to download data from the net and build a corpusfrom it

acirc known size and exact hit counts rarr statistics possibleacirc people can draw conclusions over the included text typesacirc (limited) control over the contentotimes sparser data

Review and Conclusions 43

Direct Query

atilde Accessible through search engines (Google API Yahoo API Scripts)

atilde Document counts are shown to correlate directly with ldquorealrdquo frequencies (Keller2003) so search engines can help - but

acirc lots of repetitions of the same text (not representative)acirc very limited query precision (no upperlower case no punctuation)acirc only estimated counts often hard to reproduce exactlyacirc different queries give wildly different numbers

Review and Conclusions 44

Web Count Units

atilde ghit or ldquogoogle hitrdquo is the most common unit used to count web snippets (in theearly 2000s)

acirc it is document frequency not term frequency

atilde whit or ldquoweb hitrdquo is the more general term

atilde Normally you compare two phenomena to get a unitless ratio(eg different from vs different than)251000000 ghits vs 71500000 or 351 (accessed 2012-04-04)

atilde GPB for ldquoGhits per billion documentsrdquo is good if want a more stable number (sug-gested by Mark Liberman)but Google no longer releases the index size

atilde So always say when you counted and try to use ratios

httpitrecisupennedu myllanguagelogarchives000953html 45

Web Sample

atilde Extracting and filtering web documents to create linguistically annotated corpora(Kilgarriff 2006)

acirc gather documents for different topics (balance)acirc exclude documents which cannot be preprocessed with available tools (here tag-

gers and lemmatizers)acirc exclude documents which seem irrelevant for a corpus (too short or too long word

lists)acirc do this for several languages and make the corpora available

46

Building Internet Corpora Outline

1 Select Seed Words (500)

2 Combine to form multiple queries (6000)

3 Query a search engine and retrieve the URLs (50000)

4 Download the files from the URLS (100000000 words)

5 Postprocess the data (encoding cleanup tagging and parsing)

Sharoff S (2006) Creating general-purpose corpora using automated search enginequeries In M Baroni S Bernardini (eds) WaCky Working papers on the Web asCorpus Bologna 2006

httpcorpusleedsacukinternethtml 47

Internet Corpora Summary

atilde The web can be used as a corpus

acirc Direct accesslowast Fast and convenientlowast Huge amounts of dataotimes unreliable counts

acirc Web samplelowast Control over the samplelowast Some setup costs (semi-automated)otimes Less data

atilde Richer data than a compiled corpus

otimes Less balanced less markup

48

Text and Meta-textatilde Explicit Meta-data

acirc Keywords and Categoriesacirc Rankingsacirc Structural Markup

atilde Implicit Meta-data

acirc Links and Citationsacirc Tagsacirc Tablesacirc File Namesacirc Translations

and Zombies 49

Explicit Metadata

atilde You can get information from metadata within documents

acirc When they are accurate they are very goodacirc They are often deceitful

atilde HTML PDF Word hellipMetadata

atilde Keywords and Tags

atilde Rankings

atilde Links and Citations

atilde Structural Markup

50

Implicit Metadata

atilde You can get clues from metadata within documents

acirc as they are non-intended they tend to be noisyacirc but they are rarely deceitful

atilde HTML tags as constituent boundaries

atilde Tables as Semantic Relations

atilde File Names (content type and language)

atilde Translations mdash Bracketed Glosses Cross-lingual Disambiguation

atilde Query Data

atilde Wikipedia Redirections and Cross-wiki Links

Review and Conclusions 51

Language Identification andNormalization

Review and Conclusions 52

Language Identification

atilde Need to identify the encoding and language of a page in order to process it (meta-data may be unreliable)

acirc Linguistically-grounded methodslowast Diacriticslowast Character n-gramslowast Stop words

acirc Similarity-based categorisation and classificationlowast Character n-gram rankings

acirc Machine Learning based methodsacirc Context (under jp or ko)

atilde Hard to do for short test snippets similar languages mixed text

Review and Conclusions 53

Normalization

atilde Extracting text from various documents

atilde Segmenting continuous text

atilde Number Normalization $700K $700000 07 million dollars hellip

atilde Date Normalization 2000AD 1421AH Heisei 12 hellip

atilde Stripping stop words (the a of hellip)

atilde Lemmatization produces rarr produce

atilde Stemming producer rarr produc produces rarr produc

atilde Decompounding zonnecel rarr zon cel

Review and Conclusions 54

The Semantic Web

Review and Conclusions 55

Goals of the Semantic Web

atilde Web of data

acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions

atilde Increase the utility of information by connecting it to definitions and context

atilde More efficient information access and analysis

EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt

Review and Conclusions 56

Semantic Web Architecture

Review and Conclusions 57

XML eXtensible Markup Language

atilde XML is a set of rules for encoding documents electronically

acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation

atilde Can be validated for syntactic well-formedness

Review and Conclusions 58

Semantic Web Architecture (details)

atilde Identify things with Uniform Resource Identifiers

acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond

atilde Identify relations with Resource Description Framework

acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything

atilde You can build relations in ontologies (OWL)

acirc Then reason over them search them hellip

Review and Conclusions 59

Criticism of the Semantic Web

Doctorowrsquos seven insurmountable obstacles to reliable metadata are

1 People lie

2 People are lazy

3 People are stupid

4 Mission Impossible know thyself

5 Schemas arenrsquot neutral

6 Metrics influence results

7 Therersquos more than one way to describe something

httpwwwwellcom~doctorowmetacraphtm 60

Semantic Web and NLP

atilde The Semantic Web is about structuring data

atilde Text Mining is about unstructured data

atilde There is much more unstructured than structured data

acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP

61

Citation Reputation andPageRank

Review and Conclusions 62

Citation Networks

atilde How can we tell what is a good scientific paper

acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked

acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite

Review and Conclusions 63

Reputation and Citation Analysis

atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group

atilde Some scores are

acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)

atilde Problems

acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones

atilde Weight Citations by Quality of the paper

Review and Conclusions 64

Gaming Citations

atilde LeastMinimum Publishable Unit

acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information

atilde Self citation in-group citation

atilde Write only proceedings (some journals are not often read)

atilde Submitting only to High Impact factor journals

You improve what gets measurednot necessarily what you want to improve

Review and Conclusions 65

Characteristics of the Web Bow Tie

SCC Strongly Connected Core mdash can travel from any page to any page

Review and Conclusions 66

Anchor Text

atilde Recall how hyperlinks are written

lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt

For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt

atilde Link analysis builds on two intuitions

1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A

2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice

Review and Conclusions 67

PageRank as Citation analysis

atilde Citation frequency can be used to measure the impact of an article

acirc Simplest measure Each article gets one vote ndash not very accurate

atilde On the web citation frequency = inlink count

acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam

atilde Better measure weighted citation frequency or citation rank

acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated

Review and Conclusions 68

PageRank as Random walk

atilde Imagine a web surfer doing a random walk on the web

acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page

equiprobably

atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there

atilde This long-term visit rate is the pagersquos PageRank

atilde PageRank = long-term visit rate = steady state robability

Review and Conclusions 69

Teleporting ndash to get us out of dead ends

atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)

atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)

atilde With remaining probability (90) go out on a random hyperlink

acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225

atilde 10 is a parameter the teleportation rate

atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate

Review and Conclusions 70

Example Graph

Each inbound link is a positive vote

httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71

Example Graph Weighted

Pages with higher PageRanks are lighter

Review and Conclusions 72

Gaming PageRank

atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs

atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies

atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission

73

atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks

The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index

acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip

74

Current Status

atilde There is a continuous battle between

acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read

atilde All metrics get gamed

Review and Conclusions 75

Digital object identifier

atilde DOI a string used to uniquely identify an electronic document or object

acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)

atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation

atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations

acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u

Review and Conclusions 76

Conclusions

Review and Conclusions 77

Revolutions in Language Technology

atilde Speech (Language itself)

atilde Writing (invented 3-5 times)

atilde Printing (made writing common)

atilde Digital Text (made writing transferable)

atilde Hyperlinking (taking writing beyond language)

Review and Conclusions 78

The Internet

atilde The internet is useful as a tool

acirc Passively for information accessacirc Actively for collaboration

atilde The internet is interesting in and of itself

acirc As a source of data about existing languageacirc As a source of innovation in language

Review and Conclusions 79

Recommended Texts

atilde Wikipedia

atilde Crystal D (2001) Language and the Internet Cambridge University Press

atilde Sproat R (2010) Language Technology and Society Oxford University Press

atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition

atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press

atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press

Review and Conclusions 80

Complementary Courses

atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics

atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics

Review and Conclusions 81

Page 21: Lecture 12: Review and Conclusions · Web Count Units ª ghit or “google hit” is the most common unit used to count web snippets (in the early2000s) â itisdocumentfrequencynottermfrequency

Speed is different for different modalities

Speed in words per minute (one word is 6 characters)(English computer science students various studies)

Reading 300 200 (proof reading)Writing 31 21 (composing)Speaking 150Hearing 150 210 (speeded up)Typing 33 19 (composing)

atilde Reading gtgt SpeakingHearing gtgt Typing

rArr Speech for inputrArr Text for output

Review and Conclusions 20

The Telephone

Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 21

New Modalities

Review and Conclusions 22

Email Usenet Chat and Blogs

atilde All share some characteristics of speech and text

atilde Usage norms not fixed

atilde Communication methods may disappear before the norms are fixedUsenet Bulletin Boards Gopher Archie hellip

atilde Large scale discourse studies still to be done

atilde Some genuinely new things

acirc time-lagged multi-person conversationacirc raw un-edited textacirc extreme multi-authorship

Review and Conclusions 23

Email

Speech like Text liketime bound space bound (deletable)spontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 24

Usenet (asynchronous)

Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 25

Chat (synchronous)

Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 26

Blogs

Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactivecomments factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 27

Netspeak

atilde Inspired by shared background

acirc ltrantgtI canrsquot stand thisltrantgtacirc I hate^H^H^H^Hlove that idea (^H is backspace on a vt100)acirc lusers (users as seen by Systems Administrators)acirc suits

atilde Need to be in-group

cow orker Coworkerclue ldquoYou couldnrsquot get a clue during the clue mating season in a field full of horny

clues if you smeared your body with clue musk and did the clue mating dancerdquoEdward Flaherty (talkbizaare)

Review and Conclusions 28

atilde Gricean Principals

acirc Posting (topmiddlebottom quoting and trimming)

atilde Inspired by medium limitations

acirc GREATgreatgreatacirc -) (^_^) (o) acirc brb RTFM IMHOacirc lower caseacirc lack of punctuation (hard on phone keyboards)

Review and Conclusions 29

Collaboration and Wikisatilde Version Control Systems

atilde Wikipedia

atilde Licensing and Ownership

30

Version Control Systems

atilde Versioning file systems

acirc every time a file is opened a new copy is stored

atilde CVS Subversion Git

acirc changes to a collection of files are trackedacirc simultaneous changes are merged

atilde Revision Tracking

acirc Revisions are stored within a file

atilde Authorship in shared writing Explicit responsibility for changes

Review and Conclusions 31

Wikipedia

atilde The core aim of the Wikimedia Foundation is to get a free encyclopedia to everysingle person on the planet (Jimmy Wales)

atilde Wikipedia makes it easy to share your knowledgepeople like to do this

atilde Most edits are done by insiders

atilde Most content is added by outsiders

atilde Content comparable to Britannica

Review and Conclusions 32

The five pillars of Wikipedia

1 Wikipedia is an online encyclopedia

2 Wikipedia has a neutral point of view

atilde Content policies NPOV Verifiability and No original research

3 Wikipedia is free content

4 Wikipedians should interact in a respectful and civil manner

5 Wikipedia does not have firm rules

WikipediaFivepillars 33

Licenses and Ownership

atilde Copyright

atilde Copyleft

atilde Creative Commons

Review and Conclusions 34

What is a good article

1 Well-written

2 Factually accurate and verifiable

3 Broad in its coverage

4 Neutral

5 Stable

6 Illustrated if possible by images

WikipediaGood_article_criteria 35

The World Wide Web and HTML

Review and Conclusions 36

The InterWeb

atilde The Internetmore than just the web (email VoIP FTP Streaming Messaging

atilde The structure of Markup Visual vs LogicalWSISWYG WYSIAYG WYSIWIM

atilde The structure of the Web mdash hypertextpages linked to pages

atilde The future of the Web

atilde Linguistic features of the webun-edited large volume editable multi-media

Review and Conclusions 37

Map of online communities

httpxkcdcom802 38

The Deep Web (the Invisible Web)

Dynamic content dynamic pages which are returned in response to a submitted queryor accessed only through a form

Unlinked content pages which are not linked to by other pages (but clicking linksthem)

Private Web sites that require registration and login (Edventure)

Contextual Web pages with content varying for different access contexts (egranges of client IP addresses or previous navigation sequence)

Limited access content sites that limit access to their pages in a technical way (egusing the Robots Exclusion Standard)

Scripted content pages that are only accessible through links produced by JavaScriptas well as content dynamically downloaded from Web servers via Flash or Ajaxsolutions

Review and Conclusions 39

Non-HTMLtext content textual content encoded in multimedia (image or video)files or specific file formats not handled by search engines

These pages all include data that search engines cannot find

Review and Conclusions 40

Visual Markup vs Logical Markup

atilde Visual Markup (Presentational)

acirc What you see is what you get (WYSIWYG)acirc Equivalent of printersrsquo markupacirc Shows what things look like

atilde Logical Markup (Structural)

acirc Show the structure and meaningacirc Can be mapped to visual markupacirc Less flexible than visual markupacirc More adaptable (and reusable)

Review and Conclusions 41

The Web as Corpus

Review and Conclusions 42

Two Approaches to using the Web as a Corpus

atilde Direct Query Search Engine as Query tool and WWW as corpus(Objection Results are not reliable)

acirc Population and exact hit counts are unknown rarr no statistics possibleacirc Indexing does not allow to draw conclusions on the dataotimes Google is missing functionalities that linguists lexicographers would like to have

atilde Web Sample Use search engine to download data from the net and build a corpusfrom it

acirc known size and exact hit counts rarr statistics possibleacirc people can draw conclusions over the included text typesacirc (limited) control over the contentotimes sparser data

Review and Conclusions 43

Direct Query

atilde Accessible through search engines (Google API Yahoo API Scripts)

atilde Document counts are shown to correlate directly with ldquorealrdquo frequencies (Keller2003) so search engines can help - but

acirc lots of repetitions of the same text (not representative)acirc very limited query precision (no upperlower case no punctuation)acirc only estimated counts often hard to reproduce exactlyacirc different queries give wildly different numbers

Review and Conclusions 44

Web Count Units

atilde ghit or ldquogoogle hitrdquo is the most common unit used to count web snippets (in theearly 2000s)

acirc it is document frequency not term frequency

atilde whit or ldquoweb hitrdquo is the more general term

atilde Normally you compare two phenomena to get a unitless ratio(eg different from vs different than)251000000 ghits vs 71500000 or 351 (accessed 2012-04-04)

atilde GPB for ldquoGhits per billion documentsrdquo is good if want a more stable number (sug-gested by Mark Liberman)but Google no longer releases the index size

atilde So always say when you counted and try to use ratios

httpitrecisupennedu myllanguagelogarchives000953html 45

Web Sample

atilde Extracting and filtering web documents to create linguistically annotated corpora(Kilgarriff 2006)

acirc gather documents for different topics (balance)acirc exclude documents which cannot be preprocessed with available tools (here tag-

gers and lemmatizers)acirc exclude documents which seem irrelevant for a corpus (too short or too long word

lists)acirc do this for several languages and make the corpora available

46

Building Internet Corpora Outline

1 Select Seed Words (500)

2 Combine to form multiple queries (6000)

3 Query a search engine and retrieve the URLs (50000)

4 Download the files from the URLS (100000000 words)

5 Postprocess the data (encoding cleanup tagging and parsing)

Sharoff S (2006) Creating general-purpose corpora using automated search enginequeries In M Baroni S Bernardini (eds) WaCky Working papers on the Web asCorpus Bologna 2006

httpcorpusleedsacukinternethtml 47

Internet Corpora Summary

atilde The web can be used as a corpus

acirc Direct accesslowast Fast and convenientlowast Huge amounts of dataotimes unreliable counts

acirc Web samplelowast Control over the samplelowast Some setup costs (semi-automated)otimes Less data

atilde Richer data than a compiled corpus

otimes Less balanced less markup

48

Text and Meta-textatilde Explicit Meta-data

acirc Keywords and Categoriesacirc Rankingsacirc Structural Markup

atilde Implicit Meta-data

acirc Links and Citationsacirc Tagsacirc Tablesacirc File Namesacirc Translations

and Zombies 49

Explicit Metadata

atilde You can get information from metadata within documents

acirc When they are accurate they are very goodacirc They are often deceitful

atilde HTML PDF Word hellipMetadata

atilde Keywords and Tags

atilde Rankings

atilde Links and Citations

atilde Structural Markup

50

Implicit Metadata

atilde You can get clues from metadata within documents

acirc as they are non-intended they tend to be noisyacirc but they are rarely deceitful

atilde HTML tags as constituent boundaries

atilde Tables as Semantic Relations

atilde File Names (content type and language)

atilde Translations mdash Bracketed Glosses Cross-lingual Disambiguation

atilde Query Data

atilde Wikipedia Redirections and Cross-wiki Links

Review and Conclusions 51

Language Identification andNormalization

Review and Conclusions 52

Language Identification

atilde Need to identify the encoding and language of a page in order to process it (meta-data may be unreliable)

acirc Linguistically-grounded methodslowast Diacriticslowast Character n-gramslowast Stop words

acirc Similarity-based categorisation and classificationlowast Character n-gram rankings

acirc Machine Learning based methodsacirc Context (under jp or ko)

atilde Hard to do for short test snippets similar languages mixed text

Review and Conclusions 53

Normalization

atilde Extracting text from various documents

atilde Segmenting continuous text

atilde Number Normalization $700K $700000 07 million dollars hellip

atilde Date Normalization 2000AD 1421AH Heisei 12 hellip

atilde Stripping stop words (the a of hellip)

atilde Lemmatization produces rarr produce

atilde Stemming producer rarr produc produces rarr produc

atilde Decompounding zonnecel rarr zon cel

Review and Conclusions 54

The Semantic Web

Review and Conclusions 55

Goals of the Semantic Web

atilde Web of data

acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions

atilde Increase the utility of information by connecting it to definitions and context

atilde More efficient information access and analysis

EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt

Review and Conclusions 56

Semantic Web Architecture

Review and Conclusions 57

XML eXtensible Markup Language

atilde XML is a set of rules for encoding documents electronically

acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation

atilde Can be validated for syntactic well-formedness

Review and Conclusions 58

Semantic Web Architecture (details)

atilde Identify things with Uniform Resource Identifiers

acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond

atilde Identify relations with Resource Description Framework

acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything

atilde You can build relations in ontologies (OWL)

acirc Then reason over them search them hellip

Review and Conclusions 59

Criticism of the Semantic Web

Doctorowrsquos seven insurmountable obstacles to reliable metadata are

1 People lie

2 People are lazy

3 People are stupid

4 Mission Impossible know thyself

5 Schemas arenrsquot neutral

6 Metrics influence results

7 Therersquos more than one way to describe something

httpwwwwellcom~doctorowmetacraphtm 60

Semantic Web and NLP

atilde The Semantic Web is about structuring data

atilde Text Mining is about unstructured data

atilde There is much more unstructured than structured data

acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP

61

Citation Reputation andPageRank

Review and Conclusions 62

Citation Networks

atilde How can we tell what is a good scientific paper

acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked

acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite

Review and Conclusions 63

Reputation and Citation Analysis

atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group

atilde Some scores are

acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)

atilde Problems

acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones

atilde Weight Citations by Quality of the paper

Review and Conclusions 64

Gaming Citations

atilde LeastMinimum Publishable Unit

acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information

atilde Self citation in-group citation

atilde Write only proceedings (some journals are not often read)

atilde Submitting only to High Impact factor journals

You improve what gets measurednot necessarily what you want to improve

Review and Conclusions 65

Characteristics of the Web Bow Tie

SCC Strongly Connected Core mdash can travel from any page to any page

Review and Conclusions 66

Anchor Text

atilde Recall how hyperlinks are written

lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt

For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt

atilde Link analysis builds on two intuitions

1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A

2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice

Review and Conclusions 67

PageRank as Citation analysis

atilde Citation frequency can be used to measure the impact of an article

acirc Simplest measure Each article gets one vote ndash not very accurate

atilde On the web citation frequency = inlink count

acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam

atilde Better measure weighted citation frequency or citation rank

acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated

Review and Conclusions 68

PageRank as Random walk

atilde Imagine a web surfer doing a random walk on the web

acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page

equiprobably

atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there

atilde This long-term visit rate is the pagersquos PageRank

atilde PageRank = long-term visit rate = steady state robability

Review and Conclusions 69

Teleporting ndash to get us out of dead ends

atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)

atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)

atilde With remaining probability (90) go out on a random hyperlink

acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225

atilde 10 is a parameter the teleportation rate

atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate

Review and Conclusions 70

Example Graph

Each inbound link is a positive vote

httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71

Example Graph Weighted

Pages with higher PageRanks are lighter

Review and Conclusions 72

Gaming PageRank

atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs

atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies

atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission

73

atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks

The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index

acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip

74

Current Status

atilde There is a continuous battle between

acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read

atilde All metrics get gamed

Review and Conclusions 75

Digital object identifier

atilde DOI a string used to uniquely identify an electronic document or object

acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)

atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation

atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations

acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u

Review and Conclusions 76

Conclusions

Review and Conclusions 77

Revolutions in Language Technology

atilde Speech (Language itself)

atilde Writing (invented 3-5 times)

atilde Printing (made writing common)

atilde Digital Text (made writing transferable)

atilde Hyperlinking (taking writing beyond language)

Review and Conclusions 78

The Internet

atilde The internet is useful as a tool

acirc Passively for information accessacirc Actively for collaboration

atilde The internet is interesting in and of itself

acirc As a source of data about existing languageacirc As a source of innovation in language

Review and Conclusions 79

Recommended Texts

atilde Wikipedia

atilde Crystal D (2001) Language and the Internet Cambridge University Press

atilde Sproat R (2010) Language Technology and Society Oxford University Press

atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition

atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press

atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press

Review and Conclusions 80

Complementary Courses

atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics

atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics

Review and Conclusions 81

Page 22: Lecture 12: Review and Conclusions · Web Count Units ª ghit or “google hit” is the most common unit used to count web snippets (in the early2000s) â itisdocumentfrequencynottermfrequency

The Telephone

Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 21

New Modalities

Review and Conclusions 22

Email Usenet Chat and Blogs

atilde All share some characteristics of speech and text

atilde Usage norms not fixed

atilde Communication methods may disappear before the norms are fixedUsenet Bulletin Boards Gopher Archie hellip

atilde Large scale discourse studies still to be done

atilde Some genuinely new things

acirc time-lagged multi-person conversationacirc raw un-edited textacirc extreme multi-authorship

Review and Conclusions 23

Email

Speech like Text liketime bound space bound (deletable)spontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 24

Usenet (asynchronous)

Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 25

Chat (synchronous)

Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 26

Blogs

Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactivecomments factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 27

Netspeak

atilde Inspired by shared background

acirc ltrantgtI canrsquot stand thisltrantgtacirc I hate^H^H^H^Hlove that idea (^H is backspace on a vt100)acirc lusers (users as seen by Systems Administrators)acirc suits

atilde Need to be in-group

cow orker Coworkerclue ldquoYou couldnrsquot get a clue during the clue mating season in a field full of horny

clues if you smeared your body with clue musk and did the clue mating dancerdquoEdward Flaherty (talkbizaare)

Review and Conclusions 28

atilde Gricean Principals

acirc Posting (topmiddlebottom quoting and trimming)

atilde Inspired by medium limitations

acirc GREATgreatgreatacirc -) (^_^) (o) acirc brb RTFM IMHOacirc lower caseacirc lack of punctuation (hard on phone keyboards)

Review and Conclusions 29

Collaboration and Wikisatilde Version Control Systems

atilde Wikipedia

atilde Licensing and Ownership

30

Version Control Systems

atilde Versioning file systems

acirc every time a file is opened a new copy is stored

atilde CVS Subversion Git

acirc changes to a collection of files are trackedacirc simultaneous changes are merged

atilde Revision Tracking

acirc Revisions are stored within a file

atilde Authorship in shared writing Explicit responsibility for changes

Review and Conclusions 31

Wikipedia

atilde The core aim of the Wikimedia Foundation is to get a free encyclopedia to everysingle person on the planet (Jimmy Wales)

atilde Wikipedia makes it easy to share your knowledgepeople like to do this

atilde Most edits are done by insiders

atilde Most content is added by outsiders

atilde Content comparable to Britannica

Review and Conclusions 32

The five pillars of Wikipedia

1 Wikipedia is an online encyclopedia

2 Wikipedia has a neutral point of view

atilde Content policies NPOV Verifiability and No original research

3 Wikipedia is free content

4 Wikipedians should interact in a respectful and civil manner

5 Wikipedia does not have firm rules

WikipediaFivepillars 33

Licenses and Ownership

atilde Copyright

atilde Copyleft

atilde Creative Commons

Review and Conclusions 34

What is a good article

1 Well-written

2 Factually accurate and verifiable

3 Broad in its coverage

4 Neutral

5 Stable

6 Illustrated if possible by images

WikipediaGood_article_criteria 35

The World Wide Web and HTML

Review and Conclusions 36

The InterWeb

atilde The Internetmore than just the web (email VoIP FTP Streaming Messaging

atilde The structure of Markup Visual vs LogicalWSISWYG WYSIAYG WYSIWIM

atilde The structure of the Web mdash hypertextpages linked to pages

atilde The future of the Web

atilde Linguistic features of the webun-edited large volume editable multi-media

Review and Conclusions 37

Map of online communities

httpxkcdcom802 38

The Deep Web (the Invisible Web)

Dynamic content dynamic pages which are returned in response to a submitted queryor accessed only through a form

Unlinked content pages which are not linked to by other pages (but clicking linksthem)

Private Web sites that require registration and login (Edventure)

Contextual Web pages with content varying for different access contexts (egranges of client IP addresses or previous navigation sequence)

Limited access content sites that limit access to their pages in a technical way (egusing the Robots Exclusion Standard)

Scripted content pages that are only accessible through links produced by JavaScriptas well as content dynamically downloaded from Web servers via Flash or Ajaxsolutions

Review and Conclusions 39

Non-HTMLtext content textual content encoded in multimedia (image or video)files or specific file formats not handled by search engines

These pages all include data that search engines cannot find

Review and Conclusions 40

Visual Markup vs Logical Markup

atilde Visual Markup (Presentational)

acirc What you see is what you get (WYSIWYG)acirc Equivalent of printersrsquo markupacirc Shows what things look like

atilde Logical Markup (Structural)

acirc Show the structure and meaningacirc Can be mapped to visual markupacirc Less flexible than visual markupacirc More adaptable (and reusable)

Review and Conclusions 41

The Web as Corpus

Review and Conclusions 42

Two Approaches to using the Web as a Corpus

atilde Direct Query Search Engine as Query tool and WWW as corpus(Objection Results are not reliable)

acirc Population and exact hit counts are unknown rarr no statistics possibleacirc Indexing does not allow to draw conclusions on the dataotimes Google is missing functionalities that linguists lexicographers would like to have

atilde Web Sample Use search engine to download data from the net and build a corpusfrom it

acirc known size and exact hit counts rarr statistics possibleacirc people can draw conclusions over the included text typesacirc (limited) control over the contentotimes sparser data

Review and Conclusions 43

Direct Query

atilde Accessible through search engines (Google API Yahoo API Scripts)

atilde Document counts are shown to correlate directly with ldquorealrdquo frequencies (Keller2003) so search engines can help - but

acirc lots of repetitions of the same text (not representative)acirc very limited query precision (no upperlower case no punctuation)acirc only estimated counts often hard to reproduce exactlyacirc different queries give wildly different numbers

Review and Conclusions 44

Web Count Units

atilde ghit or ldquogoogle hitrdquo is the most common unit used to count web snippets (in theearly 2000s)

acirc it is document frequency not term frequency

atilde whit or ldquoweb hitrdquo is the more general term

atilde Normally you compare two phenomena to get a unitless ratio(eg different from vs different than)251000000 ghits vs 71500000 or 351 (accessed 2012-04-04)

atilde GPB for ldquoGhits per billion documentsrdquo is good if want a more stable number (sug-gested by Mark Liberman)but Google no longer releases the index size

atilde So always say when you counted and try to use ratios

httpitrecisupennedu myllanguagelogarchives000953html 45

Web Sample

atilde Extracting and filtering web documents to create linguistically annotated corpora(Kilgarriff 2006)

acirc gather documents for different topics (balance)acirc exclude documents which cannot be preprocessed with available tools (here tag-

gers and lemmatizers)acirc exclude documents which seem irrelevant for a corpus (too short or too long word

lists)acirc do this for several languages and make the corpora available

46

Building Internet Corpora Outline

1 Select Seed Words (500)

2 Combine to form multiple queries (6000)

3 Query a search engine and retrieve the URLs (50000)

4 Download the files from the URLS (100000000 words)

5 Postprocess the data (encoding cleanup tagging and parsing)

Sharoff S (2006) Creating general-purpose corpora using automated search enginequeries In M Baroni S Bernardini (eds) WaCky Working papers on the Web asCorpus Bologna 2006

httpcorpusleedsacukinternethtml 47

Internet Corpora Summary

atilde The web can be used as a corpus

acirc Direct accesslowast Fast and convenientlowast Huge amounts of dataotimes unreliable counts

acirc Web samplelowast Control over the samplelowast Some setup costs (semi-automated)otimes Less data

atilde Richer data than a compiled corpus

otimes Less balanced less markup

48

Text and Meta-textatilde Explicit Meta-data

acirc Keywords and Categoriesacirc Rankingsacirc Structural Markup

atilde Implicit Meta-data

acirc Links and Citationsacirc Tagsacirc Tablesacirc File Namesacirc Translations

and Zombies 49

Explicit Metadata

atilde You can get information from metadata within documents

acirc When they are accurate they are very goodacirc They are often deceitful

atilde HTML PDF Word hellipMetadata

atilde Keywords and Tags

atilde Rankings

atilde Links and Citations

atilde Structural Markup

50

Implicit Metadata

atilde You can get clues from metadata within documents

acirc as they are non-intended they tend to be noisyacirc but they are rarely deceitful

atilde HTML tags as constituent boundaries

atilde Tables as Semantic Relations

atilde File Names (content type and language)

atilde Translations mdash Bracketed Glosses Cross-lingual Disambiguation

atilde Query Data

atilde Wikipedia Redirections and Cross-wiki Links

Review and Conclusions 51

Language Identification andNormalization

Review and Conclusions 52

Language Identification

atilde Need to identify the encoding and language of a page in order to process it (meta-data may be unreliable)

acirc Linguistically-grounded methodslowast Diacriticslowast Character n-gramslowast Stop words

acirc Similarity-based categorisation and classificationlowast Character n-gram rankings

acirc Machine Learning based methodsacirc Context (under jp or ko)

atilde Hard to do for short test snippets similar languages mixed text

Review and Conclusions 53

Normalization

atilde Extracting text from various documents

atilde Segmenting continuous text

atilde Number Normalization $700K $700000 07 million dollars hellip

atilde Date Normalization 2000AD 1421AH Heisei 12 hellip

atilde Stripping stop words (the a of hellip)

atilde Lemmatization produces rarr produce

atilde Stemming producer rarr produc produces rarr produc

atilde Decompounding zonnecel rarr zon cel

Review and Conclusions 54

The Semantic Web

Review and Conclusions 55

Goals of the Semantic Web

atilde Web of data

acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions

atilde Increase the utility of information by connecting it to definitions and context

atilde More efficient information access and analysis

EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt

Review and Conclusions 56

Semantic Web Architecture

Review and Conclusions 57

XML eXtensible Markup Language

atilde XML is a set of rules for encoding documents electronically

acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation

atilde Can be validated for syntactic well-formedness

Review and Conclusions 58

Semantic Web Architecture (details)

atilde Identify things with Uniform Resource Identifiers

acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond

atilde Identify relations with Resource Description Framework

acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything

atilde You can build relations in ontologies (OWL)

acirc Then reason over them search them hellip

Review and Conclusions 59

Criticism of the Semantic Web

Doctorowrsquos seven insurmountable obstacles to reliable metadata are

1 People lie

2 People are lazy

3 People are stupid

4 Mission Impossible know thyself

5 Schemas arenrsquot neutral

6 Metrics influence results

7 Therersquos more than one way to describe something

httpwwwwellcom~doctorowmetacraphtm 60

Semantic Web and NLP

atilde The Semantic Web is about structuring data

atilde Text Mining is about unstructured data

atilde There is much more unstructured than structured data

acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP

61

Citation Reputation andPageRank

Review and Conclusions 62

Citation Networks

atilde How can we tell what is a good scientific paper

acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked

acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite

Review and Conclusions 63

Reputation and Citation Analysis

atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group

atilde Some scores are

acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)

atilde Problems

acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones

atilde Weight Citations by Quality of the paper

Review and Conclusions 64

Gaming Citations

atilde LeastMinimum Publishable Unit

acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information

atilde Self citation in-group citation

atilde Write only proceedings (some journals are not often read)

atilde Submitting only to High Impact factor journals

You improve what gets measurednot necessarily what you want to improve

Review and Conclusions 65

Characteristics of the Web Bow Tie

SCC Strongly Connected Core mdash can travel from any page to any page

Review and Conclusions 66

Anchor Text

atilde Recall how hyperlinks are written

lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt

For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt

atilde Link analysis builds on two intuitions

1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A

2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice

Review and Conclusions 67

PageRank as Citation analysis

atilde Citation frequency can be used to measure the impact of an article

acirc Simplest measure Each article gets one vote ndash not very accurate

atilde On the web citation frequency = inlink count

acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam

atilde Better measure weighted citation frequency or citation rank

acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated

Review and Conclusions 68

PageRank as Random walk

atilde Imagine a web surfer doing a random walk on the web

acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page

equiprobably

atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there

atilde This long-term visit rate is the pagersquos PageRank

atilde PageRank = long-term visit rate = steady state robability

Review and Conclusions 69

Teleporting ndash to get us out of dead ends

atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)

atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)

atilde With remaining probability (90) go out on a random hyperlink

acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225

atilde 10 is a parameter the teleportation rate

atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate

Review and Conclusions 70

Example Graph

Each inbound link is a positive vote

httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71

Example Graph Weighted

Pages with higher PageRanks are lighter

Review and Conclusions 72

Gaming PageRank

atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs

atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies

atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission

73

atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks

The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index

acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip

74

Current Status

atilde There is a continuous battle between

acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read

atilde All metrics get gamed

Review and Conclusions 75

Digital object identifier

atilde DOI a string used to uniquely identify an electronic document or object

acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)

atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation

atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations

acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u

Review and Conclusions 76

Conclusions

Review and Conclusions 77

Revolutions in Language Technology

atilde Speech (Language itself)

atilde Writing (invented 3-5 times)

atilde Printing (made writing common)

atilde Digital Text (made writing transferable)

atilde Hyperlinking (taking writing beyond language)

Review and Conclusions 78

The Internet

atilde The internet is useful as a tool

acirc Passively for information accessacirc Actively for collaboration

atilde The internet is interesting in and of itself

acirc As a source of data about existing languageacirc As a source of innovation in language

Review and Conclusions 79

Recommended Texts

atilde Wikipedia

atilde Crystal D (2001) Language and the Internet Cambridge University Press

atilde Sproat R (2010) Language Technology and Society Oxford University Press

atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition

atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press

atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press

Review and Conclusions 80

Complementary Courses

atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics

atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics

Review and Conclusions 81

Page 23: Lecture 12: Review and Conclusions · Web Count Units ª ghit or “google hit” is the most common unit used to count web snippets (in the early2000s) â itisdocumentfrequencynottermfrequency

New Modalities

Review and Conclusions 22

Email Usenet Chat and Blogs

atilde All share some characteristics of speech and text

atilde Usage norms not fixed

atilde Communication methods may disappear before the norms are fixedUsenet Bulletin Boards Gopher Archie hellip

atilde Large scale discourse studies still to be done

atilde Some genuinely new things

acirc time-lagged multi-person conversationacirc raw un-edited textacirc extreme multi-authorship

Review and Conclusions 23

Email

Speech like Text liketime bound space bound (deletable)spontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 24

Usenet (asynchronous)

Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 25

Chat (synchronous)

Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 26

Blogs

Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactivecomments factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 27

Netspeak

atilde Inspired by shared background

acirc ltrantgtI canrsquot stand thisltrantgtacirc I hate^H^H^H^Hlove that idea (^H is backspace on a vt100)acirc lusers (users as seen by Systems Administrators)acirc suits

atilde Need to be in-group

cow orker Coworkerclue ldquoYou couldnrsquot get a clue during the clue mating season in a field full of horny

clues if you smeared your body with clue musk and did the clue mating dancerdquoEdward Flaherty (talkbizaare)

Review and Conclusions 28

atilde Gricean Principals

acirc Posting (topmiddlebottom quoting and trimming)

atilde Inspired by medium limitations

acirc GREATgreatgreatacirc -) (^_^) (o) acirc brb RTFM IMHOacirc lower caseacirc lack of punctuation (hard on phone keyboards)

Review and Conclusions 29

Collaboration and Wikisatilde Version Control Systems

atilde Wikipedia

atilde Licensing and Ownership

30

Version Control Systems

atilde Versioning file systems

acirc every time a file is opened a new copy is stored

atilde CVS Subversion Git

acirc changes to a collection of files are trackedacirc simultaneous changes are merged

atilde Revision Tracking

acirc Revisions are stored within a file

atilde Authorship in shared writing Explicit responsibility for changes

Review and Conclusions 31

Wikipedia

atilde The core aim of the Wikimedia Foundation is to get a free encyclopedia to everysingle person on the planet (Jimmy Wales)

atilde Wikipedia makes it easy to share your knowledgepeople like to do this

atilde Most edits are done by insiders

atilde Most content is added by outsiders

atilde Content comparable to Britannica

Review and Conclusions 32

The five pillars of Wikipedia

1 Wikipedia is an online encyclopedia

2 Wikipedia has a neutral point of view

atilde Content policies NPOV Verifiability and No original research

3 Wikipedia is free content

4 Wikipedians should interact in a respectful and civil manner

5 Wikipedia does not have firm rules

WikipediaFivepillars 33

Licenses and Ownership

atilde Copyright

atilde Copyleft

atilde Creative Commons

Review and Conclusions 34

What is a good article

1 Well-written

2 Factually accurate and verifiable

3 Broad in its coverage

4 Neutral

5 Stable

6 Illustrated if possible by images

WikipediaGood_article_criteria 35

The World Wide Web and HTML

Review and Conclusions 36

The InterWeb

atilde The Internetmore than just the web (email VoIP FTP Streaming Messaging

atilde The structure of Markup Visual vs LogicalWSISWYG WYSIAYG WYSIWIM

atilde The structure of the Web mdash hypertextpages linked to pages

atilde The future of the Web

atilde Linguistic features of the webun-edited large volume editable multi-media

Review and Conclusions 37

Map of online communities

httpxkcdcom802 38

The Deep Web (the Invisible Web)

Dynamic content dynamic pages which are returned in response to a submitted queryor accessed only through a form

Unlinked content pages which are not linked to by other pages (but clicking linksthem)

Private Web sites that require registration and login (Edventure)

Contextual Web pages with content varying for different access contexts (egranges of client IP addresses or previous navigation sequence)

Limited access content sites that limit access to their pages in a technical way (egusing the Robots Exclusion Standard)

Scripted content pages that are only accessible through links produced by JavaScriptas well as content dynamically downloaded from Web servers via Flash or Ajaxsolutions

Review and Conclusions 39

Non-HTMLtext content textual content encoded in multimedia (image or video)files or specific file formats not handled by search engines

These pages all include data that search engines cannot find

Review and Conclusions 40

Visual Markup vs Logical Markup

atilde Visual Markup (Presentational)

acirc What you see is what you get (WYSIWYG)acirc Equivalent of printersrsquo markupacirc Shows what things look like

atilde Logical Markup (Structural)

acirc Show the structure and meaningacirc Can be mapped to visual markupacirc Less flexible than visual markupacirc More adaptable (and reusable)

Review and Conclusions 41

The Web as Corpus

Review and Conclusions 42

Two Approaches to using the Web as a Corpus

atilde Direct Query Search Engine as Query tool and WWW as corpus(Objection Results are not reliable)

acirc Population and exact hit counts are unknown rarr no statistics possibleacirc Indexing does not allow to draw conclusions on the dataotimes Google is missing functionalities that linguists lexicographers would like to have

atilde Web Sample Use search engine to download data from the net and build a corpusfrom it

acirc known size and exact hit counts rarr statistics possibleacirc people can draw conclusions over the included text typesacirc (limited) control over the contentotimes sparser data

Review and Conclusions 43

Direct Query

atilde Accessible through search engines (Google API Yahoo API Scripts)

atilde Document counts are shown to correlate directly with ldquorealrdquo frequencies (Keller2003) so search engines can help - but

acirc lots of repetitions of the same text (not representative)acirc very limited query precision (no upperlower case no punctuation)acirc only estimated counts often hard to reproduce exactlyacirc different queries give wildly different numbers

Review and Conclusions 44

Web Count Units

atilde ghit or ldquogoogle hitrdquo is the most common unit used to count web snippets (in theearly 2000s)

acirc it is document frequency not term frequency

atilde whit or ldquoweb hitrdquo is the more general term

atilde Normally you compare two phenomena to get a unitless ratio(eg different from vs different than)251000000 ghits vs 71500000 or 351 (accessed 2012-04-04)

atilde GPB for ldquoGhits per billion documentsrdquo is good if want a more stable number (sug-gested by Mark Liberman)but Google no longer releases the index size

atilde So always say when you counted and try to use ratios

httpitrecisupennedu myllanguagelogarchives000953html 45

Web Sample

atilde Extracting and filtering web documents to create linguistically annotated corpora(Kilgarriff 2006)

acirc gather documents for different topics (balance)acirc exclude documents which cannot be preprocessed with available tools (here tag-

gers and lemmatizers)acirc exclude documents which seem irrelevant for a corpus (too short or too long word

lists)acirc do this for several languages and make the corpora available

46

Building Internet Corpora Outline

1 Select Seed Words (500)

2 Combine to form multiple queries (6000)

3 Query a search engine and retrieve the URLs (50000)

4 Download the files from the URLS (100000000 words)

5 Postprocess the data (encoding cleanup tagging and parsing)

Sharoff S (2006) Creating general-purpose corpora using automated search enginequeries In M Baroni S Bernardini (eds) WaCky Working papers on the Web asCorpus Bologna 2006

httpcorpusleedsacukinternethtml 47

Internet Corpora Summary

atilde The web can be used as a corpus

acirc Direct accesslowast Fast and convenientlowast Huge amounts of dataotimes unreliable counts

acirc Web samplelowast Control over the samplelowast Some setup costs (semi-automated)otimes Less data

atilde Richer data than a compiled corpus

otimes Less balanced less markup

48

Text and Meta-textatilde Explicit Meta-data

acirc Keywords and Categoriesacirc Rankingsacirc Structural Markup

atilde Implicit Meta-data

acirc Links and Citationsacirc Tagsacirc Tablesacirc File Namesacirc Translations

and Zombies 49

Explicit Metadata

atilde You can get information from metadata within documents

acirc When they are accurate they are very goodacirc They are often deceitful

atilde HTML PDF Word hellipMetadata

atilde Keywords and Tags

atilde Rankings

atilde Links and Citations

atilde Structural Markup

50

Implicit Metadata

atilde You can get clues from metadata within documents

acirc as they are non-intended they tend to be noisyacirc but they are rarely deceitful

atilde HTML tags as constituent boundaries

atilde Tables as Semantic Relations

atilde File Names (content type and language)

atilde Translations mdash Bracketed Glosses Cross-lingual Disambiguation

atilde Query Data

atilde Wikipedia Redirections and Cross-wiki Links

Review and Conclusions 51

Language Identification andNormalization

Review and Conclusions 52

Language Identification

atilde Need to identify the encoding and language of a page in order to process it (meta-data may be unreliable)

acirc Linguistically-grounded methodslowast Diacriticslowast Character n-gramslowast Stop words

acirc Similarity-based categorisation and classificationlowast Character n-gram rankings

acirc Machine Learning based methodsacirc Context (under jp or ko)

atilde Hard to do for short test snippets similar languages mixed text

Review and Conclusions 53

Normalization

atilde Extracting text from various documents

atilde Segmenting continuous text

atilde Number Normalization $700K $700000 07 million dollars hellip

atilde Date Normalization 2000AD 1421AH Heisei 12 hellip

atilde Stripping stop words (the a of hellip)

atilde Lemmatization produces rarr produce

atilde Stemming producer rarr produc produces rarr produc

atilde Decompounding zonnecel rarr zon cel

Review and Conclusions 54

The Semantic Web

Review and Conclusions 55

Goals of the Semantic Web

atilde Web of data

acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions

atilde Increase the utility of information by connecting it to definitions and context

atilde More efficient information access and analysis

EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt

Review and Conclusions 56

Semantic Web Architecture

Review and Conclusions 57

XML eXtensible Markup Language

atilde XML is a set of rules for encoding documents electronically

acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation

atilde Can be validated for syntactic well-formedness

Review and Conclusions 58

Semantic Web Architecture (details)

atilde Identify things with Uniform Resource Identifiers

acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond

atilde Identify relations with Resource Description Framework

acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything

atilde You can build relations in ontologies (OWL)

acirc Then reason over them search them hellip

Review and Conclusions 59

Criticism of the Semantic Web

Doctorowrsquos seven insurmountable obstacles to reliable metadata are

1 People lie

2 People are lazy

3 People are stupid

4 Mission Impossible know thyself

5 Schemas arenrsquot neutral

6 Metrics influence results

7 Therersquos more than one way to describe something

httpwwwwellcom~doctorowmetacraphtm 60

Semantic Web and NLP

atilde The Semantic Web is about structuring data

atilde Text Mining is about unstructured data

atilde There is much more unstructured than structured data

acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP

61

Citation Reputation andPageRank

Review and Conclusions 62

Citation Networks

atilde How can we tell what is a good scientific paper

acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked

acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite

Review and Conclusions 63

Reputation and Citation Analysis

atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group

atilde Some scores are

acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)

atilde Problems

acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones

atilde Weight Citations by Quality of the paper

Review and Conclusions 64

Gaming Citations

atilde LeastMinimum Publishable Unit

acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information

atilde Self citation in-group citation

atilde Write only proceedings (some journals are not often read)

atilde Submitting only to High Impact factor journals

You improve what gets measurednot necessarily what you want to improve

Review and Conclusions 65

Characteristics of the Web Bow Tie

SCC Strongly Connected Core mdash can travel from any page to any page

Review and Conclusions 66

Anchor Text

atilde Recall how hyperlinks are written

lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt

For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt

atilde Link analysis builds on two intuitions

1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A

2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice

Review and Conclusions 67

PageRank as Citation analysis

atilde Citation frequency can be used to measure the impact of an article

acirc Simplest measure Each article gets one vote ndash not very accurate

atilde On the web citation frequency = inlink count

acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam

atilde Better measure weighted citation frequency or citation rank

acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated

Review and Conclusions 68

PageRank as Random walk

atilde Imagine a web surfer doing a random walk on the web

acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page

equiprobably

atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there

atilde This long-term visit rate is the pagersquos PageRank

atilde PageRank = long-term visit rate = steady state robability

Review and Conclusions 69

Teleporting ndash to get us out of dead ends

atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)

atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)

atilde With remaining probability (90) go out on a random hyperlink

acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225

atilde 10 is a parameter the teleportation rate

atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate

Review and Conclusions 70

Example Graph

Each inbound link is a positive vote

httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71

Example Graph Weighted

Pages with higher PageRanks are lighter

Review and Conclusions 72

Gaming PageRank

atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs

atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies

atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission

73

atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks

The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index

acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip

74

Current Status

atilde There is a continuous battle between

acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read

atilde All metrics get gamed

Review and Conclusions 75

Digital object identifier

atilde DOI a string used to uniquely identify an electronic document or object

acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)

atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation

atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations

acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u

Review and Conclusions 76

Conclusions

Review and Conclusions 77

Revolutions in Language Technology

atilde Speech (Language itself)

atilde Writing (invented 3-5 times)

atilde Printing (made writing common)

atilde Digital Text (made writing transferable)

atilde Hyperlinking (taking writing beyond language)

Review and Conclusions 78

The Internet

atilde The internet is useful as a tool

acirc Passively for information accessacirc Actively for collaboration

atilde The internet is interesting in and of itself

acirc As a source of data about existing languageacirc As a source of innovation in language

Review and Conclusions 79

Recommended Texts

atilde Wikipedia

atilde Crystal D (2001) Language and the Internet Cambridge University Press

atilde Sproat R (2010) Language Technology and Society Oxford University Press

atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition

atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press

atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press

Review and Conclusions 80

Complementary Courses

atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics

atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics

Review and Conclusions 81

Page 24: Lecture 12: Review and Conclusions · Web Count Units ª ghit or “google hit” is the most common unit used to count web snippets (in the early2000s) â itisdocumentfrequencynottermfrequency

Email Usenet Chat and Blogs

atilde All share some characteristics of speech and text

atilde Usage norms not fixed

atilde Communication methods may disappear before the norms are fixedUsenet Bulletin Boards Gopher Archie hellip

atilde Large scale discourse studies still to be done

atilde Some genuinely new things

acirc time-lagged multi-person conversationacirc raw un-edited textacirc extreme multi-authorship

Review and Conclusions 23

Email

Speech like Text liketime bound space bound (deletable)spontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 24

Usenet (asynchronous)

Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 25

Chat (synchronous)

Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 26

Blogs

Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactivecomments factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 27

Netspeak

atilde Inspired by shared background

acirc ltrantgtI canrsquot stand thisltrantgtacirc I hate^H^H^H^Hlove that idea (^H is backspace on a vt100)acirc lusers (users as seen by Systems Administrators)acirc suits

atilde Need to be in-group

cow orker Coworkerclue ldquoYou couldnrsquot get a clue during the clue mating season in a field full of horny

clues if you smeared your body with clue musk and did the clue mating dancerdquoEdward Flaherty (talkbizaare)

Review and Conclusions 28

atilde Gricean Principals

acirc Posting (topmiddlebottom quoting and trimming)

atilde Inspired by medium limitations

acirc GREATgreatgreatacirc -) (^_^) (o) acirc brb RTFM IMHOacirc lower caseacirc lack of punctuation (hard on phone keyboards)

Review and Conclusions 29

Collaboration and Wikisatilde Version Control Systems

atilde Wikipedia

atilde Licensing and Ownership

30

Version Control Systems

atilde Versioning file systems

acirc every time a file is opened a new copy is stored

atilde CVS Subversion Git

acirc changes to a collection of files are trackedacirc simultaneous changes are merged

atilde Revision Tracking

acirc Revisions are stored within a file

atilde Authorship in shared writing Explicit responsibility for changes

Review and Conclusions 31

Wikipedia

atilde The core aim of the Wikimedia Foundation is to get a free encyclopedia to everysingle person on the planet (Jimmy Wales)

atilde Wikipedia makes it easy to share your knowledgepeople like to do this

atilde Most edits are done by insiders

atilde Most content is added by outsiders

atilde Content comparable to Britannica

Review and Conclusions 32

The five pillars of Wikipedia

1 Wikipedia is an online encyclopedia

2 Wikipedia has a neutral point of view

atilde Content policies NPOV Verifiability and No original research

3 Wikipedia is free content

4 Wikipedians should interact in a respectful and civil manner

5 Wikipedia does not have firm rules

WikipediaFivepillars 33

Licenses and Ownership

atilde Copyright

atilde Copyleft

atilde Creative Commons

Review and Conclusions 34

What is a good article

1 Well-written

2 Factually accurate and verifiable

3 Broad in its coverage

4 Neutral

5 Stable

6 Illustrated if possible by images

WikipediaGood_article_criteria 35

The World Wide Web and HTML

Review and Conclusions 36

The InterWeb

atilde The Internetmore than just the web (email VoIP FTP Streaming Messaging

atilde The structure of Markup Visual vs LogicalWSISWYG WYSIAYG WYSIWIM

atilde The structure of the Web mdash hypertextpages linked to pages

atilde The future of the Web

atilde Linguistic features of the webun-edited large volume editable multi-media

Review and Conclusions 37

Map of online communities

httpxkcdcom802 38

The Deep Web (the Invisible Web)

Dynamic content dynamic pages which are returned in response to a submitted queryor accessed only through a form

Unlinked content pages which are not linked to by other pages (but clicking linksthem)

Private Web sites that require registration and login (Edventure)

Contextual Web pages with content varying for different access contexts (egranges of client IP addresses or previous navigation sequence)

Limited access content sites that limit access to their pages in a technical way (egusing the Robots Exclusion Standard)

Scripted content pages that are only accessible through links produced by JavaScriptas well as content dynamically downloaded from Web servers via Flash or Ajaxsolutions

Review and Conclusions 39

Non-HTMLtext content textual content encoded in multimedia (image or video)files or specific file formats not handled by search engines

These pages all include data that search engines cannot find

Review and Conclusions 40

Visual Markup vs Logical Markup

atilde Visual Markup (Presentational)

acirc What you see is what you get (WYSIWYG)acirc Equivalent of printersrsquo markupacirc Shows what things look like

atilde Logical Markup (Structural)

acirc Show the structure and meaningacirc Can be mapped to visual markupacirc Less flexible than visual markupacirc More adaptable (and reusable)

Review and Conclusions 41

The Web as Corpus

Review and Conclusions 42

Two Approaches to using the Web as a Corpus

atilde Direct Query Search Engine as Query tool and WWW as corpus(Objection Results are not reliable)

acirc Population and exact hit counts are unknown rarr no statistics possibleacirc Indexing does not allow to draw conclusions on the dataotimes Google is missing functionalities that linguists lexicographers would like to have

atilde Web Sample Use search engine to download data from the net and build a corpusfrom it

acirc known size and exact hit counts rarr statistics possibleacirc people can draw conclusions over the included text typesacirc (limited) control over the contentotimes sparser data

Review and Conclusions 43

Direct Query

atilde Accessible through search engines (Google API Yahoo API Scripts)

atilde Document counts are shown to correlate directly with ldquorealrdquo frequencies (Keller2003) so search engines can help - but

acirc lots of repetitions of the same text (not representative)acirc very limited query precision (no upperlower case no punctuation)acirc only estimated counts often hard to reproduce exactlyacirc different queries give wildly different numbers

Review and Conclusions 44

Web Count Units

atilde ghit or ldquogoogle hitrdquo is the most common unit used to count web snippets (in theearly 2000s)

acirc it is document frequency not term frequency

atilde whit or ldquoweb hitrdquo is the more general term

atilde Normally you compare two phenomena to get a unitless ratio(eg different from vs different than)251000000 ghits vs 71500000 or 351 (accessed 2012-04-04)

atilde GPB for ldquoGhits per billion documentsrdquo is good if want a more stable number (sug-gested by Mark Liberman)but Google no longer releases the index size

atilde So always say when you counted and try to use ratios

httpitrecisupennedu myllanguagelogarchives000953html 45

Web Sample

atilde Extracting and filtering web documents to create linguistically annotated corpora(Kilgarriff 2006)

acirc gather documents for different topics (balance)acirc exclude documents which cannot be preprocessed with available tools (here tag-

gers and lemmatizers)acirc exclude documents which seem irrelevant for a corpus (too short or too long word

lists)acirc do this for several languages and make the corpora available

46

Building Internet Corpora Outline

1 Select Seed Words (500)

2 Combine to form multiple queries (6000)

3 Query a search engine and retrieve the URLs (50000)

4 Download the files from the URLS (100000000 words)

5 Postprocess the data (encoding cleanup tagging and parsing)

Sharoff S (2006) Creating general-purpose corpora using automated search enginequeries In M Baroni S Bernardini (eds) WaCky Working papers on the Web asCorpus Bologna 2006

httpcorpusleedsacukinternethtml 47

Internet Corpora Summary

atilde The web can be used as a corpus

acirc Direct accesslowast Fast and convenientlowast Huge amounts of dataotimes unreliable counts

acirc Web samplelowast Control over the samplelowast Some setup costs (semi-automated)otimes Less data

atilde Richer data than a compiled corpus

otimes Less balanced less markup

48

Text and Meta-textatilde Explicit Meta-data

acirc Keywords and Categoriesacirc Rankingsacirc Structural Markup

atilde Implicit Meta-data

acirc Links and Citationsacirc Tagsacirc Tablesacirc File Namesacirc Translations

and Zombies 49

Explicit Metadata

atilde You can get information from metadata within documents

acirc When they are accurate they are very goodacirc They are often deceitful

atilde HTML PDF Word hellipMetadata

atilde Keywords and Tags

atilde Rankings

atilde Links and Citations

atilde Structural Markup

50

Implicit Metadata

atilde You can get clues from metadata within documents

acirc as they are non-intended they tend to be noisyacirc but they are rarely deceitful

atilde HTML tags as constituent boundaries

atilde Tables as Semantic Relations

atilde File Names (content type and language)

atilde Translations mdash Bracketed Glosses Cross-lingual Disambiguation

atilde Query Data

atilde Wikipedia Redirections and Cross-wiki Links

Review and Conclusions 51

Language Identification andNormalization

Review and Conclusions 52

Language Identification

atilde Need to identify the encoding and language of a page in order to process it (meta-data may be unreliable)

acirc Linguistically-grounded methodslowast Diacriticslowast Character n-gramslowast Stop words

acirc Similarity-based categorisation and classificationlowast Character n-gram rankings

acirc Machine Learning based methodsacirc Context (under jp or ko)

atilde Hard to do for short test snippets similar languages mixed text

Review and Conclusions 53

Normalization

atilde Extracting text from various documents

atilde Segmenting continuous text

atilde Number Normalization $700K $700000 07 million dollars hellip

atilde Date Normalization 2000AD 1421AH Heisei 12 hellip

atilde Stripping stop words (the a of hellip)

atilde Lemmatization produces rarr produce

atilde Stemming producer rarr produc produces rarr produc

atilde Decompounding zonnecel rarr zon cel

Review and Conclusions 54

The Semantic Web

Review and Conclusions 55

Goals of the Semantic Web

atilde Web of data

acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions

atilde Increase the utility of information by connecting it to definitions and context

atilde More efficient information access and analysis

EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt

Review and Conclusions 56

Semantic Web Architecture

Review and Conclusions 57

XML eXtensible Markup Language

atilde XML is a set of rules for encoding documents electronically

acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation

atilde Can be validated for syntactic well-formedness

Review and Conclusions 58

Semantic Web Architecture (details)

atilde Identify things with Uniform Resource Identifiers

acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond

atilde Identify relations with Resource Description Framework

acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything

atilde You can build relations in ontologies (OWL)

acirc Then reason over them search them hellip

Review and Conclusions 59

Criticism of the Semantic Web

Doctorowrsquos seven insurmountable obstacles to reliable metadata are

1 People lie

2 People are lazy

3 People are stupid

4 Mission Impossible know thyself

5 Schemas arenrsquot neutral

6 Metrics influence results

7 Therersquos more than one way to describe something

httpwwwwellcom~doctorowmetacraphtm 60

Semantic Web and NLP

atilde The Semantic Web is about structuring data

atilde Text Mining is about unstructured data

atilde There is much more unstructured than structured data

acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP

61

Citation Reputation andPageRank

Review and Conclusions 62

Citation Networks

atilde How can we tell what is a good scientific paper

acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked

acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite

Review and Conclusions 63

Reputation and Citation Analysis

atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group

atilde Some scores are

acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)

atilde Problems

acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones

atilde Weight Citations by Quality of the paper

Review and Conclusions 64

Gaming Citations

atilde LeastMinimum Publishable Unit

acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information

atilde Self citation in-group citation

atilde Write only proceedings (some journals are not often read)

atilde Submitting only to High Impact factor journals

You improve what gets measurednot necessarily what you want to improve

Review and Conclusions 65

Characteristics of the Web Bow Tie

SCC Strongly Connected Core mdash can travel from any page to any page

Review and Conclusions 66

Anchor Text

atilde Recall how hyperlinks are written

lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt

For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt

atilde Link analysis builds on two intuitions

1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A

2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice

Review and Conclusions 67

PageRank as Citation analysis

atilde Citation frequency can be used to measure the impact of an article

acirc Simplest measure Each article gets one vote ndash not very accurate

atilde On the web citation frequency = inlink count

acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam

atilde Better measure weighted citation frequency or citation rank

acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated

Review and Conclusions 68

PageRank as Random walk

atilde Imagine a web surfer doing a random walk on the web

acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page

equiprobably

atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there

atilde This long-term visit rate is the pagersquos PageRank

atilde PageRank = long-term visit rate = steady state robability

Review and Conclusions 69

Teleporting ndash to get us out of dead ends

atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)

atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)

atilde With remaining probability (90) go out on a random hyperlink

acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225

atilde 10 is a parameter the teleportation rate

atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate

Review and Conclusions 70

Example Graph

Each inbound link is a positive vote

httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71

Example Graph Weighted

Pages with higher PageRanks are lighter

Review and Conclusions 72

Gaming PageRank

atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs

atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies

atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission

73

atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks

The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index

acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip

74

Current Status

atilde There is a continuous battle between

acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read

atilde All metrics get gamed

Review and Conclusions 75

Digital object identifier

atilde DOI a string used to uniquely identify an electronic document or object

acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)

atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation

atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations

acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u

Review and Conclusions 76

Conclusions

Review and Conclusions 77

Revolutions in Language Technology

atilde Speech (Language itself)

atilde Writing (invented 3-5 times)

atilde Printing (made writing common)

atilde Digital Text (made writing transferable)

atilde Hyperlinking (taking writing beyond language)

Review and Conclusions 78

The Internet

atilde The internet is useful as a tool

acirc Passively for information accessacirc Actively for collaboration

atilde The internet is interesting in and of itself

acirc As a source of data about existing languageacirc As a source of innovation in language

Review and Conclusions 79

Recommended Texts

atilde Wikipedia

atilde Crystal D (2001) Language and the Internet Cambridge University Press

atilde Sproat R (2010) Language Technology and Society Oxford University Press

atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition

atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press

atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press

Review and Conclusions 80

Complementary Courses

atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics

atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics

Review and Conclusions 81

Page 25: Lecture 12: Review and Conclusions · Web Count Units ª ghit or “google hit” is the most common unit used to count web snippets (in the early2000s) â itisdocumentfrequencynottermfrequency

Email

Speech like Text liketime bound space bound (deletable)spontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 24

Usenet (asynchronous)

Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 25

Chat (synchronous)

Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 26

Blogs

Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactivecomments factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 27

Netspeak

atilde Inspired by shared background

acirc ltrantgtI canrsquot stand thisltrantgtacirc I hate^H^H^H^Hlove that idea (^H is backspace on a vt100)acirc lusers (users as seen by Systems Administrators)acirc suits

atilde Need to be in-group

cow orker Coworkerclue ldquoYou couldnrsquot get a clue during the clue mating season in a field full of horny

clues if you smeared your body with clue musk and did the clue mating dancerdquoEdward Flaherty (talkbizaare)

Review and Conclusions 28

atilde Gricean Principals

acirc Posting (topmiddlebottom quoting and trimming)

atilde Inspired by medium limitations

acirc GREATgreatgreatacirc -) (^_^) (o) acirc brb RTFM IMHOacirc lower caseacirc lack of punctuation (hard on phone keyboards)

Review and Conclusions 29

Collaboration and Wikisatilde Version Control Systems

atilde Wikipedia

atilde Licensing and Ownership

30

Version Control Systems

atilde Versioning file systems

acirc every time a file is opened a new copy is stored

atilde CVS Subversion Git

acirc changes to a collection of files are trackedacirc simultaneous changes are merged

atilde Revision Tracking

acirc Revisions are stored within a file

atilde Authorship in shared writing Explicit responsibility for changes

Review and Conclusions 31

Wikipedia

atilde The core aim of the Wikimedia Foundation is to get a free encyclopedia to everysingle person on the planet (Jimmy Wales)

atilde Wikipedia makes it easy to share your knowledgepeople like to do this

atilde Most edits are done by insiders

atilde Most content is added by outsiders

atilde Content comparable to Britannica

Review and Conclusions 32

The five pillars of Wikipedia

1 Wikipedia is an online encyclopedia

2 Wikipedia has a neutral point of view

atilde Content policies NPOV Verifiability and No original research

3 Wikipedia is free content

4 Wikipedians should interact in a respectful and civil manner

5 Wikipedia does not have firm rules

WikipediaFivepillars 33

Licenses and Ownership

atilde Copyright

atilde Copyleft

atilde Creative Commons

Review and Conclusions 34

What is a good article

1 Well-written

2 Factually accurate and verifiable

3 Broad in its coverage

4 Neutral

5 Stable

6 Illustrated if possible by images

WikipediaGood_article_criteria 35

The World Wide Web and HTML

Review and Conclusions 36

The InterWeb

atilde The Internetmore than just the web (email VoIP FTP Streaming Messaging

atilde The structure of Markup Visual vs LogicalWSISWYG WYSIAYG WYSIWIM

atilde The structure of the Web mdash hypertextpages linked to pages

atilde The future of the Web

atilde Linguistic features of the webun-edited large volume editable multi-media

Review and Conclusions 37

Map of online communities

httpxkcdcom802 38

The Deep Web (the Invisible Web)

Dynamic content dynamic pages which are returned in response to a submitted queryor accessed only through a form

Unlinked content pages which are not linked to by other pages (but clicking linksthem)

Private Web sites that require registration and login (Edventure)

Contextual Web pages with content varying for different access contexts (egranges of client IP addresses or previous navigation sequence)

Limited access content sites that limit access to their pages in a technical way (egusing the Robots Exclusion Standard)

Scripted content pages that are only accessible through links produced by JavaScriptas well as content dynamically downloaded from Web servers via Flash or Ajaxsolutions

Review and Conclusions 39

Non-HTMLtext content textual content encoded in multimedia (image or video)files or specific file formats not handled by search engines

These pages all include data that search engines cannot find

Review and Conclusions 40

Visual Markup vs Logical Markup

atilde Visual Markup (Presentational)

acirc What you see is what you get (WYSIWYG)acirc Equivalent of printersrsquo markupacirc Shows what things look like

atilde Logical Markup (Structural)

acirc Show the structure and meaningacirc Can be mapped to visual markupacirc Less flexible than visual markupacirc More adaptable (and reusable)

Review and Conclusions 41

The Web as Corpus

Review and Conclusions 42

Two Approaches to using the Web as a Corpus

atilde Direct Query Search Engine as Query tool and WWW as corpus(Objection Results are not reliable)

acirc Population and exact hit counts are unknown rarr no statistics possibleacirc Indexing does not allow to draw conclusions on the dataotimes Google is missing functionalities that linguists lexicographers would like to have

atilde Web Sample Use search engine to download data from the net and build a corpusfrom it

acirc known size and exact hit counts rarr statistics possibleacirc people can draw conclusions over the included text typesacirc (limited) control over the contentotimes sparser data

Review and Conclusions 43

Direct Query

atilde Accessible through search engines (Google API Yahoo API Scripts)

atilde Document counts are shown to correlate directly with ldquorealrdquo frequencies (Keller2003) so search engines can help - but

acirc lots of repetitions of the same text (not representative)acirc very limited query precision (no upperlower case no punctuation)acirc only estimated counts often hard to reproduce exactlyacirc different queries give wildly different numbers

Review and Conclusions 44

Web Count Units

atilde ghit or ldquogoogle hitrdquo is the most common unit used to count web snippets (in theearly 2000s)

acirc it is document frequency not term frequency

atilde whit or ldquoweb hitrdquo is the more general term

atilde Normally you compare two phenomena to get a unitless ratio(eg different from vs different than)251000000 ghits vs 71500000 or 351 (accessed 2012-04-04)

atilde GPB for ldquoGhits per billion documentsrdquo is good if want a more stable number (sug-gested by Mark Liberman)but Google no longer releases the index size

atilde So always say when you counted and try to use ratios

httpitrecisupennedu myllanguagelogarchives000953html 45

Web Sample

atilde Extracting and filtering web documents to create linguistically annotated corpora(Kilgarriff 2006)

acirc gather documents for different topics (balance)acirc exclude documents which cannot be preprocessed with available tools (here tag-

gers and lemmatizers)acirc exclude documents which seem irrelevant for a corpus (too short or too long word

lists)acirc do this for several languages and make the corpora available

46

Building Internet Corpora Outline

1 Select Seed Words (500)

2 Combine to form multiple queries (6000)

3 Query a search engine and retrieve the URLs (50000)

4 Download the files from the URLS (100000000 words)

5 Postprocess the data (encoding cleanup tagging and parsing)

Sharoff S (2006) Creating general-purpose corpora using automated search enginequeries In M Baroni S Bernardini (eds) WaCky Working papers on the Web asCorpus Bologna 2006

httpcorpusleedsacukinternethtml 47

Internet Corpora Summary

atilde The web can be used as a corpus

acirc Direct accesslowast Fast and convenientlowast Huge amounts of dataotimes unreliable counts

acirc Web samplelowast Control over the samplelowast Some setup costs (semi-automated)otimes Less data

atilde Richer data than a compiled corpus

otimes Less balanced less markup

48

Text and Meta-textatilde Explicit Meta-data

acirc Keywords and Categoriesacirc Rankingsacirc Structural Markup

atilde Implicit Meta-data

acirc Links and Citationsacirc Tagsacirc Tablesacirc File Namesacirc Translations

and Zombies 49

Explicit Metadata

atilde You can get information from metadata within documents

acirc When they are accurate they are very goodacirc They are often deceitful

atilde HTML PDF Word hellipMetadata

atilde Keywords and Tags

atilde Rankings

atilde Links and Citations

atilde Structural Markup

50

Implicit Metadata

atilde You can get clues from metadata within documents

acirc as they are non-intended they tend to be noisyacirc but they are rarely deceitful

atilde HTML tags as constituent boundaries

atilde Tables as Semantic Relations

atilde File Names (content type and language)

atilde Translations mdash Bracketed Glosses Cross-lingual Disambiguation

atilde Query Data

atilde Wikipedia Redirections and Cross-wiki Links

Review and Conclusions 51

Language Identification andNormalization

Review and Conclusions 52

Language Identification

atilde Need to identify the encoding and language of a page in order to process it (meta-data may be unreliable)

acirc Linguistically-grounded methodslowast Diacriticslowast Character n-gramslowast Stop words

acirc Similarity-based categorisation and classificationlowast Character n-gram rankings

acirc Machine Learning based methodsacirc Context (under jp or ko)

atilde Hard to do for short test snippets similar languages mixed text

Review and Conclusions 53

Normalization

atilde Extracting text from various documents

atilde Segmenting continuous text

atilde Number Normalization $700K $700000 07 million dollars hellip

atilde Date Normalization 2000AD 1421AH Heisei 12 hellip

atilde Stripping stop words (the a of hellip)

atilde Lemmatization produces rarr produce

atilde Stemming producer rarr produc produces rarr produc

atilde Decompounding zonnecel rarr zon cel

Review and Conclusions 54

The Semantic Web

Review and Conclusions 55

Goals of the Semantic Web

atilde Web of data

acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions

atilde Increase the utility of information by connecting it to definitions and context

atilde More efficient information access and analysis

EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt

Review and Conclusions 56

Semantic Web Architecture

Review and Conclusions 57

XML eXtensible Markup Language

atilde XML is a set of rules for encoding documents electronically

acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation

atilde Can be validated for syntactic well-formedness

Review and Conclusions 58

Semantic Web Architecture (details)

atilde Identify things with Uniform Resource Identifiers

acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond

atilde Identify relations with Resource Description Framework

acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything

atilde You can build relations in ontologies (OWL)

acirc Then reason over them search them hellip

Review and Conclusions 59

Criticism of the Semantic Web

Doctorowrsquos seven insurmountable obstacles to reliable metadata are

1 People lie

2 People are lazy

3 People are stupid

4 Mission Impossible know thyself

5 Schemas arenrsquot neutral

6 Metrics influence results

7 Therersquos more than one way to describe something

httpwwwwellcom~doctorowmetacraphtm 60

Semantic Web and NLP

atilde The Semantic Web is about structuring data

atilde Text Mining is about unstructured data

atilde There is much more unstructured than structured data

acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP

61

Citation Reputation andPageRank

Review and Conclusions 62

Citation Networks

atilde How can we tell what is a good scientific paper

acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked

acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite

Review and Conclusions 63

Reputation and Citation Analysis

atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group

atilde Some scores are

acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)

atilde Problems

acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones

atilde Weight Citations by Quality of the paper

Review and Conclusions 64

Gaming Citations

atilde LeastMinimum Publishable Unit

acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information

atilde Self citation in-group citation

atilde Write only proceedings (some journals are not often read)

atilde Submitting only to High Impact factor journals

You improve what gets measurednot necessarily what you want to improve

Review and Conclusions 65

Characteristics of the Web Bow Tie

SCC Strongly Connected Core mdash can travel from any page to any page

Review and Conclusions 66

Anchor Text

atilde Recall how hyperlinks are written

lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt

For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt

atilde Link analysis builds on two intuitions

1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A

2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice

Review and Conclusions 67

PageRank as Citation analysis

atilde Citation frequency can be used to measure the impact of an article

acirc Simplest measure Each article gets one vote ndash not very accurate

atilde On the web citation frequency = inlink count

acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam

atilde Better measure weighted citation frequency or citation rank

acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated

Review and Conclusions 68

PageRank as Random walk

atilde Imagine a web surfer doing a random walk on the web

acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page

equiprobably

atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there

atilde This long-term visit rate is the pagersquos PageRank

atilde PageRank = long-term visit rate = steady state robability

Review and Conclusions 69

Teleporting ndash to get us out of dead ends

atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)

atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)

atilde With remaining probability (90) go out on a random hyperlink

acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225

atilde 10 is a parameter the teleportation rate

atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate

Review and Conclusions 70

Example Graph

Each inbound link is a positive vote

httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71

Example Graph Weighted

Pages with higher PageRanks are lighter

Review and Conclusions 72

Gaming PageRank

atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs

atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies

atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission

73

atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks

The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index

acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip

74

Current Status

atilde There is a continuous battle between

acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read

atilde All metrics get gamed

Review and Conclusions 75

Digital object identifier

atilde DOI a string used to uniquely identify an electronic document or object

acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)

atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation

atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations

acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u

Review and Conclusions 76

Conclusions

Review and Conclusions 77

Revolutions in Language Technology

atilde Speech (Language itself)

atilde Writing (invented 3-5 times)

atilde Printing (made writing common)

atilde Digital Text (made writing transferable)

atilde Hyperlinking (taking writing beyond language)

Review and Conclusions 78

The Internet

atilde The internet is useful as a tool

acirc Passively for information accessacirc Actively for collaboration

atilde The internet is interesting in and of itself

acirc As a source of data about existing languageacirc As a source of innovation in language

Review and Conclusions 79

Recommended Texts

atilde Wikipedia

atilde Crystal D (2001) Language and the Internet Cambridge University Press

atilde Sproat R (2010) Language Technology and Society Oxford University Press

atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition

atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press

atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press

Review and Conclusions 80

Complementary Courses

atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics

atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics

Review and Conclusions 81

Page 26: Lecture 12: Review and Conclusions · Web Count Units ª ghit or “google hit” is the most common unit used to count web snippets (in the early2000s) â itisdocumentfrequencynottermfrequency

Usenet (asynchronous)

Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 25

Chat (synchronous)

Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 26

Blogs

Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactivecomments factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 27

Netspeak

atilde Inspired by shared background

acirc ltrantgtI canrsquot stand thisltrantgtacirc I hate^H^H^H^Hlove that idea (^H is backspace on a vt100)acirc lusers (users as seen by Systems Administrators)acirc suits

atilde Need to be in-group

cow orker Coworkerclue ldquoYou couldnrsquot get a clue during the clue mating season in a field full of horny

clues if you smeared your body with clue musk and did the clue mating dancerdquoEdward Flaherty (talkbizaare)

Review and Conclusions 28

atilde Gricean Principals

acirc Posting (topmiddlebottom quoting and trimming)

atilde Inspired by medium limitations

acirc GREATgreatgreatacirc -) (^_^) (o) acirc brb RTFM IMHOacirc lower caseacirc lack of punctuation (hard on phone keyboards)

Review and Conclusions 29

Collaboration and Wikisatilde Version Control Systems

atilde Wikipedia

atilde Licensing and Ownership

30

Version Control Systems

atilde Versioning file systems

acirc every time a file is opened a new copy is stored

atilde CVS Subversion Git

acirc changes to a collection of files are trackedacirc simultaneous changes are merged

atilde Revision Tracking

acirc Revisions are stored within a file

atilde Authorship in shared writing Explicit responsibility for changes

Review and Conclusions 31

Wikipedia

atilde The core aim of the Wikimedia Foundation is to get a free encyclopedia to everysingle person on the planet (Jimmy Wales)

atilde Wikipedia makes it easy to share your knowledgepeople like to do this

atilde Most edits are done by insiders

atilde Most content is added by outsiders

atilde Content comparable to Britannica

Review and Conclusions 32

The five pillars of Wikipedia

1 Wikipedia is an online encyclopedia

2 Wikipedia has a neutral point of view

atilde Content policies NPOV Verifiability and No original research

3 Wikipedia is free content

4 Wikipedians should interact in a respectful and civil manner

5 Wikipedia does not have firm rules

WikipediaFivepillars 33

Licenses and Ownership

atilde Copyright

atilde Copyleft

atilde Creative Commons

Review and Conclusions 34

What is a good article

1 Well-written

2 Factually accurate and verifiable

3 Broad in its coverage

4 Neutral

5 Stable

6 Illustrated if possible by images

WikipediaGood_article_criteria 35

The World Wide Web and HTML

Review and Conclusions 36

The InterWeb

atilde The Internetmore than just the web (email VoIP FTP Streaming Messaging

atilde The structure of Markup Visual vs LogicalWSISWYG WYSIAYG WYSIWIM

atilde The structure of the Web mdash hypertextpages linked to pages

atilde The future of the Web

atilde Linguistic features of the webun-edited large volume editable multi-media

Review and Conclusions 37

Map of online communities

httpxkcdcom802 38

The Deep Web (the Invisible Web)

Dynamic content dynamic pages which are returned in response to a submitted queryor accessed only through a form

Unlinked content pages which are not linked to by other pages (but clicking linksthem)

Private Web sites that require registration and login (Edventure)

Contextual Web pages with content varying for different access contexts (egranges of client IP addresses or previous navigation sequence)

Limited access content sites that limit access to their pages in a technical way (egusing the Robots Exclusion Standard)

Scripted content pages that are only accessible through links produced by JavaScriptas well as content dynamically downloaded from Web servers via Flash or Ajaxsolutions

Review and Conclusions 39

Non-HTMLtext content textual content encoded in multimedia (image or video)files or specific file formats not handled by search engines

These pages all include data that search engines cannot find

Review and Conclusions 40

Visual Markup vs Logical Markup

atilde Visual Markup (Presentational)

acirc What you see is what you get (WYSIWYG)acirc Equivalent of printersrsquo markupacirc Shows what things look like

atilde Logical Markup (Structural)

acirc Show the structure and meaningacirc Can be mapped to visual markupacirc Less flexible than visual markupacirc More adaptable (and reusable)

Review and Conclusions 41

The Web as Corpus

Review and Conclusions 42

Two Approaches to using the Web as a Corpus

atilde Direct Query Search Engine as Query tool and WWW as corpus(Objection Results are not reliable)

acirc Population and exact hit counts are unknown rarr no statistics possibleacirc Indexing does not allow to draw conclusions on the dataotimes Google is missing functionalities that linguists lexicographers would like to have

atilde Web Sample Use search engine to download data from the net and build a corpusfrom it

acirc known size and exact hit counts rarr statistics possibleacirc people can draw conclusions over the included text typesacirc (limited) control over the contentotimes sparser data

Review and Conclusions 43

Direct Query

atilde Accessible through search engines (Google API Yahoo API Scripts)

atilde Document counts are shown to correlate directly with ldquorealrdquo frequencies (Keller2003) so search engines can help - but

acirc lots of repetitions of the same text (not representative)acirc very limited query precision (no upperlower case no punctuation)acirc only estimated counts often hard to reproduce exactlyacirc different queries give wildly different numbers

Review and Conclusions 44

Web Count Units

atilde ghit or ldquogoogle hitrdquo is the most common unit used to count web snippets (in theearly 2000s)

acirc it is document frequency not term frequency

atilde whit or ldquoweb hitrdquo is the more general term

atilde Normally you compare two phenomena to get a unitless ratio(eg different from vs different than)251000000 ghits vs 71500000 or 351 (accessed 2012-04-04)

atilde GPB for ldquoGhits per billion documentsrdquo is good if want a more stable number (sug-gested by Mark Liberman)but Google no longer releases the index size

atilde So always say when you counted and try to use ratios

httpitrecisupennedu myllanguagelogarchives000953html 45

Web Sample

atilde Extracting and filtering web documents to create linguistically annotated corpora(Kilgarriff 2006)

acirc gather documents for different topics (balance)acirc exclude documents which cannot be preprocessed with available tools (here tag-

gers and lemmatizers)acirc exclude documents which seem irrelevant for a corpus (too short or too long word

lists)acirc do this for several languages and make the corpora available

46

Building Internet Corpora Outline

1 Select Seed Words (500)

2 Combine to form multiple queries (6000)

3 Query a search engine and retrieve the URLs (50000)

4 Download the files from the URLS (100000000 words)

5 Postprocess the data (encoding cleanup tagging and parsing)

Sharoff S (2006) Creating general-purpose corpora using automated search enginequeries In M Baroni S Bernardini (eds) WaCky Working papers on the Web asCorpus Bologna 2006

httpcorpusleedsacukinternethtml 47

Internet Corpora Summary

atilde The web can be used as a corpus

acirc Direct accesslowast Fast and convenientlowast Huge amounts of dataotimes unreliable counts

acirc Web samplelowast Control over the samplelowast Some setup costs (semi-automated)otimes Less data

atilde Richer data than a compiled corpus

otimes Less balanced less markup

48

Text and Meta-textatilde Explicit Meta-data

acirc Keywords and Categoriesacirc Rankingsacirc Structural Markup

atilde Implicit Meta-data

acirc Links and Citationsacirc Tagsacirc Tablesacirc File Namesacirc Translations

and Zombies 49

Explicit Metadata

atilde You can get information from metadata within documents

acirc When they are accurate they are very goodacirc They are often deceitful

atilde HTML PDF Word hellipMetadata

atilde Keywords and Tags

atilde Rankings

atilde Links and Citations

atilde Structural Markup

50

Implicit Metadata

atilde You can get clues from metadata within documents

acirc as they are non-intended they tend to be noisyacirc but they are rarely deceitful

atilde HTML tags as constituent boundaries

atilde Tables as Semantic Relations

atilde File Names (content type and language)

atilde Translations mdash Bracketed Glosses Cross-lingual Disambiguation

atilde Query Data

atilde Wikipedia Redirections and Cross-wiki Links

Review and Conclusions 51

Language Identification andNormalization

Review and Conclusions 52

Language Identification

atilde Need to identify the encoding and language of a page in order to process it (meta-data may be unreliable)

acirc Linguistically-grounded methodslowast Diacriticslowast Character n-gramslowast Stop words

acirc Similarity-based categorisation and classificationlowast Character n-gram rankings

acirc Machine Learning based methodsacirc Context (under jp or ko)

atilde Hard to do for short test snippets similar languages mixed text

Review and Conclusions 53

Normalization

atilde Extracting text from various documents

atilde Segmenting continuous text

atilde Number Normalization $700K $700000 07 million dollars hellip

atilde Date Normalization 2000AD 1421AH Heisei 12 hellip

atilde Stripping stop words (the a of hellip)

atilde Lemmatization produces rarr produce

atilde Stemming producer rarr produc produces rarr produc

atilde Decompounding zonnecel rarr zon cel

Review and Conclusions 54

The Semantic Web

Review and Conclusions 55

Goals of the Semantic Web

atilde Web of data

acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions

atilde Increase the utility of information by connecting it to definitions and context

atilde More efficient information access and analysis

EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt

Review and Conclusions 56

Semantic Web Architecture

Review and Conclusions 57

XML eXtensible Markup Language

atilde XML is a set of rules for encoding documents electronically

acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation

atilde Can be validated for syntactic well-formedness

Review and Conclusions 58

Semantic Web Architecture (details)

atilde Identify things with Uniform Resource Identifiers

acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond

atilde Identify relations with Resource Description Framework

acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything

atilde You can build relations in ontologies (OWL)

acirc Then reason over them search them hellip

Review and Conclusions 59

Criticism of the Semantic Web

Doctorowrsquos seven insurmountable obstacles to reliable metadata are

1 People lie

2 People are lazy

3 People are stupid

4 Mission Impossible know thyself

5 Schemas arenrsquot neutral

6 Metrics influence results

7 Therersquos more than one way to describe something

httpwwwwellcom~doctorowmetacraphtm 60

Semantic Web and NLP

atilde The Semantic Web is about structuring data

atilde Text Mining is about unstructured data

atilde There is much more unstructured than structured data

acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP

61

Citation Reputation andPageRank

Review and Conclusions 62

Citation Networks

atilde How can we tell what is a good scientific paper

acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked

acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite

Review and Conclusions 63

Reputation and Citation Analysis

atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group

atilde Some scores are

acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)

atilde Problems

acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones

atilde Weight Citations by Quality of the paper

Review and Conclusions 64

Gaming Citations

atilde LeastMinimum Publishable Unit

acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information

atilde Self citation in-group citation

atilde Write only proceedings (some journals are not often read)

atilde Submitting only to High Impact factor journals

You improve what gets measurednot necessarily what you want to improve

Review and Conclusions 65

Characteristics of the Web Bow Tie

SCC Strongly Connected Core mdash can travel from any page to any page

Review and Conclusions 66

Anchor Text

atilde Recall how hyperlinks are written

lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt

For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt

atilde Link analysis builds on two intuitions

1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A

2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice

Review and Conclusions 67

PageRank as Citation analysis

atilde Citation frequency can be used to measure the impact of an article

acirc Simplest measure Each article gets one vote ndash not very accurate

atilde On the web citation frequency = inlink count

acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam

atilde Better measure weighted citation frequency or citation rank

acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated

Review and Conclusions 68

PageRank as Random walk

atilde Imagine a web surfer doing a random walk on the web

acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page

equiprobably

atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there

atilde This long-term visit rate is the pagersquos PageRank

atilde PageRank = long-term visit rate = steady state robability

Review and Conclusions 69

Teleporting ndash to get us out of dead ends

atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)

atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)

atilde With remaining probability (90) go out on a random hyperlink

acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225

atilde 10 is a parameter the teleportation rate

atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate

Review and Conclusions 70

Example Graph

Each inbound link is a positive vote

httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71

Example Graph Weighted

Pages with higher PageRanks are lighter

Review and Conclusions 72

Gaming PageRank

atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs

atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies

atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission

73

atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks

The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index

acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip

74

Current Status

atilde There is a continuous battle between

acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read

atilde All metrics get gamed

Review and Conclusions 75

Digital object identifier

atilde DOI a string used to uniquely identify an electronic document or object

acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)

atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation

atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations

acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u

Review and Conclusions 76

Conclusions

Review and Conclusions 77

Revolutions in Language Technology

atilde Speech (Language itself)

atilde Writing (invented 3-5 times)

atilde Printing (made writing common)

atilde Digital Text (made writing transferable)

atilde Hyperlinking (taking writing beyond language)

Review and Conclusions 78

The Internet

atilde The internet is useful as a tool

acirc Passively for information accessacirc Actively for collaboration

atilde The internet is interesting in and of itself

acirc As a source of data about existing languageacirc As a source of innovation in language

Review and Conclusions 79

Recommended Texts

atilde Wikipedia

atilde Crystal D (2001) Language and the Internet Cambridge University Press

atilde Sproat R (2010) Language Technology and Society Oxford University Press

atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition

atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press

atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press

Review and Conclusions 80

Complementary Courses

atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics

atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics

Review and Conclusions 81

Page 27: Lecture 12: Review and Conclusions · Web Count Units ª ghit or “google hit” is the most common unit used to count web snippets (in the early2000s) â itisdocumentfrequencynottermfrequency

Chat (synchronous)

Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 26

Blogs

Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactivecomments factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 27

Netspeak

atilde Inspired by shared background

acirc ltrantgtI canrsquot stand thisltrantgtacirc I hate^H^H^H^Hlove that idea (^H is backspace on a vt100)acirc lusers (users as seen by Systems Administrators)acirc suits

atilde Need to be in-group

cow orker Coworkerclue ldquoYou couldnrsquot get a clue during the clue mating season in a field full of horny

clues if you smeared your body with clue musk and did the clue mating dancerdquoEdward Flaherty (talkbizaare)

Review and Conclusions 28

atilde Gricean Principals

acirc Posting (topmiddlebottom quoting and trimming)

atilde Inspired by medium limitations

acirc GREATgreatgreatacirc -) (^_^) (o) acirc brb RTFM IMHOacirc lower caseacirc lack of punctuation (hard on phone keyboards)

Review and Conclusions 29

Collaboration and Wikisatilde Version Control Systems

atilde Wikipedia

atilde Licensing and Ownership

30

Version Control Systems

atilde Versioning file systems

acirc every time a file is opened a new copy is stored

atilde CVS Subversion Git

acirc changes to a collection of files are trackedacirc simultaneous changes are merged

atilde Revision Tracking

acirc Revisions are stored within a file

atilde Authorship in shared writing Explicit responsibility for changes

Review and Conclusions 31

Wikipedia

atilde The core aim of the Wikimedia Foundation is to get a free encyclopedia to everysingle person on the planet (Jimmy Wales)

atilde Wikipedia makes it easy to share your knowledgepeople like to do this

atilde Most edits are done by insiders

atilde Most content is added by outsiders

atilde Content comparable to Britannica

Review and Conclusions 32

The five pillars of Wikipedia

1 Wikipedia is an online encyclopedia

2 Wikipedia has a neutral point of view

atilde Content policies NPOV Verifiability and No original research

3 Wikipedia is free content

4 Wikipedians should interact in a respectful and civil manner

5 Wikipedia does not have firm rules

WikipediaFivepillars 33

Licenses and Ownership

atilde Copyright

atilde Copyleft

atilde Creative Commons

Review and Conclusions 34

What is a good article

1 Well-written

2 Factually accurate and verifiable

3 Broad in its coverage

4 Neutral

5 Stable

6 Illustrated if possible by images

WikipediaGood_article_criteria 35

The World Wide Web and HTML

Review and Conclusions 36

The InterWeb

atilde The Internetmore than just the web (email VoIP FTP Streaming Messaging

atilde The structure of Markup Visual vs LogicalWSISWYG WYSIAYG WYSIWIM

atilde The structure of the Web mdash hypertextpages linked to pages

atilde The future of the Web

atilde Linguistic features of the webun-edited large volume editable multi-media

Review and Conclusions 37

Map of online communities

httpxkcdcom802 38

The Deep Web (the Invisible Web)

Dynamic content dynamic pages which are returned in response to a submitted queryor accessed only through a form

Unlinked content pages which are not linked to by other pages (but clicking linksthem)

Private Web sites that require registration and login (Edventure)

Contextual Web pages with content varying for different access contexts (egranges of client IP addresses or previous navigation sequence)

Limited access content sites that limit access to their pages in a technical way (egusing the Robots Exclusion Standard)

Scripted content pages that are only accessible through links produced by JavaScriptas well as content dynamically downloaded from Web servers via Flash or Ajaxsolutions

Review and Conclusions 39

Non-HTMLtext content textual content encoded in multimedia (image or video)files or specific file formats not handled by search engines

These pages all include data that search engines cannot find

Review and Conclusions 40

Visual Markup vs Logical Markup

atilde Visual Markup (Presentational)

acirc What you see is what you get (WYSIWYG)acirc Equivalent of printersrsquo markupacirc Shows what things look like

atilde Logical Markup (Structural)

acirc Show the structure and meaningacirc Can be mapped to visual markupacirc Less flexible than visual markupacirc More adaptable (and reusable)

Review and Conclusions 41

The Web as Corpus

Review and Conclusions 42

Two Approaches to using the Web as a Corpus

atilde Direct Query Search Engine as Query tool and WWW as corpus(Objection Results are not reliable)

acirc Population and exact hit counts are unknown rarr no statistics possibleacirc Indexing does not allow to draw conclusions on the dataotimes Google is missing functionalities that linguists lexicographers would like to have

atilde Web Sample Use search engine to download data from the net and build a corpusfrom it

acirc known size and exact hit counts rarr statistics possibleacirc people can draw conclusions over the included text typesacirc (limited) control over the contentotimes sparser data

Review and Conclusions 43

Direct Query

atilde Accessible through search engines (Google API Yahoo API Scripts)

atilde Document counts are shown to correlate directly with ldquorealrdquo frequencies (Keller2003) so search engines can help - but

acirc lots of repetitions of the same text (not representative)acirc very limited query precision (no upperlower case no punctuation)acirc only estimated counts often hard to reproduce exactlyacirc different queries give wildly different numbers

Review and Conclusions 44

Web Count Units

atilde ghit or ldquogoogle hitrdquo is the most common unit used to count web snippets (in theearly 2000s)

acirc it is document frequency not term frequency

atilde whit or ldquoweb hitrdquo is the more general term

atilde Normally you compare two phenomena to get a unitless ratio(eg different from vs different than)251000000 ghits vs 71500000 or 351 (accessed 2012-04-04)

atilde GPB for ldquoGhits per billion documentsrdquo is good if want a more stable number (sug-gested by Mark Liberman)but Google no longer releases the index size

atilde So always say when you counted and try to use ratios

httpitrecisupennedu myllanguagelogarchives000953html 45

Web Sample

atilde Extracting and filtering web documents to create linguistically annotated corpora(Kilgarriff 2006)

acirc gather documents for different topics (balance)acirc exclude documents which cannot be preprocessed with available tools (here tag-

gers and lemmatizers)acirc exclude documents which seem irrelevant for a corpus (too short or too long word

lists)acirc do this for several languages and make the corpora available

46

Building Internet Corpora Outline

1 Select Seed Words (500)

2 Combine to form multiple queries (6000)

3 Query a search engine and retrieve the URLs (50000)

4 Download the files from the URLS (100000000 words)

5 Postprocess the data (encoding cleanup tagging and parsing)

Sharoff S (2006) Creating general-purpose corpora using automated search enginequeries In M Baroni S Bernardini (eds) WaCky Working papers on the Web asCorpus Bologna 2006

httpcorpusleedsacukinternethtml 47

Internet Corpora Summary

atilde The web can be used as a corpus

acirc Direct accesslowast Fast and convenientlowast Huge amounts of dataotimes unreliable counts

acirc Web samplelowast Control over the samplelowast Some setup costs (semi-automated)otimes Less data

atilde Richer data than a compiled corpus

otimes Less balanced less markup

48

Text and Meta-textatilde Explicit Meta-data

acirc Keywords and Categoriesacirc Rankingsacirc Structural Markup

atilde Implicit Meta-data

acirc Links and Citationsacirc Tagsacirc Tablesacirc File Namesacirc Translations

and Zombies 49

Explicit Metadata

atilde You can get information from metadata within documents

acirc When they are accurate they are very goodacirc They are often deceitful

atilde HTML PDF Word hellipMetadata

atilde Keywords and Tags

atilde Rankings

atilde Links and Citations

atilde Structural Markup

50

Implicit Metadata

atilde You can get clues from metadata within documents

acirc as they are non-intended they tend to be noisyacirc but they are rarely deceitful

atilde HTML tags as constituent boundaries

atilde Tables as Semantic Relations

atilde File Names (content type and language)

atilde Translations mdash Bracketed Glosses Cross-lingual Disambiguation

atilde Query Data

atilde Wikipedia Redirections and Cross-wiki Links

Review and Conclusions 51

Language Identification andNormalization

Review and Conclusions 52

Language Identification

atilde Need to identify the encoding and language of a page in order to process it (meta-data may be unreliable)

acirc Linguistically-grounded methodslowast Diacriticslowast Character n-gramslowast Stop words

acirc Similarity-based categorisation and classificationlowast Character n-gram rankings

acirc Machine Learning based methodsacirc Context (under jp or ko)

atilde Hard to do for short test snippets similar languages mixed text

Review and Conclusions 53

Normalization

atilde Extracting text from various documents

atilde Segmenting continuous text

atilde Number Normalization $700K $700000 07 million dollars hellip

atilde Date Normalization 2000AD 1421AH Heisei 12 hellip

atilde Stripping stop words (the a of hellip)

atilde Lemmatization produces rarr produce

atilde Stemming producer rarr produc produces rarr produc

atilde Decompounding zonnecel rarr zon cel

Review and Conclusions 54

The Semantic Web

Review and Conclusions 55

Goals of the Semantic Web

atilde Web of data

acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions

atilde Increase the utility of information by connecting it to definitions and context

atilde More efficient information access and analysis

EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt

Review and Conclusions 56

Semantic Web Architecture

Review and Conclusions 57

XML eXtensible Markup Language

atilde XML is a set of rules for encoding documents electronically

acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation

atilde Can be validated for syntactic well-formedness

Review and Conclusions 58

Semantic Web Architecture (details)

atilde Identify things with Uniform Resource Identifiers

acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond

atilde Identify relations with Resource Description Framework

acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything

atilde You can build relations in ontologies (OWL)

acirc Then reason over them search them hellip

Review and Conclusions 59

Criticism of the Semantic Web

Doctorowrsquos seven insurmountable obstacles to reliable metadata are

1 People lie

2 People are lazy

3 People are stupid

4 Mission Impossible know thyself

5 Schemas arenrsquot neutral

6 Metrics influence results

7 Therersquos more than one way to describe something

httpwwwwellcom~doctorowmetacraphtm 60

Semantic Web and NLP

atilde The Semantic Web is about structuring data

atilde Text Mining is about unstructured data

atilde There is much more unstructured than structured data

acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP

61

Citation Reputation andPageRank

Review and Conclusions 62

Citation Networks

atilde How can we tell what is a good scientific paper

acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked

acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite

Review and Conclusions 63

Reputation and Citation Analysis

atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group

atilde Some scores are

acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)

atilde Problems

acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones

atilde Weight Citations by Quality of the paper

Review and Conclusions 64

Gaming Citations

atilde LeastMinimum Publishable Unit

acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information

atilde Self citation in-group citation

atilde Write only proceedings (some journals are not often read)

atilde Submitting only to High Impact factor journals

You improve what gets measurednot necessarily what you want to improve

Review and Conclusions 65

Characteristics of the Web Bow Tie

SCC Strongly Connected Core mdash can travel from any page to any page

Review and Conclusions 66

Anchor Text

atilde Recall how hyperlinks are written

lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt

For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt

atilde Link analysis builds on two intuitions

1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A

2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice

Review and Conclusions 67

PageRank as Citation analysis

atilde Citation frequency can be used to measure the impact of an article

acirc Simplest measure Each article gets one vote ndash not very accurate

atilde On the web citation frequency = inlink count

acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam

atilde Better measure weighted citation frequency or citation rank

acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated

Review and Conclusions 68

PageRank as Random walk

atilde Imagine a web surfer doing a random walk on the web

acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page

equiprobably

atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there

atilde This long-term visit rate is the pagersquos PageRank

atilde PageRank = long-term visit rate = steady state robability

Review and Conclusions 69

Teleporting ndash to get us out of dead ends

atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)

atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)

atilde With remaining probability (90) go out on a random hyperlink

acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225

atilde 10 is a parameter the teleportation rate

atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate

Review and Conclusions 70

Example Graph

Each inbound link is a positive vote

httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71

Example Graph Weighted

Pages with higher PageRanks are lighter

Review and Conclusions 72

Gaming PageRank

atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs

atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies

atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission

73

atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks

The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index

acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip

74

Current Status

atilde There is a continuous battle between

acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read

atilde All metrics get gamed

Review and Conclusions 75

Digital object identifier

atilde DOI a string used to uniquely identify an electronic document or object

acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)

atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation

atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations

acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u

Review and Conclusions 76

Conclusions

Review and Conclusions 77

Revolutions in Language Technology

atilde Speech (Language itself)

atilde Writing (invented 3-5 times)

atilde Printing (made writing common)

atilde Digital Text (made writing transferable)

atilde Hyperlinking (taking writing beyond language)

Review and Conclusions 78

The Internet

atilde The internet is useful as a tool

acirc Passively for information accessacirc Actively for collaboration

atilde The internet is interesting in and of itself

acirc As a source of data about existing languageacirc As a source of innovation in language

Review and Conclusions 79

Recommended Texts

atilde Wikipedia

atilde Crystal D (2001) Language and the Internet Cambridge University Press

atilde Sproat R (2010) Language Technology and Society Oxford University Press

atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition

atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press

atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press

Review and Conclusions 80

Complementary Courses

atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics

atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics

Review and Conclusions 81

Page 28: Lecture 12: Review and Conclusions · Web Count Units ª ghit or “google hit” is the most common unit used to count web snippets (in the early2000s) â itisdocumentfrequencynottermfrequency

Blogs

Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactivecomments factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich

Review and Conclusions 27

Netspeak

atilde Inspired by shared background

acirc ltrantgtI canrsquot stand thisltrantgtacirc I hate^H^H^H^Hlove that idea (^H is backspace on a vt100)acirc lusers (users as seen by Systems Administrators)acirc suits

atilde Need to be in-group

cow orker Coworkerclue ldquoYou couldnrsquot get a clue during the clue mating season in a field full of horny

clues if you smeared your body with clue musk and did the clue mating dancerdquoEdward Flaherty (talkbizaare)

Review and Conclusions 28

atilde Gricean Principals

acirc Posting (topmiddlebottom quoting and trimming)

atilde Inspired by medium limitations

acirc GREATgreatgreatacirc -) (^_^) (o) acirc brb RTFM IMHOacirc lower caseacirc lack of punctuation (hard on phone keyboards)

Review and Conclusions 29

Collaboration and Wikisatilde Version Control Systems

atilde Wikipedia

atilde Licensing and Ownership

30

Version Control Systems

atilde Versioning file systems

acirc every time a file is opened a new copy is stored

atilde CVS Subversion Git

acirc changes to a collection of files are trackedacirc simultaneous changes are merged

atilde Revision Tracking

acirc Revisions are stored within a file

atilde Authorship in shared writing Explicit responsibility for changes

Review and Conclusions 31

Wikipedia

atilde The core aim of the Wikimedia Foundation is to get a free encyclopedia to everysingle person on the planet (Jimmy Wales)

atilde Wikipedia makes it easy to share your knowledgepeople like to do this

atilde Most edits are done by insiders

atilde Most content is added by outsiders

atilde Content comparable to Britannica

Review and Conclusions 32

The five pillars of Wikipedia

1 Wikipedia is an online encyclopedia

2 Wikipedia has a neutral point of view

atilde Content policies NPOV Verifiability and No original research

3 Wikipedia is free content

4 Wikipedians should interact in a respectful and civil manner

5 Wikipedia does not have firm rules

WikipediaFivepillars 33

Licenses and Ownership

atilde Copyright

atilde Copyleft

atilde Creative Commons

Review and Conclusions 34

What is a good article

1 Well-written

2 Factually accurate and verifiable

3 Broad in its coverage

4 Neutral

5 Stable

6 Illustrated if possible by images

WikipediaGood_article_criteria 35

The World Wide Web and HTML

Review and Conclusions 36

The InterWeb

atilde The Internetmore than just the web (email VoIP FTP Streaming Messaging

atilde The structure of Markup Visual vs LogicalWSISWYG WYSIAYG WYSIWIM

atilde The structure of the Web mdash hypertextpages linked to pages

atilde The future of the Web

atilde Linguistic features of the webun-edited large volume editable multi-media

Review and Conclusions 37

Map of online communities

httpxkcdcom802 38

The Deep Web (the Invisible Web)

Dynamic content dynamic pages which are returned in response to a submitted queryor accessed only through a form

Unlinked content pages which are not linked to by other pages (but clicking linksthem)

Private Web sites that require registration and login (Edventure)

Contextual Web pages with content varying for different access contexts (egranges of client IP addresses or previous navigation sequence)

Limited access content sites that limit access to their pages in a technical way (egusing the Robots Exclusion Standard)

Scripted content pages that are only accessible through links produced by JavaScriptas well as content dynamically downloaded from Web servers via Flash or Ajaxsolutions

Review and Conclusions 39

Non-HTMLtext content textual content encoded in multimedia (image or video)files or specific file formats not handled by search engines

These pages all include data that search engines cannot find

Review and Conclusions 40

Visual Markup vs Logical Markup

atilde Visual Markup (Presentational)

acirc What you see is what you get (WYSIWYG)acirc Equivalent of printersrsquo markupacirc Shows what things look like

atilde Logical Markup (Structural)

acirc Show the structure and meaningacirc Can be mapped to visual markupacirc Less flexible than visual markupacirc More adaptable (and reusable)

Review and Conclusions 41

The Web as Corpus

Review and Conclusions 42

Two Approaches to using the Web as a Corpus

atilde Direct Query Search Engine as Query tool and WWW as corpus(Objection Results are not reliable)

acirc Population and exact hit counts are unknown rarr no statistics possibleacirc Indexing does not allow to draw conclusions on the dataotimes Google is missing functionalities that linguists lexicographers would like to have

atilde Web Sample Use search engine to download data from the net and build a corpusfrom it

acirc known size and exact hit counts rarr statistics possibleacirc people can draw conclusions over the included text typesacirc (limited) control over the contentotimes sparser data

Review and Conclusions 43

Direct Query

atilde Accessible through search engines (Google API Yahoo API Scripts)

atilde Document counts are shown to correlate directly with ldquorealrdquo frequencies (Keller2003) so search engines can help - but

acirc lots of repetitions of the same text (not representative)acirc very limited query precision (no upperlower case no punctuation)acirc only estimated counts often hard to reproduce exactlyacirc different queries give wildly different numbers

Review and Conclusions 44

Web Count Units

atilde ghit or ldquogoogle hitrdquo is the most common unit used to count web snippets (in theearly 2000s)

acirc it is document frequency not term frequency

atilde whit or ldquoweb hitrdquo is the more general term

atilde Normally you compare two phenomena to get a unitless ratio(eg different from vs different than)251000000 ghits vs 71500000 or 351 (accessed 2012-04-04)

atilde GPB for ldquoGhits per billion documentsrdquo is good if want a more stable number (sug-gested by Mark Liberman)but Google no longer releases the index size

atilde So always say when you counted and try to use ratios

httpitrecisupennedu myllanguagelogarchives000953html 45

Web Sample

atilde Extracting and filtering web documents to create linguistically annotated corpora(Kilgarriff 2006)

acirc gather documents for different topics (balance)acirc exclude documents which cannot be preprocessed with available tools (here tag-

gers and lemmatizers)acirc exclude documents which seem irrelevant for a corpus (too short or too long word

lists)acirc do this for several languages and make the corpora available

46

Building Internet Corpora Outline

1 Select Seed Words (500)

2 Combine to form multiple queries (6000)

3 Query a search engine and retrieve the URLs (50000)

4 Download the files from the URLS (100000000 words)

5 Postprocess the data (encoding cleanup tagging and parsing)

Sharoff S (2006) Creating general-purpose corpora using automated search enginequeries In M Baroni S Bernardini (eds) WaCky Working papers on the Web asCorpus Bologna 2006

httpcorpusleedsacukinternethtml 47

Internet Corpora Summary

atilde The web can be used as a corpus

acirc Direct accesslowast Fast and convenientlowast Huge amounts of dataotimes unreliable counts

acirc Web samplelowast Control over the samplelowast Some setup costs (semi-automated)otimes Less data

atilde Richer data than a compiled corpus

otimes Less balanced less markup

48

Text and Meta-textatilde Explicit Meta-data

acirc Keywords and Categoriesacirc Rankingsacirc Structural Markup

atilde Implicit Meta-data

acirc Links and Citationsacirc Tagsacirc Tablesacirc File Namesacirc Translations

and Zombies 49

Explicit Metadata

atilde You can get information from metadata within documents

acirc When they are accurate they are very goodacirc They are often deceitful

atilde HTML PDF Word hellipMetadata

atilde Keywords and Tags

atilde Rankings

atilde Links and Citations

atilde Structural Markup

50

Implicit Metadata

atilde You can get clues from metadata within documents

acirc as they are non-intended they tend to be noisyacirc but they are rarely deceitful

atilde HTML tags as constituent boundaries

atilde Tables as Semantic Relations

atilde File Names (content type and language)

atilde Translations mdash Bracketed Glosses Cross-lingual Disambiguation

atilde Query Data

atilde Wikipedia Redirections and Cross-wiki Links

Review and Conclusions 51

Language Identification andNormalization

Review and Conclusions 52

Language Identification

atilde Need to identify the encoding and language of a page in order to process it (meta-data may be unreliable)

acirc Linguistically-grounded methodslowast Diacriticslowast Character n-gramslowast Stop words

acirc Similarity-based categorisation and classificationlowast Character n-gram rankings

acirc Machine Learning based methodsacirc Context (under jp or ko)

atilde Hard to do for short test snippets similar languages mixed text

Review and Conclusions 53

Normalization

atilde Extracting text from various documents

atilde Segmenting continuous text

atilde Number Normalization $700K $700000 07 million dollars hellip

atilde Date Normalization 2000AD 1421AH Heisei 12 hellip

atilde Stripping stop words (the a of hellip)

atilde Lemmatization produces rarr produce

atilde Stemming producer rarr produc produces rarr produc

atilde Decompounding zonnecel rarr zon cel

Review and Conclusions 54

The Semantic Web

Review and Conclusions 55

Goals of the Semantic Web

atilde Web of data

acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions

atilde Increase the utility of information by connecting it to definitions and context

atilde More efficient information access and analysis

EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt

Review and Conclusions 56

Semantic Web Architecture

Review and Conclusions 57

XML eXtensible Markup Language

atilde XML is a set of rules for encoding documents electronically

acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation

atilde Can be validated for syntactic well-formedness

Review and Conclusions 58

Semantic Web Architecture (details)

atilde Identify things with Uniform Resource Identifiers

acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond

atilde Identify relations with Resource Description Framework

acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything

atilde You can build relations in ontologies (OWL)

acirc Then reason over them search them hellip

Review and Conclusions 59

Criticism of the Semantic Web

Doctorowrsquos seven insurmountable obstacles to reliable metadata are

1 People lie

2 People are lazy

3 People are stupid

4 Mission Impossible know thyself

5 Schemas arenrsquot neutral

6 Metrics influence results

7 Therersquos more than one way to describe something

httpwwwwellcom~doctorowmetacraphtm 60

Semantic Web and NLP

atilde The Semantic Web is about structuring data

atilde Text Mining is about unstructured data

atilde There is much more unstructured than structured data

acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP

61

Citation Reputation andPageRank

Review and Conclusions 62

Citation Networks

atilde How can we tell what is a good scientific paper

acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked

acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite

Review and Conclusions 63

Reputation and Citation Analysis

atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group

atilde Some scores are

acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)

atilde Problems

acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones

atilde Weight Citations by Quality of the paper

Review and Conclusions 64

Gaming Citations

atilde LeastMinimum Publishable Unit

acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information

atilde Self citation in-group citation

atilde Write only proceedings (some journals are not often read)

atilde Submitting only to High Impact factor journals

You improve what gets measurednot necessarily what you want to improve

Review and Conclusions 65

Characteristics of the Web Bow Tie

SCC Strongly Connected Core mdash can travel from any page to any page

Review and Conclusions 66

Anchor Text

atilde Recall how hyperlinks are written

lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt

For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt

atilde Link analysis builds on two intuitions

1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A

2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice

Review and Conclusions 67

PageRank as Citation analysis

atilde Citation frequency can be used to measure the impact of an article

acirc Simplest measure Each article gets one vote ndash not very accurate

atilde On the web citation frequency = inlink count

acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam

atilde Better measure weighted citation frequency or citation rank

acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated

Review and Conclusions 68

PageRank as Random walk

atilde Imagine a web surfer doing a random walk on the web

acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page

equiprobably

atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there

atilde This long-term visit rate is the pagersquos PageRank

atilde PageRank = long-term visit rate = steady state robability

Review and Conclusions 69

Teleporting ndash to get us out of dead ends

atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)

atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)

atilde With remaining probability (90) go out on a random hyperlink

acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225

atilde 10 is a parameter the teleportation rate

atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate

Review and Conclusions 70

Example Graph

Each inbound link is a positive vote

httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71

Example Graph Weighted

Pages with higher PageRanks are lighter

Review and Conclusions 72

Gaming PageRank

atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs

atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies

atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission

73

atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks

The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index

acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip

74

Current Status

atilde There is a continuous battle between

acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read

atilde All metrics get gamed

Review and Conclusions 75

Digital object identifier

atilde DOI a string used to uniquely identify an electronic document or object

acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)

atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation

atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations

acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u

Review and Conclusions 76

Conclusions

Review and Conclusions 77

Revolutions in Language Technology

atilde Speech (Language itself)

atilde Writing (invented 3-5 times)

atilde Printing (made writing common)

atilde Digital Text (made writing transferable)

atilde Hyperlinking (taking writing beyond language)

Review and Conclusions 78

The Internet

atilde The internet is useful as a tool

acirc Passively for information accessacirc Actively for collaboration

atilde The internet is interesting in and of itself

acirc As a source of data about existing languageacirc As a source of innovation in language

Review and Conclusions 79

Recommended Texts

atilde Wikipedia

atilde Crystal D (2001) Language and the Internet Cambridge University Press

atilde Sproat R (2010) Language Technology and Society Oxford University Press

atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition

atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press

atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press

Review and Conclusions 80

Complementary Courses

atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics

atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics

Review and Conclusions 81

Page 29: Lecture 12: Review and Conclusions · Web Count Units ª ghit or “google hit” is the most common unit used to count web snippets (in the early2000s) â itisdocumentfrequencynottermfrequency

Netspeak

atilde Inspired by shared background

acirc ltrantgtI canrsquot stand thisltrantgtacirc I hate^H^H^H^Hlove that idea (^H is backspace on a vt100)acirc lusers (users as seen by Systems Administrators)acirc suits

atilde Need to be in-group

cow orker Coworkerclue ldquoYou couldnrsquot get a clue during the clue mating season in a field full of horny

clues if you smeared your body with clue musk and did the clue mating dancerdquoEdward Flaherty (talkbizaare)

Review and Conclusions 28

atilde Gricean Principals

acirc Posting (topmiddlebottom quoting and trimming)

atilde Inspired by medium limitations

acirc GREATgreatgreatacirc -) (^_^) (o) acirc brb RTFM IMHOacirc lower caseacirc lack of punctuation (hard on phone keyboards)

Review and Conclusions 29

Collaboration and Wikisatilde Version Control Systems

atilde Wikipedia

atilde Licensing and Ownership

30

Version Control Systems

atilde Versioning file systems

acirc every time a file is opened a new copy is stored

atilde CVS Subversion Git

acirc changes to a collection of files are trackedacirc simultaneous changes are merged

atilde Revision Tracking

acirc Revisions are stored within a file

atilde Authorship in shared writing Explicit responsibility for changes

Review and Conclusions 31

Wikipedia

atilde The core aim of the Wikimedia Foundation is to get a free encyclopedia to everysingle person on the planet (Jimmy Wales)

atilde Wikipedia makes it easy to share your knowledgepeople like to do this

atilde Most edits are done by insiders

atilde Most content is added by outsiders

atilde Content comparable to Britannica

Review and Conclusions 32

The five pillars of Wikipedia

1 Wikipedia is an online encyclopedia

2 Wikipedia has a neutral point of view

atilde Content policies NPOV Verifiability and No original research

3 Wikipedia is free content

4 Wikipedians should interact in a respectful and civil manner

5 Wikipedia does not have firm rules

WikipediaFivepillars 33

Licenses and Ownership

atilde Copyright

atilde Copyleft

atilde Creative Commons

Review and Conclusions 34

What is a good article

1 Well-written

2 Factually accurate and verifiable

3 Broad in its coverage

4 Neutral

5 Stable

6 Illustrated if possible by images

WikipediaGood_article_criteria 35

The World Wide Web and HTML

Review and Conclusions 36

The InterWeb

atilde The Internetmore than just the web (email VoIP FTP Streaming Messaging

atilde The structure of Markup Visual vs LogicalWSISWYG WYSIAYG WYSIWIM

atilde The structure of the Web mdash hypertextpages linked to pages

atilde The future of the Web

atilde Linguistic features of the webun-edited large volume editable multi-media

Review and Conclusions 37

Map of online communities

httpxkcdcom802 38

The Deep Web (the Invisible Web)

Dynamic content dynamic pages which are returned in response to a submitted queryor accessed only through a form

Unlinked content pages which are not linked to by other pages (but clicking linksthem)

Private Web sites that require registration and login (Edventure)

Contextual Web pages with content varying for different access contexts (egranges of client IP addresses or previous navigation sequence)

Limited access content sites that limit access to their pages in a technical way (egusing the Robots Exclusion Standard)

Scripted content pages that are only accessible through links produced by JavaScriptas well as content dynamically downloaded from Web servers via Flash or Ajaxsolutions

Review and Conclusions 39

Non-HTMLtext content textual content encoded in multimedia (image or video)files or specific file formats not handled by search engines

These pages all include data that search engines cannot find

Review and Conclusions 40

Visual Markup vs Logical Markup

atilde Visual Markup (Presentational)

acirc What you see is what you get (WYSIWYG)acirc Equivalent of printersrsquo markupacirc Shows what things look like

atilde Logical Markup (Structural)

acirc Show the structure and meaningacirc Can be mapped to visual markupacirc Less flexible than visual markupacirc More adaptable (and reusable)

Review and Conclusions 41

The Web as Corpus

Review and Conclusions 42

Two Approaches to using the Web as a Corpus

atilde Direct Query Search Engine as Query tool and WWW as corpus(Objection Results are not reliable)

acirc Population and exact hit counts are unknown rarr no statistics possibleacirc Indexing does not allow to draw conclusions on the dataotimes Google is missing functionalities that linguists lexicographers would like to have

atilde Web Sample Use search engine to download data from the net and build a corpusfrom it

acirc known size and exact hit counts rarr statistics possibleacirc people can draw conclusions over the included text typesacirc (limited) control over the contentotimes sparser data

Review and Conclusions 43

Direct Query

atilde Accessible through search engines (Google API Yahoo API Scripts)

atilde Document counts are shown to correlate directly with ldquorealrdquo frequencies (Keller2003) so search engines can help - but

acirc lots of repetitions of the same text (not representative)acirc very limited query precision (no upperlower case no punctuation)acirc only estimated counts often hard to reproduce exactlyacirc different queries give wildly different numbers

Review and Conclusions 44

Web Count Units

atilde ghit or ldquogoogle hitrdquo is the most common unit used to count web snippets (in theearly 2000s)

acirc it is document frequency not term frequency

atilde whit or ldquoweb hitrdquo is the more general term

atilde Normally you compare two phenomena to get a unitless ratio(eg different from vs different than)251000000 ghits vs 71500000 or 351 (accessed 2012-04-04)

atilde GPB for ldquoGhits per billion documentsrdquo is good if want a more stable number (sug-gested by Mark Liberman)but Google no longer releases the index size

atilde So always say when you counted and try to use ratios

httpitrecisupennedu myllanguagelogarchives000953html 45

Web Sample

atilde Extracting and filtering web documents to create linguistically annotated corpora(Kilgarriff 2006)

acirc gather documents for different topics (balance)acirc exclude documents which cannot be preprocessed with available tools (here tag-

gers and lemmatizers)acirc exclude documents which seem irrelevant for a corpus (too short or too long word

lists)acirc do this for several languages and make the corpora available

46

Building Internet Corpora Outline

1 Select Seed Words (500)

2 Combine to form multiple queries (6000)

3 Query a search engine and retrieve the URLs (50000)

4 Download the files from the URLS (100000000 words)

5 Postprocess the data (encoding cleanup tagging and parsing)

Sharoff S (2006) Creating general-purpose corpora using automated search enginequeries In M Baroni S Bernardini (eds) WaCky Working papers on the Web asCorpus Bologna 2006

httpcorpusleedsacukinternethtml 47

Internet Corpora Summary

atilde The web can be used as a corpus

acirc Direct accesslowast Fast and convenientlowast Huge amounts of dataotimes unreliable counts

acirc Web samplelowast Control over the samplelowast Some setup costs (semi-automated)otimes Less data

atilde Richer data than a compiled corpus

otimes Less balanced less markup

48

Text and Meta-textatilde Explicit Meta-data

acirc Keywords and Categoriesacirc Rankingsacirc Structural Markup

atilde Implicit Meta-data

acirc Links and Citationsacirc Tagsacirc Tablesacirc File Namesacirc Translations

and Zombies 49

Explicit Metadata

atilde You can get information from metadata within documents

acirc When they are accurate they are very goodacirc They are often deceitful

atilde HTML PDF Word hellipMetadata

atilde Keywords and Tags

atilde Rankings

atilde Links and Citations

atilde Structural Markup

50

Implicit Metadata

atilde You can get clues from metadata within documents

acirc as they are non-intended they tend to be noisyacirc but they are rarely deceitful

atilde HTML tags as constituent boundaries

atilde Tables as Semantic Relations

atilde File Names (content type and language)

atilde Translations mdash Bracketed Glosses Cross-lingual Disambiguation

atilde Query Data

atilde Wikipedia Redirections and Cross-wiki Links

Review and Conclusions 51

Language Identification andNormalization

Review and Conclusions 52

Language Identification

atilde Need to identify the encoding and language of a page in order to process it (meta-data may be unreliable)

acirc Linguistically-grounded methodslowast Diacriticslowast Character n-gramslowast Stop words

acirc Similarity-based categorisation and classificationlowast Character n-gram rankings

acirc Machine Learning based methodsacirc Context (under jp or ko)

atilde Hard to do for short test snippets similar languages mixed text

Review and Conclusions 53

Normalization

atilde Extracting text from various documents

atilde Segmenting continuous text

atilde Number Normalization $700K $700000 07 million dollars hellip

atilde Date Normalization 2000AD 1421AH Heisei 12 hellip

atilde Stripping stop words (the a of hellip)

atilde Lemmatization produces rarr produce

atilde Stemming producer rarr produc produces rarr produc

atilde Decompounding zonnecel rarr zon cel

Review and Conclusions 54

The Semantic Web

Review and Conclusions 55

Goals of the Semantic Web

atilde Web of data

acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions

atilde Increase the utility of information by connecting it to definitions and context

atilde More efficient information access and analysis

EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt

Review and Conclusions 56

Semantic Web Architecture

Review and Conclusions 57

XML eXtensible Markup Language

atilde XML is a set of rules for encoding documents electronically

acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation

atilde Can be validated for syntactic well-formedness

Review and Conclusions 58

Semantic Web Architecture (details)

atilde Identify things with Uniform Resource Identifiers

acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond

atilde Identify relations with Resource Description Framework

acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything

atilde You can build relations in ontologies (OWL)

acirc Then reason over them search them hellip

Review and Conclusions 59

Criticism of the Semantic Web

Doctorowrsquos seven insurmountable obstacles to reliable metadata are

1 People lie

2 People are lazy

3 People are stupid

4 Mission Impossible know thyself

5 Schemas arenrsquot neutral

6 Metrics influence results

7 Therersquos more than one way to describe something

httpwwwwellcom~doctorowmetacraphtm 60

Semantic Web and NLP

atilde The Semantic Web is about structuring data

atilde Text Mining is about unstructured data

atilde There is much more unstructured than structured data

acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP

61

Citation Reputation andPageRank

Review and Conclusions 62

Citation Networks

atilde How can we tell what is a good scientific paper

acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked

acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite

Review and Conclusions 63

Reputation and Citation Analysis

atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group

atilde Some scores are

acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)

atilde Problems

acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones

atilde Weight Citations by Quality of the paper

Review and Conclusions 64

Gaming Citations

atilde LeastMinimum Publishable Unit

acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information

atilde Self citation in-group citation

atilde Write only proceedings (some journals are not often read)

atilde Submitting only to High Impact factor journals

You improve what gets measurednot necessarily what you want to improve

Review and Conclusions 65

Characteristics of the Web Bow Tie

SCC Strongly Connected Core mdash can travel from any page to any page

Review and Conclusions 66

Anchor Text

atilde Recall how hyperlinks are written

lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt

For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt

atilde Link analysis builds on two intuitions

1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A

2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice

Review and Conclusions 67

PageRank as Citation analysis

atilde Citation frequency can be used to measure the impact of an article

acirc Simplest measure Each article gets one vote ndash not very accurate

atilde On the web citation frequency = inlink count

acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam

atilde Better measure weighted citation frequency or citation rank

acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated

Review and Conclusions 68

PageRank as Random walk

atilde Imagine a web surfer doing a random walk on the web

acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page

equiprobably

atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there

atilde This long-term visit rate is the pagersquos PageRank

atilde PageRank = long-term visit rate = steady state robability

Review and Conclusions 69

Teleporting ndash to get us out of dead ends

atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)

atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)

atilde With remaining probability (90) go out on a random hyperlink

acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225

atilde 10 is a parameter the teleportation rate

atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate

Review and Conclusions 70

Example Graph

Each inbound link is a positive vote

httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71

Example Graph Weighted

Pages with higher PageRanks are lighter

Review and Conclusions 72

Gaming PageRank

atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs

atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies

atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission

73

atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks

The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index

acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip

74

Current Status

atilde There is a continuous battle between

acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read

atilde All metrics get gamed

Review and Conclusions 75

Digital object identifier

atilde DOI a string used to uniquely identify an electronic document or object

acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)

atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation

atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations

acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u

Review and Conclusions 76

Conclusions

Review and Conclusions 77

Revolutions in Language Technology

atilde Speech (Language itself)

atilde Writing (invented 3-5 times)

atilde Printing (made writing common)

atilde Digital Text (made writing transferable)

atilde Hyperlinking (taking writing beyond language)

Review and Conclusions 78

The Internet

atilde The internet is useful as a tool

acirc Passively for information accessacirc Actively for collaboration

atilde The internet is interesting in and of itself

acirc As a source of data about existing languageacirc As a source of innovation in language

Review and Conclusions 79

Recommended Texts

atilde Wikipedia

atilde Crystal D (2001) Language and the Internet Cambridge University Press

atilde Sproat R (2010) Language Technology and Society Oxford University Press

atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition

atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press

atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press

Review and Conclusions 80

Complementary Courses

atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics

atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics

Review and Conclusions 81

Page 30: Lecture 12: Review and Conclusions · Web Count Units ª ghit or “google hit” is the most common unit used to count web snippets (in the early2000s) â itisdocumentfrequencynottermfrequency

atilde Gricean Principals

acirc Posting (topmiddlebottom quoting and trimming)

atilde Inspired by medium limitations

acirc GREATgreatgreatacirc -) (^_^) (o) acirc brb RTFM IMHOacirc lower caseacirc lack of punctuation (hard on phone keyboards)

Review and Conclusions 29

Collaboration and Wikisatilde Version Control Systems

atilde Wikipedia

atilde Licensing and Ownership

30

Version Control Systems

atilde Versioning file systems

acirc every time a file is opened a new copy is stored

atilde CVS Subversion Git

acirc changes to a collection of files are trackedacirc simultaneous changes are merged

atilde Revision Tracking

acirc Revisions are stored within a file

atilde Authorship in shared writing Explicit responsibility for changes

Review and Conclusions 31

Wikipedia

atilde The core aim of the Wikimedia Foundation is to get a free encyclopedia to everysingle person on the planet (Jimmy Wales)

atilde Wikipedia makes it easy to share your knowledgepeople like to do this

atilde Most edits are done by insiders

atilde Most content is added by outsiders

atilde Content comparable to Britannica

Review and Conclusions 32

The five pillars of Wikipedia

1 Wikipedia is an online encyclopedia

2 Wikipedia has a neutral point of view

atilde Content policies NPOV Verifiability and No original research

3 Wikipedia is free content

4 Wikipedians should interact in a respectful and civil manner

5 Wikipedia does not have firm rules

WikipediaFivepillars 33

Licenses and Ownership

atilde Copyright

atilde Copyleft

atilde Creative Commons

Review and Conclusions 34

What is a good article

1 Well-written

2 Factually accurate and verifiable

3 Broad in its coverage

4 Neutral

5 Stable

6 Illustrated if possible by images

WikipediaGood_article_criteria 35

The World Wide Web and HTML

Review and Conclusions 36

The InterWeb

atilde The Internetmore than just the web (email VoIP FTP Streaming Messaging

atilde The structure of Markup Visual vs LogicalWSISWYG WYSIAYG WYSIWIM

atilde The structure of the Web mdash hypertextpages linked to pages

atilde The future of the Web

atilde Linguistic features of the webun-edited large volume editable multi-media

Review and Conclusions 37

Map of online communities

httpxkcdcom802 38

The Deep Web (the Invisible Web)

Dynamic content dynamic pages which are returned in response to a submitted queryor accessed only through a form

Unlinked content pages which are not linked to by other pages (but clicking linksthem)

Private Web sites that require registration and login (Edventure)

Contextual Web pages with content varying for different access contexts (egranges of client IP addresses or previous navigation sequence)

Limited access content sites that limit access to their pages in a technical way (egusing the Robots Exclusion Standard)

Scripted content pages that are only accessible through links produced by JavaScriptas well as content dynamically downloaded from Web servers via Flash or Ajaxsolutions

Review and Conclusions 39

Non-HTMLtext content textual content encoded in multimedia (image or video)files or specific file formats not handled by search engines

These pages all include data that search engines cannot find

Review and Conclusions 40

Visual Markup vs Logical Markup

atilde Visual Markup (Presentational)

acirc What you see is what you get (WYSIWYG)acirc Equivalent of printersrsquo markupacirc Shows what things look like

atilde Logical Markup (Structural)

acirc Show the structure and meaningacirc Can be mapped to visual markupacirc Less flexible than visual markupacirc More adaptable (and reusable)

Review and Conclusions 41

The Web as Corpus

Review and Conclusions 42

Two Approaches to using the Web as a Corpus

atilde Direct Query Search Engine as Query tool and WWW as corpus(Objection Results are not reliable)

acirc Population and exact hit counts are unknown rarr no statistics possibleacirc Indexing does not allow to draw conclusions on the dataotimes Google is missing functionalities that linguists lexicographers would like to have

atilde Web Sample Use search engine to download data from the net and build a corpusfrom it

acirc known size and exact hit counts rarr statistics possibleacirc people can draw conclusions over the included text typesacirc (limited) control over the contentotimes sparser data

Review and Conclusions 43

Direct Query

atilde Accessible through search engines (Google API Yahoo API Scripts)

atilde Document counts are shown to correlate directly with ldquorealrdquo frequencies (Keller2003) so search engines can help - but

acirc lots of repetitions of the same text (not representative)acirc very limited query precision (no upperlower case no punctuation)acirc only estimated counts often hard to reproduce exactlyacirc different queries give wildly different numbers

Review and Conclusions 44

Web Count Units

atilde ghit or ldquogoogle hitrdquo is the most common unit used to count web snippets (in theearly 2000s)

acirc it is document frequency not term frequency

atilde whit or ldquoweb hitrdquo is the more general term

atilde Normally you compare two phenomena to get a unitless ratio(eg different from vs different than)251000000 ghits vs 71500000 or 351 (accessed 2012-04-04)

atilde GPB for ldquoGhits per billion documentsrdquo is good if want a more stable number (sug-gested by Mark Liberman)but Google no longer releases the index size

atilde So always say when you counted and try to use ratios

httpitrecisupennedu myllanguagelogarchives000953html 45

Web Sample

atilde Extracting and filtering web documents to create linguistically annotated corpora(Kilgarriff 2006)

acirc gather documents for different topics (balance)acirc exclude documents which cannot be preprocessed with available tools (here tag-

gers and lemmatizers)acirc exclude documents which seem irrelevant for a corpus (too short or too long word

lists)acirc do this for several languages and make the corpora available

46

Building Internet Corpora Outline

1 Select Seed Words (500)

2 Combine to form multiple queries (6000)

3 Query a search engine and retrieve the URLs (50000)

4 Download the files from the URLS (100000000 words)

5 Postprocess the data (encoding cleanup tagging and parsing)

Sharoff S (2006) Creating general-purpose corpora using automated search enginequeries In M Baroni S Bernardini (eds) WaCky Working papers on the Web asCorpus Bologna 2006

httpcorpusleedsacukinternethtml 47

Internet Corpora Summary

atilde The web can be used as a corpus

acirc Direct accesslowast Fast and convenientlowast Huge amounts of dataotimes unreliable counts

acirc Web samplelowast Control over the samplelowast Some setup costs (semi-automated)otimes Less data

atilde Richer data than a compiled corpus

otimes Less balanced less markup

48

Text and Meta-textatilde Explicit Meta-data

acirc Keywords and Categoriesacirc Rankingsacirc Structural Markup

atilde Implicit Meta-data

acirc Links and Citationsacirc Tagsacirc Tablesacirc File Namesacirc Translations

and Zombies 49

Explicit Metadata

atilde You can get information from metadata within documents

acirc When they are accurate they are very goodacirc They are often deceitful

atilde HTML PDF Word hellipMetadata

atilde Keywords and Tags

atilde Rankings

atilde Links and Citations

atilde Structural Markup

50

Implicit Metadata

atilde You can get clues from metadata within documents

acirc as they are non-intended they tend to be noisyacirc but they are rarely deceitful

atilde HTML tags as constituent boundaries

atilde Tables as Semantic Relations

atilde File Names (content type and language)

atilde Translations mdash Bracketed Glosses Cross-lingual Disambiguation

atilde Query Data

atilde Wikipedia Redirections and Cross-wiki Links

Review and Conclusions 51

Language Identification andNormalization

Review and Conclusions 52

Language Identification

atilde Need to identify the encoding and language of a page in order to process it (meta-data may be unreliable)

acirc Linguistically-grounded methodslowast Diacriticslowast Character n-gramslowast Stop words

acirc Similarity-based categorisation and classificationlowast Character n-gram rankings

acirc Machine Learning based methodsacirc Context (under jp or ko)

atilde Hard to do for short test snippets similar languages mixed text

Review and Conclusions 53

Normalization

atilde Extracting text from various documents

atilde Segmenting continuous text

atilde Number Normalization $700K $700000 07 million dollars hellip

atilde Date Normalization 2000AD 1421AH Heisei 12 hellip

atilde Stripping stop words (the a of hellip)

atilde Lemmatization produces rarr produce

atilde Stemming producer rarr produc produces rarr produc

atilde Decompounding zonnecel rarr zon cel

Review and Conclusions 54

The Semantic Web

Review and Conclusions 55

Goals of the Semantic Web

atilde Web of data

acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions

atilde Increase the utility of information by connecting it to definitions and context

atilde More efficient information access and analysis

EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt

Review and Conclusions 56

Semantic Web Architecture

Review and Conclusions 57

XML eXtensible Markup Language

atilde XML is a set of rules for encoding documents electronically

acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation

atilde Can be validated for syntactic well-formedness

Review and Conclusions 58

Semantic Web Architecture (details)

atilde Identify things with Uniform Resource Identifiers

acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond

atilde Identify relations with Resource Description Framework

acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything

atilde You can build relations in ontologies (OWL)

acirc Then reason over them search them hellip

Review and Conclusions 59

Criticism of the Semantic Web

Doctorowrsquos seven insurmountable obstacles to reliable metadata are

1 People lie

2 People are lazy

3 People are stupid

4 Mission Impossible know thyself

5 Schemas arenrsquot neutral

6 Metrics influence results

7 Therersquos more than one way to describe something

httpwwwwellcom~doctorowmetacraphtm 60

Semantic Web and NLP

atilde The Semantic Web is about structuring data

atilde Text Mining is about unstructured data

atilde There is much more unstructured than structured data

acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP

61

Citation Reputation andPageRank

Review and Conclusions 62

Citation Networks

atilde How can we tell what is a good scientific paper

acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked

acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite

Review and Conclusions 63

Reputation and Citation Analysis

atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group

atilde Some scores are

acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)

atilde Problems

acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones

atilde Weight Citations by Quality of the paper

Review and Conclusions 64

Gaming Citations

atilde LeastMinimum Publishable Unit

acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information

atilde Self citation in-group citation

atilde Write only proceedings (some journals are not often read)

atilde Submitting only to High Impact factor journals

You improve what gets measurednot necessarily what you want to improve

Review and Conclusions 65

Characteristics of the Web Bow Tie

SCC Strongly Connected Core mdash can travel from any page to any page

Review and Conclusions 66

Anchor Text

atilde Recall how hyperlinks are written

lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt

For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt

atilde Link analysis builds on two intuitions

1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A

2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice

Review and Conclusions 67

PageRank as Citation analysis

atilde Citation frequency can be used to measure the impact of an article

acirc Simplest measure Each article gets one vote ndash not very accurate

atilde On the web citation frequency = inlink count

acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam

atilde Better measure weighted citation frequency or citation rank

acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated

Review and Conclusions 68

PageRank as Random walk

atilde Imagine a web surfer doing a random walk on the web

acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page

equiprobably

atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there

atilde This long-term visit rate is the pagersquos PageRank

atilde PageRank = long-term visit rate = steady state robability

Review and Conclusions 69

Teleporting ndash to get us out of dead ends

atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)

atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)

atilde With remaining probability (90) go out on a random hyperlink

acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225

atilde 10 is a parameter the teleportation rate

atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate

Review and Conclusions 70

Example Graph

Each inbound link is a positive vote

httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71

Example Graph Weighted

Pages with higher PageRanks are lighter

Review and Conclusions 72

Gaming PageRank

atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs

atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies

atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission

73

atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks

The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index

acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip

74

Current Status

atilde There is a continuous battle between

acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read

atilde All metrics get gamed

Review and Conclusions 75

Digital object identifier

atilde DOI a string used to uniquely identify an electronic document or object

acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)

atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation

atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations

acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u

Review and Conclusions 76

Conclusions

Review and Conclusions 77

Revolutions in Language Technology

atilde Speech (Language itself)

atilde Writing (invented 3-5 times)

atilde Printing (made writing common)

atilde Digital Text (made writing transferable)

atilde Hyperlinking (taking writing beyond language)

Review and Conclusions 78

The Internet

atilde The internet is useful as a tool

acirc Passively for information accessacirc Actively for collaboration

atilde The internet is interesting in and of itself

acirc As a source of data about existing languageacirc As a source of innovation in language

Review and Conclusions 79

Recommended Texts

atilde Wikipedia

atilde Crystal D (2001) Language and the Internet Cambridge University Press

atilde Sproat R (2010) Language Technology and Society Oxford University Press

atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition

atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press

atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press

Review and Conclusions 80

Complementary Courses

atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics

atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics

Review and Conclusions 81

Page 31: Lecture 12: Review and Conclusions · Web Count Units ª ghit or “google hit” is the most common unit used to count web snippets (in the early2000s) â itisdocumentfrequencynottermfrequency

Collaboration and Wikisatilde Version Control Systems

atilde Wikipedia

atilde Licensing and Ownership

30

Version Control Systems

atilde Versioning file systems

acirc every time a file is opened a new copy is stored

atilde CVS Subversion Git

acirc changes to a collection of files are trackedacirc simultaneous changes are merged

atilde Revision Tracking

acirc Revisions are stored within a file

atilde Authorship in shared writing Explicit responsibility for changes

Review and Conclusions 31

Wikipedia

atilde The core aim of the Wikimedia Foundation is to get a free encyclopedia to everysingle person on the planet (Jimmy Wales)

atilde Wikipedia makes it easy to share your knowledgepeople like to do this

atilde Most edits are done by insiders

atilde Most content is added by outsiders

atilde Content comparable to Britannica

Review and Conclusions 32

The five pillars of Wikipedia

1 Wikipedia is an online encyclopedia

2 Wikipedia has a neutral point of view

atilde Content policies NPOV Verifiability and No original research

3 Wikipedia is free content

4 Wikipedians should interact in a respectful and civil manner

5 Wikipedia does not have firm rules

WikipediaFivepillars 33

Licenses and Ownership

atilde Copyright

atilde Copyleft

atilde Creative Commons

Review and Conclusions 34

What is a good article

1 Well-written

2 Factually accurate and verifiable

3 Broad in its coverage

4 Neutral

5 Stable

6 Illustrated if possible by images

WikipediaGood_article_criteria 35

The World Wide Web and HTML

Review and Conclusions 36

The InterWeb

atilde The Internetmore than just the web (email VoIP FTP Streaming Messaging

atilde The structure of Markup Visual vs LogicalWSISWYG WYSIAYG WYSIWIM

atilde The structure of the Web mdash hypertextpages linked to pages

atilde The future of the Web

atilde Linguistic features of the webun-edited large volume editable multi-media

Review and Conclusions 37

Map of online communities

httpxkcdcom802 38

The Deep Web (the Invisible Web)

Dynamic content dynamic pages which are returned in response to a submitted queryor accessed only through a form

Unlinked content pages which are not linked to by other pages (but clicking linksthem)

Private Web sites that require registration and login (Edventure)

Contextual Web pages with content varying for different access contexts (egranges of client IP addresses or previous navigation sequence)

Limited access content sites that limit access to their pages in a technical way (egusing the Robots Exclusion Standard)

Scripted content pages that are only accessible through links produced by JavaScriptas well as content dynamically downloaded from Web servers via Flash or Ajaxsolutions

Review and Conclusions 39

Non-HTMLtext content textual content encoded in multimedia (image or video)files or specific file formats not handled by search engines

These pages all include data that search engines cannot find

Review and Conclusions 40

Visual Markup vs Logical Markup

atilde Visual Markup (Presentational)

acirc What you see is what you get (WYSIWYG)acirc Equivalent of printersrsquo markupacirc Shows what things look like

atilde Logical Markup (Structural)

acirc Show the structure and meaningacirc Can be mapped to visual markupacirc Less flexible than visual markupacirc More adaptable (and reusable)

Review and Conclusions 41

The Web as Corpus

Review and Conclusions 42

Two Approaches to using the Web as a Corpus

atilde Direct Query Search Engine as Query tool and WWW as corpus(Objection Results are not reliable)

acirc Population and exact hit counts are unknown rarr no statistics possibleacirc Indexing does not allow to draw conclusions on the dataotimes Google is missing functionalities that linguists lexicographers would like to have

atilde Web Sample Use search engine to download data from the net and build a corpusfrom it

acirc known size and exact hit counts rarr statistics possibleacirc people can draw conclusions over the included text typesacirc (limited) control over the contentotimes sparser data

Review and Conclusions 43

Direct Query

atilde Accessible through search engines (Google API Yahoo API Scripts)

atilde Document counts are shown to correlate directly with ldquorealrdquo frequencies (Keller2003) so search engines can help - but

acirc lots of repetitions of the same text (not representative)acirc very limited query precision (no upperlower case no punctuation)acirc only estimated counts often hard to reproduce exactlyacirc different queries give wildly different numbers

Review and Conclusions 44

Web Count Units

atilde ghit or ldquogoogle hitrdquo is the most common unit used to count web snippets (in theearly 2000s)

acirc it is document frequency not term frequency

atilde whit or ldquoweb hitrdquo is the more general term

atilde Normally you compare two phenomena to get a unitless ratio(eg different from vs different than)251000000 ghits vs 71500000 or 351 (accessed 2012-04-04)

atilde GPB for ldquoGhits per billion documentsrdquo is good if want a more stable number (sug-gested by Mark Liberman)but Google no longer releases the index size

atilde So always say when you counted and try to use ratios

httpitrecisupennedu myllanguagelogarchives000953html 45

Web Sample

atilde Extracting and filtering web documents to create linguistically annotated corpora(Kilgarriff 2006)

acirc gather documents for different topics (balance)acirc exclude documents which cannot be preprocessed with available tools (here tag-

gers and lemmatizers)acirc exclude documents which seem irrelevant for a corpus (too short or too long word

lists)acirc do this for several languages and make the corpora available

46

Building Internet Corpora Outline

1 Select Seed Words (500)

2 Combine to form multiple queries (6000)

3 Query a search engine and retrieve the URLs (50000)

4 Download the files from the URLS (100000000 words)

5 Postprocess the data (encoding cleanup tagging and parsing)

Sharoff S (2006) Creating general-purpose corpora using automated search enginequeries In M Baroni S Bernardini (eds) WaCky Working papers on the Web asCorpus Bologna 2006

httpcorpusleedsacukinternethtml 47

Internet Corpora Summary

atilde The web can be used as a corpus

acirc Direct accesslowast Fast and convenientlowast Huge amounts of dataotimes unreliable counts

acirc Web samplelowast Control over the samplelowast Some setup costs (semi-automated)otimes Less data

atilde Richer data than a compiled corpus

otimes Less balanced less markup

48

Text and Meta-textatilde Explicit Meta-data

acirc Keywords and Categoriesacirc Rankingsacirc Structural Markup

atilde Implicit Meta-data

acirc Links and Citationsacirc Tagsacirc Tablesacirc File Namesacirc Translations

and Zombies 49

Explicit Metadata

atilde You can get information from metadata within documents

acirc When they are accurate they are very goodacirc They are often deceitful

atilde HTML PDF Word hellipMetadata

atilde Keywords and Tags

atilde Rankings

atilde Links and Citations

atilde Structural Markup

50

Implicit Metadata

atilde You can get clues from metadata within documents

acirc as they are non-intended they tend to be noisyacirc but they are rarely deceitful

atilde HTML tags as constituent boundaries

atilde Tables as Semantic Relations

atilde File Names (content type and language)

atilde Translations mdash Bracketed Glosses Cross-lingual Disambiguation

atilde Query Data

atilde Wikipedia Redirections and Cross-wiki Links

Review and Conclusions 51

Language Identification andNormalization

Review and Conclusions 52

Language Identification

atilde Need to identify the encoding and language of a page in order to process it (meta-data may be unreliable)

acirc Linguistically-grounded methodslowast Diacriticslowast Character n-gramslowast Stop words

acirc Similarity-based categorisation and classificationlowast Character n-gram rankings

acirc Machine Learning based methodsacirc Context (under jp or ko)

atilde Hard to do for short test snippets similar languages mixed text

Review and Conclusions 53

Normalization

atilde Extracting text from various documents

atilde Segmenting continuous text

atilde Number Normalization $700K $700000 07 million dollars hellip

atilde Date Normalization 2000AD 1421AH Heisei 12 hellip

atilde Stripping stop words (the a of hellip)

atilde Lemmatization produces rarr produce

atilde Stemming producer rarr produc produces rarr produc

atilde Decompounding zonnecel rarr zon cel

Review and Conclusions 54

The Semantic Web

Review and Conclusions 55

Goals of the Semantic Web

atilde Web of data

acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions

atilde Increase the utility of information by connecting it to definitions and context

atilde More efficient information access and analysis

EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt

Review and Conclusions 56

Semantic Web Architecture

Review and Conclusions 57

XML eXtensible Markup Language

atilde XML is a set of rules for encoding documents electronically

acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation

atilde Can be validated for syntactic well-formedness

Review and Conclusions 58

Semantic Web Architecture (details)

atilde Identify things with Uniform Resource Identifiers

acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond

atilde Identify relations with Resource Description Framework

acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything

atilde You can build relations in ontologies (OWL)

acirc Then reason over them search them hellip

Review and Conclusions 59

Criticism of the Semantic Web

Doctorowrsquos seven insurmountable obstacles to reliable metadata are

1 People lie

2 People are lazy

3 People are stupid

4 Mission Impossible know thyself

5 Schemas arenrsquot neutral

6 Metrics influence results

7 Therersquos more than one way to describe something

httpwwwwellcom~doctorowmetacraphtm 60

Semantic Web and NLP

atilde The Semantic Web is about structuring data

atilde Text Mining is about unstructured data

atilde There is much more unstructured than structured data

acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP

61

Citation Reputation andPageRank

Review and Conclusions 62

Citation Networks

atilde How can we tell what is a good scientific paper

acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked

acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite

Review and Conclusions 63

Reputation and Citation Analysis

atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group

atilde Some scores are

acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)

atilde Problems

acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones

atilde Weight Citations by Quality of the paper

Review and Conclusions 64

Gaming Citations

atilde LeastMinimum Publishable Unit

acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information

atilde Self citation in-group citation

atilde Write only proceedings (some journals are not often read)

atilde Submitting only to High Impact factor journals

You improve what gets measurednot necessarily what you want to improve

Review and Conclusions 65

Characteristics of the Web Bow Tie

SCC Strongly Connected Core mdash can travel from any page to any page

Review and Conclusions 66

Anchor Text

atilde Recall how hyperlinks are written

lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt

For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt

atilde Link analysis builds on two intuitions

1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A

2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice

Review and Conclusions 67

PageRank as Citation analysis

atilde Citation frequency can be used to measure the impact of an article

acirc Simplest measure Each article gets one vote ndash not very accurate

atilde On the web citation frequency = inlink count

acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam

atilde Better measure weighted citation frequency or citation rank

acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated

Review and Conclusions 68

PageRank as Random walk

atilde Imagine a web surfer doing a random walk on the web

acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page

equiprobably

atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there

atilde This long-term visit rate is the pagersquos PageRank

atilde PageRank = long-term visit rate = steady state robability

Review and Conclusions 69

Teleporting ndash to get us out of dead ends

atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)

atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)

atilde With remaining probability (90) go out on a random hyperlink

acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225

atilde 10 is a parameter the teleportation rate

atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate

Review and Conclusions 70

Example Graph

Each inbound link is a positive vote

httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71

Example Graph Weighted

Pages with higher PageRanks are lighter

Review and Conclusions 72

Gaming PageRank

atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs

atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies

atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission

73

atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks

The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index

acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip

74

Current Status

atilde There is a continuous battle between

acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read

atilde All metrics get gamed

Review and Conclusions 75

Digital object identifier

atilde DOI a string used to uniquely identify an electronic document or object

acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)

atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation

atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations

acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u

Review and Conclusions 76

Conclusions

Review and Conclusions 77

Revolutions in Language Technology

atilde Speech (Language itself)

atilde Writing (invented 3-5 times)

atilde Printing (made writing common)

atilde Digital Text (made writing transferable)

atilde Hyperlinking (taking writing beyond language)

Review and Conclusions 78

The Internet

atilde The internet is useful as a tool

acirc Passively for information accessacirc Actively for collaboration

atilde The internet is interesting in and of itself

acirc As a source of data about existing languageacirc As a source of innovation in language

Review and Conclusions 79

Recommended Texts

atilde Wikipedia

atilde Crystal D (2001) Language and the Internet Cambridge University Press

atilde Sproat R (2010) Language Technology and Society Oxford University Press

atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition

atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press

atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press

Review and Conclusions 80

Complementary Courses

atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics

atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics

Review and Conclusions 81

Page 32: Lecture 12: Review and Conclusions · Web Count Units ª ghit or “google hit” is the most common unit used to count web snippets (in the early2000s) â itisdocumentfrequencynottermfrequency

Version Control Systems

atilde Versioning file systems

acirc every time a file is opened a new copy is stored

atilde CVS Subversion Git

acirc changes to a collection of files are trackedacirc simultaneous changes are merged

atilde Revision Tracking

acirc Revisions are stored within a file

atilde Authorship in shared writing Explicit responsibility for changes

Review and Conclusions 31

Wikipedia

atilde The core aim of the Wikimedia Foundation is to get a free encyclopedia to everysingle person on the planet (Jimmy Wales)

atilde Wikipedia makes it easy to share your knowledgepeople like to do this

atilde Most edits are done by insiders

atilde Most content is added by outsiders

atilde Content comparable to Britannica

Review and Conclusions 32

The five pillars of Wikipedia

1 Wikipedia is an online encyclopedia

2 Wikipedia has a neutral point of view

atilde Content policies NPOV Verifiability and No original research

3 Wikipedia is free content

4 Wikipedians should interact in a respectful and civil manner

5 Wikipedia does not have firm rules

WikipediaFivepillars 33

Licenses and Ownership

atilde Copyright

atilde Copyleft

atilde Creative Commons

Review and Conclusions 34

What is a good article

1 Well-written

2 Factually accurate and verifiable

3 Broad in its coverage

4 Neutral

5 Stable

6 Illustrated if possible by images

WikipediaGood_article_criteria 35

The World Wide Web and HTML

Review and Conclusions 36

The InterWeb

atilde The Internetmore than just the web (email VoIP FTP Streaming Messaging

atilde The structure of Markup Visual vs LogicalWSISWYG WYSIAYG WYSIWIM

atilde The structure of the Web mdash hypertextpages linked to pages

atilde The future of the Web

atilde Linguistic features of the webun-edited large volume editable multi-media

Review and Conclusions 37

Map of online communities

httpxkcdcom802 38

The Deep Web (the Invisible Web)

Dynamic content dynamic pages which are returned in response to a submitted queryor accessed only through a form

Unlinked content pages which are not linked to by other pages (but clicking linksthem)

Private Web sites that require registration and login (Edventure)

Contextual Web pages with content varying for different access contexts (egranges of client IP addresses or previous navigation sequence)

Limited access content sites that limit access to their pages in a technical way (egusing the Robots Exclusion Standard)

Scripted content pages that are only accessible through links produced by JavaScriptas well as content dynamically downloaded from Web servers via Flash or Ajaxsolutions

Review and Conclusions 39

Non-HTMLtext content textual content encoded in multimedia (image or video)files or specific file formats not handled by search engines

These pages all include data that search engines cannot find

Review and Conclusions 40

Visual Markup vs Logical Markup

atilde Visual Markup (Presentational)

acirc What you see is what you get (WYSIWYG)acirc Equivalent of printersrsquo markupacirc Shows what things look like

atilde Logical Markup (Structural)

acirc Show the structure and meaningacirc Can be mapped to visual markupacirc Less flexible than visual markupacirc More adaptable (and reusable)

Review and Conclusions 41

The Web as Corpus

Review and Conclusions 42

Two Approaches to using the Web as a Corpus

atilde Direct Query Search Engine as Query tool and WWW as corpus(Objection Results are not reliable)

acirc Population and exact hit counts are unknown rarr no statistics possibleacirc Indexing does not allow to draw conclusions on the dataotimes Google is missing functionalities that linguists lexicographers would like to have

atilde Web Sample Use search engine to download data from the net and build a corpusfrom it

acirc known size and exact hit counts rarr statistics possibleacirc people can draw conclusions over the included text typesacirc (limited) control over the contentotimes sparser data

Review and Conclusions 43

Direct Query

atilde Accessible through search engines (Google API Yahoo API Scripts)

atilde Document counts are shown to correlate directly with ldquorealrdquo frequencies (Keller2003) so search engines can help - but

acirc lots of repetitions of the same text (not representative)acirc very limited query precision (no upperlower case no punctuation)acirc only estimated counts often hard to reproduce exactlyacirc different queries give wildly different numbers

Review and Conclusions 44

Web Count Units

atilde ghit or ldquogoogle hitrdquo is the most common unit used to count web snippets (in theearly 2000s)

acirc it is document frequency not term frequency

atilde whit or ldquoweb hitrdquo is the more general term

atilde Normally you compare two phenomena to get a unitless ratio(eg different from vs different than)251000000 ghits vs 71500000 or 351 (accessed 2012-04-04)

atilde GPB for ldquoGhits per billion documentsrdquo is good if want a more stable number (sug-gested by Mark Liberman)but Google no longer releases the index size

atilde So always say when you counted and try to use ratios

httpitrecisupennedu myllanguagelogarchives000953html 45

Web Sample

atilde Extracting and filtering web documents to create linguistically annotated corpora(Kilgarriff 2006)

acirc gather documents for different topics (balance)acirc exclude documents which cannot be preprocessed with available tools (here tag-

gers and lemmatizers)acirc exclude documents which seem irrelevant for a corpus (too short or too long word

lists)acirc do this for several languages and make the corpora available

46

Building Internet Corpora Outline

1 Select Seed Words (500)

2 Combine to form multiple queries (6000)

3 Query a search engine and retrieve the URLs (50000)

4 Download the files from the URLS (100000000 words)

5 Postprocess the data (encoding cleanup tagging and parsing)

Sharoff S (2006) Creating general-purpose corpora using automated search enginequeries In M Baroni S Bernardini (eds) WaCky Working papers on the Web asCorpus Bologna 2006

httpcorpusleedsacukinternethtml 47

Internet Corpora Summary

atilde The web can be used as a corpus

acirc Direct accesslowast Fast and convenientlowast Huge amounts of dataotimes unreliable counts

acirc Web samplelowast Control over the samplelowast Some setup costs (semi-automated)otimes Less data

atilde Richer data than a compiled corpus

otimes Less balanced less markup

48

Text and Meta-textatilde Explicit Meta-data

acirc Keywords and Categoriesacirc Rankingsacirc Structural Markup

atilde Implicit Meta-data

acirc Links and Citationsacirc Tagsacirc Tablesacirc File Namesacirc Translations

and Zombies 49

Explicit Metadata

atilde You can get information from metadata within documents

acirc When they are accurate they are very goodacirc They are often deceitful

atilde HTML PDF Word hellipMetadata

atilde Keywords and Tags

atilde Rankings

atilde Links and Citations

atilde Structural Markup

50

Implicit Metadata

atilde You can get clues from metadata within documents

acirc as they are non-intended they tend to be noisyacirc but they are rarely deceitful

atilde HTML tags as constituent boundaries

atilde Tables as Semantic Relations

atilde File Names (content type and language)

atilde Translations mdash Bracketed Glosses Cross-lingual Disambiguation

atilde Query Data

atilde Wikipedia Redirections and Cross-wiki Links

Review and Conclusions 51

Language Identification andNormalization

Review and Conclusions 52

Language Identification

atilde Need to identify the encoding and language of a page in order to process it (meta-data may be unreliable)

acirc Linguistically-grounded methodslowast Diacriticslowast Character n-gramslowast Stop words

acirc Similarity-based categorisation and classificationlowast Character n-gram rankings

acirc Machine Learning based methodsacirc Context (under jp or ko)

atilde Hard to do for short test snippets similar languages mixed text

Review and Conclusions 53

Normalization

atilde Extracting text from various documents

atilde Segmenting continuous text

atilde Number Normalization $700K $700000 07 million dollars hellip

atilde Date Normalization 2000AD 1421AH Heisei 12 hellip

atilde Stripping stop words (the a of hellip)

atilde Lemmatization produces rarr produce

atilde Stemming producer rarr produc produces rarr produc

atilde Decompounding zonnecel rarr zon cel

Review and Conclusions 54

The Semantic Web

Review and Conclusions 55

Goals of the Semantic Web

atilde Web of data

acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions

atilde Increase the utility of information by connecting it to definitions and context

atilde More efficient information access and analysis

EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt

Review and Conclusions 56

Semantic Web Architecture

Review and Conclusions 57

XML eXtensible Markup Language

atilde XML is a set of rules for encoding documents electronically

acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation

atilde Can be validated for syntactic well-formedness

Review and Conclusions 58

Semantic Web Architecture (details)

atilde Identify things with Uniform Resource Identifiers

acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond

atilde Identify relations with Resource Description Framework

acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything

atilde You can build relations in ontologies (OWL)

acirc Then reason over them search them hellip

Review and Conclusions 59

Criticism of the Semantic Web

Doctorowrsquos seven insurmountable obstacles to reliable metadata are

1 People lie

2 People are lazy

3 People are stupid

4 Mission Impossible know thyself

5 Schemas arenrsquot neutral

6 Metrics influence results

7 Therersquos more than one way to describe something

httpwwwwellcom~doctorowmetacraphtm 60

Semantic Web and NLP

atilde The Semantic Web is about structuring data

atilde Text Mining is about unstructured data

atilde There is much more unstructured than structured data

acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP

61

Citation Reputation andPageRank

Review and Conclusions 62

Citation Networks

atilde How can we tell what is a good scientific paper

acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked

acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite

Review and Conclusions 63

Reputation and Citation Analysis

atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group

atilde Some scores are

acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)

atilde Problems

acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones

atilde Weight Citations by Quality of the paper

Review and Conclusions 64

Gaming Citations

atilde LeastMinimum Publishable Unit

acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information

atilde Self citation in-group citation

atilde Write only proceedings (some journals are not often read)

atilde Submitting only to High Impact factor journals

You improve what gets measurednot necessarily what you want to improve

Review and Conclusions 65

Characteristics of the Web Bow Tie

SCC Strongly Connected Core mdash can travel from any page to any page

Review and Conclusions 66

Anchor Text

atilde Recall how hyperlinks are written

lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt

For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt

atilde Link analysis builds on two intuitions

1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A

2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice

Review and Conclusions 67

PageRank as Citation analysis

atilde Citation frequency can be used to measure the impact of an article

acirc Simplest measure Each article gets one vote ndash not very accurate

atilde On the web citation frequency = inlink count

acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam

atilde Better measure weighted citation frequency or citation rank

acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated

Review and Conclusions 68

PageRank as Random walk

atilde Imagine a web surfer doing a random walk on the web

acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page

equiprobably

atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there

atilde This long-term visit rate is the pagersquos PageRank

atilde PageRank = long-term visit rate = steady state robability

Review and Conclusions 69

Teleporting ndash to get us out of dead ends

atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)

atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)

atilde With remaining probability (90) go out on a random hyperlink

acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225

atilde 10 is a parameter the teleportation rate

atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate

Review and Conclusions 70

Example Graph

Each inbound link is a positive vote

httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71

Example Graph Weighted

Pages with higher PageRanks are lighter

Review and Conclusions 72

Gaming PageRank

atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs

atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies

atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission

73

atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks

The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index

acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip

74

Current Status

atilde There is a continuous battle between

acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read

atilde All metrics get gamed

Review and Conclusions 75

Digital object identifier

atilde DOI a string used to uniquely identify an electronic document or object

acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)

atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation

atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations

acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u

Review and Conclusions 76

Conclusions

Review and Conclusions 77

Revolutions in Language Technology

atilde Speech (Language itself)

atilde Writing (invented 3-5 times)

atilde Printing (made writing common)

atilde Digital Text (made writing transferable)

atilde Hyperlinking (taking writing beyond language)

Review and Conclusions 78

The Internet

atilde The internet is useful as a tool

acirc Passively for information accessacirc Actively for collaboration

atilde The internet is interesting in and of itself

acirc As a source of data about existing languageacirc As a source of innovation in language

Review and Conclusions 79

Recommended Texts

atilde Wikipedia

atilde Crystal D (2001) Language and the Internet Cambridge University Press

atilde Sproat R (2010) Language Technology and Society Oxford University Press

atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition

atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press

atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press

Review and Conclusions 80

Complementary Courses

atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics

atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics

Review and Conclusions 81

Page 33: Lecture 12: Review and Conclusions · Web Count Units ª ghit or “google hit” is the most common unit used to count web snippets (in the early2000s) â itisdocumentfrequencynottermfrequency

Wikipedia

atilde The core aim of the Wikimedia Foundation is to get a free encyclopedia to everysingle person on the planet (Jimmy Wales)

atilde Wikipedia makes it easy to share your knowledgepeople like to do this

atilde Most edits are done by insiders

atilde Most content is added by outsiders

atilde Content comparable to Britannica

Review and Conclusions 32

The five pillars of Wikipedia

1 Wikipedia is an online encyclopedia

2 Wikipedia has a neutral point of view

atilde Content policies NPOV Verifiability and No original research

3 Wikipedia is free content

4 Wikipedians should interact in a respectful and civil manner

5 Wikipedia does not have firm rules

WikipediaFivepillars 33

Licenses and Ownership

atilde Copyright

atilde Copyleft

atilde Creative Commons

Review and Conclusions 34

What is a good article

1 Well-written

2 Factually accurate and verifiable

3 Broad in its coverage

4 Neutral

5 Stable

6 Illustrated if possible by images

WikipediaGood_article_criteria 35

The World Wide Web and HTML

Review and Conclusions 36

The InterWeb

atilde The Internetmore than just the web (email VoIP FTP Streaming Messaging

atilde The structure of Markup Visual vs LogicalWSISWYG WYSIAYG WYSIWIM

atilde The structure of the Web mdash hypertextpages linked to pages

atilde The future of the Web

atilde Linguistic features of the webun-edited large volume editable multi-media

Review and Conclusions 37

Map of online communities

httpxkcdcom802 38

The Deep Web (the Invisible Web)

Dynamic content dynamic pages which are returned in response to a submitted queryor accessed only through a form

Unlinked content pages which are not linked to by other pages (but clicking linksthem)

Private Web sites that require registration and login (Edventure)

Contextual Web pages with content varying for different access contexts (egranges of client IP addresses or previous navigation sequence)

Limited access content sites that limit access to their pages in a technical way (egusing the Robots Exclusion Standard)

Scripted content pages that are only accessible through links produced by JavaScriptas well as content dynamically downloaded from Web servers via Flash or Ajaxsolutions

Review and Conclusions 39

Non-HTMLtext content textual content encoded in multimedia (image or video)files or specific file formats not handled by search engines

These pages all include data that search engines cannot find

Review and Conclusions 40

Visual Markup vs Logical Markup

atilde Visual Markup (Presentational)

acirc What you see is what you get (WYSIWYG)acirc Equivalent of printersrsquo markupacirc Shows what things look like

atilde Logical Markup (Structural)

acirc Show the structure and meaningacirc Can be mapped to visual markupacirc Less flexible than visual markupacirc More adaptable (and reusable)

Review and Conclusions 41

The Web as Corpus

Review and Conclusions 42

Two Approaches to using the Web as a Corpus

atilde Direct Query Search Engine as Query tool and WWW as corpus(Objection Results are not reliable)

acirc Population and exact hit counts are unknown rarr no statistics possibleacirc Indexing does not allow to draw conclusions on the dataotimes Google is missing functionalities that linguists lexicographers would like to have

atilde Web Sample Use search engine to download data from the net and build a corpusfrom it

acirc known size and exact hit counts rarr statistics possibleacirc people can draw conclusions over the included text typesacirc (limited) control over the contentotimes sparser data

Review and Conclusions 43

Direct Query

atilde Accessible through search engines (Google API Yahoo API Scripts)

atilde Document counts are shown to correlate directly with ldquorealrdquo frequencies (Keller2003) so search engines can help - but

acirc lots of repetitions of the same text (not representative)acirc very limited query precision (no upperlower case no punctuation)acirc only estimated counts often hard to reproduce exactlyacirc different queries give wildly different numbers

Review and Conclusions 44

Web Count Units

atilde ghit or ldquogoogle hitrdquo is the most common unit used to count web snippets (in theearly 2000s)

acirc it is document frequency not term frequency

atilde whit or ldquoweb hitrdquo is the more general term

atilde Normally you compare two phenomena to get a unitless ratio(eg different from vs different than)251000000 ghits vs 71500000 or 351 (accessed 2012-04-04)

atilde GPB for ldquoGhits per billion documentsrdquo is good if want a more stable number (sug-gested by Mark Liberman)but Google no longer releases the index size

atilde So always say when you counted and try to use ratios

httpitrecisupennedu myllanguagelogarchives000953html 45

Web Sample

atilde Extracting and filtering web documents to create linguistically annotated corpora(Kilgarriff 2006)

acirc gather documents for different topics (balance)acirc exclude documents which cannot be preprocessed with available tools (here tag-

gers and lemmatizers)acirc exclude documents which seem irrelevant for a corpus (too short or too long word

lists)acirc do this for several languages and make the corpora available

46

Building Internet Corpora Outline

1 Select Seed Words (500)

2 Combine to form multiple queries (6000)

3 Query a search engine and retrieve the URLs (50000)

4 Download the files from the URLS (100000000 words)

5 Postprocess the data (encoding cleanup tagging and parsing)

Sharoff S (2006) Creating general-purpose corpora using automated search enginequeries In M Baroni S Bernardini (eds) WaCky Working papers on the Web asCorpus Bologna 2006

httpcorpusleedsacukinternethtml 47

Internet Corpora Summary

atilde The web can be used as a corpus

acirc Direct accesslowast Fast and convenientlowast Huge amounts of dataotimes unreliable counts

acirc Web samplelowast Control over the samplelowast Some setup costs (semi-automated)otimes Less data

atilde Richer data than a compiled corpus

otimes Less balanced less markup

48

Text and Meta-textatilde Explicit Meta-data

acirc Keywords and Categoriesacirc Rankingsacirc Structural Markup

atilde Implicit Meta-data

acirc Links and Citationsacirc Tagsacirc Tablesacirc File Namesacirc Translations

and Zombies 49

Explicit Metadata

atilde You can get information from metadata within documents

acirc When they are accurate they are very goodacirc They are often deceitful

atilde HTML PDF Word hellipMetadata

atilde Keywords and Tags

atilde Rankings

atilde Links and Citations

atilde Structural Markup

50

Implicit Metadata

atilde You can get clues from metadata within documents

acirc as they are non-intended they tend to be noisyacirc but they are rarely deceitful

atilde HTML tags as constituent boundaries

atilde Tables as Semantic Relations

atilde File Names (content type and language)

atilde Translations mdash Bracketed Glosses Cross-lingual Disambiguation

atilde Query Data

atilde Wikipedia Redirections and Cross-wiki Links

Review and Conclusions 51

Language Identification andNormalization

Review and Conclusions 52

Language Identification

atilde Need to identify the encoding and language of a page in order to process it (meta-data may be unreliable)

acirc Linguistically-grounded methodslowast Diacriticslowast Character n-gramslowast Stop words

acirc Similarity-based categorisation and classificationlowast Character n-gram rankings

acirc Machine Learning based methodsacirc Context (under jp or ko)

atilde Hard to do for short test snippets similar languages mixed text

Review and Conclusions 53

Normalization

atilde Extracting text from various documents

atilde Segmenting continuous text

atilde Number Normalization $700K $700000 07 million dollars hellip

atilde Date Normalization 2000AD 1421AH Heisei 12 hellip

atilde Stripping stop words (the a of hellip)

atilde Lemmatization produces rarr produce

atilde Stemming producer rarr produc produces rarr produc

atilde Decompounding zonnecel rarr zon cel

Review and Conclusions 54

The Semantic Web

Review and Conclusions 55

Goals of the Semantic Web

atilde Web of data

acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions

atilde Increase the utility of information by connecting it to definitions and context

atilde More efficient information access and analysis

EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt

Review and Conclusions 56

Semantic Web Architecture

Review and Conclusions 57

XML eXtensible Markup Language

atilde XML is a set of rules for encoding documents electronically

acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation

atilde Can be validated for syntactic well-formedness

Review and Conclusions 58

Semantic Web Architecture (details)

atilde Identify things with Uniform Resource Identifiers

acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond

atilde Identify relations with Resource Description Framework

acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything

atilde You can build relations in ontologies (OWL)

acirc Then reason over them search them hellip

Review and Conclusions 59

Criticism of the Semantic Web

Doctorowrsquos seven insurmountable obstacles to reliable metadata are

1 People lie

2 People are lazy

3 People are stupid

4 Mission Impossible know thyself

5 Schemas arenrsquot neutral

6 Metrics influence results

7 Therersquos more than one way to describe something

httpwwwwellcom~doctorowmetacraphtm 60

Semantic Web and NLP

atilde The Semantic Web is about structuring data

atilde Text Mining is about unstructured data

atilde There is much more unstructured than structured data

acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP

61

Citation Reputation andPageRank

Review and Conclusions 62

Citation Networks

atilde How can we tell what is a good scientific paper

acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked

acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite

Review and Conclusions 63

Reputation and Citation Analysis

atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group

atilde Some scores are

acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)

atilde Problems

acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones

atilde Weight Citations by Quality of the paper

Review and Conclusions 64

Gaming Citations

atilde LeastMinimum Publishable Unit

acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information

atilde Self citation in-group citation

atilde Write only proceedings (some journals are not often read)

atilde Submitting only to High Impact factor journals

You improve what gets measurednot necessarily what you want to improve

Review and Conclusions 65

Characteristics of the Web Bow Tie

SCC Strongly Connected Core mdash can travel from any page to any page

Review and Conclusions 66

Anchor Text

atilde Recall how hyperlinks are written

lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt

For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt

atilde Link analysis builds on two intuitions

1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A

2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice

Review and Conclusions 67

PageRank as Citation analysis

atilde Citation frequency can be used to measure the impact of an article

acirc Simplest measure Each article gets one vote ndash not very accurate

atilde On the web citation frequency = inlink count

acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam

atilde Better measure weighted citation frequency or citation rank

acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated

Review and Conclusions 68

PageRank as Random walk

atilde Imagine a web surfer doing a random walk on the web

acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page

equiprobably

atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there

atilde This long-term visit rate is the pagersquos PageRank

atilde PageRank = long-term visit rate = steady state robability

Review and Conclusions 69

Teleporting ndash to get us out of dead ends

atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)

atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)

atilde With remaining probability (90) go out on a random hyperlink

acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225

atilde 10 is a parameter the teleportation rate

atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate

Review and Conclusions 70

Example Graph

Each inbound link is a positive vote

httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71

Example Graph Weighted

Pages with higher PageRanks are lighter

Review and Conclusions 72

Gaming PageRank

atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs

atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies

atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission

73

atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks

The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index

acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip

74

Current Status

atilde There is a continuous battle between

acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read

atilde All metrics get gamed

Review and Conclusions 75

Digital object identifier

atilde DOI a string used to uniquely identify an electronic document or object

acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)

atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation

atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations

acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u

Review and Conclusions 76

Conclusions

Review and Conclusions 77

Revolutions in Language Technology

atilde Speech (Language itself)

atilde Writing (invented 3-5 times)

atilde Printing (made writing common)

atilde Digital Text (made writing transferable)

atilde Hyperlinking (taking writing beyond language)

Review and Conclusions 78

The Internet

atilde The internet is useful as a tool

acirc Passively for information accessacirc Actively for collaboration

atilde The internet is interesting in and of itself

acirc As a source of data about existing languageacirc As a source of innovation in language

Review and Conclusions 79

Recommended Texts

atilde Wikipedia

atilde Crystal D (2001) Language and the Internet Cambridge University Press

atilde Sproat R (2010) Language Technology and Society Oxford University Press

atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition

atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press

atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press

Review and Conclusions 80

Complementary Courses

atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics

atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics

Review and Conclusions 81

Page 34: Lecture 12: Review and Conclusions · Web Count Units ª ghit or “google hit” is the most common unit used to count web snippets (in the early2000s) â itisdocumentfrequencynottermfrequency

The five pillars of Wikipedia

1 Wikipedia is an online encyclopedia

2 Wikipedia has a neutral point of view

atilde Content policies NPOV Verifiability and No original research

3 Wikipedia is free content

4 Wikipedians should interact in a respectful and civil manner

5 Wikipedia does not have firm rules

WikipediaFivepillars 33

Licenses and Ownership

atilde Copyright

atilde Copyleft

atilde Creative Commons

Review and Conclusions 34

What is a good article

1 Well-written

2 Factually accurate and verifiable

3 Broad in its coverage

4 Neutral

5 Stable

6 Illustrated if possible by images

WikipediaGood_article_criteria 35

The World Wide Web and HTML

Review and Conclusions 36

The InterWeb

atilde The Internetmore than just the web (email VoIP FTP Streaming Messaging

atilde The structure of Markup Visual vs LogicalWSISWYG WYSIAYG WYSIWIM

atilde The structure of the Web mdash hypertextpages linked to pages

atilde The future of the Web

atilde Linguistic features of the webun-edited large volume editable multi-media

Review and Conclusions 37

Map of online communities

httpxkcdcom802 38

The Deep Web (the Invisible Web)

Dynamic content dynamic pages which are returned in response to a submitted queryor accessed only through a form

Unlinked content pages which are not linked to by other pages (but clicking linksthem)

Private Web sites that require registration and login (Edventure)

Contextual Web pages with content varying for different access contexts (egranges of client IP addresses or previous navigation sequence)

Limited access content sites that limit access to their pages in a technical way (egusing the Robots Exclusion Standard)

Scripted content pages that are only accessible through links produced by JavaScriptas well as content dynamically downloaded from Web servers via Flash or Ajaxsolutions

Review and Conclusions 39

Non-HTMLtext content textual content encoded in multimedia (image or video)files or specific file formats not handled by search engines

These pages all include data that search engines cannot find

Review and Conclusions 40

Visual Markup vs Logical Markup

atilde Visual Markup (Presentational)

acirc What you see is what you get (WYSIWYG)acirc Equivalent of printersrsquo markupacirc Shows what things look like

atilde Logical Markup (Structural)

acirc Show the structure and meaningacirc Can be mapped to visual markupacirc Less flexible than visual markupacirc More adaptable (and reusable)

Review and Conclusions 41

The Web as Corpus

Review and Conclusions 42

Two Approaches to using the Web as a Corpus

atilde Direct Query Search Engine as Query tool and WWW as corpus(Objection Results are not reliable)

acirc Population and exact hit counts are unknown rarr no statistics possibleacirc Indexing does not allow to draw conclusions on the dataotimes Google is missing functionalities that linguists lexicographers would like to have

atilde Web Sample Use search engine to download data from the net and build a corpusfrom it

acirc known size and exact hit counts rarr statistics possibleacirc people can draw conclusions over the included text typesacirc (limited) control over the contentotimes sparser data

Review and Conclusions 43

Direct Query

atilde Accessible through search engines (Google API Yahoo API Scripts)

atilde Document counts are shown to correlate directly with ldquorealrdquo frequencies (Keller2003) so search engines can help - but

acirc lots of repetitions of the same text (not representative)acirc very limited query precision (no upperlower case no punctuation)acirc only estimated counts often hard to reproduce exactlyacirc different queries give wildly different numbers

Review and Conclusions 44

Web Count Units

atilde ghit or ldquogoogle hitrdquo is the most common unit used to count web snippets (in theearly 2000s)

acirc it is document frequency not term frequency

atilde whit or ldquoweb hitrdquo is the more general term

atilde Normally you compare two phenomena to get a unitless ratio(eg different from vs different than)251000000 ghits vs 71500000 or 351 (accessed 2012-04-04)

atilde GPB for ldquoGhits per billion documentsrdquo is good if want a more stable number (sug-gested by Mark Liberman)but Google no longer releases the index size

atilde So always say when you counted and try to use ratios

httpitrecisupennedu myllanguagelogarchives000953html 45

Web Sample

atilde Extracting and filtering web documents to create linguistically annotated corpora(Kilgarriff 2006)

acirc gather documents for different topics (balance)acirc exclude documents which cannot be preprocessed with available tools (here tag-

gers and lemmatizers)acirc exclude documents which seem irrelevant for a corpus (too short or too long word

lists)acirc do this for several languages and make the corpora available

46

Building Internet Corpora Outline

1 Select Seed Words (500)

2 Combine to form multiple queries (6000)

3 Query a search engine and retrieve the URLs (50000)

4 Download the files from the URLS (100000000 words)

5 Postprocess the data (encoding cleanup tagging and parsing)

Sharoff S (2006) Creating general-purpose corpora using automated search enginequeries In M Baroni S Bernardini (eds) WaCky Working papers on the Web asCorpus Bologna 2006

httpcorpusleedsacukinternethtml 47

Internet Corpora Summary

atilde The web can be used as a corpus

acirc Direct accesslowast Fast and convenientlowast Huge amounts of dataotimes unreliable counts

acirc Web samplelowast Control over the samplelowast Some setup costs (semi-automated)otimes Less data

atilde Richer data than a compiled corpus

otimes Less balanced less markup

48

Text and Meta-textatilde Explicit Meta-data

acirc Keywords and Categoriesacirc Rankingsacirc Structural Markup

atilde Implicit Meta-data

acirc Links and Citationsacirc Tagsacirc Tablesacirc File Namesacirc Translations

and Zombies 49

Explicit Metadata

atilde You can get information from metadata within documents

acirc When they are accurate they are very goodacirc They are often deceitful

atilde HTML PDF Word hellipMetadata

atilde Keywords and Tags

atilde Rankings

atilde Links and Citations

atilde Structural Markup

50

Implicit Metadata

atilde You can get clues from metadata within documents

acirc as they are non-intended they tend to be noisyacirc but they are rarely deceitful

atilde HTML tags as constituent boundaries

atilde Tables as Semantic Relations

atilde File Names (content type and language)

atilde Translations mdash Bracketed Glosses Cross-lingual Disambiguation

atilde Query Data

atilde Wikipedia Redirections and Cross-wiki Links

Review and Conclusions 51

Language Identification andNormalization

Review and Conclusions 52

Language Identification

atilde Need to identify the encoding and language of a page in order to process it (meta-data may be unreliable)

acirc Linguistically-grounded methodslowast Diacriticslowast Character n-gramslowast Stop words

acirc Similarity-based categorisation and classificationlowast Character n-gram rankings

acirc Machine Learning based methodsacirc Context (under jp or ko)

atilde Hard to do for short test snippets similar languages mixed text

Review and Conclusions 53

Normalization

atilde Extracting text from various documents

atilde Segmenting continuous text

atilde Number Normalization $700K $700000 07 million dollars hellip

atilde Date Normalization 2000AD 1421AH Heisei 12 hellip

atilde Stripping stop words (the a of hellip)

atilde Lemmatization produces rarr produce

atilde Stemming producer rarr produc produces rarr produc

atilde Decompounding zonnecel rarr zon cel

Review and Conclusions 54

The Semantic Web

Review and Conclusions 55

Goals of the Semantic Web

atilde Web of data

acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions

atilde Increase the utility of information by connecting it to definitions and context

atilde More efficient information access and analysis

EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt

Review and Conclusions 56

Semantic Web Architecture

Review and Conclusions 57

XML eXtensible Markup Language

atilde XML is a set of rules for encoding documents electronically

acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation

atilde Can be validated for syntactic well-formedness

Review and Conclusions 58

Semantic Web Architecture (details)

atilde Identify things with Uniform Resource Identifiers

acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond

atilde Identify relations with Resource Description Framework

acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything

atilde You can build relations in ontologies (OWL)

acirc Then reason over them search them hellip

Review and Conclusions 59

Criticism of the Semantic Web

Doctorowrsquos seven insurmountable obstacles to reliable metadata are

1 People lie

2 People are lazy

3 People are stupid

4 Mission Impossible know thyself

5 Schemas arenrsquot neutral

6 Metrics influence results

7 Therersquos more than one way to describe something

httpwwwwellcom~doctorowmetacraphtm 60

Semantic Web and NLP

atilde The Semantic Web is about structuring data

atilde Text Mining is about unstructured data

atilde There is much more unstructured than structured data

acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP

61

Citation Reputation andPageRank

Review and Conclusions 62

Citation Networks

atilde How can we tell what is a good scientific paper

acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked

acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite

Review and Conclusions 63

Reputation and Citation Analysis

atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group

atilde Some scores are

acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)

atilde Problems

acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones

atilde Weight Citations by Quality of the paper

Review and Conclusions 64

Gaming Citations

atilde LeastMinimum Publishable Unit

acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information

atilde Self citation in-group citation

atilde Write only proceedings (some journals are not often read)

atilde Submitting only to High Impact factor journals

You improve what gets measurednot necessarily what you want to improve

Review and Conclusions 65

Characteristics of the Web Bow Tie

SCC Strongly Connected Core mdash can travel from any page to any page

Review and Conclusions 66

Anchor Text

atilde Recall how hyperlinks are written

lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt

For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt

atilde Link analysis builds on two intuitions

1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A

2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice

Review and Conclusions 67

PageRank as Citation analysis

atilde Citation frequency can be used to measure the impact of an article

acirc Simplest measure Each article gets one vote ndash not very accurate

atilde On the web citation frequency = inlink count

acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam

atilde Better measure weighted citation frequency or citation rank

acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated

Review and Conclusions 68

PageRank as Random walk

atilde Imagine a web surfer doing a random walk on the web

acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page

equiprobably

atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there

atilde This long-term visit rate is the pagersquos PageRank

atilde PageRank = long-term visit rate = steady state robability

Review and Conclusions 69

Teleporting ndash to get us out of dead ends

atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)

atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)

atilde With remaining probability (90) go out on a random hyperlink

acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225

atilde 10 is a parameter the teleportation rate

atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate

Review and Conclusions 70

Example Graph

Each inbound link is a positive vote

httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71

Example Graph Weighted

Pages with higher PageRanks are lighter

Review and Conclusions 72

Gaming PageRank

atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs

atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies

atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission

73

atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks

The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index

acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip

74

Current Status

atilde There is a continuous battle between

acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read

atilde All metrics get gamed

Review and Conclusions 75

Digital object identifier

atilde DOI a string used to uniquely identify an electronic document or object

acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)

atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation

atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations

acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u

Review and Conclusions 76

Conclusions

Review and Conclusions 77

Revolutions in Language Technology

atilde Speech (Language itself)

atilde Writing (invented 3-5 times)

atilde Printing (made writing common)

atilde Digital Text (made writing transferable)

atilde Hyperlinking (taking writing beyond language)

Review and Conclusions 78

The Internet

atilde The internet is useful as a tool

acirc Passively for information accessacirc Actively for collaboration

atilde The internet is interesting in and of itself

acirc As a source of data about existing languageacirc As a source of innovation in language

Review and Conclusions 79

Recommended Texts

atilde Wikipedia

atilde Crystal D (2001) Language and the Internet Cambridge University Press

atilde Sproat R (2010) Language Technology and Society Oxford University Press

atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition

atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press

atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press

Review and Conclusions 80

Complementary Courses

atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics

atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics

Review and Conclusions 81

Page 35: Lecture 12: Review and Conclusions · Web Count Units ª ghit or “google hit” is the most common unit used to count web snippets (in the early2000s) â itisdocumentfrequencynottermfrequency

Licenses and Ownership

atilde Copyright

atilde Copyleft

atilde Creative Commons

Review and Conclusions 34

What is a good article

1 Well-written

2 Factually accurate and verifiable

3 Broad in its coverage

4 Neutral

5 Stable

6 Illustrated if possible by images

WikipediaGood_article_criteria 35

The World Wide Web and HTML

Review and Conclusions 36

The InterWeb

atilde The Internetmore than just the web (email VoIP FTP Streaming Messaging

atilde The structure of Markup Visual vs LogicalWSISWYG WYSIAYG WYSIWIM

atilde The structure of the Web mdash hypertextpages linked to pages

atilde The future of the Web

atilde Linguistic features of the webun-edited large volume editable multi-media

Review and Conclusions 37

Map of online communities

httpxkcdcom802 38

The Deep Web (the Invisible Web)

Dynamic content dynamic pages which are returned in response to a submitted queryor accessed only through a form

Unlinked content pages which are not linked to by other pages (but clicking linksthem)

Private Web sites that require registration and login (Edventure)

Contextual Web pages with content varying for different access contexts (egranges of client IP addresses or previous navigation sequence)

Limited access content sites that limit access to their pages in a technical way (egusing the Robots Exclusion Standard)

Scripted content pages that are only accessible through links produced by JavaScriptas well as content dynamically downloaded from Web servers via Flash or Ajaxsolutions

Review and Conclusions 39

Non-HTMLtext content textual content encoded in multimedia (image or video)files or specific file formats not handled by search engines

These pages all include data that search engines cannot find

Review and Conclusions 40

Visual Markup vs Logical Markup

atilde Visual Markup (Presentational)

acirc What you see is what you get (WYSIWYG)acirc Equivalent of printersrsquo markupacirc Shows what things look like

atilde Logical Markup (Structural)

acirc Show the structure and meaningacirc Can be mapped to visual markupacirc Less flexible than visual markupacirc More adaptable (and reusable)

Review and Conclusions 41

The Web as Corpus

Review and Conclusions 42

Two Approaches to using the Web as a Corpus

atilde Direct Query Search Engine as Query tool and WWW as corpus(Objection Results are not reliable)

acirc Population and exact hit counts are unknown rarr no statistics possibleacirc Indexing does not allow to draw conclusions on the dataotimes Google is missing functionalities that linguists lexicographers would like to have

atilde Web Sample Use search engine to download data from the net and build a corpusfrom it

acirc known size and exact hit counts rarr statistics possibleacirc people can draw conclusions over the included text typesacirc (limited) control over the contentotimes sparser data

Review and Conclusions 43

Direct Query

atilde Accessible through search engines (Google API Yahoo API Scripts)

atilde Document counts are shown to correlate directly with ldquorealrdquo frequencies (Keller2003) so search engines can help - but

acirc lots of repetitions of the same text (not representative)acirc very limited query precision (no upperlower case no punctuation)acirc only estimated counts often hard to reproduce exactlyacirc different queries give wildly different numbers

Review and Conclusions 44

Web Count Units

atilde ghit or ldquogoogle hitrdquo is the most common unit used to count web snippets (in theearly 2000s)

acirc it is document frequency not term frequency

atilde whit or ldquoweb hitrdquo is the more general term

atilde Normally you compare two phenomena to get a unitless ratio(eg different from vs different than)251000000 ghits vs 71500000 or 351 (accessed 2012-04-04)

atilde GPB for ldquoGhits per billion documentsrdquo is good if want a more stable number (sug-gested by Mark Liberman)but Google no longer releases the index size

atilde So always say when you counted and try to use ratios

httpitrecisupennedu myllanguagelogarchives000953html 45

Web Sample

atilde Extracting and filtering web documents to create linguistically annotated corpora(Kilgarriff 2006)

acirc gather documents for different topics (balance)acirc exclude documents which cannot be preprocessed with available tools (here tag-

gers and lemmatizers)acirc exclude documents which seem irrelevant for a corpus (too short or too long word

lists)acirc do this for several languages and make the corpora available

46

Building Internet Corpora Outline

1 Select Seed Words (500)

2 Combine to form multiple queries (6000)

3 Query a search engine and retrieve the URLs (50000)

4 Download the files from the URLS (100000000 words)

5 Postprocess the data (encoding cleanup tagging and parsing)

Sharoff S (2006) Creating general-purpose corpora using automated search enginequeries In M Baroni S Bernardini (eds) WaCky Working papers on the Web asCorpus Bologna 2006

httpcorpusleedsacukinternethtml 47

Internet Corpora Summary

atilde The web can be used as a corpus

acirc Direct accesslowast Fast and convenientlowast Huge amounts of dataotimes unreliable counts

acirc Web samplelowast Control over the samplelowast Some setup costs (semi-automated)otimes Less data

atilde Richer data than a compiled corpus

otimes Less balanced less markup

48

Text and Meta-textatilde Explicit Meta-data

acirc Keywords and Categoriesacirc Rankingsacirc Structural Markup

atilde Implicit Meta-data

acirc Links and Citationsacirc Tagsacirc Tablesacirc File Namesacirc Translations

and Zombies 49

Explicit Metadata

atilde You can get information from metadata within documents

acirc When they are accurate they are very goodacirc They are often deceitful

atilde HTML PDF Word hellipMetadata

atilde Keywords and Tags

atilde Rankings

atilde Links and Citations

atilde Structural Markup

50

Implicit Metadata

atilde You can get clues from metadata within documents

acirc as they are non-intended they tend to be noisyacirc but they are rarely deceitful

atilde HTML tags as constituent boundaries

atilde Tables as Semantic Relations

atilde File Names (content type and language)

atilde Translations mdash Bracketed Glosses Cross-lingual Disambiguation

atilde Query Data

atilde Wikipedia Redirections and Cross-wiki Links

Review and Conclusions 51

Language Identification andNormalization

Review and Conclusions 52

Language Identification

atilde Need to identify the encoding and language of a page in order to process it (meta-data may be unreliable)

acirc Linguistically-grounded methodslowast Diacriticslowast Character n-gramslowast Stop words

acirc Similarity-based categorisation and classificationlowast Character n-gram rankings

acirc Machine Learning based methodsacirc Context (under jp or ko)

atilde Hard to do for short test snippets similar languages mixed text

Review and Conclusions 53

Normalization

atilde Extracting text from various documents

atilde Segmenting continuous text

atilde Number Normalization $700K $700000 07 million dollars hellip

atilde Date Normalization 2000AD 1421AH Heisei 12 hellip

atilde Stripping stop words (the a of hellip)

atilde Lemmatization produces rarr produce

atilde Stemming producer rarr produc produces rarr produc

atilde Decompounding zonnecel rarr zon cel

Review and Conclusions 54

The Semantic Web

Review and Conclusions 55

Goals of the Semantic Web

atilde Web of data

acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions

atilde Increase the utility of information by connecting it to definitions and context

atilde More efficient information access and analysis

EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt

Review and Conclusions 56

Semantic Web Architecture

Review and Conclusions 57

XML eXtensible Markup Language

atilde XML is a set of rules for encoding documents electronically

acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation

atilde Can be validated for syntactic well-formedness

Review and Conclusions 58

Semantic Web Architecture (details)

atilde Identify things with Uniform Resource Identifiers

acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond

atilde Identify relations with Resource Description Framework

acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything

atilde You can build relations in ontologies (OWL)

acirc Then reason over them search them hellip

Review and Conclusions 59

Criticism of the Semantic Web

Doctorowrsquos seven insurmountable obstacles to reliable metadata are

1 People lie

2 People are lazy

3 People are stupid

4 Mission Impossible know thyself

5 Schemas arenrsquot neutral

6 Metrics influence results

7 Therersquos more than one way to describe something

httpwwwwellcom~doctorowmetacraphtm 60

Semantic Web and NLP

atilde The Semantic Web is about structuring data

atilde Text Mining is about unstructured data

atilde There is much more unstructured than structured data

acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP

61

Citation Reputation andPageRank

Review and Conclusions 62

Citation Networks

atilde How can we tell what is a good scientific paper

acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked

acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite

Review and Conclusions 63

Reputation and Citation Analysis

atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group

atilde Some scores are

acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)

atilde Problems

acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones

atilde Weight Citations by Quality of the paper

Review and Conclusions 64

Gaming Citations

atilde LeastMinimum Publishable Unit

acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information

atilde Self citation in-group citation

atilde Write only proceedings (some journals are not often read)

atilde Submitting only to High Impact factor journals

You improve what gets measurednot necessarily what you want to improve

Review and Conclusions 65

Characteristics of the Web Bow Tie

SCC Strongly Connected Core mdash can travel from any page to any page

Review and Conclusions 66

Anchor Text

atilde Recall how hyperlinks are written

lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt

For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt

atilde Link analysis builds on two intuitions

1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A

2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice

Review and Conclusions 67

PageRank as Citation analysis

atilde Citation frequency can be used to measure the impact of an article

acirc Simplest measure Each article gets one vote ndash not very accurate

atilde On the web citation frequency = inlink count

acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam

atilde Better measure weighted citation frequency or citation rank

acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated

Review and Conclusions 68

PageRank as Random walk

atilde Imagine a web surfer doing a random walk on the web

acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page

equiprobably

atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there

atilde This long-term visit rate is the pagersquos PageRank

atilde PageRank = long-term visit rate = steady state robability

Review and Conclusions 69

Teleporting ndash to get us out of dead ends

atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)

atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)

atilde With remaining probability (90) go out on a random hyperlink

acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225

atilde 10 is a parameter the teleportation rate

atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate

Review and Conclusions 70

Example Graph

Each inbound link is a positive vote

httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71

Example Graph Weighted

Pages with higher PageRanks are lighter

Review and Conclusions 72

Gaming PageRank

atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs

atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies

atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission

73

atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks

The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index

acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip

74

Current Status

atilde There is a continuous battle between

acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read

atilde All metrics get gamed

Review and Conclusions 75

Digital object identifier

atilde DOI a string used to uniquely identify an electronic document or object

acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)

atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation

atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations

acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u

Review and Conclusions 76

Conclusions

Review and Conclusions 77

Revolutions in Language Technology

atilde Speech (Language itself)

atilde Writing (invented 3-5 times)

atilde Printing (made writing common)

atilde Digital Text (made writing transferable)

atilde Hyperlinking (taking writing beyond language)

Review and Conclusions 78

The Internet

atilde The internet is useful as a tool

acirc Passively for information accessacirc Actively for collaboration

atilde The internet is interesting in and of itself

acirc As a source of data about existing languageacirc As a source of innovation in language

Review and Conclusions 79

Recommended Texts

atilde Wikipedia

atilde Crystal D (2001) Language and the Internet Cambridge University Press

atilde Sproat R (2010) Language Technology and Society Oxford University Press

atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition

atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press

atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press

Review and Conclusions 80

Complementary Courses

atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics

atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics

Review and Conclusions 81

Page 36: Lecture 12: Review and Conclusions · Web Count Units ª ghit or “google hit” is the most common unit used to count web snippets (in the early2000s) â itisdocumentfrequencynottermfrequency

What is a good article

1 Well-written

2 Factually accurate and verifiable

3 Broad in its coverage

4 Neutral

5 Stable

6 Illustrated if possible by images

WikipediaGood_article_criteria 35

The World Wide Web and HTML

Review and Conclusions 36

The InterWeb

atilde The Internetmore than just the web (email VoIP FTP Streaming Messaging

atilde The structure of Markup Visual vs LogicalWSISWYG WYSIAYG WYSIWIM

atilde The structure of the Web mdash hypertextpages linked to pages

atilde The future of the Web

atilde Linguistic features of the webun-edited large volume editable multi-media

Review and Conclusions 37

Map of online communities

httpxkcdcom802 38

The Deep Web (the Invisible Web)

Dynamic content dynamic pages which are returned in response to a submitted queryor accessed only through a form

Unlinked content pages which are not linked to by other pages (but clicking linksthem)

Private Web sites that require registration and login (Edventure)

Contextual Web pages with content varying for different access contexts (egranges of client IP addresses or previous navigation sequence)

Limited access content sites that limit access to their pages in a technical way (egusing the Robots Exclusion Standard)

Scripted content pages that are only accessible through links produced by JavaScriptas well as content dynamically downloaded from Web servers via Flash or Ajaxsolutions

Review and Conclusions 39

Non-HTMLtext content textual content encoded in multimedia (image or video)files or specific file formats not handled by search engines

These pages all include data that search engines cannot find

Review and Conclusions 40

Visual Markup vs Logical Markup

atilde Visual Markup (Presentational)

acirc What you see is what you get (WYSIWYG)acirc Equivalent of printersrsquo markupacirc Shows what things look like

atilde Logical Markup (Structural)

acirc Show the structure and meaningacirc Can be mapped to visual markupacirc Less flexible than visual markupacirc More adaptable (and reusable)

Review and Conclusions 41

The Web as Corpus

Review and Conclusions 42

Two Approaches to using the Web as a Corpus

atilde Direct Query Search Engine as Query tool and WWW as corpus(Objection Results are not reliable)

acirc Population and exact hit counts are unknown rarr no statistics possibleacirc Indexing does not allow to draw conclusions on the dataotimes Google is missing functionalities that linguists lexicographers would like to have

atilde Web Sample Use search engine to download data from the net and build a corpusfrom it

acirc known size and exact hit counts rarr statistics possibleacirc people can draw conclusions over the included text typesacirc (limited) control over the contentotimes sparser data

Review and Conclusions 43

Direct Query

atilde Accessible through search engines (Google API Yahoo API Scripts)

atilde Document counts are shown to correlate directly with ldquorealrdquo frequencies (Keller2003) so search engines can help - but

acirc lots of repetitions of the same text (not representative)acirc very limited query precision (no upperlower case no punctuation)acirc only estimated counts often hard to reproduce exactlyacirc different queries give wildly different numbers

Review and Conclusions 44

Web Count Units

atilde ghit or ldquogoogle hitrdquo is the most common unit used to count web snippets (in theearly 2000s)

acirc it is document frequency not term frequency

atilde whit or ldquoweb hitrdquo is the more general term

atilde Normally you compare two phenomena to get a unitless ratio(eg different from vs different than)251000000 ghits vs 71500000 or 351 (accessed 2012-04-04)

atilde GPB for ldquoGhits per billion documentsrdquo is good if want a more stable number (sug-gested by Mark Liberman)but Google no longer releases the index size

atilde So always say when you counted and try to use ratios

httpitrecisupennedu myllanguagelogarchives000953html 45

Web Sample

atilde Extracting and filtering web documents to create linguistically annotated corpora(Kilgarriff 2006)

acirc gather documents for different topics (balance)acirc exclude documents which cannot be preprocessed with available tools (here tag-

gers and lemmatizers)acirc exclude documents which seem irrelevant for a corpus (too short or too long word

lists)acirc do this for several languages and make the corpora available

46

Building Internet Corpora Outline

1 Select Seed Words (500)

2 Combine to form multiple queries (6000)

3 Query a search engine and retrieve the URLs (50000)

4 Download the files from the URLS (100000000 words)

5 Postprocess the data (encoding cleanup tagging and parsing)

Sharoff S (2006) Creating general-purpose corpora using automated search enginequeries In M Baroni S Bernardini (eds) WaCky Working papers on the Web asCorpus Bologna 2006

httpcorpusleedsacukinternethtml 47

Internet Corpora Summary

atilde The web can be used as a corpus

acirc Direct accesslowast Fast and convenientlowast Huge amounts of dataotimes unreliable counts

acirc Web samplelowast Control over the samplelowast Some setup costs (semi-automated)otimes Less data

atilde Richer data than a compiled corpus

otimes Less balanced less markup

48

Text and Meta-textatilde Explicit Meta-data

acirc Keywords and Categoriesacirc Rankingsacirc Structural Markup

atilde Implicit Meta-data

acirc Links and Citationsacirc Tagsacirc Tablesacirc File Namesacirc Translations

and Zombies 49

Explicit Metadata

atilde You can get information from metadata within documents

acirc When they are accurate they are very goodacirc They are often deceitful

atilde HTML PDF Word hellipMetadata

atilde Keywords and Tags

atilde Rankings

atilde Links and Citations

atilde Structural Markup

50

Implicit Metadata

atilde You can get clues from metadata within documents

acirc as they are non-intended they tend to be noisyacirc but they are rarely deceitful

atilde HTML tags as constituent boundaries

atilde Tables as Semantic Relations

atilde File Names (content type and language)

atilde Translations mdash Bracketed Glosses Cross-lingual Disambiguation

atilde Query Data

atilde Wikipedia Redirections and Cross-wiki Links

Review and Conclusions 51

Language Identification andNormalization

Review and Conclusions 52

Language Identification

atilde Need to identify the encoding and language of a page in order to process it (meta-data may be unreliable)

acirc Linguistically-grounded methodslowast Diacriticslowast Character n-gramslowast Stop words

acirc Similarity-based categorisation and classificationlowast Character n-gram rankings

acirc Machine Learning based methodsacirc Context (under jp or ko)

atilde Hard to do for short test snippets similar languages mixed text

Review and Conclusions 53

Normalization

atilde Extracting text from various documents

atilde Segmenting continuous text

atilde Number Normalization $700K $700000 07 million dollars hellip

atilde Date Normalization 2000AD 1421AH Heisei 12 hellip

atilde Stripping stop words (the a of hellip)

atilde Lemmatization produces rarr produce

atilde Stemming producer rarr produc produces rarr produc

atilde Decompounding zonnecel rarr zon cel

Review and Conclusions 54

The Semantic Web

Review and Conclusions 55

Goals of the Semantic Web

atilde Web of data

acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions

atilde Increase the utility of information by connecting it to definitions and context

atilde More efficient information access and analysis

EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt

Review and Conclusions 56

Semantic Web Architecture

Review and Conclusions 57

XML eXtensible Markup Language

atilde XML is a set of rules for encoding documents electronically

acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation

atilde Can be validated for syntactic well-formedness

Review and Conclusions 58

Semantic Web Architecture (details)

atilde Identify things with Uniform Resource Identifiers

acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond

atilde Identify relations with Resource Description Framework

acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything

atilde You can build relations in ontologies (OWL)

acirc Then reason over them search them hellip

Review and Conclusions 59

Criticism of the Semantic Web

Doctorowrsquos seven insurmountable obstacles to reliable metadata are

1 People lie

2 People are lazy

3 People are stupid

4 Mission Impossible know thyself

5 Schemas arenrsquot neutral

6 Metrics influence results

7 Therersquos more than one way to describe something

httpwwwwellcom~doctorowmetacraphtm 60

Semantic Web and NLP

atilde The Semantic Web is about structuring data

atilde Text Mining is about unstructured data

atilde There is much more unstructured than structured data

acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP

61

Citation Reputation andPageRank

Review and Conclusions 62

Citation Networks

atilde How can we tell what is a good scientific paper

acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked

acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite

Review and Conclusions 63

Reputation and Citation Analysis

atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group

atilde Some scores are

acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)

atilde Problems

acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones

atilde Weight Citations by Quality of the paper

Review and Conclusions 64

Gaming Citations

atilde LeastMinimum Publishable Unit

acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information

atilde Self citation in-group citation

atilde Write only proceedings (some journals are not often read)

atilde Submitting only to High Impact factor journals

You improve what gets measurednot necessarily what you want to improve

Review and Conclusions 65

Characteristics of the Web Bow Tie

SCC Strongly Connected Core mdash can travel from any page to any page

Review and Conclusions 66

Anchor Text

atilde Recall how hyperlinks are written

lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt

For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt

atilde Link analysis builds on two intuitions

1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A

2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice

Review and Conclusions 67

PageRank as Citation analysis

atilde Citation frequency can be used to measure the impact of an article

acirc Simplest measure Each article gets one vote ndash not very accurate

atilde On the web citation frequency = inlink count

acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam

atilde Better measure weighted citation frequency or citation rank

acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated

Review and Conclusions 68

PageRank as Random walk

atilde Imagine a web surfer doing a random walk on the web

acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page

equiprobably

atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there

atilde This long-term visit rate is the pagersquos PageRank

atilde PageRank = long-term visit rate = steady state robability

Review and Conclusions 69

Teleporting ndash to get us out of dead ends

atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)

atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)

atilde With remaining probability (90) go out on a random hyperlink

acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225

atilde 10 is a parameter the teleportation rate

atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate

Review and Conclusions 70

Example Graph

Each inbound link is a positive vote

httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71

Example Graph Weighted

Pages with higher PageRanks are lighter

Review and Conclusions 72

Gaming PageRank

atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs

atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies

atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission

73

atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks

The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index

acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip

74

Current Status

atilde There is a continuous battle between

acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read

atilde All metrics get gamed

Review and Conclusions 75

Digital object identifier

atilde DOI a string used to uniquely identify an electronic document or object

acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)

atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation

atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations

acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u

Review and Conclusions 76

Conclusions

Review and Conclusions 77

Revolutions in Language Technology

atilde Speech (Language itself)

atilde Writing (invented 3-5 times)

atilde Printing (made writing common)

atilde Digital Text (made writing transferable)

atilde Hyperlinking (taking writing beyond language)

Review and Conclusions 78

The Internet

atilde The internet is useful as a tool

acirc Passively for information accessacirc Actively for collaboration

atilde The internet is interesting in and of itself

acirc As a source of data about existing languageacirc As a source of innovation in language

Review and Conclusions 79

Recommended Texts

atilde Wikipedia

atilde Crystal D (2001) Language and the Internet Cambridge University Press

atilde Sproat R (2010) Language Technology and Society Oxford University Press

atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition

atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press

atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press

Review and Conclusions 80

Complementary Courses

atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics

atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics

Review and Conclusions 81

Page 37: Lecture 12: Review and Conclusions · Web Count Units ª ghit or “google hit” is the most common unit used to count web snippets (in the early2000s) â itisdocumentfrequencynottermfrequency

The World Wide Web and HTML

Review and Conclusions 36

The InterWeb

atilde The Internetmore than just the web (email VoIP FTP Streaming Messaging

atilde The structure of Markup Visual vs LogicalWSISWYG WYSIAYG WYSIWIM

atilde The structure of the Web mdash hypertextpages linked to pages

atilde The future of the Web

atilde Linguistic features of the webun-edited large volume editable multi-media

Review and Conclusions 37

Map of online communities

httpxkcdcom802 38

The Deep Web (the Invisible Web)

Dynamic content dynamic pages which are returned in response to a submitted queryor accessed only through a form

Unlinked content pages which are not linked to by other pages (but clicking linksthem)

Private Web sites that require registration and login (Edventure)

Contextual Web pages with content varying for different access contexts (egranges of client IP addresses or previous navigation sequence)

Limited access content sites that limit access to their pages in a technical way (egusing the Robots Exclusion Standard)

Scripted content pages that are only accessible through links produced by JavaScriptas well as content dynamically downloaded from Web servers via Flash or Ajaxsolutions

Review and Conclusions 39

Non-HTMLtext content textual content encoded in multimedia (image or video)files or specific file formats not handled by search engines

These pages all include data that search engines cannot find

Review and Conclusions 40

Visual Markup vs Logical Markup

atilde Visual Markup (Presentational)

acirc What you see is what you get (WYSIWYG)acirc Equivalent of printersrsquo markupacirc Shows what things look like

atilde Logical Markup (Structural)

acirc Show the structure and meaningacirc Can be mapped to visual markupacirc Less flexible than visual markupacirc More adaptable (and reusable)

Review and Conclusions 41

The Web as Corpus

Review and Conclusions 42

Two Approaches to using the Web as a Corpus

atilde Direct Query Search Engine as Query tool and WWW as corpus(Objection Results are not reliable)

acirc Population and exact hit counts are unknown rarr no statistics possibleacirc Indexing does not allow to draw conclusions on the dataotimes Google is missing functionalities that linguists lexicographers would like to have

atilde Web Sample Use search engine to download data from the net and build a corpusfrom it

acirc known size and exact hit counts rarr statistics possibleacirc people can draw conclusions over the included text typesacirc (limited) control over the contentotimes sparser data

Review and Conclusions 43

Direct Query

atilde Accessible through search engines (Google API Yahoo API Scripts)

atilde Document counts are shown to correlate directly with ldquorealrdquo frequencies (Keller2003) so search engines can help - but

acirc lots of repetitions of the same text (not representative)acirc very limited query precision (no upperlower case no punctuation)acirc only estimated counts often hard to reproduce exactlyacirc different queries give wildly different numbers

Review and Conclusions 44

Web Count Units

atilde ghit or ldquogoogle hitrdquo is the most common unit used to count web snippets (in theearly 2000s)

acirc it is document frequency not term frequency

atilde whit or ldquoweb hitrdquo is the more general term

atilde Normally you compare two phenomena to get a unitless ratio(eg different from vs different than)251000000 ghits vs 71500000 or 351 (accessed 2012-04-04)

atilde GPB for ldquoGhits per billion documentsrdquo is good if want a more stable number (sug-gested by Mark Liberman)but Google no longer releases the index size

atilde So always say when you counted and try to use ratios

httpitrecisupennedu myllanguagelogarchives000953html 45

Web Sample

atilde Extracting and filtering web documents to create linguistically annotated corpora(Kilgarriff 2006)

acirc gather documents for different topics (balance)acirc exclude documents which cannot be preprocessed with available tools (here tag-

gers and lemmatizers)acirc exclude documents which seem irrelevant for a corpus (too short or too long word

lists)acirc do this for several languages and make the corpora available

46

Building Internet Corpora Outline

1 Select Seed Words (500)

2 Combine to form multiple queries (6000)

3 Query a search engine and retrieve the URLs (50000)

4 Download the files from the URLS (100000000 words)

5 Postprocess the data (encoding cleanup tagging and parsing)

Sharoff S (2006) Creating general-purpose corpora using automated search enginequeries In M Baroni S Bernardini (eds) WaCky Working papers on the Web asCorpus Bologna 2006

httpcorpusleedsacukinternethtml 47

Internet Corpora Summary

atilde The web can be used as a corpus

acirc Direct accesslowast Fast and convenientlowast Huge amounts of dataotimes unreliable counts

acirc Web samplelowast Control over the samplelowast Some setup costs (semi-automated)otimes Less data

atilde Richer data than a compiled corpus

otimes Less balanced less markup

48

Text and Meta-textatilde Explicit Meta-data

acirc Keywords and Categoriesacirc Rankingsacirc Structural Markup

atilde Implicit Meta-data

acirc Links and Citationsacirc Tagsacirc Tablesacirc File Namesacirc Translations

and Zombies 49

Explicit Metadata

atilde You can get information from metadata within documents

acirc When they are accurate they are very goodacirc They are often deceitful

atilde HTML PDF Word hellipMetadata

atilde Keywords and Tags

atilde Rankings

atilde Links and Citations

atilde Structural Markup

50

Implicit Metadata

atilde You can get clues from metadata within documents

acirc as they are non-intended they tend to be noisyacirc but they are rarely deceitful

atilde HTML tags as constituent boundaries

atilde Tables as Semantic Relations

atilde File Names (content type and language)

atilde Translations mdash Bracketed Glosses Cross-lingual Disambiguation

atilde Query Data

atilde Wikipedia Redirections and Cross-wiki Links

Review and Conclusions 51

Language Identification andNormalization

Review and Conclusions 52

Language Identification

atilde Need to identify the encoding and language of a page in order to process it (meta-data may be unreliable)

acirc Linguistically-grounded methodslowast Diacriticslowast Character n-gramslowast Stop words

acirc Similarity-based categorisation and classificationlowast Character n-gram rankings

acirc Machine Learning based methodsacirc Context (under jp or ko)

atilde Hard to do for short test snippets similar languages mixed text

Review and Conclusions 53

Normalization

atilde Extracting text from various documents

atilde Segmenting continuous text

atilde Number Normalization $700K $700000 07 million dollars hellip

atilde Date Normalization 2000AD 1421AH Heisei 12 hellip

atilde Stripping stop words (the a of hellip)

atilde Lemmatization produces rarr produce

atilde Stemming producer rarr produc produces rarr produc

atilde Decompounding zonnecel rarr zon cel

Review and Conclusions 54

The Semantic Web

Review and Conclusions 55

Goals of the Semantic Web

atilde Web of data

acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions

atilde Increase the utility of information by connecting it to definitions and context

atilde More efficient information access and analysis

EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt

Review and Conclusions 56

Semantic Web Architecture

Review and Conclusions 57

XML eXtensible Markup Language

atilde XML is a set of rules for encoding documents electronically

acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation

atilde Can be validated for syntactic well-formedness

Review and Conclusions 58

Semantic Web Architecture (details)

atilde Identify things with Uniform Resource Identifiers

acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond

atilde Identify relations with Resource Description Framework

acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything

atilde You can build relations in ontologies (OWL)

acirc Then reason over them search them hellip

Review and Conclusions 59

Criticism of the Semantic Web

Doctorowrsquos seven insurmountable obstacles to reliable metadata are

1 People lie

2 People are lazy

3 People are stupid

4 Mission Impossible know thyself

5 Schemas arenrsquot neutral

6 Metrics influence results

7 Therersquos more than one way to describe something

httpwwwwellcom~doctorowmetacraphtm 60

Semantic Web and NLP

atilde The Semantic Web is about structuring data

atilde Text Mining is about unstructured data

atilde There is much more unstructured than structured data

acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP

61

Citation Reputation andPageRank

Review and Conclusions 62

Citation Networks

atilde How can we tell what is a good scientific paper

acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked

acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite

Review and Conclusions 63

Reputation and Citation Analysis

atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group

atilde Some scores are

acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)

atilde Problems

acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones

atilde Weight Citations by Quality of the paper

Review and Conclusions 64

Gaming Citations

atilde LeastMinimum Publishable Unit

acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information

atilde Self citation in-group citation

atilde Write only proceedings (some journals are not often read)

atilde Submitting only to High Impact factor journals

You improve what gets measurednot necessarily what you want to improve

Review and Conclusions 65

Characteristics of the Web Bow Tie

SCC Strongly Connected Core mdash can travel from any page to any page

Review and Conclusions 66

Anchor Text

atilde Recall how hyperlinks are written

lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt

For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt

atilde Link analysis builds on two intuitions

1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A

2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice

Review and Conclusions 67

PageRank as Citation analysis

atilde Citation frequency can be used to measure the impact of an article

acirc Simplest measure Each article gets one vote ndash not very accurate

atilde On the web citation frequency = inlink count

acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam

atilde Better measure weighted citation frequency or citation rank

acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated

Review and Conclusions 68

PageRank as Random walk

atilde Imagine a web surfer doing a random walk on the web

acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page

equiprobably

atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there

atilde This long-term visit rate is the pagersquos PageRank

atilde PageRank = long-term visit rate = steady state robability

Review and Conclusions 69

Teleporting ndash to get us out of dead ends

atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)

atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)

atilde With remaining probability (90) go out on a random hyperlink

acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225

atilde 10 is a parameter the teleportation rate

atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate

Review and Conclusions 70

Example Graph

Each inbound link is a positive vote

httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71

Example Graph Weighted

Pages with higher PageRanks are lighter

Review and Conclusions 72

Gaming PageRank

atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs

atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies

atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission

73

atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks

The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index

acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip

74

Current Status

atilde There is a continuous battle between

acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read

atilde All metrics get gamed

Review and Conclusions 75

Digital object identifier

atilde DOI a string used to uniquely identify an electronic document or object

acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)

atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation

atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations

acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u

Review and Conclusions 76

Conclusions

Review and Conclusions 77

Revolutions in Language Technology

atilde Speech (Language itself)

atilde Writing (invented 3-5 times)

atilde Printing (made writing common)

atilde Digital Text (made writing transferable)

atilde Hyperlinking (taking writing beyond language)

Review and Conclusions 78

The Internet

atilde The internet is useful as a tool

acirc Passively for information accessacirc Actively for collaboration

atilde The internet is interesting in and of itself

acirc As a source of data about existing languageacirc As a source of innovation in language

Review and Conclusions 79

Recommended Texts

atilde Wikipedia

atilde Crystal D (2001) Language and the Internet Cambridge University Press

atilde Sproat R (2010) Language Technology and Society Oxford University Press

atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition

atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press

atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press

Review and Conclusions 80

Complementary Courses

atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics

atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics

Review and Conclusions 81

Page 38: Lecture 12: Review and Conclusions · Web Count Units ª ghit or “google hit” is the most common unit used to count web snippets (in the early2000s) â itisdocumentfrequencynottermfrequency

The InterWeb

atilde The Internetmore than just the web (email VoIP FTP Streaming Messaging

atilde The structure of Markup Visual vs LogicalWSISWYG WYSIAYG WYSIWIM

atilde The structure of the Web mdash hypertextpages linked to pages

atilde The future of the Web

atilde Linguistic features of the webun-edited large volume editable multi-media

Review and Conclusions 37

Map of online communities

httpxkcdcom802 38

The Deep Web (the Invisible Web)

Dynamic content dynamic pages which are returned in response to a submitted queryor accessed only through a form

Unlinked content pages which are not linked to by other pages (but clicking linksthem)

Private Web sites that require registration and login (Edventure)

Contextual Web pages with content varying for different access contexts (egranges of client IP addresses or previous navigation sequence)

Limited access content sites that limit access to their pages in a technical way (egusing the Robots Exclusion Standard)

Scripted content pages that are only accessible through links produced by JavaScriptas well as content dynamically downloaded from Web servers via Flash or Ajaxsolutions

Review and Conclusions 39

Non-HTMLtext content textual content encoded in multimedia (image or video)files or specific file formats not handled by search engines

These pages all include data that search engines cannot find

Review and Conclusions 40

Visual Markup vs Logical Markup

atilde Visual Markup (Presentational)

acirc What you see is what you get (WYSIWYG)acirc Equivalent of printersrsquo markupacirc Shows what things look like

atilde Logical Markup (Structural)

acirc Show the structure and meaningacirc Can be mapped to visual markupacirc Less flexible than visual markupacirc More adaptable (and reusable)

Review and Conclusions 41

The Web as Corpus

Review and Conclusions 42

Two Approaches to using the Web as a Corpus

atilde Direct Query Search Engine as Query tool and WWW as corpus(Objection Results are not reliable)

acirc Population and exact hit counts are unknown rarr no statistics possibleacirc Indexing does not allow to draw conclusions on the dataotimes Google is missing functionalities that linguists lexicographers would like to have

atilde Web Sample Use search engine to download data from the net and build a corpusfrom it

acirc known size and exact hit counts rarr statistics possibleacirc people can draw conclusions over the included text typesacirc (limited) control over the contentotimes sparser data

Review and Conclusions 43

Direct Query

atilde Accessible through search engines (Google API Yahoo API Scripts)

atilde Document counts are shown to correlate directly with ldquorealrdquo frequencies (Keller2003) so search engines can help - but

acirc lots of repetitions of the same text (not representative)acirc very limited query precision (no upperlower case no punctuation)acirc only estimated counts often hard to reproduce exactlyacirc different queries give wildly different numbers

Review and Conclusions 44

Web Count Units

atilde ghit or ldquogoogle hitrdquo is the most common unit used to count web snippets (in theearly 2000s)

acirc it is document frequency not term frequency

atilde whit or ldquoweb hitrdquo is the more general term

atilde Normally you compare two phenomena to get a unitless ratio(eg different from vs different than)251000000 ghits vs 71500000 or 351 (accessed 2012-04-04)

atilde GPB for ldquoGhits per billion documentsrdquo is good if want a more stable number (sug-gested by Mark Liberman)but Google no longer releases the index size

atilde So always say when you counted and try to use ratios

httpitrecisupennedu myllanguagelogarchives000953html 45

Web Sample

atilde Extracting and filtering web documents to create linguistically annotated corpora(Kilgarriff 2006)

acirc gather documents for different topics (balance)acirc exclude documents which cannot be preprocessed with available tools (here tag-

gers and lemmatizers)acirc exclude documents which seem irrelevant for a corpus (too short or too long word

lists)acirc do this for several languages and make the corpora available

46

Building Internet Corpora Outline

1 Select Seed Words (500)

2 Combine to form multiple queries (6000)

3 Query a search engine and retrieve the URLs (50000)

4 Download the files from the URLS (100000000 words)

5 Postprocess the data (encoding cleanup tagging and parsing)

Sharoff S (2006) Creating general-purpose corpora using automated search enginequeries In M Baroni S Bernardini (eds) WaCky Working papers on the Web asCorpus Bologna 2006

httpcorpusleedsacukinternethtml 47

Internet Corpora Summary

atilde The web can be used as a corpus

acirc Direct accesslowast Fast and convenientlowast Huge amounts of dataotimes unreliable counts

acirc Web samplelowast Control over the samplelowast Some setup costs (semi-automated)otimes Less data

atilde Richer data than a compiled corpus

otimes Less balanced less markup

48

Text and Meta-textatilde Explicit Meta-data

acirc Keywords and Categoriesacirc Rankingsacirc Structural Markup

atilde Implicit Meta-data

acirc Links and Citationsacirc Tagsacirc Tablesacirc File Namesacirc Translations

and Zombies 49

Explicit Metadata

atilde You can get information from metadata within documents

acirc When they are accurate they are very goodacirc They are often deceitful

atilde HTML PDF Word hellipMetadata

atilde Keywords and Tags

atilde Rankings

atilde Links and Citations

atilde Structural Markup

50

Implicit Metadata

atilde You can get clues from metadata within documents

acirc as they are non-intended they tend to be noisyacirc but they are rarely deceitful

atilde HTML tags as constituent boundaries

atilde Tables as Semantic Relations

atilde File Names (content type and language)

atilde Translations mdash Bracketed Glosses Cross-lingual Disambiguation

atilde Query Data

atilde Wikipedia Redirections and Cross-wiki Links

Review and Conclusions 51

Language Identification andNormalization

Review and Conclusions 52

Language Identification

atilde Need to identify the encoding and language of a page in order to process it (meta-data may be unreliable)

acirc Linguistically-grounded methodslowast Diacriticslowast Character n-gramslowast Stop words

acirc Similarity-based categorisation and classificationlowast Character n-gram rankings

acirc Machine Learning based methodsacirc Context (under jp or ko)

atilde Hard to do for short test snippets similar languages mixed text

Review and Conclusions 53

Normalization

atilde Extracting text from various documents

atilde Segmenting continuous text

atilde Number Normalization $700K $700000 07 million dollars hellip

atilde Date Normalization 2000AD 1421AH Heisei 12 hellip

atilde Stripping stop words (the a of hellip)

atilde Lemmatization produces rarr produce

atilde Stemming producer rarr produc produces rarr produc

atilde Decompounding zonnecel rarr zon cel

Review and Conclusions 54

The Semantic Web

Review and Conclusions 55

Goals of the Semantic Web

atilde Web of data

acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions

atilde Increase the utility of information by connecting it to definitions and context

atilde More efficient information access and analysis

EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt

Review and Conclusions 56

Semantic Web Architecture

Review and Conclusions 57

XML eXtensible Markup Language

atilde XML is a set of rules for encoding documents electronically

acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation

atilde Can be validated for syntactic well-formedness

Review and Conclusions 58

Semantic Web Architecture (details)

atilde Identify things with Uniform Resource Identifiers

acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond

atilde Identify relations with Resource Description Framework

acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything

atilde You can build relations in ontologies (OWL)

acirc Then reason over them search them hellip

Review and Conclusions 59

Criticism of the Semantic Web

Doctorowrsquos seven insurmountable obstacles to reliable metadata are

1 People lie

2 People are lazy

3 People are stupid

4 Mission Impossible know thyself

5 Schemas arenrsquot neutral

6 Metrics influence results

7 Therersquos more than one way to describe something

httpwwwwellcom~doctorowmetacraphtm 60

Semantic Web and NLP

atilde The Semantic Web is about structuring data

atilde Text Mining is about unstructured data

atilde There is much more unstructured than structured data

acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP

61

Citation Reputation andPageRank

Review and Conclusions 62

Citation Networks

atilde How can we tell what is a good scientific paper

acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked

acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite

Review and Conclusions 63

Reputation and Citation Analysis

atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group

atilde Some scores are

acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)

atilde Problems

acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones

atilde Weight Citations by Quality of the paper

Review and Conclusions 64

Gaming Citations

atilde LeastMinimum Publishable Unit

acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information

atilde Self citation in-group citation

atilde Write only proceedings (some journals are not often read)

atilde Submitting only to High Impact factor journals

You improve what gets measurednot necessarily what you want to improve

Review and Conclusions 65

Characteristics of the Web Bow Tie

SCC Strongly Connected Core mdash can travel from any page to any page

Review and Conclusions 66

Anchor Text

atilde Recall how hyperlinks are written

lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt

For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt

atilde Link analysis builds on two intuitions

1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A

2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice

Review and Conclusions 67

PageRank as Citation analysis

atilde Citation frequency can be used to measure the impact of an article

acirc Simplest measure Each article gets one vote ndash not very accurate

atilde On the web citation frequency = inlink count

acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam

atilde Better measure weighted citation frequency or citation rank

acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated

Review and Conclusions 68

PageRank as Random walk

atilde Imagine a web surfer doing a random walk on the web

acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page

equiprobably

atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there

atilde This long-term visit rate is the pagersquos PageRank

atilde PageRank = long-term visit rate = steady state robability

Review and Conclusions 69

Teleporting ndash to get us out of dead ends

atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)

atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)

atilde With remaining probability (90) go out on a random hyperlink

acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225

atilde 10 is a parameter the teleportation rate

atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate

Review and Conclusions 70

Example Graph

Each inbound link is a positive vote

httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71

Example Graph Weighted

Pages with higher PageRanks are lighter

Review and Conclusions 72

Gaming PageRank

atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs

atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies

atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission

73

atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks

The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index

acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip

74

Current Status

atilde There is a continuous battle between

acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read

atilde All metrics get gamed

Review and Conclusions 75

Digital object identifier

atilde DOI a string used to uniquely identify an electronic document or object

acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)

atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation

atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations

acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u

Review and Conclusions 76

Conclusions

Review and Conclusions 77

Revolutions in Language Technology

atilde Speech (Language itself)

atilde Writing (invented 3-5 times)

atilde Printing (made writing common)

atilde Digital Text (made writing transferable)

atilde Hyperlinking (taking writing beyond language)

Review and Conclusions 78

The Internet

atilde The internet is useful as a tool

acirc Passively for information accessacirc Actively for collaboration

atilde The internet is interesting in and of itself

acirc As a source of data about existing languageacirc As a source of innovation in language

Review and Conclusions 79

Recommended Texts

atilde Wikipedia

atilde Crystal D (2001) Language and the Internet Cambridge University Press

atilde Sproat R (2010) Language Technology and Society Oxford University Press

atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition

atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press

atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press

Review and Conclusions 80

Complementary Courses

atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics

atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics

Review and Conclusions 81

Page 39: Lecture 12: Review and Conclusions · Web Count Units ª ghit or “google hit” is the most common unit used to count web snippets (in the early2000s) â itisdocumentfrequencynottermfrequency

Map of online communities

httpxkcdcom802 38

The Deep Web (the Invisible Web)

Dynamic content dynamic pages which are returned in response to a submitted queryor accessed only through a form

Unlinked content pages which are not linked to by other pages (but clicking linksthem)

Private Web sites that require registration and login (Edventure)

Contextual Web pages with content varying for different access contexts (egranges of client IP addresses or previous navigation sequence)

Limited access content sites that limit access to their pages in a technical way (egusing the Robots Exclusion Standard)

Scripted content pages that are only accessible through links produced by JavaScriptas well as content dynamically downloaded from Web servers via Flash or Ajaxsolutions

Review and Conclusions 39

Non-HTMLtext content textual content encoded in multimedia (image or video)files or specific file formats not handled by search engines

These pages all include data that search engines cannot find

Review and Conclusions 40

Visual Markup vs Logical Markup

atilde Visual Markup (Presentational)

acirc What you see is what you get (WYSIWYG)acirc Equivalent of printersrsquo markupacirc Shows what things look like

atilde Logical Markup (Structural)

acirc Show the structure and meaningacirc Can be mapped to visual markupacirc Less flexible than visual markupacirc More adaptable (and reusable)

Review and Conclusions 41

The Web as Corpus

Review and Conclusions 42

Two Approaches to using the Web as a Corpus

atilde Direct Query Search Engine as Query tool and WWW as corpus(Objection Results are not reliable)

acirc Population and exact hit counts are unknown rarr no statistics possibleacirc Indexing does not allow to draw conclusions on the dataotimes Google is missing functionalities that linguists lexicographers would like to have

atilde Web Sample Use search engine to download data from the net and build a corpusfrom it

acirc known size and exact hit counts rarr statistics possibleacirc people can draw conclusions over the included text typesacirc (limited) control over the contentotimes sparser data

Review and Conclusions 43

Direct Query

atilde Accessible through search engines (Google API Yahoo API Scripts)

atilde Document counts are shown to correlate directly with ldquorealrdquo frequencies (Keller2003) so search engines can help - but

acirc lots of repetitions of the same text (not representative)acirc very limited query precision (no upperlower case no punctuation)acirc only estimated counts often hard to reproduce exactlyacirc different queries give wildly different numbers

Review and Conclusions 44

Web Count Units

atilde ghit or ldquogoogle hitrdquo is the most common unit used to count web snippets (in theearly 2000s)

acirc it is document frequency not term frequency

atilde whit or ldquoweb hitrdquo is the more general term

atilde Normally you compare two phenomena to get a unitless ratio(eg different from vs different than)251000000 ghits vs 71500000 or 351 (accessed 2012-04-04)

atilde GPB for ldquoGhits per billion documentsrdquo is good if want a more stable number (sug-gested by Mark Liberman)but Google no longer releases the index size

atilde So always say when you counted and try to use ratios

httpitrecisupennedu myllanguagelogarchives000953html 45

Web Sample

atilde Extracting and filtering web documents to create linguistically annotated corpora(Kilgarriff 2006)

acirc gather documents for different topics (balance)acirc exclude documents which cannot be preprocessed with available tools (here tag-

gers and lemmatizers)acirc exclude documents which seem irrelevant for a corpus (too short or too long word

lists)acirc do this for several languages and make the corpora available

46

Building Internet Corpora Outline

1 Select Seed Words (500)

2 Combine to form multiple queries (6000)

3 Query a search engine and retrieve the URLs (50000)

4 Download the files from the URLS (100000000 words)

5 Postprocess the data (encoding cleanup tagging and parsing)

Sharoff S (2006) Creating general-purpose corpora using automated search enginequeries In M Baroni S Bernardini (eds) WaCky Working papers on the Web asCorpus Bologna 2006

httpcorpusleedsacukinternethtml 47

Internet Corpora Summary

atilde The web can be used as a corpus

acirc Direct accesslowast Fast and convenientlowast Huge amounts of dataotimes unreliable counts

acirc Web samplelowast Control over the samplelowast Some setup costs (semi-automated)otimes Less data

atilde Richer data than a compiled corpus

otimes Less balanced less markup

48

Text and Meta-textatilde Explicit Meta-data

acirc Keywords and Categoriesacirc Rankingsacirc Structural Markup

atilde Implicit Meta-data

acirc Links and Citationsacirc Tagsacirc Tablesacirc File Namesacirc Translations

and Zombies 49

Explicit Metadata

atilde You can get information from metadata within documents

acirc When they are accurate they are very goodacirc They are often deceitful

atilde HTML PDF Word hellipMetadata

atilde Keywords and Tags

atilde Rankings

atilde Links and Citations

atilde Structural Markup

50

Implicit Metadata

atilde You can get clues from metadata within documents

acirc as they are non-intended they tend to be noisyacirc but they are rarely deceitful

atilde HTML tags as constituent boundaries

atilde Tables as Semantic Relations

atilde File Names (content type and language)

atilde Translations mdash Bracketed Glosses Cross-lingual Disambiguation

atilde Query Data

atilde Wikipedia Redirections and Cross-wiki Links

Review and Conclusions 51

Language Identification andNormalization

Review and Conclusions 52

Language Identification

atilde Need to identify the encoding and language of a page in order to process it (meta-data may be unreliable)

acirc Linguistically-grounded methodslowast Diacriticslowast Character n-gramslowast Stop words

acirc Similarity-based categorisation and classificationlowast Character n-gram rankings

acirc Machine Learning based methodsacirc Context (under jp or ko)

atilde Hard to do for short test snippets similar languages mixed text

Review and Conclusions 53

Normalization

atilde Extracting text from various documents

atilde Segmenting continuous text

atilde Number Normalization $700K $700000 07 million dollars hellip

atilde Date Normalization 2000AD 1421AH Heisei 12 hellip

atilde Stripping stop words (the a of hellip)

atilde Lemmatization produces rarr produce

atilde Stemming producer rarr produc produces rarr produc

atilde Decompounding zonnecel rarr zon cel

Review and Conclusions 54

The Semantic Web

Review and Conclusions 55

Goals of the Semantic Web

atilde Web of data

acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions

atilde Increase the utility of information by connecting it to definitions and context

atilde More efficient information access and analysis

EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt

Review and Conclusions 56

Semantic Web Architecture

Review and Conclusions 57

XML eXtensible Markup Language

atilde XML is a set of rules for encoding documents electronically

acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation

atilde Can be validated for syntactic well-formedness

Review and Conclusions 58

Semantic Web Architecture (details)

atilde Identify things with Uniform Resource Identifiers

acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond

atilde Identify relations with Resource Description Framework

acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything

atilde You can build relations in ontologies (OWL)

acirc Then reason over them search them hellip

Review and Conclusions 59

Criticism of the Semantic Web

Doctorowrsquos seven insurmountable obstacles to reliable metadata are

1 People lie

2 People are lazy

3 People are stupid

4 Mission Impossible know thyself

5 Schemas arenrsquot neutral

6 Metrics influence results

7 Therersquos more than one way to describe something

httpwwwwellcom~doctorowmetacraphtm 60

Semantic Web and NLP

atilde The Semantic Web is about structuring data

atilde Text Mining is about unstructured data

atilde There is much more unstructured than structured data

acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP

61

Citation Reputation andPageRank

Review and Conclusions 62

Citation Networks

atilde How can we tell what is a good scientific paper

acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked

acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite

Review and Conclusions 63

Reputation and Citation Analysis

atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group

atilde Some scores are

acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)

atilde Problems

acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones

atilde Weight Citations by Quality of the paper

Review and Conclusions 64

Gaming Citations

atilde LeastMinimum Publishable Unit

acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information

atilde Self citation in-group citation

atilde Write only proceedings (some journals are not often read)

atilde Submitting only to High Impact factor journals

You improve what gets measurednot necessarily what you want to improve

Review and Conclusions 65

Characteristics of the Web Bow Tie

SCC Strongly Connected Core mdash can travel from any page to any page

Review and Conclusions 66

Anchor Text

atilde Recall how hyperlinks are written

lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt

For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt

atilde Link analysis builds on two intuitions

1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A

2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice

Review and Conclusions 67

PageRank as Citation analysis

atilde Citation frequency can be used to measure the impact of an article

acirc Simplest measure Each article gets one vote ndash not very accurate

atilde On the web citation frequency = inlink count

acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam

atilde Better measure weighted citation frequency or citation rank

acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated

Review and Conclusions 68

PageRank as Random walk

atilde Imagine a web surfer doing a random walk on the web

acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page

equiprobably

atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there

atilde This long-term visit rate is the pagersquos PageRank

atilde PageRank = long-term visit rate = steady state robability

Review and Conclusions 69

Teleporting ndash to get us out of dead ends

atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)

atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)

atilde With remaining probability (90) go out on a random hyperlink

acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225

atilde 10 is a parameter the teleportation rate

atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate

Review and Conclusions 70

Example Graph

Each inbound link is a positive vote

httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71

Example Graph Weighted

Pages with higher PageRanks are lighter

Review and Conclusions 72

Gaming PageRank

atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs

atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies

atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission

73

atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks

The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index

acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip

74

Current Status

atilde There is a continuous battle between

acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read

atilde All metrics get gamed

Review and Conclusions 75

Digital object identifier

atilde DOI a string used to uniquely identify an electronic document or object

acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)

atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation

atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations

acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u

Review and Conclusions 76

Conclusions

Review and Conclusions 77

Revolutions in Language Technology

atilde Speech (Language itself)

atilde Writing (invented 3-5 times)

atilde Printing (made writing common)

atilde Digital Text (made writing transferable)

atilde Hyperlinking (taking writing beyond language)

Review and Conclusions 78

The Internet

atilde The internet is useful as a tool

acirc Passively for information accessacirc Actively for collaboration

atilde The internet is interesting in and of itself

acirc As a source of data about existing languageacirc As a source of innovation in language

Review and Conclusions 79

Recommended Texts

atilde Wikipedia

atilde Crystal D (2001) Language and the Internet Cambridge University Press

atilde Sproat R (2010) Language Technology and Society Oxford University Press

atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition

atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press

atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press

Review and Conclusions 80

Complementary Courses

atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics

atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics

Review and Conclusions 81

Page 40: Lecture 12: Review and Conclusions · Web Count Units ª ghit or “google hit” is the most common unit used to count web snippets (in the early2000s) â itisdocumentfrequencynottermfrequency

The Deep Web (the Invisible Web)

Dynamic content dynamic pages which are returned in response to a submitted queryor accessed only through a form

Unlinked content pages which are not linked to by other pages (but clicking linksthem)

Private Web sites that require registration and login (Edventure)

Contextual Web pages with content varying for different access contexts (egranges of client IP addresses or previous navigation sequence)

Limited access content sites that limit access to their pages in a technical way (egusing the Robots Exclusion Standard)

Scripted content pages that are only accessible through links produced by JavaScriptas well as content dynamically downloaded from Web servers via Flash or Ajaxsolutions

Review and Conclusions 39

Non-HTMLtext content textual content encoded in multimedia (image or video)files or specific file formats not handled by search engines

These pages all include data that search engines cannot find

Review and Conclusions 40

Visual Markup vs Logical Markup

atilde Visual Markup (Presentational)

acirc What you see is what you get (WYSIWYG)acirc Equivalent of printersrsquo markupacirc Shows what things look like

atilde Logical Markup (Structural)

acirc Show the structure and meaningacirc Can be mapped to visual markupacirc Less flexible than visual markupacirc More adaptable (and reusable)

Review and Conclusions 41

The Web as Corpus

Review and Conclusions 42

Two Approaches to using the Web as a Corpus

atilde Direct Query Search Engine as Query tool and WWW as corpus(Objection Results are not reliable)

acirc Population and exact hit counts are unknown rarr no statistics possibleacirc Indexing does not allow to draw conclusions on the dataotimes Google is missing functionalities that linguists lexicographers would like to have

atilde Web Sample Use search engine to download data from the net and build a corpusfrom it

acirc known size and exact hit counts rarr statistics possibleacirc people can draw conclusions over the included text typesacirc (limited) control over the contentotimes sparser data

Review and Conclusions 43

Direct Query

atilde Accessible through search engines (Google API Yahoo API Scripts)

atilde Document counts are shown to correlate directly with ldquorealrdquo frequencies (Keller2003) so search engines can help - but

acirc lots of repetitions of the same text (not representative)acirc very limited query precision (no upperlower case no punctuation)acirc only estimated counts often hard to reproduce exactlyacirc different queries give wildly different numbers

Review and Conclusions 44

Web Count Units

atilde ghit or ldquogoogle hitrdquo is the most common unit used to count web snippets (in theearly 2000s)

acirc it is document frequency not term frequency

atilde whit or ldquoweb hitrdquo is the more general term

atilde Normally you compare two phenomena to get a unitless ratio(eg different from vs different than)251000000 ghits vs 71500000 or 351 (accessed 2012-04-04)

atilde GPB for ldquoGhits per billion documentsrdquo is good if want a more stable number (sug-gested by Mark Liberman)but Google no longer releases the index size

atilde So always say when you counted and try to use ratios

httpitrecisupennedu myllanguagelogarchives000953html 45

Web Sample

atilde Extracting and filtering web documents to create linguistically annotated corpora(Kilgarriff 2006)

acirc gather documents for different topics (balance)acirc exclude documents which cannot be preprocessed with available tools (here tag-

gers and lemmatizers)acirc exclude documents which seem irrelevant for a corpus (too short or too long word

lists)acirc do this for several languages and make the corpora available

46

Building Internet Corpora Outline

1 Select Seed Words (500)

2 Combine to form multiple queries (6000)

3 Query a search engine and retrieve the URLs (50000)

4 Download the files from the URLS (100000000 words)

5 Postprocess the data (encoding cleanup tagging and parsing)

Sharoff S (2006) Creating general-purpose corpora using automated search enginequeries In M Baroni S Bernardini (eds) WaCky Working papers on the Web asCorpus Bologna 2006

httpcorpusleedsacukinternethtml 47

Internet Corpora Summary

atilde The web can be used as a corpus

acirc Direct accesslowast Fast and convenientlowast Huge amounts of dataotimes unreliable counts

acirc Web samplelowast Control over the samplelowast Some setup costs (semi-automated)otimes Less data

atilde Richer data than a compiled corpus

otimes Less balanced less markup

48

Text and Meta-textatilde Explicit Meta-data

acirc Keywords and Categoriesacirc Rankingsacirc Structural Markup

atilde Implicit Meta-data

acirc Links and Citationsacirc Tagsacirc Tablesacirc File Namesacirc Translations

and Zombies 49

Explicit Metadata

atilde You can get information from metadata within documents

acirc When they are accurate they are very goodacirc They are often deceitful

atilde HTML PDF Word hellipMetadata

atilde Keywords and Tags

atilde Rankings

atilde Links and Citations

atilde Structural Markup

50

Implicit Metadata

atilde You can get clues from metadata within documents

acirc as they are non-intended they tend to be noisyacirc but they are rarely deceitful

atilde HTML tags as constituent boundaries

atilde Tables as Semantic Relations

atilde File Names (content type and language)

atilde Translations mdash Bracketed Glosses Cross-lingual Disambiguation

atilde Query Data

atilde Wikipedia Redirections and Cross-wiki Links

Review and Conclusions 51

Language Identification andNormalization

Review and Conclusions 52

Language Identification

atilde Need to identify the encoding and language of a page in order to process it (meta-data may be unreliable)

acirc Linguistically-grounded methodslowast Diacriticslowast Character n-gramslowast Stop words

acirc Similarity-based categorisation and classificationlowast Character n-gram rankings

acirc Machine Learning based methodsacirc Context (under jp or ko)

atilde Hard to do for short test snippets similar languages mixed text

Review and Conclusions 53

Normalization

atilde Extracting text from various documents

atilde Segmenting continuous text

atilde Number Normalization $700K $700000 07 million dollars hellip

atilde Date Normalization 2000AD 1421AH Heisei 12 hellip

atilde Stripping stop words (the a of hellip)

atilde Lemmatization produces rarr produce

atilde Stemming producer rarr produc produces rarr produc

atilde Decompounding zonnecel rarr zon cel

Review and Conclusions 54

The Semantic Web

Review and Conclusions 55

Goals of the Semantic Web

atilde Web of data

acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions

atilde Increase the utility of information by connecting it to definitions and context

atilde More efficient information access and analysis

EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt

Review and Conclusions 56

Semantic Web Architecture

Review and Conclusions 57

XML eXtensible Markup Language

atilde XML is a set of rules for encoding documents electronically

acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation

atilde Can be validated for syntactic well-formedness

Review and Conclusions 58

Semantic Web Architecture (details)

atilde Identify things with Uniform Resource Identifiers

acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond

atilde Identify relations with Resource Description Framework

acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything

atilde You can build relations in ontologies (OWL)

acirc Then reason over them search them hellip

Review and Conclusions 59

Criticism of the Semantic Web

Doctorowrsquos seven insurmountable obstacles to reliable metadata are

1 People lie

2 People are lazy

3 People are stupid

4 Mission Impossible know thyself

5 Schemas arenrsquot neutral

6 Metrics influence results

7 Therersquos more than one way to describe something

httpwwwwellcom~doctorowmetacraphtm 60

Semantic Web and NLP

atilde The Semantic Web is about structuring data

atilde Text Mining is about unstructured data

atilde There is much more unstructured than structured data

acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP

61

Citation Reputation andPageRank

Review and Conclusions 62

Citation Networks

atilde How can we tell what is a good scientific paper

acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked

acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite

Review and Conclusions 63

Reputation and Citation Analysis

atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group

atilde Some scores are

acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)

atilde Problems

acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones

atilde Weight Citations by Quality of the paper

Review and Conclusions 64

Gaming Citations

atilde LeastMinimum Publishable Unit

acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information

atilde Self citation in-group citation

atilde Write only proceedings (some journals are not often read)

atilde Submitting only to High Impact factor journals

You improve what gets measurednot necessarily what you want to improve

Review and Conclusions 65

Characteristics of the Web Bow Tie

SCC Strongly Connected Core mdash can travel from any page to any page

Review and Conclusions 66

Anchor Text

atilde Recall how hyperlinks are written

lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt

For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt

atilde Link analysis builds on two intuitions

1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A

2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice

Review and Conclusions 67

PageRank as Citation analysis

atilde Citation frequency can be used to measure the impact of an article

acirc Simplest measure Each article gets one vote ndash not very accurate

atilde On the web citation frequency = inlink count

acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam

atilde Better measure weighted citation frequency or citation rank

acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated

Review and Conclusions 68

PageRank as Random walk

atilde Imagine a web surfer doing a random walk on the web

acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page

equiprobably

atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there

atilde This long-term visit rate is the pagersquos PageRank

atilde PageRank = long-term visit rate = steady state robability

Review and Conclusions 69

Teleporting ndash to get us out of dead ends

atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)

atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)

atilde With remaining probability (90) go out on a random hyperlink

acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225

atilde 10 is a parameter the teleportation rate

atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate

Review and Conclusions 70

Example Graph

Each inbound link is a positive vote

httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71

Example Graph Weighted

Pages with higher PageRanks are lighter

Review and Conclusions 72

Gaming PageRank

atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs

atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies

atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission

73

atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks

The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index

acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip

74

Current Status

atilde There is a continuous battle between

acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read

atilde All metrics get gamed

Review and Conclusions 75

Digital object identifier

atilde DOI a string used to uniquely identify an electronic document or object

acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)

atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation

atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations

acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u

Review and Conclusions 76

Conclusions

Review and Conclusions 77

Revolutions in Language Technology

atilde Speech (Language itself)

atilde Writing (invented 3-5 times)

atilde Printing (made writing common)

atilde Digital Text (made writing transferable)

atilde Hyperlinking (taking writing beyond language)

Review and Conclusions 78

The Internet

atilde The internet is useful as a tool

acirc Passively for information accessacirc Actively for collaboration

atilde The internet is interesting in and of itself

acirc As a source of data about existing languageacirc As a source of innovation in language

Review and Conclusions 79

Recommended Texts

atilde Wikipedia

atilde Crystal D (2001) Language and the Internet Cambridge University Press

atilde Sproat R (2010) Language Technology and Society Oxford University Press

atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition

atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press

atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press

Review and Conclusions 80

Complementary Courses

atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics

atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics

Review and Conclusions 81

Page 41: Lecture 12: Review and Conclusions · Web Count Units ª ghit or “google hit” is the most common unit used to count web snippets (in the early2000s) â itisdocumentfrequencynottermfrequency

Non-HTMLtext content textual content encoded in multimedia (image or video)files or specific file formats not handled by search engines

These pages all include data that search engines cannot find

Review and Conclusions 40

Visual Markup vs Logical Markup

atilde Visual Markup (Presentational)

acirc What you see is what you get (WYSIWYG)acirc Equivalent of printersrsquo markupacirc Shows what things look like

atilde Logical Markup (Structural)

acirc Show the structure and meaningacirc Can be mapped to visual markupacirc Less flexible than visual markupacirc More adaptable (and reusable)

Review and Conclusions 41

The Web as Corpus

Review and Conclusions 42

Two Approaches to using the Web as a Corpus

atilde Direct Query Search Engine as Query tool and WWW as corpus(Objection Results are not reliable)

acirc Population and exact hit counts are unknown rarr no statistics possibleacirc Indexing does not allow to draw conclusions on the dataotimes Google is missing functionalities that linguists lexicographers would like to have

atilde Web Sample Use search engine to download data from the net and build a corpusfrom it

acirc known size and exact hit counts rarr statistics possibleacirc people can draw conclusions over the included text typesacirc (limited) control over the contentotimes sparser data

Review and Conclusions 43

Direct Query

atilde Accessible through search engines (Google API Yahoo API Scripts)

atilde Document counts are shown to correlate directly with ldquorealrdquo frequencies (Keller2003) so search engines can help - but

acirc lots of repetitions of the same text (not representative)acirc very limited query precision (no upperlower case no punctuation)acirc only estimated counts often hard to reproduce exactlyacirc different queries give wildly different numbers

Review and Conclusions 44

Web Count Units

atilde ghit or ldquogoogle hitrdquo is the most common unit used to count web snippets (in theearly 2000s)

acirc it is document frequency not term frequency

atilde whit or ldquoweb hitrdquo is the more general term

atilde Normally you compare two phenomena to get a unitless ratio(eg different from vs different than)251000000 ghits vs 71500000 or 351 (accessed 2012-04-04)

atilde GPB for ldquoGhits per billion documentsrdquo is good if want a more stable number (sug-gested by Mark Liberman)but Google no longer releases the index size

atilde So always say when you counted and try to use ratios

httpitrecisupennedu myllanguagelogarchives000953html 45

Web Sample

atilde Extracting and filtering web documents to create linguistically annotated corpora(Kilgarriff 2006)

acirc gather documents for different topics (balance)acirc exclude documents which cannot be preprocessed with available tools (here tag-

gers and lemmatizers)acirc exclude documents which seem irrelevant for a corpus (too short or too long word

lists)acirc do this for several languages and make the corpora available

46

Building Internet Corpora Outline

1 Select Seed Words (500)

2 Combine to form multiple queries (6000)

3 Query a search engine and retrieve the URLs (50000)

4 Download the files from the URLS (100000000 words)

5 Postprocess the data (encoding cleanup tagging and parsing)

Sharoff S (2006) Creating general-purpose corpora using automated search enginequeries In M Baroni S Bernardini (eds) WaCky Working papers on the Web asCorpus Bologna 2006

httpcorpusleedsacukinternethtml 47

Internet Corpora Summary

atilde The web can be used as a corpus

acirc Direct accesslowast Fast and convenientlowast Huge amounts of dataotimes unreliable counts

acirc Web samplelowast Control over the samplelowast Some setup costs (semi-automated)otimes Less data

atilde Richer data than a compiled corpus

otimes Less balanced less markup

48

Text and Meta-textatilde Explicit Meta-data

acirc Keywords and Categoriesacirc Rankingsacirc Structural Markup

atilde Implicit Meta-data

acirc Links and Citationsacirc Tagsacirc Tablesacirc File Namesacirc Translations

and Zombies 49

Explicit Metadata

atilde You can get information from metadata within documents

acirc When they are accurate they are very goodacirc They are often deceitful

atilde HTML PDF Word hellipMetadata

atilde Keywords and Tags

atilde Rankings

atilde Links and Citations

atilde Structural Markup

50

Implicit Metadata

atilde You can get clues from metadata within documents

acirc as they are non-intended they tend to be noisyacirc but they are rarely deceitful

atilde HTML tags as constituent boundaries

atilde Tables as Semantic Relations

atilde File Names (content type and language)

atilde Translations mdash Bracketed Glosses Cross-lingual Disambiguation

atilde Query Data

atilde Wikipedia Redirections and Cross-wiki Links

Review and Conclusions 51

Language Identification andNormalization

Review and Conclusions 52

Language Identification

atilde Need to identify the encoding and language of a page in order to process it (meta-data may be unreliable)

acirc Linguistically-grounded methodslowast Diacriticslowast Character n-gramslowast Stop words

acirc Similarity-based categorisation and classificationlowast Character n-gram rankings

acirc Machine Learning based methodsacirc Context (under jp or ko)

atilde Hard to do for short test snippets similar languages mixed text

Review and Conclusions 53

Normalization

atilde Extracting text from various documents

atilde Segmenting continuous text

atilde Number Normalization $700K $700000 07 million dollars hellip

atilde Date Normalization 2000AD 1421AH Heisei 12 hellip

atilde Stripping stop words (the a of hellip)

atilde Lemmatization produces rarr produce

atilde Stemming producer rarr produc produces rarr produc

atilde Decompounding zonnecel rarr zon cel

Review and Conclusions 54

The Semantic Web

Review and Conclusions 55

Goals of the Semantic Web

atilde Web of data

acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions

atilde Increase the utility of information by connecting it to definitions and context

atilde More efficient information access and analysis

EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt

Review and Conclusions 56

Semantic Web Architecture

Review and Conclusions 57

XML eXtensible Markup Language

atilde XML is a set of rules for encoding documents electronically

acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation

atilde Can be validated for syntactic well-formedness

Review and Conclusions 58

Semantic Web Architecture (details)

atilde Identify things with Uniform Resource Identifiers

acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond

atilde Identify relations with Resource Description Framework

acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything

atilde You can build relations in ontologies (OWL)

acirc Then reason over them search them hellip

Review and Conclusions 59

Criticism of the Semantic Web

Doctorowrsquos seven insurmountable obstacles to reliable metadata are

1 People lie

2 People are lazy

3 People are stupid

4 Mission Impossible know thyself

5 Schemas arenrsquot neutral

6 Metrics influence results

7 Therersquos more than one way to describe something

httpwwwwellcom~doctorowmetacraphtm 60

Semantic Web and NLP

atilde The Semantic Web is about structuring data

atilde Text Mining is about unstructured data

atilde There is much more unstructured than structured data

acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP

61

Citation Reputation andPageRank

Review and Conclusions 62

Citation Networks

atilde How can we tell what is a good scientific paper

acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked

acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite

Review and Conclusions 63

Reputation and Citation Analysis

atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group

atilde Some scores are

acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)

atilde Problems

acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones

atilde Weight Citations by Quality of the paper

Review and Conclusions 64

Gaming Citations

atilde LeastMinimum Publishable Unit

acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information

atilde Self citation in-group citation

atilde Write only proceedings (some journals are not often read)

atilde Submitting only to High Impact factor journals

You improve what gets measurednot necessarily what you want to improve

Review and Conclusions 65

Characteristics of the Web Bow Tie

SCC Strongly Connected Core mdash can travel from any page to any page

Review and Conclusions 66

Anchor Text

atilde Recall how hyperlinks are written

lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt

For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt

atilde Link analysis builds on two intuitions

1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A

2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice

Review and Conclusions 67

PageRank as Citation analysis

atilde Citation frequency can be used to measure the impact of an article

acirc Simplest measure Each article gets one vote ndash not very accurate

atilde On the web citation frequency = inlink count

acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam

atilde Better measure weighted citation frequency or citation rank

acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated

Review and Conclusions 68

PageRank as Random walk

atilde Imagine a web surfer doing a random walk on the web

acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page

equiprobably

atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there

atilde This long-term visit rate is the pagersquos PageRank

atilde PageRank = long-term visit rate = steady state robability

Review and Conclusions 69

Teleporting ndash to get us out of dead ends

atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)

atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)

atilde With remaining probability (90) go out on a random hyperlink

acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225

atilde 10 is a parameter the teleportation rate

atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate

Review and Conclusions 70

Example Graph

Each inbound link is a positive vote

httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71

Example Graph Weighted

Pages with higher PageRanks are lighter

Review and Conclusions 72

Gaming PageRank

atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs

atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies

atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission

73

atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks

The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index

acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip

74

Current Status

atilde There is a continuous battle between

acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read

atilde All metrics get gamed

Review and Conclusions 75

Digital object identifier

atilde DOI a string used to uniquely identify an electronic document or object

acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)

atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation

atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations

acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u

Review and Conclusions 76

Conclusions

Review and Conclusions 77

Revolutions in Language Technology

atilde Speech (Language itself)

atilde Writing (invented 3-5 times)

atilde Printing (made writing common)

atilde Digital Text (made writing transferable)

atilde Hyperlinking (taking writing beyond language)

Review and Conclusions 78

The Internet

atilde The internet is useful as a tool

acirc Passively for information accessacirc Actively for collaboration

atilde The internet is interesting in and of itself

acirc As a source of data about existing languageacirc As a source of innovation in language

Review and Conclusions 79

Recommended Texts

atilde Wikipedia

atilde Crystal D (2001) Language and the Internet Cambridge University Press

atilde Sproat R (2010) Language Technology and Society Oxford University Press

atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition

atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press

atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press

Review and Conclusions 80

Complementary Courses

atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics

atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics

Review and Conclusions 81

Page 42: Lecture 12: Review and Conclusions · Web Count Units ª ghit or “google hit” is the most common unit used to count web snippets (in the early2000s) â itisdocumentfrequencynottermfrequency

Visual Markup vs Logical Markup

atilde Visual Markup (Presentational)

acirc What you see is what you get (WYSIWYG)acirc Equivalent of printersrsquo markupacirc Shows what things look like

atilde Logical Markup (Structural)

acirc Show the structure and meaningacirc Can be mapped to visual markupacirc Less flexible than visual markupacirc More adaptable (and reusable)

Review and Conclusions 41

The Web as Corpus

Review and Conclusions 42

Two Approaches to using the Web as a Corpus

atilde Direct Query Search Engine as Query tool and WWW as corpus(Objection Results are not reliable)

acirc Population and exact hit counts are unknown rarr no statistics possibleacirc Indexing does not allow to draw conclusions on the dataotimes Google is missing functionalities that linguists lexicographers would like to have

atilde Web Sample Use search engine to download data from the net and build a corpusfrom it

acirc known size and exact hit counts rarr statistics possibleacirc people can draw conclusions over the included text typesacirc (limited) control over the contentotimes sparser data

Review and Conclusions 43

Direct Query

atilde Accessible through search engines (Google API Yahoo API Scripts)

atilde Document counts are shown to correlate directly with ldquorealrdquo frequencies (Keller2003) so search engines can help - but

acirc lots of repetitions of the same text (not representative)acirc very limited query precision (no upperlower case no punctuation)acirc only estimated counts often hard to reproduce exactlyacirc different queries give wildly different numbers

Review and Conclusions 44

Web Count Units

atilde ghit or ldquogoogle hitrdquo is the most common unit used to count web snippets (in theearly 2000s)

acirc it is document frequency not term frequency

atilde whit or ldquoweb hitrdquo is the more general term

atilde Normally you compare two phenomena to get a unitless ratio(eg different from vs different than)251000000 ghits vs 71500000 or 351 (accessed 2012-04-04)

atilde GPB for ldquoGhits per billion documentsrdquo is good if want a more stable number (sug-gested by Mark Liberman)but Google no longer releases the index size

atilde So always say when you counted and try to use ratios

httpitrecisupennedu myllanguagelogarchives000953html 45

Web Sample

atilde Extracting and filtering web documents to create linguistically annotated corpora(Kilgarriff 2006)

acirc gather documents for different topics (balance)acirc exclude documents which cannot be preprocessed with available tools (here tag-

gers and lemmatizers)acirc exclude documents which seem irrelevant for a corpus (too short or too long word

lists)acirc do this for several languages and make the corpora available

46

Building Internet Corpora Outline

1 Select Seed Words (500)

2 Combine to form multiple queries (6000)

3 Query a search engine and retrieve the URLs (50000)

4 Download the files from the URLS (100000000 words)

5 Postprocess the data (encoding cleanup tagging and parsing)

Sharoff S (2006) Creating general-purpose corpora using automated search enginequeries In M Baroni S Bernardini (eds) WaCky Working papers on the Web asCorpus Bologna 2006

httpcorpusleedsacukinternethtml 47

Internet Corpora Summary

atilde The web can be used as a corpus

acirc Direct accesslowast Fast and convenientlowast Huge amounts of dataotimes unreliable counts

acirc Web samplelowast Control over the samplelowast Some setup costs (semi-automated)otimes Less data

atilde Richer data than a compiled corpus

otimes Less balanced less markup

48

Text and Meta-textatilde Explicit Meta-data

acirc Keywords and Categoriesacirc Rankingsacirc Structural Markup

atilde Implicit Meta-data

acirc Links and Citationsacirc Tagsacirc Tablesacirc File Namesacirc Translations

and Zombies 49

Explicit Metadata

atilde You can get information from metadata within documents

acirc When they are accurate they are very goodacirc They are often deceitful

atilde HTML PDF Word hellipMetadata

atilde Keywords and Tags

atilde Rankings

atilde Links and Citations

atilde Structural Markup

50

Implicit Metadata

atilde You can get clues from metadata within documents

acirc as they are non-intended they tend to be noisyacirc but they are rarely deceitful

atilde HTML tags as constituent boundaries

atilde Tables as Semantic Relations

atilde File Names (content type and language)

atilde Translations mdash Bracketed Glosses Cross-lingual Disambiguation

atilde Query Data

atilde Wikipedia Redirections and Cross-wiki Links

Review and Conclusions 51

Language Identification andNormalization

Review and Conclusions 52

Language Identification

atilde Need to identify the encoding and language of a page in order to process it (meta-data may be unreliable)

acirc Linguistically-grounded methodslowast Diacriticslowast Character n-gramslowast Stop words

acirc Similarity-based categorisation and classificationlowast Character n-gram rankings

acirc Machine Learning based methodsacirc Context (under jp or ko)

atilde Hard to do for short test snippets similar languages mixed text

Review and Conclusions 53

Normalization

atilde Extracting text from various documents

atilde Segmenting continuous text

atilde Number Normalization $700K $700000 07 million dollars hellip

atilde Date Normalization 2000AD 1421AH Heisei 12 hellip

atilde Stripping stop words (the a of hellip)

atilde Lemmatization produces rarr produce

atilde Stemming producer rarr produc produces rarr produc

atilde Decompounding zonnecel rarr zon cel

Review and Conclusions 54

The Semantic Web

Review and Conclusions 55

Goals of the Semantic Web

atilde Web of data

acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions

atilde Increase the utility of information by connecting it to definitions and context

atilde More efficient information access and analysis

EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt

Review and Conclusions 56

Semantic Web Architecture

Review and Conclusions 57

XML eXtensible Markup Language

atilde XML is a set of rules for encoding documents electronically

acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation

atilde Can be validated for syntactic well-formedness

Review and Conclusions 58

Semantic Web Architecture (details)

atilde Identify things with Uniform Resource Identifiers

acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond

atilde Identify relations with Resource Description Framework

acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything

atilde You can build relations in ontologies (OWL)

acirc Then reason over them search them hellip

Review and Conclusions 59

Criticism of the Semantic Web

Doctorowrsquos seven insurmountable obstacles to reliable metadata are

1 People lie

2 People are lazy

3 People are stupid

4 Mission Impossible know thyself

5 Schemas arenrsquot neutral

6 Metrics influence results

7 Therersquos more than one way to describe something

httpwwwwellcom~doctorowmetacraphtm 60

Semantic Web and NLP

atilde The Semantic Web is about structuring data

atilde Text Mining is about unstructured data

atilde There is much more unstructured than structured data

acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP

61

Citation Reputation andPageRank

Review and Conclusions 62

Citation Networks

atilde How can we tell what is a good scientific paper

acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked

acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite

Review and Conclusions 63

Reputation and Citation Analysis

atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group

atilde Some scores are

acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)

atilde Problems

acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones

atilde Weight Citations by Quality of the paper

Review and Conclusions 64

Gaming Citations

atilde LeastMinimum Publishable Unit

acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information

atilde Self citation in-group citation

atilde Write only proceedings (some journals are not often read)

atilde Submitting only to High Impact factor journals

You improve what gets measurednot necessarily what you want to improve

Review and Conclusions 65

Characteristics of the Web Bow Tie

SCC Strongly Connected Core mdash can travel from any page to any page

Review and Conclusions 66

Anchor Text

atilde Recall how hyperlinks are written

lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt

For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt

atilde Link analysis builds on two intuitions

1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A

2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice

Review and Conclusions 67

PageRank as Citation analysis

atilde Citation frequency can be used to measure the impact of an article

acirc Simplest measure Each article gets one vote ndash not very accurate

atilde On the web citation frequency = inlink count

acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam

atilde Better measure weighted citation frequency or citation rank

acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated

Review and Conclusions 68

PageRank as Random walk

atilde Imagine a web surfer doing a random walk on the web

acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page

equiprobably

atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there

atilde This long-term visit rate is the pagersquos PageRank

atilde PageRank = long-term visit rate = steady state robability

Review and Conclusions 69

Teleporting ndash to get us out of dead ends

atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)

atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)

atilde With remaining probability (90) go out on a random hyperlink

acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225

atilde 10 is a parameter the teleportation rate

atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate

Review and Conclusions 70

Example Graph

Each inbound link is a positive vote

httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71

Example Graph Weighted

Pages with higher PageRanks are lighter

Review and Conclusions 72

Gaming PageRank

atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs

atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies

atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission

73

atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks

The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index

acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip

74

Current Status

atilde There is a continuous battle between

acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read

atilde All metrics get gamed

Review and Conclusions 75

Digital object identifier

atilde DOI a string used to uniquely identify an electronic document or object

acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)

atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation

atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations

acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u

Review and Conclusions 76

Conclusions

Review and Conclusions 77

Revolutions in Language Technology

atilde Speech (Language itself)

atilde Writing (invented 3-5 times)

atilde Printing (made writing common)

atilde Digital Text (made writing transferable)

atilde Hyperlinking (taking writing beyond language)

Review and Conclusions 78

The Internet

atilde The internet is useful as a tool

acirc Passively for information accessacirc Actively for collaboration

atilde The internet is interesting in and of itself

acirc As a source of data about existing languageacirc As a source of innovation in language

Review and Conclusions 79

Recommended Texts

atilde Wikipedia

atilde Crystal D (2001) Language and the Internet Cambridge University Press

atilde Sproat R (2010) Language Technology and Society Oxford University Press

atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition

atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press

atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press

Review and Conclusions 80

Complementary Courses

atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics

atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics

Review and Conclusions 81

Page 43: Lecture 12: Review and Conclusions · Web Count Units ª ghit or “google hit” is the most common unit used to count web snippets (in the early2000s) â itisdocumentfrequencynottermfrequency

The Web as Corpus

Review and Conclusions 42

Two Approaches to using the Web as a Corpus

atilde Direct Query Search Engine as Query tool and WWW as corpus(Objection Results are not reliable)

acirc Population and exact hit counts are unknown rarr no statistics possibleacirc Indexing does not allow to draw conclusions on the dataotimes Google is missing functionalities that linguists lexicographers would like to have

atilde Web Sample Use search engine to download data from the net and build a corpusfrom it

acirc known size and exact hit counts rarr statistics possibleacirc people can draw conclusions over the included text typesacirc (limited) control over the contentotimes sparser data

Review and Conclusions 43

Direct Query

atilde Accessible through search engines (Google API Yahoo API Scripts)

atilde Document counts are shown to correlate directly with ldquorealrdquo frequencies (Keller2003) so search engines can help - but

acirc lots of repetitions of the same text (not representative)acirc very limited query precision (no upperlower case no punctuation)acirc only estimated counts often hard to reproduce exactlyacirc different queries give wildly different numbers

Review and Conclusions 44

Web Count Units

atilde ghit or ldquogoogle hitrdquo is the most common unit used to count web snippets (in theearly 2000s)

acirc it is document frequency not term frequency

atilde whit or ldquoweb hitrdquo is the more general term

atilde Normally you compare two phenomena to get a unitless ratio(eg different from vs different than)251000000 ghits vs 71500000 or 351 (accessed 2012-04-04)

atilde GPB for ldquoGhits per billion documentsrdquo is good if want a more stable number (sug-gested by Mark Liberman)but Google no longer releases the index size

atilde So always say when you counted and try to use ratios

httpitrecisupennedu myllanguagelogarchives000953html 45

Web Sample

atilde Extracting and filtering web documents to create linguistically annotated corpora(Kilgarriff 2006)

acirc gather documents for different topics (balance)acirc exclude documents which cannot be preprocessed with available tools (here tag-

gers and lemmatizers)acirc exclude documents which seem irrelevant for a corpus (too short or too long word

lists)acirc do this for several languages and make the corpora available

46

Building Internet Corpora Outline

1 Select Seed Words (500)

2 Combine to form multiple queries (6000)

3 Query a search engine and retrieve the URLs (50000)

4 Download the files from the URLS (100000000 words)

5 Postprocess the data (encoding cleanup tagging and parsing)

Sharoff S (2006) Creating general-purpose corpora using automated search enginequeries In M Baroni S Bernardini (eds) WaCky Working papers on the Web asCorpus Bologna 2006

httpcorpusleedsacukinternethtml 47

Internet Corpora Summary

atilde The web can be used as a corpus

acirc Direct accesslowast Fast and convenientlowast Huge amounts of dataotimes unreliable counts

acirc Web samplelowast Control over the samplelowast Some setup costs (semi-automated)otimes Less data

atilde Richer data than a compiled corpus

otimes Less balanced less markup

48

Text and Meta-textatilde Explicit Meta-data

acirc Keywords and Categoriesacirc Rankingsacirc Structural Markup

atilde Implicit Meta-data

acirc Links and Citationsacirc Tagsacirc Tablesacirc File Namesacirc Translations

and Zombies 49

Explicit Metadata

atilde You can get information from metadata within documents

acirc When they are accurate they are very goodacirc They are often deceitful

atilde HTML PDF Word hellipMetadata

atilde Keywords and Tags

atilde Rankings

atilde Links and Citations

atilde Structural Markup

50

Implicit Metadata

atilde You can get clues from metadata within documents

acirc as they are non-intended they tend to be noisyacirc but they are rarely deceitful

atilde HTML tags as constituent boundaries

atilde Tables as Semantic Relations

atilde File Names (content type and language)

atilde Translations mdash Bracketed Glosses Cross-lingual Disambiguation

atilde Query Data

atilde Wikipedia Redirections and Cross-wiki Links

Review and Conclusions 51

Language Identification andNormalization

Review and Conclusions 52

Language Identification

atilde Need to identify the encoding and language of a page in order to process it (meta-data may be unreliable)

acirc Linguistically-grounded methodslowast Diacriticslowast Character n-gramslowast Stop words

acirc Similarity-based categorisation and classificationlowast Character n-gram rankings

acirc Machine Learning based methodsacirc Context (under jp or ko)

atilde Hard to do for short test snippets similar languages mixed text

Review and Conclusions 53

Normalization

atilde Extracting text from various documents

atilde Segmenting continuous text

atilde Number Normalization $700K $700000 07 million dollars hellip

atilde Date Normalization 2000AD 1421AH Heisei 12 hellip

atilde Stripping stop words (the a of hellip)

atilde Lemmatization produces rarr produce

atilde Stemming producer rarr produc produces rarr produc

atilde Decompounding zonnecel rarr zon cel

Review and Conclusions 54

The Semantic Web

Review and Conclusions 55

Goals of the Semantic Web

atilde Web of data

acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions

atilde Increase the utility of information by connecting it to definitions and context

atilde More efficient information access and analysis

EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt

Review and Conclusions 56

Semantic Web Architecture

Review and Conclusions 57

XML eXtensible Markup Language

atilde XML is a set of rules for encoding documents electronically

acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation

atilde Can be validated for syntactic well-formedness

Review and Conclusions 58

Semantic Web Architecture (details)

atilde Identify things with Uniform Resource Identifiers

acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond

atilde Identify relations with Resource Description Framework

acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything

atilde You can build relations in ontologies (OWL)

acirc Then reason over them search them hellip

Review and Conclusions 59

Criticism of the Semantic Web

Doctorowrsquos seven insurmountable obstacles to reliable metadata are

1 People lie

2 People are lazy

3 People are stupid

4 Mission Impossible know thyself

5 Schemas arenrsquot neutral

6 Metrics influence results

7 Therersquos more than one way to describe something

httpwwwwellcom~doctorowmetacraphtm 60

Semantic Web and NLP

atilde The Semantic Web is about structuring data

atilde Text Mining is about unstructured data

atilde There is much more unstructured than structured data

acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP

61

Citation Reputation andPageRank

Review and Conclusions 62

Citation Networks

atilde How can we tell what is a good scientific paper

acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked

acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite

Review and Conclusions 63

Reputation and Citation Analysis

atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group

atilde Some scores are

acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)

atilde Problems

acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones

atilde Weight Citations by Quality of the paper

Review and Conclusions 64

Gaming Citations

atilde LeastMinimum Publishable Unit

acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information

atilde Self citation in-group citation

atilde Write only proceedings (some journals are not often read)

atilde Submitting only to High Impact factor journals

You improve what gets measurednot necessarily what you want to improve

Review and Conclusions 65

Characteristics of the Web Bow Tie

SCC Strongly Connected Core mdash can travel from any page to any page

Review and Conclusions 66

Anchor Text

atilde Recall how hyperlinks are written

lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt

For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt

atilde Link analysis builds on two intuitions

1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A

2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice

Review and Conclusions 67

PageRank as Citation analysis

atilde Citation frequency can be used to measure the impact of an article

acirc Simplest measure Each article gets one vote ndash not very accurate

atilde On the web citation frequency = inlink count

acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam

atilde Better measure weighted citation frequency or citation rank

acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated

Review and Conclusions 68

PageRank as Random walk

atilde Imagine a web surfer doing a random walk on the web

acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page

equiprobably

atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there

atilde This long-term visit rate is the pagersquos PageRank

atilde PageRank = long-term visit rate = steady state robability

Review and Conclusions 69

Teleporting ndash to get us out of dead ends

atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)

atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)

atilde With remaining probability (90) go out on a random hyperlink

acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225

atilde 10 is a parameter the teleportation rate

atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate

Review and Conclusions 70

Example Graph

Each inbound link is a positive vote

httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71

Example Graph Weighted

Pages with higher PageRanks are lighter

Review and Conclusions 72

Gaming PageRank

atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs

atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies

atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission

73

atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks

The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index

acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip

74

Current Status

atilde There is a continuous battle between

acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read

atilde All metrics get gamed

Review and Conclusions 75

Digital object identifier

atilde DOI a string used to uniquely identify an electronic document or object

acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)

atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation

atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations

acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u

Review and Conclusions 76

Conclusions

Review and Conclusions 77

Revolutions in Language Technology

atilde Speech (Language itself)

atilde Writing (invented 3-5 times)

atilde Printing (made writing common)

atilde Digital Text (made writing transferable)

atilde Hyperlinking (taking writing beyond language)

Review and Conclusions 78

The Internet

atilde The internet is useful as a tool

acirc Passively for information accessacirc Actively for collaboration

atilde The internet is interesting in and of itself

acirc As a source of data about existing languageacirc As a source of innovation in language

Review and Conclusions 79

Recommended Texts

atilde Wikipedia

atilde Crystal D (2001) Language and the Internet Cambridge University Press

atilde Sproat R (2010) Language Technology and Society Oxford University Press

atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition

atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press

atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press

Review and Conclusions 80

Complementary Courses

atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics

atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics

Review and Conclusions 81

Page 44: Lecture 12: Review and Conclusions · Web Count Units ª ghit or “google hit” is the most common unit used to count web snippets (in the early2000s) â itisdocumentfrequencynottermfrequency

Two Approaches to using the Web as a Corpus

atilde Direct Query Search Engine as Query tool and WWW as corpus(Objection Results are not reliable)

acirc Population and exact hit counts are unknown rarr no statistics possibleacirc Indexing does not allow to draw conclusions on the dataotimes Google is missing functionalities that linguists lexicographers would like to have

atilde Web Sample Use search engine to download data from the net and build a corpusfrom it

acirc known size and exact hit counts rarr statistics possibleacirc people can draw conclusions over the included text typesacirc (limited) control over the contentotimes sparser data

Review and Conclusions 43

Direct Query

atilde Accessible through search engines (Google API Yahoo API Scripts)

atilde Document counts are shown to correlate directly with ldquorealrdquo frequencies (Keller2003) so search engines can help - but

acirc lots of repetitions of the same text (not representative)acirc very limited query precision (no upperlower case no punctuation)acirc only estimated counts often hard to reproduce exactlyacirc different queries give wildly different numbers

Review and Conclusions 44

Web Count Units

atilde ghit or ldquogoogle hitrdquo is the most common unit used to count web snippets (in theearly 2000s)

acirc it is document frequency not term frequency

atilde whit or ldquoweb hitrdquo is the more general term

atilde Normally you compare two phenomena to get a unitless ratio(eg different from vs different than)251000000 ghits vs 71500000 or 351 (accessed 2012-04-04)

atilde GPB for ldquoGhits per billion documentsrdquo is good if want a more stable number (sug-gested by Mark Liberman)but Google no longer releases the index size

atilde So always say when you counted and try to use ratios

httpitrecisupennedu myllanguagelogarchives000953html 45

Web Sample

atilde Extracting and filtering web documents to create linguistically annotated corpora(Kilgarriff 2006)

acirc gather documents for different topics (balance)acirc exclude documents which cannot be preprocessed with available tools (here tag-

gers and lemmatizers)acirc exclude documents which seem irrelevant for a corpus (too short or too long word

lists)acirc do this for several languages and make the corpora available

46

Building Internet Corpora Outline

1 Select Seed Words (500)

2 Combine to form multiple queries (6000)

3 Query a search engine and retrieve the URLs (50000)

4 Download the files from the URLS (100000000 words)

5 Postprocess the data (encoding cleanup tagging and parsing)

Sharoff S (2006) Creating general-purpose corpora using automated search enginequeries In M Baroni S Bernardini (eds) WaCky Working papers on the Web asCorpus Bologna 2006

httpcorpusleedsacukinternethtml 47

Internet Corpora Summary

atilde The web can be used as a corpus

acirc Direct accesslowast Fast and convenientlowast Huge amounts of dataotimes unreliable counts

acirc Web samplelowast Control over the samplelowast Some setup costs (semi-automated)otimes Less data

atilde Richer data than a compiled corpus

otimes Less balanced less markup

48

Text and Meta-textatilde Explicit Meta-data

acirc Keywords and Categoriesacirc Rankingsacirc Structural Markup

atilde Implicit Meta-data

acirc Links and Citationsacirc Tagsacirc Tablesacirc File Namesacirc Translations

and Zombies 49

Explicit Metadata

atilde You can get information from metadata within documents

acirc When they are accurate they are very goodacirc They are often deceitful

atilde HTML PDF Word hellipMetadata

atilde Keywords and Tags

atilde Rankings

atilde Links and Citations

atilde Structural Markup

50

Implicit Metadata

atilde You can get clues from metadata within documents

acirc as they are non-intended they tend to be noisyacirc but they are rarely deceitful

atilde HTML tags as constituent boundaries

atilde Tables as Semantic Relations

atilde File Names (content type and language)

atilde Translations mdash Bracketed Glosses Cross-lingual Disambiguation

atilde Query Data

atilde Wikipedia Redirections and Cross-wiki Links

Review and Conclusions 51

Language Identification andNormalization

Review and Conclusions 52

Language Identification

atilde Need to identify the encoding and language of a page in order to process it (meta-data may be unreliable)

acirc Linguistically-grounded methodslowast Diacriticslowast Character n-gramslowast Stop words

acirc Similarity-based categorisation and classificationlowast Character n-gram rankings

acirc Machine Learning based methodsacirc Context (under jp or ko)

atilde Hard to do for short test snippets similar languages mixed text

Review and Conclusions 53

Normalization

atilde Extracting text from various documents

atilde Segmenting continuous text

atilde Number Normalization $700K $700000 07 million dollars hellip

atilde Date Normalization 2000AD 1421AH Heisei 12 hellip

atilde Stripping stop words (the a of hellip)

atilde Lemmatization produces rarr produce

atilde Stemming producer rarr produc produces rarr produc

atilde Decompounding zonnecel rarr zon cel

Review and Conclusions 54

The Semantic Web

Review and Conclusions 55

Goals of the Semantic Web

atilde Web of data

acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions

atilde Increase the utility of information by connecting it to definitions and context

atilde More efficient information access and analysis

EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt

Review and Conclusions 56

Semantic Web Architecture

Review and Conclusions 57

XML eXtensible Markup Language

atilde XML is a set of rules for encoding documents electronically

acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation

atilde Can be validated for syntactic well-formedness

Review and Conclusions 58

Semantic Web Architecture (details)

atilde Identify things with Uniform Resource Identifiers

acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond

atilde Identify relations with Resource Description Framework

acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything

atilde You can build relations in ontologies (OWL)

acirc Then reason over them search them hellip

Review and Conclusions 59

Criticism of the Semantic Web

Doctorowrsquos seven insurmountable obstacles to reliable metadata are

1 People lie

2 People are lazy

3 People are stupid

4 Mission Impossible know thyself

5 Schemas arenrsquot neutral

6 Metrics influence results

7 Therersquos more than one way to describe something

httpwwwwellcom~doctorowmetacraphtm 60

Semantic Web and NLP

atilde The Semantic Web is about structuring data

atilde Text Mining is about unstructured data

atilde There is much more unstructured than structured data

acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP

61

Citation Reputation andPageRank

Review and Conclusions 62

Citation Networks

atilde How can we tell what is a good scientific paper

acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked

acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite

Review and Conclusions 63

Reputation and Citation Analysis

atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group

atilde Some scores are

acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)

atilde Problems

acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones

atilde Weight Citations by Quality of the paper

Review and Conclusions 64

Gaming Citations

atilde LeastMinimum Publishable Unit

acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information

atilde Self citation in-group citation

atilde Write only proceedings (some journals are not often read)

atilde Submitting only to High Impact factor journals

You improve what gets measurednot necessarily what you want to improve

Review and Conclusions 65

Characteristics of the Web Bow Tie

SCC Strongly Connected Core mdash can travel from any page to any page

Review and Conclusions 66

Anchor Text

atilde Recall how hyperlinks are written

lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt

For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt

atilde Link analysis builds on two intuitions

1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A

2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice

Review and Conclusions 67

PageRank as Citation analysis

atilde Citation frequency can be used to measure the impact of an article

acirc Simplest measure Each article gets one vote ndash not very accurate

atilde On the web citation frequency = inlink count

acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam

atilde Better measure weighted citation frequency or citation rank

acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated

Review and Conclusions 68

PageRank as Random walk

atilde Imagine a web surfer doing a random walk on the web

acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page

equiprobably

atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there

atilde This long-term visit rate is the pagersquos PageRank

atilde PageRank = long-term visit rate = steady state robability

Review and Conclusions 69

Teleporting ndash to get us out of dead ends

atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)

atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)

atilde With remaining probability (90) go out on a random hyperlink

acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225

atilde 10 is a parameter the teleportation rate

atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate

Review and Conclusions 70

Example Graph

Each inbound link is a positive vote

httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71

Example Graph Weighted

Pages with higher PageRanks are lighter

Review and Conclusions 72

Gaming PageRank

atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs

atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies

atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission

73

atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks

The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index

acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip

74

Current Status

atilde There is a continuous battle between

acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read

atilde All metrics get gamed

Review and Conclusions 75

Digital object identifier

atilde DOI a string used to uniquely identify an electronic document or object

acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)

atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation

atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations

acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u

Review and Conclusions 76

Conclusions

Review and Conclusions 77

Revolutions in Language Technology

atilde Speech (Language itself)

atilde Writing (invented 3-5 times)

atilde Printing (made writing common)

atilde Digital Text (made writing transferable)

atilde Hyperlinking (taking writing beyond language)

Review and Conclusions 78

The Internet

atilde The internet is useful as a tool

acirc Passively for information accessacirc Actively for collaboration

atilde The internet is interesting in and of itself

acirc As a source of data about existing languageacirc As a source of innovation in language

Review and Conclusions 79

Recommended Texts

atilde Wikipedia

atilde Crystal D (2001) Language and the Internet Cambridge University Press

atilde Sproat R (2010) Language Technology and Society Oxford University Press

atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition

atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press

atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press

Review and Conclusions 80

Complementary Courses

atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics

atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics

Review and Conclusions 81

Page 45: Lecture 12: Review and Conclusions · Web Count Units ª ghit or “google hit” is the most common unit used to count web snippets (in the early2000s) â itisdocumentfrequencynottermfrequency

Direct Query

atilde Accessible through search engines (Google API Yahoo API Scripts)

atilde Document counts are shown to correlate directly with ldquorealrdquo frequencies (Keller2003) so search engines can help - but

acirc lots of repetitions of the same text (not representative)acirc very limited query precision (no upperlower case no punctuation)acirc only estimated counts often hard to reproduce exactlyacirc different queries give wildly different numbers

Review and Conclusions 44

Web Count Units

atilde ghit or ldquogoogle hitrdquo is the most common unit used to count web snippets (in theearly 2000s)

acirc it is document frequency not term frequency

atilde whit or ldquoweb hitrdquo is the more general term

atilde Normally you compare two phenomena to get a unitless ratio(eg different from vs different than)251000000 ghits vs 71500000 or 351 (accessed 2012-04-04)

atilde GPB for ldquoGhits per billion documentsrdquo is good if want a more stable number (sug-gested by Mark Liberman)but Google no longer releases the index size

atilde So always say when you counted and try to use ratios

httpitrecisupennedu myllanguagelogarchives000953html 45

Web Sample

atilde Extracting and filtering web documents to create linguistically annotated corpora(Kilgarriff 2006)

acirc gather documents for different topics (balance)acirc exclude documents which cannot be preprocessed with available tools (here tag-

gers and lemmatizers)acirc exclude documents which seem irrelevant for a corpus (too short or too long word

lists)acirc do this for several languages and make the corpora available

46

Building Internet Corpora Outline

1 Select Seed Words (500)

2 Combine to form multiple queries (6000)

3 Query a search engine and retrieve the URLs (50000)

4 Download the files from the URLS (100000000 words)

5 Postprocess the data (encoding cleanup tagging and parsing)

Sharoff S (2006) Creating general-purpose corpora using automated search enginequeries In M Baroni S Bernardini (eds) WaCky Working papers on the Web asCorpus Bologna 2006

httpcorpusleedsacukinternethtml 47

Internet Corpora Summary

atilde The web can be used as a corpus

acirc Direct accesslowast Fast and convenientlowast Huge amounts of dataotimes unreliable counts

acirc Web samplelowast Control over the samplelowast Some setup costs (semi-automated)otimes Less data

atilde Richer data than a compiled corpus

otimes Less balanced less markup

48

Text and Meta-textatilde Explicit Meta-data

acirc Keywords and Categoriesacirc Rankingsacirc Structural Markup

atilde Implicit Meta-data

acirc Links and Citationsacirc Tagsacirc Tablesacirc File Namesacirc Translations

and Zombies 49

Explicit Metadata

atilde You can get information from metadata within documents

acirc When they are accurate they are very goodacirc They are often deceitful

atilde HTML PDF Word hellipMetadata

atilde Keywords and Tags

atilde Rankings

atilde Links and Citations

atilde Structural Markup

50

Implicit Metadata

atilde You can get clues from metadata within documents

acirc as they are non-intended they tend to be noisyacirc but they are rarely deceitful

atilde HTML tags as constituent boundaries

atilde Tables as Semantic Relations

atilde File Names (content type and language)

atilde Translations mdash Bracketed Glosses Cross-lingual Disambiguation

atilde Query Data

atilde Wikipedia Redirections and Cross-wiki Links

Review and Conclusions 51

Language Identification andNormalization

Review and Conclusions 52

Language Identification

atilde Need to identify the encoding and language of a page in order to process it (meta-data may be unreliable)

acirc Linguistically-grounded methodslowast Diacriticslowast Character n-gramslowast Stop words

acirc Similarity-based categorisation and classificationlowast Character n-gram rankings

acirc Machine Learning based methodsacirc Context (under jp or ko)

atilde Hard to do for short test snippets similar languages mixed text

Review and Conclusions 53

Normalization

atilde Extracting text from various documents

atilde Segmenting continuous text

atilde Number Normalization $700K $700000 07 million dollars hellip

atilde Date Normalization 2000AD 1421AH Heisei 12 hellip

atilde Stripping stop words (the a of hellip)

atilde Lemmatization produces rarr produce

atilde Stemming producer rarr produc produces rarr produc

atilde Decompounding zonnecel rarr zon cel

Review and Conclusions 54

The Semantic Web

Review and Conclusions 55

Goals of the Semantic Web

atilde Web of data

acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions

atilde Increase the utility of information by connecting it to definitions and context

atilde More efficient information access and analysis

EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt

Review and Conclusions 56

Semantic Web Architecture

Review and Conclusions 57

XML eXtensible Markup Language

atilde XML is a set of rules for encoding documents electronically

acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation

atilde Can be validated for syntactic well-formedness

Review and Conclusions 58

Semantic Web Architecture (details)

atilde Identify things with Uniform Resource Identifiers

acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond

atilde Identify relations with Resource Description Framework

acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything

atilde You can build relations in ontologies (OWL)

acirc Then reason over them search them hellip

Review and Conclusions 59

Criticism of the Semantic Web

Doctorowrsquos seven insurmountable obstacles to reliable metadata are

1 People lie

2 People are lazy

3 People are stupid

4 Mission Impossible know thyself

5 Schemas arenrsquot neutral

6 Metrics influence results

7 Therersquos more than one way to describe something

httpwwwwellcom~doctorowmetacraphtm 60

Semantic Web and NLP

atilde The Semantic Web is about structuring data

atilde Text Mining is about unstructured data

atilde There is much more unstructured than structured data

acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP

61

Citation Reputation andPageRank

Review and Conclusions 62

Citation Networks

atilde How can we tell what is a good scientific paper

acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked

acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite

Review and Conclusions 63

Reputation and Citation Analysis

atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group

atilde Some scores are

acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)

atilde Problems

acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones

atilde Weight Citations by Quality of the paper

Review and Conclusions 64

Gaming Citations

atilde LeastMinimum Publishable Unit

acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information

atilde Self citation in-group citation

atilde Write only proceedings (some journals are not often read)

atilde Submitting only to High Impact factor journals

You improve what gets measurednot necessarily what you want to improve

Review and Conclusions 65

Characteristics of the Web Bow Tie

SCC Strongly Connected Core mdash can travel from any page to any page

Review and Conclusions 66

Anchor Text

atilde Recall how hyperlinks are written

lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt

For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt

atilde Link analysis builds on two intuitions

1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A

2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice

Review and Conclusions 67

PageRank as Citation analysis

atilde Citation frequency can be used to measure the impact of an article

acirc Simplest measure Each article gets one vote ndash not very accurate

atilde On the web citation frequency = inlink count

acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam

atilde Better measure weighted citation frequency or citation rank

acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated

Review and Conclusions 68

PageRank as Random walk

atilde Imagine a web surfer doing a random walk on the web

acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page

equiprobably

atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there

atilde This long-term visit rate is the pagersquos PageRank

atilde PageRank = long-term visit rate = steady state robability

Review and Conclusions 69

Teleporting ndash to get us out of dead ends

atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)

atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)

atilde With remaining probability (90) go out on a random hyperlink

acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225

atilde 10 is a parameter the teleportation rate

atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate

Review and Conclusions 70

Example Graph

Each inbound link is a positive vote

httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71

Example Graph Weighted

Pages with higher PageRanks are lighter

Review and Conclusions 72

Gaming PageRank

atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs

atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies

atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission

73

atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks

The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index

acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip

74

Current Status

atilde There is a continuous battle between

acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read

atilde All metrics get gamed

Review and Conclusions 75

Digital object identifier

atilde DOI a string used to uniquely identify an electronic document or object

acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)

atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation

atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations

acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u

Review and Conclusions 76

Conclusions

Review and Conclusions 77

Revolutions in Language Technology

atilde Speech (Language itself)

atilde Writing (invented 3-5 times)

atilde Printing (made writing common)

atilde Digital Text (made writing transferable)

atilde Hyperlinking (taking writing beyond language)

Review and Conclusions 78

The Internet

atilde The internet is useful as a tool

acirc Passively for information accessacirc Actively for collaboration

atilde The internet is interesting in and of itself

acirc As a source of data about existing languageacirc As a source of innovation in language

Review and Conclusions 79

Recommended Texts

atilde Wikipedia

atilde Crystal D (2001) Language and the Internet Cambridge University Press

atilde Sproat R (2010) Language Technology and Society Oxford University Press

atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition

atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press

atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press

Review and Conclusions 80

Complementary Courses

atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics

atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics

Review and Conclusions 81

Page 46: Lecture 12: Review and Conclusions · Web Count Units ª ghit or “google hit” is the most common unit used to count web snippets (in the early2000s) â itisdocumentfrequencynottermfrequency

Web Count Units

atilde ghit or ldquogoogle hitrdquo is the most common unit used to count web snippets (in theearly 2000s)

acirc it is document frequency not term frequency

atilde whit or ldquoweb hitrdquo is the more general term

atilde Normally you compare two phenomena to get a unitless ratio(eg different from vs different than)251000000 ghits vs 71500000 or 351 (accessed 2012-04-04)

atilde GPB for ldquoGhits per billion documentsrdquo is good if want a more stable number (sug-gested by Mark Liberman)but Google no longer releases the index size

atilde So always say when you counted and try to use ratios

httpitrecisupennedu myllanguagelogarchives000953html 45

Web Sample

atilde Extracting and filtering web documents to create linguistically annotated corpora(Kilgarriff 2006)

acirc gather documents for different topics (balance)acirc exclude documents which cannot be preprocessed with available tools (here tag-

gers and lemmatizers)acirc exclude documents which seem irrelevant for a corpus (too short or too long word

lists)acirc do this for several languages and make the corpora available

46

Building Internet Corpora Outline

1 Select Seed Words (500)

2 Combine to form multiple queries (6000)

3 Query a search engine and retrieve the URLs (50000)

4 Download the files from the URLS (100000000 words)

5 Postprocess the data (encoding cleanup tagging and parsing)

Sharoff S (2006) Creating general-purpose corpora using automated search enginequeries In M Baroni S Bernardini (eds) WaCky Working papers on the Web asCorpus Bologna 2006

httpcorpusleedsacukinternethtml 47

Internet Corpora Summary

atilde The web can be used as a corpus

acirc Direct accesslowast Fast and convenientlowast Huge amounts of dataotimes unreliable counts

acirc Web samplelowast Control over the samplelowast Some setup costs (semi-automated)otimes Less data

atilde Richer data than a compiled corpus

otimes Less balanced less markup

48

Text and Meta-textatilde Explicit Meta-data

acirc Keywords and Categoriesacirc Rankingsacirc Structural Markup

atilde Implicit Meta-data

acirc Links and Citationsacirc Tagsacirc Tablesacirc File Namesacirc Translations

and Zombies 49

Explicit Metadata

atilde You can get information from metadata within documents

acirc When they are accurate they are very goodacirc They are often deceitful

atilde HTML PDF Word hellipMetadata

atilde Keywords and Tags

atilde Rankings

atilde Links and Citations

atilde Structural Markup

50

Implicit Metadata

atilde You can get clues from metadata within documents

acirc as they are non-intended they tend to be noisyacirc but they are rarely deceitful

atilde HTML tags as constituent boundaries

atilde Tables as Semantic Relations

atilde File Names (content type and language)

atilde Translations mdash Bracketed Glosses Cross-lingual Disambiguation

atilde Query Data

atilde Wikipedia Redirections and Cross-wiki Links

Review and Conclusions 51

Language Identification andNormalization

Review and Conclusions 52

Language Identification

atilde Need to identify the encoding and language of a page in order to process it (meta-data may be unreliable)

acirc Linguistically-grounded methodslowast Diacriticslowast Character n-gramslowast Stop words

acirc Similarity-based categorisation and classificationlowast Character n-gram rankings

acirc Machine Learning based methodsacirc Context (under jp or ko)

atilde Hard to do for short test snippets similar languages mixed text

Review and Conclusions 53

Normalization

atilde Extracting text from various documents

atilde Segmenting continuous text

atilde Number Normalization $700K $700000 07 million dollars hellip

atilde Date Normalization 2000AD 1421AH Heisei 12 hellip

atilde Stripping stop words (the a of hellip)

atilde Lemmatization produces rarr produce

atilde Stemming producer rarr produc produces rarr produc

atilde Decompounding zonnecel rarr zon cel

Review and Conclusions 54

The Semantic Web

Review and Conclusions 55

Goals of the Semantic Web

atilde Web of data

acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions

atilde Increase the utility of information by connecting it to definitions and context

atilde More efficient information access and analysis

EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt

Review and Conclusions 56

Semantic Web Architecture

Review and Conclusions 57

XML eXtensible Markup Language

atilde XML is a set of rules for encoding documents electronically

acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation

atilde Can be validated for syntactic well-formedness

Review and Conclusions 58

Semantic Web Architecture (details)

atilde Identify things with Uniform Resource Identifiers

acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond

atilde Identify relations with Resource Description Framework

acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything

atilde You can build relations in ontologies (OWL)

acirc Then reason over them search them hellip

Review and Conclusions 59

Criticism of the Semantic Web

Doctorowrsquos seven insurmountable obstacles to reliable metadata are

1 People lie

2 People are lazy

3 People are stupid

4 Mission Impossible know thyself

5 Schemas arenrsquot neutral

6 Metrics influence results

7 Therersquos more than one way to describe something

httpwwwwellcom~doctorowmetacraphtm 60

Semantic Web and NLP

atilde The Semantic Web is about structuring data

atilde Text Mining is about unstructured data

atilde There is much more unstructured than structured data

acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP

61

Citation Reputation andPageRank

Review and Conclusions 62

Citation Networks

atilde How can we tell what is a good scientific paper

acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked

acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite

Review and Conclusions 63

Reputation and Citation Analysis

atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group

atilde Some scores are

acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)

atilde Problems

acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones

atilde Weight Citations by Quality of the paper

Review and Conclusions 64

Gaming Citations

atilde LeastMinimum Publishable Unit

acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information

atilde Self citation in-group citation

atilde Write only proceedings (some journals are not often read)

atilde Submitting only to High Impact factor journals

You improve what gets measurednot necessarily what you want to improve

Review and Conclusions 65

Characteristics of the Web Bow Tie

SCC Strongly Connected Core mdash can travel from any page to any page

Review and Conclusions 66

Anchor Text

atilde Recall how hyperlinks are written

lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt

For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt

atilde Link analysis builds on two intuitions

1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A

2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice

Review and Conclusions 67

PageRank as Citation analysis

atilde Citation frequency can be used to measure the impact of an article

acirc Simplest measure Each article gets one vote ndash not very accurate

atilde On the web citation frequency = inlink count

acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam

atilde Better measure weighted citation frequency or citation rank

acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated

Review and Conclusions 68

PageRank as Random walk

atilde Imagine a web surfer doing a random walk on the web

acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page

equiprobably

atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there

atilde This long-term visit rate is the pagersquos PageRank

atilde PageRank = long-term visit rate = steady state robability

Review and Conclusions 69

Teleporting ndash to get us out of dead ends

atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)

atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)

atilde With remaining probability (90) go out on a random hyperlink

acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225

atilde 10 is a parameter the teleportation rate

atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate

Review and Conclusions 70

Example Graph

Each inbound link is a positive vote

httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71

Example Graph Weighted

Pages with higher PageRanks are lighter

Review and Conclusions 72

Gaming PageRank

atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs

atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies

atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission

73

atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks

The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index

acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip

74

Current Status

atilde There is a continuous battle between

acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read

atilde All metrics get gamed

Review and Conclusions 75

Digital object identifier

atilde DOI a string used to uniquely identify an electronic document or object

acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)

atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation

atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations

acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u

Review and Conclusions 76

Conclusions

Review and Conclusions 77

Revolutions in Language Technology

atilde Speech (Language itself)

atilde Writing (invented 3-5 times)

atilde Printing (made writing common)

atilde Digital Text (made writing transferable)

atilde Hyperlinking (taking writing beyond language)

Review and Conclusions 78

The Internet

atilde The internet is useful as a tool

acirc Passively for information accessacirc Actively for collaboration

atilde The internet is interesting in and of itself

acirc As a source of data about existing languageacirc As a source of innovation in language

Review and Conclusions 79

Recommended Texts

atilde Wikipedia

atilde Crystal D (2001) Language and the Internet Cambridge University Press

atilde Sproat R (2010) Language Technology and Society Oxford University Press

atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition

atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press

atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press

Review and Conclusions 80

Complementary Courses

atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics

atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics

Review and Conclusions 81

Page 47: Lecture 12: Review and Conclusions · Web Count Units ª ghit or “google hit” is the most common unit used to count web snippets (in the early2000s) â itisdocumentfrequencynottermfrequency

Web Sample

atilde Extracting and filtering web documents to create linguistically annotated corpora(Kilgarriff 2006)

acirc gather documents for different topics (balance)acirc exclude documents which cannot be preprocessed with available tools (here tag-

gers and lemmatizers)acirc exclude documents which seem irrelevant for a corpus (too short or too long word

lists)acirc do this for several languages and make the corpora available

46

Building Internet Corpora Outline

1 Select Seed Words (500)

2 Combine to form multiple queries (6000)

3 Query a search engine and retrieve the URLs (50000)

4 Download the files from the URLS (100000000 words)

5 Postprocess the data (encoding cleanup tagging and parsing)

Sharoff S (2006) Creating general-purpose corpora using automated search enginequeries In M Baroni S Bernardini (eds) WaCky Working papers on the Web asCorpus Bologna 2006

httpcorpusleedsacukinternethtml 47

Internet Corpora Summary

atilde The web can be used as a corpus

acirc Direct accesslowast Fast and convenientlowast Huge amounts of dataotimes unreliable counts

acirc Web samplelowast Control over the samplelowast Some setup costs (semi-automated)otimes Less data

atilde Richer data than a compiled corpus

otimes Less balanced less markup

48

Text and Meta-textatilde Explicit Meta-data

acirc Keywords and Categoriesacirc Rankingsacirc Structural Markup

atilde Implicit Meta-data

acirc Links and Citationsacirc Tagsacirc Tablesacirc File Namesacirc Translations

and Zombies 49

Explicit Metadata

atilde You can get information from metadata within documents

acirc When they are accurate they are very goodacirc They are often deceitful

atilde HTML PDF Word hellipMetadata

atilde Keywords and Tags

atilde Rankings

atilde Links and Citations

atilde Structural Markup

50

Implicit Metadata

atilde You can get clues from metadata within documents

acirc as they are non-intended they tend to be noisyacirc but they are rarely deceitful

atilde HTML tags as constituent boundaries

atilde Tables as Semantic Relations

atilde File Names (content type and language)

atilde Translations mdash Bracketed Glosses Cross-lingual Disambiguation

atilde Query Data

atilde Wikipedia Redirections and Cross-wiki Links

Review and Conclusions 51

Language Identification andNormalization

Review and Conclusions 52

Language Identification

atilde Need to identify the encoding and language of a page in order to process it (meta-data may be unreliable)

acirc Linguistically-grounded methodslowast Diacriticslowast Character n-gramslowast Stop words

acirc Similarity-based categorisation and classificationlowast Character n-gram rankings

acirc Machine Learning based methodsacirc Context (under jp or ko)

atilde Hard to do for short test snippets similar languages mixed text

Review and Conclusions 53

Normalization

atilde Extracting text from various documents

atilde Segmenting continuous text

atilde Number Normalization $700K $700000 07 million dollars hellip

atilde Date Normalization 2000AD 1421AH Heisei 12 hellip

atilde Stripping stop words (the a of hellip)

atilde Lemmatization produces rarr produce

atilde Stemming producer rarr produc produces rarr produc

atilde Decompounding zonnecel rarr zon cel

Review and Conclusions 54

The Semantic Web

Review and Conclusions 55

Goals of the Semantic Web

atilde Web of data

acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions

atilde Increase the utility of information by connecting it to definitions and context

atilde More efficient information access and analysis

EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt

Review and Conclusions 56

Semantic Web Architecture

Review and Conclusions 57

XML eXtensible Markup Language

atilde XML is a set of rules for encoding documents electronically

acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation

atilde Can be validated for syntactic well-formedness

Review and Conclusions 58

Semantic Web Architecture (details)

atilde Identify things with Uniform Resource Identifiers

acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond

atilde Identify relations with Resource Description Framework

acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything

atilde You can build relations in ontologies (OWL)

acirc Then reason over them search them hellip

Review and Conclusions 59

Criticism of the Semantic Web

Doctorowrsquos seven insurmountable obstacles to reliable metadata are

1 People lie

2 People are lazy

3 People are stupid

4 Mission Impossible know thyself

5 Schemas arenrsquot neutral

6 Metrics influence results

7 Therersquos more than one way to describe something

httpwwwwellcom~doctorowmetacraphtm 60

Semantic Web and NLP

atilde The Semantic Web is about structuring data

atilde Text Mining is about unstructured data

atilde There is much more unstructured than structured data

acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP

61

Citation Reputation andPageRank

Review and Conclusions 62

Citation Networks

atilde How can we tell what is a good scientific paper

acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked

acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite

Review and Conclusions 63

Reputation and Citation Analysis

atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group

atilde Some scores are

acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)

atilde Problems

acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones

atilde Weight Citations by Quality of the paper

Review and Conclusions 64

Gaming Citations

atilde LeastMinimum Publishable Unit

acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information

atilde Self citation in-group citation

atilde Write only proceedings (some journals are not often read)

atilde Submitting only to High Impact factor journals

You improve what gets measurednot necessarily what you want to improve

Review and Conclusions 65

Characteristics of the Web Bow Tie

SCC Strongly Connected Core mdash can travel from any page to any page

Review and Conclusions 66

Anchor Text

atilde Recall how hyperlinks are written

lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt

For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt

atilde Link analysis builds on two intuitions

1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A

2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice

Review and Conclusions 67

PageRank as Citation analysis

atilde Citation frequency can be used to measure the impact of an article

acirc Simplest measure Each article gets one vote ndash not very accurate

atilde On the web citation frequency = inlink count

acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam

atilde Better measure weighted citation frequency or citation rank

acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated

Review and Conclusions 68

PageRank as Random walk

atilde Imagine a web surfer doing a random walk on the web

acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page

equiprobably

atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there

atilde This long-term visit rate is the pagersquos PageRank

atilde PageRank = long-term visit rate = steady state robability

Review and Conclusions 69

Teleporting ndash to get us out of dead ends

atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)

atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)

atilde With remaining probability (90) go out on a random hyperlink

acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225

atilde 10 is a parameter the teleportation rate

atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate

Review and Conclusions 70

Example Graph

Each inbound link is a positive vote

httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71

Example Graph Weighted

Pages with higher PageRanks are lighter

Review and Conclusions 72

Gaming PageRank

atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs

atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies

atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission

73

atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks

The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index

acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip

74

Current Status

atilde There is a continuous battle between

acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read

atilde All metrics get gamed

Review and Conclusions 75

Digital object identifier

atilde DOI a string used to uniquely identify an electronic document or object

acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)

atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation

atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations

acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u

Review and Conclusions 76

Conclusions

Review and Conclusions 77

Revolutions in Language Technology

atilde Speech (Language itself)

atilde Writing (invented 3-5 times)

atilde Printing (made writing common)

atilde Digital Text (made writing transferable)

atilde Hyperlinking (taking writing beyond language)

Review and Conclusions 78

The Internet

atilde The internet is useful as a tool

acirc Passively for information accessacirc Actively for collaboration

atilde The internet is interesting in and of itself

acirc As a source of data about existing languageacirc As a source of innovation in language

Review and Conclusions 79

Recommended Texts

atilde Wikipedia

atilde Crystal D (2001) Language and the Internet Cambridge University Press

atilde Sproat R (2010) Language Technology and Society Oxford University Press

atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition

atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press

atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press

Review and Conclusions 80

Complementary Courses

atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics

atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics

Review and Conclusions 81

Page 48: Lecture 12: Review and Conclusions · Web Count Units ª ghit or “google hit” is the most common unit used to count web snippets (in the early2000s) â itisdocumentfrequencynottermfrequency

Building Internet Corpora Outline

1 Select Seed Words (500)

2 Combine to form multiple queries (6000)

3 Query a search engine and retrieve the URLs (50000)

4 Download the files from the URLS (100000000 words)

5 Postprocess the data (encoding cleanup tagging and parsing)

Sharoff S (2006) Creating general-purpose corpora using automated search enginequeries In M Baroni S Bernardini (eds) WaCky Working papers on the Web asCorpus Bologna 2006

httpcorpusleedsacukinternethtml 47

Internet Corpora Summary

atilde The web can be used as a corpus

acirc Direct accesslowast Fast and convenientlowast Huge amounts of dataotimes unreliable counts

acirc Web samplelowast Control over the samplelowast Some setup costs (semi-automated)otimes Less data

atilde Richer data than a compiled corpus

otimes Less balanced less markup

48

Text and Meta-textatilde Explicit Meta-data

acirc Keywords and Categoriesacirc Rankingsacirc Structural Markup

atilde Implicit Meta-data

acirc Links and Citationsacirc Tagsacirc Tablesacirc File Namesacirc Translations

and Zombies 49

Explicit Metadata

atilde You can get information from metadata within documents

acirc When they are accurate they are very goodacirc They are often deceitful

atilde HTML PDF Word hellipMetadata

atilde Keywords and Tags

atilde Rankings

atilde Links and Citations

atilde Structural Markup

50

Implicit Metadata

atilde You can get clues from metadata within documents

acirc as they are non-intended they tend to be noisyacirc but they are rarely deceitful

atilde HTML tags as constituent boundaries

atilde Tables as Semantic Relations

atilde File Names (content type and language)

atilde Translations mdash Bracketed Glosses Cross-lingual Disambiguation

atilde Query Data

atilde Wikipedia Redirections and Cross-wiki Links

Review and Conclusions 51

Language Identification andNormalization

Review and Conclusions 52

Language Identification

atilde Need to identify the encoding and language of a page in order to process it (meta-data may be unreliable)

acirc Linguistically-grounded methodslowast Diacriticslowast Character n-gramslowast Stop words

acirc Similarity-based categorisation and classificationlowast Character n-gram rankings

acirc Machine Learning based methodsacirc Context (under jp or ko)

atilde Hard to do for short test snippets similar languages mixed text

Review and Conclusions 53

Normalization

atilde Extracting text from various documents

atilde Segmenting continuous text

atilde Number Normalization $700K $700000 07 million dollars hellip

atilde Date Normalization 2000AD 1421AH Heisei 12 hellip

atilde Stripping stop words (the a of hellip)

atilde Lemmatization produces rarr produce

atilde Stemming producer rarr produc produces rarr produc

atilde Decompounding zonnecel rarr zon cel

Review and Conclusions 54

The Semantic Web

Review and Conclusions 55

Goals of the Semantic Web

atilde Web of data

acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions

atilde Increase the utility of information by connecting it to definitions and context

atilde More efficient information access and analysis

EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt

Review and Conclusions 56

Semantic Web Architecture

Review and Conclusions 57

XML eXtensible Markup Language

atilde XML is a set of rules for encoding documents electronically

acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation

atilde Can be validated for syntactic well-formedness

Review and Conclusions 58

Semantic Web Architecture (details)

atilde Identify things with Uniform Resource Identifiers

acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond

atilde Identify relations with Resource Description Framework

acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything

atilde You can build relations in ontologies (OWL)

acirc Then reason over them search them hellip

Review and Conclusions 59

Criticism of the Semantic Web

Doctorowrsquos seven insurmountable obstacles to reliable metadata are

1 People lie

2 People are lazy

3 People are stupid

4 Mission Impossible know thyself

5 Schemas arenrsquot neutral

6 Metrics influence results

7 Therersquos more than one way to describe something

httpwwwwellcom~doctorowmetacraphtm 60

Semantic Web and NLP

atilde The Semantic Web is about structuring data

atilde Text Mining is about unstructured data

atilde There is much more unstructured than structured data

acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP

61

Citation Reputation andPageRank

Review and Conclusions 62

Citation Networks

atilde How can we tell what is a good scientific paper

acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked

acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite

Review and Conclusions 63

Reputation and Citation Analysis

atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group

atilde Some scores are

acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)

atilde Problems

acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones

atilde Weight Citations by Quality of the paper

Review and Conclusions 64

Gaming Citations

atilde LeastMinimum Publishable Unit

acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information

atilde Self citation in-group citation

atilde Write only proceedings (some journals are not often read)

atilde Submitting only to High Impact factor journals

You improve what gets measurednot necessarily what you want to improve

Review and Conclusions 65

Characteristics of the Web Bow Tie

SCC Strongly Connected Core mdash can travel from any page to any page

Review and Conclusions 66

Anchor Text

atilde Recall how hyperlinks are written

lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt

For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt

atilde Link analysis builds on two intuitions

1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A

2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice

Review and Conclusions 67

PageRank as Citation analysis

atilde Citation frequency can be used to measure the impact of an article

acirc Simplest measure Each article gets one vote ndash not very accurate

atilde On the web citation frequency = inlink count

acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam

atilde Better measure weighted citation frequency or citation rank

acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated

Review and Conclusions 68

PageRank as Random walk

atilde Imagine a web surfer doing a random walk on the web

acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page

equiprobably

atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there

atilde This long-term visit rate is the pagersquos PageRank

atilde PageRank = long-term visit rate = steady state robability

Review and Conclusions 69

Teleporting ndash to get us out of dead ends

atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)

atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)

atilde With remaining probability (90) go out on a random hyperlink

acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225

atilde 10 is a parameter the teleportation rate

atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate

Review and Conclusions 70

Example Graph

Each inbound link is a positive vote

httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71

Example Graph Weighted

Pages with higher PageRanks are lighter

Review and Conclusions 72

Gaming PageRank

atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs

atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies

atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission

73

atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks

The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index

acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip

74

Current Status

atilde There is a continuous battle between

acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read

atilde All metrics get gamed

Review and Conclusions 75

Digital object identifier

atilde DOI a string used to uniquely identify an electronic document or object

acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)

atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation

atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations

acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u

Review and Conclusions 76

Conclusions

Review and Conclusions 77

Revolutions in Language Technology

atilde Speech (Language itself)

atilde Writing (invented 3-5 times)

atilde Printing (made writing common)

atilde Digital Text (made writing transferable)

atilde Hyperlinking (taking writing beyond language)

Review and Conclusions 78

The Internet

atilde The internet is useful as a tool

acirc Passively for information accessacirc Actively for collaboration

atilde The internet is interesting in and of itself

acirc As a source of data about existing languageacirc As a source of innovation in language

Review and Conclusions 79

Recommended Texts

atilde Wikipedia

atilde Crystal D (2001) Language and the Internet Cambridge University Press

atilde Sproat R (2010) Language Technology and Society Oxford University Press

atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition

atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press

atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press

Review and Conclusions 80

Complementary Courses

atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics

atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics

Review and Conclusions 81

Page 49: Lecture 12: Review and Conclusions · Web Count Units ª ghit or “google hit” is the most common unit used to count web snippets (in the early2000s) â itisdocumentfrequencynottermfrequency

Internet Corpora Summary

atilde The web can be used as a corpus

acirc Direct accesslowast Fast and convenientlowast Huge amounts of dataotimes unreliable counts

acirc Web samplelowast Control over the samplelowast Some setup costs (semi-automated)otimes Less data

atilde Richer data than a compiled corpus

otimes Less balanced less markup

48

Text and Meta-textatilde Explicit Meta-data

acirc Keywords and Categoriesacirc Rankingsacirc Structural Markup

atilde Implicit Meta-data

acirc Links and Citationsacirc Tagsacirc Tablesacirc File Namesacirc Translations

and Zombies 49

Explicit Metadata

atilde You can get information from metadata within documents

acirc When they are accurate they are very goodacirc They are often deceitful

atilde HTML PDF Word hellipMetadata

atilde Keywords and Tags

atilde Rankings

atilde Links and Citations

atilde Structural Markup

50

Implicit Metadata

atilde You can get clues from metadata within documents

acirc as they are non-intended they tend to be noisyacirc but they are rarely deceitful

atilde HTML tags as constituent boundaries

atilde Tables as Semantic Relations

atilde File Names (content type and language)

atilde Translations mdash Bracketed Glosses Cross-lingual Disambiguation

atilde Query Data

atilde Wikipedia Redirections and Cross-wiki Links

Review and Conclusions 51

Language Identification andNormalization

Review and Conclusions 52

Language Identification

atilde Need to identify the encoding and language of a page in order to process it (meta-data may be unreliable)

acirc Linguistically-grounded methodslowast Diacriticslowast Character n-gramslowast Stop words

acirc Similarity-based categorisation and classificationlowast Character n-gram rankings

acirc Machine Learning based methodsacirc Context (under jp or ko)

atilde Hard to do for short test snippets similar languages mixed text

Review and Conclusions 53

Normalization

atilde Extracting text from various documents

atilde Segmenting continuous text

atilde Number Normalization $700K $700000 07 million dollars hellip

atilde Date Normalization 2000AD 1421AH Heisei 12 hellip

atilde Stripping stop words (the a of hellip)

atilde Lemmatization produces rarr produce

atilde Stemming producer rarr produc produces rarr produc

atilde Decompounding zonnecel rarr zon cel

Review and Conclusions 54

The Semantic Web

Review and Conclusions 55

Goals of the Semantic Web

atilde Web of data

acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions

atilde Increase the utility of information by connecting it to definitions and context

atilde More efficient information access and analysis

EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt

Review and Conclusions 56

Semantic Web Architecture

Review and Conclusions 57

XML eXtensible Markup Language

atilde XML is a set of rules for encoding documents electronically

acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation

atilde Can be validated for syntactic well-formedness

Review and Conclusions 58

Semantic Web Architecture (details)

atilde Identify things with Uniform Resource Identifiers

acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond

atilde Identify relations with Resource Description Framework

acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything

atilde You can build relations in ontologies (OWL)

acirc Then reason over them search them hellip

Review and Conclusions 59

Criticism of the Semantic Web

Doctorowrsquos seven insurmountable obstacles to reliable metadata are

1 People lie

2 People are lazy

3 People are stupid

4 Mission Impossible know thyself

5 Schemas arenrsquot neutral

6 Metrics influence results

7 Therersquos more than one way to describe something

httpwwwwellcom~doctorowmetacraphtm 60

Semantic Web and NLP

atilde The Semantic Web is about structuring data

atilde Text Mining is about unstructured data

atilde There is much more unstructured than structured data

acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP

61

Citation Reputation andPageRank

Review and Conclusions 62

Citation Networks

atilde How can we tell what is a good scientific paper

acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked

acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite

Review and Conclusions 63

Reputation and Citation Analysis

atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group

atilde Some scores are

acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)

atilde Problems

acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones

atilde Weight Citations by Quality of the paper

Review and Conclusions 64

Gaming Citations

atilde LeastMinimum Publishable Unit

acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information

atilde Self citation in-group citation

atilde Write only proceedings (some journals are not often read)

atilde Submitting only to High Impact factor journals

You improve what gets measurednot necessarily what you want to improve

Review and Conclusions 65

Characteristics of the Web Bow Tie

SCC Strongly Connected Core mdash can travel from any page to any page

Review and Conclusions 66

Anchor Text

atilde Recall how hyperlinks are written

lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt

For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt

atilde Link analysis builds on two intuitions

1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A

2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice

Review and Conclusions 67

PageRank as Citation analysis

atilde Citation frequency can be used to measure the impact of an article

acirc Simplest measure Each article gets one vote ndash not very accurate

atilde On the web citation frequency = inlink count

acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam

atilde Better measure weighted citation frequency or citation rank

acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated

Review and Conclusions 68

PageRank as Random walk

atilde Imagine a web surfer doing a random walk on the web

acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page

equiprobably

atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there

atilde This long-term visit rate is the pagersquos PageRank

atilde PageRank = long-term visit rate = steady state robability

Review and Conclusions 69

Teleporting ndash to get us out of dead ends

atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)

atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)

atilde With remaining probability (90) go out on a random hyperlink

acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225

atilde 10 is a parameter the teleportation rate

atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate

Review and Conclusions 70

Example Graph

Each inbound link is a positive vote

httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71

Example Graph Weighted

Pages with higher PageRanks are lighter

Review and Conclusions 72

Gaming PageRank

atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs

atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies

atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission

73

atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks

The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index

acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip

74

Current Status

atilde There is a continuous battle between

acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read

atilde All metrics get gamed

Review and Conclusions 75

Digital object identifier

atilde DOI a string used to uniquely identify an electronic document or object

acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)

atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation

atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations

acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u

Review and Conclusions 76

Conclusions

Review and Conclusions 77

Revolutions in Language Technology

atilde Speech (Language itself)

atilde Writing (invented 3-5 times)

atilde Printing (made writing common)

atilde Digital Text (made writing transferable)

atilde Hyperlinking (taking writing beyond language)

Review and Conclusions 78

The Internet

atilde The internet is useful as a tool

acirc Passively for information accessacirc Actively for collaboration

atilde The internet is interesting in and of itself

acirc As a source of data about existing languageacirc As a source of innovation in language

Review and Conclusions 79

Recommended Texts

atilde Wikipedia

atilde Crystal D (2001) Language and the Internet Cambridge University Press

atilde Sproat R (2010) Language Technology and Society Oxford University Press

atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition

atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press

atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press

Review and Conclusions 80

Complementary Courses

atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics

atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics

Review and Conclusions 81

Page 50: Lecture 12: Review and Conclusions · Web Count Units ª ghit or “google hit” is the most common unit used to count web snippets (in the early2000s) â itisdocumentfrequencynottermfrequency

Text and Meta-textatilde Explicit Meta-data

acirc Keywords and Categoriesacirc Rankingsacirc Structural Markup

atilde Implicit Meta-data

acirc Links and Citationsacirc Tagsacirc Tablesacirc File Namesacirc Translations

and Zombies 49

Explicit Metadata

atilde You can get information from metadata within documents

acirc When they are accurate they are very goodacirc They are often deceitful

atilde HTML PDF Word hellipMetadata

atilde Keywords and Tags

atilde Rankings

atilde Links and Citations

atilde Structural Markup

50

Implicit Metadata

atilde You can get clues from metadata within documents

acirc as they are non-intended they tend to be noisyacirc but they are rarely deceitful

atilde HTML tags as constituent boundaries

atilde Tables as Semantic Relations

atilde File Names (content type and language)

atilde Translations mdash Bracketed Glosses Cross-lingual Disambiguation

atilde Query Data

atilde Wikipedia Redirections and Cross-wiki Links

Review and Conclusions 51

Language Identification andNormalization

Review and Conclusions 52

Language Identification

atilde Need to identify the encoding and language of a page in order to process it (meta-data may be unreliable)

acirc Linguistically-grounded methodslowast Diacriticslowast Character n-gramslowast Stop words

acirc Similarity-based categorisation and classificationlowast Character n-gram rankings

acirc Machine Learning based methodsacirc Context (under jp or ko)

atilde Hard to do for short test snippets similar languages mixed text

Review and Conclusions 53

Normalization

atilde Extracting text from various documents

atilde Segmenting continuous text

atilde Number Normalization $700K $700000 07 million dollars hellip

atilde Date Normalization 2000AD 1421AH Heisei 12 hellip

atilde Stripping stop words (the a of hellip)

atilde Lemmatization produces rarr produce

atilde Stemming producer rarr produc produces rarr produc

atilde Decompounding zonnecel rarr zon cel

Review and Conclusions 54

The Semantic Web

Review and Conclusions 55

Goals of the Semantic Web

atilde Web of data

acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions

atilde Increase the utility of information by connecting it to definitions and context

atilde More efficient information access and analysis

EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt

Review and Conclusions 56

Semantic Web Architecture

Review and Conclusions 57

XML eXtensible Markup Language

atilde XML is a set of rules for encoding documents electronically

acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation

atilde Can be validated for syntactic well-formedness

Review and Conclusions 58

Semantic Web Architecture (details)

atilde Identify things with Uniform Resource Identifiers

acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond

atilde Identify relations with Resource Description Framework

acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything

atilde You can build relations in ontologies (OWL)

acirc Then reason over them search them hellip

Review and Conclusions 59

Criticism of the Semantic Web

Doctorowrsquos seven insurmountable obstacles to reliable metadata are

1 People lie

2 People are lazy

3 People are stupid

4 Mission Impossible know thyself

5 Schemas arenrsquot neutral

6 Metrics influence results

7 Therersquos more than one way to describe something

httpwwwwellcom~doctorowmetacraphtm 60

Semantic Web and NLP

atilde The Semantic Web is about structuring data

atilde Text Mining is about unstructured data

atilde There is much more unstructured than structured data

acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP

61

Citation Reputation andPageRank

Review and Conclusions 62

Citation Networks

atilde How can we tell what is a good scientific paper

acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked

acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite

Review and Conclusions 63

Reputation and Citation Analysis

atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group

atilde Some scores are

acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)

atilde Problems

acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones

atilde Weight Citations by Quality of the paper

Review and Conclusions 64

Gaming Citations

atilde LeastMinimum Publishable Unit

acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information

atilde Self citation in-group citation

atilde Write only proceedings (some journals are not often read)

atilde Submitting only to High Impact factor journals

You improve what gets measurednot necessarily what you want to improve

Review and Conclusions 65

Characteristics of the Web Bow Tie

SCC Strongly Connected Core mdash can travel from any page to any page

Review and Conclusions 66

Anchor Text

atilde Recall how hyperlinks are written

lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt

For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt

atilde Link analysis builds on two intuitions

1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A

2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice

Review and Conclusions 67

PageRank as Citation analysis

atilde Citation frequency can be used to measure the impact of an article

acirc Simplest measure Each article gets one vote ndash not very accurate

atilde On the web citation frequency = inlink count

acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam

atilde Better measure weighted citation frequency or citation rank

acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated

Review and Conclusions 68

PageRank as Random walk

atilde Imagine a web surfer doing a random walk on the web

acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page

equiprobably

atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there

atilde This long-term visit rate is the pagersquos PageRank

atilde PageRank = long-term visit rate = steady state robability

Review and Conclusions 69

Teleporting ndash to get us out of dead ends

atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)

atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)

atilde With remaining probability (90) go out on a random hyperlink

acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225

atilde 10 is a parameter the teleportation rate

atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate

Review and Conclusions 70

Example Graph

Each inbound link is a positive vote

httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71

Example Graph Weighted

Pages with higher PageRanks are lighter

Review and Conclusions 72

Gaming PageRank

atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs

atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies

atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission

73

atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks

The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index

acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip

74

Current Status

atilde There is a continuous battle between

acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read

atilde All metrics get gamed

Review and Conclusions 75

Digital object identifier

atilde DOI a string used to uniquely identify an electronic document or object

acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)

atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation

atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations

acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u

Review and Conclusions 76

Conclusions

Review and Conclusions 77

Revolutions in Language Technology

atilde Speech (Language itself)

atilde Writing (invented 3-5 times)

atilde Printing (made writing common)

atilde Digital Text (made writing transferable)

atilde Hyperlinking (taking writing beyond language)

Review and Conclusions 78

The Internet

atilde The internet is useful as a tool

acirc Passively for information accessacirc Actively for collaboration

atilde The internet is interesting in and of itself

acirc As a source of data about existing languageacirc As a source of innovation in language

Review and Conclusions 79

Recommended Texts

atilde Wikipedia

atilde Crystal D (2001) Language and the Internet Cambridge University Press

atilde Sproat R (2010) Language Technology and Society Oxford University Press

atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition

atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press

atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press

Review and Conclusions 80

Complementary Courses

atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics

atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics

Review and Conclusions 81

Page 51: Lecture 12: Review and Conclusions · Web Count Units ª ghit or “google hit” is the most common unit used to count web snippets (in the early2000s) â itisdocumentfrequencynottermfrequency

Explicit Metadata

atilde You can get information from metadata within documents

acirc When they are accurate they are very goodacirc They are often deceitful

atilde HTML PDF Word hellipMetadata

atilde Keywords and Tags

atilde Rankings

atilde Links and Citations

atilde Structural Markup

50

Implicit Metadata

atilde You can get clues from metadata within documents

acirc as they are non-intended they tend to be noisyacirc but they are rarely deceitful

atilde HTML tags as constituent boundaries

atilde Tables as Semantic Relations

atilde File Names (content type and language)

atilde Translations mdash Bracketed Glosses Cross-lingual Disambiguation

atilde Query Data

atilde Wikipedia Redirections and Cross-wiki Links

Review and Conclusions 51

Language Identification andNormalization

Review and Conclusions 52

Language Identification

atilde Need to identify the encoding and language of a page in order to process it (meta-data may be unreliable)

acirc Linguistically-grounded methodslowast Diacriticslowast Character n-gramslowast Stop words

acirc Similarity-based categorisation and classificationlowast Character n-gram rankings

acirc Machine Learning based methodsacirc Context (under jp or ko)

atilde Hard to do for short test snippets similar languages mixed text

Review and Conclusions 53

Normalization

atilde Extracting text from various documents

atilde Segmenting continuous text

atilde Number Normalization $700K $700000 07 million dollars hellip

atilde Date Normalization 2000AD 1421AH Heisei 12 hellip

atilde Stripping stop words (the a of hellip)

atilde Lemmatization produces rarr produce

atilde Stemming producer rarr produc produces rarr produc

atilde Decompounding zonnecel rarr zon cel

Review and Conclusions 54

The Semantic Web

Review and Conclusions 55

Goals of the Semantic Web

atilde Web of data

acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions

atilde Increase the utility of information by connecting it to definitions and context

atilde More efficient information access and analysis

EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt

Review and Conclusions 56

Semantic Web Architecture

Review and Conclusions 57

XML eXtensible Markup Language

atilde XML is a set of rules for encoding documents electronically

acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation

atilde Can be validated for syntactic well-formedness

Review and Conclusions 58

Semantic Web Architecture (details)

atilde Identify things with Uniform Resource Identifiers

acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond

atilde Identify relations with Resource Description Framework

acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything

atilde You can build relations in ontologies (OWL)

acirc Then reason over them search them hellip

Review and Conclusions 59

Criticism of the Semantic Web

Doctorowrsquos seven insurmountable obstacles to reliable metadata are

1 People lie

2 People are lazy

3 People are stupid

4 Mission Impossible know thyself

5 Schemas arenrsquot neutral

6 Metrics influence results

7 Therersquos more than one way to describe something

httpwwwwellcom~doctorowmetacraphtm 60

Semantic Web and NLP

atilde The Semantic Web is about structuring data

atilde Text Mining is about unstructured data

atilde There is much more unstructured than structured data

acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP

61

Citation Reputation andPageRank

Review and Conclusions 62

Citation Networks

atilde How can we tell what is a good scientific paper

acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked

acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite

Review and Conclusions 63

Reputation and Citation Analysis

atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group

atilde Some scores are

acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)

atilde Problems

acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones

atilde Weight Citations by Quality of the paper

Review and Conclusions 64

Gaming Citations

atilde LeastMinimum Publishable Unit

acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information

atilde Self citation in-group citation

atilde Write only proceedings (some journals are not often read)

atilde Submitting only to High Impact factor journals

You improve what gets measurednot necessarily what you want to improve

Review and Conclusions 65

Characteristics of the Web Bow Tie

SCC Strongly Connected Core mdash can travel from any page to any page

Review and Conclusions 66

Anchor Text

atilde Recall how hyperlinks are written

lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt

For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt

atilde Link analysis builds on two intuitions

1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A

2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice

Review and Conclusions 67

PageRank as Citation analysis

atilde Citation frequency can be used to measure the impact of an article

acirc Simplest measure Each article gets one vote ndash not very accurate

atilde On the web citation frequency = inlink count

acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam

atilde Better measure weighted citation frequency or citation rank

acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated

Review and Conclusions 68

PageRank as Random walk

atilde Imagine a web surfer doing a random walk on the web

acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page

equiprobably

atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there

atilde This long-term visit rate is the pagersquos PageRank

atilde PageRank = long-term visit rate = steady state robability

Review and Conclusions 69

Teleporting ndash to get us out of dead ends

atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)

atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)

atilde With remaining probability (90) go out on a random hyperlink

acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225

atilde 10 is a parameter the teleportation rate

atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate

Review and Conclusions 70

Example Graph

Each inbound link is a positive vote

httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71

Example Graph Weighted

Pages with higher PageRanks are lighter

Review and Conclusions 72

Gaming PageRank

atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs

atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies

atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission

73

atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks

The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index

acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip

74

Current Status

atilde There is a continuous battle between

acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read

atilde All metrics get gamed

Review and Conclusions 75

Digital object identifier

atilde DOI a string used to uniquely identify an electronic document or object

acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)

atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation

atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations

acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u

Review and Conclusions 76

Conclusions

Review and Conclusions 77

Revolutions in Language Technology

atilde Speech (Language itself)

atilde Writing (invented 3-5 times)

atilde Printing (made writing common)

atilde Digital Text (made writing transferable)

atilde Hyperlinking (taking writing beyond language)

Review and Conclusions 78

The Internet

atilde The internet is useful as a tool

acirc Passively for information accessacirc Actively for collaboration

atilde The internet is interesting in and of itself

acirc As a source of data about existing languageacirc As a source of innovation in language

Review and Conclusions 79

Recommended Texts

atilde Wikipedia

atilde Crystal D (2001) Language and the Internet Cambridge University Press

atilde Sproat R (2010) Language Technology and Society Oxford University Press

atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition

atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press

atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press

Review and Conclusions 80

Complementary Courses

atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics

atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics

Review and Conclusions 81

Page 52: Lecture 12: Review and Conclusions · Web Count Units ª ghit or “google hit” is the most common unit used to count web snippets (in the early2000s) â itisdocumentfrequencynottermfrequency

Implicit Metadata

atilde You can get clues from metadata within documents

acirc as they are non-intended they tend to be noisyacirc but they are rarely deceitful

atilde HTML tags as constituent boundaries

atilde Tables as Semantic Relations

atilde File Names (content type and language)

atilde Translations mdash Bracketed Glosses Cross-lingual Disambiguation

atilde Query Data

atilde Wikipedia Redirections and Cross-wiki Links

Review and Conclusions 51

Language Identification andNormalization

Review and Conclusions 52

Language Identification

atilde Need to identify the encoding and language of a page in order to process it (meta-data may be unreliable)

acirc Linguistically-grounded methodslowast Diacriticslowast Character n-gramslowast Stop words

acirc Similarity-based categorisation and classificationlowast Character n-gram rankings

acirc Machine Learning based methodsacirc Context (under jp or ko)

atilde Hard to do for short test snippets similar languages mixed text

Review and Conclusions 53

Normalization

atilde Extracting text from various documents

atilde Segmenting continuous text

atilde Number Normalization $700K $700000 07 million dollars hellip

atilde Date Normalization 2000AD 1421AH Heisei 12 hellip

atilde Stripping stop words (the a of hellip)

atilde Lemmatization produces rarr produce

atilde Stemming producer rarr produc produces rarr produc

atilde Decompounding zonnecel rarr zon cel

Review and Conclusions 54

The Semantic Web

Review and Conclusions 55

Goals of the Semantic Web

atilde Web of data

acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions

atilde Increase the utility of information by connecting it to definitions and context

atilde More efficient information access and analysis

EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt

Review and Conclusions 56

Semantic Web Architecture

Review and Conclusions 57

XML eXtensible Markup Language

atilde XML is a set of rules for encoding documents electronically

acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation

atilde Can be validated for syntactic well-formedness

Review and Conclusions 58

Semantic Web Architecture (details)

atilde Identify things with Uniform Resource Identifiers

acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond

atilde Identify relations with Resource Description Framework

acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything

atilde You can build relations in ontologies (OWL)

acirc Then reason over them search them hellip

Review and Conclusions 59

Criticism of the Semantic Web

Doctorowrsquos seven insurmountable obstacles to reliable metadata are

1 People lie

2 People are lazy

3 People are stupid

4 Mission Impossible know thyself

5 Schemas arenrsquot neutral

6 Metrics influence results

7 Therersquos more than one way to describe something

httpwwwwellcom~doctorowmetacraphtm 60

Semantic Web and NLP

atilde The Semantic Web is about structuring data

atilde Text Mining is about unstructured data

atilde There is much more unstructured than structured data

acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP

61

Citation Reputation andPageRank

Review and Conclusions 62

Citation Networks

atilde How can we tell what is a good scientific paper

acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked

acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite

Review and Conclusions 63

Reputation and Citation Analysis

atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group

atilde Some scores are

acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)

atilde Problems

acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones

atilde Weight Citations by Quality of the paper

Review and Conclusions 64

Gaming Citations

atilde LeastMinimum Publishable Unit

acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information

atilde Self citation in-group citation

atilde Write only proceedings (some journals are not often read)

atilde Submitting only to High Impact factor journals

You improve what gets measurednot necessarily what you want to improve

Review and Conclusions 65

Characteristics of the Web Bow Tie

SCC Strongly Connected Core mdash can travel from any page to any page

Review and Conclusions 66

Anchor Text

atilde Recall how hyperlinks are written

lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt

For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt

atilde Link analysis builds on two intuitions

1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A

2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice

Review and Conclusions 67

PageRank as Citation analysis

atilde Citation frequency can be used to measure the impact of an article

acirc Simplest measure Each article gets one vote ndash not very accurate

atilde On the web citation frequency = inlink count

acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam

atilde Better measure weighted citation frequency or citation rank

acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated

Review and Conclusions 68

PageRank as Random walk

atilde Imagine a web surfer doing a random walk on the web

acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page

equiprobably

atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there

atilde This long-term visit rate is the pagersquos PageRank

atilde PageRank = long-term visit rate = steady state robability

Review and Conclusions 69

Teleporting ndash to get us out of dead ends

atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)

atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)

atilde With remaining probability (90) go out on a random hyperlink

acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225

atilde 10 is a parameter the teleportation rate

atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate

Review and Conclusions 70

Example Graph

Each inbound link is a positive vote

httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71

Example Graph Weighted

Pages with higher PageRanks are lighter

Review and Conclusions 72

Gaming PageRank

atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs

atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies

atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission

73

atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks

The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index

acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip

74

Current Status

atilde There is a continuous battle between

acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read

atilde All metrics get gamed

Review and Conclusions 75

Digital object identifier

atilde DOI a string used to uniquely identify an electronic document or object

acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)

atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation

atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations

acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u

Review and Conclusions 76

Conclusions

Review and Conclusions 77

Revolutions in Language Technology

atilde Speech (Language itself)

atilde Writing (invented 3-5 times)

atilde Printing (made writing common)

atilde Digital Text (made writing transferable)

atilde Hyperlinking (taking writing beyond language)

Review and Conclusions 78

The Internet

atilde The internet is useful as a tool

acirc Passively for information accessacirc Actively for collaboration

atilde The internet is interesting in and of itself

acirc As a source of data about existing languageacirc As a source of innovation in language

Review and Conclusions 79

Recommended Texts

atilde Wikipedia

atilde Crystal D (2001) Language and the Internet Cambridge University Press

atilde Sproat R (2010) Language Technology and Society Oxford University Press

atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition

atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press

atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press

Review and Conclusions 80

Complementary Courses

atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics

atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics

Review and Conclusions 81

Page 53: Lecture 12: Review and Conclusions · Web Count Units ª ghit or “google hit” is the most common unit used to count web snippets (in the early2000s) â itisdocumentfrequencynottermfrequency

Language Identification andNormalization

Review and Conclusions 52

Language Identification

atilde Need to identify the encoding and language of a page in order to process it (meta-data may be unreliable)

acirc Linguistically-grounded methodslowast Diacriticslowast Character n-gramslowast Stop words

acirc Similarity-based categorisation and classificationlowast Character n-gram rankings

acirc Machine Learning based methodsacirc Context (under jp or ko)

atilde Hard to do for short test snippets similar languages mixed text

Review and Conclusions 53

Normalization

atilde Extracting text from various documents

atilde Segmenting continuous text

atilde Number Normalization $700K $700000 07 million dollars hellip

atilde Date Normalization 2000AD 1421AH Heisei 12 hellip

atilde Stripping stop words (the a of hellip)

atilde Lemmatization produces rarr produce

atilde Stemming producer rarr produc produces rarr produc

atilde Decompounding zonnecel rarr zon cel

Review and Conclusions 54

The Semantic Web

Review and Conclusions 55

Goals of the Semantic Web

atilde Web of data

acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions

atilde Increase the utility of information by connecting it to definitions and context

atilde More efficient information access and analysis

EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt

Review and Conclusions 56

Semantic Web Architecture

Review and Conclusions 57

XML eXtensible Markup Language

atilde XML is a set of rules for encoding documents electronically

acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation

atilde Can be validated for syntactic well-formedness

Review and Conclusions 58

Semantic Web Architecture (details)

atilde Identify things with Uniform Resource Identifiers

acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond

atilde Identify relations with Resource Description Framework

acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything

atilde You can build relations in ontologies (OWL)

acirc Then reason over them search them hellip

Review and Conclusions 59

Criticism of the Semantic Web

Doctorowrsquos seven insurmountable obstacles to reliable metadata are

1 People lie

2 People are lazy

3 People are stupid

4 Mission Impossible know thyself

5 Schemas arenrsquot neutral

6 Metrics influence results

7 Therersquos more than one way to describe something

httpwwwwellcom~doctorowmetacraphtm 60

Semantic Web and NLP

atilde The Semantic Web is about structuring data

atilde Text Mining is about unstructured data

atilde There is much more unstructured than structured data

acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP

61

Citation Reputation andPageRank

Review and Conclusions 62

Citation Networks

atilde How can we tell what is a good scientific paper

acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked

acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite

Review and Conclusions 63

Reputation and Citation Analysis

atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group

atilde Some scores are

acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)

atilde Problems

acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones

atilde Weight Citations by Quality of the paper

Review and Conclusions 64

Gaming Citations

atilde LeastMinimum Publishable Unit

acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information

atilde Self citation in-group citation

atilde Write only proceedings (some journals are not often read)

atilde Submitting only to High Impact factor journals

You improve what gets measurednot necessarily what you want to improve

Review and Conclusions 65

Characteristics of the Web Bow Tie

SCC Strongly Connected Core mdash can travel from any page to any page

Review and Conclusions 66

Anchor Text

atilde Recall how hyperlinks are written

lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt

For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt

atilde Link analysis builds on two intuitions

1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A

2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice

Review and Conclusions 67

PageRank as Citation analysis

atilde Citation frequency can be used to measure the impact of an article

acirc Simplest measure Each article gets one vote ndash not very accurate

atilde On the web citation frequency = inlink count

acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam

atilde Better measure weighted citation frequency or citation rank

acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated

Review and Conclusions 68

PageRank as Random walk

atilde Imagine a web surfer doing a random walk on the web

acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page

equiprobably

atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there

atilde This long-term visit rate is the pagersquos PageRank

atilde PageRank = long-term visit rate = steady state robability

Review and Conclusions 69

Teleporting ndash to get us out of dead ends

atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)

atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)

atilde With remaining probability (90) go out on a random hyperlink

acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225

atilde 10 is a parameter the teleportation rate

atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate

Review and Conclusions 70

Example Graph

Each inbound link is a positive vote

httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71

Example Graph Weighted

Pages with higher PageRanks are lighter

Review and Conclusions 72

Gaming PageRank

atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs

atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies

atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission

73

atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks

The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index

acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip

74

Current Status

atilde There is a continuous battle between

acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read

atilde All metrics get gamed

Review and Conclusions 75

Digital object identifier

atilde DOI a string used to uniquely identify an electronic document or object

acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)

atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation

atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations

acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u

Review and Conclusions 76

Conclusions

Review and Conclusions 77

Revolutions in Language Technology

atilde Speech (Language itself)

atilde Writing (invented 3-5 times)

atilde Printing (made writing common)

atilde Digital Text (made writing transferable)

atilde Hyperlinking (taking writing beyond language)

Review and Conclusions 78

The Internet

atilde The internet is useful as a tool

acirc Passively for information accessacirc Actively for collaboration

atilde The internet is interesting in and of itself

acirc As a source of data about existing languageacirc As a source of innovation in language

Review and Conclusions 79

Recommended Texts

atilde Wikipedia

atilde Crystal D (2001) Language and the Internet Cambridge University Press

atilde Sproat R (2010) Language Technology and Society Oxford University Press

atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition

atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press

atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press

Review and Conclusions 80

Complementary Courses

atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics

atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics

Review and Conclusions 81

Page 54: Lecture 12: Review and Conclusions · Web Count Units ª ghit or “google hit” is the most common unit used to count web snippets (in the early2000s) â itisdocumentfrequencynottermfrequency

Language Identification

atilde Need to identify the encoding and language of a page in order to process it (meta-data may be unreliable)

acirc Linguistically-grounded methodslowast Diacriticslowast Character n-gramslowast Stop words

acirc Similarity-based categorisation and classificationlowast Character n-gram rankings

acirc Machine Learning based methodsacirc Context (under jp or ko)

atilde Hard to do for short test snippets similar languages mixed text

Review and Conclusions 53

Normalization

atilde Extracting text from various documents

atilde Segmenting continuous text

atilde Number Normalization $700K $700000 07 million dollars hellip

atilde Date Normalization 2000AD 1421AH Heisei 12 hellip

atilde Stripping stop words (the a of hellip)

atilde Lemmatization produces rarr produce

atilde Stemming producer rarr produc produces rarr produc

atilde Decompounding zonnecel rarr zon cel

Review and Conclusions 54

The Semantic Web

Review and Conclusions 55

Goals of the Semantic Web

atilde Web of data

acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions

atilde Increase the utility of information by connecting it to definitions and context

atilde More efficient information access and analysis

EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt

Review and Conclusions 56

Semantic Web Architecture

Review and Conclusions 57

XML eXtensible Markup Language

atilde XML is a set of rules for encoding documents electronically

acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation

atilde Can be validated for syntactic well-formedness

Review and Conclusions 58

Semantic Web Architecture (details)

atilde Identify things with Uniform Resource Identifiers

acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond

atilde Identify relations with Resource Description Framework

acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything

atilde You can build relations in ontologies (OWL)

acirc Then reason over them search them hellip

Review and Conclusions 59

Criticism of the Semantic Web

Doctorowrsquos seven insurmountable obstacles to reliable metadata are

1 People lie

2 People are lazy

3 People are stupid

4 Mission Impossible know thyself

5 Schemas arenrsquot neutral

6 Metrics influence results

7 Therersquos more than one way to describe something

httpwwwwellcom~doctorowmetacraphtm 60

Semantic Web and NLP

atilde The Semantic Web is about structuring data

atilde Text Mining is about unstructured data

atilde There is much more unstructured than structured data

acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP

61

Citation Reputation andPageRank

Review and Conclusions 62

Citation Networks

atilde How can we tell what is a good scientific paper

acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked

acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite

Review and Conclusions 63

Reputation and Citation Analysis

atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group

atilde Some scores are

acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)

atilde Problems

acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones

atilde Weight Citations by Quality of the paper

Review and Conclusions 64

Gaming Citations

atilde LeastMinimum Publishable Unit

acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information

atilde Self citation in-group citation

atilde Write only proceedings (some journals are not often read)

atilde Submitting only to High Impact factor journals

You improve what gets measurednot necessarily what you want to improve

Review and Conclusions 65

Characteristics of the Web Bow Tie

SCC Strongly Connected Core mdash can travel from any page to any page

Review and Conclusions 66

Anchor Text

atilde Recall how hyperlinks are written

lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt

For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt

atilde Link analysis builds on two intuitions

1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A

2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice

Review and Conclusions 67

PageRank as Citation analysis

atilde Citation frequency can be used to measure the impact of an article

acirc Simplest measure Each article gets one vote ndash not very accurate

atilde On the web citation frequency = inlink count

acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam

atilde Better measure weighted citation frequency or citation rank

acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated

Review and Conclusions 68

PageRank as Random walk

atilde Imagine a web surfer doing a random walk on the web

acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page

equiprobably

atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there

atilde This long-term visit rate is the pagersquos PageRank

atilde PageRank = long-term visit rate = steady state robability

Review and Conclusions 69

Teleporting ndash to get us out of dead ends

atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)

atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)

atilde With remaining probability (90) go out on a random hyperlink

acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225

atilde 10 is a parameter the teleportation rate

atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate

Review and Conclusions 70

Example Graph

Each inbound link is a positive vote

httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71

Example Graph Weighted

Pages with higher PageRanks are lighter

Review and Conclusions 72

Gaming PageRank

atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs

atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies

atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission

73

atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks

The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index

acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip

74

Current Status

atilde There is a continuous battle between

acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read

atilde All metrics get gamed

Review and Conclusions 75

Digital object identifier

atilde DOI a string used to uniquely identify an electronic document or object

acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)

atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation

atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations

acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u

Review and Conclusions 76

Conclusions

Review and Conclusions 77

Revolutions in Language Technology

atilde Speech (Language itself)

atilde Writing (invented 3-5 times)

atilde Printing (made writing common)

atilde Digital Text (made writing transferable)

atilde Hyperlinking (taking writing beyond language)

Review and Conclusions 78

The Internet

atilde The internet is useful as a tool

acirc Passively for information accessacirc Actively for collaboration

atilde The internet is interesting in and of itself

acirc As a source of data about existing languageacirc As a source of innovation in language

Review and Conclusions 79

Recommended Texts

atilde Wikipedia

atilde Crystal D (2001) Language and the Internet Cambridge University Press

atilde Sproat R (2010) Language Technology and Society Oxford University Press

atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition

atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press

atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press

Review and Conclusions 80

Complementary Courses

atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics

atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics

Review and Conclusions 81

Page 55: Lecture 12: Review and Conclusions · Web Count Units ª ghit or “google hit” is the most common unit used to count web snippets (in the early2000s) â itisdocumentfrequencynottermfrequency

Normalization

atilde Extracting text from various documents

atilde Segmenting continuous text

atilde Number Normalization $700K $700000 07 million dollars hellip

atilde Date Normalization 2000AD 1421AH Heisei 12 hellip

atilde Stripping stop words (the a of hellip)

atilde Lemmatization produces rarr produce

atilde Stemming producer rarr produc produces rarr produc

atilde Decompounding zonnecel rarr zon cel

Review and Conclusions 54

The Semantic Web

Review and Conclusions 55

Goals of the Semantic Web

atilde Web of data

acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions

atilde Increase the utility of information by connecting it to definitions and context

atilde More efficient information access and analysis

EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt

Review and Conclusions 56

Semantic Web Architecture

Review and Conclusions 57

XML eXtensible Markup Language

atilde XML is a set of rules for encoding documents electronically

acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation

atilde Can be validated for syntactic well-formedness

Review and Conclusions 58

Semantic Web Architecture (details)

atilde Identify things with Uniform Resource Identifiers

acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond

atilde Identify relations with Resource Description Framework

acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything

atilde You can build relations in ontologies (OWL)

acirc Then reason over them search them hellip

Review and Conclusions 59

Criticism of the Semantic Web

Doctorowrsquos seven insurmountable obstacles to reliable metadata are

1 People lie

2 People are lazy

3 People are stupid

4 Mission Impossible know thyself

5 Schemas arenrsquot neutral

6 Metrics influence results

7 Therersquos more than one way to describe something

httpwwwwellcom~doctorowmetacraphtm 60

Semantic Web and NLP

atilde The Semantic Web is about structuring data

atilde Text Mining is about unstructured data

atilde There is much more unstructured than structured data

acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP

61

Citation Reputation andPageRank

Review and Conclusions 62

Citation Networks

atilde How can we tell what is a good scientific paper

acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked

acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite

Review and Conclusions 63

Reputation and Citation Analysis

atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group

atilde Some scores are

acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)

atilde Problems

acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones

atilde Weight Citations by Quality of the paper

Review and Conclusions 64

Gaming Citations

atilde LeastMinimum Publishable Unit

acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information

atilde Self citation in-group citation

atilde Write only proceedings (some journals are not often read)

atilde Submitting only to High Impact factor journals

You improve what gets measurednot necessarily what you want to improve

Review and Conclusions 65

Characteristics of the Web Bow Tie

SCC Strongly Connected Core mdash can travel from any page to any page

Review and Conclusions 66

Anchor Text

atilde Recall how hyperlinks are written

lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt

For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt

atilde Link analysis builds on two intuitions

1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A

2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice

Review and Conclusions 67

PageRank as Citation analysis

atilde Citation frequency can be used to measure the impact of an article

acirc Simplest measure Each article gets one vote ndash not very accurate

atilde On the web citation frequency = inlink count

acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam

atilde Better measure weighted citation frequency or citation rank

acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated

Review and Conclusions 68

PageRank as Random walk

atilde Imagine a web surfer doing a random walk on the web

acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page

equiprobably

atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there

atilde This long-term visit rate is the pagersquos PageRank

atilde PageRank = long-term visit rate = steady state robability

Review and Conclusions 69

Teleporting ndash to get us out of dead ends

atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)

atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)

atilde With remaining probability (90) go out on a random hyperlink

acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225

atilde 10 is a parameter the teleportation rate

atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate

Review and Conclusions 70

Example Graph

Each inbound link is a positive vote

httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71

Example Graph Weighted

Pages with higher PageRanks are lighter

Review and Conclusions 72

Gaming PageRank

atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs

atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies

atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission

73

atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks

The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index

acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip

74

Current Status

atilde There is a continuous battle between

acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read

atilde All metrics get gamed

Review and Conclusions 75

Digital object identifier

atilde DOI a string used to uniquely identify an electronic document or object

acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)

atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation

atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations

acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u

Review and Conclusions 76

Conclusions

Review and Conclusions 77

Revolutions in Language Technology

atilde Speech (Language itself)

atilde Writing (invented 3-5 times)

atilde Printing (made writing common)

atilde Digital Text (made writing transferable)

atilde Hyperlinking (taking writing beyond language)

Review and Conclusions 78

The Internet

atilde The internet is useful as a tool

acirc Passively for information accessacirc Actively for collaboration

atilde The internet is interesting in and of itself

acirc As a source of data about existing languageacirc As a source of innovation in language

Review and Conclusions 79

Recommended Texts

atilde Wikipedia

atilde Crystal D (2001) Language and the Internet Cambridge University Press

atilde Sproat R (2010) Language Technology and Society Oxford University Press

atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition

atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press

atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press

Review and Conclusions 80

Complementary Courses

atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics

atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics

Review and Conclusions 81

Page 56: Lecture 12: Review and Conclusions · Web Count Units ª ghit or “google hit” is the most common unit used to count web snippets (in the early2000s) â itisdocumentfrequencynottermfrequency

The Semantic Web

Review and Conclusions 55

Goals of the Semantic Web

atilde Web of data

acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions

atilde Increase the utility of information by connecting it to definitions and context

atilde More efficient information access and analysis

EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt

Review and Conclusions 56

Semantic Web Architecture

Review and Conclusions 57

XML eXtensible Markup Language

atilde XML is a set of rules for encoding documents electronically

acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation

atilde Can be validated for syntactic well-formedness

Review and Conclusions 58

Semantic Web Architecture (details)

atilde Identify things with Uniform Resource Identifiers

acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond

atilde Identify relations with Resource Description Framework

acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything

atilde You can build relations in ontologies (OWL)

acirc Then reason over them search them hellip

Review and Conclusions 59

Criticism of the Semantic Web

Doctorowrsquos seven insurmountable obstacles to reliable metadata are

1 People lie

2 People are lazy

3 People are stupid

4 Mission Impossible know thyself

5 Schemas arenrsquot neutral

6 Metrics influence results

7 Therersquos more than one way to describe something

httpwwwwellcom~doctorowmetacraphtm 60

Semantic Web and NLP

atilde The Semantic Web is about structuring data

atilde Text Mining is about unstructured data

atilde There is much more unstructured than structured data

acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP

61

Citation Reputation andPageRank

Review and Conclusions 62

Citation Networks

atilde How can we tell what is a good scientific paper

acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked

acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite

Review and Conclusions 63

Reputation and Citation Analysis

atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group

atilde Some scores are

acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)

atilde Problems

acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones

atilde Weight Citations by Quality of the paper

Review and Conclusions 64

Gaming Citations

atilde LeastMinimum Publishable Unit

acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information

atilde Self citation in-group citation

atilde Write only proceedings (some journals are not often read)

atilde Submitting only to High Impact factor journals

You improve what gets measurednot necessarily what you want to improve

Review and Conclusions 65

Characteristics of the Web Bow Tie

SCC Strongly Connected Core mdash can travel from any page to any page

Review and Conclusions 66

Anchor Text

atilde Recall how hyperlinks are written

lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt

For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt

atilde Link analysis builds on two intuitions

1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A

2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice

Review and Conclusions 67

PageRank as Citation analysis

atilde Citation frequency can be used to measure the impact of an article

acirc Simplest measure Each article gets one vote ndash not very accurate

atilde On the web citation frequency = inlink count

acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam

atilde Better measure weighted citation frequency or citation rank

acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated

Review and Conclusions 68

PageRank as Random walk

atilde Imagine a web surfer doing a random walk on the web

acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page

equiprobably

atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there

atilde This long-term visit rate is the pagersquos PageRank

atilde PageRank = long-term visit rate = steady state robability

Review and Conclusions 69

Teleporting ndash to get us out of dead ends

atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)

atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)

atilde With remaining probability (90) go out on a random hyperlink

acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225

atilde 10 is a parameter the teleportation rate

atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate

Review and Conclusions 70

Example Graph

Each inbound link is a positive vote

httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71

Example Graph Weighted

Pages with higher PageRanks are lighter

Review and Conclusions 72

Gaming PageRank

atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs

atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies

atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission

73

atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks

The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index

acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip

74

Current Status

atilde There is a continuous battle between

acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read

atilde All metrics get gamed

Review and Conclusions 75

Digital object identifier

atilde DOI a string used to uniquely identify an electronic document or object

acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)

atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation

atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations

acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u

Review and Conclusions 76

Conclusions

Review and Conclusions 77

Revolutions in Language Technology

atilde Speech (Language itself)

atilde Writing (invented 3-5 times)

atilde Printing (made writing common)

atilde Digital Text (made writing transferable)

atilde Hyperlinking (taking writing beyond language)

Review and Conclusions 78

The Internet

atilde The internet is useful as a tool

acirc Passively for information accessacirc Actively for collaboration

atilde The internet is interesting in and of itself

acirc As a source of data about existing languageacirc As a source of innovation in language

Review and Conclusions 79

Recommended Texts

atilde Wikipedia

atilde Crystal D (2001) Language and the Internet Cambridge University Press

atilde Sproat R (2010) Language Technology and Society Oxford University Press

atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition

atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press

atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press

Review and Conclusions 80

Complementary Courses

atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics

atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics

Review and Conclusions 81

Page 57: Lecture 12: Review and Conclusions · Web Count Units ª ghit or “google hit” is the most common unit used to count web snippets (in the early2000s) â itisdocumentfrequencynottermfrequency

Goals of the Semantic Web

atilde Web of data

acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions

atilde Increase the utility of information by connecting it to definitions and context

atilde More efficient information access and analysis

EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt

Review and Conclusions 56

Semantic Web Architecture

Review and Conclusions 57

XML eXtensible Markup Language

atilde XML is a set of rules for encoding documents electronically

acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation

atilde Can be validated for syntactic well-formedness

Review and Conclusions 58

Semantic Web Architecture (details)

atilde Identify things with Uniform Resource Identifiers

acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond

atilde Identify relations with Resource Description Framework

acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything

atilde You can build relations in ontologies (OWL)

acirc Then reason over them search them hellip

Review and Conclusions 59

Criticism of the Semantic Web

Doctorowrsquos seven insurmountable obstacles to reliable metadata are

1 People lie

2 People are lazy

3 People are stupid

4 Mission Impossible know thyself

5 Schemas arenrsquot neutral

6 Metrics influence results

7 Therersquos more than one way to describe something

httpwwwwellcom~doctorowmetacraphtm 60

Semantic Web and NLP

atilde The Semantic Web is about structuring data

atilde Text Mining is about unstructured data

atilde There is much more unstructured than structured data

acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP

61

Citation Reputation andPageRank

Review and Conclusions 62

Citation Networks

atilde How can we tell what is a good scientific paper

acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked

acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite

Review and Conclusions 63

Reputation and Citation Analysis

atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group

atilde Some scores are

acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)

atilde Problems

acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones

atilde Weight Citations by Quality of the paper

Review and Conclusions 64

Gaming Citations

atilde LeastMinimum Publishable Unit

acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information

atilde Self citation in-group citation

atilde Write only proceedings (some journals are not often read)

atilde Submitting only to High Impact factor journals

You improve what gets measurednot necessarily what you want to improve

Review and Conclusions 65

Characteristics of the Web Bow Tie

SCC Strongly Connected Core mdash can travel from any page to any page

Review and Conclusions 66

Anchor Text

atilde Recall how hyperlinks are written

lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt

For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt

atilde Link analysis builds on two intuitions

1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A

2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice

Review and Conclusions 67

PageRank as Citation analysis

atilde Citation frequency can be used to measure the impact of an article

acirc Simplest measure Each article gets one vote ndash not very accurate

atilde On the web citation frequency = inlink count

acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam

atilde Better measure weighted citation frequency or citation rank

acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated

Review and Conclusions 68

PageRank as Random walk

atilde Imagine a web surfer doing a random walk on the web

acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page

equiprobably

atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there

atilde This long-term visit rate is the pagersquos PageRank

atilde PageRank = long-term visit rate = steady state robability

Review and Conclusions 69

Teleporting ndash to get us out of dead ends

atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)

atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)

atilde With remaining probability (90) go out on a random hyperlink

acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225

atilde 10 is a parameter the teleportation rate

atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate

Review and Conclusions 70

Example Graph

Each inbound link is a positive vote

httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71

Example Graph Weighted

Pages with higher PageRanks are lighter

Review and Conclusions 72

Gaming PageRank

atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs

atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies

atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission

73

atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks

The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index

acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip

74

Current Status

atilde There is a continuous battle between

acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read

atilde All metrics get gamed

Review and Conclusions 75

Digital object identifier

atilde DOI a string used to uniquely identify an electronic document or object

acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)

atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation

atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations

acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u

Review and Conclusions 76

Conclusions

Review and Conclusions 77

Revolutions in Language Technology

atilde Speech (Language itself)

atilde Writing (invented 3-5 times)

atilde Printing (made writing common)

atilde Digital Text (made writing transferable)

atilde Hyperlinking (taking writing beyond language)

Review and Conclusions 78

The Internet

atilde The internet is useful as a tool

acirc Passively for information accessacirc Actively for collaboration

atilde The internet is interesting in and of itself

acirc As a source of data about existing languageacirc As a source of innovation in language

Review and Conclusions 79

Recommended Texts

atilde Wikipedia

atilde Crystal D (2001) Language and the Internet Cambridge University Press

atilde Sproat R (2010) Language Technology and Society Oxford University Press

atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition

atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press

atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press

Review and Conclusions 80

Complementary Courses

atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics

atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics

Review and Conclusions 81

Page 58: Lecture 12: Review and Conclusions · Web Count Units ª ghit or “google hit” is the most common unit used to count web snippets (in the early2000s) â itisdocumentfrequencynottermfrequency

Semantic Web Architecture

Review and Conclusions 57

XML eXtensible Markup Language

atilde XML is a set of rules for encoding documents electronically

acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation

atilde Can be validated for syntactic well-formedness

Review and Conclusions 58

Semantic Web Architecture (details)

atilde Identify things with Uniform Resource Identifiers

acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond

atilde Identify relations with Resource Description Framework

acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything

atilde You can build relations in ontologies (OWL)

acirc Then reason over them search them hellip

Review and Conclusions 59

Criticism of the Semantic Web

Doctorowrsquos seven insurmountable obstacles to reliable metadata are

1 People lie

2 People are lazy

3 People are stupid

4 Mission Impossible know thyself

5 Schemas arenrsquot neutral

6 Metrics influence results

7 Therersquos more than one way to describe something

httpwwwwellcom~doctorowmetacraphtm 60

Semantic Web and NLP

atilde The Semantic Web is about structuring data

atilde Text Mining is about unstructured data

atilde There is much more unstructured than structured data

acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP

61

Citation Reputation andPageRank

Review and Conclusions 62

Citation Networks

atilde How can we tell what is a good scientific paper

acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked

acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite

Review and Conclusions 63

Reputation and Citation Analysis

atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group

atilde Some scores are

acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)

atilde Problems

acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones

atilde Weight Citations by Quality of the paper

Review and Conclusions 64

Gaming Citations

atilde LeastMinimum Publishable Unit

acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information

atilde Self citation in-group citation

atilde Write only proceedings (some journals are not often read)

atilde Submitting only to High Impact factor journals

You improve what gets measurednot necessarily what you want to improve

Review and Conclusions 65

Characteristics of the Web Bow Tie

SCC Strongly Connected Core mdash can travel from any page to any page

Review and Conclusions 66

Anchor Text

atilde Recall how hyperlinks are written

lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt

For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt

atilde Link analysis builds on two intuitions

1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A

2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice

Review and Conclusions 67

PageRank as Citation analysis

atilde Citation frequency can be used to measure the impact of an article

acirc Simplest measure Each article gets one vote ndash not very accurate

atilde On the web citation frequency = inlink count

acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam

atilde Better measure weighted citation frequency or citation rank

acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated

Review and Conclusions 68

PageRank as Random walk

atilde Imagine a web surfer doing a random walk on the web

acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page

equiprobably

atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there

atilde This long-term visit rate is the pagersquos PageRank

atilde PageRank = long-term visit rate = steady state robability

Review and Conclusions 69

Teleporting ndash to get us out of dead ends

atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)

atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)

atilde With remaining probability (90) go out on a random hyperlink

acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225

atilde 10 is a parameter the teleportation rate

atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate

Review and Conclusions 70

Example Graph

Each inbound link is a positive vote

httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71

Example Graph Weighted

Pages with higher PageRanks are lighter

Review and Conclusions 72

Gaming PageRank

atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs

atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies

atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission

73

atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks

The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index

acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip

74

Current Status

atilde There is a continuous battle between

acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read

atilde All metrics get gamed

Review and Conclusions 75

Digital object identifier

atilde DOI a string used to uniquely identify an electronic document or object

acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)

atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation

atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations

acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u

Review and Conclusions 76

Conclusions

Review and Conclusions 77

Revolutions in Language Technology

atilde Speech (Language itself)

atilde Writing (invented 3-5 times)

atilde Printing (made writing common)

atilde Digital Text (made writing transferable)

atilde Hyperlinking (taking writing beyond language)

Review and Conclusions 78

The Internet

atilde The internet is useful as a tool

acirc Passively for information accessacirc Actively for collaboration

atilde The internet is interesting in and of itself

acirc As a source of data about existing languageacirc As a source of innovation in language

Review and Conclusions 79

Recommended Texts

atilde Wikipedia

atilde Crystal D (2001) Language and the Internet Cambridge University Press

atilde Sproat R (2010) Language Technology and Society Oxford University Press

atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition

atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press

atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press

Review and Conclusions 80

Complementary Courses

atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics

atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics

Review and Conclusions 81

Page 59: Lecture 12: Review and Conclusions · Web Count Units ª ghit or “google hit” is the most common unit used to count web snippets (in the early2000s) â itisdocumentfrequencynottermfrequency

XML eXtensible Markup Language

atilde XML is a set of rules for encoding documents electronically

acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation

atilde Can be validated for syntactic well-formedness

Review and Conclusions 58

Semantic Web Architecture (details)

atilde Identify things with Uniform Resource Identifiers

acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond

atilde Identify relations with Resource Description Framework

acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything

atilde You can build relations in ontologies (OWL)

acirc Then reason over them search them hellip

Review and Conclusions 59

Criticism of the Semantic Web

Doctorowrsquos seven insurmountable obstacles to reliable metadata are

1 People lie

2 People are lazy

3 People are stupid

4 Mission Impossible know thyself

5 Schemas arenrsquot neutral

6 Metrics influence results

7 Therersquos more than one way to describe something

httpwwwwellcom~doctorowmetacraphtm 60

Semantic Web and NLP

atilde The Semantic Web is about structuring data

atilde Text Mining is about unstructured data

atilde There is much more unstructured than structured data

acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP

61

Citation Reputation andPageRank

Review and Conclusions 62

Citation Networks

atilde How can we tell what is a good scientific paper

acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked

acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite

Review and Conclusions 63

Reputation and Citation Analysis

atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group

atilde Some scores are

acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)

atilde Problems

acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones

atilde Weight Citations by Quality of the paper

Review and Conclusions 64

Gaming Citations

atilde LeastMinimum Publishable Unit

acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information

atilde Self citation in-group citation

atilde Write only proceedings (some journals are not often read)

atilde Submitting only to High Impact factor journals

You improve what gets measurednot necessarily what you want to improve

Review and Conclusions 65

Characteristics of the Web Bow Tie

SCC Strongly Connected Core mdash can travel from any page to any page

Review and Conclusions 66

Anchor Text

atilde Recall how hyperlinks are written

lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt

For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt

atilde Link analysis builds on two intuitions

1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A

2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice

Review and Conclusions 67

PageRank as Citation analysis

atilde Citation frequency can be used to measure the impact of an article

acirc Simplest measure Each article gets one vote ndash not very accurate

atilde On the web citation frequency = inlink count

acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam

atilde Better measure weighted citation frequency or citation rank

acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated

Review and Conclusions 68

PageRank as Random walk

atilde Imagine a web surfer doing a random walk on the web

acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page

equiprobably

atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there

atilde This long-term visit rate is the pagersquos PageRank

atilde PageRank = long-term visit rate = steady state robability

Review and Conclusions 69

Teleporting ndash to get us out of dead ends

atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)

atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)

atilde With remaining probability (90) go out on a random hyperlink

acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225

atilde 10 is a parameter the teleportation rate

atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate

Review and Conclusions 70

Example Graph

Each inbound link is a positive vote

httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71

Example Graph Weighted

Pages with higher PageRanks are lighter

Review and Conclusions 72

Gaming PageRank

atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs

atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies

atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission

73

atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks

The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index

acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip

74

Current Status

atilde There is a continuous battle between

acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read

atilde All metrics get gamed

Review and Conclusions 75

Digital object identifier

atilde DOI a string used to uniquely identify an electronic document or object

acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)

atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation

atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations

acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u

Review and Conclusions 76

Conclusions

Review and Conclusions 77

Revolutions in Language Technology

atilde Speech (Language itself)

atilde Writing (invented 3-5 times)

atilde Printing (made writing common)

atilde Digital Text (made writing transferable)

atilde Hyperlinking (taking writing beyond language)

Review and Conclusions 78

The Internet

atilde The internet is useful as a tool

acirc Passively for information accessacirc Actively for collaboration

atilde The internet is interesting in and of itself

acirc As a source of data about existing languageacirc As a source of innovation in language

Review and Conclusions 79

Recommended Texts

atilde Wikipedia

atilde Crystal D (2001) Language and the Internet Cambridge University Press

atilde Sproat R (2010) Language Technology and Society Oxford University Press

atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition

atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press

atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press

Review and Conclusions 80

Complementary Courses

atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics

atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics

Review and Conclusions 81

Page 60: Lecture 12: Review and Conclusions · Web Count Units ª ghit or “google hit” is the most common unit used to count web snippets (in the early2000s) â itisdocumentfrequencynottermfrequency

Semantic Web Architecture (details)

atilde Identify things with Uniform Resource Identifiers

acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond

atilde Identify relations with Resource Description Framework

acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything

atilde You can build relations in ontologies (OWL)

acirc Then reason over them search them hellip

Review and Conclusions 59

Criticism of the Semantic Web

Doctorowrsquos seven insurmountable obstacles to reliable metadata are

1 People lie

2 People are lazy

3 People are stupid

4 Mission Impossible know thyself

5 Schemas arenrsquot neutral

6 Metrics influence results

7 Therersquos more than one way to describe something

httpwwwwellcom~doctorowmetacraphtm 60

Semantic Web and NLP

atilde The Semantic Web is about structuring data

atilde Text Mining is about unstructured data

atilde There is much more unstructured than structured data

acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP

61

Citation Reputation andPageRank

Review and Conclusions 62

Citation Networks

atilde How can we tell what is a good scientific paper

acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked

acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite

Review and Conclusions 63

Reputation and Citation Analysis

atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group

atilde Some scores are

acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)

atilde Problems

acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones

atilde Weight Citations by Quality of the paper

Review and Conclusions 64

Gaming Citations

atilde LeastMinimum Publishable Unit

acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information

atilde Self citation in-group citation

atilde Write only proceedings (some journals are not often read)

atilde Submitting only to High Impact factor journals

You improve what gets measurednot necessarily what you want to improve

Review and Conclusions 65

Characteristics of the Web Bow Tie

SCC Strongly Connected Core mdash can travel from any page to any page

Review and Conclusions 66

Anchor Text

atilde Recall how hyperlinks are written

lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt

For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt

atilde Link analysis builds on two intuitions

1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A

2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice

Review and Conclusions 67

PageRank as Citation analysis

atilde Citation frequency can be used to measure the impact of an article

acirc Simplest measure Each article gets one vote ndash not very accurate

atilde On the web citation frequency = inlink count

acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam

atilde Better measure weighted citation frequency or citation rank

acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated

Review and Conclusions 68

PageRank as Random walk

atilde Imagine a web surfer doing a random walk on the web

acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page

equiprobably

atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there

atilde This long-term visit rate is the pagersquos PageRank

atilde PageRank = long-term visit rate = steady state robability

Review and Conclusions 69

Teleporting ndash to get us out of dead ends

atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)

atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)

atilde With remaining probability (90) go out on a random hyperlink

acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225

atilde 10 is a parameter the teleportation rate

atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate

Review and Conclusions 70

Example Graph

Each inbound link is a positive vote

httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71

Example Graph Weighted

Pages with higher PageRanks are lighter

Review and Conclusions 72

Gaming PageRank

atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs

atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies

atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission

73

atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks

The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index

acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip

74

Current Status

atilde There is a continuous battle between

acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read

atilde All metrics get gamed

Review and Conclusions 75

Digital object identifier

atilde DOI a string used to uniquely identify an electronic document or object

acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)

atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation

atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations

acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u

Review and Conclusions 76

Conclusions

Review and Conclusions 77

Revolutions in Language Technology

atilde Speech (Language itself)

atilde Writing (invented 3-5 times)

atilde Printing (made writing common)

atilde Digital Text (made writing transferable)

atilde Hyperlinking (taking writing beyond language)

Review and Conclusions 78

The Internet

atilde The internet is useful as a tool

acirc Passively for information accessacirc Actively for collaboration

atilde The internet is interesting in and of itself

acirc As a source of data about existing languageacirc As a source of innovation in language

Review and Conclusions 79

Recommended Texts

atilde Wikipedia

atilde Crystal D (2001) Language and the Internet Cambridge University Press

atilde Sproat R (2010) Language Technology and Society Oxford University Press

atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition

atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press

atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press

Review and Conclusions 80

Complementary Courses

atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics

atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics

Review and Conclusions 81

Page 61: Lecture 12: Review and Conclusions · Web Count Units ª ghit or “google hit” is the most common unit used to count web snippets (in the early2000s) â itisdocumentfrequencynottermfrequency

Criticism of the Semantic Web

Doctorowrsquos seven insurmountable obstacles to reliable metadata are

1 People lie

2 People are lazy

3 People are stupid

4 Mission Impossible know thyself

5 Schemas arenrsquot neutral

6 Metrics influence results

7 Therersquos more than one way to describe something

httpwwwwellcom~doctorowmetacraphtm 60

Semantic Web and NLP

atilde The Semantic Web is about structuring data

atilde Text Mining is about unstructured data

atilde There is much more unstructured than structured data

acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP

61

Citation Reputation andPageRank

Review and Conclusions 62

Citation Networks

atilde How can we tell what is a good scientific paper

acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked

acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite

Review and Conclusions 63

Reputation and Citation Analysis

atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group

atilde Some scores are

acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)

atilde Problems

acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones

atilde Weight Citations by Quality of the paper

Review and Conclusions 64

Gaming Citations

atilde LeastMinimum Publishable Unit

acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information

atilde Self citation in-group citation

atilde Write only proceedings (some journals are not often read)

atilde Submitting only to High Impact factor journals

You improve what gets measurednot necessarily what you want to improve

Review and Conclusions 65

Characteristics of the Web Bow Tie

SCC Strongly Connected Core mdash can travel from any page to any page

Review and Conclusions 66

Anchor Text

atilde Recall how hyperlinks are written

lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt

For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt

atilde Link analysis builds on two intuitions

1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A

2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice

Review and Conclusions 67

PageRank as Citation analysis

atilde Citation frequency can be used to measure the impact of an article

acirc Simplest measure Each article gets one vote ndash not very accurate

atilde On the web citation frequency = inlink count

acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam

atilde Better measure weighted citation frequency or citation rank

acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated

Review and Conclusions 68

PageRank as Random walk

atilde Imagine a web surfer doing a random walk on the web

acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page

equiprobably

atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there

atilde This long-term visit rate is the pagersquos PageRank

atilde PageRank = long-term visit rate = steady state robability

Review and Conclusions 69

Teleporting ndash to get us out of dead ends

atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)

atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)

atilde With remaining probability (90) go out on a random hyperlink

acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225

atilde 10 is a parameter the teleportation rate

atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate

Review and Conclusions 70

Example Graph

Each inbound link is a positive vote

httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71

Example Graph Weighted

Pages with higher PageRanks are lighter

Review and Conclusions 72

Gaming PageRank

atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs

atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies

atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission

73

atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks

The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index

acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip

74

Current Status

atilde There is a continuous battle between

acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read

atilde All metrics get gamed

Review and Conclusions 75

Digital object identifier

atilde DOI a string used to uniquely identify an electronic document or object

acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)

atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation

atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations

acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u

Review and Conclusions 76

Conclusions

Review and Conclusions 77

Revolutions in Language Technology

atilde Speech (Language itself)

atilde Writing (invented 3-5 times)

atilde Printing (made writing common)

atilde Digital Text (made writing transferable)

atilde Hyperlinking (taking writing beyond language)

Review and Conclusions 78

The Internet

atilde The internet is useful as a tool

acirc Passively for information accessacirc Actively for collaboration

atilde The internet is interesting in and of itself

acirc As a source of data about existing languageacirc As a source of innovation in language

Review and Conclusions 79

Recommended Texts

atilde Wikipedia

atilde Crystal D (2001) Language and the Internet Cambridge University Press

atilde Sproat R (2010) Language Technology and Society Oxford University Press

atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition

atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press

atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press

Review and Conclusions 80

Complementary Courses

atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics

atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics

Review and Conclusions 81

Page 62: Lecture 12: Review and Conclusions · Web Count Units ª ghit or “google hit” is the most common unit used to count web snippets (in the early2000s) â itisdocumentfrequencynottermfrequency

Semantic Web and NLP

atilde The Semantic Web is about structuring data

atilde Text Mining is about unstructured data

atilde There is much more unstructured than structured data

acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP

61

Citation Reputation andPageRank

Review and Conclusions 62

Citation Networks

atilde How can we tell what is a good scientific paper

acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked

acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite

Review and Conclusions 63

Reputation and Citation Analysis

atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group

atilde Some scores are

acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)

atilde Problems

acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones

atilde Weight Citations by Quality of the paper

Review and Conclusions 64

Gaming Citations

atilde LeastMinimum Publishable Unit

acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information

atilde Self citation in-group citation

atilde Write only proceedings (some journals are not often read)

atilde Submitting only to High Impact factor journals

You improve what gets measurednot necessarily what you want to improve

Review and Conclusions 65

Characteristics of the Web Bow Tie

SCC Strongly Connected Core mdash can travel from any page to any page

Review and Conclusions 66

Anchor Text

atilde Recall how hyperlinks are written

lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt

For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt

atilde Link analysis builds on two intuitions

1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A

2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice

Review and Conclusions 67

PageRank as Citation analysis

atilde Citation frequency can be used to measure the impact of an article

acirc Simplest measure Each article gets one vote ndash not very accurate

atilde On the web citation frequency = inlink count

acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam

atilde Better measure weighted citation frequency or citation rank

acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated

Review and Conclusions 68

PageRank as Random walk

atilde Imagine a web surfer doing a random walk on the web

acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page

equiprobably

atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there

atilde This long-term visit rate is the pagersquos PageRank

atilde PageRank = long-term visit rate = steady state robability

Review and Conclusions 69

Teleporting ndash to get us out of dead ends

atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)

atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)

atilde With remaining probability (90) go out on a random hyperlink

acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225

atilde 10 is a parameter the teleportation rate

atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate

Review and Conclusions 70

Example Graph

Each inbound link is a positive vote

httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71

Example Graph Weighted

Pages with higher PageRanks are lighter

Review and Conclusions 72

Gaming PageRank

atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs

atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies

atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission

73

atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks

The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index

acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip

74

Current Status

atilde There is a continuous battle between

acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read

atilde All metrics get gamed

Review and Conclusions 75

Digital object identifier

atilde DOI a string used to uniquely identify an electronic document or object

acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)

atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation

atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations

acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u

Review and Conclusions 76

Conclusions

Review and Conclusions 77

Revolutions in Language Technology

atilde Speech (Language itself)

atilde Writing (invented 3-5 times)

atilde Printing (made writing common)

atilde Digital Text (made writing transferable)

atilde Hyperlinking (taking writing beyond language)

Review and Conclusions 78

The Internet

atilde The internet is useful as a tool

acirc Passively for information accessacirc Actively for collaboration

atilde The internet is interesting in and of itself

acirc As a source of data about existing languageacirc As a source of innovation in language

Review and Conclusions 79

Recommended Texts

atilde Wikipedia

atilde Crystal D (2001) Language and the Internet Cambridge University Press

atilde Sproat R (2010) Language Technology and Society Oxford University Press

atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition

atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press

atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press

Review and Conclusions 80

Complementary Courses

atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics

atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics

Review and Conclusions 81

Page 63: Lecture 12: Review and Conclusions · Web Count Units ª ghit or “google hit” is the most common unit used to count web snippets (in the early2000s) â itisdocumentfrequencynottermfrequency

Citation Reputation andPageRank

Review and Conclusions 62

Citation Networks

atilde How can we tell what is a good scientific paper

acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked

acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite

Review and Conclusions 63

Reputation and Citation Analysis

atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group

atilde Some scores are

acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)

atilde Problems

acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones

atilde Weight Citations by Quality of the paper

Review and Conclusions 64

Gaming Citations

atilde LeastMinimum Publishable Unit

acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information

atilde Self citation in-group citation

atilde Write only proceedings (some journals are not often read)

atilde Submitting only to High Impact factor journals

You improve what gets measurednot necessarily what you want to improve

Review and Conclusions 65

Characteristics of the Web Bow Tie

SCC Strongly Connected Core mdash can travel from any page to any page

Review and Conclusions 66

Anchor Text

atilde Recall how hyperlinks are written

lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt

For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt

atilde Link analysis builds on two intuitions

1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A

2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice

Review and Conclusions 67

PageRank as Citation analysis

atilde Citation frequency can be used to measure the impact of an article

acirc Simplest measure Each article gets one vote ndash not very accurate

atilde On the web citation frequency = inlink count

acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam

atilde Better measure weighted citation frequency or citation rank

acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated

Review and Conclusions 68

PageRank as Random walk

atilde Imagine a web surfer doing a random walk on the web

acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page

equiprobably

atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there

atilde This long-term visit rate is the pagersquos PageRank

atilde PageRank = long-term visit rate = steady state robability

Review and Conclusions 69

Teleporting ndash to get us out of dead ends

atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)

atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)

atilde With remaining probability (90) go out on a random hyperlink

acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225

atilde 10 is a parameter the teleportation rate

atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate

Review and Conclusions 70

Example Graph

Each inbound link is a positive vote

httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71

Example Graph Weighted

Pages with higher PageRanks are lighter

Review and Conclusions 72

Gaming PageRank

atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs

atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies

atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission

73

atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks

The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index

acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip

74

Current Status

atilde There is a continuous battle between

acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read

atilde All metrics get gamed

Review and Conclusions 75

Digital object identifier

atilde DOI a string used to uniquely identify an electronic document or object

acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)

atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation

atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations

acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u

Review and Conclusions 76

Conclusions

Review and Conclusions 77

Revolutions in Language Technology

atilde Speech (Language itself)

atilde Writing (invented 3-5 times)

atilde Printing (made writing common)

atilde Digital Text (made writing transferable)

atilde Hyperlinking (taking writing beyond language)

Review and Conclusions 78

The Internet

atilde The internet is useful as a tool

acirc Passively for information accessacirc Actively for collaboration

atilde The internet is interesting in and of itself

acirc As a source of data about existing languageacirc As a source of innovation in language

Review and Conclusions 79

Recommended Texts

atilde Wikipedia

atilde Crystal D (2001) Language and the Internet Cambridge University Press

atilde Sproat R (2010) Language Technology and Society Oxford University Press

atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition

atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press

atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press

Review and Conclusions 80

Complementary Courses

atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics

atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics

Review and Conclusions 81

Page 64: Lecture 12: Review and Conclusions · Web Count Units ª ghit or “google hit” is the most common unit used to count web snippets (in the early2000s) â itisdocumentfrequencynottermfrequency

Citation Networks

atilde How can we tell what is a good scientific paper

acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked

acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite

Review and Conclusions 63

Reputation and Citation Analysis

atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group

atilde Some scores are

acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)

atilde Problems

acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones

atilde Weight Citations by Quality of the paper

Review and Conclusions 64

Gaming Citations

atilde LeastMinimum Publishable Unit

acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information

atilde Self citation in-group citation

atilde Write only proceedings (some journals are not often read)

atilde Submitting only to High Impact factor journals

You improve what gets measurednot necessarily what you want to improve

Review and Conclusions 65

Characteristics of the Web Bow Tie

SCC Strongly Connected Core mdash can travel from any page to any page

Review and Conclusions 66

Anchor Text

atilde Recall how hyperlinks are written

lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt

For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt

atilde Link analysis builds on two intuitions

1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A

2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice

Review and Conclusions 67

PageRank as Citation analysis

atilde Citation frequency can be used to measure the impact of an article

acirc Simplest measure Each article gets one vote ndash not very accurate

atilde On the web citation frequency = inlink count

acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam

atilde Better measure weighted citation frequency or citation rank

acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated

Review and Conclusions 68

PageRank as Random walk

atilde Imagine a web surfer doing a random walk on the web

acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page

equiprobably

atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there

atilde This long-term visit rate is the pagersquos PageRank

atilde PageRank = long-term visit rate = steady state robability

Review and Conclusions 69

Teleporting ndash to get us out of dead ends

atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)

atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)

atilde With remaining probability (90) go out on a random hyperlink

acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225

atilde 10 is a parameter the teleportation rate

atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate

Review and Conclusions 70

Example Graph

Each inbound link is a positive vote

httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71

Example Graph Weighted

Pages with higher PageRanks are lighter

Review and Conclusions 72

Gaming PageRank

atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs

atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies

atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission

73

atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks

The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index

acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip

74

Current Status

atilde There is a continuous battle between

acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read

atilde All metrics get gamed

Review and Conclusions 75

Digital object identifier

atilde DOI a string used to uniquely identify an electronic document or object

acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)

atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation

atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations

acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u

Review and Conclusions 76

Conclusions

Review and Conclusions 77

Revolutions in Language Technology

atilde Speech (Language itself)

atilde Writing (invented 3-5 times)

atilde Printing (made writing common)

atilde Digital Text (made writing transferable)

atilde Hyperlinking (taking writing beyond language)

Review and Conclusions 78

The Internet

atilde The internet is useful as a tool

acirc Passively for information accessacirc Actively for collaboration

atilde The internet is interesting in and of itself

acirc As a source of data about existing languageacirc As a source of innovation in language

Review and Conclusions 79

Recommended Texts

atilde Wikipedia

atilde Crystal D (2001) Language and the Internet Cambridge University Press

atilde Sproat R (2010) Language Technology and Society Oxford University Press

atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition

atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press

atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press

Review and Conclusions 80

Complementary Courses

atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics

atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics

Review and Conclusions 81

Page 65: Lecture 12: Review and Conclusions · Web Count Units ª ghit or “google hit” is the most common unit used to count web snippets (in the early2000s) â itisdocumentfrequencynottermfrequency

Reputation and Citation Analysis

atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group

atilde Some scores are

acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)

atilde Problems

acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones

atilde Weight Citations by Quality of the paper

Review and Conclusions 64

Gaming Citations

atilde LeastMinimum Publishable Unit

acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information

atilde Self citation in-group citation

atilde Write only proceedings (some journals are not often read)

atilde Submitting only to High Impact factor journals

You improve what gets measurednot necessarily what you want to improve

Review and Conclusions 65

Characteristics of the Web Bow Tie

SCC Strongly Connected Core mdash can travel from any page to any page

Review and Conclusions 66

Anchor Text

atilde Recall how hyperlinks are written

lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt

For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt

atilde Link analysis builds on two intuitions

1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A

2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice

Review and Conclusions 67

PageRank as Citation analysis

atilde Citation frequency can be used to measure the impact of an article

acirc Simplest measure Each article gets one vote ndash not very accurate

atilde On the web citation frequency = inlink count

acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam

atilde Better measure weighted citation frequency or citation rank

acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated

Review and Conclusions 68

PageRank as Random walk

atilde Imagine a web surfer doing a random walk on the web

acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page

equiprobably

atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there

atilde This long-term visit rate is the pagersquos PageRank

atilde PageRank = long-term visit rate = steady state robability

Review and Conclusions 69

Teleporting ndash to get us out of dead ends

atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)

atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)

atilde With remaining probability (90) go out on a random hyperlink

acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225

atilde 10 is a parameter the teleportation rate

atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate

Review and Conclusions 70

Example Graph

Each inbound link is a positive vote

httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71

Example Graph Weighted

Pages with higher PageRanks are lighter

Review and Conclusions 72

Gaming PageRank

atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs

atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies

atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission

73

atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks

The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index

acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip

74

Current Status

atilde There is a continuous battle between

acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read

atilde All metrics get gamed

Review and Conclusions 75

Digital object identifier

atilde DOI a string used to uniquely identify an electronic document or object

acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)

atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation

atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations

acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u

Review and Conclusions 76

Conclusions

Review and Conclusions 77

Revolutions in Language Technology

atilde Speech (Language itself)

atilde Writing (invented 3-5 times)

atilde Printing (made writing common)

atilde Digital Text (made writing transferable)

atilde Hyperlinking (taking writing beyond language)

Review and Conclusions 78

The Internet

atilde The internet is useful as a tool

acirc Passively for information accessacirc Actively for collaboration

atilde The internet is interesting in and of itself

acirc As a source of data about existing languageacirc As a source of innovation in language

Review and Conclusions 79

Recommended Texts

atilde Wikipedia

atilde Crystal D (2001) Language and the Internet Cambridge University Press

atilde Sproat R (2010) Language Technology and Society Oxford University Press

atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition

atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press

atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press

Review and Conclusions 80

Complementary Courses

atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics

atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics

Review and Conclusions 81

Page 66: Lecture 12: Review and Conclusions · Web Count Units ª ghit or “google hit” is the most common unit used to count web snippets (in the early2000s) â itisdocumentfrequencynottermfrequency

Gaming Citations

atilde LeastMinimum Publishable Unit

acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information

atilde Self citation in-group citation

atilde Write only proceedings (some journals are not often read)

atilde Submitting only to High Impact factor journals

You improve what gets measurednot necessarily what you want to improve

Review and Conclusions 65

Characteristics of the Web Bow Tie

SCC Strongly Connected Core mdash can travel from any page to any page

Review and Conclusions 66

Anchor Text

atilde Recall how hyperlinks are written

lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt

For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt

atilde Link analysis builds on two intuitions

1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A

2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice

Review and Conclusions 67

PageRank as Citation analysis

atilde Citation frequency can be used to measure the impact of an article

acirc Simplest measure Each article gets one vote ndash not very accurate

atilde On the web citation frequency = inlink count

acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam

atilde Better measure weighted citation frequency or citation rank

acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated

Review and Conclusions 68

PageRank as Random walk

atilde Imagine a web surfer doing a random walk on the web

acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page

equiprobably

atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there

atilde This long-term visit rate is the pagersquos PageRank

atilde PageRank = long-term visit rate = steady state robability

Review and Conclusions 69

Teleporting ndash to get us out of dead ends

atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)

atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)

atilde With remaining probability (90) go out on a random hyperlink

acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225

atilde 10 is a parameter the teleportation rate

atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate

Review and Conclusions 70

Example Graph

Each inbound link is a positive vote

httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71

Example Graph Weighted

Pages with higher PageRanks are lighter

Review and Conclusions 72

Gaming PageRank

atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs

atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies

atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission

73

atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks

The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index

acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip

74

Current Status

atilde There is a continuous battle between

acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read

atilde All metrics get gamed

Review and Conclusions 75

Digital object identifier

atilde DOI a string used to uniquely identify an electronic document or object

acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)

atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation

atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations

acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u

Review and Conclusions 76

Conclusions

Review and Conclusions 77

Revolutions in Language Technology

atilde Speech (Language itself)

atilde Writing (invented 3-5 times)

atilde Printing (made writing common)

atilde Digital Text (made writing transferable)

atilde Hyperlinking (taking writing beyond language)

Review and Conclusions 78

The Internet

atilde The internet is useful as a tool

acirc Passively for information accessacirc Actively for collaboration

atilde The internet is interesting in and of itself

acirc As a source of data about existing languageacirc As a source of innovation in language

Review and Conclusions 79

Recommended Texts

atilde Wikipedia

atilde Crystal D (2001) Language and the Internet Cambridge University Press

atilde Sproat R (2010) Language Technology and Society Oxford University Press

atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition

atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press

atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press

Review and Conclusions 80

Complementary Courses

atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics

atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics

Review and Conclusions 81

Page 67: Lecture 12: Review and Conclusions · Web Count Units ª ghit or “google hit” is the most common unit used to count web snippets (in the early2000s) â itisdocumentfrequencynottermfrequency

Characteristics of the Web Bow Tie

SCC Strongly Connected Core mdash can travel from any page to any page

Review and Conclusions 66

Anchor Text

atilde Recall how hyperlinks are written

lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt

For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt

atilde Link analysis builds on two intuitions

1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A

2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice

Review and Conclusions 67

PageRank as Citation analysis

atilde Citation frequency can be used to measure the impact of an article

acirc Simplest measure Each article gets one vote ndash not very accurate

atilde On the web citation frequency = inlink count

acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam

atilde Better measure weighted citation frequency or citation rank

acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated

Review and Conclusions 68

PageRank as Random walk

atilde Imagine a web surfer doing a random walk on the web

acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page

equiprobably

atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there

atilde This long-term visit rate is the pagersquos PageRank

atilde PageRank = long-term visit rate = steady state robability

Review and Conclusions 69

Teleporting ndash to get us out of dead ends

atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)

atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)

atilde With remaining probability (90) go out on a random hyperlink

acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225

atilde 10 is a parameter the teleportation rate

atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate

Review and Conclusions 70

Example Graph

Each inbound link is a positive vote

httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71

Example Graph Weighted

Pages with higher PageRanks are lighter

Review and Conclusions 72

Gaming PageRank

atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs

atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies

atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission

73

atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks

The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index

acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip

74

Current Status

atilde There is a continuous battle between

acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read

atilde All metrics get gamed

Review and Conclusions 75

Digital object identifier

atilde DOI a string used to uniquely identify an electronic document or object

acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)

atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation

atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations

acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u

Review and Conclusions 76

Conclusions

Review and Conclusions 77

Revolutions in Language Technology

atilde Speech (Language itself)

atilde Writing (invented 3-5 times)

atilde Printing (made writing common)

atilde Digital Text (made writing transferable)

atilde Hyperlinking (taking writing beyond language)

Review and Conclusions 78

The Internet

atilde The internet is useful as a tool

acirc Passively for information accessacirc Actively for collaboration

atilde The internet is interesting in and of itself

acirc As a source of data about existing languageacirc As a source of innovation in language

Review and Conclusions 79

Recommended Texts

atilde Wikipedia

atilde Crystal D (2001) Language and the Internet Cambridge University Press

atilde Sproat R (2010) Language Technology and Society Oxford University Press

atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition

atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press

atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press

Review and Conclusions 80

Complementary Courses

atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics

atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics

Review and Conclusions 81

Page 68: Lecture 12: Review and Conclusions · Web Count Units ª ghit or “google hit” is the most common unit used to count web snippets (in the early2000s) â itisdocumentfrequencynottermfrequency

Anchor Text

atilde Recall how hyperlinks are written

lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt

For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt

atilde Link analysis builds on two intuitions

1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A

2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice

Review and Conclusions 67

PageRank as Citation analysis

atilde Citation frequency can be used to measure the impact of an article

acirc Simplest measure Each article gets one vote ndash not very accurate

atilde On the web citation frequency = inlink count

acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam

atilde Better measure weighted citation frequency or citation rank

acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated

Review and Conclusions 68

PageRank as Random walk

atilde Imagine a web surfer doing a random walk on the web

acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page

equiprobably

atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there

atilde This long-term visit rate is the pagersquos PageRank

atilde PageRank = long-term visit rate = steady state robability

Review and Conclusions 69

Teleporting ndash to get us out of dead ends

atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)

atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)

atilde With remaining probability (90) go out on a random hyperlink

acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225

atilde 10 is a parameter the teleportation rate

atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate

Review and Conclusions 70

Example Graph

Each inbound link is a positive vote

httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71

Example Graph Weighted

Pages with higher PageRanks are lighter

Review and Conclusions 72

Gaming PageRank

atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs

atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies

atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission

73

atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks

The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index

acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip

74

Current Status

atilde There is a continuous battle between

acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read

atilde All metrics get gamed

Review and Conclusions 75

Digital object identifier

atilde DOI a string used to uniquely identify an electronic document or object

acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)

atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation

atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations

acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u

Review and Conclusions 76

Conclusions

Review and Conclusions 77

Revolutions in Language Technology

atilde Speech (Language itself)

atilde Writing (invented 3-5 times)

atilde Printing (made writing common)

atilde Digital Text (made writing transferable)

atilde Hyperlinking (taking writing beyond language)

Review and Conclusions 78

The Internet

atilde The internet is useful as a tool

acirc Passively for information accessacirc Actively for collaboration

atilde The internet is interesting in and of itself

acirc As a source of data about existing languageacirc As a source of innovation in language

Review and Conclusions 79

Recommended Texts

atilde Wikipedia

atilde Crystal D (2001) Language and the Internet Cambridge University Press

atilde Sproat R (2010) Language Technology and Society Oxford University Press

atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition

atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press

atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press

Review and Conclusions 80

Complementary Courses

atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics

atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics

Review and Conclusions 81

Page 69: Lecture 12: Review and Conclusions · Web Count Units ª ghit or “google hit” is the most common unit used to count web snippets (in the early2000s) â itisdocumentfrequencynottermfrequency

PageRank as Citation analysis

atilde Citation frequency can be used to measure the impact of an article

acirc Simplest measure Each article gets one vote ndash not very accurate

atilde On the web citation frequency = inlink count

acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam

atilde Better measure weighted citation frequency or citation rank

acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated

Review and Conclusions 68

PageRank as Random walk

atilde Imagine a web surfer doing a random walk on the web

acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page

equiprobably

atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there

atilde This long-term visit rate is the pagersquos PageRank

atilde PageRank = long-term visit rate = steady state robability

Review and Conclusions 69

Teleporting ndash to get us out of dead ends

atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)

atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)

atilde With remaining probability (90) go out on a random hyperlink

acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225

atilde 10 is a parameter the teleportation rate

atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate

Review and Conclusions 70

Example Graph

Each inbound link is a positive vote

httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71

Example Graph Weighted

Pages with higher PageRanks are lighter

Review and Conclusions 72

Gaming PageRank

atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs

atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies

atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission

73

atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks

The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index

acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip

74

Current Status

atilde There is a continuous battle between

acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read

atilde All metrics get gamed

Review and Conclusions 75

Digital object identifier

atilde DOI a string used to uniquely identify an electronic document or object

acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)

atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation

atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations

acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u

Review and Conclusions 76

Conclusions

Review and Conclusions 77

Revolutions in Language Technology

atilde Speech (Language itself)

atilde Writing (invented 3-5 times)

atilde Printing (made writing common)

atilde Digital Text (made writing transferable)

atilde Hyperlinking (taking writing beyond language)

Review and Conclusions 78

The Internet

atilde The internet is useful as a tool

acirc Passively for information accessacirc Actively for collaboration

atilde The internet is interesting in and of itself

acirc As a source of data about existing languageacirc As a source of innovation in language

Review and Conclusions 79

Recommended Texts

atilde Wikipedia

atilde Crystal D (2001) Language and the Internet Cambridge University Press

atilde Sproat R (2010) Language Technology and Society Oxford University Press

atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition

atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press

atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press

Review and Conclusions 80

Complementary Courses

atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics

atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics

Review and Conclusions 81

Page 70: Lecture 12: Review and Conclusions · Web Count Units ª ghit or “google hit” is the most common unit used to count web snippets (in the early2000s) â itisdocumentfrequencynottermfrequency

PageRank as Random walk

atilde Imagine a web surfer doing a random walk on the web

acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page

equiprobably

atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there

atilde This long-term visit rate is the pagersquos PageRank

atilde PageRank = long-term visit rate = steady state robability

Review and Conclusions 69

Teleporting ndash to get us out of dead ends

atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)

atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)

atilde With remaining probability (90) go out on a random hyperlink

acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225

atilde 10 is a parameter the teleportation rate

atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate

Review and Conclusions 70

Example Graph

Each inbound link is a positive vote

httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71

Example Graph Weighted

Pages with higher PageRanks are lighter

Review and Conclusions 72

Gaming PageRank

atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs

atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies

atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission

73

atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks

The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index

acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip

74

Current Status

atilde There is a continuous battle between

acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read

atilde All metrics get gamed

Review and Conclusions 75

Digital object identifier

atilde DOI a string used to uniquely identify an electronic document or object

acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)

atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation

atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations

acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u

Review and Conclusions 76

Conclusions

Review and Conclusions 77

Revolutions in Language Technology

atilde Speech (Language itself)

atilde Writing (invented 3-5 times)

atilde Printing (made writing common)

atilde Digital Text (made writing transferable)

atilde Hyperlinking (taking writing beyond language)

Review and Conclusions 78

The Internet

atilde The internet is useful as a tool

acirc Passively for information accessacirc Actively for collaboration

atilde The internet is interesting in and of itself

acirc As a source of data about existing languageacirc As a source of innovation in language

Review and Conclusions 79

Recommended Texts

atilde Wikipedia

atilde Crystal D (2001) Language and the Internet Cambridge University Press

atilde Sproat R (2010) Language Technology and Society Oxford University Press

atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition

atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press

atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press

Review and Conclusions 80

Complementary Courses

atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics

atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics

Review and Conclusions 81

Page 71: Lecture 12: Review and Conclusions · Web Count Units ª ghit or “google hit” is the most common unit used to count web snippets (in the early2000s) â itisdocumentfrequencynottermfrequency

Teleporting ndash to get us out of dead ends

atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)

atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)

atilde With remaining probability (90) go out on a random hyperlink

acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225

atilde 10 is a parameter the teleportation rate

atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate

Review and Conclusions 70

Example Graph

Each inbound link is a positive vote

httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71

Example Graph Weighted

Pages with higher PageRanks are lighter

Review and Conclusions 72

Gaming PageRank

atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs

atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies

atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission

73

atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks

The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index

acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip

74

Current Status

atilde There is a continuous battle between

acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read

atilde All metrics get gamed

Review and Conclusions 75

Digital object identifier

atilde DOI a string used to uniquely identify an electronic document or object

acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)

atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation

atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations

acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u

Review and Conclusions 76

Conclusions

Review and Conclusions 77

Revolutions in Language Technology

atilde Speech (Language itself)

atilde Writing (invented 3-5 times)

atilde Printing (made writing common)

atilde Digital Text (made writing transferable)

atilde Hyperlinking (taking writing beyond language)

Review and Conclusions 78

The Internet

atilde The internet is useful as a tool

acirc Passively for information accessacirc Actively for collaboration

atilde The internet is interesting in and of itself

acirc As a source of data about existing languageacirc As a source of innovation in language

Review and Conclusions 79

Recommended Texts

atilde Wikipedia

atilde Crystal D (2001) Language and the Internet Cambridge University Press

atilde Sproat R (2010) Language Technology and Society Oxford University Press

atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition

atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press

atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press

Review and Conclusions 80

Complementary Courses

atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics

atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics

Review and Conclusions 81

Page 72: Lecture 12: Review and Conclusions · Web Count Units ª ghit or “google hit” is the most common unit used to count web snippets (in the early2000s) â itisdocumentfrequencynottermfrequency

Example Graph

Each inbound link is a positive vote

httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71

Example Graph Weighted

Pages with higher PageRanks are lighter

Review and Conclusions 72

Gaming PageRank

atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs

atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies

atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission

73

atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks

The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index

acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip

74

Current Status

atilde There is a continuous battle between

acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read

atilde All metrics get gamed

Review and Conclusions 75

Digital object identifier

atilde DOI a string used to uniquely identify an electronic document or object

acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)

atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation

atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations

acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u

Review and Conclusions 76

Conclusions

Review and Conclusions 77

Revolutions in Language Technology

atilde Speech (Language itself)

atilde Writing (invented 3-5 times)

atilde Printing (made writing common)

atilde Digital Text (made writing transferable)

atilde Hyperlinking (taking writing beyond language)

Review and Conclusions 78

The Internet

atilde The internet is useful as a tool

acirc Passively for information accessacirc Actively for collaboration

atilde The internet is interesting in and of itself

acirc As a source of data about existing languageacirc As a source of innovation in language

Review and Conclusions 79

Recommended Texts

atilde Wikipedia

atilde Crystal D (2001) Language and the Internet Cambridge University Press

atilde Sproat R (2010) Language Technology and Society Oxford University Press

atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition

atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press

atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press

Review and Conclusions 80

Complementary Courses

atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics

atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics

Review and Conclusions 81

Page 73: Lecture 12: Review and Conclusions · Web Count Units ª ghit or “google hit” is the most common unit used to count web snippets (in the early2000s) â itisdocumentfrequencynottermfrequency

Example Graph Weighted

Pages with higher PageRanks are lighter

Review and Conclusions 72

Gaming PageRank

atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs

atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies

atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission

73

atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks

The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index

acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip

74

Current Status

atilde There is a continuous battle between

acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read

atilde All metrics get gamed

Review and Conclusions 75

Digital object identifier

atilde DOI a string used to uniquely identify an electronic document or object

acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)

atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation

atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations

acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u

Review and Conclusions 76

Conclusions

Review and Conclusions 77

Revolutions in Language Technology

atilde Speech (Language itself)

atilde Writing (invented 3-5 times)

atilde Printing (made writing common)

atilde Digital Text (made writing transferable)

atilde Hyperlinking (taking writing beyond language)

Review and Conclusions 78

The Internet

atilde The internet is useful as a tool

acirc Passively for information accessacirc Actively for collaboration

atilde The internet is interesting in and of itself

acirc As a source of data about existing languageacirc As a source of innovation in language

Review and Conclusions 79

Recommended Texts

atilde Wikipedia

atilde Crystal D (2001) Language and the Internet Cambridge University Press

atilde Sproat R (2010) Language Technology and Society Oxford University Press

atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition

atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press

atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press

Review and Conclusions 80

Complementary Courses

atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics

atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics

Review and Conclusions 81

Page 74: Lecture 12: Review and Conclusions · Web Count Units ª ghit or “google hit” is the most common unit used to count web snippets (in the early2000s) â itisdocumentfrequencynottermfrequency

Gaming PageRank

atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs

atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies

atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission

73

atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks

The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index

acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip

74

Current Status

atilde There is a continuous battle between

acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read

atilde All metrics get gamed

Review and Conclusions 75

Digital object identifier

atilde DOI a string used to uniquely identify an electronic document or object

acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)

atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation

atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations

acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u

Review and Conclusions 76

Conclusions

Review and Conclusions 77

Revolutions in Language Technology

atilde Speech (Language itself)

atilde Writing (invented 3-5 times)

atilde Printing (made writing common)

atilde Digital Text (made writing transferable)

atilde Hyperlinking (taking writing beyond language)

Review and Conclusions 78

The Internet

atilde The internet is useful as a tool

acirc Passively for information accessacirc Actively for collaboration

atilde The internet is interesting in and of itself

acirc As a source of data about existing languageacirc As a source of innovation in language

Review and Conclusions 79

Recommended Texts

atilde Wikipedia

atilde Crystal D (2001) Language and the Internet Cambridge University Press

atilde Sproat R (2010) Language Technology and Society Oxford University Press

atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition

atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press

atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press

Review and Conclusions 80

Complementary Courses

atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics

atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics

Review and Conclusions 81

Page 75: Lecture 12: Review and Conclusions · Web Count Units ª ghit or “google hit” is the most common unit used to count web snippets (in the early2000s) â itisdocumentfrequencynottermfrequency

atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks

The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index

acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip

74

Current Status

atilde There is a continuous battle between

acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read

atilde All metrics get gamed

Review and Conclusions 75

Digital object identifier

atilde DOI a string used to uniquely identify an electronic document or object

acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)

atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation

atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations

acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u

Review and Conclusions 76

Conclusions

Review and Conclusions 77

Revolutions in Language Technology

atilde Speech (Language itself)

atilde Writing (invented 3-5 times)

atilde Printing (made writing common)

atilde Digital Text (made writing transferable)

atilde Hyperlinking (taking writing beyond language)

Review and Conclusions 78

The Internet

atilde The internet is useful as a tool

acirc Passively for information accessacirc Actively for collaboration

atilde The internet is interesting in and of itself

acirc As a source of data about existing languageacirc As a source of innovation in language

Review and Conclusions 79

Recommended Texts

atilde Wikipedia

atilde Crystal D (2001) Language and the Internet Cambridge University Press

atilde Sproat R (2010) Language Technology and Society Oxford University Press

atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition

atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press

atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press

Review and Conclusions 80

Complementary Courses

atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics

atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics

Review and Conclusions 81

Page 76: Lecture 12: Review and Conclusions · Web Count Units ª ghit or “google hit” is the most common unit used to count web snippets (in the early2000s) â itisdocumentfrequencynottermfrequency

Current Status

atilde There is a continuous battle between

acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read

atilde All metrics get gamed

Review and Conclusions 75

Digital object identifier

atilde DOI a string used to uniquely identify an electronic document or object

acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)

atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation

atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations

acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u

Review and Conclusions 76

Conclusions

Review and Conclusions 77

Revolutions in Language Technology

atilde Speech (Language itself)

atilde Writing (invented 3-5 times)

atilde Printing (made writing common)

atilde Digital Text (made writing transferable)

atilde Hyperlinking (taking writing beyond language)

Review and Conclusions 78

The Internet

atilde The internet is useful as a tool

acirc Passively for information accessacirc Actively for collaboration

atilde The internet is interesting in and of itself

acirc As a source of data about existing languageacirc As a source of innovation in language

Review and Conclusions 79

Recommended Texts

atilde Wikipedia

atilde Crystal D (2001) Language and the Internet Cambridge University Press

atilde Sproat R (2010) Language Technology and Society Oxford University Press

atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition

atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press

atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press

Review and Conclusions 80

Complementary Courses

atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics

atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics

Review and Conclusions 81

Page 77: Lecture 12: Review and Conclusions · Web Count Units ª ghit or “google hit” is the most common unit used to count web snippets (in the early2000s) â itisdocumentfrequencynottermfrequency

Digital object identifier

atilde DOI a string used to uniquely identify an electronic document or object

acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)

atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation

atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations

acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u

Review and Conclusions 76

Conclusions

Review and Conclusions 77

Revolutions in Language Technology

atilde Speech (Language itself)

atilde Writing (invented 3-5 times)

atilde Printing (made writing common)

atilde Digital Text (made writing transferable)

atilde Hyperlinking (taking writing beyond language)

Review and Conclusions 78

The Internet

atilde The internet is useful as a tool

acirc Passively for information accessacirc Actively for collaboration

atilde The internet is interesting in and of itself

acirc As a source of data about existing languageacirc As a source of innovation in language

Review and Conclusions 79

Recommended Texts

atilde Wikipedia

atilde Crystal D (2001) Language and the Internet Cambridge University Press

atilde Sproat R (2010) Language Technology and Society Oxford University Press

atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition

atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press

atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press

Review and Conclusions 80

Complementary Courses

atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics

atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics

Review and Conclusions 81

Page 78: Lecture 12: Review and Conclusions · Web Count Units ª ghit or “google hit” is the most common unit used to count web snippets (in the early2000s) â itisdocumentfrequencynottermfrequency

Conclusions

Review and Conclusions 77

Revolutions in Language Technology

atilde Speech (Language itself)

atilde Writing (invented 3-5 times)

atilde Printing (made writing common)

atilde Digital Text (made writing transferable)

atilde Hyperlinking (taking writing beyond language)

Review and Conclusions 78

The Internet

atilde The internet is useful as a tool

acirc Passively for information accessacirc Actively for collaboration

atilde The internet is interesting in and of itself

acirc As a source of data about existing languageacirc As a source of innovation in language

Review and Conclusions 79

Recommended Texts

atilde Wikipedia

atilde Crystal D (2001) Language and the Internet Cambridge University Press

atilde Sproat R (2010) Language Technology and Society Oxford University Press

atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition

atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press

atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press

Review and Conclusions 80

Complementary Courses

atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics

atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics

Review and Conclusions 81

Page 79: Lecture 12: Review and Conclusions · Web Count Units ª ghit or “google hit” is the most common unit used to count web snippets (in the early2000s) â itisdocumentfrequencynottermfrequency

Revolutions in Language Technology

atilde Speech (Language itself)

atilde Writing (invented 3-5 times)

atilde Printing (made writing common)

atilde Digital Text (made writing transferable)

atilde Hyperlinking (taking writing beyond language)

Review and Conclusions 78

The Internet

atilde The internet is useful as a tool

acirc Passively for information accessacirc Actively for collaboration

atilde The internet is interesting in and of itself

acirc As a source of data about existing languageacirc As a source of innovation in language

Review and Conclusions 79

Recommended Texts

atilde Wikipedia

atilde Crystal D (2001) Language and the Internet Cambridge University Press

atilde Sproat R (2010) Language Technology and Society Oxford University Press

atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition

atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press

atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press

Review and Conclusions 80

Complementary Courses

atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics

atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics

Review and Conclusions 81

Page 80: Lecture 12: Review and Conclusions · Web Count Units ª ghit or “google hit” is the most common unit used to count web snippets (in the early2000s) â itisdocumentfrequencynottermfrequency

The Internet

atilde The internet is useful as a tool

acirc Passively for information accessacirc Actively for collaboration

atilde The internet is interesting in and of itself

acirc As a source of data about existing languageacirc As a source of innovation in language

Review and Conclusions 79

Recommended Texts

atilde Wikipedia

atilde Crystal D (2001) Language and the Internet Cambridge University Press

atilde Sproat R (2010) Language Technology and Society Oxford University Press

atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition

atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press

atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press

Review and Conclusions 80

Complementary Courses

atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics

atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics

Review and Conclusions 81

Page 81: Lecture 12: Review and Conclusions · Web Count Units ª ghit or “google hit” is the most common unit used to count web snippets (in the early2000s) â itisdocumentfrequencynottermfrequency

Recommended Texts

atilde Wikipedia

atilde Crystal D (2001) Language and the Internet Cambridge University Press

atilde Sproat R (2010) Language Technology and Society Oxford University Press

atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition

atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press

atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press

Review and Conclusions 80

Complementary Courses

atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics

atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics

Review and Conclusions 81

Page 82: Lecture 12: Review and Conclusions · Web Count Units ª ghit or “google hit” is the most common unit used to count web snippets (in the early2000s) â itisdocumentfrequencynottermfrequency

Complementary Courses

atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics

atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics

Review and Conclusions 81