lecture 12: review and conclusions · web count units ª ghit or “google hit” is the most...

HG2052Language Technology and the Internet

Review and Conclusions

Francis BondDivision of Linguistics and Multilingual Studies

httpwww3ntuedusghomefcbondbondieeeorg

Lecture 12

HG2052 (2020)

Introduction

Review and Conclusions 1

What have we learned

atilde How technology affects our use of language

atilde How language is used on the internet

acirc Some interesting things we can now do that we couldnrsquot before

atilde Collaboration on the Web

atilde The web as a source of linguistic data Direct Query and Sample

atilde The Semantic Web meaning for non-humans

atilde Citation and reputation


Reflection

atilde What was the most surprising thing in this class

atilde What do you think is most likely wrong

atilde What do you think is the coolest result

atilde What do you think yoursquore most likely to remember

atilde How do you think this course will influence you as a linguist

atilde What (if anything) did you hope to learn that you didnrsquot


Goals

atilde Gain an understanding of how technology affects language use

atilde Develop familiarity with markup and meta information in texts

atilde Get a feel for what research is all about especially relating to web mining and onlinefrequency counting

Upon successful completion students will

atilde have an understanding of how technology shapes language use

atilde be able to test linguistic hypotheses against web data

atilde know how to edit Wikipedia


Themes

atilde Language and Technology

acirc Writing and Speech Technology

atilde Language and the Internet

acirc Email Chat Virtual Worlds WWW IM Blogs Facebook Wikis Twitter

atilde The Web as Corpus

atilde The Web beyond Language

acirc Semantic Web and Networks


Revision


Language and Technology

atilde The two things that separate humans from animals (Sproat 2010 sect11)

acirc Languagelowast large vocabulary (10000+)lowast complicated syntax (no upper length recursion embedding)

acirc Technologylowast Widespread tool uselowast Widespread tool manufacture

atilde Speech mdash the start of language

atilde Writing mdash the first great intersection


Language and the Internet

atilde New forms of communication

acirc Neither speech nor textacirc Massively interactive

atilde Extremely rapid change

atilde A first hand narrative (I was online before the internet -)

acirc but I am probably behind you all now


New forms

atilde Email (from PC phone other)

atilde Chat Usenet

atilde Virtual Worlds

atilde WWW

atilde Blogs (overlap)

atilde Facebook LinkedIn

atilde Wikis

atilde Twitter


Representing Languageatilde Writing Systems

atilde Encodings

atilde Speech

atilde Bandwidth


What is represented

atilde Phonemes ma d laks aeligvkadoz (45)

acirc Not a simple correspondence between a writing system and the soundsacirc Some logographs (3 $)acirc Even when a new alphabet is designed pronunciation changes

atilde Syllables (ma) (d) (laks) (aelig)(v)(ka)(doz) (10000+)

atilde Morphemes mame + rsquos d lak + s aeligvkado + z (100000+)

atilde Words my dog likes avocados (200000+)

atilde Concepts speaker poss canine fond avacado+PL (400000+)


Three Major Writing Systems

atilde Alphabetic (Latin)

acirc one symbol for consonant or vowelacirc Typically 20-30 base symbols (1 byte)

atilde Syllabic (Hiragana)

acirc one symbol for each syllable (consonant+vowel)acirc Typically 50-100 base symbols (1-2 bytes)

atilde Logographic (Hanzi)

acirc pictographs ideographs sounds-meaning combinationsacirc Typically 100000+ symbols (2-3 bytes)


Computational Encoding

atilde Need to map characters to bits

atilde More characters require more space

acirc English ASCII (7 bits) Western Europe ISO-8859-1 (8 bits)Japanese EUC (16 bits) Everything UTF-8 (8-32 bits)

atilde Moving towards unicode for everything

atilde If you get the encoding wrong it is gibberish

acirc Web pages should state their encodingacirc Sometimes they are wrongacirc Encoding Detection usually involves statistical analysis of byte patterns

lowast But an encoding can be shared by many languages


Speech and Language Technologyatilde The need for speech representation

atilde Storing sound

atilde Transforming Speech

acirc Automatic Speech Recognition (ASR) sounds to textacirc Text-to-Speech Synthesis (TTS) texts to sounds

atilde Speech technology mdash the Telephone


How good are the systems

Task Vocab WER () WER () adaptedDigits 11 04 02Dialogue (travel) 21000 109 mdashDictation (WSJ) 5000 39 30Dictation (WSJ) 20000 100 86Dialogue (noisy army) 3000 422 310Phone Conversations 4000 419 310

Results of various DARPA competitions (from Richard Sproatrsquos slides)


Why is it so difficult

atilde Pronunciation depends on context

acirc The same word will be pronounced differently in different sentences

atilde Speaker variability

acirc Genderacirc DialectForeign Accentacirc Individual Differences Physical differences Language differences (idiolect)

atilde Many many rare events

acirc 300 out of 2000 diphones in the core set for the ATampT NextGen system occuronly once in a 2-hour speech database


Two steps in a TTS system

1 Linguistic Analysis

atilde Sentence Segmentationatilde Abbreviations Dr Smith lives on Nanyang Dr He is hellipatilde Word Segmentation

acirc 森前銀総裁 Moriyama zen Nichigin Sousaiotimes 森前銀総裁emspMoriyama zennichi gin Sousai

2 Speech Synthesis

atilde Find the pronunciationatilde Generate soundsatilde Add intonation


Speech Synthesis

atilde Articulatory Synthesis Attempt to model human articulation

atilde Formant Synthesis Bypass modeling of human articulation and model acousticsdirectly

atilde Concatenative Synthesis Synthesize from stored units of actual speech


Prosody of Emotion

atilde Excitement Fast very high pitch loud

atilde Hot anger Fast high pitch strong falling accent loud

atilde Fear Jitter

atilde Sarcasm Prolonged accent late peak

atilde Sad Slow low pitch

The main determinant of ldquonaturalnessrdquo in speech synthesis is not ldquovoice qualityrdquobut natural-sounding prosody (intonation and duration)

Richard Sproat


Speed is different for different modalities

Speed in words per minute (one word is 6 characters)(English computer science students various studies)

Reading 300 200 (proof reading)Writing 31 21 (composing)Speaking 150Hearing 150 210 (speeded up)Typing 33 19 (composing)

atilde Reading gtgt SpeakingHearing gtgt Typing

rArr Speech for inputrArr Text for output


The Telephone

Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich


New Modalities


Email Usenet Chat and Blogs

atilde All share some characteristics of speech and text

atilde Usage norms not fixed

atilde Communication methods may disappear before the norms are fixedUsenet Bulletin Boards Gopher Archie hellip

atilde Large scale discourse studies still to be done

atilde Some genuinely new things

acirc time-lagged multi-person conversationacirc raw un-edited textacirc extreme multi-authorship


Email

Speech like Text liketime bound space bound (deletable)spontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactive factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich


Usenet (asynchronous)



Chat (synchronous)



Blogs

Speech like Text liketime bound space boundspontaneous contrivedface-to-face visually decontextualizedloosely structured elaborately structuredsocially interactivecomments factually communicativeimmediately revisable repeatedly revisableprosodically rich graphically rich


Netspeak

atilde Inspired by shared background

acirc ltrantgtI canrsquot stand thisltrantgtacirc I hate^H^H^H^Hlove that idea (^H is backspace on a vt100)acirc lusers (users as seen by Systems Administrators)acirc suits

atilde Need to be in-group

cow orker Coworkerclue ldquoYou couldnrsquot get a clue during the clue mating season in a field full of horny

clues if you smeared your body with clue musk and did the clue mating dancerdquoEdward Flaherty (talkbizaare)


atilde Gricean Principals

acirc Posting (topmiddlebottom quoting and trimming)

atilde Inspired by medium limitations

acirc GREATgreatgreatacirc -) (^_^) (o) acirc brb RTFM IMHOacirc lower caseacirc lack of punctuation (hard on phone keyboards)


Collaboration and Wikisatilde Version Control Systems

atilde Wikipedia

atilde Licensing and Ownership

30

Version Control Systems

atilde Versioning file systems

acirc every time a file is opened a new copy is stored

atilde CVS Subversion Git

acirc changes to a collection of files are trackedacirc simultaneous changes are merged

atilde Revision Tracking

acirc Revisions are stored within a file

atilde Authorship in shared writing Explicit responsibility for changes


Wikipedia

atilde The core aim of the Wikimedia Foundation is to get a free encyclopedia to everysingle person on the planet (Jimmy Wales)

atilde Wikipedia makes it easy to share your knowledgepeople like to do this

atilde Most edits are done by insiders

atilde Most content is added by outsiders

atilde Content comparable to Britannica


The five pillars of Wikipedia

1 Wikipedia is an online encyclopedia

2 Wikipedia has a neutral point of view

atilde Content policies NPOV Verifiability and No original research

3 Wikipedia is free content

4 Wikipedians should interact in a respectful and civil manner

5 Wikipedia does not have firm rules

WikipediaFivepillars 33

Licenses and Ownership

atilde Copyright

atilde Copyleft

atilde Creative Commons


What is a good article

1 Well-written

2 Factually accurate and verifiable

3 Broad in its coverage

4 Neutral

5 Stable

6 Illustrated if possible by images

WikipediaGood_article_criteria 35

The World Wide Web and HTML


The InterWeb

atilde The Internetmore than just the web (email VoIP FTP Streaming Messaging

atilde The structure of Markup Visual vs LogicalWSISWYG WYSIAYG WYSIWIM

atilde The structure of the Web mdash hypertextpages linked to pages

atilde The future of the Web

atilde Linguistic features of the webun-edited large volume editable multi-media


Map of online communities

httpxkcdcom802 38

The Deep Web (the Invisible Web)

Dynamic content dynamic pages which are returned in response to a submitted queryor accessed only through a form

Unlinked content pages which are not linked to by other pages (but clicking linksthem)

Private Web sites that require registration and login (Edventure)

Contextual Web pages with content varying for different access contexts (egranges of client IP addresses or previous navigation sequence)

Limited access content sites that limit access to their pages in a technical way (egusing the Robots Exclusion Standard)

Scripted content pages that are only accessible through links produced by JavaScriptas well as content dynamically downloaded from Web servers via Flash or Ajaxsolutions


Non-HTMLtext content textual content encoded in multimedia (image or video)files or specific file formats not handled by search engines

These pages all include data that search engines cannot find


Visual Markup vs Logical Markup

atilde Visual Markup (Presentational)

acirc What you see is what you get (WYSIWYG)acirc Equivalent of printersrsquo markupacirc Shows what things look like

atilde Logical Markup (Structural)

acirc Show the structure and meaningacirc Can be mapped to visual markupacirc Less flexible than visual markupacirc More adaptable (and reusable)


The Web as Corpus


Two Approaches to using the Web as a Corpus

atilde Direct Query Search Engine as Query tool and WWW as corpus(Objection Results are not reliable)

acirc Population and exact hit counts are unknown rarr no statistics possibleacirc Indexing does not allow to draw conclusions on the dataotimes Google is missing functionalities that linguists lexicographers would like to have

atilde Web Sample Use search engine to download data from the net and build a corpusfrom it

acirc known size and exact hit counts rarr statistics possibleacirc people can draw conclusions over the included text typesacirc (limited) control over the contentotimes sparser data


Direct Query

atilde Accessible through search engines (Google API Yahoo API Scripts)

atilde Document counts are shown to correlate directly with ldquorealrdquo frequencies (Keller2003) so search engines can help - but

acirc lots of repetitions of the same text (not representative)acirc very limited query precision (no upperlower case no punctuation)acirc only estimated counts often hard to reproduce exactlyacirc different queries give wildly different numbers


Web Count Units

atilde ghit or ldquogoogle hitrdquo is the most common unit used to count web snippets (in theearly 2000s)

acirc it is document frequency not term frequency

atilde whit or ldquoweb hitrdquo is the more general term

atilde Normally you compare two phenomena to get a unitless ratio(eg different from vs different than)251000000 ghits vs 71500000 or 351 (accessed 2012-04-04)

atilde GPB for ldquoGhits per billion documentsrdquo is good if want a more stable number (sug-gested by Mark Liberman)but Google no longer releases the index size

atilde So always say when you counted and try to use ratios

httpitrecisupennedu myllanguagelogarchives000953html 45

Web Sample

atilde Extracting and filtering web documents to create linguistically annotated corpora(Kilgarriff 2006)

acirc gather documents for different topics (balance)acirc exclude documents which cannot be preprocessed with available tools (here tag-

gers and lemmatizers)acirc exclude documents which seem irrelevant for a corpus (too short or too long word

lists)acirc do this for several languages and make the corpora available

46

Building Internet Corpora Outline

1 Select Seed Words (500)

2 Combine to form multiple queries (6000)

3 Query a search engine and retrieve the URLs (50000)

4 Download the files from the URLS (100000000 words)

5 Postprocess the data (encoding cleanup tagging and parsing)

Sharoff S (2006) Creating general-purpose corpora using automated search enginequeries In M Baroni S Bernardini (eds) WaCky Working papers on the Web asCorpus Bologna 2006

httpcorpusleedsacukinternethtml 47

Internet Corpora Summary

atilde The web can be used as a corpus

acirc Direct accesslowast Fast and convenientlowast Huge amounts of dataotimes unreliable counts

acirc Web samplelowast Control over the samplelowast Some setup costs (semi-automated)otimes Less data

atilde Richer data than a compiled corpus

otimes Less balanced less markup

48

Text and Meta-textatilde Explicit Meta-data

acirc Keywords and Categoriesacirc Rankingsacirc Structural Markup

atilde Implicit Meta-data

acirc Links and Citationsacirc Tagsacirc Tablesacirc File Namesacirc Translations

and Zombies 49

Explicit Metadata

atilde You can get information from metadata within documents

acirc When they are accurate they are very goodacirc They are often deceitful

atilde HTML PDF Word hellipMetadata

atilde Keywords and Tags

atilde Rankings

atilde Links and Citations

atilde Structural Markup

50

Implicit Metadata

atilde You can get clues from metadata within documents

acirc as they are non-intended they tend to be noisyacirc but they are rarely deceitful

atilde HTML tags as constituent boundaries

atilde Tables as Semantic Relations

atilde File Names (content type and language)

atilde Translations mdash Bracketed Glosses Cross-lingual Disambiguation

atilde Query Data

atilde Wikipedia Redirections and Cross-wiki Links


Language Identification andNormalization


Language Identification

atilde Need to identify the encoding and language of a page in order to process it (meta-data may be unreliable)

acirc Linguistically-grounded methodslowast Diacriticslowast Character n-gramslowast Stop words

acirc Similarity-based categorisation and classificationlowast Character n-gram rankings

acirc Machine Learning based methodsacirc Context (under jp or ko)

atilde Hard to do for short test snippets similar languages mixed text


Normalization

atilde Extracting text from various documents

atilde Segmenting continuous text

atilde Number Normalization $700K $700000 07 million dollars hellip

atilde Date Normalization 2000AD 1421AH Heisei 12 hellip

atilde Stripping stop words (the a of hellip)

atilde Lemmatization produces rarr produce

atilde Stemming producer rarr produc produces rarr produc

atilde Decompounding zonnecel rarr zon cel


The Semantic Web


Goals of the Semantic Web

atilde Web of data

acirc provides common data representation frameworkacirc makes possible integrating multiple sourcesacirc so you can draw new conclusions

atilde Increase the utility of information by connecting it to definitions and context

atilde More efficient information access and analysis

EG not just rdquocolorrdquo but a concept denoted by a Web identifierlthttppantonetmexamplecom2002std6colorgt


Semantic Web Architecture


XML eXtensible Markup Language

atilde XML is a set of rules for encoding documents electronically

acirc is a markup language much like HTMLacirc was designed to carry data not to display dataacirc tags are not predefined You must define your own tagsacirc is designed to be self-descriptiveacirc XML is a W3C Recommendation

atilde Can be validated for syntactic well-formedness


Semantic Web Architecture (details)

atilde Identify things with Uniform Resource Identifiers

acirc Universal Resource Name urnisbn1575864606acirc Universal Resource Locator httpwww3ntuedusghomefcbond

atilde Identify relations with Resource Description Framework

acirc Triples of ltsubject predicate objectgtacirc Each element is a URIacirc RDFs are written in well defined XMLacirc You can say anything about anything

atilde You can build relations in ontologies (OWL)

acirc Then reason over them search them hellip


Criticism of the Semantic Web

Doctorowrsquos seven insurmountable obstacles to reliable metadata are

1 People lie

2 People are lazy

3 People are stupid

4 Mission Impossible know thyself

5 Schemas arenrsquot neutral

6 Metrics influence results

7 Therersquos more than one way to describe something

httpwwwwellcom~doctorowmetacraphtm 60

Semantic Web and NLP

atilde The Semantic Web is about structuring data

atilde Text Mining is about unstructured data

atilde There is much more unstructured than structured data

acirc NLP can infer structureacirc NLP makes the Semantic Web feasibleacirc the Semantic Web can be a resource for NLP

61

Citation Reputation andPageRank


Citation Networks

atilde How can we tell what is a good scientific paper

acirc Content-basedlowast Read it and see if it is interesting (hard for a computer)lowast Compare it to other things you have read and liked

acirc Context based Citation Analysislowast See who else read and thought it interesting enough to cite


Reputation and Citation Analysis

atilde One major use of citation networks is in measuring productivity and impact of thepublished work of a scientist scholar or research group

atilde Some scores are

acirc Total Number of Citations (Pretty Useful)acirc Total Number of Citations minus Self-citationsacirc Total Number of (Citations Number of Authors)acirc Average (Citation Impact Factor Number of Authors)

atilde Problems

acirc Not all citations are equal citations by lsquogoodrsquo papers are betteracirc Newer publications suffer in relation to older ones

atilde Weight Citations by Quality of the paper


Gaming Citations

atilde LeastMinimum Publishable Unit

acirc Break research into small chunks to increase the number of citationsacirc Sometimes there is very little new information

atilde Self citation in-group citation

atilde Write only proceedings (some journals are not often read)

atilde Submitting only to High Impact factor journals

You improve what gets measurednot necessarily what you want to improve


Characteristics of the Web Bow Tie

SCC Strongly Connected Core mdash can travel from any page to any page


Anchor Text

atilde Recall how hyperlinks are written

lta href=httppathtotherepageHG803gtHG803Language Technology and the Internetltagt

For more information about Language Technology and theInternet see the lta href=httpgtHG803 Course Pageltagt

atilde Link analysis builds on two intuitions

1 The hyperlink from A to B represents an endorsement of page B by the creatorof page A

2 The (extended) anchor text pointing to page B is a good description of page BThis is not always the case for instance most corporate websites have a pointer from every page toa page containing a copyright notice


PageRank as Citation analysis

atilde Citation frequency can be used to measure the impact of an article

acirc Simplest measure Each article gets one vote ndash not very accurate

atilde On the web citation frequency = inlink count

acirc A high inlink count does not necessarily mean high quality hellipacirc hellipmainly because of link spam

atilde Better measure weighted citation frequency or citation rank

acirc An articlersquos vote is weighted according to its citation impactacirc This can be formalized in a well-defined way and calculated


PageRank as Random walk

atilde Imagine a web surfer doing a random walk on the web

acirc Start at a random pageacirc At each step go out of the current page along one of the links on that page

equiprobably

atilde In the steady state each page has a long-term visit ratewhat proportion of the time someone will be there

atilde This long-term visit rate is the pagersquos PageRank

atilde PageRank = long-term visit rate = steady state robability


Teleporting ndash to get us out of dead ends

atilde At a dead end jump to a random web page with probability 1N (N is the total number of web pages)

atilde At a non-dead end with probability 10 jump to a random web page (to each witha probability of 01N)

atilde With remaining probability (90) go out on a random hyperlink

acirc For example if the page has 4 outgoing links randomly choose one with proba-bility (1-010)4=0225

atilde 10 is a parameter the teleportation rate

atilde Note ldquojumpingrdquo from a dead end is independent of teleportation rate


Example Graph

Each inbound link is a positive vote

httpwwwamsorgsamplingsfeature-columnfcarc-pagerank 71

Example Graph Weighted

Pages with higher PageRanks are lighter


Gaming PageRank

atilde Link Spam adding links between pages for reasons other than merit Link spam takesadvantage of link-based ranking algorithms which gives websites higher rankings themore other highly ranked websites link to it Examples include adding links withinblogs

atilde Link Farms creating tightly-knit communities of pages referencing each other alsoknown humorously as mutual admiration societies

atilde Scraper Sites rdquoscraperdquo search-engine results pages or other sources of content andcreate rdquocontentrdquo for a website The specific presentation of content on these sites isunique but is merely an amalgamation of content taken from other sources oftenwithout permission

73

atilde Comment spam is a form of link spam in web pages that allow dynamic user editingsuch as wikis blogs and guestbooks Agents can be written that automatically ran-domly select a user edited web page such as a Wikipedia article and add spamminglinks

The nofollow link a value that can be assigned to the rel attribute of an HTMLhyperlink to instruct some search engines that a hyperlink should not influence thelink targetrsquos ranking in the search enginersquos index

acirc Google does not index the target of a link marked nofollowacirc Yahoo does not include the link in its rankingacirc hellip

74

Current Status

atilde There is a continuous battle between

acirc Search companies who want to get the most useful page to the useracirc Page writers who want to get their page read

atilde All metrics get gamed


Digital object identifier

atilde DOI a string used to uniquely identify an electronic document or object

acirc Metadata about the object is stored with the DOI nameacirc The metadata includes a location such as a URLacirc The DOI for a document is permanent the metadata may changeacirc Gives a Persistent Identifier (like ISBN)

atilde The DOI system is implemented through a federation of registration agencies coor-dinated by the International DOI Foundation

atilde By late 2009 approximately 43 million DOI names had been assigned by some 4000organizations

acirc DOI 101007s10579-008-9062-zhttpwwwspringerlinkcomcontentv7q114033401th5u


Conclusions


Revolutions in Language Technology

atilde Speech (Language itself)

atilde Writing (invented 3-5 times)

atilde Printing (made writing common)

atilde Digital Text (made writing transferable)

atilde Hyperlinking (taking writing beyond language)


The Internet

atilde The internet is useful as a tool

acirc Passively for information accessacirc Actively for collaboration

atilde The internet is interesting in and of itself

acirc As a source of data about existing languageacirc As a source of innovation in language


Recommended Texts

atilde Wikipedia

atilde Crystal D (2001) Language and the Internet Cambridge University Press

atilde Sproat R (2010) Language Technology and Society Oxford University Press

atilde Jurafsky D and Martin J H (2008) Speech and Language Processing An In-troduction to Natural Language Processing Computational Linguistics and SpeechRecognition Prentice Hall 2nd edition

atilde Manning C D and Schuumltze H (1999) Foundations of Statistical Natural LanguageProcessing MIT Press

atilde Manning C D Raghavan P and Schuumltze H (2008) Introduction to InformationRetrieval Cambridge University Press


Complementary Courses

atilde HG2051 Language and the Computer mdash solving NLP problems with Pythonintroduces both programming and linguistics

atilde HG3051 Corpus Linguistics mdash (Pre Req-HG251 waived) This course is an in-troduction to the fast growing field of corpus linguistics


Introduction











Reflection








Goals









Themes









Revision
















New forms


atilde Chat Usenet


atilde WWW



atilde Wikis

atilde Twitter



atilde Encodings

atilde Speech

atilde Bandwidth


What is represented















































2 Speech Synthesis



Speech Synthesis





Prosody of Emotion



atilde Fear Jitter




Richard Sproat








The Telephone



New Modalities










Email






Chat (synchronous)



Blogs



Netspeak













atilde Wikipedia


30










Wikipedia
















atilde Copyright

atilde Copyleft




1 Well-written



4 Neutral

5 Stable





The InterWeb








httpxkcdcom802 38


















The Web as Corpus








Direct Query





Web Count Units








Web Sample





46















48





and Zombies 49

Explicit Metadata





atilde Rankings



50

Implicit Metadata







atilde Query Data












Normalization










The Semantic Web



atilde Web of data























1 People lie

2 People are lazy

3 People are stupid











61



Citation Networks









atilde Problems




Gaming Citations











Anchor Text



















equiprobably













Example Graph






Gaming PageRank




73




74

Current Status












Conclusions









The Internet






Recommended Texts

atilde Wikipedia




















Reflection








Goals









Themes









Revision
















New forms


atilde Chat Usenet


atilde WWW



atilde Wikis

atilde Twitter



atilde Encodings

atilde Speech

atilde Bandwidth


What is represented















































2 Speech Synthesis



Speech Synthesis





Prosody of Emotion



atilde Fear Jitter




Richard Sproat








The Telephone



New Modalities










Email






Chat (synchronous)



Blogs



Netspeak













atilde Wikipedia


30










Wikipedia
















atilde Copyright

atilde Copyleft




1 Well-written



4 Neutral

5 Stable





The InterWeb








httpxkcdcom802 38


















The Web as Corpus








Direct Query





Web Count Units








Web Sample





46















48





and Zombies 49

Explicit Metadata





atilde Rankings



50

Implicit Metadata







atilde Query Data












Normalization










The Semantic Web



atilde Web of data























1 People lie

2 People are lazy

3 People are stupid











61



Citation Networks









atilde Problems




Gaming Citations











Anchor Text



















equiprobably













Example Graph






Gaming PageRank




73




74

Current Status












Conclusions









The Internet






Recommended Texts

atilde Wikipedia











Goals









Themes









Revision
















New forms


atilde Chat Usenet


atilde WWW



atilde Wikis

atilde Twitter



atilde Encodings

atilde Speech

atilde Bandwidth


What is represented















































2 Speech Synthesis



Speech Synthesis





Prosody of Emotion



atilde Fear Jitter




Richard Sproat








The Telephone



New Modalities










Email






Chat (synchronous)



Blogs



Netspeak













atilde Wikipedia


30










Wikipedia
















atilde Copyright

atilde Copyleft




1 Well-written



4 Neutral

5 Stable





The InterWeb








httpxkcdcom802 38


















The Web as Corpus








Direct Query





Web Count Units








Web Sample





46















48





and Zombies 49

Explicit Metadata





atilde Rankings



50

Implicit Metadata







atilde Query Data












Normalization










The Semantic Web



atilde Web of data























1 People lie

2 People are lazy

3 People are stupid











61



Citation Networks









atilde Problems




Gaming Citations











Anchor Text



















equiprobably













Example Graph






Gaming PageRank




73




74

Current Status












Conclusions









The Internet






Recommended Texts

atilde Wikipedia











Themes









Revision
















New forms


atilde Chat Usenet


atilde WWW



atilde Wikis

atilde Twitter



atilde Encodings

atilde Speech

atilde Bandwidth


What is represented















































2 Speech Synthesis



Speech Synthesis





Prosody of Emotion



atilde Fear Jitter




Richard Sproat








The Telephone



New Modalities










Email






Chat (synchronous)



Blogs



Netspeak













atilde Wikipedia


30










Wikipedia
















atilde Copyright

atilde Copyleft




1 Well-written



4 Neutral

5 Stable





The InterWeb








httpxkcdcom802 38


















The Web as Corpus








Direct Query





Web Count Units








Web Sample





46















48





and Zombies 49

Explicit Metadata





atilde Rankings



50

Implicit Metadata







atilde Query Data












Normalization










The Semantic Web



atilde Web of data























1 People lie

2 People are lazy

3 People are stupid











61



Citation Networks









atilde Problems




Gaming Citations











Anchor Text



















equiprobably













Example Graph






Gaming PageRank




73




74

Current Status












Conclusions









The Internet






Recommended Texts

atilde Wikipedia











Revision
















New forms


atilde Chat Usenet


atilde WWW



atilde Wikis

atilde Twitter



atilde Encodings

atilde Speech

atilde Bandwidth


What is represented















































2 Speech Synthesis



Speech Synthesis





Prosody of Emotion



atilde Fear Jitter




Richard Sproat








The Telephone



New Modalities










Email






Chat (synchronous)



Blogs



Netspeak













atilde Wikipedia


30










Wikipedia
















atilde Copyright

atilde Copyleft




1 Well-written



4 Neutral

5 Stable





The InterWeb








httpxkcdcom802 38


















The Web as Corpus








Direct Query





Web Count Units








Web Sample





46















48





and Zombies 49

Explicit Metadata





atilde Rankings



50

Implicit Metadata







atilde Query Data












Normalization










The Semantic Web



atilde Web of data























1 People lie

2 People are lazy

3 People are stupid











61



Citation Networks









atilde Problems




Gaming Citations











Anchor Text



















equiprobably













Example Graph






Gaming PageRank




73




74

Current Status












Conclusions









The Internet






Recommended Texts

atilde Wikipedia

























New forms


atilde Chat Usenet


atilde WWW



atilde Wikis

atilde Twitter



atilde Encodings

atilde Speech

atilde Bandwidth


What is represented















































2 Speech Synthesis



Speech Synthesis





Prosody of Emotion



atilde Fear Jitter




Richard Sproat








The Telephone



New Modalities










Email






Chat (synchronous)



Blogs



Netspeak













atilde Wikipedia


30










Wikipedia
















atilde Copyright

atilde Copyleft




1 Well-written



4 Neutral

5 Stable





The InterWeb








httpxkcdcom802 38


















The Web as Corpus








Direct Query





Web Count Units








Web Sample





46















48





and Zombies 49

Explicit Metadata





atilde Rankings



50

Implicit Metadata







atilde Query Data












Normalization










The Semantic Web



atilde Web of data























1 People lie

2 People are lazy

3 People are stupid











61



Citation Networks









atilde Problems




Gaming Citations











Anchor Text



















equiprobably













Example Graph






Gaming PageRank




73




74

Current Status












Conclusions









The Internet






Recommended Texts

atilde Wikipedia












atilde Encodings

atilde Speech

atilde Bandwidth


What is represented















































2 Speech Synthesis



Speech Synthesis





Prosody of Emotion



atilde Fear Jitter




Richard Sproat








The Telephone



New Modalities










Email






Chat (synchronous)



Blogs



Netspeak













atilde Wikipedia


30










Wikipedia
















atilde Copyright

atilde Copyleft




1 Well-written



4 Neutral

5 Stable





The InterWeb








httpxkcdcom802 38


















The Web as Corpus








Direct Query





Web Count Units








Web Sample





46















48





and Zombies 49

Explicit Metadata





atilde Rankings



50

Implicit Metadata







atilde Query Data












Normalization










The Semantic Web



atilde Web of data























1 People lie

2 People are lazy

3 People are stupid











61



Citation Networks









atilde Problems




Gaming Citations











Anchor Text



















equiprobably













Example Graph






Gaming PageRank




73




74

Current Status












Conclusions









The Internet






Recommended Texts

atilde Wikipedia











What is represented















































2 Speech Synthesis



Speech Synthesis





Prosody of Emotion



atilde Fear Jitter




Richard Sproat








The Telephone



New Modalities










Email






Chat (synchronous)



Blogs



Netspeak













atilde Wikipedia


30










Wikipedia
















atilde Copyright

atilde Copyleft




1 Well-written



4 Neutral

5 Stable





The InterWeb








httpxkcdcom802 38


















The Web as Corpus








Direct Query





Web Count Units








Web Sample





46















48





and Zombies 49

Explicit Metadata





atilde Rankings



50

Implicit Metadata







atilde Query Data












Normalization










The Semantic Web



atilde Web of data























1 People lie

2 People are lazy

3 People are stupid











61



Citation Networks









atilde Problems




Gaming Citations











Anchor Text



















equiprobably













Example Graph






Gaming PageRank




73




74

Current Status












Conclusions









The Internet






Recommended Texts

atilde Wikipedia


















































2 Speech Synthesis



Speech Synthesis





Prosody of Emotion



atilde Fear Jitter




Richard Sproat








The Telephone



New Modalities










Email






Chat (synchronous)



Blogs



Netspeak













atilde Wikipedia


30










Wikipedia
















atilde Copyright

atilde Copyleft




1 Well-written



4 Neutral

5 Stable





The InterWeb








httpxkcdcom802 38


















The Web as Corpus








Direct Query





Web Count Units








Web Sample





46















48





and Zombies 49

Explicit Metadata





atilde Rankings



50

Implicit Metadata







atilde Query Data












Normalization










The Semantic Web



atilde Web of data























1 People lie

2 People are lazy

3 People are stupid











61



Citation Networks









atilde Problems




Gaming Citations











Anchor Text



















equiprobably













Example Graph






Gaming PageRank




73




74

Current Status












Conclusions









The Internet






Recommended Texts

atilde Wikipedia











Speech Synthesis





Prosody of Emotion



atilde Fear Jitter




Richard Sproat








The Telephone



New Modalities










Email






Chat (synchronous)



Blogs



Netspeak













atilde Wikipedia


30










Wikipedia
















atilde Copyright

atilde Copyleft




1 Well-written



4 Neutral

5 Stable





The InterWeb








httpxkcdcom802 38


















The Web as Corpus








Direct Query





Web Count Units








Web Sample





46















48





and Zombies 49

Explicit Metadata





atilde Rankings



50

Implicit Metadata







atilde Query Data












Normalization










The Semantic Web



atilde Web of data























1 People lie

2 People are lazy

3 People are stupid











61



Citation Networks









atilde Problems




Gaming Citations











Anchor Text



















equiprobably













Example Graph






Gaming PageRank




73




74

Current Status












Conclusions









The Internet






Recommended Texts

atilde Wikipedia











Prosody of Emotion



atilde Fear Jitter




Richard Sproat








The Telephone



New Modalities










Email






Chat (synchronous)



Blogs



Netspeak













atilde Wikipedia


30










Wikipedia
















atilde Copyright

atilde Copyleft




1 Well-written



4 Neutral

5 Stable





The InterWeb








httpxkcdcom802 38


















The Web as Corpus








Direct Query





Web Count Units








Web Sample





46















48





and Zombies 49

Explicit Metadata





atilde Rankings



50

Implicit Metadata







atilde Query Data












Normalization










The Semantic Web



atilde Web of data























1 People lie

2 People are lazy

3 People are stupid











61



Citation Networks









atilde Problems




Gaming Citations











Anchor Text



















equiprobably













Example Graph






Gaming PageRank




73




74

Current Status












Conclusions









The Internet






Recommended Texts

atilde Wikipedia

















The Telephone



New Modalities










Email






Chat (synchronous)



Blogs



Netspeak













atilde Wikipedia


30










Wikipedia
















atilde Copyright

atilde Copyleft




1 Well-written



4 Neutral

5 Stable





The InterWeb








httpxkcdcom802 38


















The Web as Corpus








Direct Query





Web Count Units








Web Sample





46















48





and Zombies 49

Explicit Metadata





atilde Rankings



50

Implicit Metadata







atilde Query Data












Normalization










The Semantic Web



atilde Web of data























1 People lie

2 People are lazy

3 People are stupid











61



Citation Networks









atilde Problems




Gaming Citations











Anchor Text



















equiprobably













Example Graph






Gaming PageRank




73




74

Current Status












Conclusions









The Internet






Recommended Texts

atilde Wikipedia











New Modalities










Email






Chat (synchronous)



Blogs



Netspeak













atilde Wikipedia


30










Wikipedia
















atilde Copyright

atilde Copyleft




1 Well-written



4 Neutral

5 Stable





The InterWeb








httpxkcdcom802 38


















The Web as Corpus








Direct Query





Web Count Units








Web Sample





46















48





and Zombies 49

Explicit Metadata





atilde Rankings



50

Implicit Metadata







atilde Query Data












Normalization










The Semantic Web



atilde Web of data























1 People lie

2 People are lazy

3 People are stupid











61



Citation Networks









atilde Problems




Gaming Citations











Anchor Text



















equiprobably













Example Graph






Gaming PageRank




73




74

Current Status












Conclusions









The Internet






Recommended Texts

atilde Wikipedia



















Email






Chat (synchronous)



Blogs



Netspeak













atilde Wikipedia


30










Wikipedia
















atilde Copyright

atilde Copyleft




1 Well-written



4 Neutral

5 Stable





The InterWeb








httpxkcdcom802 38


















The Web as Corpus








Direct Query





Web Count Units








Web Sample





46















48





and Zombies 49

Explicit Metadata





atilde Rankings



50

Implicit Metadata







atilde Query Data












Normalization










The Semantic Web



atilde Web of data























1 People lie

2 People are lazy

3 People are stupid











61



Citation Networks









atilde Problems




Gaming Citations











Anchor Text



















equiprobably













Example Graph






Gaming PageRank




73




74

Current Status












Conclusions









The Internet






Recommended Texts

atilde Wikipedia














Chat (synchronous)



Blogs



Netspeak













atilde Wikipedia


30










Wikipedia
















atilde Copyright

atilde Copyleft




1 Well-written



4 Neutral

5 Stable





The InterWeb








httpxkcdcom802 38


















The Web as Corpus








Direct Query





Web Count Units








Web Sample





46















48





and Zombies 49

Explicit Metadata





atilde Rankings



50

Implicit Metadata







atilde Query Data












Normalization










The Semantic Web



atilde Web of data























1 People lie

2 People are lazy

3 People are stupid











61



Citation Networks









atilde Problems




Gaming Citations











Anchor Text



















equiprobably













Example Graph






Gaming PageRank




73




74

Current Status












Conclusions









The Internet






Recommended Texts

atilde Wikipedia











Blogs



Netspeak













atilde Wikipedia


30










Wikipedia
















atilde Copyright

atilde Copyleft




1 Well-written



4 Neutral

5 Stable





The InterWeb








httpxkcdcom802 38


















The Web as Corpus








Direct Query





Web Count Units








Web Sample





46















48





and Zombies 49

Explicit Metadata





atilde Rankings



50

Implicit Metadata







atilde Query Data












Normalization










The Semantic Web



atilde Web of data























1 People lie

2 People are lazy

3 People are stupid











61



Citation Networks









atilde Problems




Gaming Citations











Anchor Text



















equiprobably













Example Graph






Gaming PageRank




73




74

Current Status












Conclusions









The Internet






Recommended Texts

atilde Wikipedia











Netspeak













atilde Wikipedia


30










Wikipedia
















atilde Copyright

atilde Copyleft




1 Well-written



4 Neutral

5 Stable





The InterWeb








httpxkcdcom802 38


















The Web as Corpus








Direct Query





Web Count Units








Web Sample





46















48





and Zombies 49

Explicit Metadata





atilde Rankings



50

Implicit Metadata







atilde Query Data












Normalization










The Semantic Web



atilde Web of data























1 People lie

2 People are lazy

3 People are stupid











61



Citation Networks









atilde Problems




Gaming Citations











Anchor Text



















equiprobably













Example Graph






Gaming PageRank




73




74

Current Status












Conclusions









The Internet






Recommended Texts

atilde Wikipedia

















atilde Wikipedia


30










Wikipedia
















atilde Copyright

atilde Copyleft




1 Well-written



4 Neutral

5 Stable





The InterWeb








httpxkcdcom802 38


















The Web as Corpus








Direct Query





Web Count Units








Web Sample





46















48





and Zombies 49

Explicit Metadata





atilde Rankings



50

Implicit Metadata







atilde Query Data












Normalization










The Semantic Web



atilde Web of data























1 People lie

2 People are lazy

3 People are stupid











61



Citation Networks









atilde Problems




Gaming Citations











Anchor Text



















equiprobably













Example Graph






Gaming PageRank




73




74

Current Status












Conclusions









The Internet






Recommended Texts

atilde Wikipedia




















Wikipedia
















atilde Copyright

atilde Copyleft




1 Well-written



4 Neutral

5 Stable





The InterWeb








httpxkcdcom802 38


















The Web as Corpus








Direct Query





Web Count Units








Web Sample





46















48





and Zombies 49

Explicit Metadata





atilde Rankings



50

Implicit Metadata







atilde Query Data












Normalization










The Semantic Web



atilde Web of data























1 People lie

2 People are lazy

3 People are stupid











61



Citation Networks









atilde Problems




Gaming Citations











Anchor Text



















equiprobably













Example Graph






Gaming PageRank




73




74

Current Status












Conclusions









The Internet






Recommended Texts

atilde Wikipedia




















atilde Copyright

atilde Copyleft




1 Well-written



4 Neutral

5 Stable





The InterWeb








httpxkcdcom802 38


















The Web as Corpus








Direct Query





Web Count Units








Web Sample





46















48





and Zombies 49

Explicit Metadata





atilde Rankings



50

Implicit Metadata







atilde Query Data












Normalization










The Semantic Web



atilde Web of data























1 People lie

2 People are lazy

3 People are stupid











61



Citation Networks









atilde Problems




Gaming Citations











Anchor Text



















equiprobably













Example Graph






Gaming PageRank




73




74

Current Status












Conclusions









The Internet






Recommended Texts

atilde Wikipedia












1 Well-written



4 Neutral

5 Stable





The InterWeb








httpxkcdcom802 38


















The Web as Corpus








Direct Query





Web Count Units








Web Sample





46















48





and Zombies 49

Explicit Metadata





atilde Rankings



50

Implicit Metadata







atilde Query Data












Normalization










The Semantic Web



atilde Web of data























1 People lie

2 People are lazy

3 People are stupid











61



Citation Networks









atilde Problems




Gaming Citations











Anchor Text



















equiprobably













Example Graph






Gaming PageRank




73




74

Current Status












Conclusions









The Internet






Recommended Texts

atilde Wikipedia













The InterWeb








httpxkcdcom802 38


















The Web as Corpus








Direct Query





Web Count Units








Web Sample





46















48





and Zombies 49

Explicit Metadata





atilde Rankings



50

Implicit Metadata







atilde Query Data












Normalization










The Semantic Web



atilde Web of data























1 People lie

2 People are lazy

3 People are stupid











61



Citation Networks









atilde Problems




Gaming Citations











Anchor Text



















equiprobably













Example Graph






Gaming PageRank




73




74

Current Status












Conclusions









The Internet






Recommended Texts

atilde Wikipedia












httpxkcdcom802 38


















The Web as Corpus








Direct Query





Web Count Units








Web Sample





46















48





and Zombies 49

Explicit Metadata





atilde Rankings



50

Implicit Metadata







atilde Query Data












Normalization










The Semantic Web



atilde Web of data























1 People lie

2 People are lazy

3 People are stupid











61



Citation Networks









atilde Problems




Gaming Citations











Anchor Text



















equiprobably













Example Graph






Gaming PageRank




73




74

Current Status












Conclusions









The Internet






Recommended Texts

atilde Wikipedia




























The Web as Corpus








Direct Query





Web Count Units








Web Sample





46















48





and Zombies 49

Explicit Metadata





atilde Rankings



50

Implicit Metadata







atilde Query Data












Normalization










The Semantic Web



atilde Web of data























1 People lie

2 People are lazy

3 People are stupid











61



Citation Networks









atilde Problems




Gaming Citations











Anchor Text



















equiprobably













Example Graph






Gaming PageRank




73




74

Current Status












Conclusions









The Internet






Recommended Texts

atilde Wikipedia

















Direct Query





Web Count Units








Web Sample





46















48





and Zombies 49

Explicit Metadata





atilde Rankings



50

Implicit Metadata







atilde Query Data












Normalization










The Semantic Web



atilde Web of data























1 People lie

2 People are lazy

3 People are stupid











61



Citation Networks









atilde Problems




Gaming Citations











Anchor Text



















equiprobably













Example Graph






Gaming PageRank




73




74

Current Status












Conclusions









The Internet






Recommended Texts

atilde Wikipedia











Web Count Units








Web Sample





46















48





and Zombies 49

Explicit Metadata





atilde Rankings



50

Implicit Metadata







atilde Query Data












Normalization










The Semantic Web



atilde Web of data























1 People lie

2 People are lazy

3 People are stupid











61



Citation Networks









atilde Problems




Gaming Citations











Anchor Text



















equiprobably













Example Graph






Gaming PageRank




73




74

Current Status












Conclusions









The Internet






Recommended Texts

atilde Wikipedia











Web Sample





46















48





and Zombies 49

Explicit Metadata





atilde Rankings



50

Implicit Metadata







atilde Query Data












Normalization










The Semantic Web



atilde Web of data























1 People lie

2 People are lazy

3 People are stupid











61



Citation Networks









atilde Problems




Gaming Citations











Anchor Text



















equiprobably













Example Graph






Gaming PageRank




73




74

Current Status












Conclusions









The Internet






Recommended Texts

atilde Wikipedia

























48





and Zombies 49

Explicit Metadata





atilde Rankings



50

Implicit Metadata







atilde Query Data












Normalization










The Semantic Web



atilde Web of data























1 People lie

2 People are lazy

3 People are stupid











61



Citation Networks









atilde Problems




Gaming Citations











Anchor Text



















equiprobably













Example Graph






Gaming PageRank




73




74

Current Status












Conclusions









The Internet






Recommended Texts

atilde Wikipedia















and Zombies 49

Explicit Metadata





atilde Rankings



50

Implicit Metadata







atilde Query Data












Normalization










The Semantic Web



atilde Web of data























1 People lie

2 People are lazy

3 People are stupid











61



Citation Networks









atilde Problems




Gaming Citations











Anchor Text



















equiprobably













Example Graph






Gaming PageRank




73




74

Current Status












Conclusions









The Internet






Recommended Texts

atilde Wikipedia











Explicit Metadata





atilde Rankings



50

Implicit Metadata







atilde Query Data












Normalization










The Semantic Web



atilde Web of data























1 People lie

2 People are lazy

3 People are stupid











61



Citation Networks









atilde Problems




Gaming Citations











Anchor Text



















equiprobably













Example Graph






Gaming PageRank




73




74

Current Status












Conclusions









The Internet






Recommended Texts

atilde Wikipedia











Implicit Metadata







atilde Query Data












Normalization










The Semantic Web



atilde Web of data























1 People lie

2 People are lazy

3 People are stupid











61



Citation Networks









atilde Problems




Gaming Citations











Anchor Text



















equiprobably













Example Graph






Gaming PageRank




73




74

Current Status












Conclusions









The Internet






Recommended Texts

atilde Wikipedia




















Normalization










The Semantic Web



atilde Web of data























1 People lie

2 People are lazy

3 People are stupid











61



Citation Networks









atilde Problems




Gaming Citations











Anchor Text



















equiprobably













Example Graph






Gaming PageRank




73




74

Current Status












Conclusions









The Internet






Recommended Texts

atilde Wikipedia











The Semantic Web



atilde Web of data























1 People lie

2 People are lazy

3 People are stupid











61



Citation Networks









atilde Problems




Gaming Citations











Anchor Text



















equiprobably













Example Graph






Gaming PageRank




73




74

Current Status












Conclusions









The Internet






Recommended Texts

atilde Wikipedia












atilde Web of data























1 People lie

2 People are lazy

3 People are stupid











61



Citation Networks









atilde Problems




Gaming Citations











Anchor Text



















equiprobably













Example Graph






Gaming PageRank




73




74

Current Status












Conclusions









The Internet






Recommended Texts

atilde Wikipedia




























1 People lie

2 People are lazy

3 People are stupid











61



Citation Networks









atilde Problems




Gaming Citations











Anchor Text



















equiprobably













Example Graph






Gaming PageRank




73




74

Current Status












Conclusions









The Internet






Recommended Texts

atilde Wikipedia
















61



Citation Networks









atilde Problems




Gaming Citations











Anchor Text



















equiprobably













Example Graph






Gaming PageRank




73




74

Current Status












Conclusions









The Internet






Recommended Texts

atilde Wikipedia













Citation Networks









atilde Problems




Gaming Citations











Anchor Text



















equiprobably













Example Graph






Gaming PageRank




73




74

Current Status












Conclusions









The Internet






Recommended Texts

atilde Wikipedia















atilde Problems




Gaming Citations











Anchor Text



















equiprobably













Example Graph






Gaming PageRank




73




74

Current Status












Conclusions









The Internet






Recommended Texts

atilde Wikipedia











Gaming Citations











Anchor Text



















equiprobably













Example Graph






Gaming PageRank




73




74

Current Status












Conclusions









The Internet






Recommended Texts

atilde Wikipedia














Anchor Text



















equiprobably













Example Graph






Gaming PageRank




73




74

Current Status












Conclusions









The Internet






Recommended Texts

atilde Wikipedia






















equiprobably













Example Graph






Gaming PageRank




73




74

Current Status












Conclusions









The Internet






Recommended Texts

atilde Wikipedia



















Example Graph






Gaming PageRank




73




74

Current Status












Conclusions









The Internet






Recommended Texts

atilde Wikipedia














Gaming PageRank




73




74

Current Status












Conclusions









The Internet






Recommended Texts

atilde Wikipedia














74

Current Status












Conclusions









The Internet






Recommended Texts

atilde Wikipedia











Current Status












Conclusions









The Internet






Recommended Texts

atilde Wikipedia


















Conclusions









The Internet






Recommended Texts

atilde Wikipedia


















The Internet






Recommended Texts

atilde Wikipedia











Recommended Texts

atilde Wikipedia











lecture 12: review and conclusions · web count units ª ghit or “google hit” is the most...

Documents