the internet and its text

36
The Internet and its Text Mike Scott School of English University of Liverpool Staff/Student Seminar, Open University, 16.3.06 This presentation is at www.lexically.net/downloads/corpus_linguistic s/internet.ppt

Upload: early

Post on 22-Feb-2016

26 views

Category:

Documents


0 download

DESCRIPTION

The Internet and its Text . Mike Scott School of English University of Liverpool Staff/Student Seminar, Open University, 16.3.06 This presentation is at www.lexically.net/downloads/corpus_linguistics/internet.ppt. Internet…. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: The Internet and its Text

The Internet and its Text

Mike ScottSchool of English

University of LiverpoolStaff/Student Seminar, Open University, 16.3.06

This presentation is at www.lexically.net/downloads/corpus_linguistics/internet.ppt

Page 2: The Internet and its Text

Internet… Home was BAMA, the Sprawl, the Boston-Atlanta

Metropolitan Axis. Program a map to display frequency data exchange, every thousand megabytes a single pixel on a very large screen. Manhattan and Atlanta burn solid white. Then they start to pulse, the rate of traffic threatening to overload your simulation. Your map is about to go nova. Cool it down. Up your scale. Each pixel a million megabytes. At a hundred million megabytes per second, you begin to make out certain blocks in midtown Manhattan, outlines of hundred-year-old parks ringing the old core of Atlanta.

(William Gibson, Neuromancer, 1984:57).

… or Google Earth?

Page 3: The Internet and its Text

H.P. Lovecraft The Call of Cthulhu (1926), opening words The most merciful thing in the world, I think, is the

inability of the human mind to correlate all its contents. We live on a placid island of ignorance in the midst of black seas of infinity, and it was not meant that we should voyage far. The sciences, each straining in its own direction, have hitherto harmed us little; but some day the piecing together of dissociated knowledge will open up such terrifying vistas of reality, and of our frightful position therein, that we shall either go mad from the revelation or flee from the light into the peace and safety of a new dark age. (from García Landa 2005)

Page 4: The Internet and its Text

Issues and Questions

The Internet as a Resource InterNET Characteristics of networks Corpus Linguistics (CL) and Internet text Patterns of interest to the language learner

Page 5: The Internet and its Text

Internet Map

Page 6: The Internet and its Text

UK Janet network 2001

Page 7: The Internet and its Text

another way of viewing it

Page 8: The Internet and its Text

Networks

Milgram’s experiments (1960s) 160 letters sent out asking random people in

Nebraska & Kansas to forward the letter to a person in Boston, but without the address.

Most of the letters got through. In only about 6 steps.

Page 9: The Internet and its Text

Networks Graph Theory You want to link 50 towns with a road network, but

don’t want to build 1,225 roads (50 * 49 ÷ 2). Erdös proved in 1959 that 98 random roads (8%) will

ensure the great majority get linked. In general, for larger networks, you need only a tiny

percentage of the possible links to get a network which works (traffic gets through).

For a network of 6 billion people, you need 0.000000004%, which is about 24 links (acquaintances).

Messages will get through from anyone .. to anyone.

Page 10: The Internet and its Text

Power Law

Nodes and connections obey a “power law”: “each time the number of links doubles, the number of nodes with that many links becomes less by about five times”. (Buchanan 2002: 83)

Are words in text anything like these networks?

Page 11: The Internet and its Text

Internet a “scale-free” network “The probability

distribution of incoming links to HTML documents… follows a power law, generating a straight line on this logarithmic plot. The outgoing links have a similar distribution. This implies that the WWW is a scale-free network”. (Ball 2004:480)

Page 12: The Internet and its Text

Word Frequency lists Zipf’s rank-frequency

distribution of words (Zipf, 1965: 25)

(A) “The James Joyce data; (B) the Eldridge data; (C) ideal curve with slope of negative unity.” (original caption)

Page 13: The Internet and its Text

Word Frequency lists — BNC Zipf plot of word

frequencies & ranks (Scott & Tribble in press)

Based on whole BNC, nearly 400,000 types

11

Frequency

Rank

Page 14: The Internet and its Text

Corpus Linguistics

Uncertain status as a discipline Innovative in methodology Focus on “the language” relatively unfiltered data

Page 15: The Internet and its Text

this?

or this?

as opposed to

Page 16: The Internet and its Text

Internet text

Google “Google examines more than 8 billion web

pages to find the most relevant pages for any query and typically returns those results in less than half a second. No other search engine accesses more of the Internet or delivers more useful information than Google.” (http://www.google.co.uk/corporate/features.html)

Page 17: The Internet and its Text

But there are more sites

islands sites not found by web-bots sites not indexed by web-bots … so not all the Internet can be seen

Page 18: The Internet and its Text

The problem: what verb goes with “battle”? hold? fight? win? take? there + be? struggle? combat? pitch?

Page 19: The Internet and its Text

Dictionaries

OED: “join, give, refuse, accept, offer, do battle”

Oxford Advanced Learner’s 1974: no verbs supplied

Cobuild 1988: examples show “fought” and “do battle”

Page 20: The Internet and its Text

LTP Dictionary of Selected Collocations Verbs to the left: engage in, fight, force, go

into, join in, lose, take part in, win ~ Verbs to the right: ~ continues, dragged on,

ended in stalemate, is in progress, raged Adj: bitter, bloody, crucial, decisive, fierce,

final, hopeless, important, last-ditch, long, long-running, major, mock, pitched, real, relentless, running, successful ~

Phrases: fight a losing ~, outcome of ~

Page 21: The Internet and its Text

battle

Page 22: The Internet and its Text

fight battle

Page 23: The Internet and its Text

Webgetter

Settings: English only, minimum 100 words

Page 24: The Internet and its Text

Webgetter In approx. 600,000 words, “battle” occurs nearly

4,000 times, about once every 150 words. “An epic battle rages between the Forseti and the

Muspell as the oceans rise and land disappears. The Forseti compel you to help protect their remaining land by taking charge of the ultimate war machine – the Battle Engine. Whether in walking or in flying mode, you have access to an array of destructive weapons and you receive constant direction from base command. By commanding a device so powerful and advanced, your battlefield decisions will shape the direction of each engagement and, ultimately, the entire war.”

Page 25: The Internet and its Text

Webgetter results

Collocated verbs in top 100 linked by MI score: cheats(10 occurrences) “Battle engine Aquila

cheats”(? is this a verb?) gaming (9) fought (43) is number 110

Clusters: “battle was fought” (6)

Page 26: The Internet and its Text

BNC (written) In 90 million words, “battle” comes over 6,000 times,

once every 14,000 words. Collocated verbs in top 100 linked by MI score:

fought(153)/fighting(93) rages(5)/raged(12) waged(10)/waging(12) ensued(8)/ensuing(13) defeated(39) losing(68) won(152) commence(5)

Page 27: The Internet and its Text

BNC Written clusters

to do battle (54) fighting a losing (24) win the battle (22) won the battle (22) fighting a losing battle (21) to fight a (15)

Page 28: The Internet and its Text

Internet text 1

Page 29: The Internet and its Text

Internet text 2

Page 30: The Internet and its Text

Internet text 3

Page 31: The Internet and its Text

BNC Written Text 1 “The BNC was designed to characterise the state of

contemporary British English in its various social and generic uses.” (Aston & Burnard, 1998: 28)

Imaginative 20% Arts 8% Belief & thought 4% Commerce & finance 8% Leisure 11% Natural & pure science 4% Applied science 11% Social science 15% World affairs 14% Unclassified 2% (Aston & Burnard, 1998: 29)

Page 32: The Internet and its Text

BNC Written Text 2

Book 46% Periodical 36% Miscellaneous published 6% Miscellaneous unpublished 7% To-be-spoken 1% Unclassified 2% (Aston & Burnard, 1998: 30)

Page 33: The Internet and its Text

Conclusions (1) The Internet is a powerful linked scale-free network

with the capacity of linking nodes efficiently and fast, and is relatively robust

Connections within the Internet have characteristics of a power law

Word frequency lists share these characteristics … … suggesting that grammar words are like Google.

Yahoo, Microsoft web-sites, extremely often visited… …but not in themselves informative and other sites we visit are like lexical words… …less visited but more informative

Page 34: The Internet and its Text

Conclusions (2)

The learner wants to know how words collocate

Collocation dictionaries – but not other dictionaries – give useful information

but no examples or not enough Internet text is often strangely structured after all the Internet is merely a noticeboard New and often strange text-types or uses of

familiar words

Page 35: The Internet and its Text

Conclusions (3)

The concordance + BNC gives a better view for the language learner, through

concordance lines collocates clusters

Page 36: The Internet and its Text

References: Aston, Guy & Lou Burnard, 1988. The BNC Handbook. Edinburgh: Edinburgh University Press. Ball, Philip, 2004. Critical Mass. London: Arrow. Barábasi, Albert-Lásló, 2002, Linked: the new science of networks. Cambridge, Mass.: Perseus. Buchanan, Mark, 2002, Small World: uncovering nature’s networks. London: Weidenfeld & Nicholson. García Landa, José Angel, 2005, “Linkterature: from Word to Web”

http://www.unizar.es/departamentos/filologia_inglesa/garciala/publicaciones/linkterature.htm Gibson, William, 1984. Neuromancer. London: Voyager. Hill, J. & Lewis, M. 1997. LTP Dictionary of Selected Collocations. Hove: Language Teaching Productions. Nation, I.S.P., 2001, Learning Vocabulary in Another Language. Cambridge: Cambridge University Press.

P53.9.N27 Faloutsos, Michalis, Petros Faloutsos & Christos Faloutsos, 1999, “On Power-Law Relationships of the

Internet Topology” in Applications, Technologies, Architectures,and Protocols for Computer Communication. Proceedings of the conference on Applications, technologies, architectures, and protocols for computer communication. Cambridge, Mass.: ACM Press. pp. 251-62.

Scott, Mike & Chris Tribble (2006) Working with Texts: keyword and corpus analysis in language education. Amsterdam: Benjamins.

Zipf, G. K. 1965. Human Behavior and the Principle of Least Effort, New York: Hafner. (facsimile of 1949 edition).

http://www.cybergeography.org/atlas/topology.html