the web, laws and ethics...• the web as ‘a fabulous linguist’s playground’ (kilgarriff and...

26
The web, laws and ethics Tony McEnery

Upload: others

Post on 16-Jan-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The web, laws and ethics...• The web as ‘a fabulous linguist’s playground’ (Kilgarriff and Grefenstette 2003: 333 • The web ‘can in no way be considered a representative

The web, laws and ethics

Tony McEnery

Page 2: The web, laws and ethics...• The web as ‘a fabulous linguist’s playground’ (Kilgarriff and Grefenstette 2003: 333 • The web ‘can in no way be considered a representative

Introduction

• Practical issues – some are familiar • Some are rarely touched upon • Other are almost never discussed at any

length • This talk surveys legal issues in web based

corpus collection and ethical issues in general. • I speak from a Western standpoint and from

my own experiences with research projects

Page 3: The web, laws and ethics...• The web as ‘a fabulous linguist’s playground’ (Kilgarriff and Grefenstette 2003: 333 • The web ‘can in no way be considered a representative

Law and the Web

• The fundamental issue – do you have the right to gather and distribute?

• The World Wide Web – an opportunity • The World Wide Web – a challenge • BootCat (Baroni and Bernardini, 2004) – a system

for collecting web corpora • ‘Web as Corpus’ • Chat room data (Claridge 2007, Thelwall 2008,

King 2009)

Page 4: The web, laws and ethics...• The web as ‘a fabulous linguist’s playground’ (Kilgarriff and Grefenstette 2003: 333 • The web ‘can in no way be considered a representative

Diverging Views

• The web as ‘a fabulous linguist’s playground’ (Kilgarriff and Grefenstette 2003: 333

• The web ‘can in no way be considered a representative sample of language use in general’ (Leech 2007: 145). A

• Although the web can be useful, the ‘more sophisticated needs of the working linguist may be better fulfilled by means of traditional corpora’ (Lew 2009: 298).

Page 5: The web, laws and ethics...• The web as ‘a fabulous linguist’s playground’ (Kilgarriff and Grefenstette 2003: 333 • The web ‘can in no way be considered a representative

An Issue - Genre

• How does one determine the genre of a web document?

• Baroni and Bernardini (2004: 1315) suggests that one in three of the webpages recovered may not be in the desired genre.

Page 6: The web, laws and ethics...• The web as ‘a fabulous linguist’s playground’ (Kilgarriff and Grefenstette 2003: 333 • The web ‘can in no way be considered a representative

An Issue - Law

• Copyright laws apply to documents available on the web exactly as they do to print documents

• Financial loss as a measure • Corpus data needs to be widely available if

corpus linguistics is to be replicable • This is an ethical imperative for the researcher.

Page 7: The web, laws and ethics...• The web as ‘a fabulous linguist’s playground’ (Kilgarriff and Grefenstette 2003: 333 • The web ‘can in no way be considered a representative

Ways of Addressing Copyright

• Treat text from the web the same as any other text (Baker et al. 2004)

• Collect data only from sites which explicitly allow the re-use and redistribution of text (e.g. Wikipedia)

• Collect data without any regard to seeking permission and make it available to other researchers through a tool that does not allow copyright to be breached (Davies, 2010)

• Fair use? • Redistribute a list of the web addresses from which the

corpus has been collected, not the corpus itself

Page 8: The web, laws and ethics...• The web as ‘a fabulous linguist’s playground’ (Kilgarriff and Grefenstette 2003: 333 • The web ‘can in no way be considered a representative

Discussion 1

• What experiences of getting copyright clearance have you had? Have you used web material? Has it always been easy to liaise with a copyright holder where you have tried to do so?

Page 9: The web, laws and ethics...• The web as ‘a fabulous linguist’s playground’ (Kilgarriff and Grefenstette 2003: 333 • The web ‘can in no way be considered a representative

Ethics

• Somewhat neglected in corpus linguistics • Exceptions: Hasund (1998), Sampson (2000)

and Rock (2001) • A problem solved by others, who produce

guidelines we can use? • http://www.baal.org.uk/about_goodpractice_

full.pdf .

Page 10: The web, laws and ethics...• The web as ‘a fabulous linguist’s playground’ (Kilgarriff and Grefenstette 2003: 333 • The web ‘can in no way be considered a representative

Ethics and Respondents

• Example: spoken BNC • Issue: the sacrifice of privacy • Issue: securing sensitive information • The BNC did address such issues, but … • What about the tapes? More specifically, what

about those who were recorded coincidentally?

Page 11: The web, laws and ethics...• The web as ‘a fabulous linguist’s playground’ (Kilgarriff and Grefenstette 2003: 333 • The web ‘can in no way be considered a representative

• Issue: What about those being talked about? Sampson (2000: section 4.1) :

... comment that one of their schoolmates, identified by Christian

name, behaves like a whore. This person is entitled to anonymity as much as the speakers, and arguably more so: she signed no release form for the corpus compilers. When well-known public figures or institutions are mentioned, the BNC compilers seem to have felt that there was no need to anonymise the references at all. Clearly, if someone announces that he has just bought the latest album by a named pop singer, there is no point in concealing the singer’s name. But it depends on what is said. One of the CHRISTINE texts contains a series of quite damaging remarks about the management of a secondary school, named in the BNC file. In another case, speakers comment adversely on the sexual morality of a named American actress. Even American actresses, surely, are entitled to have their honour guarded by corpus linguists.

• Yes, but how do we help?

Page 12: The web, laws and ethics...• The web as ‘a fabulous linguist’s playground’ (Kilgarriff and Grefenstette 2003: 333 • The web ‘can in no way be considered a representative

• Issue: Anonymization. File F86 of the BNC (utterances 264 and 265)

During nineteen ninety one the Board has been delighted to open new areas of work in Inverness where our first designated place and associated hostel was opened on a most happened-- happy day by Sir Russell <gap desc="name" reason="anonymization"> .

In Elderslie near Paisley <pause> where Lady <gap desc="name" reason="anonymization"> the wife of last year's Lord High Commissioner opened our fourth senile dementia unit.

Page 13: The web, laws and ethics...• The web as ‘a fabulous linguist’s playground’ (Kilgarriff and Grefenstette 2003: 333 • The web ‘can in no way be considered a representative

• Leech and Weisser (2003) contained credit card details that had to be anonymised

• The Lancaster Corpus of Children’s Writing (Smith et al. 1998) corpus contained the personal details of young children

• file HE7 in the BNC

Page 14: The web, laws and ethics...• The web as ‘a fabulous linguist’s playground’ (Kilgarriff and Grefenstette 2003: 333 • The web ‘can in no way be considered a representative

A 275 Well don't you think that it's really rather improper for you to be doing this? A 276 After all people are entitled to some secrecy <event: "running down stairs">aren't they, about their <unclear> <event: "breaking

furniture"> You don't feel that there's any need at all to give any explanation of your behaviour? A 277 <event: "noise - traffic">You don't think that <unclear> an explanation is due here? A 278 This information after all should have received confidential and does belong to other people, doesn't it? B 279 What I thinks embarrassing <-|-> is that <unclear> <-|-> A 280 <-|-> And you're just stealing it <-|-> you're just stealing it so that you can make money aren't you? A 281 <event: "noise - traffic"><voice quality: shouting>Do people have a right to have their health records <unclear> confidential do they not?

<end of voice quality> A 282 <pause> Have you got nothing to say what so ever? B 283 'Fraid not, no. <event: "footsteps"> <pause> A 284 Robert <gap desc="name" reason="anonymization"> is not alone in selling personal information from data banks.

Page 15: The web, laws and ethics...• The web as ‘a fabulous linguist’s playground’ (Kilgarriff and Grefenstette 2003: 333 • The web ‘can in no way be considered a representative

• Issue: Video and Audio • ‘[a]nonymisation is extremely difficult, if not impossible,

because the data is so rich’ (Reiter 2004: 2). • Hasund (1998: 16-17): In the invitation to take part in the research project … the

following promise was given to the COLT recruits: ‘You and the people you have recorded are guaranteed full anonymity’. There were lengthy discussions among the researchers working on the corpus of what was implied by the term ‘full anonymity’, resulting in an agreement to delete all surnames and addresses in the transcription, but leave all first names unchanged. Considering that the recordings were made in a huge city like London, and the recruits were pupils and not public persons connected to specific positions at specific universities of companies, this level of anonymization was considered sufficient for the protection of personal identities.

Page 16: The web, laws and ethics...• The web as ‘a fabulous linguist’s playground’ (Kilgarriff and Grefenstette 2003: 333 • The web ‘can in no way be considered a representative

Ethics and Corpus Builders • EMILLE corpus (Baker et al. 2004) • Opportunistic corpus • A religious organization saw the opportunity to

contribute to our corpus as a way of distributing their material and thus gaining converts

• Who it is alright to hate • A question of research ethics – ethics relating to the

conduct of scientific experiments • The underlying problem is embedded in one of the

great strengths of the corpus approach: corpora are multifunctional

Page 17: The web, laws and ethics...• The web as ‘a fabulous linguist’s playground’ (Kilgarriff and Grefenstette 2003: 333 • The web ‘can in no way be considered a representative

Discussion 2

• Should a corpus be censored? For example, the BNC has been used to explore swearing in English. This is possible because a choice was made when building the corpus not to censor the data. How defensible is that decision? Are there circumstances in which you would consider censoring corpus data? If so, on what grounds would you do so?

Page 18: The web, laws and ethics...• The web as ‘a fabulous linguist’s playground’ (Kilgarriff and Grefenstette 2003: 333 • The web ‘can in no way be considered a representative

Ethics and Corpus Distributors

• Issue: collecting data from outlawed groups • Issue: who is funding you and why? • Issue: may the corpus become illegal in certain

jurisdictions if certain decisions are made? • Issue: keeping data intact and available.

Page 19: The web, laws and ethics...• The web as ‘a fabulous linguist’s playground’ (Kilgarriff and Grefenstette 2003: 333 • The web ‘can in no way be considered a representative

Ethics and Corpus Users

• Issue: what may your corpus analysis cause to happen? • Issue: making your analyses available to future

researchers • Issue: making your tools available to future researchers • Automatic and manual analyses have differing issues,

but the imperative to preserve the analysis is constant • Issue: How will others interpret your results? What may

be the impact of an interpretation you did not expect?

Page 20: The web, laws and ethics...• The web as ‘a fabulous linguist’s playground’ (Kilgarriff and Grefenstette 2003: 333 • The web ‘can in no way be considered a representative

Ethically Problematic Research – Some Case Studies

• Corpus builders do not routinely produce corpora that are overtly unethical.

• Nonetheless, it is possible to find instances of poor practice.

• The BNC spoken corpus has a somewhat haphazard approach to anonymisation.

• The Survey of English Usage – serruptitious recordings

• Lost data and methods – Philips (1989)

Page 21: The web, laws and ethics...• The web as ‘a fabulous linguist’s playground’ (Kilgarriff and Grefenstette 2003: 333 • The web ‘can in no way be considered a representative

• The most problematic ethical choices are firmly in the past

• Though corpus linguistics has made mistakes in the past, it has learnt from them

• Prediction: the interaction between corpus linguistics and legislation will intensify

Page 22: The web, laws and ethics...• The web as ‘a fabulous linguist’s playground’ (Kilgarriff and Grefenstette 2003: 333 • The web ‘can in no way be considered a representative

Discussion 3

• Imagine you are building a spoken corpus, and you are collecting data by audio-recording spontaneous conversation. Given that you are following standard ethics procedures to get informed consent from all speakers, what steps could you take, in designing the data collection and transcription procedure, to minimise the observer effect that will inevitably result from following these procedures?

Page 23: The web, laws and ethics...• The web as ‘a fabulous linguist’s playground’ (Kilgarriff and Grefenstette 2003: 333 • The web ‘can in no way be considered a representative

Homework!

• The next time you find yourself in conversation with a group of friends, imagine that you are secretly recording the conversation. Make a mental note (and, later, a written note) of the ethical issues that might arise if you were intending to transcribe and then publish the data without the participants’ knowledge or consent. Repeat this ‘experiment’ in a number of other contexts. What common issues emerge? Do any ethical issues seem bound to certain types of interaction?

Page 24: The web, laws and ethics...• The web as ‘a fabulous linguist’s playground’ (Kilgarriff and Grefenstette 2003: 333 • The web ‘can in no way be considered a representative

Further Reading

• McEnery & Hardie (2012) • Hundt et al. (2007) • Leech (2007) • Baroni et al. (2008) • Hoffmann (2007a, 2007b) • http://aixtal.blogspot.com/2005/02/web-

googles-missing-pages-mystery.html • McEnery et al. (2006: 77-9) • Rock (2001) • Hassund (1998).

Page 25: The web, laws and ethics...• The web as ‘a fabulous linguist’s playground’ (Kilgarriff and Grefenstette 2003: 333 • The web ‘can in no way be considered a representative

Bibliography

• Baker, P., Bontcheva, K., Cunningham, H., Gaizauskas, R., Hamza, O., Hardie, A., Jayaram, B.D., Leisher, M., McEnery, T., Maynard, D., Tablan, V., Ursu, C. and Xiao, R.Z. 2004. ‘Corpus linguistics and South Asian languages: corpus creation and tool development’, Literary and Linguistic Computing 19 (4): 509–24.

• Baroni, M. and Bernardini, S. 2004. ‘BootCaT: Bootstrapping corpora and terms from the web’, in Proceedings of LREC 2004, pp. 1313–16. Paris: European Language Resources Association (ELRA).

• Baroni, M., Chantree, F., Kilgarriff, A. and Sharoff, S. 2008. ‘CleanEval: a competition for cleaning webpages’, in Proceedings of LREC 2008, pp. 638–43. Paris: European Language Resources Association (ELRA).

• Claridge, C. 2007. ‘Constructing a corpus from the web: message boards’, in M. Hundt, N. Nesselhauf and C. Biewer (eds) Corpus Linguistics and the Web, pp. 87–108. Amsterdam: Rodopi.

• Davies, M. 2010. ‘More than a peephole: Using large and diverse online corpora’, International Journal of Corpus Linguistics 15 (3): 412–8.

• Hasund, K. 1998. ‘Protecting the innocent: the issue of informants’ anonymity in the COLT corpus’, in A. Renouf (ed.) Explorations in Corpus Linguistics, pp. 13–28. Amsterdam: Rodopi.

• Hoffmann, S. 2007a. ‘From web page to mega-corpus: the CNN transcripts’, in M. Hundt, N. Nesselhauf and C. Biewer (eds) Corpus Linguistics and the Web, pp. 69–85. Amsterdam: Rodopi.

• Hoffmann, S. 2007b. ‘Processing Internet-Derived Text - Creating a Corpus of Usenet Messages’, Literary and Linguistic Computing 22 (2): 151–65.

Page 26: The web, laws and ethics...• The web as ‘a fabulous linguist’s playground’ (Kilgarriff and Grefenstette 2003: 333 • The web ‘can in no way be considered a representative

• Hundt, M., Nesselhauf, N. and Biewer, C. 2007. Corpus Linguistics and the Web. Amsterdam: Rodopi. • Kilgarriff, A. and Grefenstette, G. 2003. ‘Introduction to the Special Issue on the Web as Corpus’, Computational

Linguistics 29 (3): 333–47. • King, B. 2009. ‘Building and analysing corpora of computer mediated communication’, in P. Baker (ed.)

Contemporary Corpus Linguistics, pp. 301–20. London: Continuum. • Leech, G. 2007. ‘New resources, or just better old ones?’, in M. Hundt, N. Nesselhauf and C. Biewer (eds) Corpus

Linguistics and the Web, pp. 134–49. Amsterdam: Rodopi. • Leech, G. and Weisser, M. 2003. ‘Generic speech act annotation for task-oriented dialogues’, in D. Archer, P.

Rayson, A. Wilson and T. McEnery (eds) Proceedings of the Corpus Linguistics 2003 conference. University Centre for Computer Corpus Research on Language, Technical Papers 16 (1): 441–6.

• Lew, R. 2009. ‘The web as corpus versus traditional corpora’, in P. Baker (ed.) Contemporary Corpus Linguistics, pp. 289–300. London: Continuum.

• McEnery T. and Hardie, A. 2012. Corpus Linguistics: Method, Theory and Practice. Cambridge University Press, Cambridge.

• McEnery, T., Xiao, R.Z. and Tono, Y. 2006. Corpus-based Language Studies: An Advanced Resource Book. London: Routledge.

• Phillips, M. 1989. Lexical Structure of Text. Birmingham: University of Birmingham. • Reiter, E. 2004. ‘Creating a corpus for memories for life’, Grand Challenges for Memories for Life. Available online

at: http://www.memoriesforlife.org/resources.php • Rock, F. 2001. ‘Policy and practice in the anonymization of linguistic data’, International Journal of Corpus

Linguistics 6 (1): 1–26. • Sampson, G.R. 2000. CHRISTINE Corpus, Stage I: Documentation. Available online at:

www.grsampson.net/ChrisDoc.html • Smith, N.I., McEnery, T. and Ivanic, R. 1998. ‘Issues in transcribing a corpus of children’s handwritten projects’,

Literary and Linguistic Computing 13 (4): 312–29. • Thelwall, M. 2008. ‘Fk yea I swear: cursing and gender in MySpace’, Corpora 3 (1): 83–107.