developing a research library position statement on text and data mining in the uk

43
OSC Office of Scholarly Communication Developing a research library position statement on Text and Data Mining in the UK RLUK 2017 Dr Danny Kingsley, Dr Debbie Hansen - University of Cambridge Anna Vernon - Jisc British Library - 9th March 2017

Upload: danny-kingsley

Post on 12-Apr-2017

291 views

Category:

Education


1 download

TRANSCRIPT

Page 1: Developing a research Library position statement on Text and Data Mining in the UK

OSC

Office of Scholarly Communication

Developing a research library position statement on Text and Data Mining

in the UK

RLUK 2017Dr Danny Kingsley, Dr Debbie Hansen - University of CambridgeAnna Vernon - Jisc

British Library - 9th March 2017

Page 2: Developing a research Library position statement on Text and Data Mining in the UK

OSC

Office of Scholarly Communication

Who are we?

Page 3: Developing a research Library position statement on Text and Data Mining in the UK

OSCSlide title here

Page 4: Developing a research Library position statement on Text and Data Mining in the UK

OSCSlide title here

Page 5: Developing a research Library position statement on Text and Data Mining in the UK

OSCSlide title here

Page 6: Developing a research Library position statement on Text and Data Mining in the UK

OSCSlide title here

Page 7: Developing a research Library position statement on Text and Data Mining in the UK

OSC

Office of Scholarly Communication

What is TDM?“the use of large online text collections to discover new facts and trends about the world itself” (Hearst, 1999†)

†Hearst, M. A., Untangling Text Data Mining, Proceedings of ACL'99: the 37th Annual Meeting of the Association for Computational Linguistics, University of Maryland, June 20-26, 1999 (invited paper)

Page 8: Developing a research Library position statement on Text and Data Mining in the UK

OSCWhy do you do it?

Fast literature reviewExtract new factsAnswer research questionsAccess wide range of sources for a topic

Saves timeMore back for same costNot achievable through manual searches

Research Innovation

Page 9: Developing a research Library position statement on Text and Data Mining in the UK

OSCMining of scholarly publications

Page 10: Developing a research Library position statement on Text and Data Mining in the UK

OSCTDM basics - how

Page 11: Developing a research Library position statement on Text and Data Mining in the UK

OSC

Office of Scholarly Communication

What is the legal situation?

Hargreaves exceptionLicensing

Page 12: Developing a research Library position statement on Text and Data Mining in the UK

OSCHargreaves Review

• Independent review of UK Intellectual Property system, focus UK Copyright law

• Recommendations from Hargreaves Review:– introduce a copyright exception – allow TDM for non-commercial

purposes– prohibit exclusion of TDM through

contract• Government introduction of exception

June 2014: – if user has lawful access, works can

be copied for TDM for non-commercial research

Page 13: Developing a research Library position statement on Text and Data Mining in the UK

OSC

Office of Scholarly Communication

Current practicalities

What do publishers say about TDM?

Page 14: Developing a research Library position statement on Text and Data Mining in the UK

OSC

• Open Access Scholarly Publishers Association

• Expressed their support of TDM efforts– By signing Hague Declaration in

2016• Support the adoption of best practice

and behaviours regarding TDM• Statement:

– ‘reasonable best practice for those engaged in TDM to inform publishers … that such content mining is planned’

OASPA, December 2016, http://oaspa.org/oaspa-comment-text-data-mining-proposed-eu-copyright-reform/

Comment from OASPA

Page 15: Developing a research Library position statement on Text and Data Mining in the UK

OSC

£500K per annum

• Estimate for TDM activity compliance

–Before TDM exception• Figure from OA staffing costs

–OA staffing costs to monitor TDM activity

–TDM clearance fees with rightsholders

University College London study

Page 16: Developing a research Library position statement on Text and Data Mining in the UK

OSCDo the publishers have statements?

Oxford University Press

● No permission needed for non-commercial TDM○ But can contact for consultation on TDM (e.g. to avoid

technical safeguard triggers)● Can contact to request TDM for commercial purposes

https://academic.oup.com/journals/pages/help/third_party_data_mining

Royal Society

● Support use of computers to extract from scholarly publications

● Members of subscribing institutions have permission to mine○ For non-commercial and commercial purposes○ Respect copyright and cite where possible

● Let them know when you intend to do TDM○ To prevent automatic lock-out

https://royalsociety.org/journals/ethics-policies/data-sharing-mining/

Cambridge University Press

International Union of Crystallography now also:

Page 17: Developing a research Library position statement on Text and Data Mining in the UK

OSCDo the publishers have statements?

Elsevier

Different licenses, different rules, e.g.:● CC BY - yes to TDM● CC BY-NC-SA - yes to TDM for non-commercial purposes● CC BY-NC-ND - no to TDM● Open Archives (content made available after an embargo)

○ Yes to TDM for non-commercial purposes and cite authors and source

https://www.elsevier.com/connect/what-changes-when-publishing-open-access-understanding-the-fine-print

Page 18: Developing a research Library position statement on Text and Data Mining in the UK

OSC

• Hindawi facilitate the use of their content for data mining purposes - https://www.hindawi.com/corpus/.

• Full XML content available for download

–as single .zip file–.zip file updated daily–(XML files adhere to the US National

Library of Medicine Document Type Definition)

• Not advertised widely– over last 12 months, 1,770 unique visits– btwn 60-90 downloads per/month – roughly 720-1080 downloads for the year

Hindawi

Page 19: Developing a research Library position statement on Text and Data Mining in the UK

OSC

Negotiations between a publisher and Cambridge University in May 2015 over TDM.

–Original contract would have been binding for whole University

–Data only available on a hard drive and not downloadable onto a server

–Charge of £1,100 for the cost of the hard drive

–Substantial number of limitations and restrictions

Example situation

Page 20: Developing a research Library position statement on Text and Data Mining in the UK

OSC

Office of Scholarly Communication

Talk about any experiences you have had with TDM. Feedback into the group:* Challenges encountered?* Concerns?* Successes?

Group discussion - about your experiences

Page 21: Developing a research Library position statement on Text and Data Mining in the UK

OSCFeedback from discussion

Situation• Hard drive provided. • Not know what is being asked for - who is

responsible? • What is the IT responsibility here? • Copyright and compliance officer needed to

do a lot of work.Solutions:

• Clearer understanding of the licesning situation.

• Mechanism of where to go for advice. • Procedures of what to do with it - policy issue

Page 22: Developing a research Library position statement on Text and Data Mining in the UK

OSC

• Issue:– Researcher behaviour - academics not concerned by

copyright• Library implications:

– Librarians are not always aware of TDM taking place. – Help if have better understanding. – New legislation, so we are currently reactive to it– Change of role of the library - traditionally to preserve

access to items. – TDM could threaten access, so internal disquiet– Would like to be enabling this activity rather than

saying no you can’t• Solutions?

– Help if publishers deliver material in different ways - not a hard drive. Could this be part of a platform?

– Good if material was produced in a format that allowed TDM (at no extra cost)

Feedback from discussion 2

Page 23: Developing a research Library position statement on Text and Data Mining in the UK

OSC

Office of Scholarly Communication

International activity in this area

There are several large initiatives looking at Text and Data Mining

Page 24: Developing a research Library position statement on Text and Data Mining in the UK

OSC

Office of Scholarly Communication

Work in this area - FutureTDM

• Background: America and Asia lead activity in TDM

• FutureTDM seek to increase TDM activity in EU

• Engagement with stakeholders (e.g. researchers, developers, publishers)

–Why is uptake lower in EU?–Raise awareness–Develop solutions

http://project.futuretdm.eu/

Page 25: Developing a research Library position statement on Text and Data Mining in the UK

OSC

Office of Scholarly Communication

Work in this area - European Commission

Proposal:New copyright exception for research organisations carrying out research in public interest

– to carry out TDM of copyright protected content

– if they have lawful access (e.g. subscription)

– without prior authorisation

European Commission MEMO (MEMO/16/3011)

Page 26: Developing a research Library position statement on Text and Data Mining in the UK

OSC

Office of Scholarly Communication

You can have your say on the EU reform • Sign the Hague Declaration and ask your researchers to sign it

http://thehaguedeclaration.com– (not just about copyright reform but about advancing research more

generally)• Ask your institutions to support this joint letter for LIBER, LERU etc.

http://libereurope.eu/blog/2017/01/10/eu-copyright-reform-liber-joins-leading-research-groups-call-change/

• Write to your local MEP saying why you support a European exception on TDM. Mary Honeyball and Catherine Stihler 2 key UK MEPs

• Collect examples of TDM projects, problems, solutions, share and promote them. Make the UK Intellectual Property Office aware of issues that you have with the UK legislation.

• Once the report goes through European Parliament it will go to the European council (EU heads of state) so contacting your national representatives (ministers for research etc., IP Office) will be key at this point.

European Commission MEMO (MEMO/16/3011)

Page 27: Developing a research Library position statement on Text and Data Mining in the UK

OSC

Office of Scholarly Communication

UK-based TDM activity

British LibraryContent Mine - Wikimedia projectChemDataExtractorNaCTeM

Page 28: Developing a research Library position statement on Text and Data Mining in the UK

OSCBritish Library EThOS

• E-Thesis On-line Service• British Library opportunity for PhD

student placement*• TDM on 150,000 theses held in EThOS

–Extract new metadata information–E.g. Names of supervisors from

Acknowledgements, funding information

–Outputs feed into future initiativesBritish Library, 2017, https://www.bl.uk/news/2016/november/british-library-phd-placements-call-for-applications*Applications closed 20 February 2017

Page 29: Developing a research Library position statement on Text and Data Mining in the UK

OSCContentMine and WikiFactMine

ContentMine supplies open source TDM software to access and analyse documents

Project grant to develop WikiFactMine– ContentMine partnering with Wikimedia Foundation– Project aims to make scientific data available to editors of

Wikidata and Wikipediahttp://contentmine.org/

Page 30: Developing a research Library position statement on Text and Data Mining in the UK

OSCChemDataExtractor

• Molecular Engineering Group, University of Cambridge

• Chemical information from scientific documentation (e.g. text, tables)

• Open source software package• Extracted data for onward analysis

Swain, M. C., & Cole, J. M. "ChemDataExtractor: A Toolkit for Automated Extraction of Chemical Information from the Scientific Literature", J. Chem. Inf. Model. 2016, 56 (10), pp 1894–1904 10.1021/acs.jcim.6b00207

Page 31: Developing a research Library position statement on Text and Data Mining in the UK

OSC

Office of Scholarly Communication

Biomedical Text Mining

• Manchester Institute of Biotechnology - National Centre for Text Mining (NaCTeM)

• Text mining tools and services in the biomedical field

http://www.nactem.ac.uk/index.php

Page 32: Developing a research Library position statement on Text and Data Mining in the UK

OSC

Office of Scholarly Communication

Libraries are worried about getting cut off from their subscription by publishers due to large downloads of papers through TDM activity

The problem we are trying to solve

Page 33: Developing a research Library position statement on Text and Data Mining in the UK

OSCBeing cut off - how it works

• Publishers systems pre-programmed to react to suspicious activity

• TDM may invoke automated investigation, may cause access block

• For Universities to maintain support mechanism to ensure continuity of access

–Require workflows for swift resolution, fast communication, team of communicators

• Also requires education of researchers of potential issues

Page 34: Developing a research Library position statement on Text and Data Mining in the UK

OSC

Office of Scholarly Communication

Discussion

Write on three separate post it notes your top three reasons why your organisation is not actively

supporting TDM (Yellow post its)If your organisation is supporting TDM write the top

three challenges you face (Green post-its)

Page 35: Developing a research Library position statement on Text and Data Mining in the UK

OSCSlide title here

Discussion feedback: Why not supporting?

● Practical○ Challenges of handling physical media○ Risk of lockout

● Lack of demand○ We are not getting enquiries Perhaps not coming to the Library. Someone

in IT supporting research computing may not even pass on the queries. Internal discussion needed.

○ Not much call● Who is responsible?-

○ No institutional view on TDM because the issues are not raised at academic level. - POLICY NEEDED

○ How can a library provide a service - responding to individual queries, how do we scale it up?

○ Not joined up - assumption in the discussion that the Library is at the centre of all this and we are not joined up as organisations

Page 36: Developing a research Library position statement on Text and Data Mining in the UK

OSCSlide title hereDiscussion feedback: Challenges?

● When making research within a specific environment it should be relatively straightforward if it remains within the environment.

● Complicated○ In order to provide access to the data, there are

requirements at the content owner level - everyone needs to understand the need.

○ Intrusive on the researcher process. ○ Need to ensure it is not commercial use, and ensuring

people know their responsibilities● Time

○ A contract with a particular publisher to allow our researchers to TDM took two years to finalise.

Page 37: Developing a research Library position statement on Text and Data Mining in the UK

OSC

To draft a statement for a Service Level Agreement for publishers to assure us that if the activity is legal we will be reinstated within 1 hour (or something like that).

Discuss - What are the issues if we did this?

Proposal

Page 38: Developing a research Library position statement on Text and Data Mining in the UK

OSC

Expectation of publishers?• Publishers contact the library to give a grace period

to investigate rather than being cut off• Way publisher platforms operate -

– LOCKSS crawls publisher software without getting trapped.

– This could work in the same way with a bank of IP addresses that is secured for this purpose.

– Avoid some of the manual work. Third party IP registry.

• Basis for the conversation over the SLA– The law is on the subscriber’s side if everyone is

doing it legally. – We need an understanding of the extent of

infringing activity going on with University networks (understanding that people can ‘mask’ themselves).

– Useful for thinking of thresholds.

Discussion feedback 1

Page 39: Developing a research Library position statement on Text and Data Mining in the UK

OSC

Expectation of libraries?• Not like to do a register. • Range of IP addresses to be part of the license

agreement• Create a safe space for TDM? Or is this a

barrier?• Tryinf to design something which is bolted onto

a different use content. Large scale computational reading is something totally different.

• Two issues – How do we manage the licenses we are

currently signed up to?– How do we manage licensing into the

future so we separate the different uses?

Discussion feedback 2

Page 40: Developing a research Library position statement on Text and Data Mining in the UK

OSC

Time frames?– Being cut off for a week or two weeks with no

redress is unusual at best!

Discussion feedback 3

Page 41: Developing a research Library position statement on Text and Data Mining in the UK

OSC

1.Don’t cut us off! Have a conversation first (and if you want to cut us off - prove there are all these activities happening in the UK)

2.If you do cut us off and it turns out to be legitimate then we expect compensation for the time we were cut off

3.Mechanisms for TDM where certain behaviours are expected - built into separate licensing agreements for TDM

Agreed Expectations:

Page 42: Developing a research Library position statement on Text and Data Mining in the UK

OSC

Office of Scholarly Communication

Page 43: Developing a research Library position statement on Text and Data Mining in the UK

OSC

Office of Scholarly Communication

Next steps

Getting the statement endorsed by RLUK, funding councils etc take to to publishers.