developing a research library position statement on text and data mining in the uk
TRANSCRIPT
OSC
Office of Scholarly Communication
Developing a research library position statement on Text and Data Mining
in the UK
RLUK 2017Dr Danny Kingsley, Dr Debbie Hansen - University of CambridgeAnna Vernon - Jisc
British Library - 9th March 2017
OSC
Office of Scholarly Communication
Who are we?
OSCSlide title here
OSCSlide title here
OSCSlide title here
OSCSlide title here
OSC
Office of Scholarly Communication
What is TDM?“the use of large online text collections to discover new facts and trends about the world itself” (Hearst, 1999†)
†Hearst, M. A., Untangling Text Data Mining, Proceedings of ACL'99: the 37th Annual Meeting of the Association for Computational Linguistics, University of Maryland, June 20-26, 1999 (invited paper)
OSCWhy do you do it?
Fast literature reviewExtract new factsAnswer research questionsAccess wide range of sources for a topic
Saves timeMore back for same costNot achievable through manual searches
Research Innovation
OSCMining of scholarly publications
OSCTDM basics - how
OSC
Office of Scholarly Communication
What is the legal situation?
Hargreaves exceptionLicensing
OSCHargreaves Review
• Independent review of UK Intellectual Property system, focus UK Copyright law
• Recommendations from Hargreaves Review:– introduce a copyright exception – allow TDM for non-commercial
purposes– prohibit exclusion of TDM through
contract• Government introduction of exception
June 2014: – if user has lawful access, works can
be copied for TDM for non-commercial research
OSC
Office of Scholarly Communication
Current practicalities
What do publishers say about TDM?
OSC
• Open Access Scholarly Publishers Association
• Expressed their support of TDM efforts– By signing Hague Declaration in
2016• Support the adoption of best practice
and behaviours regarding TDM• Statement:
– ‘reasonable best practice for those engaged in TDM to inform publishers … that such content mining is planned’
OASPA, December 2016, http://oaspa.org/oaspa-comment-text-data-mining-proposed-eu-copyright-reform/
Comment from OASPA
OSC
£500K per annum
• Estimate for TDM activity compliance
–Before TDM exception• Figure from OA staffing costs
–OA staffing costs to monitor TDM activity
–TDM clearance fees with rightsholders
University College London study
OSCDo the publishers have statements?
Oxford University Press
● No permission needed for non-commercial TDM○ But can contact for consultation on TDM (e.g. to avoid
technical safeguard triggers)● Can contact to request TDM for commercial purposes
https://academic.oup.com/journals/pages/help/third_party_data_mining
Royal Society
● Support use of computers to extract from scholarly publications
● Members of subscribing institutions have permission to mine○ For non-commercial and commercial purposes○ Respect copyright and cite where possible
● Let them know when you intend to do TDM○ To prevent automatic lock-out
https://royalsociety.org/journals/ethics-policies/data-sharing-mining/
Cambridge University Press
International Union of Crystallography now also:
OSCDo the publishers have statements?
Elsevier
Different licenses, different rules, e.g.:● CC BY - yes to TDM● CC BY-NC-SA - yes to TDM for non-commercial purposes● CC BY-NC-ND - no to TDM● Open Archives (content made available after an embargo)
○ Yes to TDM for non-commercial purposes and cite authors and source
https://www.elsevier.com/connect/what-changes-when-publishing-open-access-understanding-the-fine-print
OSC
• Hindawi facilitate the use of their content for data mining purposes - https://www.hindawi.com/corpus/.
• Full XML content available for download
–as single .zip file–.zip file updated daily–(XML files adhere to the US National
Library of Medicine Document Type Definition)
• Not advertised widely– over last 12 months, 1,770 unique visits– btwn 60-90 downloads per/month – roughly 720-1080 downloads for the year
Hindawi
OSC
Negotiations between a publisher and Cambridge University in May 2015 over TDM.
–Original contract would have been binding for whole University
–Data only available on a hard drive and not downloadable onto a server
–Charge of £1,100 for the cost of the hard drive
–Substantial number of limitations and restrictions
Example situation
OSC
Office of Scholarly Communication
Talk about any experiences you have had with TDM. Feedback into the group:* Challenges encountered?* Concerns?* Successes?
Group discussion - about your experiences
OSCFeedback from discussion
Situation• Hard drive provided. • Not know what is being asked for - who is
responsible? • What is the IT responsibility here? • Copyright and compliance officer needed to
do a lot of work.Solutions:
• Clearer understanding of the licesning situation.
• Mechanism of where to go for advice. • Procedures of what to do with it - policy issue
OSC
• Issue:– Researcher behaviour - academics not concerned by
copyright• Library implications:
– Librarians are not always aware of TDM taking place. – Help if have better understanding. – New legislation, so we are currently reactive to it– Change of role of the library - traditionally to preserve
access to items. – TDM could threaten access, so internal disquiet– Would like to be enabling this activity rather than
saying no you can’t• Solutions?
– Help if publishers deliver material in different ways - not a hard drive. Could this be part of a platform?
– Good if material was produced in a format that allowed TDM (at no extra cost)
Feedback from discussion 2
OSC
Office of Scholarly Communication
International activity in this area
There are several large initiatives looking at Text and Data Mining
OSC
Office of Scholarly Communication
Work in this area - FutureTDM
• Background: America and Asia lead activity in TDM
• FutureTDM seek to increase TDM activity in EU
• Engagement with stakeholders (e.g. researchers, developers, publishers)
–Why is uptake lower in EU?–Raise awareness–Develop solutions
http://project.futuretdm.eu/
OSC
Office of Scholarly Communication
Work in this area - European Commission
Proposal:New copyright exception for research organisations carrying out research in public interest
– to carry out TDM of copyright protected content
– if they have lawful access (e.g. subscription)
– without prior authorisation
European Commission MEMO (MEMO/16/3011)
OSC
Office of Scholarly Communication
You can have your say on the EU reform • Sign the Hague Declaration and ask your researchers to sign it
http://thehaguedeclaration.com– (not just about copyright reform but about advancing research more
generally)• Ask your institutions to support this joint letter for LIBER, LERU etc.
http://libereurope.eu/blog/2017/01/10/eu-copyright-reform-liber-joins-leading-research-groups-call-change/
• Write to your local MEP saying why you support a European exception on TDM. Mary Honeyball and Catherine Stihler 2 key UK MEPs
• Collect examples of TDM projects, problems, solutions, share and promote them. Make the UK Intellectual Property Office aware of issues that you have with the UK legislation.
• Once the report goes through European Parliament it will go to the European council (EU heads of state) so contacting your national representatives (ministers for research etc., IP Office) will be key at this point.
European Commission MEMO (MEMO/16/3011)
OSC
Office of Scholarly Communication
UK-based TDM activity
British LibraryContent Mine - Wikimedia projectChemDataExtractorNaCTeM
OSCBritish Library EThOS
• E-Thesis On-line Service• British Library opportunity for PhD
student placement*• TDM on 150,000 theses held in EThOS
–Extract new metadata information–E.g. Names of supervisors from
Acknowledgements, funding information
–Outputs feed into future initiativesBritish Library, 2017, https://www.bl.uk/news/2016/november/british-library-phd-placements-call-for-applications*Applications closed 20 February 2017
OSCContentMine and WikiFactMine
ContentMine supplies open source TDM software to access and analyse documents
Project grant to develop WikiFactMine– ContentMine partnering with Wikimedia Foundation– Project aims to make scientific data available to editors of
Wikidata and Wikipediahttp://contentmine.org/
OSCChemDataExtractor
• Molecular Engineering Group, University of Cambridge
• Chemical information from scientific documentation (e.g. text, tables)
• Open source software package• Extracted data for onward analysis
Swain, M. C., & Cole, J. M. "ChemDataExtractor: A Toolkit for Automated Extraction of Chemical Information from the Scientific Literature", J. Chem. Inf. Model. 2016, 56 (10), pp 1894–1904 10.1021/acs.jcim.6b00207
OSC
Office of Scholarly Communication
Biomedical Text Mining
• Manchester Institute of Biotechnology - National Centre for Text Mining (NaCTeM)
• Text mining tools and services in the biomedical field
http://www.nactem.ac.uk/index.php
OSC
Office of Scholarly Communication
Libraries are worried about getting cut off from their subscription by publishers due to large downloads of papers through TDM activity
The problem we are trying to solve
OSCBeing cut off - how it works
• Publishers systems pre-programmed to react to suspicious activity
• TDM may invoke automated investigation, may cause access block
• For Universities to maintain support mechanism to ensure continuity of access
–Require workflows for swift resolution, fast communication, team of communicators
• Also requires education of researchers of potential issues
OSC
Office of Scholarly Communication
Discussion
Write on three separate post it notes your top three reasons why your organisation is not actively
supporting TDM (Yellow post its)If your organisation is supporting TDM write the top
three challenges you face (Green post-its)
OSCSlide title here
Discussion feedback: Why not supporting?
● Practical○ Challenges of handling physical media○ Risk of lockout
● Lack of demand○ We are not getting enquiries Perhaps not coming to the Library. Someone
in IT supporting research computing may not even pass on the queries. Internal discussion needed.
○ Not much call● Who is responsible?-
○ No institutional view on TDM because the issues are not raised at academic level. - POLICY NEEDED
○ How can a library provide a service - responding to individual queries, how do we scale it up?
○ Not joined up - assumption in the discussion that the Library is at the centre of all this and we are not joined up as organisations
OSCSlide title hereDiscussion feedback: Challenges?
● When making research within a specific environment it should be relatively straightforward if it remains within the environment.
● Complicated○ In order to provide access to the data, there are
requirements at the content owner level - everyone needs to understand the need.
○ Intrusive on the researcher process. ○ Need to ensure it is not commercial use, and ensuring
people know their responsibilities● Time
○ A contract with a particular publisher to allow our researchers to TDM took two years to finalise.
OSC
To draft a statement for a Service Level Agreement for publishers to assure us that if the activity is legal we will be reinstated within 1 hour (or something like that).
Discuss - What are the issues if we did this?
Proposal
OSC
Expectation of publishers?• Publishers contact the library to give a grace period
to investigate rather than being cut off• Way publisher platforms operate -
– LOCKSS crawls publisher software without getting trapped.
– This could work in the same way with a bank of IP addresses that is secured for this purpose.
– Avoid some of the manual work. Third party IP registry.
• Basis for the conversation over the SLA– The law is on the subscriber’s side if everyone is
doing it legally. – We need an understanding of the extent of
infringing activity going on with University networks (understanding that people can ‘mask’ themselves).
– Useful for thinking of thresholds.
Discussion feedback 1
OSC
Expectation of libraries?• Not like to do a register. • Range of IP addresses to be part of the license
agreement• Create a safe space for TDM? Or is this a
barrier?• Tryinf to design something which is bolted onto
a different use content. Large scale computational reading is something totally different.
• Two issues – How do we manage the licenses we are
currently signed up to?– How do we manage licensing into the
future so we separate the different uses?
Discussion feedback 2
OSC
Time frames?– Being cut off for a week or two weeks with no
redress is unusual at best!
Discussion feedback 3
OSC
1.Don’t cut us off! Have a conversation first (and if you want to cut us off - prove there are all these activities happening in the UK)
2.If you do cut us off and it turns out to be legitimate then we expect compensation for the time we were cut off
3.Mechanisms for TDM where certain behaviours are expected - built into separate licensing agreements for TDM
Agreed Expectations:
OSC
Office of Scholarly Communication
OSC
Office of Scholarly Communication
Next steps
Getting the statement endorsed by RLUK, funding councils etc take to to publishers.