enabling tdm tool development · −collaborative r&d address industrial, societal and...

26
Enabling TDM Tool Development Dr Rob Firth Senior AI Research Scientist, STFC Hartree Centre Text and Data Mining: what is real and applicable? Panel @ STM Innovations 3/12/2019

Upload: others

Post on 10-Dec-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Enabling TDM Tool Development · −Collaborative R&D Address industrial, societal and scientific challenges. Bespoke small teams built around challenge-led industry projects. and

Enabling TDM Tool Development

Dr Rob Firth

Senior AI Research Scientist, STFC Hartree Centre

Text and Data Mining: what is real and applicable?

Panel @ STM Innovations 3/12/2019

Page 2: Enabling TDM Tool Development · −Collaborative R&D Address industrial, societal and scientific challenges. Bespoke small teams built around challenge-led industry projects. and

Our track record The Hartree Centre

Located at the STFC Daresbury Laboratory on the

Sci-Tech Daresbury campus.

Mission: To transform UK industry by accelerating

the adoption of high performance computing, data-

centric computing and AI.

Staff based in the building include:

• STFC

• IBM Research

• University of Liverpool Virtual Engineering Centre

Page 3: Enabling TDM Tool Development · −Collaborative R&D Address industrial, societal and scientific challenges. Bespoke small teams built around challenge-led industry projects. and

RAL Space Central Laser Facility (CLF)

… and more

Medical Research Council (MRC)

Natural Environment

Research Council (NERC)

… and more

Met Office UK Space Agency … and more

Department for International

Trade

Department for Education

… and more

Where does the

Hartree Centre fit ?

Page 4: Enabling TDM Tool Development · −Collaborative R&D Address industrial, societal and scientific challenges. Bespoke small teams built around challenge-led industry projects. and

− Collaborative R&D Address industrial, societal and scientific challenges.

Bespoke small teams built around challenge-led industry projects.

− Platform as a service Give your own experts pay-as-you-go access to our compute power

− Creating digital assets License the new industry-led software applications we create with IBM Research

− Training and skills Drop in on our comprehensive programme of specialist training courses and events or

design a bespoke course for your team.

What we do

• 60+ computational scientists

and technologists

• Insights into new & evolving

technologies e.g. HPC, AI,

Quantum Computing.

Page 5: Enabling TDM Tool Development · −Collaborative R&D Address industrial, societal and scientific challenges. Bespoke small teams built around challenge-led industry projects. and

Government &

public sector Local business

networks Technology

partners

Our Network

Academia

Universities

International

research

communities

Page 6: Enabling TDM Tool Development · −Collaborative R&D Address industrial, societal and scientific challenges. Bespoke small teams built around challenge-led industry projects. and

Government &

public sector Local business

networks Technology

partners

Our Network

Academia

Universities

International

research

communities

Page 7: Enabling TDM Tool Development · −Collaborative R&D Address industrial, societal and scientific challenges. Bespoke small teams built around challenge-led industry projects. and

Collaborators

Chris Morris Jo McEntyre

Francesco

Talo

Aravind

Venkatesan Abhik

Mukhopadhyay

Sameer

Velankar

Rob Firth

Page 8: Enabling TDM Tool Development · −Collaborative R&D Address industrial, societal and scientific challenges. Bespoke small teams built around challenge-led industry projects. and

Collaborators

Chris Morris Jo McEntyre

Francesco

Talo

Aravind

Venkatesan Abhik

Mukhopadhyay

Sameer

Velankar

Rob Firth

Page 9: Enabling TDM Tool Development · −Collaborative R&D Address industrial, societal and scientific challenges. Bespoke small teams built around challenge-led industry projects. and

TDM - Real and Applicable?

Page 10: Enabling TDM Tool Development · −Collaborative R&D Address industrial, societal and scientific challenges. Bespoke small teams built around challenge-led industry projects. and

Protein Residue Annotator Case Study

Page 11: Enabling TDM Tool Development · −Collaborative R&D Address industrial, societal and scientific challenges. Bespoke small teams built around challenge-led industry projects. and

Background: Proteins, Structural Biology

Credit: Thomas Shafee, see: Protein Structure, Wikipedia

• Proteins can be large, with only a small site of interest for a researcher/experiment

• Small primary-structural changes can cause large effects on macro-structure and interaction behavior

• Ever mounting literature to search (just like everywhere else!)

• Speed up search

• Augment the corpus

Credit: R. Firth

Page 12: Enabling TDM Tool Development · −Collaborative R&D Address industrial, societal and scientific challenges. Bespoke small teams built around challenge-led industry projects. and

Background: Proteins, Structural Biology Fortunately:

References to residues in text are (usually) referred to in a well constrained way:

Residue Name followed by Position Number.

Tyr33, His-455, Lys(382), Methionine 120

Serine at position 91, leucine residue at position 16, F123 residue

Glu30-Tyr33-Trp71, Ala29-33

Trp71/77, Cys99/Pro101, Arg(96)/(102), Ser-91/Gly-97

Cys residues at positions 73 and 152,

Arginine at position 452, 459 and 466

Page 13: Enabling TDM Tool Development · −Collaborative R&D Address industrial, societal and scientific challenges. Bespoke small teams built around challenge-led industry projects. and

Background: Proteins, Structural Biology

Also Fortunately:

• Relevant bioinformatics resources are well established and maintained

• Recent platform upgrades allow good API access and linking across

resources

• Recent development on Natural Language Processing (NLP) pipelines

allow at-scale performance off-the-shelf

• Common Structural data format:

PDBx/mmCIF

Fortunately:

References to residues in text are (usually) referred to in a well constrained way:

Residue Name followed by Position Number.

Tyr33, His-455, Lys(382), Methionine 120

Serine at position 91, leucine residue at position 16, F123 residue

Glu30-Tyr33-Trp71, Ala29-33

Trp71/77, Cys99/Pro101, Arg(96)/(102), Ser-91/Gly-97

Cys residues at positions 73 and 152,

Arginine at position 452, 459 and 466

Source:

Text,Tables, Metadata

BioInformatics APIs

NLP Pipelines

Page 14: Enabling TDM Tool Development · −Collaborative R&D Address industrial, societal and scientific challenges. Bespoke small teams built around challenge-led industry projects. and

Automatic Annotation of Protein Residues in Published Papers

Firth, R., Talo, F., Venkatesan, A., Mukhopadhyay, A., McEntyre, J.,

Velankar, S. & Morris, C. (2019). Acta Cryst. F75, 665–672.

URI, URI

PDBx/mmCIF from

, or user provided

1. Ingest source text

2. Identify Residues and Proteins

3. Associate Residues with Proteins

4. Output Annotations

SciLite, JSON

Page 15: Enabling TDM Tool Development · −Collaborative R&D Address industrial, societal and scientific challenges. Bespoke small teams built around challenge-led industry projects. and

• Annotations produced by Pyresid

appear alongside a suite of other

‘BioEntities’: – e.g. diseases, chemicals, protein

interactions

• Platform is ePMC “SciLite”

(Venkatesan et al. 2017;

https://europepmc.org/Annotations)

• Highlighted and searchable

• Links to Uniprot entries on click

• Currently between Pyresid

versions Screenshot of SciLite Annotations on europePMC; Firth et al. 2019

Page 16: Enabling TDM Tool Development · −Collaborative R&D Address industrial, societal and scientific challenges. Bespoke small teams built around challenge-led industry projects. and

Example of excellent practice for TDM developers and end users

• Submission via Web-Portal or API

– Validation tool available

• Scope to host tools on EMBASSY Cloud

– possible to run daily annotation jobs

• Annotations themselves are available under API

• Option of Bulk download

• Fulltext searchable and well formatted when retrieved as JSON

• Big infrastructure – ELIXIR Node

– Not neccesarily unique

Revamped platform imminent!

Page 17: Enabling TDM Tool Development · −Collaborative R&D Address industrial, societal and scientific challenges. Bespoke small teams built around challenge-led industry projects. and

• Pyresid now runs on ingestion to Acta

Crys. F

• Text Source independent of ePMC

• Implemented by IUCr Developer

Simon Westrip

• Existing Acta Crys F. archive has been

annotated

• Linked with 3Dmol.js to annotate 3D

Structure

Figure: van Raaij & Newman; Taking biological structure

communications into the third dimension. Acta Crystallogr

F Struct Biol Commun.

Page 18: Enabling TDM Tool Development · −Collaborative R&D Address industrial, societal and scientific challenges. Bespoke small teams built around challenge-led industry projects. and
Page 19: Enabling TDM Tool Development · −Collaborative R&D Address industrial, societal and scientific challenges. Bespoke small teams built around challenge-led industry projects. and

Thoughts

• What is made available elsewhere? Will your corpus be disregarded?

– Don’t reinvent the metadata wheel e.g. W3C standards: Web Annotation Data Model etc.

• Table Schemas in Metadata

– Semi-structured data is good target, but scaled extraction can be challenging without

enforcing standards

• Closing the loop

– Make available other mining outputs

– Version these mining outputs – is it up to date?

• Capture Preferred Vocabularies and Ontologies

• Corrections can be powerful

• Validation and Evaluation

– Valid format?

– Valid annotations?

Page 20: Enabling TDM Tool Development · −Collaborative R&D Address industrial, societal and scientific challenges. Bespoke small teams built around challenge-led industry projects. and

Enabling Annotation

• A well annotated corpus lends itself to further exploitation

• More “Classic” TDM: Named Entity Recognition, Information Extraction

• Other, broader tasks like Question Answering and Summarisation

• Latest technology in Natural Language Processing, ‘Transformers’ (Neural Net

Architecture)

– “Imagenet for Text” – BERT/GPT-2 pretrained models

– Transfer learning in this area is already being done: “BioBert: a pre-trained

biomedical language representation model for biomedical text mining” Lee et al.

2019

Page 21: Enabling TDM Tool Development · −Collaborative R&D Address industrial, societal and scientific challenges. Bespoke small teams built around challenge-led industry projects. and

Evaluation

• Peer review gives unparalleled access to expert knowledge

• Most expensive part of building ML/AI tools is often expert

Humans, not CPU/GPU hours

• Authors have expectation of time spent during submission

– This is unusual!

• Evaluation of Text Mined Terms and relationships

– Put back the results to submitters

– Refine the model, provide a better offering

• More in the next session!

Page 23: Enabling TDM Tool Development · −Collaborative R&D Address industrial, societal and scientific challenges. Bespoke small teams built around challenge-led industry projects. and

Additional Slides

Page 24: Enabling TDM Tool Development · −Collaborative R&D Address industrial, societal and scientific challenges. Bespoke small teams built around challenge-led industry projects. and

Citations

Firth R, Talo F, Venkatesan F, Mukhopadhyay A, McEntyre J, Velankar S, Morris C, Acta Crystallogr F Struct Biol

Commun. 2019 Nov 1; 75(Pt 11): 665–672. Published online 2019 Nov 5. doi: 10.1107/S2053230X1901210X,

PMCID: PMC6839820

Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, Jaewoo Kang, BioBERT: a pre-

trained biomedical language representation model for biomedical text mining, Bioinformatics, 09-2019,

btz682, https://doi.org/10.1093/bioinformatics/btz682

van Raaij MJ, Newman J. Taking biological structure communications into the third dimension. Acta Crystallogr F Struct

Biol Commun. 2019;75(Pt 11):663–664. doi:10.1107/S2053230X19014754

Venkatesan A, Kim JH, Talo F, et al. SciLite: a platform for displaying text-mined annotations as a means to link research

articles with biological data. Wellcome Open Res. 2017;1:25. Published 2017 Jul 10.

doi:10.12688/wellcomeopenres.10210.2

Rego, N. & Koes, D. (2015). Bioinformatics, 31, 1322–1324.

Page 25: Enabling TDM Tool Development · −Collaborative R&D Address industrial, societal and scientific challenges. Bespoke small teams built around challenge-led industry projects. and

Additional Examples - AstroCats Of the 55,430 supernovae that have been

discovered, only:

• 18,172 have publicly available light curves in

an easily downloadable format

• 8,600 have spectra

Page 26: Enabling TDM Tool Development · −Collaborative R&D Address industrial, societal and scientific challenges. Bespoke small teams built around challenge-led industry projects. and

Annotations

+PMID/PMCID

+Annotations

 - prefix

 - exact (residue)

 - postfix

 - section

 - URI (protein)

start FullTextXML 

NER 

parser 

tagger 

tokenize 

BeautifulSoup FullText

Doc

 SourceClass 

Doc 

FullText 

PMC ID 

TextDict

TextDict Section 

Names 

 Section

 Whitelist

FuzzyWuzzy

 MatchClass

Exact String

start index 

end index 

end Token 

start Token 

Section 

Matches 

 SourceClass 

PMC ID 

Doc 

Residue 

Matches 

FullText 

  Add

  Context

UniProt URI

Exact String

start index 

end index 

     Residue 

 MatchClass 

Token 

Sent 

Prefix 

Postfix 

Data Product

Context 

Addition 

Output 

to 

Json 

pyresid

RegEx

PDB URI

PDB

 MatchClass

start index 

end index 

Exact String

Protein Data 

Bank(PDB)API 

 ValidatePDB Entry 

List 

UniProt URI

RegEx

 MatchClass

Exact String

start index 

end index 

Residues

Decompose 

Composite 

Matches 

Exact String

start index 

end index 

     Residue 

 MatchClass 

RegEx

  Dictionary

 Lookup

  Validate

+

   Associate

 Query

PDB API 

mmCIF/PDB 

Structure 

 Download

 Query

UniProt API 

UniProt URI

Identification and Location

Exact String

start index 

end index 

     Residue 

 MatchClass