using text mining for discovery and data integration · funded by the wellcome trust, bbsrc, mrc,...

18
EBI is an Outstation of the European Molecular Biology Laboratory. Using text mining for discovery and data integration literature resources at the EBI and the UKPMC project.

Upload: others

Post on 07-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Using text mining for discovery and data integration · Funded by the Wellcome Trust, BBSRC, MRC, British Heart Foundation, Cancer Research, Arthritis Research Campaign, NIHR. The

EBI is an Outstation of the European Molecular Biology Laboratory.

Using text mining for discovery and data integration

literature resources at the EBI and the UKPMC project.

Page 2: Using text mining for discovery and data integration · Funded by the Wellcome Trust, BBSRC, MRC, British Heart Foundation, Cancer Research, Arthritis Research Campaign, NIHR. The

UKPMC

• Over ~1.5 million full text articles

• Based on PubMed Central (PMC) at the NIH: PMCi

• Mandated deposition of articles by UK Funders

• 2006-2008: Basic set-up and grant reporting

• 2008-2011 timeframe, adding value to search and

retrieval

Funded by the Wellcome Trust, BBSRC, MRC, British Heart Foundation,

Cancer Research, Arthritis Research Campaign, NIHR

Page 3: Using text mining for discovery and data integration · Funded by the Wellcome Trust, BBSRC, MRC, British Heart Foundation, Cancer Research, Arthritis Research Campaign, NIHR. The

The Collaboration

The University of ManchesterMIMAS: host PMCi and Grant Reporting

NaCTeM: text mining

The British Library

Content AdditionInterface Development

The European Bioinformatics Institute

Web services

Metadata and full text indexingText mining

Page 4: Using text mining for discovery and data integration · Funded by the Wellcome Trust, BBSRC, MRC, British Heart Foundation, Cancer Research, Arthritis Research Campaign, NIHR. The

Full text

+ abstracts (citeXplore)

Page 5: Using text mining for discovery and data integration · Funded by the Wellcome Trust, BBSRC, MRC, British Heart Foundation, Cancer Research, Arthritis Research Campaign, NIHR. The

~ 9 % the size of PubMed

“free access”

Page 6: Using text mining for discovery and data integration · Funded by the Wellcome Trust, BBSRC, MRC, British Heart Foundation, Cancer Research, Arthritis Research Campaign, NIHR. The

CiteXplore: overview

• More than 22 million abstracts • PubMed: 19 million; patents: 1.88 million

• Website and web services • URL below, SOAP

• Basic search, some advanced search features• Lucene

• Added value: citations, database links, text mining• Citations: over 9 million PubMed articles cited from our

UKPMC and CrossRef dataset

http://www.ebi.ac.uk/citexplore/

Page 7: Using text mining for discovery and data integration · Funded by the Wellcome Trust, BBSRC, MRC, British Heart Foundation, Cancer Research, Arthritis Research Campaign, NIHR. The

Genes

Text mining in CiteXplore via Whatizit

Species

GO Terms

Page 8: Using text mining for discovery and data integration · Funded by the Wellcome Trust, BBSRC, MRC, British Heart Foundation, Cancer Research, Arthritis Research Campaign, NIHR. The

Added value: Anatomy of a CiteXplore record

Full text links

Database links

Text mining

Citation info

Page 9: Using text mining for discovery and data integration · Funded by the Wellcome Trust, BBSRC, MRC, British Heart Foundation, Cancer Research, Arthritis Research Campaign, NIHR. The

Same page in UKPMC

Page 10: Using text mining for discovery and data integration · Funded by the Wellcome Trust, BBSRC, MRC, British Heart Foundation, Cancer Research, Arthritis Research Campaign, NIHR. The

Addition of content to CiteXplore

PubMed19 million

~ 0.5 million

PubMed Central1.8 million

PMC will be a true subset at CiteXplore and UKPMC

Page 11: Using text mining for discovery and data integration · Funded by the Wellcome Trust, BBSRC, MRC, British Heart Foundation, Cancer Research, Arthritis Research Campaign, NIHR. The

Text mining component of UKPMC?

Named Entity RecognitionOrganisms, GO Terms, Genes/Proteins,

Acc. Numbers, Diseases, Drugs, Chemicals

Fact Extraction (to 2011)

Protein-protein interactions, gene-phenotyperelationships, drugs-proteins

� Applications

Different uses for different applications e.g. “human” vs. “ABCA1” vs. “inhibits”

Page 12: Using text mining for discovery and data integration · Funded by the Wellcome Trust, BBSRC, MRC, British Heart Foundation, Cancer Research, Arthritis Research Campaign, NIHR. The

Top 10

Genes/Proteins

Top 10 Organisms Top 10 GO Terms

CD4

IFN

Insulin

Actin

P53

TNF

GST

GFP

LacZ

EGF

Human

Mouse, mice

Escherichia coli

Animals

Rat(s)

Bacteria

Yeast

Rabbit

HIV

viruses

Binding

Membrane(s)

Development

Transcription

Death

Host

Chromosome

Phosphorylation

Intracellular

Transport

Genes/Proteins: 5,110,489, Species: 3,892,466, GO terms: 4,493,691

1,551,533 documents: OCR, PDF, XML

Progress so far

Many improvements that could be made

Page 13: Using text mining for discovery and data integration · Funded by the Wellcome Trust, BBSRC, MRC, British Heart Foundation, Cancer Research, Arthritis Research Campaign, NIHR. The

Coming soon to a screen near you

Page 14: Using text mining for discovery and data integration · Funded by the Wellcome Trust, BBSRC, MRC, British Heart Foundation, Cancer Research, Arthritis Research Campaign, NIHR. The

Challenges

• Precision and recall

• OCR (see example)

• Experience in mining full text content; section identification

• Integrating text mining utility into regular workflows

• Provision of functions in a production environment

• Business Rules (see next slide), open and free access

• Content growth (compliance; new funding agencies)

• Biologists are unforgiving of TM “errors”

Political/Social

Technical

Page 15: Using text mining for discovery and data integration · Funded by the Wellcome Trust, BBSRC, MRC, British Heart Foundation, Cancer Research, Arthritis Research Campaign, NIHR. The

13-LACTAMASE/3-lactamase83-lactamase,8-lactamase8)-lactamasebeta-lactamase,B-Lactamasef8-lactamasefi-lactamase,f-lactamasef-lactamasef,-lactamasef)-lactamasefl-lactamase/,-lactamase,-lactamase

Hazards of OCR

Thanks to CJ Rupp, NaCTeM

“there's more where that came from”

Page 16: Using text mining for discovery and data integration · Funded by the Wellcome Trust, BBSRC, MRC, British Heart Foundation, Cancer Research, Arthritis Research Campaign, NIHR. The

Reflect http://reflect.ws/

Page 17: Using text mining for discovery and data integration · Funded by the Wellcome Trust, BBSRC, MRC, British Heart Foundation, Cancer Research, Arthritis Research Campaign, NIHR. The

Application Goals of Text mining in UKPMC

High quality, integrated functions

Integrated browse functions

Highlighting, article summariese.g. articles with similar named entity profiles to this

one

Integration with underlying databasese.g. UniProt, Array Express, PDB

Highlighting, article summaries

Integrated search functions

e.g. search for gene symbol & find co-occurring diseases

facts are the ultimate in co-occurrenceMany of these kinds of functions are demonstrated in existing stand-alone apps

Page 18: Using text mining for discovery and data integration · Funded by the Wellcome Trust, BBSRC, MRC, British Heart Foundation, Cancer Research, Arthritis Research Campaign, NIHR. The

Who did the work?

EBI Literature ServicesPeter Stoehr, Sharmila Pilia, Alan Horne, Mark

RijnbeekEBI Text miningDietrich Rebholz-Schuhmann, Ian Lewin, Jee-Hyub Kim

Collaborators: UKPMCNaCTeM: Sophia Ananiadou, CJ Rupp, Chikashi Nobata

MIMAS: Dave Chapman, Vic Lyte, Ross McIntyreBritish Library: Ernie Ong, Phil Vaughan, Sandy Chevuru,

Rob Rowbotham, Paul Davey, Heather Rosie