using text mining for discovery and data integration · funded by the wellcome trust, bbsrc, mrc,...
TRANSCRIPT
![Page 1: Using text mining for discovery and data integration · Funded by the Wellcome Trust, BBSRC, MRC, British Heart Foundation, Cancer Research, Arthritis Research Campaign, NIHR. The](https://reader034.vdocuments.us/reader034/viewer/2022050512/5f9c7dce0b4c07652078ec60/html5/thumbnails/1.jpg)
EBI is an Outstation of the European Molecular Biology Laboratory.
Using text mining for discovery and data integration
literature resources at the EBI and the UKPMC project.
![Page 2: Using text mining for discovery and data integration · Funded by the Wellcome Trust, BBSRC, MRC, British Heart Foundation, Cancer Research, Arthritis Research Campaign, NIHR. The](https://reader034.vdocuments.us/reader034/viewer/2022050512/5f9c7dce0b4c07652078ec60/html5/thumbnails/2.jpg)
UKPMC
• Over ~1.5 million full text articles
• Based on PubMed Central (PMC) at the NIH: PMCi
• Mandated deposition of articles by UK Funders
• 2006-2008: Basic set-up and grant reporting
• 2008-2011 timeframe, adding value to search and
retrieval
Funded by the Wellcome Trust, BBSRC, MRC, British Heart Foundation,
Cancer Research, Arthritis Research Campaign, NIHR
![Page 3: Using text mining for discovery and data integration · Funded by the Wellcome Trust, BBSRC, MRC, British Heart Foundation, Cancer Research, Arthritis Research Campaign, NIHR. The](https://reader034.vdocuments.us/reader034/viewer/2022050512/5f9c7dce0b4c07652078ec60/html5/thumbnails/3.jpg)
The Collaboration
The University of ManchesterMIMAS: host PMCi and Grant Reporting
NaCTeM: text mining
The British Library
Content AdditionInterface Development
The European Bioinformatics Institute
Web services
Metadata and full text indexingText mining
![Page 4: Using text mining for discovery and data integration · Funded by the Wellcome Trust, BBSRC, MRC, British Heart Foundation, Cancer Research, Arthritis Research Campaign, NIHR. The](https://reader034.vdocuments.us/reader034/viewer/2022050512/5f9c7dce0b4c07652078ec60/html5/thumbnails/4.jpg)
Full text
+ abstracts (citeXplore)
![Page 5: Using text mining for discovery and data integration · Funded by the Wellcome Trust, BBSRC, MRC, British Heart Foundation, Cancer Research, Arthritis Research Campaign, NIHR. The](https://reader034.vdocuments.us/reader034/viewer/2022050512/5f9c7dce0b4c07652078ec60/html5/thumbnails/5.jpg)
~ 9 % the size of PubMed
“free access”
![Page 6: Using text mining for discovery and data integration · Funded by the Wellcome Trust, BBSRC, MRC, British Heart Foundation, Cancer Research, Arthritis Research Campaign, NIHR. The](https://reader034.vdocuments.us/reader034/viewer/2022050512/5f9c7dce0b4c07652078ec60/html5/thumbnails/6.jpg)
CiteXplore: overview
• More than 22 million abstracts • PubMed: 19 million; patents: 1.88 million
• Website and web services • URL below, SOAP
• Basic search, some advanced search features• Lucene
• Added value: citations, database links, text mining• Citations: over 9 million PubMed articles cited from our
UKPMC and CrossRef dataset
http://www.ebi.ac.uk/citexplore/
![Page 7: Using text mining for discovery and data integration · Funded by the Wellcome Trust, BBSRC, MRC, British Heart Foundation, Cancer Research, Arthritis Research Campaign, NIHR. The](https://reader034.vdocuments.us/reader034/viewer/2022050512/5f9c7dce0b4c07652078ec60/html5/thumbnails/7.jpg)
Genes
Text mining in CiteXplore via Whatizit
Species
GO Terms
![Page 8: Using text mining for discovery and data integration · Funded by the Wellcome Trust, BBSRC, MRC, British Heart Foundation, Cancer Research, Arthritis Research Campaign, NIHR. The](https://reader034.vdocuments.us/reader034/viewer/2022050512/5f9c7dce0b4c07652078ec60/html5/thumbnails/8.jpg)
Added value: Anatomy of a CiteXplore record
Full text links
Database links
Text mining
Citation info
![Page 9: Using text mining for discovery and data integration · Funded by the Wellcome Trust, BBSRC, MRC, British Heart Foundation, Cancer Research, Arthritis Research Campaign, NIHR. The](https://reader034.vdocuments.us/reader034/viewer/2022050512/5f9c7dce0b4c07652078ec60/html5/thumbnails/9.jpg)
Same page in UKPMC
![Page 10: Using text mining for discovery and data integration · Funded by the Wellcome Trust, BBSRC, MRC, British Heart Foundation, Cancer Research, Arthritis Research Campaign, NIHR. The](https://reader034.vdocuments.us/reader034/viewer/2022050512/5f9c7dce0b4c07652078ec60/html5/thumbnails/10.jpg)
Addition of content to CiteXplore
PubMed19 million
~ 0.5 million
PubMed Central1.8 million
PMC will be a true subset at CiteXplore and UKPMC
![Page 11: Using text mining for discovery and data integration · Funded by the Wellcome Trust, BBSRC, MRC, British Heart Foundation, Cancer Research, Arthritis Research Campaign, NIHR. The](https://reader034.vdocuments.us/reader034/viewer/2022050512/5f9c7dce0b4c07652078ec60/html5/thumbnails/11.jpg)
Text mining component of UKPMC?
Named Entity RecognitionOrganisms, GO Terms, Genes/Proteins,
Acc. Numbers, Diseases, Drugs, Chemicals
Fact Extraction (to 2011)
Protein-protein interactions, gene-phenotyperelationships, drugs-proteins
� Applications
Different uses for different applications e.g. “human” vs. “ABCA1” vs. “inhibits”
![Page 12: Using text mining for discovery and data integration · Funded by the Wellcome Trust, BBSRC, MRC, British Heart Foundation, Cancer Research, Arthritis Research Campaign, NIHR. The](https://reader034.vdocuments.us/reader034/viewer/2022050512/5f9c7dce0b4c07652078ec60/html5/thumbnails/12.jpg)
Top 10
Genes/Proteins
Top 10 Organisms Top 10 GO Terms
CD4
IFN
Insulin
Actin
P53
TNF
GST
GFP
LacZ
EGF
Human
Mouse, mice
Escherichia coli
Animals
Rat(s)
Bacteria
Yeast
Rabbit
HIV
viruses
Binding
Membrane(s)
Development
Transcription
Death
Host
Chromosome
Phosphorylation
Intracellular
Transport
Genes/Proteins: 5,110,489, Species: 3,892,466, GO terms: 4,493,691
1,551,533 documents: OCR, PDF, XML
Progress so far
Many improvements that could be made
![Page 13: Using text mining for discovery and data integration · Funded by the Wellcome Trust, BBSRC, MRC, British Heart Foundation, Cancer Research, Arthritis Research Campaign, NIHR. The](https://reader034.vdocuments.us/reader034/viewer/2022050512/5f9c7dce0b4c07652078ec60/html5/thumbnails/13.jpg)
Coming soon to a screen near you
![Page 14: Using text mining for discovery and data integration · Funded by the Wellcome Trust, BBSRC, MRC, British Heart Foundation, Cancer Research, Arthritis Research Campaign, NIHR. The](https://reader034.vdocuments.us/reader034/viewer/2022050512/5f9c7dce0b4c07652078ec60/html5/thumbnails/14.jpg)
Challenges
• Precision and recall
• OCR (see example)
• Experience in mining full text content; section identification
• Integrating text mining utility into regular workflows
• Provision of functions in a production environment
• Business Rules (see next slide), open and free access
• Content growth (compliance; new funding agencies)
• Biologists are unforgiving of TM “errors”
Political/Social
Technical
![Page 15: Using text mining for discovery and data integration · Funded by the Wellcome Trust, BBSRC, MRC, British Heart Foundation, Cancer Research, Arthritis Research Campaign, NIHR. The](https://reader034.vdocuments.us/reader034/viewer/2022050512/5f9c7dce0b4c07652078ec60/html5/thumbnails/15.jpg)
13-LACTAMASE/3-lactamase83-lactamase,8-lactamase8)-lactamasebeta-lactamase,B-Lactamasef8-lactamasefi-lactamase,f-lactamasef-lactamasef,-lactamasef)-lactamasefl-lactamase/,-lactamase,-lactamase
Hazards of OCR
Thanks to CJ Rupp, NaCTeM
“there's more where that came from”
![Page 16: Using text mining for discovery and data integration · Funded by the Wellcome Trust, BBSRC, MRC, British Heart Foundation, Cancer Research, Arthritis Research Campaign, NIHR. The](https://reader034.vdocuments.us/reader034/viewer/2022050512/5f9c7dce0b4c07652078ec60/html5/thumbnails/16.jpg)
Reflect http://reflect.ws/
![Page 17: Using text mining for discovery and data integration · Funded by the Wellcome Trust, BBSRC, MRC, British Heart Foundation, Cancer Research, Arthritis Research Campaign, NIHR. The](https://reader034.vdocuments.us/reader034/viewer/2022050512/5f9c7dce0b4c07652078ec60/html5/thumbnails/17.jpg)
Application Goals of Text mining in UKPMC
High quality, integrated functions
Integrated browse functions
Highlighting, article summariese.g. articles with similar named entity profiles to this
one
Integration with underlying databasese.g. UniProt, Array Express, PDB
Highlighting, article summaries
Integrated search functions
e.g. search for gene symbol & find co-occurring diseases
facts are the ultimate in co-occurrenceMany of these kinds of functions are demonstrated in existing stand-alone apps
![Page 18: Using text mining for discovery and data integration · Funded by the Wellcome Trust, BBSRC, MRC, British Heart Foundation, Cancer Research, Arthritis Research Campaign, NIHR. The](https://reader034.vdocuments.us/reader034/viewer/2022050512/5f9c7dce0b4c07652078ec60/html5/thumbnails/18.jpg)
Who did the work?
EBI Literature ServicesPeter Stoehr, Sharmila Pilia, Alan Horne, Mark
RijnbeekEBI Text miningDietrich Rebholz-Schuhmann, Ian Lewin, Jee-Hyub Kim
Collaborators: UKPMCNaCTeM: Sophia Ananiadou, CJ Rupp, Chikashi Nobata
MIMAS: Dave Chapman, Vic Lyte, Ross McIntyreBritish Library: Ernie Ong, Phil Vaughan, Sandy Chevuru,
Rob Rowbotham, Paul Davey, Heather Rosie