ii-sdv 2016 michael iarrobino - improving text mining results with access to full-text scientific...

Post on 17-Feb-2017

609 Views

Category:

Internet

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Improving Text Mining Results withAccess to Full-Text Scientific Articles

Mike IarrobinoProduct Manager, CCC

Introduction

Mike IarrobinoProduct ManagerRightFind™ XML for MiningCopyright Clearance Center

Making Copyright Work – CCC and RightsDirect

Rightsholders Content Users

• Licensing Solutions

• Rights Management

• Content Delivery

• Copyright Education950+ million rights from:

• Publishers

• Authors

• Agents

• Creators

• 35,000 companies

• Workers worldwide

• 1,200 colleges and universities

• Publishers and Authors

CCC and Text Mining

Rightsholders Content Users

Servicing many text mining license and content requests

Managing text mining feeds

Negotiating text mining rights with

multiple publishers

“Text mining” is the process of deriving high-quality information from text materials using software.

Text Mining Non-Patent Literature

• Mining limited to abstracts

• High cost to obtain formatted full-text content and permission from multiple publishers

• Multiple formats

• Researchers can’t mine content to which they are not subscribed

What is the Benefit of Full Text?

Volume Timeliness Quality

Catherine Blake. “Beyond genes, proteins, and abstracts: Identifying scientific claims from full-text biomedical articles.” Journal of Biomedical Informatics Volume 43, Issue 2, April 2010, Pages 173–189

Elsevier (2015) Harnessing the Power of Content -Extracting value from scientific literature: the power of mining full-text articles for pathway analysis. Available at www.elsevier.com/__data/assets/pdf_file/0016/83005/R_D-Solutions_Harnessing-Power-of-Content_DIGITAL.pdf

Elsevier (2015) Harnessing the Power of Content -Extracting value from scientific literature: the power of mining full-text articles for pathway analysis. Available at www.elsevier.com/__data/assets/pdf_file/0016/83005/R_D-Solutions_Harnessing-Power-of-Content_DIGITAL.pdf

Enrique Bernal-Delgado and Elliot S Fisher. “Abstracts in high profile journals often fail to report harm.” BMC Medical Research Methodology (2008); 8:14

Volume and Recall

December 20158

(Abstract: "tau hyperphosphorylation" AND Abstract: kinase OR (GSK3β OR (CDK5 OR (MAPK1 OR (MARK1 OR (MARK2 OR (MARK3 OR MARK4))))))) AND (Abstract: alzheimer OR alzheimer's)

content:"tau hyperphosphorylation kinase"~25 OR "tau hyperphosphorylation GSK3β "~25 OR "tau hyperphosphorylation CDK5"~25 OR "tau hyperphosphorylation MAPK1"~25 OR "tau hyperphosphorylation MARK1"~25 OR "tau hyperphosphorylation MARK2"~25 OR "tau hyperphosphorylation MARK3"~25 OR "tau hyperphosphorylation MARK4"~25

Volume and Recall - Results

December 20159

0

100

200

300

400

500

600

700

800

BTK Tauhyperphosphorylation

Nu

mb

er A

rtic

les

Abstract

Full text

Text Mining Today – Example Workflow

December 201510

SearchGet

permissionDownload

PDFsConvert PDFs

Import into text mining software

SearchGet

permissionDownload

PDFsConvert PDFs

Import into text mining software

• Perform search• Obtain permission from publishers to mine full text for commercial use

• Requires automated tool or custom software to download in bulk

• Requires text mining permission from multiple publishers

• Requires content storage and feed management

• PDF is converted to a “blob of text”

• No tags

• Loss of metadata

• Low fidelity of content

• References induce noise

• Requires structuring text into XML

• Article text does not have “fields”

• Combining content from multiple sources takes time to normalize the metadata

SearchGet

permissionDownload

PDFsConvert

PDFs

Import into text mining

software

TEXT MINING TOOLS

Run queries

View results

MANUAL WORKTypically takes 4-8 weeks

CCC’s RightFind™ XML for Mining Service

Build a corpus of full-text articles in XML format for mining

Text Mining SoftwareCCC’s Text Mining Service

XML for Mining

• Rapid inventory growth

• MEDLINE abstract corpus

• Purchase not subscribed articles with cost optimization process

• MeSH article tagging and flat synonym list

Market Observations and Future Vision

ACCESS

AUTOMATION

Thank you!Mike IarrobinoProduct Manager, CCC+1.978.646.2633miarrobino@copyright.com

top related