flexible text mining using interactive information extraction david milward...

16
Flexible Text Mining using Interactive Information Extraction David Milward [email protected]

Upload: derick-miller

Post on 31-Dec-2015

218 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Flexible Text Mining using Interactive Information Extraction David Milward david.milward@linguamatics.com

Flexible Text Mining using Interactive Information Extraction

David [email protected]

Page 2: Flexible Text Mining using Interactive Information Extraction David Milward david.milward@linguamatics.com

2

Text mining vs. Data Mining

• Text mining– getting nuggets of information

from text

– extracting relationships

– structured results to feed into data mining, visualisation or databases

company activity companySanofi bid AventisRoche partner Antisoma

• Data mining– getting new knowledge from databases

– suggesting new relationships, trends, patterns

Page 3: Flexible Text Mining using Interactive Information Extraction David Milward david.milward@linguamatics.com

3

Text Data Mining

• Emphasizes finding new knowledge from text

• Typically knowledge that is implicit within multiple documents

Page 4: Flexible Text Mining using Interactive Information Extraction David Milward david.milward@linguamatics.com

4

What is the relationship to IR?

• IR finds the most relevant documents

• Text mining finds information from within documents, or across documents– What drugs are used for psoriasis treatment?

– Who are associated directly or indirectly with the Board of Exxon?

• There is overlap …– we often search to answer a question, not to find a

document

Page 5: Flexible Text Mining using Interactive Information Extraction David Milward david.milward@linguamatics.com

5

Traditional Information Extraction

• Uses natural language processing to distinguish– Sanofi bid for Aventis – Aventis bid for Sanofi

• Provides structured results for easy review and analysis

• Uses normalised terminology to allow integration with databases e.g.

– Preferred term: Sanofi, – Synonyms: Sanofi Pasteur, Sanofi Synthelabo, Sanofi Synthélabo …

• But:– typically limited to patterns on a single sentence– constructing, testing and running queries can take days

• Appropriate if you always have the same question e.g. want to run over a newsfeed every night

company activity companySanofi bid AventisRoche partner Antisoma

Page 6: Flexible Text Mining using Interactive Information Extraction David Milward david.milward@linguamatics.com

6

I2E: Interactive Information Extraction

• A new concept• Encompasses

– keywords → documents– patterns → relationships (structured output)

• Queries ranging from:– General Motors – General Motors & acquisition in the same

document– Automotive companies & acquisitions in the

same sentence– What companies is General Motors

associated with?

• Not limited to patterns within sentences e.g.– Merger and acquisition activity in

documents mentioning Japan

• Fast, scalable, versatile

I2EInformation ExtractionInformation Extraction

NLPNLP

Taxonomies/ Ontologies

Taxonomies/ Ontologies

Text SearchText Search

Structured Output

Structured Output

Page 7: Flexible Text Mining using Interactive Information Extraction David Milward david.milward@linguamatics.com

7

Linguistic Processing

We find that p42mapk phosphorylates c-Myb on serine and threonine .

Purified recombinant p42 MAPK was found to phosphorylate Wee1 .

sentences

• Groups words into meaningful units

• Morphology allows search for different forms of words

morphology -

different forms

noun phrases

match entities

verb groups

match actions

Page 8: Flexible Text Mining using Interactive Information Extraction David Milward david.milward@linguamatics.com

8

Monitoring Merger and Acquisition Activity

Page 9: Flexible Text Mining using Interactive Information Extraction David Milward david.milward@linguamatics.com

9

Company Positions

Page 10: Flexible Text Mining using Interactive Information Extraction David Milward david.milward@linguamatics.com

10

Using I2E in the Life Sciences

• Good resources– Scientific abstracts are readily

available in XML

– Large number of existing taxonomies/terminologies

• Very large scale– 16 million abstracts relevant to life

sciences. Growing ???? a year

– Large numbers of internal reports and full-text articles

– Internal documents often > 1000 pages, may be PDF images

– Taxonomies/terminologies are large, often deeply structured e.g.

• 350K nodes, ??? synonyms

– Still need to augment terminology for specific areas

• Relatively large scale– 17 million abstracts

– Large numbers of internal reports and full-text articles

– Internal documents can be >1000 pages, may be PDF images

– Taxonomies/terminologies are large, often deeply structured

> 100K concepts

> 400K synonyms

– Still need to augment terminology for specific areas

Page 11: Flexible Text Mining using Interactive Information Extraction David Milward david.milward@linguamatics.com

11

Examples of Pharma Questions

• R&D

– Which proteins interact with metabolite X?

– What are the reaction kinetics for canonical pathway Y?

– What attributes are common to sets of biomarker genes

– What are the known associations between expressed genes and environmental factors.

– What dosages of compound B cause adverse reactions?

• Competitive Intelligence

– Which companies are working on technology C?

– What compounds are available for in-licensing in a disease area?

– Which research groups are my competitors collaborating with?

Page 12: Flexible Text Mining using Interactive Information Extraction David Milward david.milward@linguamatics.com

12

Linking Drugs to Adverse Events

Page 13: Flexible Text Mining using Interactive Information Extraction David Milward david.milward@linguamatics.com

13

Measurements

• Extraction of numerical parameters, – e.g. amounts, dosages, concentrations

Page 14: Flexible Text Mining using Interactive Information Extraction David Milward david.milward@linguamatics.com

14

Benefits of Flexible Text Mining

• The ideal final query may use – co-occurrence of terms within a document or sentence

– a precise linguistic pattern

– a mixture of both

• It depends on– the nature of the task

– the availability of terminologies

– the kind of documents (news vs. science, abstract vs. full text)

– the time available to check results

• Flexibility to mix different techniques is also critical for fast development of queries– e.g. start with broad queries to explore the “results space”,

then home in

Page 15: Flexible Text Mining using Interactive Information Extraction David Milward david.milward@linguamatics.com

15

Fast query creation

I2E: Better Results, Faster

Fast return of results

Fast review and analysis

0

1

2

3

4

5

6

7

8

9

10

BCL2 CDKN1A DMPK EPHB2 INS MAP2K1 MAPK1 MAPK3 MAPK7 RB1 STK3 VIM

suppress

regulate

phosphorylate

mediate

interact

inhibit

induce

inactivate

co-express

block

bind

activate

[c] Reln

Page 16: Flexible Text Mining using Interactive Information Extraction David Milward david.milward@linguamatics.com

16

Impact of I2E

• Significant reduction in time spent searching/reading the literature– weeks reduced to days or hours

• Structure the unstructured to – provide systematic and comprehensive review of

information content

– enable integration with traditional structured data

– allow complex analysis of literature derived information

– generate hypotheses, gain insight