two and a half approaches to natural language processing

115
Two and a half approaches to natural language processing in computational biology Kevin Bretonnel Cohen Biomedical Text Mining Group Lead Center for Computational Pharmacology UCHSC, Fitzsimons Campus [email protected]

Upload: others

Post on 01-Oct-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Two and a half approaches to natural language processing

Two and a half approaches to natural language

processing in computational biology

Kevin Bretonnel CohenBiomedical Text Mining Group Lead

Center for Computational PharmacologyUCHSC, Fitzsimons [email protected]

Page 2: Two and a half approaches to natural language processing

There’s natural language processing in computational

biology??

What’s it doing there?

Page 3: Two and a half approaches to natural language processing

(One lab’s) funding for NLP in computational biology

• INIA (Neuroinformatics of Alcoholism) $5M, 5 years

• Wyeth Genomics Institute ($200K, 2 years)

• National Library of Medicine ($4.2M, 3 years)

• …

Page 4: Two and a half approaches to natural language processing

Why biologists care

• High-throughput data interpretation• Literature search• Annotation• Database construction

Page 5: Two and a half approaches to natural language processing

But, I’m a computer scientist (mathematician, engineer…)

• Hard, but might be possible• Might be harder in molbio than in

“General English”• Might be more possible in molbio than

in “General English”

Page 6: Two and a half approaches to natural language processing

ResourcesThe big drawing point for “bionlp”

• DataUMLS, Gene OntologyPubMedLabelled training data from “bake-offs”

• Tools lvg (“Lexical Variant Generator”), entity identification systems, …

Page 7: Two and a half approaches to natural language processing

$$$

Page 8: Two and a half approaches to natural language processing

Job market

• Academia: greatUS, Europe

• Industry: even better, although not molbio-specific

Page 9: Two and a half approaches to natural language processing

Overview

• Some definitions• Specific tasks• General issues• Two (and a half) approaches

Page 10: Two and a half approaches to natural language processing

Natural language vs. artificial language

• 2 + 3 * 4== 14!= 20

Page 11: Two and a half approaches to natural language processing

Natural language vs. artificial language

• 2 + 3 * 4== 14!= 20

• fatty acids and cholesterol(fatty acids) and cholesterolfatty (acids and cholesterol)

Page 12: Two and a half approaches to natural language processing

The basic applications (biologist’s-eye view)

• High-throughput data interpretation• Literature search• Annotation• Database construction

Page 13: Two and a half approaches to natural language processing

The basic tasks(computer-scientist’s-eye view)

Information extractionEntity identificationIndexing

Page 14: Two and a half approaches to natural language processing

What processing means

• High-levelEntity identificationInformation extractionInformation retrieval

• Lower-levelTokenizationPart-of-speech taggingSyntactic analysis

These are prerequisites to (most of) the others…

Page 15: Two and a half approaches to natural language processing

Information extraction

Given the template:

BINDING_EVENTBinder:Bound:

Page 16: Two and a half approaches to natural language processing

Information extraction

…and the input:

Met28 binds to DNA.

Page 17: Two and a half approaches to natural language processing

Information extraction

…return:

BINDING_EVENT

Binder: Met28Bound: DNA

Page 18: Two and a half approaches to natural language processing

Second thing that makes it hard: morphosyntactic variability

Met28 binds to DNA…binding of Met28 to DNA……Met28 and DNA bind……binding between Met28 and DNA……Met28 is sufficient to bind DNA……DNA bound by Met28…

Page 19: Two and a half approaches to natural language processing

Second thing that makes it hard: morphosyntactic variability

…binding of Met28 to DNA……binding under unspecified conditions

of Met28 to DNA……binding of this translational variant of

Met28 to DNA……binding of Met28 to upstream regions

of DNA…

Page 20: Two and a half approaches to natural language processing

Second thing that makes it hard: morphosyntactic variability

…binding under unspecified conditions of this translational variant of Met28 to upstream regions of DNA…

Page 21: Two and a half approaches to natural language processing

Even before that, though…

HSP60Hsp-60heat shock protein 60CerberuswinglessKen and Barbiethe

Page 22: Two and a half approaches to natural language processing

Even before that, though…

HSP60Hsp-60heat shock protein 60CerberuswinglessKen and Barbiethe

Entity identification

Page 23: Two and a half approaches to natural language processing

How they connect

• Entity identification

• Information extraction

• Indexing

• Literature search

• Annotation

• Database construction

Page 24: Two and a half approaches to natural language processing

…and none of those are what we REALLY want…

• “text data mining:” finding knowledge that isn’t explicitly stated

Question-answering“Missing links” SummarizationInference

Page 25: Two and a half approaches to natural language processing

Why is NLP hard?

Page 26: Two and a half approaches to natural language processing

Why is NLP hard?

• 2 + 3 * 4== 14!= 20

• fatty acids and cholesterol(fatty acids) and cholesterolfatty (acids and cholesterol)

Page 27: Two and a half approaches to natural language processing

Types of ambiguity

• Part of speech: Is this a noun, a verb, an adjective…

• Lexical: which word is this?• Structural: what’s the relationship

between these words/phrases?• Semantic, pragmatic, discourse-level…

Page 28: Two and a half approaches to natural language processing

Part-of-speech ambiguity

Page 29: Two and a half approaches to natural language processing

• adjective • noun

Page 30: Two and a half approaches to natural language processing

• heat shock protein 60heat/VERB shock/VERB protein/NOUNheat/NOUN shock/ NOUN protein/NOUN

Page 31: Two and a half approaches to natural language processing

Lexical ambiguity

Page 32: Two and a half approaches to natural language processing

• Flying mammal • Sporting equipment

Page 33: Two and a half approaches to natural language processing

Lexical ambiguity

Page 34: Two and a half approaches to natural language processing

Lexical ambiguitya verb with 14 meanings

• 7 a : to assemble and set alight the materials for (a fire) b : to set in order <make beds>

• 3 a : to bring into being by forming, shaping, or altering material : FASHION <make a dress>

Page 35: Two and a half approaches to natural language processing

• hunkhuman natural killer (cell type)HUN kinase (gene/protein)radiological/orthopedic classification schemepiece of something

Page 36: Two and a half approaches to natural language processing

A reminder about what syntactic structure is

• Relations between phrases (groups of words)

I dislike stupid people like you I find them boring

Page 37: Two and a half approaches to natural language processing

A reminder about what syntactic structure is

• Relations between phrases (groups of words)

[I dislike stupid people][like you I find them boring]

Page 38: Two and a half approaches to natural language processing

A reminder about what syntactic structure is

• Relations between phrases (groups of words)

[I dislike stupid people like you][I find them boring]

Page 39: Two and a half approaches to natural language processing

A reminder about what syntactic structure is

• Relations between phrases (groups of words)

I saw the man with the binoculars

Page 40: Two and a half approaches to natural language processing

A reminder about what syntactic structure is

• Relations between phrases (groups of words)

I saw the man with the binoculars

Paraphrase: using the binoculars, I saw the man

Page 41: Two and a half approaches to natural language processing

A reminder about what syntactic structure is

• Relations between phrases (groups of words)

I saw the man with the binoculars

Paraphrase: using the binoculars, I saw the man

Page 42: Two and a half approaches to natural language processing

A reminder about what syntactic structure is

• Relations between phrases (groups of words)

I saw the man with the binoculars

Paraphrase: the man that I saw had some binoculars

Page 43: Two and a half approaches to natural language processing

A reminder about what syntactic structure is

• Relations between phrases (groups of words)

I saw the man with the binoculars

Paraphrase: the man that I saw had some binoculars

Page 44: Two and a half approaches to natural language processing

Structural ambiguity

Page 45: Two and a half approaches to natural language processing

Structural ambiguity

• Clerk interprets:• Verb (blouse) (in

window)

• She means:• Verb (blouse (in

window))

Page 46: Two and a half approaches to natural language processing

Structural ambiguity

Page 47: Two and a half approaches to natural language processing

Structural ambiguity

• Sly dog means:Verb (in (house (on

bed)))

• She interprets:• Verb (in house) (on

bed)

Page 48: Two and a half approaches to natural language processing

• NaCT is expressed in liver, testis and brain in rat and shows preference for citrate over dicarboxylates.

Page 49: Two and a half approaches to natural language processing

• NaCT is expressed in liver, testis and brain in rat and shows preference for citrate over dicarboxylates.

• (liver), (testis) and (brain in rat)• liver, (testis and brain in rat)• (liver, testis and brain in rat)

Page 50: Two and a half approaches to natural language processing

• NaCT is expressed in liver, testis and brain in rat and shows preference for citrate over dicarboxylates.

• shows preference for (citrate over dicarboxylates)

• shows preference (for citrate) (over dicarboxylates)

Page 51: Two and a half approaches to natural language processing

Other syntactic ambiguity(ellipsis)

Page 52: Two and a half approaches to natural language processing

Two approaches to NLP

Rule-based Statistical/machine learning

Page 53: Two and a half approaches to natural language processing

First approach to NLP

• Rule-based• AI, linguistics• Patterns (regular, context-free…)• Procedures

Page 54: Two and a half approaches to natural language processing

Rule-based: regex

• Patterns (regular, context-free, …)

• Procedures

$geneName = “[A-Za-z]+-?[0-9]”;

$input =~ /interaction of ($geneName) with ($geneName)/;

$interactionAssertion->setGene1($1);

$interactionAssertion->setGene2($2);

Page 55: Two and a half approaches to natural language processing

Rule-based: CFGs

• Patterns (regular, context-free, …)

• Procedures

NounPhrase -> NounPhrase+ Conjunction NounPhrase

NounPhrase -> Predeterminer Determiner+ Adjective+ Noun

Page 56: Two and a half approaches to natural language processing

Rule-based: procedural

• Patterns (regular, context-free, …)

• Procedures

if (currentWordEndsWith-ing) {

if (previousWordIsThe) {

if (nextWordIsOf) {

Page 57: Two and a half approaches to natural language processing

Rule-based approachesWhy they work

• Patterns are realPsychologicallyFormally adequate (mostly)

• Intuition works• No need for training data

Page 58: Two and a half approaches to natural language processing

Rule-based approachesWhy they’re hard

• Knowledge takes time to get• Process of developing large rule sets

can be slowConsider English syntax…

Page 59: Two and a half approaches to natural language processing

Second approach to NLP

• Mosteller & Wallace• Bayesian• Other machine learning techniques

Page 60: Two and a half approaches to natural language processing

Statistical approaches

Bayesian statistics: Why this is funny

Page 61: Two and a half approaches to natural language processing

Statistical/ML approaches

• Frame the NLP task as a series of classification problems

Which POS is this?Which word meaning?Which phrasal grouping?

Page 62: Two and a half approaches to natural language processing

Specifics of statistical approaches

• Training data: 100 examples of “bat”• 95 times, meaning is “sporting

equipment”• 5 times, meaning is “mammal”

Page 63: Two and a half approaches to natural language processing

Specifics of statistical approaches

• You see “bat:” what does it probably mean?

Page 64: Two and a half approaches to natural language processing

Specifics of statistical approaches

• 95 examples mean “sporting implement”

• 95 times, word “ball” also appears

Page 65: Two and a half approaches to natural language processing

Specifics of statistical approaches

• 5 examples mean “mammal”• Each time, word “wizard” also appears

Page 66: Two and a half approaches to natural language processing

Specifics of statistical approaches

• You see the word “bat,” and you also see the word “wizard…”

Page 67: Two and a half approaches to natural language processing

Statistical approachesWhy they work

• Statistics can be proxy for knowledge• Some interesting stuff is frequent

enough to be tractable

Page 68: Two and a half approaches to natural language processing

Statistical approachesWhy they’re hard

• Problem: sparse data

Frequency

Rank

Page 69: Two and a half approaches to natural language processing

Statistical approachesWhy they’re hard

• Solutions: smoothing, back-off

Page 70: Two and a half approaches to natural language processing

Statistical approachesWhy they’re hard

• Problem: labelled training data is expensive

Page 71: Two and a half approaches to natural language processing

Statistical approachesWhy they’re hard

• Solutions: spend moneyfigure out how to use other peoples’“weakly labelled” data

Page 72: Two and a half approaches to natural language processing

Rule-based or statistical: what to do??

Page 73: Two and a half approaches to natural language processing

Rule-based vs. statistical approaches

• Picking one:Is it cheaper to label more training data, or to put time into developing patterns?

Page 74: Two and a half approaches to natural language processing

Rule-based vs. statistical approaches

• Combine them:Do both together/iterativelyStatistical solution first, then rule-based post-processing

the 2.5th

approach

Page 75: Two and a half approaches to natural language processing

Rule-based vs. statistical approaches

• Researcher’s answer:Use one as the baseline for the other

Page 76: Two and a half approaches to natural language processing

Rule-based vs. statistical approaches

Domain specificity makes both of them more

tractable

Page 77: Two and a half approaches to natural language processing

Rule-based vs. statistical approaches

Sublanguage model

Page 78: Two and a half approaches to natural language processing

Two extended examples

Page 79: Two and a half approaches to natural language processing

POS tagging: why you need it

• All syntax is built on it• Overcome sparseness problem by

abstracting away from specific words• Potential basis for entity

identification

Page 80: Two and a half approaches to natural language processing

What “POS tagging” is

• POS: part of speech• School: 8 (noun, verb, adjective,

interjection…)• Real life: 40 or more

Page 81: Two and a half approaches to natural language processing

How do you get from 8 to 80?

• Noun • NN (noun, singular or mass)• NNS (plural noun)• NNP (proper noun)• NNPS (plural proper noun)

Page 82: Two and a half approaches to natural language processing

How do you get from 8 to 80?

• Verb • VB (base form)• VBD (past tense)• VBG (gerund)• VBN (past participle)• VBP (singular present-tense non-

3rd-person)• VBZ (3rd-person singular present

tense)

Page 83: Two and a half approaches to natural language processing

How do you define noun, verb, etc.?

• Semantic: “A noun is a person, place, or thing…”“A verb is…”

• Distributional characteristics:

“A noun can take the plural and genitive morphemes”“A noun can appear in the environment All of my twelve hairy ___ left before noon”

Page 84: Two and a half approaches to natural language processing

POS tagging defined (or, why’s it interesting?)

• Given:Time flies like an arrow, but fruit flies like a banana.

• Do:Time flies/VBZ like/IN an arrow, but fruit flies/NNS like/VBP a banana.

Page 85: Two and a half approaches to natural language processing

A statistical approach: TnT• Second-order Markov model

• Smoothing by linear interpolation of ngrams

• λ estimated by deleted interpolation• Tag probabilities learned for word endings;

used for unknown words

Page 86: Two and a half approaches to natural language processing

TnT• Ngram: an n-tag or n-word sequence• N = 1

DETNOUNrole

• BigramsDET NOUNNOUN PREPOSITIONa role

• Trigrams

Page 87: Two and a half approaches to natural language processing

The Brill Tagger

Page 88: Two and a half approaches to natural language processing

The Brill tagger

• Iterative error reduction1. Assign most common tags, then2. Evaluate performance, then

Page 89: Two and a half approaches to natural language processing

The Brill tagger

• Iterative error reduction1. Assign most common tags, then2. Evaluate performance, then3. Propose rules to fix errors4. Evaluate performance, then5. If you’ve improved, GOTO 3, else END

Page 90: Two and a half approaches to natural language processing

The Brill tagger

• Change Determiner Verb “of”• …to…• Determiner Noun “of”

The/Determiner running/Verb of/IN

The/Determiner running/Noun of/IN

Page 91: Two and a half approaches to natural language processing

EI as POS tagging

• Retinoic acid downmodulateserythroid differentiation and GATA1 expression in purified adult-progenitor culture. (Labbaye et al. 1994)

Page 92: Two and a half approaches to natural language processing

• Retinoic/JJ acid/NNdownmodulates/VBZ erythroid/JJdifferentiation/NN and/CCGATA1/NN expression/NN in/INpurified/VBN adult-progenitor/JJculture/NN ./. (Labbaye et al. 1994)

Page 93: Two and a half approaches to natural language processing

• Retinoic/JJ acid/NNdownmodulates/VBZ erythroid/JJdifferentiation/NN and/CCGATA1/GENE expression/NN in/INpurified/VBN adult-progenitor/JJculture/NN ./. (Labbaye et al. 1994)

Page 94: Two and a half approaches to natural language processing

Rule-based post-processing

• 14 simple patterns for adjusting boundariesif current_word == (gene OR mutant etc.) && previous_word_tag == GENE then current_word_tag = GENE

if current_word =~ /digit|Roman numeral|Greek letter/ && previous_word_tag == genethen current_word_tag = GENE

Page 95: Two and a half approaches to natural language processing

Results (BioCreative)Precision and Recall : Official Score

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0Recall

Pre

cis

ion

No post-processing

Closed Division

Open Division

High

Median

Low

Page 96: Two and a half approaches to natural language processing

How to get started

• Certificate program project• BIOI 7713 project• “Bake-offs”

BioCreativeTREC

• BIOI 7791, Spring quarter, HSC/Fitz• NLP meetings: 2 p.m. Wednesdays,

Room 6106, South Tower

Page 97: Two and a half approaches to natural language processing

Coursework

• CSCI 5832, Natural language processingJim MartinBoulder campus, Spring semesterhttp://www.cs.colorado.edu/~martin/csci5832.html

• Independent Study• BIOI 7791• BIOI 7713

Page 98: Two and a half approaches to natural language processing

Miscellaneous extra slides for answering questions, etc.

Page 99: Two and a half approaches to natural language processing

Books

Page 100: Two and a half approaches to natural language processing

How to read a paper in “BioNLP”

• Evaluation setExclusions

Page 101: Two and a half approaches to natural language processing

Evaluation

• P/R/F-measure

Page 102: Two and a half approaches to natural language processing

The other application of NLP in computational biology

• Sequence dataHow do you know a gene/promoter/whatever when you see one?Grammar-based approaches (David Searls)HMM’s

Page 103: Two and a half approaches to natural language processing
Page 104: Two and a half approaches to natural language processing

<scanned picture of business card>

Page 105: Two and a half approaches to natural language processing

<happy-face photo>

Page 106: Two and a half approaches to natural language processing

One year later…

Page 107: Two and a half approaches to natural language processing

A sad story: physicians don’t buy a lot of NLP software

Page 108: Two and a half approaches to natural language processing

Another sad story: trying to sell “gisting” to physicians

Page 109: Two and a half approaches to natural language processing

Sold for $400K: 14.5 or 2.9¢ on the dollar…

Page 110: Two and a half approaches to natural language processing

Salesperson’s thought process

You have a problem

I can solve it for you

Page 111: Two and a half approaches to natural language processing

Physician’s thought process

Have I been sued over how I do this?

I have a problem

I don’t have a problem

YesNo

Can you solve it for me?

Page 112: Two and a half approaches to natural language processing

Cooccurrence

Page 113: Two and a half approaches to natural language processing

Cooccurrence

Page 114: Two and a half approaches to natural language processing

Pleonastic vs. referential ambiguity

Page 115: Two and a half approaches to natural language processing

Edit distance