two and a half approaches to natural language processing
TRANSCRIPT
![Page 1: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/1.jpg)
Two and a half approaches to natural language
processing in computational biology
Kevin Bretonnel CohenBiomedical Text Mining Group Lead
Center for Computational PharmacologyUCHSC, Fitzsimons [email protected]
![Page 2: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/2.jpg)
There’s natural language processing in computational
biology??
What’s it doing there?
![Page 3: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/3.jpg)
(One lab’s) funding for NLP in computational biology
• INIA (Neuroinformatics of Alcoholism) $5M, 5 years
• Wyeth Genomics Institute ($200K, 2 years)
• National Library of Medicine ($4.2M, 3 years)
• …
![Page 4: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/4.jpg)
Why biologists care
• High-throughput data interpretation• Literature search• Annotation• Database construction
![Page 5: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/5.jpg)
But, I’m a computer scientist (mathematician, engineer…)
• Hard, but might be possible• Might be harder in molbio than in
“General English”• Might be more possible in molbio than
in “General English”
![Page 6: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/6.jpg)
ResourcesThe big drawing point for “bionlp”
• DataUMLS, Gene OntologyPubMedLabelled training data from “bake-offs”
• Tools lvg (“Lexical Variant Generator”), entity identification systems, …
![Page 7: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/7.jpg)
$$$
![Page 8: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/8.jpg)
Job market
• Academia: greatUS, Europe
• Industry: even better, although not molbio-specific
![Page 9: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/9.jpg)
Overview
• Some definitions• Specific tasks• General issues• Two (and a half) approaches
![Page 10: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/10.jpg)
Natural language vs. artificial language
• 2 + 3 * 4== 14!= 20
![Page 11: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/11.jpg)
Natural language vs. artificial language
• 2 + 3 * 4== 14!= 20
• fatty acids and cholesterol(fatty acids) and cholesterolfatty (acids and cholesterol)
![Page 12: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/12.jpg)
The basic applications (biologist’s-eye view)
• High-throughput data interpretation• Literature search• Annotation• Database construction
![Page 13: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/13.jpg)
The basic tasks(computer-scientist’s-eye view)
Information extractionEntity identificationIndexing
![Page 14: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/14.jpg)
What processing means
• High-levelEntity identificationInformation extractionInformation retrieval
• Lower-levelTokenizationPart-of-speech taggingSyntactic analysis
These are prerequisites to (most of) the others…
![Page 15: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/15.jpg)
Information extraction
Given the template:
BINDING_EVENTBinder:Bound:
![Page 16: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/16.jpg)
Information extraction
…and the input:
Met28 binds to DNA.
![Page 17: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/17.jpg)
Information extraction
…return:
BINDING_EVENT
Binder: Met28Bound: DNA
![Page 18: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/18.jpg)
Second thing that makes it hard: morphosyntactic variability
Met28 binds to DNA…binding of Met28 to DNA……Met28 and DNA bind……binding between Met28 and DNA……Met28 is sufficient to bind DNA……DNA bound by Met28…
![Page 19: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/19.jpg)
Second thing that makes it hard: morphosyntactic variability
…binding of Met28 to DNA……binding under unspecified conditions
of Met28 to DNA……binding of this translational variant of
Met28 to DNA……binding of Met28 to upstream regions
of DNA…
![Page 20: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/20.jpg)
Second thing that makes it hard: morphosyntactic variability
…binding under unspecified conditions of this translational variant of Met28 to upstream regions of DNA…
![Page 21: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/21.jpg)
Even before that, though…
HSP60Hsp-60heat shock protein 60CerberuswinglessKen and Barbiethe
![Page 22: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/22.jpg)
Even before that, though…
HSP60Hsp-60heat shock protein 60CerberuswinglessKen and Barbiethe
Entity identification
![Page 23: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/23.jpg)
How they connect
• Entity identification
• Information extraction
• Indexing
• Literature search
• Annotation
• Database construction
![Page 24: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/24.jpg)
…and none of those are what we REALLY want…
• “text data mining:” finding knowledge that isn’t explicitly stated
Question-answering“Missing links” SummarizationInference
![Page 25: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/25.jpg)
Why is NLP hard?
![Page 26: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/26.jpg)
Why is NLP hard?
• 2 + 3 * 4== 14!= 20
• fatty acids and cholesterol(fatty acids) and cholesterolfatty (acids and cholesterol)
![Page 27: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/27.jpg)
Types of ambiguity
• Part of speech: Is this a noun, a verb, an adjective…
• Lexical: which word is this?• Structural: what’s the relationship
between these words/phrases?• Semantic, pragmatic, discourse-level…
![Page 28: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/28.jpg)
Part-of-speech ambiguity
![Page 29: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/29.jpg)
• adjective • noun
![Page 30: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/30.jpg)
• heat shock protein 60heat/VERB shock/VERB protein/NOUNheat/NOUN shock/ NOUN protein/NOUN
![Page 31: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/31.jpg)
Lexical ambiguity
![Page 32: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/32.jpg)
• Flying mammal • Sporting equipment
![Page 33: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/33.jpg)
Lexical ambiguity
![Page 34: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/34.jpg)
Lexical ambiguitya verb with 14 meanings
• 7 a : to assemble and set alight the materials for (a fire) b : to set in order <make beds>
• 3 a : to bring into being by forming, shaping, or altering material : FASHION <make a dress>
![Page 35: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/35.jpg)
• hunkhuman natural killer (cell type)HUN kinase (gene/protein)radiological/orthopedic classification schemepiece of something
![Page 36: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/36.jpg)
A reminder about what syntactic structure is
• Relations between phrases (groups of words)
I dislike stupid people like you I find them boring
![Page 37: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/37.jpg)
A reminder about what syntactic structure is
• Relations between phrases (groups of words)
[I dislike stupid people][like you I find them boring]
![Page 38: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/38.jpg)
A reminder about what syntactic structure is
• Relations between phrases (groups of words)
[I dislike stupid people like you][I find them boring]
![Page 39: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/39.jpg)
A reminder about what syntactic structure is
• Relations between phrases (groups of words)
I saw the man with the binoculars
![Page 40: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/40.jpg)
A reminder about what syntactic structure is
• Relations between phrases (groups of words)
I saw the man with the binoculars
Paraphrase: using the binoculars, I saw the man
![Page 41: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/41.jpg)
A reminder about what syntactic structure is
• Relations between phrases (groups of words)
I saw the man with the binoculars
Paraphrase: using the binoculars, I saw the man
![Page 42: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/42.jpg)
A reminder about what syntactic structure is
• Relations between phrases (groups of words)
I saw the man with the binoculars
Paraphrase: the man that I saw had some binoculars
![Page 43: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/43.jpg)
A reminder about what syntactic structure is
• Relations between phrases (groups of words)
I saw the man with the binoculars
Paraphrase: the man that I saw had some binoculars
![Page 44: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/44.jpg)
Structural ambiguity
![Page 45: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/45.jpg)
Structural ambiguity
• Clerk interprets:• Verb (blouse) (in
window)
• She means:• Verb (blouse (in
window))
![Page 46: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/46.jpg)
Structural ambiguity
![Page 47: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/47.jpg)
Structural ambiguity
• Sly dog means:Verb (in (house (on
bed)))
• She interprets:• Verb (in house) (on
bed)
![Page 48: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/48.jpg)
• NaCT is expressed in liver, testis and brain in rat and shows preference for citrate over dicarboxylates.
![Page 49: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/49.jpg)
• NaCT is expressed in liver, testis and brain in rat and shows preference for citrate over dicarboxylates.
• (liver), (testis) and (brain in rat)• liver, (testis and brain in rat)• (liver, testis and brain in rat)
![Page 50: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/50.jpg)
• NaCT is expressed in liver, testis and brain in rat and shows preference for citrate over dicarboxylates.
• shows preference for (citrate over dicarboxylates)
• shows preference (for citrate) (over dicarboxylates)
![Page 51: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/51.jpg)
Other syntactic ambiguity(ellipsis)
![Page 52: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/52.jpg)
Two approaches to NLP
Rule-based Statistical/machine learning
![Page 53: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/53.jpg)
First approach to NLP
• Rule-based• AI, linguistics• Patterns (regular, context-free…)• Procedures
![Page 54: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/54.jpg)
Rule-based: regex
• Patterns (regular, context-free, …)
• Procedures
$geneName = “[A-Za-z]+-?[0-9]”;
$input =~ /interaction of ($geneName) with ($geneName)/;
$interactionAssertion->setGene1($1);
$interactionAssertion->setGene2($2);
![Page 55: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/55.jpg)
Rule-based: CFGs
• Patterns (regular, context-free, …)
• Procedures
NounPhrase -> NounPhrase+ Conjunction NounPhrase
NounPhrase -> Predeterminer Determiner+ Adjective+ Noun
![Page 56: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/56.jpg)
Rule-based: procedural
• Patterns (regular, context-free, …)
• Procedures
if (currentWordEndsWith-ing) {
if (previousWordIsThe) {
if (nextWordIsOf) {
![Page 57: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/57.jpg)
Rule-based approachesWhy they work
• Patterns are realPsychologicallyFormally adequate (mostly)
• Intuition works• No need for training data
![Page 58: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/58.jpg)
Rule-based approachesWhy they’re hard
• Knowledge takes time to get• Process of developing large rule sets
can be slowConsider English syntax…
![Page 59: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/59.jpg)
Second approach to NLP
• Mosteller & Wallace• Bayesian• Other machine learning techniques
![Page 60: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/60.jpg)
Statistical approaches
Bayesian statistics: Why this is funny
![Page 61: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/61.jpg)
Statistical/ML approaches
• Frame the NLP task as a series of classification problems
Which POS is this?Which word meaning?Which phrasal grouping?
![Page 62: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/62.jpg)
Specifics of statistical approaches
• Training data: 100 examples of “bat”• 95 times, meaning is “sporting
equipment”• 5 times, meaning is “mammal”
![Page 63: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/63.jpg)
Specifics of statistical approaches
• You see “bat:” what does it probably mean?
![Page 64: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/64.jpg)
Specifics of statistical approaches
• 95 examples mean “sporting implement”
• 95 times, word “ball” also appears
![Page 65: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/65.jpg)
Specifics of statistical approaches
• 5 examples mean “mammal”• Each time, word “wizard” also appears
![Page 66: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/66.jpg)
Specifics of statistical approaches
• You see the word “bat,” and you also see the word “wizard…”
![Page 67: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/67.jpg)
Statistical approachesWhy they work
• Statistics can be proxy for knowledge• Some interesting stuff is frequent
enough to be tractable
![Page 68: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/68.jpg)
Statistical approachesWhy they’re hard
• Problem: sparse data
Frequency
Rank
![Page 69: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/69.jpg)
Statistical approachesWhy they’re hard
• Solutions: smoothing, back-off
![Page 70: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/70.jpg)
Statistical approachesWhy they’re hard
• Problem: labelled training data is expensive
![Page 71: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/71.jpg)
Statistical approachesWhy they’re hard
• Solutions: spend moneyfigure out how to use other peoples’“weakly labelled” data
![Page 72: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/72.jpg)
Rule-based or statistical: what to do??
![Page 73: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/73.jpg)
Rule-based vs. statistical approaches
• Picking one:Is it cheaper to label more training data, or to put time into developing patterns?
![Page 74: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/74.jpg)
Rule-based vs. statistical approaches
• Combine them:Do both together/iterativelyStatistical solution first, then rule-based post-processing
the 2.5th
approach
![Page 75: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/75.jpg)
Rule-based vs. statistical approaches
• Researcher’s answer:Use one as the baseline for the other
![Page 76: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/76.jpg)
Rule-based vs. statistical approaches
Domain specificity makes both of them more
tractable
![Page 77: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/77.jpg)
Rule-based vs. statistical approaches
Sublanguage model
![Page 78: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/78.jpg)
Two extended examples
![Page 79: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/79.jpg)
POS tagging: why you need it
• All syntax is built on it• Overcome sparseness problem by
abstracting away from specific words• Potential basis for entity
identification
![Page 80: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/80.jpg)
What “POS tagging” is
• POS: part of speech• School: 8 (noun, verb, adjective,
interjection…)• Real life: 40 or more
![Page 81: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/81.jpg)
How do you get from 8 to 80?
• Noun • NN (noun, singular or mass)• NNS (plural noun)• NNP (proper noun)• NNPS (plural proper noun)
![Page 82: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/82.jpg)
How do you get from 8 to 80?
• Verb • VB (base form)• VBD (past tense)• VBG (gerund)• VBN (past participle)• VBP (singular present-tense non-
3rd-person)• VBZ (3rd-person singular present
tense)
![Page 83: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/83.jpg)
How do you define noun, verb, etc.?
• Semantic: “A noun is a person, place, or thing…”“A verb is…”
• Distributional characteristics:
“A noun can take the plural and genitive morphemes”“A noun can appear in the environment All of my twelve hairy ___ left before noon”
![Page 84: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/84.jpg)
POS tagging defined (or, why’s it interesting?)
• Given:Time flies like an arrow, but fruit flies like a banana.
• Do:Time flies/VBZ like/IN an arrow, but fruit flies/NNS like/VBP a banana.
![Page 85: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/85.jpg)
A statistical approach: TnT• Second-order Markov model
• Smoothing by linear interpolation of ngrams
• λ estimated by deleted interpolation• Tag probabilities learned for word endings;
used for unknown words
![Page 86: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/86.jpg)
TnT• Ngram: an n-tag or n-word sequence• N = 1
DETNOUNrole
• BigramsDET NOUNNOUN PREPOSITIONa role
• Trigrams
![Page 87: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/87.jpg)
The Brill Tagger
![Page 88: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/88.jpg)
The Brill tagger
• Iterative error reduction1. Assign most common tags, then2. Evaluate performance, then
![Page 89: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/89.jpg)
The Brill tagger
• Iterative error reduction1. Assign most common tags, then2. Evaluate performance, then3. Propose rules to fix errors4. Evaluate performance, then5. If you’ve improved, GOTO 3, else END
![Page 90: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/90.jpg)
The Brill tagger
• Change Determiner Verb “of”• …to…• Determiner Noun “of”
The/Determiner running/Verb of/IN
The/Determiner running/Noun of/IN
![Page 91: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/91.jpg)
EI as POS tagging
• Retinoic acid downmodulateserythroid differentiation and GATA1 expression in purified adult-progenitor culture. (Labbaye et al. 1994)
![Page 92: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/92.jpg)
• Retinoic/JJ acid/NNdownmodulates/VBZ erythroid/JJdifferentiation/NN and/CCGATA1/NN expression/NN in/INpurified/VBN adult-progenitor/JJculture/NN ./. (Labbaye et al. 1994)
![Page 93: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/93.jpg)
• Retinoic/JJ acid/NNdownmodulates/VBZ erythroid/JJdifferentiation/NN and/CCGATA1/GENE expression/NN in/INpurified/VBN adult-progenitor/JJculture/NN ./. (Labbaye et al. 1994)
![Page 94: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/94.jpg)
Rule-based post-processing
• 14 simple patterns for adjusting boundariesif current_word == (gene OR mutant etc.) && previous_word_tag == GENE then current_word_tag = GENE
if current_word =~ /digit|Roman numeral|Greek letter/ && previous_word_tag == genethen current_word_tag = GENE
![Page 95: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/95.jpg)
Results (BioCreative)Precision and Recall : Official Score
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0Recall
Pre
cis
ion
No post-processing
Closed Division
Open Division
High
Median
Low
![Page 96: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/96.jpg)
How to get started
• Certificate program project• BIOI 7713 project• “Bake-offs”
BioCreativeTREC
• BIOI 7791, Spring quarter, HSC/Fitz• NLP meetings: 2 p.m. Wednesdays,
Room 6106, South Tower
![Page 97: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/97.jpg)
Coursework
• CSCI 5832, Natural language processingJim MartinBoulder campus, Spring semesterhttp://www.cs.colorado.edu/~martin/csci5832.html
• Independent Study• BIOI 7791• BIOI 7713
![Page 98: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/98.jpg)
Miscellaneous extra slides for answering questions, etc.
![Page 99: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/99.jpg)
Books
![Page 100: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/100.jpg)
How to read a paper in “BioNLP”
• Evaluation setExclusions
![Page 101: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/101.jpg)
Evaluation
• P/R/F-measure
![Page 102: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/102.jpg)
The other application of NLP in computational biology
• Sequence dataHow do you know a gene/promoter/whatever when you see one?Grammar-based approaches (David Searls)HMM’s
![Page 103: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/103.jpg)
![Page 104: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/104.jpg)
<scanned picture of business card>
![Page 105: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/105.jpg)
<happy-face photo>
![Page 106: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/106.jpg)
One year later…
![Page 107: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/107.jpg)
A sad story: physicians don’t buy a lot of NLP software
![Page 108: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/108.jpg)
Another sad story: trying to sell “gisting” to physicians
![Page 109: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/109.jpg)
Sold for $400K: 14.5 or 2.9¢ on the dollar…
![Page 110: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/110.jpg)
Salesperson’s thought process
You have a problem
I can solve it for you
![Page 111: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/111.jpg)
Physician’s thought process
Have I been sued over how I do this?
I have a problem
I don’t have a problem
YesNo
Can you solve it for me?
![Page 112: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/112.jpg)
Cooccurrence
![Page 113: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/113.jpg)
Cooccurrence
![Page 114: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/114.jpg)
Pleonastic vs. referential ambiguity
![Page 115: Two and a half approaches to natural language processing](https://reader036.vdocuments.us/reader036/viewer/2022071613/615709f6a097e25c76506d9e/html5/thumbnails/115.jpg)
Edit distance