domain adaptation for biomedical information extraction jing jiang beespace seminar oct 17, 2007
TRANSCRIPT
![Page 1: Domain Adaptation for Biomedical Information Extraction Jing Jiang BeeSpace Seminar Oct 17, 2007](https://reader036.vdocuments.us/reader036/viewer/2022062518/5697bf901a28abf838c8df52/html5/thumbnails/1.jpg)
Domain Adaptation for Biomedical Information Extraction
Jing Jiang
BeeSpace SeminarOct 17, 2007
![Page 2: Domain Adaptation for Biomedical Information Extraction Jing Jiang BeeSpace Seminar Oct 17, 2007](https://reader036.vdocuments.us/reader036/viewer/2022062518/5697bf901a28abf838c8df52/html5/thumbnails/2.jpg)
10/17/07 2
Outline
Why do we need domain adaptation? Solutions:
Intelligent learning methods Knowledge bases Expert supervision
Connections with BeeSpace V4
![Page 3: Domain Adaptation for Biomedical Information Extraction Jing Jiang BeeSpace Seminar Oct 17, 2007](https://reader036.vdocuments.us/reader036/viewer/2022062518/5697bf901a28abf838c8df52/html5/thumbnails/3.jpg)
10/17/07 3
Why do we need domain adaptation? Many biomedical information extraction
problems are solved by supervised machine learning methods such as support vector machines (SVMs). Entity recognition Relation extraction Sentence categorization
In supervised machine learning, it is assumed that the training data and the test data have the same distribution.
![Page 4: Domain Adaptation for Biomedical Information Extraction Jing Jiang BeeSpace Seminar Oct 17, 2007](https://reader036.vdocuments.us/reader036/viewer/2022062518/5697bf901a28abf838c8df52/html5/thumbnails/4.jpg)
10/17/07 4
Why do we need domain adaptation? Existing labeled training data is often limited to
certain domains. GENIA corpus human, blood cells, transcription factors PennBioIE Genetic variation in malignancy, Cytochrome
P450 inhibition Training data for sentence categorization in gene
summarizer fly Even when the training data is diverse (containing
multiple domains), it would still be nice to customize the classifier for the particular target domain that we are working on.
![Page 5: Domain Adaptation for Biomedical Information Extraction Jing Jiang BeeSpace Seminar Oct 17, 2007](https://reader036.vdocuments.us/reader036/viewer/2022062518/5697bf901a28abf838c8df52/html5/thumbnails/5.jpg)
10/17/07 5
Why do we need domain adaptation?
NER Task Train → Test F1
to find PER, LOC, ORG from news text
NYT → NYT 0.855
Reuters → NYT 0.641
to find gene/protein from biomedical literature
mouse → mouse 0.541
fly → mouse 0.281
![Page 6: Domain Adaptation for Biomedical Information Extraction Jing Jiang BeeSpace Seminar Oct 17, 2007](https://reader036.vdocuments.us/reader036/viewer/2022062518/5697bf901a28abf838c8df52/html5/thumbnails/6.jpg)
10/17/07 6
Solutions to domain adaptation Intelligent learning methods
Instance weighting Feature selection
Knowledge bases Expert supervision
thesis research
future work
discussion
![Page 7: Domain Adaptation for Biomedical Information Extraction Jing Jiang BeeSpace Seminar Oct 17, 2007](https://reader036.vdocuments.us/reader036/viewer/2022062518/5697bf901a28abf838c8df52/html5/thumbnails/7.jpg)
10/17/07 7
Domain adaptive learning methods Two-stage approach Two frameworks
Instance weighting Feature selection
Use of unlabeled data
![Page 8: Domain Adaptation for Biomedical Information Extraction Jing Jiang BeeSpace Seminar Oct 17, 2007](https://reader036.vdocuments.us/reader036/viewer/2022062518/5697bf901a28abf838c8df52/html5/thumbnails/8.jpg)
10/17/07 8
Intuition
SourceDomain Target
Domain
![Page 9: Domain Adaptation for Biomedical Information Extraction Jing Jiang BeeSpace Seminar Oct 17, 2007](https://reader036.vdocuments.us/reader036/viewer/2022062518/5697bf901a28abf838c8df52/html5/thumbnails/9.jpg)
10/17/07 9
Goal
TargetDomain
SourceDomain
![Page 10: Domain Adaptation for Biomedical Information Extraction Jing Jiang BeeSpace Seminar Oct 17, 2007](https://reader036.vdocuments.us/reader036/viewer/2022062518/5697bf901a28abf838c8df52/html5/thumbnails/10.jpg)
10/17/07 10
Start from the source domain
SourceDomain Target
Domain
![Page 11: Domain Adaptation for Biomedical Information Extraction Jing Jiang BeeSpace Seminar Oct 17, 2007](https://reader036.vdocuments.us/reader036/viewer/2022062518/5697bf901a28abf838c8df52/html5/thumbnails/11.jpg)
10/17/07 11
Focus on the common part
SourceDomain Target
Domain
![Page 12: Domain Adaptation for Biomedical Information Extraction Jing Jiang BeeSpace Seminar Oct 17, 2007](https://reader036.vdocuments.us/reader036/viewer/2022062518/5697bf901a28abf838c8df52/html5/thumbnails/12.jpg)
10/17/07 12
Pick up some part from the target domain
SourceDomain Target
Domain
![Page 13: Domain Adaptation for Biomedical Information Extraction Jing Jiang BeeSpace Seminar Oct 17, 2007](https://reader036.vdocuments.us/reader036/viewer/2022062518/5697bf901a28abf838c8df52/html5/thumbnails/13.jpg)
10/17/07 13
Formal formulation?
SourceDomain Target
Domain
How to formally formulate these ideas?
![Page 14: Domain Adaptation for Biomedical Information Extraction Jing Jiang BeeSpace Seminar Oct 17, 2007](https://reader036.vdocuments.us/reader036/viewer/2022062518/5697bf901a28abf838c8df52/html5/thumbnails/14.jpg)
10/17/07 14
Instance weighting
SourceDomain Target
Domain
instance space
(each point represents an example)
to assign different weights to different instances in the objective function
![Page 15: Domain Adaptation for Biomedical Information Extraction Jing Jiang BeeSpace Seminar Oct 17, 2007](https://reader036.vdocuments.us/reader036/viewer/2022062518/5697bf901a28abf838c8df52/html5/thumbnails/15.jpg)
10/17/07 15
Instance weighting
Observationsource domain target domain
![Page 16: Domain Adaptation for Biomedical Information Extraction Jing Jiang BeeSpace Seminar Oct 17, 2007](https://reader036.vdocuments.us/reader036/viewer/2022062518/5697bf901a28abf838c8df52/html5/thumbnails/16.jpg)
10/17/07 16
Instance weighting
Observationsource domain target domain
![Page 17: Domain Adaptation for Biomedical Information Extraction Jing Jiang BeeSpace Seminar Oct 17, 2007](https://reader036.vdocuments.us/reader036/viewer/2022062518/5697bf901a28abf838c8df52/html5/thumbnails/17.jpg)
10/17/07 17
Instance weighting
Analysis of domain differencep(x, y)
p(x)p(y | x)
ps(y | x) ≠ pt(y | x)
ps(x) ≠ pt(x)
labeling difference instance difference
labeling adaptation instance adaptation?
![Page 18: Domain Adaptation for Biomedical Information Extraction Jing Jiang BeeSpace Seminar Oct 17, 2007](https://reader036.vdocuments.us/reader036/viewer/2022062518/5697bf901a28abf838c8df52/html5/thumbnails/18.jpg)
10/17/07 18
Instance weighting
Three sets of instancesDs Dt, l Dt, u
?
);|(log)|()(maxarg
X
Y
*t
ytt dxxypxypxp
X Ds+ Dt,l+ Dt,u?
![Page 19: Domain Adaptation for Biomedical Information Extraction Jing Jiang BeeSpace Seminar Oct 17, 2007](https://reader036.vdocuments.us/reader036/viewer/2022062518/5697bf901a28abf838c8df52/html5/thumbnails/19.jpg)
10/17/07 19
Instance weighting
Framework
)](log
);|(log)(1
);|(log1
);|(log1
[maxargˆ
,
,
1,,
1,,
1
p
xypyC
xypC
xypC
ut
lt
s
N
k y
tkk
utut
N
j
ti
ti
ltlt
N
i
si
siii
ss
Y
a flexible setup covering both standard methods and new domain adaptive methods
1,, utlts
labeled source data
labeled target data
unlabeled target data
![Page 20: Domain Adaptation for Biomedical Information Extraction Jing Jiang BeeSpace Seminar Oct 17, 2007](https://reader036.vdocuments.us/reader036/viewer/2022062518/5697bf901a28abf838c8df52/html5/thumbnails/20.jpg)
10/17/07 20
Feature selection
SourceDomain Target
Domain
feature space
(each point represents a feature)
to identify features that behave similarly across domains
![Page 21: Domain Adaptation for Biomedical Information Extraction Jing Jiang BeeSpace Seminar Oct 17, 2007](https://reader036.vdocuments.us/reader036/viewer/2022062518/5697bf901a28abf838c8df52/html5/thumbnails/21.jpg)
10/17/07 21
Feature selection
Observation Domain-specific features
wingless
daughterless
eyeless
apexless
…
“suffix -less” weighted high in the model trained from fly data
Useful for other organisms?
in general NO! May cause generalizable
features to be downweightedfly genes
![Page 22: Domain Adaptation for Biomedical Information Extraction Jing Jiang BeeSpace Seminar Oct 17, 2007](https://reader036.vdocuments.us/reader036/viewer/2022062518/5697bf901a28abf838c8df52/html5/thumbnails/22.jpg)
10/17/07 22
Feature selection
Observation Generalizable features: generalize well in all
domains
…decapentaplegic and wingless are
expressed in analogous
patterns in each…
…that CD38 is expressed by both neurons and glial
cells…that PABPC5 is
expressed in fetal brain and in a range of adult
tissues.
fly mouse
![Page 23: Domain Adaptation for Biomedical Information Extraction Jing Jiang BeeSpace Seminar Oct 17, 2007](https://reader036.vdocuments.us/reader036/viewer/2022062518/5697bf901a28abf838c8df52/html5/thumbnails/23.jpg)
10/17/07 23
Feature selection
Observation Generalizable features: generalize well in all
domains
…decapentaplegic and wingless are
expressed in analogous
patterns in each…
…that CD38 is expressed by both neurons and glial
cells…that PABPC5 is
expressed in fetal brain and in a range of adult
tissues.
fly mouse
“wi+2 = expressed” is generalizable
![Page 24: Domain Adaptation for Biomedical Information Extraction Jing Jiang BeeSpace Seminar Oct 17, 2007](https://reader036.vdocuments.us/reader036/viewer/2022062518/5697bf901a28abf838c8df52/html5/thumbnails/24.jpg)
10/17/07 24
Feature selectionIntuition for identification of generalizable features
…source
domains
……-less……expressed……
………expressed………-less
………expressed……-less…
…………expressed……-less
…
12345678
12345678
12345678
12345678
…expressed………-less……
fly mouse D3 DK
![Page 25: Domain Adaptation for Biomedical Information Extraction Jing Jiang BeeSpace Seminar Oct 17, 2007](https://reader036.vdocuments.us/reader036/viewer/2022062518/5697bf901a28abf838c8df52/html5/thumbnails/25.jpg)
10/17/07 25
Feature selection
Framework Matrix A is for feature selection
K
k
N
i
kTki
ki
k
K
k
ks
uv
k
k
k
uvAxypNK
uvuv
1 1
1
22
}{,
;|log11
minarg}{,
![Page 26: Domain Adaptation for Biomedical Information Extraction Jing Jiang BeeSpace Seminar Oct 17, 2007](https://reader036.vdocuments.us/reader036/viewer/2022062518/5697bf901a28abf838c8df52/html5/thumbnails/26.jpg)
10/17/07 26
Feature selection results on gene/protein recognition
![Page 27: Domain Adaptation for Biomedical Information Extraction Jing Jiang BeeSpace Seminar Oct 17, 2007](https://reader036.vdocuments.us/reader036/viewer/2022062518/5697bf901a28abf838c8df52/html5/thumbnails/27.jpg)
10/17/07 27
New directions to explore
Knowledge bases Expert supervision
![Page 28: Domain Adaptation for Biomedical Information Extraction Jing Jiang BeeSpace Seminar Oct 17, 2007](https://reader036.vdocuments.us/reader036/viewer/2022062518/5697bf901a28abf838c8df52/html5/thumbnails/28.jpg)
10/17/07 28
Knowledge bases – entity recognition Well-documented nomenclatures
Fly, Mouse, Rat Help filter out false positives? Help select features?
Dictionaries of entities “Dictionary features” Automatic summarization of nomenclatures? Automatic identification of good features?
![Page 29: Domain Adaptation for Biomedical Information Extraction Jing Jiang BeeSpace Seminar Oct 17, 2007](https://reader036.vdocuments.us/reader036/viewer/2022062518/5697bf901a28abf838c8df52/html5/thumbnails/29.jpg)
10/17/07 29
Knowledge bases – sentence categorization in gene summarizer For fly, the training sentences are
automatically extracted from FlyBase. For other organisms, do we have similar resources?
![Page 30: Domain Adaptation for Biomedical Information Extraction Jing Jiang BeeSpace Seminar Oct 17, 2007](https://reader036.vdocuments.us/reader036/viewer/2022062518/5697bf901a28abf838c8df52/html5/thumbnails/30.jpg)
10/17/07 30
Expert supervision – entity recognition Computer system selects ambiguous
examples for human experts to judge. Computer system asks human experts other
questions. Similar organisms? Typical surface features? (e.g. cis-regulatory
elements, “-RE”) Computer system summarizes possible
features from pseudo labeled data, and asks human experts for confirmation.
![Page 31: Domain Adaptation for Biomedical Information Extraction Jing Jiang BeeSpace Seminar Oct 17, 2007](https://reader036.vdocuments.us/reader036/viewer/2022062518/5697bf901a28abf838c8df52/html5/thumbnails/31.jpg)
10/17/07 31
Connections to BeeSpace V4
A major challenge in BeeSpace V4 is extraction of new types of entities and relations.
Exploiting knowledge bases and expert supervision is especially important.
For new types, no labeled data is available even from other domains. Use of bootstrapping methods should be explored.
![Page 32: Domain Adaptation for Biomedical Information Extraction Jing Jiang BeeSpace Seminar Oct 17, 2007](https://reader036.vdocuments.us/reader036/viewer/2022062518/5697bf901a28abf838c8df52/html5/thumbnails/32.jpg)
10/17/07 32
New entity types
Recognition of many new types will be dictionary based: organism, anatomy, biological process, etc.
Recognition of some new types will need some NER techniques: chemical, regulatory element
![Page 33: Domain Adaptation for Biomedical Information Extraction Jing Jiang BeeSpace Seminar Oct 17, 2007](https://reader036.vdocuments.us/reader036/viewer/2022062518/5697bf901a28abf838c8df52/html5/thumbnails/33.jpg)
10/17/07 33
New relation types
Bootstrapping (?) Seed patterns from knowledge bases or human
experts Human inspection of newly discovered patterns?
![Page 34: Domain Adaptation for Biomedical Information Extraction Jing Jiang BeeSpace Seminar Oct 17, 2007](https://reader036.vdocuments.us/reader036/viewer/2022062518/5697bf901a28abf838c8df52/html5/thumbnails/34.jpg)
10/17/07 34
The end