automatic metadata extraction (darwin core) from museum … · 2015. 5. 30. · 24.09.2008 dcma...
TRANSCRIPT
![Page 1: Automatic Metadata Extraction (Darwin Core) From Museum … · 2015. 5. 30. · 24.09.2008 DCMA 2008 15 Performances of NB and HMM Performances of NB and HMM 0% 20% 40% 60% 80% 100%](https://reader035.vdocuments.us/reader035/viewer/2022071516/61390b7da4cdb41a985b73ac/html5/thumbnails/1.jpg)
Automatic Metadata Extraction (Darwin Core) From Museum
Specimen Labels
P. Bryan Heidorn, Qin Wei
University of Illinois at Urbana-Champaign
…<co> Curtis, </co><hdlc> North American Pl!
</hdlc><cnl> No.</cnl><cn> 503*</cn>!
<gn> Polygala</gn><sp> ambigua,</sp><sa> Nutt.,</sa><val> var.</val>!
<hb> Coral soil,</hb><lc> Cudjoe Key, South Florida.!
</lc><col> Legit</col><co> A. H. Curtiss.</co><dt>February</dt>…!
![Page 2: Automatic Metadata Extraction (Darwin Core) From Museum … · 2015. 5. 30. · 24.09.2008 DCMA 2008 15 Performances of NB and HMM Performances of NB and HMM 0% 20% 40% 60% 80% 100%](https://reader035.vdocuments.us/reader035/viewer/2022071516/61390b7da4cdb41a985b73ac/html5/thumbnails/2.jpg)
24.09.2008 DCMA 2008 2
The problem
•" >1 Billion Natural History Specimens
•" Collected over 250 years / many languages
•" No publishing standards
•" Near infinite classes –" Your high school teacher lied
•" 6 min / label * 1B labels = 100M hours
•" Saving 1 min = 16.7 Million hours
•" $10/hr = $167,000,000
•" 1/4790 of U.S. deregulation financial bailout
![Page 3: Automatic Metadata Extraction (Darwin Core) From Museum … · 2015. 5. 30. · 24.09.2008 DCMA 2008 15 Performances of NB and HMM Performances of NB and HMM 0% 20% 40% 60% 80% 100%](https://reader035.vdocuments.us/reader035/viewer/2022071516/61390b7da4cdb41a985b73ac/html5/thumbnails/3.jpg)
24.09.2008 DCMA 2008 3
Why care
•" Historic distribution of species
•" Ecological niche modeling (invasiveness, crop hardiness, pest potential)
•" Projections of the impact of climate change
•" Where did explorer go? ( for error detection)
•" Will I see a Kirkland Warbler here?
•" Do tamaracks grow in sand?
•" When did Linden trees bloom before the industrial revolution?
![Page 4: Automatic Metadata Extraction (Darwin Core) From Museum … · 2015. 5. 30. · 24.09.2008 DCMA 2008 15 Performances of NB and HMM Performances of NB and HMM 0% 20% 40% 60% 80% 100%](https://reader035.vdocuments.us/reader035/viewer/2022071516/61390b7da4cdb41a985b73ac/html5/thumbnails/4.jpg)
24.09.2008 DCMA 2008 4
A real-life example: Baronia brevicornis and its single food plant, Acacia
cochliacantha (Soberon)
![Page 5: Automatic Metadata Extraction (Darwin Core) From Museum … · 2015. 5. 30. · 24.09.2008 DCMA 2008 15 Performances of NB and HMM Performances of NB and HMM 0% 20% 40% 60% 80% 100%](https://reader035.vdocuments.us/reader035/viewer/2022071516/61390b7da4cdb41a985b73ac/html5/thumbnails/5.jpg)
24.09.2008 DCMA 2008 5
B. brevicornis Abiotic Niche using BS Garp
![Page 6: Automatic Metadata Extraction (Darwin Core) From Museum … · 2015. 5. 30. · 24.09.2008 DCMA 2008 15 Performances of NB and HMM Performances of NB and HMM 0% 20% 40% 60% 80% 100%](https://reader035.vdocuments.us/reader035/viewer/2022071516/61390b7da4cdb41a985b73ac/html5/thumbnails/6.jpg)
24.09.2008 DCMA 2008 6
II: Estimating the “Area of Accesibility” (Soberon)
•" From where? What is the initial condition?
•" At what scale? In relation to what vagility parameters?
•" At certain scales, one can assume that biogeography is a good surrogate for the accesibility areas, this is, we assume that if a species is present in a given biogeographical region, it can reach all of it.
![Page 7: Automatic Metadata Extraction (Darwin Core) From Museum … · 2015. 5. 30. · 24.09.2008 DCMA 2008 15 Performances of NB and HMM Performances of NB and HMM 0% 20% 40% 60% 80% 100%](https://reader035.vdocuments.us/reader035/viewer/2022071516/61390b7da4cdb41a985b73ac/html5/thumbnails/7.jpg)
24.09.2008 DCMA 2008 7
Natural History
Specimens
![Page 8: Automatic Metadata Extraction (Darwin Core) From Museum … · 2015. 5. 30. · 24.09.2008 DCMA 2008 15 Performances of NB and HMM Performances of NB and HMM 0% 20% 40% 60% 80% 100%](https://reader035.vdocuments.us/reader035/viewer/2022071516/61390b7da4cdb41a985b73ac/html5/thumbnails/8.jpg)
24.09.2008 DCMA 2008 8
Sample records
![Page 9: Automatic Metadata Extraction (Darwin Core) From Museum … · 2015. 5. 30. · 24.09.2008 DCMA 2008 15 Performances of NB and HMM Performances of NB and HMM 0% 20% 40% 60% 80% 100%](https://reader035.vdocuments.us/reader035/viewer/2022071516/61390b7da4cdb41a985b73ac/html5/thumbnails/9.jpg)
24.09.2008 DCMA 2008 9
Sample OCR Output
Yale University Herbarium
~r-^""" r-n-------
YU.001300
Curtisb, North American Pl
C^o.nr r^-n
ANTS,
No. 503* "^
Polygala ambigna, Nntt., var.
Coral soil, Cudjoe Key, South Florida.
Legit A. H. Curtiss.
![Page 10: Automatic Metadata Extraction (Darwin Core) From Museum … · 2015. 5. 30. · 24.09.2008 DCMA 2008 15 Performances of NB and HMM Performances of NB and HMM 0% 20% 40% 60% 80% 100%](https://reader035.vdocuments.us/reader035/viewer/2022071516/61390b7da4cdb41a985b73ac/html5/thumbnails/10.jpg)
24.09.2008 DCMA 2008 10
Label Labels
•" bc - barcode!
•" bt - barcode text!
•" cm - common/colloquial name!
•" cn - collection number!
•" co - collector!
•" cd - collection date!
•" fm - family name!
•" ft - footer info!
![Page 11: Automatic Metadata Extraction (Darwin Core) From Museum … · 2015. 5. 30. · 24.09.2008 DCMA 2008 15 Performances of NB and HMM Performances of NB and HMM 0% 20% 40% 60% 80% 100%](https://reader035.vdocuments.us/reader035/viewer/2022071516/61390b7da4cdb41a985b73ac/html5/thumbnails/11.jpg)
24.09.2008 DCMA 2008 11
Label Labels
•" gn - genus name !
•" hd - header info!
•" in - infra name!
•" ina - infra name author!
•" lc - location !
•" pd - plant description!
•" sa - scientific name author!
•" sp - species name!
![Page 12: Automatic Metadata Extraction (Darwin Core) From Museum … · 2015. 5. 30. · 24.09.2008 DCMA 2008 15 Performances of NB and HMM Performances of NB and HMM 0% 20% 40% 60% 80% 100%](https://reader035.vdocuments.us/reader035/viewer/2022071516/61390b7da4cdb41a985b73ac/html5/thumbnails/12.jpg)
24.09.2008 DCMA 2008 12
Example Training Record
<?xml version="1.0" encoding="UTF-8"?>!
<?oxygen RNGSchema="http://www3.isrl.uiuc.edu/~TeleNature/Herbis/semanticrelax.rng" type="xml"?>!
<labeldata>!
<bt>Yale University Herbarium!
</bt><ns> ~r-^""" r-n------</ns><bc> YU.001300!
</bc><co cc="Curtiss"> Curtisb, </co><hdlc cc="North American Plants"> North American Pl!
</hdlc><ns>C^o.nr r^-n!
ANTS,</ns>!
<cnl> No.</cnl><cn> 503*</cn><ns> "^</ns>!
<gn> Polygala</gn><sp> ambigna,</sp><sa> Nntt.,</sa><val> var.</val>!
<hb> Coral soil,</hb><lc> Cudjoe Key, South Florida.!
</lc><col> Legit</col><co> A. H. Curtiss.</co>!
</labeldata>!
![Page 13: Automatic Metadata Extraction (Darwin Core) From Museum … · 2015. 5. 30. · 24.09.2008 DCMA 2008 15 Performances of NB and HMM Performances of NB and HMM 0% 20% 40% 60% 80% 100%](https://reader035.vdocuments.us/reader035/viewer/2022071516/61390b7da4cdb41a985b73ac/html5/thumbnails/13.jpg)
24.09.2008 DCMA 2008 13
Supervised Learning Framework
Gold Classified
Labels
Training Phase
Application Phase
Machine
Learner
Trained
Model
Unclassified
Labels
Segmented
Text Silver
Classified Labels
Segmentation Machine
Classifier
Unclassified
Labels
Human
Editing
![Page 14: Automatic Metadata Extraction (Darwin Core) From Museum … · 2015. 5. 30. · 24.09.2008 DCMA 2008 15 Performances of NB and HMM Performances of NB and HMM 0% 20% 40% 60% 80% 100%](https://reader035.vdocuments.us/reader035/viewer/2022071516/61390b7da4cdb41a985b73ac/html5/thumbnails/14.jpg)
24.09.2008 DCMA 2008 14
Herbis Experimental Data
•" 295 marked up records
•" 74 label states
•" 5-fold cross-validation
![Page 15: Automatic Metadata Extraction (Darwin Core) From Museum … · 2015. 5. 30. · 24.09.2008 DCMA 2008 15 Performances of NB and HMM Performances of NB and HMM 0% 20% 40% 60% 80% 100%](https://reader035.vdocuments.us/reader035/viewer/2022071516/61390b7da4cdb41a985b73ac/html5/thumbnails/15.jpg)
24.09.2008 DCMA 2008 15
Performances of NB and HMM
Performances of NB and HMM
0%
20%
40%
60%
80%
100%
bc
bt
cd
cdl
cm
cml
cn
cnl
co
col
ct
dtl
fm fml
gn
hb
hbl
hdlc
in latlon
lc lcl
pd
sa
snl
sp
Elements
F-Score
NB HMM
![Page 16: Automatic Metadata Extraction (Darwin Core) From Museum … · 2015. 5. 30. · 24.09.2008 DCMA 2008 15 Performances of NB and HMM Performances of NB and HMM 0% 20% 40% 60% 80% 100%](https://reader035.vdocuments.us/reader035/viewer/2022071516/61390b7da4cdb41a985b73ac/html5/thumbnails/16.jpg)
24.09.2008 DCMA 2008 16
![Page 17: Automatic Metadata Extraction (Darwin Core) From Museum … · 2015. 5. 30. · 24.09.2008 DCMA 2008 15 Performances of NB and HMM Performances of NB and HMM 0% 20% 40% 60% 80% 100%](https://reader035.vdocuments.us/reader035/viewer/2022071516/61390b7da4cdb41a985b73ac/html5/thumbnails/17.jpg)
24.09.2008 DCMA 2008 17
Improved Performance With Field
Element Identifiers
Improved Performance With
FEI Encoding
-30%
-20%
-10%
0%
10%
20%
30%
40%
50%
60%
70%
bt bc cd cm cn co ct dt fm gn hbin hdhdlcsp lc ot pd sa va altdb nstc sc rgnrsprsarddtrinrdppinptverbppreppperspdtointhd
dtdlatlonpb
Elements
F-S
core
Dif
fere
nce
![Page 18: Automatic Metadata Extraction (Darwin Core) From Museum … · 2015. 5. 30. · 24.09.2008 DCMA 2008 15 Performances of NB and HMM Performances of NB and HMM 0% 20% 40% 60% 80% 100%](https://reader035.vdocuments.us/reader035/viewer/2022071516/61390b7da4cdb41a985b73ac/html5/thumbnails/18.jpg)
24.09.2008 DCMA 2008 18
![Page 19: Automatic Metadata Extraction (Darwin Core) From Museum … · 2015. 5. 30. · 24.09.2008 DCMA 2008 15 Performances of NB and HMM Performances of NB and HMM 0% 20% 40% 60% 80% 100%](https://reader035.vdocuments.us/reader035/viewer/2022071516/61390b7da4cdb41a985b73ac/html5/thumbnails/19.jpg)
24.09.2008 DCMA 2008 19
Learning w/ pre categorization
Gold
Labels
Machine
Learner
Model
n
Unclassified
Labels
Classified
Labels
Class 1
Labels
Categor-
ization
Class 2
Labels
Class n
Labels
Machine
Learner
Machine
Learner
Model
2
Model
1
Class 1
Labels
Categor-
ization
Class 2
Labels
Class n
Labels
Machine
Classification
Machine
Classification
Machine
Classification
Classified
Labels
Classified
Labels
![Page 20: Automatic Metadata Extraction (Darwin Core) From Museum … · 2015. 5. 30. · 24.09.2008 DCMA 2008 15 Performances of NB and HMM Performances of NB and HMM 0% 20% 40% 60% 80% 100%](https://reader035.vdocuments.us/reader035/viewer/2022071516/61390b7da4cdb41a985b73ac/html5/thumbnails/20.jpg)
24.09.2008 DCMA 2008 20
FIG. 5. Improved Performance of Specialist Model
Specialist100 Curtiss VS 100 General
![Page 21: Automatic Metadata Extraction (Darwin Core) From Museum … · 2015. 5. 30. · 24.09.2008 DCMA 2008 15 Performances of NB and HMM Performances of NB and HMM 0% 20% 40% 60% 80% 100%](https://reader035.vdocuments.us/reader035/viewer/2022071516/61390b7da4cdb41a985b73ac/html5/thumbnails/21.jpg)
24.09.2008 DCMA 2008 21
Future Work
•" Community Learning Models
•" Label records might be processed in different orders to maximize learning and minimize error rate.
•" OCR correction might be improved using context dependent information. Context dependent correction means conducting the correct after knowing the word’s class. For example, word “Ourtiss” should be corrected as “Curtiss”. If the system already identified “Ourtiss” as collector, we can use the smaller collector dictionary instead of using a much larger general dictionary to do the correction.
![Page 22: Automatic Metadata Extraction (Darwin Core) From Museum … · 2015. 5. 30. · 24.09.2008 DCMA 2008 15 Performances of NB and HMM Performances of NB and HMM 0% 20% 40% 60% 80% 100%](https://reader035.vdocuments.us/reader035/viewer/2022071516/61390b7da4cdb41a985b73ac/html5/thumbnails/22.jpg)
24.09.2008 DCMA 2008 22
Community Learning Models
Gold Classified
Labels
Training Phase
Application Phase
Machine
Learner
Trained
Model
Unclassified
Labels
Segmented
Text Silver
Classified Labels
Segmentation Machine
Classifier
Unclassified
Labels
Human
Editing
![Page 23: Automatic Metadata Extraction (Darwin Core) From Museum … · 2015. 5. 30. · 24.09.2008 DCMA 2008 15 Performances of NB and HMM Performances of NB and HMM 0% 20% 40% 60% 80% 100%](https://reader035.vdocuments.us/reader035/viewer/2022071516/61390b7da4cdb41a985b73ac/html5/thumbnails/23.jpg)
24.09.2008 DCMA 2008 23
Many thanks to Qin Wei
Ask Questions
Human Learning
Listen