ocr implementation in the caribbean plants digitization project
DESCRIPTION
OCR implementation in The Caribbean Plants Digitization Project. A project to image and catalog over 150,000 Caribbean specimens at the New York Botanical Garden. The New York Botanical Garden. *Legend : estimated number of specimens per country. Presented by: Stephen Gottschalk. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: OCR implementation in The Caribbean Plants Digitization Project](https://reader035.vdocuments.us/reader035/viewer/2022062811/56815f60550346895dce46c9/html5/thumbnails/1.jpg)
OCR implementation in The Caribbean Plants Digitization Project
A project to image and catalog over 150,000 Caribbean specimens at the New York Botanical Garden
*Legend: estimated number of specimens per countryThe New York Botanical Garden
Presented by: Stephen Gottschalk
![Page 2: OCR implementation in The Caribbean Plants Digitization Project](https://reader035.vdocuments.us/reader035/viewer/2022062811/56815f60550346895dce46c9/html5/thumbnails/2.jpg)
The New York Botanical Garden
NYBG’s Caribbean Collections More than 100 expeditions sponsored by the
garden since 1895. Notable and prolific collections by current and
former Garden staff including the Garden’s founder, Nathaniel Lord Britton
Approximately 75 % of the specimen data could be digitized from field books at NYBG and other institutions, or from published itineraries which provide the same information
![Page 3: OCR implementation in The Caribbean Plants Digitization Project](https://reader035.vdocuments.us/reader035/viewer/2022062811/56815f60550346895dce46c9/html5/thumbnails/3.jpg)
The New York Botanical Garden
Caribbean Project workflow summary:
Curation and rapid barcoding of specimens
Specimen imagingOptical CharacterRecognition (OCR)and data parsing
Field book entries
Manual keyingof specimendata
Specimen CatalogRecord
![Page 4: OCR implementation in The Caribbean Plants Digitization Project](https://reader035.vdocuments.us/reader035/viewer/2022062811/56815f60550346895dce46c9/html5/thumbnails/4.jpg)
The New York Botanical Garden
Sample ideal fieldbook:Plant family Plant
description
Collection locality No. of
duplicates
DeterminationCollection no.
Collection dateHabitat
![Page 5: OCR implementation in The Caribbean Plants Digitization Project](https://reader035.vdocuments.us/reader035/viewer/2022062811/56815f60550346895dce46c9/html5/thumbnails/5.jpg)
The New York Botanical Garden
Sample fieldbook - the product:
![Page 6: OCR implementation in The Caribbean Plants Digitization Project](https://reader035.vdocuments.us/reader035/viewer/2022062811/56815f60550346895dce46c9/html5/thumbnails/6.jpg)
The New York Botanical Garden
Sample Caribbean fieldbooks, less than ideal:
Vol 132, J. A. Safer, 1909 Vol. 69, Van Hermann, 1904
![Page 7: OCR implementation in The Caribbean Plants Digitization Project](https://reader035.vdocuments.us/reader035/viewer/2022062811/56815f60550346895dce46c9/html5/thumbnails/7.jpg)
The New York Botanical Garden
OCR assists in attaching fieldbook records:
OCR derived fields
Fieldbook entries
user input
IRN
![Page 8: OCR implementation in The Caribbean Plants Digitization Project](https://reader035.vdocuments.us/reader035/viewer/2022062811/56815f60550346895dce46c9/html5/thumbnails/8.jpg)
The New York Botanical Garden
Using OCR to populate fields:
Python script findsline of query term
User detects pattern to update fields
Query raw OCR toextract recordsof a given label type
SELECT *FROM OCR_allwhere label like "*New*Yor*Bot*Gar*Exp*Cub*";
Example:
![Page 9: OCR implementation in The Caribbean Plants Digitization Project](https://reader035.vdocuments.us/reader035/viewer/2022062811/56815f60550346895dce46c9/html5/thumbnails/9.jpg)
The New York Botanical Garden
Using OCR to populate fields:
Python script findsline of query term
User detects pattern to update fields
Query raw OCR toextract recordsof a given label type
Example: Return line containing “Col”
![Page 10: OCR implementation in The Caribbean Plants Digitization Project](https://reader035.vdocuments.us/reader035/viewer/2022062811/56815f60550346895dce46c9/html5/thumbnails/10.jpg)
The New York Botanical Garden
Using OCR to populate fields:
Python script findsline of query term
User detects pattern to update fields
Query raw OCR toextract recordsof a given label type
Example:Length of string Find position of “j.
a.”find “sha”Find “afer”
J. A. Shafer collections!
![Page 11: OCR implementation in The Caribbean Plants Digitization Project](https://reader035.vdocuments.us/reader035/viewer/2022062811/56815f60550346895dce46c9/html5/thumbnails/11.jpg)
The New York Botanical Garden
Avoid false positives:
F. S. Earle – no!
![Page 12: OCR implementation in The Caribbean Plants Digitization Project](https://reader035.vdocuments.us/reader035/viewer/2022062811/56815f60550346895dce46c9/html5/thumbnails/12.jpg)
The New York Botanical Garden
Consider pattern training and a second OCR pass:
Wright Labels, 162 total, generally low quality:
0.0%
10.0%
20.0%
30.0%
40.0%
50.0%
60.0%
70.0%
80.0%
90.0%
"Plantæ""Cubenses""Wright-ianæ"Full String
Perc
enta
ge c
orre
ctly
OC
R’d
OCR Pattern Training Used
![Page 13: OCR implementation in The Caribbean Plants Digitization Project](https://reader035.vdocuments.us/reader035/viewer/2022062811/56815f60550346895dce46c9/html5/thumbnails/13.jpg)
The New York Botanical Garden
Consider pattern training and a second OCR pass:
Zanoni Labels, 114 total, generally typed:
built-in trained once trained mult trained other trained both0.0%
10.0%
20.0%
30.0%
40.0%
50.0%
60.0%
70.0%
80.0%
90.0%
100.0%
"Moscoso"
"Rafael"
"Zanoni"
Full Heading: Jardin Botan-ico Nacional "Dr. Rafael M. Moscoso"
stripped “ " . ” punctuation from heading: Jardin Botan-ico Nacional Dr Rafael M Moscoso
OCR Pattern Training Used
Perc
enta
ge c
orre
ctly
OC
R’d
![Page 14: OCR implementation in The Caribbean Plants Digitization Project](https://reader035.vdocuments.us/reader035/viewer/2022062811/56815f60550346895dce46c9/html5/thumbnails/14.jpg)
The New York Botanical Garden
Closing thoughts:
•OCR plus human parsing works well with very little programming.•Works well for large, self contained data sets but maybe not for partial or changing data sets – automation would be helpful for addressing this.•Allows for creation of “digital” fieldbooks (ie order by collector, collection number and place).
![Page 15: OCR implementation in The Caribbean Plants Digitization Project](https://reader035.vdocuments.us/reader035/viewer/2022062811/56815f60550346895dce46c9/html5/thumbnails/15.jpg)
The New York Botanical Garden
Acknowledgements National Science Foundation
Barbara Thiers, Jacquelyn Kallunki, Michael Bevans, Anthony Kirchgessner, Melissa Tulig, Benito Santos, Nicole Tarnowsky, Tom Zanoni, Benjamin Saracco, Stephen Sinon, Vinson Doyle, Jessica Allen, Sarah Dutton, Lane Gibbons, Elizabeth Kiernan, Brandy Watts, Charles Zimmerman
Visit the Virtual Herbarium: http://sciweb.nybg.org/science2/vii2.asp