the use of ocr in the digitisation of herbarium specimens
DESCRIPTION
The use of OCR in the digitisation of herbarium specimens. Robyn E Drinkwater, Robert Cubey & Elspeth Haston. What is happening in digitisation?. … and these minimal data records are going to need data added to them. What are the options when using optical character recognition (OCR)?. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: The use of OCR in the digitisation of herbarium specimens](https://reader035.vdocuments.us/reader035/viewer/2022062316/56816932550346895de082cb/html5/thumbnails/1.jpg)
The use of OCR in the digitisation of
herbarium specimens
Robyn E Drinkwater, Robert Cubey & Elspeth Haston
![Page 2: The use of OCR in the digitisation of herbarium specimens](https://reader035.vdocuments.us/reader035/viewer/2022062316/56816932550346895de082cb/html5/thumbnails/2.jpg)
What is happening in digitisation?
![Page 3: The use of OCR in the digitisation of herbarium specimens](https://reader035.vdocuments.us/reader035/viewer/2022062316/56816932550346895de082cb/html5/thumbnails/3.jpg)
![Page 4: The use of OCR in the digitisation of herbarium specimens](https://reader035.vdocuments.us/reader035/viewer/2022062316/56816932550346895de082cb/html5/thumbnails/4.jpg)
• … and these minimal data records are going to need data added to them.
![Page 5: The use of OCR in the digitisation of herbarium specimens](https://reader035.vdocuments.us/reader035/viewer/2022062316/56816932550346895de082cb/html5/thumbnails/5.jpg)
• Parse OCR text directly into the database fields
• Use OCR data to prepare the specimens for manual / semi automated data entry
What are the options when usingoptical character recognition (OCR)?
![Page 6: The use of OCR in the digitisation of herbarium specimens](https://reader035.vdocuments.us/reader035/viewer/2022062316/56816932550346895de082cb/html5/thumbnails/6.jpg)
• We have had a digitisation project running to digitise all the specimens from SW Asia and the Middle East at RBGE.
• Minimal data had been captured originally*– Filing name– Geographical filing region– Barcode
• We have been routinely processing all our specimen images through ABBYY OCR software.
* E Haston, R Cubey, DJ Harris (2011). Data concepts and their relevance for data capture in large scale digitisation of biological collections. International Journal of Humanities and Arts Computing 6 (1-2), 111-119.
![Page 7: The use of OCR in the digitisation of herbarium specimens](https://reader035.vdocuments.us/reader035/viewer/2022062316/56816932550346895de082cb/html5/thumbnails/7.jpg)
Exploring the data…
![Page 8: The use of OCR in the digitisation of herbarium specimens](https://reader035.vdocuments.us/reader035/viewer/2022062316/56816932550346895de082cb/html5/thumbnails/8.jpg)
• We used the OCR output text to pull out over 7,000 specimen images and associated data records
• These were then prepared into batches:– some random– some sorted by collector and / or country
Step One
![Page 9: The use of OCR in the digitisation of herbarium specimens](https://reader035.vdocuments.us/reader035/viewer/2022062316/56816932550346895de082cb/html5/thumbnails/9.jpg)
• A team of six digitisers at RBGE completed a series of trials
• They used two different protocols for data entry– complete records – partial records (including collector and geographical
information but not habitat and description)
• In total 7,200 specimens were processed
Step Two
![Page 10: The use of OCR in the digitisation of herbarium specimens](https://reader035.vdocuments.us/reader035/viewer/2022062316/56816932550346895de082cb/html5/thumbnails/10.jpg)
• Compared to unsorted, random specimens, those which were sorted based on data from the OCR output were quicker to digitise
• Of the methods tested here, the most efficient used a protocol based on partial data entry, working with specimens which had been filtered by Collector and Country
Results…
![Page 11: The use of OCR in the digitisation of herbarium specimens](https://reader035.vdocuments.us/reader035/viewer/2022062316/56816932550346895de082cb/html5/thumbnails/11.jpg)
The human factor…Thinking about the ease of entering the data for each test, rate
them on their relative ease of use
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Random 1 Collector Country Collector& Country
Collector& Country
(OCR)
Random 2
5- Hardest
4
3
2
1- Easiest
![Page 12: The use of OCR in the digitisation of herbarium specimens](https://reader035.vdocuments.us/reader035/viewer/2022062316/56816932550346895de082cb/html5/thumbnails/12.jpg)
• Digitisation staff preferred working with sorted specimens
• They also preferred working with physical specimens rather than images
The human factor…
![Page 13: The use of OCR in the digitisation of herbarium specimens](https://reader035.vdocuments.us/reader035/viewer/2022062316/56816932550346895de082cb/html5/thumbnails/13.jpg)
• This work is more easily applied than parsing data from the OCR output
• It can be used in conjunction with other tools later in the digitisation process since these other processes will almost certainly be more efficient with sorted batches of specimens
• Other tasks can also be built on top of this: eg condition assessment, QC, etc
Some more thoughts…
![Page 14: The use of OCR in the digitisation of herbarium specimens](https://reader035.vdocuments.us/reader035/viewer/2022062316/56816932550346895de082cb/html5/thumbnails/14.jpg)
• It’s surprising what can be used to help filter specimens – the black art of search terms!
![Page 15: The use of OCR in the digitisation of herbarium specimens](https://reader035.vdocuments.us/reader035/viewer/2022062316/56816932550346895de082cb/html5/thumbnails/15.jpg)
Acknowledgments
• The digitisation team at RBGE: Nicky Sharp, David Braidwood, Muhammad Ghazali, Lorna Glancy, Dorota Jaworska, Esther Nieto.
• The Andrew W Mellon Foundation
• Dr Antje Ahrends (RBGE) & Dr Chris Glaseby (BIOSS) for statistical advice