![Page 1: Computer Science Research for Family History and Genealogy](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815c91550346895dca9f6a/html5/thumbnails/1.jpg)
Computer Science Research forFamily History and Genealogy
David W. Embley Heath Nielson, Mike Rimer, Luke Hutchison, Ken Tubbs, Doug Kennard, Tom FinniganWilliam A. Barrett
Computer Graphics, Vision, & Image Processing LaboratoryNeural Networks and Machine Learning Laboratory
Data Extraction and Integration LaboratoryLaboratory for Information, Collaboration, & Interaction Environments
Performance Evaluation LaboratoryData and Software Engineering Laboratory
www.cs.byu.edu/familyhistory
![Page 2: Computer Science Research for Family History and Genealogy](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815c91550346895dca9f6a/html5/thumbnails/2.jpg)
The Problem
• 2.5 million rolls of microfilm• Assuming 1000 images per roll• 2.5 billion images
Is there a way to automatically extract this information?
![Page 3: Computer Science Research for Family History and Genealogy](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815c91550346895dca9f6a/html5/thumbnails/3.jpg)
A (Possible) Solution
• Input: Images of Microfilmed Records– Table Recognition (Heath Nielson)– Old-Text Recognition (Mike Rimer)– Handwriting Recognition (Luke Hutchison)– Record Extraction & Organization (Ken Tubbs)– Just-in-Time Browsing (Doug Kennard)– Visualization (Tom Finnigan)
• Output: Organized Genealogical Information
Let a computer do the extraction work.
![Page 4: Computer Science Research for Family History and Genealogy](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815c91550346895dca9f6a/html5/thumbnails/4.jpg)
ZoningGeneral Overview
• Find the lines in the document using the horizontal and vertical profiles of the image.
• Apply a matched filter to the profiles to identify the line signatures.
• Recursively divide the document into separate pieces, analyzing each piece for lines.
![Page 5: Computer Science Research for Family History and Genealogy](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815c91550346895dca9f6a/html5/thumbnails/5.jpg)
![Page 6: Computer Science Research for Family History and Genealogy](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815c91550346895dca9f6a/html5/thumbnails/6.jpg)
![Page 7: Computer Science Research for Family History and Genealogy](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815c91550346895dca9f6a/html5/thumbnails/7.jpg)
Zone ClassificationMachine vs. Handwriting
• Machine printed text is consistent/regular.
• Handwriting is irregular.
![Page 8: Computer Science Research for Family History and Genealogy](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815c91550346895dca9f6a/html5/thumbnails/8.jpg)
Document templates
• Images are not ideal.– Results in incorrect zoning and classification.
• Form layout is the same across documents.– Features missed in one image, are found in
another.• Build a template of the document’s form by
using several documents.– Provides robustness, and increases accuracy.
![Page 9: Computer Science Research for Family History and Genealogy](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815c91550346895dca9f6a/html5/thumbnails/9.jpg)
Document Templates
![Page 10: Computer Science Research for Family History and Genealogy](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815c91550346895dca9f6a/html5/thumbnails/10.jpg)
Zoned Image
![Page 11: Computer Science Research for Family History and Genealogy](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815c91550346895dca9f6a/html5/thumbnails/11.jpg)
Automated Text Recognition
![Page 12: Computer Science Research for Family History and Genealogy](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815c91550346895dca9f6a/html5/thumbnails/12.jpg)
Word Segmentation
![Page 13: Computer Science Research for Family History and Genealogy](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815c91550346895dca9f6a/html5/thumbnails/13.jpg)
Letter Segmentation
![Page 14: Computer Science Research for Family History and Genealogy](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815c91550346895dca9f6a/html5/thumbnails/14.jpg)
Optical Character Recognition
![Page 15: Computer Science Research for Family History and Genealogy](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815c91550346895dca9f6a/html5/thumbnails/15.jpg)
Handwriting Recognition
![Page 16: Computer Science Research for Family History and Genealogy](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815c91550346895dca9f6a/html5/thumbnails/16.jpg)
Handwriting Recognition• The Task
– Online handwriting recognition• The writer's pen movements are captured• Velocity, acceleration, stroke order are available
– Offline handwriting recognition• Page was previously-written and scanned• Only pixel color information available
• Genealogical records are all offline
• Offline is harder, but doable
Mary
![Page 17: Computer Science Research for Family History and Genealogy](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815c91550346895dca9f6a/html5/thumbnails/17.jpg)
Handwriting Recognition• Can we just convert offline data into (simulated)
online data?
– Yes, although difficult to do reliably:• What order were the strokes written in?• Doubled-up line segments? Ink blobs? Spurious joins between
letters? Missing joins?– Inferring online data (e.g. stroke ordering) could be
crucial to success– Demonstrated to be solvable with reasonable reliability
![Page 18: Computer Science Research for Family History and Genealogy](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815c91550346895dca9f6a/html5/thumbnails/18.jpg)
Handwriting Recognition
• An example of some steps in the analysis process:
– Contour extraction
– Midline determination
– Stroke ordering
![Page 19: Computer Science Research for Family History and Genealogy](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815c91550346895dca9f6a/html5/thumbnails/19.jpg)
Handwriting Recognition
• An example of some steps in the recognition process:
– Handwriting style clustering
– Letter recognition
– Approximate string matching
nr?
m?SmithSmythe
![Page 20: Computer Science Research for Family History and Genealogy](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815c91550346895dca9f6a/html5/thumbnails/20.jpg)
Automatic Record ExtractionAutomatic Record Extraction
![Page 21: Computer Science Research for Family History and Genealogy](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815c91550346895dca9f6a/html5/thumbnails/21.jpg)
Extraction AlgorithmExtraction Algorithm
1. Identify the Geometric Structure
2. Identify the Type of Information
3. Identify the Attribute-Value pairs
4. Identify the Record Boundaries
![Page 22: Computer Science Research for Family History and Genealogy](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815c91550346895dca9f6a/html5/thumbnails/22.jpg)
Column-Row RecognitionColumn-Row Recognition
![Page 23: Computer Science Research for Family History and Genealogy](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815c91550346895dca9f6a/html5/thumbnails/23.jpg)
Genealogical OntologyGenealogical Ontology
![Page 24: Computer Science Research for Family History and Genealogy](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815c91550346895dca9f6a/html5/thumbnails/24.jpg)
ROAD, STREET, &c.,ROAD, STREET, &c.,And No. or NAME of HOUSEAnd No. or NAME of HOUSE
Match LabelsMatch LabelsLocationLocation
![Page 25: Computer Science Research for Family History and Genealogy](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815c91550346895dca9f6a/html5/thumbnails/25.jpg)
Match LabelsMatch LabelsNAME and Surname of NAME and Surname of each Personeach PersonFull NameFull Name LocationLocation
![Page 26: Computer Science Research for Family History and Genealogy](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815c91550346895dca9f6a/html5/thumbnails/26.jpg)
RELATION to Head RELATION to Head of Family of FamilyRelationshipRelationship
Match LabelsMatch LabelsLocationLocation Full NameFull Name
![Page 27: Computer Science Research for Family History and Genealogy](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815c91550346895dca9f6a/html5/thumbnails/27.jpg)
Extract RecordsExtract Records
CollaferCollaferLocationLocation Full NameFull Name RelationshipRelationship
![Page 28: Computer Science Research for Family History and Genealogy](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815c91550346895dca9f6a/html5/thumbnails/28.jpg)
Extract RecordsExtract Records
John EyresJohn Eyres HeadHeadLocationLocation Full NameFull Name RelationshipRelationship CollaferCollafer
![Page 29: Computer Science Research for Family History and Genealogy](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815c91550346895dca9f6a/html5/thumbnails/29.jpg)
Extract RecordsExtract Records
Annie EyresAnnie Eyres WifeWifeLocationLocation Full NameFull Name RelationshipRelationship CollaferCollafer
![Page 30: Computer Science Research for Family History and Genealogy](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815c91550346895dca9f6a/html5/thumbnails/30.jpg)
Extract RecordsExtract Records
Lehailes EyresLehailes Eyres SonSonLocationLocation Full NameFull Name RelationshipRelationship CollaferCollafer
![Page 31: Computer Science Research for Family History and Genealogy](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815c91550346895dca9f6a/html5/thumbnails/31.jpg)
John
Web QueryWeb Query
Eyres
![Page 32: Computer Science Research for Family History and Genealogy](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815c91550346895dca9f6a/html5/thumbnails/32.jpg)
Search ResultsSearch Results
![Page 33: Computer Science Research for Family History and Genealogy](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815c91550346895dca9f6a/html5/thumbnails/33.jpg)
Online Digital Microfilm: ProblemMany of the images we are interested in are quite large.
6048 x 4287 pixels
![Page 34: Computer Science Research for Family History and Genealogy](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815c91550346895dca9f6a/html5/thumbnails/34.jpg)
What is Just-In-Time Browsing?
• Progressive Image Transmission:• Hierarchical Spatial Resolution• Progressive Bitplane Encoding
• JBIG Compressed Bitplanes• Prioritized Regions of Interest• User Interaction
A method of quickly browsing digital images over the Internet which capitalizes on:
![Page 35: Computer Science Research for Family History and Genealogy](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815c91550346895dca9f6a/html5/thumbnails/35.jpg)
Hierarchical PITSequential Transmission(Progressive Image Transmission)
![Page 36: Computer Science Research for Family History and Genealogy](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815c91550346895dca9f6a/html5/thumbnails/36.jpg)
PIT Using Bitplane Method
1 BitPlane(2 levels of gray)
2 BitPlanes(4 levels of gray)
3 BitPlanes(8 levels of gray)
4 BitPlanes(16 levels of gray)
![Page 37: Computer Science Research for Family History and Genealogy](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815c91550346895dca9f6a/html5/thumbnails/37.jpg)
Digital Microfilm Browser
![Page 38: Computer Science Research for Family History and Genealogy](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815c91550346895dca9f6a/html5/thumbnails/38.jpg)
PAF – 5 Generation Pedigree
![Page 39: Computer Science Research for Family History and Genealogy](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815c91550346895dca9f6a/html5/thumbnails/39.jpg)
PAF – 5 Generation Pedigree
![Page 40: Computer Science Research for Family History and Genealogy](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815c91550346895dca9f6a/html5/thumbnails/40.jpg)
Gena:A 3D Genealogy Visualizer
![Page 41: Computer Science Research for Family History and Genealogy](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815c91550346895dca9f6a/html5/thumbnails/41.jpg)
Concluding Remarks
Workshop: April 4, 2002 at BYUwww.cs.byu.edu/familyhistory
![Page 42: Computer Science Research for Family History and Genealogy](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815c91550346895dca9f6a/html5/thumbnails/42.jpg)
AppendixCategorized List of BYU Faculty Interests in
Computer Science Research Topicsthat Support Technology for
Family History and Genealogy
![Page 43: Computer Science Research for Family History and Genealogy](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815c91550346895dca9f6a/html5/thumbnails/43.jpg)
Extraction from Digitized Images• Scanning (Flanagan) • Segmentation & Table Recognition (Barrett,
Martinez)• OCR for Old Type-Set Text (Martinez)• Element Classification & Record Construction
(Embley, Barrett, Martinez)• Handwriting Recognition (Sederberg)• Recognition of Hand-printed Text
(Olson, Barrett, Martinez)
![Page 44: Computer Science Research for Family History and Genealogy](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815c91550346895dca9f6a/html5/thumbnails/44.jpg)
Extraction from Digital Data Sources
• Automatic Extraction from Semi-structured and Unstructured Sources (Embley, Martinez)
• Mappings from Heterogeneous Structured Source Views to Target Views (Embley)
• Individualized Source Views (Woodfield)
![Page 45: Computer Science Research for Family History and Genealogy](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815c91550346895dca9f6a/html5/thumbnails/45.jpg)
Information Integration
• Definition of Ontological Expectations (Embley, Woodfield)
• Value Normalization (Woodfield)• Object Identity & Data Merging (Embley, Sederberg)• Managing Uncertainty (Embley, Woodfield,
Martinez)
![Page 46: Computer Science Research for Family History and Genealogy](https://reader035.vdocuments.us/reader035/viewer/2022062323/56815c91550346895dca9f6a/html5/thumbnails/46.jpg)
Systems for Family History and Genealogy
• Storage of Large Volumes of Data (Flanagan)• Distributed Storage (Woodfield)• Indexing Original Documents (Martinez, Embley)• Human-Computer Interaction (Olsen)• Just-in-Time Browsing (Barrett, Olsen)• Workflow for Directing Genealogical Work (Woodfield, Martinez, Embley)• Notification Systems (Woodfield)• Visualization (Sederberg)