computer science research for family history and genealogy
DESCRIPTION
Computer Science Research for Family History and Genealogy. Computer Graphics, Vision, & Image Processing Laboratory Neural Networks and Machine Learning Laboratory Data Extraction and Integration Laboratory Laboratory for Information, Collaboration, & Interaction Environments - PowerPoint PPT PresentationTRANSCRIPT
Computer Science Research forFamily History and Genealogy
David W. Embley Heath Nielson, Mike Rimer, Luke Hutchison, Ken Tubbs, Doug Kennard, Tom FinniganWilliam A. Barrett
Computer Graphics, Vision, & Image Processing LaboratoryNeural Networks and Machine Learning Laboratory
Data Extraction and Integration LaboratoryLaboratory for Information, Collaboration, & Interaction Environments
Performance Evaluation LaboratoryData and Software Engineering Laboratory
www.cs.byu.edu/familyhistory
The Problem
• 2.5 million rolls of microfilm
• Assuming 1000 images per roll
• 2.5 billion images
Is there a way to automatically extract this information?
A (Possible) Solution
• Input: Images of Microfilmed Records– Table Recognition (Heath Nielson)– Old-Text Recognition (Mike Rimer)– Handwriting Recognition (Luke Hutchison)– Record Extraction & Organization (Ken Tubbs)– Just-in-Time Browsing (Doug Kennard)– Visualization (Tom Finnigan)
• Output: Organized Genealogical Information
Let a computer do the extraction work.
ZoningGeneral Overview
• Find the lines in the document using the horizontal and vertical profiles of the image.
• Apply a matched filter to the profiles to identify the line signatures.
• Recursively divide the document into separate pieces, analyzing each piece for lines.
Zone ClassificationMachine vs. Handwriting
• Machine printed text is consistent/regular.
• Handwriting is irregular.
Document templates
• Images are not ideal.– Results in incorrect zoning and classification.
• Form layout is the same across documents.– Features missed in one image, are found in
another.
• Build a template of the document’s form by using several documents.– Provides robustness, and increases accuracy.
Document Templates
Zoned Image
Automated Text Recognition
Word Segmentation
Letter Segmentation
Optical Character Recognition
Handwriting Recognition
Handwriting Recognition
• The Task
– Online handwriting recognition• The writer's pen movements are captured• Velocity, acceleration, stroke order are available
– Offline handwriting recognition• Page was previously-written and scanned• Only pixel color information available
• Genealogical records are all offline
• Offline is harder, but doable
Mary
Handwriting Recognition• Can we just convert offline data into (simulated)
online data?
– Yes, although difficult to do reliably:• What order were the strokes written in?• Doubled-up line segments? Ink blobs? Spurious joins between
letters? Missing joins?– Inferring online data (e.g. stroke ordering) could be
crucial to success– Demonstrated to be solvable with reasonable reliability
Handwriting Recognition
• An example of some steps in the analysis process:
– Contour extraction
– Midline determination
– Stroke ordering
Handwriting Recognition
• An example of some steps in the recognition process:
– Handwriting style clustering
– Letter recognition
– Approximate string matching
nr?
m?SmithSmythe
Automatic Record ExtractionAutomatic Record Extraction
Extraction AlgorithmExtraction Algorithm
1. Identify the Geometric Structure
2. Identify the Type of Information
3. Identify the Attribute-Value pairs
4. Identify the Record Boundaries
Column-Row RecognitionColumn-Row Recognition
Genealogical OntologyGenealogical Ontology
ROAD, STREET, &c.,ROAD, STREET, &c.,And No. or NAME of HOUSEAnd No. or NAME of HOUSE
Match LabelsMatch Labels
LocationLocation
Match LabelsMatch LabelsNAME and Surname of NAME and Surname of each Personeach PersonFull NameFull Name LocationLocation
RELATION to Head RELATION to Head of Family of FamilyRelationshipRelationship
Match LabelsMatch Labels
LocationLocation Full NameFull Name
Extract RecordsExtract Records
CollaferCollafer
LocationLocation Full NameFull Name RelationshipRelationship
Extract RecordsExtract Records
John EyresJohn Eyres HeadHead
LocationLocation Full NameFull Name RelationshipRelationship
CollaferCollafer
Extract RecordsExtract Records
Annie EyresAnnie Eyres WifeWife
LocationLocation Full NameFull Name RelationshipRelationship
CollaferCollafer
Extract RecordsExtract Records
Lehailes EyresLehailes Eyres SonSon
LocationLocation Full NameFull Name RelationshipRelationship
CollaferCollafer
John
Web QueryWeb Query
Eyres
Search ResultsSearch Results
Online Digital Microfilm: ProblemMany of the images we are interested in are quite large.
6048 x 4287 pixels
What is Just-In-Time Browsing?
• Progressive Image Transmission:• Hierarchical Spatial Resolution• Progressive Bitplane Encoding
• JBIG Compressed Bitplanes
• Prioritized Regions of Interest
• User Interaction
A method of quickly browsing digital images over the Internet which capitalizes on:
Hierarchical PITSequential Transmission(Progressive Image Transmission)
PIT Using Bitplane Method
1 BitPlane(2 levels of gray)
2 BitPlanes(4 levels of gray)
3 BitPlanes(8 levels of gray)
4 BitPlanes(16 levels of gray)
Digital Microfilm Browser
PAF – 5 Generation Pedigree
PAF – 5 Generation Pedigree
Gena:A 3D Genealogy Visualizer
Concluding Remarks
Workshop: April 4, 2002 at BYUwww.cs.byu.edu/familyhistory
Appendix
Categorized List of BYU Faculty Interests inComputer Science Research Topics
that Support Technology forFamily History and Genealogy
Extraction from Digitized Images• Scanning (Flanagan) • Segmentation & Table Recognition (Barrett,
Martinez)• OCR for Old Type-Set Text (Martinez)• Element Classification & Record Construction
(Embley, Barrett, Martinez)• Handwriting Recognition (Sederberg)• Recognition of Hand-printed Text
(Olson, Barrett, Martinez)
Extraction from Digital Data Sources
• Automatic Extraction from Semi-structured and Unstructured Sources (Embley, Martinez)
• Mappings from Heterogeneous Structured Source Views to Target Views (Embley)
• Individualized Source Views (Woodfield)
Information Integration
• Definition of Ontological Expectations (Embley, Woodfield)
• Value Normalization (Woodfield)• Object Identity & Data Merging (Embley, Sederberg)• Managing Uncertainty (Embley,
Woodfield, Martinez)
Systems for Family History and Genealogy
• Storage of Large Volumes of Data (Flanagan)• Distributed Storage (Woodfield)• Indexing Original Documents (Martinez, Embley)• Human-Computer Interaction (Olsen)• Just-in-Time Browsing (Barrett, Olsen)• Workflow for Directing Genealogical Work (Woodfield, Martinez, Embley)• Notification Systems (Woodfield)• Visualization (Sederberg)