computer science research for family history and genealogy

46
Computer Science Research for Family History and Genealogy id W. Embley Heath Nielson, Mike Rimer, Luke Hutchison, Ken Tubbs, Doug Kennard, Tom Finnigan liam A. Barrett Computer Graphics, Vision, & Image Processing Laboratory Neural Networks and Machine Learning Laboratory Data Extraction and Integration Laboratory tory for Information, Collaboration, & Interaction Environmen Performance Evaluation Laboratory Data and Software Engineering Laboratory www.cs.byu.edu/familyhistory

Upload: ori-moss

Post on 03-Jan-2016

21 views

Category:

Documents


0 download

DESCRIPTION

Computer Science Research for Family History and Genealogy. Computer Graphics, Vision, & Image Processing Laboratory Neural Networks and Machine Learning Laboratory Data Extraction and Integration Laboratory Laboratory for Information, Collaboration, & Interaction Environments - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Computer Science Research for Family History and Genealogy

Computer Science Research forFamily History and Genealogy

David W. Embley Heath Nielson, Mike Rimer, Luke Hutchison, Ken Tubbs, Doug Kennard, Tom FinniganWilliam A. Barrett

Computer Graphics, Vision, & Image Processing LaboratoryNeural Networks and Machine Learning Laboratory

Data Extraction and Integration LaboratoryLaboratory for Information, Collaboration, & Interaction Environments

Performance Evaluation LaboratoryData and Software Engineering Laboratory

www.cs.byu.edu/familyhistory

Page 2: Computer Science Research for Family History and Genealogy

The Problem

• 2.5 million rolls of microfilm

• Assuming 1000 images per roll

• 2.5 billion images

Is there a way to automatically extract this information?

Page 3: Computer Science Research for Family History and Genealogy

A (Possible) Solution

• Input: Images of Microfilmed Records– Table Recognition (Heath Nielson)– Old-Text Recognition (Mike Rimer)– Handwriting Recognition (Luke Hutchison)– Record Extraction & Organization (Ken Tubbs)– Just-in-Time Browsing (Doug Kennard)– Visualization (Tom Finnigan)

• Output: Organized Genealogical Information

Let a computer do the extraction work.

Page 4: Computer Science Research for Family History and Genealogy

ZoningGeneral Overview

• Find the lines in the document using the horizontal and vertical profiles of the image.

• Apply a matched filter to the profiles to identify the line signatures.

• Recursively divide the document into separate pieces, analyzing each piece for lines.

Page 5: Computer Science Research for Family History and Genealogy
Page 6: Computer Science Research for Family History and Genealogy
Page 7: Computer Science Research for Family History and Genealogy

Zone ClassificationMachine vs. Handwriting

• Machine printed text is consistent/regular.

• Handwriting is irregular.

Page 8: Computer Science Research for Family History and Genealogy

Document templates

• Images are not ideal.– Results in incorrect zoning and classification.

• Form layout is the same across documents.– Features missed in one image, are found in

another.

• Build a template of the document’s form by using several documents.– Provides robustness, and increases accuracy.

Page 9: Computer Science Research for Family History and Genealogy

Document Templates

Page 10: Computer Science Research for Family History and Genealogy

Zoned Image

Page 11: Computer Science Research for Family History and Genealogy

Automated Text Recognition

Page 12: Computer Science Research for Family History and Genealogy

Word Segmentation

Page 13: Computer Science Research for Family History and Genealogy

Letter Segmentation

Page 14: Computer Science Research for Family History and Genealogy

Optical Character Recognition

Page 15: Computer Science Research for Family History and Genealogy

Handwriting Recognition

Page 16: Computer Science Research for Family History and Genealogy

Handwriting Recognition

• The Task

– Online handwriting recognition• The writer's pen movements are captured• Velocity, acceleration, stroke order are available

– Offline handwriting recognition• Page was previously-written and scanned• Only pixel color information available

• Genealogical records are all offline

• Offline is harder, but doable

Mary

Page 17: Computer Science Research for Family History and Genealogy

Handwriting Recognition• Can we just convert offline data into (simulated)

online data?

– Yes, although difficult to do reliably:• What order were the strokes written in?• Doubled-up line segments? Ink blobs? Spurious joins between

letters? Missing joins?– Inferring online data (e.g. stroke ordering) could be

crucial to success– Demonstrated to be solvable with reasonable reliability

Page 18: Computer Science Research for Family History and Genealogy

Handwriting Recognition

• An example of some steps in the analysis process:

– Contour extraction

– Midline determination

– Stroke ordering

Page 19: Computer Science Research for Family History and Genealogy

Handwriting Recognition

• An example of some steps in the recognition process:

– Handwriting style clustering

– Letter recognition

– Approximate string matching

nr?

m?SmithSmythe

Page 20: Computer Science Research for Family History and Genealogy

Automatic Record ExtractionAutomatic Record Extraction

Page 21: Computer Science Research for Family History and Genealogy

Extraction AlgorithmExtraction Algorithm

1. Identify the Geometric Structure

2. Identify the Type of Information

3. Identify the Attribute-Value pairs

4. Identify the Record Boundaries

Page 22: Computer Science Research for Family History and Genealogy

Column-Row RecognitionColumn-Row Recognition

Page 23: Computer Science Research for Family History and Genealogy

Genealogical OntologyGenealogical Ontology

Page 24: Computer Science Research for Family History and Genealogy

ROAD, STREET, &c.,ROAD, STREET, &c.,And No. or NAME of HOUSEAnd No. or NAME of HOUSE

Match LabelsMatch Labels

LocationLocation

Page 25: Computer Science Research for Family History and Genealogy

Match LabelsMatch LabelsNAME and Surname of NAME and Surname of each Personeach PersonFull NameFull Name LocationLocation

Page 26: Computer Science Research for Family History and Genealogy

RELATION to Head RELATION to Head of Family of FamilyRelationshipRelationship

Match LabelsMatch Labels

LocationLocation Full NameFull Name

Page 27: Computer Science Research for Family History and Genealogy

Extract RecordsExtract Records

CollaferCollafer

LocationLocation Full NameFull Name RelationshipRelationship

Page 28: Computer Science Research for Family History and Genealogy

Extract RecordsExtract Records

John EyresJohn Eyres HeadHead

LocationLocation Full NameFull Name RelationshipRelationship

CollaferCollafer

Page 29: Computer Science Research for Family History and Genealogy

Extract RecordsExtract Records

Annie EyresAnnie Eyres WifeWife

LocationLocation Full NameFull Name RelationshipRelationship

CollaferCollafer

Page 30: Computer Science Research for Family History and Genealogy

Extract RecordsExtract Records

Lehailes EyresLehailes Eyres SonSon

LocationLocation Full NameFull Name RelationshipRelationship

CollaferCollafer

Page 31: Computer Science Research for Family History and Genealogy

John

Web QueryWeb Query

Eyres

Page 32: Computer Science Research for Family History and Genealogy

Search ResultsSearch Results

Page 33: Computer Science Research for Family History and Genealogy

Online Digital Microfilm: ProblemMany of the images we are interested in are quite large.

6048 x 4287 pixels

Page 34: Computer Science Research for Family History and Genealogy

What is Just-In-Time Browsing?

• Progressive Image Transmission:• Hierarchical Spatial Resolution• Progressive Bitplane Encoding

• JBIG Compressed Bitplanes

• Prioritized Regions of Interest

• User Interaction

A method of quickly browsing digital images over the Internet which capitalizes on:

Page 35: Computer Science Research for Family History and Genealogy

Hierarchical PITSequential Transmission(Progressive Image Transmission)

Page 36: Computer Science Research for Family History and Genealogy

PIT Using Bitplane Method

1 BitPlane(2 levels of gray)

2 BitPlanes(4 levels of gray)

3 BitPlanes(8 levels of gray)

4 BitPlanes(16 levels of gray)

Page 37: Computer Science Research for Family History and Genealogy

Digital Microfilm Browser

Page 38: Computer Science Research for Family History and Genealogy

PAF – 5 Generation Pedigree

Page 39: Computer Science Research for Family History and Genealogy

PAF – 5 Generation Pedigree

Page 40: Computer Science Research for Family History and Genealogy

Gena:A 3D Genealogy Visualizer

Page 41: Computer Science Research for Family History and Genealogy

Concluding Remarks

Workshop: April 4, 2002 at BYUwww.cs.byu.edu/familyhistory

Page 42: Computer Science Research for Family History and Genealogy

Appendix

Categorized List of BYU Faculty Interests inComputer Science Research Topics

that Support Technology forFamily History and Genealogy

Page 43: Computer Science Research for Family History and Genealogy

Extraction from Digitized Images• Scanning (Flanagan) • Segmentation & Table Recognition (Barrett,

Martinez)• OCR for Old Type-Set Text (Martinez)• Element Classification & Record Construction

(Embley, Barrett, Martinez)• Handwriting Recognition (Sederberg)• Recognition of Hand-printed Text

(Olson, Barrett, Martinez)

Page 44: Computer Science Research for Family History and Genealogy

Extraction from Digital Data Sources

• Automatic Extraction from Semi-structured and Unstructured Sources (Embley, Martinez)

• Mappings from Heterogeneous Structured Source Views to Target Views (Embley)

• Individualized Source Views (Woodfield)

Page 45: Computer Science Research for Family History and Genealogy

Information Integration

• Definition of Ontological Expectations (Embley, Woodfield)

• Value Normalization (Woodfield)• Object Identity & Data Merging (Embley, Sederberg)• Managing Uncertainty (Embley,

Woodfield, Martinez)

Page 46: Computer Science Research for Family History and Genealogy

Systems for Family History and Genealogy

• Storage of Large Volumes of Data (Flanagan)• Distributed Storage (Woodfield)• Indexing Original Documents (Martinez, Embley)• Human-Computer Interaction (Olsen)• Just-in-Time Browsing (Barrett, Olsen)• Workflow for Directing Genealogical Work (Woodfield, Martinez, Embley)• Notification Systems (Woodfield)• Visualization (Sederberg)