enabling search for facts and implied facts in historical documents
DESCRIPTION
Enabling Search for Facts and Implied Facts in Historical Documents. David W. Embley, Stephen W. Liddle, Deryle W. Lonsdale, Spencer Machado, Thomas Packer, Joseph Park, Nathan Tate, Andrew Zitzelberger Brigham Young University. BYU D ata E xtraction Research G roup. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Enabling Search for Facts and Implied Facts in Historical Documents](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812e05550346895d936d41/html5/thumbnails/1.jpg)
Enabling Search for Facts andImplied Facts in Historical Documents
David W. Embley, Stephen W. Liddle, Deryle W. Lonsdale, Spencer Machado, Thomas Packer,
Joseph Park, Nathan Tate, Andrew ZitzelbergerBrigham Young University
BYU Data Extraction Research Group
![Page 2: Enabling Search for Facts and Implied Facts in Historical Documents](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812e05550346895d936d41/html5/thumbnails/2.jpg)
WoK-HD(A Web of Knowledge Superimposed over Historical Documents)
… …
… …
![Page 3: Enabling Search for Facts and Implied Facts in Historical Documents](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812e05550346895d936d41/html5/thumbnails/3.jpg)
WoK-HD(A Web of Knowledge Superimposed over Historical Documents)
… …
grandchildren of Mary Ely
… …
![Page 4: Enabling Search for Facts and Implied Facts in Historical Documents](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812e05550346895d936d41/html5/thumbnails/4.jpg)
WoK-HD(A Web of Knowledge Superimposed over Historical Documents)
… …
… …
grandchildren of Mary Ely
![Page 5: Enabling Search for Facts and Implied Facts in Historical Documents](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812e05550346895d936d41/html5/thumbnails/5.jpg)
WoK-HD(A Web of Knowledge Superimposed over Historical Documents)
… …
grandchildren of Mary Ely
… …
![Page 6: Enabling Search for Facts and Implied Facts in Historical Documents](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812e05550346895d936d41/html5/thumbnails/6.jpg)
grandchildren of Mary Ely
WoK-HD(A Web of Knowledge Superimposed over Historical Documents)
… …
… …
![Page 7: Enabling Search for Facts and Implied Facts in Historical Documents](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812e05550346895d936d41/html5/thumbnails/7.jpg)
WoK-HD Input
![Page 8: Enabling Search for Facts and Implied Facts in Historical Documents](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812e05550346895d936d41/html5/thumbnails/8.jpg)
Querying for Facts & Implied Facts
![Page 9: Enabling Search for Facts and Implied Facts in Historical Documents](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812e05550346895d936d41/html5/thumbnails/9.jpg)
Querying for Facts & Implied Facts
Animation of1. Extraction query, results, highlighting2. Reasoned Query, results, reasoning chain, highlighting
![Page 10: Enabling Search for Facts and Implied Facts in Historical Documents](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812e05550346895d936d41/html5/thumbnails/10.jpg)
Extraction Ontologies
![Page 11: Enabling Search for Facts and Implied Facts in Historical Documents](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812e05550346895d936d41/html5/thumbnails/11.jpg)
Extraction Ontologies
![Page 12: Enabling Search for Facts and Implied Facts in Historical Documents](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812e05550346895d936d41/html5/thumbnails/12.jpg)
Fact Extraction
![Page 13: Enabling Search for Facts and Implied Facts in Historical Documents](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812e05550346895d936d41/html5/thumbnails/13.jpg)
Fact Extraction
![Page 14: Enabling Search for Facts and Implied Facts in Historical Documents](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812e05550346895d936d41/html5/thumbnails/14.jpg)
Fact Extraction
![Page 15: Enabling Search for Facts and Implied Facts in Historical Documents](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812e05550346895d936d41/html5/thumbnails/15.jpg)
Reasoning for Implied Facts
![Page 16: Enabling Search for Facts and Implied Facts in Historical Documents](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812e05550346895d936d41/html5/thumbnails/16.jpg)
Reasoning for Implied Facts
![Page 17: Enabling Search for Facts and Implied Facts in Historical Documents](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812e05550346895d936d41/html5/thumbnails/17.jpg)
Reasoning for Implied Facts
![Page 18: Enabling Search for Facts and Implied Facts in Historical Documents](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812e05550346895d936d41/html5/thumbnails/18.jpg)
Reasoning for Implied Facts
![Page 19: Enabling Search for Facts and Implied Facts in Historical Documents](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812e05550346895d936d41/html5/thumbnails/19.jpg)
Query Interpretation
“Mary Ely” grandchild
![Page 20: Enabling Search for Facts and Implied Facts in Historical Documents](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812e05550346895d936d41/html5/thumbnails/20.jpg)
Query Interpretation
“Mary Ely” grandchild
![Page 21: Enabling Search for Facts and Implied Facts in Historical Documents](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812e05550346895d936d41/html5/thumbnails/21.jpg)
Query Interpretation
“Mary Ely” grandchild
![Page 22: Enabling Search for Facts and Implied Facts in Historical Documents](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812e05550346895d936d41/html5/thumbnails/22.jpg)
Generated SPARQL Query
![Page 23: Enabling Search for Facts and Implied Facts in Historical Documents](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812e05550346895d936d41/html5/thumbnails/23.jpg)
Generated SPARQL Query
![Page 24: Enabling Search for Facts and Implied Facts in Historical Documents](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812e05550346895d936d41/html5/thumbnails/24.jpg)
Query Results
![Page 25: Enabling Search for Facts and Implied Facts in Historical Documents](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812e05550346895d936d41/html5/thumbnails/25.jpg)
Results of Processingthe Ely Ancestry (all 830 Pages)
• Number of facts extracted: 22,251– 8,740 Person-Birthdate facts– 3,803 Person-Deathdate facts– 9,708 children facts, including
• 5,020 Child-has-parent-Person facts• 2,394 Son-of-Person facts• 2,294 Daughter-of-Person facts
• Number of implied grandchild facts inferred: 5,277• Processing time:
– ~18 seconds per page– CPU time: ~4 hours– Processing 10 in parallel: ~24 minutes
![Page 26: Enabling Search for Facts and Implied Facts in Historical Documents](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812e05550346895d936d41/html5/thumbnails/26.jpg)
Results of Processingthe Ely Ancestry (all 830 Pages)
• Precision: .52 (by randomly selecting & checking 100 of the 22,251 facts)
• Recall: .33 & Precision: .40 (by randomly selecting and checking 2 fact-filled family pages)
• Errors:– Name recognizer– Text pattern expectations– OCR
• Varying accuracy (for pages checked)– Recall: .11, Precision: .11 (bad combination of all problems)
– Recall: .50, Precision: .68 (some problems, but closer to expectations)
– Recall: .59, Precision: .71 (10 pages, mostly as expected)
– Recall: .91, Precision: .94 (tuned, no problems except a few OCR errors)
![Page 27: Enabling Search for Facts and Implied Facts in Historical Documents](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812e05550346895d936d41/html5/thumbnails/27.jpg)
Current and Future Work
• Implementation Status:– Full line works (but is fragile & needs finishing touches)– HyKSS integrated (but not all features)
• Scalability:– Handcrafted extraction ontologies & reasoning rules (worth
the work for certain applications)– ListReader (plus bootstrapping for lists and general extraction)– Optimization (especially for query processing)
• Integration:– Mapping extraction ontologies to domain ontologies– Object identity for people and places
![Page 28: Enabling Search for Facts and Implied Facts in Historical Documents](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812e05550346895d936d41/html5/thumbnails/28.jpg)
Summary and Conclusion
• WoK-HD– Superimposes a web of knowledge over a
collection of historical documents– Works as a proof-of-concept prototype
• To build and deploy the WoK-HD successfully:– Efficient implementation– Better, more cost-effective extraction– Integration and record linkage
www.deg.byu.edu