a robust open-source gedcom parser
DESCRIPTION
A Robust Open-source GEDCOM Parser presented by Dallan Quass and Ryan Knight at RootsTech 2012 Parses GEDCOM files into a "de facto" object model; includes round-tripping for the vast majority of GEDCOM files.TRANSCRIPT
![Page 2: A Robust Open-source GEDCOM Parser](https://reader033.vdocuments.us/reader033/viewer/2022061616/557853a8d8b42a2f6a8b4fcc/html5/thumbnails/2.jpg)
What's a GEDCOM?
0 HEAD1 SOUR PAF2 NAME Personal Ancestral File2 VERS 5.2.18.02 CORP The Church of Jesus Christ of Latter-day Saints3 ADDR 50 East North Temple Street4 CONT Salt Lake City, UT 841504 CONT USA1 DEST Other1 DATE 9 Aug 20062 TIME 19:57:471 FILE temp-paf.ged1 GEDC2 VERS 5.52 FORM LINEAGE-LINKED1 CHAR UTF-81 LANG English1 SUBM @SUB1@0 @SUB1@ SUBM1 NAME Dallan Quass0 @I1@ INDI1 NAME Dallan /Quass/2 SURN Quass2 GIVN Dallan
If this looks unfamiliar to you,you may not get a lot out of this talk
On the other hand,the purpose of this project is to
handle this for you,
so you can develop cool projects in genealogyand let this be unfamiliar to you!
![Page 3: A Robust Open-source GEDCOM Parser](https://reader033.vdocuments.us/reader033/viewer/2022061616/557853a8d8b42a2f6a8b4fcc/html5/thumbnails/3.jpg)
Why is parsing GEDCOMs so hard?
![Page 4: A Robust Open-source GEDCOM Parser](https://reader033.vdocuments.us/reader033/viewer/2022061616/557853a8d8b42a2f6a8b4fcc/html5/thumbnails/4.jpg)
Challenge #1 – Character set detection
0 HEAD1 SOUR PAF2 NAME Personal Ancestral File2 VERS 5.2.18.02 CORP The Church of Jesus Christ of Latter-day Saints3 ADDR 50 East North Temple Street4 CONT Salt Lake City, UT 841504 CONT USA1 DEST Other1 DATE 9 Aug 20062 TIME 19:57:471 FILE temp-paf.ged1 GEDC2 VERS 5.52 FORM LINEAGE-LINKED1 CHAR UTF-81 LANG English1 SUBM @SUB1@0 @SUB1@ SUBM1 NAME Dallan Quass0 @I1@ INDI1 NAME Dallan /Quass/2 SURN Quass2 GIVN Dallan
Should be easy, except...
![Page 5: A Robust Open-source GEDCOM Parser](https://reader033.vdocuments.us/reader033/viewer/2022061616/557853a8d8b42a2f6a8b4fcc/html5/thumbnails/5.jpg)
Challenge #1 – Character set detection
GeneWeb ASCII → ANSI
Geni.com ANSEL → UTF8
Geni.com UNICODE → UTF8
GENJ UNICODE → UTF8
All others UNICODE → UTF16
ASCII/MacOS Roman → x-MacRoman
![Page 6: A Robust Open-source GEDCOM Parser](https://reader033.vdocuments.us/reader033/viewer/2022061616/557853a8d8b42a2f6a8b4fcc/html5/thumbnails/6.jpg)
Challenge #1 – Character set detection
ANSEL
![Page 7: A Robust Open-source GEDCOM Parser](https://reader033.vdocuments.us/reader033/viewer/2022061616/557853a8d8b42a2f6a8b4fcc/html5/thumbnails/7.jpg)
Challenge #2 – Custom tags
The GEDCOM specification hasn't been updated in a LONG time
![Page 8: A Robust Open-source GEDCOM Parser](https://reader033.vdocuments.us/reader033/viewer/2022061616/557853a8d8b42a2f6a8b4fcc/html5/thumbnails/8.jpg)
Challenge #3 – Misused tags
![Page 9: A Robust Open-source GEDCOM Parser](https://reader033.vdocuments.us/reader033/viewer/2022061616/557853a8d8b42a2f6a8b4fcc/html5/thumbnails/9.jpg)
Shout out
Tim Forsythe
VGed - GEDCOM validator
http://ancestorsnow.blogspot.com/ 2011/07/vged.html
![Page 10: A Robust Open-source GEDCOM Parser](https://reader033.vdocuments.us/reader033/viewer/2022061616/557853a8d8b42a2f6a8b4fcc/html5/thumbnails/10.jpg)
ALIA
1 SEX M1 ALIA /Ted/1 BIRT
![Page 11: A Robust Open-source GEDCOM Parser](https://reader033.vdocuments.us/reader033/viewer/2022061616/557853a8d8b42a2f6a8b4fcc/html5/thumbnails/11.jpg)
SOUR
0 @N6@ NOTE1 CONT adopted surname Termaat2 SOUR @S9@
![Page 12: A Robust Open-source GEDCOM Parser](https://reader033.vdocuments.us/reader033/viewer/2022061616/557853a8d8b42a2f6a8b4fcc/html5/thumbnails/12.jpg)
DATA
2 SOUR @S2149874917@3 DATA4 DATE 11 Sep 19243 NOTE ...3 DATA4 TEXT ...
2 SOUR @S99@3 DATA4 TEXT William Donald ...4 DATE 1 Sep 1997
2 SOUR @S28@3 PAGE Indian Prarie...3 QUAY 33 DATE 28 Feb 2005
![Page 13: A Robust Open-source GEDCOM Parser](https://reader033.vdocuments.us/reader033/viewer/2022061616/557853a8d8b42a2f6a8b4fcc/html5/thumbnails/13.jpg)
Challenge #4 – Unused tags
EventPhone
Event Agency
Source Citation Event Type
![Page 14: A Robust Open-source GEDCOM Parser](https://reader033.vdocuments.us/reader033/viewer/2022061616/557853a8d8b42a2f6a8b4fcc/html5/thumbnails/14.jpg)
Challenge #5 – Names
![Page 15: A Robust Open-source GEDCOM Parser](https://reader033.vdocuments.us/reader033/viewer/2022061616/557853a8d8b42a2f6a8b4fcc/html5/thumbnails/15.jpg)
GEDCOM Standard?
The code is more what you'd call
"guidelines" than actual rules.
![Page 16: A Robust Open-source GEDCOM Parser](https://reader033.vdocuments.us/reader033/viewer/2022061616/557853a8d8b42a2f6a8b4fcc/html5/thumbnails/16.jpg)
Two goals
![Page 17: A Robust Open-source GEDCOM Parser](https://reader033.vdocuments.us/reader033/viewer/2022061616/557853a8d8b42a2f6a8b4fcc/html5/thumbnails/17.jpg)
Goal #1 – Parse GEDCOMs into a de facto object model
De Facto:
In fact or in practice; in actual use or existence, regardless of official or legal status. – Wictionary.org
Model should be straightforward, easy to use and understand
![Page 18: A Robust Open-source GEDCOM Parser](https://reader033.vdocuments.us/reader033/viewer/2022061616/557853a8d8b42a2f6a8b4fcc/html5/thumbnails/18.jpg)
Goal #2 – Round-trip
From GEDCOM
To Object Model
Back to GEDCOMwithout information loss
![Page 19: A Robust Open-source GEDCOM Parser](https://reader033.vdocuments.us/reader033/viewer/2022061616/557853a8d8b42a2f6a8b4fcc/html5/thumbnails/19.jpg)
Nirvana
![Page 20: A Robust Open-source GEDCOM Parser](https://reader033.vdocuments.us/reader033/viewer/2022061616/557853a8d8b42a2f6a8b4fcc/html5/thumbnails/20.jpg)
There is no Nirvana
![Page 21: A Robust Open-source GEDCOM Parser](https://reader033.vdocuments.us/reader033/viewer/2022061616/557853a8d8b42a2f6a8b4fcc/html5/thumbnails/21.jpg)
But we can get pretty close
94%
![Page 22: A Robust Open-source GEDCOM Parser](https://reader033.vdocuments.us/reader033/viewer/2022061616/557853a8d8b42a2f6a8b4fcc/html5/thumbnails/22.jpg)
How is it done?
???
![Page 23: A Robust Open-source GEDCOM Parser](https://reader033.vdocuments.us/reader033/viewer/2022061616/557853a8d8b42a2f6a8b4fcc/html5/thumbnails/23.jpg)
Object model
![Page 24: A Robust Open-source GEDCOM Parser](https://reader033.vdocuments.us/reader033/viewer/2022061616/557853a8d8b42a2f6a8b4fcc/html5/thumbnails/24.jpg)
People
![Page 25: A Robust Open-source GEDCOM Parser](https://reader033.vdocuments.us/reader033/viewer/2022061616/557853a8d8b42a2f6a8b4fcc/html5/thumbnails/25.jpg)
Extensions
![Page 26: A Robust Open-source GEDCOM Parser](https://reader033.vdocuments.us/reader033/viewer/2022061616/557853a8d8b42a2f6a8b4fcc/html5/thumbnails/26.jpg)
GedML
Originally by Michael Kayhttp://users.breathe.com/mhkay/gedml/
Enhanced by Lynn Monsonhttp://lmonson.com/blog/?page_id=64
Further enhanced by Nathan Powell & Dallan Quasspart of this project
GEDCOM → SAX eventsANSEL reader & writer
![Page 27: A Robust Open-source GEDCOM Parser](https://reader033.vdocuments.us/reader033/viewer/2022061616/557853a8d8b42a2f6a8b4fcc/html5/thumbnails/27.jpg)
Parser
Written in Java
~1500 LoC for parser + ~4000 LoC for POJOs
Handles SAX events emitted by GedML
Separate functions called to handle each tag
Maintains a stack of model objects
Attach unexpected tags to model objects as extensions
Fast
Easily extendible
Tree parser also available
![Page 28: A Robust Open-source GEDCOM Parser](https://reader033.vdocuments.us/reader033/viewer/2022061616/557853a8d8b42a2f6a8b4fcc/html5/thumbnails/28.jpg)
GEDCOM Export
Visitor pattern
600 LoC
![Page 29: A Robust Open-source GEDCOM Parser](https://reader033.vdocuments.us/reader033/viewer/2022061616/557853a8d8b42a2f6a8b4fcc/html5/thumbnails/29.jpg)
JSON
GEDCOM POJO JSON POJO GEDCOM
Simple model persistence using Google GSON
![Page 30: A Robust Open-source GEDCOM Parser](https://reader033.vdocuments.us/reader033/viewer/2022061616/557853a8d8b42a2f6a8b4fcc/html5/thumbnails/30.jpg)
Further thoughts
![Page 31: A Robust Open-source GEDCOM Parser](https://reader033.vdocuments.us/reader033/viewer/2022061616/557853a8d8b42a2f6a8b4fcc/html5/thumbnails/31.jpg)
Do we need a radically-different data-exchange model for genealogy?
![Page 32: A Robust Open-source GEDCOM Parser](https://reader033.vdocuments.us/reader033/viewer/2022061616/557853a8d8b42a2f6a8b4fcc/html5/thumbnails/32.jpg)
I don't know
A new proposed object model could use this project tomigrate existing GEDCOMs to the de facto model,
then translate the de facto model objectsto the new model
![Page 33: A Robust Open-source GEDCOM Parser](https://reader033.vdocuments.us/reader033/viewer/2022061616/557853a8d8b42a2f6a8b4fcc/html5/thumbnails/33.jpg)
Do we need GEDCOM validation tools?
![Page 34: A Robust Open-source GEDCOM Parser](https://reader033.vdocuments.us/reader033/viewer/2022061616/557853a8d8b42a2f6a8b4fcc/html5/thumbnails/34.jpg)
Definitely!
A list of “standard” custom tagswould also be pretty helpful
![Page 35: A Robust Open-source GEDCOM Parser](https://reader033.vdocuments.us/reader033/viewer/2022061616/557853a8d8b42a2f6a8b4fcc/html5/thumbnails/35.jpg)
We live in the real world
![Page 36: A Robust Open-source GEDCOM Parser](https://reader033.vdocuments.us/reader033/viewer/2022061616/557853a8d8b42a2f6a8b4fcc/html5/thumbnails/36.jpg)
Purpose of this project
![Page 37: A Robust Open-source GEDCOM Parser](https://reader033.vdocuments.us/reader033/viewer/2022061616/557853a8d8b42a2f6a8b4fcc/html5/thumbnails/37.jpg)
Demonstration of Gedcom Server
Demonstrates GEDCOM -> model -> json -> model -> GEDCOM
Built with Play 1.2.4 - A Java Web framework
Allows for rapid development of web applications with a fully integrated stack
Deployed to Heroku – Cloud Application Platform
Heroku allows one step deployment with git
![Page 38: A Robust Open-source GEDCOM Parser](https://reader033.vdocuments.us/reader033/viewer/2022061616/557853a8d8b42a2f6a8b4fcc/html5/thumbnails/38.jpg)
Demonstration of Gedcom Server
![Page 39: A Robust Open-source GEDCOM Parser](https://reader033.vdocuments.us/reader033/viewer/2022061616/557853a8d8b42a2f6a8b4fcc/html5/thumbnails/39.jpg)
Demonstration of Gedcom Server
![Page 40: A Robust Open-source GEDCOM Parser](https://reader033.vdocuments.us/reader033/viewer/2022061616/557853a8d8b42a2f6a8b4fcc/html5/thumbnails/40.jpg)
Conclusion
Images appearing on these slides are copyrighted by the contributors to http://commons.wikimedia.org and are used under license
Parsing GEDCOMs is hard
• it's like parsing HTML in the 1990's
But getting it right is pretty important
especially if you want to retain existing information
Open source algorithm is now freely available
http://github.com/DallanQ/Gedcom
simple object model with extensions, 94% round-trip
Hopefully others will benefit from this effort
![Page 41: A Robust Open-source GEDCOM Parser](https://reader033.vdocuments.us/reader033/viewer/2022061616/557853a8d8b42a2f6a8b4fcc/html5/thumbnails/41.jpg)