2012 ehumanities amsterdam - descartes text conversion: lessons learned
DESCRIPTION
The arduous process of producing a digital text of Descartes' letters, including mathematical formulas. It was a subtask of the CKCC project at the Huygens Institute. Lessons learned. With Erik-Jan Bos, Utrecht.TRANSCRIPT
Letters from Descartes in
digital formatAn exercise in conversion
Dirk Roorda@ eHumanities 2012-01-26
the task the method the lessons the result
◦ demo
overview
The Task: converting from ...JapAM
Descartes Correspondence
ca. 700 letters
69,237 lines
600 formulas
4.2 MB (without the 311 pictures)
The task: converting to ...CKCC corpus Descartes
XML : Text Encoding Initiative (TEI)
~ 35,000 elements, of which7,200 metadata
7,700 paragraphs6,200 formulas
6,000 text-formattings4,200 structure
2,900 page-breaks538 images
The (re)Sources
EJB
Metadata
Google Books
EJB ‘s head
observation
non-algorithmic changes
consolidation
proofs
The method
use digital equipment:
-your text-editor
-your scripting language
-your regular expressions
Observation
observation: italic scopes
replace=(.*?)$
by<italic>match1</italic>
???
Aargh!#@\€]
observation: greek
non-algorithmic changes
closers: hints
consolidating: metadata
... formulas meta closers ...
conversion process
canonical
initial
corrected
improved
checked metadata combining
merging meta
proofs: formulas
proofs: formulas in gif
quick formula checking
The anatomy of conversion
convert.pl
100 KB of program code text=25 densely typed pages=3427 lines
of which
2175 real code lines
Code/Input = 1/32
1/3 of the tasks need 2/3 of the codeformulas: (2) 37 %headers, openers, closers: (3) 16 %meta and images: (3) 11 %
run time of same tasksformulas: (2) 29 %headers, openers, closers: (3) 6 %meta and images (3) 10 %total run time (25) 40 sec
Statistics
1. Unicode is your friend2. Split into many subtasks3. task = configuration + workflow4. Count and check5. Performance matters6. Do not give up automation
The tricks of conversion
1. Unicode is your friend
(2a) that can be run separately
(2b) that can be reordered easily
2. Split into many subtasks
3. task = config + workflow
4. Count and check (ad nauseam)
was 30+ secondsis now 2.07 secondsmany new subtasks based on same template(gain = 15 * 30 = 7.5 min per run)many, many runs before everything is OK(gain = 100 * 7.5 = 12.5 hours CPU-time)
5. Performance matters!
we used a lot of expert knowledgewhich has all been transferred to- the source- consolidated extra inputsso the conversion is still repeatable and modifiable
6. Do not give up automation
source formulas meta closers results
corrections hints hints hints CKCC
conversion program
Thank You