processing xml: a rewriting system approach
DESCRIPTION
Yet another method to parse XML: rewrite it!TRANSCRIPT
Processing XMLA rewriting system approach
Alberto Simões
Portuguese Perl Workshop – 2010
Alberto Simões Processing XML: a rewriting system approach
Motivation and Goals
XML is usually generated from structured information:databases, spreadsheets, forms, etc.
but it can be generated from unstructured(or poorly-structured data):
textual documents, domain specific languages;
Question arises:How to produce XML documents from textual documents?
write a parser (natural language, domain specific, etc);
produce XML by rewriting the textual document!
Alberto Simões Processing XML: a rewriting system approach
Motivation and Goals
XML is usually generated from structured information:databases, spreadsheets, forms, etc.
but it can be generated from unstructured(or poorly-structured data):
textual documents, domain specific languages;
Question arises:How to produce XML documents from textual documents?
write a parser (natural language, domain specific, etc);
produce XML by rewriting the textual document!
Alberto Simões Processing XML: a rewriting system approach
Motivation and Goals
XML is usually generated from structured information:databases, spreadsheets, forms, etc.
but it can be generated from unstructured(or poorly-structured data):
textual documents, domain specific languages;
Question arises:How to produce XML documents from textual documents?
write a parser (natural language, domain specific, etc);
produce XML by rewriting the textual document!
Alberto Simões Processing XML: a rewriting system approach
Motivation and Goals
XML is usually generated from structured information:databases, spreadsheets, forms, etc.
but it can be generated from unstructured(or poorly-structured data):
textual documents, domain specific languages;
Question arises:How to produce XML documents from textual documents?
write a parser (natural language, domain specific, etc);
produce XML by rewriting the textual document!
Alberto Simões Processing XML: a rewriting system approach
Motivation and Goals
XML is usually generated from structured information:databases, spreadsheets, forms, etc.
but it can be generated from unstructured(or poorly-structured data):
textual documents, domain specific languages;
Question arises:How to produce XML documents from textual documents?
write a parser (natural language, domain specific, etc);
produce XML by rewriting the textual document!
Alberto Simões Processing XML: a rewriting system approach
Hows does textual rewriting works?
write rewriting rules:
rule ∼= pattern × restriction × action
pattern a regular (or irregular) expression that shouldbe textually matched;
restriction conditional code that checks whether the ruleshould be applied;
action a piece of code (or simply a string) thatproduces text that should replace theoriginally matched text;
Alberto Simões Processing XML: a rewriting system approach
Hows does textual rewriting works?
write rewriting rules:
rule ∼= pattern × restriction × action
pattern a regular (or irregular) expression that shouldbe textually matched;
restriction conditional code that checks whether the ruleshould be applied;
action a piece of code (or simply a string) thatproduces text that should replace theoriginally matched text;
Alberto Simões Processing XML: a rewriting system approach
Hows does textual rewriting works?
write rewriting rules:
rule ∼= pattern × restriction × action
pattern a regular (or irregular) expression that shouldbe textually matched;
restriction conditional code that checks whether the ruleshould be applied;
action a piece of code (or simply a string) thatproduces text that should replace theoriginally matched text;
Alberto Simões Processing XML: a rewriting system approach
Hows does textual rewriting works?
write rewriting rules:
rule ∼= pattern × restriction × action
pattern a regular (or irregular) expression that shouldbe textually matched;
restriction conditional code that checks whether the ruleshould be applied;
action a piece of code (or simply a string) thatproduces text that should replace theoriginally matched text;
Alberto Simões Processing XML: a rewriting system approach
Are there text rewriting tools?
For this work we used Text::RewriteRules:
written in Perl:Perl regular expression engine power;Reflexive language (code can be generated on the fly);
supports different rewriting approaches:Fixed-point rewriting approach;Sliding-cursor rewriting approach;Lexical analyzer approach;
home-developed;
Alberto Simões Processing XML: a rewriting system approach
Are there text rewriting tools?
For this work we used Text::RewriteRules:
written in Perl:Perl regular expression engine power;Reflexive language (code can be generated on the fly);
supports different rewriting approaches:Fixed-point rewriting approach;Sliding-cursor rewriting approach;Lexical analyzer approach;
home-developed;
Alberto Simões Processing XML: a rewriting system approach
Are there text rewriting tools?
For this work we used Text::RewriteRules:
written in Perl:Perl regular expression engine power;Reflexive language (code can be generated on the fly);
supports different rewriting approaches:Fixed-point rewriting approach;Sliding-cursor rewriting approach;Lexical analyzer approach;
home-developed;
Alberto Simões Processing XML: a rewriting system approach
Are there text rewriting tools?
For this work we used Text::RewriteRules:
written in Perl:Perl regular expression engine power;Reflexive language (code can be generated on the fly);
supports different rewriting approaches:Fixed-point rewriting approach;Sliding-cursor rewriting approach;Lexical analyzer approach;
home-developed;
Alberto Simões Processing XML: a rewriting system approach
Fixed-point rewriting approach
Algorithmeasy to understand;a sequence of rules that are applied by order;first rule is applied, and following rules are only applied ifthere is no previous rule that can be applied;it might happen that a rule changes the document in a waythat a previous rule will be applied again;the process ends when there are no rules that can beapplied (or if a specific rule forces the system to end);
Code example: anonymization of emailsRULES anonymize\w+(\.\w+)*@\w+\.\w+(\.\w+)*==>[[hidden email]]ENDRULES
Alberto Simões Processing XML: a rewriting system approach
Fixed-point rewriting approach
Algorithmeasy to understand;a sequence of rules that are applied by order;first rule is applied, and following rules are only applied ifthere is no previous rule that can be applied;it might happen that a rule changes the document in a waythat a previous rule will be applied again;the process ends when there are no rules that can beapplied (or if a specific rule forces the system to end);
Code example: anonymization of emailsRULES anonymize\w+(\.\w+)*@\w+\.\w+(\.\w+)*==>[[hidden email]]ENDRULES
Alberto Simões Processing XML: a rewriting system approach
Sliding-cursor rewriting approach
Algorithmthe cursor is placed in the beginning of the string;patterns are matched if they occur right after the cursor;if a rule is applied, the cursor is placed after that region;if no rule matches, the cursor moves ahead one character;process ends when cursor reaches the end of the string;it will never rewrite text that was already rewritten.
Code example: brute force translationRULES/m translate(\w+)=e=> $translation{$1} !! exists($translation{$1})ENDRULES
Example_ latest trainúltimo _ trainúltimo combóio _
Alberto Simões Processing XML: a rewriting system approach
Sliding-cursor rewriting approach
Algorithmthe cursor is placed in the beginning of the string;patterns are matched if they occur right after the cursor;if a rule is applied, the cursor is placed after that region;if no rule matches, the cursor moves ahead one character;process ends when cursor reaches the end of the string;it will never rewrite text that was already rewritten.
Code example: brute force translationRULES/m translate(\w+)=e=> $translation{$1} !! exists($translation{$1})ENDRULES
Example_ latest trainúltimo _ trainúltimo combóio _
Alberto Simões Processing XML: a rewriting system approach
Sliding-cursor rewriting approach
Algorithmthe cursor is placed in the beginning of the string;patterns are matched if they occur right after the cursor;if a rule is applied, the cursor is placed after that region;if no rule matches, the cursor moves ahead one character;process ends when cursor reaches the end of the string;it will never rewrite text that was already rewritten.
Code example: brute force translationRULES/m translate(\w+)=e=> $translation{$1} !! exists($translation{$1})ENDRULES
Example_ latest trainúltimo _ trainúltimo combóio _
Alberto Simões Processing XML: a rewriting system approach
Valid Rewriting Rules
Different approaches have different possible rules. . .
but the most relevant rules are:==> simple pattern substitution: left hand side includes
a Perl regular expression and right hand sideincludes the string that will replace the match;
=e=> similar to the previous one, but right hand sideincludes Perl code to be evaluated. The result willbe used to replace the match;
=begin=> without a left hand side, the right hand side code isexecuted before the rewrite starts;
=end=> without a right hand side, when the left hand sidepattern matches quits the rewrite system;
they can include a restriction block (!!) at the right of the action.
Alberto Simões Processing XML: a rewriting system approach
Valid Rewriting Rules
Different approaches have different possible rules. . .
but the most relevant rules are:==> simple pattern substitution: left hand side includes
a Perl regular expression and right hand sideincludes the string that will replace the match;
=e=> similar to the previous one, but right hand sideincludes Perl code to be evaluated. The result willbe used to replace the match;
=begin=> without a left hand side, the right hand side code isexecuted before the rewrite starts;
=end=> without a right hand side, when the left hand sidepattern matches quits the rewrite system;
they can include a restriction block (!!) at the right of the action.
Alberto Simões Processing XML: a rewriting system approach
Valid Rewriting Rules
Different approaches have different possible rules. . .
but the most relevant rules are:==> simple pattern substitution: left hand side includes
a Perl regular expression and right hand sideincludes the string that will replace the match;
=e=> similar to the previous one, but right hand sideincludes Perl code to be evaluated. The result willbe used to replace the match;
=begin=> without a left hand side, the right hand side code isexecuted before the rewrite starts;
=end=> without a right hand side, when the left hand sidepattern matches quits the rewrite system;
they can include a restriction block (!!) at the right of the action.
Alberto Simões Processing XML: a rewriting system approach
Valid Rewriting Rules
Different approaches have different possible rules. . .
but the most relevant rules are:==> simple pattern substitution: left hand side includes
a Perl regular expression and right hand sideincludes the string that will replace the match;
=e=> similar to the previous one, but right hand sideincludes Perl code to be evaluated. The result willbe used to replace the match;
=begin=> without a left hand side, the right hand side code isexecuted before the rewrite starts;
=end=> without a right hand side, when the left hand sidepattern matches quits the rewrite system;
they can include a restriction block (!!) at the right of the action.
Alberto Simões Processing XML: a rewriting system approach
Valid Rewriting Rules
Different approaches have different possible rules. . .
but the most relevant rules are:==> simple pattern substitution: left hand side includes
a Perl regular expression and right hand sideincludes the string that will replace the match;
=e=> similar to the previous one, but right hand sideincludes Perl code to be evaluated. The result willbe used to replace the match;
=begin=> without a left hand side, the right hand side code isexecuted before the rewrite starts;
=end=> without a right hand side, when the left hand sidepattern matches quits the rewrite system;
they can include a restriction block (!!) at the right of the action.
Alberto Simões Processing XML: a rewriting system approach
Valid Rewriting Rules
Different approaches have different possible rules. . .
but the most relevant rules are:==> simple pattern substitution: left hand side includes
a Perl regular expression and right hand sideincludes the string that will replace the match;
=e=> similar to the previous one, but right hand sideincludes Perl code to be evaluated. The result willbe used to replace the match;
=begin=> without a left hand side, the right hand side code isexecuted before the rewrite starts;
=end=> without a right hand side, when the left hand sidepattern matches quits the rewrite system;
they can include a restriction block (!!) at the right of the action.
Alberto Simões Processing XML: a rewriting system approach
Rewriting Text into XML
How to produce XML from weak-structured data?write a parser;or rewrite the data step-by-step into XML!
Two case studies:Rewriting a dictionary in textual format into TEI;Rewriting a XML DSL authoring tool into XML;
Alberto Simões Processing XML: a rewriting system approach
Rewriting Text into XML
How to produce XML from weak-structured data?write a parser;or rewrite the data step-by-step into XML!
Two case studies:Rewriting a dictionary in textual format into TEI;Rewriting a XML DSL authoring tool into XML;
Alberto Simões Processing XML: a rewriting system approach
Rewriting Text into TEI
Rewrite this. . .*Cachimbo*,_m._Apparelho de fumador, composto d..Peça de ferro, em que entra o es..Buraco, em que se encaixa a vela..* _Bras. de Pernambuco._Bebida, preparada com aguardente..* _Pl. Gír._Pés.(Do químb. _quixima_)
. . . into this!<entry id="cachimbo"><form><orth>Cachimbo</orth></form><sense><gramGrp>m.</gramGrp><def>Apparelho de fumador, composto d..Peça de ferro, em que entra o es..Buraco, em que se encaixa a vela..</def></sense><sense ast="1"><usg type="geo">Bras. de Pernamb..<def>Bebida, preparada com aguardente..</def></sense><sense ast="1"><gramGrp>Pl.</gra..<usg type="style">Gír.</usg><def>Pés.</def></sense><etym ori="químb">(Do químb. _qu..</entry>Alberto Simões Processing XML: a rewriting system approach
Rewriting Text into TEI
This rewrite was all based on:a few tables (grammatical and usage strings);
entries genres: Gír Fam Pop Des Fig Vulg Ant Chul Euphentries domains: Agr Anat Anthrop Apicult Arith Artilh Archit
rewrite the few mark-up into better XML structure;((\* )?_([^_]|_[^_]{1,5}_)+_( *)?)\n=e=>$a=$1;end_def.end_sense.start_sense.gramGrp($a)."\n".start_def
rewrite the new XML structure to detect and annotate amore complex structure;
<gramGrp>([^<]*)\s*\*\s*([^<]*)</gramGrp>=e=>$a="$1 $2"; "ast=\"1\"".gramGrp($a)
detect and correct wrong XML elements.</form></sense>==></form>
</form></def>\n</sense>==></form>
Alberto Simões Processing XML: a rewriting system approach
Rewriting Text into TEI
This rewrite was all based on:a few tables (grammatical and usage strings);
entries genres: Gír Fam Pop Des Fig Vulg Ant Chul Euphentries domains: Agr Anat Anthrop Apicult Arith Artilh Archit
rewrite the few mark-up into better XML structure;((\* )?_([^_]|_[^_]{1,5}_)+_( *)?)\n=e=>$a=$1;end_def.end_sense.start_sense.gramGrp($a)."\n".start_def
rewrite the new XML structure to detect and annotate amore complex structure;
<gramGrp>([^<]*)\s*\*\s*([^<]*)</gramGrp>=e=>$a="$1 $2"; "ast=\"1\"".gramGrp($a)
detect and correct wrong XML elements.</form></sense>==></form>
</form></def>\n</sense>==></form>
Alberto Simões Processing XML: a rewriting system approach
Rewriting Text into TEI
This rewrite was all based on:a few tables (grammatical and usage strings);
entries genres: Gír Fam Pop Des Fig Vulg Ant Chul Euphentries domains: Agr Anat Anthrop Apicult Arith Artilh Archit
rewrite the few mark-up into better XML structure;((\* )?_([^_]|_[^_]{1,5}_)+_( *)?)\n=e=>$a=$1;end_def.end_sense.start_sense.gramGrp($a)."\n".start_def
rewrite the new XML structure to detect and annotate amore complex structure;
<gramGrp>([^<]*)\s*\*\s*([^<]*)</gramGrp>=e=>$a="$1 $2"; "ast=\"1\"".gramGrp($a)
detect and correct wrong XML elements.</form></sense>==></form>
</form></def>\n</sense>==></form>
Alberto Simões Processing XML: a rewriting system approach
Rewriting Text into TEI
This rewrite was all based on:a few tables (grammatical and usage strings);
entries genres: Gír Fam Pop Des Fig Vulg Ant Chul Euphentries domains: Agr Anat Anthrop Apicult Arith Artilh Archit
rewrite the few mark-up into better XML structure;((\* )?_([^_]|_[^_]{1,5}_)+_( *)?)\n=e=>$a=$1;end_def.end_sense.start_sense.gramGrp($a)."\n".start_def
rewrite the new XML structure to detect and annotate amore complex structure;
<gramGrp>([^<]*)\s*\*\s*([^<]*)</gramGrp>=e=>$a="$1 $2"; "ast=\"1\"".gramGrp($a)
detect and correct wrong XML elements.</form></sense>==></form>
</form></def>\n</sense>==></form>
Alberto Simões Processing XML: a rewriting system approach
Rewriting Text into TEI
Case study conclusions:flexible tool;
works on big files:Text file is 13 MB;Output XML is 30 MB;Process takes about nine minutes!
we event rewrote XML into XML.
Hey!! XML is text!!How can we rewrite it!?
Alberto Simões Processing XML: a rewriting system approach
Rewriting Text into TEI
Case study conclusions:flexible tool;
works on big files:Text file is 13 MB;Output XML is 30 MB;Process takes about nine minutes!
we event rewrote XML into XML.
Hey!! XML is text!!How can we rewrite it!?
Alberto Simões Processing XML: a rewriting system approach
Rewriting XML
different from the usual DOM or SAX oriented approaches;
looks to XML as text, non structured data;
rewrite can be done:as any other text write system;taking advantage of irregular expressions.
Irregular expressions? Are you kidding?
Alberto Simões Processing XML: a rewriting system approach
Rewriting XML
different from the usual DOM or SAX oriented approaches;
looks to XML as text, non structured data;
rewrite can be done:as any other text write system;taking advantage of irregular expressions.
Irregular expressions? Are you kidding?
Alberto Simões Processing XML: a rewriting system approach
Rewriting XML
different from the usual DOM or SAX oriented approaches;
looks to XML as text, non structured data;
rewrite can be done:as any other text write system;taking advantage of irregular expressions.
Irregular expressions? Are you kidding?
Alberto Simões Processing XML: a rewriting system approach
Rewriting XML
different from the usual DOM or SAX oriented approaches;
looks to XML as text, non structured data;
rewrite can be done:as any other text write system;taking advantage of irregular expressions.
Irregular expressions? Are you kidding?
Alberto Simões Processing XML: a rewriting system approach
Not so regular expressions
Perl has a powerful regular expression engine:regular expressions can define capture zones:small pieces of the match that can be used later;
regular expressions can define look-ahead or look-behind:check the context of the matching zone;
since Perl 5.10, regular expressions can be recursive:regular expression that depends on themself.
my $parens = qr/(\((?:[^()]++|(?-1))*+\))/;
For XML, we defined two classes:[[:XML:]] matches any well formed XML fragment;[[:XML(tag):]] matches a XML fragment with a specific
root element;
Alberto Simões Processing XML: a rewriting system approach
Not so regular expressions
Perl has a powerful regular expression engine:regular expressions can define capture zones:small pieces of the match that can be used later;
regular expressions can define look-ahead or look-behind:check the context of the matching zone;
since Perl 5.10, regular expressions can be recursive:regular expression that depends on themself.
my $parens = qr/(\((?:[^()]++|(?-1))*+\))/;
For XML, we defined two classes:[[:XML:]] matches any well formed XML fragment;[[:XML(tag):]] matches a XML fragment with a specific
root element;
Alberto Simões Processing XML: a rewriting system approach
Not so regular expressions
Perl has a powerful regular expression engine:regular expressions can define capture zones:small pieces of the match that can be used later;
regular expressions can define look-ahead or look-behind:check the context of the matching zone;
since Perl 5.10, regular expressions can be recursive:regular expression that depends on themself.
my $parens = qr/(\((?:[^()]++|(?-1))*+\))/;
For XML, we defined two classes:[[:XML:]] matches any well formed XML fragment;[[:XML(tag):]] matches a XML fragment with a specific
root element;
Alberto Simões Processing XML: a rewriting system approach
Not so regular expressions
Perl has a powerful regular expression engine:regular expressions can define capture zones:small pieces of the match that can be used later;
regular expressions can define look-ahead or look-behind:check the context of the matching zone;
since Perl 5.10, regular expressions can be recursive:regular expression that depends on themself.
my $parens = qr/(\((?:[^()]++|(?-1))*+\))/;
For XML, we defined two classes:[[:XML:]] matches any well formed XML fragment;[[:XML(tag):]] matches a XML fragment with a specific
root element;
Alberto Simões Processing XML: a rewriting system approach
Not so regular expressions
Perl has a powerful regular expression engine:regular expressions can define capture zones:small pieces of the match that can be used later;
regular expressions can define look-ahead or look-behind:check the context of the matching zone;
since Perl 5.10, regular expressions can be recursive:regular expression that depends on themself.
my $parens = qr/(\((?:[^()]++|(?-1))*+\))/;
For XML, we defined two classes:[[:XML:]] matches any well formed XML fragment;[[:XML(tag):]] matches a XML fragment with a specific
root element;
Alberto Simões Processing XML: a rewriting system approach
Not so regular expressions
Perl has a powerful regular expression engine:regular expressions can define capture zones:small pieces of the match that can be used later;
regular expressions can define look-ahead or look-behind:check the context of the matching zone;
since Perl 5.10, regular expressions can be recursive:regular expression that depends on themself.
my $parens = qr/(\((?:[^()]++|(?-1))*+\))/;
For XML, we defined two classes:[[:XML:]] matches any well formed XML fragment;[[:XML(tag):]] matches a XML fragment with a specific
root element;
Alberto Simões Processing XML: a rewriting system approach
Rewriting XML
As a simple example, we can remove duplicate translation unitsin a translation memory file:
Code exampleRULES/m duplicates([[:XML(tu):]])==>!!duplicate($1)ENDRULES
sub duplicate {my $tu = shift;my $tumd5 = md5(dtstring($tu,
-default => sub{$c}));return 1 if exists $visited{$tumd5};$visited{$tumd5}++return 0;
}
Alberto Simões Processing XML: a rewriting system approach
Conclusions
The rewriting approach is:flexible;powerful;easy to learn;grows quickly;big systems can be difficult to maintain;
The Perl regular engine:makes it easy to match anything;almost supports full grammars;makes it possible to define block structures;
So, it can be applied to XML easily!
Alberto Simões Processing XML: a rewriting system approach
Thank you
Thank You!
Alberto Simõ[email protected]
Alberto Simões Processing XML: a rewriting system approach