everything you know about regexes is wrong · 2019. 9. 23. · • this process has produced a wide...

Everything You Know About Regexes

Is Wrong

Damian Conway

Everything You Know About Regexes Is Wrong Copyright © Thoughtstream Pty Ltd, 2014-2015 http://damian.conway.org

1

What are regular expressions really?

The “problem” of regular expressions

• There’s a well-known and widely quoted observation about regexes1 which goes:

Some developers, when confronted with a problem say: “I know! I’ll use regular expressions!” Now they have two problems.

• And there’s a significant amount of truth mixed in with the snark.

• In deciding to use regular expressions, most developers do indeed find themselves with a second problem…but not the second problem that most people think they have.

• The problem with using regexes is not that regexes are broken, or difficult, or unreadable, or error-prone.

• The problem with using regexes is that most developers don’t understand what regexes really are, how they actually work, or how to efficiently construct them.

• It’s no different from a Java or a C or a Python developer, on being confronted with a problem for which their familiar language is ill-equipped, saying: “I know! I’ll use Haskell!”

• Now they do indeed have a second problem: not the problem that Haskell is broken or difficult or unreadable or error-prone; but the problem that they simply don’t know or understand Haskell.

• Or (far worse!) that they think they know, but actually fundamentally misunderstand, Haskell.

How not to think about regexes

• And that’s the main problem with regular expressions: developers think they know what they are, how they work, and how to create them.

• And they’re mostly wrong. 1 …which is usually attributed to Jamie Zawinski, who was the principle initial vector of the meme,

though not its originator.


2

• The most common mistake is to think of a regular expression as a “pattern”.

• That is: to think of a regex as a shorthand description, or formal declarative blueprint, of the general structure of a set of conforming character strings.

• But regular expressions are not declarative.

• And they do not describe strings.

• And they are not generalizations of structure.

• And they don’t parse strings by “seeing” if the text “conforms” to a “pattern”.

• As long as developers persist in thinking of regexes in those incorrect ways, regexes will continue to be a problem for them.

What regular expressions really are

• So, if regexes aren’t shorthand descriptions of string structure, what are they?

• They are actually something far more familiar to developers.

• Regular expressions are subroutines.

• They are imperative specifications of block-structured sequences of instructions, which are used to execute tasks on a highly specialized virtual machine (a.k.a. “the regex engine”).

• Each regex consists of one or more code statements that form a series of commands, loops, flow control, and assertions, all of which are intended to perform a particular subtask.

• Matching a regex is equivalent to calling a subroutine, passing it arguments, and then either receiving back one or more return values, or else producing some side-effect.

• Most often, the principle argument to be processed is a text string.

• Most often, the configuration options are flags that either alter the meaning of specific commands within the regex, or else modify the overall behaviour of the virtual machine on which the regex runs.


3

• Most often, the return value is a simple Boolean (i.e. did the subroutine succeed or not?), plus some extra results that isolate and label important components of the input data.

• Most often, any side-effect is a modification of the original input string, either removing, adding to, or rearranging its contents by substitution.

• Once you start to think about (and actually read) regexes as subroutines, three important things happen.

• First, you begin to be able to correlate the syntax of a regex with what it actually does.

• Second, you begin to be able to apply all your existing software development skills to developing regexes as well.

• Third, you no longer have two problems; only your original problem…plus a much better and more usable tool for solving it.

Regex dialects

• If regexes are subroutines, the next obvious question is: what language are those subroutines written in?

• Unfortunately, the answer to that is: in no single language.

• Since they were first invented back in the 1950s2 and first implemented in software tools in the 1960s3, the syntax and semantics of regular expressions have been repeatedly extended and modified, evolving independently within different languages and utilities.

• This process has produced a wide variety of mutually incompatible regex dialects.

• These dialects almost all bear strong familial resemblances to be other (much in the way that C, C++, Objective-C, and C# do).

• But they also all have significant and unique differences in syntax, semantics, feature-sets, performance, and convenience (much in the way that C, C++, Objective-C, and C# do).

2 …by mathematician Stephen Kleene. 3 …by computer scientist Ken Thompson.


4

• Unfortunately, most of the obvious differences between regex dialects are at the syntactic level, making it extremely difficult to understand a given regex if it isn’t written in the dialect with which you are most familiar.

• Moreover, the syntactic differences between dialects are of the very worst kind: the same syntax in two different dialects means exactly the opposite instruction, and the same instruction in two different dialects is represented with exactly the opposite syntax.

A taxonomy of regex dialects

• To overcome this “confusion of tongues”, every regex in this discussion will be shown in each of the syntax of every major dialect of regular expressions, and you should simply focus on the version that matches the dialect used by your preferred programming language and/or command-line utility.

• There are six major dialects of regular expression syntax we will cover: Designator Full name Dialect used in… BRE POSIX 1003.2 (section 2.8)

basic regular expressions ed, sed, grep,

ERE GNU extended regular expressions

egrep, gawk, Notepad++, vile, Tcl (with extra extensions)

EMACS Emacs/Elisp regular expressions

emacs

VIM Vim/Vimscript regular expressions

vim

PCRE Perl-compatible regular expressions

The PCRE library, the .NET runtime, Apache, BBedit, C#, Delphi, Java (subset), JavaScript (subset), PHP, Perl 5 (with extra extensions), PowerShell, Python, R, Ruby, SAS, TextMate, Ultraedit, VB.NET

PSIX Perl 6 Regular Expressions

Perl 6


5

• Note that many languages and tools offer regexes using PCRE syntax, but most actually provide only a subset of the full specification4.

• The PSIX dialect is currently used only in the Perl 6 programming language, but is significant because it is the first major regex dialect to break away from the deep family resemblances shared by all other major dialects.

• PSIX was designed to make better use of the limited number of available metacharacters (i.e. punctuation and symbols).

• The new dialect attempts to use metacharacters more consistently and predictably, to distribute them more “fairly” (i.e. shorter metasyntaxes for more frequently used constructs), and thereby to enhance the overall readability of its regexes.

• For example, a regex that looks like this in other dialects (with only trivial variations in metasyntactic backslashing):

/(?<!\s)(?:\d{3,7}(?:[^\W\d]\w*+\s++)+;(?<MSG>.*)/

• …looks like this in PSIX:

/ <!after \s> [\d**3..7] <ident>+%<ws> ';' $<MSG>=(.*)/

4 This is inevitable. The Perl 5 programming language is still being actively developed and enhanced, and new regular

expression features are still frequenctly added. Even the official PCRE library itself currently does not offer the full set of regex features provided in Perl 5. Typically, new languages adopt PCRE syntax and semantics at the time they are originally being designed, then more-or-less “freeze” their regex feature set at that point. So, for example, Java and JavaScript regexes are both subsets of PCRE, but JavaScript regexes are the more complete subset, having been specified later (i.e. when Perl 5 regexes were more fully developed).


6

How do regular expressions actually work?

• Regular expressions are subroutines that run on a highly specialized virtual machine.

• In particular, they run on a type of virtual machine known as a finite state automaton5.

• Notionally, it’s possible to convert any valid regex into a directed graph of nodes and links, where each link represents an instruction to be executed, and each node represents the current state of the engine as well as any assertions to be tested in that state.

• The instruction that each link represents is almost always a command to test whether a particular character (or if one of a set of particular characters) is present in the string at the current point of matching.

• For example, consider the following regex:

ERE EMACS VIM PCRE PSIX

/abc/

• This would be converted to the graph:

5 Or, more precisely, regular expressions were originaly designed to run on finite-state machines, but the gradual

introduction of various “non-regular” features, such as backreferences and regex subroutines, has meant that most modern regex engines are now implemented using normal stack-based virtual machines. In other words, mant programming languages really do treat a regexes as the source code for a text-matching subroutine, and compile that source down to a format that is directly executable on their own underlying virtual machine. However, in recent years there has been a push back towards purely� automata-based regex engines for performance reasons, although returning to such implementations necessarily restricts the features a regex dialect can provide.

START MATCH a b c


7

• This is equivalent to code something like this:

for 0..str.length-1 -> startpos { matchpos = startpos; try { str[matchpos] == 'a' or throw Backtracking; matchpos++; str[matchpos] == 'b' or throw Backtracking; str[matchpos] == 'c' or throw Backtracking; matchpos++; return TRUE; } } return FALSE;

• …which is why we invented regexes in the first place!

The basic matching process

• Let’s use the /abc/ graph to match against the string "12ababc".

• Initially, the engine is at the node at the very start of the graph and its current match position is at the very first character of the string:

• The only outgoing arrow from the current node requires the string to match an "a" at the current match position.

• But there’s a "1" at the current match position, so no transition along the link is possible.

START MATCH a b c

"12ababc"


8

• Therefore, the engine tries again at the next position down the string:

• There is no outgoing link from the initial node that tells the regex engine to match a "2", so again there’s no way to progress through the graph.

• So again the engine looks further down the string:

• Here it is possible for the engine to transition to the next node, which it does, while "a" the point in the string where it started:

• Now the only available outgoing link specifies that a "b" is required, which is (happily) what comes next in the string.

START MATCH a b c

"12ababc"

START MATCH a b c

"12ababc"

START MATCH a b c

"12ababc"


9

• So the engine can transition to the next node, marking the string to record that that it has accepted the "b" as well:

• Now a "c" is required for the next link transition, but the string offers an "a" at the match position instead.

• The engine cannot proceed, so it returns to the START node and at the same time resets its match position back to the position in the string where it started this failed match attempt.

• This simultaneous rewinding of match position in the string and current node in the graph is known as “backtracking”.

• Having backtracked, the regex engine then moves the match position one character further down the string, and tries again:

• Once again, however, it is immediately stymied because instruction represented by the link is match-a, but the regex engine is confronted with a "b".

START MATCH a b c

"12ababc"

START MATCH a b c

"12ababc"


10

• So once again it looks further down the string:

• Now it can proceed along the "a" link, so it marks its start position and moves along both the string and the graph:

• Having reached the MATCH node, the graph has been fully traversed, so the regex engine has successfully executed the subroutine /abc/ to match the string "12ababc".

START MATCH a b c

"12ababc"

START MATCH a b c

"12ababc"

START MATCH a b c

"12ababc"

START MATCH a b c

"12ababc"


11

Deterministic and non-deterministic automata

• The kinds of automata that regex engines implement fall into two general categories: deterministic and non-deterministic.

• Deterministic regex engines (DFAs) compile each regex down to a series of unidirectional links, each of which is a command to match a particular character within the input string.

• A deterministic engine always walks forward through links in the resulting graph, and is only ever presented with (at most) one viable transition at each node.

• If there is no viable transition at a particular node, a DFA engine immediately stops matching and reports failure.

• Non-deterministic regex engines (NFAs) compile each regex to a series of directed links, each of which is a command to attempt to match a particular next character in the string.

• An NFA normally transitions forward through links in the resulting graph, but (unlike a DFA) may find two or more equally viable alternative outgoing links at each node.

• If there is no viable outgoing link at a particular node, an NFA engine backtracks one or more of the links it has already traversed, looking for other viable execution paths through the graph.

• It only stops matching and reports failure when every possible alternative sequence of commands has been attempted.

• In other words, DFAs perform a single path traversal through their graph, whereas NFAs perform a full recursive graph traversal algorithm.

• Or, more precisely: DFA-based regex engines compile their regexes to graphs that can only ever be traversed along a single execution path for a given input string, whereas NFA-based engines compile their regexes to graphs in which full recursive graph traversal is (sometimes) necessary.

• Most regex engines (whether DFA or NFA) also add an extra level of backtracking to their virtual machine: when a match fails, the engine backs up to the beginning of its graph, moves one character down the string being matched, and tries again from the new string position.


12

• For example, consider the following regex: •

EMACS VIM /abc\|abx/

ERE PCRE PSIX

/abc|abx/

• A DFA regex engine would convert that to the following graph:

• An NFA regex engine would convert the same regex to:

• Clearly, the DFA version is likely to be faster, as it requires a maximum of three transitions (and no backtracking), no matter what the input string contains.

• In contrast, the NFA will require up to six transitions (and two backtracking steps) on most input strings.

• In general, the cost of matching a string against a DFA implementation is always linear (proportional to the length of the string), whereas matching against an NFA can be exponential in the worst case.

START

MATCH

MATCH

a b

c

x

START

MATCH

MATCH

a

a

b

b

c

x


13

• So DFAs would always be the preferred implementation approach for regex engines…except for three major problems.

• First, the translation from regex to DFA is more computationally expensive than the conversion to NFA (and sometimes exponentially more expensive), which can cancel out the performance advantages of the DFA approach.

• Second, the translation from regex to a DFA graph can produce–in the worst case–an exponential number of nodes, whereas in an NFA representation the growth in number of nodes is always linear (proportional to the length of the regex).

• Third, several of the most useful advanced features of modern regular expressions are extremely difficult, or simply impossible, to implement under a DFA representation.

• For these reasons, DFA-based regex engines tend to provide fewer useful features, but better performance; whereas NFA-based regex engines provide more features, but need to provide special backtracking-control constructs to allow regex designers to optimize performance.

The full regex execution process

• Backtracking becomes much more important when it’s possible to take more than one execution path through the commands in a regex graph, without having to first move down the string.

• Consider the graph from the previous example:

• The | or \| in the original regular expression indicates an alternative execution path, so now an NFA regex engine has two choices when encountering a leading "a" in a string.

START

MATCH

MATCH

a

a

b

b

c

x


14

• Normally, it chooses the path that was specified leftmost in the regular expression (the upper execution path in this representation).

• If traversal of the upper path should fail at some link and the engine is forced to backtrack to the START state, the engine will remember to try the lower execution path before giving up and moving the search down the string.

• For example, when matching the string "pabx", the engine would first take the upper path, getting as far as the second-last node before failing.

• Unable to proceed, it would backtrack to the START and try the lower execution path instead, the instructions of which would then enable it to successfully reach the MATCH state:

START

MATCH

MATCH

a

a

b

b

x

"pabx"

START

MATCH

MATCH

a

a

b

b

c

x

"pabx"


15

Consequences of the full matching algorithm

• The regex engine tries to traverse the regex’s graph by every available execution path (trying leftmost alternatives in the regex first), backtracking as necessary.

• If it fails to find any sequence of commands that match the string, it steps one character down the string and repeats its traversal.

• If it manages to traverse to a MATCH state at any point, it immediately stops and reports success.

• If it fails to ever reach a MATCH state from every position in the string by every possible execution path, it stops and reports failure.

• Note that this is equally true for both DFA and NFA implementations; the only difference is that NFA implementations construct a regex in which no two links leaving a node represent the same command.

• That means there is never any need to backtrack, as all previous alternatives along the current execution path are mutually exclusive, and hence cannot match what was actually encountered earlier in the string.

• There are several important consequences of this general algorithm.

• First, a regex engine tries all possible execution paths at the current starting position in the string, before it moves down the string to try again.

• This means that a regex will always match as close to the start of a string as possible, even if there are “better” candidates for a match later in the string.

• For example, consider the regex:

EMACS VIM /anteater\|antelope\|ant/

ERE PCRE PSIX

/anteater|antelope|ant/

• …matching against the string: "An ant encountered an anteater"


16

• The regex will match: "An ant encountered an anteater" rather than matching: "An ant encountered an anteater" despite the fact that the longer “word” appeared earlier in the regex.6

• The second important consequence arises from the fact that an NFA regex engine always tries the leftmost alternative first (i.e. it starts with the topmost execution path) at any given point in a string.

• This means that a common mistake under NFA implementations is to write something like:

ERE PCRE PSIX

/ant|antelope|anteater/

EMACS VIM /ant\|antelope\|anteater/

• If the string being matched is: "An anteater encountered an ant", then the first alternative in the regex will successfully match: "An anteater encountered an ant"

• At which point the graph traversal process will terminate immediately, without even considering the third alternative.

• Together with the previous observation, this means that a regex matches the leftmost viable alternative at the earliest possible position in the string, regardless of whether there is a better alternative later in the regex, or a better match later in the string.

6 This also illustrates the fundamental problem with thinking of regexes as “patterns”, instead of as what they really are:

subroutines. The regex engine isn’t trying to abstractly “identify with” one of three “words”; it’s merely executing a sequence of low-level commands, each of which match a single character, and which are structured into three alternative execution paths to be tried in strict sequence, from the start of the string.


17

How are regular expressions constructed?

• Once you assimilate the idea that regexes are simply subroutine definitions, or executable graph specifications, all that remains is to work out what each standard component of a regex is equivalent to in code, or how it translates to a subgraph.

• Once you understand what each “command” in your favourite dialect means, you can start to apply all your skills and experience in software design to the design of regexes too.

• In this section we’ll look at the standard commands and other control components of various regex languages and discover their equivalents in both pseudocode and graph notation.

Simple self-matching commands

• The vast majority of commands in any regex dialect are simple instructions to match a particular character at the current position in the string.

• For example, in all major regular expression dialects a plain alphabetic or numeric character is an instruction to match that particular character (often called: “self-matching commands”).

• For example, for any alphanumeric X, the regex /X/ is equivalent to the pseudocode:

str[matchpos] == 'X' or throw Backtracking; matchpos++;

• …and to the subgraph:

X


18

The whitespace commands

• Whitespace characters such as space, tab, and newline are also self-matching in most dialects.

• For example:

BRE VIM ERE EMACS PCRE

/H 2 O/

• …matches (only) the string "H 2 0", because it’s equivalent to the graph:

• …with each whitespace character in the regex representing an instruction to match precisely that same character in the string.

• However, in the PSIX dialect, whitespace characters within a regex are ignored (like they are in the source code of most programming languages), so the same regex in PSIX would be equivalent to the graph:

• …and hence would match (only) "H20" instead.

• Tcl and the various PCRE-based and PCRE-like languages such as Perl 5, Python, C#, etc. all provide a modifier (either within the regex itself or passed to the call that performs the match) that causes the regex engine to ignore embedded whitespace7.

7 See page 28 for details.

H

' '

2

' ' O START MATCH

H

2 O START MATCH


19

Metasyntactic commands

• In most regex dialects the majority of non-alphanumeric/non-whitespace characters are also self-matching commands.

• Characters that don’t match themselves are known as metacharacters.

• They are used to specify behaviour modifiers, non-literal matching, backtracking control, embedded code, and context assertions.

• The list of which non-alphanumerics act as metacharacters (or which introduce multi-character metasyntax) varies slightly from dialect to dialect:

BRE . \ ^ $ * [

VIM . \ ^ $ * [ ~

ERE EMACS PCRE

. \ ^ $ * [ ( ) | + ? {

PSIX Every non-alphanumeric character except _

• Because each of the above characters has a special meaning in their own regex dialect, if you want to match a literal dot or backslash or caret or any other metasyntactic character, you have to “quote” or “escape” the special meaning, by prefixing the character with a backslash.

• For example, to match the string "****$999.99", you would need:

BRE ERE EMACS VIM PCRE PSIX

/\*\*\*\*\$999\.99/


20

• This “plague of backslashes” is a common readability issue with regular expressions, so some dialects provide a delimited escape mechanism8:

ERE PCRE /\Q****$\E999\Q.\E99/

PSIX / '****$' 999 '.' 99/

Unbounded loops

• Like almost all other programming languages, regexes provide several types of loop construct.

• However, whereas most languages offer conditional loops (while) and counted loops (for), regexes combine the two approaches: their loops are all conditional (“loop while looped commands execute successfully…”) but with optional counted limits as well (“…for at least M times and no more than N”).

• Regex loops are further complicated by the need to backtrack on failure, which effectively allows a loop to “unwind” one or more iterations in order to accommodate other matching commands that are attempted after the loop.

• Syntactically, regex loops are also quite different from regular programming language loops: they are not specified by a keyword-condition-commands sequence, but by a postfix operator applied to a single matching command.

• The simplest kind of loop is a zero-or-more loop:

BRE ERE EMACS VIM PCRE PSIX

/X*/

• It’s equivalent to an unbounded loop whose exit condition is the failure of the matching command to which it is applied:

8 Though whether these mechanisms actually improve overall readability is open to debate.


21

loop { str[matchpos] == 'X' or exit loop matchpos++; }

• More strictly speaking, it’s equivalent to a pair of nested loops with a try block to handle any backtracking the loop may need to do if the rest of the regex fails to match:

maxreps = ∞; loop { repcount = 0; for 1..maxreps { str[matchpos] == 'X' or exit for; matchpos++; repcount++; } try { // the rest of regex’s commands here } catch Backtracking { maxreps = repcount - 1; maxreps >= 0 or throw Backtracking; next loop; } else { exit loop; } }

• …which (once again) is why we use regexes instead of hand-coded parsers.

• Despite the complexity of the above pseudocode, the graph representation of a maximal zero-or-more loop is surprisingly simple:

• In fact, this is exactly the same “fork” pattern as for an OR operator, except that the upper link has been curved back into its own originating node.

X


22

• Having traversed such a link, the regex engine is then back in a state where it is able to match the same character again…and again…etc.

• Note that the lower branch will be tried only when the regex engine backtracks after eventually failing to match an 'X'.

• Note too that, unlike every graph link we have seen so far, the lower branch is drawn as a broken line and is not labelled with a character that must be matched in order to traverse it.

• It is a “free transition”, which can be traversed without requiring any movement along the string.

• That is, if the regex engine is currently at the left node and cannot successfully traverse the upper branch (because there are no more X’s in the string at that point), then the engine is allowed to traverse the lower branch “for free”, without being required to match anything in the string.

• As mentioned earlier, all regex loops apply to the single regex instruction (such as the match-X in the previous example).

• So if you need to repeatedly match a sequence of any kind, you usually need some form of bracketing, to make the postfix loop operator apply to the entire preceding sequence.

• For example:

Zero or more underscores

Zero or more digits

Zero or more vowels

"" or "nom" or "nomnom" or "nomnomnom"

etc.

EMACS VIM /_*/ /\d*/ /[aeiou]*/ /$nom$*/

BRE ERE PCRE

/_*/ /\d*/ /[aeiou]*/ /(nom)*/

PSIX /_*/ /\d*/ /<[aeiou]>*/ /(nom)*/

•


23

• The graph for the final (nom ) example would be:

Greedy vs parsimonious loops

• The behaviour of the loop in the previous example also explains why the standard regex loop operators (star, plus, question, range) are often referred to as being “greedy” (though we’ll always refer to such loops as being maximal).

• Every time the regex engine followed the looped link and landed back at the same node, it was following its normal rules: to try the uppermost outgoing link first.

• But, in the case of the loop structure, the uppermost link happened to be the same looped link it had previously followed.

• So as long as a looped link successfully matches another character in the string, the loop keeps looping.

• Only when it encounters a character that doesn’t match what the looped link requires does it stop the cycle and attempt to match along the lower link.

• In other words, the loop link always pre-empts any other alternative traversal and “greedily” consumes every possible matching character in the string, with no regard to whether it may have consumed too many characters, and has thereby prevented the rest of the graph from being successfully traversed to reach a MATCH node.

• Just like the old military dictum, it simply “kills ’em all, and lets the backtracking sort ’em out!”.

• Surprisingly, that’s often an efficient approach, provided that the part of the string being matched by the loop is longer than the part of the string being matched by the rest of the regex.

n

m

o


24

• But there are also plenty of circumstances in which unconstrained greed is not the best strategy…or even a correct one.

• For example, a common approach to matching a quote-delimited substring (e.g. a character string literal from source code, or a line of speech from a manuscript) is to use a regex like this:

ERE EMACS VIM PCRE

/'.*'/

PSIX / \' .* \' /

• But that doesn’t work well if the string being matched contains more than a single quote-delimited fragment.

• For example, consider the string:

"You say 'tomayto', but I say 'tomahto'."

• The regex will try to execute its initial “match a single-quote” command at the first eight positions in the string…unsuccessfully:


• Then it will try to match the leading single-quote at the ninth position, succeed, and move on to its “match any character zero-or-more-times” loop (i.e. it will try to execute the .*).

• It will iterate that loop as long as the next character in the string matches.

• But every character in the string matches “any character”, so the greedy loop will match the entire remainder of the string, right up until it runs out of characters:

"You say 'tomayto', but I say 'tomahto'. "

• At that point, the regex engine will start backtracking, unwinding the .* loop and trying to execute the final “match a single-quote” command instead.

• Specifically, it will backtrack twice around the loop before it winds back to a single-quote that allows the final command to match.

•


25

• The result will be that executing the regex produces the match:


• …which is almost certainly not what we wanted.9

• In situations like this, greedy repetitions aren’t the right solution; in fact, we need exactly the opposite: loops that prefer to match as few characters as possible before trying to allow the rest of the regex to match.

• Such loops are sometimes referred to as being “parsimonious”, or “polite”, or “lazy”, or “stingy”, but we’ll just refer to them as being minimal.

• Most regex dialects now provide such loops, using the following syntaxes:

Zero or more X’s

(minimally)

One or more X’s

(minimally)

Zero or one X’s

(minimally)

M-to-N X’s

(minimally)

BRE Not supported

EMACS /X*?/ /X+?/ /X??/

VIM /X\{-}/ /X\{-1,}/ /X\{-0,1}/ /X\{-3,7}/

ERE PCRE /X*?/ /X+?/ /X??/ /X{3,7}?/

PSIX / X*? / / X+? / / X?? / /X **? 3..7/

• In other words, in VIM we use a range with a minus sign at the start, and in the other dialects we use the maximal loop syntax but with an extra question-mark after the loop specifier.

• The graph specification of a minimal loop is very interesting, and not nearly as complex as you might expect.

9 Although, in fairness, it is exactly what we actually asked for.


26

• Consider the version of the previous quote-delimited-string regex with a minimal loop instead:

VIM /'.\{-}'/

ERE EMACS PCRE

/'.*?'/

PSIX / \' .*? \' /

• The graph for that version is:

• In other words, a minimal loop is simply a maximal loop flipped vertically.

• The effect of that flip is profound, however: now when the engine is at the second node, the first (uppermost) alternative it sees is not the looped “match any character” link, but the final “match the final single-quote” link.

• So it tries that link first, only backtracking to try the lower loop if it can’t immediately find the closing delimiter.

• In other words, instead of greedily racing along the string consuming everything, the regex “peeks ahead” before each loop iteration, hoping to avoid iterating and to terminate the loop by matching the remainder of the regex instead.

• This means that every iteration of a minimal loop is a two-step operation: first try not to iterate the loop (i.e. try to proceed with the rest of the regex instead) or else iterate once…but only if absolutely necessary.

{any }

' START MATCH '


27

• So executing the minimal version of the quote-delimited string regex, would produce the match:


• At which point the upper outward link (match-') will be successfully traversed, and the match will be complete…and correct.

• This cautious two-step “perpetual lookahead” approach would seem to imply that a minimal loop may be up to twice as expensive as a maximal loop, but the reality is (as usual with regexes) somewhat more complicated.

• A maximal loop certainly iterates through a sequence of acceptable characters twice as fast, but if it “overmatches” (as in the “tomayto/tomahto” example) it then has to backtrack and retry the trailing commands of the regex at every earlier position until the remainder of the regex is also satisfied.

• If the maximal loop has significantly overmatched, it may have to do so much backtracking that the total cost is greater than the equivalent minimal matching would have incurred.

• This can be especially expensive if the loop occurs early in the regex, as the number of commands following it will then be proportionally higher, and hence the trailing sequence will be proportionally more expensive to retry at every position.

• Note, however, that you do not always have a choice (as in the “tomayto/tomahto” example), so make sure you first select the kind of loop you really need…and only then consider whether it could be optimized for your particular data by replacing it with the other kind of loop.

How to create correct regular expressions

• Understanding the individual commands that make up regexes is important, but it’s not sufficient if we want to be able to build correct and reliable regular expressions from scratch.

• More importantly, the real challenge of using regexes correctly often isn’t creating them, but maintaining them.

• Half a year after you originally wrote your regex, when you return to enhance or refine or extend or modify it, the whole thing can often just seem like a shimmering blur of meaningless line-noise.


28

Extended formatting

• Imagine if you weren’t allowed to use any (non-significant) whitespace or any kind of comment in your regular programming language.

• That is: imagine you had to write all your Lisp code like this:

(defun e(z) (ceiling(sqrt z))) (defun r(b e) (if(> b e)nil(cons b(r(1+ b)e)))) (defun p(b)(dotimes(i 20)( terpri))(dotimes(o(length b))(if(zerop(mod o(e(length b)))) (terpri)) (princ(if(zerop(h o b)) " " "*"))))( defun a(c c) (cond((or(= c 3) (and(not(zerop c))(= c 2)))1)(0)))(defun c(p x y z b)(h(mod(+ p y(* x(e z)) z z)z)b))(defun n(t z b)(a(h t b)(+(c t -1 -1 z b)(c t -1 0 z b)(c t -1 1 z b)(c t 0 -1 z b)(c t 0 1 z b) (c t 1 -1 z b)(c t 1 0 z b)(c t 1 1 z b))))(defun e( b o)(p b) (sleep 0.1) (if(not(equal b o)) (e(mapcar( lambda(p) (n p(length b) b))(r 0(1-(length b))))b)))

• …or all your C# applications like this:

using System; namespace L{class P{static int L(string s,string t) {int[,]d=new int[s.Length+1, t.Length+1]; for(int i=0;i<=s.Length;i++) d[i,0]=i;for(int j=0;j<= t.Length; j++) d[0,j]=j; for(int j=1;j<=t.Length;j++) for(int i=1;i<=s.Length;i++) if(s[i-1]==t[j-1])d[i,j] =d[i-1,j-1];else d[i,j]=Math.Min( Math.Min(d[i-1,j]+1 ,d[i,j-1]+1),d[i-1,j-1]+1);return d[s.Length,t.Length ];} static void Main( string [] a) { if (a.Length==2) Console.WriteLine("{0}->{1}={2}",a[0],a[1],L(a[0],a[1 ]));else Console.WriteLine("Usage:\n\nL<s1><s2>");}}}

• …or all your Perl 5 programs like so10:

sub'x{local$_=pop;sub'_{$_>=$_[0 ]?$_[1]:$"}_(1,'*')._(5,'-')._(4 ,'*').$/._(6,'|').($_>9?'X':$_>8 ?'/':$")._(8,'|').$/._(2,'*')._( 7,'-')._(3,'*').$/}print$/x($=). x(10)x(++$x/10).x($x%10)while<>;

• No matter how elegant or otherwise readable your language may be, any non-trivial program instantly becomes utterly unreadable if you’re not allowed to lay it out logically or annotate it meaningfully.

10 Okay, so maybe you feel that Perl 5 programs do in fact all look like that…but it’s still not actually a requirement of the

language! By the way, the Lisp code is an implementation of Conway’s Game of Life, the C# application computes Levenshtein distances between strings, and the Perl program dot-tallies like a lumberjack.


29

• And that’s a fundamental problem in most regular expression dialects: because every character in a regex–including each whitespace character–is potentially a separate single instruction, there’s no way to lay out those instructions logically or legibly.

• So regexes typically become unmanageably visually complicated long before they become unmanageably semantically complicated.

• For example, here’s the regex to efficiently match a decimal number:

EMACS /$[+-]$?$\d+\.?\d*\|\.\d+$$[eE][+-]?\d+$?/

VIM /$[+-]$\?$\d\+\.\?\d*\|\.\d\+$$[eE][+-]\?\d\+$\?/

ERE PCRE /([+-])?(\d+\.?\d*|\.\d+)([eE][+-]?\d+)?/

• That’s not a particularly complex regex, but it’s already close to unreadable, which makes it a maintenance nightmare.

• Of the six major dialects we’ve been considering, only PSIX treats whitespace (and comments!) as a non-meaningful by default, which allows you to rewrite the above regex like so in that dialect:

PSIX / ( # Group and capture the following... <[+-]> # Match a plus or a minus character )? # End of group, which is optional ( # Then group and capture the following... \d+ # Match one-or-more digits '.'? # Then match an optional literal dot \d* # Then match zero-or-more digits | # or... '.' # Match a literal dot \d+ # Then match one-or-more digits ) # End of group ( # Then group and capture the following... <[eE]> # Match the letter 'e' (either case) <[+-]>? # Then match an optional plus or minus \d+ # Then match one-or-more digits )? # End of group, all of which is optional /


30

• Immediately, the regex becomes possible to comprehend, even if it didn’t have the comments to tell you what’s going on.

• And, in six months’ time, when you came back to update that regex, the comments and the block structure would help you quickly refamiliarize yourself with the various components.

• Fortunately, there are ways to allow whitespace layout and comments in a regex in other dialects too.

• For example, PCRE and the Tcl implementation of ERE both let you to place a special modifier at the start of the regex (or, indeed at the start of any nested group) to switch off the meaning of whitespaces and comments:

PCRE Tcl /(?x) # Remainder of regex in “extended” format

( # Group and capture the following… [+-] # Match a plus or a minus character )? # End of group, which is optional ( # Then group and capture the following… \d+ # Match one-or-more digits \.? # Then match an optional literal dot \d* # Then match zero-or-more digits | # or… \. # Match a literal dot \d+ # Then match one-or-more digits ) # End of group ( # Then group and capture the following… [eE] # Match the letter ‘e' (either case) [+-]? # Then match an optional plus or minus \d+ # Then match one-or-more digits )? # End of group, all of which is optional /

• However, even in regex dialects that don’t support this kind of extended formatting, you can sometimes achieve the same effect by constructing the regex programmatically.

• For example, the restricted subset of PCRE provided in JavaScript11 excludes the (?x) flag, but you can still build the regex up from visually separated components, by string concatenation:

11 …and assuming we can’t just use the vastly superior XRegExp library.


31

JavaScript var numberRegex = new RegExp( "(" +// Group and capture the following... "[+-]" +// Match a plus or a minus character ")?" +// End of group, which is optional "(" +// Then group and capture the following... "\\d+" +// Match one-or-more digits "\\.?" +// Then match an optional literal dot "\\d*" +// Then match zero-or-more digits "|" +// or... "\\." +// Match a literal dot "\\d+" +// Then match one-or-more digits ")" +// End of group "(" +// Then group and capture the following... "[eE]" +// Match the letter 'e' (either case) "[+-]?" +// Then match an optional plus or minus "\\d+" +// Then match one-or-more digits ")?" // End of group, all of which is optional ); var matched = numberRegex.test(str);

Regex (de)composition

• There’s another way of improving the comprehensibility (and therefore the correctness) of regexes…by separating regex components, and naming them.

• Recall the earlier regex for matching a decimal number:

EMACS / $ [+-] $? $ \d+ \.? \d* \| \. \d+ $ $ [eE] [+-]? \d+ $? /

VIM / $ [+-] $\? $ \d\+ \.\? \d* \| \. \d\+ $ $ [eE] [+-]\? \d\+ $\? /


32

PCRE /(x?) ( [+-] )? ( \d+ \.? \d* | \. \d+ ) ( [eE] [+-]? \d+ )? /

PSIX / ( <[+-]> )? ( \d+ '.'? \d* | '.' \d+ ) ( <[eE]> <[+-]>? \d+ )? /

• Even with extended formatting, it isn’t exactly easy to read, or to verify.

• Nor was it easy to create.

• There’s another programming analogy that helps explain why even moderately complex regexes can be so hard to get right: you have to write them entirely in their own “main”.

• In other forms of coding, we decompose an intractably large problem into small, independent subtasks, which we then implement as functions or subroutines, or as classes with methods, or sometimes as macros.

• And then, instead of hard-coding the full sequence of commands to match the mantissa of a decimal number in a single huge block of code, we simply call a function to do: match_mantissa(regex, str, matchpos)

• Or we invoke a method: regex.matchMantissa(str, matchPos).

• Or we insert a macro: MANTISSA(regex,str) which silently inlines the messy code for us.

• In other words, the secret of constructing complex software is to decompose it into named components, then implement the overall algorithm by naming those components in the appropriate sequence.

• This kind of decomposition helps developers in three distinct ways.

• First, it breaks monolithic code down into chunks that are small enough for humans to hold in their short-term memory…and hence small enough to be comprehended.

• Second, it increases the abstraction of the code, by replacing long low-level specifications of how (expressed literally, command by command), with much shorter high-level specifications of what (expressed symbolically, using names).


33

• Third, because most developers are already skilled and well-practised at building software by composing it from smaller named pieces, they ought to be able to apply that same skill set to regexes, if regexes could be build that way.

• But, as we’ve already seen, regular expressions are just software…so regexes can be built that way.

• For example, consider the earlier extended-formatting technique for JavaScript.12

• We could have created a version that is even more maintainable, by decomposing the one lone “main” of the regex into three separate “macros” and then coding the complete regex in terms of those macros:

JavaScript var SIGN = "([+-])"; var MANTISSA = [ "(", "\\d+", "\\.?", "\\d*", "|", "\\.", "\\d+", ")" ].join(); var EXPONENT = "([eE]" + SIGN+"?" + "\\d+)"; var numberRegex = new RegExp( SIGN+"?" + MANTISSA + EXPONENT+"?" );

• The same technique works well for constructing complex regexes inside any programming language, even those that already support extended formatting:

Python SIGN = r"((?x) [+-] )" MANTISSA = r"((?x) \d+\.?\d* | \.\d+ )" EXPONENT = r"((?x) [eE]" + SIGN+"?" + r"\d+ )" number_regex = re.compile( SIGN+"?" + MANTISSA + EXPONENT+"?")

Perl 5 my $SIGN = qr{(?x) ( [+-] ) }; my $MANTISSA = qr{(?x) ( \d+\.?\d* | \.\d+ ) }; my $EXPONENT = qr{(?x) ( [eE] $SIGN? \d+ ) }; my $number_regex = qr{(?x) $SIGN? $MANTISSA $EXPONENT? };

12 As described on page 30.

•


34

Perl 6 regex SIGN { [+-] } regex MANTISSA { \d+ '.'? \d* | '.' \d+ } regex EXPONENT { <[eE]> <SIGN>? \d+ } my $number_regex = rx{ (<SIGN>)? (<MANTISSA>) (<EXPONENT>)? };

• It’s important to remember, however, that in all of the above examples (except the Perl 6 version), the SIGN, MANTISSA, and EXPONENT variables are acting like macro expansions in the source code of the regex…not like subroutine calls during its execution.

Structured regex programming

• The problem with this kind of macro-based approach is that developers often unconsciously confuse macros and subroutines.

• So, when they need to build a regex that matches nested data (such as a list whose elements may be numbers or nested lists):

<123,404,<7,8,<1000,1001>,9>,668,42>

• …they initially try something like this:

Vimscript let ITEM = '$' . '\d\+' . '\|' . LIST . '$' let LIST = '<' . ITEM . '$'.','.ITEM.'$*' . '>'

JavaScript var ITEM = "(" + "\d+" + "|" + LIST + ")"; var LIST = "<" + ITEM + "(,"+ITEM+")*" + ">";

Python ITEM = "(?x) ( \d+ | " + LIST + ")" LIST = "<" + ITEM + "(,"+ITEM+")*" + ">"

Perl 5 my $ITEM = qr{(?x) ( \d+ | $LIST ) }; my $LIST = qr{(?x) < $ITEM (, $ITEM)* > };


35

• …which would work fine if the interpolated macros variables were actually functions that could use recursion to match the nested LISTs within the ITEMs.

• But macros aren’t functions, and they certainly don’t recurse.

• So, instead of a clever recursive regex, what we actually end up with in the LIST variable is:

Vimscript Error detected while processing bad_macro.vim: line 1: E121: Undefined variable: LIST E15: Invalid expression: '$' . '\d\+' . '\|' . LIST . '$'

JavaScript ReferenceError: LIST is not defined on line 1

Python Traceback (most recent call last): File "bad_macro.py", line 1, in <module> ITEM = "(?x) ( \s+ | " + LIST + ")" NameError: name 'LIST' is not defined

Perl 5 qr{(?x) < ( \d+ | ) (, ( \d+ | ))* > };

• …because the initial macro expansion of LIST within the ITEM regex either causes a no-such-variable error (in Vimscript and JavaScript and Python) or else “helpfully” interpolates the current contents of $LIST (i.e. nothing!) in Perl 5.

• So the macro approach is useful for building complex regexes out of simple fragments, but it can also be error-prone in subtle and frustrating ways.

• So we’ll look at another way of functionally decomposing and structuring complex regexes, without the constraints and drawbacks of the macro approach.

• The technique is known as independent subpatterns (in PCRE) or grammar rules (in PSIX).

• Unfortunately, it is only available in those two dialects…however that does still cover a wide range of languages and utilities, and if your preferred tool supports either dialect, the approach is definitely worth learning and adopting.


36

• The technique is based on the understanding that a regex really is a series of executable instructions telling the regex engine how to match against strings.

• It works by adding just two extra regex constructs.

• The first extra construct is a mechanism that allows parts of the regex to be denoted as representing declarations of independent named sequences of regex code, which can be invoked from other locations in the regex, and which return to the invoking location after execution.

• In other words: a mechanism to declare named “regex subroutines”13 inside a larger regex:

PCRE /(?x) # Main regex code here, then... (?(DEFINE) (?<NAME1> # "subroutine"’s regex code here ) (?<NAME2> # "subroutine"’s regex code here ) (?<ET_CETERA> # et cetera ) ) /

13 …which are often referred to as “independent subpatterns”…though not in this discussion.


37

PSIX / :my regex NAME1 { # "subroutine"’s regex code here }

:my regex NAME2 { # "subroutine"’s regex code here }

:my regex ET_CETERA { # et cetera } # Then main regex code here /

• Note that in PCRE, regex subroutines are specified with the same syntax as named captures.

• Indeed, if they weren’t hidden away inside a (?(DEFINE)…) block, they would actually be named captures, and the regex engine would then attempt to execute them as part of the main regex.

• When they’re declared inside a (?(DEFINE)…), the regex engine ignores them, unless they’re explicitly invoked from the main regex (or from inside some other regex subroutine that was itself invoked from the main regex).

• In PSIX, in contrast, regexes literally are just another kind of subroutine available in the Perl 6 language, and may be declared inside (or outside) the regex using a subroutine-like syntax (but using the keyword regex, instead of sub).

• The second new construct that’s required is–naturally enough–a mechanism for actually invoking these regex subroutines from other parts of the regex code:

PCRE / (?&NAME) /

PSIX / <NAME> /

• To make use of regex subroutines, we simply define the named components of our complete regex separately (like we do with subroutines), then call them wherever they’re needed in the “main” regex (again, as we do with subroutines).


38

• For example, we could re-implement the decimal number regex like so:

PCRE /(?x) (?&SIGN)? (?&MANTISSA) (?&EXPONENT)? (?(DEFINE)

(?<SIGN> [+-] )

(?<MANTISSA> \d++\.?+\d*+ | \. \d++ )

(?<EXPONENT> [eE] (?&SIGN)?+ \d++ ) ) /

PSIX

/ :my regex SIGN { <[+-]> }

:my regex MANTISSA { \d+: '.'?: \d*: | '.' \d+: }

:my regex EXPONENT { <[eE]> <SIGN>?: \d+: } <SIGN>? <MANTISSA> <EXPONENT>? /

• Using the same mechanisms, we could also rewrite the list-of-numbers-or-nested-lists example, so that it would actually work correctly:

PCRE /(?x) (?&LIST) # Just call the LIST subroutine (?(DEFINE) (?<LIST> # The LIST "subroutine"... < # Match an angle (?&ITEM) # Then call the ITEM sub (?: # Then start a loop... , # Match a comma (?&ITEM) # Then call the ITEM sub )*+ # End of loop (possessive) > # Then match an angle ) # End of LIST definition (?<ITEM> # The ITEM "subroutine"... \d++ # Loop, matching digits | # Or... (?&LIST) # Call LIST sub recursively ) # End of ITEM definition ) /


39

PSIX

/ :my regex LIST { # The LIST "subroutine"... '<' # Match an opening angle <ITEM>*: # Then loop repeatedly match ITEM... % ',' # ...matching commas between ITEMs '>' # Then match a closing angle } # End of LIST definition :my regex ITEM { # The ITEM "subroutine"... \d+: # Loop possessively, match digits | # Or... <LIST> # Call LIST recursively } # End of ITEM definition <LIST> # Main regex: just call LIST /

• Now, because the ITEM and LIST regex subroutines are actual pre-declared symbolic entities in the regex language, they are able refer to each other without the ordering problems of macro expansion.

• More importantly, they are able to invoke each other in a fully recursive manner, which allows the regex to match any number ITEMs containing any depth of nested LISTs.

• Note, however, that we are now very far away from DFA implementations of simple graph transition algorithms.

• A regex subroutine is effectively an entirely separate subgraph, that a particular “invocation link” can instantiate, then jump across to, then traverse, then jump back to the original invoking link, where it then continues traversing.

• Whether you use regex subroutines merely as a clean way of decomposing complex regular expressions, or whether you employ their full recursive power to match complex hierarchically structured data within a string, they are a vastly better way to create regexes.


40

Learn more about regular expessions

• These notes (and the corresponding presentation) are taken from our full-day class on regular expressions: http://damian.conway.org/Courses/RegexesML.html

• The complete class covers all of the above topics plus: ! The full regex execution process ! Further consequences of the full matching algorithm ! The complete set of metasyntactic commands in all dialects ! Character sets and named character sets ! Alternations ! Other kinds of loops ! Grouping and capturing of subpatterns ! Boundary anchors ! Lookaround ! Identifier-boundary assertions ! Match-limiting commands ! Rematching commands ! Named captures ! Deeper consequences and limitations of regex (de)composition ! Minimizing and eliminating loop backtracking for efficiency ! Controlling alternation backtracking ! Refactoring regexes ! Structured regex programming

• For more information on this class, or on any of our other training services, please contact us via: http://damian.conway.org

everything you know about regexes is wrong · 2019. 9. 23. · • this process has produced a wide...

Documents