by: andrew cory. grouping things & hierarchical matching grouping characters – ( and ) allows...

By: Andrew Cory

Grouping Things &Hierarchical MatchingGrouping characters – ( and )

Allows parts of a regular expression to be treated as a single unit

Useful for the creation of multiple words and/or phrases with similar base characters and/or words

Ex. /house(cat|keeper)/ =~ /housecat|housekeeper/

Ex. /(a|[bc])d/ =~ ‘ad’, ‘bd’, or ‘cd’Ex. /(19|20|)\d\d/ =~ matches 19xx, 20xx, or xx

ContinuedBacktracking: step-by-step process of trying

alternatives and seeing if they match, and moving on to the next alternative if it doesn’t

Any given regular expression has several paths that result in a different string

Backtracking is a trial-and-error method that goes through one character at a time.

ContinuedBacktracking Example – “abcd” =~ /(af|ab)(ce|c|

cd)/; 1 – start with letter “a” 2 – try 1st alternative 3 – ‘a’ matches, but ‘f’ doesn’t match ‘b’, backtrack to ‘a’

and try 2nd alternative 4 – ‘a’ and ‘b’ matches the first 2 characters, first group

satisfied, next group. 5 – ‘c’ matches, but ‘e’ doesn’t, backtrack to ‘c’, try 2nd alt. 6 – ‘c’ matches, second group is satisfied, therefore whole

expression is satisfied by “abcd” Note – 3rd alt. in the 2nd group matches too, but is

irrelevant: the string already satisfied the regular expression.

Extracting MatchesParentheses not only group, they also extract

and separate parts of strings that match the given condition

I.e. if ($time =~ /(\d\d):(\d\d):(\d\d)/) {$hours = $1;$minutes = $2;$seconds = $3; }($hours, $minutes, $second) =($time =~ /(\d\d):(\d\d):(\d\d)/);

ContinuedNested grouping in a regular expression

results in more separationEx. /(ab(cd|ef)((gi)|j))/;

$1 = ab $2 = cd|ef $3 = gi|j $4 = gi

Backreferences – related to matching variables $1, $2, etc., but can only be used inside the regular expressionUseful for repeating phrasesEx. /(\w\w\w)\1/ =~ ‘booboo’, or ‘murmur’

ContinuedPositions of string portions that match the

conditions are also stored in the @- and @+ arraysEx. $x = “Mmm…donut”;$x =~ /^(Mmm)\.\.\.(donut)/;Foreach $expr (1..$#-) {

print “$expr: ‘${$expr}’ at ($-[$expr],$+[$expr])\n”

Output: 1: ‘Mmm’ at (0,3) 2: ‘donut’ at (6,11)

ContinuedStrings that have no groupings but are still

searched for are still stored in separate variables

$` is the string before the match$& is the string that matched$’ is the string after the matchEx. $x = “I like chips”;$x =~ /like/;$` = “I “ $& = “like” $’ = “ chips”

Matching RepetitionsQuantifier characters ?, *, +, and {} are used

to match words or syllables of any length without massive amounts of repetitionDefinitions

a? = matches ‘a’ one or zero times a* = matches ‘a’ any number of times a+ = matches ‘a’ one or more times (at least once) a{n,m} = matches at least n times, not more than m

times a{n, } = matches at least n or more times a{n} = matches exactly n times

ContinuedExamples

/[a-z]+\s+\d*/ = a lowercase word, some space, and any number of digits (ajc 93, jgro 843986)

/(\w+)\s\1/ = a doubled word of any length with a space inbetween (jon jon, hidalgo hidalgo)

/y(es)?/i = ‘y’, ‘Y’, or ‘yes’

ContinuedPerl will always try to match as much of a given

string as possible to a regular expression so long as the regular expression holds trueI.e. the ‘?’ operator will be matched to the string

with whatever precursor present, if not it stops using it

Ex. $x = “the cat in the hat”;$x =~ /^(.*)(at)(.*)$/;$1 = ‘the cat in the h’$2 = ‘at’$3 = ‘’

ContinuedQuantifiers that grab as much of the string as

possible are known as ‘maximal match’ or ‘greedy’ quantifiers

4 important regular expression principlesPrinciple 1: any regexp will be matched at the

earliest possible position in the stringPrinciple 2: The leftmost alternation that matches

in a group will be the one used (a|b|c)Principle 3: Matching quantifiers will match as

much of the string as possible while holding true to the regexp

Principle 4: The leftmost greedy quantifier has more priority over other existing greedy quantifiers

ContinuedExamples

$x = “The programming republic of Perl”; $x =~ /^(.+)(e|r)(.*)$/$1 = ‘The programming republic of Pe’$2 = ‘r’$3 = ‘l’ $x =~ /.*(m{1,2})(.*)$/$1 = ‘m’$2 = ‘ing republic of Perl’

ContinuedSometimes returning the minimal piece of a

string is essential, thus, ‘minimal match’ or ‘non-greedy’ quantifiers ??, *?, +?, and {}? were created.

Definitionsa?? = match ‘a’ 0 or 1 times, 0 first, then 1a*? = match ‘a’ any number of times, as few as

possiblea+? = match ‘a’ 1 or more times, as few as possiblea{m,n}? = match n times, no more than m, as few

as pos.a{n, }? = match n times, as few as possiblea{n}? = match n times, same thing as a{n}

ContinuedExamples: same as above, different

operators!$x = “The programming republic of Perl”;

$x =~ /^(.+?)(e|r)(.*)$/$1 = ‘Th’$2 = ‘e’$3 = ‘ programming republic of Perl’ $x =~ /.*?(m{1,2})(.*)$/$1 = ‘mm’$2 = ‘ing republic of Perl’

ContinuedNote: Principle 3 (matching quantifiers) may

be manipulated for non-greedy quantifiers so that the leftmost quantifier matches the least amount of the string as possible

ContinuedQuantifiers are susceptible to backtracking

Ex. $x = “the cat in the hat”$x =~ /^(.*)(at)(.*)$/;

$1 = ‘the cat in the h’ $2 = ‘at’ $3 = ‘’ 1 Start with the first letter, ‘t’ 2 The first quantifier starts, matches whole string 3 ‘a’ does not match the end of the string,

backtrack once 4 ‘a’ does not match the last letter ‘t’, backtrack

once more 5 match ‘a’, then the ‘t’ 6 move on to the 3rd element. Already at the end of

the string, assign it as an empty string

ContinuedError alert!

Nested indeterminable quantifiers are dangerous things

Ex. /(a|b+)*/; In the above example, the first repetitions searches

with b+ of whatever length (up to infinite), and then again searches with the * thereafter with whatever length (infinite)

If a match is not found early in the process, Perl will attempt to find EVERY possibility before halting (massive amount of memory used)

Building a RegexpStep one: decide what we want to match and

what we want to exclude.Ex. A regexp that matches numbers will reject

any string, and accept both integers and floating point #’s

Step two: break the problem down into smaller partsSmaller parts are easier to work withEx. Any integer: /[+-]?\d+/

\d+ represents a digit [+-] represents a number’s sign (positive/negative)

ContinuedEx. Floating point

Has a sign, decimal point, fractional part, and an exponent, i.e. 25.4E-72

/[+-]?(\d+\.\d|\d+\.|\.\d+|\d+)([eE][+-]?\d+)?$/; 1st part ([+-]?) is the sign of the number 2nd part (\d+\.\d|\d+\.|\.\d+|\d+) is the several

different ways a floating point number can be (2.54, 346., .395, 500)

3rd part ([eE][+-]?\d+)? is the exponential part, which is represented by e or E followed by a sign, then a decimal of any size (e-5, E9000)

ContinuedThe //x modifier in Perl allows one to write

complex regexps with as much spacing as the programmer wants

/^[+-]? (

\d+\.\d+|\d+\.|\.\d+|\d+

)([eE][+-]?\d+)?

$/x;

ContinuedThe downside to the //x modifier: certain

symbols must be typed differentlySpacing

Since //x ignores spaces as relevant regexp input, spaces must be typed in as ‘\ ‘ or ‘[ ]’

Pound Signs Similar instance as spaces, they are typed out as ‘\

#’ or ‘[#]’ using //x

ContinuedExample –/^

[+-]?\ * #an infinite amount of spaces has been added ( #between the sign and the floating point #

\d+ ( #the coding for the floating point has been re- \.\d* #worked since most of the conditions )? #started similarly.|\.\d+

)([eE][+-]?\d+)?

$/x;

by: andrew cory. grouping things & hierarchical matching grouping characters – ( and ) allows...

Documents