by: andrew cory. grouping things & hierarchical matching grouping characters – ( and ) allows...
TRANSCRIPT
By: Andrew Cory
Grouping Things &Hierarchical MatchingGrouping characters – ( and )
Allows parts of a regular expression to be treated as a single unit
Useful for the creation of multiple words and/or phrases with similar base characters and/or words
Ex. /house(cat|keeper)/ =~ /housecat|housekeeper/
Ex. /(a|[bc])d/ =~ ‘ad’, ‘bd’, or ‘cd’Ex. /(19|20|)\d\d/ =~ matches 19xx, 20xx, or xx
ContinuedBacktracking: step-by-step process of trying
alternatives and seeing if they match, and moving on to the next alternative if it doesn’t
Any given regular expression has several paths that result in a different string
Backtracking is a trial-and-error method that goes through one character at a time.
ContinuedBacktracking Example – “abcd” =~ /(af|ab)(ce|c|
cd)/; 1 – start with letter “a” 2 – try 1st alternative 3 – ‘a’ matches, but ‘f’ doesn’t match ‘b’, backtrack to ‘a’
and try 2nd alternative 4 – ‘a’ and ‘b’ matches the first 2 characters, first group
satisfied, next group. 5 – ‘c’ matches, but ‘e’ doesn’t, backtrack to ‘c’, try 2nd alt. 6 – ‘c’ matches, second group is satisfied, therefore whole
expression is satisfied by “abcd” Note – 3rd alt. in the 2nd group matches too, but is
irrelevant: the string already satisfied the regular expression.
Extracting MatchesParentheses not only group, they also extract
and separate parts of strings that match the given condition
I.e. if ($time =~ /(\d\d):(\d\d):(\d\d)/) {$hours = $1;$minutes = $2;$seconds = $3; }($hours, $minutes, $second) =($time =~ /(\d\d):(\d\d):(\d\d)/);
ContinuedNested grouping in a regular expression
results in more separationEx. /(ab(cd|ef)((gi)|j))/;
$1 = ab $2 = cd|ef $3 = gi|j $4 = gi
Backreferences – related to matching variables $1, $2, etc., but can only be used inside the regular expressionUseful for repeating phrasesEx. /(\w\w\w)\1/ =~ ‘booboo’, or ‘murmur’
ContinuedPositions of string portions that match the
conditions are also stored in the @- and @+ arraysEx. $x = “Mmm…donut”;$x =~ /^(Mmm)\.\.\.(donut)/;Foreach $expr (1..$#-) {
print “$expr: ‘${$expr}’ at ($-[$expr],$+[$expr])\n”
Output: 1: ‘Mmm’ at (0,3) 2: ‘donut’ at (6,11)
ContinuedStrings that have no groupings but are still
searched for are still stored in separate variables
$` is the string before the match$& is the string that matched$’ is the string after the matchEx. $x = “I like chips”;$x =~ /like/;$` = “I “ $& = “like” $’ = “ chips”
Matching RepetitionsQuantifier characters ?, *, +, and {} are used
to match words or syllables of any length without massive amounts of repetitionDefinitions
a? = matches ‘a’ one or zero times a* = matches ‘a’ any number of times a+ = matches ‘a’ one or more times (at least once) a{n,m} = matches at least n times, not more than m
times a{n, } = matches at least n or more times a{n} = matches exactly n times
ContinuedExamples
/[a-z]+\s+\d*/ = a lowercase word, some space, and any number of digits (ajc 93, jgro 843986)
/(\w+)\s\1/ = a doubled word of any length with a space inbetween (jon jon, hidalgo hidalgo)
/y(es)?/i = ‘y’, ‘Y’, or ‘yes’
ContinuedPerl will always try to match as much of a given
string as possible to a regular expression so long as the regular expression holds trueI.e. the ‘?’ operator will be matched to the string
with whatever precursor present, if not it stops using it
Ex. $x = “the cat in the hat”;$x =~ /^(.*)(at)(.*)$/;$1 = ‘the cat in the h’$2 = ‘at’$3 = ‘’
ContinuedQuantifiers that grab as much of the string as
possible are known as ‘maximal match’ or ‘greedy’ quantifiers
4 important regular expression principlesPrinciple 1: any regexp will be matched at the
earliest possible position in the stringPrinciple 2: The leftmost alternation that matches
in a group will be the one used (a|b|c)Principle 3: Matching quantifiers will match as
much of the string as possible while holding true to the regexp
Principle 4: The leftmost greedy quantifier has more priority over other existing greedy quantifiers
ContinuedExamples
$x = “The programming republic of Perl”; $x =~ /^(.+)(e|r)(.*)$/$1 = ‘The programming republic of Pe’$2 = ‘r’$3 = ‘l’ $x =~ /.*(m{1,2})(.*)$/$1 = ‘m’$2 = ‘ing republic of Perl’
ContinuedSometimes returning the minimal piece of a
string is essential, thus, ‘minimal match’ or ‘non-greedy’ quantifiers ??, *?, +?, and {}? were created.
Definitionsa?? = match ‘a’ 0 or 1 times, 0 first, then 1a*? = match ‘a’ any number of times, as few as
possiblea+? = match ‘a’ 1 or more times, as few as possiblea{m,n}? = match n times, no more than m, as few
as pos.a{n, }? = match n times, as few as possiblea{n}? = match n times, same thing as a{n}
ContinuedExamples: same as above, different
operators!$x = “The programming republic of Perl”;
$x =~ /^(.+?)(e|r)(.*)$/$1 = ‘Th’$2 = ‘e’$3 = ‘ programming republic of Perl’ $x =~ /.*?(m{1,2})(.*)$/$1 = ‘mm’$2 = ‘ing republic of Perl’
ContinuedNote: Principle 3 (matching quantifiers) may
be manipulated for non-greedy quantifiers so that the leftmost quantifier matches the least amount of the string as possible
ContinuedQuantifiers are susceptible to backtracking
Ex. $x = “the cat in the hat”$x =~ /^(.*)(at)(.*)$/;
$1 = ‘the cat in the h’ $2 = ‘at’ $3 = ‘’ 1 Start with the first letter, ‘t’ 2 The first quantifier starts, matches whole string 3 ‘a’ does not match the end of the string,
backtrack once 4 ‘a’ does not match the last letter ‘t’, backtrack
once more 5 match ‘a’, then the ‘t’ 6 move on to the 3rd element. Already at the end of
the string, assign it as an empty string
ContinuedError alert!
Nested indeterminable quantifiers are dangerous things
Ex. /(a|b+)*/; In the above example, the first repetitions searches
with b+ of whatever length (up to infinite), and then again searches with the * thereafter with whatever length (infinite)
If a match is not found early in the process, Perl will attempt to find EVERY possibility before halting (massive amount of memory used)
Building a RegexpStep one: decide what we want to match and
what we want to exclude.Ex. A regexp that matches numbers will reject
any string, and accept both integers and floating point #’s
Step two: break the problem down into smaller partsSmaller parts are easier to work withEx. Any integer: /[+-]?\d+/
\d+ represents a digit [+-] represents a number’s sign (positive/negative)
ContinuedEx. Floating point
Has a sign, decimal point, fractional part, and an exponent, i.e. 25.4E-72
/[+-]?(\d+\.\d|\d+\.|\.\d+|\d+)([eE][+-]?\d+)?$/; 1st part ([+-]?) is the sign of the number 2nd part (\d+\.\d|\d+\.|\.\d+|\d+) is the several
different ways a floating point number can be (2.54, 346., .395, 500)
3rd part ([eE][+-]?\d+)? is the exponential part, which is represented by e or E followed by a sign, then a decimal of any size (e-5, E9000)
ContinuedThe //x modifier in Perl allows one to write
complex regexps with as much spacing as the programmer wants
/^[+-]? (
\d+\.\d+|\d+\.|\.\d+|\d+
)([eE][+-]?\d+)?
$/x;
ContinuedThe downside to the //x modifier: certain
symbols must be typed differentlySpacing
Since //x ignores spaces as relevant regexp input, spaces must be typed in as ‘\ ‘ or ‘[ ]’
Pound Signs Similar instance as spaces, they are typed out as ‘\
#’ or ‘[#]’ using //x
ContinuedExample –/^
[+-]?\ * #an infinite amount of spaces has been added ( #between the sign and the floating point #
\d+ ( #the coding for the floating point has been re- \.\d* #worked since most of the conditions )? #started similarly.|\.\d+
)([eE][+-]?\d+)?
$/x;