regular expressions (contd.) -- remembering subpattern matches when a is being matched with a target...

36
Regular expressions (contd.) -- remembering subpattern matches When a <pattern> is being matched with a target string, substrings that match sub-patterns can be remembered and re-used later in the same pattern Sub-patterns whose matching substrings are to be remembered are enclosed in parentheses The sub-patterns are implicitly numbered, starting from 1 and their matching substrings can then be re-used later in the pattern by using back-references like \1 or \2 or \3 However, to get the backslash, we need to escape it, so we must type \\1 or \\2 or \\3 in our regular expressions

Post on 21-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Regular expressions (contd.) -- remembering subpattern matches

• When a <pattern> is being matched with a target string, substrings that match sub-patterns can be remembered and re-used later in the same pattern

• Sub-patterns whose matching substrings are to be remembered are enclosed in parentheses

• The sub-patterns are implicitly numbered, starting from 1 and their matching substrings can then be re-used later in the pattern by using back-references like \1 or \2 or \3

• However, to get the backslash, we need to escape it, so we must type \\1 or \\2 or \\3 in our regular expressions

Using back-references (contd.)• PHP code

<?php

$myString1 = ”klmAklmAAklmABklmBklmBBklm";echo "myString is $myString <br>";

$myString1 = preg_replace(”/([A-Z])\\1/",”_",$myString1);

echo "myString1 is now $myString1 ";

?>• Resultant output is

myString1 is klmAklmAAklmABklmBklmBBklm

myString1 is now klmAklm_klmABklmBklm_klm

Using back-references (contd.)• PHP code

<?php

$myString = ”klmAklmAAklmABklmBklmBBklm";echo "myString is $myString <br>";

$myString = preg_replace(”/([A-Z])\\1/",”_",$myString);

echo "myString is now $myString ";

?>• Resultant output is

myString1 is klmAklmAAklmABklmBklmBBklm

myString1 is now klmAklm_klmABklmBklm_klm

Regular expressions(contd.) -- using subpattern matches in replacements

• We saw that, within a regular expression, substrings that matched sub-patterns can be re-used later in the pattern by preceding the appropriate integer with a pair of backslashes, \\

• Within a <replacement>, substrings that matched sub-patterns in the regular expressioncan be used by preceding the appropriate integer with a dollar $

Using sub-pattern matches in replacements (contd.)• PHP code

<?php$myString = "<p>This is paragraph 1.</p><p>This is paragraph 2.</p>";

echo "myString is ".str_replace("<","&lt;",$myString)." <br>";

$myString = preg_replace("/<(\w+)>(.+?)<\/\\1>/","$2",$myString);

echo "myString is now ".str_replace("<","&lt;",$myString);

?>• Resultant output ismyString is <p>This is paragraph 1.</p><p>This is paragraph 2.</p>

myString is now This is paragraph 1.This is paragraph 2.

A reminder about greedy/frugal quantifiers• What would happen if we had used a greedy quantifier in previous slide?

• PHP code

<?php$myString = "<p>This is paragraph 1.</p><p>This is paragraph 2.</p>";

echo "myString is ".str_replace("<","&lt;",$myString)." <br>";

$myString = preg_replace("/<(\w+)>(.+)<\/\\1>/","$2",$myString);

echo "myString is now ".str_replace("<","&lt;",$myString);

?>• Resultant output ismyString is <p>This is paragraph 1.</p><p>This is paragraph 2.</p>

myString is now This is paragraph 1.</p><p>This is paragraph 2.

Choice of regexp delimiters• Up to now, we have used the forward slash

character to mark the start and end of regular expressions

• In fact, we can use any non-alphanumeric character for the purpose

• For example, instead of writing/ab+c/

• we could write%ab+c%

• This is useful when we wish to use the / character in our regular expression

• See next slide

Using different regexp delimiters (contd.)• In the regular expression below,

we do not have to escape the / character at the start of the close tag, because

we are not using the / character as the regexp delimiter:

<?php$myString = "<p>This is paragraph 1.</p><p>This is paragraph 2.</p>";

echo "myString is ".str_replace("<","&lt;",$myString)." <br>";

$myString = preg_replace(“%<(\w+)>(.+?)</\\1>%","$2",$myString);echo "myString is now ".str_replace("<","&lt;",$myString);

?>• Resultant output ismyString is <p>This is paragraph 1.</p><p>This is paragraph 2.</p>

myString is now This is paragraph 1.This is paragraph 2.

Using regexps to process nested HTML• PHP code:

<?php$myString = “<ol><li>fred</li><li>tom</li></ol>";

echo "myString is ".str_replace("<","&lt;",$myString)." <br>";

$myString = preg_replace(“%<(\w+)>(.+)</\\1>%","$2",$myString);

echo "myString is now ".str_replace("<","&lt;",$myString);

?>• Resultant output ismyString is <ol><li>fred</li><li>tom</li></ol>

myString is now <li>fred</li><li>tom</li>

• Suppose we wanted to remove all pairs of HTML tags. That is, suppose we wanted

myString is <ol><li>fred</li><li>tom</li></ol>

myString is now fredtom

• How would we achieve that?

Using regexps to process nested HTML (contd.)• Would a frugal quantifier do the trick? PHP code:<?php$myString = “<ol><li>fred</li><li>tom</li></ol>";echo "myString is ".str_replace("<","&lt;",$myString)."

<br>";

$myString = preg_replace(“%<(\w+)>(.+?)<\\1>%","$2",$myString);

echo "myString is now ".str_replace("<","&lt;",$myString);?>• No. The resultant output is stillmyString is <ol><li>fred</li><li>tom</li></ol>

myString is now <li>fred</li><li>tom</li>

• The reason is that, while preg_replace does replace all matching substrings in the target substring, it does not perform replacement operations on the replacement string

• The value <li>fred</li><li>tom</li> above is the result of a replacement operation, so it is not modified

• However, suppose we wanted to remove all pairs, no matter how deep the nesting. How would we do that?

Using regexps to process nested HTML (contd.)• We must use repetition to attack the nested instances<?php

$myString = “<ol><li>fred</li><li>tom</li></ol>";

echo "myString is ".str_replace("<","&lt;",$myString)." <br>";

$newString = preg_replace(“%<(\w+)>(.+?)<\\1>%","$2",$myString);

while ($newString != $mystring)

{ $myString = $newString;

echo "myString is now ".

str_replace("<","&lt;",$myString).”<br>”;

$newString =

preg_replace(“%<(\w+)>(.+?)<\\1>%","$2",$myString);

}

?>

• The resultant output is nowmyString is <ol><li>fred</li><li>tom</li></ol>

myString is now <li>fred</li><li>tom</li>

myString is now fredtom

Using regexps to process nested HTML (contd.)• Of course, we would not want to run words together like we did on the

last slide, so we would use spaces in the replacement string<?php

$myString = “<ol><li>fred</li><li>tom</li></ol>";

echo "myString is ".str_replace("<","&lt;",$myString)." <br>";

$newString = preg_replace(“%<(\w+)>(.+?)<\\1>%","$2",$myString);

while ($newString != $mystring)

{ $myString = $newString;

echo "myString is now ".

str_replace("<","&lt;",$myString).”<br>”;

$newString =

preg_replace(“%<(\w+)>(.+?)<\\1>%"," $2 ",$myString);

}

?>

• The resultant output is nowmyString is <ol><li>fred</li><li>tom</li></ol>

myString is now <li>fred</li><li>tom</li>

myString is now fred tom

More on regular expressions – checking for context

• All the preg_replace operations we have written so far have consumed all the characters that matched the regular expression

• There was no notion of examining the context surrounding the consumed characters– any characters that were matched were consumed

• We often need some way of matching characters without removing them from the target string

• There four meta-expression for doing this, two for forward context and two for backward context

Look-ahead context checks

(?=regexp)

This is a positive lookahead context check

It matches characters in the target string against the pattern specified by the embedded regular expression regexp without consuming them from the target string

• Example

preg_replace(“/\w+(?= cat)/”,”_”,$myString)

This replaces with an underscore any word that is followed by a space and the word cat, without removing the space or the word cat from the target string

• An example application is on the next slide

Look-ahead checks (contd.)

• Program fragment:$myString = "tabby is a big cat. fido is a fat dog.";

echo "myString is $myString <br>";

$myString = preg_replace("/\w+(?= cat)/","_",$myString);

echo "myString is now $myString";

• Output producedmyString is tabby is a big cat. fido is a fat dog.

myString is now tabby is a _ cat. fido is a fat dog.

Look-ahead checks (contd.)

(?!regexp)

This is a negative lookahead context check

It ensures that characters in the target string do not match the pattern specified by the embedded regular expression regexp

• Example

preg_replace(“/cow(?! boy)/”,”_”,$myString)

This replaces all sub-strings “cow” with “_”, provided these sub-strings are not followed by the sub-string “boy”

Look-ahead checks (contd.)

• Program fragment:$myString = "Fred is a cowboy. Dolly is a cow.";

echo "myString is $myString <br>";

$myString = preg_replace("/cow(?!boy)/","_",$myString);

echo "myString is now $myString";

• Output producedmyString is Fred is a cowboy. Dolly is a cow.

myString is now Fred is a cowboy. Dolly is a _.

Look-behind context checks

(?<=regexp)

This is a positive look-behind context check

It ensures that preceding characters in the target string match the pattern specified by the embedded regular expression regexp

• Example

preg_replace(“/(?<= cow)boy/”,”girl”,$myString)

This replaces all sub-strings “boy” with “girl”, provided these sub-strings are preceded by the sub-string “cow”, but the sub-string “cow” is not consumed.

Look-ahead checks (contd.)

• Program fragment:$myString = “Fred is a cowboy. Tom is a boy.";

echo "myString is $myString <br>";

$myString = preg_replace("/(?<=cow)boy/",”girl",$myString);

echo "myString is now $myString";

• Output producedmyString is Fred is a cowboy. Tom is a boy.

myString is now Fred is a cowgirl. Tom is a boy.

Look-behind checks (contd.)

(?<!regexp)

This is a negative look-behind context check

It ensures that preceding characters in the target string do not match the pattern specified by the embedded regular expression regexp

• Example

preg_replace(“/(?<!cow)boy/”,”girl”,$myString)

This replaces all sub-strings “boy” with “girl”, provided these sub-strings are not preceded by the sub-string “cow”

Look-ahead checks (contd.)

• Program fragment:$myString = “Fred is a cowboy. Tom is a boy.";

echo "myString is $myString <br>";

$myString = preg_replace("/(?<!cow)boy/",”girl",$myString);

echo "myString is now $myString";

• Output producedmyString is Fred is a cowboy. Tom is a boy.

myString is now Fred is a cowboy. Tom is a girl.

Regexp pattern modifiers

• We have seen that a regexp is of the form /…../ where the slash characters are delimiters (and could be replaced by other non-alphanumeric printable characters

• The terminating character can be followed by a sequence of modifiers which affect the meaning of the regexp between the delimiting characters

Example pattern modifier: the caseless match modifier

• Program fragment:$myString = "Fred is a boy. Tom is a BOY.";

echo "myString is $myString <br>";

$newstring1 = preg_replace("/boy/","_",$myString);

echo "newstring1 is $newstring1 <br>";

$newstring2 = preg_replace("/boy/i","_",$myString);echo "newstring2 is $newstring2";

• Output producedmyString is Fred is a boy. Tom is a BOY.

newstring1 is Fred is a _. Tom is a BOY.

newstring1 is Fred is a _. Tom is a _.

Contrast these 2 examples

• Program fragment 1:<?php

$oldstring1 = "<p>Fred is a boy.</p>";

echo "oldstring1 is ".str_replace("<","&lt;",$oldstring1)." <br>";

$newstring1 = preg_replace("%<p>.+</p>%","_",$oldstring1);

echo "newstring1 is ".str_replace("<","&lt;",$newstring1);

?>

• Output producedoldstring1 is <p>Fred is a boy.</p>

newstring1 is _

• Program fragment 2:<?php

$oldstring1 = "<p>Fred

is a boy.</p>";

echo "oldstring1 is ".str_replace("<","&lt;",$oldstring1)." <br>";

$newstring1 = preg_replace("%<p>.+</p>%","_",$oldstring1);

echo "newstring1 is ".str_replace("<","&lt;",$newstring1);

?>

• Output producedoldstring1 is <p>Fred is a boy.</p>

newstring1 is <p>Fred is a boy.</p>

• Why no replacment? Why no match?• Answer:

– the target string is a multi-line string

– the . meta-character does not match newline characters

The dot-all modifier• Program fragment:<?php

$oldstring1 = "<p>Fred

is a boy.</p>";

echo "oldstring1 is ".str_replace("<","&lt;",$oldstring1)." <br>";

$newstring1 = preg_replace("%<p>.+</p>%s","_",$oldstring1);echo "newstring1 is ".str_replace("<","&lt;",$newstring1);

?>

• Output producedoldstring1 is <p>Fred is a boy.</p>

newstring1 is _

• The dot-all modifier says that the dot meta-character should match all characters, including newlines

The dot-all modifier again• Program fragment 2:<?php

$oldstring1 = "<p>Fred \n is a boy.</p>";

echo "oldstring1 is ".str_replace("<","&lt;",$oldstring1)." <br>";

$newstring1 = preg_replace("%<p>.+</p>%s","_",$oldstring1);echo "newstring1 is ".str_replace("<","&lt;",$newstring1);

?>

• Output producedoldstring1 is <p>Fred is a boy.</p>

newstring1 is _

• \n is a newline but the dot-all modifier says that the dot meta-character should match all characters, including newlines

The Ungreedy modifier

• This is very similar to the use of the ? character to stop quantifiers being greedy

Example usage of the ungreedy modifier

• Program fragment:$myString = "<p>Fred is a boy.</p><p>Ann is a girl.</p>";echo "myString is ".str_replace("<","&lt;",$myString)." <br>";

$myString = preg_replace("%<p>(.+)</p>%U","$1",$myString);

echo echo "myString is now ".str_replace("<","&lt;",$myString);

• Output producedmyString is <p>Fred is a boy.</p><p>Ann is a girl.</p>

myString is now Fred is a boy.Ann is a girl.

• The + meta-character is normally greedy but the U modifier has made it ungreedy

2nd Example usage of U modifier, part 1

• Program fragment 1:$myString =

"<p>Fred is a boy.</p><p>Ann is a girl.</p><hr>";

echo "myString is $myString <br>";

$myString = preg_replace("%<p>(.+?)</p>(.+)%","$1x$2",$myString);

echo "myString is now $myString";

• Output producedmyString is <p>Fred is a boy.</p><p>Ann is a girl.</p><hr>

myString is now Fred is a boy.x<p>Ann is a girl.</p><hr>

• Here, the first + is frugal but the second is greedy

2nd Example usage of U modifier, part 2

• Program fragment 2:$myString =

"<p>Fred is a boy.</p><p>Ann is a girl.</p><hr>";

echo "myString is $myString <br>";

$myString =

preg_replace("%<p>(.+?)</p>(.+)%U","$1x$2",$myString);echo "myString is now $myString";

• Output producedmyString is <p>Fred is a boy.</p><p>Ann is a girl.</p><hr>

myString is now Fred is a boy.</p><p>Ann is a girl.x<hr>

• Here, the first + is greedy but the second is frugal• That is, the U modifier reverses the meaning of

the presence or absence of the ? Character after a quantifier

More on multi-line strings• By default, the subject string is regarded as

consisting of a single "line" of characters (even if it actually contains several newlines).

• The "start of line" metacharacter (^) matches only at the start of the string

• The "end of line" metacharacter ($) matches only at the end of the string

• Suppose, however, that we want these metacharacters to also match at newlines inside the string

Multi-line strings(contd.)• Program fragment:$myString = “<p>Fred is a boy.</p>\n<p>Ann is a girl.</p>";

echo "myString is $myString <br>";

$myString = preg_replace("%^<p>(.+)</p>$%","x$1y",$myString);

echo "myString is now $myString";

• Output producedmyString is <p>Fred is a boy.</p><p>Ann is a girl.</p>

myString is now <p>Fred is a boy.</p><p>Ann is a girl.</p>

• One the one hand, the ^ and $ characters did not match the newline in the middle of the string

• On the other hand, the dot did not match the newline

The multi-line modifier• Program fragment:$myString = “<p>Fred is a boy.</p>\n<p>Ann is a girl.</p>";

echo "myString is $myString <br>";

$myString = preg_replace("%^<p>(.+)</p>$%m","x$1y",$myString);

echo "myString is now $myString";

• Output producedmyString is <p>Fred is a boy.</p><p>Ann is a girl.</p>

myString is now xFred is a boy.yxAnn is a girl.y

• The multiline modifier m has made the ^ and $ characters match the newline in the middle of the string

CS4408 got here on 21 oct 2005

More PHP functions for using regular expressions

• To be continued in next slide set