more php functions for using regular expressions

47
More PHP functions for using regular expressions • Up to now, we have seen just one of the library of functions which PHP provides for using regular expressions • The full library is described in Chapter CVIII of the PHP Manual • We will consider just two more of them – preg_match and – preg_match_all

Upload: quinlan-roberts

Post on 04-Jan-2016

34 views

Category:

Documents


6 download

DESCRIPTION

More PHP functions for using regular expressions. Up to now, we have seen just one of the library of functions which PHP provides for using regular expressions The full library is described in Chapter CVIII of the PHP Manual We will consider just two more of them preg_match and - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: More PHP functions for using regular expressions

More PHP functions for using regular expressions

• Up to now, we have seen just one of the library of functions which PHP provides for using regular expressions

• The full library is described in Chapter CVIII of the PHP Manual

• We will consider just two more of them– preg_match and– preg_match_all

Page 2: More PHP functions for using regular expressions

preg_match

• The format of a call to this function isint preg_match ( string pattern, string subject

[, array &matches [, int flags [, int offset]]] )

• As can be seen, only the first two arguments are required, so a minimum-argument call is of the form

int preg_match ( string regexp, string subject)

which returns0 if the regexp is not matched inside the subject string, or

1 if the regexp is matched inside the string

Page 3: More PHP functions for using regular expressions

• PHP code<?php

$document = "<h1>France</h1>

<p>Foods of France:

<ol><li>wine</li><li>bread</li></ol></p>";

$regexp = "%<p>.+</p>%";

if ( preg_match($regexp,$document) )

{ echo "Yes"; }

else { echo "No"; }

?>

• OutputNo

• Why no match?• Answer: on next slide

Page 4: More PHP functions for using regular expressions

• We need to make the dot match newlines

• Revised PHP code<?php

$document = "<h1>France</h1>

<p>Foods of France:

<ol><li>wine</li><li>bread</li></ol></p>";

$regexp = "%<p>.+</p>%s";

if ( preg_match($regexp,$document) )

{ echo "Yes"; }

else { echo "No"; }

?>

• OutputYes

Page 5: More PHP functions for using regular expressions

preg_match (contd.)

• Frequently, it is useful to use the third, optional, argument

int preg_match ( string pattern, string subject, array &matches )

• As before, this returns o or 1 depending on whether a match was found in the subject string

• However, in addition, elements of the array in the third argument are set to match parts of the matching substring of the target string– matches[0] is set to contain the whole substring– matches[1] is set to contain the first () substring– matches[2] is set to contain the second () substring– … etc

Page 6: More PHP functions for using regular expressions

• PHP code<?php$document = "<h1>France</h1> <p>Foods of France: <ol><li>wine</li><li>bread</li></ol></p>";echo "document is ".str_replace("<","&lt;",$document)." <br>";$regexp = " %<p>(.+)</p>%s ";if ( preg_match($regexp,$document,$matches) ) { echo "Yes <br>"; echo "matches[0] is ".str_replace("<","&lt;",$matches[0])."<br>"; echo "matches[1] is ".str_replace("<","&lt;",$matches[1])."<br>"; } else { echo "No"; }?>

• Outputdocument is <h1>France</h1> <p>Foods of France:

<ol><li>wine</li><li>bread</li></ol></p> Yes matches[0] is <p>Foods of France: <ol><li>wine</li><li>bread</li></ol></p>matches[1] is Foods of France: <ol><li>wine</li><li>bread</li></ol>

Page 7: More PHP functions for using regular expressions

preg_match_all

• So why would we need a function calledpreg_match_all

• See next slide

Page 8: More PHP functions for using regular expressions

• PHP code<?php$document = "<p> This is paragraph 1. </p> <p> And this is paragraph 2.</p>";echo "document is ".str_replace("<","&lt;",$document)." <br>";$regexp = "%<p>(.+?)</p>%s";if ( preg_match($regexp,$document,$matches) ) { echo "Yes <br>"; echo "matches[0] is ".str_replace("<","&lt;",$matches[0])."<br>"; echo "matches[1] is ".str_replace("<","&lt;",$matches[1])."<br>"; } else { echo "No"; }?>

• Outputdocument is <p>This is paragraph 1.</p> <p>And this is paragraph 2.</p> Yes matches[0] is <p>This is paragraph 1.</p>matches[1] is This is paragraph 1.

• That is, preg_match only finds the first match

Page 9: More PHP functions for using regular expressions

preg_match_all

• preg_match_all is like preg_match except – that it finds all matches and– thus, the value returned in $matches is actually an array of

arraysint preg_match_all ( string pattern, string subject, array &matches )

• $matches[0] is an array of all the substrings which match the overall regular expression

• $matches[1] is an array of all the substrings which match the first parenthesised sub-expression

• $matches[2] is an array of all the substrings which match the second parenthesised sub-expression

• and so on

Page 10: More PHP functions for using regular expressions

• PHP code<?php$document = "<p>This is paragraph 1.</p> <p>And this is paragraph 2.</p><p>Paragraph 3.</p>";echo str_replace("<","&lt;",$document)." <br>";$regexp = "%<p>(.+?)</p>%s";if ( preg_match_all($regexp,$document,$matches) ) {$numMatches = count($matches[0]); for ($i=0;$i < $numMatches; $i++) {echo "matches[0][$i] is ".str_replace("<","&lt;",$matches[0][$i])."<br>"; } for ($i=0;$i < $numMatches; $i++) {echo "matches[1][$i] is ".str_replace("<","&lt;",$matches[1][$i])."<br>"; } } else { echo "No"; }?>

• Output<p>This is paragraph 1.</p> <p>And this is paragraph 2.</p> <p>Paragraph 3.</p>matches[0][0] is <p>This is paragraph 1.</p>matches[0][1] is <p>And this is paragraph 2.</p>matches[0][2] is <p>Paragraph 3.</p>matches[1][0] is This is paragraph 1.matches[1][1] is And this is paragraph 2.matches[1][2] is Paragraph 3.

Page 11: More PHP functions for using regular expressions

PHP Filesystem Functions

• Chapter XXXVIII of the PHP manual

• We will consider just four• resource fopen ( string filename, string mode [, bool use_include_path [,

resource zcontext]] )

Usually used as

resource fopen ( string filename, string mode)

• bool fclose ( resource handle )

• string fread ( resource handle, int length )

• int fwrite ( resource handle, string someString [, int length] )

Usually used as

int fwrite (resource handle, string someString )

Page 12: More PHP functions for using regular expressions

fopen• Typical call format:

resource fopen ( string filename, string mode)

• Example calls

$fileHandle1 = fopen("names.txt","w");

opens, for writing, a file called "names.txt" in the same directory

as the PHP program

$fileHandle1 = fopen("names.txt","r");

opens, for reading, a file called "names.txt" in the same directory

as the PHP program

$fileHandle1 = fopen(" "/usr/csr/names.txt ","w");

opens, for writing, a file called "/usr/csr/names.txt" on the same

computer as the PHP program

$fileHandle1 = fopen(" "/usr/csr/names.txt ","r");

opens, for reading, a file called "/usr/csr/names.txt" on the same

computer as the PHP program

$fileHandle1 = fopen("http://www.rte.ie/index.html","r");

opens, for reading, a file on an external web-site

Page 13: More PHP functions for using regular expressions

fread and file_get_contents• Typical call format:

string fread ( resource handle, int length )

• Example calls

$contents = fread($fileHandle1,1000);

reads the next 1000 bytes from the file with handle $fileHandle1 or

up to the end of the file if there are less than 1000 bytes still unread

in the file

$contents = fread($fileHandle1,100000000);

reads the next 100 MB bytes from the file with handle $fileHandle1 or

up to the end of the file if there are less than 100 MB bytes still

unread in the file -- since very few files are as large 100 MB, this

probably just makes the computer read up to the end of the file;

• Typical call format:

string file_get_contents ( string filename [, bool use_include_path

[, resource context [, int offset [, int maxlen]]]] )Example call

$contents = file_get_contents($someURL);

Page 14: More PHP functions for using regular expressions

fwrite• Typical call format:

int fwrite ( resource handle, string someString )

• Example call

$result = fwrite($fileHandle1,"<h1>Blah blah</h1>");

writes the string <h1>Blah blahM/h1> into the file with handle

$fileHandle1 and returns the number of bytes written into the file or

returns 0 (FALSE) if there was an error

Page 15: More PHP functions for using regular expressions

fclose• Typical call format:

string fclose ( resource handle)

• Example call

fclose($fileHandle1);

closes the file with handle $fileHandle1

Page 16: More PHP functions for using regular expressions

Example usage• PHP code:

<?php

$rte = fopen("http://www.rte.ie/","r");

$contents = fread($rte,100000000);

fclose($rte);

echo str_replace("<","&lt;",$contents);

?>

• Output:

<html> <head> <META http-equiv="Content-Type" content="text/html; charset=iso-8859-15"> <title>RTE.ie - Irish Public Service TV and radio stations online</title> <META name="Description" content="RTE.ie - Irish public service television and radio broadcaster on the World Wide Web - bringing you Irish news, sports, business, entertainment, weather, television and radio, programmes, current affairs, health, motors, travel, video and audio."> <META name="Keywords" content="rte, rte.ie, irish, television, radio, ireland, Irish, news, business, sport, results, news, Ireland, video, audio, broadcaster, irish"> <STYLE TYPE="text/css"><!-- A {text-decoration: none; color: #000000} A:hover {text-decoration: none; color: #660000} --></STYLE> <STYLE TYPE="text/css"> <!-- FORM {display:inline;} --></STYLE> <script language="JavaScript"> <!-- function AertelPage() { var

Page 17: More PHP functions for using regular expressions

Compare output of program with page seen in browser

<html> <head> <META http-equiv="Content-Type" content="text/html; charset=iso-8859-15"> <title>RTE.ie - Irish Public Service TV and radio stations online</title> <META name="Description" content="RTE.ie - Irish public service television and radio broadcaster on the World Wide Web - bringing you Irish news, sports, business, entertainment

Page 18: More PHP functions for using regular expressions

Example application• Extracting output from website of The Guardian:

– The Guardian maintains a page, updated almost daily, of recent stories on Israel and the Middle East at

http://www.guardian.co.uk/israel– Its appearance on 25 October 2005 is on the next

slide– The first part of the the source code for the page,

gotten from a browser, is on the slide after that– The complete source code for the 25 October 2005

version of the page is in the file http://www.cs.ucc.ie/j.bowen/cs4408/resources/GuardianIsrael25October2005.txt

(Note that this is an exact copy of the page from the Guardian site, so the src and href attribute values in the file assume that the page is stored at the Guardian URL above.)

• We want to extract the text stories, as shown in the third-next slide

Page 19: More PHP functions for using regular expressions
Page 20: More PHP functions for using regular expressions

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">

<!-- artifact_id=377264, built 2005-10-25 11:00 -->

<html lang="en">

<head>

<meta http-equiv="content-type" content="text/html; charset=ISO-8859-1">

<meta name="artifact" content="377264">

<title>Guardian Unlimited | Special reports | Special report: Israel & the Middle East</title>

<link rel="shortcut icon" href="/favicon.ico" type="image/x-icon">

<link rel="stylesheet" href="/external/styles/basic/0,14491,,00.css" type="text/css">

<style type="text/css">

<!--

a.GUARDIANUNLIMITED {

text-decoration:none;

Page 21: More PHP functions for using regular expressions
Page 22: More PHP functions for using regular expressions

We want to produce a page like that on the right-hand side below from that on the left-hand side

Page 23: More PHP functions for using regular expressions

We want to get all stories between Latest and Audio reports

Page 24: More PHP functions for using regular expressions

Here is the source code around the headline Latest on the page

<div class="maintrail"><font face="Geneva, Arial, Helvetica, sans-serif" size="2">

<b>Latest</b><hr size="1">

<p><span class="mainlink"><a HREF="/israel/Story/0,2763,1599840,00.html">Israel still in control of Gaza, says envoy</a></span><br /><b>October 25: </b>The international Middle East envoy, James Wolfensohn, has accused Israel of behaving as if it has not withdrawn from the Gaza Strip, by blocking its borders and failing to fulfil commitments to allow the movement of Palestinians and goods.

</p>

<b>Qur'an test</b><hr size="1">

Page 25: More PHP functions for using regular expressions

Here is the source code around the headline Audio reports

<p><span class="mainlink"><a HREF="/israel/Story/0,2763,1596052,00.html">Israel's closed zone</a></span><br /><b>October 20, letters: </b>You graphically highlight the continuing expansionism of the Israeli government (Report, October 18).

</p>

<b>Audio reports</b><hr size="1">

<p><span class="mainlink"><a HREF="http://stream.guardian.co.uk:7080/ramgen/sys-audio/Guardian/audio/2005/09/12/120905McGreal.ra">Palestinians rush to Gaza</a></span><br /><b>September 12:</b> Many descended on the former Jewish Gaza settlements intent on causing chaos, but others came simply to see the beach for the first time, reports <b>Chris McGreal</b> from Khan Yunis. (2min 33s)

Page 26: More PHP functions for using regular expressions

This PHP program will extract all the source code between the two headlines

<?php

$f1 = fopen("http://www.guardian.co.uk/israel/","r");

$document = fread($f1,100000);

fclose($f1);$regexp = "%<b>Latest</b><hr size=\"1\"> (.+) <b>Audio reports</b><hr size=\"1\">%s";

preg_match($regexp,$document,$matches);

$stories = $matches[1];

echo $stories;

?>

Page 27: More PHP functions for using regular expressions

But we also want to remove all the intermediate headlines and rulings between the stories

Wolfensohn, has accused Israel of behaving as if it has not withdrawn from the Gaza Strip, by blocking its borders and failing to fulfil commitments to allow the movement of Palestinians and goods.

</p>

<b>Qur'an test</b><hr size="1">

<p><span class="mainlink"><a HREF="/israel/Story/0,2763,1599227,00.html">Qur'an competition tests participants' memories</a></span><br /><b>October 24: </b>With senior militant leaders looking on, Palestinian officials opened an international competition yesterday testing participants' knowledge of the Qur'an.

</p>

<b>Comment and analysis</b><hr size="1">

<p><span class="mainlink"><a HREF="/israel/Story/0,2763,1596291,00.html">Christian leanings at the Jerusalem Post</a></span><br /><b>October 20, Chris McGreal:</b> The strange and uneasy embrace between the Jewish state and America's evangelical right is being tightened.

<br />

<a HREF="/israel/comment/0,10551,1590082,00.html">12.10.05, Jonathan Freedland: One and three-quarter state solution</a><br />

<a HREF="/israel/Story/0,2763,1584308,00.html">04.10.05, Chris McGreal: House that became a war zone</a></p>

<b>West Bank</b><hr size="1">

<p><span class="mainlink"><a HREF="/israel/Story/0,2763,1596168,00.html">Israel accused of 'road apartheid' in West Bank</a></span><br /><b>October 20: </b>Army seals off main route to Palestinian vehicles <br><b>&#183; </b>Opponents say plan is to carve out new borders.

<br />

Page 28: More PHP functions for using regular expressions

This program will extract stories and remove all intermediate headlines and rulings between stories<?php

$f1 = fopen("http://www.guardian.co.uk/israel/","r");

$document = fread($f1,100000);

fclose($f1);$regexp = "%<b>Latest</b><hr size=\"1\">(.+)<b>Audio reports</b><hr size=\"1\">%s";

preg_match($regexp,$document,$matches);

$stories = $matches[1];

$regexp = "%<b>.+</b><hr size=\"1\">%";

/* Equivalent to $regexp = "%<b>.+?</b><hr size=\"1\">%s"; */

$stories = preg_replace($regexp,"",$stories);

echo $stories;

?>

Page 29: More PHP functions for using regular expressions

Correcting the URLs

• The URLs on the Guardian page assume that the page is being delivered from the Guardian server

• Anchor elements are of this form:<a HREF="/israel/Story/0,2763,1599840,00.html">Israel still in control of

Gaza, says envoy</a>

• Thus, if someone clicks a hotlink on our output, the browser will think the target page is on our server

• We must make the URLs point to the Guardian server by making them full (or "absolute") URLs

Page 30: More PHP functions for using regular expressions

This program corrects the URLsNotice that we need only string manipulation

<?php

$f1 = fopen("http://www.guardian.co.uk/israel/","r");

$document = fread($f1,100000);

fclose($f1);$regexp = "%<b>Latest</b><hr size=\"1\">(.+)<b>Audio reports</b><hr size=\"1\">%s";

preg_match($regexp,$document,$matches);

$stories = $matches[1];

$regexp = "%<b>.+</b><hr size=\"1\">%";

$stories = preg_replace($regexp,"",$stories);$stories = str_replace('HREF="/','HREF="http://www.guardian.co.uk/',$stories);

echo "<h1>Today's Guardian stories on Palestine</h1>";

echo $stories

?>

• You can run this program at http://www.cs.ucc.ie/j.bowen/cs4408/resources/GuardianExtractor1.php

Page 31: More PHP functions for using regular expressions

Output from program on previous page

Page 32: More PHP functions for using regular expressions

Putting the stories into a local file<?php

$f1 = fopen("http://www.guardian.co.uk/israel/","r");

$document = fread($f1,100000);

fclose($f1);$regexp = "%<b>Latest</b><hr size=\"1\">(.+)<b>Audio reports</b><hr size=\"1\">%s";

preg_match($regexp,$document,$matches);

$stories = $matches[1];

$regexp = "%<b>.+?</b><hr size=\"1\">%";

$stories = preg_replace($regexp,"",$stories);$stories = str_replace('HREF="/','HREF="http://www.guardian.co.uk/',$stories);

$f2 = fopen("todaysGuardian.html","w");

fwrite($f2,"<h1>Today's Guardian Palestine stories</h1>");

fwrite($f2,$stories);

fclose($f2);

?>

<a href="todaysGuardian.html">See result</a>

• You can run this program at http://www.cs.ucc.ie/j.bowen/cs4408/resources/GuardianExtractor2.php

Page 33: More PHP functions for using regular expressions

There is a problem

• The trouble is that our program does not have write-access in the directory which contains the program

• We must get it to write to a different directory, where it will have write-access

Page 34: More PHP functions for using regular expressions

Putting the stories into a file in a write-access directory<?php

$f1 = fopen("http://www.guardian.co.uk/israel/","r");

$document = fread($f1,100000);

fclose($f1);$regexp = "%<b>Latest</b><hr size=\"1\">(.+)<b>Audio reports</b><hr size=\"1\">%s";

preg_match($regexp,$document,$matches);

$stories = $matches[1];

$regexp = "%<b>.+?</b><hr size=\"1\">%";

$stories = preg_replace($regexp,"",$stories);$stories = str_replace('HREF="/','HREF="http://www.guardian.co.uk/',$stories);

$f2 = fopen("writable/todaysGuardian.html","w");

fwrite($f2,"<h1>Today's Guardian Palestine stories</h1>");

fwrite($f2,$stories);

fclose($f2);

?>

<a href="writable/todaysGuardian.html">See result</a>

• You can run this program at http://www.cs.ucc.ie/j.bowen/cs4408/resources/GuardianExtractor3.php

Page 35: More PHP functions for using regular expressions

Output from program on previous page

Page 36: More PHP functions for using regular expressions

Contents of page generated by previous program

Page 37: More PHP functions for using regular expressions

A note on allowing PHP programs to write files

• PHP programs are run by your Apache server

• On Unix/Linux machines, the Apache server is treated as an ordinary user, with the name "nobody"

• Thus, your PHP programs can only write into a directory where user nobody has write -permission

Page 38: More PHP functions for using regular expressions

A note on allowing PHP programs to write files (contd.)

• This could be a directory where you have given everybody write-access, as in

drwxrwxrwx 2 jabowen staff 35 Oct 25 09:58 writable

• But this is unsafe• It is better to create a group which contains only

yourself and nobody and give write access to that group, as in

drwxrwxr-x 2 jabowen jbApach 35 Oct 25 09:58 writable

where jbApach is a group that contains jabowen and nobody

Page 39: More PHP functions for using regular expressions

Generating this page automatically every day

• To generate this page automatically every day, we need to run this program automatically

http://www.cs.ucc.ie/j.bowen/cs4408/resources/GuardianExtractor3.php

• To do this, we can use some Linux/Unix features– a utility called wget– another one called nohup– a third one called crontab

Page 40: More PHP functions for using regular expressions

wget

• wget is for non-interactive download of files from the Web. • It supports HTTP, HTTPS, and FTP protocols, as well as retrieval

through HTTP proxies.• wget is non-interactive, meaning that it can work in the background,

while the user is not logged on • This allows you to start a retrieval and disconnect from the system,

letting Wget finish the work. • By contrast, most of web browsers require constant user's presence

Page 41: More PHP functions for using regular expressions

Using wget

Page 42: More PHP functions for using regular expressions

Using wget (contd.)Output is saved into a local file with same name as remote web-page

Page 43: More PHP functions for using regular expressions

Using wget (contd.)See contents of local file, part 1

Page 44: More PHP functions for using regular expressions

Using wget (contd.)Local file contains exactly the output from our program,

i.e. without any HTTP headers

Page 45: More PHP functions for using regular expressions

• The local file generated by wget contains the output from our program

• Of course, we are not interested in that output• We are simply interested in the fact that that file

output/todaysGuardian.html was generated • We are not even interested in hanging around while wget

excutes our web program• Thus, we use another utility, called nohup, to call wget to

run our programnohup wget http://www.cs.ucc.ie/j.bowen/cs4408/resources/GuardianExtractor3.php

Page 46: More PHP functions for using regular expressions

Automating all this

• Suppose we want to be able to look at the file output/todaysGuardian.html first thing every morning

• Suppose we want to ensure that it is updated each morning before we wake up

• We can use the crontab utility to make this happen

Page 47: More PHP functions for using regular expressions

crontab

• crontab will execute programs automatically at times we specify

• To ask it to execute our program, 7 days a week at 3:30 AM, use the following:

30 3 * * 1,2,3,4,5,6,7 nohup wget http://cs.ucc.ie/j.bowen/cs4408/resources/GuardianExtractor3.php