regular expressions cisc/qcse 810. recognizing matching strings ls *.exe translates to "any set...

15
Regular Expressions CISC/QCSE 810

Upload: evan-neil-young

Post on 02-Jan-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Regular Expressions

CISC/QCSE 810

Recognizing Matching Strings

ls *.exe translates to "any set of characters,

followed by the exact string ".exe"

The "*.exe" is a regular expressionls gets a list of all files, and then only returns those that match the expression "*.exe"

In Perl

In Perl, can see if strings match using the =~ operator

$s = "Cat In the Hat";if ($s =~ /Cat/) { print "Matches Cat";}

if ($s =~ /Chat/) { print "Matches Chat";}

Common references \w Characters in

words\W Non-word character

\s Space, tab \S Non-whitespace character

\d Match a digit \D Non-digit match

\n Newline \t Tab

. Any character

^ Start of string $ End of string

Modifiers

* 0 or more occurences

{n} Exactly n matches

{n,} n or more matches {n,m} Match n to m matches

Character Groups

[a-z] [xyz] [0-9A-Z] [\w_]

[^a-z]

NOT a-z

Exercise 1

Write a regexp that matches only on Canadian postal codes

Exercise 2

Write a regexp that matches typical intermediate files (.o, .dvi, .tmp) helpful if you want a systematic way

to delete them

String Substitution

Found an input file (*.dat), looking for a matching output file (<same>.out)

@input_files = <*.dat>

foreach $input_file (@input_files) { # Copy to output name $output_file = $input_file; # replace .dat with .out $output_file =~ s/.dat/.out/;

if (! -f $output_file) {print "Need to create output for $output_file\n";

} }

Translating

$s = "Alternate Ending";$s =~ tr/[a-z]/[A-Z];

Can also use 'uc' and 'lc' (more generic for non-English languages)

Grabbing Substrings

Get root URL

$url = "http://www.mast.queensu.ca/~math224/Slides/Week_09/driven_spring2.m";$url =~ /(www[\w.]*)/;$short_url = $1;print "Full URL: $url\n";print "Site URL: $short_url\n";

End options

s/a/A/g – global; swap all matches changes "aaaba" to "AAAbA"

Compare with s/a/A/ changes "aaaba" to "Aaaba"

/tmp/i - case insensitive recognizes "tmp", "Tmp", "tMP",

"TMP"…

Exercise

Write a regexp line that returns all the integers in the text

Can it be extended to handle floating point values?

Functions with Regex

split split /\s+/, $line; split /,/, $line; split /\t/, $line split //, $line;

grep @v = qw( aaa bba bbc); @matches = grep /bb/, @v;

Longer example – Log files

Parsing log files195.5.23.103 - - [25/Mar/2003:02:22:11 -0800] "GET /gcs/new.gif HTTP/1.1" 200 926195.5.23.103 - - [25/Mar/2003:02:22:11 -0800] "GET /gcs/update.gif HTTP/1.1" 200 971proxy.skynet.be - - [25/Mar/2003:02:40:54 -0800] "GET /gcs/gc1hint.html HTTP/1.1" 200 16358j3194.inktomisearch.com - - [25/Mar/2003:03:13:12 -0800] "GET /~gcs/K-12.html HTTP/1.0" 200 3235kittyhawk.hhmi.org - - [25/Mar/2003:03:17:20 -0800] "HEAD /gcs/ HTTP/1.0" 200 0j3104.inktomisearch.com - - [25/Mar/2003:03:54:43 -0800] "GET /gcs/pa.html HTTP/1.0" 200 5614crawl11-public.alexa.com - - [25/Mar/2003:04:51:41 -0800] "GET /gcs/clinical.html HTTP/1.0" 200 20132…livebot-65-55-208-64.search.live.com - - [24/Jul/2007:22:16:58 -0700] "GET /gcs/webstats/usage_200602.html HTTP/1.0" 200 128720203.129.234.42 - - [24/Jul/2007:22:22:39 -0700] "GET /gcs/status/statuscheck.html HTTP/1.1" 200 1522624livebot-65-55-208-65.search.live.com - - [24/Jul/2007:22:47:32 -0700] "GET /gcs/webstats/usage_200610.html HTTP/1.0" 200 132580…

Alternate uses

If you write your own program, with many print statements, can 1. make print statements meaningful

"Time spent on loading: 23.5s"

2. can parse afterwards to process/store values

$line = m/: ([\d.])+s/; $time = $1;

Resources

Any web search for "perl regular expression tutorial"Perl reg exp by example

http://www.somacon.com/p127.phpReference card

http://www.erudil.com/preqr.pdfPerl site reference

http://perldoc.perl.org/perlre.html