regular expressions regular expressions. regular expressions regular expressions are a powerful...
TRANSCRIPT
Regular ExpressionsRegular Expressions
Regular ExpressionsRegular Expressions
Regular expressions are a powerful string Regular expressions are a powerful string manipulation toolmanipulation tool
All modern languages have similar library All modern languages have similar library packages for regular expressions packages for regular expressions
Use regular expressions to:Use regular expressions to:• Search a string (Search a string (search search andand match) match)• Replace parts of a stringReplace parts of a string (sub) (sub)• Break stings into smaller piece Break stings into smaller piece (split)(split)
Regular Expression Python SyntaxRegular Expression Python Syntax Most characters match themselvesMost characters match themselves
The regular expression “test” The regular expression “test” matches the string matches the string ‘test’‘test’, and only , and only that stringthat string
[x] matches any [x] matches any oneone of a list of of a list of characterscharacters
““[abc]” matches [abc]” matches ‘a’,‘b’,‘a’,‘b’,oror ‘c’ ‘c’ [^x] matches any [^x] matches any oneone character that is character that is
not included in not included in xx
““[^abc]” matches any single [^abc]” matches any single character character exceptexcept ‘a’,’b’,‘a’,’b’,oror ‘c’ ‘c’
Regular Expressions SyntaxRegular Expressions Syntax
““.” matches any single character.” matches any single character Parentheses can be used for groupingParentheses can be used for grouping
““(abc)+” matches (abc)+” matches ’abc’, ‘abcabc’, ’abc’, ‘abcabc’, ‘abcabcabc’, ‘abcabcabc’, etc.etc.
x|y x|y matches matches xx or or yy
““this|that” matches this|that” matches ‘this’ or ‘this’ or ‘that’, ‘that’, but notbut not ‘thisthat’. ‘thisthat’.
Regular Expression SyntaxRegular Expression Syntax
xx* matches zero or more * matches zero or more xx’s’s
““a*” matches a*” matches ’’’’, , ’a’’a’, , ’aa’’aa’,, etc.etc. xx+ matches one or more + matches one or more xx’s’s
““a+” matches a+” matches ’a’’a’,,’aa’’aa’,,’aaa’’aaa’, etc., etc. xx? matches zero or one ? matches zero or one xx’s’s
“ “a?” matches a?” matches ’’’’ or or ’a’’a’ . . x{m, n} x{m, n} matches matches i x’i x’s, where s, where mm<<ii<< n n
““a{2,3}” matches a{2,3}” matches ’aa’ ’aa’ oror ’aaa’ ’aaa’
Regular Expression SyntaxRegular Expression Syntax ““\d” matches any digit; “\D” matches any \d” matches any digit; “\D” matches any
non-digitnon-digit ““\s” matches any whitespace character; “\S” \s” matches any whitespace character; “\S”
matches any non-whitespace charactermatches any non-whitespace character ““\w” matches any alphanumeric character; “\\w” matches any alphanumeric character; “\
W” matches any non-alphanumeric characterW” matches any non-alphanumeric character ““^” matches the beginning of the string; “$” ^” matches the beginning of the string; “$”
matches the end of the stringmatches the end of the string ““\b” matches a word boundary; “\B” matches \b” matches a word boundary; “\B” matches
position that is not a word boundaryposition that is not a word boundary
Search and MatchSearch and Match The two basic functions are The two basic functions are re.search re.search and and
re.matchre.match• Search looks for a pattern anywhere in a stringSearch looks for a pattern anywhere in a string• Match looks for a match staring at the Match looks for a match staring at the
beginningbeginning Both return Both return NoneNone if the pattern is not found if the pattern is not found
(logical false) and a “match object” if it is true(logical false) and a “match object” if it is true
>>> pat = "a*b">>> pat = "a*b"
>>> import re>>> import re
>>> re.search(pat,"fooaaabcde")>>> re.search(pat,"fooaaabcde")
<_sre.SRE_Match object at 0x809c0><_sre.SRE_Match object at 0x809c0>
>>> re.match(pat,"fooaaabcde")>>> re.match(pat,"fooaaabcde")
Python’s raw string notationPython’s raw string notation Python’s raw string notation for regular expression Python’s raw string notation for regular expression
patterns; backslashes are not handled in any special way in patterns; backslashes are not handled in any special way in a string literal prefixed with 'r'. So r"\n" is a two-character a string literal prefixed with 'r'. So r"\n" is a two-character string containing '\' and 'n', while "\n" is a one-character string containing '\' and 'n', while "\n" is a one-character string containing a newline. Usually patterns will be string containing a newline. Usually patterns will be expressed in Python code using this raw string notation.expressed in Python code using this raw string notation.
Raw string notation (r"text") keeps regular expressions Raw string notation (r"text") keeps regular expressions sane. Without it, every backslash ('\') in a regular sane. Without it, every backslash ('\') in a regular expression would have to be prefixed with another one to expression would have to be prefixed with another one to escape it. For example, the two following lines of code are escape it. For example, the two following lines of code are functionally identical:functionally identical:>>> >>> re.match(r"re.match(r"\\\\", r"", r"\\\\") ")
<_sre.SRE_Match object at ...> <_sre.SRE_Match object at ...>
>>> >>> re.match("re.match("\\\\\\\\", r"", r"\\\\") ")
<_sre.SRE_Match object at ...><_sre.SRE_Match object at ...>
Search exampleSearch exampleimport re import re
programming = ["Python", "Perl", "PHP", "C++"] programming = ["Python", "Perl", "PHP", "C++"]
pat = "^B|^P|i$|H$" pat = "^B|^P|i$|H$"
for lang in programming: for lang in programming:
if re.search(pat,lang,re.IGNORECASE): if re.search(pat,lang,re.IGNORECASE):
print lang , "FOUND" print lang , "FOUND"
else: else:
print lang, "NOT FOUND"print lang, "NOT FOUND"
The output of above script will be:The output of above script will be:
Python FOUND Python FOUND
Perl FOUND Perl FOUND
PHP FOUND PHP FOUND
C++ NOT FOUNDC++ NOT FOUND
Q: What’s a match object?Q: What’s a match object? An instance of the match class with the details of the An instance of the match class with the details of the
match resultmatch result
pat = "a*b"pat = "a*b"
>>> r1 = re.search(pat,"fooaaabcde")>>> r1 = re.search(pat,"fooaaabcde")
>>> r1.group() # group returns string >>> r1.group() # group returns string matchedmatched
'aaab''aaab'
>>> r1.start() # index of the match start>>> r1.start() # index of the match start
33
>>> r1.end() # index of the match end>>> r1.end() # index of the match end
77
>>> r1.span() # tuple of (start, end)>>> r1.span() # tuple of (start, end)
(3, 7)(3, 7)
What got matched?What got matched?
Here’s a pattern to match simple email Here’s a pattern to match simple email addressesaddresses
\w+@(\w+\.)+(com|org|net|edu)\w+@(\w+\.)+(com|org|net|edu)>>> pat1 = "\w+@(\w+\.)+(com|org|net|edu)">>> pat1 = "\w+@(\w+\.)+(com|org|net|edu)"
>>> r1 = re.match(pat1,"[email protected]")>>> r1 = re.match(pat1,"[email protected]")
>>> r1.group()>>> r1.group()
'[email protected]’'[email protected]’
We might want to extract the pattern parts, We might want to extract the pattern parts, like the email name and host like the email name and host
What got matched?What got matched? We can put parentheses around groups we We can put parentheses around groups we
want to be able to referencewant to be able to reference>>> pat2 = "(\w+)@((\w+\.)+(com|org|net|edu))">>> pat2 = "(\w+)@((\w+\.)+(com|org|net|edu))"
>>> r2 = re.match(pat2,"[email protected]")>>> r2 = re.match(pat2,"[email protected]")
>>> r2.groups()>>> r2.groups()
r2.groups()r2.groups()
('finin', 'cs.umbc.edu', 'umbc.', 'edu’)('finin', 'cs.umbc.edu', 'umbc.', 'edu’)
>>> r2.group(1)>>> r2.group(1)
'finin''finin'
>>> r2.group(2)>>> r2.group(2)
'cs.umbc.edu''cs.umbc.edu'
Note that the ‘groups’ are numbered in a Note that the ‘groups’ are numbered in a preorder traversal of the forestpreorder traversal of the forest
What got matched?What got matched?
We can ‘label’ the groups as well… We can ‘label’ the groups as well… >>> pat3 ="(?P<name>\w+)@(?P<host>(\w+\.)+(com|org|>>> pat3 ="(?P<name>\w+)@(?P<host>(\w+\.)+(com|org|
net|edu))"net|edu))"
>>> r3 = re.match(pat3,"[email protected]")>>> r3 = re.match(pat3,"[email protected]")
>>> r3.group('name')>>> r3.group('name')
'finin''finin'
>>> r3.group('host')>>> r3.group('host')
'cs.umbc.edu’'cs.umbc.edu’
And reference the matching parts by the And reference the matching parts by the labelslabels
Pattern object methodsPattern object methods
There are methods defined for a pattern There are methods defined for a pattern object that parallel the regular expression object that parallel the regular expression functions, e.g.,functions, e.g.,
• Match Match • Search Search • splitsplit• findallfindall• subsub
More re functionsMore re functions re.split() is like split but can use patternsre.split() is like split but can use patterns
>>> re.split("\W+", “This... is a test, short and sweet, >>> re.split("\W+", “This... is a test, short and sweet, of split().”)of split().”)
['This', 'is', 'a', 'test', 'short’, 'and', ['This', 'is', 'a', 'test', 'short’, 'and', 'sweet', 'of', 'split’, ‘’]'sweet', 'of', 'split’, ‘’]
>>>re.split('[a-f]+', '0a3B9‘, flags=re.IGNORECASE)>>>re.split('[a-f]+', '0a3B9‘, flags=re.IGNORECASE)
['0', '3', '9']['0', '3', '9'] re.sub substitutes one string for a patternre.sub substitutes one string for a pattern
>>> re.sub('(blue|white|red)', 'black', 'blue socks and >>> re.sub('(blue|white|red)', 'black', 'blue socks and red shoes')red shoes')
'black socks and black shoes’'black socks and black shoes’ re.findall() finds all matchesre.findall() finds all matches
>>> re.findall("\d+”,"12 dogs,11 cats, 1 egg")>>> re.findall("\d+”,"12 dogs,11 cats, 1 egg")
['12', '11', ’1’] ['12', '11', ’1’] findall With Filesfindall With Files
f = open('test.txt', 'r')f = open('test.txt', 'r') # it returns a list of all the found strings # it returns a list of all the found strings strings = re.findall('some pattern', f.read()) strings = re.findall('some pattern', f.read())
Compiling regular expressionsCompiling regular expressionsre.compilere.compile
If you plan to use a re pattern more than If you plan to use a re pattern more than once, compile it to a re objectonce, compile it to a re object
Python produces a special data structure Python produces a special data structure that speeds up matchingthat speeds up matching
>>> capt3 = re.compile(pat3)>>> capt3 = re.compile(pat3)
>>> cpat3>>> cpat3
<_sre.SRE_Pattern object at 0x2d9c0><_sre.SRE_Pattern object at 0x2d9c0>
>>> r3 = cpat3.search("[email protected]")>>> r3 = cpat3.search("[email protected]")
>>> r3>>> r3
<_sre.SRE_Match object at 0x895a0><_sre.SRE_Match object at 0x895a0>
>>> r3.group()>>> r3.group()
Example: pig latinExample: pig latin
RulesRules• If word starts with consonant(s)If word starts with consonant(s)
Move them to the end, append Move them to the end, append “ay”“ay”
• Else word starts with vowel(s)Else word starts with vowel(s) Keep as is, but add “zay”Keep as is, but add “zay”
• How might we do this?How might we do this?
The patternThe pattern
([bcdfghjklmnpqrstvwxyz]+)(\w+)([bcdfghjklmnpqrstvwxyz]+)(\w+)
piglatin.pypiglatin.py
import reimport re
pat = ‘([bcdfghjklmnpqrstvwxyz]+)(\w+)’pat = ‘([bcdfghjklmnpqrstvwxyz]+)(\w+)’
cpat = re.compile(pat)cpat = re.compile(pat)
def piglatin(string):def piglatin(string):
return " ".join( [piglatin1(w) for w in return " ".join( [piglatin1(w) for w in string.split()] )string.split()] )
piglatin.pypiglatin.py
def piglatin1(word):def piglatin1(word):
match = cpat.match(word)match = cpat.match(word)
if match:if match:
consonants = match.group(1)consonants = match.group(1)
rest = match.group(2)rest = match.group(2)
return rest + consonents + “ay”return rest + consonents + “ay”
else:else:
return word + "zay"return word + "zay"
ExercisesExercises Write a python program using regexp to validate an Write a python program using regexp to validate an
ip address : (eg. 172.1.2.200)ip address : (eg. 172.1.2.200) Write a regexp to validate your USN.Write a regexp to validate your USN. Find Email Domain in AddressFind Email Domain in Address
E.g 'My name is Ram, and [email protected] is my E.g 'My name is Ram, and [email protected] is my email.‘ and program should return @gmail.email.‘ and program should return @gmail.
Write a program to validate name and phone number using Write a program to validate name and phone number using re. It will continue to ask until you put correct data only. re. It will continue to ask until you put correct data only. (eg.Phone number: (800) 555.1212 #1234. Use re.compile(eg.Phone number: (800) 555.1212 #1234. Use re.compile
Define a simple "spelling correction" function correct() that Define a simple "spelling correction" function correct() that takes a string and sees to it that 1) two or more takes a string and sees to it that 1) two or more occurrences of the space character is compressed into one, occurrences of the space character is compressed into one, and 2) inserts an extra space after a period if the period is and 2) inserts an extra space after a period if the period is directly followed by a letter. (use regular expression)directly followed by a letter. (use regular expression)
E.g. correctE.g. correct("This is very funny and cool.Indeed!") ("This is very funny and cool.Indeed!")
should return should return "This is very funny and cool. Indeed!" "This is very funny and cool. Indeed!" Find all five characters long words in a sentenceFind all five characters long words in a sentence