1 an introduction to python part 3 regular expressions for data formatting jacob morgan brent frakes...
TRANSCRIPT
11
An Introduction to PythonPart 3
Regular Expressions for Data Formatting
An Introduction to PythonPart 3
Regular Expressions for Data Formatting
Jacob MorganBrent Frakes
National Park ServiceFort Collins, CO
April, 2008
Jacob MorganBrent Frakes
National Park ServiceFort Collins, CO
April, 2008
22
OverviewOverview
• Regular Expressions
• Regular Expressions Module
• Formatting a dataset
• Utilize all skills learned so far!
• Regular Expressions
• Regular Expressions Module
• Formatting a dataset
• Utilize all skills learned so far!
33
Regular ExpressionsRegular Expressions
• Regular expressions enable string manipulation, searching, and substitution
• Useful built in methods:– count(sub, start = 0, end=max) #returns the number of non-overlapping
occurrences of substring– find(sub, start = 0, end = max) #returns position of first occurrence of string– isalnum() #returns True if all letters or numbers. Otherwise, returns False– isdigit() #returns True if when all characters are digits– lower() #lower case– strip() #removes end of line character (i.e., \n)
• Regular expressions enable string manipulation, searching, and substitution
• Useful built in methods:– count(sub, start = 0, end=max) #returns the number of non-overlapping
occurrences of substring– find(sub, start = 0, end = max) #returns position of first occurrence of string– isalnum() #returns True if all letters or numbers. Otherwise, returns False– isdigit() #returns True if when all characters are digits– lower() #lower case– strip() #removes end of line character (i.e., \n)
44
ExercisesExercises
>>string = ‘the brown fox’
>>string.count(‘o’)
>>string.find(‘o’)
>>string.isalnum()
>>string.isdigit()
>>string.split(‘b’)
>>string = ‘the brown fox’
>>string.count(‘o’)
>>string.find(‘o’)
>>string.isalnum()
>>string.isdigit()
>>string.split(‘b’)
55
Regular Expressions ModuleRegular Expressions Module
• Build in module
• Enhances basic functionality
• >>import re
• Build in module
• Enhances basic functionality
• >>import re
66
Regular Expressions - SyntaxRegular Expressions - Syntax
. matches any character but \n* matches zero or more cases of the previous string + matches one or more cases of the previous string\d matches one digit\D matches one non-digit\s matches a whitespace characters\S matches any non-whitespace character\w matches one alphanumeric character\W matches any non-alphanumeric character| alternative match, or
. matches any character but \n* matches zero or more cases of the previous string + matches one or more cases of the previous string\d matches one digit\D matches one non-digit\s matches a whitespace characters\S matches any non-whitespace character\w matches one alphanumeric character\W matches any non-alphanumeric character| alternative match, or
77
FunctionsFunctions
split(pattern, string) #returns list split by pattern
search(pattern, string) #returns location of string
Examples• import re• string = ‘the brown fox’• re.split(‘\s*’, string) [‘the’,’brown’,’fox’]• re.split(‘b|w’, string) ['the ', 'ro', 'n fox']• re.search(‘z’, string) None• f = re.search(‘o’, string) <_sre.SRE_Match
object at 0x011E1790>• f.start() 6
split(pattern, string) #returns list split by pattern
search(pattern, string) #returns location of string
Examples• import re• string = ‘the brown fox’• re.split(‘\s*’, string) [‘the’,’brown’,’fox’]• re.split(‘b|w’, string) ['the ', 'ro', 'n fox']• re.search(‘z’, string) None• f = re.search(‘o’, string) <_sre.SRE_Match
object at 0x011E1790>• f.start() 6
88
ExercisesExercises
>>import re
>>string = “10 20 30 40”
>>re.split(‘\s*, string)
>>re.search(‘a’, string)
>>import re
>>string = “10 20 30 40”
>>re.split(‘\s*, string)
>>re.search(‘a’, string)
99
ProblemProblem
• You have 500 of the following data tables in separate text files
• You have 500 of the following data tables in separate text files
1111
RulesRules
• The table Taxa.txt is an example of such a file
• Number of lines in header is not always consistent
• All headers have (‘Study:’, ‘Author:’, and ‘Date:’)
• Table always begins with Taxon_ID• Number of columns and rows varies • Table is space-delimited
• The table Taxa.txt is an example of such a file
• Number of lines in header is not always consistent
• All headers have (‘Study:’, ‘Author:’, and ‘Date:’)
• Table always begins with Taxon_ID• Number of columns and rows varies • Table is space-delimited
1212
HintsHints
• Break the exercise into simple tasks• open file• read a line file• evaluate a line with a regular expression• loop through lines• print to a file• close files• More hints in taxa.py
• Break the exercise into simple tasks• open file• read a line file• evaluate a line with a regular expression• loop through lines• print to a file• close files• More hints in taxa.py