1 an introduction to python part 3 regular expressions for data formatting jacob morgan brent frakes...

12
1 An Introduction to Python Part 3 Regular Expressions for Data Formatting Jacob Morgan Brent Frakes National Park Service Fort Collins, CO April, 2008

Upload: rose-hudson

Post on 02-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

11

An Introduction to PythonPart 3

Regular Expressions for Data Formatting

An Introduction to PythonPart 3

Regular Expressions for Data Formatting

Jacob MorganBrent Frakes

National Park ServiceFort Collins, CO

April, 2008

Jacob MorganBrent Frakes

National Park ServiceFort Collins, CO

April, 2008

22

OverviewOverview

• Regular Expressions

• Regular Expressions Module

• Formatting a dataset

• Utilize all skills learned so far!

• Regular Expressions

• Regular Expressions Module

• Formatting a dataset

• Utilize all skills learned so far!

33

Regular ExpressionsRegular Expressions

• Regular expressions enable string manipulation, searching, and substitution

• Useful built in methods:– count(sub, start = 0, end=max) #returns the number of non-overlapping

occurrences of substring– find(sub, start = 0, end = max) #returns position of first occurrence of string– isalnum() #returns True if all letters or numbers. Otherwise, returns False– isdigit() #returns True if when all characters are digits– lower() #lower case– strip() #removes end of line character (i.e., \n)

• Regular expressions enable string manipulation, searching, and substitution

• Useful built in methods:– count(sub, start = 0, end=max) #returns the number of non-overlapping

occurrences of substring– find(sub, start = 0, end = max) #returns position of first occurrence of string– isalnum() #returns True if all letters or numbers. Otherwise, returns False– isdigit() #returns True if when all characters are digits– lower() #lower case– strip() #removes end of line character (i.e., \n)

44

ExercisesExercises

>>string = ‘the brown fox’

>>string.count(‘o’)

>>string.find(‘o’)

>>string.isalnum()

>>string.isdigit()

>>string.split(‘b’)

>>string = ‘the brown fox’

>>string.count(‘o’)

>>string.find(‘o’)

>>string.isalnum()

>>string.isdigit()

>>string.split(‘b’)

55

Regular Expressions ModuleRegular Expressions Module

• Build in module

• Enhances basic functionality

• >>import re

• Build in module

• Enhances basic functionality

• >>import re

66

Regular Expressions - SyntaxRegular Expressions - Syntax

. matches any character but \n* matches zero or more cases of the previous string + matches one or more cases of the previous string\d matches one digit\D matches one non-digit\s matches a whitespace characters\S matches any non-whitespace character\w matches one alphanumeric character\W matches any non-alphanumeric character| alternative match, or

. matches any character but \n* matches zero or more cases of the previous string + matches one or more cases of the previous string\d matches one digit\D matches one non-digit\s matches a whitespace characters\S matches any non-whitespace character\w matches one alphanumeric character\W matches any non-alphanumeric character| alternative match, or

77

FunctionsFunctions

split(pattern, string) #returns list split by pattern

search(pattern, string) #returns location of string

Examples• import re• string = ‘the brown fox’• re.split(‘\s*’, string) [‘the’,’brown’,’fox’]• re.split(‘b|w’, string) ['the ', 'ro', 'n fox']• re.search(‘z’, string) None• f = re.search(‘o’, string) <_sre.SRE_Match

object at 0x011E1790>• f.start() 6

split(pattern, string) #returns list split by pattern

search(pattern, string) #returns location of string

Examples• import re• string = ‘the brown fox’• re.split(‘\s*’, string) [‘the’,’brown’,’fox’]• re.split(‘b|w’, string) ['the ', 'ro', 'n fox']• re.search(‘z’, string) None• f = re.search(‘o’, string) <_sre.SRE_Match

object at 0x011E1790>• f.start() 6

88

ExercisesExercises

>>import re

>>string = “10 20 30 40”

>>re.split(‘\s*, string)

>>re.search(‘a’, string)

>>import re

>>string = “10 20 30 40”

>>re.split(‘\s*, string)

>>re.search(‘a’, string)

99

ProblemProblem

• You have 500 of the following data tables in separate text files

• You have 500 of the following data tables in separate text files

1010

Desired FormatDesired Format

1111

RulesRules

• The table Taxa.txt is an example of such a file

• Number of lines in header is not always consistent

• All headers have (‘Study:’, ‘Author:’, and ‘Date:’)

• Table always begins with Taxon_ID• Number of columns and rows varies • Table is space-delimited

• The table Taxa.txt is an example of such a file

• Number of lines in header is not always consistent

• All headers have (‘Study:’, ‘Author:’, and ‘Date:’)

• Table always begins with Taxon_ID• Number of columns and rows varies • Table is space-delimited

1212

HintsHints

• Break the exercise into simple tasks• open file• read a line file• evaluate a line with a regular expression• loop through lines• print to a file• close files• More hints in taxa.py

• Break the exercise into simple tasks• open file• read a line file• evaluate a line with a regular expression• loop through lines• print to a file• close files• More hints in taxa.py