regular expressions · regexp-module author: gvwilson created date: 10/21/2010 1:22:46 pm keywords...

46
More Tools Regular Expressions More Tools Copyright © Software Carpentry 2010 This work is licensed under the Creative Commons Attribution License See http://software-carpentry.org/license.html for more information.

Upload: others

Post on 17-Oct-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Regular Expressions · regexp-module Author: gvwilson Created Date: 10/21/2010 1:22:46 PM Keywords ()

More Tools

Regular Expressions

More Tools

Copyright © Software Carpentry 2010

This work is licensed under the Creative Commons Attribution License

See http://software-carpentry.org/license.html for more information.

Page 2: Regular Expressions · regexp-module Author: gvwilson Created Date: 10/21/2010 1:22:46 PM Keywords ()

Thousands of papers and theses written in LaTeX

Regular Expressions More Tools

Page 3: Regular Expressions · regexp-module Author: gvwilson Created Date: 10/21/2010 1:22:46 PM Keywords ()

Thousands of papers and theses written in LaTeX

Granger's work on graphs \cite{dd-gr2007,gr2009},

particularly ones obeying Snape's Inequality

\cite{ snape87 } (but see \cite{quirrell89}),

has opened up new lines of research. However,

studies at Unseen University \cite{stibbons2002,

stibbons2008} highlight several dangers.

⋮ ⋮ ⋮ ⋮ ⋮

⋮ ⋮

Regular Expressions More Tools

Page 4: Regular Expressions · regexp-module Author: gvwilson Created Date: 10/21/2010 1:22:46 PM Keywords ()

Thousands of papers and theses written in LaTeX

Granger's work on graphs \cite{dd-gr2007,gr2009},

particularly ones obeying Snape's Inequality

\cite{ snape87 } (but see \cite{quirrell89}),

has opened up new lines of research. However,

studies at Unseen University \cite{stibbons2002,

stibbons2008} highlight several dangers.

⋮ ⋮ ⋮ ⋮ ⋮

⋮ ⋮

All share a common bibliography

Regular Expressions More Tools

All share a common bibliography

Page 5: Regular Expressions · regexp-module Author: gvwilson Created Date: 10/21/2010 1:22:46 PM Keywords ()

Thousands of papers and theses written in LaTeX

Granger's work on graphs \cite{dd-gr2007,gr2009},

particularly ones obeying Snape's Inequality

\cite{ snape87 } (but see \cite{quirrell89}),

has opened up new lines of research. However,

studies at Unseen University \cite{stibbons2002,

stibbons2008} highlight several dangers.

⋮ ⋮ ⋮ ⋮ ⋮

⋮ ⋮

All share a common bibliography

Regular Expressions More Tools

All share a common bibliography

Want to see how often citations appear together

Page 6: Regular Expressions · regexp-module Author: gvwilson Created Date: 10/21/2010 1:22:46 PM Keywords ()

Thousands of papers and theses written in LaTeX

Granger's work on graphs \cite{dd-gr2007,gr2009},

particularly ones obeying Snape's Inequality

\cite{ snape87 } (but see \cite{quirrell89}),

has opened up new lines of research. However,

studies at Unseen University \cite{stibbons2002,

stibbons2008} highlight several dangers.

⋮ ⋮ ⋮ ⋮ ⋮

⋮ ⋮

All share a common bibliography

Regular Expressions More Tools

All share a common bibliography

Want to see how often citations appear together

First step: extract citation sets from documents

Page 7: Regular Expressions · regexp-module Author: gvwilson Created Date: 10/21/2010 1:22:46 PM Keywords ()

Citations enclosed in \cite{…}

Granger's work on graphs \cite{dd-gr2007,gr2009},

particularly ones obeying Snape's Inequality

\cite{ snape87 } (but see \cite{quirrell89}),

has opened up new lines of research. However,

studies at Unseen University \cite{stibbons2002,

stibbons2008} highlight several dangers.

⋮ ⋮ ⋮ ⋮ ⋮

⋮ ⋮

Regular Expressions More Tools

Page 8: Regular Expressions · regexp-module Author: gvwilson Created Date: 10/21/2010 1:22:46 PM Keywords ()

Citations enclosed in \cite{…}

Granger's work on graphs \cite{dd-gr2007,gr2009},

particularly ones obeying Snape's Inequality

\cite{ snape87 } (but see \cite{quirrell89}),

has opened up new lines of research. However,

studies at Unseen University \cite{stibbons2002,

stibbons2008} highlight several dangers.

⋮ ⋮ ⋮ ⋮ ⋮

⋮ ⋮

Multiple labels separated by commas

Regular Expressions More Tools

Multiple labels separated by commas

Page 9: Regular Expressions · regexp-module Author: gvwilson Created Date: 10/21/2010 1:22:46 PM Keywords ()

Citations enclosed in \cite{…}

Granger's work on graphs \cite{dd-gr2007,gr2009},

particularly ones obeying Snape's Inequality

\cite{ snape87 } (but see \cite{quirrell89}),

has opened up new lines of research. However,

studies at Unseen University \cite{stibbons2002,

stibbons2008} highlight several dangers.

⋮ ⋮ ⋮ ⋮ ⋮

⋮ ⋮

Multiple labels separated by commas

Regular Expressions More Tools

Multiple labels separated by commas

May be white space

Page 10: Regular Expressions · regexp-module Author: gvwilson Created Date: 10/21/2010 1:22:46 PM Keywords ()

Citations enclosed in \cite{…}

Granger's work on graphs \cite{dd-gr2007,gr2009},

particularly ones obeying Snape's Inequality

\cite{ snape87 } (but see \cite{quirrell89}),

has opened up new lines of research. However,

studies at Unseen University \cite{stibbons2002,

stibbons2008} highlight several dangers.

⋮ ⋮ ⋮ ⋮ ⋮

⋮ ⋮

Multiple labels separated by commas

Regular Expressions More Tools

Multiple labels separated by commas

May be white space (including line breaks)

Page 11: Regular Expressions · regexp-module Author: gvwilson Created Date: 10/21/2010 1:22:46 PM Keywords ()

Citations enclosed in \cite{…}

Granger's work on graphs \cite{dd-gr2007,gr2009},

particularly ones obeying Snape's Inequality

\cite{ snape87 } (but see \cite{quirrell89}),

has opened up new lines of research. However,

studies at Unseen University \cite{stibbons2002,

stibbons2008} highlight several dangers.

⋮ ⋮ ⋮ ⋮ ⋮

⋮ ⋮

Multiple labels separated by commas

Regular Expressions More Tools

Multiple labels separated by commas

May be white space (including line breaks)

And multiple citations per line

Page 12: Regular Expressions · regexp-module Author: gvwilson Created Date: 10/21/2010 1:22:46 PM Keywords ()

print re.search('cite{(.+)}', 'a \\cite{X} b').groups()

Idea #1: capture everything in cite{…} in a group

('X',)

Regular Expressions More Tools

Page 13: Regular Expressions · regexp-module Author: gvwilson Created Date: 10/21/2010 1:22:46 PM Keywords ()

print re.search('cite{(.+)}', 'a \\cite{X} b').groups()

Idea #1: capture everything in cite{…} in a group

('X',)

print re.search('cite{(.+)}', 'a \\cite{X} b \\cite{Y} c').groups()

('X} b \\cite{Y',)

What about multiple citations?

Regular Expressions More Tools

Page 14: Regular Expressions · regexp-module Author: gvwilson Created Date: 10/21/2010 1:22:46 PM Keywords ()

print re.search('cite{(.+)}', 'a \\cite{X} b').groups()

Idea #1: capture everything in cite{…} in a group

('X',)

print re.search('cite{(.+)}', 'a \\cite{X} b \\cite{Y} c').groups()

('X} b \\cite{Y',)

What about multiple citations?

Regular Expressions More Tools

Page 15: Regular Expressions · regexp-module Author: gvwilson Created Date: 10/21/2010 1:22:46 PM Keywords ()

print re.search('cite{(.+)}', 'a \\cite{X} b').groups()

Idea #1: capture everything in cite{…} in a group

('X',)

print re.search('cite{(.+)}', 'a \\cite{X} b \\cite{Y} c').groups()

('X} b \\cite{Y',)

What about multiple citations?

Regular Expressions More Tools

Matching is greedy

Page 16: Regular Expressions · regexp-module Author: gvwilson Created Date: 10/21/2010 1:22:46 PM Keywords ()

Idea #2: match everything inside '{}' except '}'

Regular Expressions More Tools

Page 17: Regular Expressions · regexp-module Author: gvwilson Created Date: 10/21/2010 1:22:46 PM Keywords ()

Idea #2: match everything inside '{}' except '}'

Use '[^}]' to negate the set containing only '}'

Regular Expressions More Tools

Page 18: Regular Expressions · regexp-module Author: gvwilson Created Date: 10/21/2010 1:22:46 PM Keywords ()

Idea #2: match everything inside '{}' except '}'

Use '[^}]' to negate the set containing only '}'

print re.search('cite{([^}]+)}', 'a \\cite{X} b').groups()

('X',)

Regular Expressions More Tools

Page 19: Regular Expressions · regexp-module Author: gvwilson Created Date: 10/21/2010 1:22:46 PM Keywords ()

Idea #2: match everything inside '{}' except '}'

Use '[^}]' to negate the set containing only '}'

print re.search('cite{([^}]+)}', 'a \\cite{X} b').groups()

('X',)

Regular Expressions More Tools

Page 20: Regular Expressions · regexp-module Author: gvwilson Created Date: 10/21/2010 1:22:46 PM Keywords ()

Idea #2: match everything inside '{}' except '}'

Use '[^}]' to negate the set containing only '}'

print re.search('cite{([^}]+)}', 'a \\cite{X} b').groups()

('X',)

What about multiple citations?

Regular Expressions More Tools

Page 21: Regular Expressions · regexp-module Author: gvwilson Created Date: 10/21/2010 1:22:46 PM Keywords ()

Idea #2: match everything inside '{}' except '}'

Use '[^}]' to negate the set containing only '}'

print re.search('cite{([^}]+)}', 'a \\cite{X} b').groups()

('X',)

print re.search('cite{([^}]+)}', 'a \\cite{X} b \\cite{Y} c').groups()

What about multiple citations?

Regular Expressions More Tools

print re.search('cite{([^}]+)}', 'a \\cite{X} b \\cite{Y} c').groups()

('X',)

Page 22: Regular Expressions · regexp-module Author: gvwilson Created Date: 10/21/2010 1:22:46 PM Keywords ()

Idea #2: match everything inside '{}' except '}'

Use '[^}]' to negate the set containing only '}'

print re.search('cite{([^}]+)}', 'a \\cite{X} b').groups()

('X',)

print re.search('cite{([^}]+)}', 'a \\cite{X} b \\cite{Y} c').groups()

What about multiple citations?

Regular Expressions More Tools

print re.search('cite{([^}]+)}', 'a \\cite{X} b \\cite{Y} c').groups()

('X',)

Need to extract all matches, not just the first

Page 23: Regular Expressions · regexp-module Author: gvwilson Created Date: 10/21/2010 1:22:46 PM Keywords ()

Idea #3: use re.findall instead of re.search

Regular Expressions More Tools

Page 24: Regular Expressions · regexp-module Author: gvwilson Created Date: 10/21/2010 1:22:46 PM Keywords ()

Idea #3: use re.findall instead of re.search

"A programmer is only as good as her knowledge

of her language's libraries."of her language's libraries."

Regular Expressions More Tools

Page 25: Regular Expressions · regexp-module Author: gvwilson Created Date: 10/21/2010 1:22:46 PM Keywords ()

Idea #3: use re.findall instead of re.search

"A programmer is only as good as her knowledge

of her language's libraries."

print re.findall('cite{([^}]+)}', 'a \\cite{X} b \\cite{Y} c')

['X', 'Y']

of her language's libraries."

Regular Expressions More Tools

Page 26: Regular Expressions · regexp-module Author: gvwilson Created Date: 10/21/2010 1:22:46 PM Keywords ()

Idea #3: use re.findall instead of re.search

"A programmer is only as good as her knowledge

of her language's libraries."

print re.findall('cite{([^}]+)}', 'a \\cite{X} b \\cite{Y} c')

['X', 'Y']

of her language's libraries."

Regular Expressions More Tools

Page 27: Regular Expressions · regexp-module Author: gvwilson Created Date: 10/21/2010 1:22:46 PM Keywords ()

Idea #3: use re.findall instead of re.search

"A programmer is only as good as her knowledge

of her language's libraries."

print re.findall('cite{([^}]+)}', 'a \\cite{X} b \\cite{Y} c')

['X', 'Y']

of her language's libraries."

What about spaces?

Regular Expressions More Tools

Page 28: Regular Expressions · regexp-module Author: gvwilson Created Date: 10/21/2010 1:22:46 PM Keywords ()

Idea #3: use re.findall instead of re.search

"A programmer is only as good as her knowledge

of her language's libraries."

print re.findall('cite{([^}]+)}', 'a \\cite{X} b \\cite{Y} c')

['X', 'Y']

of her language's libraries."

What about spaces?

Regular Expressions More Tools

print re.search('cite{([^}]+)}', 'a \\cite{ X} b \\cite{Y } c').groups()

[' X', 'Y ']

Page 29: Regular Expressions · regexp-module Author: gvwilson Created Date: 10/21/2010 1:22:46 PM Keywords ()

Could tidy this up after matching using string.strip()

Regular Expressions More Tools

Page 30: Regular Expressions · regexp-module Author: gvwilson Created Date: 10/21/2010 1:22:46 PM Keywords ()

Could tidy this up after matching using string.strip()

Let's modify the pattern instead

Regular Expressions More Tools

Page 31: Regular Expressions · regexp-module Author: gvwilson Created Date: 10/21/2010 1:22:46 PM Keywords ()

print re.findall('cite{\\s*([^}]+)\\s*}', 'a \\cite{ X} b \\cite{Y } c')

Could tidy this up after matching using string.strip()

Let's modify the pattern instead

print re.findall('cite{\\s*([^}]+)\\s*}', 'a \\cite{ X} b \\cite{Y } c')

['X', 'Y ']

Regular Expressions More Tools

Page 32: Regular Expressions · regexp-module Author: gvwilson Created Date: 10/21/2010 1:22:46 PM Keywords ()

print re.findall('cite{\\s*([^}]+)\\s*}', 'a \\cite{ X} b \\cite{Y } c')

Could tidy this up after matching using string.strip()

Let's modify the pattern instead

print re.findall('cite{\\s*([^}]+)\\s*}', 'a \\cite{ X} b \\cite{Y } c')

['X', 'Y ']

Regular Expressions More Tools

Page 33: Regular Expressions · regexp-module Author: gvwilson Created Date: 10/21/2010 1:22:46 PM Keywords ()

print re.findall('cite{\\s*([^}]+)\\s*}', 'a \\cite{ X} b \\cite{Y } c')

Could tidy this up after matching using string.strip()

Let's modify the pattern instead

print re.findall('cite{\\s*([^}]+)\\s*}', 'a \\cite{ X} b \\cite{Y } c')

['X', 'Y ']

Still capturing the space after 'Y'

Regular Expressions More Tools

Page 34: Regular Expressions · regexp-module Author: gvwilson Created Date: 10/21/2010 1:22:46 PM Keywords ()

print re.findall('cite{\\s*([^}]+)\\s*}', 'a \\cite{ X} b \\cite{Y } c')

Could tidy this up after matching using string.strip()

Let's modify the pattern instead

print re.findall('cite{\\s*([^}]+)\\s*}', 'a \\cite{ X} b \\cite{Y } c')

['X', 'Y ']

Still capturing the space after 'Y'

Match the word-to-nonword transition as well

Regular Expressions More Tools

Page 35: Regular Expressions · regexp-module Author: gvwilson Created Date: 10/21/2010 1:22:46 PM Keywords ()

print re.findall('cite{\\s*([^}]+)\\s*}', 'a \\cite{ X} b \\cite{Y } c')

Could tidy this up after matching using string.strip()

Let's modify the pattern instead

print re.findall('cite{\\s*([^}]+)\\s*}', 'a \\cite{ X} b \\cite{Y } c')

['X', 'Y ']

Still capturing the space after 'Y'

Match the word-to-nonword transition as well

Regular Expressions More Tools

print re.findall('cite{\\s*\\b([^}]+)\\b\\s*}', 'a \\cite{ X} b

[' X', 'Y']

Page 36: Regular Expressions · regexp-module Author: gvwilson Created Date: 10/21/2010 1:22:46 PM Keywords ()

print re.findall('cite{\\s*([^}]+)\\s*}', 'a \\cite{ X} b \\cite{Y } c')

Could tidy this up after matching using string.strip()

Let's modify the pattern instead

print re.findall('cite{\\s*([^}]+)\\s*}', 'a \\cite{ X} b \\cite{Y } c')

['X', 'Y ']

Still capturing the space after 'Y'

Match the word-to-nonword transition as well

Regular Expressions More Tools

print re.findall('cite{\\s*\\b([^}]+)\\b\\s*}', 'a \\cite{ X} b

[' X', 'Y']

Page 37: Regular Expressions · regexp-module Author: gvwilson Created Date: 10/21/2010 1:22:46 PM Keywords ()

What about multiple labels in a single citation?

Regular Expressions More Tools

Page 38: Regular Expressions · regexp-module Author: gvwilson Created Date: 10/21/2010 1:22:46 PM Keywords ()

print re.findall('cite{\\s*\\b([^}]+)\\b\\s*}', '\\cite{X,Y} ')

What about multiple labels in a single citation?

['X,Y']

print re.findall('cite{\\s*\\b([^}]+)\\b\\s*}', '\\cite{X, Y, Z} ')

['X, Y, Z']

Regular Expressions More Tools

Page 39: Regular Expressions · regexp-module Author: gvwilson Created Date: 10/21/2010 1:22:46 PM Keywords ()

print re.findall('cite{\\s*\\b([^}]+)\\b\\s*}', '\\cite{X,Y} ')

What about multiple labels in a single citation?

['X,Y']

print re.findall('cite{\\s*\\b([^}]+)\\b\\s*}', '\\cite{X, Y, Z} ')

['X, Y, Z']

Actually can be done, but it's very complex

Regular Expressions More Tools

Actually can be done, but it's very complex

Page 40: Regular Expressions · regexp-module Author: gvwilson Created Date: 10/21/2010 1:22:46 PM Keywords ()

print re.findall('cite{\\s*\\b([^}]+)\\b\\s*}', '\\cite{X,Y} ')

What about multiple labels in a single citation?

['X,Y']

print re.findall('cite{\\s*\\b([^}]+)\\b\\s*}', '\\cite{X, Y, Z} ')

['X, Y, Z']

Actually can be done, but it's very complex

Regular Expressions More Tools

Actually can be done, but it's very complex

Use re.split() to break matches on '\\s*,\\s*'

Page 41: Regular Expressions · regexp-module Author: gvwilson Created Date: 10/21/2010 1:22:46 PM Keywords ()

# Start with a working skeleton.

defdefdefdef get_citations(text):

'''Return the set of all citation tags found in a block of text.'''

returnreturnreturnreturn set()

ifififif __name__ == '__main__':ifififif __name__ == '__main__':

test = '''\

Granger's work on graphs \cite{dd-gr2007,gr2009},

particularly ones obeying Snape's Inequality

\cite{ snape87 } (but see \cite{quirrell89}),

has opened up new lines of research. However,

studies at Unseen University \cite{stibbons2002,

stibbons2008} highlight several dangers.'''

Regular Expressions More Tools

stibbons2008} highlight several dangers.'''

printprintprintprint get_citations(test)

set([])

Page 42: Regular Expressions · regexp-module Author: gvwilson Created Date: 10/21/2010 1:22:46 PM Keywords ()

import import import import re

CITE = 'cite{\\s*\\b([^}]+)\\b\\s*}'

SPLIT = '\\s*,\\s*'

defdefdefdef get_citations(text):defdefdefdef get_citations(text):

'''Return the set of all citation tags found in a block of text.'''

result = set()

match = re.findall(CITE, text)

if if if if match:

forforforfor citation inininin match:

cites = re.split(SPLIT, citation)

Regular Expressions More Tools

cites = re.split(SPLIT, citation)

forforforfor c inininin cites:

result.add(c)

returnreturnreturnreturn result

Page 43: Regular Expressions · regexp-module Author: gvwilson Created Date: 10/21/2010 1:22:46 PM Keywords ()

import import import import re

CITE = re.compile('cite{\\s*\\b([^}]+)\\b\\s*}')

SPLIT = re.compile('\\s*,\\s*')

defdefdefdef get_citations(text):defdefdefdef get_citations(text):

'''Return the set of all citation tags found in a block of text.'''

result = set()

match = CITE.findall(text)

if if if if match:

forforforfor citations inininin match:

label_list = SPLIT.split(citations)

Regular Expressions More Tools

label_list = SPLIT.split(citations)

forforforfor label inininin label_list:

result.add(label)

returnreturnreturnreturn result

Page 44: Regular Expressions · regexp-module Author: gvwilson Created Date: 10/21/2010 1:22:46 PM Keywords ()

import import import import re

CITE = re.compile('cite{\\s*\\b([^}]+)\\b\\s*}')

SPLIT = re.compile('\\s*,\\s*')

defdefdefdef get_citations(text):defdefdefdef get_citations(text):

'''Return the set of all citation tags found in a block of text.'''

result = set()

match = CITE.findall(text)

if if if if match:

forforforfor citations inininin match:

label_list = SPLIT.split(citations)

Regular Expressions More Tools

label_list = SPLIT.split(citations)

forforforfor label inininin label_list:

result.add(label)

returnreturnreturnreturn result

Page 45: Regular Expressions · regexp-module Author: gvwilson Created Date: 10/21/2010 1:22:46 PM Keywords ()

# Now test it all out.

ifififif __name__ == '__main__':

test = '''\

Granger's work on graphs \cite{dd-gr2007,gr2009},

particularly ones obeying Snape's Inequalityparticularly ones obeying Snape's Inequality

\cite{ snape87 } (but see \cite{quirrell89}),

has opened up new lines of research. However,

studies at Unseen University \cite{stibbons2002,

stibbons2008} highlight several dangers.'''

printprintprintprint get_citations(test)

Regular Expressions More Tools

set(['gr2009', 'stibbons2002', 'dd-gr2007', 'stibbons2008',

'snape87', 'quirrell89'])

Page 46: Regular Expressions · regexp-module Author: gvwilson Created Date: 10/21/2010 1:22:46 PM Keywords ()

June 2010

created by

Greg Wilson

June 2010

Copyright © Software Carpentry 2010

This work is licensed under the Creative Commons Attribution License

See http://software-carpentry.org/license.html for more information.