understanding regular expressions: programming historian study group, university of florida, george...

13
Understanding Regular Expressions Programming Historian Study Group University of Florida, George A. Smathers Libraries 5 November 2015 Allison Jai O’Dell | [email protected]

Upload: allison-jai-odell

Post on 08-Feb-2017

361 views

Category:

Data & Analytics


1 download

TRANSCRIPT

Understanding Regular Expressions

Programming Historian Study Group

University of Florida, George A. Smathers Libraries

5 November 2015

Allison Jai O’Dell | [email protected]

Use regexes for pattern matching

“A regular expression (regex or regexp for short) is a special text string for describing a search pattern.

“You can think of regular expressions as wildcards on steroids.”

http://www.regular-expressions.info/

Use regexes in searches and programs

“Regular expression processors are found in several search engines, search and replace dialogs of several word processors and text editors, and in the command lines of text processing utilities …

“Many programming languages provide regular expression capabilities, some built-in, for example Perl, JavaScript, Ruby,AWK, and Tcl, and others via a standard library, for example .NET languages, Java, Pythonand C++ … Most other languages offer regular expressions via a library.”

– https://en.wikipedia.org/wiki/Regular_expression

Use regexes in the humanities

“As a simple example, if we want to find a reference to a particular year, say 1877, in a document, it’s easy enough to search for that single date. But if we want to find any references to years in latter half of the 19th century, it is impractical to search several dozen times for 1850, 1851, 1852, etc., in turn. By using regular expressions we can use a concise pattern like “18[5-9][0-9]” to effectively match any year from 1850 to 1899.”

– http://programminghistorian.org/lessons/

understanding-regular-expressions

Use regexes to save the day

https://xkcd.com/208/

Regexes 101

| Pipe separates alternatives: gray|grey can match "gray" or "grey"

() Parentheses define the scope and precedence of the operators: gray|grey and gr(a|e)y are equivalent patterns which both describe the set of "gray" or "grey"

? Question mark indicates zero or one occurrences of the preceding element: colou?r matches both "color" and "colour"

• Asterisk indicates zero or more occurrences of the preceding element: ab*c matches "ac", "abc", "abbc", "abbbc", etc.

+ Plus sign indicates one or more occurrences of the preceding element: ab+c matches "abc", "abbc", "abbbc", and so on, but not "ac"

{n} The preceding item is matched exactly n times.

{min,} The preceding item is matched min or more times.

{min,max} The preceding item is matched at least min times, but not more than max times.

. Dot represents any single character except for a line break or paragraph break: "sh.rt" returns both "shirt" and "short"

^ Caret indicates that the search is at the beginning of a paragraph.

$ Dollar sign indicates the end of a paragraph.

[ ] Brackets:

[abc123] Represents one of the characters between the brackets.

[a-e] Represents any of the characters that are between a and e, including both start and end characters.

[a-eh-x] Represents any of the characters that are between a-e and h-x.

[^a-s] Represents everything that is not between a and s.

LibreOffice List of Regular Expressionshttps://help.libreoffice.org/Common/List_of_Regular_Expressions

#inspohttp://allisonjai.com/iPhone/gleason_vis.html

Library catalog subject data Regular expressions Magic

Programming Historian Lessonhttp://programminghistorian.org/lessons/understanding-regular-expressions