regular expressions 2 day 7 - 9/10/14 ling 3820 & 6820 natural language processing harry howard...

15
Regular expressions 2 Day 7 - 9/10/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

Upload: reynard-chase

Post on 13-Dec-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: REGULAR EXPRESSIONS 2 DAY 7 - 9/10/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

Regular expressions 2Day 7 - 9/10/14LING 3820 & 6820

Natural Language Processing

Harry Howard

Tulane University

Page 2: REGULAR EXPRESSIONS 2 DAY 7 - 9/10/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

Course organization

10-Sept-2014NLP, Prof. Howard, Tulane University

2

http://www.tulane.edu/~howard/LING3820/

The syllabus is under construction.

http://www.tulane.edu/~howard/CompCultEN/

Page 3: REGULAR EXPRESSIONS 2 DAY 7 - 9/10/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

Regular expressions

Review

10-Sept-2014

3

NLP, Prof. Howard, Tulane University

Page 4: REGULAR EXPRESSIONS 2 DAY 7 - 9/10/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

Regular expressions

re.findall(' be ', S)

Regex meta-characters:

| () [], [^] {} .

` to | be | it | as `

` (to|be|it|as) `

` ([a-z][a-z]) `

` ([a-z]{2}) ` ` (..) ` ` (.{2}) `

10-Sept-2014

4

NLP, Prof. Howard, Tulane University

Page 5: REGULAR EXPRESSIONS 2 DAY 7 - 9/10/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

Open Spyder

10-Sept-2014

5

NLP, Prof. Howard, Tulane University

import re

Page 6: REGULAR EXPRESSIONS 2 DAY 7 - 9/10/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

4.2.7. Will the best regex please stand up?

§4. Regular expressions 2

10-Sept-2014

6

NLP, Prof. Howard, Tulane University

Page 7: REGULAR EXPRESSIONS 2 DAY 7 - 9/10/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

4.2.7.1. Under-fitting vs. over-fitting

This challenge of finding the regular expression that is just right may remind you of the story of Goldilocks and the three bears, in which Goldilocks tried to find the bowl of porridge that was neither too hot nor too cold.

Statisticians have their own version of Goldilocks, which evaluates how well a statistical analysis fits the data that it is applied to. An analysis that over-fits the data is too specific, in that it excludes data points from a larger data set that should be included.

Conversely, an analysis that under-fits the data is too general, in that it includes data points from a larger data set that should be excluded. In our example, the first two regular expressions over-fit the data set (at should be included), while the last two under-fit it (19 should be excluded).

10-Sept-2014NLP, Prof. Howard, Tulane University

7

Page 8: REGULAR EXPRESSIONS 2 DAY 7 - 9/10/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

4.2.7.2. False positives and false negatives Statistical test theory provides an alternative way of conceptualizing the problem, which I unfortunately can’t figure out how to tie in to Goldilocks.

Though it is usually illustrated in terms of medical tests, I believe that explaining it in terms of legal ‘tests’ is easier to understand.

10-Sept-2014NLP, Prof. Howard, Tulane University

8

Page 9: REGULAR EXPRESSIONS 2 DAY 7 - 9/10/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

A trial Imagine that a person is charged with a crime and goes

through a trial. If she is guilty and the verdict is guilty, the trial has

produced a true positive data point: a guilty person is found guilty.

Conversely, if she is innocent and the verdict is not guilty, the trial has produced a true negative data point: a not-guilty person is found not guilty.

We expect that an accurate test only produces true positives and true negatives, but there are two more logical possibilities that leave room for a test to be nearly accurate.

One is for an innocent person to be found guilty. This is called a false positive data point, because the

accused should have failed the test but instead passed it. Alternatively, if a guilty person is found innocent, the

legal test has produced a false negative data point, because the accused should have passed the test but instead failed it.

10-Sept-2014NLP, Prof. Howard, Tulane University

9

Page 10: REGULAR EXPRESSIONS 2 DAY 7 - 9/10/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

Four outcomes of a trial

true false

positive guilty found guilty

innocent found guilty

negative innocent found not guilty

guilty found not guilty

10-Sept-2014NLP, Prof. Howard, Tulane University

10

Page 11: REGULAR EXPRESSIONS 2 DAY 7 - 9/10/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

4.2.7.3. Summary of the two sorts of regex evaluation

true false

positive

evaluation of ‘to’ by [a-z]{2} results in good fit

evaluation of ‘at’ by (?:to|be|it|as) results in under-fit

negativeevaluation of ‘the’ by [a-z]{2} results in bad fit

evaluation of ‘19’ by .{2} results in over-fit

10-Sept-2014NLP, Prof. Howard, Tulane University

11

Page 12: REGULAR EXPRESSIONS 2 DAY 7 - 9/10/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

4.2.8. More on ranges and negation

>>> S2 = 'otolaryngologist' English only has five letters for vowels, so it would be easy enough list them in a disjunction:>>> re.findall('a|e|i|o|u', S2)

['o', 'o', 'a', 'o', 'o', 'i']

I>>> re.findall('[aeiou]', S2)

['o', 'o', 'a', 'o', 'o', 'i']

>>> re.findall('[^aeiou]', S2)

['t', 'l', 'r', 'y', 'n', 'g', 'l', 'g', 's', 't']

10-Sept-2014NLP, Prof. Howard, Tulane University

12

Page 13: REGULAR EXPRESSIONS 2 DAY 7 - 9/10/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

4.2.9. A range of repetition with {}character{minimum, maximum}>>> S3 = 'bookkeeper' >>> S4 = 'goddessship' >>> re.findall('[aeiou]{2}', S3) ['oo', 'ee'] >>> re.findall('[^aeiou]{3}', S4) ['sss'] >>> re.findall('[^aeiou]{2,3}', S4) ['dd', 'sss']

10-Sept-2014NLP, Prof. Howard, Tulane University

13

Page 14: REGULAR EXPRESSIONS 2 DAY 7 - 9/10/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

4.2.10. Match the beginning or end of a string with ^ and $>>> re.findall('^.|.$', S) ['T', '.']

10-Sept-2014NLP, Prof. Howard, Tulane University

14

Page 15: REGULAR EXPRESSIONS 2 DAY 7 - 9/10/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

http://www.tulane.edu/~howard/CompCultEN/regex.html#further-practice-of-fixed-length-matching4.3. Variable-length matching

Next time

10-Sept-2014NLP, Prof. Howard, Tulane University

15