regular expressions 2 day 7 - 9/10/14 ling 3820 & 6820 natural language processing harry howard...

Regular expressions 2Day 7 - 9/10/14LING 3820 & 6820

Natural Language Processing

Harry Howard

Tulane University

Course organization

10-Sept-2014NLP, Prof. Howard, Tulane University

2

http://www.tulane.edu/~howard/LING3820/

The syllabus is under construction.

http://www.tulane.edu/~howard/CompCultEN/

Regular expressions

Review

10-Sept-2014

3

NLP, Prof. Howard, Tulane University

Regular expressions

re.findall(' be ', S)

Regex meta-characters:

| () [], [^] {} .

` to | be | it | as `

` (to|be|it|as) `

` ([a-z][a-z]) `

` ([a-z]{2}) ` ` (..) ` ` (.{2}) `

10-Sept-2014

4


Open Spyder

10-Sept-2014

5


import re

4.2.7. Will the best regex please stand up?

§4. Regular expressions 2

10-Sept-2014

6


4.2.7.1. Under-fitting vs. over-fitting

This challenge of finding the regular expression that is just right may remind you of the story of Goldilocks and the three bears, in which Goldilocks tried to find the bowl of porridge that was neither too hot nor too cold.

Statisticians have their own version of Goldilocks, which evaluates how well a statistical analysis fits the data that it is applied to. An analysis that over-fits the data is too specific, in that it excludes data points from a larger data set that should be included.

Conversely, an analysis that under-fits the data is too general, in that it includes data points from a larger data set that should be excluded. In our example, the first two regular expressions over-fit the data set (at should be included), while the last two under-fit it (19 should be excluded).


7

4.2.7.2. False positives and false negatives Statistical test theory provides an alternative way of conceptualizing the problem, which I unfortunately can’t figure out how to tie in to Goldilocks.

Though it is usually illustrated in terms of medical tests, I believe that explaining it in terms of legal ‘tests’ is easier to understand.


8

A trial Imagine that a person is charged with a crime and goes

through a trial. If she is guilty and the verdict is guilty, the trial has

produced a true positive data point: a guilty person is found guilty.

Conversely, if she is innocent and the verdict is not guilty, the trial has produced a true negative data point: a not-guilty person is found not guilty.

We expect that an accurate test only produces true positives and true negatives, but there are two more logical possibilities that leave room for a test to be nearly accurate.

One is for an innocent person to be found guilty. This is called a false positive data point, because the

accused should have failed the test but instead passed it. Alternatively, if a guilty person is found innocent, the

legal test has produced a false negative data point, because the accused should have passed the test but instead failed it.


9

Four outcomes of a trial

true false

positive guilty found guilty

innocent found guilty

negative innocent found not guilty

guilty found not guilty


10

4.2.7.3. Summary of the two sorts of regex evaluation

true false

positive

evaluation of ‘to’ by [a-z]{2} results in good fit

evaluation of ‘at’ by (?:to|be|it|as) results in under-fit

negativeevaluation of ‘the’ by [a-z]{2} results in bad fit

evaluation of ‘19’ by .{2} results in over-fit


11

4.2.8. More on ranges and negation

>>> S2 = 'otolaryngologist' English only has five letters for vowels, so it would be easy enough list them in a disjunction:>>> re.findall('a|e|i|o|u', S2)

['o', 'o', 'a', 'o', 'o', 'i']

I>>> re.findall('[aeiou]', S2)

['o', 'o', 'a', 'o', 'o', 'i']

>>> re.findall('[^aeiou]', S2)

['t', 'l', 'r', 'y', 'n', 'g', 'l', 'g', 's', 't']


12

4.2.9. A range of repetition with {}character{minimum, maximum}>>> S3 = 'bookkeeper' >>> S4 = 'goddessship' >>> re.findall('[aeiou]{2}', S3) ['oo', 'ee'] >>> re.findall('[^aeiou]{3}', S4) ['sss'] >>> re.findall('[^aeiou]{2,3}', S4) ['dd', 'sss']


13

4.2.10. Match the beginning or end of a string with ^ and $>>> re.findall('^.|.$', S) ['T', '.']


14

http://www.tulane.edu/~howard/CompCultEN/regex.html#further-practice-of-fixed-length-matching4.3. Variable-length matching

Next time


15

regular expressions 2 day 7 - 9/10/14 ling 3820 & 6820 natural language processing harry howard...

Documents

guilty person

data points

tulane university import

guilty negativ

false positive data

true positive data point

larger data set

false negative data