ling/c sc/psyc 438/538 lecture 15 sandiway fong. administrivia homework 4 review homework 5 out...

22
LING/C SC/PSYC 438/538 Lecture 15 Sandiway Fong

Upload: ross-mckinney

Post on 24-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

LING/C SC/PSYC 438/538

Lecture 15Sandiway Fong

Administrivia

• Homework 4 review

• Homework 5 out todaydue next Monday by midnightone PDF file

Homework 4 Review

• Task:– Treat everything that begins with a capital letter as a possible

named entity.• What's the problem with this simple approach?

– By inspection, design regex(s) to label PER, ORG, LOC and MISC named entities.

– e.g. • /in ([A-Z][a-z]+)/ $1 => LOC• /Mr\. ([A-Z][a-z]+ ([A-Z]\.|) ([A-Z][a-z]+|))/ $1 => PER• /chairman of ([A-Z][a-z]+ ([A-Z][a-z]+)*)/ $1 => ORG• /([A-Z][a-z]+ ([A-Z][a-z]+)*) (Inc|Corp)/ $1 => ORG• etc.• all others as MISC

Homework 4 Review

• Baseline: the Illinois NER system: … not ground truth

Homework 4 Review• Preamble:

– # define hashes– %per = ();– %org = ();– %loc = ();– %misc = ();

• Subroutine:# take out end white spacesub trim { $name = shift @_; $name =~ s/^\s+//; $name =~ s/\s+$//; return $name}

open($fh, $ARGV[0]) or die "$ARGV[0] not found!\n";$intext = 0;while ($line = <$fh>) { $line =~ s/^\s+//; # trim left if ($line =~ /<TEXT>/) { $intext = 1 } elsif ($line =~ m!^</TEXT>!) { $intext = 0 } elsif ($intext) {

# Body of code goes here! }}

Homework 4 ReviewFor each of PER, ORG, LOC, define a basic formatExample:• # ABC ((D.) EFG) (Jr./Sr./III)• $name_format = '([A-Z][a-z]*(\s+[A-Z](\.|[a-z]*))*(\s[JS]r\.|III)?)';

Define an array of regexs for easy testingExample:@per = ( # Mr./Dr. name_format

qr/[MD]r\. $name_format/, # Judge name_formatqr/Judge $name_format/,# name_format, XX (years old)qr/$name_format,\s+\d+( years old|,)/,# ABC D. EFG (Jr./Sr./III)qr/([A-Z][a-z]*\s[A-Z]\.\s[A-Z][a-z]*(\s[JS]r\.|III)?)/, # $name_format ending in Jr./Sr./IIIqr/([A-Z][a-z]*(\s+[A-Z](\.|[a-z]*))*(\s[JS]r\.|III))/,# $name_format, editor/(vice) president/chairmanqr/$name_format, (editor|(vice |)president|chairman)/

);

foreach $re (@per) { while ($line =~ /$re/g) {

$name = trim($1);$per{$name}++;

}

print "PER:\n";foreach $name (keys %per) { print "$name:$per{$name}\n"}

qr/STRING/ compiles a regex out of STRING

Homework 4 ReviewPER:1. Barnum:12. Rudolph Agnew:13. Gary P. Smaby:14. Ross:15. Terrence D. Daniels:16. Spoon:27. Darrell Phillips:18. Wilbur Ross Jr.:19. Daniel M. Rexinger:110. Neil Davenport:111. Talcott:312. Cray:713. William H. Hudnut:114. Curry:615. Vitulli:216. Douglas R. Wheeland:2

17. Mortimer B. Zuckerman:118. Mossman:119. Clark J. Vitulli:120. Rowe:121. Joseph M. Blanchard:222. Pierre Vinken:123. Frederick Deane Jr.:224. John Rowe:125. Vinken:126. Brooke T. Mossman:127. Brenda Malizia Negus:128. Norman Ricken:129. James A. Talcott:130. Robert R. Glauber:131. Richard Curry:132. Malcolm A. Hammerton:2

PER: 32

Illinois NERPER:1. Byron:22. Mr. Vitulli:13. Donoghue:14. William H. Hudnut III:15. Barnum:26. Rudolph Agnew:17. Gary P. Smaby:18. Edison:19. Seymour:310. Terrence D. Daniels:111. Darrell Phillips:112. Daniel M. Rexinger:113. Alan Spoon:114. Neil Davenport:115. Talcott:216. Mr. Ross:117. Victor Borge:118. Curry:619. Douglas R. Wheeland:120. Wilbur Ross:121. Vose:122. J.P. Bolduc:123. Wilcox:124. Braidwood:325. Du Pont:1

26. J. Vitulli:127. Mortimer B. Zuckerman:128. Lorillard:129. Messrs. 30. Cray:131. Rowe:132. Joseph M. Blanchard:133. Clark:134. Pierre Vinken:135. Frederick Deane Jr.:136. Dr. Mossman:137. John Rowe:138. Vinken:139. Mr. Cray:340. Brooke T. Mossman:141. Brenda Malizia Negus:142. W.R. Grace:243. Norman Ricken:144. Seymour Cray:145. James A. Talcott:146. Gregory Barnum:147. Robert R. Glauber:148. Mr. Spoon:249. Richard Curry:150. Malcolm A. Hammerton:1

PER: 50

Homework 4 ReviewORG example:• $org_format = '([A-Z](\.|[a-z]+)(\s*[A-Z](\.|[a-z]+))*)';

ORG regex array:@org = ( # Co./Inc./Corp./Ltd./N.V./PLC/S.p.A.

qr/$org_format ((Co|Inc|Corp|Ltd|N\.V|S\.p\.A)\.|PLC)/,# Institute/Commission/Edisonqr/([A-Z](\.|[a-z]+)(\s+(and|of|[A-Z](\.|[a-z]+)))*\s(Institute|Commission|Edison))/,# SECqr/(?!Co|Inc|Corp|Ltd|PLC)([A-Z]{2,})/,# XX &/of YYqr/($org_format( (&amp;|of) $org_format)+)/,# $org_format, a unit ofqr/$org_format, a unit of/,# $org_format (vice )president/chairmanqr/$org_format (vice |)(president|chairman)/);

foreach $re (@org) { while ($line =~ /$re/g) {

$name = trim($1);$org{$name}++;

}

print "ORG:\n";foreach $name (keys %org) { print "$name:$org{$name}\n"}

Homework 4 ReviewORG:1. Hitachi:12. Federal Energy Regulatory Commission:13. Gary P. Smaby of Smaby Group Inc:14. Consolidated Gold Fields:15. NEC:16. Royal Trustco:17. Public Service:18. Time Warner:19. LC:110. Signet Banking:111. Commerce Commission:112. Loews:113. U.S. News &amp; World Report:114. Vose:115. Texas Instruments Japan:116. Washington Post:117. International Business Machines:118. Illinois Commerce Commission:119. Newsweek:220. Smaby Group:121. Cray Research:122. Pacific First Financial:123. National Cancer Institute:124. W.R. Grace &amp; Co:125. SEC:126. United Illuminating:127. Lorillard:1

28. James A. Talcott of Boston:129. III:130. Us:131. FERC:132. The National Association of Manufacturers:133. Finmeccanica:134. University of Vermont College of Medicine:135. McDermott International:136. W.R. Grace:137. Commonwealth Edison:1438. Chrysler:139. Audit Bureau of Circulations:140. IBC:141. Elsevier:142. CEO:143. Hollingsworth &amp; Vose Co:144. Rothschild:145. Farber Cancer Institute:146. Texas Instruments:147. Hollingsworth &amp; Vose:148. Securities and Exchange Commission:149. New England Journal of Medicine:150. Babcock &amp; Wilcox:151. Mazda Motor:152. Fujitsu:153. Cray Computer:154. PS:7

Missing:New York Stock Exchange

cf. LOCNewsweek:1

ORG: 54

Illinois NER1. Dana-Farber Cancer Institute:12. IBC/Donoghue's Money Fund Report:13. Edison:14. Pacific First Financial Corp.:15. Texas Instruments Inc.:16. U.S. Treasury:17. Signet Banking Corp.:18. PS of New Hampshire:39. Indianapolis Motor Speedway:110. Finmeccanica S.p.A. for $295 million:111. Mazda Motor Corp:112. Money Fund Report:113. Commonwealth Edison Co.:114. Valley Queen Cheese Factory:115. Commerce Commission:116. Treasury:317. Bailey Controls Operations:118. New York Stock Exchange:219. Cray Research:1120. Fujitsu Ltd:121. SEC:122. Mazda:123. Texas Instruments Japan Ltd.:124. Us Inc.:125. Circulation Credit Plan:126. Chrysler for:127. Vose Co.:128. Finmeccanica:129. University of Vermont College of Medicine:130. Smaby Group Inc.:131. United Illuminating Co. and Northeast Utilities:1

32. Boston University:133. Trojan Steel:134. Chrysler:135. Harvard University:136. U.S. News &amp:137. Boca:138. National Association of Manufacturers:139. McDermott International Inc.:140. Circuit City:141. Consolidated Gold Fields PLC:142. Lorillard Inc.:143. Senate:144. Trade and Industry Ministry:145. Illinois Supreme Court:146. Federal Energy Regulatory Commission:147. NEC Corp.:148. Publishers Information Bureau:149. W.R. Grace &amp; Co.:150. Cray Research is:151. World Report:152. Public Service Co. of New Hampshire:153. Babcock &amp:154. Elsevier N.V.:155. U.S. News:256. Washington Post Co.:157. Cray Computer Corp.:158. Hollingsworth &amp:259. Illinois Commerce Commission:160. New England Electric System:161. Japan's Big Three:162. LaSalle I:163. Environmental Protection Agency:1

ORG: 9364. Newsweek:365. Northeast:166. National Cancer Institute:167. Time:568. United Illuminating:269. House:170. Dreyfus World-Wide Dollar:171. FERC:172. Maytag:173. Rothschild Inc.:174. Kent:575. Loews Corp.:176. Illinois Appellate Court:177. Indiana Roof:178. International Business Machines Corp.:179. Chrysler Corp.:180. Royal Trustco Ltd. of Toronto:181. New England Electric:482. Hitachi Ltd.:183. Indianapolis:284. Cray Research.:185. Indianapolis Symphony Orchestra:186. Cray Research Inc.:187. Bailey Controls:188. Time Warner Inc.:189. Securities and Exchange Commission:190. New England Journal of Medicine:191. Cray Computer:892. Japan Automobile Dealers' Association:193. PS:1

Homework 4 ReviewLOC example:• $loc_format = '([A-Z][a-z]*(\s[A-Z][a-z]*)*)';

LOC regex array:@loc = ( # -based

qr/$loc_format-based/,# loc_format, XXX.qr/$loc_format, [A-Z][a-z]*\.,/,# in qr/\bin $loc_format/,# based in qr/\bbased in $loc_format/

);

foreach $re (@loc) { while ($line =~ /$re/g) {

$name = trim($1);$loc{$name}++;

}

print "LOC:\n";foreach $name (keys %loc) { print "$name:$loc{$name}\n"}

Homework 4 Review

LOC:1. September:12. Time:13. West Groton:14. October:55. Chinchon:16. February:17. New Haven:38. Colorado Springs:19. Western:110. New York:111. May:1

12. South Korea:213. Wickliffe:214. Westborough:315. Hartford:216. Chapter:117. Minneapolis:118. Newsweek:119. January:120. New York Stock Exchange:221. Rockford:122. August:1

cf. ORGNewsweek:2

LOC: 22 – 6 (months) = 16

Illinois NERLOC:1. Seoul:12. Chinchon:13. Boston:14. Ill.:15. U.S.:66. Colorado Springs:17. Japan:18. Mass.:29. Commonwealth Edison:1210. England:111. Hot Springs:112. South Korea:413. Wickliffe:1

14. Ohio:115. Westborough:116. Hartford:117. New Hampshire:118. Grace Energy:119. Conn:120. Manchester:121. Cray Computer:222. Minneapolis:123. Rockford:124. Newsweek:325. Indiana:126. Northeast:1

LOC: 26

Illinois NERMISC:1. Chapter 11:12. West Groton:13. Fortune 500:14. Lorillard:25. Japanese:16. New Haven:17. Cray-3:28. Western:19. U.S. News:110. Audit Bureau of Circulations:111. Minneapolis-based:112. Dutch:113. Micronite:1

14. Nasdaq:115. CEOs:116. South Korea:217. New York-based:118. New Hampshire:319. British:120. Hoosier:121. Boca Raton:122. Rust Belt:123. C-90:124. Cray-3's:125. New York Stock Exchange:126. Italian:127. Northeast:1

MISC: 27

Prolog online resources

• Useful Online Tutorial– Learn Prolog Now!

• Patrick Blackburn, Johan Bos & Kristina Striegnitz

• http://www.learnprolognow.org

SWI Prolog

• Grammar rules are translated when the program is loaded into Prolog rules.

• Solves the mystery why we have to type two arguments with the nonterminal at the command prompt

• Recall list notation:– [1|[2,3,4]] = [1,2,3,4]

1. s --> [a],b.2. b --> [a],b.3. b --> [b],c.4. b --> [b].5. c --> [b],c.6. c --> [b].

1. s([a|A], B) :- b(A, B).2. b([a|A], B) :- b(A, B).3. b([b|A], B) :- c(A, B).4. b([b|A], A).5. c([b|A], B) :- c(A, B).6. c([b|A], A).

Prolog Recursion

• Example (factorial):– 0! = 1– n! = n * (n-1)! for n>0

• In Prolog:– factorial(0,1).– factorial(N,NF) :- M is N-1, factorial(M,MF), NF is N * MF.

• Problem: infinite loop• Fix: 2nd case only applies to numbers > 0

factorial(N,NF) :- N>0, M is N-1, factorial(M,MF), NF is N * MF.

Prolog built-in:X is <math expression>

19

NDFSA → (D)FSA

• Recall the set of states construction• Let's write the NDFSA in Prolog directly…

s x

z

a

a

a

b

y

b

b

a

b

s {x,y}

{z}

a

aa

{y,z}

b

a

{y}

b

a

b

b

>

> [PowerpointAnimation]

Homework 5 Question 1

• Give the shortest regular expression for the FSA below • Give an equivalent FSA without the ε-transition • answer in the form of a diagram

s x ya

a

Homework 5 Question 2

• convert the NDFSA on the right into a DFSA

• give a diagram for the DFSA• implement the DFSA in Perl• run the DFSA on the strings abab,

abaaba and ababb• implement the NDFSA in Prolog• run the same strings from figure 2.27

in the textbook

Note: predicate names can't be numbers in Prolog. I suggest you call the states: one, two, three and four, respectively

22

Homework 5 Question 3

• Give a FSA diagram that accepts a string of a's and 'b that contains only a's except for a single substring bab or bn (n>0)– examples: aababaaa, bbbba, aabaa– non-examples: aaaa, babbab, abbaabb

• Implement your machine (in Perl or Prolog). Test it.• Give the regular expression equivalent of the FSA