comments from task #1 · pangrams, anagrams • 368 noun-verb homographs from bnc • apostrophes...

38
Comments from Task #1 Books vs. journal articles vs. conference proceedings Peer-reviewed vs. not Blind, double-blind Hirsch index: metric for rating research productivity (people, publication venues, schools) h24: 24 artifacts cited ≥ 24 times; h5=10: ≥10 citations in last 5 years CiteSeerX: Citations: literature that the paper cites Co-citation: frequency with which two documents are cited together Active bibliography: when two documents’ bibliographies overlap Based on Solr (a Lucene derivative) Source code and data are downloadable; API is also available 1

Upload: others

Post on 08-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Comments from Task #1 · Pangrams, anagrams • 368 noun-verb homographs from BNC • Apostrophes in Pope’s Iliad translation: o’er, e’en, jav’lin, gen’rous, neighb’ring

Comments from Task #1• Books vs. journal articles vs. conference proceedings• Peer-reviewed vs. not

• Blind, double-blind• Hirsch index: metric for rating research productivity (people, publication

venues, schools)• h24: 24 artifacts cited ≥ 24 times; h5=10: ≥10 citations in last 5 years

• CiteSeerX: • Citations: literature that the paper cites• Co-citation: frequency with which two documents are cited together• Active bibliography: when two documents’ bibliographies overlap• Based on Solr (a Lucene derivative)• Source code and data are downloadable; API is also available

1

Page 2: Comments from Task #1 · Pangrams, anagrams • 368 noun-verb homographs from BNC • Apostrophes in Pope’s Iliad translation: o’er, e’en, jav’lin, gen’rous, neighb’ring

Sample corpus phenomena (Task #2)• Pangrams, anagrams• 368 noun-verb homographs from BNC• Apostrophes in Pope’s Iliad translation: o’er, e’en, jav’lin, gen’rous, neighb’ring• Pres. Monson’s distinctive use of passive auxiliaries• Longest sentences in German • Split infinitives in General Conf. talks• Usage of anaphoric locative adverbs: whither, thither, hither, whence, thence, hence• "Unnecessary" quotation marks in blogs• Spoken instances of “whom” + stranded preposition (whom…with.)• Japanese verbs marking locative object with を• Double modals• Mutually intelligible phrases/sentences in Spanish & Portuguese• What we’re asked (not) to do from last Gen. Conf. (in Japanese)• Usages of 100 instances of ‘why’ from live Twitter stream• OK vs. okay over time in COCA• Lexical richness in most recent 200 tweets of 7 political figures• Subjunctive “be” in English • “-ish” suffixation on Reddit• coincidental "emojis" in the scriptures: …the ruler of the feast had tasted the water that was made wine, and knew not whence it was: (but

the servants which drew the water knew;)• “yall” in tweets• superlatives in movie reviews on Reddit

Page 3: Comments from Task #1 · Pangrams, anagrams • 368 noun-verb homographs from BNC • Apostrophes in Pope’s Iliad translation: o’er, e’en, jav’lin, gen’rous, neighb’ring

A note about your final project• First milestone (proposal) will be due in about a month• Two possible changes to coordinate with a local company:

• LiteralApp• POS-tagging of n-grams (email me for contact info)• Caveats

• Probably max. 2-3 students total• I don’t know, and hence can’t vouch for, these people/organizations or accept responsibility for their

behavior, policies, working conditions, etc.• This is not a formal BYU-sponsored internship• You’ll have to work out particulars re schedule, code ownership, deliverables• You need my approval • You can’t start coding until Feb.• Grading will be as with any other project

• IMO it’s better if you conceive/design/spec/develop/evaluate your own project

3

Page 4: Comments from Task #1 · Pangrams, anagrams • 368 noun-verb homographs from BNC • Apostrophes in Pope’s Iliad translation: o’er, e’en, jav’lin, gen’rous, neighb’ring

CORPORA AND REGEXP’SFinding lexical expressions

Page 5: Comments from Task #1 · Pangrams, anagrams • 368 noun-verb homographs from BNC • Apostrophes in Pope’s Iliad translation: o’er, e’en, jav’lin, gen’rous, neighb’ring

Corpus (pl. corpora)• Body of language data• Various types

• Text/written, speech/oral, gestures• Images

• Annotation levels• Structured (aligned, annotated, treebanks, etc.)

• Spans the range of human linguistic communication

Page 6: Comments from Task #1 · Pangrams, anagrams • 368 noun-verb homographs from BNC • Apostrophes in Pope’s Iliad translation: o’er, e’en, jav’lin, gen’rous, neighb’ring

Purposes for corpora• Language instruction• Task analysis• NLP systems development

• Training systems• Testing/evaluating systems

• Knowledge source development (dictionaries, lexicons, etc.)• Linguistic research

Page 7: Comments from Task #1 · Pangrams, anagrams • 368 noun-verb homographs from BNC • Apostrophes in Pope’s Iliad translation: o’er, e’en, jav’lin, gen’rous, neighb’ring

Sources for corpora• e-text centers, digital libraries (e.g. Project Gutenberg, Oxford Text Archive)• Clearinghouses (e.g. LDC, NIST, ELRA)• Specialized collections of corpora (e.g. SIG’s)• Specially deployed corpora (e.g. BYU)• The web itself• Your own house!

(Watch this fascinating discussion of the Human Speechome Project)

Page 8: Comments from Task #1 · Pangrams, anagrams • 368 noun-verb homographs from BNC • Apostrophes in Pope’s Iliad translation: o’er, e’en, jav’lin, gen’rous, neighb’ring

Vocabulary lists• Single-word term lists• Collocations and compound lists (aka MWE’s)• KWIC listings• Frequency lists• Saliency lists• Weirdness: typos, low-freq words, etc.

Page 9: Comments from Task #1 · Pangrams, anagrams • 368 noun-verb homographs from BNC • Apostrophes in Pope’s Iliad translation: o’er, e’en, jav’lin, gen’rous, neighb’ring

Accessing corpus content• Simple term (word/phrase) search

• Provided in most text editors• KWIC listing for results context• Ranked documents from browser

• KWIC listings

• Search engines return documents• Searchable: words, character strings• Not (yet):

• Word senses• Wildcards• Arbitrarily complex specifiable patterns• Syntax• Conversational structure

9

Page 10: Comments from Task #1 · Pangrams, anagrams • 368 noun-verb homographs from BNC • Apostrophes in Pope’s Iliad translation: o’er, e’en, jav’lin, gen’rous, neighb’ring

Data formats• Text

• Character encodings: ASCII, EBCDIC, UNICODE, etc.• File formats: word processing, tab-delimited, etc.• With or without markup (SGML, HTML, XML, rtf, etc.)• Application-specific (doc, wpd, spreadsheet, etc.)• Can vary widely across languages

• Speech• Huge amount of variation across projects/hw/sw• TIMIT, NIST (US Gov.), AIFF (Apple), SUNAU8 (Sun), OGI File Format, WAV (Microsoft)

• Pictures/graphics (GIF, JPEG, BMP, WMF, …)• Video• Live data streams

Page 11: Comments from Task #1 · Pangrams, anagrams • 368 noun-verb homographs from BNC • Apostrophes in Pope’s Iliad translation: o’er, e’en, jav’lin, gen’rous, neighb’ring

Scripting• Scripting ability is important in corpus work• A crucial skill for computational linguists

• In fact, for any linguist!• Languages: not quite full-fledged programming languages; slightly different

focus• Wikipedia re programming languages

Wikipedia re scripting languages• Computational linguists and NLP practitioners typically work with several of

both types• Learn some (more)!

• Be familiar with already existing conversion tools

Page 12: Comments from Task #1 · Pangrams, anagrams • 368 noun-verb homographs from BNC • Apostrophes in Pope’s Iliad translation: o’er, e’en, jav’lin, gen’rous, neighb’ring

Useful corpus/scripting tools• Grep: gets and prints regular expressions• Awk, gawk, tawk: scripting specifically for text manipulation• SED: stream editor (for buffered text)• PERL

• Pathologically Eclectic Rubbish Lister• Practical Extraction and Report Language• Built-in variables, dynamic arrays, user-definable functions• Large repository of file/string manipulation functions• Relies heavily on regular expressions

• Lex, YACC: lexical analyzers, compilers• more (or less) plus FINDSTR in Windows COM• Many, many, many others

Page 13: Comments from Task #1 · Pangrams, anagrams • 368 noun-verb homographs from BNC • Apostrophes in Pope’s Iliad translation: o’er, e’en, jav’lin, gen’rous, neighb’ring

Levels of corpus processing• Building (capturing data, organizing, encoding)• Cleanup (inconsistencies w/ data, etc.)• Annotation (marking features)• Dissemination (platform, copyright, licensing, etc.)• Translation (conversion from one format/application to another)• Feature extraction

Page 14: Comments from Task #1 · Pangrams, anagrams • 368 noun-verb homographs from BNC • Apostrophes in Pope’s Iliad translation: o’er, e’en, jav’lin, gen’rous, neighb’ring

Types of words in a corpus• Tokens: each word instance is counted• Types: similar word instances are only counted once• Type/Token ratio: measure of richness of vocabulary• Disfluencies: fragments, filled pauses

14

Page 15: Comments from Task #1 · Pangrams, anagrams • 368 noun-verb homographs from BNC • Apostrophes in Pope’s Iliad translation: o’er, e’en, jav’lin, gen’rous, neighb’ring

From task #1: re the LDC• BYU is a member of LDC; membership years: 1998, 2003-2015, 2018-2020• You can order corpora for BYU classwork, research, theses, etc.• Always obey distribution conditions, restrictions• Some corpora require a signed license agreement

15

Page 16: Comments from Task #1 · Pangrams, anagrams • 368 noun-verb homographs from BNC • Apostrophes in Pope’s Iliad translation: o’er, e’en, jav’lin, gen’rous, neighb’ring

Regular expressions (regexes)• Way to process data:

• Filter out extraneous data• Hone in on targeted data

• One of most important tools in natural language processing (NLP) and corpus linguistics

• Widely used tool for searching, matching, and replacing text

Page 17: Comments from Task #1 · Pangrams, anagrams • 368 noun-verb homographs from BNC • Apostrophes in Pope’s Iliad translation: o’er, e’en, jav’lin, gen’rous, neighb’ring

Where do you use regexes?• Some corpus interfaces• Command lines: Unix/Linux, findstr in DOS, PowerShell• Supported to various degrees by many programming/scripting languages• Some Web browsers

17

Page 18: Comments from Task #1 · Pangrams, anagrams • 368 noun-verb homographs from BNC • Apostrophes in Pope’s Iliad translation: o’er, e’en, jav’lin, gen’rous, neighb’ring

Learning regular expressions• Documentation, tutorials abound• Graphical user interfaces (GUIs) to try out your own regexes on texts

• Some allow you to upload your own texts• Some editors support regular expressions

• Alas, various flavors have slightly differing conventions• Need to be flexible, willing to consult documentation

Page 19: Comments from Task #1 · Pangrams, anagrams • 368 noun-verb homographs from BNC • Apostrophes in Pope’s Iliad translation: o’er, e’en, jav’lin, gen’rous, neighb’ring

Regular expressions• A formal language for specifying text strings• How can we search for any of these?

• woodchuck• woodchucks• Woodchuck• Woodchucks

Page 20: Comments from Task #1 · Pangrams, anagrams • 368 noun-verb homographs from BNC • Apostrophes in Pope’s Iliad translation: o’er, e’en, jav’lin, gen’rous, neighb’ring

Using regexes• Need:

• a pattern• a corpus

• Specifying a pattern:• delimiters (often slashes) around outside of pattern

/something//honor|honour/

• Character ranges, groupings• Constraints on number of matches• Special characters, variables

20

Page 21: Comments from Task #1 · Pangrams, anagrams • 368 noun-verb homographs from BNC • Apostrophes in Pope’s Iliad translation: o’er, e’en, jav’lin, gen’rous, neighb’ring

Special characters

• What do the following represent:*

.

+

?/ /

[ ]|

-

21

Page 22: Comments from Task #1 · Pangrams, anagrams • 368 noun-verb homographs from BNC • Apostrophes in Pope’s Iliad translation: o’er, e’en, jav’lin, gen’rous, neighb’ring

Grouping• Use parentheses/hono(u)r//gray|grey//gr(a|e)y/

22

Page 23: Comments from Task #1 · Pangrams, anagrams • 368 noun-verb homographs from BNC • Apostrophes in Pope’s Iliad translation: o’er, e’en, jav’lin, gen’rous, neighb’ring

Counting (quantifiers)• ? means zero or one• * means zero or more (Kleene star)• + means one or more

/honou?r//aie*//ba+//H(ae?|ä)ndel/(a|b|c)*

23

Page 24: Comments from Task #1 · Pangrams, anagrams • 368 noun-verb homographs from BNC • Apostrophes in Pope’s Iliad translation: o’er, e’en, jav’lin, gen’rous, neighb’ring

Other frequently used symbols• Dot: represents any single character

/b.*e//b.+e//b.?e/

• Brackets: matches a single character/[abc]/

• Hyphens: specify ranges/[A-Z][a-z]/

24

Page 25: Comments from Task #1 · Pangrams, anagrams • 368 noun-verb homographs from BNC • Apostrophes in Pope’s Iliad translation: o’er, e’en, jav’lin, gen’rous, neighb’ring

Regular Expressions: Disjunctions

• Letters inside square brackets []

• Ranges [A-Z]

Pattern Matches[wW]oodchuck Woodchuck, woodchuck[1234567890] Any digit

Pattern Matches[A-Z] An upper case

letterDrenched Blossoms

[a-z] A lower case letter my beans were impatient

[0-9] A single digit Chapter 1: Down the Rabbit Hole

Page 26: Comments from Task #1 · Pangrams, anagrams • 368 noun-verb homographs from BNC • Apostrophes in Pope’s Iliad translation: o’er, e’en, jav’lin, gen’rous, neighb’ring

Regular Expressions: Negation in Disjunction

• Negations [^Ss]• Carat means negation only when first in []

Pattern Matches[^A-Z] Not an upper case

letterOyfn pripetchik

[^Ss] Neither ‘S’ nor ‘s’ I have no exquisite reason”

[^e^] Neither e nor ^ Look here

a^b The pattern a carat b Look up a^b now

Page 27: Comments from Task #1 · Pangrams, anagrams • 368 noun-verb homographs from BNC • Apostrophes in Pope’s Iliad translation: o’er, e’en, jav’lin, gen’rous, neighb’ring

Regular Expressions: More Disjunction

• Woodchucks is another name for groundhog!• The pipe | for disjunction

Pattern Matchesgroundhog|woodchuck

yours|mine yoursmine

a|b|c = [abc]

[gG]roundhog|[Ww]oodchuck

Page 28: Comments from Task #1 · Pangrams, anagrams • 368 noun-verb homographs from BNC • Apostrophes in Pope’s Iliad translation: o’er, e’en, jav’lin, gen’rous, neighb’ring

Regular Expressions: ? * + .

Stephen C Kleene

Pattern Matchescolou?r Optional

previous char

color colour

oo*h! 0 or more ofprevious char

oh! ooh! oooh! ooooh!

o+h! 1 or more of previous char

oh! ooh! oooh! ooooh!

baa+ baa baaa baaaa baaaaa

beg.n begin begun begun beg3n

Kleene *, Kleene +

Page 29: Comments from Task #1 · Pangrams, anagrams • 368 noun-verb homographs from BNC • Apostrophes in Pope’s Iliad translation: o’er, e’en, jav’lin, gen’rous, neighb’ring

Regular Expressions: Anchors ^ $

Pattern Matches^[A-Z] Palo Alto

^[^A-Za-z] 1 “Hello”

\.$ The end.

.$ The end? The end!

Page 30: Comments from Task #1 · Pangrams, anagrams • 368 noun-verb homographs from BNC • Apostrophes in Pope’s Iliad translation: o’er, e’en, jav’lin, gen’rous, neighb’ring

More symbols, more flexibility• To “protect” special characters, use the backslash: \

/.*\.txt/

• To anchor the match at the beginning of a string: ^/^I\, Nephi/

• To anchor the match at the end of a string: $/Amen\.$/

30

Page 31: Comments from Task #1 · Pangrams, anagrams • 368 noun-verb homographs from BNC • Apostrophes in Pope’s Iliad translation: o’er, e’en, jav’lin, gen’rous, neighb’ring

Metacharacters• \b: word boundary character• \w: word character• \s: space character

• \B: everything BUT a word boundary character• \W: everything BUT a word character• \S: everything BUT a space character

31

Page 32: Comments from Task #1 · Pangrams, anagrams • 368 noun-verb homographs from BNC • Apostrophes in Pope’s Iliad translation: o’er, e’en, jav’lin, gen’rous, neighb’ring

Advanced topics• Capture groups• Lookahead

32

Page 33: Comments from Task #1 · Pangrams, anagrams • 368 noun-verb homographs from BNC • Apostrophes in Pope’s Iliad translation: o’er, e’en, jav’lin, gen’rous, neighb’ring

Recent job ads requiring regex skill• Oracle: Computational linguist, programmer• Lionbridge: Linguist, computational linguist• Appen Butler Hill: Consultant• The Long Now Foundation: Intern• Nuance Mobile: NLU automotive vocon speech engineer• Amazon: text-to-speech linguist • Measurement, Inc.: student evaluation tools expert• Infogain.com: language solutions consultant• Linnaeus University, Växjö: funded PhD• Webinterpret.com: localization engineer

33

Page 34: Comments from Task #1 · Pangrams, anagrams • 368 noun-verb homographs from BNC • Apostrophes in Pope’s Iliad translation: o’er, e’en, jav’lin, gen’rous, neighb’ring

Examples of regex usage• Compiling frequency statistics on word usage from a French corpus for a French dictionary that I published.

• Exploring translation differences between two different translations of the Book of Mormon in Farsi, for the Church Scripture Translation team.

• Searching for expressions that need to be simplified in English running text and that can be replaced with Church Basic English vocabulary that we developed (e.g. for "Preach My Gospel").

• Processing the names of several thousand suspected terrorists for identification of Romanized variants, for a DoD contract.

• Specifying Syriac morphological composition for a parser being used in developing, annotating and deploying a corpus of writings of the poet/theologian Ephrem, a joint project with the Maxwell Institute.

• Locating data elements from Web documents in an ontology-based conceptual modeling system that I have been developing with other BYU researchers and which is being used for various applications including biomedical research and decision support systems.

• Identifying the phonotactic sequences found in spoken language and matching them with patterns extant in several natural languages and dialects, for a spoken language ID system.

• Specifying the morphological composition for a parser that I'm using to annotate a corpus of Lushootseed utterances, working along with several Native American tribes who are hoping to save the language from extinction.

• Retrieving, parsing, and using morphological, syntactic, and semantic data from WordNet, a lexical database that I'm using in the lexical access component of a cognitive modeling system.

• Matching personal information (names, affiliations, email addresses, phone numbers, etc.) from Web pages in order to improve recall and precision in person-based information extraction, used recently in a worldwide programming contest called WePS-2.

34

Page 35: Comments from Task #1 · Pangrams, anagrams • 368 noun-verb homographs from BNC • Apostrophes in Pope’s Iliad translation: o’er, e’en, jav’lin, gen’rous, neighb’ring

NLP corpus tasks and regular expressions• Text preprocessing

• Sentence segmentation• Tokenization: parsing out words from document, data stream, etc.

• Not as trivial a problem as most people expect!• Normalization: standardizing formats (dates, locations, addresses, phone #’s, etc.)• Case folding• Lemmatization (next lecture)

35

Page 36: Comments from Task #1 · Pangrams, anagrams • 368 noun-verb homographs from BNC • Apostrophes in Pope’s Iliad translation: o’er, e’en, jav’lin, gen’rous, neighb’ring

Formal languages• Symbol: primitive term (like “point” in geometry)• String: finite sequence of symbols (If a, b, c are symbols then aabcba is a

string.) Strings are also called “words”.• If w is a word, |w| is the length of w (e.g. |aabcba| = 6).• The empty string will be denoted by . ||=0.• Prefix: any number of initial symbols. Proper prefix: prefix other than whole

string.• String concatenation: w1 • w2 w1w2

Page 37: Comments from Task #1 · Pangrams, anagrams • 368 noun-verb homographs from BNC • Apostrophes in Pope’s Iliad translation: o’er, e’en, jav’lin, gen’rous, neighb’ring

Formal languages (cont.)• Alphabet: finite set of symbols• Language: set of strings (words) from some one alphabet. The empty set Ø and the set {} consisting of the empty string are (different) languages.

• Set operations: L1 L1, L1 L2, L1–L2• L1•L2 = {w1•w2| w1L1, w2L2}• Kleene closure: L*={} L L2 L3 …

Page 38: Comments from Task #1 · Pangrams, anagrams • 368 noun-verb homographs from BNC • Apostrophes in Pope’s Iliad translation: o’er, e’en, jav’lin, gen’rous, neighb’ring

Examples• Set of words with two consecutive c’s:• {a,b,c}*{cc}{a,b,c}*• Set of all binary strings with exactly two 1’s• What about recognizing dates? What about only correct dates?