comments from task #1 · pangrams, anagrams • 368 noun-verb homographs from bnc • apostrophes...
TRANSCRIPT
Comments from Task #1• Books vs. journal articles vs. conference proceedings• Peer-reviewed vs. not
• Blind, double-blind• Hirsch index: metric for rating research productivity (people, publication
venues, schools)• h24: 24 artifacts cited ≥ 24 times; h5=10: ≥10 citations in last 5 years
• CiteSeerX: • Citations: literature that the paper cites• Co-citation: frequency with which two documents are cited together• Active bibliography: when two documents’ bibliographies overlap• Based on Solr (a Lucene derivative)• Source code and data are downloadable; API is also available
1
Sample corpus phenomena (Task #2)• Pangrams, anagrams• 368 noun-verb homographs from BNC• Apostrophes in Pope’s Iliad translation: o’er, e’en, jav’lin, gen’rous, neighb’ring• Pres. Monson’s distinctive use of passive auxiliaries• Longest sentences in German • Split infinitives in General Conf. talks• Usage of anaphoric locative adverbs: whither, thither, hither, whence, thence, hence• "Unnecessary" quotation marks in blogs• Spoken instances of “whom” + stranded preposition (whom…with.)• Japanese verbs marking locative object with を• Double modals• Mutually intelligible phrases/sentences in Spanish & Portuguese• What we’re asked (not) to do from last Gen. Conf. (in Japanese)• Usages of 100 instances of ‘why’ from live Twitter stream• OK vs. okay over time in COCA• Lexical richness in most recent 200 tweets of 7 political figures• Subjunctive “be” in English • “-ish” suffixation on Reddit• coincidental "emojis" in the scriptures: …the ruler of the feast had tasted the water that was made wine, and knew not whence it was: (but
the servants which drew the water knew;)• “yall” in tweets• superlatives in movie reviews on Reddit
A note about your final project• First milestone (proposal) will be due in about a month• Two possible changes to coordinate with a local company:
• LiteralApp• POS-tagging of n-grams (email me for contact info)• Caveats
• Probably max. 2-3 students total• I don’t know, and hence can’t vouch for, these people/organizations or accept responsibility for their
behavior, policies, working conditions, etc.• This is not a formal BYU-sponsored internship• You’ll have to work out particulars re schedule, code ownership, deliverables• You need my approval • You can’t start coding until Feb.• Grading will be as with any other project
• IMO it’s better if you conceive/design/spec/develop/evaluate your own project
3
CORPORA AND REGEXP’SFinding lexical expressions
Corpus (pl. corpora)• Body of language data• Various types
• Text/written, speech/oral, gestures• Images
• Annotation levels• Structured (aligned, annotated, treebanks, etc.)
• Spans the range of human linguistic communication
Purposes for corpora• Language instruction• Task analysis• NLP systems development
• Training systems• Testing/evaluating systems
• Knowledge source development (dictionaries, lexicons, etc.)• Linguistic research
Sources for corpora• e-text centers, digital libraries (e.g. Project Gutenberg, Oxford Text Archive)• Clearinghouses (e.g. LDC, NIST, ELRA)• Specialized collections of corpora (e.g. SIG’s)• Specially deployed corpora (e.g. BYU)• The web itself• Your own house!
(Watch this fascinating discussion of the Human Speechome Project)
Vocabulary lists• Single-word term lists• Collocations and compound lists (aka MWE’s)• KWIC listings• Frequency lists• Saliency lists• Weirdness: typos, low-freq words, etc.
Accessing corpus content• Simple term (word/phrase) search
• Provided in most text editors• KWIC listing for results context• Ranked documents from browser
• KWIC listings
• Search engines return documents• Searchable: words, character strings• Not (yet):
• Word senses• Wildcards• Arbitrarily complex specifiable patterns• Syntax• Conversational structure
9
Data formats• Text
• Character encodings: ASCII, EBCDIC, UNICODE, etc.• File formats: word processing, tab-delimited, etc.• With or without markup (SGML, HTML, XML, rtf, etc.)• Application-specific (doc, wpd, spreadsheet, etc.)• Can vary widely across languages
• Speech• Huge amount of variation across projects/hw/sw• TIMIT, NIST (US Gov.), AIFF (Apple), SUNAU8 (Sun), OGI File Format, WAV (Microsoft)
• Pictures/graphics (GIF, JPEG, BMP, WMF, …)• Video• Live data streams
Scripting• Scripting ability is important in corpus work• A crucial skill for computational linguists
• In fact, for any linguist!• Languages: not quite full-fledged programming languages; slightly different
focus• Wikipedia re programming languages
Wikipedia re scripting languages• Computational linguists and NLP practitioners typically work with several of
both types• Learn some (more)!
• Be familiar with already existing conversion tools
Useful corpus/scripting tools• Grep: gets and prints regular expressions• Awk, gawk, tawk: scripting specifically for text manipulation• SED: stream editor (for buffered text)• PERL
• Pathologically Eclectic Rubbish Lister• Practical Extraction and Report Language• Built-in variables, dynamic arrays, user-definable functions• Large repository of file/string manipulation functions• Relies heavily on regular expressions
• Lex, YACC: lexical analyzers, compilers• more (or less) plus FINDSTR in Windows COM• Many, many, many others
Levels of corpus processing• Building (capturing data, organizing, encoding)• Cleanup (inconsistencies w/ data, etc.)• Annotation (marking features)• Dissemination (platform, copyright, licensing, etc.)• Translation (conversion from one format/application to another)• Feature extraction
Types of words in a corpus• Tokens: each word instance is counted• Types: similar word instances are only counted once• Type/Token ratio: measure of richness of vocabulary• Disfluencies: fragments, filled pauses
14
From task #1: re the LDC• BYU is a member of LDC; membership years: 1998, 2003-2015, 2018-2020• You can order corpora for BYU classwork, research, theses, etc.• Always obey distribution conditions, restrictions• Some corpora require a signed license agreement
15
Regular expressions (regexes)• Way to process data:
• Filter out extraneous data• Hone in on targeted data
• One of most important tools in natural language processing (NLP) and corpus linguistics
• Widely used tool for searching, matching, and replacing text
Where do you use regexes?• Some corpus interfaces• Command lines: Unix/Linux, findstr in DOS, PowerShell• Supported to various degrees by many programming/scripting languages• Some Web browsers
17
Learning regular expressions• Documentation, tutorials abound• Graphical user interfaces (GUIs) to try out your own regexes on texts
• Some allow you to upload your own texts• Some editors support regular expressions
• Alas, various flavors have slightly differing conventions• Need to be flexible, willing to consult documentation
Regular expressions• A formal language for specifying text strings• How can we search for any of these?
• woodchuck• woodchucks• Woodchuck• Woodchucks
Using regexes• Need:
• a pattern• a corpus
• Specifying a pattern:• delimiters (often slashes) around outside of pattern
/something//honor|honour/
• Character ranges, groupings• Constraints on number of matches• Special characters, variables
20
Special characters
• What do the following represent:*
.
+
?/ /
[ ]|
-
21
Grouping• Use parentheses/hono(u)r//gray|grey//gr(a|e)y/
22
Counting (quantifiers)• ? means zero or one• * means zero or more (Kleene star)• + means one or more
/honou?r//aie*//ba+//H(ae?|ä)ndel/(a|b|c)*
23
Other frequently used symbols• Dot: represents any single character
/b.*e//b.+e//b.?e/
• Brackets: matches a single character/[abc]/
• Hyphens: specify ranges/[A-Z][a-z]/
24
Regular Expressions: Disjunctions
• Letters inside square brackets []
• Ranges [A-Z]
Pattern Matches[wW]oodchuck Woodchuck, woodchuck[1234567890] Any digit
Pattern Matches[A-Z] An upper case
letterDrenched Blossoms
[a-z] A lower case letter my beans were impatient
[0-9] A single digit Chapter 1: Down the Rabbit Hole
Regular Expressions: Negation in Disjunction
• Negations [^Ss]• Carat means negation only when first in []
Pattern Matches[^A-Z] Not an upper case
letterOyfn pripetchik
[^Ss] Neither ‘S’ nor ‘s’ I have no exquisite reason”
[^e^] Neither e nor ^ Look here
a^b The pattern a carat b Look up a^b now
Regular Expressions: More Disjunction
• Woodchucks is another name for groundhog!• The pipe | for disjunction
Pattern Matchesgroundhog|woodchuck
yours|mine yoursmine
a|b|c = [abc]
[gG]roundhog|[Ww]oodchuck
Regular Expressions: ? * + .
Stephen C Kleene
Pattern Matchescolou?r Optional
previous char
color colour
oo*h! 0 or more ofprevious char
oh! ooh! oooh! ooooh!
o+h! 1 or more of previous char
oh! ooh! oooh! ooooh!
baa+ baa baaa baaaa baaaaa
beg.n begin begun begun beg3n
Kleene *, Kleene +
Regular Expressions: Anchors ^ $
Pattern Matches^[A-Z] Palo Alto
^[^A-Za-z] 1 “Hello”
\.$ The end.
.$ The end? The end!
More symbols, more flexibility• To “protect” special characters, use the backslash: \
/.*\.txt/
• To anchor the match at the beginning of a string: ^/^I\, Nephi/
• To anchor the match at the end of a string: $/Amen\.$/
30
Metacharacters• \b: word boundary character• \w: word character• \s: space character
• \B: everything BUT a word boundary character• \W: everything BUT a word character• \S: everything BUT a space character
31
Advanced topics• Capture groups• Lookahead
32
Recent job ads requiring regex skill• Oracle: Computational linguist, programmer• Lionbridge: Linguist, computational linguist• Appen Butler Hill: Consultant• The Long Now Foundation: Intern• Nuance Mobile: NLU automotive vocon speech engineer• Amazon: text-to-speech linguist • Measurement, Inc.: student evaluation tools expert• Infogain.com: language solutions consultant• Linnaeus University, Växjö: funded PhD• Webinterpret.com: localization engineer
33
Examples of regex usage• Compiling frequency statistics on word usage from a French corpus for a French dictionary that I published.
• Exploring translation differences between two different translations of the Book of Mormon in Farsi, for the Church Scripture Translation team.
• Searching for expressions that need to be simplified in English running text and that can be replaced with Church Basic English vocabulary that we developed (e.g. for "Preach My Gospel").
• Processing the names of several thousand suspected terrorists for identification of Romanized variants, for a DoD contract.
• Specifying Syriac morphological composition for a parser being used in developing, annotating and deploying a corpus of writings of the poet/theologian Ephrem, a joint project with the Maxwell Institute.
• Locating data elements from Web documents in an ontology-based conceptual modeling system that I have been developing with other BYU researchers and which is being used for various applications including biomedical research and decision support systems.
• Identifying the phonotactic sequences found in spoken language and matching them with patterns extant in several natural languages and dialects, for a spoken language ID system.
• Specifying the morphological composition for a parser that I'm using to annotate a corpus of Lushootseed utterances, working along with several Native American tribes who are hoping to save the language from extinction.
• Retrieving, parsing, and using morphological, syntactic, and semantic data from WordNet, a lexical database that I'm using in the lexical access component of a cognitive modeling system.
• Matching personal information (names, affiliations, email addresses, phone numbers, etc.) from Web pages in order to improve recall and precision in person-based information extraction, used recently in a worldwide programming contest called WePS-2.
34
NLP corpus tasks and regular expressions• Text preprocessing
• Sentence segmentation• Tokenization: parsing out words from document, data stream, etc.
• Not as trivial a problem as most people expect!• Normalization: standardizing formats (dates, locations, addresses, phone #’s, etc.)• Case folding• Lemmatization (next lecture)
35
Formal languages• Symbol: primitive term (like “point” in geometry)• String: finite sequence of symbols (If a, b, c are symbols then aabcba is a
string.) Strings are also called “words”.• If w is a word, |w| is the length of w (e.g. |aabcba| = 6).• The empty string will be denoted by . ||=0.• Prefix: any number of initial symbols. Proper prefix: prefix other than whole
string.• String concatenation: w1 • w2 w1w2
Formal languages (cont.)• Alphabet: finite set of symbols• Language: set of strings (words) from some one alphabet. The empty set Ø and the set {} consisting of the empty string are (different) languages.
• Set operations: L1 L1, L1 L2, L1–L2• L1•L2 = {w1•w2| w1L1, w2L2}• Kleene closure: L*={} L L2 L3 …
Examples• Set of words with two consecutive c’s:• {a,b,c}*{cc}{a,b,c}*• Set of all binary strings with exactly two 1’s• What about recognizing dates? What about only correct dates?