comments from task #1 · pangrams, anagrams • 368 noun-verb homographs from bnc • apostrophes...

Comments from Task #1• Books vs. journal articles vs. conference proceedings• Peer-reviewed vs. not

• Blind, double-blind• Hirsch index: metric for rating research productivity (people, publication

venues, schools)• h24: 24 artifacts cited ≥ 24 times; h5=10: ≥10 citations in last 5 years

• CiteSeerX: • Citations: literature that the paper cites• Co-citation: frequency with which two documents are cited together• Active bibliography: when two documents’ bibliographies overlap• Based on Solr (a Lucene derivative)• Source code and data are downloadable; API is also available

1

Sample corpus phenomena (Task #2)• Pangrams, anagrams• 368 noun-verb homographs from BNC• Apostrophes in Pope’s Iliad translation: o’er, e’en, jav’lin, gen’rous, neighb’ring• Pres. Monson’s distinctive use of passive auxiliaries• Longest sentences in German • Split infinitives in General Conf. talks• Usage of anaphoric locative adverbs: whither, thither, hither, whence, thence, hence• "Unnecessary" quotation marks in blogs• Spoken instances of “whom” + stranded preposition (whom…with.)• Japanese verbs marking locative object with を• Double modals• Mutually intelligible phrases/sentences in Spanish & Portuguese• What we’re asked (not) to do from last Gen. Conf. (in Japanese)• Usages of 100 instances of ‘why’ from live Twitter stream• OK vs. okay over time in COCA• Lexical richness in most recent 200 tweets of 7 political figures• Subjunctive “be” in English • “-ish” suffixation on Reddit• coincidental "emojis" in the scriptures: …the ruler of the feast had tasted the water that was made wine, and knew not whence it was: (but

the servants which drew the water knew;)• “yall” in tweets• superlatives in movie reviews on Reddit

A note about your final project• First milestone (proposal) will be due in about a month• Two possible changes to coordinate with a local company:

• LiteralApp• POS-tagging of n-grams (email me for contact info)• Caveats

• Probably max. 2-3 students total• I don’t know, and hence can’t vouch for, these people/organizations or accept responsibility for their

behavior, policies, working conditions, etc.• This is not a formal BYU-sponsored internship• You’ll have to work out particulars re schedule, code ownership, deliverables• You need my approval • You can’t start coding until Feb.• Grading will be as with any other project

• IMO it’s better if you conceive/design/spec/develop/evaluate your own project

3

http://linguistics.byu.edu/classes/ling581dl/LinguisticsEngineeringInternJobReq.pdf

CORPORA AND REGEXP’SFinding lexical expressions

Corpus (pl. corpora)• Body of language data• Various types

• Text/written, speech/oral, gestures• Images

• Annotation levels• Structured (aligned, annotated, treebanks, etc.)

• Spans the range of human linguistic communication

Purposes for corpora• Language instruction• Task analysis• NLP systems development

• Training systems• Testing/evaluating systems

• Knowledge source development (dictionaries, lexicons, etc.)• Linguistic research

Sources for corpora• e-text centers, digital libraries (e.g. Project Gutenberg, Oxford Text Archive)• Clearinghouses (e.g. LDC, NIST, ELRA)• Specialized collections of corpora (e.g. SIG’s)• Specially deployed corpora (e.g. BYU)• The web itself• Your own house!

(Watch this fascinating discussion of the Human Speechome Project)

http://www.ted.com/talks/deb_roy_the_birth_of_a_word.html

Vocabulary lists• Single-word term lists• Collocations and compound lists (aka MWE’s)• KWIC listings• Frequency lists• Saliency lists• Weirdness: typos, low-freq words, etc.

Accessing corpus content• Simple term (word/phrase) search

• Provided in most text editors• KWIC listing for results context• Ranked documents from browser

• KWIC listings

• Search engines return documents• Searchable: words, character strings• Not (yet):

• Word senses• Wildcards• Arbitrarily complex specifiable patterns• Syntax• Conversational structure

9

Data formats• Text

• Character encodings: ASCII, EBCDIC, UNICODE, etc.• File formats: word processing, tab-delimited, etc.• With or without markup (SGML, HTML, XML, rtf, etc.)• Application-specific (doc, wpd, spreadsheet, etc.)• Can vary widely across languages

• Speech• Huge amount of variation across projects/hw/sw• TIMIT, NIST (US Gov.), AIFF (Apple), SUNAU8 (Sun), OGI File Format, WAV (Microsoft)

• Pictures/graphics (GIF, JPEG, BMP, WMF, …)• Video• Live data streams

Scripting• Scripting ability is important in corpus work• A crucial skill for computational linguists

• In fact, for any linguist!• Languages: not quite full-fledged programming languages; slightly different

focus• Wikipedia re programming languages

Wikipedia re scripting languages• Computational linguists and NLP practitioners typically work with several of

both types• Learn some (more)!

• Be familiar with already existing conversion tools

http://en.wikipedia.org/wiki/List_of_programming_languages

https://en.wikipedia.org/wiki/Scripting_language

Useful corpus/scripting tools• Grep: gets and prints regular expressions• Awk, gawk, tawk: scripting specifically for text manipulation• SED: stream editor (for buffered text)• PERL

• Pathologically Eclectic Rubbish Lister• Practical Extraction and Report Language• Built-in variables, dynamic arrays, user-definable functions• Large repository of file/string manipulation functions• Relies heavily on regular expressions

• Lex, YACC: lexical analyzers, compilers• more (or less) plus FINDSTR in Windows COM• Many, many, many others

Levels of corpus processing• Building (capturing data, organizing, encoding)• Cleanup (inconsistencies w/ data, etc.)• Annotation (marking features)• Dissemination (platform, copyright, licensing, etc.)• Translation (conversion from one format/application to another)• Feature extraction

Types of words in a corpus• Tokens: each word instance is counted• Types: similar word instances are only counted once• Type/Token ratio: measure of richness of vocabulary• Disfluencies: fragments, filled pauses

14

From task #1: re the LDC• BYU is a member of LDC; membership years: 1998, 2003-2015, 2018-2020• You can order corpora for BYU classwork, research, theses, etc.• Always obey distribution conditions, restrictions• Some corpora require a signed license agreement

15

Regular expressions (regexes)• Way to process data:

• Filter out extraneous data• Hone in on targeted data

• One of most important tools in natural language processing (NLP) and corpus linguistics

• Widely used tool for searching, matching, and replacing text

Where do you use regexes?• Some corpus interfaces• Command lines: Unix/Linux, findstr in DOS, PowerShell• Supported to various degrees by many programming/scripting languages• Some Web browsers

17

Learning regular expressions• Documentation, tutorials abound• Graphical user interfaces (GUIs) to try out your own regexes on texts

• Some allow you to upload your own texts• Some editors support regular expressions

• Alas, various flavors have slightly differing conventions• Need to be flexible, willing to consult documentation

Regular expressions• A formal language for specifying text strings• How can we search for any of these?

• woodchuck• woodchucks• Woodchuck• Woodchucks

Using regexes• Need:

• a pattern• a corpus

• Specifying a pattern:• delimiters (often slashes) around outside of pattern

/something//honor|honour/

• Character ranges, groupings• Constraints on number of matches• Special characters, variables

20

Special characters

• What do the following represent:*

.

+

?/ /

[ ]|

-

21

Grouping• Use parentheses/hono(u)r//gray|grey//gr(a|e)y/

22

Counting (quantifiers)• ? means zero or one• * means zero or more (Kleene star)• + means one or more

/honou?r//aie*//ba+//H(ae?|ä)ndel/(a|b|c)*

23

Other frequently used symbols• Dot: represents any single character

/b.*e//b.+e//b.?e/

• Brackets: matches a single character/[abc]/

• Hyphens: specify ranges/[A-Z][a-z]/

24

Regular Expressions: Disjunctions

• Letters inside square brackets []

• Ranges [A-Z]

Pattern Matches[wW]oodchuck Woodchuck, woodchuck[1234567890] Any digit

Pattern Matches[A-Z] An upper case

letterDrenched Blossoms

[a-z] A lower case letter my beans were impatient

[0-9] A single digit Chapter 1: Down the Rabbit Hole

Regular Expressions: Negation in Disjunction

• Negations [^Ss]• Carat means negation only when first in []

Pattern Matches[^A-Z] Not an upper case

letterOyfn pripetchik

[^Ss] Neither ‘S’ nor ‘s’ I have no exquisite reason”

[^e^] Neither e nor ^ Look here

a^b The pattern a carat b Look up a^b now

Regular Expressions: ? * + .

Stephen C Kleene

Pattern Matchescolou?r Optional

previous char

color colour

oo*h! 0 or more ofprevious char

oh! ooh! oooh! ooooh!

o+h! 1 or more of previous char

oh! ooh! oooh! ooooh!

baa+ baa baaa baaaa baaaaa

beg.n begin begun begun beg3n

Kleene *, Kleene +

Regular Expressions: Anchors ^ $

Pattern Matches^[A-Z] Palo Alto

^[^A-Za-z] 1 “Hello”

\.$ The end.

.$ The end? The end!

More symbols, more flexibility• To “protect” special characters, use the backslash: \

/.*\.txt/

• To anchor the match at the beginning of a string: ^/^I\, Nephi/

• To anchor the match at the end of a string: $/Amen\.$/

30

Metacharacters• \b: word boundary character• \w: word character• \s: space character

• \B: everything BUT a word boundary character• \W: everything BUT a word character• \S: everything BUT a space character

31

Advanced topics• Capture groups• Lookahead

32

Recent job ads requiring regex skill• Oracle: Computational linguist, programmer• Lionbridge: Linguist, computational linguist• Appen Butler Hill: Consultant• The Long Now Foundation: Intern• Nuance Mobile: NLU automotive vocon speech engineer• Amazon: text-to-speech linguist • Measurement, Inc.: student evaluation tools expert• Infogain.com: language solutions consultant• Linnaeus University, Växjö: funded PhD• Webinterpret.com: localization engineer

33

Examples of regex usage• Compiling frequency statistics on word usage from a French corpus for a French dictionary that I published.

• Exploring translation differences between two different translations of the Book of Mormon in Farsi, for the Church Scripture Translation team.

• Searching for expressions that need to be simplified in English running text and that can be replaced with Church Basic English vocabulary that we developed (e.g. for "Preach My Gospel").

• Processing the names of several thousand suspected terrorists for identification of Romanized variants, for a DoD contract.

• Specifying Syriac morphological composition for a parser being used in developing, annotating and deploying a corpus of writings of the poet/theologian Ephrem, a joint project with the Maxwell Institute.

• Locating data elements from Web documents in an ontology-based conceptual modeling system that I have been developing with other BYU researchers and which is being used for various applications including biomedical research and decision support systems.

• Identifying the phonotactic sequences found in spoken language and matching them with patterns extant in several natural languages and dialects, for a spoken language ID system.

• Specifying the morphological composition for a parser that I'm using to annotate a corpus of Lushootseed utterances, working along with several Native American tribes who are hoping to save the language from extinction.

• Retrieving, parsing, and using morphological, syntactic, and semantic data from WordNet, a lexical database that I'm using in the lexical access component of a cognitive modeling system.

• Matching personal information (names, affiliations, email addresses, phone numbers, etc.) from Web pages in order to improve recall and precision in person-based information extraction, used recently in a worldwide programming contest called WePS-2.

34

NLP corpus tasks and regular expressions• Text preprocessing

• Sentence segmentation• Tokenization: parsing out words from document, data stream, etc.

• Not as trivial a problem as most people expect!• Normalization: standardizing formats (dates, locations, addresses, phone #’s, etc.)• Case folding• Lemmatization (next lecture)

35

Formal languages• Symbol: primitive term (like “point” in geometry)• String: finite sequence of symbols (If a, b, c are symbols then aabcba is a

string.) Strings are also called “words”.• If w is a word, |w| is the length of w (e.g. |aabcba| = 6).• The empty string will be denoted by . ||=0.• Prefix: any number of initial symbols. Proper prefix: prefix other than whole

string.• String concatenation: w1 • w2 w1w2

Formal languages (cont.)• Alphabet: finite set of symbols• Language: set of strings (words) from some one alphabet. The empty set Ø and the set {} consisting of the empty string are (different) languages.

• Set operations: L1 L1, L1 L2, L1–L2• L1•L2 = {w1•w2| w1L1, w2L2}• Kleene closure: L*={} L L2 L3 …

Examples• Set of words with two consecutive c’s:• {a,b,c}*{cc}{a,b,c}*• Set of all binary strings with exactly two 1’s• What about recognizing dates? What about only correct dates?

comments from task #1 · pangrams, anagrams • 368 noun-verb homographs from bnc • apostrophes...

Documents