the$creaon$of$acorpus$of$ english$metalanguage$the$creaon$of$acorpus$of$ english$metalanguage$...
TRANSCRIPT
The Crea(on of a Corpus of English Metalanguage
Shomir Wilson Carnegie Mellon (current affilia(on) University of Maryland (research
performed)
ACL 2012 – 10 July 2012
Speaking or Wri(ng About Language: Observa(ons
When we write or speak about language (to discuss words, phrases, syntax, meaning…): – We oOen convey very direct, salient informa(on about language.
– We tend to be instruc(ve, and we (oOen) try to be easily understood.
– We do this to clarify the meaning of words or phrases we (or our audience) use.
2012-‐07-‐10 Shomir Wilson -‐ ACL 2012 2
Examples
1) This is some(mes called tough love. 2) I wrote “meet outside” on the chalkboard. 3) Has is a conjuga(on of the verb have. 4) The buYon labeled go was illuminated. 5) That bus, was its name 61C? 6) Mississippi is fun to spell. 7) He said, “Dinner is served.”
2012-‐07-‐10 Shomir Wilson -‐ ACL 2012 3
And Yet…
Metalanguage (some(mes described as self-‐referen/al language, or the “men(on” part of the use-‐men(on dis(nc(on) should be fer(le ground for language technologies. However: – Metalinguis(c construc(ons have atypical proper(es.
– Metalanguage defies trends in language (e.g., in syntax, word senses, topicality) that language technologies usually exploit.
2012-‐07-‐10 Shomir Wilson -‐ ACL 2012 4
What Goes Wrong
2012-‐07-‐10 Shomir Wilson -‐ ACL 2012 5
(ROOT! (S! (NP! (NP (DT The) (NN button))! (VP (VBN labeled)! (S! (VP (VB go)))))! (VP (VBD was)! (VP (VBN illuminated)))! (. .)))!
Dialog System: Where do you wish to depart from?!
User: Arlington.!
Dialog System: Departing from Allegheny West. Is this right?!
User: No, I said “Arlington”.!
Dialog System: Please say where you are leaving from.!
The word "bank" can refer to many things.!!bank: n|1| a financial institution that accepts deposits and channels the money into lending activities!
Dialog System: Let’s Go! (Carnegie Mellon University)
Parser: Stanford Parser (Stanford University)
Word Sense Disambigua/on: IMS (Na/onal University of Singapore)
Metalanguage and Men(oned Language
The goal of this project was to provide a basis for the study of metalanguage (i.e., language about language) in English.
A beYer understanding of metalanguage will enable us to construct language technologies that (at worst) can cope with it and (at best) exploit the informa(on it conveys.
2012-‐07-‐10 Shomir Wilson -‐ ACL 2012 6
To make the problem tractable, the focus was on men/oned language: instances of metalanguage that can be explicitly delimited within a sentence.
Previous Efforts • Two proof-‐of-‐concept corpora preceded this one: – A “pilot corpus” established that Wikipedia was a fer(le source of men(oned language [1].
– A “combined cues corpus” validated the combina(on of lexical and stylis(c cues to gather candidate instances [2].
• Anderson, et al. [3] gathered a metalanguage corpus of human dialog—but it lacked word-‐ or phrase-‐level annota(ons and contained substan(al noise.
• Many have discussed men(oned language or metalanguage in purely theore(cal terms (Saka, Cappelen, Lepore, Maier, Geach, Partee, et al.).
2012-‐07-‐10 Shomir Wilson -‐ ACL 2012 7
[1] Shomir Wilson. "Dis(nguishing use and men(on in natural language". NAACL HLT Student Research Workshop, 2010 [2] Shomir Wilson "In search of the use-‐men(on dis(nc(on and its impact on language processing tasks". CICLING 2011 [3] Anderson, ML, Yoshi A Okamoto, Darsana Josyula, and Donald Perlis. "The Use-‐Men(on Dis(nc(on and its Importance to HCI." EDILOLG 2002
Men(oned Language: A Defini(on
The following defini(on was used in previous efforts to build pilot corpora of men(oned language: For T a token or a set of tokens in a sentence, if T is produced to draw a?en@on to a property of the token T or the type of T, then T is an instance of men@oned language. Example: “The cat is on the mat” is a sentence. New in the present effort: an equivalent subs(tu(on-‐based “labeling rubric” was used to produce consistent results. The rubric appears in the paper.
2012-‐07-‐10 Shomir Wilson -‐ ACL 2012 8
Corpus Crea(on: Overview
• A randomly subset of English Wikipedia ar(cles was chosen as a text source.
• To make human annota(on tractable: sentences were examined only if they fit a combina(on of cues:
2012-‐07-‐10 Shomir Wilson -‐ ACL 2012 9
The term chip has a similar meaning.
Metalinguis(c cue Stylis(c cue: italic text, bold text, or quoted text
• Mechanical Turk did not work well for labeling. • Candidate instances were labeled by a human annotator.
A subset were labeled by mul(ple annotators to verify the reliability of the corpus.
Collec(on and Filtering
2012-‐07-‐10 Shomir Wilson -‐ ACL 2012 10
629 instances of men(oned language
1,764 nega(ve instances
5,000 Wikipedia ar(cles (in HTML)
Main body text of ar(cles
17,753 sentences containing 25,716 instances of highlighted text
Ar/cle sec/on filtering and sentence tokenizer
Stylis/c cue filter
Human annotator
1,914 sentences containing 2,393 candidate instances
Metalinguis/c cue proximity filter
100 instances labeled by three addi(onal human annotators
Random selec/on procedure for 100 instances
23 hand-‐selected metalinguis(c cues
8,735 metalinguis(c cues
WordNet crawl
Corpus Composi(on: Frequent Leading and Trailing Words
2012-‐07-‐10 Shomir Wilson -‐ ACL 2012 11
Rank Word Freq. Precision (%) 1 call (v) 92 80 2 word (n) 68 95.8 3 term (n) 60 95.2 4 name (n) 31 67.4 5 use (v) 17 70.8 6 know (v) 15 88.2 7 also (rb) 13 59.1 8 name (v) 11 100 9 some(mes (rb) 9 81.9 10 La(n (n) 9 69.2
Rank Word Freq. Precision (%) 1 mean (v) 31 83.4 2 name (n) 24 63.2 3 use (v) 11 55 4 meaning (n) 8 57.1 5 derive (v) 8 80 6 refers (n) 7 87.5 7 describe (v) 6 60 8 refer (v) 6 54.5 9 word (n) 6 50 10 may (md) 5 62.5
These were the most common words to appear in the three words before and aOer instances of men(oned language.
Before Instances AOer Instances
Corpus Composi(on: Categories
2012-‐07-‐10 Shomir Wilson -‐ ACL 2012 12
Category Freq. Example Words as Words (WW)
438 The IP Mul(media Subsystem architecture uses the term transport plane to describe a func(on roughly equivalent to the rou(ng control plane. The material was a heavy canvas known as duck, and the brothers began making work pants and shirts out of the strong material.
Names as Names (NN)
117 Digeri is the name of a Thracian tribe men(oned by Pliny the Elder, in The Natural History. Hazrat Syed Jalaluddin Bukhari's descendants are also called Naqvi al-‐Bukhari.
Spelling and Pronuncia(on (SP)
48 The French changed the spelling to bataillon, whereupon it directly entered into German. Welles insisted on pronouncing the word apostles with a hard t.
Other Men(oned Language (OM)
26 He kneels over Fil, and seeing that his eyes are open whispers: brother. During Christmas 1941, she typed The end on the last page of Laura.
[Not Men(oned Language (XX)]
1,764 NCR was the first U.S. publica(on to write about the clergy sex abuse scandal. Many Croats reacted by expelling all words in the Croa(an language that had, in their minds, even distant Serbian origin.
Categories were observed through applica(on of the subs(tu(on rubric.
Inter-‐Annotator Agreement
2012-‐07-‐10 Shomir Wilson -‐ ACL 2012 13
Code Frequency K WW 17 0.38 NN 17 0.72 SP 16 0.66 OM 4 0.09 XX 46 0.74
Three addi(onal expert annotators labeled 100 instances selected randomly with quotas from each category.
For men(on vs. non-‐men(on labeling, the kappa sta(s(c was 0.74. Kappa between the primary annotator and the “majority voter” of the rest was 0.90.
These sta(s(cs suggest that men(oned language can be labeled fairly consistently—but the categories are fluid.
Discussion
2012-‐07-‐10 Shomir Wilson -‐ ACL 2012 14
• A core set of metalinguis(c cues appeared frequently, followed by a long, thin tail. – A core metalinguis(c vocabulary seems to exist. – The most popular metalinguis(c cues were highly correlated with men(oned language.
• Recurring paYerns were observed in how metalinguis(c cues related to men(oned language. – Noun apposi(on oOen occurs between a cue noun and men(oned language: “Some(mes the term scramble crossing is used.”
– Men(oned language tends to appear in appropriate seman(c roles for cue verbs: “This precipita(on is called sleet.”
Future Direc(ons
• The corpus is online (URL on next slide) and available for use under a CC BY-‐SA 3.0 license.
• Next: automa(c detec(on of men(oned language. It appears to be feasible.
• Poten(al applica(ons to language technologies: – Dialog systems – Language instruc(on – Dic(onary building tools – Source aYribu(on – Automated typesetng and copyedi(ng
2012-‐07-‐10 Shomir Wilson -‐ ACL 2012 15
Ques(ons?
Shomir Wilson – [email protected] hYp://www.cs.cmu.edu/~shomir
The corpus is available at:
hYp://www.cs.cmu.edu/~shomir/um_corpus.html
Special thanks to: Don Perlis Tim Oates
Appendix
2012-‐07-‐10 Shomir Wilson -‐ ACL 2012 17
The Rubric
2012-‐07-‐10 Shomir Wilson -‐ ACL 2012 18
Rubric Given S a sentence and X a copy of a linguis(c en(ty in S: 1) Create X': the phrase “that
[item]”, where [item] is the appropriate term for linguis(c en(ty X in the context of S.
2) Create S': copy S and replace the occurrence of X with X'.
3) (3) Create W: the set of truth condi(ons of S.
4) (4) Create W': the set of truth condi(ons of S', assuming that X' in S' is understood to refer deic(cally to X.
5) (5) Compare W and W'. If they are equal, X is men(oned language in S. Else, X is not men(oned language in S.
Posi(ve Example S: Spain is the name of a European country. X: Spain. 1) X': that name 2) S': That name is the name of a European country. 3) W: Stated briefly, Spain is the name of a European
country. 4) W': Stated briefly, Spain is the name of a European
country. 5) W and W' are equal. Spain is men(oned language in S. Nega(ve Example S: Spain is a European country. X: Spain. 1) X': that name 2) S': That name is a European country. 3) W: Stated briefly, Spain is a European country. 4) W': Stated briefly, the name Spain is a European
country. 5) W and W' are not equal. Spain is not men(oned
language in S. !
“Men(on Word” Collec(on via WordNet
For each of 23 men(on words from the previous (“combined cues”) corpus: 1) A human annotator
found its most general linguis(cally-‐significant hypernym
2) All descendants of the hypernym were gathered
2012-‐07-‐10 Shomir Wilson -‐ ACL 2012 19
x
term.n.01
part.n.01
word.n.01
language unit.n.01
language unit.n.01
word.n.01
Automated mass-‐collec(on of hyponyms
anagram.n.01
syllable.n.01
Seed men(on word: “term”
Corpus Composi(on: Cumula(ve Coverage of Top Words
2012-‐07-‐10 Shomir Wilson -‐ ACL 2012 20