masc the manually annotated sub- corpus of american english nancy ide, collin baker, christiane...

15
MASC The Manually Annotated Sub-Corpus of American English Nancy Ide, Collin Baker, Christiane Fellbaum, Charles Fillmore, Rebecca Passonneau

Upload: elwin-wilcox

Post on 03-Jan-2016

216 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: MASC The Manually Annotated Sub- Corpus of American English Nancy Ide, Collin Baker, Christiane Fellbaum, Charles Fillmore, Rebecca Passonneau

MASCThe Manually Annotated Sub-Corpus of American English

Nancy Ide, Collin Baker, Christiane Fellbaum, Charles Fillmore, Rebecca Passonneau

Page 2: MASC The Manually Annotated Sub- Corpus of American English Nancy Ide, Collin Baker, Christiane Fellbaum, Charles Fillmore, Rebecca Passonneau

MASC• Manually Annotated Sub-Corpus• NSF-funded project to provide a sharable, reusable annotated

resource with rich linguistic annotations• Vassar, ICSI, Columbia, Princeton• texts from diverse genres• manual annotations or manually-validated annotations for multiple

levels– WordNet senses– FrameNet frames and frame – shallow parses – named entities

• Enables linking WordNet senses and FrameNet frames into more complex semantic structures

• Enriches semantic and pragmatic information• detailed inter-annotator agreement measures

Page 3: MASC The Manually Annotated Sub- Corpus of American English Nancy Ide, Collin Baker, Christiane Fellbaum, Charles Fillmore, Rebecca Passonneau

Contents

• Texts drawn from the Open ANC– Several genres• Written (travel guides, blog, fiction, letters, newspaper,

non-fiction, technical, journal, government documents) • Spoken (face-to-face, academic, telephone)

– Free of license restrictions, redistributable– Download from ANC website

• All MASC data and annotations will be freely downloadable

Page 4: MASC The Manually Annotated Sub- Corpus of American English Nancy Ide, Collin Baker, Christiane Fellbaum, Charles Fillmore, Rebecca Passonneau

Annotation Process• Smaller portions of the sub-corpus manually annotated for specific

phenomena– Maintain representativeness – Include as many annotations of different types as possible

• Apply (semi)-automatic annotation techniques to determine the reliability of their results

• Study inter-annotator agreement on manually-produced annotations – Determine benchmark of accuracy– Fine-tune annotator guidelines

• Consider if accurate annotations for one phenomenon can improve performance of automatic annotation systems for another– E.G., Validated WN sense tags and noun chunks may improve automatic

semantic role labeling

Page 5: MASC The Manually Annotated Sub- Corpus of American English Nancy Ide, Collin Baker, Christiane Fellbaum, Charles Fillmore, Rebecca Passonneau

Process (continued)

• Apply iterative process to maximize performance of automatic taggers ; – Manual annotation– Retrain automatic annotation software

• Improved annotation software can later be applied to the entire ANC– Provide more accurate automatically-produced

annotation of full corpus

Page 6: MASC The Manually Annotated Sub- Corpus of American English Nancy Ide, Collin Baker, Christiane Fellbaum, Charles Fillmore, Rebecca Passonneau

Composition Relative to Whole OANC

Genre-representative core with validated entity, shallow parse annotations

WSJ with PropBank, NomBank, PTB,TimeBank and PDTB annotations

Training examples

FrameNet and WordNet full annotation

WordNet annotations

Page 7: MASC The Manually Annotated Sub- Corpus of American English Nancy Ide, Collin Baker, Christiane Fellbaum, Charles Fillmore, Rebecca Passonneau

MASC Core

• Includes – 25K fully annotated (“all words”) for FrameNet

frames and WordNet senses– ~40K corpus annotated by Unified Linguistic

Annotation project• PropBank, NomBank, Penn Treebank, Penn Discourse

Treebank, TimeBank– Small subset of WSJ with many annotation

• Other annotations rendered into GrAF for compatibility

Page 8: MASC The Manually Annotated Sub- Corpus of American English Nancy Ide, Collin Baker, Christiane Fellbaum, Charles Fillmore, Rebecca Passonneau

Representation• ISO TC37 SC4 Linguistic Annotation Framework

– Graph of feature structures (GrAF) – isomorphic to other feature structure-based representations

(e.g. UIMA CAS)• Each annotation in a separate stand-off document linked to

primary data or other annotations• Merge annotations with ANC API

– Output in any of several formats• XML • non-XML for use with systems such as NLTK and concordancing tools • UIMA CAS• Input to GraphViz • …

Page 9: MASC The Manually Annotated Sub- Corpus of American English Nancy Ide, Collin Baker, Christiane Fellbaum, Charles Fillmore, Rebecca Passonneau

WordNet annotation

• Updating WSD systems to use WordNet version 3.0– Pederson’s SenseRelate – Mihalcea et al.’s SenseLearner

• Apply to automatically assign WN sense tags to all content words (nouns, verbs, adjectives, and adverbs) in the entire OANC

• Manually validate a set of words from whole OANC• Manually validate all words in 25K FN-annotated

subset

Page 10: MASC The Manually Annotated Sub- Corpus of American English Nancy Ide, Collin Baker, Christiane Fellbaum, Charles Fillmore, Rebecca Passonneau

FrameNet Annotation

• Full manual annotation of 25K in FrameNet full-text manner

• Application of automatic semantic role labeling software over entire MASC

• Improve automatic semantic role labeling (ASRL)– Use active learning

• ASRL system results evaluated to determine where the most errors occur

• Extra manual annotation done to improve performance– Draw from entire OANC, possibly even other sources for

examples

Page 11: MASC The Manually Annotated Sub- Corpus of American English Nancy Ide, Collin Baker, Christiane Fellbaum, Charles Fillmore, Rebecca Passonneau

Alignment of Lexical Resources

• Concurrent project investigating how and to what extent WordNet and FrameNet can be aligned

• MASC annotations of 25K for FrameNet frames and frame elements and WordNet senses provide a ready-made testing ground

Page 12: MASC The Manually Annotated Sub- Corpus of American English Nancy Ide, Collin Baker, Christiane Fellbaum, Charles Fillmore, Rebecca Passonneau

Interannotator agreement

• Use a suite of metrics that measure different characteristics – Interannotator agreement coefficients such

as Cohen’s Kappa –Average F-measure to determine

proportion of the annotated data all annotators agree on

Page 13: MASC The Manually Annotated Sub- Corpus of American English Nancy Ide, Collin Baker, Christiane Fellbaum, Charles Fillmore, Rebecca Passonneau

IAA

• Determine impact of these two measures – consider the relation between the agreement

coefficient values / F-measure and potential users of the planned annotations

• Simultaneous investigations of interannotator agreement and measurable results of using different annotations of the same data provide a stronger picture of the integrity of annotated data (Passonneau et al. 2005; Passonneau et al. 2006 )

Page 14: MASC The Manually Annotated Sub- Corpus of American English Nancy Ide, Collin Baker, Christiane Fellbaum, Charles Fillmore, Rebecca Passonneau

Overall Goal• Continually augment MASC with contributed

annotations from the research community• Discourse structure, additional entities, events,

opinions, etc.• Distribution of effort and integration of currently

independent resources such as the ANC, WordNet, and FrameNet will enable progress in resource development– Less cost – No duplication of effort– Greater degree of accuracy and usability– Harmonization

Page 15: MASC The Manually Annotated Sub- Corpus of American English Nancy Ide, Collin Baker, Christiane Fellbaum, Charles Fillmore, Rebecca Passonneau

Conclusion

• MASC will provide a much-needed resource for computational linguistics research aimed at the development of robust language processing systems

• MASC’s availability should have a major impact on the speed with which similar resources can be reliably annotated

• MASC will be the largest semantically annotated corpus of English in existence

• WN and FN annotation of the MASC will immediately create a massive multi-lingual resource network– Both WN and FN linked to corresponding resources in other

languages– No existing resource approaches this scope