parsing tweets into universal...

1
Parsing Tweets Universal Dependencies into Dataset @ http://tiny.cc/0juzty · Software @ http://tiny.cc/fluzty Yijia Liu 1 · Yi Zhu 2 · Wanxiang Che 1 · Bing Qin 1 · Nathan Schneider 3 · Noah A. Smith 4 1 Harbin Institute of Technology 2 University of Cambridge 3 Georgetown University 4 University of Washington Annotation • Twitter-specific constructions that are not covered by UD guidelines (cf. Sanguinetti et al. 2017 for Italian) Pipeline • overcome noise in the annotation • accurate parsing without sacrificing speed Social Media NLP: domain adaptation and annotated datasets Universal Dependencies (UD): adaptable to different genres and languages Our work: UD v2 on English Social Media • Annotation: Tweebank v2 (4x larger than v1) • Pipeline: Distillation for fast/accurate parsing Introduction Tokenization Tradeoff between preservation of original tweet content and respecting the UD guidelines. Part-of-Speech Conform to UD guidelines in most cases. Use syntactic head’s POS for abbreviations. Dependencies Identify non-syntactic tokens (see above Fig.) discourse for sentiment emoticon, topical hashtag, and truncated word list for referential URL conforming UD Retweet construction is treated as a whole Foster et al. (2010) Stanford Dependencies Tweebank v1 (Kong et al., 2014) FUDG Dependencies Tweebank v2 (UD) URL Yes Yes Yes Ellipsis Not mentioned Not mentioned Yes Listing of entities Not mentioned Not mentioned Yes Parataxis sentences Not mentioned Not mentioned Yes Phrasal abbreviations Not mentioned Not mentioned Our contribution Retweet Yes Yes Our contribution @-mention (reply) Yes Yes Our contribution Hashtag Yes Yes Our contribution Truncated words Not mentioned Not mentioned Our contribution Common in web-text · Common in tweets Twitter-specific Constructions Challenges Annotation Guidelines 0.3% 0.1% 1.0% 0.7% 0.0% 1.0% 2.5% 1.2% 2.4% 0.5% 0.0% 1.0% 2.0% 3.0% sentiment emoticons retweet constructions topical hashtag referential URL truncated word Tokenizer POS tagger System F1 Stanford CoreNLP 97.3 Twokenizer 94.6 Ours biLSTM 98.3 Parser Tweet tokenization: contextual dependent and requires adaption Statistical modeling vs rule-based model • We propose to use bi- LSTM for tokenization and it performs better • We consider the existing POS taggers Rich feature-based (Owoputi et al., 2013) vs neural tagger (Ma and Hovy, 2016) and careful feature engineering still helps Data source: Tweebank v1 + Feb to Jul 2016 Twitter Stream Statistics: 18 people involved 3,550 annotated tweets 4.5 times larger than v1 • POS agreement: 96.6 • Dep. agreement: 88.8 (U) / 84.3 (L) Disagreements: • POS for named entities • Syntactically ambiguous tweets • See our paper for more details System Acc. Stanford CoreNLP 90.6 Owoputi et al., 2013 94.6 Ma and Hovy, 2016 92.5 Annotation: noisy, complicates the parser training • Overcome the noise with ensemble Ensemble is slow. We do distillation and it’s fast and accurate System LAS Kt/s Kong et al. (2014) 76.9 0.3 Dozat et al. (2017) 77.7 1.7 Ballesteros et al. (2015) 75.7 2.3 Ensemble 79.4 0.2 Distillation 77.9 2.3 Pipeline Evaluation Tweebank v2 Tokenization: 98.3, POS tagging: 93.3, UD parsing: 74.0 Proportion of Twitter-specific constructions, left bar: syntactic, right bar: non-syntactic

Upload: others

Post on 19-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Parsing Tweets into Universal Dependenciespeople.cs.georgetown.edu/nschneid/p/twparse2-poster.pdf · Commonin web-text · Common in tweets Twitter-specific Constructions s AnnotationGuidelines

Parsing Tweets Universal Dependenciesinto

Dataset @ http://tiny.cc/0juzty · Software @ http://tiny.cc/fluzty

Yijia Liu1 · Yi Zhu2 · Wanxiang Che1 · Bing Qin1 · Nathan Schneider3 · Noah A. Smith4

1Harbin Institute ofTechnology

2University of Cambridge

3Georgetown University

4University of Washington

Annotation• Twitter-specific constructions that are not

covered by UD guidelines (cf. Sanguinetti et al. 2017 for Italian)

Pipeline• overcome noise in the annotation• accurate parsing without sacrificing speed

Social Media NLP: domain adaptation and annotated datasetsUniversal Dependencies (UD): adaptable to different genres and languagesOur work: UD v2 on English Social Media• Annotation: Tweebank v2 (4x larger than v1)• Pipeline: Distillation for fast/accurate parsingIn

tro

du

ctio

n

TokenizationTradeoff between preservation of original tweet content and respecting the UD guidelines.Part-of-SpeechConform to UD guidelines in most cases.Use syntactic head’s POS for abbreviations.DependenciesIdentify non-syntactic tokens (see above Fig.)• discourse for sentiment emoticon, topical

hashtag, and truncated word• list for referential URL conforming UD• Retweet construction is treated as a whole

Foster et al. (2010)Stanford

Dependencies

Tweebank v1 (Kong et al., 2014) FUDG

Dependencies

Tweebank v2 (UD)

• URL Yes Yes Yes• Ellipsis Not mentioned Not mentioned Yes• Listing of entities Not mentioned Not mentioned Yes• Parataxis

sentencesNot mentioned Not mentioned Yes

• Phrasal abbreviations

Not mentioned Not mentioned Our contribution

• Retweet Yes Yes Our contribution• @-mention (reply) Yes Yes Our contribution• Hashtag Yes Yes Our contribution• Truncated words Not mentioned Not mentioned Our contribution

Common in web-text · Common in tweets

Twitter-specific ConstructionsC

hal

len

ges

Annotation Guidelines

0.3% 0.1%1.0% 0.7%

0.0%

1.0%

2.5%

1.2%

2.4%

0.5%

0.0%

1.0%

2.0%

3.0%

sentiment emoticons retweet constructions topical hashtag referential URL truncated word

Tokenizer POS tagger

System F1Stanford CoreNLP 97.3

Twokenizer 94.6Ours biLSTM 98.3

Parser• Tweet tokenization:

contextual dependentand requires adaption

• Statistical modeling vsrule-based model

• We propose to use bi-LSTM for tokenizationand it performs better

• We consider theexisting POS taggers

• Rich feature-based(Owoputi et al., 2013)vs neural tagger (Maand Hovy, 2016) andcareful featureengineering still helps

Data source: Tweebank v1 +Feb to Jul 2016 Twitter StreamStatistics:• 18 people involved• 3,550 annotated tweets• 4.5 times larger than v1• POS agreement: 96.6• Dep. agreement: 88.8 (U) / 84.3 (L)Disagreements:• POS for named entities• Syntactically ambiguous tweets• See our paper for more details

System Acc.Stanford CoreNLP 90.6

Owoputi et al., 2013 94.6

Ma and Hovy, 2016 92.5

• Annotation: noisy,complicates the parsertraining

• Overcome the noisewith ensemble

• Ensemble is slow. Wedo distillation and it’sfast and accurate

System LAS Kt/sKong et al. (2014) 76.9 0.3

Dozat et al. (2017) 77.7 1.7Ballesteros et al. (2015) 75.7 2.3

Ensemble 79.4 0.2Distillation 77.9 2.3Pipeline Evaluation

Tweebank v2

Tokenization: 98.3, POS tagging: 93.3, UD parsing: 74.0

Proportion of Twitter-specific constructions,left bar: syntactic, right bar: non-syntactic