lexical normalization for dutch social media texts · lexical normalization for dutch social media...

1
Lexical Normalization for Dutch Social Media Texts Rob van der Goot & Gertjan van Noord Lexical Normalization nee ! :-D kzal nog es vriendenlijk doen lol nee ! :-D ik zal nog eens vriendelijk doen lol tgaat goed , vdg rustig aaan . Het gaat goed , vandaag rustig aan . social ppl r anoying social people are annoying aaah buenoo esqe digo pa qe madrugara este jajaja ah bueno es que digo para qu´ e madrugar´ a este jajaja nekomu je sarkazm detektor crknu nekomu je sarkazem detektor crknil Performance Per Corpus 0 10,000 20,000 30,000 40,000 50,000 Train size (words) 0.2 0.3 0.4 0.5 0.6 0.7 ERR GhentNorm TweetNorm LexNorm1.2 LexNorm2015 Janes-Norm ReLDI-hr ReLDI-sr Is normalization for Dutch more difficult? MoNoise Rob van der Goot and Gertjan van Noord. MoNoise: Modeling Noise Using a Modular Normalization System. In CLIN Journal 2017 Tokenizer Generation Orig. Word scoren #fail scoren #fail LookupList alst hahahaha als het haha word2vec mss dinnetje mssn dinnie missch vriendinnetje misschien dinnetjes Aspell grapjee felicteren grapje feliciteren grapjes flecteren greepje fluctueren Split kheb datis k heb dat is word.* ech waarsch echt waarschijnlijk echo waarschuw echeltje waarschuwt Ranking Features: isOrig N-grams Wiki Word2vec dist. N-grams Twitter Aspell dist. dict isSplit length word.* containsAlpha origFeats Random Forest Classifier New Dataset Annotate capitalization consistently Annotate tokenization in a separate layer Do not include phrasal abbreviations (‘lol’‘laughing out loud’) Make publicly available No Flemish Dutch Annotate POS dev/test data Annotate categories? a Annotate Universal Dependencies? a Rob Van der Goot, Rik Van Noord, and Gertjan Van Noord. A taxonomy for in-depth evaluation of normalization for user generated content. In Proceedings of LREC 2018 Beneficial for Parsing? new pix comming tomorroe 0 1 2 3 4 new (1.0) pix (0.6) pics (0.3) pictures (0.1) comming (0.3) coming (0.6) common (0.1) tomoroe (0.3) tomorrow(0.5) more (0.2) S VB NP NN tomorrow VBG coming NP NNS pictures JJ new new pix comming tomoroe root amod nsubj nmod Rob Van der Goot and Gertjan Van Noord. Parser Adaptation for Social Media by Integrating Normalization. In Proceedings of ACL 2017 Rob Van der Goot and Gertjan Van Noord. Modeling Input Uncertainty in A Neural Network Dependency Parser. In Proceedings of EMNLP 2018 Brussels Try it! www.let.rug.nl/rob/monoise

Upload: others

Post on 10-Jul-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lexical Normalization for Dutch Social Media Texts · Lexical Normalization for Dutch Social Media Texts Rob van der Goot & Gertjan van Noord Lexical Normalization nee ! :-D kzal

Lexical Normalization for Dutch Social Media TextsRob van der Goot & Gertjan van Noord

Lexical Normalizationnee ! :-D kzal nog es vriendenlijk doen lolnee ! :-D ik zal nog eens vriendelijk doen lol

tgaat goed , vdg rustig aaan .Het gaat goed , vandaag rustig aan .

social ppl r anoyingsocial people are annoying

aaah buenoo esqe digo pa qe madrugara este jajajaah bueno es que digo para que madrugara este jajaja

nekomu je sarkazm detektor crknunekomu je sarkazem detektor crknil

Performance Per Corpus

0 10,000 20,000 30,000 40,000 50,000Train size (words)

0.2

0.3

0.4

0.5

0.6

0.7

ER

R

GhentNorm

TweetNorm

LexNorm1.2

LexNorm2015

Janes-Norm

ReLDI-hr

ReLDI-sr

Is normalization for Dutch more difficult?

MoNoise

Rob van der Goot and Gertjan van Noord.MoNoise: Modeling Noise Using a ModularNormalization System. In CLIN Journal 2017

Tokenizer

Generation

Orig. Wordscoren #failscoren #fail

LookupListalst hahahahaals het haha

word2vecmss dinnetjemssn dinniemissch vriendinnetjemisschien dinnetjes

Aspellgrapjee felicterengrapje feliciterengrapjes flecterengreepje fluctueren

Splitkheb datisk heb dat is

word.*ech waarschecht waarschijnlijkecho waarschuwecheltje waarschuwt

Ranking

Features:isOrig N-grams WikiWord2vec dist. N-grams TwitterAspell dist. dictisSplit lengthword.* containsAlpha

origFeats

Random ForestClassifier

New Dataset• Annotate capitalization consistently• Annotate tokenization in a separate layer• Do not include phrasal abbreviations

(‘lol’ 7→‘laughing out loud’)• Make publicly available• No Flemish 7→ Dutch• Annotate POS dev/test data• Annotate categories?a

• Annotate Universal Dependencies?

aRob Van der Goot, Rik Van Noord, and Gertjan Van Noord. A taxonomy for in-depthevaluation of normalization for user generated content. In Proceedings of LREC 2018

Beneficial for Parsing?

new pix comming tomorroe

0 1 2 3 4new (1.0)

pix (0.6)

pics (0.3)

pictures (0.1)

comming (0.3)

coming (0.6)

common (0.1)

tomoroe (0.3)

tomorrow(0.5)

more (0.2)

S

VB

NP

NN

tomorrow

VBG

coming

NP

NNS

pictures

JJ

new

new pix comming tomoroe

rootamod nsubj nmod

Rob Van der Goot andGertjan Van Noord.Parser Adaptationfor Social Media byIntegrating Normalization.In Proceedings of ACL2017

Rob Van der Gootand Gertjan VanNoord. Modeling InputUncertainty in A NeuralNetwork DependencyParser. In Proceedingsof EMNLP 2018Brussels

Try it!www.let.rug.nl/rob/monoise