Transcript
Page 1: Lexical Normalization for Dutch Social Media Texts · Lexical Normalization for Dutch Social Media Texts Rob van der Goot & Gertjan van Noord Lexical Normalization nee ! :-D kzal

Lexical Normalization for Dutch Social Media TextsRob van der Goot & Gertjan van Noord

Lexical Normalizationnee ! :-D kzal nog es vriendenlijk doen lolnee ! :-D ik zal nog eens vriendelijk doen lol

tgaat goed , vdg rustig aaan .Het gaat goed , vandaag rustig aan .

social ppl r anoyingsocial people are annoying

aaah buenoo esqe digo pa qe madrugara este jajajaah bueno es que digo para que madrugara este jajaja

nekomu je sarkazm detektor crknunekomu je sarkazem detektor crknil

Performance Per Corpus

0 10,000 20,000 30,000 40,000 50,000Train size (words)

0.2

0.3

0.4

0.5

0.6

0.7

ER

R

GhentNorm

TweetNorm

LexNorm1.2

LexNorm2015

Janes-Norm

ReLDI-hr

ReLDI-sr

Is normalization for Dutch more difficult?

MoNoise

Rob van der Goot and Gertjan van Noord.MoNoise: Modeling Noise Using a ModularNormalization System. In CLIN Journal 2017

Tokenizer

Generation

Orig. Wordscoren #failscoren #fail

LookupListalst hahahahaals het haha

word2vecmss dinnetjemssn dinniemissch vriendinnetjemisschien dinnetjes

Aspellgrapjee felicterengrapje feliciterengrapjes flecterengreepje fluctueren

Splitkheb datisk heb dat is

word.*ech waarschecht waarschijnlijkecho waarschuwecheltje waarschuwt

Ranking

Features:isOrig N-grams WikiWord2vec dist. N-grams TwitterAspell dist. dictisSplit lengthword.* containsAlpha

origFeats

Random ForestClassifier

New Dataset• Annotate capitalization consistently• Annotate tokenization in a separate layer• Do not include phrasal abbreviations

(‘lol’ 7→‘laughing out loud’)• Make publicly available• No Flemish 7→ Dutch• Annotate POS dev/test data• Annotate categories?a

• Annotate Universal Dependencies?

aRob Van der Goot, Rik Van Noord, and Gertjan Van Noord. A taxonomy for in-depthevaluation of normalization for user generated content. In Proceedings of LREC 2018

Beneficial for Parsing?

new pix comming tomorroe

0 1 2 3 4new (1.0)

pix (0.6)

pics (0.3)

pictures (0.1)

comming (0.3)

coming (0.6)

common (0.1)

tomoroe (0.3)

tomorrow(0.5)

more (0.2)

S

VB

NP

NN

tomorrow

VBG

coming

NP

NNS

pictures

JJ

new

new pix comming tomoroe

rootamod nsubj nmod

Rob Van der Goot andGertjan Van Noord.Parser Adaptationfor Social Media byIntegrating Normalization.In Proceedings of ACL2017

Rob Van der Gootand Gertjan VanNoord. Modeling InputUncertainty in A NeuralNetwork DependencyParser. In Proceedingsof EMNLP 2018Brussels

Try it!www.let.rug.nl/rob/monoise

Top Related