lectura 3.5 word normalizationintwitter finitestate_transducers

Word normalization in Twitter using finite-state transducers

Jordi Porta and Jose Luis SanchoCentro de Estudios de la Real Academia Espanola

c/ Serrano 197-198, Madrid 28002{porta,sancho}@rae.es

Resumen:

Palabras clave:

Abstract: This paper presents a linguistic approach based on weighted-finite statetransducers for the lexical normalisation of Spanish Twitter messages. The sys-tem developed consists of transducers that are applied to out-of-vocabulary tokens.Transducers implement linguistic models of variation that generate sets of candi-dates according to a lexicon. A statistical language model is used to obtain themost probable sequence of words. The article includes a description of the compo-nents and an evaluation of the system and some of its parameters.Keywords: Tweet Messages. Lexical Normalisation. Finite-state Transducers.Statistical Language Models.

1 Introduction

Text messaging (or texting) exhibits a con-siderable degree of departure from the writ-ing norm, including spelling. There are manyreasons for this deviation: the informalityof the communication style, the character-istics of the input devices, etc. Althoughmany people consider that these communi-cation channels are “deteriorating” or even“destroying” languages, many scholars claimthat even in this kind of channels communi-cation obeys maxims and that spelling is alsoprincipled. Even more, it seems that, in gen-eral, the processes underlying variation arenot new to languages. It is under these con-siderations that the modelling of the spellingvariation, and also its normalisation, can beaddressed. Normalisation of text messagingis seen as a necessary preprocessing task be-fore applying other natural language process-ing tools designed for standard language va-rieties.

Few works dealing with Spanish text mes-saging can be found in the literature. Tothe best of our knowledge, the most rele-vant and recent published works are Mos-quera and Moreda (2012), Pinto et al.(2012), Gomez Hidalgo, Caurcel Dıaz, andIniguez del Rio (2013) and Oliva et al.(2013).

2 Architecture and components of

the system

The system has three main components thatare applied sequentially: An analyser per-forming tokenisation and lexical analysis onstandard word forms and on other expres-sions like numbers, dates, etc.; a compo-nent generating word candidates for out-of-vocabulary (OOV) tokens; a statisticallanguage model used to obtain the mostlikely sequence of words; and finally, a true-caser giving proper capitalisation to commonwords assigned to OOV tokens.

Freeling (Atserias et al., 2006) with a spe-cial configuration designed for this task isused to tokenise the message and identify,among other tokens, standard words forms.The generation of candidates, i.e., the con-fusion set of an OOV token, is performedby components inspired in other modulesused to analyse words found in historicaltexts, where other kind of spelling variationcan be found (Porta, Sancho, and Gomez,2013). The approach to historical variationwas based on weighted finite-state transduc-ers over the tropical semiring implementinglinguistically motivated models. Some ex-periments were conducted in order to assessthe task of assigning to old word forms theircorresponding modern lemmas. For each oldword, lemmas were assigned via the possiblemodern forms predicted by the model. Re-

sults were comparable to the results obtainedwith the Levenshtein distance (Levenshtein,1966) in terms of recall, but were better interms of accuracy, precision and F . As forold words, the confusion set of a OOV tokenis generated by applying the shortest-pathsalgorithm to the following expression:

W ◦ E ◦ L

where W is the automata representing theOOV token, E is an edit transducer gener-ating possible variations on tokens, and L isthe set of target words. The composition ofthese three modules is performed using anon-line implementation of the efficient three-way composition algorithm of Allauzen andMohri (2008).

3 Resources employed

In this section, the resources employed by thecomponents of the system are described: theedit transducers, the lexical resources and thelanguage model.

3.1 Edit transducers

We follow the classification of Crystal (2008)for texting features present also in Twittermessages. In order to deal with these featuresseveral transducers were developed. Trans-ducers are expressed as regular expressionsand context-dependent rewrite rules of theform α → β / γ δ (Chomsky and Halle,1968) that are compiled into weighted finite-state transducers using the OpenGrm Thraxtools (Tai, Skut, and Sproat, 2011).

3.1.1 Logograms and Pictograms

Some letters are found used as logograms,with a phonetic value. They are dealt withby optional rewrites altering the orthographicform of tokens:

ReplaceLogograms = (x (→) por) ◦(2 (→) dos) ◦ (@ (→) a|o) ◦ . . .

Also laughs, which are very frequent, areconsidered logograms, since they representsounds associated with actions. The multi-ple ways they are realised, including plurals,are easily described with regular expressions.

Pictograms like emoticons entered bymeans of ready-to-use icons in input devicesare not treated by our system since they arenot textual representations. However textualrepresentations of emoticons like :DDD or

xDDDDDD are recognised by regular expres-sions and mapped to their canonical form bymeans of simple transducers.

3.1.2 Initialisms, shortenings, and

letter omissions

The string operations for initialisms (oracronymisation) and shortenings are difficultto model without incurring in an overgenera-tion of candidates. For this reason, only com-mon initialisms, e.g., sq (es que), tk (te quie-ro) or sa (se ha), and common shortenings,e.g., exam (examen) or nas (buenas), are lis-ted.

For the omission of letters several trans-ducers are implemented. The simplest andmore conservative one is a transducer intro-ducing just one letter in any position of thetoken string. Consonantal writing is a spe-cial case of letter omission. This kind of wri-ting relies on the assumption that consonantscarry much more information than vowels do,which in fact is the norm in same languageslike Semitic languages. Some rewrite rules areapplied to OOV tokens in order to restore vo-wels:

InsertVowels = invert(RemoveVowels)RemoveVowels = Vowels (→) ǫ

3.1.3 Standard non-standard

spellings

We consider non-standard spellings standardwhen they are widely used. These includespellings for representing regional or informalspeech, or choices sometimes conditioned byinput devices, as non-accented writing. Forthe case of accents and tildes, they are resto-red using a cascade of optional rewrite ruleslike the following:

RestoreAccents = (n|ni|ny|nh (→) n) ◦(a (→) a) ◦ (e (→) e) ◦ . . .

Also words containing k instead of c or qu,which appears frequently in protest writings,are standardised with simple transducers. So-me other changes are done to some endings torecover the standard ending. There are com-plete paradigms like the following, which re-lates non-standard to standard endings:

-a -ada-as -adas-ao -ado-aos -ados

We also consider phonetic writing as akind of non-standard writing in which aphonetic form of a word is alphabeticallyand syllabically approximated. The transdu-cers used for generating standard words fromtheir phonetic and graphical variants are:

DephonetiseWriting =invert(PhonographemicVariation)

PhonographemicVariation =GraphemeToPhoneme ◦PhoneConflation ◦PhonemeToGrapheme ◦GraphemeVariation

In the previous definitions, the PhoneCon-flation makes phonemes equivalent, as forexample the IPA phonemes /L/ and /J/. Lin-guistic phenomena as seseo and ceceo, inwhich several phonemes were conflated by16th century, still remain in spoken variantsand are also reflected in texting. The Grap-hemeVariation transducer models, among ot-hers, the writing of ch as x, which could bedue to the influence of other languages.

3.1.4 Juxtapositions

Spacing in texting is also non-standard. Inthe normalisation task, some OOV tokensare in fact juxtaposed words. The possi-ble decompositions of a word into a sequen-ce of possible words is: shortest-paths(W ◦SplitConjoinedWords ◦ L( L)+), where W isthe word to be analysed, L( L)+ representsthe valid sequences of words and SplitConjoi-nedWords is a transducer introducing blanks( ) between letters and undoing optionallypossible fused vowels:

SplitConjoinedWords = invert(JoinWords)

JoinWords =(a a (→) a<1>) ◦ . . . ◦ (u u (→) u<1>) ◦( (→) ǫ)

Note that in the previous definition, some ru-les are weighted with a unit cost <1>. The-se costs are used by the shortest-paths al-gorithm as a preference mechanism to selectnon-fused over fused sequences when both ca-ses are possible.

3.1.5 Other transducers

Expressive lengthening, which consist in re-peating a letter in order to convey emphasis,are dealt with by means of rules removing a

varying number of consecutive occurrences ofthe same letter. An example of a rule dealingwith letter a repetitions is a (→) ǫ / a .A transducer is generated for the alphabet.

Because messages are keyboarded, someerrors found in words are due to letter trans-positions and confusions between adjacentletters in the same row of the keyboard. The-se changes are also implemented with a trans-ducer.

Finally, a Levenshtein transducer with amaximum distance of one has been also im-plemented.

3.2 The lexicon

The lexicon for OOV token normalisationcontains mainly Spanish standard words,proper names and some frequent Englishwords. These constitute the set of targetwords. We used the DRAE (RAE, 2001) asthe source for Spanish standard words in thelexicon. Besides inflected forms, we have ad-ded verbal forms with clitics attached andderivative forms not found as entries in theDRAE: -mente adverbs, appreciatives, etc.The list of proper names was compiled frommany sources and contains first names, sur-names, aliases, cities, country names, brands,organisations, etc. Special attention was pa-yed to hypocorisms, i.e., shorter or diminuti-ve forms of a given name, as well as nickna-mes or calling names, since communication inchannels as Twitter tends to be interpersonal(or between members of a group) and affecti-ve. A list of common hypocorisms is providedto the system. For English words, we have se-lected the 100,000 more frequent words of theBNC (BNC, 2001).

3.3 Language model

We use a language model to decode the wordgraph and thus obtain the most probableword sequence. The model is estimated froma corpus of webpages compiled with Wacky(Baroni et al., 2009). The corpus containsabout 11,200,000 tokens coming from about21,000 URLs. We used as seeds the typesfound in the development set (about 2,500).Backed-off n-gram models, used as languagemodels, are implemented with the OpenGrmNGram toolkit (Roark et al., 2012).

3.4 Truecasing

The restoring of case information in badly-cased text has been addressed in (Lita etal., 2003) and has been included as part of

the normalisation task. Part of this process,for proper names, is performed by the ap-plication of the language model to the wordgraph. Words at message initial position arenot always uppercased, since doing so yiel-ded contradictory results after some experi-mentation. A simple heuristic is implementedto uppercase a normalisation candidate whenthe OOV token is also uppercased.

4 Settings and evaluation

In order to generate the confusion sets weused two edit transducers applied in a casca-de. If neither of the two is able to relate atoken with a word, the token is assigned toitself.

The first transducer generates candida-tes according to the expansion of abbrevia-tions, the identification of acronyms, picto-grams and words which result from the follo-wing composition of edit transducers combi-ning some of the features of texting:

RemoveSymbols ◦LowerCase ◦Deaccent ◦RemoveReduplicates ◦ReplaceLogograms ◦StandardiseEndings ◦DephonetiseWriting ◦Reaccent ◦MixCase

The second edit transducer analyses to-kens that did not receive analyses with thefirst editor. This second editor implementsconsonantal writing, typing error recovery, anapproximate matching using a Levenshteindistance of one and the splitting of juxtapo-sed words. In all cases, case, accents and re-duplications are also considered. This secondtransducer makes use of an extended lexiconcontaining sequences of simple words.

Several experiments were conducted in or-der to evaluate some parameters of the sys-tem. In particular, the effect of the order ofthe n-grams in the language model and theeffect of generating confusion sets for OOVtokens only versus the generation of confu-sion sets for all tokens. For all the experi-ments we used the test set provided with thetokenization delivered by Freeling.

For the first series of experiments, tokensidentified as standard words by Freeling re-ceive the same token as analysis and OOV

tokens are analysed with the system. Recallon OOV tokens is of 89.40%. Confusion setssize follows a power-law distribution with anaverage size of 5.48 for OOV tokens that goesdown to 1.38 if we average over the rest ofthe tokens. Precision for 2- and 4-gram lan-guage models is 78.10%, but the best resultis obtained with 3-grams, with an precisionof 78.25%.

There is a number of non-standardforms that were wrongly recognised as in-vocabulary words because they clash withother standard words. In the second seriesof experiments a confusion set is generatedfor each word in order to correct potentiallywrong assignments. Average size of confu-sion sets increases to 5.56.1 Precision resultsfor the 2-gram language model is of 78.25%but 3- and 4-gram reach both an precision of78.55%.

From a quantitative point of view, it seemsthat slighty better results are obtained usinga 3-gram language model and generating con-fusion sets not only for OOV tokens but forall the tokens in the message. In a qualitati-ve evaluation of errors several categories showup. The most populated categories are thosehaving to do with case restoration and wrongdecoding by the language model. Some errorsare related to particularities of DRAE, fromwhich the lexicon was derived (dispertar ormalaleche). Non standard morphology is ob-served in tweets, as in derivatives (tranquileoor loquendera). Lack of abbreviation expan-sion is also observed (Hum). Faulty applica-tion of segmentation accounts for a few errors(mencantaba). Finally, some errors are not onour output but on the reference (Hojo).

5 Conclusions and future work

No attention has been payed to multilingua-lism since the task explicitly excluded tweetsfrom bilingual areas of Spain. However, gi-ven that not few Spanish speakers (both inEurope and America) are bilingual or live inbilingual areas, mechanisms should be provi-ded to deal with other languages than Englishto make the system more robust.

We plan to build a corpus of lexically stan-dard tweets via the Twitter streaming APIto determine whether n-grams observed in a

1We removed from the calculation the token

mes de abril, which receives 308,017 different analy-

ses due to the combination of multiple editions and

segmentations.

Twitter-only corpus improve decoding or notas a side effect of syntax being also non stan-dard.

Qualitative analysis of results showed thatthere is room for improvement experimentingwith selective deactivation of items in the le-xicon and further development of the segmen-ting module.

However, initialisms and shortenings arefeatures of texting difficult to model withoutcausing overgeneration. Acronyms like FYQ,which correspond to the school subject ofFısica y Quımica, are domain specific and dif-ficult to foresee and therefore to have themlisted in the resources.

References

Allauzen, Cyril and Mehryar Mohri. 2008. 3-way composition of weighted finite-statetransducers. In Proc. of the 13th Int.Conf. on Implementation and Applicationof Automata (CIAA–2008), pages 262–273, San Francisco, California, USA.

Atserias, Jordi, Bernardino Casas, Elisa-bet Comelles, Meritxell Gonzalez, LluısPadro, and Muntsa Padro. 2006. Free-Ling 1.3: Syntactic and semantic servicesin an open-source NLP library. In Proc. ofthe 5th Int. Conf. on Language Resourcesand Evaluation (LREC-2006), pages 48–55, Genoa, Italy, May.

Baroni, Marco, Silvia Bernardini, AdrianoFerraresi, and Eros Zanchetta. 2009. Thewacky wide web: A collection of very largelinguistically processed web-crawled cor-pora. Language Resources and Evalua-tion, 43(3):209–226.

BNC. 2001. The British National Cor-pus, version 2 (BNC World). Distribu-ted by Oxford University Computing Ser-vices on behalf of the BNC Consortium.http://www.natcorp.ox.ac.uk.

Chomsky, Noam and Morris Halle. 1968.The sound pattern of English. Harper &Row, New York.

Crystal, David. 2008. Txtng: The Gr8 Db8.Oxford University Press.

Gomez Hidalgo, Jose Marıa, Andres AlfonsoCaurcel Dıaz, and Yovan Iniguez del Rio.2013. Un metodo de analisis de lenguajetipo SMS para el castellano. Linguamati-ca, 5(1):31—39, July.

Levenshtein, Vladimir I. 1966. Binary co-des capable of correcting deletions, inser-tions, and reversals. Soviet Physics Do-klady, 10(8):707–710.

Lita, Lucian Vlad, Abe Ittycheriah, SalimRoukos, and Nanda Kambhatla. 2003.tRuEcasIng. In Proc. of the 41st AnnualMeeting on ACL - Volume 1, ACL ’03, pa-ges 152–159, Stroudsburg, PA, USA.

Mosquera, Alejandro and Paloma Moreda.2012. TENOR: A lexical normalisationtool for Spanish Web 2.0 texts. In PetrSojka, Ales Horak, Ivan Kopecek, and Ka-rel Pala, editors, Text, Speech and Dialo-gue, volume 7499 of LNCS. Springer, pa-ges 535–542.

Oliva, J., J. I. Serrano, M. D. Del Castillo,and A. Iglesias. 2013. A SMS normali-zation system integrating multiple gram-matical resources. Natural Language En-gineering, 19:121–141, 1.

Pinto, David, Darnes Vilarino Ayala, Yuri-diana Aleman, Helena Gomez, Nahun Lo-ya, and Hector Jimenez-Salazar. 2012.The Soundex phonetic algorithm revisitedfor SMS text representation. In Petr Soj-ka, Ales Horak, Ivan Kopecek, and KarelPala, editors, Text, Speech and Dialogue,volume 7499 of LNCS, pages 47–55. Sprin-ger.

Porta, Jordi, Jose-Luis Sancho, and JavierGomez. 2013. Edit transducers for spe-lling variation in Old Spanish. In Proc. ofthe workshop on computational historicallinguistics at NODALIDA 2013. NEALTProc. Series 18, pages 70–79, Oslo, Nor-way, May 22-24.

RAE. 2001. Diccionario de la lengua es-panola. Espasa, Madrid, 22th edition.

Roark, Brian, Richard Sproat, Cyril Allau-zen, Michael Riley, Jeffrey Sorensen, andTerry Tai. 2012. The OpenGrm open-source finite-state grammar software libra-ries. In Proc. of the ACL 2012 SystemDemonstrations, pages 61–66, Jeju Island,Korea, July.

Tai, Terry, Wojciech Skut, and RichardSproat. 2011. Thrax: An Open SourceGrammar Compiler Built on OpenFst. InASRU 2011, Waikoloa Resort, Hawaii, De-cember.

lectura 3.5 word normalizationintwitter finitestate_transducers

Technology