resources for multilingual processing georgiana puşcaşu university of wolverhampton, uk

Resources for multilingual processing

Georgiana PuşcaşuUniversity of Wolverhampton, UK

2

Outline Motivation and goals NLP Methods, Resources and Applications

Text Segmentation Part of Speech Tagging Stemming Lemmatization Syntactic Parsing Named Entity Recognition Term Extraction and Terminology Data Management Tools Text Summarization Language Identification Statistical Language Modeling Toolkits Corpora

Conclusions

3

Motivation and goals

Motivation Most NLP research and resources deal with English The Web is multilingual and ideally for all languages the

current NLP state-of-the-art should be attained

Goals To present already available textual methods that can

support multilingual NLP To offer an inventory of existent tools and resources that

can be exploited in order to avoid reinventing the wheel

4

Text Segmentation

Electronic text is essentially just a sequence of characters

Before any real processing, text needs to be segmented

Text segmentation involves

Low-level text segmentation (performed at the initial stages of text processing)

Tokenization

Sentence splitting

High-level text segmentation

Intra-sentential: segmentation of linguistic groups such as Named Entities, Noun Phrases, splitting sentences into clauses

Inter-Sentential: grouping sentences and paragraphs into discourse topics

5

Tokenization Tokenization is the process of segmenting text into

linguistic units such as words, punctuation, numbers, alphanumerics, etc.

It is normally the first step in the majority of text processing applications

Tokenization in languages that are: segmented: is considered a relatively easy and

uninteresting part of text processing (words delimited by blank spaces and punctuation)

non-segmented: is more challenging (no explicit boundaries between words)

6

Tokenization in segmented languages Segmented languages: all modern languages that use a

Latin-, Cyrillic- or Greek-based writing system

Traditionally, tokenization rules are written using regular expressions

Problems:

Abbreviations: solved by lists of abbreviations (pre-compiled or automatically extracted from a corpus), guessing rules

Hyphenated words: “One word or two?”

Numerical and special expressions (Email addresses, URLs, telephone numbers, etc.) are handled by specialized tokenizers (preprocessors)

Apostrophe: (they’re => they + ‘re; don’t => do + n’t) solved by language-specific rules

7

Tokenization innon-segmented languages Non-segmented languages: Oriental languages Problems:

tokens are written directly adjacent to each other almost all characters can be one-character word by

themselves but can also form multi-character words

Solutions: Pre-existing lexico-grammatical knowledge Machine learning employed to extract segmentation

regularities from pre-segmented data Statistical methods: character n-grams

8

Tokenizers (1)ALEMBICAuthor(s): M. Vilain, J. Aberdeen, D. Day, J. Burger, The MITRE CorporationPurpose: Alembic is a multi-lingual text processing system. Among other tools, it incorporates

tokenizers for: English, Spanish, Japanese, Chinese, French, Thai.Access: Free by contacting [email protected]

ELLOGONAuthor(s): George Petasis, Vangelis Karkaletsis, National Center for Scientific Research, GreecePurpose: Ellogon is a multi-lingual, cross-platform, general-purpose language engineering

environment. One of the provided components that can be adapted to various languages can perform tokenization. Supported languages: Unicode.

Access: Free at http://www.ellogon.org/

GATE (General Architecture for Text Engineering)Author(s): NLP Group, University of Sheffield, UKAccess: Free but requires registration at http://gate.ac.uk/

HEART Of GOLDAuthor(s): Ulrich Schäfer, DFKI Language Technology Lab, GermanyPurpose: Supported language: Unicode, Spanish, Polish, Norwegian, Japanese, Italian, Greek,

German, French, English, Chinese.Access: Free at http://heartofgold.dfki.de/

9

LT TTTAuthor(s): Language Technology Group, University of Edinburgh, UKPurpose: LT TTT is a text tokenization system and toolset which enables users to produce a swift

and individually-tailored tokenisation of text.Access: Free at http://www.ltg.ed.ac.uk/software/ttt/

MXTERMINATOR Author(s): Adwait RatnaparkhiPlatforms: Platform independentAccess: Free at http://www.cis.upenn.edu/~adwait/statnlp.html

QTOKENAuthor(s): Oliver Mason, Birmingham University, UKPlatforms: Platform independentAccess: Free at http://www.english.bham.ac.uk/staff/omason/software/qtoken.html

SProUTAuthor(s): Feiyu Xu, Tim vor der Brück, LT-Lab, DFKI GmbH, GermanyPurpose: SProUT provides tokenization for Unicode, Spanish, Japanese, German, French, English,

Chinese.Access: Not free. More information at http://sprout.dfki.de/

Tokenizers (2)

10

THE QUIPU GROK LIBRARYAuthor(s): Gann Bierner and Jason Baldridge, University of Edinburgh, UKAccess: Free at https://sourceforge.net/project/showfiles.php?group_id=4083

TWOLAuthor(s): Lingsoft Purpose: Supported languages: Swedish, Norwegian, German, Finnish, English, Danish.Access: Not free. More information at http://www.lingsoft.fi/

Tokenizers (3)

11

Sentence splitting Sentence splitting is the task of segmenting text into

sentences In the majority of cases it is a simple task:

. ? ! usually signal a sentence boundary However, in cases when a period denotes a decimal

point or is a part of an abbreviation, it does not always signal a sentence break.

The simplest algorithm is known as ‘period-space-capital letter’ (not very good performance). Can be improved with lists of abbreviations, a lexicon of frequent sentence initial words and/or machine learning techniques

12

Part of Speech (POS) Tagging

POS Tagging is the process of assigning a part-of-speech or lexical class marker to each word in a corpus (Jurafsky and Martin)

Thecouplespentthehoneymoononayacht

WORDSTAGS

NVPDET

13

POS Tagger Prerequisites

Lexicon of words

For each word in the lexicon information about all its possible tags according to a chosen tagset

Different methods for choosing the correct tag for a word:

Rule-based methods

Statistical methods

Transformation Based Learning (TBL) methods

14

POS Tagger Prerequisites:Lexicon of words

Classes of words Closed classes: a fixed set

Prepositions: in, by, at, of, … Pronouns: I, you, he, her, them, … Particles: on, off, … Determiners: the, a, an, … Conjunctions: or, and, but, … Auxiliary verbs: can, may, should, … Numerals: one, two, three, …

Open classes: new ones can be created all the time, therefore it is not possible that all words from these classes appear in the lexicon Nouns Verbs Adjectives Adverbs

15

POS Tagger PrerequisitesTagsets

To do POS tagging, need to choose a standard set of tags to work with

A tagset is normally sophisticated and linguistically well grounded

Could pick very coarse tagets N, V, Adj, Adv.

More commonly used set is finer grained, the “UPenn TreeBank tagset”, 48 tags

Even more fine-grained tagsets exist

16

POS Tagger PrerequisitesTagset example – UPenn tagset

1 CC Coordinating conjunction2 CD Cardinal number3 DT Determiner4 EX Existential there5 FW Foreign word6 IN Preposition/subord. conjunction7 J J Adjective8 J JR Adjective, comparative9 J JS Adjective, superlative

10 LS List item marker11 MD Modal12 NN Noun, singular or mass13 NNS Noun, plural14 NNP Proper noun, singular15 NNPS Proper noun, plural16 PDT Predeterminer17 POS Possessive ending18 PRP Personal pronoun19 PP Possessive pronoun20 RB Adverb21 RBR Adverb, comparative22 RBS Adverb, superlative23 RP Particle24 SYM Symbol (mathematical or scientific)

25 TO to26 UH Interjection27 VB Verb, base form28 VBD Verb, past tense29 VBG Verb, gerund/present participle30 VBN Verb, past participle31 VBP Verb, non-3rd ps. sing. present32 VBZ Verb, 3rd ps. sing. present33 WDT wh-determiner34 WP wh-pronoun35 WP Possessive wh-pronoun36 WRB wh-adverb37 # Pound sign38 $ Dollar sign39 . Sentence-final punctuation40 , Comma41 : Colon, semi-colon42 ( Left bracket character43 ) Right bracket character44 " Straight double quote45 ` Left open single quote46 " Left open double quote47 ' Right close single quote48 " Right close double quote

17

POS TaggingRule based methods

Start with a dictionary

Assign all possible tags to words from the dictionary

Write rules by hand to selectively remove tags

Leaving the correct tag for each word

18

POS TaggingStatistical methods (1)

The Most Frequent Tag Algorithm Training

Take a tagged corpus Create a dictionary containing every word in the corpus

together with all its possible tags Count the number of times each tag occurs for a word and

compute the probability P(tag|word); then save all probabilities

Tagging Given a new sentence, for each word, pick the most

frequent tag for that word from the corpus

19

POS TaggingStatistical methods (2)

Bigram HMM Tagger Training

Create a dictionary containing every word in the corpus together with all its possible tags

Compute the probability of each tag generating a certain word, compute the probability each tag is preceded by a specific tag (Bigram HMM Tagger => probability is dependent only on the previous tag)

Tagging Given a new sentence, for each word, pick the most likely tag for

that word using the parameters obtained after training HMM Taggers choose the tag sequence that maximizes this

formula: P(word|tag) * P(tag|previous tag)

20

Bigram HMM Tagging: ExamplePeople/NNS are/VBZ expected/VBN to/TO queue/VB at/IN the/DT

registry/NNS

The/DT police/NN is/VBZ to/TO blame/VB for/IN the/DT queue/NN

to/TO queue/???the/DT queue/???

tk = argmaxk P(tk|tk-1)*P(wi|tk) i = number of word in sequence, k = number among possible tags for the word “queue”

How do we compute P(tk|tk-1)? count(tk-1tk)/count(tk-1)

How do we compute P(wi|tk)? count(wi tk)/count(tk)

max[P(VB|TO)*P(queue|VB) , P(NN|TO)*P(queue|NN)]

Corpus: P(NN|TO) = 0.021 * P(queue|NN) = 0.00041 => 0.000007 P(VB|TO) = 0.34 * P(queue|VB) = 0.00003 => 0.00001

21

POS TaggingTransformation Based Tagging (1)

Combination of rule-based and stochastic tagging methodologies

Like rule-based because rule templates are used to learn transformations

Like stochastic approach because machine learning is used — with tagged corpus as input

Input:

tagged corpus

lexicon (with all possible tags for each word)

22

POS TaggingTransformation Based Tagging (2)

Basic Idea: Set the most probable tag for each word as a start value Change tags according to rules of type “if word-1 is a determiner

and word is a verb then change the tag to noun” in a specific order

Training is done on tagged corpus:1. Write a set of rule templates2. Among the set of rules, find one with highest score3. Continue from 2 until lowest score threshold is passed4. Keep the ordered set of rules

Rules make errors that are corrected by later rules

23

Transformation Based TaggingExample

Tagger labels every word with its most-likely tag For example: race has the following probabilities in the

Brown corpus: P(NN|race) = 0.98 P(VB|race)= 0.02

Transformation rules make changes to tags “Change NN to VB when previous tag is TO”

… is/VBZ expected/VBN to/TO race/NN tomorrow/NNbecomes… is/VBZ expected/VBN to/TO race/VB tomorrow/NN

24

ACOPOSTAuthor(s): Jochen Hagenstroem, Kilian Foth, Ingo Schröder, Parantu ShahPurpose: ACOPOST is a collection of POS taggers. It implements and extends well-

known machine learning techniques and provides a uniform environment for testing.Platforms: All POSIX (Linux/BSD/UNIX-like OSes)Access: Free at http://sourceforge.net/projects/acopost/

BRILL’S TAGGER Author(s): Eric BrillPurpose: Transformation Based Learning POS TaggerAccess: Free at http://www.cs.jhu.edu/~brill

fnTBLAuthor(s): Radu Florian and Grace Ngai, John Hopkins University, USAPurpose: fnTBL is a customizable, portable and free source machine-learning toolkit

primarily oriented towards Natural Language-related tasks (POS tagging, base NP chunking, text chunking, end-of-sentence detection). It is currently trained for English and Swedish.

Platforms: Linux, Solaris, WindowsAccess: Free at http://nlp.cs.jhu.edu/~rflorian/fntbl/

POS Taggers (1)

25

LINGSOFTAuthor(s): LINGSOFT, FinlandPurpose: Among the services offered by Lingsoft one can find POS taggers for Danish,

English, German, Norwegian, Swedish.Access: Not free. Demos at http://www.lingsoft.fi/demos.html

LT POS (LT TTT)Author(s): Language Technology Group, University of Edinburgh, UKPurpose: The LT POS part of speech tagger uses a Hidden Markov Model

disambiguation strategy. It is currently trained only for English.Access: Free but requires registration at http://www.ltg.ed.ac.uk/software/pos/index.html

MACHINESE PHRASE TAGGER Author(s): Connexor Purpose: Machinese Phrase Tagger is a set of program components that perform basic

linguistic analysis tasks at very high speed and provide relevant information about words and concepts to volume-intensive applications. Available for: English, French, Spanish, German, Dutch, Italian, Finnish.

Access: Not free. Free access to online demo at http://www.connexor.com/demo/tagger/

POS Taggers (2)

26

MXPOSTAuthor(s): Adwait RatnaparkhiPurpose: MXPOST is a maximum entropy POS tagger. The downloadable version

includes a Wall St. Journal tagging model for English, but can also be trained for different languages.

Platforms: Platform independentAccess: Free at http://www.cis.upenn.edu/~adwait/statnlp.html

MEMORY BASED TAGGERAuthor(s): ILK - Tilburg University, CNTS - University of AntwerpPurpose: Memory-based tagging is based on the idea that words occurring in similar

contexts will have the same POS tag. The idea is implemented using the memory-based learning software package TiMBL.

Access: Usable by email or on the Web at http://ilk.uvt.nl/software.html#mbt

µ-TBLAuthor(s): Torbjörn LagerPurpose: The µ-TBL system is a powerful environment in which to experiment with

transformation-based learning.Platforms: WindowsAccess: Free at http://www.ling.gu.se/~lager/mutbl.html

POS Taggers (3)

27

QTAGAuthor(s): Oliver Mason, Birmingham University, UKPurpose: QTag is a probabilistic parts-of-speech tagger. Resource files for English and

German can be downloaded together with the tool.Platforms: Platform independentAccess: Free at http://www.english.bham.ac.uk/staff/omason/software/qtag.html

STANFORD POS TAGGERAuthor(s): Kristina Toutanova, Stanford University, USAPurpose: The Stanford POS tagger is a log-linear tagger written in Java. The

downloadable package includes components for command-line invocation and a Java API both for training and for running a trained tagger.

Platforms: Platform independentAccess: Free at http://nlp.stanford.edu/software/tagger.shtml

SVM TOOLAuthor(s): TALP Research Center, University of Catalunya, Spain Purpose: The SVMTool is a simple and effective part-of-speech tagger based on

Support Vector Machines. The SVMLight software implementation of Vapnik's Support Vector Machine by Thosten Joachims has been used to train the models for Catalan, English and Spanish.

Access: Free. SVMTool at http://www.lsi.upc.es/~nlp/SVMTool/ and SVMLight at http://svmlight.joachims.org/

POS Taggers (4)

28

TnTAuthor(s): Thorsten Brants, Saarland University, GermanyPurpose: TnT, the short form of Trigrams'n'Tags, is a very efficient statistical part-of-

speech tagger that is trainable on different languages and virtually any tagset. The tagger is an implementation of the Viterbi algorithm for second order Markov models. TnT comes with two language models, one for German, and one for English.

Platforms: Platform independent.Access: Free but requires registration at http://www.coli.uni-saarland.de/~thorsten/tnt

/

TREETAGGERAuthor(s): Helmut Schmid, Institute for Computational Linguistics, University of Stuttgart,

GermanyPurpose: The TreeTagger has been successfully used to tag German, English, French,

Italian, Spanish, Greek and old French texts and is easily adaptable to other languages if a lexicon and a manually tagged training corpus are available.

Access: Free athttp://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/DecisionTreeTagger.html

POS Taggers (5)

29

Xerox XRCE MLTT Part Of Speech TaggersAuthor(s): Xerox Research Centre EuropePurpose: Xerox has developed morphological analysers and part-of-speech

disambiguators for various languages including Dutch, English, French, German, Italian, Portuguese, Spanish. More recent developments include Czech, Hungarian, Polish and Russian.

Access: Not free. Demos at http://www.xrce.xerox.com/competencies/content-analysis/fsnlp/tagger.en.html

YAMCHAAuthor(s): Taku KudoPurpose: YamCha is a generic, customizable, and open source text chunker oriented

toward a lot of NLP tasks, such as POS tagging, Named Entity Recognition, base NP chunking, and Text Chunking. YamCha is using Support Vector Machines (SVMs), first introduced by Vapnik in 1995. YamCha is exactly the same system which performed the best in the CoNLL2000 Shared Task, Chunking and BaseNP Chunking task.

Platforms: Linux, WindowsAccess: Free at http://www2.chasen.org/~taku/software/yamcha/

POS Taggers (6)

30

Stemming Stemmers are used in IR to reduce as many related words

and word forms as possible to a common canonical form – not necessarily the base form – which can then be used in the retrieval process.

Frequently, the performance of an IR system will be improved if term groups such as: CONNECT, CONNECTED, CONNECTING, CONNECTION, CONNECTIONS are conflated into a single term (by removal of the various suffixes -ED, -ING, -ION, -IONS to leave the single term CONNECT). The suffix stripping process will reduce the total number of terms in the IR system, and hence reduce the size and complexity of the data in the system, which is always advantageous.

31

The Porter Stemmer A conflation stemmer

developed by Martin Porter at the University of Cambridge in 1980

Idea: the English suffixes (approximately 1200) are mostly made up of a combination of smaller and simpler suffixes

Can be adapted to other languages (needs a list of suffixes and context sensitive rules)

32

ELLOGONAuthor(s): George Petasis, Vangelis Karkaletsis, National Center for Scientific Research,

GreeceAccess: Free at http://www.ellogon.org/

FSAAuthor(s): Jan Daciuk, Rijksuniversiteit Groningen and Technical University of Gdansk,

PolandPurpose: Supported languages: German, English, French, Polish.Access: Free at http://juggernaut.eti.pg.gda.pl/~jandac/fsa.html

HEART Of GOLDAuthor(s): Ulrich Schäfer, DFKI Language Technology Lab, GermanyPurpose: Supported language: Unicode, Spanish, Polish, Norwegian, Japanese, Italian,

Greek, German, French, English, Chinese.Access: Free at http://heartofgold.dfki.de/

Stemmers (1)

33

LANGSUITEAuthor(s): PetaMemPurpose: Supported languages: Unicode, Spanish, Polish, Italian, Hungarian, German,

French, English, Dutch, Danish, Czech. Access: Not free. More information at http://www.petamem.com/

SNOWBALLPurpose: Presentation of stemming algorithms, and Snowball stemmers, for English,

Russian, Romance languages (French, Spanish, Portuguese and Italian), German, Dutch, Swedish, Norwegian, Danish and Finnish.

Access: Free at http://www.snowball.tartarus.org/

SProUTAuthor(s): Feiyu Xu, Tim vor der Brück, LT-Lab, DFKI GmbH, GermanyPurpose: Available for: Unicode, Spanish, Japanese, German, French, English, ChineseAccess: Not free. More information at http://sprout.dfki.de/

TWOLAuthor(s): Lingsoft Purpose: Supported languages: Swedish, Norwegian, German, Finnish, English, DanishAccess: Not free. More information at http://www.lingsoft.fi/

Stemmers (2)

34

Lemmatization The process of grouping the inflected forms of a word together

under a base form, or of recovering the base form from an inflected form, e.g. grouping the inflected forms COME, COMES, COMING, CAME under the base form COME

Dictionary based Input: token + pos Output: lemma

Note: needs POS information Example:

left+v -> leave, left+a->left It is the same as looking for a transformation to apply on a word to

get its normalized form (word endings: what word suffix should be removed and/or added to get the normalized form) => lemmatization can be modeled as a machine learning problem

35

CONNEXOR LANGUAGE ANALYSIS TOOLSAuthor(s): Connexor, FinlandPurpose: Supported languages: English, French, Spanish, German, Dutch, Italian,

Finnish.Access: Not free. Demos at http://www.conexor.fi/


GreeceAccess: Free at http://www.ellogon.org/

FSAAuthor(s): Jan Daciuk, Rijksuniversiteit Groningen and Technical University of Gdansk,

PolandPurpose: Supported languages: German, English, French, Polish.Access: Free at http://juggernaut.eti.pg.gda.pl/~jandac/fsa.html

MBLEMAuthor(s): ILK Research Group, Tilburg UniversityPurpose: MBLEM is a lemmatizer for English, German, and Dutch.Access: Demo at http://ilk.uvt.nl/mblem/

Lemmatizers (1)

36

SWESUMAuthor(s): Hercules Dalianis, Martin Hassel, KTH, Euroling ABPurpose: Supported languages: Swedish, Spanish, German, French, EnglishAccess: Free at http://www.euroling.se/produkter/swesum.html

TREETAGGERAuthor(s): Helmut Schmid, Institute for Computational Linguistics, University of Stuttgart,

GermanyPurpose: The TreeTagger has been successfully used for German, English, French,

Italian, Spanish, Greek and old French texts and is easily adaptable to other languages if a lexicon is available.

Access: Free at http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/DecisionTreeTagger.html

TWOLAuthor(s): Lingsoft Purpose: Supported languages: Swedish, Norwegian, German, Finnish, English, DanishAccess: Not free. More information at http://www.lingsoft.fi/

Lemmatizers (2)

37

Syntactic Parsing

Syntax refers to the way words are arranged together and the relationship between them

Parsing is the process of using a grammar to assign a syntactic analysis to a string of words

Approaches: Shallow Parsing

Dependency Parsing

Context-Free Parsing

38

Shallow Parsing

Partition the input into a sequence of non-overlapping units, or chunks, each a sequence of words labelled with a syntactic category and possibly a marking to indicate which word is the head of the chunk

How?

Set of regular expressions over POS labels

Training the chunker on manually marked up text

39

Dependency Parsing

Based on dependency grammars, where a syntactic analysis takes the form of a set of head-modifier dependency links between words, each link labelled with the grammatical function of the modifying word with respect to the head

Parser first labels each word with all possible function types and then applies handwritten rules to introduce links between specific types and remove other function-type readings

40

Context-Free (CF) Parsing CF parsing algorithms form the basis for almost all approaches to

parsing that build hierarchical phrase structure CFG Example:

S -> NP VP NP -> Det NOMINAL NOMINAL -> Noun VP -> Verb Det -> a Noun -> flight Verb -> left

A derivation is a sequence of rules applied to a string that accounts for that string (derivation tree)

Parsing is the process of taking a string and a grammar and returning one (more?) parse tree(s) for that string

Treebanks = Parsed corpora in the form of trees

41

Probabilistic CFGs Assigning probabilities to parse trees

Attach probabilities to grammar rules The expansions for a given non-terminal sum to 1 A derivation (tree) consists of the set of grammar

rules that are in the tree The probability of a tree is just the product of the

probabilities of the rules in the derivation. Needed: grammar, dictionary with POS,

parser Task is to find the max probability tree for an

input

42

fnTBLAuthor(s): Radu Florian and Grace Ngai, John Hopkins University, USAPurpose: fnTBL is a customizable, portable and free source machine-learning toolkit

primarily oriented towards Natural Language-related tasks (POS tagging, base NP chunking, text chunking, end-of-sentence detection, word sense disambiguation). It is currently trained for English and Swedish.

Platforms: Linux, Solaris, WindowsAccess: Free at http://nlp.cs.jhu.edu/~rflorian/fntbl/




Noun Phrase (NP) Chunkers

43

MACHINESE PHRASE TAGGER Author(s): Connexor Purpose: Machinese Phrase Tagger is a set of program components that perform basic

linguistic analysis tasks at very high speed and provide relevant information about words and concepts to volume-intensive applications. Available for: English, French, Spanish, German, Dutch, Italian, Finnish.

Access: Not free. Free access to online demo at http://www.connexor.com/demo/tagger/

Syntactic parsers

44

Named Entity Recognition Identification of proper names in texts, and their classification

into a set of predefined categories of interest: entities: organizations, persons, locations temporal expressions: time, date quantities: monetary values, percentages, numbers

Two kinds of approaches

Knowledge Engineering rule based developed by experienced

language engineers make use of human intuition small amount of training data very time consuming some changes may be hard to

accommodate

Learning Systems use statistics or other

machine learning developers do not need LE

expertise require large amounts of

annotated training data some changes may require

re-annotation of the entire training corpus

45

Named Entity RecognitionKnowledge engineering approach

identification of named entities in two steps: recognition patterns expressed as WFSA (Weighted Finite-

State Automaton) are used to identify phrases containing potential candidates for named entities (longest match strategy)

additional constraints (depending on the type of candidate) are used for validating the candidates

usage of on-line base lexicon for geographical names, first names

46

Named Entity RecognitionProblems

Variation of NEs, e.g. John Smith, Mr. Smith, John

Since named entities may appear without designators (companies, persons) a dynamic lexicon for storing such named entities is used

Example:

“Mars Ltd is a wholly-owned subsidiary of Food Manufacturing Ltd, a non-trading company registered in England. Mars is controlled by members of the Mars family.”

Resolution of type ambiguity using the dynamic lexicon:

If an expression can be a person name or company name (Martin Marietta Corp.) then use type of last entry inserted into dynamic lexicon for making decision.

Issues of style, structure, domain, genre etc.

Punctuation, spelling, spacing, formatting

47


GreecePurpose: Available for Unicode.Access: Free at http://www.ellogon.org/

HEART Of GOLDAuthor(s): Ulrich Schäfer, DFKI Language Technology Lab, GermanyPurpose: Supported language: Unicode, Spanish, Polish, Norwegian, Japanese, Italian,

Greek, German, French, English, Chinese.Access: Free at http://heartofgold.dfki.de/

INSIGHT DISCOVERER EXTRACTORAuthor(s): TEMISPurpose: Supported language: Spanish, Russian, Portuguese, Polish, Italian, Hungarian,

Greek, German, French, English, Dutch, Czech.Access: Not free. More information at http://www.temis-group.com/

Named Entity Recognizers (1)

48

LINGPIPEAuthor(s): Bob Carpenter, Breck Baldwin, Alias-IPurpose: Supported languages: Unicode, Spanish, German, French, English, Dutch.Access: Free at http://www.alias-i.com/lingpipe/




Named Entity Recognizers (2)

49

Automatic term extraction

Terms = linguistic labels of concepts

Concepts = units of thought (vague definition): if a term represents a unit of thought, its appearance in textual data has to be statistically significant, otherwise, the “unit” nature of the concept the term represents is in question.

Label: different labels can be used for the same concept, and the same label can be used for different concepts.

50

Automatic term extraction Lexico-syntactic approaches use lexical and

syntactic patterns: domain-specific prefixes and suffixes (i.e. formaldehyde) part-of-speech sequences (AN; NN; AAN; ANN; NAN; NNN;

NPN) (How about ((A|N)+|((A|N)*(N|P)?)(A|N)*)N) cue word or phrase (immediate left/right contexts)

Statistical approaches: different statistical measures: Frequency, relative frequency, tf.idf etc. (for the whole unit) Mutual information; t-score; z-score; etc. (collocation

measurement) C-value: combine both internal and external statistical

measures.

51

Terminology extractors (1)CONNEXOR LANGUAGE ANALYSIS TOOLSAuthor(s): Connexor, FinlandPurpose: Supported languages: English, French, Spanish, German, Dutch, Italian, Finnish.Access: Not free. Demos at http://www.conexor.fi/


GreecePurpose: Available for Unicode.Access: Free at http://www.ellogon.org/

FASTRAuthor(s): Christian Jacquemin, Groupe Langage et Cognition, CNRS-LIMSIPurpose: Available for French and English.Access: Free at http://www.limsi.fr/Individu/jacquemi/FASTR/

INTEXAuthor(s): Max Silberztein, New York UniversityPurpose: Supported languages: Spanish, Portuguese, Italian, French, English.Access: Free at http://www.nyu.edu/pages/linguistics/intex/

52

Terminology extractors (2)NOMINOAuthor(s): Université de Québec à MontréalPurpose: French and English term extractors.Access: Free at http://www.ling.uqam.ca/nomino/

PROMEMORIAAuthor(s): BridgeTermPurpose: Translation memory system with terminology extraction component.Access: Not free. More information at http://www.bridgeterm.com/en/promem.html

PWAAuthor(s): Jörg Tiedemann, Mikael Andersson, Magnus Merkel, Lars Ahrenberg, Anna Sågvall

Hein, Department of Linguistics, Uppsala University; Department of Computer and Information Science, Linköping University, Sweden

Purpose: Language independent terminology extractor.Access: Free at http://stp.ling.uu.se/~corpora/plug/pwa/index.html

TerminologyExtractorAuthor(s): Etienne Cornu, Chamblon Systems Inc., Cambridge, Ontario, CanadaPurpose: Available for French and English.Access: Not free. More information at http://www.chamblon.com/terminologyextractor.htm

53

Terminology extractors (3)

Xerox TermFinderAuthor(s): Xerox Multilingual Knowledge Management SolutionsPurpose: Supported languages: Swedish, Spanish, Russian, Portuguese, Norwegian,

Hungarian, German, French, Finnish, English, Dutch, Danish.Access: Not free. More information at http://www.mkms.xerox.com/

54

Terminology data management tools (1)DÉJÀ VU Author(s): Atril SoftwarePurpose: Translation memory system with integrated terminology tool.Access: Not free. Trial version at: http://www.atril.com

DICOMAKER Author(s): Dalix SoftwareAccess: http://www.dicomaker.com/

EDITERM Author(s): EDIT INC.Access: Not free. More information at http://www.editerm.com/indexN.html

LEXSYN Author(s): BabelingAccess: Not free. Evaluation version at http://www.babeling.com/accueil.html

LOGITERM Author(s): Terminotix Inc.Access: Not free. More information at http://www.terminotix.com/eng/index.htm

55

Terminology data management tools (2)MULTITERM Author(s): TRADOSPurpose: Available as a stand-alone version or as part of the TRADOS TM Workbench translation

memory system.Access: Not free. More information at http://www.trados.com/products.asp?page=22

MULTITRANS Author(s): MultiCorpora R&D Inc.Purpose: Translation memory system with integrated terminology tool.Access: Not free. More information at http://www.multicorpora.ca

SYSTEM QUIRK Author(s): School of ECM, University of Surrey, UKAccess: Free at http://www.computing.surrey.ac.uk/SystemQ/

TERMBASE Author(s): University of MainzAccess: Free at http://www.fask.uni-MAINZ.de/user/srini/srini.html

TERMSTAR Author(s): STAR-USA, LLCAccess: Not free. http://www.star-group.net/eng/software/sprachtech/termstar.html

56

Text Summarization

Text summarization = automatic creation of summaries of one or more texts

Summary = a text that is produced from one or more texts, that contains a significant portion of the information in the original text(s) and that is no longer than half of the original text(s)

Types of summary: Extracts: summaries created by reusing portions (words,

sentences) of the input text(s)

Abstracts: summaries created by re-generating the extracted content

57

Text SummarizationMethodology There are three stages of automated text summarization:

Stage 1: Topic Identification Using different criteria of importance, the system should identify the most

important units (words, sentences, passages). If it lists them => extract. If not => Stage 2 and Stage 3

Criteria of importance: Cue phrase indicator criteria Word and phrase frequency criteria Query and title overlap criteria Combination of various criteria and scores

Stage 2: Interpretation or topic fusion: template representation of important topics identified at stage 1

Stage 3: Summary generation: the information captured in the templates is processed by NLG modules to obtain the summary (abstract)

58

Text summarizers (1)BREVITYAuthor(s): Art Pollard, Lextek InternationalAccess: Not free, demo available at http://www.lextek.com/brevity/

CASTAuthor(s): Constantin Orasan, Laura Hasler, Ruslan Mitkov, University of

Wolverhampton, UKAccess: Free at http://clg.wlv.ac.uk/projects/CAST

COPERNIC SUMMARIZERAuthor(s): Copernic TechnologiesPurpose: Supported languages: Spanish, German, French, English.Access: Not free, trial available athttp://www.copernic.com/en/products/summarizer/index.html

GATEAuthor(s): NLP Group, University of Sheffield, UKAccess: Free but requires registration at http://gate.ac.uk/

59

Text summarizers (2)LANGSUITEAuthor(s): PetaMemPurpose: Supported languages: Unicode, Spanish, Polish, Italian, Hungarian, German, French,

English, Dutch, Danish, Czech. Access: Not free. More information at http://www.petamem.com/

MEADAuthor(s): The Center for Language and Speech Processing, Johns Hopkins University, USAAccess: Free at http://www.summarization.com/mead/

MUSTAuthor(s): Chin-Yew Lin, Eduard Hovy, ISI, USAPurpose: MuST performs web access, text summarization and translation into English from

Japanese, Arabic, Spanish, and Indonesian.Access: Demo at http://www.isi.edu/natural-language/projects/MuST.html

PERTINENCEAuthor(s): A. Lehmam, P. Bouvet, PertinencePurpose: Available for English, French and Spanish.Access: Free at http://www.pertinence.net/index.html

60

Text summarizers (3)

SUMMARISTAuthor(s): Eduard Hovy, Chin-Yew Lin, Daniel Marcu, ISI, USAPurpose: SUMMARIST produces extract summaries in five languages (English,

Japanese, Arabic, Spanish and Indonesian)Access: Demo at http://www.isi.edu/natural-language/projects/SUMMARIST.html

SWESUMAuthor(s): Hercules Dalianis, Martin Hassel, KTH, Euroling ABPurpose: Supported languages: Swedish, Spanish, German, French, EnglishAccess: Free at http://www.euroling.se/produkter/swesum.html

SYSTEM QUIRKAuthor(s): School of ECM, University of Surrey, UKAccess: Free at http://www.computing.surrey.ac.uk/SystemQ/

61

Language Identification

The task of detecting the language a text is written in.

Identifying the language of a text from some of the text’s attributes is a typical classification problem.

Two approaches to language identification: Short words (articles, prepositions, etc.)

N-grams (sequences of n letters). Best results are obtained for trigrams (3 letters).

62

Language IdentificationTrigram method

Training Module

Source languages

texts

Language Detection Module

Trigram Data Files

(Language specific)

Combined Data File

(All languages)

Input text

Language of the input text

63

Trigram methodTraining module

Given a specific language and a text file written in this language, the training module will execute the following steps: Remove characters that may reduce the probability of correct language

identification (! " ( ) [ ] { } : ; ? , . & £ $ % * 0 1 2 3 4 5 6 7 8 9 - ` +)

Replace all white spaces with _ to mark word boundaries, then replace any sequence of __ with _ so that double spaces are treated as one

Store all three-character sequences within an array, with each having a counter indicating number of occurrences

Remove from the list of trigrams all trigrams with underscores in the middle (‘e_a’ for example) as they are considered to be invalid trigrams

Retain for further processing only those trigrams appearing more than x times Approximate the probability of each trigram occurring in a particular language by

summing the frequencies of all the retained trigrams for that language, and dividing each frequency by the total sum

This process is repeated for all languages the system should be trained on.

All language specific trigram data files are merged into one combined training file.

64

Trigram methodLanguage detection module

Input: text written in an unknown language

The unknown text sample is processed in a similar way to the training data (i.e. removing unwanted characters, replacing spaces with underscores and then dividing it into trigrams), and for each trained language the probability of the resulting sequence of trigrams is computed. This assumes that a zero probability is assigned to each unknown trigram.

The language will be identified by the language trigram data set with the highest combined probability of occurrence.

The fewer characters in the source text, the less accurate the language detection is likely to be.

This method is successful in more than 90% of the cases when the input text contains at least 40 characters.

65

Language Guessers (1)SWESUMAuthor(s): Hercules Dalianis, Martin Hassel, KTH, Euroling ABPurpose: Supported languages: Swedish, Spanish, German, French, EnglishAccess: Free at http://www.euroling.se/produkter/swesum.html

LANGSUITEAuthor(s): PetaMemPurpose: Supported languages: Unicode, Spanish, Polish, Italian, Hungarian, German, French,

English, Dutch, Danish, Czech. Access: Not free. More information at http://www.petamem.com/

TED DUNNING'S LANGUAGE IDENTIFIERAuthor(s): Ted DunningAccess: Free at ftp://crl.nmsu.edu/pub/misc/lingdet_suite.tar.gz

TEXTCATAuthor(s): Gertjan van NoordPurpose: TextCat is an implementation of the N-Gram-Based Text Categorization algorithm and

at the moment, the system knows about 69 natural languages.Access: Free at http://grid.let.rug.nl/~vannoord/TextCat/

66

Language Guessers (2)

XEROX LANGUAGE IDENTIFIERAuthor(s): Xerox Research Centre EuropePurpose: Supported languages: Albanian, Arabic, Basque, Breton, Bulgarian, Catalan,

Chinese, Croatian, Czech, Danish, Dutch, English, Esperanto, Estonian, Finnish, French. Georgian, German, Greek, Hebrew, Hungarian, Icelandic, Indonesian, Irish, Italian, Japanese, Korean, Latin, Latvian, Lithuanian, Malay, Maltese, Norwegian, Polish, Poruguese, Romanian, Russian, Slovakian, Slovenian, Spanish, Swahili, Swedish, Thai, Turkish, Ukrainian, Vietnamese, Welsh

Access: Not free. More information at http://www.xrce.xerox.com/competencies/content-analysis/tools/guesser-ISO-8859-1.en.html

67

Statistical language modeling toolkitsCMU - Cambridge Statistical Language Modeling ToolkitAuthor(s): Philip Clarkson and Roni Rosenfeld, Carnegie Mellon University, USAPurpose: The toolkit is a suite of UNIX software tools to facilitate the construction and

testing of statistical language models. Access: Free at http://svr-www.eng.cam.ac.uk/~prc14/toolkit.html

BOW - A Toolkit for Statistical Language Modeling, Text Retrieval, Classification and Clustering

Author(s): Andrew McCallu, Carnegie Mellon University, USAPurpose: Bow (or LIBBOW) is a library of C code useful for writing statistical text

analysis, language modeling and information retrieval programs. The current distribution includes the library, as well as front-ends for document classification (RAINBOW), document retrieval (ARROW) and document clustering (CROSSBOW).

Access: Free at http://www-2.cs.cmu.edu/~mccallum/bow/

68

CMU - Cambridge Statistical Language Modeling Toolkit

The Carnegie Mellon Statistical Language Modeling (CMU SLM) Toolkit is a set of Unix software tools designed to facilitate language modeling work in the research community.

Some of the tools are used to process general textual data into: word frequency lists and vocabularies word bigram and trigram counts vocabulary-specific word bigram and trigram counts bigram- and trigram-related statistics various Backoff bigram and trigram language models

69

CMU - Cambridge Statistical Language Modeling Toolkit – The Tools (1) text2wfreq

Input: Text file Output: List of every word which occurred in the text, along with

its number of occurrences. wfreq2vocab

Input: A word-frequency file as produced by text2wfreq. Output: A file containing a list of vocabulary words

text2wngram Input: Text file Output: List of every word n-gram (n - parameter) which occurred

in the text, along with its number of occurrences text2idngram

Input: Text file plus a vocabulary file Output: List of every id n-gram (n-tuples of numbers

corresponding to the mapping of the word n-grams relative to the vocabulary) which occurred in the text, along with its number of occurrences

70

CMU - Cambridge Statistical Language Modeling Toolkit – The Tools (2) wngram2idngram

Input: Word n-gram file, plus a vocabulary file Output: List of every id n-gram which occurred in the text, along

with its number of occurrences idngram2stats

Input: An id n-gram file Output: A list of the frequency-of-frequencies for each of the 2-

grams, …, n-grams mergeidngram

Input: A set of id n-gram files Output: One id n-gram file containing the merged id n-grams

from the input files idngram2lm

Input: An id n-gram file and a vocabulary file Output: A language model in either binary format or in ARPA

format

71

CMU - Cambridge Statistical Language Modeling Toolkit – The Tools (2) binlm2arpa

Input: A binary format language model, as generated by idngram2lm

Output: An ARPA format language model evallm

Input: A binary or ARPA format language model, as generated by idngram2lm.

Output: Output is confirmation or denial that the sum of the probabilities of each of the words in the context supplied by the user sums to one.

72

Corpora - Large collections aimed at the NLP community

LDC (Linguistic Data Consortium)Access: http://www.ldc.upenn.edu/

ELDA (European Language Resources Association)Access: http://www.elra.info/

TRACTOR (TELRI Research Archive of Computational Tools and Resources)Access: http://www.tractor.de/

CLR (Consortium for Lexical Research)Access: http://crl.nmsu.edu/Tools/CLR/

European Corpus Initiative Multilingual Corpus I (ECI/MCI)Access: http://www.elsnet.org/resources/eciCorpus.html

MULTEXT: Multilingual Text Tools and CorporaAccess: http://www.lpl.univ-aix.fr/projects/multext/

Electronic Text Collections in Western European LiteraturePurpose: Pointers to internet sources for literary texts in the western European

languages other than English: Catalan, Danish, Dutch, Finnish, French, German, Italian, Norwegian, Old Norse, Portuguese, Provençal, Spanish, Swedish.

Access: Free at http://www.lib.virginia.edu/wess/etexts.html

73

Other multilingual corpora

CRATER Multilingual Aligned Annotated CorpusPurpose: Aligned corpus in English, French and Spanish.Access: http://www.comp.lancs.ac.uk/linguistics/crater/corpus.html

EMILLE/CIILPurpose: Monolingual written corpus data for 14 South Asian languages (Assamese,

Bengali, Gujarati, Hindi, Kannada, Kashmiri, Malayalam, Marathi, Oriya, Punjabi, Sinhala, Tamil, Telegu and Urdu). Orthographically transcribed spoken data and parallel corpus data for five South Asian languages (Bengali, Gujarati, Hindi, Punjabi and Urdu). In addition, the parallel corpus contains the English originals from which the translations stored in the corpus were derived. All data in the corpus is CES and Unicode compliant. The EMILLE corpus totals some 94 million words.

Access: Free at http://bowland-files.lancs.ac.uk/corplang/emille/

OPUSPurpose: An open source parallel corpus, aligned, in many languages, based on free Linux

etc. manuals. Access: http://logos.uio.no/opus/

Searchable Canadian Hansard French-English parallel texts (1986-1993) Access: http://rali.iro.umontreal.ca/

European Union Web ServerAccess: http://europa.eu.int/

74

Online multilingual dictionaries

ECTACOAccess: www.ectaco.com

YOURDICTIONARYPurpose: It is the most comprehensive index of dictionaries available on the web. Access: http://www.yourdictionary.com/

75

Lexical resources (wordnets)

BalkanetPurpose: The Balkanet project aimed at the development of a multilingual lexical

database comprising of individual WordNets for the Balkan languages (Bulgarian, Czech, Greek, Romanian, Serbian and Turkish).

Access: http://www.ceid.upatras.gr/Balkanet/

EuroWordnetPurpose: EuroWordNet is a multilingual database with wordnets for several European

languages (Dutch, Italian, Spanish, German, French, Czech and Estonian).Access: http://www.illc.uva.nl/EuroWordNet/

WordNetPurpose: WordNet is an online lexical reference system. The wordnets developed as a

result of the Balkanet and EuroWordnet projects are linked to the original Princeton WordNet to ensure conceptual equivalence.

Access: http://wordnet.princeton.edu/

76

Treebanks (1)

Penn TreebankLanguage: US-EnglishSize: 2 million + wordsAccess:

BLLIP WSJ corpusLanguage: US-EnglishSize: 30 million wordsAccess:

ICE-GBLanguage: UK-EnglishSize: 1 million wordsAccess:

NEGRA CorpusLanguage: GermanSize: 20000 sentencesAccess:

77

Treebanks (2)

TIGER CorpusLanguage: GermanSize: 700000 wordsAccess:

Alpino Dependency TreebankLanguage: DutchSize: 150000 wordsAccess:

The Prague Dependency Treebank 1.0Language: CzechSize: 500000 wordsAccess:

Bulgarian TreebankLanguage: BulgarianSize: n/aAccess:

78

Treebanks (3)

Penn Chinese TreebankLanguage: ChineseSize: 100000 wordsAccess:

Danish Dependency Treebank 1.0Language: DanishSize: 100000 wordsAccess:

Syntactic Spanish DatabaseLanguage: SpanishSize: 1.5 million wordsAccess:

LDC Korean TreebankLanguage: KoreanSize: n/aAccess:

79

Methods and applications that did not make it into this presentation

Word Sense Disambiguation Nancy Ide and Dan Tufis

Anaphora Resolution Dan Cristea, Constantin Orasan and Oana Postolache

Machine Translation Daniel Marcu and Dragos Stefan Munteanu

Question Answering Bernardo Magnini and Marius Pasca

80

Conclusions

Many resources for textual NLP already exist on the Web and can be exploited and adapted to new languages

All methods presented today can be adapted to a new language

Hopefully the present inventory will be of help in your future NLP activity

81

Thank you!

resources for multilingual processing georgiana puşcaşu university of wolverhampton, uk

Documents

segmented text segmentation

text processing words

applications text segmentation

lowlevel text segmentation

text needs

tokenization tokenization

process of segmenting

tokenization rules