evalita 2007 frascati, september 10th 2007 emanuele pianta and roberto zanoli fbk-irst, trento

19
EVALITA 2007 EVALITA 2007 Frascati, September 10th 2007 Frascati, September 10th 2007 TAGPRO A system for ITALIAN POS A system for ITALIAN POS TAGGING based on SVM TAGGING based on SVM Emanuele Pianta and Roberto Emanuele Pianta and Roberto Zanoli Zanoli FBK-irst, Trento FBK-irst, Trento

Upload: tucker-cousins

Post on 15-Dec-2015

222 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: EVALITA 2007 Frascati, September 10th 2007 Emanuele Pianta and Roberto Zanoli FBK-irst, Trento

EVALITA 2007EVALITA 2007

Frascati, September 10th 2007Frascati, September 10th 2007

TAGPROA system for ITALIAN POS A system for ITALIAN POS TAGGING based on SVMTAGGING based on SVM

Emanuele Pianta and Roberto ZanoliEmanuele Pianta and Roberto Zanoli

FBK-irst, TrentoFBK-irst, Trento

Page 2: EVALITA 2007 Frascati, September 10th 2007 Emanuele Pianta and Roberto Zanoli FBK-irst, Trento

TextProTextPro

22

A suite of modular NLP tools developed at FBK-irst TokenPro: tokenization MorphoPro: morphological analysis TagPro: Part-of-Speech tagging LemmaPro: lemmatization EntityPro: Named Entity recognition ChunkPro: phrase chunking SentencePro: sentence splitting

Architecture designed to be efficient, scalable and robust. Cross-platform: Unix / Linux / Windows / MacOS X Multi-lingual models All modules integrated and accessible through unified command line interface

Page 3: EVALITA 2007 Frascati, September 10th 2007 Emanuele Pianta and Roberto Zanoli FBK-irst, Trento

33

TagPro’s architecture

To build TagPro we used YamCha, an SVM-based machine learning environment. TagPro can exploit a rich set of linguistic features, such as morphological analysis, prefixes and suffixes

Feature Feature selectionselection

ControllerController

Feature extractionortho, prefix, suffix, dictionary,

morpho analysis

dictionary

Learning

models

ClassificationClassification

YamCha

Training

data

Test

dataFeature Feature selectionselection

TagPro

Feature extractionortho, prefix, suffix, dictionary,

morpho analysis

MorphoPro

Page 4: EVALITA 2007 Frascati, September 10th 2007 Emanuele Pianta and Roberto Zanoli FBK-irst, Trento

YamChaYamCha

44

• Created as generic, customizable, open source text chunker

• Can be adapted to a lot of other tag-oriented NLP tasks

• Uses state-of-the-art machine learning algorithm (SVM)

Can redefine Context (window-size) parsing-direction (forward/backward) algorithms for multi-class problem (pair wise/one vs rest)

Practical chunking time (1 or 2 sec./sentence.)

Available as C/C++ library

Page 5: EVALITA 2007 Frascati, September 10th 2007 Emanuele Pianta and Roberto Zanoli FBK-irst, Trento

Support Vector MachinesSupport Vector Machines

55

Based on the Structural Risk Minimization principle (Vladimir N. Vapnik, 1995)

• SVM map input vectors to a higher dimensional space where a maximal separating hyperplane is constructed. • Two parallel hyperplanes are constructed on each side of the hyperplane that separates the data. • The separating hyperplane is the hyperplane that maximizes the distance between the two parallel hyperplanes.

Page 6: EVALITA 2007 Frascati, September 10th 2007 Emanuele Pianta and Roberto Zanoli FBK-irst, Trento

YamCha: YamCha: Setting Window SizeSetting Window Size

66

Default setting is "F:-2..2:0.. T:-2..-1".

The window setting can be customized

Page 7: EVALITA 2007 Frascati, September 10th 2007 Emanuele Pianta and Roberto Zanoli FBK-irst, Trento

Training and Tuning SetTraining and Tuning Set

77

The Evalita development set was randomly split into 2 parts

Training: 89,170 tokens

Tuning: 44,586 tokens

Page 8: EVALITA 2007 Frascati, September 10th 2007 Emanuele Pianta and Roberto Zanoli FBK-irst, Trento

FEATURESFEATURES

88

For each running word a rich set of features are extracted

WORD: the word itself (both unchanged and lower-cased) e.g. Autore autore

MORPHO: the morphological analysis (produced by MorphoPro)e.g. Autore autore+n+m+sing Calcio calcio calcio+n+m+sing

calciare+v+indic+pres+nil+1+sing

AFFIX: prefixes/suffixes (2, 3, 4 or 5 chars. at the start/end of the word)e.g. libro {li,lib,libr,libro,ro,bro,ibro,libro}

ORTHOgraphic information (e.g. capitalization, hypenation)e.g. Oggi C (capitalized) oggi L (lowercased)

GAZETTeers of proper nouns (154,000 proper names, 12,000 cities,5,000 organizations and 3,200 locations)

Page 9: EVALITA 2007 Frascati, September 10th 2007 Emanuele Pianta and Roberto Zanoli FBK-irst, Trento

99

Static vs Dynamic FeaturesStatic vs Dynamic Features

STATIC FEATURES extracted for the current, previous and

following word WORD, MORPHO, AFFIXes, ORTHO,

GAZET

DYNAMIC FEATURES decided dynamically during tagging tag of the two tokens preceding the current

token.

Page 10: EVALITA 2007 Frascati, September 10th 2007 Emanuele Pianta and Roberto Zanoli FBK-irst, Trento

An Example of An Example of Feature Extraction Feature Extraction

1010

l' ARTex ADJleader NNsocialista ADJBettino NN_PCraxi NN_P

l' l' l' __nil__ __nil__ __nil__ l' __nil__ __nil__ __nil__ L A N N N N N N N N N N N Y N N N N N N N N Y N O O O O ARTex ex ex __nil__ __nil__ __nil__ ex __nil__ __nil__ __nil__ L N N N N N N N N N N Y 2 N N N Y N N N N N N N O O O O ADJleader leader le lea lead leade er der ader eader L N N N N N N N N N N Y N N Y 0 N N N N N N N N O O O O NNsocialista socialista so soc soci socia ta sta ista lista L N N N N N N N N N N Y 2 N Y 0 N N N N N N N N O O O O ADJBettino bettino be bet bett betti no ino tino ttino C N N N N N N N N N N N N N N N N N N N N Y N N O O O B-NAM NN_PCraxi craxi cr cra crax craxi xi axi raxi craxi C N N N N N N N N N N N N N N N N N N N N Y N N O O O B-SUR NN_P

Page 11: EVALITA 2007 Frascati, September 10th 2007 Emanuele Pianta and Roberto Zanoli FBK-irst, Trento

Finding the best featuresFinding the best features

1111

EAGLES TagSet Accuracy UTAccuracy

baseline 86.70 59.95

+AFFIX +ORTHO +8.56 +25.56

+AFFIX +ORTHO +MORPHO +10,69 +33.18

+AFFIX +ORTHO +MORPHO +GAZETT +10.72 +33.13

Baseline: WORD (both unchanged and lower-cased) window-size: +1,-1

Page 12: EVALITA 2007 Frascati, September 10th 2007 Emanuele Pianta and Roberto Zanoli FBK-irst, Trento

Finding the best window-sizeFinding the best window-size

1212

EAGLES TagSet STAT DYN Accuracy

+1,-1 -1 97.42

+2,-2 -2 -0.34+1,-1 -2 +0.23+1,-1 -3 +0.22

Given the best set of features (F1=97.42) we tried to improve Accuracy by changing the window-size

Page 13: EVALITA 2007 Frascati, September 10th 2007 Emanuele Pianta and Roberto Zanoli FBK-irst, Trento

multi-class problemmulti-class problempair-wise/one vs restpair-wise/one vs rest

1313

one vs rest: fewer bigger classifiers pairwise:

a classifier for each possible pair of classes choose the classifier with best confidence many relatively small classifiers faster, less memory

EAGLES TagSet method Accuracy

pairwise 97.65one vs rest 97.78

Page 14: EVALITA 2007 Frascati, September 10th 2007 Emanuele Pianta and Roberto Zanoli FBK-irst, Trento

Evaluating the best algorithmEvaluating the best algorithmPKI vs. PKEPKI vs. PKE

1414

EAGLES TagSet Accuracy

PKI 97.78PKE 97.64

YamCha uses two implementations of SVMs: PKI and PKE.

• both are faster than the original SVMs

PKI (3-12 x faster) produces the same accuracy as the original SVMs.

PKE (10-300 x) approximates the orginal SVM, slightly less accurate but much faster

Page 15: EVALITA 2007 Frascati, September 10th 2007 Emanuele Pianta and Roberto Zanoli FBK-irst, Trento

Results on the development setResults on the development set

1515

EAGLES DISTRIB

Accuracy 97.78 97.52

Known Words: 40,320 MorphoPro coverage: 96.20% Accuracy 98.29 97.95

Unknown Words: 4,396 MorphoPro coverage: 84.41% Accuracy 93.03 93.56

Page 16: EVALITA 2007 Frascati, September 10th 2007 Emanuele Pianta and Roberto Zanoli FBK-irst, Trento

Test ResultsTest Results

1616

TagSet Accuracy UTAccuracy

EAGLES 98.04 95.02

DISTRIB 97.68 94.65

Page 17: EVALITA 2007 Frascati, September 10th 2007 Emanuele Pianta and Roberto Zanoli FBK-irst, Trento

ConclusionsConclusions

1717

A statistical approach to PoS-Tagging for Italian based on YamCha / SVMs.

Results confirm that SVMs can deal with a big number of features without incurring in overfitting.

We used the same best configuration for both tagsets.

No specific method was applied for classifying unknown words.

Features: AFFIX+ORTHO: +8.56 over baselineMORPHO: 2.13 improvement over AFFIX+ORTHOGAZETteers do not contribute any further significant improvement

• Features for unknown words: AFFIX+ORTHO:+25.56 MORPHO: ++7,62

No benefit from a larger context (e.g. window-size +2,-2 and more)

Page 18: EVALITA 2007 Frascati, September 10th 2007 Emanuele Pianta and Roberto Zanoli FBK-irst, Trento

TagProTagPro

1818

TagPro is a system for PoS-tagging based on YamCha.

YamCha (Yet Another Multipurpose Chunk Annotator, by Taku Kudo)

is a generic, customizable, and open source text chunker.is based on Support Vector Machines (SVMs)

TagPro exploits a rich set of linguistic features such as the morphological analysis prefixes and suffixes.

The system is part of TextPro, a suite of NLP tools developed at FBK-irst.

Page 19: EVALITA 2007 Frascati, September 10th 2007 Emanuele Pianta and Roberto Zanoli FBK-irst, Trento

1919

PRON_PER V_AVERE V_PP ADV ART ADJ NN NN_P P_OTH PREP_A P_EOS V_GVRB PREP CONJ_S ADJ_DIM V_MOD CONJ_C V_ESSERE PRON_REL PRON_DIM PRON_IND ADJ_IND PRON_IES V_CLIT ADJ_NUM C_NUM ADJ_POS ADJ_IES PRON_POS NULL P_APO INT

PRON_PER 972 0 0 20 12 0 0 0 0 0 0 0 0 8 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

V_AVERE 0 521 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

V_PP 0 0 1188 2 0 97 19 0 0 0 0 2 1 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0

ADV 9 0 2 2248 0 12 10 3 0 0 0 0 20 14 0 0 0 0 0 0 3 1 0 0 0 0 0 0 0 0 0 0

ART 6 0 0 0 3780 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0

ADJ 0 0 97 14 0 2454 120 3 0 0 0 6 2 0 0 0 0 0 0 0 0 2 0 0 2 0 0 0 0 0 0 0

NN 0 0 21 23 0 114 8673 43 0 0 0 24 0 1 0 0 0 1 0 0 1 0 2 0 6 2 0 0 0 1 0 0

NN_P 0 0 0 1 1 3 15 1847 0 3 0 0 2 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 2 0 0

P_OTH 0 0 0 0 0 0 0 0 3476 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0

PREP_A 0 0 0 0 0 2 0 1 0 2655 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

P_EOS 0 0 0 0 0 0 0 0 2 0 1786 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

V_GVRB 0 0 7 2 0 15 27 0 0 0 0 2754 1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0

PREP 0 0 0 5 0 0 1 0 0 1 0 0 4230 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

CONJ_S 1 0 0 7 0 0 1 0 0 0 0 0 4 598 0 0 3 0 70 0 0 0 0 0 0 0 0 0 0 0 0 0

ADJ_DIM 0 0 0 0 0 2 0 0 0 0 0 0 0 0 261 0 0 0 0 8 0 7 0 0 0 0 1 0 0 0 0 0

V_MOD 0 0 1 0 0 3 0 0 0 0 0 4 0 0 0 285 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

CONJ_C 0 0 0 2 0 0 3 0 0 1 0 0 1 10 0 0 1759 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0

V_ESSERE 0 0 0 2 0 0 2 0 0 0 0 1 0 0 0 0 1 1285 0 0 0 0 0 0 1 0 0 0 0 0 0 0

PRON_REL 0 0 0 1 0 0 0 0 0 0 0 0 0 24 0 0 0 0 664 0 0 2 2 0 0 0 0 0 0 0 0 0

PRON_DIM 0 0 0 0 0 0 0 0 0 0 0 0 0 0 18 0 0 0 0 191 0 0 0 0 0 0 0 0 0 0 0 0

PRON_IND 0 0 0 7 4 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 214 16 0 0 0 0 0 0 0 0 0 0

ADJ_IND 0 0 0 6 0 4 1 0 0 0 0 0 1 0 0 0 0 0 0 0 5 395 0 0 0 0 0 0 0 0 0 0

PRON_IES 0 0 0 1 0 0 4 0 0 0 0 0 0 0 0 0 0 0 9 0 0 0 19 0 0 0 0 0 0 0 0 0

V_CLIT 0 3 7 0 0 0 8 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 232 0 0 0 0 0 0 0 0

ADJ_NUM 0 0 0 0 2 4 3 2 0 0 0 0 3 0 0 0 0 0 0 0 2 3 0 0 314 0 0 0 0 0 0 0

C_NUM 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 348 0 0 0 0 0 0

ADJ_POS 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 351 0 0 0 0 0

ADJ_IES 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 3 0 0 2 0 0 0 0 0 8 0 0 0 0

PRON_POS 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 8 0 0 0 0 0

NULL 0 0 0 0 0 1 4 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 21 0 0

P_APO 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 30 0

INT 0 0 0 0 0 2 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2

Confusion matrix