automatic keyphrase extraction from croatian newspaper articles renee ahel, bojana dalbelo bašić,...

Automatic Keyphrase Extraction from Croatian Newspaper Articles

Renee Ahel, Bojana Dalbelo Bašić,Jan Šnajder

Knowledge Technologies LabDepartment of Electronics, Microelectronics, Computer and Intelligent Systems

University of Zagreb, Faculty of Electrical Engineering and Computing

INFuture 2009

INFuture 2009, 05.11.2009. 2/19

Agenda

• Assigning keyphrases

• Related work

• Extraction system

• Corpora

• Results

• Conclusion

INFuture 2009, 05.11.2009. 3/19

Assigning keyphrases

• Keyphrases– Summarize documents– Are often not assigned to documents– Manual assignment is a tedious task

• Automatic assignment methods– Keyphrase assignment (in narrow sense)– Keyphrase extraction

INFuture 2009, 05.11.2009. 4/19

Related work

• Our work is inspired by KEA (Witten et al. 1999)– Good performance despite using a simple

set of features– Our approach: more features,

improvements on candidate generation

• POS tag filtering similar to Hulth (2003)– Larger set of POS tags for filtering

(Petrović et al. 2009)

INFuture 2009, 05.11.2009. 5/19

Extraction system

Pre - processingCandidate generation

Feature calculation

Learning candidate feature matrix

Classification candidate feature matrix

Learning

Ranking

Staging area

Candidate warehouse

Program Relational database

Knowledge base

INFuture 2009, 05.11.2009. 6/19

Extraction – pre-processing

Token_IDDocument_token_index

Is_in_categories Is_in_title Token_text POS_tag IsStopword Lemma

1591219 3 1 1 vrhovni A 0 vrhovan

1591220 4 1 1 sud N 0 sud

1591221 5 0 0 potvrdio V 0 potvrditi

1591222 6 0 0 je X 1 je

1591223 7 1 0 presudu N 0 presuda

1591224 8 1 0 županijskog A 0 županijski

1591225 9 1 1 suda N 0 sud

1591226 10 0 0 u X 1 u

1591227 11 0 0 splitu N 0 split

“Vrhovni sud potvrdio je presudu Županijskog suda u Splitu.”HINA

categories

Phrase boundaries Tokenization Lemmatization

INFuture 2009, 05.11.2009. 7/19

Extraction – candidate generation

Candidate generation

Candidate_ID

Appear_idx

Is_in_categories

Is_in_title

IsKeyword

Ngram_level

Original_text

Lemmatized_text

POS_tag_pattern

978155 3 1 1 0 2vrhovni

sud vrhovan sud AN

978137 4 1 1 0 1 sud sud N

978138 7 1 0 1 1 presudu presuda N

vrhovni A

vrhovni sud AN

vrhovni sud potvrdio ANV

Token_IDDocument_token_index

Is_in_categories Is_in_title Token_text POS_tag IsStopword Lemma

1591219 3 1 1 vrhovni A 0 vrhovan

1591220 4 1 1 sud N 0 sud

1591221 5 0 0 potvrdio V 0 potvrditi

1591222 6 0 0 je X 1 je

1591223 7 1 0 presudu N 0 presuda

... ... ... .. ... ... ... ...

INFuture 2009, 05.11.2009. 8/19

Extraction – feature calculation

978155 33 3 1 1 0 2 0 1 vrhovni sud vrhovan sud

978137 33 4 1 1 0 1 0 2 sud sud

978138 33 7 1 0 1 1 1 0 presudu presuda

978164 33 7 1 1 0 3 1 0presudu županijskog suda

presuda županijski

sud

Candidate_ID Original_textFirst_appearance_relative

Is_in_categories

Is_in_Title TF_IDF

IsKeyword

978138 presudu 0.212 1 0 0.097 1

978140 splitu 0.333 0 0 0.124 0

... .... ... ... ... ... ...

Feature calculation

TF_IDF

0.967

Corpus

Is_in_title

0

First_appearance_relative

0.212

Is_in_categories

1

Ca

nd

ida

te_

ID

Do

cu

me

nt_

len

gth

Ap

pe

ara

nc

e_

ind

ex

Is_

in_

ca

teg

ori

es

Is_

in_

titl

e

IsK

ey

wo

rd

Ng

ram

_le

ve

l

Fir

st_

lem

ma

_m

atc

h

Le

mm

a_

ma

tch

es

Ori

gin

al_

tex

t

Le

mm

ati

zed

_

tex

t

presuda

INFuture 2009, 05.11.2009. 9/19

Extraction – learning and rankingCandidate_ID Original_text

First_appearance_relative

Is_in_categories

Is_in_title TF_IDF

IsKeyword

978138 presudu 0.212 1 0 0.096 1

978140 splitu 0.333 0 0 0.124 0

978156 županijskog suda 0.242 1 1 0.089 0

978164presudu županijskogsuda 0.212 1 1 0.188 0

978165 suda u splitu 0.272 1 1 0.183 0

Candidate_ID Original_text TF_IDF

Output_probability

Higher_scored_pattern_matches

978164 presudu županijskog suda 0.188 0.1777 0

978165 suda u splitu 0.183 0.0876 0

978140 splitu 0.124 0.0876 0

978138 presudu 0.096 0.1777 1

978156 županijskog suda 0.089 0.1777 1

Ranking

Discretization

for naive Bayes

Discretizationfor naive Bayes

Knowledge baseNaive Bayes

learning

INFuture 2009, 05.11.2009. 10/19

Corpora

• Deficiencies– Assigned keyphrases not appearing in text removed (57% of

original keyphrases)– Unknown inter-annotator agreement– Inconsistent keyphrases (63% of keyphrases assigned to only one

document)• Experimental set

– 200 documents– On average, 6.5 keyphrases and 370 candidates per document

October 4457 January 4532

Number of documents (articles) 3905 4521

Average document length (in words) 525 335

Average number of candidates per document 235 153

Average number of keyphrases per document 2 2

INFuture 2009, 05.11.2009. 11/19

Results – basic configurations

0.0

2.0

4.0

6.0

8.0

10.0

12.0

14.0

16.0

0 2 4 6 8 10 12 14 16

Extracted keyphrases

F1

(%

)

MDL discretization Percentile discretization

INFuture 2009, 05.11.2009. 12/19

Results – additional POS filter

• Filtering out the candidates that do not match the POS patterns N, AN, NN, NXN– discards 30% negative candidates– discards only 7.5% positive candidates

POS pattern Positive candidates (%)

Negative candidates (%)

Total (%)

N 49.77 38.75 38.87

AN 28.17 13.26 13.42

NN 11.58 10.41 10.42

F 1.91 9.72 9.64

NXN 2.98 7.57 7.52

NAN 1.56 3.22 3.2

Other 4.02 17.06 16.92

F = lemmatization failed, X = stopword

INFuture 2009, 05.11.2009. 13/19

Results – additional POS filter

0

4

8

12

16

20

0 2 4 6 8 10 12 14 16

Extracted keyphrases%

change F

1

MDL + POS filtering

0.0

4.0

8.0

12.0

16.0

20.0

0 2 4 6 8 10 12 14 16

Extracted keyphrases

F1 (%

)

MDL + POS filtering MDL discretization

INFuture 2009, 05.11.2009. 14/19

Results – ablation study

• Influence of each feature on performance• Holding out one feature and doing keyphrase

extraction using the remaining features

-30

-20

-10

0

10

20

30

Is in

cate

go

ry

Is in

title

Fir

sta

pp

ea

ran

cere

lativ

e

TF

xID

F

Precision (%) Recall (%) F1 (%) % change F1

INFuture 2009, 05.11.2009. 15/19

ExamplesDocument title Pre-assigned keyphrases Extracted keyphrases*

Lijevo-desna nerazumijevanja

ljevicapolitički životvrijednostisocijalna državasocijaldemokratiliberalisocijalna politikacrkveni socijalni naukekonomska politikasuverenitetzakonitost

poimanjusuverenitetastrankepolitikarazličitim poimanjempredizbornu kampanjuzakonitostiliberalesocijaldemokratekonzervativce

* Keyphrase normalization is a work in progress

INFuture 2009, 05.11.2009. 16/19

ExamplesDocument title Pre-assigned keyphrases Extracted keyphrases

EU lansirala petogodišnji plan sigurnosti

akcijski plansigurnostimigracijska politikapravosuđegranična kontrolameđunarodna suradnja

petogodišnji plansigurnostiakcijski planimigracijske politikepravosuđa i sigurnostieuropsku sigurnostsuradnje između zemaljaterorizmasigurnost granicaslobode i sigurnosti

INFuture 2009, 05.11.2009. 17/19

Conclusion

• Overall best result achieved – MDL + additional POS filtering, 10 extracted keyphrases

• In absence of comparable results, we consider our results to be of modest performance

• Possible causes– Low inter-annotator agreement suspected– Inconsistently assigned keyphrases

• Results show that performance can be improved, despite deficiencies in corpora

• New corpus of much higher quality obtained

INFuture 2009, 05.11.2009. 18/19

Acknowledgements

• This work has been supported by the Ministry of Science, Education and Sports, Republic of Croatia and under the Grant 036-1300646-1986

• The authors are grateful to the Croatian News Agency (HINA) for making available the newspaper corpora

INFuture 2009, 05.11.2009. 19/19

Thank you!

Questions?

automatic keyphrase extraction from croatian newspaper articles renee ahel, bojana dalbelo bašić,...

Documents