automatic keyphrase extraction from croatian newspaper articles renee ahel, bojana dalbelo bašić,...
TRANSCRIPT
Automatic Keyphrase Extraction from Croatian Newspaper Articles
Renee Ahel, Bojana Dalbelo Bašić,Jan Šnajder
Knowledge Technologies LabDepartment of Electronics, Microelectronics, Computer and Intelligent Systems
University of Zagreb, Faculty of Electrical Engineering and Computing
INFuture 2009
INFuture 2009, 05.11.2009. 2/19
Agenda
• Assigning keyphrases
• Related work
• Extraction system
• Corpora
• Results
• Conclusion
INFuture 2009, 05.11.2009. 3/19
Assigning keyphrases
• Keyphrases– Summarize documents– Are often not assigned to documents– Manual assignment is a tedious task
• Automatic assignment methods– Keyphrase assignment (in narrow sense)– Keyphrase extraction
INFuture 2009, 05.11.2009. 4/19
Related work
• Our work is inspired by KEA (Witten et al. 1999)– Good performance despite using a simple
set of features– Our approach: more features,
improvements on candidate generation
• POS tag filtering similar to Hulth (2003)– Larger set of POS tags for filtering
(Petrović et al. 2009)
INFuture 2009, 05.11.2009. 5/19
Extraction system
Pre - processingCandidate generation
Feature calculation
Learning candidate feature matrix
Classification candidate feature matrix
Learning
Ranking
Staging area
Candidate warehouse
Program Relational database
Knowledge base
INFuture 2009, 05.11.2009. 6/19
Extraction – pre-processing
Token_IDDocument_token_index
Is_in_categories Is_in_title Token_text POS_tag IsStopword Lemma
1591219 3 1 1 vrhovni A 0 vrhovan
1591220 4 1 1 sud N 0 sud
1591221 5 0 0 potvrdio V 0 potvrditi
1591222 6 0 0 je X 1 je
1591223 7 1 0 presudu N 0 presuda
1591224 8 1 0 županijskog A 0 županijski
1591225 9 1 1 suda N 0 sud
1591226 10 0 0 u X 1 u
1591227 11 0 0 splitu N 0 split
“Vrhovni sud potvrdio je presudu Županijskog suda u Splitu.”HINA
categories
Phrase boundaries Tokenization Lemmatization
INFuture 2009, 05.11.2009. 7/19
Extraction – candidate generation
Candidate generation
Candidate_ID
Appear_idx
Is_in_categories
Is_in_title
IsKeyword
Ngram_level
Original_text
Lemmatized_text
POS_tag_pattern
978155 3 1 1 0 2vrhovni
sud vrhovan sud AN
978137 4 1 1 0 1 sud sud N
978138 7 1 0 1 1 presudu presuda N
vrhovni A
vrhovni sud AN
vrhovni sud potvrdio ANV
Token_IDDocument_token_index
Is_in_categories Is_in_title Token_text POS_tag IsStopword Lemma
1591219 3 1 1 vrhovni A 0 vrhovan
1591220 4 1 1 sud N 0 sud
1591221 5 0 0 potvrdio V 0 potvrditi
1591222 6 0 0 je X 1 je
1591223 7 1 0 presudu N 0 presuda
... ... ... .. ... ... ... ...
INFuture 2009, 05.11.2009. 8/19
Extraction – feature calculation
978155 33 3 1 1 0 2 0 1 vrhovni sud vrhovan sud
978137 33 4 1 1 0 1 0 2 sud sud
978138 33 7 1 0 1 1 1 0 presudu presuda
978164 33 7 1 1 0 3 1 0presudu županijskog suda
presuda županijski
sud
Candidate_ID Original_textFirst_appearance_relative
Is_in_categories
Is_in_Title TF_IDF
IsKeyword
978138 presudu 0.212 1 0 0.097 1
978140 splitu 0.333 0 0 0.124 0
... .... ... ... ... ... ...
Feature calculation
TF_IDF
0.967
Corpus
Is_in_title
0
First_appearance_relative
0.212
Is_in_categories
1
Ca
nd
ida
te_
ID
Do
cu
me
nt_
len
gth
Ap
pe
ara
nc
e_
ind
ex
Is_
in_
ca
teg
ori
es
Is_
in_
titl
e
IsK
ey
wo
rd
Ng
ram
_le
ve
l
Fir
st_
lem
ma
_m
atc
h
Le
mm
a_
ma
tch
es
Ori
gin
al_
tex
t
Le
mm
ati
zed
_
tex
t
presuda
INFuture 2009, 05.11.2009. 9/19
Extraction – learning and rankingCandidate_ID Original_text
First_appearance_relative
Is_in_categories
Is_in_title TF_IDF
IsKeyword
978138 presudu 0.212 1 0 0.096 1
978140 splitu 0.333 0 0 0.124 0
978156 županijskog suda 0.242 1 1 0.089 0
978164presudu županijskogsuda 0.212 1 1 0.188 0
978165 suda u splitu 0.272 1 1 0.183 0
Candidate_ID Original_text TF_IDF
Output_probability
Higher_scored_pattern_matches
978164 presudu županijskog suda 0.188 0.1777 0
978165 suda u splitu 0.183 0.0876 0
978140 splitu 0.124 0.0876 0
978138 presudu 0.096 0.1777 1
978156 županijskog suda 0.089 0.1777 1
Ranking
Discretization
for naive Bayes
Discretizationfor naive Bayes
Knowledge baseNaive Bayes
learning
INFuture 2009, 05.11.2009. 10/19
Corpora
• Deficiencies– Assigned keyphrases not appearing in text removed (57% of
original keyphrases)– Unknown inter-annotator agreement– Inconsistent keyphrases (63% of keyphrases assigned to only one
document)• Experimental set
– 200 documents– On average, 6.5 keyphrases and 370 candidates per document
October 4457 January 4532
Number of documents (articles) 3905 4521
Average document length (in words) 525 335
Average number of candidates per document 235 153
Average number of keyphrases per document 2 2
INFuture 2009, 05.11.2009. 11/19
Results – basic configurations
0.0
2.0
4.0
6.0
8.0
10.0
12.0
14.0
16.0
0 2 4 6 8 10 12 14 16
Extracted keyphrases
F1
(%
)
MDL discretization Percentile discretization
INFuture 2009, 05.11.2009. 12/19
Results – additional POS filter
• Filtering out the candidates that do not match the POS patterns N, AN, NN, NXN– discards 30% negative candidates– discards only 7.5% positive candidates
POS pattern Positive candidates (%)
Negative candidates (%)
Total (%)
N 49.77 38.75 38.87
AN 28.17 13.26 13.42
NN 11.58 10.41 10.42
F 1.91 9.72 9.64
NXN 2.98 7.57 7.52
NAN 1.56 3.22 3.2
Other 4.02 17.06 16.92
F = lemmatization failed, X = stopword
INFuture 2009, 05.11.2009. 13/19
Results – additional POS filter
0
4
8
12
16
20
0 2 4 6 8 10 12 14 16
Extracted keyphrases%
change F
1
MDL + POS filtering
0.0
4.0
8.0
12.0
16.0
20.0
0 2 4 6 8 10 12 14 16
Extracted keyphrases
F1 (%
)
MDL + POS filtering MDL discretization
INFuture 2009, 05.11.2009. 14/19
Results – ablation study
• Influence of each feature on performance• Holding out one feature and doing keyphrase
extraction using the remaining features
-30
-20
-10
0
10
20
30
Is in
cate
go
ry
Is in
title
Fir
sta
pp
ea
ran
cere
lativ
e
TF
xID
F
Precision (%) Recall (%) F1 (%) % change F1
INFuture 2009, 05.11.2009. 15/19
ExamplesDocument title Pre-assigned keyphrases Extracted keyphrases*
Lijevo-desna nerazumijevanja
ljevicapolitički životvrijednostisocijalna državasocijaldemokratiliberalisocijalna politikacrkveni socijalni naukekonomska politikasuverenitetzakonitost
poimanjusuverenitetastrankepolitikarazličitim poimanjempredizbornu kampanjuzakonitostiliberalesocijaldemokratekonzervativce
* Keyphrase normalization is a work in progress
INFuture 2009, 05.11.2009. 16/19
ExamplesDocument title Pre-assigned keyphrases Extracted keyphrases
EU lansirala petogodišnji plan sigurnosti
akcijski plansigurnostimigracijska politikapravosuđegranična kontrolameđunarodna suradnja
petogodišnji plansigurnostiakcijski planimigracijske politikepravosuđa i sigurnostieuropsku sigurnostsuradnje između zemaljaterorizmasigurnost granicaslobode i sigurnosti
INFuture 2009, 05.11.2009. 17/19
Conclusion
• Overall best result achieved – MDL + additional POS filtering, 10 extracted keyphrases
• In absence of comparable results, we consider our results to be of modest performance
• Possible causes– Low inter-annotator agreement suspected– Inconsistently assigned keyphrases
• Results show that performance can be improved, despite deficiencies in corpora
• New corpus of much higher quality obtained
INFuture 2009, 05.11.2009. 18/19
Acknowledgements
• This work has been supported by the Ministry of Science, Education and Sports, Republic of Croatia and under the Grant 036-1300646-1986
• The authors are grateful to the Croatian News Agency (HINA) for making available the newspaper corpora
INFuture 2009, 05.11.2009. 19/19
Thank you!
Questions?