ff & fer infuture2009: digital resources and knowledge sharing, 4-7 november 2009 comparative...
Post on 03-Jan-2016
214 Views
Preview:
TRANSCRIPT
INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009
FF & FER
Comparative Analysis of Automatic Term and Collocation
Extraction
Sanja Seljan, Bojana Dalbelo Bašić, Jan Šnajder,Davor Delač, Matija Šamec-Gjurin, Dina Crnec
Faculty of Humanities and Social Sciences, Department of Information Sciences Faculty of Electrical Engineering and Computing
INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009
FF & FEROverview
I. Introduction– Reasons for extraction
II. Research– Resources & tools– Extracted lists
III. Evaluation– Precision, recall, F-measure
IV. Conclusion
INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009
FF & FERI. Introduction
• Monolingual and multilingual resources– Helpful– Integrated– Require human intervention
• EU pre-accession activities– Speed up + consistency
• Used in further research and practice
INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009
FF & FER
• List:– Terms (Member State, European Union)
– Collocations (adopt a/the resolution, decided as follows)
– Multi-word units (depend on, well-being)
• Term extraction process:– Term extraction (term acquisition)- identification– Term recognition - verification
INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009
FF & FERII. Research
• Resources– 10 documents – legislation, Cro-Eng
• Tools– TermeX tool (FER) – list A– SDL Multi Term Extract + NooJ (FF) – list B
• Reference list– Evaluation – reference list
INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009
FF & FERReference list
• 470 terms and collocations• Exclude unigrams• Balance between lexical coverage, adequacy,
practicality– terms (NPs: 346/470)– collocations (VPs)
INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009
FF & FERReference list
• Contains:– Terms (acquiring company, applicant country)
– Collocations (adopt a/the resolution, decided as
follows, entry into force, having regard to) – Names and abbreviations (Economic and
Monetary Union EMU, European Union EU)
– Relevant embedded terms (crime prevention, crime prevention bodies, national crime prevention measures).
INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009
FF & FER
• Language-independent statistically-based SDL Multi Term Extract tool – Frequency treshold set to 4– Filtered by the list of stop-words -> 369 cand.
• Language dependant NooJ tool– 36 local grammars -> 512 cand.
List B
INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009
FF & FERList A
• TermeX– Lexical association measures (AMs)– 14 AMs (PMI, Dice, Chi-square,…)– Lemmatization– POS filtering– Frequency treshold set to ?
INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009
FF & FERList A
• Extracted terms ranked by AM value – 1816 candidates
• AMs used:– 2-grams – PMI
– 3-grams, 4-grams – heuristic extensions
• Noun phrases only
INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009
FF & FERResults
• Evaluation– F1-measure (precision, recall)
– True positives calculated by taking into account inflection (suffix stripping)
List A List B
No. of terms 1816 508
Valid terms 202 234
Precision (%) 11.56 47.37
Recall (%) 42.98 49.79
F1 (%) 18.22 48.55
INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009
FF & FERResults
• List A unsatisfactory– Low recall – Verb phrases, terms consisting of
more than 4 words
– Low precision – ranked list, can be improved with cut-off (true positives are better ranked)
• List B modest– can be improved with lemmatization, definition of
upper/lower cases, more detailed local grammar
INFuture2009: Digital Resources and Knowledge Sharing, 4-7 November 2009
FF & FERConclusion
• Comparison of two hybrid approaches to term extraction
• Human created lists differ from extracted lists– human knowledge, experience and intuition
• Space for improvement – automatic extraction combined human intervention
top related