evaluation of nlp systems
DESCRIPTION
Evaluation of NLP Systems. Martin Hassel KTH CSC Royal Institute of Technology 100 44 Stockholm +46-8-790 66 34 [email protected]. Why Evaluate?. Otherwise you won’t know if what you’re doing is any good! Human languages are very loosely defined - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Evaluation of NLP Systems](https://reader034.vdocuments.us/reader034/viewer/2022051700/56815dea550346895dcc1014/html5/thumbnails/1.jpg)
Evaluation of NLP Systems
Martin HasselKTH CSC
Royal Institute of Technology100 44 Stockholm+46-8-790 66 34
![Page 2: Evaluation of NLP Systems](https://reader034.vdocuments.us/reader034/viewer/2022051700/56815dea550346895dcc1014/html5/thumbnails/2.jpg)
Martin Hassel
Why Evaluate?
• Otherwise you won’t know if what you’re doing is any good!
• Human languages are very loosely defined• This makes it hard to prove that something
works (as you do in mathematics or logic)
![Page 3: Evaluation of NLP Systems](https://reader034.vdocuments.us/reader034/viewer/2022051700/56815dea550346895dcc1014/html5/thumbnails/3.jpg)
Martin Hassel
Aspects of Evaluation
• General aspects• To measure progress
• Commercial aspects• To ensure consumer satisfaction• Edge against competitors / PR
• Scientific aspects• Good science
![Page 4: Evaluation of NLP Systems](https://reader034.vdocuments.us/reader034/viewer/2022051700/56815dea550346895dcc1014/html5/thumbnails/4.jpg)
Martin Hassel
Manual Evaluation
• Human judges+ Semantically based assessment– Subjective– Time consuming– Expensive
![Page 5: Evaluation of NLP Systems](https://reader034.vdocuments.us/reader034/viewer/2022051700/56815dea550346895dcc1014/html5/thumbnails/5.jpg)
Martin Hassel
Semi-Automatic Evaluation
• Task based evaluation+ Measures the system’s utility– Subjective interpretation of questions and
answers
![Page 6: Evaluation of NLP Systems](https://reader034.vdocuments.us/reader034/viewer/2022051700/56815dea550346895dcc1014/html5/thumbnails/6.jpg)
Martin Hassel
Automatic Evaluation
Example from Text Summarization
• Sentence Recall+ Cheap and repeatable– Does not distinguish between different
summaries
![Page 7: Evaluation of NLP Systems](https://reader034.vdocuments.us/reader034/viewer/2022051700/56815dea550346895dcc1014/html5/thumbnails/7.jpg)
Martin Hassel
Why Automatic Evaluation?
• Manual labor is expensive and takes time• It’s practical to be able to evaluate often
– does this parameter lead to improvements?
• It’s tedious to evaluate manually• Human factor
– People tend to tire and make mistakes
![Page 8: Evaluation of NLP Systems](https://reader034.vdocuments.us/reader034/viewer/2022051700/56815dea550346895dcc1014/html5/thumbnails/8.jpg)
Martin Hassel
Corpora
• A body of data considered to represent ”reality” in a balanced way• Sampling
• Raw format vs. annotated data
![Page 9: Evaluation of NLP Systems](https://reader034.vdocuments.us/reader034/viewer/2022051700/56815dea550346895dcc1014/html5/thumbnails/9.jpg)
Martin Hassel
Corpora can be…
• a Part-of-Speech tagged data collectionArrangör nn.utr.sin.ind.nomvar vb.prt.akt.kopJärfälla pm.gennaturförening nn.utr.sin.ind.nomdär haMargareta pm.nomär vb.prs.akt.kopmedlem nn.utr.sin.ind.nom. mad
![Page 10: Evaluation of NLP Systems](https://reader034.vdocuments.us/reader034/viewer/2022051700/56815dea550346895dcc1014/html5/thumbnails/10.jpg)
Martin Hassel
Corpora can be…
• a parse tree data collection(S (NP-SBJ (NNP W.R.) (NNP Grace) ) (VP (VBZ holds) (NP (NP (CD three) ) (PP (IN of) (NP (NP (NNP Grace) (NNP Energy) (POS 's) ) (CD seven) (NN board) (NNS seats) ) ) ) )(. .) )
![Page 11: Evaluation of NLP Systems](https://reader034.vdocuments.us/reader034/viewer/2022051700/56815dea550346895dcc1014/html5/thumbnails/11.jpg)
Martin Hassel
Corpora can be…
• a collection of sound samples
![Page 12: Evaluation of NLP Systems](https://reader034.vdocuments.us/reader034/viewer/2022051700/56815dea550346895dcc1014/html5/thumbnails/12.jpg)
Martin Hassel
Widely Accepted Corpora
• Pros• Well-defined origin and context• Well-established evaluation schemes• Inter-system comparabilitity
• Cons• Optimizing for a specific data set• May establish a common “truth”
![Page 13: Evaluation of NLP Systems](https://reader034.vdocuments.us/reader034/viewer/2022051700/56815dea550346895dcc1014/html5/thumbnails/13.jpg)
Martin Hassel
Gold Standard
• ”Correct guesses” demand knowing what the result should be
• This ”optimal” result is often called a gold standard
• How the gold standard looks and how you count can differ a lot between tasks
• The basic idea is however the same
![Page 14: Evaluation of NLP Systems](https://reader034.vdocuments.us/reader034/viewer/2022051700/56815dea550346895dcc1014/html5/thumbnails/14.jpg)
Martin Hassel
Example of a Gold StandardGold standard for tagging, shallow parsing and clause boundering
Han pn.utr.sin.def.sub NPB CLBär vb.prs.akt.kop VCB CLImest ab.suv ADVPB|APMINB CLIroad jj.pos.utr.sin.ind.nom APMINB|APMINI CLIav pp PPB CLIäldre jj.kom.utr/neu.sin/plu.ind/def.nom APMINB|NPB|PPI CLIsorter nn.utr.plu.ind.nom NPI|PPI CLI. Mad 0 CLI
![Page 15: Evaluation of NLP Systems](https://reader034.vdocuments.us/reader034/viewer/2022051700/56815dea550346895dcc1014/html5/thumbnails/15.jpg)
Martin Hassel
Some Common Measures
• Precision = correct guesses / all guesses• Recall = correct guesses / correct answers
• Precision and recall often are mutually dependant• higher recall → lower precision• higher precision → lower recall
![Page 16: Evaluation of NLP Systems](https://reader034.vdocuments.us/reader034/viewer/2022051700/56815dea550346895dcc1014/html5/thumbnails/16.jpg)
Martin Hassel
More Evaluation Terminology• True positive
– Alarm given at correct point• False negative
– No alarm when one should be given• False positive
– Alarm given when none should be given• (True negative)
– The algorithm is quiet on uninteresting data
• In e.g. spell checking the above could correspond to detected errors, missed errors, false alarms and correct words without warning.
![Page 17: Evaluation of NLP Systems](https://reader034.vdocuments.us/reader034/viewer/2022051700/56815dea550346895dcc1014/html5/thumbnails/17.jpg)
Martin Hassel
How Good Is 95%?
• It depends on what problem you are solving!
• Try to determine expected upper and lower bounds for performance (of a specific task)
• A baseline tells you the performance of a naïve approach (lower bound)
![Page 18: Evaluation of NLP Systems](https://reader034.vdocuments.us/reader034/viewer/2022051700/56815dea550346895dcc1014/html5/thumbnails/18.jpg)
Martin Hassel
Lower Bound
• Baselines• Serve as lower limit of acceptability• Common to have several baselines
• Common baselines• Random• Most common choice/answer (e.g. in tagging)• Linear selection (e.g. in summarization)
![Page 19: Evaluation of NLP Systems](https://reader034.vdocuments.us/reader034/viewer/2022051700/56815dea550346895dcc1014/html5/thumbnails/19.jpg)
Martin Hassel
Upper Bound
• Sometimes there is an upper bound lower than 100%
• Example:In 10% of all cases experts disagree on the correct answer• Human ceiling (inter-assessor agreement)• Low inter-assessor agreement can sometimes be
countered with comparison against several ”sources”
![Page 20: Evaluation of NLP Systems](https://reader034.vdocuments.us/reader034/viewer/2022051700/56815dea550346895dcc1014/html5/thumbnails/20.jpg)
Martin Hassel
Limited Data• Limited data is often a problem, especially in
machine learning
• We want lots of data for training• Better results
• We want lots of data for evaluation• More reliable numbers
• If possible, create your own data!• Missplel
![Page 21: Evaluation of NLP Systems](https://reader034.vdocuments.us/reader034/viewer/2022051700/56815dea550346895dcc1014/html5/thumbnails/21.jpg)
Martin Hassel
Limited DataN-fold Cross Validation
• Idea:1 Set 5% of the data aside for evaluation and train on
95%2 Set another 5% aside for evaluation and repeat training
on 95%3 … and again (repeat in total 20 times)
• Take the mean of the evaluation results to be the final result
![Page 22: Evaluation of NLP Systems](https://reader034.vdocuments.us/reader034/viewer/2022051700/56815dea550346895dcc1014/html5/thumbnails/22.jpg)
Martin Hassel
Concrete Examples• Taggning
• Force the tagger to assign exactly one tag to each token – precision?
• Parsing• What happens when almost correct?• Partial trees, how many sentences got full trees?
• Spell checking• Recall / precision for alarms• How far down in the suggestion list is the correct suggestion?
![Page 23: Evaluation of NLP Systems](https://reader034.vdocuments.us/reader034/viewer/2022051700/56815dea550346895dcc1014/html5/thumbnails/23.jpg)
Martin Hassel
Concrete Examples
• Grammar checking• How many are false alarms (precision)?• How many errors are detected (recall)?• How many of these have the correct diagnosis?
• Machine translation & Text Summarization• How many n-grams overlap with gold standard(s)?• BLEU scores & ROUGE scores
![Page 24: Evaluation of NLP Systems](https://reader034.vdocuments.us/reader034/viewer/2022051700/56815dea550346895dcc1014/html5/thumbnails/24.jpg)
Martin Hassel
Concrete Examples
• Information retrieval• What is the precision of the first X hits?
At Y% recall?• Mean precision.
• Text categorizing• How many documents were correctly
classified?