developing and validating a document classifier: a real-life story - marko smiljanic
TRANSCRIPT
![Page 1: Developing and validating a document classifier: a real-life story - Marko Smiljanic](https://reader036.vdocuments.us/reader036/viewer/2022062823/586f79981a28ab10258b7007/html5/thumbnails/1.jpg)
Marko Smiljanić, NIRI Inteligent computing Ltd,CEO
Developing and validating a document classifier:a real-life story
![Page 2: Developing and validating a document classifier: a real-life story - Marko Smiljanic](https://reader036.vdocuments.us/reader036/viewer/2022062823/586f79981a28ab10258b7007/html5/thumbnails/2.jpg)
Developing and validating a document classifier:
a real-life storyMarko Smiljanić, CEO
www.niri-ic.com
![Page 3: Developing and validating a document classifier: a real-life story - Marko Smiljanic](https://reader036.vdocuments.us/reader036/viewer/2022062823/586f79981a28ab10258b7007/html5/thumbnails/3.jpg)
About us.
NIRI: 10 years in Intelligent Computing Text Mining Knowledge Discovery and Management All about Data Science
NIŠ
![Page 4: Developing and validating a document classifier: a real-life story - Marko Smiljanic](https://reader036.vdocuments.us/reader036/viewer/2022062823/586f79981a28ab10258b7007/html5/thumbnails/4.jpg)
About me.
My role
COMPANY
![Page 5: Developing and validating a document classifier: a real-life story - Marko Smiljanic](https://reader036.vdocuments.us/reader036/viewer/2022062823/586f79981a28ab10258b7007/html5/thumbnails/5.jpg)
Business Context The Challenge The Solution Effectiveness
Laboratory measurements Impact estimation Reality
Wrap up
The flow
![Page 6: Developing and validating a document classifier: a real-life story - Marko Smiljanic](https://reader036.vdocuments.us/reader036/viewer/2022062823/586f79981a28ab10258b7007/html5/thumbnails/6.jpg)
Business context
![Page 7: Developing and validating a document classifier: a real-life story - Marko Smiljanic](https://reader036.vdocuments.us/reader036/viewer/2022062823/586f79981a28ab10258b7007/html5/thumbnails/7.jpg)
Business context
Largest clients include Public Employment Services in EU, USA, and
Asia Staffing companies in EU, USA
![Page 8: Developing and validating a document classifier: a real-life story - Marko Smiljanic](https://reader036.vdocuments.us/reader036/viewer/2022062823/586f79981a28ab10258b7007/html5/thumbnails/8.jpg)
Vacancies Job seekers
Job Taxonom
y
SkillTaxonom
y
ELISE Platform
![Page 9: Developing and validating a document classifier: a real-life story - Marko Smiljanic](https://reader036.vdocuments.us/reader036/viewer/2022062823/586f79981a28ab10258b7007/html5/thumbnails/9.jpg)
Business Context The Challenge The Solution Effectiveness
Laboratory measurements Impact estimation Reality
Wrap up
The flow
![Page 10: Developing and validating a document classifier: a real-life story - Marko Smiljanic](https://reader036.vdocuments.us/reader036/viewer/2022062823/586f79981a28ab10258b7007/html5/thumbnails/10.jpg)
Vacancies
Job Taxonom
y
Document Classification
![Page 11: Developing and validating a document classifier: a real-life story - Marko Smiljanic](https://reader036.vdocuments.us/reader036/viewer/2022062823/586f79981a28ab10258b7007/html5/thumbnails/11.jpg)
Occupation Taxonomies ISCO (International Standard Classification of
Occupations) ESCO O*NET and many more ISCO level 1 (10)
ISCO level 2 (42)ISCO level 3 (124)ISCO level 4 (400)
ESCO level 5 (5000)
“Delivery service worker”
Challenges (for humans) Knowing the
taxonomy Ambiguous taxonomyHybrid positionsVague vacancy
![Page 12: Developing and validating a document classifier: a real-life story - Marko Smiljanic](https://reader036.vdocuments.us/reader036/viewer/2022062823/586f79981a28ab10258b7007/html5/thumbnails/12.jpg)
Client’s situationin 2014
VacancyAggregato
rand Classifier
Correct Code? PublishRepair
Code!NO
23%
ОК65%
no help
14%
OK9%
no code
12%
2000-4000 per day (into >2000 taxonomy classes) %?
![Page 13: Developing and validating a document classifier: a real-life story - Marko Smiljanic](https://reader036.vdocuments.us/reader036/viewer/2022062823/586f79981a28ab10258b7007/html5/thumbnails/13.jpg)
Business Context The Challenge The Solution Effectiveness
Laboratory measurements Impact estimation Reality
Wrap up
The flow
![Page 14: Developing and validating a document classifier: a real-life story - Marko Smiljanic](https://reader036.vdocuments.us/reader036/viewer/2022062823/586f79981a28ab10258b7007/html5/thumbnails/14.jpg)
The Solution:NIRI will build you a better classifier
VacancyAggregato
rand Classifier
NIRI Classifier Publish2000-4000 per day
![Page 15: Developing and validating a document classifier: a real-life story - Marko Smiljanic](https://reader036.vdocuments.us/reader036/viewer/2022062823/586f79981a28ab10258b7007/html5/thumbnails/15.jpg)
Really?How accurate will it be?How will it fit our process?
Reduce manual effort Increase volume Improve final accuracy
Really. We will (try to):
![Page 16: Developing and validating a document classifier: a real-life story - Marko Smiljanic](https://reader036.vdocuments.us/reader036/viewer/2022062823/586f79981a28ab10258b7007/html5/thumbnails/16.jpg)
But you need to give us training data > 1M vacancies
No class12%
Not verified14%
Verified74%%?
![Page 17: Developing and validating a document classifier: a real-life story - Marko Smiljanic](https://reader036.vdocuments.us/reader036/viewer/2022062823/586f79981a28ab10258b7007/html5/thumbnails/17.jpg)
Long tail effect
![Page 18: Developing and validating a document classifier: a real-life story - Marko Smiljanic](https://reader036.vdocuments.us/reader036/viewer/2022062823/586f79981a28ab10258b7007/html5/thumbnails/18.jpg)
Architecture of our solution
FeatureExtractor Negotiator
Classifier 1
Classifier 2
Classifier N
…Vacancy [Class,
Confidence]+
Vacancy Classifier
External Services
![Page 19: Developing and validating a document classifier: a real-life story - Marko Smiljanic](https://reader036.vdocuments.us/reader036/viewer/2022062823/586f79981a28ab10258b7007/html5/thumbnails/19.jpg)
What to do with confidence?
Vacancy, Code, ConfidenceVacancy, Code, ConfidenceVacancy, Code, ConfidenceVacancy, Code, ConfidenceVacancy, Code, ConfidenceVacancy, Code, ConfidenceVacancy, Code, ConfidenceVacancy, Code, ConfidenceVacancy, Code, ConfidenceVacancy, Code, ConfidenceVacancy, Code, Confidence…
Bulk Accept
To check manualy
Batch Processing
CO
NFID
EN
CE
High accuracy
Low accuracy
![Page 20: Developing and validating a document classifier: a real-life story - Marko Smiljanic](https://reader036.vdocuments.us/reader036/viewer/2022062823/586f79981a28ab10258b7007/html5/thumbnails/20.jpg)
Using confidence
![Page 21: Developing and validating a document classifier: a real-life story - Marko Smiljanic](https://reader036.vdocuments.us/reader036/viewer/2022062823/586f79981a28ab10258b7007/html5/thumbnails/21.jpg)
Business Context The Challenge The Solution Effectiveness
Laboratory measurements Impact estimation Reality
Wrap up
The flow
![Page 22: Developing and validating a document classifier: a real-life story - Marko Smiljanic](https://reader036.vdocuments.us/reader036/viewer/2022062823/586f79981a28ab10258b7007/html5/thumbnails/22.jpg)
Measuring accuracy in the laboratory
No class12%
Not verified14%
Verified74%
No class
Incorrect
Correct
Test20%
Train80% Train
Test
x 5
Vacancy Classifier
![Page 23: Developing and validating a document classifier: a real-life story - Marko Smiljanic](https://reader036.vdocuments.us/reader036/viewer/2022062823/586f79981a28ab10258b7007/html5/thumbnails/23.jpg)
Corpus Classifier Classifier 100 Classifier 1000
74% 78% 80% 85%
14%13% 12%
10%12% 9% 8% 5%
One of many Laboratory MeasurementsCorrect Incorrect No class
Measuring accuracy in the laboratory
Does this make any sense?
Yes, but…
![Page 24: Developing and validating a document classifier: a real-life story - Marko Smiljanic](https://reader036.vdocuments.us/reader036/viewer/2022062823/586f79981a28ab10258b7007/html5/thumbnails/24.jpg)
Measuring accuracy in the laboratory
No class12%
Not verified14%
Verified74%
Vacancy Classifier
No class 9%
Incorrect13%
Correct78%
OriginalClassifier
This is not relaityBiased train/test setAccuracy of test set unknown Inability to test against 26%
![Page 25: Developing and validating a document classifier: a real-life story - Marko Smiljanic](https://reader036.vdocuments.us/reader036/viewer/2022062823/586f79981a28ab10258b7007/html5/thumbnails/25.jpg)
Business Context The Challenge The Solution Effectiveness
Laboratory measurements Impact estimation Reality
Wrap up
The flow
![Page 26: Developing and validating a document classifier: a real-life story - Marko Smiljanic](https://reader036.vdocuments.us/reader036/viewer/2022062823/586f79981a28ab10258b7007/html5/thumbnails/26.jpg)
Remember the process?
VacancyAggregato
rand Classifier
Correct Code? PublishRepair
Code!NO
23%
ОК65%
no help
14%
OK9%
no code
12%
![Page 27: Developing and validating a document classifier: a real-life story - Marko Smiljanic](https://reader036.vdocuments.us/reader036/viewer/2022062823/586f79981a28ab10258b7007/html5/thumbnails/27.jpg)
This is what it actually looks like.Check Repair
Reduce manual effort Increase volume Improve final accuracy
We will
![Page 28: Developing and validating a document classifier: a real-life story - Marko Smiljanic](https://reader036.vdocuments.us/reader036/viewer/2022062823/586f79981a28ab10258b7007/html5/thumbnails/28.jpg)
And we proposed this one.Bulk Accept Check Repair
![Page 29: Developing and validating a document classifier: a real-life story - Marko Smiljanic](https://reader036.vdocuments.us/reader036/viewer/2022062823/586f79981a28ab10258b7007/html5/thumbnails/29.jpg)
Best/worst case analysis, some manual validation, careful assumptions:
Bulk Accept
Check Repair
![Page 30: Developing and validating a document classifier: a real-life story - Marko Smiljanic](https://reader036.vdocuments.us/reader036/viewer/2022062823/586f79981a28ab10258b7007/html5/thumbnails/30.jpg)
Impact estimation showed that: Step 1 effort reduction 60%
(due to bulk acceptance) Step 2 effort reduction 11%
(due to bulk acceptance and top 5 offers) Significant published volume increase
(almost to 100%) Accuracy slightly larger
(+1%, to around 92%)
Does this make any sense?
Yes, but…
![Page 31: Developing and validating a document classifier: a real-life story - Marko Smiljanic](https://reader036.vdocuments.us/reader036/viewer/2022062823/586f79981a28ab10258b7007/html5/thumbnails/31.jpg)
Business Context The Challenge The Solution Effectiveness
Laboratory measurements Impact estimation Reality
Wrap up
The flow
![Page 32: Developing and validating a document classifier: a real-life story - Marko Smiljanic](https://reader036.vdocuments.us/reader036/viewer/2022062823/586f79981a28ab10258b7007/html5/thumbnails/32.jpg)
No class12%
Not verified14%
Verified74%%?
How can we measure production accuracy?
We can not,unless…
![Page 33: Developing and validating a document classifier: a real-life story - Marko Smiljanic](https://reader036.vdocuments.us/reader036/viewer/2022062823/586f79981a28ab10258b7007/html5/thumbnails/33.jpg)
Golden Test Set
![Page 34: Developing and validating a document classifier: a real-life story - Marko Smiljanic](https://reader036.vdocuments.us/reader036/viewer/2022062823/586f79981a28ab10258b7007/html5/thumbnails/34.jpg)
How was it built?Check & Repair4 eye principle
Vacancy Classifier
Published
Original Code&
Top 5 VC codes
Original Code&
Top 5 VC codes
Original Code&
Top 5 VC codes
Every single classification was marked as either Correct, Acceptable, or Wrong
![Page 35: Developing and validating a document classifier: a real-life story - Marko Smiljanic](https://reader036.vdocuments.us/reader036/viewer/2022062823/586f79981a28ab10258b7007/html5/thumbnails/35.jpg)
Results
Current NIRI VC Current(HQ source)
NIRI VC (HQ source)
63.05%73.91% 72.06% 74.38%
65.98%77.56% 76.25% 78.69%
Golden Test Set ResultsCorrect Acceptable
Highest Quality Source (Training)
![Page 36: Developing and validating a document classifier: a real-life story - Marko Smiljanic](https://reader036.vdocuments.us/reader036/viewer/2022062823/586f79981a28ab10258b7007/html5/thumbnails/36.jpg)
Business Context The Challenge The Solution Effectiveness
Laboratory measurements Impact estimation Reality
Wrap up
The flow
![Page 37: Developing and validating a document classifier: a real-life story - Marko Smiljanic](https://reader036.vdocuments.us/reader036/viewer/2022062823/586f79981a28ab10258b7007/html5/thumbnails/37.jpg)
Wrap up Clean semantic data, in real-life, can only be a myth. We are looking into
data cleansing approaches. Measuring usefulness can be hard and expensive, but … … it can/must to be monitored after the system is deployed.
It changes over time. Continuous learning, where possible is a great thing. 1) Implementing state-of-the-art machine learning algorithm is one thing.
2) Making it useful is another. 3) Explaining that to the end-user is the third.
NIRI is a very cool company to work with!
I hope you liked the story, and I thank you for your attention.
![Page 38: Developing and validating a document classifier: a real-life story - Marko Smiljanic](https://reader036.vdocuments.us/reader036/viewer/2022062823/586f79981a28ab10258b7007/html5/thumbnails/38.jpg)
Developing and validating a document classifier:
a real-life storyMarko Smiljanić, CEO
www.niri-ic.com