scool : a system for academic institution name normalization
DESCRIPTION
sCooL : A System for Academic Institution Name Normalization. Ferosh Jacob, Faizan Javed, Meng Zhao, and Matt McNair Classification R & D CareerBuilder. About sCooL What is entity normalization? Why is academic entity normalization important? - PowerPoint PPT PresentationTRANSCRIPT
1
sCooL: A System for Academic Institution
Name NormalizationFerosh Jacob, Faizan Javed, Meng Zhao, and Matt McNair
Classification R & DCareerBuilder
2
Presentation overview
About sCooLโฆ What is entity normalization?โฆ Why is academic entity normalization important?โฆ What are the academic entity normalization challenges?
Inside sCooLโฆ A high-level overview of the core componentsโฆ Atlas- the mapping manager
Evaluating sCooLโฆ Comparing sCooL with existing implementationโฆ Independent evaluation of sCooL
Concluding remarksโฆ Demoโฆ Questions?
3
About sCooL:Academic entity normalization facts
Facts 7,021 post-secondary title IV institutions in 2010-111*
200 Million unique visitors @ CB U.S
12 Million unique academic institutions entries in CB resume database
*http://nces.ed.gov/fastfacts/display.asp?id=84
4
About sCooL:Academic entity normalization definition
No. Name (surface formss) Frequency
1 4102 1393 1314 65 16 17 18 19 110 1}Entity:
Surfa
ce fo
rms
6
About sCooL:Academic entity normalization challenges
No. Name (surface formss) Frequency
1 Salford College 4102 Salford College of Technology 1393 Salford City College 1314 Salford Uni 65 Salford University - 16 The University of Salford. 17 Salford University **+ 18 University of Salford 1982 19 =- University OF SALFORD 110 University of Salford- 1}
Entity:Salford City CollegeMerchants Quay, Salford QuaysUnited Kingdom
Entity:University of SalfordSalford, LancashireUnited Kingdom
Entity:Salford College68 Grenfell Street, AdelaideAustralia
How will you identify the most accurate normalization from a given surface form?
7
About sCooL:Academic entity normalization challenges..
String similarity algorithmsโฆ Edit distance
Salford university -> Salford Unevarsity (Edit distance 2) (spelling error)
St. Loyeโs College ->St. Lukeโs College (Edit distance 2) (Two different academic institutions)
How will you distinguish spelling or typing errors from two different institution mapping scenario?
8
About sCooL:Academic entity normalization challenges
How will you create and maintain the surface form-entity mappings?
Legacy names (Mergers)
โฆ University of Central England in Birmingham is an old name of Birmingham City University
โฆ In January 2009, Salford College merged with Eccles College and Pendleton College to form Salford City College
โฆ In October 2004, Victoria University of Manchester with the University of Manchester Institute of Science and Technology to form The University of Manchester
Popular names and Acronyms
โฆ Ole Miss is a popular name for The University of Mississippiโฆ MIT is an acronym for Massachusetts Institute of Technology. However, GIT is not an
acronym for Georgia Institute of Technology but Georgia Tech or Ga Tech are popular names for the institution.
9
No. Top 10 frequent universities in UK dataset
Frequency
1 N/A 128976
2 City & Guilds 23992
3 Not Specified 18598
4 City and Guilds 17441
5 Open University 6886
6 MIDDLESEX UNIVERSITY 5490
7 University of East London 5266
8 University of Greenwich 5108
9 CITY UNIVERSITY 4863
10 Kingston University 4856
About sCooL:Academic entity normalization challenges
How can we remove K-12 schools and noise?
Institution type Distribution
College 23.32%
University 16.57%
K-12 school 34.22%
Not sure 25.89%
10
About sCooL:Challenges summary
How will you identify the most accurate normalization from a given surface form?
How will you distinguish spelling or typing errors from two different institution mapping scenario?
How will you create and maintain the surface form-entity mappings?
How can we remove K-12 schools and noise?
11
Raw input query (surface form)
Remove K-12 schoolsโข Weka classifier
Search institutions using mappings DBโข Lucene index
Refine resultsโข String comparison algorithm
Normalized entityโข Update mappings DB
Inside sCooL:A high-level overview of the system
Inside sCooL:Atlas- sCooLโs mapping manager
12
CB mappi
ngs
Wikimappings
MongoDB
Lucene
Atlas
sCooL
13
Inside sCooL:Refining Lucene results
๐ด๐๐๐ข๐๐๐๐ฆ=๐๐๐ข๐๐๐๐๐๐๐๐๐ง๐๐ก๐๐๐
๐๐๐๐๐ข๐๐ ๐๐๐๐๐๐๐๐ง๐๐ก๐๐๐๐ (๐๐๐ข๐+๐น๐๐๐ ๐)
๐ถ๐๐ฃ๐๐๐๐๐=๐๐๐ข๐๐๐๐๐๐๐๐๐ง๐๐ก๐๐๐
๐ด๐๐ ๐๐๐๐๐๐๐๐ง๐๐ก๐๐๐ (๐๐๐ข๐+๐น๐๐๐ ๐+๐๐ข๐๐)
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 20
0.1
0.2
0.3
0.4
0.5
0.6
0.82
0.84
0.86
0.88
0.9
0.92
0.94
0.96
0.98
1
Coverage
Accuracy
Threshold similarity
14
Evaluation:Comparing sCooL with existing implementation
Targeted metrics: Accuracy & Coverage
Precision is more important than Recall
Stratified Sampling in estimate of ratios
Favor high-frequency queries in sampling
15
Evaluation:Comparing sCooLwith existing implementation
๐๐ซ (|๏ฟฝฬ๏ฟฝ๐โ๐ ๐
๐ ๐ |<h ๐)=๐ถ
{๐0= ๐ ๐ผ2 ๐ ๐(1โ ๐๐)
h ๐2
๐๐=๐0
1+(๐0โ1)/๐ ๐
๏ฟฝฬ๏ฟฝ=โ๐=1
3 ๐ ๐
โ๐๐ ๐
๏ฟฝฬ๏ฟฝ๐
91%
7%
2%Sampling design
[1, 6][7, 39][40, max]
16
Evaluation:Comparing sCooL with existing implementation
Groups Group Size Sample Size Sampling Rate
sCool Accuracy
Existing System Accuracy
[1, 6] 145,126 780 1% 92% 75% [7, 39] 11,938 736 6% 96% 79% [40, max] 3,896 653 17% 95% 85% Total 160,960 2,169 1% 95% 80%
Dataset Coverage Weighted Coverage
UK CareerBuilder data sCool Existing System sCool Existing
System 40% 1% 73% 46%
17
Evaluation:Independent evaluation of sCooL
Test1-4ICU university list
The 4ICU [22] website145 popular universities and colleges in U.K.
Test2-Guardian university list:
The Guardian [23]a list of 135 universities in U.K.
DatasetAccuracy Coverage
sCool Existing System sCool Existing
SystemTest 1 (145) 93% 91% 95% 79%Test 2 (135) 93% 90% 88% 72%
18
sCooL:Demo
Atlas http://ec2-54-193-1-73.us-west-1.compute.amazonaws.com/Atlas/
20
sCooL: AppendixLucene search results for โUniversity of Milanโ
Rank Searchable field Display name1 polytechnic university of milan Polytechnic University of Milan
2 university of milan University of Milan3 catholic university of milan Universit`a Cattolica del Sacro Cuore
4 iulm university of milan IULM University of Milan
5 university of milan bicocca University of Milan Bicocca
6 milan university University of Milan7 politecnico of milan Polytechnic University of Milan
8 milan polytechnic Polytechnic University of Milan
21
sCooL: AppendixString similarity algorithms
Rank String similarity algorithms1 Levenshtein
2 Lucene Levenshtein3 N-gram
4 Jaccard Similarity
5 Jaro Winkler
6 Hamming7 Equals
8 Ignore case Equals
22
Evaluation:Comparing sCool with existing implementation
Balancing between Accuracy and Coverage
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 20
1000
2000
3000
4000
5000
6000
7000
CorrectWrongNull
Threshold similarity
Total input queries
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 20
0.1
0.2
0.3
0.4
0.5
0.6
0.82
0.84
0.86
0.88
0.9
0.92
0.94
0.96
0.98
1
CoverageAccuracy
Threshold similarity
23
About sCooL:Related work
Cucerzan, S from Microsoft Research did great work on large-scale disambiguation by Wikipedia data in 2007
Jijkoun, V et. al. from Univ. of Amsterdam proposed NEN in user generated content in 2008
Liu, X et. al. from Microsoft Research, China conducted a joint inference on NER and NEN for tweets in 2012
Magdy, W et. al. from IBM, Egypt invented NEN for Arabic names in 2007
Jonnalagadda, S et. al. from Lnx Research, CA developed NEMO, a NER and NEN system for PubMed author affiliations 2011
Cohen, A from OHSU studied gene/protein NEN by automatically generated libraries in 2005