scool : a system for academic institution name normalization ferosh jacob, faizan javed, meng zhao,...

23
sCooL: A System for Academic Institution Name Normalization Ferosh Jacob, Faizan Javed, Meng Zhao, and Matt McNair Classification R & D CareerBuilder 1

Upload: elmer-rodgers

Post on 26-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: SCooL : A System for Academic Institution Name Normalization Ferosh Jacob, Faizan Javed, Meng Zhao, and Matt McNair Classification R & D CareerBuilder

1

sCooL: A System for Academic Institution

Name NormalizationFerosh Jacob, Faizan Javed, Meng Zhao, and Matt McNair

Classification R & DCareerBuilder

Page 2: SCooL : A System for Academic Institution Name Normalization Ferosh Jacob, Faizan Javed, Meng Zhao, and Matt McNair Classification R & D CareerBuilder

2

Presentation overview

About sCooL◦ What is entity normalization?◦ Why is academic entity normalization important?◦ What are the academic entity normalization challenges?

Inside sCooL◦ A high-level overview of the core components◦ Atlas- the mapping manager

Evaluating sCooL◦ Comparing sCooL with existing implementation◦ Independent evaluation of sCooL

Concluding remarks◦ Demo◦ Questions?

Page 3: SCooL : A System for Academic Institution Name Normalization Ferosh Jacob, Faizan Javed, Meng Zhao, and Matt McNair Classification R & D CareerBuilder

3

About sCooL:Academic entity normalization facts

Facts 7,021 post-secondary title IV institutions in 2010-111*

200 Million unique visitors @ CB U.S

12 Million unique academic institutions entries in CB resume database

*http://nces.ed.gov/fastfacts/display.asp?id=84

Page 4: SCooL : A System for Academic Institution Name Normalization Ferosh Jacob, Faizan Javed, Meng Zhao, and Matt McNair Classification R & D CareerBuilder

4

About sCooL:Academic entity normalization definition

No. Name (surface formss) Frequency

1 410

2 139

3 131

4 6

5 1

6 1

7 1

8 1

9 1

10 1}Entity:Su

rface

form

s

Page 5: SCooL : A System for Academic Institution Name Normalization Ferosh Jacob, Faizan Javed, Meng Zhao, and Matt McNair Classification R & D CareerBuilder

5

About sCooL:Why academic entity normalizations

Improved Searching

Labor market dynamics insights

Page 6: SCooL : A System for Academic Institution Name Normalization Ferosh Jacob, Faizan Javed, Meng Zhao, and Matt McNair Classification R & D CareerBuilder

6

About sCooL:Academic entity normalization challenges

No. Name (surface formss) Frequency

1 Salford College 410

2 Salford College of Technology 139

3 Salford City College 131

4 Salford Uni 6

5 Salford University - 1

6 The University of Salford. 1

7 Salford University **+ 1

8 University of Salford 1982 1

9 =- University OF SALFORD 1

10 University of Salford- 1}Entity:Salford City CollegeMerchants Quay, Salford QuaysUnited Kingdom

Entity:University of SalfordSalford, LancashireUnited Kingdom

Entity:Salford College68 Grenfell Street, AdelaideAustralia

How will you identify the most accurate normalization from a given surface form?

Page 7: SCooL : A System for Academic Institution Name Normalization Ferosh Jacob, Faizan Javed, Meng Zhao, and Matt McNair Classification R & D CareerBuilder

7

About sCooL:Academic entity normalization challenges..

String similarity algorithms◦ Edit distance

Salford university -> Salford Unevarsity (Edit distance 2)

(spelling error)

St. Loye’s College ->St. Luke’s College (Edit distance 2)

(Two different academic institutions)

How will you distinguish spelling or typing errors from two different institution mapping scenario?

Page 8: SCooL : A System for Academic Institution Name Normalization Ferosh Jacob, Faizan Javed, Meng Zhao, and Matt McNair Classification R & D CareerBuilder

8

About sCooL:Academic entity normalization challenges

How will you create and maintain the surface form-entity mappings?

Legacy names (Mergers)

◦ University of Central England in Birmingham is an old name of Birmingham City University

◦ In January 2009, Salford College merged with Eccles College and Pendleton College to form Salford City College

◦ In October 2004, Victoria University of Manchester with the University of Manchester Institute of Science and Technology to form The University of Manchester

Popular names and Acronyms

◦ Ole Miss is a popular name for The University of Mississippi◦ MIT is an acronym for Massachusetts Institute of Technology. However, GIT is not an

acronym for Georgia Institute of Technology but Georgia Tech or Ga Tech are popular names for the institution.

Page 9: SCooL : A System for Academic Institution Name Normalization Ferosh Jacob, Faizan Javed, Meng Zhao, and Matt McNair Classification R & D CareerBuilder

9

No. Top 10 frequent universities in UK dataset

Frequency

1 N/A 128976

2 City & Guilds 23992

3 Not Specified 18598

4 City and Guilds 17441

5 Open University 6886

6 MIDDLESEX UNIVERSITY 5490

7 University of East London 5266

8 University of Greenwich 5108

9 CITY UNIVERSITY 4863

10 Kingston University 4856

About sCooL:Academic entity normalization challenges

How can we remove K-12 schools and noise?

Institution type Distribution

College 23.32%

University 16.57%

K-12 school 34.22%

Not sure 25.89%

Page 10: SCooL : A System for Academic Institution Name Normalization Ferosh Jacob, Faizan Javed, Meng Zhao, and Matt McNair Classification R & D CareerBuilder

10

About sCooL:Challenges summary

How will you identify the most accurate normalization from a given surface form?

How will you distinguish spelling or typing errors from two different institution mapping scenario?

How will you create and maintain the surface form-entity mappings?

How can we remove K-12 schools and noise?

Page 11: SCooL : A System for Academic Institution Name Normalization Ferosh Jacob, Faizan Javed, Meng Zhao, and Matt McNair Classification R & D CareerBuilder

11

Raw input query (surface form)

Remove K-12 schools• Weka classifier

Search institutions using mappings DB• Lucene index

Refine results• String comparison algorithm

Normalized entity• Update mappings DB

Inside sCooL:A high-level overview of the system

Page 12: SCooL : A System for Academic Institution Name Normalization Ferosh Jacob, Faizan Javed, Meng Zhao, and Matt McNair Classification R & D CareerBuilder

Inside sCooL:Atlas- sCooL’s mapping manager

12

CB mappi

ngs

Wikimappings

MongoDB

Lucene

Atlas

sCooL

Page 13: SCooL : A System for Academic Institution Name Normalization Ferosh Jacob, Faizan Javed, Meng Zhao, and Matt McNair Classification R & D CareerBuilder

13

Inside sCooL:Refining Lucene results

𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦=𝑇𝑟𝑢𝑒𝑁𝑜𝑟𝑚𝑎𝑙𝑖𝑧𝑎𝑡𝑖𝑜𝑛

𝑁𝑜𝑛𝑁𝑢𝑙𝑙 𝑁𝑜𝑟𝑚𝑎𝑙𝑖𝑧𝑎𝑡𝑖𝑜𝑛𝑠(𝑇𝑟𝑢𝑒+𝐹𝑎𝑙𝑠𝑒)

𝐶𝑜𝑣𝑒𝑟𝑎𝑔𝑒=𝑇𝑟𝑢𝑒𝑁𝑜𝑟𝑚𝑎𝑙𝑖𝑧𝑎𝑡𝑖𝑜𝑛

𝐴𝑙𝑙 𝑁𝑜𝑟𝑚𝑎𝑙𝑖𝑧𝑎𝑡𝑖𝑜𝑛(𝑇𝑟𝑢𝑒+𝐹𝑎𝑙𝑠𝑒+𝑁𝑢𝑙𝑙)

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 20

0.1

0.2

0.3

0.4

0.5

0.6

0.82

0.84

0.86

0.88

0.9

0.92

0.94

0.96

0.98

1

Coverage

Accuracy

Threshold similarity

Page 14: SCooL : A System for Academic Institution Name Normalization Ferosh Jacob, Faizan Javed, Meng Zhao, and Matt McNair Classification R & D CareerBuilder

14

Evaluation:Comparing sCooL with existing implementation

Targeted metrics: Accuracy & Coverage

Precision is more important than Recall

Stratified Sampling in estimate of ratios

Favor high-frequency queries in sampling

Page 15: SCooL : A System for Academic Institution Name Normalization Ferosh Jacob, Faizan Javed, Meng Zhao, and Matt McNair Classification R & D CareerBuilder

15

Evaluation:Comparing sCooLwith existing implementation

𝐏𝐫 (|�̂�𝑖−𝑃 𝑖

𝑃 𝑖|<h 𝑖)=𝐶

{𝑛0=𝑍𝛼2 𝑃 𝑖(1−𝑃𝑖)

h 𝑖2

𝑛𝑖=𝑛0

1+(𝑛0−1)/𝑁 𝑖

�̂�=∑𝑖=1

3 𝑁 𝑖

∑𝑖

𝑁 𝑖

�̂�𝑖

91%

7%

2%

Sampling design

[1, 6]

[7, 39]

[40, max]

Page 16: SCooL : A System for Academic Institution Name Normalization Ferosh Jacob, Faizan Javed, Meng Zhao, and Matt McNair Classification R & D CareerBuilder

16

Evaluation:Comparing sCooL with existing implementation

Groups Group Size Sample SizeSampling Rate

sCool Accuracy

Existing System Accuracy

[1, 6] 145,126 780 1% 92% 75% [7, 39] 11,938 736 6% 96% 79% [40, max] 3,896 653 17% 95% 85% Total 160,960 2,169 1% 95% 80%

Dataset Coverage Weighted Coverage

UK CareerBuilder data sCool

Existing System

sCoolExisting System

40% 1% 73% 46%

Page 17: SCooL : A System for Academic Institution Name Normalization Ferosh Jacob, Faizan Javed, Meng Zhao, and Matt McNair Classification R & D CareerBuilder

17

Evaluation:Independent evaluation of sCooL

Test1-4ICU university list

The 4ICU [22] website145 popular universities and colleges in U.K.

Test2-Guardian university list:

The Guardian [23]a list of 135 universities in U.K.

DatasetAccuracy Coverage

sCool Existing System

sCoolExisting System

Test 1 (145) 93% 91% 95% 79%

Test 2 (135) 93% 90% 88% 72%

Page 18: SCooL : A System for Academic Institution Name Normalization Ferosh Jacob, Faizan Javed, Meng Zhao, and Matt McNair Classification R & D CareerBuilder

18

sCooL:Demo

Atlas http://ec2-54-193-1-73.us-west-1.compute.amazonaws.com/Atlas/

Page 19: SCooL : A System for Academic Institution Name Normalization Ferosh Jacob, Faizan Javed, Meng Zhao, and Matt McNair Classification R & D CareerBuilder

19

sCooL:Questions

Page 20: SCooL : A System for Academic Institution Name Normalization Ferosh Jacob, Faizan Javed, Meng Zhao, and Matt McNair Classification R & D CareerBuilder

20

sCooL: Appendix

Lucene search results for “University of Milan”

Rank Searchable field Display name

1 polytechnic university of milan Polytechnic University of Milan

2 university of milan University of Milan

3 catholic university of milan Universit`a Cattolica del Sacro Cuore

4 iulm university of milan IULM University of Milan

5 university of milan bicocca University of Milan Bicocca

6 milan university University of Milan

7 politecnico of milan Polytechnic University of Milan

8 milan polytechnic Polytechnic University of Milan

Page 21: SCooL : A System for Academic Institution Name Normalization Ferosh Jacob, Faizan Javed, Meng Zhao, and Matt McNair Classification R & D CareerBuilder

21

sCooL: AppendixString similarity algorithms

Rank String similarity algorithms

1 Levenshtein

2 Lucene Levenshtein

3 N-gram

4 Jaccard Similarity

5 Jaro Winkler

6 Hamming

7 Equals

8 Ignore case Equals

Page 22: SCooL : A System for Academic Institution Name Normalization Ferosh Jacob, Faizan Javed, Meng Zhao, and Matt McNair Classification R & D CareerBuilder

22

Evaluation:Comparing sCool with existing implementation

Balancing between Accuracy and Coverage

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 20

1000

2000

3000

4000

5000

6000

7000

CorrectWrongNull

Threshold similarity

Total input queries

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 20

0.1

0.2

0.3

0.4

0.5

0.6

0.82

0.84

0.86

0.88

0.9

0.92

0.94

0.96

0.98

1

Coverage

Accuracy

Threshold similarity

Page 23: SCooL : A System for Academic Institution Name Normalization Ferosh Jacob, Faizan Javed, Meng Zhao, and Matt McNair Classification R & D CareerBuilder

23

About sCooL:Related work

Cucerzan, S from Microsoft Research did great work on large-scale disambiguation by Wikipedia data in 2007

Jijkoun, V et. al. from Univ. of Amsterdam proposed NEN in user generated content in 2008

Liu, X et. al. from Microsoft Research, China conducted a joint inference on NER and NEN for tweets in 2012

Magdy, W et. al. from IBM, Egypt invented NEN for Arabic names in 2007

Jonnalagadda, S et. al. from Lnx Research, CA developed NEMO, a NER and NEN system for PubMed author affiliations 2011

Cohen, A from OHSU studied gene/protein NEN by automatically generated libraries in 2005