results on tracks 1 and 2 of kdd cup 2013
TRANSCRIPT
![Page 1: Results on Tracks 1 and 2 of KDD Cup 2013](https://reader033.vdocuments.us/reader033/viewer/2022051813/62828f41f8ee71478939390d/html5/thumbnails/1.jpg)
Results on Tracks 1 and 2 of KDD Cup2013
Chih-Jen Lin
Department of Computer ScienceNational Taiwan University
Joint work with members of the team “Algorithm” from National
Taiwan UniversityChih-Jen Lin (National Taiwan Univ.) Aug 11, 2013 1 / 51
![Page 2: Results on Tracks 1 and 2 of KDD Cup 2013](https://reader033.vdocuments.us/reader033/viewer/2022051813/62828f41f8ee71478939390d/html5/thumbnails/2.jpg)
Outline1 Introduction2 Track 1: paper-author identification
Feature generationClassificationEnsemble and post-processingResults
3 Track 2: author disambiguationStrategies and architectureImplementationEnsemble and typo handlingAnalysis
4 ConclusionsChih-Jen Lin (National Taiwan Univ.) Aug 11, 2013 2 / 51
![Page 3: Results on Tracks 1 and 2 of KDD Cup 2013](https://reader033.vdocuments.us/reader033/viewer/2022051813/62828f41f8ee71478939390d/html5/thumbnails/3.jpg)
Introduction
Outline1 Introduction2 Track 1: paper-author identification
Feature generationClassificationEnsemble and post-processingResults
3 Track 2: author disambiguationStrategies and architectureImplementationEnsemble and typo handlingAnalysis
4 ConclusionsChih-Jen Lin (National Taiwan Univ.) Aug 11, 2013 3 / 51
![Page 4: Results on Tracks 1 and 2 of KDD Cup 2013](https://reader033.vdocuments.us/reader033/viewer/2022051813/62828f41f8ee71478939390d/html5/thumbnails/4.jpg)
Introduction
Team Members
At National Taiwan University, we organized acourse for KDD Cup 2013
Three instructors, three TAs, and 18 students
18 students split to six sub-teams named byalgorithms
A*, Binary-Search, Dijkstra, K-means, Quick-Sort,Simplex
Submission quotas are equally divided to sixsub-teams
Chih-Jen Lin (National Taiwan Univ.) Aug 11, 2013 4 / 51
![Page 5: Results on Tracks 1 and 2 of KDD Cup 2013](https://reader033.vdocuments.us/reader033/viewer/2022051813/62828f41f8ee71478939390d/html5/thumbnails/5.jpg)
Track 1: paper-author identification
Outline1 Introduction2 Track 1: paper-author identification
Feature generationClassificationEnsemble and post-processingResults
3 Track 2: author disambiguationStrategies and architectureImplementationEnsemble and typo handlingAnalysis
4 ConclusionsChih-Jen Lin (National Taiwan Univ.) Aug 11, 2013 5 / 51
![Page 6: Results on Tracks 1 and 2 of KDD Cup 2013](https://reader033.vdocuments.us/reader033/viewer/2022051813/62828f41f8ee71478939390d/html5/thumbnails/6.jpg)
Track 1: paper-author identification
Paper-author Identification
Given an (author, paper) pair, did the author writethe paper?
What information do we have?
Author and paper profilesLabeled (author, paper) pairs
Confirmed: author wrote paperDeleted: author didn’t write paper
Under a given (author, paper), we use target authorand target paper to distinguish them from otherauthors/papers
Chih-Jen Lin (National Taiwan Univ.) Aug 11, 2013 6 / 51
![Page 7: Results on Tracks 1 and 2 of KDD Cup 2013](https://reader033.vdocuments.us/reader033/viewer/2022051813/62828f41f8ee71478939390d/html5/thumbnails/7.jpg)
Track 1: paper-author identification
Paper-author Identification (Cont’d)
Submission: ranking query papers for each queryauthor
Example: author 9417 has query papers 1, 2, 3, 6,and 9.
If 3, 6 are confirmed and 1, 2, 9 are deleted, weshould submit “9417, 3 6 1 2 9”
MAP (Mean Average Precision) is the evaluationmeasure
Chih-Jen Lin (National Taiwan Univ.) Aug 11, 2013 7 / 51
![Page 8: Results on Tracks 1 and 2 of KDD Cup 2013](https://reader033.vdocuments.us/reader033/viewer/2022051813/62828f41f8ee71478939390d/html5/thumbnails/8.jpg)
Track 1: paper-author identification
Internal Validation Set
We split Train.csv to internal training/validation setsdue to the limited number of submissions per day.
This also avoids overfitting the leader board
5 : 2 : 3
Train Valid Test
internaltrain
internalvalid
5 : 2
randomly shuffled
Chih-Jen Lin (National Taiwan Univ.) Aug 11, 2013 8 / 51
![Page 9: Results on Tracks 1 and 2 of KDD Cup 2013](https://reader033.vdocuments.us/reader033/viewer/2022051813/62828f41f8ee71478939390d/html5/thumbnails/9.jpg)
Track 1: paper-author identification
System Overview
List of 97 features can be seen in the paper
Chih-Jen Lin (National Taiwan Univ.) Aug 11, 2013 9 / 51
![Page 10: Results on Tracks 1 and 2 of KDD Cup 2013](https://reader033.vdocuments.us/reader033/viewer/2022051813/62828f41f8ee71478939390d/html5/thumbnails/10.jpg)
Track 1: paper-author identification Feature generation
Outline1 Introduction2 Track 1: paper-author identification
Feature generationClassificationEnsemble and post-processingResults
3 Track 2: author disambiguationStrategies and architectureImplementationEnsemble and typo handlingAnalysis
4 ConclusionsChih-Jen Lin (National Taiwan Univ.) Aug 11, 2013 10 / 51
![Page 11: Results on Tracks 1 and 2 of KDD Cup 2013](https://reader033.vdocuments.us/reader033/viewer/2022051813/62828f41f8ee71478939390d/html5/thumbnails/11.jpg)
Track 1: paper-author identification Feature generation
Features from Author Profiles
Given a query (author1360414, paper1841516) .What information do we have about the author?
Author.csv: 1360414,Chih-Jen Lin,NationalTaiwan University
PaperAuthor.csv: 1841516,1360414,Chih-JenLin,”National Taiwan University, Taipei”
Distance between target author’s names, affiliations,etc. in two csv files ⇒ features to indicateconsistency
Chih-Jen Lin (National Taiwan Univ.) Aug 11, 2013 11 / 51
![Page 12: Results on Tracks 1 and 2 of KDD Cup 2013](https://reader033.vdocuments.us/reader033/viewer/2022051813/62828f41f8ee71478939390d/html5/thumbnails/12.jpg)
Track 1: paper-author identification Feature generation
Features from Author Profiles (Cont’d)
We need to address two issues
Distance between full and abbreviated namesWestern and eastern order of namesExample: “Chih Jen Lin” and “Lin Chih Jen”
See paper for details
Chih-Jen Lin (National Taiwan Univ.) Aug 11, 2013 12 / 51
![Page 13: Results on Tracks 1 and 2 of KDD Cup 2013](https://reader033.vdocuments.us/reader033/viewer/2022051813/62828f41f8ee71478939390d/html5/thumbnails/13.jpg)
Track 1: paper-author identification Feature generation
Features from Author Profiles (Cont’d)
Chih-Jen Lin (National Taiwan Univ.) Aug 11, 2013 13 / 51
![Page 14: Results on Tracks 1 and 2 of KDD Cup 2013](https://reader033.vdocuments.us/reader033/viewer/2022051813/62828f41f8ee71478939390d/html5/thumbnails/14.jpg)
Track 1: paper-author identification Feature generation
Features from Coauthors Names
Example: deleted paper 5633 of Li Zhang has twoauthors with the same name
Relation between target author and authors oftarget paper can be features
Examples
1. Minimum name distance between the targetauthor and authors of the target paper
2. Same as 1, but check abbreviated names
Chih-Jen Lin (National Taiwan Univ.) Aug 11, 2013 14 / 51
![Page 15: Results on Tracks 1 and 2 of KDD Cup 2013](https://reader033.vdocuments.us/reader033/viewer/2022051813/62828f41f8ee71478939390d/html5/thumbnails/15.jpg)
Track 1: paper-author identification Feature generation
Features for Author/Paper Consistency
Information should be consistent across papers andauthors
Examples
1. Maximum distance between target author’saffiliation and affiliations of co-authors in targetpaper
2. Maximum distance between target paper’s titleand target author’s papers
Chih-Jen Lin (National Taiwan Univ.) Aug 11, 2013 15 / 51
![Page 16: Results on Tracks 1 and 2 of KDD Cup 2013](https://reader033.vdocuments.us/reader033/viewer/2022051813/62828f41f8ee71478939390d/html5/thumbnails/16.jpg)
Track 1: paper-author identification Feature generation
Missing Value Handling
Two empty strings have zero distance
d(’Chih J Lin’, ’C Jen Lin’) ≥ d(”, ”)
Replace distance between empty strings withnon-zero valueDistance valueJaro 0.5Jacard 0.5Levenshtein average length of all entries
Missing value indicators. Example: number ofcoauthors without affiliation information
Chih-Jen Lin (National Taiwan Univ.) Aug 11, 2013 16 / 51
![Page 17: Results on Tracks 1 and 2 of KDD Cup 2013](https://reader033.vdocuments.us/reader033/viewer/2022051813/62828f41f8ee71478939390d/html5/thumbnails/17.jpg)
Track 1: paper-author identification Feature generation
Features Using Publication Time
Examples
1. Earliest/latest publication year of target author2. Publication year of target paper
Data cleaning:
Years outside [1800, 2013] are removedThen we must handle missing values
Chih-Jen Lin (National Taiwan Univ.) Aug 11, 2013 17 / 51
![Page 18: Results on Tracks 1 and 2 of KDD Cup 2013](https://reader033.vdocuments.us/reader033/viewer/2022051813/62828f41f8ee71478939390d/html5/thumbnails/18.jpg)
Track 1: paper-author identification Feature generation
Features Based on A Network
We construct a network of authors, papers, journals,and conferences
Chih-Jen Lin (National Taiwan Univ.) Aug 11, 2013 18 / 51
![Page 19: Results on Tracks 1 and 2 of KDD Cup 2013](https://reader033.vdocuments.us/reader033/viewer/2022051813/62828f41f8ee71478939390d/html5/thumbnails/19.jpg)
Track 1: paper-author identification Feature generation
Features Based on A Network (Cont’d)
From the network we can extract features todescribe node relationships
Examples
1. # of publications of the author2. # of coauthored papers of the target author
with all the coauthors of the target paper
Chih-Jen Lin (National Taiwan Univ.) Aug 11, 2013 19 / 51
![Page 20: Results on Tracks 1 and 2 of KDD Cup 2013](https://reader033.vdocuments.us/reader033/viewer/2022051813/62828f41f8ee71478939390d/html5/thumbnails/20.jpg)
Track 1: paper-author identification Classification
Outline1 Introduction2 Track 1: paper-author identification
Feature generationClassificationEnsemble and post-processingResults
3 Track 2: author disambiguationStrategies and architectureImplementationEnsemble and typo handlingAnalysis
4 ConclusionsChih-Jen Lin (National Taiwan Univ.) Aug 11, 2013 20 / 51
![Page 21: Results on Tracks 1 and 2 of KDD Cup 2013](https://reader033.vdocuments.us/reader033/viewer/2022051813/62828f41f8ee71478939390d/html5/thumbnails/21.jpg)
Track 1: paper-author identification Classification
Classification
Tree-based classifiersRandom forests (RF)Gradient boosting decision tree (GBDT)LambdaMART (LM)
classifiertree
# of trees parallelMAP on
ensemble Valid.csv
RF bagging 12,000 yes 0.983340GBDT boosting 300 no 0.983046
LM boosting 300 no 0.983047
RF is sensitive to the initial random seed. Using12,000 trees stabilizes the results
Chih-Jen Lin (National Taiwan Univ.) Aug 11, 2013 21 / 51
![Page 22: Results on Tracks 1 and 2 of KDD Cup 2013](https://reader033.vdocuments.us/reader033/viewer/2022051813/62828f41f8ee71478939390d/html5/thumbnails/22.jpg)
Track 1: paper-author identification Ensemble and post-processing
Outline1 Introduction2 Track 1: paper-author identification
Feature generationClassificationEnsemble and post-processingResults
3 Track 2: author disambiguationStrategies and architectureImplementationEnsemble and typo handlingAnalysis
4 ConclusionsChih-Jen Lin (National Taiwan Univ.) Aug 11, 2013 22 / 51
![Page 23: Results on Tracks 1 and 2 of KDD Cup 2013](https://reader033.vdocuments.us/reader033/viewer/2022051813/62828f41f8ee71478939390d/html5/thumbnails/23.jpg)
Track 1: paper-author identification Ensemble and post-processing
Ensemble
Weighted average over RF, GBDT, andLambdaMart
Didn’t use more complicated settings like regressionbecause we have only three models
A simple grid search on weights
Final weights
RF: 5, GBDT: 1, and LambdaMart: 1.
Chih-Jen Lin (National Taiwan Univ.) Aug 11, 2013 23 / 51
![Page 24: Results on Tracks 1 and 2 of KDD Cup 2013](https://reader033.vdocuments.us/reader033/viewer/2022051813/62828f41f8ee71478939390d/html5/thumbnails/24.jpg)
Track 1: paper-author identification Ensemble and post-processing
Post-Processing
Our post-processing procedure is simple, but onething to note is duplicated paper IDs.
If an author has confirmed papers 1,2,2,4 anddeleted paper 3
The evaluation code seems to consider the 2nd “2”as a deleted paper
Thus, MAP of 1,2,4,3,2 > MAP of 1,2,2,4,3
We move duplicated ones to the end
Chih-Jen Lin (National Taiwan Univ.) Aug 11, 2013 24 / 51
![Page 25: Results on Tracks 1 and 2 of KDD Cup 2013](https://reader033.vdocuments.us/reader033/viewer/2022051813/62828f41f8ee71478939390d/html5/thumbnails/25.jpg)
Track 1: paper-author identification Results
Outline1 Introduction2 Track 1: paper-author identification
Feature generationClassificationEnsemble and post-processingResults
3 Track 2: author disambiguationStrategies and architectureImplementationEnsemble and typo handlingAnalysis
4 ConclusionsChih-Jen Lin (National Taiwan Univ.) Aug 11, 2013 25 / 51
![Page 26: Results on Tracks 1 and 2 of KDD Cup 2013](https://reader033.vdocuments.us/reader033/viewer/2022051813/62828f41f8ee71478939390d/html5/thumbnails/26.jpg)
Track 1: paper-author identification Results
ResultsPublic Private
1st of public 0.98554 0.9810012th of public (ours) 0.98235 0.98259 (1st)
Possible reasons of the best result in the end
Improvements after Valid.csv is released
1. Data cleaning: unicode → ASCII2. Missing-value handling (0.98334)3. Ensemble (0.98390)
We didn’t give up even though we were the12th!We effectively use the internal validation set toavoid overfitting
Chih-Jen Lin (National Taiwan Univ.) Aug 11, 2013 26 / 51
![Page 27: Results on Tracks 1 and 2 of KDD Cup 2013](https://reader033.vdocuments.us/reader033/viewer/2022051813/62828f41f8ee71478939390d/html5/thumbnails/27.jpg)
Track 2: author disambiguation
Outline1 Introduction2 Track 1: paper-author identification
Feature generationClassificationEnsemble and post-processingResults
3 Track 2: author disambiguationStrategies and architectureImplementationEnsemble and typo handlingAnalysis
4 ConclusionsChih-Jen Lin (National Taiwan Univ.) Aug 11, 2013 27 / 51
![Page 28: Results on Tracks 1 and 2 of KDD Cup 2013](https://reader033.vdocuments.us/reader033/viewer/2022051813/62828f41f8ee71478939390d/html5/thumbnails/28.jpg)
Track 2: author disambiguation
Author disambiguation
C. J. Smile Lin C. J. Cry LinNational Taiwan Univ. Univ. of Michigan
LIBSVM Guide LIBLINEAR Guide
Are they duplicates?
Chih-Jen Lin (National Taiwan Univ.) Aug 11, 2013 28 / 51
![Page 29: Results on Tracks 1 and 2 of KDD Cup 2013](https://reader033.vdocuments.us/reader033/viewer/2022051813/62828f41f8ee71478939390d/html5/thumbnails/29.jpg)
Track 2: author disambiguation Strategies and architecture
Outline1 Introduction2 Track 1: paper-author identification
Feature generationClassificationEnsemble and post-processingResults
3 Track 2: author disambiguationStrategies and architectureImplementationEnsemble and typo handlingAnalysis
4 ConclusionsChih-Jen Lin (National Taiwan Univ.) Aug 11, 2013 29 / 51
![Page 30: Results on Tracks 1 and 2 of KDD Cup 2013](https://reader033.vdocuments.us/reader033/viewer/2022051813/62828f41f8ee71478939390d/html5/thumbnails/30.jpg)
Track 2: author disambiguation Strategies and architecture
Main Strategies
Using string matching rather than other learningtechniques
An author without any papers is treated as a singlegroup without duplicates
Recognizing if an author is Chinese or not
Chih-Jen Lin (National Taiwan Univ.) Aug 11, 2013 30 / 51
![Page 31: Results on Tracks 1 and 2 of KDD Cup 2013](https://reader033.vdocuments.us/reader033/viewer/2022051813/62828f41f8ee71478939390d/html5/thumbnails/31.jpg)
Track 2: author disambiguation Strategies and architecture
Architecture
Implementation 1
Implementation 2
Ensemble Final
Chih-Jen Lin (National Taiwan Univ.) Aug 11, 2013 31 / 51
![Page 32: Results on Tracks 1 and 2 of KDD Cup 2013](https://reader033.vdocuments.us/reader033/viewer/2022051813/62828f41f8ee71478939390d/html5/thumbnails/32.jpg)
Track 2: author disambiguation Strategies and architecture
Results
Method Public PrivateImplementation 1 0.99186 0.99198Implementation 2 0.99071 0.99083Final 0.99195 0.99202
Chih-Jen Lin (National Taiwan Univ.) Aug 11, 2013 32 / 51
![Page 33: Results on Tracks 1 and 2 of KDD Cup 2013](https://reader033.vdocuments.us/reader033/viewer/2022051813/62828f41f8ee71478939390d/html5/thumbnails/33.jpg)
Track 2: author disambiguation Strategies and architecture
Framework of the Two Implementations
1. Cleaning: remove redundant information
2. Chinese-or-not: classify each author as Chinese ornon-Chinese
3. Selection: select a set of candidates of possibleduplicates for each author
4. Identification: identify duplicates from the set ofcandidates for each author
5. Splitting: split incorrect cases (not discussed here)
Chih-Jen Lin (National Taiwan Univ.) Aug 11, 2013 33 / 51
![Page 34: Results on Tracks 1 and 2 of KDD Cup 2013](https://reader033.vdocuments.us/reader033/viewer/2022051813/62828f41f8ee71478939390d/html5/thumbnails/34.jpg)
Track 2: author disambiguation Strategies and architecture
Differences between Two Implementations
The basic elements are different
Implementation 1: author identifiers
Implementation 2: author names
Author identifier 1001Name in Author.csv Chih Jen Lin
C. J. LinNames in PaperAuthor.csv Chih Jen Peter Lin
C. J. P. Lin
More (complicated) rules in Implementation 1
Focus on Implementation 1 because of timelimitation
Chih-Jen Lin (National Taiwan Univ.) Aug 11, 2013 34 / 51
![Page 35: Results on Tracks 1 and 2 of KDD Cup 2013](https://reader033.vdocuments.us/reader033/viewer/2022051813/62828f41f8ee71478939390d/html5/thumbnails/35.jpg)
Track 2: author disambiguation Implementation
Outline1 Introduction2 Track 1: paper-author identification
Feature generationClassificationEnsemble and post-processingResults
3 Track 2: author disambiguationStrategies and architectureImplementationEnsemble and typo handlingAnalysis
4 ConclusionsChih-Jen Lin (National Taiwan Univ.) Aug 11, 2013 35 / 51
![Page 36: Results on Tracks 1 and 2 of KDD Cup 2013](https://reader033.vdocuments.us/reader033/viewer/2022051813/62828f41f8ee71478939390d/html5/thumbnails/36.jpg)
Track 2: author disambiguation Implementation
Cleaning
Clean redundant information.
Cleaning
Examples:CHih JEN LIn → chih jen lin
Mr. Chih Jen Lin → chih jen linChih Jen Lin → chih jen lin
Chih Jen Bill Lin → chih jen william lin
Chih-Jen Lin (National Taiwan Univ.) Aug 11, 2013 36 / 51
![Page 37: Results on Tracks 1 and 2 of KDD Cup 2013](https://reader033.vdocuments.us/reader033/viewer/2022051813/62828f41f8ee71478939390d/html5/thumbnails/37.jpg)
Track 2: author disambiguation Implementation
Chinese or not
Chinese and non-Chinese names are very different
No middle name in Chinese. “Chih Lin” and “ChihJ. Lin” are likely different
Some Chinese last names like “Wang” are toocommon. Also, “林” and “藺” are romanized to“Lin”
Chih-Jen Lin (National Taiwan Univ.) Aug 11, 2013 37 / 51
![Page 38: Results on Tracks 1 and 2 of KDD Cup 2013](https://reader033.vdocuments.us/reader033/viewer/2022051813/62828f41f8ee71478939390d/html5/thumbnails/38.jpg)
Track 2: author disambiguation Implementation
Chinese or not (Cont’d)
wu linjuan
zhuang
chintung
C. J. Chris Peter Lin
C J L
X C. J. Lin
Using common Chinese last names and words as adictionary
Chih-Jen Lin (National Taiwan Univ.) Aug 11, 2013 38 / 51
![Page 39: Results on Tracks 1 and 2 of KDD Cup 2013](https://reader033.vdocuments.us/reader033/viewer/2022051813/62828f41f8ee71478939390d/html5/thumbnails/39.jpg)
Track 2: author disambiguation Implementation
Chinese or not (Cont’d)
Check if the name contains words in our dictionary
Examples:
Without full word → Non-Chinese; full word: aword without “.” and longer than 1e.g., C J LOnly one full word and it is in Chinesedictionary → Chinesee.g., C. J. LinMore than one full word not in Chinesedictionary → Non-Chinesee.g., C. J. Chris Peter Lin
Chih-Jen Lin (National Taiwan Univ.) Aug 11, 2013 39 / 51
![Page 40: Results on Tracks 1 and 2 of KDD Cup 2013](https://reader033.vdocuments.us/reader033/viewer/2022051813/62828f41f8ee71478939390d/html5/thumbnails/40.jpg)
Track 2: author disambiguation Implementation
Selection
Find candidates of duplicates to reduce squarecomplexity to linear in future comparison
Links indicate candidates
Each author generates several keys. “Chih Jen Lin”has:
“Chih” “Jen” “Lin” “Chih Jen”“Jen Lin” “Chih Lin” “Chih Jen Lin”
Chih-Jen Lin (National Taiwan Univ.) Aug 11, 2013 40 / 51
![Page 41: Results on Tracks 1 and 2 of KDD Cup 2013](https://reader033.vdocuments.us/reader033/viewer/2022051813/62828f41f8ee71478939390d/html5/thumbnails/41.jpg)
Track 2: author disambiguation Implementation
Selection − Chih Jen Lin’s candidates
One is a candidate of another if two share the samekey. Ignore common keys.
Chih Jen Lin
Shou De Lin
Hsuan Tien Lin
Chien Chih Wang
Peng Jen Chen
Yong Zhaung
Felix Wu
Hsiao Yu TungLin Chih Jen
C. J. Lin
Yu Chin Juan
Wei Sheng Chin
Chih-Jen Lin (National Taiwan Univ.) Aug 11, 2013 41 / 51
![Page 42: Results on Tracks 1 and 2 of KDD Cup 2013](https://reader033.vdocuments.us/reader033/viewer/2022051813/62828f41f8ee71478939390d/html5/thumbnails/42.jpg)
Track 2: author disambiguation Implementation
Selection − Chih Jen Lin’s candidates
Chih Jen Lin
Shou De Lin
Hsuan Tien Lin
Chien Chih Wang
Peng Jen Chen
Yong Zhaung
Felix Wu
Hsiao Yu TungLin Chih Jen
C. J. Lin
Yu Chin Juan
Wei Sheng Chin
Chih-Jen Lin (National Taiwan Univ.) Aug 11, 2013 42 / 51
![Page 43: Results on Tracks 1 and 2 of KDD Cup 2013](https://reader033.vdocuments.us/reader033/viewer/2022051813/62828f41f8ee71478939390d/html5/thumbnails/43.jpg)
Track 2: author disambiguation Implementation
Identification
Find duplicates from candidates
?
?
?
Matchingfunctions
Dry-run
Chih-Jen Lin (National Taiwan Univ.) Aug 11, 2013 43 / 51
![Page 44: Results on Tracks 1 and 2 of KDD Cup 2013](https://reader033.vdocuments.us/reader033/viewer/2022051813/62828f41f8ee71478939390d/html5/thumbnails/44.jpg)
Track 2: author disambiguation Implementation
Identification − Matching Functions
?
?
?
Matchingfunctions
13 matching functions
1. Two authors have the same words2. (Non-Chinese only) Only one author
has middle name and their last namesdiffer in the last two characters
3. . . .
Examples:
Two namesChih Jen Lin, Lin Chih Jen Fun. 1
Michael I. Jordan, Michael Jordan Fun. 2
Chih-Jen Lin (National Taiwan Univ.) Aug 11, 2013 44 / 51
![Page 45: Results on Tracks 1 and 2 of KDD Cup 2013](https://reader033.vdocuments.us/reader033/viewer/2022051813/62828f41f8ee71478939390d/html5/thumbnails/45.jpg)
Track 2: author disambiguation Implementation
Identification − Dry-run
?
?
?
Dry-run
Making corrections as matchingfunctions may wrongly identify duplicates
Check if two names are “looselyidentical”
Examples:
Potential duplicates PassC. J. Lin, Chih Jen Lin,
#Chih Lin, Chen Ju Lin
C. J. Lin, Chih Jen Lin, Chih J. Lin !
Chih-Jen Lin (National Taiwan Univ.) Aug 11, 2013 45 / 51
![Page 46: Results on Tracks 1 and 2 of KDD Cup 2013](https://reader033.vdocuments.us/reader033/viewer/2022051813/62828f41f8ee71478939390d/html5/thumbnails/46.jpg)
Track 2: author disambiguation Ensemble and typo handling
Outline1 Introduction2 Track 1: paper-author identification
Feature generationClassificationEnsemble and post-processingResults
3 Track 2: author disambiguationStrategies and architectureImplementationEnsemble and typo handlingAnalysis
4 ConclusionsChih-Jen Lin (National Taiwan Univ.) Aug 11, 2013 46 / 51
![Page 47: Results on Tracks 1 and 2 of KDD Cup 2013](https://reader033.vdocuments.us/reader033/viewer/2022051813/62828f41f8ee71478939390d/html5/thumbnails/47.jpg)
Track 2: author disambiguation Ensemble and typo handling
Ensemble
Method Author identifier DuplicatesImplementation 1 10 10,11Implementation 2 10 10,11,12,13,14Ensembled 10 10,11,12,14
Implementation 1 considered as major predictions
{12, 13, 14} become additional duplicates
Check if each of (10, 12), (10, 13), (10, 14) hassimilar affiliations or fields
Chih-Jen Lin (National Taiwan Univ.) Aug 11, 2013 47 / 51
![Page 48: Results on Tracks 1 and 2 of KDD Cup 2013](https://reader033.vdocuments.us/reader033/viewer/2022051813/62828f41f8ee71478939390d/html5/thumbnails/48.jpg)
Track 2: author disambiguation Analysis
Outline1 Introduction2 Track 1: paper-author identification
Feature generationClassificationEnsemble and post-processingResults
3 Track 2: author disambiguationStrategies and architectureImplementationEnsemble and typo handlingAnalysis
4 ConclusionsChih-Jen Lin (National Taiwan Univ.) Aug 11, 2013 48 / 51
![Page 49: Results on Tracks 1 and 2 of KDD Cup 2013](https://reader033.vdocuments.us/reader033/viewer/2022051813/62828f41f8ee71478939390d/html5/thumbnails/49.jpg)
Track 2: author disambiguation Analysis
Analysis
We conduct some analyses after the competition.Thank Kaggle for re-opening the submission site
Method Public PrivateFinal 0.99195 0.99202Implementation 1 0.99186 0.99198Without Chinese-or-not 0.99109 0.99125Without dry-run 0.99097 0.99112Without both 0.98891 0.98934
Splitting Chinese/non-Chinese and the dry-runfunction in the identification stage are useful
Chih-Jen Lin (National Taiwan Univ.) Aug 11, 2013 49 / 51
![Page 50: Results on Tracks 1 and 2 of KDD Cup 2013](https://reader033.vdocuments.us/reader033/viewer/2022051813/62828f41f8ee71478939390d/html5/thumbnails/50.jpg)
Conclusions
Outline1 Introduction2 Track 1: paper-author identification
Feature generationClassificationEnsemble and post-processingResults
3 Track 2: author disambiguationStrategies and architectureImplementationEnsemble and typo handlingAnalysis
4 ConclusionsChih-Jen Lin (National Taiwan Univ.) Aug 11, 2013 50 / 51
![Page 51: Results on Tracks 1 and 2 of KDD Cup 2013](https://reader033.vdocuments.us/reader033/viewer/2022051813/62828f41f8ee71478939390d/html5/thumbnails/51.jpg)
Conclusions
Conclusions
Our code is available at
github.com/kdd-cup-2013-ntu/track1
github.com/kdd-cup-2013-ntu/track2
Papers are at www.csie.ntu.edu.tw/~cjlin/papers/kddcup2013/kddcup2013track1.pdf andkddcup2013track2.pdf
We thank the organizers and the support fromNational Taiwan University
Chih-Jen Lin (National Taiwan Univ.) Aug 11, 2013 51 / 51