name conflict resolution
TRANSCRIPT
•System to automate company registration process
•Compares the company names using string matching algorithms
•Names are ranked according to their similarity percentage
•A name is rejected if the similarity score is 100%
Introduction
Introduction
8/30/2013
•To develop a system to resolve naming conflict.
•To find names similar to the name proposed by user.
•To provide the ranks of matched proposed name with other existing names.
Objectives
Objectives
8/30/2013
Building Base Dictionary
Keyword Generation
Finding Possible Matches
Finding Duplicates
Finding Ranks
Methodology
Methodology
8/30/2013
Downcasting“Centre Nepal Metals Industries”
Downcasting
“centre nepal metals industries”
• act of casting input from uppercase letters to lowercases
8/30/2013
Transformation
Transformation
“centre nepal metals industries”
• conversion of British English words to American English words
center
8/30/2013
Stopword Removal
Stopword Removal
“center nepal metals industries”
• process of removing predefined stopwordsfrom the string literal
“center nepal metals ”industries“center nepal metals”
8/30/2013
Tokenization
Tokenization
• process of reducing a large string to a set of tokens
“center nepal metals”center nepal metals
8/30/2013
Stemming
Stemming
• process of reducing a word to a root, or simpler form
metal
center
nepal
metals
center
nepal
Tokens Stemmed Tokens
8/30/2013
Translation
Translation• conversion of the meaning of a source-language
text by means of an equivalent target-language text
metal
center
nepal
Stemmed Tokens Translated Tokens
8/30/2013
Transliteration
Transliteration
• conversion of a text from one script to another
dhatu
kendra
nepal
Translated Tokens Transliterated Tokens
8/30/2013
Database Query using Final Token List
•nepal medical centre pvt. ltd.
•nepal dhatu company
•metal nepal pvt. ltd.
•enter nepal
•nepal metal industries
•dhatu sankalan kendra
8/30/2013
Permutation
•kendra nepal dhatu
•kendra nepal metal
•center nepal dhatu
•center nepal metal
8/30/2013
Levenshtein Distance Calculation
Optimized Maximal Similarity using Hungarian Algorithm
Sorenson Index to Calculate Similarity %
3 steps
1
2
3
Comparison
Comparison
8/30/2013
111160 Registered Company Names
106299 Unique Reg. ID / Company Names
16326 Words in English- Nepali Dictionary
144 British-American Words for Transformation
Dataset
Dataset
8/30/2013
1.6642.204
11.952
37.743
8.959 13.315
39.994
107.498
0
20
40
60
80
100
120
1 Token 2 Tokens 3 Tokens 4 Tokens
Tim
e t
o C
om
pu
te (
sec)
Number of Tokens
Number of Tokens VS Computation Time
Time to compute (sec) in I5 CPU
Time to compute (sec) in Dual Core CPU
Result Analysis
8/30/2013
• Stemming sometimes produces incorrect results if input contains a Nepali word
• Dictionary (English-Nepali) does not contain enough words
• Tokenization is based on whitespace and hyphen only
• Comparison is not phonetic based
Limitations
Limitations
8/30/2013
• Use of Taxonomy for classifying the tokens
• Using some weighing measures to assign weights to tokens
• Implementation of faster searching methods
• Integration of phonetic based similarity measures
Future Enhancements
Future Enhancements
8/30/2013