a multistage hierarchical methodfor authorname disambiguation3)/p8.pdf · tion; split citation...

4
A Multistage Hierarchical Method for Author Name Disambiguation Tasleem Arif a , Rashid Ali b , M Asger c a Department of Information Technology, BGSB University Rajouri (Jammu and Kashmir)-185234 India, Contact: [email protected] b Department of Computer Engineering, AMU Aligarh, India. c School of Mathematical Sciences and Engineering, BGSB University Rajouri (Jammu and Kashmir), India. Author name ambiguity has long been studied as a problem which affects literature management and leads to incorrect attribution of publications and credit to authors. Majority of the solutions provided either suffer from split citation problem or mixed citation problem. In recent years, there seems to be a tendency to use and store additional attributes of a publication to enrich its metadata. Use of e-mail ID of corresponding authors is prevalent in almost all publishing houses. In addition to the traditional metadata like author(s), title, venue and year, other attributes like e-mail ID and affiliation of author(s) are available in publication headers and in some cases in metadata also. For example, ACM Digital Library stores affiliations of authors as part of their metadata. In this paper, we propose a method that creates clusters in stages with each stage using a different attribute on the clusters created in the previous step. The purpose is to explore the effect of these additional attributes in resolving the author name ambiguity problem. Experimental results on publications obtained from DBLP show that our method obtained significant improvements over the existing state-of-the-art methods with average precision, recall and F-score of 93.93, 91.57 and 92.33 percent respectively and average execution time of 0.07 seconds per publication. Keywords : Hierarchical Clustering, Metadata, Name Disambiguation. 1. INTRODUCTION The advent of Internet has changed the way we live, communicate and maintain relationships. Research collaborations are not an exception and have benefited a lot from the advances in Information and Communication Technol- ogy (ICT) [1] and web technologies have been deriving developments at the research front [2]. The rise in number of publications produced with each passing year has made it quite dif- ficult for literature management services to properly index the publications of conflicting authors. This can be attributed to the fact that our parents had a limited number of options to choose our names from. In America alone 114 million males share 300 common names [3] whereas in China the problem is more severe as 1.1 billion of its population shares just 129 surnames [4]. If we look at the name ambigu- ity from the types of subproblem being faced in indexing i.e., split citation and mixed cita- tion; split citation means that we have to man- age publications of non-existent authors [3] and mixed citations means that we are mixing two different real authors into one author. However the increase in percentage of joint publications can be treated as a relief for solutions to this problem as information about more authors is present in a publication metadata which leads to enrichment of publication metadata. As pointed out by Liu [5], low availability of some key fields comprising of the metadata of a citation-record have resulted in poor perfor- mance of author disambiguation mechanisms. As such the need for extra information is being 92 International Journal of Information Processing, 9(3), 92-105, 2015 ISSN : 0973-8215 IK International Publishing House Pvt. Ltd., New Delhi, India

Upload: others

Post on 31-Oct-2019

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Multistage Hierarchical Methodfor AuthorName Disambiguation3)/p8.pdf · tion; split citation means that we have to man-agepublications of non-existent authors[3] and mixed citations

A Multistage Hierarchical Method for Author Name

Disambiguation

Tasleem Arifa, Rashid Alib, M Asgerc

aDepartment of Information Technology, BGSB University Rajouri(Jammu and Kashmir)-185234 India, Contact: [email protected]

bDepartment of Computer Engineering, AMU Aligarh, India.

cSchool of Mathematical Sciences and Engineering, BGSB University Rajouri(Jammu and Kashmir), India.

Author name ambiguity has long been studied as a problem which affects literature management and leadsto incorrect attribution of publications and credit to authors. Majority of the solutions provided either sufferfrom split citation problem or mixed citation problem. In recent years, there seems to be a tendency to useand store additional attributes of a publication to enrich its metadata. Use of e-mail ID of correspondingauthors is prevalent in almost all publishing houses. In addition to the traditional metadata like author(s),title, venue and year, other attributes like e-mail ID and affiliation of author(s) are available in publicationheaders and in some cases in metadata also. For example, ACM Digital Library stores affiliations of authorsas part of their metadata. In this paper, we propose a method that creates clusters in stages with eachstage using a different attribute on the clusters created in the previous step. The purpose is to explore theeffect of these additional attributes in resolving the author name ambiguity problem. Experimental resultson publications obtained from DBLP show that our method obtained significant improvements over theexisting state-of-the-art methods with average precision, recall and F-score of 93.93, 91.57 and 92.33 percentrespectively and average execution time of 0.07 seconds per publication.

Keywords : Hierarchical Clustering, Metadata, Name Disambiguation.

1. INTRODUCTION

The advent of Internet has changed the way welive, communicate and maintain relationships.Research collaborations are not an exceptionand have benefited a lot from the advancesin Information and Communication Technol-ogy (ICT) [1] and web technologies have beenderiving developments at the research front [2].

The rise in number of publications producedwith each passing year has made it quite dif-ficult for literature management services toproperly index the publications of conflictingauthors. This can be attributed to the fact thatour parents had a limited number of optionsto choose our names from. In America alone114 million males share 300 common names [3]whereas in China the problem is more severe

as 1.1 billion of its population shares just 129surnames [4]. If we look at the name ambigu-ity from the types of subproblem being facedin indexing i.e., split citation and mixed cita-tion; split citation means that we have to man-age publications of non-existent authors [3] andmixed citations means that we are mixing twodifferent real authors into one author. Howeverthe increase in percentage of joint publicationscan be treated as a relief for solutions to thisproblem as information about more authors ispresent in a publication metadata which leadsto enrichment of publication metadata.

As pointed out by Liu [5], low availability ofsome key fields comprising of the metadata ofa citation-record have resulted in poor perfor-mance of author disambiguation mechanisms.As such the need for extra information is being

92

International Journal of Information Processing, 9(3), 92-105, 2015ISSN : 0973-8215IK International Publishing House Pvt. Ltd., New Delhi, India

Page 2: A Multistage Hierarchical Methodfor AuthorName Disambiguation3)/p8.pdf · tion; split citation means that we have to man-agepublications of non-existent authors[3] and mixed citations

A Multistage Hierarchical Method for Author Name Disambiguation 103

very well. Experiments conducted and compar-isons with other name disambiguation methodsindicated substantial improvement over HACand CONSTRAINT and considerable improve-ment over Fixed-K. Although the performanceimprovement over Fixed-K is only 1.35 percent,it is significant in the sense that it may be-come much more difficult to increase the per-formance more in the upper bracket of morethan 90 percent prediction performance.

Publication venue title has been used forauthor name disambiguation purposes in pre-vious studies as well as in this study but weare of the view that with the ever increasingpublications it may not serve as a good featurefor name disambiguation purposes as more andmore similar authors may publish in same orrelatively similar venues. The low recall andF1 score in case of Gang Wu can be primar-ily attributed with more than one Gang Wupublishing in similar publication venues.

An important observation that we made dur-ing this study was that in majority of the casesone author in a group of similar authors havealmost more than forty percent of the publi-cations. Another observation that we madeduring this study was that more the numberof publications, the better the disambiguationresults.

As a part of future work we intend to excludethe venue information and include title infor-mation and examine the effects of the intendedchange. In addition to that we also intend to

make use of Soft-Computing techniques to dealwith split-citation problem so that the preci-sion may be increased further.

REFERENCES

1. Chang H W and Huang M H. Cohesive Sub-groups in the International Collaboration Net-work in Astronomy and Astrophysics, In Sci-entometrics, 101:1587–1607, 2014.

2. Zhao D and Strotmann A. The KnowledgeBase and Research Front of Information Sci-ence 20062010: An Author Cocitation and Bib-liographic Coupling Analysis, Journal of theAssociation for Information Science and Tech-nology, 65(5):995–1006, 2014.

3. Liu Y, Li W, Huang Z and Fang Q. AFast Method Based on Multiple Clustering forName Disambiguation in Bibliographic Cita-tions, Journal of the Association for Informa-tion Science and Technology, 66(3):634–644,2014.

4. Qiu J. Scientific Publishing: Identity Crisis,Nature, 451:766–767, 2008.

5. Liu W, Dogan R I, Kim S, Comeau D C, KimW, Yeganova L, Lu Z and Wilbur W J. AuthorName Disambiguation for PubMed, Journal ofthe Association for Information Science andTechnology, 65:765–781, 2014.

6. Arif T, Ali R and Asger M. Author NameDisambiguation using Vector Space Model andHybrid Similarity Measures, In Proceedings of7th International Conference on ContemporaryComputing Noida, India: IEEE, pages 135–140, 2014.

7. Ferreira A A, Gonalves G A and Laender HF A. A Brief Survey of Automatic Methods

Page 3: A Multistage Hierarchical Methodfor AuthorName Disambiguation3)/p8.pdf · tion; split citation means that we have to man-agepublications of non-existent authors[3] and mixed citations

104 Tasleem Arif, et al.,

for Author Name Disambiguation, ACM SIG-MOD, pages 15–26, 2012.

8. Torvik V I, Weeber M, Swanson D R andSmalheiser N R. A Probabilistic SimilarityMetric for Medline Records: A Model forAuthor Name Disambiguation: Research Ar-ticles, Journal of the American Society for In-formation Science and Technology, 56(2):140–158, 2005.

9. Smalheiser N R and Torvik V I. Author NameDisambiguation, Annual Review of Informa-tion Science and Technology, 43(1):1–43, 2009.

10. Arif T, Asger M and Ali R. Author NameDisambiguation using Two Stage Clustering,INROADS (Special Issue), ISSN: 2277-4904,3:340–345, 2014.

11. Tang J, Fong A C M, Wang B and Zhang J. AUnified Probabilistic Framework for Name Dis-ambiguation in Digital Library, IEEE Trans-actions on Knowledge and Data Engineering,24:975–987, 2012.

12. D Angelo C A, Giuffrida C and Abramo G.A Heuristic Approach to Author Name Dis-ambiguation in Bibliometrics Databases forLarge-scale Research Assessments, Journal ofthe American Society for Information Scienceand Technology, 62:257–269, 2011.

13. Kanani P, McCallum A and Pal C. ImprovingAuthor Coreference by Resource-bounded In-formation Gathering from theWeb, In Proceed-ings of 20th International Joint Conference onArtificial Intelligence-IJCAI, pages 429–434,2007.

14. Kang I S, Na S H Lee, S Jung, H Kim, PSung W K and Lee J H. On Co-authorship forAuthor Disambiguation, Information Process-ing and Management, 45:84–97, 2009.

15. Yang K H, Peng H T, Jiang J Y, Lee H M andHo J M. Author Name Disambiguation for Ci-tations using Topic and Web Correlation, InB Christensen Dalsgaard, D Castelli, B A Ju-rik and J Lippincott, Research and AdvancedTechnology for Digital Libraries, pages 185–196, 2008.

16. Pereira D A, Ribeiro Neto B, Ziviani N, Laen-der A H, Gonalves M A and Ferreira AA. Using Web Information for Author NameDisambiguation, In Proceedings of the NinthACM/IEEE-CS Joint Conference on DigitalLibraries, ACM, pages 49–58, 2009.

17. Wang X, Tang J, Cheng H and Yu S P.ADANA: Active Name Disambiguation, InProceedings of 11th IEEE International Con-

ference on Data Mining, pages 794–803, 2011.18. Aswani N, Bontcheva K and Cunningham H.

Mining Information for Instance Unification,In proceedings of Fifth International Seman-tic Web Conference, Athens, GA, USA, pages329–342, 2006.

19. Lin Q, Wang B, Du Y, Wang X, Li Y and ChenS. Disambiguating Authors by Pairwise Classi-fication, In CSWS, Beijing, China, pages 668–677, 2010.

20. Imran M, Gillani S Z H and MarcheseM. A Real-time Heuristic-based UnsupervisedMethod for Name Disambiguation in Digi-tal Libraries, Magazine of Digital Library Re-search, 2013.

21. Culotta A, Kanani P, Hall R, Wick M and Mc-Callum. Author Disambiguation using Error-driven Machine Learning with a Ranking LossFunction, In Proceedings of the AAAI Sixth In-ternational Workshop on Information Integra-tion on the Web, 2007.

22. Mann G S and Yarowsky D. Unsupervised Per-sonal Name Disambiguation, In Proceedings ofthe Seventh Conference on Natural LanguageLearning at HLT-NAACL, 4:33–40, 2003.

23. Song Y, Huang J, Councill I G, Li J and GilesC L. Efficient Topic-based Unsupervised NameDisambiguation, In Proceedings of the Sev-enth ACM/IEEE-CS Joint Conference on Dig-ital Libraries Vancouver, BC, Canada: ACM,pages 342–351, 2007.

24. Wang J, Berzins K, Hicks D, Melkers J,Xiao F and Pinheiro D. A Boosted TreesMethod for Name Disambiguation, Sciento-metrics, 93(2):391–411, 2012.

25. Jain A K, Murty M N and Flynn P J. DataClustering: A Review, ACM Computing Sur-veys (CSUR), pages 264–323, 1999.

26. On B W, Lee D, Kang J and Mitra P. Compar-ative Study of Name Disambiguation Problemusing a Scalable Blocking-based Framework, InProceedings of ACM Conference JCDL, pages344–353, 2005.

27. Cormen T H, Leiserson C E, Rivest R L andStein C. Introduction to Algorithms, Cam-bridge: MIT press, 2001.

28. Aini A and Salehipour A. Speeding up theFloyd Warshall Algorithm for the CycledShortest Path Problem, Applied MathematicsLetters, 25:1–5, 2012.

29. JonnalagaddaS Topham P. NEMO: Extrac-tion and Normalization of Organization Names

Page 4: A Multistage Hierarchical Methodfor AuthorName Disambiguation3)/p8.pdf · tion; split citation means that we have to man-agepublications of non-existent authors[3] and mixed citations

A Multistage Hierarchical Method for Author Name Disambiguation 105

from PubMed Affiliation Strings, Journal ofBiomedical Discovery and Collaboration, 5:50–75, 2010.

30. French J, Powell A, Schulman E and PfaltzJ. Automating the Construction of AuthorityFiles in Digital Libraries: A Case Study, Re-search and Advanced Technology for Digital Li-braries, pages 1324:55–71, 1997.

31. French J, C Powell A and Schulman E. Us-ing Clustering Strategies for Creating Author-ity Files, Journal of the American Society forInformation Science, 51:774–786, 2000.

32. Cota R G, Ferreira A A, Nascimento C, Go-nalves M A and Laender A H F. An Unsuper-vised Heuristic-based Hierarchical Method forName Disambiguation in Bibliographic Cita-tions, Journal of the American Society for In-formation Science and Technology, 61(9):1853–1870, 2010.

33. Han H, Zha H and Giles C L. Name Dis-ambiguation in Author Citations using a K-way Spectral Clustering Method, In Proceed-ings of the Fifth ACM/IEEE-CS Joint Confer-ence on Digital Libraries-JCDL05, pages 334–343, 2000.

34. Tan Y F, Kan M and Lee D. Search En-gine Driven Author Disambiguation, Proceed-ings of ACM/IEEE Joint Conference DigitalLibraries, pages 314–315, 2006.

35. Zhang D, Tang J, Li J and Wang K. AConstraint based Probabilistic Framework forName Disambiguation, In Proceedings of ACMConference on Information and KnowledgeManagement, pages 1019–1022, 2007.

36. Accomazzi A, Eichhorn G, Kurtz M J, GrantC S and Murray S S. The ADS Article Ser-vice Data Holdings and Access Methods, InG Hunt and H Payne, Editors, AstronomicalData Analysis Software and Systems VI, Con-ference Series, 125:357–360, 1997.

37. Torvik V I and Smalheiser N R. AuthorName Disambiguation in MEDLINE, ACMTransactions on Knowledge Discovery Data,3(3):1–29, 2009.

Tasleem Arif obtained hisMCA from University of Jammuin 2004. He obtained his Ph.Din Computer Science from BabaGhulam Shah Badshah Univer-sity Rajouri in 2015. His Ph.Dwork was on Academic SocialNetwork Extraction from online

sources. He has authored about 25 papers in vari-ous International Journals and International con-ference proceedings. Currently he is working as Sr.Assistant Professor in the Post Graduate Depart-ment of Information Technology, Baba GhulamShah Badshah University Rajouri, Jammu andKashmir. His research interests include Web Min-ing, Soft Computing, Information Retrieval, DataMining and Cryptography and Network Security.

Rashid Ali obtained his BTechand MTech from AMU Ali-garh, India in 1999 and 2001respectively. He obtained hisPh.D in Computer Engineeringin February 2010 from AMUAligarh. His Ph.D work was onPerformance Evaluation of Web

Search Engines. He has authored about 75 papersin various International Journals and Internationalconference proceedings. He has presented papersin many International conferences and has alsochaired sessions in few International conferences.He has reviewed articles for some of the reputed In-ternational Journals and International conferenceproceedings. He has supervised 15 MTech dis-sertation. Currently, he is supervising four Ph.Dcandidates. His research interests include WebSearching, Web Mining, soft Computing (Rough-Set, Artificial Neural Networks, Fuzzy Logic etc.,)and Image Retrieval Techniques.

Mohammed Asger obtainedhis M.Sc and Ph.D from JamiaMillia Islamia, New Delhi. Hehas vast teaching and adminis-trative experience and has beeninstrumental in setting up of fewEngineering colleges in the NCR

region. He has authored about 50 papers in variousInternational Journals and International confer-ence proceedings. Currently he is working as DeanSchool of Mathematical Sciences and Engineering,Dean of Students and Principal University Col-lege of Engineering and Technology, Baba GhulamShah Badshah University Rajouri, Jammu andKashmir. His research interests include Soft Com-puting, Quantum Computing, etc., .