extracting and sharing data citations from google scholar for collaborative exploitation

1
Extracng and sharing data citaons from Google Scholar for collaborave exploitaon Sibele Fausto * , Tiago Rodrigo Marçal Murakami ** There are studies that have drawn aenon to the lack of indexing for the tles of scien- fic journals in the Social Sciences, Applied Social Sciences and Humanies in large com- mercial databases (Frandsen & Nicolaisen, 2008; Neuhaus & Daniel, 2007). This lack is even more acute when it comes to journals concerned with these areas published in lan- guages other than English and published in developing countries (Archambault & Lari- vière, 2010), which makes it difficult to carry out an invesgaon of the importance and impact of these journals. This situaon is changing as a result of the new opportunies provided by the emergence of Open Access (OA) and tools as the search engine Google Scholar (GS) and soſtware for data processing such as Publish or Perish - PoP (Harzing, 2007). The increasing shiſt of So- cial Sciences and Humanies journals to the Web - including those of Library and Infor- maon Science (LIS) is making them more widespread. This is allowing detailed searches to be conducted through GS and the recovery of citaons of arcles, which can be regard- ed as an alternave to tradional databases in bibliometrics studies on the impact of sci- enfic producon published in these areas. In addion it highlights the fact that GS is a free access source, in contrast with expensive commercial databases. It has a broad cover- age of other kinds of material, even in the Social Sciences and Humanies (SSH), such as books, book chapters, conference materials, etc. which are not normally covered by tradi- onal databases and hence it is able to make a comprehensive recovery of open access journals, in languages other than English, some of which come from emerging countries. However, this apparently favorable context for research into bibliometrics in these areas sll faces challenges owing to quesons about the reliability of the GS as a data source (Jacsó, 2010). This cricism regarding to GS is a restatement of the need for more re- search into the tool to finds a raonal basis for understanding the full potenal of Google Scholar for bibliometrics studies, especially in areas not covered by commercial databases (Caregnato, 2011). This situaon smulated our aempt to share citaon data from Brazilian LIS journals as a pilot scheme to allow further invesgaon by the Brazilian scientometrics community in employing Google Scholar with the aim of encouraging its greater use for bibliometric purposes. This pilot scheme adopted the following procedures: a. Conducng a survey of LIS journals tles through compiling lists of those that exist on the web; b. Carrying out searches using PoP soſtware for Win- dows, with the journal tle as a parameter, and con- firming the official tles and abbreviaons, in the period from January 28, 2014 to March 02, 2014; c. Displaying the results in Google Drive spread- sheets, one for each retrieved journal tle; d. Creang a spreadsheet that brings together all the spreadsheets with the arcles that had at least one citaon; e. Carrying out stascal tests using Excel and Tab- leau Public. Google Drive allows its contents to be shared publicly, and the extracted data to be made available through the fol- lowing link: hps://docs.google.com/ spreadsheets/ d/19kcMMnfi_5Ohe60_mev- myFc85FkppqRJy-HhXpfB_Q/ edit. Data extracon from the GS with PoP resulted in a total of 24 Brazilian LIS jour- nals, all in open access. However, the searches recovered some inaccurate data which were then analyzed arcle by arcle and those with inconsistencies were withdrawn. The data obtained allowed some exploratory exercises to be conduct- ed with Tableau Public, by various categorizaons such as the received citaons for each journal, in- cluding citaons per year and the arcles cited, among others. These preliminary exercises were also publicly shared through the following link: hp://public.tableausoſtware.com/views/ EstudodascitaesrecebidasporperidicosdaCI/ Citaesrecebidasporperidi- cos?:embed=y&:display_count=no , e.g. as shown in Figure 1. Figure 1. Number of Citaons per journal and per year Citaon studies are an important subject research in Bibliometrics and their sources of reliable data were, unl recently, a prerogave of restricve and ex- pensive commercial databases, despite these sources sll connue to show in- consistencies as is widely discussed in the literature. Google Scholar provides an alternave source to these studies, parcularly in the areas of the SSH, where many journals are not considered by the large databases. The emergence of tools that facilitate the extracon and data processing from GS, such as PoP and tools like Google Refine, Google Drive and Tableau Public help to simplify the task of validang these data. In our view, the public sharing of pretreated citaon data can smulate more collaborave invesgaons by the community of Brazilian scientometricians with the aim to demonstrate the ca- pacity of Google Scholar to act as an alternave and reliable data source in the metrical studies of naonal journals and thus enable beer measures of the SSH results in the context of scienfic evaluaon in Brazil. References Archambault, E. & Larivière, V. The limits of bibliometrics for the analysis of the social sciences and humanies literature (2010). In UNESCO (Ed.), 2010 World Social Science Report: Knowledge Divides (pp. 251-254). Paris: UNESCO, Internaonal Social Science Council. Retrieved February 20, 2014 from: hp://unesdoc.unesco.org/images/0018/001883/188333e.pdf. Caregnato, S. E. (2011). Google Acadêmico como ferramenta para os estudos de citações: avaliação da precisão das buscas por autor. Ponto de Acesso, 5 (3), 72-86. Frandsen, T.F. & Nicolaisen, J. (2008). Intradisciplinary differences in database coverage and the consequences for bibliometric research. Journal of the American Society for Informaon Science and Technology, 59 (10), 1570-1581. Harzing, A.-W. Publish or Perish (2007). Retrieved February 20, 2014 from: hp://www.harzing.com/pop.htm. JACSÓ, P. (2010). Metadata mega mess in Google Scholar. Online Informaon Review, 34 (1), 175–191. Neuhaus, C.; Daniel, H-D. (2007). Data sources for performing citaon analysis: An overview. Journal of Documentaon, 64 (2), 193-210. Background and purpose Methods Preliminary findings Final considerations * [email protected] Escola de Comunicações e Artes, University of São Paulo, Av. Prof. Lúcio M. Rodrigues, 443, São Paulo, SP, CEP 05608-020 (Brazil) ** [email protected] Departamento Técnico, Sistema Integrado de Bibliotecas, University of São Paulo Rua da Biblioteca, S/N, Complexo Brasiliana, Piso Embasamento, São Paulo, SP, CEP 05508-050 (Brazil)

Upload: sfausto

Post on 24-Jul-2015

191 views

Category:

Internet


0 download

TRANSCRIPT

Extracting and sharing data citations from Google Scholar for collaborative exploitation

Sibele Fausto*, Tiago Rodrigo Marçal Murakami**

There are studies that have drawn attention to the lack of indexing for the titles of scien-tific journals in the Social Sciences, Applied Social Sciences and Humanities in large com-mercial databases (Frandsen & Nicolaisen, 2008; Neuhaus & Daniel, 2007). This lack is even more acute when it comes to journals concerned with these areas published in lan-guages other than English and published in developing countries (Archambault & Lari-vière, 2010), which makes it difficult to carry out an investigation of the importance and impact of these journals.

This situation is changing as a result of the new opportunities provided by the emergence of Open Access (OA) and tools as the search engine Google Scholar (GS) and software for data processing such as Publish or Perish - PoP (Harzing, 2007). The increasing shift of So-cial Sciences and Humanities journals to the Web - including those of Library and Infor-mation Science (LIS) is making them more widespread. This is allowing detailed searches to be conducted through GS and the recovery of citations of articles, which can be regard-ed as an alternative to traditional databases in bibliometrics studies on the impact of sci-entific production published in these areas. In addition it highlights the fact that GS is a free access source, in contrast with expensive commercial databases. It has a broad cover-age of other kinds of material, even in the Social Sciences and Humanities (SSH), such as books, book chapters, conference materials, etc. which are not normally covered by tradi-tional databases and hence it is able to make a comprehensive recovery of open access journals, in languages other than English, some of which come from emerging countries.

However, this apparently favorable context for research into bibliometrics in these areas still faces challenges owing to questions about the reliability of the GS as a data source (Jacsó, 2010). This criticism regarding to GS is a restatement of the need for more re-search into the tool to finds a rational basis for understanding the full potential of Google Scholar for bibliometrics studies, especially in areas not covered by commercial databases (Caregnato, 2011).

This situation stimulated our attempt to share citation data from Brazilian LIS journals as a pilot scheme to allow further investigation by the Brazilian scientometrics community in employing Google Scholar with the aim of encouraging its greater use for bibliometric purposes.

This pilot scheme adopted the following procedures:

a. Conducting a survey of LIS journals titles through compiling lists of those that exist on the web;

b. Carrying out searches using PoP software for Win-dows, with the journal title as a parameter, and con-firming the official titles and abbreviations, in the period from January 28, 2014 to March 02, 2014;

c. Displaying the results in Google Drive spread-sheets, one for each retrieved journal title;

d. Creating a spreadsheet that brings together all the spreadsheets with the articles that had at least one citation;

e. Carrying out statistical tests using Excel and Tab-leau Public.

Google Drive allows its contents to be shared publicly, and the extracted data to be made available through the fol-lowing link:

https://docs.google.com/spreadsheets/d/19kcMMnfi_5Ohe60_mev-myFc85FkppqRJy-HhXpfB_Q/edit.

Data extraction from the GS with PoP resulted in a total of 24 Brazilian LIS jour-nals, all in open access. However, the searches recovered some inaccurate data which were then analyzed article by article and those with inconsistencies were withdrawn. The data obtained allowed some exploratory exercises to be conduct-ed with Tableau Public, by various categorizations such as the received citations for each journal, in-cluding citations per year and the articles cited, among others. These preliminary exercises were also publicly shared through the following link:

http://public.tableausoftware.com/views/EstudodascitaesrecebidasporperidicosdaCI/Citaesrecebidasporperidi-cos?:embed=y&:display_count=no, e.g. as shown in Figure 1.

Figure 1. Number of Citations per journal and per year

Citation studies are an important subject research in Bibliometrics and their sources of reliable data were, until recently, a prerogative of restrictive and ex-pensive commercial databases, despite these sources still continue to show in-consistencies as is widely discussed in the literature. Google Scholar provides an alternative source to these studies, particularly in the areas of the SSH, where many journals are not considered by the large databases.

The emergence of tools that facilitate the extraction and data processing from GS, such as PoP and tools like Google Refine, Google Drive and Tableau Public help to simplify the task of validating these data. In our view, the public sharing of pretreated citation data can stimulate more collaborative investigations by the community of Brazilian scientometricians with the aim to demonstrate the ca-pacity of Google Scholar to act as an alternative and reliable data source in the metrical studies of national journals and thus enable better measures of the SSH results in the context of scientific evaluation in Brazil.

References

Archambault, E. & Larivière, V. The limits of bibliometrics for the analysis of the social sciences and humanities literature (2010). In UNESCO (Ed.), 2010 World Social Science Report: Knowledge Divides (pp. 251-254). Paris: UNESCO, International Social Science Council. Retrieved February 20, 2014 from: http://unesdoc.unesco.org/images/0018/001883/188333e.pdf.

Caregnato, S. E. (2011). Google Acadêmico como ferramenta para os estudos de citações: avaliação da precisão das buscas por autor. Ponto de Acesso, 5 (3), 72-86.

Frandsen, T.F. & Nicolaisen, J. (2008). Intradisciplinary differences in database coverage and the consequences for bibliometric research. Journal of the American Society for Information Science and Technology, 59 (10), 1570-1581.

Harzing, A.-W. Publish or Perish (2007). Retrieved February 20, 2014 from: http://www.harzing.com/pop.htm.

JACSÓ, P. (2010). Metadata mega mess in Google Scholar. Online Information Review, 34 (1), 175–191.

Neuhaus, C.; Daniel, H-D. (2007). Data sources for performing citation analysis: An overview. Journal of Documentation, 64 (2), 193-210.

Background and purpose

Methods

Preliminary findings

Final considerations

*[email protected]

Escola de Comunicações e Artes, University of São Paulo,

Av. Prof. Lúcio M. Rodrigues, 443, São Paulo, SP, CEP 05608-020 (Brazil)

**[email protected]

Departamento Técnico, Sistema Integrado de Bibliotecas, University of São Paulo

Rua da Biblioteca, S/N, Complexo Brasiliana, Piso Embasamento, São Paulo, SP, CEP 05508-050 (Brazil)