a new family of old hirsch index variants

5
Journal of Informetrics 4 (2010) 647–651 Contents lists available at ScienceDirect Journal of Informetrics journal homepage: www.elsevier.com/locate/joi Short communication A new family of old Hirsch index variants Michael Schreiber Institut für Physik, Technische Universität Chemnitz, 09107 Chemnitz, Germany article info Article history: Received 20 April 2010 Received in revised form 11 May 2010 Accepted 11 May 2010 Keywords: Hirsch index g-Index Performance evaluation Citations Ranking Generalized mean abstract The Hirsch index h and the g index proposed by Egghe as well as the f index and the t index proposed by Tol are shown to be special cases of a family of Hirsch index variants, based on the generalized mean with exponent p. Inequalities between the different indices are derived from the generalized mean inequality. The graphical determination of the indices is shown for one example. © 2010 Elsevier Ltd. All rights reserved. 1. Introduction The Hirsch index h is defined as the highest number of papers of a scientist that received h or more citations (Hirsch, 2005). A shortcoming of the h index is that it is insensitive to highly cited papers because it is irrelevant whether or not a paper in the h-core (i.e. the h-defining set) receives further citations (Egghe, 2006a). This deficit is remedied by the g index which was proposed by Egghe (2006a, 2006b) as the highest numbers of papers that together received g 2 or more citations. This definition is equivalent to the determination of the g index as the highest numbers of papers that received on average g or more citations (Schreiber, 2010a), where the average is given by the arithmetic mean of the number of citations to the g most cited papers. Thus highly cited papers enhance the index value significantly. Tol (2009) proposed a different solution, namely to utilize the harmonic or the geometric mean yielding the f index or the t index, respectively. Accordingly the indices f, t, g are based on the 3 Pythagorean means and it suggests itself to investigate the generalization of their definition to the Hölder mean also known as power mean or generalized mean. It is the purpose of the present communication to analyze the thus defined family of infinitely many Hirsch index variants. In particular it will be shown that the original Hirsch index h as well as the highest number of citations to a single paper belong to this family as limiting cases. The special case of the quadratic mean yields a Hirsch index variant which suggests itself but has nevertheless never been proposed before. As a simple example for the usefulness of these findings, the generalized mean inequality is employed to derive the inequalities between the various indices. The determination of the indices is visualized for the citation record of Tol. Tel.: +49 371 531 21910; fax: +49 371 531 21919. E-mail address: [email protected]. 1751-1577/$ – see front matter © 2010 Elsevier Ltd. All rights reserved. doi:10.1016/j.joi.2010.05.002

Upload: michael-schreiber

Post on 26-Jun-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Journal of Informetrics 4 (2010) 647–651

Contents lists available at ScienceDirect

Journal of Informetrics

journa l homepage: www.e lsev ier .com/ locate / jo i

Short communication

A new family of old Hirsch index variants

Michael Schreiber ∗

Institut für Physik, Technische Universität Chemnitz, 09107 Chemnitz, Germany

a r t i c l e i n f o

Article history:Received 20 April 2010Received in revised form 11 May 2010Accepted 11 May 2010

Keywords:Hirsch indexg-IndexPerformance evaluationCitationsRankingGeneralized mean

a b s t r a c t

The Hirsch index h and the g index proposed by Egghe as well as the f index and the t indexproposed by Tol are shown to be special cases of a family of Hirsch index variants, basedon the generalized mean with exponent p. Inequalities between the different indices arederived from the generalized mean inequality. The graphical determination of the indicesis shown for one example.

© 2010 Elsevier Ltd. All rights reserved.

1. Introduction

The Hirsch index h is defined as the highest number of papers of a scientist that received h or more citations (Hirsch,2005). A shortcoming of the h index is that it is insensitive to highly cited papers because it is irrelevant whether or not apaper in the h-core (i.e. the h-defining set) receives further citations (Egghe, 2006a). This deficit is remedied by the g indexwhich was proposed by Egghe (2006a, 2006b) as the highest numbers of papers that together received g2 or more citations.This definition is equivalent to the determination of the g index as the highest numbers of papers that received on average gor more citations (Schreiber, 2010a), where the average is given by the arithmetic mean of the number of citations to the gmost cited papers. Thus highly cited papers enhance the index value significantly. Tol (2009) proposed a different solution,namely to utilize the harmonic or the geometric mean yielding the f index or the t index, respectively. Accordingly theindices f, t, g are based on the 3 Pythagorean means and it suggests itself to investigate the generalization of their definitionto the Hölder mean also known as power mean or generalized mean. It is the purpose of the present communication toanalyze the thus defined family of infinitely many Hirsch index variants. In particular it will be shown that the originalHirsch index h as well as the highest number of citations to a single paper belong to this family as limiting cases. The specialcase of the quadratic mean yields a Hirsch index variant which suggests itself but has nevertheless never been proposedbefore.

As a simple example for the usefulness of these findings, the generalized mean inequality is employed to derivethe inequalities between the various indices. The determination of the indices is visualized for the citation record ofTol.

∗ Tel.: +49 371 531 21910; fax: +49 371 531 21919.E-mail address: [email protected].

1751-1577/$ – see front matter © 2010 Elsevier Ltd. All rights reserved.doi:10.1016/j.joi.2010.05.002

648 M. Schreiber / Journal of Informetrics 4 (2010) 647–651

2. Definition of the generalized Hirsch index hp

The Hölder mean, or power mean, or generalized mean with exponent p is defined as

c̄p(r) =(

1r

r∑r′=1

cp(r′)

)1/p

(1)

for positive real numbers c(r′). In the context of citation analysis these numbers are given by the citation frequencies of ascientist’s papers. For practical purposes it will be assumed that the citation frequencies are sorted into decreasing sequence,with ties being resolved by an additional condition like antichronological order. The number r′ denotes the rank of a paper inthis series. The general search facility in the Science Citation Index provided by Thomson Scientific in the ISI Web of Scienceallows one to arrange the publication list in this way.

Using the thus determined citation record, I define the generalized Hirsch index hp as the highest number of papers of ascientist that received on average hp or more citations where the average is determined by (1):

hp ≤ c̄p(hp) while c̄p(hp + 1) < hp + 1 (2)

In rather rare cases where the number n of all papers or the number n1 of all cited papers is relatively small, the generalizedindex hp has to be constrained by the number of papers (for positive p) or the number of cited papers (for non-positive p),see Appendix A.

Obviously, for p = 1 one obtains the arithmetic average from (1), so that the definition (2) yields the g index as g = h1.Likewise for p = −1 the harmonic average follows from (1) and thus the f index f = h−1 from (2). For the case p = 0 one can usethe limit

limp→0

1p

ln

(1r

r∑r′=1

cp(r′)

)= 1

r

r∑r′=1

ln c(r′) (3)

to show that

c̄0(r) = limp→0

c̄p(r) = exp

(1r

r∑r′=1

ln c(r′)

)(4)

which is equivalent to the geometric average

c̄0(r) =(

r∏r′=1

c(r′)

)1/r

(5)

used for the definition of the t index as t = h0 (Tol, 2009). Thus the 3 Pythagorean means have been utilized given by p = −1,0, 1.

Of course other values of p can be employed in a similar way, e.g. the quadratic mean is given by (1) for p = 2 and yieldsh2.1 More interesting are the limiting cases p → −∞ and p → ∞. In the first case one obtains

c̄−∞(r) = limp→−∞

c̄p(r) = minr′≤r

c(r′). (6)

Due to the decreasing order into which the citation records have been arranged the minimum in (6) is given by c(r), sothat

c̄−∞(r) = c(r) (7)

which means that we have h−∞ = h, because (2) reduces to the original definition of the Hirsch index. To avoid confusion, Inote that the use of the maximum function in the definition (1) of Tol (2009) is incorrect and the context shows that ratherthe same prescription as in the present analysis was meant.

Similar to (6) and (7) for p → ∞ the limit gives

c̄∞(r) = limp→∞

c̄p(r) = maxr′≤r

c(r′) (8)

Obviously for the decreasing sequence of the citation data the maximum is given by the first paper, i.e.

c̄∞(r) = c(1) (9)

so that h∞ = c(1).

1 To avoid confusion, it should be noted that the label h2 has already been used (Prathap, 2006; Schubert, 2007) for a second-order index evaluating thenumber of, e.g., the h indices of the researchers of an institute. Nevertheless, I use the same notation here as a special case of the general definition hp .

M. Schreiber / Journal of Informetrics 4 (2010) 647–651 649

Fig. 1. Number of citations to the publications of Tol (lowest data curve) and average number of citations up to rank r for different values of the power pof the generalized mean yielding the maximum number of citations, quadratic mean, the arithmetic mean, the geometric mean, the harmonic mean, andthe minimum number of citations up to rank r (from top to bottom). The diagonal straight line reflects the function c(r) = r, so that its intersection with thedifferent data curves determines (from left to right) h, f, t, g, h2, c(1).

For p ≥ 1 the generalized mean can be and is used to define the norm in different ways. For example, interpreting c(r′) asthe components of a vector c, the definition (1) yields the maximum norm for p → ∞ and the �1 norm for p = 1. It is remarkablethat the most important norm, namely the Euclidean norm corresponding to p = 2, has never been utilized before for thedefinition of a Hirsch-type index.

3. Example

As an example for the application I have determined the citation record of Tol from the Web of Science in March 2010,yielding 144 papers, out of which 109 have been cited at least once. The rank-frequency function is plotted in Fig. 1. Theaverage numbers of citations for p = −1, 0, 1, and 2 are also displayed in this figure as is the maximum c(1). Of course,the averaged curves are much smoother than the original data curve. The various indices can be easily obtained from theintersection of the diagonal straight line c(r) = r. To be specific, due to the restriction to integer values, the respective indicesare given by the floor function of the real number of the rank determined by the intersection points. The respective valuesare comprised in Table 1.

It is obvious from the graphical determination of the indices that

h < f < t < g < h2 < c(1). (10)

These inequalities can also be proven, using the generalized mean inequality which states that

c̄p(r) ≤ c̄q(r), if p < q. (11)

As consequence one obtains

hp ≤ hq, if p < q. (12)

The equality in (11) occurs if and only if c(1) = c(2) = · · · = c(r), which is usually never the case for realistic citation data.In practise, however, equal index values might occur in (12) also as a discretization effect, i.e. due to the restriction of theindices to integer values.

From (12) it follows that the number of papers in the hp core increases with p. Although the generalized mean (1) putsmore and more weight for increasing p on the highly cited papers in the citation record, all the less cited publications up to

Table 1Generalized means for the power p, the corresponding Hirsch index variants, and the respective values of the citation record of Tol.

p Generalized mean Index Tol’s value

−∞ Minimum h 18−1 Harmonic average f 240 Geometric average t 271 Arithmetic average g 302 Quadratic average h2 35∞ Maximum c(1) 92

650 M. Schreiber / Journal of Informetrics 4 (2010) 647–651

the index value hp contribute, albeit less and less. Only in the limit p → ∞ their contributions vanish so that only in this limitnot all the papers in the hp core are relevant, but only the citation count of the first paper in the list.

As the index values of individual researchers increase with p, one could speculate that for a given group of researchersthis would lead to a larger spreading of their index values with increasing p. Thus one would expect fewer ties, i.e. a higherdiscriminative power with increasing p. Indeed, in the sample of 100 economists (Tol, 2009) the range of index valuesincreases from h via f and t to g. However, the number of ties in that investigation is rather high for the g index, significantlylower for f and smallest for t which means that t shows the highest discriminative power. This empirical result is surprisingand I believe that it is a mere accident. I have recently analyzed twenty Hirsch index variants for a sample of 26 physicists(Schreiber, 2010b) and found a decreasing number of ties from h via f and g to c1, as expected. But unexpectedly, the largestnumber of ties occurred for the t index, larger even than the number of ties for the h index in this case.

4. Concluding remarks

The generalized means are an elegant way of describing various Hirsch index variants on the same footing, includingthe original Hirsch index and the most cited paper as limiting cases. However, care has to be taken, because the data in thecitation record might be exhausted before the definition (2) can be fulfilled for large or negative values of p. The generalizationallows one to easily compare different indices as exemplified above in (10) and it should also enable a unified treatment of allgeneralized indices hp, for example analyzing the influence of transformations as suggested by Egghe (2008) and Rousseau(2006) for h and g, or studying certain models (Burrell, 2009), or applying axiomatic characterizations (Woeginger, 2009).

The generalized mean (1) emphasizes large signal values for large powers p. The other way round, one could state thatthe hp index becomes more egalitarian (Tol, 2009) for smaller values of p. It remains a matter of taste, whether one prefersan index which is more egalitarian or which emphasizes higher citation counts. Thus the present generalized index allowseverybody to choose a variant to his/her liking.

Appendix A.

If the number n of papers or the number n1 of cited papers is relatively small, then the citation record might not besufficient to fulfill the definition (2) because for large values of p there might be no paper of rank hp + 1, so that the secondinequality of (2) cannot be fulfilled. For negative values of p it could be the case that for the calculation of the average (1) inthe first inequality of (2) one would formally want to include non-cited papers leading to negative powers of zero citationcounts.

To be specific, for the g index one would usually require that c̄1(n) ≤ n, although this condition can be circumvented byadding fictitious non-cited articles to the publication list as proposed by Egghe (2006a). It was discussed (Schreiber, 2010a)that this extrapolation of the citation record does better justice to a high total number of citations, but it can also be avoidedby adding the stipulation

hp = n, if c̄p(n) > n (13)

to the definition (2) for p > 0. In the case of the g index this problem does usually not occur (Schreiber, 2008). I encountered onlyone example (Schreiber, 2010a) where such an extrapolation or the application of (13) was necessary for the determinationof the g index. Of course for larger values of p it is more likely that the citation record is exhausted so soon that the condition(13) has to be utilized or fictitious non-cited papers have to be added to the citation record. In the limit p → ∞ this occursmore frequently, e.g. in my analysis of 26 non-prominent physicists (Schreiber, 2007) it was found that n < c(1) for 12 of the26 datasets. And for the eight famous physicists (Schreiber, 2010a) I even found n < c(1) in 7 of the 8 datasets.

For negative values of p, the definition (2) makes sense only if the number of cited papers is sufficiently large. Otherwiseone has to require additionally

hp = n1, if c̄p(n1) > n1 (14)

for p ≤ 0, where n1 is the number of papers which have been cited at least once. In this case an extrapolation to fictitiouspapers is not possible. But experience shows (Tol, 2009; Schreiber, 2010b) that it is rarely the case that the condition (14)has to be applied. Tol (2009) claimed that such a problem does not exist at all for the f and t index, but it is easy to constructa counterexample: For example, if a scientist has published 3 papers of which two received 10 citations each and the thirdone no citation, then the restriction (14) is already necessary. In fact it remains unclear from Table A1 of Tol (2009) whetherin the case of Poterba it would be necessary already to involve (14) for the determination of the t index because in that tablen = n1 = t = g = 12. Probably this is the case, but the citation record of Poterba which I checked in April 2010 yielded 22 papers,out of which 17 had been cited and it was no problem to determine t = 16 as well as g = 20 from (2).

References

Burrell, Q. L. (2009). On Hirsch’s h Egghe’s g and Kosmulski’s h(2). Scientometrics, 79, 79–91.Egghe, L. (2006a). Theory and practise of the g-index. Scientometrics, 69, 131–152.Egghe, L. (2006b). An improvement of the h-index: The g-index. ISSI Newsletter, 2, 8–9.

M. Schreiber / Journal of Informetrics 4 (2010) 647–651 651

Egghe, L. (2008). The influence of transformations on the h-index and the g-index. Journal of the American Society for Information Science and Technology, 59,1304–1312.

Hirsch, J. E. (2005). An index to quantify an individual’s scientific research output. Proceedings of the National Academy of Sciences of the United States ofAmerica, 102, 16569–16572.

Prathap, G. (2006). Hirsch-type indices for ranking institutions’ scientific research output. Current Science, 91, 1439.Rousseau, R. (2006). Simple models and the corresponding h- and g-index. http://eprints.rclis.org/archive/00006153.Schreiber, M. (2007). A case study of the Hirsch index for 26 non-prominent physicists. Annalen der Physik (Leipzig), 16, 640–652.Schreiber, M. (2008). An empirical investigation of the g-index for 26 physicists in comparison with the h-index, the A-index, and the R-index. Journal of

the American Society for Information Science and Technology, 59, 1513–1522.Schreiber, M. (2010a). Revisiting the g-index: The average number of citations in the g-core. Journal of American Society for Information Science and Technology,

61, 169–174.Schreiber, M. (2010b). Twenty Hirsch index variants and other indicators giving more or less preference to highly cited papers. Annalen der Physik (Berlin)

arXiv:physics/1005.5227.Schubert, A. (2007). Successive h-indices. Scientometrics, 70, 201–205.Tol, R. S. J. (2009). The h-index and its alternatives: An application to the 100 most prolific economists. Scientometrics, 80, 317–324.Woeginger, G. J. (2009). Generalizations of Egghe’s g-index. Journal of the American Society for Information Science and Technology, 60, 1267–1273.