submitted to ec hnology 1 - kurtz/jasist-submitted.pdf · 2003. 5. 23. · submitted to the journal...

36

Upload: others

Post on 15-Feb-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

  • Submitted to The Journal of the American Society for Information Science and Technology 1

    The NASA Astrophysics Data System: Sociology, Bibliometrics,

    and Impact

    Michael J. Kurtz, Guenther Eichhorn, Alberto Accomazzi, Carolyn Grant, Markus

    Demleitner, Stephen S. Murray, Nathalie Martimbeau and Barbara ElwellHarvard-Smithsonian Center for Astrophysics, Cambridge, MA 02138

    ABSTRACT

    The NASA Astrophysics Data System (ADS), along with astronomy's journals and datacenters, has developed a distributed on-line digital library which has become the dominant meansby which astronomers search, access and read their technical literature. By combining data fromthe text, citation, and reference databases with data from the ADS readership logs we have beenable to create Second Order Bibliometric Operators, a customizable class of collaborative �lterswhich permits substantially improved accuracy in literature queries. Using the ADS usage logsalong with membership statistics from the International Astronomical Union and data on thepopulation and gross domestic product (GDP) we develop an accurate model for world-widebasic research where the number of scientists in a country is proportional to the GDP of thatcountry, and the amount of basic research done by a country is proportional to the number ofscientists in that country times that country's per capita GDP.

    We examine the obsolescence function as measured by actual reads, and show that it canbe well �t by the sum of four exponentials with very di�erent time constants. We compare theobsolescence function as measured by readership with the obsolescence function as measured bycitations. We �nd that the citation function is proportional to the sum of two of the componentsof the readership function. This proves that the normative theory of citation is true in themean. We further examine in detail the similarities and di�erences between the citation rate, thereadership rate and the total citations for individual articles, and discuss some of the causes.

    Using the number of reads as a bibliometric measure for individuals, we introduce the read-cite diagram to provide a two-dimensional view of an individual's scienti�c productivity. Wedevelop a simple model to account for an individual's reads and cites and use it to show thatthe position of a person in the read-cite diagram is a function of age, innate productivity, andwork history. We show the age biases of both reads and cites, and develop two new bibliometricmeasures which have substantially less age bias than citations: SumProd, a weighted sum oftotal citations and the readership rate, intended to show the total productivity of an individual;and Read10, the readership rate for papers published in the last ten years, intended to show anindividual's current productivity. We also discuss the e�ect of normalization (dividing by thenumber of authors on a paper) on these statistics.

    We apply SumProd and Read10 using new, non-parametric techniques to rank and comparedi�erent astronomical research organizations. We then introduce the concept of utility time tomeasure the impact of the ADS and the electronic astronomical library on astronomical research.We �nd that in 2002 it amounted to the equivalent of 736 FTE researchers, or $250 Million, orthe astronomical research done in France.

    Subject headings: digital libraries; bibliometrics; sociology of science; information retrieval

  • Submitted to The Journal of the American Society for Information Science and Technology 2

    1. Introduction

    The NASA Astrophysics Data System AbstractService resides at the center of the most sophisti-cated discipline centered bibliographic system everdeveloped (Boyce 1998), (Boyce 1996). The typ-ical astronomer, on average, uses the ADS everyday; ADS users read approximately �ve to tentimes the number of papers which are read in allthe (traditional) astronomy libraries of the world,combined.

    A detailed technical and historical descriptionof the ADS can be found in a set of four papers:Kurtz et al. (2000), Eichhorn et al. (2000), Ac-comazzi et al. (2000), and Grant et al. (2000).

    The key fact which links the sections of thispaper together is that the existence of digital li-braries, like the ADS, means that detailed infor-mation on the readership of journal articles, on aper article and per reader basis, can now be rou-tinely obtained. This is basically a new bibliomet-ric measure of importance equal to citations.

    In section 2 we briey describe the system, andnote some of its more important or sophisticatedfeatures. In section 3 we describe the growth andcurrent use of the system, and in section 4 we useADS usage to show the relationship a country'swealth and population has with the number ande�ectiveness of its scientists.

    In section 5 we develop a four component modelto describe how the astronomy literature becomesobsolete (as measured by actual article reads) withage, and in section 6 we demonstrate the closeconnection between citations and readership.

    In section 7 we develop a methodology to evalu-ate the research performance of individuals, usinga combination of citations and readership, and welook at several interesting cases. Extending the re-sults for individuals in section 8, we then developtechniques for comparing organizations.

    Finally in section 9 we discuss how the total im-pact of the ADS and the electronic astronomy li-brary can be evaluated; and we conclude in section10 by discussing some of the future consequencesof the new technologies and data used here.

    2. Descriptive Background

    The ADS Abstract Service was �rst demon-strated in 1992 (Kurtz et al. 1993) and was put

    on-line for general use in April 1993. In early 1994the service was moved onto the World Wide Webfrom its original NASA developed network home(Murray et al. 1992). In this section we will brieydescribe some of the more important milestonesand features in the ADS development; Kurtz etal. (2000) contains a detailed description of theearly history of the ADS Abstract Service.

    The ADS from the beginning has been designedto be used primarily by end-user astronomers, asopposed to librarians. This has led us to createa system which emphasizes recall over precision,assuming that the expert user can easily separatethe desired from the undesired. The behavior ofthe system can be easily modi�ed from its default,however, to match almost any desired searchingstyle.

    From the beginning our standard search in-volved partial match techniques, natural languagequeries with an extensive discipline speci�c syn-onym list, relevance feedback, and the combina-tion of evidence (Belkin, et al. 1995) from simul-taneous queries, such as text from the abstractcombined with text from the title, combined withan author's name.

    Within a few months of the initial public re-lease, in 1993, we were able to include the abilityto query for papers containing information abouta particular astronomical object (i.e. a particularstar or galaxy) by performing simultaneous jointqueries with the SIMBAD data base (Wenger etal. 2000), operated by the CDS (Centre des don-nees astronomique de Strasbourg) in France. Webelieve this is the �rst time the internet was usedto permit routine joint transatlantic queries of thissort.

    Also, soon after moving to a web based system,we were able to make direct hyperlinks betweenthe bibliographic entries and on-line versions ofdata tables from those articles from the journalAstronomy & Astrophysics kept by the CDS inStrasbourg (Ochsenbein and Lequeux 1995). Thiscapacity has grown so that now essentially everydata table from every article in every major jour-nal of astronomy is available via these means.

    Towards the end of 1994, the ADS began scan-ning (creating bi-tonal bitmapped images of) re-cent issues of The Astrophysical Journal (Letters).With the collaboration of virtually every astron-

  • Submitted to The Journal of the American Society for Information Science and Technology 3

    omy journal we now have electronic copies ofnearly every astronomy research paper publishedin the last �fty years, and with the collaborationof the Center for Astrophysics' Wolbach Libraryand the Harvard University Library (Corbin andColetti 1995) we expect in the near future to havea nearly complete collection of the astronomy re-search literature going back two centuries. Thiscollection has been called \Earth's Largest FreeFull-Text Science Archive" (Highwire 2001).

    In July 1995, The Astrophysical Journal (Let-ters) published its �rst issue in HTML electronicformat (Boyce 1995), (Dalterio et al. 1995); itwas the �rst major scienti�c journal to be fullyon-line. The reference sections of that �rst elec-tronic issue had hyperlinks from the references tothe corresponding entries in the ADS. This prac-tice of linking from the electronic article referencesto the relevant sections of a bibliographic servicehas now become standard for electronic journals;now all astronomy journals, and several others,link to the ADS.

    Also in 1995 the ADS made its �rst links be-tween a bibliographic reference and the data whichwere used to do the research described in the arti-cle (Plante et al. 1996) . This practice has growntremendously; more than 175,000 such links arenow in the ADS, and the growth is becoming morerapid as part of the emerging Astronomical Vir-tual Observatory (Brunner et al. 2001), (Kurtzand Eichhorn 2001).

    In 1996, using data purchased from the Insti-tute for Scienti�c Information (ISI) with fundsprovided by the American Astronomical Society(AAS) as part of the electronicAstrophysical Jour-nal project, we provided links from the main ar-ticle references to lists of the papers they refer-enced, and to the papers which cite them (Kurtzet al. 1996). With the cooperation of all the mainjournals of astronomy we have since greatly ex-panded this capability, by purchasing more datafrom ISI, by performing optical character recogni-tion on the reference sections of scanned articlesin our database, and by taking directly the ref-erences from electronic articles provided to us bythe publishers (Demleitner et al. 1999). Currentlythere are more than 8,000,000 citing-cited articlepairs in our database, and the number continuesto grow.

    In 1996, we began maintaining mirror sites

    around the world (Eichhorn et al. 1996); the �rstwas hosted by the CDS in Strasbourg. Currentlywe have twelve mirrors in twelve countries on fourcontinents.

    In 2000, we added the ability to see articleswhich were the most read by people who read an-other article (i.e. people who read this article alsoread: : : :, hereafter also-read). Because ADS usersread a substantial fraction of all astronomy arti-cles (see sections 3 and 9) this feature provides anaccurate sampling of the entire discipline.

    2.1. Second Order Operators

    In 1997, we began permitting some second or-der queries (Kurtz 1992) by allowing the resultof a query to be the collated reference or citationlists of the articles returned by the query, ratherthan the articles themselves (Kurtz et al. 2000) .In 2001 we added the also-read lists to this capa-bility, and created a new interface, both to makeit more intuitive, and to improve its functional-ity. The resulting set of second order operatorsprovides an extremely powerful basis for creatingspecialized collaborative �lters (e.g. Goldberg, etal. (1992)). We believe many of these capabilitiesare unique to the ADS.

    A full description of the second order operatorswill be published elsewhere (Kurtz et al. 2002);here we will give one example of their use, in orderto demonstrate their power.

    We will begin by assuming that we are inter-ested in the current state of knowledge in sub�eldX. We will further assume that we know enoughabout sub�eld X (by knowing an author's name,perhaps) to be able to �nd the abstract of at leastone paper about sub�eld X.

    We retrieve the abstract(s) using the ADS anduse the \�nd similar abstracts" feature, modifyingthe query to return only articles published withinthe past year, we then retrieve a list of the recentpapers with the most similar abstracts. This is alist of the most recent papers in sub�eld X basedon similarity of text. We can now use the secondorder operators.

    The people who have recently read the mostrecent papers on sub�eld X are people who are in-terested in sub�eld X; the papers they read themost will be the currently most popular paperson sub�eld X. Clicking on the \get also-read lists"

  • Submitted to The Journal of the American Society for Information Science and Technology 4

    feature retrieves exactly those most popular arti-cles on sub�eld X.

    These papers are the most popular for a reason;presumedly they are informative papers concern-ing the current state of a�airs of sub�eld X. Thepapers most referenced by these papers would bethose currently most useful to sub�eld X; clickingon the \get reference lists" feature gets this list ofmost useful (most cited by papers in sub�eld X)articles.

    Papers which cite a large number of these mostuseful articles are papers with extensive discus-sions of topics of interest to sub�eld X; they are themost instructive articles about sub�eld X. Clickingon the \get citation lists" feature gets this list ofmost instructive articles. Figure 1 shows this listafter following the above procedure starting withthe abstract from the paper by Riess et al. (1998)\Observational Evidence from Supernovae for anAccelerating Universe and a Cosmological Con-stant." The score refers to the number of refer-ences in the previous list which are also cited bythese articles. For example, the top article in thislist cited 96 of the papers from the previous list ofmost referenced papers; for three of the �ve papersshown more than half of all the papers in their ref-erence lists were also in the previous list. All ofthe articles on the top of the list are review articleson exactly the original subject; the high quality ofthis list tends to validate each stage of the process.

    Beginning with a minimum of knowledge (theability to �nd a single article on the desired sub-ject) we are quickly able to obtain lists of the mostcurrent, most popular, most useful, and most in-structive articles about the desired subject. Giventhe ability to write a good query, or to edit thelists to include only the most relevant articles, onecan obtain in-depth results for very narrow sub-ject matters. Of course very many di�erent waysof using these second order operators are possible.

    While we believe our use of these second orderoperators is unique, there are some similar featuresin other systems. The \customers who bought thisbook also bought" feature of Amazon.com (Ama-zon 2002) and other on-line retailers is similar tousing the \get also-read" feature on a single ar-ticle. The \get related records" feature of ISI'sWeb of Science (ISI 2002) is the same as runningthe \get citation lists" feature on the reference listfrom a single article.

    Fig. 1.| The \most instructive" papers list ob-tained by following the procedure described in thetext, beginning with the abstract from Riess etal. (1998).

    3. The Growth of the ADS

    The ADS went on-line in April 1993 using awide area networking system created for NASA;in its ten months of existence usage approximatelydoubled. In February 1994, the ADS converted tothe World Wide Web (Grant et al. 1994); usagewent up by a factor of four within �ve weeks. Sincethen use of the ADS has doubled about every twoyears. There are a number of di�erent ways tomeasure ADS use, �gure 2 shows the number ofunique users per month. The dotted line shows atwo-year doubling.

    We caution that the number of unique usersmay be misleading. We track usage by the num-ber of unique \cookies"1 which access the ADS,and by the number of unique IP2 addresses. Thisleads to a number of unique users which may ex-

    1A cookie is a unique identi�er which WWW providers (inthis case ADS) assign to each user, and store on the user'scomputer using the browser.

    2Each Machine on the internet has a unique IP (InternetProtocol) address.

  • Submitted to The Journal of the American Society for Information Science and Technology 5

    Table 1

    Use by Selected Institutions, August 2001

    Institution heavy usersa AAS Membersb unique readsc

    Harvard-Smithsonian 236 250 8085Space Telescope 126 163 5271CalTech 173 196 5589Cornell 33 52 899European Southern Obs. 385 18 11274Total worldwide 10532 6429 350404

    aUnique users who have read ten or more articles during the month

    bAAS (2001)

    cNumber of unique user-paper pairs

    ceed the actual number of di�erent people usingthe system; a person who logs in from work andfrom home would count twice, but any number ofpeople using a library computer would count asone; also a person who used more than one of ourdozen mirror sites would be counted more thanonce. The monthly usage for October 2001 was51,960 users.

    This number may be compared to 8,692 mem-bers of the International Astronomical Union(IAU) (IAU 2001), the largest organization ofprofessional astronomers. There were 10,677 dif-ferent authors listed in the papers published in2001 by the seven major astronomy journals, 5576of these were listed as (co-)author on more thanone paper. Independent of how many actual usersthe ADS has, we believe nearly every workingastronomer uses the ADS very regularly.

    Table 1 compares the number of unique userswho read ten or more articles during August 2001using the ADS to the number of members of theAmerican Astronomical Society for some selectedastronomical institutions. The number of AASmembers is similar to the number of heavy ADSusers for the US organizations; more than twiceas many users are present at each institution, butthey use the ADS less frequently, the heavy usersaccount for about 85% of the total use of the sys-tem.

    The number of unique reads represents the

    number of unique user-article pairs in a month,thus if a single user reads the same article twice,it only counts once. This is the most conservativeestimate of number of articles read in the system.For heavy users the median number of articles readper month is 22, with very little change among var-ious organizations or subgroups. This is the basisfor our claim that the typical working astronomer,on average, uses the ADS every working day.

    As the numbers for the European Southern Ob-servatory, which is located near Munich, demon-strate, the use of the ADS is not limited to orga-nizations within the United States. Three �fths ofall ADS use comes from outside the United States;this will be explored in the next section.

    Tenopir, et al. (2003) independently dis-cuss the degree to which the ADS is used by as-tronomers.

    4. ADS as indicator of the worldwide re-

    search e�ort

    Astronomy is a worldwide enterprise, and, asprobably the least practically useful of all sciences,is perhaps the best indicator of how pure researchis done internationally. As the ADS use is over-whelmingly by astronomy researchers, a countryby country comparison of ADS use can yield in-sights into the factors which inuence the world'spure research.

  • Submitted to The Journal of the American Society for Information Science and Technology 6

    1994 1996 1998 2000 2002 100

    1000

    10000

    100000

    year

    Fig. 2.| The number of unique users of the ADSeach month. The dotted line represents a 2 yeardoubling time.

    4.1. ADS use and GDP

    Certainly one factor which inuences the re-search done in a country is the wealth of that coun-try. Figure 3 shows the ADS use per capita vs. theGDP per capita. There is clearly a trend. Thecenter line shows the relation where ADS use percapita is proportional to GDP per capita squared;the lines to each side are separated by a factor ofp2 in per capita GDP.

    Taking ADS use as a proxy for basic researchwe obtain directly a simple relation for the amountof basic research done in a country:

    BasicResearch / (GDP)2=Population (1)

    The ADS use is obtained by counting the num-ber of entries in the query log (this is approxi-mately twice the true number of requests to thesystem) made during the year 2000, and the GDPis by the purchase power parity method (CIA2001). Most queries can be automatically resolvedto the internet host address, but about 20% can-not; the IP addresses which do not resolve are sys-tematically more frequently found in the less in-dustrialized countries. We have resolved by hand(to the country level) all class B and class C

    GDP dollars per capita 1000 10000

    100

    1000

    10000

    AR

    AM

    AU

    AT

    BE

    BR

    BG

    CA

    CL

    CN

    HR

    CZ

    DK

    EG

    EE

    FI

    FR

    GE

    DE

    GRHU

    INIR

    IE

    IL

    IT

    JP

    KRLTMX

    NZ

    NO

    PL

    PT

    RO

    RU

    SA

    SK

    ZA

    ES

    SECH

    TW

    NL

    TR

    UA

    UK

    US

    VE

    MA

    LVSI

    YU

    Fig. 3.| The per capita GDP of a country vs. theADS use per million inhabitants in that country.The symbols are the ISO symbols for each country.

    IP subnets with more than 1,000 queries duringthe year 2000, this procedure only signi�cantlychanged the data for two countries, Egypt and Mo-rocco. By this process we can be certain that wehave not missed any large group of astronomers(1,000 queries is a typical amount for a single veryactive user); we have not resolved about 10% ofthe queries, and can be certain that the results forsome countries with low usage are systematicallyunderestimated.

    Comparing the GDP �gures from the CIA (CIA2001) with those from the Economist Magazine(Economist 2001) shows that these can di�er bya factor of two for some developing countries, andby as much as 30% even in industrial countries.

    Figure 3 contains only countries where the num-ber of ADS queries is equal or greater than 1,737,which is the average number of queries per mem-ber of the IAU, and is about equal to the use oftwo active users. There are several things whichcan be seen in the plot. First, notice that theUnited States is not the largest per capita user ofthe ADS; France (FR) and the Netherlands (NL)are; their symbols obscure each other on the plot.Also notice that while the di�erence in per capitaincome between the richest and poorest countries

  • Submitted to The Journal of the American Society for Information Science and Technology 7

    is a factor of ten to �fteen, the di�erence in percapita ADS use between these countries is close tothree hundred.

    Next notice that the countries to the left ofthe dotted lines, those which perform more re-search per capita than their GDP per capitasquared would suggest, are, with one exception(Morocco, MA) all eastern European countrieswhich have undergone substantial economic andpolitical change in recent years. It is possible thatin the face of such large historical changes GDPis not a fully accurate measure of the total wealthof a country, which would also include existinginfrastructure.

    Also notice that the four countries to the rightof the dotted lines, those which perform less re-search than expected, are all oil producing states.It is possible that the (historically) recent in-creased GDP due to oil revenues in these coun-tries is not a true measure of their actual longterm wealth.

    4.2. ADS use and IAU membership

    Another factor which a�ects the amount of re-search done in a country is the number of re-searchers in that country. Again using astronomyas an indicator of the pure research e�ort we willtake the membership in the IAU on a per countrybasis (IAU 2001) and compare that to the amountof ADS use in each country.

    One might naively assume that on average theper astronomer use in each country would be con-stant; the average astronomer would do the av-erage amount of research independent of location.However, �gure 4 shows for each country the num-ber of ADS queries per IAU member as a func-tion of GDP per capita. In fact there is a cleartrend, the dotted line shows the relation whereADS use per IAU member is proportional to GDPper capita, hence the productivity of individualastronomers is not independent of location. Thedotted line is not a �t, and the scatter in this rela-tion is too large to derive a parameterization. Thedotted line is perhaps the simplest relation whichis consistent with the data.

    4.3. IAU membership vs. GDP

    Combining the relations shown in �gure 3 (ADSuse per capita is proportional to GDP per capita

    1000 10000

    100

    1000

    GDP per capita

    AR

    AM

    AU

    AT

    BEBR

    BG

    CA

    CL

    CN

    HR

    CZ

    DK

    EG

    EE

    FI

    FR

    GE

    DE

    GR

    HU

    INIR

    IE

    IL

    IT

    JP

    KR

    LT

    MX

    NZ

    NO

    PL

    PT

    RO

    RU

    SA

    SK

    ZA

    ES

    SECH

    TW

    NL

    TR

    UA

    UK

    US

    VE

    Fig. 4.| The per capita GDP of a country vs.the ADS use per the number of IAU members inthat country. The symbols are the ISO symbolsfor each country.

    squared) and �gure 4 (ADS use per IAU mem-ber is proportional to GDP per capita) one ob-tains the relation that IAU membership is propor-tional to GDP. This relation can also be obtainedin other ways, and is obviously independent of theADS. Figure 5 shows the relation; the central dot-ted line represents the relation one astronomer per$4,300,000,000 which is the total world GDP di-vided by the number of IAU members.

    The trend in these data is clear. Nearly allcountries on the plot follow the relation, the prin-cipal exceptions being Indonesia, Thailand, andPakistan (ID, TH, and PK). The countries, how-ever, do not seem to form a single linear relation.The two anking lines, which are separated fromeach other by a factor of three, seem to betterdescribe the distribution than the single average.We suggest that this bifurcation is real, and repre-sents the cultural di�erence between western na-tions and eastern nations with regard to supportof basic research. Additionally some countries per-form essentially no astronomy; they are primarilyfrom southeast Asia (excluding India) and subsa-haran Africa.

    If one divides the data by the mean line down

  • Submitted to The Journal of the American Society for Information Science and Technology 8

    1 10 100 1000 1

    10

    100

    1000

    IAU Members

    DZ

    AR

    AM

    AU

    AT

    AZ

    BE

    BR

    BG

    CA

    CL

    CN

    CO

    HR

    CZDK

    EG

    EE

    FI

    FR

    GE

    DE

    GR

    HU

    IS

    IN

    ID

    IR

    IE

    IL

    IT

    JP

    JO

    KZ

    KR

    LV

    LB LT

    MY

    MU

    MX

    MA

    NZ

    NO

    PK

    PY

    PE

    PHPL

    PTRO

    RU

    SA

    SG

    SK

    SI

    ZA

    ES

    LK

    SECH

    TW

    TH

    NLTR

    UA

    UK

    US

    UY

    VE

    Fig. 5.| The GDP of a country in units of$4,300,000,000 vs. the number of IAU membersin that country. The symbols are the IS) symbolsfor each country.

    the center, one �nds that 31 European countriesare below the line and three above. One also �ndsthat seventeen Asian countries are above the line,and none are below. We thus propose:

    Scientists =c

    kGDP (2)

    where for astronomy

    c =

    8<:

    p3 if european culture;1p3

    if asian culture;

    0 otherwise.

    and

    k = $4; 300; 000; 000

    Scientists is the number of scientists in a coun-try, GDP is the GDP of that country, k is the meanGDP per scientist worldwide, and c is a culturalfactor which di�ers among countries.

    4.4. A model for worldwide basic research

    Rearranging the relation shown in �gure 4 weobtain the following model for the research donein a country:

    0.1 1 10 100 1000 10000 0.1

    1

    10

    100

    1000

    10000

    queries/1000

    DZ

    AR

    AM

    AU

    AT

    BE

    BR

    BG

    CA

    CL

    CN

    CO

    HR

    CZDK

    EGEE

    FI

    FR

    GE

    DE

    GR

    HU

    IS

    IN

    ID

    IR

    IEIL

    ITJP

    KR

    LV

    LB

    LTMY

    MU

    MX

    MA

    NZNO

    PE

    PL

    PTRO

    RU

    SA

    SG

    SK

    SI

    ZA

    ES

    LK

    SECH

    TW

    TH

    NL

    TRUA

    UK

    US

    UY

    VE

    Fig. 6.| The number of astronomers times theper capita GDP vs. ADS use for each country.The symbols are the ISO symbols for each country

    BasicResearch / Scientists GDPPop

    (3)

    The amount of basic research done in a countryis proportional to the number of scientists timesthe per capita GDP.

    Figure 6 shows this model; the dotted line rep-resents the same mean proportionality constant asin the line in �gure 4. The model in equation 3provides a reasonably accurate prediction of theresearch output of a country for countries di�eringin research output over a factor of ten thousand.

    4.5. Discussion

    Combining equations 2 and 3 we get a sim-ple picture of the major factors a�ecting researchworldwide. The number of scientists in a countryis just proportional to the GDP of that country,except that Asian and European countries have adi�erent proportionality constant, by a factor ofthree (and some countries simply do no astron-omy); and the amount of research done in a coun-try is proportional to the number of scientists inthat country, times an eÆciency factor which isjust the GDP per capita.

  • Submitted to The Journal of the American Society for Information Science and Technology 9

    This model for the research of nations is es-sentially a quanti�cation of the Matthew E�ectfor Countries (Bonitz 1997), an extension of theMatthew E�ect (Merton 1968) which essentiallystates that the system of rewards in science is thatthe rich get richer and the poor stay poor.

    According to this model two countries with thesame GDP (for example India and France) willhave the same number of scientists (ignoring thecultural factor). The amount of research these sci-entists perform, however, di�ers by the ratio of theper capita GDP for the two countries, more thana factor of ten in the case of India and France.

    5. Readership as a function of age

    Because the use of the ADS is now the domi-nant means by which astronomers access the tech-nical literature (see sections 2 and 9) the ADS us-age logs can provide a uniquely powerful view ofthe way an entire discipline (astronomy) uses thetechnical literature. In this section we will ex-amine the obsolescence (e.g. White and McCain(1989), Line and Sandison (1975)) of the techni-cal literature of astronomy as a function of articleage based on the actual readership of an article.This is an extension and reexamination of the workexplored in Kurtz et al. (2000).

    We use, as our basic data source, the log of allarticle \reads" using the ADS between January1st and August 20th, 2001. We de�ne a \read"as every time a user who has access to a list ofarticles, their dates, journal names, titles and au-thors (such as in �gure 1) chooses to view moreinformation about an article. Currently 50% ofthese \reads" are of the abstract, 38% are of oneof the forms of whole text, 8% are of the citationlist, and the rest are distributed amongst ten otheroptions. There are more than 4.2 million \reads"in this log.

    For this study we extracted reads for the threemajor U.S. astronomy journals; The Astrophysi-cal Journal, The Astronomical Journal, and ThePublications of the Astronomical Society of the Pa-ci�c. All three of these journals have been sta-ble over the past century and are currently amongthe most important astronomy journals. All havehad their full text versions beginning with their�rst issues available on-line through the ADS sincewell before the beginning of the reporting period.

    1900 1920 1940 1960 1980 2000

    1

    10

    100

    publication year

    H

    CI

    Fig. 7.| The average number of reads per ar-ticle per year for three U.S. astronomy journals.The thin line is the actual data, the thick line isthe model in the text, and the three dotted linesrepresent the three components of the model.

    These journals accounted for slightly more than1.8 million reads in the �rst 7.66 months of 2001.

    5.1. The obsolescence model for reads

    Figure 7 shows the average number of reads perarticle per year for these three journals as a func-tion of publication year. This shows more than afull century from the �rst issue of The Publicationsof the Astronomical Society of the Paci�c in 1889through 2000.

    Figure 8 shows an expanded view of the last 25years of data from �gure 7. The dotted lines showthe relevant three components of the four com-ponent readership model of Kurtz et al. (2000),as modi�ed here. In this model, research articlereadership (R) is parameterized by the sum of fourexponential functions with very di�erent time con-stants. We associate these four functions with fourdi�erent modes of readership: Historical (RH), In-teresting (RI ), Current (RC) and New (RN ). TheNew (RN ) mode, which corresponds to the newlyarrived (either on-line or in the mail) issue, can-not be parameterized by the data in �gures 7 and8. The Historical (RH) mode we actually parame-

  • Submitted to The Journal of the American Society for Information Science and Technology 10

    1975 1980 1985 1990 1995 2000

    10

    100

    publication year

    H

    CI

    Fig. 8.| An expanded view of Figure 7 showingthe most recent 25 years. Note that the model �tsthe actual data very well.

    terize as a constant, H0, we leave the exponentialform in equation 4 (with kH = 0) because othercombinations of multiplicative and time constantscan also be found which �t the data well, includingsome combinations with kH 6= 0.

    R = RH +RI +RC +RN (4)

    whereRH = H0e

    �kHt

    RI = I0e�kIt

    RC = C0e�kC t

    RN = N0e�kN t

    andH0 = 1:5; kH = 0

    I0 = 45; kI = 0:065

    C0 = 110; kC = 0:4

    N0 = 1600; kN = 16

    t = time since publication in years

    The three longer term functions, RH , RI , andRC are parameterized to �t the data shown in �g-ures 7 and 8. The RN function is included for com-pleteness; kN is taken from Kurtz et al. (2000),

    and N0 obtained by assuming kN is correct, andascribing all readership of the Astrophysical Jour-nal electronic edition which does not originatewith the ADS to the N mode. This is a very crudeapproximation, but the three component (RH ; RI ;and RC) model for archival readership is not af-fected by the N mode usage, which fades veryrapidly following publication.

    The three mode model is not unique, but does,as �gures 7 and 8 show, provide a very good �tto the existing data. No model consisting of onlytwo exponential functions can �t both the recentand historical data, as comparing the two �guresmakes clear.

    Most studies of obsolescence �nd that the useof the literature declines exponentially with age,and parameterize this with a single number, oftencalled the \half-life," which is related to the coeÆ-cient in the exponent by half-life = loge(2)=k, thepoint where the use of an article drops to half theuse of a newly published article. There are severalother de�nitions of half-life in the literature, weuse this one. Thus the Historical (RH) componentof equation 4 does not have a half-life; the half-lifeof the Interesting (RI) component is 10.7 years;the half-life of the Current (RC) component is 1.7years. Kurtz et al. (2000) estimate the half-life ofthe New (RN ) component at 16 days.

    Several studies (e.g. Egghe (1993) and refer-ences therein) decompose the exponential decayin use into the product of an intrinsic decay andthe general growth of the literature. The resultspresented here are for the mean current use perarticle published as a function of time since thepresent, thus we measure directly the intrinsic de-cay. Kurtz et al. (2000) show the growth of theastronomy literature has been 3.7% per year, mea-sured in terms of number of papers published overthe past 22 years.

    The total number of papers read over timein each mode is just the integral of the func-tion from zero to in�nity, which for a negativeexponent is just the ratio of the two constants:H0=kH =1 reads (one and a half reads per yearforever); I0=kI = 818 reads; C0=kC = 275 reads;N0=kN = 100 reads. This assumes no growth inthe number of reads; if the number of reads peryear increases long term at the same rate, 3.7%, atwhich the number of publications is now increas-ing, then the constants in the exponents would all

  • Submitted to The Journal of the American Society for Information Science and Technology 11

    be increased by 0.037. This would have very littlee�ect on the integrals of the RN and RC func-tions, but would more than triple the articles readin the RI mode; and the RH mode would growapace with the growth in the number of reads.

    5.2. Discussion

    Beginning with Burton and Kebler (1960) therehave been a number of studies (see White and Mc-Cain (1989) for a review) which suggest that theobsolescence function consists of the sum of twoexponentials, which Burton and Kebler (1960) at-tribute to \classic" and \ephemeral" papers; pa-rameterizations (e.g. Price (1965)) tend to besimilar to our RH +RI functions.

    If we ignore the RN component, which neitherthis study, nor any of the other studies of obso-lescence could see, we still very clearly �nd threeseparate components to the obsolescence function.Why have these three components not been seenuntil now?

    We suggest that the data available to previousstudies have not been adequate to see these sub-tle e�ects. Most studies have used citation datato determine the obsolescence function. Becauseit takes time after a paper is published for it tobe cited (e.g. section 6) the peak in the RC modeis obscured in citation data. Also citation stud-ies have substantial problems accounting for thegrowth of the literature, which has not been at allconstant over the past century. Related to thisis the determination of the size of the sample uni-verse (the number of relevant papers to the study)at past times.

    There are certainly other possibilities, perhapsthe obsolescence function is di�erent for reads andcites, and perhaps the very existence of the ADShas changed the way the literature is used. Wewill explore these questions further in section 6.

    The reason why readership studies have notseen the three component nature of archival read-ership which we see, we suggest, is that the dataavailable in such studies have been too sparse.The largest astronomy library, the Center for As-trophysics Library, has a reshelve rate of about1,000/month (Coletti 1999) , which is less than0.2% of the rate of reads in the ADS. Additionallymany astronomers keep (and use) their own papercopies of recent journals, which would suppress the

    RC mode in library use.

    6. The relationship between reads and

    cites

    Central to bibliometrics is the study of cita-tions (Gar�eld 1979), and central to the studyof citations is the so called normative assumption(Liu 1993) that \the number of times a docu-ment is cited ... reects how much it has beenused..." (White and McCain 1989). There havebeen many articles suggesting problems with ci-tation studies (e.g. MacRoberts and MacRoberts(1989)), and many articles defending them (e.g.Small (1987)). White (2001) and Phelan (1999)discuss these issues.

    The readership data discussed in section 5 pro-vide a totally independent, direct new measure of\how much [an article] has been used." Compar-ing the readership statistics with citation measureswill show the similarities and di�erences betweencitations, which are an indirect measure of use,but, some would argue, a direct measure of use-fulness, and reads, which are a direct measure ofuse, but perhaps an indirect measure of usefulness.Here we expand considerably on the comparisonpresented in Kurtz et al. (2000).

    6.1. The mean relationship between reads

    and cites

    While there have been many dozens of studieson obsolescence using citations, and many dozensmore using readership as determined by using li-brary circulation statistics (see White and McCain(1989) for review), there are very few studies com-paring the two methodologies over the same data.Tsay (1998) compared the readership obsoles-cence function (obtained by reshelving statistics)for a number of medical journals with the citationobsolescence function for the same journals. Hefound the half-life of the readership function wassigni�cantly shorter than the half-life of the cita-tion function. Tsay (1998) reviewed the literatureand found only one previous comparable study;Guitard, in 1985 (discussed in Line (1993)), us-ing photocopy requests as the use proxy, found thecitation half-life shorter than the readership half-life.

    We have only found two other studies. Cooperand McGregor (1994), also using photocopy data,

  • Submitted to The Journal of the American Society for Information Science and Technology 12

    found citation half-life substantially longer thanthe use half-life; also they found \no correlationbetween obsolescence measured by photocopy de-mand and obsolescence measured by citation fre-quency." Satariano (1978) used the questionnairemethod to �nd \citation patterns reect a cross-disciplinary focus that is not found in the journalsmost often read."

    We believe Kurtz et al. (2000) contains the �rststudy using a data-set large enough to show thesimilarities and di�erences between the two ob-solescence functions. Here we use a substantiallyimproved data-set; we intend this section to super-sede the study in section 6.2 of Kurtz et al. (2000).

    6.1.1. Synchronous relation

    Kurtz et al. (2000) found that the instanta-neous obsolescence function for articles from therecent technical astronomy literature as measuredby citations is simply equal to a proportionalityconstant times the function measured by readstimes an exponential ramp-up to account for thetime delay from when an article is �rst publishedto when an article which cites that article is pub-lished:

    C = cR(1� e�kDT ) (5)where

    c � 120; kD = 0:7

    The proportionality constant, c, represents thenumber of citations per read. This changes withthe (always incomplete) citation databases, andwith time, as ADS use increases (see section 3).Currently we estimate that the average paper isread about twenty times using the ADS for everytime it is cited. In comparing the citation andreads obsolescence functions we have adjusted cto provide the best �t to the samples.

    Figure 9 compares the reads and cites obsoles-cence functions for recent articles. The reader-ship data are for articles from The AstrophysicalJournal, The Astronomical Journal, The Publica-tions of the Astronomical Society of the Paci�c,and The Monthly Notices of the Royal Astronom-ical Society which were read between 1 January2001 and 20 August 2001. The citation data aretaken from those four journals and both Astron-omy & Astrophysics and Nature, where the publi-

    1975 1980 1985 1990 1995 2000

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    11

    2

    3

    publication year

    Fig. 9.| A comparison of the C = cR(1�e�kDT )model with the actual citations from the 2001 sam-ple for papers from the most recent 25 years. Thethick line is the actual citations, the thin line isthe model, using the actual reads, and the dottedline is the model using the reads model, equation4.

    cation date was also between 1 January 2001 and20 August 2001. Only citations to one of the fourjournals in the readership sample were taken; thedata contain 45,000 citations.

    As can be seen from �gure 9 the citation func-tion follows the reads function very closely. In par-ticular the RC function (section 5.1) clearly hasan analog in the citation data, despite the sup-pression of the steep increase compared with theraw reads due to the exponential ramp-up. Thechange in slope in the citation function seen be-ginning about 1994 is exactly what is predictedfrom the reads function; the number of citationsin 1998 and 1999 are more than 40% above thatexpected by an extrapolation of the exponentialdecay seen between 1975 and 1990, a decay whichcorresponds very closely to the RI function. Wesuggest this shows that the citation derived ob-solescence function has two components with ex-actly the same parameters as the two mid-range(in time) readership functions.

    To examine the obsolescence function over a

  • Submitted to The Journal of the American Society for Information Science and Technology 13

    longer time period we use a di�erent data-set of ci-tations. We take all citations to the four journalsin the readership sample from articles publishedbetween 1 January 1995 and 20 August 2001 inthe ADS database. The data contain 625,000 ci-tations.

    We continue to use as our comparison the year2001 reads sample. Clearly papers published in1995 could not have cited papers published in2000, so comparison with recent obsolescence isimpossible (this comparison is in �gure 9). We usethese data exclusively to examine the long termbehavior of the obsolescence function.

    Figure 10 shows the long term obsolescencefunction obtained from citation data comparedwith the readership function. They clearly arenot the same. The citation function follows theRI function, but not the RH +RI function. Thisis not a statistical uke based on having a smallnumber of citations; the number of citations inthe period from 1889 to 1940 which are \missing"from the citation function exceeds 5,000. In theyear 1900, for example, there were 18 citations inthe six year sample, while about 150 would be ex-pected, were the RH mode to produce citations atthe same amplitude as the RI mode.

    We are therefore driven to the conclusion that:

    C = c(RC +RI)(1� e�kDT ) (6)where

    c � 120; kD = 0:7

    for research articles in the astronomy literature.

    There are a number of possible reasons forthe citation obsolescence function to be di�erentfrom the reads function. There are also a num-ber of possible reasons why the citation obsoles-cence function measured here does not show theRH component, whereas this component is seen inother citation based obsolescence functions, begin-ning with Price (1965). We see no clear candidateexplanation which accounts for both di�erences,however.

    6.1.2. Diachronous relation

    As a further test of the citation model in equa-tion 6 we compare the model with the citationsto articles published in four di�erent years, 1983,1986, 1990, and 1992 by the four main journals in

    1900 1920 1940 1960 1980 2000

    0.01

    0.1

    1

    publication year

    H

    C

    I

    Fig. 10.| A comparison of the C = cR(1�e�kDT )model with the actual citations for papers fromthe six year sample for the last 111 years. Thethick line is the actual citations, the thin line isthe model, using the actual reads, and the dottedlines are the modi�ed reads model from equation4 and its components.

    astronomy; The Astrophysical Journal, The As-tronomical Journal, The Monthly Notices of theRoyal Astronomical Society, and Astronomy & As-trophysics. We use all the citations in the ADS ci-tation database; each year has between 67,000 and75,000 citations to articles from these four jour-nals.

    Figure 11 shows the comparison. The thin linesare the actual number of citations each year tothe articles, normalized so that the the averagecitation rate over the �rst �ve years after publica-tion is exactly one. The citation rates go to zerowhen the timespan following publication reachesthe present (or more accurately 20 August 2001);thus one can tell which line represents which year,1992 goes to zero �rst, 1983 last.

    The thick line is the citation model of equa-tion 6, using exactly the same parameters as inequation 4, modi�ed to account for a 3.7% yearlygrowth in the literature.

    It can be noted that the model matches the dataas well or better than the curves for the di�erent

  • Submitted to The Journal of the American Society for Information Science and Technology 14

    years since publication0 5 10 15 20

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    11

    Fig. 11.| A comparison of the C = c(RC +RI)(1 � e�kDT ) model with the actual citationsfor papers published in 1983, 1986, 1990 and 1992.The thin lines are the actual data for the fourdi�erent years, the thick line is the model, usingthe RC and RI components of the reads model ofequation 4 and assuming the 3.7% yearly growthrate of Kurtz et al. (2000).

    years match themselves. While these data may be�t with a di�erent citation function, one where theRC component is zero, the slope of the required RIfunction necessary to �t these data is very muchtoo steep to �t the long term decline shown inthe data in �gure 10. To �t the data in �gure 11with a single RI function it would have to havea slope kI = 0:15 which is completely excludedby the citation data shown in �gure 10. Also thefunction would then �t only the 1990 data well,the one with the o�set peak. We believe the o�setpeak is caused by a mean three weeks delay inthe time of actual publication compared with thepublication date in that year, and that the otherthree years better represent the true function.

    Therefore we conclude that this comparisonalso provides strong evidence for a model wherecitations follow the function C = c(RC +RI)(1�e�kDT ) where the RC and RI modes are those use-age behaviors seen in the study of the readershipdata in section 5.1.

    6.1.3. Discussion

    We have shown that the citation rate as a func-tion of time is equal to a constant times the sumof two modes of the readership function. There isno a priori reason why the constant c in equation6 should not actually be a function of time. Whyshould the number of citations per read (about0.05) be constant, independent of the age of thearticle?

    Examining �gures 9 and 10 we see that if c isa function of time it cannot change by more thanabout 1% per year. This is an extraordinary re-sult, it says that within the (small) measurementerror the C function and the RC + RI functionmust be measuring exactly the same thing, themean usefulness of journal articles as a function oftime.

    The private act of reading an article entailsnone of the various sociological inuences that thepublic act of citing an article does (Seglen (1997)lists several of these factors). This suggests thatin the mean these factors do not inuence the ci-tation rate.

    Unless the sum of all the various sociologicalinuences as a function of time is exactly the sameas the usefulness of articles as a function of timethe existence of these inuences would cause c notto be constant. That c is constant means that atevery age the total e�ect of these various inuencesis zero.

    We therefore assert that we have proven thatthe normative theory of citing (Liu 1993) is truein the mean.

    6.2. The relationship for individual pa-

    pers

    6.2.1. All articles in the database

    While the citation rate may be a direct measureof usefulness in the mean, there are clearly otherfactors at play for any individual article. Certainlycitations are not the only valid measure of useful-ness; newspapers, for example, are very useful butnot often cited.

    In this section we will compare reads and citeson a per article basis. We believe this is the �rsttime data of this sort have been published.

    Table 2 shows the relationship between readsand cites for all articles in the ADS database. The

  • Submitted to The Journal of the American Society for Information Science and Technology 15

    Table 2

    Reads vs Cites in 2000 for All Articlesa

    NCnNR 0 1 2 4 8 16 32 64 128 256 512 1024 2048 total

    0 20.71 17.44 16.72 16.08 15.29 14.14 12.86 11.40 9.19 5.75 0.00 0.00 0.00 2133024

    1 12.03 11.00 11.82 12.56 12.90 12.74 11.90 10.82 9.17 6.07 0.00 � � � � � � 366212 9.32 8.55 9.65 10.83 11.87 12.43 12.25 11.35 9.82 7.19 0.00 � � � � � � 214184 5.81 5.21 6.30 7.83 9.40 10.95 11.68 11.39 9.95 7.64 3.46 0.00 � � � 102178 2.32 2.00 3.00 3.58 5.93 7.83 9.63 10.40 9.73 7.69 4.32 � � � � � � 353216 0.00 0.00 0.00 1.00 3.46 4.81 5.78 7.84 8.52 7.47 4.64 � � � � � � 89632 � � � � � � � � � � � � 1.00 2.00 3.46 4.09 5.78 6.29 4.46 0.00 � � � 19064 � � � � � � � � � � � � � � � � � � 2.00 2.00 1.58 4.09 3.58 2.00 � � � 44128 � � � � � � � � � � � � � � � � � � 0.00 1.00 � � � 0.00 0.00 1.00 0.00 8total 1714338 179938 112360 77346 52260 32681 20256 11392 4328 946 94 9 2 2205950

    aIn terms of the base 2 logarithm of the cross-tab counts.

    reads and cites refer to articles read during cal-endar year 2000 and to articles cited in paperspublished during calendar year 2000. Table 2 isformatted as follows: the horizontal direction cor-responds to the number of reads and the verticaldirection shows the number of cites. The data arebinned in factors of two, so for example, the sixthcolumn shows the number of citations for articleswhich were read between 16 and 31 times and the�fth row shows the number of reads for articleswhich were cited between 8 and 15 times. Theactual numbers in the cells are the base 2 loga-rithm of the actual counts, thus the number ofarticles which were read between 16 and 31 timesand which were cited between 8 and 15 times is27:83 = 227.

    Looking at table 2 we can see several things,�rst the vast majority of articles in the ADSdatabase (1020:71 = 77:5%) were neither read norcited during the year 2000. Nearly 97% of all ar-ticles in the database were not cited during thisperiod. The most likely number of times whichan article was cited was zero for articles whichwere read fewer than 128 times. 72,926 di�er-ent articles in the database were cited during thisperiod, while 491,612 di�erent articles were read.The large number of unread articles is because theADS contains a very large physics data set (it islarger than the astronomy data set); overwhelm-ingly the ADS is used by astronomers, the physicsdata are not yet heavily used. At the time these

    data were captured the ADS did not contain themajority of citations to articles in the physics lit-erature (it does now, however); it did contain mostof the citations from the astronomy literature tothe physics literature.

    Looking at the number of cites (the left col-umn) and �nding the most likely number of readsfor a given number of cites we �nd that the mostlikely number of reads for articles cited zero timesis zero, for articles cited 1 time is 8-15, for articlescited 2-3 times is 16-31, etc. The maximum in thereadership rate given the citation rate moves verynearly as number of citations times eight. Lookingat the cells to the right and left of these maximawe see that the average decline is somewhat lessthan a factor of two (a di�erence of one in the base2 log) for each cell, where each cell is a factor oftwo in number of reads.

    The number of citations are thus a good pre-dictor of the number of reads, with the distribu-tion approximately normal in the log, and withthe standard deviation about a factor of two (onein the base 2 log). The same cannot be said of thenumber of reads, they clearly make a very poorpredictor of the number of citations. Looking forexample at 64-127 reads the most likely numberof citations is zero, but the cell with 2-3 and thecell with 4-7 citations are each close to as full.

    With nearly seven times as many di�erent ar-ticles being read as cited, the factors which cause

  • Submitted to The Journal of the American Society for Information Science and Technology 16

    someone to read an article are not always the sameas the factors which cause someone to cite an ar-ticle. This is also apparent in that the cell withthe most likely number of reads, given a numberof cites, contains, in nearly every case, fewer pa-pers than any of the less well cited cells at thatreadership level.

    The most important factor which di�erentiatesreads from cites is youth; as parameterized by thefactor (1 � e�kDT ) in equation 6. The ratio ofreads to cites (in the year 2000, using the ADS)for papers published in 1995 is 17.4, and in 1985is 17.1. The ratio for papers published in 2000is 143. 14.5% of all reads in 2000 were to paperspublished in 2000, but only 2.5% of all cites wereto papers published during the calendar year theciting papers were published (2000). 95% of allpapers which were read 256 or more times, butwhich were cited fewer than four times were pub-lished within 1.5 years of the end of 2000.

    The second most important factor which di�er-entiates reads from cites is the type of article. In1995 the Astrophysical Journal, published 2,258full length refereed articles. During 2000 2,250(99.6%) of these were read and 1,541 (68%) werecited. Also in 1995 the Bulletin of the AmericanAstronomical Society published 1,272 unrefereedabstracts. During 2000 1,152 (91%) of these wereread, but only 22 (1.6%) were cited.

    The third factor di�erentiating cites from readsis the actual content of each article. The mostcited and read papers during 2000 make this point.The most cited paper during 2000 (Landolt 1992)is also the second most read paper during this pe-riod. Landolt (1992) is a prototypical highly citedpaper; it contains the basic calibration data for afundamental measurement technique.

    The most read article during 2000 (Trimble andAschwanden 2000) was not cited during that time.This is a prototypical newspaper type article, itcontains an extensive review of the past year's lit-erature in astrophysics, part of a yearly series ofarticles by the same authors. While it might beexpected that this article would not be cited dur-ing the year following its publication it still hasnever been cited as this present article is written,2.5 years later, nor have the two articles whichfollowed it or the article which preceded it in thesame series (Trimble and Aschwanden 1999, 2001,2002) each of which was (is) the most read article

    by ADS users in the year following its publication.

    The �nal factor which di�erentiates reads fromcites is age; the RH component of readership hasno counterpart in the citation function. It is likelythat this component of the readership representspapers which are read simply for their historicalinterest, not because they directly inuence a cur-rent research problem.

    6.2.2. The 1990|1997 Astrophysical Journal

    Table 3 is similar in format to table 2, butonly includes reads and cites during 2000 to pa-pers published in the Astrophysical Journal in theyears 1990|1997. This eliminates many of thesystematic causes of di�erentiation between readsand cites, such as youth, age, and article type seenin table 2.

    The most obvious di�erence between tables 2and 3 is that most of the articles represented intable 3 are heavily used, while most of the articlesin the whole database are not used at all. 78% ofthe articles in the whole database were not readin 2000 and 97% were not cited, but only 0.6%of the 1990|1997 Astrophysical Journal articlesin the database were not read, and 36% were notcited. Recent Astrophysical Journal articles are atthe core of modern research in astronomy.

    The number of cites in table 3 predicts the num-ber of reads to about the same accuracy (a factorof two) as seen in table 2 for the whole database.The number of reads shown in table 3, however, isalso a reasonable predictor of the number of cites(also to about a factor of two), which it is not forthe whole database.

    For the homogeneous sample of standard jour-nal articles used here, this again demonstrates thatthe normative theory of citing (Liu 1993) is truein the mean; the scatter about this mean is a mul-tiplicative factor of about two, and contains withinit all the various sociological, political, and scien-ti�c factors which cause variations for individualpapers.

    Twelve (out of 16,557) of the AstrophysicalJournal articles in table 3 were cited more timesthan they were read, 11 were cited once but notread, and 1 was cited twice and read once. Thissuggests two things. First: ADS users must repre-sent all the sub�elds of astronomy which are pub-lished by the ApJ. If there were any large sub�eld

  • Submitted to The Journal of the American Society for Information Science and Technology 17

    Table 3

    Reads vs Cites in 2000 for 1990-1997 ApJ Articlesa

    NCnNR 0 1 2 4 8 16 32 64 128 256 512 1024 2048 total

    0 6.64 7.17 8.89 10.17 10.79 10.61 9.28 6.66 2.81 � � � � � � � � � � � � 59221 3.46 4.00 6.21 8.36 9.69 10.29 9.64 7.38 3.46 � � � � � � � � � � � � 34792 � � � 0.00 4.39 6.99 9.04 10.30 10.32 8.78 5.00 � � � � � � � � � � � � 36884 � � � � � � � � � 3.91 6.52 8.84 9.96 9.38 6.70 1.58 � � � 0.00 � � � 23328 � � � � � � � � � � � � 1.00 5.04 7.88 8.70 7.37 3.17 0.00 � � � � � � 86316 � � � � � � � � � � � � � � � 0.00 2.58 6.04 6.79 5.32 � � � � � � � � � 22432 � � � � � � � � � � � � � � � � � � � � � 1.58 3.70 4.64 1.00 � � � � � � 4364 � � � � � � � � � � � � � � � � � � � � � � � � 0.00 1.00 1.00 0.00 � � � 6total 111 161 569 1622 3210 4567 3927 1860 444 79 5 2 0 16557

    aIn terms of the base 2 logarithm of the cross-tab counts.

    of astronomers who did not use the ADS, the pa-pers they read and reference would show up in ta-ble 3 as cited but not read. Second: astronomersoverwhelmingly read the articles which they refer-ence. If there were any substantial population ofarticles which were being pro forma cited, but notread they would also appear in table 3.

    6.2.3. How historical cites inuence reads

    As shown in section 5 the maximum rate atwhich articles are read occurs immediately follow-ing their publication, and then declines with time.However astronomers learn of the existence of anarticle during the �rst year or so following publica-tion, they do not learn of it by a citation, as noneexists at that time. It is reasonable to ascribe acausality in this situation: following publicationan article is cited because it has been read.

    Later in the life of an article this causation isoften reversed: an article is read because it hasbeen cited. Pointing readers of an article to otherarticles of interest to read is one of the principalfunctions of citations.

    Table 4 shows the number of reads duringthe two year period 2000|2001 for articles pub-lished in the Astrophysical Journal during theyears 1950|1959 versus the total number of ci-tations for these articles from the ADS citationsdatabase.3 Very clearly there is a strong correla-

    3In 1969 several major European astronomy journals ceased

    tion of the readership rate of articles with the totalnumber of citations for that article.

    The total number of citations can predict thereadership rate about as well as the citation ratepredicts the readership rate (to within about afactor of two) in the samples of tables 2 and 3.The readership rate is also a predictor of the to-tal number of citations, again to about a factor oftwo, similar to the ability of the readership rate of1990|1997 Astrophysical Journal articles to pre-dict their current citation rate. Also, as with the1990|1997Astrophysical Journal articles, the dis-tribution of citations as a function of reads is not asymmetric normal distribution, but is skewed to-ward fewer citations. Unlike the e�ect with thewhole database (table 2) this does not overwhelmthe predictive power of the reads, however.

    Although papers with higher total citationstend strongly to be the papers most read, this isnot due to the existence of more links to thesepapers from the on-line reference lists. By exam-ining the reads of Astrophysical Journal articles byunique individuals (as determined by their cookie)which follow these individuals accessing the refer-ence list of a di�erent article, as a function of the

    publication and Astronomy & Astrophysics began publica-tion. The ADS citation database does not yet contain ci-tations from these earlier European journals, perhaps 30%of the expected citations to the articles in table 4. We donot believe having these citations would change our con-clusions.

  • Submitted to The Journal of the American Society for Information Science and Technology 18

    Table 4

    Reads in 2000+2001 vs Total Cites for 1950-1959 ApJ Articlesa

    NCnNR 0 1 2 4 8 16 32 64 128 256 512 total0 5.93 4.25 5.58 5.81 4.75 1.00 1.00 0.00 � � � � � � � � � 2161 3.91 2.58 3.70 5.17 5.09 2.81 � � � � � � � � � � � � � � � 1112 4.32 3.70 4.86 5.83 5.83 4.32 1.58 � � � � � � � � � � � � 1994 3.70 3.32 4.64 6.61 6.58 5.52 1.58 0.00 � � � � � � � � � 2928 2.81 3.00 3.00 6.36 6.88 6.04 3.58 0.00 � � � � � � � � � 30216 � � � � � � 2.32 5.09 6.34 6.44 4.39 2.32 � � � � � � � � � 23332 � � � � � � 0.00 0.00 5.09 6.04 4.75 4.00 0.00 � � � � � � 14664 � � � � � � � � � � � � 2.00 4.09 4.81 4.32 2.81 � � � � � � 76128 � � � � � � � � � � � � � � � 0.00 3.00 3.17 3.17 0.00 � � � 28256 � � � � � � � � � � � � � � � � � � � � � 1.58 1.00 1.58 � � � 8512 � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � 1.00 21024 � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � 0.00 1total 116 56 129 364 451 312 104 56 19 4 3 1614

    aIn terms of the base 2 logarithm of the cross-tab counts.

    publication date of the Astrophysical Journal ar-ticle read and the time lag after the reference listwas accessed, we can estimate the fraction of ar-ticles which were read by directly clicking on areference link in the back of another article.

    For articles less than one year old, this fractionis very near zero, as there has not been time forrecently published papers to be referenced; thisfraction slowly increases to about 1% for papersten years old, 2% for papers twenty years old and3% for papers thirty or more years old. For pa-pers published in the Astrophysical Journal be-tween 1950 and 1959 only about 3% of total readsare directly caused by their being cited in on-linereference lists.

    Another possibility is that substantial reader-ship of the older literature comes from non-journalbased outside links. This is, in fact, true for thesecond most read article in table 3, Allen et al.(1995), which is pointed to by a number of edu-cational web sites, following its appearance in theNASA Astronomy Picture of the Day web site on1 November 1997 (NASA-APOD 1997).

    To check if this practice is widespread amongthe older material we take the three most citedand most read papers from the 1950's Astrophys-

    ical Journal, Salpeter (1955), Johnson & Morgan(1953), Abell (1958) and examine the details oftheir use during the month of September, 2001 us-ing the raw web server logs. These three articlesare among the most famous papers in the historyof astronomy. If articles are being accessed rou-tinely from outside educational web sites, thesearticles would be among those so accessed.

    We found no indication that outside links playa large role in the use of these articles; two reads ofJohnson & Morgan (1953) came from a static linkin a Taiwanese web site, and a couple of reads ofAbell (1958) came from the NED database (Helouet al. 1995) and another couple from the SIMBADdatabase (Wenger et al. 2000). The overwhelm-ing majority of reads of these papers came fromqueries using the ADS system.

    In addition to the three very well cited paperswe also examined the details of the query types(from the web server logs) for all AstrophysicalJournal articles published during the 1950's andread during the �rst ten days of September, 2001.We �nd the distribution of queries to be consis-tent with the distribution for the whole database(Eichhorn et al. 2000) with only one clear di�er-ence. The number of queries which have a restric-

  • Submitted to The Journal of the American Society for Information Science and Technology 19

    tive date range is higher; typical is a query withan author name and an exact year. These exactqueries more commonly retrieve highly cited pa-pers than less well cited ones.

    7. Measuring individual research produc-

    tivity using reads and cites

    As was demonstrated in section 6, reads andcites have substantially di�erent properties, butfundamentally measure the same thing, the use-fulness of an article. In this section we discusssome of the issues involved in combining measuresof cites and reads to assess the scienti�c productiv-ity of individuals. The possibility that combininginformation from the access logs of an electroniclibrary with citation information could result inimproved productivity measures has been notedbefore, notably by Kaplan and Nelson (2000).

    7.1. Joint properties of cites and reads for

    individuals

    To show some of the basic properties of mea-suring both cites and reads for individuals we usea data-set designed to show the breadth of au-thors of astronomy research papers; we includeold/young, active/inactive, tenured/nontenured,... individuals in the sample. For each individ-ual we obtain from the ADS databases a list of allhis/her papers published before May 2000 and foreach of these papers a list of the total number ofcitations up to May 2000, and a list of the totalnumber of reads in the period between January1999 and May 2000.

    The sample contains 441 individuals whose pri-mary employment is/was in the �eld of astronomy.130 of these are members of the astronomy de-partments of �ve prominent universities: Prince-ton, CalTech, University of California at Berkeley,University of California at Santa Cruz, and Cor-nell. 166 individuals are employed at the Harvard-Smithsonian Center for Astrophysics, includingmost of the long-term researchers plus everyonelisted in the Center's phone book with a Dr. titleand whose family name begins with the letters A{G. 53 individuals are winners of the Russell Prize,the highest award of the American AstronomicalSociety (many of these individuals are deceased,some for more than 40 years). Finally, 92 indi-viduals are all persons who received their PhD in

    10 100 1000 10000 10

    100

    1000

    10000

    reads

    All papers since 1990

    All papers before 1990

    Fig. 12.| Reads vs Cites for individuals in thesample described in the text. Open circles are forpapers more than ten years old and �lled circlesare for papers less than ten years old. The linesshow linear relations, with the two relations sepa-rated by a factor of ten.

    astronomy from a U.S. university during the year1980, and have published at least one paper in theADS database since 1990; slightly fewer than halfof the persons receiving U.S. PhDs in 1980 are onthe list.

    7.1.1. The read-cite diagram

    Figure 12 shows total cites vs reads for authorsfrom the sample. Each point represents the sumof papers written by a single individual; the opencircles for papers published before 1990 and the�lled circles for papers written after 1990. Oncepapers are separated by date the relationship be-tween reads and cites for papers summed over sin-gle authors appears quite linear, in agreement withequation 6. The factor of ten di�erence betweenthe two age groups is fully consistent with thatexpected using equation 4.

    It is obvious from �gure 12 that the positionof a document in a read-cite diagram is stronglydependent on the age of the document; likewisethe position of the sum of an individual author'spapers in a read-cite diagram will be strongly in-

  • Submitted to The Journal of the American Society for Information Science and Technology 20

    Fig. 13.| Cites vs Reads for individuals in thetotal sample described in the text. Filled circlesare winners of the Russell Prize; open circles arethe rest of the sample.

    uenced by the age of that individual.

    Figures 13 and 14 show this e�ect. Figure 13shows the total sample in the space of cites vsreads (with the reads and cites normalized for thenumber of authors on each paper), the �lled circlesare winners of the Russell Prize; �gure 14 is thesame, but here the �lled squares are the 1980 PhDcohort and the crosses are persons who receivedtheir PhD after 1985. The Russell Prize winnersare all substantially older than the members of the1980 cohort; clearly the three distributions formcoherent subsamples within the entire sample, andthese subsamples are substantially di�erent fromeach other.

    The two dotted lines in �gure 13 show the locusfor papers which are read once every two (line onbottom-right) or �ve (line on top-left) years for ev-ery time they have been cited. The two dots on the�ve year line at about 2,500 normalized cites rep-resent W.W. Morgan and M. Schwartschild, two ofthe most distinguished astronomers of the twenti-eth century. Each won the Russell prize aboutforty years ago, and each is currently deceased.Along with table 4 this demonstrates that paperscontinue to be read as a function of how much they

    Fig. 14.| Cites vs Reads for individuals in thetotal sample described in the text. Filled squaresare members of the 1980 PhD cohort; crosses arepersons who received their PhD after 1985; opencircles are the rest of the sample.

    have been cited. The di�erence in read rates percitation between �gure 13 and table 4 is due tothe growth of the ADS.

    Another interesting facet of �gure 13 is its pre-dictive power. Just to the right of the two yearline, at about 3,000 normalized reads and 4,000normalized cites is a group of three points, onea �lled circle representing a Russell prize winner.One of the two open circles in the triplet repre-sents W.L.W. Sargent; Professor Sargent won theRussell prize six months after this plot was made.

    7.1.2. The age-productivity model

    Using the relation between reads and cites ofequation 6 and the fact that older papers are readaccording to how much they have been cited, wecan understand these diagrams in detail. Theright and top outer envelopes of �gures 14 and13 represent the most productive astronomers atdi�erent ages. The basic shape of an individual'scareer tract in this diagram is similar to a (some-what tilted) �gure 7, with the sharp angle appear-ing at the time of retirement.

    A simple model which �ts the data is that a

  • Submitted to The Journal of the American Society for Information Science and Technology 21

    Fig. 15.| Cites vs Reads for individuals in thetotal sample described in the text. Filled circlesare Russell prize winners; the curve represents themodel for the current status of persons whose re-search is at the level of Russell prize winners. Thegray squares represent the 1980 PhD cohort, thecrosses persons who received their PhD after 1985,and circles the rest of the sample.

    person starts o� following the PhD being produc-tive, this productivity grows by about a factor oftwo during the �rst seven years of one's career andstays stable for the next twenty-three years. Atthirty years past the PhD one's productivity be-gins to decline until forty-two years past the PhDwhen one is fully retired and the number of readsdeclines to one per citation per couple of years.

    Figure 15 shows this model for the current readsvs lifetime cites of individuals currently betweenone and one hundred years past the PhD. Thecurve bends down for individuals more than 45years past the PhD because when they were ac-tive astronomy was a smaller �eld than currently,as discussed in section 6.1.2.

    The fact that the large majority of Russell prizewinners lie near the curve suggests that withinsmall factors the model is an accurate represen-tation of the productivity paths of the most pro-ductive astronomers. The tightness of the relationfor Russell prize winners is underestimated in the

    Fig. 16.| Cites vs Reads for individuals in a sub-set of the total sample described in the text. Filledgray squares are members of the 1980 PhD co-hort; gray crosses are persons who received theirPhD after 1985. The curves represent six di�er-ent models for current researchers described in thetext, the large dark �lled squares are the di�erentmodel predictions for researchers 20 years past thePhD, corresponding with the 1980 cohort.

    actual data (the �lled circles) due to the increas-ing incompleteness of the citation data base as afunction of age.

    Figure 16 is complex; the gray symbols are thesame as in �gure 15, the small gray squares are the1980 cohort and the gray crosses are persons nowworking as astronomers who received their PhDafter 1985. The solid lines are based on the modelin �gure 15. There are three groups of two mod-els, these groups are separated from each other byequal multiplicative factors of

    p10 in total pro-

    ductivity.

    The pairs of curves at each of three productivi-ties represent the �rst twenty years of two di�erentextreme career paths for individuals. The taller ofthe two curves represents the same model careeras the curve in �gure 15, the �rst twenty years ofthe career of a person who works at his/her fullcapacity; the shorter curve (the one with the large"hook") represents the career of someone with ex-

  • Submitted to The Journal of the American Society for Information Science and Technology 22

    actly the same latent productivity as the adjacenttaller curve, but who stops performing astronomyresearch after ten years (perhaps leaving the �eld).

    The six large �lled black squares represent aperson 20 years past the PhD (corresponding withthe 1980 cohort) for each of the six models (threelatent productivities, with the most productive be-ing a factor of ten times more productive than theleast productive, times two di�erent career paths).The large black squares have essentially the samedistribution as the small gray squares, thus thissimple model is able to account for the distribu-tion of the 1980 cohort in a direct way.

    There are some small gray squares, those onthe lower left in the diagram, which are not ac-counted for by the six model points. These arethe least cited/read individuals. They can be ac-counted for either by ascribing to them a lowerlatent productivity, or an earlier time when theyleft active research. The spread in latent produc-tivity of active researchers who got their PhDs af-ter 1985 (the gray crosses) is about a factor often, and is well contained within the model linesin �gure 16, thus we postulate that the naturalvariation of latent productivity in astronomers isabout a factor of ten, and is not great enough toaccount for the very low cited/read individuals,thus they must have essentially left active researchbefore ten years past the PhD. Although all in the1980 cohort were chosen to have been co-author onat least one article of any kind since 1990 (whichwould be ten years past the PhD) examination ofthe publication records of the least cited/read con-�rms that they e�ectively stopped doing researchbefore then.

    7.1.3. Discussion

    The read-cite diagram is a powerful newmethod for examining the research productivityof individuals; by comparing both measures one isable to see di�erences which would not be visibleusing either reads or cites alone. For example, ifwe take two hypothetical researchers, A and B,both A and B are the same age past the PhD, andboth A and B have the exact same total productiv-ity from their PhDs to now. Person A began moreproductive than B, and has become less productivethan B over time; person B began less productivethan A and has become more productive. Becausecitation measures more heavily weight past work,

    and readership measures more heavily weight re-cent work person A would have more citationsthan person B, while person B would have morereads than person A.

    Combined with the use of similar cohorts, espe-cially same age cohorts, for comparisons, the read-cite diagram allows an individual's research to beevaluated in a manner which is more accurate andfair than is possible with the use of citation countsalone. It does not, however, provide a measure ac-curate enough to preclude the necessity of carefuland direct examination of the record of the personbeing evaluated.

    7.2. New productivity measures for indi-

    viduals

    It is often useful to have a single number todescribe the productivity of an individual, partic-ularly when aggregations of individuals are beingstudied. Citations (Gar�eld 1979) have servedthis role almost exclusively. The existence of thereadership information allows other measures tobe developed which in many cases are better suitedto the particular study being done.

    To investigate the properties of di�erent pro-ductivity measures we have created a data-set con-taining the readership information for the calen-dar years 2000 and 2001, and the lifetime citationinformation for articles published before 1 Jan-uary 2002 for a set of astronomers which is alltenured faculty in the ten highest rated astron-omy departments in the United States (NationalResearch Council 1995) and the federal civil ser-vant astronomers at the Smithsonian Astrophysi-cal Observatory, arguably the leading U.S. federallaboratory for astronomical research. For each ofthese 308 individuals we have also found the yeartheir PhD was granted. We will call this data setthe tenured faculty sample.

    Figure 17 shows the normalized (by dividing bythe number of authors for each paper) citationsvs PhD date for individuals in the tenured fac-ulty sample. The solid line represents the citationcomponent of the individual productivity modeldiscussed in section 7.1.2, it seems to match theouter envelope of the data quite well.

    Figure 17 shows clearly the most importantdrawback of using citations as a productivity mea-sure: they are very strongly biased towards older

  • Submitted to The Journal of the American Society for Information Science and Technology 23

    Fig. 17.| Total normalized cites for tenured fac-ulty from the leading astronomical organizationsvs. the year each person received his/her PhD.The solid line is the cites model described in thetext.

    individuals. Of the top ten astronomers in �gure17, one received his PhD in the 1940s, three in the1950s, and six in the 1960s. Using the solid lineas a line of constant ability shows that a persontwenty years past the PhD (i.e. at peak produc-tivity) will only have half the citations of a similarindividual forty years past the PhD, (i.e. at re-tirement).

    Another measure of productivity is the rate atwhich an individual's papers are being read. Fig-ure 18 shows the normalized read rate (for the twoyear period) for members of the tenured facultysample vs. PhD date. The solid line representsthe readership component of the model in section7.1.2.

    Compared to the citation information the readsare much less biased as a function of age. Theyare, however, still biased, but toward younger in-dividuals. Of the top ten astronomers in �gure 18one received his PhD in the 1960s, three in the1970s, four in the 1980s, and two in the 1990s.Only one person (J.P. Ostriker) is in both lists.

    Just as citations have been used with greatpro�t in a large number of studies, the reader-

    Fig. 18.| Total normalized reads in 2000 and2001 for tenured faculty from the leading astro-nomical institutions vs. the year each person re-ceived his/her PhD. The solid line is the readsmodel described in the text.

    ship information, which is a more direct measureof current usefulness, seems also well suited for usein bibliometric studies.

    7.2.1. The SumProd statistic

    While the direct use of the readership informa-tion may eliminate much of the age bias inherentin citation analysis it introduces other diÆculties.As can be seen in �gure 8 and in equation 4 thereadership of an article declines very rapidly fol-lowing its publication. Thus, for example, sinceindividuals tend to have not more than a few veryhighly read/cited papers in their careers, the readstatistic will be substantially elevated if the sampleperiod coincides with the release of such a papercompared to a few years before or later. Citationsare much more stable to these inuences.

    It seems justi�ed to create a statistic which hasthe attributes of both the reads and the cites. Wemake one by adding the scaled reads to the cites,and name the statistic SumProd. SumProd �(f �reads+ cites)=2, where the reads and cites arenormalized by the number of authors in each pa-per, and the reads are multiplied by the factor f

  • Submitted to The Journal of the American Society for Information Science and Technology 24

    Fig. 19.| The SumProd statistic for tenured fac-ulty from the leading astronomical institutions vs.the year each person received his/her PhD. Thesolid line is the productivity model described inthe text.

    to make the numerical weight of the current read-ership rate equal to the lifetime sum of the cites,over the entire sample. Dividing by two makes thenumerical range the same as for citations. Figure19 shows this statistic for the tenured faculty sam-ple; the solid line is derived from the sum of thereads and cites lines in �gures 17 and 18. Thescaling factor for the reads was 0.8.

    Because SumProd uses two measures with sub-stantially di�erent age dependencies, it is substan-tially more stable for younger scientists than cita-tion counts, and it is also more stable for olderscientists than the readership rate.

    Of the top ten astronomers in �gure 19 tworeceived their PhDs in the 1950s, three in the1960s, three in the 1970s, and two in the 1980s.SumProd has substantially less age bias than ci-tation counts alone; for nearly all studies where ci-tation counts are an appropriate measure, we sug-gest that SumProd would be a more appropriatemeasure.

    For studies where age bias must be minimized,one can perform a model dependent age normal-ization of the data, by dividing by a model of

    Fig. 20.| The age normalized SumProd statisticfor tenured faculty from the leading astronomi-cal institutions vs. the year each person receivedhis/her PhD.

    productivity vs age, such as shown in �gures 17,18, and 19. Figure 20 shows the SumProd statis-tic from �gure 19 divided by the age-productivitymodel in that �gure, that is, the age normal-ized relative productivities of the members of thetenured faculty sample.

    7.2.2. The Read10 statistic

    SumProd is intended to measure an individ-ual's lifetime scienti�c production, where for in-dividual's who are still active, this involves anextrapolation of their current status. Once anindividual has written a paper, SumProd countsthe total citations and current reads; it is not ameasure of current activity. Indeed an individualsSumProd measure remains nearly constant in per-petuity. For example, W.W. Morgan would rankin the top quarter of the tenured faculty sampleaccording to SumProd. Professor Morgan wrotehis last scienti�c paper about twenty years ago,and died in 1994.

    We create the Read10 statistic to measure thecurrent activity of individuals. Read10 is the cur-rent readership rate for all an indi