counting on open doar

24
Counting on OpenDOAR Peter Millington SHERPA Technical Development Officer CRC, University of Nottingham [email protected] .uk

Upload: marrim

Post on 23-Mar-2016

30 views

Category:

Documents


1 download

DESCRIPTION

Counting on Open DOAR. Peter Millington SHERPA Technical Development Officer CRC, University of Nottingham [email protected]. Background to Open DOAR. Created in 2005 Lists over 2320 repositories (2013-07-02) Manually validated High quality… - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Counting on  Open DOAR

Counting on OpenDOAR

Peter MillingtonSHERPA Technical Development Officer

CRC, University of [email protected]

Page 2: Counting on  Open DOAR

http://www.opendoar.org/

Background to OpenDOAR

• Created in 2005– Lists over 2320 repositories (2013-07-02)

• Manually validated– High quality…– …but we didn’t like to talk about the record counts

• Counts not updated after the initial entry– Unless prompted by users

• Fixed in 2012– Record counts updated about every 2 weeks

Page 3: Counting on  Open DOAR

http://www.opendoar.org/

Established counting methods

• Manual inspection– Labour-intensive

• Counting OAI-PMH record identifiers– Inefficient

• Handling big files• Iterative

– Unreliable• File size limits and timeouts

– Inaccurate• Need to account for deleted records

Page 4: Counting on  Open DOAR

http://www.opendoar.org/

How difficult can it be?

• SELECT COUNT(*) FROM repository;– Still fast even with added complexity– Statuses, Breakdown by date, etc.

• The number is often there on the web page– Headline number, or– “x to y of z” tally, or– Adding up numbers on a “Browse by year” page

Page 5: Counting on  Open DOAR

http://www.opendoar.org/

OpenDOAR’s Strategy

• Avoid OAI-PMH whenever possible• Use other m2m interfaces, if available/suitable• Screen scrape numbers from web pages• If all else fails, use manual methods

• Counts for “full texts” as well, where possible

Page 6: Counting on  Open DOAR

Some examples…

Page 7: Counting on  Open DOAR

http://www.opendoar.org/

Generic n records

Documents avec texte intégral 229181

Page 8: Counting on  Open DOAR

http://www.opendoar.org/

Generic x to y of z countersDSpace Browse Counter is a special case

Showing results 1 to 20 of 6727

Page 9: Counting on  Open DOAR

DSpace totalCnt Add-on

NCKUR 中的社群 [40782/74662] [ 全文筆數 / 總筆數 ]

-

Page 10: Counting on  Open DOAR

Generic Sum of List CountersEPrints count Browse List is a special case

Add up the numbersin brackets

Page 11: Counting on  Open DOAR

Numberof items

EPrints V.3 Counterhttp://eprints.nonesuch.ac.uk/cgi/counter

Page 12: Counting on  Open DOAR

Generic Sum of Numbers

Add up the numbers

Page 13: Counting on  Open DOAR

Generic HTML tag counting

Count item tags in HTML source code

Page 14: Counting on  Open DOAR

http://www.opendoar.org/

Counting multiple pages

• Separate pages per letter, document type, etc

• Issues with Greenstone – lack of predictability

Page 15: Counting on  Open DOAR

OAI-PMH ListIdentifiers: Simplehttp:// ... /oai?verb=ListIdentifiers&metadataPrefix=oai_dc

Count these

No resumptionToken

Page 16: Counting on  Open DOAR

OAI-PMH ListIdentifiers: IterativeresumptionToken

for blocks of identifiers

<resumptionToken>193114FUS</resumptionToken>

Page 17: Counting on  Open DOAR

OAI-PMH completeListSize

<resumptionToken completeListSize="89805"

Bingo!

Page 18: Counting on  Open DOAR

http://www.opendoar.org/

Twelve count harvesting methods• Generic

– Generic n records– Generic x to y of z counters– Generic Sum of List Counters– Generic HTML tag counting– Generic Sum of Numbers

• DSpace– DSpace Browse Counter– DSpace totalCnt Add-on

• EPrints– EPrints count Browse List– EPrints V.3 Counter

• OAI-PMH ListIdentifiers– Simple– Iterative– completeListSize

• Manual counting

Page 19: Counting on  Open DOAR

Efficiency of the methods

Generic Sum of Numbers

Generic n records

OAI-PMH completeListSize

EPrints count Browse List

Generic Sum of List Counters

EPrints V.3 Counter

DSpace totalCnt Add-on

Generic x to y of z counter

DSpace Browse Counter

OAI-PMH Simple count

Generic HTML tag counting

OAI-PMH Iterative count

0 5000 10000 15000 20000 25000

Microseconds/Item

Big files

Small files

Iterative OAI-PMHso much slower

Page 20: Counting on  Open DOAR

Relative Frequency of Methods

41%

3%

11%4%

6%

1%

0%

18%

8%

3% 0%

0%

5%

DSpace Browse CounterDSpace totalCnt Add-onEPrints V.3 CounterEPrints count Browse ListOAI-PMH completeListSizeOAI-PMH Simple countOAI-PMH Iterative countGeneric n recordsGeneric Sum of List CountersGeneric HTML tag countingGeneric x to y of z counterGeneric Sum of NumbersManual counting

Page 21: Counting on  Open DOAR

http://www.opendoar.org/

UgentNumbers galore

DSpace and EPrintsEasily scrapeable counts

Page 22: Counting on  Open DOAR

http://www.opendoar.org/

Count harvesting issues• No counts visible or harvestable• Static counts – often approx. – e.g. “over 2m items”• Connectivity issues

– Infrastructure limitations – e.g. heavy internet traffic– HTTP 401 (unauthorised) & 403 (forbidden) errors

• Data hidden in include files (e.g. JavaScript)– Not visible in View Source code

• No direct URL known for the pages with counts– Only accessible to human navigators

• Remodelled websites – requiring updated settings

Page 23: Counting on  Open DOAR

http://www.opendoar.org/

Help OpenDOAR count your repository• Display record counts on your home page

– Using distinctive wording & highlighting– Ideally in <div id="[ID]"> or <span id="[ID]"> tags

• Ensure numbers can be seen in View Source code• Ensure pages & files are not blocked to robots

– Grant read-only access if necessary• Implement OAI-PMH properly

– Return ListIdentifiers in chunks – not one big file– Include completeListSize in the resumptionToken

• Tell us about any changes, so we can update settings

Page 24: Counting on  Open DOAR

http://www.opendoar.org/

Ideas for the Future• Comparing counts from OpenDOAR & ROAR– E.g. Nottm ePrints: 1,239 < 1,277– E.g. HAL-Inserm: 7,498 > 2,773

• OpenDOAR– Growth charts– Full text counts

• Extending OAI-PMH– Statistical features– Trial PSH