pubmed central archive at the us national institutes of health prepared bymartha r. fishel prepared...

33
PubMed Central Archive at the US National Institutes of Health Prepared by Prepared by Martha R. Fishel Martha R. Fishel Deputy Chief, Public Services Division Deputy Chief, Public Services Division Presented by Presented by Becky J. Lyon Becky J. Lyon Deputy Associate Director, Library Deputy Associate Director, Library Operations Operations 10 10 th th International Conference of Medical and International Conference of Medical and Health Librarians Health Librarians Cluj-Napoca, Romania Cluj-Napoca, Romania September 19, 2006 September 19, 2006

Upload: barnaby-joseph

Post on 26-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: PubMed Central Archive at the US National Institutes of Health Prepared byMartha R. Fishel Prepared by Martha R. Fishel Deputy Chief, Public Services Division

PubMed Central Archive at the US National Institutes of Health

Prepared byPrepared by Martha R. FishelMartha R. Fishel Deputy Chief, Public Services DivisionDeputy Chief, Public Services DivisionPresented byPresented by Becky J. LyonBecky J. LyonDeputy Associate Director, Library OperationsDeputy Associate Director, Library Operations

1010thth International Conference of Medical and Health Librarians International Conference of Medical and Health Librarians

Cluj-Napoca, RomaniaCluj-Napoca, Romania

September 19, 2006September 19, 2006

Page 2: PubMed Central Archive at the US National Institutes of Health Prepared byMartha R. Fishel Prepared by Martha R. Fishel Deputy Chief, Public Services Division

Focus of today’s PresentationFocus of today’s Presentation

1. What is PubMed Central?1. What is PubMed Central?

2. What material is included?2. What material is included?

(current content, scanned back files, (current content, scanned back files, Manuscripts, books)Manuscripts, books)

3. How does the content get added?3. How does the content get added?

4. How is it used?4. How is it used?

5. Value added features5. Value added features

Page 3: PubMed Central Archive at the US National Institutes of Health Prepared byMartha R. Fishel Prepared by Martha R. Fishel Deputy Chief, Public Services Division

WWhat is PubMed Central?hat is PubMed Central? Digital archive of life sciences journalsDigital archive of life sciences journals

– includes health policy, bioinformatics and other fieldsincludes health policy, bioinformatics and other fields

Participation is open to journals that are:Participation is open to journals that are:– covered by a major abstracting/indexing servicecovered by a major abstracting/indexing service– or, that have 3 editorial board members with current or, that have 3 editorial board members with current

grants from major non-profit funding agenciesgrants from major non-profit funding agencies

Free access to full-text articles and supporting Free access to full-text articles and supporting datadata

Integrated with PubMed and other bibliographic Integrated with PubMed and other bibliographic and factual databases in NCBI’s Entrez networkand factual databases in NCBI’s Entrez network

Page 4: PubMed Central Archive at the US National Institutes of Health Prepared byMartha R. Fishel Prepared by Martha R. Fishel Deputy Chief, Public Services Division

PMC Basic PolicyPMC Basic Policy Journal deposits an authoritative electronic copy Journal deposits an authoritative electronic copy

that meets PMC data quality standardsthat meets PMC data quality standards– full-text XMLfull-text XML– original high-resolution graphicsoriginal high-resolution graphics– PDFPDF– supplementary datasupplementary data

Journal may delay free access (up to 2 years)Journal may delay free access (up to 2 years)– research articles usually free in one year or lessresearch articles usually free in one year or less

Copyright is retained by publisher or authorCopyright is retained by publisher or author

Deposits – and free access permissions – are Deposits – and free access permissions – are permanentpermanent– journal may stop depositing new material but may not journal may stop depositing new material but may not

withdraw material already depositedwithdraw material already deposited

Page 5: PubMed Central Archive at the US National Institutes of Health Prepared byMartha R. Fishel Prepared by Martha R. Fishel Deputy Chief, Public Services Division

PMC Numbers – August 2006PMC Numbers – August 2006

Articles available: 696,000Articles available: 696,000– 60 percent from back issue digitization60 percent from back issue digitization– Oldest content – Trans Am Ophthalmol Soc – is Oldest content – Trans Am Ophthalmol Soc – is

from 1865from 1865– Total items incl. issue covers, corrections, etc.: Total items incl. issue covers, corrections, etc.:

758,000758,000

Unique IP addresses: 2.07 million Unique IP addresses: 2.07 million – Estimated unique users (1.5x): 2.35 millionEstimated unique users (1.5x): 2.35 million

Articles retrieved – HTML full text, PDF, or Articles retrieved – HTML full text, PDF, or scanned article summary: 7.7 millionscanned article summary: 7.7 million

Total page views, incl. searches: 11.5 millionTotal page views, incl. searches: 11.5 million

Page 6: PubMed Central Archive at the US National Institutes of Health Prepared byMartha R. Fishel Prepared by Martha R. Fishel Deputy Chief, Public Services Division

How content gets inHow content gets in

1. SGML/XML from Publishers

2. Back Issue Scanning

3.NIH Manuscripts from Public

Access

4. Books and other non-article

content

Page 7: PubMed Central Archive at the US National Institutes of Health Prepared byMartha R. Fishel Prepared by Martha R. Fishel Deputy Chief, Public Services Division

Current Content – XML formatCurrent Content – XML format

1.1. Each journal workflow is different, and Each journal workflow is different, and frequency depends on publishing cycle of frequency depends on publishing cycle of the journalthe journal

2.2. PMC accepts SGML or XML – must meet PMC accepts SGML or XML – must meet certain criteriacertain criteria

3.3. Problems are often present (e.g. special Problems are often present (e.g. special characters that must be mapped to NLM characters that must be mapped to NLM DTD)DTD)

4.4. Time consuming and error prone to adjust Time consuming and error prone to adjust data in every issuedata in every issue

Page 8: PubMed Central Archive at the US National Institutes of Health Prepared byMartha R. Fishel Prepared by Martha R. Fishel Deputy Chief, Public Services Division

Text ProcessingText Processing

Source SGML

Source XML

OpenSX XML Resolve Named Character Entities

Parse

Parse

Source-specific

XSL Transform

to PMC Style

Validate with PMC StyleChecker

Load to PMC QA

These steps can take a lot of time and cause NLM to reject or send content back for rework. All articles must pass validation and style checker

Page 9: PubMed Central Archive at the US National Institutes of Health Prepared byMartha R. Fishel Prepared by Martha R. Fishel Deputy Chief, Public Services Division

1. SGML/XML from Publishers

2. Back Issue Scanning

3.NIH Manuscripts from Public

Access

4. Books and other non-article

content

How content gets inHow content gets in

Page 10: PubMed Central Archive at the US National Institutes of Health Prepared byMartha R. Fishel Prepared by Martha R. Fishel Deputy Chief, Public Services Division

Back Issue DigitizationBack Issue Digitization

Objective: Create a complete digital Objective: Create a complete digital archive of PMC journals back to volume 1archive of PMC journals back to volume 1

Cover-to-cover digital copy of everything Cover-to-cover digital copy of everything up to where journal began producing up to where journal began producing electronic copy electronic copy

– (includes articles, covers, TOCs, (includes articles, covers, TOCs, advertisements and administrative matter)advertisements and administrative matter)

Publisher gets free, unencumbered copyPublisher gets free, unencumbered copy

Page 11: PubMed Central Archive at the US National Institutes of Health Prepared byMartha R. Fishel Prepared by Martha R. Fishel Deputy Chief, Public Services Division

Scanning HighlightsScanning Highlights

Currently in Production:Currently in Production: American Journal of Public Health from 1873American Journal of Public Health from 1873 Biophysical Journal (1960)Biophysical Journal (1960) Canadian Veterinary Medical Association Canadian Veterinary Medical Association

TitlesTitles Environmental Health Perspectives (1972)Environmental Health Perspectives (1972) Health Services Research (1966)Health Services Research (1966) Public Health Reports (1896)Public Health Reports (1896) Transactions of the American Transactions of the American

Ophthalmological Society (1864)Ophthalmological Society (1864)

Page 12: PubMed Central Archive at the US National Institutes of Health Prepared byMartha R. Fishel Prepared by Martha R. Fishel Deputy Chief, Public Services Division

Production PlanningProduction Planning

Contractor’s capacity at 5 facilities is Contractor’s capacity at 5 facilities is approximately 250,000 to 300,000 approximately 250,000 to 300,000 pages per monthpages per month

Production is currently planned Production is currently planned through December 2006through December 2006

Schedules are adjusted monthly Schedules are adjusted monthly based on deliveriesbased on deliveries

New shipments are sent from NLM New shipments are sent from NLM every 6-8 weeksevery 6-8 weeks

Page 13: PubMed Central Archive at the US National Institutes of Health Prepared byMartha R. Fishel Prepared by Martha R. Fishel Deputy Chief, Public Services Division

Progress to Date

83 Journal Titles Included83 Journal Titles Included 5,606,059: Pages collected for 5,606,059: Pages collected for

processingprocessing 3,500,000: P3,500,000: Page images deliveredage images delivered 297,000: XML Citations created297,000: XML Citations created 470,000: Scanned articles in PMC470,000: Scanned articles in PMC

Page 14: PubMed Central Archive at the US National Institutes of Health Prepared byMartha R. Fishel Prepared by Martha R. Fishel Deputy Chief, Public Services Division

Digitized SamplesDigitized Samples

Page 15: PubMed Central Archive at the US National Institutes of Health Prepared byMartha R. Fishel Prepared by Martha R. Fishel Deputy Chief, Public Services Division

Wellcome Trust CollaborationWellcome Trust Collaboration September 2004 - Cooperative September 2004 - Cooperative

agreement signedagreement signed Expect to complete an additional 2.7 Expect to complete an additional 2.7

million pages million pages – Biochemical Journal - Biochemical Journal - Largest archive Largest archive

to date – 350,000 pages scannedto date – 350,000 pages scanned

– Annals of Surgery Annals of Surgery – British Journal of PharmacologyBritish Journal of Pharmacology– Journal of PhysiologyJournal of Physiology– Journal of AnatomyJournal of Anatomy

Page 16: PubMed Central Archive at the US National Institutes of Health Prepared byMartha R. Fishel Prepared by Martha R. Fishel Deputy Chief, Public Services Division

Challenges To DateChallenges To Date Locating old, rare copies in good conditionLocating old, rare copies in good condition Assessing donations for completeness Assessing donations for completeness

(covers, TOCs)(covers, TOCs) Scanning and delivering fill-in pages at NLMScanning and delivering fill-in pages at NLM Feeding the pipeline Feeding the pipeline Quality Assurance (understanding Quality Assurance (understanding

requirements)requirements) Information tracking Information tracking Every title is differentEvery title is different

Page 17: PubMed Central Archive at the US National Institutes of Health Prepared byMartha R. Fishel Prepared by Martha R. Fishel Deputy Chief, Public Services Division

Related ActivitiesRelated Activities

Citations for scanned articles are Citations for scanned articles are being phased into PubMedbeing phased into PubMed

Some completed archives Some completed archives delivered to their publishersdelivered to their publishers– ASM titles (Highwire)ASM titles (Highwire)– Plant Physiology (Highwire)Plant Physiology (Highwire)– Biochemical Journal (Portland Biochemical Journal (Portland

Press)Press)

Page 18: PubMed Central Archive at the US National Institutes of Health Prepared byMartha R. Fishel Prepared by Martha R. Fishel Deputy Chief, Public Services Division

1. SGML/XML from Publishers

2. Back Issue Scanning

3.NIH Manuscripts from Public

Access

4. Books and other non-article

content

How content gets inHow content gets in

Page 19: PubMed Central Archive at the US National Institutes of Health Prepared byMartha R. Fishel Prepared by Martha R. Fishel Deputy Chief, Public Services Division

NIH Public Access ProgramNIH Public Access Program

The National Institutes of Health (NIH) Policy on The National Institutes of Health (NIH) Policy on Enhancing Public Access to Archived Publications Enhancing Public Access to Archived Publications Resulting from NIH-Funded Research (Public Access Resulting from NIH-Funded Research (Public Access Policy), which took effect on May 2, 2005, requests Policy), which took effect on May 2, 2005, requests and strongly encourages all investigators to make and strongly encourages all investigators to make their NIH-funded peer-reviewed, author's final their NIH-funded peer-reviewed, author's final manuscript available to other researchers and the manuscript available to other researchers and the public through the NIH National Library of Medicine's public through the NIH National Library of Medicine's (NLM) (NLM) PubMed CentralPubMed Central (PMC) immediately (PMC) immediately after after the the final date of journal publication. The NIH has final date of journal publication. The NIH has developed a password-protected, Web-based, developed a password-protected, Web-based, NIH Manuscript SubmissionNIH Manuscript Submission (NIHMS) system to (NIHMS) system to implement the NIH Public Access Policy. implement the NIH Public Access Policy.

Page 20: PubMed Central Archive at the US National Institutes of Health Prepared byMartha R. Fishel Prepared by Martha R. Fishel Deputy Chief, Public Services Division

NIH Manuscript Submission NIH Manuscript Submission SystemSystem

Author deposits began May 2, Author deposits began May 2, 20052005

Voluntary submissions by NIH Voluntary submissions by NIH funded authorsfunded authors

Third party deposits began in July Third party deposits began in July 20052005

August 2006 - Just under 5% of August 2006 - Just under 5% of qualifying authors are submitting qualifying authors are submitting manuscripts manuscripts

Page 21: PubMed Central Archive at the US National Institutes of Health Prepared byMartha R. Fishel Prepared by Martha R. Fishel Deputy Chief, Public Services Division

NIH Public Access Policy - NIH Public Access Policy - StatusStatus

As of February 10, 2006, two bills As of February 10, 2006, two bills are in Congress that mandate are in Congress that mandate participation in the NIHMS participation in the NIHMS

Neither bill is expected to pass in Neither bill is expected to pass in this legislative sessionthis legislative session

Page 22: PubMed Central Archive at the US National Institutes of Health Prepared byMartha R. Fishel Prepared by Martha R. Fishel Deputy Chief, Public Services Division

NIH Manuscript SubmissionNIH Manuscript Submission http://http://www.nihms.nih.govwww.nihms.nih.gov//

Page 23: PubMed Central Archive at the US National Institutes of Health Prepared byMartha R. Fishel Prepared by Martha R. Fishel Deputy Chief, Public Services Division

1. SGML/XML from Publishers

2. Back Issue Scanning

3.NIH Manuscripts from Public

Access

4. Books and other non-article

content

How content gets inHow content gets in

Page 24: PubMed Central Archive at the US National Institutes of Health Prepared byMartha R. Fishel Prepared by Martha R. Fishel Deputy Chief, Public Services Division

BookshelfBookshelf

The books may be accessed in two ways: The books may be accessed in two ways:

(1) searched directly using any search (1) searched directly using any search term or phrase (in the same way as the term or phrase (in the same way as the bibliographic database PubMed); bibliographic database PubMed); More...More...

(2) found through links to PubMed (2) found through links to PubMed abstracts. Each PubMed abstract has a abstracts. Each PubMed abstract has a "Books" button that displays a facsimile of "Books" button that displays a facsimile of the abstract, in which some phrases are the abstract, in which some phrases are hypertext links. hypertext links.

Page 25: PubMed Central Archive at the US National Institutes of Health Prepared byMartha R. Fishel Prepared by Martha R. Fishel Deputy Chief, Public Services Division

NCBI BookshelfNCBI Bookshelf

NLM prefers the content of the book NLM prefers the content of the book to be supplied in SGMLto be supplied in SGML

The book text files are converted into The book text files are converted into XML according to the NCBI Book XML according to the NCBI Book Document Type Definition (DTD)Document Type Definition (DTD)

Page 26: PubMed Central Archive at the US National Institutes of Health Prepared byMartha R. Fishel Prepared by Martha R. Fishel Deputy Chief, Public Services Division

PMC – Added ValuePMC – Added Value Separate, high resolution images for all Separate, high resolution images for all

illustrationsillustrations Links from the PMC article to its bibliographic Links from the PMC article to its bibliographic

citation in PubMed citation in PubMed Links from the references in the bibliography Links from the references in the bibliography

to the citation in PubMed to the citation in PubMed Links from the original article to corrections Links from the original article to corrections

and retractions and vice versa and retractions and vice versa Links to PubMed related articles Links to PubMed related articles Links to PubMed articles by each author Links to PubMed articles by each author Links to related resources – such as chemical Links to related resources – such as chemical

compounds and protein sequences compounds and protein sequences

Page 27: PubMed Central Archive at the US National Institutes of Health Prepared byMartha R. Fishel Prepared by Martha R. Fishel Deputy Chief, Public Services Division

PMC InternationalPMC International

Collaboration between NLM, publishers Collaboration between NLM, publishers in PMC and international partnersin PMC and international partners

Portable PMC (pPMC)Portable PMC (pPMC) Literature Archiving Software SuiteLiterature Archiving Software Suite

– pPMCpPMC– NLM XML DTD SuiteNLM XML DTD Suite– NLM XML Authoring ToolNLM XML Authoring Tool– Portable NIHMS (pNIHMS)Portable NIHMS (pNIHMS)

Page 28: PubMed Central Archive at the US National Institutes of Health Prepared byMartha R. Fishel Prepared by Martha R. Fishel Deputy Chief, Public Services Division

PMC News – August 2006PMC News – August 2006

Springer ‘Open Choice’ and Blackwell Springer ‘Open Choice’ and Blackwell ‘Online Open’ articles now coming in to ‘Online Open’ articles now coming in to PMC PMC

Also working with OUP ‘Oxford Open’Also working with OUP ‘Oxford Open’

Detailed tagging guidelines released for Detailed tagging guidelines released for NLM Journal Publishing DTD NLM Journal Publishing DTD

Library of Congress and British Library are Library of Congress and British Library are adopting NLM Journal DTD as a standardadopting NLM Journal DTD as a standard

Page 29: PubMed Central Archive at the US National Institutes of Health Prepared byMartha R. Fishel Prepared by Martha R. Fishel Deputy Chief, Public Services Division

LinksLinks

PubMed Central PubMed Central

http://www.pubmedcentral.govhttp://www.pubmedcentral.gov

NLM DTDs and documentationNLM DTDs and documentation

http://dtd.nlm.nih.govhttp://dtd.nlm.nih.gov

Page 30: PubMed Central Archive at the US National Institutes of Health Prepared byMartha R. Fishel Prepared by Martha R. Fishel Deputy Chief, Public Services Division

Thank you!Thank you!

Contact: Martha FishelContact: Martha Fishel

[email protected]@nlm.nih.gov

Page 31: PubMed Central Archive at the US National Institutes of Health Prepared byMartha R. Fishel Prepared by Martha R. Fishel Deputy Chief, Public Services Division

CostsCosts

Costs vary title to title – factors include:Costs vary title to title – factors include: Quantity of color and or grayscale Quantity of color and or grayscale

images vs. straight black & white textimages vs. straight black & white text Quantity of new xml citations preparedQuantity of new xml citations prepared Errors in the deliverables (e.g.image Errors in the deliverables (e.g.image

quality, accuracy of xml, OCR)quality, accuracy of xml, OCR) Media (number of DVDs)Media (number of DVDs)

Page 32: PubMed Central Archive at the US National Institutes of Health Prepared byMartha R. Fishel Prepared by Martha R. Fishel Deputy Chief, Public Services Division

Costs – real exampleCosts – real example

Biochemical Journal – (scanned 1906-Biochemical Journal – (scanned 1906-1995)1995)

287,000 pages287,000 pagesTotal cost=$152,000 USD*Total cost=$152,000 USD*Approximately $.053 per pageApproximately $.053 per page

*Excludes project management costs *Excludes project management costs at NLM and Wellcome/JISCat NLM and Wellcome/JISC

Page 33: PubMed Central Archive at the US National Institutes of Health Prepared byMartha R. Fishel Prepared by Martha R. Fishel Deputy Chief, Public Services Division

Sample PagesSample Pages