iassit kansa presentation

Post on 29-Nov-2014

693 Views

Category:

Technology

3 Downloads

Preview:

Click to see full reader

DESCRIPTION

A presentation given at the "Data Stewardship: Increasing the Integrity and Effectiveness of Science and Scholarship" Session on Friday, June 8 2012 at the IASSIT 2012 conference in Washington DC. This presentation introduced data publishing, using a social science (archaeology) case study to explore editorial processes and dissemination outcomes that increasingly demand “Linked Data” capabilities.

TRANSCRIPT

Case-Study: Publishing to the “Web of Data” in Archaeology

Quality and Workflows

Eric Kansa UC Berkeley / OpenContext.org

Unless otherwise indicated, this work is licensed under a Creative Commons Attribution 3.0 License <http://creativecommons.org/licenses/by/3.0/>

“Small Science” data sharing is hard:(1) Complexity(2) Scalability(3) Ethics, cultural property

claims, IP(4) Incentives(5) Preservation

Image Credit: “Grand Canyon NPS” via Flickr (CC-By)http://www.flickr.com/photos/grand_canyon_nps/5975537378/

Thousand Flowers

● Open Context: Open access, open licensed data for arhaeology

● Archiving by California Digital Library

● Persistent Identifiers (DOIs, ARKs)

● Web services● NSF/NEH links for data

management plans

Thousand Flowers

Fills a Gap:

Most data sources are institutional. Open Context publishes individual, small group contributions

Thousand Flowers

Fills a Gap:

Most data sources are institutional. Open Context publishes individual, small group contributions

Challenge:Diverse contributions, needing lots of work to clean-up and “link” to the Web of Data

• 3-year project Oct 2010 – Sep 2013

• Funded with a National Leadership Grant from the Institute for Museum and Library Services, LG-06-10-0140-10, “Dissemination Information Packages for Information Reuse”

• Ixchel Faniel, PI & Elizabeth Yakel, Co-PI

http://www.dipir.org

DIPIR Collaboration

The Big DIPIR Questions

Research Questions

1. What are the significant properties of data that facilitate reuse by the designated communities at the three sites?

2. How can these significant properties be expressed as representation information to ensure the preservation of meaning and enable data reuse?

Open Context Interviewees

• 22 Ph.D. or graduate students interviewed

– 13 men– 9 women

• Novices / Experts– 19 experts– 3 novices

• Interviewees who where curators or professors also with a curatorial role = 6

Raw Data is Unappetizing?

Data Documentation PracticesI use an Excel spreadsheet…which I … inherited from my research advisers. …my dissertation advisor was still recording data for each specimen on paper when I was in graduate school so that's what I started …then quickly, I was like, "This is ridiculous.“… I just started using an Excel spreadsheet that has sort of slowly gotten bigger and bigger over time with more variables or columns…I've added …color coding…I also use…a very sort of primitive numerical coding system, again, that I inherited from my research advisers…So, this little book that goes with me of codes which is sort of odd, but …we all know that a 14 is a sheep.” (CCU13)

Data Documentation PracticesI use an Excel spreadsheet…which I … inherited from my research advisers. …my dissertation advisor was still recording data for each specimen on paper when I was in graduate school so that's what I started …then quickly, I was like, "This is ridiculous.“… I just started using an Excel spreadsheet that has sort of slowly gotten bigger and bigger over time with more variables or columns…I've added …color coding…I also use…a very sort of primitive numerical coding system, again, that I inherited from my research advisers…So, this little book that goes with me of codes which is sort of odd, but …we all know that a 14 is a sheep.” (CCU13)

A long way to go before we get usable, intelligible data

Sometimes data is better served cooked.

Thousand Flowers

● Clean-up and document contributed data

● Map to ArchaeoML (general ontology)

● Mint URIs to entities (potsherds, projects, contexts, people)

● Link to important vocabularies / collections (Pleiades, Encyclopedia of Life)

● Working on CIDOC-CRM (RDF) representations (not straightforward)

Open Context: Record

Open Context: Record

● XHTML + RDFa (Dublin Core, Open Annotation, etc.)

● XML (ArchaeoML)● Atom● RDF (draft CIDOC)● Link to GitHub versioned file

Open Context: Record

Open Context: Record

Open Context: Visutalization of Data Linked to the EOL

My Precious Data

Image Credit: “Lord of the Rings” (2003, New Line), All Rights Reserved Copyright

Data sharing as publication

Data Publishing

Data Quality and Standards Alignment(1) Check consistency(2) Edit functions(3) Align to common standards

(“Linked Data” if applicable)(4) Issue tracking, version

control

Publishing

Tools of the Trade

(1) Google Refine (check, edit, consistancy)

(2) Mantis (issue-tracker, coordinate edits, metadata creation)

Publishing

Tools of the Trade

(1) Domain scientists (Editorial Board) check data

(2) Iterative “coproduction” between contributors and editoris

Publishing

Publishing

Project Metadata

Column Descriptions

Web of Data (2011)

Main Contributors:

● Institutions (esp. government)

● Thematic collections / projects

Entity Reconciliation

(1) With Google Refine(2) Implemented, EOL and

Pleiades (gazetteer)(3) Use existing mappings to

improve future reconciliation

Publishing

● CDL Archiving Service● EZID for persistent Identity: DOIs

(aggregate resources), ARKs (granular resources) and Merritt Repository

● Helps build trust in community

● Platform / Services disciplinary communities can use for “Data Publishing”

● Different communities work out semantic/interoperability needs, editorial policies, incentives, etc.

University of California (System) Repository,

All disciplines(UC-funded library, grants)

CDL as Infrastructure

● Platform / Services disciplinary communities can use for “Data Publishing”

● Different communities work out semantic/interoperability needs, editorial policies, incentives, etc.

University of California (System) Repository,

All disciplines(UC-funded library, grants)

CDL as InfrastructureFuture data publisher

Future data publisher

eScholarship: UC’s OA Publishing Platform

Platform for traditional publishing

Also supports new genres

Outcomes of Publishing Data:(1) Communicate and set

expectations about content and quality

(2) Organize workflows to improve data quality and usability

(3) Make “datasets” first class citizens in world of scholarly communications

Summary

Final Thoughts

Publication needs to evolve!

(1) Participating in Linked Data is a great goal, but far removed from most everyday practice

(2) Researchers need help.

(3) 19th century publication norms poorly suited to 21st century methods, research, public goals

top related