dirk roorda, coordinator infrastructure

Post on 24-Mar-2016

31 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

http://www.dans.knaw.nl Dirk Roorda, coordinator infrastructure. Overview. Part 1: The rising role of data Part 2: The free use of data Part 3: The care for data Part 4: The re-use of data. Part 1: The rising role of data. http://en.wikipedia.org/wiki/Exabyte - PowerPoint PPT Presentation

TRANSCRIPT

http://www.dans.knaw.nlDirk Roorda, coordinator infrastructure

Overview

Part 1: The rising role of dataPart 2: The free use of dataPart 3: The care for dataPart 4: The re-use of data

Part 1: The rising role of data

http://en.wikipedia.org/wiki/Exabyte

Internet size (May 2009): 500 EB500.000 PB500 million TB500 million fat USB disks500 billion memory cards of 1 GB70 memory cards per person

Data deluge

http://www.datadeluge.com/ http://en.wikipedia.org/wiki/File:Tree_of_life_SVG.svg

http://tolweb.org/tree/

Where does it come from?• Instruments

• satellites, sensors, dna-sequencing• Records

• administrations, censuses, surveys• Digitisation

• the analog legacy• Hobby

• pictures, movies, genealogy• Integration

• better interoperability of existing data

The driving force

Information and Communication Technology

Babbage Analytical Engine1870

A datacenter

Genealogy2,5 PB5328 servers1,12 MW

http://blog.familytreemagazine.com/insider/Inside+Ancestrycoms+TopSecret+Data+Center.aspx

http://www.ancestry.com/

A closer look• Linguistics

• text corpora, automatic translation• Philology

• how to read a million books?• History

• historical census data• Archeology

• archive law, commercial research

Linguistics and PhilologyA chronometric approach to Indian alchemical literatureAssessing frequency changes in multistage diachronic corporaEvaluating methods for computer-assisted stemmatology using artificial benchmark data sets A Corpus Study of the Rigveda Dictionary generation for less-frequent language pairs using WordNetAn exercise in non-ideal authorship attribution: the mysterious Maria Ward

http://llc.oxfordjournals.org/

History

http://www.volkstellingen.nl/nl/

http://www.volkstellingen.nl/en/

Archaeology

http://edna.itor.org/nl/intern/upload_directory/a00002/downloads/IMG0013.tif

Archaeology (2)

http://edna.itor.org/nl/oai/oai_addi/oai_addi/OAI:EVALMA:a00002.xml/

Part 2: The free use of Data

Open Access

Data is informationInformation is knowledgeKnowledge is powerWhy share it?

Open Access

Shared knowledge is double knowledge

Without free sharing of knowledge, scientific progress will halt

Tensions between sharing and not sharing remain, though

A good Example

http://www.ploscompbiol.org/home.action

Work to do

• organise your data• let your data work together with those of

others • (colleagues, future scientists, the public)

• ask new questions to the data• because there is so much of it

• create new (virtual) data collections

Part 3: The care for data

Research Data Recycling

• existing data• collecting by experiments, surveys

• primary research data• verifying results by others• preserving unique data from experiments

• compilation, aggregation, annotation• databanks

• data mining, analysis, visualisation• new data as research input

Challenge: Software

Operating system (DOS, Windows 95, ...)Programming Languages (Basic, Pascal)File formats (Word Perfect, dBase)Applications (Addressbook, Websites)

Old data may be locked up in old software.

Meeting the challenge

To prevent the problem in the futureBackward compatibilityOpen StandardsOpen Source ApplicationsModular software engineering

keep data separated from interface and business logic

To remedy the problems of the pastEmulationMigration

Challenge: Human organisation

Forgotten jargonForgotten knowledgeNo metadataWebsites with broken links

Jargon

• II.17. Posterior berry aneurysm with subarachnoid bleed.

• II.18. Subarachnoid bleed with extension into the ventricles.

• II.19. Ruptured berry aneurysm at the end of the internal carotid artery, with obstructive hydrocephalus. Morgagni found the rupture.

• II.22. Subarachnoid hemorrhage.

http://www.pathguy.com/morgagni.htm

Meeting the challenge

Persistent IdentifiersEnough MetadataCodification of knowledge and practices

WikipediaDatamanagement early on

Part 4: The re-use of data

Data management

Use common infrastructure rather than private means

Use open formats rather than proprietary formats

Use open source software rather than closed software

Use standard ways of documenting datataxonomies, ontologies, metadata schemes

Common Infrastructure

Local file sharesUniversity repositoryDANSEuropean Infrastructures

DANS

http://easy.dans.knaw.nl/dms

EASY

Dataset

Datafiles

Metadata

linguists make their technology accessible- resourcesalgorithms techniques

humanities and social sciences- they are the target users

Geleerdenbrieven=

Circulation of KnowledgeArchiving

=circulation of information

Keep imagining

top related