ukoln is supported by: to name: persistently: ay, there’s the rub andy powell, ukoln, university...

33
UKOLN is supported by: To name: persistently: ay, there’s the rub Andy Powell, UKOLN, University of Bath [email protected] DCC Persistent Identifiers Workshop, University of Glasgow – June 2005 www.bath.ac.u k a centre of expertise in digital information management www.ukoln.ac.u k

Upload: rudolf-owens

Post on 31-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

UKOLN is supported by:

To name: persistently: ay, there’s the rub

Andy Powell, UKOLN, University of Bath

[email protected]

DCC Persistent Identifiers Workshop, University of Glasgow – June 2005

www.bath.ac.uk

a centre of expertise in digital information management

www.ukoln.ac.uk

                                                             

DCC Persistent Identifiers 2

Contents

• beginning• middle• end• chance for discussion• note: middle section will focus on

technical functional requirements

                                                             

DCC Persistent Identifiers 3

Overall theme

the only useful form of identifier

is the URI,

the only useful form of URI is one

that conforms to a registered

scheme,

despite appearances to the

contrary…

…the most useful

registered scheme is the ‘http’ URI

scheme

the only useful form of identifier

is the URI,

the only useful form of URI is one

that conforms to a registered

scheme,

despite appearances to the

contrary…

…the most useful

registered scheme is the ‘http’ URI

scheme

note use of “is” in the first line… “can be mapped to a

URI” is not good enough!

digital

                                                             

DCC Persistent Identifiers 4

Introduction

• PI meetings often try to focus on functional requirements…

• uniqueness, persistence, resolvability, usability, transportability, simplicity of assignment, applicability to digital and non-digital resources, cost, blah, blah, blah…

• difficult, because requirements are abstract… not clear how to meet them in practice - difficult to move forward

• forget abstract functional requirements… let’s get technical

                                                             

DCC Persistent Identifiers 5

Space/time continuum

time

Inte

rnet

space

my application

“Internet space” represents some

combination of geographic / network distance and

domain / administration / application distance…

“time” represents time…

                                                             

DCC Persistent Identifiers 6

Space/time continuum

time

Inte

rnet

space applications that are

closely related in terms of space or time likely to

share understanding about identifiers – often by

hardwiring knowledge into code

myapplication

otherapplication

otherapplication

                                                             

DCC Persistent Identifiers 7

Space/time continuum

time

Inte

rnet

space applications that are

“distant” are less likely to share understanding about

identifiers

knowledge locked within domain or lost over time

or, worse, both

myapplication

otherapplication

otherapplication

                                                             

DCC Persistent Identifiers 8

Pushing the boundaries

• how do we push the boundaries of identifier understanding further out across the space/time continuum?– standards, standards, standards– go with the crowd– stop telling people existing stuff is broken… or

at least, we stop pretending that we can do better!

– use what already works and is widely deployed– focus on existing technical standards– stop inventing, start doing

                                                             

DCC Persistent Identifiers 9

W3C Web Architecture

• Global Identifiers - Global naming leads to global network effects.

(Principle)

• Identify with URIs - To benefit from and increase the value of the

World Wide Web, agents should provide URIs as identifiers for

resources. (Good practice)

• URIs Identify a Single Resource - Assign distinct URIs to distinct

resources. (Constraint)

• Avoiding URI aliases - A URI owner SHOULD NOT associate

arbitrarily different URIs with the same resource. (Good practice)

• Consistent URI usage - An agent that receives a URI SHOULD

refer to the associated resource using the same URI, character-by-

character. (Good practice)

• Reuse URI schemes - A specification SHOULD reuse an existing

URI scheme (rather than create a new one) when it provides the

desired properties of identifiers and their relation to resources.

(Good practice)

• URI opacity - Agents making use of URIs SHOULD NOT attempt to

infer properties of the referenced resource. (Good practice)

• Global Identifiers - Global naming leads to global network effects.

(Principle)

• Identify with URIs - To benefit from and increase the value of the

World Wide Web, agents should provide URIs as identifiers for

resources. (Good practice)

• URIs Identify a Single Resource - Assign distinct URIs to distinct

resources. (Constraint)

• Avoiding URI aliases - A URI owner SHOULD NOT associate

arbitrarily different URIs with the same resource. (Good practice)

• Consistent URI usage - An agent that receives a URI SHOULD

refer to the associated resource using the same URI, character-by-

character. (Good practice)

• Reuse URI schemes - A specification SHOULD reuse an existing

URI scheme (rather than create a new one) when it provides the

desired properties of identifiers and their relation to resources.

(Good practice)

• URI opacity - Agents making use of URIs SHOULD NOT attempt to

infer properties of the referenced resource. (Good practice)

http://www.w3.org/TR/webarch/

                                                             

DCC Persistent Identifiers 10

URIs and XML

• in order for identifiers to work across the space/time continuum we need– global and unambiguous identifiers– global and unambiguous ways of exchanging

identifiers between software applications

• the Uniform Resource Identifier is the only option for the former

• XML is the “best” option for the latter– and in particular the XML Schema AnyURI

datatype“global” means “very widely deployedtechnology” – e.g. in my mum’s house!“global” means “very widely deployedtechnology” – e.g. in my mum’s house!

                                                             

DCC Persistent Identifiers 11

Use URIs

the only useful form of identifier

is the URI,

the only useful form of URI is one

that conforms to a registered

scheme,

despite appearances to the

contrary…

…the most useful

registered scheme is the ‘http’ URI

scheme

the only useful form of identifier

is the URI,

the only useful form of URI is one

that conforms to a registered

scheme,

despite appearances to the

contrary…

…the most useful

registered scheme is the ‘http’ URI

scheme

digital

                                                             

DCC Persistent Identifiers 12

URI scheme registration

• registration of URI schemes is important• registration helps to ensure uniqueness• without registration the same scheme can be

used in ignorance by someone, somewhere else in the space/time continuum

• registration doesn’t guarantee that every URI with a scheme will be unique – but it helps!

• without registration there are no guarantees of uniqueness or persistence

                                                             

DCC Persistent Identifiers 13

Use registered URI schemes

the only useful form of identifier

is the URI,

the only useful form of URI is one

that conforms to a registered

scheme,

despite appearances to the

contrary…

…the most useful

registered scheme is the ‘http’ URI

scheme

the only useful form of identifier

is the URI,

the only useful form of URI is one

that conforms to a registered

scheme,

despite appearances to the

contrary…

…the most useful

registered scheme is the ‘http’ URI

scheme

digital

                                                             

DCC Persistent Identifiers 14

Semantic Web

• the Semantic Web relies on URIs to identify resources

• resources == stuff (digital/physical/conceptual things)

• the semantic Web is built on a global, shared body of metadata (RDF)

• terms in the metadata language are identified using URIs

• those URIs must be “resolvable”… in order that “reasoning” can be performed

                                                             

DCC Persistent Identifiers 15

Note: dereferencing URIs

• the Web Architecture talks about “dereferencing” URIs rather than “resolving” them– in many cases “dereferencing” a URI results in

obtaining a “representation” of the resource– several representations may be available

• the Web Architecture says:

• only ‘http’ URIs offer simple, widely deployed dereferencing mechanism

• Available representation - A URI owner SHOULD provide

representations of the resource it identifies (Good practice)• Available representation - A URI owner SHOULD provide

representations of the resource it identifies (Good practice)

http://www.w3.org/TR/webarch/

                                                             

DCC Persistent Identifiers 16

Quick quiz…

• what kind of identifier is this?– 1361-3200 is an ISSN

it identifies UKOLN’s Ariadne magazine

                                                             

DCC Persistent Identifiers 17

Quick quiz…

• what kind of identifier is this?– 1361-3200– info:lccn/n78890351 is an ‘info’ URI

it identifies a Libraryof Congress metadatarecord (an authority file)but I don’t know which

                                                             

DCC Persistent Identifiers 18

Quick quiz…

• what kind of identifier is this?– 1361-3200– info:lccn/n78890351– 10.1000/182 is a DOI

it is also a Handle

it identifies the “DOIHandbook”

                                                             

DCC Persistent Identifiers 19

Quick quiz…

• what kind of identifier is this?– 1361-3200– info:lccn/n78890351– 10.1000/182– 79 Ceti is a Flamsteed Designation

it identifies a 7th magnitude star in the constellation of Cetus

                                                             

DCC Persistent Identifiers 20

Quick quiz…

• what kind of identifier is this?– 1361-3200– info:lccn/n78890351– 10.1000/182– 79 Ceti– http://purl.org/dc/terms/audience

is an ‘http’ URIa.k.a. a URLit is also a PURL

it identifies a DCMI metadata term – i.e. a conceptual resource

                                                             

DCC Persistent Identifiers 21

Quick quiz…

• what kind of identifier is this?– 1361-3200– info:lccn/n78890351– 10.1000/182– 79 Ceti– http://purl.org/dc/terms/audience

• only one of these can be understood and dereferenced by every single bit of currently deployed Internet software…

Question: why would we wantto use anything else?Question: why would we wantto use anything else?

                                                             

DCC Persistent Identifiers 22

But…

• ‘http’ URIs are identifiers, just like any other

• ‘http’ URIs can identify any resource – digital, physical or conceptual

• ‘http’ URIs don’t have to break, they just need to be assigned/managed carefully

But, ‘http’ URIs are just locators aren’t they?

But, ‘http’ URIs can only be used for Web resources, accessed over HTTP, can’t they?

But, ‘http’ URIs break every 30 days or something, don’t they?

                                                             

DCC Persistent Identifiers 23

Use ‘http’ URIs

the only useful form of identifier

is the URI,

the only useful form of URI is one

that conforms to a registered

scheme,

despite appearances to the

contrary…

…the most useful registered

scheme is the ‘http’ URI scheme

the only useful form of identifier

is the URI,

the only useful form of URI is one

that conforms to a registered

scheme,

despite appearances to the

contrary…

…the most useful registered

scheme is the ‘http’ URI scheme

                                                             

DCC Persistent Identifiers 24

Case study 1 - LOM

• XML example from IEEE LOM…• typical of many XML / identifier encodings

…<general> <identifier> <catalog>DOI</catalog> <entry>10.1000/182</entry> </identifier> …</general>…

…<general> <identifier> <catalog>DOI</catalog> <entry>10.1000/182</entry> </identifier> …</general>…

                                                             

DCC Persistent Identifiers 25

…<general> <identifier> <catalog>URI</catalog> <entry>http://purl.org/poi/rdn.ac.uk/12-34</entry> </identifier> …</general>…

…<general> <identifier> <catalog>URI</catalog> <entry>http://purl.org/poi/rdn.ac.uk/12-34</entry> </identifier> …</general>…

Case study 1 - LOM

• the “catalogue” indicates what kinds of identifier is being used

just a string

just a string

nothing in the XMLschema indicates thatthis is a URI – someapplications will be blind

                                                             

DCC Persistent Identifiers 26

Case study 1 - LOM

• a improved version might be…

…<general> <identifier>http://purl.org/poi/rdn.ac.uk/12-34</identifier> …</general>…

…<general> <identifier>http://purl.org/poi/rdn.ac.uk/12-34</identifier> …</general>…

where the XML schemaindicates that this is ofdatatype AnyURI

therefore all XML-awareapplications will know thisis a URI

                                                             

DCC Persistent Identifiers 27

Case study 1 - LOM

• a improved version might be…

…<general> <identifier>doi:10.1000/182</identifier> …</general>…

…<general> <identifier>doi:10.1000/182</identifier> …</general>…

the URI syntax providesthe “catalogue” from theoriginal example

all XML-aware applicationswill know this is a URI, somewill know it is a DOI

                                                             

DCC Persistent Identifiers 28

Case study 2 - DOI

• the DOI “10.1000/182” can be encoded as a URI in several ways:– http://dx.doi.org/10.1000/182

– doi:10.1000/182

– urn:doi:10.1000/182

• however…– DOI-aware applications have to have knowledge of these

encodings hard-coded into them (since the DOI itself is just a string)

– nothing in the URI specification indicates that these URIs are equivalent

– note that the 2nd and 3rd forms are not registered

http://dx.doi.org/10.1000/182http://dx.doi.org/10.1000/182

Question: which of these forms is most persistent and why?

Question: which of these forms is most persistent and why?

                                                             

DCC Persistent Identifiers 29

Case study 3 – ‘info’ URI

• consider the following ‘info’ URI:– info:lccn/n78890351

• ‘info’ URIs are explicitly defined to be non-dereferencable

• therefore, there is no documented way of finding out what this URI identifies

• there is no documented way of getting a representation of the resource it identifies

• and there is no documented way of finding out any more about it

http://info-uri.info/registry/http://info-uri.info/registry/

Question: how is this useful?Question: how is this useful?

                                                             

DCC Persistent Identifiers 30

But, what happens when…

• …the Internet disappears?• who cares!• we’ll deal with it• we’ll be with the crowd• there’ll be a global transition• everyone will need to deal with it• every software component on the whole

Internet will need fixing• the people left behind will be the people who

invented their own solutions

                                                             

DCC Persistent Identifiers 31

Conclusion

the only useful form of identifier

is the URI,

the only useful form of URI is one

that conforms to a registered

scheme,

the most useful registered scheme

is the ‘http’ URI scheme

URIs are only really

useful if encoding syntaxes

explicitly indicate “this is a URI”

the only useful form of identifier

is the URI,

the only useful form of URI is one

that conforms to a registered

scheme,

the most useful registered scheme

is the ‘http’ URI scheme

URIs are only really

useful if encoding syntaxes

explicitly indicate “this is a URI”

digital

                                                             

DCC Persistent Identifiers 32

Questions and discussion?

                                                             

DCC Persistent Identifiers 33

Discussion

• What do we need to do to make ‘http’ URIs more persistent? (Are ‘http’ URIs the answer after all?)

• Do we have functional requirements that aren’t met by ‘http’ URIs? (When and why should we create new URI schemes?)