lis512 lecture 6

46
lis512 lecture 6 identifiers, dublin core and RDF

Upload: mio

Post on 22-Feb-2016

47 views

Category:

Documents


0 download

DESCRIPTION

lis512 lecture 6. identifiers, dublin core and RDF. identifiers. When you are maintaining your information on your own, you don't exchange it, you don't need to worry about identifiers. Identifiers come into play when you try to communicated to another party that - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: lis512 lecture  6

lis512 lecture 6

identifiers, dublin core and RDF

Page 2: lis512 lecture  6

identifiers When you are maintaining your information

on your own, you don't exchange it, you don't need to worry about identifiers.

Identifiers come into play when you try to communicated to another party that you are talking about the some thing as before or the other party and you mean the same thing.

Page 3: lis512 lecture  6

keys We have already seen keys in a database. A key is simply a column in a table that is

unique in for each line in in the table. A key is simply a field in a record that is

unique for each record in a table.

Page 4: lis512 lecture  6

identifier An identifier is something a bit different. When two records have the same identifier,

they are about the same thing. Being about the same thing calls the

philosophers into the room. Things become pretty messy. Discussion on identifiers is often very

confused, even among experts.

Page 5: lis512 lecture  6

example Record 1

Name: Thomas Krichel Homepage: http://wotan.liu.edu/home/krichel

Record 2 Name: Thomas Krichel Homepage: http://openlib.org/home/krichel

Are they about the same person?

Page 6: lis512 lecture  6

a person homepage finder Imagine a home page finder. Ask it what the homepage of Thomas Krichel

is. It may say that there are two Thomas Krichels

with two different homepages. If there is only one, the service is wrong.

Page 7: lis512 lecture  6

work to do If we have two records that look like we

describe the same person, we try to unite them.

In a practical scenario, we most of the time have a first Thomas Krichel.

When the other Thomas Krichel comes along, we check, make a decision if it is the same person, then update the homepage setting.

Page 8: lis512 lecture  6

human labour To find out if a person is the same as another

person is something that no computer can do. We need a person to verify this. Identity can only be controlled by a person. Worst: two people may differ on what “same

person” means.

Page 9: lis512 lecture  6

identifier The identifier is a special field that we keep

constant over time as long as the record is about the some thing.

When we get our first record about Thomas Krichel, we have Name: Thomas Krichel Homepage: http://openlib.org/home/krichel Identifier: tk1

Page 10: lis512 lecture  6

changes to the record Thomas gets married

Name: Thomas Lechirk Homepage: http://openlib.org/home/krichel Identifier: tk1

Later, he decides to change his hompage Name: Thomas Lechirk Homepage: http://openlib.org/home/lechirk Identifier: tk1

Page 11: lis512 lecture  6

problem Since he is now know as Thomas Lechirk,

should he not be given an identifier tl1? Well in general, the resultion of this question

depends on the enviroment that we are operating in.

Changing the identifier would destroy its value.

Page 12: lis512 lecture  6

value of the identifier If you change the identifier every time you

change the record, it is useless. The value of the identifier is that is long-lived. With a good idenifier, users will still be able to

find the same thing even when they query the same dataset in hundreds of years.

Page 13: lis512 lecture  6

dumb vs intelligent identifiers Identifiers can be dumb or intelligent. Choosing a scheme for identifiers is a key step

at the start of a project. Once we are running our project, changing

identifiers can only be accomplished at great expense.

Dump identifiers are strings that have no meaning or structure “na$19h-ab9a19]”

Page 14: lis512 lecture  6

dumb indentifiers The advantages are

If they are long and complicated enough they may be found by web searches.

They are not at risk from lobbying for change. The problem is

It is not easy to resolve the identifier into data about.

Page 15: lis512 lecture  6

intelligent identifiers Most identifiers have some form of

intelligence. That intelligence says something about the

identifier to a person who looks at it. The intelligence may also help a computer

program to make decisions.

Page 16: lis512 lecture  6

ISBN An ISBN is assigned to each edition of a book. After 2007, it has 13 numbers, the first 3 are

978 or 978. Before 2007, it had only the remaining 10.

Digits 1 to 5 are the group identifier for country or country group.

Then follows a publisher identifier. Then follows an item number by the publisher. The parts are often split by hyphens but the

hyphen is not part of the actual number.

Page 17: lis512 lecture  6

Uniform Resource Locators URL We all know URLs. Speaking in simplified terms, a URL typically

has three components. a mechanism a host a local part, local to the host

Page 18: lis512 lecture  6

URL mechanism

• The URL mechanism essentially says what to do with the thing when you actually find it.

• Typical URL mechanisms include– http– ftp– telnet– mailto

Page 19: lis512 lecture  6

host

• The host is the domain name of a machine.• Domain names are a hierarchical scheme of

names that can be resolved into IP addresses.• Example– openlib.org– fafner.openlib.org– www.fafner.openlib.org

• I lease openlib.org.

Page 20: lis512 lecture  6

local part

• Once the resolution reaches the host it needs to know what to send out.

• This is some sort of local identifier that the host knows about.

Page 21: lis512 lecture  6

URIs

• The world wide web consortium, in its quest to make the web a more universal information medium, generalized URLs to URIs.

• URI are Uniform Resource Identifiers. • The don't only allow for locating resources but

also for naming them. • The structure is similar, but there is a wide

variety of mechanisms, called schemes.

Page 22: lis512 lecture  6

example scheme info

• The info: URI scheme was created by representatives of the library community and the publishing industry as a light-weight way to make existing identifiers available as URIs.

• http://info-uri.info/ has the list of these schemes.

Page 23: lis512 lecture  6

URL examples

• http://openlib.org/home/krichel/– mechanism: http– host: openlib.org– local part: /home/krichel/

• mailto:[email protected]– mechanism: mailto– host: openlib.org– local part: krichel

Page 24: lis512 lecture  6

DC contributor

• URI http://purl.org/dc/elements/1.1/contributor

• Definition: An entity responsible for making contributions to the resource.

• Comment: Examples of a Contributor include a person, an organization, or a service. Typically, the name of a Contributor should be used to indicate the entity.

Page 25: lis512 lecture  6

Dublin Core

• Dublin Core was initially created at a invitational meeting at OCLC in Dublin, OH.

• The aim was to create a metadata format that non-trained specialists could use the annotate resources on the Web.

• Nowadays, it serves as a basic common denominator metadata in a wide variety of applications.

Page 26: lis512 lecture  6

DC coverage

• URI http://purl.org/dc/elements/1.1/coverage • Definition: The spatial or temporal topic of the

resource, the spatial applicability of the resource, or the jurisdiction under which the resource is relevant.

• Comment: Spatial topic and spatial applicability may be a named place or a location specified by its geographic coordinates. Temporal topic may be a named period, date, or date range.

Page 27: lis512 lecture  6

DC creator

• URI http://purl.org/dc/elements/1.1/creator• Definition: An entity primarily responsible for

making the resource. • Comment: Examples of a Creator include a

person, an organization, or a service. Typically, the name of a Creator should be used to indicate the entity.

Page 28: lis512 lecture  6

DC date

• URI: http://purl.org/dc/elements/1.1/date • Definition: A point or period of time

associated with an event in the lifecycle of the resource.

• Comment: Date may be used to express temporal information at any level of granularity. Recommended best practice is to use an encoding scheme.

Page 29: lis512 lecture  6

DC description

• http://purl.org/dc/elements/1.1/description• Definition: An account of the resource.• Comment: Description may include but is not

limited to: – an abstract– a table of contents– a graphical representation, or – a free-text account of the resource.

Page 30: lis512 lecture  6

DC format

• http://purl.org/dc/elements/1.1/format • Definition: The file format, physical medium,

or dimensions of the resource. • Comment: Examples of dimensions include

size and duration. Recommended best practice is to use a controlled vocabulary.

• The MIME types by the IANA are such a vocabulary for files.

Page 31: lis512 lecture  6

DC identifier

• URI: http://purl.org/dc/elements/1.1/identifier• Definition: An unambiguous reference to the

resource within a given context. • Comment: Recommended best practice is to

identify the resource by means of a string conforming to a formal identification system.

Page 32: lis512 lecture  6

DC language

• URI: http://purl.org/dc/elements/1.1/language • Definition: A language of the resource. • Comment: Recommended best practice is to

use a controlled vocabulary such as RFC 4646

Page 33: lis512 lecture  6

DC publisher

• URI: http://purl.org/dc/elements/1.1/publisher • Definition: An entity responsible for making the

resource available. • Comment: Examples of a Publisher include a

person, an organization, or a service. Typically, the name of a Publisher should be used to indicate the entity.

Page 34: lis512 lecture  6

DC relation

• URI: http://purl.org/dc/elements/1.1/relation • Definition: A related resource. • Comment: Recommended best practice is to

identify the related resource by means of a string conforming to a formal identification system.

Page 35: lis512 lecture  6

DC source

• URI: http://purl.org/dc/elements/1.1/source • Definition: A related resource from which the

described resource is derived. • Comment: The described resource may be

derived from the related resource in whole or in part. Recommended best practice is to identify the related resource by means of a string conforming to a formal identification system.

• Apparently only kept for historical purposes.

Page 36: lis512 lecture  6

DC subject

• URI: http://purl.org/dc/elements/1.1/subject Definition: The topic of the resource. Comment: Typically, the subject will be represented using keywords, key phrases, or classification codes. Recommended best practice is to use a controlled vocabulary. To describe the spatial or temporal topic of the resource, use the DC coverage element.

Page 37: lis512 lecture  6

DC title

• URI: http://purl.org/dc/elements/1.1/title Definition: A name given to the resource. Comment: Typically, a Title will be a name by which the resource is formally known.

Page 38: lis512 lecture  6

DC type

• URI: http://purl.org/dc/elements/1.1/type • Definition: The nature or genre of the resource. • Comment: Recommended best practice is to

use a controlled vocabulary such as the DCMI Type Vocabulary

• To describe the file format, physical medium, or dimensions of the resource, use the DC Format element.

Page 39: lis512 lecture  6

DC abstract model

• Basically, this is something like the FRBR for Dublin Core.

• It maybe quite useless in practice, but it makes for beautiful theory.

Page 40: lis512 lecture  6

the resource

• A resource is anything that can be given an identifier.

• Since a URI can be given to almost everything, this includes a vast array of "stuff".

• But the important thing is that the resource is actually identified by some identifier, rather than being named.

• It's essentially same thing as an entity in FRBR.

Page 41: lis512 lecture  6

property / value pair

• A description occurs when a bunch of attribute / value pairs are associated with a resource.

• This is like the attributes in FRBR, just that it counts them together with the values.

Page 42: lis512 lecture  6

property values

• Property values can be of two forms.• They can be literals, i.e. human

understandable strings. • Or they can be resources, which again can be

described. When a resource is used as a value, we quote an identifier.

Page 43: lis512 lecture  6

example

• Resource: http://openlib.org/home/krichel– DC.Creator: Thomas Krichel– DC.Format: text/html

• Note: properties themselves are identified by URIs.

Page 44: lis512 lecture  6

similar: the resource description framework

• The resource description framework has been developed by the W3C is a way generically represent information, such that it can be processed via the Web.

• It again is a conceptual model. • RDF can be expressed in a variety of syntaxes. • The XML syntax of RDF is very ugly.

Page 45: lis512 lecture  6

general idea

• The general idea is that of a semantic web.• Such a semantic web you could ask real

questions such as what we will tomorrows weather here, or what athlete won the most gold medals in the Olympic games of 1936.

• Such a system is many years away. • Computer scientists don't make much

progress with this.

Page 46: lis512 lecture  6

key problem

• The key problem is identity. • You and I may understand what a gold medal

and what the Olympic games are. • But teaching agreeing on what it precisely

means appears to be a matter of great difficulty.

• Recently, some research, the dbpedia, used wikipedia headings as concept names.