niso/dcmi september 25 webinar: implementing linked data in developing countries and low-resource...

88
NISO/DCMI Webinar: Implementing Linked Data in Developing Countries and Low-Resource Conditions September 25, 2013 Speakers: Johannes Keizer - Information Systems Officer, Food and Agriculture Organization of the United Nations Caterina Caracciolo - Senior Information Specialist at the Food and Agriculture Organization of the United Nations http://www.niso.org/news/events/2013/dcmi/developing

Upload: national-information-standards-organization-niso

Post on 13-Jul-2015

22.746 views

Category:

Education


0 download

TRANSCRIPT

NISO/DCMI Webinar:Implementing Linked Data in Developing Countries and Low-Resource Conditions

September 25, 2013

Speakers: Johannes Keizer - Information Systems Officer, Food and Agriculture

Organization of the United NationsCaterina Caracciolo - Senior Information Specialist at the Food and

Agriculture Organization of the United Nations

http://www.niso.org/news/events/2013/dcmi/developing

Implementing Linked Data in Developing Countries and Low

Resource Conditions

NISO/DCMI Webminar

25 September, 2013

Caterina Caracciolo, Johannes Keizer

{caterina.caracciolo},{johannes.keizer}@fao.org

Goal of this Webinar

• Overview of Linked data stack and components

• LOD in low resource conditions

– Possible? Why to do it?

• What to think of when doing LOD in low resources

• Explain some initiatives to enable LOD in low resources

• Exemplify a real world LOD Szenario

The importance of the issue

Source: United Nations Population Division, World Population

Prospects: The 2010 Revision, medium variant (2011).

World by population

www.worldmapper.org

http://www.worldmapper.org/extraindex/language_notes.html

• ~ 7000 languages

http://w3techs.com/technologi

es/overview/content_language

/all

And there is something more

~ 7000 languages

The world by languages spoken

www.worldmapper.org

Let’s get into the nitty gritty

Implementing Linked Data in Developing Countries and Low

Resource Conditions

Part 2

NISO/DCMI Webminar

25 September, 2013

Caterina Caracciolo

[email protected]

Today

• A bird’s eye view on Linked Data lifecycle, from data consumption to data generation

• Discussion on major difficulties, especially in the data generation phase

• Some considerations on possible solutions, especially from a strategic and organizational point of view

• No ambition to have a comprehensive survey of tools!

What are low resource conditions really?

CPU, memory and technology constraints...

Electricity may be unreliable…

…occasionally available…

…expensive…

Internet connection may be slow...

… and dependent on the weather…

Funding...

is always a problem

IT competencies…

Few IT people, over-busy, trained on different technologies, with little or no incentives to learn/adopt new ones

IT and domain-specific competencies

• Usually, complete separation between those working on IT and those working on collecting/analysing/maintaining data (domain specialists)

• Domain specialists do not want to spend time changing formats, validating conversions, explaining intended meaning of data etc.

– Tendency to consider data as “my” data

Linked Data

Scenario

An institution has data to publish as Linked Data

– Data is produced internally, e.g. list of publications produced by the institution, specimens in the local museum, factsheets on local plants, statistics on production, …

– Data may be online or inside somebody’s computer

– Typically in some RDB, or spreadsheets in file system

Remark

• Although not necessary, strictly speaking, here we consider RDF as the format for Linked Data

A typical Linked Data flow

SPARQL endpoint

HTML/RDF

Content negotiation

RDF store

RDF dump

LOD based

applications

Data consumptionData exposure Data storageData lifecycle

Data conversion

Data linking

Data maintenance

Data consumption

Building LOD based applications is easy…

(relatively)

Relatively easy…

• It is about making mash up applications…

• But interfacing with the data may be an issue

– Developers need to know SPARQL

– And how to use it within his/her framework of choice

A pointer

• Research to Impact Hackathon, Kenya, Jan 2013

– @iHub Research, Kenya

• local agricultural and nutritional sector

– Comments on that in Tim Davies’ blog

• http://www.timdavies.org.uk/

• Other blogs around … (search for them!)

Data exposure can be done in various ways

Exposing de-referenceable URIs

• Need to set up content negotiation mechanism

– Serving content for URIs

• In our experience, not a big problem

– Simple back-ends are available, e.g. Pubby

• Still, need server 24/7… properly configured

Provide an RDF dump

• Always a good choice

– Data is downloaded for inclusion in applications

– Efficiency of access to data is under control

– Perhaps not always clear how to produce the dump, what to include in it…

• Only the data? Also the links?

Expose SPARQL endpoint

• Endpoint typically provided by triple store

• Heavy on server side

• Query processing is left to the SPARQL engine

– Implementation of reasoning

– Implementation of order in clause processing –filters, unions, select

• Require 24/7 server availability

Expose Web Services

• Known technology

• May be built on top RDF stores

• Good performances

• Control on what data may be accessed

• API formats to simplify use of linked data by web developers https://code.google.com/p/linked-data-api/

Data storage is tricky

Triple stores are well known resource-guzzlers

• Intense use of CPU, memory

• Server configuration needs to be appropriate

• Internet connection may be a bottleneck

• Again, some tech know-how needed to choose the best solution

– Also considering other technologies, e.g. NoSQL

The Semantic Web is resource guzzler!

Downscale the Semantic Web!

http://worldwidesemanticweb.org/events/downscale2012/

http://worldwidesemanticweb.org/events/downscale2013/

Data generation

Producing RDF may be a daunting task

Getting to RDF… from what?

• In many cases, RDF means an abrupt jump from formats that we consider long abandoned

• From a recent survey, we learn that some AGROVOC users (libraries, institutions) use the paper version

– Last published in 1992

RDF generation

• It is a simple format, simply triples

• But requires some familiarity with the technology, and especially acquaintance with the mentality around, especially on standards and reuse

A much simplified example from AGROVOC

TermCode 1 TermCode 2 TermSpell1 TermSpell2 LangCode 1 LangCode 2 LinkType

1 2 Irrigated farm

Farm EN EN BT

1 3 Irrigated farm

irrigation EN EN RT

Can be turned into some RDF…

Subject Predicate Object

Entity1 TermSpell Irrigated farm

Entity1 BT Entity2

Entity2 TermSpell Farm

Entity3 TermSpell Irrigation

Entity2 BT Entity3

The problem is the middle column

• These are locally defined predicates

• One has to guess what they stand for!

Predicate

TermSpell

BT

TermSpell

TermSpell

BT

Better something like that..

Subject Predicate Object

URI_1 rdfs:label “Irrigated farm”

URI_1 skos:broader URI_2

URI_2 rdfs:label “Farm”

URI_3 rdfs:label “Irrigation”

URI_1 skos:related URI_3

Using standard vocabularies is the key

• Standard, or de facto standard

• Only a few of them:

– Dublin Core, BIBO, FOAF, SKOS, ..

• Ensure possibility of reuse of data

Standard vocabularies as Step 0 of Linked Data

• Reusing existing vocabularies is the first step to have some indications of what data may be linked and what not

– E.g. dct:subject in a bibliographic record indicates the “topic” of the record

How to know what vocabulary to use?

• And how to know if the right vocabulary exists?

– We very often receive questions about this from local institutions (who expect to use AGROVOC for that…)

• This is probably the very first conceptual blocker!

Need to support data managers

• Initiatives such as Linked Open Vocabularies (LOV) are useful:

– http://lov.okfn.org/dataset/lov/index.html

• But also need usable and stable tools to support data managers

Drupal’s way to support small users

• Allows one to import data from other sources, create RDF, and expose RDF dumps

• At conversion time, one can chose the vocabulary to use

• Then, it becomes the tool for data maintenance

• No programming skill required, still some competency on Drupal! And you need to understand RDF and your data!

Other attempts along the same line

• AgriDrupal

– Drupal especially customized for small institutions

– And bibliographic data, data on people, organizations

• ScratchPad

– Customized for biodiversity data

URIs

Is assigning URIs also a problem?

• Often not a technical issue…

• Choice may have to do with the languages of the data

– AGROVOC uses numbers because it was not possible to chose one language over the others, but software developers often complain

• Or with the internal organizations’ asset

• It may require longer time than one would expect…

An AGROVOC URIs

Linking data is a bottleneck

Example of linking from AGROVOC

http://aims.fao.org/aos/agrovoc/c_2808 skos:exactMatch http://www.caas.net.cn/caas/cat/c_33429

“farmland” from AGROVOC exact match …chinese term…

Linking entities

• Still active research area

• Maintenance still an issue

– see example of AGROVOC linked to Chinese thesaurus…

• Data validation usually outside the rest of the data lifecycle

Data maintenance

• Choice: keep everything in your db and continue periodic generation of rdf

• Move maintenance in different tools

In what language is your data?

Certainly, there are many languages beyond English…

http://ioannis.parapontis.com/

Some considerations from a managerial perspective…

Assuming an institution with constrained resources has already planned to go Linked Data, what

to do?

Options

• Go ahead on your own

• Organize a collaboration

– A network creation effort

AGRIS is an example of network

Data coordination

Partner

Partner

Partner

Partner

Partner

Partner

Can be much smaller or bigger!

Partner

Partner

Our conclusions

1) Semantic Web is energy intensive

• Because of infrastructure requirements

• The biggest bottleneck is often on the side of IT competencies, and at the interface between IT and domain knowledge, especially for data modeling

• Linked Data-related technologies must become lighter in order to be adoptable in low resource conditions

2) In low resource conditions…

• Do a careful assessment of your data and in-house skills

• It is a good idea to organize your effort in collaboration

• Start mobilizing IT specialists, data curators

3) Start with Step 0: identify and use standards to describe your

data

• Mobilize IT specialists, data curators

The AGRIS network

7171

……a bibliographical record original

…the same record transformed

Data Flow

74

OpenAGRIS data flow

How is linked data produced

……using title and author

……using title and author

……using the key words

……using the key words

…using the journal name

http://agris.fao.org/openagris/search.do?recordID=PL2009000495

Linking URIs

Linking vocabularies

Questions?

NISO/DCMI WebinarImplementing Linked Data in Developing Countries and Low-Resource Conditions

NISO/DCMI Webinar • September 25, 2013

Questions?All questions will be posted with presenter answers on

the NISO website following the webinar:

http://www.niso.org/news/events/2013/dcmi/developing

Thank you for joining us today.

Please take a moment to fill out the brief online survey.

We look forward to hearing from you!

THANK YOU