linked open government data: what’s next?

Post on 01-Jul-2015

170 Views

Category:

Data & Analytics

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

presented at the 2011 SemTech Open government data and related services/applications are quickly growing on the Web. Although most agree that the government data has great potential in solving real world problems, there are still many challenges that must be addressed. This talk will describe several representative domain applications and provide concrete examples of evolving technical challenges remaining. We will show solution paths that have proven useful and make recommendations on the corresponding Semantic Web best practices. • Scalability. How can we handle(e.g. search and cleanse) the 3,000+ raw/tool datasets, and the additional 300,000+ geo datasets from data.gov? • Interoperability. Multi-scale open government data came from city governments, state governments, and national governments. How can one compare the GDP of the US and China, and later link to state-level financial data? Open government data covers many domains. How can one associate open government data with domain knowledge to build a cancer prevention application? • Provenance and quality. How should provenance be leveraged to facilitate high-quality data management interactions (e.g. reuse, mash-up and feedback) between the government and the public?

TRANSCRIPT

Linked Open Government Data:

What’s Next?

Li Ding, James A. Hendler, and Deborah L. McGuinness

With thanks to the entire RPI Tetherless World LOGD team:

logd.tw.rpi.edu particularly John Erickson, Tim Lebo, Dominic DiFranzo;, Alvaro Graves;

Gregory Williams; Xian Li; James Michaelis; Jin Zheng; Zhenning Shangguan; Johanna Flores, Evan Patton

Tetherless World Constellation, Rensselaer Polytechnic Institute

SemTech 2011 San Francisco June 7, 2011

Outline

• Open Government Data

• Linked Open Government Data

• Challenges and Opportunities

• Future Directions

Open Government Data:

Government data is already available and

open on the Web and is growing.

Let’s create mash ups to expose more value.

?

Opening Government Data

“Openness will strengthen our

democracy and promote efficiency and

effectiveness in Government.”

--- President Obama (Jan 2009)

“if people put data onto the web --

government data, scientific data,

community data, whatever it is -- it

will be used by other people to do

wonderful things, in ways that they

never could have imagined.”

-- Tim Berners-Lee (Feb 2010)

Source: http://www.whitehouse.gov/open, http://www.ted.com/talks/lang/eng/tim_berners_lee_the_year_open_data_went_worldwide.html

Linked Data and Semantic Tech are key enabler!

International Open Government Data:

A Great Opportunity

• 13 Other nations establishing open data

• 24 States now offering data sites

• 11 Cities in America with open data

• 236 New applications from Data.gov datasets

• 258 Data contacts in Federal Agencies

• 308,650 Datasets available on Data.gov

• Open Government Data (OGD)

– A public asset (collected by

government) with a large amount

of high value data and wide

domain coverage

– An international mandate for

government transparency,

business applications, citizen

participation, and etc.

Deployment Status (source: Data.gov)

Source: http://www.data.gov/

Challenges from Raw Open

Government Data

Data in proprietary formats Independent curators

Distributed and unlinked Data

Smoke rate

(Impacteen.org)

Policy coverage

(NCI)

Limited Participation

Linked Open Government Data

TWC: Tetherless World Constellation at Rensselaer Polytechnic Institute logd.tw.rpi.edu

LOGD: Linked Open Government Data

Linked Data is Large and is Growing

8

Linked Open Government Data

A Linked Open Government Data (LOGD)

ecosystem is a Linked Data-based system

where stakeholders of different sizes and

roles find, manage, archive, publish, reuse,

integrate, mash-up, and consume open

government data in connection with online

tools, services and societies.

Moving data.gov to linked data (US)

• Third parties (like RPI) translate the government datasets into linked data formats

• US Data.gov hosts 6.4B RDF triples 5/21/2010 •acknowledges Semantic Web as a key technology for open government data

Government Data within the LD Cloud

12

http://linkeddata.org/

Government Data is

currently over ½ the cloud in

size (~17B triples), 10s of

thousands of links to other

data (within and without)

TWC LOGD: 50+ Demos in Many

Domains using Various Technologies

Technology

• Semantic Web

• Semantic CMS

• Semantic Search

• Social network

• NLP

• Mobile

• Visualization

• Provenance

• …

Domain

• Health

• Finance

• Politics

• Society

• Economy

• …

Selected TWC Mashups

Trends in Smoking Prevalence, Tobacco Policy

Coverage and Tobacco Prices (1991-2007)

PopSciGrid with NIH/NCI & Northwestern

Aimed at conveying complex health-related information to

consumers and health decision makers

Diverse datasets from NIH

Uses lightweight semantic technologies to produce

mashups that make data accessible that would be otherwise

difficult to view in perspective

Maintains provenance about data and manipulations

Two-way communication: Feedback users’ comments to

gov contacts (e.g. %)

PopSciGrid Workflow

Convert

Enhance

Visualize

derive derive

Integrate

Ban coverage

Publish

The Abstract LOGD Workflow 17

Visualize End

User

Gov

Agency

Mashed

Data

LOGD

RAW

OGD

Enhance

Integrate

Publish

Convert

Developer

Usability of LOGD

• Interoperability

• Scalability

• Provenance

Mashup Workflow

(Conventional OGD)

1. Publish

2. Mashup

3. Visualize

Challenge: Interoperability

make your stuff available on the

Web (whatever format) under an

open license

★★

make it available as structured

data (e.g., Excel instead of image

scan of a table)

★★★ use non-proprietary formats (e.g.,

CSV instead of Excel)

★★★★ use URIs to identify things, so

that people can point at your stuff

★★★★★ link your data to other data to

provide context

Syntactic • Extract entities from HTML tables

• Parse Excel tables

Semantic • Does “Georgia” refer to a US state or a country?

• Is “2000” calendar year, fiscal year or dollar amount?

TBL’s 5-star Deployment Scheme for Linked Data

Mashing up data from different

countries

http://data-gov.tw.rpi.edu/demo/USForeignAid/demo-1554.html

Even if not “rationalized” together

Build ontology mapping

based on shared terms

“Economic”

Enhance interoperability using Linked

Data: drill down contextual knowledge

21

• Identity : URI

• Context

– Description: metadata, esp. type & datatype

– Mapping (linking identities)

• Syntactic

– Common string name

– Common URI

• Semantic

– Complex Object: attributes + context (siblings)

– Ontological Mapping: e.g., owl:sameAs

– Rule-based Mapping: e.g. mapping “Liter” to “Gallon”

Scalability factors in LOGD

deployment

• Large number of OGD datasets – 6k+ Data.gov.uk

– 200k+ Data.gov

– 323k+ International OGD datasets

• Non-trivial human workload: clean-up syntax, enhance semantics, integrate datasets, visualize resulting data …

• Substantial computing workload: running time of complex tasks, memory and disk space, maintenance costs …

22

International catalog

23

Scalability issues in the International

Open Government Dataset Catalog

24

Crawled 40+ different dataset catalogs from 19 countries

“non-trivial customized programming workload”

Searching 323,304 datasets

“Complex SPARQL query got timeout”

Social Aspect

Computing Aspect

International

Open Government

Dataset Catalog

Social Aspect: Distribute human

workload to the right developers

25

Domain Expertise

Ap

plic

ation D

evelo

pm

en

t E

xpert

ise

Joint work with Alvaro Graves, PhD student at RPI

Layman

End Users

Software

Engineers

Scientists,

Experts

Genus

Students

Convert

Enhance Combine

Knowledge

Engineers

Visualize

Publish

1. Decompose

workload to

fine-granular

jobs

2. Leverage a

wider range of

developers

Computing Aspect: fit computing

power to LOGD deployment

• Scale up for more government data

– Support collective incremental data processing

– Support large scale data analysis: graph connectivity, complex pattern/hypotheses discovery

– Map repetitive developers’ workload to automated tools

– Reduce service maintenance costs

• Scale down for wider range of end user apps

– Limited computing power, e.g., mobile devices

– End users’ cognitive constraints, e.g., screen-size, executive summary

26

Provenance

• Provenance-aware frameworks are

needed to support transparency,

appropriate attribution, and ultimately trust

of any kind of open data.

• Versioning and persistence are important

factors to sustainable applications

• Workflow provenance can help increase

understanding and trust since it can be

used to explain behavior and

dependencies of intelligent systems

27

Attribution in PopSciGrid

demo

person technology

dataset

agency

version

conversion

logd:uses_technology

dcterms:contributor

Example scenarios • List direct/indirect contributors

• End users send feedback to curators

• Curators learn usage of datasets

• List demos by technology

void:subset

void:subset

dcterms:publisher

logd:uses_dataset

State-wise

Tobacco Policy

coverage stats

TWC Semantic Water Quality Portal

Aimed at helping people investigate local water quality

Diverse datasets, regulations, datatypes

Uses lightweight semantic technologies to produce mashups that make

data accessible that would be otherwise difficult to view in perspective

Maintains provenance about data and manipulations

Exposes unexpected uses of data (and thus unexpected usage patterns)

Detailed View of Pollution

Provenance of regulations

Challenges Revisited

• Interoperability – Syntactic: Linked Data, RDF

– Semantic: ontology, evolving

• Scalability (9.9 billion triples on the TWC LOGD) – Effective Social platform for task dispatching

– More automations, e.g., data cleaning, and linked detection

– Scalable tools, esp. SPARQL endpoint

• Provenance – Accountability: Privacy, licensing, trust

– Credit / Blame

– Replicate applications and transfer system building knowledge

• More issues – Persistent data access for changing data

– …

32

Summary

• The Open Government data is a key resource

– Many governments releasing data, growing number in structured form

• Government (and general data) transparency comes through

in the “mashing up” of data from many sites maintaining (and

exposing) provenance – Key to linked data

• While there has been tremendous progress, many

challenges remain

– Trust, Provenance, Scaling, Interoperability, Archiving, Curation, …

• The Research agenda for linked government data is

an important driving area for semantic technologies

Questions?

The work presented in this talk was primarily conducted at the

Tetherless World Constellation at Rensselaer Polytechnic Institute.

Comments / Questions:

[ dingl | dlm ] @ cs.rpi.edu.

Events:

Open Linked Govt. Data Symposium: submission deadline June 15

http://tw.rpi.edu/web/event/AAAI/2011/Fall_Symposium_OGK

TWC / Elsevier Hackathon: June 27-28

http://tw.rpi.edu/web/event/TWCElsevierHackathonJune2011

Reference: Li Ding, Timothy Lebo, John S. Erickson, Dominic DiFranzo, Gregory Todd Williams, Xian Li, James

Michaelis, Alvaro Graves, Jin Guang Zheng, Zhenning Shangguan, Johanna Flores, Deborah L. McGuinness and

Jim Hendler, TWC LOGD: A Portal for Linked Open Government Data Ecosystems, submitted to JWS, special

issue on semantic web challenge’10

top related