linked open government data: what’s next?
Post on 01-Jul-2015
170 Views
Preview:
DESCRIPTION
TRANSCRIPT
Linked Open Government Data:
What’s Next?
Li Ding, James A. Hendler, and Deborah L. McGuinness
With thanks to the entire RPI Tetherless World LOGD team:
logd.tw.rpi.edu particularly John Erickson, Tim Lebo, Dominic DiFranzo;, Alvaro Graves;
Gregory Williams; Xian Li; James Michaelis; Jin Zheng; Zhenning Shangguan; Johanna Flores, Evan Patton
Tetherless World Constellation, Rensselaer Polytechnic Institute
SemTech 2011 San Francisco June 7, 2011
Outline
• Open Government Data
• Linked Open Government Data
• Challenges and Opportunities
• Future Directions
Open Government Data:
Government data is already available and
open on the Web and is growing.
Let’s create mash ups to expose more value.
?
Opening Government Data
“Openness will strengthen our
democracy and promote efficiency and
effectiveness in Government.”
--- President Obama (Jan 2009)
“if people put data onto the web --
government data, scientific data,
community data, whatever it is -- it
will be used by other people to do
wonderful things, in ways that they
never could have imagined.”
-- Tim Berners-Lee (Feb 2010)
Source: http://www.whitehouse.gov/open, http://www.ted.com/talks/lang/eng/tim_berners_lee_the_year_open_data_went_worldwide.html
Linked Data and Semantic Tech are key enabler!
International Open Government Data:
A Great Opportunity
• 13 Other nations establishing open data
• 24 States now offering data sites
• 11 Cities in America with open data
• 236 New applications from Data.gov datasets
• 258 Data contacts in Federal Agencies
• 308,650 Datasets available on Data.gov
• Open Government Data (OGD)
– A public asset (collected by
government) with a large amount
of high value data and wide
domain coverage
– An international mandate for
government transparency,
business applications, citizen
participation, and etc.
Deployment Status (source: Data.gov)
Source: http://www.data.gov/
Challenges from Raw Open
Government Data
Data in proprietary formats Independent curators
Distributed and unlinked Data
Smoke rate
(Impacteen.org)
Policy coverage
(NCI)
Limited Participation
Linked Open Government Data
TWC: Tetherless World Constellation at Rensselaer Polytechnic Institute logd.tw.rpi.edu
LOGD: Linked Open Government Data
Linked Data is Large and is Growing
8
The Tetherless World Constellation
Linked Open Govt Data Portal
9
Create
TWC LOGD
Convert
Query/
Access
LOGD
SPARQL
Endpoint
Enhance
• RDF
• RSS
• JSON
• XML
• HTML
• CSV
• …
Community Portal
Data.gov deployment
Linked Open Government Data
A Linked Open Government Data (LOGD)
ecosystem is a Linked Data-based system
where stakeholders of different sizes and
roles find, manage, archive, publish, reuse,
integrate, mash-up, and consume open
government data in connection with online
tools, services and societies.
Moving data.gov to linked data (US)
• Third parties (like RPI) translate the government datasets into linked data formats
• US Data.gov hosts 6.4B RDF triples 5/21/2010 •acknowledges Semantic Web as a key technology for open government data
Government Data within the LD Cloud
12
http://linkeddata.org/
Government Data is
currently over ½ the cloud in
size (~17B triples), 10s of
thousands of links to other
data (within and without)
TWC LOGD: 50+ Demos in Many
Domains using Various Technologies
Technology
• Semantic Web
• Semantic CMS
• Semantic Search
• Social network
• NLP
• Mobile
• Visualization
• Provenance
• …
Domain
• Health
• Finance
• Politics
• Society
• Economy
• …
Selected TWC Mashups
Trends in Smoking Prevalence, Tobacco Policy
Coverage and Tobacco Prices (1991-2007)
PopSciGrid with NIH/NCI & Northwestern
Aimed at conveying complex health-related information to
consumers and health decision makers
Diverse datasets from NIH
Uses lightweight semantic technologies to produce
mashups that make data accessible that would be otherwise
difficult to view in perspective
Maintains provenance about data and manipulations
Two-way communication: Feedback users’ comments to
gov contacts (e.g. %)
PopSciGrid Workflow
Convert
Enhance
Visualize
derive derive
Integrate
Ban coverage
Publish
The Abstract LOGD Workflow 17
Visualize End
User
Gov
Agency
Mashed
Data
LOGD
RAW
OGD
Enhance
Integrate
Publish
Convert
Developer
Usability of LOGD
• Interoperability
• Scalability
• Provenance
Mashup Workflow
(Conventional OGD)
1. Publish
2. Mashup
3. Visualize
Challenge: Interoperability
★
make your stuff available on the
Web (whatever format) under an
open license
★★
make it available as structured
data (e.g., Excel instead of image
scan of a table)
★★★ use non-proprietary formats (e.g.,
CSV instead of Excel)
★★★★ use URIs to identify things, so
that people can point at your stuff
★★★★★ link your data to other data to
provide context
Syntactic • Extract entities from HTML tables
• Parse Excel tables
Semantic • Does “Georgia” refer to a US state or a country?
• Is “2000” calendar year, fiscal year or dollar amount?
TBL’s 5-star Deployment Scheme for Linked Data
Mashing up data from different
countries
http://data-gov.tw.rpi.edu/demo/USForeignAid/demo-1554.html
Even if not “rationalized” together
Build ontology mapping
based on shared terms
“Economic”
Enhance interoperability using Linked
Data: drill down contextual knowledge
21
• Identity : URI
• Context
– Description: metadata, esp. type & datatype
– Mapping (linking identities)
• Syntactic
– Common string name
– Common URI
• Semantic
– Complex Object: attributes + context (siblings)
– Ontological Mapping: e.g., owl:sameAs
– Rule-based Mapping: e.g. mapping “Liter” to “Gallon”
Scalability factors in LOGD
deployment
• Large number of OGD datasets – 6k+ Data.gov.uk
– 200k+ Data.gov
– 323k+ International OGD datasets
• Non-trivial human workload: clean-up syntax, enhance semantics, integrate datasets, visualize resulting data …
• Substantial computing workload: running time of complex tasks, memory and disk space, maintenance costs …
22
International catalog
23
Scalability issues in the International
Open Government Dataset Catalog
24
Crawled 40+ different dataset catalogs from 19 countries
“non-trivial customized programming workload”
Searching 323,304 datasets
“Complex SPARQL query got timeout”
Social Aspect
Computing Aspect
International
Open Government
Dataset Catalog
Social Aspect: Distribute human
workload to the right developers
25
Domain Expertise
Ap
plic
ation D
evelo
pm
en
t E
xpert
ise
Joint work with Alvaro Graves, PhD student at RPI
Layman
End Users
Software
Engineers
Scientists,
Experts
Genus
Students
Convert
Enhance Combine
Knowledge
Engineers
Visualize
Publish
1. Decompose
workload to
fine-granular
jobs
2. Leverage a
wider range of
developers
Computing Aspect: fit computing
power to LOGD deployment
• Scale up for more government data
– Support collective incremental data processing
– Support large scale data analysis: graph connectivity, complex pattern/hypotheses discovery
– Map repetitive developers’ workload to automated tools
– Reduce service maintenance costs
• Scale down for wider range of end user apps
– Limited computing power, e.g., mobile devices
– End users’ cognitive constraints, e.g., screen-size, executive summary
26
Provenance
• Provenance-aware frameworks are
needed to support transparency,
appropriate attribution, and ultimately trust
of any kind of open data.
• Versioning and persistence are important
factors to sustainable applications
• Workflow provenance can help increase
understanding and trust since it can be
used to explain behavior and
dependencies of intelligent systems
27
Attribution in PopSciGrid
demo
person technology
dataset
agency
version
conversion
logd:uses_technology
dcterms:contributor
Example scenarios • List direct/indirect contributors
• End users send feedback to curators
• Curators learn usage of datasets
• List demos by technology
void:subset
void:subset
dcterms:publisher
logd:uses_dataset
State-wise
Tobacco Policy
coverage stats
TWC Semantic Water Quality Portal
Aimed at helping people investigate local water quality
Diverse datasets, regulations, datatypes
Uses lightweight semantic technologies to produce mashups that make
data accessible that would be otherwise difficult to view in perspective
Maintains provenance about data and manipulations
Exposes unexpected uses of data (and thus unexpected usage patterns)
Detailed View of Pollution
Provenance of regulations
Challenges Revisited
• Interoperability – Syntactic: Linked Data, RDF
– Semantic: ontology, evolving
• Scalability (9.9 billion triples on the TWC LOGD) – Effective Social platform for task dispatching
– More automations, e.g., data cleaning, and linked detection
– Scalable tools, esp. SPARQL endpoint
• Provenance – Accountability: Privacy, licensing, trust
– Credit / Blame
– Replicate applications and transfer system building knowledge
• More issues – Persistent data access for changing data
– …
32
Summary
• The Open Government data is a key resource
– Many governments releasing data, growing number in structured form
• Government (and general data) transparency comes through
in the “mashing up” of data from many sites maintaining (and
exposing) provenance – Key to linked data
• While there has been tremendous progress, many
challenges remain
– Trust, Provenance, Scaling, Interoperability, Archiving, Curation, …
• The Research agenda for linked government data is
an important driving area for semantic technologies
Questions?
The work presented in this talk was primarily conducted at the
Tetherless World Constellation at Rensselaer Polytechnic Institute.
Comments / Questions:
[ dingl | dlm ] @ cs.rpi.edu.
Events:
Open Linked Govt. Data Symposium: submission deadline June 15
http://tw.rpi.edu/web/event/AAAI/2011/Fall_Symposium_OGK
TWC / Elsevier Hackathon: June 27-28
http://tw.rpi.edu/web/event/TWCElsevierHackathonJune2011
Reference: Li Ding, Timothy Lebo, John S. Erickson, Dominic DiFranzo, Gregory Todd Williams, Xian Li, James
Michaelis, Alvaro Graves, Jin Guang Zheng, Zhenning Shangguan, Johanna Flores, Deborah L. McGuinness and
Jim Hendler, TWC LOGD: A Portal for Linked Open Government Data Ecosystems, submitted to JWS, special
issue on semantic web challenge’10
top related