Stefano Mazzocchi, Researcher at MIT, Application Catalyst at Metaweb Technologies, Inc.
Stephen Garland, Principal Research Scientist, Emeritus, MIT Computer Science and Artificial Intelligence Laboratory
Ryan Lee, W3C Research Engineer
January 26, 2005 - XML.comMassachusetts Institute of Technology (MIT) Research Activity
Alireza AbbasiAlireza AbbasiTechnology Management, Economics and Policy Program (TEMEP), College of Eng., SNU
SIMILE Project
Focused on collecting and publishing Semantic Web data to the (non-Semantic) Web.
Researching solutions to data interoperability problems for digital libraries using semantic web technologies.
RDF-based Tools Longwell Gadget RDFizer Welkin Fresnel Timeline *new Referee *new Crowbar *new Piggy Bank *new Solvent *new
2
Introduction
Digital Libraries’ Problem: Browsing digital libraries is a difficult process of
navigating through different interfaces and different terminologies for each collection
SIMILESIMILE Project [Semantic Interoperability of Metadata In unLike Environments]Make it easier to wander from collection to
collection, And, more generally, to find your way around in the Semantic Web
Motivated by DSpace a repository for storing, indexing, preserving, and
redistributing digital assets. Manages metadata about the content and distributes on
the web3
DSpace
Jointly developed by HP Research Labs
and the MIT Libraries. (Open source software)
Used by many research-producing Organizations, and often
by their libraries, to manage digital data and for researchers to find that data
4
DSpace (2)
Needs to support additional metadata schemas for a variety of purposes:finding digital research material described in various,
domain-specific ways, managing that digital content over time in order to
preserve it.
As DSpace expands to use new metadata schemas, it will have to deal with the problem of interoperability.
5
Incentive of SIMILE
The Semantic Web Core stack (RDF, RDFS, and OWL)
enables people to create ontologies to describe their specialized metadata and to make them generally reusableBut most people are not trained Semantic Web
developers.
So, they need some toolstools for this and assess whether they did the job correctly.
6
Goals of SIMILE
To extend DSpace, enhancing support for arbitrary schemas and metadata and providing an architecture for disseminating digital assets
Creating toolsCreating tools that metadata specialists (e.g., librarians) need, to produce good-quality RDF. Due to limited expertise in defining ontologies, creating RDF, and
converting existing XML-based metadata into RDF.
Make Make metadata interoperability metadata interoperability easier easier for digital libraries by for digital libraries by providing useful tools providing useful tools
for browsing, searching and mapping for browsing, searching and mapping heterogeneous metadata in RDFheterogeneous metadata in RDF
7
SIMILE – Delivered Components
Tools for Metadata Managers Gadget - XML inspector RDFizers - Batch tools to transform existing XML data into RDF Solvent* - Firefox extension for Javascript screen scraping Welkin - Graphical tool to inspect/edit RDF graph
Tools for End-Users Longwell - Web-based RDF faceted metadata browser Frensel – extensible universal information client Piggy Bank* - Firefox extension for personal info. management of
metadata in RDF Semantic Bank* - Web-based server that allows data publishing and
sharing by individuals, groups, or communities Exibit* - lightweight structured data publishing framework Timeline* - AJAXy widget for visualizing time-based events
*: new tools after the paper
8
SIMILE: Tools for Metadata Managers
RDFizers Batch tools to transform existing XML data into RDF
Gadget XML inspector
Welkin Graphical tool to inspect/edit RDF graph
Solvent* Firefox extension for Javascript screen scraping
*: new tools after the paper
9
RDFizers: Transform XML data into RDF Transform XML data into RDF
RDF’s strength is “defining models in the highly distributed nature”
But, RDF/XML serialization is a very unfriendly compromise
So, RDFizers is created to create and catalog software tools and scripts, which are able to
transform data from existing syntaxes into RDF. allows people to explore their existing data in available RDF browsing
tools.
It helps to resolve the SW chicken-and-egg problem "not much RDF data will be created without a killer app., but no
killer app. will be created without more RDF data“ Solution: making it easier for specialists (like librarians and other
metadata experts) to convert popular and widely available metadata sources into RDF.
10
RDFizers (2)
Done with XSLT style sheets, simple scripts
Need to define RDF “ontologies” for each
List of RDFizers in SIMILE: MARC/MODS RDF . OAI-PMH RDF OCW RDF . EMail RDF BibTEX RDF . Flat RDF Weather RDF . Java RDF Javadoc RDF . Jira RDF Subversion RDF . Random RDF
Gadget: XML inspector
Problem in transformation of existing XML datasets into RDF lack of tools that give you an at-a-glance overview of an
XML dataset (or a collection of XML documents).
Gadget helps data managers understand the structure of an XML dataset by providing a summary of the
count, unique values, and percentage of unique values for XML attributes.
Works on any well-formed XML
Used for Data exploration, understanding Data migration, transformation Data cleanup Complexity evaluation Schema adherence understanding Schema emergence (if none provided)
12
Gadget: sample
13
OCW: 2,002,015 Lines of XML
Welkin: Graphical tool to inspect/edit RDF graph
Configuring tools like Longwell requires a thorough understanding of the structure of the data being examined. it is hard to get a global overview of an RDF model, a few tools for summarizing RDF and giving a quick mental
model of the data being manipulated with a browser.
So WelkinWelkin is created an interactive graphical RDF browser that visualizes
any RDF model without requiring prior configuration (like Knowle, but unlike Longwell)
displays RDF as a clustered set of nodes and arcs. useful for understanding and mining the layout of
unfamiliar datasets. tries to empower the user with an interactive approach,
allowing users to mine, zoom, drag, select, cluster, filter, and highlight nodes and arcs.
14
Welkin: Graphical tool to inspect/edit RDF graph
15
Solvent (new*): Easier Scraping to RDF
a Firefox extension that helps write Javascript screen scrapers for Piggy Bank.
Motivation: turns a regular web page into a semantic web page, freeing the
data from the page/site that contains it.
Unfortunately, not many web pages embed or link to RDF information.
Piggy Bank needs web pages to embed information in RDF.
Piggy Bank is capable to execute a particular screen scraper on particular pages in order to "extract" the information it needs.
16
Solvent (example)
17
SIMILE: SIMILE: Tools for End-UsersTools for End-Users
Longwell Web-based RDF faceted metadata browser
Frensel Vocabulary for specifying how RDF graphs are presented
Piggy Bank* Firefox extension for personal info. management of metadata in RDF
Semantic Bank* Web-based server that allows data publishing and sharing by
individuals, groups, or communities
Exibit* lightweight structured data publishing framework
*: new tools after the paper
18
Longwell: RDF faceted metadata browser
RDF browsing for library usersLongwell, a web-based RDF-powered highly-configurable
faceted browser targets users by hiding the presence of the
underlying RDF model
Knowle (shipped as part of the Longwell distribution), a node-focused graph navigation browser targeted at people who want to see or debug the
underlying RDF model.
The browsing suite is written as Java servlets and is built around HP's Jena2 Semantic Web toolkit.
19
Longwell (sample)
20
Haystack: extensible "universal information client“
enables users to manage diverse sources of information (e.g., email, calendars, address books, and web pages) by defining whichever arrangements of, connections between, and views of
information they find most effective.
the interaction offered by a web-browser interface is too limited, So, The Haystack project is exploring a "rich client" interface that allows RDF data to be manipulated as well as navigated.
Unlike Welkin, which displays information as a graph, Haystack aims for a Longwell-like presentation of information that is natural for simple end users. It uses standard primitives like drag and drop and context menus
to give users access to various operations on the data being viewed at any given time.
It is currently being repackaged as a plugin in the Eclipse platform.
21
Fresnel: vocabulary for specifying how RDF graphs are presented
In working on RDF browsing for both SIMILE and Haystack, they found that it is better to have a general ontology governing how to display RDF, a kind of stylesheet for RDF that allows user to indicate
how we would like to present some abstract data to the user.
Together with other members of the Semantic Web development community, SIMILE is working on putting together Fresnel, a generic ontology for describing how to render RDF in a human-friendly manner.
22
23 ©MIT CNI Spring 2006
Piggy Bank*: information management of metadata in RDF
Firefox extension for managing metadata Loads RDF into local Longwell server
Search and faceted browse of local RDF Views defined by library, other users
Users can find, collect, annotate RDF Can then publish for access by others
24 ©MIT CNI Spring 2006
Piggy Bank* (Sample)
25 ©MIT CNI Spring 2006
Semantic Bank*: Web-based server that allows data publishing and sharing by individuals, groups, or communities
To persist remotely, share, and publish data on a server
For individuals, groups, communities e.g. conference
proceedings
Ability to tag resources Longwell facetted
browsing view of published information
Exibit*: create web pages with support for sorting, filtering, and rich visualizations
27 ©MIT CNI Spring 2006
SIMILE Categories of Work
Projects after this Paper - Done
Timeplot Timeplot a cross-browser DHTML (canvas-based) time
series plotting widget.
TimelineTimelineA DHTML AJAX timeline widget for visualizing
temporal information.
28
Projects after this Paper - ongoing
Piggy BankPiggy Bank An extension to the Firefox that turns it into a Semantic Web browser letting you
make use of existing information on the Web in more useful and flexible ways not offered by the original Web sites.
Semantic BankSemantic Bank The server companion of Piggy Bank that lets you persist, share and publish data
collected by individuals, groups or communities. SolventSolvent
A Firefox extension that helps you write Javascript screen scrapers for Piggy Bank.
jsTeXjsTeX a javascript library that is capable of interpreting some (basic) TeX encodings
and transform them into HTML definitions right directly on a web page. CitelineCiteline
A web application to facilitate the web publishing of bibliographies and citation collections as interactive exhibits and facilitate the sharing of this type of data.
ZotzZotz a Firefox add-on giving you the ability to publish citations from your Zotero to an
Exhibit (via Citeline) in one step.
29
Projects after this Paper – ongoing (2)
RefereeReferee reads your web server logs, crawls your referrers (the links that point to your pages)
and extract metadata from those pages and text around the links that pointed to your pages.
BabelBabel lets you convert between various data formats.
ExhibitExhibit lets you create web pages with support for sorting, filtering, and rich
visualizations by writing only HTML and optionally some CSS and Javascript code.
AppalachianAppalachian a Firefox add-on that adds the ability to manage and use several OpenIDs to ease
the login parts of your browsing experience.
SeekSeek adds faceted browsing features to Mozilla Thunderbird and lets you search
through your email more effectively.
30
An Incomplete Picture
For metadata specialists and system developers, For metadata specialists and system developers, What about editing RDF?
http://www.altova.com/features_RDF.html http://www.cs.rpi.edu/~puninj/rdfeditor http://rhodonite.angelite.nl
What about building new ontologies? Universidad Politécnica de Madrid’s School of Computing (FIUPM) have developed a new method for
building multilingual ontologies that can be applied to the Semantic Web.
What about storing vast quantities of (potentially distributed) RDF and accessing it efficiently?
http://tucana.es.northropgrumman.com/solutions/technology.htm
What about using performance-enhancing techniques (such as caching) for RDF? What about quickly inferencing over RDF data?
For users, For users, Can we design faceted browsing interfaces that scale to dozens of RDF
ontologies? How about improving navigation across the linkages between ontologies? How can we support searching that will start in one domain/ontology and
expand into relevant related domains/ontologies?
31
References
SIMILE: Practical Metadata for the Semantic Web,
by Stefano Mazzocchi, Stephen Garland, Ryan Lee [January 26, 2005] http://www.xml.com/pub/a/2005/01/26/simile.html
http://simile.mit.edu/http://en.wikipedia.org/wiki/SIMILE
“MIT’s SIMILE Project: Demonstrating Practical Value of Semantic Web Technology for Digital Libraries” by MacKenzie Smith, MIT Libraries
“Tutorial – Semantic Digital Libraries, Comparison and the Future” by Sebastian R. Kruk, Bernhard Haslhofer, Philipp Nußbaumer, Sandy Payette, Tomasz Woroniecki, Univ. of Vienna, 2007.
32
33
Faceted browsing
a technique for accessing a collection of information represented using a faceted classification, allowing users to explore by filtering available information.
Displays only the metadata fields that are configured to be 'facets' (i.e., to be important for the user browsing data in one or more specific domains) using values for those fields as a means for zooming into a collection by
selecting those items with a particular field-value pair (e.g., 26 works of art in the example dataset have a subject of Abstract Expressionism).
Provides a mechanism that allows users to explore different schemas from different domains with a unified interface and to discover the synergies across them. For example, the interface can be designed to show users that one
schema uses a "subject" facet while another uses a "topic" facet for similar information.
34
Welkin (sample)Welkin is used
to browse a fragment of the MIT OpenCourseWare metadata converted to RDF.
35
Timeline*: visualizing temporal information
Behind the Curtain
Four groups support SIMILE: HP Research Labs, the W3C, MIT Libraries,
and MIT CSAIL. The principal investigators have included
Mick Bass, Eric Miller, MacKenzie Smith, and David Karger.
The developers are Stefano Mazzocchi, Stephen Garland, and Ryan
Lee. Mark Butler (bootstraper of the Longwell project)
37