commodity semantic search: a case study of discovered
TRANSCRIPT
Commodity Semantic Search:
A Case Study of DiscoverEd
Nathan R. YerglerCreative Commons
Semantic Technology Conference24 June 2010
Good afternoon. My name is Nathan Yergler, and I'm Chief Technology Officer at Creative Commons. This afternoon I'm going to talk about a semantic enhanced search engine for education we've been working on called DiscoverEd. It's built on commodity hardware and open source tools, and the software can be used for other domains. I'm going to talk about some approaches we tried and rejected, and give you some information on tools you can use for building your own semantic search without investing in your own server farm.
share, reuse, and remix legally
Creative Commons provides legal and technical tools that make sharing easy, legal, and scalable.
Attribution 3.0 Unported
3.0
CC Rights Expression Language
CC licenses are based on international copyright law
There are hundreds of millions of pieces of CC-licensed content on the web
OER
Open Educational Resources
Learning materials that are freely available to use, remix, and redistribute.
Wide variety of format, content types, audience
CC licenses make this content interoperable
But how do you find OER youre looking for?
OER Search == CC Search++
SimilaritiesNo central registry or repository
It's up to publishers to label their works
And additional interesting problemsDiffering views on what makes it Educational
Additional facets subject, language, etc
A Model for OER Search
Curators identify educational resources Curators optionally add metadata
A Curator may also be the Publisher
Or a Curator may add metadata to someone elses resources
A Model for OER Search (2)
Ingest resource lists, metadata via RSS/Atom feeds, OAI-PMH
Curators & Feeds
Two Prototypes
Google CSE
Nutch
Initial effort: Google CSE
Google Custom Search Engine allows you to create a search engine for a website or a collection of interesting websites.Define resource patterns for inclusion
Optionally include annotations facets and labels
Python scripts to consume resource lists
Output XML suitable for Google CSE
Scaling with CSE
Lists of individual resources did not scale well
Labels and Facets worked best with fixed, limited vocabulary
License-filtered search unavailable
Nutch-based Prototype
Nutch + Jena triple store
Simple scripts for generating seeds from the store
IndexingFilter plugin for injecting metadata into Nutch index
QueryFilter plugins for field-specific searches
DiscoverEd (Nutch)
Prototype Results
Curator model allows for very directed crawlLow cost, not very resource intensive
ScaleFlexibly filter on predicate values
LimitationsProvenance for curator metadata
Predicate filters had to be hand-crafted
Current DiscoverEd Work
We're Open
Education is our test domain, but the tool can be generally useful
Other organizations have expressed interest in using the DiscoverEd software
Making code available on Gitorious,
http://gitorious.org/discovered
Provenance
Initial work complete on storing provenance for curator metadata
Working on integrating this with the query front end now
When complete, will allow users to Limit their query to specific curators
Exclude curators from their query
Field Queries
Need to map predicate URIs to human names
Currently map at query timeIndex using the URI
Map specific terms to URIs
For example,
tag to http://purl.org/dc/terms/subject
Requires a Filter for every predicate
Landing work now to map at index time
Information for Curators
We want publishers/curators to publish more linked data
Need a feedback loop to help drive this
Working on dashboard to see what's indexed, how, etc.
Second phase: documentation, tools to help improve their RDFa
DiscoverEd Team & Supporters
Ahrash Bissell
Asheesh Laroia
Raphael Krut-Landau
Alex Kozak
Christine Geith
Karen Vignare
Hewlett Foundation
Bill & Melinda Gates Foundation
OSI
Michigan State University
AgShare
Cloud Tools for Semantic Search
Yahoo! BOSSRetrieve RDF extracted from pages
Google CSEFilter using structured data (Page Maps, RDFa)
Customize display using structured data
Conclusion
Cloud tools can help build simple semantic search quickly
Nutch provides a powerful, extensible platform for prototyping search tools
DiscoverEd software demonstrates semantic search without large hardware investment
http://wiki.creativecommons.org/DiscoverEd
[email protected]
@nyergler (identi.ca, twitter)
Click to edit the title text format
Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline LevelSeventh Outline LevelEighth Outline LevelNinth Outline Level
agshare
Click to edit the title text format
Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline LevelSeventh Outline LevelEighth Outline LevelNinth Outline Level
Click to edit the title text format
Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline LevelSeventh Outline LevelEighth Outline LevelNinth Outline Level