commodity semantic search: a case study of discovered

Download Commodity Semantic Search: A Case Study of DiscoverEd

If you can't read please download the document

Upload: nathan-yergler

Post on 16-Apr-2017

1.760 views

Category:

Technology


2 download

TRANSCRIPT

Commodity Semantic Search:
A Case Study of DiscoverEd

Nathan R. YerglerCreative Commons

Semantic Technology Conference24 June 2010

Good afternoon. My name is Nathan Yergler, and I'm Chief Technology Officer at Creative Commons. This afternoon I'm going to talk about a semantic enhanced search engine for education we've been working on called DiscoverEd. It's built on commodity hardware and open source tools, and the software can be used for other domains. I'm going to talk about some approaches we tried and rejected, and give you some information on tools you can use for building your own semantic search without investing in your own server farm.

share, reuse, and remix legally

Creative Commons provides legal and technical tools that make sharing easy, legal, and scalable.


Attribution 3.0 Unported








3.0







CC Rights Expression Language

CC licenses are based on international copyright law

There are hundreds of millions of pieces of CC-licensed content on the web

OER

Open Educational Resources

Learning materials that are freely available to use, remix, and redistribute.

Wide variety of format, content types, audience

CC licenses make this content interoperable

But how do you find OER youre looking for?

OER Search == CC Search++

SimilaritiesNo central registry or repository

It's up to publishers to label their works

And additional interesting problemsDiffering views on what makes it Educational

Additional facets subject, language, etc

A Model for OER Search

Curators identify educational resources Curators optionally add metadata

A Curator may also be the Publisher

Or a Curator may add metadata to someone elses resources

A Model for OER Search (2)

Ingest resource lists, metadata via RSS/Atom feeds, OAI-PMH

Curators & Feeds

Two Prototypes

Google CSE

Nutch

Initial effort: Google CSE

Google Custom Search Engine allows you to create a search engine for a website or a collection of interesting websites.Define resource patterns for inclusion

Optionally include annotations facets and labels

Python scripts to consume resource lists

Output XML suitable for Google CSE

Scaling with CSE

Lists of individual resources did not scale well

Labels and Facets worked best with fixed, limited vocabulary

License-filtered search unavailable

Nutch-based Prototype

Nutch + Jena triple store

Simple scripts for generating seeds from the store

IndexingFilter plugin for injecting metadata into Nutch index

QueryFilter plugins for field-specific searches

DiscoverEd (Nutch)

Prototype Results

Curator model allows for very directed crawlLow cost, not very resource intensive

ScaleFlexibly filter on predicate values

LimitationsProvenance for curator metadata

Predicate filters had to be hand-crafted

Current DiscoverEd Work

We're Open

Education is our test domain, but the tool can be generally useful

Other organizations have expressed interest in using the DiscoverEd software

Making code available on Gitorious,
http://gitorious.org/discovered

Provenance

Initial work complete on storing provenance for curator metadata

Working on integrating this with the query front end now

When complete, will allow users to Limit their query to specific curators

Exclude curators from their query

Field Queries

Need to map predicate URIs to human names

Currently map at query timeIndex using the URI

Map specific terms to URIs

For example,
tag to http://purl.org/dc/terms/subject

Requires a Filter for every predicate

Landing work now to map at index time

Information for Curators

We want publishers/curators to publish more linked data

Need a feedback loop to help drive this

Working on dashboard to see what's indexed, how, etc.

Second phase: documentation, tools to help improve their RDFa

DiscoverEd Team & Supporters

Ahrash Bissell

Asheesh Laroia

Raphael Krut-Landau

Alex Kozak

Christine Geith

Karen Vignare

Hewlett Foundation

Bill & Melinda Gates Foundation

OSI

Michigan State University

AgShare

Cloud Tools for Semantic Search

Yahoo! BOSSRetrieve RDF extracted from pages

Google CSEFilter using structured data (Page Maps, RDFa)

Customize display using structured data

Conclusion

Cloud tools can help build simple semantic search quickly

Nutch provides a powerful, extensible platform for prototyping search tools

DiscoverEd software demonstrates semantic search without large hardware investment

http://wiki.creativecommons.org/DiscoverEd

[email protected]
@nyergler (identi.ca, twitter)

Click to edit the title text format

Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline LevelSeventh Outline LevelEighth Outline LevelNinth Outline Level

agshare

Click to edit the title text format

Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline LevelSeventh Outline LevelEighth Outline LevelNinth Outline Level

Click to edit the title text format

Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline LevelSeventh Outline LevelEighth Outline LevelNinth Outline Level