the www as a database: www query languages curtis dyreson james cook university ( townsville,...

Post on 31-Dec-2015

216 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

The WWW as a Database:WWW Query Languages

Curtis Dyreson

James Cook University

(Townsville, Australia)

Aalborg University

Outline

• searching the WWW– search engines– WWW query languages

• WebSQL– WWW graph– cost

• Jumping Spider– hybrid

Searching the WWW

• search engines– Altavista, Infoseek, 2100 others!

• static architecture – robot: periodic, slow, non-uniform coverage– index: keywords to URLs, fast, ranking algorithm

• example query

Lecture notes on trees in a data structures

course.

A Search Engine Index

A Search Engine Indexdata structures

A Search Engine Index

lecture notes

data structures

A Search Engine Index

lecture notes

treesdata structures

A Search Engine Index

lecture notes

treesdata structures

A Search Engine Index

lecture notes

treesdata structures

WWW Query Languages

• search engines index single pages

• multi-page concepts

• hunting strategy– search engine to nearby page– manual search

• WWW query languages

WebSQL, W3QS, WebLog

WWW Graph Structure

• large (650K servers, 350M pages)

• dynamic, cycliclink = edge

page = node

WebSQL

• SQL-like

• search engine to find pages• path expression (regular expression of links)• text manipulation predicates

SELECT <attribute list>FROM <document list>WHERE <predicate>;

WebSQL From Clause

• from clause collects a set of documents

• unstructured - primitive schema

• MENTIONS - retrieve from search engineDOCUMENT x SUCH THAT x MENTIONS ‘data structures’

WebSQL From Clause

• from clause collects a set of documents

• unstructured - primitive schema Document[URL, text, link to URL, modify date]

• MENTIONS - retrieve from search engine

SELECT z.URLFROM DOCUMENT x SUCH THAT x MENTIONS ‘data structures’, DOCUMENT y SUCH THAT x -> y, DOCUMENT z SUCH THAT y->* zWHERE y CONTAINS ‘lecture notes’ AND z CONTAINS ‘trees’;

WebSQL From Clause

• path expression finds related documents

• URL

• local link: ->

• global link: =>

DOCUMENT x SUCH THAT “http://www.cs.auc.dk”

DOCUMENT y SUCH THAT x -> y

DOCUMENT y SUCH THAT x => y

WebSQL From Clause

• at most one link: ?

• any number of links: *

• alternation: |

DOCUMENT y SUCH THAT x ->(->)? y

DOCUMENT y SUCH THAT x (=> | ->*) y

DOCUMENT y SUCH THAT x ->* y

WebSQL From Clause: Example

FROM Document X SUCH THAT X MENTIONS ‘Java’, Document Y SUCH THAT X -> | ->-> Y

WebSQL From Clause: Example

FROM Document X SUCH THAT X MENTIONS ‘Java’, Document Y SUCH THAT X -> | ->-> Y

Java

WebSQL From Clause: Example

FROM Document X SUCH THAT X MENTIONS ‘Java’, Document Y SUCH THAT X -> | ->-> Y

Java

WebSQL From Clause: Example

FROM Document X SUCH THAT X MENTIONS ‘Java’, Document Y SUCH THAT X -> | ->-> Y

Java

WebSQL From Clause

• path expression limits search space

• local link, search limited to local machine

• global link, can go anywhere

• =>* would search all of WWW

• pre-analysis, filtering

• even three to four local links infeasible

WebSQL Where Clause

• like SQL

• CONTAINS, text search of retrieved document

• can push CONTAINS into navigation

WHERE y CONTAINS ‘lecture notes’ AND y.length < 4000;

WebSQL Query

• Find lecture notes on trees in a data structures course.

SELECT z.FROM DOCUMENT x SUCH THAT x MENTIONS ‘data structures’, DOCUMENT y SUCH THAT x -> y, DOCUMENT z SUCH THAT y->* zWHERE y CONTAINS ‘lecture notes’ AND z CONTAINS ‘trees’;

data structures -> lecture notes

data structures -> lecture notesdata structures

data structures -> lecture notesdata structures

data structures -> lecture notesdata structures

lecture notes

lecture notes ->* treesdata structures

lecture notes

lecture notes ->* treesdata structures

lecture notes

lecture notes ->* treesdata structures

lecture notes

trees

Resultdata structures

lecture notes

trees

WebSQL Example

WebSQL Architecture

• Java implementation

WWW Query Language -Drawbacks

• dynamic architecture

• O(p**k)

- p is length of path expression

- k is branching factor

• a priori knowledge of topology

• back links are a problem

Jumping Spider - a Hybrid

• like a search engine

- static architecture

- keyword searches

• like a WWW query language

- uses modified WWW graph

- one kind of path expression

Kinds of Links

• content refinement queries are common

• heuristic

information in subdirectories is refined

• different kinds of links

back - subdirectory to parent

down - parent directory to subdirectory

side - unrelated directories

Re-using the WWW Graph

Directory Trees

Down Links

Back Links

Eliminate Back Links

Transitive Closure of Down Links

Plus a Side Link

data structures -> lecture notesdata structures

data structures -> lecture notesdata structures

data structures -> lecture notesdata structures

lecture notes

lecture notes -> treesdata structures

lecture notes

lecture notes -> treesdata structures

lecture notes

trees

Analysis

• search engine index

- adds a pertinent index

• pertinent index - O(nlogn) to O(n**2) space

- all URLs that can reach this URL

- tree-like, so should be close to O(nlogn)

• more intersections

• implemented in Perl 5

Related Work

• WWW query languages

WebSQL (Arocena et al. - WWW6 ’97)

W3QS (Konopnicki and Shmueli - VLDB’95)

WebLog (Lakshmanan et al. RIDE ’96)

AKIRA (Lacroix et al. - ER ’97)

• Indexes that already use directories

Infoseek

WebGlimpse (Manber et al. - Usenix ’97)

• Semi-structured data models - many

Future Work

• scale to size of WWW

• extended query language (negation)

• easier installation

top related