modeling and querying structure and contents of the webdbis/publications/99/idm99sl.pdf · modeling...

26
Modeling and Querying Structure and Contents of the Web Wolfgang May Institut f ¨ ur Informatik Universit¨ at Freiburg Germany

Upload: others

Post on 28-Nov-2019

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Modeling and Querying Structure and Contents of the Webdbis/Publications/99/idm99sl.pdf · Modeling and Querying Structure and Contents of the Web Extended Web Skeleton extended Web

Modeling and Querying Structure and Contentsof the Web

Wolfgang MayInstitut fur InformatikUniversitat Freiburg

Germany

Page 2: Modeling and Querying Structure and Contents of the Webdbis/Publications/99/idm99sl.pdf · Modeling and Querying Structure and Contents of the Web Extended Web Skeleton extended Web

Modeling and Querying Structure and Contents of the Web

Overview

� Integrated Architecture for Web Data Extraction

� Unified World Model

� Implementation: F-Logic/FLORID

� Examples / Case Studies:

– the DBLP Publications Web Server (single-site)

– Geographical Information (multi-site)

Overview 1

Page 3: Modeling and Querying Structure and Contents of the Webdbis/Publications/99/idm99sl.pdf · Modeling and Querying Structure and Contents of the Web Extended Web Skeleton extended Web

Modeling and Querying Structure and Contents of the Web

Integrated ArchitectureF

LO

RID

Sys

tem objects, incl. Web pages

wrapper + mediator rulesSGML-Parser application logic rules

url�.get ��� :- ������ :- ���

http/ftp-Web Interface

User

F-Logic

exte

rnal

Res

ourc

es Internet

HTMLurl�

HTMLurl�

� Unified, monolithic framework for wrappers and mediators

� F-Logic: unified data model, wrapper, mediator, andquerying language

� Data Model: Representation of the Web fragment andapplication-level representation.Structure + Contents of the Web as a unit

Architecture 2

Page 4: Modeling and Querying Structure and Contents of the Webdbis/Publications/99/idm99sl.pdf · Modeling and Querying Structure and Contents of the Web Extended Web Skeleton extended Web

Modeling and Querying Structure and Contents of the Web

The Web Model

� unified object-oriented model

– the Web (carrier of information)

– the application domain (carried information)

� graph-based model

� inter-document-leveltopology of the Web ((Web) skeleton):

– nodes: Web documents,

– (labeled) edges: hyperlinks between documents.

– skeleton: no information apart from the link structure isavailable;

� intra-document-levelThe page markup (tags):induces a tree structure of the page contents.

� Web skeleton and parse trees: application-independent

� an object-oriented model of the application domain.

The Web Model 3

Page 5: Modeling and Querying Structure and Contents of the Webdbis/Publications/99/idm99sl.pdf · Modeling and Querying Structure and Contents of the Web Extended Web Skeleton extended Web

Modeling and Querying Structure and Contents of the Web

The Skeleton: URL's and Web Documents

� Every resource in the Web has a unique url.

� document associated with a url contains hyperlinks to otherurl's

�x� �� y� � SK � the Web document x contains a hyperlink

labeled with � to the Web document y.

(“�a href � y � � ��a�”)

Example: The DBLP Server

The Web Model 4

Page 6: Modeling and Querying Structure and Contents of the Webdbis/Publications/99/idm99sl.pdf · Modeling and Querying Structure and Contents of the Web Extended Web Skeleton extended Web

Modeling and Querying Structure and Contents of the Web

Example: The DBLP Server

dblp

conf�index�a� conf�index�l� conf�index� a�tree�� journals�� series��

conf�vldb��

���

conf�vldb�vldb���

conf�vldb�vldb������

conf�iclp��

conf�popl��

���

conf�popl�popl���

conf�popl�popl�����

conf�edbt��

���

conf�edbt�edbt��

conf�edbt�edbt���

conf�edbt�edbt��

������

a�tree�s�Altman� a�tree�A�����

���

a�tree�j�Jarke� a�tree�A������

���

a�tree�l�Lockemann� a�tree�A� ����

���

a�tree�s�Senko���� a�tree�A ���

������

���

journals�tods�tods����

journals�tods�tods����

journals�tods�tods����

journals�tods��

journals�lncs��

journals�is�is������

journals�is�is������

journals�is�is����

journals�is�is����

journals�is��

���

journals�lncs�������������

allcon

f

LPconf

DBconf author

Altint�Amit

Altman

Janssens�H�Jega

Jarke

Lo�Raymond�Loid

Lockemann

Sengupta�S�Sevil

Senko

journalsseries

EDBT

ICLP

POPL

VLDB

VLDB

VLDB��

VLDB��EDBT

EDBT�

EDBT��

EDBT�

ICLP

POPL

POPL��

POPL�

TODS

IS

Inf�System

s

VLDB

LNCS

vol��

vol��

vol��

vol�

vol�

vol��

vol��

���������

IS��

IS��

LNCS

Contents

Lockem

ann

IS�

IS�

Senko

IS�

Senko

VLDB��

Senko

VLDB��

Altman

Senko

Altman

Jarke

Lockemann

Lockemann

� skeleton: Web pages and hyperlinks

� corresponds to real world objects:journals, conferences, books, and authors

The Web Model 5

Page 7: Modeling and Querying Structure and Contents of the Webdbis/Publications/99/idm99sl.pdf · Modeling and Querying Structure and Contents of the Web Extended Web Skeleton extended Web

Modeling and Querying Structure and Contents of the Web

Extending the Web Skeleton: Parse-trees

� real-world objects are represented as individual Webpages, or by substructures.

� integration of parse-trees

The Web Model 6

Page 8: Modeling and Querying Structure and Contents of the Webdbis/Publications/99/idm99sl.pdf · Modeling and Querying Structure and Contents of the Web Extended Web Skeleton extended Web

Modeling and Querying Structure and Contents of the Web

Example: Extended Web Skeleton of DBLP

dblp

conf/vldb/ conf/vldb/vldb76 journals/is/

vldb76.parse is.parse

�head� �body� �head� �body�

some text �ul�. . .�/ul� . . . some text �ul�. . .�/ul� . . .

�li�. . .�/li� . . . �li�. . .�/li� . . . �li�. . .�/li� �li�. . .�/li� . . .

�a href=. . .�M.Senko�/a� –title– �a href=. . .� Vol.1�/A� �a href=. . .� Vol.2�/A� . . .

�a href=. . .�E.Altman�/a� journals/is/is1 journals/is/is2 . . .

a-tree/s/senko a-tree/a/altman

senko.parse

�head�. . .�/head� �body�. . .�/body�

some text �table�. . .�/table�

�tr�. . .�/tr� . . . �tr�. . .�/tr� �tr�. . .�/tr�

�th�1976�/th� �td�. . .�/td�. . .

M.Senko �a href=. . .�E.Altman�/a� – title – �a href=. . .�IS1�/a�

is1.parse

�head�. . .�/head� �body�. . .�/body�

“Number 1” �ul�. . .�/ul� “Number 2” �ul�. . .�/ul� . . .

�li�. . .�/li� �li�. . .�/li� . . .

�a href=. . .�M.Senko�/a� title

hrefs@(VLDB)hrefs@(Inf.Systems)

hrefs@(VLDB'76)parse

html@(0) html@(1)

body@(0)body@(1) body@(. . . )

ul@(0)

ul@(. . . )ul@(1) ul@(. . . )

li@(0)

li@(1)

li@(2)

hrefs@(M.Senko)hrefs@(E.Altman)

parse

html@(0) html@(1)

body@(0)body@(1) body@(. . . )

ul@(0)ul@(1) ul@(. . . )

ul@(0)ul@(1)

ul@(. . . )

hrefs@(volume1)hrefs@(volume2) hrefs@(. . . )

parse

html@(0) html@(1)

body@(0) body@(1)

table@(0)table@(4)

table@(5)

tr@(0) tr@(0)

td@(0)td@(1) td@(2) td@(3)

hrefs@(E.Altman)

parse

html@(0) html@(1)

body@(0) body@(1)body@(2)body@(3)body@(. . . )

ul@(0)ul@(1)

ul@(. . . )

li@(0)li@(1)

hrefs@(M.Senko)

hrefs@(VLDB76)

Example: The DBLP Server 7

Page 9: Modeling and Querying Structure and Contents of the Webdbis/Publications/99/idm99sl.pdf · Modeling and Querying Structure and Contents of the Web Extended Web Skeleton extended Web

Modeling and Querying Structure and Contents of the Web

Extended Web Skeleton

� extended Web skeleton: unified – but still based on theWeb representation, not on the application semantics.

� many objects have already a direct counterpart in theextended Web skeleton.

– objects have a Web representation as a Web pagereferencable via url.

– objects correspond to nodes in a parse-tree (journalvolumes and papers).referencable in HTML via page�anchor.

– counterparts in several parsetrees� Object fusion:Objects as objects in the Web representation and in theapplication model.

– mapping between nodes/arcs of the extended Webskeleton and instances of distinguishedclasses/relationships of the application modeling.

� XML?Parse-tree � application-semantic model?

Example: The DBLP Server 8

Page 10: Modeling and Querying Structure and Contents of the Webdbis/Publications/99/idm99sl.pdf · Modeling and Querying Structure and Contents of the Web Extended Web Skeleton extended Web

Modeling and Querying Structure and Contents of the Web

Formal Framework: F-Logic

� object-oriented database language

� id-terms are composed from object constructors andvariables (capital letters) as usual.

� is-a atoms: o�c

� subclass atoms: c �� d

� Method applications to objects:o�m�v� (scalar)o�m��v� (multivalued)analogous with arguments: o�m��x��� � � �xn�v�.inheritable:c�m��v�

c�m���v�

� Signatures of methods:c�m�v� (scalar)c�m��v� (multivalued)

� Variables allowed at all positions

� Entities can act at the same as classes, objects andmethods

� Rules over atoms: �head� :- �body�.

� Program: a set of rules

F-Logic 9

Page 11: Modeling and Querying Structure and Contents of the Webdbis/Publications/99/idm99sl.pdf · Modeling and Querying Structure and Contents of the Web Extended Web Skeleton extended Web

Modeling and Querying Structure and Contents of the Web

Example: F-Logic Model of DBLP

paper institution

journal p

conf p publisher string person

journal vol

journal integer conf proc conf series

oj�

oi��

ois

odi

ov�� ovldb

omes

oeba

omj

opcl

oejn

orwt

ouka

ogmd

journal p �paper� conf p �paper�

paper�title�string� authors��person��

journal p�in vol�journal volume��

conf p�at conf�conf proc��

oj� � journal p�title��Records� Relations� Sets� Entities� and Things� authors��fomesg� in vol�oi����

odi �conf p�title��DIAM II and Levels of Abstraction� authors��fomes� oebag� at conf�ov����

oi�� � journal vol�of�ois� number�� volume�� year��� ��

ois � journal�name��Information Systems� editors��������fomjg��

ov�� � conf proc�of�ovldb� year����� editors��fopcl� oejng��

ovldb � conf series�name��Very Large Databases��

omes �person�name��Michael E� Senko�� omj �person�name��Matthias Jarke� a�l��� � ���orwt��

orwt � institution�name��RWTH Aachen�� � � �

��

��

authors

editors

editors��Ye

ar�publish

er

invol

of

a�l��Year�

of

year

name name

namead

dress

name

name

title

atconfvol� year

number

authors

invol

of

editors��� � �

�authors

author

s

atconf

of

editors

editors

a�l������

a�l������

a�l������

�title�Records� Relations� Sets�

Entities� and Things�

�title�DIAM II and Levels

of Abstraction�

�number��

volume��

year������

�name�Information Systems�

�year������ �name�Very Large Databases abbrev�VLDB�

�name�Michael E� Senko�

�name�Edward B� Altman�

�name� Matthias Jarke�

�name�Peter C� Lockemann�

�name�Erich J� Neuhold�

�name�Uni Karlsruhe�

�name�RWTH Aachen�

�name�GMD Darmstadt�

F-Logic 10

Page 12: Modeling and Querying Structure and Contents of the Webdbis/Publications/99/idm99sl.pdf · Modeling and Querying Structure and Contents of the Web Extended Web Skeleton extended Web

Modeling and Querying Structure and Contents of the Web

Formal Framework: F-Logic

� path expressions:��o�m��� that o s�t� o�m �o�

��o��m��� all o s�t� o�m ��o�

?- P �conf proc.of[abbrev�“VLDB”], P[year�1976],

P..editors[affil@(1976)�A].

� object creation by path expressions in the head:o�m�� � � � � � � �

� Derived equality via object fusion:o� � o� � � � �

implemented in the object manager.

� Aggregates: sum, count, ...

� nonmonotonic inheritance

� FLORID: bottom-up inflationary semantics with user-definedstratification

F-Logic 11

Page 13: Modeling and Querying Structure and Contents of the Webdbis/Publications/99/idm99sl.pdf · Modeling and Querying Structure and Contents of the Web Extended Web Skeleton extended Web

Modeling and Querying Structure and Contents of the Web

Requirements for Implementation of Web Access

� non-logical features � built-ins:

� Web Access via http-protocol,

� Parsing of HTML/SGML/XML,

� Matching with Perl Regular Expressions,

� Logical issues:

� Suitable modeling (classes)

� Object creation on demand

� Object fusion

� Navigation in the model

� Powerful, flexible reasoning

F-Logic 12

Page 14: Modeling and Querying Structure and Contents of the Webdbis/Publications/99/idm99sl.pdf · Modeling and Querying Structure and Contents of the Web Extended Web Skeleton extended Web

Modeling and Querying Structure and Contents of the Web

Exploration of the Web

� classes url and webdoc.(subclasses of string)

� class url: url�get implemented as an active method (C++):

u�get �� � � �

– accesses the Web document which is accessible via u

– assigns it to u�get (object creation)

– becomes an instance of class webdoc

– and several properties are automatically filled in.

u�get�hrefs����� u�� �

u�get contains “�a href � u� � � ��a�” .

url��string�get�webdoc��

webdoc�url �url� author �string�

type �string� hrefs��string ��url� ��� �

modif �string� error ��string��

url�get�wd�� url��get�wd��

wd�webdoc�url�url�� hrefs���label ��furl�g�

wd��webdoc�url�url�� type�html� ����

Exploration of the Web 13

Page 15: Modeling and Querying Structure and Contents of the Webdbis/Publications/99/idm99sl.pdf · Modeling and Querying Structure and Contents of the Web Extended Web Skeleton extended Web

Modeling and Querying Structure and Contents of the Web

Data-Driven Web Exploration

� in course of the information extraction and restructuringprocess, additional pages are recognized to be relevant:

U.get � A:author[homepage�U].

url� �

�HTML��HEAD������HEAD�

���

�A HREF�url��label��A�

���

��HTML�

� �z �wd�

url� �

�HTML��HEAD������HEAD�

���

�A HREF����������A�

���

��HTML�

� �z �wd�

hrefs��label�

� approach implements a hybrid concept by embeddingdata-driven wrapping into a warehouse approach

��

��

��

� ��

WWW

dblp

vldbis

76 v1 v5� � senko

��

��

Databaseaccess along hrefs��� � �

loadanalyze

Exploration of the Web 14

Page 16: Modeling and Querying Structure and Contents of the Webdbis/Publications/99/idm99sl.pdf · Modeling and Querying Structure and Contents of the Web Extended Web Skeleton extended Web

Modeling and Querying Structure and Contents of the Web

Parsing of Web Pages

� url�parse: active method

� generates F-Logic representation of the parse-tree,

� assigns it to the object u�parse �parsetree

– SGML-tagged groups �tag� � � � � tag� become objects,

– classes webdoc��tag�,

– navigation: o��tag����� � � � � o��tag���nare the segments inside o��tag�

– tag attributes: o��attr�

- tables whose header contains '1998' in any headerrow/column are identified by

?- T �wd.table,

T.table@(Row).tr@(Col)[th@(0)�S],

substr(S,“1998”).

- the contents of the third column of the 17th row of a giventable tab is addressed by

tab�table�����tr����THD���.

� hyperlinks emanating from the parse-tree:

Z[hrefs@(Label)��Url] �Z:(U:url.parse.a), Z[a@(0)�Label; href�Url].

Exploration of the Web 15

Page 17: Modeling and Querying Structure and Contents of the Webdbis/Publications/99/idm99sl.pdf · Modeling and Querying Structure and Contents of the Web Extended Web Skeleton extended Web

Modeling and Querying Structure and Contents of the Web

Wrapping

� url�get, url�parse: raw, uninterpreted data� Extended Web skeleton

� wrapping by F-Logic Rules

� Logical Markup:Parser-basedDBLP-server: sufficiently well-structured HTML

- direct correspondence between HTML-nodes and objects(extended Web skeleton).

� Optical and Syntactical Markup:pattern matching via regular expressions

- construction of object-oriented model fromscratch/identifying new objects

Wrapping 16

Page 18: Modeling and Querying Structure and Contents of the Webdbis/Publications/99/idm99sl.pdf · Modeling and Querying Structure and Contents of the Web Extended Web Skeleton extended Web

Modeling and Querying Structure and Contents of the Web

More than Parsing

� not all Web pages provide logical markup

� well-structured pages need further wrapping:

– keywords,

– commalists,

– text search for relevant words

auth�, auth�, ... , and authn: title. number n inVolume v of series, pages p� p�, year.

Pattern Matching in FLORID

Perl regular expressions by the built-in predicate

pmatch(�string�,“/�regexp�/”, [�fmt-list�], [X�,. . . , Xn])

pmatch(STRING,

“/nA ([�:]*): (.*)n.ns

Number ([0-9]*) in Volume ([0-9]*) of ([a-Z]*),

pages ([0-9-]*), ([0-9]*)/”,

[$1,$2,“$4($3)”, $5, $6, $7],

[AuthList, Title, Num, Series, Pages, Year])

AuthList is a commalist ...

Wrapping 17

Page 19: Modeling and Querying Structure and Contents of the Webdbis/Publications/99/idm99sl.pdf · Modeling and Querying Structure and Contents of the Web Extended Web Skeleton extended Web

Modeling and Querying Structure and Contents of the Web

Example: DBLP Server

constructing the application model:

dblp[url�“http://. . . ”].dblp.url:url.dblp.url.get.

dblp[journals page�(X:url)]�dblp.url.get[hrefs@(“Journals”)��X].dblp[conf page�(X:url)] �dblp.url.get[hrefs@(“Conferences)��X]. ,

dblp.journals page.get.dblp.conf page.get

% conferencesS �conf series[name�S, url�(U:url)], U.get � % “VLDB”

dblp.conf page.get[hrefs@(S)��U].

(S.year@(Year) �conf) [series�S; year�Y; url�(U:url)],U.get � % “VLDB”@(1976)

(S �conf series).url.get[hrefs@(“Contents”)��U],pmatch(U,“/[A-z]*([0-9]*).html/”,“19$1”,Year).

% ... similar for journals

Wrapping 18

Page 20: Modeling and Querying Structure and Contents of the Webdbis/Publications/99/idm99sl.pdf · Modeling and Querying Structure and Contents of the Web Extended Web Skeleton extended Web

Modeling and Querying Structure and Contents of the Web

Example: DBLP Server

Every paper on a conference or journal volume page isrepresented in an �li� tag, e.g.

�li��a href=. . . �author��/a�, . . . , �a href=. . . �author��/a�:�b�title�/b� pages.

conf paper �� paper.journal paper �� paper.

% conference papersp(P) �conf paper[parsenode��P] �

C �conf, P �C.parse.li.

% journal papersp(P) � journal paper[parsenode��P] �

V � journal vol, P �V.parse.li.

% papers: titles and pagesP[title�T] � P �paper, (P.parsenode.li@( ):b)[b@(0)�T],string(T).P[pages�N] � P �paper, P.parsenode[li@( )�N], string(N).

% papers: authorsN �author[name�Name; url��(U:url)], P[authors��N] �

(((P �paper).parsenode.li@( )):a)[href�U, a@(0)�Name].

Wrapping 19

Page 21: Modeling and Querying Structure and Contents of the Webdbis/Publications/99/idm99sl.pdf · Modeling and Querying Structure and Contents of the Web Extended Web Skeleton extended Web

Modeling and Querying Structure and Contents of the Web

Example: DBLP Server

% authors pagesU.get � A:author[url��U]

% authors homepagesA[homepage�(U:url)], U.get �

A:author.url.get[hrefs@(“Homepage”)��U].

� data-driven Web exploration

Wrapping 20

Page 22: Modeling and Querying Structure and Contents of the Webdbis/Publications/99/idm99sl.pdf · Modeling and Querying Structure and Contents of the Web Extended Web Skeleton extended Web

Modeling and Querying Structure and Contents of the Web

Example: DBLP Server

� single-site source

� “best case”-example

� well-structured HTML/SGML

� parser-based wrapping

� Model contains Web skeleton, parse-trees and application

Wrapping 21

Page 23: Modeling and Querying Structure and Contents of the Webdbis/Publications/99/idm99sl.pdf · Modeling and Querying Structure and Contents of the Web Extended Web Skeleton extended Web

Modeling and Querying Structure and Contents of the Web

Generic Wrapping Tasks

Extracting contents of the pages:

� Logical Markup

– HTML-Lists

– HTML-Tables: Headers, Columns

� Optical Markup

– Paragraphs

– Boldfacing, Emphasizing

� Syntactical Markup:

– Commalists, Semicolons, Parentheses

� Generic Rules for these tasks

� program skeleton completed by application-specific rulesand refining rules (rapid prototyping).

� (semi-)automatical approaches for wrapper-generation:not (yet) provide a sufficiently fine granularity

Wrapping 22

Page 24: Modeling and Querying Structure and Contents of the Webdbis/Publications/99/idm99sl.pdf · Modeling and Querying Structure and Contents of the Web Extended Web Skeleton extended Web

Modeling and Querying Structure and Contents of the Web

Mediating: Integration and Restructuring

� every source defines a schema

� overlapping classes

� different names for objects

� object fusion

� Inter-Source Links

Integration 23

Page 25: Modeling and Querying Structure and Contents of the Webdbis/Publications/99/idm99sl.pdf · Modeling and Querying Structure and Contents of the Web Extended Web Skeleton extended Web

Modeling and Querying Structure and Contents of the Web

Conclusion

� practicable approach (multi-source MONDIAL case study).

� unified model for Web representation and model of theapplication,

� integrated data model/language for wrapping, mediating,and querying.

� Further Work:

– “intelligent” wrapping (analyzing of tables)

– Usage with search engines

– XML

Conclusion 24

Page 26: Modeling and Querying Structure and Contents of the Webdbis/Publications/99/idm99sl.pdf · Modeling and Querying Structure and Contents of the Web Extended Web Skeleton extended Web

Modeling and Querying Structure and Contents of the Web

Appendix: Formal Semantics of Web Access

Herbrand semantics of get and parse:

explore � URL � �HB

parse � URL � �HB

A Herbrand model H of an F-Logic program P is a model of Pwrt. Web-Access (built-in semantics of u�get and u�parse) if

� if H j� u � url u�getg, then explore�u� � H

� if H j� u � url u�parseg, then parse�u� � H

... integrated into the TP -operator:

For an F-Logic program P and an H-interpretation H,

TP �H� �� H � fh j �h� body� � ground�P ��H j� bodyg �

TW��P �H� �� H �

TW�i��

P �H� �� C��TP �TW�i

P �H��

�Sfexplore�u� j TP �TW�i

P �H�� j� u �url u�getg

�Sfparse�u� j TP �TW�i

P �H�� j� u �url u�parseg�

Then use TW��P .

Conclusion 25