monitoring xml data on the web

28
Monitoring XML Data on the Web Benjamin Nguyen , Serge Abiteboul, Grégory Cobéna and Mihaï Preda INRIA Rocquencourt, Projet Verso and Xyleme S.A. FRANCE Contact: [email protected] or [email protected] http://www-rocq.inria.fr/verso/ and http://www.xyleme.com

Upload: brick

Post on 06-Jan-2016

29 views

Category:

Documents


2 download

DESCRIPTION

Monitoring XML Data on the Web. Benjamin Nguyen , Serge Abiteboul, Grégory Cobéna and Mihaï Preda INRIA Rocquencourt, Projet Verso and Xyleme S.A. FRANCE Contact: [email protected] or [email protected] http://www-rocq.inria.fr/verso/ and http://www.xyleme.com. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Monitoring XML Data on the Web

Monitoring XML Data on the Web

Benjamin Nguyen, Serge Abiteboul, Grégory Cobéna and Mihaï Preda

INRIA Rocquencourt, Projet Verso and Xyleme S.A.

FRANCE

Contact: [email protected] or [email protected]

http://www-rocq.inria.fr/verso/ and http://www.xyleme.com

Page 2: Monitoring XML Data on the Web

SIGMOD'01 Santa-Barbara 2

Organization

Introduction

Query SubscriptionMotivations

Subscription System Architecture

Subscription Language

Complex Event Detection Algorithm

Alerters

Conclusion

Page 3: Monitoring XML Data on the Web

A Dynamic Warehouse for the XML Data of the Web

XylemeA complex tissue in the vascular system of higher plants…

functions chiefly in conduction but also in support and storage. -Webster

Page 4: Monitoring XML Data on the Web

SIGMOD'01 Santa-Barbara 4

A brief look back…

1999/2000: a group of researchers from Inria Rocquencourt, Verso Group

U. of Mannheim, Database Group

U. of Orsay, IASI Group

CNAM, Vertigo Group

October 2000: creation of a start-up. http://www.xyleme.com/

Page 5: Monitoring XML Data on the Web

SIGMOD'01 Santa-Barbara 5

The three aspects of Xyleme

WebhouseXyleme stores huge quantities of data (teraB)Xyleme is more than a search engine (only index) or a mediator (only virtual data)

XMLXyleme is focused on XML, i.e., trees

DynamicXyleme is interested in data evolution/changes

Page 6: Monitoring XML Data on the Web

SIGMOD'01 Santa-Barbara 6

Change Control

Repository and Index Manager

Query Processor

Semantic Module

User Interface

Xyleme Interface

Xyleme Global Architecture

-------------------- I N T E R N E T -----------------------

Web Interface

Acquisition& Crawler

Loader

Runs on a cluster of Linux PCs. Implemented in C++

Page 7: Monitoring XML Data on the Web

Query Subscription

1. Motivations

Page 8: Monitoring XML Data on the Web

SIGMOD'01 Santa-Barbara 8

The Web changes all the time

Data acquisition + maintenance keep the warehouse up-to-date: “Acquisition and Maintenance of XML Data from the Web”, L. Mignet, M. Preda, S. Abiteboul, B. Amann, A. Marian, Tech. Report

Version management“Change-Centric Management of Versions in an XML Warehouse”, A.Marian,S. Abiteboul,G. Cobena, L. Mignet VLDB’01

Change monitoringquery subscription

Page 9: Monitoring XML Data on the Web

SIGMOD'01 Santa-Barbara 9

Query Subscription

Users may subscribe to certain events• Changes in a page, a set of pages, • Changes in pages from a particular semantic domain, containing some specific words or with a particular DTD • Changes of particular elements somewhere (new products in a catalog)

Users may request to be notified • Immediately at the time the event is detected• Regularly, e.g., weekly• After a certain number of event detections

Users want to be notified• By email• Upon Login to our site

Page 10: Monitoring XML Data on the Web

Query subscription

2. Architecture

Page 11: Monitoring XML Data on the Web

SIGMOD'01 Santa-Barbara 11

SQL

Architecture

XylemeAlerter

Web Browser

XylemeReporter

XylemeSubscription

Manager

ComplexEvent

DetectionSubscription

Manager

Reporter

TriggerEngine

XylemeQuery

Processor

SQL

documents

Page 12: Monitoring XML Data on the Web

SIGMOD'01 Santa-Barbara 12

Step 1: Atomic Event Detection

metadatamanager

HTMLparser

XMLloader

document & alerts

d/46

complexevent detection

atomic event 46: URL matches pattern www.musee-orsay.fr/*atomic event 67: XML documentcontains the tag <painter> withthe value “Monet”

5 millions of pages/day

d

d/46,67loading

Page 13: Monitoring XML Data on the Web

SIGMOD'01 Santa-Barbara 13

Step2: Complex Event Detection

HTMLparser

XMLloader

complexevent detection

complex event 12: 67 & 46 (XML document contains the tag <painter> with value “Monet” and URL matches pattern www.musee-orsay.fr/*)

Millions of alerts of pages/dayMillions of subscriptions

Page 14: Monitoring XML Data on the Web

SIGMOD'01 Santa-Barbara 14

triggers

notification/monitoring

Step 3: Notification Processor

Reporter

continuousqueries

complexevent detection

clock notification/results

Millions of notifications/day

alerts

Page 15: Monitoring XML Data on the Web

Query subscription

3. The language

Page 16: Monitoring XML Data on the Web

SIGMOD'01 Santa-Barbara 16

Subscription LanguageSQL-like language.Combines the use of monitoring queries and continuous queries.The language can be extended by adding new types of atomic events.Uses the XML Query Language for continuous queries. “Querying the XML Documents of the Web”, V. Aguilera, S. Cluet, F. Boiscuvier, Tech. Report

Page 17: Monitoring XML Data on the Web

SIGMOD'01 Santa-Barbara 17

Examplesubscription myPaintings

% what are the new painting entries in Musee d’Orsay sitemonitoring newPainting

select URLwhere URL extends www.musee-orsay.fr/*and <painter> contains “Monet”

% manage the changes in the expositions continuous delta Exposition

select ... from ... where when monthly

notify daily % send me a daily report

Page 18: Monitoring XML Data on the Web

Query Subscription

4. Complex Event Detection

Page 19: Monitoring XML Data on the Web

SIGMOD'01 Santa-Barbara 19

C1 = a0 a4

a4

Atomic Event Set Algorithm

a2

C3 = a2

a4

a5

a6

a7

C4 = a4 a5 a6 a7

a0

C0 = a0

C0

a1

a4 a3

C2 = a0 a1 a3

C1 C2

C3

C4

C1

Page 20: Monitoring XML Data on the Web

SIGMOD'01 Santa-Barbara 20

a4

a0a0

Atomic Event Set Algorithm

a2

a1

a3

a4

a5

a6

a7 S={a0 a2 a 4}

Detected Events:

C0

a4

C1

a2

C3

a4C0

C1

C3

C2C4

Page 21: Monitoring XML Data on the Web

SIGMOD'01 Santa-Barbara 21

Complexity resultsA formal study has been conducted.Experimental (simulation) values concur with this studyResults show that the algorithm is well suited for our application:

10 million Complex Events 1 million Atomic Events 100 Atomic events detected per document

0.8 ms to process a document. ~2 million documents per day.

Page 22: Monitoring XML Data on the Web

Query Subscription

5. Alerters

Page 23: Monitoring XML Data on the Web

SIGMOD'01 Santa-Barbara 23

Alerters

Each Alerter can be viewed as a plugin that acts on a document flow.

All sorts of Atomic events can be detected: URL pattern detection, Keywords, XML structure, Page rank…

Can be distributed.

Page 24: Monitoring XML Data on the Web

SIGMOD'01 Santa-Barbara 24

Conclusion and PerspectivesThis work has been implemented and integrated in the Xyleme System.The core of our system is reusable.The system is expandable, and can be used to trigger various other modules:

versionning of documentssemantic classification

Page 25: Monitoring XML Data on the Web

SIGMOD'01 Santa-Barbara 25

Perspectives

Re-use of the core of our system.

Triggering of various other modules.versioning documents

semantic classification

Page 26: Monitoring XML Data on the Web

SIGMOD'01 Santa-Barbara 26

The Coming of XMLHTML

comes from SGMLhypertext languagefixed number of tagscontent and presentation are mixedvery difficult to extract data from a page

old standard

XMLalsosemistructured datanot fixednot mixedvery easy

new standard

Page 27: Monitoring XML Data on the Web

SIGMOD'01 Santa-Barbara 27

XML = Semistructured Data

Ref Name PriceX23 Camera 359.99 R2D2 Robot 19350.00Z25 PC 1299.99...

Information System

<product-table>< product reference=”X23"> <designation> camera </designation> <price unit=Dollars> 359.99 </price> <description> … </description></product>< product reference=”R2D2"> <designation> Robot </designation> <price unit=Dollars> 19350 </price> <description> … </description>...</product-table> XML

Data + StructureSemistructured: more flexible

Page 28: Monitoring XML Data on the Web

The Web and XML