monitoring xml data on the web
DESCRIPTION
Monitoring XML Data on the Web. Benjamin Nguyen , Serge Abiteboul, Grégory Cobéna and Mihaï Preda INRIA Rocquencourt, Projet Verso and Xyleme S.A. FRANCE Contact: [email protected] or [email protected] http://www-rocq.inria.fr/verso/ and http://www.xyleme.com. - PowerPoint PPT PresentationTRANSCRIPT
Monitoring XML Data on the Web
Benjamin Nguyen, Serge Abiteboul, Grégory Cobéna and Mihaï Preda
INRIA Rocquencourt, Projet Verso and Xyleme S.A.
FRANCE
Contact: [email protected] or [email protected]
http://www-rocq.inria.fr/verso/ and http://www.xyleme.com
SIGMOD'01 Santa-Barbara 2
Organization
Introduction
Query SubscriptionMotivations
Subscription System Architecture
Subscription Language
Complex Event Detection Algorithm
Alerters
Conclusion
A Dynamic Warehouse for the XML Data of the Web
XylemeA complex tissue in the vascular system of higher plants…
functions chiefly in conduction but also in support and storage. -Webster
SIGMOD'01 Santa-Barbara 4
A brief look back…
1999/2000: a group of researchers from Inria Rocquencourt, Verso Group
U. of Mannheim, Database Group
U. of Orsay, IASI Group
CNAM, Vertigo Group
October 2000: creation of a start-up. http://www.xyleme.com/
SIGMOD'01 Santa-Barbara 5
The three aspects of Xyleme
WebhouseXyleme stores huge quantities of data (teraB)Xyleme is more than a search engine (only index) or a mediator (only virtual data)
XMLXyleme is focused on XML, i.e., trees
DynamicXyleme is interested in data evolution/changes
SIGMOD'01 Santa-Barbara 6
Change Control
Repository and Index Manager
Query Processor
Semantic Module
User Interface
Xyleme Interface
Xyleme Global Architecture
-------------------- I N T E R N E T -----------------------
Web Interface
Acquisition& Crawler
Loader
Runs on a cluster of Linux PCs. Implemented in C++
Query Subscription
1. Motivations
SIGMOD'01 Santa-Barbara 8
The Web changes all the time
Data acquisition + maintenance keep the warehouse up-to-date: “Acquisition and Maintenance of XML Data from the Web”, L. Mignet, M. Preda, S. Abiteboul, B. Amann, A. Marian, Tech. Report
Version management“Change-Centric Management of Versions in an XML Warehouse”, A.Marian,S. Abiteboul,G. Cobena, L. Mignet VLDB’01
Change monitoringquery subscription
SIGMOD'01 Santa-Barbara 9
Query Subscription
Users may subscribe to certain events• Changes in a page, a set of pages, • Changes in pages from a particular semantic domain, containing some specific words or with a particular DTD • Changes of particular elements somewhere (new products in a catalog)
Users may request to be notified • Immediately at the time the event is detected• Regularly, e.g., weekly• After a certain number of event detections
Users want to be notified• By email• Upon Login to our site
Query subscription
2. Architecture
SIGMOD'01 Santa-Barbara 11
SQL
Architecture
XylemeAlerter
Web Browser
XylemeReporter
XylemeSubscription
Manager
ComplexEvent
DetectionSubscription
Manager
Reporter
TriggerEngine
XylemeQuery
Processor
SQL
documents
SIGMOD'01 Santa-Barbara 12
Step 1: Atomic Event Detection
metadatamanager
HTMLparser
XMLloader
document & alerts
d/46
complexevent detection
atomic event 46: URL matches pattern www.musee-orsay.fr/*atomic event 67: XML documentcontains the tag <painter> withthe value “Monet”
5 millions of pages/day
d
d/46,67loading
SIGMOD'01 Santa-Barbara 13
Step2: Complex Event Detection
HTMLparser
XMLloader
complexevent detection
complex event 12: 67 & 46 (XML document contains the tag <painter> with value “Monet” and URL matches pattern www.musee-orsay.fr/*)
Millions of alerts of pages/dayMillions of subscriptions
SIGMOD'01 Santa-Barbara 14
triggers
notification/monitoring
Step 3: Notification Processor
Reporter
continuousqueries
complexevent detection
clock notification/results
Millions of notifications/day
alerts
Query subscription
3. The language
SIGMOD'01 Santa-Barbara 16
Subscription LanguageSQL-like language.Combines the use of monitoring queries and continuous queries.The language can be extended by adding new types of atomic events.Uses the XML Query Language for continuous queries. “Querying the XML Documents of the Web”, V. Aguilera, S. Cluet, F. Boiscuvier, Tech. Report
SIGMOD'01 Santa-Barbara 17
Examplesubscription myPaintings
% what are the new painting entries in Musee d’Orsay sitemonitoring newPainting
select URLwhere URL extends www.musee-orsay.fr/*and <painter> contains “Monet”
% manage the changes in the expositions continuous delta Exposition
select ... from ... where when monthly
notify daily % send me a daily report
Query Subscription
4. Complex Event Detection
SIGMOD'01 Santa-Barbara 19
C1 = a0 a4
a4
Atomic Event Set Algorithm
a2
C3 = a2
a4
a5
a6
a7
C4 = a4 a5 a6 a7
a0
C0 = a0
C0
a1
a4 a3
C2 = a0 a1 a3
C1 C2
C3
C4
C1
SIGMOD'01 Santa-Barbara 20
a4
a0a0
Atomic Event Set Algorithm
a2
a1
a3
a4
a5
a6
a7 S={a0 a2 a 4}
Detected Events:
C0
a4
C1
a2
C3
a4C0
C1
C3
C2C4
SIGMOD'01 Santa-Barbara 21
Complexity resultsA formal study has been conducted.Experimental (simulation) values concur with this studyResults show that the algorithm is well suited for our application:
10 million Complex Events 1 million Atomic Events 100 Atomic events detected per document
0.8 ms to process a document. ~2 million documents per day.
Query Subscription
5. Alerters
SIGMOD'01 Santa-Barbara 23
Alerters
Each Alerter can be viewed as a plugin that acts on a document flow.
All sorts of Atomic events can be detected: URL pattern detection, Keywords, XML structure, Page rank…
Can be distributed.
SIGMOD'01 Santa-Barbara 24
Conclusion and PerspectivesThis work has been implemented and integrated in the Xyleme System.The core of our system is reusable.The system is expandable, and can be used to trigger various other modules:
versionning of documentssemantic classification
SIGMOD'01 Santa-Barbara 25
Perspectives
Re-use of the core of our system.
Triggering of various other modules.versioning documents
semantic classification
SIGMOD'01 Santa-Barbara 26
The Coming of XMLHTML
comes from SGMLhypertext languagefixed number of tagscontent and presentation are mixedvery difficult to extract data from a page
old standard
XMLalsosemistructured datanot fixednot mixedvery easy
new standard
SIGMOD'01 Santa-Barbara 27
XML = Semistructured Data
Ref Name PriceX23 Camera 359.99 R2D2 Robot 19350.00Z25 PC 1299.99...
Information System
<product-table>< product reference=”X23"> <designation> camera </designation> <price unit=Dollars> 359.99 </price> <description> … </description></product>< product reference=”R2D2"> <designation> Robot </designation> <price unit=Dollars> 19350 </price> <description> … </description>...</product-table> XML
Data + StructureSemistructured: more flexible
The Web and XML