representing and querying xml with incomplete information

Post on 09-Jan-2016

26 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Representing and Querying XML with Incomplete Information. Serge Abiteboul INRIA. Luc Segoufin INRIA. Victor Vianu UCSD. Organization. Motivations Simplifying assumptions Model of incompleteness Answering queries Results Discussion Conclusion. Motivations. - PowerPoint PPT Presentation

TRANSCRIPT

Representing and QueryingXMLwith Incomplete Information

Serge AbiteboulINRIA

Luc SegoufinINRIA

Victor VianuUCSD

pods 2001 Abiteboul-Segoufin-Vianu 2

Organization

• Motivations• Simplifying assumptions• Model of incompleteness• Answering queries• Results• Discussion• Conclusion

Motivations

pods 2001 Abiteboul-Segoufin-Vianu 4

The Web is a world of incompleteness

• Information you get from the web is seldom complete:• Queries return you some - not all - data • Limited storage capability• Documents change on the Web:

expiration• Sites are unavailable…

• Context: A warehouse of XML documents from the Web, Xyleme

pods 2001 Abiteboul-Segoufin-Vianu 5

This work

• This work: simple, practically appealing approach to managing incomplete information

• Sequence of queries to the web • (q1,A1)+(q2,A2)+… • Answers are cached

• Process a new query without access to the web• Give an incomplete answer• Explain incompleteness to user • Seek additional information, i.e., find minimal set

of queries to fully answer

pods 2001 Abiteboul-Segoufin-Vianu 6

Related works

• Semantic caching• Answering queries using views

• keep (Qi,Ai)

• try to rewrite query Q into Q’(A1,...,An)

• reject if you cannot

• Incomplete database • (Qi,Ai) is some incomplete knowledge of DB

• Related to querying incomplete information – e.g. Lipski-Imielinski

pods 2001 Abiteboul-Segoufin-Vianu 7

Challenge: balance expressiveness and tractability

• Choice of data model• Choice of the query language• Choice of a representation of

incompleteness

• Results• Simple, practical solution• Extra features lead to serious problems

Simplifying Assumptions

pods 2001 Abiteboul-Segoufin-Vianu 9

Data is XML: trees

dealer

UsedCars NewCars

ad ad

model year model

<dealer> <UsedCars> <ad> <model>Honda</model> <year>96</year> </ad> </UsedCars> <NewCars> <ad> <model>Acura</model> </ad> </NewCars></dealer>

Honda 96 Acura

pods 2001 Abiteboul-Segoufin-Vianu 10

Simplified XML

=can =444 =electronique=can =444 =electronique=nik =234 =electronic=nik =234 =electronic

=camera=camera=camera=camera

=c.jpg=c.jpg

value functionvalue function

unordered treesunordered trees

name price cat picturename price cat picture

catalogcatalog

productproduct

subcategorysubcategory

productproduct

name price categoryname price category

subcategorysubcategory

labelling functionlabelling function

pods 2001 Abiteboul-Segoufin-Vianu 11

Simple XML types

catalogcatalog

productproduct

name price cat picturename price cat picture

subcategorysubcategory

**

**

1 : 1 child (default)1 : 1 child (default)* : 0 or more* : 0 or more+ : 1 or more+ : 1 or more? : 0 or 1? : 0 or 1

pods 2001 Abiteboul-Segoufin-Vianu 12

Prefix Selection Queries (ps-queries)

catalogcatalog

productproduct

name price cat=elecname price cat=elec

subcategorysubcategory

<200<200

Query1Query1catalogcatalog

productproduct

name name

Query2Query2

picturepicture

pods 2001 Abiteboul-Segoufin-Vianu 13

Simplifications

Data

• No order• No distinction

attribute/element• No recursion• No links

Query

• No complex path expressions

• No join• No repeated child

productproduct

name cat=elec cat=toyname cat=elec cat=toy

NONO

pods 2001 Abiteboul-Segoufin-Vianu 14

Crucial assumption: XID

prodprod

canon 120 eleccanon 120 elec

cameracamera

&245&245 prodprod&245&245

c.jpgc.jpg

++c.jpgc.jpg

prodprod

canon 120 eleccanon 120 elec

&245&245

cameracamera

==

• URLsURLs• ID/IDrefsID/IDrefs

Representation of incomplete information:

Incomplete trees

pods 2001 Abiteboul-Segoufin-Vianu 16

Document Type Definition (DTD) are used to represent incompleteness

• Set of rules: e r• e element name• r regular expression• Set of trees satisfying a

DTD d: tree(d)• Shortcoming of DTDs

• An element has a single definition independently of the context

• Type of ad depends on the context

dealerdealer

newxarnewxarusedcarusedcar

adadadad

modelmodel yearyear modelmodel

pods 2001 Abiteboul-Segoufin-Vianu 17

Solution: specialization (decoupled tags)

• adused and adnew h(adused)=h(adnew )=ad

dealerdealer

newxarnewxarusedcarusedcar

adadnewadadused

modelmodel yearyear modelmodel

dealerdealer

newxarnewxarusedcarusedcar

adadadad

modelmodel yearyear modelmodel

hh

pods 2001 Abiteboul-Segoufin-Vianu 18

DTDs + Specialization

The sets of trees that can be specified: the regular unranked tree languages [Bruggeman—Klein+Murata+Wood]

• Same closure properties: intersection, union, complement

• Same complexity

pods 2001 Abiteboul-Segoufin-Vianu 19

Example

Q1: name, subcat, price of electronic products with price Q1: name, subcat, price of electronic products with price less than $200less than $200

Q2: name, pictures of cameras at least pictured onceQ2: name, pictures of cameras at least pictured once

--------------------------------------------------------

Q3: name, price, pictures of cameras costing less than Q3: name, price, pictures of cameras costing less than $100 and at least pictured once$100 and at least pictured once

can be can be completelycompletely answered using A1, A2 answered using A1, A2

Q4: list all camerasQ4: list all cameras

can be can be partiallypartially answered using A1, A2 answered using A1, A2

pods 2001 Abiteboul-Segoufin-Vianu 20

catalogcatalog

cdplayercdplayer

productproduct

canon 120 eleccanon 120 elec

cameracamera

productproduct

nikon 199 elecnikon 199 elec

cameracamera

productproduct

sony 175 elecsony 175 elec

product1product1 product2product2

****

Q1: name, subcat, price of electronic products with price less than 200Q1: name, subcat, price of electronic products with price less than 200

missingmissing

pods 2001 Abiteboul-Segoufin-Vianu 21

Missing data after Q1

product1product1

name price cat picturename price cat picture

subcategorysubcategory

**

product2product2

name price cat picturename price cat picture

subcategorysubcategory

**

!=elec!=elec =elec=elec>200>200

pods 2001 Abiteboul-Segoufin-Vianu 22

catalogcatalog

productproduct

canon 120 eleccanon 120 elec

cameracamera

productproduct

nikon 199 elecnikon 199 elec

cameracamera

productproduct

sony 175 elecsony 175 elec

cdplayercdplayer

product2aproduct2a

Q2: name, pictures of cameras at least pictured onceQ2: name, pictures of cameras at least pictured once

product1product1

missingmissingproduct2cproduct2c

product2product2**

** product2bproduct2b**

c.jpgc.jpg akai a.jpg elecakai a.jpg elec

cameracamera

3333

pods 2001 Abiteboul-Segoufin-Vianu 23

Incomplete information

• Known information• Prefix of the real data tree

• Missing information• Extended tree type• Conditions on data values• Specializations, disjunctions

pods 2001 Abiteboul-Segoufin-Vianu 24

product1product1

name price name price catcat picture picture

subcategorysubcategory

**

!=elec!=elec

product2product2aa

name name priceprice catcat picture picture

subcategorysubcategory

=elec=elec>200>200

name price name price catcat

product3product3

elecelecproduct2product2bb

namename priceprice catcat picturepicture

**

=elec=elec>200>200

product2product2cc

namename priceprice catcat

subcategorysubcategory

=elec=elec>200>200

subcategorysubcategory!=camera!=camera

subcategorysubcategory!=camera!=camera

no pictureno picture

no pictureno picture

product +product +

Known data

Missing data

Answering Queries

pods 2001 Abiteboul-Segoufin-Vianu 26

Complete answer to Q3

• Q3: name, price, Q3: name, price, pictures of cameras pictures of cameras costing less than $150 costing less than $150 and having at least one and having at least one picturepicture

• Can be fully answered Can be fully answered using available using available informationinformation

• Need to check whether Need to check whether answer is completeanswer is complete

catalogcatalog

prodprod

canon 120 canon 120 c.jpgc.jpg

pods 2001 Abiteboul-Segoufin-Vianu 27

Incomplete answer to Q4• Provide known cameras• Explain incompleteness

canoncanon nikonnikon sony sony akaiakai

more productsmore products

name name

price>200price>200andandno pictureno picture

pods 2001 Abiteboul-Segoufin-Vianu 28

Completing answer to Q4

• It suffices to ask:

productproduct

name price cat name price cat

sub=camerasub=camera

=elec=elec>200>200 picturepicture

00

pods 2001 Abiteboul-Segoufin-Vianu 29

Revisit the types• DTD • Conditions• Specialization: same

element name may have several types

• Not sufficient

• Need to extend again the types: disjunctions

productproduct2b2b

**

=elec=elec>200>200

subcategorysubcategory!=camera!=camera

namename priceprice catcat picturepicture

pods 2001 Abiteboul-Segoufin-Vianu 30

Disjunction

??

??

vehiclevehicle

datadata engineengine

descriptiondescription

sailsail

vehiclevehicle

datadata

descriptiondescription

vehiclevehicle

datadataengineengine

sailsail

Query1’Query1’ Query2’Query2’

vehiclevehicle

data=“….”data=“….”

description=“….”description=“….”

Empty!Empty!&322&322

pods 2001 Abiteboul-Segoufin-Vianu 31

Disjunction continued

• Type of &322vehicle1 + vehicle2

vehicle2vehicle2

datadata

descriptiondescription

sailsail

vehicle1vehicle1

datadata engineengine

descriptiondescription

The type of &322 can not be described independently of that of data below

Results

pods 2001 Abiteboul-Segoufin-Vianu 33

Representation System:Lipski’s+Imielinski’s

reprep rep(T)rep(T)Set of possible Set of possible worldsworlds

q(rep(T))q(rep(T))==

rep(q(T))rep(q(T))

qq

Set of possible Set of possible answersanswers

TT

Representation Representation of informationof information

q(T)q(T) reprep

qq

Representation Representation of resultof result

pods 2001 Abiteboul-Segoufin-Vianu 34

Representation System for PS-queries

• Incomplete tree T to representq1

-1(A1) … qk-1(Ak)

• PS-query q

• q(T) can be computed in ptime(representation of the answer can be

computed in ptime)

pods 2001 Abiteboul-Segoufin-Vianu 35

Querying Incomplete Trees

• Given T and a query q, one can • Give in ptime the sure answers up to

our current knowledge• Check in ptime whether query q can be

fully anwered• Generate in ptime queries to complete

answer

pods 2001 Abiteboul-Segoufin-Vianu 36

Comparison with IL

Relational model

• Relational calculus/algebra

• Conditional table

• Closed or open world

• Representation system

XML tree model

• Weaker language (no join)

• Weaker system (no variable)

+ Closed and open World

• Representation system

pods 2001 Abiteboul-Segoufin-Vianu 37

Drawback: exponential blowup

• Incomplete information may become exponential w.r.t the sequence of query/answer q1/A1;q2/A2…

11 11 qqii::

Answers are emptyAnswers are empty

databasedatabase

a=ia=i b=ib=i

databasedatabase

aa bb

Type:Type:

pods 2001 Abiteboul-Segoufin-Vianu 38

Dealing with exponential blowup

• Make the representation more complex using disjunctions of types• Size of representation stays polynomial• Manipulations much more complex

• Restrict tree types and PS-queries • Already very/too? simple

• Accept to loose some information • Ask extra queries to simplify

representation

Discussion

pods 2001 Abiteboul-Segoufin-Vianu 40

Discussion: extend language

• Some results in paper• Extensions often lead to intractability

• E.G. : K-pebble transducers [Milo,Suciu,Vianu] that somehow subsume XML-QL and XSL• No (known) representation system• Testing rep(T) is empty is non-

elementary

pods 2001 Abiteboul-Segoufin-Vianu 41

Discussion : node Ids

Without node Ids• much less information to integrate

results• more complex• tedious case analysis

pods 2001 Abiteboul-Segoufin-Vianu 42

Discussion: ordering

• Ordering in XML, DTD, queries • Problem is totally different and very complex

• Example: • Q1/A1: list of males; Q2/A2: list of females; Q3: list all

• Depending on the type of input• (Male)*(Female)* A3= A1 || A2• (Male Female)* A3= shuffle(A1,A2)• (Male + Female)* we cannot answer A3

• Regular expression processing

pods 2001 Abiteboul-Segoufin-Vianu 43

Conclusion

• Framework for acquiring, maintaining, querying incomplete XML data

• Limitations: • simple queries• no order and Id assumption • small extensions lead to problems

• Possible to represent the incompleteness• Possible to answer with incompleteness• Possible to obtain queries to provide full

answer

top related