answering queries using views: a survey - columbia universitygravano/qual/papers/halevy01.pdf ·...

25
The VLDB Journal 10: 270–294 (2001) / Digital Object Identifier (DOI) 10.1007/s007780100054 Answering queries using views: A survey A.Y. Halevy Department of Computer Science and Engineering, University of Washington, Seattle, WA, 98195; E-mail: [email protected] Edited by M. Carey. Received: 1 August 1999 / Accepted: 23 March 2001 Published online: 6 September 2001 – c Springer-Verlag 2001 Abstract. The problem of answering queries using views is to find efficient methods of answering a query using a set of previ- ously defined materialized views over the database, rather than accessing the database relations. The problem has recently re- ceived significant attention because of its relevance to a wide variety of data management problems. In query optimization, finding a rewriting of a query using a set of materialized views can yield a more efficient query execution plan. To support the separation of the logical and physical views of data, a storage schema can be described using views over the logical schema. As a result, finding a query execution plan that accesses the storage amounts to solving the problem of answering queries using views. Finally, the problem arises in data integration systems, where data sources can be described as precomputed views over a mediated schema. This article surveys the state of the art on the problem of answering queries using views, and synthesizes the disparate works into a coherent frame- work. We describe the different applications of the problem, the algorithms proposed to solve it and the relevant theoretical results. Keywords: Materialized views – Data integration – Query optimization – Survey – Date warehousing – Web-site man- agement 1 Introduction The problem of answering queries using views (also known as rewriting queries using views) has recently received sig- nificant attention because of its relevance to a wide variety of data management problems: query optimization, maintenance of physical data independence, data integration and data ware- house design. Informally speaking, the problem is the follow- ing. Suppose we are given a query Q over a database schema, and a set of view definitions V 1 ,...,V n over the same schema. Is it possible to answer the query Q using only the answers to the views V 1 ,...,V n ?Alternatively, what is the maximal set of tuples in the answer of Q that we can obtain from the views? If we can access both the views and the database relations, what is the cheapest query execution plan for answering Q? The first class of applications in which we encounter the problem of answering queries using views is query optimiza- tion and database design. In the context of query optimiza- tion, computing a query using previously materialized views can speed up query processing because part of the computa- tion necessary for the query may have already been done while computing the views. Such savings are especially significant in decision support applications when the views and queries con- tain grouping and aggregation. Furthermore, in some cases, certain indices can be modeled as precomputed views (e.g., join indices [Val87]), 1 and deciding which indices to use re- quires a solution to the query rewriting problem. In the context of database design, view definitions provide a mechanism for supporting the independence of the physical view of the data and its logical view. This independence enables us to modify the storage schema of the data (i.e., the physical view) with- out changing its logical schema, and to model more complex types of indices. Hence, several authors describe the storage schema as a set of views over the logical schema [YL87,TSI96, Flo96]. Given these descriptions of the storage, the problem of computing a query execution plan (which, of course, must access the physical storage) involves figuring out how to use the views to answer the query. A second class of applications in which our problem arises is data integration. Data integration systems provide a uniform query interface to a multitude of autonomous data sources, which may reside within an enterprise or on the World-Wide Web. Data integration systems free the user from having to locate sources relevant to a query, interact with each one in isolation, and manually combine data from multiple sources. Users of data integration systems do not pose queries in terms of the schemas in which the data is stored, but rather in terms of a mediated schema. The mediated schema is a set of rela- tions that is designed for a specific data integration application, and contains the salient aspects of the domain under consid- eration. The tuples of the mediated schema relations are not actually stored in the data integration system. Instead, the sys- tem includes a set of source descriptions that provide semantic mappings between the relations in the source schemas and the relations in the mediated schema. 1 Strictly speaking, to model join indices we need to extend the logical model to refer to row IDs.

Upload: others

Post on 24-Aug-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Answering queries using views: A survey - Columbia Universitygravano/Qual/Papers/halevy01.pdf · the views are sufficient for answering the queries we expect to receive over the

The VLDB Journal 10: 270–294 (2001) / Digital Object Identifier (DOI) 10.1007/s007780100054

Answering queries using views: A survey

A.Y. Halevy

Department of Computer Science and Engineering, University of Washington, Seattle, WA, 98195; E-mail: [email protected]

Edited by M. Carey. Received: 1 August 1999 / Accepted: 23 March 2001Published online: 6 September 2001 –c© Springer-Verlag 2001

Abstract. Theproblemof answering queries using views is tofindefficientmethodsof answeringaqueryusingaset of previ-ously definedmaterialized viewsover the database, rather thanaccessing the database relations. The problem has recently re-ceived significant attention because of its relevance to a widevariety of data management problems. In query optimization,finding a rewriting of a query using a set of materialized viewscan yield amore efficient query execution plan. To support theseparation of the logical and physical views of data, a storageschema can be described using views over the logical schema.As a result, finding a query execution plan that accesses thestorage amounts to solving the problem of answering queriesusing views. Finally, the problem arises in data integrationsystems, where data sources can be described as precomputedviews over a mediated schema. This article surveys the stateof the art on the problem of answering queries using views,and synthesizes the disparate works into a coherent frame-work. We describe the different applications of the problem,the algorithms proposed to solve it and the relevant theoreticalresults.

Keywords: Materialized views – Data integration – Queryoptimization – Survey – Date warehousing – Web-site man-agement

1 Introduction

The problem of answering queries using views (also knownas rewriting queries using views) has recently received sig-nificant attention because of its relevance to a wide variety ofdatamanagement problems: query optimization, maintenanceof physical data independence, data integration and data ware-house design. Informally speaking, the problem is the follow-ing. Suppose we are given a queryQ over a database schema,and a set of view definitionsV1, . . . , Vn over the same schema.Is it possible to answer the queryQ usingonly the answers tothe viewsV1, . . . , Vn?Alternatively,what is themaximal set oftuples in the answer ofQ that we can obtain from the views? Ifwe can access both the views and the database relations, whatis the cheapest query execution plan for answeringQ?

The first class of applications in which we encounter theproblem of answering queries using views is query optimiza-tion and database design. In the context of query optimiza-tion, computing a query using previously materialized viewscan speed up query processing because part of the computa-tion necessary for the querymay have already been donewhilecomputing theviews.Suchsavingsareespecially significant indecision support applicationswhen the views and queries con-tain grouping and aggregation. Furthermore, in some cases,certain indices can be modeled as precomputed views (e.g.,join indices [Val87]),1 and deciding which indices to use re-quires a solution to the query rewriting problem. In the contextof database design, view definitions provide a mechanism forsupporting the independence of thephysicalview of the dataand itslogical view. This independence enables us to modifythe storage schema of the data (i.e., the physical view) with-out changing its logical schema, and to model more complextypes of indices. Hence, several authors describe the storageschemaasaset of viewsover the logical schema [YL87,TSI96,Flo96]. Given these descriptions of the storage, the problemof computing a query execution plan (which, of course, mustaccess the physical storage) involves figuring out how to usethe views to answer the query.

A second class of applications in which our problem arisesis data integration. Data integration systems provide a uniformquery interface to a multitude of autonomous data sources,which may reside within an enterprise or on the World-WideWeb. Data integration systems free the user from having tolocate sources relevant to a query, interact with each one inisolation, and manually combine data from multiple sources.Users of data integration systems do not pose queries in termsof the schemas in which the data is stored, but rather in termsof amediated schema. The mediated schema is a set of rela-tions that is designed for a specificdata integrationapplication,and contains the salient aspects of the domain under consid-eration. The tuples of the mediated schema relations are notactually stored in the data integration system. Instead, the sys-tem includes a set ofsource descriptionsthat provide semanticmappings between the relations in the source schemas and therelations in the mediated schema.

1 Strictly speaking, to model join indices we need to extend thelogical model to refer to row IDs.

Page 2: Answering queries using views: A survey - Columbia Universitygravano/Qual/Papers/halevy01.pdf · the views are sufficient for answering the queries we expect to receive over the

A.Y. Halevy: Answering queries using views: A survey 271

The data integration systems described in [LRO96b,DG97b,KW96,LKG99] follow an approach in which the con-tents of the sources are described as views over the mediatedschema.As a result, the problemof reformulating a user query,posed over the mediated schema, into a query that refers di-rectly to the source schemas becomes the problem of answer-ing queries using views. In a sense, the data integration contextcan be viewed as an extreme case of the need tomaintain phys-ical data independence, where the logical and physical layoutof the data sources has been defined in advance. The solutionsto the problem of answering queries using views differ in thiscontext because the number of views (i.e., sources) tends tobe much larger, and the sources need not contain thecompleteextensions of the views.

In the area of data warehouse design we need to choose aset of views (and indexes on the views) to materialize in thewarehouse [HRU96,TS97,YKL97,GHRU97,ACN00,CG00].Similarly, in web-site design, the performance of a web sitecan be significantly improved by choosing a set of views tomaterialize [FLSY99]. In both of these problems, the first stepin determining the utility of a choice of views is to ensure thatthe views are sufficient for answering the queries we expect toreceive over the data warehouse or the web site. This problem,again, translates into the view rewriting problem.

Finally, answering queries using views plays a key role indevelopingmethods for semantic data caching in client-serversystems [DFJ+96,KB96,CR94,ACPS96]. In these works, thedata cached at the client is modeled semantically as a set ofqueries, rather than at the physical level as a set of data pagesor tuples. Hence, deciding which data needs to be shippedfrom the server in order to answer a given query requires ananalysis of which parts of the query can be answered by thecached views.

The many applications of the problem of answeringqueries using views has spurred a flurry of research, rangingfrom theoretical foundations to algorithm design and imple-mentation in several commercial systems. This article surveysthe current state of the art in this area, and classifies the worksinto a coherent framework based on a set of dimensions alongwhich the treatments of the problem differ.

The treatments of the problem differ mainly dependingon whether they are concerned with query optimization anddatabase design or with data integration. In the case of queryoptimization and database design, the focus has been on pro-ducing a query execution plan that involves the views, andhence the effort has been on extending query optimizers toaccommodate the presence of views. In this context, it is nec-essary that rewriting of the query using the views be anequiv-alent rewriting in order for the query execution plan to becorrect. It is important to note that some of the views includedin the query plan may not contribute to the logical correctnessof the plan, but only to reducing the plan’s cost.

In the data integration context, the focus has been on trans-lating queries formulated in terms of a mediated schema intoqueries formulated in terms of data sources. Hence, the outputof the algorithm is a query expression, rather than a query ex-ecution plan. Because the data sources may not entirely coverthe domain, we sometimes need to settle for acontainedqueryrewriting, rather than an equivalent one. A contained queryrewriting provides a subset of the answer to the query, butperhaps not the entire answer. In addition, the works on data

integration distinguished between the case in which the indi-vidual views are complete (i.e., contain all the tuples in theirdefinition) and the case where they may be incomplete (as iscommon when modeling autonomous data sources). Further-more, the works on data integration distinguished the transla-tion problem from the more general problem of finding all theanswers to a query given the data in the sources, and showedthat the two problems differ in interesting ways.

The survey is organized as follows. Section 2 presents inmore detail the applications motivating the study of the prob-lem and the dimensions along which we can study the prob-lem. Section 3 defines the problem formally. As a basis forthe discussion of the different algorithms, Sect.4 provides anintuitive explanation of the conditions under which a view canbe used to answer a query. Section 5 describes how materi-alized views have been incorporated into query optimization.Section 6 describes algorithms for answering queries usingviews that were developed in the context of data integration.Section 7 surveys some theoretical issues concerning the prob-lem of answering queries using views, and Sect.8 discussesseveral extensions to the algorithms in Sects. 5 and 6 to ac-commodatequeriesoverobject-orienteddatabasesandquerieswith access-pattern limitations. Finally, Sect.9 concludes, andoutlines some of the open problems in this area.

We note that this survey is not concerned with the closelyrelated problems of incremental maintenance of material-ized views, which is surveyed in [GM99b], selection ofwhich views to maintain in a data warehouse [HRU96,TS97,GHRU97,Gup97b,YKL97,GM99c,CG00,CHS01] or auto-mated selection of indexes [CN98b,CN98a].

2 Motivation and illustrative examples

Before beginning the detailed technical discussion, we moti-vate the problem of answering queries using views throughsome of its applications. In particular, this section serves toillustrate the wide and seemingly disparate range of applica-tions of the problem. We end the section by classifying thedifferent works on the topic into a taxonomy.

We use the following familiar university schema in our ex-amples throughout the paper. We assume that professors, stu-dents, and departments are uniquely identified by their names,and courses are uniquely identified by their numbers. TheRegistered relation describes the students’ registration inclasses, while theMajor relation describes in which depart-ment aparticular student ismajoring (weassume for simplicitythat every department has a single major program).

Prof(name, area)Course(c-number, title)Teaches(prof, c-number, quarter, evaluation)Registered(student, c-number, quarter)Major(student, dept)WorksIn(prof, dept)Advises(prof, student).

2.1 Query optimization

The first and most obvious motivation for considering theproblem of answering queries using views is for query opti-mization. If part of the computation needed to answer a query

Page 3: Answering queries using views: A survey - Columbia Universitygravano/Qual/Papers/halevy01.pdf · the views are sufficient for answering the queries we expect to receive over the

272 A.Y. Halevy: Answering queries using views: A survey

Fig. 1. An entity/relationship diagram for the university domain.Note thatquarter is an attribute of the relationshipsregistered andteaches

has already been performed in computing amaterialized view,then we can use the view to speed up the computation of thequery.

Consider the following query, asking for students andcourse titles for students who registered in Ph.D-level classestaught by professors in theDatabase area (in our example, uni-versity graduate-level classes have numbers of 400 and above,and Ph.D-level courses numbers of 500 and above):

select Registered.student, Course.titlefrom Teaches, Prof, Registered, Coursewhere Prof.name=Teaches.prof and

Teaches.c-number =Registered.c-number andTeaches.quarter=Registered.quarter andRegistered.c-number=Course.c-number andCourse.c-number ≥ 500 and Prof.area="DB".

Suppose we have the following materialized view, con-taining the registration records of graduate-level courses andabove.

create view Graduate asselect Registered.student, Course.title, Course.c-number,

Registered.quarterfrom Registered, Coursewhere Registered.c-number=Course.c-number and

Course.c-number ≥ 400.

The viewGraduate can be used in the computation of theabove query as follows:

select Graduate.student, Graduate.titlefrom Teaches, Prof, Graduatewhere Prof.name=Teaches.prof and

Teaches.c-number=Graduate.c-number andTeaches.quarter=Graduate.quarter andGraduate.c-number ≥ 500 and Prof.area="DB".

The resulting evaluation will be cheaper because the viewGraduate has already performed the join betweenRegis-tered andCourse, and has already pruned the non-graduatecourses (the courses that actually account for most of the ac-tivity going on in a typical university). It is important to notethat the viewGraduate is useful for answering the query eventhough it does notsyntacticallymatch any of the subparts ofthe query.

Even if a view has already computed part of the query,it is not necessarily the case that using the view will lead toa more efficient evaluation plan, especially considering theindexes available on the database relations and on the views.

For example, suppose the relationsCourse andRegisteredhave indexes on thec-number attribute. In this case, if theview Graduate does not have any indexes, then evaluatingthe query directly from the database relations may be cheaper.Hence, the challenge is not only to detect when a view islogically usable for answering a query, but also to make ajudicious cost-based decision on when to use the availableviews.

2.2 Maintaining physical data independence

Several works on answering queries using viewswere inspiredby the goal of maintaining physical data independence in re-lational and object-oriented databases [YL87,TSI96,Flo96].One of the principles underlying modern database systems isthe separation between the logical view of the data (e.g., astables with their named attributes) and the physical view ofthe data (i.e., how it is laid out on disk). With the exceptionof horizontal or vertical partitioning of relations into multiplefiles, relational database systems are still largely based on a1-1 correspondence between relations in the schema and filesin which they are stored. In object-oriented systems, main-taining the separation is necessary because the logical schemacontains significant redundancy, and does not correspond toa good physical layout. Maintaining physical data indepen-dence becomes more crucial in applications where the logicalmodel is introduced as an intermediate level after the physi-cal representation has already been determined. This is com-mon in applications of semi-structured data [Bun97,Abi97,FLM98], storage of XML data in relational databases [FK99,SGT+99,DFS99,TIHW01], and in data integration. In fact,the STORED System [DFS99] stores XML documents in arelational database, and uses views to describe the mappingfrom XML into relations in the database. In some sense, dataintegration, discussed in the next section, is an extreme casewhere there is a separation between the logical view of thedata and its physical view.

To maintain physical data independence, several authorsproposed to use views as a mechanism for describing the stor-age of the data. In particular, [TSI96] described the storage ofthe data using GMAPs(generalized multi-level access paths),expressed over the conceptual model of the database.

To illustrate, consider the entity-relationship model of aslightly extended university domain shown in Fig.1. Figure 2shows GMAPs expressing the different storage structures forthis data.

A GMAP describes the physical organization and indexesof the storage structure. The first clause of the GMAP (theas clause) describes the actual data structure used to store aset of tuples (e.g., a B+-tree, hash index, etc.) The remainingclauses describe the content of the structure, much like a viewdefinition.Thegivenandselect clauses describe the availableattributes, where thegiven clause describes the attributes onwhich thestructure is indexed.Thedefinitionof theview,givenin thewhere clause uses infix notation over the conceptualmodel.

In our example, the GMAP G1 stores a set of pairs con-taining students and the departments in which they major,and these pairs are indexed by a B+-tree on attributeStu-dent.name. The GMAP G2 stores an index from the names

Page 4: Answering queries using views: A survey - Columbia Universitygravano/Qual/Papers/halevy01.pdf · the views are sufficient for answering the queries we expect to receive over the

A.Y. Halevy: Answering queries using views: A survey 273

def gmap G1 as b+-tree bygiven Student.nameselect Departmentwhere Student major Department.

def gmap G2 as b+-tree bygiven Student.nameselect Course.c-numberwhere Student registered Course.

def gmap G3 as b+-tree bygiven Course.c-numberselect Departmentwhere Student registered Course andStudent major Department.

Fig. 2.GMAPs for the university domain

of students to the numbers of the courses in which they areregistered. The GMAP G3 stores an index from course num-bers to departments whose majors are enrolled in the course.As shown in [TSI96], using GMAPs it is possible to express alarge family of data structures, including secondary indexes onrelations, nested indexes, collection-based indexes, and struc-tures implementing field replication.

Given that the data is stored in the structures describedby the GMAPs, the question that arises is how to use thesestructures to answer queries. Since the logical content of theGMAPsare described by views, answering a query amounts tofinding away of rewriting the query using these views. If thereare multiple ways of answering the query using the views, wewould like to find the cheapest one. Note that in contrast to thequery optimization context, wemustuse the views to answera given query, because all the data is stored in the GMAPs,

Consider the following query in our domain, which asksfor names of students registered for Ph.D-level courses andthe departments in which these students are majoring.

select Student.name, Departmentwhere Student registered Course and

Student major Department andCourse.c-number≥500.

The query can be answered in two ways. First, sinceStu-dent.name uniquely identifies a student, we can take thejoin of G1 and G2, and then apply a selectionCourse.c-number≥500, and a projection onStudent.name andDe-partment. A second solution would be to join G3 with G2and selectCourse.c-number≥500. In fact, this solutionmayeven be more efficient because G3 has an index on the coursenumber and therefore the intermediate joins may be muchsmaller.

2.3 Data integration

Much of the recent work on answering queries using views hasbeen spurred because of its applicability to data integrationsystems. A data integration system (a.k.a. a mediator system[Wie92]) provides auniform query interface to a multitudeof autonomous heterogeneous data sources. Prime examplesof data integration applications include enterprise integration,

querying multiple sources on the World-Wide Web, and in-tegration of data from distributed scientific experiments. Thesources in such an application may be traditional databases,legacy systems, or even structured files. The goal of a data in-tegration system is to free the user from having to find the datasources relevant to a query, interact with each source in isola-tion, and manually combine data from the different sources.

To provide a uniform interface, a data integration systemexposes to the user amediated schema. A mediated schema isa set ofvirtual relations, in the sense that they are not actuallystored anywhere. The mediated schema is designed manuallyfor a particular data integration application. To be able to an-swer queries, the system must also contain a set ofsource de-scriptions.Adescription of a data source specifies the contentsof the source, the attributes that can be found in the source,and the constraints on the contents of the source.

One of the approaches for specifying source descriptions,which has been adopted in several systems ([LRO96b,KW96,FW97,DG97b,LKG99]), is to describe the contents of a datasource as aview over the mediated schema. This approachfacilitates the addition of new data sources and the specifica-tion of constraints on contents of sources (see [Ull97,FLM98,Lev00] for a comparison of different approaches for specify-ing source descriptions).

In order to answer a query, a data integration system needsto translate a query formulated on the mediated schema intoone that refers directly to the schemas in the data sources.Since the contents of the data sources are described as views,the translation problem amounts to finding a way to answer aquery using a set of views.

We illustrate the problem with the following example,where the mediated schema exposed to the user is our univer-sity schema, except that the relationsTeaches andCoursehave an additional attribute identifying the university at whicha course is being taught:

Teaches(prof, c-number, quarter, evaluation, univ)Course(c-number, title, univ)

Suppose we have the following two data sources. The firstsource provides a listing of all the courses titled “DatabaseSystems” taught anywhere and their instructors. This sourcecan be described by the following view definition:

create view DB-courses asselect Course.title, Teaches.prof, Course.c-number,

Course.univfrom Teaches, Coursewhere Teaches.c-number=Course.c-number and

Teaches.univ=Course.univ andCourse.title=“Database Systems”.

The second source lists Ph.D-level courses being taught atthe University of Washington (UW), and is described by thefollowing view definition:

create view UW-phd-courses asselect Course.title, Teaches.prof, Course.c-number,

Course.univfrom Teaches, Coursewhere Teaches.c-number=Course.c-number and

Course.univ=“UW” andTeaches.univ=“UW” andCourse.c-number≥500.

If we were to ask the data integration system who teachescourses titled “Database Systems” at UW, it would be able to

Page 5: Answering queries using views: A survey - Columbia Universitygravano/Qual/Papers/halevy01.pdf · the views are sufficient for answering the queries we expect to receive over the

274 A.Y. Halevy: Answering queries using views: A survey

answer the query by applying a selection on the sourceDB-courses:

select proffrom DB-courseswhere univ=“UW”.

On the other hand, suppose we ask for all the graduate-level courses (not just indatabases)beingofferedatUW.Giventhat only these two sources are available, the data integrationsystem cannot findall tuples in the answer to the query. In-stead, the system can attempt to find the maximal set of tuplesin the answer that are available from the sources. In particu-lar, the system can obtain graduatedatabasecourses at UWfrom theDB-courses source, and the Ph.D-level courses atUW from theUW-Phd-courses source. Hence, the follow-ing query provides the maximal set of answers that can beobtained from the two sources:

select title, c-numberfrom DB-courseswhere univ=“UW” and c-number≥400

UNIONselect title, c-numberfrom UW-phd-courses.

Note that courses that are not Ph.D-level courses ordatabase courses will not be returned as answers. Whereas inthe contexts of query optimization and maintaining physicaldata independence the focus is on finding a query expressionthat isequivalentto the original query, here we attempt to finda query expression that provides themaximal answersfromthe views. We formalize both of these notions in Sect.3.

Other applications:Before proceeding, we also note that theproblem of answering queries using views arises in the designof data warehouses (e.g., [HRU96,TS97,GHRU97,YKL97])and in semantic data caching. In data warehouse design, whenwechoosea set of views tomaterialize in a datawarehouse,weneed to check that we will be able to answer all the requiredqueries over the warehouse using only these views. In thecontext of semantic data caching (e.g., [DFJ+96,KB96,CR94,ACPS96]) we need to check whether the cached results of apreviously computed query can be used for a new query, orwhether the client needs to request additional data from theserver. In [FLSY99,YFIV00] it is shown that precomputingviews can significantly speed up the response time from websites, which again raises the question of view selection.

2.4 A taxonomy of the field

As illustrated by the examples, there are several dimensionsalong which we can classify the treatments of the problemof answering queries using views. In this section we describea taxonomy for classifying the different works on this prob-lem, and highlight the main differences between the problemtreatments. Figure 3 shows the taxonomy and some of therepresentative works belonging to each of its classes.

The most significant distinction between the differentworks is whether their goal is data integration or whether it

is query optimization and maintenance of physical data inde-pendence. The key difference between these two classes ofworks is the output of the algorithm for answering queries us-ing views. In the former case, given a queryQ, and a set ofviewsV, the goal of the algorithm is to produce an expressionQ′ that references the views and is either equivalent to or con-tained inQ. In the latter case, the algorithm must go furtherand produce a (hopefully optimal) query execution plan foransweringQ using the views (and possibly the database rela-tions). Here the rewriting must be an equivalent toQ in orderto ensure the correctness of the plan.

The similarity between these two bodies of work is thatthey are concerned with the core issue of whether a rewritingof a query is equivalent to or contained in the query. How-ever, while logical correctness suffices for the data integrationcontext, it does not in the query optimization context wherewe also need to find thecheapestplan using the views. Thecomplication arises because the optimization algorithms needto consider views that do not contribute to thelogical correct-ness of the rewriting, but do reduce the cost of the resultingplan. Hence, while the reasoning underlying the algorithms inthe data integration context is mostly logical, in the query op-timization case it is both logical and cost-based. On the otherhand, an aspect stressed in the data integration context is theimportance of dealing with a large number of views, whichcorrespond to data sources. In the context of query optimiza-tion it is generally assumed (not always!) that the number ofviews is roughly comparable to the size of the schema.

The works on query optimization can be classified intoSystem-R style optimizers and transformational optimizers.The initial works incorporated views into System-R style joinenumeration,while laterworks that attempt todealwithamoreextended subset of SQL realized that the power of rewritingrules is required in order to incorporate views.

The main line of work on data integration attempted todevelop algorithms for answering queries using views thatscale up to a large number of views2. A second line of workstarted considering different properties of the data sources.For example, it was shown that if data sources are assumed tobe complete (i.e., they include all the tuples that satisfy theirdefinition), then the problemof answering queries using viewsbecomescomputationally harder. Intuitively, the reason for theadded complexity is that when sources are complete, we canalso infer negative information as a result of a query to thesource. This led to asking the following more basic question:given a queryQ, a set of viewsV and their extensions, whatis the complexity of finding the maximal set of tuples in theanswer toQ from V.3 This work established an interestingconnection between the problem of answering queries usingviews and query answering in conditional tables [IL84]. Inthese works, a major factor affecting the complexity of theproblem is whether the view extensions are assumed to becomplete or not (when they are complete, the complexity ishigher). Note that in the context of query optimization, theviews are always assumed to be complete.

2 Strictly speaking, the motivation for the work of [YL87] was themaintenance of physical data independence, but their algorithm hasmore similarities with the data integration algorithms.

3 Some authors refer to the distinction between the two problemsas therewriting problem versus thequery answeringproblem.

Page 6: Answering queries using views: A survey - Columbia Universitygravano/Qual/Papers/halevy01.pdf · the views are sufficient for answering the queries we expect to receive over the

A.Y. Halevy: Answering queries using views: A survey 275

Answering queries using views✘✘✘✘✘✘✘✘✘✘

Cost-based rewriting

(query optimization and physical data independence)

Logical rewriting

(data integration)✘✘✘✘✘✘✘

✘✘✘✘✘✘✘

System-R style Query answering algorithms

(complete or incomplete sources)

Transformational approaches Rewriting algorithms

[YL87,LMSS95][Qia96,LRO96b][DG97a,PL00]

[AD98,GM99a,CGLV00a]

[FRV96,BDD+98][DPT99,ZCL+00,GL01]

[CKPS95,TSI96,PH01]

Fig. 3.A taxonomy of work on answering queries using views. Themain distinction is betweenworks on query optimization andmaintenance ofphysical data independence andworks considering logical rewritings,mostly in the context of data integration. Theworks on query optimizationhave considered both System-R style algorithms and transformation-based algorithms. The works on data integration considered algorithmsthat scale to a large number of views, and the question of finding all the answers to the query, given the view extensions

Table 1.Extensions to query and view languages

Extension Relevant works

Grouping and aggregation [GHQ95,SDJL96,CNS99,GRT99,ZCL+00,GT00] (Sect.5.3)

Bag semantics [CKPS95,ZCL+00] (Sect.5.3)

OQL [FRV96,DPT99] (Sect.8.1)

Multi-block queries [ZCL+00] (Sect.5.2)

Integrity constraints [DL97,Gry98,ZCL+00,DPT99] (Sect.7.2)

Access-pattern limitations [RSU95,KW96,DL97] (Sect.8.2)

Unions in the views [AGK99,Dus98] (Sect.8.3)

Queries over semi-structured data[CGLV99,PV99] (Sect.8.3)

Hierarchies in Description Logics [BLR97,CGL99] (Sect.8.3)

Languages for querying schema [Mil98] (Sect.8.3)

A separate dimension for classifying the different works isthe specific language used for expressing views and queries.Much of the early work on the problem focused on select-project-join queries, but, as shown inTable 1,many extensionshave been considered as well. The works on query optimiza-tion have considered extensions of interest to SQL engines,such as grouping and aggregation and the presence of certainintegrity constraints on the database relations. For obviousreasons, these works have also considered the implicationsof bag semantics on the rewriting problem. The data integra-tion works have considered extensions such as access-patternlimitation to the views, recursive queries, path expressions inthe queries, and integrity constraints expressed in descriptionlogics.

3 Problem definition

In this section we define the basic terminology used through-out this paper. We define the concepts of query containmentand query equivalence that provide a semantic basis for com-paring between queries and their rewritings, and then definethe problem of answering queries using views. Finally, we de-fine the problem of extractingall the answers to a query froma set of views (referred to as the set of certain answers).

The bulk of our discussionwill focus on the class of select-project-join queries on relational databases.A view is a named

query. It is said to bematerialized if its results are stored in thedatabase.A database instance is an assignment of an extension(i.e., a set of tuples) to each of the relations in the database.

Weassume the reader is familiar with the basic elements ofSQL. We will distinguish between queries that involve arith-metic comparison predicates (e.g.,≤, <, �=) and those that donot. Our discussion of answering queries using views in thecontext of data integration systems will require consideringrecursive datalog queries.We recall the basic concepts of dat-alog in Sect.6.

In our discussion, we denote the result of computing thequeryQ over the databaseD by Q(D). We often refer toqueries that reference named views (e.g., in query rewritings).In that case,Q(D) refers to the result of computingQ afterthe views have been computed fromD.

3.1 Containment and equivalence

The notions of query containment and query equivalence en-able comparison between different reformulations of queries.They will be used when we test the correctness of a rewritingof a query in terms of a set of views. In the definitions belowwe assume the answers to queries are sets of tuples. The def-initions can be extended in a straightforward fashion to bagsemantics. In the context of our discussion it is important tonote that the definitions below also apply to queries that mayreference named views.

Page 7: Answering queries using views: A survey - Columbia Universitygravano/Qual/Papers/halevy01.pdf · the views are sufficient for answering the queries we expect to receive over the

276 A.Y. Halevy: Answering queries using views: A survey

Definition 1. Query containment and equivalence: A queryQ1 is said to be contained in a queryQ2, denotedbyQ1 � Q2,if for all database instancesD, the set of tuples computedfor Q1 is a subset of those computed forQ2, i.e.,Q1(D) ⊆Q2(D). The two queries are said to be equivalent ifQ1 � Q2andQ2 � Q1.

The problems of query containment and equivalencehave been studied extensively in the literature and shouldbe a topic of a specialized survey. Some of the caseswhich are most relevant to our discussion include: con-tainment of select-project-join queries and unions thereof[CM77,SY81], querieswith arithmetic comparison predicates[Klu88,LS93,ZO93,KMT98], recursive queries [Shm93,Sag88,LS93,CV92,CV94], and queries with bag semantics[CV93].

3.2 Rewriting of a query using views

Given a queryQ and a set of view definitionsV1, . . . , Vm, arewriting of the query using the views is a query expressionQ′ that refersonly to the viewsV1, . . . , Vm.4 In SQL, a queryrefers only to the views if all the relations mentioned in thefrom clauses are views. In practice, we may also be interestedin rewritings that can also refer to the database relations. Con-ceptually, rewritings that refer to the database relations do notintroduce new difficulties, because we can always simulatethe previous case by inventing views that mirror precisely thedatabase tables.

As we saw in Sect.2, we need to distinguish betweentwo types of query rewritings:equivalent rewritingsandmaximally-contained rewritings.For query optimization andmaintaining physical data independence we consider equiva-lent rewritings.

Definition 2. Equivalent rewritings: LetQ be a query andV = {V1, . . . , Vm} be a set of view definitions. The queryQ′is an equivalent rewriting ofQ usingV if:• Q′ refers only to the views inV, and• Q′ is equivalent toQ.

In the context of data integration, we often need to considermaximally-contained rewritings. Unlike the case of equivalentrewritings, the maximally-contained rewriting may differ de-pending on the query language we consider for the rewriting.Hence, the following definition depends on a particular querylanguage:

Definition 3. Maximally-contained rewritings: LetQ be aquery,V = {V1, . . . , Vm} be a set of view definitions, andLbe a query language. The queryQ′ is a maximally-containedrewriting ofQ usingV with respect toL if:• Q′ is a query inL that refers only to the views inV,• Q′ is contained inQ, and• there is no rewritingQ1 ∈ L, such thatQ′ � Q1 � QandQ1 is not equivalent toQ′.

4 Note that rewritings that refer only to the views were calledcom-plete rewritingsin [LMSS95].

When a rewritingQ′ is contained inQ but is not amaximally-contained rewriting we refer to it as a containedrewriting. Note that the above definitions are independent ofthe particular query language we consider. Furthermore, wenote that algorithms for query containment and equivalenceprovide methods fortestingwhether a candidate rewriting ofa query is an equivalent or contained rewriting. However, bythemselves, these algorithms do not provide a solution to theproblem of answering queries using views.

A more fundamental question we can consider is how tofind all the possible answers to the query, given a set of viewdefinitions and their extensions. Finding a rewriting of thequery using the views and then evaluating the rewriting overthe views is clearly one candidate algorithm. If the rewriting isequivalent to the query, then we are guaranteed to find all thepossible answers. However, as we see in Sect.7, a maximally-contained rewriting of a query using a set of views does notalways provide all the possible answers that can be obtainedfrom theviews. Intuitively, the reason for this is that a rewritingis maximally-contained only with respect to a specific querylanguage, and hence theremay sometimes beaquery in amoreexpressive language that may provide more answers.

The problem of finding all the answers to a query givena set of views is formalized below by the notion ofcertainanswers, originally introduced in [AD98]. In the definition,wedistinguish the case in which the view extensions are assumedto be complete (closed-world assumption) from the case inwhich the views may be partial (open-world).

Definition 4. Certain answers: LetQ be a query andV ={V1, . . . , Vm} be a set of view definitions over the databaseschemaR1, . . . , Rn. Let the sets of tuplesv1, . . . , vm be ex-tensions of the viewsV1, . . . , Vm, respectively.

The tuplea is a certain answerto the queryQ under theclosed-world assumption givenv1, . . . , vm if a ∈ Q(D) forall database instancesD such thatVi(D) = vi for everyi,1 ≤ i ≤ m.

The tuplea is a certain answerto the queryQ under theopen-world assumption givenv1, . . . , vm if a ∈ Q(D) forall database instancesD such thatVi(D) ⊇ vi for everyi,1 ≤ i ≤ m.

The intuition behind the definition of certain answers isthe following. The extensions of a set of views do not define aunique database instance. Hence, given the extensions of theviews we have only partial information about the real state ofthe database. A tuple is a certain answer of the queryQ if it isan answer foranyof the possible database instances that areconsistent with the given extensions of the views. Section 7.3considers the complexity of finding certain answers.

Example 1.As a very simple example, consider a databaseschemaR(A,B) that includes a single relation with two at-tributes. Suppose the viewV1 is defined to be the projectionof R onA, whileV2 is defined to be the projection ofR onB,and suppose that our queryQ is to retrieve all of the relationR.

Suppose we are given that the extension ofV1 includesthe single tuple(c1), and that the extension ofV2 includes thesingle tuple(c2),

Under the closed-world assumption, we can infer that thetuple(c1, c2) mustbe in the relationR, and hence it is a cer-

Page 8: Answering queries using views: A survey - Columbia Universitygravano/Qual/Papers/halevy01.pdf · the views are sufficient for answering the queries we expect to receive over the

A.Y. Halevy: Answering queries using views: A survey 277

tain answer toQ. However, under the open-world assumption,sinceV1 andV2 are not necessarily complete, the tuple(c1, c2)neednot be inR. For example,Rmaycontain the tuples(c1, d)and(e, c2) for some constantsd ande. Hence,(c1, c2) is nota certain answer toQ. �

4 When is a view usable for a query?

The common theme across all of the works on answeringqueries using views is that they all have to deal with the fun-damental question of when a view is usable to answer a query.Hence, before describing the actual algorithms for answeringqueries using views it is instructive to examine a few exam-ples and gain an intuition for the conditions under which aview is usable for answering a query, and in what ways a viewmay be useful. In this section we consider select-project-joinqueries under set semantics. Note that in some cases a viewmay be usable in maximally-contained rewritings but not inequivalent rewritings.

Informally, a view can be useful for a query if the set ofrelations it mentions overlaps with that of the query, and it se-lects some of the attributes selected by the query. Moreover, ifthe query applies predicates to attributes that it has in commonwith the view, then the view must apply either equivalent orlogically weaker predicates in order to be part of an equivalentrewriting. If the view applies a logically stronger predicate, itmay be part of a contained rewriting.

Consider the following query, asking for the triplets ofprofessors, students, and teaching quarters, where the studentis advised by the professor, and has taken a class taught by theprofessor during the winter of 1998 or later.

select Advises.prof, Advises.student, Registered.quarterfrom Registered, Teaches, Adviseswhere Registered.c-number=Teaches.c-number and

Registered.quarter=Teaches.quarter andAdvises.prof=Teaches.prof and Advises.student=Registered.student andRegistered.quarter ≥ "winter98".

The followingviewV1 is usablebecause it applies thesamejoin conditions to the relationsRegistered andTeaches.Hence,we can useV1 to answer the query by joining it with therelationAdvises. Furthermore,V1 selects the attributesReg-istered.student, Registered.quarter andTeaches.prof thatare needed for the join with the relationAdvises and for theselect clauseof the query. Finally,V1 applies apredicateReg-istered.quarter > "winter97"which is weaker than the pred-icateRegistered.quarter ≥ "winter98" in the query. How-ever, sinceV1 selects the attributeRegistered.quarter, thestronger predicate can be applied as part of the rewriting.

create view V1 asselect Registered.student, Teaches.prof,

Registered.quarterfrom Registered, Teacheswhere Registered.c-number=Teaches.c-number and

Registered.quarter=Teaches.quarter andRegistered.quarter > "winter97".

The views shown in Fig.4 illustrate how minor modifi-cations toV1 change their usability in answering the query.The viewV2 is similar toV1, except that it does not selectthe attributeTeaches.prof, which is needed for the join withthe relationAdvises and in theselect clause of the query.Hence, to useV2 in the rewriting, we would need to joinV2with theTeaches relation again (in addition to a join withAdvises). Still, if the join of the relationsRegistered andTeaches is very selective, then employingV2 may actuallyresult in a more efficient query execution plan.

The viewV3 does not apply the necessary equi-join pred-icate betweenRegistered.quarter and Teaches.quarter.Since the attributes Teaches.quarter and Regis-tered.quarter are not selected byV3, the join predicatecannot be applied in the rewriting, and therefore there islittle to gain by usingV3. The viewV4 considers only theprofessors who have at least one area of research. Hence, theview applies an additional condition that does not exist in thequery, and cannot be used in an equivalent rewriting unless weallow union and negation in the rewriting language. However,if we have an integrity constraint stating that every professorhas at least one area of research, then an optimizer should beable to realize thatV4 is usable. Finally, viewV5 applies astronger predicate than in the query (Registered.quarter >"winter99"), and is therefore usable for a contained rewriting,but not for an equivalent rewriting of the query.

To summarize, the following conditions need to hold inorder for a select-project-join viewV to be usable in an equiv-alent rewriting of a queryQ. The intuitive conditions belowcan be made formal in the context of a specific query lan-guage and/or available integrity constraints (see e.g., [YL87,LMSS95]):

1. Theremust be amappingψ from the occurrences of tablesmentioned in thefrom clause ofV to those mentioned inthefrom clause ofQ, mapping every table name to itself.In the case of bag semantics,ψ must be a 1-1 mapping,whereas for set semantics,ψ can be amany-to-1 mapping.

2. V must either apply the join and selection predicates inQon the attributes of the tables in the domain ofψ, or mustapply to them a logically weaker selection, and select theattributes on which predicates need to still be applied.

3. V must not project out any attributes of the tables in thedomain ofψ that are needed in the selection ofQ, unlessthese attributes can be recovered from another view (orfrom the original table if it’s available).

Finally, we note that the introduction of bag semanticsintroduces additional subtleties. In particular, we must ensurethat the multiplicity of answers required in the query are notlost in the views (e.g., by the use ofdistinct), and are notincreased (e.g., by the introduction of additional joins).

5 Incorporating materialized viewsinto query optimization

This section describes the different approaches to incorporat-ing materialized views into query optimization. The focus ofthese algorithms is to judiciously decide when to use views toanswer a query. The output of the algorithm is an executionplan for the query. The approaches differ depending on which

Page 9: Answering queries using views: A survey - Columbia Universitygravano/Qual/Papers/halevy01.pdf · the views are sufficient for answering the queries we expect to receive over the

278 A.Y. Halevy: Answering queries using views: A survey

create view V2 as create view V3 asselect Registered.student, Registered.quarter select Registered.student, Teaches.prof, Registered.quarterfrom Registered, Teaches from Registered, Teacheswhere Registered.c-number=Teaches.c-number where Registered.c-number=Teaches.c-numberand Registered.quarter=Teaches.quarter and Registered.quarter ≥ "winter98".and Registered.quarter ≥ "winter98".

create view V4 as create view V5 asselect Registered.student, Registered.quarter, select Registered.student, Teaches.prof, Registered.quarter

Teaches.proffrom Registered, Teaches, Advises, Area from Registered, Teacheswhere Registered.c-number=Teaches.c-number where Registered.c-number=Teaches.c-numberand Registered.quarter=Teaches.quarter and Registered.quarter=Teaches.quarterand Teaches.prof=Advises.prof and Registered.quarter > "winter99".and Teaches.prof=Area.nameand Registered.quarter ≥ "winter98"

Fig. 4.Examples of unusable views

phase of query optimization was modified to consider materi-alized views. Section 5.1 describes algorithms based on Sys-tem R-style optimization, where materialized views are con-sidered during the join enumeration phase [CKPS95,TSI96].Section 5.2 describes works based on transformational opti-mizers [ZCL+00,DPT99,PDST00,GL01].There, thekey ideais that replacing a query subexpression by a view is yet an-other transformation employed by the optimizer. Section 5.3discusses some of the issues that arise when rewriting algo-rithms are extended to consider grouping and aggregation.These extensions are key to incorporating materialized viewsinto decision support applications.

5.1 System-R style optimization

In this section we consider select-project-join queries and dis-cuss the changes that need to be made to a join enumerationalgorithm to incorporate materialized views. To illustrate thechanges to a System R-style optimizer we first briefly recallthe principles underlying System-R optimization [SAC+79].System-R takes a bottom-up approach to building query exe-cution plans. In the first phase, it constructs plans of size 1, i.e.,chooses the best access paths to every table mentioned in thequery. In phasen, the algorithm considers plans of sizen, bycombiningpairs of plansobtained in thepreviousphases (Notethat if the algorithm is considering only left-deep plans, it willtry to combine plans of sizen − 1 with plans of size 1. Other-wise, it will consider combining plans of sizek with plans ofsizen− k.) The algorithm terminates after constructing plansthat cover all the relations in the query.

Intuitively, the efficiency of System-R stems from thefact that it partitions query execution plans intoequivalenceclasses, and only considers a single execution plan for ev-ery equivalence class. Two plans are in the same equivalenceclass if they: (1) cover the same set of relations in the query(and therefore are also of the same size); and (2) produce theanswers in the same interesting order. In the process of build-ing plans, two plans are combined only if they cover disjointsubsets of the relations mentioned in the query.

In our context, the query optimizer builds query executionplans by accessing a set of views, rather than a set of database

relations. Hence, in addition to the meta-data that the queryoptimizer has about the materialized views (e.g., statistics,indexes) the optimizer is also given as input the query expres-sions defining the views. Recall that a database relation canalways be modeled as a view as well.

We illustrate the changes to the join enumerationalgorithmwith an example that includes the following views:

create view V1 asselect student, deptfrom Major.

create view V2 asselect Registered.student, Registered.c-numberfrom Registered, Coursewhere Registered.c-number=Course.c-number and

Course.title LIKE ’%theory%’.

create view V3 asselect Major.dept, Registered.c-numberfrom Registered, Majorwhere Registered.student=Major.student and

Registered.c-number≥500.

Suppose the query below asks for all of the students at-tending Ph.D level classes with ’theory’ in their title, and thedepartments in which the students are majoring.

select Registered.student, Major.deptfrom Registered, Major, Coursewhere Registered.student=Major.student and

Registered.c-number=Course.c-number andCourse.c-number≥500 andCourse.title LIKE ’%theory%’.

We now describe the additional issues that the optimizerneeds to consider in the presence of materialized views. Fig-ure 5 shows a side-by-side comparison of the steps of a tra-ditional optimizer vs. one that exploits materialized views.The algorithm described below is a slight modification ofthe GMAP algorithm [TSI96]. The algorithm described in[CKPS95] uses the same principles, but, as we explain later,with several differences.

A. In the first iteration the algorithm needs to decide whichviews arerelevant to the query. A view is relevant if it

Page 10: Answering queries using views: A survey - Columbia Universitygravano/Qual/Papers/halevy01.pdf · the views are sufficient for answering the queries we expect to receive over the

A.Y. Halevy: Answering queries using views: A survey 279

is usable in answering the query (illustrated by the con-ditions in Sect.4). The corresponding step in a traditionaloptimizer is trivial: a relation is relevant to the query if itis mentioned in thefrom clause.

In our example, the algorithm will determine that all threeviews are relevant to the query, because each of themmentionsthe relations in the query and applies some of the same joinpredicates as in the query. Therefore, the algorithm choosesthe best access path to each of the views, depending on theexisting index structures and selection predicates in the query.

B. Since the query execution plans involve joins over views,rather than joins over database relations, plans can nolonger be neatly partitioned into equivalence classeswhichcan be explored in increasing size. This observation im-plies several changes to the traditional algorithm:1. Termination testing: the algorithm needs to distin-

guishpartial query execution plansof the query fromcomplete execution plans. The enumeration of the pos-sible join orders terminates when there are no moreunexplored partial plans. In contrast, in the traditionalsetting the algorithm terminates after considering theequivalence classes that include all the relations in thequery.

2. Pruning of plans: a traditional optimizer comparesbetween pairs of planswithin one equivalence classand saves only the cheapest one for each class. In ourcontext, the query optimizer needs to compare betweenany pairof plans generated thus far.A planp is prunedif there is another planp′ that: (1) is cheaper thanp; and(2) has greater or equal contribution to the query thanp. Informally, a planp′ contributes more to the querythan the planp if it covers more of the relations in thequery and selects more of the necessary attributes.

3. Combining partial plans: in the traditional setting,when two partial plans are combined, the join predi-cates that involve both plans are explicit in the query,and the enumeration algorithm need only consider themost efficient way to apply these predicates. However,in our case, it may not be obvious a priori which joinpredicate will yield a correct rewriting of the query,since we are joining views rather than database rela-tions directly. Hence, the enumeration algorithm needsto consider several alternative join predicates. Fortu-nately, in practice, the number of join predicates thatneed to be considered can be significantly pruned us-ing meta-data about the schema. For example, thereis no point in trying to join a string attribute with anumeric one. Furthermore, in some cases we can useknowledge of integrity constraints and the structure ofthe query to reduce the number of join predicates weconsider. Finally, after considering all the possible joinpredicates, the optimizer also needs to check whetherthe resulting plan is still a partial solution to the query.

In our example, the algorithm will consider in the seconditeration all possible methods to join pairs of plans producedin the first iteration. The algorithm will save the cheapest planfor each of the two-way joins, assuming the result is still apartial or complete solution to the query. The algorithm will

consider the following combinations (in this discussion weignore the choice of inner versus outer input to the join):

• The join ofV1 andV2 on the attributestudent: This joinproduces a partial result to the query. There are two waysto extend this join to complete execution plan. The firstis to apply an additional selection on thec-number at-tribute and a projection onstudent anddept. The second,which is explored in the subsequent iteration, is to join theresult withV3. Hence, the algorithm produces one com-plete execution plan and keepsV1✶V2 for the subsequentiterations.In principle, as explained in bullet 3 above, the algorithmshould also consider joiningV1 andV2 on other attributes(e.g.,V1.student=V2.c-number), but in this case, a sim-ple semantic analysis shows that such a join will not yielda partial solution.

• The joins ofV1withV3 (ondept) andofV2withV3 (onc-number): These two joins produce partial solutions to thequery, but only if set semantics are considered (otherwise,the resulting rewriting will have multiple occurrences oftheMajor (orRegistered) relation, whereas the query hasonly one occurrence).

In the third iteration, thealgorithm tries to join theplans forthe partial solutions from the second iteration with a plan fromthe first iteration. One of the plans the algorithm will consideris the one in which the result of joiningV2 andV3 is thenjoined withV1. Even though this plan may seem redundantcompared toV1 ✶ V2, it may be cheaper depending on theavailable indexes on the views, because it enables pruning the(possibly larger) set of students based on the selective coursenumber.

Variations on the above principles are presented in [TSI94,TSI96] and [CKPS95]. The algorithm in [TSI96] attempts toreformulate a query on a logical schema to refer directly toGMAPs storing the data (see Sect.2). They consider select-project-join queries with set semantics. To test whether a so-lution is complete (i.e., whether it is equivalent to the originalquery) they use an efficient and sufficient query-equivalencecondition that alsomakes use of some inclusion and functionaldependencies.

The goal of the algorithm described in [CKPS95] is tomake use ofmaterialized views in query evaluation. They con-sider select-project-join querieswith bag semantics andwhichmayalso include arithmetic comparison predicates.Under bagsemantics, the ways in which views may be combined to an-swer a query are more limited. This is due to the fact that twoqueries are equivalent if and only if there is a bi-directional1-1 mapping between the two queries, which maps the joinpredicates of one query to those of the other [CV93]. Hence,if we ignore the arithmetic comparison operators, a view isusable only if it is isomorphic to a subset of the query. Anadditional difference between [TSI96] and [CKPS95] is thatthe latter searches the space of join orderings in a top-downfashion, compared to the bottom-up fashion in [TSI96]. How-ever, since the algorithms consider different semantics, theirsearch spaces are incomparable. Both [TSI96] and [CKPS95]present experimental results that examine the cost of consid-ering materialized views in query optimization.

Page 11: Answering queries using views: A survey - Columbia Universitygravano/Qual/Papers/halevy01.pdf · the views are sufficient for answering the queries we expect to receive over the

280 A.Y. Halevy: Answering queries using views: A survey

Conventional optimizer Optimizer using viewsIteration 1 Iteration 1a) find all possible access paths. a1) Find all views that arerelevantto the query.

a2) Distinguish between partial and complete solutionsto the query.

b) Compare their cost and keep the least b) Compare all pairs of views. If one has neither greaterexpensive. contribution nor a lower cost than the other, prune it.

c) If the query has one relation, stop. c) If there are no partial solutions, stop.Iteration 2 Iteration 2For each query join:a) Consider joining the relevant access paths a1) Consider joining all partial solutions found in the

found in the previous iteration using all previous iteration using all possible equi-join methods andpossible join methods. trying all possible subsets of join predicates.

a2) Distinguish between complete and partial solutions.b) Compare the cost of the resulting join b) If any newly generated solution is either not relevant

plans and keep the least expensive. to the query, or dominated by another, prune it.c) If the query has only 2 relations, stop. c) If there are no partial solutions, stop.Iteration 3 Iteration 3. . . . . .

Fig. 5.A comparison of a traditional query optimizer with one that exploits materialized views

5.2 Transformational and other approaches to view rewriting

In this sectionwe describe several works that incorporate viewrewriting as transformations. The common theme in theseworks is that replacing somepart of a querywith a view is con-sidered as another transformation available to the optimizer.This approach is necessary when: (1) the entire optimizer istransformational (e.g, in [GL01]); and (2) in the logical rewrit-ing phase of a System-R style optimizer that is consideringmore complex SQL queries (as in [ZCL+00]).

In [GL01] the authors describe an algorithm for rewrit-ing queries using views that is implemented in the transfor-mational optimizer of Microsoft SQL Server. In the algo-rithm, view matching is added as another transformation rulein the optimizer. The transformation rule is invoked on select-project-join-group-by (SPJG) expressions, and it attempts toreplace the SPJG expression by a single view. The authors de-scribe in detail the conditions under which a sub-query is re-placed by a view. The key novelty in this work is thefilter-tree,a clever index structure that makes it possible to efficientlyfilter the set of views that are relevant to a particular SPJG ex-pression. The index is composed of several sub-indexes, eachof which is built on a particular property of the views (e.g.,the set of tables in the view, the set of output columns, group-ing columns). The sub-indexes are combined in a hierarchicalfashion into the filter tree, where each level in the tree fur-ther partitions the views according to another property. Theauthors describe a set of experiments that shows that their al-gorithm adds relatively little to the optimization time, even inthe presence of 1,000 views.

In [ZCL+00] the authors describe how view rewriting isincorporated into the query rewrite phase of the IBM DB2UDB optimizer. Their algorithm operates on the Query GraphModel (QGM) representation of a query [HFLP89], whichdecomposes the query into multiple QGMboxes, each cor-responding to a select-project-join block. The algorithm at-tempts to match pairs of QGM boxes in the views with thosein the query. The algorithm navigates the QGM in a bottom upfashion, starting from the leaf boxes. A match between a boxin the query and in the view can be either: (1) exact, meaning

that the two boxes represent equivalent queries; or (2) mayrequire acompensation. A compensation represents a set ofadditional operations that need to be performed on a box ofthe view in order to obtain an equivalent result to a box in thequery. The algorithm considers a pair of boxes only after thematchalgorithmhasbeenapplied toeverypossiblepair of theirchildren. Therefore, thematch (and corresponding compensa-tion) can be determined without looking into the subtrees oftheir children. The algorithm terminates when it finds a matchbetween the root of the view and some box in the QGM ofthe query. The authors show that by considering rewritings atthe QGM level, they are able to extend previous algorithmsto handle SQL queries and views with multiple blocks, whileprevious algorithms considered only single block queries. Aswe point out in the next section, their algorithm also extendsprevious work to handle more complex types of grouping andaggregation.

In [DPT99] the authors use a transformational approach touniformly incorporate the use materialized views, specializedindexes and semantic integrity constraints.All of these are rep-resented as constraints. Their procedure involves two phases,each involving a different set of transformations. In the firstphase, thechase,the query is expanded to include any otherstructure (e.g,. materialized view or access structure) that isrelevant to the query, resulting in auniversalquery plan. Inthe second phase, theback-chase,the optimizer tries to re-move structures (and hence joins) from the universal plan, inorder to obtain a plan of minimal cost. The chase procedure isbased on a generalization of the standard chase procedure tohandle path conjunctive queries [PT99], thereby enabling thealgorithm to handle certain forms of object-oriented queries.In [PDST00] the authors describe an implementation of theframework and experiments that prove its feasibility, focusingon methods for speeding up the back-chase phase.

In [BDD+98] the authors describe a limited use of trans-formation rules to incorporate view rewriting algorithm intothe Oracle 8i DBMS. The algorithm works in two phases. Inthe first phase, the algorithm applies a set of rewrite rules thatattempt to replace parts of the query with references to exist-ing materialized views. The rewrite rules consider the cases in

Page 12: Answering queries using views: A survey - Columbia Universitygravano/Qual/Papers/halevy01.pdf · the views are sufficient for answering the queries we expect to receive over the

A.Y. Halevy: Answering queries using views: A survey 281

which views satisfy the conditions described in Sect.4, andalso consider common integrity constraints encountered inpractice, such as functional dependencies and foreign key con-straints. The result of the rewrite phase is a query that refersto the views. In the second phase, the algorithm compares theestimated cost of two plans: the cost of the result of the firstphase, and the cost of the best plan found by the optimizer thatdoesnotconsider the use ofmaterialized views. The optimizerchooses to execute the cheaper of these two plans. The mainadvantage of this approach is its ease of implementation, sincethe capability of using views is added to the optimizer withoutchanging the join enumerationmodule. On the other hand, thealgorithm considers the cost of only one possible rewriting ofthe query using the views, and hence may miss the cheapestuse of the materialized views.

Finally, in [ALU01] the authors consider using views forquery optimization from a different angle. They consider theproblemof finding the rewritingof thequerywithminimal costunder three specific costmodels: (1)minimizing thenumber ofviews in the rewriting (hence the number of joins); (2) reduc-ing the size of the intermediate relations computed during therewriting; and (3) reducing the size of intermediate relationswhile also dropping irrelevant attributes as the computationproceeds. The techniques underlying theCoreCover algo-rithm described in [ALU01] are closer in spirit to those usedin the MiniCon Algorithm [PL00] described in Sect.6.4.

5.3 Queries with grouping and aggregation

In decision support applications, when queries contain group-ingandaggregation, there isevenmoreof anopportunity toob-tain significant speedups by reusing the results of materializedviews. However, the presence of grouping and aggregation inthe queries or the views introduces several new difficulties tothe problem of answering queries using views. The first dif-ficulty that arises is dealing with aggregated columns. Recallthat for a view to be usable by a query, it must not projectout an attribute that is needed in the query (and is not other-wise recoverable). When a view performs an aggregation onan attribute, we lose some information about the attribute, andin a sensepartially projecting it out. If the query requires thesameor a coarser grouping than performed in the view, and theaggregated column is either available or can be reconstructedfrom other attributes, then the view is still usable for the query.The second difficulty arises due to the loss of multiplicity ofvalues on attributes on which grouping is performed. Whenwe group on an attributeA, we lose the multiplicity of theattribute in the data, thereby losing the ability to perform sub-sequent sum, counting or averaging operations. In some cases,it may be possible to recover the multiplicity using additionalinformation.

The following simple example illustrates some of the sub-tleties that arise in the presence of grouping and aggregation.To make this example slightly more appealing, we assume thequarter attribute of the relationTeaches is replacedby ayearattribute (and hence, there are likely to be several offerings ofthe same course during an academic year). Suppose we havethe following viewavailable, which considers all the graduate-level courses, and for every pair of course and year, gives the

maximal course evaluation for that course in the given year,and the number of times the course was offered.

create view V asselect c-number, year, Max(evaluation) as maxeval,

Count(∗) as offeringsfrom Teacheswhere c-number ≥ 400groupBy c-number, year.

The following query considers only Ph.D-level courses,and asks for the maximal evaluation obtained foranycourseduring a given year, and the number of different course offer-ings during that year.

select year, count(∗), Max(evaluation)from Teacheswhere c-number ≥ 500groupBy year.

The following rewriting uses the viewV to answer our query.

select year, sum(offerings), Max(maxeval)from Vwhere c-number ≥ 500groupBy year.

There are a couple of points to note about the rewriting.First, even though the view performed an aggregation on theattributeevaluation, we could still use the view in the query,because the groupings in the query (onyear) are more coarsethan those in the view (onyear andc-number). Thus, theanswer to the query can be obtained by coalescing groupsfrom the view. Second, since the view groups the answers byc-number and thereby loses the multiplicity of each course,we would have ordinarily not been able to use the view tocompute the number of course offerings per year. However,since the view also computed the attributeofferings, therewas still enough information in the view to recover the totalnumber of course offerings per year, by summing the offeringsper course.

Several works considered the problem of answeringqueries using views in the presence of grouping and aggrega-tion. One approach considered involved a set of transforma-tions in the query rewrite phase [GHQ95]. In this approach,the algorithm performs syntactic transformations on the queryuntil it is possible to identify a subexpression of the query thatis identical to the view, and hence substitute the view for thesubexpression. However, as the authors point out, the purelysyntactic nature of this approach is a limiting factor in its ap-plicability.

A more semantic approach is proposed in [SDJL96]. Theauthors describe the conditions required for a view to be us-able for answering a query in the presence of grouping andaggregation, and a rewriting algorithm that incorporates theseconditions. That paper considers the cases in which the viewsand/or the queries contain grouping and aggregation. It is in-teresting to note that when the view contains grouping andaggregation but the query does not, then unless the query re-moves duplicates in theselect clause, the view cannot be usedto answer a query.Another important point to recall about thiscontext is that because of the bag semantics a view will be us-able to answer a query only if there is an isomorphismbetweenthe viewanda subset of the query [CV93]. Thework described

Page 13: Answering queries using views: A survey - Columbia Universitygravano/Qual/Papers/halevy01.pdf · the views are sufficient for answering the queries we expect to receive over the

282 A.Y. Halevy: Answering queries using views: A survey

in [ZCL+00] extends the treatment of grouping and aggrega-tion to consider multi-block queries and to multi-dimensionalaggregation functions such as cube, roll-up, and grouping sets[GBLP98].

Several works [CNS99,GRT99,GT00] consider the for-mal aspects of answering queries using views in the presenceof grouping and aggregation. They present cases in which itcan be shown that a rewriting algorithm is complete, in thesense that it will find a rewriting if one exists. Their algorithmsare based on insights into the problem of query containmentfor queries with grouping and aggregation.

An interesting issue that has not received attention to dateis extending the notion of maximally-contained rewritings tothe presence of grouping and aggregation. In particular, onecan imagine a notion of maximally-contained plans in whichtheanswersprovide thebest possibleboundson theaggregatedcolumns.5

6 Answering queries using views for data integration

Theprevious section focusedonextending query optimizers toaccommodate the use of views. They were designed to handlecases where the number of views is relatively small (i.e., com-parable to the size of the database schema), and cases wherewe require an equivalent rewriting of the query. In addition,for the most part, these algorithms did not consider cases inwhich the resulting rewriting may contain a union over theviews.

In contrast, the context of data integration requires that weconsider a large number of views, since each data source isbeing described by one or more views. In addition, the viewdefinitions contain many complex predicates, whose goal isto express fine-grained distinctions between the contents ofdifferent data sources. As shown in Sect.2, we will often notbe able to find an equivalent rewriting of the query using thesource views, and the best we can do is find the maximally-contained rewriting of the query. The maximally-containedrewriting will often involve a union of several queries over thesources. Furthermore, in the context of data integration it isoften assumed that the views are not complete, i.e., they mayonly contain a subset of the tuples satisfying their definition.

In this section we describe algorithms for answeringqueries using views that were developed specifically for thecontext of data integration. These algorithms are thebucket al-gorithmdeveloped in the context of the Information Manifoldsystem [LRO96b] and later studied in [GM99a], theinverse-rules algorithm[Qia96,DGL00] which was implemented inthe InfoMaster system [DG97b], and the MiniCon algorithm[PL00,PH01]. It should be noted that unlike the algorithmsde-scribed in the previous section, the output of these algorithmsis not a query execution plan, but rather a query referring tothe view relations.

6.1 Datalog notation

For this and the next section, it is necessary to revert to datalognotation and terminology. Hence, belowwe provide a brief re-

5 I thank an anonymous reviewer for suggesting this problem.

minder of datalog notation and of conjunctive queries [Ull89,AHV95].

Conjunctive queries are able to express select-project-joinqueries. A conjunctive query has the form:

q(X̄) :− r1(X̄1), . . . , rn(X̄n)

whereq, andr1, . . . , rn are predicate names. The predicatenamesr1, . . . , rn refer to database relations. The atomq(X̄)is called theheadof thequery, and refers to theanswer relation.Theatomsr1(X̄1), . . . , rn(X̄n)are thesubgoalsin thebodyofthe query. The tuples̄X, X̄1, . . . , X̄n contain either variablesor constants. We require that the query besafe, i.e., thatX̄ ⊆X̄1 ∪ . . .∪ X̄n (that is, every variable that appears in the headmust also appear in the body).

Queries may also contain subgoals whose predicates arearithmetic comparisons<,≤,=, �=. In this case, we requirethat if a variableX appears in a subgoal of a comparisonpredicate, thenX must also appear in an ordinary subgoal.We refer to the subgoals of comparison predicates of a queryQ byC(Q).

As an example of expressing an SQL query in datalog,consider the following SQL query asking for the students (andtheir advisors) who took courses from their advisors after thewinter of 1998:

select Advises.prof, Advises.studentfrom Registered, Teaches, Adviseswhere Registered.c-number=Teaches.c-number and

Registered.quarter=Teaches.quarter andAdvises.prof=Teaches.prof and Advises.student=Registered.student andRegistered.quarter > "winter98".

In the notation of conjunctive queries, the above query wouldbe expressed as follows:

q(prof, student) :-Registered(student, c-number, quarter),Teaches(prof, c-number, quarter),Advises(prof, student), quarter > "winter98".

Note that when using conjunctive queries, join predicatesof SQL are expressed by multiple occurrences of the samevariable in different subgoals of the body (e.g., the variablesquarter, c-number, prof, andstudent above). Unions canbe expressed in this notation by allowing a set of conjunctivequeries with the same head predicate.

A datalog query is a set of rules, each having the same formasaconjunctivequery, except that predicates in thebodydonothave to refer to database relations. In adatalogquerywedistin-guish EDB (extensional database) predicates that refer to thedatabase relations from the IDB (intensional database) predi-cates that refer to intermediate computed relations. Hence, inthe rules, EDB predicates appear only in the bodies, whereasthe IDB predicates may appear anywhere.We assume that ev-ery datalog query has a distinguished IDB predicate called thequery predicate, referring to the relation of the result.

A predicatep in a datalog program is said todependon apredicateq if q appears in one of the ruleswhose head isp. Thedatalog program is said to berecursiveif there is a cycle in thedependency graph of predicates. It is important to recall that ifa datalog program is not recursive, then it can be equivalentlyrewritten as a union of conjunctive queries, though possiblywith an exponential blowup in the size of the query. As we

Page 14: Answering queries using views: A survey - Columbia Universitygravano/Qual/Papers/halevy01.pdf · the views are sufficient for answering the queries we expect to receive over the

A.Y. Halevy: Answering queries using views: A survey 283

see in Sect.7.2, certain cases may require rewritings that arerecursive datalog queries.

The input to a datalog queryQ consists of a databaseDstoring extensions of all EDB predicates inQ. Given such adatabaseD, the answer toQ, denoted byQ(D), is the leastfixpointmodel ofQandD, which can be computed as follows.Weapply the rules of the program in anarbitrary order, startingwith empty extensions for the IDB relations. An applicationof a rule may derive new tuples for the relation denoted bythe predicate in the head of the rule. We apply the rules untilwe cannot derive any new tuples. The outputQ(D) is the setof tuples computed for the query predicate. Note that sincethe number of tuples that can be computed for each relation isfinite and monotonically increasing, the evaluation is guaran-teed to terminate. Finally, we say that a datalog query refersonly to views if instead of EDB predicates we have predicatesreferring to views (but we still allow arithmetic comparisonpredicates and IDB predicates).

6.2 The bucket algorithm

The goal of the bucket algorithm is to reformulate a user querythat is posed on a mediated (virtual) schema into a query thatrefers directly to the available data sources. Both the queryand the sources are described by conjunctive queries that mayinclude atoms of arithmetic comparison predicates (hereafterreferred to simply as predicates). As we explain in Sect.7, thenumber of possible rewritings of the query using the viewsis exponential in the size of the query. Hence, the main ideaunderlying the bucket algorithm is that the number of queryrewritings that need to be considered can be drastically re-duced if we first consider each subgoal in the query in iso-lation, and determine which views may be relevant to eachsubgoal.

Given a queryQ, the bucket algorithm proceeds in twosteps. In the first step, the algorithm creates a bucket for eachsubgoal inQ that is not inC(Q), containing the views (i.e.,data sources) that are relevant to answering the particular sub-goal. More formally, to decide whether the viewV should bein the bucket of a subgoalg, we consider each of the subgoalsg1 in V and do the following:

A. Check whether there is a unifierθ for g andg1, i.e., θ isa variable mapping such thatθ(g) = θ(g1). If there is nounifier, we proceed to the next subgoal.

B. Given the unifierθ, we check whether the view and thequery would be compatible after the unifier is applied.Hence, we applyθh(V ) to the query and to the view, whereθh(V ) is the same asθ but its domain does not include theexistential variables inV (since only the head variables ofV are part of a rewriting). Then we check two conditions:(1) that the predicates inQ and inV are mutually satisfi-able, i.e.,θh(V )(C(Q)) ∧ θh(V )(C(V )) is satisfiable; and(2) thatθ treats the head variables occurring ing correctly,i.e., ifX is a head variable that appears in positioni of thesubgoalg, then the variable appearing in positioni of g1must be a head variable ofV .

If the above conditions are satisfied, then we insert theatomθ(head(V )) into the bucket ofg. Note that a subgoalg

Table 2.Contents of the buckets. The primed variables are those thatare not in the domain of the unifying mapping

Teaches(P,C,Q) Registered(S,C,Q) Course(C,T)

V2(S’,P,C,Q) V1(S,C,Q,T’) V1(S’,C,Q’,T)V4(P,C,T’,Q) V2(S,P’,C,Q) V4(P’,C,T,Q’)

may unify with more than one subgoal in a viewV , and in thatcase the bucket ofg will contain multiple occurrences ofV .

In the second step, the bucket algorithm finds a set ofcon-junctive query rewritings, each of them being a conjunctivequery that includes one conjunct from every bucket. Each ofthese conjunctive query rewritings represents one way of ob-taining part of the answer toQ from the views. The result ofthe bucket algorithm is defined to be the union of the con-junctive query rewritings (since each of the rewritings maycontribute different tuples). Given a conjunction, constructedfrom a single element from every bucket, it is a conjunctivequery rewriting if either: (1) the conjunction is contained inthe queryQ; or (2) it is possible to add atoms of compari-son predicates such that the resulting conjunction is containedin Q.

Example 2.Consider the following views

V1(student,c-number,quarter,title):-Registered(student,c-number,quarter),Course(c-number,title), c-number≥500, quarter≥Aut98.

V2(student,prof,c-number,quarter):-Registered(student,c-number,quarter),Teaches(prof,c-number,quarter)

V3(student,c-number):-Registered(student,c-number,quarter), quarter ≤ Aut94.

V4(prof,c-number,title,quarter):-Registered(student,c-number,quarter),Course(c-number,title), Teaches(prof,c-number,quarter),quarter≤Aut97.

Suppose our query is:

q(S,C,P) :- Teaches(P,C,Q), Registered(S,C,Q), Course(C,T),C≥300, Q≥Aut95.

In the first step the algorithm creates a bucket for eachof the relational subgoals in the query in turn. The resultingcontents of the buckets are shown in Table 2. The bucket ofTeaches(P,C,Q) includes viewsV2 andV4, since the follow-ing mapping unifies the subgoal in the query with the corre-spondingTeaches subgoal in the views (thereby satisfyingcondition (a) above):

{ P → prof, C → c-number, Q → quarter }.Note that each view head in a bucket only includes vari-

ables in the domain of the mapping. Fresh variables (primed)are used for the other head variables of the view.

The bucket of the subgoalRegistered(S,C,Q) containsthe viewsV1 andV2, since the following mapping unifiesthe subgoal in the query with the correspondingRegisteredsubgoal in the views:

{ S → student, C → c-number, Q → quarter }.The viewV3 is not included in the bucket ofRegis-

tered(S,C,Q) because after applying the unificationmapping,

Page 15: Answering queries using views: A survey - Columbia Universitygravano/Qual/Papers/halevy01.pdf · the views are sufficient for answering the queries we expect to receive over the

284 A.Y. Halevy: Answering queries using views: A survey

the predicatesQ≥ Aut95 andQ≤ Aut94 aremutually incon-sistent. The viewV4 is not included in the bucket ofRegis-tered(S,C,Q) because the variablestudent is not in the headofV4, whileS is in the head of the query.

Next, consider the bucket of the subgoalCourse(C,T).The viewsV1 andV4 will be included in the bucket becauseof the mapping

{ C → c-number, T → title }.In the second step of the algorithm, we combine elements

from the buckets. In our example, we start with a rewritingthat includes the top elements of each bucket, i.e.,

q’(S,C,P) :- V2(S’,P,C,Q), V1(S,C,Q,T’), V1(S’, C, Q’, T).

As can be checked, this rewriting can be simplified byequating the variablesS andS’, andQ andQ’, and then re-moving the third subgoal, resulting with

q’(S,C,P) :- V2(S’,P,C,Q), V1(S,C,Q,T’).

Another possibility that the bucket algorithm would exploreis:

q’(S,C,P) :- V4(P, C, T’, Q), V1(S,C,Q,T’), V4(P’, C, T, Q’).

However, this rewriting would be dismissed because thequarters given inV1 are disjoint from those given inV4. Inthis case, the viewsV1 andV4 are relevant to the query whenthey are considered inisolation, but, if joined, would yield theempty answer.

Finally, the algorithm would also produce the rewriting

q’(S,C,P) :- V2(S,P,C,Q), V4(P, C, T’, Q).

Hence, the result of the bucket algorithm is the union oftwo conjunctive queries, one obtains answers by joiningV1andV2, and the other by joiningV2 andV4. The reader shouldnote that in this example, as often happens in the data integra-tion context, the algorithm produced amaximally-containedrewriting of the query using the views, and not an equivalentrewriting. In fact, when the query does not contain arithmeticcomparison predicates (but the view definitions still may)the bucket algorithm is guaranteed to return the maximally-contained rewriting of the query using the views. �

The strength of the bucket algorithm is that it exploits thepredicates in the query to prune significantly the number ofcandidate conjunctive rewritings that need to be considered.Checking whether a view should belong to a bucket can bedone in time polynomial in the size of the query and view def-inition when the predicates involved are arithmetic compar-isons. Hence, if the data sources (i.e., the views) are indeeddistinguished by having different comparison predicates, thenthe resulting buckets will be relatively small. The bucket algo-rithm also extends naturally to cases where the query (but notthe views) is a union of conjunctive queries, and to other formsof predicates in the query such as class hierarchies [LRO96a].Finally, the bucket algorithm also makes it possible to iden-tify opportunities for interleaving optimization and executionin a data integration system in cases where one of the bucketscontains an especially large number of views [LRO96a].

The main disadvantage of the bucket algorithm is that theCartesian product of the buckets may still be rather large. Fur-thermore, in the second step the algorithm needs to perform

a query containment test for every candidate rewriting. Thetesting problem isΠp

2 -complete,6 though only in the size ofthe query and the view definition, and hence quite efficient inpractice.

6.3 The inverse-rules algorithm

Like thebucket algorithm, the inverse-rulesalgorithmwasalsodeveloped in the context of a data integration system [DG97b].The key idea underlying the algorithm is to construct a set ofrules thatinvert the view definitions, i.e., rules that show howto compute tuples for the database relations from tuples of theviews. We illustrate inverse rules with an example. Supposewe have the following view (we omit thequarter attribute ofRegistered for brevity in this example):

V3(dept, c-number) :- Major(student,dept),Registered(student,c-number).

We construct one inverse rule for every subgoal in the body ofthe view:

Major(f1(dept,X), dept) :- V3(dept,X)Registered(f1(Y, c-number), c-number) :- V3(Y,c-number)

Intuitively, the inverse rules have the following meaning.A tuple of the form(dept,c-number) in the extension of theviewV3 is awitnessof tuples in the relationsMajor andReg-istered. The tuple(dept,c-number) is a witness in the sensethat it tells us two things:

• The relationMajor contains a tuple of the form(Z, dept),for some value ofZ.

• The relationRegistered contains a tuple of the form(Z,c-number), for thesamevalue ofZ.

In order to express the information that the unknown valueof Z is the same in the two atoms, we refer to it using thefunctional termf1(dept,c-number). Formally,f1 is a Skolemfunction (see [ABS99], Pg. 96) and therefore uninterpreted.Note that there may be several values ofZ in the database thatcause the tuple(dept,c-number) to be in the join ofMajorandRegistered, but all that matters is that there exists at leastone such value.

In general, we create one function symbol for every ex-istential variable that appears in the view definitions. Thesefunction symbols are used in the heads of the inverse rules.

The rewriting of a queryQ using the set of viewsV is thedatalog program that includes:

• The inverse rules forV.• The queryQ.

As shown in [DG97a,DGL00], the inverse-rules algorithmreturns the maximally-contained rewriting ofQ usingV. Infact, thealgorithm returns themaximally containedquery evenif Q is an arbitrary recursive datalog program.

Example 3.Suppose a query asks for the departments inwhich the students of the 444 course are majoring,

6 For conjunctive queries with no comparison predicates, querycontainment is in NP because we only need to guess a containmentmapping. Here, however, we need to guess a containment mappingfor every possible ordering on the variables in containing query.

Page 16: Answering queries using views: A survey - Columbia Universitygravano/Qual/Papers/halevy01.pdf · the views are sufficient for answering the queries we expect to receive over the

A.Y. Halevy: Answering queries using views: A survey 285

q(dept) :- Major(student,dept), Registered(student, 444)

and the viewV3 includes the tuples:

{ (CS, 444), (EE, 444), (CS, 333) }.The inverse rules would compute the following tuples:

Registered: { (f1(CS,444), CS), (f1(EE,444), EE),(f1(CS,333), CS) }

Major: { (f1(CS,444),444), (f1(EE,444),444),(f1(CS,333),333) }

Applying the query to these extensions would yield the an-swersCS andEE. �

In the above example we showed how functional termsare generated as part of the evaluation of the inverse rules.However, the resulting rewriting can actually be rewritten insuch a way that no functional terms appear [DG97a].

There are several interesting similarities and differencesbetween the bucket and inverse rules algorithms that are worthnoting. In particular, the step of computing buckets is similarin spirit to that of computing the inverse rules, because both ofthemcompute the views that are relevant to single atoms of thedatabase relations. The difference is that the bucket algorithmcomputes the relevant views by taking into consideration thecontextin which the atom appears in the query, while the in-verse rules algorithm does not. Hence, if the predicates in aview definition entail that the view cannot provide tuples rel-evant to a query (because they are mutually unsatisfiable withthe predicates in the query), then the view will not end up in abucket. In contrast, the query rewriting obtained by the inverserules algorithm may contain views that are not relevant to thequery. However, the inverse rules can be computed once, andbe applicable to any query. In order to remove irrelevant viewsfrom the rewriting produced by the inverse-rules algorithmweneed to apply a subsequent constraint propagation phase (asin [LFS97,SR92]).

A key advantage of the inverse-rules algorithm is its con-ceptual simplicity and modularity. As shown in [DGL00],extending the algorithm to exploit functional dependencieson the database schema, recursive queries or the existence ofaccess-pattern limitations can be done by adding another set ofrules to the inverse rules. Furthermore, the algorithm producesthe maximally-contained rewriting in time that is polynomialin the size of the query and the views. Note that the algorithmdoes not tell us whether the maximally-contained rewritingis equivalent to the original query, which would contradictthe fact that the problem of finding an equivalent rewriting isNP-complete [LMSS95] (see Sect.7).

On the other hand, using the resulting rewriting producedby the algorithm for actually evaluating queries from the viewshas a significant drawback, since it insists on recomputing theextensions of the database relations. In our example, evalu-ating the inverse rules computes tuples forRegistered andMajor, and the query is then evaluated over these extensions.However, by doing that, we lose the fact that the view alreadycomputed the join that the query is requesting. Hence, muchof the computational advantage of exploiting the materializedview is lost.

In order to obtain a more efficient rewriting from the in-verse rules, we must unfold the inverse rules and remove re-dundant subgoals from the unfolded rules. Unfolding the rules

turns out to be similar to (but still slightly better than) the sec-ond phase of the bucket algorithm, where we consider theCartesian product of the buckets (see [PL00] for an experi-mental analysis).

6.4 The MiniCon algorithm

The MiniCon algorithm [PL00,PH01] addresses the limita-tions of the previous algorithms. The key idea underlyingthe algorithm is a change of perspective: instead of buildingrewritings by combining rewritings for each of the querysub-goalsor the database relation, we consider how each of thevariablesin the query can interact with the available views.The result is that the second phase of the MiniCon algorithmneeds to consider drastically fewer combinations of views.The following example illustrates the key idea of MiniCon.Consider the query

q(D) :- Major(S, D), Registered(S, 444, Q), Advises(P, S)

and the views:

V1(dept) :- Major(student,dept),Registered(student, 444, quarter).

V2(prof, dept, area) :- Advises(prof, student),Prof(name, area)

V3(dept, c-number) :- Major(student,dept),Registered(student, c-number, quarter),Advises(prof, student).

The bucket algorithm considers each of the subgoals inthe query in isolation, and therefore puts the viewV1 into thebuckets ofMajor(student, dept) andRegistered(student,444, quarter). However, a careful analysis reveals thatV1cannot possibly be useful in a rewriting of the query. Thereason is that since the variablestudent is not in the head ofthe view, then in order forV1 to be useful, it must contain thesubgoalAdvises(prof,student) as well. Otherwise, the joinon the variableS in the query cannot be applied using theresults ofV1.

The MiniCon algorithm starts out like the bucket algo-rithm, considering which views contain subgoals that corre-spond to subgoals in the query. However, once the algorithmfinds a partial variable mapping from a subgoalg in the queryto a subgoalg1 in a viewV , it changes perspective and looksat the variables in the query. The algorithm considers the joinpredicates in the query (which are specified bymultiple occur-rences of the same variable) and finds the minimal additionalset of subgoals thatmustto bemapped to subgoals inV , giventhatg will be mapped tog1. This set of subgoals and mappinginformation is called aMiniCon Description(MCD), and canbe viewed as a generalized bucket. Unlike buckets, MCDs areassociated withsetsof subgoals in the query. In the secondphase, the MCDs are combined to produce the query rewrit-ings.

In the above example, the algorithm will determine that itcannot create an MCD forV1 because it cannot apply the joinpredicates in the query.WhenV2 is considered, the MCDwillcontain only the subgoalAdvises(prof, student). WhenV3is considered, the MCDwill include all of the query subgoals.

The key advantage of the MiniCon algorithm is that thesecond phase of the algorithm considersmany fewer combina-tions of MCDs compared to the Cartesian product of the buck-ets or compared to the number of unfoldings of inverse rules.

Page 17: Answering queries using views: A survey - Columbia Universitygravano/Qual/Papers/halevy01.pdf · the views are sufficient for answering the queries we expect to receive over the

286 A.Y. Halevy: Answering queries using views: A survey

Thework in [PL00] describesadetailed set of experiments thatshows that the MiniCon significantly outperforms the inverserules algorithm, which in turn outperforms the bucket algo-rithm. The paper demonstrates exactly how these savings areobtained in the second phase of the algorithm. Furthermore,the experiments show that the algorithm scales up to hundredsof views with commonly occurring shapes such as chain, star,and complete queries (see [MGA97] for a description of thesequery shapes). The work in [PH01] also explains how to ex-ploit the key ideas of the theMiniCon algroithm to the contextof query optimization with materialized views, where the costof the query plan if the primary concern.

7 Theory of answering queries using views

In the previous sections we discussed specific algorithms foranswering queries using views. Here we consider several fun-damental issues that cut across all of the algorithms we havediscussed thus far, and which have been studied from a moretheoretical perspective in the literature.

The first question concerns thecompletenessof the queryrewriting algorithms. That is, given a set of views and a query,will the algorithm always find a rewriting of the query usingthe views if one exists? A related issue is characterizing thecomplexity of the query rewriting problem. We discuss theseissues in Sect.7.1.

Completeness of a rewriting algorithm is characterizedwith respect to a specific query language in which the rewrit-ings are expressed (e.g., select-project-join queries, querieswith union, recursion). For example, there are cases in whichif we do not allow unions in the rewriting of the query, thenwewill not be able to find an equivalent rewriting of a query usinga set of views. The language that we consider for the rewritingis even more crucial when we consider maximally-containedrewritings, because the notion of maximal containment is de-fined with respect to a specific query language. As it turnsout, there are several important cases in which a maximally-contained rewriting of a query can only be found if we con-siderrecursivedatalog rewritings. These cases are illustratedin Sect.7.2.

At the limit, we would like to be able to extract all thecertainanswers for a query given a set of views, whether wedo it by applying a query rewriting to the extensions of theviews or via an arbitrary algorithm. In Sect.7.3 we considerthe complexity of finding all the certain answers, and showthat even in some simple cases the problem is surprisinglyco-NP-hard in the size of the extensions of the views.

7.1 Completeness and complexity of finding query rewritings

The first question one can ask about an algorithm for rewrit-ing queries using views is whether the algorithm is complete:given a queryQ and a set of viewsV, will the algorithm finda rewriting ofQ usingV when one exists. The first answerto this question was given for the class of queries and viewsexpressed as conjunctive queries [LMSS95]. In that paper itwas shown that when the query does not contain comparisonpredicates and hasn subgoals, then there exists an equivalent

conjunctive rewriting ofQ usingV only if there is a rewrit-ing with at mostn subgoals. An immediate corollary of thebound on the size of the rewriting is that the problem of find-ing an equivalent rewriting of a query using a set of views isin NP, because it suffices to guess a rewriting and check itscorrectness.7

The bound on the size of the rewriting also sheds somelight on the algorithms described in the previous sections. Inparticular, it entails that the search strategy that the GMAPalgorithm [TSI96] employs is guaranteed to be complete underthe conditions that (1) we use a sound and complete algorithmfor query containment for testing equivalence of rewritings;(2) when combining two subplans, the algorithm considersall possible join predicates on the attributes of the combinedsubplans; and (3) we consider self-joins of the views. Theseconditions essentially guarantee that the algorithm searchesthrough all rewritings whose size is bounded by the size of thequery. It is important to emphasize that the rewriting of thequery that produces the mostefficientplan for answering thequery may havemoreconjuncts that the original query. Thebound of [LMSS95] also guarantees that the bucket algorithmis guaranteed to find themaximally-contained rewriting of thequery when the query does not contain arithmetic comparisonpredicates (but the views may), and that we consider unionsof conjunctive queries as the language for the rewriting.

In [LMSS95] it is also shown that the problem of find-ing a rewriting is NP-hard for two independent reasons: (1)the number of possible ways to map a single view into thequery; and (2) the number of ways to combine the mappingsof different views into the query.

In [RSU95] the authors extend the bound on the size of therewriting to the case where the views contain access-patternlimitations (discussed in detail in Sect.8.2). In [CR97] theauthors exploit the close connection between the containmentand rewriting problems, and show several polynomial-timecases of the rewriting problems, corresponding to analogouscases for the problem of query containment.

7.2 The need for recursive rewritings

As noted earlier, in cases where we cannot find an equiva-lent rewriting of the query using a set of views, we considerthe problem of finding maximally-contained rewritings. Ourhope is that when we apply the maximally-contained rewrit-ing to the extensions of the views, we will obtain the set ofall certain answers to the query (Definition 4). Interestingly,there are several contexts where in order to achieve this goalwe need to consider recursive datalog rewritings of the query[DGL00]. We recall that a datalog rewriting is a datalog pro-gram in which the base (EDB) predicates are the view re-lations, and there are additional intermediate IDB relations.Except for the obvious case in which the query is recursive[DG97a], other cases include: when we exploit the presenceof functional dependencies on the database relations or whenthere are access-pattern limitations on the extensions of the

7 Note that checking the correctness of a rewriting isNP-complete;however, the guess of a rewriting can be extended to a guess forcontainment mappings showing the equivalence of the rewriting andof the query.

Page 18: Answering queries using views: A survey - Columbia Universitygravano/Qual/Papers/halevy01.pdf · the views are sufficient for answering the queries we expect to receive over the

A.Y. Halevy: Answering queries using views: A survey 287

views [DL97] (see Sect.8.2 for a more detailed discussion),when views contain unions [Afr00] (though even recursiondoes not always suffice here), and the case where additionalsemantic information about class hierarchies on objects is ex-pressed using description logics [BLR97]. We illustrate thecase of functional dependencies below.

Example 4.To illustrate the need for recursive rewritings inthe presence of functional dependencies, we temporarily ven-ture into the domain of airline flights. Suppose we have thefollowing database relation

schedule(Airline,Flight no,Date,Pilot,Aircraft)

which stores tuples describing the pilot that is scheduled fora certain flight, and the aircraft that is used for this flight.Assume we have the following functional dependencies onthe relations in the mediated schema

Pilot → Airline andAircraft → Airline

expressing the constraints that pilots work for only one air-line, and that there is no joint ownership of aircrafts betweenairlines. Suppose we have the following view available, whichprojects thedate, pilot, andaircraft attributes from thedatabaserelation:

v(D,P,C) :- schedule(A,N,D,P,C)

The view v records on which date which pilot flies whichaircraft. Now consider a query asking for pilots that work forthe same airline as Mike (expressed as the following self joinon the attributeAirline of theschedule relation):

q(P) :- schedule(A,N,D,‘mike’,C), schedule(A,N’,D’,P,C’)

The viewv doesn’t record the airlines that pilots work for, andtherefore, deriving answers to the above query requires usingthe functional dependencies in subtle ways. For example, ifboth Mike and Ann are known to have flown aircraft #111,then, since each aircraft belongs to a single airline, and everypilot flies for only one airline, Ann must work for the sameairlineasMike.Moreover, if, in addition,Ann is known tohaveflown aircraft #222, and John has flown aircraft #222 then thesame line of reasoning leads us to conclude thatAnn and Johnwork for the same airline. In general, for any value ofn, thefollowing conjunctive rewriting is a contained rewriting:

qn(P ) :− v(D1, mike, C1), v(D2, P2, C1),v(D3, P2, C2), v(D4, P3, C2), . . . ,v(D2n−2, Pn, Cn−1), v(D2n−1, Pn, Cn), v(D2n, P, Cn)

Moreover, for eachn, qn(P ) may provide answers thatwere not given byqi for i < n, because one can always buildan extension of the viewv that requiresn steps of chainingin order to find answers to the query. The conclusion is thatwe cannot find a maximally-contained rewriting of this queryusing the views if we only consider non-recursive rewritings.Instead, the maximally-contained rewriting is the followingdatalog program:

relevantPilot(“mike”).relevantAirCraft(C) :- v(D, “mike”, C).relevantAirCraft(C) :- v(D,P,C), relevantPilot(P).relevantPilot(P) :- relevantPilot(P1), relevantAirCraft(C),

v(D1, P1, C), v(D2, P, C).

In the program above, the relationrelevantPilot will in-clude the set of pilots who work for the same airline as Mike,and the relationrelevantAirCraft will include the aircraftflown by relevant pilots. Note that the fourth rule is mutuallyrecursive with the definition ofrelevantAirCraft. �

In [DL97,DGL00] it is shown how to augment the inverse-rules algorithm to incorporate functional dependencies. Thekey element of that algorithm is to add a set of rules thatsimulate the application of a Chase algorithm [MMS79] onthe atoms of the database relations.

7.3 Finding the certain answers

A different perspective on the problem of answering queriesusing views is the following. Given a set of materialized viewsand the corresponding view definitions, we obtain somein-completeinformation about the contents of the database.Morespecifically, the views define a set ofpossibleunderlyingdatabasesD. Given a queryQ over the database and a tuplet,there are a few possibilities: (1)twould be an answer toQ forevery database inD; (2) t is an answer toQ for some databasein D; or (3)t is not an answer toQ for any database inD. Thenotion of certain answers, (see Definition 4) formalizes case(1).

If Q′ is anequivalent rewritingof a queryQ using theset of viewsV, then it will always produce the same result asQ, independent of the state of the database or of the views.In particular, this means thatQ′ will always produce all thecertain answers toQ for any possible database.

WhenQ′ is amaximally-contained rewritingof Q usingthe viewsV it may produce only a subset of the answers ofQ for a given state of the database. The maximality ofQ′ isdefined only with respect to the other possible rewritings in aparticular query languageL that we consider forQ′. Hence,the question that remains is how to find all the certain answers,whetherwedo it byapplying some rewrittenquery to theviewsor by some other algorithm.

The question of finding all the certain answers is consid-ered in detail in [AD98,GM99a]. In their analysis they distin-guish the case of theopen-world assumptionfrom that of theclosed-world assumption. With the closed-world assumption,the extensions of the views are assumed to containall the tu-ples that would result from applying the view definition to thedatabase. Under the open-world assumption, the extensionsof the views may be missing tuples. The open-world assump-tion is especially appropriate in data integration applications,where the views describe sources that may be incomplete (see[EGW97,Lev96,Dus97] for treatments of complete sourcesin the data integration context). The closed-world assumptionis appropriate for the context of query optimization and main-tainingphysical data independence,where viewshaveactuallybeen computed from existing database relations.

Under the open-world assumption, [AD98] show that inmany practical cases, finding all the certain answers can bedone in polynomial time. However, the problem becomes co-NP-hard (in the size of the view extensions!) as soon as weallow union in the language for defining the views, or allowthe predicate�= in the language defining the query.

Under the closed-world assumption the situation is evenworse. Even when both the views and the query are defined by

Page 19: Answering queries using views: A survey - Columbia Universitygravano/Qual/Papers/halevy01.pdf · the views are sufficient for answering the queries we expect to receive over the

288 A.Y. Halevy: Answering queries using views: A survey

conjunctive queries without comparison predicates, the prob-lem of finding all certain answers is already co-NP-hard. Thefollowing example is the crux of the proof of the co-NP-hardness result [AD98].

Example 5.The following example shows a reduction of theproblem of graph 3-colorability to the problem of finding allthe certain answers. Suppose the relationedge(X,Y) encodesthe edges of a graph, and the relationcolor(X,Z) encodeswhich colorZ is attached to the nodes of the graph. Considerthe following three views:

V1(X) :- color(X,Y)V2(Y) :- color(X,Y)V3(X,Y) :- edge(X,Y)

where the extension ofV1 is the set of nodes in a graph, theextension ofV2 is the set{red, green, blue}, and the extensionof V3 is the set of edges in the graph. Consider the followingquery:

q(c) :- edge(X,Y), color(X,Z), color(Y,Z)

In [AD98] it is shown thatc is a certain answer toq ifand only if the graph encoded byedge isnot three-colorable.Hence, they show that the problem of finding all certain an-swers is co-NP-hard. �

The hardness of finding all the certain answers provides aninteresting perspective on formalisms for data integration. In-tuitively, the result entails that when we use views to describethe contents of data sources, even if we only use conjunctivequeries to describe the sources, the complexity of finding allthe answers to a query from the set of sources is co-NP-hard.In contrast, using a formalism inwhich the relations of theme-diated schemaare described by views over the source relations(as in [GMPQ+97]), the complexity of finding all the answersis always polynomial. Hence, this result hints that the formerformalism has a greater expressive power as a formalism fordata integration.

It is also interesting to note the connection established in[AD98] between the problem of finding all certain answersand computation with conditional tables [IL84]. As the au-thors show, the partial information about the database that isavailable from a set of views can be encoded as a conditionaltable using the formalism studied in [IL84], providing a for-malization to the intuition starting out this section.

The work in [GM99a] also considers the case where theviews may either be incomplete, complete, or contain tuplesthat don’t satisfy the view definition (referred to asincorrecttuples). It is shown that without comparison predicates in theviews or the query, when either all the views are complete orall of them may contain incorrect tuples, finding all certainanswers can be done in polynomial time in the size of theview extensions. In other cases, the problem is co-NP-hard.The work in [MM01] considers the query answering problemin cases where we may have bounds on the soundness and/orcompleteness of the views.

Finally, [MLF00] considers the problem of relative querycontainment, i.e., whether the set of certain answers of a queryQ1 is always contained in the set of certain answers of a queryQ2.Thepaper shows that for the conjunctivequeriesandviewswith no comparison predicates the problem isΠp

2 -complete,and that the problem is still decidable in the presence of accesspattern limitations.

8 Extensions to the query language

In this section we survey the algorithms for answering queriesusing views in the context of several important extensionsto the query languages considered thus far. We consider ex-tensions for Object Query Language (OQL) [FRV96,Flo96,DPT99], and views with access pattern limitations [RSU95,KW96,DL97].

8.1 Object query language

In [FRV96,Flo96] the authors studied the problem of answer-ing queries using views in the context of querying object-oriented databases, and have incorporated their algorithm intothe FloraOQL optimizer. In object-oriented databases the cor-respondence between thelogical model of the data and thephysicalmodel is even less direct than in relational systems.Hence, as argued in [Flo96], it is imperative for a query opti-mizer for object-oriented database be based on the notion ofphysical data independence.

Answering queries using views in the context of object-oriented systems introduces two key difficulties. First, findingrewritings often requires that we exploit some semantic in-formation about the class hierarchy and about the attributesof classes. Second, OQL does not make a clean distinctionbetween theselect , from andwhere clauses as in SQL.Select clauses may contain arbitrary expressions, and thewhere clauses also allow path navigation.

Thealgorithm foransweringqueriesusingviewsdescribedin [Flo96] operates in two phases. In the first phase the algo-rithm rewrites the query into a canonical form, thereby ad-dressing the lack of distinction between theselect , fromandwhere clauses.As an example, in this phase, navigationalexpressions are removed from thewhere clause by introduc-ing new variables and terms in thefrom clause.

In the second phase, the algorithm exploits semanticknowledge about the class hierarchy in order to find a subex-pression of the query that is matched by one of the views.When such amatch is found, the subexpression in the query isreplaced by a reference to the view and appropriate conditionsare added in order to conserve the equivalence to the query.

We illustrate the main novelties of the algorithm with thefollowing example from [Flo96], using a French version ofour university domain. Here we have the classUniversities,with subclassFrance.Universities and the classCity. Thefirst two classes have the attributesstudents, PhDstudents(a sub-attribute ofstudents), professors andadjuncts.

Example 6.Suppose we have the following view asking forstudents who are at least as old as their professors, and whostudy in universities in small cities. Below we use the notationof OQL. Note that theselect clause of OQL defines the recordstructure of the result. In addition, note the use of path expres-sions – for example,y in x.students means that the variabley will be bound to each of the students of the object to whichx will be bound.

create view V1 asselect distinct [A:=x.name, B:=y.identifier, C:=z]from x in Universities, y in x.students,

z in union(x.professors, x.adjuncts)where x.city.kind="small" and y.age ≥ z.age.

Page 20: Answering queries using views: A survey - Columbia Universitygravano/Qual/Papers/halevy01.pdf · the views are sufficient for answering the queries we expect to receive over the

A.Y. Halevy: Answering queries using views: A survey 289

Suppose a query asks for Ph.D students in French universitieswho have the same age as their professors, and study in smalltown universities:

select distinct [D:=u.name, E:=v.name, F:=t.name]from u in France.Universities, v in u.PhDstudents,

t in u.professorswhere u.city.kind="small" and v.age=t.age.

In the first step, the algorithmwill transform the query andthe view into their normal form. The resulting expression forthe query would be: (note that the variablew was added tothe query in order to eliminate the navigation term from thewhere clause)

select distinct [D:=u.name, E:=v.name, F:=t.name]from u in France.Universities, w in City,

v in u.PhDstudents, t in u.professorswhere w.kind="small" and v.age=t.age and u.city=w.

In the next step, the algorithmwill note the following prop-erties of the schema:

1. The collectionFrance.Universities is included in the col-lectionUniversities,

2. The collection denoted by the expressionu.PhDstudentsis included in the collection denoted byx.students. Thisinclusion follows from the first inclusion and the fact thatPhD students are a subset of students.

3. The collectionu.professors is included in the collectionunion(x.professors, x.adjuncts).

Putting these three inclusions together, the algorithm de-termines that the view can be used to answer the query, be-cause the selections in the view are less restrictive than thosein the query. The rewriting of the query using the view is thefollowing:

select distinct [D:=a.A, E:=a.B.name, F:=t.name]from a in V1, u in France.Universities,

v in u.PhDstudents, t in u.professorswhere u.city.kind="small" and v.age=t.age and

u.name=a.A and v.name=a.B and t=a.C.

Note that the role of the view is only to restrict the possiblebindings of the variables used in the query. In particular, thequery still has to restrict the universities to only the Frenchones, the students to only the Ph.Ds, and the range of thevariablet to cover only professors. In this case, the evaluationof the query using the view is likely to be more efficient thancomputing the query only from the class extents. �

As noted in Sect.5.2, the algorithm described in [DPT99,PDST00] also considers certain types of queries over object-oriented data.

8.2 Access pattern limitations

In the context of data integration, where data sources aremod-eled as views, we may have limitations on the possible accesspaths to the data. For example, when querying the InternetMovie Database, we cannot simply ask for all the tuples in thedatabase. Instead, we must supply one of several inputs, (e.g.,actor name or director), and obtain the set of movies in whichthey are involved.

We can model limited access paths by attaching a set ofadornments to every data source. If a source is modeled by aview with n attributes, then an adornment consists of a stringof lengthn, composed of the lettersb andf . The meaning ofthe letterb in an adornment is that the sourcemustbe givenvalues for the attribute in that position. The meaning of theletterf in an adornment is that the source doesn’t have to begiven a value for the attribute in that position. For example,an adornmentbf for a viewV (A,B) means that tuples ofVcan be obtained only by providing values for the attributesA.

Several works have considered the problem of answeringqueries using views when the views are also associated withadornments describing limited access patterns. In [RSU95] itis shown that the bound given in [LMSS95] on the length of apossible rewriting does not hold anymore. To illustrate, con-sider the following example, where the binary relationCitesstores pairs of papersX,Y , whereX citesY . Suppose wehave the following views with their associated adornments:

CitationDBbf (X,Y) :- Cites(X,Y)CitingPapersf (X) :- Cites(X,Y)

and suppose we have the following query asking for all thepapers citing paper #001:

Q(X) :- Cites(X,001)

The bound given in [LMSS95] would require that if thereexists a rewriting, then there is one with at most one atom, thesize of the query. However, the only possible rewriting in thiscase is:

q(X) :- CitingPapers(X), CitationDB(X,001).

[RSU95] shows that in the presence of access-pattern lim-itations it is sufficient to consider a slightly larger bound onthe size of the rewriting:n + v, wheren is the number ofsubgoals in the query andv is the number of variables in thequery. Hence, the problem of finding an equivalent rewritingof the query using a set of views is still NP-complete.

The situation becomes more complicated when we con-sider maximally-contained rewritings. As the following ex-ample given in [KW96] shows, there may beno bound onthe size of a rewriting. In the following example, the relationDBpapers denotes the set of papers in the database field, andthe relationAwardPapers stores papers that have receivedawards (in databases or any other field). The following viewsare available:

DBSourcef (X) :- DBpapers(X)CitationDBbf (X,Y) :- Cites(X,Y)AwardDBb(X) :- AwardPaper(X)

The first source provides all the papers in databases, and hasno access-pattern limitations. The second source, when givena paper, will return all the papers that are cited by it. The thirdsource, when given a paper, returns whether the paper is anaward winner or not.

The query asks for all the papers that won awards:

Q(X) :- AwardPaper(X).

Since the viewAwardDB requires its input to be bound,we cannot query it directly. One way to get solutions to thequery is to obtain the set of all database papers from theviewDBSource, and perform a dependent join with the view

Page 21: Answering queries using views: A survey - Columbia Universitygravano/Qual/Papers/halevy01.pdf · the views are sufficient for answering the queries we expect to receive over the

290 A.Y. Halevy: Answering queries using views: A survey

AwardDB. Another way would be to begin by retrieving thepapers inDBSource, join the result with the viewCitationDBto obtainall papers citedbypapers inDBSource, and then jointhe result with the viewAwardDB. As the rewritings belowshow, we can follow any length of citation chains beginningwith papers inDBSource and obtain answers to the query thatwere possibly not obtained by shorter chains. Hence, there isno bound on the length of a rewriting of the query using theviews.

Q’(X) :- DBSource(X), AwardDB(X)Q’(X) :- DBSource(V), CitationDB(V,X1), . . . ,

CitationDB(Xn,X), AwardDB(X).

Fortunately, as shown in [DL97,DGL00], we can still findafinite rewritingof thequeryusing the views, albeit a recursiveone. The following datalog rewriting will obtain all the pos-sible answers from the above views. The key in constructingthe program is to define a new intermediate relationpaperswhose extension is the set of all papers reachable by citationchains from papers in databases, and is defined by a transitiveclosure over the viewCitationDB.

papers(X) :- DBsource(X)papers(X) :- papers(Y), CitationDB(Y,X)Q’(X) :- papers(X), AwardDB(X).

In [DL97] it is shown that amaximally-contained rewritingof the query using the views can always be obtained witha recursive rewriting. In [FW97] and [LKG99] the authorsdescribe additional optimizations to this basic algorithm.

8.3 Other extensions

Several authors have considered additional extensions of thequery rewriting problems in various contexts. We mentionsome of them here.

Extensions to the query and schema language:In [AGK99,Dus98] the authors consider the rewriting problem when theviews may contain unions. The consideration of inclusion de-pendencies on the database relations introduces several sub-tleties to the query rewriting problem, which are considered in[Gry98]. In [Mil98], the author considers the query rewritingproblem for a language that enables querying the schema anddata uniformly, and hence, names of attributes in the data maybecome constants in the extensions of the views. In [MRP99]the authors show that when the schema contains a single uni-versal relation, answering queries using views and several re-lated operations can be done more efficiently.

Semi-structured data:The emergence of XML as a standardfor sharing data on theWWW has spurred significant interestin building systems for integrating XML data from multiplesources.Theemerging formalisms formodelingXMLdata arevariations on labeled directed graphs, which have also beenused to model semi-structured data [Abi97,Bun97,ABS99].The model of labeled directed graphs is especially well suitedfor modeling the irregularity and the lack of schema which

are inherent in XML data. Several languages have been devel-oped for querying semi-structured data and XML [AQM+97,FFLS97,BDHS96,DFF+99,CRF00].

Several works have started considering the problem of an-swering queries using views when the views and queries areexpressed in a language for querying semi-structured data.There are two main difficulties that arise in this context. First,such query languages enable usingregular path expressionsin the query, to express navigational queries over data whosestructure is not well known a priori. Regular path expressionsessentially provideavery limitedkindof recursion in thequerylanguage. In [CGLV99] the authors consider the problem ofrewriting a regular path query using a set of regular path views,and show that the problem is in 2EXPTIME (and checkingwhether the rewriting is an equivalent one is in 2EXPSPACE).In [CGLV00a] the authors consider the problem of finding allthe certain answers when queries and views are expressed us-ing regular path expressions, and show that the problem isco-NP-complete when data complexity (i.e., size of the viewextensions) is considered. In [CGLV00b] the authors extendthe results of [CGLV99,CGLV00a] to path expressions thatinclude the inverse operator, allowing both forward and back-ward traversals in a graph.

The second problem that arises in the context of semi-structured data stems from the rich restructuring capabilitieswhich enable the creation of arbitrary graphs in the output.The output graphs can also include nodes that did not existin the input data. In [PV99] the authors consider the rewrit-ing problem in the case where the query can createanswertrees, and queries that do not involve regular path expressionswith recursion. For the most part, considering queries withrestructuring remains an open research problem.

Infinite number of views:Two works have considered theproblem of answering queries using views in the presence ofan infinite number of views [LRU96,VP97]. The motivationfor this seemingly curious problem is that when a data sourcehas the capability to performlocal processing, it can be mod-eled by the (possibly infinite) set of views it can supply, ratherthan a single one.As a simple example, consider a data sourcethat stores a set of documents, and can answer queries of theform:

q(doc) :- document(doc), contains(doc, w1), . . .,contains(doc,wn)

wherewe can have any number of occurrences of thecontainssubgoal, each with a different word.

To answer queries using such sources, one need not onlychoosewhich sources to query, but wemust also choosewhichquery to send to it out of the set of possible queries it cananswer. In [LRU96,VP97] it is shown that in certain importantcases the problem of answering a query using an infinite setof views is decidable. Of particular note is the case in whichthe set of views that a source can answer is described by thefinite unfoldings of a datalog program.

Description logics:Description logics are a family of logicsfor modeling complex hierarchical structures. A descriptionlogic makes it possible to define sets of objects by specifying

Page 22: Answering queries using views: A survey - Columbia Universitygravano/Qual/Papers/halevy01.pdf · the views are sufficient for answering the queries we expect to receive over the

A.Y. Halevy: Answering queries using views: A survey 291

their properties, and then to reason about the relationship be-tween these sets (e.g., subsumption, disjointness). A descrip-tion logic also enables reasoning about individual objects, andtheir membership in different sets. One of the reasons that de-scription logics are useful in data management is their abilityto describe complex models of a domain and reason aboutinter-schema relationships [CL93]. For that reason, descrip-tion logics have been used in several data integration systems[AKS96,LRO96a]. Borgida [Bor95] surveys the use of de-scription logics in data management.

Several works have considered the problem of answeringqueries using viewswhen description logics are used tomodelthe domain. In [BLR97] it is shown that in general, answer-ing queries using views in this context may be NP-hard, andpresents cases in which we can obtain a maximally-containedrewritingof aquery in recursivedatalog.Thecomplexity of an-sweringqueriesusingviews foranexpressivedescription logic(which also includes n-ary relations) is studied in [CGL99].

9 Conclusions

As this survey has shown, the problem of answering queriesusing views raises amultitude of challenges, ranging from the-oretical foundations to considerations of a more practical na-ture. While algorithms for answering queries using views arealready being incorporated into commercial database systems(e.g., [BDD+98,ZCL+00]), these algorithms will have evenmore importance in data integration systems and data ware-house design. Furthermore, answering queries using views isa key technique to give database systems the ability of main-taining physical data independence.

There are many issues that remain open in this realm. Al-thoughwe have touched upon several query languages and ex-tensions thereof,many cases remain to be investigated.Of par-ticular note are studying rewriting algorithms in the presenceof a wider class of integrity constraints on both the databaseand view relations, and studying the effect of restructuringcapabilities of query languages (as in OQL or languages forquerying semistructured data [BDHS96,AQM+97,FFLS97,DFF+99,CRF00]).

As described in the article, different motivations have ledto two strands of work on answering queries using views, onein the context of optimization and the other in the context ofdata integration. In part, these differences are due to the factthat in the data integration context the algorithms search for amaximally-contained rewriting of the query and assume thatthe number of views is relatively large. However, as we illus-trated, the principles underlying the two strands are similar.Furthermore, interesting challenges arise as we try to bridgethe gaps between these bodies of work. The first challenge isto extend the work on query optimization to handle a muchlarger number of more complex views. The second challengeis to extend data integration algorithms to choose judiciouslythe best rewritings of the query. This can be done by eithertrying to order the access to the data sources (as in [FKL97,DL99,NLF99]), or to combine the choice of rewritings withother adaptive aspects of query processing explored in dataintegration systems (e.g., [UFA98,IFF+99]).

The context of data warehouse design, when one tries toselect a set of views to materialize in the warehouse, raises

another challenge.Thedatawarehousedesignproblem isoftentreated as a problem ofsearchthrough a set of warehouseconfigurations. In each configuration, we need to determinewhether the workload queries anticipated on the warehousecan be answered using the selected views, and estimate thecost of the configuration. In this context it is important to beable to reuse the results of the computation from the previousstate in the search space. In particular, this raises the challengeof developingincrementalalgorithms for answering queriesusing views, which can compute rewritings more efficientlywhen only minor changes are made to the set of availableviews.

In this survey we considered the problem of using ma-terialized views when they are available. I believe thatthe next challenge isselectingwhich views to material-ize in the first place. The problem of view selection alsohas a surprising number of potential applications, such asquery optimization, data warehousing, web-site design, con-tent distribution networks, peer-to-peer computing and ubiq-uitous computing. Even though there has been work onthis problem (e.g., [CHS01,ACN00,Gup97a,CG00,GM99c,TS97,YKL97,BPT97,GHRU97,HRU96,GHI+01]), the re-search is still in its infancy. The wealth of techniques devel-oped for answering queries using views will be very useful inthis realm.

Acknowledgements.I would like to thank Phil Bernstein, MikeCarey, Anhai Doan, Todd Millstein, Rachel Pottinger, Arnie Rosen-thal, Igor Tatarinov and the anonymous reviewers for valuable com-ments on earlier drafts of this paper. I would like to acknowledgethe support of a Sloan Fellowship, NSF Grants #IIS-9978567 and#IIS-9985114, and gifts fromMicrosoft Corporation and from NEC,Japan.

References

[Abi97] Abiteboul S. Querying semi-structured data. In: Proc.of ICDT. pp 1–18, Delphi, Greece, 1997

[ABS99] Abiteboul S., Buneman P., Suciu D. Data on the Web.Morgan Kaufmann, San Francisco, 1999

[ACN00] Agrawal S., Chaudhuri S., NarasayyaV. Automated se-lection of materialized views and indexes in MicrosoftSQL Server. In: Proc. of VLDB. pp 496–505, Cairo,Egypt, 2000

[ACPS96] Adali S., Candan K., Papakonstantinou Y., Subrahma-nianV.S. Query caching and optimization in distributedmediator systems. In: Proc. of SIGMOD. pp 137–148,Montreal, Canada, 1996

[AD98] Abiteboul S., Duschka O. Complexity of answeringqueries using materialized views. In: Proc. of PODS.pp 254–263, Seattle, Wash., USA, 1998

[Afr00] Afrati F. Personal communication, 2000[AGK99] Afrati F., Gergatsoulis M., Kavalieros T. Answering

queries using materialized views with disjunctions. In:Proc. of ICDT. pp 435–452, 1999

[AHV95] Abiteboul S., Hull R., Vianu V. Foundations ofdatabases. Addison Wesley, Reading, Mass., USA,1995

[AKS96] ArensY., KnoblockC.A., ShenW.M. Query reformula-tion for dynamic information integration. Int. J. Intell.Coop. Inf. Syst. (6)2/3:99–130, 1996

Page 23: Answering queries using views: A survey - Columbia Universitygravano/Qual/Papers/halevy01.pdf · the views are sufficient for answering the queries we expect to receive over the

292 A.Y. Halevy: Answering queries using views: A survey

[ALU01] Afrati F., Li C., Ullman J. Generating efficient plans forqueries using views. In: Proc. of SIGMOD. pp 319–330, 2001

[AQM+97] Abiteboul S., Quass D., McHugh J., Widom J., WienerJ. The Lorel query language for semistructured data.Int. J. Digital Libr. 1(1):68–88, 1997

[BDD+98] Bello R., Dias K., Downing A., Feenan J., Finnerty J.,NorcottW., SunH.,WitkowskiA., ZiauddinM. Materi-alized views inOracle. In: Proc. ofVLDB. pp 659–664,1998

[BDHS96] Buneman P., Davidson S., Hillebrand G., Suciu D. Aquery language and optimization techniques for un-structured data. In: Proc. of SIGMOD. pp 505–516,Montreal, Canada, 1996

[BLR97] Beeri C., Levy A.Y., Rousset M.C. Rewriting queriesusing views in description logics. In: Proc. of PODS.pp 99–108, Tucson, Ariz., USA, 1997

[Bor95] Borgida A. Description logics in data management.IEEE Trans. Knowl. Data Eng. 7(5):671–682, 1995

[BPT97] Baralis E., Paraboschi S., Teniente E. Materializedviews selection in a multidimensional database. In:Proc. of VLDB. pp 156–165, 1997

[Bun97] Buneman P. Semistructured data. In: Proc. of PODS.pp 117–121, Tucson, Ariz., USA, 1997

[CG00] Chirkova R., Genesereth M. Linearly bounded refor-mulationsof conjunctivedatabases. In:Proc. ofDOOD.pp 987–1001, 2000

[CGL99] CalvaneseD.,DeGiacomoG., LenzeriniM.Answeringqueries using views in description logics. In: Workingnotes of the KRDBWorkshop. 1999

[CGLV99] Calvanese D., De Giacomo G., Lenzerini M., VardiM. Rewriting of regular expressions and regular pathqueries. In: Proc. of PODS. pp 194–204, 1999

[CGLV00a] Calvanese D., De Giacomo G., Lenzerini M., Vardi M.Answering regular path queries using views. In: Proc.of ICDE. pp 389–398, 2000

[CGLV00b] Calvanese D., De Giacomo G., Lenzerini M., Vardi M.View-based query processing for regular path querieswith inverse. In: Proc. of PODS. pp 58–66, 2000

[CHS01] Chirkova R., HalevyA., Suciu D. A formal perspectiveon the viewselection problem. In: Proc. ofVLDB. 2001

[CKPS95] Chaudhuri S., Krishnamurthy R., Potamianos S., ShimK. Optimizing queries with materialized views. In:Proc. of ICDE. pp 190–200, Taipei, Taiwan, 1995

[CL93] Catarci T., Lenzerini M. Representing and using inter-schemaknowledge in cooperative information systems.J. Intell. Coop. Inf. Syst.. pp 55–62, 1993

[CM77] Chandra A.K., Merlin P.M. Optimal implementationof conjunctive queries in relational databases. In: Proc.NinthAnnualACMSymposium onTheory of Comput-ing. pp 77–90, 1977

[CN98a] Chaudhuri S., NarasayyaV.R. Autoadmin ’what-if’ in-dex analysis utility. In: Proc. of SIGMOD. pp 367–378,1998

[CN98b] Chaudhuri S., Narasayya V.R. Microsoft index tuningwizard forSQL Server 7.0. In: Proc. of SIGMOD. pp553–554, 1998

[CNS99] Cohen V.R., Nutt W., Serebrenik A. Rewriting aggre-gate queries using views. In: Proc. of PODS. pp 155–166, 1999

[CR94] Chen C., Roussopoulos N. Implementation and perfor-mance evaluation of the ADMS query optimizer. In:Proc. of EDBT. pp 323–336, March 1994

[CR97] Chekuri C., Rajaraman A. Conjunctive query contain-ment revisited. In: Proc. of ICDT. pp 56–70, Delphi,Greece, 1997

[CRF00] Chamberlin D.D., Robie J., Florescu D. Quilt: an XMLquery language for heterogeneous data sources. In:WebDB (Informal Proceedings) 2000. pp 53–62, 2000

[CV92] Chaudhuri S.,Vardi M. On the equivalence of recursiveand nonrecursive datalog programs. In: Proc. of PODS.pp 55–66, San Diego, Calif., USA, 1992

[CV93] Chaudhuri S., Vardi M. Optimizing real conjunctivequeries. In: Proc. of PODS. pp 59–70, WashingtonD.C., USA, 1993

[CV94] Chaudhuri S., Vardi M. On the complexity of equiva-lence between recursive and nonrecursive datalog pro-grams. In: Proc. of PODS. pp 55–66, Minneapolis,Minn., USA, 1994

[DFF+99] DeutschA., FernandezM., Florescu D., LevyA., SuciuD. A query language for XML. In: Proc. World-WideWeb 8 Conference. pp 1155–1169, 1999

[DFJ+96] Dar S., Franklin M.J., Jonsson B., Srivastava D., TanM. Semantic data caching and replacement. In: Proc.of VLDB. pp 330–341, 1996

[DFS99] Deutsch A., Fernandez M., Suciu D. Storing semi-structured data with STORED. In: Proc. of SIGMOD.pp 431–442, 1999

[DG97a] Duschka O.M., Genesereth M.R. Answering recursivequeries using views. In: Proc. of PODS. pp 109–116,Tucson, Ariz., USA, 1997

[DG97b] DuschkaO.M., GeneserethM.R. Query planning in in-fomaster. In: Proc.ACMSymposium onApplied Com-puting. pp 109–111, San Jose, Calif., USA, 1997

[DGL00] Duschka O., Genesereth M., Levy A. Recursive queryplans for data integration. In: Special issue on LogicBased Heterogeneous Information Systems. J. LogicProgram. 43(1):49–73, 2000

[DL97] Duschka O.M. LevyA.Y. Recursive plans for informa-tion gathering. In: Proc. 15th International Joint Con-ference on Artificial Intelligence. pp 778–784, 1997

[DL99] Doan A., Levy A. Efficiently ordering query plans fordata integration. In: IJCAI Workshop on IntelligentData Integration. Stockholm, Sweden, August 1999

[DPT99] Deutsch A., Popa L., Tannen V. Physical data inde-pendence, constraints and optimization with universalplans. In: Proc. of VLDB. pp 459–470, 1999

[Dus97] Duschka O. Query optimization using local complete-ness. In: Proc. AAAI 14th National Conference onAr-tificial Intelligence. pp 249–255, 1997

[Dus98] Duschka O.M. Query planning and optimization in in-formation integration. PhD thesis, Stanford University,Stanford, Calif., USA, 1998

[EGW97] Etzioni O., Golden K., Weld D.S. Sound and efficientclosed-world reasoning for planning.Artif. Intell. 89(1-2):113–148, 1997

[FFLS97] Fernandez M., Florescu D., LevyA., Suciu D. A querylanguage for aweb-sitemanagement system. SIGMODRecord 26(3):4–11, 1997

[FK99] Florescu D., Kossmann D. Storing and querying XMLdata using an rdbms. IEEE Data Eng. Bull. 22(3):27–34, 1999

[FKL97] Florescu D., Koller D., Levy A. Using probabilisticinformation in data integration. In: Proc. of VLDB. pp216–225, Athens, Greece, 1997

[FLM98] Florescu D., Levy A., Mendelzon A. Database tech-niques for the world-wide web: a survey. SIGMODRecord 27(3):59–74, 1998

Page 24: Answering queries using views: A survey - Columbia Universitygravano/Qual/Papers/halevy01.pdf · the views are sufficient for answering the queries we expect to receive over the

A.Y. Halevy: Answering queries using views: A survey 293

[Flo96] Florescu D.D. Search spaces for object-oriented queryoptimization. PhD thesis, Univerisity of Paris VI.France, 1996

[FLSY99] Florescu D., Levy A., Suciu D., Yagoub K. Optimiza-tion of run-time management of data intensive websites. In: Proc. of VLDB. pp 627–638, 1999

[FRV96] FlorescuD.,RaschidL.,ValduriezP.Answeringqueriesusing OQL view expressions. In: Workshop on Mate-rialized Views, in cooperation with ACM SIGMOD.Montreal, Canada, 1996

[FW97] Friedman M., Weld D. Efficient execution of infor-mation gathering plans. In: Proc. International JointConference on Artificial Intelligence, Nagoya, Japan.pp 785–791, 1997

[GBLP98] Gray J., Bosworth A., Layman A., Pirahesh H. Datacube: a relational aggregation operator generalizinggroup-by, cross-tab and sub-totals. In: Proc. of ICDE.pp 152–159, 1998

[GHI+01] Gribble S., Halevy A., Ives Z., Rodrig M., Suciu D.What can databases do for peer-to-peer? In:ACMSIG-MODWebDBWorkshop 2001. 2001

[GHQ95] Gupta A., Harinarayan V., Quass D. Aggregate-queryprocessing in datawarehousing environments. In: Proc.of VLDB. pp 358–369, 1995

[GHRU97] Gupta H., Harinarayan V., Rajaraman A., Ullman J.D.Index selection for OLAP. In: Proc. of ICDE. pp 208–219, 1997

[GL01] Goldstein J., Larson P.A. Optimizing queries usingma-terialized views: a practical, scalable solution. In: Proc.of SIGMOD. pp 331–342, 2001

[GM99a] Grahne G., Mendelzon A.O. Tableau techniques forquerying information sources through global schemas.In: Proc. of ICDT. pp 332–347, 1999

[GM99b] Gupta A., Mumick I., (eds.) Materialized views: tech-niques, implementations and applications. MIT, Cam-bridge, Mass., USA, 1999

[GM99c] GuptaH.,Mumick I.S. Selectionof views tomaterializeunder amaintenance cost constraint. In: Proc. of ICDT.pp 453–470, 1999

[GMPQ+97] Garcia-MolinaH., PapakonstantinouY.,QuassD., Ra-jaraman A., SagivY., Ullman J., Widom J. The TSIM-MIS project: Integration of heterogeneous informationsources. J. Intell. Inf. Syst. 8(2):117–132, 1997

[GRT99] Grumbach S., Rafanelli M., Tininini L. Querying ag-gregate data. In: Proc. of PODS. pp 174–184, 1999

[Gry98] Gryz J. Query folding with inclusion dependencies. In:Proc. of ICDE. pp 126–133, Orlando, Fla., USA, 1998

[GT00] Grumbach S., Tininini L. On the content of materialzedaggregate views. In: Proc. of PODS. 2000

[Gup97a] Gupta H. Selection of views to materialize in a datawarehouse. In: Proc. of ICDT. pp 98–112, 1997

[Gup97b] Gupta H. Selection of views to materialize in a datawarehouse. In: Proc. of ICDT. pp 98–112, Delphi,Greece, 1997

[HFLP89] Haas L., Freytag J., LohmanG., PiraheshH. Extensiblequery processing in Starburst. In: Proc. of SIGMOD.pp 377–388, 1989

[HRU96] HarinarayanV.,RajaramanA.,UllmanJ.D. Implement-ing data cubes efficiently. In: Proc. of SIGMOD. pp205–216, 1996

[IFF+99] Ives Z., Florescu D., Friedman M., Levy A., Weld D.Anadaptivequeryexecutionengine fordata integration.In: Proc. of SIGMOD. pp 299–310, 1999

[IL84] Imielinski T., Lipski W. Incomplete information in re-lational databases. J. ACM. 31(4):761–791, 1984

[KB96] KellerA.M., Basu J. A predicate-based caching schemefor client-server database architectures. VLDB J.5(1):35–47, 1996

[Klu88] KlugA. Onconjunctive queries containing inequalities.J. ACM. pp 35(1):146–160, 1988

[KMT98] Kolaitis P., Martin D., Thakur M. On the complexity ofthe containment problem for conjunctive queries withbuilt-in predicates. In: Proc. of PODS. pp 197–204,Seattle, Wash., USA, 1998

[KW96] Kwok C.T., Weld D.S. Planning to gather information.In: Proc. AAAI 13th National Conference on ArtificialIntelligence. pp 32–39, 1996

[Lev96] Levy A.Y. Obtaining complete answers from incom-plete databases. In: Proc. of VLDB. pp 402–412, Bom-bay, India, 1996

[Lev00] Levy A.Y. Logic-based techniques in data integration.In: MinkerJ (ed.) Logic-based artificial intelligence.Kluwer Academic, Dordrecht, 2000, pp 575–595

[LFS97] LevyA.Y., Fikes R.E., Sagiv S. Speeding up inferencesusing relevance reasoning: a formalism and algorithms.Artif. Intell. 97(1–2), 1997

[LKG99] Lambrecht E., Kambhampati S., Gnanaprakasam S.Optimizing recursive information gathering plans. In:Proc. 16th International Joint Conference on ArtificialIntelligence. pp 1204–1211, 1999

[LMSS95] Levy A.Y., Mendelzon A.O., Sagiv Y., Srivastava D.Answering queries using views. In: Proc. of PODS. pp95–104, San Jose, Calif, USA, 1995

[LRO96a] LevyA.Y., RajaramanA., Ordille J.J. Query answeringalgorithms for information agents. In: Proc. NationalConference on Artificial Intelligence. pp 40–47, 1996

[LRO96b] LevyA.Y., RajaramanA., Ordille J.J. Querying hetero-geneous information sources using source descriptions.In: Proc. of VLDB. pp 251–262, Bombay, India, 1996

[LRU96] Levy A.Y., Rajaraman A., Ullman J.D. Answeringqueries using limited external processors. In: Proc. ofPODS. pp 227–237, Montreal, Canada, 1996

[LS93] Levy A.Y., Sagiv Y. Queries independent of updates.In: Proc. of VLDB. pp 171–181, Dublin, Ireland, 1993

[MGA97] Steinbrunn M., Moerkotte G., Kemper A. Heuristicand randomized optimization for the join. VLDB J.6(3):191–208, 1997

[Mil98] Miller R.J. Using schematically heterogeneous struc-tures. In: Proc. of SIGMOD. pp 189–200, Seattle,Wash., USA, 1998

[MLF00] Millstein T., LevyA., FriedmanM. Query containmentfor data integration systems. In: Proc. of PODS. pp67–75, Dallas, Tex., USA, 2000

[MM01] Mendelzon A., Mihaila G. Querying partially soundand complete data sources. In: Proc. of PODS. pp 162–170, 2001

[MMS79] Maier D., Mendelzon A., Sagiv Y. Testing implica-tions of data dependencies. ACMTrans. Database Syst.4(4):455–469, 1979

[MRP99] Minock M., Rusinkiewicz M., Perry B. The identi-fication of missing information resources through thequery difference operator. In: Proc. 4th IFCIS Interna-tional Conference onCooperative Information Systems(CoopIS 99). September 1999

[NLF99] NaumannF., LeserU., Freytag J.C. Quality-driven inte-gration of heterogeneous information systems. In: 25thConference onVery Large Database Systems (VLDB).pp 447–458, 1999

[PDST00] Popa L., Deutsch A., Sahuguet A., Tannen V. A chasetoo far? In: Proc. of SIGMOD. pp 273–284, 2000

Page 25: Answering queries using views: A survey - Columbia Universitygravano/Qual/Papers/halevy01.pdf · the views are sufficient for answering the queries we expect to receive over the

294 A.Y. Halevy: Answering queries using views: A survey

[PH01] Pottinger R., Halevy A. Minicon: a scalable algorithmfor answering queries using views. VLDB J. 10(2):182–198, 2001

[PL00] Pottinger R., LevyA. A scalable algorithm for answer-ing queries using views. In: Proc. of VLDB. pp 484–495, Cairo, Egypt, 2000

[PT99] Popa L., Tannen V. An equational chase for path con-junctive queries, constraints and views. In: Proc. ofICDT. 1999

[PV99] Papakonstantinou Y., Vassalos V. Query rewriting forsemi-structured data. In: Proc. of SIGMOD. pp 455–466, 1999

[Qia96] Qian X. Query folding. In: Proc. of ICDE. pp 48–55,New Orleans, La., USA, 1996

[RSU95] RajaramanA., SagivY.,UllmanJ.D.Answeringqueriesusing templates with binding patterns. In: Proc. ofPODS. pp 105–112, San Jose, Calif., USA, 1995

[SAC+79] SelingerP.,AstrahanM.,ChamberlinD., LorieR.,PriceT. Access path selection in relational database systems.In: Proc. of SIGMOD. pp 23–34, Boston, Mass., USA,1979

[Sag88] Sagiv Y. Optimizing datalog programs. In: Minker J(ed) Foundations of deductive databases and logic pro-gramming. pp 659–698. Morgan Kaufmann, LosAltos,Calif., USA, 1988

[SDJL96] SrivastavaD.,DarS., JagadishH.V., LevyA.Y.Answer-ing SQL queries using materialized views. In: Proc. ofVLDB. Bombay, India, 1996

[SGT+99] Shanmugasundaram J., Gang H., Tufte K., Zhang C.,DeWitt D.J., Naughton J. Relational databases forquerying XML documents: limitations and opportuni-ties. In: Proc. of VLDB. pp 302–314, 1999

[Shm93] Shmueli O. Equivalence of datalog queries is undecid-able. J. Logic Program. 15:231–241, 1993

[SR92] Srivastava D., Ramakrishnan R. Pushing constraint se-lections. In: Proc. of PODS. pp 301–315, San Diego,Calif., USA, 1992

[SY81] SagivY.,Yannakakis M. Equivalence among relationalexpressions with the union and difference operators. J.ACM. 27(4):633–655, 1981

[TIHW01] Tatarinov I., Ives Z., Halevy A., Weld D. UpdatingXML. In: Proc. of SIGMOD. pp 413–424, 2001

[TS97] Theodoratos D., Sellis T. Data warehouse configura-tion. In: Proc. of VLDB. pp 126–135, Athens, Greece,1997

[TSI94] Tsatalos O.G., Solomon M.H., Ioannidis Y.E. TheGMAP: a versatile tool for physical data independence.In: Proc. of VLDB. pp 367–378, Santiago, Chile, 1994

[TSI96] Tsatalos O.G., Solomon M.H., Ioannidis Y.E. TheGMAP: a versatile tool for physical data independence.VLDB J. 5(2):101–118, 1996

[UFA98] Urhan T., Franklin M.J., Amsaleg L. Cost-based queryscrambling for initial delays. In: Proc. of SIGMOD. pp130–141, Seattle, Wash., USA, 1998

[Ull89] UllmanJ.D. Principlesof databaseandknowledge-basesystems, vols. I, II. Computer Science, Rockville, Md.,USA, 1989

[Ull97] Ullman J.D. Information integration using logicalviews. In: Proc. of ICDT. pp 19–40, Delphi, Greece,1997

[Val87] Valduriez P. Join indices. ACM Trans. Database Syst.12(2):218–246, 1987

[VP97] Vassalos V., PapakonstantinouY. Describing and usingquery capabilities of heterogeneous sources. In: Proc.of VLDB. pp 256–265, Athens, Greece, 1997

[Wie92] Wiederhold G. Mediators in the architecture of futureinformation systems. IEEE Comput. pp 38–49, 1992

[YFIV00] Yagoub K., Florescu D., Issarny V., Valduriez P.Caching strategies for data-intensive web sites. In:Proc. of VLDB. pp 188–199, Cairo, Egypt, 2000

[YKL97] Yang J., Karlapalem K., Li Q. Algorithms for material-ized view design in data warehousing environment. In:Proc. of VLDB. pp 136–145, Athens, Greece, 1997

[YL87] Yang H.Z., Larson P.A. Query transformation for PSJ-queries. In: Proc. ofVLDB. pp 245–254, Brighton,UK,1987

[ZCL+00] Zaharioudakis M., Cochrane R., Lapis G., Pirahesh H.,Urata M. Answering complex SQL queries using au-tomatic summary tables. In: Proc. of SIGMOD. pp105–116, 2000

[ZO93] Zhang X., Ozsoyoglu M.Z. On efficient reasoning withimplication constraints. In: Proc. of DOOD. pp 236–252, 1993