institut für scientific computing – universität wienp.brezany distributed query processing –an...

29
Institut für Scientific Computing – Universität Wien P.Brezany Distributed Query Processing Distributed Query Processing An Overview An Overview Univ.-Prof. Dr. Peter Brezany Institut für Scientific Computing Universität Wien Tel. 4277 39425 Sprechstunde: Di, 13.00-14.00 LV-Portal: www.par.univie.ac.at /~ brezany / teach / gckfk /300658.html

Upload: antony-bingham

Post on 14-Dec-2015

216 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Institut für Scientific Computing – Universität WienP.Brezany Distributed Query Processing –An Overview Univ.-Prof. Dr. Peter Brezany Institut für Scientific

Institut für Scientific Computing – Universität Wien

P.Brezany

Distributed Query ProcessingDistributed Query Processing – –An OverviewAn Overview

Univ.-Prof. Dr. Peter Brezany

Institut für Scientific Computing

Universität WienTel. 4277 39425

Sprechstunde: Di, 13.00-14.00

LV-Portal: www.par.univie.ac.at/~brezany/teach/gckfk/300658.html

Page 2: Institut für Scientific Computing – Universität WienP.Brezany Distributed Query Processing –An Overview Univ.-Prof. Dr. Peter Brezany Institut für Scientific

Institut für Scientific Computing – Universität Wien

P.Brezany2

Introduction• To construct the answer to the query, the user does not

precisely specify the procedure to follow this procedure is devised by a DBMS module called query processor, which performs query optimization.

• The query processing problem is much more difficult in distributed environments a larger number of parameters affect the performance: relations involved in a distributed query may be fragmented and/or replicated, thereby inducing communication costs. Furthermore, with many sites to access, query response time may become very high.

• The role of a distributed query processor is to map a high-level query (expressed in relational calculus) on a distributed DB (i.e., a set of global relations) into a sequence of DB operations (of relational algebra) on relation fragments.

– The calculus query must be decomposed into a sequence of relational operations called an algebraic query.

– The data accessed by the query must be localized so that the operations on relations are translated to bear on local data (fragments).

– The algebraic query on fragments must be extended with communication operations and optimized with respect to a cost function (based on features of disk I/Os, CPUs, and communication networks) to be minimized.

Page 3: Institut für Scientific Computing – Universität WienP.Brezany Distributed Query Processing –An Overview Univ.-Prof. Dr. Peter Brezany Institut für Scientific

Institut für Scientific Computing – Universität Wien

P.Brezany3

Query Processing ProblemExample 1: Consider EMP(ENO, ENAME, TITLE)

ASG (ENO, PNO, RESP, DUR) and the following simple user query: „Find the names of employees who are managing a project“

In relational calculus using SQL:SELECT ENAMEFROM EMP, ASGWHERE EMP.ENO = ASG.ENOAND ASG.RESP = “Manager“

Two equivalent rel. algebra queries:

ENAME (ASG.RESP=“Manager“ EMP.ENO=ASG.ENO (EMP ASG)and

ENAME (ENAME ⋈ENO (ASG.RESP=“Manager“(ASG))) consumes less comp.

resources – it is intuitively obvious

Page 4: Institut für Scientific Computing – Universität WienP.Brezany Distributed Query Processing –An Overview Univ.-Prof. Dr. Peter Brezany Institut für Scientific

Institut für Scientific Computing – Universität Wien

P.Brezany4

Query Processing Problem (cont.)In a distributed system, rel. algebra is not enough to expressexecution strategy. It must be suplemented with operations forexchanging data between sites. The best sites to process datamust also be selected. This increases the solution space.

Example 2: ENAME (ENAME ⋈ENO (RESP=“Manager“(ASG)))

We assume that EMP and ASG are horizontally fragmented:

EMP1 = ENO “E3“ (EMP)

EMP2 = ENO “E3“ (EMP)

ASG1 = ENO “E3“ (ASG)

ASG2 = ENO “E3“ (ASG)

Fragments ASG1, ASG2, EMP1, and EMP2 are stored at sites 1, 2, 3, and 4, respectively, and the result is expected at site 5.

Page 5: Institut für Scientific Computing – Universität WienP.Brezany Distributed Query Processing –An Overview Univ.-Prof. Dr. Peter Brezany Institut für Scientific

Institut für Scientific Computing – Universität Wien

P.Brezany5

Equivalent Distributed Execution Strategies

For the sake of pedagogical simplicity, the project

operation is ignored.

(b) Strategy B

Strategy A exploits the factthat EMP and ASG are fragmented the same way in order to perform the select andjoin in parallel.

Strategy B centralizes all the operand data at the result site before processing the query.

Page 6: Institut für Scientific Computing – Universität WienP.Brezany Distributed Query Processing –An Overview Univ.-Prof. Dr. Peter Brezany Institut für Scientific

Institut für Scientific Computing – Universität Wien

P.Brezany6

Simple Cost Model and Statistics

• tuple access: 1 unit (which we leave unspecified)

• tuple transfer: tuptrans: 10 units• relations EMP and ASG have 400 and 1000

tuples, respectively.• there are 20 managers in relation ASG.• data is is uniformly distributed

(fragmentation + allocation) among the sites.• relations ASG and EMP are locally clustered

on attributes RESP and ENO, respectively. Therefore, there is direct access to tuples of ASG (respectively, EMP) based on the value of attribute RESP (respectively, ENO)

Page 7: Institut für Scientific Computing – Universität WienP.Brezany Distributed Query Processing –An Overview Univ.-Prof. Dr. Peter Brezany Institut für Scientific

Institut für Scientific Computing – Universität Wien

P.Brezany7

Cost Estimation

Page 8: Institut für Scientific Computing – Universität WienP.Brezany Distributed Query Processing –An Overview Univ.-Prof. Dr. Peter Brezany Institut für Scientific

Institut für Scientific Computing – Universität Wien

P.Brezany8

Complexity of Relational Algebra Operations

n denotes the relation cardinality

Page 9: Institut für Scientific Computing – Universität WienP.Brezany Distributed Query Processing –An Overview Univ.-Prof. Dr. Peter Brezany Institut für Scientific

Institut für Scientific Computing – Universität Wien

P.Brezany9

Layers of Query Processing

Page 10: Institut für Scientific Computing – Universität WienP.Brezany Distributed Query Processing –An Overview Univ.-Prof. Dr. Peter Brezany Institut für Scientific

Institut für Scientific Computing – Universität Wien

P.Brezany10

Query Decomposition – An Overview The distributed calculus query is decomposed into

an algebraic query on global relations. The information about data distribution is not used. The techniques used by this layer are those of a centralized DBMS.

• Query is rewritten in a normalized form that is suitable for subsequent manipulation.

• The normalized query is analyzed semantically so that incorrect queries are rejected as soon as possible.

• The correct query is simplified (e.g., eliminating redundant predicates).

• The calculus query is restructured as an algebraic query. Several alg. queries can be derived from the same calc. query, but some alg. queries are “better“ than others. The quality is defined in terms of expected performance.

Page 11: Institut für Scientific Computing – Universität WienP.Brezany Distributed Query Processing –An Overview Univ.-Prof. Dr. Peter Brezany Institut für Scientific

Institut für Scientific Computing – Universität Wien

P.Brezany11

Data Localization – An Overview• The main role is to localize the query‘s data

using data distribution information. • Repetition: Relations are fragmented; each

being stored at a different site. Fragmentation is defined through fragmentation rules (fragmentation scheme).

• This layer determines which fragments are involved in the query and transforms the distributed query into a fragment query.1. The distributed query is mapped into a fragment query by

substituting each distributed relation by its recontructing program (materialization program).

2. The fragment query is simplified and restructured to produce another “good“ query (applying the same rules used in the decomposition layer).

Page 12: Institut für Scientific Computing – Universität WienP.Brezany Distributed Query Processing –An Overview Univ.-Prof. Dr. Peter Brezany Institut für Scientific

Institut für Scientific Computing – Universität Wien

P.Brezany12

Global and Local Query Optimization

• Global optimization: the goal is to find an execution strategy for the query which is close to optimal

• Local optimization: it is performed by all the sites having fragments involved in the query. Each subquery executing at one site, called a local query, is then optimized using the local schema of the site. At this time, the algorithms to perform the relational operations may be chosen. Local optimization uses the algorithms of centralized systems.

Page 13: Institut für Scientific Computing – Universität WienP.Brezany Distributed Query Processing –An Overview Univ.-Prof. Dr. Peter Brezany Institut für Scientific

Institut für Scientific Computing – Universität Wien

P.Brezany13

Query Decomposition – 1. Normalization• The most important transformation is that of the

query qualification (the WHERE clause), which may be arbitrarily complex, quantifier-free predicate or preceded by all necessary quantifiers ( or ).

• Conjunctive and disjunctive normal forms.• Rules for the transformation of the quantifier-free

predicates.

Page 14: Institut für Scientific Computing – Universität WienP.Brezany Distributed Query Processing –An Overview Univ.-Prof. Dr. Peter Brezany Institut für Scientific

Institut für Scientific Computing – Universität Wien

P.Brezany14

Normalization (cont.)• „Find the names of employees who have been

working on project P1 for 12 or 24 months“• SELECT ENAME FROM EMP, ASG WHERE EMP.ENO = ASG.ENO AND ASG.PNO = “P1“ AND DUR = 12 OR DUR = 24--------------------------------------------The qualification in conjunctive normal form is EMP.ENO = ASG.ENO ASG.PNO = “P1“ (DUR = 12

DUR = 24) and in disjunctive normal form: (EMP.ENO = ASG.ENO ASG.PNO = “P1“ DUR = 12)

(EMP.ENO = ASG.ENO ASG.PNO = “P1“ DUR = 24)

In the latter form, treating the two conjunctions may lead to redundant work if common subexpressions are not eliminated.

Page 15: Institut für Scientific Computing – Universität WienP.Brezany Distributed Query Processing –An Overview Univ.-Prof. Dr. Peter Brezany Institut für Scientific

Institut für Scientific Computing – Universität Wien

P.Brezany15

Query Decomposition - 2.Analysis• The main reasons for query rejection (The query is

simply returned to the user with an explanation.) are that the query is type incorrect or semantically incorrect.

• Example 1: The following query is type incorrectSELECT E#FROM EMPWHERE ENAME > 200

for 2 reasons: (1) Attribute E# is not declared in the schema; and (2) Operation “>200“ is incompatible with the type

string of ENAME.

Page 16: Institut für Scientific Computing – Universität WienP.Brezany Distributed Query Processing –An Overview Univ.-Prof. Dr. Peter Brezany Institut für Scientific

Institut für Scientific Computing – Universität Wien

P.Brezany16

Analysis (cont.)• A query is semantically incorrect if components of it do

not contribute in any way to the generation of the result.

• In the context of relational calculus, it is not possible to determine the semantic correctness of general queries. However, it is possible to do so for large class of relational queries, those which do not contain disjunction and negation.

• This is based on the representation of the query as a query graph or connection graph – one node indicates the result relation, and any other node indicates an operand relation. An edge between two nodes that are not results represents a join, whereas an edge whose destination node is the result represents a project. Furthermore, a nonresult node may be labeled by a select or a self-join predicate.

• An important subgraph of the relation connection graph is the join graph, in which only the joins are considered. The join graph is particularly useful in the query optimization phase.

Page 17: Institut für Scientific Computing – Universität WienP.Brezany Distributed Query Processing –An Overview Univ.-Prof. Dr. Peter Brezany Institut für Scientific

Institut für Scientific Computing – Universität Wien

P.Brezany17

Analysis (cont.)• Example 1: “Find the names and responsibility of

programmers who have been working on the CAD/CAM project for more than 3 years“

SELECT ENAME, RESPFROM EMP, ASG, PROJWHERE EMP.ENO = ASG.ENOAND ASG.PNO = PROJ.PNOAND PNAME = “CAD/CAM“ AND DUR >= 36AND TITLE = “Programmer“

The query graph and the corresponding join graph are shown in the next slide.

Page 18: Institut für Scientific Computing – Universität WienP.Brezany Distributed Query Processing –An Overview Univ.-Prof. Dr. Peter Brezany Institut für Scientific

Institut für Scientific Computing – Universität Wien

P.Brezany18

Analysis (cont.)

Fig. 8.1

Page 19: Institut für Scientific Computing – Universität WienP.Brezany Distributed Query Processing –An Overview Univ.-Prof. Dr. Peter Brezany Institut für Scientific

Institut für Scientific Computing – Universität Wien

P.Brezany19

Analysis (cont.)The query graph is useful to determine the semantic correctness of a conjunctive multivariable query without negation.

Such a query is semantically incorrect if its query graphis not connected. In this case one or more subgraphs (corresponding to subqueries) are disconnected from thegraph that contains the result relation.

The query could be considered correct (which some systemsdo) by considering the missing connection as a Cartesianproduct.

Page 20: Institut für Scientific Computing – Universität WienP.Brezany Distributed Query Processing –An Overview Univ.-Prof. Dr. Peter Brezany Institut für Scientific

Institut für Scientific Computing – Universität Wien

P.Brezany20

Analysis (cont.)

Example 2: Let us consider the following query:SELECT ENAME, RESPFROM EMP, ASG, PROJWHERE EMP.ENO = ASG.PROJAND PNAME = “CAD/CAM“AND DUR >= 36AND TITLE = “Programmer“

Its query graph is disconnected, which tells us that the query is semantically incorrect.There are basically 3 solutions to the problem:

(1) reject the query (2) assume that there is an implicit Cartesian product

between relations ASG and PROJ, or (3) infer (using the schema) the missing join

predicate ASG.PNO = PROJ.PNO which transforms the query

into that of Example 1.

Page 21: Institut für Scientific Computing – Universität WienP.Brezany Distributed Query Processing –An Overview Univ.-Prof. Dr. Peter Brezany Institut für Scientific

Institut für Scientific Computing – Universität Wien

P.Brezany21

Query Decomposition - 3.Elimination of RedundancyThe query qualification may contain redundant predicates.

A naive evaluation can well lead to duplicated work. This can be eliminated by simplifying the qualification with the followingwell-known idempotency rules:

Example:

Page 22: Institut für Scientific Computing – Universität WienP.Brezany Distributed Query Processing –An Overview Univ.-Prof. Dr. Peter Brezany Institut für Scientific

Institut für Scientific Computing – Universität Wien

P.Brezany22

Elimination of Redundancy (cont.)

Example (cont.)

Slide

12 Slide

12

Page 23: Institut für Scientific Computing – Universität WienP.Brezany Distributed Query Processing –An Overview Univ.-Prof. Dr. Peter Brezany Institut für Scientific

Institut für Scientific Computing – Universität Wien

P.Brezany23

Query Decomposition - 4.Rewriting

The last step of query decomposition rewrites the query in relational algebra – in two substeps:

(1) straighforward transformation of the query from

relational calculus into relational algebra(2) restructuring of the relational algebra query

to improve performance.It is customary to represent the relational algebra query graphically by an operator tree.

Example 1: The query

can be mapped in a straightforward way in the tree in the following slide.

Page 24: Institut für Scientific Computing – Universität WienP.Brezany Distributed Query Processing –An Overview Univ.-Prof. Dr. Peter Brezany Institut für Scientific

Institut für Scientific Computing – Universität Wien

P.Brezany24

Rewriting (cont.)

By applying transformation rules, many different equivalent trees may be found Vorlesung Datenbanksysteme (Prof. Schikuta)

Page 25: Institut für Scientific Computing – Universität WienP.Brezany Distributed Query Processing –An Overview Univ.-Prof. Dr. Peter Brezany Institut für Scientific

Institut für Scientific Computing – Universität Wien

P.Brezany25

Rewriting (cont.)

Example 2: The restructuring of the tree in previous slide leads to the tree below. The resulting tree is good in the sense that repeated access to the same relation (as in the previous figure is avoided and that the most selective operations are done first. However, this tree is far from optimal. The select operation on EMP is not very useful before the join because it does not greatly reduce the size of the operand relation.

Page 26: Institut für Scientific Computing – Universität WienP.Brezany Distributed Query Processing –An Overview Univ.-Prof. Dr. Peter Brezany Institut für Scientific

Institut für Scientific Computing – Universität Wien

P.Brezany26

Localization of Distributed Data• This layer translates an algebraic query on

global relations into an algebraic query expressed on physical fragments.

• Localization uses information stored in the fragment schema.

• Fragmentation is defined through fragmentation rules.

• Reduction techniques are a way how to localize a distributed query.

Page 27: Institut für Scientific Computing – Universität WienP.Brezany Distributed Query Processing –An Overview Univ.-Prof. Dr. Peter Brezany Institut für Scientific

Institut für Scientific Computing – Universität Wien

P.Brezany27

Reduction for Primary Horizontal FragmentationThe following example is used in subsequent discussions.

Example: Relation EMP(ENO, ENAME, TITLE) can be split into three horizontal fragments EMP1, EMP2, and EMP3, defined as follows:

EMP1 = ENO “E3“(EMP)

EMP2 = “E3“ < ENO “E6“(EMP)

EMP3 = ENO > “E6“(EMP)The localization program for an horizontally fragmented relation is the union of the fragments. EMP = EMP1 EMP2 EMP3

Thus the generic form of any query specified on EMP is obtained by replacing it by (EMP1 EMP2 EMP3).

The reduction of queries consists primarily of detecting those subtrees that will produce empty relations, and removing them.

Page 28: Institut für Scientific Computing – Universität WienP.Brezany Distributed Query Processing –An Overview Univ.-Prof. Dr. Peter Brezany Institut für Scientific

Institut für Scientific Computing – Universität Wien

P.Brezany28

Reduction with Selection• Selections on fragments that have a qualification

contradicting the qualifications of the fragmentation rule generate empty relations. Given a relation R that has been horizontally fragmented as R1, R2, ..., Rw, where Rj = pj (R), the rule can be stated formally as follows:

Rule: pi (Rj) = if x in R: (pi(x) pj(x))

where pi and pj are selection predicates, x denotes a tuple, and p(x) denotes “predicate p holds for x.“

Determining the contradicting predicates requires theorem-proving techniques if the predicates are quite general. However, DBMSs generally simplify predicate comparison by supporting only simple predicates for defining fragmentation rules (by the DB administrator).

Page 29: Institut für Scientific Computing – Universität WienP.Brezany Distributed Query Processing –An Overview Univ.-Prof. Dr. Peter Brezany Institut für Scientific

Institut für Scientific Computing – Universität Wien

P.Brezany29

Reduction with Selection (cont.)Example: SELECT *

FROM EMP WHERE ENO = “E5“

Applying the naive approach to localize EMP from EMP1, EMP2, and EMP3 gives the generic query of Figure (a) below. It is easy to detect that the selection predicate contradicts the predicate of EMP1 and EMP3, thereby producing empty relations. The reduced query is simply applied to EMP2 as shown in Figure (b).