the sum is greater than the parts global query optimization in federated systems tony young m.math...

100
The Sum is Greater Than the Parts Global Query Optimization in Federated Systems Tony Young M.Math Candidate CS 848 - Fall 2004

Upload: marlene-banks

Post on 30-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The Sum is Greater Than the Parts Global Query Optimization in Federated Systems Tony Young M.Math Candidate CS 848 - Fall 2004

The Sum is GreaterThan the Parts

Global Query Optimization in Federated Systems

Tony Young

M.Math Candidate

CS 848 - Fall 2004

Page 2: The Sum is Greater Than the Parts Global Query Optimization in Federated Systems Tony Young M.Math Candidate CS 848 - Fall 2004

Outline Introduction

Motivation Issues

System Overview Optimization

Semijoin Algorithm Reduction Algorithm The Garlic Approach

Conclusion My Project

Page 3: The Sum is Greater Than the Parts Global Query Optimization in Federated Systems Tony Young M.Math Candidate CS 848 - Fall 2004

Outline Introduction

Motivation Issues

System Overview Optimization

Semijoin Algorithm Reduction Algorithm The Garlic Approach

Conclusion My Project

Page 4: The Sum is Greater Than the Parts Global Query Optimization in Federated Systems Tony Young M.Math Candidate CS 848 - Fall 2004

Introduction References:

L. M. Haas, E. T. Lin, and M. A. Roth. Data integration through database federation. IBM Systems Journal, 41(4):578–596, 2002.

David K. Hsiao. Federated databases and systems: part i — a tutorial on their data sharing. The VLDB Journal, 1(1):127–180, 1992.

David K. Hsiao. Federated databases and systems: part ii — a tutorial on their resource consolidation. The VLDB Journal, 1(2):285–310, 1992.

Hongjun Lu, Beng-Chin Ooi, and Cheng-Hian Goh. On global multidatabase query optimization. SIGMOD Rec., 21(4):6–11, 1992.

Page 5: The Sum is Greater Than the Parts Global Query Optimization in Federated Systems Tony Young M.Math Candidate CS 848 - Fall 2004

Introduction References:

Neil Coburn and Per-Ake Larson. Multidatabase services: issues and architectural design. In Proceedings of the 1992 conference of the Centre for Advanced Studies on Collaborative research, pages 57–66, Toronto, Ontario, Canada, 1992. IBM Press.

Qiang Zhu. Query optimization in multidatabase systems. In Proceedings of the 1992 conference of the Centre for Advanced Studies on Collaborative research, pages 111–127. IBM Press, 1992.

Page 6: The Sum is Greater Than the Parts Global Query Optimization in Federated Systems Tony Young M.Math Candidate CS 848 - Fall 2004

Introduction

References: H. Lu, B. C. Ooi, and C. H. Goh.

Multidatabase query optimization: Issues and solutions. In Proceedings of Third International Workshop on Research Issues in Data Engineering: Interoperability in Multidatabase Systems, pages 137–143, 1993.

Page 7: The Sum is Greater Than the Parts Global Query Optimization in Federated Systems Tony Young M.Math Candidate CS 848 - Fall 2004

Introduction

Page 8: The Sum is Greater Than the Parts Global Query Optimization in Federated Systems Tony Young M.Math Candidate CS 848 - Fall 2004

Introduction

Page 9: The Sum is Greater Than the Parts Global Query Optimization in Federated Systems Tony Young M.Math Candidate CS 848 - Fall 2004

Introduction

Page 10: The Sum is Greater Than the Parts Global Query Optimization in Federated Systems Tony Young M.Math Candidate CS 848 - Fall 2004

Introduction

What is a federated database system (FDBS)? Also referred to as a multidatabase system

(MDBS) FDBS’s combine multiple heterogeneous

data sources into one global view Users think that the data all resides in one

place

Page 11: The Sum is Greater Than the Parts Global Query Optimization in Federated Systems Tony Young M.Math Candidate CS 848 - Fall 2004

Introduction

The notion of an FDBS has been around for many years First semi-commercial product was Mermaid

(later became InterViso) in 1984 Flurry of papers began to surface in late

1980’s - early 1990’s What motivated the development of these

systems?

Page 12: The Sum is Greater Than the Parts Global Query Optimization in Federated Systems Tony Young M.Math Candidate CS 848 - Fall 2004

Motivation

Replacement of Data Processing With Databases Organizations were moving away from

traditional data processing techniques (such as storing information in flat files

The rise in popularity of DBMS's can be attributed to powerful data mining applications as well as ease of data access

Page 13: The Sum is Greater Than the Parts Global Query Optimization in Federated Systems Tony Young M.Math Candidate CS 848 - Fall 2004

Motivation Proliferation of Heterogeneous Databases

Within an Organization It is not uncommon for different departments within

an organization to make use of their own database servers

Departments often do not coordinate to ensure that a corporation is using a homogeneous DBMS to store data

There is no guarantee that the schema individual departments use to store data will be homogeneous

Page 14: The Sum is Greater Than the Parts Global Query Optimization in Federated Systems Tony Young M.Math Candidate CS 848 - Fall 2004

Motivation Data Sharing Within Organizations

Many organizations seek to share data between different departments

• Finance department may require information regarding projects in progress in the marketing department

Such information sharing is difficult without the guarantee of schema, data model or access language homogeneity

• Finance may use SQL on a relational database and marketing may use Xquery on an XML database

Page 15: The Sum is Greater Than the Parts Global Query Optimization in Federated Systems Tony Young M.Math Candidate CS 848 - Fall 2004

Motivation Different Rates of Technology Adoption

Different departments will adopt technology at different rates

• IT will adopt quickly to store inventory information, call tickets, etc. and are tech savvy

• HR may adopt slowly as they use many paper forms and are not so tech savvy

Older systems are in place, and newer systems to come online use different products

Page 16: The Sum is Greater Than the Parts Global Query Optimization in Federated Systems Tony Young M.Math Candidate CS 848 - Fall 2004

Motivation Geographic Separation of Teams

Different teams may be broken up across geographic locations

Different sites will hold teams working on different projects

Each site may have their own IT staff and make their own purchasing and installation decisions

Often there is no coordination between sites

Page 17: The Sum is Greater Than the Parts Global Query Optimization in Federated Systems Tony Young M.Math Candidate CS 848 - Fall 2004

Motivation Mergers and Acquisitions

When companies join forces, their IT systems must be joined as well

Old applications will depend on the old software, and users might be reluctant to learn an entirely new system

If we can merge the two system so that each user can make use of their old applications and old access language, transitions might be easier

Page 18: The Sum is Greater Than the Parts Global Query Optimization in Federated Systems Tony Young M.Math Candidate CS 848 - Fall 2004

Issues

Several issues affect query optimization in FDBS’s

Page 19: The Sum is Greater Than the Parts Global Query Optimization in Federated Systems Tony Young M.Math Candidate CS 848 - Fall 2004

Issues Site Autonomy

Data - Local database administrators have direct and complete control over the schemas

• This information cannot be modified in any way Design - Local database administrators decide when

and how to replicate and fragment data Communication - Each site decides locally whether

or not to communicate with the FDBS Execution - Each site can determine how, when and

whether to execute global queries, as well as how queries are prioritized

Page 20: The Sum is Greater Than the Parts Global Query Optimization in Federated Systems Tony Young M.Math Candidate CS 848 - Fall 2004

Issues

Local Parameters Local cost parameters for individual sites are

not always available to the FDBS• The FDBS often doesn't know what indices are

available for relations at local sites• Can’t predict what access methods will be used

by local sites• No idea what page size and disk latency are• … etc.

Page 21: The Sum is Greater Than the Parts Global Query Optimization in Federated Systems Tony Young M.Math Candidate CS 848 - Fall 2004

Issues

Translation Queries must be translated to and from the

local schema, query language, and data model on-the-fly

This requires additional query processing time

Page 22: The Sum is Greater Than the Parts Global Query Optimization in Federated Systems Tony Young M.Math Candidate CS 848 - Fall 2004

Issues

Heterogeneous Capabilities Not all local sites have the same capabilities.

• Some sites may not implement any ranking operations

This means that intermediate results might have to be shifted to sites that can provide these capabilities, further increasing processing time

Page 23: The Sum is Greater Than the Parts Global Query Optimization in Federated Systems Tony Young M.Math Candidate CS 848 - Fall 2004

Issues Additional Costs

Cost based optimization needs to take into consideration some additional factors such as:

• Transmission speeds• Network loads• Local site configurations

As with local parameters, this information is not always available to the FDBS

Page 24: The Sum is Greater Than the Parts Global Query Optimization in Federated Systems Tony Young M.Math Candidate CS 848 - Fall 2004

Issues Overriding theme - the FDBS is just

another application as far as the local sites are concerned We can make no assumptions about the local

sites We can assume no global control over the

local sites We have no hooks into the local sites to

directly access information

Page 25: The Sum is Greater Than the Parts Global Query Optimization in Federated Systems Tony Young M.Math Candidate CS 848 - Fall 2004

Outline Introduction

Motivation Issues

System Overview Optimization

Semijoin Algorithm Reduction Algorithm The Garlic Approach

Conclusion My Project

Page 26: The Sum is Greater Than the Parts Global Query Optimization in Federated Systems Tony Young M.Math Candidate CS 848 - Fall 2004

System Overview References:

Qiang Zhu. Query optimization in multidatabase systems. In Proceedings of the 1992 conference of the Centre for Advanced Studies on Collaborative research, pages 111–127. IBM Press, 1992.

M. T. Ozsu and P. Valduriez. Principles of Distributed Database Systems. Prentice Hall, Upper Saddle River, NJ, 2nd edition, 1999.

John Grant, Witold Litwin, Nick Roussopoulos, and Timos Sellis. Query languages for relational multidatabases. The VLDB Journal, 2(2):153–172, 1993.

Page 27: The Sum is Greater Than the Parts Global Query Optimization in Federated Systems Tony Young M.Math Candidate CS 848 - Fall 2004

System Overview Two main approaches to a federated system: Multidatabase Language Approach

Users must learn a special access language Users must use one standardized data model Users must know the sites they are contacting and

how data is organized at those sites Users must enter their authentication information

each time they use a site Users must…

Page 28: The Sum is Greater Than the Parts Global Query Optimization in Federated Systems Tony Young M.Math Candidate CS 848 - Fall 2004

System Overview

Page 29: The Sum is Greater Than the Parts Global Query Optimization in Federated Systems Tony Young M.Math Candidate CS 848 - Fall 2004

System Overview

Main idea: User specifies the sites, relations and

columns used in the query by their name at the local site

Queries are still submitted to a middleware, but user must know where things are stored

Page 30: The Sum is Greater Than the Parts Global Query Optimization in Federated Systems Tony Young M.Math Candidate CS 848 - Fall 2004

System Overview Two main approaches to federated system: Global Schema Approach

Global DBA implements wrappers/agents to convert access language and data model before sending to source

Global DBA generates an integrated global schema Global DBA stores authentication for individual users

at each local site and FDBS handles login Global DBA…

Page 31: The Sum is Greater Than the Parts Global Query Optimization in Federated Systems Tony Young M.Math Candidate CS 848 - Fall 2004

System Overview

Page 32: The Sum is Greater Than the Parts Global Query Optimization in Federated Systems Tony Young M.Math Candidate CS 848 - Fall 2004

System Overview

Main idea: User specifies the relations and columns

used in the query by their global name Queries are submitted to a middleware that

does conversion and subquery generation

Page 33: The Sum is Greater Than the Parts Global Query Optimization in Federated Systems Tony Young M.Math Candidate CS 848 - Fall 2004

System Overview

The difference is how users perceive the system MDBL: As far as the user is concerned, the

data is stored at separate sites and their query must explicitly use those sites

GS: As far as the user is concerned, the data is stored in the middleware and they can access it directly

Page 34: The Sum is Greater Than the Parts Global Query Optimization in Federated Systems Tony Young M.Math Candidate CS 848 - Fall 2004

Outline Introduction

Motivation Issues

System Overview Optimization

Semijoin Algorithm Reduction Algorithm The Garlic Approach

Conclusion My Project

Page 35: The Sum is Greater Than the Parts Global Query Optimization in Federated Systems Tony Young M.Math Candidate CS 848 - Fall 2004

Optimization There are many many optimization algorithms for

federated systems 2-Phase Statistical Sampling Adaptive Probing Query-based …etc.

We will look at three Semijoin and Reduction: From Mermaid - state of the art for

many years and pioneer in the field Garlic: From IBM - incorporated into a shipping product

(DB2 Information Integrator)

Page 36: The Sum is Greater Than the Parts Global Query Optimization in Federated Systems Tony Young M.Math Candidate CS 848 - Fall 2004

Semijoin Optimization Algorithm

References: David Brill, Marjorie Templeton, and Clement

T. Yu. Distributed query processing strategies in mermaid, a frontend to data management systems. In Proceedings of the First International Conference on Data Engineering, pages 211–218. IEEE Computer Society, 1984.

Page 37: The Sum is Greater Than the Parts Global Query Optimization in Federated Systems Tony Young M.Math Candidate CS 848 - Fall 2004

Semijion Optimization Algorithm The semijoin algorithm was proposed in

Mermaid (1984) Assumes that the cost of data transfer through

a network outweighs local site CPU overhead Seeks to reduce the size of relations required

for a query at local sites before transferring results back to the Controller

Four steps

Page 38: The Sum is Greater Than the Parts Global Query Optimization in Federated Systems Tony Young M.Math Candidate CS 848 - Fall 2004

Step 1: Site Selection A set of sites that will be used to perform a query

must first be chosen Requires finding a set of minimal size that

includes one copy of each local, partitioned and replicated relation i.e. each site holding a data fragment must be in the

set, but only one replica of a relation must be in the set Some sites may hold more than one relation

required by the query Allows us to further reduce the size of the site set

Page 39: The Sum is Greater Than the Parts Global Query Optimization in Federated Systems Tony Young M.Math Candidate CS 848 - Fall 2004

Step 1: Site Selection

The system statistics that can be used to optimize this selection, such as link speeds and system loads, remains an open problem

Page 40: The Sum is Greater Than the Parts Global Query Optimization in Federated Systems Tony Young M.Math Candidate CS 848 - Fall 2004

Step 1: Site Selection

Page 41: The Sum is Greater Than the Parts Global Query Optimization in Federated Systems Tony Young M.Math Candidate CS 848 - Fall 2004

Step 1: Site Selection

Page 42: The Sum is Greater Than the Parts Global Query Optimization in Federated Systems Tony Young M.Math Candidate CS 848 - Fall 2004

Step 2: Local Reduction In parallel at each local site in the chosen site set,

reduce each relation by performing selections and projections Parameters used to perform these operations are taken

from select, where and join conditions in the original query It might be possible to optimize the order in which

site reduction queries are performed by exploiting network traffic and speed, CPU load at local sites, etc. i.e. submit queries to slow sites first and hope they don't

increase the overall execution time too much

Page 43: The Sum is Greater Than the Parts Global Query Optimization in Federated Systems Tony Young M.Math Candidate CS 848 - Fall 2004

Step 2: Local Reduction

Page 44: The Sum is Greater Than the Parts Global Query Optimization in Federated Systems Tony Young M.Math Candidate CS 848 - Fall 2004

Step 3: Global Reduction Find and execute an efficient sequence of

semijoins that will reduce the set of records to be transmitted Mermaid uses a hill-climbing algorithm to determine

this set Once the semijoins are performed, the smallest

amount of data required to answer the query is ready for transport Some other algorithms to determine the optimal

semijoin sequence should be investigated as this one is slow!

Page 45: The Sum is Greater Than the Parts Global Query Optimization in Federated Systems Tony Young M.Math Candidate CS 848 - Fall 2004

Step 3: Global Reduction

Page 46: The Sum is Greater Than the Parts Global Query Optimization in Federated Systems Tony Young M.Math Candidate CS 848 - Fall 2004

Step 4: Assembly

Transfer the data to one central query site and generate the result set. Return the result set to the user May perform joins at local sites or wait until

we get to the FDBS

Page 47: The Sum is Greater Than the Parts Global Query Optimization in Federated Systems Tony Young M.Math Candidate CS 848 - Fall 2004

Step 4: Assembly

Page 48: The Sum is Greater Than the Parts Global Query Optimization in Federated Systems Tony Young M.Math Candidate CS 848 - Fall 2004

Step 4: Assembly

Page 49: The Sum is Greater Than the Parts Global Query Optimization in Federated Systems Tony Young M.Math Candidate CS 848 - Fall 2004

Step 4: Assembly

Page 50: The Sum is Greater Than the Parts Global Query Optimization in Federated Systems Tony Young M.Math Candidate CS 848 - Fall 2004

Step 4: Assembly

Page 51: The Sum is Greater Than the Parts Global Query Optimization in Federated Systems Tony Young M.Math Candidate CS 848 - Fall 2004

Step 4: Assembly

It may be less costly to generate the result set at one central site and then transfer the data back to the user It may also be less costly to assemble the

result set at the user's site

Page 52: The Sum is Greater Than the Parts Global Query Optimization in Federated Systems Tony Young M.Math Candidate CS 848 - Fall 2004

Semijoin Optimization Algorithm

This algorithm exploits the capabilities of the DBMS's in the federation

Attempts to reduce the transmission overhead required to send data to sites

Statistics are computed on-the-fly and are discarded once the query is complete No attempt is made to store or make

statistics more accurate

Page 53: The Sum is Greater Than the Parts Global Query Optimization in Federated Systems Tony Young M.Math Candidate CS 848 - Fall 2004

Reduction Algorithm

References David Brill, Marjorie Templeton, and Clement

T. Yu. Distributed query processing strategies in mermaid, a frontend to data management systems. In Proceedings of the First International Conference on Data Engineering, pages 211–218. IEEE Computer Society, 1984.

Page 54: The Sum is Greater Than the Parts Global Query Optimization in Federated Systems Tony Young M.Math Candidate CS 848 - Fall 2004

Reduction Algorithm The replicate algorithm was proposed in

Mermaid (1984) Assumes that CPU overhead at local sites

outweighs transfer costs between them Seeks to transfer data to local sites in order to

exploit the differences in processing speeds of each system

Four steps

Page 55: The Sum is Greater Than the Parts Global Query Optimization in Federated Systems Tony Young M.Math Candidate CS 848 - Fall 2004

Step 1: Site Selection

As with the semijoin algorithm, we choose a minimal site set

Instead of choosing only one replica for each replicated relation, we include all replicas of the data Allows us to run queries in parallel at each

replica

Page 56: The Sum is Greater Than the Parts Global Query Optimization in Federated Systems Tony Young M.Math Candidate CS 848 - Fall 2004

Step 1: Site Selection

Page 57: The Sum is Greater Than the Parts Global Query Optimization in Federated Systems Tony Young M.Math Candidate CS 848 - Fall 2004

Step 2: Data Transfer Copy each relation to each site where it is to be

used to process a subquery, but does not already exist I.e. if site 1 holds relation A and requires relation B,

transfer B to site 1 This may require composing fragmented relations into

one large relation After this step, each site should have a copy of

the relations that are to be used to form the partial query result for which that site is responsible As per the subquery sent to it by the Controller

Page 58: The Sum is Greater Than the Parts Global Query Optimization in Federated Systems Tony Young M.Math Candidate CS 848 - Fall 2004

Step 2: Data Transfer

Page 59: The Sum is Greater Than the Parts Global Query Optimization in Federated Systems Tony Young M.Math Candidate CS 848 - Fall 2004

Step 3: Query Execution

Once each site has the data it needs to run its partial query, the queries are executed

After this step, each site should have a partial answer to the user's query

Page 60: The Sum is Greater Than the Parts Global Query Optimization in Federated Systems Tony Young M.Math Candidate CS 848 - Fall 2004

Step 3: Query Execution

Page 61: The Sum is Greater Than the Parts Global Query Optimization in Federated Systems Tony Young M.Math Candidate CS 848 - Fall 2004

Step 4: Assembly

Transfer the partial answers from local sites and create the final result set at the user's home site

Return the results to the user

Page 62: The Sum is Greater Than the Parts Global Query Optimization in Federated Systems Tony Young M.Math Candidate CS 848 - Fall 2004

Step 4: Assembly

Page 63: The Sum is Greater Than the Parts Global Query Optimization in Federated Systems Tony Young M.Math Candidate CS 848 - Fall 2004

Step 4: Assembly

Page 64: The Sum is Greater Than the Parts Global Query Optimization in Federated Systems Tony Young M.Math Candidate CS 848 - Fall 2004

Step 4: Assembly

It may be less costly to generate the result set at one central site and then transfer the data back to the user It may also be less costly to assemble the

result set at the user's site

Page 65: The Sum is Greater Than the Parts Global Query Optimization in Federated Systems Tony Young M.Math Candidate CS 848 - Fall 2004

Reduction Algorithm This algorithm exploits the configurations of the

DBMS's in the federation Attempts to reduce the processing time for the

query by working at sites of varying capabilities Statistics are computed on-the-fly and are

discarded once the query is complete No attempt is made to store or make statistics more

accurate

Page 66: The Sum is Greater Than the Parts Global Query Optimization in Federated Systems Tony Young M.Math Candidate CS 848 - Fall 2004

Garlic References

L. M. Haas, P. M. Schwarz, P. Kodali, E. Kotlar, J. E. Rice, and W. C. Swope. Discoverylink: a system for integrated access to life sciences data sources. IBM Syst. J., 40(2):489–511, 2001.

Laura M. Haas, Donald Kossmann, Edward L. Wimmers, and Jun Yang. Optimizing queries across diverse data sources. In Proceedings of the 23rd International Conference on Very Large Data Bases, pages 276–285. Morgan Kaufmann Publishers Inc., 1997.

Page 67: The Sum is Greater Than the Parts Global Query Optimization in Federated Systems Tony Young M.Math Candidate CS 848 - Fall 2004

Garlic References:

L. M. Haas, P. Kodali, J. E. Rice, P. M. Schwarz, and W. C. Swope. Integrating life sciences data-with a little garlic. In Proceedings of the 1st IEEE International Symposium on Bioinformatics and Biomedical Engineering, page 5. IEEE Computer Society, 2000.

Mary Tork Roth, Fatma Ozcan, and Laura M. Haas. Cost models DO matter: Providing cost information for diverse data sources in a federated system. In The VLDB Journal, pages 599–610, 1999.

Page 68: The Sum is Greater Than the Parts Global Query Optimization in Federated Systems Tony Young M.Math Candidate CS 848 - Fall 2004

Garlic

References: Guy M. Lohman. Grammar-like functional

rules for representing query optimization alternatives. In Proceedings of the 1988 ACM SIGMOD international conference on Management of data, pages 18–27. ACM Press, 1988.

Page 69: The Sum is Greater Than the Parts Global Query Optimization in Federated Systems Tony Young M.Math Candidate CS 848 - Fall 2004

Garlic Proposed by IBM for use in several products

DB2 II and DiscoveryLink The Garlic optimizer uses wrappers to gather

costing info for developing a query plan Goal is to allow Garlic to find a good plan without

knowledge of the capabilities of the local site Optimizer uses a set of strategy alternative rules

(STAR’s) that are used to rewrite plans Plans are a set of plan operators (POP’s) that compose

the query plan tree (sort, filter, scan, etc). • A generic pushdown POP that encapsulates work to be done at a

local site is also included

Page 70: The Sum is Greater Than the Parts Global Query Optimization in Federated Systems Tony Young M.Math Candidate CS 848 - Fall 2004

Garlic

STAR’s are fired over the query in order to generate POP’s STAR’s can be seen as grammatical

production rules STAR's generate cost and cardinality

information using input from the wrapper Works in three phases

Page 71: The Sum is Greater Than the Parts Global Query Optimization in Federated Systems Tony Young M.Math Candidate CS 848 - Fall 2004

Step 1: Fire Access STAR’s

Access STAR’s are applied in order to enumerate plans that read data from a source

Plan space is pruned in order to remove plans that have the same or weaker cost properties Just what is “weaker” is not detailed!

Page 72: The Sum is Greater Than the Parts Global Query Optimization in Federated Systems Tony Young M.Math Candidate CS 848 - Fall 2004

Step 1: Fire Access STAR’s

Page 73: The Sum is Greater Than the Parts Global Query Optimization in Federated Systems Tony Young M.Math Candidate CS 848 - Fall 2004

Step 1: Fire Access STAR’s

Page 74: The Sum is Greater Than the Parts Global Query Optimization in Federated Systems Tony Young M.Math Candidate CS 848 - Fall 2004

Step 2: Fire Join STAR’s Join STAR’s are applied in order to enumerate

all plans involving joins Plan space is filled with all possible join

combinations Garlic considers bushy plans and left-deep plans as

collocated data may make a bushy plan more efficient

• I.e. joins might be able to be performed at a local site

Plan space is pruned in order to remove plans that have the same or weaker cost properties

Page 75: The Sum is Greater Than the Parts Global Query Optimization in Federated Systems Tony Young M.Math Candidate CS 848 - Fall 2004

Step 2: Fire Join STAR’s

Page 76: The Sum is Greater Than the Parts Global Query Optimization in Federated Systems Tony Young M.Math Candidate CS 848 - Fall 2004

Step 2: Fire Join STAR’s

Page 77: The Sum is Greater Than the Parts Global Query Optimization in Federated Systems Tony Young M.Math Candidate CS 848 - Fall 2004

Step 3: Fire FinishRoot STAR

The FinishRoot STAR is applied to provide orderings, selects, projects, etc. that were not already completed by some local site I.e. the local site did not have the proper

capabilities to perform that operation The plan with lowest cost is then chosen

for execution

Page 78: The Sum is Greater Than the Parts Global Query Optimization in Federated Systems Tony Young M.Math Candidate CS 848 - Fall 2004

Step 3: Fire FinishRoot STAR

Page 79: The Sum is Greater Than the Parts Global Query Optimization in Federated Systems Tony Young M.Math Candidate CS 848 - Fall 2004

Garlic Garlic plans are built bottom-up using dynamic

programming This method will take large amounts of time to

optimize plans containing even a small number of joins An 8-way clique join (i.e. there is a join between every

pair of relations) was shown to take approximately 4.5 min to optimize!

• Considered the upper limit of the algorithm.

Wrappers are assumed to be able to gather statistics directly from the local site Not always possible!

Page 80: The Sum is Greater Than the Parts Global Query Optimization in Federated Systems Tony Young M.Math Candidate CS 848 - Fall 2004

Outline Introduction

Motivation Issues

System Overview Optimization

Semijoin Algorithm Reduction Algorithm The Garlic Approach

Conclusion My Project

Page 81: The Sum is Greater Than the Parts Global Query Optimization in Federated Systems Tony Young M.Math Candidate CS 848 - Fall 2004

Conclusion

Many good algorithms exist for optimizing queries in federated systems

Those queries make use of statistics that must be gathered somehow Probing queries Utilities Statistical sampling of data …etc.

Page 82: The Sum is Greater Than the Parts Global Query Optimization in Federated Systems Tony Young M.Math Candidate CS 848 - Fall 2004

Conclusion

Statistical accuracy is important As we all know, accuracy directly affects

estimates, which may directly affect algorithm choices

The optimizer is only as good as the statistics!

Page 83: The Sum is Greater Than the Parts Global Query Optimization in Federated Systems Tony Young M.Math Candidate CS 848 - Fall 2004

Questions?

Page 84: The Sum is Greater Than the Parts Global Query Optimization in Federated Systems Tony Young M.Math Candidate CS 848 - Fall 2004

Outline Introduction

Motivation Issues

System Overview Optimization

Semijoin Algorithm Reduction Algorithm The Garlic Approach

Conclusion My Project

Page 85: The Sum is Greater Than the Parts Global Query Optimization in Federated Systems Tony Young M.Math Candidate CS 848 - Fall 2004

My Project There are many methods of statistics collection

that have been proposed Utility based methods dispatch an agent to gather

data from a database and build statistics Utility is invoked on a regular schedule to keep

statistics up to date Imposes great overhead at the site Requires a DBA to set the schedule and rerun if the

stats are out of date between scheduled runs

Page 86: The Sum is Greater Than the Parts Global Query Optimization in Federated Systems Tony Young M.Math Candidate CS 848 - Fall 2004

My Project There are many methods of statistics collection

that have been proposed Piggybacking requests more data than is required by

a user query and builds statistics using it• Ex: ask for an additional column or remove filter predicates

We can develop accurate statistics with the returned data

Has the potential to significantly increase the overhead of a query at a local site as well as the running time of a query

Page 87: The Sum is Greater Than the Parts Global Query Optimization in Federated Systems Tony Young M.Math Candidate CS 848 - Fall 2004

My Project There are many methods of statistics collection that

have been proposed Fuzzy cost models attempt to model the changing state of

the database using probabilities about the parameters needed for costing

Builds a parameter by taking portions of the possible values (according to the probability that they will occur)

Models must be built offline adding a significant task to the DBA

What do we do to update the model once it’s built? How do we determine what parameters to add to the

model?

Page 88: The Sum is Greater Than the Parts Global Query Optimization in Federated Systems Tony Young M.Math Candidate CS 848 - Fall 2004

My Project There are many methods of statistics collection that

have been proposed Data sampling allows us to build a statistically accurate

value for a statistic Different methods are used to draw a sample

• Random: draw a random sample of tuples• Stratified: draw a sample of tuples from tuples classified into

groups• Adaptive: draw n tuples where n is determined to be a statistically

relevant sample size or provides an error less than the relative error required by the application

• Systemic: draw a certain percentage of the number of tuples in the table

• …etc.

Page 89: The Sum is Greater Than the Parts Global Query Optimization in Federated Systems Tony Young M.Math Candidate CS 848 - Fall 2004

My Project

There are many methods of statistics collection that have been proposed Among sample bias and estimation errors,

sampling methods require direct access to the data in the tables

• FDBS does not have direct access! Not practical to run online

• Might as well use a utility to get more accurate stats

Page 90: The Sum is Greater Than the Parts Global Query Optimization in Federated Systems Tony Young M.Math Candidate CS 848 - Fall 2004

My Project

The overriding theme: We want something that imposes limited

overhead We want something that calculates

reasonably accurate statistics

Page 91: The Sum is Greater Than the Parts Global Query Optimization in Federated Systems Tony Young M.Math Candidate CS 848 - Fall 2004

My Project

WE CANNOT ASSUME ANY ACCESS TO DATA WILL BE AVAILABLE!!! To the local sites, our FDBS is just another

application, and can only do the things an application can do!

• I.e. we can’t dictate the access plan to use, etc.

Page 92: The Sum is Greater Than the Parts Global Query Optimization in Federated Systems Tony Young M.Math Candidate CS 848 - Fall 2004

My Project

Meet FLO Federated Learning cOllector (FLO) is able to

“learn” statistics from the data that is returned to answer user queries

Calculates statistics regularly according to the amount that they change

• I.e. if a stat changes by 2%, recalculate it in a day; if a stat changes by 20%, recalculate it at the next chance

Page 93: The Sum is Greater Than the Parts Global Query Optimization in Federated Systems Tony Young M.Math Candidate CS 848 - Fall 2004

My Project

Meet FLO FLO does not request any additional data as

piggybacking FLO does not run any additional queries as

utilities FLO does not impose any additional

overhead on the local sites, and stats are calculated offline

Page 94: The Sum is Greater Than the Parts Global Query Optimization in Federated Systems Tony Young M.Math Candidate CS 848 - Fall 2004

My Project

Meet FLO Tradeoffs exist

• Stats won’t be perfectly accurate• Stats won’t necessarily exist for a range of values

Questions exist• How often is often enough to recalculate stats?• How do we get initial statistics?• …?

Page 95: The Sum is Greater Than the Parts Global Query Optimization in Federated Systems Tony Young M.Math Candidate CS 848 - Fall 2004

My Project Statistics Gathering References:

E. Whalen. Oracle Performance Tuning and Optimization. SAMS Publishing, 1996.

Amira Rahal, Qiang Zhu, and Per-Ake Larson. Evolutionary techniques for updating query cost models in a dynamic multidatabase environment. The VLDB Journal, 13(2):162–176, 2004.

A. H. H. Ngu, B. Harangsri, and J. Shepherd. Query size estimation for joins using systematic sampling. Distrib. Parallel Databases, 15(3):237–275, 2004.

Page 96: The Sum is Greater Than the Parts Global Query Optimization in Federated Systems Tony Young M.Math Candidate CS 848 - Fall 2004

My Project Statistics Gathering References

Peter J. Haas and Arun N. Swami. Sequential sampling procedures for query size estimation. SIGMOD Rec., 21(2):341–350, 1992.

Richard J. Lipton, Jeffrey F. Naughton, and Donovan A. Schneider. Practical selectivity estimation through adaptive sampling. In Proceedings of the 1990 ACM SIGMOD international conference on Management of data, pages 1–11. ACM Press, 1990.

Page 97: The Sum is Greater Than the Parts Global Query Optimization in Federated Systems Tony Young M.Math Candidate CS 848 - Fall 2004

My Project Statistics Gathering References

Qiang Zhu, Brian Dunkel, Wing Lau, Suyun Chen, and Berni Schiefer. Piggyback statistics collection for query optimization: Towards a self-maintaining database management system. The Computer Journal, 47(2):221–244, March 2004.

Qiang Zhu and P. A. Larson. Query optimization using fuzzy set theory for multidatabase systems. In Proceedings of the 1993 conference of the Centre for Advanced Studies on Collaborative research, pages 848–859, Toronto, Ontario, Canada, 1993. IBM Press.

Page 98: The Sum is Greater Than the Parts Global Query Optimization in Federated Systems Tony Young M.Math Candidate CS 848 - Fall 2004

My Project Statistics Gathering References

Qiang Zhu and Per-Ake Larson. A query sampling method of estimating local cost parameters in a multidatabase system. In Proceedings of the Tenth International Conference on Data Engineering, February 14-18, 1994, Houston, Texas, USA, pages 144–153. IEEE Computer Society, 1994.

Qiang Zhu. An integrated method for estimating selectivities in a multidatabase system. In Proceedings of the 1993 conference of the Centre for Advanced Studies on Collaborative research, pages 832–847. IBM Press, 1993.

Page 99: The Sum is Greater Than the Parts Global Query Optimization in Federated Systems Tony Young M.Math Candidate CS 848 - Fall 2004

My Project Statistics Gathering References

Per-Ake Larson Qiang Zhu. Establishing a fuzzy cost model for query optimization in a multidatabase system. In System Sciences, 1994. Vol.II: Software Technology, Proceedings of the Twenty-Seventh Hawaii International Conference on, volume 2, pages 263–272, 1994.

Page 100: The Sum is Greater Than the Parts Global Query Optimization in Federated Systems Tony Young M.Math Candidate CS 848 - Fall 2004

Questions?