multidatabase global query optimization - · pdf fileproblem between various local systems can...

10
Proceedings of the 28th Annual Hawaii International Conference on System Sciences - 1995 MULTIDATABASE GLOBAL QUERY OPTIMIZATION Prabhu Ram, Department of Computer Science, University of North Dakota, Grand Forks, ND 58202. Abstract Data requests ro mulriaiztabase (MDBSs) are posed through non-procedural languages such as SQL and for a global data request optimization must be performed to achieve good system performance. However, the issues are ofren complicated in multidatabases, due to addirional issues arising because of the autonomy and heterogeneity restrictions of the independent local DBMSs. The data translation problem between various local systems can be observed at the schema level and at the instance level. We have identified the need for a domain translation table in mulridatabase query processing and discuss methods of implementing it. Some observations about multidatabase inter-site joins are made and four methods by which mulridatabase global queries can be processed are introduced and analytically compared. 1.0 Introduction Progress in communication anddatabase technologies is radicallychanging userdataprocessing capabilities. A rapidly growing numberof applications need to be able to access and manipulate data from autonomous databases and otherdataresources that often pre-exist. These areoftenlocated in heterogeneous hardware and softwareenvironments distributedamongnodes of a computer network [MSI90]. Such application domains arebecoming common in environments such asoffice information systems, computer-integrated manufacturing systems (with computer-aideddesign as a subset), personal computing,business and financial computing,and scientific research information bases [McL90]. A logical integration into one logical database of all data needed by such an application appears to be a promising solution, and systems that facilitate suchan integrationare called multidatabase systems [MSI90]. William Per&o, Department of Comp. Sci.and OR, North Dakota State University, Fargo, ND 58 102. The state of the art and research directions in multidatabases were the focus of the workshop reported in the final report of the Workshop on Multidatabases and Semantic Interoperability [MSI90]. Some of the research directionsidentified were as follows : (i) Transaction Management in Multidatabase Systems, (ii) Modeling Multidatabase Applications, (iii) Query Optimization in Multidatabase Systems, and (iv) Semantic Data Integration. 2.0 Multidatabase Query Optimization Multidatabase query optimization is a topic of research interest is multidatabase systems or heterogeneous distributed database systems (HDDBSs). The object of query optimization is to improve performance. The final report of the workshop on multidatabases and semantic interoperability [MS1901 points out the lack of research in this area. Some of the facilities cited in [MS1901 that are necessary for executingqueriesin multidatabases include the following : (a) data value mapping functions for converting betweenmultiple representations of reals,dates,units, etc.; (b) efficient waysof processing joins; (c) semijoin exchanges; and (d) the passing of parsed trees and,possibly, compiled queries. The local systems under the MDBS are often autonomous, i.e., they are often under separate and independent control. A local DBMS may exhibit several typesof autonomy. A classification discussed by Veijalainen andPopescu-Zeletin [VEI86] includes design, communication, and execution autonomy. Sheth and Larson [SHE901 include an additional autonomy called association autonomy. An MDBS can be classified into two types, based on the autonomy of their componentDBSs : nonfederated database and federated database systems (FDBSs) 253 1060-3425/96 $4.00 Q 1996 IEEE Proceedings of the 28th Hawaii International Conference on System Sciences (HICSS '95) 1060-3425/95 $10.00 © 1995 IEEE

Upload: doanmien

Post on 21-Mar-2018

218 views

Category:

Documents


3 download

TRANSCRIPT

Proceedings of the 28th Annual Hawaii International Conference on System Sciences - 1995

MULTIDATABASE GLOBAL QUERY OPTIMIZATION

Prabhu Ram, Department of Computer Science, University of North Dakota, Grand Forks, ND 58202.

Abstract

Data requests ro mulriaiztabase (MDBSs) are posed through non-procedural languages such as SQL and

for a global data request optimization must be performed to achieve good system performance. However, the issues are ofren complicated in multidatabases, due to addirional issues arising because of the autonomy and heterogeneity restrictions of the independent local DBMSs. The data translation problem between various local systems can be observed at the schema level and at the instance level. We have identified the need for a domain translation table in mulridatabase query processing and discuss methods of implementing it. Some observations about multidatabase inter-site joins are made and four methods by which mulridatabase global queries can be processed are introduced and analytically compared.

1.0 Introduction

Progress in communication and database technologies is radically changing user data processing capabilities. A rapidly growing number of applications need to be able to access and manipulate data from autonomous databases and other data resources that often pre-exist. These are often located in heterogeneous hardware and software environments distributed among nodes of a computer network [MSI90]. Such application domains are becoming common in environments such as office information systems, computer-integrated manufacturing systems (with computer-aideddesign as a subset), personal computing, business and financial computing, and scientific research information bases [McL90]. A logical integration into one logical database of all data needed by such an application appears to be a promising solution, and systems that facilitate such an integration are called multidatabase systems [MSI90].

William Per&o, Department of Comp. Sci.and OR, North Dakota State University, Fargo, ND 58 102.

The state of the art and research directions in multidatabases were the focus of the workshop reported in the final report of the Workshop on Multidatabases and Semantic Interoperability [MSI90]. Some of the research directions identified were as follows : (i) Transaction Management in Multidatabase Systems, (ii) Modeling Multidatabase Applications, (iii) Query Optimization in Multidatabase Systems, and (iv) Semantic Data Integration.

2.0 Multidatabase Query Optimization

Multidatabase query optimization is a topic of research interest is multidatabase systems or heterogeneous distributed database systems (HDDBSs). The object of query optimization is to improve performance. The final report of the workshop on multidatabases and semantic interoperability [MS1901 points out the lack of research in this area. Some of the facilities cited in [MS1901 that are necessary for executing queries in multidatabases include the following : (a) data value mapping functions for converting between multiple representations of reals, dates,units, etc.; (b) efficient ways of processing joins; (c) semijoin exchanges; and (d) the passing of parsed trees and, possibly, compiled queries.

The local systems under the MDBS are often autonomous, i.e., they are often under separate and independent control. A local DBMS may exhibit several types of autonomy. A classification discussed by Veijalainen and Popescu-Zeletin [VEI86] includes design, communication, and execution autonomy. Sheth and Larson [SHE901 include an additional autonomy called association autonomy. An MDBS can be classified into two types, based on the autonomy of their component DBSs : nonfederated database and federated database systems (FDBSs)

253 1060-3425/96 $4.00 Q 1996 IEEE

Proceedings of the 28th Hawaii International Conference on System Sciences (HICSS '95) 1060-3425/95 $10.00 © 1995 IEEE

Proceedings of the 28th Annual Hawaii International Conference on System Sciences - 1995

WWO]. A nonfederated database system is an integration of local DBMSs that are not autonomous. An FDBS, on the other hand, consists of local DBMSs that are autonomous yet participate in a federation to allow partial and controlled sharing of their data. The term federated database systems was coined by Hammer and McLcod [HAM791 and Heimbigner and McLeod @IBI85], and in this study, the terms MDBS and HDDBS are used synonymously with FDBS.

A considerable amount of research has been done in the area of integration and interoperability [MSNO, IMS91] of HDDBSs. Perrizo etal. [PERglb, PER931 introduce a method by which global serializability and atomic commitment in a heterogeneous environment can be achieved. Though extensive amount of work has been done on the problem of query processing very few papers have been published on query optimization in heterogeneous databases.

The global query optimization problem in a heterogeneous environment is fundamentally different from the optimization in a homogeneous distributed database system (DDBS), and existing query processing and optimization technologies must be re-examined [LU92]. The heterogeneous environment may have local systems residing on machines with different architectures. These computers may be connected using different communication protocols. Some of them may only provide limited access to remote users and even terminate their services without advance notice. On the software side, the local databases may be managed by different DBMSs, some of which may lack important DBM’S functions. More importantly, from the query optimization point of view, strategies and execution plans adopted at a local DBMS are not known at foreign sites (due to the execution autonomy of the local systems). Due to the difference in environment, new primitive query processing techniques have to be developed [LU92]. In this study, a method of implementing the join operation based on value based reduction method [RAM931 is proposed. This method will extend the distributed domain vector join approach [PERgla] to perform a multi-site or remote join in an MDBS environment.

In an MDBS environment, the pagination of data in the local system is not known, unlike in a DDBS environment. However, if value based reduction is performed to identify only the participating values at either sites of the join, considerable gains can be obtained. An approach such as the one suggested by

Du et al. [DU92] can be used to calibrate the local systems, and important statistics about the local systems can be gathered. A cost model can then be developed to estimate the costs of global queries. This cost model can be used by an optimizer to provide optimization for global query in an MDBS.

3.0 The Need for MDBS Query Optimization

Just as in centralized, distributed, and other databases, query optimization is necessary for high performance in multidatabases. However, the issues are often complicated in multidatabases, due to additional issues arising because of the autonomy and heterogeneity restrictions of the independent local DBMSs. Existing query processing techniques need to be re-examined, and new methods of performing primitive operations are required [LU92]. In this section, the need for a domain translation table and four multidatabase global join processing methods are proposed.

3.1 Domain Translation Table

Multidatabase systems are typically developed by aggregating preexisting systems, using a bottom up like approach. Since these systems may not share a common data model and data manipulation language (DML), a common data model and DML are often used to communicate between systems [CER84]. Data representations and primitives of each Iocal system are mapped to equivalent representations and primitives of the common data model and DML. An advantage of this approach of having a common model is that mappings and translations need not be done between each pair of DBMSs. Another advantage is that it becomes possible to have a global and consistent view of the database, as though it is not distributed and as if all the DBMSs are of the same type.

The data translation problem between various local systems can be observed at the schema level and at the instance level. At the schema level, a schema of the data model of each of the local systems must be converted to the schemaof the common data model (or vice versa). At the instance level, large amounts of data have to be converted automatically from one representation to the other. Differences in data definition of the same facts are called conflicts [DAY82]. Conflicts could occur due to names (often

254

Proceedings of the 28th Hawaii International Conference on System Sciences (HICSS '95) 1060-3425/95 $10.00 © 1995 IEEE

Proceedings of the 28th Annual Hawaii International Conference on System Sciences - 1995

due to homonyms and synonyms), scales (difference in units of measure), structures (same entity stored using two different physical representations), and different abstractions (one schema containiig more detailed information than the other one). Therefore, the need to perform operations over two semantically related but syntactically varying entities is clearly pronounced, particularly in the multidatabase environment [DeM89]. The problem of matching the same entity instance (e.g., join-able attributes) stored in two relations at distinct local DBMSs is called the “inter-database instance identification problem” by Wang and Madnick [wAN89]. h&rkowitz [MAR9 1 ] suggests that the identifiers ofall possible synonyms of a particular object could be stored in a table and this table can be used for conflict resolution.

The importance of solving the domain m ismatch problem has been recognized by Dayal [DAY84]. Solutions for cases in which straightforward data type conversions suffice or in which an element of the source domain can be mapped to a unique element of the target domain (i.e., a closed form function exists), have been proposed [BRE86, CZE87, TEM87]. Litwin and Vigier [LIT861 defme “dynamic attributes” to execute queries over m ismatched domains. A dynamic attribute is a fictitious attribute instantly defined from some actual ones that are not known to the conceptual schema or view schema of a database. A dynamic attribute may be explicitly defined in the query in which it occurs, or its definition may be stored as an executable program and invoked within a query. After dynamic attribute mappings have been applied to a relation, the standard relational operations can be invoked.

Most of the work in the literature makes strong assumptions about the domains of the data involved. Most of these efforts deal only with situations where one-to-one mappings can be established between domains or where an element of the source domain can

VID ATR A ATR B I

be mapped to a unique element of the target domain. In practice, however, this is not so [DeM89]. DeMichiel [EleM89] uses mechanisms calleddomain mappings and virtual attributes to address the above problem. A virtual attribute, like a real attribute, denotes the property of some entity and is associated with a particular domain. It is derivable from other attributes in the database or other information associated with the database. The domain mapping definitions are registered and stored within the database and exist as a layer above the individual database schema to allow these schema to be integrated with the schema of other databases. In this study, the above information is assumed to be logically stored in a table form and is called a domain translation table (D’IT). A tuple in the DTI’ contains a surrogate identifier (henceforth referred to as a value identifier or ‘vid’) and semantically equivalent attribute values.

Since the access to the local systems is through their local data manipulation and data definition languages, the DTI’ can be stored as a table at a global site or it can be vertically fragmented and co-located with the local attributes at the relation’s site. Figure 1 shows such a DTI’ as it would be stored in a global site. We present the following as a very simple example. In Figure 1, the attribute A has values (“Smith, John”, “Taylor, James”, ” Thomas, Joe”), and attribute B has values {“Bill Ross”, “John Smith”, “Joe Thomas”). All matching instances are mapped to the same vid value (such as “Smith, John” and “John Smith” being mapped to vid value 2). It should be noted that not all attributes will display the closed form like conversion from one attribute to another as shown in Figure 1.

When stored locally, an insert to the D’IT will require obtaining a new vid position from the global catalog manager prior to the actual insert into the D’IT. DlTs are essential to process all global queries (selects and joins) and provide the mappings between the semantically related attributes of the autonomous local databases.

3.2 Multidatabase Query Acceleration

In this section, four methods by which global join queries in a multidatabase system can be processed are discussed. Of these methods, two methods show how join queries can be processed without optimization, and the other two methods show how the queries can be optimized. Each method’s costs are detailed and

255

Proceedings of the 28th Hawaii International Conference on System Sciences (HICSS '95) 1060-3425/95 $10.00 © 1995 IEEE

Proceedings of the 28th Annual Hawaii International Conference on System Sciences - 1995

then compared graphically. A global schema is assumed to be present, and DTTs are assumed to be maintained. The cost of maintaining (inserts, deletes, updates) multidatabase catalog information is charged to the schema and instance integration process. This approach can be justified as follows: When a local DBMS elects to be a part of the multidatabase, it exports the data and the schema that it wisbes to expose to the multidaitabase world (as in the federated approach). It must also provide the semantics of the data that it exports to the global catalog manager. If the data or the semantics change, the multidatabase catalog has no way of recognizing the chaniges unless the multidatabase catalog is intimated by the local systems effecting the change.

For illustration purposes, let relations R and Sreside at site 1 and site 2, respectively. The join attribute of relation R is A and that of relation S is B. When appropriate, R is assumed to be the driving relation. It should be noted that no assumptions are made about the syntax or semantics of A and B. A and B are join-able entities (as indicated by the global schema), and the conversion between A and B is done by using the DTT or some translation function. For the methods in which optimization is not performed, the DTT is assumed to be implemented as an unfragmented table ordered on the driving relation’s join attribute.

3.2.1 Multiple Autonomous System Nested Loop Join (MAS-NL)

In this method, a nested loop like approach is taken to process a join query. For each tuple ri in relation R with the A value A, the DTI’ is accessed to obtain the matching Bi value, which can then be used to query relation S. For example, let cname be the A attribute in relation R and pname be the B attribute in S. If the value accessed using Rename is ’ Smith, John ’ and the DTT returns ‘John Smith’ as the equivalent pname value, then a SQL query would be constructed as :

SELECTS.* FROM S WHERE S.pname = ‘John Smith’.

This process is repeated for each tuple until all R tuples are processed. This method is a brute force like approach that does not take advantage of the memory available in the system; neither does it take advantage of the presence of D’IT to accelerate queries. The DTI’ is used only to convert data from one representation to the other. This method is only

proposed as a baseline for comparison with the other methods.

3.2.2 Multiple Autonomous System Hybrid Hash Join (MAS-HIT)

This method makes use of all available memory in the joining site to reduce the number of accesses to the local systems. MAS-HH is the only one of the four methods considered that is sensitive to the size of the memory. Using the size of the tuples of the relations, the number of tuples (say n) that will fit into the available memory can be found. Using this information, a query can be issued to the R relation, such that only the number of tuples that will fit in the memory are accessed. This can be done in SQL as follows and stored in a temporary relation, R-TEMP

SELECT R.* FROM R ORDER BY R.A SET RECNO n.

The last A value (say Aj) returned by the above query is noted. For each A value in R-TEMP, the D‘IT is accessed to find the equivalent Bvalue, denoted by (Bi. .., &). The tuples corresponding to (Bj,..,$) values in S are accessed as follows:

SELECT S.* FROM S WHERE S.B EXISTS (B,..,B&

Once all the values in R-TEMP have been processed, a query is reissued to R using Ajin the WHERE clause to restrict the output. The SQL query is issued as :

SELECT R.* FROM R WHERE R.A > Aj ORDER BY R.A SET RECNO n.

The process is repeated until all tuples in R are processed. MAS-HH is similar in principle to the hybrid hash method [DEW84, SHA86], except that the ORDER BY clause in SQL is used to replace the hashing.

3.2.3 Multiple Autonomous System Sort Merge Join (MAS-SM)

This method, a modification of the standard sort- merge join method, requires that the D’IT be maintained as fragments, along with the corresponding vids, in the same site as the join attribute it represents.

256

Proceedings of the 28th Hawaii International Conference on System Sciences (HICSS '95) 1060-3425/95 $10.00 © 1995 IEEE

Proceedings of the 28th Annual Hawaii International Conference on System Sciences - I995

The fragmented D’IT is called D’lT-R in site1 and D’KS in site2. Both are stored as two column IddOnS and are maintained ordered on their rqective attribute values. Any update to the DTT is now a local operation (except for obtaining the right vid position from thns global catalog manager in. the case of inserts), and since they a,m maintained ordered on the attribute value, access to the structure is fast. MAS-SM attempts to take advantage of the capability of local systems to provide the output of a query in sorted order.

MAS-SM begins its processing by issuing the following sub-queries in parallel at both sites.

Site 1 : SELECT R.*, D’IT-R.vid FROM R, DTI-R WHERE R.A = D’IT-R.A ORDER BY D’IT-R.vid

Site 2 : SELECTS.*, DTT-Svid FROM S, DlT-S WHERE S.B = DTT-S.B ORDER BY Dl’TJ.vid

The result from site1 is stored into a temporary relation called R-TEMP, and the result from site2 is stored in S-TEMP. If we assume that site1 is the join site, then S-TEMP is shipped to sitel. A merge of R-TEMP and S-TEMP on the vid columns produces the join result. The drawbacks of this approach are that regardless of the semijoin selectivities of R and S, the base relations are always joined with their corresponding DTT fragments and a whole relation is always transmitted to the other site, regardless of whether the tuples of the shipped relation participate in the join or not.

3.2.4 Multiple Autonomous System Value Filtered Join (MAS-VF)

The MAS-VF method is proposed by us to reduce the size of relations read (and as a result, lower I/O costs) and reduce the amount of data transmitted through the network to only tuples participating in the join (hence, lower network transmission costs). The DTT is assumed to be maintained in the same form as described in Section 3.2.3. MAS-VF attempts to reduce the relations to only tuples that participate in the join by identifying the values (and hence the vids) that participate in the join.

The method simultaneously issues the following sub-

queries to the participating sites of the join. Site 1 :

SELECT R.*, DTT-R.vid FROM R, D’IT-R WHERE R.A = D’I7’mR.A

DlT-R.vid EXISTS { SELECT D’IT-R.vid

FROM D’IT-R, DTT-S WHERE DTT-R.vid = DTT-S.vid

I ORDER BY DTT-R.vid

Site 2 : SELECTS.*, DTTJ.vid FROM S, DTT-S WHERE S.B = DTT’,S.B

D’IT-S .vid EXIST { SELECT D’IT-S .vid

FROM DTT-R, DTT-S WHERE D’IT-R.vid = DTT-S.vid

ORDER BY D’IT-S.vid The nested inner query is equivalent to extracting the

semijoins of relations R and S in terms of vids. Assuming that the nested query executes first, the main query maps the semijoin vids to the attribute values. The results from the two simultaneously executing sub-queries are ordered, with respect to their vids, and a merge of the results produces the join result.

3.3 Performance Analysis

In this section, the cost of performing MAS-NL, MAS-HH, MAS-SM, and MAS-VF method is analyzed using a performance model. The analysis parameters and their values are given in Table 1. These validated parameters are referred from the study by Du et al. [DU92]. For a detailed description of other validated parameters used, the reader is referred to [RAM93]. Two DBMSs, namely Informix and DB2, are used as the local DBMSs in this study.

TSCANsSENP is the overhead cost associated with accessing a relation sequentially from a database in terms of seconds, regardless of the number of tuples accessed from that relation. Similarly, Tcurs_s- is the overhead cost of accessing a relation using a clustered index which is defined on some attribute of the relation. TAMOR is the amortized I/O cost of fetching each tuple of a relation if the relation is sequentially

Proceedings of the 28th Hawaii International Conference on System Sciences (HICSS '95) 1060-3425/95 $10.00 © 1995 IEEE

scanned, and Tmmmx is the cost if the relation is accessed using an index.

INFORMIX T SCAN-SSTVP 0.06 seconds T cLus~ssTuP 0.001 seconds T AwoR (168+tuple size) /7e+5 sec. T mRD== 0.001 seconds

DB2 T schN&?mwP 0.6 seconds T CLUS2ZlVP 0.003 seconds T,umt (tuple size)/14e+5 set, T--m 0.003 seconds

Table 1. MDBS Analysis Parameters

3.3.1 MAS-NL Performance

MAS-NL begins by accessing one tuple from the driving relation (R), and the tiost of accessing the tuple is given by

T ONEJKJPLE = T~CAN_SEIVP + TAMOR . Having obtained the join attribute (A) value, the Dm has to be accessed to find the equivalent join attribute (B) value in the other relation (S). Since the D’lT is assumed to be ordered on the A attribute, a binary search-like procedure can obtain the value efficiently. The cost of this operation is given by

T BS = TCLUS~VP + TAMOR

The cost of accessing each tuple with the corresponding B value is the same as TONE~ITl~, and the total cost of the method is given by

TM,,., = II R II * II S II * ( Tom + Tes 1.

3.3.2 MAS-HH Performance

Let memory represent the available memory in the system. MAS-HH begins its execution by estimating the number of tuples of the drivipg relation (R) that can fit into the available memory and the number of accesses that need to be performed to access all the tuples of R. The number of tuples that fit into memory, RSS-num, and the number of database accesses, num-runs, are given by:

RSS-num = II R II / (memory * PO), and num-runs = ( II R II * hpLG / (memory * PO))

The cost of accessing RSS-num tuples from a database 258

Proceedings of the 28th Annual Hawaii International Conference on System Sciences - I995

is given by : T Rss~cou = T--m + TM, * RSS-num .

For each set of RSS-num tuples in .R, the corresponding B values in the other relation (S) have to be identified. Since the D’IT is assumed to be ordered on the driving relation’s attribute, a binary search can be perforr& to access the B values.

T ~..~ti= TOWS--+ RSS-num * SR * T,,, where SR is the semijoin selectivity of the R relation, and TDnuoR is the amortized I/O and CPU cost of reading the D’IT. These parameter costs are derived by using the procedures suggested by Du etal. [DU92] and shown in Table 1. Having obtained the B values, the cost of accessing the tuples from S and storing them in a temporary relation SJEMP is given by

T = T,, ~anrp + RSS-num * TAMOR . An index :zsumed to be maintained on DTT, keyed on the B values to provide fast access to the DTI’ structure. This index is assumed to be implemented as 3 level B+ tree index, and it is assumed that the root is maintained in memory. The cost of accessing such an index denoted by CIcost is shown in [RAM93]. Cursoring through the STEMP’s B values, the index is accessed to obtain the matching A values costing

T aTEMP = CIcost(RSS-num, SS, LKEY) , where SS is the semijoin selectivity of relation S. The total cost of each of the num-run - 1 (except the last run) is given by the function

T omm WS-num) = TRSS wit + TB valr

+T- - .TEMP + TRTEMP.

The total response time for the MAS-HI-I is given by TMAS-HH = (num-runs - 1) * T,,,,(RSS-num)

+ Tmm, (IIRII - (RSS-num * (num-runs - 1)) .

3.3.3 MAS-SM Performance

The DTT in this method is assumed to be fragmented and stored in the same location as the attribute in the fragment. MAS-SM begins by performing a join operation between the base relations (R or S) and the DTT fragment at the base relation’s site. Assuming that R site is the joining site, the results of the join have to be shipped to the R site. Tm, and Ts- the costs of performing the join at site1 and site2, respectively, and the cost of shipping the results id the join site are given by

TSllW = TINNERS + Tcxrm + Tscm + T,.~GEI 9 and T SITE? = TDWW + Tmm + ‘Ikmn + TMERGU

+ TSHP ,

Proceedings of the 28th Hawaii International Conference on System Sciences (HICSS '95) 1060-3425/95 $10.00 © 1995 IEEE

Proceedings of the 28th Annual Hawaii International Conference on System Sciences - 1995

where TrmR, the cost of accessing the inner relation (in this case, the DTT fragment relation), TW, the cost of accessing the outer relation (reelation R), and Ts~nr ,the cost of sorting the base relation. TsoRT is cabdated using Sterling’s estimtion [ KNU73]. Since the DTI’ is already maintained sorted on the join attribute, no cost is assigned to that. TlvatRas, is the cost of merging the ordered relations, and Tsm. is the cost of sending the joined results from the S site to the R site. Each of the above c’osts for the R site are calculated as follows

TINNER1 = ‘.kam-up + T~TAMOR * II R 11, T DUTERI =T~,-m+TAMm*IIRII, T SORT1 = II R II * logZ( IF R II) * (L&4)

* TWMp + II R II * L&4 * T-NW + II R II * LrupLe * Tk, and

T MERGE, = ( 2 * II R II) * L&4 * TWCW + IIRII*SR*(L’IUPLE+L&*T~~~.

The similar costs for the S site are as follows T INNEERZ = Tsc~~,sFnrp + TDTWOR * 11 S 11, T =TWSFNP+T,-*IISII, T :z= II S II * lo&( II S II) * (LwI4) * TV,W

+ II S II * L&4 * TwcMp + II S II *LUPLR*TWMV,

T MEROE2=(2* IISII)*L,&~*TWW + II S II * SS * (L,,.,pw + Lm) * T,,w ,

and T SHIP = Tmsrmp+IISII*SS

*(Lw+Lm)*8/~. Since the results from both sites are ordered on their vids, the cost of performing a merge to obtain the join result is given by

T ENDMERGE =(lIRII*SR+IISII*SS) *LvJ4*TWCMp+ (llRII*SR+

IISII*SS)*LTIJPU*TWM~. The total cost of the MAS-SM is given by

T MM-W = max (TwE,, TS& + TENDMERGE.

3.3.4 MAS-VF Performance

The analysis of MAS-VF is similar to that of MAS- SM except that the cost of the nested inner query has to be calculated along with other costs. The cost of the inner sub-query is as follows:

T INNER-R = TX--- +max(IIRII*SR* Tm-b II S II * SS *Tw,-wcx)

+ (II R II * SR + II S II * SS ) * L&4 *T~CMp+IIRII*SR*(L~~+L~) * TWMV + TNTRJEW + (Lm+Lvm) *IIsII*ss*8/NTR.

The first line in the above equation shows the setup cost of accessing both the databases and the. cost of accessing both fragments of the D’IT. The second and third lines show the cost of merging the DTT fragments. The fourth line shows the shipping cost of sending the joined DTT result to the join site. Similar costs exist for the S site and are denoted by T NESTED-S. The cost of performing the outer query at the R site is denoted by T-- and is calculated as :

Tom-~ = T--snup+ (IIRII*SR) *TAMoR+TSDRT .

TsmT is calculated in the same way as shown in section 3.3.3. A similar cost denoted by Tomks exists for processing the outer query at the S site. The only other additional cost at the S site is the cost of sending the nested query results to the join site, T,,,.

T SHIP = T~-smrp+IISII*SS *(Lmy+LvID)*8/NTR.

Once the results from the queries at both sites are obtained, a merge is performed to produce the join result, and the cost for the merge is given by

T EmMERGE = ( 11 R ll * SR + ll S ll * SS ) *Lv&4*TwcMp + (IIRII*SR + II S II * SS ) * LNPLE * T&v .

The total cost of the MAS-VF method is as follows

4.0 Graphical Comparison

In this section, the response times of MAS-NL, MAS-HH, MAS-SM, and MAS-VF are compared using the performance model described in Section 3.3. A number of parameters, such as semijoin selectivities, memory sizes, sizes of relations, etc., were varied to observe their effects on the methods. One parameter is varied while the others are held constant.

MAS-SM and MAS-VF have the potential of taking advantage of pipelining if such a mechanism is available. Since only MAS-HH is dependant on the size of the memory, varying the size of the memory and the relations are found to be of very little interest. Hence, the effects of change in memory and relation sizes on response time are not discussed. In multidatabases, the notion of a primary key - foreign key relationship may not be relevant, as it may violate the autonomy of a local DBMS. Hence, the semijoin of both relations is treated the same way. In other

269

Proceedings of the 28th Hawaii International Conference on System Sciences (HICSS '95) 1060-3425/95 $10.00 © 1995 IEEE

Figure 2. Response Time Vs. SS. DB=DB2, R=&200,000 tuples

NTR=4Mbps, SR-0.1

words, the join is treated as a foreign key - foreign key join, and by varying a semijoin and holding the other constant, the same ef$ect as holding the first semijoin constant and varying the other occurs. In this section, the SS values are varied, while the SR values are held constant. The analysis parameters of Table 1 are used to calculate the response time of the four multidatabase query processing methods.

Figure 2 shows the effect of varying SS when memory is held constant at 1000 pages, the SR value at 0.1, the sizes of both the participating relations at 200,000 tuples, and the effective network transmission rate at 4 Mbps for a DB2 local database system. When the SS value is 0.001, the response times of MAS-NL, MAS-HH, MAS-SM, and MAS-VF are 219.21 seconds, 102.21 seconds, 69.74 seconds, and 10.27 seconds, respectively. When the SS value reaches a value of 1.0, the response times of MAS-NL, MAS- HH, MAS-SM. and MAS-VF are 130226.69 seconds, 10229.67 seconds, 160.86 seconds, and 138.15 seconds, respectively. It should be noted that the Y- axis is based on a logarithmic scale. The MAS-NL method takes the most response time because of the excessive database setup cost paid to access each tuple required. The MAS-HH method accesses as many tuples as will fit in its memory at a time and, hence, does not have to pay the setup cost of accessing a local DBMS quite as many times as MAS-NL does. Considerable amount of time is spent in accessing the D’IT structure, and this reflects on the method’s poor performance. The MAS-SM method performs the join by ordering the base relations’ tuples and then merging

them, based on their vid values. Since this method does not benefit from semijoin reduction, the curve stays flat until SS reaches a value of 0.1. The overhead costs are the costs of reading the respective DTI’ fragments and the costs of merging a base relation and a DTT fragment. When the SS value is 1.0, the cost of reading the R relation and the DTI’ fragment is approximately 56.753 seconds, and the cost of merging them is 10.172 seconds. The .similar costs at the S site are 98.554 seconds (includes the data transmission cost) and 31.683 seconds, respectively. MAS-VF, on the other hand, benefits from semijoin reduction which is realized by joining the DTT fragments prior to accessing the base relations. The cost of executing the inner subquery is approximately 1.548 seconds when the SS value is 0.001. The cost of executing the outer query, which produces the tuples in an ordered sequence at the same SS value, at the R site is 5.896 seconds and 0.646 seconds at the S site. The S site also incurs a network transmission cost of 0.241 seconds, which is the cost of sending the reduced relation to the join site at 4 Mbps.

Figure 3 has the same parameter values as in Figure 2 except that the local DBMS is Informix. When the SS value is 0.001, the response times of MAS-NL, MAS-HH, MAS-SM, and MAS-VF are 487.357 seconds, 475.657 seconds, 131.485 seconds, and 20.433 seconds, respectively. The difference in costs between Figures 2 and 3 is due to the higher per tuple processing cost in Informix and a higher database setup cost (both for access through a scan or through

Figure 3. Response Time Vs. SS. DB=lnformix, R=S=200,000 tuples

NTR=4Mbps, SR=O.l

Proceedings of the 28th Annual Hawaii International Conference on System Sciences - 1995

260

Proceedings of the 28th Hawaii International Conference on System Sciences (HICSS '95) 1060-3425/95 $10.00 © 1995 IEEE

Proceedings of the 28th Annual Hawaii International Conference on System Sciences - 1995

Figure 4. Response lime Vs. SS. DMnformix, R=S=200,000 tuplas

NTFMMbps, SR=O.l

an index) in DB2. The higher processing cost in Informix is reflected in larger execution times in MAS- NL and MAS-HH in Figure 3.

Figure 4 is the same plot as in Figure 3 except that it highlights the MAS-SM and MAS-VF methods, which were overshadowed in Figure 3 due to the logarithmic scaling of the Y-axis. Since MAS-VF benefits from semijoin reduction, its response time is much lower than the response time of the MAS-SM method. At lower SS values, the cost of joining the base relation and the DTT fragment is the principal cost in the MAS-SM method. As the semijoin value increases, the cost of processing the accessed tuples and the cost of merging sorted relations become significant. When SS reaches 1.0, both the MAS-SM and MAS-VF method become the same. But the plot for MAS-VF in Figure 4 shows that it exceeds the response time of

Figure 5. Response Time Vs. SS. DB=DB2, R=S=200,000 tuples

NTR=4Mbps, SR=O.25

MAS-SM. When SS=l.O, the outer query in the MAS-VF method is in fact the MAS-SM method. The extra cost in the figure is the cost of perfomiing the inner part of the nested query, which does not produce any reduction.

Figure 5 compares the response times of the four methods. The number of tuples of both relations is held constant at 2OOOO0, the memory size at 1000 pages, the SR value at 0.25, and the network transmission rate at 4 Mbps. Just as in Figures 2 and 3, the MAS-VF method clearly outperforms the MAS- NL and MAS-HH methods with the response times being lower by one order of magnitude in low semijoin selectivity values and almost three orders of magnitude in higher values. In comparison to the MAS-SM method, the MAS-VF method outperforms MAS-SM at almost all values, except when the semijoin selectivity is 1.0, in which case it performs just as well. It should be noted that a case where SS=l.O, implying a primary key - foreign key relationship between the relations, is rare in the multidatabase environment.

5.0 Conclusions and Future Directions

In this paper, we have discussed the need for query optimization for better overall system performance in a multidatabase system. The need for a domain translation table is discussed and possible ways of implementing such tables are discussed. Four methods by which a global join query to a multidatabase system can be processed have been introduced. An analytical model to compare the performance of each method is developed and the response times of each method on multidatabases comprising of two commercial databases, Informix and DB2, are compared graphically. The MAS-Value Filter method clearly performs much better than the other methods. Its response time grows linearly with increase in selectivities as against the exponential growth of some of the other methods. It also benefits fully from “semijoin reduction” as against the MAS-Sort Merge method which incurs a constant cost of sorting the participating relations of a join and hence does not benefit from semijoin reduction. Current efforts are underway to build prototype implementations on a Cray XMP, a Solboume, a Sun 670/4, 5 Decstation 5OOOs, a AIX based RS6000, 3 NeXTs, a few Sun Spares and a few PCs.

261

Proceedings of the 28th Hawaii International Conference on System Sciences (HICSS '95) 1060-3425/95 $10.00 © 1995 IEEE

Proceedings of the 28th Annual Hawaii International Conference on System Sciences - 1995

6.0 Bibliography

lB=q Breitbart, Y., Olson, P. L., and, Thompson, G. R., “Database Integration in a Distributed Heterogeneous Database System”, Proc. of the Id. Conf. on Data Engg, 1986. pp. 301-310. K=W Ceri, S., and Pelagatti, G., Distributed Databases Principles and Systems, McGraw-Hill Book Company, 1984. v=Jm Czejdo, B., Embley, D. W., and, Rusinkiewicz, M., “An Approach to Schema Integration and Query Formulation in Federated Database Systems”, Proc. of the 3rd Intl. Conf. on Data Engineering, 1987. pp. 477-484. [DAY841 Dayal, U., and Hwang, H. J., “View Definition and Generalization for Database Integration in a MDBS System” IEEE Transaction on Software Engg., Vol. 10, No. 6, Nov. 1984. pp.628~645. [DEW841 Dewitt, D.J., et. al., “Implementation Techniques for Main Memory Database Systems”, Proc. ACM SIGMOD, June 1991,. pp. 1-8. [DeM89] DeMichiel, L. G., “Performing Operations over Mismatched Domains”, Proc. of the Fifth Intl. Conf. on Data Engg., 1989. pp. 36-45. [DU92]Du, W., Krishnamurthy, R., and Shan, M. C., “Query Optimization in Heterogeneous DBMS”, Proceedings of the 18th Very Large Data Bases Conference, 1992. pp. 277-291. [HAM791 Hammer, M., and McLeod, D., “On Database Management System Architecture”, Tech. Report. MITiLCS/Tm- 141, Massachusetts Institute of Technology, Cambridge, Mass. 1979. [mm Heimbigner, D., and McLeod, D., “A Federated Architecture for Information Management”, ACM Transactions on Office Information Systems, 3, 3, July, 1985. pp. 253-278. [IMS91] Proc. of the First Intl. Conf. on Interoperability in Multidatabase Systems, Kyoto, Japan, April 1991. [KNU73] Knuth, D., The Art of Computer Programming: Sorting and Searching, Vol. 3, Addison-Wesley, Reading, Mass., 1973. WTw Litwin, W., and, Vigier, P., “Dynamic Attributes in the MDBS System MRDSM”, Proc. of the Intl. Conf. on Data Engg., 1986. pp. 103-l 10. [LU92] Lu, H., and Shan, M., “Global Query Optimization in Multidatabase Systems”, 1992 NSF Workshop on Heterogeneous Databases and Semantic Interoperability, 1992.

[MAR911 Markowitz, V. M., “An Architecture for Identifying Objects in Multidatabases”, Proc. of the First Intl. Workshop on Interoperability of Multidatabases, April 1991. pp. 294-301. FICLW McLeod, D., “INTERBASE: An Approach to Controlled Sharing Among Autonomous, Heterogeneous Database Systems”, Proc. of the Intl. Conf. on Data Engg., 1990. Vol. 13, No.2. pp. 4-9. [MS1901 Breitbart, Y ., Garcia-Molina, H., Litwin, W., Roussopoulos, N., Rusinkiewicz, M., Thompson, G., and Wiederhold, G., Final Report of the Workshop on Multidatabases and Semantic Interoperability,, November 2-4, 1990. [PER9la] Perrizo, W., Gustafson, J., Thureen, D., Wenberg, D., and Davidson, W., “Domain Vector Accelerator (DVA): A Query Accelerator for Relational Operators”, IEEE Conf. on Data Engg., April, 1991, Kobe, Japan. [PER9lb] Perrizo, W., Rajkumar, J., and Ram, P., “HYDRO: A Heterogeneously Distributed Database System”, The Proc. of SIGMOD Intl. Conf. on Management of Data, 199 1. IpER Perrizo, W., Ram, P.,er al., “KB-HYDRO: A Distributed System for Integrating Data and Knowledge Bases”, in the International Journal on Mini and Microcomputers, 1993. [RAM931 Ram, P., “Residual Surrogates in Database Query Processing”, Ph.D. Dissertation, North Dakota State University, NDSU-CSOR-TR-9312. [SHA86] Shapiro, L., “Join Processing in Database Systems with Large Main Memories”, ACM Trans. on Database Systems, Sept. 1986. pp. 239-264. [SHE90] Sheth, A., and Larson, J., “Federated Database Systems”, ACM Computing Surveys, Vol. 22, Number 2, Sept., 1990. pp. 185-236. [TEM87] Templeton, et. al., , “Mermaid - A Front End to Distributed Heterogeneous Databases”, Proc. of the IEEE, Vol. 75, No. 5, May 1987. pp. 695-708. [TH090] Thomas, G., et. al., ., “Heterogeneous DDBSs for Production Use”, ACM Comp. Surveys, Vol. 22, No. 3, Sept. 1990. pp. 237-265. [VEIS6] Veijalainen, J., and Popescu-Zeletin, R., “On MDBS Trans. in a Cooperative Autonomous Environment”, Tech. Report, Hahn-Meitnet Institut, Berlin GmnH, D-1000 Berlin 39, FRG, 1986. [WAN891 Wang, R. Y ., and, Madnick, S. E., “The Inter-Database Identification Problem in Integrating Autonomous Systems”, Proceedings of the Fifth Intl. Conf. on Data Engg., L 1989. pp. 46-55.

262

Proceedings of the 28th Hawaii International Conference on System Sciences (HICSS '95) 1060-3425/95 $10.00 © 1995 IEEE