informix - db.cs.berkeley.edudb.cs.berkeley.edu/papers/informix/ · informix corporation, or their...

15
Informi x Performance Penalties for Simulating Object-Relational DBMSs Dr. Michael Stonebraker Chief Technology Officer Informix Software, Inc.

Upload: ngothuy

Post on 29-Apr-2018

218 views

Category:

Documents


1 download

TRANSCRIPT

I n f o r m i x

Performance Penalties

for Simulating

Object-Relational

DBMSs

©1997 Informix Software, Inc. The following are worldwide trademarks of Informix Software, Inc.,Informix Corporation, or their subsidiaries, registered in the United States of America as indicatedby “®” and in numerous countries worldwide: Informix®, The Informix Logo®. All other names ormarks are registered trademarks or trademarks of their respective owners.

4100 Bohannon Drive

Menlo Park, California 94025

1 415 926 6300

World Wide Web: http://www.informix.comPrinted in U

.S.A.–10K

–02/97–000-21451-70 Dr. Michael Stonebraker ♦ Chief Technology Officer ♦ Informix Software, Inc.

Unleashing Business Innovation

Informix Software is leading the next greatwave in database innovation to enables theworld’s leading corporations to manageand grow their businesses. With its focusedtechnology strategy, superior customer service, and best-or breed partnerships,Informix is the demonstrated technologyleader for corporate computing environmentsranging from workgroups to very largeOLTP and data warehouse applications—as well as a catalyst in major new IT trendssuch as dynamic content management, theWeb, smart cards, and mobile computing.

For more information about Informixproducts and services, please contact thesales office nearest you, or visit us on theWeb at www.informix.com

Regional Sales Offices

Asia/Pac i f i c65 298 1716

Canada (Toronto)416 730 9009

Europe/Midd le East/Afr i ca44 1784 422 000

Federa l703 847 2900

Japan81 3 5562 4500

Lat in Amer i ca305 265 7545

North Amer i ca800 331 1763415 926 6300

Table of Contents

Abstract

Introduction 1

Performance Problems with Simulation 3

Problems with Cross-System Services 7

Conclusion 10

References 11

Abstract

This white paper examines several of the serious performance problems

associated with current architectures used to simulate an object-relational

database management system (ORDBMS). Unlike a native object-relational

DBMS engine, simulator middleware and multiple specialized servers both

impose a number of performance penalties.

Introduction

The object-relational architecture proposed by both Sybase and Oracle

(Version 8), attempts to simulate object-relational functionality by means

of an object-relational simulator in middleware outside the relational DBMS.

The simulator works in conjunction with a series of separate specialized

servers (see companion white paper “Architectural Options for Object-

Relational DBMSs” (STON96). This architecture is illustrated in Figure 1.

The object middleware layer serves three distinct purposes:

1) Object Simulation: An object simulator is required to support

SQL functionality for objects stored in the relational DBMS.

2) SQL Executor for Objects: An SQL execution engine is required

for objects stored in one of the specialized servers.

3) Cross-System Services (Glue Layer): Because this architecture

fundamentally has two “heads,” a relational engine and an object

executor, data can be stored in both systems. Therefore this

object-simulation environment requires cross-system joins and

transactions supported by a cross-system glue layer.

Re l a t i ona l DBMS

Ob j e c t -Re l a t i ona l S imu la t o r

Se r ve r 1 Se r ve r 2 Se r ve r X

1

F i g u r e 1 :

Middleware Layer

Supporting

Specialized Servers

These three functions are illustrated in more detail in Figure 2.

This white paper examines several of the performance problems associated

with the simulator architecture shown in Figure 2. The performance of a

simulator-based architecture is contrasted with that of a native implementation

of a true object-relational engine.

The simulator architecture stores objects in a middleware layer outside the

relational engine. The next sections analyze specific examples of performance

problems that result from this sort of simulation.

The last section examines the problems that arise from storing data in

multiple server systems and then using cross-system services to support SQL

across those systems.

Re l a t i ona l DBMS

Ob j e c t S imu la t o r SQL Exe cu t o r

C ro s s - s y s t em s e r v i c e s ( g l ue l aye r )

Se r ve r 1 Se r ve r 2 Se r ve r X

2

F i g u r e 2 :

Cross-System

Services

Performance Problemswith Simulation

In the late 1980s, a relational database company had just implemented the

SQL-89 notion of date and time. This simple datatype had been added to

SQL-89 to support the many business data processing environments with

temporal data. The company, of course, implemented Julian calendar semantics

for time, as mandated in the SQL standard. Using the Julian calendar, a

calculation subtracting February 15th from March 15th yields 28 days,

except during leap years and double leap years.

One customer, who had been eagerly awaiting this new functionality, called

the company a few days after the release in a state of extreme agitation. He

claimed the company had implemented time incorrectly. This is what happened:

The customer had an application that computed interest on financial

bonds for a Wall Street investment firm. A simplified version of his schema is:

Bond (name, coupon_rate,

date_sold, date_bought, interest)

His application included the following calculation:

update Bond

set interest = coupon_rate *

(date_sold - date_bought)

where …

In effect, he wished to use SQL to compute the interest due to the bond

holder. Unfortunately, the above SQL statement failed in his application for

the following reason:

In the United States bond market (but not in most other countries), the

same amount of interest is earned during each month, no matter how many

days in the month. Hence, Wall Street bonds use a calendar in which a year

of 360 days is divided into 12 equal-length months. As such, March 15th

minus February 15th equals 30 days. Let’s call this semantics “bond time” to

contrast it with “Julian time.”

3

The application required a notion of subtraction for times that followed

the bond time calendar, but the SQL command implements only Julian time.

As a result, the interest computed was incorrect. The reason for the customer’s

anguish was that he was forced to utilize the following workaround:

1) Set up a cursor for the query on Bond.

2) Iterate over the result set, for each qualifying row, retrieving the two

times and the coupon rate.

3) Compute bond time subtraction in user code, followed by the rest of

the interest calculation.

4) Put the answer back in the database by updating the current record.

This workaround has two serious drawbacks relative to the native SQL code.

First, it is very slow. Instead of performing a set-oriented update in SQL

with a single round trip from the application to the DBMS, the customer

required a cursor and a round trip per qualifying record. Moreover, he

needed to copy the data out of the DBMS to a user program. In aggregate,

he reported that the workaround was a factor of three times slower than the

native code. As a result, his inability to use the Julian calendar time cost him

a factor of three in performance.

The second problem was the requirement of maintaining a library of

routines in user space that programs could call to perform bond time calculations.

This presented serious system administration difficulties. Altogether, the

workaround was both slow and painful.

The obviously better solution is to use a true object-relational DBMS,

which supports type extensions. In such a system, the customer can add

bond time and the correct notion of subtraction to the DBMS. In this case,

bond time will run at the same performance as Julian time.

The conclusion to be drawn from this example is that a simulator will

run much slower than a native implementation.

A second example is equally revealing. Consider the following SQL

command for the traditional employee table:

emp (name, salary, age, dept)

select name

from emp

order by name;

4

Although this yields the correct answer in the United States, it fails in

Scotland. The following names are all part of the Tavish clan and sort together:

MacTavish

McTavish

M’Tavish

Unfortunately, the character string datatype in SQL sorts according to the

ASCII notion of “<=”. This won’t work in Scotland, because the Tavish clan does

not obey the ASCII sort order. Hence, Scottish names must be implemented in

the middleware layer. Rather than the underlying DBMS engine sorting the

records with a minimum of data copying, all records must be sent from the

relational engine to the middleware where a second sort routine must be

implemented. Naturally, this entails a serious performance degradation.

As a third example, consider a variant of the above query:

select name

from emp

where name < ‘B’

order by name;

This query requests a sorted list of a subset of the names. A native object-

relational engine can use a B-tree index on the Scottish name datatype. Because

the object-relational engine knows that the Tavish name is of type “Scottish

name,” it can use the correct definition of “<” to build the B-tree. Then the

engine can use an indexed scan to find the required subset of employees.

5

In contrast, a simulator architecture can request that a B-tree be built on

name, but it will not work correctly because the B-tree will use the ASCII

notion of “<” in building and exploring the index. An indexed scan is not

available to the simulator technology, so a complete sequential scan of

employees must be utilized. The performance difference between an indexed

scan and a sequential search on the above query will be at least one order of

magnitude. Again, this example demonstrates the presence of very serious

performance degradation.

In summary, there is no reason to believe that object-relational simulators

will work well on top of relational engines. The performance problems with

simulators include:

• Difficulties in generating a good query plan

• Inability to use indexing

• Excessive copying of data

Many customers have hand-coded such simulators and noted similar

experiences. The performance penalty varies from application to application.

However, in time-series applications for the investment industry, Informix

saw about two orders of magnitude performance improvement when using

type extensions. This difference in performance was the difference between

the application being infeasible or viable.

6

Problems withCross-System Services

A customer would be expected to use the simulator architecture shown in

Figure 2 by putting traditional datatypes in the relational engine and rich

datatypes in one of the specialized servers supported by an object executor.

This solution entails a second kind of serious performance problem, illustrated

with the following example. Consider the schema:

emp (id, name, age, salary,

picture, resume)

In this case, the table might be divided like this:

emp-R (id, name, age, salary)

‡ relational DBMS

emp-OR (id, picture, resume)

‡ one or more servers

Of course, id must be repeated in the emp-OR table as a mechanism

to join the corresponding records in both tables together. Now consider

the following query:

select name, resume

from emp

where age > 40

and wants (resume) = “marketing”’;

In a native object-relational engine, this is a single query to a single engine.

A single query optimizer can decide whether to use an indexed scan over the

B-tree index on age, or a functional index on the wants function, or both.

7

In contrast, the multiple server architecture must run the following

pair of queries:

select id, name

from emp-R

where age > 40;

select id, name, resume

from emp-OR

where wants (resume) = “marketing”;

There are two serious problems with the multiple server architecture.

First, the B-tree index on age is not available to help with the emp-OR

portion of the query. Likewise, a functional index on wants is not available

to help with the emp-R subquery.

Obviously, the single query to the native object-relational engine is much

more efficient than the pair of queries above.

There is another problem with the dual query. The results of the two

queries must be joined together to produce the final answer. This join can be

done in one of two ways:

1) Load the result of one query into the “other head.” Then, do the join.

2) Retrieve both results into the cross-system services layer, and implement

a join algorithm in the middleware.

Either way is slow. Moreover, the join is extra work, necessitated by the

presence of a second system. Joins are typically one order of magnitude slower

than single-table queries.

The combination of requiring a join and other indexing inefficiencies

causes a performance degradation of about one order of magnitude.

Performance problems are even more severe when updates are encountered.

Consider the following command:

delete emp-OR

where name = ‘Mike’;

8

This command deletes a single record. It can be efficiently performed by

a native object-relational engine using a B-tree to find Mike’s record and then

removing it.

In contrast, in a multiple server architecture, Mike’s record will span

multiple storage systems. As a result, two serious performance problems

occur. First, Mike’s record must be found in the multiple server systems and

then removed. To a first approximation, the extra work is proportional to the

number of storage systems involved.

However, the second problem is even more troublesome. If one of the

storage systems fails while the second remains operational, then the transaction

that deletes Mike’s record may be committed in one system and aborted in the

second one. This would result in an inconsistent database. To prevent such

inconsistencies, the cross-system services layer must implement a two-phase

commit protocol. This protocol would avoid inconsistencies; however, a

performance penalty is incurred by an extra set of messages between the

middleware and the storage engines. This adds to the already serious performance

tax imposed by a multiple storage system architecture.

9

Conclusion

The multiple system architecture depicted in Figure 2 incurs serious performance

problems. The object simulator on top of a relational engine imposes one sort of

severe penalty. On the other hand, storing data in multiple systems requires a

cross-system services layer that turns most queries into joins and most updates

into distributed updates. Again, serious performance problems result.

For these reasons, a single object-relational engine is the preferred alternative.

10

References

[STON96] Stonebraker, M., “Architecture Options for Object-Relational

DBMSs,” Informix White Paper, November 1996.

11

I n f o r m i x

Performance Penalties

for Simulating

Object-Relational

DBMSs

©1997 Informix Software, Inc. The following are worldwide trademarks of Informix Software, Inc.,Informix Corporation, or their subsidiaries, registered in the United States of America as indicatedby “®” and in numerous countries worldwide: Informix®, The Informix Logo®. All other names ormarks are registered trademarks or trademarks of their respective owners.

4100 Bohannon Drive

Menlo Park, California 94025

1 415 926 6300

World Wide Web: http://www.informix.com

Printed in U.S.A

.–10K–02/97–000-21451-70 Dr. Michael Stonebraker ♦ Chief Technology Officer ♦ Informix Software, Inc.

Unleashing Business Innovation

Informix Software is leading the next greatwave in database innovation to enable theworld’s leading corporations to manageand grow their businesses. With its focusedtechnology strategy, superior customer service, and best-of-breed partnerships,Informix is the demonstrated technologyleader for corporate computing environmentsranging from workgroups to very largeOLTP and data warehouse applications—as well as a catalyst in major new IT trendssuch as dynamic content management, theWeb, smart cards, and mobile computing.

For more information about Informixproducts and services, please contact thesales office nearest you, or visit us on theWeb at www.informix.com

Regional Sales Offices

Asia/Pac i f i c65 298 1716

Canada (Toronto)416 730 9009

Europe/Midd le East/Afr i ca44 1784 422 000

Federa l703 847 2900

Japan81 3 5562 4500

Lat in Amer i ca305 265 7545

North Amer i ca800 331 1763415 926 6300