managing time in relational databases

Articles in the Time and Time Again Series. Time and Time Again: Managing Time in Relational Databases, Part 1 Tom Johnston, Randall Weis Article published in May 2007, in DM Review magazine Time and Time Again: Managing Time in Relational Databases, Part 2 - Preliminaries Tom Johnston, Randall Weis Article published in May 2007, in DM Direct Time and Time Again: Managing Time in Relational Databases, Part 3 - Version Patterns 1 and 2 Tom Johnston, Randall Weis Article published in May 2007, in DM Direct Time and Time Again: Managing Time in Relational Databases, Part 4 - Preliminaries to Version Pattern 3 Tom Johnston, Randall Weis Article published in June 2007, in DM Direct Time and Time Again: Managing Time in Relational Databases, Part 5: Version Pattern 3 Tom Johnston, Randall Weis Article published in June 2007, in DM Direct Time and Time Again: Managing Time in Relational Databases, Part 6: Version Pattern 4 Tom Johnston, Randall Weis Article published in June 2007, in DM Direct Time and Time Again: Managing Time in Relational Databases, Part 7: Completing Version Pattern 4 Tom Johnston, Randall Weis Article published in July 2007, in DM Direct Time and Time Again: Managing Time in Relational Databases, Part 8: Version Pattern 5 Tom Johnston, Randall Weis

http://www.dmreview.com/article_sub.cfm?articleId=1082475

















Article published in August 2007, in DM Direct Time and Time Again: Managing Time in Relational Databases, Part 9: Completing Version Pattern 5 Tom Johnston, Randall Weis Article published in August 2007, in DM Direct Time and Time Again: Managing Time in Relational Databases, Part 10 - Foreign Keys in Version Tables Tom Johnston, Randall Weis Article published in September 2007, in DM Direct Time and Time again: Managing Time in Relational Databases, Part 11 - Foreign Keys in Versioned tables Continued Tom Johnston, Randall Weis Article published in October 2007, in DM Direct Time and Time Again: Managing Time in Relational Databases, Part 12: Foreign Keys in Version Tables Tom Johnston, Randall Weis Article published in October 2007, in DM Direct Time and Time Again: Managing Time in Relational Databases, Part 13 - A Roadmap of the Rest of This Series Tom Johnston, Randall Weis Article published in November 2007, in DM Direct Time and Time Again: Managing Time in Relational Databases, Part 14 - Versioning with Surrogate Keys Randall Weis, Tom Johnston Article published in November 2007, in DM Direct Time and Time Again: Managing Time in Relational Databases, Part 15 - For and Against Surrogate Keys Tom Johnston, Randall Weis Article published in December 2007, in DM Direct Time and Time Again: Managing Time in Relational Databases Part 16 - Physical Implementation of Version Pattern 1 Randall Weis, Tom Johnston





http://www.dmreview.com/dmdirect/20070907/1092319-1.html






http://www.dmreview.com/dmdirect/2007_45/10000273-1.html


http://www.dmreview.com/dmdirect/2007_47/10000330-1.html?ET=dmreview:e236:2052773a:&st=email






Article published in the December 21 2007 issue of DM Direct Time and Time Again: Managing Time in Relational Databases, Part 17 - Physical Implementation of Version Pattern 2 Randall Weis, Tom Johnston Article published in the January 4, 2008 issue of DM Direct Time and Time Again: Managing Time in Relational Databases, Part 18 - Physical Implementation of Version Pattern 3 Randall Weis, Tom Johnston Article published in the January 18, 2008 issue of DM Direct Time and Time Again: Managing Time in Relational Databases, Part 19 - RI and Temporal RI Tom Johnston, Randall Weis Article published in the February 1, 2008 issue of DM Direct Time and Time Again: Managing Time in Relational Databases, Part 20 - RI and Temporal RI (Continued) Tom Johnston, Randall Weis Article published in the February 15, 2008 issue of DM Direct Time and Time Again: Managing Time in Relational Databases, Part 21 - Enforcement Time Options for Temporal RI Tom Johnston, Randall Weis Article published in the March 7, 2008 issue of DM Direct. Time and Time Again: Managing Time in Relational Databases - Part 22: Original and Temporal Deletes Tom Johnston, Randall Weis Article published in the March 21, 2008 issue of DM Direct. Time and Time Again: Managing Time in Relational Databases, Part 23 - Original and Temporal Updates Tom Johnston, Randall Weis Article published in the April 4, 2008 issue of DM Direct. Time and Time Again - Managing Time in Relational Databases, Part 24: Original and Temporal Updates (Concluded) Tom Johnston, Randall Weis

















Article published in the April 18, 2008 issue of DM Direct. Time and Time Again: Managing Time in Relational Databases, Part 25 - Original and Temporal Inserts Tom Johnston, Randall Weis Article published in the May 2, 2008 issue of DM Direct. Time and Time Again - Managing Time in Relational Databases, Part 26: Points in Time and Periods of Time Tom Johnston, Randall Weis Article published in the May 16, 2008 issue of DM Direct.





Time and Time Again: Managing Time in

Relational Databases, Part 1

Tom Johnston, Randall WeisDM Review Magazine, May 2007

Relational databases usually keep data about the current state of things - about how things are

now. Updates do occur, of course; but when they do, they overwrite the data already there. In

doing so, history is lost. We can't tell how things used to be. To tell how things used to be, as wellas how they are now requires that we introduce time into our databases.

This series of articles on a very technical topic is motivated by the fact that increasingly,businesses need real-time access to historical data. No longer can all historical data be pushed offonto archive files or onto less immediately accessible historical databases. The objective of thisseries of articles is to show how to provide real-time access to history in particular, and to any kindof noncurrent data in general, and to show how to do so using the relational databasemanagement systems (DBMSs) and SQL that are available today.

First, a note on terminology. It will often be convenient to talk about history, because that is thekind of noncurrent data that businesses are primarily concerned with. But the real topic is therepresentation of time in relational databases, whether past, present or future time, whetherreal-world or database time, whether actual or hypothetical time, whether linear or branching time.

Join the Largest Community of MDM Experts in New York City,October 19-21, 2008Restructured to better suit your business needs, MDM Summit Fall 2008offers value of experience and thought leadership. Speakers will discusshierarchy management, identity resolution and more. Pre-register byAugust 8 for bonus savings!

Managing time in today's relational databases is difficult. With more than 50 years of combinedexperience in data and database management, we have encountered no database model morecomplex for its size than the model we will propose, no queries fraught with more possibilities formisstatement and misinterpretation than queries against databases that attempt to manage time.But we know of no simpler way to meet business requirements for real-time access to what we willcall "versioned history."

However, very little support for time management has yet been provided by DBMS vendors or isspecified in the current SQL standards. The complexity of time management cannot be hidden, asit ultimately must be, inside the DBMS itself, with declarative interfaces provided in extensions tothe SQL DDL and DML languages.

But businesses need robust and reliable time management in their databases now, not eventually.So we take a do-it-yourself approach. And the first thing we must do ourselves is find a way to

encapsulate both time management data structures and the additional complexities in SQLstatements that are required for time management data access.

As for updating temporal data structures, all updates should be mediated by application program

Time and Time Again: Managing Time in Relational Databases, Part 1 http://www.dmreview.com/issues/20070501/1082475-1.html?type=printe...

1 of 4 10/7/2008 9:23 AM

interfaces (APIs). If an operational data store (ODS) or data warehouse is the database beingupdated, the API function will be provided by the extract, transform and load (ETL) process,whether hand coded or implemented with a commercial ETL tool.

For retrievals against temporal data structures, all queries should access a set of view tables fromwhich as much of the technical detail of managing time will have been removed as possible.These tables will satisfy the following two requirements:

When point-in-time data is wanted, these view tables will look, column for column, like tablesthat are not temporally managed, i.e., like "typical" or "normal" tables in a database.When that desired point in time is now, these view tables will guarantee that existingretrievals will remain completely unaffected as time management is added to the tables theyaccess.

We should think of these temporal tables as no more accessible to the end user, DBA ordeveloper than are the dynamic balanced-tree access method structures that provide the filestorage subsystem for many commercial DBMSs. The only database users able to directlymanipulate these temporal tables should be:

The developers who translate source system or end-user-supplied update transactions intoupdates to these temporal tables, andThe developers who write the CREATE VIEW statements that encapsulate these tables forretrieval transactions.

Figure 1 illustrates the encapsulation necessary to hide the complexities of temporal databasestructures from both those who update those structures and those who retrieve data from them.

Figure 1: Encapsulating Temporal Tables

When we IT professionals talk about history in databases, we may have several different things inmind. A taxonomy of the types of history that IT data management professionals talk about ispresented in Figure 2.


2 of 4 10/7/2008 9:23 AM

Figure 2: A Taxonomy of History as Managed in RDBMSs

Database history is historical data kept in relational database tables. The two major categories ofhistorical data are reconstructable history and queryable history.

Reconstructable history is data about the past state of something, obtained by restoring abackup file and then applying update transactions captured in a DBMS logfile, from that point intime forward to the point in time of interest. This requires the intervention of IT personnel and is

therefore not real time.

Queryable history is data about the past state of something, obtained with a SQL query, withoutthe need for restoring backups and reapplying logged updates. Because the query may be

directly executed, this type of history is possibly real time. It is actually real time if its performancemeets the parameters which define real time for the class of query it belongs to. Queryable historymay be kept as either a record of events or as a record of states.

Event history is data about events that have altered the states of things. This data is captured as

transactions in tables that are best queried with links to the leaf-level entries of a set ofdimensional hierarchies. Thus, these tables are often queried as the fact tables in star schemadata marts, usually by means of OLAP reporting tools.

State history is historical data, captured as copies of the current state of an object, either on aperiodic basis or in response to a specific update event. State history may be kept as eithersnapshots or versions.

Snapshot history is a coordinated set of copies in a relational database. Snapshots are oftentaken at the entire database level, but sometimes at the level of a semantically related subset oftables. Snapshots occur at regularly scheduled intervals. Consequently, snapshots will miss anyupdates that are overlaid by later updates to the same data that happen before the next snapshotis taken. Another drawback to snapshots is that they are an inefficient way to record history,because they create copies of rows whether or not those rows have changed.

Version history is individual row-level logical updates, implemented without overlaying data, by"retiring" the current version of the object and replacing it with a new version that contains theupdated data. Versions are created on an as-needed basis, i.e., every time an update importantenough to be versioned occurs. Consequently, versions will not miss any updates. They are alsoan efficient way to record history because they create new rows only when a versionable changehas occurred.

Thus, for example, suppose that the current version of customer 123's name is Mary Smith. When


3 of 4 10/7/2008 9:23 AM

she changes her name to Mary Jones, the Mary Smith version of that customer loses its status asthe current version. The Mary Jones version is inserted and marked as the new current version.Effectivity dates indicate when each version was the truth about Mary, as the business knew it.

Over the past two decades, many suggestions have been made as to how vendors shouldimplement time management in relational databases, and how SQL DML and DDL should bemodified to provide declarative access to those new capabilities. The best access to pre-2000computer science research is written by the principal computer scientist working in the field of

temporal databases.1 A well-known commentator on relational databases and a member of the

SQL standards committee have also written on this topic.2

But our focus is not on recommendations to vendors or to the SQL standards committees. Ourfocus is on the here and now. Given today's relational DBMSs, and the SQL common to them,how can we provide the temporal support that businesses increasingly need, and that they neednow?

This series of articles will continue in the DM Direct newsletter and will be published every otherissue beginning May 4. The discussion will be at the implementation level of detail, although toemphasize its relevance, we will present this very technical material in terms of a real-worldbusiness application. In these articles, academic research will be brought to bear on the problemof providing temporal data management in today's relational databases. However, our emphasis ismore on real-world implementation than on theory. These articles will address the problems andissues faced by business IT data architects, data modelers, DBAs, those developers who write theETL or other interface logic to update temporal tables and those developers who write the viewswhich insulate the rest of the IT and business community from the complexities of temporal data.

It's about time.

References:

R. T. Snodgrass. Developing Time-Oriented Database Applications in SQL. San Francisco:Morgan-Kaufmann, 2000.

1.

C. J. Date, Hugh Darwen and Nikos Lorentzos. Temporal Data and the Relational Model. SanFrancisco: Morgan-Kaufmann, 2002.

2.

For more information on related topics, visit the following channels:

DatabasesReal-Time Enterprise

©2008 Data Management Review and SourceMedia, Inc. All rights reserved. SourceMedia is an Investcorp company.

Use, duplication, or sale of this service, or data contained herein, is strictly prohibited.


4 of 4 10/7/2008 9:23 AM


Relational Databases, Part 2 - Preliminaries

Tom Johnston, Randall WeisDM Direct, May 2007

The needs which businesses have for versioned history are many and varied. However, thoseneeds and the data patterns which satisfy them seem to fall quite nicely along a linear progressionfrom simple to complex. We will call the patterns that satisfy each set of business needs "versionpatterns" because, in the terms organized by the taxonomy in Part 1 of this series, we areconcerned with queryable history of the states of persistent objects, kept as versions of thoseobjects.

However, our concern will not be exclusively with past and present versions. The more generaltopic is maintaining and querying tables which contain noncurrent as well as current versions oftheir persistent objects. And there are other kinds of noncurrent versions than historical ones. Wewill also consider versions for error correction, versions for future states, versions which overlap intime, coordinated (pending/sandbox) updates and even versions for hypothetical states.

Version Patterns: Manner of Presentation

In this series of articles, we will discuss a dozen version patterns. Each pattern will be introduced

by describing the business requirements it supports.

Intuitive Data Reporting & Dashboards

Get instant visibility into your business with clickable access to your

data no matter where it resides. Customize reports & dashboards on

the fly.Try PivotLink now.

Next, we will show how these version patterns work by presenting several scenarios and timelinesfor each pattern. Each scenario will show the results of doing an insert, update or delete against atable constructed according to the pattern. Each timeline will show what would be returned by aquery against the past, present or future state of the object whose versions are being used as theexample.

Initially, these transactions will be simple ones that physically affect just one row in one table. Butgradually, more semantically expressive scenarios will be considered, ones in which what is seenas one transaction to the issuing code or person will result in several rows being affected inpossibly several tables.

Following the scenarios for each history pattern, we will present and comment on a list of semantic

constraints that must be followed for the pattern to work. Most of these constraints can beenforced only by triggers, stored procedures or application code. It is an indication of how"special" and complex temporal state management is that the normal DBMS entity and referentialintegrity mechanisms can do very little to enforce the semantics of these patterns.

Time and Time Again: Managing Time in Relational Databases, Part 2 - Pr... http://www.dmreview.com/dmdirect/20070504/1082693-1.html?type=pri...

1 of 6 10/7/2008 9:21 AM

Finally, each pattern will conclude with a list of objections to it that we have encountered. Some ofthese objections are valid in the sense that if the different business requirements for the patternwere weighted a little differently, a different pattern or at least a different physical implementationthan that suggested by the pattern would be appropriate. Other objections are not valid, eitherbecause the pattern was not fully understood by those criticizing it, or because the flaws in thealternative pattern that the critics preferred were not fully understood.

In this series of articles, it will not be possible to discuss all the objections to these patterns thatwe are aware of. But we can, and will, at least note the existence of the objections we are awareof.

Following is an annotated list of the version patterns which we will discuss in these articles.

Figure 1: Annotated List of Version Patterns

Scenarios


2 of 6 10/7/2008 9:21 AM

Figure 2 shows the template we will use to present scenarios.

Figure 2: The Scenario Template

The numbered boxes in the template will contain the following information:

Name of the scenario.1.Table used for this scenario.2.Definition of the table.3.Description of the scenario.4.Now (whatever date that may be).5.Column headings for the table.6.One or more rows of the table.7.

Basically, boxes 1 through 5 contain metadata that is necessary to understand the scenario.Boxes 7 are the column headings for the table under discussion. There are one or more rowsindicated as boxes 8, and each one represents a row in a relational table.

Timelines

Except for the first two patterns, we will also use a timeline diagram to explain what is going on.Figure 3 illustrates this diagram. It will be used beginning with Version Pattern 3, which is the firstpattern that goes beyond a simple update in place approach to history.

Figure 3: The Timeline Diagram

Health Care Insurance: A Business Example for Version Patterns

Health care insurance is not the only industry which has a need for real-time history. But theinsurance industry in general is certainly one in which real-time history is very important. Twoparadigmatic insurance situations make the point: processing a claim and answering a customertelephone inquiry.

As for the first paradigmatic situation, a claim is filed for an event which happened at some point inthe past. The claimant has a policy with the insurance company, to which she sends the claim.


3 of 6 10/7/2008 9:21 AM

But the policy which is relevant to the claim is not her policy as it is when the claim is received.

Rather, it is her policy as it was when the claimable event occurred.

In property insurance, the claimable event is damage to property. In health care insurance, theclaimable event is a service event, an event in which a doctor, hospital or other health careprofessional or organization provided medical services to the claimant.

Processing a claim involves comparing the claim to a past state of the policy against which the

claim is being made. That past state may happen to be identical to the current state of the policy.But the point is that it equally well may not. And, in point of fact, it often is different. So processinga claim requires real-time access to historical data.

There is a customer-facing aspect to real-time claims data, also. And because it is customer-facing, it represents not just a need which a processing system has, but a product differentiator forthe insurance company. This is our second paradigmatic situation.

These events occur when a customer calls the insurance company and asks for an explanation of

benefits. Usually, what they are concerned with are benefits not provided, i.e., claims not paid orclaims that were reimbursed for less than the policyholder thought they should be.

The customer service representative (CSR) needs online, real-time history to answer thecustomer's questions during that phone call, there and then. He needs complete information onthe caller's policy, as well as the date the service was provided for which the claim the caller isconcerned with was made. The need for real-time access to policy history exists because it isimportant to answer the caller's questions while the caller is on the phone. Companies whichcannot do that are at a serious competitive disadvantage.

Other industries will have their own paradigmatic situations illustrating their own needs forreal-time access to history. But claims processing is the operational heart of insurance, andcustomer satisfaction is the operational soul of any business.

So: historical data is already being managed in insurance company databases. Our intent, in this

series of articles, is to present what we believe is a better way of satisfying this need than thevarious ad hoc solutions we have seen over the past few decades. And better has the bottom-linemeaning it always has had: better in terms of revenue enhancement, cost containment andcustomer service.

Two additional points will serve to complete our introduction to this topic. The first point illustratesthe increasing importance of versioned history by considering the breakdown of a traditionalalignment of current versus historical with operational versus analytic. The second point is thatdates and timestamps (of varying granularity) are just different ways of representing the tick of alogical clock that provides reference times for all versioned history.

Operational Data and Analytical Data

Another way of realizing the significance of this topic is to understand it in terms of the breakdown

of a traditional alignment. Traditionally, the distinction between operational data and analytical

data has lined up with the distinction between current data and historical data. In talking about theincreasing need for real-time access to historical data, we are emphasizing that this alignment isbreaking down.

On the one hand, data marts have traditionally been considered part of the analytical environment,


4 of 6 10/7/2008 9:21 AM

not part of the operational environment. Data marts are and always have been a means ofdiscovering trending over time patterns. But near real-time data is increasingly being required indata marts. The reason is that businesses are finding that they can make better analyticaldecisions if the relevant data reaches right up to the present moment.

This is a particularly acute need for large-scale inventory management systems of consumer

goods, especially during the holiday season. The analytics needed from such systems do notdirectly concern on-hand, on-order, sales and returns volumes. Rather, the single critical concern

is with velocities. The critical question isn't, "How much did we sell last week?" or even "How muchdo we have on-hand and on-order right now?" The critical question is, "Is our on-hand and ourpipeline adequate to keep product in stock, given the rate at which it is being sold?" Since salesvolumes for much consumer merchandise peaks during the holiday season, the velocity thatdecision-makers are concerned with, at that time of year, isn't one based on rates of change

reaching back six months. It's one based on rates of change reaching back six days! So this is a

situation in which current data is needed for analytical decision-making.

If you are thinking, at this point, that we have just described a need for operational data, not for

analytical data, then you are illustrating the confusion caused when an alignment (operational =current, analytical = historical) breaks down. What we have just described is a need for nearreal-time data. If we put the currency of data aside, however, what distinguishes operational fromanalytical data?

One answer might be that analytical data is used by upper-level management, while operationaldata is used by lower-level management. While this happens to be true, it isn't very informative.What we need to know is what upper-level management is doing with its data that is different fromwhat lower-level management is doing with its data.

The answer is that analytical data attempts to predict the future, while operational data attempts to

optimize the present. Analytical data uses the past to establish curves, one of whose axes is time.The more accurate those curves are, the better an indicator they will be of future trends. For thebest basis on which to guess what the future will be like is a sufficiently long and accurate timelineof what has happened in the past. Just project the curve into the future.

On the one hand, as our inventory management example illustrated, real-time data is needed for

analytical decision-making. Conversely, operational data, whether in OLTP systems or inoperational data stores (ODSs), increasingly includes relevant historical data, and operationaldecision-making increasingly requires access to that data that is just as real-time as is access tocurrent data.

Again, in the insurance industry, the past state of insurance policies is the relevant example. Theclaim is current data. The relevant state of the policy is historical data. So this is a situation in

which historical data is needed for operational decision-making. The old "operational = current,analytical = historical" alignment is breaking down. Historical data is no longer "second-class"data.

Every Tick of the Clock

Throughout these articles, we will be talking about dates, not timestamps. This is for the sake ofsimplicity; nothing in these patterns would be disturbed if we were to talk about timestampsinstead. However, this simplification means that the granularity at which time will appear in these

examples is the granularity of a single day. For example, we will talk about a version ending one


5 of 6 10/7/2008 9:21 AM

day prior to the start of the next version.

More generally, a version ends one tick of the clock before the start of the next version (assuming,as we do for now, that there can be no gaps between versions of the same thing being versioned).

That is one tick of whatever "clock" is being used. With dates, every clock tick is one day.However, our discussions apply equally to situations in which timestamps are used instead ofdates. With timestamps, the clock ticks once for every change in timestamp value, at thegranularity at which the timestamps are recorded. Thus, for example, timestamps can record aclock that ticks in hours, minutes, seconds, milliseconds, etc.

Using a clock for which each tick is one day isn't at all unrealistic. It supports a businessrequirement, for example, to load a data warehouse or ODS once a day, with whatever the sourcedata is like at that point in time. For many warehousing situations, a clock that ticks once a day isfine-grained enough to meet all business requirements for the warehouse.

Note: We will not include century in the dates used in these examples. Instead, we will specifythem as "mm/dd/yy." This is to keep the columns as narrow as possible, to fit entire rows across apage.

However, there is one "special date" value that will require a century. It's the value "12/31/9999."Obviously, "12/31/99" cannot be reserved for a special meaning, since it is a date that is likely to

appear as real business data. But "12/31/9999" is available to carry a special meaning, on theassumption that no business data will occur that needs to specify that exact day nearly eight-thousand years from now. That special meaning has some of the semantics of "end dateunknown," and also some of the semantics of "end of time." But we think that the most accuratelydescriptive name for this value is "until next changed or deleted." These semantics will beexamined more thoroughly in a later installment in this series. The differences are subtle; andsometimes, in the real world of business IT, they are even relevant.

With this special meaning, "12/31/9999" is not a date; that's why we refer to it as a date value, i.e.,a value that appears in a date data type but which does not carry the semantics of a date. (If thisappears awkward to you, it is. We need the SQL standards committees, and the vendors, toprovide something better for us. But until they do, this is what we have to work with.)

Part 3 begins the discussion of specific version patterns.


Best Practices/BenchmarkingBusiness Intelligence (BI)Customer Relationship Management (CRM)Data MartsOperational Data Store




6 of 6 10/7/2008 9:21 AM


Relational Databases, Part 3 - Version

Patterns 1 and 2

Tom Johnston, Randall WeisDM Direct, May 2007

Version Pattern 1: Updating in Place

Updates in place destroy history by overwriting the previous version of the rows they update. Butin most cases, this is the way tables are in fact updated. In these cases, the business does notrequire that the earlier states of the updated rows be available as queryable history. For this kindof data and these business requirements, the need for history is so infrequent and so nonreal-timecritical, that reconstructable history is good enough.

However, even with tables that lack queryable history, it is quite common for developers to includeboth a create date and a last update date; these two dates provide a minimal level of support forhistory. Let's include them as business requirements for Version Pattern 1 and note them asfollows:

{BR 1}. Retain the date each row was created, and also the date of its last update.


The scenarios below illustrate Version Pattern 1. Our example will be an insurance policy table.

Scenario 1.1: an Insert to the Policy Table

On 1/12/04, a policy for client Mary Jaspers is inserted into the Policy table for the AcmeHealthcare Insurance Company (AHIC). For now, we will assume that this is also the action thatmakes the policy effective. In other words, a policy with AHIC becomes effective on the date that arow representing that policy is inserted into AHIC's Policy table.

Note: although 1/12/04 is in the past, as of the date of publication of this article, the paragraph

above said that on "1/12/04, a policy for client Mary Jaspers is inserted into the Policy table"

(italics added). It did not say ".... A policy ...... was inserted .......". This "is" should be consider a

tenseless "is," not an indication of the present tense. The reason to speak tenselessly wheneverpossible is that otherwise, the tenses involved in an English language description of thesesituations simply become too complex. With a tenseless natural language description, we will notbe distracted from the focus of these articles, which is the management of time in relational

databases, not the management of time and tenses in natural languages.

Time and Time Again: Managing Time in Relational Databases, Part 3 - V... http://www.dmreview.com/dmdirect/20070518/1084145-1.html?type=pri...

1 of 7 10/7/2008 9:24 AM

In later history patterns, we will drop this assumption, and distinguish system activity dates such

as dates rows were inserted, updated or deleted, from business effectivity dates such as beginand end dates of business effectivity, or the effectivity dates for a correction to an earlier entry.This distinction between system activity and business effectivity dates is referred to, in the

literature, as the "bi-temporal" distinction.

Note: those in the health care insurance industry will recognize that this example about policies isnot a typical situation. Typically, health care insurance policies are group policies, which arecontracts between a client company and the insurance company, not a contract between theinsured individuals and the insurance company.

In the property and casualty industry, on the other hand, policies which cover an individual orfamily are usually contracts between the insurance company and that individual or family. Buthealth care insurance companies do issue individual policies, and we believe that this will providea simpler set of examples with which to illustrate our proposal for managing versioned history.

Scenario 1.1 shows the result of entering Mary's policy into the Policy table. (See the previousarticle - Part 2 - for an explanation of the scenario template used here and throughout theremaining articles in this series.)

Note: the conventions for distinguishing different types of columns in these illustrative tables areas follows:

Primary key columns are listed left-most, and the column headings are bold-faced. Theycontain the suffix (PK).Foreign key columns are listed next, and the column headings are underlined. They containthe suffix (FK).

Note: these articles do not contain examples of foreign keys which are also primary keys. One ofus (Johnston) has written several articles explaining why he thinks that foreign keys should never

occur in primary keys. See References for links to these articles.

Non-primary key and non-foreign key business data columns are listed next.Metadata columns, such as crt-dt and last-updt-dt, are listed last.

The client-nbr foreign key can be ignored for now. It will be used in later scenarios. The same istrue for [now], whose template cell contains "n/a" (for "not applicable"). [Now] (the bracketsindicate that we are talking about a metadata item, and not merely referring to the presentmoment) will not become relevant until we move on to more complex scenarios.


2 of 7 10/7/2008 9:24 AM

The policy type is PPO (preferred provider option). The first metadata column (crt-dt) tells us thatthe row was inserted on 1/12/04. The {null} in the second metadata column (last-updt-dt) tells usthat, so far, this row has not been updated.

Presumably there are many other rows and columns in the Policy table. But we won't show any ofthem until later on, when they are needed to illustrate discussions of more complex patterns.

Scenario 1.2: an Update to the Policy Table

On 3/14/05, Mary changes her policy type from PPO to HMO (Health Maintenance Organization).After that change is recorded in the Policy table, Mary's Policy row looks like this:

The use of a create date and last update date is common in business IT. And with those twodates, we are not entirely lacking in temporal metadata about this row. Crt-dt tells us that the rowwas inserted on 1/12/04, last-updt-dt that it was last changed on 3/14/05.

But there is, nonetheless, a lot that we do not know. We do not know how many times this rowwas changed prior to 3/14/05 (if any), nor what column or columns were changed by any earlierupdates. We don't even know what column or columns were changed on 3/14/05. And even if wedid know that it was the policy type column that was changed, we still wouldn't know what theprevious value was.

So, information has been lost. More precisely, using the taxonomy introduced in Part 1,

information in the form of queryable history has been lost. Reconstructable history is alwaysavailable, of course. In this case, to get Policy P138's reconstructable history, we would scan thelogfile of the initial inserts and all subsequent updates to the Policy table, starting on 1/12/04, andarrange the logfile entries for P138 in chronological order.

Scenario 1.3: a Physical Delete to the Policy Table

On 6/1/06, Mary's policy is ended (by contract expiration, explicit termination, revocation, etc. - itdoesn't matter). This is recorded, using Version Pattern 1, by physically deleting her policy's row inthe Policy table. After that change is recorded in the Policy table, the table looks like this:


3 of 7 10/7/2008 9:24 AM

After the update shown in Scenario 1.2, as we said above, we have both the current state ofMary's policy, and also a small amount of metadata. Now we have neither. We can't even tell thatMary ever had a policy with AHIC.

Sometimes losing this much information to queryable access doesn't matter. Sometimes all weneed to know in real-time is what things are like right now. But frequently, businesses do require aminimal amount of real-time available history, such as, in this case, information about policies nolonger in effect.

This requirement can be met by replacing a physical delete with a logical delete. This too is atechnique well-known to business IT professionals.

Eventually, logically deleted rows will be archived off the online table. This is usually done basedon their age. For example, the business rule might be "at each month-end, archive all logicallydeleted rows that were deleted more than forty-nine months ago." In terms of the taxonomyintroduced in Part 1, this is a rule which states the conditions under which a row changes fromqueryable to reconstructable.

But for now, we can ignore the archive process. What we want to do is see how a logical delete istypically implemented. This is shown in Version Pattern 2.

Version Pattern 2: Updating in Place with Logical Deletion

Because physical deletes destroy even the little historical information that Version Pattern 1provides, businesses frequently ask their IT departments for some way to preserve that minimallevel of historical information. Usually, the way this is done is to do a logical delete instead of aphysical delete.

{BR 2}. Support {BR 1}, and also retain deleted rows as queryable history.

There are variations on this theme, of course, as there are on almost any data pattern. We willexplain below why we chose the variation we did.

Scenario 2.1: A Logical Delete to the Policy Table

To support logical deletes, we need an additional column - a delete indicator. Often this isimplemented as a delete flag, set to "Y" if the row is logically deleted, and to "N" otherwise. Thedate of the deletion is then put in last-updt-dt.

Another option is to have a separate delete date column. With this option, a delete flag isn'tneeded, because logically deleted rows are all and only those rows whose delete dates are notnull.

We prefer this second option because it preserves rather than overwrites the metadata informationabout the last update date (and also because we have a further use for this new column, whichwe will discuss later on).

On 6/12/06, Mary Franklin's policy ends. Starting on that date, it is no longer in effect. This time,

we record that event by logically deleting her policy's row in the Policy table. After that change isrecorded, the table looks like this:


4 of 7 10/7/2008 9:24 AM

Note: to make it possible to show an entire row on one line, we will start doubling up. The twometadata dates, crt-dt and last-updt-dt, will be stacked from this point forward. Later, other suchdoubling-ups may be required. Keep in mind that in spite of the graphical layout, crt-dt andlast-updt-dt are still two distinct columns in this table.

The Semantics of Version Pattern 2

For Version Pattern 2 to satisfy its business requirements, the following constraints must beenforced when updating the table:

{SC 2-1}. Always, del-dt must be null if the row is not logically deleted, and must otherwisecontain a valid date.

{SC 2-2}. When entered, del-dt must be the current date, or else a future date. It cannot be apast date.

{SC 2-3}. When entered, foreign key semantics must be enforced.

{SC 2-4}. Once entered, del-dt cannot be changed. If del-dt is not null, the date in it must bethe date originally put in it.

{SC 2-2} is particularly interesting. What is the reasoning behind it? Why can't we use a past datewhen logically deleting a row? After all, if we were supposed to delete a row on a particular date,

and simply forgot to do it, why can't we just apply the delete retroactively? Wouldn't that producethe same result as if we had done the delete on its originally scheduled date?

The answer is "no." If a past date is used, we are changing the past, and doing so without leavingany indication that we did. In other words, if a past date is used, we were either not telling thetruth before the logical delete, or we are not telling the truth after the logical delete. Here's why.

Imagine we were supposed to delete the row seven days ago and that we logically delete the row,today, by setting its del-dt to the date seven days ago. Whether we did this seven days ago, or doit today, the result will be the same. From today forward, the database will show a logical deleteon 6/12/06.

However, for the seven days prior to this action, the row was not deleted during those seven days,

and so during that period of time, our database says that the policy is active. But as soon as we

put a past date into del-dt, our database now says that the row was logically deleted during those

seven days, and therefore that the policy was not active during that period of time. Well, either itwas active over the past seven days, or it wasn't. We can't have it both ways.


5 of 7 10/7/2008 9:24 AM

With this retroactive delete, we have changed the past, and left no indication that we have doneso. A query counting active policies in the period 6/6/06 - 6/12/06, run sometime during the pastweek, would include Mary's policy in that count. The identical query, run anytime after that event,would not include her policy.

Even so, the assumption is that we were supposed to delete that row some time in the past. Thereason is that, just as a policy becomes effective on the date a row for that policy is inserted intothe Policy table, a policy ceases to be effective on the date a row for that policy is deleted from thePolicy table. We missed the date to delete the policy, and now we apparently have no way tocorrect the mistake and show the date on which Mary's policy became no longer in effect.

Our dilemma is this: as long as the date of a physical insert into a table is also a businesseffectivity start date, and the date of a physical delete is also a business effectivity end date, webetter be sure we can insert and delete on time! For if we miss either an insert or a delete date,the only way we can correct the mistake is to lie about the past!

The dilemma is resolved, not by lying about the past, but rather by eliminating the homonyms that

are the source of the dilemma. Crt-dt is a homonym because it means both "the date the row isinserted" and also "the date that the object represented by the row becomes effective." Del-dt is ahomonym for analogous reasons.

As we said at the beginning of this article, distinguishing database activity dates from businesseffectivity dates creates what computer scientists call a "bi-temporal" pattern. We will have more tosay about bi-temporality in later installments, starting with the discussion of Version Pattern 5.

As for {SC2-3}, it simply says that semantically, a logical deletion must follow the same referentialintegrity rules as does a physical deletion. So, for example, if there are any foreign keydependencies on P138, then either the logical delete of P138 must be blocked, or else it musttrigger a cascade (logical) delete to all rows that refer to it.

As for {SC 2-4}, the reason we cannot change del-dt once it has become non-null is the same.Doing so would change the past. Our database would give different answers, depending on whenit was queried. At least one of those answers would have to be false.

This concludes our discussion of Version Patterns 1 and 2. The next installment will discussVersion Pattern 3.

References:

Tom Johnston. "Primary Key Reengineering Projects: the Problem."DM Review. February2000.

1.

Tom Johnston. "Primary Key Reengineering Projects: the Solution."DMReview.com. March 1,2000.

2.

Tom Johnston. "The Semantics of Surrogate Keys and Business Keys: Part

1."Datawarehouse.com. December 6, 2002.

3.

Tom Johnston. "The Semantics of Surrogate Keys and Business Keys: Part 2."

Datawarehouse.com. January 13, 2003.

4.


Data ManagementData Modeling


6 of 7 10/7/2008 9:24 AM




7 of 7 10/7/2008 9:24 AM


Relational Databases, Part 4 - Preliminaries to

Version Pattern 3

Tom Johnston, Randall WeisDM Direct, June 2007

The first two version patterns, described in Part 3, support a minimal form of queryable history inbusiness databases. But sometimes that isn't enough. For even if we use Version Pattern 2 -updating in place with logical deletion - there are still two important ways in which history can belost.

First, as soon as the first update takes place, the row, as originally entered, is lost because theupdate overwrites the row. So we lose the information that was true of the policy on the date itbecame effective. Next, considering the entire row as a single unit of data, all updates but themost recent one are also lost, including even the knowledge of whether or not there were anyprevious updates. Sometimes this is acceptable; but sometimes it is not.

Note: this may suggest that we should manage versions at the level of individual row/columns, notat the level of entire rows. Although that would certainly be possible, we believe that the samesemantics can be supported by versioning entire rows, and at a lower overall system cost. We willnot consider row/column level versioning in the remainder of these articles.


Now we will consider a more stringent business requirement for history.

{BR-3}. Support {BR-2}, and also keep a complete queryable history on policies, i.e., on allinserts, updates and deletes to all policies.

As our history taxonomy indicates, our first choice is to keep this history as a chronological recordof either events or states. We could keep an event-based history of changes to policies by addinga row to the Policy table each time a new policy was created, and after that keeping a transactiontable in which each transaction was an update or delete to the policy.

Event-based history is most appropriately used to manage changes to metric values ofrelationships among persistent objects, values such as counts, quantities and amounts. These

relationships that contain metrics are often called balances. Each change to a relationship metricis recorded as a transaction, and in the traditional star schema design, each transaction contains adate, the change to each metric, and a foreign key to each object that is affected by thetransaction, objects such as customers, products, suppliers and so on.


1 of 6 10/7/2008 9:26 AM

However, event-based history is not appropriate for managing changes to non-metric values ofpersistent objects. For them, state-based history is the preferred option. Because we have chosento keep a state-based history of policies, we must now decide whether to do so using snapshots orversions. Snapshots are the preferred method for recording the coordinated history of many rows,usually across many tables, as of a single point in time. However, if the same column in the samerow is updated more than once between snapshots, only the last update will be captured.Therefore, since the business requirement is to track all changes to policies, we must do so usingversions.

What Version Tables Version

Before we dive into the details of managing versioned history, we must first understand what is

being logically inserted, updated and deleted in versioned tables. It is not versions. It is the objects

they are versions of. The physical insertion of a version logically inserts, updates or deletes an

object, and does so in the following way.

The physical insert of the first version for a policy is a logical insert of that object, i.e., thatpolicy.

The physical insert of subsequent versions for that policy, which are not designated as alogical delete of the policy, are logical updates of the policy.

The physical insert of a version marked as a logical delete logically deletes the policy.

Note that "object" is being used here in the sense of "anything that persists over time," not in itsnarrower object-oriented sense. Customers, products, suppliers, contracts, employees, facilities,inventories, packages, invoices, purchase orders - all these are objects. For the most part, we areusing insurance policies as illustrative objects in these articles.

A Policy Version Table

Each row in the original Policy table represents a policy (of course). But no row in the PolicyVersion table represents a policy! Instead, each row represents what we know about what a policylooked like during a designated period of time.

Multiple rows on the Policy Version table may have the same policy-nbr, of course. If this were aPolicy table, that would be an error. But on a Policy Version table, it is not an error. On a PolicyVersion table, those rows, and only those rows, with a given policy number are versions of thatpolicy. Those rows, and only those rows, contain information about that policy during a period oftime.

So a version is a timeslice of an object. It is not the object itself. A complete sequenced set of

versions is a full timeline for an object.

Since none of these rows in the Policy Version table represent policies, where are the rows thatdo? If this table is not a table of policies, where is that table?

The answer is that, on our implementation of versioned history, there is no Policy table. Thereason is that we don't need one. For any point in time during which a policy was a policy, we canretrieve a row from the Policy Version table that represents the policy as of that point in time. Thefirst row in the Policy Version table for a policy shows what the policy was like when it became


2 of 6 10/7/2008 9:26 AM

effective. Subsequent rows show what the policy was like after each of the changes thathappened to it and until the next one happened. If the Policy terminated, there is a version for thatevent also.

A Simplifying Assumption: One and Only One Episode per Object

Before proceeding, we will introduce two simplifying assumptions into our initial discussion ofversions, assumptions which we will drop later on. The first one is this: objects can't recur.

For example, if Mary Jaspers' policy P138 terminated on a given date, but the next day Mary calledand said she wanted to continue her policy, then given this assumption, AHIC would have to issueher a new policy. If Mary wanted no changes to her policy, the new policy would be identical to theold one except for its policy ID number, but it would in fact be a new, distinct policy.

If an object could recur, what would that look like, as far as data is concerned? In the case ofMary's policy, it would look like a logical delete of policy P138 followed, after one or more clockticks, by another version for P138, which is not a logical delete. Normally, persistent objectsre-appear after some period of time, i.e., after some possibly large number of clock ticks. So therecan be gaps between episodes of the same object, although if there is only one clock tick betweena logical delete and the next version of an object, those episodes would not have a gap betweenthem.

This is illustrated in Figure 1.

Figure 1: Episodes of an Object

Since that next version is an initial version of the policy after a prior logical delete of that policy, itconstitutes a reappearance of the policy. Let us call the first version for P138 (the Jan 1, 2005

version) the original initial version, and each other version which immediately follows a logical

delete of P138 (the Jul 15, 2005 version) a successor initial version. The count of initial and

successor versions for an object is a count of what we will call episodes of that object. Gaps mayor may not exist between successive episodes. In this case, the gap extends from one clock tickafter the first episode is terminated, to the clock tick on which the second episode begins.

If the number of logical deletes of an object is the same as the number of episodes, then, for aslong as that is true, the object does not currently exist on our database. That is because everyepisode, including the most recent one, has been terminated by a logical delete.

The only other case that can arise is that the number of episodes is one greater than the numberof logical deletes. That is because, in a series of versions for the same object, what splits them

into episodes are the logical delete versions. In this case, the object does currently exist on ourdatabase because there is no logical delete version for the most recent (or only) episode.

So our simplifying assumption can also be expressed like this: (for now), each persistent objecthas one and only one episode. That episode, of course, can consist of any number of versions.

This single-episode constraint means that AHIC (our fictional insurance company) does not permitpolicies to recur. So if a policy lapsed for even a single day, and the policy holder wanted herpolicy reinstated immediately, she would be assigned a new policy. The initial version of that newpolicy would become effective one clock tick after the deletion of her previous policy. This is not arealistic way of doing business, of course, and later on we will consider scenarios in which alapsed policy may be reinstated.


3 of 6 10/7/2008 9:26 AM

Another Simplifying Assumption: No Gaps Within Episodes

Our second assumption is that there can be no gaps, within a given episode of an object, betweensuccessive versions. One version must follow right after the other. That is, all noninitial versions ofan episode of an object must begin on the next tick of the clock after the version they supercedeends. So in Figure 1, each arrow which is labeled "version {n} ends; version {n+1} starts" actuallypoints to two adjacent clock ticks.

A corollary is that each nonterminal version of an episode of a persistent object must end on thetick of the clock immediately before the tick on which the version they precede begins. (See Part 2for a discussion of clock ticks.)

Let us call a gap within an episode an intra-episodic gap, using the term "episode" as indicatedabove, to refer to the time span for an object that reaches from the begin date of its initial versionto the end date of the first logical delete reached by following that object's versions inchronological sequence. And let us call the first assumption, that objects have only a single

episode, the requirement that there be no inter-episodic gaps.

Note: Bear in mind that while we reference an event's commencement and conclusion as begindate and end date, these temporal markers would frequently be a more granular date/timestamp.As always, the key concept here is a "tick of the clock," at whatever level of granularity isappropriate for your specific application.

Semantic Constraints

Our two simplifying assumptions come to this: until further notice, these articles are discussingsingle-episode objects whose versions have no gaps. The semantic constraints imposed by theseassumptions are as follows:

{SC 3-1}. If a version for an object is followed by another version for that same object, theend date of the former version must be one tick of the clock prior to the begin date for thelatter version.

{SC 3-2}. If a version for an object is not followed by another version for that same object,and is not itself a logical delete version, the end date will be assumed to be unspecified.(This is not always the case, of course; and in later patterns, we will drop this simplifyingassumption.)

{SC 3-3}. If a version for an object is a delete version, the object will be considered deletedas of that delete date.

{SC 3-4}. It is semantically illegitimate for any version of an object to follow (to bechronologically later than, based on its version date) a logical delete version of that object.

There are several issues raised by these semantic constraints, all of which we will discuss later on.They include:

The semantics of "unspecified" end dates. For some versions which do have an end date,that end date may be unknown at the time the version is entered. For others, which we canthink of as "until further notice" versions, there is no end date, known or not, at the time theyare entered. In the literature, an unspecified end date is sometimes called "forever." But


4 of 6 10/7/2008 9:26 AM

forever implies that we know the version will never terminate, and there aren't many objectswhich have versions like that. And one may hope that if an end date for a version should beknown, it is known at the time the version is entered. If we can assume that, it leaves the"until further notice" kind of version as the only kind of version with an unknown end date.But its semantics is not the same as the semantics of forever. This issue will be discussed inlater installments in this series.

The existence of versions with a known end date. We must be able to manage versions ofobjects with a known end date, and also versions of objects whose end date is not known.As we will see later, this can prove quite challenging.

The fact, as already noted, that businesses often need to recognize the re-appearance of thesame object after a gap in time. This is the need to permit multiple episodes of an object onthe database. It is a common business requirement; customers come and go, products areoffered for sale, later retired, and later still reintroduced, etc.

The fact that businesses often need to keep two or more versions with overlapping oridentical effectivity periods.

This last point is especially important. In normal cases, versions cannot temporally overlap oneanother. If they did, there would be a period of time during which both were effective. But if theywere both effective, how would we know which one was the truth? And what would the other onerepresent, if not the truth? These are topics for later articles in this series.

Corrections, Gaps and Overlaps

To avoid intra-episodic gaps, we must write code to insure that every new noninitial version beginsexactly one tick of the clock after the version it is about to supercede. However, while necessary,this is not sufficient to guarantee that there will be no intra-episodic gaps. A second way in whichsuch a gap can be introduced is with a correction.

Consider a non-initial version entered with an effective begin date that is earlier than it should be.In that case, we need to move its begin date forward. But doing so would introduce a gap betweenit and its precedessor version. By the same token, moving an effective end date back wouldintroduce a gap between it and its successor version. To prevent such corrections from violating{SC 3-1}, we must always correct two versions when an effective begin date on a non-initial versionis moved forward or when an effective end date on a non-terminal version is moved backward.One is the version itself that is being corrected. The other is the adjacent version, which must nowbe corrected to "fill in the gap" created by the first correction.

We should not think of these two physical updates as two transactions. There is one transaction; itis the movement of the transition point in time between two adjacent versions. This single semanticupdate requires two physical rows to be updated.

Semantic constraint {SC 3-1} also requires that updating an effectivity date range must not createan overlap. One caveat, of course, is that a correction which does not alter a version's effectivitydate range must have exactly the same date range as the version it corrects.

Version Patterns 1 and 2 were both "update in place" patterns. In this installment, we have laid the

groundwork for discussions of patterns which do not update in place, i.e., which do not overlaydata already on the database. First, we introduced two simplifying assumptions, which we willeventually drop as we discuss increasingly complex patterns. Next, we introduced some concepts


5 of 6 10/7/2008 9:26 AM

which will prove useful in those later discussions. The concepts were:

TimesliceTimelineOriginal initial versionSuccessor initial versionEpisodeGaps

Intra-episodic gapsInter-episodic gaps

In Part 5, we will begin our discussion of Version Pattern 3.


Data Management




6 of 6 10/7/2008 9:26 AM


Relational Databases, Part 5: Version Pattern

3


The business requirement for Version Pattern 3 is to be able to show what an object was like atany point in its lifetime. To do this, we must retain the before-update state of an object, not just theafter-update state. We must retain a versioned history of changes to the state of an object. This

means that in place of the Policy table we have been using thus far, we will need a Policy Versiontable.

{BR-3}. Support {BR-2}, and also retain a queryable history of changes to objects.

Tables and Keys

The primary key of our original Policy table is policy-nbr. So the primary key of our Policy Versiontable must be policy-nbr plus something else. The purpose of that something else is to distinguisheach version of a policy from all other versions of that same policy.





Pretty much anything will do to distinguish policy versions. In fact, any value guaranteed not torecur for the same policy will do. The first thing that comes to mind is a sequence number. The

initial version is 1, and the subsequent n versions are 2, 3, ...... , n.

This isn't a bad idea, just as the logical delete flag, discussed in Version Pattern 2, was not a badidea. But just as we preferred a delete date to a delete flag, we prefer to use a date as our version-discriminating part of the primary key for policy versions, and thus the combination of policy-nbrand this date to uniquely identify each row in the Policy Version table.

Business Dates and System Dates

Version Pattern 3 does not distinguish business effectivity dates from physical database activitydates. We must have a business effectivity date, of course. But for the purposes of this analysis,we can consider this date as either the business effectivity start date for the version, or that date

and also the date the version was physically inserted. In the former case, we simply fail to recordthe date of physical database activity. In the latter case, the two dates are identical, withcomplications that we will examine later.

Time and Time Again: Managing Time in Relational Databases, Part 5: Ve... http://www.dmreview.com/dmdirect/20070615/1086123-1.html?type=pri...

1 of 8 10/7/2008 9:26 AM

The former case is simpler, but it is also almost never seen in production databases. Physicalinsertion dates (and physical last update dates, in the case of updates in place) are almost alwaysincluded. So for Version Pattern 3, we will assume that version begin date is both the date theinserted row became business effective and also the date the row was physically inserted.

Let's now recreate the series of events described in Scenario 1. We begin with the event thatcreated a new policy for Mary Jaspers. This event takes place on 1/12/04.

Scenario 3.1: A Policy Insert in the Policy Version Table

As we have already pointed out, the physical action of inserting the initial version for a policy is

also the logical action of inserting a new policy. Here is the result of inserting Mary's policyeffective 1/12/04:

Figure 1: Result of Inserting Mary Jaspers's Policy Effective 1/12/04

The primary key of the policy version is policy-nbr (the unique identifier for a policy) plus versionbegin date. All versions for the same policy have the same policy number and are distinguishedfrom one another by their begin date. Because Version Pattern 3 is based on the assumption thatobjects cannot recur (as described in Part 4), there will be no temporal gaps among versions forthe same object. Consequently, we also know the end date for all versions but the current one. Itis one day (one clock tick) prior to the begin date for the chronologically next version.

If the above row were inserted on any date later than 1/12/04, it would be a retroactive insert. If itwere inserted on any date earlier than 1/12/04, it would be a proactive insert. But as long as rowsmust be inserted on their business effectivity dates, then neither retroactive nor proactive insertsare possible. In Version Pattern 4, we will distinguish these two kinds of dates and considerretroactive and proactive database activity.

Using this primary key presupposes that we can never have two versions of the same policy thatbegin on the same date, i.e., on the same tick of whatever clock we are using. For now, we'll stickwith this assumption. Later on, we will examine an exception to it.

Figure 2: Timeline Diagram for Scenario 3.1


2 of 8 10/7/2008 9:26 AM

By inserting this initial version, we have created policy P138, effective 1/12/2004. Now let's seehow the first update to this policy is represented in the Policy Version table.

Scenario 3.2: A Policy Update in the Policy Version Table

On 3/14/05, Mary changes her policy type from PPO to HMO (Health Maintenance Organization).After that change is recorded in the Policy Version table, the table looks like this:

Figure 3: Policy Version Table After Policy Type Change

Because of our no-gap assumption, we know from this data that this policy was a PPO policy from1/12/04 through 3/13/05 - the latter date being one clock tick prior to the begin date of the nextversion.

Note: One way of representing date ranges is by a pair of dates (one of them inferred, in thisexample). Another is to specify a begin date and a duration, i.e. a number of clock ticks. Thesetwo ways of representing date ranges are semantically equivalent; each can handle all (and only)the business requirements that the other can. They differ only in details of implementation.Throughout these articles, we have chosen to represent date ranges with a pair of dates.

When a pair of dates is used, we must specify whether neither, either or both are included in therange they specify. Again, the options are semantically equivalent. In these articles, we havechosen to include begin dates in the range, but exclude end dates.

The timeline for policy P138, after this update, is shown below.


Note: graphics for timeline diagrams permit us to draw the lines indicating the start of a versiononly in alignment with the start of a vertical bar. Vertical bars represent months. So our conventionwill be to draw these version start lines at the start of the month of their ver-beg-dt. Thus, aver-beg-dt of 3/14/05 is drawn aligned with the start of the vertical bar representing March, 2005.

After this change, we do have an end date for our initial version, although it is inferred rather than


3 of 8 10/7/2008 9:26 AM

expressed as an explicit data value. However, we still do not have an end date for Mary's policyitself, as is graphically illustrated by the ellipsis on the above timeline.

Scenario 3.3: A Policy Deletion in the Policy Version Table

On 6/12/06, Mary's policy is terminated. But in the Policy Version table, as shown in Scenario 3.1,we have no way to distinguish a version representing a policy deletion from one representing apolicy insert.

To distinguish versions which logically delete a policy, we could add a delete flag. But a deletedate carries more information than a delete flag. So while we do not need a second date tomanage policy inserts and updates, we do need one to manage policy deletions.

Notice that the version date in the primary key is called version begin date, not version create dateor version insert date. By the same token, we will call this second date object end date, not objectdelete date. And, of course, we do not call it version end date. It is the end date of the object, i.e.,the policy, not of a version of the object.

After we terminate Mary's policy, the Policy Version table looks like this:

Figure 5: Policy Version Table After Termination of Policy

We can identify the first row as the initial version for P138 because there is no other row for P138with an earlier ver-beg-dt. We can identify the second row as the terminal version, i.e. the logicaldelete point, for policy P138 because it is the only version for P138 whose obj-end-dt is not null.

The lifetime of policy P138 is from 1/12/04 to 6/11/06, inclusive. A query asking for what the policylooked like, at any given point during its lifetime, can now be satisfied.

Notice that these two rows are a physically condensed way of expressing the same semantics thatwould be expressed by having nearly 900 rows in the table, one for each day in the life of thispolicy. These two rows are semantically equivalent to a set of nearly 900 consecutive dailysnapshots. This illustrates how, from a data storage perspective, versions are much more efficientthan snapshots.

Notice also that this is the first instance in which we have physically updated a row in the PolicyVersion table. Prior to 6/12/06, the row with a 3/14/05 ver-beg-dt has a null obj-end-dt. From

6/12/06 forward, it does not. So we have overwritten that state of the row.

In doing so, have we lost information? In fact, we have not. We can infer from the second rowabove that its obj-end-dt would have been shown as null prior to 6/12/06, and as not null from that


4 of 8 10/7/2008 9:26 AM

date forward. So although we did physically overwrite that row, we lost no information about the

policy, or about what we knew about the policy at any point in its lifetime.

This shows why we think Dr. Snodgrass's metaphor of the relationship between nontemporal andtemporal tables of the same "things" is a poor one. He encourages us to think of a temporal tableas derived from a nontemporal one by adding various dates to it. So he would encourage us tothink of the Policy Version table as the Policy table with time management added to it. Indeed, inhis examples, he changes nothing in table names as he changes them from nontemporal totemporal. (This metaphor is first encountered in {Snodgrass 2000, p.20}, where a Lot table is still aLot table after adding two pairs of dates. It is again encountered on p.113 (Adding History), p.117"...adding valid-time support to a table..." and elsewhere.)

Throughout our discussion of Version Pattern 3, we have seen the profound difference betweenobjects and versions of objects. In temporal tables, we physically manage rows which representversions of objects. In the process, we logically/semantically manage those objects themselves.But in that process, the table whose rows represent those objects disappears! We have not kept aPolicy table and added a Policy Version table. We have replaced the Policy table with a Policy

Version table; and that version table should not be thought of as the Policy table with timemanagement added to it.

The complete lifetime of policy P138 is shown on the timeline below.


It is only now, when the policy has been logically deleted, that we have full knowledge of itstimeline. It is only now that the last version on the timeline is not followed by an ellipsis.

The Semantics of Version Pattern 3

In Part 4, we defined four semantic constraints for Version Pattern 3. These constraints spell outthe implications of the single-episode, no-gaps assumptions that are in effect for this versionpattern. To repeat, those constraints are as follows:

{SC 3-1}. If a version for an object is followed by another version for that same object, the(implied) end date of the former version must be one tick of the clock prior to the begin datefor the latter version.

{SC 3-2}. If a version for an object is not followed by another version for that same objectand is not itself a logical delete version, the end date will be assumed to be unspecified.(This is not always the case, of course; and in later patterns, we will drop this simplifyingassumption.)

{SC 3-3}. If a version for an object is a delete version, the object will be considered deletedas of that delete date.


5 of 8 10/7/2008 9:26 AM

{SC 3-4}. It is semantically illegitimate for any version of an object to follow (to bechronologically later than, based on its version date) a logical delete version of that object.

In this installment, we have defined two dates which are needed to implement Version Pattern 3 -a version begin date and an object end date. Additional semantic constraints specify the rules forpopulating these dates.

The first group of additional constraints specifies the rules for populating these two dates duringthe insert of the first version for an object. Semantically, this is the action of inserting a new policy.

{SC 3-5}. Ver-beg-dt cannot be in the past or in the future, because it is also the systeminsert date. To set such a date to anything but [now] would be a lie. We can't do anything inthe past, or in the future. [Now] is all we have.

{SC3-6}. Obj-end-dt must be null. We can't end something before it begins. We can't evenend something when it begins. We can only end something after it begins.

The second group of additional constraints specify the rules for populating these two dates duringthe insert of a noninitial, nonlogical delete version for an object. Semantically, this is the action ofupdating an existing policy.

{SC 3-7}. For every noninitial version of a policy, its ver-beg-dt must be at least one clock tickpast the ver-beg-dt of the immediately prior version for the policy. If it is not, then the twoversions will overlap for at least one clock tick. But that is invalid because there cannot betwo versions of the truth at any given point in time.

However, for Version Pattern 3, this condition should never arise. Since a ver-beg-dt is also thesystem insert date, only an incorrect system clock could permit a version to have an earlier begindate than the version which preceded it. It shouldn't be necessary to write code to enforce {SC3-7} because if the system clock is incorrect, we have a much bigger problem on our hands andwill probably have to roll back and reapply updates anyway.

{SC 3-8}. Obj-end-dt must be null. If it is not, the policy is being deleted; and, byassumption, this is not a delete transaction.

The third group of additional constraints specify the rules for populating object end date during alogical delete. Semantically, this is the action of terminating a policy.

{SC 3-9}.To terminate a given policy, set the object end date of the current version (whichmay or may not be the initial version) to [now]. The object end date cannot be in the past orin the future, because it is also the system insert date.

What Comes Next

Version Pattern 3 is a significant achievement. It retains the knowledge of the state of an object atany point during the lifetime of that object. Indeed, what more could there be? What more couldbusinesses want that versioned tables can provide?

The answer is: quite a lot. To begin with, we might want to see what would happen with VersionPattern 3 if we distinguished business dates from system activity dates. This would mean thatversion begin date and object end date are business dates only. For system activity dates, wewould then want to keep an insert date and a version update date. That latter date will be {null}


6 of 8 10/7/2008 9:26 AM

except when the version is updated. And because with versions we are no longer consideringupdates in place, this update can only be one thing - the date the object end date was specified.

In the academic literature, versioning which has both a business set of dates and a system activityset of dates is said to implement a "bitemporal" pattern.

In providing an annotated list of version patterns, in Part 2, we did not distinguish a bitemporal andnonbitemporal variation of Version Pattern 3. (In addition, the description of Version Pattern 3 wasincorrect, and should have stated the opposite of what it did state.) So following is a revisedannotated list of the version patterns which we are discussing in these articles.

This list provides a correct description for pattern 3. It "bumps down" the other patterns, andspecifies a pattern 4 which is pattern 3 with business and system activity dates distinguished.Finally, it reverses the sequence of the original list's ninth and tenth patterns in order to put thetwo patterns which describe snapshots next to one another.


7 of 8 10/7/2008 9:26 AM

Figure 7: Annotated List of Version Patterns - Revision 1


Data ModelingDatabasesReal-Time Enterprise




8 of 8 10/7/2008 9:26 AM



4


Version Pattern 3, discussed in Part 5, keeps a record of the state of an object at every pointduring its lifetime. This results in an unbroken chronological series of versions of the object andguarantees that no versionable updates will be lost.

With Pattern 3 and Pattern 4, each version of an object is distinguished from all other versions of

the same object by means of a version begin date.

The start of an object's lifetime is marked by the chronologically earliest version of thatobject.

The end of an object's lifetime is marked by a chronologically latest version which contains

an object end date.

The current version of an object whose lifetime is not yet ended is marked by achronologically latest version which contains {null} in its object end date.

Notice that these versions have no version end date. This is because Patterns 3 and 4 assumethat every non-initial version follows its predecessor immediately, i.e., that when a new version isadded, the previous version ends its business effectivity exactly one clock tick prior to the start ofbusiness effectivity of the newly added version. In other words, with Patterns 3 and 4, inserting a

new version into a table determines both the begin date for that version, and also the (implicit) enddate for what had been the current version until that insert took place.

Integrating Data: An Open Source Approach

Any technology deployed in the information system needs tointeroperate with existing applications or databases. Learn throughreal-life scenarios how Open Source data integration solves theinteroperability challenge. Free White Paper.

A second assumption that characterizes Patterns 3 and 4 is that after an object has been deleted,it cannot reappear in the database. More technically, and using our Policy table example, given aversion with a non-null object end date, no version with a later version begin date can have thesame policy number. Equivalently, any version with a later begin date must be a version of a

different policy. In Part 4, we introduced the concept of an episode, and characterized this secondassumption as the assumption that no object can have more than one episode.

These two assumptions are semantic constraints on Patterns 3 and 4. If your businessrequirements for versioned history rule out either or both of these assumptions, then neither ofthese patterns will meet your needs.

Patterns 3 and 4


1 of 6 10/7/2008 9:27 AM

A version pattern is most precisely defined by the combination of a) its business requirement

(noted as {BR-x}, where x is any number) which the pattern must satisfy, and b) its semantic

constraints (noted as {SC-x-y}, where x is the number for the pattern, and y is the number for theconstraint). In these terms, Pattern 4, which is the pattern we will examine in this article, satisfiesthe same business requirement and operates under the same semantic constraints as Pattern 3 -with one exception. That exception is semantic constraint {SC 3-5}:

{SC 3-5}. Ver-beg-dt cannot be in the past nor in the future because it is also the systeminsert date. To set such a date to anything but [now] would be a lie. We can't do anything inthe past, nor in the future. [Now] is all we have.

With Pattern 4, this constraint is changed. Semantic constraints 1 - 4 and 6 - 7 are identical totheir corresponding Pattern 3 constraints. But Pattern 4 has no constraint corresponding to {SC3-5} because it uses distinct dates to indicate when the version becomes effective (ver-beg-dt) andwhen the version is inserted into its table (crt-dt).

Bi-Temporality

With Pattern 4, we introduce a second set of dates. The first set is business effectivity dates. ForPatterns 3 and 4, they are:

Version-begin-date (ver-beg-dt): the clock tick on which that version of the object becomesthe currently in effect version.

Object-end-date (obj-end-dt): the last clock tick on which the last version of the object is ineffect.

The second set, included in Pattern 4 but not Pattern 3, also consists of a pair of dates. They are:

Create-date (crt-dt): the clock tick on which that row is physically inserted into its table.

Last update date (last-updt-dt): the clock tick on which that row is physically updated.

Normally, versions aren't updated. And indeed, versions are never logically updated. Theinformation they contain is never lost. But as noted in Part 5, there is one case in which a version

may be physically updated. That happens when a transaction is received which directs the system

to logically delete the object. This logical delete of the object is implemented by physically

updating the current version, changing its object end date from {null} to the date of the logicaldelete. When this physical update takes place, the last update date records the clock tick onwhich it does. In all other cases, last update date is null.

These two sets of dates are orthogonal to one another. Changes to one set do not necessarilyentail changes to the other set.

Computer scientists have called the use of these two sets of dates the "bi-temporal" data pattern.The difference between Patterns 3 and 4, then, is solely that Pattern 3 is not bi-temporal whilePattern 4 is.

To work out the implications of this difference between the two patterns, let's begin by recreatingthe series of events described in Scenario 1.

Scenario 4.1: a Proactive Insert


2 of 6 10/7/2008 9:27 AM

With Pattern 4, it is now possible to physically insert a row on a different date than the date itbecomes effective. So let's assume that Mary's policy, entered into the database on 1/12/04, does

not become effective until 6/1/04. (Correction: In scenario templates, the fourth cell down on theleft-hand side specifies the date or dates on which the version could validly have been inserted. InPart 5, Scenario 3.1, that cell contains "1/12/04 or any later date." It should contain "N/A," becausefor that Version Pattern, there is no choice about when the insert takes place; it must take placeon the ver-beg-dt.)

The primary key of the policy version is policy-nbr (the unique identifier for a policy) plus versionbegin date. All versions for the same policy have the same policy number, and are distinguishedfrom one another by their version begin date. In addition, for all versions but the current one (ifone is currently in effect), we also know the dates on which they end their effectivity. If there is aterminal version, it is the clock tick in its object end date. Otherwise, it is one clock tick prior to thebegin date of the chronologically next version.

The insert of the row shown above is a proactive insert. It takes place some four and a half monthsprior to the clock tick on which the inserted row becomes the in-effect row for that object. Thisinsert is shown on the timeline diagram below.

With proactive inserts, there is a period of time between when the row was physically inserted andwhen the version it represents becomes effective. In the timeline diagram for Scenario 4.1, thisperiod of time is indicated by the shaded block at the front of the version.

One situation in which proactive inserts are valuable is when there are multiple versions thatbecome effective on the same clock tick. If our clock ticks once per day, then we have 24 hours toinsert all those versions. Usually, that will be enough time to handle all but the largest volumesituations.

But suppose our clock ticks once per second. With Pattern 3, we cannot insert versions rapidlyenough to give more than a small number of them the correct ver-beg-dt. This immediately

suggests that we should assign the correct ver-beg-dt to versions, rather than reading that datefrom the system clock once for each version.

Indeed, the more granular our clock ticks are, the less reasonable it is to design a system in which


3 of 6 10/7/2008 9:27 AM

each version's ver-beg-dt is populated from a read of the system clock. So suppose we redesign

our system to assign version begin dates rather than base them on a read of the system clock.Since we also need to record the clock tick on which rows are physically inserted - for all the usualreasons, such as backing out an entire batch run - we need to add an additional pair of dates,which we will call create date and last update date. Doing this gets us from Pattern 3 to Pattern 4.

Retroactive Inserts?

By inserting this initial version, on 1/12/04, we have created policy P138, effective 6/1/2004. Butwith increased expressive power - here the power to physically insert versions on dates other thantheir effectivity dates - comes an increased possibility for making a mistake.

One such mistake is a retroactive insert. Suppose that instead of inserting Mary's policy on1/12/04, we instead inserted it on 8/1/04.

Before we explain why this is an error, let's note that there is nothing the database managementsystem (DBMS) can do to prevent it. Put another way, SQL DDL does not let us tell the DBMS thatretroactive inserts are invalid. Until DBMSs are upgraded with robust temporal managementcapabilities, we are stuck with a do-it-yourself approach. In this case, we must write code thatprevents an insert whenever the transaction's create date is later than its version begin date.Because this is an integrity rule which applies to all versions in the database, it should not be leftto individual applications to enforce. It should be enforced by the database itself, by means of apre-insert trigger.

Why Not?

The reason that retroactive inserts are not valid is easy to see. They change the past. Prior to theretroactive insert, the past looked one way; afterwards, it looks different.

Consider policy P138's version that has a 6/1/04 effective date but that we didn't insert until 8/1/04.Prior to the retroactive insert, the past from 6/1/04 up to 8/1/04 looked like there was no policy

P138 in effect at any point during that period of time. Afterwards, it looks like there was a policyP138 in effect during that period of time. But both things can't be true of the period 6/1/04 - 8/1/04;either policy P138 was in effect, or it wasn't!

It's certainly true that during any period of time, including the 6/1/04 - 8/1/04 period, either P138was in effect or it wasn't. But suppose that it was. In that case, for those two months, the databasewas incorrect. Surely there must be a way to correct mistakes?

Well, with mistakes, we can either correct them or we can ignore them. If we correct them, we caneither eradicate all traces of the error or we can preserve the evidence. The choice depends on thecost/benefit ratio of each alternative. Ignoring errors, actually doing nothing about them, is usuallynot an option when we are dealing with production databases, so we won't consider that optionany further.


4 of 6 10/7/2008 9:27 AM

When we are dealing with transactionally updated databases, accountants know that the way tocorrect an error is to a) create a transaction which offsets the bad transaction, and then b) add thecorrect transaction. Although our Policy table is not transactionally updated, it is versioned, sosomething similar should take place. The error should not be erased or overwritten. It should be"sequestered" in some way, so that it is invisible to "normal" queries but nonetheless retrievablewith specially written queries. Then the correct version should be inserted in its place.

This is the "preserve the evidence" approach to error correction. But it is not always the bestchoice. Consider that correcting mistakes while at the same time not removing all traces of themfrom queryable history (see Part 1 for an explanation of the term "queryable history") requirescomplex logic for both updating the database and subsequently querying the database after thecorrection. The cost of developing this logic is not trivial. Moreover, the potential for mistakes inqueries against a database containing such corrections - especially mistakes in end-user-written,non-IT-managed queries - is very real and can be as costly as you care to imagine. If there are nolegal requirements to retain a queryable record of the mistake, for example if the only legalrequirement is to be able to pull archive tapes when auditors ask for them, then the benefit ofmaking both the correction and the error immediately queryable is not very great. In a case likethis, the best approach is to snapshot the database (so the error can later be preserved in archivalstorage), and then proceed with the retroactive insert (or the update in place of an existing version,if that is what is required).

However, suppose our queryable database is a database of record for the data in question(policies, in our case). Suppose, further, that there is either a legal or a customer relationsdownside to telling one story about whether or not policy P138 was in effect throughout 6/1/04-8/1/04, when asked prior to 8/1/04, and subsequently telling the opposite story when asked.

A legal downside is easy to imagine. If a report on policies in effect was run originally on 7/31/04,and sent to the appropriate regulatory agency, but a week later rerun and sent again to thatagency, the two reports are not going to agree about P138, or about any counts or other statisticsit might contribute to.

A customer relations downside is equally easy to imagine. If our customer service representative(CSR) tells a customer one story about her policy on one call and a contradictory story on a latercall, the customer is not likely to be happy and won't feel increased confidence in the reliability ofour company.

Downsides that cannot be contained within the enterprise, such as regulatory, customerrelationship and even vendor relationship downsides, are usually taken far more seriously byexecutive management than are downsides that affect only internal operations. With externallyvisible downsides, the cost/benefit ratio quickly swings to the "preserve the evidence" approach toerror correction.

But neither Version Pattern 3 nor 4 supports correction with error retention. So if your businessrequirements for versioned history do not require error retention in queryable history and do permityou to manage nonrecurring continuously existing objects, then you may still be able to use oneor both of these patterns. Otherwise, neither Pattern 3 nor 4 will meet your needs. We will have towait for a later Version Pattern to get error correction with error retention.

In Part 7, we will finish our discussion of Version Pattern 4.



5 of 6 10/7/2008 9:27 AM





6 of 6 10/7/2008 9:27 AM


Relational Databases, Part 7: Completing

Version Pattern 4

Tom Johnston, Randall WeisDM Direct, July 2007

As we saw in Part 6, what distinguishes Pattern 4 from Pattern 3 is bi-temporality. With Pattern 4,the business effectivity begin date of a version can be different than the date the version wasphysically inserted into its table. This was not possible with Pattern 3.

Consequently, bi-temporality makes it possible to physically insert a row representing a new policy

proactively, prior to the date the policy becomes effective. As long as queries are careful to filterout rows whose effectivity lies in the future (unless those queries are expressly written to retrievesuch rows), the ability to proactively insert new objects is often very useful. With this ability, wecan "insert and forget," meaning that we can insert a new object as soon as we know when it willbecome effective, and then not have to do anything later on to "activate" it. The mere passage oftime will ensure that the object becomes a "now in effect" object on the indicated effectivity date.

By the same token, bi-temporality also makes it possible to physically insert a row representing a

new policy retroactively, after the date the policy becomes effective. We concluded Part 6 bydiscussing why retroactive inserts were semantically invalid, and why they must therefore be

prevented. A retroactive insert of a new object, n days after its business effective begin date, issemantically invalid because it changes the past. Prior to the insert, a query about any point in

time within n days prior to the time of insert, would show that the object did not then exist. Afterthe insert, the same query would show that the object did exist at that point in time. One query,contradictory results.


Let's turn now to the two other scenarios we have been using to illustrate all of these versionpatterns - updates and deletes.

Scenario 4.2a: A Type 1 Proactive Update

On 1/12/04, policy P138 is created for Mary Jaspers. As we have already described, the creation ofa new policy is implemented by inserting an initial version for that policy into the Policy Versiontable.

However, policy P138 does not become immediately effective. It becomes effective on 6/1/04. It is,thus, a proactive insert of a policy. Since inserting a new policy and updating an existing policy areboth implemented by physically inserting a row in the Policy Version table, proactive updating of a

Time and Time Again: Managing Time in Relational Databases, Part 7: ... http://www.dmreview.com/dmdirect/20070713/1088592-1.html?type=pri...

1 of 5 10/7/2008 9:28 AM

policy should follow the same pattern as proactive insertion of a policy. Proactive deletion shouldbe similarly straightforward. But as we shall see, things are a little more complicated than that.

While there is only one type of proactive insertion of a new policy, there are two types of proactiveupdates. With the first type, Mary requests a change to her policy to become effective after thepolicy itself becomes effective. With the second type, Mary requests a change to her policy tobecome effective after it is entered into the database, but prior to its effective date.

To illustrate a Type 1 proactive update, let's assume that on 6/18/04, Mary requests a change inpolicy type from PPO to HMO, to become effective on 9/1/04, and that this change is physicallyrecorded in the Policy Version table on 7/1/04. This is indeed a proactive update since the createdate is two months prior to the effectivity date.

After inserting the new version (a physical insert which semantically updates the policy), the PolicyVersion table looks like this:

What these two rows tells us is semantically straightforward. Policy P138 goes into effect on6/1/04, as a PPO policy. Three months to the day later, it remains in effect but becomes an HMOpolicy.

Scenarios 4.2b: a Type 2 Proactive Update

But there is a second kind of proactive update to consider. What will the Policy Version table looklike if, instead of the update just illustrated, we instead enter an update on 3/10/04, which alsochanges P138 from PPO to HMO, but this time becoming effective on 3/15/04?

Clearly this is also a proactive update, since it is entered prior to its effectivity. So shouldn't thingsbe just as straightforward as with the Type 1 update? Let's see.

After inserting the new version (a physical insert which semantically updates the policy), the PolicyVersion table looks like this:


2 of 5 10/7/2008 9:28 AM

What these two rows tell us may be semantically straightforward, but it is also wrong! In otherwords, the result is not what we intended. These two rows tell us that Policy P138 will go intoeffect on 3/15/04, as an HMO policy, and that two and a half months later, it will remain in effectbut will become a PPO policy.

Type 1 and Type 2 Proactive Updates. What's the Difference?

The difference between Type 1 and Type 2 proactive updates does not lie in the updatetransactions themselves. We may assume that the schema for the transactions is the same in bothcases. Neither does the difference lie in the code processing the two transactions. We mayassume that it is the same code.

What, then, is the difference? Why do Type 1 proactive updates "make sense," while Type 2proactive updates do not?

The relevant difference between these two types of proactive update seems to be that Type 1 is anupdate to a policy that is already in effect, i.e., that has a version in effect at the time the newversion is physically inserted. Type 2, however, is an update to a policy that has not yet gone intoeffect.

But how does this difference result in the difference between a semantically correct database stateand a semantically incorrect one?

One way to look at it is this. Version Pattern 4 does not permit us to enter versions with a knowneffective end date. All versions entered are assumed to be valid from their effective begin date"until further notice." So when we enter a version of P138 that becomes effective prior to a versionalready in the table, we enter it with the understanding that it is effective until further notice, i.e.,effective until we take some other action regarding policy P138.

The semantics of a Type 2 proactive insert, one whose effective begin date is prior to the effectivebegin date of an existing version, violates this understanding. It violates the semantics of proactiveactivity carried out in the absence of effective end dates because, without any further action beingtaken with respect to policy P138, the version which was entered first will change its status frombeing the insertion of a new policy to being an update of an existing policy.


3 of 5 10/7/2008 9:28 AM

From Mary's point of view, she understood the semantics of her two requests to be this:

In the case of the Type 1 update, Mary's second request is to change an existing policy. Sherequests that her policy, in effect since 6/1/04, be changed to an HMO policy three monthslater. And this is what she gets.In the case of the Type 2 update, Mary's second request is not to change an existing policy.Instead, with this second request, Mary changes her mind about a policy which is not yet ineffect. Instead of a new PPO policy becoming effective on 6/1/04 and remaining in effectuntil further notice, she now wants an HMO policy becoming effective on 3/15/04 andremaining in effect until further notice. But this is not what she gets.

Again, the exact same code is used to make both changes. In the first case, Mary got what shewanted. A database update correctly implements her change request. But in the second case, thatsame code produces a disaster in the making. For two and a half months after the Type 2 update,everything works as it should. Then, without any other database change taking place, on 6/1/04,Mary's policy suddenly becomes an HMO policy.

In the first case, Mary wants a change to something already in effect. In the second case, she

wanted a replacement to a request that has not yet been implemented.

We can easily imagine that the business analyst who gathered the requirements for a versionedhistory of policies failed to distinguish between changes and replacements. After all, there are onlythree things you can do to rows in a table. You can insert them, update them or delete them.What the user calls a change and what she calls a replacement (or perhaps calls "changing hermind") are, to the business analyst, both physical changes to something already in the database.

In other words, to the business analyst, Mary's changing a policy and Mary's changing her mind

amount to a distinction without a difference. As Paul Newman said in Cool Hand Luke, "Whatwe've got here is a failure to communicate."

Proactive Deletion

We need to discuss proactive deletions only briefly. A Type 1 proactive deletion terminates anexisting policy. A Type 2 proactive deletion terminates a policy that does not yet exist. But becauseyou can't do anything to something that doesn't exist, we need to say what we mean more clearlythan this.

In a non-versioned table, one row represents one object. In a Policy table, each row wouldrepresent a policy. But in the Policy Version table, no row represents a policy and each rowrepresents one time slice of one policy. Consequently, actions to the policies themselves do notline up, one for one, with actions to rows in the Policy Version table.

With non-versioned tables, whose rows are one for one with objects, semantics are generally onefor one with physical activity. For example, physically inserting a row is semantically inserting anobject; physically updating a row is semantically updating an object; and so on. But with versionedtables, none of whose rows represent objects, semantics are not similarly isomorphic with physicalactivity. For example, physically inserting a row may be semantically inserting an object; but it maynot be. Instead, it may be semantically updating an object. And the only physical updates toversioned tables, at least with the version patterns considered so far, are done to implementsemantic deletions.

So to speak more clearly about deletions, we should say this: a Type 1 proactive deletion


4 of 5 10/7/2008 9:28 AM

terminates an existing policy by supplying a termination date for the version of the policy that wascurrent at the time of the physical Type 1 action. A Type 2 proactive deletion terminates theversion of a policy which has the latest effective begin date, but does so prior to that date. Thus,unless some other action takes place, a Type 2 proactive deletion ensures that its correspondingpolicy will never become effective.

What Comes Next

We will not itemize the semantic constraints which define Version Pattern 4. Enough has alreadybeen said that the reader should be able to modify the semantic constraints for Pattern 3 andcreate a set of constraints that correctly define Pattern 4.

Next time, we will begin our discussion of Version Pattern 5. It is simpler than Pattern 4 in that it isnot a bi-temporal pattern. But it is more complex than Pattern 4 in that it drops the non-recurringassumption. Once we have learned to handle objects that appear for awhile, vanish for awhile,and then show up again, and to do so without bi-temporality, we will then re-introducebi-temporality. That will take us to Version Pattern 6 which is the penultimate versioning pattern.






5 of 5 10/7/2008 9:28 AM



5

Tom Johnston, Randall WeisDM Direct, August 2007

Version Pattern 5 is the first pattern to manage objects with multiple episodes, a conceptintroduced in Part 4. To avoid taking on too much at once, however, Pattern 5 will not includebi-temporality. That will come with Pattern 6 which will be, consequently, a pattern supportingboth multiple episode objects and bi-temporality.

Up to this point, the version patterns we have examined have incorporated the assumption thatonce an object is logically deleted, it will never recur. In Part 4, we called this the "single episode"assumption. Technically, this assumption means that a) there can be at most one logical deleteversion for an object; and b) that no other version for the same object can have an effectivity begindate later than the logical delete date.

Single Episode Objects

In the case of single episode versioning, it is reasonable to ask, "What kind of objects neverrecur?" For if there aren't any such objects, then the patterns considered thus far have no realworld applicability.





The answer to this question depends on what we mean by "recur". So, at this point, we couldembark on a general discussion of the concept of recurrence. That discussion would include anexamination of the ontology of types vs. instances, and of enduring (persisting through time)objects vs. events (processes in and through which objects change over time).

The types vs. instances distinction is, in relational databases, the distinction between a table andits rows, or an entity and its attributes. In object-oriented theory, it is the distinction between objectclasses and their objects, and there the ontological question looms large, as the question ofwhether or not classes are "first-class objects". These questions are the current manifestation ofwhat the medieval Scholastics (especially Ockham and Duns Scotus) knew as the realism /nominalism debate.

For realists, types really existed, just as truly as their instances did. Realists would have treatedclasses as themselves objects. For nominalists, instances were what really existed; a type was justa name given to the common features we observed across many different objects. Nominalistswould not have treated classes as objects, able to exert influence on other objects, for example.


1 of 5 10/7/2008 9:29 AM

What about the second philosophical distinction, that between objects and events? Someproponents and followers of the Kimball methodology believe that everything of importance can berepresented in dimensional data structures. From the perspective of the history of Philosophy, wewould see that this position reflects a commitment to the ontological priority of events over objects.We would see that the approach to change which we have taken in this series of articles, whichinvolves preserving both the before and after states of objects undergoing change, reflects acommitment to the ontological priority, or at least equivalence, of objects to events.

This, we believe, would be an important discussion. The history of Philosophy has much to tell usabout problems which IT professionals and computer scientists alike are still struggling with. Adiscussion like this would illustrate how such business IT-specific issues as we have justmentioned are but the current manifestation of two ways of thinking about the world, whose rootsgo back at least to the pre-Socratic philosophers. Although object-orientation (in the sense of theontological priority of "things" and "stuff") has dominated Western metaphysics for two-and-a-halfmillennia, there has been a strain of process-orientation (in the sense of the ontological priority ofevents and occasions). This latter strain originates, perhaps, in the reflections of Heraclitus, andhis famous dictum "panta rei" - "all things change". It reached perhaps its fullest expression some

eighty years ago, in the work Process and Reality, by Alfred North Whitehead.

Don't Know and Don't Care

The problem with the single-episode assumption is that things exist over time, and so alwayscontain within themselves the possibility of recurrence - of appearing to us, going away, and thenreturning. So, by their very natures, objects are fraught with multi-episode possibilities.

The reason that version patterns incorporating the single-episode assumption may still have somereal-world use, is that there are objects we may be interested in whose recurrence is not of interestto us. In other words, there may be objects which, when they do recur, we treat as new objects.Either we don't notice that they are the same objects we encountered before, or we do notice butdon't care.

A simple example is a policy holder. Suppose that Joe Smith is a policy holder with our companyfrom 2002 to 2004, at the end of which year he lets his policy lapse. A year later, Joe takes outanother policy. If our company makes no effort to identify Joe as the same person who held apolicy from 2002 to 2004, then when he returns, in 2006, Joe is effectively a new policy holder forus - not just a holder of a new policy, but a new holder of a policy.

Twenty years ago, to take another example, it was common for banks to manage "accounts".Some effort was made to identify cases in which one person held several accounts concurrently,but almost no effort was made to identify cases in which one person held several accountssuccessively.

So the answer to the question "What kind of objects are single-episode objects?" is clear. Single-episode objects are those whose recurrence we either don't know about, or don't care about.

But with Version Pattern 5, we begin to deal with multi-episode objects. These, then, are objectswhose recurrence we do make an effort to identify and track.

Across industries that deal with people as customers, the same change has taken or is currentlytaking place. Insurance companies who managed a relationship with policy holders now want tomanage a relationship with persons who are, or who may become, policy holders. Banks whichused to manage the account relationship now want to manage the customer relationship. In some


2 of 5 10/7/2008 9:29 AM

industries, companies are even striving to recognize the appearance and re-appearance ofhouseholds.

Multi-Episode Objects

If an object could recur, what would that look like, as far as data is concerned? In the case ofMary's policy, it would look like a logical delete of policy P138 followed, after one or more clockticks, by another version for P138 which is not a logical delete.

Normally, persistent objects re-appear after some period of time, i.e. after some possibly largenumber of clock ticks. So there can be gaps between episodes of the same object, although ifthere is only one clock tick between a logical delete and the next version of an object, thoseepisodes would not have a gap between them.

This is illustrated in Figure 1.

Figure 1: Episodes of an Object

If the number of logical deletes of an object is the same as the number of episodes, then, for aslong as that is true, the object does not currently exist on our database. That is because everyepisode, including the most recent one, has been terminated by a logical delete.

The only other case that can arise is that the number of episodes is one greater than the numberof logical deletes. That is because, in a series of versions for the same object, what splits them

into episodes are the logical delete versions. In this case, the object does currently exist on ourdatabase because there is no logical delete version for the most recent (or only) episode.

Semantic Constraints for Pattern 5

Pattern 5 is identical to Pattern 3, but with the single-episode constraint dropped. So instead ofpresenting our usual insert/update/delete scenarios, we will develop Pattern 5 by beginning withthe semantic constraints (called "assumptions" when that is the more convenient terminology) forPattern 3, and seeing what changes need to be made to them to accommodate the multi-episodefunctionality.

We start with the first constraint.

Semantic Constraint {SC 3-1}.

{SC 3-1} states that "if a version for an object is followed by another version for that same object,the end date of the former version must be one tick of the clock prior to the begin date for thelatter version." This constraint says that versions must have no gaps in time between them. But


3 of 5 10/7/2008 9:29 AM

multiple episodes of an object nearly always will have gaps in time between them. For example, apolicy expires, because of a late payment. But a week later that policy is re-instated because thecompany granted a grace period in this case, in order to retain a good customer.

So to accommodate multiple episodes, {SC 3-1} must be modified to apply only to versions whichdo not follow a terminal (logical delete) version. This gives us the corresponding constraint forPattern 5.

{SC 5-1}. If a non-terminal version for an object is followed by another version for that sameobject, the end date of the former version must be one tick of the clock prior to the begin date forthe latter version.

Semantic Constraints {SC 3-2 and 3-3}.

{SC 3-2} states that "if a version for an object is not followed by another version for that sameobject, and is not itself a logical delete version, the end date will be assumed to be unspecified".

And {SC 3-3} states that "if a version for an object is a delete version, the object will be considereddeleted as of that delete date". Both of these semantic constraints apply to Pattern 5 as well.

Semantic Constraint {SC 3-4}.

{SC 3-4} states that "it is semantically illegitimate for any version of an object to follow (to bechronologically later than, based on its version date) a logical delete version of that object."

This is the single-episode constraint. So it is precisely the constraint that is dropped for Pattern 5.

Semantic Constraints {SC 3-5 and 3-6}.

The next two semantic constraints for Pattern 3 explicitly apply to a logical insert, i.e., to the firstversion of an object.

{SC 3-5} states the constraint on the effectivity begin date. It says, of the first version of an object,that "ver-beg-dt cannot be in the past or in the future, because it is also the system insert date. Toset such a date to anything but [now] would be a lie. We can't do anything in the past, or in thefuture. [Now] is all we have."

{SC3-6} states the constraint on the logical delete date. It says, of the first version of an object,that "obj-end-dt must be null. We can't end something before it begins. We can't even endsomething when it begins. We can only end something after it begins."

What's different about Pattern 5 is that there can be multiple first versions of an object, one foreach episode. So the question is whether these two constraints apply to initial versions of anyepisode, or just to the initial version of the first episode.

The answer is that these two constraints apply to initial versions of any episode of an object. Nomatter what episode we are dealing with, it is obviously true that (without bi-temporality) we can'tdo anything in the past, or in the future. And it is equally obvious that (without bi-temporality) wecan't end something before it begins, or even when it begins.

Taking Stock of Where We Are

We will complete Version Pattern 5 in Part 9. But here, at a point about a third of the way through


4 of 5 10/7/2008 9:29 AM

this series of articles, we want to remind you why we are considering several versioned historypatterns, and also why the discussion has been so painstakingly detailed.

There are two reasons for considering several version patterns. The first is that each pattern canbe used to satisfy any set of business requirements which conform to the semantic constraintsstated for the pattern. Earlier patterns are less complex than later ones, and some developers mayprefer to use simpler versioned history patterns when their system requirements permit it.

The second reason for considering several version patterns is to gradually build up to what is themost powerful versioned history pattern that we know of. As we said in Part 1, the versionedhistory pattern we have evolved is the most complex data pattern for its size that we know of. It isbest, pedagogically, to approach it one step at a time.

Another pedagogical feature of these articles is their discursive style. Often, data patterns or datamodels are presented as completed objects. Little effort is spent on explaining the thoughtprocesses that led up to them.

We believe, however, that the thought processes are, if anything, even more important than thecompleted pattern. If we can teach senior data modelers how to think about time, by explaininghow we thought through the issues and why we chose the solutions we did, then the finalversioning pattern we present should seem almost inevitable. If, at the end of this series ofarticles, that pattern does seem inevitable or at least completely plausible to you, then we willhave given you not just a tool for managing time in databases, but also the skills to build othersuch tools.


Databases




5 of 5 10/7/2008 9:29 AM


Relational Databases, Part 9: Completing

Version Pattern 5

Tom Johnston, Randall WeisDM Direct, August 2007

We noted in Part 8 that Version Pattern 5 is equivalent to Version Pattern 3, but with the single

episode constraint {SC 3-4} removed. However, in that same installment, we also pointed out a

difference in the first semantic constraint for each pattern. Constraint {SC 3-1} states that "if aversion for an object is followed by another version for that same object, the end date of the formerversion must be one tick of the clock prior to the begin date for the latter version." But for multi-episode objects, clearly, there may be any number of clock ticks between the logical deleteversion for one episode, and the initial version of the next episode. We accommodated this change

by specifying that constraint {SC 5-1} has the same wording as {SC 3-1}, except for replacing "if aversion for an object ..." with "if a non-terminal version for an object ..."

On examining constraints {SC 3-5} and {SC 3-6}, we found no differences. These constraintsapply, without modification, to Pattern 5. This brings us to the last two constraints for Pattern 3. Do

they also apply to Pattern 5? Except for the minor rewording of {SC 3-1} and dropping {SC 3-4},are the semantic constraints for Pattern 5 exactly the same as those for Pattern 3?

Chart of Version Patterns

Before continuing, we insert here the current chart of version patterns. References back to earlierpatterns are happening pretty frequently now, and we hope this will help the reader recall whatthose prior version patterns were like.




1 of 6 10/7/2008 9:29 AM

Figure 1: Chart of Version Patterns

The Importance of Semantic Constraints

Before we proceed, let's remind ourselves of the importance of explicitly listing the semanticconstraints for each pattern. There are two reasons for doing so. The first is that, together with theschemas for each pattern - the columns used to implement the pattern - these constraints are the

formal and fully explicit definition of each pattern.

The scenarios show how each pattern works. The diagrams provide useful visual representations

of what is going on. But the constraints and the schemas are the patterns.

The second reason for explicitly listing semantic constraints is that we will have fully and correctlyimplemented a version pattern if and only if a) we include the versioning columns defined for thepattern in all tables we wish to version; b) we enforce each of these pattern-specific constraints;

and c) for all version patterns, we implement what I will call "temporal referential integrity."

Entity integrity constraints - the uniqueness of primary keys - are still enforced by the DBMS. But,


2 of 6 10/7/2008 9:29 AM

as we shall see later on, we cannot use the DBMS to enforce referential integrity. When standardscommittees stop squabbling and produce a temporal data management standard, and then whenvendors implement that standard, it is to be hoped that "temporal referential integrity" will beenforced by the DBMS. But because we are still in the "do it yourself" period, as far as theevolution of temporal database management is concerned, that's what we will have to do. We willhave to implement temporal referential integrity ourselves, and also the specific constraints thatdefine each version pattern. That means, in most cases, that we will have to write code. In general,that code will be written as before-insert triggers.

But this is getting ahead of ourselves. For now, let's wrap up Version Pattern 5 by looking at itslast two semantic constraints.

Semantic Constraints {SC 3-7 and 3-8}

Constraints 5 and 6 apply to initial versions of objects. For Pattern 3, these are the versions thatimplement the insert of a new object. For Pattern 5, these are the versions that implement theinsert of a new episode of an object. Inserting a new object is thus a special case of inserting anew episode of an object. It is the special case of inserting an initial episode of an object.

Constraint 5 applies to the effectivity begin date, and constraint 6 to the logical delete date.Constraints 7 and 8 apply to the same pair of dates, but implement versioned updates rather thanversioned inserts.

{SC 3-7} states the constraint on the effectivity begin date. It says of any noninitial, non-logical-delete version of an object, that "its ver-beg-dt must be at least one clock tick past the ver-beg-dtof the immediately prior version for the policy. If it is not, then the two versions will overlap for atleast one clock tick. But that is invalid because there cannot be two versions of the truth at anygiven point in time."

This constraint also applies to Pattern 5, and for exactly the same reason. However, in reviewingPattern 3, we have discovered that constraint 7 is a corollary of constraint 1. Constraint 1 statesthat "If a (non-terminal) version for an object is followed by another version for that same object,the end date of the former version must be one tick of the clock prior to the begin date for thelatter version." This is the same thing as saying that "... the begin date of the latter version mustbe one tick of the clock later than the end date for the former version." If it must be one tick of theclock later, then it follows that it "must be at least one tick of the clock" later.

So constraint 7, first of all, should have said "exactly" instead of "at least." But in either form, it isnot a distinct constraint because it follows from constraint 1. If code implemented constraint 1,there would be no need for additional code to implement constraint 7.

{SC 3-8} states the constraint on the logical delete date. It says, of any non-initial, non-logical-delete version of an object, that "obj-end-dt must be null. If it is not, the policy is being deleted;and, by assumption, this is not a delete transaction."

Version Pattern 5 has the same semantic constraint. It says that the object end date for a versionhas a value if and only if its corresponding object is being logically deleted. Note that obj-end-dt isa business date, not a date reflecting activity against the DBMS. (In general, when we characterizea column as a "business" or "logical" or "semantic" column, we mean the same thing. Moreprecisely (since these terms emphasize different things, and thus are not truly synonyms), everycolumn described in these articles is either all three of these things, or none of them.)


3 of 6 10/7/2008 9:29 AM

Infrastructure is a Hard Sell

More than one company has turned down the opportunity to implement versioned history as anenterprise resource, as a data management layer on top of all production databases, and onwhich a view layer can be built to hide its complexities. The attitude within IT seems to be that theyneed to get on with the job of building or re-building new OLTP systems, warehouses, data martsor ODSs (Operational Data Stores). If versioned history is required, they will just add a date to theprimary key and be done with it. It works like this.

Versioned history is usually introduced during a project to build some kind of data warehouse,some kind of database for which there is a requirement to keep some amount of versioned history(as defined in Part 1). The next time a database is being created or modified and versioned historyagain becomes a requirement, it is implemented for that database. If a data mart is being built ormodified, one based on the dimensional model, and there is a requirement to provide not onlydimensions as they are currently, but also dimensions as they were at some point in the past, thenwork is done and, eventually, the result is achieved.

This approach works, more or less. As time becomes increasingly important in databases, it will beused more and more frequently. But it is the wrong approach, and results in a set of databaseswith (a) hampered semantic interoperability; (b) little or no flexibility to accommodate additionalsemantic temporal requirements; (c) high maintenance and enhancement costs; (d) a higherproportion of serious (but often difficult to detect) semantic errors in the data; and (e) a databasewhose time management functions must be understood and manipulated by both end user andIT-developer query writers.

Let's consider each of these flaws in the ad hoc approach, one at a time.

Hampered semantic interoperability. An ad hoc, one-database-at-a-time approach might berepresented by a set of databases, each of which uses one of these version patterns.Obviously, federated queries that range across two or more such databases will have toreconcile the different implementations of versioned history. As we have seen, just byconsidering these first five version patterns, that reconciliation is likely to be difficult anderror-prone.

1.

Restricted flexibility. Suppose one of these version patterns has been implemented in adatabase, and that later on, the business comes back and asks for additional versionedhistory functionality. This is equivalent to asking for a new Version Pattern. If this involvesnothing more than rewriting insert triggers, the changes will be relatively easy to implement.If it involves adding non-primary key columns to a versioned table, the changes will still berelatively easy to implement.

2.

But the change from Pattern 2 to Pattern 3, and another change that we will consider later on,both involve changes to the primary key of the versioned tables. These changes are still easy toimplement, compared to changes in primary keys for non-versioned tables. The reason, as we will

see, is that referential integrity for versioned tables is not implemented by storing primary keyvalues in foreign key columns. This means that changes to the primary keys of versioned tablesaffect those tables only and do not "ripple out" to affect all foreign key-related tables.

Nonetheless, the costs of incrementally adding temporal functionality are not negligible. Wheneverthe schemas are changed, for example, all code that touches those schemas must be altered.Although, as described in Part 1, only view-creating and physical update code will directly accessversioned tables, this can still be a significant burden when changes must be made. Far better to


4 of 6 10/7/2008 9:29 AM

implement an "ultimate" versioned history solution, and then "turn on" specific functionality as andwhen it is requested.

High maintenance and enhancement costs. High maintenance costs result from the need tomaintain different version-implementing codebases. High enhancement costs are just acorollary of the previous point, restricted flexibility.

3.

The main source of high enhancement costs is the cost of adding a date column to the primarykey of each table being converted to a version table, together with the (erroneous) belief thatforeign keys to the table being converted must also be changed. With primary key expansion, andthe belief that foreign keys must be maintained when a table is converted to a versioned tables,foreign keys in all directly related objects must also be expanded. Thus, the cost of change is nolonger simply the cost of adding a date to a primary key, in a single table. Furthermore, whenforeign keys to the original table are themselves part of the primary keys in dependent tables, theneed to modify foreign keys ripples out to the tables dependent on those tables. This ripple effectcan continue indefinitely, and require changes to any number of tables. Another problem is thatcomplex "trickle" processes must be developed to replicate changes in versions and to replicateterminating events to dependent objects.

Risk of errors. This is a special case of the previous point, high maintenance andenhancement costs.

4.

Risk of Misinterpretation of Query Results. In addition, there is the specific risk of errors ininterpretation. Because of the complexity of versioned history, it is quite easy to write queriesthat return apparently valid results, but that are actually answers to a slightly differentquestion than the one the query author had in mind. This risk of misinterpreting queryresults exists for all queries, of course. But we have found that it is an especially high riskwhen temporal queries are involved.

5.

The Biggest Risk: Internal Machinery Exposed to Query Writers with the Ad Hoc Approach

Temporal management functions are infrastructure. It's just that standards committees can't agreeon how to implement them, and so DBMS vendors hold back. This does not mean that temporalmanagement functions should be developed on the fly, one database at a time. It means that theyare infrastructure we must currently develop ourselves.

Although not necessarily so, every ad hoc implementation of versioned history that we have seenexposes the temporal management infrastructure to both developers and business users of thedatabases. It also asks modelers and developers to design and build perhaps the most complexsemantic patterns that databases are asked to manage. And the ability to do this well, let alonethe ability to do it correctly, is found only with the most advanced modelers and developers an ITdepartment has. For those IT managers who think that modeling and database development arecommodity skills, the management of temporal data is the strongest disproof possible.

Because temporal data management is infrastructure, it should be encapsulated within a dataaccess layer that hides its complexity from modelers, developers and business users alike, asexplained in Part 1. Just as designers and developers don't have to implement balanced-treeaccess methods, but make use of them every time they design and build a relational database, sotoo should they be able to use temporal data management without the internal machinery beingexposed to them.

Infrastructure projects are always a hard sell. By definition, they don't satisfy any particular need


5 of 6 10/7/2008 9:29 AM

of a specific business unit. They are paradigmatically enterprise work, work whose value isrealized indirectly by the specific applications and databases that serve specific business usersand purposes. The management of time with today's relational DBMS products is a "do it yourself"effort. For the reasons just articulated, it is work whose considerable cost should be born by theenterprise.

One final note. We have noticed that when infrastructure work is proposed, and objections areraised to it, at some point someone will say that what we are proposing is a "build it and they willcome" approach. And, of course, everyone knows that that is a bad idea. But with infrastructurework, the only alternative to "Build it and they will come" is "get in, patch it, and get out as quicklyand cheaply as possible."

We believe that when an enterprise solution to temporal database management is built, they willcome. It is not the business that will come. But the business will make demands on developersthat require temporal data management, and those developers will come.

Earlier, in the numbered paragraph entitled "High Maintenance and Enhancement Costs," wesuggested that foreign keys don't work with versioned tables the same way they work with"normal" unversioned ones. It is important to be very clear about this. A commonmisunderstanding of versioning is that when a new version of an object is created, all foreign keysthat pointed to the original version must be cascade-updated to point to the new one. Indeed,were it true that foreign keys work the same way in versioned and un-versioned tables, theproblems would get even more complicated.

So next time, in Part 10, we will stop and examine the use of foreign keys when versioned tablesare involved. We will see that DBMS-enforced referential integrity cannot be used. A new way ofusing foreign keys and enforcing referential integrity is required. And once again, becausetemporal database management is on hold as far as DBMS vendors are concerned, we will haveto take a "do it yourself" approach.


Data ManagementDatabases




6 of 6 10/7/2008 9:29 AM


Relational Databases, Part 10 - Foreign Keys

in Version Tables

Tom Johnston, Randall WeisDM Direct, September 2007

In trying to explain versioning to other data modelers and DBAs, we have found that many of themare worried about the cascade update implications for versioned tables. In general, the concern isthat if a new version of an object is inserted into a version table, then all foreign key references tothe version it replaced must be updated to point to the new version. Furthermore, if those updatestake place in tables which are themselves versioned tables, then those updates create additionalnew versions, and all foreign keys that point to them must also be updated. And so a singleversioned update can cascade out to affect many other versioned tables, ending only when acascade chain reaches either a) a non-versioned table or b) a versioned table which does not itselfcontain any foreign keys.

These concerned modelers and DBAs correctly point out that this cascade effect can makeversioned updates very expensive. Furthermore, if there are any circular references, the cascadewill become an infinite loop. For example, consider this pair of tables: a Client Version table andour Policy Version table. The Policy Version table contains a foreign key to the Client Versiontable, and each row in the Client Version table contains a count of the total number of versionedupdates across all policies for that client. The infinite loop happens like this:

A new version of a policy is inserted. Call the old version P12 and the new version P13.1.This triggers code that updates the policy version count in the related client, resulting in anew version of that client. Call the old version C25 and the new version C26.

2.

At this point, we must cascade update the policy version. P13's foreign key still points toclient version C25, and it must be updated to point to C26. This creates a new version of thatpolicy, P14.

3.

But now the trigger executes once again, updating the policy version count in the relatedclient. This loops us back to step 2, where we update that count, creating client version C27.

4.

But once again, we must cascade update the policy version. P14's client foreign key stillpoints to C26. This update creates a new version of that policy, whose client foreign keypoints to C27. This new version of the policy is P15.

5.

And so on, and so on, to either the last syllable of recorded time, or to a frustrated machineoperator who cancels the process.

6.

Also, notice that so far, we have talked only about cascade updates. There are also, of course,cascade deletes. As we will see later on, the semantic constraints surrounding cascade deletes inversioned tables are not identical to those for cascade deletes of non-versioned tables.

Time and Time Again: Managing Time in Relational Databases, Part 10 - ... http://www.dmreview.com/dmdirect/20070907/1092319-1.html?type=pri...

1 of 7 10/7/2008 1:02 PM



But for now, we will concentrate on the cascade update problem.

Chart of Version Patterns

Before continuing, we insert here the current chart of version patterns. References back to earlierpatterns are happening pretty frequently now, and we hope this will help the reader recall whatthose prior version patterns are.


2 of 7 10/7/2008 1:02 PM


Chart of This Series

Before continuing, we insert here a chart listing the installments to date in this series and a briefdescription of what each one is about. As this series grows longer, the back references to earlierinstallments will only increase, and we think that having a chart like this at hand will be helpful.


3 of 7 10/7/2008 1:02 PM

Figure 2: Chart of Installments to Date in this Series

What to Do About the Cascade Update Problem for Versioned Tables

There are only three possible responses to the versioned table cascade update problem (or to anyproblem). They are: a) ignore the problem, b) avoid it or c) fix it.

Ignore it. The first response is to ignore the problem. If we take this approach, it means that inmany if not most cases, a single versioned update will trigger any number of additional versionedupdates. Perhaps the attitude about multiple updates is: "DASD is cheap; CPU cycles are cheap.So just let it happen." But then what about the possibility of an infinite loop? Perhaps the attitudeabout infinite loops is: "Design your databases so there aren't any loops."

In some specific cases, this may be an acceptable response. But if our objective is to develop an

enterprise solution (which, indeed, is our objective), then this response just won't do. We need aresponse that works in all possible cases, not just in special individual cases. We can't just ignorethe issue of cascade updates in versioned tables.

Avoid it. The second response is to avoid the problem. This means to turn off referential integritychecking in the DBMS (which would stop the loop in the client to policy direction). If we can acceptcascade updating, and are only concerned to avoid the infinite loop, an alternative is to turn off the


4 of 7 10/7/2008 1:02 PM

developer-written code which updates counts (which would stop the loop in the policy to clientdirection).

Avoiding the problem is not as cavalier as it sounds. Standards committees are at loggerheadsover how to introduce temporal management functions into relational DBMSs, and as a result,DBMS vendors are stymied. Consequently, there is no adequate support for temporal databasemanagement in current relational DBMSs. So we shouldn't be surprised that there is also noDBMS-implemented referential integrity that works for temporal (versioned) tables. Until there issuch support, avoiding the problem by turning off referential integrity may be a very reasonableresponse.

But isn't the proper conclusion, given DBMS shortcomings, to implement referential integrity with

our own code? Isn't the proper conclusion to fix it, not to avoid it?

That depends, of course, on a comparison of the costs and benefits of the two responses.Specifically:

What costs do we incur if we fix the problem? We may assume that these are all IT costs,not business costs. The answer is that we incur the development and maintenance coststhat are part of any IT software artifact.What benefits do we garner if we fix the problem? We may assume that these are allbusiness benefits, not IT benefits. The answer is that a) we can use DBMS-suppliedreferential integrity, and thereby gain the assurance that RI errors will not occur; and b) byupdating foreign keys whenever referenced rows are replaced by newer versions, therelationships involving versioned tables will remain current.What costs do we incur if we avoid the problem? We may assume that these are all businesscosts, not IT costs. The answer is that a) since we cannot use DBMS-supplied referentialintegrity, we must therefore write our own code to guarantee that integrity; and b) sincerelationships among versioned tables are not updated if we avoid the problem, suchrelationships gradually become more and more "out of date" as additional versions of aforeign-key referenced object are created.We may assume that there are no benefits to avoiding the problem (other than costavoidance).

Fix it. The third response is to fix the problem. To fix any problem, we need to start with a correctdiagnosis. We need to identify the root cause, the thing which, if fixed, will eliminate the problem.We suggest that the root cause of the problem is the instability of foreign-key references whenthose foreign keys point to versioned tables.

With "normal" (non-versioned) tables, cascade updates are rare because key changes are rare. Ifwe were dealing with a Client and a Policy table - instead of with a Client Version and PolicyVersion table - then nothing we have considered so far would require a cascade update. Butbecause we are dealing with foreign key-related version tables, every new version in thereferenced table (Client Version) requires a cascade update in the referencing table (PolicyVersion). Because we have chosen to create a new version in the referenced table every time anew version is added in the referencing table (by incrementing a count), we loop back and createan infinite cascade update problem.

We can fix the infinite loop problem quite easily. In the case of the example we are working with,the infinite loop will be broken if a versioned update in the referencing Policy Version table does

not cause a new version to be created in the referenced Client Version table. That means that weeither don't keep a count of policy versions in the client, or else that updating that count is an


5 of 7 10/7/2008 1:02 PM

overwrite type of update, not one that requires a new version to be created.

Additional discussion of the infinite loop problem, and the fix for it, might provide additional insightinto the root cause of the overall cascade update problem. But we prefer to go after that additionalinsight by turning to the other part of the problem, the cascade update from "parent" to "child,"from the referenced table to the referencing table.

A cascade update happens when a primary key changes, and there exist one or more foreign keyswhich reference it. So to fix a cascade update problem (not ignore it and not avoid it), we musteither a) eliminate primary keys, b) eliminate foreign keys or c) stabilize the link between them.

We can rule out the first two options immediately. They amount to conceding that the relationalmodel cannot handle versions and, more generally, cannot handle time. While it might beinteresting to pursue this concession, we have neither the time nor the inclination to do so here.

But what does the third fix mean? What does it mean to "stabilize" the link between a primary anda foreign key? Let's start by looking at an example of these keys.

Figure 3: The Client Version and Policy Version Tables

In both tables, the primary key consists of two columns. The first column is an identifier for theindividual object, either a client or a policy. The second column, a date, distinguishes andsequences the versions of each object. No two versions of the same object can have the samebegin date.

Note: in all examples up to this point, the foreign key in the Policy Version table consists of only asingle column - client number. Now that we are considering an issue involving foreign keys toversioned tables, we have to drop that simplification and show the full foreign key.

What Does It Mean?

We said earlier that to understand how avoiding the problem might work, we need to answer to thequestion, "What do foreign keys relate when they relate versioned tables?" The same thing is truefor the "fix it" option. To understand how to fix the problem, we need to answer the same question.

Next time, in Part 11, we will see that the "avoid it" and "fix it" options are based on two differentanswers to the question, "What do foreign keys relate when they relate versioned tables?" Sincedifferent answers are being given, the different responses based on those answers will very likelyentail different semantics supported by the answer. It is a cost/benefit analysis of these differentsemantics which will determine whether we should continue using an "avoid it" or a "fix it"response.


6 of 7 10/7/2008 1:02 PM

Editor's note: Watch for Part 11 in the October 5, 2007 issue of DM Direct.


Database Application PerformanceDatabases




7 of 7 10/7/2008 1:02 PM


Relational Databases, Part 11 - Foreign Keys

in Version Tables Continued

Tom Johnston, Randall WeisDM Direct, October 2007

In Part 10, we presented a concern that many modelers and database administrators (DBAs) haveexpressed about what happens to foreign keys when a new version of an object is created. Wesaid that "in general, the concern is that if a new version of an object is inserted into a versiontable, then all foreign key references to the version it replaced must be updated to point to thenew version. Furthermore, if those updates take place in tables which are themselves versionedtables, then those updates create additional new versions and all foreign keys that point to themmust also be updated. And so a single versioned update can cascade out to affect many otherversioned tables, ending only when a cascade chain reaches either a) a non-versioned table, or b)a versioned table which does not itself contain any foreign keys." We also noted that there is a potential cascade problem when an object is deleted. Note two keyphrases here: a) "when a version is replaced," and b) "when an object is deleted." Versions arenot updated; versions are replaced. It is objects which are updated in the process of one versionreplacing a previous one. Also, versions are not deleted; objects are deleted by placing a deletionindicator on the current version of the object, and also on any future versions on the database atthe time of the deletion.



Chart of Version Patterns Before continuing, we insert here the current chart of version patterns. References back to earlierpatterns are happening pretty frequently now, and we hope this will help the reader recall whatthose prior version patterns are.

Time and Time Again: Managing Time in Relational Databases, Part 11 - ... http://www.dmreview.com/dmdirect/20071005/10000163-1.html?type=pr...

1 of 7 10/7/2008 1:03 PM


Chart of This Series Before continuing, we insert here a chart listing the installments to date in this series, and a briefdescription of what each one is about. As this series grows longer, the back references to earlierinstallments will only increase, and we think that having a chart like this at hand will be helpful.


2 of 7 10/7/2008 1:03 PM


What Do Foreign Keys Relate? The Issue In non-versioned tables, foreign keys relate objects. Each row in a Client table represents a client.Each row in a Policy table represents a policy. What a foreign key from a Policy table back to aClient table does can be expressed in two different ways. First, we can say that it relates the rowrepresenting that policy, which contains the foreign key, to the row representing the client for thatpolicy. Secondly, we can equally well say that it relates that policy to that client. The first way of saying what the foreign key does refers to the data; the second way refers to thethings which that data represents. The first way, from the point of view of data management, isphysical; the second way is logical or, more precisely, semantic. In non-versioned tables, there is a one-to-one pairing of things we are interested in (clients,policies) with data that represents them (rows in tables). Because the mapping between what isrepresented and its representation is one-to-one, we don't need to pay much attention to thedistinction. We can talk about what foreign keys relate in terms of either objects or rows. Seldom dealing with versioned tables, modelers and DBAs are usually not very concerned with thedifference between data and semantics, between tables, rows and columns, and the types,instances and properties that they represent. But when we are dealing with versioned tables, theone-to-one mapping breaks down. It is only when there is exactly one row, representing the initialversion of an object, that the mapping appears to be one to one. After that, it is one object to manyversions of that object.


3 of 7 10/7/2008 1:03 PM

However, a more careful reading of the situation would make it clear that, even in this case of asingle version of an object, there is not a one-to-one mapping between object and row. For a rowwhich represents a version of an object, even when there is only one version of that object beingrepresented, does not represent the object. It represents a delimited section of the timeline of thatobject. Each version explicitly says, "I represent this object as it was, beginning on this specificdate." Then (in the "ultimate" version pattern we are leading up to), the version also says either"And I no longer represent this object after this specific end date", or else it says "I will continue torepresent this object 'until further notice'" (which, as we shall explain later, is not the same thingas saying "I represent this object until the end of time"). Because of this breakdown of a one-to-one mapping between an object in the world and a row in atable, talk about relating objects and talk about relating rows are not, as they are in the case ofnon-versioned tables, effectively equivalent. Let's consider a more typical situation, say, one clientand five versions for that object in the Client Version table, and one policy related to that client,and three versions for that object, in the Policy Version table. Here, it is clear that the one-to-onemapping has broken down, both for clients and for policies. And here we will have to make achoice between the two ways of speaking. Do foreign keys relate policies to clients, or do theyrelate policy versions to client versions?

What Do Foreign Keys Relate? One Answer Those modelers and DBAs who are concerned about the foreign key problem in versioned tablesassume that foreign keys relate rows which represent versions. A foreign key from a Policy Versiontable back to a Client Version table relates one row to another, a policy version to its associatedclient version. If we want a justification for the assumption that foreign keys in versioned tables relate versions toversions, it might go like this: we understand that the business is interested in clients and policies,and which go with which. But we modelers and DBAs serve the business by managing data. If thebusiness analyst has gotten the requirements right, and if we have expressed them correctly, thenforeign keys relating rows to rows will accurately reflect the real-world relationships betweenobjects and objects. An even shorter justification goes like this: the DBMS doesn't know anything about what's "outthere," about what the data it manages represents. So there is no way we can ask the DBMS torelate objects; all it knows how to do is to relate rows to rows, and foreign keys are the method ituses to do that. Versioned tables or non-versioned tables, it's all the same to the DBMS. Foreignkeys relate rows to rows. But if we proceed on the assumption that foreign keys in versioned tables relate rows to rows, andthus versions to versions, we can easily find ourselves dealing with the cascade update problemdescribed in Part 10. Specifically, whenever a new version is created for a client, then all thepolicies whose foreign keys pointed back to the original client version must be updated to point tothe new client version. If this update is a versionable update, it will create a new version of thepolicy. Then, any versions anywhere else across the database that pointed to the policy which justreceived a new version will have to be updated to point to that new version. And so on, and so on. In Part 10, we concluded that we could not ignore this problem. The reason we can't ignore theproblem is that we are looking for an enterprise solution, one that will apply to all sets of versiontables. It may be that in databases that contain only a few versioned tables, and/or databaseswhose versioned tables change very infrequently, the performance penalty of cascading updates


4 of 7 10/7/2008 1:03 PM

might be manageable. But with an enterprise solution, we cannot count on that. With anenterprise solution, some databases may be crippled by the performance demands of frequentlyupdated version tables with lengthy cascade chains. So, on the assumption that we are relating versions to versions, we must somehow either fix theproblem or avoid it. How can we do that? As we said in Part 10, we could avoid the problem byturning off RI. But DBAs just don't like to turn off RI. That opens the door to foreign keys that don'tpoint to anything. So, is there a way to fix the problem rather than avoid it? There are two ways to fix the problem. One way is to make foreign key updates non-versionableupdates, i.e., updates in place, updates which overwrite the foreign key but which do not producenew versions in the process. The second way is to not update foreign keys among versions. Ineither case, RI among versions can remain on, but cascade updates are prevented. Let's consider each solution in turn.

Solution 1. Make Foreign Key Updates Non-Versionable. Figure 1 below shows the example we introduced in Part 10. Both the Client and Policy tables areversioned tables. However, the inclusion of policy-type as a column of the Client Version table,back in Part 10, was a mistake. But it turns out to be one we will need as this analysis continues.So to distinguish it from the policy-type column on the Policy Version table, let's assume thatpolicy-type on client versions means something like "the type of policy which the client's salesrepresentative most recently tried to sell to the client." So let's rename it "upsell-policy-type."

Figure 1: The Client Version and Policy Version Tables Let's assume that a new version of client C882 has just been created. This is shown in Figure 2below. We must now update all foreign keys that pointed to [C882, 5/23/04], and make them pointto [C882, 10/01/04]. And if we make this a non-versionable change, then there is no cascadeproblem. Instead, the result is shown in Figure 3.


5 of 7 10/7/2008 1:03 PM

Figure 2: Client C882 is Versioned There is no cascade problem because, as Figure 3 shows, the foreign key change in the PolicyVersion table was made as a physical update to C882's current policy version. No new version ofthat policy was created.

Figure 3: Client C882's Policies are Updated This is one way that we can keep foreign keys pointing to the current versions of the objects theyreference, and at the same time avoid the cascade update problem. But notice the drawback tothis solution: it loses information. The specific information that is lost is that, from 6/01/04 to10/01/04, policy version [P138, 6/01/04] pointed to client version [C882, 5/23/04]. During thatperiod of time, a join from that policy version to its related client version would have shown anupsell policy type of PPO. This is the information that is lost. Why is it lost? Because we physically updated the foreign key in the policy version. We did that inorder to avoid the cascade update problem. But why does the foreign key need to be updated?


6 of 7 10/7/2008 1:03 PM

The answer we get to this question, from the modelers and DBAs we have spoken to, is that joinsshould always join to the most current version of the object referenced. After all, withnon-versioned tables, foreign keys point to the most current version – which also happens to bethe only version. In Part 12, we will continue this line of questioning, which will lead us to consider the secondsolution to the cascade update problem, which is: don't update foreign keys that point to versions.We will then conclude our discussion of foreign keys and versioned tables by considering theother answer to the question, "What do foreign keys relate?” that answer being, "Objects, notversions."


Data ManagementData Modeling




7 of 7 10/7/2008 1:03 PM


Relational Databases, Part 12: Foreign Keys

in Version Tables

Tom Johnston, Randall WeisDM Direct, October 2007

In this article, and in Part 10 and Part 11, we have been discussing the issue of foreign keys andreferential integrity in versioned tables. The issue arises because most modelers and databaseadministrators (DBAs) seem to believe that if a change is made in a versioned table, and there areforeign keys pointing to that table, those foreign keys must also be changed. If they themselvesare in a versioned table, and foreign keys in yet other versioned tables point to the rows they arein, versioned updates will ripple out to those other tables, creating a cascade update problem.


In Part 10, we decided that our options are the options we always have for any problem. We canignore the problem, avoid the problem or fix the problem. We also determined that we cannotignore the problem because we are looking for an enterprise solution, one applicable to alldatabases. For databases containing small and/or infrequently updated versioned tables, wemight be able to ignore the problem. But an enterprise solution must apply to all databases,regardless of size or frequency of update. In Part 11, we suggested that the most perspicuous line of investigation was to begin by asking"What do foreign keys to versioned tables relate?" One answer is that they relate rows to rows.The "related-to" rows are the rows in versioned tables. The "related-from" rows may be eitherversions or non-versions. On the assumption that foreign keys relate rows to rows, whether or notany of those rows are in versioned tables, we developed one solution to the versioned foreign keyproblem. Here in Part 12, we develop a second solution to the versioned foreign key problem, alsobased on the assumption that those foreign keys still relate rows to rows. We will then develop asolution based on the assumption that foreign keys that point into versioned tables do not relaterows to rows, i.e., versions to versions, but rather relate objects to objects. We will conclude our discussion of referential integrity involving versioned tables by distinguishingbetween the machinery of versioning and its semantics. The machinery is complex, and should beas invisible as possible to developers and users. But the semantics are the semantics of temporaldatabases, and once such databases are supported, those who write queries must understandthose additional semantics in order to write unambiguous queries against those databases.

Chart of Version Patterns Before continuing, we insert here the current chart of version patterns. References back to earlier

Time and Time Again: Managing Time in Relational Databases, Part 12: ... http://www.dmreview.com/dmdirect/20071019/10000082-1.html?type=pr...

1 of 8 10/7/2008 1:04 PM

patterns are happening pretty frequently now, and we hope this will help the reader recall whatthose prior version patterns are. They also give a hint of version patterns not yet discussed.




2 of 8 10/7/2008 1:04 PM


Solution 2. Don't Update Foreign Keys to Versions As explained in Part 10, this is the second solution to the cascade problem, given the assumptionthat foreign keys to versioned tables relate versions to versions, if the foreign-key containing tableis also a versioned table, or else relate objects to versions, if the foreign-key containing table is nota versioned table. The first solution is to update foreign keys in versioned tables, but to do so asupdates in place, updates which overwrite the foreign key but which do not produce new versionsin the process. The second way, discussed now, is to not update foreign keys that point toversions. Figure 3 shows the example we introduced in Parts 10 and 11. Both the Client and Policy tablesare versioned tables.

Figure 3: The Client Version and Policy Version Tables


3 of 8 10/7/2008 1:04 PM

Let's assume that a new version of client C882 has just been created. This is shown in Figure 4.With this “don’t update” solution, no further physical activity against the database is required. Inparticular, we do not update the foreign key that points to [C882, 5/23/04].

Figure 4: Client C882 is Versioned Now let's insert a new policy and see what happens. The result of this insert is shown in Figure 5.

Figure 5: Client C882 Gets Another Policy At first sight, Figure 5 is puzzling. Policy P138 points to the first version of client C882, but policyP704 points to the second version. Why is there this difference? The difference is in the appearance of things only. It is, in fact, the result applying the followingrule.

Figure 6: Rule for Managing Foreign Keys in Versioned Tables The version of the client that was current when policy P138 was created is the first version. The


4 of 8 10/7/2008 1:04 PM

version current when policy P704 was created is the second version. So with this rule, everyforeign key pointing to a versioned table points to the version of the related object that was currentwhen the foreign-key containing row was created. This rule applies both when the foreign key is acolumn in a versioned table, and also when it is a column in a nonversioned table. Joins using such versioned foreign keys show what things look like, across joins, at the moment intime that the row from which we are joining was created. Therefore, using this rule, we can alwaysfollow a chain of foreign keys out from a particular row of interest, and bring together versions ofthe related objects as they were when that original row was created. We can, in other words, bringback data which is equivalent to a snapshot of that data taken at the moment the original row wascreated. But sometimes that isn’t what we want. Sometimes, we want to join to the most current version ofthe related object. For example, consider a query which takes policy version [P138, 6/1/04], andjoins it to the client who owns that policy. If we use the foreign key in that version, it will join toclient version [C882, 5/23/04]. If this query was executed anytime between 5/23/04 and 9/30/04, it returned an upsell policy typeof PPO. And during that period of time, this is the correct upsell policy type. But suppose thissame query was executed on or after 10/1/04. In that case, the query returned an incorrect result,because from 10/1/04 forward, the current upsell policy type was HMO. What are we to make of this? First of all, it is clear that this way of managing foreign keys inversioned tables will always produce correct results provided that, implicit in every join is theassumption that joined-to versions will be the version current when the joined-from row wascreated. But if this assumption is not valid, and in particular when what is desired is joins thatreturn the then current versions of all joined-to objects, then using the foreign keys in versionedtables to make those joins is not guaranteed to return correct results. How, then, can we write these other queries, the ones that want the current version of all joined-toversioned objects? The way we can do this is to write a query which specifies the joined-to objectidentifier – client number in this case – together with a clause that specifies the greatest versionbegin date not greater than the current date, and containing a {null} object end date. The “notgreater than the current date” qualifier is needed to filter out future versions, versions that are notyet in effect at the time the query is executed. The "containing a {null} object end date" qualifier isneeded to filter out an otherwise qualifying version that is for an object that has been deleted. Something else that we must be aware of is that even though foreign keys to versioned objectspoint to rows current at the time the containing row is created, this does not guarantee that usingthese foreign keys will always return versions current when the joined-from row was first created. Here's how this apparent anomaly can happen. Suppose that we also have an Employer table,which is also versioned. This table contains the employers of our clients. Now suppose we have aquery which selects a policy, and then joins first to the client version current when it was created,and from there to the employer version current when that client version was created. This querymay return employer data that was not current when that policy version was created. Figure 7illustrates this situation.


5 of 8 10/7/2008 1:04 PM

Figure 7: “Current When” Joins Do Not Always Retrieve Contemporary Versions This join will return [E721, 11/15/01], even though the version of employer E721 that was currenton 6/1/04 was [E721, 5/15/04]. Supposing that the third column is employer data retrieved in thequery, that query will show “abc” as the "current" employer value, whereas “xyz” was the valuecurrent when the policy version was created. These situations can arise because “current when” isnot a transitive relationship. This is illustrated in Figure 8.

Figure 8: “Current When” is not a Transitive Relationship

Assumption 2. Foreign Keys in Versioned Tables Relate Objects to Objects The first answer to the question, "What do foreign keys in versioned tables relate?” was that theyrelate rows to rows. On that assumption, we found that there were two solutions to the cascadeupdate problem. The second answer to that question is, “Foreign keys in versioned tables relate objects, notversions.” But we have to understand what this means since, in a physical database sense, it isobvious that foreign keys relate rows to rows. And in the case of our examples, since both thePolicy and Client tables are versioned tables, foreign keys relating these two tables must relaterows to rows – in other words, versions to versions. But in a semantic sense, our ability to relate versions to versions, given current RDBMStechnology, is limited. If we don’t update foreign keys when the versions they point to are


6 of 8 10/7/2008 1:04 PM

superseded, then the relationships implemented by those foreign keys slowly “age.” On the otherhand, if we do update (overwrite) foreign keys when the versions they point to are superseded,then the relationships established when those rows were first created are no longer stored in thedatabase. Our only other option is to create a new version each time a foreign key changes. Butthat option creates the cascade update problem we are trying to avoid. On assumption 2, we may as well store only the object identifier component of a foreign key to aversioned table. In our example, we may as well store only client number, and not bother withversion begin date. If we want to join a policy to the client version current when the policy wascreated, the WHERE clause will ask for the version of that client whose begin date is the greatestdate less then or equal to the begin date of the policy version, and whose object end date is {null}. On the other hand, if we want to join a policy version to the most current version of the relatedclient, the WHERE clause will ask for the version of that client whose begin date is the greatestdate less then or equal to [now], and whose object end date is {null}. One would hope, of course, that a versioned client would not be version-deleted (have a non-nullobject end date) if it still had at least one non-deleted versioned object related to it. If we weredealing with non-versioned objects, the DBMS would guarantee this by either carrying out acascade delete, or by denying the delete until all dependent objects were first deleted. But with versioned objects, the DBMS cannot guarantee this because such temporal semanticsare not part of any current RDBMS. So since we must guarantee it ourselves, there is always achance of error. To guard against that error, the WHERE clause we are discussing must alsospecify that the client version whose begin date is the greatest date less then or equal to [now]also has an object end-date of {null}.

Hiding the Complex Semantics of Time We have already mentioned that managing time in relational databases is perhaps the mostdifficult challenge we have ever encountered as analysts, modelers and DBAs. This discussion ofthe semantics of versioned foreign keys certainly emphasizes this point. It is clear that the mechanisms which manage time in relational databases must be hidden fromdevelopers and users alike. One reason is the complexity of those mechanisms. Another reason isthat the management of time is not an applications- or subject area-specific requirement any morethan the management of dynamic balanced trees is an application- or subject area-specificrequirement. The optimal solution will be one provided by vendors and specified by standards committees. Themechanics may differ from one vendor to another, as long as the syntax and semantics of both theSQL DML and SQL DDL are supported. But a SQL with temporal data management capabilitiesmust first be proposed and then accepted by the standards committees. The proposal was made,almost a decade ago, but the acceptance seems no closer today than it was back then. So in the meantime, we must provide our own encapsulation of this complexity, our ownenterprise-level solution to temporal database management. That is what we are doing in thisseries of articles.

Being Aware of the Complex Semantics of Time


7 of 8 10/7/2008 1:04 PM

There is one thing about temporal database management that cannot be hidden, however. It is thefact that with such databases, new temporal semantics are available. This means that anyone whowrites queries must understand these semantics, and code his queries accordingly. For example,a query asking for a join of a policy version to a client version raises the question, “Which clientversion?” A query asking for a join of a policy to a client raises no such questions. If the query writer does not understand the possibilities, and therefore does not code for them,then either the DBMS will tell him that his query is ambiguous, or it will execute the query usingwhatever default behavior has been defined. In the former case, the query writer needs tounderstand what the error message means, and then write a new query that makes his temporalassumptions explicit. In the latter case, the query writer had better understand what the issuesare, and what the default behavior is, or else he may ask one question with his query, but receivean answer to a different question in the result set returned by the DBMS. The basic point is this. The mechanics of temporal database management can and should be asinvisible to developers and other query authors as possible. But it is still necessary for any queryauthor to understand temporal semantics, because he cannot unambiguously say what he meansunless he does understand those semantics. No matter how elegant and expressive temporal SQLwill prove to be, it cannot relieve query authors from the obligation to understand the optionsinvolved, and to be careful to code queries which say exactly what they mean.

The Versioned Table Foreign Key Position, Going Forward In Part 13, we will take up Version Pattern 6, the bi-temporal versioning of recurring objects. Whenwe do, we will handle foreign keys to versioned tables by capturing the full foreign key of theversion current as of the time the row in question is inserted, both object identifier and versionbegin date. This will be done whether the row being inserted is being inserted into a versioned or anon-versioned table. The issue is the table pointed to, and what to do when that table isversioned. Thus, in terms of our example, we will continue to record both client number (the objectidentifier) and version begin date as a foreign key in the Policy table.


Data ModelingDatabases




8 of 8 10/7/2008 1:04 PM


Relational Databases, Part 13 - A Roadmap of

the Rest of This Series

Tom Johnston, Randall WeisDM Direct, November 2007

We are nearly ready to discuss the penultimate pattern, Version Pattern 6. This pattern expressesthe combined semantics of all the version patterns we have discussed so far. The last versioningrequirement we will examine adds the requirement to version error corrections, and it is VersionPattern 7 that will accommodate that requirement, as well as all the patterns previous to it. VersionPattern 7 is therefore the “ultimate” versioning pattern to be discussed in these articles. It is clear that if Pattern 6 can express the semantics of patterns 4 and 5, it can express thesemantics of all prior patterns. And indeed it can.





Like Pattern 4, it distinguishes business effectivity dates from database activity dates, making itpossible, for example, to create new versions in advance of the date on which they will becomeeffective. Like Pattern 5, it manages versions which may have temporal gaps between them, making itpossible, for example, to record the reappearance of an object after its absence for some period oftime. But before we discuss Pattern 6, there are two preliminary matters to address. The first is toprovide a roadmap to the rest of the articles in this series. The second is to introduce the use of aspecial set of surrogate keys for versioned tables, and explain why they are needed and how theywill be used. The rest of this article lays out that roadmap. Then, Parts 14 & 15 make the case for usingsurrogate keys, and describes the special considerations needed to make surrogate keys workwith versioned tables.

Chart of Version Patterns Before continuing, we insert here the current chart of version patterns. References back to earlierpatterns are happening pretty frequently now, and we hope this will help the reader recall whatthose prior version patterns are. They also give a hint of version patterns not yet discussed.

Time and Time Again: Managing Time in Relational Databases, Part 13 -... http://www.dmreview.com/dmdirect/2007_45/10000273-1.html?type=pri...

1 of 9 10/7/2008 1:05 PM




2 of 9 10/7/2008 1:05 PM


Roadmap for the Remainder of the Series The high-level view of the roadmap to the rest of this series is as follows. Items are listed in thesequence in which they will be discussed.

Versioning with surrogate keys.1.The physical implementation of Version Patterns 1 – 5.2.Version Pattern 6. Semantics and implementation.3.Managing recurring objects.4.Version Pattern 7. Semantics and implementation.5.Star Schema versioning.6.


3 of 9 10/7/2008 1:05 PM

Snapshots.7.Sandbox versioning.8.Hierarchical versioning.9.Performance considerations.10.Migrating to an enterprise versioning solution.11.Notes towards a standards-based and vendor-implemented solution.12.

Let’s briefly consider each of these items.

Versioning Using Surrogate Keys Surrogate keys are primary keys whose values are system-created and free of any businesscontent. Although Dr. E. F. Codd’s Information Principle recommends the use of business datarather than surrogate keys as primary keys, there are both theoretical and practical reasons forusing surrogates as primary keys. The distinction we have already made, between objects and their versions, is best represented,when surrogate keys are used, by a pair of such keys. One of them we will call an “objectidentifier” (“object id,” abbreviated as OID), and the other a “version identifier” (“version id,”abbreviated as VID). These matters will be the focus of the Roadmap Item #1 discussion.

The Physical Implementation of Version Patterns 1 – 5 In earlier articles, we noted that each of the first five version patterns might be an appropriatesolution in a specific versioning situation. Those articles discussed the semantics which eachpattern supports, and described each pattern in what we will call logical data model (LDM) terms.With Roadmap Item #2, we will provide the SQL DDL and DML that define the correspondingphysical database schemas, and that provide access to them. We will call this the physical datamodel (PDM) representation of these patterns. This clears the deck by completing our discussion of the first five patterns. It's time, then, to moveon to more advanced patterns.

Version Pattern 6: Semantics and Implementation In semantic terms, this pattern is the logical union of the semantics of all preceding patterns. Afterdemonstrating that this is so, we go on to a PDM discussion, which demonstrates how to actuallybuild tables that conform to this pattern. Also, with this pattern, we switch from business keys tosurrogate keys. This will be the focus of the Roadmap Item #3 discussion.

Managing Recurring Objects Recurring objects are anything that persists over time, are recognized by us, go away, and lateron come back. A policy that has lapsed, but then is renewed, is a recurring object. But the most important kind of recurring objects, for businesses, are customers – people ororganizations who purchase goods or services from us. The principal reason that recognizingreturning customers is so important is human psychology. Returning customers want to begreeted like old friends. They don't want the standard "We're happy to welcome you as acustomer of ours" greeting. They want a "Welcome! It's great to have you back!" greeting.


4 of 9 10/7/2008 1:05 PM

Another reason it is important to recognize returning customers is so we can use the valuableknowledge we have gained about them from their previous history with us, particularly theknowledge of what kinds of products or services they tend to purchase from us. If we're acompany really on its toes, we may even have captured the information about why they left us inthe first place. All this information is far too valuable to lose just because the IT Department tellsupper management that "the match problem is difficult"! From the point of view of temporal database management, however, it doesn't matter whether thereturning/recurring object is a person or a policy, a customer or a competitor. The point is thatwhen a new row is inserted into a versioned table, it is inserted as the next version of an object forwhich we already have prior versions, even though there is a gap in time between when the objectwas terminated and when this new version is inserted. These gaps mean that we can recognizelogical groupings of versions, one group for each continuous set of versions, with gapsdistinguishing the groups. These groups are what we called "episodes" back in Part 5. We already know how to record the reappearance of a returning object in a versioned table. Butthere are two related issues, one of which we will discuss, and one of which we will not. The issuewe will not discuss is how to recognize a returning object as an object we have seen (andrecorded in our databases) previously. In a non-temporal form, this is identical to the issue of howto recognize that data from two different input streams, very often with different and incompleteidentifying data, are indeed for the same object. The issue, in either its temporal or itsnon-temporal form, is "match logic,” and it differs so much from business to business, and objecttype to object type, that nothing useful could be said of the general match logic problem in muchless than a book-length treatment. The issue we will discuss is what to do when we handle reappearance incorrectly. The moregeneral issue is what to do when our match logic makes the wrong choice. The first wrong choiceis to say that we have a recurring object (a match to an object already in our databases) when infact we do not. The second wrong choice is to say that we do not have a recurring object when infact we do. The third wrong choice is to say (correctly) that we have a match, but to identify thewrong object to match it with. In the first case, once we identify our mistake, we have to split the incorrect match, separating dataidentified as being for one object into two sets of data, one for each of the objects we nowrecognize as distinct. In the second case, once we identify our mistake, we have to merge the datafor the two objects into data for a single object, either letting one of the two objects "survive" themerge, or else creating a new third object to merge the first two into. In the third case, we have todo both. This is not a versioning problem, but it is a problem that inevitably arises when we are required to(try to) correctly recognize recurring objects. When that requirement is combined with therequirement to version the objects in question, we get a situation in which error correctionbecomes a very complex problem. So the relevance of this issue to versioning is that if bothrecurring objects and error correction are requirements, the problem of correcting match logicmistakes becomes in part a versioning problem. This complex problem is the subject matter of Roadmap Item #4. But most of the analysis of thisproblem is relevant whether or not the target tables being affected are versioned or not. Sobecause the requirement to recognize recurring objects is being imposed on more and morebusiness application systems, and because the work required to correct the inevitable matchmistakes is so complex, we think it is warranted for us to branch off from our core topic of


5 of 9 10/7/2008 1:05 PM

versioning, and consider the related topic of correcting object matching mistakes.

Version Pattern 7: Semantics and Implementation With Roadmap Item #5, we present Version Pattern 7, our “ultimate” version pattern. Because ofits importance, we will provide a more thorough and detailed discussion of it than we have of theprior version patterns. In particular, we will provide over a dozen usage scenarios for this pattern, whereby “usagescenario” we mean something like a use case, except that a workflow of separate processing stepsis not part of a usage scenario. Our usage scenarios focus on those events in which one or moreversioned tables are either updated from source transactions, or queried with SQL. For each usage scenario, we will discuss both the updates required to get the retrieved data intothe database, and the queries required to get that data out. An LDM level discussion will befollowed by the physical schemas, transformation code and SQL which constitute the PDMdiscussion. Specifically, we will cover the following points: Sample tables, in their logical form, as already used with the earlier patterns. In addition, samplephysical tables, and the DDL used to generate them. All sample tables will contain instance datasufficient to illustrate the scenario being presented. Timeline diagrams for each scenario. These are the same diagrams that have been used toexplain earlier version patterns. The semantics of each update scenario. These semantics will be clearly stated. In addition, thetransforms from source transaction to versioned table updates will be documented, either as codeor as “logically tight English.” These transforms provide one aspect of the encapsulation ofversioning shown in Figure 2 of Part 1. The semantics of view creation. This will include SQL view creation statements, and theirinstance-level results. These views provide the complementary aspect of the encapsulation ofversioning shown in Figure 2 of Part 1. The semantics of temporal queries against these views. The objective here is to maximize thesimplicity of the query, while keeping all the temporal semantics available for the query author.

Star Schema Versioning Star schemas are the most common data model for fact/dimension databases, often referred to as"data marts.” But versioning doesn't apply to fact tables because fact tables are about events, andevents don't change. Events happen, and then they are over with. Versioning tracks change overtime, and only objects which persist over time can change over time. They change by participatingin events. In the Roadmap #6 articles, we will discuss the semantics – indeed the ontology – ofevents and persistent objects, in order to understand this important point more clearly. However, the dimension tables of fact/dimension databases do represent persistent objects.Indeed, they represent the very persistent objects whose properties or relationships are altered bytheir participation in the events recorded in their fact tables. Because dimensions representobjects that change over time, versioning is applicable to them. The Roadmap #6 articles willprovide both a logical and a physical description of the use of versioning in fact/dimension


6 of 9 10/7/2008 1:05 PM

databases.

Snapshots In Part 1, we provided a complete taxonomy of ways in which historical data is maintained inbusiness databases. In particular, we distinguished snapshots from versioning. Although thisseries has focused on versioning, snapshots also have their place in the management of temporaldata. Roadmap Item #7 discusses snapshots and their uses.

Sandbox Versioning Often, application systems need to accumulate updates over a period time, and across severalphysical units of work (UOW) that represent a single logical business transaction. Not until allupdates are applied, and the results of the complete set of updates validated, are those resultsmade visible to database end users. In other words, the updates are accumulated from severaldifferent UOWs, but are then assembled into a single logically atomic transaction, i.e., one inwhich the updates either all succeed or all fail, and in which the state of the database while theupdates are being applied is not visible to end users. One common way of implementing the requirement to create a single, multi-transaction atomicupdate out of updates from several distinct UOWs is to copy the database schema and store theresults of the updates in this copy. Once all updates have been applied, and all results stored inthe copy database, those results are then validated and either approved, or re-applied until theresults are all correct. At that point, all the data is moved from the tables in the database copy tothe corresponding tables in the live database. This way of implementing a multiple UOW to single atomic update requirement is often costly.Redundant objects and processes may be needed to validate the data stored in the copydatabase. In addition, it may require significant amounts of system resources to marshal this datainto and out of the copy database. Sandbox versioning will be discussed when we reach Roadmap Item #8. We will explain how therequirement we have described can be met without the need for a copy database, and alsowithout the need for intra-database copies of the updated tables for these pending updates.Instead, everything is done within the original tables themselves. When the results have beenvalidated and approved, the results become visible to the database user almost instantly,regardless of how many rows across how many tables have participated in this "mass update." Sandbox versioning functionality also provides the basis for the “check-out” and “check-in”capabilities of hierarchical versioning, which is described next.

Hierarchical Versioning The versioning we have been considering, up to this point, we may call “linear versioning.” Exceptfor error corrections, versions of the same object follow one after the other, with or withoutintervening gaps, but never overlapping. They thus constitute a linear sequence of versions, eachthe truth, and the only truth, about the object during the version’s effectivity period of time. But businesses also have a need for hierarchical versioning. As the name suggests, this isversioning in which one or more versions are both a) “child” versions of the same “parent version,”and b) “valid,” in some sense, in concurrent and/or overlapping periods of time. IT professionals


7 of 9 10/7/2008 1:05 PM

deal with this kind of versioning when they use software that allows them to “check-out” and“check-in” copies of a base object. Roadmap Item #9 discusses hierarchical versioning.

Performance Considerations Other things being equal, a versioned table of an object (i.e., an entity type) will not perform aswell as a non-versioned table of the same object. Several factors account for this, including thefollowing ones: Versioning combines current and non-current versions of an object in the same physical table, andin general, the non-current versions will account for most of the rows and most of the size of thetable. As explained in Parts 10 – 12, temporal joins are not directly supported in current RDBMSs. Necessarily, dates are used to carry much of the semantics of temporal distinctions amongversions. In general, this means that to retrieve a version current as of a specific date, theversioned table must be searched for first the object, and then for the version possessing theunique pair of effectivity dates that frame the specific date. Roadmap Item #10 presents some real-world production database metrics for updating andquerying versioned tables. They also include some “tips and tricks” for performance-tuning, basedon several years’ experience we have had in implementing and supporting relational databaseswhich contain versioned tables.

Migrating to an Enterprise Versioning Solution Starting from scratch is easy. It’s like bringing a ship into dry dock for major repairs. But inbusiness IT, we seldom have that luxury. Usually, we are required to carry out major repairs whilethe ship is still at sea. The articles in this series, up to Roadmap Item #10, will serve you well if youare able to design one or more versioned databases from scratch, before they begin to bepopulated with production data. Otherwise, your will have to face the challenge of introducingversioning into production databases. From a theoretical point of view, this is a special case of the “schema change” problem, one inwhich the schema changes are from non-versioned to versioned schemata. (See Bibliography.)From a practical point of view, the challenge is even more difficult than that; and the reason is thatthe schema changes cannot be done all at once, against a database that has been taken off-line,schema-changed, instance-updated, and then brought back on-line. That would be the dry dockapproach to schema change. Our challenge, in the world of production databases, is to carry out schema changes while still “atsea.” The Roadmap Item #11 articles will present our recommendations for implementing schemachanges in already productionalized databases.

Notes Toward a Standards-Based and Vendor-Implemented Solution We lack a vendor implementation of anything more than elementary functionality for temporaldatabase management. The reason is that the SQL Standards Committee, and the British


8 of 9 10/7/2008 1:05 PM

committee in particular, rejected the original temporal database proposal made several years agoby Dr. Richard Snodgrass. We are not aware that any progress has been made towards resolution of this issue by thestandards committees. For those interested in what the differences are, each side has written abook explaining their position. These are the two books referenced in Part 1, one by Dr. RichardSnodgrass and the other by Chris Date, Hugh Darwen and Nikos Lorentzos. With Roadmap Item #12, we will attempt to find common ground between the two positions, andalso describe implementation issues which we believe should be taken into consideration whendesigning this standard.

Roadmap Wrap-Up In summary, we anticipate that a sufficiently detailed coverage of these eleven topics will requirethirty to thirty-five more articles. At two articles a month, that means that this series will continuefor approximately another fifteen to eighteen months. As always, the authors welcome questions and comments from readers. This input can help usidentify issues we have not discussed clearly enough, or objections we have not considered at all,and we can then clarify the issue, or respond to the objection. As always, we can be reached viathe indicated email addresses, and also at our respective websites. Note: like many authors in this field, we have begun to use the word "semantics" quite frequently. And, like most of them,

we haven't bothered to define the term before using it. Of course, one way to learn a new term is "by osmosis" as it

were, to simply see it used in enough sentences that one gets a "feel" for how to use it. This, in fact, is how we all learn

nearly all the words we use. Googling is a way to accelerate this kind of learning. But some words are important enough and specialized enough that we should supplement thisbootstrap approach to learning by studying a definition and the discussion that accompanies it. Tothat end, one of us (Johnston) has recently added just such a definition with discussion to hisWeb site, MindfulData.com.

Reference:

E. Rahm and P. A. Bernstein. (December 2006.) An Online Bibliography on Scheme

Evolution. SIGMOD Record 35(4), 30–31.

1.


Data ModelingDatabasesDW BasicsDW Design, Methodology




9 of 9 10/7/2008 1:05 PM


Relational Databases, Part 14 - Versioning

with Surrogate Keys

Randall Weis, Tom JohnstonDM Direct, November 2007

Beginning with Version Pattern 6, we will use surrogate version keys in all versioned tables. Butnote that the phrase “surrogate version keys” should be read as “surrogate keys for versions,” andnot as “keys for surrogate versions.” We are recommending the use of surrogate keys forversioned tables. We are not introducing a concept of “surrogate versions.” Neither the chart of version patterns nor the chart of this series are included in this article. Thereason is that this discussion of surrogate keys is relatively self-contained and makes fewreferences to earlier articles in this series.


We begin by defining our terms.

Surrogate and Other Kinds of Keys: Definitions

Primary Key. The column or columns of a table in a relational database that the databasemanagement system (DBMS) recognizes as a unique identifier for that table, and as the column orcolumns whose values will be duplicated in foreign keys in order to create relationships.

Comment: The second half of the definition, the part following the "and," is necessary. The reasonis that a DBMS is able to recognize and accommodate more than one unique identifier for a table.Any column or columns over which a unique index is defined is as much a unique identifier for atable as is the primary key. Once one of those columns or groups of columns is designated theprimary key, the DBMS knows that their values will appear in foreign keys whenever a relationshipto their rows is created.

Foreign Key. The column or columns of a table in a relational database that the DBMS uses toestablish a relationship between that table and another (not necessarily distinct) table. At theinstance level, foreign keys relate pairs of rows such that each foreign key value is identical to theprimary key value of a row in the referenced table.

Surrogate Key. A primary key whose values are system-generated and are either hidden frombusiness users or exposed with the understanding that no meaning other than their function asunique identifiers should be assumed true of them.

Time and Time Again: Managing Time in Relational Databases, Part 14 - ... http://www.dmreview.com/dmdirect/2007_47/10000330-1.html?type=pri...

1 of 6 10/7/2008 1:06 PM

Comment: the second half of the definition, the part following the "and," is necessary. The reasonis that system-generated primary key values may be exposed to business users. A typical exampleis system-generated document numbers for invoices, purchase orders, receipts, etc. The exposuredoes not mean that those primary keys are no longer surrogate keys. But, if exposed, such keysmust be exposed to business users with the clear understanding that their only role is to identify,and that no other meaning can reliably be given to them. Because this has turned out to be an excellent example with which to illustrate the concepts ofmeaning and semantics, the next section will be devoted to elaborating on it. This is an aside toour main theme of introducing the use of surrogate keys with versioned tables, but we think it is animportant one.

Business Key. A unique identifier for a table, all of whose columns express business meaning.

Natural Key. A primary key which is a business key.

Semantics and Meaning, Illustrated With a Surrogate Key Example Speaking of surrogate keys, we said above that "… if exposed, such keys must be exposed tobusiness users with the clear understanding that their only role is to identify, and that no othermeaning can reliably be given to them." The word "reliably" is essential here, because sometimes additional meaning is imputed tosurrogate keys by business users, whether we IT folks like it or not. For example, perhaps manyusers of Acme's Policy table know that, following the "P" in its values, the three-digit number is anumber that was sequentially assigned by the system. Knowing that, these business users maythen use that knowledge to conclude, for example, that of two clients, the one with the highernumeric part in the client-nbr was added to the database later than the other one. In this way, such business users are imputing meaning, over and above the meaning of the key asa unique identifier, to the values that key takes on. The additional meaning, or semantics, is thatthe surrogate key also chronologically sequences the events in which the clients or policies wereadded to Acme's tables. Those key values "tell us" the sequence in which rows of that table werecreated, and it is in their "telling us that" that their imputed semantics consists. While IT cannot stop users from reading these semantics into these keys, we can and mustexplicitly state that such imputed semantics are not reliable. We must tell the users that if theycontinue to do this, they do so at their own risk. Let's see, now, what kind of trouble such userscan get themselves into. Acme's Policy Management System initially did start out by creating primary key values using asequence number for the last three characters in the key. Although many business users knewthis, what they did not know is that policy numbers, starting about a year ago, were being reused.Whenever an Acme policy was deleted, a six-month waiting period was observed (to allow allreferences to the deleted policy to be "flushed out" of the on-line database), after which the policynumber was used as the unique identifier of a different policy. Obviously, the reuse of policynumbers makes them useless for telling which of two policies was created before the other one.This illustrates the risks of relying on undocumented semantics. As semantically risky as such reuse of identifiers is, most of us who have been around for awhilehave seen this kind of thing happen. In our case, Acme apparently thought that it would never


2 of 6 10/7/2008 1:06 PM

have to manage more than a thousand policies. A very modest way to start out, but also a littleshortsighted. And this shortsightedness is the cause of this semantically risky reuse of policynumbers. Acme's IT department is desperately trying to put off the evil day when they will have tochange from four-byte to five-byte policy numbers. It doesn't take much reflection to realize that the changeover will be very "evil," i.e., very costly andcomplicated. Existing policy numbers are everywhere – in primary keys, in foreign keys, acrossseveral Acme databases, and in the databases of customers and partners. Those numbers are "inflight" on transactions created but yet to be applied to Acme's databases. This can't be cleaned upin one fell swoop, on one night between midnight and 6a.m. while all the data is unavailable foraccess. It can't be cleaned up, in other words, in one multi-transaction atomic unit of work. Whilethe changeover from a four-byte to a five-byte client number is taking place, semantic disintegritywill be exposed to business users. Joins will return only partial result sets. Code and SQL willhave to be rewritten to use "link tables" and, after the changeover is complete, rewritten again tonot use those link tables. It reminds me of what Everett Dirksen, the senior senator from Illinois who almost helped to writethe Civil Rights Act of 1964, said about how the government spends money: "A billion here, abillion there, and pretty soon you're talking about real money." Well, when it comes to changingthe syntax of the primary keys of critical enterprise tables, it may not be a billion here and a billionthere. But it can often be "a million here and a million there."

The Client Version and Policy Version Tables – Without Surrogate Keys Figure 1 shows our Client Version and Policy Version tables, as we have seen them through thefirst fourteen articles in this series. In these tables, the primary key is a natural key, not asurrogate key, because the same columns that play the role of primary key also play the role ofbusiness key. Client-nbr and Policy-nbr. Although we have identified client-nbr and policy-nbr as surrogate keys,each is really a hybrid. We said that the numeric part was surrogate because, even if it had thesemantics of chronological sequence, semantics were an accident of how the numbers werecreated and could not be relied on. But we tacitly accepted the semantics of the alphabeticportion, in which "C" designated a client key and "P" designated a policy key. So because one part of those keys has business meaning, let's think of them as natural keys. Inother words, let's think of them as not being surrogate keys. (This notion of a primary key columnhaving "parts" brings up the old issue of "intelligent keys.” But intelligent keys have no relevanceto the issues being discussed here. However, for an extensive discussion of intelligent keys, seeJohnston's two articles on "Primary Key Re-Engineering Projects"(Part 1, Part 2). Client-ver-beg-dt and Policy-ver-beg-dt. These primary key columns have business meaning,because they designate business effective begin dates in both cases. Therefore, [client-nbr +client-ver-beg-dt] and [policy-nbr + policy-ver-beg-dt] are business keys for versions of clients andpolicies, respectively.


3 of 6 10/7/2008 1:06 PM

Figure 1: The Client Version and Policy Version Tables with Natural Keys

The Client Version and Policy Version Tables – With Surrogate Keys Figure 2 shows our examples of a Client Version and Policy Version table, with surrogates used asprimary keys. The first thing to notice is that these tables still contain their original business keys.Business keys are what provide semantic entity integrity. Primary keys provide physical entityintegrity. The distinction between semantic and physical entity integrity arises, of course, onlywhen we substitute surrogate for natural keys, and thus separate what the DBMS recognizes asthe primary key from what the business recognizes as the combination of data elements thatproperly play the role of distinguishing each instance of the entity from every other instance. This is a complex topic, and we will not say anything more about it in these articles. However, atthe DAMA 2008 Conference, Johnston's presentation will discuss this issue in greater detail. Client-nbr, client-ver-beg-dt, policy-nbr and policy-ver-beg-dt have the same semantics they havehad all along, except that their role as primary keys has been removed. In the primary key roles,we have provided two new columns for each table, an object identifier ("oid") and a versionidentifier ("vid"). We have also "stacked" a few more columns, again so the picture of these tables could fitcomfortably on a page. So although the picture of each table suggests that there are six columnsapiece, in reality the Client Version table has nine columns and the Policy Version table has tencolumns. Or, if you prefer, the graphic columns are six and six, respectively, while the databasecolumns are nine and ten, respectively. The Client Version and Policy Version tables, with surrogate keys, are shown in Figure 2.


4 of 6 10/7/2008 1:06 PM

Figure 2: The Client Version and Policy Version Tables with Surrogate Keys

For both oids and vids, we have used five-digit numbers. But the important thing is that a) thesenumbers carry no business meaning whatsoever, and b) they play their roles as unique identifiers.Their semantics as unique identifiers are, at a minimum:

No two policies may have the same policy-oid.All versions of the same policy must have the same policy-oid.No two versions of the same policy may have the same policy-vid.And the same for clients.

For reasons described in Johnston's "Primary Key Re-Engineering Project" articles, we wouldprefer additional semantics for oids and vids, namely:

No two objects, of whatever type (policies and clients, here) may have the same oid.No two versions, of the same or of different objects, may have the same vid.

One final point before we conclude Part 14. Surrogate keys are usually single-column primarykeys. But our versioned tables are two-column primary keys. It isn't necessary to use two columns– an oid and a vid – as the primary key of versioned tables. But it is very useful to have a system-generated key for an object, with an additional distinct system-generated key for a version of anobject. What isn't necessary is that these two system-generated columns be used as the primarykey of a versioned table. Instead, we could have a typical surrogate key, i.e., one consisting of asingle column. Why do we need a system-generated oid and also a system-generated vid? It is not for the sake ofsemantics. As far as semantics is concerned, we could use a single-column surrogate key forversioned tables, while retaining the client-nbr (or policy-nbr) and associated version begin date.No oid, no vid. Recall that client-nbr and policy-nbr were our somewhat simplified examples of business keys. Soit could just as easily have been that policies had a four-column business key while the client hada three-column business key. Quite possible no two or more of those six columns would have thesame data type and length.


5 of 6 10/7/2008 1:06 PM

Why is that significant? It is significant because, all along, we have had a single, enterprise-widesolution to versioning as our objective. Not one solution for one database and a slightly differentsolution for another database. While we may be able to retrieve data from versioned tables withSQL alone, it is not possible, as we will see, to update data in versioned tables with SQL alone.For updating versioned tables, therefore, there is a codebase. Our enterprise objective, if we are totake it seriously, requires that one and the same codebase be deployed for all versioningsolutions. If an enterprise versioning solution is used, let us say, in six databases across the enterprise, thenthere may be six physical copies of that codebase. An enterprise solution does not requirefederated access to a single physical codebase instance. But it does require that all instances ofthat codebase be line-for-line, byte-for-byte identical. The only alternative is to develop a customversioning solution for each database that requires versioned tables. While that may be the mostcommon way of implementing versioning in an enterprise, it is not our objective.

The Versioned Table Surrogate Key Position, Going Forward The concept of surrogate keys is a contentious one. Publications arguing both for and againsttheir use go back at least twenty years. Arguments based on both theoretical and practicalconcerns can be made for using surrogate keys and also for not using surrogate keys. One of us (Johnston) has been an advocate of using surrogate keys for all kinds of tables, in allkinds of databases. This means for dependent and associative tables, not just for the “kernel”tables which those tables are dependent on, or relate. This means for tables in OLTP and datamart databases, not just in ODS and data warehouse databases. We will refer to this as the“strong surrogate key” position to distinguish it from positions which recommend surrogate keys forsome kinds of databases but not for others, or for some kinds of tables but not for others. In Part 15, we will provide an outline of the major arguments for and against the strong surrogatekey position. This will not be an impartial review of arguments on both sides. It will be a distillationof the arguments Johnston has made in half a dozen articles going back to 2000, and also a

preview of the presentation he will be making at the DAMA International Conference next March inSan Diego.


Data ManagementData Management




6 of 6 10/7/2008 1:06 PM


Relational Databases, Part 15 - For and

Against Surrogate Keys

Tom Johnston, Randall WeisDM Direct, December, 2007

In Part 14, we introduced the concept of surrogate keys and showed how our example database ofclients and policies would look if we used surrogate keys with that database. We also introducedsome terminology that will help us examine the arguments for and against their use, terms suchas "business key," "natural key" and "mixed key." We then went on to define two sets of semanticconstraints that give surrogate keys their meaning. The minimal set consisted of three rules for theClient table and a corresponding three rules for the Policy table. The full set consisted of anadditional two rules, both of which applied to both tables. We also designated the approach to surrogate keys that we will use in the remainder of thesearticles as the "strong approach." The approach we will contrast it to we will designate the"standard approach," for such it is. We begin, then, by describing the standard approach and ourreasons for dismissing it.





Neither the chart of version patterns nor the chart of this series are included in this article. Thereason is that this discussion of surrogate keys is relatively self-contained and makes fewreferences to earlier articles in this series.

Surrogate and Other Kinds of Keys: Definitions

Primary Key. The column or columns of a table in a relational database that the databasemanagement system (DBMS) recognizes as the unique identifier column or columns for thattable.

Foreign Key. One or more columns of a table in a relational database used by the DBMS toestablish a relationship between that table and another (not necessarily distinct) table. At theinstance level, foreign keys relate pairs of rows such that each foreign key value is identicalto a primary key value in the referenced table.

Surrogate Key. A primary key whose values are system-generated, and which thereforecarry no business meaning. Surrogate keys are usually single-column keys.

Mixed Key. A multi-column primary key, some of whose columns contain system-generatedvalues and other of whose columns contain business-meaningful values.

Business Key. A unique identifier for a table, all of whose values carry business meaning.

Natural Key. A primary key that is a business key.


1 of 6 10/7/2008 1:06 PM

Surrogate Keys: The Standard Approach When IT practitioners speak of a database that uses surrogate keys, they are usually referring to adatabase some of whose tables have single-column surrogate primary keys, but other of whosetables have multi-column primary keys whose values may or may not be system assigned. Thus,they might use a single-column surrogate key for a Purchase Order table but create a mixedprimary key for a Purchase Order Line Item table, one consisting of a foreign key to the PurchaseOrder table together with a line number. For an associative table matching up purchase orders toreceipts, they would typically follow the general practice for associative tables and create a primarykey by concatenating the primary keys of the Purchase Order and Receipt tables. Let's call this the standard approach to using surrogate keys. What might justify the standardapproach as the best way to assign primary keys to the tables in a database? It would have to besomething like this. Whether or not surrogate keys are preferable to business keys is somethingthat should be determined for each table. Now consider a database all of whose tables havenatural keys. Among those tables, there is a subset whose primary keys do not contain anyforeign keys, i.e., which do not use "identifying relationships." Call them "kernel tables." Assign asingle-column surrogate key to these kernel tables. That surrogate key will then become part of an identifying relationship in all of the other tables towhich each kernel key table is related. There are two types of such other tables: dependent tablesand associative tables. Dependent tables have primary keys which contain one surrogate foreignkey. Associative tables have primary keys that contain two surrogate foreign keys. Our objection to this standard approach is that it makes the decision to use or not to use surrogatekeys at the level of individual tables. The result is a database in which some tables use surrogatekeys, other tables use mixed keys and yet other tables may use natural keys. We have seen various rules of thumb offered for deciding whether to use a surrogate, mixed ornatural key for a particular table. For example, some database modelers/designers will always usea natural key for reference/code tables. Their argument is that besides the value being referenced,all that such tables contain are things like a short and a long description. Since for most purposes,it's just the code that is wanted in a result set, using that code as the natural key of the referencetable means that a join to that table is not required in order to get the code value. ICD-9 codes ona medical claim are a good example. If the ICD-9 code itself is the foreign key, the claim with its listof ICD-9 codes can be materialized without needing to join back to the ICD-9 reference table. We prefer to make the decision to use or not use surrogate keys at the database level. For all thetables in the database, it's all or nothing. Either all of them use surrogate keys, or none of themdo. If single-column surrogate keys are being used, then the primary keys of all tables containexactly one column. This is true of associative and dependent tables, of core tables and codetables. All use surrogate keys. Moreover, on our strong approach to surrogate keys, no surrogate key value will appear in morethan one primary key instance anywhere in the database. A primary key is unique, in other words,across all tables in the database, not merely unique within one table.

Argument Against the Standard Approach to Surrogate Keys Our argument against the standard approach to using surrogate keys is simplicity itself. Thestandard approach leaves us with a mixture of primary and foreign keys to manage that share no


2 of 6 10/7/2008 1:06 PM

consistent syntax and across which duplicate values are possible. The argument for using surrogate keys rather than natural keys is simple: natural keys canchange, and that change can be expensive. As for change, the values in natural keys describesomething about what the row they are part of represents; that follows from the definition of"natural key." Suppose one natural key is a two-byte column whose values represent geographicalsales regions and are: NE, NW, SE and SW. Now suppose the VP of Sales, after extensiveanalysis and reflection, of course, decides to change sales regions as follows:

California is too big to be lumped in with the rest of the SW, so it will become its own salesregion, with "CA" as its code.The NE and SE regions are also getting too big, but we don't want to carve out one or morestates to fix their problem. Instead, we will split them each into two regions, with designationsas follows: NE1, NE2, SE1 and SE2.

We invite you to reflect on the cost of such changes. Next, think about the cost of these changes ifIT had, instead of using natural keys, used a surrogate key for this Sales Region table.

Argument 1 Against Surrogate Keys: Users Don't Understand Them Taken in one sense, this argument means that users don't know what surrogate key values standfor. But of course, that's the whole point. The system generates values for surrogate keys, andthose values have no business meaning. No wonder users don't understand them. Taken in another sense, this argument means that users don't understand the need for surrogatekeys, that they are uncomfortable when they see surrogate keys in their result sets and so on. The need for surrogate keys can be explained to business users in two steps. First of all, primarykeys are needed to uniquely designate each row in each table. Secondly, the business may needto occasionally require IT to changes values and/or data types and lengths of primary keys inorder to make the data being used as a unique identifier more accurately describe what it is about.And when those changes are required, the costs of implementing them can be enormous. We sawthis above, in the argument against the standard approach to surrogate keys.

Argument 2 Against Surrogate Keys: Users Don't Use Them This argument is that, because surrogate keys do not contain business data, users will neverselect rows from tables by supplying surrogate key values. This is certainly true, but it doesn'tsupport the conclusion that users don't use them. On the contrary, surrogate keys, together withthe foreign keys that replicate their values, are used to join rows to other rows. If a user wants information about a particular Policy, for example, we may assume that he knowswhat the business key of that policy is. His query then begins with a Select operation on the Policytable, with policy-nbr as the target, and a specific policy number as the search argument. Sincebusiness keys should be unique, this process should select exactly one row (provided thesupplied policy number is valid). Next, logically although not temporally speaking, the query proceeds to the Join operation inwhich one or more other rows foreign key-related to the original row, are also selected. Finally, again speaking logically and not temporally, the query proceeds to the Project operation in


3 of 6 10/7/2008 1:06 PM

which some columns are designated to be included in the result set, while the other ones dropout. The entire query, at this point, asks a question of the database. It says, "Do there exist one ormore instances of data that satisfy this specification?" If so, the resulting instances are returned,each one as a row in a result set. There is generally no good reason to include surrogate keys or their associated foreign keys in theresult set. There is generally no good reason for users to ever see a surrogate key value. For evenif the user writes his own SQL and uses surrogate keys and foreign keys in his joins, he does notsupply a surrogate key value anywhere in the query, and isn't interested in a surrogate key result.Primary and surrogate key join pairs, like two adjacent links in a chain, are purely DBMSmachinery for linking data together.

Argument 3 Against Surrogate Keys: They Take Up Extra Space This argument points out that if we used a table's business key as its primary key, we wouldn'tneed an extra surrogate key column. Say that surrogate key is eight bytes long, and it occurs in amillion-row Client table. That's eight million unnecessary bytes, approximately 7.5MB extra for theentire table. While direct access storage device (DASD) space is relatively cheap, the argument will continue,response time to a user's query is not. And the eight extra bytes on each row also contributes topoorer response time, because fewer rows will fit into a single I/O block; fewer rows at a time canbe transferred between secondary storage and main memory. In response, we need to distinguish between the impact of a surrogate key on the table it is theprimary key for and its impact across the entire database. In its native table, the Client table in thiscase, adding a surrogate key will indeed add extra space to the storage requirements for the tableand will also add some amount of extra time when doing sequential retrievals, i.e., retrievals thatrequire moving I/O block after I/O block from DASD to main storage. But let's consider all the tables in the database and nonsequential retrievals. Across all the tablesin a database, the use of surrogate keys may well reduce the total requirement for storage space.For consider our eight-byte, single-column surrogate key, and compare it to a natural key of oneor more columns, which may well be longer than eight bytes. Next, suppose that the Client table is an important table in the specific sense that many othertables have foreign keys to the Client table, and for some of them those foreign keys are part oftheir own primary keys, i.e., those foreign keys implement identifying relationships. Finally,suppose that for some of these other tables with identifying relationships to the Client table, thereare further tables that have foreign keys to them. Throughout this chain of direct and also indirect foreign keys to the Client table, and depending,of course, on the number of other rows which instantiate these foreign keys, we may have dozensor even hundreds of occurrences of Client table primary key values. Let us assume that, at some point in time, there are 5,000 such Client foreign keys. And let usassume, further, that the business key for the Client table would have been two columns, totalingtwelve bytes. The arithmetic here is simple. In the first case, the foreign keys total 60kb. In thelatter case, they total 40kb.


4 of 6 10/7/2008 1:06 PM

Now consider a database in which important tables like our Client table contain millions of rows.And consider then the trade-off in space consumed between such a database in which surrogatekeys were used for all tables versus a database in which business keys were used for all tables. There is indeed a significant difference in storage requirements between using and not usingsurrogate keys. Surrogate keys are much more space-efficient. Now let's briefly consider the argument that retrievals will take longer with surrogate keys becausefewer rows will fit into an I/O block. Although there are several things to note, we will take the timeto mention only one of them. This argument applies only when sequential retrievals are beingconsidered. When a business key value is in hand, and about to be used to select a client, thenafter a lookup in the unique index defined on business key, only a single I/O block will beretrieved. As for joins, once a primary key value is in hand, and about to be used in a join, thenafter the lookup in the index, only a single I/O block will be retrieved. Any negative impact onretrieval time caused by using surrogate keys is negligible to the point of being undetectable.

Argument 4 Against Surrogate Keys: They Require Extra Joins This argument points out that whenever a surrogate primary key is used, then all relationships to arow with such a key use foreign keys whose values are identical to that row's surrogate key value.So whenever a query has specified a row which has such a foreign key, and whenever that querywants to include all or part of the business key of the related row, a join is required to bring thatbusiness key into the result set. This argument is correct. However, it is possible to avoid joins that the use of surrogate keys wouldotherwise make necessary. The solution is to denormalize and put duplicate columns on thereferencing table, columns that duplicate the business key of the row that the surrogate foreignkey is pointing to.

The Argument For Surrogate Keys One of us (Johnston) will present a more complete case for using surrogate keys on all tables in adatabase, and will do so at the DAMA 2008 Conference in San Diego in March 2008. But in brief,the reasons we will proceed to discuss versioning in terms of sample tables that use surrogatekeys are these:

We want an enterprise solution to versioning, one applicable to any table and any databasewithout modifying the code that implements it. That requires that all versioned tables havethe same syntax for their primary keys, i.e., the same columns each with the same datatypes, lengths and domains.

1.

We want to avoid the usually very high costs of changing primary keys by adopting keys thatthere will never be a business reason to change. Such keys are precisely those that carry nobusiness meaning and whose only semantics are to link related rows together.

2.

A Physical Description of Earlier Version Patterns In Part 15, we begin to revisit our earlier version patterns and provide a physical description ofthem. This includes physical schemas and SQL, and shows how these version patterns work in areal database. For a longer-term look at where this series is going, consult Part 13.


5 of 6 10/7/2008 1:06 PM


Data ManagementDatabases




6 of 6 10/7/2008 1:06 PM


Relational Databases Part 16 - Physical

Implementation of Version Pattern 1

Tom Johnston, Randall WeisDM Direct, December 21, 2007

Starting with this installment in our series, we will begin to describe the physical implementation ofthese versioning patterns. To this point, patterns have been presented from a logical perspective.Data modelers would recognize these discussions as being about logical data models, rather thanphysical data models. Almost equivalently, we could say that previous articles in this series havediscussed the semantics of versioning rather than its implementation. We now turn to the corresponding physical data models and to their implementation in a realdatabase. Throughout the rest of this series, we will refer to articles that focus on implementationas "PDM (physical data model) segments" of this series, where each segment is a separate article.We will reference each pattern by version number, and we will use the illustrations and thescenarios of the version pattern within the context of each of these PDM segments.



These PDM segments will show the physical Data Definition Language (DDL) used to create thetables in the schema; the Structured Query Language (SQL) used to populate and update thosetables; and the SQL used to query the tables. We will also discuss performance and data integrityin the physical database. In addition, we will describe and discuss known issues, databasemanagement system (DBMS) limitations and alternatives to the implementations that are depictedhere. We encourage you to create, populate and query these objects in your own database. By doingso, you will deepen your appreciation of the temporal functionality provided by the early versionpatterns, and of the more complete functionality provided by the later patterns. In addition, youmay learn to develop some of your own variations on these patterns, variations that might betterapply in your own DBMS and application environment.

Preliminaries We will be using Microsoft SQL Server as the DBMS. However, with some minor adjustments, theSQL and the designs suggested in these PDM segments could be used in other DBMSs, such asIBM’s DB2, Oracle or MySQL. Various database objects will be created, including tables, views,indexes, primary keys, foreign keys, triggers and stored procedures. However, we will assume thatthe reader will create other needed objects, such as databases, containers, file groups andtablespaces, and will have the appropriate permissions to create and use these objects.

Time and Time Again: Managing Time in Relational Databases Part 16 - ... http://www.dmreview.com/dmdirect/2007_51/10000487-1.html?type=pri...

1 of 4 10/7/2008 1:09 PM

In SQL Server the schema owner is often defined as “dbo.” We will not prefix the objects with“dbo,” but that will be the implicit schema name. However, in many DBMSs the default schemaname is the same as the current SQL ID or login ID of the person that is connected to thedatabase. This should suffice for most of these examples. Some DBMSs will implicitly create certain database objects based on the creation of other objects.For example, some DBMSs will automatically create indexes based on the creation of primary keysand foreign keys, but others will not. In many of our examples, we will include the DDL to createobjects that might be implicitly created by other DBMSs. Thus, you many encounter some minorsyntax or duplicate object errors when declaring these objects to your DBMS. In many cases, youwill be able to ignore these errors. We suggest that you confirm that the error does notcompromise the model, and that the fundamental purpose of the object, function or statement isfulfilled in some other way. The SQL we present will often use literals for the variables rather than parameter markers (hostvariables). This is for illustrative purposes only. Normally, for systems that require highperformance, applications should use either statically bound SQL or prepared statements that useparameter markers in dynamic SQL such as JDBC rather than dynamic literals. This strategysaves a significant amount of overhead, avoiding constant reoptimization of the SQL’s accesspath. So, while we will use literals in our examples, this is not the recommended approach and isonly used to clarify the examples. Note also that some companies have standards to not use a “select” clause. We will use it in ourexamples simply to save textual space.

The Application The insurance industry has extensive date/time requirements for their application databases.These include:

The date range an insurance policy is effective,The date an event occurred,The date a service was performed,The date a claim was adjudicated, andThe date a claim was paid.

Often the dates when a policy or a policy-related event are recorded in a database do not matchthe effectivity range of the policy or the date on which the related event actually occurred. Thisdistinction is one of the reasons we chose to use an insurance application in our examples.However, many other industries have complex date-driven rules and business processes. In fact,most large companies need to adhere to best practices or to various regulations, such asSarbanes-Oxley, that require appropriate accounting and tracking of data changes. Therefore, webelieve that while your application may not appear to currently need advanced version patterns, itis likely that, in time, they will. So, as you read these physical model segments and try out the examples, you should be able toeasily translate these PDM entities, attributes, objects and examples into your own industry’sterms.

Keys


2 of 4 10/7/2008 1:09 PM

In non-versioned data models, and also in some of the earlier patterns in this series, the primarykey of the table is also considered the unique identifier of the object. For example, either abusiness key or a surrogate key of the Policy object might be called policy_nbr, and it would bethe primary key in the physical database. However, as we progress through the more advancedversioning patterns, we will discover that the object identifier alone will not uniquely identify asingle occurrence (row) in a table because multiple versions of an object may (and usually will)exist in a versioned table. This restriction in current relational databases – that the distinctionbetween an object and its versions must be made by the developer and is not built-in to eitherstandard or vendor-augmented DDL – forces us to create a new surrogate key that represents thecomposite of an object id and either a version date/time or a version id, which now becomes theunique primary key. Therefore, as you review these patterns, consider the fact that the object id will not always beconsidered the primary key, regardless of whether it is a business key or a surrogate. So to helpyou distinguish the different columns that would normally represent a single object and theprimary key in non-versioned tables, these columns (keys) will be suffixed with “OID,” meaning theobject id, and surrogate columns that distinguish versions of an object will be suffixed with “VID,”meaning version id.

PDM Structure These PDM segment articles will be structured as follows:

Reference to the version pattern as defined in the earlier articles.A brief description of the pattern.A table facsimile with sample rows.The DDL to create the base table(s).Supporting DDL, such as constraint, index or view creation.Special design notes and considerations.A copy of the scenario figure used in the original version pattern.The Data Manipulation Language (DML) to update the database for each event.SQL to query the results at various points in time.The results of the queries in tabular format.

Figures 1 and 2: Templates that Will be Used (All figures are oversized and linked at the end ofthe article) (See pdf).

Setup The initial patterns will concentrate on the Policy table. We will retain pattern versions in separatetables so we can compare results. Therefore, each Policy table will be prefixed with a “P#” toidentify each pattern number. We will also, as needed, reference a Client table. The following DDLand SQL can be executed to create the base Client table. Figure 3: Client Table Setup (see pdf) Version Pattern 1 Now let’s see how Version Pattern #1 will look in a physical database.


3 of 4 10/7/2008 1:09 PM

Figure 4: Version Pattern 1 (see pdf)

Version 1 Scenario 1.1 Figure 5: Scenario 1.1 (see pdf)

Version 1 Scenario 1.2 Figure 6: Scenario 1.2 (see pdf)

Version 1 Scenario 1.3 Figure 7: Scenario 1.3 (see pdf) These scenarios show the physical data model for the first version pattern, which has no history orfuture dating requirements whatsoever, as well as insert, update, delete and retrieve SQL againstthat physical model. This is the easiest version pattern to update and query, but by the sametoken is the least expressive of temporal requirements. The date a row was created, and the date itwas last updated, provide the only persisted temporal information about the data. Relational databases and SQL standards have not yet evolved to easily support various temporalstates. So, this series will continue to show you how to implement feature-rich version patterns intoyour database that could also become part of your enterprise architecture. These physical designsegments will take this concept from theory to actual implementation. It’s about time!


Data ManagementDatabasesDW Basics




4 of 4 10/7/2008 1:09 PM

figure 1

Version: Pattern Number and Name

Description: Brief Description of Pattern

Table Facsimile my_oid (pk) parent_oid (fk) attr_n_val crt_dt last_updt_dt

138 882 PPO 1/12/2004 1/12/2004

Table Creation DDL Create table my_table ( my_oid int not null, parent_oid int null, attr_n_val char(10) not null, crt_dt datetime default getdate() not null, last_updt_dt datetime default getdate() not null) ;

Index DDL Create unique index pk_my_table on my_table (my_oid);

Create clustered index fk_ parent_child on my_table (parent_oid);

Primary Key Alter table my_table add constraint pk_my_table primary key (my_oid);

Foreign Keys Alter table my_table add constraint fk_parent_child foreign key (parent_oid) references parent (parent_oid) on delete restrict;

Notes: Special design notes and options

Object/Version PDM Pattern Template

figure 2

Scenario: Version Pattern Number Scenario Number and Name

Description: Brief Description of Scenario

Scenario Figure

Insert SQL Insert into my_table (my_oid, parent_oid, attr_val, crt_dt) VALUES(138, 882, 'PPO', '2004-01-12');

Query SQL Select * from my_table where my_oid =138;

Notes: Explanation of how the SQL works and any special notes regarding SQL and criteria. Hints, tips options.

Scenario Template

Query Results my_oid parent_oid attr_n_val crt_dt last_updt_dt

138 882 PPO 1/12/2004 1/12/2004

figure 3

Client Table Setup Create table client ( client_nbr int not null, client_name varchar(50) not null, crt_dt datetime default getdate() not null, last_updt_dt datetime default getdate() not null) ; Alter table client add constraint pk_client primary key (client_nbr); Insert into client (client_nbr, client_name, crt_dt) values (882, ‘Mary Client', '2004-01-19') ;

figure 4

Version Pattern 1: Updating in Place

Description: No versioning with updates occurring in place, and the end of an object’s episode is represented by the deletion of the object.

Table Facsimile policy_nbr (pk)

client_nbr (fk)

policy_typ crt_dt last_updt_dt

138 882 PPO 2004-01-12 2004-01-12

Table Creation DDL Create table p1_policy ( policy_nbr int not null, client_nbr int null, policy_typ char(10) not null, crt_dt datetime default getdate() not null, last_updt_dt datetime default getdate() not null) ;

Index DDL Create unique nonclustered index ix_pk_p1_policy on p1_policy (policy_nbr);

Create clustered index ix_fk_p1_policy_client on p1_policy (client_nbr);

Primary Key Alter table p1_policy add constraint pk_p1_policy primary key (policy_nbr);

Foreign Keys Alter table p1_policy add constraint pk_p1_policy_client foreign key (client_nbr) references client (client_nbr) on delete no action on update no action ;

Notes: We clustered this table by the foreign key because access is normally by a foreign key. The delete/update rule is set to “no action” which is similar to “restrict”; however, cascade could be used, also. Remember, however, there is no versioning.

Object/Version PDM Pattern

figure 5

Scenario 1.1 of Version Pattern 1: an Insert to the Policy Table

Description: shows the result of entering Mary's policy into the Policy table

Scenario Figure

Insert SQL Insert into p1_policy (policy_nbr, client_nbr, policy_typ, crt_dt, last_updt_dt) VALUES(138, 882, 'PPO', '2004-01-12', '2004-01-12');

Query SQL Select * from p1_policy where policy_nbr=138;

Notes: Add and look up Mary’s Policy.

Scenario

Query Results policy_nbr client_nbr policy_typ crt_dt last_updt_dt

138 882 PPO 2004-01-12 2004-01-12

figure 6

Scenario 1.2 of Version Pattern 1: an Update on the Policy Table

Description: On 3/14/05, Mary changes her policy type from PPO to HMO (Health Maintenance Organization). After that change is recorded in the Policy table, Mary's Policy row looks like this: Scenario Figure

Update SQL Update p1_policy Set policy_typ='HMO', last_updt_dt='2005-03-14' where policy_nbr=138;


Notes: Update Mary’s Policy in place. We’ve lost all record that Mary originally had PPO

Scenario

Query Results policy_nbr client_nbr policy_typ crt_dt last_updt_dt 138 882 HMO 2004-01-12 2005-03-14

figure 7

Scenario 1.3 of Version Pattern 1: a Physical Delete to the Policy Table

Description: On 6/1/06, Mary's policy is ended (by contract expiration, explicit termination, revocation, etc. - it doesn't matter). This is recorded, using Version Pattern 1, by physically deleting her policy's row in the Policy table. Scenario Figure

Delete SQL Delete from p1_policy where policy_nbr=138;


Notes: Mary’s Policy is deleted. We’ve lost all record that Mary originally had PPO, HMO, or had any policy whatsoever

Scenario

Query Results policy_nbr client_nbr policy_typ crt_dt last_updt_dt

[no row]


Relational Databases, Part 17 - Physical


Tom Johnston, Randall WeisDM Direct, January 4, 2008

In this PDM segment, we will describe the physical implementation of Version Pattern 2. Thispattern was described, from a logical point of view, in part 3 of this series. This pattern expands onpattern 1 by adding a logical delete capability. This pattern is more semantically expressive than Version Pattern 1, just as each patterndescribed in this series is more expressive than the patterns prior to it. This pattern allows us tokeep a record of deleted rows, and thus information about when they were first inserted, lastupdated and, finally, deleted. A delete request which is carried out as a physical delete loses allthis information, removing all evidence, in the online table, that the row ever existed.


To support logical deletes, we need an additional column, either some kind of a delete indicator orelse a delete date. A delete indicator would be set to "N" when the row is first inserted, and set to"Y" if and when the row is logically deleted. A delete date would be set to {null} when the row isfirst inserted, and set to the date of the logical deletion if and when the row is logically deleted. Remember, in this series when we reference a column as “date” we normally would consider a“date/timestamp” to be more accurate and to provide more flexibility as to when we can change thestate of a row. We use dates in our examples to help keep them simple. We prefer to use a delete date because it provides more information than a delete indicator. If adelete indicator is used, we must decide to either leave the last update date alone, or to put thedelete date in that column. If we choose the first option, we have lost the delete date. If we choosethe second option, we have overwritten the date of the last update. But if we use a delete dateinstead of a delete indicator, we can preserve both dates. In addition, as we will show later, itbecomes easier and more efficient to query when looking for rows that were active at a given pointin the past. Also discussed later in this article is the option of using a special date value ratherthan {null}. A variant of Version Pattern 2 would be to add a “delete reason” column. Together with a deletedate, this would tell us both when and why rows were logically deleted. For example, a row in aCustomer table might have been logically deleted because of no recent orders, bad credit, etc.And, of course, the combination of insert date and delete date tells us how long the row waslogically present in the table. (This may not correspond to how long the customer was active,


1 of 4 10/7/2008 1:11 PM

however. It will correspond only if the business chooses to treat row insertion and deletion datesas also representing customer effectivity dates.) With rows logically rather than physically deleted, it is also possible to maintain referential integrityfor rows in dependent tables. Otherwise, a physical deletion would force us either to set foreignkeys to {null}, or to cascade delete those dependent rows. In certain types of queries, data from alogically deleted parent row might frequently be usefully joined to nondeleted child rows. Thiswould provide data about the parent row related to those child rows, a relationship that was ineffect between the later of the insert dates for the parent and the child rows, and the delete dateof the parent row.

Foreign Key Constraints Most popular database management systems (DBMSs) allow a “Set Null” delete option. With thisoption, if there are one or more rows in one or more dependent tables that have a foreign key to arow being deleted, then those foreign keys are set to {null}. As just mentioned, while this approachretains the dependent rows, it loses the link to the parent row. So a logical delete allows us toretain more information than a physical delete which uses the Set Null option. These same DBMSs also provide for a “Cascade” delete option. With this option, if there are oneor more rows in one or more dependent tables that have a foreign key to a row being deleted, thenthose dependent rows are also deleted. If other rows have foreign keys to these dependent rows,then those other rows will also be deleted. Usually, Cascade delete options are avoided in mostfinancial and other mission-critical applications. A Cascade delete retains less information than aSet Null delete against the same set of rows because a Cascade delete removes the dependentrows while a Set Null delete retains the rows and merely sets their foreign keys to {null}. So alogical delete also allows us to retain more information than a physical delete with the Cascadeoption. The third common delete option is called “No Action” or “Restrict.” No Action and Restrict arenearly identical, and in DBMSs that support both, the only difference is when the constraint isenforced – before or after other constraints. This constraint is used as a safety valve to prevent theinadvertent deletion of dependents when a parent row is physically deleted and the inadvertentdeletion of a parent row that has dependents. Therefore, in this physical implementation ofVersion Pattern 2, we will use the “Delete No Action” option because no rows should ever bephysically deleted, but if someone tries to physically delete a row, this option will prevent thedeletion if that row has dependent rows. This will not, however, prevent the physical deletion of rows that do not have dependents. So, ifwe really want to prevent the physical deletion of rows in a table, the safest approach would be torevoke the delete permission on the table from the users, or not grant delete authority in the firstplace. This will force the user to update the logical delete indicator (or logical delete date) in orderto perform a business deletion. A logical delete date or indicator column can help maintain referential integrity if we want toinactivate a parent table row while retaining the dependents’ relationship. For example, we maywant to inactivate a customer but retain both the orders they placed and the foreign keyrelationship. This is helpful for several reasons, including the ability to find the customer name andaddress of an order that was shipped last year, even though the customer is no longer active. This approach also gives us the ability to reactivate the customer if conditions change – without


2 of 4 10/7/2008 1:11 PM

having to create a new customer entity occurrence – thus retaining the customer's previoushistory. But if we do this, we will lose the information that this customer was logically deleted andinactive for a period of time before its re-activation. To handle re-activation without loss ofinformation required a more advanced version pattern. Figure 1 shows how Version Pattern 2 will look in a physical database.

Figure 1: Version Pattern 2 (see PDF below)

Figure 2: Version 2 Setup (see PDF below) Starting in this PDM segment we will include a setup section that will show the preparation of thedata and chronology of SQL events leading up to the Scenarios.

Figure 3: Version 2 Scenario 2.1 (see PDF below)

Figure 4: Version 2 Follow-Up (see PDF below) We will also perform some follow-up SQL that shows various ways the data can be queried.

Two Approaches to a Delete Date In the physical implementation of our remaining versioning patterns, we will not use {null} in deletedates. Instead, we will use the special value “12/31/9999.” Two reasons to use this special valueare to simplify queries and to improve performance. For example, if a delete date is inclusive and nullable, the SQL that looks for active rows at a givenpoint in time may be written as: “… AND (crt_dt <= ‘01/01/2005’ AND (del_dt >= ‘01/01/2005’ OR del_dt is null)) ”…using the table facsimile in Figure 5.

Figure 5: Version 2 Follow-Up (see PDF below) This query will return policies 138 and 139 because they were both active on 01/01/2005. However, if we used “12/31/9999” for the delete date, and it is inclusive, the SQL may be simplywritten as: “… AND (:my-date between crt_dt AND del_dt ) ” or another way: “… AND (crt_dt <= :my-date AND del_dt >= :my-date) ”where :my-date is ‘01/01/2005’ using the table facsimile in Figure 6.

Figure 6: (see PDF below) This query will also return policies 138 and 139 because they were both active on 01/01/2005. Using “12/31/9999” instead of {null}, the SQL is clearly simpler to write. In addition, performance


3 of 4 10/7/2008 1:11 PM

will be better because we have removed the “OR” from the optimization criteria. Many DBMSs donot optimize ORs very well, and often will alter the access path to something less efficientcompared to similar SQL without the OR. Using “12/31/9999,” we may also decide to put an indexon the delete date, usually combined with some other columns such as client_nbr.

Wrap-Up This article has shown how to implement Version Pattern 2, which builds on Version Pattern 1 byproviding a logical delete to use in place of a physical delete. However, updates using this versionpattern still lose a great deal of information. This pattern does not retain previous versions of anobject, nor can it distinguish between re-activated objects and those which were never deleted.Also, we’ve only discussed one dimension of time, the system dates of the row's creation anddeletion. Thus, Version Pattern 2 does not manage bi-temporality. Its dates are dates of databaseactivity. Business meaningful dates – specifically the dates on which a version of an objectbecame effective and ceased being in effect – are not explicitly represented in Version Pattern 2. In the next several articles, we will proceed to demonstrate how to physically model the moreadvanced semantics of the higher version patterns. We will also present the SQL needed tomaintain and query these tables when these more advanced patterns are used. Time is on our side!


Databases




4 of 4 10/7/2008 1:11 PM


Relational Databases, Part 18 - Physical


Tom Johnston, Randall WeisDM Direct, January 18, 2008

In this article, we will describe the physical implementation of Version Pattern 3. This patternexpands on Patterns 1 and 2 by retaining a history of updates. This pattern was logicallydescribed in part 4 and part 5 of this series. Pattern 2 provided the ability to query objects that were previous deleted. However, it failed to savehistory created by updates. In Pattern 3, we will start saving this update history by usingversioning, and will lay the groundwork for the more robust history patterns that follow.


Version Tables One way to keep track of updates is to use history tables that are nearly clones of the object table,and are populated by using update triggers. There are pros and cons with using this approach,which we will discuss later. Pattern 3 takes an approach where all of the versions (current andpast) are stored in the same table. A Policy was the object in the Policy table in Patterns 1 and 2.Now, the table will contain multiple versions of the same object. Therefore, to distinguish these,the table that contains the versions of a policy will be called the “Policy Version” table. While you might assume that having a policy version table would also require a policy (object)table, this is not so. We will only populate and query the Policy Version table for our current andprevious versions of an object. Some organizations might demand both tables for various reasons,but it is not required to support the semantics of this pattern. There are benefits to having only one(version) table to update, manage and query, so we will describe this pattern with this designdecision.

Version Primary Key Having versions of the same object in the same table requires a change to the primary key of thetable. Policy number (policy_nbr) was the primary key in Patterns 1 and 2. However, policynumber alone is no longer a valid primary key now that we are storing versions in the same table.We need something else as part of the primary key to distinguish multiple versions of the samepolicy. An alternative might be to replace policy number with a surrogate key. But since surrogatekeys were not introduced as part of these earlier patterns, we will use a composite key made up ofthe policy_nbr and something else to define the version.


1 of 4 10/7/2008 1:12 PM

Figure 1: Correlated Subselect for Max (see PDF) That “something else” could be a sequence number (version number), but for our examples wewill use the version beginning (system activity) date (ver_beg_dt). Remember, we use dates in ourexamples to simplify them, but we really would use a date/timestamp to allow for multiple eventsper day. The primary key will be policy_nbr and ver_beg_dt. The current version is identified by being the row for the policy that has the highest (most recent)version begin date. So, to find the most current version you would need to find the highest (MAX)date for the policy, then use that date to qualify the ver_beg_dt to get the single current row. Thiscan be accomplished by using a correlated subselect, as shown in Figure 1. Either way, thisapproach has negative performance implications in applications that read these rows. We haveseen this performance problem in implementations of this pattern, and it is often not discovereduntil there is a larger quantity of history and a load is placed on the application. This issue will beaddressed and overcome in future patterns in this series, so stay tuned. Returning a version of an object “as of” a certain point in time, while possible, is a bit morecomplex. You need to find the specific row that began (beg_dt) prior to the input date, but it mustless than the next version that is higher. It is not a simple “between” predicate. This kind of queryis shown in Figure 2. Figure 2: “As Of” (see PDF) The SQL in the following scenarios after the product data management (PDM) design will helpdemonstrate how to retrieve the various versions using an input parameter representing the as-ofdate. Remember, this pattern only works with a single dimension of time. It is the time that the event isrecorded in the computer system. We assume for this pattern that this date is both the systemactivity and business effectivity date. The later patterns in this series will show us how to work withboth dimensions of time. But, for now, here is Pattern 3 – Update History.

Version Pattern 3 Figure 3: Policy – Version 3 – PDM Pattern (see PDF)

Version 3 Scenario 3.1 Figure 4: Scenario 3.1 (see PDF)

Version 3 Scenario 3.2 Figure 5: Scenario 3.2 (see PDF)

Version 3 Scenario 3.3 Figure 6: Version 3 Scenario 3.3 (see PDF)

Version 3 Follow-up


2 of 4 10/7/2008 1:12 PM

As explained in prior PDM articles, we will expand upon the original scenarios with a follow-upsection that demonstrates additional capabilities of the pattern. This follow-up for pattern 3 showsone of the benefits of storing the current and past versions in the same table by being able to usethe same SQL and application logic to retrieve and process both the current and previousversions, just by supplying a different input date. Figure 7: Version 3 Follow-up 3.4 (see PDF) Figure 8: Version 3 Follow-up 3.5 (see PDF)

One or Two Tables? Pattern 3 combines both the current version and the past versions into a single table. There arepros and cons to this, both physically and logically. One of the semantic constraints of Pattern 3 isthat date represents both the business effectivity date and the system activity date. So, the date inthe computing system on which it appears is the date that the event is effective for the business.This also implies that we cannot accurately allow retroactive or future date processing; that is, anevent cannot be dated prior to the current version, nor can it be dated in the future. Consequently, it would not significantly impact the application if the various versions were storedin one or two tables, as far as Pattern 3 is concerned. However, in future patterns we will showwhy we combine all versions in a single table to better support bi-temporal patterns and retroactiveand future date processing. Following is a list showing some of the pros and cons combining or separating versions intotables. Figure 9: Versioned Tables versus Current Plus Historical Tables (see PDF) If you are trying to retrofit history into an existing database and application that has no history,then you might consider creating a separate history table, populated by triggers on the originalbase object table. Otherwise, for a new application we lean towards combining the versions intothe same table, possibly with partitioning, with the non-current versions.

Wrap-Up This article has shown how to implement Version Pattern 3. This pattern allows you to easily queryany version, current or past, based on the supplied as-of date input parameter. This reduces thenumber of database objects, application objects and SQL required for the application. Moreimportantly, we start to better and more accurately retain history that is required by governingbodies and regulations, such as Sarbanes-Oxley (SOX), National Association of InsuranceCommissioners (NAIC) and Securities and Exchange Commission (SEC). It also helps supportauditing requirements and best practices desired by most major companies today. However, itdoes not yet support accurately post-dating or pre-dating database events. In upcoming articles, we will continue to demonstrate how to physically model these moreadvanced temporal patterns, and will also present the DDL to create these tables and their relateddatabase objects, plus the SQL needed to maintain and query the data. Stay with us as we travel through time!


3 of 4 10/7/2008 1:12 PM


Databases




4 of 4 10/7/2008 1:12 PM


Relational Databases, Part 19 - RI and

Temporal RI

Tom Johnston, Randall WeisDM Direct, February 1, 2008

This installment should have been the continuation of our presentation of physicalimplementations of the version patterns discussed thus far. It should have presented a physicalimplementation of Version Patterns 4 and 5. However, because editors are just plain tough when itcomes to deadlines, and because of some personal exigencies with the authors, we are unable tostick to this logical sequence. Instead, in this article, we will consolidate and extend some of thethings we have learned about the semantic constraints that apply to versions, and especially towhat referential integrity means when applied to versions. In terms of the roadmap presented in Part 13, this means that we are putting Step 2 aside, for themoment, and moving on to Step 3, the semantics and implementation of Version Pattern 6. Thefirst part of those semantics which we wish to discuss is what we will call "temporal referentialintegrity" (temporal RI).


A glossary is included at the end of this article. As we have been writing this series, we havedeveloped a set of terms which have been defined more by means of introducing and using themthan in any more formal way. Of course, the later Wittgenstein (someone whose name C. J. Datelikes to invoke) did indeed say "Meaning is use," but it is well-known (by philosophers, at least)that the later Wittgenstein is notoriously difficult to interpret and, consequently, correspondinglyinappropriate as a source of quotations taken out of context. This glossary, like dictionaries whenthey are periodically undergoing revisions, takes usage to be prescriptive of meaning, but notidentical to it. And in so doing, it and we part company with the later Wittgenstein for whom mostphilosophical problems need to be dissolved, not solved.

RI and Temporal RI The referential integrity (RI) that we all know and love is RI that the database management system(DBMS) can enforce, based on declarations we make to it when we submit our data manipulationlanguage (DML). It is RI against nonversion tables. For clarity's sake, we have been callingnon-version tables "object tables," not in the object-oriented sense, but rather in the sense ofthings that persist through time, things that versions are versions of. Thus, a Customer table is atable of customers, which are instances of a certain kind of object, and which persist through time.But a Customer Version table is not a table of customers; it is a table of chronologicallysequenced, nonoverlapping, time slices of customers.


1 of 6 10/7/2008 1:13 PM

As we continue, we will use the term "object RI" when we think the adjective is needed for clarity,and simply "RI" when context makes it clear that we are talking about object RI. Finally, we willalways use the term "temporal RI" when talking about RI as it applies to versions.

Object RI Here is the definition of object RI: If there is an object RI dependency from table Y to table X (not necessarily distinct), then no stateof the database is valid in which any row in table Y is not foreign key-linked to a row in table X. Noinserts, updates or deletes are valid that would transform the database into such a state. The RI machinery of the DBMS insures this by carrying out the following checks. We assumethroughout this discussion, unless otherwise noted, that all foreign keys are defined asnon-nullable.

When inserting a row into object table Y, that row must reference a row in object table X thatalready exists in the database (or that is being inserted as part of the same atomictransaction).

1.

When updating the foreign key to a row in object table X, from a row in object table Y, thatnew foreign key must reference a row in X that already exists in the database (or that isbeing inserted as part of the same atomic transaction).

2.

When deleting a row in object table X, no rows may remain in object table Y, after thattransaction is complete, that reference that row.

3.

With insert, update and delete actions, and two tables, we have a combination of six table/actioncombinations. Notice that three of them are not mentioned as constraints that object RI mustenforce. Inserting or updating a parent row cannot affect RI, nor can deleting a child row orupdating it in any way that leaves its foreign key to its parent unchanged. Note that for the time being, we are not concerned with the different ways that the DBMS canenforce constraint number 3 above – those ways being a) block, b) set null and c) cascadedelete.

Temporal RI We can now define and illustrate the enforcement of temporal RI in a way that emphasizes thatobject and temporal RI are both forms of referential integrity, but with the latter bringing time intoconsideration. But before we do, here is a depiction of what Version Pattern 6 looks like, using ourClient Version table as an example.

Figure 1: Version Table Schema for Version Pattern 6

This schema will be refined later on, as we begin our discussion of Version Pattern 6. It is


2 of 6 10/7/2008 1:13 PM

presented here simply to provide a graphic image of the various time-implementing columns thatare discussed below. So here is the definition of temporal RI, stated initially as a constraint involving two version tablesAs we will see later, we must also consider temporal RI constraints, and their enforcement,between two tables one of which is a version table and the other of which is an object table: If there is a temporal RI dependency from version table Y to version table X (not necessarilydistinct), then no state of the database is valid in which any row in table Y is not object R- linked toa row in table X, or in which the effectivity time period of a row in table Y is not wholly containedwithin the effectivity time period of its parent row in table X. No inserts, updates or deletes are validthat would transform a database into such a state. Now for the enforcement of temporal RI. No existing DBMS provides the machinery to carry outthis enforcement. Instead, it will be up to us, the developers, to support temporal RI by writing ourown code, possibly using triggers, to enforce the following constraints on activity against thedatabase:

When inserting a row into version table Y, that row must reference a row in version table Xthat a) already exists in the database (or that is being inserted as part of the same atomictransaction), and b) whose effectivity time period wholly contains the effectivity time period ofthe row being inserted into Y.

1.

When updating the foreign key to a row in version table X, from a row in version table Y, thatnew foreign key must reference a row in X that a) already exists in the database (or that isbeing inserted as part of the same atomic transaction), and b) whose effectivity time periodwholly contains the effectivity time period of the row being updated in Y.

2.

When deleting a row in version table X, no rows may remain in version table Y, after thattransaction is complete, that reference that row.

3.

No update to a row in version table X (where a change in primary key is not considered anupdate) has any effect on RI constraints, other than a change to that row's effectivity timeperiod. In that case, when updating the effectivity time period of an object X (which is doneby creating a new current version of X that specifies the new time period), that effectivity timeperiod may not be shortened as long as one or more object-RI linked rows in version table Yhave an effectivity end date beyond that of the shortened end date for object X.

4.

Recall, from our discussions thus far, that deletions in version tables are always logical deletions.That is, they are always done by marking the current version of an object deleted as of anindicated date, not by physically removing a row from a table. Informally, what is going on is this:

Temporal RI is object RI plus the constraint that the time period of a child row must always be

wholly contained with the time period of its parent row. If we keep this in mind as we begin to look more closely at the details of implementing temporal RIenforcement, those details will be much less likely to confuse us. All of those details exist tosupport this one-sentence definition of temporal RI.


3 of 6 10/7/2008 1:13 PM

Glossary Child table, child row. Y is a child table to X if and only if there is a foreign key dependency from Yto X. A row in Y is a child to a row in X if and only if the row in Y has a foreign key whose value isidentical to the primary key value of that related row in X.

Parent/child relationships typically have a maximum cardinality of one/many, and a minimumcardinality of optional for the parent and required for the child. But it is a matter of whichtable contains the foreign key and which table is the reference of that foreign key thatdifferentiates parent from child tables and rows. Cardinality constraints are not what makethe difference.

1.

Effectivity time period. The period of time during which a version is regarded as the truth about anobject.

In our version patterns, an effectivity time period is defined by an effectivity begin date andan effectivity end date, where "date" may be a calendar date or any other "tick of the clock"(as described in Part 2).

1.

Our convention is that the time period begins on the begin date, but ends one clock tick priorto the end date.

2.

Object referential integrity, object RI. Referential integrity for objects, as opposed to objectversions.

This is what the term "referential integrity" refers to, outside of these articles. It is the integrityconstraint, enforced by all current RDBMSs, that every non-null foreign key must contain avalue identical to the value in the primary key of a row in the RI-related table.

1.

Object table. A table whose rows represent persistent objects. Sometimes called a"non-versioned" table.

Persistent objects are things that exist over time and can change over time, such as vendors,customers, employees, regulatory agencies, products, services, bills of material, invoices,purchase orders, claims, certifications, etc.

1.

From an OLAP, star-schema point of view, dimension tables are tables of persistent objects.For a more in-depth discussion of different types of tables, see the articles "An Ontology ofTables," at MindfulData.com.

2.

From an OLTP database point of view, assuming that there are no version tables in thedatabase, object tables are all the tables which are not transaction tables.

3.

Roughly speaking, object tables are the tables which are the concern of master datamanagement.

4.

Parent table, parent row. X is a parent table to Y if and only if there is a foreign key dependencyfrom Y to X. A row in X is a parent to a row in Y if and only if that row in Y has a foreign key whosevalue is identical to the primary key value of that row in X.

Parent/child relationships typically have a maximum cardinality of one/many, and a minimum1.


4 of 6 10/7/2008 1:13 PM

cardinality of optional for the parent and required for the child. But it is a matter of whichtable contains the foreign key and which table is the reference of that foreign key thatdifferentiates parent from child tables and rows. Cardinality constraints are not what makethe difference.

Temporal referential integrity, temporal RI. Referential integrity for versions, which consists ofobject RI plus the constraint that the effectivity time period for child rows are wholly contained withthe effectivity time periods of their object RI-related parent rows. Version table. A table whose rows represent versions of persistent objects.

A version of a persistent object is a time-slice of that object. A row in a version tablerepresents a version, and describes that object as it was, is, will be and/or might be duringits effectivity time period.

1.

A row in a version table is what the custodians of the table believe is the truth about theobject it represents, during the indicated effectivity time period.

2.

Wholly contained in (for effectivity time periods). Time period 2 is wholly contained in time period 1if and only if the effectivity begin date of time period 2 is equal to or later than the effectivity begindate of time period 1, and the effectivity end date of time period 2 is equal to or earlier than theeffectivity end date of time period 1.

In order to implement this constraint in today's production databases, we rely on SQL datatypes and operators. Specifically, we rely on dates and/or timestamps, and the DBMSimplementation of comparison operators for those data types.

1.

Also for ease of current implementation, we use "12/31/9999" in a way whose intracacies willbe discussed in Part 20. For now, we note that when we use "12/31/9999," we let the DBMSinterpret that string as a valid date. However, semantically, it almost never would be a validdate, because no business activities that we know of are concerned with an effectivity timeperiod that extends up to but not including or beyond that date nearly eight-thousand yearsin the future.

2.

Wrap-Up In Parts 10 through 12 of this series, we discussed RI as it applies to version tables. There weconcluded that when managing rows in version tables, the only RI we needed to enforce amongthose rows was object RI, RI expressed by means of object identifiers (OIDs). Clearly, if this is all that were done to check the validity of insert, update and delete transactionsagainst version tables, it would be possible to introduce violations of temporal RI. As we willexplain next time, our concern was to avoid the performance penalty of enforcing temporal RI oninserts, updates and deletes to the database. This concern, indeed, is the principal reason that ITprofessionals tend to avoid versioning as much as possible. In Part 20, we will examine our reasons for taking that position, and will explain why it is still areasonable position to take. However, we will look more closely at the performance implications ofdoing temporal RI checking on insert, update and delete transactions. Those implications may notbe as forbidding as we had originally thought them to be.


5 of 6 10/7/2008 1:13 PM

We will also begin a discussion of the value "12/31/9999" which may appear in effectivity enddates. In date or timestamp data types, this string is understood by the DBMS to represent anactual date, nearly eight-thousand years in the future. But from a semantic point of view, thatstring does not represent a date. It functions as a temporal {null}. As such, we shall argue, its most mathematically secure management would require the use of atemporal three-valued logic. But fortunately, our focus is on implementation with today'stechnology. So just as {null}s in today's SQL fudge a robust three-truth-valued logic, we will showhow these "temporal {null}s" can also limp along without the support of a secure foundation in athree-time-valued (modal) logic. Beyond these issues, we must also discuss temporal RI as it applies to future versions, tohypothetical versions, and also to versions which correct errors in other versions. This will take usseveral installments further into this series, at which point we will resume our presentation ofphysical implementations of Version Patterns 4 and 5. After that, we will begin our discussion of Version Pattern 6.


Databases




6 of 6 10/7/2008 1:13 PM


Relational Databases, Part 20 - RI and

Temporal RI (Continued)

Tom Johnston, Randall WeisDM Direct, February 15, 2008

In this installment, we will begin a discussion of how temporal referential integrity (RI) may beenforced with today's relational database management systems (RDBMSs). In addition, we willexpand the glossary which we began last time, in Part 19. Indeed, the glossary work has proven tobe the major portion of the work done here in Part 20. This seems to be the right point in ourseries to begin formalizing the definitions of the terms we have introduced.

Object RI and Temporal RI: A Clarification



In Part 19, we said that by "object RI" we meant the normal DBMS-enforceable referential integrity,implemented by populating a column designated as a foreign key to table X with the value of oneof the primary keys in that table. That was wrong, and is indicated as such in the glossary byshading that entry. A new corrected definition is also included. On that basis, we went on to give the following definition of "temporal RI": If there is a temporal RIdependency from version table Y to version table X (not necessarily distinct), then no state of thedatabase is valid in which any row in table Y is not object-RI linked to a row in table X, or in whichthe effectivity time period of a row in table Y is not wholly contained within the effectivity timeperiod of its parent row in table X. No inserts, updates or deletes are valid that would transform adatabase into such a state. As this discussion should make clear, our corrected definition of the term "object RI" does notrequire us to amend the above definition of "temporal RI." Notice that in this definition, the dependent-on table (X) is itself a version table. Consequently, itcontains no rows for objects, only rows for versions of objects. To illustrate, consider thisschematic of a version table, also copied from Part 19. We also add a schematic of a dependenttable which, in our ongoing example, is the Policy table.


1 of 7 10/7/2008 1:14 PM

Figure 1: Version Table Schema for Version Pattern 6

In the Policy table shown here, the notation (OFK) stands for object foreign key. The objectmentioned in this definition is, in this example, a specific client. The client is represented by theOFK column client-nbr. If the referenced table were itself an object table, and not a version table, then client-nbr would bea "normal" foreign key, not an OFK. It would contain a value that was unique in the designatedprimary key column of the referenced table, and its validity could be checked by the standard RImechanism of the DBMS (see Glossary). But the definition specifically states that it is a definitionthat involves two version tables. Consequently, client-nbr in the referenced table, by itself, is not acomplete primary key. So it is not something the DBMS can enforce. So what does "object RI" mean when there is no unique object to refer to, when, whatever it is, it isnot enforceable by a RDBMS? By the same token, what kind of foreign key is an OFK? An OFK column contains the value of a unique object even though a table with that value asprimary key does not exist. It is as if, instead of Figure 1, versioning were implemented as shownin Figure 2.


2 of 7 10/7/2008 1:14 PM

In the as-if situation depicted in Figure 2, these are the foreign key dependencies:Policy Version table to Policy table.Client Version table to Client table.Policy table to Client table. These RI dependencies are all object RI dependencies because in all cases, the table referencedby a foreign key is an object (a non-version) table. In addition, all these object RI dependenciescan be enforced by the RDBMS. So to keep this distinction clear, we will use the term "standardRI" (or "RI" by itself) to mean RI among non-versioned tables. Standard RI is RI that is enforceableby the DBMS. This leaves us free to redefine the term "object RI." We will now use it to mean a constraint thatapplies between a pair of tables, where the referenced (parent) table is a version table. In thatreferenced version table, one part of its primary key will be an identifier for the object whoseversion it is. In the referencing table, there is a column which is constrained to contain a valuematching the object identifier part of a version in the referenced table which is current at the timethe referencing (child) row was inserted. That column is what we are calling an OFK. In addition, ifthe referencing row is not in a version table, then an update in place can be applied to it, ofcourse. If that happens, the object RI constraint requires that the new value put in the OFKcolumn must match the object identifier in a version current at the time of the update. The Glossary below contains the indicated corrections, and also includes new expressions notdefined previously.


3 of 7 10/7/2008 1:14 PM

Glossary Child table, child row. Y is a child table to X if and only if there is a foreign key dependency from Yto X. A row in Y is a child to a row in X if and only if the row in Y has a foreign key whose value isidentical to the primary key value of that related row in X.

Parent/child relationships typically have a maximum cardinality of one/many, and a minimumcardinality of optional for the parent and required for the child. But it is a matter of whichtable contains the foreign key and which table is the reference of that foreign key thatdifferentiates parent from child tables and rows. Cardinality constraints are not what makethe difference.

1.

Also sometimes referred to as “dependent tables” or “dependent rows,” or various cognates(e.g., “RI-dependent tables,” “RI-dependent rows”).

2.

Current version, most recent version. A version of an object whose effectivity begin date ischronologically the latest across all non-future versions of that object, and whose effectivity enddate is either unknown (represented by “12/31/9999”) or in the future.

Until we begin discussing Version Pattern 7, we will assume that no two versions of the sameobject can have the same effectivity begin date. After all, between that date and the earliesteffectivity end date between them, that would mean that we had two different statements ofwhat was true about an object during that period of time. But there can only be one versionof the truth.

1.

With Version Pattern 7, we will discuss cases in which two or more versions, of the sameobject, might have the same effectivity begin date.

2.

One source of such cases is when there is a business requirement to correct an errorin a version table, but also to retain, as queryable history, the version discovered to bein error. Another source of such cases would be where one of a few number of things might betrue of an object in a given time period, but we don’t know which one.

Effectivity time period. The period of time during which a version is regarded as the truth about anobject.

In our version patterns, an effectivity time period is defined by an effectivity begin date andan effectivity end date, where "date" may be a calendar date or any other "tick of the clock"(as described in Part 2).

1.

Our convention is that the time period begins on the begin date, but ends one clock tick priorto the end date.

2.

Logical delete. A delete which marks the row as deleted, but leaves it in the table.

Also known as a “business delete,” or referred to as a “semantic delete” or a cognate.1.See “physical delete”.2.

Episode. Each version Y of an object which does not supercede a version X current at the time Yis created begins an episode of an object.

See “supercession.”1.The first version of an object creates the initial episode of that object. That remains the2.


4 of 7 10/7/2008 1:14 PM

current episode until a logical delete version for that object is inserted into the table.If, when a version Y is inserted into the table, the most recent version of that object alreadyin the table is a logical delete version, then Y creates a noninitial episode of that object. Thatremains the current episode until a logical delete version for that object is inserted into thetable.

3.

If, when a version Y is inserted into the table, the most recent version of that object alreadyin the table is a current version, then Y supercedes that version but remains within the sameepisode as the version it supercedes.

4.

Also:Current episode. An episode of an object whose most recent version is not in the futureand is not a logical delete

a.

Past terminated episode, terminated episode. An episode of an object whose mostrecent version is not in the future and is a logical delete

b.

Future episode. An episode of an object whose most recent version is in the future andis not a logical delete

c.

Future terminated episode. An episode of an object whose most recent version is in thefuture and is a logical delete.

d.

5.

Object foreign key (OFK). A column in a versioned table which contains the object identifier usedby one or more rows in a (not necessarily distinct) version table.

For example, in Figure 2 of Part 20, client-nbr in the Policy table is an OFK. It is constrainedto have the value of a client-nbr in one or more rows in the Client table, such that at leastone of those rows was current as of the time the row containing the OFK was created.

1.

Object referential integrity, object RI (revised). The constraint that when a row containing an OFKis inserted, the value in that OFK matches the value of the object identifier portion of a primary keyin a (not necessarily distinct) version table.

Object RI cannot be enforced by today's RDBMSs. This follows from the fact that OFKsreference an object only indirectly, by means of the one or more versions that implement it.In those versions, the referenced object identifier is only part of the primary key of its table,and is thus not necessarily (or even usually) unique.

1.

Object RI requires that the referenced (parent) table be a versioned table. But thereferencing (child) table, the one that contains the OFK, may itself be either a versioned oran object (non-versioned) table.

2.

Object table. A table whose rows represent persistent objects. Sometimes called a"non-versioned" table.

Persistent objects are things that exist over time and can change over time, such as vendors,customers, employees, regulatory agencies, products, services, bills of material, invoices,purchase orders, claims, certifications, etc.

1.

From an online analytical processing, star-schema point of view, dimension tables are tablesof persistent objects. For more in-depth discussion of different types of tables, see thearticles "An Ontology of Tables" at MindfulData.com.

2.

From an online transaction processing database point of view, assuming that there are noversion tables in the database, object tables are all the tables that are not transaction tables.

3.

Roughly speaking, object tables are the tables which are the concern of master datamanagement.

4.

Parent table, parent row. X is a parent table to Y if and only if there is a foreign key dependency


5 of 7 10/7/2008 1:14 PM

from Y to X. A row in X is a parent to a row in Y if and only if that row in Y has a foreign key whosevalue is identical to the primary key value of that row in X.

Parent/child relationships typically have a maximum cardinality of one/many and a minimumcardinality of optional for the parent and required for the child. But it is a matter of whichtable contains the foreign key and which table is the reference of that foreign key thatdifferentiates parent from child tables and rows. Cardinality constraints are not what makethe difference.

1.

Also sometimes referred to as “dependent-on tables,” or “dependent-on rows” variouscognates.

2.

Physical delete. A delete which physically removes the row from its table.

See “logical delete.”1.

Queryable history. Data about an object that was valid at some time in the past, which is no longercurrently valid, but which is as easily and rapidly accessible as current data.

“As easily and rapidly accessible as current data” means what it says. Our way of providingthis access is to use version tables. In such tables, production queries against current data(the most common kind of query) can be used to retrieve historical data simply by adding adate to a BETWEEN clause of the SQL statement.

1.

As we have mentioned before (and will again), providing queryable history, in this manner,can significantly lower the development and operations cost of accessing historical data, andsignificantly improve the availability of the historical data retrieved.

2.

Standard referential integrity, standard RI. The referential integrity constraint amongnon-versioned tables that today's RDBMSs can enforce. Because we are calling thesenon-versioned tables "object tables," we erroneously equated object RI with standard RI in Part19. Supercede, supercession. In these articles, we use these terms to refer to the replacement of acurrent version with a new current version.

Supercession is a logical function. Physically, supercession is done by inserting a new row ina version table.

1.

Deletion in a version table is done via supercession.2.Versionable updates in a version table are also done via supercession.3.However, creating the first version of an object does not involve supercession, as there is nocurrent version of that object to supercede.

4.

A more subtle point: Creating the first version of a noninital episode of an object also doesnot involve supercession. Even if other versions of an object exist, every episode isconcluded with a logical delete version. Thus, when a new episode is started, the mostrecent prior version of that object is a logical delete version. Thus, there is no current versionof the object to supercede when that new episode begins.

5.

Temporal referential integrity, temporal RI. Referential integrity for versions, which consists ofobject RI plus the constraint that the effectivity time period for child rows are wholly contained withthe effectivity time periods of their parent rows. Version, version table. A table whose rows represent versions of persistent objects.


6 of 7 10/7/2008 1:14 PM

A version of a persistent object is a time-slice of that object. A row in a version tablerepresents a version, and describes that object as it was, is, will be and/or might be duringits effectivity time period.

1.

A row in a version table is what the custodians of the table believe is the truth about theobject it represents during the indicated effectivity time period.

2.

Wholly contained in (for effectivity time periods). Time period 2 is wholly contained in time period 1if and only if the effectivity begin date of time period 2 is equal to or later than the effectivity begindate of time period 1, and the effectivity end date of time period 2 is equal to or earlier than theeffectivity end date of time period 1.

In order to implement this constraint in today's production databases, we rely on SQL datatypes and operators. Specifically, we rely on dates and/or timestamps, and the DBMSimplementation of comparison operators for those data types.

1.

When we use "12/31/9999," we let the DBMS interpret that string as a valid date. However,semantically, it almost never would be a valid date, because no business activities that weknow of are concerned with an effectivity time period that extends up to but not including orbeyond that date nearly eight-thousand years in the future.

2.

Our ground-clearing that was needed before we discuss options for enforcing temporal RI ishopefully now complete. If it is, Part 21 will begin our discussion of those options. Beyond these issues, we must also discuss temporal RI as it applies to future versions, tohypothetical versions and also to versions that correct errors in other versions. This will take usseveral installments further into this series, at which point we will resume our presentation ofphysical implementations of Version Patterns 4 and 5. After that, we will begin our discussion of Version Patterns 6 and 7.


Databases




7 of 7 10/7/2008 1:14 PM


Relational Databases, Part 21 - Enforcement

Time Options for Temporal RI

Tom Johnston, Randall WeisDM Direct, March 7, 2008

At the conclusion of our previous article, we said that there are two options for enforcing temporalintegrity constraints on a database. One is to enforce the constraints at the time of update so as toprevent violations from being persisted in the database. The other is to let violations occur, but tofilter them out at retrieval time. This would be done by means of views against which all normalqueries would be written. It's easy to disparage the latter option by describing it as letting temporal "garbage" onto thedatabase, and then trying to hide that garbage from queries. But this is indeed the approach wehave taken on temporal databases that we have designed, built, put into production andsupported, and we had good reason for doing so. Like almost all database professionals, we wereconcerned that there would be an overwhelming volume of cascade updates if we tried to enforcetemporal RI at time of update, especially when the update involves the delete of a versioned objectthat was the "parent" of possibly many other objects, in many other tables. (See the introductionto Part 10 for a more complete description of this "cascade update" fear.)





We believe it is time to re-examine this fear, and see if the performance hit for enforcing temporalRI on update transactions is as great as we have feared. We assume here that for as long as datastorage continues to involve spinning disks, cylinders and mechanical movement of read/writeheads, to get to required data, we can safely simplify our performance issue and treat it as amatter of how many physical I/Os, how many block-level transfers between a storage device (or itscache) and main memory, are required to complete a temporal RI validation. With this in mind, letus now examine two things – the logic of transactions against versioned objects, and the I/O thatis required for those transactions.

Temporal RI Reformulated To more explicitly allow for the query-time enforcement of temporal integrity constraints, we willslightly amend the definition of temporal RI given in Part 20, as follows: If there is a temporal RI dependency from table Y to version table X (not necessarily distinct), thenno exposed state of the database is valid in which any row in table Y is not object-RI linked to arow in table X, or in which the effectivity time period of a row in table Y is not wholly containedwithin the effectivity time period of its parent row in table X. No queries are valid which would


1 of 5 10/7/2008 1:14 PM

expose such a state. There are three changes to the original definition:

"Version table Y" changed to "table Y." As explained last time, we allow both versioned andnon-versioned tables to be object-RI dependent on a versioned table.

1.

"State of the database" changed to "exposed state of the database." This allows for theoption of enforcing temporal RI on queries rather than on updates.

2.

"No inserts, updates or deletes are valid that would transform a database into such a state"replaced by "No queries are valid that would expose such a state." The replaced sentencewas appropriate for update-enforced temporal RI. The replacing sentence is appropriate foreither update- or query-enforced temporal RI.

3.

We will assume, in the following discussion, that we are dealing with update-enforced temporal RI.In later articles, we will discuss query-time enforcement.

Note: before proceeding, please note that the glossary introduced a couple of installments ago isnot included in this article, nor will it be included in future articles. The reason is that the glossaryis already over 3,000 words long and likely to grow to several times that size. One reason for the glossary being this extensive is that we are making the definitions of its entriesextensive. Most of these concepts are complex, and we don't think it is possible to do them justicewith one- or two-sentence definitions. Another reason for the glossary being this extensive is that we are attempting to construct it as acontrolled vocabulary. By that we mean that any expression involving temporal concepts which isused in a definition must itself be a defined glossary entry. This is the kind of rigor that must beapplied to definitions before they can be "ontologized," i.e., expressed in a formal notation likepredicate logic and thus available for manipulation by software-realized inferencing engines. It isalso the kind of rigor that must be applied (to any set of technical definitions) before we can saywith confidence that we know what we mean. Of course, all this just makes the glossary that much more important than if its entries were briefand informal. So we have decided to put the glossary on each of our Web sites, to keep it updatedthere, and to include a link on our respective homepages to that glossary. The URLs areMindfulData.com and InbaseInc.com. In this review of updates to versioned tables, our approach will be to first describe the logic ofthose updates, including the enforcement of temporal constraints. After that, we will go on toconsider the physical I/O costs of enforcing those constraints on those updates. The specific typesof update that we will consider are:

Deletes;Inserts;(Row-level) updates; andUpserts.

We begin with some terminology. An "original" transaction is a transaction created or authorizedby the business. A "temporal" transaction is the result of translating an original transaction into atransaction against a versioned object. As we will see, while each original transaction appears toits author to be an update against a single row of an object table, the translation may result in a


2 of 5 10/7/2008 1:14 PM

set of updates to multiple rows in possibly multiple tables. (The original/temporal distinction, andother key terms of our temporal data management approach, are defined in the glossary.)

Original Deletes: Their Semantics To the business, when they are issuing a delete against an object, they don't care if the object isversioned or not, and indeed may not know. They are directing us to remove the object from thedatabase. Thus, an original delete transaction doesn't need to specify a version, just the objectbeing deleted and the date the delete is to take effect. This is illustrated in Figure 1.

Figure 1: An Original Delete Transaction

Let's organize this discussion around three questions.

First, when an object is versioned, and a business user issues or authorizes a delete, whatdoes he think he is deleting, an object or a version?

1.

Second, when an object is versioned, and a business user issues or authorizes a delete,

what is actually deleted?

2.

And finally, how is the actual delete transaction carried out?3.

As for the first question, the author of the delete transaction intends to delete an object, andbelieves he is doing so. For example, he may issue a delete of a specific client, against a databasein which clients are versioned, but he does not issue a directive to delete a specific version of thatclient. The reason that original update transactions are specified against objects, and not againstversions, is that the burden of managing versions should not be placed on the business user, inthis case on the author of the delete transaction. Versions are simply an internal mechanism forsatisfying the business demand for real-time access to noncurrent data, which is just as fast andeasy to access as current data. As far as possible, versions and their complexities should behidden from both end users and developers. A delete transaction against a versioned object is valid if and only if there is a current version ofthat object on the database at the time of the transaction. Necessarily, this version will always bethe one current version of the one current episode of that object, and applying the delete willdelete the object by terminating its current episode. But what about past versions, future-dated versions or hypothetical versions? As for past versions,it is reasonable to expose semantics to the user that prohibits them from updating the past. Justas with good accounting practice, if what we recorded about the past was wrong, we do notupdate that original entry. Instead, we enter adjusting transactions. As for future or hypothetical versions, it could be said that if we allow users to create them, as wein fact do, then we expose the mechanism of versioning to those users. But this isn't so.Consistent with our principle of hiding the mechanism of versioning as much as possible, we donot expose to the user or developer the concept of deleting something that doesn't yet exist.


3 of 5 10/7/2008 1:14 PM

Instead, all that we have exposed to the user is functionality, the ability to insert transactions inadvance of when they are to become effective, or even transactions that may never becomeeffective. The mechanisms by which we do this are not themselves exposed. From the user's pointof view, it appears as if we have simply accepted transactions in advance of when they should beapplied and stored them somewhere in a transaction file, with the promise that we will apply themprecisely on the date they are to become effective.

Temporal Deletes: Their Targets So, given that there must be a current version of the object of an original delete, and there maybe, in addition, past, future and hypothetical versions as well, what is actually deleted? This isdetermined by the code that transforms an original delete into a temporal delete. We may think ofthis code as the "transform" portion of an extract, transform and load (ETL) process against abatch file of original transactions. We may also think of this code as a stored procedure invokedagainst individual online transactions. In either case, the important thing is that business users, developers and database administrators(DBAs) all work with original transactions and not with temporal transactions. The code thattranslates original to temporal transactions encapsulates the complexity of versions and theirmanipulations, supporting their functional semantics without exposing their implementationcomplexities. This question about the targets of temporal transactions is the second of the three questions inthe previous section. The answer is: the current version, and any future and hypothetical versionsas well. To the user, it will seem that we have done three things:

Deleted the one row representing the object in question;Flushed any remaining transactions for that object from the transaction file just mentioned;andIf there were hypothetical versions, removed that object from one or more "hypotheticalsituation" databases.

But how is this done? What physically happens when the object of an original delete is aversioned object? The answer is that this is done by inserting a new version that supercedes thecurrent version of the object being deleted, as well as a new version for each of the future orhypothetical versions of that object. The process begins by finding the current version of the object specified on the originaltransaction. If it is not found, the transaction is invalid. Otherwise, a superceding delete version iscreated which has the same object identifier and business effectivity begin date as the version it issuperceding. The effectivity end date on the superceding delete version is set to the delete date on thetransaction. This is because an original delete is a business transaction. Its intent is to remove theobject from the database as of the indicated date. For example, if a client is being deleted, theintent is that after the transaction is complete, that client no longer exists, i.e., that person ororganization no longer exists as a client of our company. (We may assume that if a delete date isnot supplied on the original transaction, the effectivity end date defaults to the date current whenthe transaction is physically applied to the database.) On these versions created by an original delete transaction, a delete flag is set to "yes." For all but


4 of 5 10/7/2008 1:14 PM

these delete-created versions, this flag is set to, and remains, "no." The purpose of this delete flagis to distinguish deletes that happen in response to an original delete transaction from deletes thathappen because a non-12/31/9999 end date comes to pass. Note that nowhere in this process have we discussed a physical deletion of a physical row in aversion table. Physically removing rows from versioned tables is a process we will call "archiving"those rows, and it is outside the scope of what this series of articles is about. See the taxonomy inPart 1 for additional discussion of the difference between archiving, creating snapshots, andversioning.

Wrap-Up Next time, we will finish our discussion of original and temporal delete transactions by examiningthe foreign key implications of deleting a versioned object. We will then be in a position to accessthe risk and the extent of potential "cascade" effects of such deletions. Since what we areproposing is an enterprise solution, we must either evaluate performance implications againstworst-case scenarios, or else design our original transactions with a flag that can selectively turnon or off transaction-time temporal validation checks.


Databases




5 of 5 10/7/2008 1:14 PM


Relational Databases - Part 22: Original and

Temporal Deletes

Tom Johnston, Randall WeisDM Direct, March 21, 2008

Note: A glossary of technical terms used in these articles can be found on the Web sites of the

authors. The URLs are MindfulData.com and InbaseInc.com.

This glossary is being constructed as a controlled vocabulary. By that we mean that any expression

involving temporal concepts which is used in the definition of a glossary entry must itself be a

glossary entry.





One reason for being this careful with definitions is that this kind of rigor must be applied to

definitions before they can be "ontologized," i.e. expressed in a formal notation like predicate logic.

That, in turn, is what is needed to make definitions available for manipulation by software-realized

inferencing engines.

Another reason for being this careful with definitions is that this is the kind of rigor that must be

applied to any set of technical definitions before we can say with confidence that we know what we

mean, and that we mean what we say, no more and no less. In our previous article, we introduced the distinction between original and temporal transactions,and began by discussing original and temporal deletes. This article completes our discussion ofdelete transactions. We will then go on to discuss insert, update and upsert transactions. These discussions describe the transformations required to map original transactions ontotemporal transactions. This mapping is the "database update side" of the encapsulation of thecomplexities of temporal data management which, in Part 1, we said was an essential part of ourapproach. These same discussions also provide the insight necessary to decide whether we should attemptto enforce temporal referential integrity (RI) on the updates themselves, or instead defer thatintegrity checking until data is retrieved. With respect to the deferral option, we will repeat what wesaid about it in the previous article:

It's easy to disparage [the deferral option] by describing it as letting temporal "garbage" onto thedatabase, and then trying to hide that garbage from queries. But this is indeed the approach wehave taken on temporal databases that we have designed, built, put into production and

Time and Time Again: Managing Time in Relational Databases - Part 22: ... http://www.dmreview.com/dmdirect/2008_64/10000972-1.html?type=pri...

1 of 5 10/7/2008 1:15 PM

supported, and we had good reason for doing so. Like almost all database professionals, we wereconcerned that there would be an overwhelming volume of cascade updates if we tried to enforcetemporal RI at time of update, especially when the update involves the delete of a versioned objectthat was the "parent" of possibly many other objects, in many other tables. (See the introductionto Part 10 for a more complete description of this "cascade update" fear.)

From the Previous Article For the sake of continuity, we repeat the following material from the previous article. If there is a temporal RI dependency from table Y to version table X (not necessarily distinct), thenno exposed state of the database is valid in which any row in table Y is not object-RI linked to arow in table X, or in which the effectivity time period of a row in table Y is not wholly containedwithin the effectivity time period of its parent row in table X. No queries are valid which wouldexpose such a state. An "original" transaction is a transaction created or authorized by the business. A "temporal"transaction is the result of translating an original transaction into a transaction against a versionedobject. To the business, when they are issuing a delete against an object, they don't care if the object isversioned or not, and indeed may not know. They are directing us to remove the object from thedatabase. Thus, an original delete transaction doesn't need to specify a version, just the objectbeing deleted and the date the delete is to take effect. This is illustrated in Figure 1.

Figure 1: An Original Delete Transaction

Deleting a Versioned Object: The Existence Constraint The original transaction we are considering is a delete. Just as with a delete to a non-versionedobject, the first thing that must be done is to find the target row(s). One target row must be themost recent row of the current episode of the object. The original transaction provides theelectronic identifier (EID) for the object. The most recent version of the current episode of theobject is the one (nonhypothetical) version of that object with an effectivity begin date in the pastand an effectivity end date either in the future, or containing the "temporal null" value "12/31/9999." First of all, note the assumption that there cannot be more than one current version of an object ina versioned table. We recommend that you attempt to state, clearly and precisely, whatguarantees that this assumption is valid. (Hint: Consider the logic of inserting into a versionedtable. In the next article, this logic will be described in terms of the mapping from an original insertto a temporal insert, but in both the glossary and in previous articles, the material already existsfor you to answer this question.) If a current version of the indicated object is found, the delete can proceed. Otherwise, it isrejected as invalid. (We consider deletes which are explicitly against future versions in a laterarticle. For now, we assume that all original transactions are against current versions.)


2 of 5 10/7/2008 1:15 PM

As described last time, an original delete of a versioned object requires us to supercede the mostrecent version of the current episode of the object with a delete version, and also to do the samefor any active future (or hypothetical) episodes of that object. An active future episode is a futureepisode whose most recent version is not a delete version.

Deleting a Versioned Object: What is a Dependent Row? If the version is found, the next thing is to look for any rows, in either object tables or versiontables, that are dependent on it. This dependency is precisely that of temporal RI. Recall as well that the database management systems (DBMS) cannot help us here. If we are toenforce temporal RI on delete transactions, we must do the work ourselves. And since this is workto enforce a database constraint, it does not belong anywhere in application-specific code. It mustbe implemented in the codebase that maps original transactions to temporal transactions. (Thiscodebase, of course, may be expressed in part, or even in its entirety, as DBMS triggers.) If the child table is an object table, a dependent row is one with the to-be-deleted row's EID as anobject foreign key (OFK). Performance considerations would thus suggest that OFK columns, inall non-versioned tables, have a non-unique index defined on them. If the child table is a versioned table, a dependent row in a current episode of its object is onewhich has an OFK that contains the value of the to-be-deleted row's EID, and with an effectivityend date either in the future, or else containing the value "12/31/9999". A dependent row in anactive future episode of its object is the most recent version of that episode. (Until further notice,we will not refer to hypothetical objects, versions or episodes any longer. The approach totemporal data management that we are describing does accommodate such hypotheticals, butthey require special treatment. They will be discussed in later articles.) In short, dependent rowsfor a temporal delete are all and only those rows which are OFK-related to the to-be-deleted row,and which are the most recent versions in the active episodes for their object. Here, performance could be improved by using a non-unique unique index on OFK plus effectivitybegin date (plus one more date, as we shall see when we discuss Version Pattern 7) in thedependent table. However, at this point we are getting into a detailed discussion of implementationoptions. We intend to discuss such issues in depth toward the end of this series, and so we willsay no more about them here.

Deleting a Versioned Object: Block, Set Null and Cascade Options The search for dependent rows is conducted under one of three directives. They are the familiarrelational delete options of:

Block,Set null, orCascade.

From a performance point of view, which, as we stated last time, means from a physical I/O pointof view, the block option is often less costly than the other options. This is because if there aremultiple rows dependent on the to-be-deleted current version, we can stop checking as soon asthe first one is found.


3 of 5 10/7/2008 1:15 PM

But if either the set null or the cascade option is chosen, then our search for temporally RIdependent rows in versioned child tables must be exhaustive. It must retrieve and examine allOFK-related rows. Notice that even if the child table index used for the search is as described above, the rowsthemselves must be retrieved, because the check must look at the effectivity end date of the childrows. In very large tables, possibly with multiple OFKs, the cost of such indexes can be high.When an important to-be-deleted object has many dependent children, the I/O costs of a)searching both the index and the base table for possible episodes to terminate (i.e., possible mostrecent versions to supercede with temporal delete versions), and b) doing the appropriate writes tothe database to terminate those episodes, may be prohibitively high. But these are simply the trade-offs that any database administrator (DBA) must consider whenphysicalizing any data model. It is the cumulative weight of such costs that can lead us to decideto postpone temporal RI checking until retrieval time. In the case of temporal deletes, it wouldmean that we would just terminate all active episodes of the designated object, and not check forany dependencies.

Deleting a Versioned Object: The Foreign Key Ripple Effect We know that normal, DBMS-enforced RI doesn't just link child tables to parent tables. Whenusing the set null or cascade options, we must also consider child tables to those child tables, andcontinue down the RI-dependency chain as far as it extends. Thus, in a situation where we haveclients, policies and claims, a cascade or set null delete of a client may require one or morepolicies to be set null or deleted. And if any are, the DBMS must go on to set null or delete theclaims dependent on those policies. Let's call this the "foreign key ripple effect." There are three things to note about this ripple effectas it applies to versioned objects.

It does apply to versioned objects.1.The DBMS can't help us. We must write our own code to enforce temporal RI on temporaldelete transactions.

2.

The performance cost will usually be much higher than for normal, DBMS-enforced RI.3.

So if we are enforcing temporal RI as the database is updated, the index and base-table lookupsthat we described above must be performed. In addition, as we just pointed out, they must beperformed on all tables in the temporal RI chain of dependencies, thus on not just children of theto-be-deleted parent row, but also of grandchildren, great-grandchildren, etc. What of the claim that performance costs will usually be much higher than for normal RI? Adetailed defense of the claim, of course, cannot be provided unless we describe a specificimplementation. But this would not prove that some better implementation might keep down theincremental performance costs. However, regardless of the specific implementation, they all sharethe following characteristic: there are going to be a lot more rows in a versioned table than therewould be in the corresponding object table. The reason is that in versioned tables, it is not justinserts that create new rows. Versionable updates do also, and versioned deletes do also.

Deleting a Versioned Object: Temporal Delete Transactions Let's suppose that we have identified both the most recent version of the current episode of the


4 of 5 10/7/2008 1:15 PM

object specified on the original delete, the most recent version of any active future episodes of thatobject, and also the most recent version of all active episodes in all tables along the temporalRI-dependency chain. We then add a temporal delete transaction to each of these episodes,superceding the most recent version in each of them with a temporal delete version. Note that these temporal deletes do not flag and rewrite these versions. That would be to do anupdate in place logical delete. As we explained in an earlier article, update in place logical deleteslose information, and that information may be considered versionable information by the business. Temporal deletes insert a temporal delete version for each active episode. The effectivity end dateon the delete version is set to the delete date on the transaction. This is because an original deleteis a business transaction. Its intent is to remove the object from the database, as of the indicateddate. For example, if a client is being deleted, the intent is that after the transaction is complete,that client no longer exists, i.e., that person or organization no longer exists as a client of ourcompany. (We may assume that if a delete date is not supplied on the original transaction, theeffectivity end date defaults to the date current when the transaction is physically applied to thedatabase.) Note also that nowhere in this process have we discussed a physical deletion of a physical row ina versioned table. Physically removing rows from versioned tables is a process we will call"archiving" those rows, and it is outside the scope of what this series of articles is about. See thetaxonomy in Part 1 for additional discussion of the difference between archiving, creatingsnapshots, and versioning. So just like original inserts and original (versionable) updates, original deletes to versioned objectsbecome physical inserts or one or more rows.

Wrap-Up There is one unfinished piece of business. Having argued that sometimes it is too performancecostly to enforce temporal RI on original delete transactions against versioned objects, we shouldnow show how that same RI is enforced if it is, instead, enforced at retrieval time. However, withdeadlines approaching on our bill-paying projects, we are forced to postpone that discussion for alater time. Next time, we will provide the same treatment for original insert transactions against versionedobjects.


Databases




5 of 5 10/7/2008 1:15 PM


Relational Databases, Part 23 - Original and

Temporal Updates

Tom Johnston, Randall WeisDM Direct, April 4, 2008




involving temporal concepts used in the definition of a glossary entry must itself be a glossary

entry.






definitions before they can be "ontologized," i.e., expressed in a formal notation like first-order

logic. That, in turn, is needed to make definitions available for manipulation by software-realized

inferencing engines.

Another reason for being this careful with definitions is that this rigor must be applied to any set of

technical definitions before we can claim not only that we say what we mean, but also that we know

what we mean when we say it. In our previous three articles, we introduced the distinction between original and temporaltransactions. We then reviewed temporal referential integrity (RI) rules as they apply to temporaldeletes. This article continues the discussion by reviewing temporal RI rules as they apply totemporal updates. These same discussions also provide the insight necessary to decide whether we should attemptto enforce temporal RI on the updates themselves, or instead defer that integrity checking untildata is retrieved. With respect to the deferral option, we will repeat what we said about it in theprevious article: It's easy to disparage [the deferral option] by describing it as letting temporal "garbage" onto thedatabase, and then trying to hide that garbage from queries. But this is indeed the approach wehave taken on temporal databases that we have designed, built, put into production andsupported, and we had good reason for doing so. Like almost all database professionals, we wereconcerned that there would be an overwhelming volume of cascade updates if we tried to enforcetemporal RI at time of update, especially when the update involves the delete of a versioned objectthat was the "parent" of possibly many other objects, in many other tables. (See the introduction


1 of 4 10/7/2008 1:16 PM

to Part 10 for a more complete description of this "cascade update" fear.) For the sake of continuity, we repeat the following definitions which are especially important forunderstanding the material in this article. These, and other technical terms, are all defined in theglossary.

Temporal Referential Integrity: Definition If there is a temporal RI dependency from table Y to version table X (not necessarily distinct), thenno exposed state of the database is valid in which any row in table Y is not object-RI linked to arow in table X, or in which the effectivity time period of a row in table Y is not wholly containedwithin the effectivity time period of its parent row in table X. No queries are valid which wouldexpose such a state.

Original and Temporal Transactions: Definitions An "original" transaction is a transaction against a versioned object. A "temporal" transaction is the result of translating an original transaction into one or moretransactions against one or more episodes of that versioned object.

Updating a Versioned Object: the Original Update Transaction To the business, when they are issuing an update against an object, they don't care if the object isversioned or not, and, indeed, they may not know. They are directing us to apply the specifiedchange to the object. Thus, an original update transaction doesn't need to specify a version, justthe object being updated and the date the update is to take effect. This is illustrated in Figure 1. As we have pointed out before, we are using dates throughout these articles only for convenience.The approach to versioning that we are presenting here applies no matter what the granularity ofthe clock ticks that measure out time. The granularity used here would be the correct granularityonly for that updates take place in batch mode and are applied at most once per day. As soon asonline transactional updates are involved, however, we would need a finer granularity, such as afull timestamp.

Figure 1: An Original Update Transaction Update transactions can be submitted as either complete images of the updated row or else byspecifying first the row, and then the column/value pairs for each column to be updated in thatrow. If these transactions are to become effective exactly when they are applied, the effectivitybegin date in Figure 1 is not required. In those cases, an original update transaction againstversioned data will be identical to the same update submitted against non-versioned data.

Updating a Versioned Object: the Existence Constraint The original transaction we are considering is an update. Just as with an update to anon-versioned object, the first thing that must be done is to find the target row(s). Given that thetarget table is a versioned table, one target row must be the most recent version of the current


2 of 4 10/7/2008 1:16 PM

episode of the object. The original transaction provides the EID for the object. The most recentversion of the current episode of the object is the one version of that object with an effectivitybegin date in the past and an effectivity end date either in the future, or containing the "temporalnull" value "12/31/9999." First of all, note the assumption that there cannot be more than one current version of an object ina versioned table. This assumption is guaranteed by the logic of inserts and deletes againstversioned objects. This reflects our commonsense belief that at any point in time, there cannot betwo different things that are true about an object. When a user submits an update transaction against a non-versioned table, she is asserting thatthe indicated object already exists in that table. If the object doesn’t already exist, the transactionis invalid. When a corresponding original transaction is submitted, i.e., when the update will beapplied to a versioned table, the update logic must also ensure that the object already exists inthat table. Specifically, the update logic must ensure that the object exists right now. Just as withnon-versioned tables, it's not enough that the object may have existed in the table at some point inthe past or will exist at some point in the future. It must exist right now. There must also be onlyone row in the table representing that object. In a versioned table, that means there must be onlyone current version of the object, where a current version is one whose effectivity begin date is inthe past and whose effectivity end date is not. If a current version of the indicated object is found, the update can proceed. Otherwise, it isrejected as invalid. (We consider updates which are explicitly against future versions in a laterarticle. For now, we assume that all original transactions are against current versions.) An original update of a versioned object requires us to supercede the most recent version of thecurrent episode of the object (which is, by definition, the current version of the object) with anupdated version, and also to do the same for any active future episodes of that object. An activefuture episode is a future episode whose most recent version is not a delete version.

Updating a Versioned Object: Temporal Update Transactions Let's suppose that we have identified the most recent version of the current episode of the objectspecified on the original update, and also the most recent versions of any active future episodes ofthat object. We then add a temporal update transaction to each of these episodes, supercedingthe most recent version in each of them with a new most recent version. Note that these temporal updates do not flag and rewrite these versions. That would be to do anupdate in place, which defeats the whole purpose of versioning by overwriting the past. Temporal updates insert a new most recent version for each active episode. The effectivity enddate on the pre-update most current version is set to one click tick prior to the effectivity begin dateon the version that is superceding it. This is because an original update is a business transaction. Its intent is to place the updatedinformation in the database, as of the indicated date. For example, if a client is being updated, theintent is that after the transaction is complete, the current description of that client is different thanit was before the update, reflecting new information we have come across about the currentdescription of that client. But what about temporal RI? If we supercede a version of a client, what about the policies for that


3 of 4 10/7/2008 1:16 PM

client that are temporally RI dependent on the client? And what about the claims for thosepolicies that are temporally RI dependent on them?

Updating a Versioned Object: the Concern As we’ve mentioned before, a pervasive fear among business IT professionals is that temporal RIenforcement on updates raises the likelihood of an extensive cascade effect, one we have calledthe “foreign key ripple effect” in other articles. We know that normal, database management systems (DBMS)-enforced RI doesn't just link childtables to parent tables. When using the set null or cascade options, we must also consider childtables to those child tables, and continue down the RI-dependency chain as far as it extends.Thus, in a situation where we have clients, policies and claims, a cascade or set null delete of aclient may require one or more policies to be set null or deleted. And if any are, the DBMS must goon to set null or delete the claims dependent on those policies. If these RI checks must be done on normal tables whenever an object is deleted, don't temporalRI checks need to be done on versioned tables whenever a version of an object is superceded? Contrary to a widespread belief among data management professionals, the answer is that theydo not need to be done whenever a version of an object is superceded. To understand why thisfear is largely unfounded, note first of all that we implement temporal RI with object foreign keys(OFKs), not with normal DBMS-enforced FKs. OFKs in a temporally dependent table do not pointto a specific parent row in the versioned table they are dependent on. They point only to the objectthey are dependent on. Thus, in a Policy table, each row has an OFK to the client whose policythey are. But the table the Policy table is temporally RI dependent on is a versioned table, andthus an OFK does not pick out a specific row in that table. Initially, we might want to conclude that temporal updates have no effect on temporal RI at all, andthat just like with normal RI, a temporal update does not require any RI checking at all. However,this is not so. Most temporal updates have no effect on temporal RI, and thus incur no RI-relatedperformance penalties. But there is one kind of temporal update that does require checking ofdependent tables, and that may involve a lengthy chain of dependencies that must be checked sothat the temporal update can either be blocked, or trigger a cascade of temporal updates, ortrigger a cascade of temporal set nulls to the OFKs in the chain. We will examine this temporal RI-triggering original update in Part 24. In the meantime, werecommend that our readers think about temporal updates, and try to figure out for themselveswhat kinds of temporal updates can trigger a temporal RI cascade and what kinds cannot.


Databases




4 of 4 10/7/2008 1:16 PM

Time and Time Again - Managing Time in

Relational Databases, Part 24: Original and

Temporal Updates (Concluded)

Tom Johnston, Randall WeisDM Direct, April 18, 2008

Note: A glossary of technical terms used in these articles can be found on the websites of the




glossary entry.




logic. That, in turn, is what is needed to make definitions available for manipulation by software-

realized inferencing engines.


applied to any set of technical definitions before we can claim not only that we say what we mean,

but also that we know what we mean when we say it. As this series continues, context becomes increasingly important so that the thread of thediscussion is not lost. We begin by updating our chart of installments, which last appeared in Part13. (See PDF at the end of this article for a chart of installments to date in this series.) With Part 19, we interrupted our presentation of the physical implementation of the versioningpatterns that we developed in earlier articles in the series and began a discussion of temporalreferential integrity (TRI). On completing Part 23, we had introduced and defined TRI and relatedconcepts, and discussed TRI as it applies to deletes and updates. We now turn to TRI as itapplies to inserts.

Updating a Versioned Object: When TRI Checking Does Not Come Into Play Last time, we described a temporal update in which a current version of a client was supercededby a new current version in a client version table. We also alluded to a TRI dependent policyversion table, in which every policy was owned by exactly one client. So if the client version that

Time and Time Again - Managing Time in Relational Databases, Part 24: ... http://www.dmreview.com/dmdirect/2008_68/10001130-1.html?type=pri...

1 of 5 10/7/2008 1:17 PM

was superceded owned one or more policies, doesn't that mean that we must either block thesupercession, set null the references in the policies, or else cascade update them to point to thenew version of the client? To raise the question is to begin to see the outline of the answer to it. Those policies –versioned ornot – do not point to the superceded client version, because versioned tables are not referencedby foreign keys; they are referenced only by object foreign keys (OFKs), which point to thereferenced object but do not point to any specific version of it. Let's think about the situation in terms of what is going on in the real world, not simply in terms ofwhat is going on in the database. In the real world, policies are owned by clients. As those policieschange over time, or as the client that owns them changes over time, those claims are still ownedby those clients. This last sentence, translated into the language of database activity againstversioned tables, reads like this: even as versions of those policies are superceded by newerversions, or as versions of the client that owns them are superceded by newer versions, thosepolicies are still owned by those clients. Versioning of either or both objects has no effect on theownership relationship between them. But because we are dealing with a temporally qualified relationship, there is a proviso. Versioning

of either or both objects has no effect on the ownership relationship between them, provided that

the superceding version of a TRI child (parent) object has an effectivity period that is wholly

contained within (wholly contains) the effectivity period of all objects it is a TRI child of (a TRI

parent of). Note that this proviso does not distinguish between TRI-mandatory and TRI-optional relationships.It clearly applies to mandatory relationships, but at this time we are not sure whether or not itshould also apply to optional ones. It's time now to see what the "provided that" clause entails.

Updating a Versioned Object: When TRI-Checking Does Come Into Play Suppose we have a policy whose effectivity period is 1/1/2007 through 12/31/2007, and an owningclient whose effectivity period is 1/1/2004 through 12/31/9999. Although "12/31/9999" means "untilfurther notice," it is treated by the database management systems as a real date. Thus, the client'seffectivity period wholly contains that of the policy. This is shown in Figure 2.

Figure 1: A Client and Her Policy

The graphical conventions used in this illustration are:

The vertical bars constitute a timeline, with each bar representing one month. The timelineruns from Jan 1, 2004 through Dec 31, 2009.

1.


2 of 5 10/7/2008 1:17 PM

A client and a policy episode are shown, each starting on its effectivity begin date.2.The client episode contains four versions, the policy episode three.

All versions begin with left-most vertical bars.Adjacent versions share a vertical bar.Versions with a known effectivity end date have a right-most vertical bar indicating thatend date. Thus, the most recent version of the policy episode has a 12/31/2007effectivity end date.Versions whose effectivity end date is not known have an ellipsis in place of theright-most vertical bar and always have an end date value of "12/31/9999."We will call the former kind of episode "closed," and the latter kind "open".

3.

{Now}, relative to the example, is indicated by the black box arrow, pointing to March 1, 2007.So, for purposes of discussing the example, we should assume that it is now March 1, 2007.

4.

The "OFK" labeled crow's foot notation indicates a TRI relationship from clients to policies.Note in particular what this is not a relationship between. It is not a relationship between:

A specific client and a specific policy.A specific version of the client and a specific version of the policy.A specific episode of the client and a specific episode of the policy.

Just as a relationship type is defined on a pair of conventional tables, a temporal relationshiptype is defined on a pair of tables, at least one of which is temporal.

5.

As long as these effectivity periods do not change, and the policy continues to have an OFK to itsowning client, TRI constraints between these two rows remain satisfied. Thus, no updates thatleave this OFK and these pairs of dates alone can violate TRI, and therefore no versioning causedby these updates needs to be accompanied by TRI checks. Consequently, there is no TRIperformance penalty in such cases. Suppose that it is now some time during 2007, say March 1, 2007. And suppose that the client'seffectivity end date is moved back to 5/1/2007. If no other changes are made to that client, she willbe temporally deleted on that date, which will then terminate her current episode. In addition, allversions of all future episodes will be temporally deleted. (Although we are postponing discussionof how future episodes are managed, recall that, as we explained earlier, original transactionauthors will think of future episodes as future transactions that are queued up in a virtual"transaction batch file," with the promise that they will be applied on their effectivity date. From thispoint of view, temporally deleting all future episodes appears to be flushing those not-yet-appliedtransactions from that virtual batch file.) Because this new version of the client "shrinks" the client's effectivity time period, TRI checks arerequired. In our example, there is a policy that is TRI dependent on that client, and that policy'seffective end date is 12/31/2007. Since 5/1/2007 is earlier than 12/31/2007, this violates the TRIconstraint that the effectivity period of a child version be wholly contained in the effectivity periodof the corresponding parent version. From the point of view of the real world, it means that if we letthe update proceed, the result would be a database which showed a policy in effect for six monthsafter its owning client was deleted. This situation, a violation of TRI, is shown in Figure 3. Theperiod of time during which TRI is violated is indicated by a red block.


3 of 5 10/7/2008 1:17 PM

Figure 2: A TRI Violation

This is an example of a change to the effectivity period of a parent object. If that effectivity period"shrinks," TRI checking is triggered on all its dependents to see that they are not "temporallystranded" by the shrinkage. There is obviously a correlative process when an update changes theeffectivity period of a child object. In that case, if the effectivity period "expands," TRI checking istriggered on all its parents to see that they do not "temporally strand" themselves by extendingtheir effectivity range beyond that of one of the objects they are dependent on. On original updates to versioned objects, then, TRI checks must be done whenever there are TRIdependencies, and effectivity periods are changed. Otherwise, original updates to versionedobjects do not trigger TRI checks, and thus do not incur any TRI performance penalty. There is one more situation in which a temporal update may trigger TRI checking. That is when a{null} OFK in the updated object takes on a real value, or when an existing OFK value is changed.In either case, TRI checking is required to ensure that the effectivity timespan of a version of thenewly referenced object wholly encloses the effectivity timespan of the object whose OFK wasupdated. With conventional RI, constraints exist to ensure that after any database update, there are noforeign keys with "dangling references," with values for which there is no corresponding primarykey in the referenced table. With TRI, constraints exist to ensure that after any database update,there are no object foreign keys with "dangling references," with values for which there is no row inthe referenced table for that object whose effectivity timespan wholly encloses the effectivitytimespan of the referencing object.

New Terms Added to the Glossary

Adjacent versions.Closed episode.Dangling reference.Expanding a time period.Flushing the virtual batch file.Open episode.Shrinking a time period.Temporally stranded objects.TRI (as an acronym for "temporal referential integrity").TRI mandatory relationship.TRI optional relationship.Virtual batch (transaction) file.

Our discussion of TRI, except for a consideration of how future versions are affected by it, is nearlydone. Next time, we will discuss TRI for the temporal transactions resulting from original insertsand original upserts. Then we will go on to extend our discussion to consider the effect of TRI


4 of 5 10/7/2008 1:17 PM

constraints on future versions of objects. As we shall see, those considerations are notinsignificant.


Databases




5 of 5 10/7/2008 1:17 PM


Relational Databases, Part 25 - Original and

Temporal Inserts

Tom Johnston, Randall WeisDM Direct, May 2, 2008





glossary entry.









but also that we know what we mean when we say it. As this series continues, context becomes increasingly important so that the thread of thediscussion is not lost. Here is what we've done so far. (See PDF below for Figure 1: Chart of Installments to Date in this Series.) Beginning with Part 19, we interrupted our presentation of the physical implementation of theversioning patterns, which we described earlier in the series, and began a discussion of temporalintegrity constraints. On completing Part 24, we had introduced and defined temporal RI andrelated concepts, and discussed these constraints as they apply to delete and to updatetransactions. In this and the following article, we turn to temporal integrity constraints as theyapply to insert and to upsert transactions. The "ultimate" versioning pattern we are working toward is what we, following the computerscience community, call a "bi-temporal" pattern. In Parts 19 through 25, we have been discussingboth temporal entity integrity and temporal referential integrity constraints. But it is important tonote that these discussions have taken place in the context of a "uni-temporal" pattern. The onlytime period considered has been an effective time period, and the only dates an effective begin


1 of 5 10/7/2008 1:17 PM

and effective end date. As we have seen, integrity constraints in a uni-temporal versioning context have been far fromsimple. As we will see, integrity constraints in a bi-temporal versioning context are considerablymore complex than that.

Inserting a Versioned Object: the Original Insert Transaction To the business, when they are issuing an insert of an object, they don't care if the object isversioned or not, and indeed may not know. They are directing us to insert a row representing anobject which is not already represented by a row in the target table. Thus, an original inserttransaction doesn't need to specify a version, just the object being inserted, its business key,whether or not match logic should be applied, and the date the insert is to take effect. This isillustrated in Figure 2. As we have pointed out before, we are using dates throughout these articles only for convenience.The approach to versioning which we are presenting here applies no matter what the granularity ofthe clock ticks that measure out time. The granularity used here would be the correct granularityfor transactions taking place in batch mode and being applied at most once per day. As soon asonline individually applied transactions are involved, however, we would need a finer granularity,such as a full timestamp.

Figure 2: An Original Insert Transaction

Before we begin to examine an original insert transaction in detail, let's look specifically at the"Match? (Y/N)" flag in Figure 2. Inserting a Versioned Object: Haven't We Met Before?Objects of interest to an enterprise frequently appear, go away, and then reappear later on. Oneresponse to the fact of recurrence is to ignore it. When the object reappears, it is treated as a newobject, and no attempt is made to match it up with the history we have for it. But enterprises generally prefer to recognize a recurring object rather than treat it as a new object.For example, if the object is a customer, then if we don't attempt to determine whether or not sheused to be a customer of ours, we won't be able to greet her and make her feel special, like along-lost friend. We won't be able to say, "Welcome, back. Glad to have you as a customer onceagain." We will also be unable to use the history of transactions we have with this customer tomore intelligently manage the reestablished relationship. We know the kinds of things she likes topurchase from us. We may even know the kinds of advertising she is most responsive to. Butunless we can recognize her as a returning customer, we will lose the competitive advantage thatthis past history can give us. So the preferred response, if our target table is a versioned table, would be to look for a match ofthe object being inserted with a past version. Assuming that the primary keys of rows in the targettable have not been back-propagated to their populating source systems, we must look for amatch using the source system business keys of the objects. If we find a match, we will use thetarget table EID we found as the EID for the insert. This means, for a versioned target table, that itwill be used as the EID for the first version of a new episode of the object. With this EID firmly


2 of 5 10/7/2008 1:17 PM

attached to all the versions of all the episodes of the object, we retain the information we canglean from past episodes. We can say "Hi, and welcome back" to our returning customer. We willalready know what she likes, and what she doesn't like. But notice the [Match? (Y/N)] flag on the original transaction. Why do we need it? If we keep thebusiness key on our versioned tables, and a business key is provided on the transaction, whycan't we automatically check for a match? In a perfect world, with objects that are assigned unique business keys, we would not need thismatch flag. But in the imperfect world of often dirty data that we IT professionals must manage,sometimes we can't rely on the business key. Sometimes we must even allow duplicate businesskeys, or keys with default values or {null}s in one or more columns, into our target tables. For example, multiple independent source systems might provide insert, update and deletetransactions to the same target table. The same business key might be used in both systems, andthe two rows thus designated might or might not represent the same object. One way we couldfind ourselves in this situation is if the two source systems were, up to now, managed rather"loosely," as systems for quasi-independent organizations. Mergers and acquisitions often start offwith this "loosely coupled" approach because it is much quicker and much easier to bring onlinethan full integration. But when the customer bases of the acquiring and the acquired companiesoverlap, as they often will, it is inefficient to manage the customers in common as differentcustomers. It is often an inefficiency visible to those customers as well, whose impressions of ourcompany will certainly not be enhanced by this annoying inability of ours to recognize them as oneand the same customer. Another example would be overlapping product lines across the manufacturing plants of anacquiring and an acquired company. Here, too, there is a loosely coupled and a tightly coupledapproach. Here, too, moving from the former to the latter will likely uncover latent business keycollisions and make them fully explicit. These M&A scenarios are often exacerbated by the not infrequent practice of using a system-generated sequence number as the last column of a business key. When this happens, the oddsof identical cross-system business keys representing different objects quickly escalates. Forexample, if each of two systems has three business keys, each set identical except for thesequence number portion, and both sets identical in the non-sequence number portion, and if weassume that there are indeed exactly three customers involved, the odds of matching themcorrectly across the two systems is nine to one, against! Another way business keys can be compromised is when a source system uses a surrogate key,one which, for example, recycles every couple of years. By the time it recycles, all reference to theobject it originally represented will have been archived out of that system. But if our target is aversioned table with a longer online lifecycle than the source table, then when the reusedsurrogate appears, it may still exist in the target table as an identifier for the previous object itdesignated. In these ways, or in any of the other ways in which dirty data can make its way into the verybusiness keys that identify the objects we are interested in, if we automatically search for a matchon original insert transactions, we may inadvertently create synonyms, multiple rows whichrepresent the same object. For this reason, original inserts must indicate whether or not matchlogic should be applied to them. But because the same semantics often need to be provided evenwhen the target table is not versioned, requiring the match flag does not "tip our hand" and


3 of 5 10/7/2008 1:17 PM

indicate that the target is a versioned table. The distinction between versioned and conventionaltables remains hidden from the users and IT personnel who provide the source transactions.

Unreliable Business Keys: Why Assume a No-Match? If we always assume a no-match when dealing with unreliable business keys, the effect is to turnevery source update transaction into an insert. The result quickly becomes a situation in whichthere are a large number of rows for each object that is frequently updated. In business IT, welook for ways to keep these numbers down, but it's not our purpose to describe those heuristicshere. But one could ask why the assumption must be that a match of unreliable business keys is ano-match. Perhaps application-specific circumstances mean that the odds are nine out of ten thatsuch matching keys do in fact represent the same object. In that case, why not assume that theydo? The reason, as all reasons are in business IT, is a negative impact on the bottom line. In manycases, the cost of creating one mistaken homonym is higher than the cost of creating nine, or evena hundred, synonyms. In this case, "homonym" refers to one row representing multiple objects,and "synonym" refers to multiple rows representing the same object. In these terms, the question, "Why assume a no-match?" can be rephrased as, "Why assume thathomonyms are more costly than synonyms?" And the answer is that in general, they are. Here'swhy. In the case of synonyms, when we discover one, we have to do what is called a "file merge." If twoClient table rows are discovered to represent the same client, for example, we must replace themwith a single row. As for source transactions, we must change the match logic to point pairs oftransactions hitherto directed to different targets, to the one new target which replaced them. Asfor RI dependencies, we must take all the children RI-dependent on either of the two original rows,change their foreign keys to point back to the one new row which replaced them. But notice thatall these changes can be done with code. There is no need for human judgment in the revisedmatch logic, or in the new RI dependencies. But such is not the case when we are dealing with homonyms. When we discover a homonym, wemust do what is called a "file split." Say we split one client table row into two rows. The next timeupdate transactions appear that had both been applied to the original row, how do we determinewhich of the two new clients to direct each one to? By the very fact that the homonym existed tobegin with, we know that source systems have thus far failed to make the distinction. So the matchlogic must be updated to make a discrimination it previously did not, or could not, make. As for RIdependencies, with a single parent client row replaced by two (or more) client rows, how are we toknow which new parent to redirect each RI child to? Occasionally, the means to make these new discriminations will exist in the data, and so the newlogic can be fully automated. But far more frequently, those means do not exist in the data. In thatcase, file splits can only be done if we can devote human resources to the one-time task of fixingforeign keys, and to the recurring task of augmenting match logic to make discriminations thatcannot be made on the basis of the data alone. In short, doing less discriminatory work is easy; doing more is hard. That is why homonyms costmore than synonyms. And that, finally, is why in the absence of cost/benefit data to the contrary,


4 of 5 10/7/2008 1:17 PM

we should treat unreliable source system business keys by permitting duplicate rows into thetarget tables. Sometimes, of course, a cost/benefit analysis in a specific situation will indicate that it is better tocreate the occasional homonym than it is to create a flurry of synonyms. But the defaultassumption always has to be that homonyms are to be avoided, and synonyms reluctantlytolerated. We have spent this much time on the business key match problem because it is so important. Inthe past, the problem was often ignored, and each doubtful situation treated as an insert, as anew object situation. The reason was always the difficulty in producing reliable matches in thepresence of match criteria data of poor quality, data which often came from outside our enterpriseand so was beyond our ability to clean up, or in the absence of common match criteria acrossdifferent source systems. But throwing up our hands and populating source tables with an unknown and possible significantnumber of duplicate rows, rows which represent the same object, is increasingly unacceptable.Increasingly, businesses believe that the high cost of solving the problem, or at least reducing itsimpact, is less than the cost of living with it. We have several times used the example of mergingcustomer files in the presence of unreliable source system keys. We chose this example becausethe high cost of living with un-merged customer data is most apparent in customer-facing systems. We will have more to say about a general approach to keeping the cost of duplicate-containingtables to a minimum in a later set of articles. We think there is something of value to describe thatwill apply to nearly all situations in which unreliable business keys must be matched. While this isnot a problem exclusive to versioned tables, neither is it a problem exclusive from them. In the meantime, we will turn out attention back to inserts and upserts to temporal tables in ournext article.


Databases




5 of 5 10/7/2008 1:17 PM

Time and Time Again - Managing Time in

Relational Databases, Part 26: Points in Time

and Periods of Time

Tom Johnston, Randall WeisDM Direct, May 16, 2008





glossary entry.








but also that we know what we mean when we say it. As this series continues, context becomes increasingly important so that the thread of thediscussion is not lost. Figure 1 reviews what we've done so far.

Figure 1: Chart of Installments to Date in the Series (See PDF link at the end of the article.) Last time, we began looking at original and temporal insert transactions but immediately digressedinto a discussion of the problems that can be encountered when matching transactions to targettables. We planned to discuss both temporal entity integrity and temporal referential integrity asthey apply to insert transactions in this installment. But we discovered that those discussionsrequire another digression, this time into how we use pairs of dates to represent periods of time. We also discovered that we have vacillated in this series between two methods of using pairs ofdates to represent periods of time, methods which are called the "closed-closed" and"closed-open" approaches. Two other methods are possible: "open-closed" and "open-open." Butwe won't enter into an extensive discussion of all four methods, because that has already been

covered by both Snodgrass and by Darwen and Lorentzos.1,2


1 of 5 10/7/2008 9:25 AM

There is both a practical and a theoretical side to the question of how to represent time periodswith dates. On the theoretical side, dates (or datetimes, or timestamps) are discrete points along acontinuous timeline. And as Zeno first discovered, and Leibniz and Newton later formalized, therelationship between the continuous and the discrete is problematic, equally so for time as forspace.

Points in Time and Periods of Time: Practical Issues Consider the following two versions, V1 and V2, of the same object. Let us suppose that there isno gap in time between them, i.e., no gap along the "effectivity timeline." How are we to representthis? Two of the possibilities, the two specific ones we have vacillated between, are shown inFigures 2 and 3.

We also assume, in these examples, that the clock tick granularity is one day. In both cases, the first effective time period starts on 2/19/06 and ends on 5/22/07, and the secondstarts on 5/23/07 and ends on 10/14/09. So it might seem that the closed-closed representation isthe correct one, and that the closed-open representation is simply wrong. But that is not the case. We could try to illustrate the wrongness of the closed-open representation by querying for the timeperiod that contains the date 5/23/07. On the closed-open representation, can we tell whichversion's time period is the desired one? Yes, we can. We just need the following WHERE clausein our SQL query: WHERE EFFECTIVE-BEGIN-DATE LESS-THAN OR EQUAL TO "5/23/07"AND "5/23/07" LESS THAN EFFECTIVE-END-DATE With this clause, the query will correctly pick out V2. So why might we have thought that the closed-open representation is wrong? Probably becausewe had the mental image of our desired date being between the begin and end dates of theversions. But "between" as we ordinarily understand it is not "BETWEEN" as SQL understands it. We would not have gotten confused, of course, if we had used the closed-closed approach. Inthat case, we could have written:


2 of 5 10/7/2008 9:25 AM

WHERE "5/23/07" BETWEEN EFFECTIVE-BEGIN-DATE AND EFFECTIVE-END-DATE But because both clauses return the correct result, provided each is used with its correspondingmethod of representing periods of time, both methods are equally correct. So in our own vacillation between these two methods, it was this consideration which led us to(sometimes) describe the closed-closed method as the one we preferred. What, then, is the advantage of using the closed-open method? Well, look again at Figures 2 and3. In both cases, V1 and V2 are contiguous. We know this because we have set up the exampleon the assumption that there is no gap in time between them. With the closed-openrepresentation, however, we can instantly see that there is no such gap. But with the closed-closed representation, we would also need the information that the clock tick granularity beingused is a granularity of one day. The real problem with the closed-closed representation, however, is not what is or is notimmediately clear to us human beings. It is the relative complexity of the code which will often becalled upon to determine, of two consecutive versions, whether or not they are contiguous, i.e.,whether or not there is a clock tick between them. With a closed-closed representation, that codewill also have to know what the clock tick granularity is. So in our own vacillation between these two methods, it was this consideration which led us to(sometimes) describe the closed-open method as the one we preferred. Once the SQL standards groups can agree on temporal extensions to the standard, part of thatagreement will certainly be a way of representing time periods directly, without relying on theconfusing circumlocutions of various ways to use pairs of dates to represent time periods. But ourconcern, in these articles, is with today's SQL, and so we must choose a date-pair method ofrepresentation. Therefore, from this point forward, unless we discover other considerations whichwould dramatically tip the scale the other way, we will use the closed-open representation of timeperiods.

Points in Time and Periods of Time: Theoretical Issues During our "closed-closed" phase, some time ago, one of our readers wrote us to say that closed-closed was simply wrong for what we will call "theoretical" reasons. He wrote:

A time period most definitely does not end one clock tick prior to the end date. It endsat precisely the time specified as the end date….. If the period ends at 12:10:15 thenwhen the clock strikes 12:10:15 the period is over THEN and the next period beginsprecisely THEN. At that instant… If you don’t do it that way then you are saying that date/time fields that are used for thebeginning times are interpreted as meaning the beginning of the stated time periodand date/time fields used for ending times are interpreted a DIFFERENT way, namelyas indicating the end of the stated time period. One can’t do things that way, i.e.,changing the definition of a data type based on the context.

This reader also has an argument that is practical in nature, namely that with the closed-openmethod, date arithmetic will always involve a "minus 1" clause that would not be required on theclosed-closed approach. But as we have seen, there are practical concerns no matter how we


3 of 5 10/7/2008 9:25 AM

choose to use date-pairs to represent time periods, and these concerns are therefore notconclusive. Our first response to our reader's theoretical argument, briefly, is this. First, the closed-openapproach does not "chang(e) the definition of a data type." If we changed the definition of a datatype depending on which columns we applied it to, then the DBMS or programming language thataccessed those columns would not be able to handle them correctly. So we are not changing thedefinition of the data type. The issue, instead, is what we take each date to mean. And as long aswe write our SQL carefully (see the WHERE clauses, above), we will get the correct results nomatter which date-pair interpretation we choose. Our second response is that decades of computer science research have shown that we can andshould represent the flow of time as a succession of atomic clock ticks, ticks which may be of anyactual length of time, such as a day, a nanosecond, etc. But no one ever has, and no one everwill, invent a clock so precise that there is no time at all between its ticks. Only if we had such aclock could we use clock ticks to represent time periods in a manner that conforms to the intuitionsof this reader, namely that "If the period ends at 12:10:15 then when the clock strikes 12:10:15 theperiod is over THEN and the next period begins precisely THEN. At that instant." This interpretation creates an ambiguity in the use of date-pairs to delimit time periods that cannever be resolved. For on this interpretation, two adjacent versions overlap at exactly that tick ofthe clock. On that tick of the clock, this interpretation requires us to say that both V1 and V2 are ineffect. Without going into any detail about the theoretical underpinnings of these arguments, we canmake the following points. First, at the Planck scale, both space and time are, as far as we can measure them, discrete, witha unit of Planck time (5.3906 x 10 to the minus 44 seconds) being the time it takes for a photon oflight to travel the Planck distance (1.6160 x 10 to the minus 35 meters). But this is too theoreticallyremote to be relevant to a discussion of time as far as IT is concerned. However, it does castconsiderable doubt, obviously, on the hope that we will ever have a clock whose ticks do not takeup a finite period of time. Second, with quantum physics considerations aside, both space and time are continuous, butneither points (in space) nor instants (points in time) are. The ensuing conceptual difficulties werefirst noted by Zeno, which he described in his paradox of the hare and the tortoise. The solution to the correct representation of something continuous by means of somethingdiscrete was ultimately solved, for space at least, by Leibniz and Newton. And their concept of alimit applies equally well to time. But this is still too remote for its relevance to managing time inrelational databases to be apparent. However, for a good basic discussion of these issues, we

recommend Sowa 2000.3

Finally, though, the concepts necessary to correctly manage the relationships among time periodsrepresented as pairs of dates were originally formulated by James F. Allen in the mid-1980s. Thisreference, and an excellent exposition of this material, is contained in Snodgrass 2000 and inDarwen and Lorentzos 2002. Both contain references to the earliest of three articles in which Allenexplained his concepts. With our digressions completed, we will proceed to a discussion of original and temporal inserts in


4 of 5 10/7/2008 9:25 AM

our next article, and will explain how the temporal correlates of entity integrity and referentialintegrity apply to them.

References:

C. J. Date, Hugh Darwen, Nikos Lorentzos. Temporal Data and the Relational Model.Morgan-Kaufmann, 2002, 90-94.

1.

Snodgrass, R. T. Developing Time-Oriented Database Applications in SQL. Morgan-Kaufmann, 2000.

2.

Sowa, John F. Knowledge Representation. (Brooks-Cole, Pacific Grove CA, 2000, 103-124.3.


Databases




5 of 5 10/7/2008 9:25 AM

managing time in relational databases

Documents