unit 1 module 1: introduction definition : dbmsa dbms consists of software that operates databases,...

UNIT 1

MODULE 1: INTRODUCTION

Database systems – Definition – Components – Advantages – Objectives – Evolution.

Definition : DBMS

A database management system (DBMS) is a system that allows to build and maintain

databases, as well as to utilize their data and retrieve information from it. A DBMS

defines the database type that it supports, as well as its functionality and operational

capabilities. A DBMS provides the internal processes for external applications built on

them.

A DBMS consists of software that operates databases, providing storage, access,

security, backup and other facilities to meet needed requirements

A database management system (DBMS) is a collection of programs that manages

the database structure and controls access to the data stored in the database. In a sense,

a database resembles a very well-organized electronic filing cabinet in which powerful

software, known as a database management system, helps manage the cabinet’s

contents.

Well known DBMSs include Oracle, IBM DB2, Microsoft SQL Server, PostgreSQL,

MySQL and SQLite.

Evolution of database and DBMS technology

The database concept has evolved since the 1960s to ease increasing difficulties in

designing, building, and maintaining complex information systems (typically with

many concurrent end-users, and with a diverse large amount of data). It has evolved

together with database management systems which enable the effective handling of

databases. Though the terms database and DBMS define different entities, they are

inseparable: a database's properties are determined by its supporting DBMS and vice-

versa. The Oxford English dictionary cites 1962 technical report as the first to use the

term "data-base.‖With the progress in technology in the areas of processors, computer

memory, computer storage. And computer networks, the sizes, capabilities, and

performance of databases and their respective DBMSs have grown in orders of

magnitudes. For decades it has been unlikely that a complex information system can

be built effectively without a proper database supported by a DBMS. The utilization

of databases is now spread to such a wide degree that virtually every technology and

product relies on databases and DBMSs for its development and commercialization, or

even may have such embedded in it. Also, organizations and companies, from small to

large, heavily depend on databases for their operations.

The introduction of the term database coincided with the availability of direct-access

storage (disks and drums) from the mid-1960s onwards. The term represented a

contrast with the tape-based systems of the past, allowing shared interactive use rather

than daily batch processing.

http://en.wikipedia.org/wiki/Software

http://en.wikipedia.org/wiki/Oracle_Database

http://en.wikipedia.org/wiki/IBM_DB2

http://en.wikipedia.org/wiki/Microsoft_SQL_Server

http://en.wikipedia.org/wiki/PostgreSQL

http://en.wikipedia.org/wiki/MySQL

http://en.wikipedia.org/wiki/SQLite

http://en.wikipedia.org/wiki/Information_system

http://en.wikipedia.org/wiki/Oxford_English_dictionary

http://en.wikipedia.org/wiki/Processors

http://en.wikipedia.org/wiki/Computer_memory

http://en.wikipedia.org/wiki/Computer_memory

http://en.wikipedia.org/wiki/Computer_storage

http://en.wikipedia.org/wiki/Computer_networks

In the earliest database systems, efficiency was perhaps the primary concern, but it

was already recognized that there were other important objectives. One of the key aim

was to make the data independent of the logic of application programs, so that the

same data could be made available to different applications.

The first generation of database systems were navigational applications typically

accessed data by following pointers from one record to another. The two main data

models at this time were the hierarchical model, epitomized by IBM's IMS system,

and the Codasyl model (Network model), implemented in a number of products such

as IDMS.

The Relational model, first proposed in 1970 by Edgar F. Codd, departed from this

tradition by insisting that applications should search for data by content, rather than by

following links. This was considered necessary to allow the content of the database to

evolve without constant rewriting of applications. Relational systems placed heavy

demands on processing resources, and it was not until the mid 1980s that computing

hardware became powerful enough to allow them to be widely deployed. By the early

1990s, however, relational systems were dominant for all large-scale data processing

applications, and they remain dominant today (2012) except in niche areas. The

dominant database language is the standard SQL for the Relational model, which has

influenced database languages also for other data models.

Because the relational model emphasizes search rather than navigation, it does not

make relationships between different entities explicit in the form of pointers, but

represents those rather using primary keys and foreign keys. While this is a good basis

for a query language, it is less well suited as a modelling language. For this reason a

different model, the Entity-relationship model which emerged shortly later (1976),

gained popularity for database design.

In the period since the 1970s database technology has kept pace with the increasing

resources becoming available from the computing platform: notably the rapid increase

in the capacity and speed (and reduction in price) of disk storage, and the increasing

capacity of main memory. This has enabled ever larger databases and higher

throughputs to be achieved.

The rigidity of the relational model, in which all data is held in tables with a fixed

structure of rows and columns, has increasingly been seen as a limitation when

handling information that is richer or more varied in structure than the traditional

'ledger-book' data of corporate information systems: for example, document databases,

engineering databases, multimedia databases, or databases used in the molecular

sciences. Various attempts have been made to address this problem, many of them

gathering under banners such as post-relational or NoSQL. Two developments of note

are the Object database and the XML database. The vendors of relational databases

have fought off competition from these newer models by extending the capabilities of

their own products to support a wider variety of data types.

General-purpose DBMS

http://en.wikipedia.org/wiki/Navigational_database

http://en.wikipedia.org/wiki/Hierarchical_database_model

http://en.wikipedia.org/wiki/Codasyl

http://en.wikipedia.org/wiki/Network_model

http://en.wikipedia.org/wiki/IDMS

http://en.wikipedia.org/wiki/Relational_model

http://en.wikipedia.org/wiki/Edgar_F._Codd

http://en.wikipedia.org/wiki/Entity-relationship_model

http://en.wikipedia.org/wiki/Database_design

http://en.wikipedia.org/wiki/Object_database

http://en.wikipedia.org/wiki/XML_database

A DBMS has evolved into a complex software system and its development typically

requires thousands of person-years of development effort] Some general-purpose

DBMSs, like Oracle, Microsoft SQL server, and IBM DB2, have been undergoing

upgrades for thirty years or more. General-purpose DBMSs aim to satisfy as many

applications as possible, which typically make them even more complex than special-

purpose databases. However, the fact that they can be used "off the shelf", as well as

their amortized cost over many applications and instances, makes them an attractive

alternative (Vs. one-time development) whenever they meet an application's

requirements.

Though attractive in many cases, a general-purpose DBMS is not always the optimal

solution: When certain applications are pervasive with many operating instances, each

with many users, a general-purpose DBMS may introduce unnecessary overhead and

too large "footprint" (too large amount of unnecessary, unutilized software code).

Such applications usually justify dedicated development. Typical examples are email

systems, though they need to possess certain DBMS properties: email systems are

built in a way that optimizes email messages handling and managing, and do not need

significant portions of a general-purpose DBMS functionality.

ADVANTAGES OF DBMS

the DBMS (Database Management System) is preferred ever the conventional file

processing system due to the following advantages:

1. Controlling Data Redundancy - In the conventional file processing system, every

user group maintains its own files for handling its data files. This may lead to

• Duplication of same data in different files.

• Wastage of storage space, since duplicated data is stored.

• Errors may be generated due to updating of the same data in different files.

• Time in entering data again and again is wasted.

• Computer Resources are needlessly used.

• It is very difficult to combine information.

2. Elimination of Inconsistency - In the file processing system information is

duplicated throughout the system. So changes made in one file may be necessary be

carried over to another file. This may lead to inconsistent data. So we need to remove

this duplication of data in multiple file to eliminate inconsistency.

For example: - Let us consider an example of student's result system. Suppose that in

STUDENT file it is indicated that Roll no= 10 has opted for 'Computer' course but in

RESULT file it is indicated that 'Roll No. =l 0' has opted for 'Accounts' course. Thus,

in this case the two entries for z particular student don't agree with each other. Thus,

database is said to be in an inconsistent state. So to eliminate this conflicting

information we need to centralize the database. On centralizing the data base the

duplication will be controlled and hence inconsistency will be removed.

http://en.wikipedia.org/wiki/Microsoft_SQL_server

http://en.wikipedia.org/wiki/Email

Data inconsistency are often encountered in everyday life Consider an another

example, w have all come across situations when a new address is communicated to

an organization that we deal it (E.g. - Telecom, Gas Company, Bank). We find that

some of the communications from that organization are received at a new address

while other continued to be mailed to the old address. So combining all the data in

database would involve reduction in redundancy as well as inconsistency so it is likely

to reduce the costs for collection storage and updating of Data.

Let us again consider the example of Result system.

Suppose that a student having Roll no -201 changes his course from 'Computer' to

'Arts'. The change is made in the SUBJECT file but not in RESULT'S file. This may

lead to inconsistency of the data. So we need to centralize the database so that changes

once made are reflected to all the tables where a particulars field is stored. Thus the

update is brought automatically and is known as propagating updates.

3. Better service to the users - A DBMS is often used to provide better services to

the users. In conventional system, availability of information is often poor, since it

normally difficult to obtain information that the existing systems were not designed

for. Once several conventional systems are combined to form one centralized

database, the availability of information and its updateness is likely to improve since

the data can now be shared and DBMS makes it easy to respond to anticipated

information requests.

Centralizing the data in the database also means that user can obtain new and

combined information easily that would have been impossible to obtain otherwise.

Also use of DBMS should allow users that don't know programming to interact with

the data more easily, unlike file processing system where the programmer may need to

write new programs to meet every new demand.

4. Flexibility of the System is Improved - Since changes are often necessary to the

contents of the data stored in any system, these changes are made more easily in a

centralized database than in a conventional system. Applications programs need not to

be changed on changing the data in the database.

5. Integrity can be improved - Since data of the organization using database

approach is centralized and would be used by a number of users at a time. It is

essential to enforce integrity-constraints.

In the conventional systems because the data is duplicated in multiple files so

updating or changes may sometimes lead to entry of incorrect data in some files where

it exists.

For example: - The example of result system that we have already discussed. Since

multiple files are too maintained, as sometimes you may enter a value for course

which may not exist. Suppose course can have values (Computer, Accounts,

Economics, and Arts) but we enter a value 'Hindi' for it, so this may lead to an

inconsistent data, so lack of Integrity.

Even if we centralized the database it may still contain incorrect data. For example: -

• Salary of full time employ may be entered as Rs. 500 rather than Rs. 5000.

• A student may be shown to have borrowed books but has no enrolment.

• A list of employee numbers for a given department may include a number of

nonexistent employees.

These problems can be avoided by defining the validation procedures whenever any

update operation is attempted.

6. Standards can be enforced - Since all access to the database must be through

DBMS, so standards are easier to enforce. Standards may relate to the naming of data,

format of data, structure of the data etc. Standardizing stored data formats is usually

desirable for the purpose of data interchange or migration between systems.

7. Security can be improved - In conventional systems, applications are developed in

an adhoc/temporary manner. Often different system of an organization would access

different components of the operational data, in such an environment enforcing

security can be quiet difficult. Setting up of a database makes it easier to enforce

security restrictions since data is now centralized. It is easier to control that has access

to what parts of the database. Different checks can be established for each type of

access (retrieve, modify, delete etc.) to each piece of information in the database.

Consider an Example of banking in which the employee at different levels may be

given access to different types of data in the database. A clerk may be given the

authority to know only the names of all the customers who have a loan in bank but not

the details of each loan the customer may have. It can be accomplished by giving the

privileges to each employee.

8. Organization's requirement can be identified - All organizations have sections

and departments and each of these units often consider the work of their unit as the

most important and therefore consider their need as the most important. Once a

database has been setup with centralized control, it will be necessary to identify

organization's requirement and to balance the needs of the competition units. So it

may become necessary to ignore some requests for information if they conflict with

higher priority need of the organization.

It is the responsibility of the DBA (Database Administrator) to structure the database

system to provide the overall service that is best for an organization.

For example: - A DBA must choose best file Structure and access method to give fast

response for the high critical applications as compared to less critical applications.

9. Overall cost of developing and maintaining systems is lower - It is much easier

to respond to unanticipated requests when data is centralized in a database than when

it is stored in a conventional file system. Although the initial cost of setting up of a

database can be large, one normal expects the overall cost of setting up of a database,

developing and maintaining application programs to be far lower than for similar

service using conventional systems, Since the productivity of programmers can be

higher in using non-procedural languages that have been developed with DBMS than

using procedural languages.

10. Data Model must be developed - Perhaps the most important advantage of

setting up of database system is the requirement that an overall data model for an

organization be build. In conventional systems, it is more likely that files will be

designed as per need of particular applications demand. The overall view is often not

considered. Building an overall view of an organization's data is usual cost effective in

the long terms.

11. Provides backup and Recovery - Centralizing a database provides the schemes

such as recovery and backups from the failures including disk crash, power failures,

software errors which may help the database to recover from the inconsistent state to

the state that existed prior to the occurrence of the failure, though methods are very

complex.

Disadvantages of Database: Although the database system yields considerable advantages over previous data

management approaches, database systems do carry significant disadvantages. For

example:

1. Increased costs. Database systems require sophisticated hardware and software and highly skilled

personnel. The cost of maintaining the hardware, software, and personnel required to

operate and manage a database system can be substantial. Training, licensing, and

regulation compliance costs are often overlooked when database systems are

implemented.

2. Management complexity. Database systems interface with many different technologies and have a significant

impact on a company’s resources and culture. The changes introduced by the adoption

of a database system must be properly managed to ensure that they help advance the

company’s objectives. Given the fact that database systems hold crucial company data

that are accessed from multiple sources, security issues must be assessed constantly.

3. Maintaining currency. To maximize the efficiency of the database system, you must keep your system

current. Therefore, you must perform frequent updates and apply the latest patches

and security measures to all components. Because database technology advances

rapidly, personnel training costs tend to be significant. Vendor dependence: Given the

heavy investment in technology and personnel training, companies might be reluctant

to change database vendors. As a consequence, vendors are less likely to offer pricing

point advantages to existing customers, and those customers might be limited in their

choice of database system components.

4. Frequent upgrade/replacement cycles. DBMS vendors frequently upgrade their products by adding new functionality. Such

new features often come bundled in new upgrade versions of the software. Some of

these versions require hardware upgrades. Not only do the upgrades themselves cost

money, but it also costs money to train database users and administrators to properly

use and manage the new features’

The major characteristics of database approach are:

• Self-describing Nature of a Database System

• Insulation between Programs and Data, and Data Abstraction

• Support of Multiple Views of the Data

• Sharing of Data and Multi user Transaction Processing

Self-contained nature

DBMS system contains data plus a full description of the data (called ―metadata‖)

―metadata‖ is data about data - data formats, record structures, locations, how to

access, indexes metadata is stored in a catalogue and is used by DBMS software to

know how to access the data. Contrast this with the file processing approach where

application programs need to know the structure and format of records and data.

Program-data independence

Data independence is immunity of application programs to changes in storage

structures and access techniques. E.g. adding a new field, changing index structure,

changing data format, In a DBMS environment these changes are reflected in the

catalogue. Applications aren’t affected. Traditional file processing programs would all

have to change, possibly substantially.

Data abstraction

A DBMS provides users with a conceptual representation of data (for example, as

objects with properties and inter-relationships). Storage details are hidden. Conceptual

representation is provided in terms of a data model.

Support for multiple views

DBMS may allow different users to see different ―views‖ of the DB, according to the

perspective each one requires. E.g. a subset of the data - For example; the people

using the payroll system need not/should not see data about students and class

schedules. E.g. data presented in a different form from the way it is stored - For

example someone interested in student transcripts might get a view which is formed

by combining information from separate files or tables.

Centralized control of the data resource

The DBMS provides centralized control of data in an organization.

This brings a number of advantages:

(a) Reduces redundancy

(b) Avoids inconsistencies

(c) Data can be shared

(d) Standards can be enforced

(e) Security restrictions can be applied

(f) Integrity can be maintained

Redundancy and Inconsistencies

Redundancy is unnecessary duplication of data. For example if accounts department

and registration department both keep student name, number and address.

Redundancy wastes space and duplicates effort in maintaining the data.

Redundancy also leads to inconsistency.

Inconsistent data is data which contradicts itself - e.g. two different addresses for a

given student number. Inconsistency cannot occur if data is represented by a single

entry (i.e. if there is no redundancy).

Controlled redundancy: Some redundancy may be desirable (for efficiency). A

DBMS should be aware of it, and take care of propagating updates to all copies of a

data item.

This is an objective, not yet currently supported.

Sharing

• Need concurrency control

• Multiple user views

Standards

E.g. data formats, record structures, naming, documentation

International, organizational, departmental ... standards

Security

- restricting unauthorized access

DBMS should perform security checks on all accesses.

Integrity

Maintaining validity of data;

E.g. employee numbers must be in some range

E.g. every course must have an instructor

e.g... Student number must be unique

E.g. hours worked cannot be more than 150

These things are expressed as constraints.

DBMS should perform integrity checks on all updates. Currently DBMSs provide

limited integrity checks..

MODULE 2:

DBMS Architecture – Associations – Relationships – Mappings between Abstractions

–

Generalisation – Classifications – Conceptual Data modeling – File Organization –

Data

Structure – Data models: HDBMS, NDBMS, RDBMS, OODBMS.

DBMS Architecture

Components of a DBMS

Database Management System (DBMS)

DBMSs are the technology tools that directly support managing organizational data.

With a DBMS you can create a database including its logical structure and constraints,

you can manipulate the data and information it contains, or you can directly create a

simple database application or reporting tool. Human administrators, through a user

interface, perform certain tasks with the tool such as creating a database, converting

an existing database, or archiving a large and growing database. Business applications,

which perform the higher level tasks of managing business processes, interact with

end users and other applications and, to store and manage data, rely on and directly

operate their own underlying database through a standard programming interface like

ODBC.

The following diagram illustrates the five components of a DBMS.

Database Engine:

The Database Engine is the core service for storing, processing, and securing data.

The Database Engine provides controlled access and rapid transaction processing to

meet the requirements of the most demanding data consuming applications within

your enterprise. Use the Database Engine to create relational databases for online

transaction processing or online analytical processing data. This includes creating

tables for storing data, and database objects such as indexes, views, and stored

procedures for viewing, managing, and securing data. You can use SQL Server

Management Studio to manage the database objects, and SQL Server Profiler for

capturing server events.

Data dictionary:

A data dictionary is a reserved space within a database which is used to store

information about the database itself. A data dictionary is a set of table and views

which can only be read and never altered. Most data dictionaries contain different

information about the data used in the enterprise. In terms of the database

representation of the data, the data table defines all schema objects including views,

tables, clusters, indexes, sequences, synonyms, procedures, packages, functions,

triggers and many more. This will ensure that all these things follow one standard

defined in the dictionary. The data dictionary also defines how much space has been

allocated for and / or currently in used by all the schema objects. A data dictionary is

used when finding information about users, objects, and schema and storage

structures. Every time a data definition language (DDL) statement is issued, the data

dictionary becomes modified.

A data dictionary may contain information such as:

Database design information

Stored SQL procedures

User permissions

User statistics

Database process information

Database growth statistics

Database performance statistics

Query Processor:

A relational database consists of many parts, but at its heart are two major

components: the storage engine and the query processor.

The storage engine writes data to and reads data from the disk. It manages

records, controls concurrency, and maintains log files.

The query processor accepts SQL syntax, selects a plan for executing the syntax, and

then executes the chosen plan. The user or program interacts with the query processor,

and the query processor in turn interacts with the storage engine. The query processor

isolates the user from the details of execution. The user specifies the result, and the

query processor determines how this result is obtained.

The query processor components include: DDL interpreter, DML compiler, Query

evaluation engine

Mappings between Abstractions

The conceptual/internal mapping:

o defines conceptual and internal view correspondence

o specifies mapping from conceptual records to their stored counterparts

An external/conceptual mapping:

o defines a particular external and conceptual view correspondence

A change to the storage structure definition means that the conceptual/internal

mapping must be changed accordingly, so that the conceptual schema may

remain invariant, achieving physical data independence.

A change to the conceptual definition means that the conceptual/external

mapping must be changed accordingly, so that the external schema may remain

invariant, achieving logical data independence.

The architecture for DBMSs is divided into three general levels:

external

conceptual

internal

External View

A user is anyone who needs to access some portion of the data. They may range from

application programmers to casual users with adhoc queries. Each user has a language

at his/her disposal.

The application programmer may use a high level language ( e.g. COBOL) while the

casual user will probably use a query language.

Regardless of the language used, it will include a data sublanguage DSL which is that

subset of the language which is concerned with storage and retrieval of information in

the database and may or may not be apparent to the user.

A DSL is a combination of two languages:

a data definition language (DDL) - provides for the definition or description of

database objects

a data manipulation language (DML) - supports the manipulation or processing

of database objects.

Each user sees the data in terms of an external view: Defined by an external schema,

consisting basically of descriptions of each of the various types of external record in

that external view, and also a definition of the mapping between the external schema

and the underlying conceptual schema.

Conceptual View

An abstract representation of the entire information content of the database.

It is in general a view of the data as it actually is, that is, it is a `model' of the

`realworld'.

It consists of multiple occurrences of multiple types of conceptual record,

defined in the conceptual schema.

To achieve data independence, the definitions of conceptual records must

involve information content only.

storage structure is ignored

access strategy is ignored

In addition to definitions, the conceptual schema contains authorisation and

validation procedures.

Internal View

The internal view is a low-level representation of the entire database consisting of

multiple occurrences of multiple types of internal (stored) records.

It is however at one remove from the physical level since it does not deal in terms of

physical records or blocks nor with any device specific constraints such as cylinder or

track sizes. Details of mapping to physical storage is highly implementation specific

and are not expressed in the three-level architecture.

The internal view described by the internal schema:

defines the various types of stored record

what indices exist

how stored fields are represented

what physical sequence the stored records are in

In effect the internal schema is the storage structure definition.

Here we compare these three types of data models. The table below compares the

different features:

Feature Conceptual Logical Physical

Entity Names ✓ ✓

Entity Relationships ✓ ✓

Attributes

✓

Primary Keys

✓ ✓

Foreign Keys

✓ ✓

Table Names

✓

Column Names

✓

Column Data Types

✓

We can see that the complexity increases from conceptual to logical to physical. This

is why we always first start with the conceptual data model (so we understand at high

level what are the different entities in our data and how they relate to one another),

then move on to the logical data model (so we understand the details of our data

without worrying about how they will actually implemented), and finally the physical

data model (so we know exactly how to implement our data model in the database of

choice). In a data warehousing project, sometimes the conceptual data model and the

logical data model are considered as a single deliverable.

DATA MODELS

Hierarchical Model

The hierarchical data model organizes data in a tree structure. There is a hierarchy of

parent and child data segments. This structure implies that a record can have repeating

information, generally in the child data segments. Data in a series of records, which

have a set of field values attached to it. It collects all the instances of a specific record

together as a record type. These record types are the equivalent of tables in the

relational model, and with the individual records being the equivalent of rows. To

create links between these record types, the hierarchical model uses Parent Child

Relationships. These are a 1: N mapping between record types. This is done by using

trees, like set theory used in the relational model, "borrowed" from maths. For

example, an organization might store information about an employee, such as name,

employee number, department, salary. The organization might also store information

about an employee's children, such as name and date of birth. The employee and

children data forms a hierarchy, where the employee data represents the parent

segment and the children data represents the child segment. If an employee has three

children, then there would be three child segments associated with one employee

segment. In a hierarchical database the parent-child relationship is one too many. This

restricts a child segment to having only one parent segment. Hierarchical DBMSs

were popular from the late 1960s, with the introduction of IBM's Information

Management System (IMS) DBMS, through the 1970s.

Network Model

The popularity of the network data model coincided with the popularity of the

hierarchical data model. Some data were more naturally modelled with more than one

parent per child. So, the network model permitted the modelling of many-to-many

relationships in data. In 1971, the Conference on Data Systems Languages

(CODASYL) formally defined the network model. The basic data modelling construct

in the network model is the set construct. A set consists of an owner record type, a set

name, and a member record type. A member record type can have that role in more

than one set; hence the multiparent concept is supported. An owner record type can

also be a member or owner in another set. The data model is a simple network, and

link and intersection record types (called junction records by IDMS) may exist, as

well as sets between them. Thus, the complete network of relationships is represented

by several pair wise sets; in each set some (one) record type is owner (at the tail of the

network arrow) and one or more record types are members (at the head of the

relationship arrow). Usually, a set defines a 1: M relationship, although 1:1 is

permitted. The CODASYL network model is based on mathematical set theory.

Relational Model

(RDBMS - relational database management system) A database based on the

relational model developed by E.F. Codd. A relational database allows the definition

of data structures, storage and retrieval operations and integrity constraints. In such a

database the data and relations between them are organised in tables. A table is a

collection of records and each record in a table contains the same fields.

Properties of Relational Tables:

gnificant

Certain fields may be designated as keys, which means that searches for specific

values of that field will use indexing to speed them up. Where fields in two different

tables take values from the same set, a join operation can be performed to select

related records in the two tables by matching values in those fields. Often, but not

always, the fields will have the same name in both tables. For example, an "orders"

table might contain (customer-ID, product-code) pairs and a "products" table might

contain (product-code, price) pairs so to calculate a given customer's bill you would

sum the prices of all products ordered by that customer by joining on the product-code

fields of the two tables. This can be extended to joining multiple tables on multiple

fields. Because these relationships are only specified at retrieval time, relational

databases are classed as dynamic database management system. The RELATIONAL

database model is based on the Relational Algebra.

Object/Relational Model

Object/relational database management systems (ORDBMSs) add new object storage

capabilities to the relational systems at the core of modern information systems. These

new facilities integrate management of traditional fielded data, complex objects such

as time-series and geospatial data and diverse binary media such as audio, video,

images, and applets. By encapsulating methods with data structures, an ORDBMS

server can execute complex analytical and data manipulation operations to search and

transform multimedia and other complex objects.

As an evolutionary technology, the object/relational (OR) approach has inherited the

robust transaction- and performance-management features of it s relational ancestor

and the flexibility of its object-oriented cousin. Database designers can work with

familiar tabular structures and data definition languages (DDLs) while assimilating

new object-management possibilities. Query and procedural languages and call

interfaces in ORDBMSs are familiar: SQL3, vendor procedural languages, and

ODBC, JDBC, and proprietary call interfaces are all extensions of RDBMS languages

and interfaces. And the leading vendors are, of course, quite well known: IBM, Inform

ix, and Oracle.

Object-Oriented Model

Object DBMSs add database functionality to object programming languages. They

bring much more than persistent storage of programming language objects. Object

DBMSs extend the semantics of the C++, Smalltalk and Java object programming

languages to provide full-featured database programming capability, while retaining

native language compatibility. A major benefit of this approach is the unification of

the application and database development into a seamless data model and language

environment. As a result, applications require less code, use more natural data

modelling, and code bases are easier to maintain. Object developers can write

complete database applications with a modest amount of additional effort.

According to Rao (1994), "The object-oriented database (OODB) paradigm is the

combination of object-oriented programming language (OOPL) systems and persistent

systems. The power of the OODB comes from the seamless treatment of both

persistent data, as found in databases, and transient data, as found in executing

programs."

In contrast to a relational DBMS where a complex data structure must be flattened out

to fit into tables or joined together from those tables to form the in-memory structure,

object DBMSs have no performance overhead to store or retrieve a web or hierarchy

of interrelated objects. This one-to-one mapping of object programming language

objects to database objects has two benefits over other storage approaches: it provides

higher performance management of objects, and it enables better management of the

complex interrelationships between objects. This makes object DBMSs better suited

to support applications such as financial portfolio risk analysis systems,

telecommunications service applications, World Wide Web document structures,

design and manufacturing systems, and hospital patient record systems, which have

complex relationships between data.

Data Model: A data model is a collection of conceptual tools for describing data, data relationships,

data semantics, and consistency constraints.

Entity: An entity is a ―thing‖ or ―object‖ in the real world that is distinguishable from

all other objects. For example, each person in an enterprise is an entity.

Entity set :An entity set is a set of entities of the same type that share the same

properties, or attributes. The set of all persons who are customers at a given bank, for

example, can be defined as the entity set customer. Similarly, the entity set loan might

represent the set of all loans awarded by a particular bank.

An entity is represented by a set of attributes. Attributes are descriptive properties

possessed by each member of an entity set. The designation of an attribute for an

entity set expresses that the database stores similar information concerning each entity

in the entity set; however, each entity may have its own value for each attribute.

Simple and composite attributes: the attributes have been simple; that is, they are

not divided into subparts is called as "simple attributes". on the other hand, can be

divided into subparts is called as "composite attributes". For example, an attribute

name could be structured as a composite attribute consisting of first-name, middle-

initial, and last-name.

Single-valued and multivalued attributes: For instance, the loan-number attribute

for a specific loan entity refers to only one loan number. Such attributes are said to be

single valued. There may be instances where an attribute has a set of values for a

specific entity. Consider an employee entity set with the attribute phone-number. An

employee may have zero, one, or several phone numbers, and different employees

may have different numbers of phones.

This type of attribute is said to be multivalued.

Derived attribute: The value for this type of attribute can be derived from the values

of other related attributes or entities. For instance, let us say that the customer entity

set has an attribute loans-held, which represents how many loans a customer has from

the bank. We can derive the value for this attribute by counting the number of loan

entities associated with that customer.

Relationship Sets: A relationship is an association among several entities. A

relationship set is a set of relationships of the same type.

Mapping Cardinalities: Mapping cardinalities, or cardinality ratios, express the

number of entities to which another entity can be associated via a relationship set.

Mapping cardinalities are most useful in describing binary relationship sets, although

they can contribute to the description of relationship sets that involve more than two

entity sets.

One to one. An entity in A is associated with at most one entity in B, and an

entity in B is associated with at most one entity in A.

One to many. An entity in A is associated with any number (zero or more) of

entities in B. An entity in B, however, can be associated with at most one entity

in A.

Many to one. An entity in A is associated with at most one entity in B. An

entity in B, however, can be associated with any number (zero or more) of

entities in A.

Many to many. An entity in A is associated with any number (zero or more) of

entities in B, and an entity in B is associated with any number (zero or more) of

entities in A.

Keys: A key allows us to identify a set of attributes that suffice to distinguish entities

from each other. Keys also help uniquely identify relationships, and thus distinguish

relationships from each other.

superkey: A superkey is a set of one or more attributes that, taken collectively, allow

us to identify uniquely an entity in the entity set. For example, the customer-id

attribute of the entity set customer is sufficient to distinguish one customer entity from

another. Thus, customer-id is a superkey. Similarly, the combination of customer-

name and customer-id is a superkey for the entity set customer. The customer-name

attribute of customer is not a superkey, because several people might have the same

name.

candidate key: minimal super keys are called candidate keys. If K is a super key, then

so is any superset of K. We are often interested in super keys for which no proper

subset is a super key.It is possible that several distinct sets of attributes could serve as

a candidate key. Suppose that a combination of customer-name and customer-street is

sufficient to distinguish among members of the customer entity set. Then, both

{customer-id} and {customer-name, customer-street} are candidate keys. Although

the attributes customer-id and customer-name together can distinguish customer

entities, their combination does not form a candidate key, since the attribute customer-

id alone is a candidate key.

primary key: which denotes the unique identity is called as primary key. Primary key

to denote a candidate key that is chosen by the database designer as the principal

means of identifying entities within an entity set. A key (primary, candidate, and

super) is a property of the entity set, rather than of the individual entities. Any two

individual entities in the set are prohibited from having the same value on the key

attributes at the same time. The designation of a key represents a constraint in the real-

world enterprise being modeled.

Weak Entity Sets:An entity set may not have sufficient attributes to form a primary

key. Such an entity set is termed a weak entity set. An entity set that has a primary key

is termed a strong entity set.

For a weak entity set to be meaningful, it must be associated with another entity set,

called the identifying or owner entity set. Every weak entity must be associated with

an identifying entity; that is, the weak entity set is said to be existence dependent on

the identifying entity set. The identifying entity set is said to own the weak entity set

that it identifies. The relationship associating the weak entity set with the identifying

entity set is called the identifying relationship. The identifying relationship is many to

one from the weak entity set to the identifying entity set, and the participation of the

weak entity set in the relationship is total.

In our example, the identifying entity set for payment is loan, and a relationship loan-

payment that associates payment entities with their corresponding loan entities is the

identifying relationship.

Although a weak entity set does not have a primary key, we nevertheless need a

means of distinguishing among all those entities in the weak entity set that depend on

one particular strong entity. The discriminator of a weak entity set is a set of attributes

that allows this distinction to be made.

In E-R diagrams, a doubly outlined box indicates a weak entity set, and a doubly

outlined diamond indicates the corresponding identifying relationship.in fig the weak

entity set payment depends on the strong entity set loan via the relationship set loan-

payment.

The figure also illustrates the use of double lines to indicate total participation—the of

the (weak) entity set payment in the relationship loan-payment is total, meaning that

every payment must be related via loan-payment to some loan. Finally, the arrow from

loan-payment to loan indicates that each payment is for a single loan. The

discriminator of a weak entity set also is underlined, but with a dashed, rather than a

solid, line.

Specialization : An entity set may include sub groupings of entities that are distinct in

some way from other entities in the set. For instance, a subset of entities within an

entity set may have attributes that are not shared by all the entities in the entity set.

The E-R model provides a means for representing these distinctive entity groupings.

Consider an entity set person, with attributes name, street, and city. A person may be

further classified as one of the following:

• customer

• employee

Each of these person types is described by a set of attributes that includes all the

attributes of entity set person plus possibly additional attributes. For example,

customer entities may be described further by the attribute customer-id, whereas

employee entities may be described further by the attributes employee-id and salary.

The process of designating sub groupings within an entity set is called specialization.

The specialization of person allows us to distinguish among persons according to

whether they are employees or customers.

Generalization: The design process may also proceed in a bottom-up manner, in

which multiple entity sets are synthesized into a higher-level entity set on the basis of

common features. The database designer may have first identified a customer entity

set with the attributes name, street, city, and customer-id, and an employee entity set

with the attributes name, street, city, employee-id, and salary. There are similarities

between the customer entity set and the employee entity set in the sense that they have

several attributes in common. This commonality can be expressed by generalization,

which is a containment relationship that exists between a higher-level entity set and

one or more lower-level entity sets. In our example, person is the higher-level entity

set and customer and employee are lower-level entity sets.

Higher- and lower-level entity sets also may be designated by the terms super class

and subclass, respectively. The person entity set is the super class of the customer and

employee subclasses .For all practical purposes, generalization is a simple inversion of

specialization. We will apply both processes, in combination, in the course of

designing the E-R schema for an enterprise. In terms of the E-R diagram itself, we do

not distinguish between specialization and generalization. New levels of entity

representation will be distinguished (specialization) or synthesized (generalization) as

the design schema comes to express fully the database application and the user

requirements of the database. Differences in the two approaches may be characterized

by their starting point and overall goal. Generalization proceeds from the recognition

that a number of entity sets share some common features (namely, they are described

by the same attributes and participate in the same relationship sets).

Aggregation:

Aggregation is an abstraction in which relationship sets (along with their associated

entity sets) are treated as higher-level entity sets, and can participate in relationships.

Symbols used in the E-R notation:

File Organization

Database needs to be stored on some computer storage medium. Two categories of

storage are:

• Primary storage: Includes storage media that can be operated on directly by

CPU. Fast access to data. Limited storage capacity.

• Secondary storage: Have large capacity, costs less, slower access to data. Can’t

be processed directly by CPU.

A file is a sequence of records. If every record has exactly the same size its said to

be made of fixed-length records, if different records in file have different sizes, file

is said to be made of variable length records.

Records of a file must be allocated to disk blocks because a block is unit of data

transfer between disk and memory. When the block size is larger than the record

size, each block will contain numerous records, at times records may not fit in a

block. Suppose block size is B bytes, for a file of fixed length records of size R

bytes, with B>=R, bfr =└B/R

┘, bfr is blocking factor for the file

Unused space in each block is B – (bfr x R) bytes

• To utilize the unused space, we can store part of a record on one block and the

rest on another. A pointer at the end of the first block points to the block

containing the remainder of the record, in case it is not the next consecutive

block on disk. This organisation is called spanned.

• If records are not allowed to cross block boundaries, the organisation is called

unspanned. This is used for fixed-length records, having B > R because it

makes each record start at a known location in the block.

• For variable length records, either a spanned or unspanned organization can be

used. To calculate the no. of blocks b, b= └(r/bfr

) ┘ blocks

Techniques for allocating file blocks on disk:

• Contiguous allocation : File blocks are allocated to consecutive disk blocks.

This makes the reading the file faster using double buffering, but file expansion

is difficult.

• Linked allocation : Each file block contains a pointer to the next file block.

Expansion is easy but reading of file is slower.

• Combination of consecutive disk blocks is called cluster. Clusters are called

file segments or extents.

File header has information about a file that is needed by the system programs

that access the file records. Gives address of file blocks, field length, order of

fields.

To search for a record on disk, one or more blocks are copied into main memory

buffers. Programs search for the desired record or records within the buffers, using

the information in the file header. If the address of the block that contains the

desired record is not known, the search programs must do a linear search through

the file blocks.

Each file block is copied into a buffer and searched either until the record is

located or all file blocks have been searched unsuccessfully

Operations on files

• Include retrieval operations and update operations

• Retrieval operations locate records for processing and examining values.

• Update operations change the data in the file

• Eg: Select, Insert, delete ,update

Can use equality or comparison operator

• Open: Prepare files for reading or writing.

• Reset: Set the file pointer of an open file to the beginning of the file.

• Find: Searches for the first record that satisfies a search condition.

• Read: Copies the current record from the buffer to a program variable in the

user program.

• FindNext: Searches for the next record in the file.

• Delete: Delete current record

• Modify: Modify some field values for current record.

• Insert: Inserts a new record in the file by locating the block

• Close: Releases buffers and completes file access

• Findall : locate all records in file satisfying a condition

• Find n : Searches for the first record that satisfies a search condition and

continues to locate the next n-1 records

• FindOrdered: Retrieves all records in some specified order.

• Reorganize: Reodering records

File organization: Organization of data of a file into records, blocks and access

structures.

Access method: Group of operations that can be applied to a file.

Buffering of blocks.

When several blocks need to be transferred from the disk to the main memory and

all the block addresses are known, several buffers can be reserved in main memory

to speed up the transfer.

While a buffer is being read or written, the CPU can process data in the other

buffer. Buffering is useful when processes run concurrently in a parallel

fashion.(separate I/o processor or multiple CPU processors exist)

Double buffering: CPU starts processing a block once its transfer to main memory

is complete, at the same time disk I/O processor can be reading and transferring

the next block into a different buffer. Reduces waiting time

INDEXES

An index is a data structure that organizes data records on disk to optimize certain

kinds of retrieval operations.An index allows us to efficiently retrieve all records

that satisfy search conditions on the search key fields of the index.Additional

indexes on given collection of data records, each with a different search key, to

speed up search operations that are not efficiently supported by file organization

used to store the data records.

• Eg: store the records in a file organized as an index on employee age, this is an

alternative to sorting the file by age

When ordering of data records is same as or close to ordering of data entries in

some index, we say that the index is clustered, else it is unclustered index.

An index on a set of fields that includes the primary key is called a primary index,

other indexes are called secondary indexes. Two data entries are said to be

duplicates if they have the same value for the search key field associated with the

index. A primary index is guaranteed to contain duplicates.

Clustered Field

If records of a file are physically ordered on a non-key field that doesn’t have a

distinct value for each record that field is called the clustering field of the file.

Index created is clustering index. Speeds retrieval of records. Ordered file with two

fields: One field is like clustering field of the data file and second is a block

pointer.

Ordered Indices

A file may have several indices, on different search keys .If the file containing the

records is sequentially ordered, a primary index is an index whose search key also

defines the sequential order of the file. Also called as clustering indices. Search

key of a primary index is usually the primary key.

Indices whose search key specifies an order different from the sequential order of

the file are celled secondary indices

Primary index: All files are arranged sequentially on some search key. Such files

with a primary index on the search key is called index-sequential files. Designed

for applications that require both sequential processing and random processing.

Dense and Sparse index

• Index record or index entry consists of a search-key value, and pointers to one

or more records with that value as their search-key value. Pointer to a record

consists of the identifier of a disk block and an offset within the disk block to

identify record within the block.

• Two types :

• Dense index and

• Sparse index.

Dense index: An index record appears for every search key value in the file.

Index record has the search key value and a pointer to the first data record with

that search key value.

Rest of the record, with same search key-value would be stored sequentially after

first record.

Implementations store list of pointers to all records with the same search-key value

Sparse indices

An index record appears for only some of the search-key values .Each index record

contains a search key value and a pointer to the first data record with that search-

key value. To locate a record, the index entry with largest search-key value that is

less than or equal to the search key value for which we are looking.

Start at the record pointed to by that index entry and follow the pointers in the file

until we find the desired record.

Must be dense, with an index for every search-key value and pointer to every

record in the file.

Primary index may be sparse storing only some of the search-key values, since its

always possible to find records with intermediate search-key values by a sequential

access to a part of the file.

If secondary index stores only some of the search-key values, records with

intermediate search-key values, may be anywhere in the file and in general , we

cannot find them without searching the entire file.

MODULE 3: DATABASE DESIGN

Relational Data Model – Relational Algebra – ER Diagrams – Data Dictionary –

Normalisation – Boyce Codd Normal Forms – Integrity – Relational Database

Languages – Database Administration – File Structures and Indexing.

A relation is a table that holds the data we are interested in. It is two-dimensional and

has rows and columns.

Each entity type in the ER model is mapped into a relation.

The attributes become the columns.

The individual entities become the rows.

Figure : a relation

Relations can be represented textually as:

tablename(primary key, attribute 1, attribute 2, ... , foreign key)

If matric_no was the primary key, and there were no foreign keys, then the table

above could be represented as:

student(matric no, name, address, date_of_birth)

When referring to relations or tables, cardinality is considered to the the number of

rows in the relation or table, and arity is the number of columns in a table or attributes

in a relation.

Foreign keys

A foreign key is an attribute (or group of attributes) that is the primary key to another

relation.

Roughly, each foreign key represents a relationship between two entity types.

They are added to relations as we go through the mapping process.

They allow the relations to be linked together.

A relation can have several foreign keys.

The Relational Data Model

The Relation

The Relation is the basic element in a relational data model.

Relations in the Relational Data Model

A relation is subject to the following rules:

1. Relation (file, table) is a two-dimensional table.

2. Attribute (i.e. field or data item) is a column in the table.

3. Each column in the table has a unique name within that table.

4. Each column is homogeneous. Thus the entries in any column are all of the

same type (e.g. age, name, employee-number, etc).

5. Each column has a domain, the set of possible values that can appear in that

column.

6. A Tuple (i.e. record) is a row in the table.

7. The order of the rows and columns is not important.

8. Values of a row all relate to some thing or portion of a thing.

9. Repeating groups (collections of logically related attributes that occur multiple

times within one record occurrence) are not allowed.

10. Duplicate rows are not allowed (candidate keys are designed to prevent this).

11. Cells must be single-valued (but can be variable length). Single valued means

the following:

o Cannot contain multiple values such as 'A1,B2,C3'.

o Cannot contain combined values such as 'ABC-XYZ' where 'ABC' means

one thing and 'XYZ' another.

A relation may be expressed using the notation R(A,B,C, ...) where:

R = the name of the relation.

(A,B,C, ...) = the attributes within the relation.

A = the attribute(s) which form the primary key.

Keys

1. A simple key contains a single attribute.

2. A composite key is a key that contains more than one attribute.

3. A candidate key is an attribute (or set of attributes) that uniquely identifies a

row. A candidate key must possess the following properties:

o Unique identification - For every row the value of the key must uniquely

identify that row.

o Non redundancy - No attribute in the key can be discarded without

destroying the property of unique identification.

4. A primary key is the candidate key which is selected as the principal unique

identifier. Every relation must contain a primary key. The primary key is usually

the key selected to identify a row when the database is physically implemented.

For example, a part number is selected instead of a part description.

5. A superkey is any set of attributes that uniquely identifies a row. A superkey

differs from a candidate key in that it does not require the non redundancy

property.

6. A foreign key is an attribute (or set of attributes) that appears (usually) as a non

key attribute in one relation and as a primary key attribute in another relation. I

say usually because it is possible for a foreign key to also be the whole or part of

a primary key:

o A many-to-many relationship can only be implemented by introducing an

intersection or link table which then becomes the child in two one-to-many

relationships. The intersection table therefore has a foreign key for each of its

parents, and its primary key is a composite of both foreign keys.

o A one-to-one relationship requires that the child table has no more than one

occurrence for each parent, which can only be enforced by letting the foreign

key also serve as the primary key.

7. A semantic or natural key is a key for which the possible values have an

obvious meaning to the user or the data. For example, a semantic primary key

for a COUNTRY entity might contain the value 'USA' for the occurrence

describing the United States of America. The value 'USA' has meaning to the

user.

8. A technical or surrogate or artificial key is a key for which the possible values

have no obvious meaning to the user or the data. These are used instead of

semantic keys for any of the following reasons:

o When the value in a semantic key is likely to be changed by the user, or can

have duplicates. For example, on a PERSON table it is unwise to use

PERSON_NAME as the key as it is possible to have more than one person

with the same name, or the name may change such as through marriage.

http://www.tonymarston.net/php-mysql/database-design.html#keys

o When none of the existing attributes can be used to guarantee uniqueness. In

this case adding an attribute whose value is generated by the system, e.g

from a sequence of numbers, is the only way to provide a unique value.

Typical examples would be ORDER_ID and INVOICE_ID. The value

'12345' has no meaning to the user as it conveys nothing about the entity to

which it relates.

9. A key functionally determines the other attributes in the row, thus it is always a

determinant.

10. Note that the term 'key' in most DBMS engines is implemented as an index

which does not allow duplicate entries.

Relationships

One table (relation) may be linked with another in what is known as a relationship.

Relationships may be built into the database structure to facilitate the operation of

relational joins at runtime.

1. A relationship is between two tables in what is known as a one-to-many or

parent-child or master-detail relationship where an occurrence on the 'one' or

'parent' or 'master' table may have any number of associated occurrences on the

'many' or 'child' or 'detail' table. To achieve this the child table must contain

fields which link back the primary key on the parent table. These fields on the

child table are known as a foreign key, and the parent table is referred to as the

foreign table (from the viewpoint of the child).

2. It is possible for a record on the parent table to exist without corresponding

records on the child table, but it should not be possible for an entry on the child

table to exist without a corresponding entry on the parent table.

3. A child record without a corresponding parent record is known as an orphan.

4. It is possible for a table to be related to itself. For this to be possible it needs a

foreign key which points back to the primary key. Note that these two keys

cannot be comprised of exactly the same fields otherwise the record could only

ever point to itself.

5. A table may be the subject of any number of relationships, and it may be the

parent in some and the child in others.

6. Some database engines allow a parent table to be linked via a candidate key,

but if this were changed it could result in the link to the child table being

broken.

7. Some database engines allow relationships to be managed by rules known as

referential integrity or foreign key restraints. These will prevent entries on

child tables from being created if the foreign key does not exist on the parent

http://www.tonymarston.net/php-mysql/database-design.html#determinant

http://www.tonymarston.net/php-mysql/database-design.html#joins

table, or will deal with entries on child tables when the entry on the parent table

is updated or deleted.

Relational Joins

The join operator is used to combine data from two or more relations (tables) in order

to satisfy a particular query. Two relations may be joined when they share at least one

common attribute. The join is implemented by considering each row in an instance of

each relation. A row in relation R1 is joined to a row in relation R2 when the value of

the common attribute(s) is equal in the two relations. The join of two relations is often

called a binary join.

The join of two relations creates a new relation. The notation R1 x R2 indicates the

join of relations R1 and R2. For example, consider the following:

Relation R1

A B C

1 5 3

2 4 5

8 3 5

9 3 3

1 6 5

5 4 3

2 7 5

Relation R2

B D E

4 7 4

6 2 3

5 7 8

7 2 3

3 2 2

Note that the instances of relation R1 and R2 contain the same data values for attribute

B. Data normalisation is concerned with decomposing a relation (e.g. R(A,B,C,D,E)

into smaller relations (e.g. R1 and R2). The data values for attribute B in this context

will be identical in R1 and R2. The instances of R1 and R2 are projections of the

instances of R(A,B,C,D,E) onto the attributes (A,B,C) and (B,D,E) respectively. A

projection will not eliminate data values - duplicate rows are removed, but this will

not remove a data value from any attribute.

The join of relations R1 and R2 is possible because B is a common attribute. The

result of the join is:

Relation R1 x R2

A B C D E

1 5 3 7 8

2 4 5 7 4

8 3 5 2 2

9 3 3 2 2

1 6 5 2 3

5 4 3 7 4

2 7 5 2 3

The row (2 4 5 7 4) was formed by joining the row (2 4 5) from relation R1 to the row

(4 7 4) from relation R2. The two rows were joined since each contained the same

value for the common attribute B. The row (2 4 5) was not joined to the row (6 2 3)

since the values of the common attribute (4 and 6) are not the same.

The relations joined in the preceding example shared exactly one common attribute.

However, relations may share multiple common attributes. All of these common

attributes must be used in creating a join. For example, the instances of relations R1

and R2 in the following example are joined using the common attributes B and C:

Before the join:

Relation R1

A B C

6 1 4

8 1 4

5 1 2

2 7 1

Relation R2

B C D

1 4 9

1 4 2

1 2 1

7 1 2

7 1 3

After the join:

Relation R1 x R2

A B C D

6 1 4 9

6 1 4 2

8 1 4 9

8 1 4 2

5 1 2 1

2 7 1 2

2 7 1 3

The row (6 1 4 9) was formed by joining the row (6 1 4) from relation R1 to the row

(1 4 9) from relation R2. The join was created since the common set of attributes (B

and C) contained identical values (1 and 4). The row (6 1 4) from R1 was not joined to

the row (1 2 1) from R2 since the common attributes did not share identical values - (1

4) in R1 and (1 2) in R2.

The join operation provides a method for reconstructing a relation that was

decomposed into two relations during the normalisation process. The join of two

rows, however, can create a new row that was not a member of the original relation.

Thus invalid information can be created during the join process.

Lossless Joins

A set of relations satisfies the lossless join property if the instances can be joined

without creating invalid data (i.e. new rows). The term lossless join may be somewhat

confusing. A join that is not lossless will contain extra, invalid rows. A join that is

lossless will not contain extra, invalid rows. Thus the term gainless join might be

more appropriate.

To give an example of incorrect information created by an invalid join let us take the

following data structure:

R(student, course, instructor, hour, room, grade)

Assuming that only one section of a class is offered during a semester we can define

the following functional dependencies:

1. (HOUR, ROOM) COURSE

2. (COURSE, STUDENT) GRADE

3. (INSTRUCTOR, HOUR) ROOM

4. (COURSE) INSTRUCTOR

5. (HOUR, STUDENT) ROOM

Take the following sample data:

STUDENT COURSE INSTRUCTOR HOUR ROOM GRADE

Smith Math 1 Jenkins 8:00 100 A

Jones English Goldman 8:00 200 B

Brown English Goldman 8:00 200 C

Green Algebra Jenkins 9:00 400 A

The following four relations, each in 4th normal form, can be generated from the

given and implied dependencies:

R1(STUDENT, HOUR, COURSE)

R2(STUDENT, COURSE, GRADE)

R3(COURSE, INSTRUCTOR)

http://www.tonymarston.net/php-mysql/database-design.html#normalisation

http://www.tonymarston.net/php-mysql/database-design.html#joins

http://www.tonymarston.net/php-mysql/database-design.html#functional.dependency

http://www.tonymarston.net/php-mysql/database-design.html#4nf

R4(INSTRUCTOR, HOUR, ROOM)

Note that the dependencies (HOUR, ROOM) COURSE and (HOUR, STUDENT)

ROOM are not explicitly represented in the preceding decomposition. The goal is

to develop relations in 4th normal form that can be joined to answer any ad hoc

inquiries correctly. This goal can be achieved without representing every functional

dependency as a relation. Furthermore, several sets of relations may satisfy the goal.

The preceding sets of relations can be populated as follows:

R1

STUDENT HOUR COURSE

Smith 8:00 Math 1

Jones 8:00 English

Brown 8:00 English

Green 9:00 Algebra

R2

STUDENT COURSE GRADE

Smith Math 1 A

Jones English B

Brown English C

Green Algebra A

R3

COURSE INSTRUCTOR

Math 1 Jenkins

English Goldman

Algebra Jenkins

R4

INSTRUCTOR HOUR ROOM

Jenkins 8:00 100

Goldman 8:00 200




Jenkins 9:00 400

Now suppose that a list of courses with their corresponding room numbers is required.

Relations R1 and R4 contain the necessary information and can be joined using the

attribute HOUR.

The result of this join is:

R1 x R4

STUDENT COURSE INSTRUCTOR HOUR ROOM

Smith Math 1 Jenkins 8:00 100

Smith Math 1 Goldman 8:00 200

Jones English Jenkins 8:00 100

Jones English Goldman 8:00 200

Brown English Jenkins 8:00 100

Brown English Goldman 8:00 200

Green Algebra Jenkins 9:00 400

This join creates the following invalid information (denoted by the coloured rows):

Smith, Jones, and Brown take the same class at the same time from two different

instructors in two different rooms.

Jenkins (the Maths teacher) teaches English.

Goldman (the English teacher) teaches Maths.

Both instructors teach different courses at the same time.

Another possibility for a join is R3 and R4 (joined on INSTRUCTOR). The result

would be:

R3 x R4

COURSE INSTRUCTOR HOUR ROOM

Math 1 Jenkins 8:00 100

Math 1 Jenkins 9:00 400

English Goldman 8:00 200

Algebra Jenkins 8:00 100

Algebra Jenkins 9:00 400

This join creates the following invalid information:

Jenkins teaches Math 1 and Algebra simultaneously at both 8:00 and 9:00.

A correct sequence is to join R1 and R3 (using COURSE) and then join the resulting

relation with R4 (using both INSTRUCTOR and HOUR). The result would be:

R1 x R3

STUDENT COURSE INSTRUCTOR HOUR

Smith Math 1 Jenkins 8:00

Jones English Goldman 8:00

Brown English Goldman 8:00

Green Algebra Jenkins 9:00

(R1 x R3) x R4

STUDENT COURSE INSTRUCTOR HOUR ROOM

Smith Math 1 Jenkins 8:00 100

Jones English Goldman 8:00 200

Brown English Goldman 8:00 200

Green Algebra Jenkins 9:00 400

Extracting the COURSE and ROOM attributes (and eliminating the duplicate row

produced for the English course) would yield the desired result:

COURSE ROOM

Math 1 100

English 200

Algebra 400

The correct result is obtained since the sequence (R1 x r3) x R4 satisfies the lossless

(gainless?) join property.

A relational database is in 4th normal form when the lossless join property can be

used to answer unanticipated queries. However, the choice of joins must be evaluated

carefully. Many different sequences of joins will recreate an instance of a relation.

Some sequences are more desirable since they result in the creation of less invalid data

during the join operation.

Suppose that a relation is decomposed using functional dependencies and multi-valued

dependencies. Then at least one sequence of joins on the resulting relations exists that

recreates the original instance with no invalid data created during any of the join

operations.

For example, suppose that a list of grades by room number is desired. This question,

which was probably not anticipated during database design, can be answered without

creating invalid data by either of the following two join sequences:

R1 x R3

(R1 x R3) x R2

((R1 x R3) x R2) x R4

or

R1 x R3

(R1 x R3) x R4

((R1 x R3) x R4) x R2

The required information is contained with relations R2 and R4, but these relations

cannot be joined directly. In this case the solution requires joining all 4 relations.

The database may require a 'lossless join' relation, which is constructed to assure that

any ad hoc inquiry can be answered with relational operators. This relation may

contain attributes that are not logically related to each other. This occurs because the

relation must serve as a bridge between the other relations in the database. For

example, the lossless join relation will contain all attributes that appear only on the

left side of a functional dependency. Other attributes may also be required, however,

in developing the lossless join relation.

Consider relational schema R(A, B, C, D), A B and C D. Relations Rl(A, B) and

R2(C, D) are in 4th normal form. A third relation R3(A, C), however, is required to

satisfy the lossless join property. This relation can be used to join attributes B and D.



http://www.tonymarston.net/php-mysql/database-design.html#multi.valued.dependency




This is accomplished by joining relations R1 and R3 and then joining the result to

relation R2. No invalid data is created during these joins. The relation R3(A, C) is the

lossless join relation for this database design.

A relation is usually developed by combining attributes about a particular subject or

entity. The lossless join relation, however, is developed to represent a relationship

among various relations. The lossless join relation may be difficult to populate

initially and difficult to maintain - a result of including attributes that are not logically

associated with each other.

The attributes within a lossless join relation often contain multi-valued dependencies.

Consideration of 4th normal form is important in this situation. The lossless join

relation can sometimes be decomposed into smaller relations by eliminating the multi-

valued dependencies. These smaller relations are easier to populate and maintain.

Determinant and Dependent

The terms determinant and dependent can be described as follows:

1. The expression X Y means 'if I know the value of X, then I can obtain the

value of Y' (in a table or somewhere).

2. In the expression X Y, X is the determinant and Y is the dependent

attribute.

3. The value X determines the value of Y.

4. The value Y depends on the value of X.

Functional Dependencies (FD)

A functional dependency can be described as follows:

1. An attribute is functionally dependent if its value is determined by another

attribute which is a key.

2. That is, if we know the value of one (or several) data items, then we can find the

value of another (or several).

3. Functional dependencies are expressed as X Y, where X is the determinant

and Y is the functionally dependent attribute.

4. If A (B,C) then A B and A C.

5. If (A,B) C, then it is not necessarily true that A C and B C.

6. If A B and B A, then A and B are in a 1-1 relationship.

7. If A B then for A there can only ever be one value for B.

Transitive Dependencies (TD)

A transitive dependency can be described as follows:





http://www.tonymarston.net/php-mysql/database-design.html#dependent


1. An attribute is transitively dependent if its value is determined by another

attribute which is not a key.

2. If X Y and X is not a key then this is a transitive dependency.

3. A transitive dependency exists when A B C but NOT A C.

Multi-Valued Dependencies (MVD)

A multi-valued dependency can be described as follows:

1. A table involves a multi-valued dependency if it may contain multiple values for

an entity.

2. A multi-valued dependency may arise as a result of enforcing 1st normal form.

3. X Y, ie X multi-determines Y, when for each value of X we can have more

than one value of Y.

4. If A B and A C then we have a single attribute A which multi-determines

two other independent attributes, B and C.

5. If A (B,C) then we have an attribute A which multi-determines a set of

associated attributes, B and C.

Join Dependencies (JD)

A join dependency can be described as follows:

1. If a table can be decomposed into three or more smaller tables, it must be

capable of being joined again on common keys to form the original table.

Modification Anomalies

A major objective of data normalisation is to avoid modification anomalies. These

come in two flavours:

1. An insertion anomaly is a failure to place information about a new database

entry into all the places in the database where information about that new entry

needs to be stored. In a properly normalized database, information about a new

entry needs to be inserted into only one place in the database. In an inadequately

normalized database, information about a new entry may need to be inserted into

more than one place, and, human fallibility being what it is, some of the needed

additional insertions may be missed.

2. A deletion anomaly is a failure to remove information about an existing

database entry when it is time to remove that entry. In a properly normalized

database, information about an old, to-be-gotten-rid-of entry needs to be deleted

from only one place in the database. In an inadequately normalized database,

information about that old entry may need to be deleted from more than one

place, and, human fallibility being what it is, some of the needed additional

deletions may be missed.

http://www.tonymarston.net/php-mysql/database-design.html#dependent



http://www.tonymarston.net/php-mysql/database-design.html#normalisation

An update of a database involves modifications that may be additions, deletions, or

both. Thus 'update anomalies' can be either of the kinds of anomalies discussed above.

All three kinds of anomalies are highly undesirable, since their occurrence constitutes

corruption of the database. Properly normalised databases are much less susceptible to

corruption than are unnormalised databases.

Types of Relational Join

A JOIN is a method of creating a result set that combines rows from two or more

tables (relations). When comparing the contents of two tables the following conditions

may occur:

Every row in one relation has a match in the other relation.

Relation R1 contains rows that have no match in relation R2.

Relation R2 contains rows that have no match in relation R1.

INNER joins contain only matches. OUTER joins may contain mismatches as well.

Inner Join

This is sometimes known as a simple join. It returns all rows from both tables where

there is a match. If there are rows in R1 which do not have matches in R2, those rows

will not be listed. There are two possible ways of specifying this type of join:

SELECT * FROM R1, R2 WHERE R1.r1_field = R2.r2_field;

SELECT * FROM R1 INNER JOIN R2 ON R1.field = R2.r2_field

If the fields to be matched have the same names in both tables then the ON condition,

as in:

ON R1.fieldname = R2.fieldname

ON (R1.field1 = R2.field1 AND R1.field2 = R2.field2)

can be replaced by the shorter USING condition, as in:

USING fieldname

USING (field1, field2)

Natural Join

A natural join is based on all columns in the two tables that have the same name. It is

semantically equivalent to an INNER JOIN or a LEFT JOIN with a USING clause

that names all columns that exist in both tables.

SELECT * FROM R1 NATURAL JOIN R2

http://www.tonymarston.net/php-mysql/database-design.html#inner.join

http://www.tonymarston.net/php-mysql/database-design.html#left.outer.join

The alternative is a keyed join which includes an ON or USING condition.

Left [Outer] Join

Returns all the rows from R1 even if there are no matches in R2. If there are no

matches in R2 then the R2 values will be shown as null.

SELECT * FROM R1 LEFT [OUTER] JOIN R2 ON R1.field = R2.field

Right [Outer] Join

Returns all the rows from R2 even if there are no matches in R1. If there are no

matches in R1 then the R1 values will be shown as null.

SELECT * FROM R1 RIGHT [OUTER] JOIN R2 ON R1.field = R2.field

Full [Outer] Join

Returns all the rows from both tables even if there are no matches in one of the tables.

If there are no matches in one of the tables then its values will be shown as null.

SELECT * FROM R1 FULL [OUTER] JOIN R2 ON R1.field = R2.field

Self Join

This joins a table to itself. This table appears twice in the FROM clause and is

followed by table aliases that qualify column names in the join condition.

SELECT a.field1, b.field2 FROM R1 a, R1 b WHERE a.field = b.field

Cross Join

This type of join is rarely used as it does not have a join condition, so every row of R1

is joined to every row of R2. For example, if both tables contain 100 rows the result

will be 10,000 rows. This is sometimes known as a cartesian product and can be

specified in either one of the following ways:

SELECT * FROM R1 CROSS JOIN R2

SELECT * FROM R1, R2

Entities

An entity is any object in the system that we want to model and store

information about

Individual objects are called entities

Groups of the same type of objects are called entity types or entity sets

Entities are represented by rectangles (either with round or square corners)

There are two types of entities; weak and strong entity types.

Attribute

All the data relating to an entity is held in its attributes.

An attribute is a property of an entity.

Each attribute can have any value from its domain.

Each entity within an entity type:

o May have any number of attributes.

o Can have different attribute values than that in any other entity.

o Have the same number of attributes.

Attributes can be

simple or composite

single-valued or multi-valued

Attributes can be shown on ER models

They appear inside ovals and are attached to their entity.

Note that entity types can have a large number of attributes... If all are shown

then the diagrams would be confusing. Only show an attribute if it adds

information to the ER diagram, or clarifies a point.

Figure : Attributes

Keys

A key is a data item that allows us to uniquely identify individual occurrences

or an entity type.

A candidate key is an attribute or set of attributes that uniquely identifies

individual occurrences or an entity type.

An entity type may have one or more possible candidate keys, the one which is

selected is known as the primary key.

A composite key is a candidate key that consists of two or more attributes

The name of each primary key attribute is underlined.

Relationships

A relationship type is a meaningful association between entity types

A relationship is an association of entities where the association includes one

entity from each participating entity type.

Relationship types are represented on the ER diagram by a series of lines.

As always, there are many notations in use today...

In the original Chen notation, the relationship is placed inside a diamond, e.g.

managers manage employees:

Figure : Chens notation for relationships

For this module, we will use an alternative notation, where the relationship is a

label on the line. The meaning is identical

Figure : Relationships used in this document

Degree of a Relationship

The number of participating entities in a relationship is known as the degree of

the relationship.

If there are two entity types involved it is a binary relationship type

Figure : Binary Relationships

If there are three entity types involved it is a ternary relationship type

Figure : Ternary relationship

It is possible to have a n-ary relationship (e.g. quaternary or unary).

Unary relationships are also known as a recursive relationship.

It is a relationship where the same entity participates more than once in

different roles.

In the example above we are saying that employees are managed by

employees.

If we wanted more information about who manages whom, we could introduce

a second entity type called manager.

Degree of a Relationship

It is also possible to have entities associated through two or more distinct

relationships.

Figure : Multiple relationships

In the representation we use it is not possible to have attributes as part of a

relationship. To support this other entity types need to be developed.

Replacing ternary relationships

When ternary relationships occurs in an ER model they should always be removed

before finishing the model. Sometimes the relationships can be replaced by a series of

binary relationships that link pairs of the original ternary relationship.

Figure : A ternary relationship example

This can result in the loss of some information - It is no longer clear which

sales assistant sold a customer a particular product.

Try replacing the ternary relationship with an entity type and a set of binary

relationships.

Relationships are usually verbs, so name the new entity type by the relationship verb

rewritten as a noun.

The relationship sells can become the entity type sale.

Figure : Replacing a ternary relationship

So a sales assistant can be linked to a specific customer and both of them to the

sale of a particular product.

This process also works for higher order relationships.

Cardinality

Relationships are rarely one-to-one

For example, a manager usually manages more than one employee

This is described by the cardinality of the relationship, for which there are four

possible categories.

One to one (1:1) relationship

One to many (1:m) relationship

Many to one (m:1) relationship

Many to many (m:n) relationship

On an ER diagram, if the end of a relationship is straight, it represents 1, while

a "crow's foot" end represents many.

A one to one relationship - a man can only marry one woman, and a woman

can only marry one man, so it is a one to one (1:1) relationship

Figure : One to One relationship example

A one to may relationship - one manager manages many employees, but each

employee only has one manager, so it is a one to many (1:n) relationship

Figure : One to Many relationship example

A many to one relationship - many students study one course. They do not

study more than one course, so it is a many to one (m:1) relationship

Figure : Many to One relationship example

A many to many relationship - One lecturer teaches many students and a

student is taught by many lecturers, so it is a many to many (m:n) relationship

Figure : Many to Many relationship example

Optionality

A relationship can be optional or mandatory.

If the relationship is mandatory

an entity at one end of the relationship must be related to an entity at the other

end.

The optionality can be different at each end of the relationship

For example, a student must be on a course. This is mandatory. To the

relationship `student studies course' is mandatory.

But a course can exist before any students have enrolled. Thus the relationship

`course is_studied_by student' is optional.

To show optionality, put a circle or `0' at the òptional end' of the relationship.

As the optional relationship is `course is_studied_by student', and the optional

part of this is the student, then the Ò' goes at the student end of the relationship

connection.

Figure : Optionality example

It is important to know the optionality because you must ensure that whenever

you create a new entity it has the required mandatory links.

Entity Sets

Sometimes it is useful to try out various examples of entities from an ER model. One

reason for this is to confirm the correct cardinality and optionality of a relationship.

We use an èntity set diagram' to show entity examples graphically. Consider the

example of `course is_studied_by student'.

Figure : Entity set example

Confirming Correctness

Figure : Entity set confirming errors

Use the diagram to show all possible relationship scenarios.

Go back to the requirements specification and check to see if they are allowed.

If not, then put a cross through the forbidden relationships

This allows you to show the cardinality and optionality of the relationship

Deriving the relationship parameters

To check we have the correct parameters (sometimes also known as the degree) of a

relationship, ask two questions:

1. One course is studied by how many students? Answer = `zero or more'.

o This gives us the degree at the `student' end.

o The answer `zero or more' needs to be split into two parts.

o The `more' part means that the cardinality is `many'.

o The `zero' part means that the relationship is òptional'.

o If the answer was òne or more', then the relationship would be

`mandatory'.

2. One student studies how many courses? Answer = Òne'

o This gives us the degree at the `course' end of the relationship.

o The answer òne' means that the cardinality of this relationship is 1, and

is `mandatory'

o If the answer had been `zero or one', then the cardinality of the

relationship would have been 1, and be òptional'.

Redundant relationships

Some ER diagrams end up with a relationship loop.

check to see if it is possible to break the loop without losing info

Given three entities A, B, C, where there are relations A-B, B-C, and C-A,

check if it is possible to navigate between A and C via B. If it is possible, then

A-C was a redundant relationship.

Always check carefully for ways to simplify your ER diagram. It makes it

easier to read the remaining information.

Redundant relationships example

Consider entities `customer' (customer details), `address' (the address of a

customer) and `distance' (distance from the company to the customer address).

Figure : Redundant relationship

Splitting n:m Relationships

A many to many relationship in an ER model is not necessarily incorrect. They can be

replaced using an intermediate entity. This should only be done where:

the m:n relationship hides an entity

the resulting ER diagram is easier to understand.

Splitting n:m Relationships - Example

Consider the case of a car hire company. Customers hire cars, one customer hires

many card and a car is hired by many customers.

Figure : Many to Many example

The many to many relationship can be broken down to reveal a `hire' entity, which

contains an attribute `date of hire'.

Figure : Splitting the Many to Many example

Constructing an ER model

Before beginning to draw the ER model, read the requirements specification carefully.

Document any assumptions you need to make.

1. Identify entities - list all potential entity types. These are the object of interest

in the system. It is better to put too many entities in at this stage and them

discard them later if necessary.

2. Remove duplicate entities - Ensure that they really separate entity types or just

two names for the same thing.

o Also do not include the system as an entity type

o e.g. if modelling a library, the entity types might be books, borrowers,

etc.

o The library is the system, thus should not be an entity type.

3. List the attributes of each entity (all properties to describe the entity which are

relevant to the application).

o Ensure that the entity types are really needed.

o are any of them just attributes of another entity type?

o if so keep them as attributes and cross them off the entity list.

o Do not have attributes of one entity as attributes of another entity!

4. Mark the primary keys.

o Which attributes uniquely identify instances of that entity type?

o This may not be possible for some weak entities.

5. Define the relationships

o Examine each entity type to see its relationship to the others.

6. Describe the cardinality and optionality of the relationships

o Examine the constraints between participating entities.

7. Remove redundant relationships

o Examine the ER model for redundant relationships.

ER modelling is an iterative process, so draw several versions, refining each one until

you are happy with it.

Normalisation is the process of taking data from a problem and reducing it to a set of

relations while ensuring data integrity and eliminating data redundancy

Data integrity - all of the data in the database are consistent, and satisfy all

integrity constraints.

Data redundancy – if data in the database can be found in two different

locations (direct redundancy) or if data can be calculated from other data items

(indirect redundancy) then the data is said to contain redundancy.

Data should only be stored once and avoid storing data that can be calculated from

other data already held in the database. During the process of normalisation

redundancy must be removed, but not at the expense of breaking data integrity rules.

If redundancy exists in the database then problems can arise when the database is in

normal operation:

When data is inserted the data must be duplicated correctly in all places where

there is redundancy. For instance, if two tables exist for in a database, and both

tables contain the employee name, then creating a new employee entry requires

that both tables be updated with the employee name.

When data is modified in the database, if the data being changed has

redundancy, then all versions of the redundant data must be updated

simultaneously. So in the employee example a change to the employee name

must happen in both tables simultaneously.

The removal of redundancy helps to prevent insertion, deletion, and update errors,

since the data is only available in one attribute of one table in the database.

The data in the database can be considered to be in one of a number of `normal forms'.

Basically the normal form of the data indicates how much redundancy is in that data.

The normal forms have a strict ordering:

1. 1st Normal Form

2. 2nd

Normal Form

3. 3rd

Normal Form

4. BCNF

There are other normal forms, such as 4th and 5th normal forms. They are rarely

utilised in system design and are not considered further here.

To be in a particular form requires that the data meets the criteria to also be in all

normal forms before that form. Thus to be in 2nd

normal form the data must meet the

criteria for both 2nd

normal form and 1st normal form. The higher the form the more

redundancy has been eliminated.

Integrity Constraints

An integrity constraint is a rule that restricts the values that may be present in the

database. The relational data model includes constraints that are used to verify the

validity of the data as well as adding meaningful structure to it:

entity integrity :

The rows (or tuples) in a relation represent entities, and each one must be uniquely

identified. Hence we have the primary key that must have a unique non-null value for

each row.

referential integrity :

This constraint involves the foreign keys. Foreign keys tie the relations together, so it

is vitally important that the links are correct. Every foreign key must either be null or

its value must be the actual value of a key in another relation.

Understanding Data

Sometimes the starting point for understanding data is given in the form of relations

and functional dependancies. This would be the case where the starting point in the

process was a detailed specification of the problem. We already know what relations

are. Functional dependancies are rules stating that given a certain set of attributes (the

determinant) determines a second set of attributes.

The definition of a functional dependency looks like A->B. In this case B is a single

attribute but it can be as many attributes as required (for instance, X->J,K,L,M). In the

functional dependency, the determinant (the left hand side of the -> sign) can

determine the set of attributes on the right hand side of the -> sign. This basically

means that A selects a particular value for B, and that A is unique. In the second

example X is unique and selects a particular set of values for J,K,L, and M. It can also

be said that B is functionally dependent on A. In addition, a particular value of A

ALWAYS gives you a particular value for B, but not vice-versa.

Consider this example:

R(matric_no, firstname, surname, tutor_number, tutor_name)

tutor_number -> tutor_name

Here there is a relation R, and a functional dependency that indicates that:

instances of tutor_number are unique in the data

from the data, given a tutor_number, it is always possible to work out the

tutor_name.

As an example tutor number 1 may be ―Mr Smith‖, but tutor number 10 may

also be ―Mr Smith‖. Given a tutor number of 1, this is ALWAYS ―Mr Smith‖.

However, given the name ―Mr Smith‖ it is not possible to work out if we are

talking about tutor 1 or tutor 10.

There is actually a second functional dependency for this relation, which can be

worked out from the relation itself. As the relation has a primary key, then given this

attribute you can determine all the other attributes in R. This is an implied functional

dependency and is not normally listed in the list of functional dependents.

Extracting understanding

It is possible that the relations and the determinants have not yet been defined for a

problem, and therefore must be calculated from examples of the data. Consider the

following Student table.

Student - an unnormalised tablewith repeating groups

matric_no Name date_of_birth subject grade

960100 Smith, J 14/11/1977

Databases

Soft_Dev

ISDE

C

A

D

960105 White, A 10/05/1975 Soft_Dev

ISDE

B

B

960120 Moore, T 11/03/1970

Databases

Soft_Dev

Workshop

A

B

C

960145 Smith, J 09/01/1972 Databases B

960150 Black, D 21/08/1973

Databases

Soft_Dev

ISDE

Workshop

B

D

C

D

The subject/grade pair is repeated for each student. 960145 has 1 pair while 960150

has four. Repeating groups are placed inside another set of parentheses. From the table

the following relation is generated:

Student(matric_no, name, date_of_birth, ( subject, grade ) )

The repeating group needs a key in order that the relation can be correctly defined.

Looking at the data one can see that grade repeats within matric_no (for instance, for

960150, the student has 2 D grades). However, subject never seems to repeat for a

single matric_no, and therefore is a candidate key in the repeating group.Whenever

keys or dependencies are extracted from example data, the information extracted is

only as good as the data sample examined. It could be that another data sample

disproves some of the key selections made or dependencies extracted. What is

important however is that the information extracted during these exercises is correct

for the data being examined.

Looking at the data itself, we can see that the same name appears more than once in

the name column. The name in conjunction with the date_of_birth seems to be unique,

suggesting a functional dependency of:

name, date_of_birth -> matric_no

This implies that not only is the matric_no sufficient to uniquely identify a student, the

student’s name combined with the date of birth is also sufficient to uniquely identify a

student. It is therefore possible to have the relation Student written as:


As guidance in cases where a variety of keys could be selected one should try to select

the relation with the least number of attributes defined as primary keys.

Flattened Tables

Note that the student table shown above explicitly identifies the repeating group. It is

also possible that the table presented will be what is called a flat table, where the

repeating group is not explicitly shown:

Student #2 - Flattened Table

matric_no name date_of_birth Subject grade

960100 Smith, J 14/11/1977 Databases C

960100 Smith, J 14/11/1977 Soft_Dev A

960100 Smith, J 14/11/1977 ISDE D

960105 White, A 10/05/1975 Soft_Dev B

960105 White, A 10/05/1975 ISDE B

960120 Moore, T 11/03/1970 Databases A

960120 Moore, T 11/03/1970 Soft_Dev B

960120 Moore, T 11/03/1970 Workshop C

960145 Smith, J 09/01/1972 Databases B

960150 Black, D 21/08/1973 Databases B

960150 Black, D 21/08/1973 Soft_Dev D

960150 Black, D 21/08/1973 ISDE C

960150 Black, D 21/08/1973 Workshop B

The table still shows the same data as the previous example, but the format is

different. We have removed the repeating group (which is good) but we have

introduced redundancy (which is bad).

Sometimes you will miss spotting the repeating group, so you may produce something

like the following relation for the Student data.

Student(matric_no, name, date_of_birth, subject, grade )

matric_no -> name, date_of_birth

name, date_of_birth -> matric_no

This data does not explicitly identify the repeating group, but as you will see the result

of the normalisation process on this relation produces exactly the same relations as the

normalisation of the version that explicitly does have a repeating group.

First Normal Form

First normal form (1NF) deals with the `shape' of the record type

A relation is in 1NF if, and only if, it contains no repeating attributes or groups

of attributes.

Example:

The Student table with the repeating group is not in 1NF

It has repeating groups, and it is called an `unnormalised table'.

Relational databases require that each row only has a single value per attribute, and so

a repeating group in a row is not allowed.

To remove the repeating group, one of two things can be done:

either flatten the table and extend the key, or

decompose the relation- leading to First Normal Form

Flatten table and Extend Primary Key

The Student table with the repeating group can be written as:


If the repeating group was flattened, as in the Student #2 data table, it would look

something like:


Although this is an improvement, we still have a problem. matric_no can no longer be

the primary key - it does not have an unique value for each row. So we have to find a

new primary key - in this case it has to be a compound key since no single attribute

can uniquely identify a row. The new primary key is a compound key (matrix_no +

subject).

We have now solved the repeating groups problem, but we have created other

complications. Every repetition of the matric_no, name, and data_of_birth is

redundant and liable to produce errors.

With the relation in its flattened form, strange anomalies appear in the system.

Redundant data is the main cause of insertion, deletion, and updating anomalies.

Insertion anomaly:

With the primary key including subject, we cannot enter a new student until they have

at least one subject to study. We are not allowed NULLs in the primary key so we

must have an entry in both matric_no and subject before we can create a new record.

This is known as the insertion anomaly. It is difficult to insert new records into

the database.

On a practical level, it also means that it is difficult to keep the data up to date.

Update anomaly

If the name of a student were changed for example Smith, J. was changed to Green, J.

this would require not one change but many one for every subject that Smith, J.

studied.

Deletion anomaly

If all of the records for the `Databases' subject were deleted from the table,we would

inadvertently lose all of the information on the student with matric_no 960145. This

would be the same for any student who was studying only one subject and the subject

was deleted. Again this problem arises from the need to have a compound primary

key.

Decomposing the relation

The alternative approach is to split the table into two parts, one for the

repeating groups and one of the non-repeating groups.

the primary key for the original relation is included in both of the new relations

Record

matric_no subject grade

960100 Databases C

960100 Soft_Dev A

960100 ISDE D

960105 Soft_Dev B

960105 ISDE B

... ... ...

960150 Workshop B

Student

matric_no name date_of_birth

960100 Smith,J 14/11/1977

960105 White,A 10/05/1975

960120 Moore,T 11/03/1970

960145 Smith,J 09/01/1972

960150 Black,D 21/08/1973

We now have two relations, Student and Record.

Student contains the original non-repeating groups

Record has the original repeating groups and the matric_no

Student(matric_no, name, date_of_birth )

Record(matric_no, subject, grade )

Matric_no remains the key to the Student relation. It cannot be the complete key to the

new Record relation - we end up with a compound primary key consisting of

matric_no and subject. The matric_no is the link between the two tables - it will allow

us to find out which subjects a student is studying . So in the Record relation,

matric_no is the foreign key.

This method has eliminated some of the anomalies. It does not always do so, it

depends on the example chosen

In this case we no longer have the insertion anomaly

It is now possible to enter new students without knowing the subjects that they

will be studying

They will exist only in the Student table, and will not be entered in the Record

table until they are studying at least one subject.

We have also removed the deletion anomaly

If all of the `databases' subject records are removed, student 960145 still exists

in the Student table.

We have also removed the update anomaly

Student and Record are now in First Normal Form.

Second Normal Form

Second normal form (or 2NF) is a more stringent normal form defined as:

A relation is in 2NF if, and only if, it is in 1NF and every non-key attribute is fully

functionally dependent on the whole key.

Thus the relation is in 1NF with no repeating groups, and all non-key attributes must

depend on the whole key, not just some part of it. Another way of saying this is that

there must be no partial key dependencies (PKDs).

The problems arise when there is a compound key, e.g. the key to the Record relation

- matric_no, subject. In this case it is possible for non-key attributes to depend on only

part of the key - i.e. on only one of the two key attributes. This is what 2NF tries to

prevent.

Consider again the Student relation from the flattened Student #2 table:


There are no repeating groups

The relation is already in 1NF

However, we have a compound primary key - so we must check all of the non-

key attributes against each part of the key to ensure they are functionally

dependent on it.

matric_no determines name and date_of_birth, but not grade.

subject together with matric_no determines grade, but not name or

date_of_birth.

So there is a problem with potential redundancies

A dependency diagram is used to show how non-key attributes relate to each part or

combination of parts in the primary key.

Figure : Dependency Diagram

This relation is not in 2NF

It appears to be two tables squashed into one.

the solution is to split the relation up into its component parts.

separate out all the attributes that are solely dependent on matric_no

put them in a new Student_details relation, with matric_no as the primary key

separate out all the attributes that are solely dependent on subject.

in this case no attributes are solely dependent on subject.

separate out all the attributes that are solely dependent on matric_no + subject

put them into a separate Student relation, keyed on matric_no + subject

All attributes in each relation are fully

functionally dependent upon its primary

key

These relations are now in 2NF

Figure : Dependencies after splitting

Interestingly this is the same set of relations as when we recognized that there were

repeating terms in the table and directly removed the repeating terms. It should not

really matter what process you followed when normalizing, as the end result should be

similar relations.

Third Normal Form

3NF is an even stricter normal form and removes virtually all the redundant data :

A relation is in 3NF if, and only if, it is in 2NF and there are no transitive

functional dependencies

Transitive functional dependencies arise:

when one non-key attribute is functionally dependent on another non-key

attribute:

FD: non-key attribute -> non-key attribute

and when there is redundancy in the database

By definition transitive functional dependency can only occur if there is more than

one non-key field, so we can say that a relation in 2NF with zero or one non-key field

must automatically be in 3NF.

project_no manager address

p1 Black,B 32 High Street

p2 Smith,J 11 New Street



Project has more than one non-key field so

we must check for transitive dependency:

address depends on the value in the manager column

every time B Black is listed in the manager column, the address column has the

value `32 High Street'. From this the relation and functional dependency can be

implied as:

Project(project_no, manager, address)

manager -> address

in this case address is transitively dependent on manager. Manager is the

determinant - it determines the value of address. It is transitive functional

dependency only if all attributes on the left of the ―->‖ are not in the key but

are all in the relation, and all attributes to the right of the ―->‖ are not in the key

with at least one actually being in the relation.

Data redundancy arises from this

we duplicate address if a manager is in charge of more than one project

causes problems if we had to change the address- have to change several

entries, and this could lead to errors.

The solution is to eliminate transitive functional dependency by splitting the

table

create two relations - one with the transitive dependency in it, and another for

all of the remaining attributes.

split Project into Project and Manager.

the determinant attribute becomes the primary key in the new relation

manager becomes the primary key to the Manager relation

the original key is the primary key to the remaining non-transitive attributes

in this case, project_no remains the key to the new Projects table.

Project project_no manager

p1 Black,B

p2 Smith,J

p3 Black,B

p4 Black,B

Manager manager address

Black,B 32 High Street

Smith,J 11 New Street

Now we need to store the address only once

If we need to know a manager's address we can look it up in the Manager

relation

The manager attribute is the link between the two tables, and in the Projects

table it is now a foreign key.

These relations are now in third normal form.

Summary: 1NF

A relation is in 1NF if it contains no repeating groups

To convert an unnormalised relation to 1NF either:

Flatten the table and change the primary key, or

Decompose the relation into smaller relations, one for the repeating groups and

one for the non-repeating groups.

Remember to put the primary key from the original relation into both new

relations.

This option is liable to give the best results.

Summary: 2NF

A relation is in 2NF if it contains no repeating groups and no partial key

functional dependencies

Rule: A relation in 1NF with a single key field must be in 2NF

To convert a relation with partial functional dependencies to 2NF. create a set

of new relations:

One relation for the attributes that are fully dependent upon the key.

One relation for each part of the key that has partially dependent attributes

Summary: 3NF

A relation is in 3NF if it contains no repeating groups, no partial functional

dependencies, and no transitive functional dependencies

To convert a relation with transitive functional dependencies to 3NF, remove

the attributes involved in the transitive dependency and put them in a new

relation

Rule: A relation in 2NF with only one non-key attribute must be in 3NF

In a normalised relation a non-key field must provide a fact about the key, the

whole key and nothing but the key.

Relations in 3NF are sufficient for most practical database design problems.

However, 3NF does not guarantee that all anomalies have been removed.

Boyce-Codd Normal Form (BCNF)

When a relation has more than one candidate key, anomalies may result even

though the relation is in 3NF.

3NF does not deal satisfactorily with the case of a relation with overlapping

candidate keys

i.e. composite candidate keys with at least one attribute in common.

BCNF is based on the concept of a determinant.

A determinant is any attribute (simple or composite) on which some other

attribute is fully functionally dependent.

A relation is in BCNF is, and only if, every determinant is a candidate key.

Consider the following relation and determinants.

R(a,b,c,d)

a,c -> b,d

a,d -> b

Here, the first determinant suggests that the primary key of R could be changed from

a,b to a,c. If this change was done all of the non-key attributes present in R could still

be determined, and therefore this change is legal. However, the second determinant

indicates that a,d determines b, but a,d could not be the key of R as a,d does not

determine all of the non key attributes of R (it does not determine c). We would say

that the first determinate is a candidate key, but the second determinant is not a

candidate key, and thus this relation is not in BCNF (but is in 3rd

normal form).

Normalisation to BCNF - Example 1

Patient No Patient Name Appointment Id Time Doctor

1 John 0 09:00 Zorro

2 Kerr 0 09:00 Killer

3 Adam 1 10:00 Zorro

4 Robert 0 13:00 Killer

5 Zane 1 14:00 Zorro

Lets consider the database extract shown above. This depicts a special dieting clinic

where the each patient has 4 appointments. On the first they are weighed, the second

they are exercised, the third their fat is removed by surgery, and on the fourth their

mouth is stitched closed… Not all patients need all four appointments! If the Patient

Name begins with a letter before ―P‖ they get a morning appointment, otherwise they

get an afternoon appointment. Appointment 1 is either 09:00 or 13:00, appointment 2

10:00 or 14:00, and so on. From this (hopefully) make-believe scenario we can extract

the following determinants:

DB(Patno,PatName,appNo,time,doctor)

Patno -> PatName

Patno,appNo -> Time,doctor

Time -> appNo

Now we have to decide what the primary key of DB is going to be. From the

information we have, we could chose:

DB(Patno,PatName,appNo,time,doctor) (example 1a)

or

DB(Patno,PatName,appNo,time,doctor) (example 1b)

Example 1a - DB(Patno,PatName,appNo,time,doctor)

1NF Eliminate repeating groups.

None:


2NF Eliminate partial key dependencies

DB(Patno,appNo,time,doctor)

R1(Patno,PatName)

3NF Eliminate transitive dependencies

None: so just as 2NF

BCNF Every determinant is a candidate key


R1(Patno,PatName)

Go through all determinates where ALL of the left hand attributes are present

in a relation and at least ONE of the right hand attributes are also present in the

relation.

Patno -> PatName

Patno is present in DB, but not PatName, so not relevant.


All LHS present, and time and doctor also present, so relevant. Is this a

candidate key? Patno,appNo IS the key, so this is a candidate key. Thus this is

OK for BCNF compliance.

Time -> appNo

Time is present, and so is appNo, so relevant. Is this a candidate key. If it was

then we could rewrite DB as:


This will not work, as you need both time and Patno together to form a unique

key. Thus this determinate is not a candidate key, and therefore DB is not in

BCNF. We need to fix this.

BCNF: rewrite to

DB(Patno,time,doctor)

R1(Patno,PatName)

R2(time,appNo)

time is enough to work out the appointment number of a patient. Now BCNF is

satisfied, and the final relations shown are in BCNF.

Example 1b - DB(Patno,PatName,appNo,time,doctor)

1NF Eliminate repeating groups.

None:


2NF Eliminate partial key dependencies


R1(Patno,PatName)

R2(time,appNo)

3NF Eliminate transitive dependencies

None: so just as 2NF



R1(Patno,PatName)

R2(time,appNo)

Go through all determinates where ALL of the left hand attributes are present

in a relation and at least ONE of the right hand attributes are also present in the

relation.

Patno -> PatName

Patno is present in DB, but not PatName, so not relevant.


Not all LHS present, so not relevant.

Time -> appNo

Time is present, and so is appNo, so relevant. This is a candidate key.

However, Time is currently the key for R2, so satisfies the rules for BCNF.

BCNF: as 3NF


R1(Patno,PatName)

R2(time,appNo)

Summary - Example 1

This example has demonstrated three things:

BCNF is stronger than 3NF, relations that are in 3NF are not necessarily in

BCNF

BCNF is needed in certain situations to obtain full understanding of the data

model

there are several routes to take to arrive at the same set of relations in BCNF.

Unfortunately there are no rules as to which route will be the easiest one to

take.

Example 2

Grade_report(StudNo,StudName,(Major,Adviser,

(CourseNo,Ctitle,InstrucName,InstructLocn,Grade)))

Functional dependencies

StudNo -> StudName

CourseNo -> Ctitle,InstrucName

InstrucName -> InstrucLocn

StudNo,CourseNo,Major -> Grade

StudNo,Major -> Advisor

Advisor -> Major

Unnormalised

Grade_report(StudNo,StudName,(Major,Advisor,

(CourseNo,Ctitle,InstrucName,InstructLocn,Grade)))

1NF Remove repeating groups

Student(StudNo,StudName)

StudMajor(StudNo,Major,Advisor)

StudCourse(StudNo,Major,CourseNo,

Ctitle,InstrucName,InstructLocn,Grade)

2NF Remove partial key dependencies



StudCourse(StudNo,Major,CourseNo,Grade)

Course(CourseNo,Ctitle,InstrucName,InstructLocn)

3NF Remove transitive dependencies




Course(CourseNo,Ctitle,InstrucName)

Instructor(InstructName,InstructLocn)


Student : only determinant is StudNo

StudCourse: only determinant is StudNo,Major

Course: only determinant is CourseNo

Instructor: only determinant is InstrucName

StudMajor: the determinants are

StudNo,Major, or

Adviser

Only StudNo,Major is a candidate key.

BCNF



Course(CourseNo,Ctitle,InstrucName)

Instructor(InstructName,InstructLocn)

StudMajor(StudNo,Advisor)

Adviser(Adviser,Major)

Problems BCNF overcomes

STUDENT MAJOR ADVISOR

123 PHYSICS EINSTEIN

123 MUSIC MOZART

456 BIOLOGY DARWIN

789 PHYSICS BOHR

999 PHYSICS EINSTEIN

If the record for student 456 is deleted we lose not only information on student

456 but also the fact that DARWIN advises in BIOLOGY

we cannot record the fact that WATSON can advise on COMPUTING until we

have a student majoring in COMPUTING to whom we can assign WATSON

as an advisor.

In BCNF we have two tables:

STUDENT ADVISOR

123 EINSTEIN

123 MOZART

456 DARWIN

789 BOHR

999 EINSTEIN

ADVISOR MAJOR

EINSTEIN PHYSICS

MOZART MUSIC

DARWIN BIOLOGY

BOHR PHYSICS

Relational Database Language - Structured Query Language

SQL is a popular relational database language first standardized in 1986 by the

American National Standards Institute (ANSI). Since then, it has been formally

adopted as an International Standard by the International Organization for

Standardization (ISO) and the International Electrotechnical Commission (IEC). It has

also been adopted as a Federal Information Processing Standard (FIPS) for the U.S.

federal government.

Database Language SQL is under continual development by the above mentioned

standardization bodies. The most recent published version was in 1992, a 580 page

specification published by ANSI as American National Standard X3.135-1992 and by

ISO/IEC as International Standard 9075:1992. The two specifications are word-for-

word identical. Both versions are available in hard copy only from ANSI (Sales

telephone: +1-212-642-4900). Further extensions and enhancements are under

development

DESCRIPTION of SQL

The basic structure of the relational model is a table, consisting of rows and columns.

Data definition includes declaring the name of each table to be included in a database,

the names and data types of all columns of each table, constraints on the values in and

among columns, and the granting of table manipulation privileges to prospective

users. Tables can be accessed by inserting new rows, deleting or updating existing

rows, or selecting rows that satisfy a given search condition for output. Tables can be

manipulated to produce new tables by Cartesian products, unions, intersections, joins

on matching columns, or projections on given columns.

SQL data manipulation operations may be invoked through a cursor or through a

general query specification. The language includes all arithmetic operations,

predicates for comparison and string matching, universal and existential quantifiers,

summary operations for max/min or count/sum, and GROUP BY and HAVING

clause to partition tables by groups. Transaction management is achieved through

COMMIT and ROLLBACK statements.

The standard provides language facilities for defining application specific views of the

data. Each view is the specification of database operations that would produce a

desired table. The viewed table is then materialized at application execution time.

The SQL standard provides a Module Language for interface to other languages. Each

SQL statement may be packaged as a procedure that can be called and have

parameters passed to it from an external language. A cursor mechanism provides row-

at-a-time access from languages that can only handle one row of a table at one time.

Access control is provided by GRANT and REVOKE statements. Each prospective

user must be explicitly granted the privilege to access a specific table or view using a

specific statement.

The SQL Integrity Enhancement facility offers additional tools for referential

integrity, CHECK constraint clauses, and DEFAULT clauses. Referential integrity

allows specification of primary and foreign keys with the requirement that no foreign

key row may be inserted or updated unless a matching primary key row exists. Check

clauses allow specification of inter-column constraints to be maintained by the

database system. Default clauses provide optional default values for missing data.

The Embedded SQL specification provides SQL interface to programming languages,

specifically Ada, C, COBOL, FORTRAN, MUMPS, Pascal, and PL/I. Applications

may thereby integrate program control structures with SQL data manipulation

capabilities. The Embedded SQL syntax is just a shorthand for an explicit SQL

Module accessed from a standard conforming programming language.

SQL-92 significantly increases the size of the original 1986 standard to include a

schema manipulation language for modifying or altering schemas, schema information

tables to make schema definitions accessible to users, new facilities for dynamic

creation of SQL statements, and new data types and domains. Other new SQL-92

features include outer join, cascade update and delete referential actions, set algebra

on tables, transaction consistency levels, scrolled cursors ,deferred constraint

checking, and greatly expanded exception reporting. SQL-92 also removes a number

of restrictions in order to make the language more flexible and orthogonal.

Why is it needed?

When data is stored on disk based storage devices, it is stored as blocks of

data. These blocks are accessed in their entirety, making them the atomic

disk access operation. Disk blocks are structured in much the same way as

linked lists; both contain a section for data, a pointer to the location of the

next node (or block), and both need not be stored contiguously.

Due to the fact that a number of records can only be sorted on one field, we

can state that searching on a field that isn’t sorted requires a Linear Search

which requires N/2 block accesses, where N is the number of blocks that the

table spans. If that field is a non-key field (i.e. doesn’t contain unique

entries) then the entire table space must be searched at N block accesses.

Whereas with a sorted field, a Binary Search may be used, this has log2 N

block accesses. Also since the data is sorted given a non-key field, the rest

of the table doesn’t need to be searched for duplicate values, once a higher

value is found. Thus the performance increase is substantial.

What is indexing?

Indexing is a way of sorting a number of records on multiple fields. Creating

an index on a field in a table creates another data structure which holds the

field value, and pointer to the record it relates to. This index structure is then

sorted, allowing Binary Searches to be performed on it.

The downside to indexing is that these indexes require additional space on

the disk, since the indexes are stored together in a MyISAM database, this

file can quickly reach the size limits of the underlying file system if many

fields within the same table are indexed.

How does it work?

Firstly, let’s outline a sample database table schema;

Field name Data type Size on disk

id (Primary key) Unsigned INT 4 bytes

firstName Char(50) 50 bytes

lastName Char(50) 50 bytes

emailAddress Char(100) 100 bytes

Note: char was used in place of varchar to allow for an accurate size on disk

value. This sample database contains five million rows, and is unindexed.

The performance of several queries will now be analyzed. These are a query

using the id (a sorted key field) and one using the firstName (a non-key

unsorted field).

Example 1

Given our sample database of r = 5,000,000 records of a fixed size giving a

record length of R = 204 bytes and they are stored in a MyISAM database

which is using the default block size B = 1,024 bytes. The blocking factor of

the table would be bfr = (B/R) = 1024/204 = 5 records per disk block. The

total number of blocks required to hold the table is N = (r/bfr) = 5000000/5

= 1,000,000 blocks.

A linear search on the id field would require an average of N/2 = 500,000

block accesses to find a value given that the id field is a key field. But since

the id field is also sorted a binary search can be conducted requiring an

average of log2 1000000 = 19.93 = 20 block accesses. Instantly we can see

this is a drastic improvement.

Now the firstName field is neither sorted, so a binary search is impossible,

nor are the values unique, and thus the table will require searching to the end

for an exact N = 1,000,000 block accesses. It is this situation that indexing

aims to correct.

Given that an index record contains only the indexed field and a pointer to

the original record, it stands to reason that it will be smaller than the multi-

field record that it points to. So the index itself requires fewer disk blocks

that the original table, which therefore requires fewer block accesses to

iterate through. The schema for an index on the firstName field is outlined

below;

Field name Data type Size on disk

firstName Char(50) 50 bytes

(record pointer) Special 4 bytes

Note: Pointers in MySQL are 2, 3, 4 or 5 bytes in length depending on the

size of the table.

Example 2

Given our sample database of r = 5,000,000 records with an index record

length of R = 54 bytes and using the default block size B = 1,024 bytes. The

blocking factor of the index would be bfr = (B/R) = 1024/54 = 18 records

per disk block. The total number of blocks required to hold the table is N =

(r/bfr) = 5000000/18 = 277,778 blocks.

Now a search using the firstName field can utilise the index to increase

performance. This allows for a binary search of the index with an average of

log2 277778 = 18.08 = 19 block accesses. To find the address of the actual

record, which requires a further block access to read, bringing the total to 19

+ 1 = 20 block accesses, a far cry from the 277,778 block accesses required

by the non-indexed table.

When should it be used?

Given that creating an index requires additional disk space (277,778 blocks

extra from the above example), and that too many indexes can cause issues

arising from the file systems size limits, careful thought must be used to

select the correct fields to index.

Since indexes are only used to speed up the searching for a matching field

within the records, it stands to reason that indexing fields used only for

output would be simply a waste of disk space and processing time when

doing an insert or delete operation, and thus should be avoided. Also given

the nature of a binary search, the cardinality or uniqueness of the data is

important. Indexing on a field with a cardinality of 2 would split the data in

half, whereas a cardinality of 1,000 would return approximately 1,000

records. With such a low cardinality the effectiveness is reduced to a linear

sort, and the query optimizer will avoid using the index if the cardinality is

less than 30% of the record number, effectively making the index a waste of

space

Overview of Physical Storage Media

1. Several types of data storage exist in most computer systems. They vary in

speed of access, cost per unit of data, and reliability.

o Cache: most costly and fastest form of storage. Usually very small, and

managed by the operating system.

o Main Memory (MM): the storage area for data available to be operated

on.

General-purpose machine instructions operate on main memory.

Contents of main memory are usually lost in a power failure or

``crash''.

Usually too small (even with megabytes) and too expensive to

store the entire database.

o Flash memory: EEPROM (electrically erasable programmable read-

only memory).

Data in flash memory survive from power failure.

Reading data from flash memory takes about 10 nano-secs

(roughly as fast as from main memory), and writing data into

flash memory is more complicated: write-once takes about 4-10

microsecs.

To overwrite what has been written, one has to first erase the

entire bank of the memory. It may support only a limited number

of erase cycles ( to ).

It has found its popularity as a replacement for disks for storing

small volumes of data (5-10 megabytes).

o Magnetic-disk storage: primary medium for long-term storage.

Typically the entire database is stored on disk.

Data must be moved from disk to main memory in order for the

data to be operated on.

After operations are performed, data must be copied back to disk

if any changes were made.

Disk storage is called direct access storage as it is possible to

read data on the disk in any order (unlike sequential access).

Disk storage usually survives power failures and system crashes.

o Optical storage: CD-ROM (compact-disk read-only memory), WORM

(write-once read-many) disk (for archival storage of data), and Juke box

(containing a few drives and numerous disks loaded on demand).

o Tape Storage: used primarily for backup and archival data.

Cheaper, but much slower access, since tape must be read

sequentially from the beginning.

Used as protection from disk failures!

2. The storage device hierarchy is presented in Figure 10.1, where the higher

levels are expensive (cost per bit), fast (access time), but the capacity is

smaller.

http://www.cs.sfu.ca/CourseCentral/354/zaiane/material/notes/Chapter10/node2.html#fig101store

Figure 10.1: Storage-device hierarchy

3. Another classification: Primary, secondary, and tertiary storage.

1. Primary storage: the fastest storage media, such as cash and main

memory.

2. Secondary (or on-line) storage: the next level of the hierarchy, e.g.,

magnetic disks.

3. Tertiary (or off-line) storage: magnetic tapes and optical disk juke

boxes.

Volatility of storage. Volatile storage loses its contents when the power is

removed. Without power backup, data in the volatile storage (the part of the hierarchy

from main memory up) must be written to nonvolatile storage for safekeeping.

Physical Characteristics of Disks

1. The storage capacity of a single disk ranges from 10MB to 10GB. A typical

commercial database may require hundreds of disks.

2. Figure 10.2 shows a moving-head disk mechanism.

o Each disk platter has a flat circular shape. Its two surfaces are covered

with a magnetic material and information is recorded on the surfaces.

The platter of hard disks are made from rigid metal or glass, while

floppy disks are made from flexible material.

o The disk surface is logically divided into tracks, which are subdivided

into sectors. A sector (varying from 32 bytes to 4096 bytes, usually 512

bytes) is the smallest unit of information that can be read from or written

to disk. There are 4-32 sectors per track and 20-1500 tracks per disk

surface.

o The arm can be positioned over any one of the tracks.

o The platter is spun at high speed.

o To read information, the arm is positioned over the correct track.

o When the data to be accessed passes under the head, the read or write

operation is performed.

3. A disk typically contains multiple platters (see Figure 10.2). The read-write

heads of all the tracks are mounted on a single assembly called a disk arm, and

move together.

o Multiple disk arms are moved as a unit by the actuator.

o Each arm has two heads, to read disks above and below it.

o The set of tracks over which the heads are located forms a cylinder.

o This cylinder holds that data that is accessible within the disk latency

time.

o It is clearly sensible to store related data in the same or adjacent

cylinders.

4. Disk platters range from 1.8" to 14" in diameter, and 5"1/4 and 3"1/2 disks

dominate due to the lower cost and faster seek time than do larger disks, yet

they provide high storage capacity.

5. A disk controller interfaces between the computer system and the actual

hardware of the disk drive. It accepts commands to r/w a sector, and initiate

actions. Disk controllers also attach checksums to each sector to check read

error.

6. Remapping of bad sectors: If a controller detects that a sector is damaged when

the disk is initially formatted, or when an attempt is made to write the sector, it

can logically map the sector to a different physical location.

7. SCSI (Small Computer System Interconnect) is commonly used to connect

disks to PCs and workstations. Mainframe and server systems usually have a

faster and more expensive bus to connect to the disks.

8. Head crash: why cause the entire disk failing (?).

9. A fixed dead disk has a separate head for each track -- very many heads, very

expensive. Multiple disk arms: allow more than one track to be accessed at a

time. Both were used in high performance mainframe systems but are relatively

rare today.

Performance Measures of Disks

The main measures of the qualities of a disk are capacity, access time, data transfer

rate, and reliability,

1. access time: the time from when a read or write request is issued to when data

transfer begins. To access data on a given sector of a disk, the arm first must

move so that it is positioned over the correct track, and then must wait for the

sector to appear under it as the disk rotates. The time for repositioning the arm

is called seek time, and it increases with the distance the arm must move.

Typical seek time range from 2 to 30 milliseconds.

Average seek time is the average of the seek time, measured over a sequence of

(uniformly distributed) random requests, and it is about one third of the worst-

case seek time.

Once the seek has occurred, the time spent waiting for the sector to be accesses

to appear under the head is called rotational latency time. Average rotational

latency time is about half of the time for a full rotation of the disk. (Typical

rotational speeds of disks ranges from 60 to 120 rotations per second).

The access time is then the sum of the seek time and the latency and ranges

from 10 to 40 milli-sec.

2. data transfer rate, the rate at which data can be retrieved from or stored to the

disk. Current disk systems support transfer rate from 1 to 5 megabytes per

second.

3. reliability, measured by the mean time to failure. The typical mean time to

failure of disks today ranges from 30,000 to 800,000 hours (about 3.4 to 91

years).

Optimization of Disk-Block Access

1. Data is transferred between disk and main memory in units called blocks.

2. A block is a contiguous sequence of bytes from a single track of one platter.

3. Block sizes range from 512 bytes to several thousand.

4. The lower levels of file system manager covert block addresses into the

hardware-level cylinder, surface, and sector number.

5. Access to data on disk is several orders of magnitude slower than is access to

data in main memory. Optimization techniques besides buffering of blocks in

main memory.

o Scheduling: If several blocks from a cylinder need to be transferred, we

may save time by requesting them in the order in which they pass under

the heads. A commonly used disk-arm scheduling algorithm is the

elevator algorithm.

o File organization. Organize blocks on disk in a way that corresponds

closely to the manner that we expect data to be accessed. For example,

store related information on the same track, or physically close tracks, or

adjacent cylinders in order to minimize seek time. IBM mainframe OS's

provide programmers fine control on placement of files but increase

programmer's burden.

UNIX or PC OSs hide disk organizations from users. Over time, a

sequential file may become fragmented. To reduce fragmentation, the

system can make a back-up copy of the data on disk and restore the

entire disk. The restore operation writes back the blocks of each file

continuously (or nearly so). Some systems, such as MS-DOS, have

utilities that scan the disk and then move blocks to decrease the

fragmentation.

o Nonvolatile write buffers. Use nonvolatile RAM (such as battery-back-

up RAM) to speed up disk writes drastically (first write to nonvolatile

RAM buffer and inform OS that writes completed).

o Log disk. Another approach to reducing write latency is to use a log

disk, a disk devoted to writing a sequential log. All access to the log disk

is sequential, essentially eliminating seek time, and several consecutive

blocks can be written at once, making writes to log disk several times

faster than random writes.

Optical Disks

1. CD-ROM has become a popular medium for distributing software, multimedia

data, and other electronic published information.

2. Capacity of CD-ROM: 500 MB. Disks are cheap to mass produce and also

drives.

3. CD-ROM: much longer seek time (250m-sec), lower rotation speed (400 rpm),

leading to high latency and lower data-transfer rate (about 150 KB/sec). Drives

spins at audio CD spin speed (standard) is available.

4. Recently, a new optical format, digit video disk (DVD) has become standard.

These disks hold between 4.7 and 17 GB data.

5. WORM (write-once, read many) disks are popular for archival storage of data

since they have a high capacity (about 500 MB), longer life time than HD, and

can be removed from drive -- good for audit trail (hard to tamper).

Magnetic Tapes

1. Long history, slow, and limited to sequential access, and thus are used for

backup, storage for infrequent access, and off-line medium for system transfer.

2. Moving to the correct spot may take minutes, but once positioned, tape drives

can write data at density and speed approaching to those of disk drives.

3. 8mm tape drive has the highest density, and we store 5 GB data on a 350-foot

tape.

4. Popularly used for storage of large volumes of data, such as video, image, or

remote sensing data.

Storage Access

1. Each file is partitioned into fixed-length storage units, called blocks, which are

the units of both storage allocation and data transfer.

2. It is desirable to keep as many blocks as possible in main memory. Usually, we

cannot keep all blocks in main memory, so we need to manage the allocation of

available main memory space.

3. We need to use disk storage for the database, and to transfer blocks of data

between main memory and disk. We also want to minimize the number of such

transfers, as they are time-consuming.

4. The buffer is the part of main memory available for storage of copies of disk

blocks.

Buffer manager

1. The subsystem responsible for the allocation of buffer space is called the

buffer manager.

o The buffer manager handles all requests for blocks of the database.

o If the block is already in main memory, the address in main memory is

given to the requester.

o If not, the buffer manager must read the block in from disk (possibly

displacing some other block if the buffer is full) and then pass the

address in main memory to the requester.

2. The buffer manager must use some sophisticated techniques in order to provide

good service:

o Replacement Strategy -- When there is no room left in the buffer, some

block must be removed to make way for the new one. Typical operating

system memory management schemes use a ``least recently used''

(LRU) method. (Simply remove the block least recently referenced.)

This can be improved upon for database applications.

o Pinned Blocks - For the database to be able to recover from crashes, we

need to restrict times when a block maybe written back to disk. A block

not allowed to be written is said to be pinned. Many operating systems

do not provide support for pinned blocks, and such a feature is essential

if a database is to be ``crash resistant''.

o Forced Output of Blocks - Sometimes it is necessary to write a block

back to disk even though its buffer space is not needed, (called the

forced output of a block.) This is due to the fact that main memory

contents (and thus the buffer) are lost in a crash, while disk data usually

survives.

Buffer replacement policies

1. Replacement Strategy: Goal is minimization of accesses to disk. Generally it

is hard to predict which blocks will be referenced. So operating systems use the

history of past references as a guide to prediction.

o General Assumption: Blocks referenced recently are likely to be used

again.

o Therefore, if we need space, throw out the least recently referenced

block (LRU replacement scheme).

2. LRU is acceptable in operating systems, however, a database system is able to

predict future references more accurately.

3. Consider processing of the relational algebra expression

4. Further, assume the strategy to process this request is given by the following

pseudo-code:

aaaaaaaaaaaa¯for each tuple b of borrower do

for each tuple c of customer do

if b[cname] = c[cname]

then begin

let x be a tuple defined as follows:

x[cname]:= b[cname] x[loan#]:=

b[loan#] x[street]:=

c[street] x[city]:= c[city] include tuple x as part of result of borrow customer

end

end end

5. Assume that the two relations in this example are stored in different files.

o Once a tuple of borrower has been processed, it is not needed again.

Therefore, once processing of an entire block of tuples is finished, that

block is not needed in main memory, even though it has been used very

recently.

o Buffer manager should free the space occupied by a borrow block as

soon as it is processed. This strategy is called toss-immediate.

o Consider blocks containing customer tuples.

o Every block of customer tuples must be examined once for every tuple

of the borrow relation. When processing of a customer block is

completed, it will not be used again until all other customer blocks have

been processed. This means the most recently used (MRU) block will be

the last block to be re-referenced, and the least recently used will be

referenced next.

o This is the opposite of LRU assumptions. So for inner block, use MRU

strategy -- if a customer block must be removed from the buffer, choose

MRU block.

o For MRU strategy, the system must pin the customer block currently

being processed until the last tuple has been processed. Then it is

unpinned, becoming the most recently used block.

6. The buffer manager may also use statistical information regarding the

probability that a request will reference a particular relation.

o The data dictionary is the most frequently-used part of the database. It

should, therefore, not be removed from main memory unless necessary.

o File indices are also frequently used, and should generally be in main

memory.

o No single strategy is known that handles all possible scenarios well.

o Many database systems use LRU, despite of its faults.

o Concurrency and recovery may need other buffer management

strategies, such as delayed buffer-out or forced output.

File Organization

1. A file is organized logically as a sequence of records.

2. Records are mapped onto disk blocks.

3. Files are provided as a basic construct in operating systems, so we assume the

existence of an underlying file system.

4. Blocks are of a fixed size determined by the operating system.

5. Record sizes vary.

6. In relational database, tuples of distinct relations may be of different sizes.

7. One approach to mapping database to files is to store records of one length in a

given file.

8. An alternative is to structure files to accommodate variable-length records.

(Fixed-length is easier to implement.)

Fixed-Length Records

Consider a file of deposit records of the form:

aaaaaaaaaaaa¯type deposit = record

bname : char(22);

account# : char(10);

balance : real;

end

o If we assume that each character occupies one byte, an integer occupies

4 bytes, and a real 8 bytes, our deposit record is 40 bytes long.

o The simplest approach is to use the first 40 bytes for the first record, the

next 40 bytes for the second, and so on.

o However, there are two problems with this approach.

o It is difficult to delete a record from this structure.

o Space occupied must somehow be deleted, or we need to mark deleted

records so that they can be ignored.

o Unless block size is a multiple of 40, some records will cross block

boundaries.

o It would then require two block accesses to read or write such a record.

2. When a record is deleted, we could move all successive records up one (Figure

10.7), which may require moving a lot of records.

o We could instead move the last record into the ``hole'' created by the

deleted record (Figure 10.8).

o This changes the order the records are in.

o It turns out to be undesirable to move records to occupy freed space, as

moving requires block accesses.

o Also, insertions tend to be more frequent than deletions.

o It is acceptable to leave the space open and wait for a subsequent

insertion.

o This leads to a need for additional structure in our file design.

3. So one solution is:

o At the beginning of a file, allocate some bytes as a file header.

o This header for now need only be used to store the address of the first

record whose contents are deleted.

o This first record can then store the address of the second available

record, and so on (Figure 10.9).

o To insert a new record, we use the record pointed to by the header, and

change the header pointer to the next available record.

o If no deleted records exist we add our new record to the end of the file.

4. Note: Use of pointers requires careful programming. If a record pointed to is

moved or deleted, and that pointer is not corrected, the pointer becomes a

dangling pointer. Records pointed to are called pinned.

5. Fixed-length file insertions and deletions are relatively simple because ``one

size fits all''. For variable length, this is not the case

Variable-Length Records

1. Variable-length records arise in a database in several ways:

o Storage of multiple items in a file.

o Record types allowing variable field size

o Record types allowing repeating fields

We'll look at several techniques, using one example with a variable-length record:

aaaaaaaaaaaa¯type account-list = record

bname : char(22);

account-info : array of record;

account#: char(10); balance: real;

end

end

Account-information is an array with an arbitrary number of elements.

Byte string representation

1. Attach a special end-of-record symbol ( ) to the end of each record. Each

record is stored as a string of successive bytes (See Figure 10.10).

Byte string representation has several disadvantages:

o It is not easy to re-use space left by a deleted record

o In general, there is no space for records to grow longer. (Must move to

expand, and record may be pinned.)

So this method is not usually used.

2. An interesting structure: Slot page structure.

There is a header at the beginning of each block, containing:

o # of record entires in the header

o the end of free space in the block

o an array whose entries contain the location and size of each record.

3. The slot page structure requires that there be no pointers that point directly to

records. Instead, pointers must point to the entry in the header that contains the

actual location of the record. This level of indirection allows records to be

moved to prevent fragmentation of space inside a block, while supporting

indirect pointers to the record.

Fixed-length representation

1. Uses one or more fixed-length records to represent one variable-length record.

2. Two techniques:

o Reserved space - uses fixed-length records large enough to

accommodate the largest variable-length record. (Unused space filled

with end-of-record symbol.)

o Pointers - represent by a list of fixed-length records, chained together.

3. The reserved space method requires the selection of some maximum record

length. (Figure 10.12)

If most records are of near-maximum length this method is useful. Otherwise,

space is wasted.

4. Then the pointer method may be used (Figure 10.13). Its disadvantage is that

space is wasted in successive records in a chain as non-repeating fields are still

present.

5. To overcome this last disadvantage we can split records into two blocks (See

Figure 10.14)

o Anchor block - contains first records of a chain

o Overflow block - contains records other than first in the chain.

Now all records in a block have the same length, and there is no wasted space.

Organization of Records in Files

There are several ways of organizing records in files.

heap file organization. Any record can be placed anywhere in the file where

there is space for the record. There is no ordering of records.

sequential file organization. Records are stored in sequential order, based on

the value of the search key of each record.

hashing file organization. A hash function is computed on some attribute of

each record. The result of the function specifies in which block of the file the

record should be placed -- to be discussed in chapter 11 since it is closely

related to the indexing structure.

clustering file organization. Records of several different relations can be

stored in the same file. Related records of the different relations are stored on

the same block so that one I/O operation fetches related records from all the

relations.

Sequential File Organization

1. A sequential file is designed for efficient processing of records in sorted

order on some search key.

o Records are chained together by pointers to permit fast retrieval in

search key order.

o Pointer points to next record in order.

o Records are stored physically in search key order (or as close to this as

possible).

o This minimizes number of block accesses.

o Figure 10.15 shows an example, with bname as the search key.

2. It is difficult to maintain physical sequential order as records are inserted and

deleted.

o Deletion can be managed with the pointer chains.

o Insertion poses problems if no space where new record should go.

o If space, use it, else put new record in an overflow block.

o Adjust pointers accordingly.

o Figure 10.16 shows the previous example after an insertion.

o Problem: we now have some records out of physical sequential order.

o If very few records in overflow blocks, this will work well.

o If order is lost, reorganize the file.

o Reorganizations are expensive and done when system load is low.

3. If insertions rarely occur, we could keep the file in physically sorted order and

reorganize when insertion occurs. In this case, the pointer fields are no longer

required.

Clustering File Organization

1. One relation per file, with fixed-length record, is good for small databases,

which also reduces the code size.

2. Many large-scale DB systems do not rely directly on the underlying operating

system for file management. One large OS file is allocated to DB system and

all relations are stored in one file.

3. To efficiently execute queries involving , one may store

the depositor tuple for each cname near the customer tuple for the

corresponding cname, as shown in Figure 10.19.

4. This structure mixes together tuples from two relations, but allows for efficient

processing of the join.

5. If the customer has many accounts which cannot fit in one block, the remaining

records appear on nearby blocks. This file structure, called clustering, allows us

to read many of the required records using one block read.

6. Our use of clustering enhances the processing of a particular join but may

result in slow processing of other types of queries, such as selection on

customer.

For example, the query

aaaaaaaaaaaa¯select * from customer

now requires more block accesses as our customer relation is now interspersed

with the deposit relation.

7. Thus it is a trade-off, depending on the types of query that the database

designer believes to be most frequent. Careful use of clustering may produce

significant performance gain.

Data Dictionary Storage

1. The database also needs to store information about the relations, known as the

data dictionary. This includes:

o Names of relations.

o Names of attributes of relations.

o Domains and lengths of attributes.

o Names and definitions of views.

o Integrity constraints (e.g., key constraints).

plus data on the system users:

o Names of authorized users.

o Accounting information about users.

plus (possibly) statistical and descriptive data:

o Number of tuples in each relation.

o Method of storage used for each relation (e.g., clustered or non-

clustered).

2. When we look at indices (Chapter 11), we'll also see a need to store

information about each index on each relation:

o Name of the index.

o Name of the relation being indexed.

o Attributes the index is on.

o Type of index.

3. This information is, in itself, a miniature database. We can use the database to

store data about itself, simplifying the overall structure of the system, and

allowing the full power of the database to be used to permit fast access to

system data.

4. The exact choice of how to represent system data using relations must be made

by the system designer. One possible representation follows.

aaaaaaaaaaaa¯System-catalog-schema = (relation-name, number-attrs)

Attr-schema = (attr-name, rel-name, domain-type, position, length)

User-schema = (user-name, encrypted-password, group)

Index-schema = (index-name, rel-name, index-type, index-attr)

View-schema = (view-name, definition)

INDEXES

1. An index for a file works like a catalogue in a library. Cards in alphabetic order

tell us where to find books by a particular author.

2. In real-world databases, indices like this might be too large to be efficient.

We'll look at more sophisticated indexing techniques.

3. There are two kinds of indices.

o Ordered indices: indices are based on a sorted ordering of the values.

o Hash indices: indices are based on the values being distributed

uniformly across a range of buckets. The buckets to which a value is

assigned is determined by a function, called a hash function.

4. We will consider several indexing techniques. No one technique is the best.

Each technique is best suited for a particular database application.

5. Methods will be evaluated on:

1. Access Types -- types of access that are supported efficiently, e.g.,

value-based search or range search.

2. Access Time -- time to find a particular data item or set of items.

3. Insertion Time -- time taken to insert a new data item (includes time to

find the right place to insert).

4. Deletion Time -- time to delete an item (includes time taken to find

item, as well as to update the index structure).

5. Space Overhead -- additional space occupied by an index structure.

We may have more than one index or hash function for a file. (The library may

have card catalogues by author, subject or title.)

The attribute or set of attributes used to look up records in a file is called the

search key (not to be confused with primary key, etc.).

Ordered Indices

1. In order to allow fast random access, an index structure may be used.

2. A file may have several indices on different search keys.

3. If the file containing the records is sequentially ordered, the index whose

search key specifies the sequential order of the file is the primary index, or

clustering index. Note: The search key of a primary index is usually the

primary key, but it is not necessarily so.

4. Indices whose search key specifies an order different from the sequential order

of the file are called the secondary indices, or nonclustering indices.

Primary Index

1. Index-sequential files: Files are ordered sequentially on some search key, and a

primary index is associated with it.

Figure 11.1: Sequential file for deposit records.

Dense and Sparse Indices

1. There are Two types of ordered indices:

Dense Index:

o An index record appears for every search key value in file.

o This record contains search key value and a pointer to the actual record.

Sparse Index:

o Index records are created only for some of the records.

o To locate a record, we find the index record with the largest search key

value less than or equal to the search key value we are looking for.

o We start at that record pointed to by the index record, and proceed along

the pointers in the file (that is, sequentially) until we find the desired

record.

2. Figures 11.2 and 11.3 show dense and sparse indices for the deposit file.

Figure 11.2: Dense index.

http://www.cs.sfu.ca/CourseCentral/354/zaiane/material/notes/Chapter11/node5.html#figx12dense

http://www.cs.sfu.ca/CourseCentral/354/zaiane/material/notes/Chapter11/node5.html#figx13sparse

3. Notice how we would find records for Perryridge branch using both methods.

(Do it!)

Figure 11.3: Sparse index.

4. Dense indices are faster in general, but sparse indices require less space and

impose less maintenance for insertions and deletions. (Why?)

5. A good compromise: to have a sparse index with one entry per block.

Why is this good?

o Biggest cost is in bringing a block into main memory.

o We are guaranteed to have the correct block with this method, unless

record is on an overflow block (actually could be several blocks).

o Index size still small.

Multi-Level Indices

1. Even with a sparse index, index size may still grow too large. For 100,000

records, 10 per block, at one index record per block, that's 10,000 index

records! Even if we can fit 100 index records per block, this is 100 blocks.

2. If index is too large to be kept in main memory, a search results in several disk

reads.

o If there are no overflow blocks in the index, we can use binary search.

o This will read as many as blocks (as many as 7 for our 100

blocks).

o If index has overflow blocks, then sequential search typically used,

reading all b index blocks.

3. Solution: Construct a sparse index on the index (Figure 11.4).

http://www.cs.sfu.ca/CourseCentral/354/zaiane/material/notes/Chapter11/node6.html#figx142lvlspa

Figure 11.4: Two-level sparse index.

4. Use binary search on outer index. Scan index block found until correct index

record found. Use index record as before - scan block pointed to for desired

record.

5. For very large files, additional levels of indexing may be required.

6. Indices must be updated at all levels when insertions or deletions require it.

7. Frequently, each level of index corresponds to a unit of physical storage (e.g.

indices at the level of track, cylinder and disk).

Index Update

Regardless of what form of index is used, every index must be updated whenever a

record is either inserted into or deleted from the file.

1. Deletion: o Find (look up) the record

o If the last record with a particular search key value, delete that search

key value from index.

o For dense indices, this is like deleting a record in a file.

o For sparse indices, delete a key value by replacing key value's entry in

index by next search key value. If that value already has an index entry,

delete the entry.

2. Insertion: o Find place to insert.

o Dense index: insert search key value if not present.

o Sparse index: no change unless new block is created. (In this case, the

first search key value appearing in the new block is inserted into the

index).

Secondary Indices

1. If the search key of a secondary index is not a candidate key, it is not enough to

point to just the first record with each search-key value because the remaining

records with the same search-key value could be anywhere in the file.

Therefore, a secondary index must contain pointers to all the records.

Figure 11.5: Sparse secondary index on cname.

2. We can use an extra-level of indirection to implement secondary indices on

search keys that are not candidate keys. A pointer does not point directly to the

file but to a bucket that contains pointers to the file.

o See Figure 11.5 on secondary key cname.

o To perform a lookup on Peterson, we must read all three records pointed

to by entries in bucket 2.

o Only one entry points to a Peterson record, but three records need to be

read.

o As file is not ordered physically by cname, this may take 3 block

accesses.

3. Secondary indices must be dense, with an index entry for every search-key

value, and a pointer to every record in the file.

4. Secondary indices improve the performance of queries on non-primary keys.

5. They also impose serious overhead on database modification: whenever a file

is updated, every index must be updated.

6. Designer must decide whether to use secondary indices or not.

B -Tree Index Files

1. Primary disadvantage of index-sequential file organization is that performance

degrades as the file grows. This can be remedied by costly re-organizations.

http://www.cs.sfu.ca/CourseCentral/354/zaiane/material/notes/Chapter11/node8.html#figx16spar2nd

2. B -tree file structure maintains its efficiency despite frequent insertions and

deletions. It imposes some acceptable update and space overheads.

3. A B -tree index is a balanced tree in which every path from the root to a leaf

is of the same length.

4. Each nonleaf node in the tree must have between and n children, where n

is fixed for a particular tree.

Structure of a B -Tree

1. A B -tree index is a multilevel index but is structured differently from that of

multi-level index sequential files.

2. A typical node (Figure 11.6) contains up to n-1 search key values

, and n pointers . Search key values in a node

are kept in sorted order.

Figure 11.6: Typical node of a B+-tree.

3. For leaf nodes, ( ) points to either a file record with search

key value , or a bucket of pointers to records with that search key value.

Bucket structure is used if search key is not a primary key, and file is not sorted

in search key order.

Pointer (nth pointer in the leaf node) is used to chain leaf nodes together in

linear order (search key order). This allows efficient sequential processing of

the file.

The range of values in each leaf do not overlap.

4. Non-leaf nodes form a multilevel index on leaf nodes.

A non-leaf node may hold up to n pointers and must hold pointers. The

number of pointers in a node is called the fan-out of the node.

Consider a node containing m pointers. Pointer ( ) points to a

subtree containing search key values and . Pointer points to a

subtree containing search key values . Pointer points to a subtree

containing search key values .

5. Figures 11.7 (textbook Fig. 11.8) and textbook Fig. 11.9 show B -trees for the

deposit file with n=3 and n=5.

http://www.cs.sfu.ca/CourseCentral/354/zaiane/material/notes/Chapter11/node10.html#figx1bplustreenode

http://www.cs.sfu.ca/CourseCentral/354/zaiane/material/notes/Chapter11/node10.html#figx19btree

Figure 11.7: B+-tree for deposit file with n = 3.

Queries on B -Trees

1. Suppose we want to find all records with a search key value of k.

o Examine the root node and find the smallest search key value .

o Follow pointer to another node.

o If follow pointer .

o Otherwise, find the appropriate pointer to follow.

o Continue down through non-leaf nodes, looking for smallest search key

value > k and following the corresponding pointer.

o Eventually we arrive at a leaf node, where pointer will point to the

desired record or bucket.

2. In processing a query, we traverse a path from the root to a leaf node. If there

are K search key values in the file, this path is no longer than .

This means that the path is not long, even in large files. For a 4k byte disk

block with a search-key size of 12 bytes and a disk pointer of 8 bytes, n is

around 200. If n =100, a look-up of 1 million search-key values may take

nodes to be accessed. Since root is in usually in the

buffer, so typically it takes only 3 or fewer disk reads.

Updates on B -Trees

1. Insertions and Deletions:

Insertion and deletion are more complicated, as they may require splitting or

combining nodes to keep the tree balanced. If splitting or combining are not

required, insertion works as follows:

o Find leaf node where search key value should appear.

o If value is present, add new record to the bucket.

o If value is not present, insert value in leaf node (so that search keys are

still in order).

o Create a new bucket and insert the new record.

If splitting or combining are not required, deletion works as follows:

o Deletion: Find record to be deleted, and remove it from the bucket.

o If bucket is now empty, remove search key value from leaf node.

2. Insertions Causing Splitting:

When insertion causes a leaf node to be too large, we split that node. In Figure

11.8, assume we wish to insert a record with a bname value of ``Clearview''.

o There is no room for it in the leaf node where it should appear.

o We now have n values (the n-1 search key values plus the new one we

wish to insert).

o We put the first values in the existing node, and the remainder into

a new node.

o Figure 11.10 shows the result.

o The new node must be inserted into the B -tree.

o We also need to update search key values for the parent (or higher)

nodes of the split leaf node. (Except if the new node is the leftmost one)

o Order must be preserved among the search key values in each node.

o If the parent was already full, it will have to be split.

o When a non-leaf node is split, the children are divided among the two

new nodes.

o In the worst case, splits may be required all the way up to the root. (If

the root is split, the tree becomes one level deeper.)

o Note: when we start a B -tree, we begin with a single node that is both

the root and a single leaf. When it gets full and another insertion occurs,

we split it into two leaf nodes, requiring a new root.

3. Deletions Causing Combining:

Deleting records may cause tree nodes to contain too few pointers. Then we

must combine nodes.

o If we wish to delete ``Downtown'' from the B -tree of Figure 11.11,

this occurs.

o In this case, the leaf node is empty and must be deleted.

o If we wish to delete ``Perryridge'' from the B -tree of Figure 11.11, the

parent is left with only one pointer, and must be coalesced with a sibling

node.

o Sometimes higher-level nodes must also be coalesced.

o If the root becomes empty as a result, the tree is one level less deep

(Figure 11.13).

o Sometimes the pointers must be redistributed to keep the tree balanced.

o Deleting ``Perryridge'' from Figure 11.11 produces Figure 11.14.

4. To summarize:

o Insertion and deletion are complicated, but require relatively few

operations.

o Number of operations required for insertion and deletion is proportional

to logarithm of number of search keys.

o B -trees are fast as index structures for database.

B -Tree File Organization

1. The B -tree structure is used not only as an index but also as an organizer for

records into a file.

2. In a B -tree file organization, the leaf nodes of the tree store records instead of

storing pointers to records, as shown in Fig. 11.17.

3. Since records are usually larger than pointers, the maximum number of records

that can be stored in a leaf node is less than the maximum number of pointers

in a nonleaf node.

4. However, the leaf node are still required to be at least half full.

5. Insertion and deletion from a B -tree file organization are handled in the same

way as that in a B -tree index.

6. When a B -tree is used for file organization, space utilization is particularly

important. We can improve the space utilization by involving more sibling

nodes in redistribution during splits and merges.

7. In general, if m nodes are involved in redistribution, each node can be

guaranteed to contain at least entries. However, the cost of update

becomes higher as more siblings are involved in redistribution.

B-Tree Index Files

1. B-tree indices are similar to B -tree indices.

o Difference is that B-tree eliminates the redundant storage of search key

values.

o In B -tree of Figure 11.11, some search key values appear twice.

o A corresponding B-tree of Figure 11.18 allows search key values to

appear only once.

o Thus we can store the index in less space.

Figure 11.8: Leaf and nonleaf node of a B-tree.

2. Advantages: o Lack of redundant storage (but only marginally different).

o Some searches are faster (key may be in non-leaf node).

3. Disadvantages: o Leaf and non-leaf nodes are of different size (complicates storage)

o Deletion may occur in a non-leaf node (more complicated)

Generally, the structural simplicity of B -tree is preferred.

Static Hashing

1. Index schemes force us to traverse an index structure. Hashing avoids this.

Hash File Organization

1. Hashing involves computing the address of a data item by computing a

function on the search key value.

2. A hash function h is a function from the set of all search key values K to the

set of all bucket addresses B.

o We choose a number of buckets to correspond to the number of search

key values we will have stored in the database.

o To perform a lookup on a search key value , we compute , and

search the bucket with that address.

o If two search keys i and j map to the same address, because

, then the bucket at the address obtained will contain

records with both search key values.

o In this case we will have to check the search key value of every record

in the bucket to get the ones we want.

o Insertion and deletion are simple.

Hash Functions

1. A good hash function gives an average-case lookup that is a small constant,

independent of the number of search keys.

2. We hope records are distributed uniformly among the buckets.

3. The worst hash function maps all keys to the same bucket.

4. The best hash function maps all keys to distinct addresses.

5. Ideally, distribution of keys to addresses is uniform and random.

6. Suppose we have 26 buckets, and map names beginning with ith letter of the

alphabet to the ith bucket.

o Problem: this does not give uniform distribution.

o Many more names will be mapped to ``A'' than to ``X''.

o Typical hash functions perform some operation on the internal binary

machine representations of characters in a key.

o For example, compute the sum, modulo # of buckets, of the binary

representations of characters of the search key.

o See Figure 11.18, using this method for 10 buckets (assuming the ith

character in the alphabet is represented by integer i).

Handling of bucket overflows

1. Open hashing occurs where records are stored in different buckets. Compute

the hash function and search the corresponding bucket to find a record.

2. Closed hashing occurs where all records are stored in one bucket. Hash

function computes addresses within that bucket. (Deletions are difficult.) Not

used much in database applications.

3. Drawback to our approach: Hash function must be chosen at implementation

time.

o Number of buckets is fixed, but the database may grow.

o If number is too large, we waste space.

o If number is too small, we get too many ``collisions'', resulting in

records of many search key values being in the same bucket.

o Choosing the number to be twice the number of search key values in the

file gives a good space/performance tradeoff.

Hash Indices

1. A hash index organizes the search keys with their associated pointers into a

hash file structure.

2. We apply a hash function on a search key to identify a bucket, and store the

key and its associated pointers in the bucket (or in overflow buckets).

3. Strictly speaking, hash indices are only secondary index structures, since if a

file itself is organized using hashing, there is no need for a separate hash index

structure on it.

Dynamic Hashing

1. As the database grows over time, we have three options:

o Choose hash function based on current file size. Get performance

degradation as file grows.

o Choose hash function based on anticipated file size. Space is wasted

initially.

o Periodically re-organize hash structure as file grows. Requires selecting

new hash function, recomputing all addresses and generating new

bucket assignments. Costly, and shuts down database.

2. Some hashing techniques allow the hash function to be modified dynamically

to accommodate the growth or shrinking of the database. These are called

dynamic hash functions.

o Extendable hashing is one form of dynamic hashing.

o Extendable hashing splits and coalesces buckets as database size

changes.

o This imposes some performance overhead, but space efficiency is

maintained.

o As reorganization is on one bucket at a time, overhead is acceptably

low.

3. How does it work?

Figure 11.9: General extendable hash structure.

o We choose a hash function that is uniform and random that generates

values over a relatively large range.

o Range is b-bit binary integers (typically b=32).

o is over 4 billion, so we don't generate that many buckets!

o Instead we create buckets on demand, and do not use all b bits of the

hash initially.

o At any point we use i bits where .

o The i bits are used as an offset into a table of bucket addresses.

o Value of i grows and shrinks with the database.

o Figure 11.19 shows an extendable hash structure.

o Note that the i appearing over the bucket address table tells how many

bits are required to determine the correct bucket.

o It may be the case that several entries point to the same bucket.

o All such entries will have a common hash prefix, but the length of this

prefix may be less than i.

o So we give each bucket an integer giving the length of the common hash

prefix.

o This is shown in Figure 11.9 (textbook 11.19) as .

o Number of bucket entries pointing to bucket j is then .

4. To find the bucket containing search key value :

o Compute .

o Take the first i high order bits of .

o Look at the corresponding table entry for this i-bit string.

o Follow the bucket pointer in the table entry.

5. We now look at insertions in an extendable hashing scheme.

o Follow the same procedure for lookup, ending up in some bucket j.

o If there is room in the bucket, insert information and insert record in the

file.

http://www.cs.sfu.ca/CourseCentral/354/zaiane/material/notes/Chapter11/node20.html#figx119exthash

o If the bucket is full, we must split the bucket, and redistribute the

records.

o If bucket is split we may need to increase the number of bits we use in

the hash.

6. Two cases exist:

1. If , then only one entry in the bucket address table points to bucket j.

o Then we need to increase the size of the bucket address table so that we

can include pointers to the two buckets that result from splitting bucket

j.

o We increment i by one, thus considering more of the hash, and doubling

the size of the bucket address table.

o Each entry is replaced by two entries, each containing original value.

o Now two entries in bucket address table point to bucket j.

o We allocate a new bucket z, and set the second pointer to point to z.

o Set and to i.

o Rehash all records in bucket j which are put in either j or z.

o Now insert new record.

o It is remotely possible, but unlikely, that the new hash will still put all of

the records in one bucket.

o If so, split again and increment i again.

2. If , then more than one entry in the bucket address table points to

bucket j.

o Then we can split bucket j without increasing the size of the bucket

address table (why?).

o Note that all entries that point to bucket j correspond to hash prefixes

that have the same value on the leftmost bits.

o We allocate a new bucket z, and set and to the original value plus 1.

o Now adjust entries in the bucket address table that previously pointed to

bucket j.

o Leave the first half pointing to bucket j, and make the rest point to

bucket z.

o Rehash each record in bucket j as before.

o Reattempt new insert.

7. Note that in both cases we only need to rehash records in bucket j.

8. Deletion of records is similar. Buckets may have to be coalesced, and bucket

address table may have to be halved.

9. Insertion is illustrated for the example deposit file of Figure 11.20.

o 32-bit hash values on bname are shown in Figure 11.21.

o An initial empty hash structure is shown in Figure 11.22.

o We insert records one by one.

o We (unrealistically) assume that a bucket can only hold 2 records, in

order to illustrate both situations described.

o As we insert the Perryridge and Round Hill records, this first bucket

becomes full.

o When we insert the next record (Downtown), we must split the bucket.

o Since , we need to increase the number of bits we use from the

hash.

o We now use 1 bit, allowing us buckets.

o This makes us double the size of the bucket address table to two entries.

o We split the bucket, placing the records whose search key hash begins

with 1 in the new bucket, and those with a 0 in the old bucket (Figure

11.23).

o Next we attempt to insert the Redwood record, and find it hashes to 1.

o That bucket is full, and .

o So we must split that bucket, increasing the number of bits we must use

to 2.

o This necessitates doubling the bucket address table again to four entries

(Figure 11.24).

o We rehash the entries in the old bucket.

o We continue on for the deposit records of Figure 11.20, obtaining the

extendable hash structure of Figure 11.25.

10. Advantages: o Extendable hashing provides performance that does not degrade as the

file grows.

o Minimal space overhead - no buckets need be reserved for future use.

Bucket address table only contains one pointer for each hash value of

current prefix length.

11. Disadvantages: o Extra level of indirection in the bucket address table

o Added complexity

12. Summary: A highly attractive technique, provided we accept added

complexity.

Comparison of Indexing and Hashing

1. To make a wise choice between the methods seen, database designer must

consider the following issues:

o Is the cost of periodic re-organization of index or hash structure

acceptable?

o What is the relative frequence of insertion and deletion?

o Is it desirable to optimize average access time at the expense of

increasing worst-case access time?

o What types of queries are users likely to pose?

2. The last issue is critical to the choice between indexing and hashing. If most

queries are of the form

3. aaaaaaaaaaaa¯select

4.

5. from r

6.

7. where

then to process this query the system will perform a lookup on an index or hash

structure for attribute with value c.

8. For these sorts of queries a hashing scheme is preferable.

o Index lookup takes time proportional to log of number of values in R for

.

o Hash structure provides lookup average time that is a small constant

(independent of database size).

9. However, the worst-case favors indexing:

o Hash worst-case gives time proportional to the number of values in R

for .

o Index worst case still log of number of values in R.

10. Index methods are preferable where a range of values is specified in the query,

e.g.

aaaaaaaaaaaa¯select

from r

where and

This query finds records with values in the range from to .

o Using an index structure, we can find the bucket for value , and then

follow the pointer chain to read the next buckets in alphabetic (or

numeric) order until we find .

o If we have a hash structure instead of an index, we can find a bucket for

easily, but it is not easy to find the ``next bucket''.

o A good hash function assigns values randomly to buckets.

o Also, each bucket may be assigned many search key values, so we

cannot chain them together.

o To support range queries using a hash structure, we need a hash function

that preserves order.

o For example, if and are search key values and then

.

o Such a function would ensure that buckets are in key order.

o Order-preserving hash functions that also provide randomness and

uniformity are extremely difficult to find.

Index Definition in SQL

1. Some SQL implementations includes data definition commands to create and

drop indices. The IBM SAA-SQL commands are

o An index is created by

o aaaaaaaaaaaa¯create index <index-name>

o

o on r

(<attribute-list>)

o

o The attribute list is the list of attributes in relation r that form the search

key for the index.

o To create an index on bname for the branch relation:

o aaaaaaaaaaaa¯create index b-index

on branch (bname)

if the search key is a candidate key, we add the word unique to the

definition:

aaaaaaaaaaaa¯create unique index b-index on branch (bname)

o If bname is not a candidate key, an error message will appear.

o If the index creation succeeds, any attempt to insert a tuple violating this

requirement will fail.

o The unique keyword is redundant if primary keys have been defined

with integrity constraints already.

To remove an index, the command is

aaaaaaaaaaaa¯drop index <index-name>

Multiple-Key Access

1. For some queries, it is advantageous to use multiple indices if they exist.

2. If there are two indices on deposit, one on bname and one on cname, then

suppose we have a query like

aaaaaaaaaaaa¯select balance

from deposit

where bname = ``Perryridge'' and balance = 1000

3. There are 3 possible strategies to process this query:

o Use the index on bname to find all records pertaining to Perryridge

branch. Examine them to see if balance = 1000

o Use the index on balance to find all records pertaining to Williams.

Examine them to see if bname = ``Perryridge''.

o Use index on bname to find pointers to records pertaining to Perryridge

branch. Use index on balance to find pointers to records pertaining to

1000. Take the intersection of these two sets of pointers.

4. The third strategy takes advantage of the existence of multiple indices. This

may still not work well if

o There are a large number of Perryridge records AND

o There are a large number of 1000 records AND

o Only a small number of records pertain to both Perryridge and 1000.

5. To speed up multiple search key queries special structures can be maintained.

Grid File

1. A grid structure for queries on two search keys is a 2-dimensional grid, or

array, indexed by values for the search keys. Figure 11.10 (textbook 11.31)

shows part of a grid structure for the deposit file.

Figure 11.10: Grid structure for deposit file.

2. A particular entry in the array contains pointers to all records with the specified

search key values.

o No special computations need to be done

o Only the right records are accessed

o Can also be used for single search key queries (one column or row)

o Easy to extend to queries on n search keys - construct an n-dimensional

array.

o Significant improvement in processing time for multiple-key queries.

o Imposes space overhead.

http://www.cs.sfu.ca/CourseCentral/354/zaiane/material/notes/Chapter11/node24.html#figx126grid

o Performance overhead on insertion and deletion.

Partitioned Hashing

1.

o An alternative approach to multiple-key queries.

o To construct a structure for queries on deposit involving bname and

cname, we construct a hash structure for the key (cname, bname).

o We split this hash function into two parts, one for each part of the key.

o The first part depends only on the cname value.

o The second part depends only on the bname value.

o Figure 11.32 shows a sample partitioned hash function.

o Note that pairs with the same cname or bname value will have 3 bits the

same in the appropriate position.

o To find the balance in all of Williams' accounts at the Perryridge branch,

we compute h(Williams, Perryridge) and access the hash structure.

2. The same hash structure can be used to answer a query on one of the search

keys:

o Compute part of partitioned hash.

o Access hash structure and scan buckets for which that part of the hash

coincides.

o Text doesn't say so, but the hash structure must have some grid-like

form imposed on it to enable searching the structure based on only some

part of the hash.

3. Partitioned hashing can also be extended to n-key search.

New DB Applications

1. Traditional applications.

1. Four generations of traditional DB systems: file system, hierarchical,

CODASYL, and relational. All are designed for business applications:

inventory, employee, university, bank, library, air-line reservation

systems, etc.

2. Common features of `traditional' applications:

1. Uniformity: large number of similarly structured data items, all of

which have the same size,

2. Record orientation: the basic data items consist of fixed-length

records,

3. Small data items: each record is short,

4. Atomic fields: fields within a record are short and of fixed length.

There is no structure within fields. The 1st normal form holds.

5. Short transactions: within fractions of a second. There is no

human interaction with a transaction during its execution.

6. Static conceptual schemes: The database scheme is changed

infrequently. Only simple changes are allowed, e.g., in relational

systems: create relation, remove relation, add/remove attributes

to/from a relation scheme.

2. New applications

1. Engineering databases, CAx: computer-aided design (CAD),

manufacturing (CAM), engineering (CAE)), CIM (computer-integrated

manufacturing).

Tasks: A CAD database stores data required pertaining to an

engineering design, including the components of the items being

designed, the inter-relationship of components, and old versions of

designs.

2. Computer-aided software engineering (CASE): A CASE database stores

data required to assist software developers, including source code,

dependencies among software modules, definitions and uses of

variables, and the development history of the software system.

3. Multimedia databases: A multimedia database contains spatial data,

audio/video data, and the like. DBs of this sort arise from geophysical

data, voice mail systems and graphics applications.

4. Office Information Systems (OIS): Office automation includes

workstation-based tools for document creation and retrieval, tools for

maintaining appointment calendars, and so on. An OIS DB must allow

queries pertaining to schedules, documents, and contents of documents.

5. Hypertext databases: Hypertext is text enriched with links that point to

other documents, e.g., WWW. Hypertext documents may also be

structured in specific ways that help index them. Hypertext database

must support the ability to retrieve documents based on links, and to

query documents based on their structure.

3. Expected Features for New Applications

1. Complex objects: A complex object is an item that is viewed as a single

object in the real world, but that contains other objects (with an arbitrary

complex internal structure). Often objects are stored hierarchically,

representing the containment relationship. This leads to object-oriented

DBs and nested relational DBs.

2. Behavioral data: Distinct objects may need to respond in different ways

to the same command. For example, the deletion of certain tuples may

require to delete other tuples in the case for weak entities. In CAD and

CASE applications the behavior of different objects in response to a

given command may be widely different. This behavioral information

can be captured by storing executable code with objects in the database.

This capability is provided by the methods of OODBs and by the rule

base of KB systems.

3. Meta knowledge: General rules about the application rather than specific

tuples (i.e., data about data) form an important part of expert databases.

4. Long duration transactions: CAD and CASE applications involve human

interaction with the data. ``what-if'' modifications that the user may wish

to undo, concurrent designer efforts that may lead to conflicts among

transactions. Important concepts: nested transactions, correct,

nonserializable executions.

The Object-Oriented Data Model

1. A data model is a logic organization of the real world objects (entities),

constraints on them, and the relationships among objects. A DB language is a

concrete syntax for a data model. A DB system implements a data model.

2. A core object-oriented data model consists of the following basic object-

oriented concepts:

(1) object and object identifier: Any real world entity is uniformly modeled

as an object (associated with a unique id: used to pinpoint an object to retrieve).

(2) attributes and methods: every object has a state (the set of values for the

attributes of the object) and a behavior (the set of methods - program code -

which operate on the state of the object). The state and behavior encapsulated

in an object are accessed or invoked from outside the object only through

explicit message passing.

[ An attribute is an instance variable, whose domain may be any class: user-

defined or primitive. A class composition hierarchy (aggregation relationship)

is orthogonal to the concept of a class hierarchy. The link in a class

composition hierarchy may form cycles. ]

(3) class: a means of grouping all the objects which share the same set of

attributes and methods. An object must belong to only one class as an instance

of that class (instance-of relationship). A class is similar to an abstract data

type. A class may also be primitive (no attributes), e.g., integer, string,

Boolean.

(4) Class hierarchy and inheritance: derive a new class (subclass) from an

existing class (superclass). The subclass inherits all the attributes and methods

of the existing class and may have additional attributes and methods. single

inheritance (class hierarchy) vs. multiple inheritance (class lattice).

Object Structure

1. The object-oriented paradigm is based on encapsulating code and data into a

single unit. Conceptually, all interactions between an object and the rest of the

system are via messages. Thus, the interface between an object and the rest of

the system is defined by a set of allowed messages.

2. In general, an object has associated with it:

o A set of variables that contain the data for the object. The value of each

variable is itself an object.

o A set of messages to which the object responds.

o A set of methods, each of which is a body of code to implement each

message; a method returns a value as the response to the message.

3. Motivation of using messages and methods.

All employee objects respond to the annual-salary message but in different

computations for managers, tellers, etc. By encapsulation within the employee

object itself the information about how to compute the annual salary, all

employee objects present the same interface.

Since the only external interface presented by an object is the set of messages

to which it responds, it is possible to (i) modify the definition of methods and

variables without affecting the rest of the system, and (ii) replace a variable

with the method that computes a value, e.g., age from birth_date.

The ability to modify the definition of an object without affecting the rest of the

system is considered to be one of the major advantages of the OO

programming paradigm.

4. Methods of an object may be classified as either read-only or update. Message

can also be classified as read-only or update. Derived attributes of an entity in

the ER model can be expressed as read-only messages.

Object Classes

1. Usually, there are many similar objects in a DB. By ``similar'', it means that

they respond to the same messages, use the same methods, and have variables

of the same name and type. We group similar objects to form a class. Each

such object is called an instance of its class. E.g., in a bank DB, customers,

accounts and loans are classes.

2. The definition of the class employee, written in pseudo-code. The definition

shows the variables and the messages to which the objects of the class respond,

but not the methods that handle the messages.

class employee

{

/* Variables */ string name; string address;

date start-date; int salary;

/* Messages */

int annual-salary(); string get-name(); string get-

address(); int set-address(string new-address); int

employment-length(); };

3. Class: (i) captures the instance-of relationship, (ii) the basis on which a query

may be formulated, (iii) enhance the integrity of OO systems by introducing

type checking, and (iv) reducing replications of names and integrity-related

specifications among objects in the same class.

4. The concept of classes is similar to the concept of abstract data types. There are

several additional aspects to the class concept beyond those of ADTs. To

represent these properties, we treat each class as itself being an object.

Metaclass: the class of a class. Most OODB systems do not support the strict

notion of metaclass. In ORION, CLASS is the root of the class hierarchy (the

metaclass of all other classes). A class object includes

o a set-valued variable whose value is the set of all objects that are

instances of the class,

o implementation of a method for the message new, which creates a new

instance of the class.

Inheritance

1. An object-oriented database schema typically requires a large number of

classes. Often, however, several classes are similar. For example, bank

employees are similar to customers.

2. In order to allow the direct representation of similarities among classes, we

need to place classes in a specialization hierarchy. E.g., Fig. 8.1 is a

specialization hierarchy for the ER model.

Figure 8.1: Specialization hierarchy for the banking example

The concept of a class hierarchy is similar to that of specialization in the ER

model. The corresponding corresponding class hierarchy is shown in Fig. 8.2.

http://www.cs.sfu.ca/CourseCentral/354/zaiane/material/notes/Chapter8/node6.html#fig81


Figure 8.2: Class hierarchy corresponding to the banking example

The class hierarchy can be defined in pseudo-code in Fig. 8.3, in which the

variables associated with each class are as follows. For brevity, we do not

present the methods associated with these classes.

Figure 8.3: Definition of class hierarchy in pseudo-code

3. The keyword isa is used to indicate that a class is a specialization of another

class. The specialization of a class are called subclasses. E.g., employee is a

subclass of person; teller is a subclass of employee. Conversely, employee is a

superclass of teller.

4. Class hierarchy and inheritance of properties from more general classes. E.g.,

an object representing an officer contains all the variables of classes officer,

employee, and person. Methods are inherited in a manner identical to

inheritance of variables.

5. An important benefit of inheritance in OO systems is the notion of

substitutability: Any method of a class, A, can be equally well be invoked with

an object belonging to any subclass B of A. This characteristic leads to code-


reuse: methods and functions in class A (such as get-name() in class person) do

not have to be rewritten again for objects of class B).

6. Two plausible ways of associating objects with nonleaf classes:

o associate with the employee class all employee objects including those

that are instances of officer, teller, and secretary.

o associate with the employee class only those employee objects that are

instances neither officer, nor teller, nor secretary.

Typically, the latter choice is made in OO systems. It is possible to determine

the set of all employee objects in this case by taking the union of those objects

associated with all classes in the subtree rooted at employee.

7. Most OO systems allow specialization to be partial, i.e., they allow objects that

belong to a class such as employee that do not belong to any of that class's

subclasses.

Multiple Inheritance

1. In most cases, tree-structured organization of classes is adequate to describe

applications. In such cases, all superclasses of a class are ancestors of

descendants of another in the hierarchy. However, there are situations that

cannot be represented well in a tree-structured class hierarchy.

2. Example. We could create subclasses: part-time-teller, full-time-teller, etc., as

shown in Fig. 8.4. But problems: (1) redundancy leads to potential

inconsistency on updates; and (2) the hierarchy cannot represent full/part- time

employees who are neither secretaries nor tellers.

Figure 8.4: Class hierarchy for full- and part-time employees.

3. Multiple inheritance: the ability of class to inherit variables and methods from

multiple superclasses.

4. The class/subclass relationship is represented by a rooted directed acyclic graph

(DAG) in which a class may have more than one superclass.


Figure 8.5: Class DAG for the banking example.

5. Handling name conflicts: When multiple inheritance is used, there is potential

ambiguity if the same variable or method can be inherited from more than one

superclass.

6. Example. In our banking example, we may define a variable pay for each full-

time, part-time, teller and secretary as follows:

o full-time: pay is an integer from 0 to 100,000 containing annual salary.

o part-time: pay is an integer from 0 to 20 containing an hourly rate of

pay.

o teller: pay is an integer from 0 to 20,000 containing the annual salary.

o secretary: pay is an integer from 0 to 25,000 containing the annual

salary.

7. For part-time-secretary, it could inherit the definition of pay from either part-

time or secretary. We have the following options:

o Include both variables, renaming them to part-time-pay and secretary-

pay.

o Choose one or the other based on the order of creation.

o Force the user the make a choice at the time of class definition.

o Treat the situation as an error.

No single solution has been accepted as best, and different systems make

different choices.

8. Not all cases of multiple inheritance lead to ambiguity. If, instead of defining

pay, we retain the definition of variable salary in class employee, and define it

nowhere else, then all the subclasses inherit salary from employee (no

ambiguity).

9. We can use multiple inheritance to model the concept of roles. For example,

for subclasses, student, teacher and footballPlayer, an object can belong to

several categories at once and each of these categories is called a role. We can

use multiple inheritance to create subclasses, such as student-teacher, student-

footballPlayer, and so on to model the possibility of an object simultaneously

having multiple roles

Object Identity

1. Object identity: An object retains its identity even if some or all of the values

of variables or definitions of methods change over time.

This concept of object identity is necessary in applications but doe not apply to

tuples of a relational database.

2. Object identity is a stronger notion of identity than typically found in

programming languages or in data models not based on object orientation.

3. Several forms of identity:

o value: A data value is used for identity (e.g., the primary key of a tuple

in a relational database).

o name: A user-supplied name is used for identity (e.g., file name in a file

system).

o built-in: A notion of identity is built-into the data model or

programming languages, and no user-supplied identifier is required

(e.g., in OO systems).

4. Object identity is typically implemented via a unique, system-generated OID.

The value of the OID is not visible to the external user, but is used internally by

the system to identify each object uniquely and to create and manage inter-

object references.

5. There are many situations where having the system generate identifiers

automatically is a benefit, since it frees humans from performing that task.

However, this ability should be used with care. System-generated identifiers

are usually specific to the system, and have to be translated if data are moved to

a different database system. System-generated identifiers may be redundant if

the entities being modeled already have unique identifiers external to the

system, e.g., SIN#.

Object Containment

1. Objects that contain other objects are called complex or composite objects.

There can be multiple levels of containment, forming a containment hierarchy

among objects.

2. Example: A bicycle design database:

Figure 8.6: Containment hierarchy for bicycle-design database.

Fig. 8.6 shows the containment relationship in a schematic way by listing class

names. Thus the links between classes must be interpreted as is-part-of, rather

than the is-a interpretation of links in an inheritance hierarchy.


3. Containment allows data to be viewed at different granularities by different

users. E.g., wheel by wheel designer but bicycle by a sales-person.

The containment hierarchy is used to find all objects contained in a bicycle

object.

4. In certain applications, an object may be contained in several objects. In such

cases, the containment relationship is represented by a DAG rather than by a

hierarchy.

Object-Oriented Languages

The expression of object-orientation can be done in one of two ways.

1. The concepts of object-orientation can be used purely as a design tool, and are

encoded into, e.g., a relational database. E.g., ER modeling.

2. The concepts of object-orientation are incorporated into a language that is used

to manipulate the database. There are several possible languages into which the

concepts can be integrated.

1. Extend a DML such as SQL by adding complex types and object-

orientation. Systems that provide object-oriented extensions to relational

systems are called object-relational systems.

2. Take an existing object-oriented programming language and extend it to

deal with database. Such languages are called persistent programming

languages.

Persistent Programming Languages

1. Persistent data: data that continue to exist even after the program that created it

has terminated.

2. A persistent programming language is a programming language extended with

constructs to handle persistent data. It distinguishes with embedded SQL in at

least two ways:

1. In a persistent program language, query language is fully integrated with

the host language and both share the same type system. Any format

changes required in databases are carried out transparently.

Comparison with Embedded SQL where (1) host and DML have

different type systems, code conversion operates outside of OO type

system, and hence has a higher chance of having undetected errors; (2)

format conversion takes a substantial amount of code.

2. Using Embedded SQL, a programmer is responsible for writing explicit

code to fetch data into memory or store data back to the database.

In a persistent program language, a programmer can manipulate

persistent data without having to write such code explicitly.

Drawbacks: (1) Powerful but easy to make programming errors that damage the

database; (2) harder to do automatic high-level optimization; and (3) do not

support declarative querying well.

Persistence of Objects

Several approaches have been proposed to make the objects persistent.

1. persistence by class. Declare class to be persistent: all objects of the class are

then persistent objects. Simple, not flexible since it is often useful to have both

transient and persistent objects in a single class. In many OODB systems,

declaring a class to be persistent is interpreted as ``persistable'' -- objects in the

class potentially can be made persistent.

2. persistence by creation. Introduce new syntax to create persistent objects.

3. persistence by marking. Mark an object persistent after it is created (and before

the program terminates).

4. persistence by reference. One or more objects are explicitly declared as (root)

persistent objects. All other objects are persistent iff they are referred, directly

or indirectly, from a root persistent object. It is easy to make the entire data

structure persistent by merely declaring the root of the structure as persistent,

but is expensive to follow the chains in detection for a database system.

Object Identity and Pointers

1. The association of an object with a physical location in storage (as in C++)

may change over time.

2. There are several degrees of permanence of identity:

o intraprocedure: Identity persists only during the execution of a single

procedure, e.g., local variables within procedures.

o intraprogram: Identity persists only during the execution of a single

program or query, e.g., global variables in programming languages, and

main memory or virtual memory pointers.

o interprogram: Identity persists from one program execution to another,

e.g., pointers to file system data on disk but may change if the way data

is stored in the file system is changed.

o persistent: Identity persists not only among program executions but also

among structural reorganizations of the data. This is the persistent form

of identity required for object-oriented systems.

3. In persistent extension of C++, object identifiers are implemented as

``persistent pointers'' which can be viewed as a pointer to an object in the

database.

Storage and Access of Persistent Objects

1. How are objects stored in a database?

Code (that implements methods) should be stored in the database as part of the

schema, along with type definitions, but many implementations store them

outside of the database, to avoid having to integrate system software such as

compilers with the database system.

Data: stored individually for each object.

2. How to find the objects?

1. Give names to objects like we give names to files: works only for small

sets of objects.

2. Expose object identifiers or persistent pointers to the objects:

3. Store the collections of object and allow programs to iterate over the

collections to find required objects. The collections can be modeled as

objects of a collection type. A special case of a collection is a class

extent, which is a collection of all objects belonging to the class.

Most OODB systems support all three ways of accessing persistent objects. All

objects have object identifiers. Names are typically given only to class extents

and other collection objects, and perhaps to other selected objects, but most

objects are not given names. Class extents are usually maintained for all

classed that can have persistent objects, but in many implementations, they

contain only persistent objects of the class.

OBJECT MODELLING

Objects are the physical and conceptual things we find in the universe around us.

Hardware, software, documents, human beings, and even concepts are all examples of

objects. For purposes of modeling his or her company, a chief executive officer could

view employees, buildings, divisions, documents, and benefits packages as objects.

An automotive engineer would see tires, doors, engines, top speed, and the current

fuel level as objects. Atoms, molecules, volumes, and temperatures would all be

objects a chemist might consider in creating an object-oriented simulation of a

chemical reaction. Finally, a software engineer would consider stacks, queues,

windows, and check boxes as objects.

Objects are thought of as having state. The state of an object is the condition of the

object, or a set of circumstances describing the object. It is not uncommon to hear

people talk about the "state information" associated with a particular object. For

example, the state of a bank account object would include the current balance, the

state of a clock object would be the current time, the state of an electric light bulb

would be "on" or "off." For complex objects like a human being or an automobile, a

complete description of the state might be very complex. Fortunately, when we use

objects to model real world or imagined situations, we typically restrict the possible

states of the objects to only those that are relevant to our models.

We also think of the state of an object as something that is internal to an object. For

example, if we place a message in a mailbox, the (internal) state of the mailbox object

is changed, whereas the (internal) state of the message object remains unchanged.

Sometimes people think of objects as being strictly static. That is, the state of an

object will not change unless something outside of the object requests the object to

change its state. Indeed, many objects are passive (static). A list of names does not

spontaneously add new names to itself, nor would we expect it to spontaneously delete

names from itself.

However, it is possible for some objects to change their own state. If an object is

capable of spontaneously changing its own state, we refer to it as an "object with

life." (Objects with life are sometimes also called "active objects" or "actors.") Clocks

and timers are common examples of objects with life. If we were modeling a business

process, we would recognize that salespeople and customers were also objects with

life.

Classes, Metaclasses, Parameterized Classes, and Exemplars

There are two broad categories of objects: classes and instances. Users of object-

oriented technology usually think of classes as containing the information necessary to

create instances, i.e., the structure and capabilities of an instance is determined by its

corresponding class. There are three commonly used (and different) views on the

definition for "class":

A class is a pattern, template, or blueprint for a category of structurally

identical items. The items created using the class are called instances. This is

often referred to as the "class as a `cookie cutter'" view. As you might guess,

the instances are the "cookies."

A class is a thing that consists of both a pattern and a mechanism for creating

items based on that pattern. This is the "class as an `instance factory'" view;

instances are the individual items that are "manufactured" (created) using the

class's creation mechanism.

A class is the set of all items created using a specific pattern. Said another

way, the class is the set of all instances of that pattern.

We should note that it is possible for an instance of a class to also be a class. A

metaclass is a class whose instances themselves are classes. This means when we use

the instance creation mechanism in a metaclass, the instance created will itself be a

class. The instance creation mechanism of this class can, in turn, be used to create

instances -- although these instances may or may not themselves be classes.

A concept very similar to the metaclass is the parameterized class. A parameterized

class is a template for a class wherein specific items have been identified as being

required to create non-parameterized classes based on the template. In effect, a

parameterized class can be viewed as a "fill in the blanks" version of a class. One

cannot directly use the instance creation mechanism of a parameterized class. First, we

must supply the required parameters, resulting in the creation of a non-parameterized

class. Once we have a non-parameterized class, we can use its creation mechanisms to

create instances.

In this article, we will use the term "class" to mean metaclass, parameterized class, or

a class that is neither a metaclass nor a parameterized class. We will make a

distinction only when it is necessary to do so. Further, we will occasionally refer to

"non-class instances." A non-class instance is an instance of a class, but is itself not a

class. An instance of a metaclass, for example, would not be a non-class instance.

In this article, we will sometimes refer to "instantiation." Instantiation has two

common meanings:

as a verb, instantiation is the process of creating an instance of a class, and

as a noun, an instantiation is an instance of a class.

Some people restrict the use of the term "object" to instances of classes. For these

people, classes are not objects. However, when these people are confronted with the

concepts of metaclasses and parameterized classes, they have a difficulty attempting

to resolve the "problems" these concepts introduce. For example, is a class that is an

instance of a metaclass an object -- even though it is itself a class? In this article, we

will use the term "object" to refer to both classes and their instances. We will only

distinguish between the two when needed.

Black Boxes and Interfaces

Objects are "black boxes." Specifically, the underlying implementations of objects are

hidden from those that use the object. In object-oriented systems, it is only the

producer (creator, designer, or builder) of an object that knows the details about the

internal construction of that object. The consumers (users) of an object are denied

knowledge of the inner workings of the object, and must deal with an object via one of

its three distinct interfaces:

the "public" interface. This is the interface that is open (visible) to everybody.

the "inheritance" interface. This is the interface that is accessible only by direct

specializations of the object. (We will discuss inheritance and specialization

later in this chapter.) In class-based object-oriented systems, only classes can

provide an inheritance interface.

the "parameter" interface. In the case of parameterized classes, the parameter

interface defines the parameters that must be supplied to create an instance of

the parameterized class.

Another way of saying that an item is in the public interface of an object is to say that

the object "exports" that item. Similarly, when an object requires information from

outside of itself (e.g., as with the parameters in a parameterized class), we can say that

the object needs to "import" that information.

Aggregation

It is, of course, possible for objects to be composed of other objects. Aggregation is

either:

the process of creating a new object from two or more other objects, or

an object that is composed of two or more other objects.

For example, a date object could be fashioned from a month object, a day object, and

a year object. A list of names object, for example, can be thought of as containing

many name objects.

Specialization and Inheritance

Aggregation is not the only way in which two objects can be related. One object can

be a specialization of another object. Specialization is either:

the process of defining a new object based on a (typically) more narrow

definition of an existing object, or

an object that is directly related to, and more narrowly defined than, another

object.

Specialization is usually associated with classes. It is usually only in the so-called

"classless" object-oriented systems that we think of specialization for objects other

than classes.

Depending on their technical background, there are a number of different ways in

which people express specialization. For example, those who are familiar with an

object-oriented programming language called Smalltalk refer to specializations as

"subclasses" and to the corresponding generalizations of these specializations as

"superclasses." Those with a background in the C++ programming language use the

term "derived class" for specialization and "base class" for corresponding

generalizations.

It is common to say that everything that is true for a generalization is also true for its

corresponding specialization. We can, for example, define "checking accounts" and

"savings accounts" as specializations of "bank accounts." Another way of saying this

is that a checking account is a kind of bank account, and a savings account is a kind of

bank account. Still another way of expressing this idea is to say that everything that

was true for the bank account is also true for the savings account and the checking

account.

In an object-oriented context, we speak of specializations as "inheriting"

characteristics from their corresponding generalizations. Inheritance can be defined

as the process whereby one object acquires (gets, receives) characteristics from one or

more other objects. Some object-oriented systems permit only single inheritance, a

situation in which a specialization may only acquire characteristics from a single

generalization. Many object-oriented systems, however, allow for multiple

inheritance, a situation in which a specialization may acquire characteristics from two

or more corresponding generalizations.

Our previous discussion of the bank account, checking account, and savings account

was an example of single inheritance. A telescope and a television set are both

specializations of "device that enables one to see things far away." A television set is

also a kind of "electronic device." You might say that a television set acquires

characteristics from two different generalizations, "device that enables one to see

things far away" and "electronic device." Therefore, a television set is a product of

multiple inheritance.

Abstract Classes

We usually think of classes as being complete definitions. However, there are

situations where incomplete definitions are useful, and classes that represent these

incomplete definitions are equally useful. For example, in everyday conversation, we

might talk about such items as bank accounts, insurance policies, and houses. In

object-oriented thinking, we often isolate useful, but incomplete, concepts such as

these into their own special classes.

Abstract classes are classes that embody coherent and cohesive, but incomplete,

concepts, and in turn, make these characteristics available to their specializations via

inheritance. People sometimes use the terms "partial type" and "abstract superclass" as

synonyms for abstract class. While we would never create instances of abstract

classes, we most certainly would make their individual characteristics available to

more specialized classes via inheritance.

For example, consider the concept of an automobile. On one hand, most people know

what an automobile is. On the other hand, "automobile" is not a complete definition

for any vehicle. It would be quite accurate to describe "automobile" as the set of

characteristics that make a thing an automobile, in other words, the "essence of

automobile-ness."

Operations

The public interface of an object typically contains three different categories of items:

operations (sometimes referred to as "method selectors," "method interfaces,"

"messages," or "methods"),

constants, and

exceptions.

An operation in the public interface of an object advertises a functional capability of

that object. For example, "deposit" would be an operation in the public interface of a

bank account object, "what is current temperature" would be an operation in the public

interface of a temperature sensor object, and "increment" would be an operation in the

public interface of a counter object.

The actual algorithm for accomplishing an operation is referred to as a method.

Unlike operations, methods are not in the public interface for an object. Rather,

methods are hidden on the inside of an object. So, while users of bank account objects

would know that they could make a deposit into a bank account, they would be

unaware of the details as to how that deposit actually got credited to the bank account.

MODULE 5: OPERATIONS AND MANAGEMENT

Client / Server and Databases – Data Warehousing – Query Processing – Concurrency

Management – Heterogeneous and Homogenous Systems – Distributed Databases –

Controls – Atomicity, Recovery – Security, Back-up and Recovery.

The client/server model is a computing model that acts as a distributed application

which partitions tasks or workloads between the providers of a resource or service,

called servers, and service requesters, called clients.[1]

Often clients and servers

communicate over a computer network on separate hardware, but both client and

server may reside in the same system. A server machine is a host that is running one

or more server programs which share their resources with clients. A client does not

share any of its resources, but requests a server's content or service function. Clients

therefore initiate communication sessions with servers which await incoming requests.

Schematic clients-server interaction.

The client/server characteristic describes the relationship of cooperating programs in

an application. The server component provides a function or service to one or many

clients, which initiate requests for such services. A notable example of this is the way

OpenGL treats the video card of a computer as a server, with the actual application

making rendering requests to it. This model is further solidified with the OpenGL

Shading Language, with the user writing small programs that live in video memory,

and are requested from the main program through the graphics driver.

Functions such as email exchange, web access and database access are built on the

client/server model. Users accessing banking services from their computer use a web

browser client to send a request to a web server at a bank. That web server runs a

program which may in turn, forward the request to its own database client program,

which sends a request to the bank's database server (which runs on another computer)

to retrieve the account information. The balance and transaction records are returned

to the bank database client, which in turn serves it back to the user's web browser

client, displaying the results to the user. The client–server model has become one of

the central ideas of network computing. Many business applications being written

today use the client–server model, as do the Internet's main application protocols, such

as HTTP, SMTP, Telnet, and DNS.

The interaction between client and server is often described using sequence diagrams.

The Unified Modeling Language has support for sequence diagrams.

http://en.wikipedia.org/wiki/Computing

http://en.wikipedia.org/wiki/Distributed_application

http://en.wikipedia.org/wiki/Server_(computing)

http://en.wikipedia.org/wiki/Client_(computing)

http://en.wikipedia.org/wiki/Client%E2%80%93server_model#cite_note-0

http://en.wikipedia.org/wiki/Computer_network

http://en.wikipedia.org/wiki/OpenGL

http://en.wikipedia.org/wiki/GLSL

http://en.wikipedia.org/wiki/GLSL

http://en.wikipedia.org/wiki/Network_computing

http://en.wikipedia.org/wiki/HTTP

http://en.wikipedia.org/wiki/SMTP

http://en.wikipedia.org/wiki/Telnet

http://en.wikipedia.org/wiki/Domain_Name_System

http://en.wikipedia.org/wiki/Sequence_diagram

http://en.wikipedia.org/wiki/Unified_Modeling_Language

http://en.wikipedia.org/w/index.php?title=File:Client-server-model.svg&page=1

http://en.wikipedia.org/wiki/File:Client-server-model.svg

http://en.wikipedia.org/w/index.php?title=File:Client-server-model.svg&page=1

http://en.wikipedia.org/wiki/File:Client-server-model.svg

Specific types of clients include web browsers, email clients, and online chat clients.

Specific types of servers include web servers, ftp servers, application servers,

database servers, name servers, mail servers, file servers, print servers, and

terminal servers. Most web services are also types of servers.

Comparison to peer-to-peer architecture

A client/server network involves multiple clients connecting to a single, central server.

The file server on a client/server network must have robust hardware configurations to

handle large volumes of client service requests. Hence servers usually are high speed

computers with large hard disk capacity.

By contrast, peer-to-peer networks involve two or more computers pooling individual

resources such as disk drives, CD-ROMs and printers. These shared resources are

available to every computer in the network, while each of them communicates in a

session. Each computer acts as both the client and the server which means all the

computers on the network are equals, that is where the term peer-to-peer comes from.

The advantage of peer-to-peer networking is the easier control concept not requiring

any additional coordination entity and not delaying transfers by routing via server

entities. However, the collision of session may be larger than with routing via server

nodes.

In the peer to peer network, software applications can be installed on the single

computer and shared by every computer in the network. They are also cheaper to set

up because most desktop operating systems have the software required for the network

installed by default. On the other hand, the client/server model works with any size or

physical layout of LAN and doesn't tend to slow down with a heavy use.[3]

Peer-to-peer networks are typically less secure than client/server networks because

security is handled by the individual computers, not controlled and supervised on the

network as a whole. The resources of the computers in the network can become

congested as they have to support not only the workstation user, but also the requests

from network users. It may be difficult to provide systemwide services when the client

operating system typically used in this type of network is incapable of hosting the

service.

Client/server networks with their additional capacities have a higher initial setup cost

for networking than peer to peer networks. The long-term aspect of administering a

client/server network with applications largely server-hosted surely saves

administering effort compared to administering the application settings per each

client. In addition the concentration of functions in performant servers allows for

lower grade performance qualification of the clients.

It is possible to set up a server on a modern desktop computer, but it is recommended

to consider investment in enterprise-wide server facilities with standardised choice of

hardware and software and with a systematic and remotely operable administering

http://en.wikipedia.org/wiki/Web_browser

http://en.wikipedia.org/wiki/Email_client

http://en.wikipedia.org/wiki/Online_chat

http://en.wikipedia.org/wiki/Web_server

http://en.wikipedia.org/wiki/Ftp_server

http://en.wikipedia.org/wiki/Application_server

http://en.wikipedia.org/wiki/Database_server

http://en.wikipedia.org/wiki/Name_server

http://en.wikipedia.org/wiki/Mail_server

http://en.wikipedia.org/wiki/File_server

http://en.wikipedia.org/wiki/Print_server

http://en.wikipedia.org/wiki/Terminal_server

http://en.wikipedia.org/wiki/Web_service

http://en.wikipedia.org/wiki/Peer-to-peer

http://en.wikipedia.org/wiki/Client%E2%80%93server_model#cite_note-2

strategy. It is easier to configure and manage the server hardware and software

compared to the distributed administering requirements with a flock of computers.

In centralized database systems, the database system, application programs, and

user-interface all are executed on a single system and dummy terminals are connected

to it. The processing power of single system is utilized and dummy terminals are used

only to display the information. As the personal computers became faster, more

powerful, and cheaper, the database system started to exploit the available processing

power of the system at the user’s side, which led to the development of client/server

architecture. In client/server architecture, the processing power of the computer

system at the user’s end is utilized by processing the user-interface on that system.A

client is a computer system that sends request to the server connected to the network,

and a server is a computer system that receives the request, processes it, and returns

the requested information back to the client. Client and server are usually present at

different sites. The end users (remote database users) work on client computer system

and database system runs on the server. Servers can be of several types, for example,

file servers, printer servers, web servers, database servers, etc. The client machines

have user interfaces that help users to utilize the servers. It also provides users the

local processing power to run local applications on the client side.

There are two approaches to implement client/server architecture. In the first

approach, the user interface and application programs are placed on the client side and

the database system on the server side. This architecture is called two-tier

architecture. The application programs that reside at the client side invoke the DBMS

at the server side. The application program interface standards like Open Database

Connectivity (ODBC) and Java Database Connectivity (JDBC) are used r interaction

between client and server.

The second approach, that is, three-tier architecture is primarily used for web-based

applications. It adds intermediate layer known as application server (or web server)

between the client and the database server. The client communicates with the

application server, which in turn communicates with the database server. The

application server stores the business rules (procedures and constraints) used for

accessing data from database server. It checks the client’s credentials before

forwarding a request to database server. Hence, it improves database security.

When a client requests for information, the application server accepts the request,

processes it, and sends corresponding database commands to database server. The

database server sends the result back to application server which is converted into

GUI format and presented to the client.

Fundamentals of Distributed Databases

In recent years, the distributed database system has been emerging as an important

area of information processing, and its popularity is increasing rapidly. A distributed

database is a database that is under the control of a central DBMS in which not all

http://my.safaribooksonline.com/9788131731925/gloss01#gloss01_037

storage devices are attached to a common CPU. It may be stored on multiple

computers located in the same physical location, or may be dispersed over a network

of interconnected computers. Collections of data (e.g., in a database) can be

distributed across multiple physical locations. In short, a distributed database is a

logically interrelated collection of shared data, and a description of this data is

physically distributed over a computer network. A distributed database must ensure

the following:

The distribution is transparent – users must be able to interact with the

system as if it is a single logical system. This applies to the system

performance and method of accessing amongst other things.

The transactions are transparent – each transaction must maintain database

integrity across multiple databases. Transactions may also be divided into

subtransactions; each subtransaction affects one database system.

Features of a Distributed DBMS: A Distributed DBMS may have a number of local

applications, but it has at least one global application. Thus, a distributed DBMS has

the following features:

i. A distributed DBMS is a collection of logically related shared data.

ii. The data in a distributed DBMS is split into a number of fragments or

partitions.

iii. Fragments may be replicated in a distributed system.

iv. Fragments/replicas are allocated to different sites.

v. In a distributed system, the sites are linked by communications network.

vi. The data at each site is under the control of a DBMS.

vii. The DBMS at each site has its own right, that is, it can handle local

applications independently.

viii. Each DBMS in a distributed system participates in at least one global

application.

Advantages and Disadvantages of Distributed DBMS

A distributed DBMS has several advantages over traditional centralized DBMSs.

These are listed in the following.

Sharing of information The major advantage in a distributed database system is

the provision for sharing information. Users at one site in a distributed system may

be able to access data residing at other sites. For example, consider an organization

that has a number of branches throughout the country. Each branch stores its own

data locally. Similarly, a user in one branch can access data from another branch;

thus, information sharing is possible in a distributed system

Components of a Distributed DBMS

A Distributed DBMS controls the storage and efficient retrieval of logically

interrelated data that are physically distributed among several sites. Therefore, a

distributed DBMS includes the following components.

a. Computer workstations (sites or nodes) – A distributed DBMS consists of a

number of computer workstations that form the network system. The

distributed database system must be independent of the computer system

hardware.

b. Network components (both hardware and software) – Each workstation in a

distributed system contains a number of network hardware and software

components. These components allow each site to interact and exchange data

with each other site. Network system independence is a desirable property of

the distributed system.

c. Communication media – In a distributed system, any type of communication

(data transfer, information exchange) among nodes is carried out through

communication media. This is a very important component of a distributed

DBMS. It is desirable that a distributed DBMS be communication media

independent, that is, it must be able to support several types of communication

media.

d. Transaction processor (TP) – A TP is a software component that resides in

each computer connected with the distributed system and is responsible for

receiving and processing both local and remote applications’ data requests.

This component is also known as the application processor (AP) or the

transaction manager (TM).

e. Data processor (DP) – A DP is also a software component that resides in each

computer connected with the distributed system and stores and retrieves data

located at that site. The DP is also known as the data manager (DM). In a

distributed DBMS, a DP may be a centralized DBMS.

Objectives of Distributed Concurrency Control

Concurrency control is the activity of coordinating concurrent accesses to a database

in a multi-user system. Concurrency control allows users to access a database in a

multi-programmed fashion while preserving the consistency of the data. The main

technical difficulty in achieving this goal is the necessity to prevent database updates

performed by one user from interfering with database retrievals and updates

performed by other users. The concurrency control problem is exacerbated in a

distributed DBMS (DDBMS), because in a distributed system users may access data

stored at many different sites, and a concurrency control mechanism at one site cannot

be aware of transactions at other sites instantaneously. The replication of data items in

a distributed database adds extra complexity to the concurrency control mechanism. In

this context, it is necessary to discuss the objectives of concurrency control

mechanism in a distributed database environment. Like in centralized DBMSs, the

primary goal of the concurrency control mechanism in a DDBMS is to ensure that

the consistency of the data items is preserved and each atomic action will be

completed in a finite time if the system has not failed. In addition, a good concurrency

control mechanism for a DDBMS has the following objectives:

It must be resilient to site and communication failures.

It should permit parallel execution of transactions to achieve maximum

concurrency, thus satisfying performance requirements.

Its computational methods and storage mechanisms should be modest to

minimize overhead.

It should perform satisfactorily in a network environment taking into

consideration that it involves significant communication delay.

It must impose few constraints on the structure of atomic actions of

transactions.

Atomicity

ATOMICITY :The phrase "all or nothing" succinctly describes the first ACID

property of atomicity. When an update occurs to a database, either all or none of the

update becomes available to anyone beyond the user or application performing the

update. This update to the database is called a transaction and it either commits or

aborts. This means that only a fragment of the update cannot be placed into the

database, should a problem occur with either the hardware or the software involved.

Features to consider for atomicity:

a transaction is a unit of operation - either all the transaction's actions are

completed or none are

atomicity is maintained in the presence of deadlocks

atomicity is maintained in the presence of database software failures

atomicity is maintained in the presence of application software failures

atomicity is maintained in the presence of CPU failures

atomicity is maintained in the presence of disk failures

atomicity can be turned off at the system level

atomicity can be turned off at the session level

Data security is the protection of the database from unauthorized users. Only the

authorized persons are allowed to access the database. Most of the users are

allowed to access a part of database i.e., the data that is related to them or related

to their department. Mostly, the DBA or head of department can access all the data

in the database. Some users may be permitted only to retrieve data, whereas others

are allowed to retrieve as well as to update data. The database access is controlled

by the DBA. He creates the-accounts of users and gives rights to access the

database. Typically, users or group of users are given usernames protected by

passwords.

Most of the DBMSs provide the security sub-system, which the DBA uses to

create accounts of users and to specify account restrictions. The user enters his/her

account number (or user name) and password to access the data from database. For

example, if you have an account in the "hotmail.com" (a popular website), then

you have to give your correct username and password to access your account of e-

mail. Similarly, when you insert your ATM card into the auto teller machine

(ATM) in a bank, the machine reads your ID number printed on the card and then

asks you to enter your pin code (or password). In this way you can access your

account.

Homogeneous and Heterogeneous Distributed DBMSs

A distributed DBMS may be classified as homogeneous or heterogeneous. In an ideal

distributed database system, the sites would share a common global schema (although

some relations may be stored only at some local sites), all sites would run the same

database management software and the sites are aware of the existence of other sites.

In a distributed system, if all sites use the same DBMS product, it is called a

homogenous distributed database system. However, in reality, a distributed

database has to be constructed by linking multiple already-existing database systems

together, each with its own schema and possibly running different database

management software. Such systems are called heterogeneous distributed database

systems. In a heterogeneous distributed database system, sites may run different

DBMS products that need not be based on the same underlying data model, and thus,

the system may be composed of relational, network, hierarchical, and object-oriented

DBMSs. Homogeneous distributed DBMS provides several advantages such as

simplicity, ease of designing and incremental growth. It is much easier to design and

manage a homogeneous distributed DBMS than a heterogeneous one. In a

homogeneous distributed DBMS, making the addition of a new site to the distributed

system is much easier, thereby providing incremental growth. These systems also

improve performance by exploiting the parallel processing capability of multiple sites.

Functions of Distributed DBMS:A distributed DBMS manages the storage and

processing of logically related data on interconnected computers wherein both data

and processing functions are distributed among several sites. Thus, a distributed

DBMS has at least all the functionality of a centralized DBMS. In addition, it must

provide the following functions to be classified as distributed. Application interfaces

A distributed DBMS provides application interfaces to interact with the end-users or

application programs and with remote databases within the distributed system. This

feature is most important and is provided by all systems that are distributed in nature.

http://www.blurtit.com/q9653304.html

unit 1 module 1: introduction definition : dbmsa dbms consists of software that operates databases,...

Documents