unit 1 module 1: introduction definition : dbmsa dbms consists of software that operates databases,...
TRANSCRIPT
UNIT 1
MODULE 1: INTRODUCTION
Database systems – Definition – Components – Advantages – Objectives – Evolution.
Definition : DBMS
A database management system (DBMS) is a system that allows to build and maintain
databases, as well as to utilize their data and retrieve information from it. A DBMS
defines the database type that it supports, as well as its functionality and operational
capabilities. A DBMS provides the internal processes for external applications built on
them.
A DBMS consists of software that operates databases, providing storage, access,
security, backup and other facilities to meet needed requirements
A database management system (DBMS) is a collection of programs that manages
the database structure and controls access to the data stored in the database. In a sense,
a database resembles a very well-organized electronic filing cabinet in which powerful
software, known as a database management system, helps manage the cabinet’s
contents.
Well known DBMSs include Oracle, IBM DB2, Microsoft SQL Server, PostgreSQL,
MySQL and SQLite.
Evolution of database and DBMS technology
The database concept has evolved since the 1960s to ease increasing difficulties in
designing, building, and maintaining complex information systems (typically with
many concurrent end-users, and with a diverse large amount of data). It has evolved
together with database management systems which enable the effective handling of
databases. Though the terms database and DBMS define different entities, they are
inseparable: a database's properties are determined by its supporting DBMS and vice-
versa. The Oxford English dictionary cites 1962 technical report as the first to use the
term "data-base.‖With the progress in technology in the areas of processors, computer
memory, computer storage. And computer networks, the sizes, capabilities, and
performance of databases and their respective DBMSs have grown in orders of
magnitudes. For decades it has been unlikely that a complex information system can
be built effectively without a proper database supported by a DBMS. The utilization
of databases is now spread to such a wide degree that virtually every technology and
product relies on databases and DBMSs for its development and commercialization, or
even may have such embedded in it. Also, organizations and companies, from small to
large, heavily depend on databases for their operations.
The introduction of the term database coincided with the availability of direct-access
storage (disks and drums) from the mid-1960s onwards. The term represented a
contrast with the tape-based systems of the past, allowing shared interactive use rather
than daily batch processing.
In the earliest database systems, efficiency was perhaps the primary concern, but it
was already recognized that there were other important objectives. One of the key aim
was to make the data independent of the logic of application programs, so that the
same data could be made available to different applications.
The first generation of database systems were navigational applications typically
accessed data by following pointers from one record to another. The two main data
models at this time were the hierarchical model, epitomized by IBM's IMS system,
and the Codasyl model (Network model), implemented in a number of products such
as IDMS.
The Relational model, first proposed in 1970 by Edgar F. Codd, departed from this
tradition by insisting that applications should search for data by content, rather than by
following links. This was considered necessary to allow the content of the database to
evolve without constant rewriting of applications. Relational systems placed heavy
demands on processing resources, and it was not until the mid 1980s that computing
hardware became powerful enough to allow them to be widely deployed. By the early
1990s, however, relational systems were dominant for all large-scale data processing
applications, and they remain dominant today (2012) except in niche areas. The
dominant database language is the standard SQL for the Relational model, which has
influenced database languages also for other data models.
Because the relational model emphasizes search rather than navigation, it does not
make relationships between different entities explicit in the form of pointers, but
represents those rather using primary keys and foreign keys. While this is a good basis
for a query language, it is less well suited as a modelling language. For this reason a
different model, the Entity-relationship model which emerged shortly later (1976),
gained popularity for database design.
In the period since the 1970s database technology has kept pace with the increasing
resources becoming available from the computing platform: notably the rapid increase
in the capacity and speed (and reduction in price) of disk storage, and the increasing
capacity of main memory. This has enabled ever larger databases and higher
throughputs to be achieved.
The rigidity of the relational model, in which all data is held in tables with a fixed
structure of rows and columns, has increasingly been seen as a limitation when
handling information that is richer or more varied in structure than the traditional
'ledger-book' data of corporate information systems: for example, document databases,
engineering databases, multimedia databases, or databases used in the molecular
sciences. Various attempts have been made to address this problem, many of them
gathering under banners such as post-relational or NoSQL. Two developments of note
are the Object database and the XML database. The vendors of relational databases
have fought off competition from these newer models by extending the capabilities of
their own products to support a wider variety of data types.
General-purpose DBMS
A DBMS has evolved into a complex software system and its development typically
requires thousands of person-years of development effort] Some general-purpose
DBMSs, like Oracle, Microsoft SQL server, and IBM DB2, have been undergoing
upgrades for thirty years or more. General-purpose DBMSs aim to satisfy as many
applications as possible, which typically make them even more complex than special-
purpose databases. However, the fact that they can be used "off the shelf", as well as
their amortized cost over many applications and instances, makes them an attractive
alternative (Vs. one-time development) whenever they meet an application's
requirements.
Though attractive in many cases, a general-purpose DBMS is not always the optimal
solution: When certain applications are pervasive with many operating instances, each
with many users, a general-purpose DBMS may introduce unnecessary overhead and
too large "footprint" (too large amount of unnecessary, unutilized software code).
Such applications usually justify dedicated development. Typical examples are email
systems, though they need to possess certain DBMS properties: email systems are
built in a way that optimizes email messages handling and managing, and do not need
significant portions of a general-purpose DBMS functionality.
ADVANTAGES OF DBMS
the DBMS (Database Management System) is preferred ever the conventional file
processing system due to the following advantages:
1. Controlling Data Redundancy - In the conventional file processing system, every
user group maintains its own files for handling its data files. This may lead to
• Duplication of same data in different files.
• Wastage of storage space, since duplicated data is stored.
• Errors may be generated due to updating of the same data in different files.
• Time in entering data again and again is wasted.
• Computer Resources are needlessly used.
• It is very difficult to combine information.
2. Elimination of Inconsistency - In the file processing system information is
duplicated throughout the system. So changes made in one file may be necessary be
carried over to another file. This may lead to inconsistent data. So we need to remove
this duplication of data in multiple file to eliminate inconsistency.
For example: - Let us consider an example of student's result system. Suppose that in
STUDENT file it is indicated that Roll no= 10 has opted for 'Computer' course but in
RESULT file it is indicated that 'Roll No. =l 0' has opted for 'Accounts' course. Thus,
in this case the two entries for z particular student don't agree with each other. Thus,
database is said to be in an inconsistent state. So to eliminate this conflicting
information we need to centralize the database. On centralizing the data base the
duplication will be controlled and hence inconsistency will be removed.
Data inconsistency are often encountered in everyday life Consider an another
example, w have all come across situations when a new address is communicated to
an organization that we deal it (E.g. - Telecom, Gas Company, Bank). We find that
some of the communications from that organization are received at a new address
while other continued to be mailed to the old address. So combining all the data in
database would involve reduction in redundancy as well as inconsistency so it is likely
to reduce the costs for collection storage and updating of Data.
Let us again consider the example of Result system.
Suppose that a student having Roll no -201 changes his course from 'Computer' to
'Arts'. The change is made in the SUBJECT file but not in RESULT'S file. This may
lead to inconsistency of the data. So we need to centralize the database so that changes
once made are reflected to all the tables where a particulars field is stored. Thus the
update is brought automatically and is known as propagating updates.
3. Better service to the users - A DBMS is often used to provide better services to
the users. In conventional system, availability of information is often poor, since it
normally difficult to obtain information that the existing systems were not designed
for. Once several conventional systems are combined to form one centralized
database, the availability of information and its updateness is likely to improve since
the data can now be shared and DBMS makes it easy to respond to anticipated
information requests.
Centralizing the data in the database also means that user can obtain new and
combined information easily that would have been impossible to obtain otherwise.
Also use of DBMS should allow users that don't know programming to interact with
the data more easily, unlike file processing system where the programmer may need to
write new programs to meet every new demand.
4. Flexibility of the System is Improved - Since changes are often necessary to the
contents of the data stored in any system, these changes are made more easily in a
centralized database than in a conventional system. Applications programs need not to
be changed on changing the data in the database.
5. Integrity can be improved - Since data of the organization using database
approach is centralized and would be used by a number of users at a time. It is
essential to enforce integrity-constraints.
In the conventional systems because the data is duplicated in multiple files so
updating or changes may sometimes lead to entry of incorrect data in some files where
it exists.
For example: - The example of result system that we have already discussed. Since
multiple files are too maintained, as sometimes you may enter a value for course
which may not exist. Suppose course can have values (Computer, Accounts,
Economics, and Arts) but we enter a value 'Hindi' for it, so this may lead to an
inconsistent data, so lack of Integrity.
Even if we centralized the database it may still contain incorrect data. For example: -
• Salary of full time employ may be entered as Rs. 500 rather than Rs. 5000.
• A student may be shown to have borrowed books but has no enrolment.
• A list of employee numbers for a given department may include a number of
nonexistent employees.
These problems can be avoided by defining the validation procedures whenever any
update operation is attempted.
6. Standards can be enforced - Since all access to the database must be through
DBMS, so standards are easier to enforce. Standards may relate to the naming of data,
format of data, structure of the data etc. Standardizing stored data formats is usually
desirable for the purpose of data interchange or migration between systems.
7. Security can be improved - In conventional systems, applications are developed in
an adhoc/temporary manner. Often different system of an organization would access
different components of the operational data, in such an environment enforcing
security can be quiet difficult. Setting up of a database makes it easier to enforce
security restrictions since data is now centralized. It is easier to control that has access
to what parts of the database. Different checks can be established for each type of
access (retrieve, modify, delete etc.) to each piece of information in the database.
Consider an Example of banking in which the employee at different levels may be
given access to different types of data in the database. A clerk may be given the
authority to know only the names of all the customers who have a loan in bank but not
the details of each loan the customer may have. It can be accomplished by giving the
privileges to each employee.
8. Organization's requirement can be identified - All organizations have sections
and departments and each of these units often consider the work of their unit as the
most important and therefore consider their need as the most important. Once a
database has been setup with centralized control, it will be necessary to identify
organization's requirement and to balance the needs of the competition units. So it
may become necessary to ignore some requests for information if they conflict with
higher priority need of the organization.
It is the responsibility of the DBA (Database Administrator) to structure the database
system to provide the overall service that is best for an organization.
For example: - A DBA must choose best file Structure and access method to give fast
response for the high critical applications as compared to less critical applications.
9. Overall cost of developing and maintaining systems is lower - It is much easier
to respond to unanticipated requests when data is centralized in a database than when
it is stored in a conventional file system. Although the initial cost of setting up of a
database can be large, one normal expects the overall cost of setting up of a database,
developing and maintaining application programs to be far lower than for similar
service using conventional systems, Since the productivity of programmers can be
higher in using non-procedural languages that have been developed with DBMS than
using procedural languages.
10. Data Model must be developed - Perhaps the most important advantage of
setting up of database system is the requirement that an overall data model for an
organization be build. In conventional systems, it is more likely that files will be
designed as per need of particular applications demand. The overall view is often not
considered. Building an overall view of an organization's data is usual cost effective in
the long terms.
11. Provides backup and Recovery - Centralizing a database provides the schemes
such as recovery and backups from the failures including disk crash, power failures,
software errors which may help the database to recover from the inconsistent state to
the state that existed prior to the occurrence of the failure, though methods are very
complex.
Disadvantages of Database: Although the database system yields considerable advantages over previous data
management approaches, database systems do carry significant disadvantages. For
example:
1. Increased costs. Database systems require sophisticated hardware and software and highly skilled
personnel. The cost of maintaining the hardware, software, and personnel required to
operate and manage a database system can be substantial. Training, licensing, and
regulation compliance costs are often overlooked when database systems are
implemented.
2. Management complexity. Database systems interface with many different technologies and have a significant
impact on a company’s resources and culture. The changes introduced by the adoption
of a database system must be properly managed to ensure that they help advance the
company’s objectives. Given the fact that database systems hold crucial company data
that are accessed from multiple sources, security issues must be assessed constantly.
3. Maintaining currency. To maximize the efficiency of the database system, you must keep your system
current. Therefore, you must perform frequent updates and apply the latest patches
and security measures to all components. Because database technology advances
rapidly, personnel training costs tend to be significant. Vendor dependence: Given the
heavy investment in technology and personnel training, companies might be reluctant
to change database vendors. As a consequence, vendors are less likely to offer pricing
point advantages to existing customers, and those customers might be limited in their
choice of database system components.
4. Frequent upgrade/replacement cycles. DBMS vendors frequently upgrade their products by adding new functionality. Such
new features often come bundled in new upgrade versions of the software. Some of
these versions require hardware upgrades. Not only do the upgrades themselves cost
money, but it also costs money to train database users and administrators to properly
use and manage the new features’
The major characteristics of database approach are:
• Self-describing Nature of a Database System
• Insulation between Programs and Data, and Data Abstraction
• Support of Multiple Views of the Data
• Sharing of Data and Multi user Transaction Processing
Self-contained nature
DBMS system contains data plus a full description of the data (called ―metadata‖)
―metadata‖ is data about data - data formats, record structures, locations, how to
access, indexes metadata is stored in a catalogue and is used by DBMS software to
know how to access the data. Contrast this with the file processing approach where
application programs need to know the structure and format of records and data.
Program-data independence
Data independence is immunity of application programs to changes in storage
structures and access techniques. E.g. adding a new field, changing index structure,
changing data format, In a DBMS environment these changes are reflected in the
catalogue. Applications aren’t affected. Traditional file processing programs would all
have to change, possibly substantially.
Data abstraction
A DBMS provides users with a conceptual representation of data (for example, as
objects with properties and inter-relationships). Storage details are hidden. Conceptual
representation is provided in terms of a data model.
Support for multiple views
DBMS may allow different users to see different ―views‖ of the DB, according to the
perspective each one requires. E.g. a subset of the data - For example; the people
using the payroll system need not/should not see data about students and class
schedules. E.g. data presented in a different form from the way it is stored - For
example someone interested in student transcripts might get a view which is formed
by combining information from separate files or tables.
Centralized control of the data resource
The DBMS provides centralized control of data in an organization.
This brings a number of advantages:
(a) Reduces redundancy
(b) Avoids inconsistencies
(c) Data can be shared
(d) Standards can be enforced
(e) Security restrictions can be applied
(f) Integrity can be maintained
Redundancy and Inconsistencies
Redundancy is unnecessary duplication of data. For example if accounts department
and registration department both keep student name, number and address.
Redundancy wastes space and duplicates effort in maintaining the data.
Redundancy also leads to inconsistency.
Inconsistent data is data which contradicts itself - e.g. two different addresses for a
given student number. Inconsistency cannot occur if data is represented by a single
entry (i.e. if there is no redundancy).
Controlled redundancy: Some redundancy may be desirable (for efficiency). A
DBMS should be aware of it, and take care of propagating updates to all copies of a
data item.
This is an objective, not yet currently supported.
Sharing
• Need concurrency control
• Multiple user views
Standards
E.g. data formats, record structures, naming, documentation
International, organizational, departmental ... standards
Security
- restricting unauthorized access
DBMS should perform security checks on all accesses.
Integrity
Maintaining validity of data;
E.g. employee numbers must be in some range
E.g. every course must have an instructor
e.g... Student number must be unique
E.g. hours worked cannot be more than 150
These things are expressed as constraints.
DBMS should perform integrity checks on all updates. Currently DBMSs provide
limited integrity checks..
MODULE 2:
DBMS Architecture – Associations – Relationships – Mappings between Abstractions
–
Generalisation – Classifications – Conceptual Data modeling – File Organization –
Data
Structure – Data models: HDBMS, NDBMS, RDBMS, OODBMS.
DBMS Architecture
Components of a DBMS
Database Management System (DBMS)
DBMSs are the technology tools that directly support managing organizational data.
With a DBMS you can create a database including its logical structure and constraints,
you can manipulate the data and information it contains, or you can directly create a
simple database application or reporting tool. Human administrators, through a user
interface, perform certain tasks with the tool such as creating a database, converting
an existing database, or archiving a large and growing database. Business applications,
which perform the higher level tasks of managing business processes, interact with
end users and other applications and, to store and manage data, rely on and directly
operate their own underlying database through a standard programming interface like
ODBC.
The following diagram illustrates the five components of a DBMS.
Database Engine:
The Database Engine is the core service for storing, processing, and securing data.
The Database Engine provides controlled access and rapid transaction processing to
meet the requirements of the most demanding data consuming applications within
your enterprise. Use the Database Engine to create relational databases for online
transaction processing or online analytical processing data. This includes creating
tables for storing data, and database objects such as indexes, views, and stored
procedures for viewing, managing, and securing data. You can use SQL Server
Management Studio to manage the database objects, and SQL Server Profiler for
capturing server events.
Data dictionary:
A data dictionary is a reserved space within a database which is used to store
information about the database itself. A data dictionary is a set of table and views
which can only be read and never altered. Most data dictionaries contain different
information about the data used in the enterprise. In terms of the database
representation of the data, the data table defines all schema objects including views,
tables, clusters, indexes, sequences, synonyms, procedures, packages, functions,
triggers and many more. This will ensure that all these things follow one standard
defined in the dictionary. The data dictionary also defines how much space has been
allocated for and / or currently in used by all the schema objects. A data dictionary is
used when finding information about users, objects, and schema and storage
structures. Every time a data definition language (DDL) statement is issued, the data
dictionary becomes modified.
A data dictionary may contain information such as:
Database design information
Stored SQL procedures
User permissions
User statistics
Database process information
Database growth statistics
Database performance statistics
Query Processor:
A relational database consists of many parts, but at its heart are two major
components: the storage engine and the query processor.
The storage engine writes data to and reads data from the disk. It manages
records, controls concurrency, and maintains log files.
The query processor accepts SQL syntax, selects a plan for executing the syntax, and
then executes the chosen plan. The user or program interacts with the query processor,
and the query processor in turn interacts with the storage engine. The query processor
isolates the user from the details of execution. The user specifies the result, and the
query processor determines how this result is obtained.
The query processor components include: DDL interpreter, DML compiler, Query
evaluation engine
Mappings between Abstractions
The conceptual/internal mapping:
o defines conceptual and internal view correspondence
o specifies mapping from conceptual records to their stored counterparts
An external/conceptual mapping:
o defines a particular external and conceptual view correspondence
A change to the storage structure definition means that the conceptual/internal
mapping must be changed accordingly, so that the conceptual schema may
remain invariant, achieving physical data independence.
A change to the conceptual definition means that the conceptual/external
mapping must be changed accordingly, so that the external schema may remain
invariant, achieving logical data independence.
The architecture for DBMSs is divided into three general levels:
external
conceptual
internal
External View
A user is anyone who needs to access some portion of the data. They may range from
application programmers to casual users with adhoc queries. Each user has a language
at his/her disposal.
The application programmer may use a high level language ( e.g. COBOL) while the
casual user will probably use a query language.
Regardless of the language used, it will include a data sublanguage DSL which is that
subset of the language which is concerned with storage and retrieval of information in
the database and may or may not be apparent to the user.
A DSL is a combination of two languages:
a data definition language (DDL) - provides for the definition or description of
database objects
a data manipulation language (DML) - supports the manipulation or processing
of database objects.
Each user sees the data in terms of an external view: Defined by an external schema,
consisting basically of descriptions of each of the various types of external record in
that external view, and also a definition of the mapping between the external schema
and the underlying conceptual schema.
Conceptual View
An abstract representation of the entire information content of the database.
It is in general a view of the data as it actually is, that is, it is a `model' of the
`realworld'.
It consists of multiple occurrences of multiple types of conceptual record,
defined in the conceptual schema.
To achieve data independence, the definitions of conceptual records must
involve information content only.
storage structure is ignored
access strategy is ignored
In addition to definitions, the conceptual schema contains authorisation and
validation procedures.
Internal View
The internal view is a low-level representation of the entire database consisting of
multiple occurrences of multiple types of internal (stored) records.
It is however at one remove from the physical level since it does not deal in terms of
physical records or blocks nor with any device specific constraints such as cylinder or
track sizes. Details of mapping to physical storage is highly implementation specific
and are not expressed in the three-level architecture.
The internal view described by the internal schema:
defines the various types of stored record
what indices exist
how stored fields are represented
what physical sequence the stored records are in
In effect the internal schema is the storage structure definition.
Here we compare these three types of data models. The table below compares the
different features:
Feature Conceptual Logical Physical
Entity Names ✓ ✓
Entity Relationships ✓ ✓
Attributes
✓
Primary Keys
✓ ✓
Foreign Keys
✓ ✓
Table Names
✓
Column Names
✓
Column Data Types
✓
We can see that the complexity increases from conceptual to logical to physical. This
is why we always first start with the conceptual data model (so we understand at high
level what are the different entities in our data and how they relate to one another),
then move on to the logical data model (so we understand the details of our data
without worrying about how they will actually implemented), and finally the physical
data model (so we know exactly how to implement our data model in the database of
choice). In a data warehousing project, sometimes the conceptual data model and the
logical data model are considered as a single deliverable.
DATA MODELS
Hierarchical Model
The hierarchical data model organizes data in a tree structure. There is a hierarchy of
parent and child data segments. This structure implies that a record can have repeating
information, generally in the child data segments. Data in a series of records, which
have a set of field values attached to it. It collects all the instances of a specific record
together as a record type. These record types are the equivalent of tables in the
relational model, and with the individual records being the equivalent of rows. To
create links between these record types, the hierarchical model uses Parent Child
Relationships. These are a 1: N mapping between record types. This is done by using
trees, like set theory used in the relational model, "borrowed" from maths. For
example, an organization might store information about an employee, such as name,
employee number, department, salary. The organization might also store information
about an employee's children, such as name and date of birth. The employee and
children data forms a hierarchy, where the employee data represents the parent
segment and the children data represents the child segment. If an employee has three
children, then there would be three child segments associated with one employee
segment. In a hierarchical database the parent-child relationship is one too many. This
restricts a child segment to having only one parent segment. Hierarchical DBMSs
were popular from the late 1960s, with the introduction of IBM's Information
Management System (IMS) DBMS, through the 1970s.
Network Model
The popularity of the network data model coincided with the popularity of the
hierarchical data model. Some data were more naturally modelled with more than one
parent per child. So, the network model permitted the modelling of many-to-many
relationships in data. In 1971, the Conference on Data Systems Languages
(CODASYL) formally defined the network model. The basic data modelling construct
in the network model is the set construct. A set consists of an owner record type, a set
name, and a member record type. A member record type can have that role in more
than one set; hence the multiparent concept is supported. An owner record type can
also be a member or owner in another set. The data model is a simple network, and
link and intersection record types (called junction records by IDMS) may exist, as
well as sets between them. Thus, the complete network of relationships is represented
by several pair wise sets; in each set some (one) record type is owner (at the tail of the
network arrow) and one or more record types are members (at the head of the
relationship arrow). Usually, a set defines a 1: M relationship, although 1:1 is
permitted. The CODASYL network model is based on mathematical set theory.
Relational Model
(RDBMS - relational database management system) A database based on the
relational model developed by E.F. Codd. A relational database allows the definition
of data structures, storage and retrieval operations and integrity constraints. In such a
database the data and relations between them are organised in tables. A table is a
collection of records and each record in a table contains the same fields.
Properties of Relational Tables:
gnificant
Certain fields may be designated as keys, which means that searches for specific
values of that field will use indexing to speed them up. Where fields in two different
tables take values from the same set, a join operation can be performed to select
related records in the two tables by matching values in those fields. Often, but not
always, the fields will have the same name in both tables. For example, an "orders"
table might contain (customer-ID, product-code) pairs and a "products" table might
contain (product-code, price) pairs so to calculate a given customer's bill you would
sum the prices of all products ordered by that customer by joining on the product-code
fields of the two tables. This can be extended to joining multiple tables on multiple
fields. Because these relationships are only specified at retrieval time, relational
databases are classed as dynamic database management system. The RELATIONAL
database model is based on the Relational Algebra.
Object/Relational Model
Object/relational database management systems (ORDBMSs) add new object storage
capabilities to the relational systems at the core of modern information systems. These
new facilities integrate management of traditional fielded data, complex objects such
as time-series and geospatial data and diverse binary media such as audio, video,
images, and applets. By encapsulating methods with data structures, an ORDBMS
server can execute complex analytical and data manipulation operations to search and
transform multimedia and other complex objects.
As an evolutionary technology, the object/relational (OR) approach has inherited the
robust transaction- and performance-management features of it s relational ancestor
and the flexibility of its object-oriented cousin. Database designers can work with
familiar tabular structures and data definition languages (DDLs) while assimilating
new object-management possibilities. Query and procedural languages and call
interfaces in ORDBMSs are familiar: SQL3, vendor procedural languages, and
ODBC, JDBC, and proprietary call interfaces are all extensions of RDBMS languages
and interfaces. And the leading vendors are, of course, quite well known: IBM, Inform
ix, and Oracle.
Object-Oriented Model
Object DBMSs add database functionality to object programming languages. They
bring much more than persistent storage of programming language objects. Object
DBMSs extend the semantics of the C++, Smalltalk and Java object programming
languages to provide full-featured database programming capability, while retaining
native language compatibility. A major benefit of this approach is the unification of
the application and database development into a seamless data model and language
environment. As a result, applications require less code, use more natural data
modelling, and code bases are easier to maintain. Object developers can write
complete database applications with a modest amount of additional effort.
According to Rao (1994), "The object-oriented database (OODB) paradigm is the
combination of object-oriented programming language (OOPL) systems and persistent
systems. The power of the OODB comes from the seamless treatment of both
persistent data, as found in databases, and transient data, as found in executing
programs."
In contrast to a relational DBMS where a complex data structure must be flattened out
to fit into tables or joined together from those tables to form the in-memory structure,
object DBMSs have no performance overhead to store or retrieve a web or hierarchy
of interrelated objects. This one-to-one mapping of object programming language
objects to database objects has two benefits over other storage approaches: it provides
higher performance management of objects, and it enables better management of the
complex interrelationships between objects. This makes object DBMSs better suited
to support applications such as financial portfolio risk analysis systems,
telecommunications service applications, World Wide Web document structures,
design and manufacturing systems, and hospital patient record systems, which have
complex relationships between data.
Data Model: A data model is a collection of conceptual tools for describing data, data relationships,
data semantics, and consistency constraints.
Entity: An entity is a ―thing‖ or ―object‖ in the real world that is distinguishable from
all other objects. For example, each person in an enterprise is an entity.
Entity set :An entity set is a set of entities of the same type that share the same
properties, or attributes. The set of all persons who are customers at a given bank, for
example, can be defined as the entity set customer. Similarly, the entity set loan might
represent the set of all loans awarded by a particular bank.
An entity is represented by a set of attributes. Attributes are descriptive properties
possessed by each member of an entity set. The designation of an attribute for an
entity set expresses that the database stores similar information concerning each entity
in the entity set; however, each entity may have its own value for each attribute.
Simple and composite attributes: the attributes have been simple; that is, they are
not divided into subparts is called as "simple attributes". on the other hand, can be
divided into subparts is called as "composite attributes". For example, an attribute
name could be structured as a composite attribute consisting of first-name, middle-
initial, and last-name.
Single-valued and multivalued attributes: For instance, the loan-number attribute
for a specific loan entity refers to only one loan number. Such attributes are said to be
single valued. There may be instances where an attribute has a set of values for a
specific entity. Consider an employee entity set with the attribute phone-number. An
employee may have zero, one, or several phone numbers, and different employees
may have different numbers of phones.
This type of attribute is said to be multivalued.
Derived attribute: The value for this type of attribute can be derived from the values
of other related attributes or entities. For instance, let us say that the customer entity
set has an attribute loans-held, which represents how many loans a customer has from
the bank. We can derive the value for this attribute by counting the number of loan
entities associated with that customer.
Relationship Sets: A relationship is an association among several entities. A
relationship set is a set of relationships of the same type.
Mapping Cardinalities: Mapping cardinalities, or cardinality ratios, express the
number of entities to which another entity can be associated via a relationship set.
Mapping cardinalities are most useful in describing binary relationship sets, although
they can contribute to the description of relationship sets that involve more than two
entity sets.
One to one. An entity in A is associated with at most one entity in B, and an
entity in B is associated with at most one entity in A.
One to many. An entity in A is associated with any number (zero or more) of
entities in B. An entity in B, however, can be associated with at most one entity
in A.
Many to one. An entity in A is associated with at most one entity in B. An
entity in B, however, can be associated with any number (zero or more) of
entities in A.
Many to many. An entity in A is associated with any number (zero or more) of
entities in B, and an entity in B is associated with any number (zero or more) of
entities in A.
Keys: A key allows us to identify a set of attributes that suffice to distinguish entities
from each other. Keys also help uniquely identify relationships, and thus distinguish
relationships from each other.
superkey: A superkey is a set of one or more attributes that, taken collectively, allow
us to identify uniquely an entity in the entity set. For example, the customer-id
attribute of the entity set customer is sufficient to distinguish one customer entity from
another. Thus, customer-id is a superkey. Similarly, the combination of customer-
name and customer-id is a superkey for the entity set customer. The customer-name
attribute of customer is not a superkey, because several people might have the same
name.
candidate key: minimal super keys are called candidate keys. If K is a super key, then
so is any superset of K. We are often interested in super keys for which no proper
subset is a super key.It is possible that several distinct sets of attributes could serve as
a candidate key. Suppose that a combination of customer-name and customer-street is
sufficient to distinguish among members of the customer entity set. Then, both
{customer-id} and {customer-name, customer-street} are candidate keys. Although
the attributes customer-id and customer-name together can distinguish customer
entities, their combination does not form a candidate key, since the attribute customer-
id alone is a candidate key.
primary key: which denotes the unique identity is called as primary key. Primary key
to denote a candidate key that is chosen by the database designer as the principal
means of identifying entities within an entity set. A key (primary, candidate, and
super) is a property of the entity set, rather than of the individual entities. Any two
individual entities in the set are prohibited from having the same value on the key
attributes at the same time. The designation of a key represents a constraint in the real-
world enterprise being modeled.
Weak Entity Sets:An entity set may not have sufficient attributes to form a primary
key. Such an entity set is termed a weak entity set. An entity set that has a primary key
is termed a strong entity set.
For a weak entity set to be meaningful, it must be associated with another entity set,
called the identifying or owner entity set. Every weak entity must be associated with
an identifying entity; that is, the weak entity set is said to be existence dependent on
the identifying entity set. The identifying entity set is said to own the weak entity set
that it identifies. The relationship associating the weak entity set with the identifying
entity set is called the identifying relationship. The identifying relationship is many to
one from the weak entity set to the identifying entity set, and the participation of the
weak entity set in the relationship is total.
In our example, the identifying entity set for payment is loan, and a relationship loan-
payment that associates payment entities with their corresponding loan entities is the
identifying relationship.
Although a weak entity set does not have a primary key, we nevertheless need a
means of distinguishing among all those entities in the weak entity set that depend on
one particular strong entity. The discriminator of a weak entity set is a set of attributes
that allows this distinction to be made.
In E-R diagrams, a doubly outlined box indicates a weak entity set, and a doubly
outlined diamond indicates the corresponding identifying relationship.in fig the weak
entity set payment depends on the strong entity set loan via the relationship set loan-
payment.
The figure also illustrates the use of double lines to indicate total participation—the of
the (weak) entity set payment in the relationship loan-payment is total, meaning that
every payment must be related via loan-payment to some loan. Finally, the arrow from
loan-payment to loan indicates that each payment is for a single loan. The
discriminator of a weak entity set also is underlined, but with a dashed, rather than a
solid, line.
Specialization : An entity set may include sub groupings of entities that are distinct in
some way from other entities in the set. For instance, a subset of entities within an
entity set may have attributes that are not shared by all the entities in the entity set.
The E-R model provides a means for representing these distinctive entity groupings.
Consider an entity set person, with attributes name, street, and city. A person may be
further classified as one of the following:
• customer
• employee
Each of these person types is described by a set of attributes that includes all the
attributes of entity set person plus possibly additional attributes. For example,
customer entities may be described further by the attribute customer-id, whereas
employee entities may be described further by the attributes employee-id and salary.
The process of designating sub groupings within an entity set is called specialization.
The specialization of person allows us to distinguish among persons according to
whether they are employees or customers.
Generalization: The design process may also proceed in a bottom-up manner, in
which multiple entity sets are synthesized into a higher-level entity set on the basis of
common features. The database designer may have first identified a customer entity
set with the attributes name, street, city, and customer-id, and an employee entity set
with the attributes name, street, city, employee-id, and salary. There are similarities
between the customer entity set and the employee entity set in the sense that they have
several attributes in common. This commonality can be expressed by generalization,
which is a containment relationship that exists between a higher-level entity set and
one or more lower-level entity sets. In our example, person is the higher-level entity
set and customer and employee are lower-level entity sets.
Higher- and lower-level entity sets also may be designated by the terms super class
and subclass, respectively. The person entity set is the super class of the customer and
employee subclasses .For all practical purposes, generalization is a simple inversion of
specialization. We will apply both processes, in combination, in the course of
designing the E-R schema for an enterprise. In terms of the E-R diagram itself, we do
not distinguish between specialization and generalization. New levels of entity
representation will be distinguished (specialization) or synthesized (generalization) as
the design schema comes to express fully the database application and the user
requirements of the database. Differences in the two approaches may be characterized
by their starting point and overall goal. Generalization proceeds from the recognition
that a number of entity sets share some common features (namely, they are described
by the same attributes and participate in the same relationship sets).
Aggregation:
Aggregation is an abstraction in which relationship sets (along with their associated
entity sets) are treated as higher-level entity sets, and can participate in relationships.
Symbols used in the E-R notation:
File Organization
Database needs to be stored on some computer storage medium. Two categories of
storage are:
• Primary storage: Includes storage media that can be operated on directly by
CPU. Fast access to data. Limited storage capacity.
• Secondary storage: Have large capacity, costs less, slower access to data. Can’t
be processed directly by CPU.
A file is a sequence of records. If every record has exactly the same size its said to
be made of fixed-length records, if different records in file have different sizes, file
is said to be made of variable length records.
Records of a file must be allocated to disk blocks because a block is unit of data
transfer between disk and memory. When the block size is larger than the record
size, each block will contain numerous records, at times records may not fit in a
block. Suppose block size is B bytes, for a file of fixed length records of size R
bytes, with B>=R, bfr =└B/R
┘, bfr is blocking factor for the file
Unused space in each block is B – (bfr x R) bytes
• To utilize the unused space, we can store part of a record on one block and the
rest on another. A pointer at the end of the first block points to the block
containing the remainder of the record, in case it is not the next consecutive
block on disk. This organisation is called spanned.
• If records are not allowed to cross block boundaries, the organisation is called
unspanned. This is used for fixed-length records, having B > R because it
makes each record start at a known location in the block.
• For variable length records, either a spanned or unspanned organization can be
used. To calculate the no. of blocks b, b= └(r/bfr
) ┘ blocks
Techniques for allocating file blocks on disk:
• Contiguous allocation : File blocks are allocated to consecutive disk blocks.
This makes the reading the file faster using double buffering, but file expansion
is difficult.
• Linked allocation : Each file block contains a pointer to the next file block.
Expansion is easy but reading of file is slower.
• Combination of consecutive disk blocks is called cluster. Clusters are called
file segments or extents.
File header has information about a file that is needed by the system programs
that access the file records. Gives address of file blocks, field length, order of
fields.
To search for a record on disk, one or more blocks are copied into main memory
buffers. Programs search for the desired record or records within the buffers, using
the information in the file header. If the address of the block that contains the
desired record is not known, the search programs must do a linear search through
the file blocks.
Each file block is copied into a buffer and searched either until the record is
located or all file blocks have been searched unsuccessfully
Operations on files
• Include retrieval operations and update operations
• Retrieval operations locate records for processing and examining values.
• Update operations change the data in the file
• Eg: Select, Insert, delete ,update
Can use equality or comparison operator
• Open: Prepare files for reading or writing.
• Reset: Set the file pointer of an open file to the beginning of the file.
• Find: Searches for the first record that satisfies a search condition.
• Read: Copies the current record from the buffer to a program variable in the
user program.
• FindNext: Searches for the next record in the file.
• Delete: Delete current record
• Modify: Modify some field values for current record.
• Insert: Inserts a new record in the file by locating the block
• Close: Releases buffers and completes file access
• Findall : locate all records in file satisfying a condition
• Find n : Searches for the first record that satisfies a search condition and
continues to locate the next n-1 records
• FindOrdered: Retrieves all records in some specified order.
• Reorganize: Reodering records
File organization: Organization of data of a file into records, blocks and access
structures.
Access method: Group of operations that can be applied to a file.
Buffering of blocks.
When several blocks need to be transferred from the disk to the main memory and
all the block addresses are known, several buffers can be reserved in main memory
to speed up the transfer.
While a buffer is being read or written, the CPU can process data in the other
buffer. Buffering is useful when processes run concurrently in a parallel
fashion.(separate I/o processor or multiple CPU processors exist)
Double buffering: CPU starts processing a block once its transfer to main memory
is complete, at the same time disk I/O processor can be reading and transferring
the next block into a different buffer. Reduces waiting time
INDEXES
An index is a data structure that organizes data records on disk to optimize certain
kinds of retrieval operations.An index allows us to efficiently retrieve all records
that satisfy search conditions on the search key fields of the index.Additional
indexes on given collection of data records, each with a different search key, to
speed up search operations that are not efficiently supported by file organization
used to store the data records.
• Eg: store the records in a file organized as an index on employee age, this is an
alternative to sorting the file by age
When ordering of data records is same as or close to ordering of data entries in
some index, we say that the index is clustered, else it is unclustered index.
An index on a set of fields that includes the primary key is called a primary index,
other indexes are called secondary indexes. Two data entries are said to be
duplicates if they have the same value for the search key field associated with the
index. A primary index is guaranteed to contain duplicates.
Clustered Field
If records of a file are physically ordered on a non-key field that doesn’t have a
distinct value for each record that field is called the clustering field of the file.
Index created is clustering index. Speeds retrieval of records. Ordered file with two
fields: One field is like clustering field of the data file and second is a block
pointer.
Ordered Indices
A file may have several indices, on different search keys .If the file containing the
records is sequentially ordered, a primary index is an index whose search key also
defines the sequential order of the file. Also called as clustering indices. Search
key of a primary index is usually the primary key.
Indices whose search key specifies an order different from the sequential order of
the file are celled secondary indices
Primary index: All files are arranged sequentially on some search key. Such files
with a primary index on the search key is called index-sequential files. Designed
for applications that require both sequential processing and random processing.
Dense and Sparse index
• Index record or index entry consists of a search-key value, and pointers to one
or more records with that value as their search-key value. Pointer to a record
consists of the identifier of a disk block and an offset within the disk block to
identify record within the block.
• Two types :
• Dense index and
• Sparse index.
Dense index: An index record appears for every search key value in the file.
Index record has the search key value and a pointer to the first data record with
that search key value.
Rest of the record, with same search key-value would be stored sequentially after
first record.
Implementations store list of pointers to all records with the same search-key value
Sparse indices
An index record appears for only some of the search-key values .Each index record
contains a search key value and a pointer to the first data record with that search-
key value. To locate a record, the index entry with largest search-key value that is
less than or equal to the search key value for which we are looking.
Start at the record pointed to by that index entry and follow the pointers in the file
until we find the desired record.
Must be dense, with an index for every search-key value and pointer to every
record in the file.
Primary index may be sparse storing only some of the search-key values, since its
always possible to find records with intermediate search-key values by a sequential
access to a part of the file.
If secondary index stores only some of the search-key values, records with
intermediate search-key values, may be anywhere in the file and in general , we
cannot find them without searching the entire file.
MODULE 3: DATABASE DESIGN
Relational Data Model – Relational Algebra – ER Diagrams – Data Dictionary –
Normalisation – Boyce Codd Normal Forms – Integrity – Relational Database
Languages – Database Administration – File Structures and Indexing.
A relation is a table that holds the data we are interested in. It is two-dimensional and
has rows and columns.
Each entity type in the ER model is mapped into a relation.
The attributes become the columns.
The individual entities become the rows.
Figure : a relation
Relations can be represented textually as:
tablename(primary key, attribute 1, attribute 2, ... , foreign key)
If matric_no was the primary key, and there were no foreign keys, then the table
above could be represented as:
student(matric no, name, address, date_of_birth)
When referring to relations or tables, cardinality is considered to the the number of
rows in the relation or table, and arity is the number of columns in a table or attributes
in a relation.
Foreign keys
A foreign key is an attribute (or group of attributes) that is the primary key to another
relation.
Roughly, each foreign key represents a relationship between two entity types.
They are added to relations as we go through the mapping process.
They allow the relations to be linked together.
A relation can have several foreign keys.
The Relational Data Model
The Relation
The Relation is the basic element in a relational data model.
Relations in the Relational Data Model
A relation is subject to the following rules:
1. Relation (file, table) is a two-dimensional table.
2. Attribute (i.e. field or data item) is a column in the table.
3. Each column in the table has a unique name within that table.
4. Each column is homogeneous. Thus the entries in any column are all of the
same type (e.g. age, name, employee-number, etc).
5. Each column has a domain, the set of possible values that can appear in that
column.
6. A Tuple (i.e. record) is a row in the table.
7. The order of the rows and columns is not important.
8. Values of a row all relate to some thing or portion of a thing.
9. Repeating groups (collections of logically related attributes that occur multiple
times within one record occurrence) are not allowed.
10. Duplicate rows are not allowed (candidate keys are designed to prevent this).
11. Cells must be single-valued (but can be variable length). Single valued means
the following:
o Cannot contain multiple values such as 'A1,B2,C3'.
o Cannot contain combined values such as 'ABC-XYZ' where 'ABC' means
one thing and 'XYZ' another.
A relation may be expressed using the notation R(A,B,C, ...) where:
R = the name of the relation.
(A,B,C, ...) = the attributes within the relation.
A = the attribute(s) which form the primary key.
Keys
1. A simple key contains a single attribute.
2. A composite key is a key that contains more than one attribute.
3. A candidate key is an attribute (or set of attributes) that uniquely identifies a
row. A candidate key must possess the following properties:
o Unique identification - For every row the value of the key must uniquely
identify that row.
o Non redundancy - No attribute in the key can be discarded without
destroying the property of unique identification.
4. A primary key is the candidate key which is selected as the principal unique
identifier. Every relation must contain a primary key. The primary key is usually
the key selected to identify a row when the database is physically implemented.
For example, a part number is selected instead of a part description.
5. A superkey is any set of attributes that uniquely identifies a row. A superkey
differs from a candidate key in that it does not require the non redundancy
property.
6. A foreign key is an attribute (or set of attributes) that appears (usually) as a non
key attribute in one relation and as a primary key attribute in another relation. I
say usually because it is possible for a foreign key to also be the whole or part of
a primary key:
o A many-to-many relationship can only be implemented by introducing an
intersection or link table which then becomes the child in two one-to-many
relationships. The intersection table therefore has a foreign key for each of its
parents, and its primary key is a composite of both foreign keys.
o A one-to-one relationship requires that the child table has no more than one
occurrence for each parent, which can only be enforced by letting the foreign
key also serve as the primary key.
7. A semantic or natural key is a key for which the possible values have an
obvious meaning to the user or the data. For example, a semantic primary key
for a COUNTRY entity might contain the value 'USA' for the occurrence
describing the United States of America. The value 'USA' has meaning to the
user.
8. A technical or surrogate or artificial key is a key for which the possible values
have no obvious meaning to the user or the data. These are used instead of
semantic keys for any of the following reasons:
o When the value in a semantic key is likely to be changed by the user, or can
have duplicates. For example, on a PERSON table it is unwise to use
PERSON_NAME as the key as it is possible to have more than one person
with the same name, or the name may change such as through marriage.
o When none of the existing attributes can be used to guarantee uniqueness. In
this case adding an attribute whose value is generated by the system, e.g
from a sequence of numbers, is the only way to provide a unique value.
Typical examples would be ORDER_ID and INVOICE_ID. The value
'12345' has no meaning to the user as it conveys nothing about the entity to
which it relates.
9. A key functionally determines the other attributes in the row, thus it is always a
determinant.
10. Note that the term 'key' in most DBMS engines is implemented as an index
which does not allow duplicate entries.
Relationships
One table (relation) may be linked with another in what is known as a relationship.
Relationships may be built into the database structure to facilitate the operation of
relational joins at runtime.
1. A relationship is between two tables in what is known as a one-to-many or
parent-child or master-detail relationship where an occurrence on the 'one' or
'parent' or 'master' table may have any number of associated occurrences on the
'many' or 'child' or 'detail' table. To achieve this the child table must contain
fields which link back the primary key on the parent table. These fields on the
child table are known as a foreign key, and the parent table is referred to as the
foreign table (from the viewpoint of the child).
2. It is possible for a record on the parent table to exist without corresponding
records on the child table, but it should not be possible for an entry on the child
table to exist without a corresponding entry on the parent table.
3. A child record without a corresponding parent record is known as an orphan.
4. It is possible for a table to be related to itself. For this to be possible it needs a
foreign key which points back to the primary key. Note that these two keys
cannot be comprised of exactly the same fields otherwise the record could only
ever point to itself.
5. A table may be the subject of any number of relationships, and it may be the
parent in some and the child in others.
6. Some database engines allow a parent table to be linked via a candidate key,
but if this were changed it could result in the link to the child table being
broken.
7. Some database engines allow relationships to be managed by rules known as
referential integrity or foreign key restraints. These will prevent entries on
child tables from being created if the foreign key does not exist on the parent
table, or will deal with entries on child tables when the entry on the parent table
is updated or deleted.
Relational Joins
The join operator is used to combine data from two or more relations (tables) in order
to satisfy a particular query. Two relations may be joined when they share at least one
common attribute. The join is implemented by considering each row in an instance of
each relation. A row in relation R1 is joined to a row in relation R2 when the value of
the common attribute(s) is equal in the two relations. The join of two relations is often
called a binary join.
The join of two relations creates a new relation. The notation R1 x R2 indicates the
join of relations R1 and R2. For example, consider the following:
Relation R1
A B C
1 5 3
2 4 5
8 3 5
9 3 3
1 6 5
5 4 3
2 7 5
Relation R2
B D E
4 7 4
6 2 3
5 7 8
7 2 3
3 2 2
Note that the instances of relation R1 and R2 contain the same data values for attribute
B. Data normalisation is concerned with decomposing a relation (e.g. R(A,B,C,D,E)
into smaller relations (e.g. R1 and R2). The data values for attribute B in this context
will be identical in R1 and R2. The instances of R1 and R2 are projections of the
instances of R(A,B,C,D,E) onto the attributes (A,B,C) and (B,D,E) respectively. A
projection will not eliminate data values - duplicate rows are removed, but this will
not remove a data value from any attribute.
The join of relations R1 and R2 is possible because B is a common attribute. The
result of the join is:
Relation R1 x R2
A B C D E
1 5 3 7 8
2 4 5 7 4
8 3 5 2 2
9 3 3 2 2
1 6 5 2 3
5 4 3 7 4
2 7 5 2 3
The row (2 4 5 7 4) was formed by joining the row (2 4 5) from relation R1 to the row
(4 7 4) from relation R2. The two rows were joined since each contained the same
value for the common attribute B. The row (2 4 5) was not joined to the row (6 2 3)
since the values of the common attribute (4 and 6) are not the same.
The relations joined in the preceding example shared exactly one common attribute.
However, relations may share multiple common attributes. All of these common
attributes must be used in creating a join. For example, the instances of relations R1
and R2 in the following example are joined using the common attributes B and C:
Before the join:
Relation R1
A B C
6 1 4
8 1 4
5 1 2
2 7 1
Relation R2
B C D
1 4 9
1 4 2
1 2 1
7 1 2
7 1 3
After the join:
Relation R1 x R2
A B C D
6 1 4 9
6 1 4 2
8 1 4 9
8 1 4 2
5 1 2 1
2 7 1 2
2 7 1 3
The row (6 1 4 9) was formed by joining the row (6 1 4) from relation R1 to the row
(1 4 9) from relation R2. The join was created since the common set of attributes (B
and C) contained identical values (1 and 4). The row (6 1 4) from R1 was not joined to
the row (1 2 1) from R2 since the common attributes did not share identical values - (1
4) in R1 and (1 2) in R2.
The join operation provides a method for reconstructing a relation that was
decomposed into two relations during the normalisation process. The join of two
rows, however, can create a new row that was not a member of the original relation.
Thus invalid information can be created during the join process.
Lossless Joins
A set of relations satisfies the lossless join property if the instances can be joined
without creating invalid data (i.e. new rows). The term lossless join may be somewhat
confusing. A join that is not lossless will contain extra, invalid rows. A join that is
lossless will not contain extra, invalid rows. Thus the term gainless join might be
more appropriate.
To give an example of incorrect information created by an invalid join let us take the
following data structure:
R(student, course, instructor, hour, room, grade)
Assuming that only one section of a class is offered during a semester we can define
the following functional dependencies:
1. (HOUR, ROOM) COURSE
2. (COURSE, STUDENT) GRADE
3. (INSTRUCTOR, HOUR) ROOM
4. (COURSE) INSTRUCTOR
5. (HOUR, STUDENT) ROOM
Take the following sample data:
STUDENT COURSE INSTRUCTOR HOUR ROOM GRADE
Smith Math 1 Jenkins 8:00 100 A
Jones English Goldman 8:00 200 B
Brown English Goldman 8:00 200 C
Green Algebra Jenkins 9:00 400 A
The following four relations, each in 4th normal form, can be generated from the
given and implied dependencies:
R1(STUDENT, HOUR, COURSE)
R2(STUDENT, COURSE, GRADE)
R3(COURSE, INSTRUCTOR)
R4(INSTRUCTOR, HOUR, ROOM)
Note that the dependencies (HOUR, ROOM) COURSE and (HOUR, STUDENT)
ROOM are not explicitly represented in the preceding decomposition. The goal is
to develop relations in 4th normal form that can be joined to answer any ad hoc
inquiries correctly. This goal can be achieved without representing every functional
dependency as a relation. Furthermore, several sets of relations may satisfy the goal.
The preceding sets of relations can be populated as follows:
R1
STUDENT HOUR COURSE
Smith 8:00 Math 1
Jones 8:00 English
Brown 8:00 English
Green 9:00 Algebra
R2
STUDENT COURSE GRADE
Smith Math 1 A
Jones English B
Brown English C
Green Algebra A
R3
COURSE INSTRUCTOR
Math 1 Jenkins
English Goldman
Algebra Jenkins
R4
INSTRUCTOR HOUR ROOM
Jenkins 8:00 100
Goldman 8:00 200
Jenkins 9:00 400
Now suppose that a list of courses with their corresponding room numbers is required.
Relations R1 and R4 contain the necessary information and can be joined using the
attribute HOUR.
The result of this join is:
R1 x R4
STUDENT COURSE INSTRUCTOR HOUR ROOM
Smith Math 1 Jenkins 8:00 100
Smith Math 1 Goldman 8:00 200
Jones English Jenkins 8:00 100
Jones English Goldman 8:00 200
Brown English Jenkins 8:00 100
Brown English Goldman 8:00 200
Green Algebra Jenkins 9:00 400
This join creates the following invalid information (denoted by the coloured rows):
Smith, Jones, and Brown take the same class at the same time from two different
instructors in two different rooms.
Jenkins (the Maths teacher) teaches English.
Goldman (the English teacher) teaches Maths.
Both instructors teach different courses at the same time.
Another possibility for a join is R3 and R4 (joined on INSTRUCTOR). The result
would be:
R3 x R4
COURSE INSTRUCTOR HOUR ROOM
Math 1 Jenkins 8:00 100
Math 1 Jenkins 9:00 400
English Goldman 8:00 200
Algebra Jenkins 8:00 100
Algebra Jenkins 9:00 400
This join creates the following invalid information:
Jenkins teaches Math 1 and Algebra simultaneously at both 8:00 and 9:00.
A correct sequence is to join R1 and R3 (using COURSE) and then join the resulting
relation with R4 (using both INSTRUCTOR and HOUR). The result would be:
R1 x R3
STUDENT COURSE INSTRUCTOR HOUR
Smith Math 1 Jenkins 8:00
Jones English Goldman 8:00
Brown English Goldman 8:00
Green Algebra Jenkins 9:00
(R1 x R3) x R4
STUDENT COURSE INSTRUCTOR HOUR ROOM
Smith Math 1 Jenkins 8:00 100
Jones English Goldman 8:00 200
Brown English Goldman 8:00 200
Green Algebra Jenkins 9:00 400
Extracting the COURSE and ROOM attributes (and eliminating the duplicate row
produced for the English course) would yield the desired result:
COURSE ROOM
Math 1 100
English 200
Algebra 400
The correct result is obtained since the sequence (R1 x r3) x R4 satisfies the lossless
(gainless?) join property.
A relational database is in 4th normal form when the lossless join property can be
used to answer unanticipated queries. However, the choice of joins must be evaluated
carefully. Many different sequences of joins will recreate an instance of a relation.
Some sequences are more desirable since they result in the creation of less invalid data
during the join operation.
Suppose that a relation is decomposed using functional dependencies and multi-valued
dependencies. Then at least one sequence of joins on the resulting relations exists that
recreates the original instance with no invalid data created during any of the join
operations.
For example, suppose that a list of grades by room number is desired. This question,
which was probably not anticipated during database design, can be answered without
creating invalid data by either of the following two join sequences:
R1 x R3
(R1 x R3) x R2
((R1 x R3) x R2) x R4
or
R1 x R3
(R1 x R3) x R4
((R1 x R3) x R4) x R2
The required information is contained with relations R2 and R4, but these relations
cannot be joined directly. In this case the solution requires joining all 4 relations.
The database may require a 'lossless join' relation, which is constructed to assure that
any ad hoc inquiry can be answered with relational operators. This relation may
contain attributes that are not logically related to each other. This occurs because the
relation must serve as a bridge between the other relations in the database. For
example, the lossless join relation will contain all attributes that appear only on the
left side of a functional dependency. Other attributes may also be required, however,
in developing the lossless join relation.
Consider relational schema R(A, B, C, D), A B and C D. Relations Rl(A, B) and
R2(C, D) are in 4th normal form. A third relation R3(A, C), however, is required to
satisfy the lossless join property. This relation can be used to join attributes B and D.
This is accomplished by joining relations R1 and R3 and then joining the result to
relation R2. No invalid data is created during these joins. The relation R3(A, C) is the
lossless join relation for this database design.
A relation is usually developed by combining attributes about a particular subject or
entity. The lossless join relation, however, is developed to represent a relationship
among various relations. The lossless join relation may be difficult to populate
initially and difficult to maintain - a result of including attributes that are not logically
associated with each other.
The attributes within a lossless join relation often contain multi-valued dependencies.
Consideration of 4th normal form is important in this situation. The lossless join
relation can sometimes be decomposed into smaller relations by eliminating the multi-
valued dependencies. These smaller relations are easier to populate and maintain.
Determinant and Dependent
The terms determinant and dependent can be described as follows:
1. The expression X Y means 'if I know the value of X, then I can obtain the
value of Y' (in a table or somewhere).
2. In the expression X Y, X is the determinant and Y is the dependent
attribute.
3. The value X determines the value of Y.
4. The value Y depends on the value of X.
Functional Dependencies (FD)
A functional dependency can be described as follows:
1. An attribute is functionally dependent if its value is determined by another
attribute which is a key.
2. That is, if we know the value of one (or several) data items, then we can find the
value of another (or several).
3. Functional dependencies are expressed as X Y, where X is the determinant
and Y is the functionally dependent attribute.
4. If A (B,C) then A B and A C.
5. If (A,B) C, then it is not necessarily true that A C and B C.
6. If A B and B A, then A and B are in a 1-1 relationship.
7. If A B then for A there can only ever be one value for B.
Transitive Dependencies (TD)
A transitive dependency can be described as follows:
1. An attribute is transitively dependent if its value is determined by another
attribute which is not a key.
2. If X Y and X is not a key then this is a transitive dependency.
3. A transitive dependency exists when A B C but NOT A C.
Multi-Valued Dependencies (MVD)
A multi-valued dependency can be described as follows:
1. A table involves a multi-valued dependency if it may contain multiple values for
an entity.
2. A multi-valued dependency may arise as a result of enforcing 1st normal form.
3. X Y, ie X multi-determines Y, when for each value of X we can have more
than one value of Y.
4. If A B and A C then we have a single attribute A which multi-determines
two other independent attributes, B and C.
5. If A (B,C) then we have an attribute A which multi-determines a set of
associated attributes, B and C.
Join Dependencies (JD)
A join dependency can be described as follows:
1. If a table can be decomposed into three or more smaller tables, it must be
capable of being joined again on common keys to form the original table.
Modification Anomalies
A major objective of data normalisation is to avoid modification anomalies. These
come in two flavours:
1. An insertion anomaly is a failure to place information about a new database
entry into all the places in the database where information about that new entry
needs to be stored. In a properly normalized database, information about a new
entry needs to be inserted into only one place in the database. In an inadequately
normalized database, information about a new entry may need to be inserted into
more than one place, and, human fallibility being what it is, some of the needed
additional insertions may be missed.
2. A deletion anomaly is a failure to remove information about an existing
database entry when it is time to remove that entry. In a properly normalized
database, information about an old, to-be-gotten-rid-of entry needs to be deleted
from only one place in the database. In an inadequately normalized database,
information about that old entry may need to be deleted from more than one
place, and, human fallibility being what it is, some of the needed additional
deletions may be missed.
An update of a database involves modifications that may be additions, deletions, or
both. Thus 'update anomalies' can be either of the kinds of anomalies discussed above.
All three kinds of anomalies are highly undesirable, since their occurrence constitutes
corruption of the database. Properly normalised databases are much less susceptible to
corruption than are unnormalised databases.
Types of Relational Join
A JOIN is a method of creating a result set that combines rows from two or more
tables (relations). When comparing the contents of two tables the following conditions
may occur:
Every row in one relation has a match in the other relation.
Relation R1 contains rows that have no match in relation R2.
Relation R2 contains rows that have no match in relation R1.
INNER joins contain only matches. OUTER joins may contain mismatches as well.
Inner Join
This is sometimes known as a simple join. It returns all rows from both tables where
there is a match. If there are rows in R1 which do not have matches in R2, those rows
will not be listed. There are two possible ways of specifying this type of join:
SELECT * FROM R1, R2 WHERE R1.r1_field = R2.r2_field;
SELECT * FROM R1 INNER JOIN R2 ON R1.field = R2.r2_field
If the fields to be matched have the same names in both tables then the ON condition,
as in:
ON R1.fieldname = R2.fieldname
ON (R1.field1 = R2.field1 AND R1.field2 = R2.field2)
can be replaced by the shorter USING condition, as in:
USING fieldname
USING (field1, field2)
Natural Join
A natural join is based on all columns in the two tables that have the same name. It is
semantically equivalent to an INNER JOIN or a LEFT JOIN with a USING clause
that names all columns that exist in both tables.
SELECT * FROM R1 NATURAL JOIN R2
The alternative is a keyed join which includes an ON or USING condition.
Left [Outer] Join
Returns all the rows from R1 even if there are no matches in R2. If there are no
matches in R2 then the R2 values will be shown as null.
SELECT * FROM R1 LEFT [OUTER] JOIN R2 ON R1.field = R2.field
Right [Outer] Join
Returns all the rows from R2 even if there are no matches in R1. If there are no
matches in R1 then the R1 values will be shown as null.
SELECT * FROM R1 RIGHT [OUTER] JOIN R2 ON R1.field = R2.field
Full [Outer] Join
Returns all the rows from both tables even if there are no matches in one of the tables.
If there are no matches in one of the tables then its values will be shown as null.
SELECT * FROM R1 FULL [OUTER] JOIN R2 ON R1.field = R2.field
Self Join
This joins a table to itself. This table appears twice in the FROM clause and is
followed by table aliases that qualify column names in the join condition.
SELECT a.field1, b.field2 FROM R1 a, R1 b WHERE a.field = b.field
Cross Join
This type of join is rarely used as it does not have a join condition, so every row of R1
is joined to every row of R2. For example, if both tables contain 100 rows the result
will be 10,000 rows. This is sometimes known as a cartesian product and can be
specified in either one of the following ways:
SELECT * FROM R1 CROSS JOIN R2
SELECT * FROM R1, R2
Entities
An entity is any object in the system that we want to model and store
information about
Individual objects are called entities
Groups of the same type of objects are called entity types or entity sets
Entities are represented by rectangles (either with round or square corners)
There are two types of entities; weak and strong entity types.
Attribute
All the data relating to an entity is held in its attributes.
An attribute is a property of an entity.
Each attribute can have any value from its domain.
Each entity within an entity type:
o May have any number of attributes.
o Can have different attribute values than that in any other entity.
o Have the same number of attributes.
Attributes can be
simple or composite
single-valued or multi-valued
Attributes can be shown on ER models
They appear inside ovals and are attached to their entity.
Note that entity types can have a large number of attributes... If all are shown
then the diagrams would be confusing. Only show an attribute if it adds
information to the ER diagram, or clarifies a point.
Figure : Attributes
Keys
A key is a data item that allows us to uniquely identify individual occurrences
or an entity type.
A candidate key is an attribute or set of attributes that uniquely identifies
individual occurrences or an entity type.
An entity type may have one or more possible candidate keys, the one which is
selected is known as the primary key.
A composite key is a candidate key that consists of two or more attributes
The name of each primary key attribute is underlined.
Relationships
A relationship type is a meaningful association between entity types
A relationship is an association of entities where the association includes one
entity from each participating entity type.
Relationship types are represented on the ER diagram by a series of lines.
As always, there are many notations in use today...
In the original Chen notation, the relationship is placed inside a diamond, e.g.
managers manage employees:
Figure : Chens notation for relationships
For this module, we will use an alternative notation, where the relationship is a
label on the line. The meaning is identical
Figure : Relationships used in this document
Degree of a Relationship
The number of participating entities in a relationship is known as the degree of
the relationship.
If there are two entity types involved it is a binary relationship type
Figure : Binary Relationships
If there are three entity types involved it is a ternary relationship type
Figure : Ternary relationship
It is possible to have a n-ary relationship (e.g. quaternary or unary).
Unary relationships are also known as a recursive relationship.
It is a relationship where the same entity participates more than once in
different roles.
In the example above we are saying that employees are managed by
employees.
If we wanted more information about who manages whom, we could introduce
a second entity type called manager.
Degree of a Relationship
It is also possible to have entities associated through two or more distinct
relationships.
Figure : Multiple relationships
In the representation we use it is not possible to have attributes as part of a
relationship. To support this other entity types need to be developed.
Replacing ternary relationships
When ternary relationships occurs in an ER model they should always be removed
before finishing the model. Sometimes the relationships can be replaced by a series of
binary relationships that link pairs of the original ternary relationship.
Figure : A ternary relationship example
This can result in the loss of some information - It is no longer clear which
sales assistant sold a customer a particular product.
Try replacing the ternary relationship with an entity type and a set of binary
relationships.
Relationships are usually verbs, so name the new entity type by the relationship verb
rewritten as a noun.
The relationship sells can become the entity type sale.
Figure : Replacing a ternary relationship
So a sales assistant can be linked to a specific customer and both of them to the
sale of a particular product.
This process also works for higher order relationships.
Cardinality
Relationships are rarely one-to-one
For example, a manager usually manages more than one employee
This is described by the cardinality of the relationship, for which there are four
possible categories.
One to one (1:1) relationship
One to many (1:m) relationship
Many to one (m:1) relationship
Many to many (m:n) relationship
On an ER diagram, if the end of a relationship is straight, it represents 1, while
a "crow's foot" end represents many.
A one to one relationship - a man can only marry one woman, and a woman
can only marry one man, so it is a one to one (1:1) relationship
Figure : One to One relationship example
A one to may relationship - one manager manages many employees, but each
employee only has one manager, so it is a one to many (1:n) relationship
Figure : One to Many relationship example
A many to one relationship - many students study one course. They do not
study more than one course, so it is a many to one (m:1) relationship
Figure : Many to One relationship example
A many to many relationship - One lecturer teaches many students and a
student is taught by many lecturers, so it is a many to many (m:n) relationship
Figure : Many to Many relationship example
Optionality
A relationship can be optional or mandatory.
If the relationship is mandatory
an entity at one end of the relationship must be related to an entity at the other
end.
The optionality can be different at each end of the relationship
For example, a student must be on a course. This is mandatory. To the
relationship `student studies course' is mandatory.
But a course can exist before any students have enrolled. Thus the relationship
`course is_studied_by student' is optional.
To show optionality, put a circle or `0' at the `optional end' of the relationship.
As the optional relationship is `course is_studied_by student', and the optional
part of this is the student, then the `O' goes at the student end of the relationship
connection.
Figure : Optionality example
It is important to know the optionality because you must ensure that whenever
you create a new entity it has the required mandatory links.
Entity Sets
Sometimes it is useful to try out various examples of entities from an ER model. One
reason for this is to confirm the correct cardinality and optionality of a relationship.
We use an `entity set diagram' to show entity examples graphically. Consider the
example of `course is_studied_by student'.
Figure : Entity set example
Confirming Correctness
Figure : Entity set confirming errors
Use the diagram to show all possible relationship scenarios.
Go back to the requirements specification and check to see if they are allowed.
If not, then put a cross through the forbidden relationships
This allows you to show the cardinality and optionality of the relationship
Deriving the relationship parameters
To check we have the correct parameters (sometimes also known as the degree) of a
relationship, ask two questions:
1. One course is studied by how many students? Answer = `zero or more'.
o This gives us the degree at the `student' end.
o The answer `zero or more' needs to be split into two parts.
o The `more' part means that the cardinality is `many'.
o The `zero' part means that the relationship is `optional'.
o If the answer was `one or more', then the relationship would be
`mandatory'.
2. One student studies how many courses? Answer = `One'
o This gives us the degree at the `course' end of the relationship.
o The answer `one' means that the cardinality of this relationship is 1, and
is `mandatory'
o If the answer had been `zero or one', then the cardinality of the
relationship would have been 1, and be `optional'.
Redundant relationships
Some ER diagrams end up with a relationship loop.
check to see if it is possible to break the loop without losing info
Given three entities A, B, C, where there are relations A-B, B-C, and C-A,
check if it is possible to navigate between A and C via B. If it is possible, then
A-C was a redundant relationship.
Always check carefully for ways to simplify your ER diagram. It makes it
easier to read the remaining information.
Redundant relationships example
Consider entities `customer' (customer details), `address' (the address of a
customer) and `distance' (distance from the company to the customer address).
Figure : Redundant relationship
Splitting n:m Relationships
A many to many relationship in an ER model is not necessarily incorrect. They can be
replaced using an intermediate entity. This should only be done where:
the m:n relationship hides an entity
the resulting ER diagram is easier to understand.
Splitting n:m Relationships - Example
Consider the case of a car hire company. Customers hire cars, one customer hires
many card and a car is hired by many customers.
Figure : Many to Many example
The many to many relationship can be broken down to reveal a `hire' entity, which
contains an attribute `date of hire'.
Figure : Splitting the Many to Many example
Constructing an ER model
Before beginning to draw the ER model, read the requirements specification carefully.
Document any assumptions you need to make.
1. Identify entities - list all potential entity types. These are the object of interest
in the system. It is better to put too many entities in at this stage and them
discard them later if necessary.
2. Remove duplicate entities - Ensure that they really separate entity types or just
two names for the same thing.
o Also do not include the system as an entity type
o e.g. if modelling a library, the entity types might be books, borrowers,
etc.
o The library is the system, thus should not be an entity type.
3. List the attributes of each entity (all properties to describe the entity which are
relevant to the application).
o Ensure that the entity types are really needed.
o are any of them just attributes of another entity type?
o if so keep them as attributes and cross them off the entity list.
o Do not have attributes of one entity as attributes of another entity!
4. Mark the primary keys.
o Which attributes uniquely identify instances of that entity type?
o This may not be possible for some weak entities.
5. Define the relationships
o Examine each entity type to see its relationship to the others.
6. Describe the cardinality and optionality of the relationships
o Examine the constraints between participating entities.
7. Remove redundant relationships
o Examine the ER model for redundant relationships.
ER modelling is an iterative process, so draw several versions, refining each one until
you are happy with it.
Normalisation is the process of taking data from a problem and reducing it to a set of
relations while ensuring data integrity and eliminating data redundancy
Data integrity - all of the data in the database are consistent, and satisfy all
integrity constraints.
Data redundancy – if data in the database can be found in two different
locations (direct redundancy) or if data can be calculated from other data items
(indirect redundancy) then the data is said to contain redundancy.
Data should only be stored once and avoid storing data that can be calculated from
other data already held in the database. During the process of normalisation
redundancy must be removed, but not at the expense of breaking data integrity rules.
If redundancy exists in the database then problems can arise when the database is in
normal operation:
When data is inserted the data must be duplicated correctly in all places where
there is redundancy. For instance, if two tables exist for in a database, and both
tables contain the employee name, then creating a new employee entry requires
that both tables be updated with the employee name.
When data is modified in the database, if the data being changed has
redundancy, then all versions of the redundant data must be updated
simultaneously. So in the employee example a change to the employee name
must happen in both tables simultaneously.
The removal of redundancy helps to prevent insertion, deletion, and update errors,
since the data is only available in one attribute of one table in the database.
The data in the database can be considered to be in one of a number of `normal forms'.
Basically the normal form of the data indicates how much redundancy is in that data.
The normal forms have a strict ordering:
1. 1st Normal Form
2. 2nd
Normal Form
3. 3rd
Normal Form
4. BCNF
There are other normal forms, such as 4th and 5th normal forms. They are rarely
utilised in system design and are not considered further here.
To be in a particular form requires that the data meets the criteria to also be in all
normal forms before that form. Thus to be in 2nd
normal form the data must meet the
criteria for both 2nd
normal form and 1st normal form. The higher the form the more
redundancy has been eliminated.
Integrity Constraints
An integrity constraint is a rule that restricts the values that may be present in the
database. The relational data model includes constraints that are used to verify the
validity of the data as well as adding meaningful structure to it:
entity integrity :
The rows (or tuples) in a relation represent entities, and each one must be uniquely
identified. Hence we have the primary key that must have a unique non-null value for
each row.
referential integrity :
This constraint involves the foreign keys. Foreign keys tie the relations together, so it
is vitally important that the links are correct. Every foreign key must either be null or
its value must be the actual value of a key in another relation.
Understanding Data
Sometimes the starting point for understanding data is given in the form of relations
and functional dependancies. This would be the case where the starting point in the
process was a detailed specification of the problem. We already know what relations
are. Functional dependancies are rules stating that given a certain set of attributes (the
determinant) determines a second set of attributes.
The definition of a functional dependency looks like A->B. In this case B is a single
attribute but it can be as many attributes as required (for instance, X->J,K,L,M). In the
functional dependency, the determinant (the left hand side of the -> sign) can
determine the set of attributes on the right hand side of the -> sign. This basically
means that A selects a particular value for B, and that A is unique. In the second
example X is unique and selects a particular set of values for J,K,L, and M. It can also
be said that B is functionally dependent on A. In addition, a particular value of A
ALWAYS gives you a particular value for B, but not vice-versa.
Consider this example:
R(matric_no, firstname, surname, tutor_number, tutor_name)
tutor_number -> tutor_name
Here there is a relation R, and a functional dependency that indicates that:
instances of tutor_number are unique in the data
from the data, given a tutor_number, it is always possible to work out the
tutor_name.
As an example tutor number 1 may be ―Mr Smith‖, but tutor number 10 may
also be ―Mr Smith‖. Given a tutor number of 1, this is ALWAYS ―Mr Smith‖.
However, given the name ―Mr Smith‖ it is not possible to work out if we are
talking about tutor 1 or tutor 10.
There is actually a second functional dependency for this relation, which can be
worked out from the relation itself. As the relation has a primary key, then given this
attribute you can determine all the other attributes in R. This is an implied functional
dependency and is not normally listed in the list of functional dependents.
Extracting understanding
It is possible that the relations and the determinants have not yet been defined for a
problem, and therefore must be calculated from examples of the data. Consider the
following Student table.
Student - an unnormalised tablewith repeating groups
matric_no Name date_of_birth subject grade
960100 Smith, J 14/11/1977
Databases
Soft_Dev
ISDE
C
A
D
960105 White, A 10/05/1975 Soft_Dev
ISDE
B
B
960120 Moore, T 11/03/1970
Databases
Soft_Dev
Workshop
A
B
C
960145 Smith, J 09/01/1972 Databases B
960150 Black, D 21/08/1973
Databases
Soft_Dev
ISDE
Workshop
B
D
C
D
The subject/grade pair is repeated for each student. 960145 has 1 pair while 960150
has four. Repeating groups are placed inside another set of parentheses. From the table
the following relation is generated:
Student(matric_no, name, date_of_birth, ( subject, grade ) )
The repeating group needs a key in order that the relation can be correctly defined.
Looking at the data one can see that grade repeats within matric_no (for instance, for
960150, the student has 2 D grades). However, subject never seems to repeat for a
single matric_no, and therefore is a candidate key in the repeating group.Whenever
keys or dependencies are extracted from example data, the information extracted is
only as good as the data sample examined. It could be that another data sample
disproves some of the key selections made or dependencies extracted. What is
important however is that the information extracted during these exercises is correct
for the data being examined.
Looking at the data itself, we can see that the same name appears more than once in
the name column. The name in conjunction with the date_of_birth seems to be unique,
suggesting a functional dependency of:
name, date_of_birth -> matric_no
This implies that not only is the matric_no sufficient to uniquely identify a student, the
student’s name combined with the date of birth is also sufficient to uniquely identify a
student. It is therefore possible to have the relation Student written as:
Student(matric_no, name, date_of_birth, ( subject, grade ) )
As guidance in cases where a variety of keys could be selected one should try to select
the relation with the least number of attributes defined as primary keys.
Flattened Tables
Note that the student table shown above explicitly identifies the repeating group. It is
also possible that the table presented will be what is called a flat table, where the
repeating group is not explicitly shown:
Student #2 - Flattened Table
matric_no name date_of_birth Subject grade
960100 Smith, J 14/11/1977 Databases C
960100 Smith, J 14/11/1977 Soft_Dev A
960100 Smith, J 14/11/1977 ISDE D
960105 White, A 10/05/1975 Soft_Dev B
960105 White, A 10/05/1975 ISDE B
960120 Moore, T 11/03/1970 Databases A
960120 Moore, T 11/03/1970 Soft_Dev B
960120 Moore, T 11/03/1970 Workshop C
960145 Smith, J 09/01/1972 Databases B
960150 Black, D 21/08/1973 Databases B
960150 Black, D 21/08/1973 Soft_Dev D
960150 Black, D 21/08/1973 ISDE C
960150 Black, D 21/08/1973 Workshop B
The table still shows the same data as the previous example, but the format is
different. We have removed the repeating group (which is good) but we have
introduced redundancy (which is bad).
Sometimes you will miss spotting the repeating group, so you may produce something
like the following relation for the Student data.
Student(matric_no, name, date_of_birth, subject, grade )
matric_no -> name, date_of_birth
name, date_of_birth -> matric_no
This data does not explicitly identify the repeating group, but as you will see the result
of the normalisation process on this relation produces exactly the same relations as the
normalisation of the version that explicitly does have a repeating group.
First Normal Form
First normal form (1NF) deals with the `shape' of the record type
A relation is in 1NF if, and only if, it contains no repeating attributes or groups
of attributes.
Example:
The Student table with the repeating group is not in 1NF
It has repeating groups, and it is called an `unnormalised table'.
Relational databases require that each row only has a single value per attribute, and so
a repeating group in a row is not allowed.
To remove the repeating group, one of two things can be done:
either flatten the table and extend the key, or
decompose the relation- leading to First Normal Form
Flatten table and Extend Primary Key
The Student table with the repeating group can be written as:
Student(matric_no, name, date_of_birth, ( subject, grade ) )
If the repeating group was flattened, as in the Student #2 data table, it would look
something like:
Student(matric_no, name, date_of_birth, subject, grade )
Although this is an improvement, we still have a problem. matric_no can no longer be
the primary key - it does not have an unique value for each row. So we have to find a
new primary key - in this case it has to be a compound key since no single attribute
can uniquely identify a row. The new primary key is a compound key (matrix_no +
subject).
We have now solved the repeating groups problem, but we have created other
complications. Every repetition of the matric_no, name, and data_of_birth is
redundant and liable to produce errors.
With the relation in its flattened form, strange anomalies appear in the system.
Redundant data is the main cause of insertion, deletion, and updating anomalies.
Insertion anomaly:
With the primary key including subject, we cannot enter a new student until they have
at least one subject to study. We are not allowed NULLs in the primary key so we
must have an entry in both matric_no and subject before we can create a new record.
This is known as the insertion anomaly. It is difficult to insert new records into
the database.
On a practical level, it also means that it is difficult to keep the data up to date.
Update anomaly
If the name of a student were changed for example Smith, J. was changed to Green, J.
this would require not one change but many one for every subject that Smith, J.
studied.
Deletion anomaly
If all of the records for the `Databases' subject were deleted from the table,we would
inadvertently lose all of the information on the student with matric_no 960145. This
would be the same for any student who was studying only one subject and the subject
was deleted. Again this problem arises from the need to have a compound primary
key.
Decomposing the relation
The alternative approach is to split the table into two parts, one for the
repeating groups and one of the non-repeating groups.
the primary key for the original relation is included in both of the new relations
Record
matric_no subject grade
960100 Databases C
960100 Soft_Dev A
960100 ISDE D
960105 Soft_Dev B
960105 ISDE B
... ... ...
960150 Workshop B
Student
matric_no name date_of_birth
960100 Smith,J 14/11/1977
960105 White,A 10/05/1975
960120 Moore,T 11/03/1970
960145 Smith,J 09/01/1972
960150 Black,D 21/08/1973
We now have two relations, Student and Record.
Student contains the original non-repeating groups
Record has the original repeating groups and the matric_no
Student(matric_no, name, date_of_birth )
Record(matric_no, subject, grade )
Matric_no remains the key to the Student relation. It cannot be the complete key to the
new Record relation - we end up with a compound primary key consisting of
matric_no and subject. The matric_no is the link between the two tables - it will allow
us to find out which subjects a student is studying . So in the Record relation,
matric_no is the foreign key.
This method has eliminated some of the anomalies. It does not always do so, it
depends on the example chosen
In this case we no longer have the insertion anomaly
It is now possible to enter new students without knowing the subjects that they
will be studying
They will exist only in the Student table, and will not be entered in the Record
table until they are studying at least one subject.
We have also removed the deletion anomaly
If all of the `databases' subject records are removed, student 960145 still exists
in the Student table.
We have also removed the update anomaly
Student and Record are now in First Normal Form.
Second Normal Form
Second normal form (or 2NF) is a more stringent normal form defined as:
A relation is in 2NF if, and only if, it is in 1NF and every non-key attribute is fully
functionally dependent on the whole key.
Thus the relation is in 1NF with no repeating groups, and all non-key attributes must
depend on the whole key, not just some part of it. Another way of saying this is that
there must be no partial key dependencies (PKDs).
The problems arise when there is a compound key, e.g. the key to the Record relation
- matric_no, subject. In this case it is possible for non-key attributes to depend on only
part of the key - i.e. on only one of the two key attributes. This is what 2NF tries to
prevent.
Consider again the Student relation from the flattened Student #2 table:
Student(matric_no, name, date_of_birth, subject, grade )
There are no repeating groups
The relation is already in 1NF
However, we have a compound primary key - so we must check all of the non-
key attributes against each part of the key to ensure they are functionally
dependent on it.
matric_no determines name and date_of_birth, but not grade.
subject together with matric_no determines grade, but not name or
date_of_birth.
So there is a problem with potential redundancies
A dependency diagram is used to show how non-key attributes relate to each part or
combination of parts in the primary key.
Figure : Dependency Diagram
This relation is not in 2NF
It appears to be two tables squashed into one.
the solution is to split the relation up into its component parts.
separate out all the attributes that are solely dependent on matric_no
put them in a new Student_details relation, with matric_no as the primary key
separate out all the attributes that are solely dependent on subject.
in this case no attributes are solely dependent on subject.
separate out all the attributes that are solely dependent on matric_no + subject
put them into a separate Student relation, keyed on matric_no + subject
All attributes in each relation are fully
functionally dependent upon its primary
key
These relations are now in 2NF
Figure : Dependencies after splitting
Interestingly this is the same set of relations as when we recognized that there were
repeating terms in the table and directly removed the repeating terms. It should not
really matter what process you followed when normalizing, as the end result should be
similar relations.
Third Normal Form
3NF is an even stricter normal form and removes virtually all the redundant data :
A relation is in 3NF if, and only if, it is in 2NF and there are no transitive
functional dependencies
Transitive functional dependencies arise:
when one non-key attribute is functionally dependent on another non-key
attribute:
FD: non-key attribute -> non-key attribute
and when there is redundancy in the database
By definition transitive functional dependency can only occur if there is more than
one non-key field, so we can say that a relation in 2NF with zero or one non-key field
must automatically be in 3NF.
project_no manager address
p1 Black,B 32 High Street
p2 Smith,J 11 New Street
p3 Black,B 32 High Street
p4 Black,B 32 High Street
Project has more than one non-key field so
we must check for transitive dependency:
address depends on the value in the manager column
every time B Black is listed in the manager column, the address column has the
value `32 High Street'. From this the relation and functional dependency can be
implied as:
Project(project_no, manager, address)
manager -> address
in this case address is transitively dependent on manager. Manager is the
determinant - it determines the value of address. It is transitive functional
dependency only if all attributes on the left of the ―->‖ are not in the key but
are all in the relation, and all attributes to the right of the ―->‖ are not in the key
with at least one actually being in the relation.
Data redundancy arises from this
we duplicate address if a manager is in charge of more than one project
causes problems if we had to change the address- have to change several
entries, and this could lead to errors.
The solution is to eliminate transitive functional dependency by splitting the
table
create two relations - one with the transitive dependency in it, and another for
all of the remaining attributes.
split Project into Project and Manager.
the determinant attribute becomes the primary key in the new relation
manager becomes the primary key to the Manager relation
the original key is the primary key to the remaining non-transitive attributes
in this case, project_no remains the key to the new Projects table.
Project project_no manager
p1 Black,B
p2 Smith,J
p3 Black,B
p4 Black,B
Manager manager address
Black,B 32 High Street
Smith,J 11 New Street
Now we need to store the address only once
If we need to know a manager's address we can look it up in the Manager
relation
The manager attribute is the link between the two tables, and in the Projects
table it is now a foreign key.
These relations are now in third normal form.
Summary: 1NF
A relation is in 1NF if it contains no repeating groups
To convert an unnormalised relation to 1NF either:
Flatten the table and change the primary key, or
Decompose the relation into smaller relations, one for the repeating groups and
one for the non-repeating groups.
Remember to put the primary key from the original relation into both new
relations.
This option is liable to give the best results.
Summary: 2NF
A relation is in 2NF if it contains no repeating groups and no partial key
functional dependencies
Rule: A relation in 1NF with a single key field must be in 2NF
To convert a relation with partial functional dependencies to 2NF. create a set
of new relations:
One relation for the attributes that are fully dependent upon the key.
One relation for each part of the key that has partially dependent attributes
Summary: 3NF
A relation is in 3NF if it contains no repeating groups, no partial functional
dependencies, and no transitive functional dependencies
To convert a relation with transitive functional dependencies to 3NF, remove
the attributes involved in the transitive dependency and put them in a new
relation
Rule: A relation in 2NF with only one non-key attribute must be in 3NF
In a normalised relation a non-key field must provide a fact about the key, the
whole key and nothing but the key.
Relations in 3NF are sufficient for most practical database design problems.
However, 3NF does not guarantee that all anomalies have been removed.
Boyce-Codd Normal Form (BCNF)
When a relation has more than one candidate key, anomalies may result even
though the relation is in 3NF.
3NF does not deal satisfactorily with the case of a relation with overlapping
candidate keys
i.e. composite candidate keys with at least one attribute in common.
BCNF is based on the concept of a determinant.
A determinant is any attribute (simple or composite) on which some other
attribute is fully functionally dependent.
A relation is in BCNF is, and only if, every determinant is a candidate key.
Consider the following relation and determinants.
R(a,b,c,d)
a,c -> b,d
a,d -> b
Here, the first determinant suggests that the primary key of R could be changed from
a,b to a,c. If this change was done all of the non-key attributes present in R could still
be determined, and therefore this change is legal. However, the second determinant
indicates that a,d determines b, but a,d could not be the key of R as a,d does not
determine all of the non key attributes of R (it does not determine c). We would say
that the first determinate is a candidate key, but the second determinant is not a
candidate key, and thus this relation is not in BCNF (but is in 3rd
normal form).
Normalisation to BCNF - Example 1
Patient No Patient Name Appointment Id Time Doctor
1 John 0 09:00 Zorro
2 Kerr 0 09:00 Killer
3 Adam 1 10:00 Zorro
4 Robert 0 13:00 Killer
5 Zane 1 14:00 Zorro
Lets consider the database extract shown above. This depicts a special dieting clinic
where the each patient has 4 appointments. On the first they are weighed, the second
they are exercised, the third their fat is removed by surgery, and on the fourth their
mouth is stitched closed… Not all patients need all four appointments! If the Patient
Name begins with a letter before ―P‖ they get a morning appointment, otherwise they
get an afternoon appointment. Appointment 1 is either 09:00 or 13:00, appointment 2
10:00 or 14:00, and so on. From this (hopefully) make-believe scenario we can extract
the following determinants:
DB(Patno,PatName,appNo,time,doctor)
Patno -> PatName
Patno,appNo -> Time,doctor
Time -> appNo
Now we have to decide what the primary key of DB is going to be. From the
information we have, we could chose:
DB(Patno,PatName,appNo,time,doctor) (example 1a)
or
DB(Patno,PatName,appNo,time,doctor) (example 1b)
Example 1a - DB(Patno,PatName,appNo,time,doctor)
1NF Eliminate repeating groups.
None:
DB(Patno,PatName,appNo,time,doctor)
2NF Eliminate partial key dependencies
DB(Patno,appNo,time,doctor)
R1(Patno,PatName)
3NF Eliminate transitive dependencies
None: so just as 2NF
BCNF Every determinant is a candidate key
DB(Patno,appNo,time,doctor)
R1(Patno,PatName)
Go through all determinates where ALL of the left hand attributes are present
in a relation and at least ONE of the right hand attributes are also present in the
relation.
Patno -> PatName
Patno is present in DB, but not PatName, so not relevant.
Patno,appNo -> Time,doctor
All LHS present, and time and doctor also present, so relevant. Is this a
candidate key? Patno,appNo IS the key, so this is a candidate key. Thus this is
OK for BCNF compliance.
Time -> appNo
Time is present, and so is appNo, so relevant. Is this a candidate key. If it was
then we could rewrite DB as:
DB(Patno,appNo,time,doctor)
This will not work, as you need both time and Patno together to form a unique
key. Thus this determinate is not a candidate key, and therefore DB is not in
BCNF. We need to fix this.
BCNF: rewrite to
DB(Patno,time,doctor)
R1(Patno,PatName)
R2(time,appNo)
time is enough to work out the appointment number of a patient. Now BCNF is
satisfied, and the final relations shown are in BCNF.
Example 1b - DB(Patno,PatName,appNo,time,doctor)
1NF Eliminate repeating groups.
None:
DB(Patno,PatName,appNo,time,doctor)
2NF Eliminate partial key dependencies
DB(Patno,time,doctor)
R1(Patno,PatName)
R2(time,appNo)
3NF Eliminate transitive dependencies
None: so just as 2NF
BCNF Every determinant is a candidate key
DB(Patno,time,doctor)
R1(Patno,PatName)
R2(time,appNo)
Go through all determinates where ALL of the left hand attributes are present
in a relation and at least ONE of the right hand attributes are also present in the
relation.
Patno -> PatName
Patno is present in DB, but not PatName, so not relevant.
Patno,appNo -> Time,doctor
Not all LHS present, so not relevant.
Time -> appNo
Time is present, and so is appNo, so relevant. This is a candidate key.
However, Time is currently the key for R2, so satisfies the rules for BCNF.
BCNF: as 3NF
DB(Patno,time,doctor)
R1(Patno,PatName)
R2(time,appNo)
Summary - Example 1
This example has demonstrated three things:
BCNF is stronger than 3NF, relations that are in 3NF are not necessarily in
BCNF
BCNF is needed in certain situations to obtain full understanding of the data
model
there are several routes to take to arrive at the same set of relations in BCNF.
Unfortunately there are no rules as to which route will be the easiest one to
take.
Example 2
Grade_report(StudNo,StudName,(Major,Adviser,
(CourseNo,Ctitle,InstrucName,InstructLocn,Grade)))
Functional dependencies
StudNo -> StudName
CourseNo -> Ctitle,InstrucName
InstrucName -> InstrucLocn
StudNo,CourseNo,Major -> Grade
StudNo,Major -> Advisor
Advisor -> Major
Unnormalised
Grade_report(StudNo,StudName,(Major,Advisor,
(CourseNo,Ctitle,InstrucName,InstructLocn,Grade)))
1NF Remove repeating groups
Student(StudNo,StudName)
StudMajor(StudNo,Major,Advisor)
StudCourse(StudNo,Major,CourseNo,
Ctitle,InstrucName,InstructLocn,Grade)
2NF Remove partial key dependencies
Student(StudNo,StudName)
StudMajor(StudNo,Major,Advisor)
StudCourse(StudNo,Major,CourseNo,Grade)
Course(CourseNo,Ctitle,InstrucName,InstructLocn)
3NF Remove transitive dependencies
Student(StudNo,StudName)
StudMajor(StudNo,Major,Advisor)
StudCourse(StudNo,Major,CourseNo,Grade)
Course(CourseNo,Ctitle,InstrucName)
Instructor(InstructName,InstructLocn)
BCNF Every determinant is a candidate key
Student : only determinant is StudNo
StudCourse: only determinant is StudNo,Major
Course: only determinant is CourseNo
Instructor: only determinant is InstrucName
StudMajor: the determinants are
StudNo,Major, or
Adviser
Only StudNo,Major is a candidate key.
BCNF
Student(StudNo,StudName)
StudCourse(StudNo,Major,CourseNo,Grade)
Course(CourseNo,Ctitle,InstrucName)
Instructor(InstructName,InstructLocn)
StudMajor(StudNo,Advisor)
Adviser(Adviser,Major)
Problems BCNF overcomes
STUDENT MAJOR ADVISOR
123 PHYSICS EINSTEIN
123 MUSIC MOZART
456 BIOLOGY DARWIN
789 PHYSICS BOHR
999 PHYSICS EINSTEIN
If the record for student 456 is deleted we lose not only information on student
456 but also the fact that DARWIN advises in BIOLOGY
we cannot record the fact that WATSON can advise on COMPUTING until we
have a student majoring in COMPUTING to whom we can assign WATSON
as an advisor.
In BCNF we have two tables:
STUDENT ADVISOR
123 EINSTEIN
123 MOZART
456 DARWIN
789 BOHR
999 EINSTEIN
ADVISOR MAJOR
EINSTEIN PHYSICS
MOZART MUSIC
DARWIN BIOLOGY
BOHR PHYSICS
Relational Database Language - Structured Query Language
SQL is a popular relational database language first standardized in 1986 by the
American National Standards Institute (ANSI). Since then, it has been formally
adopted as an International Standard by the International Organization for
Standardization (ISO) and the International Electrotechnical Commission (IEC). It has
also been adopted as a Federal Information Processing Standard (FIPS) for the U.S.
federal government.
Database Language SQL is under continual development by the above mentioned
standardization bodies. The most recent published version was in 1992, a 580 page
specification published by ANSI as American National Standard X3.135-1992 and by
ISO/IEC as International Standard 9075:1992. The two specifications are word-for-
word identical. Both versions are available in hard copy only from ANSI (Sales
telephone: +1-212-642-4900). Further extensions and enhancements are under
development
DESCRIPTION of SQL
The basic structure of the relational model is a table, consisting of rows and columns.
Data definition includes declaring the name of each table to be included in a database,
the names and data types of all columns of each table, constraints on the values in and
among columns, and the granting of table manipulation privileges to prospective
users. Tables can be accessed by inserting new rows, deleting or updating existing
rows, or selecting rows that satisfy a given search condition for output. Tables can be
manipulated to produce new tables by Cartesian products, unions, intersections, joins
on matching columns, or projections on given columns.
SQL data manipulation operations may be invoked through a cursor or through a
general query specification. The language includes all arithmetic operations,
predicates for comparison and string matching, universal and existential quantifiers,
summary operations for max/min or count/sum, and GROUP BY and HAVING
clause to partition tables by groups. Transaction management is achieved through
COMMIT and ROLLBACK statements.
The standard provides language facilities for defining application specific views of the
data. Each view is the specification of database operations that would produce a
desired table. The viewed table is then materialized at application execution time.
The SQL standard provides a Module Language for interface to other languages. Each
SQL statement may be packaged as a procedure that can be called and have
parameters passed to it from an external language. A cursor mechanism provides row-
at-a-time access from languages that can only handle one row of a table at one time.
Access control is provided by GRANT and REVOKE statements. Each prospective
user must be explicitly granted the privilege to access a specific table or view using a
specific statement.
The SQL Integrity Enhancement facility offers additional tools for referential
integrity, CHECK constraint clauses, and DEFAULT clauses. Referential integrity
allows specification of primary and foreign keys with the requirement that no foreign
key row may be inserted or updated unless a matching primary key row exists. Check
clauses allow specification of inter-column constraints to be maintained by the
database system. Default clauses provide optional default values for missing data.
The Embedded SQL specification provides SQL interface to programming languages,
specifically Ada, C, COBOL, FORTRAN, MUMPS, Pascal, and PL/I. Applications
may thereby integrate program control structures with SQL data manipulation
capabilities. The Embedded SQL syntax is just a shorthand for an explicit SQL
Module accessed from a standard conforming programming language.
SQL-92 significantly increases the size of the original 1986 standard to include a
schema manipulation language for modifying or altering schemas, schema information
tables to make schema definitions accessible to users, new facilities for dynamic
creation of SQL statements, and new data types and domains. Other new SQL-92
features include outer join, cascade update and delete referential actions, set algebra
on tables, transaction consistency levels, scrolled cursors ,deferred constraint
checking, and greatly expanded exception reporting. SQL-92 also removes a number
of restrictions in order to make the language more flexible and orthogonal.
Why is it needed?
When data is stored on disk based storage devices, it is stored as blocks of
data. These blocks are accessed in their entirety, making them the atomic
disk access operation. Disk blocks are structured in much the same way as
linked lists; both contain a section for data, a pointer to the location of the
next node (or block), and both need not be stored contiguously.
Due to the fact that a number of records can only be sorted on one field, we
can state that searching on a field that isn’t sorted requires a Linear Search
which requires N/2 block accesses, where N is the number of blocks that the
table spans. If that field is a non-key field (i.e. doesn’t contain unique
entries) then the entire table space must be searched at N block accesses.
Whereas with a sorted field, a Binary Search may be used, this has log2 N
block accesses. Also since the data is sorted given a non-key field, the rest
of the table doesn’t need to be searched for duplicate values, once a higher
value is found. Thus the performance increase is substantial.
What is indexing?
Indexing is a way of sorting a number of records on multiple fields. Creating
an index on a field in a table creates another data structure which holds the
field value, and pointer to the record it relates to. This index structure is then
sorted, allowing Binary Searches to be performed on it.
The downside to indexing is that these indexes require additional space on
the disk, since the indexes are stored together in a MyISAM database, this
file can quickly reach the size limits of the underlying file system if many
fields within the same table are indexed.
How does it work?
Firstly, let’s outline a sample database table schema;
Field name Data type Size on disk
id (Primary key) Unsigned INT 4 bytes
firstName Char(50) 50 bytes
lastName Char(50) 50 bytes
emailAddress Char(100) 100 bytes
Note: char was used in place of varchar to allow for an accurate size on disk
value. This sample database contains five million rows, and is unindexed.
The performance of several queries will now be analyzed. These are a query
using the id (a sorted key field) and one using the firstName (a non-key
unsorted field).
Example 1
Given our sample database of r = 5,000,000 records of a fixed size giving a
record length of R = 204 bytes and they are stored in a MyISAM database
which is using the default block size B = 1,024 bytes. The blocking factor of
the table would be bfr = (B/R) = 1024/204 = 5 records per disk block. The
total number of blocks required to hold the table is N = (r/bfr) = 5000000/5
= 1,000,000 blocks.
A linear search on the id field would require an average of N/2 = 500,000
block accesses to find a value given that the id field is a key field. But since
the id field is also sorted a binary search can be conducted requiring an
average of log2 1000000 = 19.93 = 20 block accesses. Instantly we can see
this is a drastic improvement.
Now the firstName field is neither sorted, so a binary search is impossible,
nor are the values unique, and thus the table will require searching to the end
for an exact N = 1,000,000 block accesses. It is this situation that indexing
aims to correct.
Given that an index record contains only the indexed field and a pointer to
the original record, it stands to reason that it will be smaller than the multi-
field record that it points to. So the index itself requires fewer disk blocks
that the original table, which therefore requires fewer block accesses to
iterate through. The schema for an index on the firstName field is outlined
below;
Field name Data type Size on disk
firstName Char(50) 50 bytes
(record pointer) Special 4 bytes
Note: Pointers in MySQL are 2, 3, 4 or 5 bytes in length depending on the
size of the table.
Example 2
Given our sample database of r = 5,000,000 records with an index record
length of R = 54 bytes and using the default block size B = 1,024 bytes. The
blocking factor of the index would be bfr = (B/R) = 1024/54 = 18 records
per disk block. The total number of blocks required to hold the table is N =
(r/bfr) = 5000000/18 = 277,778 blocks.
Now a search using the firstName field can utilise the index to increase
performance. This allows for a binary search of the index with an average of
log2 277778 = 18.08 = 19 block accesses. To find the address of the actual
record, which requires a further block access to read, bringing the total to 19
+ 1 = 20 block accesses, a far cry from the 277,778 block accesses required
by the non-indexed table.
When should it be used?
Given that creating an index requires additional disk space (277,778 blocks
extra from the above example), and that too many indexes can cause issues
arising from the file systems size limits, careful thought must be used to
select the correct fields to index.
Since indexes are only used to speed up the searching for a matching field
within the records, it stands to reason that indexing fields used only for
output would be simply a waste of disk space and processing time when
doing an insert or delete operation, and thus should be avoided. Also given
the nature of a binary search, the cardinality or uniqueness of the data is
important. Indexing on a field with a cardinality of 2 would split the data in
half, whereas a cardinality of 1,000 would return approximately 1,000
records. With such a low cardinality the effectiveness is reduced to a linear
sort, and the query optimizer will avoid using the index if the cardinality is
less than 30% of the record number, effectively making the index a waste of
space
Overview of Physical Storage Media
1. Several types of data storage exist in most computer systems. They vary in
speed of access, cost per unit of data, and reliability.
o Cache: most costly and fastest form of storage. Usually very small, and
managed by the operating system.
o Main Memory (MM): the storage area for data available to be operated
on.
General-purpose machine instructions operate on main memory.
Contents of main memory are usually lost in a power failure or
``crash''.
Usually too small (even with megabytes) and too expensive to
store the entire database.
o Flash memory: EEPROM (electrically erasable programmable read-
only memory).
Data in flash memory survive from power failure.
Reading data from flash memory takes about 10 nano-secs
(roughly as fast as from main memory), and writing data into
flash memory is more complicated: write-once takes about 4-10
microsecs.
To overwrite what has been written, one has to first erase the
entire bank of the memory. It may support only a limited number
of erase cycles ( to ).
It has found its popularity as a replacement for disks for storing
small volumes of data (5-10 megabytes).
o Magnetic-disk storage: primary medium for long-term storage.
Typically the entire database is stored on disk.
Data must be moved from disk to main memory in order for the
data to be operated on.
After operations are performed, data must be copied back to disk
if any changes were made.
Disk storage is called direct access storage as it is possible to
read data on the disk in any order (unlike sequential access).
Disk storage usually survives power failures and system crashes.
o Optical storage: CD-ROM (compact-disk read-only memory), WORM
(write-once read-many) disk (for archival storage of data), and Juke box
(containing a few drives and numerous disks loaded on demand).
o Tape Storage: used primarily for backup and archival data.
Cheaper, but much slower access, since tape must be read
sequentially from the beginning.
Used as protection from disk failures!
2. The storage device hierarchy is presented in Figure 10.1, where the higher
levels are expensive (cost per bit), fast (access time), but the capacity is
smaller.
Figure 10.1: Storage-device hierarchy
3. Another classification: Primary, secondary, and tertiary storage.
1. Primary storage: the fastest storage media, such as cash and main
memory.
2. Secondary (or on-line) storage: the next level of the hierarchy, e.g.,
magnetic disks.
3. Tertiary (or off-line) storage: magnetic tapes and optical disk juke
boxes.
Volatility of storage. Volatile storage loses its contents when the power is
removed. Without power backup, data in the volatile storage (the part of the hierarchy
from main memory up) must be written to nonvolatile storage for safekeeping.
Physical Characteristics of Disks
1. The storage capacity of a single disk ranges from 10MB to 10GB. A typical
commercial database may require hundreds of disks.
2. Figure 10.2 shows a moving-head disk mechanism.
o Each disk platter has a flat circular shape. Its two surfaces are covered
with a magnetic material and information is recorded on the surfaces.
The platter of hard disks are made from rigid metal or glass, while
floppy disks are made from flexible material.
o The disk surface is logically divided into tracks, which are subdivided
into sectors. A sector (varying from 32 bytes to 4096 bytes, usually 512
bytes) is the smallest unit of information that can be read from or written
to disk. There are 4-32 sectors per track and 20-1500 tracks per disk
surface.
o The arm can be positioned over any one of the tracks.
o The platter is spun at high speed.
o To read information, the arm is positioned over the correct track.
o When the data to be accessed passes under the head, the read or write
operation is performed.
3. A disk typically contains multiple platters (see Figure 10.2). The read-write
heads of all the tracks are mounted on a single assembly called a disk arm, and
move together.
o Multiple disk arms are moved as a unit by the actuator.
o Each arm has two heads, to read disks above and below it.
o The set of tracks over which the heads are located forms a cylinder.
o This cylinder holds that data that is accessible within the disk latency
time.
o It is clearly sensible to store related data in the same or adjacent
cylinders.
4. Disk platters range from 1.8" to 14" in diameter, and 5"1/4 and 3"1/2 disks
dominate due to the lower cost and faster seek time than do larger disks, yet
they provide high storage capacity.
5. A disk controller interfaces between the computer system and the actual
hardware of the disk drive. It accepts commands to r/w a sector, and initiate
actions. Disk controllers also attach checksums to each sector to check read
error.
6. Remapping of bad sectors: If a controller detects that a sector is damaged when
the disk is initially formatted, or when an attempt is made to write the sector, it
can logically map the sector to a different physical location.
7. SCSI (Small Computer System Interconnect) is commonly used to connect
disks to PCs and workstations. Mainframe and server systems usually have a
faster and more expensive bus to connect to the disks.
8. Head crash: why cause the entire disk failing (?).
9. A fixed dead disk has a separate head for each track -- very many heads, very
expensive. Multiple disk arms: allow more than one track to be accessed at a
time. Both were used in high performance mainframe systems but are relatively
rare today.
Performance Measures of Disks
The main measures of the qualities of a disk are capacity, access time, data transfer
rate, and reliability,
1. access time: the time from when a read or write request is issued to when data
transfer begins. To access data on a given sector of a disk, the arm first must
move so that it is positioned over the correct track, and then must wait for the
sector to appear under it as the disk rotates. The time for repositioning the arm
is called seek time, and it increases with the distance the arm must move.
Typical seek time range from 2 to 30 milliseconds.
Average seek time is the average of the seek time, measured over a sequence of
(uniformly distributed) random requests, and it is about one third of the worst-
case seek time.
Once the seek has occurred, the time spent waiting for the sector to be accesses
to appear under the head is called rotational latency time. Average rotational
latency time is about half of the time for a full rotation of the disk. (Typical
rotational speeds of disks ranges from 60 to 120 rotations per second).
The access time is then the sum of the seek time and the latency and ranges
from 10 to 40 milli-sec.
2. data transfer rate, the rate at which data can be retrieved from or stored to the
disk. Current disk systems support transfer rate from 1 to 5 megabytes per
second.
3. reliability, measured by the mean time to failure. The typical mean time to
failure of disks today ranges from 30,000 to 800,000 hours (about 3.4 to 91
years).
Optimization of Disk-Block Access
1. Data is transferred between disk and main memory in units called blocks.
2. A block is a contiguous sequence of bytes from a single track of one platter.
3. Block sizes range from 512 bytes to several thousand.
4. The lower levels of file system manager covert block addresses into the
hardware-level cylinder, surface, and sector number.
5. Access to data on disk is several orders of magnitude slower than is access to
data in main memory. Optimization techniques besides buffering of blocks in
main memory.
o Scheduling: If several blocks from a cylinder need to be transferred, we
may save time by requesting them in the order in which they pass under
the heads. A commonly used disk-arm scheduling algorithm is the
elevator algorithm.
o File organization. Organize blocks on disk in a way that corresponds
closely to the manner that we expect data to be accessed. For example,
store related information on the same track, or physically close tracks, or
adjacent cylinders in order to minimize seek time. IBM mainframe OS's
provide programmers fine control on placement of files but increase
programmer's burden.
UNIX or PC OSs hide disk organizations from users. Over time, a
sequential file may become fragmented. To reduce fragmentation, the
system can make a back-up copy of the data on disk and restore the
entire disk. The restore operation writes back the blocks of each file
continuously (or nearly so). Some systems, such as MS-DOS, have
utilities that scan the disk and then move blocks to decrease the
fragmentation.
o Nonvolatile write buffers. Use nonvolatile RAM (such as battery-back-
up RAM) to speed up disk writes drastically (first write to nonvolatile
RAM buffer and inform OS that writes completed).
o Log disk. Another approach to reducing write latency is to use a log
disk, a disk devoted to writing a sequential log. All access to the log disk
is sequential, essentially eliminating seek time, and several consecutive
blocks can be written at once, making writes to log disk several times
faster than random writes.
Optical Disks
1. CD-ROM has become a popular medium for distributing software, multimedia
data, and other electronic published information.
2. Capacity of CD-ROM: 500 MB. Disks are cheap to mass produce and also
drives.
3. CD-ROM: much longer seek time (250m-sec), lower rotation speed (400 rpm),
leading to high latency and lower data-transfer rate (about 150 KB/sec). Drives
spins at audio CD spin speed (standard) is available.
4. Recently, a new optical format, digit video disk (DVD) has become standard.
These disks hold between 4.7 and 17 GB data.
5. WORM (write-once, read many) disks are popular for archival storage of data
since they have a high capacity (about 500 MB), longer life time than HD, and
can be removed from drive -- good for audit trail (hard to tamper).
Magnetic Tapes
1. Long history, slow, and limited to sequential access, and thus are used for
backup, storage for infrequent access, and off-line medium for system transfer.
2. Moving to the correct spot may take minutes, but once positioned, tape drives
can write data at density and speed approaching to those of disk drives.
3. 8mm tape drive has the highest density, and we store 5 GB data on a 350-foot
tape.
4. Popularly used for storage of large volumes of data, such as video, image, or
remote sensing data.
Storage Access
1. Each file is partitioned into fixed-length storage units, called blocks, which are
the units of both storage allocation and data transfer.
2. It is desirable to keep as many blocks as possible in main memory. Usually, we
cannot keep all blocks in main memory, so we need to manage the allocation of
available main memory space.
3. We need to use disk storage for the database, and to transfer blocks of data
between main memory and disk. We also want to minimize the number of such
transfers, as they are time-consuming.
4. The buffer is the part of main memory available for storage of copies of disk
blocks.
Buffer manager
1. The subsystem responsible for the allocation of buffer space is called the
buffer manager.
o The buffer manager handles all requests for blocks of the database.
o If the block is already in main memory, the address in main memory is
given to the requester.
o If not, the buffer manager must read the block in from disk (possibly
displacing some other block if the buffer is full) and then pass the
address in main memory to the requester.
2. The buffer manager must use some sophisticated techniques in order to provide
good service:
o Replacement Strategy -- When there is no room left in the buffer, some
block must be removed to make way for the new one. Typical operating
system memory management schemes use a ``least recently used''
(LRU) method. (Simply remove the block least recently referenced.)
This can be improved upon for database applications.
o Pinned Blocks - For the database to be able to recover from crashes, we
need to restrict times when a block maybe written back to disk. A block
not allowed to be written is said to be pinned. Many operating systems
do not provide support for pinned blocks, and such a feature is essential
if a database is to be ``crash resistant''.
o Forced Output of Blocks - Sometimes it is necessary to write a block
back to disk even though its buffer space is not needed, (called the
forced output of a block.) This is due to the fact that main memory
contents (and thus the buffer) are lost in a crash, while disk data usually
survives.
Buffer replacement policies
1. Replacement Strategy: Goal is minimization of accesses to disk. Generally it
is hard to predict which blocks will be referenced. So operating systems use the
history of past references as a guide to prediction.
o General Assumption: Blocks referenced recently are likely to be used
again.
o Therefore, if we need space, throw out the least recently referenced
block (LRU replacement scheme).
2. LRU is acceptable in operating systems, however, a database system is able to
predict future references more accurately.
3. Consider processing of the relational algebra expression
4. Further, assume the strategy to process this request is given by the following
pseudo-code:
aaaaaaaaaaaa¯for each tuple b of borrower do
for each tuple c of customer do
if b[cname] = c[cname]
then begin
let x be a tuple defined as follows:
x[cname]:= b[cname] x[loan#]:=
b[loan#] x[street]:=
c[street] x[city]:= c[city] include tuple x as part of result of borrow customer
end
end end
5. Assume that the two relations in this example are stored in different files.
o Once a tuple of borrower has been processed, it is not needed again.
Therefore, once processing of an entire block of tuples is finished, that
block is not needed in main memory, even though it has been used very
recently.
o Buffer manager should free the space occupied by a borrow block as
soon as it is processed. This strategy is called toss-immediate.
o Consider blocks containing customer tuples.
o Every block of customer tuples must be examined once for every tuple
of the borrow relation. When processing of a customer block is
completed, it will not be used again until all other customer blocks have
been processed. This means the most recently used (MRU) block will be
the last block to be re-referenced, and the least recently used will be
referenced next.
o This is the opposite of LRU assumptions. So for inner block, use MRU
strategy -- if a customer block must be removed from the buffer, choose
MRU block.
o For MRU strategy, the system must pin the customer block currently
being processed until the last tuple has been processed. Then it is
unpinned, becoming the most recently used block.
6. The buffer manager may also use statistical information regarding the
probability that a request will reference a particular relation.
o The data dictionary is the most frequently-used part of the database. It
should, therefore, not be removed from main memory unless necessary.
o File indices are also frequently used, and should generally be in main
memory.
o No single strategy is known that handles all possible scenarios well.
o Many database systems use LRU, despite of its faults.
o Concurrency and recovery may need other buffer management
strategies, such as delayed buffer-out or forced output.
File Organization
1. A file is organized logically as a sequence of records.
2. Records are mapped onto disk blocks.
3. Files are provided as a basic construct in operating systems, so we assume the
existence of an underlying file system.
4. Blocks are of a fixed size determined by the operating system.
5. Record sizes vary.
6. In relational database, tuples of distinct relations may be of different sizes.
7. One approach to mapping database to files is to store records of one length in a
given file.
8. An alternative is to structure files to accommodate variable-length records.
(Fixed-length is easier to implement.)
Fixed-Length Records
Consider a file of deposit records of the form:
aaaaaaaaaaaa¯type deposit = record
bname : char(22);
account# : char(10);
balance : real;
end
o If we assume that each character occupies one byte, an integer occupies
4 bytes, and a real 8 bytes, our deposit record is 40 bytes long.
o The simplest approach is to use the first 40 bytes for the first record, the
next 40 bytes for the second, and so on.
o However, there are two problems with this approach.
o It is difficult to delete a record from this structure.
o Space occupied must somehow be deleted, or we need to mark deleted
records so that they can be ignored.
o Unless block size is a multiple of 40, some records will cross block
boundaries.
o It would then require two block accesses to read or write such a record.
2. When a record is deleted, we could move all successive records up one (Figure
10.7), which may require moving a lot of records.
o We could instead move the last record into the ``hole'' created by the
deleted record (Figure 10.8).
o This changes the order the records are in.
o It turns out to be undesirable to move records to occupy freed space, as
moving requires block accesses.
o Also, insertions tend to be more frequent than deletions.
o It is acceptable to leave the space open and wait for a subsequent
insertion.
o This leads to a need for additional structure in our file design.
3. So one solution is:
o At the beginning of a file, allocate some bytes as a file header.
o This header for now need only be used to store the address of the first
record whose contents are deleted.
o This first record can then store the address of the second available
record, and so on (Figure 10.9).
o To insert a new record, we use the record pointed to by the header, and
change the header pointer to the next available record.
o If no deleted records exist we add our new record to the end of the file.
4. Note: Use of pointers requires careful programming. If a record pointed to is
moved or deleted, and that pointer is not corrected, the pointer becomes a
dangling pointer. Records pointed to are called pinned.
5. Fixed-length file insertions and deletions are relatively simple because ``one
size fits all''. For variable length, this is not the case
Variable-Length Records
1. Variable-length records arise in a database in several ways:
o Storage of multiple items in a file.
o Record types allowing variable field size
o Record types allowing repeating fields
We'll look at several techniques, using one example with a variable-length record:
aaaaaaaaaaaa¯type account-list = record
bname : char(22);
account-info : array of record;
account#: char(10); balance: real;
end
end
Account-information is an array with an arbitrary number of elements.
Byte string representation
1. Attach a special end-of-record symbol ( ) to the end of each record. Each
record is stored as a string of successive bytes (See Figure 10.10).
Byte string representation has several disadvantages:
o It is not easy to re-use space left by a deleted record
o In general, there is no space for records to grow longer. (Must move to
expand, and record may be pinned.)
So this method is not usually used.
2. An interesting structure: Slot page structure.
There is a header at the beginning of each block, containing:
o # of record entires in the header
o the end of free space in the block
o an array whose entries contain the location and size of each record.
3. The slot page structure requires that there be no pointers that point directly to
records. Instead, pointers must point to the entry in the header that contains the
actual location of the record. This level of indirection allows records to be
moved to prevent fragmentation of space inside a block, while supporting
indirect pointers to the record.
Fixed-length representation
1. Uses one or more fixed-length records to represent one variable-length record.
2. Two techniques:
o Reserved space - uses fixed-length records large enough to
accommodate the largest variable-length record. (Unused space filled
with end-of-record symbol.)
o Pointers - represent by a list of fixed-length records, chained together.
3. The reserved space method requires the selection of some maximum record
length. (Figure 10.12)
If most records are of near-maximum length this method is useful. Otherwise,
space is wasted.
4. Then the pointer method may be used (Figure 10.13). Its disadvantage is that
space is wasted in successive records in a chain as non-repeating fields are still
present.
5. To overcome this last disadvantage we can split records into two blocks (See
Figure 10.14)
o Anchor block - contains first records of a chain
o Overflow block - contains records other than first in the chain.
Now all records in a block have the same length, and there is no wasted space.
Organization of Records in Files
There are several ways of organizing records in files.
heap file organization. Any record can be placed anywhere in the file where
there is space for the record. There is no ordering of records.
sequential file organization. Records are stored in sequential order, based on
the value of the search key of each record.
hashing file organization. A hash function is computed on some attribute of
each record. The result of the function specifies in which block of the file the
record should be placed -- to be discussed in chapter 11 since it is closely
related to the indexing structure.
clustering file organization. Records of several different relations can be
stored in the same file. Related records of the different relations are stored on
the same block so that one I/O operation fetches related records from all the
relations.
Sequential File Organization
1. A sequential file is designed for efficient processing of records in sorted
order on some search key.
o Records are chained together by pointers to permit fast retrieval in
search key order.
o Pointer points to next record in order.
o Records are stored physically in search key order (or as close to this as
possible).
o This minimizes number of block accesses.
o Figure 10.15 shows an example, with bname as the search key.
2. It is difficult to maintain physical sequential order as records are inserted and
deleted.
o Deletion can be managed with the pointer chains.
o Insertion poses problems if no space where new record should go.
o If space, use it, else put new record in an overflow block.
o Adjust pointers accordingly.
o Figure 10.16 shows the previous example after an insertion.
o Problem: we now have some records out of physical sequential order.
o If very few records in overflow blocks, this will work well.
o If order is lost, reorganize the file.
o Reorganizations are expensive and done when system load is low.
3. If insertions rarely occur, we could keep the file in physically sorted order and
reorganize when insertion occurs. In this case, the pointer fields are no longer
required.
Clustering File Organization
1. One relation per file, with fixed-length record, is good for small databases,
which also reduces the code size.
2. Many large-scale DB systems do not rely directly on the underlying operating
system for file management. One large OS file is allocated to DB system and
all relations are stored in one file.
3. To efficiently execute queries involving , one may store
the depositor tuple for each cname near the customer tuple for the
corresponding cname, as shown in Figure 10.19.
4. This structure mixes together tuples from two relations, but allows for efficient
processing of the join.
5. If the customer has many accounts which cannot fit in one block, the remaining
records appear on nearby blocks. This file structure, called clustering, allows us
to read many of the required records using one block read.
6. Our use of clustering enhances the processing of a particular join but may
result in slow processing of other types of queries, such as selection on
customer.
For example, the query
aaaaaaaaaaaa¯select * from customer
now requires more block accesses as our customer relation is now interspersed
with the deposit relation.
7. Thus it is a trade-off, depending on the types of query that the database
designer believes to be most frequent. Careful use of clustering may produce
significant performance gain.
Data Dictionary Storage
1. The database also needs to store information about the relations, known as the
data dictionary. This includes:
o Names of relations.
o Names of attributes of relations.
o Domains and lengths of attributes.
o Names and definitions of views.
o Integrity constraints (e.g., key constraints).
plus data on the system users:
o Names of authorized users.
o Accounting information about users.
plus (possibly) statistical and descriptive data:
o Number of tuples in each relation.
o Method of storage used for each relation (e.g., clustered or non-
clustered).
2. When we look at indices (Chapter 11), we'll also see a need to store
information about each index on each relation:
o Name of the index.
o Name of the relation being indexed.
o Attributes the index is on.
o Type of index.
3. This information is, in itself, a miniature database. We can use the database to
store data about itself, simplifying the overall structure of the system, and
allowing the full power of the database to be used to permit fast access to
system data.
4. The exact choice of how to represent system data using relations must be made
by the system designer. One possible representation follows.
aaaaaaaaaaaa¯System-catalog-schema = (relation-name, number-attrs)
Attr-schema = (attr-name, rel-name, domain-type, position, length)
User-schema = (user-name, encrypted-password, group)
Index-schema = (index-name, rel-name, index-type, index-attr)
View-schema = (view-name, definition)
INDEXES
1. An index for a file works like a catalogue in a library. Cards in alphabetic order
tell us where to find books by a particular author.
2. In real-world databases, indices like this might be too large to be efficient.
We'll look at more sophisticated indexing techniques.
3. There are two kinds of indices.
o Ordered indices: indices are based on a sorted ordering of the values.
o Hash indices: indices are based on the values being distributed
uniformly across a range of buckets. The buckets to which a value is
assigned is determined by a function, called a hash function.
4. We will consider several indexing techniques. No one technique is the best.
Each technique is best suited for a particular database application.
5. Methods will be evaluated on:
1. Access Types -- types of access that are supported efficiently, e.g.,
value-based search or range search.
2. Access Time -- time to find a particular data item or set of items.
3. Insertion Time -- time taken to insert a new data item (includes time to
find the right place to insert).
4. Deletion Time -- time to delete an item (includes time taken to find
item, as well as to update the index structure).
5. Space Overhead -- additional space occupied by an index structure.
We may have more than one index or hash function for a file. (The library may
have card catalogues by author, subject or title.)
The attribute or set of attributes used to look up records in a file is called the
search key (not to be confused with primary key, etc.).
Ordered Indices
1. In order to allow fast random access, an index structure may be used.
2. A file may have several indices on different search keys.
3. If the file containing the records is sequentially ordered, the index whose
search key specifies the sequential order of the file is the primary index, or
clustering index. Note: The search key of a primary index is usually the
primary key, but it is not necessarily so.
4. Indices whose search key specifies an order different from the sequential order
of the file are called the secondary indices, or nonclustering indices.
Primary Index
1. Index-sequential files: Files are ordered sequentially on some search key, and a
primary index is associated with it.
Figure 11.1: Sequential file for deposit records.
Dense and Sparse Indices
1. There are Two types of ordered indices:
Dense Index:
o An index record appears for every search key value in file.
o This record contains search key value and a pointer to the actual record.
Sparse Index:
o Index records are created only for some of the records.
o To locate a record, we find the index record with the largest search key
value less than or equal to the search key value we are looking for.
o We start at that record pointed to by the index record, and proceed along
the pointers in the file (that is, sequentially) until we find the desired
record.
2. Figures 11.2 and 11.3 show dense and sparse indices for the deposit file.
Figure 11.2: Dense index.
3. Notice how we would find records for Perryridge branch using both methods.
(Do it!)
Figure 11.3: Sparse index.
4. Dense indices are faster in general, but sparse indices require less space and
impose less maintenance for insertions and deletions. (Why?)
5. A good compromise: to have a sparse index with one entry per block.
Why is this good?
o Biggest cost is in bringing a block into main memory.
o We are guaranteed to have the correct block with this method, unless
record is on an overflow block (actually could be several blocks).
o Index size still small.
Multi-Level Indices
1. Even with a sparse index, index size may still grow too large. For 100,000
records, 10 per block, at one index record per block, that's 10,000 index
records! Even if we can fit 100 index records per block, this is 100 blocks.
2. If index is too large to be kept in main memory, a search results in several disk
reads.
o If there are no overflow blocks in the index, we can use binary search.
o This will read as many as blocks (as many as 7 for our 100
blocks).
o If index has overflow blocks, then sequential search typically used,
reading all b index blocks.
3. Solution: Construct a sparse index on the index (Figure 11.4).
Figure 11.4: Two-level sparse index.
4. Use binary search on outer index. Scan index block found until correct index
record found. Use index record as before - scan block pointed to for desired
record.
5. For very large files, additional levels of indexing may be required.
6. Indices must be updated at all levels when insertions or deletions require it.
7. Frequently, each level of index corresponds to a unit of physical storage (e.g.
indices at the level of track, cylinder and disk).
Index Update
Regardless of what form of index is used, every index must be updated whenever a
record is either inserted into or deleted from the file.
1. Deletion: o Find (look up) the record
o If the last record with a particular search key value, delete that search
key value from index.
o For dense indices, this is like deleting a record in a file.
o For sparse indices, delete a key value by replacing key value's entry in
index by next search key value. If that value already has an index entry,
delete the entry.
2. Insertion: o Find place to insert.
o Dense index: insert search key value if not present.
o Sparse index: no change unless new block is created. (In this case, the
first search key value appearing in the new block is inserted into the
index).
Secondary Indices
1. If the search key of a secondary index is not a candidate key, it is not enough to
point to just the first record with each search-key value because the remaining
records with the same search-key value could be anywhere in the file.
Therefore, a secondary index must contain pointers to all the records.
Figure 11.5: Sparse secondary index on cname.
2. We can use an extra-level of indirection to implement secondary indices on
search keys that are not candidate keys. A pointer does not point directly to the
file but to a bucket that contains pointers to the file.
o See Figure 11.5 on secondary key cname.
o To perform a lookup on Peterson, we must read all three records pointed
to by entries in bucket 2.
o Only one entry points to a Peterson record, but three records need to be
read.
o As file is not ordered physically by cname, this may take 3 block
accesses.
3. Secondary indices must be dense, with an index entry for every search-key
value, and a pointer to every record in the file.
4. Secondary indices improve the performance of queries on non-primary keys.
5. They also impose serious overhead on database modification: whenever a file
is updated, every index must be updated.
6. Designer must decide whether to use secondary indices or not.
B -Tree Index Files
1. Primary disadvantage of index-sequential file organization is that performance
degrades as the file grows. This can be remedied by costly re-organizations.
2. B -tree file structure maintains its efficiency despite frequent insertions and
deletions. It imposes some acceptable update and space overheads.
3. A B -tree index is a balanced tree in which every path from the root to a leaf
is of the same length.
4. Each nonleaf node in the tree must have between and n children, where n
is fixed for a particular tree.
Structure of a B -Tree
1. A B -tree index is a multilevel index but is structured differently from that of
multi-level index sequential files.
2. A typical node (Figure 11.6) contains up to n-1 search key values
, and n pointers . Search key values in a node
are kept in sorted order.
Figure 11.6: Typical node of a B+-tree.
3. For leaf nodes, ( ) points to either a file record with search
key value , or a bucket of pointers to records with that search key value.
Bucket structure is used if search key is not a primary key, and file is not sorted
in search key order.
Pointer (nth pointer in the leaf node) is used to chain leaf nodes together in
linear order (search key order). This allows efficient sequential processing of
the file.
The range of values in each leaf do not overlap.
4. Non-leaf nodes form a multilevel index on leaf nodes.
A non-leaf node may hold up to n pointers and must hold pointers. The
number of pointers in a node is called the fan-out of the node.
Consider a node containing m pointers. Pointer ( ) points to a
subtree containing search key values and . Pointer points to a
subtree containing search key values . Pointer points to a subtree
containing search key values .
5. Figures 11.7 (textbook Fig. 11.8) and textbook Fig. 11.9 show B -trees for the
deposit file with n=3 and n=5.
Figure 11.7: B+-tree for deposit file with n = 3.
Queries on B -Trees
1. Suppose we want to find all records with a search key value of k.
o Examine the root node and find the smallest search key value .
o Follow pointer to another node.
o If follow pointer .
o Otherwise, find the appropriate pointer to follow.
o Continue down through non-leaf nodes, looking for smallest search key
value > k and following the corresponding pointer.
o Eventually we arrive at a leaf node, where pointer will point to the
desired record or bucket.
2. In processing a query, we traverse a path from the root to a leaf node. If there
are K search key values in the file, this path is no longer than .
This means that the path is not long, even in large files. For a 4k byte disk
block with a search-key size of 12 bytes and a disk pointer of 8 bytes, n is
around 200. If n =100, a look-up of 1 million search-key values may take
nodes to be accessed. Since root is in usually in the
buffer, so typically it takes only 3 or fewer disk reads.
Updates on B -Trees
1. Insertions and Deletions:
Insertion and deletion are more complicated, as they may require splitting or
combining nodes to keep the tree balanced. If splitting or combining are not
required, insertion works as follows:
o Find leaf node where search key value should appear.
o If value is present, add new record to the bucket.
o If value is not present, insert value in leaf node (so that search keys are
still in order).
o Create a new bucket and insert the new record.
If splitting or combining are not required, deletion works as follows:
o Deletion: Find record to be deleted, and remove it from the bucket.
o If bucket is now empty, remove search key value from leaf node.
2. Insertions Causing Splitting:
When insertion causes a leaf node to be too large, we split that node. In Figure
11.8, assume we wish to insert a record with a bname value of ``Clearview''.
o There is no room for it in the leaf node where it should appear.
o We now have n values (the n-1 search key values plus the new one we
wish to insert).
o We put the first values in the existing node, and the remainder into
a new node.
o Figure 11.10 shows the result.
o The new node must be inserted into the B -tree.
o We also need to update search key values for the parent (or higher)
nodes of the split leaf node. (Except if the new node is the leftmost one)
o Order must be preserved among the search key values in each node.
o If the parent was already full, it will have to be split.
o When a non-leaf node is split, the children are divided among the two
new nodes.
o In the worst case, splits may be required all the way up to the root. (If
the root is split, the tree becomes one level deeper.)
o Note: when we start a B -tree, we begin with a single node that is both
the root and a single leaf. When it gets full and another insertion occurs,
we split it into two leaf nodes, requiring a new root.
3. Deletions Causing Combining:
Deleting records may cause tree nodes to contain too few pointers. Then we
must combine nodes.
o If we wish to delete ``Downtown'' from the B -tree of Figure 11.11,
this occurs.
o In this case, the leaf node is empty and must be deleted.
o If we wish to delete ``Perryridge'' from the B -tree of Figure 11.11, the
parent is left with only one pointer, and must be coalesced with a sibling
node.
o Sometimes higher-level nodes must also be coalesced.
o If the root becomes empty as a result, the tree is one level less deep
(Figure 11.13).
o Sometimes the pointers must be redistributed to keep the tree balanced.
o Deleting ``Perryridge'' from Figure 11.11 produces Figure 11.14.
4. To summarize:
o Insertion and deletion are complicated, but require relatively few
operations.
o Number of operations required for insertion and deletion is proportional
to logarithm of number of search keys.
o B -trees are fast as index structures for database.
B -Tree File Organization
1. The B -tree structure is used not only as an index but also as an organizer for
records into a file.
2. In a B -tree file organization, the leaf nodes of the tree store records instead of
storing pointers to records, as shown in Fig. 11.17.
3. Since records are usually larger than pointers, the maximum number of records
that can be stored in a leaf node is less than the maximum number of pointers
in a nonleaf node.
4. However, the leaf node are still required to be at least half full.
5. Insertion and deletion from a B -tree file organization are handled in the same
way as that in a B -tree index.
6. When a B -tree is used for file organization, space utilization is particularly
important. We can improve the space utilization by involving more sibling
nodes in redistribution during splits and merges.
7. In general, if m nodes are involved in redistribution, each node can be
guaranteed to contain at least entries. However, the cost of update
becomes higher as more siblings are involved in redistribution.
B-Tree Index Files
1. B-tree indices are similar to B -tree indices.
o Difference is that B-tree eliminates the redundant storage of search key
values.
o In B -tree of Figure 11.11, some search key values appear twice.
o A corresponding B-tree of Figure 11.18 allows search key values to
appear only once.
o Thus we can store the index in less space.
Figure 11.8: Leaf and nonleaf node of a B-tree.
2. Advantages: o Lack of redundant storage (but only marginally different).
o Some searches are faster (key may be in non-leaf node).
3. Disadvantages: o Leaf and non-leaf nodes are of different size (complicates storage)
o Deletion may occur in a non-leaf node (more complicated)
Generally, the structural simplicity of B -tree is preferred.
Static Hashing
1. Index schemes force us to traverse an index structure. Hashing avoids this.
Hash File Organization
1. Hashing involves computing the address of a data item by computing a
function on the search key value.
2. A hash function h is a function from the set of all search key values K to the
set of all bucket addresses B.
o We choose a number of buckets to correspond to the number of search
key values we will have stored in the database.
o To perform a lookup on a search key value , we compute , and
search the bucket with that address.
o If two search keys i and j map to the same address, because
, then the bucket at the address obtained will contain
records with both search key values.
o In this case we will have to check the search key value of every record
in the bucket to get the ones we want.
o Insertion and deletion are simple.
Hash Functions
1. A good hash function gives an average-case lookup that is a small constant,
independent of the number of search keys.
2. We hope records are distributed uniformly among the buckets.
3. The worst hash function maps all keys to the same bucket.
4. The best hash function maps all keys to distinct addresses.
5. Ideally, distribution of keys to addresses is uniform and random.
6. Suppose we have 26 buckets, and map names beginning with ith letter of the
alphabet to the ith bucket.
o Problem: this does not give uniform distribution.
o Many more names will be mapped to ``A'' than to ``X''.
o Typical hash functions perform some operation on the internal binary
machine representations of characters in a key.
o For example, compute the sum, modulo # of buckets, of the binary
representations of characters of the search key.
o See Figure 11.18, using this method for 10 buckets (assuming the ith
character in the alphabet is represented by integer i).
Handling of bucket overflows
1. Open hashing occurs where records are stored in different buckets. Compute
the hash function and search the corresponding bucket to find a record.
2. Closed hashing occurs where all records are stored in one bucket. Hash
function computes addresses within that bucket. (Deletions are difficult.) Not
used much in database applications.
3. Drawback to our approach: Hash function must be chosen at implementation
time.
o Number of buckets is fixed, but the database may grow.
o If number is too large, we waste space.
o If number is too small, we get too many ``collisions'', resulting in
records of many search key values being in the same bucket.
o Choosing the number to be twice the number of search key values in the
file gives a good space/performance tradeoff.
Hash Indices
1. A hash index organizes the search keys with their associated pointers into a
hash file structure.
2. We apply a hash function on a search key to identify a bucket, and store the
key and its associated pointers in the bucket (or in overflow buckets).
3. Strictly speaking, hash indices are only secondary index structures, since if a
file itself is organized using hashing, there is no need for a separate hash index
structure on it.
Dynamic Hashing
1. As the database grows over time, we have three options:
o Choose hash function based on current file size. Get performance
degradation as file grows.
o Choose hash function based on anticipated file size. Space is wasted
initially.
o Periodically re-organize hash structure as file grows. Requires selecting
new hash function, recomputing all addresses and generating new
bucket assignments. Costly, and shuts down database.
2. Some hashing techniques allow the hash function to be modified dynamically
to accommodate the growth or shrinking of the database. These are called
dynamic hash functions.
o Extendable hashing is one form of dynamic hashing.
o Extendable hashing splits and coalesces buckets as database size
changes.
o This imposes some performance overhead, but space efficiency is
maintained.
o As reorganization is on one bucket at a time, overhead is acceptably
low.
3. How does it work?
Figure 11.9: General extendable hash structure.
o We choose a hash function that is uniform and random that generates
values over a relatively large range.
o Range is b-bit binary integers (typically b=32).
o is over 4 billion, so we don't generate that many buckets!
o Instead we create buckets on demand, and do not use all b bits of the
hash initially.
o At any point we use i bits where .
o The i bits are used as an offset into a table of bucket addresses.
o Value of i grows and shrinks with the database.
o Figure 11.19 shows an extendable hash structure.
o Note that the i appearing over the bucket address table tells how many
bits are required to determine the correct bucket.
o It may be the case that several entries point to the same bucket.
o All such entries will have a common hash prefix, but the length of this
prefix may be less than i.
o So we give each bucket an integer giving the length of the common hash
prefix.
o This is shown in Figure 11.9 (textbook 11.19) as .
o Number of bucket entries pointing to bucket j is then .
4. To find the bucket containing search key value :
o Compute .
o Take the first i high order bits of .
o Look at the corresponding table entry for this i-bit string.
o Follow the bucket pointer in the table entry.
5. We now look at insertions in an extendable hashing scheme.
o Follow the same procedure for lookup, ending up in some bucket j.
o If there is room in the bucket, insert information and insert record in the
file.
o If the bucket is full, we must split the bucket, and redistribute the
records.
o If bucket is split we may need to increase the number of bits we use in
the hash.
6. Two cases exist:
1. If , then only one entry in the bucket address table points to bucket j.
o Then we need to increase the size of the bucket address table so that we
can include pointers to the two buckets that result from splitting bucket
j.
o We increment i by one, thus considering more of the hash, and doubling
the size of the bucket address table.
o Each entry is replaced by two entries, each containing original value.
o Now two entries in bucket address table point to bucket j.
o We allocate a new bucket z, and set the second pointer to point to z.
o Set and to i.
o Rehash all records in bucket j which are put in either j or z.
o Now insert new record.
o It is remotely possible, but unlikely, that the new hash will still put all of
the records in one bucket.
o If so, split again and increment i again.
2. If , then more than one entry in the bucket address table points to
bucket j.
o Then we can split bucket j without increasing the size of the bucket
address table (why?).
o Note that all entries that point to bucket j correspond to hash prefixes
that have the same value on the leftmost bits.
o We allocate a new bucket z, and set and to the original value plus 1.
o Now adjust entries in the bucket address table that previously pointed to
bucket j.
o Leave the first half pointing to bucket j, and make the rest point to
bucket z.
o Rehash each record in bucket j as before.
o Reattempt new insert.
7. Note that in both cases we only need to rehash records in bucket j.
8. Deletion of records is similar. Buckets may have to be coalesced, and bucket
address table may have to be halved.
9. Insertion is illustrated for the example deposit file of Figure 11.20.
o 32-bit hash values on bname are shown in Figure 11.21.
o An initial empty hash structure is shown in Figure 11.22.
o We insert records one by one.
o We (unrealistically) assume that a bucket can only hold 2 records, in
order to illustrate both situations described.
o As we insert the Perryridge and Round Hill records, this first bucket
becomes full.
o When we insert the next record (Downtown), we must split the bucket.
o Since , we need to increase the number of bits we use from the
hash.
o We now use 1 bit, allowing us buckets.
o This makes us double the size of the bucket address table to two entries.
o We split the bucket, placing the records whose search key hash begins
with 1 in the new bucket, and those with a 0 in the old bucket (Figure
11.23).
o Next we attempt to insert the Redwood record, and find it hashes to 1.
o That bucket is full, and .
o So we must split that bucket, increasing the number of bits we must use
to 2.
o This necessitates doubling the bucket address table again to four entries
(Figure 11.24).
o We rehash the entries in the old bucket.
o We continue on for the deposit records of Figure 11.20, obtaining the
extendable hash structure of Figure 11.25.
10. Advantages: o Extendable hashing provides performance that does not degrade as the
file grows.
o Minimal space overhead - no buckets need be reserved for future use.
Bucket address table only contains one pointer for each hash value of
current prefix length.
11. Disadvantages: o Extra level of indirection in the bucket address table
o Added complexity
12. Summary: A highly attractive technique, provided we accept added
complexity.
Comparison of Indexing and Hashing
1. To make a wise choice between the methods seen, database designer must
consider the following issues:
o Is the cost of periodic re-organization of index or hash structure
acceptable?
o What is the relative frequence of insertion and deletion?
o Is it desirable to optimize average access time at the expense of
increasing worst-case access time?
o What types of queries are users likely to pose?
2. The last issue is critical to the choice between indexing and hashing. If most
queries are of the form
3. aaaaaaaaaaaa¯select
4.
5. from r
6.
7. where
then to process this query the system will perform a lookup on an index or hash
structure for attribute with value c.
8. For these sorts of queries a hashing scheme is preferable.
o Index lookup takes time proportional to log of number of values in R for
.
o Hash structure provides lookup average time that is a small constant
(independent of database size).
9. However, the worst-case favors indexing:
o Hash worst-case gives time proportional to the number of values in R
for .
o Index worst case still log of number of values in R.
10. Index methods are preferable where a range of values is specified in the query,
e.g.
aaaaaaaaaaaa¯select
from r
where and
This query finds records with values in the range from to .
o Using an index structure, we can find the bucket for value , and then
follow the pointer chain to read the next buckets in alphabetic (or
numeric) order until we find .
o If we have a hash structure instead of an index, we can find a bucket for
easily, but it is not easy to find the ``next bucket''.
o A good hash function assigns values randomly to buckets.
o Also, each bucket may be assigned many search key values, so we
cannot chain them together.
o To support range queries using a hash structure, we need a hash function
that preserves order.
o For example, if and are search key values and then
.
o Such a function would ensure that buckets are in key order.
o Order-preserving hash functions that also provide randomness and
uniformity are extremely difficult to find.
Index Definition in SQL
1. Some SQL implementations includes data definition commands to create and
drop indices. The IBM SAA-SQL commands are
o An index is created by
o aaaaaaaaaaaa¯create index <index-name>
o
o on r
(<attribute-list>)
o
o The attribute list is the list of attributes in relation r that form the search
key for the index.
o To create an index on bname for the branch relation:
o aaaaaaaaaaaa¯create index b-index
on branch (bname)
if the search key is a candidate key, we add the word unique to the
definition:
aaaaaaaaaaaa¯create unique index b-index on branch (bname)
o If bname is not a candidate key, an error message will appear.
o If the index creation succeeds, any attempt to insert a tuple violating this
requirement will fail.
o The unique keyword is redundant if primary keys have been defined
with integrity constraints already.
To remove an index, the command is
aaaaaaaaaaaa¯drop index <index-name>
Multiple-Key Access
1. For some queries, it is advantageous to use multiple indices if they exist.
2. If there are two indices on deposit, one on bname and one on cname, then
suppose we have a query like
aaaaaaaaaaaa¯select balance
from deposit
where bname = ``Perryridge'' and balance = 1000
3. There are 3 possible strategies to process this query:
o Use the index on bname to find all records pertaining to Perryridge
branch. Examine them to see if balance = 1000
o Use the index on balance to find all records pertaining to Williams.
Examine them to see if bname = ``Perryridge''.
o Use index on bname to find pointers to records pertaining to Perryridge
branch. Use index on balance to find pointers to records pertaining to
1000. Take the intersection of these two sets of pointers.
4. The third strategy takes advantage of the existence of multiple indices. This
may still not work well if
o There are a large number of Perryridge records AND
o There are a large number of 1000 records AND
o Only a small number of records pertain to both Perryridge and 1000.
5. To speed up multiple search key queries special structures can be maintained.
Grid File
1. A grid structure for queries on two search keys is a 2-dimensional grid, or
array, indexed by values for the search keys. Figure 11.10 (textbook 11.31)
shows part of a grid structure for the deposit file.
Figure 11.10: Grid structure for deposit file.
2. A particular entry in the array contains pointers to all records with the specified
search key values.
o No special computations need to be done
o Only the right records are accessed
o Can also be used for single search key queries (one column or row)
o Easy to extend to queries on n search keys - construct an n-dimensional
array.
o Significant improvement in processing time for multiple-key queries.
o Imposes space overhead.
o Performance overhead on insertion and deletion.
Partitioned Hashing
1.
o An alternative approach to multiple-key queries.
o To construct a structure for queries on deposit involving bname and
cname, we construct a hash structure for the key (cname, bname).
o We split this hash function into two parts, one for each part of the key.
o The first part depends only on the cname value.
o The second part depends only on the bname value.
o Figure 11.32 shows a sample partitioned hash function.
o Note that pairs with the same cname or bname value will have 3 bits the
same in the appropriate position.
o To find the balance in all of Williams' accounts at the Perryridge branch,
we compute h(Williams, Perryridge) and access the hash structure.
2. The same hash structure can be used to answer a query on one of the search
keys:
o Compute part of partitioned hash.
o Access hash structure and scan buckets for which that part of the hash
coincides.
o Text doesn't say so, but the hash structure must have some grid-like
form imposed on it to enable searching the structure based on only some
part of the hash.
3. Partitioned hashing can also be extended to n-key search.
New DB Applications
1. Traditional applications.
1. Four generations of traditional DB systems: file system, hierarchical,
CODASYL, and relational. All are designed for business applications:
inventory, employee, university, bank, library, air-line reservation
systems, etc.
2. Common features of `traditional' applications:
1. Uniformity: large number of similarly structured data items, all of
which have the same size,
2. Record orientation: the basic data items consist of fixed-length
records,
3. Small data items: each record is short,
4. Atomic fields: fields within a record are short and of fixed length.
There is no structure within fields. The 1st normal form holds.
5. Short transactions: within fractions of a second. There is no
human interaction with a transaction during its execution.
6. Static conceptual schemes: The database scheme is changed
infrequently. Only simple changes are allowed, e.g., in relational
systems: create relation, remove relation, add/remove attributes
to/from a relation scheme.
2. New applications
1. Engineering databases, CAx: computer-aided design (CAD),
manufacturing (CAM), engineering (CAE)), CIM (computer-integrated
manufacturing).
Tasks: A CAD database stores data required pertaining to an
engineering design, including the components of the items being
designed, the inter-relationship of components, and old versions of
designs.
2. Computer-aided software engineering (CASE): A CASE database stores
data required to assist software developers, including source code,
dependencies among software modules, definitions and uses of
variables, and the development history of the software system.
3. Multimedia databases: A multimedia database contains spatial data,
audio/video data, and the like. DBs of this sort arise from geophysical
data, voice mail systems and graphics applications.
4. Office Information Systems (OIS): Office automation includes
workstation-based tools for document creation and retrieval, tools for
maintaining appointment calendars, and so on. An OIS DB must allow
queries pertaining to schedules, documents, and contents of documents.
5. Hypertext databases: Hypertext is text enriched with links that point to
other documents, e.g., WWW. Hypertext documents may also be
structured in specific ways that help index them. Hypertext database
must support the ability to retrieve documents based on links, and to
query documents based on their structure.
3. Expected Features for New Applications
1. Complex objects: A complex object is an item that is viewed as a single
object in the real world, but that contains other objects (with an arbitrary
complex internal structure). Often objects are stored hierarchically,
representing the containment relationship. This leads to object-oriented
DBs and nested relational DBs.
2. Behavioral data: Distinct objects may need to respond in different ways
to the same command. For example, the deletion of certain tuples may
require to delete other tuples in the case for weak entities. In CAD and
CASE applications the behavior of different objects in response to a
given command may be widely different. This behavioral information
can be captured by storing executable code with objects in the database.
This capability is provided by the methods of OODBs and by the rule
base of KB systems.
3. Meta knowledge: General rules about the application rather than specific
tuples (i.e., data about data) form an important part of expert databases.
4. Long duration transactions: CAD and CASE applications involve human
interaction with the data. ``what-if'' modifications that the user may wish
to undo, concurrent designer efforts that may lead to conflicts among
transactions. Important concepts: nested transactions, correct,
nonserializable executions.
The Object-Oriented Data Model
1. A data model is a logic organization of the real world objects (entities),
constraints on them, and the relationships among objects. A DB language is a
concrete syntax for a data model. A DB system implements a data model.
2. A core object-oriented data model consists of the following basic object-
oriented concepts:
(1) object and object identifier: Any real world entity is uniformly modeled
as an object (associated with a unique id: used to pinpoint an object to retrieve).
(2) attributes and methods: every object has a state (the set of values for the
attributes of the object) and a behavior (the set of methods - program code -
which operate on the state of the object). The state and behavior encapsulated
in an object are accessed or invoked from outside the object only through
explicit message passing.
[ An attribute is an instance variable, whose domain may be any class: user-
defined or primitive. A class composition hierarchy (aggregation relationship)
is orthogonal to the concept of a class hierarchy. The link in a class
composition hierarchy may form cycles. ]
(3) class: a means of grouping all the objects which share the same set of
attributes and methods. An object must belong to only one class as an instance
of that class (instance-of relationship). A class is similar to an abstract data
type. A class may also be primitive (no attributes), e.g., integer, string,
Boolean.
(4) Class hierarchy and inheritance: derive a new class (subclass) from an
existing class (superclass). The subclass inherits all the attributes and methods
of the existing class and may have additional attributes and methods. single
inheritance (class hierarchy) vs. multiple inheritance (class lattice).
Object Structure
1. The object-oriented paradigm is based on encapsulating code and data into a
single unit. Conceptually, all interactions between an object and the rest of the
system are via messages. Thus, the interface between an object and the rest of
the system is defined by a set of allowed messages.
2. In general, an object has associated with it:
o A set of variables that contain the data for the object. The value of each
variable is itself an object.
o A set of messages to which the object responds.
o A set of methods, each of which is a body of code to implement each
message; a method returns a value as the response to the message.
3. Motivation of using messages and methods.
All employee objects respond to the annual-salary message but in different
computations for managers, tellers, etc. By encapsulation within the employee
object itself the information about how to compute the annual salary, all
employee objects present the same interface.
Since the only external interface presented by an object is the set of messages
to which it responds, it is possible to (i) modify the definition of methods and
variables without affecting the rest of the system, and (ii) replace a variable
with the method that computes a value, e.g., age from birth_date.
The ability to modify the definition of an object without affecting the rest of the
system is considered to be one of the major advantages of the OO
programming paradigm.
4. Methods of an object may be classified as either read-only or update. Message
can also be classified as read-only or update. Derived attributes of an entity in
the ER model can be expressed as read-only messages.
Object Classes
1. Usually, there are many similar objects in a DB. By ``similar'', it means that
they respond to the same messages, use the same methods, and have variables
of the same name and type. We group similar objects to form a class. Each
such object is called an instance of its class. E.g., in a bank DB, customers,
accounts and loans are classes.
2. The definition of the class employee, written in pseudo-code. The definition
shows the variables and the messages to which the objects of the class respond,
but not the methods that handle the messages.
class employee
{
/* Variables */ string name; string address;
date start-date; int salary;
/* Messages */
int annual-salary(); string get-name(); string get-
address(); int set-address(string new-address); int
employment-length(); };
3. Class: (i) captures the instance-of relationship, (ii) the basis on which a query
may be formulated, (iii) enhance the integrity of OO systems by introducing
type checking, and (iv) reducing replications of names and integrity-related
specifications among objects in the same class.
4. The concept of classes is similar to the concept of abstract data types. There are
several additional aspects to the class concept beyond those of ADTs. To
represent these properties, we treat each class as itself being an object.
Metaclass: the class of a class. Most OODB systems do not support the strict
notion of metaclass. In ORION, CLASS is the root of the class hierarchy (the
metaclass of all other classes). A class object includes
o a set-valued variable whose value is the set of all objects that are
instances of the class,
o implementation of a method for the message new, which creates a new
instance of the class.
Inheritance
1. An object-oriented database schema typically requires a large number of
classes. Often, however, several classes are similar. For example, bank
employees are similar to customers.
2. In order to allow the direct representation of similarities among classes, we
need to place classes in a specialization hierarchy. E.g., Fig. 8.1 is a
specialization hierarchy for the ER model.
Figure 8.1: Specialization hierarchy for the banking example
The concept of a class hierarchy is similar to that of specialization in the ER
model. The corresponding corresponding class hierarchy is shown in Fig. 8.2.
Figure 8.2: Class hierarchy corresponding to the banking example
The class hierarchy can be defined in pseudo-code in Fig. 8.3, in which the
variables associated with each class are as follows. For brevity, we do not
present the methods associated with these classes.
Figure 8.3: Definition of class hierarchy in pseudo-code
3. The keyword isa is used to indicate that a class is a specialization of another
class. The specialization of a class are called subclasses. E.g., employee is a
subclass of person; teller is a subclass of employee. Conversely, employee is a
superclass of teller.
4. Class hierarchy and inheritance of properties from more general classes. E.g.,
an object representing an officer contains all the variables of classes officer,
employee, and person. Methods are inherited in a manner identical to
inheritance of variables.
5. An important benefit of inheritance in OO systems is the notion of
substitutability: Any method of a class, A, can be equally well be invoked with
an object belonging to any subclass B of A. This characteristic leads to code-
reuse: methods and functions in class A (such as get-name() in class person) do
not have to be rewritten again for objects of class B).
6. Two plausible ways of associating objects with nonleaf classes:
o associate with the employee class all employee objects including those
that are instances of officer, teller, and secretary.
o associate with the employee class only those employee objects that are
instances neither officer, nor teller, nor secretary.
Typically, the latter choice is made in OO systems. It is possible to determine
the set of all employee objects in this case by taking the union of those objects
associated with all classes in the subtree rooted at employee.
7. Most OO systems allow specialization to be partial, i.e., they allow objects that
belong to a class such as employee that do not belong to any of that class's
subclasses.
Multiple Inheritance
1. In most cases, tree-structured organization of classes is adequate to describe
applications. In such cases, all superclasses of a class are ancestors of
descendants of another in the hierarchy. However, there are situations that
cannot be represented well in a tree-structured class hierarchy.
2. Example. We could create subclasses: part-time-teller, full-time-teller, etc., as
shown in Fig. 8.4. But problems: (1) redundancy leads to potential
inconsistency on updates; and (2) the hierarchy cannot represent full/part- time
employees who are neither secretaries nor tellers.
Figure 8.4: Class hierarchy for full- and part-time employees.
3. Multiple inheritance: the ability of class to inherit variables and methods from
multiple superclasses.
4. The class/subclass relationship is represented by a rooted directed acyclic graph
(DAG) in which a class may have more than one superclass.
Figure 8.5: Class DAG for the banking example.
5. Handling name conflicts: When multiple inheritance is used, there is potential
ambiguity if the same variable or method can be inherited from more than one
superclass.
6. Example. In our banking example, we may define a variable pay for each full-
time, part-time, teller and secretary as follows:
o full-time: pay is an integer from 0 to 100,000 containing annual salary.
o part-time: pay is an integer from 0 to 20 containing an hourly rate of
pay.
o teller: pay is an integer from 0 to 20,000 containing the annual salary.
o secretary: pay is an integer from 0 to 25,000 containing the annual
salary.
7. For part-time-secretary, it could inherit the definition of pay from either part-
time or secretary. We have the following options:
o Include both variables, renaming them to part-time-pay and secretary-
pay.
o Choose one or the other based on the order of creation.
o Force the user the make a choice at the time of class definition.
o Treat the situation as an error.
No single solution has been accepted as best, and different systems make
different choices.
8. Not all cases of multiple inheritance lead to ambiguity. If, instead of defining
pay, we retain the definition of variable salary in class employee, and define it
nowhere else, then all the subclasses inherit salary from employee (no
ambiguity).
9. We can use multiple inheritance to model the concept of roles. For example,
for subclasses, student, teacher and footballPlayer, an object can belong to
several categories at once and each of these categories is called a role. We can
use multiple inheritance to create subclasses, such as student-teacher, student-
footballPlayer, and so on to model the possibility of an object simultaneously
having multiple roles
Object Identity
1. Object identity: An object retains its identity even if some or all of the values
of variables or definitions of methods change over time.
This concept of object identity is necessary in applications but doe not apply to
tuples of a relational database.
2. Object identity is a stronger notion of identity than typically found in
programming languages or in data models not based on object orientation.
3. Several forms of identity:
o value: A data value is used for identity (e.g., the primary key of a tuple
in a relational database).
o name: A user-supplied name is used for identity (e.g., file name in a file
system).
o built-in: A notion of identity is built-into the data model or
programming languages, and no user-supplied identifier is required
(e.g., in OO systems).
4. Object identity is typically implemented via a unique, system-generated OID.
The value of the OID is not visible to the external user, but is used internally by
the system to identify each object uniquely and to create and manage inter-
object references.
5. There are many situations where having the system generate identifiers
automatically is a benefit, since it frees humans from performing that task.
However, this ability should be used with care. System-generated identifiers
are usually specific to the system, and have to be translated if data are moved to
a different database system. System-generated identifiers may be redundant if
the entities being modeled already have unique identifiers external to the
system, e.g., SIN#.
Object Containment
1. Objects that contain other objects are called complex or composite objects.
There can be multiple levels of containment, forming a containment hierarchy
among objects.
2. Example: A bicycle design database:
Figure 8.6: Containment hierarchy for bicycle-design database.
Fig. 8.6 shows the containment relationship in a schematic way by listing class
names. Thus the links between classes must be interpreted as is-part-of, rather
than the is-a interpretation of links in an inheritance hierarchy.
3. Containment allows data to be viewed at different granularities by different
users. E.g., wheel by wheel designer but bicycle by a sales-person.
The containment hierarchy is used to find all objects contained in a bicycle
object.
4. In certain applications, an object may be contained in several objects. In such
cases, the containment relationship is represented by a DAG rather than by a
hierarchy.
Object-Oriented Languages
The expression of object-orientation can be done in one of two ways.
1. The concepts of object-orientation can be used purely as a design tool, and are
encoded into, e.g., a relational database. E.g., ER modeling.
2. The concepts of object-orientation are incorporated into a language that is used
to manipulate the database. There are several possible languages into which the
concepts can be integrated.
1. Extend a DML such as SQL by adding complex types and object-
orientation. Systems that provide object-oriented extensions to relational
systems are called object-relational systems.
2. Take an existing object-oriented programming language and extend it to
deal with database. Such languages are called persistent programming
languages.
Persistent Programming Languages
1. Persistent data: data that continue to exist even after the program that created it
has terminated.
2. A persistent programming language is a programming language extended with
constructs to handle persistent data. It distinguishes with embedded SQL in at
least two ways:
1. In a persistent program language, query language is fully integrated with
the host language and both share the same type system. Any format
changes required in databases are carried out transparently.
Comparison with Embedded SQL where (1) host and DML have
different type systems, code conversion operates outside of OO type
system, and hence has a higher chance of having undetected errors; (2)
format conversion takes a substantial amount of code.
2. Using Embedded SQL, a programmer is responsible for writing explicit
code to fetch data into memory or store data back to the database.
In a persistent program language, a programmer can manipulate
persistent data without having to write such code explicitly.
Drawbacks: (1) Powerful but easy to make programming errors that damage the
database; (2) harder to do automatic high-level optimization; and (3) do not
support declarative querying well.
Persistence of Objects
Several approaches have been proposed to make the objects persistent.
1. persistence by class. Declare class to be persistent: all objects of the class are
then persistent objects. Simple, not flexible since it is often useful to have both
transient and persistent objects in a single class. In many OODB systems,
declaring a class to be persistent is interpreted as ``persistable'' -- objects in the
class potentially can be made persistent.
2. persistence by creation. Introduce new syntax to create persistent objects.
3. persistence by marking. Mark an object persistent after it is created (and before
the program terminates).
4. persistence by reference. One or more objects are explicitly declared as (root)
persistent objects. All other objects are persistent iff they are referred, directly
or indirectly, from a root persistent object. It is easy to make the entire data
structure persistent by merely declaring the root of the structure as persistent,
but is expensive to follow the chains in detection for a database system.
Object Identity and Pointers
1. The association of an object with a physical location in storage (as in C++)
may change over time.
2. There are several degrees of permanence of identity:
o intraprocedure: Identity persists only during the execution of a single
procedure, e.g., local variables within procedures.
o intraprogram: Identity persists only during the execution of a single
program or query, e.g., global variables in programming languages, and
main memory or virtual memory pointers.
o interprogram: Identity persists from one program execution to another,
e.g., pointers to file system data on disk but may change if the way data
is stored in the file system is changed.
o persistent: Identity persists not only among program executions but also
among structural reorganizations of the data. This is the persistent form
of identity required for object-oriented systems.
3. In persistent extension of C++, object identifiers are implemented as
``persistent pointers'' which can be viewed as a pointer to an object in the
database.
Storage and Access of Persistent Objects
1. How are objects stored in a database?
Code (that implements methods) should be stored in the database as part of the
schema, along with type definitions, but many implementations store them
outside of the database, to avoid having to integrate system software such as
compilers with the database system.
Data: stored individually for each object.
2. How to find the objects?
1. Give names to objects like we give names to files: works only for small
sets of objects.
2. Expose object identifiers or persistent pointers to the objects:
3. Store the collections of object and allow programs to iterate over the
collections to find required objects. The collections can be modeled as
objects of a collection type. A special case of a collection is a class
extent, which is a collection of all objects belonging to the class.
Most OODB systems support all three ways of accessing persistent objects. All
objects have object identifiers. Names are typically given only to class extents
and other collection objects, and perhaps to other selected objects, but most
objects are not given names. Class extents are usually maintained for all
classed that can have persistent objects, but in many implementations, they
contain only persistent objects of the class.
OBJECT MODELLING
Objects are the physical and conceptual things we find in the universe around us.
Hardware, software, documents, human beings, and even concepts are all examples of
objects. For purposes of modeling his or her company, a chief executive officer could
view employees, buildings, divisions, documents, and benefits packages as objects.
An automotive engineer would see tires, doors, engines, top speed, and the current
fuel level as objects. Atoms, molecules, volumes, and temperatures would all be
objects a chemist might consider in creating an object-oriented simulation of a
chemical reaction. Finally, a software engineer would consider stacks, queues,
windows, and check boxes as objects.
Objects are thought of as having state. The state of an object is the condition of the
object, or a set of circumstances describing the object. It is not uncommon to hear
people talk about the "state information" associated with a particular object. For
example, the state of a bank account object would include the current balance, the
state of a clock object would be the current time, the state of an electric light bulb
would be "on" or "off." For complex objects like a human being or an automobile, a
complete description of the state might be very complex. Fortunately, when we use
objects to model real world or imagined situations, we typically restrict the possible
states of the objects to only those that are relevant to our models.
We also think of the state of an object as something that is internal to an object. For
example, if we place a message in a mailbox, the (internal) state of the mailbox object
is changed, whereas the (internal) state of the message object remains unchanged.
Sometimes people think of objects as being strictly static. That is, the state of an
object will not change unless something outside of the object requests the object to
change its state. Indeed, many objects are passive (static). A list of names does not
spontaneously add new names to itself, nor would we expect it to spontaneously delete
names from itself.
However, it is possible for some objects to change their own state. If an object is
capable of spontaneously changing its own state, we refer to it as an "object with
life." (Objects with life are sometimes also called "active objects" or "actors.") Clocks
and timers are common examples of objects with life. If we were modeling a business
process, we would recognize that salespeople and customers were also objects with
life.
Classes, Metaclasses, Parameterized Classes, and Exemplars
There are two broad categories of objects: classes and instances. Users of object-
oriented technology usually think of classes as containing the information necessary to
create instances, i.e., the structure and capabilities of an instance is determined by its
corresponding class. There are three commonly used (and different) views on the
definition for "class":
A class is a pattern, template, or blueprint for a category of structurally
identical items. The items created using the class are called instances. This is
often referred to as the "class as a `cookie cutter'" view. As you might guess,
the instances are the "cookies."
A class is a thing that consists of both a pattern and a mechanism for creating
items based on that pattern. This is the "class as an `instance factory'" view;
instances are the individual items that are "manufactured" (created) using the
class's creation mechanism.
A class is the set of all items created using a specific pattern. Said another
way, the class is the set of all instances of that pattern.
We should note that it is possible for an instance of a class to also be a class. A
metaclass is a class whose instances themselves are classes. This means when we use
the instance creation mechanism in a metaclass, the instance created will itself be a
class. The instance creation mechanism of this class can, in turn, be used to create
instances -- although these instances may or may not themselves be classes.
A concept very similar to the metaclass is the parameterized class. A parameterized
class is a template for a class wherein specific items have been identified as being
required to create non-parameterized classes based on the template. In effect, a
parameterized class can be viewed as a "fill in the blanks" version of a class. One
cannot directly use the instance creation mechanism of a parameterized class. First, we
must supply the required parameters, resulting in the creation of a non-parameterized
class. Once we have a non-parameterized class, we can use its creation mechanisms to
create instances.
In this article, we will use the term "class" to mean metaclass, parameterized class, or
a class that is neither a metaclass nor a parameterized class. We will make a
distinction only when it is necessary to do so. Further, we will occasionally refer to
"non-class instances." A non-class instance is an instance of a class, but is itself not a
class. An instance of a metaclass, for example, would not be a non-class instance.
In this article, we will sometimes refer to "instantiation." Instantiation has two
common meanings:
as a verb, instantiation is the process of creating an instance of a class, and
as a noun, an instantiation is an instance of a class.
Some people restrict the use of the term "object" to instances of classes. For these
people, classes are not objects. However, when these people are confronted with the
concepts of metaclasses and parameterized classes, they have a difficulty attempting
to resolve the "problems" these concepts introduce. For example, is a class that is an
instance of a metaclass an object -- even though it is itself a class? In this article, we
will use the term "object" to refer to both classes and their instances. We will only
distinguish between the two when needed.
Black Boxes and Interfaces
Objects are "black boxes." Specifically, the underlying implementations of objects are
hidden from those that use the object. In object-oriented systems, it is only the
producer (creator, designer, or builder) of an object that knows the details about the
internal construction of that object. The consumers (users) of an object are denied
knowledge of the inner workings of the object, and must deal with an object via one of
its three distinct interfaces:
the "public" interface. This is the interface that is open (visible) to everybody.
the "inheritance" interface. This is the interface that is accessible only by direct
specializations of the object. (We will discuss inheritance and specialization
later in this chapter.) In class-based object-oriented systems, only classes can
provide an inheritance interface.
the "parameter" interface. In the case of parameterized classes, the parameter
interface defines the parameters that must be supplied to create an instance of
the parameterized class.
Another way of saying that an item is in the public interface of an object is to say that
the object "exports" that item. Similarly, when an object requires information from
outside of itself (e.g., as with the parameters in a parameterized class), we can say that
the object needs to "import" that information.
Aggregation
It is, of course, possible for objects to be composed of other objects. Aggregation is
either:
the process of creating a new object from two or more other objects, or
an object that is composed of two or more other objects.
For example, a date object could be fashioned from a month object, a day object, and
a year object. A list of names object, for example, can be thought of as containing
many name objects.
Specialization and Inheritance
Aggregation is not the only way in which two objects can be related. One object can
be a specialization of another object. Specialization is either:
the process of defining a new object based on a (typically) more narrow
definition of an existing object, or
an object that is directly related to, and more narrowly defined than, another
object.
Specialization is usually associated with classes. It is usually only in the so-called
"classless" object-oriented systems that we think of specialization for objects other
than classes.
Depending on their technical background, there are a number of different ways in
which people express specialization. For example, those who are familiar with an
object-oriented programming language called Smalltalk refer to specializations as
"subclasses" and to the corresponding generalizations of these specializations as
"superclasses." Those with a background in the C++ programming language use the
term "derived class" for specialization and "base class" for corresponding
generalizations.
It is common to say that everything that is true for a generalization is also true for its
corresponding specialization. We can, for example, define "checking accounts" and
"savings accounts" as specializations of "bank accounts." Another way of saying this
is that a checking account is a kind of bank account, and a savings account is a kind of
bank account. Still another way of expressing this idea is to say that everything that
was true for the bank account is also true for the savings account and the checking
account.
In an object-oriented context, we speak of specializations as "inheriting"
characteristics from their corresponding generalizations. Inheritance can be defined
as the process whereby one object acquires (gets, receives) characteristics from one or
more other objects. Some object-oriented systems permit only single inheritance, a
situation in which a specialization may only acquire characteristics from a single
generalization. Many object-oriented systems, however, allow for multiple
inheritance, a situation in which a specialization may acquire characteristics from two
or more corresponding generalizations.
Our previous discussion of the bank account, checking account, and savings account
was an example of single inheritance. A telescope and a television set are both
specializations of "device that enables one to see things far away." A television set is
also a kind of "electronic device." You might say that a television set acquires
characteristics from two different generalizations, "device that enables one to see
things far away" and "electronic device." Therefore, a television set is a product of
multiple inheritance.
Abstract Classes
We usually think of classes as being complete definitions. However, there are
situations where incomplete definitions are useful, and classes that represent these
incomplete definitions are equally useful. For example, in everyday conversation, we
might talk about such items as bank accounts, insurance policies, and houses. In
object-oriented thinking, we often isolate useful, but incomplete, concepts such as
these into their own special classes.
Abstract classes are classes that embody coherent and cohesive, but incomplete,
concepts, and in turn, make these characteristics available to their specializations via
inheritance. People sometimes use the terms "partial type" and "abstract superclass" as
synonyms for abstract class. While we would never create instances of abstract
classes, we most certainly would make their individual characteristics available to
more specialized classes via inheritance.
For example, consider the concept of an automobile. On one hand, most people know
what an automobile is. On the other hand, "automobile" is not a complete definition
for any vehicle. It would be quite accurate to describe "automobile" as the set of
characteristics that make a thing an automobile, in other words, the "essence of
automobile-ness."
Operations
The public interface of an object typically contains three different categories of items:
operations (sometimes referred to as "method selectors," "method interfaces,"
"messages," or "methods"),
constants, and
exceptions.
An operation in the public interface of an object advertises a functional capability of
that object. For example, "deposit" would be an operation in the public interface of a
bank account object, "what is current temperature" would be an operation in the public
interface of a temperature sensor object, and "increment" would be an operation in the
public interface of a counter object.
The actual algorithm for accomplishing an operation is referred to as a method.
Unlike operations, methods are not in the public interface for an object. Rather,
methods are hidden on the inside of an object. So, while users of bank account objects
would know that they could make a deposit into a bank account, they would be
unaware of the details as to how that deposit actually got credited to the bank account.
MODULE 5: OPERATIONS AND MANAGEMENT
Client / Server and Databases – Data Warehousing – Query Processing – Concurrency
Management – Heterogeneous and Homogenous Systems – Distributed Databases –
Controls – Atomicity, Recovery – Security, Back-up and Recovery.
The client/server model is a computing model that acts as a distributed application
which partitions tasks or workloads between the providers of a resource or service,
called servers, and service requesters, called clients.[1]
Often clients and servers
communicate over a computer network on separate hardware, but both client and
server may reside in the same system. A server machine is a host that is running one
or more server programs which share their resources with clients. A client does not
share any of its resources, but requests a server's content or service function. Clients
therefore initiate communication sessions with servers which await incoming requests.
Schematic clients-server interaction.
The client/server characteristic describes the relationship of cooperating programs in
an application. The server component provides a function or service to one or many
clients, which initiate requests for such services. A notable example of this is the way
OpenGL treats the video card of a computer as a server, with the actual application
making rendering requests to it. This model is further solidified with the OpenGL
Shading Language, with the user writing small programs that live in video memory,
and are requested from the main program through the graphics driver.
Functions such as email exchange, web access and database access are built on the
client/server model. Users accessing banking services from their computer use a web
browser client to send a request to a web server at a bank. That web server runs a
program which may in turn, forward the request to its own database client program,
which sends a request to the bank's database server (which runs on another computer)
to retrieve the account information. The balance and transaction records are returned
to the bank database client, which in turn serves it back to the user's web browser
client, displaying the results to the user. The client–server model has become one of
the central ideas of network computing. Many business applications being written
today use the client–server model, as do the Internet's main application protocols, such
as HTTP, SMTP, Telnet, and DNS.
The interaction between client and server is often described using sequence diagrams.
The Unified Modeling Language has support for sequence diagrams.
Specific types of clients include web browsers, email clients, and online chat clients.
Specific types of servers include web servers, ftp servers, application servers,
database servers, name servers, mail servers, file servers, print servers, and
terminal servers. Most web services are also types of servers.
Comparison to peer-to-peer architecture
A client/server network involves multiple clients connecting to a single, central server.
The file server on a client/server network must have robust hardware configurations to
handle large volumes of client service requests. Hence servers usually are high speed
computers with large hard disk capacity.
By contrast, peer-to-peer networks involve two or more computers pooling individual
resources such as disk drives, CD-ROMs and printers. These shared resources are
available to every computer in the network, while each of them communicates in a
session. Each computer acts as both the client and the server which means all the
computers on the network are equals, that is where the term peer-to-peer comes from.
The advantage of peer-to-peer networking is the easier control concept not requiring
any additional coordination entity and not delaying transfers by routing via server
entities. However, the collision of session may be larger than with routing via server
nodes.
In the peer to peer network, software applications can be installed on the single
computer and shared by every computer in the network. They are also cheaper to set
up because most desktop operating systems have the software required for the network
installed by default. On the other hand, the client/server model works with any size or
physical layout of LAN and doesn't tend to slow down with a heavy use.[3]
Peer-to-peer networks are typically less secure than client/server networks because
security is handled by the individual computers, not controlled and supervised on the
network as a whole. The resources of the computers in the network can become
congested as they have to support not only the workstation user, but also the requests
from network users. It may be difficult to provide systemwide services when the client
operating system typically used in this type of network is incapable of hosting the
service.
Client/server networks with their additional capacities have a higher initial setup cost
for networking than peer to peer networks. The long-term aspect of administering a
client/server network with applications largely server-hosted surely saves
administering effort compared to administering the application settings per each
client. In addition the concentration of functions in performant servers allows for
lower grade performance qualification of the clients.
It is possible to set up a server on a modern desktop computer, but it is recommended
to consider investment in enterprise-wide server facilities with standardised choice of
hardware and software and with a systematic and remotely operable administering
strategy. It is easier to configure and manage the server hardware and software
compared to the distributed administering requirements with a flock of computers.
In centralized database systems, the database system, application programs, and
user-interface all are executed on a single system and dummy terminals are connected
to it. The processing power of single system is utilized and dummy terminals are used
only to display the information. As the personal computers became faster, more
powerful, and cheaper, the database system started to exploit the available processing
power of the system at the user’s side, which led to the development of client/server
architecture. In client/server architecture, the processing power of the computer
system at the user’s end is utilized by processing the user-interface on that system.A
client is a computer system that sends request to the server connected to the network,
and a server is a computer system that receives the request, processes it, and returns
the requested information back to the client. Client and server are usually present at
different sites. The end users (remote database users) work on client computer system
and database system runs on the server. Servers can be of several types, for example,
file servers, printer servers, web servers, database servers, etc. The client machines
have user interfaces that help users to utilize the servers. It also provides users the
local processing power to run local applications on the client side.
There are two approaches to implement client/server architecture. In the first
approach, the user interface and application programs are placed on the client side and
the database system on the server side. This architecture is called two-tier
architecture. The application programs that reside at the client side invoke the DBMS
at the server side. The application program interface standards like Open Database
Connectivity (ODBC) and Java Database Connectivity (JDBC) are used r interaction
between client and server.
The second approach, that is, three-tier architecture is primarily used for web-based
applications. It adds intermediate layer known as application server (or web server)
between the client and the database server. The client communicates with the
application server, which in turn communicates with the database server. The
application server stores the business rules (procedures and constraints) used for
accessing data from database server. It checks the client’s credentials before
forwarding a request to database server. Hence, it improves database security.
When a client requests for information, the application server accepts the request,
processes it, and sends corresponding database commands to database server. The
database server sends the result back to application server which is converted into
GUI format and presented to the client.
Fundamentals of Distributed Databases
In recent years, the distributed database system has been emerging as an important
area of information processing, and its popularity is increasing rapidly. A distributed
database is a database that is under the control of a central DBMS in which not all
storage devices are attached to a common CPU. It may be stored on multiple
computers located in the same physical location, or may be dispersed over a network
of interconnected computers. Collections of data (e.g., in a database) can be
distributed across multiple physical locations. In short, a distributed database is a
logically interrelated collection of shared data, and a description of this data is
physically distributed over a computer network. A distributed database must ensure
the following:
The distribution is transparent – users must be able to interact with the
system as if it is a single logical system. This applies to the system
performance and method of accessing amongst other things.
The transactions are transparent – each transaction must maintain database
integrity across multiple databases. Transactions may also be divided into
subtransactions; each subtransaction affects one database system.
Features of a Distributed DBMS: A Distributed DBMS may have a number of local
applications, but it has at least one global application. Thus, a distributed DBMS has
the following features:
i. A distributed DBMS is a collection of logically related shared data.
ii. The data in a distributed DBMS is split into a number of fragments or
partitions.
iii. Fragments may be replicated in a distributed system.
iv. Fragments/replicas are allocated to different sites.
v. In a distributed system, the sites are linked by communications network.
vi. The data at each site is under the control of a DBMS.
vii. The DBMS at each site has its own right, that is, it can handle local
applications independently.
viii. Each DBMS in a distributed system participates in at least one global
application.
Advantages and Disadvantages of Distributed DBMS
A distributed DBMS has several advantages over traditional centralized DBMSs.
These are listed in the following.
Sharing of information The major advantage in a distributed database system is
the provision for sharing information. Users at one site in a distributed system may
be able to access data residing at other sites. For example, consider an organization
that has a number of branches throughout the country. Each branch stores its own
data locally. Similarly, a user in one branch can access data from another branch;
thus, information sharing is possible in a distributed system
Components of a Distributed DBMS
A Distributed DBMS controls the storage and efficient retrieval of logically
interrelated data that are physically distributed among several sites. Therefore, a
distributed DBMS includes the following components.
a. Computer workstations (sites or nodes) – A distributed DBMS consists of a
number of computer workstations that form the network system. The
distributed database system must be independent of the computer system
hardware.
b. Network components (both hardware and software) – Each workstation in a
distributed system contains a number of network hardware and software
components. These components allow each site to interact and exchange data
with each other site. Network system independence is a desirable property of
the distributed system.
c. Communication media – In a distributed system, any type of communication
(data transfer, information exchange) among nodes is carried out through
communication media. This is a very important component of a distributed
DBMS. It is desirable that a distributed DBMS be communication media
independent, that is, it must be able to support several types of communication
media.
d. Transaction processor (TP) – A TP is a software component that resides in
each computer connected with the distributed system and is responsible for
receiving and processing both local and remote applications’ data requests.
This component is also known as the application processor (AP) or the
transaction manager (TM).
e. Data processor (DP) – A DP is also a software component that resides in each
computer connected with the distributed system and stores and retrieves data
located at that site. The DP is also known as the data manager (DM). In a
distributed DBMS, a DP may be a centralized DBMS.
Objectives of Distributed Concurrency Control
Concurrency control is the activity of coordinating concurrent accesses to a database
in a multi-user system. Concurrency control allows users to access a database in a
multi-programmed fashion while preserving the consistency of the data. The main
technical difficulty in achieving this goal is the necessity to prevent database updates
performed by one user from interfering with database retrievals and updates
performed by other users. The concurrency control problem is exacerbated in a
distributed DBMS (DDBMS), because in a distributed system users may access data
stored at many different sites, and a concurrency control mechanism at one site cannot
be aware of transactions at other sites instantaneously. The replication of data items in
a distributed database adds extra complexity to the concurrency control mechanism. In
this context, it is necessary to discuss the objectives of concurrency control
mechanism in a distributed database environment. Like in centralized DBMSs, the
primary goal of the concurrency control mechanism in a DDBMS is to ensure that
the consistency of the data items is preserved and each atomic action will be
completed in a finite time if the system has not failed. In addition, a good concurrency
control mechanism for a DDBMS has the following objectives:
It must be resilient to site and communication failures.
It should permit parallel execution of transactions to achieve maximum
concurrency, thus satisfying performance requirements.
Its computational methods and storage mechanisms should be modest to
minimize overhead.
It should perform satisfactorily in a network environment taking into
consideration that it involves significant communication delay.
It must impose few constraints on the structure of atomic actions of
transactions.
Atomicity
ATOMICITY :The phrase "all or nothing" succinctly describes the first ACID
property of atomicity. When an update occurs to a database, either all or none of the
update becomes available to anyone beyond the user or application performing the
update. This update to the database is called a transaction and it either commits or
aborts. This means that only a fragment of the update cannot be placed into the
database, should a problem occur with either the hardware or the software involved.
Features to consider for atomicity:
a transaction is a unit of operation - either all the transaction's actions are
completed or none are
atomicity is maintained in the presence of deadlocks
atomicity is maintained in the presence of database software failures
atomicity is maintained in the presence of application software failures
atomicity is maintained in the presence of CPU failures
atomicity is maintained in the presence of disk failures
atomicity can be turned off at the system level
atomicity can be turned off at the session level
Data security is the protection of the database from unauthorized users. Only the
authorized persons are allowed to access the database. Most of the users are
allowed to access a part of database i.e., the data that is related to them or related
to their department. Mostly, the DBA or head of department can access all the data
in the database. Some users may be permitted only to retrieve data, whereas others
are allowed to retrieve as well as to update data. The database access is controlled
by the DBA. He creates the-accounts of users and gives rights to access the
database. Typically, users or group of users are given usernames protected by
passwords.
Most of the DBMSs provide the security sub-system, which the DBA uses to
create accounts of users and to specify account restrictions. The user enters his/her
account number (or user name) and password to access the data from database. For
example, if you have an account in the "hotmail.com" (a popular website), then
you have to give your correct username and password to access your account of e-
mail. Similarly, when you insert your ATM card into the auto teller machine
(ATM) in a bank, the machine reads your ID number printed on the card and then
asks you to enter your pin code (or password). In this way you can access your
account.
Homogeneous and Heterogeneous Distributed DBMSs
A distributed DBMS may be classified as homogeneous or heterogeneous. In an ideal
distributed database system, the sites would share a common global schema (although
some relations may be stored only at some local sites), all sites would run the same
database management software and the sites are aware of the existence of other sites.
In a distributed system, if all sites use the same DBMS product, it is called a
homogenous distributed database system. However, in reality, a distributed
database has to be constructed by linking multiple already-existing database systems
together, each with its own schema and possibly running different database
management software. Such systems are called heterogeneous distributed database
systems. In a heterogeneous distributed database system, sites may run different
DBMS products that need not be based on the same underlying data model, and thus,
the system may be composed of relational, network, hierarchical, and object-oriented
DBMSs. Homogeneous distributed DBMS provides several advantages such as
simplicity, ease of designing and incremental growth. It is much easier to design and
manage a homogeneous distributed DBMS than a heterogeneous one. In a
homogeneous distributed DBMS, making the addition of a new site to the distributed
system is much easier, thereby providing incremental growth. These systems also
improve performance by exploiting the parallel processing capability of multiple sites.
Functions of Distributed DBMS:A distributed DBMS manages the storage and
processing of logically related data on interconnected computers wherein both data
and processing functions are distributed among several sites. Thus, a distributed
DBMS has at least all the functionality of a centralized DBMS. In addition, it must
provide the following functions to be classified as distributed. Application interfaces
A distributed DBMS provides application interfaces to interact with the end-users or
application programs and with remote databases within the distributed system. This
feature is most important and is provided by all systems that are distributed in nature.