database management systems complete notes

DATABASE MANAGEMENT SYSTEMS PREPARED BY II CSE(2010-2014) CBIT, PRODDATUR R09 batch (Guidance by J UMA MAHESH, CBIT CSE, DEPT.)

DATA BASE MANAGEMENT SYSTEMS

UNIT-1

Data base systems:

Data vs information, introducing the data base and the DBMS-why data base design is important-files and file systems –problems with the file systems data managements- database systems

Data models:

Data modeling and data models-the importance of data models – data model basic building blocks – business rules- the evolution of data models-degree of data abstraction.

UNIT-2

ENTITY RELATIONSHIP MODELLING:

The entity relationship model (ERM)-developing an ER diagram-data base design challenges: conflicting goals –The extends entity relationship model-entity clustering-entity integrity: selecting primary keys-learning flexible data base design-data modeling checklist.

UNIT-3

RELATIONAL DATA BASE MODEL:

A logical view of data-keys-integrity rules_-relational set operators –the data dictionary and the system catalog-relationships within the relational data base –data redundancy revisited –indexes –codd’s relational data base rules.

UNIT-4

STRUCTURED QUERY LANGUAGE (SQL):

Introduction to SQL –data definition commands –data manipulation commands –SELECT queries –advanced data definition commands –advanced SELECT queries –virtual tables: creating a view- joining database tables.


ADVANCED SQL:

Relational set Operators- SQL join operators-sub queries & co-related queries – SQL functions – oracle sequences – updateable views – procedural SQL – embedded SQL.

UNIT -5

NORMALIZATION OF DATA BASE TABLES

Database tables & normalization – the need for normalization – the normalization process – improving the design – surrogate key considerations – higher – level normal forms – normalization and database design – de-normalization.

UNIT-6

TRANSACTIONS MANAGEMENT AND CONCURRENCY CONTROL

What is transaction? – Transaction state – implementation of atomicity & durability – concurrency control – serializability – testing for serializability – concurrency control with locking methods – concurrency control with time stamping method – concurrency control with optimistic methods – database recovery management – validation based protocols - multiple granularity.

UNIT-7

RECOVERY SYSTEM

Recovery & atomicity – log – based recovery – recovery with concurrent transactions – buffer management – failure with loss of non-volatile storage – advanced recovery techniques – remote backup systems.

UNIT-8

FILE STRUCTURE AND INDEXING

Overview of physical storage media- magnetic disks – RAID – tertiary storage – storage access – file organization – organization of records in files – data – dictionary storage – basic concepts of indexing – ordered indices – B+ - tree index files – B tree index files – multiple key access – static hashing – dynamic hashing – comparison of ordered index & hashing – bit map indices – indexed sequential access methods (ISAM).


DATA BASE MANAGEMENT SYSTEM

UNIT 1

DATABASE SYSTEMS:o DATA VS INFORMATIONO INTRODUCING THE DATABASE AND THE DBMS

O WHY DATABASE DESIGN IS IMPORTANT

O FILES AND FILE SYSTEMS

O PROBLEM WITH FILE SYSTEM

O DATA MANAGEMENT

O DATA BASE SYSTEMS

DATA MODELS:

o DATA MODELING AND DATAMODELSo THE IMPORTANCE OF DATA MODELSo DATA MODEL BASIC BUILDING BLOCKSO BUSSINESS RULESO THE EVOLUTION OF DATA MODELSO DEGREES OF DATA ABSTRACTION

DATA VS INFORMATION

INTRODUCTION:

DATABASE: Database is also called as “DBMS”. DBMS means “Data Base Management System”. The father of RDBMS is “Edgar F Codd”. Database is software or a program used to store a huge amount of data

and to retrieve the data from the database.


The primary goal of DBMS is to store and retrieve or extract the data from the database.

Examples of DBMS are as follows:

Oracle SQL(Structured Query Language) PL_SQL(Programmable Logic SQL) Db2 INGREES POSTGRE Data Ware House and etc…………

The hierarchy of every Data Base is as follows:

Always the data gives information i.e; processed data and information gives knowledge to the user.

DATA:

Data is the collection of raw facts/materials. Raw facts is defined as the collection of characters , numbers , special

symbols , special operators , pictures , measurements and etc…………….

Example: 21212.

KNOWLEDGE

ZZ INFORMATION

DATA


INFORMATION:

By processing the data we can understand and it is called as “Information”.

Information is the processed data. By the information user gets the knowledge. Example: 21-2-12.

Hence by this we can say that “it is the date”.

Difference between the Data and the Information:

Data the collection of raw facts. Raw facts are defined as the

collection of characters , numbers,, special symbols , special operators , pictures , measurements and etc…………….

Always data doesn’t give any meaning which is unstructured data (or)Un_ organized data.

Example:Consider 21212 (actually it is the date)by seeing this,a new user doesn’t get any meaning

because here the data is unorganized means not processed.

Information is processed data. Always the user can

understand the information because it is organized (or) processed (or) Structured data.

Always information gives knowledge to the user.

Example:Consider 21-2-12 ,it is the date and it is information so it is understand by the user

because it is organized (or) processed data it gives meaning to the user.


INTRODUCING THE DATABASE AND THE DBMS

DATABASE:-

A database is a collection of interrelated data.

By data, we mean known facts For example; consider the names, telephone numbers, and addresses of the people.

*Database is software or a program. It consists of huge amount of data to store. The primary goal of database is to store and retrieve the data. Some of the data base soft wares are oracle, sql , plsql, POSTGRE, INGRES, db2, data ware house.

DATABASE MANAGEMENT SYSTEMS (DBMS):-

* A database management system (dbms) is a collection of programs that enables users to create and maintain a database. The DBMS is hence a general-purpose software system that facilitates the processes of defining, constructing, and manipulating databases for various applications.

Defining: a database involves specifying the data types, structures, and constraints for the data to be stored in the database.

Constructing: the database is the process of storing the data itself on some storage medium that is controlled y the DBMS.

Manipulating: a database includes such functions as querying the database to retrieve specific data, updating the database to reflect changes in the mini world, and generating reports from the data.

WHY DATA BASE DESIGN IS IMPORTANT?

Database design is a database structure that will use a plan to store & manage data. Database management system is the software used to implement a database design.

Modern database & application database software is so easy to use that many people can quickly learn to implement simple database & develop simple applications.

Note: As data requirements become more complex, those will simply produce the required data by incorrectly by adding columns of tables to the database.

*so we use the follow rules for a good database


The following are basically the reasons for doing database design.

Good application programs cannot overcome bad database design. The end user & the designer decide what data will be stored in the data base.

FILESYSTEM VERSES DATABASE

BASIC FILE TERMINOLOGY:

DATA: Data is the collection of raw facts/materials.

Raw facts is defined as the collection of characters , numbers , special symbols , special operators , pictures , measurements and etc…………….

Example: 21212

FIELD: A character (or) group of characters that has special meaning.

A field is use to define & store data. RECORD:A logically connected set of one(or) more fields that describes

person ,place (or) a thing. FILE:A collection of related records.

PROBLEMS WITH FILE SYSTEM MAJOR PROBLEMS WITH FILE SYSTEM ARE:

1: DATA REDUNDANCY

2: DATA ISOLATION

3: DATA INCONSISTENCY

4: DIFFICULTY IN ACESSING DATA

5: CONCURRENT& ANAMOLIES

6: ATOMICITY PROBLEM

7: DATA INTEGRITY PROBLEM

8: SECURITY PROBLEM


1. Data redundancy:

Data redundancy is duplication. Duplication is nothing but same copy of data presented in several files. This is called Data Redundancy.

Data redundancy have some problems or errors in DBMS1. storage space2. wastage of time

BY USING DBMS:

Always DBMS contains unique files.There is no redundancy.

2. Data isolation:

It means we are storing the data in various file formats (.exe, .doc etc… So, it is difficult to write the program and accessing the information

presented in files

BY USING DBMS:

By using DBMS we can write the programs is easy and to retrieve the information.

3. Data inconsistency:

Modification done in one file it affect the other file and doesn’t give any accurate result is called Data inconsistency.

BY USING DBMS:

By using dbms it gives accurate result.

4. Difficult in accessing data:

In file system it is difficult to access the data continuously and effectively.

BY USING DBMS:

By using dbms it gives result.


5. Concurrent access & anomalies: In file system it doesn’t allow more transaction at a time

BY USING DBMS:

By using dbms it allows the concurrent transactions at a time.

6. Atomicity problems: Recovery of transaction occurs in DBMS.

Example

Let us consider two transactions namely T1(A) and T2(B).we are transferring the funds from T1(A) to T2(B).while transferring the funds from T1(A) to T2(A) suddenly if power got failure or system crashes there may be a chance of data inconsistency. So, to overcome this problem we use DBMS. It starts from the failure stage. but whereas in File system the transaction will start from initial state.

7. Data integrity problems: Modification done in one file will effect to another file and give some

problems.BY USING DBMS:By using dbms before updation we have consider some constraints.8. Security problems:

In file system there is no security. By using DBMS we can provide Security by the Database Administrator (DBA).

EVOLUTION OF DATA MODELS OR Data modeling and data models

A data model is a collection of conceptual tools that can be used to describe the structure of the database. Data models include a set of basic operations for specifying retrieval and updates on the database. The basic building blocks of all data models are “entities, attributes, relationships and constraints”.

ENTITY: An entity represents a particular type of object in the real world.Entity may be physical objects,such as customer.

ATTRIBUTE: An attribute is a characteristic of an entity.(or)each entity has associated with it a set of attributes describing it.


Relationships: A relationship is an association among several entities.Data models use three types of relationships:one-to-one,one-to-many and many-to-many.

Constraints: A constraint is a restriction placed on the data.Constraints are important because they helps to ensure data integrity.Constraints are normally expressed in the form of rules.

THE data models can be classified into three types they are…

1. OBJECT_BASED LOGICAL DATA MODEL

2. RECORD_BASED LOGICAL MODEL

3. PHYSICAL DATA MODEL

OBDM (object based data model)

ERDM (entity relational data model)

OODM (object oriented data model)

RBDM (record based data model)

HDM (hierarchical data model)

NDM (network data model)

PDM (physical data model)

Fig: EVOLUTION OF DATA MODELS

1. OBJECT_BASED DATA MODEL(OBDM): It is the first step in the data model. It describe data at the conceptual and view level.Provides fairly flexible structuring capabilities.Allow one to specify data constraints


explicitly.It includes the models are E-Rmodel and object-oriented data model.

E-R MODEL: An E-r model Is also called as entity relationship model. It is based on a perception of the world as consisting of a collection of basic objects (entities)and relationships among these objects. Here we are using special symbols to represent an E-r diagram. An E-r diagram shows the relationships among entities and the relationships

between entities to attributes. RECTANGLE: represent entity sets. The symbol is

ELLIPSE:represent attributes.

DIAMOND:represent relationships among entities.

LINES:link attributes to entities and entity sets to relationships.

Object_oriented data model:1.An object_oriented model is based on a collection of objects,like E-Rmodel.3.Unlike the record-oriented models,these values are themselves object.4.an object also contains bodies of code that operate on the object.Internal parts of the objects ,the instance variables and code are not visible externally.2. Record based logical data model:1. It also describes data at the view level and conceptual level.2. Unlike object-oriented models, are used to

Specify overall logical structure of the database, and Provide a higher-level description of the implementation.

3. Because the database is structured in fixed-format records of several types.4. Each record type defines a fixed number of fields, or attributes.5. Each field is usually of a fixed length.


6. separate languages are associated with the model are used to express database queries and updates.7. the three most widely-accepted models are the relational, network and hierarchical.

Hierarchical model:1. This model is uses tree structure to represent relationships among

records.2. Hierarchical database consist a collection of records ,each record consist

a collection of fields ,each of which contain one data value.3. Each hierarchical tree can have only one root record and this record does

not have a parent record.4. The root can have any number of child records. child record can have

only one parent record.5. If any modification done in the parent record it will (effect)applies to all

child records also.6. It is conceptually simple and easy to design ,but it implementation is

difficult.7. Very efficient for one-to-many relationships.8. It can’t handle many-to-many relationships.

Network data model:

1. The network model replace the hierarchical tree with a graph.2. In this model records are allowed to have more than one parent.3. So many-to-many relationships are possible.


4. Here data access is flexible.5. If any updates are done in parent will not effect in child records.6. It gives very complex structure to the application programmer.7. It is difficult to make changes in a database.

Fig: (network between different vendors)

Relational data model: The relational model was introduced by Dr.E.F.Codd.This model allows data to be represented in a simple row and column format.Each data field is considered as column and each record is considered as a row of table.The relational model eliminate the all parent child relationship.Database based on relational model have built in support for query languages.

ADVANTAGES:

Flexibility Ease of use Security Data independence Data manipulation language

3.Physical data model: are used to describe data at lowest level.It is a representation of a data in database. It applies the constraints of a given data base management system.Eg: unifying model,Frame memory.

The Importance of Data ModelsData model

1. Relatively simple representation, usually graphical, of complex real-world data structures


2. Communications tool to facilitate interaction among the designer, the applications programmer, and the end user

3. Good database design uses an appropriate data model as its foundation4. End-users have different views and needs for data5. Data model organizes data for various users

Business Rules of Data Models: Entity relational data model given by e.f.codd. Data models and their properties are using as a Business rules for the

enterprises.The business rules are as follows

1. Data should be brief.2. It should be precise.(accurate data)3. It should be unambiguous.(without errors)4. Data should be follows some policies. i.e some rules when designing

the database with in the organization.5. Business rules are applied to any organization and uses the data and

they get the result.6. Data should be processed nothing by narrow fact.7. Must be put in writing(uptodate)8. Must be kept upto date.9. The data should be easy to understand.10.Explain each every properties or characteristics of data clearly.

Basic building blocks:The basic building blocks of all data models are:

Entities Attributes Relationships Constraints

Entity:

Entity is a particular type of object in the real world. Entity is being capable of independent existence which can be uniquely identified.

Entity is anything (a person, a place, a thing, or a event) about which data are to

be collected and stored.


An entity may be a physical object such as house or a car, an event such as house sale or a car service, or concept such as customer transaction or order.

Entity is represented in terms of rectangle. Eg:

Attribute:

An attribute is a characteristic of an entity. Attribute is a single data item related to database object. The database

schema associates one or more attributes with each database entity. Attributes are the equivalence of fields in file systems. Attributes are represented in terms of oval.

eg:

Relationship:

A relationship describes an association among entities.

A relationship is represented in terms of rhombus.

For example, a relationship exists between customer and agents that can be described as follows:

An agent can serve many customers, and each customer may be served by one agent.

Data models use four types of relationships:

One-to-many (1:M or 1..*) One-to-one(1:1 or 1..1) Many –to-many(M:N or *..*) Many-to-one(M:1 or *..1)

One-to-many relationship: A painter paints many different paintings, but each one of them is painted by only one painter. Thus the painter (the “one”) is related to the paintings (the “many”).therefore, database designers label the relationship “PAINTER paints PAINTING” as 1:M.

Bus

NameName


Many-to-one relationship: The preceding discussion identified each relationship in both directions that is relationships are bidirectional:

The paintings are painted by one painter. Therefore the relationship “PAINTINGS are painted by PAINTER” as M:1.

One-to-one relationship: A retail company’s management structure may require that each of its stores be managed by a single employee. In turn, each store manager, who is an employee, manages only a single store. Therefore, the relationship “EMPLOYEE manages STORE” is labeled 1:1.

Many-to-many relationship: An employee may learn many job skills, and each job skill may be learned by many employees. Database designers label the relationship “EMPLOYEE learns SKILL” as M:N.

Constraint: A constraint is a restriction placed on the data. Constraints are important

because they help to ensure data integrity. Constraints are normally expressed in the form of rules.Eg: An employee’s salary must have values that are between 6,000 and 35,000.

Each class must have one and only one teacher.

Degrees of Data Abstraction:

1. Way of classifying data models2. Many processes begin at high level of abstraction and proceed to an ever-

increasing level of detail3. Designing a usable database follows the same basic process4. American National Standards Institute/Standards Planning and

Requirements Committee (ANSI/SPARC)5. Classified data models according to their degree of abstraction (1970s):

1. Conceptual2. Internal3. External

Data Abstraction Levels


1. The Conceptual Model:

2. Represents global view of the database3. Enterprise-wide representation of data as viewed by high-level managers4. Basis for identification and description of main data objects, avoiding details5. Most widely used conceptual model is the entity relationship (ER) model

Advantages of Conceptual Model

1. Independent of both software and hardware 2. Does not depend on the DBMS software used to implement the model 3. Does not depend on the hardware used in the implementation of the model


4. Changes in either the hardware or the DBMS software have no effect on the database design at the conceptual level

1. The Internal Model:

1. Representation of the database as “seen” by the DBMS2. Adapts the conceptual model to the DBMS3. Software dependent4. Hardware independent2. The External Model:

1. End users’ view of the data environment2. Requires that the modeler subdivide set of requirements and constraints into

functional modules that can be examined within the framework of their external models

3. Good design should:1. Consider such relationships between views2. Provide programmers with a set of restrictions that govern common entities

Advantages of External Models

Use of database subsets makes application program development much simpler

The Physical Model:

1. Operates at lowest level of abstraction, describing the way data are saved on storage media such as disks or tapes

2. Software and hardware dependent3. Requires that database designers have a detailed knowledge of the hardware and

software used to implement database design

UNIT-2

ENTITY RELATIONSHIP MODELLING:


The entity relationship model (ERM)-developing an ER diagram-data base design challenges: conflicting goals –The extends entity relationship model-entity clustering-entity integrity: selecting primary keys-learning flexible data base design-data modeling checklist.

The Entity-Relationship model:

• It is expressed in terms of entities in the business environment, the relationships (or associations) among those entities and the attributes (properties) of both the entities and their relationships

• The E-R model is usually expressed as an E-R diagram E-R Model Constructs:

• Entity - person, place, object, event, concept• Entity Type or Entity set - is a collection of entities that share common properties or

characteristics. Each entity type is given a name, since this name represents a set of items, it is always singular. It is placed inside the box representing the entity type

• Entity instance – is a single occurrence of an entity type. An entity type is described just once (using metadata) in a database, while many instances of that entity type may be represented by data stored in the database. e.g. – there is one EMPLOYEE entity type in most organizations, but there may be hundreds of instances of this entity stored in the database

• ER Model Basics:

Entity : Real-world object distinguishable from other objects. An entity is described (in DB) using a set of attributes.

Entity Set : A collection of similar entities. E.g., all employees. All entities in an entity set have the same set of attributes. (Until we consider

ISA hierarchies, anyway!) Each entity set has a key. Each attribute has a domain.

ENTITY TYPES: Entities are classified into two types namely strong entity and weak entity

Strong entity:

The entity which does not depends on another entity called Strong entity


Weak entity:

The entity which does depends on another entity called Weak entity

Key and key attributes : Key: a unique value for an entity Key attributes: a group of one or more attributes that uniquely identify an

entity in the entity set Super key, candidate key, and primary key

Super key: a set of attributes that allows to identify and entity uniquely in the entity set

Candidate key: minimal super key There can be many candidate keys

Primary key: a candidate key chosen by the designer Denoted by underlining in ER attributes

Fig:Example: Er diagram

Relationship : Association among two or more entities. e.g., Jack works in Pharmacy department.

Relationship Set : Collection of similar relationships. An n-ary relationship set R relates n entity sets E1 ... En; each

relationship in R involves entities e1 in E1, ..., en in En• Same entity set could participate in different relationship sets,

or in different “roles” in same set.

ssn


Example of Bank Er diagram:

Note: The Entity Relationship (E-R) Model

1. A relationship’s degree indicates the number of associated entities or participants.

1. A unary relationship exists when an association is maintained within a single entity.

2. A binary relationship exists when two entities are associated.3. A ternary relationship exists when three entities are associated.


data base design challenges: conflicting goals:

DESIGN STANDERDS:

• Database must be designed to conform to design standards i.e exact information requirements

• High-speed processing may require design compromises• Quest for timely information may be the focus of database design• Other concerns

– Security– Performance– Shared access– Integrity

Note: High processing speed means minimal access time,which may be achieved by minimizing the number and complexity of logically desirable realationship.


Conflicting Goals

A design that meets all logical requirements and design conventions is an important goal.However, if this perfect design fails to meet the customer’s transaction speed and/or information requriments,the designer will not have done a proper job from the end user’s point of view.Compromises are a fact of life in the real world of database design.

Even while focusing on the entities,attributes,relationships and constraints,the designer should begin thinking about end user requirements such as performance,security ,shared access,and data integrity.The designer must consider processing requirements and verify that all update,retrival, and deletion options are available.

Finally,a design is of little value unless the end product is capable of delivering all specified query and reporting requirements.

You are likely to discover that even the best design process produces an ERD that requires further changes mandated by operational requirements .Such changes should not discourage you from using the process.

ER modeling is essential in the development of a sound design that is capable of meeting the demands of adjustment and growth.

THE EXTENDED ENTITY RELATIONSHIP MODEL

It is simply called as “EERM”. It is the extended form of entity relationship diagram (ERD) that means

adding some additional features to the entity relationship diagram (ERD). Here some additional features means

Specialization Generalization Entity clustering Crow notations(cardinalities)

Let us see what is “ERD” ERD means Entity Relational Diagram. It consists of entities, attributes , relationships,cardinalities and

constraints.


Example:

Let us discuss about the features

SPECIALIZATION: The relation between the parent class to the child class entities is

called“SPECIALIZATION”. The relation between the super class to the sub class entities is called

“SPECIALIZATION”. It is the “top to bottom” approach of sub group entities. It is the “top _down” approach of sub group entities. Example:

SCHOOL TEACHERconsists

STUDENTShas

teaches

Sname saddress

tidtname

sid sname

Super class

Sub class


Specialization

Generalization: Generalization is the part of specialization. It is the bottom to top approach. The relation between the sub class to the super class entities is called

“GENERALIZATION”. The relation between the child class to the parent class entities is called

“GENERALIZATION”. It is the “bottom _up” approach of sub group entities. Example:

generalization The combination of both specialization and generalization is called

“CLASS HIERARCHY”. ISA relationship is denoted with “TRIANGLE”. i.e;

We can use “DESCRIMINENT CONSTRAINT” instead of ISA. DESCRIMINENT CONSTRAINT is also called as

“COMPLETENESS CONSTRAINT”. It is of 2 types.

They are: 1. Total participation 2. Partial participation

Super class

Sub class

ISA


Total completeness means that every super type occurrence must be of at least one sub type.

Here total participation and weak entity is denoted with

partial completeness means that there may be some super type occurrences that are not members of any sub type.

Here partial participation and strong entity is denoted with

ENTITY CLUSTERING: It is an advance technique. It is used to group more number of entities into one entity. Clustering means grouping the similar entities. We are grouping because to remove the confusion and to reduce the

complexity of ER diagram including the relationships and their cardinalities.

Entity clustering is also called as “VIRTUAL ENTITY”Or “ABSTRACT ENTITY”.

It is of 2 types they are1. With clustering2. Without clustering

with clustering is very simple than the without clustering.

d

d


EXAMPLE FOR WITH OUT CLUSTERING:

id

+

with clustering:

STUDENTScorrespondent

has

nameaddress

roomno

name

nameid name id

id name name id

has

teaches

appoints

CLASS ROOMConsists ofSCHOOL

Orders to

HEAD MASTER TEACHER

BUILDING

colourname


Cardinalities:

Which gives the range some are normal notations and crow notations.

Normal notations are

* * 1 * * 1 1 1

entity integrity: selecting primary keys-learning flexible data base design:

The key is defined as the column or attribute of the database table. For example if a table has id,name and address as the column names then each one is known as the key for that table. We can also say that the table has 3 keys as id, name and address. The keys are also used to identify each record in the database table.The following are the various types of keys available in the DBMS system.


1. A simple key contains a single attribute.

2. A composite key is a key that contains more than one attribute.

3. A candidate key is an attribute (or set of attributes) that uniquely identifies a row. A candidate key must possess the following properties:

o Unique identification - For every row the value of the key must uniquely identify that row.

o Non redundancy - No attribute in the key can be discarded without destroying the property of unique identification.

o

4. A primary key is the candidate key which is selected as the principal unique identifier. Every relation must contain a primary key. The primary key is usually the key selected to identify a row when the database is physically implemented. For example, a part number is selected instead of a part description.

5. A superkey is any set of attributes that uniquely identifies a row. A superkey differs from a candidate key in that it does not require the non redundancy property.

6. A foreign key is an attribute (or set of attributes) that appears (usually) as a non key attribute in one relation and as a primary key attribute in another relation. I say usuallybecause it is possible for a foreign key to also be the whole or part of a primary key:

o A many-to-many relationship can only be implemented by introducing an intersection or link table which then becomes the child in two one-to-many relationships. The intersection table therefore has a foreign key for each of its parents, and its primary key is a composite of both foreign keys.


o A one-to-one relationship requires that the child table has no more than one occurrence for each parent, which can only be enforced by letting the foreign key also serve as the primary key.

7. A semantic or natural key is a key for which the possible values have an obvious meaning to the user or the data. For example, a semantic primary key for a COUNTRY entity might contain the value 'USA' for the occurrence describing the United States of America. The value 'USA' has meaning to the user.

8. A technical or surrogate or artificial key is a key for which the possible values have no obvious meaning to the user or the data. These are used instead of semantic keys for any of the following reasons:

o When the value in a semantic key is likely to be changed by the user, or can have duplicates. For example, on a PERSON table it is unwise to use PERSON_NAME as the key as it is possible to have more than one person with the same name, or the name may change such as through marriage.

o When none of the existing attributes can be used to guarantee uniqueness. In this case adding an attribute whose value is generated by the system, e.g. from a sequence of numbers, is the only way to provide a unique value. Typical examples would be ORDER_ID and INVOICE_ID. The value '12345' has no meaning to the user as it conveys nothing about the entity to which it relates.


Data Modeling Checklist: Data modeling checklist is a combination of Business rules and Data models


UNIT-III


RELATIONAL DATA BASE MODEL:

A logical view of data-keys-integrity rules_-relational set operators –the data dictionary and the system catalog-relationships within the relational data base –data redundancy revisited –indexes –codd’s relational data base rules.

A Logical views of Data-Keys-Integrity Rules

Types of Data Constraints:

There are 2 types of Data constraints that can be applied to data being inserted in the oracle table

The Constraints are:

Input Output Constraints. Business rule Constraints.

Input Output Constraints:

These are

1. Primary Key Constraints2. Foreign Key Constraints3. Candidate Key Constraints4. Composite Key Constraints5. Super Key Constraints6. Partial Key Constraints7. Unique Key Constraints

1. Primary Key Constraints :

It can be defined as one (or) more columns that uniquely identifies tuples of rows from a table is called Primary key.Primary Key doesn’t allow NULL values as well as Duplication.

Primary key is divided into 2 types.

1. Simple Primary key.


2. Composite Primary key.

1.Simple Primary key:In a table which contains only a field as a primary key then it is called simple primary key.

2.Composite Primary key:In a table which contains only one field as a primary key then it is called Composite Primary key.Primary key always act as a parent.

Ex: create table student (st_name varchar(10),st_id varchar2(10) primary key);

2. Foreign Key Constraints:

It can be defined as one (or) more columns that uniquely identifies tuples or rows from a table is called Primary key.It is derived from Primary key.The foreign key of reference must match with the reference of primary key.Foreign key allows NULL values and it doesn’t allows duplicate values.The combination of primary key and foreign key gives the relation as Referential Integrity Relation Constraint.

Ex: create table emp1(emp_name varchar(10),emp_id varchar2 (10) foreign key reference(emp_id1) from employee;

3.Candidate Key Constraints:

It can be defined as set of fields that uniquely identifies tuples or rows from a table. one of the primary key may be a candidate key.It doesn’t allow duplication & NULL values.

4.Composite Key Constraints:

It is a part of the primary key.It can be defined as more than one field acting as a primary key in a table.


5.Super key Constraints:

It can be defined as set of pairs of columns which uniquely identifies the rows (or)tuples from a table.

Ex: create table emp1(emp name varchar(10),emp_id varchar2(10) super key)

It allows NULL values & doesn’t allow duplication.

6.Partial Key Constraints:

It can be defined as one or more columns which identifies the weak entity.

7.Unique Key Constraints:

It can be defined as one (or) more columns that uniquely identifies tuples or rows from a table is called Primary key.It allows NULL values and doesn’t allow Duplication.

Business Rule Constraints:

It is also General Constraint.General Constraints are again classified into 2 types namely:

1. Table Constraint2. Assertion Constraint

General Constraints:

It can be defined as the constraints which is applied in general.

Ex: Student age is not greater than 18 then he is eligible to sit in the B.tech class.so the end user can’t violate this constraints.

1. Table Constraints:

The constraints which is applicable to particular table is called “Table Constraints”.

2. Assertion Constraints:


The constraint which is applicable to “Collection of Tables” is called Assertion Constraints.

--------------------------------------------------------------------------------------------------------------------------------------------------------

RELATIONAL SET OPERATORS

Work properly if relations are Union-Compatible same and their data types must be identical.Names of relation attributes must be the same and their data types must be identical.

There are many types of Relational Set Operators:

1. UNION

2. UNION ALL

3. INTERSECT AND

4. MINUS

1.UNION:

combines rows from two or more queries without including duplicate rows.

QUERY:

select cus_Lname,cus_Fname,cus_initial,cus_Areacode

From customer

UNION

Select cus_Lname,cus_Fname,cus_initial,cus_Areacode

From customer1;

It can be used to unite more than two queries.

EXAMPLE:


U

RESULT:

Cus_name Cus_id

A

B

C

D

1

2

3

4

UNION ALL:

It produces a relation that retains duplicate rows.It can be used to unite more than two queries.It allows redundancy.

QUERY: select cus_Lname,cus_Fname,cus_initial,cus_Areacode.

From customer

UNION ALL


From customer1;

EXAMPLE:

Cus_name Cus_id

A 1

Cus_name Cus_id

A

B

C

1

2

3

Cus_name Cus_id

D

A

B

C

4

1

2

3


B

C

2

3

Result:

Cus_name Cus_id

A

B

C

D

A

B

C

1

2

3

4

1

2

3

INTERSECT:

It combines rows from two queries, returning only the rows that appear in both sets.QUERY: select cus_Lname,cus_Fname,cus_initial,cus_Areacode.

From customer

INTERSECT

select cus_Lname,cus_Fname,cus_initial,cus_Areacode From customer1;

EXAMPLE:

Cus_name Cus_id

A 1

Cus_name Cus_id

D

A

B

c

4

3

2

1


B

C

D

E

2

3

4

5

RESULT:

Cus_name Cus_id

B

C

D

2

3

4

MINUS:

It combines rows from two queries.Returns only the rows that appear in the first set but not in the second.QUERY: select cus_Lname,cus_Fname,cus_initial,cus_Areacode.

From customer

MINUS


From customer1;

EXAMPLE:

Cus_name Cus_id

D

B

C

F

G

4

2

3

6

7


RESULT:

Cus_id Cus-name

E

F

5

6

-----------------------------------------------------------------------------------------------

-----------------------------------------------------------------------------------------------

DATA DICTIONARY AND SYSTEM CATALOG:

DATA DICTIONARY:

The data dictionary provides a detailed accounting of all tables found within the user ordesigner created database. Thus the data dictionary contains at least all of the attributenames and characteristics for each table in the system. It therefore contains metadata, ordata about data. It is sometimes described as “the database designer’s database” becauseit records the design decisions about tables and their structures.

SYSTEM CATALOG:

Like the data dictionary the system catalog contains metadata. It can be described as a

Cus_name Cus_id

A

B

C

E

F

1

2

3

5

6

Cus_name cus_id

D

A

B

C

4

1

2

3


Detailed system data dictionary that describes all objects within the database. BecauseThe system catalog contains all required data dictionary information, the terms systemcatalog and data dictionary are often interchangeable. The system catalog is actually aSystem-created database whose tables store the user or designer created databaseCharacteristics and contents and can therefore be queried like any other user or designer created database.

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Relationship In Relational Database

Relation: A relation is a data structure which consists of attributes and set of tuples which share the same type. Relation has zero or more tuples.

For example: to store the data of student, a relation is represented by the 'Student' table. This table has column attributes 'Roll-No', 'Name' and 'City'. Each column contains values of the attribute e.g., the 'Name' column contains only the names of students. If there are four rows in the table. It means that records of four students are represented.

Therefore, in relational data model terminology, we define the following terms as:

Attributes: Each column of a relation has a heading, called 'field name'. The named columns of the relation are called attributes. In the above "Student" relation, the attribute names are 'Roll-No', 'Name' and 'City'.

Degree: The number of attributes in a relation is called its degree. In the "Student" relation, the degree is 3.

Tuple: Each row of the relation is called tuple. Each tuple consists of set of values that are related to a particular instance of an entity type.

Cardinality: The number of tuples (rows or records) in the relation is called cardinality. In the 'Student' relation, there are 4 tuples, so the cardinality of the relation is 4.

Domain: The range of acceptable values for an attribute is called the domain for the attribute. Two or more attributes may have the same domain.

http://en.wikipedia.org/wiki/Tuple#Relational_model

http://en.wikipedia.org/wiki/Set_(computer_science)

http://en.wikipedia.org/wiki/Data_structure


A relation that contains a minimum amount of redundancy and allows users to insert, modify, and delete the rows without errors, is known as -well-structured relation.

A relation schema is used to describe a relation. It is made up of a relation name and a list of attributes. For example, if R is a relation name and its attributes are: A1,A2, .... An, then the relation schema of R is denoted as; R(A1,A2, .... An).

An example of relation schema for a relation of degree 3, which describes the students of university, is given below.

STUDENT (Roll-No, Name, City)

Where 'Student' represents the name of relation (or table) followed by names of attributes (in parenthesis) in the relation.

Properties of Relations:

There is no duplication of tuples in a relation. All attribute values of relation are atomic. All attribute values of a relation must come from the same domain. Tuples of a relation are unordered from top to bottom. Attributes of a relation are unordered from left to right.

The relationship between any two entities can be one-to-one, one-to-many, or many-to-many many-to-one.

One-to-many relationships are probably the most common type. An invoice includes many products. A salesperson creates many invoices. These are both examples of one-to-many relationships.

One -to-One:

Eg:

1. In South Africa a person can have one and only one Identity Document. Also, a specific Identity Document can belong to one and only one person.


2. A voter can cast only one vote in an election. A ballot paper can belong to only one voter. So there will be a 1:1 relationship between a Voter and a Ballot Paper.

Lets look at the Person - ID Document example:

The two entities could translate into the following two tables:

Person Table ID_Document Table ID

Sur name

Name

Second Name

Initials

Date Of Birth

IDNumber

DateIssued

ExpiryDate

IssuedLocation

CheckedBy

In the Person table, the column ID will be the Primary Key. In the ID_Document table, the IDNumber will be the Primary key.

One -to-Many:

Eg:

Master-detail. You have a master (or header) record with many detail records. For example an order. There will be a master record with the order date, person


placing the order, etc. And then detail records of everything ordered. The master record will have many details, and the detail will have only 1 master.

Let’s look at the Master - Detail:

The two entities could translate into the following two tables:

Master DetailOrder Number

Date

Client Code

Captured By

Order Number

Line Number

Item Code

Quantity

In the Master table, the Order Number is the primary key. In the detail table, Order Number is the foreign key. The primary key in the Detail table will be a combination of OderNumber and Line Number.

Many-to-Many:

Eg:.


Student - professor. A student will have one or more professors. The same professor will have lots of students.

Let’s look at the Student - Professor :

Now, because many: many is not allowed, we will change this to add a junction-entity in the middle. Hint: The name of the junction entity could be a combination of the other 2 entities, in this example: Student Professor.

Notice that the Many- side always points to the junction entity.

The three entities could translate into the following three tables:

Student Table Student Professor Table

Professor Table

Student Number

Name

Address

Home Tel Number

Student Number

Prof Number

ProfNumber

Name

Office Number

Work Tel Number

------------------------------------------------------------------------------------------------- -------------------------------------------------------------------------------------------------


DATA REDUNDANCY REVISITED

Data redundancy leads to data anomalies. Those anomalies can destroy the effectiveness of the database. It is however possible to control data redundancies by using common attributes also known as foreign keys, that are shared by tables. The proper use of foreign keys is crucial to data redundancy control; they however do not eliminate data redundancies because they can be repeated many times. There are also instances where data redundancy is necessary.

Problems caused by redundancy:

Redundancy can be lead to several problems:

Redundancy Storage. Insertion Anomalies.

Updation Anomalies.

Deletion Anomalies.

Redundancy Storage:

Same information stored repeatedly called Redundancy Storage.

Insertion Anomalies :

It is not possible to insert certain information until and unless to insert some waste information into a relation.

Updation Anomalies:

If one copy of such repeated data is updated which leads to data inconsistency.

Deletion Anomalies:

It is not possible to delete certain information until and unless to delete some meaningful information.

Indexes :


Index is a constant in dbms or it a special or additional features. Primary key itself act as index.

It dosen’t allow duplication used for quick reference to your coloumns in a tables.

Syntax for normal index:

Create index emp_x on emp (emp_name,emp_id);

Syntax for unique index:

Create unique index emp_y on emp1 (emp_nam,emp_id1);

Index are 2 types they are :

Normal Index.

Unique Index.

A unique index is an index in which the index key can have only one pointer value or row associated with it.

A table can have many indexes, but each index is associated with only one table.

CODD RULES :

In 1985 codd’s rules minimum rules for RDBMS used by a company.

Every organisation must satisfy with this rules.

Information. All information in a relational database must be logically represented as column values in rows within tables.

Guaranteed Access. Every value in a table is guaranteed to be accessible through acombination of table name, primary key value, and column name.

Systematic Treatment of Nulls. Nulls must be represented and treated in asystematic way, independent of data type.

Dynamic On-Line Catalog Based on the Relational Model. The metadata must be stored and managed as ordinary data, that is, in tables within the database. Such data must be available to authorized users using the standard database relational language.


Comprehensive Data Sublanguage. The relational database may support many

languages. However it must support one well defined declarative language with

support for data definition, view definition, data manipulation, integrity constraints, authorization, and transaction management.

View Updating. Any view that is theoretically updatable must be updatable through the system.

High-Level Insert, Update and Delete. The database must support set-level inserts, updates, and deletes.

Physical Data Independence. Application programs and ad hoc facilities are logically unaffected when physical access methods or storage structures are changed.

Logical Data Independence. Application programs and ad hoc facilities are logically unaffected when changes are made to the table structures that preserve the original table values.

Integrity Independence. All relation integrity constraints must be definable in the

relational language and stored in the system catalog, not at application level. Distribution Independence. The end users and application programs are

unaware And unaffected by the data location. No subversion. If the system supports low-level access to the data, there

must not be a way to bypass the integrity rules of the database.

Rule Zero. All preceding rules are based on the notion that in order for a database to be considered relational, it must use its relational facilities exclusively to manage the database.


UNIT-4

Structured query language: Introduction to SQL Data definition commands Data manipulation commands Advanced SELECT queries Virtual tables Creating a view Joining database tables

Advance SQL: Relational set operators SQL join operators Sub queries and correlated queries SQL functions Oracle sequences Updatable views Procedural SQL Embedded SQL

Introduction to SQL: SQL means structured query language SQL was developed by IBM SQL is the combination of tuple calculus and branch of mathematics The first name of SQL is SEQUEL SEQUEL means structured English query language In 1968 , IBM uses IMS to store and retrieve the data It is similar to CODASYL approach ,so they developed SQL Here IMS means integrated management system


Data definition commands: It is the data definition commands They are as follows:

Alter table table name add fieldname data type (); Alter table table name modify fieldname data type (); Alter table table name dropcolumn column name;

Create table tablename (fieldname data type()); Drop tablename; Truncate table tablename; Rename tablename1 to tablename2;

Data manipulation commands: Insert into table tablename values (‘&fieldname’); Update tablename set column name =value where tablename=’value’; Delete table tablename; Select * from table name;

Advanced select queries: Syntax to display a column from table Select column name from table name; Ex: select empname from emp;

Virtual tables:

Creating a view:

View: A view is a virtual table based on a select query The query contains columns, computed columns, aliases and aggregate

functions from one or more tables The tables on which a view is based called as BASE TABLES A view is a SCHEMA or TABLES or ALIASES View provides security to the data base


It is used by the higher authority people like administrator for security purpose

Here view is a duplicate table to the original table

These are accessed by the data base administrator The changes made in the original table may effects the views then the higher

authority people knows the modification done in the tables but changes done in the views i.e.; duplicate tables doesn’t effects the original tables

Views are dynamically updated We can create, delete ,drop the views ,there syntax is shown below:

CREATECreate view viewname as select column1,column2,………….from tablename;Ex: Create view emp1 as select emp name, emp id from tablename; DROP:

drop view viewname;Ex: Drop view emp1;Delete view viewname where column name =’value’;Ex:Delete view emp1 where emp id =’519’;

Joining data base tables:

Joins: Joins are used to joining the tables on common attributes only A join is performed when data is retrieved from more than one time at a time To join tables, you simply list the tables in the form clause of the select

statement the DBMS will create the Cartesian product of every table in the from clause

In the relational algebra , the join symbol is ‘ ' It is then related to mathematical operation i.e.; Cartesian product but in

relational calculus JOIN is used Joins are of 3 types.

They are as follows:1. Cross join(natural join)2. Outer join3. Theta join


Cross join: It is also called as natural join natural join: it is similar to join it performs operations on both the tables it is denoted with ‘ ‘ ex: R1 R2 R1=EMP, R2=EMP workSyntax: Select * from table1 natural join table2; it consists of some loss of data it is shown in the following example

EMP:

name street CityA X 1B Y 2C Z 3D W 4

EMP WORK:

NAME BRANCH SALA CSE 10B EEE 20C ECE 30D IT 40E CIVIL 50

RESULT:

NAME STREET CITY BRANCH SALA X 1 CSE 10B Y 2 EEE 20C Z 3 ECE 30D W 4 IT 40TO overcome this drawback we are using outer join


Outer join: It again classified into 3 types They are as follows:1. Left Outer join2. Right Outer join3. FullOuter join

Left outer join: It is one type of outer join It performs operation on the left table It is denoted by the symbol ‘ ‘ It will display the matching values in the common column and the remaining

values in the left table

Syntax:Select * from T1 left outer join T2 where T1.C1=T2.C1;

It does not have the loss of information in the left table

EMP:


EMP WORK:



RESULT:

NAME STREET CITY BRANCH SALA X 1 CSE 10B Y 2 EEE 20C Z 3 ECE 30D W 4 IT 40

Right outer join: It is one type of outer join It performs operation on the right table It is denoted by the symbol ‘ ‘ It will display the matching values in the common column and the remaining

values in the right table

Syntax:Select * from T1 right outer join T2 where T1.C1=T2.C1;

It does not have the loss of information in the right table

EMP:


EMP WORK:

NAME BRANCH SALA CSE 10B EEE 20C ECE 30D IT 40


E CIVIL 50

RESULT:

NAME STREET CITY BRANCH SALA X 1 CSE 10B Y 2 EEE 20C Z 3 ECE 30D W 4 IT 40E NULL NULL CIVIL 50

Full outer join: It is one type of outer join It performs operation on both the tables It is denoted by the symbol ‘ ‘ It will display the values in both the tables

Syntax:Select * from T1 fullouter join T2 where T1.C1=T2.C1;

It does not have the loss of information in both the tables EMP:


EMP WORK:



RESULT:

NAME STREET CITY BRANCH SALA X 1 CSE 10B Y 2 EEE 20C Z 3 ECE 30D W 4 IT 40E NULL NULL CIVIL 50

Theta join: It is denoted with the symbol’ ‘ It is used for to place a condition If the condition satisfies then results will displayed

It will performs single or double tables Ex: Sal < 20 in emp1

name branch SalA CSE 10

Relational set operators: These are of 4 types They are as follows1. Union2. Union all3. Intersect4. Minus

Work properly if the relations are union compactable Names of relation attribute must be same and their data types must be

identical It is used to work properly on columns


UNION: It combines rows from 2 or more queries without including duplicate rows It can be used to write more than 2queries Ex:

Select e_name, e_id from EMPUnionSelect e_name, e_id from emp1;

Union all: It produces a relation that retains duplicate rows Ex:

Select e_name, e_id from EMPUnion allSelect e_name, e_id from emp1;

It can be used to write more than 2 queries

Intersect: It combines rows from 2 queries and returns rows that appear in both sets Does not allow duplication Ex:

Select e_name, e_id from EMPIntersectSelect e_name, e_id from emp1;

Minus: It is also called as DIFFERENCE or SET DIFFERENCE It combines rows from 2 queries Returns only the rows that appear in the first set but not in the second

set


Ex:Select e_name, e_id from EMPMinusSelect e_name, e_id from emp1;

Ex for relational set operators:EMP

name IdA 1B 2

C 3

D 4E 5

EMP1

name IdA 1B 2

F 6

G 7H 8

UNION

NAME IDA 1B 2C 3D 4E 5F 6G 7

H 8


INTERSECT

NAME IDA 1B 2

UNION ALL

NAME IDA 1

A 1

B 2

B 2

C 3D 4E 5F 6G 7

H 8

MINUS

NAME IDC 3E 5G 7


QUERIES USING SQL FUNCTIONS:

Sub queries and correlated queries:

Sub queries: It is also called as nested query A nested query is a form of command that appears inside SQL statement Simply select within a select query called nested query Embedding one select statement into another select statement is called

nested query Embedding select query is called inner query or sub query The outer query is called main query Nested queries are used to design complex queries Ex:

Find all the employees who have the same post as ‘RAJ’;Select post from employee where E_NAME=RAJ;Output:MANAGER

Note:In ordinary queries each result display separately where by using sub querywe can display the result in one query

Ex:

To find all employees who have the same SAL as MANOJ

SELECT * FROM EMPLOYEE WHERE SAL= (SELECT SAL FROMemployee where name=MANOJ);

Output:Name Branch Post SalMANOJ HYDERABAD G.M 2LRaj BANGLORE M.D 2L


CORRELATED SUBQUERY: It is used for to execute each sub query more than once It is possible by using all comparison operators Ex:exists,not exists Exists and not exists are comparison operators It allow the user to test for each row of outer query whether the inner query

is true or false Here the sub query is dependent on main query Ex:

Find the names of the sailors who have reserved boat number=’103’Output:Select s.s_name from sailors s where exists (select from reserves r where s.s_id=r_sid and r.bid=103);

Ex:

Find the names of the sailors who are not reserved boat number=’103’

Output:

Select s.s_name from sailors s where not exists (select from reserves r where s.s_id=r_sid and r.bid=103);

Oracle sequences: A sequence is an object which is used to generate sequential values which

are often assigned to primary keys A sequence is not physically linked with a table It is the responsibility of the developer to only use a sequence with primary

key that is defined forSyntax:

Create sequence seq _name Increment by nStart with 1000


Here increment =to increase sequence values and it is used for records to give the sequence numbers

Ex:Create sequence emp_no_seqIncrement by nStart with n

A sequence has 2 pseudo columns. They are:1. Next Val2. Curr Val

NeXT Val: It is used to get the next sequential value of the sequence Ex:

Insert into EMP values (emp_no_seq.nextval…);

Currval: It is used to get the current value of the sequence Ex:

Select emp_no_seq.currval from dual;

Updatable views: It means updating the views by using the UPDATE command Here view is a schema View is the duplicate table to the original table In the updatable views we can performs joins but in normal views we cannot

perform join operator Ex:

Prod masterProd _id Prod no

1 A123

2 B456


Prod salesProd_id Prod_qty1 502 603 70

RESULT

Prod id Prod _qty1 502 60

PLSQL AND EMBEDDED SQL:

PLSQL: PLSQL has control statements but in SQL no of control statements Through SQL the DBA suffers from various disadvantages like

1. SQL does not have any procedural capability2. SQL does not provide the programming techniques like control

statements i.e.; checking, looping and branching3. In SQL speed of data processing is slow4. SQL does not after exception handling

ADVANTAGES OF PLSQL: PLSQL is a development tool which supports SQL and provides other

facilities like checking, looping and branching PLSQL allows exception handling mechanisms PLSQL allows procedures PLSQL allows data processing very fast

Architecture of PLSQL: It contains 3 blocks namely

1. BEGIN block2. DECLARE block3. END block


PLSQL supports other functionalities like procedures, triggers, cursors and etc…

EMBEDDED SQL: We are accessing the data from the data base by embedding all DML, DDL

statements in an application program Here application program means different programming languages Ex: We may embedded DML, DDL statements in the programming language

like COBOL language COBOL

BEGIN_ _ _ _ _ _ __ _ _ _ _ _ _DDL STATEMENTSDML STATEMENTS_ _ _ _ _ _ __ _ _ _ _ _ _END


UNIT -5

NORMALIZATION OF DATA BASE TABLES

Database tables & normalization – the need for normalization – the normalization process – improving the design – surrogate key considerations – higher – level normal forms – normalization and database design – de-normalization.

Normalization of Database Tables

Database tables and Normalization: Generally database normalization is a combination of schema refinement

with decomposition technique. Or

Database normalization is the process of removing redundant data and some other anomalies like multivalued dependency, functional dependency, trivial dependency, partial dependency etc …. Then database becomes efficient database by applying different normal forms.

The process of normalization includes a series of stages known as normal forms. These normal forms are developed by Edgar.F. codd.

The different normal forms are first normal form(1NF),second normal form(2NF),and third normal form(3NF).

Although we have other higher normal forms like 4NF and 5NF but relations in 3NF are often described as “normalized” because they are free from redundancy.


Dependencies: Definitions

Multivalued Attributes (or repeating groups): non-key attributes or groups of non-key attributes the values of which are not uniquely identified by (directly or indirectly) (not functionally dependent on) the value of the Primary Key (or its part).

Partial Dependency: when a non-key attribute is determined by a (part) determinant attributes in a composite primary key called partial dependency.

STUDENT

Stud_ID Name Course_ID Units

101 Lennon MSI 250 3.00

101 Lennon MSI 415 3.00

125 Johnson MSI 331 3.00

Partial Dependency


Transitive Dependency: when a non-key attribute determines another non-key attribute is called transitive or trivial dependency.

The normalization process:

First normal form (1NF): 1NF can be defined as applying a primary key with no multivalued attribute

or repeating groups or redundancy values in a table called 1NF. A relation schema is said to be in first normal if the attributes values in the

relation are atomic i.e .there should be no repeated values in a particular column.

Converting a Relational Table in 1NF

The conversion of relational table to 1NF can be done by performing the following steps.

Step1

In this step, the repeating groups must be removed by eliminating the null columns. This can be done by ensuring that every repeating group has an appropriate value.

EMPLOYEE

Emp_ID F_Name L_Name Dept_ID Dept_Name

111 Mary Jones 1 Acct

122 Sarah Smith 2 Mktg

Transitive Dependency


Step2

In this step, a primary key among a set of attributes must be identified. Thus, this primary key helps in uniquely identifying every attribute values.

Step3

In this step, the dependencies existing in a relation must be identified. This can be done by knowing which attribute in the relation is dependent / based on the primary key.

Example

Step1

Bringing a Relation to 1NF

Make a determinant of the repeating group (or the multivalued attribute) a part of the primary key.

Step2

STUDENT


101 Lennon MSI 250 3.00

101 Lennon MSI 415 3.00


Composite Primary Key


Consider the transitive dependency the Course_ID as primary key.

the above table contains two primary keys Stud_ID and Course_ID called composite primary key.

Step3

Remove repeating group from the relation. Create another relation which would contain all the attributes of the repeating group plus the PK from the 1st relation.

STUDENT


101 Lennon MSI 250 3.00

101 Lennon MSI 415 3.00


STUDENT_COURSE

Stud_ID Course Units

101 MSI 250 3

101 MSI 415 3

125 MSI 331 3


Second normal form(2NF):A relation is said to be in 2NF,if the following two conditions are satisfied.

A relation must be in 1NF. It does not contain any partial dependencies.

In other words, it can be said that a relation is in 2NF,a non-key attribute must be functionally dependent on primary key.

Example

The conversion of a relation in 1NF (with composite primary keys) into a relation in 2NF can be done by performing the following steps.

Step1

Initially all the single key components of the relation is listed separately. Then the key component which is a composite key is specified.

Consider the student table which is having composite primary key and partial dependencies.

Partial DependenciesComposite

Primary Key


Step2

The above table contains two partial dependencies. Remove attributes that are dependent from the part but not the whole of the primary key from the original relation.

Step3

In this step, for each partial dependency create a new relation with the corresponding part of the primary key.

STUDENT


101 Lennon MSI 250 3.00

101 Lennon MSI 415 3.00


STUDENT

Stud_ID Name

101 Lennon

101 Lennon

125 Johnson

COURSE

Course_ID Units

MSI 250 3.00

MSI 415 3.00

MSI 331 3.00


Third normal form (3NF): relation is said to be in third normal form if the following two conditions are satisfied.

A relation must be 2NF. A relation must not have any transitive dependencies.

In other words, the relation does not contain transitive dependencies i.e. a non key attribute determines another non key attribute.

Converting 2NF to 3NF

The conversion of a relation in 2NF into a relation in 3NF can be done by performing the following steps.

Step1:

In this step consider an employee table.Transitive

Dependency


Step2

Removing the attributes which are dependent on a non key attributes from the original relation.

Step3

For each transitive dependency create a new relation with the non key attribute which is determinant in the transitive dependency as a primary key and dependent non key attribute as a dependent.

EMPLOYEE

Emp_ID F_Name L_Name Dept_ID Dept_Name

111 Mary Jones 1 Acct

122 Sarah Smith 2 Mktg

EMPLOYEE

Emp_ID F_Name L_Name Dept_ID

111 Mary Jones 1

122 Sarah Smith 2


Boyce Code Normal Form (BCNF): A table is said to be in BCNF, when each and every determinant of the table

refers to candidate key. BCNF is extension to 3NF. BCNF removes the dependency like non-key attribute with candidate key. BCNF may contain more than one candidate key and it does not allow

redundancy. BCNF used in higher level database like business transitive database always

BCNF is a 3NF but not be a 3NF as a BCNF. In a table, 3NF and BCNF are said to be equal when table includes single

candidate key. Otherwise, they said to be unequal. A table is in 3NF but not in BCNF when table includes a transitive

dependency (i.e. non key attributes determines the value of another non key attribute).

In a table, when non key attribute depends on key attribute then the table is still in 3NF but it is not in BCNF. The reason is the BCNF requires that every determinant in the table must refer to candidate key.

Fourth Normal Form (4NF): 4NF is a 3NF used for to remove multi valued dependency and other

anomalies. Here multi valued dependency means redundant values in a relation. You might encounter poorly designed database, or you might be asked

convert spreadsheets into a database format in which multiple multivalued attributes exist.

Unnormalization: Unnormalization table relation contains redundancy or repeating groups.

DEPARTMENT

Dept_ID Dept_Name

1 Acct

2 Mktg


It does not contain primary key. A normalized database tables generally help to create a good database. But

processing requirements are considered, sometimes unnormalized database tables provide a better solution.

Because, the normalized tables include several data tables, which need to be joined using I/O operations, there by reducing the processing speed.

However, rare and accasional circumstances may allow some degree of denormalization so processing speed can be increased.

The disadvantages are:

When programs for reading and updating tables are considered, they deal with large tables and hence data updates are considered to be less efficient.

The implementation of indexing is very complex in unnormalized table. The virtual tables or views cannot be created using unnormalized tables.

Improving the design:Every relation can be designed with following rules in order to improving the efficient database.

1. Evaluating primary key over a relation.

2. Evaluating naming conventions (no attribute should be same).

3. Refining the attribute atomicity.

4. Evaluate related attributes.

5. Evaluated related relationship.

6. Connect existing relation details to the new relation.

7. Evaluating type of attributes.


Surrogate key consideration: Surrogate key is like artificial primary key which is generated automatically

by the system and managed through DBMS. The surrogate key is numeric and it is automatically incremented for each

new row. Designers need surrogate key when primary is treated as inappropriate. Surrogate key is used for improve the efficiency of the database. Surrogate key acts like a index concepts. Index is used for quick reference. Indexes are classified into 2 types. Normal index Unique index Applying a unique index key is called surrogate key.

Syntax:

Create unique index indexname on table name (field1, field2);

Eg:

Create unique index cus_x on customer (cus_name, cus_id);


UNIT-VI

What is transaction? - Transaction state-implementation of atomicity and durability- concurrency control- serializability -testing for serializability- concurrency control with locking methods -concurrency control with time stamping methods - concurrency control with optimistic methods - database recovery management -validation based protocols - multiple granularities

What is transaction?

Transaction processing systems collect, store, modify and retrieve the transactions of an organisation

A transaction is an event that generates or modifies data that is eventually stored in an information system.

Transaction is a small application program or unit of programs executed by the user.

Every transaction is written in high level manipulation language or programming language.

Transaction states:-

For recovery purposes, the system needs to keep track of when the transaction starts, terminates, and commits or aborts

Every transaction having 5 states which are as follows:

Active state Partially committed Committed Failure Abort state


Active state:-

Active state marks the beginning of transaction execution.

Or

Active state is initial state of the transaction.

If there is any error in the transaction then it went to failed state.

If there is no error in the transaction then it went to partially committed state.

Partially committed state:-

When the transaction comes from active state to partially committed state the transaction executes some list of operations.

Failed state:-

When ever we are performing transactions due to some failures that mean Hardware or software failures or system crashes then the transactions lead to failure state.

Committed state:-

Commit gives the signal a successful end of the transaction so that any changes executed by the transaction can be safely committed to the database.

Or

If the transaction is successfully executed then the transaction is in the commit state.

Abort state:-

Abort state is also called as rollback state.

Abort state gives the signal that the transaction so that any changes or effects that the transaction may have applied to the database.

Abort state performs 2 operations on the failed states they are as follows.

Kill the transaction Restart the transaction


Kill the transaction:-

Kill the transaction performs when the transaction have a logical errors.

Reconstructing the new transaction and executing the new transaction.

Restart the transaction:-

It performs recovery mechanism after identifying and rectifying the type of error.

Recovery mechanisms examples are

Write ahead protocols Shadow paging technique

Or

Shadow database scheme.

In abort state transaction may be restarted later or automatically or after being resubmitted by the user.

The transaction state diagram is follows as


Types of Transactions:-

Transactions are classified into 2 types namely:

1. serial transaction2. non-serial transaction

Serial Transaction:-

Serial transaction follows the property called “serializability”.

List of operations are operating serially (or) sequence called transaction.

Ex:- consider one example followed by read and write operations to perform a serializability i.e., funds transferring from A account to B account .

T1: READ (A)

A: A-50

WRITE (A)

T2: READ (B)

B: B+50

WRITE (B)

Non serial transaction:-

Non serial transaction is also called as concurrent transactions.

A transaction executing parallel order (or) non serial order called non serial transaction

Non serial transaction follows the property called “non serializability”.

Ex:- consider one example followed by read and write operations to perform non serializability i.e., funds transferring from both accounts.


T1: READ (A)

A: A-50

WRITE (A)

T2: READ (B)

B: B-50

WRITE (B)

T3: READ (B)

B: B+50

WRITE (B)

T4: READ (A)

A: A+50

WRITE (A)

Schedule:-

Executing a list of operations in a transaction is called schedule.

Schedules are again classified into 2 types they are:

serial schedule non serial schedule

Serial schedule called serial transaction

Non serial schedule called concurrent transactions or non serial transactions

Note:-

All the list of begin and end statements.


Properties of transaction:-

Every transaction is executed by following 4 properties called ACID properties namely:

A- Atomicity

C- Consistency

I- Isolation

D- Durability.

Atomicity:-

In an atomicity the transaction may be executed successfully or may not be executed successfully.

Consistency:-

In a DBMS always database is consistent then the result is of a transaction shall consistent in the database.

Isolation:-

DBMS allows the transaction may be executed concurrently but transaction does not know one another. Isolation leads to complication when the transactions are executing in concurrent transactions.

Complications are nothing but it gives a loss of data.

To recover the loss of data we use Recovery

Mechanisms.

Recovery mechanisms are 2-phase locking protocol, 2-phase commitment protocol.

Durability:-

If the transaction is executed successfully it is stored in the persistent way i.e., if any failure occurs it does not effect to the result of a transactions.


Implementation of atomicity and durability:-

Problems occurred in atomicity and durability so we are used to implementing atomicity and durability

Atomicity problem:-

Atomicity problem occurs due to some hardware or software failure or system crashes which means does not execute successfully this leads to the inconsistent database.

Durability problem:-

If atomicity property fails it causes the durability which mean the result does not stored in the persistent way

To overcome the atomicity and durability problems we are implementing atomicity and durability with the help of recovery mechanisms namely:

Write ahead logging protocol. Shadow paging technique. Or

Shadow database scheme

Write ahead logging protocol:-

To be able to recover from failure that affects transactions, the system maintains a log to keep track of all transaction operations that affect the values of database items. This information may be needed to permit recovery from failures. The log is kept on disk, so it is not affected by ant type of failure except for disk (or) catastrophic failure. In addition, the log is periodically backed up to archival storage (tape) to guard against such disk failures.

Log is protocol used for the recovery mechanism which maintains a log file.

Log contains all the details of the transaction redo and undo operations.

If our transaction got failure while we executing, no problem with the help of log file we can recovery the loss of data after find outing and rectifying the type of errors.


Here the type of error means hardware (or) system crashes (or) software failure (or) improper logic.

Shadow paging technique:-

Shadow paging technique is also called as shadow database scheme.

Shadow paging technique is one of the recovery mechanisms.

Shadow paging techniques maintain 2 databases namely old copy and shadow copy.

Shadow copy is also called as new copy and old copy is also called as original copy.

Every database contains the database pointer.

Database pointer used to point the error in a shadow copy.

Shadow paging technique is not wormed with concurrent transactions.

It means it is worked with only serial transactions.

It does not allow large database.

During transaction execution the shadow copy is never modified.

The database pointer which moves back to old copy. By doing this we can recovery the loss of data of a particular transaction.

In a multi user environment concurrent transactions logs and check points must be incorporated into a shadow paging technique.

If our operations in a new database executed in successfully there we are deleting old copy of the data base.

The diagrammatical representation of an shadow paging technique is:-


Concurrency control:-

Database management systems allow multiple transactions to allow execution of a transaction.

Concurrent transactions contains advantages when compare with serial transactions.

Advantages are,

Increased processing speed and disk utilisation.

Reduced average time for transaction.

Concurrency control scheme:-

Mechanisms to achieve isolation property that mean to control the interactions among the concurrent transactions and in order to provide consistent database we are using concurrency control schemes.

Eg:-

Locking protocols

Time stamp based protocol, Optimistic method


Concurrency control mechanism through locking technique:-

Lock is a variable and it is associated with the data item.

Locking is a technique for concurrency control lock information managed by the lock manager.

Every database management system contains “lock manager”.

The main techniques are used to control concurrent executions of a transaction are based on the concept of locking.

Types of locks:-

Locks are classified into 3 types they are:

1. binary lock2. multiple mode lock3. 2-phase locking

Binary locking:-

Binary locking is only applied on the levels of granularities.

In binary lock every lock has 2 states.

Locked state

Unlocked state

Locked state is denoted with the symbol 1.

Unlocked state is denoted with the symbol 0.

Locking is nothing but acquiring the lock.

Unlocking is nothing but releasing the lock.

There are 4 levels of granularities shown in below as follows:

1. column level2. row level3. page level4. table level

To control concurrent transactions we are using binary locking.


A binary lock us simple to be implemented by represented the lock as a record of he form

<Data item, name, lock, locking transaction> + a queue for waiting transaction for the data item x.

Multiple mode locking (or) shared (or) exclusive locking:-

Multiple mode locking is also called as shared / exclusive locking (or) read/write locking.

Every data base management contains “lock manager.”

Locking are 2 types.

1. Shared lock. 2. Mutually exclusive

Shared lock:-

If the transactions are T1 and T2. T1 is applying with the shared lock operation when T2 is read the data from T1 then T2 does not read because it is locked. So, T2 sends the request for the lock manager then the lock manager gives the permission then T2 reads data from T1.

Mutually exclusive lock:-

If the transactions are T1 and T2.T1 is apply with exclusive lock when T2 is read the data from T1 but it cannot read because it has licked then T2 send request for the lock manager but the lock manager does not releasing to the T2. If the T1 transaction is completed automatically it releases then we read the data from T1 toT2.If does not give permission then T2 is waiting (or) suspended.

If LOCK(X) = write –locked, the value of locking –transactions is a single transaction that holds the exclusive (write) lock on x.

If LOCK(X) = read-locked, the value of locking transactions is a list of one (or) more transactions that hold the shared (read) lock on x.


2-phase mode locking:-

A transaction is said to follow two- phase locking protocol if all the locking operations are (read-lock, write-lock) precede the first unlock operation in the transaction.

(Or)

To follow two- phase locking protocol in a transaction here the condition is it should be unlock in the first phase.

Such a transaction can be divided into 2 phases:

1. growing phase2. shrinking phase

Growing phase:-

Growing phase is also called as expanding phase.

Growing phase is the first phase, during which new locks on items can be acquired but none can be released.

(Or)

In growing phase it acquires the lock on a transaction but no transaction will be released.

Shrinking phase:-

Shrinking phase is second phase, during which exists the locks can be released but no new locks can be acquired.

(Or)

No new locks can be granted.


TStarts TEnds

Note:-

Locking mechanisms over concurrency control suffer with dead lock problem.

Dead lock:-

Dead lock occurs when each transaction T in a set of 2 (or) more transactions is waiting for some data item that is locked by some other transaction T in the set.

Example:-

T1 T2

READ –LOCK(X)

WRITE – LOCK (X)

READ –LOCK(X)

WRITE –LOCK(X)

Dead lock can be solved by using some other concurrency control mechanisms like:


1. optimistic concurrency control2. dead lock prevention protocol3. validation based protocol4. dead lock detection protocol5. time outs 6. dead lock avoidance protocol

Serializability:-

Serializability means list of instructions (or) operations are executed serially (or) non-serially but it produces same result.

In this we are using the concept of schedules.

If there is serializability then the database is always consistent.

Serializability is classified into 2 types.

1. Conflict serializability2. View serializability.

Conflict serializability:-

Consider 2 transactions TA and TB with the instructions IA and IB and perform read and write operations in the same data item x then it belongs to the conflict serializability.

In conflict serializability IA is dependent of IB and IB is dependent of IA.

Conflicts are 4 types they are:

1. Read-read conflict.(mostly not occur)2. Read-write conflict.3. Write-read conflict.4. Write-write conflict.

To solve conflict serializability we are using testing for serializability with the precedence graph (or) directed graph (or) serializability graph.


Directed graph can be represented as:

Every directed graph contains vertices (or) nodes (or) arcs and edges.

Transactions can be represented as nodes.

Operations can be represented as arcs.

Every directed graph is a “cyclic graph”.

In directed graph all the transactions are dependent with no conflicts.

Testing for serializability:-

Testing of serializability is nothing but we have to test the conflict serializability.

Read – read conflict:-

Consider 2 transactions TA and TB of instructions IA and IB.

Suppose IB wants to perform read operation but it cannot read the data item until and unless if IA performs read operation on the data item X.

Here data item is same for TA and TB transactions.

Read – write conflict:-

Consider 2 transactions TA and TB of instructions of IA and IB. suppose IA wants to perform read operation but it cannot read until and unless write operation is performed on the data item X by the instruction IB in transaction TB.

Write Read Conflict: Consider 2 transactions TA and TB of instructions of IA and IB. suppose IA wants to perform write operation but it cannot perform on the data item X by the instruction IB in the transaction TB.

TbTa


Write – write conflict:-

Consider 2 transactions TA and TB of instructions of IA and IB. suppose IA wants to perform write operation but it cannot write until and unless write operation perform on the data item x by the instruction IB in the transaction TB.

View serializability:-

View serializability is a part of serializability.

Here first we are executing all the operations (or) instructions of TA then we are moving to TB else first execute all operations of TB then TA.

Example:-

TA TB

Read (A)

A: A-50

Write (A)

Read (B)

B: B+50

Write (B)

Read (A)

A: A-50

Write (A)

Read (B)

B: B+50

Write (B)


Time stamp mechanisms to control concurrent transactions:-

In the time stamp mechanism it has a time stamp manager.

Time stamp manager classified into 2 types they are:

1. Global clock manager.2. Local clock manager.

Time stamp mechanism is used to decide if the transaction involved in a dead lock situation there we are performing different operations to solve dead lock.

1. Abort (or) pre-empt.2. Kill the transaction.

Time stamp mechanism for each transaction is denoted with the symbol TS (TA)

Here,

TS is time stamp value

TA is transaction A.

Every time stamp contains 2 properties:

1. Uniqueness.2. Monotonocity.

Uniqueness:-

Every transaction contains unique time stamp value.

Monotonocity:-

Always time stamp value should be increases.

Example:-

If T1 stared before T2 then it is noted as

TS (T1) < TS (T2)

Here T1 is the older transaction.

T2 is the younger transaction.

There are 2 schemes to prevent dead lock which are as follows:


Wait – die and

Wound – wait

Wait – die:-

Suppose that T1 transaction tries to lock the data item x but it is not able because the data item X is already locked by T2 and here the rule is:

If TS (T1)< TS (T2) then T1 is allowed to wait (or) if T1 is younger transaction then T1 dies and restart it later.

Wound – wait:-

Suppose that T1 tries to lock X but it is not able to because X is locked by T1 with a conflicting lock. Then the rule is;

If TS (T1) < TS (T2) then abort T2 (T1 wound T2) and restart it later with the same time stamp otherwise T1 is allowed to wait.

If any transaction is in the wait state then it leads to the dead lock phase.

Optimistic concurrency control:-

Optimistic is one mechanism to control concurrent transaction.

Optimistic concurrency control is also called as “validation based protocol”.

Optimistic concurrency control which cannot use any locking and time stamp mechanism.

In this every transaction can be executed in 3 phases:

1. Read phase.2. Validation phase.3. Write phase.


Read phase:-

In this we are reading the database (or) transaction (or) data item and we are performing some computations and its results will be stored in the temporary database (or) copy of the database. Then we have to move to validation phase.

Validation phase:-

In this we are validating the transaction and its results are stored in the temporary database.

If the results are positive (or) validation phase by copying the results from temporary to permanent database.

If the results are negative (or) validation is negative then the restart (or) reschedule the transaction from the previous phase that mean read phase.

Write phase:-

In the write phase we are writing the transaction in permanent database when the validation is positive.

Recovery management:-

Recovery management means recovery mechanism to recover the loss of data throw shadow copy technique (or) shadow paging technique and write ahead logging protocol.

Write ahead logging protocol:-

To be able to recover from failure that affects transactions, the system maintains a log to keep track of all transaction operations that affect the values of database items. This information may be needed to permit recovery from failures. . The log is kept on disk, so it is not affected by ant type of failure except for disk (or) catastrophic failure. In addition, the log is periodically backed up to archival storage (tape) to guard against such disk failures.

Log is protocol used for the recovery mechanism which maintains a log file.

Log contains all the details of the transaction redo and undo operations.


If our transaction got failure while we executing, no problem with the help of log file we can recovery the loss of data after find outing and rectifying the type of errors.

Here the type of error means hardware (or) system crashes (or) software failure (or) improper logic.

Shadow paging technique:-

Shadow paging technique is also called as shadow database scheme.

Shadow paging technique is one of the recovery mechanisms.

Shadow paging techniques maintain 2 databases namely old copy and shadow copy.

Shadow copy is also called as new copy and old copy is also called as original copy.

Every database contains the database pointer.

Database pointer used to point the error in a shadow copy.

Shadow paging technique is not wormed with concurrent transactions.

It means it is worked with only serial transactions.

It does not allow large database.

During transaction execution the shadow copy is never modified.

The database pointer which moves back to old copy. By doing this we can recovery the loss of data of a particular transaction.

In a multi user environment concurrent transactions logs and check points must be incorporated into a shadow paging technique.

If our operations in a new database executed in successfully there we are deleting old copy of the data base.

The diagrammatical representation of an shadow paging technique is:-


Multiple Granularities:-

Multiple locking is a locking mechanism to apply locking on different data items at a time where as, in normal locking it is not possible to lock different data items of different transactions. So, normal locking is meant for to lock single data item at a time.

Multiple granularities is defined as applying locking at different levels

Here, different levels are database, tables, columns, pages, rows, etc.

Multiple granularities is divided into 2 methods to apply locking.

1. finite granularity2. coarse granularity

Finite granularity:-

Finite granularity as applying locking from bottom to top and here performance is high for concurrent transactions.

High order locking is also available.

Coarse granularity:-

Coarse granularity is defined as locking from top to bottom here performance is low for concurrent transaction and also low order locking is available.


Unit-VIIRECOVERY SYSTEM

1. RECOVERY AND ATOMICITY2. LOG BASED RECOVERY3. RECOVERY WITH CONCURRENT TRANSACTIONS4. BUFFER MANAGEMENTS5. FAILURE WITH LOSS OF NONVOLATILE STORAGE6. ADVANCE RECOVERY TECHNIQUES7. REMOTE BACKUP SYSTEMS

RECOVERY & ATOMICITY:

Consider 2 transactions with 2 accounts A & B now we are transferring funds from A account ot B account initially A contains 700 and B contains 500

Now the transaction performed by the user by adding Rs.80 from T(A) to T(B).Suddenly if power failure or any errors occur to that transaction then database leads to inconsistent state that is transaction is in ideal state.Rs.80 is not added to T(B). If user perform 2 operations like re-exchange don’t re-execute which leads to also inconsistent state because transaction is not committed. So use recovery technique to recover the transaction from inconsistent state to consistant state. Recovery technique provides recovery mechanisms like lock based recovery, shadow phasing check point, write head logging protocol etc..,

LOGBASED RECOVERY:-

Log is a file which maintain all the details of a transaction. Log is most widely used structure for recording database and its modifications. The log refers to sequence of log records and maintains a history of all updated activities in the database. Log maintains the details like

*Transaction identifier

*Data-item identifier

*Old value

*New value

Transaction identifier refers to transaction which executes the write operation


Data-item identifier refers to the data item written to the disk that is location of data item on disk

Old value refer to the value of data item before performing the write operation

New value refer to the value of data item after performing the write operation

Log records that are useful for recovery from system and disk failures must reside on stable storage. Log records help in recording the significant activities associated with the database transactions such as the beginning of a transaction, aborting or committing of a transaction ect.

< T begin>

This refer to begin of transaction

< T, x, v1, v2>

This refer to transaction T which preform a write operation on the data item x. The value of x before and after performing the write operation ate v1 & v2 respectively

< T commit>

This refer to commit transaction T

Basic operation of log are redo& undo

Through log we can recovery the loss of data with 2 methods or 2 tables

1. Transaction Table: It contains all the new values or updated operations on a transaction

2. Dirty page Table: It contains all the old values or out dated operation on a transaction.

RECOVERY WITH CONCURRENT TRANSACTIONS:1. Log based recovery techniques used for handling concurrent transactions.2. Every system not only deals with the execution of concurrent transactions

but also contains log and disk buffer.3. In case of transaction failure an undo operation will be performed during the

recovery phase undo also called as back-word scanning. (redo also called as forward scanning).


4. Consider an example suppose if the values of a data-item(x) has been changed or modified by T1 then for restoring its values some log based recovery technique can be used.

5. In this technique the old value of data-item(x) can be achieved by using undo operation.

6. Only a single transaction should update a data-item at any time that is only after commit transaction.

Note:-Number of transactions can perform different update operations by 2-phase locking protocol.

TRANSACTION ROLLBACK:-Every transaction can rollback by using 2 operations

1. Undo2. undo

Every record in the log in the form <T, x, 15, 25>

For example consider 2 transactions T1 & T2 in the form <T1, p, 25, 35>&<T2, q, 35, 45>

The above 2 transactions T1 and T2 has modified its data items values.

Hence back-word scanning of the log assign a value 25 to p and if a forward scanning is done then the value of p is 35. During the role back of transaction if a data item is updated by T1 then no other transactions can perform an update operation on it.

CHECK POINTS:- 1. Check point is a recovery mechanism.2. Checkpoints are the synchronization points between the database and the log

file for reducing the time taken to recover from the system.3. During the system crash of particular transaction, the entire log is to be

search for determining the transaction by using undo or redo operations.4. Log results 2 major problems

(a).Time consuming while searching the record in entire log.

(b). the redo operation takes longer time.


5. Checkpoints refer to a sequence of actions

(a).Every log that are currently available inside the main memory must be move to stable storage.

(b).Every log records contain <check point L>

Note:-During the recovery of a system the number of records in the log that the system must can scan be minimized by using checkpoints.

BUFFER MANAGEMENT:Buffer managements allows 2 techniques to ensure consistency and to reduce the interactions of a transactions in a database. The 2 techniques are

1. Log record buffering.2. Database buffering.

Log record buffering:-1. During the creation of log records all the records available in the Log are

output to stable storage.2. The log records in a stable storage are in the form of units called data blocks.3. At the physical level mostly this blocks apper greater than the records.4. At the physical level the result of data block may consist several different

output operations.5. Several different output operations lead’s to confusion in recovery

mechanism.6. So use additional requirements while storing log records into the stable

storage.7. The additional requirements are after providing the transaction <T commit>

it only stores to the stable storage.8. After providing all the records associated with a “Data locks” which is

stored in stable storage.

From the above rules some of the log records should be given to stable storage called log record buffering.


DATABASE BUFFERING:-1. The database is stored in the non-volatile storage disk then we can transfer

needed data blocks from the system to main memory.2. Typically the size of main memory is less than the entire database.3. Applying requirement’s or rule’s on the block of data called database

buffering.4. Consider 2 data blocks B1 & B2 then by considering certain rules, the log

records are given to the stable storage.5. Thisrule’s restrict the freedom to system to provide blocks of data to main

memory that is before bringing the block 2 into the main memory B1 has taken out from the main memory into the disk then operations are possible on block B2.

6. In database buffering the following actions take place

(a). The log records are brought into the stable storage.

(b).The block B1 is brought onto the disk after its operations.

(c).From the disk the block B2 is brought into main memory.

FAILURES WITH LOSS OF NON VOLATILE STORAGE:-1. The information present in the volatile storage gets lost whenever a system

crash occurs.2. But the loss of information in a non-volatile storage is very rare.3. To avoid this type of failure certain techniques to be considered.4. One technique is to be “dump” the entire data base information onto the

stable storage.5. When a system crashes the information present in physical data-base get

lost.6. In order to bring the data-base back to consistant state the dump is used for

restoring the information by the system.7. During the processing of a dump no transaction must be in process.8. It follows the additional requirements which are:-

All the records that are present in the memory must be stored into the stable storage.


All the DB information is copied onto the stable storage. A log record of the form log< dump > is stored into the stable storage.

NOTE: If a failure results in the loss of information residing in a non-volatile storage then with the help of dump procedure the data-base can be stored back to the disk by the system.Disadvantages:

1. During the processing of dump technique no transaction is allowed.

2. Dump procedure is expensive because huge amount of data transfer is

copied into the stable storage.

ADVANCED RECOVERY TECHNIQUES: Advanced recovery technique is a combination of several techniques. Techniques are undo phase & redo phase techniques like Transaction

rollback, write ahead logging protocol, checkpoints ect.

ARIES: It is a recovery algorithm. It is based on write ahead log protocol. It supports undo and redo operations. It stores related log record into the stable storage. It supports concurrent protocol.

Example: care optimistic concurrency control, time stamp concurrency control.

It supports ‘dirty page table ’and ‘transaction page table’. It has three phases

a. Analysis phase: the following recovery techniques are performed1. Determining log records in stable storage.2. Determining checkpoints 3. Determining dirty page table & transaction table.

b. Redo: forward scanning.c. Undo: backward scanning

Other features are “ supporting locking at different levels by using multiple granularity technique”.


REMOTE BACKUP SYSTEM:1.There is a need for designing ‘transaction processing systems’ that can continue even if the system fails or crashes due to some natural disasters like earth quakes, floods ect..,

2. That can be possible by using the property for transaction processing system called ‘high degree of availability’.

3. Availability can be define as ‘same copy of data presented at primary site as well as secondary sites.

4.Now remote backup system defined as maintaining same copy of data at several sites even if site got failure no problem we can continue our transaction with backup sites.

5. while designing remote backup system we should consider the following major issues

a. Failure detection

b. Recovery time

c. Control transfer

d. commit time

a.FAILURE DETECTION:

The failure of primary site must be detected by remote backup system possible through ‘good network communication’.

b.RECOVERY TIME:

Using checkpoints we can reduce the recovery time because log size at remote system also affects the recovery time that is if the size of the log increases then the time taken to perform recovery also increases.

c.CONTROL TRANSFER:

Control transfer can be defined as acting ‘remote system as a primary site’.


d.COMMIT TIME:

After completion of every transaction we must commit the transaction else leads to inconsistent data-base means not achieving durability.

UNIT-VIII


FILE STRUCTURE AND INDEXING

Overview of physical storage media- magnetic disks – RAID – tertiary storage – storage access – file organization – organization of records in files – data – dictionary storage – basic concepts of indexing – ordered indices – B+ - tree index files – B tree index files – multiple key access – static hashing – dynamic hashing – comparison of ordered index & hashing – bit map indices – indexed sequential access methods (ISAM).

Overview of physical storage media- magnetic disks:

Overview of Physical Storage Media

Several types of data storage exist in most computer systems. They vary in speed of access, cost per unit of data, and reliability.

– Cache: most costly and fastest form of storage. Usually very small, and managed by the operating system.

– Main Memory (MM): the storage area for data available to be operated on.

General-purpose machine instructions operate on main memory.

Contents of main memory are usually lost in a power failure or ``crash''.

Usually too small (even with megabytes) and too expensive to store the entire database.

Flash memory: EEPROM (electrically erasable programmable read-only memory).

Data in flash memory survive from power failure.

Reading data from flash memory takes about 10 nano-secs (roughly as fast as from main memory), and writing data into flash memory is more complicated: write-once takes about 4-10 micro secs.

To overwrite what has been written, one has to first erase the entire bank of the memory. It may support only a limited number of erase cycles ( to ).

It has found its popularity as a replacement for disks for storing small volumes of data (5-10 megabytes).


Magnetic-disk storage: primary medium for long-term storage.

Typically the entire database is stored on disk.

Data must be moved from disk to main memory in order for the data to be operated on.

After operations are performed, data must be copied back to disk if any changes were made.

Disk storage is called direct access storage as it is possible to read data on the disk in any order (unlike sequential access).

Disk storage usually survives power failures and system crashes.

Optical storage: CD-ROM (compact-disk read-only memory), WORM (write-once read-many) disk (for archival storage of data), and Juke box (containing a few drives and numerous disks loaded on demand).

Tape Storage: used primarily for backup and archival data.

Cheaper, but much slower access, since tape must be read sequentially from the beginning.

Used as protection from disk failures!

The storage device hierarchy is presented in Figure 10.1, where the higher levels are expensive (cost per bit), fast (access time), but the capacity is smaller.

Figure 10.1: Storage-device hierarchy

http://www.cs.sfu.ca/CourseCentral/354/zaiane/material/notes/Chapter10/node2.html


Magnetic disk:

Read-write head

– Positioned very close to the platter surface (almost touching it)

– Reads or writes magnetically encoded information.

Surface of platter divided into circular tracks

– Over 50K-100K tracks per platter on typical hard disks

Each track is divided into sectors.

– Sector size typically 512 bytes

– Typical sectors per track: 500 (on inner tracks) to 1000 (on outer tracks)


To read/write a sector

– disk arm swings to position head on right track

– platter spins continually; data is read/written as sector passes under head

Head-disk assemblies

– multiple disk platters on a single spindle (1 to 5 usually)

– one head per platter, mounted on a common arm.

Cylinder i consists of ith track of all the platters

• Earlier generation disks were susceptible to “head-crashes” leading to loss of all data on disk

Current generation disks are less susceptible to such disastrous failures, but individual sectors may get corrupted

RAID: Redundant Arrays of Independent Disks

– disk organization techniques that manage a large numbers of disks, providing a view of a single disk of

• high capacity and high speed by using multiple disks in parallel, and

• high reliability by storing data redundantly, so that data can be recovered even if a disk fails

• Improvement of Reliability via Redundancy

• Redundancy – store extra information that can be used to rebuild information lost in a disk failure

E.g., Mirroring (or shadowing)

Duplicate every disk. Logical disk consists of two physical disks.

Every write is carried out on both disks

Reads can take place from either disk

If one disk in a pair fails, data still available in the other


RAID Levels:

RAID Level 0: Block striping; non-redundant

RAID Level 1: Mirrored disks with block striping

RAID Level 2: Memory-Style Error-Correcting-Codes (ECC) with bit striping.

RAID Level 3: Bit-Interleaved Parity

RAID Level 4: Block-Interleaved Parity

RAID Level 5: Block-Interleaved Distributed Parity; partitions data and parity among all N + 1 disks, rather than storing data in N disks and parity in 1 disk


RAID

Redundant Arrays of Independent Disks: disk organization that takes advantage of utilizing large numbers of inexpensive, mass-market disks

Main Idea: Improvement of reliability via redundancy, i.e., store extra

information that can be used to rebuild information lost in case of a disk

failure. Use Mirroring (or shadowing): duplicate every disk (logical disk

consists of two physical disks)

Different RAID levels (0-6) have different cost, performance, and reliability characteristics.

Storage Access

A database le is partitioned into xed-length storage units called blocks (or pages). Blocks/pages are units of both storage allocation and data transfer.

Database system seeks to minimize the number of block

transfers between disk and main memory. Transfer can be

reduced by keeping as many blocks as possible in main

memory.

Buffer Pool: Portion of main memory available to store copies of disk blocks.


Buffer Manager: System component responsible for allocating and managing bu er space in main memory.

Bu ffer Manager

Program calls on bu er manager when it needs block from disk

The requesting program is given the address of the block in main memory, if it is already present in the bu er.

If the block is not in the bu er, the bu er manager allocates space in the bu er for the block, replacing (throwing out) some other blocks, if necessary to make space for new blocks.

The block that is thrown out is written back to the disk only if it was modi ed since the most recent time that it was written to/fetched from the disk.

Once space is allocated in the bu er, the bu er manager reads in the block from the disk to the bu er, and passes the address of the block in the main memory to the requesting program.

Buffer Replacement Policies

Most operating systems replace the block least recently used (LRU strategy )

LRU { Use past reference of block as a predictor of future references

Queries have well-de ned access patterns (such as sequential scans), and a database system can use the information in a user's query to predict future references

LRU can be a bad strategy for certain access patterns involving repeated sequential scans of data les

Mixed strategy with hints on replacement strategies provided by the query optimizer is preferable (based on the used query processing algorithm(s)).


Pinned block: memory block that is not allowed to be written back to disk

Toss immediate strategy: frees the space occupied by a block as soon as the nal record (tuple) of that block has been processed.

Most recently used strategy (MRU): system must pin the block currently being processed. After the nal tuple of that block has been processed, the block is unpinned, and it becomes the most recently used block.

Buffer manager can use statistical information regarding the probability that a request will reference a particular relation, e.g., the data dictionary is frequently accessed ; keep data dictionary blocks in main memory bu er

File Organization

A database is stored as a collection of les. Each le is a sequence of records, and a record is a sequence of elds.

Approaches to organizing records in les: Assume the record size is FIxed Each file has records of one particular type only Different files are (typically) used for diff erent relations

Fixed-Length Records

Simple approach:

Store record i starting at byte n (i 1), where n is the size of each record. Record access is simple but records may span blocks.

Deletion of record i (alternatives to avoid fragmentation)

move records i + 1; : : : ; n to i; : : : ; n 1 move record n to i

Link all free records in a free list

Free Lists

Maintained in the block(s) that contains deleted records.

Store the address of the first record whose content is deleted in the le header.

Use this first record to store the address of the second available record, and so on.

One can think of these stored addresses as pointers, because they \point" to the location of a record. (linked list)

More space efficient representation: reuse space for normal attributes of free records to store pointers (i.e., no pointers are stored in in-use records).

Requires careful programming: Dangling pointers occur if a record is moved or deleted and another record contains a pointer to this record; that pointer then doesn't point any longer to the desired record.

Variable-Length Records

Variable-length records arise in database systems in several ways: Storage of multiple record types in a file.

Record types that allow variable length for one or more fields (e.g., varchar)

Approaches to store variable length records:

1. End-of-Record marker

difficult to reuse space of deleted records (fragmentation) { no space for a record to grow (e.g., due to an update)

; requires moving

2. Field delimiters (e.g., a $)

Field1 $ Field2 $ Field3 $ Field4 $

requires scan of record to get to n-th field value {requires a eld for a null value

3. Each record has an array of field o sets

Field1 Field2 Field3 Field4

+ For the overhead of the o set, we get direct access to any eld

+ Null values: begin/end pointer of a eld point to the same address

Variable length records typically cause problems in case of (record) attribute modi fications:

Growth of a eld requires shifting all other elds

A modified record may no longer t into the block (leave forwarding address at the old location)

A record can span multiple blocks

Block formats for variable length records

Each record is identi ed by a record identi fier (rid) (or tuple identi fier (tid)). The rid/tid contains number of block and position in block

← simplifi es shifting a record from one position/block to another position/block

Allocation of records in a block is based on slotted block structure:

Block Header

Size#Entries

...... Free Space ...... R4 R3 R2

R1

Location

block header contains number of record entries, end of the free space in the block, and location and size of each record

records are inserted from the end of the block

records can be moved around in block to keep them contiguous

RID-Concept:

RIDBlock 4711

Block 4711

4711/2

1 12 23 3

Update ofthe record

record record

record 5013/1

recordrecord

Organization of Records in Files

Requirements: A le must (efficiently) support

insert/delete/update of a record

access to a record (typically using rid) scan of

all records

Ways to organize blocks (pages) in a le:

Heap File (unsorted le) simplest le structure; contains records in no particular order; record can be placed anywhere in the le where there is space

Sequential File records are stored in sequential order, based on the value of the search key of each record

Clustered Index related to sequential les; we'll talk about this later in Section 7 (Indexes)

Heap File Organization

At DB run-time, pages/blocks are allocated and deallocated Information to

maintain for a heap le includes pages, free space on pages, records on a page

A typical implementation is based on a two doubly linked list of pages; starting with header block.

Two lists can be associated with header block: (1) full page list, and (2) list of pages having free space

Sequential File Organization

Suitable for applications that require sequential processing of the entire le

The records in the le are ordered by a search-key.

Deletions of records are managed using pointer chains.

Insertions must locate the position in the le where the record is to be inserted if there is free space, insert the record there

if no free space, insert the record in an over ow block in either case, pointer chain must be updated

If many record modi fications (in particular insertions and deletions),

correspondence between search key order and physical order can be totally

lost =) le reorganization


Basic Concepts

Ordered Indices

B+-Tree Index Files

B-Tree Index Files

Static Hashing

Dynamic Hashing

Comparison of Ordered Indexing and Hashing

Index Definition in SQL

Multiple-Key Access

Data Dictionary Storage:

Data dictionary (also called system catalog) stores metadata: that is, data about data, such as

Information about relations

names of relations

names and types of attributes of each relation

names and definitions of views

integrity constraints

User and accounting information, including passwords

Statistical and descriptive data

number of tuples in each relation

Physical file organization information

How relation is stored (sequential/hash/…)

122


Physical location of relation

Information about indices

Catalog structure

Relational representation on disk

specialized data structures designed for efficient access, in memory

A possible catalog representation:

Relation_metadata = (relation_name, number_of_attributes, storage_organization, location )Attribute_metadata = (attribute_name, relation_name, domain_type,

position, length)

User_metadata = (user_name, encrypted_password, group)

Index_metadata = (index_name, relation_name, index_type, index_attributes)

View_metadata = (view_name, definition)

Indexing: Basic Concepts

Indexing mechanisms are used to speed up access to desired data.

E.g., author catalog in library

Search Key - attribute to set of attributes used to look up records in a file.

An index file consists of records (called index entries) of the form

Index files are typically much smaller than the original file

Two basic kinds of indices:

Ordered indices: search keys are stored in sorted order

Hash indices: search keys are distributed uniformly across “buckets” using a “hash function”.

123


Index definition: Index is a file whose records (entries) have the following structure:

(search-key, pointer or set of pointers to data records).

When the search key is unique the index has fixed size records, otherwise it may have variable size records.

What is the pointer?

Block number or RID!

Index Evaluation Metrics:

Access types supported efficiently. E.g.,

records with a specified value in the attribute

or records with an attribute value falling in a specified range of values.

Access time

Insertion time

Deletion time

Space overhead

Ordered Indices:

Indexing techniques evaluated on basis of:

In an ordered index, index entries are stored sorted on the search key value. E.g., author catalog in library.

Primary index: in a sequentially ordered file, the index whose search key specifies the sequential order of the file.

Also called clustering index

The search key of a primary index is usually but not necessarily the primary key.

124


Secondary index: an index whose search key specifies an order different from the sequential order of the file. Also called non-clustering index.

Index-sequential file: ordered sequential file with a primary index.

Classes of Indexes:

Unique/non-unique – whether search key is unique/non-unique

Dense/Nondense – every/not every search key (every record for Unique key) in the data file has a corresponding pointer in the index.

Clustered/Nonclustered – order of index search key is equal/not equal to file order.

Primary/Secondary – search key equal/not equal to primary key. Also, the answer is a single/set of data file records.

Note difference in definition of Primary index (us vs. Silberschatz…)

Note: for Unique key – Dense index indexes every record, Non-dense does not index every record

Note: Non-dense index must be clustered!

Dense Clustered Index File:

Dense index — Index record appears for every search-key value in the file. (note, for every search key vs. for every record…)

125


Sparse Index Files:

Sparse Index: contains index records for only some search-key values.

Applicable when records are sequentially ordered on search-key

To locate a record with search-key value K we:

Find index record with largest search-key value < K

Search file sequentially starting at the record to which the index record points

Less space and less maintenance overhead for insertions and deletions.

Generally slower than dense index for locating records.

Good tradeoff: sparse index with an index entry for every block in file, corresponding to least search-key value in the block.

Basis of the ISAM index

Multilevel Index:

If primary index does not fit in memory, access becomes expensive.

To reduce number of disk accesses to index records, treat primary index kept on disk as a sequential file and construct a sparse index on it.

• outer index – a sparse index of primary index

• inner index – the primary index file

126


If even outer index is too large to fit in main memory, yet another level of index can be created, and so on.

Indices at all levels must be updated on insertion or deletion from the file.

Index Update: Deletion:

If deleted record was the only record in the file with its particular search-key value, the search-key is deleted from the index also.

Single-level index deletion:

Dense indices – deletion of search-key is similar to file record deletion.

Sparse indices – if an entry for the search key exists in the index, it is deleted by replacing the entry in the index with the next search-key value in the file (in search-key order). If the next search-key value already has an index entry, the entry is deleted instead of being replaced.

Index Update: Insertion:

127


Single-level index insertion:

Perform a lookup using the search-key value appearing in the record to be inserted.

Dense indices – if the search-key value does not appear in the index, insert it.

Sparse indices – if index stores an entry for each block of the file, no change needs to be made to the index unless a new block is created. In this case, the first search-key value appearing in the new block is inserted into the index.

Multilevel insertion (as well as deletion) algorithms are simple extensions of the single-level algorithms

Secondary Indices:

Frequently, one wants to find all the records whose values in a certain field (which is not the search-key of the primary index satisfy some condition.

• Example 1: In the account database stored sequentially by account number, we may want to find all accounts in a particular branch

• Example 2: as above, but where we want to find all accounts with a specified balance or range of balances

We can have a secondary index with an index record for each search-key value; index record points to a bucket that contains pointers to all the actual records with that particular search-key value

Primary and Secondary Indices:

Secondary indices have to be dense.

Indices offer substantial benefits when searching for records.

When a file is modified, every index on the file must be updated, Updating indices imposes overhead on database modification.

128


Sequential scan using primary index is efficient, but a sequential scan using a secondary index is expensive

each record access may fetch a new block from disk since index is usually un-clustered

B+-Tree Index Files:

A B+-tree is a rooted tree satisfying the following properties:

All paths from root to leaf are of the same length

Each node that is not a root or a leaf has between [n/2] and n children.

A leaf node has between [(n–1)/2] and n–1 values

Special cases:

If the root is not a leaf, it has at least 2 children.

If the root is a leaf (that is, there are no other nodes in the tree), it can have between 0 and (n–1) values.

B+-Tree Node Structure:

Typical node

Ki are the search-key values

Pi are pointers to children (for non-leaf nodes) or pointers to records or buckets of records (for leaf nodes).

The search-keys in a node are ordered

K1 < K2 < K3 < . . . < Kn–1

129


B+-tree for account file (n = 3)

Note, index points to FIRST key

B+-Tree File Organization:

Index file degradation problem is solved by using B+-Tree indices. Data file degradation problem is solved by using B+-Tree File Organization.

The leaf nodes in a B+-tree file organization store records, instead of pointers.

Since records are larger than pointers, the maximum number of records that can be stored in a leaf node is less than the number of pointers in a nonleaf node.

Leaf nodes are still required to be half full.

Insertion and deletion are handled in the same way as insertion and deletion of entries in a B+-tree index.

Simplest to implement sequential file organization – no need for

overflows!

130


Good space utilization important since records use more space than pointers.

To improve space utilization, involve more sibling nodes in redistribution during splits and merges

Involving 2 siblings in redistribution (to avoid split / merge where possible) results in each node having at least entries

B-Tree Index Files:

Similar to B+-tree, but B-tree allows search-key values to appear only once; eliminates redundant storage of search keys.

Search keys in nonleaf nodes appear nowhere else in the B-tree; an additional pointer field for each search key in a nonleaf node must be included.

Generalized B-tree leaf node

Nonleaf node – pointers Bi are the bucket or file record pointers.

Advantages of B-Tree indices:

May use less tree nodes than a corresponding B+-Tree.

Sometimes possible to find search-key value before reaching leaf node.

Disadvantages of B-Tree indices:

Only small fraction of all search-key values are found early

131


Non-leaf nodes are larger, so fan-out is reduced (because of the extra pointer!). Thus B-Trees typically have greater depth than corresponding B+-Tree

Insertion and deletion more complicated than in B+-Trees

Implementation is harder than B+-Trees.

Reading in sorted order is hard

Typically, advantages of B-Trees do not out weigh disadvantages.

Hash Indices:

Hashing can be used not only for file organization, but also for index-structure creation.

A hash index organizes the search keys, with their associated record pointers, into a hash file structure.

Strictly speaking, hash indices are always secondary indices

if the file itself is organized using hashing, a separate primary hash index on it using the same search-key is unnecessary.

However, we use the term hash index to refer to both secondary index structures and hash organized files.

I don’t agree! You may have Hash index as primary and the file is clustered on another key or a Heap file

Note the difference between a Hash Index and a Hash file – whether the buckets contain Pointers to data records or the records themselves! (same as difference between A B+-tree index and a B+-tree file! )

Deficiencies of Static Hashing:

In static hashing, function h maps search-key values to a fixed set of B of bucket addresses.

• Databases grow with time. If initial number of buckets is too small, performance will degrade due to too much overflows.

132


• If file size at some point in the future is anticipated and number of buckets allocated accordingly, significant amount of space will be wasted initially.

• If database shrinks, again space will be wasted.

• One option is periodic re-organization of the file with a new hash function, but it is very expensive.

These problems can be avoided by using techniques that allow the number of buckets to be modified dynamically.

Summary of Static Hashing:

Advantages :

Simple.

Very fast (1-2 accesses).

Disadvantages :

Direct access only (no ordered access!).

Non dynamic, needs Reorganization.

Dynamic Hashing:

Good for database that grows and shrinks in size

Allows the hash function to be modified dynamically

Extendable hashing – one form of dynamic hashing

Hash function generates values over a large range — typically b-bit integers, with b = 32.

At any time use only a prefix of the hash function to index into a table of bucket addresses.

Let the length of the prefix be i bits, 0 £ i £ 32.

Bucket address table size = 2i. Initially i = 0

133


Value of i grows and shrinks as the size of the database grows and shrinks.

Multiple entries in the bucket address table may point to a bucket.

Thus, actual number of buckets is < 2i

The number of buckets also changes dynamically due to coalescing and splitting of buckets.

Multiple-Key Access:

Use multiple indices for certain types of queries.

Example:

select account-number

from account

where branch-name = “Perryridge” and balance - 1000

Possible strategies for processing query using indices on single attributes:

1. Use index on branch-name to find accounts with balances of $1000; test branch-name = “Perryridge”.

2. Use index on balance to find accounts with balances of $1000; test branch-name = “Perryridge”.

3. Use branch-name index to find pointers to all records pertaining to the Perryridge branch. Similarly use index on balance. Take intersection of both sets of pointers obtained.

Indices on Multiple Attributes:

Suppose we have an index on combined search-key

(branch-name, balance).

With the where clausewhere branch-name = “Perryridge” and balance = 1000

134


the index on the combined search-key will fetch only records that satisfy both conditions.Using separate indices in less efficient — we may fetch many records (or pointers) that satisfy only one of the conditions.

Can also efficiently handle where branch-name - “Perryridge” and balance < 1000

But cannot efficiently handlewhere branch-name < “Perryridge” and balance = 1000May fetch many records that satisfy the first but not the second condition.

ISAM: Indexed-Sequential-Access-Method:

A Note of Caution

• ISAM is an old-fashioned idea

– B+-trees are usually better, as we’ll see

• Though not always

• But, it’s a good place to start

– Simpler than B+-tree, but many of the same ideas

• Upshot

– Don’t brag about being an ISAM expert on your resume, Do understand how they work, and tradeoffs with B+-trees

135


136


137


138


ADDITIONAL NOTES ON DATABASE NORMALIZATION

Normalization

Normalization is a formal process for determining which fields belong in which tables in a relational database. Normalization follows a set of rules worked out at the time relational databases were born. A normalized relational database provides several benefits:

Elimination of redundant data storage.

Close modeling of real world entities, processes, and their relationships.

Structuring of data so that the model is flexible.

Normalization ensures that you get the benefits relational databases offer. Time spent learning about normalization will begin paying for itself immediately.

Normalized Design: Pros and Cons

There are various advantages to producing a properly normalized design before you implement your system. A detailed list of the pros and cons are given below:

139


Pros of Normalizing Cons of normalizing

More efficient database structure.

Better understanding of your data.

More flexible data structure.

Easier to maintain database structure.

Few (if any) costly surprises down the road.

Validates your common sense and intuition. Avoid redundant fields.

Ensures that distinct tables exist when necessary.

You can’t start building the database before you know what the user needs.

1st Normal Form (1NF)

Def: A table (relation) is in 1NF if

1. There are no duplicated rows in the table.

2. Each cell is single-valued (i.e., there are no repeating groups).

3. Entries in a column (attribute, field) are of the same kind.

Note: The order of the rows is immaterial; the order of the columns is immaterial. The requirement that there be no duplicated rows in the table means that the table has a key (although the key might be made up of more than one column—even, possibly, of all the columns).

A relation is in 1NF if and only if all underlying domains contain atomic values only.

140


The first normal form deals only with the basic structure of the relation and does not resolve the problems of redundant information or the anomalies.

For example consider the following example relation:

student(sno, sname, dob) Add some other attributes so it has anomalies and is not in 2NF

The attribute dob is the date of birth and the primary key of the relation is sno with the functional dependencies sno -> sname and eno → dob The relation is in 1NF as long as dob is considered an atomic value and not consisting of three components (day, month, year). The above relation of course suffers from all the anomalies and needs to be normalized. (add example with date of birth)

A relation is in first normal form if and only if, in every legal value of that relation every tuple contains one value for each attribute

The above definition merely states that the relations are always in first normal form which is always correct. However the relation that is only in first normal form has a structure those undesirable for a number of reasons.

First normal form (1NF) sets the very basic rules for an organized database:

• Eliminate duplicative columns from the same table.

• Create separate tables for each group of related data and identify each row with a unique column or set of columns (the primary key).

2nd Normal Form (2NF)

Def: A table is in 2NF if it is in 1NF and if all non-key attributes are dependent on the entire key.

The second normal form attempts to deal with the problems that are identified with the relation above that is in 1NF. The aim of second normal form is to ensure that all information in one relation is only about one thing.

A relation is in 2NF if it is in 1NF and every non-key attribute is fully dependent on each candidate key of the relation.

141


Note: Since a partial dependency occurs when a non-key attribute is dependent on only a part of the (composite) key, the definition of 2NF is sometimes phrased as, "A table is in 2NF if it is in 1NF and if it has no partial dependencies."

Recall the general requirements of 2NF:

• Remove subsets of data that apply to multiple rows of a table and place them in separate rows.

• Create relationships between these new tables and their predecessors through the use of foreign keys.

These rules can be summarized in a simple statement: 2NF attempts to reduce the amount of redundant data in a table by extracting it, placing it in new table(s) and creating relationships between those tables.

For example. Imagine an online store that maintains customer information in a database. Their Customers table might look something like this:

CustNum FirstName LastName Address City State ZIP

1 John Doe 12Main Street

Sea Cliff

NY 11579

2 Alan Johnson 82 Evergreen Tr

Sea Cliff

NY 11579

3 Beth Thompson 1912 NE Ist St

Miami FL 33157

4 Jacob Smith 142 Irish Way

South Bend

IN 46637

5 Sue Ryan 412 NE Ist St

Miami FL 33157

A brief look at this table reveals a small amount of redundant data. We're storing the "Sea Cliff, NY 11579" and "Miami, FL 33157" entries twice each. Now, that

142


might not seem like too much added storage in our simple example, but imagine the wasted space if we had thousands of rows in our table. Additionally, if the ZIP code for Sea Cliff were to change, we'd need to make that change in many places throughout the database.

In a 2NF-compliant database structure, this redundant information is extracted and stored in a separate table. Our new table (let's call it ZIPs) might look like this:

ZIP City State

11579 Sea Cliff NY

33157 Miami FL

46637 South Bend IN

If we want to be super-efficient, we can even fill this table in advance -- the post office provides a directory of all valid ZIP codes and their city/state relationships. Surely, you've encountered a situation where this type of database was utilized. Someone taking an order might have asked you for your ZIP code first and then knew the city and state you were calling from. This type of arrangement reduces operator error and increases efficiency.

Now that we've removed the duplicative data from the Customers table, we've satisfied the first rule of second normal form. We still need to use a foreign key to tie the two tables together. We'll use the ZIP code (the primary key from the ZIPs table) to create that relationship. Here's our new Customers table:

CustNum FirstName LastName Address ZIP

1 John Doe 12Main Street

11579

2 Alan Johnson 82 11579

143


Evergreen Tr

3 Beth Thompson 1912 NE Ist St

33157

4 Jacob Smith 142 Irish Way

46637

5 Sue Ryan 412 NE Ist St

33157

We've now minimized the amount of redundant information stored within the database and the structure is in second normal form.

Let’s take one more example

The concept of 2NF requires that all attributes that are not part of a candidate key be fully dependent on each candidate key. If we consider the relation

student (sno, sname, cno, cname)

and the functional dependencies

eno → ename

cno -> cname

and assume that (sno, cno) is the only candidate key (and therefore the primary key), the relation is not in 2NF since sname and cname are not fully dependent on the key. The above relation suffers from the same anomalies and repetition of information, since sname and cname will be repeated. To resolve these difficulties we could remove those attributes from the relation that are not fully dependent on the candidate keys of the relations. Therefore we decompose the relation into the following projections of the original relation:

S1 (sno, sname) S2 (cno, cname) SC (sno, cno)

Use an example that leaves one relation in 2NF but not in 3NF. We may recover the original relation by taking the natural join of the three relations. If however we

144


assume that sname and cname are unique and therefore we have the following candidate keys

(sno, cno) (sno, cname) (sname, cno) (sname, cname)

The above relation is now in 2NF since the relation has no non-key attributes. The relation still has the same problems as before but it then does satisfy the requirements of 2NF. Higher level normalization is needed to resolve such problems with relations that are in 2NF and further normalization will result in decomposition of such relations

3rd Normal Form (3NF)

Def: A table is in 3NF if it is in 2NF and if it has no transitive dependencies.

The basic requirements of 3NF are as follows

• Meet the requirements of 1NF and 2NF

• Remove columns that are not fully dependent upon the primary key.

Although transforming a relation that is not in 2NF into a number of relations that are in 2NF removes many of the anomalies that appear in the relation that was not in 2NF, not all anomalies are removed and further normalization is sometime needed to ensure further removal of anomalies. These anomalies arise because a 2NF relation may have attributes that are not directly related to the thing that is being described by the candidate keys of the relation.

A relation R is in third normal form if it is in 2NF and every non-key attribute of R is non-transitively dependent on each candidate key of R.

To understand the third normal form, we need to define transitive dependence which is based on one of Armstrong's axioms. Let A, B and C be three attributes of a relation R such that

A→ B and B→ C

From these FDs, we may derive A→ C , this dependence A→ C is transitive.

145


The 3NF differs from the 2NF in that all non-key attributes in 3NF are required to be directly dependent on each candidate key of the relation. The 3NF therefore insists, in the words of Kent (1983) that all facts in the relation are about the key (or the thing that the key identifies), the whole key and nothing but the key. If some attributes are dependent on the keys transitively then that is an indication that those attributes provide information not about the key but about a non-key attribute. So the information is not directly about the key, although it obviously is related to the key.

Consider the following relation

subject (cno, cname, instructor, office)

Assume that cname is not unique and therefore cno is the only candidate key. The following functional dependencies exist

Eno → ename

Eno → instructor

Instructor → office

We can derive Eno → office from the above functional dependencies and therefore the above relation is in 2NF. The relation is however not in 3NF since office is not directly dependent on cno. This transitive dependence is an indication that the relation has information about more than one thing (viz. course and instructor) and should therefore be decomposed. The primary difficulty with the above relation is that an instructor might be responsible for several subjects and therefore his office address may need to be repeated many times. This leads to all the updation problems. To overcome these difficulties we need to decompose the above relation in the following two relations:

s (cno, cname, instructor)

ins (instructor, office)

s is now in 3NF and so is ins.

An alternate decomposition of the relation subject is possible:

s(cno, cname)

inst(instructor, office)

si(cno, instructor)

146


The decomposition into three relations is not necessary since the original relation is based on the assumption of one instructor for each course.

The 3NF is usually quite adequate for most relational database designs. There are however some situations, for example the relation student(sno, sname, cno, cname) is in 2NF above, where 3NF may not eliminate all the redundancies and inconsistencies. The problem with the relation student(sno, sname, cno, cname) is because of the redundant information in the candidate keys. These are resolved by further normalization using the BCNF.

Imagine that we have a table of widget orders:

Order Number

Customer Number

Unit Price Quantity Total

1 241 $10 2 $20

2 842 $9 20 $180

3 919 $19 1 $19

4 919 $12 10 $120

Remember, our first requirement is that the table must satisfy the requirements of 1NF and 2NF. Are there any duplicative columns? No. Do we have a primary key? Yes, the order number. Therefore, we satisfy the requirements of 1NF. Are there any subsets of data that apply to multiple rows? No, so we also satisfy the requirements of 2NF.

Now, are all of the columns fully dependent upon the primary key? The customer number varies with the order number and it doesn't appear to depend upon any of the other fields. What about the unit price? This field could be dependent upon the customer number in a situation where we charged each customer a set price. However, looking at the data above, it appears we sometimes charge the same customer different prices. Therefore, the unit price is fully dependent upon the order number. The quantity of items also varies from order to order.

147


What about the total? It looks like we might be in trouble here. The total can be derived by multiplying the unit price by the quantity; therefore it's not fully dependent upon the primary key. We must remove it from the table to comply with the third normal form:

Order Number Customer Number

Unit Price Quantity

1 241 $10 2

2 842 $9 20

3 919 $19 1

4 919 $12 10

Now our table is in 3NF.

Boyce-Codd Normal Form (BCNF)

The relation student(sno, sname, cno, cname) has all attributes participating in candidate keys since all the attributes are assumed to be unique. We therefore had the following candidate keys:

(sno, cno)

(sno, cname)

(sname, cno)

(sname, cname)

Since the relation has no non-key attributes, the relation is in 2NF and also in 3NF, in spite of the relation suffering the modification problems.

148


The difficulty in this relation is being caused by dependence within the candidate keys. The second and third normal forms assume that all attributes not part of the candidate keys depend on the candidate keys but does not deal with dependencies within the keys. BCNF deals with such dependencies.

A relation R is said to be in BCNF if whenever X → A holds in R, and A is not in X, then X is a candidate key for R.

It should be noted that most relations that are in 3NF are also in BCNF. Infrequently, a 3NF relation is not in BCNF and this happens only if

(a) the candidate keys in the relation are composite keys (that is, they are not single attributes),

(b) there is more than one candidate key in the relation, and (c) the keys are not disjoint, that is, some attributes in the keys are common.

The BCNF differs from the 3NF only when there are more than one candidate keys and the keys are composite and overlapping. Consider for example, the relationship

enrol (sno, sname, cno, cname, date-enrolled)

Let us assume that the relation has the following candidate keys:

(sno, cno)

(sno, cname)

(sname, cno)

(sname, cname)

(we have assumed sname and cname are unique identifiers). The relation is in 3NF but not in BCNF because there are dependencies

sno → sname

eno → ename

149


where attributes that are part of a candidate key are dependent on part of another candidate key. Such dependencies indicate that although the relation is about some entity or association that is identified by the candidate keys e.g. (sno, cno), there are attributes that are not about the whole thing that the keys identify. For example, the above relation is about an association (enrolment) between students and subjects and therefore the relation needs to include only one identifier to identify students and one identifier to identify subjects. Providing two identifiers about students (sno, sname) and two keys about subjects (cno, cname) means that some information about students and subjects that is not needed is being provided. This provision of information will result in repetition of information and the anomaliess. If we wish to include further information about students and courses in the database, it should not be done by putting the information in the present relation but by creating new relations that represent information about entities student and subject.

These difficulties may be overcome by decomposing the above relation in the following three relations:

(sno, sname)

(cno, cname)

(sno, cno, date-of-enrolment)

We now have a relation that only has information about students, another only about subjects and the third only about enrolments. All the anomalies and repetition of information have been removed.

So, a relation is said to be in the BCNF if and only if it is in the 3NF and every non-trivial, left-irreducible functional dependency has a candidate key as its determinant. In more informal terms, a relation is in BCNF if it is in 3NF and the only determinants are the candidate keys.

Desirable properties of decompositions

So far our approach has consisted of looking at individual relations and checking if they belong to 2NF, 3NF or BCNF. If a relation was not in the normal form that was being checked for and we wished the relation to be normalized to that normal form so that some of the anomalies can be eliminated, it was necessary to

150


decompose the relation in two or more relations. The process of decomposition of a relation R into a set of relations R1,R2,…..Rn was based on identifying different components and using that as a basis of decomposition. The decomposed relations R1,R2,…..Rn are projections of R and are of course not disjoint otherwise the glue holding the information together would be lost. Decomposing relations in this way based on a recognize and split method is not a particularly sound approach since we do not even have a basis to determine that the original relation can be constructed if necessary from the decomposed relations.

Desirable properties of decomposition are:

1. Attribute preservation

2. Lossless-join decomposition

3. Dependency preservation

4. Lack of redundancy

Attribute Preservation

This is a simple and an obvious requirement that involves preserving all the attributes that were there in the relation that is being decomposed.

Lossless-Join Decomposition

We decomposed a relation intuitively. We need a better basis for deciding decompositions since intuition may not always be correct. We illustrate how a careless decomposition may lead to problems including loss of information.


enrol (sno, cno, date-enrolled, room-No., instructor)

Suppose we decompose the above relation into two relations enrol1 and enrol2 as follows

enrol1 (sno, cno, date-enrolled)

enrol2 (date-enrolled, room-No., instructor)

There are problems with this decomposition but we wish to focus on one aspect at the moment. Let an instance of the relation enrol be

151


Sno Cno Date-Enrolled Room-no.

Instructor

830057

830057

820159

825678

826789

CP302

CP303

CP302

CP304

CP305

1FEB1984

1FEB1984

10JAN1984

1FEB1984

15JAN1984

MP006

MP006

MP006

CE122

EA123

Gupta

Jones

Gupta

Wilson

Smith

and let the decomposed relations enrol1 and enrol2 be:

Sno Cno Date-Enrolled

830057

830057

820159

825678

826789

CP302

CP303

CP302

CP304

CP305

1FEB1984

1FEB1984

10JAN1984

1FEB1984

15JAN1984

Date-Enrolled Room-no.

Instructor

1FEB1984

1FEB1984

MP006

MP006

Gupta

Jones

152


10JAN1984

1FEB1984

15JAN1984

MP006

CE122

EA123

Gupta

Wilson

Smith

All the information that was in the relation enrol appears to be still available in enrol1 and enrol2 but this is not so. Suppose, we wanted to retrieve the student numbers of all students taking a course from Wilson, we would need to join enrol1 and enrol2. The join would have 11 tuples as follows:

Sno Cno Date-Enrolled

Room-No. Instructor

830057

830057

830057

830057

830057

830057

CP302

CP302

CP303

CP303

CP302

CP303

1FEB1984

1FEB1984

1FEB1984

1FEB1984

1FEB1984

1FEB1984

MP006

MP006

MP006

MP006

CE122

CE122

Gupta

Jones

Gupta

Jones

Wilson

Wilson

(add further tuples ...)

The join contains a number of spurious tuples that were not in the original relation Enrol. Because of these additional tuples, we have lost the information about which students take courses from WILSON. (we have more tuples but less information because we are unable to say with certainty who is taking courses from WILSON). Such decompositions are called lossy decompositions.

153


A nonloss or lossless decomposition is that which guarantees that the join will result in exactly the same relation as was decomposed.

We need to analyze why some decompositions are lossy. The common attribute in above decompositions was Date-enrolled. The common attribute is the clue that gives us the ability to find the relationships between different relations by joining the relations together. If the common attribute is not unique, the relationship information is not preserved. If each tuple had a unique value of Date-enrolled, the problem of losing information would not have existed. The problem arises because several enrolments may take place on the same date.

A decomposition of a relation R into relations R1,R2,…..Rn is called a lossless-join decomposition (with respect to FDs F) if the relation R is always the natural join of the relations R1,R2,…..Rn It should be noted that natural join is the only way to recover the relation from the decomposed relations. There is no other set of operators that can recover the relation if the join cannot. Furthermore, it should be noted when the decomposed relations R1,R2,…..Rn are obtained by projecting on the relation R, for example R1 by projection π(R) the relation R1 may not always be precisely equal to the projection since the relation R1 might have additional tuples called the dangling tuples.

It is not difficult to test whether a given decomposition is lossless-join given a set of functional dependencies F. We consider the simple case of a relation R being decomposed into R1 and R2 . the decomposition is lossless-join, then one of the following two conditions must hold

R1 ∩ R2 → R1– R2

R1 ∩ R2 → R2 – R1

That is, the common attributes in R1 and R2 must include a candidate key of either R1 or R2.

Dependency Preservation

It is clear that decomposition must be lossless so that we do not lose any information from the relation that is decomposed. Dependency preservation is another important requirement since a dependency is a constraint on the database and if X → Y holds than we know that the two (sets) attributes are closely related

154


and it would be useful if both attributes appeared in the same relation so that the dependency can be checked easily.

Let us consider a relation R(A, B, C, D) that has the dependencies F that include the following:

A→ BA → C

etc

If we decompose the above relation into R1(A, B) and R2(B, C, D) the dependency A→C cannot be checked (or preserved) by looking at only one relation. It is desirable that decompositions be such that each dependency in F may be checked by looking at only one relation and that no joins need be computed for checking dependencies. In some cases, it may not be possible to preserve each and every dependency in F but as long as the dependencies that are preserved are equivalent to F, it should be sufficient.

Let F be the dependencies on a relation R which is decomposed in relations R1,R2,….,Rn.

We can partition the dependencies given by F such that F1,F2,….,Fn. Fn are dependencies that only involve attributes from relations R1,R2,….,Rn. If the union of dependencies F1 imply all the dependencies in F, then we say that the decomposition has preserved dependencies, otherwise not.

If the decomposition does not preserve the dependencies F, then the decomposed relations may contain relations that do not satisfy F or the updates to the decomposed relations may require a join to check that the constraints implied by the dependencies still hold.

155



sub(sno, instructor, office)

We may wish to decompose the above relation to remove the transitive dependency of office on sno. A possible decomposition is

S1(sno, instructor)

S2(sno, office)

The relations are now in 3NF but the dependency instructor → office cannot be verified by looking at one relation; a join of S1 and S2 is needed. In the above decomposition, it is quite possible to have more than one office number for one instructor although the functional dependency instructor → office does not allow it.

Lack of Redundancy

Lossless-join, dependency preservation and lack of redundancy not always possible with BCNF. Lossless-join, dependency preservation and lack of redundancy is always possible with 3NF.

Deriving BCNF

Given a set of dependencies F, we may decompose a given relation into a set of relations that are in BCNF using the following algorithm. The algorithm consists of

(1) Find out the facts about the real world. (2) Reduce the list of functional relationships. (3) Find the keys. (4) Combine related facts.

If there is any relation R that has a dependency A→ B and A is not a key, the relation violates the conditions of BCNF and may be decomposed in AB and R - A. The relation AB is now in BCNF and we can now check if R - A is also in BCNF. If not, we can apply the above procedure again until all the relations are in fact in BCNF.

156


Multivalued dependencies

Several difficulties that can arise when an entity has multivalue attributes. It was because in the relational model, if all of the information about such entity is to be represented in one relation, it will be necessary to repeat all the information other than the multivalue attribute value to represent all the information that we wish to represent. This results in many tuples about the same instance of the entity in the relation and the relation having a composite key (the entity id and the mutlivalued attribute). Of course the other option suggested was to represent this multivalue information in a separate relation.

The situation of course becomes much worse if an entity has more than one multivalued

attributes and these values are represented in one relation by a number of tuples for each

entity instance such that every value of one the multivalued attributes appears with every

value of the second multivalued attribute to maintain consistency. The multivalued dependency relates to this problem when more than one multivalued attributes exist.

Consider the following relation that represents an entity employee that has one

mutlivalued attribute proj:

emp (e#, dept, salary, proj)

normalization upto BCNF are based on functional dependencies; dependencies that apply only to single-valued facts. For example, e#→ dept implies only one dept value for each value of e#. Not all information in a database is single-valued, for example, proj in an employee relation may be the list of all projects that the

157


employee is currently working on. Although e# determines the list of all projects that an employee is working on, eno→ proj is not a functional dependency.

Multivalued facts about an entity is represented by having a separate relation for that multivalue attribute and then inserting a tuple for each value of that fact. This resulted in composite keys since the multivalued fact must form part of the key.

The fourth and fifth normal forms deal with multivalued dependencies.

programmer (emp_name, qualifications, languages)

The above relation includes two multivalued attributes of entity programmer; qualifications and languages. There are no functional dependencies.

The attributes qualifications and languages are assumed independent of each other. If we

were to consider qualifications and languages separate entities, we would have two relationships (one between employees and qualifications and the other between employees and programming languages). Both the above relationships are many-to-many

i.e. one programmer could have several qualifications and may know several programming languages. Also one qualification may be obtained by several programmers

and one programming language may be known to many programmers.

The above relation is therefore in 3NF (even in BCNF) but it still has some disadvantages. Suppose a programmer has several qualifications (B.Sc, Dip. Comp. Sc,

158


etc) and is proficient in several programming languages; how should this information be

represented? There are several possibilities.

Cmp name Qualifications Languages

Smith

Smith

Smith

Smith

Smith

Smith

B.Sc

B.Sc

B.Sc

Dip. CS

Dip. CS

Dip. CS

Fortran

Cobol

Pascal

Fortran

Cobol

Pascal

Cmp Name Qualification Language

Smith

Smith

Smith

Smith

Smith

B.Sc

Dip. CS

NULL

NULL

NULL

NULL

NULL

Fortran

Cobol

Pascal

Cmp Name Qualificatiom Language

159


Smith

Smith

Smith

B.Sc

Dip CS

NULL

Fortran

Cobol

Pascal

Other variations are possible (there is no relationship between qualifications and programming languages). All these variations have some disadvantages. If the information is repeated we face the same problems of repeated information and anomalies as we did when second or third normal form conditions are violated. If there is no repetition, there are still some difficulties with search, insertions and deletions. For example, the role of NULL values in the above relations is confusing.

Also the candidate key in the above relations is (emp name, qualifications, language) and

existential integrity requires that no NULLs be specified. These problems may be overcome by decomposing a relation like the one above as follows:

Cmp Name Qualifications

Smith

Smith

B.Sc

Dip. CS

160


Cmp Name Languages

Smith

Smith

Smith

Fortran

Cobol

Pascal

The basis of the above decomposition is the concept of multivalued dependency (MVD).

Functional dependency A→ B relates one value of A to one value of B while multivalueddependency A→→ B defines a relationship in which a set of values of attribute B are determined by a single value of A.The concept of multivalued dependencies was developed to provide a basis for decomposition of relations like the one above. Therefore if a relation like enrolment(sno,

subject#) has a relationship between sno and subject# in which sno uniquely determines

the values of subject#, the dependence of subject# on sno is called a trivial MVD since

the relation enrolment cannot be decomposed any further. More formally, a MVD X→→Y is called trivial MVD if either Y is a subset of X or X and Y together form the

relation R. The MVD is trivial since it results in no constraints being placed on the

relation. Therefore a relation having non-trivial MVDs must have at least three attributes;

two of them multivalued. Non-trivial MVDs result in the relation having some constraints

on it since all possible combinations of the multivalue attributes are then required to be in

161


the relation.

Let us now define the concept of multivalued dependency. The multivalued dependency X→→Y is said to hold for a relation R(X, Y, Z) if for a given set of value (set of values if

X is more than one attribute) for attributes X, there is a set of (zero or more) associated

values for the set of attributes Y and the Y values depend only on X values and have no

dependence on the set of attributes Z.

In the example above, if there was some dependence between the attributes qualifications

and language, for example perhaps, the language was related to the qualifications

(perhaps the qualification was a training certificate in a particular language), then the

relation would not have MVD and could not be decomposed into two relations as abve. In

the above situation whenever X→→Y holds, so does X→→Z since the role of the

attributes Y and Z is symmetrical.

Consider two different situations.

(a) Z is a single valued attribute. In this situation, we deal with R(X, Y, Z) as before by

entering several tuples about each entity.

(b) Z is multivalued.

162


Now, more formally, X→→Y is said to hold for R(X, Y, Z) if t1 and t2 are two tuples in R

that have the same values for attributes X and therefore with t1[x] = t2[x] then R also

contains tuples t3 and t4 (not necessarily distinct) such that

t1[x] = t2[x] = t3[x] = t4[x]

t3[Y] = t1[Y] and t3[Z] = t2[Z]

t4[Y] = t2[Y] and t4[Z] = t1[Z]

In other words if t1 and t2 are given by

t1 = [X, Y1, Z1], and

t2 = [X, Y2, Z2]

then there must be tuples t3 and t4 such that

t3 = [X, Y1, Z2], and

t4 = [X, Y2, Z1]

We are therefore insisting that every value of Y appears with every value of Z to keep the relation instances consistent. In other words, the above conditions insist that Y and Z are determined by X alone and there is no relationship between Y and Z since Y and Z appear in every possible pair and hence these pairings present no information and are of no significance. Only if some of these pairings were not present, there would be some significance in the pairings.

Give example (instructor, quals, subjects) --- explain if subject was single valued; otherwise all combinations must occur.

163


(Note: If Z is single-valued and functionally dependent on X then Z1 = Z2. If Z is multivalue dependent on X then Z1 <> Z2).

The theory of multivalued dependencies in very similar to that for functional dependencies. Given D a set of MVDs, we may find D+, the closure of D using a set of axioms.

Multivalued Normalization --- Fourth Normal Form Def: A table is in 4NF if it is in BCNF and if it has no multi-valued dependencies.

An example of Programmer(Emp name, qualification, languages) and problems that may arise if the relation is not normalised further. The relation could be decomposed into P1(Emp name, qualifications) and P2(Emp name, languages) to overcome these problems. The decomposed relations are in fourth normal form (4NF).

A relation R is in 4NF if, whenever a multivalued dependency X→ Y holds then either

(a) the dependency is trivial, or (b) X is a candidate key for R.

The dependency X→→ Y in a relation R(X, Y) is trivial since they must hold for all R(X, Y). Similarly (X,Y) →Z must hold for all relations R(X, Y, Z) with only three attributes.

In fourth normal form, we have a relation that has information about only one entity. If a relation has more than one multivalue attribute, we should decompose it to remove difficulties with multivalued facts.

Intuitively R is in 4NF if all dependencies are a result of keys. When multivalued dependencies exist, a relation should not contain two or more independent multivalued attributes. The decomposition of a relation to achieve 4NF would normally result in not only reduction of redundancies but also avoidance of anomalies.

164


Fifth Normal Form

Def: A table is in 5NF, also called "Projection-Join Normal Form" (PJNF), if it is in 4NF and if every join dependency in the table is a consequence of the candidate keys of the table.

The normal forms discussed so far required that the given relation R if not in the given normal form be decomposed in two relations to meet the requirements of the normal form. In some rare cases, a relation can have problems like redundant information and update anomalies because of it but cannot be decomposed in two relations to remove the problems. In such cases it may be possible to decompose the relation in three or more relations using the 5NF.

The fifth normal form deals with join-dependencies which is a generalisation of the MVD. The aim of fifth normal form is to have relations that cannot be decomposed further. A relation in 5NF cannot be constructed from several smaller relations.

A relation R satisfies join dependency (R1,R2,…..,Rn) if and only if R is equal to the join of R1,R2,….,Rn where Ri are subsets of the set of attributes of R.

A relation R is in 5NF (or project-join normal form, PJNF) if for all join dependencies at least one of the following holds.

(a) (R1,R2,…..,Rn) is a trivial join-dependency (that is, one of Ri is R)

(b) Every Ri is a candidate key for R.

An example of 5NF can be provided by the example below that deals with departments, subjects and students.

165


Dept Subject Student

Comp Sc.

Mathematics

Comp Sc.

Comp Sc.

Physics

Chemistry

CP1000

MA1000

CP2000

CP3000

PH1000

CH2000

John Smith

John Smith

Arun Kumar

Reena Rani

Raymond Chew

Albert Garcia

The above relation says that Comp. Sc. offers subjects CP1000, CP2000 and CP3000 which are taken by a variety of students. No student takes all the subjects and no subject has all students enrolled in it and therefore all three fields are needed to represent the information.

The above relation does not show MVDs since the attributes subject and student are not independent; they are related to each other and the pairings have significant information in them. The relation can therefore not be decomposed in two relations

(dept, subject), and

(dept, student)

without loosing some important information. The relation can however be decomposed in the following three relations

(dept, subject), and

(dept, student)

(subject, student)

and now it can be shown that this decomposition is lossless.

Domain-Key Normal Form (DKNF)

Def: A table is in DKNF if every constraint on the table is a logical consequence of the definition of keys and domains.

166


167


168


169


170


171

database management systems complete notes

Documents

beth jacob

proddatur

ad hoc facilities

shadow paging

smith smith

lennon lennon

higher authority

partially