database concepts with emerging trends

Jamnalal Bajaj Institute of Management Studies

MMS – I Alpha

10/11/2011

Submitted by:Abhijeet Shinde (93)Akshay Wagle (115)Amrit Pradhan (80)Rahul Tambakhe (104)Saiprasad Sale (85)Sagar Patil (74)Srinath Subramanian (100)Vaibhav Thakur (106)

DATABASE CONCEPTS WITH EMERGING TRENDS

Table of Contents

Introduction 3

Database Design Methodologies 11

Database management systems 26

Relational Database Management Systems 49

Multimedia Database Management Systems 64

Distributed Databases 88

Document Oriented Database 95

Conclusion 104

2

Introduction

Data

The term data refers to qualitative or quantitative attributes of a variable or set of variables.

Data is typically the result of measurements and can be the basis of graphs, images, or

observations of a set of variables. Data are often viewed as the lowest level of

abstraction from which information and then knowledge are derived. Raw data, i.e.

unprocessed data, refers to a collection of numbers, characters, images or other outputs

from devices that collect information to convert physical quantities into symbols. Also, data

is a representation of a fact, figure, and idea. Such usage is the origin of data as a concept in

computer science: data are numbers, words, images, etc., accepted as they stand.

The terms information and knowledge are frequently used for overlapping concepts. The

main difference is in the level of abstraction being considered. Data is the lowest level of

abstraction, information is the next level, and finally, knowledge is the highest level among

all three. Data on its own carries no meaning. For data to become information, it must be

interpreted and take on a meaning. For example, the height of Mt. Everest is generally

considered as "data", a book on Mt. Everest geological characteristics may be considered as

"information", and a report containing practical information on the best way to reach Mt.

Everest's peak may be considered as "knowledge".

Information as a concept bears a diversity of meanings, from everyday usage to technical

settings. Generally speaking, the concept of information is closely related to notions of

constraint, communication, control, data, form, instruction, knowledge, meaning, mental

stimulus, pattern, perception, and representation.

Beynon-Davies uses the concept of a sign to distinguish between data and information; data

are symbols while information occurs when symbols are used to refer to something.

It is people and computers who collect data and impose patterns on it. These patterns are

seen as information which can be used to enhance knowledge. These patterns can be

interpreted as truth, and are authorized as aesthetic and ethical criteria. Events that leave

behind perceivable physical or virtual remains can be traced back through data. Marks are

no longer considered data once the link between the mark and observation is broken.

3

http://en.wikipedia.org/wiki/Sign

http://en.wikipedia.org/wiki/Abstraction

http://en.wikipedia.org/wiki/Knowledge

http://en.wikipedia.org/wiki/Information

http://en.wikipedia.org/wiki/Computer_science

http://en.wikipedia.org/wiki/Character_(computing)

http://en.wikipedia.org/wiki/Number

http://en.wikipedia.org/wiki/Raw_data

http://en.wikipedia.org/wiki/Abstraction

http://en.wikipedia.org/wiki/Image

http://en.wikipedia.org/wiki/Graph_(data_structure)

http://en.wikipedia.org/wiki/Variable_(mathematics)

Raw data refers to a collection of numbers, characters, images or other outputs from

devices to convert physical quantities into symbols, that are unprocessed. Such data is

typically further processed by a human or input into a computer, stored and processed

there, or transmitted (output) to another human or computer (possibly through a data

cable). Raw data is a relative term; data processing commonly occurs by stages, and the

"processed data" from one stage may be considered the "raw data" of the next.

Mechanical computing devices are classified according to the means by which they

represent data. An analog computer represents a datum as a voltage, distance, position, or

other physical quantity. A digital computer represents a datum as a sequence of symbols

drawn from a fixed alphabet. The most common digital computers use a binary alphabet,

that is, an alphabet of two characters, typically denoted "0" and "1". More familiar

representations, such as numbers or letters, are then constructed from the binary alphabet.

Some special forms of data are distinguished. A computer program is a collection of data,

which can be interpreted as instructions. Most computer languages make a distinction

between programs and the other data on which programs operate, but in some languages,

notably Lisp and similar languages, programs are essentially indistinguishable from other

data. It is also useful to distinguish metadata, that is, a description of other data. A similar

yet earlier term for metadata is "ancillary data." The prototypical example of metadata is

the library catalog, which is a description of the contents of books.

Experimental data refers to data generated within the context of a scientific investigation by

observation and recording. Field data refers to raw data collected in an uncontrolled in

situ environment.

Example: ROBCOR Company tracks all sales for its two divisions through invoices. Each of

the invoices contains raw facts such as these:

Invoice number = 300124, Invoice date = 12-Jan-2004, and Sales amount = $125.98

Suppose that ROBCOR's two divisions have generated 1,380,456 and 1,453,907 invoices,

respectively, between the first quarter of 1999 and the first quarter of 2004. These millions

of invoices the ROBCOR had generated raw facts which do not have meaning unless and

until they are processed into information. Now, suppose that for the purpose of making

conclusions and/or decisions, the ROBCOR's sales managers want information about sales

productivity per employee for each of the two divisions. The generated sales productivity

4

http://en.wikipedia.org/wiki/In_situ

http://en.wikipedia.org/wiki/In_situ

http://en.wikipedia.org/wiki/Field_work

http://en.wikipedia.org/wiki/Experimental_data

http://en.wikipedia.org/wiki/Metadata

http://en.wikipedia.org/wiki/Lisp_(programming_language)

http://en.wikipedia.org/wiki/Computer_program

http://en.wikipedia.org/wiki/Alphabet

http://en.wikipedia.org/wiki/Computer

http://en.wikipedia.org/wiki/Analog_computer

http://en.wikipedia.org/wiki/Data_cable

http://en.wikipedia.org/wiki/Data_cable

http://en.wikipedia.org/wiki/Output

http://en.wikipedia.org/wiki/Computer_data_storage

http://en.wikipedia.org/wiki/Computer

http://en.wikipedia.org/wiki/Input/output

http://en.wikipedia.org/wiki/Computer_data_processing

http://en.wikipedia.org/wiki/Image

http://en.wikipedia.org/wiki/Character_(computing)

http://en.wikipedia.org/wiki/Number

information will reveal meaning of the data exist in the sales invoices. An application

program in an information system will generate the required information. When data is

stored electronically in files, it can be used as input for an information system. An

information system has programs to process (or transform) data to produce information as

an output.

Information reveals meaning of data. For example, students' data values such as ID, Name,

Address, Major, and Phone number represent raw facts. Class roll is a list which shows

students' ID and Names of those students who are enrolled in particular class.

Summary of Data

o Facts, statistics used for reference or analysis.

o Numbers, characters, symbols, images etc., which can be processed by a computer.

o Data must be interpreted, by a human or machine, to derive meaning

o "Data is a representation of information" *

o Latin 'datum' meaning "that which is given"

o Data plural, datum singular

Information

Information in its most restricted technical sense is a message (utterance or expression) or

collection of messages that consists of an ordered sequence of symbols, or it is the meaning

that can be interpreted from such a message or collection of messages. Information can be

recorded or transmitted. It can be recorded as signs, or conveyed as signals by waves.

Information is any kind of event that affects the state of a dynamic system. The concept has

numerous other meanings in different context. Moreover, the concept of information is

closely related to notions of constraint, communication, control, data, form, instruction,

knowledge, meaning, mental stimulus, pattern, perception, representation, and especially

entropy.

Records are a specialized form of information. Essentially, records are information

produced consciously or as by-products of business activities or transactions and retained

5

http://en.wikipedia.org/wiki/Conveyed_concept

http://en.wikipedia.org/wiki/Dynamical_system

http://en.wikipedia.org/wiki/Thermodynamic_state

http://en.wikipedia.org/wiki/Event

http://en.wikipedia.org/wiki/Wave

http://en.wikipedia.org/wiki/Signal

http://en.wikipedia.org/wiki/Sign_(semiotics)

http://en.wikipedia.org/wiki/Symbols

http://en.wikipedia.org/wiki/Sequence

http://en.wikipedia.org/wiki/Order_theory

because of their value. Primarily their value is as evidence of the activities of the

organization but they may also be retained for their informational value. Sound records

management ensures that the integrity of records is preserved for as long as they are

required.

The international standard on records management, ISO 15489, defines records as

"information created, received, and maintained as evidence and information by an

organization or person, in pursuance of legal obligations or in the transaction of business".

The International Committee on Archives (ICA) Committee on electronic records defined a

record as, "a specific piece of recorded information generated, collected or received in the

initiation, conduct or completion of an activity and that comprises sufficient content,

context and structure to provide proof or evidence of that activity".

Records may be maintained to retain corporate memory of the organization or to meet

legal, fiscal or accountability requirements imposed on the organization. Willis (2005)

expressed the view that sound management of business records and information delivered

"…six key requirements for good corporate governance…transparency; accountability; due

process; compliance; meeting statutory and common law requirements; and security of

personal and corporate information."

Summary of Information

o Knowledge derived from study, experience (by the senses), or instruction.

o Communication of intelligence.

o Information is any kind of knowledge that is exchangeable amongst people, about things,

facts, concepts, etc., in some context.

o Information is interpreted data.

Data processed into information

6

Students Enrolment Data Student Registration System Class Roll

http://en.wikipedia.org/wiki/Corporate_governance

http://en.wikipedia.org/wiki/Corporate_memory

http://en.wikipedia.org/wiki/Records_management

http://en.wikipedia.org/wiki/Records_management

Data constitute the building blocks of information:

• Information is produced by processing data.

• Information is used to reveal the meaning of data.

• Good, relevant, and timely information is the key to good decision making.

• Good decision making is the key to organizational survival in a global environment.

Timely and useful information requires accurate data. To achieve accurate information, the

data must be stored and generated properly. Also, the data must be stored in a format that

is easy to access and process. And like any basic resource, the data environment must be

managed carefully. Thus, data management is a discipline that focuses on the proper

generation, storage, and retrieval of data.

Database Introduction

The term database is correctly applied to the data and their supporting data structures, and

not to the DBMS which is a software system used to manage the database. The structure of a

database is generally too complex to be handled without its DBMS, and any attempt to do

otherwise is very likely to result in database corruption. DBMSs are packaged as computer

software products: Well known and highly utilized DBMSs include the proprietary (ranked

by market share Oracle , IBM DB2, Microsoft SQL Server, and the open

sourceDBMSs PostgreSQL, and the simpler, with less functionality, lower-end (but

satisfactory in many cases), MySQL and SQLite. Each such DBMS product currently supports

many thousands of databases all over the world; the

mostly embedded DBMS SQLite probably supports many millions of instances. A database is

not generally portable across different DBMS, but different DBMSs can inter-operate to

some degree (while each DBMS type controls a database of its own database type) by

using standards like SQLand ODBC to support together a single application. A successful

general-purpose DBMS is designed in such a way that it can satisfy as many different

applications and application designers and builders as possible. A DBMS also needs to

provide effective run-time execution to properly support (e.g., in terms

of performance, availability, and security) as many end-users (the database's application

7

http://en.wikipedia.org/wiki/End-user

http://en.wikipedia.org/wiki/Security

http://en.wikipedia.org/wiki/Availability

http://en.wikipedia.org/wiki/IT_Performance_Management

http://en.wikipedia.org/wiki/Run_time_(program_lifecycle_phase)

http://en.wikipedia.org/wiki/ODBC

http://en.wikipedia.org/wiki/SQL

http://en.wikipedia.org/wiki/Technical_standard

http://en.wikipedia.org/wiki/Interoperation

http://en.wikipedia.org/wiki/Software_portability

http://en.wikipedia.org/wiki/SQLite

http://en.wikipedia.org/wiki/Embedded_database

http://en.wikipedia.org/wiki/SQLite

http://en.wikipedia.org/wiki/MySQL

http://en.wikipedia.org/wiki/Low-end_market

http://en.wikipedia.org/wiki/PostgreSQL

http://en.wikipedia.org/wiki/Open_source

http://en.wikipedia.org/wiki/Open_source

http://en.wikipedia.org/wiki/Microsoft_SQL_Server

http://en.wikipedia.org/wiki/IBM_DB2

http://en.wikipedia.org/wiki/Oracle_Database

http://en.wikipedia.org/wiki/Market_share

http://en.wikipedia.org/wiki/Proprietary_software

http://en.wikipedia.org/wiki/Computer_software

http://en.wikipedia.org/wiki/Computer_software

users) as needed. Sometimes the combination of a database and its respective DBMS is

referred to as a Database system (DBS).

The design, construction, and maintenance of a complex database requires specialist skills:

the staff performing these functions are referred to as database

application programmers (different from the DBMS developers/programmers)

and database administrators. Their tasks are supported by tools provided either as part of

the DBMS or as free-standing software products. These tools include specialized database

languages including data definition languages (DDLs), data manipulation languages (DMLs),

and query languages. These can be seen as special-purpose programming languages,

tailored specifically to manipulate databases; sometimes they are provided as extensions of

existing programming languages, with added special database commands. Database

languages are generally specific to one data model, and in many cases they are specific to

one DBMS type. The most widely supported standard database language is SQL, which has

been developed for the relational data model and combines the roles of both DDL, DML, and

a query language.

The database concept

The database concept has evolved since the 1960s to ease increasing difficulties in

designing, building, and maintaining complex information systems (typically with many

concurrent end-users, and with a diverse large amount of data). It has evolved together with

the evolvement of Database management systems (DBMSs) which enable the effective

handling of databases. Though the terms database and DBMS define different entities, they

are inseparable: A database's properties are determined by its supporting DBMS and vice-

versa. The Oxford English dictionary cites a 1962 technical report as the first to use the

term "data-base." With the progress in technology in the areas of processors, computer

memory, computer storage. Computer networks, the sizes, capabilities, and performance of

databases and their respective DBMSs have grown in orders of magnitudes. For decades it

has been unlikely that a complex information system can be built effectively without a

proper database supported by a DBMS. The utilization of databases is now spread to such a

wide degree that virtually every technology and product relies on databases and DBMSs for

its development and commercialization, or even may have such embedded in it. Also,

8

http://en.wikipedia.org/wiki/Oxford_English_dictionary

http://en.wikipedia.org/wiki/Database_management_system

http://en.wikipedia.org/wiki/Data

http://en.wikipedia.org/wiki/End-user

http://en.wikipedia.org/wiki/Information_system

http://en.wikipedia.org/wiki/Relational_model

http://en.wikipedia.org/wiki/SQL

http://en.wikipedia.org/wiki/Technical_standard

http://en.wikipedia.org/wiki/Data_model

http://en.wikipedia.org/wiki/Programming_language

http://en.wikipedia.org/wiki/Query_language

http://en.wikipedia.org/wiki/Data_Manipulation_Language

http://en.wikipedia.org/wiki/Data_Definition_Language

http://en.wikipedia.org/wiki/Database_language

http://en.wikipedia.org/wiki/Database_language

http://en.wikipedia.org/wiki/Database_administrator

http://en.wikipedia.org/wiki/Programmer

http://en.wikipedia.org/wiki/Database_system

organizations and companies, from small to large, heavily depend on databases for their

operations.

No widely accepted exact definition exists for DBMS. However, a system needs to provide

considerable functionality to qualify as a DBMS. Accordingly its supported data collection

needs to meet respective usability requirements to qualify as a database. Thus, a database

and its supporting DBMS are defined here by a set of general requirements listed below.

Virtually all existing mature DBMS products meet these requirements to a great extent,

while less mature either meet them or converge to meet them.

Parts of a database

There are four major parts to a database. Which are tables, forms, reports and queries:

Tables- Where all the information and data is stored within a database.

Forms- Forms are used to enter, edit or deleted data in a database one record at a time.

Reports- A report makes a paper copy of the data contained within a database.

Queries and Sorts- Queries are a special database tool that allows you to search a database for a

specific records based on certain information.

Definition of a key

A key is a set of columns that can be used to identify or access a particular row or rows. The

key is identified in the description of a table, index, or referential constraint. The same

column can be part of more than one key.

A unique key is a key that is constrained so that no two of its values are equal. The columns

of a unique key cannot contain NULL values. For example, an employee number column can

be defined as a unique key, because each value in the column identifies only one employee.

No two employees can have the same employee number.

The mechanism used to enforce the uniqueness of the key is called a unique index. The

unique index of a table is a column, or an ordered collection of columns, for which each

value identifies (functionally determines) a unique row. A unique index can contain NULL

values.

9

Primary Key

The primary key of a relational table uniquely identifies each record in the table. It can

either be a normal attribute that is guaranteed to be unique (such as Social Security Number

in a table with no more than one record per person) or it can be generated by the DBMS

(such as a globally unique identifier, or GUID, in Microsoft SQL Server). Primary keys may

consist of a single attribute or multiple attributes in combination.

Examples:

Imagine we have a STUDENTS table that contains a record for each student at a university.

The student's unique student ID number would be a good choice for a primary key in the

STUDENTS table. The student's first and last name would not be a good choice, as there is

always the chance that more than one student might have the same name.

Secondary Key

A secondary key is made on a field that you would like to be indexed for faster searches

Secondary key is used for Identification of Rows but not usually Unique. We can have

multiple Secondary Keys per table. Attribute used for Secondary key are not the ones used

for Super Key i.e. Secondary Key is not even be one of the Super Key.

10

Database Design Methodologies

Database design

It is a structured approach that uses procedures, techniques, tools, and documentation aids

to support and facilitate the process of designing a database

Critical Success Factors in Database Design :

1. Use diagrams to represent as much of the data models as possible.

2. Good documentation of additional data semantics.

3. Build a data dictionary to supplement the data model diagrams.

4. Be willing to repeat steps

5. Work interactively with the users as much as possible.

6. Follow a structured methodology throughout the data modeling process.

7. Employ a data-driven approach.

8.Incorporate structural and integrity considerations into the data models.

9. Combine conceptualization, normalization, and transaction validation techniques into the

data modeling methodology.

3 main phases of database design:

I. Conceptual database design: Construct a model of the information used, independent of all

physical considerations.

II. Logical database design: Construct a model of the information used, based on a specific

data model (e.g. relational), but independent of a particular DBMS and other physical

considerations.

II. Physical database design: Produce a description of the implementation of the DB on

secondary storage; it describes the storage structures and access methods used to achieve

efficient access to the data.

Methodology Overview

Step 1 Build local conceptual data model for each user view

Step 2 Build and Validate Local Logical Data Model

Step 3 Build and Validate Global Logical Data Model

Step 4 Translate Global Logical Data Model for Target DBMS

11

Step 5 Design Physical Representation

Step 6 Design and Implement Security Mechanisms

Step 7 Monitor and Tune the Operational System

Methodology Overview - Conceptual Database Design

Step 1 Build local conceptual data model for each user view

Step 1.1 Identify entity types

Step 1.2 Identify relationship types

Step 1.3 Identify and associate attributes with entity or relationship types

Step 1.4 Determine attribute domains

Step 1.5 Determine candidate and primary key attributes

Step 1.6 Specialize/generalize entity types (optional step)

Step 1.7 Draw Entity–Relationship diagram

Step 1.8 Review Local Conceptual Data Model with User

Methodology Overview –Logical Database Design for Relational Model

Step 2 Build and Validate Local Logical Data Model

Step 2.1 Map Local Conceptual Data Model to Local Logical Data Model

Step 2.2 Derive Relations from Local Logical Data Model

Step 2.3 Validate Model using Normalization

Step 2.4 Validate Model against User Transactions

Step 2.5 Draw Entity-Relationship Diagram

Step 2.6 Define Integrity Constraints

Step 2.7 Review Local Logical Data Model with User


Step 3.1 Merge Local Logical Data Models into Global Model

Step 3.2 Validate Global Logical Data Model

Step 3.3 Check for Future Growth

Step 3.4 Draw Final Entity-Relationship Diagram

Step 3.5 Review Global Logical Data Model with Users


Step 3.1 Merge Local Logical Data Models into Global Model

Step 3.2 Validate Global Logical Data Model

12

Step 3.3 Check for Future Growth

Step 3.4 Draw Final Entity-Relationship Diagram

Step 3.5 Review Global Logical Data Model with Users

Methodology Overview –Physical Database Design for Relational Databases

Step 4 Translate Global Logical Data Model for Target DBMS

Step 4.1 Design Base Relations for Target DBMS

Step 4.2 Design Integrity Rules for Target DBMS

Step 5 Design Physical Representation

Step 5.1 Analyze Transactions

Step 5.2 Choose File Organization

Step 5.3 Choose Secondary Indexes

Step 5.4 Consider the Introduction of Controlled Redundancy

Step 5.5 Estimate Disk Space

Step 6 Design and Implement Security Mechanisms

Step 6.1 Design User Views

Step 6.2 Design Access Rules

Step 7 Monitor and Tune the Operational System

Entity

An Entity is defined as a person, place, thing or concept which is a) of interest to the

corporation b) is capable of being described in real terms c) is relevant within the context of

the specific environment of the firm. An entity is something that has a distinct,

separate existence, although it need not be a material existence. Entities are used in system

developmental models that display communications and internal processing of, say,

documents compared to order processing.

Entities are represented on the diagram by labelled boxes.

13

http://en.wikipedia.org/wiki/Order_processing

http://en.wikipedia.org/wiki/Existence

Types of entities:

Strong Entity: Objects are represented by their attributes and, as objects are inter

distinguishable, a subset of these attributes forms a primary key for uniquely identified an

instance of an entity. Entity types that have primary keys are called strong entities.

Weak Entity: Instance of weak entity sets associated with the same instance of the strong

entity must be distinguishable from each other by a subset of the attributes of the weak

entity. This subset of the attributes is called discriminator. They don't have a primary key

Entity Set : An entity set is a set of entities of the same type that share the same properties,

or attributes. For example, all software engineers working in the department involved in the

Internet projects can be defined as the entity set Internet Group. The individual entities that

constitute a set are called extension of the entity set. Thus, all individual software engineers

of in the Internet projects are the extensions of the entity set InternetGroup.

Entity sets don’t need to be disjointed. For example, we can define an entity set Employee.

An employee may or may not be working on some Internet projects. In InternetGroup we

will have some entries that are there in Employee entity set. Therefore, entity sets

Employee and InternetGroup are not disjoint.

A database usually contains groups of entities that are similar. For example, employees of a

company share the same attributes. However, every employee entity has its own values for

each attribute. An entity type defines a set of entities that have same attributes. A name and

a list of attributes describe each entity type.

An Entity Type is represented in ER diagrams as rectangular box and the corresponding

attributes are shown in ovals attached to the entity type by straight lines. An entity type is

basically the schema or intension or structure for the set of entities that share the same

14

structure whereas the individual entities of a particular entity type are collectively called

entity set. The entity set is also called the extension of the entity type

Entity Sub-Types

Sometimes it is useful to generalise about a group of Entities which have similar

characteristics. For

example a VEHICLE is a generalisation of a CAR, a TRUCK and a MOTORCYCLE.

Conversely, it can be useful to identify specialised sub-types of an entity. For example, a CAR

is a

specialised type of VEHICLE, a TRUCK is a specialised type of VEHICLE and a MOTORCYCLE

is a specialised type of VEHICLE.

Attribute

An attribute is an aspect or quality of an entity which describes it or its actions. An attribute

may be a physical aspect such as height, weight or an aspect of the entity's location such as

place of residence or place of birth or Quality such as the level of a particular skill,

educational degree achieved etc

An attribute is a property or descriptor of an entity, for example, Customer Name is an

attribute of the entity Customer. Each attribute will eventually be represented by one or

more entity attributes in the physical database structure.

Attributes Define Entities Collectively, attributes define an entity. An attribute is

meaningless by itself. For example, date of birth comes from the context of the entity to

which it is assigned, for example, date of birth of an employee. Attributes are not shown on

the Entity-Relationship Model but are recorded in the underlying data dictionary which

contains the definitions of attributes for all entities and relationships identified in the

model. An attribute should not have facts recorded about it. In practice, however, there are

exceptions. For example, you might wish to show address as an attribute of Customer.

Address is not significant enough to be modelled as an entity in its own right and would

typically be shown as an attribute of Customer. However, at the detailed level, it may itself

have attributes such as an indicator for mailing address or home address. Attributes do not

have to be recognized and defined during the early stages of entity definition. Entity

15

definition is an iterative process, and it is unlikely that a completely satisfactory Entity-

Relationship Model will be obtained on the first iteration.

Following figure shows two entity types Employee and Product. Their attribute list is also

shown. A few members of each entity type are shown.

Identifying Attributes

To identify entity attributes, examine:

1. All external entities from the Context Diagram,

2. The data flows passed by the external entities,

3. Existing automated data,

4. Each entity (i.e., generate a list of entity attributes that describe the entity).

Attributes Versus Data Elements

Attributes have a looser description than data elements. For instance, whereas an attribute

may have only a descriptive name, a data element needs:

1. a size and range,

2. a format and length,

3. an accurate and detailed description,

4. valid values,

5. defined edit rules.

Some attributes may be converted into many data elements. For instance, the attribute

"address" may become four data elements representing:

1. Street Address,

16

2. City/Town,

3. State/Province,

4. Postal or Zip Code.

Additional data elements may also be defined as a result of customer requirements. For

example, the customer may require a list of all companies by county. For the purposes of

Data Modelling, Attributes and data elements are often considered identical because

attributes in the data model typically become data elements in the database.

Categories of Attributes

Attributes fall into three categories depending on the information that the attribute

captures:

1. Descriptive attributes provide facts intrinsic to each instance of the entity. For example,

the Salary of Employee or the Address of Customer.

2. Naming attributes provide facts about arbitrary labels and names carried by each

instance of an entity. For example, the Employee Name of Employee or the Employee ID of

Employee.

3. Referential attributes (i.e., foreign keys) provide facts which tie an instance of one entity

to an instance of another entity. For example, the Department Number of the Department to

which an Employee is assigned ties the Employee to the Department.

Attribute domains

A domain is a set of values for an attribute (i.e., the properties or characteristics of entities).

The value set conforms to a common definition for the domain (e.g., type, format, syntax,

meaning).

Specify domains in one of the following ways:

1. List all possible values (e.g., for an attribute named color, possible values are red,

green, blue).

2. Identify a source (e.g., procedures manual) that contains the valid values.

3. List an acceptable range of values for the domain (e.g., for an attribute named

weight, possible values range from one to five pounds).

17

4. Define a business rule that permits determination of the validity of a value assigned

to an attribute (e.g., discounts greater than five percent only apply to commercial

customers).

Derived Attributes

Derived attributes are attributes whose values are generated from other attributes using

calculations, algorithms or procedures. For example, Account Balance is derived by

subtracting Total Debit from Total Credit.

Generally, the specifications for calculating derived attributes are a concern of the

processing aspects of the information system (e.g., process model). Derived attributes may

be included in the data model if the rules for calculating the attribute values would

otherwise be lost.

Clearly indicate in the data model when an attribute is derived. Ensure that the rules needed

to derive or calculate the attribute value are captured in the model. Verify that all attributes

needed to calculate the derived attribute are present in the data model. Once the physical

data model is constructed, some derived attributes are added to the model to improve

performance of the system.

Relationships

Frequently, a meaningful relationship exists between two different types of entity.

For example:

EMPLOYEEs work in a DEPARTMENT

LAWYERs advise CLIENTs

18

EQUIPMENT is allocated to PROJECTs

TRUCK is a type of VEHICLE

There are potentially three types of relationship which can exist between two different

entities:

1. One-to-One Relationships

2. One-to-Many Relationships

3. Many-to-Many Relationships

One-to-One Relationships:

This type of relationship takes place when a single occurrence of an entity is related to just

one occurrence of a second entity.

For example, a ROOF covers one BUILDING; a BUILDING is covered by one ROOF.

A One-to-One relationship is shown on the diagram by a line connecting the two Entities

One-to-Many Relationships:

This type of relationship takes place when a single occurrence of an entity is related to

many occurrences of a second entity.

For example, An EMPLOYEE works in one DEPARTMENT; a DEPARTMENT has many

EMPLOYEE’s. A One-to-Many relationship is shown on the diagram by a line connecting the

two entities with a crows feet symbol denoting the "many" end of the relationship.

19

Many-to-Many Relationships :

This type of relationship takes place when many occurrences of an entity are related to

many

occurrences of a second entity.

For example, EQUIPMENT is allocated to many PROJECTs; A PROJECT is allocated many

items

of EQUIPMENT.

A Many-to-Many relationship is shown on the diagram by a line connecting the two entities

with a

crows feet at each end of the line.

Eliminating Many-to-Many Relationships

Many-to-many relationships in an entity relationship diagram tend to conceal areas of poor

20

understanding.

Almost always, a many-to-many relationship conceals a hidden entity.

For this reason many-to-many relationships are eliminated by identifying and adding the

hidden entity to the model. The new entity is related to the two original entities by a pair of

one-to-many relationships.

Choosing the Right Relationship

Depending on the purpose of the model, the length of time involved, and the definition of

the entities

participating in the relationship, the type of relationship between two entities can change.

For example, if the definition of a ROOF entity is an apex or flat surface covering a building,

then a

BUILDING is covered by many ROOFs.

Likewise, over a period of time, an EMPLOYEE works in many DEPARTMENTS.

Optional and Mandatory Relationships

Sometimes the participation of an entity in a relationship is optional. At other times it is

mandatory.

For example, EQUIPMENT may be allocated to PROJECTs, an EMPLOYEE must work in a

DEPARTMENT and a DEPARTMENT may have EMPLOYEEs working in it..

Optional participation in a relationship is shown by a circle placed on the relationship line

next to the entity that optionally participates in the relationship.

Mandatory participation is shown in a similar manner by placing a bar on the relationship

line.

21

Recursive Relationships

Relationships can exist between different occurrences of the same type of entity. For

example, A PERSON dances with another PERSON, an EMPLOYEE manages other

EMPLOYEEs and a COMPONENT is made from other COMPONENTs

ER Model

The whole purpose of ER modelling is to create an accurate reflection of the real world in a

database. The ER model doesn’t actually give us a database description. It gives us an

intermediate step from which it is easy to define a database. Let’s look at an example. (You

will see much more detail on these concepts in the rest of this chapter. For now just try to

understand the overall process.)

Suppose you are presented with the following situation and are told to create a database for

it:

Every department within our company is in only one division. Each division has more than

one department in it. We don’t have an upper limit on the number of departments that a

division can have. For example, the New Business Development---the one managed by

Mackenzie---and Higher Education departments are both in the Marketing division.

22

This is a fairly clear description of a situation. Many things are left unsaid that we

understand about the situation. For example: each division has a name, and that name is

unique within the company.

Deriving an E-R Diagram

An E-R diagram is used to represent the E-R model. It contains all known entities for the

given scenario, their attributes, identifiers, and the relationships that exist among the

entities. Deriving the entities, their attributes, identifiers, and the relationships among them

is one of the most important activities carried out during database development, so it is

imperative that it is done accurately.

From developing a number of databases, I have come up with a technique that allows for

rapid database design and development. This technique was initially taught to me by

consultants from the CRC SOGEMA INC. group in 1998, and a similar technique is described

in the IBM Informix Database Design and Implementation Guide (2005).

Steps

1. Discover the entities and the basic attributes for each entity. This process usually involves

performing a careful analysis of the processes and documentation used by the users in

the problem domain. Make a list of each of the entities discovered along with the basic

attributes of each entity.

2. Identify the relationships among the entities. Using an Entity-Entity Matrix (E-E Matrix)

is one of the fastest methods for deriving all the relationships among all the entities. This

matrix consists of an equal number of rows and columns, with each entity discovered

heading a row and a column. The intersection of the rows and columns represents

relationships that may exist between the entities – it is possible that no relationship will

exist between entities, that more than one relationship will exist between entities, and

that an entity will be related to itself.

3. Use the information from the E-E matrix to construct a simple initial E-R diagram. This

23

diagram will contain the entities and their basic attributes (and possible identifiers), and

the relationships indicated in the E-E matrix.

4. Derive optionalities and cardinalities for each relationship in the initial E-R diagram and

write assertions for each relationship. This records the assertions that can be made

about the data (relationships) in the user’s domain. Optionality says what can and must

happen in a relationship, and cardinality indicates the number of entity occurrences in a

relationship:

Optionality “0” – Can (optional)

Optionality “1” – Must (obligatory)

Cardinality “1” – Only one

Cardinality “N” – Many/At least one

5. Construct a detailed E-R diagram that includes all the derived information. This diagram

will include optionalities, cardinalities, attributes, identifiers, and relationships. Unique

identifiers are indicated with an asterisk (*), and non-unique identifiers with a plus (+).

Implementing the E-R Model – The Relational Model

While and E-R diagram displays the logical nature of the data that exists in the user’s

domain, the relational model shows how this data will be represented in a Relational

Database Management System (RDBMS). This model is important because the most

common and popular database management systems in use today are relational, and

relational model diagrams can be directly implemented on any RDBMS.

The relational model is similar to the E- R model but based on a branch of mathematics

called relational algebra, and as such, there are strict definitions and rules regarding the

elements of this model. Here are some of these definitions and rules:

o A relation is defined as a two dimensional table that contains rows and columns Rows

are instances of the entity and columns are attributes of the entity.

o No two rows of the table must be identical – each row must be unique (the contents of

24

the row that is), and the order of the rows is not important.

o Each column must have a unique name, and the order of the columns is not important.

They must also contain single values – multiple values and repeating groups are not

allowed.

o A key is a group of one or more attributes that uniquely identifies a row – each table

must have a key. Since each row in the table is unique, the key is what is used to

distinguish one row from another. It can be comprised of a single attribute or it can be

composite, comprising of more than one attribute. (Kroenke D, 1999, p. 113-117)

o Every column (attribute) in a table must depend solely on the primary key. It should not

depend on any other column in the table, or on part of the primary key if it is a

composite key.

25

Database management systems

A database management system (DBMS) is a software package with computer programs

that control the creation, maintenance, and the use of a database. It allows organizations to

conveniently develop databases for various applications by database administrators (DBAs)

and other specialists. A database is an integrated collection of data records, files, and other

database objects. A DBMS allows different user application programs to concurrently access

the same database. DBMSs may use a variety of database models, such as the relational

model or object model, to conveniently describe and support applications. It typically

supports query languages, which are in fact high-level programming languages, dedicated

database languages that considerably simplify writing database application programs.

Database languages also simplify the database organization as well as retrieving and

presenting information from it. A DBMS provides facilities for controlling data access,

enforcing data integrity, managing concurrency control, recovering the database after

failures and restoring it from backup files, as well as maintaining database security. The

brief topics discussed are as follows:-

Overview

A DBMS is a set of software programs that controls the system organization, storage,

management, and retrieval of data in a database. DBMSs are categorized according to their

data structures or types. The DBMS accepts requests for data from an application program

and instructs the operating system to transfer the appropriate data. The queries and

responses must be submitted and received according to a format that conforms to one or

more applicable protocols. When a DBMS is used, information systems can be changed more

easily as the organization's information requirements change. New categories of data can be

added to the database without disruption to the existing system.

Database servers are dedicated computers that hold the actual databases and run only the

DBMS and related software. Database servers are usually multiprocessor computers, with

generous memory and RAID disk arrays used for stable storage. Hardware database

accelerators, connected to one or more servers via a high-speed channel, are also used in

large volume transaction processing environments. DBMSs are found at the heart of most

database applications. DBMSs may be built around a custom multitasking kernel with built-

26

in networking support, but modern DBMSs typically rely on a standard operating system to

provide these functions.

History

Databases have been in use since the earliest days of electronic computing. Unlike modern

systems which can be applied to widely different databases and needs, the vast majority of

older systems were tightly linked to the custom databases in order to gain speed at the

expense of flexibility. Originally DBMSs were found only in large organizations with the

computer hardware needed to support large data sets.

1960s Navigational DBMS

As computers grew in speed and capability, a number of general-purpose database systems

emerged; by the mid-1960s there were a number of such systems in commercial use.

Interest in a standard began to grow, and Charles Bachman, author of one such product, the

Integrated Data Store (IDS), founded the "Database Task Group" within CODASYL, the group

responsible for the creation and standardization of COBOL. In 1971 they delivered their

standard, which generally became known as the "Codasyl approach", and soon a number of

commercial products based on this approach were made available.

The Codasyl approach was based on the "manual" navigation of a linked data set which was

formed into a large network. When the database was first opened, the program was handed

back a link to the first record in the database, which also contained pointers to other pieces

of data. To find any particular record the programmer had to step through these pointers

one at a time until the required record was returned. Simple queries like "find all the people

in India" required the program to walk the entire data set and collect the matching results

one by one. There was, essentially, no concept of "find" or "search". This may sound like a

serious limitation today, but in an era when most data was stored on magnetic tape such

operations were too expensive to contemplate anyway.

IBM also had their own DBMS system in 1968, known as IMS. IMS was a development of

software written for the Apollo program on the System/360. IMS was generally similar in

concept to Codasyl, but used a strict hierarchy for its model of data navigation instead of

Codasyl's network model. Both concepts later became known as navigational databases due

27

to the way data was accessed, and Bachman's 1973 Turing Award award presentation was

The Programmer as Navigator. IMS is classified as a hierarchical database. IDMS and

CINCOM's TOTAL database are classified as network databases.

1970s relational DBMS

Edgar Codd worked at IBM in San Jose, California, in one of their offshoot offices that was

primarily involved in the development of hard disk systems. He was unhappy with the

navigational model of the Codasyl approach, notably the lack of a "search" facility. In 1970,

he wrote a number of papers that outlined a new approach to database construction that

eventually culminated in the groundbreaking A Relational Model of Data for Large Shared

Data Banks.

In this paper, he described a new system for storing and working with large databases.

Instead of records being stored in some sort of linked list of free-form records as in Codasyl,

Codd's idea was to use a "table" of fixed-length records. A linked-list system would be very

inefficient when storing "sparse" databases where some of the data for any one record

could be left empty. The relational model solved this by splitting the data into a series of

normalized tables, with optional elements being moved out of the main table to where they

would take up room only if needed.

In the relational model, related records are linked together with a "key".

28

http://en.wikipedia.org/wiki/File:Relational_key.png

For instance, a common use of a database system is to track information about users, their

name, login information, various addresses and phone numbers. In the navigational

approach all of these data would be placed in a single record, and unused items would

simply not be placed in the database. In the relational approach, the data would be

normalized into a user table, an address table and a phone number table (for instance).

Records would be created in these optional tables only if the address or phone numbers

were actually provided.

Linking the information back together is the key to this system. In the relational model,

some bit of information was used as a “key”, uniquely defining a particular record. When

information was being collected about a user, information stored in the optional (or related)

tables would be found by searching for this key. For instance, if the login name of a user is

unique, addresses and phone numbers for that user would be recorded with the login name

as its key. This "re-linking" of related data back into a single collection is something that

traditional computer languages are not designed for.

Just as the navigational approach would require programs to loop in order to collect

records, the relational approach would require loops to collect information about any one

record. Codd's solution to the necessary looping was a set-oriented language, a suggestion

that would later spawn the ubiquitous SQL. Using a branch of mathematics known as tuple

calculus, he demonstrated that such a system could support all the operations of normal

databases (inserting, updating etc.) as well as providing a simple system for finding and

returning sets of data in a single operation.

Codd's paper was picked up by two people at Berkeley, Eugene Wong and Michael

Stonebraker. They started a project known as INGRES using funding that had already been

allocated for a geographical database project, using student programmers to produce code.

Beginning in 1973, INGRES delivered its first test products which were generally ready for

widespread use in 1979. During this time, a number of people had moved "through" the

group — perhaps as many as 30 people worked on the project, about five at a time. INGRES

was similar to System R in a number of ways, including the use of a "language" for data

access, known as QUEL — QUEL was in fact relational, having been based on Codd's own

Alpha language, but has since been corrupted to follow SQL, thus violating much the same

concepts of the relational model as SQL itself.

29

IBM itself did one test implementation of the relational model, PRTV, and a production one,

Business System 12, both now discontinued. Honeywell did MRDS for Multics, and now

there are two new implementations: Alphora Dataphor and Rel. All other DBMS

implementations usually called relational are actually SQL DBMSs. In 1968, the University of

Michigan began development of the Micro DBMS . It was used to manage very large data

sets by the US Department of Labor, the Environmental Protection Agency and researchers

from University of Alberta, the University of Michigan and Wayne State University. It ran on

mainframe computers using Michigan Terminal System. The system remained in production

until 1996.

Late-1970s SQL DBMS

IBM started working on a prototype system loosely based on Codd's concepts as System R in

the early 1970s. The first version was ready in 1974/5, and work then started on multi-

table systems in which the data could be split so that all of the data for a record (some of

which is optional) did not have to be stored in a single large "chunk". Subsequent multi-user

versions were tested by customers in 1978 and 1979, by which time a standardized query

language – SQL – had been added. Codd's ideas were establishing themselves as both

workable and superior to Codasyl, pushing IBM to develop a true production version of

System R, known as SQL/DS, and, later, Database 2 (DB2).

Many of the people involved with INGRES became convinced of the future commercial

success of such systems, and formed their own companies to commercialize the work but

with an SQL interface. Sybase, Informix, NonStop SQL and eventually Ingres itself were all

being sold as offshoots to the original INGRES product in the 1980s. Even Microsoft SQL

Server is actually a re-built version of Sybase, and thus, INGRES. Only Larry Ellison's Oracle

started from a different chain, based on IBM's papers on System R, and beat IBM to market

when the first version was released in 1978.

Stonebraker went on to apply the lessons from INGRES to develop a new database, Postgres,

which is now known as PostgreSQL. PostgreSQL is often used for global mission critical

applications (the .org and .info domain name registries use it as their primary data store, as

do many large companies and financial institutions).

30

In Sweden, Codd's paper was also read and Mimer SQL was developed from the mid-70s at

Uppsala University. In 1984, this project was consolidated into an independent enterprise.

In the early 1980s, Mimer in c introduced transaction handling for high robustness in

applications, an idea that was subsequently implemented on most other DBMS.

1980s object-oriented databases

The 1980s, along with a rise in object oriented programming, saw a growth in how data in

various databases were handled. Programmers and designers began to treat the data in

their databases as objects. That is to say that if a person's data were in a database, that

person's attributes, such as their address, phone number, and age, were now considered to

belong to that person instead of being extraneous data. This allows for relations between

data to be relations to objects and their attributes and not to individual fields.

Another big game changer for databases in the 1980s was the focus on increasing reliability

and access speeds. In 1989, two professors from the University of Wisconsin at Madison

published an article at an ACM associated conference outlining their methods on increasing

database performance. The idea was to replicate specific important, and often queried

information, and store it in a smaller temporary database that linked these key features

back to the main database. This meant that a query could search the smaller database much

quicker, rather than search the entire dataset. This eventually leads to the practice of

indexing, which is used by almost every operating system from Windows to the system that

operates Apple iPod devices.

21st century NoSQL databases

In the 21st century a new trend of NoSQL databases was started. Those non-relational

databases are significantly different from the classic relational databases. They often do not

require fixed table schemas, avoid join operations by storing denormalized data, and are

designed to scale horizontally. Most of them can be classified as either key-value stores or

document-oriented databases.

In recent years there was a high demand for massively distributed databases with high

partition tolerance but according to the CAP theorem it is impossible for a distributed

system to simultaneously provide consistency, availability and partition tolerance

31

guarantees. A distributed system can satisfy any two of these guarantees at the same time,

but not all three. For that reason many NoSQL databases are using what is called eventual

consistency to provide both availability and partition tolerance guarantees with a maximum

level of data consistency.

The most popular software in that category include: memcached, Redis, MongoDB,

CouchDB, Apache Cassandra and HBase.

Current trends

In 1998, database management was in need of a new style of databases to solve current

database management problems. Researchers realized that the old trends of database

management were becoming too complex and there was a need for automated

configuration and management. Surajit Chaudhuri, Gerhard Weikum and Michael

Stonebraker were the pioneers that dramatically affected the thought of database

management systems. They believed that database management needed a more modular

approach and there were too many specifications needed for users. Since this new

development process of database management there are more possibilities. Database

management is no longer limited to “monolithic entities”. Many solutions have been

developed to satisfy the individual needs of users. The development of numerous database

options has created flexibility in database management.

There are several ways database management has affected the field of technology. Because

organizations' demand for directory services has grown as they expand in size, businesses

use directory services that provide prompted searches for company information. Mobile

devices are able to store more than just the contact information of users, and can cache and

display a large amount of information on smaller displays. Search engine queries are able to

locate data within the World Wide Web. Retailers have also benefited from the

developments with data warehousing, recording customer transactions. Online transactions

have become tremendously popular for e-business. Consumers and businesses are able to

make payments securely through some company websites.

32

Components

DBMS engine accepts logical requests from various other DBMS subsystems, converts

them into physical equivalents, and actually accesses the database and data dictionary

as they exist on a storage device.

Data definition subsystem helps the user create and maintain the data dictionary and

define the structure of the files in a database.

Data manipulation subsystem helps the user to add, change, and delete information in

a database and query it for valuable information. Software tools within the data

manipulation subsystem are most often the primary interface between user and the

information contained in a database. It allows the user to specify its logical information

requirements.

Application generation subsystem contains facilities to help users develop

transaction-intensive applications. It usually requires that the user perform a detailed

series of tasks to process a transaction. It facilitates easy-to-use data entry screens,

programming languages, and interfaces.

Data administration subsystem helps users manage the overall database

environment by providing facilities for backup and recovery, security management,

query optimization, concurrency control, and change management.

Modeling language

A modeling language is a data modeling language to define the schema of each database

hosted in the DBMS, according to the DBMS database model. Database management systems

(DBMS) are designed to use one of five database structures to provide simplistic access to

information stored in databases. The five database structures are:

• the hierarchical model,

• the network model,

• the relational model,

• the multidimensional model, and

• the object model.

33

Inverted lists and other methods are also used. A given database management system may

provide one or more of the five models. The optimal structure depends on the natural

organization of the application's data, and on the application's requirements, which include

transaction rate (speed), reliability, maintainability, scalability, and cost.

The hierarchical structure was used in early mainframe DBMS. Records’ relationships form a

treelike model. This structure is simple but nonflexible because the relationship is confined

to a one-to-many relationship. IBM’s IMS system and the RDM Mobile are examples of a

hierarchical database system with multiple hierarchies over the same data. RDM Mobile is a

newly designed embedded database for a mobile computer system. The hierarchical

structure is used primarily today for storing geographic information and file systems.

The network structure consists of more complex relationships. Unlike the hierarchical

structure, it can relate to many records and accesses them by following one of several paths.

In other words, this structure allows for many-to-many relationships.

The relational structure is the most commonly used today. It is used by mainframe,

midrange and microcomputer systems. It uses two-dimensional rows and columns to store

data. The tables of records can be connected by common key values. While working for IBM,

E.F. Codd designed this structure in 1970. The model is not easy for the end user to run

queries with because it may require a complex combination of many tables.

The multidimensional structure is similar to the relational model. The dimensions of the

cube-like model have data relating to elements in each cell. This structure gives a

spreadsheet-like view of data. This structure is easy to maintain because records are stored

as fundamental attributes—in the same way they are viewed—and the structure is easy to

understand. Its high performance has made it the most popular database structure when it

comes to enabling online analytical processing (OLAP).

The object-oriented structure has the ability to handle graphics, pictures, voice and text,

types of data, without difficultly unlike the other database structures. This structure is

popular for multimedia Web-based applications. It was designed to work with object-

oriented programming languages such as Java.

34

The dominant model in use today is the ad hoc one embedded in SQL,despite the objections

of purists who believe this model is a corruption of the relational model since it violates

several fundamental principles for the sake of practicality and performance. Many DBMSs

also support the Open Database Connectivity API that supports a standard way for

programmers to access the DBMS.

Before the database management approach, organizations relied on file processing systems

to organize, store, and process data files. End users criticized file processing because the

data is stored in many different files and each organized in a different way. Each file was

specialized to be used with a specific application. File processing was bulky, costly and

nonflexible when it came to supplying needed data accurately and promptly. Data

redundancy is an issue with the file processing system because the independent data files

produce duplicate data so when updates were needed each separate file would need to be

updated. Another issue is the lack of data integration. The data is dependent on other data

to organize and store it. Lastly, there was not any consistency or standardization of the data

in a file processing system which makes maintenance difficult. For these reasons, the

database management approach was produced.

Data structure

Data structures (fields, records, files and objects) optimized to deal with very large amounts

of data stored on a permanent data storage device (which implies relatively slow access

compared to volatile main memory).

Database query language

A database query language and report object allows users to interactively interrogate the

database, analyze its data and update it according to the users privileges on data. It also

controls the security of the database. Data security prevents unauthorized users from

viewing or updating the database. Using passwords, users are allowed access to the entire

database or subsets of it called subschemas. For example, an employee database can contain

all the data about an individual employee, but one group of users may be authorized to view

only payroll data, while others are allowed access to only work history and medical data.

35

If the DBMS provides a way to interactively enter and update the database, as well as

interrogate it, this capability allows for managing personal databases. However, it may not

leave an audit trail of actions or provide the kinds of controls necessary in a multi-user

organization. These controls are only available when a set of application programs are

customized for each data entry and updating function.

Transaction mechanism

A database transaction mechanism ideally guarantees ACID properties in order to ensure

data integrity despite concurrent user accesses (concurrency control), and faults (fault

tolerance). It also maintains the integrity of the data in the database. The DBMS can

maintain the integrity of the database by not allowing more than one user to update the

same record at the same time. The DBMS can help prevent duplicate records via unique

index constraints; for example, no two customers with the same customer numbers (key

fields) can be entered into the database.

External, logical and internal view

Traditional view of data

A DBMS Provides the ability for many different users to share data and process resources.

As there can be many different users, there are many different database needs. The question

is: How can a single, unified database meet varying requirements of so many users?

36

http://en.wikipedia.org/wiki/File:A2_2_Traditional_View_of_Data.jpg

A DBMS minimizes these problems by providing three views of the database data: an

external view (or user view), logical view (or conceptual view) and physical (or internal)

view. The user’s view of a database program represents data in a format that is meaningful

to a user and to the software programs that process those data.

One strength of a DBMS is that while there is typically only one conceptual (or logical) and

physical (or internal) view of the data, there can be an endless number of different external

views. This feature allows users to see database information in a more business-related way

rather than from a technical, processing viewpoint. Thus the logical view refers to the way

the user views the data, and the physical view refers to the way the data are physically

stored and processed.

Features and capabilities

Alternatively, and especially in connection with the relational model of database

management, the relation between attributes drawn from a specified set of domains can be

seen as being primary. For instance, the database might indicate that a car that was

originally "red" might fade to "pink" in time, provided it was of some particular "make" with

an inferior paint job. Such higher parity relationships provide information on all of the

underlying domains at the same time, with none of them being privileged above the others.

Simple definition

A database management system is the system in which related data is stored in an efficient

and compact manner. "Efficient" means that the data which is stored in the DBMS can be

accessed quickly and "compact" means that the data takes up very little space in the

computer's memory. The phrase "related data" means that the data stored pertains to a

particular topic.

Specialized databases have existed for scientific, imaging, document storage and like uses.

Functionality drawn from such applications has begun appearing in mainstream DBMS's as

well. However, the main focus, at least when aimed at the commercial data processing

market, is still on descriptive attributes on repetitive record structures.

37

Thus, the DBMSs of today roll together frequently needed services or features of attribute

management. By externalizing such functionality to the DBMS, applications effectively share

code with each other and are relieved of much internal complexity. Features commonly

offered by database management systems include:

Query ability

Querying is the process of requesting attribute information from various

perspectives and combinations of factors. Example: "How many 2-door cars in Texas

are green?" A database query language and report writer allow users to

interactively interrogate the database, analyze its data and update it according to

the users privileges on data.

Backup and replication

Copies of attributes need to be made regularly in case primary disks or other

equipment fails. A periodic copy of attributes may also be created for a distant

organization that cannot readily access the original. DBMS usually provide utilities

to facilitate the process of extracting and disseminating attribute sets. When data is

replicated between database servers, so that the information remains consistent

throughout the database system and users cannot tell or even know which server in

the DBMS they are using, the system is said to exhibit replication transparency.

Rule enforcement

Often one wants to apply rules to attributes so that the attributes are clean and

reliable. For example, we may have a rule that says each car can have only one

engine associated with it (identified by Engine Number). If somebody tries to

associate a second engine with a given car, we want the DBMS to deny such a

request and display an error message. However, with changes in the model

specification such as, in this example, hybrid gas-electric cars, rules may need to

change. Ideally such rules should be able to be added and removed as needed

without significant data layout redesign.

Security

For security reasons, it is desirable to limit who can see or change specific attributes

or groups of attributes. This may be managed directly on an individual basis, or by

the assignment of individuals and privileges to groups, or (in the most elaborate

models) through the assignment of individuals and groups to roles which are then

granted entitlements.

38

Computation

Common computations requested on attributes are counting, summing, averaging,

sorting, grouping, cross-referencing, and so on. Rather than have each computer

application implement these from scratch, they can rely on the DBMS to supply such

calculations.

Change and access logging

This describes who accessed which attributes, what was changed, and when it was

changed. Logging services allow this by keeping a record of access occurrences and

changes.

Automated optimization

For frequently occurring usage patterns or requests, some DBMS can adjust

themselves to improve the speed of those interactions. In some cases the DBMS will

merely provide tools to monitor performance, allowing a human expert to make the

necessary adjustments after reviewing the statistics collected.

Meta-data repository

Metadata is data describing data. For example, a listing that describes what attributes are

allowed to be in data sets is called "meta-information".

Advanced DBMS

An example of an advanced DBMS is Distributed Data Base Management System (DDBMS), a

collection of data which logically belong to the same system but are spread out over the

sites of the computer network. The two aspects of a distributed database are distribution

and logical correlation:

Distribution: The fact that the data are not resident at the same site, so that we can

distinguish a distributed database from a single, centralized database.

Logical Correlation: The fact that the data have some properties which tie them together, so

that we can distinguish a distributed database from a set of local databases or files which

are resident at different sites of a computer network.

39

Access Control in Database Management Systems

Introduction

Result of the database management systems use is existence of one data set on which are

working all users and applications in an organization. On one side, hardly solved problems

with duplication, inconsistency and application dependency of data disappear. On the other

side much more dangerous security problems appear. There is a brief introduction to them

on next lines.

Design of secure DBMS assumes identification of security risks and selection of right

security policies (what is the security system supposed to do) and mechanisms (the way we

are going to achieve that) for their neutralization.

Secure database system should satisfy three basic requirements on data protection:

security, integrity and availability. What is the content of those words?

Ensuring security - preventing, detecting and deterring improper disclosure of

information. This is especially important in strongly protected environments (e.g.

army).

Ensuring integrity - preventing, detecting and deterring improper changes of

information. The proper function of any organization depends on proper operations

on proper data.

Ensuring system availability - effort for prevention of improper denial of service

that DBMS provides.

Security Policies

Security policies are guidelines describing all (if possible) actions pertinent to the

information system. Logical access control belongs to that area and so security policies

should define principles on which is design of secure DBMS based.

Generally, policies should give answers on basic security questions. Policies can be divided

into two basic types - of minimal (army) and maximal (university) privilege. System with

such a policy is called closed or opened, respectively.

40

Talking about access control, the way of administration of access rules should be

determined.

hierarchical decentralized - central authorizer distributes responsibilities among

dependent subauthorizers

ownership based - owner of an object (its author) determines access to the object

cooperative authorization - authorization of special rights for special resources is

approved by all members of predefined group

Security Mechanisms

In the moment when politics are defined, mechanisms that fulfill them can be selected.

Mechanisms are external:

administrative controls

physical controls

and internal that are part of information system itself:

authentication - user identity is verified; this process is based on knowledge of

something, ownership of an object or on physical characteristics of user

authorization - system answers only those queries that user is authorized for

(access control)

audit - is composed from two phases; logging of actions in the system and reporting

of logged information

Security Threat

This word has been used several times already. Security threat is any hostile agent which

randomly or with use of specialized techniques can obtain or change information in the

information system. Random security threats are:

natural or accidental disasters- earthquake, water damage or fire. As data as

hardware is damaged which leads to the integrity violence and service rejection.

errors and bugs in hardware andsoftware - causes improper application of

security policies.

41

human errors - unintentional violations such as incorrect input or wrong use of

applications.

Intended security threats can be categorized according to their originator:

authorized users - abuse there privileges

hostile agents - various hostile programs - viruses, Trojan horses, back-doors

Requirements on DBMS Security

At this moment we have basic image of information system security and we can take a look

at concrete aspects that should be covered with DBMS security mechanisms.

1. Protection from improper access- only authorized users should be granted access

to objects of DBMS. This control should be applied on smaller objects (record,

attribute, value).

2. Protection from inference - inference of confidential information from available

data should be avoided. This regards mainly statistical DBMSs.

3. Database integrity - partially is ensured with system controls of DBMS (atomic

transactions) and various back-up and recovery procedures and partially with

security procedures.

4. Operational data integrity - logical consistence of data during concurrent

transactions (concurrency manager), serializability and isolation of transactions

(locking techniques).

5. Semantic data integrity - ensuring that attribute values are in allowed ranges. This

is ensured with integrity constraints.

6. Accountability and auditing - there should be possibility to log all data accesses.

7. User authentication - there should be unambiguous identification of each DBMS

user. This is basis for all authorization mechanisms.

8. Management and protection of sensitive data - access should be granted only to

narrow round of users.

9. Multilevel security - data may be classified according to their sensitivity. Access

granting should then depend on that classification.

10. Confinement (subject isolation) - there is necessity to isolate subjects to avoid

uncontrolled data flow between programs (memory channels, covert channels).

42

At least five aspects from the previous list must be ensured with special techniques that do

not exist in unsecure DBMSs. There are three basic ways to do it:

flow control - we control information flows in frame of DBMS

inference control - control of dependencies among data

access control - access to the information in DBMS is restricted

Cryptographic techniques can be added to the controls.

Flow Control

Flow control regulates distribution (flow) of information among available objects. E.g.

reading of information from object X and its direct writing into object Y.

Flow control policies need list of acceptable information flows and their constrains. Flow

constraints are often based on classification of system elements and definition of acceptable

flows between classification levels.

Inference Control

The aim of the inference control is to avoid indirect disclosure of information (set of data X

that can be read by user can be used for determination of data set Y (Y=f(X)) ). Generally

there are three ways to unauthorized data disclosure:

1. correlated data - typical channel when visible data X are semantically related with

invisible data Y

2. missing data - result of query contains NULL values that mask sensitive data.

Existence of that data may by detected that way.

3. statistical inference - typical for databases that provide statistical information

about entities.

Statistical databases do not allow direct access to data and user has to query only statistical

data. Attacks in this DBMSs can be faced with two different approaches:

1. data perturbation - concrete data are replaced with statistical results

2. query controls - more frequently used, mostly it is based on minimal and maximal

number of items that are concerned with query. Results are satisfactory but this

technique is expensive and difficult for administration.

43

Access Control

Access control is responsible for control of rules determined by security policies for all

direct accesses to the system. Traditional control systems work with notions subject, object

and operation.

Fig.1 Schema of secure database management system

Security Models

The aim of security modeling is to create abstract, software independent, conceptual model

from requirements specification for system protection.

Security model should provide rich semantic representation which allows description of

functional and structural properties of security system. It should also provide definitions for

protection requirements and system policies. The proof of model properties should be

available, too.

44

It is clear that level on which we decide to describe access control can greatly differ as we

will see in the next paragraph. Description of basic approaches for access control is

introduced in chapter 2.2. Description of concrete models follows.

Abstractions of Access Models

Access control models can be classified in several levels as proposed by LaPadula and

Williams. Following levels proceed from general to more implementation dependent:

1. trust objectives - definition of basic requirements on the system trustfulness

2. requirements on external interface - security requirements on interface system-

environments

3. internal requirements - requirements that must be satisfied in system components

4. operational rules - describe assurance of internal requirements

5. functional design - functional description of behavior of system components

Models that are going to be described are on level 3. TBAC and RBAC model families partly

interfere with level 2 because they are much more general and cover wider range of

problems.

Types of Access Control Models

Security models can be classified according to many aspects. For example target system,

type of security policy, addressed security aspects, type of control (direct or of flows) and so

on. In the moment we will talk about type of security policy that is enforced by the model.

Two basic model types arised very soon - discretionary and mandatory access control.

Owner of data governs access in the former one. This is the most common form of

authorization administration - ownership based. That policy is very flexible but also very

difficult for control from the global point of view. Models with mandatory access control

enforce global policy by the flow control among security levels that are assigned to objects.

It seemed that nothing else would exist but OO technologies have encourage new

approaches that reflect OO DBMSs and new requirements of commercial sphere. The first

one is RBAC - access control based on roles and the second one is TBAC which is based on

concept of task. TBAC brings absolutely new ideas and notion of active security.

45

We finish this introduction and try to describe policy types on concrete models.

HRU - Matrix Access Model

The HRU (Harrison-Ruzzo-Ullman) model covers security of data for DBMS and OS. It was

proposed in 1971 and in 1976 was formalized.

Authorization State

Authorization state is derived from relations between objects and subjects. It is defined as

triple Q=(S, O, A) where S is set of subjects, O is set of objects (contains also subjects) and A

is access matrix (A[s, o]=r).

Authorization state for DBMS can be enriched by predicates of access rights. Those

predicates may by data, time, context or history dependent.

Six primitive operations are defined for authorization state administration. They are

granting and revoking of rights r from A[s, o], creating and deleting of subject s, creating and

deleting of object o. Operations are used for composition of commands for authorization

state modifications. Harrison assumes that modification commands have two parts;

conditional and executive. The executive part is performed when the conditional is true.

Set of possible access rights depends on object. The model knows read, write, append,

execute, own. The last one determines owner that administers access to it.

Administration of Authorizations

The owner of object may grant/revoke any rights (except own) to the given object for other

subjects. Even those rights that he does not possess in the moment. This approach breaks

principle of attenuation of privileges.

There are modifications of access rights in some systems that allow other subjects to

administer access rights. This is expressed by explicit flag on access right that may have two

forms.

copy flag - m* in A[s1, o] allows grant an access right m for object o to other

subjects, rights of subject s1 do not change and subject s2 receives right m for object

o

46

Fig. 2. Use of copy flag

transfer flag - subject s1 grants right m for object o to subject s2, subject s1 loses the

right. Subject s1 has got m+ (with transfer flag)

Fig. 3. Use of transfer flag

Model Extensions

The HRU model is very flexible for expression and control of access rights for information.

The most important problem is security question (Is there a reachable state to the given

starting authorization state where a subject receives certain right for given object?) that is

undecidable. This problem coheres with danger of Trojan horses in DAC.

The model has been extended several ways for security question decidability.

Schematic protections model (Sandhu)

Extended schematic protection model (Ammann and Sandhu)

Typed access matrix (TAM) model (Sandhu) - each object and subject has a type

that is assign with creation of entity and it does not change during lifecycle.

Bell-LaPadula Model

This model is extension of HRU model oriented on definition of security requirements in

complex systems where system elements can be classified (security levels). The model was

47

proposed in 1973 for operation systems. The model became referential for mandatory

access control (MAC).

Role-Based Access Control

This model family has been proposed to satisfy requirements of the commercial sphere. The

basic idea is notion of role. Users are granted roles that allow them to perform certain

actions that are assigned to the roles. Roles are usually created according to the functions in

organization. In the moment we decide to revoke or change access from a user, the only one

thing we have to do is remove him from a group or reassign him to another one. This change

is very easy for an administrator.

This approach offers very suitable abstraction for expression of security policies of large

companies and seems to be promising alternative to the traditional MAC and DAC models.

RBAC is suitable for environments where MAC models are too strict. It has been proofed

that RBAC can simulate mandatory access models and, in addition, that MAC models are

specific case of RBAC.

One may say that RBAC is based on something that is very similar to groups that are used in

many older models. Groups of users are used to associate subjects with the same access

rights. Such a group does not represent a role but common needs for execution of certain

system actions. Roles in RBAC exploit groups to abstract access rights needed for some

concrete actions that are later accessible for role members.

Narrow, specific meaning of roles can not exhaustively express status of users. It seems that

for effective implementation of role-base access control we need to define three aspects.

They are role, individuality and locality. I demonstrate it on an example when a firm engages

a new person - Smith. His position is seller (it is his role). During the trial period you do not

want him to do transactions with banks (individuality) and finally you do not want him to

access your system elsewhere than in New York (locality).

What are the advantages of the system? You are not bound during the policy selection,

model offers strict separation of duties, ease expression of organizational policy and of

course easy access rights administration.

48

Relational Database Management Systems (RDBMS)

A relational database management system (RDBMS) is a database management system

(DBMS) that is based on the relational model. A short definition of an RDBMS is: a DBMS in

which data is stored in tables and the relationships among the data are also stored in tables.

The data can be accessed or reassembled in many different ways without having to change

the table forms.

The Relational Database concept was obviously thought about by folk who on a day-to-day

basis had to manage large production data structures in the 1960's and 1970's and 1980's

and 1990's and today. But it was probably Codd (Dr. E.F.) who first published (in an IBM

Research Report of August 1969) some form of specification (set) of rules by which such a

species of database would be known:-

CODD's TWELVE RULES

Rule 1: The Information Rule: All information in a relational database is represented

explicitly at the logical level in exactly one way by values in tables.

Rule 2: Guaranteed Access Rule: Each and every datum (atomic value) in a relational

database is guaranteed to be logically accessible by resorting to a table name, primary key

value, and column name.

Rule 3: Systematic Treatment of Null Values: Null values (distinct from empty character

string or a string of blank characters and distinct from zero or any other number) are

supported in the fully relational DBMS for representing missing information in a systematic

way, independent of data type.

Rule 4: Dynamic On-line Catalog Based on the Relational Model: The database description is

represented at the logical level in the same way as ordinary data, so authorized users can

apply the same relational language to its interrogation as they apply to regular data.

Rule 5: Comprehensive Data Sublanguage Rule: A relational system may support several

languages and various modes of terminal use (for example, the fill-in-blanks mode).

However, there must be at least one language whose statements are expressible, per some

well-defined syntax, as character strings and whose ability to support all of the following is

comprehensible: data definition, view definition, data manipulation (interactive and by

program), integrity constraints, and transaction boundaries (begin, commit, and rollback).

49

Rule 6: View Updating Rule: All views that are theoretically updateable are also updateable

by the system.

Rule 7: High-level Insert, Update, and Delete: The capability of handling a base relation or a

derived relation as a single operand applies not only to the retrieval of data but also to the

insertion, update, and deletion of data.

Rule 8: Physical Data Independence: Application programs and terminal activities remain

logically unimpaired whenever any changes are made in either storage representation or

access methods.

Rule 9: Logical Data Independence: Application programs and terminal activities remain

logically unimpaired when information preserving changes of any kind that theoretically

permit unimpairment are made to the base tables.

Rule 10: Integrity Independence: Integrity constraints specific to a particular relational

database must be definable in the relational data sublanguage and storable in the catalog,

not in the application programs.

A minimum of the following two integrity constraints must be supported:

1. Entity integrity: No component of a primary key is allowed to have a null value.

2. Referential integrity: For each distinct non null foreign key value in a relational database,

there must exist a matching primary key value from the same domain.

Rule 11: Distribution Independence: A relational DBMS has distribution independence.

Distribution independence implies that users should not have to be aware of whether a

database is distributed.

Rule 12: Non subversion Rule: If a relational system has a low-level (single-record-at-a-

time) language, that low-level language cannot be used to subvert or bypass the integrity

50

rules or constraints expressed in the higher-level (multiple-records-at-a-time) relational

language.

Note: There is a rider to these 12 rules known as Rule Zero: "For any system that is claimed

to be a relational database management system, that system must be able to manage data

entirely through its relational capabilities."

On the basis of the above rules, there is no fully relational DBMS available today.

By the mid 1980's there were a number of database vendors who had already made the

claim that their DBMS was in reality an RDBMS. Previously known as Relational Software,

Oracle Corporation released the first commercially available database management system

packaged with a DBMS Language of its own called SQL (Structured Query Language). IBM

also pioneered research and development into making the DBMS more of an RDBMS, and by

development of the language which was to become known as SQL. It was not until almost

the 1990's that Microsoft joined these two major league RDBMS/DBMS vendors on the field

with the SQL Server RDBMS/DBMS product.

Are these products RDBMS or DBMS?

What it is about an RDBMS which is missing from a DBMS, with the exception of the obvious

R for Relational? Some insist it is nothing but a rigorous adherence to Codd's 12 Rules, while

others insist a non-relational DBMS structure can be "made relational" by the application of

specific housekeeping and data integrity exception management subsystems.

From the perspective of IT Management however, whatever DBMS or RDBMS it was that

housed the organisational data in production real-time, it was this very ever-changing

database of information which needed to be safeguarded with the utmost diligence.

The daily backups needed to be performed and checked off, and if they were automated

then they needed to be checked. Periodically restores of such backups needed to be

scheduled to test the integrity of the backup and restore process, or media, or device. A

backup was useless if it could not be restored.

More importantly, the integrity of the database had to be maintained at the highest possible

level. If this duty were not attended to by some responsible party internal or external to the

51

organisation, then the value of the database as an entity would be suspect to varying

degrees.

With experience in the field one understands that there are indeed usually an entire series

of specific forms of data integrity issues and problems which may arise at any time within

any given database, due to a number of different environmental factors.

At the top of the list, we see Rules 2 and 3 may have direct bearing on data integrity issues if

they are not enforced. But it is Rule number 10 that is often quoted as the one which is not

successfully established in certain instances by default within the DBMS, to make it a truly

RDBMS.

As CJ Date points out in his article entitled Business Rules and the Relational Model, the title

of Codd's original paper was "Derivability, Redundancy, and Consistency of Relations Stored

in Large Data Banks". He goes on to reinforce that the term "consistency of stored

relations" from Codd, is implicitly the same thing as the term integrity constraints, and to

note that another formal description for the set of these things is "the relation predicate.

When all is said and done however, the RDBMS vendors of today have a far more powerful

and robust product than they had a decade ago, and while it is often very interesting

examining the theoretical threads of the beginnings of the Relational Database concept, this

article is bound closely to follow the path and perspective of the IT Management staff who

have used all these evolving products, and seen the decades come and go.

For this reason, in terms of the technical evolution of the major RDBMS products available

since the 1990's, we are going to skip this era, and move right up to the present day in 2002

and have a look under the bonnet of one of these major RDBMS products, just to see what it

does.

Again, let me state clearly that while the academic approach to the failure of RDBMS

vendors to supply greater support for integrity constraints, or for a number of technical

issues by which their products may not conform 100% to the relational model is often

warranted and heeded by the vendors, my approach has not had the luxury of time.

52

When we are dealing with a production database, alive and breathing through all forms of

organisational processes, we cannot afford the time to wait for the longer term solutions to

such problems. Production database integrity, and by this I mean real live day-to-day

production sites, is a concern that needs to be addressed without delay. If you as an IT

Manager designed a series of data integrity exceptions alerts to identify the quantitative

failure of specific integrity constraints, and had such resolved, such a mechanism might

work-around potential major problems.

Perhaps the real difference between a DBMS and an RDBMS is similar to the difference

between a 6 cylinder car and an 8 cylinder car? If we looked under the bonnet we might be

able to tell, so lets choose one of the major 3 RDBMS vendors, and have a little look. In this

article we have selected to check under the bonnet of the SQL Server product from

Microsoft Corporation.

Relational model

The relational model for database management is a database model based on first-order

predicate logic, first formulated and proposed in 1969 by Edgar F. Codd.

Diagram of an example database according to the Relational model

53

http://en.wikipedia.org/wiki/File:Relational_Model_2.jpg

In the relational model, related records are linked together with a "key".

The purpose of the relational model is to provide a declarative method for specifying data

and queries: users directly state what information the database contains and what

information they want from it, and let the database management system software take care

of describing data structures for storing the data and retrieval procedures for answering

queries.

The fundamental assumption of the relational model is that all data is represented as

mathematical n-ary relations, an n-ary relation being a subset of the Cartesian product of n

domains. In the mathematical model, reasoning about such data is done in two-valued

predicate logic, meaning there are two possible evaluations for each proposition: either true

or false (and in particular no third value such as unknown, or not applicable, either of which

are often associated with the concept of NULL). Some think two-valued logic is an important

part of the relational model, while others think a system that uses a form of three-valued

logic can still be considered relational.

Data are operated upon by means of a relational calculus or relational algebra, these being

equivalent in expressive power.

54

http://en.wikipedia.org/wiki/File:Relational_key.png

The relational model of data permits the database designer to create a consistent, logical

representation of information. Consistency is achieved by including declared constraints in

the database design, which is usually referred to as the logical schema. The theory includes

a process of database normalization whereby a design with certain desirable properties can

be selected from a set of logically equivalent alternatives. The access plans and other

implementation and operation details are handled by the DBMS engine, and are not

reflected in the logical model. This contrasts with common practice for SQL DBMSs in which

performance tuning often requires changes to the logical model.

The basic relational building block is the domain or data type, usually abbreviated

nowadays to type. A tuple is an ordered set of attribute values. An attribute is an ordered

pair of attribute name and type name. An attribute value is a specific valid value for the type

of the attribute. This can be either a scalar value or a more complex type.

A relation consists of a heading and a body. A heading is a set of attributes. A body (of an n-

ary relation) is a set of n-tuples. The heading of the relation is also the heading of each of its

tuples.

A relation is defined as a set of n-tuples. In both mathematics and the relational database

model, a set is an unordered collection of unique, non-duplicated items, although some

DBMSs impose an order to their data. In mathematics, a tuple has an order, and allows for

duplication. E.F. Codd originally defined tuples using this mathematical definition. Later, it

was one of E.F. Codd's great insights that using attribute names instead of an ordering

would be so much more convenient (in general) in a computer language based on relations.

This insight is still being used today. Though the concept has changed, the name "tuple" has

not. An immediate and important consequence of this distinguishing feature is that in the

relational model the Cartesian product becomes commutative.

A table is an accepted visual representation of a relation; a tuple is similar to the concept of

row, but note that in the database language SQL the columns and the rows of a table are

ordered.

55

A relvar is a named variable of some specific relation type, to which at all times some

relation of that type is assigned, though the relation may contain zero tuples.

The basic principle of the relational model is the Information Principle: all information is

represented by data values in relations. In accordance with this Principle, a relational

database is a set of relvars and the result of every query is presented as a relation.

The consistency of a relational database is enforced, not by rules built into the applications

that use it, but rather by constraints, declared as part of the logical schema and enforced by

the DBMS for all applications. In general, constraints are expressed using relational

comparison operators, of which just one, "is subset of" (⊆), is theoretically sufficient. In

practice, several useful shorthand’s are expected to be available, of which the most

important are candidate key (really, super key) and foreign key constraints.

SQL

SQL, often referred to as Structured Query Language is a programming language designed

for managing data in relational database management systems (RDBMS).

Originally based upon relational algebra and tuple relational calculus, its scope includes

data insert, query, update and delete, schema creation and modification, and data access

control.

SQL was one of the first commercial languages for Edgar F. Codd's relational model, as

described in his influential 1970 paper, "A Relational Model of Data for Large Shared Data

Banks". Despite not adhering to the relational model as described by Codd, it became the

most widely used database language. Though often described as, and to a great extent is a

declarative language, SQL also includes procedural elements. SQL became a standard of the

American National Standards Institute (ANSI) in 1986 and of the International Organization

for Standards (ISO) in 1987. Since then the standard has been enhanced several times with

added features. However, issues of SQL code portability between major RDBMS products

still exist due to lack of full compliance with, or different interpretations of the standard.

Among the reasons mentioned are the large size, and incomplete specification of the

standard, as well as vendor lock-in.

The SQL language is sub-divided into several language elements, including:

56

Clauses, which are constituent components of statements and queries. (In some cases,

these are optional.)

Expressions, which can produce either scalar values or tables consisting of columns and

rows of data.

Predicates, which specify conditions that can be evaluated to SQL three-valued logic (3VL)

or Boolean (true/false/unknown) truth values and which are used to limit the effects of

statements and queries, or to change program flow.

Queries, which retrieve the data based on specific criteria. This is the most important

element of SQL.

Statements, which may have a persistent effect on schemata and data, or which may

control transactions, program flow, connections, sessions, or diagnostics. SQL statements

also include the semicolon (";") statement terminator. Though not required on every

platform, it is defined as a standard part of the SQL grammar.

Insignificant whitespace is generally ignored in SQL statements and queries, making it

easier to format SQL code for readability.

Data manipulation

The Data Manipulation Language (DML) is the subset of SQL used to add, update and delete

data:

INSERT adds rows (formally tuples) to an existing table, e.g.,:

INSERT INTO My_table

(field1, field2, field3)

VALUES

('test', 'N', NULL);

UPDATE modifies a set of existing table rows, e.g.,:

UPDATE My_table

SET field1 = 'updated value'

WHERE field2 = 'N';

DELETE removes existing rows from a table, e.g.,:

57

DELETE FROM My_table

WHERE field2 = 'N';

MERGE is used to combine the data of multiple tables. It combines the INSERT and UPDATE

elements. It is defined in the SQL:2003 standard; prior to that, some databases provided

similar functionality via different syntax, sometimes called "upsert".

Data definition

The Data Definition Language (DDL) manages table and index structure. The most basic

items of DDL are the CREATE, ALTER, RENAME, DROP and TRUNCATE statements:

CREATE creates an object (a table, for example) in the database, e.g.,:

CREATE TABLE My_table(

my_field1 INT,

my_field2 VARCHAR(50),

my_field3 DATE NOT NULL,

PRIMARY KEY (my_field1, my_field2)

);

ALTER modifies the structure of an existing object in various ways, for example, adding a

column to an existing table or a constraint, e.g.,:

ALTER TABLE My_table ADD my_field4 NUMBER(3) NOT NULL;

TRUNCATE deletes all data from a table in a very fast way, deleting the data inside the table

and not the table itself. It usually implies a subsequent COMMIT operation, i.e., it cannot be

rolled back.

TRUNCATE TABLE My_table;

58

DROP deletes an object in the database, usually irretrievably, i.e., it cannot be rolled back,

e.g.,:

DROP TABLE My_table;

People who work with databases

1. systems analyst

A systems analyst researches problem, plans solutions, recommends software and systems,

and coordinates development to meet business or other requirements. They will be familiar

with multiple varieties of programming languages, operating systems, and computer

hardware platforms. Because they often write user requests into technical specifications,

the systems analysts are the liaisons between vendors and information technology

professionals. They may be responsible for developing cost analysis, design considerations,

and implementation time-lines.

A systems analyst is a person who conducts a study, identifies activities and objectives and

determines a procedure to achieve the objectives. Designing and implementing systems to

suit organizational needs are the functions of the systems analyst.

Their responsibilities include:-

Communicate with each prospective database user group in order to understand its

o information needs

o processing needs

Develop a specification of each user group’s information and processing needs

Develop a specification integrating the information and processing needs of the user

groups

Document the specification

2. Database designers

The database designer defines the tables, indexes, views, constraints, triggers, stored

procedures, table, spaces or storage parameters, and other database-specific constructs

needed to store, retrieve, and delete persistent objects.

Their responsibilities include:-

59

Choose appropriate structures to represent the information specified by the system

analysts

Choose appropriate structures to store the information in a normalized manner in order

to guarantee integrity and consistency of data

Choose appropriate structures to guarantee an efficient system

Document the database design

3. Application Developers

A database developer writes and modifies databases. Data in a database can be reorganized,

dispersed, and accessed in a number of ways. Databases are important to companies and

organizations, because they contain records or files, such as sales transactions, product

catalog and inventories, and customer profiles.

Database developers create management systems to provide effective and efficient access to

information stored in databases. They provide expertise and guidance in the design,

implementation and maintenance of database management systems. Database developers

control the design and use of the database and they also determine the way the filing

systems will be organized and accessed. An important part of this work involves

implementing and controlling security procedures to protect the database from accidental

or intentional damage or loss.

They must be good at communicating not only in computer languages, but with people as

well. They not only develop the programs, but write up descriptions about them, prepare

manuals, help screens, or explain the new systems in person to users. They might have to

adapt the programs once they're in use, to make them more suitable to the users. In any

business or office environment it is the database developers who are the authorities on the

use of these management systems. This means they may spend a lot of their day, answering

questions on the correct usage of the database and correcting any glitches that may come up

within the system.

Their responsibilities can be summarized as follows:-

Implement the database design

Implement the application programs to meet the program specifications

Test and debug the database implementation and the application programs

Document the database implementation and the application programs

60

4. Database administrators

A database administrator (short form DBA) is a person responsible for the design,

implementation, maintenance and repair of an organization's database. They are also

known by the titles Database Coordinator or Database Programmer, and is closely related to

the Database Analyst, Database Modeller, Programmer Analyst, and Systems Manager. The

role includes the development and design of database strategies, monitoring and improving

database performance and capacity, and planning for future expansion requirements. They

may also plan, co-ordinate and implement security measures.

Database administrator's duties include:

Schema definition

Storage structure and access method definition

Schema and physical organization modification

Granting user authority to access the database

Specifying integrity constraints

Acting as liaison with users

Monitoring performance and responding to changes in requirements

Database backup and recovery

Database security

5. Database Users

Users are differentiated by the way they expect to interact with the system. The

different types of users are:-

Application programmers

An application programmer is someone who has acquired high-level skills in a

variety of computer programming languages. They are able to use these skills to

modify and create new software applications or programs. An application

programmer is also known as a developer or systems analyst. They interact with

system through DML calls.

Sophisticated users

These users form requests in a database query language. E.g. Engineers, scientists,

business analysts, and others who thoroughly familiarize themselves with the

61

facilities of the DBMS so as to implement their applications to meet their complex

requirements. They try to learn most of the DBMS facilities in order to achieve their

complex requirements.

Specialized users

These users write specialized database applications that do not fit into the

traditional data processing framework. They maintain personal databases by using

ready-made program packages that provide easy-to-use menu- or graphics-based

interfaces. An example is the user of a tax package that stores a variety of personal

financial data for tax purposes

Typically they become very proficient in using a specific software package.

Naive users

These users invoke one of the permanent application programs that have been written

previously. Their characteristics include constantly querying and updating the database,

using standard types of queries and updates called canned transactions that have been

carefully programmed and tested. E.g. Bank tellers check account balances and post

withdrawals and deposits; Reservation clerks for airlines, hotels, and car rental companies

check availability for a given request and make reservations; Clerks at receiving stations for

courier mail enter package identifications via bar codes and descriptive information

through buttons to update a central database of received and in-transit packages.

Disadvantages of RDBMS

The main problem when using a relational database is the complexity that arises when it is

first created. It is absolutely vital that the defined relationships between the tables are

correct and that each set of information is linked to its pair. Although less information has

to be entered in total than with other databases, making sure every point is set up correctly

is a slow process. Furthermore, the relationships can become extremely intricate when a

relational database contains more than just two tables.

It was well known from the outset that the relational model did not and could not model

some kinds of data. One common data structure that the Relational Model didn’t

accommodate, for example, was a hierarchy. You could implement laborious get-around to

accommodate such data within a relational database, but the simple reality was that the

Relational Model never got round to defining hierarchies properly. Insofar as Relational

database included a mathematical model of data, it was incomplete.

62

With the development of Object Oriented programming languages in the early 1990s, the

limits of the Relational Model began to become awkwardly visible. What quickly became

clear was that the view of data that an OO program had was not compatible with the way an

RDBMS stored data.

An easy way to explain this problem is this:

Think of a car. When you have finished using it, you park it in your garage and leave it. When

you park it you do not take out the seats, take off the tires, remove the doors, pull the steering

wheel off, take out the engine, etc. However RDBMS with their “normalization” processes

would often insist in storing a car in exactly that way and then reassembling it for you when

you next needed to use it.

The limitations of RDBMS became even more obvious with the advent of the Internet.

RDBMS had no way of representing web pages which were, by and large, collections of data

and mark-up tags. So web pages were stored as character strings by RDBMS and their inner

structure was buried.

Multimedia Database Management Systems

63

INTRODUCTION

Due to the heterogeneous nature of multimedia data, systems intended to store, transport,

display and in general manage such data must have considerably more capabilities than

conventional information-management systems. In this paper, we elaborate on issues

pertaining to multimedia database management and discuss technical challenges involved

in developing a general-purpose multimedia DBMS. We can view these issues as

specification requirements for this DBMS and focus our discussion on a reference

architecture for such a system. A list of prominent issues in multimedia database

management follows.

Development of formal semantic modelling techniques for multimedia information,

especially for video and image data. These models should be rich in capabilities for

abstracting multimedia information and capturing semantics. They should be able to

provide canonical representations of complex images, scenes, and events in terms of objects

and their spatio-temporal behavior. These models need to be compared and evaluated.

Design of powerful indexing, searching, and organization methods for multimedia data.

Search in multimedia databases can be computationally intensive, especially if content-

based retrieval is needed for image and video data stored in compressed or uncompressed

form.

Development of models for specifying the media synchronization/integration requirements.

Integration of these models with the monomedia database schema will be required.

Subsequently, in order to determine the synchronization requirements at retrieval time,

transformation of these models into a metaschema is needed. This entails designing object-

retrieval algorithms for the operating systems. Similarly, integration of these models with

higher-level information abstractions such as hypermedia or object-oriented models may be

required.

Designing formal multimedia query languages. These languages should have strong

capabilities to express arbitrarily complex semantic and spatiotemporal schemas associated

with composite multimedia information. They must support manipulation of content-based

functions for multimedia objects.

Development of efficient data replacement schemas for physical storage management.

These schemes are needed to manage real-time multimedia data, for both single and

parallel disk systems. Design and development of suitable architecture and operating

64

system support. Heterogeneity of multimedia information dictates that the architecture of a

general-purpose multimedia database management system must support a rich set of data-

management and computational functionalities. The operating system must also support

real-time requirements of multimedia data.

Management of distributed multimedia databases. In a networked environment extensive

coordination and management capabilities are needed among the distributed sites to

provide location-transparent access and support real-time delivery of data to distributed

users.

A REFERENCE ARCHITECTURE FOR MULTIMEDIA DBMS

Many types of real-world knowledge can be represented by describing the interplay among

objects (persons, buildings, events, etc.) in the course of time and their relationships in

space. An application may need to store and access information about this knowledge that

can be expressed by complex spatiotemporal logic. A video database is a typical replica of

this real-world environment [Day et al. 1995]. The spatiotemporal specification process is

reversed while composing multimedia information documents [Little and Ghafoor 1993;

lino et al. 1994]. In this case, a user synthetically creates interplay among various

monomeric objects (akin to physical objects), both in space and time. In multimedia

databases these objects may represent individual data entities that serve as components of

some multimedia document [Little and Ghafoor 1993; Stotts and Furuta 1989].

Furthermore, these entities/documents can be grouped together for efficient management

and access.

In either case, it is essential that the user be able to identify and address different objects

and be able to express their relations in time and space. These relations should be

representable in a suitable structure that is powerful enough to specify higher-level

contents and semantic abstractions. It is, therefore, desirable that a general framework for

spatiotemporal and semantic modeling should be available. Such a framework can provide a

reasonable approach to address the issues discussed in the previous section and can be

used in designing a general purpose multimedia database-management system. Various

conceptual models for multimedia information have been proposed in the literature [Little

and Ghafoor 1993; Stotts and Furuta 1989]. These models are either aimed at

synchronization aspects of the multimedia data or are concerned with the browsing aspects

of information. Irrespective of the media type, it is imperative that a model should be able to

65

generate a clear specification of the metaschema and it must be integral with the underlying

data models of various monomedia.

Such a model must also facilitate the efficient development of high-level user interfaces.

From this discussion, we can perceive reference architecture for a general-purpose

multimedia DBMS as shown in Figure 1(a). The architecture consists of three layers: a

monomedia database management layer, a multimedia composition/management layer,

and a user interface layer. We now describe functionalities of each layer and discuss their

role in managing the issues mentioned in the previous section. The monomedia database-

management layer provides the functionalities essential for managing individual media. The

key aspects of each DBMS at this level include maintaining efficient indexing mechanisms

and allowing users to develop semantics-based modeling and grouping of complex

information associated with each medium. In this respect, this layer deals with the first two

issues discussed in the previous section. The primary objective is to process content-based

queries as discussed in the previous section. The primary objective is to process content-

based queries and facilitate retrieval of appropriate pieces of monomedia data, such as

video clips, parts of an image, or audio segments. For example, for video data, abstraction

hierarchies in space and time, as shown in Figure 1(b), are needed in order to express

events and allow content-based indexing and retrieval.

Spatiotemporal logic can provide a formal framework for representing such events and

building these hierarchical abstractions [Day et al. 1995]. Subsequently this process can

lead to schema definition for each monomedia database. Such schema may also contain

some unconventional information, for example, the quality of service (QoS) parameters for

presentation, such as speed, volume, resolution, delay bounds, and the like. The objective is

to allow retrieval of monomedia data with the desired quality, if possible. Accordingly, a

suitable database language is needed at this level to manipulate hierarchical abstractions

and to query about events of interest. It is important to mention that considerable

computational demand can be placed at this layer, especially for image and video data, if a

fully automated system is desired that is capable of generating content-based indices based

on object recognition [Guidavada and Raghavan 1995]. Such computation can span a broad

spectrum of effective and robust computer-vision and image processing algorithms that are

intermingled with data and entities at various levels of abstractions of Figure Kb).

Another important function incorporated within this layer is the physical management of

individual databases. Such management may require, as mentioned previously, efficient

66

placement of data on a single/parallel disk system. For example, due to the time-variant

characteristics of compressed video, it is difficult to predict the disk seek latencies. A poor

data-placement strategy can result in significant degradation in quality due to excessive

droppage of video frames during multiple concurrent sessions. The primary objective of the

middle layer is to deal with the third issue, the integration of monomedia to compose

multimedia documents and to crosslink information stored in monomedia databases.

There is a growing demand for management of multimedia documents and libraries, and

the need for efficient integration models is one of the key research issues in developing a

general-purpose multimedia DBMS. Integration of media can span multiple dimensions

including space, time, and semantics. Therefore, this layer needs to maintain a

multidimensional monomedia integration model, in form of a metaschema, along with the

QoS parameters associated with each medium. Basically, the layer processes users' queries

for composite multimedia information and generates appropriate subqueries for the

monomedia databases. Retrieval of the monomedia data is controlled by the integration

model maintained by the metaschema [Little and Ghafoor 1993].

It is important to note that for the purpose of consistency, this meta-schema needs to be

interfaced with the schemas associated with the monomedia databases present at the lower

layer, because both schemas share information about monomedia, including their content-

based semantics and QoS parameters. Another important function of this layer is to provide

coordination among monomedia databases if they are distributed.

As mentioned before, one of the major objectives in this case is to provide location-

transparent access to different database sites and maintain synchronization among media

streams originating from heterogeneous hosts. From the database point of view,

information about the location of various objects of omposite multimedia and their schema

must be maintained by this layer. The interactive layer consists of various user-interface

facilities that support multimedia presentation functionalities such as display of images,

playout of video clips or some audio segment, and the like. These interactive facilities may

require some formal query language that can be identical to that used at the bottom layer,

with some enhanced capabilities to manipulate composite multimedia information.

Alternatively, some graphical query interface can be used for that purpose. Additional

capabilities may include browsing, media editing, and so on.

67

DEVELOPMENT CONSIDERATIONS FOR THE REFERENCE ARCHITECTURE

The development of a general-purpose multimedia database system, including design of

multilevel meta-schema, management of a large number of indices and physical databases,

as well as interfacing with the operating system for real-time retrieval, can be a

technologically daunting task. In this section, we focus on two development issues that are

unique to multimedia databases and are crucial for realizing the reference architecture of

Figure 1(a): spatiotemporal modeling of monomedia/composite multimedia and the

suitability of two well-known data-modeling paradigms, relational versus object-oriented,

for developing a general purpose multimedia DBMS. A number of attempts have been made

to develop synchronization models for multimedia information. HyTime, one such model,

has been recommended as an ISO standard.

However, currently this model suffers from the drawback that the extraction of various

spatiotemporal and content semantics from this model can be quite cumbersome. On the

other hand, the Petrinet-based model proposed in Little and Ghafoor [1993] not only allows

extraction of the desired semantics and generation of a database schema in a rather

straightforward manner, but also has the additional advantage of pictorially illustrating

synchronization aspects of the information. In this regard this model is unique and is thus

also well suited for the third layer of the reference architecture, where visual orchestration

of multimedia information can be highly desirable. The model uses a set of generalized

temporal operators. An expanded version of the model also allows specification of spatial

semantics of information [lino et al. 1994].

The notion of generalized spatiotemporal relations has recently been extended to specify

complex events in video data, a process that is almost a reversal of the multimedia

composition process [Day et al. 1995]. Numerous extensions to this model have been

proposed in the literature, including provision for user interaction, modeling of distributed

object synchronization, development of synchronization protocols for communication, and

the like. In summary, the Petrinet-based model in Little and Ghafoor [1993] can be a good

candidate for spatiotemporal modeling and generation of meta-schemas for the reference

architecture. We now discuss the second development issue.

As the application domain for a given multimedia application can be highly complex and ill-

structured, both relational calculus and semantic-based object-oriented approaches need to

be scrutinized. We can evaluate these approaches on two important criteria: their ability to

69

express semantics associated with hierarchical abstractions, which need to be managed by

almost all the monomedia DBMSs of the bottom layer of Figure 1(a), and their expressive

power for the specification of meta-schema for media integration, which is managed by the

middle layer. Hierarchical abstractions can result in considerable semantic heterogeneity,

which has been a difficult problem for relational database models.

On the other hand, the semantic-based object-oriented models can provide a richer set of

abstractions that are particularly useful for the users to extract/define view of information

at various levels of these hierarchies [Day et al. 1995]. From the media integration and

composition point of view, the relational technology again provides no elegant mechanisms

for maintaining complex logical structures associated with composite multimedia objects.

Although some relational DBMSs support access to multimedia objects by using pointers to

BLOBs (binary large objects), there is no provision for interactively accessing various

portion of these objects, because a BLOB is treated as a single entity in its entirety. It has

been demonstrated that the object-oriented technology can provide a powerful paradigm to

meet the requirements of multimedia composition and developing semantic models [Day et

al. 1995; lino et al. 1994]. Its data and computational encapsulation features provide

effective mechanisms for media synchronization.

Multimedia Database Systems: Where are we now?

Introduction

A Multimedia Database Management System (MMDBMS) must support multimedia

data types in addition to providing facilities for traditional DBMS functions like database

creation, data modeling, data retrieval, data access and organization, and data

independence. The area and applications have experienced tremendous growth.

Especially with the rapid development of network technology, multimedia database system

gets more tremendous development and multimedia information exchange becomes

very important. This paper reviews the history and the current state-of-the-art in MMDBMS.

70

The first wave

The first MMDBMS rely mainly on the operating system for storing and querying files. These

were ad-hoc systems that served mostly as repositories. The mid 90s saw a first wave

of commercial, implemented from-the-scratch, and full-fledged MMDBMS. Some of

them were MediaDB, now MediaWay [1], JASMINE [2], and ITASCA that is the

commercial successor of ORION [3]. They were all able to handle diverse kinds of data and

provided mechanisms for querying, retrieving, inserting, and updating data. Most of these

products disappeared from the market after some years of existence, and only some of them

continued and adapted themselves successfully to the hardware and software advances as

well as to application changes. For instance, MediaWay provided early very specific support

for a wide variety of different media types. Specifically different media file formats varying

from images, and video to PowerPoint documents can be managed segmented, linked

and searched.

Designing MMDBs

Many inherent characteristics of multimedia data have direct and indirect impacts on the

design of MMDBs. These include: the huge size of MMDBs, temporal nature, richness of

content, complexity of representation and subjective interpretation. The major challenges in

designing multimedia databases arise from several requirements they need to satisfy such

as the following:

1. Manage different types of input, output, and storage devices. Data input can be from

a variety of devices such as scanners, digital camera for images, microphone, MIDI devices

for audio, video cameras. Typical output devices are high-resolution monitors for images

and video, and speakers for audio.

2. Handle a variety of data compression and storage formats. The data encoding has a

variety of formats even within a single application. For instance, in medical applications, the

MRI images of brain has lossless or very stringent quality of lossy coding technique, while

the X-ray images of bones can be less stringent. Also, the radiological image data, the ECG

data, other patient data, etc. have widely varying formats.

71

3. Support different computing platforms and operating systems. Different users

operate computers and devices suited to their needs and tastes. But they need the same

kind of user-level view of the database.

4. Integrate different data models. Some data such as numeric and textual data are best

handled using a relational database model, while some others such as video documents are

better handled using an object-oriented database model. So these two models should

coexist together in MMDBs.

5. Offer a variety of user-friendly query systems suited to different kinds of media.

From a user point of view, easy-to-use queries and fast and accurate retrieval of information

is highly desirable. The query for the same item can be in different forms. For example, a

portion of interest in a video can be queried by using either

1) A few sample video frames as an example,

2) A clip of the corresponding audio track or

3) A textual description using keywords.

6. Handle different kinds of indices. The inexact and subjective nature of multimedia

data has rendered keyword-based indices and exact and range searches used in traditional

databases ineffective. For example, the retrieval of records of persons based on social

security number is precisely defined, but the retrieval of records of persons having certain

facial features from a database of facial images requires, content-based queries and

similarity-based retrievals. This requires indices that are content dependent, in addition to

key-word indices.

7. Develop measures of data similarity that correspond well with perceptual similarity.

Measures of similarity for different media types need to be quantified to correspond well

with the perceptual similarity of objects of those data types. These need to be incorporated

into the search process

8. Provide transparent view of geographically distributed data. MMDBs are likely to be

a distributed nature. The media data resides in many different storage units possibly spread

out geographically. This is partly due to the changing nature of computation and computing

resources from centralized to networked and distributed.

9. Adhere to real-time constraints for the transmission of media data. Video and audio

are inherently temporal in nature. For example, the frames of a video need to be presented

at the rate of at least 30 frames/sec. for the eye to perceive continuity in the video.

72

10. Synchronize different media types while presenting to user. It is likely that different

media types corresponding to a single multimedia object are stored in different formats, on

different devices, and have different rates of transfer. Thus they need to be periodically

synchronized for presentation.

The recent growth in using multimedia data in applications has been phenomenal.

Multimedia databases are essential for efficient management and effective use of huge

amounts of data. The diversity of applications using multimedia data, the rapidly changing

technology, and the inherent complexities in the semantic representation, interpretation

and comparison for similarity pose many challenges. MMDBs are still in their infancy.

Today's MMDBs are closely bound to narrow application areas. The experiences acquired

from developing and using novel multimedia applications will help advance the multimedia

database technology.

Benefits of using Multimedia Database

The incorporation of multimedia database systems will improve the quantity and quality of

information manipulated by computer users in all fields, computer aided design, and

information retrieval. The area of intelligent multimedia content analysis and retrieval

techniques is an emerging discipline. Techniques for representing and extracting semantic

information from media such as speech, images, and video are required.

When a multimedia application lacks a database, the data structure is buried in the script,

where all of its value is lost. This omission also makes the script more complicated and less

flexible. Using a multimedia database makes the data structure logic available to other

multimedia applications and simplifies the script so that many scripts can share the same

multimedia metadata. In addition, when a multimedia or abstract data database is

organized and annotated for one application, other applications can use those annotations

without going through the same time-consuming process. This capability adds great value

to the data through reuse and controlled redundancy.

When multimedia application content is controlled by the multimedia database, multimedia

content can be added, deleted, or modified without modifying the application script. For

example, interactive kiosks that display, describe, and demonstrate products can be

updated automatically without reprogramming the application script. Furthermore, a

73

multimedia application such as a multimedia textbook can actually control the operation of

book topics that have the same look and feel. This control lets the script perform as a

template: An entire series of math textbooks (algebra, calculus, trigonometry, and

geometry), including text and video, can use the same multimedia application because all

data is physically separate.

Search and retrieval operations are critical in interactive multimedia applications; they

must be equally efficient and powerful. Search and retrieval of multimedia and abstract data

is challenging, but multimedia databases make it feasible through internal storage format

flexibility and efficient operation. The DBMS should have significant knowledge about the

data and its structure to enable powerful semantic optimizations and intelligent searches.

Search and retrieval operations also give the application access to media components so

that they can be dynamically and seamlessly processed when necessary.

Features of Multimedia Database

Query in Multimedia DBMS

In a DBMS process, declarative user queries are done by defining a query language as part of

DBMS. It is known that the query is one of the most important parts of a DBMS. Since a

query gets used by both native and export users it is critical to use effective query languages

in order to promote user friendliness. Multimedia query languages must deal with complex

spatial and temporal relationships inherited in the wide range of multimedia data types.

Powerful query languages could help manipulate multimedia DBMS and maintain the

desired independence between the database and the application. Due to the complexity of

multimedia data, we need a query language that could support semantic data retrieval

which is done by using keywords, indexes on keywords, and contents of multimedia objects.

However the query language of traditional DBMS dealt only with exact key match queries on

data types which means sufficient to deal with queries posed against metadata and

annotations of multimedia data.

There are two types of queries that are used in the database system. The first query is called

a well-defined query. A well-defined query is when properties of objects are represented by

a well defined set of labels. The conditional operators are also well defined. In well-defined

query, the user must know the exact knowledge of the underlying database and the desired

query result. This query only works with exact matches. The second query type of is called

74

the fuzzy query where the properties of query objects are unclear or comparison operators

in the query do not give exact matches, This will result in all sets of different answers.

Charts and Graphs

Multimedia database supports all type of charts and graphs which is considered a great

feature in the MDBS. Creating those charts and graphs will need special queries and those

queries are created by using SQL +D.

75

Multimedia Presentation

The terms multimedia document, multimedia or temporal presentation, have been used in

the literature interchangeably. Multimedia Database offers to store large amount of media

like audio and video, all of those are used also in the presentation process. MDBS also stores

multimedia documents and presentation.

Applications of MMDB

Data Streaming and E-mail Technologies

The use of streaming video and audio media is becoming an increasingly desirable feature

in network technologies. As networks become more saturated with traffic, the timely

delivery of latency-sensitive data, such as the data generated by video conferencing and

Internet Protocol (IP) telephony programs, becomes more problematic. ATM and QoS are

two technologies that you can use to facilitate the timely delivery of latency-sensitive data.

POP3 is an e-mail protocol that can be used facilitate mail delivery to computers that are not

part of a network or computers that do not have an e-mail server on the local network.

76

Interactive video technology

A few interactive video technologies have been developed in last few years that utilize a

new way of encoding videos allowing users to click on any person, place, or object in the

video.

Various existing trends of usage of the technology

• Customizable online interactive videos

• Conversational online interactive videos

• Exploratory online interactive videos

• Interactive video in early computer games

• Interactive video in cinema

• Interactive video in youtube

• Interactive video art

• Interactive video in cell phone gaming

• Interactive video in VJing

• Hypervideo

Interactive Video ads

These are designed to meet the marketing needs of advertisers in a wide variety of

industries, including: Entertainment, CPG, Automotive, Retail, Telecommunications, and

Travel, with more categories being added on an ongoing basis.

Some of the flexible features and functionality:

• Multiple videos within a single ad unit, including Interactive video, augmented

reality, and dynamic content

• Social media integration and RSS feeds

• In-player video games, polling, surveys, and quizzes

• Localized geographic data, store location, and mapping

• Product features/carousel

• Coupons and click-to-schedule/purchase

Digital library

A digital library is a library in which collections are stored in digital formats (as opposed to

print, microform, or other media) and accessible by computers. The digital content may be

77

stored locally, or accessed remotely via computer networks. A digital library is a type of

information retrieval system.

A distinction is often made between content that was created in a digital format, known as

born-digital, and information that has been converted from a physical medium, e.g., paper,

by digitizing. The term hybrid library is sometimes used for libraries that have both

physical collections and digital collections.

Large scale digitization projects are underway at Google, the Million Book Project, and

Internet Archive. With continued improvements in book handling and presentation

technologies such as optical character recognition and ebooks, and development of

alternative depositories and business models, digital libraries are rapidly growing in

popularity as demonstrated by Google, Yahoo!, and MSN's efforts. Just as libraries have

ventured into audio and video collections, so have digital libraries such as the Internet

Archive.

According to Larry Lannom, Director of Information Management Technology at the

nonprofit Corporation should be for National Research Initiatives, "all the problems

associated with digital libraries are wrapped up in archiving." He goes on to state, "If in 100

years people can still read your article, we'll have solved the problem." Daniel Akst, author

of The Webster Chronicle, proposes that "the future of libraries—and of information—is

digital." Peter Lyman and Hal Varian, information scientists at the University of California,

Berkeley, estimate that "the world's total yearly production of print, film, optical, and

magnetic content would require roughly 1.5 billion gigabytes of storage." Therefore, they

believe that "soon it will be technologically possible for an average person to access

virtually all recorded information."

Most digital libraries provide a search interface which allows resources to be found. These

resources are typically deep web (or invisible web) resources since they frequently cannot

be located by search engine crawlers. Some digital libraries create special pages or sitemaps

to allow search engines to find all their resources. Digital libraries frequently use the Open

Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) to expose their metadata to

other digital libraries, and search engines like Google Scholar, Yahoo! and Scirus can also

use OAI-PMH to find these deep web resources.[9]

There are two general strategies for searching a federation of digital libraries:

78

• Distributed searching, and

• Searching previously harvested metadata.

Distributed searching typically involves a client sending multiple search requests in parallel

to a number of servers in the federation. The results are gathered, duplicates are eliminated

or clustered, and the remaining items are sorted and presented back to the client. Protocols

like Z39.50 are frequently used in distributed searching. A benefit to this approach is that

the resource-intensive tasks of indexing and storage are left to the respective servers in the

federation. A drawback to this approach is that the search mechanism is limited by the

different indexing and ranking capabilities of each database, making it difficult to assemble

a combined result consisting of the most relevant found items.

Searching over previously harvested metadata involves searching a locally stored index of

information that has previously been collected from the libraries in the federation. When a

search is performed, the search mechanism does not need to make connections with the

digital libraries it is searching - it already has a local representation of the information. This

approach requires the creation of an indexing and harvesting mechanism which operates

regularly, connecting to all the digital libraries and querying the whole collection in order to

discover new and updated resources. OAI-PMH is frequently used by digital libraries for

allowing metadata to be harvested. A benefit to this approach is that the search mechanism

has full control over indexing and ranking algorithms, possibly allowing more consistent

results. A drawback is that harvesting and indexing systems are more resource-intensive

and therefore expensive.

Advantages of a digital library

The advantages of digital libraries as a means of easily and rapidly accessing books, archives

and images of various types are now widely recognized by commercial interests and public

bodies alike.[14]

Traditional libraries are limited by storage space; digital libraries have the potential to store

much more information, simply because digital information requires very little physical

space to contain it. As such, the cost of maintaining a digital library is much lower than that

of a traditional library.

A traditional library must spend large sums of money paying for staff, book maintenance,

rent, and additional books. Digital libraries may reduce or, in some instances, do away with

these fees. Both types of library require cataloguing input to allow users to locate and

79

retrieve material. Digital libraries may be more willing to adopt innovations in technology

providing users with improvements in electronic and audio book technology as well as

presenting new forms of communication such as wikis and blogs; conventional libraries

may consider that providing online access to their OPAC catalogue is sufficient. An

important advantage to digital conversion is increased accessibility to users. They also

increase availability to individuals who may not be traditional patrons of a library, due to

geographic location or organizational affiliation.

No physical boundary: The user of a digital library need not to go to the library physically;

people from all over the world can gain access to the same information, as long as an

Internet connection is available.

Round the clock availability A major advantage of digital libraries is that people can gain

access 24/7 to the information.

Multiple access: The same resources can be used simultaneously by a number of institutions

and patrons. This may not be the case for copyrighted material: a library may have a license

for "lending out" only one copy at a time; this is achieved with a system of digital rights

management where a resource can become inaccessible after expiration of the lending

period or after the lender chooses to make it inaccessible (equivalent to returning the

resource).

Information retrieval: The user is able to use any search term (word, phrase, title, name,

subject) to search the entire collection. Digital libraries can provide very user-friendly

interfaces, giving clickable access to its resources.

Preservation and conservation: Digitization is not a long-term preservation solution for

physical collections, but does succeed in providing access copies for materials that would

otherwise fall to degradation from repeated use.

Space: Traditional libraries are limited by storage space, digital libraries have the potential

to store much more information, and simply because digital information requires very little

physical space to contain them and media storage technologies are more affordable than

ever before.

Added value: Certain characteristics of objects, primarily the quality of images, may be

improved. Digitization can enhance legibility and remove visible flaws such as stains and

discoloration.

Easily accessible.

80

Image retrieval

An image retrieval system is a computer system for browsing, searching and retrieving

images from a large database of digital images. Most traditional and common methods of

image retrieval utilize some method of adding metadata such as captioning, keywords, or

descriptions to the images so that retrieval can be performed over the annotation words.

Manual image annotation is time-consuming, laborious and expensive; to address this, there

has been a large amount of research done on automatic image annotation. Additionally, the

increase in social web applications and the semantic web have inspired the development of

several web-based image annotation tools.

Image search is a specialized data search used to find images. To search for images, a user

may provide query terms such as keyword, image file/link, or click on some image, and the

system will return images "similar" to the query. The similarity used for search criteria

could be meta tags, color distribution in images, region/shape attributes, etc.

• Image meta search - search of images based on associated metadata such as

keywords, text, etc.

• Content-based image retrieval (CBIR) – the application of computer vision to the

image retrieval

POP3

Post Office Protocol version 3 (POP3) service enables a server to host e-mail accounts for

subscribers of Internet service providers (ISP) or for small businesses. The most common

use is for home ISP clients or small businesses that cannot locally implement and support a

mail server, Simple Mail Transfer Protocol (SMTP) service, and related mail handling

infrastructure. The POP3 protocol is used to retrieve mail that is being held by a mail server

that offers SMTP service. The POP3 service performs the tasks of downloading messages,

moving the messages from a folder in the file system of a mail server to clients on small

networks, or to stand-alone home computers. Typically, after mail is downloaded from the

mail server, the mail server deletes the mail from its folder.

81

Recent Developments

News-On-Demand

The Canadian Institute for Telecommunication Research (CITR) Broadband Services studies

and prototypes enabling technologies for distributed multimedia applications using Object-

Oriented Multimedia database system. Such applications include multimedia news, distance

learning, and home shopping. A successful product of their project is the News-on-Demand

Application. This application incorporates a distributed database server by storing up to

date news items from various sources, such as TV, radio, newspapers, and magazines.

News-on-Demand allows subscriber to retrieve one or more news items using an article

query interface. Various news items from different sources are annotated and organized

into multimedia documents by service providers. News-on-Demand uses a disturbed

service where a client can access news over the broadband network.

Video-On-Demand

Multimedia technologies are attracting more and more interest every day. Video-on-

Demand is one of the buzzwords today and is now available for the public. Content

providers such as publishers, broadcasting companies and audio/video production firms

must be able to archive and index their productions for later retrieval. This is a formidable

task and it is even more so when the material to be handled encompasses several media

types and covers a time span of several years. In order for such a vast amount of data to be

easily available, existing multimedia database design models, indexing and retrieval

methodologies and delivery methods have to be improved and refined. In addition, video,

image and audio data cannot be effectively managed with the exclusive use of older,

keyword-based search techniques.

KMeD

The UCLA Knowledge-Based Multimedia Medical Distributed Database (KMeD) project is a

joint project between the Computer Science Department and the Radiological Science

Department.

82

The KMeD project has five major objectives:

1. Query medical multimedia distributed database by image, alphabetical, and

numerical content.

2. Model temporal, spatial, and evolutionary nature of medical objects.

3. Formulate queries using conceptual and imprecise medical terms to support

cooperative processing.

4. Develop a domain-independent, high-level query language and a medical domain

user interface to support KMeD functionality.

5. Provide analysis and presentation methods for visualization of knowledge and data

models.

KMeD database is presented by features and objects. The selected objects of interest in

medical images such as X-Ray and MRI image are segmented using knowledge-based model-

guiding techniques. As with other multimedia database system features and contents of the

medical image are extracted and stored in a feature and content database. In KMeD, Type

Abstraction Hierarchies (TAHs) is used to represent the knowledge about the image in three

structures. A query process supports operators such as "similar to" and "nearby" and

conceptual terms such as "small" and "large" to find the approximate matches to the

features and contents. A technique has been developed for visual interface to use point-

click-and-drag input. This new development by UCLA, namely KMeD database, can change

the way hospitals maintain and utilize all of their data.

EtherMed

EtherMed is an experimental database to Internet accessible multimedia courseware in

health professions education. EtherMed contains records of educational materials that are

freely accessible on the web and links to the actual courseware. EtherMed records are

descriptive and non-evaluative. EtherMed is an experiment is collaborative database

development and maintenance. OHPCC staff and collaborating consultants at distant

locations can directly enter, modify, and delete records online. The aim is determine

whether distributed management of database resources is possible and can become self-

sustaining and to determine how well database approaches compare to those employing

search engines. Online videoconferencing and other collaboration tools are used to support

this effort. ImageMed is an experimental system that uses:

83

(1) the Unified Medical Language System (UMLS) Meta-thesaurus for indexing and retrieval

of medical images.

(2) a distributed image database.

Future Work

Future requirements of distributed multimedia systems will be even much more demanding

than it is now. It is envisaged that users will be heavily mobile and require ubiquitous

access to, and satisfactory presentation of, multimedia data, regardless of the actual

connectivity and the specific presentation device they are currently using (e.g., wire-based

vs. wireless network, high-resolution graphics screen vs. Web-enabled cellular phone).

Moreover, users will expect to find information and multimedia contents faster and easily,

and will interact with the contents much more intensively than today.

In this scope, the CODAC and its sister project aims at realizing a quality adaptive end-to-

end multimedia system, i.e., we shall provide means for indexing and retrieving multimedia

data by their content and adaptation capabilities and develop methods to guarantee a

quality adaptive video transport to the client.

Depicts the architectural view of the end-to-end multimedia system, which we would like to

realize. For this we need to carry out a number of activities and the following are the first

work items under development in the CODAC project:

1-Development of a multimedia cartridge in the core of an Oracle 9i DBMS: The advantages

of the cartridge technology, as proposed by Oracle and other database vendors, are

reusability, extensibility and especially a very clean interface to components of the database

system, like query processor, optimizer and page access management. This multimedia

cartridge realizes the meta-database and provides access to the clients for complex search

functionality, supported by advanced indexing structures (like a combination of X-trees, SS-

trees, etc.).

2. Realization of a Processing Unit for MPEG-4 videos: The Processing Unit is supposed to

be situated between the video server and the meta-database and shall extract the necessary

84

quality adaptation capability information from the A/V streams to be stored as meta-data in

the database. Upon the insertion of the video (the insertion could be on demand or on a

regular basis), the Processing Unit shall apply efects such as transformation and scaling to

the MPEG-4 encoded videos and report results (performance and variation information) to

the meta-database and write back the adapted videos to the video server.

Implementation of an indexing structure for the access of MPEG-7 files, possibly in BiM

format:

A composition mechanism of different MPEG-7 Access Units shall be developed and the

mapping process of MPEG-7 Access Units to MPEG-4 Access Units shall be implemented.

4. Realization of the cross-referencing between MPEG-7 and MPEG-4, i.e., how to access

media data from meta-data and vice-versa: indexing structures of the later work item shall

be employed and integrated into a reference stream which allows the efficient access of the

meta-data from the media-data. Referencing of MPEG-7 and MPEG-4 will be employed in the

proxy-cache and in the active routers.

The application scenarios, we focus upon are sport event videos, M3box, and tele-teaching.

Sport events are an interesting application scenario since these videos provide semantically

rich content. The M3box is an adaptive multimedia message box. It is developed by Siemens

Corporate Technology.

2) In cooperation with our Institute. The final application is tele-teaching. This application

differs from the previous ones, as here, multimedia data plays a supporting role, rather than

being the central data component. Therefore, the meta-database has to keep, besides

descriptive information on the multimedia data, information on the tele-teaching material

and the teaching process.

Future Improvement

The most interesting and exciting thing about multimedia databases is how quickly they're

evolving. This growth, along with the emergence of inexpensive removable storage devices

such as DVD-which stores tens of gigabytes-will ignite an explosion of multimedia

85

applications. This explosion, in turn, will fuel an intense need for powerful multimedia

databases.

Companies and Institutes and other organizations that interacts directly with Multimedia

Database

Laboratories and Research Groups

Research in this area involves representation, indexing, transmission, retrieval and

presentation of multimedia in a distributed environment, content-based retrieval, data

mining, and bioinformatics. Their main projects that was mainly built on Multimedia

Database are:

MultiStore

MultiStore is a storage infrastructure project to construct, manage, and store

multidimensional data sets. The project will carry out geographic imaging research,

bioinformatics research, pharmacogenomic research, and basic research on large scale data

sets and information visualization. Collaborators include David Mark, Director of the NCGIA,

Nobel laureate Herbert Hauptman, and Norma Nowak, Director of the DNA Micro array

Facility at the Roswell Park Cancer Institute.

NetMedia

Consistent and robust retrieval, transmission and presentation of multimedia data in

distributed multimedia database systems.

Data Mining and Bioinformatics

While available information retrieval techniques rely on human coding of semantic content,

Zhang’s intelligent content analysis analyzes images automatically. The picture analyzed

depicts Bell Hall, home to the main office of the Department of Computer Science and

Engineering on UB's North Campus.

Multimedia Database at RMIT

The Multimedia Database Systems Group at RMIT is one of Australia's leading centers of

research in electronic document management. The group develops systems for document

management and undertakes consultation within the wider area of database management

and information retrieval.

86

The TeraText Database System

RMIT MDS is responsible for the development and maintenance of the TeraText database

system (formerly called the Structured Information Manager (SIM)), a text database system

with native XML support. The TeraText DBS is designed for building web based text

intensive applications. for large enterprises. The TeraText product is now sold in both

Australia and the USA and is used by a large number of organizations in both the public and

private sectors.

DELTA Collaborative Multimedia Database Project

The DELTA Collaborative Multimedia Database Project is designed to provide faculty and

students with access to a rich diversity of digital instructional materials (DIM) for self study

and for use by faculty in developing courseware and electronic presentations for

instructional purposes. The materials have copyright protection, but are available for non-

commercial use without charge. Currently, databases of digital still images, sounds, and

video have been developed from analog materials provided by faculty who have certified

copyright on the original work. The material contained in this database may only be used

for non-profit, educational purposes. Any other use is prohibited without the express

written consent of the Trustees of the California State University or the copyright owner.

World War II Multimedia Database

Originally completed as my thesis at Fordham University, the World War II Multimedia

Database is dedicated to the 50 million people killed in the Second World War. Currently the

World War II Multimedia Database has 1850 photos, 93 video clips, and a virtual radio.

87

Distributed Databases

In today’s world of universal dependence on information systems, all sorts of people need

access to companies’ databases. In addition to a company’s own employees, these include

the company’s customers, potential customers, suppliers, and vendors of all types. It is

possible for a company to have all of its databases concentrated at one mainframe computer

site with worldwide access to this site provided by telecommunications networks, including

the Internet. Although the management of such a centralized system and its databases can

be controlled in a well-contained manner and this can be advantageous, it poses some

problems as well. For example, if the single site goes down, then everyone is blocked from

accessing the databases until the site comes back up again. Also the communications costs

from the many far PCs and terminals to the central site can be expensive. One solution to

such problems, and an alternative design to the centralized database concept, is known as

distributed database. The idea is that instead of having one, centralized database, we are

going to spread the data out among the cities on the distributed network, each of which has

its own computer and data storage facilities. All of this distributed data is still considered to

be a single logical database. When a person or process anywhere on the distributed

network queries the database, it is not necessary to know where on the network the data

being sought is located. The user just issues the query, and the result is returned. This

feature is known as location transparency. This can become rather complex very quickly,

and it must be managed by sophisticated software known as a distributed database

management system or distributed DBMS.

A distributed database (DDB) is a collection of multiple, logically interrelated databases

distributed over a computer network.

A distributed database management system (DDBMS) is the software that manages the

DDB, and provides an access mechanism that makes this distribution transparent to the

user.

Collections of data (e.g. in a database) can be distributed across multiple physical locations.

A distributed database can reside on network servers on the Internet, on corporate

intranets or extranets, or on other company networks. The replication and distribution of

databases improves database performance at end-user worksites.

88

Although there are a number of advantages to using a distributed DBMS, there are also a

number of problems and implementation issues. Finally, data in a distributed DBMS can be

partitioned or replicated or both.

Homogeneous distributed databases

A homogeneous distributed database is one where all sites are homogeneous or similar.

All sites have identical software. They are aware of each other and agree to cooperate in

processing user request

Heterogeneous distributed databases

In a heterogeneous distributed database system, at least one of the databases uses different

schemas and software.

A database system having different schema may cause a major problem for query

processing. A database system having different software may cause a major problem for

transaction processing. Therefore, special attention needs to be paid to integrate a

heterogeneous distributed database.

Methods to achieve a distributed database

A distributed database is one that is held in several locations. There are three methods of

achieving this:

1. Each remote processor has data on its own customers, stock, etc. The database is

uploaded each night and changes made.

2. The entire database is duplicated at the remote side. The database is uploaded each

night and changes made.

3. The central database contains only an index of entries. The actual records are held

at the remote site. A query to the central database will locate where the record is held. This

system is used by very large databases.

Fragmentation

In DDBS we need to define the logical unit of DB distribution and allocation. In some cases it

might be more efficient to split the tables into smaller units (fragments) and allocate them

in different sites.

Fragmentation has three different types:

1. Horizontal Fragmentation

In horizontal partitioning, the rows of a table are divided up among several sites on the

89

network. Each such partition must include the primary key attribute(s) of the table.

An example on horizontal fragmentation is the employee’s table which makes since for the

company to split the table into different partitions based on the employees who work on

that site. This makes the management, queries, and transactions convenient and efficient.

The down side of this choice is that, whenever a query involving all records, it has to

request all partitions from all sites and do a union on them.

2. Vertical Fragmentation

In vertical partitioning, the columns of a table are divided up among several sites on the

network. Each such partition must include the primary key attribute(s) of the table. This

arrangement can make sense when different sites are responsible for processing different

functions involving an entity. For example, the salary attributes of a personnel table might

be stored in one city while the skills attributes of the table might be stored in another city.

Both partitions would include the employee number, the primary key of the full table. A

down side of this option is that, a query involving the entire table would have to request all

portions from all sites and do a join on them.

3. Hybrid Fragmentation

In this type of fragmentation scheme, the table is divided into arbitrary blocks, based on the

needed requirements. Each fragment then can be allocated on to a specific site.

This type of fragmentation is the most complex one, which needs more management.

Query Processing

DDBS adds to the conventional centralized DBS some other types of processing expenses,

because of the additional design (hardware & software) to handle the distribution. These

expenses present as the

cost of data transfer over the network. Data transferred could be, intermediate files

resulting from local sites, or final results need to be sent back to the original site that issued

the query. Therefore, database designers are concerned about query optimization, which

target minimizing the cost of transferring data across the network. One method to optimize

query on DDBS is, where a relation R1 can send the entire join-column CR1 to the target

relation R2, then the site containing R2 would perform the join on CR1, and project on the

passed attributes. The resulting tuples are then shipped back to R! for further processing.

90

This can significantly enhance the query efficiency, since the data transferred on the

network is minimized.

Concurrency and Recovery

DDBS design of concurrency and recovery, has to consider different aspects other than of

those of centralized DBS. These aspects include:

• Multiple copies of data: concurrency has to maintain the data copies consistent.

Recovery on the other hand has to make a copy consistent with others whenever a site

recovers from a failure.

• Failure of communication links

• Failure of individual sites

• Distributed commit: during transaction commit some sites may fail, so the twophase

commit is used to solve this problem.

• Deadlocks on multiple sites.

The following two sections describe two suggestions to manage concurrency control.

1. Distinguished Copy of a Data Item

There are three variations to this method:

• Primary site

In this method, a single site is designated as the coordinator site. All locks and unlocks for

all data units are controlled by this site. One advantage is, easy to implement. However two

downsides of this method are overloading of the coordinator site, and this site forms a

single point failure for the entire DDBS.

• Primary site with backup site

This technique addresses the second disadvantage in the 1st technique (primary site) by

designating a backup site, that can take over as the new coordinator in case of failure, in

which case, an other backup site has to be selected.

• Primary copy technique

This method distributes the load to the sites that have a designated primary copy of a data

unit as opposed to centralizing the entire data units in one coordinator site. This way if a

site goes down, only transactions involving the primary copies residing on that site will be

effected.

91

2. Voting

This method does not designate any distinguished copy or site to be the coordinator as

suggested in the 1st two methods described above. When a site attempts to lock a data unit,

requests to all sites having the desired copy, must be sent asking to lock this copy. If the

requesting transaction did was not granted the lock by the majority voting from the sites,

then the transaction fails and sends cancellation to all. Otherwise it keeps the lock and

informs all sites that it has been granted the lock.

3. Recovery

The first step of dealing with the recovery problem is to identify that there was a failure,

what type was it, and at which site that happened. Dealing with distributed recovery

requires aspects include: database logs, and update protocols, transaction failure recovery

protocol, etc.

Advantages of Distributed Database

• Management of distributed data with different levels of transparency like

fragmentation transparency, replication etc.

• Increased reliability and availability: Reliability is defined as, the probability that the

system will be up at a given time. The availability is defined as, the probability that the

system will be up continuously during a given time period. These important system

parameters are improved with the DDBS. In the centralized DBS, if any component of the DB

goes down, the entire system will go down, whereas in the DDBS, only the effected site is

down, and the rest of the system will not be affected. Furthermore, if the data is replicated

at the different sites, the effects are greatly minimized.

• Easier expansion.

• Reflects organizational structure: Database fragments are located in the

departments they relate to.

• Local autonomy or site autonomy: A department can control the data about them (as

they are the ones familiar with it.)

• Protection of valuable data: If there were ever a catastrophic event such as a fire, all

of the data would not be in one place, but distributed in multiple locations.

92

• Improved performance: Data is located near the site of greatest demand, and the

database systems themselves are parallelized, allowing load on the databases to be

balanced among servers. (A high load on one module of the database won't affect other

modules of the database in a distributed database.)

• Economics: It costs less to create a network of smaller computers with the power of

a single large computer.

• Modularity: Systems can be modified, added and removed from the distributed

database without affecting other modules (systems).

• Reliable transactions : Due to replication of database.

• Hardware, Operating System, Network, Fragmentation, DBMS, Replication and

Location Independence.

• Continuous operation.

• Distributed Query processing.

• Distributed Transaction management.

Disadvantage of Distributed databases

• Complexity: Extra work must be done by the DBAs to ensure that the distributed

nature of the system is transparent. Extra work must also be done to maintain multiple

disparate systems, instead of one big one. Extra database design work must also be done to

account for the disconnected nature of the database — for example, joins become

prohibitively expensive when performed across multiple systems.

• Economics: Increased complexity and a more extensive infrastructure means extra

labour costs.

• Security: Remote database fragments must be secured, and they are not centralized

so the remote sites must be secured as well. The infrastructure must also be secured (e.g.,

by encrypting the network links between remote sites).

• Difficult to maintain integrity: But in a distributed database, enforcing integrity over

a network may require too much of the network's resources to be feasible.,

• Inexperience: distributed databases are difficult to work with, and as a young field

there is not much readily available experience on proper practice.

• Lack of standards: There are no tools or methodologies yet to help users convert a

centralized DBMS into a distributed DBMS.

93

• Database design more complex: Besides of the normal difficulties, the design of a

distributed database has to consider fragmentation of data, allocation of fragments to

specific sites and data replication.

• Additional software is required.

• Operating System should support distributed environment.

• Concurrency control: It is a major issue. It is solved by locking and timestamping.

94

Document Oriented Database

A document-oriented database is a computer program designed for storing, retrieving, and

managing document-oriented, or semi structured data, information. Document-oriented

databases are one of the main categories of so-called NoSQL databases and the popularity of

the term "document-oriented database" (or "document store") has grown with the use of

the term NoSQL itself.

Documents-

The central concept of a document-oriented database is the notion of a Document. While

each document-oriented database implementation differs on the details of this definition, in

general, they all assume documents encapsulate and encode data (or information) in some

standard format(s) (or encoding(s)). Encodings in use include XML, YAML, JSON and BSON,

as well as binary forms like PDF and Microsoft Office documents (MS Word, Excel, and so

on).

Documents inside a document-oriented database are similar, in some ways, to records or

rows, in relational databases, but they are less rigid. They are not required to adhere to a

standard schema nor will they have all the same sections, slots, parts, keys, or the like. For

example here's a document:

FirstName="Bob", Address="5 Oak St.", Hobby="sailing".

Another document could be:

FirstName="Jonathan", Address="15 Wanamassa Point Road",

Children=[{Name:"Michael",Age:10}, {Name:"Jennifer", Age:8}, {Name:"Samantha", Age:5},

{Name:"Elena", Age:2}].

Both documents have some similar information and some different. Unlike a relational

database where each record would have the same set of fields and unused fields might be

kept empty, there are no empty 'fields' in either document (record) in this case. This system

allows new information to be added and it doesn't require explicitly stating if other pieces

of information are left out.

95

Keys, Retrieval, and Organization

Keys

Documents are addressed in the database via a unique key that represents that document.

Often, this key is a simple string. In some cases, this string is a URI or path. Regardless, you

can use this key to retrieve the document from the database. Typically, the database retains

an index on the key such that document retrieval is fast.

Retrieval

One of the other defining characteristics of a document-oriented database is that, beyond

the simple key-document (or key-value) lookup that you can use to retrieve a document, the

database will offer an API or query language that will allow you to retrieve documents

based on their contents. For example, you may want a query that gets you all the documents

with a certain field set to a certain value. The set of query APIs or query language features

available, as well as the expected performance of the queries, varies significantly from one

implementation to the next.

Organization

Implementations offer a variety of ways of organizing documents, including notions of

▪ Collections

▪ Tags

▪ Non-visible Metadata

▪ Directory hierarchies

XML database implementations

An XML database is a data persistence software system that allows data to be stored in XML

format. This data can then be queried, exported and serialized into the desired format.

Two major classes of XML database exist:

▪ XML-enabled: these map all XML to a traditional database (such as a relational

database), accepting XML as input and rendering XML as output. This term implies that the

database does the conversion itself (as opposed to relying on middleware).

96

Native XML (NXD): the internal model of such databases depends on XML and uses XML

documents as the fundamental unit of storage, which are, however, not necessarily stored in

the form of text files.

Rationale for XML in databases

O'Connell gives one reason for the use of XML in databases: the increasingly common use of

XML for data transport, which has meant, "data is extracted from databases and put into

XML documents and vice-versa". It may prove more efficient (in terms of conversion costs)

and easier to store the data in XML format.

Native XML databases

The term "native XML database" (NXD) can lead to confusion. Many NXDs do not function as

standalone databases at all, and do not really store the native (text) form.

The formal definition from the XML:DB initiative (which appears to be inactive since 2003)

states that a native XML database:

Defines a (logical) model for an XML document — as opposed to the data in that document

— and stores and retrieves documents according to that model. At a minimum, the model

must include elements, attributes, PCDATA, and document order. Examples of such models

include the XPath data model, the XML Infoset, and the models implied by the DOM and the

events in SAX 1.0.

Has an XML document as its fundamental unit of (logical) storage, just as a relational

database has a row in a table as its fundamental unit of (logical) storage.

Need not have any particular underlying physical storage model. For example, NXDs can use

relational, hierarchical, or object-oriented database structures, or use a proprietary storage

format (such as indexed, compressed files).

Additionally, many XML databases provide a logical model of grouping documents, called

"collections". Databases can set up and manage many collections at one time. In some

implementations, a hierarchy of collections can exist, much in the same way that an

operating system's directory-structure works.

All XML databases now support at least one form of querying syntax. Minimally, just about

all of them support XPath for performing queries against documents or collections of

documents. XPath provides a simple pathing system that allows users to identify nodes that

match a particular set of criteria.

97

In addition to XPath, many XML databases support XSLT as a method of transforming

documents or query-results retrieved from the database. XSLT provides a declarative

language written using an XML grammar. It aims to define a set of XPath filters that can

transform documents (in part or in whole) into other formats including Plain text, XML, or

HTML.

Many XML databases also support XQuery to perform querying. XQuery includes XPath as a

node-selection method, but extends XPath to provide transformational capabilities. Users

sometimes refer to its syntax as "FLWOR" (pronounced 'Flower') because the query may

include the following clauses: 'for', 'let', 'where', 'order by' and 'return'. Traditional RDBMS

vendors (who traditionally had SQL only engines), are now shipping with hybrid SQL and

XQuery engines. Hybrid SQL/XQuery engines help to query XML data alongside the

relational data, in the same query expression. This approach helps in combining relational

and XML data.

Some XML databases support an API called the XML:DB API (or XAPI) as a form of

implementation-independent access to the XML datastore. In XML databases, XAPI

resembles ODBC and JDBC as used with relational databases. On the 24th of June 2009, The

Java Community Process released the final version of the XQuery API for Java specification

(XQJ) - "a common API that allows an application to submit queries conforming to the W3C

XQuery 1.0 specification and to process the results of such queries".

98

Implementations of the Databases-

Name Publisher License Language NotesLotus Notes IBM ProprietaryaskSam askSam Systems ProprietaryApstrata Apstrata ProprietaryDatawasp Significant Data Systems Proprietary

Clusterpoint Clusterpoint Ltd. Free community license / Commercial[1] C++

Scalable, high-performance, schema-free, document-oriented database management system platform with server based data storage, fast full text search engine functionality, information ranking for search revelevance and clustering.

CRX Day Software ProprietaryMUMPS Database[2] Proprietary and GNU Affero GPL[3] MUMPS Commonly used in health applications.UniVerse Rocket Software ProprietaryUniData Rocket Software ProprietaryJackrabbit Apache Software Foundation Apache License Java

CouchDB Couchbase, Apache Software Foundation Apache License ErlangJSON over REST/HTTP with Multi-Version Concurrency Control and ACID properties. Uses map and reduce for views and queries.[4]

FleetDB FleetDB MIT License ClojureA JSON-based schema-free database optimized for agile development.

MongoDB GNU AGPL v3.0[6] C++Fast, document-oriented database optimized for highly transient data.

GemFire Enterprise [1] VMWare Commercial Java, .NET, C++Memory-oriented, fast, key-value database with indexing and querying support.

OrientDB Orient Technologies Apache License Java JSON over HTTP

RavenDB RavenDB commercial or GNU AGPL v3.0 .NET

A .NET LINQ-enabled Document Database, focused on providing high performance, transactional, schema-less, flexible and scalable NoSQL data store for the .NET and Windows platforms.

Redis BSD License ANSI CKey-value store supporting lists and sets with fast, simple and binary-safe protocol.

StrokeDB [2] MIT License Alpha software.Terrastore Apache License Java JSON/HTTP

ThruDB BSD License C++, Java

Built on top of Apache Thrift framework that provides indexing and document storage services for building and scaling websites. Alternate implementation is being developed in Java. Alpha software.

Persevere Persevere BSD License

A JSON database and JavaScript Application Server. Provides RESTful JSON interface for Create, read, update, and delete access to data. Also supports JSONQuery/JSONPath querying.

DBSlayer DBSlayer Apache License Cdatabase abstraction layer (over MySQL) used by the New York Times. JSON over HTTP.

Eloquera DB Eloquera Proprietary .NETHigh performance. Based on Dynamic objects. Supports LINQ, SQL queries.

99

Mobile Database-

A mobile database is a database that can be connected to by a mobile computing device over

a mobile network. The client and server have wireless connections. A cache is maintained to

hold frequent data and transactions so that they are not lost due to connection failure. A

database is a structured way to organize information. This could be a list of contacts, price

information or distance travelled.

The use of laptops, mobiles and PDAs is increasing and likely to increase in the future with

more and more applications residing in the mobile systems. While those same analysts can’t

tell us exactly which applications will be the most popular, it is clear that a large percentage

will require the use of a database of some sort. Many applications such as databases would

require the ability to download information from an information repository and operate on

this information even when out of range or disconnected.

An example of this is a mobile workforce. In this scenario user would require to access and

update information from files in the home directories on a server or customer records from

a database. This type of access and workload generated by such users is different from the

traditional workloads seen in client–server systems of today. With the advent of mobile

databases, now users can load up their smart phones or PDAs with mobile databases to

exchange mission-critical data remotely without worrying about time or distance. Mobile

databases let employees enter data on the fly. Information can be synchronized with a

server database at a later time.

Embedded Database-

An embedded database system is a database management system (DBMS), which is tightly

integrated with application software that requires access to stored data, such that the

database system is “hidden” from the application’s end-user and requires little or no

ongoing maintenance. It is actually a broad technology category that includes database

systems with differing application programming interfaces (SQL as well as proprietary,

native APIs); database architectures (client/server and in-process); storage modes (on-disk,

in-memory and combined); database models (relational, object-oriented, Entity-Attribute-

Value model and network/CODASYL); and target markets. The term "embedded database"

can be confusing because only a small subset of embedded database products is used in

100

real-time embedded systems such as telecommunications switches and consumer

electronics devices.

Mobile and Embedded DBMS Characteristics-

The data access and management requirements of the applications described above are

significantly different from that of traditional server DBMSs. These new applications must

be able to run on multiple tiers rang- ing from devices to servers to web and would benefit

from various existing database mechanisms. However, these database mechanisms (like

query, indexing, persistence) must be unlocked from the traditional monolithic DBMSs and

made available as embeddable components (e.g. DLLs) that can be embedded within

applications, thereby, enabling them to meet the requirements described above. Such

Mobile and Embedded DBMSs have the following characteristics:

1. Embeddable in applications – Mobile and Embedded DBMSs form an integral part of the

application or the application infrastructure, often requiring no administration. Database

functionality is delivered as part of the application (or app infrastructure). While the

database must be embeddable as a DLL in applications, it must also be possible to deploy it

as a stand-alone DBMS with support for multiple transactions and applications.

2. Small footprint – For many applications, especially those that are downloadable, it is

important to mini- mize DBMS footprint. Since the database system is part of the

application, the size of the DBMS affects the overall application footprint. In addition to the

small footprint, it is also desirable to have short code paths for efficient application

execution. Most of these applications do not require the full functionality of commercial

DBMSs; they require simple query and execute in constrained environments.

3. Run on mobile devices – The DBMSs that run on mobile devices tend to be specialized

versions of mobile and embedded DBMSs. In addition to handling the memory, disk and

processor limitations of these devices, the DBMS must also run on specialized operating

systems. The DBMS must be able to store and forward data to the back-end databases as

synchronization with backend systems is critical for them.

4. Componentized DBMS – Often, to support the small footprint requirement, it is important

to include only the functionality that is required by the applications. For example, many

simple applications just require ISAM like record-oriented access. For these applications,

there is no need to include the query processor, thereby increasing the footprint. Similarly,

101

many mobile and mid-tier applications require only a small set of relational operators while

others require XML access and not relational access. So, it should be possible to pick and

choose the desired components.

5. Self managed DBMS – The embedded DBMS is invisible to the application user. There can

be no DBA to manage the database and operations like backups, recovery, indexing, tuning

etc. cannot be initiated by a DBA. If the database crashes, the recovery must start

instantaneously. The database must be self managed or managed by the application. Also,

embedded DBMS must auto install with the application – it should not be installed explicitly

(user action) or independently. Similarly when the application is shut down, the DBMS must

transparently shutdown.

66. In-Memory DBMS – These are specialized DBMSs serving applications that require high

performance on data that is small enough to be contained in main memory. In-memory

DBMSs require specialized query processing and indexing techniques that are optimized for

main memory usage. Such DBMSs also can support data that may never get persisted.

7. Portable databases – There are many applications, which require very simple deployment

– installing the application, should install the database associated with it. This requires the

database to be highly portable. Typically, single file databases (e.g. like Microsoft Access

databases) are ideally suited for this purpose. Again, there should be no need to install the

DBMS separately – installing the application installs the DBMS and then copying the

database file completes the application migration.

8. No code in the database – Portable database must also be safe. Executable code can be a

carrier of virus or other threats. By eliminating any code storage in the database, it can be

made safer and portable.

9. Synchronize with back-end data sources – In the case of mobile and cached scenarios; it

must be possible to synchronize the data with the back-end data sources. In typical mid-tier

(application server) caches, the data is fetched from the back-end databases into the cache,

operated on, and synchronized with the back-end database.

10. Remote management – While mobile and embedded DBMSs must be self managed, it is

important to allow them to be managed remotely also, especially those on mobile devices.

In enterprises (e.g. FedEX, UPS), mobile devices must be configured and managed in a

manner compliant with the company standards. Therefore centralized remote management

of these devices is necessary.

102

11. Custom programming interfaces – An important usage of embedded DBMS is in

specialized data- centric applications. Such applications use variety of data models (e.g.

Relational, XML, Streams, and Analytics) and query languages. The embedded DBMSs must

be componentized and extensible to allow (application) domain-specific query languages

and programming surfaces.

103

Conclusion

With the increased interest in specialized applications and database systems (like

streaming, analytics, mobile, sensor networks, etc.), the need for Mobile and Embedded

DBMSs is increasing. There have been mobile and embedded DBMS products in the market

that satisfy some of the characteristics described above, but were designed as low-end

database products and do not factor in the recent hardware and application trends. For

example, products that focused on small footprint but did not design with componentization

for embedded applications; similarly, there are products that only provide embedded (well

componentized) storage engines but without any query processing components.

The hardware trends in multi core CPUs, large main memory, and NVRAMs (e.g. flash) are

beginning to disrupt the traditional database architectures. Particularly, advances in large

memory and flash storage open doors for further research in database architectures. The

componentized mobile and embedded DBMSs are in a position to adapt to these changing

hardware and software trends more rapidly.

104