database concepts with emerging trends
DESCRIPTION
usefulTRANSCRIPT
Jamnalal Bajaj Institute of Management Studies
MMS – I Alpha
10/11/2011
Submitted by:Abhijeet Shinde (93)Akshay Wagle (115)Amrit Pradhan (80)Rahul Tambakhe (104)Saiprasad Sale (85)Sagar Patil (74)Srinath Subramanian (100)Vaibhav Thakur (106)
DATABASE CONCEPTS WITH EMERGING TRENDS
Table of Contents
Introduction 3
Database Design Methodologies 11
Database management systems 26
Relational Database Management Systems 49
Multimedia Database Management Systems 64
Distributed Databases 88
Document Oriented Database 95
Conclusion 104
2
Introduction
Data
The term data refers to qualitative or quantitative attributes of a variable or set of variables.
Data is typically the result of measurements and can be the basis of graphs, images, or
observations of a set of variables. Data are often viewed as the lowest level of
abstraction from which information and then knowledge are derived. Raw data, i.e.
unprocessed data, refers to a collection of numbers, characters, images or other outputs
from devices that collect information to convert physical quantities into symbols. Also, data
is a representation of a fact, figure, and idea. Such usage is the origin of data as a concept in
computer science: data are numbers, words, images, etc., accepted as they stand.
The terms information and knowledge are frequently used for overlapping concepts. The
main difference is in the level of abstraction being considered. Data is the lowest level of
abstraction, information is the next level, and finally, knowledge is the highest level among
all three. Data on its own carries no meaning. For data to become information, it must be
interpreted and take on a meaning. For example, the height of Mt. Everest is generally
considered as "data", a book on Mt. Everest geological characteristics may be considered as
"information", and a report containing practical information on the best way to reach Mt.
Everest's peak may be considered as "knowledge".
Information as a concept bears a diversity of meanings, from everyday usage to technical
settings. Generally speaking, the concept of information is closely related to notions of
constraint, communication, control, data, form, instruction, knowledge, meaning, mental
stimulus, pattern, perception, and representation.
Beynon-Davies uses the concept of a sign to distinguish between data and information; data
are symbols while information occurs when symbols are used to refer to something.
It is people and computers who collect data and impose patterns on it. These patterns are
seen as information which can be used to enhance knowledge. These patterns can be
interpreted as truth, and are authorized as aesthetic and ethical criteria. Events that leave
behind perceivable physical or virtual remains can be traced back through data. Marks are
no longer considered data once the link between the mark and observation is broken.
3
Raw data refers to a collection of numbers, characters, images or other outputs from
devices to convert physical quantities into symbols, that are unprocessed. Such data is
typically further processed by a human or input into a computer, stored and processed
there, or transmitted (output) to another human or computer (possibly through a data
cable). Raw data is a relative term; data processing commonly occurs by stages, and the
"processed data" from one stage may be considered the "raw data" of the next.
Mechanical computing devices are classified according to the means by which they
represent data. An analog computer represents a datum as a voltage, distance, position, or
other physical quantity. A digital computer represents a datum as a sequence of symbols
drawn from a fixed alphabet. The most common digital computers use a binary alphabet,
that is, an alphabet of two characters, typically denoted "0" and "1". More familiar
representations, such as numbers or letters, are then constructed from the binary alphabet.
Some special forms of data are distinguished. A computer program is a collection of data,
which can be interpreted as instructions. Most computer languages make a distinction
between programs and the other data on which programs operate, but in some languages,
notably Lisp and similar languages, programs are essentially indistinguishable from other
data. It is also useful to distinguish metadata, that is, a description of other data. A similar
yet earlier term for metadata is "ancillary data." The prototypical example of metadata is
the library catalog, which is a description of the contents of books.
Experimental data refers to data generated within the context of a scientific investigation by
observation and recording. Field data refers to raw data collected in an uncontrolled in
situ environment.
Example: ROBCOR Company tracks all sales for its two divisions through invoices. Each of
the invoices contains raw facts such as these:
Invoice number = 300124, Invoice date = 12-Jan-2004, and Sales amount = $125.98
Suppose that ROBCOR's two divisions have generated 1,380,456 and 1,453,907 invoices,
respectively, between the first quarter of 1999 and the first quarter of 2004. These millions
of invoices the ROBCOR had generated raw facts which do not have meaning unless and
until they are processed into information. Now, suppose that for the purpose of making
conclusions and/or decisions, the ROBCOR's sales managers want information about sales
productivity per employee for each of the two divisions. The generated sales productivity
4
information will reveal meaning of the data exist in the sales invoices. An application
program in an information system will generate the required information. When data is
stored electronically in files, it can be used as input for an information system. An
information system has programs to process (or transform) data to produce information as
an output.
Information reveals meaning of data. For example, students' data values such as ID, Name,
Address, Major, and Phone number represent raw facts. Class roll is a list which shows
students' ID and Names of those students who are enrolled in particular class.
Summary of Data
o Facts, statistics used for reference or analysis.
o Numbers, characters, symbols, images etc., which can be processed by a computer.
o Data must be interpreted, by a human or machine, to derive meaning
o "Data is a representation of information" *
o Latin 'datum' meaning "that which is given"
o Data plural, datum singular
Information
Information in its most restricted technical sense is a message (utterance or expression) or
collection of messages that consists of an ordered sequence of symbols, or it is the meaning
that can be interpreted from such a message or collection of messages. Information can be
recorded or transmitted. It can be recorded as signs, or conveyed as signals by waves.
Information is any kind of event that affects the state of a dynamic system. The concept has
numerous other meanings in different context. Moreover, the concept of information is
closely related to notions of constraint, communication, control, data, form, instruction,
knowledge, meaning, mental stimulus, pattern, perception, representation, and especially
entropy.
Records are a specialized form of information. Essentially, records are information
produced consciously or as by-products of business activities or transactions and retained
5
because of their value. Primarily their value is as evidence of the activities of the
organization but they may also be retained for their informational value. Sound records
management ensures that the integrity of records is preserved for as long as they are
required.
The international standard on records management, ISO 15489, defines records as
"information created, received, and maintained as evidence and information by an
organization or person, in pursuance of legal obligations or in the transaction of business".
The International Committee on Archives (ICA) Committee on electronic records defined a
record as, "a specific piece of recorded information generated, collected or received in the
initiation, conduct or completion of an activity and that comprises sufficient content,
context and structure to provide proof or evidence of that activity".
Records may be maintained to retain corporate memory of the organization or to meet
legal, fiscal or accountability requirements imposed on the organization. Willis (2005)
expressed the view that sound management of business records and information delivered
"…six key requirements for good corporate governance…transparency; accountability; due
process; compliance; meeting statutory and common law requirements; and security of
personal and corporate information."
Summary of Information
o Knowledge derived from study, experience (by the senses), or instruction.
o Communication of intelligence.
o Information is any kind of knowledge that is exchangeable amongst people, about things,
facts, concepts, etc., in some context.
o Information is interpreted data.
Data processed into information
6
Students Enrolment Data Student Registration System Class Roll
Data constitute the building blocks of information:
• Information is produced by processing data.
• Information is used to reveal the meaning of data.
• Good, relevant, and timely information is the key to good decision making.
• Good decision making is the key to organizational survival in a global environment.
Timely and useful information requires accurate data. To achieve accurate information, the
data must be stored and generated properly. Also, the data must be stored in a format that
is easy to access and process. And like any basic resource, the data environment must be
managed carefully. Thus, data management is a discipline that focuses on the proper
generation, storage, and retrieval of data.
Database Introduction
The term database is correctly applied to the data and their supporting data structures, and
not to the DBMS which is a software system used to manage the database. The structure of a
database is generally too complex to be handled without its DBMS, and any attempt to do
otherwise is very likely to result in database corruption. DBMSs are packaged as computer
software products: Well known and highly utilized DBMSs include the proprietary (ranked
by market share Oracle , IBM DB2, Microsoft SQL Server, and the open
sourceDBMSs PostgreSQL, and the simpler, with less functionality, lower-end (but
satisfactory in many cases), MySQL and SQLite. Each such DBMS product currently supports
many thousands of databases all over the world; the
mostly embedded DBMS SQLite probably supports many millions of instances. A database is
not generally portable across different DBMS, but different DBMSs can inter-operate to
some degree (while each DBMS type controls a database of its own database type) by
using standards like SQLand ODBC to support together a single application. A successful
general-purpose DBMS is designed in such a way that it can satisfy as many different
applications and application designers and builders as possible. A DBMS also needs to
provide effective run-time execution to properly support (e.g., in terms
of performance, availability, and security) as many end-users (the database's application
7
users) as needed. Sometimes the combination of a database and its respective DBMS is
referred to as a Database system (DBS).
The design, construction, and maintenance of a complex database requires specialist skills:
the staff performing these functions are referred to as database
application programmers (different from the DBMS developers/programmers)
and database administrators. Their tasks are supported by tools provided either as part of
the DBMS or as free-standing software products. These tools include specialized database
languages including data definition languages (DDLs), data manipulation languages (DMLs),
and query languages. These can be seen as special-purpose programming languages,
tailored specifically to manipulate databases; sometimes they are provided as extensions of
existing programming languages, with added special database commands. Database
languages are generally specific to one data model, and in many cases they are specific to
one DBMS type. The most widely supported standard database language is SQL, which has
been developed for the relational data model and combines the roles of both DDL, DML, and
a query language.
The database concept
The database concept has evolved since the 1960s to ease increasing difficulties in
designing, building, and maintaining complex information systems (typically with many
concurrent end-users, and with a diverse large amount of data). It has evolved together with
the evolvement of Database management systems (DBMSs) which enable the effective
handling of databases. Though the terms database and DBMS define different entities, they
are inseparable: A database's properties are determined by its supporting DBMS and vice-
versa. The Oxford English dictionary cites a 1962 technical report as the first to use the
term "data-base." With the progress in technology in the areas of processors, computer
memory, computer storage. Computer networks, the sizes, capabilities, and performance of
databases and their respective DBMSs have grown in orders of magnitudes. For decades it
has been unlikely that a complex information system can be built effectively without a
proper database supported by a DBMS. The utilization of databases is now spread to such a
wide degree that virtually every technology and product relies on databases and DBMSs for
its development and commercialization, or even may have such embedded in it. Also,
8
organizations and companies, from small to large, heavily depend on databases for their
operations.
No widely accepted exact definition exists for DBMS. However, a system needs to provide
considerable functionality to qualify as a DBMS. Accordingly its supported data collection
needs to meet respective usability requirements to qualify as a database. Thus, a database
and its supporting DBMS are defined here by a set of general requirements listed below.
Virtually all existing mature DBMS products meet these requirements to a great extent,
while less mature either meet them or converge to meet them.
Parts of a database
There are four major parts to a database. Which are tables, forms, reports and queries:
Tables- Where all the information and data is stored within a database.
Forms- Forms are used to enter, edit or deleted data in a database one record at a time.
Reports- A report makes a paper copy of the data contained within a database.
Queries and Sorts- Queries are a special database tool that allows you to search a database for a
specific records based on certain information.
Definition of a key
A key is a set of columns that can be used to identify or access a particular row or rows. The
key is identified in the description of a table, index, or referential constraint. The same
column can be part of more than one key.
A unique key is a key that is constrained so that no two of its values are equal. The columns
of a unique key cannot contain NULL values. For example, an employee number column can
be defined as a unique key, because each value in the column identifies only one employee.
No two employees can have the same employee number.
The mechanism used to enforce the uniqueness of the key is called a unique index. The
unique index of a table is a column, or an ordered collection of columns, for which each
value identifies (functionally determines) a unique row. A unique index can contain NULL
values.
9
Primary Key
The primary key of a relational table uniquely identifies each record in the table. It can
either be a normal attribute that is guaranteed to be unique (such as Social Security Number
in a table with no more than one record per person) or it can be generated by the DBMS
(such as a globally unique identifier, or GUID, in Microsoft SQL Server). Primary keys may
consist of a single attribute or multiple attributes in combination.
Examples:
Imagine we have a STUDENTS table that contains a record for each student at a university.
The student's unique student ID number would be a good choice for a primary key in the
STUDENTS table. The student's first and last name would not be a good choice, as there is
always the chance that more than one student might have the same name.
Secondary Key
A secondary key is made on a field that you would like to be indexed for faster searches
Secondary key is used for Identification of Rows but not usually Unique. We can have
multiple Secondary Keys per table. Attribute used for Secondary key are not the ones used
for Super Key i.e. Secondary Key is not even be one of the Super Key.
10
Database Design Methodologies
Database design
It is a structured approach that uses procedures, techniques, tools, and documentation aids
to support and facilitate the process of designing a database
Critical Success Factors in Database Design :
1. Use diagrams to represent as much of the data models as possible.
2. Good documentation of additional data semantics.
3. Build a data dictionary to supplement the data model diagrams.
4. Be willing to repeat steps
5. Work interactively with the users as much as possible.
6. Follow a structured methodology throughout the data modeling process.
7. Employ a data-driven approach.
8.Incorporate structural and integrity considerations into the data models.
9. Combine conceptualization, normalization, and transaction validation techniques into the
data modeling methodology.
3 main phases of database design:
I. Conceptual database design: Construct a model of the information used, independent of all
physical considerations.
II. Logical database design: Construct a model of the information used, based on a specific
data model (e.g. relational), but independent of a particular DBMS and other physical
considerations.
II. Physical database design: Produce a description of the implementation of the DB on
secondary storage; it describes the storage structures and access methods used to achieve
efficient access to the data.
Methodology Overview
Step 1 Build local conceptual data model for each user view
Step 2 Build and Validate Local Logical Data Model
Step 3 Build and Validate Global Logical Data Model
Step 4 Translate Global Logical Data Model for Target DBMS
11
Step 5 Design Physical Representation
Step 6 Design and Implement Security Mechanisms
Step 7 Monitor and Tune the Operational System
Methodology Overview - Conceptual Database Design
Step 1 Build local conceptual data model for each user view
Step 1.1 Identify entity types
Step 1.2 Identify relationship types
Step 1.3 Identify and associate attributes with entity or relationship types
Step 1.4 Determine attribute domains
Step 1.5 Determine candidate and primary key attributes
Step 1.6 Specialize/generalize entity types (optional step)
Step 1.7 Draw Entity–Relationship diagram
Step 1.8 Review Local Conceptual Data Model with User
Methodology Overview –Logical Database Design for Relational Model
Step 2 Build and Validate Local Logical Data Model
Step 2.1 Map Local Conceptual Data Model to Local Logical Data Model
Step 2.2 Derive Relations from Local Logical Data Model
Step 2.3 Validate Model using Normalization
Step 2.4 Validate Model against User Transactions
Step 2.5 Draw Entity-Relationship Diagram
Step 2.6 Define Integrity Constraints
Step 2.7 Review Local Logical Data Model with User
Step 3 Build and Validate Global Logical Data Model
Step 3.1 Merge Local Logical Data Models into Global Model
Step 3.2 Validate Global Logical Data Model
Step 3.3 Check for Future Growth
Step 3.4 Draw Final Entity-Relationship Diagram
Step 3.5 Review Global Logical Data Model with Users
Step 3 Build and Validate Global Logical Data Model
Step 3.1 Merge Local Logical Data Models into Global Model
Step 3.2 Validate Global Logical Data Model
12
Step 3.3 Check for Future Growth
Step 3.4 Draw Final Entity-Relationship Diagram
Step 3.5 Review Global Logical Data Model with Users
Methodology Overview –Physical Database Design for Relational Databases
Step 4 Translate Global Logical Data Model for Target DBMS
Step 4.1 Design Base Relations for Target DBMS
Step 4.2 Design Integrity Rules for Target DBMS
Step 5 Design Physical Representation
Step 5.1 Analyze Transactions
Step 5.2 Choose File Organization
Step 5.3 Choose Secondary Indexes
Step 5.4 Consider the Introduction of Controlled Redundancy
Step 5.5 Estimate Disk Space
Step 6 Design and Implement Security Mechanisms
Step 6.1 Design User Views
Step 6.2 Design Access Rules
Step 7 Monitor and Tune the Operational System
Entity
An Entity is defined as a person, place, thing or concept which is a) of interest to the
corporation b) is capable of being described in real terms c) is relevant within the context of
the specific environment of the firm. An entity is something that has a distinct,
separate existence, although it need not be a material existence. Entities are used in system
developmental models that display communications and internal processing of, say,
documents compared to order processing.
Entities are represented on the diagram by labelled boxes.
13
Types of entities:
Strong Entity: Objects are represented by their attributes and, as objects are inter
distinguishable, a subset of these attributes forms a primary key for uniquely identified an
instance of an entity. Entity types that have primary keys are called strong entities.
Weak Entity: Instance of weak entity sets associated with the same instance of the strong
entity must be distinguishable from each other by a subset of the attributes of the weak
entity. This subset of the attributes is called discriminator. They don't have a primary key
Entity Set : An entity set is a set of entities of the same type that share the same properties,
or attributes. For example, all software engineers working in the department involved in the
Internet projects can be defined as the entity set Internet Group. The individual entities that
constitute a set are called extension of the entity set. Thus, all individual software engineers
of in the Internet projects are the extensions of the entity set InternetGroup.
Entity sets don’t need to be disjointed. For example, we can define an entity set Employee.
An employee may or may not be working on some Internet projects. In InternetGroup we
will have some entries that are there in Employee entity set. Therefore, entity sets
Employee and InternetGroup are not disjoint.
A database usually contains groups of entities that are similar. For example, employees of a
company share the same attributes. However, every employee entity has its own values for
each attribute. An entity type defines a set of entities that have same attributes. A name and
a list of attributes describe each entity type.
An Entity Type is represented in ER diagrams as rectangular box and the corresponding
attributes are shown in ovals attached to the entity type by straight lines. An entity type is
basically the schema or intension or structure for the set of entities that share the same
14
structure whereas the individual entities of a particular entity type are collectively called
entity set. The entity set is also called the extension of the entity type
Entity Sub-Types
Sometimes it is useful to generalise about a group of Entities which have similar
characteristics. For
example a VEHICLE is a generalisation of a CAR, a TRUCK and a MOTORCYCLE.
Conversely, it can be useful to identify specialised sub-types of an entity. For example, a CAR
is a
specialised type of VEHICLE, a TRUCK is a specialised type of VEHICLE and a MOTORCYCLE
is a specialised type of VEHICLE.
Attribute
An attribute is an aspect or quality of an entity which describes it or its actions. An attribute
may be a physical aspect such as height, weight or an aspect of the entity's location such as
place of residence or place of birth or Quality such as the level of a particular skill,
educational degree achieved etc
An attribute is a property or descriptor of an entity, for example, Customer Name is an
attribute of the entity Customer. Each attribute will eventually be represented by one or
more entity attributes in the physical database structure.
Attributes Define Entities Collectively, attributes define an entity. An attribute is
meaningless by itself. For example, date of birth comes from the context of the entity to
which it is assigned, for example, date of birth of an employee. Attributes are not shown on
the Entity-Relationship Model but are recorded in the underlying data dictionary which
contains the definitions of attributes for all entities and relationships identified in the
model. An attribute should not have facts recorded about it. In practice, however, there are
exceptions. For example, you might wish to show address as an attribute of Customer.
Address is not significant enough to be modelled as an entity in its own right and would
typically be shown as an attribute of Customer. However, at the detailed level, it may itself
have attributes such as an indicator for mailing address or home address. Attributes do not
have to be recognized and defined during the early stages of entity definition. Entity
15
definition is an iterative process, and it is unlikely that a completely satisfactory Entity-
Relationship Model will be obtained on the first iteration.
Following figure shows two entity types Employee and Product. Their attribute list is also
shown. A few members of each entity type are shown.
Identifying Attributes
To identify entity attributes, examine:
1. All external entities from the Context Diagram,
2. The data flows passed by the external entities,
3. Existing automated data,
4. Each entity (i.e., generate a list of entity attributes that describe the entity).
Attributes Versus Data Elements
Attributes have a looser description than data elements. For instance, whereas an attribute
may have only a descriptive name, a data element needs:
1. a size and range,
2. a format and length,
3. an accurate and detailed description,
4. valid values,
5. defined edit rules.
Some attributes may be converted into many data elements. For instance, the attribute
"address" may become four data elements representing:
1. Street Address,
16
2. City/Town,
3. State/Province,
4. Postal or Zip Code.
Additional data elements may also be defined as a result of customer requirements. For
example, the customer may require a list of all companies by county. For the purposes of
Data Modelling, Attributes and data elements are often considered identical because
attributes in the data model typically become data elements in the database.
Categories of Attributes
Attributes fall into three categories depending on the information that the attribute
captures:
1. Descriptive attributes provide facts intrinsic to each instance of the entity. For example,
the Salary of Employee or the Address of Customer.
2. Naming attributes provide facts about arbitrary labels and names carried by each
instance of an entity. For example, the Employee Name of Employee or the Employee ID of
Employee.
3. Referential attributes (i.e., foreign keys) provide facts which tie an instance of one entity
to an instance of another entity. For example, the Department Number of the Department to
which an Employee is assigned ties the Employee to the Department.
Attribute domains
A domain is a set of values for an attribute (i.e., the properties or characteristics of entities).
The value set conforms to a common definition for the domain (e.g., type, format, syntax,
meaning).
Specify domains in one of the following ways:
1. List all possible values (e.g., for an attribute named color, possible values are red,
green, blue).
2. Identify a source (e.g., procedures manual) that contains the valid values.
3. List an acceptable range of values for the domain (e.g., for an attribute named
weight, possible values range from one to five pounds).
17
4. Define a business rule that permits determination of the validity of a value assigned
to an attribute (e.g., discounts greater than five percent only apply to commercial
customers).
Derived Attributes
Derived attributes are attributes whose values are generated from other attributes using
calculations, algorithms or procedures. For example, Account Balance is derived by
subtracting Total Debit from Total Credit.
Generally, the specifications for calculating derived attributes are a concern of the
processing aspects of the information system (e.g., process model). Derived attributes may
be included in the data model if the rules for calculating the attribute values would
otherwise be lost.
Clearly indicate in the data model when an attribute is derived. Ensure that the rules needed
to derive or calculate the attribute value are captured in the model. Verify that all attributes
needed to calculate the derived attribute are present in the data model. Once the physical
data model is constructed, some derived attributes are added to the model to improve
performance of the system.
Relationships
Frequently, a meaningful relationship exists between two different types of entity.
For example:
EMPLOYEEs work in a DEPARTMENT
LAWYERs advise CLIENTs
18
EQUIPMENT is allocated to PROJECTs
TRUCK is a type of VEHICLE
There are potentially three types of relationship which can exist between two different
entities:
1. One-to-One Relationships
2. One-to-Many Relationships
3. Many-to-Many Relationships
One-to-One Relationships:
This type of relationship takes place when a single occurrence of an entity is related to just
one occurrence of a second entity.
For example, a ROOF covers one BUILDING; a BUILDING is covered by one ROOF.
A One-to-One relationship is shown on the diagram by a line connecting the two Entities
One-to-Many Relationships:
This type of relationship takes place when a single occurrence of an entity is related to
many occurrences of a second entity.
For example, An EMPLOYEE works in one DEPARTMENT; a DEPARTMENT has many
EMPLOYEE’s. A One-to-Many relationship is shown on the diagram by a line connecting the
two entities with a crows feet symbol denoting the "many" end of the relationship.
19
Many-to-Many Relationships :
This type of relationship takes place when many occurrences of an entity are related to
many
occurrences of a second entity.
For example, EQUIPMENT is allocated to many PROJECTs; A PROJECT is allocated many
items
of EQUIPMENT.
A Many-to-Many relationship is shown on the diagram by a line connecting the two entities
with a
crows feet at each end of the line.
Eliminating Many-to-Many Relationships
Many-to-many relationships in an entity relationship diagram tend to conceal areas of poor
20
understanding.
Almost always, a many-to-many relationship conceals a hidden entity.
For this reason many-to-many relationships are eliminated by identifying and adding the
hidden entity to the model. The new entity is related to the two original entities by a pair of
one-to-many relationships.
Choosing the Right Relationship
Depending on the purpose of the model, the length of time involved, and the definition of
the entities
participating in the relationship, the type of relationship between two entities can change.
For example, if the definition of a ROOF entity is an apex or flat surface covering a building,
then a
BUILDING is covered by many ROOFs.
Likewise, over a period of time, an EMPLOYEE works in many DEPARTMENTS.
Optional and Mandatory Relationships
Sometimes the participation of an entity in a relationship is optional. At other times it is
mandatory.
For example, EQUIPMENT may be allocated to PROJECTs, an EMPLOYEE must work in a
DEPARTMENT and a DEPARTMENT may have EMPLOYEEs working in it..
Optional participation in a relationship is shown by a circle placed on the relationship line
next to the entity that optionally participates in the relationship.
Mandatory participation is shown in a similar manner by placing a bar on the relationship
line.
21
Recursive Relationships
Relationships can exist between different occurrences of the same type of entity. For
example, A PERSON dances with another PERSON, an EMPLOYEE manages other
EMPLOYEEs and a COMPONENT is made from other COMPONENTs
ER Model
The whole purpose of ER modelling is to create an accurate reflection of the real world in a
database. The ER model doesn’t actually give us a database description. It gives us an
intermediate step from which it is easy to define a database. Let’s look at an example. (You
will see much more detail on these concepts in the rest of this chapter. For now just try to
understand the overall process.)
Suppose you are presented with the following situation and are told to create a database for
it:
Every department within our company is in only one division. Each division has more than
one department in it. We don’t have an upper limit on the number of departments that a
division can have. For example, the New Business Development---the one managed by
Mackenzie---and Higher Education departments are both in the Marketing division.
22
This is a fairly clear description of a situation. Many things are left unsaid that we
understand about the situation. For example: each division has a name, and that name is
unique within the company.
Deriving an E-R Diagram
An E-R diagram is used to represent the E-R model. It contains all known entities for the
given scenario, their attributes, identifiers, and the relationships that exist among the
entities. Deriving the entities, their attributes, identifiers, and the relationships among them
is one of the most important activities carried out during database development, so it is
imperative that it is done accurately.
From developing a number of databases, I have come up with a technique that allows for
rapid database design and development. This technique was initially taught to me by
consultants from the CRC SOGEMA INC. group in 1998, and a similar technique is described
in the IBM Informix Database Design and Implementation Guide (2005).
Steps
1. Discover the entities and the basic attributes for each entity. This process usually involves
performing a careful analysis of the processes and documentation used by the users in
the problem domain. Make a list of each of the entities discovered along with the basic
attributes of each entity.
2. Identify the relationships among the entities. Using an Entity-Entity Matrix (E-E Matrix)
is one of the fastest methods for deriving all the relationships among all the entities. This
matrix consists of an equal number of rows and columns, with each entity discovered
heading a row and a column. The intersection of the rows and columns represents
relationships that may exist between the entities – it is possible that no relationship will
exist between entities, that more than one relationship will exist between entities, and
that an entity will be related to itself.
3. Use the information from the E-E matrix to construct a simple initial E-R diagram. This
23
diagram will contain the entities and their basic attributes (and possible identifiers), and
the relationships indicated in the E-E matrix.
4. Derive optionalities and cardinalities for each relationship in the initial E-R diagram and
write assertions for each relationship. This records the assertions that can be made
about the data (relationships) in the user’s domain. Optionality says what can and must
happen in a relationship, and cardinality indicates the number of entity occurrences in a
relationship:
Optionality “0” – Can (optional)
Optionality “1” – Must (obligatory)
Cardinality “1” – Only one
Cardinality “N” – Many/At least one
5. Construct a detailed E-R diagram that includes all the derived information. This diagram
will include optionalities, cardinalities, attributes, identifiers, and relationships. Unique
identifiers are indicated with an asterisk (*), and non-unique identifiers with a plus (+).
Implementing the E-R Model – The Relational Model
While and E-R diagram displays the logical nature of the data that exists in the user’s
domain, the relational model shows how this data will be represented in a Relational
Database Management System (RDBMS). This model is important because the most
common and popular database management systems in use today are relational, and
relational model diagrams can be directly implemented on any RDBMS.
The relational model is similar to the E- R model but based on a branch of mathematics
called relational algebra, and as such, there are strict definitions and rules regarding the
elements of this model. Here are some of these definitions and rules:
o A relation is defined as a two dimensional table that contains rows and columns Rows
are instances of the entity and columns are attributes of the entity.
o No two rows of the table must be identical – each row must be unique (the contents of
24
the row that is), and the order of the rows is not important.
o Each column must have a unique name, and the order of the columns is not important.
They must also contain single values – multiple values and repeating groups are not
allowed.
o A key is a group of one or more attributes that uniquely identifies a row – each table
must have a key. Since each row in the table is unique, the key is what is used to
distinguish one row from another. It can be comprised of a single attribute or it can be
composite, comprising of more than one attribute. (Kroenke D, 1999, p. 113-117)
o Every column (attribute) in a table must depend solely on the primary key. It should not
depend on any other column in the table, or on part of the primary key if it is a
composite key.
25
Database management systems
A database management system (DBMS) is a software package with computer programs
that control the creation, maintenance, and the use of a database. It allows organizations to
conveniently develop databases for various applications by database administrators (DBAs)
and other specialists. A database is an integrated collection of data records, files, and other
database objects. A DBMS allows different user application programs to concurrently access
the same database. DBMSs may use a variety of database models, such as the relational
model or object model, to conveniently describe and support applications. It typically
supports query languages, which are in fact high-level programming languages, dedicated
database languages that considerably simplify writing database application programs.
Database languages also simplify the database organization as well as retrieving and
presenting information from it. A DBMS provides facilities for controlling data access,
enforcing data integrity, managing concurrency control, recovering the database after
failures and restoring it from backup files, as well as maintaining database security. The
brief topics discussed are as follows:-
Overview
A DBMS is a set of software programs that controls the system organization, storage,
management, and retrieval of data in a database. DBMSs are categorized according to their
data structures or types. The DBMS accepts requests for data from an application program
and instructs the operating system to transfer the appropriate data. The queries and
responses must be submitted and received according to a format that conforms to one or
more applicable protocols. When a DBMS is used, information systems can be changed more
easily as the organization's information requirements change. New categories of data can be
added to the database without disruption to the existing system.
Database servers are dedicated computers that hold the actual databases and run only the
DBMS and related software. Database servers are usually multiprocessor computers, with
generous memory and RAID disk arrays used for stable storage. Hardware database
accelerators, connected to one or more servers via a high-speed channel, are also used in
large volume transaction processing environments. DBMSs are found at the heart of most
database applications. DBMSs may be built around a custom multitasking kernel with built-
26
in networking support, but modern DBMSs typically rely on a standard operating system to
provide these functions.
History
Databases have been in use since the earliest days of electronic computing. Unlike modern
systems which can be applied to widely different databases and needs, the vast majority of
older systems were tightly linked to the custom databases in order to gain speed at the
expense of flexibility. Originally DBMSs were found only in large organizations with the
computer hardware needed to support large data sets.
1960s Navigational DBMS
As computers grew in speed and capability, a number of general-purpose database systems
emerged; by the mid-1960s there were a number of such systems in commercial use.
Interest in a standard began to grow, and Charles Bachman, author of one such product, the
Integrated Data Store (IDS), founded the "Database Task Group" within CODASYL, the group
responsible for the creation and standardization of COBOL. In 1971 they delivered their
standard, which generally became known as the "Codasyl approach", and soon a number of
commercial products based on this approach were made available.
The Codasyl approach was based on the "manual" navigation of a linked data set which was
formed into a large network. When the database was first opened, the program was handed
back a link to the first record in the database, which also contained pointers to other pieces
of data. To find any particular record the programmer had to step through these pointers
one at a time until the required record was returned. Simple queries like "find all the people
in India" required the program to walk the entire data set and collect the matching results
one by one. There was, essentially, no concept of "find" or "search". This may sound like a
serious limitation today, but in an era when most data was stored on magnetic tape such
operations were too expensive to contemplate anyway.
IBM also had their own DBMS system in 1968, known as IMS. IMS was a development of
software written for the Apollo program on the System/360. IMS was generally similar in
concept to Codasyl, but used a strict hierarchy for its model of data navigation instead of
Codasyl's network model. Both concepts later became known as navigational databases due
27
to the way data was accessed, and Bachman's 1973 Turing Award award presentation was
The Programmer as Navigator. IMS is classified as a hierarchical database. IDMS and
CINCOM's TOTAL database are classified as network databases.
1970s relational DBMS
Edgar Codd worked at IBM in San Jose, California, in one of their offshoot offices that was
primarily involved in the development of hard disk systems. He was unhappy with the
navigational model of the Codasyl approach, notably the lack of a "search" facility. In 1970,
he wrote a number of papers that outlined a new approach to database construction that
eventually culminated in the groundbreaking A Relational Model of Data for Large Shared
Data Banks.
In this paper, he described a new system for storing and working with large databases.
Instead of records being stored in some sort of linked list of free-form records as in Codasyl,
Codd's idea was to use a "table" of fixed-length records. A linked-list system would be very
inefficient when storing "sparse" databases where some of the data for any one record
could be left empty. The relational model solved this by splitting the data into a series of
normalized tables, with optional elements being moved out of the main table to where they
would take up room only if needed.
In the relational model, related records are linked together with a "key".
28
For instance, a common use of a database system is to track information about users, their
name, login information, various addresses and phone numbers. In the navigational
approach all of these data would be placed in a single record, and unused items would
simply not be placed in the database. In the relational approach, the data would be
normalized into a user table, an address table and a phone number table (for instance).
Records would be created in these optional tables only if the address or phone numbers
were actually provided.
Linking the information back together is the key to this system. In the relational model,
some bit of information was used as a “key”, uniquely defining a particular record. When
information was being collected about a user, information stored in the optional (or related)
tables would be found by searching for this key. For instance, if the login name of a user is
unique, addresses and phone numbers for that user would be recorded with the login name
as its key. This "re-linking" of related data back into a single collection is something that
traditional computer languages are not designed for.
Just as the navigational approach would require programs to loop in order to collect
records, the relational approach would require loops to collect information about any one
record. Codd's solution to the necessary looping was a set-oriented language, a suggestion
that would later spawn the ubiquitous SQL. Using a branch of mathematics known as tuple
calculus, he demonstrated that such a system could support all the operations of normal
databases (inserting, updating etc.) as well as providing a simple system for finding and
returning sets of data in a single operation.
Codd's paper was picked up by two people at Berkeley, Eugene Wong and Michael
Stonebraker. They started a project known as INGRES using funding that had already been
allocated for a geographical database project, using student programmers to produce code.
Beginning in 1973, INGRES delivered its first test products which were generally ready for
widespread use in 1979. During this time, a number of people had moved "through" the
group — perhaps as many as 30 people worked on the project, about five at a time. INGRES
was similar to System R in a number of ways, including the use of a "language" for data
access, known as QUEL — QUEL was in fact relational, having been based on Codd's own
Alpha language, but has since been corrupted to follow SQL, thus violating much the same
concepts of the relational model as SQL itself.
29
IBM itself did one test implementation of the relational model, PRTV, and a production one,
Business System 12, both now discontinued. Honeywell did MRDS for Multics, and now
there are two new implementations: Alphora Dataphor and Rel. All other DBMS
implementations usually called relational are actually SQL DBMSs. In 1968, the University of
Michigan began development of the Micro DBMS . It was used to manage very large data
sets by the US Department of Labor, the Environmental Protection Agency and researchers
from University of Alberta, the University of Michigan and Wayne State University. It ran on
mainframe computers using Michigan Terminal System. The system remained in production
until 1996.
Late-1970s SQL DBMS
IBM started working on a prototype system loosely based on Codd's concepts as System R in
the early 1970s. The first version was ready in 1974/5, and work then started on multi-
table systems in which the data could be split so that all of the data for a record (some of
which is optional) did not have to be stored in a single large "chunk". Subsequent multi-user
versions were tested by customers in 1978 and 1979, by which time a standardized query
language – SQL – had been added. Codd's ideas were establishing themselves as both
workable and superior to Codasyl, pushing IBM to develop a true production version of
System R, known as SQL/DS, and, later, Database 2 (DB2).
Many of the people involved with INGRES became convinced of the future commercial
success of such systems, and formed their own companies to commercialize the work but
with an SQL interface. Sybase, Informix, NonStop SQL and eventually Ingres itself were all
being sold as offshoots to the original INGRES product in the 1980s. Even Microsoft SQL
Server is actually a re-built version of Sybase, and thus, INGRES. Only Larry Ellison's Oracle
started from a different chain, based on IBM's papers on System R, and beat IBM to market
when the first version was released in 1978.
Stonebraker went on to apply the lessons from INGRES to develop a new database, Postgres,
which is now known as PostgreSQL. PostgreSQL is often used for global mission critical
applications (the .org and .info domain name registries use it as their primary data store, as
do many large companies and financial institutions).
30
In Sweden, Codd's paper was also read and Mimer SQL was developed from the mid-70s at
Uppsala University. In 1984, this project was consolidated into an independent enterprise.
In the early 1980s, Mimer in c introduced transaction handling for high robustness in
applications, an idea that was subsequently implemented on most other DBMS.
1980s object-oriented databases
The 1980s, along with a rise in object oriented programming, saw a growth in how data in
various databases were handled. Programmers and designers began to treat the data in
their databases as objects. That is to say that if a person's data were in a database, that
person's attributes, such as their address, phone number, and age, were now considered to
belong to that person instead of being extraneous data. This allows for relations between
data to be relations to objects and their attributes and not to individual fields.
Another big game changer for databases in the 1980s was the focus on increasing reliability
and access speeds. In 1989, two professors from the University of Wisconsin at Madison
published an article at an ACM associated conference outlining their methods on increasing
database performance. The idea was to replicate specific important, and often queried
information, and store it in a smaller temporary database that linked these key features
back to the main database. This meant that a query could search the smaller database much
quicker, rather than search the entire dataset. This eventually leads to the practice of
indexing, which is used by almost every operating system from Windows to the system that
operates Apple iPod devices.
21st century NoSQL databases
In the 21st century a new trend of NoSQL databases was started. Those non-relational
databases are significantly different from the classic relational databases. They often do not
require fixed table schemas, avoid join operations by storing denormalized data, and are
designed to scale horizontally. Most of them can be classified as either key-value stores or
document-oriented databases.
In recent years there was a high demand for massively distributed databases with high
partition tolerance but according to the CAP theorem it is impossible for a distributed
system to simultaneously provide consistency, availability and partition tolerance
31
guarantees. A distributed system can satisfy any two of these guarantees at the same time,
but not all three. For that reason many NoSQL databases are using what is called eventual
consistency to provide both availability and partition tolerance guarantees with a maximum
level of data consistency.
The most popular software in that category include: memcached, Redis, MongoDB,
CouchDB, Apache Cassandra and HBase.
Current trends
In 1998, database management was in need of a new style of databases to solve current
database management problems. Researchers realized that the old trends of database
management were becoming too complex and there was a need for automated
configuration and management. Surajit Chaudhuri, Gerhard Weikum and Michael
Stonebraker were the pioneers that dramatically affected the thought of database
management systems. They believed that database management needed a more modular
approach and there were too many specifications needed for users. Since this new
development process of database management there are more possibilities. Database
management is no longer limited to “monolithic entities”. Many solutions have been
developed to satisfy the individual needs of users. The development of numerous database
options has created flexibility in database management.
There are several ways database management has affected the field of technology. Because
organizations' demand for directory services has grown as they expand in size, businesses
use directory services that provide prompted searches for company information. Mobile
devices are able to store more than just the contact information of users, and can cache and
display a large amount of information on smaller displays. Search engine queries are able to
locate data within the World Wide Web. Retailers have also benefited from the
developments with data warehousing, recording customer transactions. Online transactions
have become tremendously popular for e-business. Consumers and businesses are able to
make payments securely through some company websites.
32
Components
DBMS engine accepts logical requests from various other DBMS subsystems, converts
them into physical equivalents, and actually accesses the database and data dictionary
as they exist on a storage device.
Data definition subsystem helps the user create and maintain the data dictionary and
define the structure of the files in a database.
Data manipulation subsystem helps the user to add, change, and delete information in
a database and query it for valuable information. Software tools within the data
manipulation subsystem are most often the primary interface between user and the
information contained in a database. It allows the user to specify its logical information
requirements.
Application generation subsystem contains facilities to help users develop
transaction-intensive applications. It usually requires that the user perform a detailed
series of tasks to process a transaction. It facilitates easy-to-use data entry screens,
programming languages, and interfaces.
Data administration subsystem helps users manage the overall database
environment by providing facilities for backup and recovery, security management,
query optimization, concurrency control, and change management.
Modeling language
A modeling language is a data modeling language to define the schema of each database
hosted in the DBMS, according to the DBMS database model. Database management systems
(DBMS) are designed to use one of five database structures to provide simplistic access to
information stored in databases. The five database structures are:
• the hierarchical model,
• the network model,
• the relational model,
• the multidimensional model, and
• the object model.
33
Inverted lists and other methods are also used. A given database management system may
provide one or more of the five models. The optimal structure depends on the natural
organization of the application's data, and on the application's requirements, which include
transaction rate (speed), reliability, maintainability, scalability, and cost.
The hierarchical structure was used in early mainframe DBMS. Records’ relationships form a
treelike model. This structure is simple but nonflexible because the relationship is confined
to a one-to-many relationship. IBM’s IMS system and the RDM Mobile are examples of a
hierarchical database system with multiple hierarchies over the same data. RDM Mobile is a
newly designed embedded database for a mobile computer system. The hierarchical
structure is used primarily today for storing geographic information and file systems.
The network structure consists of more complex relationships. Unlike the hierarchical
structure, it can relate to many records and accesses them by following one of several paths.
In other words, this structure allows for many-to-many relationships.
The relational structure is the most commonly used today. It is used by mainframe,
midrange and microcomputer systems. It uses two-dimensional rows and columns to store
data. The tables of records can be connected by common key values. While working for IBM,
E.F. Codd designed this structure in 1970. The model is not easy for the end user to run
queries with because it may require a complex combination of many tables.
The multidimensional structure is similar to the relational model. The dimensions of the
cube-like model have data relating to elements in each cell. This structure gives a
spreadsheet-like view of data. This structure is easy to maintain because records are stored
as fundamental attributes—in the same way they are viewed—and the structure is easy to
understand. Its high performance has made it the most popular database structure when it
comes to enabling online analytical processing (OLAP).
The object-oriented structure has the ability to handle graphics, pictures, voice and text,
types of data, without difficultly unlike the other database structures. This structure is
popular for multimedia Web-based applications. It was designed to work with object-
oriented programming languages such as Java.
34
The dominant model in use today is the ad hoc one embedded in SQL,despite the objections
of purists who believe this model is a corruption of the relational model since it violates
several fundamental principles for the sake of practicality and performance. Many DBMSs
also support the Open Database Connectivity API that supports a standard way for
programmers to access the DBMS.
Before the database management approach, organizations relied on file processing systems
to organize, store, and process data files. End users criticized file processing because the
data is stored in many different files and each organized in a different way. Each file was
specialized to be used with a specific application. File processing was bulky, costly and
nonflexible when it came to supplying needed data accurately and promptly. Data
redundancy is an issue with the file processing system because the independent data files
produce duplicate data so when updates were needed each separate file would need to be
updated. Another issue is the lack of data integration. The data is dependent on other data
to organize and store it. Lastly, there was not any consistency or standardization of the data
in a file processing system which makes maintenance difficult. For these reasons, the
database management approach was produced.
Data structure
Data structures (fields, records, files and objects) optimized to deal with very large amounts
of data stored on a permanent data storage device (which implies relatively slow access
compared to volatile main memory).
Database query language
A database query language and report object allows users to interactively interrogate the
database, analyze its data and update it according to the users privileges on data. It also
controls the security of the database. Data security prevents unauthorized users from
viewing or updating the database. Using passwords, users are allowed access to the entire
database or subsets of it called subschemas. For example, an employee database can contain
all the data about an individual employee, but one group of users may be authorized to view
only payroll data, while others are allowed access to only work history and medical data.
35
If the DBMS provides a way to interactively enter and update the database, as well as
interrogate it, this capability allows for managing personal databases. However, it may not
leave an audit trail of actions or provide the kinds of controls necessary in a multi-user
organization. These controls are only available when a set of application programs are
customized for each data entry and updating function.
Transaction mechanism
A database transaction mechanism ideally guarantees ACID properties in order to ensure
data integrity despite concurrent user accesses (concurrency control), and faults (fault
tolerance). It also maintains the integrity of the data in the database. The DBMS can
maintain the integrity of the database by not allowing more than one user to update the
same record at the same time. The DBMS can help prevent duplicate records via unique
index constraints; for example, no two customers with the same customer numbers (key
fields) can be entered into the database.
External, logical and internal view
Traditional view of data
A DBMS Provides the ability for many different users to share data and process resources.
As there can be many different users, there are many different database needs. The question
is: How can a single, unified database meet varying requirements of so many users?
36
A DBMS minimizes these problems by providing three views of the database data: an
external view (or user view), logical view (or conceptual view) and physical (or internal)
view. The user’s view of a database program represents data in a format that is meaningful
to a user and to the software programs that process those data.
One strength of a DBMS is that while there is typically only one conceptual (or logical) and
physical (or internal) view of the data, there can be an endless number of different external
views. This feature allows users to see database information in a more business-related way
rather than from a technical, processing viewpoint. Thus the logical view refers to the way
the user views the data, and the physical view refers to the way the data are physically
stored and processed.
Features and capabilities
Alternatively, and especially in connection with the relational model of database
management, the relation between attributes drawn from a specified set of domains can be
seen as being primary. For instance, the database might indicate that a car that was
originally "red" might fade to "pink" in time, provided it was of some particular "make" with
an inferior paint job. Such higher parity relationships provide information on all of the
underlying domains at the same time, with none of them being privileged above the others.
Simple definition
A database management system is the system in which related data is stored in an efficient
and compact manner. "Efficient" means that the data which is stored in the DBMS can be
accessed quickly and "compact" means that the data takes up very little space in the
computer's memory. The phrase "related data" means that the data stored pertains to a
particular topic.
Specialized databases have existed for scientific, imaging, document storage and like uses.
Functionality drawn from such applications has begun appearing in mainstream DBMS's as
well. However, the main focus, at least when aimed at the commercial data processing
market, is still on descriptive attributes on repetitive record structures.
37
Thus, the DBMSs of today roll together frequently needed services or features of attribute
management. By externalizing such functionality to the DBMS, applications effectively share
code with each other and are relieved of much internal complexity. Features commonly
offered by database management systems include:
Query ability
Querying is the process of requesting attribute information from various
perspectives and combinations of factors. Example: "How many 2-door cars in Texas
are green?" A database query language and report writer allow users to
interactively interrogate the database, analyze its data and update it according to
the users privileges on data.
Backup and replication
Copies of attributes need to be made regularly in case primary disks or other
equipment fails. A periodic copy of attributes may also be created for a distant
organization that cannot readily access the original. DBMS usually provide utilities
to facilitate the process of extracting and disseminating attribute sets. When data is
replicated between database servers, so that the information remains consistent
throughout the database system and users cannot tell or even know which server in
the DBMS they are using, the system is said to exhibit replication transparency.
Rule enforcement
Often one wants to apply rules to attributes so that the attributes are clean and
reliable. For example, we may have a rule that says each car can have only one
engine associated with it (identified by Engine Number). If somebody tries to
associate a second engine with a given car, we want the DBMS to deny such a
request and display an error message. However, with changes in the model
specification such as, in this example, hybrid gas-electric cars, rules may need to
change. Ideally such rules should be able to be added and removed as needed
without significant data layout redesign.
Security
For security reasons, it is desirable to limit who can see or change specific attributes
or groups of attributes. This may be managed directly on an individual basis, or by
the assignment of individuals and privileges to groups, or (in the most elaborate
models) through the assignment of individuals and groups to roles which are then
granted entitlements.
38
Computation
Common computations requested on attributes are counting, summing, averaging,
sorting, grouping, cross-referencing, and so on. Rather than have each computer
application implement these from scratch, they can rely on the DBMS to supply such
calculations.
Change and access logging
This describes who accessed which attributes, what was changed, and when it was
changed. Logging services allow this by keeping a record of access occurrences and
changes.
Automated optimization
For frequently occurring usage patterns or requests, some DBMS can adjust
themselves to improve the speed of those interactions. In some cases the DBMS will
merely provide tools to monitor performance, allowing a human expert to make the
necessary adjustments after reviewing the statistics collected.
Meta-data repository
Metadata is data describing data. For example, a listing that describes what attributes are
allowed to be in data sets is called "meta-information".
Advanced DBMS
An example of an advanced DBMS is Distributed Data Base Management System (DDBMS), a
collection of data which logically belong to the same system but are spread out over the
sites of the computer network. The two aspects of a distributed database are distribution
and logical correlation:
Distribution: The fact that the data are not resident at the same site, so that we can
distinguish a distributed database from a single, centralized database.
Logical Correlation: The fact that the data have some properties which tie them together, so
that we can distinguish a distributed database from a set of local databases or files which
are resident at different sites of a computer network.
39
Access Control in Database Management Systems
Introduction
Result of the database management systems use is existence of one data set on which are
working all users and applications in an organization. On one side, hardly solved problems
with duplication, inconsistency and application dependency of data disappear. On the other
side much more dangerous security problems appear. There is a brief introduction to them
on next lines.
Design of secure DBMS assumes identification of security risks and selection of right
security policies (what is the security system supposed to do) and mechanisms (the way we
are going to achieve that) for their neutralization.
Secure database system should satisfy three basic requirements on data protection:
security, integrity and availability. What is the content of those words?
Ensuring security - preventing, detecting and deterring improper disclosure of
information. This is especially important in strongly protected environments (e.g.
army).
Ensuring integrity - preventing, detecting and deterring improper changes of
information. The proper function of any organization depends on proper operations
on proper data.
Ensuring system availability - effort for prevention of improper denial of service
that DBMS provides.
Security Policies
Security policies are guidelines describing all (if possible) actions pertinent to the
information system. Logical access control belongs to that area and so security policies
should define principles on which is design of secure DBMS based.
Generally, policies should give answers on basic security questions. Policies can be divided
into two basic types - of minimal (army) and maximal (university) privilege. System with
such a policy is called closed or opened, respectively.
40
Talking about access control, the way of administration of access rules should be
determined.
hierarchical decentralized - central authorizer distributes responsibilities among
dependent subauthorizers
ownership based - owner of an object (its author) determines access to the object
cooperative authorization - authorization of special rights for special resources is
approved by all members of predefined group
Security Mechanisms
In the moment when politics are defined, mechanisms that fulfill them can be selected.
Mechanisms are external:
administrative controls
physical controls
and internal that are part of information system itself:
authentication - user identity is verified; this process is based on knowledge of
something, ownership of an object or on physical characteristics of user
authorization - system answers only those queries that user is authorized for
(access control)
audit - is composed from two phases; logging of actions in the system and reporting
of logged information
Security Threat
This word has been used several times already. Security threat is any hostile agent which
randomly or with use of specialized techniques can obtain or change information in the
information system. Random security threats are:
natural or accidental disasters- earthquake, water damage or fire. As data as
hardware is damaged which leads to the integrity violence and service rejection.
errors and bugs in hardware andsoftware - causes improper application of
security policies.
41
human errors - unintentional violations such as incorrect input or wrong use of
applications.
Intended security threats can be categorized according to their originator:
authorized users - abuse there privileges
hostile agents - various hostile programs - viruses, Trojan horses, back-doors
Requirements on DBMS Security
At this moment we have basic image of information system security and we can take a look
at concrete aspects that should be covered with DBMS security mechanisms.
1. Protection from improper access- only authorized users should be granted access
to objects of DBMS. This control should be applied on smaller objects (record,
attribute, value).
2. Protection from inference - inference of confidential information from available
data should be avoided. This regards mainly statistical DBMSs.
3. Database integrity - partially is ensured with system controls of DBMS (atomic
transactions) and various back-up and recovery procedures and partially with
security procedures.
4. Operational data integrity - logical consistence of data during concurrent
transactions (concurrency manager), serializability and isolation of transactions
(locking techniques).
5. Semantic data integrity - ensuring that attribute values are in allowed ranges. This
is ensured with integrity constraints.
6. Accountability and auditing - there should be possibility to log all data accesses.
7. User authentication - there should be unambiguous identification of each DBMS
user. This is basis for all authorization mechanisms.
8. Management and protection of sensitive data - access should be granted only to
narrow round of users.
9. Multilevel security - data may be classified according to their sensitivity. Access
granting should then depend on that classification.
10. Confinement (subject isolation) - there is necessity to isolate subjects to avoid
uncontrolled data flow between programs (memory channels, covert channels).
42
At least five aspects from the previous list must be ensured with special techniques that do
not exist in unsecure DBMSs. There are three basic ways to do it:
flow control - we control information flows in frame of DBMS
inference control - control of dependencies among data
access control - access to the information in DBMS is restricted
Cryptographic techniques can be added to the controls.
Flow Control
Flow control regulates distribution (flow) of information among available objects. E.g.
reading of information from object X and its direct writing into object Y.
Flow control policies need list of acceptable information flows and their constrains. Flow
constraints are often based on classification of system elements and definition of acceptable
flows between classification levels.
Inference Control
The aim of the inference control is to avoid indirect disclosure of information (set of data X
that can be read by user can be used for determination of data set Y (Y=f(X)) ). Generally
there are three ways to unauthorized data disclosure:
1. correlated data - typical channel when visible data X are semantically related with
invisible data Y
2. missing data - result of query contains NULL values that mask sensitive data.
Existence of that data may by detected that way.
3. statistical inference - typical for databases that provide statistical information
about entities.
Statistical databases do not allow direct access to data and user has to query only statistical
data. Attacks in this DBMSs can be faced with two different approaches:
1. data perturbation - concrete data are replaced with statistical results
2. query controls - more frequently used, mostly it is based on minimal and maximal
number of items that are concerned with query. Results are satisfactory but this
technique is expensive and difficult for administration.
43
Access Control
Access control is responsible for control of rules determined by security policies for all
direct accesses to the system. Traditional control systems work with notions subject, object
and operation.
Fig.1 Schema of secure database management system
Security Models
The aim of security modeling is to create abstract, software independent, conceptual model
from requirements specification for system protection.
Security model should provide rich semantic representation which allows description of
functional and structural properties of security system. It should also provide definitions for
protection requirements and system policies. The proof of model properties should be
available, too.
44
It is clear that level on which we decide to describe access control can greatly differ as we
will see in the next paragraph. Description of basic approaches for access control is
introduced in chapter 2.2. Description of concrete models follows.
Abstractions of Access Models
Access control models can be classified in several levels as proposed by LaPadula and
Williams. Following levels proceed from general to more implementation dependent:
1. trust objectives - definition of basic requirements on the system trustfulness
2. requirements on external interface - security requirements on interface system-
environments
3. internal requirements - requirements that must be satisfied in system components
4. operational rules - describe assurance of internal requirements
5. functional design - functional description of behavior of system components
Models that are going to be described are on level 3. TBAC and RBAC model families partly
interfere with level 2 because they are much more general and cover wider range of
problems.
Types of Access Control Models
Security models can be classified according to many aspects. For example target system,
type of security policy, addressed security aspects, type of control (direct or of flows) and so
on. In the moment we will talk about type of security policy that is enforced by the model.
Two basic model types arised very soon - discretionary and mandatory access control.
Owner of data governs access in the former one. This is the most common form of
authorization administration - ownership based. That policy is very flexible but also very
difficult for control from the global point of view. Models with mandatory access control
enforce global policy by the flow control among security levels that are assigned to objects.
It seemed that nothing else would exist but OO technologies have encourage new
approaches that reflect OO DBMSs and new requirements of commercial sphere. The first
one is RBAC - access control based on roles and the second one is TBAC which is based on
concept of task. TBAC brings absolutely new ideas and notion of active security.
45
We finish this introduction and try to describe policy types on concrete models.
HRU - Matrix Access Model
The HRU (Harrison-Ruzzo-Ullman) model covers security of data for DBMS and OS. It was
proposed in 1971 and in 1976 was formalized.
Authorization State
Authorization state is derived from relations between objects and subjects. It is defined as
triple Q=(S, O, A) where S is set of subjects, O is set of objects (contains also subjects) and A
is access matrix (A[s, o]=r).
Authorization state for DBMS can be enriched by predicates of access rights. Those
predicates may by data, time, context or history dependent.
Six primitive operations are defined for authorization state administration. They are
granting and revoking of rights r from A[s, o], creating and deleting of subject s, creating and
deleting of object o. Operations are used for composition of commands for authorization
state modifications. Harrison assumes that modification commands have two parts;
conditional and executive. The executive part is performed when the conditional is true.
Set of possible access rights depends on object. The model knows read, write, append,
execute, own. The last one determines owner that administers access to it.
Administration of Authorizations
The owner of object may grant/revoke any rights (except own) to the given object for other
subjects. Even those rights that he does not possess in the moment. This approach breaks
principle of attenuation of privileges.
There are modifications of access rights in some systems that allow other subjects to
administer access rights. This is expressed by explicit flag on access right that may have two
forms.
copy flag - m* in A[s1, o] allows grant an access right m for object o to other
subjects, rights of subject s1 do not change and subject s2 receives right m for object
o
46
Fig. 2. Use of copy flag
transfer flag - subject s1 grants right m for object o to subject s2, subject s1 loses the
right. Subject s1 has got m+ (with transfer flag)
Fig. 3. Use of transfer flag
Model Extensions
The HRU model is very flexible for expression and control of access rights for information.
The most important problem is security question (Is there a reachable state to the given
starting authorization state where a subject receives certain right for given object?) that is
undecidable. This problem coheres with danger of Trojan horses in DAC.
The model has been extended several ways for security question decidability.
Schematic protections model (Sandhu)
Extended schematic protection model (Ammann and Sandhu)
Typed access matrix (TAM) model (Sandhu) - each object and subject has a type
that is assign with creation of entity and it does not change during lifecycle.
Bell-LaPadula Model
This model is extension of HRU model oriented on definition of security requirements in
complex systems where system elements can be classified (security levels). The model was
47
proposed in 1973 for operation systems. The model became referential for mandatory
access control (MAC).
Role-Based Access Control
This model family has been proposed to satisfy requirements of the commercial sphere. The
basic idea is notion of role. Users are granted roles that allow them to perform certain
actions that are assigned to the roles. Roles are usually created according to the functions in
organization. In the moment we decide to revoke or change access from a user, the only one
thing we have to do is remove him from a group or reassign him to another one. This change
is very easy for an administrator.
This approach offers very suitable abstraction for expression of security policies of large
companies and seems to be promising alternative to the traditional MAC and DAC models.
RBAC is suitable for environments where MAC models are too strict. It has been proofed
that RBAC can simulate mandatory access models and, in addition, that MAC models are
specific case of RBAC.
One may say that RBAC is based on something that is very similar to groups that are used in
many older models. Groups of users are used to associate subjects with the same access
rights. Such a group does not represent a role but common needs for execution of certain
system actions. Roles in RBAC exploit groups to abstract access rights needed for some
concrete actions that are later accessible for role members.
Narrow, specific meaning of roles can not exhaustively express status of users. It seems that
for effective implementation of role-base access control we need to define three aspects.
They are role, individuality and locality. I demonstrate it on an example when a firm engages
a new person - Smith. His position is seller (it is his role). During the trial period you do not
want him to do transactions with banks (individuality) and finally you do not want him to
access your system elsewhere than in New York (locality).
What are the advantages of the system? You are not bound during the policy selection,
model offers strict separation of duties, ease expression of organizational policy and of
course easy access rights administration.
48
Relational Database Management Systems (RDBMS)
A relational database management system (RDBMS) is a database management system
(DBMS) that is based on the relational model. A short definition of an RDBMS is: a DBMS in
which data is stored in tables and the relationships among the data are also stored in tables.
The data can be accessed or reassembled in many different ways without having to change
the table forms.
The Relational Database concept was obviously thought about by folk who on a day-to-day
basis had to manage large production data structures in the 1960's and 1970's and 1980's
and 1990's and today. But it was probably Codd (Dr. E.F.) who first published (in an IBM
Research Report of August 1969) some form of specification (set) of rules by which such a
species of database would be known:-
CODD's TWELVE RULES
Rule 1: The Information Rule: All information in a relational database is represented
explicitly at the logical level in exactly one way by values in tables.
Rule 2: Guaranteed Access Rule: Each and every datum (atomic value) in a relational
database is guaranteed to be logically accessible by resorting to a table name, primary key
value, and column name.
Rule 3: Systematic Treatment of Null Values: Null values (distinct from empty character
string or a string of blank characters and distinct from zero or any other number) are
supported in the fully relational DBMS for representing missing information in a systematic
way, independent of data type.
Rule 4: Dynamic On-line Catalog Based on the Relational Model: The database description is
represented at the logical level in the same way as ordinary data, so authorized users can
apply the same relational language to its interrogation as they apply to regular data.
Rule 5: Comprehensive Data Sublanguage Rule: A relational system may support several
languages and various modes of terminal use (for example, the fill-in-blanks mode).
However, there must be at least one language whose statements are expressible, per some
well-defined syntax, as character strings and whose ability to support all of the following is
comprehensible: data definition, view definition, data manipulation (interactive and by
program), integrity constraints, and transaction boundaries (begin, commit, and rollback).
49
Rule 6: View Updating Rule: All views that are theoretically updateable are also updateable
by the system.
Rule 7: High-level Insert, Update, and Delete: The capability of handling a base relation or a
derived relation as a single operand applies not only to the retrieval of data but also to the
insertion, update, and deletion of data.
Rule 8: Physical Data Independence: Application programs and terminal activities remain
logically unimpaired whenever any changes are made in either storage representation or
access methods.
Rule 9: Logical Data Independence: Application programs and terminal activities remain
logically unimpaired when information preserving changes of any kind that theoretically
permit unimpairment are made to the base tables.
Rule 10: Integrity Independence: Integrity constraints specific to a particular relational
database must be definable in the relational data sublanguage and storable in the catalog,
not in the application programs.
A minimum of the following two integrity constraints must be supported:
1. Entity integrity: No component of a primary key is allowed to have a null value.
2. Referential integrity: For each distinct non null foreign key value in a relational database,
there must exist a matching primary key value from the same domain.
Rule 11: Distribution Independence: A relational DBMS has distribution independence.
Distribution independence implies that users should not have to be aware of whether a
database is distributed.
Rule 12: Non subversion Rule: If a relational system has a low-level (single-record-at-a-
time) language, that low-level language cannot be used to subvert or bypass the integrity
50
rules or constraints expressed in the higher-level (multiple-records-at-a-time) relational
language.
Note: There is a rider to these 12 rules known as Rule Zero: "For any system that is claimed
to be a relational database management system, that system must be able to manage data
entirely through its relational capabilities."
On the basis of the above rules, there is no fully relational DBMS available today.
By the mid 1980's there were a number of database vendors who had already made the
claim that their DBMS was in reality an RDBMS. Previously known as Relational Software,
Oracle Corporation released the first commercially available database management system
packaged with a DBMS Language of its own called SQL (Structured Query Language). IBM
also pioneered research and development into making the DBMS more of an RDBMS, and by
development of the language which was to become known as SQL. It was not until almost
the 1990's that Microsoft joined these two major league RDBMS/DBMS vendors on the field
with the SQL Server RDBMS/DBMS product.
Are these products RDBMS or DBMS?
What it is about an RDBMS which is missing from a DBMS, with the exception of the obvious
R for Relational? Some insist it is nothing but a rigorous adherence to Codd's 12 Rules, while
others insist a non-relational DBMS structure can be "made relational" by the application of
specific housekeeping and data integrity exception management subsystems.
From the perspective of IT Management however, whatever DBMS or RDBMS it was that
housed the organisational data in production real-time, it was this very ever-changing
database of information which needed to be safeguarded with the utmost diligence.
The daily backups needed to be performed and checked off, and if they were automated
then they needed to be checked. Periodically restores of such backups needed to be
scheduled to test the integrity of the backup and restore process, or media, or device. A
backup was useless if it could not be restored.
More importantly, the integrity of the database had to be maintained at the highest possible
level. If this duty were not attended to by some responsible party internal or external to the
51
organisation, then the value of the database as an entity would be suspect to varying
degrees.
With experience in the field one understands that there are indeed usually an entire series
of specific forms of data integrity issues and problems which may arise at any time within
any given database, due to a number of different environmental factors.
At the top of the list, we see Rules 2 and 3 may have direct bearing on data integrity issues if
they are not enforced. But it is Rule number 10 that is often quoted as the one which is not
successfully established in certain instances by default within the DBMS, to make it a truly
RDBMS.
As CJ Date points out in his article entitled Business Rules and the Relational Model, the title
of Codd's original paper was "Derivability, Redundancy, and Consistency of Relations Stored
in Large Data Banks". He goes on to reinforce that the term "consistency of stored
relations" from Codd, is implicitly the same thing as the term integrity constraints, and to
note that another formal description for the set of these things is "the relation predicate.
When all is said and done however, the RDBMS vendors of today have a far more powerful
and robust product than they had a decade ago, and while it is often very interesting
examining the theoretical threads of the beginnings of the Relational Database concept, this
article is bound closely to follow the path and perspective of the IT Management staff who
have used all these evolving products, and seen the decades come and go.
For this reason, in terms of the technical evolution of the major RDBMS products available
since the 1990's, we are going to skip this era, and move right up to the present day in 2002
and have a look under the bonnet of one of these major RDBMS products, just to see what it
does.
Again, let me state clearly that while the academic approach to the failure of RDBMS
vendors to supply greater support for integrity constraints, or for a number of technical
issues by which their products may not conform 100% to the relational model is often
warranted and heeded by the vendors, my approach has not had the luxury of time.
52
When we are dealing with a production database, alive and breathing through all forms of
organisational processes, we cannot afford the time to wait for the longer term solutions to
such problems. Production database integrity, and by this I mean real live day-to-day
production sites, is a concern that needs to be addressed without delay. If you as an IT
Manager designed a series of data integrity exceptions alerts to identify the quantitative
failure of specific integrity constraints, and had such resolved, such a mechanism might
work-around potential major problems.
Perhaps the real difference between a DBMS and an RDBMS is similar to the difference
between a 6 cylinder car and an 8 cylinder car? If we looked under the bonnet we might be
able to tell, so lets choose one of the major 3 RDBMS vendors, and have a little look. In this
article we have selected to check under the bonnet of the SQL Server product from
Microsoft Corporation.
Relational model
The relational model for database management is a database model based on first-order
predicate logic, first formulated and proposed in 1969 by Edgar F. Codd.
Diagram of an example database according to the Relational model
53
In the relational model, related records are linked together with a "key".
The purpose of the relational model is to provide a declarative method for specifying data
and queries: users directly state what information the database contains and what
information they want from it, and let the database management system software take care
of describing data structures for storing the data and retrieval procedures for answering
queries.
The fundamental assumption of the relational model is that all data is represented as
mathematical n-ary relations, an n-ary relation being a subset of the Cartesian product of n
domains. In the mathematical model, reasoning about such data is done in two-valued
predicate logic, meaning there are two possible evaluations for each proposition: either true
or false (and in particular no third value such as unknown, or not applicable, either of which
are often associated with the concept of NULL). Some think two-valued logic is an important
part of the relational model, while others think a system that uses a form of three-valued
logic can still be considered relational.
Data are operated upon by means of a relational calculus or relational algebra, these being
equivalent in expressive power.
54
The relational model of data permits the database designer to create a consistent, logical
representation of information. Consistency is achieved by including declared constraints in
the database design, which is usually referred to as the logical schema. The theory includes
a process of database normalization whereby a design with certain desirable properties can
be selected from a set of logically equivalent alternatives. The access plans and other
implementation and operation details are handled by the DBMS engine, and are not
reflected in the logical model. This contrasts with common practice for SQL DBMSs in which
performance tuning often requires changes to the logical model.
The basic relational building block is the domain or data type, usually abbreviated
nowadays to type. A tuple is an ordered set of attribute values. An attribute is an ordered
pair of attribute name and type name. An attribute value is a specific valid value for the type
of the attribute. This can be either a scalar value or a more complex type.
A relation consists of a heading and a body. A heading is a set of attributes. A body (of an n-
ary relation) is a set of n-tuples. The heading of the relation is also the heading of each of its
tuples.
A relation is defined as a set of n-tuples. In both mathematics and the relational database
model, a set is an unordered collection of unique, non-duplicated items, although some
DBMSs impose an order to their data. In mathematics, a tuple has an order, and allows for
duplication. E.F. Codd originally defined tuples using this mathematical definition. Later, it
was one of E.F. Codd's great insights that using attribute names instead of an ordering
would be so much more convenient (in general) in a computer language based on relations.
This insight is still being used today. Though the concept has changed, the name "tuple" has
not. An immediate and important consequence of this distinguishing feature is that in the
relational model the Cartesian product becomes commutative.
A table is an accepted visual representation of a relation; a tuple is similar to the concept of
row, but note that in the database language SQL the columns and the rows of a table are
ordered.
55
A relvar is a named variable of some specific relation type, to which at all times some
relation of that type is assigned, though the relation may contain zero tuples.
The basic principle of the relational model is the Information Principle: all information is
represented by data values in relations. In accordance with this Principle, a relational
database is a set of relvars and the result of every query is presented as a relation.
The consistency of a relational database is enforced, not by rules built into the applications
that use it, but rather by constraints, declared as part of the logical schema and enforced by
the DBMS for all applications. In general, constraints are expressed using relational
comparison operators, of which just one, "is subset of" (⊆), is theoretically sufficient. In
practice, several useful shorthand’s are expected to be available, of which the most
important are candidate key (really, super key) and foreign key constraints.
SQL
SQL, often referred to as Structured Query Language is a programming language designed
for managing data in relational database management systems (RDBMS).
Originally based upon relational algebra and tuple relational calculus, its scope includes
data insert, query, update and delete, schema creation and modification, and data access
control.
SQL was one of the first commercial languages for Edgar F. Codd's relational model, as
described in his influential 1970 paper, "A Relational Model of Data for Large Shared Data
Banks". Despite not adhering to the relational model as described by Codd, it became the
most widely used database language. Though often described as, and to a great extent is a
declarative language, SQL also includes procedural elements. SQL became a standard of the
American National Standards Institute (ANSI) in 1986 and of the International Organization
for Standards (ISO) in 1987. Since then the standard has been enhanced several times with
added features. However, issues of SQL code portability between major RDBMS products
still exist due to lack of full compliance with, or different interpretations of the standard.
Among the reasons mentioned are the large size, and incomplete specification of the
standard, as well as vendor lock-in.
The SQL language is sub-divided into several language elements, including:
56
Clauses, which are constituent components of statements and queries. (In some cases,
these are optional.)
Expressions, which can produce either scalar values or tables consisting of columns and
rows of data.
Predicates, which specify conditions that can be evaluated to SQL three-valued logic (3VL)
or Boolean (true/false/unknown) truth values and which are used to limit the effects of
statements and queries, or to change program flow.
Queries, which retrieve the data based on specific criteria. This is the most important
element of SQL.
Statements, which may have a persistent effect on schemata and data, or which may
control transactions, program flow, connections, sessions, or diagnostics. SQL statements
also include the semicolon (";") statement terminator. Though not required on every
platform, it is defined as a standard part of the SQL grammar.
Insignificant whitespace is generally ignored in SQL statements and queries, making it
easier to format SQL code for readability.
Data manipulation
The Data Manipulation Language (DML) is the subset of SQL used to add, update and delete
data:
INSERT adds rows (formally tuples) to an existing table, e.g.,:
INSERT INTO My_table
(field1, field2, field3)
VALUES
('test', 'N', NULL);
UPDATE modifies a set of existing table rows, e.g.,:
UPDATE My_table
SET field1 = 'updated value'
WHERE field2 = 'N';
DELETE removes existing rows from a table, e.g.,:
57
DELETE FROM My_table
WHERE field2 = 'N';
MERGE is used to combine the data of multiple tables. It combines the INSERT and UPDATE
elements. It is defined in the SQL:2003 standard; prior to that, some databases provided
similar functionality via different syntax, sometimes called "upsert".
Data definition
The Data Definition Language (DDL) manages table and index structure. The most basic
items of DDL are the CREATE, ALTER, RENAME, DROP and TRUNCATE statements:
CREATE creates an object (a table, for example) in the database, e.g.,:
CREATE TABLE My_table(
my_field1 INT,
my_field2 VARCHAR(50),
my_field3 DATE NOT NULL,
PRIMARY KEY (my_field1, my_field2)
);
ALTER modifies the structure of an existing object in various ways, for example, adding a
column to an existing table or a constraint, e.g.,:
ALTER TABLE My_table ADD my_field4 NUMBER(3) NOT NULL;
TRUNCATE deletes all data from a table in a very fast way, deleting the data inside the table
and not the table itself. It usually implies a subsequent COMMIT operation, i.e., it cannot be
rolled back.
TRUNCATE TABLE My_table;
58
DROP deletes an object in the database, usually irretrievably, i.e., it cannot be rolled back,
e.g.,:
DROP TABLE My_table;
People who work with databases
1. systems analyst
A systems analyst researches problem, plans solutions, recommends software and systems,
and coordinates development to meet business or other requirements. They will be familiar
with multiple varieties of programming languages, operating systems, and computer
hardware platforms. Because they often write user requests into technical specifications,
the systems analysts are the liaisons between vendors and information technology
professionals. They may be responsible for developing cost analysis, design considerations,
and implementation time-lines.
A systems analyst is a person who conducts a study, identifies activities and objectives and
determines a procedure to achieve the objectives. Designing and implementing systems to
suit organizational needs are the functions of the systems analyst.
Their responsibilities include:-
Communicate with each prospective database user group in order to understand its
o information needs
o processing needs
Develop a specification of each user group’s information and processing needs
Develop a specification integrating the information and processing needs of the user
groups
Document the specification
2. Database designers
The database designer defines the tables, indexes, views, constraints, triggers, stored
procedures, table, spaces or storage parameters, and other database-specific constructs
needed to store, retrieve, and delete persistent objects.
Their responsibilities include:-
59
Choose appropriate structures to represent the information specified by the system
analysts
Choose appropriate structures to store the information in a normalized manner in order
to guarantee integrity and consistency of data
Choose appropriate structures to guarantee an efficient system
Document the database design
3. Application Developers
A database developer writes and modifies databases. Data in a database can be reorganized,
dispersed, and accessed in a number of ways. Databases are important to companies and
organizations, because they contain records or files, such as sales transactions, product
catalog and inventories, and customer profiles.
Database developers create management systems to provide effective and efficient access to
information stored in databases. They provide expertise and guidance in the design,
implementation and maintenance of database management systems. Database developers
control the design and use of the database and they also determine the way the filing
systems will be organized and accessed. An important part of this work involves
implementing and controlling security procedures to protect the database from accidental
or intentional damage or loss.
They must be good at communicating not only in computer languages, but with people as
well. They not only develop the programs, but write up descriptions about them, prepare
manuals, help screens, or explain the new systems in person to users. They might have to
adapt the programs once they're in use, to make them more suitable to the users. In any
business or office environment it is the database developers who are the authorities on the
use of these management systems. This means they may spend a lot of their day, answering
questions on the correct usage of the database and correcting any glitches that may come up
within the system.
Their responsibilities can be summarized as follows:-
Implement the database design
Implement the application programs to meet the program specifications
Test and debug the database implementation and the application programs
Document the database implementation and the application programs
60
4. Database administrators
A database administrator (short form DBA) is a person responsible for the design,
implementation, maintenance and repair of an organization's database. They are also
known by the titles Database Coordinator or Database Programmer, and is closely related to
the Database Analyst, Database Modeller, Programmer Analyst, and Systems Manager. The
role includes the development and design of database strategies, monitoring and improving
database performance and capacity, and planning for future expansion requirements. They
may also plan, co-ordinate and implement security measures.
Database administrator's duties include:
Schema definition
Storage structure and access method definition
Schema and physical organization modification
Granting user authority to access the database
Specifying integrity constraints
Acting as liaison with users
Monitoring performance and responding to changes in requirements
Database backup and recovery
Database security
5. Database Users
Users are differentiated by the way they expect to interact with the system. The
different types of users are:-
Application programmers
An application programmer is someone who has acquired high-level skills in a
variety of computer programming languages. They are able to use these skills to
modify and create new software applications or programs. An application
programmer is also known as a developer or systems analyst. They interact with
system through DML calls.
Sophisticated users
These users form requests in a database query language. E.g. Engineers, scientists,
business analysts, and others who thoroughly familiarize themselves with the
61
facilities of the DBMS so as to implement their applications to meet their complex
requirements. They try to learn most of the DBMS facilities in order to achieve their
complex requirements.
Specialized users
These users write specialized database applications that do not fit into the
traditional data processing framework. They maintain personal databases by using
ready-made program packages that provide easy-to-use menu- or graphics-based
interfaces. An example is the user of a tax package that stores a variety of personal
financial data for tax purposes
Typically they become very proficient in using a specific software package.
Naive users
These users invoke one of the permanent application programs that have been written
previously. Their characteristics include constantly querying and updating the database,
using standard types of queries and updates called canned transactions that have been
carefully programmed and tested. E.g. Bank tellers check account balances and post
withdrawals and deposits; Reservation clerks for airlines, hotels, and car rental companies
check availability for a given request and make reservations; Clerks at receiving stations for
courier mail enter package identifications via bar codes and descriptive information
through buttons to update a central database of received and in-transit packages.
Disadvantages of RDBMS
The main problem when using a relational database is the complexity that arises when it is
first created. It is absolutely vital that the defined relationships between the tables are
correct and that each set of information is linked to its pair. Although less information has
to be entered in total than with other databases, making sure every point is set up correctly
is a slow process. Furthermore, the relationships can become extremely intricate when a
relational database contains more than just two tables.
It was well known from the outset that the relational model did not and could not model
some kinds of data. One common data structure that the Relational Model didn’t
accommodate, for example, was a hierarchy. You could implement laborious get-around to
accommodate such data within a relational database, but the simple reality was that the
Relational Model never got round to defining hierarchies properly. Insofar as Relational
database included a mathematical model of data, it was incomplete.
62
With the development of Object Oriented programming languages in the early 1990s, the
limits of the Relational Model began to become awkwardly visible. What quickly became
clear was that the view of data that an OO program had was not compatible with the way an
RDBMS stored data.
An easy way to explain this problem is this:
Think of a car. When you have finished using it, you park it in your garage and leave it. When
you park it you do not take out the seats, take off the tires, remove the doors, pull the steering
wheel off, take out the engine, etc. However RDBMS with their “normalization” processes
would often insist in storing a car in exactly that way and then reassembling it for you when
you next needed to use it.
The limitations of RDBMS became even more obvious with the advent of the Internet.
RDBMS had no way of representing web pages which were, by and large, collections of data
and mark-up tags. So web pages were stored as character strings by RDBMS and their inner
structure was buried.
Multimedia Database Management Systems
63
INTRODUCTION
Due to the heterogeneous nature of multimedia data, systems intended to store, transport,
display and in general manage such data must have considerably more capabilities than
conventional information-management systems. In this paper, we elaborate on issues
pertaining to multimedia database management and discuss technical challenges involved
in developing a general-purpose multimedia DBMS. We can view these issues as
specification requirements for this DBMS and focus our discussion on a reference
architecture for such a system. A list of prominent issues in multimedia database
management follows.
Development of formal semantic modelling techniques for multimedia information,
especially for video and image data. These models should be rich in capabilities for
abstracting multimedia information and capturing semantics. They should be able to
provide canonical representations of complex images, scenes, and events in terms of objects
and their spatio-temporal behavior. These models need to be compared and evaluated.
Design of powerful indexing, searching, and organization methods for multimedia data.
Search in multimedia databases can be computationally intensive, especially if content-
based retrieval is needed for image and video data stored in compressed or uncompressed
form.
Development of models for specifying the media synchronization/integration requirements.
Integration of these models with the monomedia database schema will be required.
Subsequently, in order to determine the synchronization requirements at retrieval time,
transformation of these models into a metaschema is needed. This entails designing object-
retrieval algorithms for the operating systems. Similarly, integration of these models with
higher-level information abstractions such as hypermedia or object-oriented models may be
required.
Designing formal multimedia query languages. These languages should have strong
capabilities to express arbitrarily complex semantic and spatiotemporal schemas associated
with composite multimedia information. They must support manipulation of content-based
functions for multimedia objects.
Development of efficient data replacement schemas for physical storage management.
These schemes are needed to manage real-time multimedia data, for both single and
parallel disk systems. Design and development of suitable architecture and operating
64
system support. Heterogeneity of multimedia information dictates that the architecture of a
general-purpose multimedia database management system must support a rich set of data-
management and computational functionalities. The operating system must also support
real-time requirements of multimedia data.
Management of distributed multimedia databases. In a networked environment extensive
coordination and management capabilities are needed among the distributed sites to
provide location-transparent access and support real-time delivery of data to distributed
users.
A REFERENCE ARCHITECTURE FOR MULTIMEDIA DBMS
Many types of real-world knowledge can be represented by describing the interplay among
objects (persons, buildings, events, etc.) in the course of time and their relationships in
space. An application may need to store and access information about this knowledge that
can be expressed by complex spatiotemporal logic. A video database is a typical replica of
this real-world environment [Day et al. 1995]. The spatiotemporal specification process is
reversed while composing multimedia information documents [Little and Ghafoor 1993;
lino et al. 1994]. In this case, a user synthetically creates interplay among various
monomeric objects (akin to physical objects), both in space and time. In multimedia
databases these objects may represent individual data entities that serve as components of
some multimedia document [Little and Ghafoor 1993; Stotts and Furuta 1989].
Furthermore, these entities/documents can be grouped together for efficient management
and access.
In either case, it is essential that the user be able to identify and address different objects
and be able to express their relations in time and space. These relations should be
representable in a suitable structure that is powerful enough to specify higher-level
contents and semantic abstractions. It is, therefore, desirable that a general framework for
spatiotemporal and semantic modeling should be available. Such a framework can provide a
reasonable approach to address the issues discussed in the previous section and can be
used in designing a general purpose multimedia database-management system. Various
conceptual models for multimedia information have been proposed in the literature [Little
and Ghafoor 1993; Stotts and Furuta 1989]. These models are either aimed at
synchronization aspects of the multimedia data or are concerned with the browsing aspects
of information. Irrespective of the media type, it is imperative that a model should be able to
65
generate a clear specification of the metaschema and it must be integral with the underlying
data models of various monomedia.
Such a model must also facilitate the efficient development of high-level user interfaces.
From this discussion, we can perceive reference architecture for a general-purpose
multimedia DBMS as shown in Figure 1(a). The architecture consists of three layers: a
monomedia database management layer, a multimedia composition/management layer,
and a user interface layer. We now describe functionalities of each layer and discuss their
role in managing the issues mentioned in the previous section. The monomedia database-
management layer provides the functionalities essential for managing individual media. The
key aspects of each DBMS at this level include maintaining efficient indexing mechanisms
and allowing users to develop semantics-based modeling and grouping of complex
information associated with each medium. In this respect, this layer deals with the first two
issues discussed in the previous section. The primary objective is to process content-based
queries as discussed in the previous section. The primary objective is to process content-
based queries and facilitate retrieval of appropriate pieces of monomedia data, such as
video clips, parts of an image, or audio segments. For example, for video data, abstraction
hierarchies in space and time, as shown in Figure 1(b), are needed in order to express
events and allow content-based indexing and retrieval.
Spatiotemporal logic can provide a formal framework for representing such events and
building these hierarchical abstractions [Day et al. 1995]. Subsequently this process can
lead to schema definition for each monomedia database. Such schema may also contain
some unconventional information, for example, the quality of service (QoS) parameters for
presentation, such as speed, volume, resolution, delay bounds, and the like. The objective is
to allow retrieval of monomedia data with the desired quality, if possible. Accordingly, a
suitable database language is needed at this level to manipulate hierarchical abstractions
and to query about events of interest. It is important to mention that considerable
computational demand can be placed at this layer, especially for image and video data, if a
fully automated system is desired that is capable of generating content-based indices based
on object recognition [Guidavada and Raghavan 1995]. Such computation can span a broad
spectrum of effective and robust computer-vision and image processing algorithms that are
intermingled with data and entities at various levels of abstractions of Figure Kb).
Another important function incorporated within this layer is the physical management of
individual databases. Such management may require, as mentioned previously, efficient
66
placement of data on a single/parallel disk system. For example, due to the time-variant
characteristics of compressed video, it is difficult to predict the disk seek latencies. A poor
data-placement strategy can result in significant degradation in quality due to excessive
droppage of video frames during multiple concurrent sessions. The primary objective of the
middle layer is to deal with the third issue, the integration of monomedia to compose
multimedia documents and to crosslink information stored in monomedia databases.
There is a growing demand for management of multimedia documents and libraries, and
the need for efficient integration models is one of the key research issues in developing a
general-purpose multimedia DBMS. Integration of media can span multiple dimensions
including space, time, and semantics. Therefore, this layer needs to maintain a
multidimensional monomedia integration model, in form of a metaschema, along with the
QoS parameters associated with each medium. Basically, the layer processes users' queries
for composite multimedia information and generates appropriate subqueries for the
monomedia databases. Retrieval of the monomedia data is controlled by the integration
model maintained by the metaschema [Little and Ghafoor 1993].
It is important to note that for the purpose of consistency, this meta-schema needs to be
interfaced with the schemas associated with the monomedia databases present at the lower
layer, because both schemas share information about monomedia, including their content-
based semantics and QoS parameters. Another important function of this layer is to provide
coordination among monomedia databases if they are distributed.
As mentioned before, one of the major objectives in this case is to provide location-
transparent access to different database sites and maintain synchronization among media
streams originating from heterogeneous hosts. From the database point of view,
information about the location of various objects of omposite multimedia and their schema
must be maintained by this layer. The interactive layer consists of various user-interface
facilities that support multimedia presentation functionalities such as display of images,
playout of video clips or some audio segment, and the like. These interactive facilities may
require some formal query language that can be identical to that used at the bottom layer,
with some enhanced capabilities to manipulate composite multimedia information.
Alternatively, some graphical query interface can be used for that purpose. Additional
capabilities may include browsing, media editing, and so on.
67
68
DEVELOPMENT CONSIDERATIONS FOR THE REFERENCE ARCHITECTURE
The development of a general-purpose multimedia database system, including design of
multilevel meta-schema, management of a large number of indices and physical databases,
as well as interfacing with the operating system for real-time retrieval, can be a
technologically daunting task. In this section, we focus on two development issues that are
unique to multimedia databases and are crucial for realizing the reference architecture of
Figure 1(a): spatiotemporal modeling of monomedia/composite multimedia and the
suitability of two well-known data-modeling paradigms, relational versus object-oriented,
for developing a general purpose multimedia DBMS. A number of attempts have been made
to develop synchronization models for multimedia information. HyTime, one such model,
has been recommended as an ISO standard.
However, currently this model suffers from the drawback that the extraction of various
spatiotemporal and content semantics from this model can be quite cumbersome. On the
other hand, the Petrinet-based model proposed in Little and Ghafoor [1993] not only allows
extraction of the desired semantics and generation of a database schema in a rather
straightforward manner, but also has the additional advantage of pictorially illustrating
synchronization aspects of the information. In this regard this model is unique and is thus
also well suited for the third layer of the reference architecture, where visual orchestration
of multimedia information can be highly desirable. The model uses a set of generalized
temporal operators. An expanded version of the model also allows specification of spatial
semantics of information [lino et al. 1994].
The notion of generalized spatiotemporal relations has recently been extended to specify
complex events in video data, a process that is almost a reversal of the multimedia
composition process [Day et al. 1995]. Numerous extensions to this model have been
proposed in the literature, including provision for user interaction, modeling of distributed
object synchronization, development of synchronization protocols for communication, and
the like. In summary, the Petrinet-based model in Little and Ghafoor [1993] can be a good
candidate for spatiotemporal modeling and generation of meta-schemas for the reference
architecture. We now discuss the second development issue.
As the application domain for a given multimedia application can be highly complex and ill-
structured, both relational calculus and semantic-based object-oriented approaches need to
be scrutinized. We can evaluate these approaches on two important criteria: their ability to
69
express semantics associated with hierarchical abstractions, which need to be managed by
almost all the monomedia DBMSs of the bottom layer of Figure 1(a), and their expressive
power for the specification of meta-schema for media integration, which is managed by the
middle layer. Hierarchical abstractions can result in considerable semantic heterogeneity,
which has been a difficult problem for relational database models.
On the other hand, the semantic-based object-oriented models can provide a richer set of
abstractions that are particularly useful for the users to extract/define view of information
at various levels of these hierarchies [Day et al. 1995]. From the media integration and
composition point of view, the relational technology again provides no elegant mechanisms
for maintaining complex logical structures associated with composite multimedia objects.
Although some relational DBMSs support access to multimedia objects by using pointers to
BLOBs (binary large objects), there is no provision for interactively accessing various
portion of these objects, because a BLOB is treated as a single entity in its entirety. It has
been demonstrated that the object-oriented technology can provide a powerful paradigm to
meet the requirements of multimedia composition and developing semantic models [Day et
al. 1995; lino et al. 1994]. Its data and computational encapsulation features provide
effective mechanisms for media synchronization.
Multimedia Database Systems: Where are we now?
Introduction
A Multimedia Database Management System (MMDBMS) must support multimedia
data types in addition to providing facilities for traditional DBMS functions like database
creation, data modeling, data retrieval, data access and organization, and data
independence. The area and applications have experienced tremendous growth.
Especially with the rapid development of network technology, multimedia database system
gets more tremendous development and multimedia information exchange becomes
very important. This paper reviews the history and the current state-of-the-art in MMDBMS.
70
The first wave
The first MMDBMS rely mainly on the operating system for storing and querying files. These
were ad-hoc systems that served mostly as repositories. The mid 90s saw a first wave
of commercial, implemented from-the-scratch, and full-fledged MMDBMS. Some of
them were MediaDB, now MediaWay [1], JASMINE [2], and ITASCA that is the
commercial successor of ORION [3]. They were all able to handle diverse kinds of data and
provided mechanisms for querying, retrieving, inserting, and updating data. Most of these
products disappeared from the market after some years of existence, and only some of them
continued and adapted themselves successfully to the hardware and software advances as
well as to application changes. For instance, MediaWay provided early very specific support
for a wide variety of different media types. Specifically different media file formats varying
from images, and video to PowerPoint documents can be managed segmented, linked
and searched.
Designing MMDBs
Many inherent characteristics of multimedia data have direct and indirect impacts on the
design of MMDBs. These include: the huge size of MMDBs, temporal nature, richness of
content, complexity of representation and subjective interpretation. The major challenges in
designing multimedia databases arise from several requirements they need to satisfy such
as the following:
1. Manage different types of input, output, and storage devices. Data input can be from
a variety of devices such as scanners, digital camera for images, microphone, MIDI devices
for audio, video cameras. Typical output devices are high-resolution monitors for images
and video, and speakers for audio.
2. Handle a variety of data compression and storage formats. The data encoding has a
variety of formats even within a single application. For instance, in medical applications, the
MRI images of brain has lossless or very stringent quality of lossy coding technique, while
the X-ray images of bones can be less stringent. Also, the radiological image data, the ECG
data, other patient data, etc. have widely varying formats.
71
3. Support different computing platforms and operating systems. Different users
operate computers and devices suited to their needs and tastes. But they need the same
kind of user-level view of the database.
4. Integrate different data models. Some data such as numeric and textual data are best
handled using a relational database model, while some others such as video documents are
better handled using an object-oriented database model. So these two models should
coexist together in MMDBs.
5. Offer a variety of user-friendly query systems suited to different kinds of media.
From a user point of view, easy-to-use queries and fast and accurate retrieval of information
is highly desirable. The query for the same item can be in different forms. For example, a
portion of interest in a video can be queried by using either
1) A few sample video frames as an example,
2) A clip of the corresponding audio track or
3) A textual description using keywords.
6. Handle different kinds of indices. The inexact and subjective nature of multimedia
data has rendered keyword-based indices and exact and range searches used in traditional
databases ineffective. For example, the retrieval of records of persons based on social
security number is precisely defined, but the retrieval of records of persons having certain
facial features from a database of facial images requires, content-based queries and
similarity-based retrievals. This requires indices that are content dependent, in addition to
key-word indices.
7. Develop measures of data similarity that correspond well with perceptual similarity.
Measures of similarity for different media types need to be quantified to correspond well
with the perceptual similarity of objects of those data types. These need to be incorporated
into the search process
8. Provide transparent view of geographically distributed data. MMDBs are likely to be
a distributed nature. The media data resides in many different storage units possibly spread
out geographically. This is partly due to the changing nature of computation and computing
resources from centralized to networked and distributed.
9. Adhere to real-time constraints for the transmission of media data. Video and audio
are inherently temporal in nature. For example, the frames of a video need to be presented
at the rate of at least 30 frames/sec. for the eye to perceive continuity in the video.
72
10. Synchronize different media types while presenting to user. It is likely that different
media types corresponding to a single multimedia object are stored in different formats, on
different devices, and have different rates of transfer. Thus they need to be periodically
synchronized for presentation.
The recent growth in using multimedia data in applications has been phenomenal.
Multimedia databases are essential for efficient management and effective use of huge
amounts of data. The diversity of applications using multimedia data, the rapidly changing
technology, and the inherent complexities in the semantic representation, interpretation
and comparison for similarity pose many challenges. MMDBs are still in their infancy.
Today's MMDBs are closely bound to narrow application areas. The experiences acquired
from developing and using novel multimedia applications will help advance the multimedia
database technology.
Benefits of using Multimedia Database
The incorporation of multimedia database systems will improve the quantity and quality of
information manipulated by computer users in all fields, computer aided design, and
information retrieval. The area of intelligent multimedia content analysis and retrieval
techniques is an emerging discipline. Techniques for representing and extracting semantic
information from media such as speech, images, and video are required.
When a multimedia application lacks a database, the data structure is buried in the script,
where all of its value is lost. This omission also makes the script more complicated and less
flexible. Using a multimedia database makes the data structure logic available to other
multimedia applications and simplifies the script so that many scripts can share the same
multimedia metadata. In addition, when a multimedia or abstract data database is
organized and annotated for one application, other applications can use those annotations
without going through the same time-consuming process. This capability adds great value
to the data through reuse and controlled redundancy.
When multimedia application content is controlled by the multimedia database, multimedia
content can be added, deleted, or modified without modifying the application script. For
example, interactive kiosks that display, describe, and demonstrate products can be
updated automatically without reprogramming the application script. Furthermore, a
73
multimedia application such as a multimedia textbook can actually control the operation of
book topics that have the same look and feel. This control lets the script perform as a
template: An entire series of math textbooks (algebra, calculus, trigonometry, and
geometry), including text and video, can use the same multimedia application because all
data is physically separate.
Search and retrieval operations are critical in interactive multimedia applications; they
must be equally efficient and powerful. Search and retrieval of multimedia and abstract data
is challenging, but multimedia databases make it feasible through internal storage format
flexibility and efficient operation. The DBMS should have significant knowledge about the
data and its structure to enable powerful semantic optimizations and intelligent searches.
Search and retrieval operations also give the application access to media components so
that they can be dynamically and seamlessly processed when necessary.
Features of Multimedia Database
Query in Multimedia DBMS
In a DBMS process, declarative user queries are done by defining a query language as part of
DBMS. It is known that the query is one of the most important parts of a DBMS. Since a
query gets used by both native and export users it is critical to use effective query languages
in order to promote user friendliness. Multimedia query languages must deal with complex
spatial and temporal relationships inherited in the wide range of multimedia data types.
Powerful query languages could help manipulate multimedia DBMS and maintain the
desired independence between the database and the application. Due to the complexity of
multimedia data, we need a query language that could support semantic data retrieval
which is done by using keywords, indexes on keywords, and contents of multimedia objects.
However the query language of traditional DBMS dealt only with exact key match queries on
data types which means sufficient to deal with queries posed against metadata and
annotations of multimedia data.
There are two types of queries that are used in the database system. The first query is called
a well-defined query. A well-defined query is when properties of objects are represented by
a well defined set of labels. The conditional operators are also well defined. In well-defined
query, the user must know the exact knowledge of the underlying database and the desired
query result. This query only works with exact matches. The second query type of is called
74
the fuzzy query where the properties of query objects are unclear or comparison operators
in the query do not give exact matches, This will result in all sets of different answers.
Charts and Graphs
Multimedia database supports all type of charts and graphs which is considered a great
feature in the MDBS. Creating those charts and graphs will need special queries and those
queries are created by using SQL +D.
75
Multimedia Presentation
The terms multimedia document, multimedia or temporal presentation, have been used in
the literature interchangeably. Multimedia Database offers to store large amount of media
like audio and video, all of those are used also in the presentation process. MDBS also stores
multimedia documents and presentation.
Applications of MMDB
Data Streaming and E-mail Technologies
The use of streaming video and audio media is becoming an increasingly desirable feature
in network technologies. As networks become more saturated with traffic, the timely
delivery of latency-sensitive data, such as the data generated by video conferencing and
Internet Protocol (IP) telephony programs, becomes more problematic. ATM and QoS are
two technologies that you can use to facilitate the timely delivery of latency-sensitive data.
POP3 is an e-mail protocol that can be used facilitate mail delivery to computers that are not
part of a network or computers that do not have an e-mail server on the local network.
76
Interactive video technology
A few interactive video technologies have been developed in last few years that utilize a
new way of encoding videos allowing users to click on any person, place, or object in the
video.
Various existing trends of usage of the technology
• Customizable online interactive videos
• Conversational online interactive videos
• Exploratory online interactive videos
• Interactive video in early computer games
• Interactive video in cinema
• Interactive video in youtube
• Interactive video art
• Interactive video in cell phone gaming
• Interactive video in VJing
• Hypervideo
Interactive Video ads
These are designed to meet the marketing needs of advertisers in a wide variety of
industries, including: Entertainment, CPG, Automotive, Retail, Telecommunications, and
Travel, with more categories being added on an ongoing basis.
Some of the flexible features and functionality:
• Multiple videos within a single ad unit, including Interactive video, augmented
reality, and dynamic content
• Social media integration and RSS feeds
• In-player video games, polling, surveys, and quizzes
• Localized geographic data, store location, and mapping
• Product features/carousel
• Coupons and click-to-schedule/purchase
Digital library
A digital library is a library in which collections are stored in digital formats (as opposed to
print, microform, or other media) and accessible by computers. The digital content may be
77
stored locally, or accessed remotely via computer networks. A digital library is a type of
information retrieval system.
A distinction is often made between content that was created in a digital format, known as
born-digital, and information that has been converted from a physical medium, e.g., paper,
by digitizing. The term hybrid library is sometimes used for libraries that have both
physical collections and digital collections.
Large scale digitization projects are underway at Google, the Million Book Project, and
Internet Archive. With continued improvements in book handling and presentation
technologies such as optical character recognition and ebooks, and development of
alternative depositories and business models, digital libraries are rapidly growing in
popularity as demonstrated by Google, Yahoo!, and MSN's efforts. Just as libraries have
ventured into audio and video collections, so have digital libraries such as the Internet
Archive.
According to Larry Lannom, Director of Information Management Technology at the
nonprofit Corporation should be for National Research Initiatives, "all the problems
associated with digital libraries are wrapped up in archiving." He goes on to state, "If in 100
years people can still read your article, we'll have solved the problem." Daniel Akst, author
of The Webster Chronicle, proposes that "the future of libraries—and of information—is
digital." Peter Lyman and Hal Varian, information scientists at the University of California,
Berkeley, estimate that "the world's total yearly production of print, film, optical, and
magnetic content would require roughly 1.5 billion gigabytes of storage." Therefore, they
believe that "soon it will be technologically possible for an average person to access
virtually all recorded information."
Most digital libraries provide a search interface which allows resources to be found. These
resources are typically deep web (or invisible web) resources since they frequently cannot
be located by search engine crawlers. Some digital libraries create special pages or sitemaps
to allow search engines to find all their resources. Digital libraries frequently use the Open
Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) to expose their metadata to
other digital libraries, and search engines like Google Scholar, Yahoo! and Scirus can also
use OAI-PMH to find these deep web resources.[9]
There are two general strategies for searching a federation of digital libraries:
78
• Distributed searching, and
• Searching previously harvested metadata.
Distributed searching typically involves a client sending multiple search requests in parallel
to a number of servers in the federation. The results are gathered, duplicates are eliminated
or clustered, and the remaining items are sorted and presented back to the client. Protocols
like Z39.50 are frequently used in distributed searching. A benefit to this approach is that
the resource-intensive tasks of indexing and storage are left to the respective servers in the
federation. A drawback to this approach is that the search mechanism is limited by the
different indexing and ranking capabilities of each database, making it difficult to assemble
a combined result consisting of the most relevant found items.
Searching over previously harvested metadata involves searching a locally stored index of
information that has previously been collected from the libraries in the federation. When a
search is performed, the search mechanism does not need to make connections with the
digital libraries it is searching - it already has a local representation of the information. This
approach requires the creation of an indexing and harvesting mechanism which operates
regularly, connecting to all the digital libraries and querying the whole collection in order to
discover new and updated resources. OAI-PMH is frequently used by digital libraries for
allowing metadata to be harvested. A benefit to this approach is that the search mechanism
has full control over indexing and ranking algorithms, possibly allowing more consistent
results. A drawback is that harvesting and indexing systems are more resource-intensive
and therefore expensive.
Advantages of a digital library
The advantages of digital libraries as a means of easily and rapidly accessing books, archives
and images of various types are now widely recognized by commercial interests and public
bodies alike.[14]
Traditional libraries are limited by storage space; digital libraries have the potential to store
much more information, simply because digital information requires very little physical
space to contain it. As such, the cost of maintaining a digital library is much lower than that
of a traditional library.
A traditional library must spend large sums of money paying for staff, book maintenance,
rent, and additional books. Digital libraries may reduce or, in some instances, do away with
these fees. Both types of library require cataloguing input to allow users to locate and
79
retrieve material. Digital libraries may be more willing to adopt innovations in technology
providing users with improvements in electronic and audio book technology as well as
presenting new forms of communication such as wikis and blogs; conventional libraries
may consider that providing online access to their OPAC catalogue is sufficient. An
important advantage to digital conversion is increased accessibility to users. They also
increase availability to individuals who may not be traditional patrons of a library, due to
geographic location or organizational affiliation.
No physical boundary: The user of a digital library need not to go to the library physically;
people from all over the world can gain access to the same information, as long as an
Internet connection is available.
Round the clock availability A major advantage of digital libraries is that people can gain
access 24/7 to the information.
Multiple access: The same resources can be used simultaneously by a number of institutions
and patrons. This may not be the case for copyrighted material: a library may have a license
for "lending out" only one copy at a time; this is achieved with a system of digital rights
management where a resource can become inaccessible after expiration of the lending
period or after the lender chooses to make it inaccessible (equivalent to returning the
resource).
Information retrieval: The user is able to use any search term (word, phrase, title, name,
subject) to search the entire collection. Digital libraries can provide very user-friendly
interfaces, giving clickable access to its resources.
Preservation and conservation: Digitization is not a long-term preservation solution for
physical collections, but does succeed in providing access copies for materials that would
otherwise fall to degradation from repeated use.
Space: Traditional libraries are limited by storage space, digital libraries have the potential
to store much more information, and simply because digital information requires very little
physical space to contain them and media storage technologies are more affordable than
ever before.
Added value: Certain characteristics of objects, primarily the quality of images, may be
improved. Digitization can enhance legibility and remove visible flaws such as stains and
discoloration.
Easily accessible.
80
Image retrieval
An image retrieval system is a computer system for browsing, searching and retrieving
images from a large database of digital images. Most traditional and common methods of
image retrieval utilize some method of adding metadata such as captioning, keywords, or
descriptions to the images so that retrieval can be performed over the annotation words.
Manual image annotation is time-consuming, laborious and expensive; to address this, there
has been a large amount of research done on automatic image annotation. Additionally, the
increase in social web applications and the semantic web have inspired the development of
several web-based image annotation tools.
Image search is a specialized data search used to find images. To search for images, a user
may provide query terms such as keyword, image file/link, or click on some image, and the
system will return images "similar" to the query. The similarity used for search criteria
could be meta tags, color distribution in images, region/shape attributes, etc.
• Image meta search - search of images based on associated metadata such as
keywords, text, etc.
• Content-based image retrieval (CBIR) – the application of computer vision to the
image retrieval
POP3
Post Office Protocol version 3 (POP3) service enables a server to host e-mail accounts for
subscribers of Internet service providers (ISP) or for small businesses. The most common
use is for home ISP clients or small businesses that cannot locally implement and support a
mail server, Simple Mail Transfer Protocol (SMTP) service, and related mail handling
infrastructure. The POP3 protocol is used to retrieve mail that is being held by a mail server
that offers SMTP service. The POP3 service performs the tasks of downloading messages,
moving the messages from a folder in the file system of a mail server to clients on small
networks, or to stand-alone home computers. Typically, after mail is downloaded from the
mail server, the mail server deletes the mail from its folder.
81
Recent Developments
News-On-Demand
The Canadian Institute for Telecommunication Research (CITR) Broadband Services studies
and prototypes enabling technologies for distributed multimedia applications using Object-
Oriented Multimedia database system. Such applications include multimedia news, distance
learning, and home shopping. A successful product of their project is the News-on-Demand
Application. This application incorporates a distributed database server by storing up to
date news items from various sources, such as TV, radio, newspapers, and magazines.
News-on-Demand allows subscriber to retrieve one or more news items using an article
query interface. Various news items from different sources are annotated and organized
into multimedia documents by service providers. News-on-Demand uses a disturbed
service where a client can access news over the broadband network.
Video-On-Demand
Multimedia technologies are attracting more and more interest every day. Video-on-
Demand is one of the buzzwords today and is now available for the public. Content
providers such as publishers, broadcasting companies and audio/video production firms
must be able to archive and index their productions for later retrieval. This is a formidable
task and it is even more so when the material to be handled encompasses several media
types and covers a time span of several years. In order for such a vast amount of data to be
easily available, existing multimedia database design models, indexing and retrieval
methodologies and delivery methods have to be improved and refined. In addition, video,
image and audio data cannot be effectively managed with the exclusive use of older,
keyword-based search techniques.
KMeD
The UCLA Knowledge-Based Multimedia Medical Distributed Database (KMeD) project is a
joint project between the Computer Science Department and the Radiological Science
Department.
82
The KMeD project has five major objectives:
1. Query medical multimedia distributed database by image, alphabetical, and
numerical content.
2. Model temporal, spatial, and evolutionary nature of medical objects.
3. Formulate queries using conceptual and imprecise medical terms to support
cooperative processing.
4. Develop a domain-independent, high-level query language and a medical domain
user interface to support KMeD functionality.
5. Provide analysis and presentation methods for visualization of knowledge and data
models.
KMeD database is presented by features and objects. The selected objects of interest in
medical images such as X-Ray and MRI image are segmented using knowledge-based model-
guiding techniques. As with other multimedia database system features and contents of the
medical image are extracted and stored in a feature and content database. In KMeD, Type
Abstraction Hierarchies (TAHs) is used to represent the knowledge about the image in three
structures. A query process supports operators such as "similar to" and "nearby" and
conceptual terms such as "small" and "large" to find the approximate matches to the
features and contents. A technique has been developed for visual interface to use point-
click-and-drag input. This new development by UCLA, namely KMeD database, can change
the way hospitals maintain and utilize all of their data.
EtherMed
EtherMed is an experimental database to Internet accessible multimedia courseware in
health professions education. EtherMed contains records of educational materials that are
freely accessible on the web and links to the actual courseware. EtherMed records are
descriptive and non-evaluative. EtherMed is an experiment is collaborative database
development and maintenance. OHPCC staff and collaborating consultants at distant
locations can directly enter, modify, and delete records online. The aim is determine
whether distributed management of database resources is possible and can become self-
sustaining and to determine how well database approaches compare to those employing
search engines. Online videoconferencing and other collaboration tools are used to support
this effort. ImageMed is an experimental system that uses:
83
(1) the Unified Medical Language System (UMLS) Meta-thesaurus for indexing and retrieval
of medical images.
(2) a distributed image database.
Future Work
Future requirements of distributed multimedia systems will be even much more demanding
than it is now. It is envisaged that users will be heavily mobile and require ubiquitous
access to, and satisfactory presentation of, multimedia data, regardless of the actual
connectivity and the specific presentation device they are currently using (e.g., wire-based
vs. wireless network, high-resolution graphics screen vs. Web-enabled cellular phone).
Moreover, users will expect to find information and multimedia contents faster and easily,
and will interact with the contents much more intensively than today.
In this scope, the CODAC and its sister project aims at realizing a quality adaptive end-to-
end multimedia system, i.e., we shall provide means for indexing and retrieving multimedia
data by their content and adaptation capabilities and develop methods to guarantee a
quality adaptive video transport to the client.
Depicts the architectural view of the end-to-end multimedia system, which we would like to
realize. For this we need to carry out a number of activities and the following are the first
work items under development in the CODAC project:
1-Development of a multimedia cartridge in the core of an Oracle 9i DBMS: The advantages
of the cartridge technology, as proposed by Oracle and other database vendors, are
reusability, extensibility and especially a very clean interface to components of the database
system, like query processor, optimizer and page access management. This multimedia
cartridge realizes the meta-database and provides access to the clients for complex search
functionality, supported by advanced indexing structures (like a combination of X-trees, SS-
trees, etc.).
2. Realization of a Processing Unit for MPEG-4 videos: The Processing Unit is supposed to
be situated between the video server and the meta-database and shall extract the necessary
84
quality adaptation capability information from the A/V streams to be stored as meta-data in
the database. Upon the insertion of the video (the insertion could be on demand or on a
regular basis), the Processing Unit shall apply efects such as transformation and scaling to
the MPEG-4 encoded videos and report results (performance and variation information) to
the meta-database and write back the adapted videos to the video server.
Implementation of an indexing structure for the access of MPEG-7 files, possibly in BiM
format:
A composition mechanism of different MPEG-7 Access Units shall be developed and the
mapping process of MPEG-7 Access Units to MPEG-4 Access Units shall be implemented.
4. Realization of the cross-referencing between MPEG-7 and MPEG-4, i.e., how to access
media data from meta-data and vice-versa: indexing structures of the later work item shall
be employed and integrated into a reference stream which allows the efficient access of the
meta-data from the media-data. Referencing of MPEG-7 and MPEG-4 will be employed in the
proxy-cache and in the active routers.
The application scenarios, we focus upon are sport event videos, M3box, and tele-teaching.
Sport events are an interesting application scenario since these videos provide semantically
rich content. The M3box is an adaptive multimedia message box. It is developed by Siemens
Corporate Technology.
2) In cooperation with our Institute. The final application is tele-teaching. This application
differs from the previous ones, as here, multimedia data plays a supporting role, rather than
being the central data component. Therefore, the meta-database has to keep, besides
descriptive information on the multimedia data, information on the tele-teaching material
and the teaching process.
Future Improvement
The most interesting and exciting thing about multimedia databases is how quickly they're
evolving. This growth, along with the emergence of inexpensive removable storage devices
such as DVD-which stores tens of gigabytes-will ignite an explosion of multimedia
85
applications. This explosion, in turn, will fuel an intense need for powerful multimedia
databases.
Companies and Institutes and other organizations that interacts directly with Multimedia
Database
Laboratories and Research Groups
Research in this area involves representation, indexing, transmission, retrieval and
presentation of multimedia in a distributed environment, content-based retrieval, data
mining, and bioinformatics. Their main projects that was mainly built on Multimedia
Database are:
MultiStore
MultiStore is a storage infrastructure project to construct, manage, and store
multidimensional data sets. The project will carry out geographic imaging research,
bioinformatics research, pharmacogenomic research, and basic research on large scale data
sets and information visualization. Collaborators include David Mark, Director of the NCGIA,
Nobel laureate Herbert Hauptman, and Norma Nowak, Director of the DNA Micro array
Facility at the Roswell Park Cancer Institute.
NetMedia
Consistent and robust retrieval, transmission and presentation of multimedia data in
distributed multimedia database systems.
Data Mining and Bioinformatics
While available information retrieval techniques rely on human coding of semantic content,
Zhang’s intelligent content analysis analyzes images automatically. The picture analyzed
depicts Bell Hall, home to the main office of the Department of Computer Science and
Engineering on UB's North Campus.
Multimedia Database at RMIT
The Multimedia Database Systems Group at RMIT is one of Australia's leading centers of
research in electronic document management. The group develops systems for document
management and undertakes consultation within the wider area of database management
and information retrieval.
86
The TeraText Database System
RMIT MDS is responsible for the development and maintenance of the TeraText database
system (formerly called the Structured Information Manager (SIM)), a text database system
with native XML support. The TeraText DBS is designed for building web based text
intensive applications. for large enterprises. The TeraText product is now sold in both
Australia and the USA and is used by a large number of organizations in both the public and
private sectors.
DELTA Collaborative Multimedia Database Project
The DELTA Collaborative Multimedia Database Project is designed to provide faculty and
students with access to a rich diversity of digital instructional materials (DIM) for self study
and for use by faculty in developing courseware and electronic presentations for
instructional purposes. The materials have copyright protection, but are available for non-
commercial use without charge. Currently, databases of digital still images, sounds, and
video have been developed from analog materials provided by faculty who have certified
copyright on the original work. The material contained in this database may only be used
for non-profit, educational purposes. Any other use is prohibited without the express
written consent of the Trustees of the California State University or the copyright owner.
World War II Multimedia Database
Originally completed as my thesis at Fordham University, the World War II Multimedia
Database is dedicated to the 50 million people killed in the Second World War. Currently the
World War II Multimedia Database has 1850 photos, 93 video clips, and a virtual radio.
87
Distributed Databases
In today’s world of universal dependence on information systems, all sorts of people need
access to companies’ databases. In addition to a company’s own employees, these include
the company’s customers, potential customers, suppliers, and vendors of all types. It is
possible for a company to have all of its databases concentrated at one mainframe computer
site with worldwide access to this site provided by telecommunications networks, including
the Internet. Although the management of such a centralized system and its databases can
be controlled in a well-contained manner and this can be advantageous, it poses some
problems as well. For example, if the single site goes down, then everyone is blocked from
accessing the databases until the site comes back up again. Also the communications costs
from the many far PCs and terminals to the central site can be expensive. One solution to
such problems, and an alternative design to the centralized database concept, is known as
distributed database. The idea is that instead of having one, centralized database, we are
going to spread the data out among the cities on the distributed network, each of which has
its own computer and data storage facilities. All of this distributed data is still considered to
be a single logical database. When a person or process anywhere on the distributed
network queries the database, it is not necessary to know where on the network the data
being sought is located. The user just issues the query, and the result is returned. This
feature is known as location transparency. This can become rather complex very quickly,
and it must be managed by sophisticated software known as a distributed database
management system or distributed DBMS.
A distributed database (DDB) is a collection of multiple, logically interrelated databases
distributed over a computer network.
A distributed database management system (DDBMS) is the software that manages the
DDB, and provides an access mechanism that makes this distribution transparent to the
user.
Collections of data (e.g. in a database) can be distributed across multiple physical locations.
A distributed database can reside on network servers on the Internet, on corporate
intranets or extranets, or on other company networks. The replication and distribution of
databases improves database performance at end-user worksites.
88
Although there are a number of advantages to using a distributed DBMS, there are also a
number of problems and implementation issues. Finally, data in a distributed DBMS can be
partitioned or replicated or both.
Homogeneous distributed databases
A homogeneous distributed database is one where all sites are homogeneous or similar.
All sites have identical software. They are aware of each other and agree to cooperate in
processing user request
Heterogeneous distributed databases
In a heterogeneous distributed database system, at least one of the databases uses different
schemas and software.
A database system having different schema may cause a major problem for query
processing. A database system having different software may cause a major problem for
transaction processing. Therefore, special attention needs to be paid to integrate a
heterogeneous distributed database.
Methods to achieve a distributed database
A distributed database is one that is held in several locations. There are three methods of
achieving this:
1. Each remote processor has data on its own customers, stock, etc. The database is
uploaded each night and changes made.
2. The entire database is duplicated at the remote side. The database is uploaded each
night and changes made.
3. The central database contains only an index of entries. The actual records are held
at the remote site. A query to the central database will locate where the record is held. This
system is used by very large databases.
Fragmentation
In DDBS we need to define the logical unit of DB distribution and allocation. In some cases it
might be more efficient to split the tables into smaller units (fragments) and allocate them
in different sites.
Fragmentation has three different types:
1. Horizontal Fragmentation
In horizontal partitioning, the rows of a table are divided up among several sites on the
89
network. Each such partition must include the primary key attribute(s) of the table.
An example on horizontal fragmentation is the employee’s table which makes since for the
company to split the table into different partitions based on the employees who work on
that site. This makes the management, queries, and transactions convenient and efficient.
The down side of this choice is that, whenever a query involving all records, it has to
request all partitions from all sites and do a union on them.
2. Vertical Fragmentation
In vertical partitioning, the columns of a table are divided up among several sites on the
network. Each such partition must include the primary key attribute(s) of the table. This
arrangement can make sense when different sites are responsible for processing different
functions involving an entity. For example, the salary attributes of a personnel table might
be stored in one city while the skills attributes of the table might be stored in another city.
Both partitions would include the employee number, the primary key of the full table. A
down side of this option is that, a query involving the entire table would have to request all
portions from all sites and do a join on them.
3. Hybrid Fragmentation
In this type of fragmentation scheme, the table is divided into arbitrary blocks, based on the
needed requirements. Each fragment then can be allocated on to a specific site.
This type of fragmentation is the most complex one, which needs more management.
Query Processing
DDBS adds to the conventional centralized DBS some other types of processing expenses,
because of the additional design (hardware & software) to handle the distribution. These
expenses present as the
cost of data transfer over the network. Data transferred could be, intermediate files
resulting from local sites, or final results need to be sent back to the original site that issued
the query. Therefore, database designers are concerned about query optimization, which
target minimizing the cost of transferring data across the network. One method to optimize
query on DDBS is, where a relation R1 can send the entire join-column CR1 to the target
relation R2, then the site containing R2 would perform the join on CR1, and project on the
passed attributes. The resulting tuples are then shipped back to R! for further processing.
90
This can significantly enhance the query efficiency, since the data transferred on the
network is minimized.
Concurrency and Recovery
DDBS design of concurrency and recovery, has to consider different aspects other than of
those of centralized DBS. These aspects include:
• Multiple copies of data: concurrency has to maintain the data copies consistent.
Recovery on the other hand has to make a copy consistent with others whenever a site
recovers from a failure.
• Failure of communication links
• Failure of individual sites
• Distributed commit: during transaction commit some sites may fail, so the twophase
commit is used to solve this problem.
• Deadlocks on multiple sites.
The following two sections describe two suggestions to manage concurrency control.
1. Distinguished Copy of a Data Item
There are three variations to this method:
• Primary site
In this method, a single site is designated as the coordinator site. All locks and unlocks for
all data units are controlled by this site. One advantage is, easy to implement. However two
downsides of this method are overloading of the coordinator site, and this site forms a
single point failure for the entire DDBS.
• Primary site with backup site
This technique addresses the second disadvantage in the 1st technique (primary site) by
designating a backup site, that can take over as the new coordinator in case of failure, in
which case, an other backup site has to be selected.
• Primary copy technique
This method distributes the load to the sites that have a designated primary copy of a data
unit as opposed to centralizing the entire data units in one coordinator site. This way if a
site goes down, only transactions involving the primary copies residing on that site will be
effected.
91
2. Voting
This method does not designate any distinguished copy or site to be the coordinator as
suggested in the 1st two methods described above. When a site attempts to lock a data unit,
requests to all sites having the desired copy, must be sent asking to lock this copy. If the
requesting transaction did was not granted the lock by the majority voting from the sites,
then the transaction fails and sends cancellation to all. Otherwise it keeps the lock and
informs all sites that it has been granted the lock.
3. Recovery
The first step of dealing with the recovery problem is to identify that there was a failure,
what type was it, and at which site that happened. Dealing with distributed recovery
requires aspects include: database logs, and update protocols, transaction failure recovery
protocol, etc.
Advantages of Distributed Database
• Management of distributed data with different levels of transparency like
fragmentation transparency, replication etc.
• Increased reliability and availability: Reliability is defined as, the probability that the
system will be up at a given time. The availability is defined as, the probability that the
system will be up continuously during a given time period. These important system
parameters are improved with the DDBS. In the centralized DBS, if any component of the DB
goes down, the entire system will go down, whereas in the DDBS, only the effected site is
down, and the rest of the system will not be affected. Furthermore, if the data is replicated
at the different sites, the effects are greatly minimized.
• Easier expansion.
• Reflects organizational structure: Database fragments are located in the
departments they relate to.
• Local autonomy or site autonomy: A department can control the data about them (as
they are the ones familiar with it.)
• Protection of valuable data: If there were ever a catastrophic event such as a fire, all
of the data would not be in one place, but distributed in multiple locations.
92
• Improved performance: Data is located near the site of greatest demand, and the
database systems themselves are parallelized, allowing load on the databases to be
balanced among servers. (A high load on one module of the database won't affect other
modules of the database in a distributed database.)
• Economics: It costs less to create a network of smaller computers with the power of
a single large computer.
• Modularity: Systems can be modified, added and removed from the distributed
database without affecting other modules (systems).
• Reliable transactions : Due to replication of database.
• Hardware, Operating System, Network, Fragmentation, DBMS, Replication and
Location Independence.
• Continuous operation.
• Distributed Query processing.
• Distributed Transaction management.
Disadvantage of Distributed databases
• Complexity: Extra work must be done by the DBAs to ensure that the distributed
nature of the system is transparent. Extra work must also be done to maintain multiple
disparate systems, instead of one big one. Extra database design work must also be done to
account for the disconnected nature of the database — for example, joins become
prohibitively expensive when performed across multiple systems.
• Economics: Increased complexity and a more extensive infrastructure means extra
labour costs.
• Security: Remote database fragments must be secured, and they are not centralized
so the remote sites must be secured as well. The infrastructure must also be secured (e.g.,
by encrypting the network links between remote sites).
• Difficult to maintain integrity: But in a distributed database, enforcing integrity over
a network may require too much of the network's resources to be feasible.,
• Inexperience: distributed databases are difficult to work with, and as a young field
there is not much readily available experience on proper practice.
• Lack of standards: There are no tools or methodologies yet to help users convert a
centralized DBMS into a distributed DBMS.
93
• Database design more complex: Besides of the normal difficulties, the design of a
distributed database has to consider fragmentation of data, allocation of fragments to
specific sites and data replication.
• Additional software is required.
• Operating System should support distributed environment.
• Concurrency control: It is a major issue. It is solved by locking and timestamping.
94
Document Oriented Database
A document-oriented database is a computer program designed for storing, retrieving, and
managing document-oriented, or semi structured data, information. Document-oriented
databases are one of the main categories of so-called NoSQL databases and the popularity of
the term "document-oriented database" (or "document store") has grown with the use of
the term NoSQL itself.
Documents-
The central concept of a document-oriented database is the notion of a Document. While
each document-oriented database implementation differs on the details of this definition, in
general, they all assume documents encapsulate and encode data (or information) in some
standard format(s) (or encoding(s)). Encodings in use include XML, YAML, JSON and BSON,
as well as binary forms like PDF and Microsoft Office documents (MS Word, Excel, and so
on).
Documents inside a document-oriented database are similar, in some ways, to records or
rows, in relational databases, but they are less rigid. They are not required to adhere to a
standard schema nor will they have all the same sections, slots, parts, keys, or the like. For
example here's a document:
FirstName="Bob", Address="5 Oak St.", Hobby="sailing".
Another document could be:
FirstName="Jonathan", Address="15 Wanamassa Point Road",
Children=[{Name:"Michael",Age:10}, {Name:"Jennifer", Age:8}, {Name:"Samantha", Age:5},
{Name:"Elena", Age:2}].
Both documents have some similar information and some different. Unlike a relational
database where each record would have the same set of fields and unused fields might be
kept empty, there are no empty 'fields' in either document (record) in this case. This system
allows new information to be added and it doesn't require explicitly stating if other pieces
of information are left out.
95
Keys, Retrieval, and Organization
Keys
Documents are addressed in the database via a unique key that represents that document.
Often, this key is a simple string. In some cases, this string is a URI or path. Regardless, you
can use this key to retrieve the document from the database. Typically, the database retains
an index on the key such that document retrieval is fast.
Retrieval
One of the other defining characteristics of a document-oriented database is that, beyond
the simple key-document (or key-value) lookup that you can use to retrieve a document, the
database will offer an API or query language that will allow you to retrieve documents
based on their contents. For example, you may want a query that gets you all the documents
with a certain field set to a certain value. The set of query APIs or query language features
available, as well as the expected performance of the queries, varies significantly from one
implementation to the next.
Organization
Implementations offer a variety of ways of organizing documents, including notions of
▪ Collections
▪ Tags
▪ Non-visible Metadata
▪ Directory hierarchies
XML database implementations
An XML database is a data persistence software system that allows data to be stored in XML
format. This data can then be queried, exported and serialized into the desired format.
Two major classes of XML database exist:
▪ XML-enabled: these map all XML to a traditional database (such as a relational
database), accepting XML as input and rendering XML as output. This term implies that the
database does the conversion itself (as opposed to relying on middleware).
96
Native XML (NXD): the internal model of such databases depends on XML and uses XML
documents as the fundamental unit of storage, which are, however, not necessarily stored in
the form of text files.
Rationale for XML in databases
O'Connell gives one reason for the use of XML in databases: the increasingly common use of
XML for data transport, which has meant, "data is extracted from databases and put into
XML documents and vice-versa". It may prove more efficient (in terms of conversion costs)
and easier to store the data in XML format.
Native XML databases
The term "native XML database" (NXD) can lead to confusion. Many NXDs do not function as
standalone databases at all, and do not really store the native (text) form.
The formal definition from the XML:DB initiative (which appears to be inactive since 2003)
states that a native XML database:
Defines a (logical) model for an XML document — as opposed to the data in that document
— and stores and retrieves documents according to that model. At a minimum, the model
must include elements, attributes, PCDATA, and document order. Examples of such models
include the XPath data model, the XML Infoset, and the models implied by the DOM and the
events in SAX 1.0.
Has an XML document as its fundamental unit of (logical) storage, just as a relational
database has a row in a table as its fundamental unit of (logical) storage.
Need not have any particular underlying physical storage model. For example, NXDs can use
relational, hierarchical, or object-oriented database structures, or use a proprietary storage
format (such as indexed, compressed files).
Additionally, many XML databases provide a logical model of grouping documents, called
"collections". Databases can set up and manage many collections at one time. In some
implementations, a hierarchy of collections can exist, much in the same way that an
operating system's directory-structure works.
All XML databases now support at least one form of querying syntax. Minimally, just about
all of them support XPath for performing queries against documents or collections of
documents. XPath provides a simple pathing system that allows users to identify nodes that
match a particular set of criteria.
97
In addition to XPath, many XML databases support XSLT as a method of transforming
documents or query-results retrieved from the database. XSLT provides a declarative
language written using an XML grammar. It aims to define a set of XPath filters that can
transform documents (in part or in whole) into other formats including Plain text, XML, or
HTML.
Many XML databases also support XQuery to perform querying. XQuery includes XPath as a
node-selection method, but extends XPath to provide transformational capabilities. Users
sometimes refer to its syntax as "FLWOR" (pronounced 'Flower') because the query may
include the following clauses: 'for', 'let', 'where', 'order by' and 'return'. Traditional RDBMS
vendors (who traditionally had SQL only engines), are now shipping with hybrid SQL and
XQuery engines. Hybrid SQL/XQuery engines help to query XML data alongside the
relational data, in the same query expression. This approach helps in combining relational
and XML data.
Some XML databases support an API called the XML:DB API (or XAPI) as a form of
implementation-independent access to the XML datastore. In XML databases, XAPI
resembles ODBC and JDBC as used with relational databases. On the 24th of June 2009, The
Java Community Process released the final version of the XQuery API for Java specification
(XQJ) - "a common API that allows an application to submit queries conforming to the W3C
XQuery 1.0 specification and to process the results of such queries".
98
Implementations of the Databases-
Name Publisher License Language NotesLotus Notes IBM ProprietaryaskSam askSam Systems ProprietaryApstrata Apstrata ProprietaryDatawasp Significant Data Systems Proprietary
Clusterpoint Clusterpoint Ltd. Free community license / Commercial[1] C++
Scalable, high-performance, schema-free, document-oriented database management system platform with server based data storage, fast full text search engine functionality, information ranking for search revelevance and clustering.
CRX Day Software ProprietaryMUMPS Database[2] Proprietary and GNU Affero GPL[3] MUMPS Commonly used in health applications.UniVerse Rocket Software ProprietaryUniData Rocket Software ProprietaryJackrabbit Apache Software Foundation Apache License Java
CouchDB Couchbase, Apache Software Foundation Apache License ErlangJSON over REST/HTTP with Multi-Version Concurrency Control and ACID properties. Uses map and reduce for views and queries.[4]
FleetDB FleetDB MIT License ClojureA JSON-based schema-free database optimized for agile development.
MongoDB GNU AGPL v3.0[6] C++Fast, document-oriented database optimized for highly transient data.
GemFire Enterprise [1] VMWare Commercial Java, .NET, C++Memory-oriented, fast, key-value database with indexing and querying support.
OrientDB Orient Technologies Apache License Java JSON over HTTP
RavenDB RavenDB commercial or GNU AGPL v3.0 .NET
A .NET LINQ-enabled Document Database, focused on providing high performance, transactional, schema-less, flexible and scalable NoSQL data store for the .NET and Windows platforms.
Redis BSD License ANSI CKey-value store supporting lists and sets with fast, simple and binary-safe protocol.
StrokeDB [2] MIT License Alpha software.Terrastore Apache License Java JSON/HTTP
ThruDB BSD License C++, Java
Built on top of Apache Thrift framework that provides indexing and document storage services for building and scaling websites. Alternate implementation is being developed in Java. Alpha software.
Persevere Persevere BSD License
A JSON database and JavaScript Application Server. Provides RESTful JSON interface for Create, read, update, and delete access to data. Also supports JSONQuery/JSONPath querying.
DBSlayer DBSlayer Apache License Cdatabase abstraction layer (over MySQL) used by the New York Times. JSON over HTTP.
Eloquera DB Eloquera Proprietary .NETHigh performance. Based on Dynamic objects. Supports LINQ, SQL queries.
99
Mobile Database-
A mobile database is a database that can be connected to by a mobile computing device over
a mobile network. The client and server have wireless connections. A cache is maintained to
hold frequent data and transactions so that they are not lost due to connection failure. A
database is a structured way to organize information. This could be a list of contacts, price
information or distance travelled.
The use of laptops, mobiles and PDAs is increasing and likely to increase in the future with
more and more applications residing in the mobile systems. While those same analysts can’t
tell us exactly which applications will be the most popular, it is clear that a large percentage
will require the use of a database of some sort. Many applications such as databases would
require the ability to download information from an information repository and operate on
this information even when out of range or disconnected.
An example of this is a mobile workforce. In this scenario user would require to access and
update information from files in the home directories on a server or customer records from
a database. This type of access and workload generated by such users is different from the
traditional workloads seen in client–server systems of today. With the advent of mobile
databases, now users can load up their smart phones or PDAs with mobile databases to
exchange mission-critical data remotely without worrying about time or distance. Mobile
databases let employees enter data on the fly. Information can be synchronized with a
server database at a later time.
Embedded Database-
An embedded database system is a database management system (DBMS), which is tightly
integrated with application software that requires access to stored data, such that the
database system is “hidden” from the application’s end-user and requires little or no
ongoing maintenance. It is actually a broad technology category that includes database
systems with differing application programming interfaces (SQL as well as proprietary,
native APIs); database architectures (client/server and in-process); storage modes (on-disk,
in-memory and combined); database models (relational, object-oriented, Entity-Attribute-
Value model and network/CODASYL); and target markets. The term "embedded database"
can be confusing because only a small subset of embedded database products is used in
100
real-time embedded systems such as telecommunications switches and consumer
electronics devices.
Mobile and Embedded DBMS Characteristics-
The data access and management requirements of the applications described above are
significantly different from that of traditional server DBMSs. These new applications must
be able to run on multiple tiers rang- ing from devices to servers to web and would benefit
from various existing database mechanisms. However, these database mechanisms (like
query, indexing, persistence) must be unlocked from the traditional monolithic DBMSs and
made available as embeddable components (e.g. DLLs) that can be embedded within
applications, thereby, enabling them to meet the requirements described above. Such
Mobile and Embedded DBMSs have the following characteristics:
1. Embeddable in applications – Mobile and Embedded DBMSs form an integral part of the
application or the application infrastructure, often requiring no administration. Database
functionality is delivered as part of the application (or app infrastructure). While the
database must be embeddable as a DLL in applications, it must also be possible to deploy it
as a stand-alone DBMS with support for multiple transactions and applications.
2. Small footprint – For many applications, especially those that are downloadable, it is
important to mini- mize DBMS footprint. Since the database system is part of the
application, the size of the DBMS affects the overall application footprint. In addition to the
small footprint, it is also desirable to have short code paths for efficient application
execution. Most of these applications do not require the full functionality of commercial
DBMSs; they require simple query and execute in constrained environments.
3. Run on mobile devices – The DBMSs that run on mobile devices tend to be specialized
versions of mobile and embedded DBMSs. In addition to handling the memory, disk and
processor limitations of these devices, the DBMS must also run on specialized operating
systems. The DBMS must be able to store and forward data to the back-end databases as
synchronization with backend systems is critical for them.
4. Componentized DBMS – Often, to support the small footprint requirement, it is important
to include only the functionality that is required by the applications. For example, many
simple applications just require ISAM like record-oriented access. For these applications,
there is no need to include the query processor, thereby increasing the footprint. Similarly,
101
many mobile and mid-tier applications require only a small set of relational operators while
others require XML access and not relational access. So, it should be possible to pick and
choose the desired components.
5. Self managed DBMS – The embedded DBMS is invisible to the application user. There can
be no DBA to manage the database and operations like backups, recovery, indexing, tuning
etc. cannot be initiated by a DBA. If the database crashes, the recovery must start
instantaneously. The database must be self managed or managed by the application. Also,
embedded DBMS must auto install with the application – it should not be installed explicitly
(user action) or independently. Similarly when the application is shut down, the DBMS must
transparently shutdown.
66. In-Memory DBMS – These are specialized DBMSs serving applications that require high
performance on data that is small enough to be contained in main memory. In-memory
DBMSs require specialized query processing and indexing techniques that are optimized for
main memory usage. Such DBMSs also can support data that may never get persisted.
7. Portable databases – There are many applications, which require very simple deployment
– installing the application, should install the database associated with it. This requires the
database to be highly portable. Typically, single file databases (e.g. like Microsoft Access
databases) are ideally suited for this purpose. Again, there should be no need to install the
DBMS separately – installing the application installs the DBMS and then copying the
database file completes the application migration.
8. No code in the database – Portable database must also be safe. Executable code can be a
carrier of virus or other threats. By eliminating any code storage in the database, it can be
made safer and portable.
9. Synchronize with back-end data sources – In the case of mobile and cached scenarios; it
must be possible to synchronize the data with the back-end data sources. In typical mid-tier
(application server) caches, the data is fetched from the back-end databases into the cache,
operated on, and synchronized with the back-end database.
10. Remote management – While mobile and embedded DBMSs must be self managed, it is
important to allow them to be managed remotely also, especially those on mobile devices.
In enterprises (e.g. FedEX, UPS), mobile devices must be configured and managed in a
manner compliant with the company standards. Therefore centralized remote management
of these devices is necessary.
102
11. Custom programming interfaces – An important usage of embedded DBMS is in
specialized data- centric applications. Such applications use variety of data models (e.g.
Relational, XML, Streams, and Analytics) and query languages. The embedded DBMSs must
be componentized and extensible to allow (application) domain-specific query languages
and programming surfaces.
103
Conclusion
With the increased interest in specialized applications and database systems (like
streaming, analytics, mobile, sensor networks, etc.), the need for Mobile and Embedded
DBMSs is increasing. There have been mobile and embedded DBMS products in the market
that satisfy some of the characteristics described above, but were designed as low-end
database products and do not factor in the recent hardware and application trends. For
example, products that focused on small footprint but did not design with componentization
for embedded applications; similarly, there are products that only provide embedded (well
componentized) storage engines but without any query processing components.
The hardware trends in multi core CPUs, large main memory, and NVRAMs (e.g. flash) are
beginning to disrupt the traditional database architectures. Particularly, advances in large
memory and flash storage open doors for further research in database architectures. The
componentized mobile and embedded DBMSs are in a position to adapt to these changing
hardware and software trends more rapidly.
104