academic year 2014 spring academic year 2014 spring

• Academic Year 2014 Spring

MODULECC3005NI:Advanced Database Systems

“Distributed Database (DDB) and Data Mining (DM)”


Distributed Database (DDB) can be defined as - a single logical database which is physically distributed across

computers in multiple locations, that are connected by a

computer network. Distributed Database Management System (DDBMS)

can be defined as - the software system that facilitates the management of the DDBs

and makes the distribution transparent to the users.

Definition of Distributed Database:

A decentralised database is also stored on computers at multiple locations. However, the computers are not connected by a network. Consequently, data cannot be shared by users at different locations.

Thus, a decentralised database is best regarded as a collection of independent databases, rather than having the geographical distribution of a single database.

Distributed vs. Decentralized DBs:

Reflecting distributed nature of some database applications Many database applications are naturally distributed over some different

locations (e.g. company may have locations in different cities) Increased reliability and availability

When centralised system fails, database is unavailable to all users.

However distributed system will continue function at some reduced level

even when a component fails.

Advantages of Distributed Databases:

Local Control Data distribution in a distributed database encourages local groups to

exercise greater control over “their” data. This promotes improved data

integrity and administration and users can still access non local data

when necessary. Lower Communication Costs

With distributed system data can be located closer to the point of use.

This can reduce communication costs compared to centralised system.


Fast response and improved performance By distributing a large database over multiple sites, smaller database

exist at each site. Some user queries at a particular site may only need to

access their smaller database stored locally. This speeds up query

processing and enhances database performance. It may also be possible to decompose complex queries into sub-queries

that can be processed in parallel at several different sites


Interconnection of existing databases When several databases already exist in an organisation and the

necessity of performing global application arises, the distributed

database often offers a natural solution by integrating and

interconnecting the pre-existing local databases.


Software Cost and Complexity This is far greater for distributed environment than for a centralised

system Communication / Processing Overheads

These are much greater because the various sites must exchange

messages and perform additional calculations to ensure proper co-

ordination among the sites. There is always the risk that the

communication overhead may degrade system responsiveness and

performance.

Disadvantages of Distributed Databases:

Data Integrity and Consistency Given the increased complexity of the system and the need for co-

ordination from multiple sites, additional control mechanisms are

required in order to prevent improper updating of data and avoid other

problems of data integrity / consistency.

Disadvantages of Distributed Databases:

Centralized Schematic View:

The External schema describes database view of a set of database user groups. Each view typically describes the part of the database that a particular user group is interested in and hides the rest of the database from that user group.

The Conceptual schema is a global description of database that hides details of physical storage structures and concentrates on describing entities, data attributes, relationships and constraints.


Internal schema is a global description of physical storage structures of database.


Distributed Schematic View:

At top level distributed database acts conceptually as a single centralised database. So, global conceptual schema, which defines all data contained in distributed database, represents to end users a unified view of data for complete distributed system.


Set of end users interact with global schema with their own local schema, each confirming to ANSI-SPARC three level architecture. These user’s external schemas are subsets of global conceptual schemas.

Note that in centralised database external schemas use subsets of conceptual schema, whilst in distributed environment local external schemas describe subsets of global conceptual schema.


Data Replication DDBMS may contain multiple copies of data at several different sites

Data Fragmentation (Partition) Relation may be divided into number of sub-relations (fragments),

which are then distributed. Distribution Transparency

This allows users to perceive distributed database as a single, logical

entity.

Issues of Distributing a Database:

Desirable properties of DDBs is ability to have a local repository of frequently used data, while still being able to access data stored at other networks sites.

Replicated Database – is a distributed database where (some) stored data is duplicated at various sites.

Fully Replicated Database – is a distributed database where all stored data are duplicated and allocated to all sites.

Data Replication:

Replication has different effect on read only and update application: Read only Application takes advantage of replication which makes it

more likely that they can reference data locally. Update Application may present problems due to replication, since they

must update all copies in order to preserve data consistency.

Data Replication – Key Points:

Replicated data enhances locality of reference by satisfying more read only queries locally. This reduces query response time and reduces traffic on communication network.

Replicated data provides greater reliability through backup copies from which data can be recovered in event of media failures.

Data Replication – Key Points:

Proper provision must be made for update operations in a Distributed Database where data replication is used. Two of the update strategies are;1. Unanimous (एकमत)Agreement Update Strategy

Updates are refused unless they have unanimous acceptance from all sites containing a replica.

In order to reflect a single copy image of replicated files, updates are propagated to all replicas immediately.

Unanimous Acceptance of the proposed update by all sites having replicas is necessary in order to make modifications and all those sites must be available for this to happen.

Data Replication – Update Strategies:

2. Single Primary Update Strategy Update requests are issued to primary replica, which serialises all updates.

One replica is designated as PRIMARY and remaining replicas as SECONDARYs.

Update request are issued to primary replica, which serves to serialise

updates to secondary replicas and thereby preserve data consistency.

Data Replication – Update Strategies:

…is basically dividing of relations into fragments for distribution Some Advantages:

Usage: Since applications usually work with views rather than entire

relations, it makes sense to use subsets of relations as unit of distribution

Efficiency: If relation can be decomposed into fragment, it is possible to allow

number of transaction to execute concurrently.

Parallelism: Parallel execution can be realised whereby a single query can be

split into set of subqueries that operate on fragments.

Data Fragmentation:

Some Disadvantages: Integrity: Integrity checking can be made more complex if data and functional

dependencies are fragmented and distributed to different sites.

Performance: It can be slower to process some global applications which

require data from fragments at different sites.

Data Fragmentation:

Completeness: Each data item from a global relation R must appear in at least one of its

fragments. This rule ensure no loss of data during fragmentation. Reconstruction:

It must always be possible to reconstruct each global relation from its

fragments. This rule ensures no loss of functional dependencies Disjointness:

Each data item from a global relation should appear in only one of its

fragments, except for vertical fragmentation where primary key attributes

must be repeated to allow reconstruction. This rule ensures minimal data

redundancy.

Data Fragmentation – 3 Rules:

Horizontal Fragmentation Horizontal fragment of a relation R is a subset of tuples in that relations.

Using RESTRICT operation, horizontal fragmentation divides a relation

horizontally by grouping subsets of tuples, where each subset (fragment) is

specified by some condition on one or more attributes of relation.

These fragments can then be assigned to different sites in distributed system.

Relation can be reconstructed from its fragments by using UNION operation

to fragments.

Data Fragmentation – Options:

We may define three horizontal

fragments on EMPLOYEE relation

with following conditions:

(DNO = 10)

(DNO = 30)

(DNO = 20)

This fragmentation satisfies three

rules: Completeness, Reconstruction

and Disjointness

Data Fragmentation – Example:

Vertical Fragmentation Vertical fragment of a relation R groups together certain attributes in relation

using PROJECT operation.

With vertical fragmentation, some of columns of a relation are projected into

one fragment and other columns are projected into other fragment(s).

Set of vertical fragments, whose projection list L1, L2, ...... include all

attributes in R but share only primary key attribute of R, is called complete

vertical fragmentation of R


Vertical Fragmentation To reconstruct relation R from a complete vertical fragmentation, we apply

natural JOIN operation to fragments. Therefore, fragments must share a

common attribute (normally primary key) to enable original relation to be

constructed if required.


We fragment EMPLOYEE relation

into two vertical fragments where:

first fragment includes personal

information – ENAME, BDATE,

ADDRESS and second includes work

related information ENO, SALARY,

DNO.

* Primary Key attribute ENO is

needed in personal information


Mixed Fragmentation Combination of two types of fragmentation schema discussed above,

resulting in mixed fragmentation.

Original relation can be reconstructed by applying UNION and JOIN

Operations to fragments in appropriate order.


Important aspects of distributed database is to hide details of data distribution from its users. This allows them to perceive distributed database as single, logical entity.

Types of Transparency Distribution Transparency

Replication Transparency

Fragmentation Transparency

Transparency in Distributed Databases:

Distribution Transparency User should write global queries and transactions as though database were

centralised, without having to specify sites at which data referenced in query

Replication Transparency Where data is replicated, system should handle management of copies and

user normally should act as if there is a single copy of data. However there

may be situations where users should be made aware of existence of copies

(but not placement of copies)


Fragmentation Transparency When database relations are fragmented, DDBMS deals with problem of

handling user queries that were specified on entire relations but now have to

be performed on sub-relations due to fragmentation. In other words, issue is

one of finding a query processing strategy based on fragments rather than on

relations.


Distributed Query Processing One of most important additional factors to consider is cost of transferring

data (including intermediate relations and final results) over network.

Therefore many DDBMS query optimisation algorithms consider objective of

reducing amount of data transfer as main criterion in choosing distributed

query execution strategy.

Other Issues in Distributed Databases (1):

Distributed Concurrency and Recovery Dealing with multiple copies of data items

Extension of centralised locking, where a particular copy of each data item is

designated as distinguished copy.

Distributed concurrency control based on voting, where there is no one

distinguished copy and lock requests are made to all sites.

Distributed commit


Distributed Concurrency and Recovery Distributed deadlock

Recovery from failure of individual sites

Recovery from failure of communication sites

etc.


Thank you!!!

Questions are WELCOME