academic year 2014 spring academic year 2014 spring
TRANSCRIPT
• Academic Year 2014 Spring
MODULECC3005NI:Advanced Database Systems
“Distributed Database (DDB) and Data Mining (DM)”
• Academic Year 2014 Spring
Distributed Database (DDB) can be defined as - a single logical database which is physically distributed across
computers in multiple locations, that are connected by a
computer network. Distributed Database Management System (DDBMS)
can be defined as - the software system that facilitates the management of the DDBs
and makes the distribution transparent to the users.
Definition of Distributed Database:
A decentralised database is also stored on computers at multiple locations. However, the computers are not connected by a network. Consequently, data cannot be shared by users at different locations.
Thus, a decentralised database is best regarded as a collection of independent databases, rather than having the geographical distribution of a single database.
Distributed vs. Decentralized DBs:
Reflecting distributed nature of some database applications Many database applications are naturally distributed over some different
locations (e.g. company may have locations in different cities) Increased reliability and availability
When centralised system fails, database is unavailable to all users.
However distributed system will continue function at some reduced level
even when a component fails.
Advantages of Distributed Databases:
Local Control Data distribution in a distributed database encourages local groups to
exercise greater control over “their” data. This promotes improved data
integrity and administration and users can still access non local data
when necessary. Lower Communication Costs
With distributed system data can be located closer to the point of use.
This can reduce communication costs compared to centralised system.
Advantages of Distributed Databases:
Fast response and improved performance By distributing a large database over multiple sites, smaller database
exist at each site. Some user queries at a particular site may only need to
access their smaller database stored locally. This speeds up query
processing and enhances database performance. It may also be possible to decompose complex queries into sub-queries
that can be processed in parallel at several different sites
Advantages of Distributed Databases:
Interconnection of existing databases When several databases already exist in an organisation and the
necessity of performing global application arises, the distributed
database often offers a natural solution by integrating and
interconnecting the pre-existing local databases.
Advantages of Distributed Databases:
Software Cost and Complexity This is far greater for distributed environment than for a centralised
system Communication / Processing Overheads
These are much greater because the various sites must exchange
messages and perform additional calculations to ensure proper co-
ordination among the sites. There is always the risk that the
communication overhead may degrade system responsiveness and
performance.
Disadvantages of Distributed Databases:
Data Integrity and Consistency Given the increased complexity of the system and the need for co-
ordination from multiple sites, additional control mechanisms are
required in order to prevent improper updating of data and avoid other
problems of data integrity / consistency.
Disadvantages of Distributed Databases:
Centralized Schematic View:
The External schema describes database view of a set of database user groups. Each view typically describes the part of the database that a particular user group is interested in and hides the rest of the database from that user group.
The Conceptual schema is a global description of database that hides details of physical storage structures and concentrates on describing entities, data attributes, relationships and constraints.
Centralized Schematic View:
Internal schema is a global description of physical storage structures of database.
Centralized Schematic View:
Distributed Schematic View:
At top level distributed database acts conceptually as a single centralised database. So, global conceptual schema, which defines all data contained in distributed database, represents to end users a unified view of data for complete distributed system.
Distributed Schematic View:
Set of end users interact with global schema with their own local schema, each confirming to ANSI-SPARC three level architecture. These user’s external schemas are subsets of global conceptual schemas.
Note that in centralised database external schemas use subsets of conceptual schema, whilst in distributed environment local external schemas describe subsets of global conceptual schema.
Distributed Schematic View:
Data Replication DDBMS may contain multiple copies of data at several different sites
Data Fragmentation (Partition) Relation may be divided into number of sub-relations (fragments),
which are then distributed. Distribution Transparency
This allows users to perceive distributed database as a single, logical
entity.
Issues of Distributing a Database:
Desirable properties of DDBs is ability to have a local repository of frequently used data, while still being able to access data stored at other networks sites.
Replicated Database – is a distributed database where (some) stored data is duplicated at various sites.
Fully Replicated Database – is a distributed database where all stored data are duplicated and allocated to all sites.
Data Replication:
Replication has different effect on read only and update application: Read only Application takes advantage of replication which makes it
more likely that they can reference data locally. Update Application may present problems due to replication, since they
must update all copies in order to preserve data consistency.
Data Replication – Key Points:
Replicated data enhances locality of reference by satisfying more read only queries locally. This reduces query response time and reduces traffic on communication network.
Replicated data provides greater reliability through backup copies from which data can be recovered in event of media failures.
Data Replication – Key Points:
Proper provision must be made for update operations in a Distributed Database where data replication is used. Two of the update strategies are;1. Unanimous (एकमत)Agreement Update Strategy
Updates are refused unless they have unanimous acceptance from all sites containing a replica.
In order to reflect a single copy image of replicated files, updates are propagated to all replicas immediately.
Unanimous Acceptance of the proposed update by all sites having replicas is necessary in order to make modifications and all those sites must be available for this to happen.
Data Replication – Update Strategies:
2. Single Primary Update Strategy Update requests are issued to primary replica, which serialises all updates.
One replica is designated as PRIMARY and remaining replicas as SECONDARYs.
Update request are issued to primary replica, which serves to serialise
updates to secondary replicas and thereby preserve data consistency.
Data Replication – Update Strategies:
…is basically dividing of relations into fragments for distribution Some Advantages:
Usage: Since applications usually work with views rather than entire
relations, it makes sense to use subsets of relations as unit of distribution
Efficiency: If relation can be decomposed into fragment, it is possible to allow
number of transaction to execute concurrently.
Parallelism: Parallel execution can be realised whereby a single query can be
split into set of subqueries that operate on fragments.
Data Fragmentation:
Some Disadvantages: Integrity: Integrity checking can be made more complex if data and functional
dependencies are fragmented and distributed to different sites.
Performance: It can be slower to process some global applications which
require data from fragments at different sites.
Data Fragmentation:
Completeness: Each data item from a global relation R must appear in at least one of its
fragments. This rule ensure no loss of data during fragmentation. Reconstruction:
It must always be possible to reconstruct each global relation from its
fragments. This rule ensures no loss of functional dependencies Disjointness:
Each data item from a global relation should appear in only one of its
fragments, except for vertical fragmentation where primary key attributes
must be repeated to allow reconstruction. This rule ensures minimal data
redundancy.
Data Fragmentation – 3 Rules:
Horizontal Fragmentation Horizontal fragment of a relation R is a subset of tuples in that relations.
Using RESTRICT operation, horizontal fragmentation divides a relation
horizontally by grouping subsets of tuples, where each subset (fragment) is
specified by some condition on one or more attributes of relation.
These fragments can then be assigned to different sites in distributed system.
Relation can be reconstructed from its fragments by using UNION operation
to fragments.
Data Fragmentation – Options:
We may define three horizontal
fragments on EMPLOYEE relation
with following conditions:
(DNO = 10)
(DNO = 30)
(DNO = 20)
This fragmentation satisfies three
rules: Completeness, Reconstruction
and Disjointness
Data Fragmentation – Example:
Vertical Fragmentation Vertical fragment of a relation R groups together certain attributes in relation
using PROJECT operation.
With vertical fragmentation, some of columns of a relation are projected into
one fragment and other columns are projected into other fragment(s).
Set of vertical fragments, whose projection list L1, L2, ...... include all
attributes in R but share only primary key attribute of R, is called complete
vertical fragmentation of R
Data Fragmentation – Options:
Vertical Fragmentation To reconstruct relation R from a complete vertical fragmentation, we apply
natural JOIN operation to fragments. Therefore, fragments must share a
common attribute (normally primary key) to enable original relation to be
constructed if required.
Data Fragmentation – Options:
We fragment EMPLOYEE relation
into two vertical fragments where:
first fragment includes personal
information – ENAME, BDATE,
ADDRESS and second includes work
related information ENO, SALARY,
DNO.
* Primary Key attribute ENO is
needed in personal information
Data Fragmentation – Example:
Mixed Fragmentation Combination of two types of fragmentation schema discussed above,
resulting in mixed fragmentation.
Original relation can be reconstructed by applying UNION and JOIN
Operations to fragments in appropriate order.
Data Fragmentation – Options:
Data Fragmentation – Example:
Important aspects of distributed database is to hide details of data distribution from its users. This allows them to perceive distributed database as single, logical entity.
Types of Transparency Distribution Transparency
Replication Transparency
Fragmentation Transparency
Transparency in Distributed Databases:
Distribution Transparency User should write global queries and transactions as though database were
centralised, without having to specify sites at which data referenced in query
Replication Transparency Where data is replicated, system should handle management of copies and
user normally should act as if there is a single copy of data. However there
may be situations where users should be made aware of existence of copies
(but not placement of copies)
Transparency in Distributed Databases:
Fragmentation Transparency When database relations are fragmented, DDBMS deals with problem of
handling user queries that were specified on entire relations but now have to
be performed on sub-relations due to fragmentation. In other words, issue is
one of finding a query processing strategy based on fragments rather than on
relations.
Transparency in Distributed Databases:
Distributed Query Processing One of most important additional factors to consider is cost of transferring
data (including intermediate relations and final results) over network.
Therefore many DDBMS query optimisation algorithms consider objective of
reducing amount of data transfer as main criterion in choosing distributed
query execution strategy.
Other Issues in Distributed Databases (1):
Distributed Concurrency and Recovery Dealing with multiple copies of data items
Extension of centralised locking, where a particular copy of each data item is
designated as distinguished copy.
Distributed concurrency control based on voting, where there is no one
distinguished copy and lock requests are made to all sites.
Distributed commit
Other Issues in Distributed Databases (2):
Distributed Concurrency and Recovery Distributed deadlock
Recovery from failure of individual sites
Recovery from failure of communication sites
etc.
Other Issues in Distributed Databases (3):
Thank you!!!
Questions are WELCOME
• Academic Year 2014 Spring