bit-3107-2.docx

Upload: p-jorn

Post on 06-Mar-2016

217 views

Category:

Documents


0 download

TRANSCRIPT

BST 3107: DATABASE SYSTEMS II Topic 2 Distributed DatabaseAdistributed database is a databasethat consists of two or moredatafileslocated at different sites on acomputernetwork. It is alternatively described as acollection of multiple logically related database distributed over a computer network, and a distributed database management system as a software system that manages a distributed database while making the distribution transparent to the user. Note: The distributed database management system is a software use to manage a DDB, and which makes the distribution transparent to the user.In apuredistributed database, the system manages a single copy of all data and supporting database objects. A key feature of the distributed database is that differentuserscan accessthe data without interfering with one another. Unlike a parallel system in which the processors are tightly coupled and constitute a single database system, a distributed database system consists of loosely-coupled sites that share no physical components. The database systems that run on each site are independent of each other, and transactions within the database may access data at one or more sites. The key features of a distributed database system include: (i) They assume relational data model.(ii) There is replication The system maintains multiple copies of data, stored in different sites, for faster retrieval and fault tolerance.(iii) There is fragmentation A relation is partitioned into several fragments stored in distinct sites. NB: A relation represents a table or an entity that contains different attributes. An entity constructs a relation (table).

To ensure that databases in distributed systems remain up-to-date, two processes are employed, replication and duplication.

In replication, specialized software that looks for changes in the distributive database are used. Once the changes have been identified, the replication process makes all the databases look the same. Replication is alternatively described as the operation of copying and maintaining database objects in multiple databases belonging to a distributed system. Replication can be complex and time-consuming, and also requires a lot of time and computer resources.

Advantages of replication include:(i) Availability - Failure of site containing a given relation does not result in unavailability of the relation, as replicas exist.(ii) Parallelism - Queries on a relation may be processed by several nodes in parallel.(iii) Reduced data transfer - A relation is available locally at each site containing its replica.

The disadvantages of replication are:(iv) Increased cost of updates - Each replica of a given relation must be updated.(v) Increased complexity of concurrency control - Concurrent updates to distinct replicas may lead to inconsistent data unless special concurrency control mechanisms are implemented. A solution to this is choosing one copy as primary copy and applying concurrency control operations on it.

In duplication, one database is identified as a master, and this is them duplicated all over. The duplication process is carried out at set times to ensure that each distributed location has the same data. In duplication, users may only change the master database, to ensure that local data will not be overwritten.

Homogenous and Heterogeneous Database Management SystemsAhomogeneous distributed databasehas identical software and hardware running all databases instances, and may appear through a single interface as if it were a single database. In a homogeneous distributed database, all sites have identical software and are aware of each other. They also agree to cooperate in processing user requests. A homogeneous distributed database management system appears to the user as a single system.

Homogeneous systems are much easier to design and manage. The approach provides incremental growth, making the addition of a new site to the DDBMS easy, and allows increased performance by exploiting the parallel processing capability of multiple sites. In homogeneous distributed database:(i) All sites have identical software. The operating system used, at each location are same or compatible.(ii) All sites aware of each other and agree to cooperate in processing user requests.(iii) The database appears to user as a single system.(iv) The data structures used at each location must be same or compatible.(v) The database application (or database management system) used at each location are same or compatible.

Aheterogeneous distributed databasemay have different hardware, operating systems, database management systems, and even data models for different databases. Different computers and operating systems, database applications or data models may be used at each of the locations. In a heterogeneous distributed database, different sites may use different schema and software. Sites in a heterogeneous system may not be aware of each other, and may provide only limited facilities for cooperation in transaction processing. One location may for example have the latest relational database management technology, while another location may store data using conventional files or old version of database management system. The heterogeneous system is often not technically or economically feasible.

Heterogeneous system usually result when individual sites have implemented their own database and integration is considered at a later stage. In a heterogeneous system, translations are required to allow communication between different DBMSs. To provide DBMS transparency, users must be able to make requests in the language of the DBMS at their local site. The system then has the task of locating the data and performing any necessary translation. Data may be required from another site that may have:(i) Different hardware.(ii) Different DBMS products.(iii) Different hardware and different DBMS products

Federated DatabasesA federated database is a system in which several databases appear to function as a single entity. Each component database in the system is completely self-sustained and functional. When an application queries the federated database, the system figures out which of its component databases contains the data being requested and passes the request to it. A federated database may be composed of a heterogeneous collection of databases. In a homogeneous environment, federated databases can help distribute the load ofvery large databases.

The federated database system distributes queries to the appropriate component database; the goal of the system is to ensure that a typical query will need to use only one component, thus drastically reducing the number of rows that need to be searched.Federated databases have several drawbacks. Each component database is a potential point of failure, and latency from any one server will delay an entire call.

Data FragmentationData fragmentation occurs when a piece of data in memory is broken up into many pieces that are not close together. Also known as sharding or partitioning, data fragmentation involves splitting a data set into smaller fragments (or shards), and distributing them across a large number of machines. Data fragmentation is carried out by specialized software, and automatically breaks data up into fragments for storage in different storage equipment, possibly in different locations, based on the sharding policies in place. Fragments are logical data units stored at various sites in a distributed database system.

Fragmentation enhances availability, as it is much faster to retrieve small data fragments rather than larger ones, which significantly improves response times. It Permits a number of transactions to execute concurrently, since they will access different portions of a relation. Above all, the process facilitates Parallel execution of a single query (intra-query concurrency). The challenge, however, is semantic data control (especially integrity enforcement) becoming more difficult.

Fragmentation aims to improve reliability, performance, balanced storage capacity and costs, communication costs, and security.

There are three types of fragmentation:(i) Horizontal: partitions a relation along its tuples(ii) Vertical: partitions a relation along its attributes(iii) Mixed/hybrid: a combination of horizontal and vertical

Horizontal and Vertical Fragmentation in Distributed Database Management SystemsConsider the following relation: No. Customer NameTownPayment TypeGender

1OkelloNairobiCredit CardMale

2KamauNakuruCashMale

3ChepyegonKisumuCashFemale

Horizontal fragmentation divides the relation into tuples called rows.Fragment 1:No. Customer NameTownPayment TypeGender

1OkelloNairobiCredit CardMale

2KamauNakuruCashMale

Fragment 2:No. Customer NameTownPayment TypeGender

3ChepyegonKisumuCashFemale

Vertical Fragmentation divides the relation into attributes called columns.Fragment 1:No. Customer NameTownGender

1OkelloNairobiMale

2KamauNakuruMale

3ChepyegonKisumuFemale

Fragment 2:No. Customer NamePayment Type

1OkelloCredit Card

2KamauCash

3ChepyegonCash

Advantages of FragmentationUsageApplications generally work with views rather than entire relations. Therefore, for data distribution, it is appropriate to work with subsets of relation as the unit of distribution.

EfficiencyFragmentation ensures data is stored close to where it is most frequently used. In addition, data that is not needed by local applications is not stored.ParallelismWith fragments as the unit of distribution, a transaction can be divided into several sub-queries that operate on fragments. This increases the degree of concurrency or parallelism in the system.SecurityData not required by local applications is not stored, and consequently not available to unauthorized users.

Disadvantages ofFragmentationPerformanceThe performance of global application that requires data from several fragments located at different sites may be slower.IntegrityIntegrity control may be more difficult if data and functional dependencies are fragmented and located at different sites.

Data TransparencyData transparency is the degree to which system user may remain unaware of the details of how and where the data items are stored in a distributed system. Important aspects of transparency with regard to distributed systems include fragmentation transparency, replication transparency and location transparency. The levels of transparency featured in a distributed database system include: Distribution or Network transparency Location transparency Naming transparency Replication transparency Fragmentation transparency Vertical fragmentation Horizontal fragmentation

Advantages and Disadvantages of Distributed Database SystemsAdvantagesReflects Organizational StructureMany organisations are naturally distributed over several locations. It is natural for databases used in such an application to be distributed over these locations. The company headquarters may wish to make global inquiries involving the access of data at all or a number of branches.

Improved Share-ability and Local AutonomyThe geographical distribution of an organisation can be reflected in the distribution of the data - users at one site can access data stored at other sites. Data can be placed at the site close to the users who normally use that data. In this way, users have local control of the data, and they can consequently establish and enforce local policies regarding the use of this data. A global database administrator (DBA) is responsible for the entire system.

Improved AvailabilityIn a centralized DBMS, acomputerfailure terminates the applications of the DBMS. However, a failure at one site of a DDBMS, or a failure of a communication link making some sites inaccessible, does not make the entire system in opera bite. If a single node fails, the system may be able to reroute the failed node's requests to another site.Improved ReliabilityAs data may be replicated so that it exists at more than one site, the failure of a node or a communication link does not necessarily make the data inaccessible.

Improved PerformanceAs the data is located near the site of 'greatest demand', and given the inherent parallelism of distributed DBMSs, speed of database access may be better than that achievable from a remote centralized database. Furthermore, since each site handles only a part of the entire database, there may not be the same contention forCPUand I/O services as characterized by a centralized DBMS.

EconomicsIt is generally accepted that it costs much less to create a system of smaller computers with the equivalent power of a single large computer. This makes it more cost-effective for corporate divisions and departments to obtain separate computers. It is also much more cost-effective to add workstations to a network than to update a mainframe system.

The second potential cost saving occurs where database are geographically remote and the applications require access to distributed data. In such cases, owing to the relative expense of data being transmitted across the network as opposed to the cost of local access, it may be much more economical to partition the application and perform the processing locally at each site.

Modular GrowthIn a distributed environment, it is much easier to handle expansion. New sites can be added to the network without affecting the operations of other sites. This flexibility allows an organisation to expand relatively easily. Adding processing and storage power to the network can usually handle the increase in database size. In a centralized DBMS, growth may entail changes to both hardware (the procurement of a more powerful system) and software (the procurement of a more powerful or more configurable DBMS).

To achieve the advantages of a distributed database, the database management system must have these additional functions:(i) Keeping track of data distribution, fragmentation and replication.(ii) Distributed query processing.(iii) Distributed transaction management.(iv) Replicated data management.(v) Distributed data recovery.(vi) Security.(vii) Distributed catalog management.

Disadvantages of DDBMSComplexityA distributed DBMS that hides the distributed nature from the user and provides an acceptable level of performance, reliability, availability is inherently more complex than a centralized DBMS.

CostIncreased complexity means that we can expect the procurement and maintenance costs fora DDBMS to be higher than those for a centralized DBMS. Furthermore, a distributed

DBMS requires additional hardware to establish a network between sites. There are ongoing communication costs incurred with the use of this network. There are also additional labor costs to manage and maintain the local DBMSs and the underlying network.

SecurityIn a centralised system, access to the data can be easily controlled. However, in a distributed DBMS not only does access to replicated data have to be controlled in multiple locations, but also the network itself has to be made secure. In the past, networks were regarded as an insecure communication medium. Although this is still partially true, significant developments have been made to make networks more secure.

Integrity Control More DifficultDatabase integrity refers to the validity and consistency of stored data. Integrity is usually expressed in terms of constraints, which are consistency rules that the database is not permitted to violate. Enforcing integrity constraints generally requires access to a large amount of data that defines the constraints. In a distributed DBMS, the communication and processing costs that are required to enforce integrity constraints are high as compared to centralized system.

Lack of StandardsAlthough distributed DBMSs depend on effective communication, standard communication and data access protocols are only beginning to appear. This lack of standards has significantly limited the potential of distributed DBMSs. There are also no tools or methodologies to help users convert a centralized DBMS into a distributed DBMSDatabase Design More ComplexBesides the normal difficulties of designing a centralised database, the design of a distributed database has to take account of fragmentation of data, allocation of fragmentation to specific sites, and data replication.6