Wednesday 21 May 2014

Distributed Database Management Systems

The below post is notes prepared by me by studying the book "Database Systems Design, Implementation and Management" by Peter Rob and Carlos Coronel
Content, examples and diagrams are taken from that book.

Unit –IV Chapter-II
Distributed Database Management Systems


Q: What are the disadvantages of centralized database management system?
  • Performance degradation due to growing number of remote locations over greater distances.
  • High costs associated with maintaining and operating large central (mainframe) database systems
  • Reliability problems created by dependence on a central site(single point of failure syndrome) and the need for data replication.
  • Scalability problems associated with the physical limits imposed by a single location(temperature conditioning and power consumption)
  • Organizational rigidity imposed by the database might not support the flexibility and agility required by modern global organizations.


Q: What is Distributed database
Ans A distributed database management system ( DDBMS ) governs the storage and processing of logically related data over interconnected computer systems in which both data and processing functions are distributed among several sites.


Q: What is DDBMS
Ans: The software system that permits the management of the distributed database and makes the distribution transparent to users.


Q: What are the advantages and disadvantages of DDBMS
Ans: Advantages
  1. Data are located near the greatest demand site.
  2. Faster data access as end users work with only a locally stored subset of the company data.
  3. Faster data processing as the workload is distributed at several sites.
  4. Growth facilitation as new sites can be added without affecting other sites
  5. Improved communication as local sites foster better communication between customer and  staff
  6. Reduced operating costs  as development work is done more cheaply and more quickly on low-cost PCs than on mainframes
  7. User friendly interface as the GUI simplifies training and use for end users
  8. Less danger of single point failure: when one computer fails, the workload is picked up by other workstations as data are distributed at multiple sites.
  9. Processor independence : The end user is able to access any available copy of the data, and an end users request is processed by any processor at the data location


Disadvantages
  1. Complexity of management and control: Applications must recognize data location and they must be able to stitch together data from various sites.
  2. Technological difficuilty: Data integrity, transaction management, concurrency control, security, back up recovery, query optimization, access path selection etc… must all be addressed and resolved
  3. Security : The probability of security lapses increases when data are located at multiple sites. The responsibility of data management will be shared by different people at several sites.
  4. Lack of standards: There are no standard communication protocols at the database level. For example, different database vendors employ different often incompatible techniques to manage the distribution of data and processing in a DDBMS environment.
  5. Increased storage and infrastructure requirements: multiple copies of data are required at different sites, thus requiring additional disk storage space.
  6. Increased training cost
  7. Costs: Distributed databases require duplicated infrastructure to operate (physical location, environment, personnel, software, licensing etc)


Distributed processing system in centralized database
Although the database resides at only one site, each site can access the data  and update the database. That is, shares the database processing chores among several sites.
These sites are connected through a communication network. Refer book for diagram


Distributed database requires distributed processing.
Distributed processing may be based on centralized database or distributed database
Both distributed processing and databases require a network to connect all components



Distribution database system
In a distribution database system, a database is composed of several parts known as database fragments.
The database fragments are located at different sites and can be replicated among various sites.
Distributed database requires distributed processing. Refer book for diagram
Each database fragment is managed by local database process. (Distributed processing)
For the management of distributed data to occur, copies or parts of database processing functions must be distributed to all data storage sites.


Characteristics of DDBMS
The DBMS must have the following functions to be classified as distributed:
  • Application interface to interact with the end user, application programs and other DBMSs  within the DDB.
  • Validation to analyze data requests for syntax correctness.
  • Transformation to decompose complex requests into atomic data request components
  • Query optimization to find the best access strategy
  • Mapping to determine the data location of local and remote fragments.
  • I/O interface to read or write data from or to permanent local storage.
  • Formatting to prepare the data for presentation to the end user or to an application program
  • Security to provide data privacy at both local and remote databases.
  • Back up and recovery to ensure the availability and recoverability of the database in case of a failure.
  • DB administration features for the database administrator
  • Concurrency control to manage simultaneous data access and to ensure data consistency across data fragments in the DDBMS.
  • Transaction management to ensure that the data moves from one consistent state to another. This activity includes the synchronization of local and remote transactions as well as transactions across multiple distributed segments.


Centralised DBMS functions
  • Receiving an application on end user’s request
  • Validate, analyse and decompose the request. The request might include mathematical and/or logical operations.
  • Ex: SELECT all customers with a balance > $1000.
    The request might require data from only a single table or it might require access to several tables.
  • Map the request’s logical -to -physical data components.
  • Decompose the request into several disk I/O operations
  • Search for, locate, read and validate the data.
  • Ensure database consistency, security and integrity.
  • Validate the data for the conditions, if any, specified by the request.
  • Present the selected data in the required format


DDBMS components
  1. Computers workstations (sites or nodes). The DDBMS must be independent of the computer system hardware
  2. Network hardware and software components that reside in each workstation to allow all sites to interact and exchange data. Because the components-computers, O.S, network hardware etc-- are likely to be supplied by different vendors, it is best to ensure that DDB functions can be run on multiple platforms.
  3. Communication media that carry the data from one workstation to another.
    The DDBMS must be communications media-independent
  4. The transaction processor (TP) also known as application processor(AP) or the transaction manager (TM), which is the software components found in each computer that requests data.The transaction processor receives and processes the application’s data requests (remote and local).
  5. The data processor (DP) also known as the data manager(DM), which is the software component residing on each computer that stores and retrieves data located at the site.
    A DP may even be a centralized DBMS.
The communication between TPs and DPs is possible through protocols used by the DDBMS
The protocol determines how the DDB system will
1.Interface with the network to transfer data and commands between DPs and TPs.
2.Synchronize all data received from DPs(TP side)
3.and route retrieved data to appropriate TPs(DP side)
4. Ensure database functions like security, concurrency control, back up and recovery in DDB


DPs and TPs can be added to the system without affecting the operations of the other components.
DPs and TPs can reside on the same computer.









Levels of data and process distribution
Current database systems can be classified on the basis of how process distribution and data distribution are supported.
For ex: a DBMS may store data at a single site or in multiple sites and
may support data processing at a single site or at multiple sites.




Single-Site Processing, Single-Site Data (SPSD)(Centralized)
- all processing is done on a single host computer
- all  data are stored on the host computer’s local disk system.


Ex: mainframe systems, single processor server and multiple processor server systems)
The functions of TP and DP are embedded within the DBMS located on single computer.
The DBMS usually runs under a time sharing, multitasking O.S, which allows several processes to run concurrently on a host computer


Multiple-Site Processing, Single Site Data (MPSD)- multiple processes run on different computers sharing a single data repository
This scenario requires a network file server running applications that are accessed through a network
  1. The TP on each workstation routes all network data requests to the file server.
  2. Only the data storage input/output (I/O) is handled by the file server, so offers limited capability of distributed processing.
  3. The end user must make direct reference to the fileserver to access remote data.
  4. All record and file locking activities are done at the end-user location
  5. All data selection, search and update functions take place at the workstation,
    thus requiring that entire file travel through the network for processing at the workstation.
    Such a requirement increases network traffic, slows response time and increase communication costs.
For ex: File server computer stores a CUSTOMER table containing 10000 data rows, 50 of which have
balances >$1000. Suppose site A issues the query: SELECT * FROM customer WHERE cus_balance>1000;
All 10000 CUSTOMER rows must travel through the network to be evaluated at site A



Client/Server architecture
MPSD
All database processing is done at server site
All database processing is done at client site
Thus reduces network traffic
Thus increases network traffic
Capable of supporting data at multiple sites.
Requires database to be located at a single site
Processing is distributed
Processing is not distributed
Performs multiple site processing



Multiple-Site Processing, Multiple-Site Data(MPMD) describes a fully DDB with support for multiple data processors and transaction processors at multiple sites


Types of DDBMS:
Depending on the level of support for various types of centralized DBMSs, DDBMSs are classified as
homogenous DDBMS
heterogenous DDBMS
Integrates only one type of centralized DBMS over a network.
Integrates different  types of centralized DBMS over a network.
The same DBMS will be running on different server platforms (single processor server, multi-processor server)
Fully heterogenous DDBMS will support different DBMSs that even support different data models (relational, hierarchical or network) running on different computer systems such as mainframes and PCs


Some DDBMS implementations support several platforms, O.S and networks and allow access remote data access to another DBMS, but subject to certain restrictions.
Remote access is provided on a read-only basis and does not support write privileges.
Restrictions are placed on the
  • number of remote tables that may be accessed in a single transaction.
  • number of distinct databases that may be accessed
  • database model that may be accessed. Access may be provided to relational databases but not to network or hierarchical databases.


Transparency features or functional characteristics
Allowing End user to feel like the database’s only user.
User believe that he is working with a centralized DBMS
All complexities are hidden or transparent to the user.


  1. Distribution transparency: it makes dispersed database look like a single database to the end user.
  2. Transaction transparency: Allows a transaction to update data at more than one
    network site.
    Transaction transparency ensures that the transaction will be either entirely completed or aborted, thus maintaining data integrity
  3. Failure transparency ensures that the system will continue to operate in the event of a node failure.
    Functions that were lost because of failure will be picked up by another network node
  4. Performance transparency :The system will not suffer any performance degradation
    due to its use on a network or due to the network’s platform differences
    Ensures that the system will find the most cost-effective path to access remote data.
  5. Heterogeneity transparency :Allows integration of several different local DBMSs (relational, hierarchical and network) under a common or global schema
    The DBMS is responsible for translating the data requests from the global schema to the local DBMS schema


Distribution transparency
Three levels of distribution transparency are
  1. Fragmentation transparency: is the highest level of transparency.
    Neither fragment names nor fragment locations are specified prior to data access.
  2. Location transparency: end user must specify the database fragment names but not their locations
  3. Location mapping transparency: end user need to specify both the fragment names and their locations
Ex: The CUSTOMER table contains cno,cname,city attributes
     The CUSTOMER data are distributed over 3 different locations: NewYork, Atlanta and Miami
     The table is divided by location i.e.., NewYork customers data are stored in fragment C1,
                              Atlanta customers data are stored in fragment C2
                              Miami customers data are stored in fragment C3


and each fragment is unique i.e.., it indicated each row is unique
No portion of the table is replicated at any other site.
Case 1: The database supports Fragmentation Transparency
SELECT * FROM customer;  (no fragment names or location specified)


Case 2: The database supports Location Transparency
SELECT * FROM C1;
UNION
SELECT * FROM C2;    (fragment names are specified and locations are not specified)
UNION                               
SELECT * FROM C3;


Case 3: The database supports Location Mapping Transparency
SELECT * FROM C1 NODE NY;
UNION
SELECT * FROM C2 NODE ATL;                              (fragment names are specified and locations are not specified)
UNION                               
SELECT * FROM C3 NODE MIA;


Distribution transparency is supported by distributed data dictionary (DDD) or distributed data catalogue (DDC). The DDC contains distributed global schema i.e.., entire database description. The DDC is itself distributed and replicated. Therefore, it must maintain consistency through updating all sites


Transaction transparency
Remote request/statements: lets a single SQL statement (or request) reference data at only one remote site or D.P
A remote transaction contains one or more remote requests, all of which reference one remote site or D.P
consider a transaction at site A
BEGIN WORK
UPDATE product SET qoh = qoh-1 WHERE pno = ‘P01’;
INSERT INTO invoice (invno,cno,invdate) VALUES (305, 202,SYSDATE);
COMMIT WORK;                                                 
Note the following remote transaction features:
  • The transaction updates PRODUCT and INVOICE table (located at SiteC)
  • The remote transaction is sent to and executed at the remote site C.
  • The entire transaction can reference and be executed at only one remote DP


A distributed transaction contains one or more requests and
Each request can access only one remote site at a time i.e..,
It allows a transaction to reference several different local or remote DP sites .
Consider a transaction at site A
BEGIN WORK
UPDATE product SET qoh = qoh-1 WHERE pno = ‘P01’;
INSERT INTO invoice (invno,cno,invdate) VALUES (305, 202,SYSDATE);
UPDATE customer SET baldue = baldue + 10 WHERE cno = 202;
COMMIT WORK;
Note the following following features
  1. The transaction references two remote sites (B and C)
  2. The first two requests (UPDATE PRODUCT and INSERT INTO INVOICE) are processed by the DP at remote site C, and the last request is processed by DP at the remote site B.
  3. Each request can access only one remote site at a time.
The third characteristics may create problems
If suppose the table PRODUCT is divided into PROD1 and PROD2, located at Site B and C respectively, then a distributed transaction cannot execute the request- SELECT * FROM product; because this request cannot access data from more than one remote site. So the DBMS must support a distributed request.


A distributed request lets a one SQL statement reference data located at several different local or remote DP sites.The ability to execute a distributed request provides fully distributed database processing capabilities because of the ability to :
  • Partition a database tables into several fragments
  • Reference to one or more of those fragments with only one request.


Distributed Concurrency control
Multisite, multiple-process operations are much more likely to create data inconsistencies and deadlocked transactions than are single-site systems


Q: Explain Two Phase Commit Protocol :
Distributed databases make it possible for a transaction to access data at several sites.
A final commit must not be issued until all sites have committed their parts of the transaction.
Each DP maintains its own transaction log.
The two phase commit protocol  requires
DO-UNDO-REDO protocol
Write-Ahead Protocol


DO-UNDO-REDO protocol
DO performs the operation and records the before and after values in the transaction log.
UNDO reverses an operation, using the log entries written by the DO portion
REDO redoes an operation.
To ensure that the DO, UNDO, REDO operations can survive a system crash while they were being executed, a write-ahead protocol is used.
The write-ahead protocol forces the log entries to be written to permanent storage before the actual operation takes place.
There are two types of nodes: the coordinator node and subordinates or cohorts.
Coordinator role is assigned to a node that initiates the transaction
The protocol is implemented in two phases
Phase-1 Preparation
The subordinate nodes receive the message, writes the transaction log, using the write-ahead protocol and
Sends an acknowledgement (YES/PREPARED TO COMMIT) or (NO/NOT PREPARED) message to coordinator.
If all nodes are PREPARED TO COMMIT , the transaction goes to phase -2
If one or more nodes reply NO or NOT PREPARED The coordinator broadcasts a ABORT message to all subordinates
Phase -2 The Final Commit
The coordinator broadcasts a COMMIT message to all subordinates and wait for replies
Each subordinate receives the COMMIT message,
and then updates the database using the DO protocol
The subordinates reply with a COMMITTED or NOT COMMITTED message to the coordinator.
If one or more subordinates did not COMMIT, it sends an ABORT message, thereby forcing all subordinates to UNDO


Performance transaparency and query optimization
The DDBMS uses query optimization to decide which copy of the data to access.
The objective of query optimization routine is to minimize the total cost associated with the execution of a request like
1. Access time (I/O) cost involved in accessing the physical data stored on disk
2. Communication cost associated with the transmission of data among nodes in DDB systems
3. CPU time cost associated with the processing overhead of managing distributed transactions


To evaluate query optimization, the TP must receive data from DP, synchronize it, assemble the answer and present it to end user or an application


Most of the algorithms proposed for query optimization are based on two principles:
1. The selection of the optimum execution order.
2. The selection of sites to be accessed to minimize communication costs.














Query optimization algorithms can be evaluated on the basis of operation mode or the timing of its optimization
Operation modes can be classified as manual or automatic--
Cost effective path is found and scheduled by end user or programmer in manual
Cost effective path is found and scheduled by DDBMS.


Classification according to when the optimization is done                     
1. Static query optimization takes place at compilation time. When the program is submitted to the DBMS for compilation, it creates the plan to access the database.
2. Dynamic query optimization takes place at execution time. Its cost is measured by run- time processing overhead


Classification according to type of information that is used to optimize the query
1. Statistically based query optimization algorithm uses statistical information like size, number of records, average access time of database.
The statistical information is managed by DDBMS and is generated in dynamic mode or in manual mode.
In dynamic statistical generation mode, the DDBMS automatically evaluates and updates the statistics after each access.
2. Rule- based query optimization algorithm is based on a set of user-defined rules to determine the best access strategy. The rules are entered by the end user or database administrator.


Distributed database design
The design of a distribution database design introduces three new issues.
1. How to partition the database into fragments.
2. Which fragment to replicate.
3. Where to locate those fragments and replicas


Data fragmentation :
it allows you to break a single object (database,table etc..) into two or more segments or fragments.
Each fragment can be stored at any site
(Information about data fragmentation is stored in distributed data catalog (DDC), from which it is accessed by TP to process user requests.
Fragmented tables can be recreated from its fragments by using Joins and Unions.)


Horizontal fragmentation:
It refers to the division of a table into subsets (fragments) of rows.
Each fragments is stored at a different node, and
each fragment has unique rows.
(Each fragment represents the equivalent of a SELECT statement, with the WHERE clause on a single attribute.
Ex:
cno
cname
cstate
climit
baldue
1
ANU
TN
3500
2700
2
RAMA
AP
6000
1200
3
RADHA
TN
4000
3500
4
GOPI
AP
1200
550


Suppose XYZ company requires information about its customers in the 2 sites (AP, TN)
And each state requires data regarding local customers only
So, distribute data by state ie., define
horizontal fragmentation by state





Fragment Name
Location
Condition
Node Name
Customer Numbers
No. of nodes
C1
TN
cstate = TN
CHE
1,3
2
C2
AP
cstate=AP
VJA
2,4
2


Table fragments in 2 states
Table name: C1    location : Tamil Nadu            Node:CHE
cno
cname
cstate
climit
baldue
1
ANU
TN
3500
2700
3
RADHA
TN
4000
3500


Table name: C2    location : AndhraPradesh        Node:VJA
cno
cname
cstate
climit
baldue
2
RAMA
AP
6000
1200
4
GOPI
AP
1200
550


Vertical fragmentation: It refers to the division of a table into attribute (column) subsets.
Each subset is stored at a different node and
each fragment has unique columns with the exception of the key column which is common to all fragments.
(For Ex: Suppose a company is divided into 2 departments --- service and collection department
Each department is in separate building and has interest in only some of the CUSTOMER attributes.
Vertical fragmentation of CUSTOMER table



Fragment Name
Location
Node Name
Attribute Names
V1
Service building
SVC
cno, cname, cstate
V2
Collection building
ARC
cno, climit, baldue


Vertically fragmented table contents
Table name: V1    location : Service building            Node:SVC
cno
cname
cstate
1
ANU
TN
2
RAMA
AP
3
RADHA
TN
4
GOPI
AP


Table name: V2    location : collection building            Node:ARC
cno
climit
baldue
1
3500
2700
2
6000
1200
3
4000
3500
4
1200
550


Mixed fragmentation: It refers to a combination of horizontal vertical strategies.
It requires two step process
  1. horizontal fragmentation is introduced
  2. vertical fragmentation is used within each horizontal fragment
(The XYZ company’s structure requires
CUSTOMER data to be fragmented horizontally to 2 company locations (TN,AP) and
Within locations, the data must be fragmented vertically to 2 departments (Service and Collection)


Mixed Fragmentation of CUSTOMER table
Fragment
Name
Location
Horizontal Criteria
Node Name
Resulting
Rows at Site
Vertical criteria
attributes at each fragments
M1
TN
cstate = TN
CHES
1,3
cno,cname,cstate
M2
TN
cstate = TN
CHEC
1,3
cno,climit,baldue
M3
AP
cstate = AP
VJAS
2,4
cno,cname,cstate
M4
AP
cstate = AP
VJAS
2,4
cno,climit,baldue


Table name: M1    location : Tamil Nadu    Node:CHES

cno
cname
cstate
1
ANU
TN
3
RADHA
TN
Table name: M2    location : Tamil Nadu   
Node:CHEC

cno
climit
baldue
1
3500
2700
3
4000
3500

Table name: M3    location : AndhraPradesh   Node:VJAS

cno
cname
cstate
2
RAMA
AP
4
GOPI
AP
Table name: M3    location : AndhraPradesh   Node:VJAC

cno
climit
baldue
2
6000
1200
4
1200
550

Table fragmentation after mixed fragmentation process
Data replication: It refers to the storage of data copies at multiple sites.
Replicated data are subject to the mutual consistency rule , which requires that
1. All copies of data fragments be identical.
2. DDBMS must ensure that a database update is performed at all sites where replicas exist.
Benefits of replication
Fragment copies can be stored at several sites to serve specific information requirements can
  • Increases data availability and response time
  • Reduced communication and query costs.
  • Better load distribution
  • Improved data failure tolerance and
Disadvantages
  • It imposes additional DDBMS processing overhead because
    each copy must be maintained by the system and also have to decide which replicated copy to use.
  • Increased Transaction time as data must be updated at several sites.
  • Storage Cost


Replication Conditions
A fully replicated database stores multiple copies of all database fragments at multiple sites.
A partially replicated database stores multiple copies of some database fragments at multiple sites.
An unreplicated database stores each database fragment at a single site.

Factors for Data Replication Decision
1. Database Size : Replicating large amount of data will have impact on
storage requirements,Data transmission cost and Higher network bandwidth
2. Usage Frequency
How frequently the data need to be updated and how big is the database.
Frequently used data needs to be updated more often.


Data allocation - deciding where to locate data.
Data allocation strategies are as follows:
With the centralized data allocation, the entire database is stored at one site.
With partitioned data allocation, the database is divided into two or more disjoint parts and stored at two or more sites.
With replicated data allocation, copies of one or more database fragments are stored at several sites.


Data allocation algorithms take into consideration a variety of factors:
1. Performance and data availability goals
2. Size, number of rows, the number of relations that an entity maintains with other entities.
3. Types of transactions to be applied to the database, the attributes accessed by each of those transactions.


Explain about Client/Server Architecture
Client/server architecture refers to the way in which computers interact to form a system.
It features a user of resources or a client and a provider of resources or a server .
The architecture can be used to implement a DBMS in which the client is the transaction processor (TP) and the server is the data processor (DP).
Client/Server Architecture
Client/Server Advantages
1. Client/server solutions are less expensive and
allow the end user to use the microcomputer’s graphical user interface (GUI),
thereby improving functionality and simplicity.
2. There are more people with PC skills than with mainframe skills.
3. Numerous data analysis and query tools exist to allow interaction with many of the DBMSs.
4. It is cheap to develop an application for PCs than for mainframes.
Client/Server Disadvantages
1. The client/server architecture creates a more complex environment with different platforms.
2. An increase in the number of users and processing sites often paves the way for security problems.
3. The burden of training increases the cost of maintaining the environment.

No comments:

Post a Comment