Title: Distributed Database Management Systems
1Distributed Database Management Systems
- Dan Freiman
- Jey Ping
- Adam Greenbaum
- Josh Simon
- Asha Haji
2Introduction
3Definitions
- Distributed Database
- A logically interrelated collection of shared
data physically distributed over a network. - Data split into fragments (segment of data).
- Distributed DBMS (DDBMS)
- The software system that permits the management
of the distributed database and makes the
distribution transparent to users. - Fundamental Principle
- Transparency in distributed system makes the
system appear centralized.
4DDBMS Overview
- Each site has a DBMS
- Fragments (replicated or unique).
- Linked by network.
- Can handle local users.
- Participates in at least one global requests.
5Distributed Processing
- A centralized database that can be accessed over
a computer network. - Data is not physically distributed.
- Ex. webdev.
6Multidatabase System (MDBS)
- A DDBMS in which each site maintains complete
autonomy. - Export schema defines data accessible to remote
users. - Unfederated have no local users.
- Federated have local and remote users.
- MDBA maintains global schema, global transaction
management, and gateways.
7DDBMS Types
- Homogeneous
- All sites use same DBMS.
- Heterogeneous
- Sites run different DBMSs.
- DMBSs can have different data models.
- Relational, Network, Object-Oriented, etc.
8Homogeneous DDBMS
- Easier to design and manage.
- Allows for easy incremental growth.
- Ex. Addition of a new site of existing nature.
- Allows for easier parallel processing.
9Heterogeneous DDBMS
- Allows multiple independent databases to be
integrated into on distributed database. - Must provide transparency of heterogeneity.
- Gateways
- Translates all requests made in local language.
- Translates the model of all data transmitted.
- Do not support transaction management.
- Difficult to homogenize schemas from different
models.
10Database Interoperability Consortium (DBIOP)
- Formed by Open Group as Specification Working
Group to develop - A common SQL API for programmers to interact with
any DBMS. - A common database protocol that enables universal
inter-DBMS communication without a gateway. - A common network protocol that allows connections
between any two DBMS.
11Advantages of DDBMS
- Location of Data
- Reflects Organizational Structure
- Each site has its own database containing
information about that site. - Improved Shareability and Local Autonomy
- Users can access data at other sites.
- Local users have control of local data.
- Local policies are easier to enforce.
12Advantages of DDBMS
- Improved Operation
- Improved Performance
- Data is located near users of frequent use which
decreases communication costs. - Inherent parallel processing.
- Improved Availability
- Site failure only matters if system is trying to
access that site. - Improved Reliability
- If site fails, request can be rerouted to
replicated data.
13Advantages of DDBMS
- Business Advantages
- Economics
- Several smaller computers may be cheaper than a
large mainframe. - Processing at remote site require less data in
communications. - Modular Growth
- Very easy to add an office with a standard office
database. - Integration
- Allows for combining of several legacy databases
into one DDBMS. - Different DBMS have different strengths best
suited for different requirements. - Remaining Competitive
14Disadvantages of DDBMS
- Operation and Maintenance
- Complexity
- Must handle fragmentation/replication data.
- Database Design More Complex
- Must design for fragmentation/replication data.
- Integrity Control More Difficult
- Constraints based on remote data has
communication costs and risks of communication
failure.
15Disadvantages of DDBMS
- Lack of Standards
- No widely used communication standard for
databases. - No standardized tools or methodologies or convert
a centralized DBMS to a DDBMS. - Lack of Experience
- Little industry experience to examine before
deciding when to use a DDBMS.
16Disadvantages of DDBMS
- Security
- Replicated data must be controlled in multiple
locations. - Communications must be secure.
- Cost
- Higher maintenance, software, and possibly
hardware costs because of complexity. - Communication costs.
- Higher labor costs.
17DDBMS
- Advantages
- Reflects Organizational Structure
- Improved Shareability and Local Autonomy
- Improved Availability
- Improved Reliability
- Improved Performance
- Economics
- Modular Growth
- Integration
- Remaining Competitive
- Disadvantages
- Complexity
- Cost
- Security
- Integrity control more difficult
- Lack of Standards
- Lack of Experience
- Database Design More Complex
18Architecture and Design
19Functions of a DDBMS
- Extensions of the functions of a centralized DBMS
to handle distributed data - Data communication - between sites
- Data storage, retrieval, update store
distribution details - User-accessible catalog distributed queries,
remote access - Authorization services
- Concurrency control
- Recovery services recover from failures of
individual sites or communication links
20Reference Architecture
Global external schema
Global external schema
Global external schema
Global conceptual schema
Fragmentation schema
Allocation schema
Local mapping schema
Local mapping schema
Local mapping schema
Local conceptual schema
Local conceptual schema
Local conceptual schema
Local internal schema
Local internal schema
Local internal schema
21Component Architecture
- Each site has
- Local DBMS component (LDBMS)
- Local data storage, retrieval, update
- Data Communications (DC)
- Global System Catalog (GSC)
- Distributed DBMS component (DDBMS)
22Distributed Database Design Issues
- Fragmentation
- Allocation
- Replication
23Design information
- Frequency transactions are run
- The site from which a transaction is run
- Performance criteria
- The relations, attributes, tuples accessed
- The type of access (read/write)
- Predicates of read operations
24Design objectives
- Locality of reference
- Improved reliability and availability
- Acceptable performance
- Balance storage capacities and costs
- Minimal communication costs
25Data Allocation Strategies
- Centralized distributed processing
- Fragmented
- Complete replication
- Selective replication
26Centralized Design
- Locality of reference Lowest
- Reliability Availability Lowest
- Performance Unsatisfactory
- Storage Costs Lowest
- Comm. Costs Highest
27Fragmented Design
- Locality of reference High
- Reliability Availability Low/High
- Performance Satisfactory
- Storage Costs Lowest
- Comm. Costs Low
28Complete Replication Design
- Locality of reference Highest
- Reliability Availability Highest
- Performance Read-optimized
- Storage Costs Highest
- Comm. Costs Read-low, update-high
29Selective Replication Design
- Locality of reference High
- Reliability Availability Low/High
- Performance Satisfactory
- Storage Costs Average
- Comm. Costs Low
30Fragmentation
- Advantages
- Usage
- Efficiency
- Parallelism
- Security
- Disadvantages
- Performance
- Integrity
31Fragmentation
- Correctness rules
- Completeness each item is found in at least one
fragment - Reconstruction it is possible to use a
relational operation to reconstruct the relation - Disjointness each item is found in only one
fragment to minimize redundancy
32Dilberts Insight ?
33Transparencies and Date's Rules
34Dates 12 Rules
To the user, a distributed system shall looketh
exactly like a non-distributed system
35Dates 12 Rules
- Physical Independence
- Data Independence
- Functional Requirements
36Physical Independence
- 1) Local autonomy local admin controls the
local DB completely - 2) No central server no single point of failure
- 4) Location independence all sites look the
same from the users perspective - 9) Hardware independence
- 10) OS independence
37 Data Independence
- 5) Fragmentation independence data divided
between sites arbitrarily - 6) Replication independence data duplicated
arbitrarily - 11) Network independence
- 12) Database independence supports different
DBMSs at each site
38Functional Requirements
- 3) Continuous operation no downtime
- 7) Distributed query processing queries access
different sites simultaneously - 8) Distributed transaction processing all sites
and system conform to ACID
39Database Transparency
- Transparency hide implementation details
- Distributed DBMS looks exactly like a
non-distributed DBMS
40Types of Transparencies
- Distribution transparency
- Transaction transparency
- Performance transparency
- DBMS transparency
- DDBMS can span multiple vendor DBMSs (Oracle,
Postgre, etc)
All transparencies rarely implemented in single
distributed DBMS
41Distribution Transparency
- Database is a single, logical entity
- Fragmentation transparency user does not know
the data is split up - Location transparency user does not know where
the data is
staffNo fname lname branchNo
Branch B001
Central Branch
staffNo position fname lname branchNo
staffNo fname lname branchNo
Branch B002
staffNo position
42Location Transparency Example
Central Branch
SELECT fName, lName FROM Staff WHERE position
Manager
staffNo position
SELECT fName, lName FROM (SELECT FROM Staff
WHERE branchNo B001) WHERE staffNo IN SELECT
staffNo FROM (SELECT staffNo, position FROM Staff
WHERE position Manager) UNION SELECT fName,
lName FROM (SELECT FROM Staff WHERE branchNo
B002) WHERE staffNo IN SELECT staffNo
FROM (SELECT staffNo, position FROM Staff WHERE
position Manager)
Branch B001
Branch B002
staffNo fname lname branchNo
staffNo fname lname branchNo
43Local Mapping Transparency
Central Branch
SELECT fName, lName FROM SELECT FROM Staff
WHERE branchNo B001 WHERE staffNo IN SELECT
staffNo FROM (SELECT staffNo, position FROM
Staff WHERE position Manager) UNION SELECT
fName, lName FROM SELECT FROM Staff WHERE
branchNo B002 WHERE staffNo IN SELECT
staffNo FROM (SELECT staffNo, position FROM
Staff WHERE position Manager)
AT Branch001
staffNo position
Branch B001
AT CentralBranch
Branch B002
AT Branch002
staffNo fname lname branchNo
staffNo fname lname branchNo
AT CentralBranch
44Distribution Transparency, Cont.
- Replication transparency user does not know
data is duplicated - Naming transparency each sub-DB must have
unique name - Map with name server (much like DNS)
- Add aliases with scope
- Local name
- Ex LocalBranch
- Globally unique identifier
- Ex Manager_at_London.LocalBranch_at_Glasgow
45Transaction Transparency
- Maintain DB integrity and consistency
- Atomicity is still the key
- Divided into subtransactions represented by an
agent
staffNo fname lname branchNo
Branch B001
Central Branch
staffNo position fname lname branchNo
staffNo fname lname branchNo
Branch B002
staffNo position
46Concurrency transaction
- Each subtransaction occurs
- concurrently
- Each subtransaction is independent
- Replication (need to propagate changes to all
copies) - Synchronization
47Failure Transaction
- Failure transaction
- If one subtransaction fails and rolls back, what
happens to entire transaction/other
subtransactions? - Loss of a message
- Failure of a communications link
- Failure of entire site
- Network partitioning
48Performance Transparency
- DDBMS must respond like a non-distributed system
- Similar to query optimizing by query processor,
except need to consider - Which fragment?
- Which copy of the fragment, if replication?
- Which location?
- Must take into account
- I/O bottlenecks of each system
- CPU time on each
- Communication cost over links (dominant)
49Transparency Overview
staffNo fname lname position
staffNo salary bonus
DDBMS
staffNo spouse dependent policyNum
50Problems with Transparency
- Gray (1989)
- Poor message performance
- Poor management
- Poor modularity
51Advanced Topics
52Wait! Ive seen all this before!
- This section covers things we have done before,
but with a distributed aspect to it - Distributed Transaction Management
- Distributed Currency Control
- Distributed Deadlock Management
- Distributed Query Optimization
53Distributed Transaction Management
- Just like standalone, we have
- Transaction manager, scheduler, recovery manager,
buffer manager - Unlike standalone, we have
- Global transaction manager (Transaction
Coordination)
54Distributed Currency Control
- Objectives
- Be resilient to site and communication failure
- Permit parallelism to satisfy performance reqs
- Incur modest computational/storage overhead
- Perform in a network with a lot of network delay
- Place few constraints on the structure of atomic
actions
55Old Stuff and New Stuff
- Lost Update Problem
- Uncommitted Dependency
- Inconsistent Analysis
- Multiple Copy Consistency Problem
- Can occur if the DB is replicated
- Updates can be done synchronously or
asynchronously
56Distributed Serializeability
- We use a global schedule
- Union of all of the local schedules
- Locking guarantees
- Timestamping guarantees
- If the DB is fragmented, but not replicated then
all of the protocols for a single DB can be used
57Locking Protocols
- Centralized 2PL
- Primary Copy 2PL
- Distributed 2PL
- Majority Rule
58Distributed 2PL
- Each site has its own lock mananger
- If data is not replicated, then is the same as
primary copy 2PL - Else, 2PL implements Read-One-Write-All Protocol
(ROWA) - Any copy can be read, but all copies must be
locked before an item can be updated - This avoid the drawbacks of centralized control
- Basically an extension of Primary Copy 2PL
- Deadlock handling is more complex, and
communication costs are higher than primary copy
2PL
59Majority Locking
- When a transaction wants to read or write, it
must obtain the lock on more than 50 of the
sites - Essentially the same as Distributed 2PL, but it
does not rely on ROWA, so it is more efficient. - This plan avoid the drawbacks of centralized
control, but is a more complicated protocol, with
more complicated deadlock.
60Timestamp Protocols
- Exactly the same as before, except we add a
location identifier to the stamp. - Timestamps with older timestamps get priority
- In a conflict (two different sites have the same
timestamp), tie goes to the earlier location
61Distributed Deadlock Management
- Centralized Deadlock Detection
- A single site is appointed the Deadlock Detection
Coordinator (DDC) - Hierarchical Deadlock Detection
- Every site sends its local WFG to the deadlock
detection side above it. - Distributed Deadlock Detection
62Hierarchical Deadlock Detection
1234
34
12
1
2
3
4
63Distributed Database Recovery
- Why things fail
- Loss of message
- Failure of communications link
- Failure of a site
- Choosing the correct time-out value is difficult
64How Failure Affects Recovery
- Abort any transactions affected by failure
- Flag the site as failed
- Check periodically to see if site has recovered
- On restart, the site must initiate recovery
procedures - After local recovery, the site must update its
copy to make it consistent with the rest of the
system - If a network partition occurs, the DDBMS must
ensure that communication is reestablished
65Distributed Recovery Protocols
- A failure at one site should not leave other
sites hanging. - This is called non-blocking
- Every global transaction should have a
coordinator or transaction manager - All of the sites at which the transaction occurs,
are called participants
66Two-Phase Commit
- Voting Phase and a Decision Phase
- The coordinator asks all of the sites if they are
ready to commit. Any single site can abort
(unilateral abort), which aborts for everyone. - Coordinator sends PREPARE. Everyone responds
either READY_COMMIT or ABORT - If ABORT is received, Coordinator sends a
GLOBAL ABORT - If everyone sends READY_COMMIT, Coordinator
sends a GLOBAL COMMIT - If a participant times out, it defaults to abort
- At this point, termination and recovery protocols
are implemented. - 3 PHASE COMMIT
67Protocols
- Termination Protocols
- Coordinator or Participant can time out
- Recovery Protocols
- Coordinator can fail in Initial, Waiting, Decided
state - Participant can fail in Initial, Prepared,
Aborted/Committed state - If the participants lose their coordinator, they
can elect a new one
68Network Partitioning
- Problems of identifying updates
- Maintaining integrity
- 2 separate groups that cant talk to each other
69Distributed Query Processing
- Query Decomposition
- Takes a query expressed on the global relation
and performs partial opimization - Data Localization
- Takes into account how the data is distributed.
Replaces global relations with their
reconstruction algorithms - Global Optimization
- Optimization strategy which uses the DDBMS to
make an execution strategy based on the
fragments. Parts of the query are sent to the
local DBMSs - Local Optimization
- Optimization described in the last lecture done
at the local site level.
70Special TopicMobile Agents for Wireless Access
to Distributed Databases
71Setbacks of Wireless for DDBMS
- With wireless, DDBMS no longer necessarily
provides - Fast, reliable, cheap communication
- Robust and research rich servers
- Stationary and fixed locations of participants
72What are mobile agents?
- Encapsulated processes dispatched from a source
computer to accomplish a specific task - data and execution state
- autonomous from sender
- can self-transport, spawn, interact
73And theyre so good, because?
- Computation pushed from host to fixed network
- Migrate to mobile clients
- Flexibility in shifting load to and from mobile
host depending on bandwidth
74How can I use mobile agents?
- Proposed models for mobile computation
- client/server (no mobile agents!)
- mobile agent model
- client/agent/server model
- client/intercept/server model
75The Bad Client-Server Model
- What weve always seen!
- Client downloads JDBC driver with every single
database request - Ridiculous amount of overhead!
- A JDBC driver can range from 300-500Kb!
76The Not as Bad Mobile Agent Model
- Client deploys mobile agent instead
- Mobile agent requests database access on
server-side - Brings back the goods to the client
- Still overhead generate agent with every
request!
77The Better Client/Server/Agent Model
- Deploy a database agent to server, have all
traffic go through it - Serve multiple clients
- Connection for duration of application
- You can also
- Have a messenger agent go back and forth between
client and database agent with results and
requests - Or send two types of messages results and
requests
78The Also Better Client/Server/Intercept Model
- Deploy database agent to network, but also create
client-side agent - Whats the point of the client-side agent?
- intercept and control communication to reduce
network traffic and query processing
79The Benefits of Mobile Agents
- Significant increase in performance compared to
client/server model - Increased ability to follow mobile clients
- Dual role as computation mechanism but also
network communication aide
80Mobile agent platforms out there
- Non-Java and Java-based platforms
- Some available platforms include
- IBMAglets Workbench
- ObjectSpaces Voyager
- Mitsubishis Concordia
- IKV Grasshopper
- General Magics Odyssey
- TACOMA
- Agent Tcl
81References
- Connolly T. and Begg C. (2005). Database Systems
A Practical Approach to Design, Implementation,
and Management 4th ed. New York, NY Addison
Wesley. Chapters 22-24. - Database Interoperability Consortium
- http//www.opengroup.org/dbiop/
- Multidatabase Systems
- Sheth A. and Larson J.L. (1990). Federated
Databases Architectures and Integration. ACM
Computing Surv., Special Issue on Heterogeneous
Databases, 22(3), 183-236. - Allocation
- Ozsu M. and Valduriez P. (1999). Principles of
Distributed Systems 2nd ed. Englewood Cliffs, NJ
Prentice-Hall. - Disadvantages
- Gray, Jim. TRANSPARENCY IN ITS PLACE -- The Case
Against Transparent Access to Geographically
Distributed Data http//research.microsoft.com/g
ray/papers/Transparency.doc - Özsu, M. and Patrick Valduriez. Distributed Data
Management Unsolved Problems and New Issues.
http//db.uwaterloo.ca/ddbms/publications/ozsu/di
stdb/UnsolvedIssues.ps