Title: Distributed Databases
1Distributed Databases
- CS 95 Advanced Database Systems
- Handout 4
2Definitions
- "... a collection of data which belong logically
to the same system but are spread over the sites
of a computer network." (Ceri, 1984) - "A distributed database is a collection of
logically related data distributed across several
machines interconnected by a computer network. An
application program operating on a distributed
database may access data stored at more than one
machine." (Gardarin Valduriez) - "A set of cooperating databases, each resident at
a different site, that the user views and
manipulates as a centralized database." (Gardarin
Valduriez) - ".. a kind of virtual object, ... physically
stored in a number of distinct real databases at
a number of distinct sites." (Date)
3Concept
- Can have many sites, each with their own database
management system (DBMS). - Local data stored at each site.
- Sites connected by communications network.
- To a user (application program) all sites
together appear as one big database. - Needs simple structures --gt relational databases
- e.g., Ingres/Star
4General structure of a DDBMS
5Another diagram of a DDBMS
- DC Data Communications component
- DBMS local DBMS
- DDBMS Distributed DBMS component
- GDD Global data Dictionary
Note that although site 3 has access to the DDB,
the local DB at site 3 is not part of the DDD
6Fragmentation
- Horizontal partitioning
- Rows of the one table are stored across different
sites. - Vertical partitioning
- Each table stored at a single site but different
tables are stored at different sites or columns
of a table are stored across different sites - Basic principle is that data is stored at the
site where it is mostly accessed - network traffic is reduced
- improves performance
7Replication
- Data may be stored at more than one site. (i.e.,
a copy of the data) - Whole relations may be replicated, or just
fragments of relations - Faster retrieval as the network communications
link is not used as often because data logically
belonging to another site may have a local copy. - Provides backup
- Problems with updating records.
812 Objectives for DDBMS (Date)
- 1. Local autonomy
- 'Local data is locally owned and managed ...'
- Each site must be able to control and process its
local data independent of any other site. - Need to store the Global Data Dictionary at each
site - Some aspects of data security and data integrity
are managed at the global level.
912 Objectives for DDBMS (Date)
- 2. No reliance on a central site
- Following from (1) 'all sites must be treated as
equals there must not be any reliance on a
central "master" site for some central service
...' - Advantages
- system is less vulnerable
- less chance of a bottleneck
1012 Objectives for DDBMS (Date)
- 3. Continuous operation
- DDBMS are not an all or nothing proposition, in
that if one site fails the other sites can still
operate, even if at a reduced level. - If one site is down then that sites data may
still be available if it is replicated. - Hence, disruptions are minimised
1112 Objectives for DDBMS (Date)
- 4. Location independence (transparency)
- 'Users ... should be able to behave ... as if the
data was all stored at their own local site.' - A primary objective
- User is not aware if data is being retrieved
locally or from another site. - Even if data is moved from one site to another,
users of the data are not affected. - Easy for retrieval, harder for updates.
1212 Objectives for DDBMS (Date)
- 5. Fragmentation independence
- '... users should be able to behave ... as if the
data were ... not fragmented at all.' - location of data is kept (somehow) in the system
catalog - DDBMS uses catalog to retrieve data from the
relevant site - A query may select data from a table that is
horizontally fragmented, and the query may not be
restricted to a single site - The query engine would have to do something like
split the query into separate subqueries, one for
each involved site - Send each subquery to the relevant site
- UNION the results of the subqueries into a final
answer - All this is hidden from the user that submits the
query - Advantage
- '... simplifies user programs and terminal
activities.'
1312 Objectives for DDBMS (Date)
- 6. Replication independence (transparency)
- Similar to fragmentation independence.
- Some data may be stored at multiple sites.
- DDBMS determines closest for retrieval.
- DDBMS alters all sites when updating.
- involves some interesting algorithms
- Have replication because
- cut communication costs
- dupl. of essential data allows processing to
continue when communication cut. - enables quick easy recover after failure.
- Replication is used to improve performance and
data availability.
1412 Objectives for DDBMS (Date)
- 7. Distributed query processing
- 'It is important that optimization in a
distributed system be performed from a global
perspective.' - Communication time is now the slowest part of
execution of a query, so that has to be taken
into account when formulating a query plan
1512 Objectives for DDBMS (Date)
- 8. Distributed transaction management
- '... recovery control and concurrency control
...' require 'extended treatment in the
distributed environment.' - eg. concurrency over multiple sites requires an
advanced form of the two-phase commit protocol - As transaction progresses nodes involved lock
data that is accessed or to be updated - At the end of the transaction the coordinating
node (site) sends get ready requests to all other
nodes involved in the transaction Each node
responds whether ready or not - If all replies are OK then coordinating node
issues actual commit, otherwise rollback Each
node performs update and responds to coordinating
node when update is complete ONLY AFTER ALL
nodes respond does the coordinating node send a
message to release all locks.
1612 Objectives for DDBMS (Date)
- 9. Hardware independence
- 10. Operating system independence
- In theory, it does not matter if different nodes
in the DDBMS run on different OSs. - eg. a Unix computer, a PC and a Mac should all be
able to participate in the same distributed
system. - The standard TCP/IP communications protocol has
facilitated this objective. - 11. Network independence
- 12. DBMS independence
- Heterogeneous databases.
17Why distributed databases?
- Organizational and economic structure may be
distributed. A DDB better models this. - One can interconnect existing DBs.
- Incremental growth is supported.
- Reduction in communications overhead - emphasis
on local processing. - Data is stored closer to where it is accessed, so
access times are reduced. - Parallel processing - improved performance.
18Why distributed databases?
- Reliability - graceful degradation
- Loss of one site does not bring the entire system
down - Reduction in data processing bureaucracy.
- Gain in local autonomy
- Drops in computer costs allow more computing
power to be purchased - Capacity and growth potential of the system is
increased
19Problems with DDB
- Security
- The data is now spread over many sites, securing
all sites is more difficult than securing one
site. - Authority to access data items must be
duplicated.
20Problems with DDB
- Catalog Management (Data Dictionary)
- May be stored (i) Centrally, (ii) Fully
replicated on every site (iii) Fragmented over
each site. ie. each site keeps and maintains the
catalog for objects stored at that site. (iv)
Combination of (i) and (iii). - New data objects require updating each DD
concerned. - Note sites not aware of the new data object will
not be able to access it.
21Problems with DDB
- Query Processing
- Retrieving data that is fragmented over multiple
sites requires extra consideration - Slow component now is transferring data between
sites - Need to optimise a query to reduce this network
traffic
22Problems with DDB
- Update Propagation for Replicated Data
- One copy of any replicated data is designated the
Primary Copy - Primary Copy Update Strategy
- An update operation is deemed complete when the
primary copy is updated - the DDBMS is then responsible for propagating the
update to the other (secondary) copies at some
subsequent time.
23Problems with DDB
- Recovery Control
- Typically based on the Two-Phase Commit protocol
- The coordinating resource instructs all resources
to "get ready". The resources reply "OK" or "Not
OK". - If all resources reply "OK" then the commit
directive is given to each resource, otherwise
the rollback directive is given. Resource
indicates success after commit/rollback
24Problems with DDB
- Recovery Control (Example)
- ie. messages go something likePhase 1 1.
Coord-gtResource("Get Ready") 2. Resource-gtCoord
("Okay")Phase 2 3. Coord-gtResource("Do It") 4.
Resource-gtCoord ("Done It") - More messages on the network gt more overhead
- Coordinating resource is usually the site where
the transaction originates - Coordinating and resource sites must write
details of every decision into its log file in
case a rollback or restart is required.
25Problems with DDB
- Concurrency/Global Deadlock
- Accessing data from remote sites means extra lock
requests/grants - even more network overhead
- Possibility of global deadlock that cannot be
detected at a single site. - Administration
- With many sites may have many DBA.
26DDBMS Fundamental Principle
- To the user, a distributed system should look
exactly like a nondistributed system.