Distributed Databases - PowerPoint PPT Presentation

1 / 26

About This Presentation

Title:

Distributed Databases

Description:

... as often because data logically belonging to another site may have a local copy. ... User is not aware if data is being retrieved locally or from another site. ... – PowerPoint PPT presentation

Number of Views:33

Avg rating:3.0/5.0

Slides: 27

Provided by: Chan86

Category:

more less

Transcript and Presenter's Notes

Title: Distributed Databases

1
Distributed Databases

CS 95 Advanced Database Systems
Handout 4

2
Definitions

"... a collection of data which belong logically
to the same system but are spread over the sites
of a computer network." (Ceri, 1984)
"A distributed database is a collection of
logically related data distributed across several
machines interconnected by a computer network. An
application program operating on a distributed
database may access data stored at more than one
machine." (Gardarin Valduriez)
"A set of cooperating databases, each resident at
a different site, that the user views and
manipulates as a centralized database." (Gardarin
Valduriez)
".. a kind of virtual object, ... physically
stored in a number of distinct real databases at
a number of distinct sites." (Date)

3
Concept

Can have many sites, each with their own database
management system (DBMS).
Local data stored at each site.
Sites connected by communications network.
To a user (application program) all sites
together appear as one big database.
Needs simple structures --gt relational databases
e.g., Ingres/Star

4
General structure of a DDBMS
5
Another diagram of a DDBMS

DC Data Communications component
DBMS local DBMS
DDBMS Distributed DBMS component
GDD Global data Dictionary

Note that although site 3 has access to the DDB,
the local DB at site 3 is not part of the DDD
6
Fragmentation

Horizontal partitioning
Rows of the one table are stored across different
sites.
Vertical partitioning
Each table stored at a single site but different
tables are stored at different sites or columns
of a table are stored across different sites
Basic principle is that data is stored at the
site where it is mostly accessed
network traffic is reduced
improves performance

7
Replication

Data may be stored at more than one site. (i.e.,
a copy of the data)
Whole relations may be replicated, or just
fragments of relations
Faster retrieval as the network communications
link is not used as often because data logically
belonging to another site may have a local copy.
Provides backup
Problems with updating records.

8
12 Objectives for DDBMS (Date)

1. Local autonomy
'Local data is locally owned and managed ...'
Each site must be able to control and process its
local data independent of any other site.
Need to store the Global Data Dictionary at each
site
Some aspects of data security and data integrity
are managed at the global level.

9
12 Objectives for DDBMS (Date)

2. No reliance on a central site
Following from (1) 'all sites must be treated as
equals there must not be any reliance on a
central "master" site for some central service
...'
Advantages
system is less vulnerable
less chance of a bottleneck

10
12 Objectives for DDBMS (Date)

3. Continuous operation
DDBMS are not an all or nothing proposition, in
that if one site fails the other sites can still
operate, even if at a reduced level.
If one site is down then that sites data may
still be available if it is replicated.
Hence, disruptions are minimised

11
12 Objectives for DDBMS (Date)

4. Location independence (transparency)
'Users ... should be able to behave ... as if the
data was all stored at their own local site.'
A primary objective
User is not aware if data is being retrieved
locally or from another site.
Even if data is moved from one site to another,
users of the data are not affected.
Easy for retrieval, harder for updates.

12
12 Objectives for DDBMS (Date)

5. Fragmentation independence
'... users should be able to behave ... as if the
data were ... not fragmented at all.'
location of data is kept (somehow) in the system
catalog
DDBMS uses catalog to retrieve data from the
relevant site
A query may select data from a table that is
horizontally fragmented, and the query may not be
restricted to a single site
The query engine would have to do something like
split the query into separate subqueries, one for
each involved site
Send each subquery to the relevant site
UNION the results of the subqueries into a final
answer
All this is hidden from the user that submits the
query
Advantage
'... simplifies user programs and terminal
activities.'

13
12 Objectives for DDBMS (Date)

6. Replication independence (transparency)
Similar to fragmentation independence.
Some data may be stored at multiple sites.
DDBMS determines closest for retrieval.
DDBMS alters all sites when updating.
involves some interesting algorithms
Have replication because
cut communication costs
dupl. of essential data allows processing to
continue when communication cut.
enables quick easy recover after failure.
Replication is used to improve performance and
data availability.

14
12 Objectives for DDBMS (Date)

7. Distributed query processing
'It is important that optimization in a
distributed system be performed from a global
perspective.'
Communication time is now the slowest part of
execution of a query, so that has to be taken
into account when formulating a query plan

15
12 Objectives for DDBMS (Date)

8. Distributed transaction management
'... recovery control and concurrency control
...' require 'extended treatment in the
distributed environment.'
eg. concurrency over multiple sites requires an
advanced form of the two-phase commit protocol
As transaction progresses nodes involved lock
data that is accessed or to be updated
At the end of the transaction the coordinating
node (site) sends get ready requests to all other
nodes involved in the transaction Each node
responds whether ready or not
If all replies are OK then coordinating node
issues actual commit, otherwise rollback Each
node performs update and responds to coordinating
node when update is complete ONLY AFTER ALL
nodes respond does the coordinating node send a
message to release all locks.

16
12 Objectives for DDBMS (Date)

9. Hardware independence
10. Operating system independence
In theory, it does not matter if different nodes
in the DDBMS run on different OSs.
eg. a Unix computer, a PC and a Mac should all be
able to participate in the same distributed
system.
The standard TCP/IP communications protocol has
facilitated this objective.
11. Network independence
12. DBMS independence
Heterogeneous databases.

17
Why distributed databases?

Organizational and economic structure may be
distributed. A DDB better models this.
One can interconnect existing DBs.
Incremental growth is supported.
Reduction in communications overhead - emphasis
on local processing.
Data is stored closer to where it is accessed, so
access times are reduced.
Parallel processing - improved performance.

18
Why distributed databases?

Reliability - graceful degradation
Loss of one site does not bring the entire system
down
Reduction in data processing bureaucracy.
Gain in local autonomy
Drops in computer costs allow more computing
power to be purchased
Capacity and growth potential of the system is
increased

19
Problems with DDB

Security
The data is now spread over many sites, securing
all sites is more difficult than securing one
site.
Authority to access data items must be
duplicated.

20
Problems with DDB

Catalog Management (Data Dictionary)
May be stored (i) Centrally, (ii) Fully
replicated on every site (iii) Fragmented over
each site. ie. each site keeps and maintains the
catalog for objects stored at that site. (iv)
Combination of (i) and (iii).
New data objects require updating each DD
concerned.
Note sites not aware of the new data object will
not be able to access it.

21
Problems with DDB

Query Processing
Retrieving data that is fragmented over multiple
sites requires extra consideration
Slow component now is transferring data between
sites
Need to optimise a query to reduce this network
traffic

22
Problems with DDB

Update Propagation for Replicated Data
One copy of any replicated data is designated the
Primary Copy
Primary Copy Update Strategy
An update operation is deemed complete when the
primary copy is updated
the DDBMS is then responsible for propagating the
update to the other (secondary) copies at some
subsequent time.

23
Problems with DDB

Recovery Control
Typically based on the Two-Phase Commit protocol
The coordinating resource instructs all resources
to "get ready". The resources reply "OK" or "Not
OK".
If all resources reply "OK" then the commit
directive is given to each resource, otherwise
the rollback directive is given. Resource
indicates success after commit/rollback

24
Problems with DDB

Recovery Control (Example)
ie. messages go something likePhase 1 1.
Coord-gtResource("Get Ready") 2. Resource-gtCoord
("Okay")Phase 2 3. Coord-gtResource("Do It") 4.
Resource-gtCoord ("Done It")
More messages on the network gt more overhead
Coordinating resource is usually the site where
the transaction originates
Coordinating and resource sites must write
details of every decision into its log file in
case a rollback or restart is required.

25
Problems with DDB