Distributed Databases - PowerPoint PPT Presentation

1 / 26
About This Presentation
Title:

Distributed Databases

Description:

... as often because data logically belonging to another site may have a local copy. ... User is not aware if data is being retrieved locally or from another site. ... – PowerPoint PPT presentation

Number of Views:33
Avg rating:3.0/5.0
Slides: 27
Provided by: Chan86
Category:

less

Transcript and Presenter's Notes

Title: Distributed Databases


1
Distributed Databases
  • CS 95 Advanced Database Systems
  • Handout 4

2
Definitions
  • "... a collection of data which belong logically
    to the same system but are spread over the sites
    of a computer network." (Ceri, 1984)
  • "A distributed database is a collection of
    logically related data distributed across several
    machines interconnected by a computer network. An
    application program operating on a distributed
    database may access data stored at more than one
    machine." (Gardarin Valduriez)
  • "A set of cooperating databases, each resident at
    a different site, that the user views and
    manipulates as a centralized database." (Gardarin
    Valduriez)
  • ".. a kind of virtual object, ... physically
    stored in a number of distinct real databases at
    a number of distinct sites." (Date)

3
Concept
  • Can have many sites, each with their own database
    management system (DBMS).
  • Local data stored at each site.
  • Sites connected by communications network.
  • To a user (application program) all sites
    together appear as one big database.
  • Needs simple structures --gt relational databases
  • e.g., Ingres/Star

4
General structure of a DDBMS
5
Another diagram of a DDBMS
  • DC Data Communications component
  • DBMS local DBMS
  • DDBMS Distributed DBMS component
  • GDD Global data Dictionary

Note that although site 3 has access to the DDB,
the local DB at site 3 is not part of the DDD
6
Fragmentation
  • Horizontal partitioning
  • Rows of the one table are stored across different
    sites.
  • Vertical partitioning
  • Each table stored at a single site but different
    tables are stored at different sites or columns
    of a table are stored across different sites
  • Basic principle is that data is stored at the
    site where it is mostly accessed
  • network traffic is reduced
  • improves performance

7
Replication
  • Data may be stored at more than one site. (i.e.,
    a copy of the data)
  • Whole relations may be replicated, or just
    fragments of relations
  • Faster retrieval as the network communications
    link is not used as often because data logically
    belonging to another site may have a local copy.
  • Provides backup
  • Problems with updating records.

8
12 Objectives for DDBMS (Date)
  • 1. Local autonomy
  • 'Local data is locally owned and managed ...'
  • Each site must be able to control and process its
    local data independent of any other site.
  • Need to store the Global Data Dictionary at each
    site
  • Some aspects of data security and data integrity
    are managed at the global level.

9
12 Objectives for DDBMS (Date)
  • 2. No reliance on a central site
  • Following from (1) 'all sites must be treated as
    equals there must not be any reliance on a
    central "master" site for some central service
    ...'
  • Advantages
  • system is less vulnerable
  • less chance of a bottleneck

10
12 Objectives for DDBMS (Date)
  • 3. Continuous operation
  • DDBMS are not an all or nothing proposition, in
    that if one site fails the other sites can still
    operate, even if at a reduced level.
  • If one site is down then that sites data may
    still be available if it is replicated.
  • Hence, disruptions are minimised

11
12 Objectives for DDBMS (Date)
  • 4. Location independence (transparency)
  • 'Users ... should be able to behave ... as if the
    data was all stored at their own local site.'
  • A primary objective
  • User is not aware if data is being retrieved
    locally or from another site.
  • Even if data is moved from one site to another,
    users of the data are not affected.
  • Easy for retrieval, harder for updates.

12
12 Objectives for DDBMS (Date)
  • 5. Fragmentation independence
  • '... users should be able to behave ... as if the
    data were ... not fragmented at all.'
  • location of data is kept (somehow) in the system
    catalog
  • DDBMS uses catalog to retrieve data from the
    relevant site
  • A query may select data from a table that is
    horizontally fragmented, and the query may not be
    restricted to a single site
  • The query engine would have to do something like
    split the query into separate subqueries, one for
    each involved site
  • Send each subquery to the relevant site
  • UNION the results of the subqueries into a final
    answer
  • All this is hidden from the user that submits the
    query
  • Advantage
  • '... simplifies user programs and terminal
    activities.'

13
12 Objectives for DDBMS (Date)
  • 6. Replication independence (transparency)
  • Similar to fragmentation independence.
  • Some data may be stored at multiple sites.
  • DDBMS determines closest for retrieval.
  • DDBMS alters all sites when updating.
  • involves some interesting algorithms
  • Have replication because
  • cut communication costs
  • dupl. of essential data allows processing to
    continue when communication cut.
  • enables quick easy recover after failure.
  • Replication is used to improve performance and
    data availability.

14
12 Objectives for DDBMS (Date)
  • 7. Distributed query processing
  • 'It is important that optimization in a
    distributed system be performed from a global
    perspective.'
  • Communication time is now the slowest part of
    execution of a query, so that has to be taken
    into account when formulating a query plan

15
12 Objectives for DDBMS (Date)
  • 8. Distributed transaction management
  • '... recovery control and concurrency control
    ...' require 'extended treatment in the
    distributed environment.'
  • eg. concurrency over multiple sites requires an
    advanced form of the two-phase commit protocol
  • As transaction progresses nodes involved lock
    data that is accessed or to be updated
  • At the end of the transaction the coordinating
    node (site) sends get ready requests to all other
    nodes involved in the transaction Each node
    responds whether ready or not
  • If all replies are OK then coordinating node
    issues actual commit, otherwise rollback Each
    node performs update and responds to coordinating
    node when update is complete ONLY AFTER ALL
    nodes respond does the coordinating node send a
    message to release all locks.

16
12 Objectives for DDBMS (Date)
  • 9. Hardware independence
  • 10. Operating system independence
  • In theory, it does not matter if different nodes
    in the DDBMS run on different OSs.
  • eg. a Unix computer, a PC and a Mac should all be
    able to participate in the same distributed
    system.
  • The standard TCP/IP communications protocol has
    facilitated this objective.
  • 11. Network independence
  • 12. DBMS independence
  • Heterogeneous databases.

17
Why distributed databases?
  • Organizational and economic structure may be
    distributed. A DDB better models this.
  • One can interconnect existing DBs.
  • Incremental growth is supported.
  • Reduction in communications overhead - emphasis
    on local processing.
  • Data is stored closer to where it is accessed, so
    access times are reduced.
  • Parallel processing - improved performance.

18
Why distributed databases?
  • Reliability - graceful degradation
  • Loss of one site does not bring the entire system
    down
  • Reduction in data processing bureaucracy.
  • Gain in local autonomy
  • Drops in computer costs allow more computing
    power to be purchased
  • Capacity and growth potential of the system is
    increased

19
Problems with DDB
  • Security
  • The data is now spread over many sites, securing
    all sites is more difficult than securing one
    site.
  • Authority to access data items must be
    duplicated.

20
Problems with DDB
  • Catalog Management (Data Dictionary)
  • May be stored (i) Centrally, (ii) Fully
    replicated on every site (iii) Fragmented over
    each site. ie. each site keeps and maintains the
    catalog for objects stored at that site. (iv)
    Combination of (i) and (iii).
  • New data objects require updating each DD
    concerned.
  • Note sites not aware of the new data object will
    not be able to access it.

21
Problems with DDB
  • Query Processing
  • Retrieving data that is fragmented over multiple
    sites requires extra consideration
  • Slow component now is transferring data between
    sites
  • Need to optimise a query to reduce this network
    traffic

22
Problems with DDB
  • Update Propagation for Replicated Data
  • One copy of any replicated data is designated the
    Primary Copy
  • Primary Copy Update Strategy
  • An update operation is deemed complete when the
    primary copy is updated
  • the DDBMS is then responsible for propagating the
    update to the other (secondary) copies at some
    subsequent time.

23
Problems with DDB
  • Recovery Control
  • Typically based on the Two-Phase Commit protocol
  • The coordinating resource instructs all resources
    to "get ready". The resources reply "OK" or "Not
    OK".
  • If all resources reply "OK" then the commit
    directive is given to each resource, otherwise
    the rollback directive is given. Resource
    indicates success after commit/rollback

24
Problems with DDB
  • Recovery Control (Example)
  • ie. messages go something likePhase 1 1.
    Coord-gtResource("Get Ready") 2. Resource-gtCoord
    ("Okay")Phase 2 3. Coord-gtResource("Do It") 4.
    Resource-gtCoord ("Done It")
  • More messages on the network gt more overhead
  • Coordinating resource is usually the site where
    the transaction originates
  • Coordinating and resource sites must write
    details of every decision into its log file in
    case a rollback or restart is required.

25
Problems with DDB
  • Concurrency/Global Deadlock
  • Accessing data from remote sites means extra lock
    requests/grants
  • even more network overhead
  • Possibility of global deadlock that cannot be
    detected at a single site.
  • Administration
  • With many sites may have many DBA.

26
DDBMS Fundamental Principle
  • To the user, a distributed system should look
    exactly like a nondistributed system.
Write a Comment
User Comments (0)
About PowerShow.com