Title: Objectives of these slides:
112. Distributed Data Management
- Objectives of these slides
- examine the main issues related to data
distribution
2Overview
- 1. Data Distribution
- 2. The Relational Data Model
- 3. Methods of Data Distribution
- 4. Data Placement Considerations
- 5. Distributed Data Access
- 6. Database Transaction Management
31. Data Distribution
- Data is distributed to multiple sites (servers)
- main aim is to make the distribution transparent
to the end users and the business logic inside
applications - The DBMS (database management system) also should
be distributed. - How and when should data be distributed?
- What are the best locations for data?
4Advantages
- Efficient data access
- Application load balancing
- Application and end-user scalability
- Matches international business models
- Replication of data makes failure less critical
5Disadvantages
- Management complexity
- Data synchronization
- transaction processing
62. The Relational Data Model
- SQL (Structured Query Language)
- DDL (Data Definition Language)
- DML (Data Manipulation Language)
Customer Table
SS
Saving Acct
Check Acct
Name
ad
12345
67823
3442
Savings
Acct
Bal
SS
67823
3442
23
73. Methods of Data Distribution
- 3.1. Manual Extraction (read-only)
Branch B1
Customer
Exchange
Savings
copy (push or pull)
Exchange
Branch B2
Exchange
Central Office
83.2. Snapshot (read-only)
- create snapshot lttable-namegt as select time
lthhmmssgt, interval lthhmmssgt - Frequency?
- e.g. exchange rate changes once per day
93.3. Replication
- How is data consistency maintained between the
original and its copies? - synchronous or asynchronous updates
- Replication management
- Location transparency
- the aim is to make the user/application think
that there is a single table at a single site
103.4. Fragmentation
Customer Table
SS
Saving Acct
Name
ad
12345
3442
Customer Table
set
34256
5001
SS
Saving Acct
Name
ad
12345
3442
set
34256
5001
po
00912
5002
jim
78123
9887
11Accounts
Saving Acct
Customer Table
SS
Saving Acct
12345
Name
ad
12345
3442
34256
00912
set
34256
5001
po
00912
5002
78123
jim
78123
9887
124. Data Placement Considerations
- Increase local autonomy
- Decrease network traffic (to/from original)
- Data type
- read/write, size, local use
- Cost of distribution/replication
- for data and application code
- Data management
13Data Location Analysis
- Assign costs to site read and update messages
per day. - Distribute the data to reduce the cost of the
network message traffic.
14Read/Update Messages/day
Site
S1
B1
2000
1000
Read
500
100
Update
15Data Location Analysis Matrix
Location of Data
MessageTraffic
S1
B1
Configuration
?
1100
1
?
2500
?
?
2
?
600
?
3
165. Distributed Data Access
- Answer queries by accessing data located on other
machines. - The (distributed) DBMS tries to hide the physical
location of the data.
175.1. Remote Request
- The client application issues a single data
request to be processed at a single remote server.
SQL query
Customer
Client
Server1
continued
18- SQL query
- select from server1.bankDB.customer where
server1.bankDB.customer.city Bangkok - Queries like this can be used for manual
extraction and snapshots. - User-level queries should not need to specify
server1.
195.2. Remote Transaction
- The client application issues multiple data
requests (all part of a single transaction) to a
single remote server.
begin work
SQL query 1
Customer
Client
SQL query 2
Branch
commit
Server1
continued
20- SQL query
- begin work select from server1.bankDB.custome
r where server1.bankDB.customer.city
Bangkok update server1.bankDB.branch
set posted_ind yescommit work
215.3. Distributed Transaction
- The client application issues multiple data
requests (all part of one distributed
transaction) to multiple remote servers.
begin work
SQL query 1
Employee
Medical
SQL query 2
Client
ServerB1
ServerM
commit
continued
22- SQL query
- begin work select from serverB1.bankDB.employ
ee where serverB1.bankDB.employee.edLevel
mba select from
serverM.medDB.medical where
serverM.medDB.medical.branch B1 commit
work
235.4. Distributed Request
- The client application issues multiple data
requests (all part of one distributed
transaction) to multiple remote servers and each
request can refer to multiple sites.
continued
24Employee
begin work
Client
SQL query 1
ServerB1
join
commit
SQL query 2
Medical
Branch
ServerM
Server1
continued
25- SQL query
- begin work select from serverB1.bankDB.employ
ee B1, serverM.bankDB.medical M where
B1.ID M.ID and B1.edLevel mba
update server1.bankDB.branch set
reported yes where server1.bankDB.branch
B1 commit work
266. Database Transaction Management
- A database transaction should possess the
properties - atomicity
- consistency
- isolation
- durability
- serializability
- A transaction is an all or nothing task.
continued
27- SQL contains the commands COMMIT and ROLLBACK for
explicit transaction management. - Implicit transaction management
- transaction completion ? COMMIT
- transaction failure ? ROLLBACK
28The Two-phase Commit Protocol
- A coordinator (which may be the client) sends its
queries to the servers, and waits for replies - the prepare phase
- If a server can answer its query then it responds
with an ok (or similar).
continued
29- If all the servers respond with ok, then the
coordinator tells them to commit - the commit phase
- If any of the servers do not respond with ok,
then the coordinator tells all the servers to
rollback.
30Distributed Two-phase Commit
Tree master(coordinator)
Syncpointresponses
Syncpointrequest
Slave/Master
Slave/Master
Slave
Slave
31Deadlock
- Updates are usually carried out by locking the
record (or group of records) being changed. - Two (or more) applications can become deadlocked
if they wait for each other to release a lock.
db1
c1
1. lock
2. wait
lock record
2. wait
db2
c2
lock record
1. lock
32Deadlock Detection
- Automatic time-out followed by transaction
failure (and retry). - Asynchronous updates.
- Deadlock avoidance algorithms.
3313. Distributed Data Management System Design
- Objectives of these slides
- examine some of the issues related to distributed
database management design
34The Main Problem
- How to obtain information from a system that is
distributed? - e.g. where is the customer database?
- e.g. what queries does the customer database
understand?
Overview
1. Metadata 2. Dates DDBMS Rules 3. Other
Issues
351. Metadata
- Information about a databases resources
- the data types, attributes, keys, indexes, etc.
- stored in a database dictionary
- Information about database location, machine, and
network characteristics - stored in a database directory
- Relational DBMS sometimes hold information about
the relationships between databases - stored in a system catalog.
36Benefits of Metadata
- improved project management
- reduced maintenance costs
- automated information sharing
- queries can be made data location independent
37Drawbacks
- the difficulty of creating metadata for legacy
systems - heterogeneous metadata formats
38Centralised or Distributed Metadata?
- Centralised
- simple to manage, maintain, etc.
- a bottleneck
- a single point of failure
- Distributed
- distribution issues (e.g. how to divide, where to
locate) - data synchronisation
392. Dates DDBMS Rules
- 2.1. Local Autonomy
- The local DBMS should manage its own security,
data integrity, data consistency, locking, and
recovery. - 2.2. No Reliance on a Central Site
- But an occasional coordinator is still
required, for example for two-phase commits. - 2.3. Continuous Operation
- This includes online backups, archiving, recovery.
40- 2.4. Data Location Independence
- Users and business logic should be unaware that
the data is distributed (they see a single local
database). - A distributed database dictionary, directory and
catalog are essential, probably replicated at
every site and synchronised.
41- 2.5. Data Fragmentation Independence
- Users and business logic should be unaware of
fragmentation (they see a single database). - At the DBMS level, SQL queries may require joins
(to combine vertical fragments) and unions (to
combine horizontal fragments).
42- 2.6. Data Replication Independence
- Replication is transparent to users and business
logic. - This requires the DBMS to use a hidden two-phase
commit protocol (or similar) when performing
transactions.
continued
43- Code transformation
- update bankDB.savings set amt amt -
withdrawal where bankDB.savings.acctno
12345 - becomes
- begin work update serverB1.bankDB.savings
set amt amt - withdrawal where
serverB1.bankDB.savings.acctno 12345
update serverB2.bankDB.savings set amt amt
- withdrawal where serverB2.bankDB.savings.a
cctno 12345commit work
44- 2.7. Distributed Query Processing
- The DBMS optimises the users query
- rule-based optimisations
- cost-based optimisations
- Example query
- select serverS1.bankDB.employees E,
serverS2.bankDB.managers M where E.ID M.ID
and E.edLevel MBA and
M.edLevel MBA
continued
45Server S1
DDM
employees
catalog
100,000 rows
SQLrequest
managers
client
catalog
Server S2
1,000 rows
continued
46- Rule-based Optimisation
- specify an execution order for the parts of a
query - e.g. search employees first (100,000 records) and
then compare with managers (1,000 records) - Cost-based Optimisation
- Estimate the cost of a query by considering
- the number of rows being processed
- the data properties (e.g. keys, types)
- machine/network speeds
- I/O operations
- e.g. copy the managers DB to the employees server
and join the databases
47- 2.8. Distributed Transaction Management
- ACIDS properties
- distributed locking
- deadlock detection
- Some commercial products (see chapter 13, Bersen
for details) - IBM CICS (Customer Information Control System)
- Novell TUXEDO
- Transarc Encina
48- 2.9. Hardware Independence
- 2.10. OS Indepedence
- 2.11. Network Independence
- 2.12. DBMS Independence
493. Other Issues
- Administration
- backups, recovery, DB creation and modification
- Security
- protection of data from illegal disclosure,
modification, or destruction - access rights, passwords, encryption
- very much harder in a distributed environment
- Currency control
- how to bring a server back online so that it
contains the same version of its database as on
other servers? - the aim is to make the database current with its
other versions