Title: TDD:%20Topics%20in%20Distributed%20Databases
1TDD Topics in Distributed Databases
- Distributed Databases
- Distributed database
- Distributed query processing joins and non-join
queries - Updating distributed data
2Distributed databases
- Data is stored in several sites (nodes),
geographically or administratively across
multiple systems - Each site is running an independent DBMS
- What do we get?
- Increased availability and reliability
- Increased parallelism
Data centers
- Complications
- Catalog management distributed data independence
and distributed transaction atomicity - Query processing and optimization replication
and fragmentation - Increased update costs, concurrency control
locking, deadlock, commit protocol, recovery
3Architectures
3
4Homogeneous vs heterogeneous systems
- Homogeneous identical DBMS, aware of each other,
cooperate - Heterogeneous different schemas/DBMS
- Multidatabase system uniform logical view of the
data -- common schema - difficult, yet common system is typically
gradually developed
5Architectures
- Client-server client (user interface, front
end), server (DBMS) - Client ships query to a server (query shipping)
- All query processing at server
client
client
client-server
server
server
server
6Architectures
- Collaborating server query can span several
servers
- Middleware
- Coordinator queries and transactions across
servers
7Warehouse architecture
client applications
data warehouse
integrator
monitor/wrapper
monitor/wrapper
monitor/wrapper
XML
RDB
OODB
8Monitor/wrapper
- A monitor/wrapper for each data source
- translation translate an information source into
a common integrating model - change detection detect changes to the
underlying data source and propagate the changes
to the integrator - active databases (triggers condition, event,
action) - logged sources inspecting logs
- periodic polling, periodic dumps/snapshots
- Data cleaning
- detect erroneous/incomplete information to ensure
validity - back flushing return cleaned data to the source
9Integrator
- Receive change notifications from the
wrapper/monitors and reflect the changes in the
data warehouse. - Typically a rule-based engine
- merging information (data fusion)
- handling references
- Data cleaning
- removing redundancies and inconsistencies
- inserting default values
- blocking sources
10When to use data warehouse
- Problem potential inconsistencies with the
sources. - Commonly used for relatively static data
- when clients require specific, predicable portion
of the available information - when clients require high query performance but
not necessarily the most recent state of the
information - when clients want summarized/aggregated
information such as historical information - Examples
- scientific data
- historical enterprise data
- caching frequently requested information
11Data warehouse vs. materialized views
- materialized view is over an individual
structured database, while a warehouse is over a
collection of heterogeneous, distributed data
sources - materialized view typically has the same form as
in the underlying database, while a warehouse
stores highly integrated and summarized data - materialized view modifications occur within the
same transaction updating its underlying
database, while a warehouse may have to deal with
independent sources - sources simply report changes
- sources may not have locking capability
- integrator is loosely coupled with the sources
12Mediated system architecture
- Virtual approach data is not stored in the
middle tier
client applications
Mediator
wrapper
wrapper
wrapper
XML
RDB
OODB
13Lazy vs. eager approaches
- Lazy approach (mediated systems)
- accept a query, determine the appropriate set of
data sources, generate sub-queries for each data
source - obtain results from the data sources, perform
translation, filtering and composing, and return
the final answer -
- Eager approach (warehouses)
- information from each source that may be of
interest is extracted in advance, translated,
filtered, merged with relevant sources, and
stored in a repository - query is evaluated directly against the
repository, without accessing the original
information sources
14Data warehouse vs. mediated systems
- Efficiency
- response time at the warehouse, queries can be
answered efficiently without accessing original
data sources. Advantageous when data sources are
slow, expensive or periodically unavailable, or
when translation, filtering and merging require
significant processing - space warehousing consumes extra storage space
- Extensibility warehouse
- consistency with the sources warehouse data may
become out of date - applicability
- warehouses for high query performance and static
data - mediated systems for information that changes
rapidly
15Distributed data storage -- replication
- Fragments of a relation are replicated at several
sites R is fragmented into R1, R2, R3 - Why?
- Increase availability/reliability if one site
fails - Increase parallelism faster query evaluation
- Increase overhead on updates consistency
- Dynamic issues synchronous vs. asynchronous
- Primary copy e.g.,
- Bank an account at the site in which it was
opened - Airline an flight at the site from which it
originates
R1
R2
R3
R2
Site 2
Site 1
16Distributed data storage -- fragmentation
- A relation R may be fragmented or partitioned
- Horizontal
- Vertical lossless join
- Question how to reconstruct the original R?
fragmentation determined by local ownership
NYC
EDI
17Transparency, independence
- Distributed data transparency (independence)
- location (name) transparency
- fragmentation
- replication transparency (catalog management)
- Transaction atomicity across several sites
- All changes persist if the transaction commits
- None persists if the transaction aborts
Data independency and transaction atomicity are
not supported currently the users have to be
aware of where data is located
18Distributed query processing joins and non-join
queries
18
19Distributed query processing and optimization
- New challenges
- Data transmission costs (network)
- parallelism
- Choice of replicas lowest transmission cost
- Fragmentation to reconstruct the original
relation - Query decomposition query rewriting/unfolding
- depending on how data is fragmented/replicated
decomposition
sub-query
sub-query
sub-query
20Non-join queries
- Schema account(acc-num, branch-name, balance)
- Query Q select from account where
branch-name EDI - Storage database DB is horizontally fragmented,
based on branch-name NYC, Philly, EDI, denoted
by DB1, , DBn - DB DB1 ? ? DBn
- Processing
- Rewrite Q into Q(DB1) ? ? Q(DBn)
- Q(DBi) is empty if branch-name ltgt EDI
- Q(DB1), where DB1 is the EDI branch
- Q(DB1) Q(DB1)
- Q select from account
21Simple join processing data shipping
- R1 R2 R3
- where Ri is at site i, S0 is the site where the
query is issued - Option 1 send copies of R1, R2, R3 to S0 and
compute the joins at S0 - Option 2
- Send R1 to S2, compute temp1 ? R1 R2 at S2
- Send temp1 to S3, compute temp2 ? R3 temp1
at S3 - Send temp2 to S0
- Decision on strategies
- The volume of data shipped
- The cost of transmitting a block
- Relative speed of processing at each site
22Semijoin reduce communication costs
- R1 R2, where Ri is at site i,
- Compute temp1 ? ? (R1 ? R2) R1 at site 1
- projection on join attributes only assume R1
smaller - Ship temp1 to site 2, instead of the entire
relation of R1 - Compute temp2 ? R2 temp1 at S2
- Ship temp2 to site 1
- compute result ? R1 temp2 at S1
- Effectiveness
- If sufficiently small fraction of the relation of
R2 contributes to the join - Additional computation cost may be higher than
the savings in communication costs
23Bloomjoin reduce communication costs
- R1 R2, where Ri is at site i,
- Compute a bit vector of size k by hashing ? (R1
? R2) R1 - bit set to 1 if some tuple hashes to it
- smaller than the projection (constant size)
- Ship the vector to site 2, instead of the entire
relation of R1 - Hash R2 using the same hashing function
- Ship to site 1 only those tuples of R2 that also
hash to 1, temp1 - compute result ? R1 temp1 at S1
- Effectiveness
- Less communication costs bit-vector vs
projection - The size of the reduction by hashing may be
larger than that of projection - Question set difference?
24exploring parallelism
- Consider R1 R2 R3 R4, where Ri is
at site i - temp1 ? R1 R2, by shipping R1 to site 2
- temp2 ? R3 R4, by shipping R3 to site 4
- result ? temp1 temp2 -- pipelining
- Question R1 R2 R3, using
- Pipelined parallelism
- Semi-join
- both
parallel
25Distributed query optimization
- The volume of data shipped
- The cost of transmitting a block
- Relative speed of processing at each site
- Site selection replication
- Two-step optimization
- At compile time, generate a query plan along
the same lines as centralized DBMS - Every time before the query is executed,
transform the plan and carry out site selection
(determine where the operators are to be
executed) dynamic, just site selection
26Practice validation of functional dependencies
- A functional dependency (FD) defined on schema R
X ? Y - For any instance D of R, D satisfies the FD if
for any pair of tuples t and t, if tX tX,
then tY tY - Violations of the FD in D
- t there exists t in D, such that tX
tX, but tY ? tY - Now suppose that D is fragmented and distributed
- Develop an algorithm that given fragmented and
distributed D and an FD, computes all the
violations of the FD in D - semijoin
- bloomjoin
- Questions what can we do if we are given a set
of FDs to validate?
horizontally or vertically
Minimize data shipment
27Practice relational operators
- Consider a relation R that is vertically
partitioned and distributed across n sites - Develop an algorithm to implement
- ?A R,
- ?C R
- by using
- semijoin
- bloomjoin
- Column-oriented DBMS store tables as sections of
columns of data, rather than rows (tuples) good
for, eg, certain aggregate queries
28Practice relational operators
- Consider relations R1 and R2 that are
horizontally partitioned and distributed across n
sites - Develop an algorithm to implement R1 R1.A
R2.B R2 by using - semijoin
- bloomjoin
- Question is your algorithm parallel scalable?
That is, the more processors are used, the faster
it is
29Updating distributed data
29
30Updating Distributed data
- Fragmentation an update may go across several
sites - Local transaction
- Global transaction
- Replication multiple copies of the same data --
consistency
updates
propagate
31System structure
- Local transaction manager either local
transaction or part of a global transaction - Maintain a log for recovery
- Concurrency control for transactions at the site
- Transaction coordinator (not in centralized DBMS)
- Start the execution of a transaction
- Break a transaction into a number of
sub-transactions and distribute them to the
appropriate sites - Coordinate the termination of the transaction
- Commit at all sites
- Abort at all sites
32Two-Phase Commit protocol (2PC) Phase 1
- Transaction T the transaction coordinator is at C
P1
(1) ltprepare Tgt
log
C
ltready Tgt
P2
(2) ltready Tgt
prepare T
if all responses are ltreadygt
commit T
P3
(2) ltabort Tgt
abort T
ltno Tgt
log
if one of the responses is ltabortgt
log
33Two-Phase Commit protocol (2PC) Phase 2
- Transaction T the transaction coordinator is at
C
P1
(3) ltcommit Tgt
log
C
P2
ltready Tgt
(4) ltack Tgt
ltcommit Tgt
prepare T
commit T
P3
(4) ltack Tgt
complete T
ltready Tgt
log
ltcommit Tgt
Similarly for abort
log
34Comments on 2PC
- Two rounds of communication both initiated at
the coordinator - Voting
- Terminating
- Any site can decide to abort a transaction
- Every message reflects a decision by the sender
- The decision is recorded in the log to ensure the
decision survives any failure transaction id - The log at each participating site the id of the
coordinator - The log at the coordinator ids of the
participating sites
35Concurrency control
- Single local manager at a chosen site
- Simple implementation and deadlock handling
- Bottleneck
- Vulnerability if the site fails
- Distributed lock manager each site has one to
update data item D at site j - Send request to the lock manager at site j
- Request is either granted or delayed
- Deadlock handling is hard
- Major complication replicated data
36replication
- Synchronous replication all copies must be
updated before the transaction commits - data distribution transparent, consistent
- expensive
- Asynchronous replication copies are periodically
updated - Allow modifying transaction to commit before all
copies have been changed - users are aware of data distribution, consistency
issues - Cheaper current products follow this approach
- Peer-to-peer replication (master copies)
- Primary site replication (only the primary is
updateable)
37Synchronous replication
- Majority approach -- voting data item D
replicated at n sites - A lock request is sent to more than one-half of
the sites - D is locked if the majority vote yes, write n/2
1 copies - Each copy maintains a version number
- Expensive
- 2(n/2 1) messages for lock
- Read at least n/2 1 copies to make sure it is
current - Deadlock is more complicated if only one copy is
locked
38Synchronous replication (cont.)
- Majority approach -- voting
- Biased protocol read-any write-all.
- Shared lock (read) simply requests a lock on D
at one site that contains a copy of D - Exclusive lock (write) lock on all sites that
contain a copy - Less overhead on read, expensive on write
- Commonly used approach to synchronous replication
39Synchronous replication -- exercise
- A distributed system uses the majority (voting)
approach to update data replicas. Suppose that a
data item D is replicated at 4 different sites
S1, S2, S3, S4. What should be done if a site S
wants to - Write D
- Read D
40Synchronous replication -- answer
- A distributed system uses majority the (voting)
approach to update data replicas. Suppose that a
data item D is replicated at 4 different sites
S1, S2, S3, S4. What should be done if a site S
wants to - Write D
- The site S sends a lock request to any 3 of S1,
S2, S3, S4 - The write operation is conducted if the lock is
granted by the lock manager of all the 3 sites
otherwise it is delayed until the lock can be
granted - Read D
- The site S reads copies of D from at least 3
sites - It picks the copy with the highest version number
41Asynchronous replication
- Primary site choose exactly one copy, residing
at a primary site - A lock request is sent to the primary site
- Replicas at other sites may not be updated they
are secondary to the primary copy - Simple to implement
- Main issues
- D becomes inaccessible if the primary site fails
- Propagation of changes from the primary site to
others
42Asynchronous replication (cont.)
- Primary site
- Peer-to-peer more than one of the copies can be
a master - Change to a master copy must be propagated to
others - Conflicts of changes to two copies have to be
resolved - Best used when conflicts do not arise e.g.,
- Each master site owns a distinct fragment
- Updating rights owned by one master at a time
43Distributed deadlock detection
- Recall wait-for graph
- Nodes transactions
- Edges T1 ? T2 if T1 requests a resource being
held by T2 - Local wait-for graph
- Nodes all transactions local or
holding/requesting data item local to the site - T1 ? T2 if T1 (at site 1) requests a resource
being held by T2 (at site 2) - Global wait-for graph union of local wait-for
graphs - Deadlock if it contains a cycle
T1
T2
T2
T4
T5
T3
T3
global
Site 2
Site 1
44False cycles
- Due to communication delay
- Site 1 local wait-for graph has T1 ? T2
- T2 releases the resource deletion of the edge
- Site 2 T2 requests the resource again addition
of T2 ? T1 - Cycle if insert T2 ? T1 arrives before removal of
T1 ? T2 - Centralized deadlock detection deadlock
detection coordinator - Constructs/maintains global wait-for graph
- Detect cycles
- If it finds a cycle, chose a victim to be rolled
back - Distributed deadlock manager? More expensive
T1
T2
T1
T2
T1
T2
global
Site 1
Site 2
45Summary and review
- Homogeneous vs heterogeneous systems
- Replication and fragmentation. Pros and cons of
replication. How to reconstruct a fragmented
relation (vertical, horizontal)? - Simple join (data shipping), semijoin, bloomjoin
- set-difference? Intersection? Aggregation?
- Transaction manager and coordinator.
Responsibilities? - Describe 2PC. Recovery coordinator and
participating sites - Replication
- majority, read-any write-all,
- primary site, peer-to-peer
- Local wait-for graph, global wait-for graph,
deadlock detection, deadlock handling
46Reading list
- MapReduce tutorial
- http//hadoop.apache.org/docs/r1.2.1/mapred_tutori
al.html - Take a look at the following
- Cassandra, http//en.wikipedia.org/wiki/Apache_Ca
ssandra - Clusterpoint, http//en.wikipedia.org/wiki/Cluste
rpoint - Riak, http//en.wikipedia.org/wiki/Riak
47- Reading for the next week
- Pregel a system for large-scale graph processing
- http//kowshik.github.io/JPregel/pregel_paper.pdf
- Distributed GraphLab A Framework for Machine
Learning in the Cloud, http//vldb.org/pvldb/vol5/
p716_yuchenglow_vldb2012.pdf - PowerGraph Distributed Graph-Parallel
Computation on Natural Graphs, http//select.cs.cm
u.edu/publications/paperdir/osdi2012-gonzalez-low-
gu-bickson-guestrin.pdf - GraphChi Large-Scale Graph Computation on Just a
PC, http//select.cs.cmu.edu/publications/paperdir
/osdi2012-kyrola-blelloch-guestrin.pdf - Performance Guarantees for Distributed
Reachability Queries, http//vldb.org/pvldb/vol5/p
1304_wenfeifan_vldb2012.pdf - W. Fan, X. Wang, and Y. Wu. Distributed Graph
Simulation Impossibility and Possibility. VLDB
2014. (parallel model) - http//homepages.inf.ed.ac.uk/wenfei/papers/vldb1
4-impossibility.pdf
Pick two papers and write reviews