Title: Parallel and distributed databases
1Parallel and distributed databases
2What is a distributed database?
3Why distribute a database
- Scalability and performance
- Resilience to failures
Throughput
Data size
X
versus
X
4Why distribute a database
- Data is already distributed
- Or needs to be distributed
- Data is in multiple systems
5Why not distribute a database
- You must earn your complexity!
- Communication needed
- Must build a complex infrastructure
- Unpredictable latencies must be masked
- More types of failures
- More components to fail
- Network failures
- Congestion, timeouts
- More complex planning
- Communication cost plus I/O cost
- May have to deal with heterogeneity
- Different types of systems
- Different schemas, possibly incompatible
- Different administrative domains
6Types of distributed databases
7The old days mainframes
Definitely not distributed!
8Client-server
User interaction
Network
Data processing
9Parallel database
10Primary/secondary
X
11Multidatabase
12How do they work?
- What is shared?
- How to distribute the data?
- How to process the data?
- How to update the data?
13What is shared?
Most modern DBMSs
CPUs
RAM
Disk
14What is shared?
Oracle RAC
RAM
15What is shared?
Search engines, Teradata
RAM
16How to distribute the data?
17How to distribute the data?
Hash partitioning
Range partitioning
(key,value)
(key,value)
Hash()
lt X
gt X
18How to distribute the data?
19Query processing
- Intra-operator parallelism
- Inter-operator parallelism
20Parallel scanning
Result
filter
filter
filter
filter
filter
filter
21Sorting
22Sorting
23Parallel hash join
Hash()
24Join
25Semi-join
26Inter-operator parallelism
27Updating distributed data
- Synchronous read-any-write-all
Reads are fast
28Updating distributed data
29Updating distributed data
Writes tolerant to disconnection
30Consistency of distributed data
31Primary/secondary
32Two-phase commit
PREPARE
COMMIT
PREPARED
PREPARED
33Two-phase commit
PREPARE
ABORT
PREPARED
ABORT
34Two-phase commit
PREPARE
ABORT
PREPARED
35Two-phase commit
X
PREPARE
PREPARED
PREPARED
36Conclusion
- Parallelism and distribution very useful
- Performance
- Fault tolerance
- Scale
- But complex!
- Rethink lots of aspects of the system
- Must earn the complexity