Title: Parallel Database System
1Parallel Database System
presented by Ming Hao
2Why parallel database
- dominance of Relational data model
-
- 1. Large uniform data record
- 2. Query can be decomposed into a bunch of
- relational operators. Each operator
takes a - relation as input and output a new
relation -
- 3. Indicate the built-in parallelism
-
31. pipelined parallelism streaming output of
one operator into the input of another
operator 2. partitioned parallelism
partitioned data and execution 3.
Inter-query parallelism OLTP
4Hardware support available
- High speed network
- message passing based client-server operating
system - cheap and powerful PC/Workstation
5Hardware architecture
1. Shared memory a. can not scale up to
lots of disks and processors network
bandwidth b. interference between
processors private cache does not solve
the problem
6Hardware architecture
2. Shared disks a. same scale problem as
sharedM b. interference when updating data
7Hardware architecture
3. Shared nothing a. linear scale up and
speedup b. less interference
c. exploiting commodity processors and
memory
8Parallelism metrics
- Speedup
- small_system_elapsed_time
- big_system_elapsed_
time - scale up
- small_system_elapsed_time_on_small_proble
m - big_system_elapsed_time_on_big_problem
Speedup
9Barriers to linear scaleup speedup
- Startup
- time to start parallel program
- interference
- critical section, synchronization,
coherence - skew
- load balance
10Pipeline or Partitioning
- not very long operator chain
- pipeline not available for some operators
- aggregate
- skew
11Data Partitioning
- Round-robin
-
- accessing data by sequential scan
- - frequently want to associatively access
record
12Data Partitioning
- hash partition
- accessing data by sequential scan
- frequently want to associatively access
record - - for clustering
13Data Partitioning
- arrange partition
-
- accessing data by sequential scan
- frequently want to associatively access
record - clustering
- - data skew execution skew
14Using existing sequential operators
- merge operator
- focusing data on one spot
- split operator
- used in multiple parallel stages
- flow control and buffering
15Better algorithms
- Minimize data flow/ tolerate data and execution
skew - Join
- 1. Sort-Merge join nlog(n)
- 2. Hash -Join linear cost
16summary
- Commodity components, not special hardware
- shared nothing architecture
- data partition, data flow
- only choice for some applications
- some remaining problems