Title: Exploiting Multithreaded Architectures to Improve Data Management Operations
1Exploiting Multithreaded Architectures to Improve
Data Management Operations
- Layali Rashid
- The Advanced Computer Architecture Group _at_ U of C
(ACAG) - Department of Electrical and Computer Engineering
- University of Calgary
2Outline
- The SMT and the CMP Architectures
- Join (Hash Join)
- Motivation
- Algorithm
- Results
- Sort (Radix and Quick Sorts)
- Motivation
- Algorithms
- Results
- Index (CSB-Tree)
- Motivation
- Algorithm
- Results
- Conclusions
3The SMT and the CMP Architectures
- Simultaneous Multithreading (SMT) multiple
threads run simultaneously on a single processor. - Chip Multiprocessor (CMP) more than one
processor are integrated on a single chip.
4Hash Join Motivation
- Hash join is one of the most important operations
commonly used in current commercial DBMSs. - The L2 cache load miss rate is a critical factor
in main-memory hash join performance. - Increase level of parallelism in hash join.
5Architecture-Aware Hash Join (AA_HJ)
- Build Index Partition Phase
- Tuples divided equally between threads, each
thread has its own set of L2-cache size clusters - The Build and Probe Index Partition Phase
- One thread builds a hash table from each
key-range, other threads index partition the
probe relation similar to the previous phase. - Probe Phase
- See figure.
6AA_HJ Results
- We achieve speedups ranging from 2 to 4.6
compared to PT on Quad Intel Xeon Dual Core
server. - Speedups for the Pentium 4 with HT ranges between
2.1 to 2.9 compared to PT.
7Memory-Analysis for Multithreaded AA_HJ
- A decrease in L2 load miss rate is due to the
cache-sized index partitioning, constructive
cache sharing and Group Prefetching. - A minor increase in L1 data cache load miss rate
from 1.5 to 4.
8The Sort Motivation
- Some researches find that the sort algorithms
suffer from high level two cache miss rates. - Whereas others pointed out that radix sort has
high TLB miss rates. - In addition, the fact that most sort algorithms
are sequential has high impact on generating
efficient parallel sort algorithms. - In our work we target Radix Sort
(distribution-based sort) and Quick Sort
(comparison-based sort).
9Our Parallel Sorts
- Radix Sort
- A hybrid radix sort between Partition Parallel
Radix Sort and Cache-Conscious Radix Sort. - Repartitioning large destination buckets only
when they are significantly larger than the L2
cache size. - Quick Sort
- Use Fast Parallel Quick Sort.
- Dynamically balancing the load across threads.
- Improve thread parallelism during the sequential
cleaning up sorting. - Stop the recursive partitioning process when the
size of the subarray is almost equal to the
largest cache size. -
10The Sort Timing for the Random Datasets on the
SMT Arhcitecure
- Radix Sort and Quick Sort shows low L1 and L2
caches miss rates on our machines. Radix Sort has
a DTLB Store miss rate up to 26. - Radix Sort accomplishes slight speedup on SMT
architectures that doesnt exceed 3 , due to its
CPU-intensive nature. - Enhancements in execution time for quick sort are
about 25 to 30.
Quick Sort
Radix Sort
11The Sort Timing for the Random Datasets on the
CMP Architecture
- Our speedups for the Radix sort range from 54
for two threads up to 300 for threads from 2 to
8. - Our speedups for the Quick Sort range from 34
to 417.
Radix Sort
Quick Sort
12The Index Motivation
- Despite the fact that CSB-tree proves to have
significant speedup over B-trees, experiments
show that a large fraction of its execution time
is still spent waiting for data. - The L2 load miss rate for single-threaded
CSB-tree is as high as 42.
13Dual-threaded CSB-Tree
- One CSB-Tree.
- Single thread for the bulkloading.
- Two threads for probing.
- Unlike inserts and deletes, search needs no
synchronization since it involves reads only.
14Index Results
- Speedups for dual-threaded CSB-tree range from
19 to 68 compared to single-threaded CSB-tree. - Two threads for memory-bound operations propose
more chances to keep the functional units
working. - Sharing one CSB-tree amongst both of our threads
result in constructive behaviour and reduction of
6 -8 in the L2 miss rate.
15Conclusions
- State-of-the-art parallel architectures (SMT and
CMP) have opened opportunities for the
improvement of software operations to better
utilize the underlying hardware resources. - It is essential to have efficient implementations
of database operations. - We propose architecture-aware multithreaded
database algorithms of the most important
database operations (joins, sorts and indexes). - We characterize the timing and memory behaviour
of these database operations.
16 17 18Figure ?1-1 The SMT Architecture
19Figure ?1-2 Comparison between the SMT and the
Dual Core Architectures
20Figure ?1-3 Combining the SMT and the CMP
Architectures
21Figure ?2-1 The L1 Data Cache Load Miss Rate for
Hash Join
22Figure ?2-2 The L2 Cache Load Miss Rate for Hash
Join
23Figure ?2-3 The Trace Cache Miss Rate for Hash
Join
24Figure ?2-4 Typical Relational Table in RDBMS
25Figure ?2-5 Database Join
26Figure ?2-6 Hash Equi-join Process
27Figure ?2-7 Hash Table Structure
28Figure ?2-8 Hash Join Base Algorithm
partition R into R0, R1,, Rn-1 partition S into
S0, S1,, Sn-1 for i 0 until i n-1 use Ri to
build hash-tablei for i 0 until i n-1 probe
Si using hash-tablei
29Figure ?2-9 AA_HJ Build Phase Executed by one
Thread
30Figure ?2-10 AA_HJ Probe Index Partitioning
Phase Executed by one Thread
31Figure ?2-11 AA_HJ S-Relation Partitioning and
Probing Phases
32Figure ?2-12 AA_HJ Multithreaded Probing
Algorithm
33Table ?2-1 Machines Specifications
34Table ?2-2 Number of Tuples for Machine 1
35Table ?2-3 Number of Tuples for Machine 2
36Figure ?2-13 Timing for three Hash Join
Partitioning Techniques
37Figure ?2-14 Memory Usage for three Hash Join
Partitioning Techniques
38Figure ?2-15 Timing for Dual-threaded Hash Join
39Figure ?2-16 Memory Usage for Dual-threaded Hash
Join
40Figure ?2-17 Timing Comparison of all Hash Join
Algorithms
41Figure ?2-18 Memory Usage Comparison of all Hash
Join Algorithms
42Figure ?2-19 Speedups due to the AA_HJSMT and
the AA_HJGPSMT Algorithms
43Figure ?2-20 Varying Number of Clusters for the
AA_HJGPSMT
44Figure ?2-21 Varying the Selectivity for Tuple
Size 100Bytes
45Figure ?2-22 Time Breakdown Comparison for the
Hash Join Algorithms for tuple sizes 20Bytes and
100Bytes
46Figure ?2-23 Timing for the Multi-threaded
Architecture-Aware Hash Join
47Figure ?2-24 Speedups for the Multi-Threaded
Architecture-Aware Hash Join
48Figure ?2-25 Memory Usage for the Multi-Threaded
Architecture-Aware Hash Join
49Figure ?2-26 Time Breakdown Comparison for Hash
Join Algorithms
50Figure ?2-27 The L1 Data Cache Load Miss Rate
for NPT and AA_HJ
51Figure ?2-28 Number of Loads for NPT and AA_HJ
52Figure ?2-29 The L2 Cache Load Miss Rate for NPT
and AA_HJ
53Figure ?2-30 The Trace Cache Miss Rate for NPT
and AA_HJ
54Figure ?2-31 The DTLB Load Miss Rate for NPT and
AA_HJ
55Figure ?3-1 The LSD Radix Sort
1 for (i 0 i lt number_of_digits i ) 2 sort
source-array based on digiti
56Figure ?3-2 The Counting LSD Radix Sort Algorithm
57Figure ?3-3 Parallel Radix Sort Algorithm
58Table ?3-1 Memory Characterization for LSD Radix
Sort with Different Datasets
59Figure ?3-4 Radix Sort Timing for the Random
Datasets on Machine 2
60Figure ?3-5 Radix Sort Timing for the Gaussian
Datasets on Machine 2
61Figure ?3-6 Radix Sort Timing for Zero Datasets
on Machine 2
62Figure ?3-7 Radix Sort Timing for the Random
Datasets on Machine 1
63Figure ?3-8 Radix Sort Timing for the Gaussian
Datasets on Machine 1
64Figure ?3-9 Radix Sort Timing for the Zero
Datasets on Machine 1
65Figure ?3-10 The DTLB Stores Miss Rate for the
Radix Sort on Machine 2 (Random Datasets)
66Figure ?3-11 The L1 Data Cache Load Miss Rate
for the Radix Sort on Machine 2 (Random Datasets)
67Table ?3-2 Memory Characterization for
Memory-Tuned Quick Sort with Different Datasets
68Figure ?3-12 Quicksort Timing for the Random
Datasets on Machine 2
69Figure ?3-13 Quicksort Timing for the Random
Dataset on Machine 1
70Figure ?3-14 Quicksort Timing for the Gaussian
Datasets on Machine 2
71Figure ?3-15 Quicksort Timing for the Gaussian
Dataset on Machine 1
72Figure ?3-16 Quicksort Timing for the Zero
Datasets on Machine 2
73Figure ?3-17 Quicksort Timing for the Zero
Dataset on Machine 1
74Table ?3-3 The Sort Results for Machine 1
75Table ?3-4 The Sort Results for Machine 2
76Figure ?4-1 Search Operation on an Index Tree
77Figure ?4-2 Differences between the B-Tree and
the CSB-Tree
78Figure ?4-3 Dual-Threaded CSB-Tree for the SMT
Architectures
79Figure ?4-4 Timing for the Single and
Dual-Threaded CSB-Tree
80Figure ?4-5 The L1 Data Cache Load Miss Rate for
the Single and Dual-Threaded CSB-Tree
81Figure ?4-6 The Trace Cache Miss Rate for the
Single and Dual-Threaded CSB-Tree
82Figure ?4-7 The L2 Load Miss Rate for the Single
and Dual-Threaded CSB-Tree
83Figure ?4-8 The DTLB Load Miss Rate for the
Single and Dual-Threaded CSB-Tree
84Figure ?4-9 The ITLB Load Miss Rate for the
Single and Dual-Threaded CSB-Tree