Title: Sort-Merge Join Implementation Details for Minibase
1Sort-Merge Join Implementation Details for
Minibase
University of California Riverside Department
of Computer Science Engineering cs179G
Database Project Phase 4
- by
- Demetris Zeinalipour
- http//www.cs.ucr.edu/cs179g-t/
2Sort-Merge Join Review
Sailors S
Reserves R
Name
Carl
Tom
- Query with JOIN
- SELECT S.name
- FROM Sailors S, Reserves R
- WHERE S.sid R.id
SID NAME AGE
1 Carl 23
2 Tom 26
3 Peter 23
SID BID
2 5
1 4
2 9
- How to implement the JOIN operator?
- Using Nested Loop Joins. tuple-at-a-time For
each tuple in S check with every tuple in R
variation page-at-a-time (every page contains
several records) - Using Block Nested Loop Join. Idea Load in
memory smaller relation e.g. S and then scan
relation S on a page-at-a-time basis. Notice The
idea can be generalized even if the smaller
relation doesnt fit in memory - Using Sort-Merge Join Idea Sort both relations
using an external sort algorithm and then merge
the relations.
3External-Sorting Review 1/3
- When? If the data to be sorted is too big to fit
in main memory then we need an external sort
algorithm. - Simple approach 2-way Merge-Sort
Inpage1 11,7, 2,1
Inpage219, 7,5,4
Use quicksort internally
Memory
4
- Steps
- Fetch a page to memory
- Sort Page in memory
- Write page back to disk
- Merge pages levelwise
- (see next page)
HeapFile 1
inpage1
1-3
outpage
4
inpage2
4External-Sorting Review 2/3
- 2-way mergesort gt
- Passes log2N14
- Cost 2N(log2N1)64 I/O
- Expensive
- In project we use External
- Sort (sort.C) which utilizes
- all available buffer pages (10)
- and reduces the number of
- Passes and the I/O cost
5External-Sorting Review 3/3
- Idea similar with 2-way Mergesort with the
difference - that we utilize B-1 buffer pages (Bgt3)
Dont worry about this implementation as it is
already implemented in sort.C
Use quicksort internally
Memory
Heapf
1-3
0
1
2
In this project you should call Sort() from
within the sortMerge constructor before
proceeding to the merge phase
outpage
3
4
5
4
6
7
6The Merging Phase of SMJoin
- Now that the two relations R and S are sorted we
- must merge them.
- Merge using 2 iterators to move from page to page
and from record to record - Works fine ONLY if both R and S have NO
duplicates. - (e.g if sid is a foreign key in a 11 relation
Sailor, Address)
Page
Tr
2 5 10 70 80 90
R.sid
Heapfile
S.sid
1 3 6 10 30 40
Gs
Forward the iterator (Tr or Gs) that has the
smallest value until TrGs then output value
7The Merging Phase of SMJoin
- What if relations have duplicates? (check
example) - (either both of them or just one of them)
- Therefore we need to use 3 iterators (1 for
Marking)
Tr
2 5 10 10 80 90
R.sid
S.sid
1 3 6 10 10 40
Gs
Ts
- The one extra Ts iterator will be used as soon
TrGs at which point we will move Ts to Gs
position and use Ts to iterate S - The full version of the algorithm is shown in 2
slides
8HeapFiles (heapfile.h, scan.C)
- Database File Organization of various pages into
a logical. - In a heapfile pages are unordered within the file
- In order to Scan the pages (records) of a
heapfile we will use scan.C
Example // Sorting a heapfile (after heapR
stored in Catalog) Sort(unsortheapR, heapR,
..)rest of params from SortMerge() // Creating a
scan on a heapfile HeapFile heapR(heapR,
status) Scan Rscan heapR.openScan(Rstatus)
// Scan until DONE Rstatus Rscan-gtgetNext(RID
rid,char RecR, int lenR) // Inserting results
in Out Heapfile HeapFile heapOut(heapOut,
status) memmove(char RecO, char RecR, int
lenR)//do same for S heapOut.insertRecord(char
recptr, lenRlenS, RIDoutRID)
All the work of Pinning/Unpinning pages is done
from within Scan since it locates the
directoryheader Page from the catalog and
proceeds from there on with getNext()
9The Merging Phase of SMJoin
- The full algorithm (this is all you need to
implement)
Sort R S
Init Iterators
Fast forward R
Fast forward S
Consider ScanPosition()
10Implementation of the SMJoin Algorithm
- The Big Picture
- main.C (or smjoin_main.C same thing)
- gt SMJTester.C (runTests())
- gt test1() (this runs all 6 tests
actually) - gt createFiles() // creates 5 Heapfiles using
- // the data of same constant integer arrays
(data0,data4) - gt test(i) // inside sortMerge Constructor
//sortMerge
heapfileR
No_of_cols
joinColumn
sortMerge sm(file0, 2, attrType, attrsize, 0,
file1, 2, attrType, attrsize, 0,
test1, 10, Ascending, s) // inside
sortMerge Constructor
heapfileS
Out_heapfile
minirel.h
No_of_pages available
gt
Sort( (infile) file0, (outfile) BRfile1,
(no_of_cols)2, attrType, attrsize,
(joincolumn)0, (no_of_pages_available)10,
(status)s )
11Where to start from?
- Start out by reading Sort-Merge Join from book
(Chap 12.5.2 2nd edition, 14.4.2 3rd edition) - Study SMJTester.C which contains the tests and
understand the test scenarios. - Start Implementing sortMerge.C by invoking the
Sort() constructor etc. - Have a closer look at the new classes that you
have Sort.C, Heapfile.C and Scan.C