CMPT 354 - PowerPoint PPT Presentation

1 / 22

About This Presentation

Title:

CMPT 354

Description:

Memory Usage. At this point all of the records from one of the input pages have ... If double buffering is used, the CPU can process one part of a run while the ... – PowerPoint PPT presentation

Number of Views:40

Avg rating:3.0/5.0

Slides: 23

Provided by: johne78

Category:

Tags: cmpt | usage

more less

Transcript and Presenter's Notes

Title: CMPT 354

1
CMPT 354

External Sorting

2
External Sorting

Two-way external merge sort
External merge sort
Replacement sort
External merge sort notes

3
External Sorting Introduction

It is often necessary to sort data in a database
The user may require that the results of a query
are sorted
Sorting often results in making other database
operations more efficient
Bulk loading a tree index
Eliminating duplicate records
Joining relations
Like most DB operations the focus on a sorting
algorithm is to keep the number of disk I/O
operations as small as possible

4
Internal vs. External Sorting

Sorting a collection of records that fit within
main memory can be done efficiently
There are a number of sorting algorithms that can
be performed in n(log2n) time
That is, with n(log2n) comparisons, e.g.
Merge sort, Quicksort, Heap sort
Because of the large size of many database files
it may not be possible to fit all of the records
to be sorted in main memory at one time
The focus in external sorting is therefore to
reduce the number of disk I/O operations

5
Merge Sort Review

Consider the mergesort algorithm
mergesort(arr, start, end)
mid start end / 2
mergesort(arr, start, mid)
mergesort(arr, mid1, end)
merge(arr, start, mid, mid1, end)
The array is repeatedly halved until the
resulting subarrays contain one element
The subarrays are then built up into sorted
subarrays by repeated merge operations
Where the merge algorithm merges the contents of
two sorted subarrays

6
Naïve External Merge Sort

Start with single pages of the file
Where the file has N pages
Read the first two pages of the file and sort
them (independently of each other)
Merge the two pages (using the in-memory merge
algorithm) and write them out
Repeat for the rest of the file to produce N/2
sorted runs of size 2
Read in the first page of the first two runs and
merge into a sorted run of size 4

7
Details

Read in the first page of the first two runs and
merge into a sorted run of size 4
What happens if there are only 3 main memory
frames available for the sort?!
That is, how do we retain a sorted run of size 4
in main memory?
While there are likely to be more than 3 free
main memory frames, this is a general problem
that must be dealt with at some point, as the
sorted runs grow larger
Regardless of the size of the sorted runs only
three main memory frames are ever required for
the merge process

8
Memory Usage

After the first merge pass the file consists of
N/2 sorted runs each of two pages
Read in the first page of each of the first two
sorted runs
Leaving a third page free as an output buffer

Main Memory Frames
9
Memory Usage

Records from the input pages are merged into the
output buffer
Once the output buffer is full it's contents are
written out to disk, to form the first page of
the first sorted run of length 4

Main Memory Frames
10
Memory Usage

At this point all of the records from one of the
input pages have been dealt with
The next page of that sorted run is read into the
input page
And the process continues

Main Memory Frames
11
Cost of Naïve Merge Sort

In the process described, assume that N 2k
After the first pass there are 2k-1 sorted runs,
each two pages long
After the second pass there are 2k-2 sorted runs,
of length 4
After the kth pass there is one sorted run
The number of passes is therefore ?log2N?1
?log2N? is the number of merge passes required
The 1 is for the initial pass to sort each page
Each pass requires that each page be read and
written back for a total cost of 2N(?log2N?1)
This process does not make full use of available
main memory space and can be improved

12
First Stage Improvement

In the first stage of the process each page is
read into main memory, sorted and written out so
that the pages are ready to be merged
Instead of sorting each page independently, read
B pages into main memory, and sort them together
Where B is the number of main memory pages
After the first pass there will be N/B sorted
runs, each of length B
This will reduce the number of subsequent merge
passes that are required

13
Merge Pass Improvement

In the merge passes perform a B-1 way merge
rather than a 2 way merge
B-1 input partitions, and
1 page for an output buffer
This requires that the first item in each of the
B-1 input partitions is compared to each other to
determine the smallest
This may result in more comparisons being
performed than a two way merge
But results in less merge passes
Each merge pass will merge B-1 runs, and will
produce sorted runs of size (B-1)B

14
Cost of External Merge Sort

The initial pass produces N/B sorted runs of size
B
Each merge pass reduces the number of runs by a
factor of B-1
The number of merge passes is ?logB-1?N/B??
Each pass requires that the entire file is read
and written
Total cost is therefore 2N(?logB-1?N/B?? 1)
As B is typically relatively large this reduction
is considerable

15
Number of Passes

Assuming that main memory is of a reasonable size
and is dedicated to sorting the file, a file can
usually be sorted at a cost of 4N disk I/Os

16
Generating Longer Initial Runs

In the first stage N/B sorted runs of size B are
produced
The size of these preliminary runs determines how
many merge passes are required
The size of the initial runs can be increased by
using replacement sort
On average the size of the initial runs is
increased to 2B
However, using replacement sort does increase
complexity

17
Replacement Sort

B-2 pages are used to sort the file
Called the current set
One page is used for input, and
One page is used for output
First enough pages to fill the current set are
read in and sorted

current set
input
output
18
Replacement Sort

The next page of the file is read in to the input
buffer
The smallest record from the current set is
placed in the output buffer
A record from the input buffer is placed in the
current set

current set
input
output
19
Replacement Sort

The process continues until the input buffer's
contents are all in the current set
At this point another input page can be read into
the input buffer
And the output buffer can be written to disk

current set
input
output
20
Replacement Sort

The current set can be sorted again to reduce
main memory costs
When a page is read into the input buffer that
contains records less than the contents of the
output buffer the rest of the current set must be
written out to complete the first run
The process then starts again

current set
input
output
21
Revisiting I/O Costs

The cost metric used so far (number of disk I/Os)
is very simplistic
In practice it my be advantageous to make the
input and output buffers larger than one page
This reduces the number of runs that can be
merged at one time, and may increase the number
of passes required
But, it allows a sequence of pages to be read (or
written) to the buffers, decreasing the actual
access time per page
We have also ignored (non-trivial) CPU costs
If double buffering is used, the CPU can process
one part of a run while the next is being loaded
into main memory

22
B Trees and Sorting

Can a B tree index be used to sort a file?
Clustered B tree index
The index can be used to find the first page, but
Note that the file must already be sorted!
Unclustered B tree index
The leaves point to data records that are not in
sort order
In the worst case, each data entry could point to
a different page from its adjacent entries
In this case all of the index leaf pages have to
be retrieved, plus one disk read for each record!
In practice external sort is likely to be much
more efficient than using an unclustered index