CMPT 354 - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

CMPT 354

Description:

Memory Usage. At this point all of the records from one of the input pages have ... If double buffering is used, the CPU can process one part of a run while the ... – PowerPoint PPT presentation

Number of Views:40
Avg rating:3.0/5.0
Slides: 23
Provided by: johne78
Category:
Tags: cmpt | usage

less

Transcript and Presenter's Notes

Title: CMPT 354


1
CMPT 354
  • External Sorting

2
External Sorting
  • Two-way external merge sort
  • External merge sort
  • Replacement sort
  • External merge sort notes

3
External Sorting Introduction
  • It is often necessary to sort data in a database
  • The user may require that the results of a query
    are sorted
  • Sorting often results in making other database
    operations more efficient
  • Bulk loading a tree index
  • Eliminating duplicate records
  • Joining relations
  • Like most DB operations the focus on a sorting
    algorithm is to keep the number of disk I/O
    operations as small as possible

4
Internal vs. External Sorting
  • Sorting a collection of records that fit within
    main memory can be done efficiently
  • There are a number of sorting algorithms that can
    be performed in n(log2n) time
  • That is, with n(log2n) comparisons, e.g.
  • Merge sort, Quicksort, Heap sort
  • Because of the large size of many database files
    it may not be possible to fit all of the records
    to be sorted in main memory at one time
  • The focus in external sorting is therefore to
    reduce the number of disk I/O operations

5
Merge Sort Review
  • Consider the mergesort algorithm
  • mergesort(arr, start, end)
  • mid start end / 2
  • mergesort(arr, start, mid)
  • mergesort(arr, mid1, end)
  • merge(arr, start, mid, mid1, end)
  • The array is repeatedly halved until the
    resulting subarrays contain one element
  • The subarrays are then built up into sorted
    subarrays by repeated merge operations
  • Where the merge algorithm merges the contents of
    two sorted subarrays

6
Naïve External Merge Sort
  • Start with single pages of the file
  • Where the file has N pages
  • Read the first two pages of the file and sort
    them (independently of each other)
  • Merge the two pages (using the in-memory merge
    algorithm) and write them out
  • Repeat for the rest of the file to produce N/2
    sorted runs of size 2
  • Read in the first page of the first two runs and
    merge into a sorted run of size 4

7
Details
  • Read in the first page of the first two runs and
    merge into a sorted run of size 4
  • What happens if there are only 3 main memory
    frames available for the sort?!
  • That is, how do we retain a sorted run of size 4
    in main memory?
  • While there are likely to be more than 3 free
    main memory frames, this is a general problem
    that must be dealt with at some point, as the
    sorted runs grow larger
  • Regardless of the size of the sorted runs only
    three main memory frames are ever required for
    the merge process

8
Memory Usage
  • After the first merge pass the file consists of
    N/2 sorted runs each of two pages
  • Read in the first page of each of the first two
    sorted runs
  • Leaving a third page free as an output buffer

Main Memory Frames
9
Memory Usage
  • Records from the input pages are merged into the
    output buffer
  • Once the output buffer is full it's contents are
    written out to disk, to form the first page of
    the first sorted run of length 4

Main Memory Frames
10
Memory Usage
  • At this point all of the records from one of the
    input pages have been dealt with
  • The next page of that sorted run is read into the
    input page
  • And the process continues

Main Memory Frames
11
Cost of Naïve Merge Sort
  • In the process described, assume that N 2k
  • After the first pass there are 2k-1 sorted runs,
    each two pages long
  • After the second pass there are 2k-2 sorted runs,
    of length 4
  • After the kth pass there is one sorted run
  • The number of passes is therefore ?log2N?1
  • ?log2N? is the number of merge passes required
  • The 1 is for the initial pass to sort each page
  • Each pass requires that each page be read and
    written back for a total cost of 2N(?log2N?1)
  • This process does not make full use of available
    main memory space and can be improved

12
First Stage Improvement
  • In the first stage of the process each page is
    read into main memory, sorted and written out so
    that the pages are ready to be merged
  • Instead of sorting each page independently, read
    B pages into main memory, and sort them together
  • Where B is the number of main memory pages
  • After the first pass there will be N/B sorted
    runs, each of length B
  • This will reduce the number of subsequent merge
    passes that are required

13
Merge Pass Improvement
  • In the merge passes perform a B-1 way merge
    rather than a 2 way merge
  • B-1 input partitions, and
  • 1 page for an output buffer
  • This requires that the first item in each of the
    B-1 input partitions is compared to each other to
    determine the smallest
  • This may result in more comparisons being
    performed than a two way merge
  • But results in less merge passes
  • Each merge pass will merge B-1 runs, and will
    produce sorted runs of size (B-1)B

14
Cost of External Merge Sort
  • The initial pass produces N/B sorted runs of size
    B
  • Each merge pass reduces the number of runs by a
    factor of B-1
  • The number of merge passes is ?logB-1?N/B??
  • Each pass requires that the entire file is read
    and written
  • Total cost is therefore 2N(?logB-1?N/B?? 1)
  • As B is typically relatively large this reduction
    is considerable

15
Number of Passes
  • Assuming that main memory is of a reasonable size
    and is dedicated to sorting the file, a file can
    usually be sorted at a cost of 4N disk I/Os

16
Generating Longer Initial Runs
  • In the first stage N/B sorted runs of size B are
    produced
  • The size of these preliminary runs determines how
    many merge passes are required
  • The size of the initial runs can be increased by
    using replacement sort
  • On average the size of the initial runs is
    increased to 2B
  • However, using replacement sort does increase
    complexity

17
Replacement Sort
  • B-2 pages are used to sort the file
  • Called the current set
  • One page is used for input, and
  • One page is used for output
  • First enough pages to fill the current set are
    read in and sorted

current set
input
output
18
Replacement Sort
  • The next page of the file is read in to the input
    buffer
  • The smallest record from the current set is
    placed in the output buffer
  • A record from the input buffer is placed in the
    current set

current set
input
output
19
Replacement Sort
  • The process continues until the input buffer's
    contents are all in the current set
  • At this point another input page can be read into
    the input buffer
  • And the output buffer can be written to disk

current set
input
output
20
Replacement Sort
  • The current set can be sorted again to reduce
    main memory costs
  • When a page is read into the input buffer that
    contains records less than the contents of the
    output buffer the rest of the current set must be
    written out to complete the first run
  • The process then starts again

current set
input
output
21
Revisiting I/O Costs
  • The cost metric used so far (number of disk I/Os)
    is very simplistic
  • In practice it my be advantageous to make the
    input and output buffers larger than one page
  • This reduces the number of runs that can be
    merged at one time, and may increase the number
    of passes required
  • But, it allows a sequence of pages to be read (or
    written) to the buffers, decreasing the actual
    access time per page
  • We have also ignored (non-trivial) CPU costs
  • If double buffering is used, the CPU can process
    one part of a run while the next is being loaded
    into main memory

22
B Trees and Sorting
  • Can a B tree index be used to sort a file?
  • Clustered B tree index
  • The index can be used to find the first page, but
  • Note that the file must already be sorted!
  • Unclustered B tree index
  • The leaves point to data records that are not in
    sort order
  • In the worst case, each data entry could point to
    a different page from its adjacent entries
  • In this case all of the index leaf pages have to
    be retrieved, plus one disk read for each record!
  • In practice external sort is likely to be much
    more efficient than using an unclustered index
Write a Comment
User Comments (0)
About PowerShow.com