13. External Sorting - PowerPoint PPT Presentation

About This Presentation
Title:

13. External Sorting

Description:

... needed to sort a file in 2 passes, using merge or bucket sort ... Pass 0: Read a page, sort it, write it. How many buffer pages needed? Pass 1, 2, 3, ..., etc. ... – PowerPoint PPT presentation

Number of Views:618
Avg rating:3.0/5.0
Slides: 36
Provided by: RaghuRamak
Learn more at: http://web.cecs.pdx.edu
Category:
Tags: external | sorting

less

Transcript and Presenter's Notes

Title: 13. External Sorting


1
13. External Sorting
  • Motivation
  • 2-way External Sort Memory, passes,cost
  • General External Sort Memory, passes, cost
  • Optimizations
  • Snowplow
  • Double Buffering
  • Forecasting
  • Using a B tree index
  • Bucket Sort
  • Intergalactic Standard Reference
  • Graefe, Implementing Sorting in Database Systems
  • http//portal.acm.org/citation.cfm?id1132964

2
Learning Objectives
  • Derive formula for cost of external merge sort
  • Derive amount of memory needed to sort a file in
    2 passes, using merge or bucket sort
  • Describe algorithm for generating longer initial
    runs and identify its best and worst cases
  • Describe forecasting and why it is useful
  • Identify when indexes should be used for sorting
  • Identify the pros and cons of external bucket vs
    merge sort.

3
Why sort?
  • A classic problem in computer science!
  • Exercises many software and hardware features
  • Data is often requested in sorted order
  • e.g., find students in increasing gpa order
  • Sorting is first step in bulk loading B tree
    index.
  • Sorting useful for some query processing
    algoritms (Chapter 14)
  • Problem sort 1Gb of data with 1Mb of RAM.
  • why not virtual memory?

4
Sort algorithms?
  • If the data can fit in memory, which sort
    algorithm is best?
  • But most DBMS files will not fit in available
    memory
  • If the data is larger than memory, try the same
    alg.
  • Suppose
  • for this data, your sort alg. requires 220 random
    memory accesses
  • Memory access takes 1 microsec, disk takes 10
    millisecs.
  • How much time is required to do your sort
    algorithms memory accesses?
  • If there is enough memory to hold the data?
  • If the data is four times the size of memory?

5
External Sorts
  • Definition When data is larger than memory.
  • An aside What is Memory?
  • Physical memory? The DBMS is not the only player
  • We concluded that most in-memory sort algs wont
    be effective for external sorting.
  • What sort algorithms are best for external sort?
  • Sort-based
  • Hash-based

6
2-Way External Merge Sort Memory?
13. Sorting
  • Pass 0 Read a page, sort it, write it.
  • How many buffer pages needed?
  • Pass 1, 2, 3, , etc.
  • How many buffer pages needed?

INPUT 1
OUTPUT
INPUT 2
Main memory buffers
Result of Pass k1
Result of Pass k
7
2-Way External Merge Sort Passes?
  • Assume file is N pages
  • Run sorted subfile
  • What happens in pass Zero?
  • How many runs are produced?
  • What is the cost in I/Os?
  • What happens in pass 1?
  • What happens in pass i?
  • How many passes are required?
  • What is the total cost?

8
Two-Way External Merge Sort Cost
13. Sorting
Input file
6,2
2
3,4
9,4
8,7
5,6
3,1
  • Each pass we read write each page in file.
  • N pages in the file gt the number of passes
  • So total cost is
  • Idea Divide and conquer sort subfiles and merge

PASS 0
1-page runs
1,3
2
3,4
5,6
2,6
4,9
7,8
PASS 1
4,7
1,3
2,3
2-page runs
8,9
5,6
2
4,6
PASS 2
2,3
4,4
1,2
4-page runs
6,7
3,5
6
8,9
I/Os
PASS 3
1,2
2,3
3,4
8-page runs
4,5
6,6
7,8
9
9
General External Merge Sort
13. Sorting
  • Suppose we have more than 3 buffer pages.
  • To sort a file with N pages using B buffer pages
  • Pass 0 use B buffer pages. How many sorted runs
    of B pages each are produced? Cost?
  • Pass 1,2, , etc. merge B-1 runs.
  • How many runs are created after pass i?
    Cost of pass i?
  • How many passes? Total Cost?

10
Cost of External Merge Sort
13. Sorting
  • Number of passes
  • Cost 2N ( of passes)
  • E.g., with 5 buffer pages, to sort 108 page file
  • Number of passes is (1 ? log4 ? 108/5 ? ?) 4
  • Cost is 2(108)4
  • Pass 0 Output is 22 sorted
    runs of 5 pages each (last run is only 3 pages)
  • Pass 1 6 sorted runs of 20
    pages each (last run is only 8 pages)
  • Pass 2 2 sorted runs, 80 pages and 28 pages
  • Pass 3 Sorted file of 108 pages

11
How much memory is needed to sort a file in one
or two passes?
  • N number of data pages, B memory pages
    available
  • To sort in One pass N ? B
  • To Sort in Two passes 1logB-1?N/B? ? 2
  • N/B ? (B-1)1
  • Approximating B-1 by B, this yields
  • ?N ? B
  • For example, if pages are 4KBytes, a 4GByte file
    can be sorted in Two Passes with ? buffers.

12
Sorting in 2 passes graphical proof
  • File is B pages wide
  • Each run is B pages
  • File is x pages high
  • Merge x runs in pass 1
  • x?B since we must merge x runs in B pages of
    memory
  • So NxB ?BB or ? N?B

Each run, 1xB pages
The File, N pages
x
B
13
Memory required for 2-pass sort
Assuming page size of 4K
N Pages in File File size in Bytes B Buffer Pages Bytes to sort in Two passes
210 4Meg 25 128K
220 4 Gig 210 4 Meg
226 256 Gig 213 32 Meg
14
Can we always sort in 1 or 2 passes?
  • Assume only one query is active, and there is at
    least 1 gig of physical RAM.
  • Yes DB2,Oracle, SQLServer, MySQL
  • They allocate all available memory to queries
  • Tricky to manage memory allocation as queries
    need more memory during execution
  • No Postgres
  • Memory allocated to each query, for sort and
    other purposes, is fixed by a config parameter.
  • Sort memory is typically a few meg, in case there
    may be many queries executing.

15
Extremes of Sorting
One Pass N B, 1-pass sort, cost N
N
B
N
Original Data
Sorted Data
Quick-sort
Two Pass N B2, 2-pass sort, cost 3N
B
B
N
B
12..B
B
Sorted Data
Original Data
B
Runs
Merge
Quicksort into B runs, length B
16
Extreme Problems
  • Most sorting of large data falls between these
    two extremes
  • If we apply the intergalactic standard sort-merge
    algorithm, in every textbook, the cost for any
    dataset with BltNltB2 will be 3M.
  • Must we always pay that large price?
  • Might there be an algorithm that is a hybrid of
    the two extremes?

17
Hybrid Sort when N ? 3B
  • The key idea of hybrid sort is dont waste
    memory.
  • Here is an example of the hybrid sort algorithm
    when N is approximately 3B.

One Pass M ? 3B, 2-pass sort, cost 3M 2B
B
B
N
Sorted Data
B
B
Original Data
Runs
Merge the runs on disk with the run in memory
Quicksort into runs, length B, leaving the last
run in memory
18
Hybrid Sort in general
  • Let k N/B
  • Arrange R so the last run is B-k pages
  • Cost is N 2(N (B-k)) 3N -2B 2(N/B) 3N
    2( B-N/B)
  • When NB, cost is N1. When NB2, cost is 3N

19
13.3.1 Maximizing initial runs
  • Defn making initial runs as long as possible.
  • Why is it helpful to maximize initial runs?
  • If initial run size is doubled, what is the time
    savings?
  • How can you maximize initial runs?
  • What algorithm is best?

20
Replacement Selection
13. Sorting
  • Initialize empty priority queues CUR, NEXT
  • Read B buffer pages of data into CUR
  • Do
  • Pop record s with smallest key from CUR to
    current run
  • // key of s is now highest key in current run
  • If key of next input r gt key of s
  • //Can put in current run
  • insert r into CUR
  • else
  • insert r into NEXT
  • If CUR is empty
  • interchange NEXT and CUR and start a new run
  • Until (input is empty)

21
More on Replacement Selection
13. Sorting
  • Cf. Knuth, vol 3 442, page 255.
  • Theorem average length of a run in replacement
    sort is 2B
  • Worst-Case
  • What is min length of a run?
  • How does this arise?
  • Best-Case
  • What is max length of a run?
  • How does this arise?
  • Quicksort is faster, but ...

22
How can we prove the Theorem?
13. Sorting
  • Begin with some modeling assumptions
  • Data to be sorted are real numbers between 0 and
    1
  • Data appear at a uniform rate and distribution
  • A snowplow picks up one datum as one falls
  • Picking up a datum pop( ) off the queue
  • Each datum is infinitesimal
  • Each run begins when the plow passes zero

23
13. Sorting
CUR NEXT
B
0
1
2B
2B
24
Snowplow Conclusion
  • The figures on the previous page show that
  • At any time after run 0, the amount of snow
    size of memory B.
  • After the first run, the volume of snow removed
    in one circuit is 2B.
  • Cf. Larson and Graefe 471
  • In spite of memory management problems, the
    snowplow optimization is very effective.

25
13.4 I/O for External Merge Sort
13. Sorting
  • What else can we do to improve performance?
  • We have assumed I/O is done a page at a time
  • Text suggests reading a block of pages
    sequentially.
  • Pass 0 No problem
  • Pass 1,2, lowers fanin
  • Sometimes a win

26
External Sorts jerky behavior
  • Recall that each input is one page
  • What happens after the last record on a page is
    output?

27
Double Buffering
13. Sorting
  • To reduce wait time for I/O request to complete,
    can prefetch into shadow block.
  • This could increase the number of passes
  • In practice, most files still sorted in 1-2
    passes.

INPUT 1
INPUT 1'
INPUT 2
OUTPUT
INPUT 2'
OUTPUT'
b
block size
Disk
INPUT k
Disk
INPUT k'
B main memory buffers, k-way merge
28
Forecasting
A B C
  • Cf. Knuth, vol 3, pg 324-7
  • Double Buffering requires Doubling memory
  • What a huge waste!
  • Most shadow buffers lie idle, unused, wasted.
  • How can we forecast which shadow buffers will be
    needed first?
  • Forecasting can achieve performance of double
    buffering with little memory

3 5 14 34
1 33 45 55
4 6 7 9
88 91 93 99
50 65 74 83
56 57 58 59
29
Sorting Records!
13. Sorting
  • Sorting has become a blood sport!
  • Parallel sorting is the name of the game ...
  • www.research.microsoft.com/barc/SortBenchmark

30
Using B Trees for Sorting
13. Sorting
  • Scenario Table to be sorted has B tree index on
    sorting column(s).
  • Idea Can retrieve records in order by traversing
    leaf pages.
  • Is this a good idea?
  • Cases to consider
  • B tree is clustered Good idea!
  • B tree is not clustered Could be a very bad idea!

31
Clustered B Tree Used for Sorting
  • Cost root to the left-most leaf, then retrieve
    all leaf pages (Alternative 1)
  • If Alternative 2 is used? Additional cost of
    retrieving data records each page fetched just
    once.

Index
(Directs search)
Data Entries
("Sequence set")
Data Records
  • Always better than external sorting!

32
Unclustered B Tree Used for Sorting
  • Alternative (2) for data entries each data entry
    contains rid of a data record. In general, one
    I/O per data record!

Index
(Directs search)
Data Entries
("Sequence set")
Data Records
33
Bucket Sort
  • Suppose search key values are 0-K
  • B pages in memory, N pages in the file
  • Pass 0 Partition the file into B-1 intervals
  • Inervals are not runs!
  • If the interval fits in one page, sort it

0,K/(B-1))
. . .
K/(B-1),2K/(B-1))
OUTPUT 2
INPUT
. . .
OUTPUT B-1
(B-2)K/(B-1),K)
B Main memory buffers
Disk
34
Bucket sort cost
  • What happens after pass 0
  • ? intervals, each ? long
  • After pass 1?
  • ? intervals, each ? long
  • How much memory is required to sort in two
    passes?
  • Each interval is at most one page, or ?
  • Same as for external merge sort

35
External Merge Sort vs External Bucket Sort
  • Approximately the same I/O cost
  • Same memory requirement for two passes
  • Same number of passes required to sort
  • Bucket sort has less CPU cost
  • Bucketizing is much cheaper than sorting/merging
  • But bucket sort is subject to skew
  • Thus merge sort is used in practice
Write a Comment
User Comments (0)
About PowerShow.com