Title: A Survey of External Memory Algorithms and Data Structures
1A Survey of External Memory Algorithms and Data
Structures
- Presented by Reynold Cheng
- Feb 10, 2003
2Internal vs. External Memory
- Internal memory (RAM) not sufficient to store
data sets in large applications - External memory (e.g., disks) used to store data
for the algorithm - The I/O between fast internal memory and slow
external memory is a performance bottleneck.
3Virtual Memory and I/O Performance
- Virtual Memory in OS
- Provides one uniform address space
- Principle of Locality
- Caching and prefetching improves performance
- Designed to be General-purpose
- Doesnt help if computations are inherently
nonlocal - Results in Large amounts of I/O and poor
performance
4EM Algorithms and Data Structures
- Incorporate locality directly into algorithm
design - Bypass the virtual memory system
- EM Algorithms and Data Structures
- explicitly manage data placement and movement
among memory hierarchy - I/O communication between internal memory (RAM)
and external memory (magnetic disk)
5Two Problem Categories
- Batched Problems
- No preprocessing is done
- Process entire file of data items
- Stream data through internal memory in 1 passes
- Online Problems
- Immediate response to queries
- Only small portion of data is examined for each
query - Usually organize data items in hierarchical index
- Data can be static or dynamic
6Talk Outline
- Modeling Parallel Disks
- Design Goal of EM Algorithms
- Striping and Load Balancing
- Batched Problems
- Algorithms for External Sorting
- Online Problems
- Hash tables and B-trees
- Conclusions
7Parallel Disk Model (PDM)
8Disk Block Size B
- Rotational Latency and seek time are long
- Improve average search time by transferring data
in blocks - Track size 50 200 KB
- Batched applications B a little larger than
track size ( 100 KB) - Online Applications (pointer-based indexes) B
8 KB - Further improve performance by parallel disks
9Data placement and disk Striping
- We assume the input data are striped across D
disks.
- D 5, B 2
- Items 12 and 13 stored in 2nd block (stripe 1)
of disk D1 - N items read/written in O(N/DB) O(n/D) I/Os
optimal
10Performance measures in PDM
- Number of I/O operations
- Disk space used (utilization)
- Ideally data structures should use linear space,
i.e., O(N/B) O(n) disk blocks - Internal (sequential/parallel) computation time
- Focus only on (1) and (2)
- Most algorithms described run in optimal CPU time
11Fundamental I/O Bounds (1)
- I/O performance often expressed in terms of the
bounds for - Scan(N) Sequential read/write N items
- Sort(N) Sorting N items
- Search(N) Searching N sorted items
- Output(Z) Outputting Z answers to a query in a
blocked output-sensitive fashion
12Fundamental I/O Bounds (2)
- Scan(N) and Sort(N) apply to batched problems
- Search(N) and Output(Z) apply to online problems
- Typically combined in the form Search(N)
Output (Z)
13Optimal I/O Bounds
14Comments on I/O Bounds (1)
- Scan(N) O(n/D) linear of I/Os
- Nontrival batched problems like permuting, matrix
transposing and list ranking need nonlinear I/Os - those can be solved in linear time in internal
memory
15Comments on I/O Bounds (2)
- Multiple-disk bounds for Scan(N), Sort(N) and
Output(Z) are D times smaller than single-disk
bounds - For Search(N), the speedup is less significant
- (logBN) for D 1 becomes T(logDBN) for D 1
- Speedup T((logBN)/logDBN)
T((log DB) / log B)
T(1 (log D) / log B) lt 2
16Locality I/O Performance
- Key to achieve efficient I/O access data with
high degree of locality - Single Disk Each I/O transfers a block of B
items optimal when all B items are needed - Multiple Disks Each I/O transfers D blocks
(stripe) optimal when all DB items are needed
17Single Disk Locality
- Many batched problems like sorting requires O(N
log N) CPU time - If data set doesnt fit into RAM, relying on
virtual memory may need O(N log n) I/O! - Goal Incorporate locality in algorithm design to
achieve O(n logmn) I/Os
18Design Goal of EM Algorithm
- For batched problems,
- N and Z terms of the I/O bounds of the naïve
algorithms are replaced by n and z - Base of the logarithm terms 2 ? m
- For online problems,
- Base of the logarithm terms 2 ? B
- Z ? z
- Speedup significant, e.g., for batched problems,
(N log n) / n logmn B log m
19Multiple Disk Locality
- Disk Striping I/Os are permitted on entire
stripes, one stripe at a time
20Single-Disk to Multiple-Disk (1)
- Net Effect of disk striping behave as a single
disk with logical block size DB - Apply disk striping paradigm to automatically
convert - algorithm for single disk with block size DB
- ? algorithm for D disks with block size B
21Single-Disk to Multiple-Disk (2)
- Example Single-disk algorithm for searching
requires T(logBN) I/Os. - Using striping, we obtain a multiple-disk
algorithm by substituting DB for B - T(logDBN) I/Os
- Disk striping is very powerful can be used to
get optimal multiple-disk algorithms from optimal
single-disk algorithms for - streaming, online search and output reporting
22Disk Striping and Sorting
- Disk striping is not optimal for sorting!
- I/O Bound for disk-striping
- Optimal I/O Bound
- Striping larger than optimal bound by
23Sorting with Multiple Disks
- To attain optimal sorting bound, we need to
forsake disk striping - Control disks independently
- Average/Worst case cost
- Algorithms are based on distribution and merge
paradigms - Online load-balancing distribute data in D
disks evenly for access
24Distribution Sort with D Disks
S 7 using 6 partitioning elements
1
2
5
4
3
6
7
- Uses S-1 partitioning elements to partition items
into S disjoint buckets - Items in bucket i Items in bucket j for i j
- Sort the buckets recursively
- Concatenate the sorted buckets
25Partitioning Elements
- Choose S-1 partitioning elements so that buckets
sizes are roughly equal, using ?(n/D) I/Os - Bucket sizes decrease by a factor of ?(S)
O(logSn) levels of recursion. - Maximum S ?(m), so minimum of recursion
levels is O(logmn). - Deterministic methods exist for choosing S?m
partitioning elements.
26I/O bound for Distribution Sort
- If each level of recursion uses ?(n/D) I/Os,
Number of I/Os - Each set of items being partitioned is itself one
bucket formed in previous recursion level. - The blocks of a bucket should be spread evenly
for the next read, so that - All D disks are involved for reading from a
bucket at the same time.
27Randomized Distribution Sort VS94
- Goal With high probability, each bucket is well
balanced across D disks - If N is so large that number of blocks per bucket
?(n/S) is ?(D log D), then write D blocks in
independent random order to a disk stripe.
28Classical Occupancy Problem
- b balls are inserted independently and uniformly
at random into d bins. - If b/d grows faster than log d, the largest bin
contains b/d balls on average ? Even distribution
29Distribution Sort for the case?(n/S) ? ?(D log
D)
- The previous technique breaks down when ?(n/S) ?
?(D log D) - A typical memoryload contains S log S blocks
? well-balanced among S buckets - 1st pass read file in one pass, one memoryload
at a time. Permute and write to disk. - 2nd pass Extract a part from each of several
memory to form a typical memoryload. - Output each memoryload by a round-robin placement
of S buckets on D disks.
30Merge Sort with D Disks
- Orthogonal to the distribution paradigm
- Run formation scan n blocks of data, one
memoryload (m blocks) each time. Sort each load
and output on stripes. - N/M n/m sorted runs
- Merging Scan and merge groups of R ?(m) runs
- passes logR(n/m) logmn - 1
31I/O bound for Merge Sort
- If each pass uses ?(n/D) I/Os,
- Number of I/Os
- Must bring in next ?(D) blocks on average at each
step - Ensure that the blocks used for merging reside on
different disks
32Simple Randomized Mergesort BV99
- Each run is striped starting at a randomly chosen
disk. - At any time, the disk containing the leading
block of any run is uniformly random.
D 4 disks, R 8 runs
33Cyclic Occupancy Problem
- Conjecture The expected maximum bin size is at
most that of the - classical occupancy problem.
34External Hashing for Online Dictionary Search
- Advantage of hashing expected number of probes
per operation is constant, independent of N - Goal develop dynamic EM structures that can
adapt smoothly to widely varying values of N - Directory method extendible hashing
- 2 I/Os for directory data access
- Directory-less method linear hashing
- 1 I/O only may require overflow lists
35Multiway Tree
- Items in a tree are sorted ? the tree can be used
for 1D range search - To find items in x,y search x, inorder
traversal from x to y - Arise naturally in online settings updates and
queries are processed immediately - To exploit block transfers, the balanced multiway
B-tree was proposed most widely used nontrival
EM data structure.
36B-Trees and B-Trees
- In B-tree, all items are stored in leaves
- Internal nodes only store key values and
pointers higher branching factor - Leaves are linked together sequentially to
facilitate range queries - or sequential access
- B-tree The most popular variant of B-tree
37B-Trees
- Postpone splitting by sharing the nodes data
with one of adjacent siblings - Split node only when sibling is also full, and
evenly redistribute data among node and siblings - Reduces the number of times new nodes created
- Higher storage utilization, lower search I/O
costs - Average utilization of nodes from 69 ? 81
38The Buffer Tree ARGE95
- Each insertion operation takes O(logBN) I/Os
- Building a B-Tree
- Repeated insertion ? O(N logBN) I/Os!
- We can take advantage of blocking and obtain O(n
logmn) I/Os - Buffer tree ARGE95
39Main Idea of Buffer Tree
- Logically group nodes together and add buffers.
- Balanced tree Degree ?(m) rather than ?(B)
- Each node has a buffer for storing ?(M) items
(?(m) blocks) - Insertions done lazily items inserted into
buffers. - When a buffer is full, its items are pushed one
level down.
40I/O Cost of Buffer Trees
- Buffer-emptyng in O(m) I/Os
- Amortize the cost of distributing M items to ?(m)
children - Each item incurs an amortized cost of
O(m/M)O(1/B) I/Os per level - The resulting cost for update/query is only
O((1/B)logmn) I/Os amortized - Cost of inserting N nB items
- nB O((1/B)logmn) nlogmn I/Os
41Conclusions (1)
- Internal memory algorithms are not readily
adapted for external memory - EM algorithms are developed to handle I/O
communications more efficiently - The most fundamental I/O bounds are scanning,
sorting, searching, outputting - The goal of EM algorithms is try to achieve these
optimal bounds
42Conclusions (2)
- Striping is optimal for scanning, searching and
outputting, but not sorting. - Randomized distribution sort and merge sort can
achieve optimal I/O bound. - Hash tables for online dictionary search
- B-trees best for 1D online range search
- Buffer trees improves the speed of B-tree
construction
43Other Interesting Topics
- Handling duplicates during sorting
- Permutation, Fast Fourier Transform (FFT),
computational geometry and graphs - Multi-dimensional data, R-trees and range queries
- Dynamic data Structures, moving objects, strings
- EM algorithm development tools (e.g., TPIE)
44References
- JSV2001 J. S. Vitter. External Memory
Algorithms and Data Structures Dealing with
MASSIVE DATA, ACM Computing Surveys, 33(2), June
2001, 209-271. - JSV1998 J.S. Vitter. External Memory
Algorithms in an invited tutorial in Proceedings
of the 17th Annual ACM Symposium on Principles of
Database Systems (PODS '98), Seattle, WA, June
1998. - VS94 Vitter, J.S. and Shriver, E. A. M. 1994.
Algorithms for parallel memory I Two-level
memories. Algorithmica 12, 23, 110147.
45References
- BV99 Barve,R.D. and Vitter, J. S. 1999. A
simple and efficient parallel disk mergesort. In
Proceedings of the ACM Symposium on Parallel
Algorithms and Architectures (St. Malo, France,
June), Vol. 11, 232241. - BY96 Baeza-Yates, R. 1996. Bounded disorder
The effect of the index. Theoretical Computer
Science 168, 2138. - ARGE95 Arge, L. 1995. The buffer tree A new
technique for optimal I/O-algorithms. In
Proceedings of the Workshop on Algorithms and
Data Structures, Vol. 955 of Lecture Notes in
Computer Science, Springer-Verlag, 334345.