Title: Selected Problems in EM Algorithms
1Selected Problems in EM Algorithms
- By Reynold Cheng
- Feb 24, 2003.
2Two Problem Categories in EM Algorithms
- Batched Problems
- Process entire file of data items
- Stream data through internal memory in 1 passes
- Example sorting, computational geometry, graphs
- Online Problems
- Immediate response to queries
- Only small portion of data is examined for each
query - Data can be static or dynamic
- Example Dictionary operations, range queries
3Parallel Disk Model (PDM)
4Parallel Disk Algorithms
- For most problems, parallel disks can be utilized
effectively by - Striping round-robin placement on disks
- Distribution and merge paradigm
- In practice, disk striping is sufficient
- We now restrict ourselves to single-disk
5Talk Outline
- Online Range Queries
- Buffer Trees and Bulk Loading
- String Processing
- Dynamic Memory Allocation
6Goals of Online Range Search
- Using B-trees for 1-D range search
- Combined search and output cost for queries of
O(logBN z) I/Os - Use only a linear amount (O(n) blocks) of disk
space, and - Support dynamic updates in O(logBn) I/Os
- Can we achieve the same performance for general
range search?
7Categories of Spatial Data Structures
- Space-Driven
- Partitioning of the embedding space
- Quad-trees and grid files
- Data-Driven
- Partitioning of the data items
- R-trees and kd-trees
- We study online range search for data-driven data
structures
8Lower bound of general 2-D orthogonal queries
- It was proved that criteria (1) and (2) cannot be
satisfied simultaneously - Lower bound
- O(n(log n)/log(logBN 1)) disk space
- O((logBN)cz) I/Os per query for any constant c
- Best algorithm achieves
- O(n(log N) / log logBn) space
- O(logBnz) I/Os per query
9Range Search Using Linear Amount of Disk Space
- None of these linear-space data structures come
close to satisfying Criteria 1 and 3 in the worst
case - In typical case scenarios they perform well
- Examples R-trees, kd-B-trees, cross trees, grid
files and linearization
10Cross Trees
- Data-driven partitioning (weight-balanced
B-trees) at upper levels of the tree - Space-driven partitioning (quad trees) at lower
levels - d-dimensional range queries done in O(n1-1/dz)
I/Os - Inserts and deletes take O(logBN) I/Os
11Grid Files and Linearization
- A Grid File is a flattened data structure for
storing the cells of a 2-D grid in disk blocks - Linearization Impose a total ordering on a n-D
space with space-filling curves, - Organize the points into a B-tree
- Worst-case performance is no better than cross
trees
12Special Cases of Range Search
- These are all important cases of the general
orthogonal 2D range search - Are there any data structures that meet Criteria
1-3 for these special cases? - Yes!
13Diagonal Corner Query Stabbing Query
- Stabbing queries are important
- Facilitate dynamic interval management
- Indexing 1-D constraints in constraint DB
- Arise in graphics and GIS problems
14Range Searching Results
I/Os
Query
15Batched Dynamic Operations
- For batched problems in internal memory, we often
use a dynamic data structure to process a
sequence of updates - To sort N items, we perform N insertions to a
priority queue, followed by N deleteMin
operations. - Known as batched dynamic operations, where
queries do not need to be answered right away or
in any particular order
16The Buffer Tree ARGE95
- In external memory, if we use a B-tree as the
priority queue, resulting I/O is non optimal - Each update and query takes O(logBN) I/Os
- Building a B-Tree takes N insertions ? O(N logBN)
I/Os! - gtgt Sort(N) O(n logBn) I/Os
- We can take advantage of blocking and obtain O(n
logmn) I/Os in building a B-tree - Buffer tree ARGE95
17Main Idea of Buffer Tree
- Logically group nodes together and add buffers
- Balanced tree Degree ?(m) rather than ?(B)
- Each node has a buffer for storing ?(M) items
(?(m) blocks) - Insertions done lazily items inserted into
buffers - When a buffer is full, its items are pushed one
level down
18I/O Cost of Buffer Trees
- Buffer-emptying in O(m) I/Os
- Amortize the cost of distributing M items to ?(m)
children - Each item incurs an amortized cost of
O(m/M)O(1/B) I/Os per level - The resulting cost for update/query is only
O((1/B)logmn) I/Os amortized - Cost of inserting N nB items
- nB O((1/B)logmn) nlogmn I/Os
19Applications of Buffer Trees
- Efficient DeleteMin operations
- External Heap
- String sorting
- Improved graph algorithms (list ranking)
- Bulk operations on R-trees
20Bulk Loading of R-Tree
- Repeated (dynamic) insertion into R-tree is
costly (N logBN I/Os) - Bulk-loading build the R-tree in bottom-up
fashion by grouping items into leaf nodes - Apply space-filling curve to presort the objects
(e.g., Hilbert curve) - Also known as static insertion
21Dynamic vs. Static Insertion
- Dynamic insertion
- Explicitly reduce coverage, overlap, or perimeter
of the bounding boxes - Good query performance
- Static insertion
- Does not consider bounding box information at all
- Improved storage utilization (up to 100)
compensates for a higher degree of node overlap - Poorer query performance
22Buffer Tree and Bulk Loading
- To get good query performance and bulk
construction efficiency, Arge et al. ARGE99
proposed using buffer trees for fast
bulk-loading. - Top-down construction in O(nlogmn) I/Os
- Matches the performance of bottom-up methods to
within a constant factor
23Experimental Results I/O Costs
24Text Indexing in Internal Memory
- Inverted file
- Analogous to the index at the back of a book
- Words of interest in the text are sorted
alphabetically - Each item in the sorted list has a list of
pointers to the occurrences of that word in the
text
25Text Indexing in External Memory
- Hybrid approach the text is divided into large
chunks one or more blocks - Inverted file is used to specify the chunks
containing each word - Use a fast sequential method to search within a
chunk, such as Knuth-Morris-Pratt and Boyer-Moore
methods - Basis of the GLIMPSE search tool MW94
26B-Tree and Strings (1)
- In a B-tree, ?(B) unit-sized keys are stored in
each internal node for searching - The node fits into one or two blocks
- If keys are variable-sized text strings, the keys
can be arbitrarily long - Many blocks to store ?(B) strings per node!
27B-Tree and Strings (2)
- Can we store pointers to ?(B) strings?
- Access to strings during search needs more than
constant of I/Os per node - Redesign B-tree to handle strings String
B-tree or SB-tree FG99 - SB-tree differs from B-tree in the way each
?(B)-way branching node is represented
28Overview of SB-Tree
- B-tree on set of pointers to strings
(lexicographical order) - Pointers to ?(B) strings stored in internal nodes
- Every internal node is a variant of the Patricia
Trie called Blind Trie (BT)
29Internal Node of SB-Tree Blind Trie
- Achieve B-way branching with only O(B)
characters, fitting in O(1) blocks - Contains characters from all B-1 strings
- Pointers to the strings are stored at the leaves
30Illustrating Blind Trie
- Left subtrie contains strings whose position 4
(5th character) is a - Right subtrie contains strings whose position 4
(5th character) is b - The 1st 4 characters in all strings in the nodes
subtrie are bcbc
31Searching a Blind Trie
32Fixing the Search Bug by Sequential Scan
- Sequentially compare S and R
- The index where they differ can be found
- S bcbabcba
- R bcbcbbba
- S is smaller than entire right subtrie of root
Possible location of S
33I/O Cost of SB-Tree
- Searching a Blind Trie requires one I/O to load
- Additional I/Os to do the sequential scan of the
string after the leaf is reached - Each block of the search string examined during
sequential scan need not be read again for lower
levels - Search time for S characters O(logBNS/B)
- Insertions and deletions also optimal
34Substring Searching
- Suffix Tree
- searches for any substring in time linear in size
of substring - apply Patricia tries is to store suffixes of a
string - suffixes of hello hello,ello,llo,lo,
o - Suffix Array
- compact but static version of suffix tree
- Both built on strings of total length N
optimally O(nlogmn) I/Os
35Dynamic Memory Allocation
- In practice, amount of internal memory allocated
to a program may fluctuate - Demands by other users/processes
- If memory allocation is reduced, algorithms that
assume uses fixed memory must resort to virtual
memory - Severe performance degradation
- EM algorithms should adapt dynamically to
whatever resources are available - Design and analysis of EM algorithms to adapt
gracefully to changing memory allocations studied
in BV99
36Dynamic Memory Allocation Model BV99
- An EM algorithm is allocated memory in an
allocation sequence s m1,m2,mk of allocation
phases - ith phase Algorithm owns mi blocks of memory for
2mi I/Os - Dynamically Optimal Algorithms
- Adversary chooses allocation sequence sx
m1,m2,,mk - A solves problem P during sx
- A is dynamic optimal for P iff no other algorithm
A can solve P more than a constant number of
times during sx - Barve and Vitter BV99 give dynamically optimal
strategies for sorting, matrix multiplication and
buffer tree operations
37Conclusions (1)
- Linear-space structures cannot provide O(logBn
z) I/Os for general 2-D orthogonal range
queries in the worst case - Linear-space structures achieve O(logBn z)
I/Os for Corner, 2-sided and 3-sided queries - Buffer tree provides both good bulk-loading and
query performance
38Conclusions (2)
- Inverted files for large file chunks used in EM
text indexing - SB-Tree enhances B-tree support for
variable-length strings with internal node being
a blind trie - EM algorithms need to adapt to changing internal
memory requirement
39Other Interesting Topics
- External sorting of strings
- Computational geometry and graph problems (e.g.,
topological sort, BFS, DFS) - EM algorithms and memory hierarchy
- Translating theoretical gains into observable
improvements in practice - New storage devices (e.g., disk drives with
processing capability) present new challenges
40References
- JSV2001 J. S. Vitter. External Memory
Algorithms and Data Structures Dealing with
MASSIVE DATA, ACM Computing Surveys, 33(2), June
2001, 209-271. - JSV1998 J.S. Vitter. External Memory
Algorithms in an invited tutorial in Proceedings
of the 17th Annual ACM Symposium on Principles of
Database Systems (PODS '98), Seattle, WA, June
1998. - BV99 Barve, R. D. and Vitter, J. S. 1999. A
theoretical framework for memory-adaptive
algorithms. In Proceedings of the IEEE Symposium
on Foundations of Computer Science (New York,
Oct.), Vol. 40, 273284.
41References
- MW94 Manber, U. and Wu, S. 1994. GLIMPSE A
tool to search through entire file systems. In
USENIX Association, Ed., Proceedings of the
Winter USENIX Conference (San Francisco, Jan.),
23 32. - FG99 FERRAGINA, P. AND GROSSI, R. 1996. Fast
string searching in secondary storage
Theoretical developments and experimental
results. In Proceedings of the ACM-SIAM Symposium
on Discrete Algorithms (Atlanta, June), Vol. 7,
373382. - ARGE95 Arge, L. 1995. The buffer tree A new
technique for optimal I/O-algorithms. In
Proceedings of the Workshop on Algorithms and
Data Structures, Vol. 955 of Lecture Notes in
Computer Science, Springer-Verlag, 334345. - ARGE99 Arge, L., Hinrichs K. H., Vahrenhold,
J., and Vitter, J. S. 1999. Efficient bulk
operations on dynamic R-trees. In Workshop on
Algorithm Engineering and Experimentation, Vol.
1619 of Lecture Notes in Computer Science
(Baltimore, Jan.) Springer-Verlag, 328348. IEEE
Transactions on Knowledge and Data Engineering 1,
2 (June), 248257.