Selected Problems in EM Algorithms - PowerPoint PPT Presentation

1 / 41

About This Presentation

Title:

Selected Problems in EM Algorithms

Description:

Online Problems. Immediate response to queries ... loading: build the R-tree in bottom-up fashion by grouping items into leaf nodes ... – PowerPoint PPT presentation

Number of Views:98

Avg rating:3.0/5.0

Slides: 42

Provided by: reynol7

Category:

more less

Transcript and Presenter's Notes

Title: Selected Problems in EM Algorithms

1
Selected Problems in EM Algorithms

By Reynold Cheng
Feb 24, 2003.

2
Two Problem Categories in EM Algorithms

Batched Problems
Process entire file of data items
Stream data through internal memory in 1 passes
Example sorting, computational geometry, graphs
Online Problems
Immediate response to queries
Only small portion of data is examined for each
query
Data can be static or dynamic
Example Dictionary operations, range queries

3
Parallel Disk Model (PDM)
4
Parallel Disk Algorithms

For most problems, parallel disks can be utilized
effectively by
Striping round-robin placement on disks
Distribution and merge paradigm
In practice, disk striping is sufficient
We now restrict ourselves to single-disk

5
Talk Outline

Online Range Queries
Buffer Trees and Bulk Loading
String Processing
Dynamic Memory Allocation

6
Goals of Online Range Search

Using B-trees for 1-D range search
Combined search and output cost for queries of
O(logBN z) I/Os
Use only a linear amount (O(n) blocks) of disk
space, and
Support dynamic updates in O(logBn) I/Os
Can we achieve the same performance for general
range search?

7
Categories of Spatial Data Structures

Space-Driven
Partitioning of the embedding space
Quad-trees and grid files
Data-Driven
Partitioning of the data items
R-trees and kd-trees
We study online range search for data-driven data
structures

8
Lower bound of general 2-D orthogonal queries

It was proved that criteria (1) and (2) cannot be
satisfied simultaneously
Lower bound
O(n(log n)/log(logBN 1)) disk space
O((logBN)cz) I/Os per query for any constant c
Best algorithm achieves
O(n(log N) / log logBn) space
O(logBnz) I/Os per query

9
Range Search Using Linear Amount of Disk Space

None of these linear-space data structures come
close to satisfying Criteria 1 and 3 in the worst
case
In typical case scenarios they perform well
Examples R-trees, kd-B-trees, cross trees, grid
files and linearization

10
Cross Trees

Data-driven partitioning (weight-balanced
B-trees) at upper levels of the tree
Space-driven partitioning (quad trees) at lower
levels
d-dimensional range queries done in O(n1-1/dz)
I/Os
Inserts and deletes take O(logBN) I/Os

11
Grid Files and Linearization

A Grid File is a flattened data structure for
storing the cells of a 2-D grid in disk blocks
Linearization Impose a total ordering on a n-D
space with space-filling curves,
Organize the points into a B-tree
Worst-case performance is no better than cross
trees

12
Special Cases of Range Search

These are all important cases of the general
orthogonal 2D range search
Are there any data structures that meet Criteria
1-3 for these special cases?
Yes!

13
Diagonal Corner Query Stabbing Query

Stabbing queries are important
Facilitate dynamic interval management
Indexing 1-D constraints in constraint DB
Arise in graphics and GIS problems

14
Range Searching Results
I/Os
Query
15
Batched Dynamic Operations

For batched problems in internal memory, we often
use a dynamic data structure to process a
sequence of updates
To sort N items, we perform N insertions to a
priority queue, followed by N deleteMin
operations.
Known as batched dynamic operations, where
queries do not need to be answered right away or
in any particular order

16
The Buffer Tree ARGE95

In external memory, if we use a B-tree as the
priority queue, resulting I/O is non optimal
Each update and query takes O(logBN) I/Os
Building a B-Tree takes N insertions ? O(N logBN)
I/Os!
gtgt Sort(N) O(n logBn) I/Os
We can take advantage of blocking and obtain O(n
logmn) I/Os in building a B-tree
Buffer tree ARGE95

17
Main Idea of Buffer Tree

Logically group nodes together and add buffers
Balanced tree Degree ?(m) rather than ?(B)
Each node has a buffer for storing ?(M) items
(?(m) blocks)
Insertions done lazily items inserted into
buffers
When a buffer is full, its items are pushed one
level down

18
I/O Cost of Buffer Trees

Buffer-emptying in O(m) I/Os
Amortize the cost of distributing M items to ?(m)
children
Each item incurs an amortized cost of
O(m/M)O(1/B) I/Os per level
The resulting cost for update/query is only
O((1/B)logmn) I/Os amortized
Cost of inserting N nB items
nB O((1/B)logmn) nlogmn I/Os

19
Applications of Buffer Trees

Efficient DeleteMin operations
External Heap
String sorting
Improved graph algorithms (list ranking)
Bulk operations on R-trees

20
Bulk Loading of R-Tree

Repeated (dynamic) insertion into R-tree is
costly (N logBN I/Os)
Bulk-loading build the R-tree in bottom-up
fashion by grouping items into leaf nodes
Apply space-filling curve to presort the objects
(e.g., Hilbert curve)
Also known as static insertion

21
Dynamic vs. Static Insertion

Dynamic insertion
Explicitly reduce coverage, overlap, or perimeter
of the bounding boxes
Good query performance
Static insertion
Does not consider bounding box information at all
Improved storage utilization (up to 100)
compensates for a higher degree of node overlap
Poorer query performance

22
Buffer Tree and Bulk Loading

To get good query performance and bulk
construction efficiency, Arge et al. ARGE99
proposed using buffer trees for fast
bulk-loading.
Top-down construction in O(nlogmn) I/Os
Matches the performance of bottom-up methods to
within a constant factor

23
Experimental Results I/O Costs
24
Text Indexing in Internal Memory

Inverted file
Analogous to the index at the back of a book
Words of interest in the text are sorted
alphabetically
Each item in the sorted list has a list of
pointers to the occurrences of that word in the
text

25
Text Indexing in External Memory

Hybrid approach the text is divided into large
chunks one or more blocks
Inverted file is used to specify the chunks
containing each word
Use a fast sequential method to search within a
chunk, such as Knuth-Morris-Pratt and Boyer-Moore
methods
Basis of the GLIMPSE search tool MW94

26
B-Tree and Strings (1)

In a B-tree, ?(B) unit-sized keys are stored in
each internal node for searching
The node fits into one or two blocks
If keys are variable-sized text strings, the keys
can be arbitrarily long
Many blocks to store ?(B) strings per node!

27
B-Tree and Strings (2)

Can we store pointers to ?(B) strings?
Access to strings during search needs more than
constant of I/Os per node
Redesign B-tree to handle strings String
B-tree or SB-tree FG99
SB-tree differs from B-tree in the way each
?(B)-way branching node is represented

28
Overview of SB-Tree

B-tree on set of pointers to strings
(lexicographical order)
Pointers to ?(B) strings stored in internal nodes
Every internal node is a variant of the Patricia
Trie called Blind Trie (BT)

29
Internal Node of SB-Tree Blind Trie

Achieve B-way branching with only O(B)
characters, fitting in O(1) blocks
Contains characters from all B-1 strings
Pointers to the strings are stored at the leaves

30
Illustrating Blind Trie

Left subtrie contains strings whose position 4
(5th character) is a
Right subtrie contains strings whose position 4
(5th character) is b
The 1st 4 characters in all strings in the nodes
subtrie are bcbc

31
Searching a Blind Trie

S bcbabcba
R bcbcbbba

32
Fixing the Search Bug by Sequential Scan

Sequentially compare S and R
The index where they differ can be found
S bcbabcba
R bcbcbbba
S is smaller than entire right subtrie of root

Possible location of S
33
I/O Cost of SB-Tree

Searching a Blind Trie requires one I/O to load
Additional I/Os to do the sequential scan of the
string after the leaf is reached
Each block of the search string examined during
sequential scan need not be read again for lower
levels
Search time for S characters O(logBNS/B)
Insertions and deletions also optimal

34
Substring Searching

Suffix Tree
searches for any substring in time linear in size
of substring
apply Patricia tries is to store suffixes of a
string
suffixes of hello hello,ello,llo,lo,
o
Suffix Array
compact but static version of suffix tree
Both built on strings of total length N
optimally O(nlogmn) I/Os

35
Dynamic Memory Allocation

In practice, amount of internal memory allocated
to a program may fluctuate
Demands by other users/processes
If memory allocation is reduced, algorithms that
assume uses fixed memory must resort to virtual
memory
Severe performance degradation
EM algorithms should adapt dynamically to
whatever resources are available
Design and analysis of EM algorithms to adapt
gracefully to changing memory allocations studied
in BV99

36
Dynamic Memory Allocation Model BV99

An EM algorithm is allocated memory in an
allocation sequence s m1,m2,mk of allocation
phases
ith phase Algorithm owns mi blocks of memory for
2mi I/Os
Dynamically Optimal Algorithms
Adversary chooses allocation sequence sx
m1,m2,,mk
A solves problem P during sx
A is dynamic optimal for P iff no other algorithm
A can solve P more than a constant number of
times during sx
Barve and Vitter BV99 give dynamically optimal
strategies for sorting, matrix multiplication and
buffer tree operations

37
Conclusions (1)

Linear-space structures cannot provide O(logBn
z) I/Os for general 2-D orthogonal range
queries in the worst case
Linear-space structures achieve O(logBn z)
I/Os for Corner, 2-sided and 3-sided queries
Buffer tree provides both good bulk-loading and
query performance

38
Conclusions (2)

Inverted files for large file chunks used in EM
text indexing
SB-Tree enhances B-tree support for
variable-length strings with internal node being
a blind trie
EM algorithms need to adapt to changing internal
memory requirement

39
Other Interesting Topics

External sorting of strings
Computational geometry and graph problems (e.g.,
topological sort, BFS, DFS)
EM algorithms and memory hierarchy
Translating theoretical gains into observable
improvements in practice
New storage devices (e.g., disk drives with
processing capability) present new challenges

40
References

JSV2001 J. S. Vitter. External Memory
Algorithms and Data Structures Dealing with
MASSIVE DATA, ACM Computing Surveys, 33(2), June
2001, 209-271.
JSV1998 J.S. Vitter. External Memory
Algorithms in an invited tutorial in Proceedings
of the 17th Annual ACM Symposium on Principles of
Database Systems (PODS '98), Seattle, WA, June
1998.
BV99 Barve, R. D. and Vitter, J. S. 1999. A
theoretical framework for memory-adaptive
algorithms. In Proceedings of the IEEE Symposium
on Foundations of Computer Science (New York,
Oct.), Vol. 40, 273284.

41
References

MW94 Manber, U. and Wu, S. 1994. GLIMPSE A
tool to search through entire file systems. In
USENIX Association, Ed., Proceedings of the
Winter USENIX Conference (San Francisco, Jan.),
23 32.
FG99 FERRAGINA, P. AND GROSSI, R. 1996. Fast
string searching in secondary storage
Theoretical developments and experimental
results. In Proceedings of the ACM-SIAM Symposium
on Discrete Algorithms (Atlanta, June), Vol. 7,
373382.
ARGE95 Arge, L. 1995. The buffer tree A new
technique for optimal I/O-algorithms. In
Proceedings of the Workshop on Algorithms and
Data Structures, Vol. 955 of Lecture Notes in
Computer Science, Springer-Verlag, 334345.
ARGE99 Arge, L., Hinrichs K. H., Vahrenhold,
J., and Vitter, J. S. 1999. Efficient bulk
operations on dynamic R-trees. In Workshop on
Algorithm Engineering and Experimentation, Vol.
1619 of Lecture Notes in Computer Science
(Baltimore, Jan.) Springer-Verlag, 328348. IEEE
Transactions on Knowledge and Data Engineering 1,
2 (June), 248257.