Selected Problems in EM Algorithms - PowerPoint PPT Presentation

1 / 41
About This Presentation
Title:

Selected Problems in EM Algorithms

Description:

Online Problems. Immediate response to queries ... loading: build the R-tree in bottom-up fashion by grouping items into leaf nodes ... – PowerPoint PPT presentation

Number of Views:98
Avg rating:3.0/5.0
Slides: 42
Provided by: reynol7
Category:

less

Transcript and Presenter's Notes

Title: Selected Problems in EM Algorithms


1
Selected Problems in EM Algorithms
  • By Reynold Cheng
  • Feb 24, 2003.

2
Two Problem Categories in EM Algorithms
  • Batched Problems
  • Process entire file of data items
  • Stream data through internal memory in 1 passes
  • Example sorting, computational geometry, graphs
  • Online Problems
  • Immediate response to queries
  • Only small portion of data is examined for each
    query
  • Data can be static or dynamic
  • Example Dictionary operations, range queries

3
Parallel Disk Model (PDM)
4
Parallel Disk Algorithms
  • For most problems, parallel disks can be utilized
    effectively by
  • Striping round-robin placement on disks
  • Distribution and merge paradigm
  • In practice, disk striping is sufficient
  • We now restrict ourselves to single-disk

5
Talk Outline
  • Online Range Queries
  • Buffer Trees and Bulk Loading
  • String Processing
  • Dynamic Memory Allocation

6
Goals of Online Range Search
  • Using B-trees for 1-D range search
  • Combined search and output cost for queries of
    O(logBN z) I/Os
  • Use only a linear amount (O(n) blocks) of disk
    space, and
  • Support dynamic updates in O(logBn) I/Os
  • Can we achieve the same performance for general
    range search?

7
Categories of Spatial Data Structures
  • Space-Driven
  • Partitioning of the embedding space
  • Quad-trees and grid files
  • Data-Driven
  • Partitioning of the data items
  • R-trees and kd-trees
  • We study online range search for data-driven data
    structures

8
Lower bound of general 2-D orthogonal queries
  • It was proved that criteria (1) and (2) cannot be
    satisfied simultaneously
  • Lower bound
  • O(n(log n)/log(logBN 1)) disk space
  • O((logBN)cz) I/Os per query for any constant c
  • Best algorithm achieves
  • O(n(log N) / log logBn) space
  • O(logBnz) I/Os per query

9
Range Search Using Linear Amount of Disk Space
  • None of these linear-space data structures come
    close to satisfying Criteria 1 and 3 in the worst
    case
  • In typical case scenarios they perform well
  • Examples R-trees, kd-B-trees, cross trees, grid
    files and linearization

10
Cross Trees
  • Data-driven partitioning (weight-balanced
    B-trees) at upper levels of the tree
  • Space-driven partitioning (quad trees) at lower
    levels
  • d-dimensional range queries done in O(n1-1/dz)
    I/Os
  • Inserts and deletes take O(logBN) I/Os

11
Grid Files and Linearization
  • A Grid File is a flattened data structure for
    storing the cells of a 2-D grid in disk blocks
  • Linearization Impose a total ordering on a n-D
    space with space-filling curves,
  • Organize the points into a B-tree
  • Worst-case performance is no better than cross
    trees

12
Special Cases of Range Search
  • These are all important cases of the general
    orthogonal 2D range search
  • Are there any data structures that meet Criteria
    1-3 for these special cases?
  • Yes!

13
Diagonal Corner Query Stabbing Query
  • Stabbing queries are important
  • Facilitate dynamic interval management
  • Indexing 1-D constraints in constraint DB
  • Arise in graphics and GIS problems

14
Range Searching Results
I/Os
Query
15
Batched Dynamic Operations
  • For batched problems in internal memory, we often
    use a dynamic data structure to process a
    sequence of updates
  • To sort N items, we perform N insertions to a
    priority queue, followed by N deleteMin
    operations.
  • Known as batched dynamic operations, where
    queries do not need to be answered right away or
    in any particular order

16
The Buffer Tree ARGE95
  • In external memory, if we use a B-tree as the
    priority queue, resulting I/O is non optimal
  • Each update and query takes O(logBN) I/Os
  • Building a B-Tree takes N insertions ? O(N logBN)
    I/Os!
  • gtgt Sort(N) O(n logBn) I/Os
  • We can take advantage of blocking and obtain O(n
    logmn) I/Os in building a B-tree
  • Buffer tree ARGE95

17
Main Idea of Buffer Tree
  • Logically group nodes together and add buffers
  • Balanced tree Degree ?(m) rather than ?(B)
  • Each node has a buffer for storing ?(M) items
    (?(m) blocks)
  • Insertions done lazily items inserted into
    buffers
  • When a buffer is full, its items are pushed one
    level down

18
I/O Cost of Buffer Trees
  • Buffer-emptying in O(m) I/Os
  • Amortize the cost of distributing M items to ?(m)
    children
  • Each item incurs an amortized cost of
    O(m/M)O(1/B) I/Os per level
  • The resulting cost for update/query is only
    O((1/B)logmn) I/Os amortized
  • Cost of inserting N nB items
  • nB O((1/B)logmn) nlogmn I/Os

19
Applications of Buffer Trees
  • Efficient DeleteMin operations
  • External Heap
  • String sorting
  • Improved graph algorithms (list ranking)
  • Bulk operations on R-trees

20
Bulk Loading of R-Tree
  • Repeated (dynamic) insertion into R-tree is
    costly (N logBN I/Os)
  • Bulk-loading build the R-tree in bottom-up
    fashion by grouping items into leaf nodes
  • Apply space-filling curve to presort the objects
    (e.g., Hilbert curve)
  • Also known as static insertion

21
Dynamic vs. Static Insertion
  • Dynamic insertion
  • Explicitly reduce coverage, overlap, or perimeter
    of the bounding boxes
  • Good query performance
  • Static insertion
  • Does not consider bounding box information at all
  • Improved storage utilization (up to 100)
    compensates for a higher degree of node overlap
  • Poorer query performance

22
Buffer Tree and Bulk Loading
  • To get good query performance and bulk
    construction efficiency, Arge et al. ARGE99
    proposed using buffer trees for fast
    bulk-loading.
  • Top-down construction in O(nlogmn) I/Os
  • Matches the performance of bottom-up methods to
    within a constant factor

23
Experimental Results I/O Costs
24
Text Indexing in Internal Memory
  • Inverted file
  • Analogous to the index at the back of a book
  • Words of interest in the text are sorted
    alphabetically
  • Each item in the sorted list has a list of
    pointers to the occurrences of that word in the
    text

25
Text Indexing in External Memory
  • Hybrid approach the text is divided into large
    chunks one or more blocks
  • Inverted file is used to specify the chunks
    containing each word
  • Use a fast sequential method to search within a
    chunk, such as Knuth-Morris-Pratt and Boyer-Moore
    methods
  • Basis of the GLIMPSE search tool MW94

26
B-Tree and Strings (1)
  • In a B-tree, ?(B) unit-sized keys are stored in
    each internal node for searching
  • The node fits into one or two blocks
  • If keys are variable-sized text strings, the keys
    can be arbitrarily long
  • Many blocks to store ?(B) strings per node!

27
B-Tree and Strings (2)
  • Can we store pointers to ?(B) strings?
  • Access to strings during search needs more than
    constant of I/Os per node
  • Redesign B-tree to handle strings String
    B-tree or SB-tree FG99
  • SB-tree differs from B-tree in the way each
    ?(B)-way branching node is represented

28
Overview of SB-Tree
  • B-tree on set of pointers to strings
    (lexicographical order)
  • Pointers to ?(B) strings stored in internal nodes
  • Every internal node is a variant of the Patricia
    Trie called Blind Trie (BT)

29
Internal Node of SB-Tree Blind Trie
  • Achieve B-way branching with only O(B)
    characters, fitting in O(1) blocks
  • Contains characters from all B-1 strings
  • Pointers to the strings are stored at the leaves

30
Illustrating Blind Trie
  • Left subtrie contains strings whose position 4
    (5th character) is a
  • Right subtrie contains strings whose position 4
    (5th character) is b
  • The 1st 4 characters in all strings in the nodes
    subtrie are bcbc

31
Searching a Blind Trie
  • S bcbabcba
  • R bcbcbbba

32
Fixing the Search Bug by Sequential Scan
  • Sequentially compare S and R
  • The index where they differ can be found
  • S bcbabcba
  • R bcbcbbba
  • S is smaller than entire right subtrie of root

Possible location of S
33
I/O Cost of SB-Tree
  • Searching a Blind Trie requires one I/O to load
  • Additional I/Os to do the sequential scan of the
    string after the leaf is reached
  • Each block of the search string examined during
    sequential scan need not be read again for lower
    levels
  • Search time for S characters O(logBNS/B)
  • Insertions and deletions also optimal

34
Substring Searching
  • Suffix Tree
  • searches for any substring in time linear in size
    of substring
  • apply Patricia tries is to store suffixes of a
    string
  • suffixes of hello hello,ello,llo,lo,
    o
  • Suffix Array
  • compact but static version of suffix tree
  • Both built on strings of total length N
    optimally O(nlogmn) I/Os

35
Dynamic Memory Allocation
  • In practice, amount of internal memory allocated
    to a program may fluctuate
  • Demands by other users/processes
  • If memory allocation is reduced, algorithms that
    assume uses fixed memory must resort to virtual
    memory
  • Severe performance degradation
  • EM algorithms should adapt dynamically to
    whatever resources are available
  • Design and analysis of EM algorithms to adapt
    gracefully to changing memory allocations studied
    in BV99

36
Dynamic Memory Allocation Model BV99
  • An EM algorithm is allocated memory in an
    allocation sequence s m1,m2,mk of allocation
    phases
  • ith phase Algorithm owns mi blocks of memory for
    2mi I/Os
  • Dynamically Optimal Algorithms
  • Adversary chooses allocation sequence sx
    m1,m2,,mk
  • A solves problem P during sx
  • A is dynamic optimal for P iff no other algorithm
    A can solve P more than a constant number of
    times during sx
  • Barve and Vitter BV99 give dynamically optimal
    strategies for sorting, matrix multiplication and
    buffer tree operations

37
Conclusions (1)
  • Linear-space structures cannot provide O(logBn
    z) I/Os for general 2-D orthogonal range
    queries in the worst case
  • Linear-space structures achieve O(logBn z)
    I/Os for Corner, 2-sided and 3-sided queries
  • Buffer tree provides both good bulk-loading and
    query performance

38
Conclusions (2)
  • Inverted files for large file chunks used in EM
    text indexing
  • SB-Tree enhances B-tree support for
    variable-length strings with internal node being
    a blind trie
  • EM algorithms need to adapt to changing internal
    memory requirement

39
Other Interesting Topics
  • External sorting of strings
  • Computational geometry and graph problems (e.g.,
    topological sort, BFS, DFS)
  • EM algorithms and memory hierarchy
  • Translating theoretical gains into observable
    improvements in practice
  • New storage devices (e.g., disk drives with
    processing capability) present new challenges

40
References
  • JSV2001 J. S. Vitter. External Memory
    Algorithms and Data Structures Dealing with
    MASSIVE DATA, ACM Computing Surveys, 33(2), June
    2001, 209-271.
  • JSV1998 J.S. Vitter. External Memory
    Algorithms in an invited tutorial in Proceedings
    of the 17th Annual ACM Symposium on Principles of
    Database Systems (PODS '98), Seattle, WA, June
    1998.
  • BV99 Barve, R. D. and Vitter, J. S. 1999. A
    theoretical framework for memory-adaptive
    algorithms. In Proceedings of the IEEE Symposium
    on Foundations of Computer Science (New York,
    Oct.), Vol. 40, 273284.

41
References
  • MW94 Manber, U. and Wu, S. 1994. GLIMPSE A
    tool to search through entire file systems. In
    USENIX Association, Ed., Proceedings of the
    Winter USENIX Conference (San Francisco, Jan.),
    23 32.
  • FG99 FERRAGINA, P. AND GROSSI, R. 1996. Fast
    string searching in secondary storage
    Theoretical developments and experimental
    results. In Proceedings of the ACM-SIAM Symposium
    on Discrete Algorithms (Atlanta, June), Vol. 7,
    373382.
  • ARGE95 Arge, L. 1995. The buffer tree A new
    technique for optimal I/O-algorithms. In
    Proceedings of the Workshop on Algorithms and
    Data Structures, Vol. 955 of Lecture Notes in
    Computer Science, Springer-Verlag, 334345.
  • ARGE99 Arge, L., Hinrichs K. H., Vahrenhold,
    J., and Vitter, J. S. 1999. Efficient bulk
    operations on dynamic R-trees. In Workshop on
    Algorithm Engineering and Experimentation, Vol.
    1619 of Lecture Notes in Computer Science
    (Baltimore, Jan.) Springer-Verlag, 328348. IEEE
    Transactions on Knowledge and Data Engineering 1,
    2 (June), 248257.
Write a Comment
User Comments (0)
About PowerShow.com