Phrase Hierarchy Inference - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

Phrase Hierarchy Inference

Description:

sea. shells. on. the. sea. shore. s h e l l s. o r e. e l l l s. a. o n. t h e. Suffix Tree ... e sells sea shells on the sea shore. sells sea shells on the sea shore ... – PowerPoint PPT presentation

Number of Views:105
Avg rating:3.0/5.0
Slides: 22
Provided by: Goog2
Category:

less

Transcript and Presenter's Notes

Title: Phrase Hierarchy Inference


1
Phrase Hierarchy Inference
  • Gordon Paynter, UC Riverside
  • Craig Nevill-Manning, Google
  • Ian Witten, University of Waikato

2
Outline
  • Overlapping vs non-overlapping phrases
  • Memory-based algorithm
  • Suffix trees
  • Suffix arrays
  • Multipass algorithm

3
Non-overlapping phrases
  • Given a text, parse it into a tree of repeated
    phrases
  • Advantage
  • Based on existing data compression algorithms
  • Disadvantage
  • Sometimes arbitrary association of words

In the beginning, God created the heaven and the
earth
4
Overlapping Phrases
  • Instead, we count all repeating phrases, even if
    two phrases overlap
  • Limit phrase length to, say, ten

5
Memory-based Algorithm
  • For each word w
  • Everywhere that word occurs, consider the phrase
    formed by the word plus the word to the left (aw)
  • Similarly for words to the right (wa)
  • If the phrase is always preceded or followed by
    the same word, extend the phrase
  • If the phrase begins or ends with a stopword,
    extend the phrase
  • Add all the extended phrases to the list of
    expansions for w
  • For each phrase p

6
Memory-based Algorithm
  • Problem
  • How to efficiently find words to the right and
    left for every occurrence of a word or a phrase?
  • Solution
  • Suffix trees

7
Suffix Tree
  • A compacted trie of suffixes
  • Trie a tree containing a set of strings

she sells sea shells on the sea shore
s h e l l s ? ? o r e ? e l l l s ? a ? o
n ? t h e ?
8
Suffix Tree
  • Compacted trie no nodes with only one child

s h e l l s ? ? o r e ? e l l l s ? a ? o
n ? t h e ?
s h e lls? ? ore? e llls? a? on? the?
9
Suffix Tree
  • Compacted trie of all suffixes

she sells sea shells on the sea shore he sells
sea shells on the sea shore e sells sea shells on
the sea shore sells sea shells on the sea
shore sells sea shells on the sea shore ells sea
shells on the sea shore lls sea shells on the sea
shore ls sea shells on the sea shore s sea shells
on the sea shore sea shells on the sea shore sea
shells on the sea shore
10
Two Surprising Facts
  • Even though there are O(n2) characters in all the
    suffixes,
  • Suffix trees consume O(n) space
  • Suffix trees take O(n) time to compute

11
Suffix Tree
  • How does the suffix tree help us?
  • Build a suffix tree of words (instead of single
    letters)
  • For any word, words to the right are children in
    the tree
  • Compaction means that the longest unique sequence
    is already computed
  • For words to the left, build a suffix tree for
    the reverse sequence

12
Suffix Array
  • Sorted list of suffixes

seashellsontheseashore sellsseashellson
theseashore esellsseashellsontheseashore
ellsseashellsontheseashore hesellsseashel
lsontheseashore llsseashellsontheseashor
e lsseashellsontheseashore sseashellsont
heseashore seashellsontheseashore sellssea
shellsontheseashore shesellsseashellsont
heseashore
13
Suffix Array
  • Advantages
  • Simple 10 lines of code
  • Space efficient one array of pointers
  • Disadvantages
  • More expensive to create O(n log n)
  • More expensive to operate on (linear scans
    instead of following an edge)

14
Multi-pass Algorithm
  • Disk seeks dominate
  • minimize disk seeks
  • fit within available memory
  • Disk reads are cheap, seeks are expensive
  • Make multiple passes over the data, using as
    little memory as possible

15
Three Phases
  • Phase 1 count all single words, two word
    phrases, three word phrases
  • Phase 2 make expansion lists for each phrase
  • Phase 3 delete uninteresting phrases

16
Phase 1 Count Phrases
  • Make one pass over the data, counting individual
    words
  • Write out all words that appear more than once
  • Make a second pass over the data, counting pairs
    of words, where both words appear more than once
  • Write out all pairs that appear more than once
  • Make a third pass over the data, counting triples
    of words, where both overlapping pairs appear
    more than once
  • Write out all triples that appear more than once

17
Phase 1 Output
words and 31 Gone 2 man 4 old 12 sea 8 the 57 Win
d 3 with 17
pairs of words and the 25 Gone with 2 man
and 3 old man 2 The old 5 the sea 3 the
Wind 2 with the 13
triples of words and the sea 3 Gone with
the 2 man and the 2 old man and 2 The old
man 2 with the Wind 2
18
Phase 2 Make Expansion Lists
  • Read all pairs of words that appear more than
    once (from phase 1)
  • Insert each pair in the list for each word
  • Read all frequent triples
  • Insert each triple in the list for each
    overlapping pair

19
Phase 2 Output
words and 31 Gone 2 man 4 old 12 sea 8 the 57 Win
d 3 with 17
pairs of words and the 25 Gone with 2 man
and 3 old man 2 The old 5 the sea 3 the
Wind 2 with the 13
triples of words and the sea 3 Gone with
the 2 man and the 2 old man and 2 The old
man 2 with the Wind 2

20
Phase 3
  • Delete each phrase in the hierarchy if
  • it begins or ends in a stopword (man and)
  • it occurs in a particular longer phrase more than
    75 of the time (theoretical computer)
  • Pointers to that phrase now point to that
    phrases expansions
  • Process is recursive

21
Phase 3 Output
words and 31 Gone 2 man 4 old 12 sea 8 the 57 Win
d 3 with 17
pairs of words and the 25 Gone with 2 man
and 3 old man 2 The old 5 the sea 3 the
Wind 2 with the 13
triples of words and the sea 3 Gone with
the 2 man and the 2 old man and 2 The old
man 2 with the Wind 2
Write a Comment
User Comments (0)
About PowerShow.com