Phrase Hierarchy Inference presentation

About This Presentation

Transcript and Presenter's Notes

Title: Phrase Hierarchy Inference

1
Phrase Hierarchy Inference

Gordon Paynter, UC Riverside
Craig Nevill-Manning, Google
Ian Witten, University of Waikato

2
Outline

Overlapping vs non-overlapping phrases
Memory-based algorithm
Suffix trees
Suffix arrays
Multipass algorithm

3
Non-overlapping phrases

Given a text, parse it into a tree of repeated
phrases
Advantage
Based on existing data compression algorithms
Disadvantage
Sometimes arbitrary association of words

In the beginning, God created the heaven and the
earth
4
Overlapping Phrases

Instead, we count all repeating phrases, even if
two phrases overlap
Limit phrase length to, say, ten

5
Memory-based Algorithm

For each word w
Everywhere that word occurs, consider the phrase
formed by the word plus the word to the left (aw)
Similarly for words to the right (wa)
If the phrase is always preceded or followed by
the same word, extend the phrase
If the phrase begins or ends with a stopword,
extend the phrase
Add all the extended phrases to the list of
expansions for w
For each phrase p

6
Memory-based Algorithm

Problem
How to efficiently find words to the right and
left for every occurrence of a word or a phrase?
Solution
Suffix trees

7
Suffix Tree

A compacted trie of suffixes
Trie a tree containing a set of strings

she sells sea shells on the sea shore
s h e l l s ? ? o r e ? e l l l s ? a ? o
n ? t h e ?
8
Suffix Tree

Compacted trie no nodes with only one child

s h e l l s ? ? o r e ? e l l l s ? a ? o
n ? t h e ?
s h e lls? ? ore? e llls? a? on? the?
9
Suffix Tree

Compacted trie of all suffixes

she sells sea shells on the sea shore he sells
sea shells on the sea shore e sells sea shells on
the sea shore sells sea shells on the sea
shore sells sea shells on the sea shore ells sea
shells on the sea shore lls sea shells on the sea
shore ls sea shells on the sea shore s sea shells
on the sea shore sea shells on the sea shore sea
shells on the sea shore
10
Two Surprising Facts

Even though there are O(n2) characters in all the
suffixes,
Suffix trees consume O(n) space
Suffix trees take O(n) time to compute

11
Suffix Tree

How does the suffix tree help us?
Build a suffix tree of words (instead of single
letters)
For any word, words to the right are children in
the tree
Compaction means that the longest unique sequence
is already computed
For words to the left, build a suffix tree for
the reverse sequence

12
Suffix Array

Sorted list of suffixes

seashellsontheseashore sellsseashellson
theseashore esellsseashellsontheseashore
ellsseashellsontheseashore hesellsseashel
lsontheseashore llsseashellsontheseashor
e lsseashellsontheseashore sseashellsont
heseashore seashellsontheseashore sellssea
shellsontheseashore shesellsseashellsont
heseashore
13
Suffix Array

Advantages
Simple 10 lines of code
Space efficient one array of pointers
Disadvantages
More expensive to create O(n log n)
More expensive to operate on (linear scans
instead of following an edge)

14
Multi-pass Algorithm

Disk seeks dominate
minimize disk seeks
fit within available memory
Disk reads are cheap, seeks are expensive
Make multiple passes over the data, using as
little memory as possible

15
Three Phases

Phase 1 count all single words, two word
phrases, three word phrases
Phase 2 make expansion lists for each phrase
Phase 3 delete uninteresting phrases

16
Phase 1 Count Phrases

Make one pass over the data, counting individual
words
Write out all words that appear more than once
Make a second pass over the data, counting pairs
of words, where both words appear more than once
Write out all pairs that appear more than once
Make a third pass over the data, counting triples
of words, where both overlapping pairs appear
more than once
Write out all triples that appear more than once

17
Phase 1 Output
words and 31 Gone 2 man 4 old 12 sea 8 the 57 Win
d 3 with 17
pairs of words and the 25 Gone with 2 man
and 3 old man 2 The old 5 the sea 3 the
Wind 2 with the 13
triples of words and the sea 3 Gone with
the 2 man and the 2 old man and 2 The old
man 2 with the Wind 2
18
Phase 2 Make Expansion Lists

Read all pairs of words that appear more than
once (from phase 1)
Insert each pair in the list for each word
Read all frequent triples
Insert each triple in the list for each
overlapping pair

19
Phase 2 Output
words and 31 Gone 2 man 4 old 12 sea 8 the 57 Win
d 3 with 17
pairs of words and the 25 Gone with 2 man
and 3 old man 2 The old 5 the sea 3 the
Wind 2 with the 13
triples of words and the sea 3 Gone with
the 2 man and the 2 old man and 2 The old
man 2 with the Wind 2

20
Phase 3

Delete each phrase in the hierarchy if
it begins or ends in a stopword (man and)
it occurs in a particular longer phrase more than
75 of the time (theoretical computer)
Pointers to that phrase now point to that
phrases expansions
Process is recursive

21
Phase 3 Output
words and 31 Gone 2 man 4 old 12 sea 8 the 57 Win
d 3 with 17
pairs of words and the 25 Gone with 2 man
and 3 old man 2 The old 5 the sea 3 the
Wind 2 with the 13
triples of words and the sea 3 Gone with
the 2 man and the 2 old man and 2 The old
man 2 with the Wind 2

Write a Comment

User Comments (0)

About PowerShow.com

Phrase Hierarchy Inference PowerPoint PPT Presentation