Efficient Unsupervised Discovery of Word Categories Using Symmetric Patterns and High Frequency Words - PowerPoint PPT Presentation

1 / 24

About This Presentation

Title:

Efficient Unsupervised Discovery of Word Categories Using Symmetric Patterns and High Frequency Words

Description:

... Ari Rappoport. The Hebrew University. ACL 2006. Introduction ... Strong n-cliques. subgraphs containing n nodes that are all bidirectionally interconnected ... – PowerPoint PPT presentation

Number of Views:43

Avg rating:3.0/5.0

Slides: 25

Provided by: nlgCsie

Category:

more less

Transcript and Presenter's Notes

Title: Efficient Unsupervised Discovery of Word Categories Using Symmetric Patterns and High Frequency Words

1
Efficient Unsupervised Discovery of Word
Categories Using Symmetric Patterns and High
Frequency Words

Dmitry Davidov, Ari Rappoport
The Hebrew University
ACL 2006

2
Introduction

Discovering word categories, sets of words
sharing a significant aspect of their meaning
context feature vectors
pattern-based discovery
Manually prepared pattern set (ex. x and y)
requiring POS tagging or partial or full parsing

3
Pattern Candidates

high frequency word (HFW)
A word appearing more than TH times per million
words
Ex. and, or, from, to
content word (CW)
word appearing less than TC times per a million
words

4
Pattern Candidates

meta-patterns obey the following constraints
at most 4 words
exactly two content words
no two consecutive CWs
Example
CHC, CHCH, CHHC, and HCHC
from x to y (HCHC), x and y (CHC),
x and a y (CHHC)

5
Symmetric Patterns

In order to find a usable subset from pattern
candidates, we focus on the symmetric patterns
Example
x belongs to y (asymmetric relationships)
X and y (asymmetric relationships)

6
Symmetric Patterns

We use single pattern graph G(P) to identifying
symmetric patterns
there is a directed arc A(x, y) from node x to
node y iff
the words x and y both appear in an instance of
the pattern P as its two CWs
x precedes y in P
SymG(P), the symmetric subgraph of G(P),
containing only the bidirectional arcs and nodes
of G(P)

7
Symmetric Patterns

We compute three measures on G(P)
M1 counts the proportion of words that can appear
in both slots of the pattern
M2, M3 measures count the proportion of the
number of symmetric nodes and edges in G(P)

8
Symmetric Patterns

We removed patterns that appear in the corpus
less than TP times per million words
We remove patterns that are not in the top ZT in
any of the three lists
We remove patterns that are in the bottom ZB in
at least one of the lists

9
Discovery of Categories

words that are highly interconnected are good
candidates to form a category
word relationship graph G
merging all of the single-pattern graphs into a
single unified graph
clique??????????????,????????

10
The Clique-Set Method

Strong n-cliques
subgraphs containing n nodes that are all
bidirectionally interconnected
A clique Q defines a category that contains the
nodes in Q plus all of the nodes that are
(1) at least unidirectionally connected to all
nodes in Q
(2) bidirectionally connected to at least one
node in Q

11
The Clique-Set Method

we use 2-cliques
For each 2-cliques, a category is generated that
contains the nodes of all triangles that contain
2-cliques and at least one additional
bidirectional arc.
For example
book and newspaper, newspaper and book, book
and note, note and book and note and
newspaper

12
The Clique-Set Method

a pair of nodes connected by a symmetric arc can
appear in more than a single category
define exactly three strongly connected
triangles, ABC,ABD,ACE
arc (A,B) yields a category A,B,C,D
arc (A,C) yields a category A,C,B,E

13
Category Merging

In order to cover as many words as possible, we
use the smallest clique, a single symmetric arc.
This creates redundant categories
We use two simple merge heuristics
If two categories are identical we treat them as
one
Given two categories Q,R, we merge them iff
theres more than a 50 overlap between them

14
Corpus Windowing

In order to increase category quality and remove
categories that are too context-specific
Instead of running the algorithm of this section
on the whole corpus, we divide the corpus into
windows of equal size
we select only those categories that appear in at
least two of the windows

15
Evaluation

three corpora, two for English and one for
Russian
BNC, containing about 100M words
Dmoz, 68GB containing about 8.2G words from 50M
web pages
Russian corpus was assembled from many web sites
and carefully filtered for duplicates, to yield
33GB and 4G words

16
Evaluation

Thresholds
thresholds TH, TC, TP ,ZT ,ZB ZB, were determined
by memory size
100, 50, 20, 100, 100
window size
5 of the different words in the
corpus assigned into categories

Vnumber of different words in the corpus.
Wnumber of words belonging to at least
one categories. C is the number of categories
AS is the average category size
17
Human Judgment Evaluation

Part I
given 40 triplets of words and were asked to rank
them using 1-5 (1 is best)
40 triplets were obtained as follows
20 of our categories were selected at random from
the non-overlapping categories
10 triplets were selected from the categories
produced by k-means (Pantel and Lin, 2002)
10 triplets were generated by random selection of
content words

18
Human Judgment Evaluation

Part II, subjects were given the full categories
of the triplets that were graded as 1 or 2 in
Part I
They were asked to grade the categories from 1
(worst) to 10 (best) according to how much the
full category had met the expectations they had
when seeing only the triplet

19
Human Judgment Evaluation

shared meaning
average percentage of triplets that were given
scores of 1 or 2

20
WordNet-Based Evaluation

Took 10 WN subsets referred to as subjects
selected at random 10 pairs of words from each
subject
For each pair we found the largest of our
discovered categories containing it

21
WordNet-Based Evaluation

For each found category C containing N words
Precision the number of words present in both C
and WN divided by N
Precision the number of correct words divided
by N (Correct words are either words that appear
in the WN subtree, or words whose entry in the
American Heritage Dictionary)
Recall the number of words present in both C and
WN divided by the number of (single) words in WN

22
WordNet-Based Evaluation