Efficient Unsupervised Discovery of Word Categories Using Symmetric Patterns and High Frequency Words - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

Efficient Unsupervised Discovery of Word Categories Using Symmetric Patterns and High Frequency Words

Description:

... Ari Rappoport. The Hebrew University. ACL 2006. Introduction ... Strong n-cliques. subgraphs containing n nodes that are all bidirectionally interconnected ... – PowerPoint PPT presentation

Number of Views:43
Avg rating:3.0/5.0
Slides: 25
Provided by: nlgCsie
Category:

less

Transcript and Presenter's Notes

Title: Efficient Unsupervised Discovery of Word Categories Using Symmetric Patterns and High Frequency Words


1
Efficient Unsupervised Discovery of Word
Categories Using Symmetric Patterns and High
Frequency Words
  • Dmitry Davidov, Ari Rappoport
  • The Hebrew University
  • ACL 2006

2
Introduction
  • Discovering word categories, sets of words
    sharing a significant aspect of their meaning
  • context feature vectors
  • pattern-based discovery
  • Manually prepared pattern set (ex. x and y)
  • requiring POS tagging or partial or full parsing

3
Pattern Candidates
  • high frequency word (HFW)
  • A word appearing more than TH times per million
    words
  • Ex. and, or, from, to
  • content word (CW)
  • word appearing less than TC times per a million
    words

4
Pattern Candidates
  • meta-patterns obey the following constraints
  • at most 4 words
  • exactly two content words
  • no two consecutive CWs
  • Example
  • CHC, CHCH, CHHC, and HCHC
  • from x to y (HCHC), x and y (CHC),
  • x and a y (CHHC)

5
Symmetric Patterns
  • In order to find a usable subset from pattern
    candidates, we focus on the symmetric patterns
  • Example
  • x belongs to y (asymmetric relationships)
  • X and y (asymmetric relationships)

6
Symmetric Patterns
  • We use single pattern graph G(P) to identifying
    symmetric patterns
  • there is a directed arc A(x, y) from node x to
    node y iff
  • the words x and y both appear in an instance of
    the pattern P as its two CWs
  • x precedes y in P
  • SymG(P), the symmetric subgraph of G(P),
    containing only the bidirectional arcs and nodes
    of G(P)

7
Symmetric Patterns
  • We compute three measures on G(P)
  • M1 counts the proportion of words that can appear
    in both slots of the pattern
  • M2, M3 measures count the proportion of the
    number of symmetric nodes and edges in G(P)

8
Symmetric Patterns
  • We removed patterns that appear in the corpus
    less than TP times per million words
  • We remove patterns that are not in the top ZT in
    any of the three lists
  • We remove patterns that are in the bottom ZB in
    at least one of the lists

9
Discovery of Categories
  • words that are highly interconnected are good
    candidates to form a category
  • word relationship graph G
  • merging all of the single-pattern graphs into a
    single unified graph
  • clique??????????????,????????

10
The Clique-Set Method
  • Strong n-cliques
  • subgraphs containing n nodes that are all
    bidirectionally interconnected
  • A clique Q defines a category that contains the
    nodes in Q plus all of the nodes that are
  • (1) at least unidirectionally connected to all
    nodes in Q
  • (2) bidirectionally connected to at least one
    node in Q

11
The Clique-Set Method
  • we use 2-cliques
  • For each 2-cliques, a category is generated that
    contains the nodes of all triangles that contain
    2-cliques and at least one additional
    bidirectional arc.
  • For example
  • book and newspaper, newspaper and book, book
    and note, note and book and note and
    newspaper

12
The Clique-Set Method
  • a pair of nodes connected by a symmetric arc can
    appear in more than a single category
  • define exactly three strongly connected
    triangles, ABC,ABD,ACE
  • arc (A,B) yields a category A,B,C,D
  • arc (A,C) yields a category A,C,B,E

13
Category Merging
  • In order to cover as many words as possible, we
    use the smallest clique, a single symmetric arc.
    This creates redundant categories
  • We use two simple merge heuristics
  • If two categories are identical we treat them as
    one
  • Given two categories Q,R, we merge them iff
    theres more than a 50 overlap between them

14
Corpus Windowing
  • In order to increase category quality and remove
    categories that are too context-specific
  • Instead of running the algorithm of this section
    on the whole corpus, we divide the corpus into
    windows of equal size
  • we select only those categories that appear in at
    least two of the windows

15
Evaluation
  • three corpora, two for English and one for
    Russian
  • BNC, containing about 100M words
  • Dmoz, 68GB containing about 8.2G words from 50M
    web pages
  • Russian corpus was assembled from many web sites
    and carefully filtered for duplicates, to yield
    33GB and 4G words

16
Evaluation
  • Thresholds
  • thresholds TH, TC, TP ,ZT ,ZB ZB, were determined
    by memory size
  • 100, 50, 20, 100, 100
  • window size
  • 5 of the different words in the
  • corpus assigned into categories

Vnumber of different words in the corpus.
Wnumber of words belonging to at least
one categories. C is the number of categories
AS is the average category size
17
Human Judgment Evaluation
  • Part I
  • given 40 triplets of words and were asked to rank
    them using 1-5 (1 is best)
  • 40 triplets were obtained as follows
  • 20 of our categories were selected at random from
    the non-overlapping categories
  • 10 triplets were selected from the categories
    produced by k-means (Pantel and Lin, 2002)
  • 10 triplets were generated by random selection of
    content words

18
Human Judgment Evaluation
  • Part II, subjects were given the full categories
    of the triplets that were graded as 1 or 2 in
    Part I
  • They were asked to grade the categories from 1
    (worst) to 10 (best) according to how much the
    full category had met the expectations they had
    when seeing only the triplet

19
Human Judgment Evaluation
  • shared meaning
  • average percentage of triplets that were given
    scores of 1 or 2

20
WordNet-Based Evaluation
  • Took 10 WN subsets referred to as subjects
  • selected at random 10 pairs of words from each
    subject
  • For each pair we found the largest of our
    discovered categories containing it

21
WordNet-Based Evaluation
  • For each found category C containing N words
  • Precision the number of words present in both C
    and WN divided by N
  • Precision the number of correct words divided
    by N (Correct words are either words that appear
    in the WN subtree, or words whose entry in the
    American Heritage Dictionary)
  • Recall the number of words present in both C and
    WN divided by the number of (single) words in WN

22
WordNet-Based Evaluation
  • Number of correctly discovered words (New) that
    are not in WN. The Table also shows the number of
    WN words (WN)

23
WordNet-Based Evaluation
  • BNC all words precision of 90.47. This metric
    was reported to be 82 in (Widdows and Dorow,
    2002).

24
Discussion
  • Evaluate the task is difficult
  • Ex. nightcrawlers, chicken,shrimp, liver,
    leeches
  • The first pattern-based lexical acquisition
    method that is fully unsupervised
  • requiring no corpus annotation or manually
    provided patterns or words
  • Categories are generated using a novel graph
    clique-set algorithm
Write a Comment
User Comments (0)
About PowerShow.com