Title: GLSA Server PARC
1GLSA Server _at_PARC
- Christiaan Royer, Ayman Farahat,
- Peter Pirolli
- Presenter Raluca Budiu
- (budiu_at_parc.com)
2Functionality
http//glsa.parc.com
3(No Transcript)
4(No Transcript)
5Outline
- Similarity
- PMI and strength of association
- Dimensionality reduction
- Corpus
- Parameters
- ACT-R Interface
- Future work
6Strength of Association
The association between words reflects their
odds of occurring together
7Semantic Similarity Math
(Pointwise Mutual Information)
Farahat, Pirolli, Markova, 2004 Pirolli, 2005
8Information Retrieval
- Pointwise mutual information between two words
-
In general, pointwise mutual information is the
reduction in uncertainty of one random variable
due to knowing about the other
(Manning and Schutze, 1999)
9Computing PMIs
- Estimate probabilities using frequency counts
-
- of words in a large corpus of N documents
10Dimensionality Reduction
Given a corpus V of v words
1. Build a matrix of strengths of
association/PMIs
2. Reduce the dimension of R to v x k (nr eigen
vectors)
3. Compute similarity using cosine measure
11Dimensionality Reduction
- Other techniques of dimensionality reduction can
be used (e.g., Hellinger metric)
- Dimensionality reduction is a smoothing that
takes into account similarity to other terms
while measuring the similarity between a specific
pair of terms
12The Corpus
- The first 10 million pages of the Stanford
Webbase project (generated by web crawling)
www.microsoft.com 10002 www.google.com 10002 w
ww.w3.org 10001 www.whitehouse.gov 10002 www.a
pple.com 2 www.epa.gov 10002 www.yahoo.com 237
www.cdc.gov 10002 www.pbs.org 10002 www.un.
org 10001 www.access.gpo.gov 2
http//dbpubs.stanford.edu8091/testbed/doc2/WebB
ase/
13Number of Eigen Values
- Too high overgeneralization
- Too low overfitting (i.e., taking too much
noise into account)
- For several synonymy problem spaces
- n 150-400 gave best results
14Why Not Use Just PMIs?
- GLSA was compared with PMI on several synonymy
tests (TOEFL, TSL1, TSL2) (using a different
corpus made of New York Times articles)
- It achieved a performance 70 (80 for TOEFL
and TS2 70 for TS1)
- PMI only was consistently worse (65-70),
although still comparable with humans
15ACT-R Format
- (chunk-type meaning)
- (defun sym-add-ia-fct (chunk1 chunk2 ia)
- (if ia
- (first (eval (no-output (add-ia (,chunk1
,chunk2 ,ia)
- (,chunk2 ,chunk1 ,ia)))))
- 0))
- (defmacro sym-add-ia (chunk1 chunk2 ia)
- (sym-add-ia-fct ',chunk1 ',chunk2 ',ia))
- (add-dm (banana isa meaning))
- (add-dm (fruit isa meaning))
- (add-dm (vegetable isa meaning))
-
- (sym-add-ia banana banana 1.0000008)
- (sym-add-ia banana fruit 0.08276595)
- (sym-add-ia banana vegetable -0.013203714)
- (sym-add-ia banana citrus 0.06277088)
16 Declare a meaning chunk type
(chunk-type meaning)
Declare a function that maps GLSA values onto
ias
it can be modified
(defun sym-add-ia-fct (chunk1 chunk2 ia)
(if ia (first (eval (no-output (add-ia
(,chunk1 ,chunk2 ,ia)
(,chunk2 ,chunk1 ,ia))))) 0))
add words to memory as chunks of type meaning
(add-dm (banana isa meaning)) (add-dm (fruit isa
meaning))
set ias
(sym-add-ia banana banana 1.0000008)
(sym-add-ia banana fruit 0.08276595)
17Text Format
- More convenient if you want to parse it into your
own format
banana banana 1.0000008 banana fruit 0.0827659
5 banana vegetable -0.013203714 banana citrus
0.06277088 banana tropical 0.066639036 banana
nut 0.035970457 banana grain 0.020719754 banan
a monkey 0.081182115 banana elephant -0.02854477
4 banana eat -0.0042282715 banana person 0.035
31764 fruit fruit 1.0000001 fruit vegetable 0.
24325794 fruit citrus 0.18533087
18Future Work
- Corpus expansion and modification
- Collect misses and add them periodically to the
corpus
- See what documents need to be removed from the
corpus and what documents need to be added
- Analyze word frequency and decide whether its
representative of the web/persons vocabulary
- TASA corpus as an option
- How does raw PMI compare with GLSA
- Have a raw PMI server available
- Add word frequency counts
- Add the option to upload a file
19Suggestions?