Title: Latent Semantic Analysis:
1Latent Semantic Analysis
- Is it a solution to Platos problem?
- And 10 other questions answers.
210 questions
- How did this paper change our lives?
- What is Plato's problem?
- Oh no! Not more philosophy?
- How can Platos problem be solved?
- What kind of solution do we need?
- What is latent semantic analysis?
- How is an LSA model constructed?
- How is the LSA model used?
- Whats a cosine between vectors?
- What are some cool empirical findings?
- Is LSA psychologically plausible?
3How did this paper change our lives?
- Because I saw a talk by Landauer on this work, I
became interested in latent semantic analysis
LSA - Because I was interested in LSA, I became
interested Curt Burgess's HAL model. - Because I was interested in HAL, I decided to
come to Edmonton, where Lori Buchanan was
working on it - Because I came to Edmonton- here I am teaching
Psych 357. - If Landauer hadnt written this paper, we
probably wouldnt have the mutual pleasure of
knowing each other as we do.
4What is Plato's problem?
- Meno (in the Platonic dialog named after him)
asks How can one ever investigate what one does
not know? - He saw two problems
- i.) How can you propose what you do not know as
the object of your search? - ii.) How will you recognize what you do not know
as the thing you did not know if you do (by
chance) find it? - More generally, the problem is that there is a
gap between what we experience and what we know,
with the latter seeming to be larger than the
former is able to support.
5Oh no! Not more philosophy?
- Not at all (indeed, the opposite)
- Plato's problem is exactly the poverty of the
stimulus/failure of induction problem - It is thus central to syntactic knowledge as well
as to many other dimensions of linguistic
knowledge (wherever we make fine-grained untaught
distinctions e.g. prosody, phonology, and
semantics).
6How can Platos problem be solved?
- i.) Plato's solution was recollection of
knowledge gained in a previous life, famously
demonstrated in the Meno by showing that a slave
boy 'knows' the Pythagorean Theorem - ii.) Some favour the idea of innate knowledge,
the modern equivalent of recollection of a
previous life - The basic common principle is one we already know
and love in Psych357 we need some source of
strong additional constraints on the problems (
information) to narrow down the size of the
search space.
7What kind of solution do we need?
- That is What properties are desirable in a
scientifically-acceptable explanation of how
constraints on a search space operate? - i.) They must be sufficient.
- ii.) They must be well-defined.
- iii.) They must be psychologically-plausible
8What is latent semantic analysis?
- LSA is an algorithmically well-defined way of
measuring lexical co-occurrence in some set of
text - The assumption is that co-occurrence says
something about semantics words about the same
things are likely to occur in the same contexts. - If we have many words and contexts, small
differences in co-occurrence probabilities can be
compiled together to give information about
semantics. - Think of 20 questions No single question might
be sufficient to identify an unknown object, but
20 questions usually are sufficient
9How is an LSA model constructed?
- i.) Build a matrix with rows representing words
and columns representing context (a document or
word string) - ii.) Enter in each cell ( a word X document
intersection) a count of many times that word
occurred in that document - iii.) Transform the matrix
10- i.) Build a matrix with rows representing words
and columns representing context (a document or
word string)
Sonnets Learn C A day at the zoo
dog
zebra
computer
11- ii.) Enter in each cell ( a word X document
intersection) a count of many times that word
occurred in that document
Sonnets Learn C A day at the zoo
dog 6 1 7
zebra 0 2 46
computer 0 123 0
12(No Transcript)
13(No Transcript)
14- iii.) Transform the matrix
- a.) Control for word frequency
- The log transform compresses the effects of
frequency - b.) Control for the number of contexts each word
appeared in - Words that occur in few contexts are more
informative about those contexts ( reduce
uncertainty about their context more) than words
that appear in many different contexts - Eg. Knowing the word computer was common
places more constraints on what the document is
about than knowing the word the was common
15- iii.) Transform the matrix
- c.) Singular value decomposition
- This reduces dimensionality by 'projecting' the
tens of thousands of context dimensions onto a
smaller number (roughly 300). - A mathematical projection is roughly the same as
real projection Think of shining a light through
a three dimensional pattern and tracing the
shadow it casts to get a two-dimensional
projection - The 'discarded' dimensions are those that are
least informative have low variance are
redundant (e.g. a word like 'the' occurred in
every context or a word like 'anti-disestablishmen
tarianism' occurred in hardly any contexts).
16How is the LSA model used?
- To get a measure of how related a word is to
another word, measure the distance between the
columns containing the two words. - This gives you a measure of how different the
contexts of the two words were that is, how
often a word occurred a different number of times
in each context - You can also take the distance between two
document vectors to get a measure of how related
they are. - You can measure distance by taking the cosine
between two vectors
17Huh? Whats a cosine between vectors?
- They probably forget to mention in your Grade 9
trigonometry class (as they did in mine) that
cosine is extensible to dimensions above 2 - Typical teaching always the special case, never
the general. - The dot product of two vectors is the sum of the
products of corresponding entries in the two
vectors i.e. (x1x2) (y1y2) (z1z2), for
two vectors of length 3. - The dot product of two vectors is the cosine of
the angle between those two vectors, multiplied
by the lengths of those vectors. - Therefore, cosine is dot product divided by
divided by the product of the two vector lengths
18What are some cool empirical findings?
- i.) LSA models can pass the TOEFL
- ii.) LSA can learn the meanings of words it has
never encountered - iii.) LSA can explain some priming effects
- iv.) LSA replicates human number judgments
- v.) LSA can mark essays
- vi.) LSA-like measures predict LD RTs
19i.) LSA models can pass the TOEFL
- On a 4-possibility multiple choice TOEFL, the
model got 51.5 correct (corrected for guessing) - Chance score is 25
- Real foreigners hoping to attend American
universities averaged 52.7
20ii.) LSA can learn the meanings of words it had
never encountered
- So can children!
- By substituting words with nonsense words and
controlling access, they showed that the model
could learn the meanings of words it had never
encountered - This replicated (and explained) an odd result
which had been found in human children- and
estimated that most word knowledge was inductive
rather than direct. - The result is not odd when you consider that the
meaning of a word is distributed across all
vectors with which it shares contexts. - You can learn a lot about lions, even if you have
never heard of them before, by knowing they are
something like tigers.
21iii.) LSA can explain some priming effects
- The model can explain some priming work using
homographs i.e. testing for 'mole' (the animal)
versus 'mole' (the beauty mark). - If context is marked by (either phonological or
orthographic word form), then these words will
indeed get over-lapping contexts even though they
are semantically different
22iv.) LSA replicates human number judgments
- Previous work has shown that judgments about
number size are best represented on the
assumption that numbers are represented as their
log of their values. - That is, people scale down large numbers
- LSA got the same representation using their
contextual occurrences.
23v.) LSA can mark essays
- LSA judgments of the quality of sentences
correlate at r 0.81 with expert ratings - LSA can judge how good an essay (on a
well-defined set topic) is by computing the
average distance between the essay to be marked
and a set of model essays - The correlation are equal to between-human
correlations - If you wrote a good essay and scrambled the
words you would get a good grade," Landauer said.
"But try to get the good words without writing a
good essay!
24vi.) LSA-like measures predict LD RTs
- An LSA-like measure for single words can predict
human RTs in lexical decision - We used 10 words each side of the target word as
a document and got distances between all words - Words close to their nearest neighbours are
recognized more quickly than words far away from
them, after controlling for other known variables
25Is LSA psychologically plausible?
- Well, the above evidence suggests they might be,
and is nicely consistent with much of our talk
about mapping between schemas - Neuro-philosopher Paul Churchland has written
- "Explanatory understanding consists of the
activation of a specific prototype vector in a
well-trained network. It consists in the
apprehension of the problematic case as an
instance of a general type, a type for which the
creature has a detailed and well-informed
representation. Such a representation allows the
creature to anticipate aspects of the case so far
unperceived, and to deploy practical techniques
appropriate to the case at hand." - Paul Churchland
- A Neurocomputational Perspective The Nature Of
Mind and The Structure Of Science