Title: Distributional learning
1Distributional learning
- Holger Diessel
- University of Jena
- holger.diessel_at_uni-jena.de
- http//www.holger-diessel.de/
2Semantic bootstrapping
Pinker (1984)
- Grammatical categories such as nouns and verbs
are part of our genetic endowment. - There is only one reliable cue in the ambient
language that children of all languages can use
to identify grammatical categories That is their
meaning. - Structural cues (e.g. inflection) are not
reliable because they are language-specific.
3Semantic bootstrapping
Step 1 Children construct semantic word classes
based on the words they encounter in the
ambient language.
Step 2 The semantically defined word classes are
hooked up to the categories of UG.
Step 3 Once the connections are established,
children can use language-specific
structural properties to identify semantically
a-typical category members (e.g. deverbal
nouns).
4Cues for category acquisition
Cues for grammatical category acquisition
- Semantic cues (e.g. Gentner 1982 Pinker 1984)
- Pragmatic cues (e.g. Bruner 1975)
- Phonological cues (e.g. Kelly 1992)
- Distributional cues (e.g. Maratsos Chalkley
1980)
5Maratsos Chalkley 1980
Nouns the __ , x-s Verbs will __, x-ing, x-ed
6Theoretical arguments
The vast number of possible relationships that
might be included in a distributional analysis is
likely to overwhelm any distributional learning
mechanism in a combinatorial explosion. (Pinker
1984)
- Distributional learning mechanisms do not search
blindly for all possible relationships between
linguistic items, i.e. the search is focused on
specific distributional cues (Reddington et al.
1998).
7Theoretical arguments
The interesting properties of linguistic
categories are abstract and such abstract
properties cannot be detected in the input.
(Pinker 1984)
- This assumption crucially relies on Pinkers
particular view of grammar. If you take a
construction grammar perspective, grammar (or
syntax) is much more concrete (Redington et al.
1998).
8Theoretical arguments
Even if the child is able to determine certain
correlations between distributional regularities
and syntactic categories, this information is of
little use because there are so many different
cross-linguistic correlations that the child
wouldnt know which ones are relevant in his/her
language. (Pinker 1984)
- Syntactic categories vary to some extent across
languages (i.e. there are no fixed categories).
Children recognize any distributional pattern
regardless of the particular properties that
categories in different languages may have
(Redington et al. 1998)
9Theoretical arguments
Spurious correlations will occur in the input
that will be misguiding. For instance, if the
child hears John eats meat. John eats
slowly. The meat is good. He may erroneously
infer The slowly is good is a possible English
sentence. (Pinker 1984)
- Children do not learn categories from isolated
examples (Redington et al. 1998).
10Redington et al. 1998. Distributional
information A powerful cue for acquiring
syntactic categories. Cognitive Science 1998
425-469.
11Redington et al. 1998
Steps of analysis
- Measuring the distribution of contexts within
which each word occurs. - Comparing the distributions of contexts for pairs
of words. - Grouping together words with similar
distributions of contexts.
12Redington et al. 1998
All adult speakers of the CHILDES database (2.5
million words).
Bigram statistics Target words 1000 most
frequent words in the corpus Context words 150
most frequent words in the corpus
Context size 2 words preceding 2 words
following the target word x the __ of x in
the __ x x will have __ the x
13Distributional learning
Distributional context 2 words preceding 2
words following the target word x the __ of
x in the __ x x I have __ x x
Bigram statistics
14Distributional learning
Context 1 (the __ of)
Target w. 1 Target w. 2 Target w. 3 Target w. 4 Etc. 210 376 0 1
15Distributional learning
Context 1 (the __ of) Context 2 (at the __ is)
Target w. 1 Target w. 2 Target w. 3 Target w. 4 Etc. 210 376 0 1 321 917 1 4
16Distributional learning
Context 1 (the __ of) Context 2 (at the __ is) Context 3 (has __ him)
Target w. 1 Target w. 2 Target w. 3 Target w. 4 Etc. 210 376 0 1 321 917 1 4 2 1 1078 987
17Distributional learning
Context 1 (the __ of) Context 2 (at the __ is) Context 3 (has __ him) Context 4 (He __ in)
Target w. 1 Target w. 2 Target w. 3 Target w. 4 Etc. 210 376 0 1 321 917 1 4 2 1 1078 987 0 5 1298 1398
18Distributional learning
Context 1 (the __ of) Context 2 (at the __ is) Context 3 (has __ him) Context 4 (He __ in)
Target w. 1 Target w. 2 Target w. 3 Target w. 4 Etc. 210 376 0 1 321 917 1 4 2 1 1078 987 0 5 1298 1398
Context vectors Target word 1 210-321-2-0 Targe
t word 2 376-917-1-5 Target word
3 0-1-1078-1298 Target word 4 1-4-987-1398
19Statistical analysis
- Hierarchical cluster analysis over context
vectors dendogram - Slicing of the denogram
- Treatment of polysemous words
20Statistical analysis
Slicing of the dengogram
21 Cluster analysis
Pronouns, auxiliaries (49) Question words,
pronouns-auxiliaries (53) Verb (105) Verb
(62) Verb, present PTC (50) Determiner,
possessive pronoun (29) Conjunction,
interjection, proper noun (91) Proper noun
(91) Preposition (33) Noun (317) Adjective
(92) Proper noun (10)
Dendogram
22Benchmark
Category Example N
Noun Adjective Numeral Verb Article Pronoun Adverb Preposition Conjunction Interjection Contractions Truck, card, hand Little, favorite, white Two, ten, three could, hope, empty The, a You, whose, more Rather, always, softly In, around, between Because, while, and Oh, huh, wow Ill, cant, theres 407 81 10 239 3 52 60 21 9 16 58
Collins Cobuild Dictionary
23Scoring
Cluster analysis
Bench-mark
Accuracy
Cluster analysis
Bench-mark
Complete-ness
24 Exp. 1 Context size
Result Local contexts have the strongest effect,
notably the word immediately preceding the target
word is important.
"Learners might be innately biased towards
considering only these local contexts, whether as
a result of limited processing abilities (e.g.
Elman 1993) or as a result of language specific
representational bias." (Redington et al. 1998)
25 Exp. 2 Number of target words
Level of accuracy
Number of target words
Distributional learning is most efficient for
high frequency open class words.
26 Exp. 3 Category type
Result nouns lt verbs lt function words
Although content words are typically much less
frequent, their context is relatively predictable
Because there are many more content words, the
context of function words will be relatively
amorphous." (Redington et al. 1998)
27 Exp. 4 Corpus size
Level of accuracy
Number of words
28 Exp. 5 Utterance boundaries
Result Including information about utterance
boundaries did not improve the level of
accurarcy.
29 Exp. 6 Frequency vs. occurrence
Frequency vectors were replaced by occurrence
vectors Frequency vector Occurrence
vector 27-0-12-0-0-12-2 1-0-1-0-0-1-1 0-213-2-1
-45-3-0 0-1-1-1-1-1-0
Result The cluster analysis still revealed
significant clusters, but performance was much
better when frequency information was included.
30 Exp. 7 Removing function words
Early child language includes very few function
words. Thus, Redington et al. removed all
function words from the context and repeated the
cluster analysis without function words.
Result The results decreased but were still
significant.
31 Exp. 8 Knowledge of word classes
The cluster analyses were performed over the
distribution of individual items. It is
conceivable that the child recognizes at some
point discrete syntactic categories (e.g. nouns),
which may facilitate the categorization task.
Result Representing particular word classes
through discrete category labels (e.g. N), does
not improve the categorization of other
categories (e.g. V).
32Mintz, Toben, Elissa L. Newport, and Thomas
Bever. 2002. The distributional structure of
grammatical categories in speech to young
children. Cognitive Science 26 393-424.
33Mintz et al. 2002. Cognitive Science
(1) The man in the yellow car (2) She has
not yet been to NY.
1. Information about phrasal boundaries improves
performance. 2. Local contexts have the
strongest effect (cf. Redington et al.
1998). 3. The results for Ns are better than the
results for Vs (cf. Redington et al. 1998).
34Monaghan, Padraic, Nick Chater, and Morton
Christiansen. 2005. The differential role of
phonological and distributional cues in
grammatical categorization. Cognition 96 143-182.
35Monaghan et al. 2005. Cognition
(1) Nouns vs. verbs (2) Open class vs. closed
class.
1. Distributional information 2. Phonological
information
36 Monaghan et al. 2005. Cognition
- Length Open class words are longer than
closed class words - Stress Closed class words usually do not
carry stress - Stress Nouns tend to be more often trochaic
than verbs (i.e. verbs are often iambic) - Consonants Closed class words have fewer
consonant cluster - Reduced vowels Closed class words include a
higher proportion of reduced vowels than
open class words
37 Monaghan et al. 2005. Cognition
- Interdentals Closed class words are more likely
to begin with an interdental fricative than
open class words - Nasals Nouns are more likely than verbs to
include nasals - Final voicing Nouns are more likely than verbs to
end in a voiced consonant - Vowel position Nouns tend to include more back
vowels than verbs - Vowel height The vowels of verbs tend to be
higher than the vowels of verbs
38 Monaghan et al. 2005. Cognition
For high-frequency items, distributional
information is extremely useful, but drops off
dramatically for lower frequency items. For the
phonological cues, the opposite pattern is
observed better performance for lower frequency
words. (168)
39 Monaghan et al. 2005. Cognition
Phonological features do not just reinforce
distributional information, but seem to be
especially powerful in domains in which
distributional information is not so easily
available.
- Distributional information is especially useful
for categorization of high frequency open class
words. - Phonological information is more useful for
catego-rization of low frequency open class words
(Zipf 1935). - Phonological information is also useful for the
distinction between open and closed class words.
40Distributional learning
We found confirmation for our hypothesis that
phonological and distributional information
contributed differentially towards
categorization. At points where distributional
information was better for classificationthe
high frequency itemsphonological cues were found
to be of less value. Conversely, for the
lower-frequency items, where distributional
information was less useful, phonological
information contributed towards more accurate
classification.
41Distributional learning
But are children able to detect and compute the
distributional information that is available in
the ambient language?
42Saffran et al. 1996
Nonce words tupiro golabu bidaku padoti
Subjects 8 months-old infants
43Saffran et al. 1996
tupiro bidaku padoti bidaku golabu
44Saffran et al. 1996
Condition1 tupiro-bidaku- Condition
2 da-pi-ku-ro-tu-
45Head-turn procedure
light auditory stimulus
green light
46Saffran et al. 1996
47Saffran et al. 1996
tu-pi-ro bi-da-ku padoti bidaku golabu
100
25
transitional probabilities
48Saffran et al. 1996
Condition 1 100-100-25-100-100-25
Condition 2 8.3-8.3-8.3-8.3-8.3
49Saffran et al. 1996
the existence of computational abilities that
extract structure so rapidly suggests that it is
premature to assert a priori how much of the
striking knowledge base of human infants is
primarily a result of experience-independent
mechanisms. In particular, some aspects of early
development may turn out to be best characterized
as resulting from innately biased statistical
leaning mechanisms rather than innate knowledge.
If this is the case, then the massive amount of
experience gathered by infants during the first
postnatal year may play a far greater role in
development than has previously been recognized.