Title: Theoretical Corpus Analysis Four examples
1Theoretical Corpus AnalysisFour examples
2Example 1 Structural Dependency
- Noisy input
- Incomplete input
- Totally absent input
- Feedback is unreliable
- Even if it is reliable, children ignore it
3Two forms of the problem
- Strong Form
- inviolable constraints, error-free learning
- Weak Form
- violable constraints, recovery from error
4Weak form deals with recoveryStrong form avoids
this
5Structural Dependency
- The man who is first in line is coming.
- Is the man who __ first in line is coming?
- Is the man who is first in line ___ coming?
- This constraint is non-parameterized, inviolable
- Learning is said to be error-free.
- No recovery is needed.
6Structural Dependency
The boy who is smoking is crazy.
7No need for positive evidence
- Chomsky, 1980 A person might go through much or
all of his life without ever having been exposed
to relevant evidence, but he will nevertheless
unerringly employ the structure-dependent
generalization, on the first relevant occasion. - Hornstein and Lightfoot, 1986 People attain
knowledge of the structure of their language for
which no evidence is available in the data to
which they are exposed as children. - Crain, 1991 ...every child comes to know facts
about the language for which there is no decisive
evidence from the environment. In some cases,
there appears to be no evidence at all.
8Emergentist solution
- Item-based learning for linking of aux/tense to
main verb - Learning based on evidence from main clauses
(Lightfoot, degree-0 learnability) - Learning on positive instances
- No evidence of contrary movement pattern ever
occurs, so competition favors basic item-based
pattern
9Two searches
- Pullum and Sholz (2002) Nearly 1 of the Wall
Street Journal corpus in the Penn Treebank
consists of relevant positive exemplars - Lewis and Elman (2000) Searched the CHILDES
database and found two relevant examples.
(CHILDES has about 3 million utterances.)
10A new search
- Complete morphological tagging of all English
corpora from normally developing children (mor
1 .cha) - Training of the POSTTRAIN database file on the
Eve corpus to reach 95 accuracy level - Running POST on the complete corpus (post 1
s0 tengtags.cut .cha)
11KWAL for the relevant patterns
- aux/cop rel aux/cop
- is/can/would who/that/what is/can/would
- Test file
- CHI is the boy who is next in line tall ?
- mor vbe3S detthe nboy prowhwho vbe3S
adjnext prepin nline adjtall ? - CHI does the boy who does the dishes run ?
- mor vauxdo3s detthe nboy prowhwho
vauxdo3s detthe ndish-PL vrun ? - CHI can the boy that can run walk ?
- mor vauxcan detthe nboy prodemthat
vauxcan vrun nwalk ? - RESULT - NONE
12A broader search
- Any initial aux with any embedded clause
aux/cop rel - Checked for mistagged that as prodem
137 close matches
- Brown - "adam20.cha" Line 3282
- MOT are you the kind of nut that a squirrel
likes? - Hallwhitework "stl.cha" Line 15497
- FAT does he know who you are?
- Hallwhitepro "gat.cha" Line 2889
- MOT isn't that the little boy who a few months
ago saw your lunch box and liked it? - Hall blackwork "roj.cha" Line 8387
- EXP is that where you're from?
- Hall blackwork "chj.cha" Line 848
- MOT is that the person you were talking about?
- Bates Snack 28 "ivy.cha" Line 175
- MOT can you tell them how old you are?
- Bates Snack 28 "rick.cha" Line 118
- MOT can you show them what you are eating?
14Two closer matches (spotted by Lewis and Elman)
- Brown adam30.cha Line 2130
- MOT Is the ball you were speaking of in the box
with the bowling pin? - Korman st11.cha Line 386
- MOT Wheres this little boy whos full of
smiles?
15What did we learn?
- Parents dont provide input, but children dont
say anything close to this either - Probably this is learned during adolescence
- However, there is plenty of positive input
demonstrating that the moved auxiliary or tense
marker comes from the main clause
16But isnt that Chomskys point?
- If child must process structure to get it right,
then structure is innate - But what level of structure is innate?
- Minimum needed
- main verb
- aux that codes tense of verb
- item-based relation
- Pairs and nested pairs rather than abstract trees
17Item-based patterns
- MacWhinney (1976, 1978, 1982, 1987)
- is (pres, 3s, inter, init) --- X (action,
state) - can (pres, inter, init) -- X (action, state)
- have (pres, -3s, inter, init) --- X (action,
state) - -----
- featural pattern
- pres, inter, init --- V
18Verb Island Constructions at About 2 Years of
Age - pictures from Tomasello
19Superimposition Schematization
The dog eats the bone.
The cat the fish.
A bird a ladybug.
This one that one.
Two pin two dogs.
20Hes push ing it.
Hes kill
Hes pull
Hes show
Hes draw
Hes deed
21Kemp 2002 -- Easy to Hard
- The cow is jumping - The pig is jumping
- The pig is jumping - Zibby is jumping
- Zibby is jumping - The pig is jumping
- The cow is niffing - The pig is niffing
- The cow is niffing - Zibby is niffing
- Zibby is niffing - The pig is niffing
22Three mechanisms
- ROTE Repeated use makes other uses sound
unconventional - Child hears X hit Y many times, but never Y
hit - As a result, X hit Y is strengthened and
entrenched - ANALOGY Semantic subclasses of verbs
- Child learns verb for causing direct motion
(remove) - Child assumes it behaves like other verbs of the
same type, i.e., as fixed transitive (e.g. bring,
take, etc.) - COMPETITION Alternative forms block the
extension of a verb to a construction - Child watches as adult tickles sibling.
- Sibling says I cant stop laughing.
- Child now expects sibling to say Dont laugh
me. - Sibling says Dont make me laugh.
23Aside Wh-raising from complex-NPs is not so
error-free
- What am I cooking on a hot __?
- Who did pictures of __ surprise you?
- What do you think a picture of __ should be on my
cake? - What is this a funny __?
24Example 2 Attachment competition
- The cop saw the thief with a revolver.
- The cop saw the thief with a telescope.
- The daughter of the colonel who bought the watch
entered the room. - Competition Model claim strength of competitor
is a function of cue validity - If listeners prefer an attachment that is less
frequent, model is falsified
25NP1
Usage
Preferences low high mid
and NP4
?
(Gibson Schütze, 1999)
N2
26Statistics of training data (estimated from Brown
Corpus)
27100
80
low
60
N3
PREFERENCE
40
high
N1
mid
20
N2
0
PREFERRED ATTACHMENT SITE
28Example 3 How degenerate is the input?
- Tagged Eve corpus with MOR and POST
- Applied LC-Flex (Rosé and Lavie, 2001) and got
47 accuracy. - Added statistical parser -- 60
- Robustness methods -- 78.5
- This compares to 85 for WSJ
29Example 4 Growing Lexicon Model
- Current network models assume a static lexicon,
but childrens lexicons are growing - catastrophic interference problem
- Some current network models make neurologically
improbable assumptions. - Children learn from both direct experience
(Juszczyk, Aslin, Saffran) and cooccurrence - Current models have no good links to
morphological analysis
30Components
- FLO to strip CHILDES codes
- FREQ to extract 300 most frequent words
- WCD (word cooccurrence detector as in HAL) to
acquire bigram cooccurrences - Compression to a constant dimensionality
- GLM (growing lexicon model) for self-organizing
map (SOM) with node insertion
31DevLex - Farkas, Li, MacWhinney 2001/2002
32Results
- Confusions were within part-of-speech
- mummy and daddy
- wouldnt and didnt
- car and truck
- We disambiguated these using additional features
from WordNet (courtesy Robert Harms) - Thus, the model requires both cooccurrence and
perceptual feature information - Next, linking GLM to DisLex
33DevLex
34Lexical and semantic maps
35Growing Hierarchical MapsDittenbach et al.
(2002) GHSOM
36Syntax emerging from items
- jump jumps jumped jumped jumping
- run runs ran run running
- pull pulls - - pulling
- want wants - - wanting
- bet bets - - betting
- Model is given pull Past and sound pulled
- -ed should be learned as the past tense
- Its position should be learned from the example
- The semantics of the head should also be learned
37Summary
- Four examples of theoretical corpus analysis
- Structural dependency
- Attachment competition
- Parsability of parental input
- Growing lexicon model
- New Directions
- Complete parsing of the database
- New taggers for Spanish, Italian, Japanese,
French, German - Conversion of the database to XML