Title: The Leo Corpus
1The Leo Corpus
2Overview
- corpora in child language research
- CHILDES project
- Leo corpus
- CLAN language analysis tools
-
3Corpora in acquisition research?
- linguistic intuitions of native speakers?
- adult speakers intuitions fail
- child will not speak on demand
- child cant judge own sentences
- Leo (111,16) Leiter hoch.
- ladder up/high
- ? particle
(pull/push/...) up - ? adjective (the
ladder is) long - ? utterance context is needed!
-
4Corpora in acquisition research?
- linguistic intuitions of native speakers?
- adult speakers intuitions fail
- child will not speak on demand
- child cant judge own sentences
- Leo (111,16) Leiter hoch.
- looking at a long ladder
- in a book
- adult German (The) ladder (is) long
- ? adjective (in adult terms!)
-
5Corpora in acquisition research
- corpora contain actually made utterances
- situated in natural contexts
- data are verifiable
- frequency analyses possible
- Kinds of corpora
- diary studies (e.g., Preyer 1882)
- experimental data
- spoken speech corpora (longitudinal /
cross-sectional)
6CHILDES
- Child Language Data Exchange System
- Brian MacWhinney / Catherine Snow
- founded in 1984
- part of TalkBank (adult corpora)
- 1500 published articles
- 4500 Members
7CHILDES
- 3 parts
- language data, i.e. corpora
- CHAT transcription system
- CLAN computer programs
8CHILDES corpora
- 130 corpora publicly available (via www)
- 26 languages
- L1 normally developing
- L1 language disorders
- bi- and trilingual children and adults
- TalkBank adult corpora
- L2, aphasics (English, German, Hungarian,
Chinese, Italian), ...
9CHAT and CLAN
- CHAT
- Codes for the Human Analysis of Transcripts
- ensure a standard format for all corpora
- CLAN
- Computerized Language Analysis
- several commands for analyzing data in CHAT
format - single program interface
10Leo corpus
- Leo, monolingual L1 German boy
- recorded 1999 2002
- Heike Behrens, MPI Leipzig
- transcribed in CHAT format
- analysable with CLAN programs
- not publicly available
11Leo corpus
- 20 30 5 x 1hr / week 20-22hrs / month
- diary for new structures
- 30 50 5 x 1hr / month 5hrs / month
- ? ca. 400hrs total recording time
- includes utterances of child and conversation
partners - spontaneous interaction (free play)
- no book-reading
- experimenter present
- some sessions videotaped
12Leo corpus
- 1.8 million words of spoken speech
- child ca. 500.000 words
- BNC largest balanced corpus
- 100 million words
- 10 spoken speech 10 million words
- dense corpus
13Dense corpora
- longitudinal databases with denser recording
intervals - traditional 0.5 1hr / week
- Leo 1.25 5hrs / week
- assumption
- child is awake and talks 10hrs / day
- traditional ca. 1 of output
- Leo 2 - 7 of output
- (Tomasello / Stahl 2004)
14Dense corpora
- advantages
- capture of infrequent phenomena
- better estimate of vocabulary size
- age of emergence
- smoother developmental curves
- input / production frequency measures
15Dense corpora
- Likelihood to capture a target token in
a year of recording - (Tomasello
/ Stahl 2004) - tokens
- 1/day
-
- 10/day
16Drawback only 1 child!
- no generalizations possible
- drawback?
- usage-based approach
- child is believed to construct language
individually - based on personal experience with language
- no help from language-specific knowledge
17Usage-based approach
- child moves gradually from lexically specific to
abstract knowledge - no adult categories
- input and frequency play a role (? corpus
needed!) - close studies of individuals highly valuable
- dense longitudinal vs. traditional
cross-sectional corpora
18Control corpora
- Kerstin Simone
- Max Miller, MPI Nijmegen
- 13 / 19 40
- Kerstin
- 0.5 2.7 recordings / month
- ca. 270.000 words (child 55.000)
- Simone
- 1.25 3.5 recordings / month
- ca. 450.000 words (child 86.000)
19Control corpora
- Pauline Sebastian
- Prof. Rigol
- Pauline
- 00 711 / 1 2 recordings / month
- 340.000 words (child 85.000)
- Sebastian
- 00 74 / 1 2 recordings / month
- 350.000 words (child 75.000)
20Leo corpus
- CHAT-format
- 1 transcription file per session
- txt-format
- no running text
- _at_Headers (file explanations)
- Main tier lines (utterances)
- Dependent tiers (annotations of utterances)
- _at_End
21CHAT Headers
- _at_Begin
- _at_Languages de
- _at_Participants CHI Leo Target_Child, MUT Maren
Mother, VAT Thorsten Father, MEC Mechthild
Observer - _at_ID dempi_evanCHI206.08malegroupmiddleTar
get_Childeducation - _at_ID dempi_evanMUT3000.00femalegroupmiddle
MotherAbitur_Lehre - _at_ID dempi_evanVAT3500.00malegroupmiddleFa
theruniversity - _at_ID dempi_evanMEC2400.00femalegroupmiddle
Observeruniversity - _at_Filename le020608.cha
- _at_Date 11-SEP-1999
- _at_Age of CHI 206.08
- _at_Comment Dependent exp, vrb, act, par,
- _at_Comment in der Wohnung, beim Einkaufen
22CHAT Main tiers
- Each utterance on own line
- Each line starts with a tier
- Each speaker has own tier CHI, VAT, ...
- Annotations on dependent tier mor, pho...
- Child Yes. Fish! Father Fish? Child Yes.
- CHI ja .
- mor INTERja .
- CHI Fisch !
- mor N03mNOMSGFisch !
- VAT Fisch ?
- mor N03mCASFisch ?
- CHI ja .
23CHAT Transcription
- orthographic or not?
- depends on purpose
- orthographic transcription ease of retrieval
- additional information via dependent tiers (pho)
- utterances can be linked to digitized sound files
(Sonic-CHAT) - or to video files
24SONIC CHAT
25CHAT Transcription
- spoken speech not as orderly as written texts
- coding scheme for spoken speech phenomena
- overlaps
- trailing off
- noncompletions is(t)
- retracing Schrei // Scheibenwischer
- non-words hm_at_o
- replacements nix nichts
26CHAT Annotation
- annotations to an utterance (tagging) on the
dependent tiers of that utterance - CHI D ist drin .
- mor VCOPSPOSPRES3ssein ADVdrin .
- exp es ist noch etwas Kakao im Becher .
- here
- mor morphology
- exp explanation of utterance situation
27CHAT Annotation
- annotations to an utterance (tagging) on the
dependent tiers of that utterance - CHI D ist drin .
- mor VCOPSPOSPRES3ssein ADVdrin .
- copula suppletive (empty) tense agreement
citation form - ? ist is the 3rd. pers. sing. present tense of
the suppletive copula verb sein
28CHAT Annotation
- tagging is based on theoretical notions of adult
language! - e.g., when ist is tagged as VCOP etc., this
doesnt mean that it constitutes a VCOP for the
child - CHID Leiter hoch .
- morN02fAKKSGLeiter PThoch .
- hoch a verb particle for the child?
29CHAT Annotation
- transcription and annotation in CLAN editor
- converts txt- and SALT-format files to CHAT
- automatic tagging (mor-tiers)
- lexicon file with word information
- tag disambiguation (manual / probabilistic)
- computes coding reliability
- checks conformity with CHAT-conventions
- works on different workstations (unlike TRANSANA)
- access files on network drive
30CLAN / CHAT interface
31CLAN Commands
- search commands, e.g.
- simple and combined strings in utterances and
annotations - interaction blocks
- imitations, repetitions, overlaps
- computing commands, e.g.
- mlu / mlt (mean lenght of utterances / turns)
- longest words / utterances
- vocabulary diversity TTR, measure D
- frequency of phonemes positions
32CLAN Commands
- commands in DOS-like style
- Examples
- research question WH-words
- emergence
- frequency
- use
33Emergence of Interrogative Pronouns
- coding
- CHI was machst du ? (lit. what do you?)
- mor PROintwas ...
- search
- kwal tCHI tmor sproint le020.cha
- search mor-tiers in all
files up to 29 - in the childs for strings
- starting with
proint
34Emergence of Interrogative Pronouns
- output
- kwal (08-Dec-2004) is conducting analyses on
- ONLY speaker main tiers matching CHI
- and those speakers' ONLY dependent tiers
matching MOR -
- From file
- From file
- From file
- ----------------------------------------
- File "le020008.cha" line 2603. Keyword
prointwo - CHI Mama, wo bis(t) du ! (Mama, where are
you!) - mor N01fVOCSGMama PROintwo
VCOPSPOSPRES2ssein PROpersNOMSGdu
N01fVOCSGMama !
Triple click to access utterance!
35Frequency of Interrogative Pronouns
- Does the childs use match with the input
frequency? - Child
- freq tCHI tmor s"PROINT" le020.cha u
o - give for childs
- frequency mor-tier of
proint for all files together
sort - count for other
up to 29
output - peoples
- Input
- freq -tCHI tmor s"PROINT" le020.cha u
o
36Frequency of Interrogative Pronouns
- result
- child input
- 536 prointwas 13162 prointwas
- 305 prointwo 3486 prointwo
- 70 prointwie 1608 prointwer
- 31 prointwer 1255 prointwie
37Non-interrogative use of wh-words
- Make a file with all words for interrogative
pronouns Leo uses - freq s"PROINT" u d1 Leowh.cut
- for all files without
frequency and direct - together count
numbers output to file
38Non-interrogative use of wh-words
- Leowh.cut
- prointwas
- prointwo
- prointwie
- prointwer
- strip file of all proint , so that just the
wh-words are left - chstring s"proint" "" y leowh.cut
- change from to
file not in CHAT-format
39Non-interrogative use of wh-words
- then look for uses of these words in sentences
that do not contain proint - combo tCHI tmor le.cha s_at_leowh.cutmor
!proint - take words from followed by not
containing - file as search string mor proint
- for utterances
- combo search with Boolean operators
40Word order in wh-questions
- German verb has to follow wh-word directly any
errors? - search for all utterances that do not follow this
pattern - combo tCHI tmor sproint!v le.cha
- search for childs proint not directly
followed by any v
41Cooccurences of wh-words
- What words does was (what) cooccur with when used
as an interrogative pronoun? - kwal tCHI tmor sproint""was d o
le.cha cooccur swas tCHI u - kwal looks for all uses of was as proint
- the results are directed to cooccur (piping)
- 6 was da
- 32 was das
- 1 was denkst
- 2 was denn
42Measuring lexical diversity
- traditional type-token-ratio (TTR)
- number of different word types
- against total number of words
- every word is a new word TTR 1.0
- the lower the TTR, the less lexical diversity
- problem depends on sample size
- in a large sample, the total vocabulary will
finally be exhausted - TTR levels out because highly frequent words will
increase the number of tokens disproportionally - rarely occuring types will have little influence
on TTR
43measure D
- measure D is obtained by
- randomly sampling the corpus
- calculating the actual leveling out of the TTR
rate - and comparing this to theoretic models of TTR
curves - the probability of new types being introduced in
the corpus is calculated, regardless of sample
size - In CLAN
- TTR freq
- Measure D VOCD