The Leo Corpus - PowerPoint PPT Presentation

1 / 43
About This Presentation
Title:

The Leo Corpus

Description:

Each line starts with a tier. Each speaker has own tier: ... annotations to an utterance (tagging) on the dependent tiers of that utterance *CHI: [D] ist drin. ... – PowerPoint PPT presentation

Number of Views:124
Avg rating:3.0/5.0
Slides: 44
Provided by: STEI8
Category:
Tags: adjectives | corpus | leo | start | that | with

less

Transcript and Presenter's Notes

Title: The Leo Corpus


1
The Leo Corpus
  • German L1 Learner Corpus

2
Overview
  • corpora in child language research
  • CHILDES project
  • Leo corpus
  • CLAN language analysis tools

3
Corpora in acquisition research?
  • linguistic intuitions of native speakers?
  • adult speakers intuitions fail
  • child will not speak on demand
  • child cant judge own sentences
  • Leo (111,16) Leiter hoch.
  • ladder up/high
  • ? particle
    (pull/push/...) up
  • ? adjective (the
    ladder is) long
  • ? utterance context is needed!

4
Corpora in acquisition research?
  • linguistic intuitions of native speakers?
  • adult speakers intuitions fail
  • child will not speak on demand
  • child cant judge own sentences
  • Leo (111,16) Leiter hoch.
  • looking at a long ladder
  • in a book
  • adult German (The) ladder (is) long
  • ? adjective (in adult terms!)

5
Corpora in acquisition research
  • corpora contain actually made utterances
  • situated in natural contexts
  • data are verifiable
  • frequency analyses possible
  • Kinds of corpora
  • diary studies (e.g., Preyer 1882)
  • experimental data
  • spoken speech corpora (longitudinal /
    cross-sectional)

6
CHILDES
  • Child Language Data Exchange System
  • Brian MacWhinney / Catherine Snow
  • founded in 1984
  • part of TalkBank (adult corpora)
  • 1500 published articles
  • 4500 Members

7
CHILDES
  • 3 parts
  • language data, i.e. corpora
  • CHAT transcription system
  • CLAN computer programs

8
CHILDES corpora
  • 130 corpora publicly available (via www)
  • 26 languages
  • L1 normally developing
  • L1 language disorders
  • bi- and trilingual children and adults
  • TalkBank adult corpora
  • L2, aphasics (English, German, Hungarian,
    Chinese, Italian), ...

9
CHAT and CLAN
  • CHAT
  • Codes for the Human Analysis of Transcripts
  • ensure a standard format for all corpora
  • CLAN
  • Computerized Language Analysis
  • several commands for analyzing data in CHAT
    format
  • single program interface

10
Leo corpus
  • Leo, monolingual L1 German boy
  • recorded 1999 2002
  • Heike Behrens, MPI Leipzig
  • transcribed in CHAT format
  • analysable with CLAN programs
  • not publicly available

11
Leo corpus
  • 20 30 5 x 1hr / week 20-22hrs / month
  • diary for new structures
  • 30 50 5 x 1hr / month 5hrs / month
  • ? ca. 400hrs total recording time
  • includes utterances of child and conversation
    partners
  • spontaneous interaction (free play)
  • no book-reading
  • experimenter present
  • some sessions videotaped

12
Leo corpus
  • 1.8 million words of spoken speech
  • child ca. 500.000 words
  • BNC largest balanced corpus
  • 100 million words
  • 10 spoken speech 10 million words
  • dense corpus

13
Dense corpora
  • longitudinal databases with denser recording
    intervals
  • traditional 0.5 1hr / week
  • Leo 1.25 5hrs / week
  • assumption
  • child is awake and talks 10hrs / day
  • traditional ca. 1 of output
  • Leo 2 - 7 of output
  • (Tomasello / Stahl 2004)

14
Dense corpora
  • advantages
  • capture of infrequent phenomena
  • better estimate of vocabulary size
  • age of emergence
  • smoother developmental curves
  • input / production frequency measures

15
Dense corpora
  • Likelihood to capture a target token in
    a year of recording
  • (Tomasello
    / Stahl 2004)
  • tokens
  • 1/day
  • 10/day

16
Drawback only 1 child!
  • no generalizations possible
  • drawback?
  • usage-based approach
  • child is believed to construct language
    individually
  • based on personal experience with language
  • no help from language-specific knowledge

17
Usage-based approach
  • child moves gradually from lexically specific to
    abstract knowledge
  • no adult categories
  • input and frequency play a role (? corpus
    needed!)
  • close studies of individuals highly valuable
  • dense longitudinal vs. traditional
    cross-sectional corpora

18
Control corpora
  • Kerstin Simone
  • Max Miller, MPI Nijmegen
  • 13 / 19 40
  • Kerstin
  • 0.5 2.7 recordings / month
  • ca. 270.000 words (child 55.000)
  • Simone
  • 1.25 3.5 recordings / month
  • ca. 450.000 words (child 86.000)

19
Control corpora
  • Pauline Sebastian
  • Prof. Rigol
  • Pauline
  • 00 711 / 1 2 recordings / month
  • 340.000 words (child 85.000)
  • Sebastian
  • 00 74 / 1 2 recordings / month
  • 350.000 words (child 75.000)

20
Leo corpus
  • CHAT-format
  • 1 transcription file per session
  • txt-format
  • no running text
  • _at_Headers (file explanations)
  • Main tier lines (utterances)
  • Dependent tiers (annotations of utterances)
  • _at_End

21
CHAT Headers
  • _at_Begin
  • _at_Languages de
  • _at_Participants CHI Leo Target_Child, MUT Maren
    Mother, VAT Thorsten Father, MEC Mechthild
    Observer
  • _at_ID dempi_evanCHI206.08malegroupmiddleTar
    get_Childeducation
  • _at_ID dempi_evanMUT3000.00femalegroupmiddle
    MotherAbitur_Lehre
  • _at_ID dempi_evanVAT3500.00malegroupmiddleFa
    theruniversity
  • _at_ID dempi_evanMEC2400.00femalegroupmiddle
    Observeruniversity
  • _at_Filename le020608.cha
  • _at_Date 11-SEP-1999
  • _at_Age of CHI 206.08
  • _at_Comment Dependent exp, vrb, act, par,
  • _at_Comment in der Wohnung, beim Einkaufen

22
CHAT Main tiers
  • Each utterance on own line
  • Each line starts with a tier
  • Each speaker has own tier CHI, VAT, ...
  • Annotations on dependent tier mor, pho...
  • Child Yes. Fish! Father Fish? Child Yes.
  • CHI ja .
  • mor INTERja .
  • CHI Fisch !
  • mor N03mNOMSGFisch !
  • VAT Fisch ?
  • mor N03mCASFisch ?
  • CHI ja .

23
CHAT Transcription
  • orthographic or not?
  • depends on purpose
  • orthographic transcription ease of retrieval
  • additional information via dependent tiers (pho)
  • utterances can be linked to digitized sound files
    (Sonic-CHAT)
  • or to video files

24
SONIC CHAT
25
CHAT Transcription
  • spoken speech not as orderly as written texts
  • coding scheme for spoken speech phenomena
  • overlaps
  • trailing off
  • noncompletions is(t)
  • retracing Schrei // Scheibenwischer
  • non-words hm_at_o
  • replacements nix nichts

26
CHAT Annotation
  • annotations to an utterance (tagging) on the
    dependent tiers of that utterance
  • CHI D ist drin .
  • mor VCOPSPOSPRES3ssein ADVdrin .
  • exp es ist noch etwas Kakao im Becher .
  • here
  • mor morphology
  • exp explanation of utterance situation

27
CHAT Annotation
  • annotations to an utterance (tagging) on the
    dependent tiers of that utterance
  • CHI D ist drin .
  • mor VCOPSPOSPRES3ssein ADVdrin .
  • copula suppletive (empty) tense agreement
    citation form
  • ? ist is the 3rd. pers. sing. present tense of
    the suppletive copula verb sein

28
CHAT Annotation
  • tagging is based on theoretical notions of adult
    language!
  • e.g., when ist is tagged as VCOP etc., this
    doesnt mean that it constitutes a VCOP for the
    child
  • CHID Leiter hoch .
  • morN02fAKKSGLeiter PThoch .
  • hoch a verb particle for the child?

29
CHAT Annotation
  • transcription and annotation in CLAN editor
  • converts txt- and SALT-format files to CHAT
  • automatic tagging (mor-tiers)
  • lexicon file with word information
  • tag disambiguation (manual / probabilistic)
  • computes coding reliability
  • checks conformity with CHAT-conventions
  • works on different workstations (unlike TRANSANA)
  • access files on network drive

30
CLAN / CHAT interface
31
CLAN Commands
  • search commands, e.g.
  • simple and combined strings in utterances and
    annotations
  • interaction blocks
  • imitations, repetitions, overlaps
  • computing commands, e.g.
  • mlu / mlt (mean lenght of utterances / turns)
  • longest words / utterances
  • vocabulary diversity TTR, measure D
  • frequency of phonemes positions

32
CLAN Commands
  • commands in DOS-like style
  • Examples
  • research question WH-words
  • emergence
  • frequency
  • use

33
Emergence of Interrogative Pronouns
  • coding
  • CHI was machst du ? (lit. what do you?)
  • mor PROintwas ...
  • search
  • kwal tCHI tmor sproint le020.cha
  • search mor-tiers in all
    files up to 29
  • in the childs for strings
  • starting with
    proint

34
Emergence of Interrogative Pronouns
  • output
  • kwal (08-Dec-2004) is conducting analyses on
  • ONLY speaker main tiers matching CHI
  • and those speakers' ONLY dependent tiers
    matching MOR
  • From file
  • From file
  • From file
  • ----------------------------------------
  • File "le020008.cha" line 2603. Keyword
    prointwo
  • CHI Mama, wo bis(t) du ! (Mama, where are
    you!)
  • mor N01fVOCSGMama PROintwo
    VCOPSPOSPRES2ssein PROpersNOMSGdu
    N01fVOCSGMama !

Triple click to access utterance!
35
Frequency of Interrogative Pronouns
  • Does the childs use match with the input
    frequency?
  • Child
  • freq tCHI tmor s"PROINT" le020.cha u
    o
  • give for childs
  • frequency mor-tier of
    proint for all files together
    sort
  • count for other
    up to 29
    output
  • peoples
  • Input
  • freq -tCHI tmor s"PROINT" le020.cha u
    o

36
Frequency of Interrogative Pronouns
  • result
  • child input
  • 536 prointwas 13162 prointwas
  • 305 prointwo 3486 prointwo
  • 70 prointwie 1608 prointwer
  • 31 prointwer 1255 prointwie

37
Non-interrogative use of wh-words
  • Make a file with all words for interrogative
    pronouns Leo uses
  • freq s"PROINT" u d1 Leowh.cut
  • for all files without
    frequency and direct
  • together count
    numbers output to file

38
Non-interrogative use of wh-words
  • Leowh.cut
  • prointwas
  • prointwo
  • prointwie
  • prointwer
  • strip file of all proint , so that just the
    wh-words are left
  • chstring s"proint" "" y leowh.cut
  • change from to
    file not in CHAT-format

39
Non-interrogative use of wh-words
  • then look for uses of these words in sentences
    that do not contain proint
  • combo tCHI tmor le.cha s_at_leowh.cutmor
    !proint
  • take words from followed by not
    containing
  • file as search string mor proint
  • for utterances
  • combo search with Boolean operators

40
Word order in wh-questions
  • German verb has to follow wh-word directly any
    errors?
  • search for all utterances that do not follow this
    pattern
  • combo tCHI tmor sproint!v le.cha
  • search for childs proint not directly
    followed by any v

41
Cooccurences of wh-words
  • What words does was (what) cooccur with when used
    as an interrogative pronoun?
  • kwal tCHI tmor sproint""was d o
    le.cha cooccur swas tCHI u
  • kwal looks for all uses of was as proint
  • the results are directed to cooccur (piping)
  • 6 was da
  • 32 was das
  • 1 was denkst
  • 2 was denn

42
Measuring lexical diversity
  • traditional type-token-ratio (TTR)
  • number of different word types
  • against total number of words
  • every word is a new word TTR 1.0
  • the lower the TTR, the less lexical diversity
  • problem depends on sample size
  • in a large sample, the total vocabulary will
    finally be exhausted
  • TTR levels out because highly frequent words will
    increase the number of tokens disproportionally
  • rarely occuring types will have little influence
    on TTR

43
measure D
  • measure D is obtained by
  • randomly sampling the corpus
  • calculating the actual leveling out of the TTR
    rate
  • and comparing this to theoretic models of TTR
    curves
  • the probability of new types being introduced in
    the corpus is calculated, regardless of sample
    size
  • In CLAN
  • TTR freq
  • Measure D VOCD
Write a Comment
User Comments (0)
About PowerShow.com