The CareGiver corpus - PowerPoint PPT Presentation

About This Presentation
Title:

The CareGiver corpus

Description:

The CareGiver corpus Toomas Altosaar, L. ten Bosch, G. Aimetti, C. Koniaris, K. Demuynck, H. van den Heuvel Overview Background of the ACORNS project A speech corpus ... – PowerPoint PPT presentation

Number of Views:85
Avg rating:3.0/5.0
Slides: 14
Provided by: Bov1
Learn more at: http://www.lrec-conf.org
Category:

less

Transcript and Presenter's Notes

Title: The CareGiver corpus


1
The CareGiver corpus
  • Toomas Altosaar, L. ten Bosch, G. Aimetti, C.
    Koniaris, K. Demuynck, H. van den Heuvel

2
Overview
  • Background of the ACORNS project
  • A speech corpus
  • Rationale
  • Design
  • A few details
  • Public availability

3
Background of the ACORNS project
  • Acquisition of COmmunication and RecogNition
    Skills
  • FP6 FET Project 2006-2009
  • www.acorns-project.org
  • Aim to investigate language acquisition by young
    infants
  • By simulating this learning process by designing
    and testing a computational model
  • Focus on word discovery
  • Improve ASR
  • To that end, a speech corpus was created

4
The ACORNS corpus - rationale
  • ACORNS model takes part in a caregiver-learner
    interaction loop
  • Corpus is required for testing various
    computational approaches for language learning
  • Utterances in corpus simulate the caregiver
  • Corpus keeps the balance in complexity between
  • Real-life recordings of caretaker utterances in
    real-life noisy child-caretaker interactions
    (CHILDES)
  • Lab-fabricated speech-like stimuli (NEWPORT)

5
ACORNS-corpus design (1)
  • Four languages (FIN, SWE, UK, NL)
  • In total 10 speakers for FIN, UK, NL
  • 4 speakers for SWE
  • Speech from primary and secondary caregivers
  • Speakers read aloud sentences
  • Simple grammatical structure
  • Limited number of keywords
  • Two speaking styles
  • Infant directed style (IDS) adult directed style
    (ADS)

6
Design (2)
  • Utterances across languages are highly comparable
    with respect to utterance length, syntactic
    structure, choice of keywords
  • Allows a cross-linguistic comparison of
    computational approaches of word discovery
  • Keyword selection was inspired by information
    about communicative development inventories (CDI)
  • E.g. the MacArthur Bates CDI http//www.sci.sdsu.e
    du/cdi/

7
Examples of Y1-utterances (UK)
  • Where is Miriam now ?
  • Do you see the shoe ?
  • Show me the book !
  • That is the bottle
  • The telephone is here
  • Look, Daddy
  • Here is the diaper
  • That is a telephone
  • Show me a shoe

8
Examples of Y2-utterances (UK)
  • I see a green turtle
  • Can you hear the red square and the airplane?
  • 50 keywords
  • Up to 4 keywords per sentence
  • Semantically free
  • But inconsistencies were avoided
  • Look at the big small car, red green ball

9
Number of utterances
Y1 1 keyword/utt 28000 cross-linguistically comparable utts Y2 multiple keywords/utt 34800 cross-linguistically comparable utts
SWE 8000 --
FIN 8000 11600 (1588)
UK 4000 (IDS only) 11600 (1588)
NL 8000 11600
10
Format
  • Each utterance is available as single wav file
  • 44.1 kHz, mono
  • and is accompanied by an xml file, with
  • Speaker information (gender)
  • Speech style (IDS, ADS)
  • Orthographic annotation (checked)
  • Keyword (s)
  • Duration
  • And for FIN some more information about syntax
  • (see paper)
  • Total 12 GB

L. ten Bosch2, G. Aimetti3, C. Koniaris4, K.
Demuynck5, H. van den Heuvel2
L. ten Bosch2, G. Aimetti3, C. Koniaris4, K.
Demuynck5, H. van den Heuvel2
L. ten Bosch2, G. Aimetti3, C. Koniaris4, K.
Demuynck5, H. van den Heuvel2
11
Research purposes
  • Simulation of word detection/word spotting
  • Acquisition of word-like units
  • Acquisition of (simple) syntax
  • Across morphologically syntactically different
    European languages

12
Public availability
  • Corpus made available via ELRA
  • Interested parties must contact ELRA

13
Conclusion
  • Corpus available with cross-language compatible
    utterances
  • Speech based
  • IDS ADS modes
  • Utterances have lexical and syntactic structure
    inspired by infant-directed speech
  • Primary secondary caregivers
  • Ideal for testing models of language acquisition
    and word detection
  • Made available through ELRA
  • More information at www.acorns-project.org
  • Also software available see website
Write a Comment
User Comments (0)
About PowerShow.com