Title: An Introduction to
1 An Introduction to CHILDES (Child Language Data
Exchange System) Jacqueline van Kampen
http//www.let.uu.nl/Jacqueline.vanKampen/perso
nal/
2CHILDES (Child Language Data Exchange System)
Brian MacWhinney and Catherine Snow, Carnegie Mellon University (Pittsburgh). CHILDES provides tools for studying conversational interaction, including - a database of transcripts - programs for computer analysis of the data - methods for linguistic coding - systems for linking transcripts to digitized audio and video The database includes many megabytes of naturalistic data from children acquiring different languages (over 26 languages are included). The CHILDES search programs are called CLAN (Computarized Language Analysis) Information about CHILDES and the CLAN programs is available on the CHILDES homepage and in MacWhinneys handbook. MacWhinney, B. (2006) The CHILDES Project Tools for Analyzing Talk, 3rd edition. Mahwah NJ Lawrence Erlbaum Associates.
Open website http//childes.psy.cmu.edu
3Before you use CHILDES data
Read the Ground rules for data usage at the CHILDES Website http//talkbank.org/share/ In a publication based on the use of CHILDES data you should cite - the references for the corpora you use (mentioned in the documentation) - MacWhinneys Handbook (latest edition) Download the CLAN program at http//childes.psy.cmu.edu/clan/ Download the CHILDES files (zip files) at http//childes.psy.cmu.edu/data/local.html Get the information you need in the manuals Database manuals, CHAT manual, and CLAN manual at
The Manuals in PDF (at http//childes.psy.cmu.edu)
4Database manuals, CHAT manual, and CLAN manual
(in PDF)
The Database manuals are available at http//childes.psy.cmu.edu/manuals/ In a database manual you will find All the necessary information about a corpus The reference(s) you have to cite when using a corpus The CHAT manual is available at http//childes.psy.cmu.edu/manuals/CHAT.pdf In the CHAT manual you will find The information about the codes in the transcriptions The CLAN manual is available at http//childes.psy.cmu.edu/manuals/CLAN.pdf In the CLAN manual you will find The information about the tools for analyzing the data
- Go to Database Manuals ? American English
- (http//childes.psy.cmu.edu/manuals/02englishusa.d
oc)
5The CHAT codes (Codes for the Human Analysis of
Transcripts)
The files are transcribed in CHAT format. A CHAT file has the extension ".cha." CHAT codes make it possible to search with various CLAN programs in de files. The headers _at_ A header is a line of text that gives information about the participants and the setting All headers begin with the _at_ A CHAT file begins and ends with _at_Begin and _at_End The _at_Begin header is followed by a series of _at_-headers, that state information about the child, other participants, date of recording/transcription. The tiers and The data that come after the _at_-headers, are divided into lines. Each line begins with a tier. The tiers are an important tool for the CLAN programs in data searching. The most important 'tiers' are the -tiers and the -tiers.
Put cursor on CLAN window. With Ctrl-o open go
to Soft Grid Q\VFS\CLAN\CHILDES\CLAN\LIB
sample.cha
6The CHAT codes (Codes for the Human Analysis of
Transcripts)
The tiers are followed by three capitals that indicate the name of the child or the child's conversation partner, for instance CHI (followed by an utterance of a child, the child stated in the _at_-headers) MOT (followed by an utterance of the mother) The tiers are 'dependent tiers' referring to the previous utterance of child/conversation partner. are followed by three small letters that represent a code, for instance. act (action. This tier describes the actions of the speaker or the listener) alt (alternative. This tier is used to provide an alternative possible transcription) com (comment. This tier is the general purpose comment tier) par (paralinguistic. This tier codes paralinguistic behaviors as coughing and laughing spa (speech act. This tier is for speech act coding)
The tiers contain additional information
(optionally added)
7Three tiers coding linguistic information
Some tiers are very useful for linguistic analysis
pho (phonology) This tier This tier describes phonological phenomena (in IPA or SAMPA format)
mor (morpholgy) This tier codes morphemic segments by type and part of speech. Example CHI I wanted a toy mor PROI1S Vwant-PAST DETaINDEF Ntoy
syn (syntax) This tier codes syntactic structure
Open LIB\sample2mor
8The CHAT codes (Codes for the Human Analysis of
Transcripts)
Some (!) frequently used notations
unfilled pause between words _at_ special form markers 6 schwa phonological fragment xxx unintelligible speech (not treated as a word by the CLAN program) www untranscribed material / retracing without correction, e.g.. then / then // retracing, with correction, e.g. then // but lt gt " quotation mark, used when the child literally repeats something , e.g. bear " (item for) all words between the brackets, e.g. ltbear sleepsgt " /. trailing of. The sentence is incomplete, but not interrupted by another speaker. // interruption. The sentence is incomplete, and interrupted by another speaker. !text paralinguistic material, like crying, yelling, laughing, for instance ! cries text short explanation, e.g. look there in the closet text standard form (in the adult language), e.g. he have has
- See The CHAT Manual pages 129-133 for a full list
of Symbols - (at http//childes.psy.cmu.edu/manuals/CHAT.pdf)
9The CLAN (Computarized Language Analysis) programs
The CLAN programs are tools for analyzing the data. In order to run CLAN, you have to install CLAN at your pc or Mac.
CLAN provides a commands window. In this window you can type the commands to run an analysis on one or more files.
The output of the analyses appears in another window, the "output window".
Open again The CLAN window
10The CLAN commands
A CLAN command includes several components directory (input/working) specifies the search space (directory) (obligatory) directory (output) specifies the directory in which the output will be stored (optional) the main command search file(s) output file (optional) /- switch(es)
specify search space (obligatory)
specify storing place (optional)
set the LIB (library) directory
main command search file(s) /- switche(s) (random order)
1. Select under Working \lib\ne32 2. Select
under Lib \lib
11The Search Functions The main command
In the CLAN window you must specify the search functions
the main command /- switch(es) search file(s) output file (optional)
The main command select firstonly one frequently used options freq (frequencies counts) kwal (word/morphemes search) combo (combined searches of 2 or more words/morphemes mlu (MLU counts) chip (comparison and analysis of utterances of different speakers)
command (specified only once)
CLAN icon (click) survey of all command options
Click the CLAN icon and select KWAL
12The Search Functions the /- parameter switches
Parameter switches may be specified more than once Order of the parameter switches is random Parameter switches have (in general) an option ltgt (include) and lt?gt (exclude) Not all parameter switches go with all commands The various switches across the commands can be seen in the commands window
Some parameter switches t selects the utterances of a specified speaker (the one following the tier) s selects a word to be searched (search) d used with 'kwal' this option puts the output in CHAT format o used with freq this option sorts output by descending frequency u specifies that all search results are stored in 1 file r deals with the treatment of material between parentheses x x search includes only utterances longer than/shorter than specified number of words (w), morphemes (m) or characters (c) w -w gives extra utterances in the context of the searched item (window) f -f f output is stored in the (specified) file(s) -f output appears on the screen
Stay at the commands window and type after kwal a
return
13The Search Functions the /- t option
The parameter switch t t selects the utterances of a specified speaker (the one following the tier) The t switch may be specified more than once! The t/?t switch includes t or excludes ?t particular tier(s). In CHAT formatted files, there are three tier code types main speaker tiers (denoted by ) speaker dependent tiers (denoted by ) header tiers (denoted by _at_) The speaker-dependent tiers are attached to speaker tiers. e.g. tMOT (speaker tier mother) and tact (dependent tier action) analyzes all of the MOT main tiers and only the act dependent tiers associated with that speaker. The t option specifies which main speaker tiers, their dependent tiers, and header tiers should be included in the analysis. All other tiers, found in the given file, will be ignored by the program.
Put cursor back to the pre-revious line kwal and
type tCHI NOTE You have to insert a space
after each option!
14The Search Functions the /- s option
The parameter switch s s selects a word (or code) to be searched (search) The s switch may be specified more than once! The s/-s switch is used to include or exclude certain words. The s option specifies the keyword you want to find. You do this by putting the word in quotes directly after the s switch as in s"dog" to search for the word dog . Use of the s option will override the default (all utterances!).
kwal tCHI s"a"
Add sa to the line in commands window
15The Search Functions the search files
the main command /- switch(es) search file(s) output file (optional)
The search file CLAN takes as working space the directory specified under Working The files (from the directory) have to be specified in the window For all files in the directory - type . - or go to the icon FILE IN Search files may be specified more than once
kwal tCHI s"a" _at_
FILE IN icon (click) Choose - all files in the directory click on Add All - a subset of the files double click on each file
Click the FILE IN icon and select file 98.cha
click Done
16The Search Functions the output file
the main command /- switch(es) search file(s) output file (optional)
Important guidelines you can put the file name or any switch in any order you wish you must not forget to keep a space between each option by default CLAN gives the output on the screen. With the option f you can change this.
kwal tCHI s"a" _at_ fart
f puts the output under the directory specified under Output (or by default same as working directory) f can (optionally) be given an extension of 3 letters. If so, the output file will get this name
Type on the command line fart
17The output file
kwal tCHI sa f Sun Nov 05 164423 2006 kwal (25-Oct-2006) is conducting analyses on ONLY speaker main tiers matching CHI From file ltc\childes\clan\lib\ne32\98.chagt to file ltc\childes\clan\lib\ne32\98.kwa.cexgt ---------------------------------------- File "c\childes\clan\lib\ne32\98.cha" line 447. Keyword a CHI lta yyygt gt . ---------------------------------------- File "c\childes\clan\lib\ne32\98.cha" line 481. Keywords a, a CHI a / a ... ---------------------------------------- File "c\childes\clan\lib\ne32\98.cha" line 487. Keyword a CHI a . ---------------------------------------- File "c\childes\clan\lib\ne32\98.cha" line 495. Keyword a CHI a coat . ---------------------------------------- File "c\childes\clan\lib\ne32\98.cha" line 551. Keyword a CHI a clothes .
kwal tCHI s"a" _at_ fart
Click on Run. Type Ctrl-o and see if you get this
in the output file
18The output window
Erase the output file 98.art
kwal tMOT sshall _at_ f u Sun Nov 05 165720 2006 kwal (25-Oct-2006) is conducting analyses on ONLY speaker main tiers matching MOT From file ltc\childes\clan\lib\ne32\55.chagt to file ltc\childes\clan\lib\ne32\55.kw0.cexgt From file ltc\childes\clan\lib\ne32\66.chagt to file ltc\childes\clan\lib\ne32\55.kw0.cexgt From file ltc\childes\clan\lib\ne32\68.chagt to file ltc\childes\clan\lib\ne32\55.kw0.cexgt ---------------------------------------- File "c\childes\clan\lib\ne32\68.cha" line 50. Keyword shall MOT but shall we find out ? ---------------------------------------- File "c\childes\clan\lib\ne32\68.cha" line 204. Keyword shall MOT ltshall we try the next boxgt lt ? ---------------------------------------- File "c\childes\clan\lib\ne32\68.cha" line 241. Keyword shall MOT shall we do these ? ---------------------------------------- File "c\childes\clan\lib\ne32\68.cha" line 393. Keyword shall MOT shall we try the next one ?
f output is stored in the (specified) file(s) by default, or with f the output appears on the screen
Now try out the following
- let output appear on the screen
- select all files in the directory
- let kwal search for the word shall uttered by
the mother
Click on Run. You will get this on the screen
(Output window)
19The Search Functions the search files
- In order to
- let output appear on the screen
- select all files in the directory
- let kwal search for the word shall uttered by
the mother - you should
kwal tMOT sshall" _at_
FILE IN icon (click) Choose - all files in the directory click on Add All
Select Add ALL at the FILE IN icon Run kwal
tMOT s"shall" _at_
20Some more parameter switches The u option
We continue with the KWAL option in the commands window.
u specifies that all search results are stored in 1 file
By default, when the user has specified a series of files in _at_ on the command line, the analysis is performed on each individual file. The program then provides separate output for each data file. If the command line uses the u option, the program combines the data found in all the specified files into one set and outputs the result for that set as a whole.
- First delete file 98.art.cex
- Go back with cursor to command line kwal tCHI
s"a" _at_ fart - Delete _at_ Go to FILE IN, Clear all files, and
select files 55.cha and 66.cha - Run and open the two output files
Delete the output files 55.art and 66.art (under LIB\ne32)
Add u to the command line Run and open the
output file
Delete the output file 55.art (under LIB\ne32)
21The /- w option with KWAL
w -w gives extra utterances in the context of the searched item (window)
kwal tCHI s"a" _at_ w1 -w2 kwal tCHI sa _at_ w1 -w2 Sun Nov 05 173531 2006 kwal (25-Oct-2006) is conducting analyses on ONLY speaker main tiers matching CHI From file ltc\childes\clan\lib\ne32\68.chagt ---------------------------------------- File "c\childes\clan\lib\ne32\68.cha" line 543. Keyword a CHI 0 . MOT what are you making ? CHI a mouth . MOT a mouth ? ---------------------------------------- File "c\childes\clan\lib\ne32\68.cha" line 1091. Keyword a CHI she toy wants to go in the chair . MOT oh . CHI a xxx chair . CHI xxx on the table . ---------------------------------------- File "c\childes\clan\lib\ne32\68.cha" line 1142. Keyword a CHI 0 ! whines . MOT see if there's a chair in the garage . CHI nope lta cargt gt . MOT yeah lt .?
This option can be used with either KWAL or COMBO. The -w option followed by a positive integer (1, 2, 3, etc.) causes the program to display that number of preceding utterances. The w option followed by a positive integer (1, 2, 3, etc.) causes the program to display that number of succeeding utterances.
- From here
- we work with the Output window
- we type the files directly on the command line
Type file 68.cha and add w2 and w1 to the
command line
22The /- w option with KWAL
kwal tCHI s"a" _at_ w1 -w2 kwal tCHI sa _at_ w1 -w2 Sun Nov 05 173531 2006 kwal (25-Oct-2006) is conducting analyses on ONLY speaker main tiers matching CHI From file ltc\childes\clan\lib\ne32\68.chagt ---------------------------------------- File "c\childes\clan\lib\ne32\68.cha" line 543. Keyword a CHI 0 . MOT what are you making ? CHI a mouth . MOT a mouth ? ---------------------------------------- File "c\childes\clan\lib\ne32\68.cha" line 1091. Keyword a CHI she toy wants to go in the chair . MOT oh . CHI a xxx chair . CHI xxx on the table . ---------------------------------------- File "c\childes\clan\lib\ne32\68.cha" line 1142. Keyword a CHI 0 ! whines . MOT see if there's a chair in the garage . CHI nope lta cargt gt . MOT yeah lt .?
- From here
- we work with the Output window
- Means (by default) no f
- we type the files directly on the command line
- Means we erase the _at_ and type the file 68.cha
w2 2 preceding sentences w1 1 following
sentence gives
On the command line there should be
kwal tCHI s"a" 68.cha w1 -w2
23The r option r1 and r2
r This option deals with material in parenthesis
By default, CLAN searches for words including the material between parentheses (omitted parts of words). With the r option, you can change this. r1 removes the parentheses (like the default) r2 leaves the parentheses
kwal tCHI s"except" _at_ r1 kwal tCHI sexcept _at_ r1 Sun Nov 05 175416 2006 kwal (25-Oct-2006) is conducting analyses on ONLY speaker main tiers matching CHI From file ltc\childes\clan\lib\ne32\68.chagt ---------------------------------------- File "c\childes\clan\lib\ne32\68.cha" line 346. Keyword except CHI yeah (ex)cept they go up and down . gt kwal tCHI s"except" _at_ r2 kwal tCHI sexcept _at_ r2 Sun Nov 05 175422 2006 kwal (25-Oct-2006) is conducting analyses on ONLY speaker main tiers matching CHI From file ltc\childes\clan\lib\ne32\68.chagt
- Try out the the following
- r1
- with the word "except"
- for speaker child
Run the same with r2
kwal tCHI s"except" 68.cha r1 kwal tCHI
s"except" 68.cha r2
24The r option r1 and r2
kwal tCHI sexcept _at_ r1 Sun Nov 05 175416 2006 kwal (25-Oct-2006) is conducting analyses on ONLY speaker main tiers matching CHI From file ltc\childes\clan\lib\ne32\68.chagt ---------------------------------------- File "c\childes\clan\lib\ne32\68.cha" line 346. Keyword except CHI yeah (ex)cept they go up and down . kwal tCHI sexcept _at_ r2 Sun Nov 05 175422 2006 kwal (25-Oct-2006) is conducting analyses on ONLY speaker main tiers matching CHI From file ltc\childes\clan\lib\ne32\68.chagt
r1 like the default
r2 no matches, because no word "except ", but
"(ex)cept"
What would you have to do to get a match with the
r2 option?
kwal tCHI s"(ex)cept" 68.cha r2
25The r option r5
kwal tCHI swant 68.cha Tue Nov 07 105256 2006 kwal (25-Oct-2006) is conducting analyses on ONLY speaker main tiers matching CHI From file lt68.chagt ---------------------------------------- File "68.cha" line 707. Keyword want CHI ltI want to draw her lips she drawing would sing withgt gt . ---------------------------------------- File "68.cha" line 840. Keywords want, want, want CHI ltI wanna want to makegt / ltI wanna want to makegt // uh I want to make that lady she was singing on tv gt . ----------------------------------------
When the child says wanna, meaning want to, it is
transcribed like this wanna want to. By
default, material in the form want to
replaces the material preceding it
If you do not want this replacement, use the r5
switch
- Try out the the following
- r1 (or by default no r)
- with the word "wanna"
- for speaker child
Run the same with r5
kwal tCHI s "wanna" 68.cha (r1) (no
matches) kwal tCHI s "wanna" 68.cha r5 (5
matches)
26The d option
d used with KWAL this option puts the output in CHAT format
By default, KWAL outputs the location of the tier where the match occurs. When the d switch is turned on you can output each matched sentence without line number information in a simple legal CHAT format. The d1 switch outputs legal CHAT format along with file names and line numbers. The d and d1 switches can be extremely important tools for performing analyses on particular subsets of a text, because you can use the output file for further analysis with CLAN
Using d is sometimes handy when the output is very long and you want to have a quick overview. Leaving out the location specification reduces the output file
- Try out the the following
- the word "the"
- for all speakers
Run the same with d
kwal s"the" 68.cha kwal s"the" 68.cha d
27Some other commands The FREQ command
FREQ searches for frequencies of words
What you have to specify The working directory
What you may specify t a specified speaker s a word to get a frequency count of f if you want to store the output in a file (instead of on the output window) o used with FREQ this option sorts output by descending frequency
What you will get a list of the words with their frequencies the type token ratio ( total number of unique words used by a selected speaker divided by the total number of words used by the same speaker
- Try out the the following (on a new line)
- freq 2) for speaker child 3) all files We are
still working in ne32 !
freq tCHI .cha
28The FREQ command the options u, o and s
What you may specify t a specified speaker s a word to get a frequency count of f stores the output in a file (instead of on the output window) o used with FREQ this option sorts output by descending frequency u specifies that all search results are stored in 1big file
Run the same with the u option
Run again, but add the o option
Run again, but add the s option for the word
"this"
freq tCHI .cha u freq tCHI .cha u o freq
tCHI .cha u o s"this"
29The output of the options u, o and s
Run the same with the u option
Add the o option
freq tCHI .cha u o Tue Nov 07 114218 2006 freq (25-Oct-2006) is conducting analyses on ONLY speaker main tiers matching CHI From file lt55.chagt From file lt66.chagt From file lt68.chagt From file lt98.chagt 114 yeah 54 a 40 this 38 i 38 the 36 no
freq tCHI .cha u Tue Nov 07 113533 2006 freq (25-Oct-2006) is conducting analyses on ONLY speaker main tiers matching CHI From file lt55.chagt From file lt66.chagt From file lt68.chagt From file lt98.chagt 54 a 1 abcs 1 about 1 after 1 again
Add the s option for the word "this"
40 this ------------------------------ 1 Total number of different word types used 40 Total number of words (tokens) 0.025 Type/Token ratio
freq tCHI .cha u freq tCHI .cha u o freq
tCHI .cha u s"this" (you can leave out the
o)
30The COMBO command
COMBO searches for a combination of words combined with Boolean operators like "and" or "or"
What you have to specify The working directory and the s option
Examples with and want directly followed by to combo s "wantto" sample.cha want eventually followed by to combo s"wantto" sample.cha both want and to in any order combo s"wantto" x sample.cha
Some operators used with COMBO immediately followed by repeated character OR ! NOT
What you will get the list of the utterances that contain the (combination of the) items
Run combo tCHI .cha s"wantto"
The command matches file 55.cha 0 times file
66.cha 1 time file 68.cha 10 times file 98.cha
2 times
31The COMBO command the x option
COMBO searches are sequential, If you want ti find clusters of words in any order, you need to use the x option
swecan lt55.chagtStrings matched 1 times lt66.chagt Strings matched 0 times lt68.chagt Strings matched 2 times lt98.chagt Strings matched 0 times
- Try out the the following
- the s option wecan
- for speaker mother
Run the same with x
swecan x lt55.chagtStrings matched 1 times lt66.chagt Strings matched 0 times lt68.chagt Strings matched 4 times lt98.chagt Strings matched 0 times
The x option gives two extra matches for file
68.cha, one subject-inversion (can we) and one
random cluster (we . (he) can)
32The MLU command
MLU calculates the mean length of utterance
If the corpus has a mor line, then MLU will give you a true MLU in number of morphemes.
Recall dependent tiers are added to the main tiers the tiers mor, syn and phon code linguistic information
If the corpus has no mor line, you then run MLU on the main line by adding the -tmor switch Then you get "MLU in words". MLU is going to count each word as one word and will do no morphemic analysis
Run mlu tCHI .cha mlu -tmor Note There is a
folder morsamples under LIB that contains some
files with mor tiers
33Two useful helps for searching 1) The wildcard
CLAN offers two possibilities that facilitate searching the wildcard (asterix ) searching in a list of words
The wildcard (asterix )
A wildcard uses the asterisk symbol () to take the place of something else. Wildcards can be used to refer to a group of files (.cha) a group of speakers (CH) a group of words with a common form
eve10.cha searches in 1 file (eve10.cha) eve.cha searches in all (cha) files of Eve s "go " searches for go s " go " searches for all words that begin with go go, goes, goed (child language), going, gone, gold, golden, good, etc... s "go" searches for all words that contain go, so next to the ones above ongoing, outgo, outgoing, etc
341) The wildcard
- Run the command in LIB\ne32
- word search
- for the word "on"
- for speaker child
- 1 big output
kwal tCHI .cha u s"on"
- Same, but now for
- all words ending with "on"
kwal tCHI .cha u son"
also gives moon, station
- Same, but now for
- all words containing "on"
kwal tCHI .cha u son"
also gives monster, crayons, dont, gone, etc..
352) Searching for words in a list
In order to search for words in a list, you have to create a file containing the list of words specify the file after the s option
This saves you typing a series of s switches
Create a file either in CLAN (as ordinary text file) or any other editor text only file
Specify the file on the command line by putting the file name after the s preceded by the _at_ sign
Use the file articles under the LIB directory to search in the ne32 folder in file number 55 for the speaker child
kwal tCHI s_at_articles 55.cha (easy) question
which words the list is made of?
36Some exercises
Answer the following questions for the files of Adam with FREQ Which question words does Adam use? with COMBO does Adam use the question word what with the auxiliary is? with COMBO does Adam use the question word where with the auxiliary is in a subordinate? with KWAL Does Adam use the word little and small?
- freq s"how" s"wh" tCHI u .cha
- combo s"whatis" tCHI .cha
- combo s"whereis" tCHI .cha x
- kwal s"little" s"small" .cha tCHI