Title: Introduction to Speech CorporaStanford
1Introduction to Speech Corpora_at_Stanford
- Neal Snider,
- snider_at_stanford.edu
- For LIN110,
- April 12th, 2005
- (adapted from slides by Florian Jaeger)
2Before we get to the real stuff
- This presentation will be available online at
- http//www.stanford.edu/dept/linguistics/corpora/m
aterial/ling110/ - Local support
- Where are our corpora?
- Setting up your account on AFS
3Local support
- Where can you get help with your project?
- Your TA
- The Corpora_at_Stanford website (http//www.stanford.
edu/dept/linguistics/corpora/) - The corpora_at_csli.stanford.edu email list (you
have to subscribe first) - The corpus TA (snider_at_stanford.edu)
4Where are our corpora? (1)
- AFS
- AFS is Stanfords file sharing system
- The linguistic corpora are stored at
- /afs/ir/data/linguistic-data/
- You need to register for AFS access
- You need to set up your account
5Where are our corpora? (2)
- Corpus Computer
- The computer is the one closest to the printer in
the linguistics departments computer cluster
(MJH, 1st floor) - The corpora are stored on partition D\
- Mapping the drive via a network
-
6The real part
- Example project
- Overview of available corpora
- Where to find them
- How does the annotation look like?
- How to search speech corpora
7Example projects (1)
- Differences in the realization of phonemes
depending on their context - Context can be segmental 1
- How does the realization of syllabic /m/ differ
depending on the preceding onset? - Word final vowel aspiration
- Context can be supra-segmental 3
- How does the realization of syllabic /m/ differ
at the beginning/end of conversations/utterances/s
entences? - Reduction of complex clusters
8Example projects (2)
- Context could also include the register, style
(formal vs. informal), genre (reading a fairy
tale vs. reading an article), different dialects,
etc. 2 - Pitch contours related to specific meanings 1
- Steady-state pitch contours
9Available corpora
- Handout in http//www.stanford.edu/dept/linguistic
s/corpora/material/X_speech_corpora/X_phonetic
corpora.doc - See also
- http//www.stanford.edu/dept/linguistics/corpora/
10(No Transcript)
11Switchboard spontaneous AE speech
- Transcripts uploaded to AFS
- /afs/ir/data/linguistic-data/Switchboard/
- Sound files available on CD
- available in several formats
- All in one file
- Separate files for
- Syllables
- Words
- Orthographic transcription
12Example annotations (Switchboard)
- Some files in Switchboard
13Switchboard all in one fileAnnotation key (1)
- Key
- SENTENCE word1 word2 ... (2005_A_0041)
- WORD word canonical? lm-probs rates
positions morebigrams part-of-speech phone1
phone2 ... - SYL baseform transcribed syl_structure stress
length lm-probs rates positions - PHONE baseform stress syl_part lm-probs
rates positions tran1 tran2 ...
14Switchboard all in one fileAnnotation key (2)
- lm-probs trigram unigram trigram-unigram
- rates seg_tr_syl seg_tr_phn lex_syl lex_phn
enrate vrate nvrate mrate mfrate enmmfrate
mmfrate - positions word_num_in_utterance
word_num_in_turn - morebigrams bigram reverse-bigram
reverse-trigram center-trigram - part-of-speech syntactic part of speech
(currently only done for the word "to") - wordX word number X in acoustically segmented
sentence' - canonical? can if canonical (pronlex)
pronunciation, alt otherwise - trigram p(word previous two words)
- unigram p(word)
- trigram-unigram difference between two
probabilities - seg_tr_syl transcribed syllable rate between
closest two pauses - seg_tr_phn transcribed phone rate between
closest two pauses - lex_syl lexical syllabic rate (i.e. as
determined from wd transcription) - lex_phn lexical phone rate (i.e. as determined
from wd transcription)
15Switchboard all in one fileAnnotation key (3)
- enrate old enrate measure
- vrate voicing rate
- nvrate another voicing rate
- mrate sub-part of mrate measure
- mfrate sub-part of mrate measure
- enmmfrate this is what we call mrate average
of enrate, mrate, mfrate - mmffrate average of mrate, mfrate
- baseform pronunciation as written in dictionary
- transcribed transcribed syllable
- syl_structure onset/nucleus/coda markings from
dictionary - stress syllable stress marking from dictionary
Pprimary Ssecondary Nnone - length syllable length
- tranX transcribed phone X corresponding to
baseform phone
16Arpabet
17Example annotations (Switchboard all in one
file)
- SENTENCE like finding a proper nursing home
(2005_A_0041) - WORD like 1 can -2.408 -2.152 -0.256 4.64 10.43
3.87 9.89 3.80 2.32 5.79 2.32 4.64 3.59 3.48 0 26
l ay k - SYL l_ay_k l_ay_k O_N_C P 0.258 -2.408 -2.152
-0.256 4.64 10.43 3.87 9.89 3.80 2.32 5.79 2.32
4.64 3.59 3.48 0 26 - PHONE l P O -2.408 -2.152 -0.256 4.64 10.43 3.87
9.89 3.80 2.32 5.79 2.32 4.64 3.59 3.48 0 26 l - PHONE ay P N -2.408 -2.152 -0.256 4.64 10.43
3.87 9.89 3.80 2.32 5.79 2.32 4.64 3.59 3.48 0 26
ay - PHONE k P C -2.408 -2.152 -0.256 4.64 10.43 3.87
9.89 3.80 2.32 5.79 2.32 4.64 3.59 3.48 0 26 k - WORD finding 2 alt -3.604 -4.256 0.652 4.64
10.43 3.87 9.89 3.80 2.32 5.79 2.32 4.64 3.59
3.48 1 27 f ay n ih ng - SYL f_ay_n f_ay_n O_N_C P 0.358 -3.604 -4.256
0.652 4.64 10.43 3.87 9.89 3.80 2.32 5.79 2.32
4.64 3.59 3.48 1 27 - PHONE f P O -3.604 -4.256 0.652 4.64 10.43 3.87
9.89 3.80 2.32 5.79 2.32 4.64 3.59 3.48 1 27 f - PHONE ay P N -3.604 -4.256 0.652 4.64 10.43 3.87
9.89 3.80 2.32 5.79 2.32 4.64 3.59 3.48 1 27 ay - PHONE n P C -3.604 -4.256 0.652 4.64 10.43 3.87
9.89 3.80 2.32 5.79 2.32 4.64 3.59 3.48 1 27 n - SYL d_ih_ng NULL_ih_ng O_N_C N 0.117 -3.604
-4.256 0.652 4.64 10.43 3.87 9.89 3.80 2.32 5.79
2.32 4.64 3.59 3.48 1 27 - PHONE d N O -3.604 -4.256 0.652 4.64 10.43 3.87
9.89 3.80 2.32 5.79 2.32 4.64 3.59 3.48 NULL 1 27 - PHONE ih N N -3.604 -4.256 0.652 4.64 10.43 3.87
9.89 3.80 2.32 5.79 2.32 4.64 3.59 3.48 1 27 ih - PHONE ng N C -3.604 -4.256 0.652 4.64 10.43 3.87
9.89 3.80 2.32 5.79 2.32 4.64 3.59 3.48 1 27 ng
18Boston Radio Transcripts
- Includes read news etc. (i.e. non-spontaneous
read speech) - Transcripts uploaded to AFS at
- /afs/ir/data/linguistic-data/Boston-University-Rad
io - Sound files available on CD
19Example annotations (Boston Radio)
- Boston News Corpus
- H 0 4
- gtendsil
- DH 4 5
- IH1 9 10
- S 19 9
- gtThis
- HH 28 5
- AA1 33 9
- L 42 12
- AX 54 4
- DCL 58 3
- D 61 1
- EY 62 16
- gtholiday
- S 78 11
- IY1 89 14
- Z 103 7
- EN 110 20
20Example annotations (Boston Radio)
- XWAVES/PRAAT readable
- signal st43/f3ast43p1
- type 1
- color 76
- font --times-medium-r---17-------
- separator
- nfields 1
-
- 0.035000 76 H
- 0.085000 76 DH
- 0.185000 76 IH1
- 0.275000 76 S
- 0.325000 76 HH
- 0.415000 76 AA1
- 0.535000 76 L
- 0.575000 76 AX
- 0.605000 76 DCL
- 0.615000 76 D
- 0.775000 76 EY
21CALLHOME Mandarin - Transcripts
- CALLHOME Mandarin
- Transcripts uploaded to AFS
- /afs/ir/data/linguistic-data/CALLHOME/CALLHOME-Man
darin-Transcripts/ - Lexicon with pronunciation information available
at - /afs/ir/data/linguistic-data/CALLHOME/CALLHOME-Man
darin-Lexicon/ - Sound files only available on CD/DVD, but I could
put them on the corpus computer
22TIMIT dialect variation
- Telephone recording of 8 major dialects of
American English - (orthographic) transcripts on AFS, sound files
available on CD - Comparable dialect corpora exist for the British
Isles (IViE stored on the corpus computer)
23Example annotations (TIMIT)
- TIMIT
- Word label (.wrd)
- 7470 11362 she
- 11362 16000 had
- 15420 17503 your
- 17503 23360 dark
- 23360 28360 suit
- 28360 30960 in
- 30960 36971 greasy
- Phonetic label (.phn)
- (Note beginning and ending silence regions are
marked with h) - 0 7470 h
- 7470 9840 sh
- 9840 11362 iy
- 11362 12908 hv
- 12908 14760 ae
- 14760 15420 dcl
- 15420 16000 jh
- 16000 17503 axr
24How to search transcribed corpora?
- Either load the files into your favorite text
editor - Or use a command from the grep family (run on a
UNIX shell) - This allows you to search many files as once for
patterns that are described by regular
expressions - For help, see our tutorial page at
- http//www.stanford.edu/dept/linguistics/corpora/c
as-tut-grep.html
25Example annotations (Switchboard all in one
file)
- SENTENCE like finding a proper nursing home
(2005_A_0041) - WORD like 1 can -2.408 -2.152 -0.256 4.64 10.43
3.87 9.89 3.80 2.32 5.79 2.32 4.64 3.59 3.48 0 26
l ay k - SYL l_ay_k l_ay_k O_N_C P 0.258 -2.408 -2.152
-0.256 4.64 10.43 3.87 9.89 3.80 2.32 5.79 2.32
4.64 3.59 3.48 0 26 - PHONE l P O -2.408 -2.152 -0.256 4.64 10.43 3.87
9.89 3.80 2.32 5.79 2.32 4.64 3.59 3.48 0 26 l - PHONE ay P N -2.408 -2.152 -0.256 4.64 10.43
3.87 9.89 3.80 2.32 5.79 2.32 4.64 3.59 3.48 0 26
ay - PHONE k P C -2.408 -2.152 -0.256 4.64 10.43 3.87
9.89 3.80 2.32 5.79 2.32 4.64 3.59 3.48 0 26 k - WORD finding 2 alt -3.604 -4.256 0.652 4.64
10.43 3.87 9.89 3.80 2.32 5.79 2.32 4.64 3.59
3.48 1 27 f ay n ih ng - SYL f_ay_n f_ay_n O_N_C P 0.358 -3.604 -4.256
0.652 4.64 10.43 3.87 9.89 3.80 2.32 5.79 2.32
4.64 3.59 3.48 1 27 - PHONE f P O -3.604 -4.256 0.652 4.64 10.43 3.87
9.89 3.80 2.32 5.79 2.32 4.64 3.59 3.48 1 27 f - PHONE ay P N -3.604 -4.256 0.652 4.64 10.43 3.87
9.89 3.80 2.32 5.79 2.32 4.64 3.59 3.48 1 27 ay - PHONE n P C -3.604 -4.256 0.652 4.64 10.43 3.87
9.89 3.80 2.32 5.79 2.32 4.64 3.59 3.48 1 27 n - SYL d_ih_ng NULL_ih_ng O_N_C N 0.117 -3.604
-4.256 0.652 4.64 10.43 3.87 9.89 3.80 2.32 5.79
2.32 4.64 3.59 3.48 1 27 - PHONE d N O -3.604 -4.256 0.652 4.64 10.43 3.87
9.89 3.80 2.32 5.79 2.32 4.64 3.59 3.48 NULL 1 27 - PHONE ih N N -3.604 -4.256 0.652 4.64 10.43 3.87
9.89 3.80 2.32 5.79 2.32 4.64 3.59 3.48 1 27 ih - PHONE ng N C -3.604 -4.256 0.652 4.64 10.43 3.87
9.89 3.80 2.32 5.79 2.32 4.64 3.59 3.48 1 27 ng
26Demo search
- egrep 'SYL a-z_ a-z_ow.1,3ma-z_
Actual phonological pattern