Title: Compiling a corpus of transcribed speech
1Compiling a corpus of transcribed speech
2Anyqs
- A corpus for classroom use in training
interpreters - Transcribed spontaneous speech (hard to come by)
- Understandable without much contextual
information (standard format) - Contemporary
- A reasonable quantity (currently 850k words)
- Basic HTML markup in official transcript
(utterances, non-verbals) - Easy to encode in TEI and to index with XAIRA
3No way is this publicly available
- The BBC site contains transcripts of all Any
Questions programmes in the last 3 years, which
you can download freely for personal
non-commercial use. - But/and you cannot adapt, alter or create a
derivative work except for your own personal,
non-commercial use.
4What the BBCs original looks like
- PRESENTER Jonathan DimblebyPANELLISTS Lord
FalconerMalcolm RifkindAnne McElvoyChris
HuhneFROM Medical Women's Federation, Central
LondonDIMBLEBYWelcome to London where we are
on the edge of Regent's Park at the Royal College
of Obstetricians and Gynaecologists. Our host
here is the Medical Women's Federation, which is
holding its 90th anniversary conference here.
With its origins in the late 19th Century the
federation was in 1917 formed with an initial
membership of 190 women doctors. Subjects at the
top of their agenda then Medical women engaged
in war and the contemporary challenges of
venereal diseases, prostitution, maternity and
infant welfare. Plus ca change. Except that today
more than half the present crop of medical
students are women and the federation's main aim
is to keep women doctors active in the medical
workforce with all that that implies for
part-time training and child welfare.On our
panel the former Lord Chancellor Charlie
Falconer. Lord Falconer there have been
scurrilous reports in some of the newspapers to
the effect that you're not happy with your
pension and that you want it to be doubled, it's
52,000 a year, we can presume that you are quite
happy yes?FALCONERI think I'd rather not talk
about that, if you don't mind Jonathan.DIMBLEBY
You're entirely free not to talk about that which
suggests that it's unresolved.The former Foreign
Secretary, Sir Malcolm Rifkind Chris Huhne who
wants to be the next leader of the Liberal
Democrats - do you like being the
underdog?HUHNEI'm not sure, I think - I'm
working on it, I'm ambitious not to be the
underdog Jonathan.DIMBLEBYAnd Anne McElvoy,
executive editor and columnist at the Evening
Standard. CLAPPINGOur first question
please.HICKSTom Hicks. Should Ian Blair
resign?
5Marking it up in XML
- In the Header
- Programme details (including date)
- Participants and roles
- Setting
- In the Text
- Utterance boundaries and their speakers
- Sentence boundaries (based on punctuation in
transcript) - Non-verbal events (e.g. clapping, laughter,
coughs) - Topic boundaries (i.e. new question)
- Tokenisation - s
- Pos tagging maybe some day
6Overall document structure
- ltTEIgt
- ltteiHeadergt
- ltfileDescgt
- lttitleStmtgt
- lttitlegt
- Any questions ltdategt Date lt/dategt
- lt/titlegt
- lt/titleStmtgt
- lt/fileDescgt
- ltprofileDescgt Profile lt/profileDescgt
- lt/teiHeadergt
- lttextgt Text lt/textgt
- lt/TEIgt
7Profile
- ltprofileDescgt
- ltparticDescgtltlistPersongt
- ltperson whoname sex f m role
- presenter questioner party
professiongt - ltparagt fullname lt/paragt
- lt/persongt
- lt/listPersongtlt/particDescgt
- ltsettingDescgt
- ltsettinggt wherefrom lt/settinggt
- lt/settingDescgt
- lt/profileDescgt
8Text
- lttextgt
- ltdiv typeintrogt
- ltu whoDIMBLEBYgt
- ltsgtWelcome to London lt/sgt
- ltsgtAnd Anne McElvoy, executive editor and
columnist at the Evening Standard. lt/sgt - ltevent descclapping/gt
- ltsgtOur first question please.lt/sgt
- lt/ugt
- lt/divgt
- ltdiv typequestiongt
- ltu whoHICKSgt
- ltsgtTom Hicks. lt/sgt
- ltsgtShould Ian Blair resign? lt/sgt
- lt/ugt
- lt/divgt
- lt/textgt
9Utterances / Sentences
- Role
- Lab 1438 / 4617
- Con 1225 / 4272
- Lib 701 / 2590
- Presenter 5437 / 9204
- Questioner 838 / 1582
- Other 2282 / 8480
- Sex
- Male 10096 / 24829
- Female 1817 / 5906
- Other 6 / 7
Total 11921 / 30745
10Things to do with it (1) Politically preferred
lexis?
- Occurrences per 1000 ltsgt
- UK 293 9
- Lab 35 7
- Con 11 3
- Lib 16 6
- Other 230
- United Kingdom 67 2
- Lab 13 3
- Con 19 4
- Lib 6 2
- Other 29
- Occurrences per 1000 ltsgt
- Britain 371 12
- Lab 58 12
- Con 84 20
- Lib 31 12
- Other 198
11Things to do with it (2) Fairly frequent
features spontaneous speech
- Agreement
- Agree (138)
- Absolutely / actually / certainly / completely /
entirely / fully / quite / rather / totally - Disagree (72)
- Fundamentally / profoundly
12The advantages of a small specialised corpus
- Homogeneous
- Knowable/predictable
- Manageable numbers
- Deal with all of the data
- Less distracting
13Another appropriate linguistic area
- Connectors
- In fact (163)
- However (131)
14However connector or adverb?
15And another the subjunctive in speech
- It were (128)
- As it were (104)
- If it were (20)
- I wish it were (2)
16As it were, data
17Women and vague language
- As it were 104
- Male 101
- Female 3
18As it were (female speakers)
19So are men vaguer, or do they mark vagueness
more? Is this a male politeness strategy?
- As it were
- BNC male 291
- BNC female 68
- ltugt BNC male 301205
- ltugt BNC female 306293
20Do men and women use as it were in the same way?