Aucun titre de diapositive - PowerPoint PPT Presentation

About This Presentation
Title:

Aucun titre de diapositive

Description:

Themes. 3. 3. Sections. 6 1. 2. External noise. 8 4. 2 ... Non speech sections: music, silence, ads, etc. Lasts 5 20 seconds. Sections are classified as: ... – PowerPoint PPT presentation

Number of Views:12
Avg rating:3.0/5.0
Slides: 22
Provided by: khal62
Learn more at: http://www.lrec-conf.org
Category:

less

Transcript and Presenter's Notes

Title: Aucun titre de diapositive


1
Quick Rich Transcriptions of Arabic Broadcast
News Speech Data
Niklas Paulsson, Djamel Mostefa, Chomicha
Bendahman ELDA (Evaluations and Language
resources Distribution Agency) Meghan Glenn,
Stephanie Strassel LDC (Linguistic Data
Consortium)
2
Outline
  • Overview
  • Transcription method
  • Sources
  • Collection
  • Selection
  • Transcription
  • Segmentation, Sentence Units, Overlapping Speech,
    Markup
  • Quality Control
  • Conclusion

3
Overview
  • Broadcast News Transcripts
  • Arabic (MSA, MCA)
  • Sources radio TV, mostly Middle East
  • Verbatim orthographic transcripts, time-aligned,
    minimal mark-up
  • QRTR to reduce time

4
Transcription Method (1)
  • QRTR Quick Rich Transcription
  • (QTR / QRTR / CTR)
  • Amount of detail in markup
  • Number of features identified
  • Degree of accuracy
  • Completeness
  • Amount of time
  • Number of quality checks

5
Transcription Method (2)
Type QTR QRTR CTR
Markup 3 5 7
Speaker noise 2 4 8
External noise 2 1 6
Sections - 3 3
Themes - - 10
Speaker turns - 3 4
Speaker info - 3 6
SU Type - 4 0
features 7 23 47
6
Sources (1)
  • Two types of recordings
  • Broadcast news (BN) talking head style news
    reports
  • Broadcast conversation (BC) more interactive,
    talk shows, interviews, call-in programs,
    roundtable discussions
  • Mainly MSA from Middle East
  • MCA from North Africa and Middle East
  • Overlapping speech
  • 30 60 minutes of recordings
  • collected from TV and Radio sources

7
Sources (2)
8
Collection
  • Sources recorded from satellite
  • Daily and weekly recordings
  • Records video stream
  • Audio extracted from video
  • Saved in WAV or SPH
  • 16 bits, 16 kHz

9
Audit
  • Manual audit of all programs
  • Procedure
  • Listen to 30 sec samples of 3 sections
    beginning, middle, end
  • Auditors can listen to additional segments if
    necessary
  • Fills in a form for auditing the recordings
  • Web-based auditing interface
  • Checks
  • Is there a recording?
  • Is the audio quality ok?
  • What is the language?
  • Is it speech from the right program?
  • What is the data type?
  • What is the topic?

10
Selection
  • Recordings rejected
  • poor quality
  • wrong language
  • Passed audit eligible for transcription
  • Criteria based on
  • data amount
  • sources
  • dates
  • 2000 hours in 24 sets
  • Sent in 20 300 hours packages for transcription
  • Period Apr. 2004 Aug. 2007

11
Transcription
  • Orthographic, verbatim transcripts
  • Arabic script
  • No vowels
  • Segmented and time stamped
  • Speaker names
  • Sentence Units
  • Noise markers
  • Overlapping speech
  • Foreign language markup

12
XTrans (1)
  • Tool for broadcast news and conversation
  • Multi-lingual (UTF-8)
  • Multi-platform (Windows, Linux, FreeBSD)
  • Output TDF format
  • Compatible with Transcriber format

13
XTrans (2)
14
Segmentation (1)
  • Segment data into sections
  • Speech delimited by pause or silence
  • Non speech sections music, silence, ads, etc
  • Lasts 5 20 seconds
  • Sections are classified as
  • News report (BN)
  • Conversation (BC)
  • Non-news
  • Sections are next grouped into speaker turns
  • Single speaker or overlapping turn
  • Statement Units (SU)
  • Speaker ID or name for each turn

15
Sentence Units
  • Group utterances into clusters of words
  • Each cluster represent a sentence-like unit
  • Each unit receives a label
  • Statement
  • Question
  • Incomplete
  • Non-Speech

16
Overlapping Speech
  • Many recordings include conversations
  • Portions of speech that are overlapping
  • Segmented and annotated
  • No SU type
  • Could be quite challenging
  • Difficult portions annotated as non-speech

17
Markup (1)
  • Minimal set of markers
  • Hesitations
  • Truncated words
  • Mispronunciations
  • Made up words
  • Noise
  • Difficult speech

18
Markup (2)
  • Noise markers
  • Background noise
  • Speaker noise laugh, cough, sneeze, lipsmack
  • Dialect / language markup
  • Non-MSA (MCA)
  • English
  • French
  • Foreign Language

19
Quality Control
  • Limited quality control due to time constraints
  • Quick Verification procedure
  • Max 18 min / file
  • Focus
  • Transcription matches speech
  • Segmentation
  • Speaker names
  • Orthography
  • Procedure
  • Checks 3 segments of 3 min beginning, middle and
    end
  • Transcriptions that did not pass sent back to
    transcribers

20
Conclusion
  • Arabic broadcast data
  • gt2000 hours transcribed
  • 330k words
  • Useful for quantative manual transcripts
  • Limited timeframe
  • Minimal but useful markup
  • Quality control
  • Training ASR systems for MT

21
  • Thanks for your attention
Write a Comment
User Comments (0)
About PowerShow.com