Talking to the Future: The MALACH Project - PowerPoint PPT Presentation

About This Presentation
Title:

Talking to the Future: The MALACH Project

Description:

100 million collection and digitization investment. Annotated ... http://www.dcs.shef.ac.uk/spandh/projects/swag/ Speech-based retrieval ... – PowerPoint PPT presentation

Number of Views:41
Avg rating:3.0/5.0
Slides: 27
Provided by: Asatisfied334
Category:

less

Transcript and Presenter's Notes

Title: Talking to the Future: The MALACH Project


1
Talking to the FutureThe MALACH Project
  • Douglas W. Oard
  • Joanne Archer, Ammie Feijoo, Xiaoli Huang
  • College of Information Studies

2
Telling Our Stories
3
Shoah Foundations Collection
  • Enormous scale
  • 116,000 hours 52,000 interviews 180 TB
  • Grand challenges
  • 32 languages, accents, elderly, emotional,
  • Accessible
  • 100 million collection and digitization
    investment
  • Annotated
  • 10,000 hours (200,000 segments) fully described
  • Users
  • A department working full time on dissemination

4
Who Uses the Collection?
Discipline
Products
  • History
  • Linguistics
  • Journalism
  • Material culture
  • Education
  • Psychology
  • Political science
  • Law enforcement
  • Book
  • Documentary film
  • Research paper
  • CDROM
  • Study guide
  • Obituary
  • Evidence
  • Personal use

Based on analysis of 280 access requests
5
Question Types
  • Content
  • Person, organization
  • Place, type of place (e.g., camp, ghetto)
  • Time, time period
  • Event, subject
  • Mode of expression
  • Language
  • Displayed artifacts (photographs, objects, )
  • Affective reaction (e.g., vivid, moving, )
  • Age appropriateness

6
Full-Description Cataloguing
Subject
Person
Location-Time
Berlin-1939 Employment
Josef Stein
Berlin-1939 Family life
Gretchen Stein

Anna Stein
interview time
Dresden-1939 Relocation
Transportation-rail

Dresden-1939 Schooling
Gunter Wendt

Maria

7
Real-Time Cataloguing
Subject
Person
Location-Time
Berlin-1939
Employment
Josef Stein
Gretchen Stein
Anna Stein
Family Life
interview time
Relocation
Transportation-rail
Dresden-1939
Gunter Wendt
Schooling
Maria
8
Thesaurus-Based Search
9
The Goal
Dramatically improve access to large
multilingual spoken word Collections
by capitalizing on the unique characteristics
of the Survivors of the Shoah Visual History
Foundation's collection of videotaped oral
history interviews.
10
Joanne Archer
11
Observational Studies
Workshop 1 (June)
Workshop 2 (August)
  • Four searchers
  • History/Political Science
  • Holocaust studies
  • Holocaust studies
  • Documentary filmmaker
  • Sequential observation
  • Rich data collection
  • Intermediary interaction
  • Semi-structured interviews
  • Observational notes
  • Think-aloud
  • Screen capture
  • Four searchers
  • Ethnography
  • German Studies
  • Sociology
  • High school teacher
  • Simultaneous observation
  • Opportunistic data collection
  • Intermediary interaction
  • Semi-structured interviews
  • Observational notes
  • Focus group discussions

12
Observed Selection Criteria
  • Topicality (57)
  • Judged based on Person, place,
  • Accessibility (23)
  • Judged based on Time to load video
  • Comprehensibility (14)
  • Judged based on Language, speaking style

13
Functionality
14
Xiaoli Huang
15
Supporting Information Access
Source Selection
16
Query Formulation
Speech Recognition
Automatic Search
Boundary Detection
Interactive Selection
Content Tagging
17
Description Strategies
  • Transcription
  • Manual transcription (with optional post-editing)
  • Annotation
  • Manually assign descriptors to points in a
    recording
  • Recommender systems (ratings, link analysis, )
  • Associated materials
  • Interviewers notes, speech scripts, producers
    logs
  • Automatic
  • Create access points with automatic speech
    processing

18
English ASR Error Rate
Training 65 hours (acoustic model)/200 hours
(language model)
19
Effect of ASR Errors
20
Building a Test Collection
  • Overall relevanceAssessment is informed by the
    assessments for the individual reasons for
    relevance (categories of relevance), but the
    relationship is not straightforward
  • Provides direct evidence
  • Provides indirect / circumstantial evidence
  • Provides context(e.g., causes for the phenomenon
    of interest)
  • Provides comparison (similarity or contrast, same
    phenomenon in different environment, similar
    phenomenon)
  • Provides pointer to source of information

21
Ammie Feijoo
22
Some Statistics
  • 2,000 U.S. radio stations Webcasting
  • 250,000 hours of oral history in British Library
  • 35,000,000 audio streams on the Web

23
Spoken Word Collections
  • Broadcast programming
  • News, interview, talk radio, sports,
    entertainment
  • Scripted stories
  • Books on tape, poetry reading, theater
  • Spontaneous storytelling
  • Oral history, folklore
  • Incidental recording
  • Speeches, oral arguments, meetings, phone calls

24
Building a Web of Spoken Words
  • Affordable storage
  • For 1, you can store 1.5 million spoken words
  • Adequate network capacity
  • Internet capacity 30 million simultaneous
    programs
  • Works with any modem
  • You can even read email while playing audio
  • Replay capabilities
  • 38 of US users recently used streaming audio
  • Effective search capabilities
  • Not quite yet

25
Looking Forward 2006
  • Working systems in five languages
  • Real users searching real data
  • Rich experience beyond broadcast news
  • Frameworks, components, systems
  • Affordable application-tuned systems
  • Oral history, lectures, speeches, meetings,

26
For More Information
  • The MALACH project
  • http//www.clsp.jhu.edu/research/malach/
  • NSF/EU Spoken Word Access Group
  • http//www.dcs.shef.ac.uk/spandh/projects/swag/
  • Speech-based retrieval
  • http//www.glue.umd.edu/dlrg/speech/
Write a Comment
User Comments (0)
About PowerShow.com