Title: Talking to the Future: The MALACH Project
1Talking to the FutureThe MALACH Project
- Douglas W. Oard
- Joanne Archer, Ammie Feijoo, Xiaoli Huang
- College of Information Studies
2Telling Our Stories
3Shoah Foundations Collection
- Enormous scale
- 116,000 hours 52,000 interviews 180 TB
- Grand challenges
- 32 languages, accents, elderly, emotional,
- Accessible
- 100 million collection and digitization
investment - Annotated
- 10,000 hours (200,000 segments) fully described
- Users
- A department working full time on dissemination
4Who Uses the Collection?
Discipline
Products
- History
- Linguistics
- Journalism
- Material culture
- Education
- Psychology
- Political science
- Law enforcement
- Book
- Documentary film
- Research paper
- CDROM
- Study guide
- Obituary
- Evidence
- Personal use
Based on analysis of 280 access requests
5Question Types
- Content
- Person, organization
- Place, type of place (e.g., camp, ghetto)
- Time, time period
- Event, subject
- Mode of expression
- Language
- Displayed artifacts (photographs, objects, )
- Affective reaction (e.g., vivid, moving, )
- Age appropriateness
6Full-Description Cataloguing
Subject
Person
Location-Time
Berlin-1939 Employment
Josef Stein
Berlin-1939 Family life
Gretchen Stein
Anna Stein
interview time
Dresden-1939 Relocation
Transportation-rail
Dresden-1939 Schooling
Gunter Wendt
Maria
7Real-Time Cataloguing
Subject
Person
Location-Time
Berlin-1939
Employment
Josef Stein
Gretchen Stein
Anna Stein
Family Life
interview time
Relocation
Transportation-rail
Dresden-1939
Gunter Wendt
Schooling
Maria
8Thesaurus-Based Search
9The Goal
Dramatically improve access to large
multilingual spoken word Collections
by capitalizing on the unique characteristics
of the Survivors of the Shoah Visual History
Foundation's collection of videotaped oral
history interviews.
10Joanne Archer
11Observational Studies
Workshop 1 (June)
Workshop 2 (August)
- Four searchers
- History/Political Science
- Holocaust studies
- Holocaust studies
- Documentary filmmaker
- Sequential observation
- Rich data collection
- Intermediary interaction
- Semi-structured interviews
- Observational notes
- Think-aloud
- Screen capture
- Four searchers
- Ethnography
- German Studies
- Sociology
- High school teacher
- Simultaneous observation
- Opportunistic data collection
- Intermediary interaction
- Semi-structured interviews
- Observational notes
- Focus group discussions
12Observed Selection Criteria
- Topicality (57)
- Judged based on Person, place,
- Accessibility (23)
- Judged based on Time to load video
- Comprehensibility (14)
- Judged based on Language, speaking style
13Functionality
14Xiaoli Huang
15Supporting Information Access
Source Selection
16Query Formulation
Speech Recognition
Automatic Search
Boundary Detection
Interactive Selection
Content Tagging
17Description Strategies
- Transcription
- Manual transcription (with optional post-editing)
- Annotation
- Manually assign descriptors to points in a
recording - Recommender systems (ratings, link analysis, )
- Associated materials
- Interviewers notes, speech scripts, producers
logs - Automatic
- Create access points with automatic speech
processing
18English ASR Error Rate
Training 65 hours (acoustic model)/200 hours
(language model)
19Effect of ASR Errors
20Building a Test Collection
- Overall relevanceAssessment is informed by the
assessments for the individual reasons for
relevance (categories of relevance), but the
relationship is not straightforward - Provides direct evidence
- Provides indirect / circumstantial evidence
- Provides context(e.g., causes for the phenomenon
of interest) - Provides comparison (similarity or contrast, same
phenomenon in different environment, similar
phenomenon) - Provides pointer to source of information
21Ammie Feijoo
22Some Statistics
- 2,000 U.S. radio stations Webcasting
- 250,000 hours of oral history in British Library
- 35,000,000 audio streams on the Web
23Spoken Word Collections
- Broadcast programming
- News, interview, talk radio, sports,
entertainment - Scripted stories
- Books on tape, poetry reading, theater
- Spontaneous storytelling
- Oral history, folklore
- Incidental recording
- Speeches, oral arguments, meetings, phone calls
24Building a Web of Spoken Words
- Affordable storage
- For 1, you can store 1.5 million spoken words
- Adequate network capacity
- Internet capacity 30 million simultaneous
programs - Works with any modem
- You can even read email while playing audio
- Replay capabilities
- 38 of US users recently used streaming audio
- Effective search capabilities
- Not quite yet
25Looking Forward 2006
- Working systems in five languages
- Real users searching real data
- Rich experience beyond broadcast news
- Frameworks, components, systems
- Affordable application-tuned systems
- Oral history, lectures, speeches, meetings,
26For More Information
- The MALACH project
- http//www.clsp.jhu.edu/research/malach/
- NSF/EU Spoken Word Access Group
- http//www.dcs.shef.ac.uk/spandh/projects/swag/
- Speech-based retrieval
- http//www.glue.umd.edu/dlrg/speech/