Lattice-Based Statistical Spoken Document Retrieval - PowerPoint PPT Presentation

1 / 50
About This Presentation
Title:

Lattice-Based Statistical Spoken Document Retrieval

Description:

Good IR systems give higher scores to more relevant docs. ... E.g. music ('query by humming') (Zhu et al. 2003); images (Vu et al. 2003) ... – PowerPoint PPT presentation

Number of Views:33
Avg rating:3.0/5.0
Slides: 51
Provided by: abc7102
Category:

less

Transcript and Presenter's Notes

Title: Lattice-Based Statistical Spoken Document Retrieval


1
Lattice-Based Statistical Spoken Document
Retrieval
  • Chia Tee Kiah
  • Ph. D. thesis
  • Department of Computer Science
  • School of Computing
  • National University of Singapore
  • Supervisors A/Prof. Ng Hwee Tou (NUS),
  • Dr. Li Haizhou (I2R)

2
Outline
  • Introduction
  • Original contribution
  • Background
  • Lattice-based SDR under statistical model
  • Other SDR methods
  • Experiments on SDR with short queries
  • Query-by-example SDR
  • Conclusion

3
IntroductionSpoken Document Retrieval
  • Information retrieval (IR)
  • Search for items of data according to users
    info. need
  • Spoken document retrieval (SDR)
  • IR on speech recordings
  • Growing in importance more more speech data
    stored news broadcasts, voice mails,
  • SDR more difficult than text IR
  • Currently need automatic speech recognition (ASR)
  • 1-best transcripts from ASR are error-prone
  • Word error rate for noisy, spontaneous speech may
    be 50

4
Introduction Lattices
t 0.00
t 0.02
t 0.3
t 0.32
t 0.47
t 0.57
t 0.65
t 0.69
t 0.72
t 1.11
t 1.12
my
sons
mentor
lt/sgt
and
ltsgt
to
tender
lt/sgt
its
nice
and
tender
  • Lattice connected directed acyclic graph
  • James Young (1994), James (1995)
  • Each edge labeled with term hypothesis, probs.
  • Each path gives hypothesized seq. of terms, its
    probability
  • Use alternative hypotheses to overcome errors in
    1-best transcripts lattice-based SDR

5
Introduction Lattices
t 0.00
t 0.02
t 0.3
t 0.32
t 0.47
t 0.57
t 0.65
t 0.69
t 0.72
t 1.11
t 1.12
my
sons
mentor
lt/sgt
and
ltsgt
to
tender
lt/sgt
its
nice
and
tender
  • Lattice connected directed acyclic graph
  • James Young (1994), James (1995)
  • Each edge labeled with term hypothesis, probs.
  • Each path gives hypothesized seq. of terms, its
    probability
  • Use alternative hypotheses to overcome errors in
    1-best transcripts lattice-based SDR

6
Outline
  • Introduction
  • Original contribution
  • Background
  • Lattice-based SDR under statistical model
  • Other SDR methods
  • Experiments on SDR with short queries
  • Query-by-example SDR
  • Conclusion

7
Original Contribution
  • A method for lattice-based SDR using a
    statistical IR model (Song Croft 1999)
  • Calculate expected count of each word in each
    lattice
  • From counts, estimate statistical lang. models
    for docs.
  • Compute query-doc. relevance as probability
  • Previous lattice-based SDR methods all based on
    vector space IR model!
  • Extension to query-by-example SDR
  • SDR where queries are also full-fledged spoken
    docs.
  • Presented in EMNLP-CoNLL 2007, SIGIR 2008

8
Outline
  • Introduction
  • Original contribution
  • Background
  • Lattice-based SDR under statistical model
  • Other SDR methods
  • Experiments on SDR with short queries
  • Query-by-example SDR
  • Conclusion

9
BackgroundInformation Retrieval


Nevertheless, information retrieval has
become accepted as a description
C
Tokenization
  • The task of IR
  • Given doc. collection C, query q giving info.
    need
  • Find list of docs. in C relevant to info. need
  • Steps involved
  • Before receiving query
  • Document preprocessing outputs an index for
    rapid access
  • Upon receiving query
  • Retrieval outputs ranked list of docs.
  • Done by assigning relevance scores guided by
    retrieval model
  • Good IR systems give higher scores to more
    relevant docs.

nevertheless information retrieval has become
accepted as a description
Stop word removal
Document preprocessing
information retrieval accepted description
Stemming
inform retriev accept descript
Indexing
Index document 336, 624, 864, Inform 33,
128, 315,
q Euclids algorithm
Ranked list
Retrieval
3

2

1
an algorithm for finding the greatest common
divisor of two numbers
10
Background IRRetrieval Models
  • Vector space with tf idf weighting (Salton 1963
    Spärck Jones 1972)
  • Docs. queries are Euclidean vecs.
  • Compute relevance as cosine similarity
  • Each vec. component d(i), q(i) a product of
  • tf(wi , d) term frequency increasing func.
    of no. of occurrences c(wi  d) of wi in d
  • idf(wi) inverse doc. frequency decreasing
    func. of no. of docs. containing wi

d
q
t
Relevance cos t
11
Background IRRetrieval Models
  • Okapi BM25 (Robertson et al. 1998)
  • Based on approximation to Harters 2-Poisson
    theory of word distribution (1974)
    Robertson/Spärck Jones weight (1976)
  • Relbm25(d, q)
  • C no. of docs in collection
  • V vocabulary
  • c(w d) count of w in d
  • c(w q) count of w in q
  • nw no. of docs containing w
  • R no. of docs. known to be rel.
  • rw no. of rel. docs containing w
  • d length of d
  • avdl average doc. length
  • k1, k2, k3, b are parms.

12
Background IRRetrieval Models
  • Statistical lang. n-gram (Song Croft 1999)
  • Use Pr(d q) as relevance measure
  • Assuming uniform Pr(d)
  • Pr(d q) Pr(q d)Pr(d) / Pr(q) ? Pr(q d)
  • We can thus define relevance as
  • Relstat(d, q) log Pr(q d)
  • Write q as seq. of words q1q2qK
  • Given unigram model Pr( d),
  • Relstat(d, q) log ?1i K Pr(qi d)
  • ? c(w q) log Pr(w d)
  • Estimate Pr( d) by smoothing word counts

Pr(q)
q
Pr(d q)
d
13
BackgroundInformation Retrieval
  • System evaluation
  • Compare IR engines ranked list to ground truth
    relevance judgements
  • Eval. metric mean average precision (MAP)
  • MAP for set of queries Q
  • Q no. of queries
  • Rq no. of docs. rel. to query q
  • r'j, q position of jth rel. doc. in ranked list
    output for query q
  • Intuitively, higher MAP means relevant docs.
    ranked higher

14
BackgroundAutomatic Speech Recognition
  • ASR transcribes speech waveform into text
    involves
  • Pronouncing dictionary maps written words to
    phonemes
  • Phoneme contrastive speech unit /ae/, /ow/,
    /th/, /p/,
  • Acoustic models describe acoustic realizations
    of phonemes
  • Each model usually for a triphone phoneme in
    the context of 2 phonemes
  • Language model gives word transition
    probabilities

15
BackgroundAutomatic Speech Recognition
  • General paradigm hidden Markov models (HMMs)
  • Acoustic models left-right triphone HMMs,
    trained using EM algo.
  • Using lang. model pron. dict., join HMMs into
    one large utterance HMM
  • Decoding find most probable transcript
    Viterbi search with beam pruning (Ney et al.
    1992)
  • Lattices computed using extension of decoding
  • ASR system evaluation word error rate (WER)
  • Edit dist. / ref. trans. length
  • Other metrics char. error rate, syll. error rate

Structure of a typical triphone HMM
16
BackgroundSpoken Document Retrieval
  • IR with collection of speech recordings
  • ASR engine produces document surrogates may be
  • 1-best word transcripts (e.g. Gauvain et al.
    2000)
  • 1-best subword transcripts (e.g. Turunen Kurimo
    2006)
  • Phoneme lattices (e.g. James 1995 Jones et al.
    1996)
  • N-best transcript lists (Siegler 1999)
  • Word lattices (e.g. Mamou et al. 2006)
  • Phoneme word lattices (e.g. Saraclar Sproat
    2004)
  • IR models used in SDR
  • For SDR with 1-best transcripts vector space,
    BM25, statistical IR models have been tried
  • For lattice-based SDR only vector space model

17
BackgroundQuery By Example
  • IR where queries docs. are of like form
  • Queries are exemplars of type of objects sought
  • E.g. music (query by humming) (Zhu et al.
    2003) images (Vu et al. 2003)
  • Work related to query-by-example SDR
  • Query by example for speech text
  • He et al. (2003) Lo Gauvain (2002, 2003)
    tracking task in Topic Detection Tracking (TDT)
  • Chen et al. (2004) newswire articles (text) for
    queries, broadcasts (speech) for docs.
  • All using 1-best transcripts
  • Lattices of short spoken queries for IR
  • Colineau Halber (1999)

18
Outline
  • Introduction
  • Original contribution
  • Background
  • Lattice-based SDR under statistical model
  • Other SDR methods
  • Experiments on SDR with short queries
  • Query-by-example SDR
  • Conclusion

19
Lattice-Based SDRUnder the Statistical Model
  • Song Crofts IR model
  • Relstat(d, q) log Pr(q d) ? c(w q) log
    Pr(w d)
  • Our idea estimate Pr( d) from lattices
  • Find expectations of word counts (Saraclar
    Sproat 2004) doc. lengths
  • Ec(w d) ?t c(w t)Pr(t d)
  • Ed ?t tPr(t d)
  • Expected counts can be computed efficiently by
    dynamic programming (Hatch et al. 2005)

20
Lattice-Based SDRUnder the Statistical Model
o1
o2
o3
o
  • The method
  • Start with speech seg.s acoustic observations o
  • Generate lattice using ASR
  • Decoding with adaptation of Viterbi algo. keep
    track of multiple paths (James 1995)
  • Use simple lang. model (bigram LM)
  • Rescore with more complex LM (trigram LM)
  • Replace bigram LM probs. with trigram probs.
  • Make duplicates of nodes with differing trigram
    contexts

Acoustic observations
w1/Pr(o1w1), Pr(w1ltsgt)
w4/Pr(o3w4), Pr(w4w3)
Latice from decoding with simple LM
w3
w3
w4
w2/Pr(o2o3w2), Pr(w2w2)
w3
w2
w1/Pr(o1w1), Pr(w1ltsgt)
w4/Pr(o3w4), Pr(w4 w1w3)
w3
Lattice rescored with complex LM
w4
w3
w4
w4/Pr(o3w4), Pr(w4w2w3)
w3
w2/Pr(o2o3w2), Pr(w2ltsgtw2)
w2
21
Lattice-Based SDRUnder the Statistical Model
o1
o2
o3
o
  • The method
  • Combine acoustic LM probs.
  • In practice, apply grammar scale factor ?, word
    insertion penalty ?
  • Prune lattice
  • Remove paths whose log probs. exceed best paths
    by Tdoc
  • Find expectations of word counts Ec(w o),
    seg. lengths Eo
  • Combine expected counts to get Ec(w d), Ed

Lattice with combined acoustic LM probs.
w3/p7
w1/p6
w4/p8
w4/p1 (p1 Pr(w1ltsgt) Pr(o1w1)e?1/?)
w4/p4
w3/p3
w4/p10
w3/p9
w2/p3
w2/p2
w4/p1
w3/p3
w4/p4
Pruned lattice
w2/p5
w2/p2
Word Expected count
w2 2p2p5/(p1p3p4p2p5)
w3 p1p3p4/(p1p3p4p2p5)
w4 2p1p3p4/(p1p3p4p2p5)
Expected counts
22
Lattice-Based SDRUnder the Statistical Model
  • The method
  • Build unigram model to get Pr( d)
  • Zhai Laffertys (2004) 2-stage smoothing method
  • Combination of Jelinek-Mercer Bayesian
    smoothing
  • Adapt 2-stage smoothing to use expected counts
  • w is a word e.g. query word
  • U a background language model
  • ? ? (0, 1) set according to nature of queries
  • µ set using variation of Zhai Laffertys
    estimation algo.
  • Thus we can compute
  • Relstat(d, q) log Pr(q d) ? c(w
    q) log Pr(w d)

23
Outline
  • Introduction
  • Original contribution
  • Background
  • Lattice-based SDR under statistical model
  • Other SDR methods
  • Experiments on SDR with short queries
  • Query-by-example SDR
  • Conclusion

24
Other SDR Methods
  • Statistical, using 1-best transcripts
  • Motivated by Song Croft (1999), Chen et al.
    (2004)
  • Vector space, using lattices
  • Mamou et al. (2006)
  • BM25, using lattices

25
Other SDR MethodsStatistical, Using 1-Best
Trans.
  • Estimate Pr( d) from 1-best transcript
  • Use Zhai Laffertys 2-stage smoothing
  • w is a word e.g. query word
  • c1-best(w d) count of w in ds 1-best
    transcript
  • d1-best length of ds transcript
  • U a background language model
  • ? ? (0, 1), µ gt 0 are smoothing parameters
  • Compute relevance
  • Relstat(d, q) log Pr(q d) ? c(w q) log
    Pr(w d)

26
Other SDR MethodsVector Space, Using Lattices
o1
o2
o3
o
  • Mamou et al. (2006)
  • Method
  • Compute word confusion network (Mangu et al.
    2000)
  • Sequence of confusion sets
  • Compute term freq. vector
  • Weight of each term depends on ranks probs. in
    confusion sets, freq. in doc. collection
  • Compute relevance
  • Construct d q vectors, compute cosine similarity

w4/p1
w3/p3
w4/p4
Pruned lattice
w2/p5
w2/p2
g1
g2
g3
Word confusion network
w2
w3
w4
w4
w2
e
d
Document query vectors
q
t
27
Other SDR MethodsBM25, Using Lattices
  • Modify Robertson et al.s BM25 formula to use
    expected counts
  • Relbm25, lat(d, q)
  • Estimate doc. freq. nw from expected counts

(Turunen Kurimo 2007)
28
Outline
  • Introduction
  • Original contribution
  • Background
  • Lattice-based SDR under statistical model
  • Other SDR methods
  • Experiments on SDR with short queries
  • Query-by-example SDR
  • Conclusion

29
SDR ExperimentsMandarin Chinese Task Setup
  • Doc. collection
  • Hub5 Mandarin training corpus (LDC98T26)
  • 42 telephone calls in Mandarin Chinese, total 17
    hours,  600Kb text
  • Unit of retrieval (document)
  • ½-minute time windows with 50 overlap
    (Abberley et al. 1998 Tuerk et al. 2001)
  • 4,312 retrieval units
  • Queries
  • 18 keyword queries 14 test, 4 devel.
  • Ground truth relevance judgements
  • Determined manually

30
SDR ExperimentsMandarin Chinese Task Details
  • Lattices
  • Generated by Abacus (Hon et al. 1994)
  • Large vocab. triphone-based cont. speech
    recognizer
  • Rescored with trigram language model
  • Trained with TDT, Callhome, CSTSC-Flight corpora
  • 1-best transcripts
  • Decoded from rescored lattices
  • Other tools used
  • ATT FSM (Mohri et al. 1998)
  • SRILM (Stolcke 2002)
  • Low et al.s (2005) Chinese word segmenter

31
SDR ExperimentsMandarin Chinese Task
  • Retrieval
  • SDR performed using
  • baseline stat. method, on ref. transcripts
  • baseline stat. method, on 1-best transcripts
  • Mamou et al.s vector space method, on lattices
  • our proposed method, on lattices
  • Smoothing parameter
  • ? 0.1 good for keyword queries (Zhai
    Lafferty 2004)
  • Lattice pruning threshold T 10000.5 Tdoc
  • Vary T on devel. queries, use best value on test
    queries
  • Evaluation measure mean avg. prec. (MAP)



32
SDR ExperimentsMandarin Chinese Task Results
  • Results for statistical methods
  • 1-best MAP was 0.1364 ref. MAP was 0.4798
  • Lattice-based MAP for devel. queries highest at T
    65,000
  • At this point, MAP for test queries was 0.2154


MAP for 4 devel. queries
MAP for 14 test queries
33
SDR ExperimentsMandarin Chinese Task Results
  • Results for Mamou et al.s vector space method
  • MAP for devel. queries highest at T 27,500
  • At this point, MAP for test queries was 0.1599


MAP for 4 devel. queries
MAP for 14 test queries
34
SDR ExperimentsMandarin Chinese Task Results
  • Statistical significance testing 1-tailed
    t-test
  • Improvement over 1-best significant at 99.5
    level
  • Improvement over vector space significant at
    97.5 level
  • Our method outperforms stat. 1-best vec. space
    with lat.

35
SDR ExperimentsEnglish Task Setup
  • Corpus Fisher English Training corpus from LDC
  • 11,699 telephone calls, total 1,920 hours,
    109Mb text
  • Each call initiated by one of 40 topics
  • 6,605 calls for training ASR engine
  • Queries
  • The 40 topic specifications
  • 32 test, 8 devel.
  • Doc. collection
  • 5,094 calls
  • Unit of retrieval (document) a call
  • Ground truth rel. judgements
  • d rel. to q iff conversation d was initiated by
    topic q

ENG01. Professional sports on TV. Do either of
you have a favorite TV sport? How many hours per
week do you spend watching it and other sporting
events on TV?
Example of a topic spec.
36
SDR ExperimentsEnglish Task Details
  • Lattices
  • Generated by HTK (Young et al., 2006)
  • Large vocab. triphone-based cont. speech
    recognizer
  • Tried trigram LM rescoring, decoding only with
    bigram LM
  • 1-best transcripts
  • Decoded from rescored lattices
  • Word error rate 48.1 (with rescoring), 50.8
    (without)
  • Words stemmed with Porter stemmer
  • Also tried stop word removal experimented with
  • no stopping
  • stopping with 319-word list from U. of Glasgow
    (gla)
  • stopping with 571-word list used in SMART system
    (smart)
  • Index building used CMU Lemur toolkit

37
SDR ExperimentsEnglish Task
  • Retrieval
  • Performed using
  • baseline stat. method, on ref. transcripts
  • baseline stat. method, on 1-best transcripts
  • Mamou et al.s vector space method, on lattices
  • BM25 method, on lattices
  • our proposed method, on lattices
  • Retrieval parameters
  • For stat. methods ? 0.7 good for verbose
    queries
  • For BM25 k1 1.2, b 0.75, k2 0 (following
    Robertson et al. (1998)) ?, k3 tuned with devel.
    queries
  • Evaluation measure MAP

38
SDR ExperimentsEnglish Task Results
  • Main findings
  • Our method outperforms 1-best stat. SDR, Mamou et
    al.s vector space method, BM25
  • Unlike Mamou et al., does not need stop word
    removal
  • Rescoring lattices with trigram LM helps improve
    SDR

39
Outline
  • Introduction
  • Original contribution
  • Background
  • Lattice-based SDR under statistical model
  • Other SDR methods
  • Experiments on SDR with short queries
  • Query-by-example SDR
  • Conclusion

40
Query-By-Example SDR
  • The task
  • Given collection C of spoken docs., query
    exemplar q (also a spoken doc.)
  • Task find docs. in coll. on similar topic as
    query
  • Extending our stat. lat.-based SDR method to
    query-by-example additional challenges
  • Problem 1 How to cope with uncertainty in ASR
    transcription of q?
  • Problem 2 How to handle high concentration of
    non-content words in q?

41
Query-By-ExampleSDR Problems
  • Problem 1 Uncertainty in transcription of q
  • Use multiple ASR hypotheses for q
  • Reformulate 1-best stat. IR as negative
    Kullback-Leibler divergence ranking (Lafferty
    Zhai 2001)
  • -?KL(q, d) log Pr(q d)
  • Thus, we can estimate models Pr( d) Pr(
    q) from d q lats., rank docs. by neg. KL div.
  • Problem 2 Lots of non-content words in q
  • Use stop word removal

rank
42
Query-By-ExampleSDR Proposed Method
  • Get lattices for d q, rescore, prune, find
    expected counts
  • Use 2 pruning thresholds Tdoc for docs., Tqry
    for queries
  • Build unigram model of d
  • With expected counts
  • Again, use 2-stage smoothing (Zhai Lafferty
    2004)
  • Build unigram model of q unsmoothed
  • Compute relevance as neg. KL div.(Lafferty Zhai
    2001)
  • Relstat-qbe(d, q) ?w Pr(w q) log Pr(w d)

43
Query-By-ExampleSDR Experiments
  • Corpus Fisher English Training corpus
  • Queries
  • 40 exemplars 32 test, 8 devel. for 40 topics
  • Doc. collection
  • 5,054 telephone calls
  • Ground truth rel. judgements
  • d rel. to q iff d q on same topic
  • Smoothing parameter
  • ? 0.7
  • Lattice pruning thresholds Tdoc and Tqry
  • Varied independently on devel. queries
  • Stop word removal used
  • no stopping
  • stopping with gla stop list
  • stopping with smart stop list

44
Query-By-ExampleSDR Experiments
  • Retrieval performed using
  • 1-best trans. of exemplars docs. (1-best ?
    1-best)
  • exemplar 1-best, doc. lat. (1-best ? Lat)
  • exemplar lat., doc. 1-best (Lat ? 1-best)
  • lat. counts of exemplars and docs. (Lat ? Lat)
    our proposed method
  • Also tried
  • ref. trans. of exemplars. docs. (Ref ? Ref)
  • orig. Fisher topic spec. for queries (Top ? Ref,
    Top ? 1-best, Top ? Lat)
  • Evaluation measure MAP

45
Query-By-ExampleSDR Experimental Results
  • MAP without stop word removal
  • Stat. significance testing 1-tailed t-test,
    Wilcoxon test
  • Lat ? Lat vs. 1-best ? 1-best improvement sig.
    at 99.95 level
  • However, original topic specs. still better
    nature of exemplars presents difficulties for
    retrieval

46
Query-By-ExampleSDR Experimental Results
  • MAP with stop word removal
  • With gla stop list Lat ? Lat better than 1-best
    ? 1-best at 99.99 level
  • With smart stop list better at 99.95 level
  • Our method (Lat ? Lat) gives consistent
    improvement

47
Outline
  • Introduction
  • Original contribution
  • Background
  • Lattice-based SDR under statistical model
  • Other SDR methods
  • Experiments on SDR with short queries
  • Query-by-example SDR
  • Conclusion

48
Conclusion
  • Contributions
  • Proposed novel SDR method combines use of
    lattices stat. IR model
  • Motivated by improved IR accuracy when each
    technique was used individually
  • New method performs well compared to previous
    methods lattice-based BM25
  • Extended proposed method to query-by-example SDR
  • Lat.-based query by example, under stat. IR model
  • Significant improvement over using 1-best trans.
  • Consistently better, under variety of setups

49
Conclusion
  • Suggestions for future work
  • Incorporate proximity-based search into our
    method
  • Formulate a more principled way of deriving
    lattice pruning thresholds
  • Examine how stop words affect SDR query by
    example
  • Extend stat. lat.-based SDR framework to other
    speech processing tasks, e.g. spoken document
    classification

50
Thank you!
Write a Comment
User Comments (0)
About PowerShow.com