Title: Tesis doctoral
1 Spoken Document Retrieval experiments with IR-n
system
Fernando Llopis Pascual Patricio Martínez-Barco
Departamento de Lenguajes y Sistemas Informáticos
2Index
IR-n System
Adapting IR-n System to SDR task
Evaluation
Conclusions and future work
3Index
IR-n System
Adapting IR-n System to SDR task
Evaluation
Conclusions and future work
4IR-n System Passage Retrieval Systems
- Use short fragments of documents instead of whole
documents to evaluate the relevance or similarity - These fragments are called passages
- Each document is divided into passages before
calculating the relevance
5IR-n System Passage concept
- Why IR-n system use the sentence to define the
passages ? - A sentence expresses an idea in the document
- There are algorithms to obtain each sentence with
a great precision - Sentences are full units allowing to show an
understandable information by users or provide
this information to a subsequent system
6IR-n System Passage concept
IR-n system defines the passages in the following
way
General Custer was Civil War Union Major
soldier. One of the most famous and controversial
figures in United States Military history.
Graduated last in his West Point Class (June
1861). Spent first part of the Civil War as a
courier and staff officer. Promoted from Captain
to Brigadier General of Volunteers just prior to
the Battle of Gettysburg, and was given command
of the Michigan "Wolverines" Cavalary brigade.
He helped defeat General Stuart's attempt to
make a cavalry strike behind Union lines on the
3rd Day of the Battle (July 3, 1863), thus
markedly contributing to the Army of the
Potomac's victory (a large monument to his
Brigade now stands in the East Cavalry Field in
Gettysburg). Participated in nearly every cavalry
action in Virginia from that point until the end
of the war, always performing boldly, most often
brilliantly, and always seeking publicity for
himself and his actions. Ended the war as a Major
General of Volunteers and a Brevet Major General
in the Regular Army. Upon Army reorganization
in 1886, he was appointed Lieutenant Colonel of
the soon to be renown 7th United States Cavalry.
Fought in the various actions against the Western
Indians, often with a singular brutality
(exemplified by his wiping out of a Cheyenne
village on the Washita in November 1868). His
exploits on the Plains were romanticized by
Eastern Unites States newspapermen, and he was
elevated to legendary status in his time. The
death of his friend, Lucarelli change his life.
SENTENCE 1 SENTENCE 2 SENTENCE 3 SENTENCE
4 SENTENCE 5 SENTENCE 6 SENTENCE 7 SENTENCE
8 SENTENCE 9 SENTENCE 10 SENTENCE 11 SENTENCE
12 SENTENCE 13 SENTENCE 14 SENTENCE 15
1 Obtains sentences from the document
2 Defines passages according to a fixed number
of sentences
7IR-n System Passage concept
- Every passage has the same number of sentences
- This number depends on
- The collection of documents
- Size of the query
8Index
IR-n System
Adapting IR-n System to SDR task
Evaluation
Conclusions and future work
9Index
IR-n System
Adapting IR-n System to SDR task
Evaluation
Conclusions and future work
10Adapting IR-n system to SDR task Spoken input
- As appointed by Dahlback (1997)
- Spoken input is often incomplete and incorrect
- Contains interruptions and repairs
- Sentences occur only very occasionally
- Conclusion
- Sentence concept is not valid in spoken input
- Therefore new basic units for dialogue models
must be proposed - Utterances instead of sentences
- Turns instead of paragraphs
11Adapting IR-n system to SDR task Definitions
- Utterance
- sequency of words chained by a speaker between
two pauses. - Turn
- set of utterances that a speaker can express
between two speaker changes (dialogues) - set of utterances that a speaker expresses about
the same subject (monologues) - (each section of TREC SDR collection is going to
be considered as a turn)
12Adapting IR-n system to SDR task SDR problems
- The lack of punctuation marks impedes the
recognition of utterance boundaries - Utterances boundaries must be estimated detecting
longest pauses - Some turns have not semantic content
- Morning C.N.N. headline news Im Sachi Koto
- Some turns are interrupted due to
- Overlaps
- Speaker mistakes
- Repetitions
- Modifications of previous information
- Noise incorporate by Automatic transcriptors
13Adapting IR-n system to SDR task IR-n problems
- The lack of sentences to define passages must be
solved with the use of utterances - An utterance splitter was developed
- Overlapping passage technique was used to
minimize fails of utterance splitting - Noise inputs
- How the system supports them must be tested
14Index
IR-n System
Adapting IR-n System to SDR task
Evaluation
Conclusions and future work
15Index
IR-n System
Adapting IR-n System to SDR task
Evaluation
Conclusions and future work
16Evaluation Evaluation goal
- The main goal of this experiment is to know the
robustness of IR-n system - How a system based on passages (therefore based
on sentences) can be adapted to utterances - How the system supports noise
17Evaluation Training focus
- Discovering the minimum time between words to
consider a new utterance - ..
- TO
- THWART
- THEIR
- ABILITY
- TO
- ACQUIRE
- AND
- DEVELOP
- WEAPONS
- ..
18Evaluation Training focus
- Discovering the minimum time between words to
consider a new utterance - ..
- TO
- THWART
- THEIR
- ABILITY
- TO
- ACQUIRE
- AND
- DEVELOP
- WEAPONS
- ..
That is not a new utterance
19Evaluation Training focus
- Discovering the minimum time between words to
consider a new utterance - ..
- BUT
- FOR
- THE
- BAY'S
- CHIEF
- I
- WHAT
- WOULD
- THEY
- ACHIEVED
- ..
20Evaluation Training focus
- Discovering the minimum time between words to
consider a new utterance - ..
- BUT
- FOR
- THE
- BAY'S
- CHIEF
- I
- WHAT
- WOULD
- THEY
- ACHIEVED
- ..
That is a new utterance
21Evaluation Training focus
- Discovering the better size for passages
UTTERANCE 1 UTTERANCE 2 UTTERANCE 3 UTTERANCE
4 UTTERANCE 5 UTTERANCE 6 UTTERANCE 7 UTTERANCE
8 UTTERANCE 9 UTTERANCE 10 UTTERANCE 11 UTTERANCE
12 UTTERANCE 13 UTTERANCE 14 UTTERANCE 15
UTTERANCE 1 UTTERANCE 2 UTTERANCE 3 UTTERANCE
4 UTTERANCE 5 UTTERANCE 6 UTTERANCE 7 UTTERANCE
8 UTTERANCE 9 UTTERANCE 10 UTTERANCE 11 UTTERANCE
12 UTTERANCE 13 UTTERANCE 14 UTTERANCE 15
UTTERANCE 1 UTTERANCE 2 UTTERANCE 3 UTTERANCE
4 UTTERANCE 5 UTTERANCE 6 UTTERANCE 7 UTTERANCE
8 UTTERANCE 9 UTTERANCE 10 UTTERANCE 11 UTTERANCE
12 UTTERANCE 13 UTTERANCE 14 UTTERANCE 15
22Evaluation Training
- Training corpus TREC SDR-8 collection
(according to the track specification) - Parameters to be evaluated
- Number of utterances / passage (from 1 to 9)
- Pause size considered for utterance split (0.1,
0.2, 0.3 sec.) - Models
- With query expansion
- Without query expansion
23Evaluation Training
Training results
Best AvgP
0.4620
Best size of passage
5
Best pause estimation
0.2
Best model
WITH
24Evaluation Monolingual test
Monolingual results
Organization
AvgP
ITC-irst
0,3944
1
Exeter
0,3824
2
IR-n Alicante
0,3637
3
JHU/APL
0,3184
4
- Test corpus TREC SDR-9 collection
- Parameters
- Number of utterances / passage 5
- Pause size considered for utterance split 0.2
seconds - Model with query expansion
25Evaluation Bilingual test (French-English)
- French queries were translated into English using
machine translation - Power translator
- Free translator
- Babel fish
26Evaluation Bilingual (French-English)
Bilingual results
Organization
AvgP
ITC-irst
0,3064
1
IR-n Alicante
0,3032
2
Exeter
0,2825
3
JHU/APL
0,1904
4
- Test corpus TREC SDR-9 collection
- Parameters
- Number of utterances / passage 5
- Pause size considered for utterance split 0.2
seconds - Model with query expansion
27Index
IR-n System
Adapting IR-n System to SDR task
Evaluation
Conclusions and future work
28Index
IR-n System
Adapting IR-n System to SDR task
Evaluation
Conclusions and future work
29 Conclusions and future work
- Conclusions
- IR-n System is robust when working in SDR task
() - IR-n System performance must be increased (-)
- Future work
- Reduce noise produced by repetitions
modifications - Remove turns without semantic content
- Evaluate and improve our utterance splitter
30 Spoken Document Retrieval experiments with IR-n
system
Fernando Llopis Pascual Patricio Martínez-Barco
Departamento de Lenguajes y Sistemas Informáticos