Evaluation of an Interactive Cross Language Retrieval System - PowerPoint PPT Presentation

1 / 46

About This Presentation

Title:

Evaluation of an Interactive Cross Language Retrieval System

Description:

Provide synonyms of a query term that share the same translation ... Is manual query translation disambiguation really helpful? ... – PowerPoint PPT presentation

Number of Views:64

Avg rating:3.0/5.0

Slides: 47

Provided by: daqi9

Category:

more less

Transcript and Presenter's Notes

Title: Evaluation of an Interactive Cross Language Retrieval System

1
Evaluation of an Interactive Cross Language
Retrieval System

Daqing He, Douglas W. Oard,
Jianqiang Wang and Michael Nossal
University of Maryland

2
Outline

CLIR and interactive CLIR
CLEF and iCLEF
Experiment design for iCLEF 2002
Preliminary results
Conclusion

3
Cross Language Information Retrieval (CLIR)

The process of retrieving documents written in
one natural language with automated systems that
can accept queries expressed in other languages.
Example
Information need European Campaigns against
Racism, especially in Germany
German document resources are the best
No active knowledge of German
English Queries
Machine Translated documents for selection
examination

4
Global Internet User Population
2000
2005
English
English
Chinese
Source Global Reach
5
Why study CLIR

Internet is a multi-language collection
Universal information access is probably true in
geographic sense, but not in language sense.
NLP technologies advance, available Machine
translation systems

6
Interactions within the Search Process
Query Formulation
7
Cross-Language Evaluation Forum (CLEF)

Cross-Language Evaluation Forum for European
languages
Annual event since 2000
Infrastructure to evaluate cross-language
retrieval systems
Tasks multilingual, bilingual, monolingual,
interactive
Multiple topic languages
Dutch, English, French, German, Italian, Spanish,
Swedish

8
Interactive CLEF

Interactive Track of CLEF (iCLEF)
Annual event since 2001
Common framework to perform experiments comparing
interactive CLIR systems with interactive query
translation, interactive cross-language document
selection, or both.
Subset of CLEF 2001 collection containing either
French, English or German documents
Participants from several countries
Finland, Spain, UK and USA

9
Our Research Interests in Interactive CLIR

How to support interactions in processes
Query formulation/translation
Document selection/examination
Query reformulation
Develop experiment designs for evaluation
2001 iCLEF interactive document selection
2002 iCLEF query formulation/translation
2003 iCLEF query reformulation?

10
UMD-Interactive CLIR system

Users involvement in query translation
disambiguation
Allow users to identify un-intended translations
Equipped with two aids to help disambiguation
back translations
Key Word In Context (KWIC)
Built on top of Inquery retrieval system

11
Disambiguation Aid - Back Translations
12
Disambiguation Aid -- Back Translations

Provide synonyms of a query term that share the
same translation
Help the user to identify intended translations
Obtained from bilingual term lists

13
Disambiguation Aid - Key Word in Context (KWIC)

Give a passage as the context of usage

English Word te
German Word tg
Passage containing te
Passage containing tg
14
Disambiguation Aid - Key Word in Context

Example

15
(No Transcript)
16
Relevance Judgment and Confidence

Relevance judgment
Highly relevant, somewhat relevant, not relevant
Confidence of judgment
High confidence, medium confidence and low
confidence
Make judgment at selection and examination
windows
Judgment carried across the session
Judgment can be changed later

17
(No Transcript)
18
Experiment design

Aim
To explore the usefulness of the two query
selection disambiguation aids
To observe users behaviors in the interactions
with the CLIR system
Task
Subjects are asked to find as many as possible
true relevant documents for given topics
Languages
Query language English
Document language German

19
Experiment design - Topics

Four topics (each with multiple facets)
Topic 3
Title European Campaigns against Racism
Desc Find documents that talk about campaigns
against racism in Europe.
Narr Relevant documents describe informative or
educational campaigns against racism (ethnic or
religious, or against immigrants) in Europe.
Documents should refer to organized campaigns
rather than reporting mere opinions against
racism.

20
Experiment design - Collections

Der Spiegel - 1994 (13,979 docs)
SDA German news agency data 1994 (71,677 docs)
English translations for display and document
selection by using Systran premium 3.0

21
Experiment design - Baseline System

Approximate current monolignual retrieval system
No query translation disambiguation
Multiple translations treated as if they are
synonyms
Similar to the way stemming is done
Sum TF, union DF
Baseline system --- AUTO system
UMD-CLIR system --- MANU system

22
(No Transcript)
23
Experiment design - Measures

Quantitative
Mean uninterpolated average precision for query
quality
Unbalanced F-Measure for relevance judgment
Favors precision over recall model the case in
which
Fluent translation is expensive
Missing some relevant documents would be okay
Qualitative
Analyze questionnaires, interviews and search
logs
How are the initial queries generated?
When are the queries reformulated? And how?
Are MT generated docs sufficient for relevance
judgment?
How confident are the relevance judgments?

24
Experiment design - Procedure

Entry Questionnaire and Training Session
20 minutes search session for each of 2 topics
with system 1, plus post-search and post-system
questionnaires
Break
20 minutes search session for each of the other 2
topics with system 2, plus post-search and
post-system questionnaires
Post-experiment interview
Total 2.5 hours

25
Results - Subjects Profiles

12 subjects
Highly educated 10 out of 12 hold master or
higher degree
Mature average 31 years old, from 19 to 43 years
old
Mostly female 9 females vs 3 males
Experienced searchers average 6 years online
experience
Novice in TREC/iCLEF none participated before
Novice in MT 3 with very little MT experience
All native English speakers
Poor German skill 11 with no or poor German, 1
with good

26
Results Subjects Groups

First 8 subjects searched only Der Spiegel
collection (group Small)
8 to 20 relevant documents for 3 topics
0 relevant documents for 1 topic!
Valuable data for search behaviors
Comparison with data of last 4
Last 4 subjects searched both Der Spiegel and
SDA collections (group Large)
20 to 60 relevant documents
Official iCLEF results

27
Results Subjective Assessment

Both systems
Equally easy to search with
Necessary to have query reformulation,
a bit difficult to formulate queries
Easy to make relevant judgments,
confident with the judgments
MANU system
the intended translation is usually available
Easy to identify unintended translations
Search improved with removing unintended
translations

28
Results Query Iterations

Average of 8 iterations (across all conditions)
Not very sensitive to topics
Topic 4 (Hunger Strikes) averaged 6 query
iterations
Topic 2 (Treasure Hunting) averaged 16
Sensitive to some system/topic combinations
Topic 1 (Genes and Diseases) and 2 had small
changes with two systems
Topics 3 (European Campaigns against Racism) and
4 had substantial drop of query iterations with
MANU system

29
Results Comparison between Groups Large and
Small
30
Results Comparison between Groups Large and
Small
Extremely
Large
Somewhat
Small
Not at all
31
Results Subjects Comments

MANU system
Select translations that lead to on/off-topic
results good
AUTO system
Simpler and less user-involvement needed - good
Few functions and easier to learn and use good
No control over translations - bad
Both systems
Highlighting keywords helps - good
Untranslated/poorly-translated words in documents
- bad
No Boolean and/or proximity operator bad

32
Futher work

Quantitative studies with MAP and unbalanced F
measure
Quality of the queries and quality of the
relevance judgments
Is manual query translation disambiguation really
helpful?
Is MAP a good measure for assess query quality?
Different strategies in searches with different
systems, with different search topics, with
different numbers of relevant documents?
Possible explanations to the outcomes

33
Conclusion

Work in progress
Subjects like to participate in query translation
Back translations and KWIC are helpful
Unexpected event gives us some insight that
numbers of relevant docs can greatly affect
users view of a search system and their search
behavior
On the route to understand interactions in CLIR
Want more participants in iCLEF

34
Experiment design - Measures

Quantitative
Unbalanced F-Measure for relevance judgment
P precision, R recall
? 0.8
Favors precision over recall F measure cause
Fluent translation is expensive
Missing some relevant documents would be okay
mean average precision for query quality
Qualitative
Analyze Questionnaires and Interviews

35
Results Comparison between Last 4 and First 8
Extremely
Easy to judge relevance?
Easy to search with?
Somewhat
Not at all
Necessary to have query reformulation?
36
Multilingual Information Access
Cross-Language Search
Query
37
Results Assessment via topics

A little unfamiliar with the topics
A bit difficult to formulate of query
Easy to identify un-intended translations
Helpful with identified un-intended translations
System do provide the intended translation
Easy to judge the relevance of documents
Confident with relevant judgments

38
Results Assessment via topics

A little unfamiliar with the topics (2.54 out of
5), Last 4 is better than First 8 (2.81 vs 2.41).
A bit difficult to formulate of query (2.73),
Last 4 felt easier than the first 8 (2.94 vs
2.63).
Easy to identify un-intended translations (3.5).
No difference between Last 4 and First 8.
Helpful with identified un-intended translations
(3.5). But Last 4 felt clearly stronger than
First 8 (3.9 vs 3.25).
System do provide the intended translation
(3.5)., and almost no difference between Last 4
and First 8.
Easy to judge the relevance of documents (3.6).
Last 4 felt much easier than the first 8 (4 vs
3.34).
Confident with relevant judgments (3.5). Last 4
gave clearly higher confidence than First 8 (3.8
vs 3.3).

39
Results Subjective Assessment

Both systems are some what easy to search with,
but overall AUTO system is easier
It is necessary to have query reformulation in
both systems (AUTO3.67 and MANU3.83).
It is somewhat easy to make relevant judgments
with both systems (3.17 (auto) and 3.08 (manu)).
le unfamiliar with the topics
A bit difficult to formulate of query
Easy to identify un-intended translations
Helpful with identified un-intended translations
System do provide the intended translation

40
Query Translation Aid - Key Word in Context (KWIC)

Provide a sentence containing the query term
whose meaning in the sentence is the same as the
translation of that query term
Help user to identify intended translations
Obtained from sentence aligned parallel corpus

41
Cross-Language Evaluation Forum

Cross-Language Evaluation Forum (CLEF) for
European languages
Contact person Carol Peter
Homepage http//clef.iei.pi.cnr.it2002/

42
Experiment design Matrix

Within subject design
4 subjects as one block for the difficulty of
recruiting subjects

43
Results Comparison between L4 and F8

Only First 8 indicate clear difference about the
easiness of searching with two systems (AUTO3.38
vs MANU2.75). Last 4 gave the same score (3.5)
for both systems.
Only First 8 indicate a clear difference about
the necessity of query reformulation comes
(AUTO3.88 vs MANU4.13). Last 4 gave the same
score (3.25) for both systems
First 8 think that the AUTO system is some what
easy (3) to make relevant judgments whereas the
MANU system is relatively more difficult (score
2.63). However, Last 4 think that MANU system is
pretty easy (4) and the AUTO system clearly more
difficult (3.5).

44
Results Comparison between L4 and F8

Last 4 felt easier to formulate queries, clearly
stronger about the helpfulness of having
identified un-intended translations, much easier
to make relevance judgments and clearly more
confident with their judgments.
But Last 4 and First 8 felt no difference about
easiness of identify un-intended translations and
whether or not the system provide the intended
translations.
Changes of iterations between F8 and L4
Topics 1 and 2 only had small drop
Topics 3 and 4 had significant drop (both from
12 down to 5)

45
Interactions within the Search Process (CL)
Source Selection
Query Formulation
Query Translation
Search
46
Our Research Interests in Interactive CLIR