Title: Evaluation of an Interactive Cross Language Retrieval System
1Evaluation of an Interactive Cross Language
Retrieval System
- Daqing He, Douglas W. Oard,
- Jianqiang Wang and Michael Nossal
- University of Maryland
2Outline
- CLIR and interactive CLIR
- CLEF and iCLEF
- Experiment design for iCLEF 2002
- Preliminary results
- Conclusion
3Cross Language Information Retrieval (CLIR)
- The process of retrieving documents written in
one natural language with automated systems that
can accept queries expressed in other languages. - Example
- Information need European Campaigns against
Racism, especially in Germany - German document resources are the best
- No active knowledge of German
- English Queries
- Machine Translated documents for selection
examination
4Global Internet User Population
2000
2005
English
English
Chinese
Source Global Reach
5Why study CLIR
- Internet is a multi-language collection
- Universal information access is probably true in
geographic sense, but not in language sense. - NLP technologies advance, available Machine
translation systems
6Interactions within the Search Process
Query Formulation
7Cross-Language Evaluation Forum (CLEF)
- Cross-Language Evaluation Forum for European
languages - Annual event since 2000
- Infrastructure to evaluate cross-language
retrieval systems - Tasks multilingual, bilingual, monolingual,
interactive - Multiple topic languages
- Dutch, English, French, German, Italian, Spanish,
Swedish
8Interactive CLEF
- Interactive Track of CLEF (iCLEF)
- Annual event since 2001
- Common framework to perform experiments comparing
interactive CLIR systems with interactive query
translation, interactive cross-language document
selection, or both. - Subset of CLEF 2001 collection containing either
French, English or German documents - Participants from several countries
- Finland, Spain, UK and USA
9Our Research Interests in Interactive CLIR
- How to support interactions in processes
- Query formulation/translation
- Document selection/examination
- Query reformulation
- Develop experiment designs for evaluation
- 2001 iCLEF interactive document selection
- 2002 iCLEF query formulation/translation
- 2003 iCLEF query reformulation?
10UMD-Interactive CLIR system
- Users involvement in query translation
disambiguation - Allow users to identify un-intended translations
- Equipped with two aids to help disambiguation
- back translations
- Key Word In Context (KWIC)
- Built on top of Inquery retrieval system
11Disambiguation Aid - Back Translations
12Disambiguation Aid -- Back Translations
- Provide synonyms of a query term that share the
same translation - Help the user to identify intended translations
- Obtained from bilingual term lists
13Disambiguation Aid - Key Word in Context (KWIC)
- Give a passage as the context of usage
English Word te
German Word tg
Passage containing te
Passage containing tg
14Disambiguation Aid - Key Word in Context
15(No Transcript)
16Relevance Judgment and Confidence
- Relevance judgment
- Highly relevant, somewhat relevant, not relevant
- Confidence of judgment
- High confidence, medium confidence and low
confidence - Make judgment at selection and examination
windows - Judgment carried across the session
- Judgment can be changed later
17(No Transcript)
18Experiment design
- Aim
- To explore the usefulness of the two query
selection disambiguation aids - To observe users behaviors in the interactions
with the CLIR system - Task
- Subjects are asked to find as many as possible
true relevant documents for given topics - Languages
- Query language English
- Document language German
19Experiment design - Topics
- Four topics (each with multiple facets)
- Topic 3
- Title European Campaigns against Racism
- Desc Find documents that talk about campaigns
against racism in Europe. - Narr Relevant documents describe informative or
educational campaigns against racism (ethnic or
religious, or against immigrants) in Europe.
Documents should refer to organized campaigns
rather than reporting mere opinions against
racism.
20Experiment design - Collections
- Der Spiegel - 1994 (13,979 docs)
- SDA German news agency data 1994 (71,677 docs)
- English translations for display and document
selection by using Systran premium 3.0
21Experiment design - Baseline System
- Approximate current monolignual retrieval system
- No query translation disambiguation
- Multiple translations treated as if they are
synonyms - Similar to the way stemming is done
- Sum TF, union DF
- Baseline system --- AUTO system
- UMD-CLIR system --- MANU system
22(No Transcript)
23Experiment design - Measures
- Quantitative
- Mean uninterpolated average precision for query
quality - Unbalanced F-Measure for relevance judgment
- Favors precision over recall model the case in
which - Fluent translation is expensive
- Missing some relevant documents would be okay
- Qualitative
- Analyze questionnaires, interviews and search
logs - How are the initial queries generated?
- When are the queries reformulated? And how?
- Are MT generated docs sufficient for relevance
judgment? - How confident are the relevance judgments?
24Experiment design - Procedure
- Entry Questionnaire and Training Session
- 20 minutes search session for each of 2 topics
with system 1, plus post-search and post-system
questionnaires - Break
- 20 minutes search session for each of the other 2
topics with system 2, plus post-search and
post-system questionnaires - Post-experiment interview
- Total 2.5 hours
25Results - Subjects Profiles
- 12 subjects
- Highly educated 10 out of 12 hold master or
higher degree - Mature average 31 years old, from 19 to 43 years
old - Mostly female 9 females vs 3 males
- Experienced searchers average 6 years online
experience - Novice in TREC/iCLEF none participated before
- Novice in MT 3 with very little MT experience
- All native English speakers
- Poor German skill 11 with no or poor German, 1
with good
26Results Subjects Groups
- First 8 subjects searched only Der Spiegel
collection (group Small) - 8 to 20 relevant documents for 3 topics
- 0 relevant documents for 1 topic!
- Valuable data for search behaviors
- Comparison with data of last 4
- Last 4 subjects searched both Der Spiegel and
SDA collections (group Large) - 20 to 60 relevant documents
- Official iCLEF results
27Results Subjective Assessment
- Both systems
- Equally easy to search with
- Necessary to have query reformulation,
- a bit difficult to formulate queries
- Easy to make relevant judgments,
- confident with the judgments
- MANU system
- the intended translation is usually available
- Easy to identify unintended translations
- Search improved with removing unintended
translations
28Results Query Iterations
- Average of 8 iterations (across all conditions)
- Not very sensitive to topics
- Topic 4 (Hunger Strikes) averaged 6 query
iterations - Topic 2 (Treasure Hunting) averaged 16
- Sensitive to some system/topic combinations
- Topic 1 (Genes and Diseases) and 2 had small
changes with two systems - Topics 3 (European Campaigns against Racism) and
4 had substantial drop of query iterations with
MANU system
29Results Comparison between Groups Large and
Small
30Results Comparison between Groups Large and
Small
Extremely
Large
Somewhat
Small
Not at all
31Results Subjects Comments
- MANU system
- Select translations that lead to on/off-topic
results good - AUTO system
- Simpler and less user-involvement needed - good
- Few functions and easier to learn and use good
- No control over translations - bad
- Both systems
- Highlighting keywords helps - good
- Untranslated/poorly-translated words in documents
- bad - No Boolean and/or proximity operator bad
32Futher work
- Quantitative studies with MAP and unbalanced F
measure - Quality of the queries and quality of the
relevance judgments - Is manual query translation disambiguation really
helpful? - Is MAP a good measure for assess query quality?
- Different strategies in searches with different
systems, with different search topics, with
different numbers of relevant documents? - Possible explanations to the outcomes
33Conclusion
- Work in progress
- Subjects like to participate in query translation
- Back translations and KWIC are helpful
- Unexpected event gives us some insight that
numbers of relevant docs can greatly affect
users view of a search system and their search
behavior - On the route to understand interactions in CLIR
- Want more participants in iCLEF
34Experiment design - Measures
- Quantitative
- Unbalanced F-Measure for relevance judgment
- P precision, R recall
- ? 0.8
- Favors precision over recall F measure cause
- Fluent translation is expensive
- Missing some relevant documents would be okay
- mean average precision for query quality
- Qualitative
- Analyze Questionnaires and Interviews
35Results Comparison between Last 4 and First 8
Extremely
Easy to judge relevance?
Easy to search with?
Somewhat
Not at all
Necessary to have query reformulation?
36Multilingual Information Access
Cross-Language Search
Query
37Results Assessment via topics
- A little unfamiliar with the topics
- A bit difficult to formulate of query
- Easy to identify un-intended translations
- Helpful with identified un-intended translations
- System do provide the intended translation
- Easy to judge the relevance of documents
- Confident with relevant judgments
38Results Assessment via topics
- A little unfamiliar with the topics (2.54 out of
5), Last 4 is better than First 8 (2.81 vs 2.41). - A bit difficult to formulate of query (2.73),
Last 4 felt easier than the first 8 (2.94 vs
2.63). - Easy to identify un-intended translations (3.5).
No difference between Last 4 and First 8. - Helpful with identified un-intended translations
(3.5). But Last 4 felt clearly stronger than
First 8 (3.9 vs 3.25). - System do provide the intended translation
(3.5)., and almost no difference between Last 4
and First 8. - Easy to judge the relevance of documents (3.6).
Last 4 felt much easier than the first 8 (4 vs
3.34). - Confident with relevant judgments (3.5). Last 4
gave clearly higher confidence than First 8 (3.8
vs 3.3).
39Results Subjective Assessment
- Both systems are some what easy to search with,
but overall AUTO system is easier - It is necessary to have query reformulation in
both systems (AUTO3.67 and MANU3.83). - It is somewhat easy to make relevant judgments
with both systems (3.17 (auto) and 3.08 (manu)). - le unfamiliar with the topics
- A bit difficult to formulate of query
- Easy to identify un-intended translations
- Helpful with identified un-intended translations
- System do provide the intended translation
40Query Translation Aid - Key Word in Context (KWIC)
- Provide a sentence containing the query term
whose meaning in the sentence is the same as the
translation of that query term - Help user to identify intended translations
- Obtained from sentence aligned parallel corpus
41Cross-Language Evaluation Forum
- Cross-Language Evaluation Forum (CLEF) for
European languages - Contact person Carol Peter
- Homepage http//clef.iei.pi.cnr.it2002/
42Experiment design Matrix
- Within subject design
- 4 subjects as one block for the difficulty of
recruiting subjects
43Results Comparison between L4 and F8
- Only First 8 indicate clear difference about the
easiness of searching with two systems (AUTO3.38
vs MANU2.75). Last 4 gave the same score (3.5)
for both systems. - Only First 8 indicate a clear difference about
the necessity of query reformulation comes
(AUTO3.88 vs MANU4.13). Last 4 gave the same
score (3.25) for both systems - First 8 think that the AUTO system is some what
easy (3) to make relevant judgments whereas the
MANU system is relatively more difficult (score
2.63). However, Last 4 think that MANU system is
pretty easy (4) and the AUTO system clearly more
difficult (3.5).
44Results Comparison between L4 and F8
- Last 4 felt easier to formulate queries, clearly
stronger about the helpfulness of having
identified un-intended translations, much easier
to make relevance judgments and clearly more
confident with their judgments. - But Last 4 and First 8 felt no difference about
easiness of identify un-intended translations and
whether or not the system provide the intended
translations. - Changes of iterations between F8 and L4
- Topics 1 and 2 only had small drop
- Topics 3 and 4 had significant drop (both from
12 down to 5)
45Interactions within the Search Process (CL)
Source Selection
Query Formulation
Query Translation
Search
46Our Research Interests in Interactive CLIR
- How to support interactions in
- Query formulation
- Document selection
- Query reformulation
- Experiment designs for these processes
- Query formulation/reformulation
- How the initial queries are generated?
- When are the queries reformulated? And how?
- Query translation
- Strategies for translation disambiguation
- Tools for translation disambiguation
- Document selection/examination ??????
- What kind of summary for document selection
- Are MT generated docs sufficient for relevance
judgment? - How confident are the relevance judgments?