Evaluation of an Interactive Cross Language Retrieval System - PowerPoint PPT Presentation

1 / 46
About This Presentation
Title:

Evaluation of an Interactive Cross Language Retrieval System

Description:

Provide synonyms of a query term that share the same translation ... Is manual query translation disambiguation really helpful? ... – PowerPoint PPT presentation

Number of Views:64
Avg rating:3.0/5.0
Slides: 47
Provided by: daqi9
Category:

less

Transcript and Presenter's Notes

Title: Evaluation of an Interactive Cross Language Retrieval System


1
Evaluation of an Interactive Cross Language
Retrieval System
  • Daqing He, Douglas W. Oard,
  • Jianqiang Wang and Michael Nossal
  • University of Maryland

2
Outline
  • CLIR and interactive CLIR
  • CLEF and iCLEF
  • Experiment design for iCLEF 2002
  • Preliminary results
  • Conclusion

3
Cross Language Information Retrieval (CLIR)
  • The process of retrieving documents written in
    one natural language with automated systems that
    can accept queries expressed in other languages.
  • Example
  • Information need European Campaigns against
    Racism, especially in Germany
  • German document resources are the best
  • No active knowledge of German
  • English Queries
  • Machine Translated documents for selection
    examination

4
Global Internet User Population
2000
2005
English
English
Chinese
Source Global Reach
5
Why study CLIR
  • Internet is a multi-language collection
  • Universal information access is probably true in
    geographic sense, but not in language sense.
  • NLP technologies advance, available Machine
    translation systems

6
Interactions within the Search Process
Query Formulation
7
Cross-Language Evaluation Forum (CLEF)
  • Cross-Language Evaluation Forum for European
    languages
  • Annual event since 2000
  • Infrastructure to evaluate cross-language
    retrieval systems
  • Tasks multilingual, bilingual, monolingual,
    interactive
  • Multiple topic languages
  • Dutch, English, French, German, Italian, Spanish,
    Swedish

8
Interactive CLEF
  • Interactive Track of CLEF (iCLEF)
  • Annual event since 2001
  • Common framework to perform experiments comparing
    interactive CLIR systems with interactive query
    translation, interactive cross-language document
    selection, or both.
  • Subset of CLEF 2001 collection containing either
    French, English or German documents
  • Participants from several countries
  • Finland, Spain, UK and USA

9
Our Research Interests in Interactive CLIR
  • How to support interactions in processes
  • Query formulation/translation
  • Document selection/examination
  • Query reformulation
  • Develop experiment designs for evaluation
  • 2001 iCLEF interactive document selection
  • 2002 iCLEF query formulation/translation
  • 2003 iCLEF query reformulation?

10
UMD-Interactive CLIR system
  • Users involvement in query translation
    disambiguation
  • Allow users to identify un-intended translations
  • Equipped with two aids to help disambiguation
  • back translations
  • Key Word In Context (KWIC)
  • Built on top of Inquery retrieval system

11
Disambiguation Aid - Back Translations
12
Disambiguation Aid -- Back Translations
  • Provide synonyms of a query term that share the
    same translation
  • Help the user to identify intended translations
  • Obtained from bilingual term lists

13
Disambiguation Aid - Key Word in Context (KWIC)
  • Give a passage as the context of usage

English Word te
German Word tg
Passage containing te
Passage containing tg
14
Disambiguation Aid - Key Word in Context
  • Example

15
(No Transcript)
16
Relevance Judgment and Confidence
  • Relevance judgment
  • Highly relevant, somewhat relevant, not relevant
  • Confidence of judgment
  • High confidence, medium confidence and low
    confidence
  • Make judgment at selection and examination
    windows
  • Judgment carried across the session
  • Judgment can be changed later

17
(No Transcript)
18
Experiment design
  • Aim
  • To explore the usefulness of the two query
    selection disambiguation aids
  • To observe users behaviors in the interactions
    with the CLIR system
  • Task
  • Subjects are asked to find as many as possible
    true relevant documents for given topics
  • Languages
  • Query language English
  • Document language German

19
Experiment design - Topics
  • Four topics (each with multiple facets)
  • Topic 3
  • Title European Campaigns against Racism
  • Desc Find documents that talk about campaigns
    against racism in Europe.
  • Narr Relevant documents describe informative or
    educational campaigns against racism (ethnic or
    religious, or against immigrants) in Europe.
    Documents should refer to organized campaigns
    rather than reporting mere opinions against
    racism.

20
Experiment design - Collections
  • Der Spiegel - 1994 (13,979 docs)
  • SDA German news agency data 1994 (71,677 docs)
  • English translations for display and document
    selection by using Systran premium 3.0

21
Experiment design - Baseline System
  • Approximate current monolignual retrieval system
  • No query translation disambiguation
  • Multiple translations treated as if they are
    synonyms
  • Similar to the way stemming is done
  • Sum TF, union DF
  • Baseline system --- AUTO system
  • UMD-CLIR system --- MANU system

22
(No Transcript)
23
Experiment design - Measures
  • Quantitative
  • Mean uninterpolated average precision for query
    quality
  • Unbalanced F-Measure for relevance judgment
  • Favors precision over recall model the case in
    which
  • Fluent translation is expensive
  • Missing some relevant documents would be okay
  • Qualitative
  • Analyze questionnaires, interviews and search
    logs
  • How are the initial queries generated?
  • When are the queries reformulated? And how?
  • Are MT generated docs sufficient for relevance
    judgment?
  • How confident are the relevance judgments?

24
Experiment design - Procedure
  • Entry Questionnaire and Training Session
  • 20 minutes search session for each of 2 topics
    with system 1, plus post-search and post-system
    questionnaires
  • Break
  • 20 minutes search session for each of the other 2
    topics with system 2, plus post-search and
    post-system questionnaires
  • Post-experiment interview
  • Total 2.5 hours

25
Results - Subjects Profiles
  • 12 subjects
  • Highly educated 10 out of 12 hold master or
    higher degree
  • Mature average 31 years old, from 19 to 43 years
    old
  • Mostly female 9 females vs 3 males
  • Experienced searchers average 6 years online
    experience
  • Novice in TREC/iCLEF none participated before
  • Novice in MT 3 with very little MT experience
  • All native English speakers
  • Poor German skill 11 with no or poor German, 1
    with good

26
Results Subjects Groups
  • First 8 subjects searched only Der Spiegel
    collection (group Small)
  • 8 to 20 relevant documents for 3 topics
  • 0 relevant documents for 1 topic!
  • Valuable data for search behaviors
  • Comparison with data of last 4
  • Last 4 subjects searched both Der Spiegel and
    SDA collections (group Large)
  • 20 to 60 relevant documents
  • Official iCLEF results

27
Results Subjective Assessment
  • Both systems
  • Equally easy to search with
  • Necessary to have query reformulation,
  • a bit difficult to formulate queries
  • Easy to make relevant judgments,
  • confident with the judgments
  • MANU system
  • the intended translation is usually available
  • Easy to identify unintended translations
  • Search improved with removing unintended
    translations

28
Results Query Iterations
  • Average of 8 iterations (across all conditions)
  • Not very sensitive to topics
  • Topic 4 (Hunger Strikes) averaged 6 query
    iterations
  • Topic 2 (Treasure Hunting) averaged 16
  • Sensitive to some system/topic combinations
  • Topic 1 (Genes and Diseases) and 2 had small
    changes with two systems
  • Topics 3 (European Campaigns against Racism) and
    4 had substantial drop of query iterations with
    MANU system

29
Results Comparison between Groups Large and
Small
30
Results Comparison between Groups Large and
Small
Extremely
Large
Somewhat
Small
Not at all
31
Results Subjects Comments
  • MANU system
  • Select translations that lead to on/off-topic
    results good
  • AUTO system
  • Simpler and less user-involvement needed - good
  • Few functions and easier to learn and use good
  • No control over translations - bad
  • Both systems
  • Highlighting keywords helps - good
  • Untranslated/poorly-translated words in documents
    - bad
  • No Boolean and/or proximity operator bad

32
Futher work
  • Quantitative studies with MAP and unbalanced F
    measure
  • Quality of the queries and quality of the
    relevance judgments
  • Is manual query translation disambiguation really
    helpful?
  • Is MAP a good measure for assess query quality?
  • Different strategies in searches with different
    systems, with different search topics, with
    different numbers of relevant documents?
  • Possible explanations to the outcomes

33
Conclusion
  • Work in progress
  • Subjects like to participate in query translation
  • Back translations and KWIC are helpful
  • Unexpected event gives us some insight that
    numbers of relevant docs can greatly affect
    users view of a search system and their search
    behavior
  • On the route to understand interactions in CLIR
  • Want more participants in iCLEF

34
Experiment design - Measures
  • Quantitative
  • Unbalanced F-Measure for relevance judgment
  • P precision, R recall
  • ? 0.8
  • Favors precision over recall F measure cause
  • Fluent translation is expensive
  • Missing some relevant documents would be okay
  • mean average precision for query quality
  • Qualitative
  • Analyze Questionnaires and Interviews

35
Results Comparison between Last 4 and First 8
Extremely
Easy to judge relevance?
Easy to search with?
Somewhat
Not at all
Necessary to have query reformulation?
36
Multilingual Information Access
Cross-Language Search
Query
37
Results Assessment via topics
  • A little unfamiliar with the topics
  • A bit difficult to formulate of query
  • Easy to identify un-intended translations
  • Helpful with identified un-intended translations
  • System do provide the intended translation
  • Easy to judge the relevance of documents
  • Confident with relevant judgments

38
Results Assessment via topics
  • A little unfamiliar with the topics (2.54 out of
    5), Last 4 is better than First 8 (2.81 vs 2.41).
  • A bit difficult to formulate of query (2.73),
    Last 4 felt easier than the first 8 (2.94 vs
    2.63).
  • Easy to identify un-intended translations (3.5).
    No difference between Last 4 and First 8.
  • Helpful with identified un-intended translations
    (3.5). But Last 4 felt clearly stronger than
    First 8 (3.9 vs 3.25).
  • System do provide the intended translation
    (3.5)., and almost no difference between Last 4
    and First 8.
  • Easy to judge the relevance of documents (3.6).
    Last 4 felt much easier than the first 8 (4 vs
    3.34).
  • Confident with relevant judgments (3.5). Last 4
    gave clearly higher confidence than First 8 (3.8
    vs 3.3).

39
Results Subjective Assessment
  • Both systems are some what easy to search with,
    but overall AUTO system is easier
  • It is necessary to have query reformulation in
    both systems (AUTO3.67 and MANU3.83).
  • It is somewhat easy to make relevant judgments
    with both systems (3.17 (auto) and 3.08 (manu)).
  • le unfamiliar with the topics
  • A bit difficult to formulate of query
  • Easy to identify un-intended translations
  • Helpful with identified un-intended translations
  • System do provide the intended translation

40
Query Translation Aid - Key Word in Context (KWIC)
  • Provide a sentence containing the query term
    whose meaning in the sentence is the same as the
    translation of that query term
  • Help user to identify intended translations
  • Obtained from sentence aligned parallel corpus

41
Cross-Language Evaluation Forum
  • Cross-Language Evaluation Forum (CLEF) for
    European languages
  • Contact person Carol Peter
  • Homepage http//clef.iei.pi.cnr.it2002/

42
Experiment design Matrix
  • Within subject design
  • 4 subjects as one block for the difficulty of
    recruiting subjects

43
Results Comparison between L4 and F8
  • Only First 8 indicate clear difference about the
    easiness of searching with two systems (AUTO3.38
    vs MANU2.75). Last 4 gave the same score (3.5)
    for both systems.
  • Only First 8 indicate a clear difference about
    the necessity of query reformulation comes
    (AUTO3.88 vs MANU4.13). Last 4 gave the same
    score (3.25) for both systems
  • First 8 think that the AUTO system is some what
    easy (3) to make relevant judgments whereas the
    MANU system is relatively more difficult (score
    2.63). However, Last 4 think that MANU system is
    pretty easy (4) and the AUTO system clearly more
    difficult (3.5).

44
Results Comparison between L4 and F8
  • Last 4 felt easier to formulate queries, clearly
    stronger about the helpfulness of having
    identified un-intended translations, much easier
    to make relevance judgments and clearly more
    confident with their judgments.
  • But Last 4 and First 8 felt no difference about
    easiness of identify un-intended translations and
    whether or not the system provide the intended
    translations.
  • Changes of iterations between F8 and L4
  • Topics 1 and 2 only had small drop
  • Topics 3 and 4 had significant drop (both from
    12 down to 5)

45
Interactions within the Search Process (CL)
Source Selection
Query Formulation
Query Translation
Search
46
Our Research Interests in Interactive CLIR
  • How to support interactions in
  • Query formulation
  • Document selection
  • Query reformulation
  • Experiment designs for these processes
  • Query formulation/reformulation
  • How the initial queries are generated?
  • When are the queries reformulated? And how?
  • Query translation
  • Strategies for translation disambiguation
  • Tools for translation disambiguation
  • Document selection/examination ??????
  • What kind of summary for document selection
  • Are MT generated docs sufficient for relevance
    judgment?
  • How confident are the relevance judgments?
Write a Comment
User Comments (0)
About PowerShow.com