Some thoughts on failure analysis (success analysis) for CLIR - PowerPoint PPT Presentation

1 / 44
About This Presentation
Title:

Some thoughts on failure analysis (success analysis) for CLIR

Description:

... who actually made this happen (in an incredibly short amount of time) ... This is a lot of work but it is really the only way to understand what is happening ... – PowerPoint PPT presentation

Number of Views:45
Avg rating:3.0/5.0
Slides: 45
Provided by: elle130
Category:

less

Transcript and Presenter's Notes

Title: Some thoughts on failure analysis (success analysis) for CLIR


1
Some thoughts on failure analysis (success
analysis) for CLIR
  • Donna Harman
  • Scientist Emeritus
  • National Institute of Standards and Technology

2
Welcome to the family!!
TREC
FIRE
3
Congratulations!!
  • To the FIRE organizers who actually made this
    happen (in an incredibly short amount of time)
  • To the participants who got their systems
    running, using (probably) insufficient resources,
    and FINISHED!!

4
Now what??
  • Do some success/failure analysis, looking more
    deeply at what worked (and didnt work)
  • Write papers so that others can learn from your
    successes and failures
  • Come up with plans for the next FIRE!!

5
TREC-1 (1992)
  • Hardware
  • Most machines ran at 75 MHz
  • Most machines had 32 MB of memory
  • A 2 GB disk drive cost 5,000
  • Software
  • IR systems previously worked on CACM and other
    small collections
  • This means 2 or 3 thousand documents
  • And those documents were abstracts

6
TREC Ad Hoc Task
50 topics
7
(No Transcript)
8
Some TREC-1 participants and methods
  • Carnegie Mellon University (Evans CLARIT
    system)
  • City University, London (Robertson OKAPI)
  • Cornell University (Salton/Buckley SMART)
  • Universitaet Dortmund (Fuhr SMART)
  • Siemens Corporate Research, Inc
    (Voorhees SMART)
  • New York University (Strzalkowski NLP
    methods)
  • Queens College, CUNY (Kwok PIRCS,
    spreading activation)
  • RMIT (Moffat, Wilkinson, Zobel compression
    study)
  • University of California, Berkeley (
    Cooper/Gey logistic regression)
  • University of Massachusetts, Amherst
    (Croft inference network)
  • University of Pittsburgh (Korfhage genetic
    algorithms)
  • VPISU (Fox combining multiple manual
    searches)
  • Bellcore (Dumais LSI)
  • ConQuest Software, Inc (Nelson)
  • GE R D Center (Jacobs/Rau Boolean
    approximation)
  • TRW Systems Development Division (Mettler
    hardware array processor)

9
What SMART did in TREC-1
  • Two official runs testing single term indexing vs
    single term plus two-term statistical phrases
  • Many other runs investigating the effects of
    standard procedures
  • Which stemmer to use
  • How large a stopword list
  • Different simple weighting schemes

10
After TREC-1
  • Community had more confidence in the evaluation
    (TREC was unknown before)
  • Training data available
  • Major changes to algorithms for most systems to
    cope with the wide variation in documents, and in
    particular for the very long documents

11
Ad Hoc Technologies
12
Pseudo-relevance feedback
  • Pseudo-relevance feedback pretends that the top
    X documents are relevant and then uses these to
    add expansion terms and/or to reweight the
    original query terms

13
Query expansion (TREC-7)
Organization Expansion and Feedback Top docs used and terms added Comments
OKAPI Probabilistic 10/30
AT T Rocchio 10/20 5 P enrichment
INQUERY LCA 30P/50 Reranking using title
Cornell/SabIR Rocchio 30/25 Clustering,reranking
RMIT/CSIRO Rocchio 10/40 5P passages
BBN HMM-based 6/unknown Differential weighting topic parts
Twenty-one Rocchio 3/200
CUNY LCA 200P/unknown
14
The TREC Tracks
Blog Spam
Personal documents
Legal Genome
Retrieval in a domain
Novelty QA
Answers, not docs
Enterprise Terabyte Web VLC
Web searching, size
Video Speech OCR
Beyond text
X?X,Y,Z Chinese Spanish
Beyond just English
Interactive, HARD
Human-in-the-loop
Filtering Routing
Streamed text
Ad Hoc, Robust
Static text
15
TREC Spanish and Chinese
  • Initial approaches to Spanish (1994) used methods
    for English but with new stopword lists and new
    stemmers
  • Initial approaches to Chinese (1996) worked with
    character bi-grams in place of words, and
    sometimes used stoplists (many of the groups had
    no access to speakers of Chinese)

16
CLIR for English, French, and German (1996)
  • Initial collection was Swiss newswire in three
    languages, plus the AP newswire from the TREC
    English collection
  • Initial approaches for monolingual work were
    stemmers and stoplists for French and German, and
    the use of n-grams
  • Initial use of machine-readable bi-lingual
    dictionaries for translation of queries

17
NTCIR-1 (1999)
  • 339,483 documents in Japanese and English
  • 23 groups, of which 17 did monolingual Japanese
    and 10 groups did CLIR
  • Initial approaches worked with known methods for
    English, but had to deal with the issues of
    segmenting Japanese or working with bi-grams or
    n-grams

18
CLEF 2000
  • Co-operative activity across five European
    countries
  • Multilingual, bilingual and monolingual tasks 40
    topics in 8 European languages
  • 20 groups, over half working in the multilingual
    task others were groups new to IR who worked
    monolingually in their own language
  • Many different kinds of resources used for the
    CLIR part of the task

19
Savoys web page
20
So you finished--now what??
  • Analyze the results, otherwise NOTHING will have
    been learned
  • Do some success/failure analysis, looking more
    deeply at what worked (and didnt work)
  • Try to understand WHY something worked or did not
    work!!

21
Macro-analysis
  • Bugs in system if your results are seriously
    worse than others, check for bugs
  • Effects of document length look at the ranking
    of your documents with respect to their length
  • Effects of topic lengths
  • Effects of different tokenizers /stemmers
  • Baseline monolingual results vs CLIR results
    both parts should be analyzed separately

22
CLEF 2008 Experiments (Paul McNamee)
words stems morf lcn4 lcn5 4-grams 5-grams
BG 0.2164 0.2703 0.2822 0.2442 0.3105 0.2820
CS 0.2270 0.3215 0.2567 0.2477 0.3294 0.3223
DE 0.3303 0.3695 0.3994 0.3464 0.3522 0.4098 0.4201
EN 0.4060 0.4373 0.4018 0.4176 0.4175 0.3990 0.4152
ES 0.4396 0.4846 0.4451 0.4485 0.4517 0.4597 0.4609
FI 0.3406 0.4296 0.4018 0.3995 0.4033 0.4989 0.5078
FR 0.3638 0.4019 0.3680 0.3882 0.3834 0.3844 0.3930
HU 0.1520 0.2327 0.2274 0.2215 0.3192 0.3061
IT 0.3749 0.4178 0.3474 0.3741 0.3673 0.3738 0.3997
NL 0.3813 0.4003 0.4053 0.3836 0.3846 0.4219 0.4243
PT 0.3162 0.3287 0.3418 0.3347 0.3358 0.3524
RU 0.2671 0.3307 0.2875 0.3053 0.3406 0.3330
SV 0.3387 0.3756 0.3738 0.3638 0.3467 0.4236 0.4271
Average 0.3195 0.3559 0.3475 0.3431 0.3851 0.3880
0.3719 0.4146 0.3928 0.3902 0.3883 0.4214 0.4310
23
X to EFGI Results
24
Stemming
  • Performance increases and decreases on a per
    topic basis (out of 225 topics)

Cranfield performance topics_at_P10 topics_at_P10 topics_at_P30 topics_at_P30
plurals 34 32 39 33
Porter 49 37 49 31
Lovins 51 44 63 33
25
Now dig into a per topic analysis
  • This is a lot of work but it is really the only
    way to understand what is happening

26
Average Precision per Topic
27
Average Precision vs.Number Relevant
28
Micro-analysis, step 1
  • Select specific topics to investigate
  • Look at results on a per topic basis with respect
    to the median of all the groups pick a small
    number of topics (10?) that did much worse than
    the median these are the initial set to explore
  • Optionally pick a similar set that did much
    BETTER than the median to see where your
    successes are

29
Micro-analysis, step 2
  • Now for each topic, pick a set of documents to
    analyze
  • Look at top X documents (around 20)
  • Look at the non-relevant ones (failure analysis)
  • Optionally, look at the relevant ones (success
    analysis)
  • Also look at the relevant documents that were NOT
    retrieved in the top Y (say 100) documents

30
Micro-analysis, step 3
  • For the relevant that were NOT retrieved in the
    top Y set, analyze for each document why it was
    not retrieved what query terms were not in the
    document, very short or long document, etc.
  • Do something similar for the top X non-relevant
    documents why were they ranked highly?
  • Develop a general hypothesis for this topic as to
    what the problems were
  • Now try to generalize across topics

31
Possible Monolingual issues
  • Tokenization and stemmer problems
  • Document length normalization problems
  • Abbreviation/commonword problems
  • Term weighting problems such as
  • Where are the global weights (IDF, etc.) coming
    from??
  • Term expansion problems generally not enough
    expansion (low recall)

32
Possible CLIR Problems
  • Bi-lingual dictionary too small or missing too
    many critical words (names, etc.)
  • Multiple translations in dictionary leading to
    bad precision particularly important when using
    term expansion techniques
  • Specific issues with cross-language synonyms,
    acronyms, etc. need a better technique for
    acquiring these

33
Reliable Information Access Workshop (RIA), 2003
  • Goals understand/control variability
  • Participating systems Clairvoyance, Lemur
    (CMU, UMass), MultiText, OKAPI, SMART (2
    versions)
  • Methodology
  • controlled experiments in pseudo-relevance
    feedback across 7 systems
  • massive, cooperative failure analysis
  • http ir.nist.gov/ria

34
RIA Failure analysis
  • Chose 44 failure" topics from 150 old TREC
  • Mean Average Precision lt average
  • Also picking most variance across systems
  • Use results from 6 systems standard runs
  • For each topic, people spent 45-60 minutes
  • looking at results from their assigned system
  • Short group discussion to come to consensus
  • Individual and overall report on-line.

35
Topic 362
  • Title Human smuggling
  • Description Identify incidents of human
    smuggling
  • 39 relevant FT (3), FBIS (17), LA (19)

city city cmu cmu sabir sabir waterloo
Prec_at_5 0.600 0.400 0.000 1.000
Prec_at_10 0.300 0.300 0.200 0.800
Prec_at_20 0.250 0.250 0.200 0.450
rel_at_1000 22 31 28 33
MAP 0.1114 0.1333 0.0785 0.3200
36
Topic 362
city cityR cmu sabir waterloo
FB1 FB10 FT4 FB16 FT5
FB2 FT1 FB15 FB21 FB19
FB3 LA2 LA4 FB22 LA6
FB4 FB11 FB16 FB23 FB28
FB5 FT2 FB17 FB20 FB4
FB6 FB12 FB18 FB24 LA7
FB7 FT3 FT5 FB25 FB25
FB8 FB13 FB19 LA6 FB11
FB9 FB14 FB20 FB26 LA8
LA1 LA3 LA5 FB27 LA9
37
Issues with 362
  • Most documents dealt with smuggling but missed
    the human concept
  • Citys title only worked OK, but no expansion
  • CMU expansion smuggle (0.14), incident (0.13),
    identify (0.13), human (0.13)
  • Sabir expansion smuggl (0.84), incid (0.29),
    identif (0.19), human (0.19)
  • Waterloo SSR and passages worked well
  • Other important terms aliens, illegal
    emigrants/immigrants,

38
Topic 435
  • Title curbing population growth
  • Description What measures have been taken
    worldwide and what countries have been effective
    in curbing population growth?
  • 117 relevant FT (25), FBIS (81), LA (1)

city cmu city cmu sabir sabir waterloo
Prec_at_5 0.200 0.200 0.400 0.200
Prec_at_10 0.200 0.200 0.300 0.300
Prec_at_20 0.200 0.100 0.400 0.450
rel_at_1000 81 34 83 67
MAP 0.0793 0.0307 0.2565 0.1124
39
Issues with 435
  • Use of phrases important here
  • Sabir the only group using phrases
  • Citys use of title only approximated this
    note that expansion was not important
  • Waterloos SSR also approximated this, but why
    did they get so few relevant by 1000

40
Topic 436
  • Title railway accidents
  • Description What are the causes of railway
    accidents throughout the world?
  • 180 relevant FT (49), FR (1), FBIS (5), LA (125)

city cmu city cmu sabir sabir waterloo
Prec_at_5 0.000 0.600 0.600 0.800
Prec_at_10 0.200 0.600 0.300 0.700
Prec_at_20 0.250 0.450 0.250 0.750
rel_at_1000 34 48 36 77
MAP 0.0356 0.0804 0.0220 0.1748
41
Issues with 436
  • Query expansion is critical here, but tricky to
    pick correct expansion terms
  • city did no expansion, title was not helpful
  • cmu, sabir did good expansion but with the full
    documents
  • waterloos passage-level expansion good
  • Most relevant documents not retrieved
  • some very short relevant documents (LA)
  • 55 relevant documents contain no query
    keywords

42
CLEF 2008 Experiments (Paul McNamee)
words stems morf lcn4 lcn5 4-grams 5-grams
BG 0.2164 0.2703 0.2822 0.2442 0.3105 0.2820
CS 0.2270 0.3215 0.2567 0.2477 0.3294 0.3223
DE 0.3303 0.3695 0.3994 0.3464 0.3522 0.4098 0.4201
EN 0.4060 0.4373 0.4018 0.4176 0.4175 0.3990 0.4152
ES 0.4396 0.4846 0.4451 0.4485 0.4517 0.4597 0.4609
FI 0.3406 0.4296 0.4018 0.3995 0.4033 0.4989 0.5078
FR 0.3638 0.4019 0.3680 0.3882 0.3834 0.3844 0.3930
HU 0.1520 0.2327 0.2274 0.2215 0.3192 0.3061
IT 0.3749 0.4178 0.3474 0.3741 0.3673 0.3738 0.3997
NL 0.3813 0.4003 0.4053 0.3836 0.3846 0.4219 0.4243
PT 0.3162 0.3287 0.3418 0.3347 0.3358 0.3524
RU 0.2671 0.3307 0.2875 0.3053 0.3406 0.3330
SV 0.3387 0.3756 0.3738 0.3638 0.3467 0.4236 0.4271
Average 0.3195 0.3559 0.3475 0.3431 0.3851 0.3880
0.3719 0.4146 0.3928 0.3902 0.3883 0.4214 0.4310
43
Thoughts to take home
  • You have spent months of time on software and
    runs now make that effort pay off
  • Analysis is more than statistical tables
  • Failure/Success analysis look at topics where
    your method failed or worked well
  • Dig REALLY deep to understand WHY something
    worked or didnt work
  • Think about generalization of what you have
    learned
  • Then PUBLISH

44
What needs to be in that paper
  • Basic layout of your FIRE experiment
  • Related work--where did you get your ideas and
    why did you pick these techniques what resources
    did you use
  • What happened when you applied these to a new
    language what kinds of language specific issues
    did you find
  • What worked (and why) and what did not work (and
    why)
Write a Comment
User Comments (0)
About PowerShow.com