Title: Some thoughts on failure analysis (success analysis) for CLIR
1Some thoughts on failure analysis (success
analysis) for CLIR
- Donna Harman
- Scientist Emeritus
- National Institute of Standards and Technology
2Welcome to the family!!
TREC
FIRE
3Congratulations!!
- To the FIRE organizers who actually made this
happen (in an incredibly short amount of time) - To the participants who got their systems
running, using (probably) insufficient resources,
and FINISHED!!
4Now what??
- Do some success/failure analysis, looking more
deeply at what worked (and didnt work) - Write papers so that others can learn from your
successes and failures - Come up with plans for the next FIRE!!
5TREC-1 (1992)
- Hardware
- Most machines ran at 75 MHz
- Most machines had 32 MB of memory
- A 2 GB disk drive cost 5,000
- Software
- IR systems previously worked on CACM and other
small collections - This means 2 or 3 thousand documents
- And those documents were abstracts
-
6TREC Ad Hoc Task
50 topics
7(No Transcript)
8Some TREC-1 participants and methods
- Carnegie Mellon University (Evans CLARIT
system) - City University, London (Robertson OKAPI)
- Cornell University (Salton/Buckley SMART)
- Universitaet Dortmund (Fuhr SMART)
- Siemens Corporate Research, Inc
(Voorhees SMART) - New York University (Strzalkowski NLP
methods) - Queens College, CUNY (Kwok PIRCS,
spreading activation) - RMIT (Moffat, Wilkinson, Zobel compression
study) - University of California, Berkeley (
Cooper/Gey logistic regression) - University of Massachusetts, Amherst
(Croft inference network) - University of Pittsburgh (Korfhage genetic
algorithms) - VPISU (Fox combining multiple manual
searches) - Bellcore (Dumais LSI)
- ConQuest Software, Inc (Nelson)
- GE R D Center (Jacobs/Rau Boolean
approximation) - TRW Systems Development Division (Mettler
hardware array processor)
9What SMART did in TREC-1
- Two official runs testing single term indexing vs
single term plus two-term statistical phrases - Many other runs investigating the effects of
standard procedures - Which stemmer to use
- How large a stopword list
- Different simple weighting schemes
10After TREC-1
- Community had more confidence in the evaluation
(TREC was unknown before) - Training data available
- Major changes to algorithms for most systems to
cope with the wide variation in documents, and in
particular for the very long documents
11Ad Hoc Technologies
12 Pseudo-relevance feedback
- Pseudo-relevance feedback pretends that the top
X documents are relevant and then uses these to
add expansion terms and/or to reweight the
original query terms
13Query expansion (TREC-7)
Organization Expansion and Feedback Top docs used and terms added Comments
OKAPI Probabilistic 10/30
AT T Rocchio 10/20 5 P enrichment
INQUERY LCA 30P/50 Reranking using title
Cornell/SabIR Rocchio 30/25 Clustering,reranking
RMIT/CSIRO Rocchio 10/40 5P passages
BBN HMM-based 6/unknown Differential weighting topic parts
Twenty-one Rocchio 3/200
CUNY LCA 200P/unknown
14 The TREC Tracks
Blog Spam
Personal documents
Legal Genome
Retrieval in a domain
Novelty QA
Answers, not docs
Enterprise Terabyte Web VLC
Web searching, size
Video Speech OCR
Beyond text
X?X,Y,Z Chinese Spanish
Beyond just English
Interactive, HARD
Human-in-the-loop
Filtering Routing
Streamed text
Ad Hoc, Robust
Static text
15TREC Spanish and Chinese
- Initial approaches to Spanish (1994) used methods
for English but with new stopword lists and new
stemmers - Initial approaches to Chinese (1996) worked with
character bi-grams in place of words, and
sometimes used stoplists (many of the groups had
no access to speakers of Chinese)
16CLIR for English, French, and German (1996)
- Initial collection was Swiss newswire in three
languages, plus the AP newswire from the TREC
English collection - Initial approaches for monolingual work were
stemmers and stoplists for French and German, and
the use of n-grams - Initial use of machine-readable bi-lingual
dictionaries for translation of queries
17NTCIR-1 (1999)
- 339,483 documents in Japanese and English
- 23 groups, of which 17 did monolingual Japanese
and 10 groups did CLIR - Initial approaches worked with known methods for
English, but had to deal with the issues of
segmenting Japanese or working with bi-grams or
n-grams
18CLEF 2000
- Co-operative activity across five European
countries - Multilingual, bilingual and monolingual tasks 40
topics in 8 European languages - 20 groups, over half working in the multilingual
task others were groups new to IR who worked
monolingually in their own language - Many different kinds of resources used for the
CLIR part of the task
19Savoys web page
20So you finished--now what??
- Analyze the results, otherwise NOTHING will have
been learned - Do some success/failure analysis, looking more
deeply at what worked (and didnt work) - Try to understand WHY something worked or did not
work!!
21Macro-analysis
- Bugs in system if your results are seriously
worse than others, check for bugs - Effects of document length look at the ranking
of your documents with respect to their length - Effects of topic lengths
- Effects of different tokenizers /stemmers
- Baseline monolingual results vs CLIR results
both parts should be analyzed separately
22CLEF 2008 Experiments (Paul McNamee)
words stems morf lcn4 lcn5 4-grams 5-grams
BG 0.2164 0.2703 0.2822 0.2442 0.3105 0.2820
CS 0.2270 0.3215 0.2567 0.2477 0.3294 0.3223
DE 0.3303 0.3695 0.3994 0.3464 0.3522 0.4098 0.4201
EN 0.4060 0.4373 0.4018 0.4176 0.4175 0.3990 0.4152
ES 0.4396 0.4846 0.4451 0.4485 0.4517 0.4597 0.4609
FI 0.3406 0.4296 0.4018 0.3995 0.4033 0.4989 0.5078
FR 0.3638 0.4019 0.3680 0.3882 0.3834 0.3844 0.3930
HU 0.1520 0.2327 0.2274 0.2215 0.3192 0.3061
IT 0.3749 0.4178 0.3474 0.3741 0.3673 0.3738 0.3997
NL 0.3813 0.4003 0.4053 0.3836 0.3846 0.4219 0.4243
PT 0.3162 0.3287 0.3418 0.3347 0.3358 0.3524
RU 0.2671 0.3307 0.2875 0.3053 0.3406 0.3330
SV 0.3387 0.3756 0.3738 0.3638 0.3467 0.4236 0.4271
Average 0.3195 0.3559 0.3475 0.3431 0.3851 0.3880
0.3719 0.4146 0.3928 0.3902 0.3883 0.4214 0.4310
23X to EFGI Results
24Stemming
- Performance increases and decreases on a per
topic basis (out of 225 topics)
Cranfield performance topics_at_P10 topics_at_P10 topics_at_P30 topics_at_P30
plurals 34 32 39 33
Porter 49 37 49 31
Lovins 51 44 63 33
25Now dig into a per topic analysis
- This is a lot of work but it is really the only
way to understand what is happening
26Average Precision per Topic
27Average Precision vs.Number Relevant
28Micro-analysis, step 1
- Select specific topics to investigate
- Look at results on a per topic basis with respect
to the median of all the groups pick a small
number of topics (10?) that did much worse than
the median these are the initial set to explore
- Optionally pick a similar set that did much
BETTER than the median to see where your
successes are
29Micro-analysis, step 2
- Now for each topic, pick a set of documents to
analyze - Look at top X documents (around 20)
- Look at the non-relevant ones (failure analysis)
- Optionally, look at the relevant ones (success
analysis) - Also look at the relevant documents that were NOT
retrieved in the top Y (say 100) documents
30Micro-analysis, step 3
- For the relevant that were NOT retrieved in the
top Y set, analyze for each document why it was
not retrieved what query terms were not in the
document, very short or long document, etc. - Do something similar for the top X non-relevant
documents why were they ranked highly? - Develop a general hypothesis for this topic as to
what the problems were - Now try to generalize across topics
31Possible Monolingual issues
- Tokenization and stemmer problems
- Document length normalization problems
- Abbreviation/commonword problems
- Term weighting problems such as
- Where are the global weights (IDF, etc.) coming
from?? - Term expansion problems generally not enough
expansion (low recall)
32Possible CLIR Problems
- Bi-lingual dictionary too small or missing too
many critical words (names, etc.) - Multiple translations in dictionary leading to
bad precision particularly important when using
term expansion techniques - Specific issues with cross-language synonyms,
acronyms, etc. need a better technique for
acquiring these
33Reliable Information Access Workshop (RIA), 2003
- Goals understand/control variability
- Participating systems Clairvoyance, Lemur
(CMU, UMass), MultiText, OKAPI, SMART (2
versions) - Methodology
- controlled experiments in pseudo-relevance
feedback across 7 systems - massive, cooperative failure analysis
- http ir.nist.gov/ria
34 RIA Failure analysis
- Chose 44 failure" topics from 150 old TREC
- Mean Average Precision lt average
- Also picking most variance across systems
- Use results from 6 systems standard runs
- For each topic, people spent 45-60 minutes
- looking at results from their assigned system
- Short group discussion to come to consensus
- Individual and overall report on-line.
35Topic 362
- Title Human smuggling
- Description Identify incidents of human
smuggling - 39 relevant FT (3), FBIS (17), LA (19)
city city cmu cmu sabir sabir waterloo
Prec_at_5 0.600 0.400 0.000 1.000
Prec_at_10 0.300 0.300 0.200 0.800
Prec_at_20 0.250 0.250 0.200 0.450
rel_at_1000 22 31 28 33
MAP 0.1114 0.1333 0.0785 0.3200
36Topic 362
city cityR cmu sabir waterloo
FB1 FB10 FT4 FB16 FT5
FB2 FT1 FB15 FB21 FB19
FB3 LA2 LA4 FB22 LA6
FB4 FB11 FB16 FB23 FB28
FB5 FT2 FB17 FB20 FB4
FB6 FB12 FB18 FB24 LA7
FB7 FT3 FT5 FB25 FB25
FB8 FB13 FB19 LA6 FB11
FB9 FB14 FB20 FB26 LA8
LA1 LA3 LA5 FB27 LA9
37Issues with 362
- Most documents dealt with smuggling but missed
the human concept - Citys title only worked OK, but no expansion
- CMU expansion smuggle (0.14), incident (0.13),
identify (0.13), human (0.13) - Sabir expansion smuggl (0.84), incid (0.29),
identif (0.19), human (0.19) - Waterloo SSR and passages worked well
- Other important terms aliens, illegal
emigrants/immigrants,
38Topic 435
- Title curbing population growth
- Description What measures have been taken
worldwide and what countries have been effective
in curbing population growth? - 117 relevant FT (25), FBIS (81), LA (1)
city cmu city cmu sabir sabir waterloo
Prec_at_5 0.200 0.200 0.400 0.200
Prec_at_10 0.200 0.200 0.300 0.300
Prec_at_20 0.200 0.100 0.400 0.450
rel_at_1000 81 34 83 67
MAP 0.0793 0.0307 0.2565 0.1124
39Issues with 435
- Use of phrases important here
- Sabir the only group using phrases
- Citys use of title only approximated this
note that expansion was not important - Waterloos SSR also approximated this, but why
did they get so few relevant by 1000
40Topic 436
- Title railway accidents
- Description What are the causes of railway
accidents throughout the world? - 180 relevant FT (49), FR (1), FBIS (5), LA (125)
city cmu city cmu sabir sabir waterloo
Prec_at_5 0.000 0.600 0.600 0.800
Prec_at_10 0.200 0.600 0.300 0.700
Prec_at_20 0.250 0.450 0.250 0.750
rel_at_1000 34 48 36 77
MAP 0.0356 0.0804 0.0220 0.1748
41Issues with 436
- Query expansion is critical here, but tricky to
pick correct expansion terms - city did no expansion, title was not helpful
- cmu, sabir did good expansion but with the full
documents - waterloos passage-level expansion good
- Most relevant documents not retrieved
- some very short relevant documents (LA)
- 55 relevant documents contain no query
keywords
42CLEF 2008 Experiments (Paul McNamee)
words stems morf lcn4 lcn5 4-grams 5-grams
BG 0.2164 0.2703 0.2822 0.2442 0.3105 0.2820
CS 0.2270 0.3215 0.2567 0.2477 0.3294 0.3223
DE 0.3303 0.3695 0.3994 0.3464 0.3522 0.4098 0.4201
EN 0.4060 0.4373 0.4018 0.4176 0.4175 0.3990 0.4152
ES 0.4396 0.4846 0.4451 0.4485 0.4517 0.4597 0.4609
FI 0.3406 0.4296 0.4018 0.3995 0.4033 0.4989 0.5078
FR 0.3638 0.4019 0.3680 0.3882 0.3834 0.3844 0.3930
HU 0.1520 0.2327 0.2274 0.2215 0.3192 0.3061
IT 0.3749 0.4178 0.3474 0.3741 0.3673 0.3738 0.3997
NL 0.3813 0.4003 0.4053 0.3836 0.3846 0.4219 0.4243
PT 0.3162 0.3287 0.3418 0.3347 0.3358 0.3524
RU 0.2671 0.3307 0.2875 0.3053 0.3406 0.3330
SV 0.3387 0.3756 0.3738 0.3638 0.3467 0.4236 0.4271
Average 0.3195 0.3559 0.3475 0.3431 0.3851 0.3880
0.3719 0.4146 0.3928 0.3902 0.3883 0.4214 0.4310
43Thoughts to take home
- You have spent months of time on software and
runs now make that effort pay off - Analysis is more than statistical tables
- Failure/Success analysis look at topics where
your method failed or worked well - Dig REALLY deep to understand WHY something
worked or didnt work - Think about generalization of what you have
learned - Then PUBLISH
44What needs to be in that paper
- Basic layout of your FIRE experiment
- Related work--where did you get your ideas and
why did you pick these techniques what resources
did you use - What happened when you applied these to a new
language what kinds of language specific issues
did you find - What worked (and why) and what did not work (and
why)