Some thoughts on failure analysis (success analysis) for CLIR - PowerPoint PPT Presentation

1 / 44

About This Presentation

Title:

Some thoughts on failure analysis (success analysis) for CLIR

Description:

... who actually made this happen (in an incredibly short amount of time) ... This is a lot of work but it is really the only way to understand what is happening ... – PowerPoint PPT presentation

Number of Views:45

Avg rating:3.0/5.0

Slides: 45

Provided by: elle130

Category:

more less

Transcript and Presenter's Notes

Title: Some thoughts on failure analysis (success analysis) for CLIR

1
Some thoughts on failure analysis (success
analysis) for CLIR

Donna Harman
Scientist Emeritus
National Institute of Standards and Technology

2
Welcome to the family!!
TREC
FIRE
3
Congratulations!!

To the FIRE organizers who actually made this
happen (in an incredibly short amount of time)
To the participants who got their systems
running, using (probably) insufficient resources,
and FINISHED!!

4
Now what??

Do some success/failure analysis, looking more
deeply at what worked (and didnt work)
Write papers so that others can learn from your
successes and failures
Come up with plans for the next FIRE!!

5
TREC-1 (1992)

Hardware
Most machines ran at 75 MHz
Most machines had 32 MB of memory
A 2 GB disk drive cost 5,000
Software
IR systems previously worked on CACM and other
small collections
This means 2 or 3 thousand documents
And those documents were abstracts

6
TREC Ad Hoc Task
50 topics
7
(No Transcript)
8
Some TREC-1 participants and methods

Carnegie Mellon University (Evans CLARIT
system)
City University, London (Robertson OKAPI)
Cornell University (Salton/Buckley SMART)
Universitaet Dortmund (Fuhr SMART)
Siemens Corporate Research, Inc
(Voorhees SMART)
New York University (Strzalkowski NLP
methods)
Queens College, CUNY (Kwok PIRCS,
spreading activation)
RMIT (Moffat, Wilkinson, Zobel compression
study)
University of California, Berkeley (
Cooper/Gey logistic regression)
University of Massachusetts, Amherst
(Croft inference network)
University of Pittsburgh (Korfhage genetic
algorithms)
VPISU (Fox combining multiple manual
searches)
Bellcore (Dumais LSI)
ConQuest Software, Inc (Nelson)
GE R D Center (Jacobs/Rau Boolean
approximation)
TRW Systems Development Division (Mettler
hardware array processor)

9
What SMART did in TREC-1

Two official runs testing single term indexing vs
single term plus two-term statistical phrases
Many other runs investigating the effects of
standard procedures
Which stemmer to use
How large a stopword list
Different simple weighting schemes

10
After TREC-1

Community had more confidence in the evaluation
(TREC was unknown before)
Training data available
Major changes to algorithms for most systems to
cope with the wide variation in documents, and in
particular for the very long documents

11
Ad Hoc Technologies
12
Pseudo-relevance feedback

Pseudo-relevance feedback pretends that the top
X documents are relevant and then uses these to
add expansion terms and/or to reweight the
original query terms

13
Query expansion (TREC-7)
Organization Expansion and Feedback Top docs used and terms added Comments
OKAPI Probabilistic 10/30
AT T Rocchio 10/20 5 P enrichment
INQUERY LCA 30P/50 Reranking using title
Cornell/SabIR Rocchio 30/25 Clustering,reranking
RMIT/CSIRO Rocchio 10/40 5P passages
BBN HMM-based 6/unknown Differential weighting topic parts
Twenty-one Rocchio 3/200
CUNY LCA 200P/unknown
14
The TREC Tracks
Blog Spam
Personal documents
Legal Genome
Retrieval in a domain
Novelty QA
Answers, not docs
Enterprise Terabyte Web VLC
Web searching, size
Video Speech OCR
Beyond text
X?X,Y,Z Chinese Spanish
Beyond just English
Interactive, HARD
Human-in-the-loop
Filtering Routing
Streamed text
Ad Hoc, Robust
Static text
15
TREC Spanish and Chinese

Initial approaches to Spanish (1994) used methods
for English but with new stopword lists and new
stemmers
Initial approaches to Chinese (1996) worked with
character bi-grams in place of words, and
sometimes used stoplists (many of the groups had
no access to speakers of Chinese)

16
CLIR for English, French, and German (1996)

Initial collection was Swiss newswire in three
languages, plus the AP newswire from the TREC
English collection
Initial approaches for monolingual work were
stemmers and stoplists for French and German, and
the use of n-grams
Initial use of machine-readable bi-lingual
dictionaries for translation of queries

17
NTCIR-1 (1999)

339,483 documents in Japanese and English
23 groups, of which 17 did monolingual Japanese
and 10 groups did CLIR
Initial approaches worked with known methods for
English, but had to deal with the issues of
segmenting Japanese or working with bi-grams or
n-grams

18
CLEF 2000

Co-operative activity across five European
countries
Multilingual, bilingual and monolingual tasks 40
topics in 8 European languages
20 groups, over half working in the multilingual
task others were groups new to IR who worked
monolingually in their own language
Many different kinds of resources used for the
CLIR part of the task

19
Savoys web page
20
So you finished--now what??

Analyze the results, otherwise NOTHING will have
been learned
Do some success/failure analysis, looking more
deeply at what worked (and didnt work)
Try to understand WHY something worked or did not
work!!

21
Macro-analysis

Bugs in system if your results are seriously
worse than others, check for bugs
Effects of document length look at the ranking
of your documents with respect to their length
Effects of topic lengths
Effects of different tokenizers /stemmers
Baseline monolingual results vs CLIR results
both parts should be analyzed separately

22
CLEF 2008 Experiments (Paul McNamee)
words stems morf lcn4 lcn5 4-grams 5-grams
BG 0.2164 0.2703 0.2822 0.2442 0.3105 0.2820
CS 0.2270 0.3215 0.2567 0.2477 0.3294 0.3223
DE 0.3303 0.3695 0.3994 0.3464 0.3522 0.4098 0.4201
EN 0.4060 0.4373 0.4018 0.4176 0.4175 0.3990 0.4152
ES 0.4396 0.4846 0.4451 0.4485 0.4517 0.4597 0.4609
FI 0.3406 0.4296 0.4018 0.3995 0.4033 0.4989 0.5078
FR 0.3638 0.4019 0.3680 0.3882 0.3834 0.3844 0.3930
HU 0.1520 0.2327 0.2274 0.2215 0.3192 0.3061
IT 0.3749 0.4178 0.3474 0.3741 0.3673 0.3738 0.3997
NL 0.3813 0.4003 0.4053 0.3836 0.3846 0.4219 0.4243
PT 0.3162 0.3287 0.3418 0.3347 0.3358 0.3524
RU 0.2671 0.3307 0.2875 0.3053 0.3406 0.3330
SV 0.3387 0.3756 0.3738 0.3638 0.3467 0.4236 0.4271
Average 0.3195 0.3559 0.3475 0.3431 0.3851 0.3880
0.3719 0.4146 0.3928 0.3902 0.3883 0.4214 0.4310
23
X to EFGI Results
24
Stemming

Performance increases and decreases on a per
topic basis (out of 225 topics)

Cranfield performance topics_at_P10 topics_at_P10 topics_at_P30 topics_at_P30
plurals 34 32 39 33
Porter 49 37 49 31
Lovins 51 44 63 33
25
Now dig into a per topic analysis

This is a lot of work but it is really the only
way to understand what is happening

26
Average Precision per Topic
27
Average Precision vs.Number Relevant
28
Micro-analysis, step 1

Select specific topics to investigate
Look at results on a per topic basis with respect
to the median of all the groups pick a small
number of topics (10?) that did much worse than
the median these are the initial set to explore
Optionally pick a similar set that did much
BETTER than the median to see where your
successes are

29
Micro-analysis, step 2

Now for each topic, pick a set of documents to
analyze
Look at top X documents (around 20)
Look at the non-relevant ones (failure analysis)
Optionally, look at the relevant ones (success
analysis)
Also look at the relevant documents that were NOT
retrieved in the top Y (say 100) documents

30
Micro-analysis, step 3

For the relevant that were NOT retrieved in the
top Y set, analyze for each document why it was
not retrieved what query terms were not in the
document, very short or long document, etc.
Do something similar for the top X non-relevant
documents why were they ranked highly?
Develop a general hypothesis for this topic as to
what the problems were
Now try to generalize across topics

31
Possible Monolingual issues

Tokenization and stemmer problems
Document length normalization problems
Abbreviation/commonword problems
Term weighting problems such as
Where are the global weights (IDF, etc.) coming
from??
Term expansion problems generally not enough
expansion (low recall)

32
Possible CLIR Problems

Bi-lingual dictionary too small or missing too
many critical words (names, etc.)
Multiple translations in dictionary leading to
bad precision particularly important when using
term expansion techniques
Specific issues with cross-language synonyms,
acronyms, etc. need a better technique for
acquiring these

33
Reliable Information Access Workshop (RIA), 2003

Goals understand/control variability
Participating systems Clairvoyance, Lemur
(CMU, UMass), MultiText, OKAPI, SMART (2
versions)
Methodology
controlled experiments in pseudo-relevance
feedback across 7 systems
massive, cooperative failure analysis
http ir.nist.gov/ria

34
RIA Failure analysis

Chose 44 failure" topics from 150 old TREC
Mean Average Precision lt average
Also picking most variance across systems
Use results from 6 systems standard runs
For each topic, people spent 45-60 minutes
looking at results from their assigned system
Short group discussion to come to consensus
Individual and overall report on-line.

35
Topic 362

Title Human smuggling
Description Identify incidents of human
smuggling
39 relevant FT (3), FBIS (17), LA (19)

city city cmu cmu sabir sabir waterloo
Prec_at_5 0.600 0.400 0.000 1.000
Prec_at_10 0.300 0.300 0.200 0.800
Prec_at_20 0.250 0.250 0.200 0.450
rel_at_1000 22 31 28 33
MAP 0.1114 0.1333 0.0785 0.3200
36
Topic 362
city cityR cmu sabir waterloo
FB1 FB10 FT4 FB16 FT5
FB2 FT1 FB15 FB21 FB19
FB3 LA2 LA4 FB22 LA6
FB4 FB11 FB16 FB23 FB28
FB5 FT2 FB17 FB20 FB4
FB6 FB12 FB18 FB24 LA7
FB7 FT3 FT5 FB25 FB25
FB8 FB13 FB19 LA6 FB11
FB9 FB14 FB20 FB26 LA8
LA1 LA3 LA5 FB27 LA9
37
Issues with 362

Most documents dealt with smuggling but missed
the human concept
Citys title only worked OK, but no expansion
CMU expansion smuggle (0.14), incident (0.13),
identify (0.13), human (0.13)
Sabir expansion smuggl (0.84), incid (0.29),
identif (0.19), human (0.19)
Waterloo SSR and passages worked well
Other important terms aliens, illegal
emigrants/immigrants,

38
Topic 435

Title curbing population growth
Description What measures have been taken
worldwide and what countries have been effective
in curbing population growth?
117 relevant FT (25), FBIS (81), LA (1)

city cmu city cmu sabir sabir waterloo
Prec_at_5 0.200 0.200 0.400 0.200
Prec_at_10 0.200 0.200 0.300 0.300
Prec_at_20 0.200 0.100 0.400 0.450
rel_at_1000 81 34 83 67
MAP 0.0793 0.0307 0.2565 0.1124
39
Issues with 435

Use of phrases important here
Sabir the only group using phrases
Citys use of title only approximated this
note that expansion was not important
Waterloos SSR also approximated this, but why
did they get so few relevant by 1000

40
Topic 436

Title railway accidents
Description What are the causes of railway
accidents throughout the world?
180 relevant FT (49), FR (1), FBIS (5), LA (125)

city cmu city cmu sabir sabir waterloo
Prec_at_5 0.000 0.600 0.600 0.800
Prec_at_10 0.200 0.600 0.300 0.700
Prec_at_20 0.250 0.450 0.250 0.750
rel_at_1000 34 48 36 77
MAP 0.0356 0.0804 0.0220 0.1748
41
Issues with 436

Query expansion is critical here, but tricky to
pick correct expansion terms
city did no expansion, title was not helpful
cmu, sabir did good expansion but with the full
documents
waterloos passage-level expansion good
Most relevant documents not retrieved
some very short relevant documents (LA)
55 relevant documents contain no query
keywords

42
CLEF 2008 Experiments (Paul McNamee)
words stems morf lcn4 lcn5 4-grams 5-grams
BG 0.2164 0.2703 0.2822 0.2442 0.3105 0.2820
CS 0.2270 0.3215 0.2567 0.2477 0.3294 0.3223
DE 0.3303 0.3695 0.3994 0.3464 0.3522 0.4098 0.4201
EN 0.4060 0.4373 0.4018 0.4176 0.4175 0.3990 0.4152
ES 0.4396 0.4846 0.4451 0.4485 0.4517 0.4597 0.4609
FI 0.3406 0.4296 0.4018 0.3995 0.4033 0.4989 0.5078
FR 0.3638 0.4019 0.3680 0.3882 0.3834 0.3844 0.3930
HU 0.1520 0.2327 0.2274 0.2215 0.3192 0.3061
IT 0.3749 0.4178 0.3474 0.3741 0.3673 0.3738 0.3997
NL 0.3813 0.4003 0.4053 0.3836 0.3846 0.4219 0.4243
PT 0.3162 0.3287 0.3418 0.3347 0.3358 0.3524
RU 0.2671 0.3307 0.2875 0.3053 0.3406 0.3330
SV 0.3387 0.3756 0.3738 0.3638 0.3467 0.4236 0.4271
Average 0.3195 0.3559 0.3475 0.3431 0.3851 0.3880
0.3719 0.4146 0.3928 0.3902 0.3883 0.4214 0.4310
43
Thoughts to take home

You have spent months of time on software and
runs now make that effort pay off
Analysis is more than statistical tables
Failure/Success analysis look at topics where
your method failed or worked well
Dig REALLY deep to understand WHY something
worked or didnt work
Think about generalization of what you have
learned
Then PUBLISH

44
What needs to be in that paper

Basic layout of your FIRE experiment
Related work--where did you get your ideas and
why did you pick these techniques what resources
did you use
What happened when you applied these to a new
language what kinds of language specific issues
did you find
What worked (and why) and what did not work (and
why)

Write a Comment

User Comments (0)