Indexing and information retrieval - PowerPoint PPT Presentation

1 / 79
About This Presentation
Title:

Indexing and information retrieval

Description:

If we were using a 'free text' or 'natural language' indexing ... Two phases: analysis then translation. Vocabulary control vs. free text. Sources of evidence? ... – PowerPoint PPT presentation

Number of Views:696
Avg rating:5.0/5.0
Slides: 80
Provided by: technica5
Category:

less

Transcript and Presenter's Notes

Title: Indexing and information retrieval


1
Indexing and information retrieval
  • inf384c
  • UT Austin School of Information

2
Introduction to Indexing
  • The main purpose of indexing and abstracting is
    to construct representations of published items
    in a form suitable for inclusion in some type of
    database. (Lancaster, 1)
  • The basic assumption is that indexers are able
    to state what a document is about by
    formulating an expression which summarizes the
    content of the document. (Hutchins, 92)

3
Introduction to Indexing
  • Though often quite complex, the basic goal of
    most indexing is to represent the meaning of a
    document (its aboutness) in a compressed form
    that will be amenable to search and retrieval.
  • Traditionally the notion of a documents
    aboutness is closely related to its topical
    subject matter.

4
Indexing and the IR Process
doc1
Indexer
interprets
creates
s1
database
indexing vocabulary
5
Terms Associated with Indexing
6
The Task of the Indexer
  • Indexing involves two principal steps
  • Conceptual Analysis Interpreting the document to
    ascertain its meaning
  • Translate the document into a set of descriptors
    that the system recognizes

7
Conceptual Analysis
  • In the indexing literature, we see a lot of
    tension about conceptual analysis.
  • But in many cases, people are quite good at
    picking out important ideas from a text, at
    analyzing a documents aboutness.

8
Translation Assigning Descriptors
  • If we were using a free text or natural
    language indexing system, then we could just
    write down our descriptors for each document
    after conceptual analysis.
  • But many systems impose more order on the process
    by using a controlled vocabulary.
  • Why do they add this step, this translation?

9
Vocabulary Control Important Ideas
  • Types of Controlled Vocabularies (not mutually
    exclusive)
  • Subject Headings
  • Thesauri
  • Name Authority Files

10
Authority Control
11
Subject Heading Lists
  • Controlled vocabularies are usually expressed as
    a list of vetted, legitimate terms, frequently
    referred to as subject headings.
  • These may be organized in various ways
  • Alphabetically
  • Topically
  • In a Thesaurus (via lexical-semantic
    relationships)

12
Thesauri LCSH Example
  • Cookery

13
Thesauri LCSH Example
  • Cookery

In this case cookery is our descriptor.
14
Thesauri LCSH Example
  • Cookery

Cookery is a preferred term (PT) for these.
15
Thesauri LCSH Example
  • Cookery

Broader Terms (BTs) expand the search vertically.
16
Thesauri LCSH Example
  • Cookery

Narrower terms (NTs) focus the scope of the
search.
17
Thesauri LCSH Example
  • Cookery

Related terms (RTs) expand a search horizontally
18
Thesauri Art and Architecture Thesaurus
  • Most thesauri are domain-specific. That is, they
    control the vocabulary of a particular domain of
    knowledge.
  • N.B. We can consider these to be ontologies,
    representing linguistic knowledge about a
    particular field.
  • Consider the Art and Architecture Thesaurus
    (maintained by the Getty Museum).

19
Some vocabulary review People are types of
Agents thus People is a NT for Agent.
20
Some vocabulary review There are many types of
person within AAT. Since they appear at the same
level in the hierarchy, terms such as antiquaries
and athletes are coordinate terms. Likewise,
acrobats and fencers are coordinate with each
other.
21
Controlled Vocabulary
  • Costs of disambiguation via controlled vocabulary

22
Manual Indexing Overview
  • Two phases analysis then translation
  • Vocabulary control vs. free text
  • Sources of evidence?
  • Inter-indexer agreement?

23
Manual Indexing Overview
  • Two phases analysis then translation
  • Vocabulary control vs. free text
  • Sources of evidence?
  • Inter-indexer agreement?

Computational approaches to indexing probably
dont solve these problems, but they do change
the equation (maybe for the better).
24
Automatic Indexing
  • Our goal to derive useful document surrogates
    without undue human intervention.
  • Human attention is expensive and slow.
  • Peoples time is best spent on doing things
    machines cant do.
  • The leap of faith we must let the data speak for
    themselves.
  • In place of human intuition, well rely on
    rigorous analysis of the document itself (heavily
    empirical)
  • We will import intuition in the form of
    strategies for treating language empirically.

25
The leap of faith
  • (Textual) Documents are comprised of words and
    words convey meaning.
  • Thus we can glean something about meaning by
    analyzing document text.
  • The question becomes what do we mean by
    analyzing?
  • Shallow efficient, tractable (good enough?)
  • Deep intensive, feasible computationally (any
    better than shallow analysis?)

26
Textual Analysis for IR
  • Most IR is based on lexical analysis.
  • i.e. analyzing which words appear in documents
    and which words are most likely to convey topical
    meaning.
  • This is highly language-dependent. Well assume
    primarily English text.
  • Why are other languages different w/respect to
    lexical analysis?
  • Why is it harder (or is it harder) to analyze
    non-textual information (e.g. images)

27
Inverted index construction
Documents to be indexed.
Friends, Romans, countrymen.
28
Indexer steps
  • Sequence of (Modified token, Document ID) pairs.

Doc 1
Doc 2
I did enact Julius Caesar I was killed i' the
Capitol Brutus killed me.
So let it be with Caesar. The noble Brutus hath
told you Caesar was ambitious
29
The Inverted Index
Doc 1
Doc 2
I did enact Julius Caesar I was killed i' the
Capitol Brutus killed me.
So let it be with Caesar. The noble Brutus hath
told you Caesar was ambitious
30
The Inverted Index
Query julius and caesar Query caesar and not
julius
Doc 1
Doc 2
I did enact Julius Caesar I was killed i' the
Capitol Brutus killed me.
So let it be with Caesar. The noble Brutus hath
told you Caesar was ambitious
31
The Inverted Index
Query julius and caesar Query caesar and not
julius
Why is it preferable to use the inverted index?
Why not just search the documents directly?
Doc 1
Doc 2
I did enact Julius Caesar I was killed i' the
Capitol Brutus killed me.
So let it be with Caesar. The noble Brutus hath
told you Caesar was ambitious
32
Ranking Documents for Retrieval
  • Given a database of indexed documents, what
    should we do when a searcher issues a query?
  • Most modern IR systems rank documents in
    decreasing order of estimated relevance to the
    query.
  • How might we rank documents against a query?

33
Naïve document ranking
  • Given a query Q containing words q1 q2 qm we
    could simply count how many of the query words
    are also in the document.

34
Naïve document ranking
r1(q,d1)2 r1(q,d2)1
  • Given a query Q containing words q1 q2 qm we
    could simply count how many of the query words
    are also in the document.

35
Can we be less naïve?
  • Instead of counting how many query words a
    document contains, we could sum over the
    frequency of each query word in each document.

36
Can we be less naïve?
r1(q,d1) 11 2 r1(q,d2)
1 r1(q,d3) 31 4
  • Given a query Q containing words q1 q2 qm we
    could simply count how many of the query words
    are also in the document.

37
Can we be still less naïve?
  • This function assumes that all words are equally
    important. Why is this unrealistic?
  • What might we do to mitigate this error?

38
Term weighting by TF-IDF (term freq. - inverse
doc. freq.)
  • Intuition when assessing the relevance of a
    document, we should take into account how many
    times it contains each term (term frequency), but
    also how common that term is in general.
  • Reward terms that occur often in the doc. but
    rarely in the collection.

39
Inverse document frequency (IDF)
  • Intuition we should give more weight to terms
    that are rare in the corpus.
  • If we have N documents, and term j occurs in nj
    of them, we are concerned with N / ni .
  • If nj is small, the quantity is large. Vice
    versa for common terms.

40
Term Weighting
  • Consider the following terms. If these terms
    occurred in a document, how indicative of that
    docs aboutness do you suspect each term would
    be?

41
Term Weighting
  • Consider the following terms. If these terms
    occurred in a document, how indicative of that
    docs aboutness do you suspect each term would
    be?

N1000
42
Term Weighting
  • Consider the following terms. If these terms
    occurred in a document, how indicative of that
    docs aboutness do you suspect each term would
    be?

pseudo IDF
N1000
43
Inverse document frequency (IDF)
  • Given that a words w occurs in dw documents, and
    that our collection consists of N documents total

44
Inverse document frequency (IDF)
45
Inverse document frequency (IDF)
N.B. stopwords. If dwN
46
Document ranking by TF-IDF
  • Sum over the frequency of each query word in the
    document times the words inverse document
    frequency.

47
Document ranking by TF-IDF
48
Document ranking by TF-IDF
r3(q,d1) 14.2 14.8 9
49
Document ranking by TF-IDF
r3(q,d1) 14.2 14.8 9
r3(q,d2) 14.2 22.2 8.6
50
IR Evaluation Experimentation and Research
Methods
51
Evaluating IR Systems
  • Assuming weve made some design decisions and
    selected one or more IR models, we would like to
    know how well our decisions are serving us.
  • IR evaluation is concerned with pursuing the
    question given a particular retrieval situation,
    how well is my IR system working?

52
Evaluating IR Systems
  • In IR research, evaluation usually analyzes at
    least one of these facets of a retrieval system
  • Effectiveness
  • Efficiency
  • Cost
  • Speed

53
IR Effectiveness
  • Information retrieval systems attempt to discover
    documents in a collection that are relevant to a
    users stated information need.

54
IR Effectiveness
  • Information retrieval systems attempt to discover
    documents in a collection that are relevant to a
    users stated information need.

To say that one IR system is more effective than
another maybe were saying that that system does
a better job of discriminating between relevant
and non-relevant documents? Regardless of the
thorns that lie in this definition, this is
commonly the assumption we work under.
55
IR Effectiveness
  • Information retrieval systems attempt to discover
    documents in a collection that are relevant to a
    users stated information need.

Perhaps this is the most troublesome notion in
the world of information retrieval research. It
is fundamental to IR evaluation, but it is nearly
impossible to define (or operationalize) it in a
way that many researchers can agree upon.
56
The Cranfield Paradigm
  • A drastically simplified notion of relevance has
    its roots in a series of experiments undertaken
    in the 1950s.
  • Cyril Cleverdon (a researcher in Cranfield,
    England) wanted to test the effectiveness of
    various methods of indexing
  • e.g. manual vs. automatic
  • To perform these tests, he constructed an
    elaborate (and still widely used) apparatus for
    experimentation.

57
The Cranfield Paradigm
  • The idea behind Cleverdons design
  • Compile a corpus of documents
  • Ask potential readers of these documents to
    supply queries that they would like answered by
    consulting the documents
  • In a laborious (and well-documented) process,
    have subject experts judge each document with
    respect to its relevance to each query

58
IR Test Collections
  • These components comprise the basic elements of a
    so-called test collection for IR
    experimentation.
  • A corpus of documents
  • A set of queries
  • A set of qrels lists of all documents that are
    relevant to each query

59
Measuring Retrieval Effectiveness
For a particular query q
ABCDN
N docs in the collection R Total of docs
in the collection relevant to query q.
60
Measuring Retrieval Effectiveness
For a particular query q
precision A / (A B) recall A / (A
C)
61
Measuring Retrieval Effectiveness
For a particular query q
precision A / (A B) recall A / (A
C)
In other words Precision is the percent of
retrieved docs that are relevant. Recall is the
percent of relevant docs that have been retrieved.
62
Measuring Retrieval Effectiveness
Model 1
precision ?? recall ??
63
Measuring Retrieval Effectiveness
Model 1
precision 10/15 2/3 recall 10/20
1/2
What does this mean? Is it good performance?
64
Measuring Retrieval Effectiveness
Model 2
Model 1
precision 5/6 recall 5/15 1/3
precision 10/15 2/3 recall 10/20
1/2
65
Measuring Retrieval Effectiveness
Model 1 has better recall than Model 2
Model 2 has better precision than Model 1
Model 2
Model 1
precision 5/6 recall 5/15 1/3
precision 10/15 2/3 recall 10/20
1/2
66
Precision and Recall
For a particular query q and model M1 ranking 10
docs R N N N R N N R R R
What is our precision at recall20?
67
Precision and Recall
For a particular query q and model M1 ranking 10
docs R N N N R N N R R R
What is our precision at recall20? Rel 5.
20 of 5 1. Thus we want to compute precision
after retrieving 1 relevant document 1/1 1.
68
Precision and Recall
For a particular query q and model M1 ranking 10
docs R N N N R N N R R R
What is our precision at recall40? Rel 5.
40 of 5 2. Thus we want to compute precision
after retrieving 2 relevant documents 2/50.4.
69
Precision and Recall
For a particular query q and model M1 ranking 10
docs R N N N R N N R R R
What is our precision at recall60? Rel 5.
60 of 5 3. Thus we want to compute precision
after retrieving 3 relevant documents 3/80.375.
70
Precision and Recall
For a particular query q and model M1 ranking 10
docs R N N N R N N R R R
Whats the trend here? As our recall rate goes
up, our precision tends to decline. This is true
in general, though oddities relating to small
numbers of relevant documents can alter this.
But most frequently, we find relationships like
What is our precision at recall60? Rel 5.
60 of 5 3. Thus we want to compute precision
after retrieving 3 relevant documents 3/80.375.
71
(No Transcript)
72
Precision and Recall
For a particular query q M1 ranks 10 docs R N
N N R N N R R R
While M2 ranks them R R N N R N R
N N R Which model is better???
73
(No Transcript)
74
Deriving a single effectiveness measure mean
avg. precision
For a particular query q M1 ranks 10 docs R N
N N R N N R R R
While M2 ranks them R R N N R N R
N N R Which model is better???
75
Deriving a single effectiveness measure mean
avg. precision
For a particular query q M1 ranks 10 docs R N
N N R N N R R R
While M2 ranks them R R N N R N R
N N R Which model is better???
76
IR in our culture John Battelles The Search
  • IR began as a quiet branch of library science.
  • In John Battelles argument, what has search
    become?

77
The database of intentions
78
The database of intentions start your own hedge
fund
Is one company (Morgan Chase or Merrill) less
correlated with these doomsday words? We have
access to data like nobody imagined. Now what to
do with it?
79
Our challenge coda
  • My problem is not finding something. May
    problem is understanding something. Danny
    Hillis (Battelle 16)
  • The goal of Google and other search companies is
    to provide people with information and make it
    useful to them. -- Craig Silverstein (Battelle
    17)
  • In this ambitious arena, what is our role? What
    will it be? What should it be?
Write a Comment
User Comments (0)
About PowerShow.com