Special Topics in Computer Science The Art of Information Retrieval Chapter 2: Modeling - PowerPoint PPT Presentation

About This Presentation

Title:

Special Topics in Computer Science The Art of Information Retrieval Chapter 2: Modeling

Description:

Electronic Commerce & Internet Application Laboratory. Special ... Pair-wise orthogonal: cos ({ki}, {kj}) = 0. This model relaxes the pair-wise orthogonality: ... – PowerPoint PPT presentation

Number of Views:83

Avg rating:3.0/5.0

Slides: 37

Provided by: alexande95

Category:

more less

Transcript and Presenter's Notes

Title: Special Topics in Computer Science The Art of Information Retrieval Chapter 2: Modeling

1
Special Topics in Computer ScienceThe Art of
Information RetrievalChapter 2 Modeling

Alexander Gelbukh
www.Gelbukh.com

2
Previous chapter

User Information Need
Vague
Semantic, not formal
Document Relevance
Order, not retrieve
Huge amount of information
Efficiency concerns
Tradeoffs
Art more than science

3
Modeling

Still science computation is formal
No good methods to work with (vague) semantics
Thus, simplify to get a (formal) model
Develop (precise) math over this (simple) model
Why math if the model is not precise
(simplified)?
phenomenon ? model step 1 step 2 ...
result
math
phenomenon ? model ? step 1 ? step 2 ? ... ? ?!

4
Modeling in IR idea

Tag documents with fields
As in a (relational) DB customer name, age,
address
Unlike DB, very many fields individual words!
E.g., bag of words word1, word2, ... 3, 5,
0, 0, 2, ...
Define a similarity measure between query and
such a record
Unlike DB, order, not retrieve (yes/no)
Justify your model (optional, but nice)
Develop math and algorithms for fast access
as relational algebra in DB

5
Taxonomy of IR systems
6
Aspects of an IR system

IR model
Boolean, Vector, Probabilistic
Logical view of documents
Full text, bag of words, ...
User task
retrieval, browsing
Independent, though some are more compatible

7
Taxonomy of IR models

Boolean (set theoretic)
fuzzy
extended
Vector (algebraic)
generalized vector
latent semantic indexing
neural network
Probabilistic
inference network
belief network

8
Taxonomy of other aspects

Text structure
Non-overlapping lists
Proximal nodes model
Browsing
Flat
Structure guided
hypertext

9
Appropriate models
10
Retrieval operation mode

Ad-hoc
static documents
interactive
ordered
Filtering (? ad-hoc on new docs)
changing document collection
notification
not interactive
machine learning techniques can be used
yes/no

11
Characterization of an IR model

D dj, collection of formal representations of
docs
e.g., keyword vectors
Q qi, possible formal representations of user
information need (queries)
F, framework for modeling these two reason for
the next
R(qi,dj) Q ? D ? R, ranking function
defines ordering

12
Specific IR models
13
IR models

Classical
Boolean
Vector
Probabilistic
(clear ideas, but some disadvantages)
Refined
Each one with refinements
Solve many of the problems of the basic models
Give good examples of possible developments in
the area
Not investigated well
We can work on this

14
Basic notions

Document Set of index term
Mainly nouns
Maybe all, then full text logical view
Term weights
some terms are better than others
terms less frequent in this doc and more frequent
in other docs are less useful
Documents ? index term vector w1j, w2j, ...,
wtj
weights of terms in the doc
t is the number of terms in all docs
weights of different terms are independent
(simplification)

15
Boolean model

Weights ? 0, 1
Doc set of words
Query Boolean expression
R(qi,dj) ? 0, 1
Good
clear semantics, neat formalism, simple
Bad
no ranking (? data retrieval), retrieves too many
or too few
difficult to translate User Information Need into
query
No term weighting

16
Vector model

Weights (non-binary)
Ranking, much better results (for User Info Need)
R(qi,dj) correlation between query vector and
doc vector
E.g., cosine measure (there is a
typo in the book)

17
Projection
18
Weights

How are the weights wij obtained? Many variants.
One way TF-IDF balance
TF Term frequency
How well the term is related to the doc?
If appears many times, is important
Proportional to the number of times that appears
IDF Inverse document frequency
How important is the term to distinguish
documents?
If appears in many docs, is not important
Inversely proportional to number of docs where
appears
Contradictory. How to balance?

19
TF-IDF ranking

TF Term frequency
IDF Inverse document frequency
Balance TF ? IDF
Other formulas exist. Art.

20
Advantages of vector model

One of the best known strategies
Improves quality (term weighting)
Allows approximate matching (partial matching)
Gives ranking by similarity (cosine formula)
Simple, fast
But
Does not consider term dependencies
considering them in a bad way hurts quality
no known good way
No logical expressions (e.g., negation mouse
NOT cat)

21
Probabilistic model

Assumptions
set of relevant docs,
probabilities of docs to be relevant
After Bayes calculation probabilities of terms
to be important for defining relevant docs
Initial idea interact with the user.
Generate an initial set
Ask the user to mark some of them as relevant or
not
Estimate the probabilities of keywords. Repeat
Can be done without user
Just re-calculate the probabilities assuming the
users acceptance is the same as predicted ranking

22
(Dis) advantages of Probabilistic model

Advantage
Theoretical adequacy ranks by probabilities
Disadvantages
Need to guess the initial ranking
Binary weights, ignores frequencies
Independence assumption (not clear if bad)
Does not perform well (?)

23
Alternative Set Theoretic modelsFuzzy set model

Takes into account term relationships (thesaurus)
Bible is related to Church
Fuzzy belonging of a term to a document
Document containing Bible also contains a little
bit of Church, but not entirely
Fuzzy set logic applied to such fuzzy belonging
logical expressions with AND, OR, and NOT
Provides ranking, not just yes/no
Not investigated well.
Why not investigate it?

24
Alternative Set Theoretic modelsExtended Boolean
model

Combination of Boolean and Vector
In comparison with Boolean model, adds distance
from query
some documents satisfy the query better than
others
In comparison with Vector model, adds the
distinction between AND and OR combinations
There is a parameter (degree of norm) allowing to
adjust the behavior between Boolean-like and
Vector-like
This can be even different within one query
Not investigated well. Why not investigate it?

25
Alternative Algebraic modelsGeneralized Vector
Space model

Classical independence assumptions
All combinations of terms are possible, none are
equivalent ( basis in the vector space)
Pair-wise orthogonal cos (ki, kj) 0
This model relaxes the pair-wise
orthogonalitycos (ki, kj) ? 0
Operates by combinations (co-occurrences) of
index terms, not individual terms
More complex, more expensive, not clear if better
Not investigated well. Why not investigate it?

26
Alternative Algebraic modelsLatent Semantic
Indexing model

Index by larger units, concepts ? sets of terms
used together
Retrieve a document that share concepts with a
relevant one (even if it does not contain query
terms)
Group index terms together (map into lower
dimensional space). So some terms are equivalent.
Not exactly, but this is the idea
Eliminates unimportant details
Depends on a parameter (what details are
unimportant?)
Not investigated well. Why not investigate it?

27
Alternative Algebraic modelsNeural Network model

NNs are good at matching
Iteratively uses the found documents as auxiliary
queries
Spreading activation.
Terms ? docs ? terms ? docs ? terms ? docs ? ...
Like a built-in thesaurus
First round gives same result as Vector model
No evidence if it is good
Not investigated well. Why not investigate it?

28
Alternative Probabilistic modelsBayesian
Inference Network model

(One of the authors of the book worked in this.
In fact not so important)
Probability as belief (not as frequency)
Belief in importance of terms. Query terms have
1.0
Similar to Neural Net
Documents found increase the importance of their
terms
Thus act as new queries
But different propagation formulas
Flexible in combining sources of evidence
Can be applied to different ranking strategies
(Boolean or TF-IDF)
Good quality of results (Warning! Authors work in
this)

29
(No Transcript)
30
Alternative Probabilistic modelsBelief Network
model

(Introduced by one of the authors of the book.)
Better network topology
Separation of document and term space
More general than Inference model
--------------------------------------------------
------------------
Bayesian network models
do not include cycles and this have linear
complexity
unlike Neural Nets
Combine distinct evidence sources (also user
feedback)
Are a neat formalism.
Better alternative to combinations of Boolean and
Vector

31
Models for structured text

Cat in the 3rd chapter. Cat in same paragraph as
Dog
Non-overlapping lists
Chapters, sections, paragraphs as regions
Technically treated much like terms (ranges of
positions)
Sections containing Cat
Proximal nodes model (suggested by the authors)
Chapters, sections, paragraphs as objects
(nodes)

32
Models for browsing

Flat browsing
Just as a list of paper
No context cues provided
Structure guided
Hierarchy
Like directory tree in the computer
Hypertext (Internet!)
No limitations of sequential writing
Modeled by a directed graph links from unit A to
unit B
units docs, chapters, etc.
A map (with traversed path) can be helpful

33
The Web

Internet
Not hypertext
Authors call hypertext a well-organized
hypertext
Internet not depository but heap of information

34
Research issues

How people judge relevance?
ranking strategies
How to combine different sources of evidence?
What interfaces can help users to understand and
formulate their Information Need?
user interfaces an open issue
Meta-search engines combine results from
different Web search engines
They almost do not intersect
How to combine ranking?

35
Conclusions