Title: Special Topics in Computer Science The Art of Information Retrieval Chapter 2: Modeling
1Special Topics in Computer ScienceThe Art of
Information RetrievalChapter 2 Modeling
- Alexander Gelbukh
- www.Gelbukh.com
2Previous chapter
- User Information Need
- Vague
- Semantic, not formal
- Document Relevance
- Order, not retrieve
- Huge amount of information
- Efficiency concerns
- Tradeoffs
- Art more than science
3Modeling
- Still science computation is formal
- No good methods to work with (vague) semantics
- Thus, simplify to get a (formal) model
- Develop (precise) math over this (simple) model
- Why math if the model is not precise
(simplified)? - phenomenon ? model step 1 step 2 ...
result -
math - phenomenon ? model ? step 1 ? step 2 ? ... ? ?!
4Modeling in IR idea
- Tag documents with fields
- As in a (relational) DB customer name, age,
address - Unlike DB, very many fields individual words!
- E.g., bag of words word1, word2, ... 3, 5,
0, 0, 2, ... - Define a similarity measure between query and
such a record - Unlike DB, order, not retrieve (yes/no)
- Justify your model (optional, but nice)
- Develop math and algorithms for fast access
- as relational algebra in DB
5Taxonomy of IR systems
6Aspects of an IR system
- IR model
- Boolean, Vector, Probabilistic
- Logical view of documents
- Full text, bag of words, ...
- User task
- retrieval, browsing
- Independent, though some are more compatible
7Taxonomy of IR models
- Boolean (set theoretic)
- fuzzy
- extended
- Vector (algebraic)
- generalized vector
- latent semantic indexing
- neural network
- Probabilistic
- inference network
- belief network
8Taxonomy of other aspects
- Text structure
- Non-overlapping lists
- Proximal nodes model
- Browsing
- Flat
- Structure guided
- hypertext
9Appropriate models
10Retrieval operation mode
- Ad-hoc
- static documents
- interactive
- ordered
- Filtering (? ad-hoc on new docs)
- changing document collection
- notification
- not interactive
- machine learning techniques can be used
- yes/no
11Characterization of an IR model
- D dj, collection of formal representations of
docs - e.g., keyword vectors
- Q qi, possible formal representations of user
information need (queries) - F, framework for modeling these two reason for
the next - R(qi,dj) Q ? D ? R, ranking function
- defines ordering
12Specific IR models
13IR models
- Classical
- Boolean
- Vector
- Probabilistic
- (clear ideas, but some disadvantages)
- Refined
- Each one with refinements
- Solve many of the problems of the basic models
- Give good examples of possible developments in
the area - Not investigated well
- We can work on this
14Basic notions
- Document Set of index term
- Mainly nouns
- Maybe all, then full text logical view
- Term weights
- some terms are better than others
- terms less frequent in this doc and more frequent
in other docs are less useful - Documents ? index term vector w1j, w2j, ...,
wtj - weights of terms in the doc
- t is the number of terms in all docs
- weights of different terms are independent
(simplification)
15Boolean model
- Weights ? 0, 1
- Doc set of words
- Query Boolean expression
- R(qi,dj) ? 0, 1
- Good
- clear semantics, neat formalism, simple
- Bad
- no ranking (? data retrieval), retrieves too many
or too few - difficult to translate User Information Need into
query - No term weighting
16Vector model
- Weights (non-binary)
- Ranking, much better results (for User Info Need)
- R(qi,dj) correlation between query vector and
doc vector - E.g., cosine measure (there is a
typo in the book)
17Projection
18Weights
- How are the weights wij obtained? Many variants.
- One way TF-IDF balance
- TF Term frequency
- How well the term is related to the doc?
- If appears many times, is important
- Proportional to the number of times that appears
- IDF Inverse document frequency
- How important is the term to distinguish
documents? - If appears in many docs, is not important
- Inversely proportional to number of docs where
appears - Contradictory. How to balance?
19TF-IDF ranking
- TF Term frequency
- IDF Inverse document frequency
- Balance TF ? IDF
- Other formulas exist. Art.
20Advantages of vector model
- One of the best known strategies
- Improves quality (term weighting)
- Allows approximate matching (partial matching)
- Gives ranking by similarity (cosine formula)
- Simple, fast
- But
- Does not consider term dependencies
- considering them in a bad way hurts quality
- no known good way
- No logical expressions (e.g., negation mouse
NOT cat)
21Probabilistic model
- Assumptions
- set of relevant docs,
- probabilities of docs to be relevant
- After Bayes calculation probabilities of terms
to be important for defining relevant docs - Initial idea interact with the user.
- Generate an initial set
- Ask the user to mark some of them as relevant or
not - Estimate the probabilities of keywords. Repeat
- Can be done without user
- Just re-calculate the probabilities assuming the
users acceptance is the same as predicted ranking
22(Dis) advantages of Probabilistic model
- Advantage
- Theoretical adequacy ranks by probabilities
- Disadvantages
- Need to guess the initial ranking
- Binary weights, ignores frequencies
- Independence assumption (not clear if bad)
- Does not perform well (?)
23Alternative Set Theoretic modelsFuzzy set model
- Takes into account term relationships (thesaurus)
- Bible is related to Church
- Fuzzy belonging of a term to a document
- Document containing Bible also contains a little
bit of Church, but not entirely - Fuzzy set logic applied to such fuzzy belonging
- logical expressions with AND, OR, and NOT
- Provides ranking, not just yes/no
- Not investigated well.
- Why not investigate it?
24Alternative Set Theoretic modelsExtended Boolean
model
- Combination of Boolean and Vector
- In comparison with Boolean model, adds distance
from query - some documents satisfy the query better than
others - In comparison with Vector model, adds the
distinction between AND and OR combinations - There is a parameter (degree of norm) allowing to
adjust the behavior between Boolean-like and
Vector-like - This can be even different within one query
- Not investigated well. Why not investigate it?
25Alternative Algebraic modelsGeneralized Vector
Space model
- Classical independence assumptions
- All combinations of terms are possible, none are
equivalent ( basis in the vector space) - Pair-wise orthogonal cos (ki, kj) 0
- This model relaxes the pair-wise
orthogonalitycos (ki, kj) ? 0 - Operates by combinations (co-occurrences) of
index terms, not individual terms - More complex, more expensive, not clear if better
- Not investigated well. Why not investigate it?
26Alternative Algebraic modelsLatent Semantic
Indexing model
- Index by larger units, concepts ? sets of terms
used together - Retrieve a document that share concepts with a
relevant one (even if it does not contain query
terms) - Group index terms together (map into lower
dimensional space). So some terms are equivalent. - Not exactly, but this is the idea
- Eliminates unimportant details
- Depends on a parameter (what details are
unimportant?) - Not investigated well. Why not investigate it?
27Alternative Algebraic modelsNeural Network model
- NNs are good at matching
- Iteratively uses the found documents as auxiliary
queries - Spreading activation.
- Terms ? docs ? terms ? docs ? terms ? docs ? ...
- Like a built-in thesaurus
- First round gives same result as Vector model
- No evidence if it is good
- Not investigated well. Why not investigate it?
28Alternative Probabilistic modelsBayesian
Inference Network model
- (One of the authors of the book worked in this.
In fact not so important) - Probability as belief (not as frequency)
- Belief in importance of terms. Query terms have
1.0 - Similar to Neural Net
- Documents found increase the importance of their
terms - Thus act as new queries
- But different propagation formulas
- Flexible in combining sources of evidence
- Can be applied to different ranking strategies
(Boolean or TF-IDF) - Good quality of results (Warning! Authors work in
this)
29(No Transcript)
30Alternative Probabilistic modelsBelief Network
model
- (Introduced by one of the authors of the book.)
- Better network topology
- Separation of document and term space
- More general than Inference model
- --------------------------------------------------
------------------ - Bayesian network models
- do not include cycles and this have linear
complexity - unlike Neural Nets
- Combine distinct evidence sources (also user
feedback) - Are a neat formalism.
- Better alternative to combinations of Boolean and
Vector
31Models for structured text
- Cat in the 3rd chapter. Cat in same paragraph as
Dog - Non-overlapping lists
- Chapters, sections, paragraphs as regions
- Technically treated much like terms (ranges of
positions) - Sections containing Cat
- Proximal nodes model (suggested by the authors)
- Chapters, sections, paragraphs as objects
(nodes)
32Models for browsing
- Flat browsing
- Just as a list of paper
- No context cues provided
- Structure guided
- Hierarchy
- Like directory tree in the computer
- Hypertext (Internet!)
- No limitations of sequential writing
- Modeled by a directed graph links from unit A to
unit B - units docs, chapters, etc.
- A map (with traversed path) can be helpful
33The Web
- Internet
- Not hypertext
- Authors call hypertext a well-organized
hypertext - Internet not depository but heap of information
34Research issues
- How people judge relevance?
- ranking strategies
- How to combine different sources of evidence?
- What interfaces can help users to understand and
formulate their Information Need? - user interfaces an open issue
- Meta-search engines combine results from
different Web search engines - They almost do not intersect
- How to combine ranking?
35Conclusions
- Modeling is needed for formal operations
- Boolean model is the simplest
- Vector model is the best combination of quality
and simplicity - TF-IDF term weighting
- This (or similar) weighting is used in all
further models - Many interesting and not well-investigated
variations - possible future work
36Thank you! Till October 2