Special Topics in Computer Science The Art of Information Retrieval Chapter 2: Modeling - PowerPoint PPT Presentation

About This Presentation
Title:

Special Topics in Computer Science The Art of Information Retrieval Chapter 2: Modeling

Description:

Electronic Commerce & Internet Application Laboratory. Special ... Pair-wise orthogonal: cos ({ki}, {kj}) = 0. This model relaxes the pair-wise orthogonality: ... – PowerPoint PPT presentation

Number of Views:83
Avg rating:3.0/5.0
Slides: 37
Provided by: alexande95
Category:

less

Transcript and Presenter's Notes

Title: Special Topics in Computer Science The Art of Information Retrieval Chapter 2: Modeling


1
Special Topics in Computer ScienceThe Art of
Information RetrievalChapter 2 Modeling
  • Alexander Gelbukh
  • www.Gelbukh.com

2
Previous chapter
  • User Information Need
  • Vague
  • Semantic, not formal
  • Document Relevance
  • Order, not retrieve
  • Huge amount of information
  • Efficiency concerns
  • Tradeoffs
  • Art more than science

3
Modeling
  • Still science computation is formal
  • No good methods to work with (vague) semantics
  • Thus, simplify to get a (formal) model
  • Develop (precise) math over this (simple) model
  • Why math if the model is not precise
    (simplified)?
  • phenomenon ? model step 1 step 2 ...
    result

  • math
  • phenomenon ? model ? step 1 ? step 2 ? ... ? ?!

4
Modeling in IR idea
  • Tag documents with fields
  • As in a (relational) DB customer name, age,
    address
  • Unlike DB, very many fields individual words!
  • E.g., bag of words word1, word2, ... 3, 5,
    0, 0, 2, ...
  • Define a similarity measure between query and
    such a record
  • Unlike DB, order, not retrieve (yes/no)
  • Justify your model (optional, but nice)
  • Develop math and algorithms for fast access
  • as relational algebra in DB

5
Taxonomy of IR systems
6
Aspects of an IR system
  • IR model
  • Boolean, Vector, Probabilistic
  • Logical view of documents
  • Full text, bag of words, ...
  • User task
  • retrieval, browsing
  • Independent, though some are more compatible

7
Taxonomy of IR models
  • Boolean (set theoretic)
  • fuzzy
  • extended
  • Vector (algebraic)
  • generalized vector
  • latent semantic indexing
  • neural network
  • Probabilistic
  • inference network
  • belief network

8
Taxonomy of other aspects
  • Text structure
  • Non-overlapping lists
  • Proximal nodes model
  • Browsing
  • Flat
  • Structure guided
  • hypertext

9
Appropriate models
10
Retrieval operation mode
  • Ad-hoc
  • static documents
  • interactive
  • ordered
  • Filtering (? ad-hoc on new docs)
  • changing document collection
  • notification
  • not interactive
  • machine learning techniques can be used
  • yes/no

11
Characterization of an IR model
  • D dj, collection of formal representations of
    docs
  • e.g., keyword vectors
  • Q qi, possible formal representations of user
    information need (queries)
  • F, framework for modeling these two reason for
    the next
  • R(qi,dj) Q ? D ? R, ranking function
  • defines ordering

12
Specific IR models
13
IR models
  • Classical
  • Boolean
  • Vector
  • Probabilistic
  • (clear ideas, but some disadvantages)
  • Refined
  • Each one with refinements
  • Solve many of the problems of the basic models
  • Give good examples of possible developments in
    the area
  • Not investigated well
  • We can work on this

14
Basic notions
  • Document Set of index term
  • Mainly nouns
  • Maybe all, then full text logical view
  • Term weights
  • some terms are better than others
  • terms less frequent in this doc and more frequent
    in other docs are less useful
  • Documents ? index term vector w1j, w2j, ...,
    wtj
  • weights of terms in the doc
  • t is the number of terms in all docs
  • weights of different terms are independent
    (simplification)

15
Boolean model
  • Weights ? 0, 1
  • Doc set of words
  • Query Boolean expression
  • R(qi,dj) ? 0, 1
  • Good
  • clear semantics, neat formalism, simple
  • Bad
  • no ranking (? data retrieval), retrieves too many
    or too few
  • difficult to translate User Information Need into
    query
  • No term weighting

16
Vector model
  • Weights (non-binary)
  • Ranking, much better results (for User Info Need)
  • R(qi,dj) correlation between query vector and
    doc vector
  • E.g., cosine measure (there is a
    typo in the book)

17
Projection
18
Weights
  • How are the weights wij obtained? Many variants.
  • One way TF-IDF balance
  • TF Term frequency
  • How well the term is related to the doc?
  • If appears many times, is important
  • Proportional to the number of times that appears
  • IDF Inverse document frequency
  • How important is the term to distinguish
    documents?
  • If appears in many docs, is not important
  • Inversely proportional to number of docs where
    appears
  • Contradictory. How to balance?

19
TF-IDF ranking
  • TF Term frequency
  • IDF Inverse document frequency
  • Balance TF ? IDF
  • Other formulas exist. Art.

20
Advantages of vector model
  • One of the best known strategies
  • Improves quality (term weighting)
  • Allows approximate matching (partial matching)
  • Gives ranking by similarity (cosine formula)
  • Simple, fast
  • But
  • Does not consider term dependencies
  • considering them in a bad way hurts quality
  • no known good way
  • No logical expressions (e.g., negation mouse
    NOT cat)

21
Probabilistic model
  • Assumptions
  • set of relevant docs,
  • probabilities of docs to be relevant
  • After Bayes calculation probabilities of terms
    to be important for defining relevant docs
  • Initial idea interact with the user.
  • Generate an initial set
  • Ask the user to mark some of them as relevant or
    not
  • Estimate the probabilities of keywords. Repeat
  • Can be done without user
  • Just re-calculate the probabilities assuming the
    users acceptance is the same as predicted ranking

22
(Dis) advantages of Probabilistic model
  • Advantage
  • Theoretical adequacy ranks by probabilities
  • Disadvantages
  • Need to guess the initial ranking
  • Binary weights, ignores frequencies
  • Independence assumption (not clear if bad)
  • Does not perform well (?)

23
Alternative Set Theoretic modelsFuzzy set model
  • Takes into account term relationships (thesaurus)
  • Bible is related to Church
  • Fuzzy belonging of a term to a document
  • Document containing Bible also contains a little
    bit of Church, but not entirely
  • Fuzzy set logic applied to such fuzzy belonging
  • logical expressions with AND, OR, and NOT
  • Provides ranking, not just yes/no
  • Not investigated well.
  • Why not investigate it?

24
Alternative Set Theoretic modelsExtended Boolean
model
  • Combination of Boolean and Vector
  • In comparison with Boolean model, adds distance
    from query
  • some documents satisfy the query better than
    others
  • In comparison with Vector model, adds the
    distinction between AND and OR combinations
  • There is a parameter (degree of norm) allowing to
    adjust the behavior between Boolean-like and
    Vector-like
  • This can be even different within one query
  • Not investigated well. Why not investigate it?

25
Alternative Algebraic modelsGeneralized Vector
Space model
  • Classical independence assumptions
  • All combinations of terms are possible, none are
    equivalent ( basis in the vector space)
  • Pair-wise orthogonal cos (ki, kj) 0
  • This model relaxes the pair-wise
    orthogonalitycos (ki, kj) ? 0
  • Operates by combinations (co-occurrences) of
    index terms, not individual terms
  • More complex, more expensive, not clear if better
  • Not investigated well. Why not investigate it?

26
Alternative Algebraic modelsLatent Semantic
Indexing model
  • Index by larger units, concepts ? sets of terms
    used together
  • Retrieve a document that share concepts with a
    relevant one (even if it does not contain query
    terms)
  • Group index terms together (map into lower
    dimensional space). So some terms are equivalent.
  • Not exactly, but this is the idea
  • Eliminates unimportant details
  • Depends on a parameter (what details are
    unimportant?)
  • Not investigated well. Why not investigate it?

27
Alternative Algebraic modelsNeural Network model
  • NNs are good at matching
  • Iteratively uses the found documents as auxiliary
    queries
  • Spreading activation.
  • Terms ? docs ? terms ? docs ? terms ? docs ? ...
  • Like a built-in thesaurus
  • First round gives same result as Vector model
  • No evidence if it is good
  • Not investigated well. Why not investigate it?

28
Alternative Probabilistic modelsBayesian
Inference Network model
  • (One of the authors of the book worked in this.
    In fact not so important)
  • Probability as belief (not as frequency)
  • Belief in importance of terms. Query terms have
    1.0
  • Similar to Neural Net
  • Documents found increase the importance of their
    terms
  • Thus act as new queries
  • But different propagation formulas
  • Flexible in combining sources of evidence
  • Can be applied to different ranking strategies
    (Boolean or TF-IDF)
  • Good quality of results (Warning! Authors work in
    this)

29
(No Transcript)
30
Alternative Probabilistic modelsBelief Network
model
  • (Introduced by one of the authors of the book.)
  • Better network topology
  • Separation of document and term space
  • More general than Inference model
  • --------------------------------------------------
    ------------------
  • Bayesian network models
  • do not include cycles and this have linear
    complexity
  • unlike Neural Nets
  • Combine distinct evidence sources (also user
    feedback)
  • Are a neat formalism.
  • Better alternative to combinations of Boolean and
    Vector

31
Models for structured text
  • Cat in the 3rd chapter. Cat in same paragraph as
    Dog
  • Non-overlapping lists
  • Chapters, sections, paragraphs as regions
  • Technically treated much like terms (ranges of
    positions)
  • Sections containing Cat
  • Proximal nodes model (suggested by the authors)
  • Chapters, sections, paragraphs as objects
    (nodes)

32
Models for browsing
  • Flat browsing
  • Just as a list of paper
  • No context cues provided
  • Structure guided
  • Hierarchy
  • Like directory tree in the computer
  • Hypertext (Internet!)
  • No limitations of sequential writing
  • Modeled by a directed graph links from unit A to
    unit B
  • units docs, chapters, etc.
  • A map (with traversed path) can be helpful

33
The Web
  • Internet
  • Not hypertext
  • Authors call hypertext a well-organized
    hypertext
  • Internet not depository but heap of information

34
Research issues
  • How people judge relevance?
  • ranking strategies
  • How to combine different sources of evidence?
  • What interfaces can help users to understand and
    formulate their Information Need?
  • user interfaces an open issue
  • Meta-search engines combine results from
    different Web search engines
  • They almost do not intersect
  • How to combine ranking?

35
Conclusions
  • Modeling is needed for formal operations
  • Boolean model is the simplest
  • Vector model is the best combination of quality
    and simplicity
  • TF-IDF term weighting
  • This (or similar) weighting is used in all
    further models
  • Many interesting and not well-investigated
    variations
  • possible future work

36
Thank you! Till October 2
Write a Comment
User Comments (0)
About PowerShow.com