Information Filtering IR Topics - PowerPoint PPT Presentation

1 / 64
About This Presentation
Title:

Information Filtering IR Topics

Description:

Searching isolated words in a list hash table) ... Longest match vs. simple rules. In longest match Removes the longest possible string of characters ... – PowerPoint PPT presentation

Number of Views:55
Avg rating:3.0/5.0
Slides: 65
Provided by: Tsvi
Category:

less

Transcript and Presenter's Notes

Title: Information Filtering IR Topics


1
Information Filtering IR Topics
  • Tsvi Kuflik
  • Department of Information Systems Engineering,
  • Ben-Gurion University of the Negev
  • Beer-Sheva 84105, Israel
  • tsvikak_at_bgumail.bgu.ac.il
  • Based partially on the books of Frakes
    Baetza-Yates Salton McGill.

2
IF Model
User with Long term Goals, tasks
Producers of Texts
Regular Information Interests
Distributers of Texts
Representation and organization
Representation
Profiles
Text Surrogates, Organized
Comparison or Interaction
  • Retrieved texts

Use and/or evaluation
modification
3
IR Topics
  • IR provides the basics for IF
  • Efficient document representation
  • Efficient query representation
  • Efficient matching methods

4
IF Model
User with Long term Goals, tasks
Producers of Texts
Regular Information Interests
Distributers of Texts
Representation and organization
Representation
Profiles
Text Surrogates, Organized
Comparison or Interaction
  • Retrieved texts

Use and/or evaluation
modification
5
IR Topics
  • Classical IR
  • Text oriented (originally there was text)
  • All manipulation is done on text
  • SMART - Salton, 1983

6
IR Topics
  • Information Retrieval Task
  • Retrieve documents relevant to user query from a
    known collection
  • Expectation of short-term goal
  • Goal should be satisfied in real time
  • Goal will not persist after satisfied
  • Mature, successful methods have been developed
  • Most experience with short text documents in
    static collections

7
IR Topics
  • Steps In Typical IR System
  • Document Preprocessing
  • Document Indexing
  • Query Processing
  • Retrieval of Relevant Documents
  • Presentation
  • Relevance Feedback

8
IR Topics
  • Preprocessing (Content based IF)
  • Parsing/Lexical analysis
  • Analyze text structural aspects
  • Isolate textual segments

9
IR Topics
  • Preprocessing (Content based IF)
  • Using meaningful terms only - dimensionality
    reduction
  • Stop lists

10
IR Topics
  • Stop lists
  • Removal of meaningless terms
  • Inclusion of topic related terms only
  • Issues
  • Stop list content
  • Domain related
  • Phrases containing stop words
  • More..

11
IR Topics
  • Stop lists
  • Frequent words in English (single letters, again,
    be, he, many, etc.)
  • Frequent words in the Database (such as computer
    in a computer science DB) - the threshold for
    frequency has to be defined.

12
Stop List Example (85 out of 429)
  • different,n,necessary,need,needed,needing,newest,n
    ext,no,nobody,non,noone,not,noting,now,nowhere,of,
    off,often,new,old,older,oldest,on,once,one,only,op
    en,again,among,already,about,above,against,alone,a
    fter,also,although,along,always,an,across,b,and,an
    ,other,ask,c,asking,asks,backed,away,a,should,show
    ,came,all,almost,before,began,backbacking,be,becam
    e,because,becomes,been,at,behind,being,best,better
    ,between,big,showed,ended,ending,both,but,by,asked
    ,backs,can,cannot,number,numbers..

13
IR Topics
  • Stop list can be implemented by
  • Filtering stop words as part of lexical analysis
  • identifying and removing words from lexical
    analyzer output
  • Searching isolated words in a list hash table)
  • Filtering stop words as part of the lexical
    analysis process
  • Finite state automata

14
IR Topics
  • Stop lists FSA

15
IR Topics
  • Preprocessing (Content based IF)
  • Using meaningful terms only - dimensionality
    reduction
  • Stemmers

16
IR Topics
  • Stemmers
  • Affix removal, mainly suffixes - Porter
  • Table lookup
  • Successor variety
  • N-grams
  • Issues
  • Side effects
  • Stem Word ?
  • Ambiguity

17
IR Topics
  • A table of all index terms and their stems

18
IR Topics
  • Disadvantages
  • A stemmer table for English does not exist.
  • What about other languages?
  • Storage overhead.
  • Advantage
  • Easy to implement(?), efficient (search time)
  • Could work well for static collections.

19
IR Topics
  • Successor Variety
  • A successor variety of a string is the number of
    different characters that follow it in words in
    the same body of text.
  • Example
  • Able, axle, accident, ape, about
  • Successor variety for apple
  • For a 4 (followed by b, x, c, p)
  • For ap 1 (followed only by e)

20
IR Topics
  • Successor Variety
  • Implementation method examples
  • Complete word a break is made after a segment if
    that segment is a complete word in the corpus.
  • Peak and Plateau a segment break is made after a
    character whose successor variety exceeds that of
    a character immediately preceding it and the
    character immediately following it

21
IR Topics
  • Successor Variety
  • Example
  • Test word readable
  • Corpus Able, Ape, Beatable, Fixable, Read,
    Readable, Reading, Reads, Red, Rope, Ripe
  • Prefix successor variety letters
  • r 3 e,i,o
  • re 2 a,d
  • rea 1 d
  • read 3 a,I,s
  • reada 1 b

22
IR Topics
  • Successor Variety
  • Example
  • By both methods readable will be segmented to
    read and able
  • Which will be selected?
  • If the first segment appears in less then 12
    words in the corpus, select it, otherwise select
    the second
  • This is due to an observation that frequent
    segments may be prexfixes

23
IR Topics
  • N-grams the shared digram method
  • Terms are broken to n consecutive letters (n2,
    pairs of letters)
  • Association measures are calculated between pairs
    of terms, based on shared unique diagrams.
  • Example
  • statistics st ta at ti is st ti ic cs
  • unique digrams ta at cs ic is st ti
  • statistical st ta at ti is st ti ic ca al
  • unique digrams al at ca ic is st ta ti
  • 6 shared digrams

24
IR Topics
  • Similarity measure S 2C/AB
  • A is unique digrams of first word
  • B is unique digrams of second method
  • C is the nu,ber of shared digrams
  • Similar words are grouped together, represented
    by the shared digrams

25
IR Topics
  • Algorithms to remove suffixes and/or prefixes
    form letters leaving a stem
  • Example of rules
  • if a word ends in ies but not eies or aies
  • then ies -gt y (studies-gt study)
  • if a word ends in es but not aes, ees or
    oes
  • then es-gt e (tables-gt table, referees
    not-gtrefere)

26
IR Topics
  • Longest match vs. simple rules
  • In longest match Removes the longest possible
    string of characters
  • Porters Algorithm - uses a suffix list for
    suffix stripping.

27
IR Topics
  • Overstemming - e.g. readable -gt red
  • Understemming e.g. users -gt use
  • Accuracy- e.g. skies -gt sky not ski. A special
    rule for k in plurals in needed.

28
IR Topics
  • Stemming summary
  • May have positive effect on retrieval performance
  • Will not degrade performance
  • May reduce the size of document representation
    and indices
  • Increase recall at the cost of precision decrease
    (what the heck is he talking about???)

29
IR Topics
  • Preprocessing (Content based IF)
  • Using meaningful terms only - dimensionality
    reduction
  • Dictionaries (Thesaurus, Ontology)

30
IR Topics
  • Dictionary/Thesaurus/Ontology
  • Topic related terms
  • Linguistic correctness of results
  • Issues
  • Ambiguity
  • Context related

31
IR Topics
  • Thesauri
  • Term relationships
  • Equivalence
  • Hierarchical
  • Non hierarchical
  • Specificity of Vocabulary
  • Manual/Automatic construction
  • Based on collections of documents
  • Merging existing Thesauri

32
IR Topics
  • IR classical models
  • Boolean
  • Vector space
  • Probabilistic
  • Model should provide
  • Document and queries representation
  • Matching techniques / similarity measure

33
IR Topics
  • Document Representation - Boolean
  • Boolean Model
  • Based on mutual occurrence of terms in documents
    and queries
  • If sim(dj,q)1 then the Boolean model predicts
    that the document dj is relevant to the query q
    (it might not be). Otherwise, the prediction is
    that the document is not relevant.

34
IR Topics
  • Document Representation - Boolean
  • Boolean Model
  • Clean formal definition.
  • Boolean operators AND, OR, NOT
  • Simple implementation
  • Intuitive

35
IR Topics
  • Document Representation - Boolean
  • Boolean Model
  • Very rigid AND means all OR means any.
  • Difficult to express complex user requests.
  • Difficult to control the number of documents
    retrieved.
  • All matched documents will be returned.
  • Difficult to rank output.
  • All matched documents logically satisfy the
    query.

36
IR Topics
  • Document Representation - Statistical
  • A document is typically represented by a bag of
    words (unordered words with frequencies).

37
IR Topics
  • Bag set that allows multiple occurrences of the
    same element.
  • User specifies a set of desired terms with
    optional weights
  • Weighted query terms
  • Q lt database 0.5 text 0.8 information
    0.2 gt
  • Unweighted query terms
  • Q lt database text information gt
  • No Boolean conditions specified in the query.

38
IR Topics
  • Document Representation - Statistical
  • Retrieval based on similarity between query and
    documents.
  • Output documents are ranked according to
    similarity to query.
  • Similarity based on occurrence frequencies of
    keywords in query and document.

39
IR Topics
  • Document Representation
  • Vector Space Model
  • Vector of terms (weighted or not)
  • Linear Algebra
  • Vector similarity implies document similarity
  • Issues
  • Document size
  • Vector size
  • Independence
  • Multimedia

40
IR Topics
  • Document Representation
  • Vector Space Model Issues
  • Document size
  • Vector size
  • Independence
  • Multimedia

41
IR Topics
  • Document Representation - Statistical
  • Vector Space Model
  • How to determine important words in a document?
  • Word sense?
  • How to determine the degree of importance of a
    term within a document and within the entire
    collection?
  • How to determine the degree of similarity between
    a document and the query?

42
IR Topics
  • Term independence false assumption
  • Are terms independent ?
  • LSI

43
IR Topics
  • Document Representation (cont)
  • Boolean
  • Term present/absent
  • TF
  • Term frequency relevancy, importance of it
  • DF
  • Term usage across documents discrimination
  • TFIDF
  • Combination of the above

44
IR Topics
  • TFIDF
  • TF normalization
  • length
  • max frequency
  • IDF
  • calculation

45
IR Topics
  • TFIDF Example
  • Given a document containing terms with
    frequencies
  • A(3), B(2), C(1)
  • Assume a collection contains 10,000 documents and
  • document frequencies of these terms are
  • A(50), B(1300), C(250)
  • Then
  • A tf 3/3 idf log(10000/50) 5.3
    tf-idf 5.3
  • B tf 2/3 idf log(10000/1300) 2.0
    tf-idf 1.3
  • C tf 1/3 idf log(10000/250) 3.7
    tf-idf 1.2

46
IR Topics
  • Similarity between vectors for the document di
    and query q can be computed as the vector inner
    product
  • sim(dj,q) djq wij wiq
  • where wij is the weight of term i in document
    j and wiq is the weight of term i in the query
  • For binary vectors, the inner product is the
    number of matched query terms in the document
    (size of intersection).
  • For weighted term vectors, it is the sum of the
    products of the weights of the matched terms.

47
IR Topics
  • Similarity between vectors for the document di
    and query q can be computed as Cosine similarity
    measures the cosine of the angle between two
    vectors.

CosSim(dj, q)
48
IR Topics
  • Similarity between vectors for the document di
    and query q can be computed as Auclidian distance
    between the vectors
  • Many more techniques

49
IR Topics
  • Document Representation
  • Probabilistic
  • Binary Independence Retrieval Model
  • Tt1, tn, set of terms in the collection
  • qk, query
  • dm document
  • Binary Independence Retrieval Model
  • Assign weights to query terms appearing in a
    document
  • ?BIR(qk,dk)

50
IR Topics
  • Document Representation
  • Probabilistic (cont)
  • Document D is composed of a set of index terms
    ti.
  • We will use them to represent document
  • Index term weights are all binary
  • Index terms can appear in relevant documents as
    well as in non-relevant documents, so we have two
    probabilities for every term

51
IR Topics
  • Document Representation
  • Probabilistic
  • Term weight is based on its frequency in the
    corpus in relevant documents vs. non-relevant
    documents, so each term has two values
  • If these two values are known then the
    probability the relevancy of a new document can
    be calculated based on these values

52
IR Topics
  • Document Representation
  • Probabilistic (cont)
  • Determines the probability that a document is
    relevant to a specific query.
  • How do we determine if a given document Dj is
    relevant to query Qi ?
  • Let us use Bayes Theorem
  • Considering odds

53
IR Topics
  • Document Representation
  • Probabilistic (cont)
  • Document D is composed of a set of index terms
    ti.
  • We will use them to represent document
  • Index term weights are all binary
  • The odds that a document is relevant are

54
IR Topics
  • Document Representation
  • Probabilistic (cont)
  • Split according to presence / absence of index
    terms

55
IR Topics
  • Document Representation
  • Probabilistic (cont)
  • Prob. that ti occurs in arbitrary relevant
    document
  • Prob. that ti occurs in arbitrary relevant
    document

56
IR Topics
  • Document Representation
  • Probabilistic (cont)
  • Assume that
  • Than
  • Only first product varies for different documents
    wrt to qk

57
IR Topics
  • Document Representation
  • Probabilistic (cont)
  • Use logarithm
  • Retrieval function
  • ?BIR(qk,dk)

58
IR Topics
  • Query Representation
  • Boolean
  • Statistic
  • TFIDF (actually IDF alone)
  • Stemming (optional)
  • Expansion (optional)
  • Issues
  • Small number of terms
  • Exact meaning (context, expansion)

59
IR Topics
  • Similarity
  • Distance
  • Cosine
  • Euclidean distance
  • Dot product
  • ...
  • Probabilistic measures
  • Thresholds

60
IR Topics
  • Presentation
  • Results Ordering
  • similarity
  • Decreased similarity order
  • Presentation
  • Top n
  • User requested number
  • First n

61
IR Topics
  • Relevance
  • Relevance is a subjective judgment and may
    include
  • Being on the proper subject.
  • Being timely (recent information).
  • Being authoritative (from a trusted source).
  • Satisfying the goals of the user and his/her
    intended use of the information (information
    need).

62
IR Topics
  • Relevance
  • Subjective
  • Measurable
  • Ambiguous
  • Helpful

63
IR Topics
  • Discussion
  • Boolean model -weakest, no partial match
  • Vector space model is very popular, easy to
    implement and expected to be better.
  • Probabilistic model has better theoretical
    background (?), considered better than the
    previous two (?)
  • Independence assumption is wrong, both in
    probabilistic model and the vector space model

64
IR Topics
  • Content based IF (IR based)
  • User information needs are defined by a set of
    (possibly weighted) terms (vector
    space/probabilistic).
  • Data-items (e.g. documents) are represented in a
    similar way.
  • User needs and data-item representations
    (vectors) are matched/correlated.
Write a Comment
User Comments (0)
About PowerShow.com