Information Retrieval and Recommendation Techniques - PowerPoint PPT Presentation

1 / 318
About This Presentation
Title:

Information Retrieval and Recommendation Techniques

Description:

Information Retrieval and Recommendation Techniques – PowerPoint PPT presentation

Number of Views:682
Avg rating:3.0/5.0
Slides: 319
Provided by: SYHU
Category:

less

Transcript and Presenter's Notes

Title: Information Retrieval and Recommendation Techniques


1
Information Retrieval and Recommendation
Techniques
  • ?????????
  • ???

2
Abstraction
  • Reality (real world) can not known in its
    entirety
  • Reality is represented by a collection of data
    abstracted from observation of the real world.
  • Information need drives the storage and retrieval
    of information.
  • Relationships among reality, information need,
    data and query (see Figure 1.1).

3
Information Systems
  • Two portions endosystem and ectosystem.
  • Ectosystem has three human components
  • User
  • Funder
  • Server information professional who operates the
    system and provide service to the user.
  • Endosystem has four components
  • Media
  • Devices
  • Algorithms
  • Data structures

4
Measures
  • The performance is dictated by the endosystem but
    judged by the ecosystem.
  • User is mainly concerned about effectiveness.
  • Server is more aware of the efficiency.
  • Founder is more concerned about economy of the
    system.
  • This course concentrates primarily on
    effectiveness measures.
  • The so called user-satisfaction has many meanings
    and different users may use different criteria.
  • A fixed set of criteria must be established for
    fair comparison.

5
From Signal to Wisdom
  • Five stepstones
  • Signal bit stream, wave, etc.
  • Data impersonal, available to any users
  • Information a set of data matched to a
    particular information need.
  • Knowledge coherence of data, concepts, and
    rules.
  • Wisdom a balanced judgment in the light of
    certain value criteria.

6
Chapter 2 Document and Query Forms
7
What is a document?
  • A paper or a book? A section or a chapter?
  • There is no strict definition on the scope and
    format of a document.
  • The document concept can be extended to include
    programs, files, email messages, images, voices,
    and videos.
  • However, most commercial IR systems handle
    multimedia documents through their textual
    representations.
  • The focus of this course is on text retrieval.

8
Data Structures of Documents
  • Fully formatted documents typically, these are
    entities stored in DBMSs.
  • Fully unformatted documents typically, these are
    data collected via sensors, e.g., medical
    monitering, sound and image data, and a text
    editor.
  • Most textual documents, however, is
    semi-structured, including title, author, source,
    abstract, and other structural information.

9
Document Surrogates
  • A document surrogate is a limited representation
    of a full document. It is the main focus of
    storing and querying for many IR system.
  • How to generate and evaluate document surrogates
    in response to users information need is an
    important topic.

10
Ingredients of document surrogates
  • Document identifier could be less meaningless
    such as record id, or a more elaborate identifier
    such as Library of Congress classification scheme
    for books (e.g., T210 C37 1982).
  • Title
  • Names author, corporate, publisher
  • Dates for timeliness and appropriateness
  • Unit descriptor Introduction, Conclusion,
    Bibliography.

11
Ingredients of document surrogates
  • Keywords
  • Abstract a brief one- or two-paragraph
    description of the contents of a paper.
  • Extracts similar to abstract but created by
    someone other than the authors.
  • Review similar to extract but meant to be
    critical. The review itself is a separate
    document that worth retrieving.

12
Vocabulary Control
  • It specifies a finite set of vocabularies to be
    used for specifying keywords.
  • Advantages
  • Uniformity throughout the retrieval system
  • More efficient
  • Disadvantages
  • Authors/users cannot give/retrieve a more
    detailed information.
  • Most IR system nowadays opt to an uncontrolled
    vocabulary and rely on a sound internal thesaurus
    for bring together related terms.

13
Encoding Standards
  • ASCII a standard for English text encoding.
    However, it does not cover characters of
    different fonts, macthematical symbols, etc.
  • Big-5 traditional chinese character set with 2
    bytes.
  • GB simplified chinese charater set with XX
    bytes.
  • CCCII a full traditional chinese character set
    with at most 6 bytes.
  • Unicode a unified encoding trying to cover
    characters from multiple nations.

14
Markup languages
  • Initially used by word processor (.doc, .tex) and
    printer (.ps, .pdf)
  • Recently used for representing a document with
    hypertext information (HTML, SGML) WWW.
  • A document written in markup language can be
    segmented into several portions that better
    represent that document for searching.

15
Query Structures
  • Two types of matches
  • Exact match (equality match and range match)
  • Approximate match

16
Boolean Queries
  • Based on Boolean algebra
  • Common connectives AND, OR, NOT
  • E.g., A AND (B OR C) AND D
  • Each term could be expanded by stemming or a list
    of related terms from a thesaurus.
  • E.g., inf -gt information, vegetarian-gtmideastern
    countries
  • A xor B ? (A AND NOT B) OR (NOT A AND B)
  • By far the most popular retrieval approach.

17
Boolean Queries (Contd)
  • Additional operators
  • Proximity (e.g., icing within 3 words of
    chocolate)
  • K out of N terms (e.g., 3 OF (A, B, C)
  • Problems
  • No good way to weigh terms
  • E.g., music by Beethoven, preferably sonata.
    (Beethoven AND sonata) OR (Beethoven)
  • Easy to misuse (e.g., People who like to have
    dinner with sports or symphony may specify
    dinner AND sports AND symphony).

18
Boolean Queries (Contd)
  • Order of preference may not be natural to users
    (e.g., A OR B AND C). People tend to interpret
    requests depending on the semantics.
  • E.g., coffee AND croissant OR muffin
  • Raincoat AND umbrella OR sunglass
  • User may construct a highly complex query.
  • There are techniques on simplifying a given query
    into disjunctive normal form (DNF) or conjunctive
    normal form (CNF)
  • It has been shown that every Boolean expression
    can be converted to an equivalent DNF or CNF.

19
Boolean Queries (Contd)
  • DNF a disjunction of several conjuncts, each of
    which includes two terms connected by AND.
  • E.g., (A AND B) OR (A AND NOT C)
  • (A AND B AND C) OR (A AND B AND NOT C) is
    equivalent to (A AND B).
  • CNF a conjunction of several disjuncts, each of
    which includes two terms connected by OR.
  • Normalization to DNF can be done by looking at
    the TRUE rows, while that to CNF can be done by
    looking at the FALSE rows.

20
Boolean Queries (Contd)
  • The size of returned set could be explosively
    large. Sol return only a limited number of
    records.
  • Though there are many problems with Boolean
    queries, they are still popular because people
    tend to use only two or three terms at a time.

21
Vector Queries
  • Each document is represented as a vector, or a
    list of terms.
  • The similarity between a document and a query is
    based on the presence of terms in both the query
    and the document.
  • The simplest model is 0-1 vector. A more general
    model is weighted vector.
  • Assigning weights to a document or a query is a
    complex process.
  • It is reasonable to assume that more frequent
    terms are more important.

22
Vector Queries (Contd)
  • It is better to give a user the freedom to assign
    weights. In this case, a conversion between user
    weight and system weight must be done. Show the
    conversion equ.
  • There are two types of vector queries (for
    similarity search)
  • top-N queries
  • Threshold-based queries

23
Extended Boolean Queries
  • This approach incorporates weights into Boolean
    queries. A general form is Aw1 Bw2 (e.g., A0.2
    AND B0.6).
  • A OR B0.2 retrieves all documents that contain A
    and those documents in B that are within top 20
    closest to the documents in A.
  • A OR B1 ?A OR B
  • A OR B0 ?A
  • See Figure 3.1 for a diagrammatic illustration.

24
Extended Boolean Queries (Contd)
  • A AND B0.2
  • A AND B0 ?A
  • A AND B1 ?A AND B
  • See Figure 3.2 for graphical illustration.
  • A AND NOT B0.2
  • A AND NOT B0 ?A
  • A AND NOT B1 ?A AND NOT B
  • See Figure 3.3 for graphical illustration.
  • A0.2 OR B0.6 returns 20 of the documents in A-B
    that are closest to B and 60 of the documents in
    B-A that are closest to A.

25
Extended Boolean Queries (Contd)
  • See Example 3.1.
  • One needs to define the distance between a
    document and a set of document (contains A).
  • The computation of an extended Boolean query
    could be time-consuming.
  • This model have not become popular.

26
Fuzzy Queries
  • It is based on fuzzy set.
  • In a fuzzy set S, each element in S is associated
    with a membership grade.
  • Formally, Sltx, ?s(x)gt ?sgt0.
  • A?B xx?A and x ?B, ?(x)min (?A(x), ?B(x)).
  • A?B xx?A or B, ?(x)max(?A(x), ?B(x)).
  • NOT A xx?A, ?(x)1- ?A(x).

27
Fuzzy Queries (Contd)
  • To use fuzzy queries, documents must be fuzzy
    too.
  • The documents are returned to the users in
    decreasing order of their fuzzy values associated
    with the fuzzy query.

28
Probabilistic Queries
  • Similar to fuzzy queries but now the membership
    function is probabilities.
  • The probability of a document in association with
    a query (or term) can be calculated through some
    probability theory (e.g., Bayes Theorem) after
    some observation.

29
Natural Language Queries
  • Convenient
  • Imprecise, inaccurate, and frequently
    ungrammatical.
  • The difficulties lie in obtaining an accurate
    interpretation of a longer text, which may rely
    on common sense.
  • The successful system must restrict to a narrowly
    defined domain (e.g., medicine v.s. diagnosis of
    illness).

30
Information Retrieval and Database Systems
  • Should one use a database system to handle
    information retrieval requests?
  • DBMS is a mature and successful technolgy in
    handling precise queries.
  • It is not appropriate to handle imprecise textual
    elements.
  • OODB provide the augment functions to the textual
    or image elements and is considered a good
    candidate.

31
The Matching Process
32
Boolean based matching
  • It divides the document space into two those
    satisfying the query and those that do not.
  • Finer grading of the set of retrieved documents
    can be defined on the number of terms satisfied
    (e.g., A OR B OR C).

33
Vector-based matching
  • Measures
  • Based on the idea of distance
  • Minkowski metric (Lq) Lq(Xi1-Xj1q
    Xi2-Xj2qXi3-Xj3qXip-Xjpq)1/q
  • Special cases Manhattan distance (q1),
    Euclidean distance (q2), and maximum direction
    distance (q?).
  • See example in p.133.
  • Based on the idea of angle
  • Cosine function ((Q?D)/(QD).

34
Mapping distance to similarity
  • It is better to map distance (or dissimilarity)
    into some range, e.g. 0, 1.
  • A simple inversion function is ?b-u.
  • A more general inversion function is ?b-p(u),
    where p(u) is a monotone nondecreasing func s.t.
    p(0)0.
  • See Fig. 4.1 for graphical illustration.

35
Distance or cosine?
  • lt1, 3gt , lt100, 300gt, lt3, 1gt? Which pair is
    similar?
  • In practice, distance and angular measures seem
    to give results of similar quality because the
    cluster of documents all roughly lie in the same
    direction.

36
Missing terms and term relationships
  • The conventional value 0 means
  • Truly missing
  • No information
  • However, if 0 is regarded as undefined. It
    becomes impossible to measure the distance
    between two documents (e.g., lt3, -gt and lt-, 4gt.
  • Terms used to define the vector model are clearly
    not independent, e.g., digital and computer
    have a strong relationship.
  • However, the effect of dependent terms is hardly
    known.

37
Probability matching
  • For a given query, we can define the probability
    that a document is related as P(rel)n/N.
  • The discriminant function on the selected set is
    dis(selected)P(relselected)/P(?relselected).
  • The desirable discriminant function value of a
    set is at least 1.
  • Let a document be represented by terms t1, , tn,
    and they are statistically independent.
    P(selectedrel)P(t1rel)P(t2rel)P(tnrel).
  • We can use Bayes theorem to calculate the
    probability that a document should be selected.
  • See Example 4.1.

38
Fuzzy matching
  • The issue is on how to define the fuzzy grade of
    documents w.r.t. a query.
  • One can define the fuzzy grade based on the
    closeness to a query. For example, ??? v.s. ??
    v.s. ????

39
Proximity matching
  • The proximity criteria can be used independently
    of any other criteria.
  • A modification is to use phrases rather than
    words. But it causes problems in some cases
    (e.g., information retrieval v.s. the retrieval
    of information).
  • Another modification is to use order of words
    (e.g., junior college v.s. college junior).
    However, this still causes the same problem as
    before.
  • Many systems introduce a measure on the proximity.

40
Effects of weighting
  • Weights can be given on sets of words, rather
    than individual words.
  • E.g., (beef and broccoli)5 (beef but not
    broccoli)2 (broccoli but not beef)2,
    noodles1 snow peas1 water chestnuts1.

41
Effects of scaling
  • An extensive collection is likely to contain
    fewer additional relevant documents.
  • Information filtering aims at producing a
    relatively small set.
  • Another possibility is to use several models
    together, leading to so called data fusion.

42
A user-centered view
  • Each user has an individual vocabulary that may
    not match that of the author, editor, or indexer.
  • Many times, the user does not know how to specify
    his/her information need. Ill know it when I
    see it. Therefore, it is important to allow
    users direct access to the data (browsing).

43
Text Analysis
44
Indexing
  • Indexing is the act of assigning index terms to a
    document.
  • Many nonfiction books have indexes created by
    authors.
  • The indexing language may be controlled or
    uncontrolled.
  • For manual indexing, an uncontrolled indexing
    language is generally used.
  • Lack of consistency (the agreement in index term
    assignment may be as little as 20)
  • Difficult for fast evolving field.

45
Indexing (Contd)
  • Characteristics of an indexing language
  • Exhaustivity (the breadth) and specificity (the
    depth)
  • The ingredients of indexes
  • Links (occur together)
  • Roles
  • Cross referencing
  • See Coal, see fuel
  • Related terms microcomputer, see also personal
    computer
  • Broader term (BT) poodle, BT dog
  • Narrower term (NT) dog, NT poodle, cocker
    spaniel, pointer.

46
Index (Contd)
  • Automatic indexing will play an ever-increasing
    role.
  • Approaches for automatic indexing
  • Word counting
  • Based on deeper linguistic knowledge
  • Based on semantics and concepts within a document
    collection.
  • Often inverted file is used to store indexes of
    documents in a document collection.

47
Matrix Representations
  • Term-document matrix A
  • Aij indicates the occurrence or the count of term
    i in document j.
  • Term-term matrix T
  • Tij indicates the occurrence or the count of term
    i and term j.
  • Document-document matrix D
  • Dij indicates the degree of term overlapping
    between document i and document j.
  • These matrices are usually sparse and better be
    stored by lists.

48
Term Extraction and Analysis
  • It has been observed that frequencies of words in
    a document follow the so called Zipfs law
    (fkr-1 ) 1, ½, 1/3, ¼,
  • Many similar observations have been made
  • Half of a documents is made up of 250 distinct
    words.
  • 20 of the text words account for 70 of term
    usage.
  • None of the observations are supported by Zipfs
    law.
  • High frequncy terms are not desirable because
    they are so common.
  • Rare words are not desirable because very few
    documents will be retrieved.

49
Term Association
  • Term association is expanded with the concept of
    word proximity.
  • Proximity measure depends on
  • the number of intervening words
  • The number of words appearing in the same
    sentence.
  • Word order
  • Punctuation
  • However, there are risks The felons
    information assured the retrieval of the money,
    and the retrieval of information, and information
    retrieval.

50
Term significance
  • Frequent words in a document collection may not
    be significant. (e.g., digital computer in
    computer science collection).
  • Absolute term frequency ignores the size of a
    document.
  • Relative term frequency is often used.
  • Absolute term frequency / length of doc.
  • Term frequency of a document collection
  • Total frequency count of a term / total words in
    documents of a document collection
  • Number of documents containing the term / total
    number of documents.

51
How to adjust the frequency weight of a term
  • Inverse document frequency weight
  • N total number of documents.
  • Dk number of documents containing term k
  • fik absolute frequency of term k in doc. i.
  • Wik the weight of term k in document i.
  • idfk log2(N/dk)1
  • Wik fik?idfk
  • This weight assignment is called TF-IDF.

52
How to adjust the frequency weight of a term
(Contd)
  • Signal-to-noise
  • H(p1, p2, , pn) information content of a
    document with pi being the probability of word i.
  • Requirements
  • H is a continuous function of pi.
  • If pi1/n, H is a monotone increasing function of
    n.
  • H preserves the partitioning property
  • H(1/2, 1/3, 1/6) H(1/2, ½)1/2H(2/3,1/3)
    H(2/3, 1/3)2/3H(3/4,1/4)
  • Entropy function satisfies all three requirements
  • H

53
How to adjust the frequency weight of a term
(Contd)
  • The more frequent a word is, the less information
    it carries.
  • The noise nk of index term k is defined as
  • The signal sk of index term k is defined as
    sklogtk nk.
  • The weight wik of term k in document i is
    wikfik sk

54
How to adjust the frequency weight of a term
(Contd)
  • Term discrimination value
  • The average similarity
  • A centroid document D, where fk tk/N.
  • ?k?k - ?.
  • wikfik ?k

55
Phrases and Proximity
  • Weighting schemes discriminate phrases.
  • How to compensate?
  • Count both the individual words and phrase.
  • Count the number of words in a phrase.
  • 1 log (number of words in a phrase)
  • How to handle proximity query?
  • Documents with involved words are identified,
    followed by the judgment of proximity criteria.
  • Direct analysis of a document collection can be
    done by using standard vocabulary analysis (e.g.,
    Brown corpus).

56
Pragmatic Factors
  • Identifying trigger phrases
  • Words such as conclusion, finding, identify key
    points and ideas in a document.
  • Weighting authors
  • Weighting journals
  • Users pragmatic factors
  • Education level
  • Novice or expert in an area

57
Document Similarity
  • Similarity metrics of 0-1 vector.
  • Contingency table for doc. to doc. match

D21 D20
D11 w x n1
D10 y z N-n1
n2 N-n2 N
58
Document similarity
  • If D1 and D2 are independent, w/N(n1/N) (n2/N).
  • We can define the basic comparison between D1 and
    D2 as ?(D1, D2)w-(n1n2/N).
  • In general, the similarity between D1 and D2 can
    be defined as follows

59
Various ways for defining coefficient of
association
  • Separation coefficient N/2.
  • Rectangular distance max(n1, n2).
  • Conditional probability min(n1, n2).
  • Vector angle (n1n2)1/2
  • Arithmetic mean (n1n2)/2.
  • For more, see p. 128.
  • For the relationship, see Table 5.2.

60
Other close similarity metrics
  • Use only w instead of w-(n1n2/N).
  • Dices coefficient 2w/(n1n2).
  • Cosine coefficient w/(n1n2)1/2.
  • Overlap coefficient w/min(n1n2)
  • Jaccards coefficient w/(N-z)
  • Distance measures requirements
  • Non-negative
  • Symmetric
  • Triangle inequality (Dist(A, C) lt Dist(A,
    B)Dist(B, C)

61
Stop lists
  • Stop list or negative dictionary consists of very
    high frequency words.
  • Typical stop list contains 150-500 words.
  • Any well-defined field may have its own jargon.
  • Words in the stop list should be excluded from
    later processing.
  • Query should also be processed against the stop
    list.
  • However, phrases that contain the words in stop
    list may not always be eliminated (e.g., to be or
    not to be).

62
Stemming
  • Computer, computers, computing, compute,
    computes, computed, computational,
    computationally, computable all deal with closely
    related concepts.
  • Use stemming algorithm to strip off word endings
    (e.g., comput).
  • Watch out the false stripping
  • Bed -gt b, breed -gtbre
  • Keep minimum acceptable stem length, having a
    small list of exceptional words, and keep various
    word forms.

63
Stemming (contd)
  • Stemming may not save much space (5).
  • One can also stem only the queries and then use
    wild cards in matching.
  • Watch the various word forms. E.g., knife should
    be expanded as knif and kniv.

64
Thesauri
  • A thesaurus contains
  • Synonyms
  • Antonyms
  • Broader terms
  • Narrower terms
  • Closely related terms
  • A thesaurus can be used during the query
    processing to broaden a query.
  • A similar problem arises w.r. t. homonyms.

65
Mid-term project
  • Lexical analysis and stoplist (Ch7)
  • Stemming algorithms (Ch8)
  • Thesaurus construction (Ch9)
  • String searching algorithms (Ch10)
  • Relevance feedback and other query modification
    techniques (Ch11)
  • Hashing algorithms (Ch13)
  • Ranking algorithms (Ch14)
  • Chinese text segmentation (to be provided)

66
File Structures
67
Inverted File
  • Structures for inverted file
  • Sorted array (Figure 3.1 in the supplement)
  • B-tree (Figure 3.2 in the supplement)
  • Trie
  • A straightforward approach
  • Parse the text to get a list of (word, location)
  • Sort the list in ascending order of word
  • Weighting each word.
  • See Figure 3.3 and 3.4 in the supplement
  • Hard to evolve.

68
Inverted File (Contd)
  • The data structure can be improved for faster
    searching (Figure 3.5 in the supplement)
  • A dictionary, including
  • Term and number of postings
  • A posting file, including
  • A set of list, one for each term
  • Doc
  • Number of postings in the doc.
  • See Figure 3.5.

69
Inverted File (Contd)
  • The dictionary can be implemented as a B-tree.
  • When a term in a new document is identified,
  • A new tree node is created, or
  • The related data of an existing node is modified.
  • The posting file can be implemented as a set of
    linked list.
  • See Table 3.1 for some statistics.

70
Signature File
  • A document is partitioned into a set of blocks,
    each of which has D keywords.
  • Each keyword is represented by a bit pattern
    (signature) of size F, with m bits set to 1.
  • The block signature is formed by superimposing
    (OR) the constituent word signatures.
  • Sig(Q) OR Sig(B) Sig(Q) if B contains the words
    in Q.
  • See Figure 4.1 in the supplement.

71
Signature File (Contd)
  • Which m bits should be set for a given word?
  • For each 3-triplet of W, a hashing function maps
    it to a position between 0, F-1.
  • If the number of 1s is less than m, randomly set
    additional bits.
  • How to set m?
  • It has been shown that when mF ln2/D, the false
    drop probability is minimized.

72
Signature File (Contd)
  • The signature file could be huge. Sequential
    search takes time.
  • The signature file is often sparse.
  • Three approaches to reduce query time
  • Compression
  • Vertical partitioning
  • Horizontal partitioning

73
Signature File (Contd)
  • Vertical partitioning
  • Use F different files, one per bit position.
  • For a query with k bits set, we need to examine k
    files. Then AND these files.
  • The qualifying blocks will have 1s in the
    resultant vector.
  • Inserting a block requires writing to F files.

74
Signature File (Contd)
  • Horizontal partitioning
  • TWO level signatures
  • The first level has N document signatures.
  • Several signatures with a common prefix are
    grouped into a group.
  • The second level has group signatures which are
    created by superimposing the constituent document
    signatures.
  • This approach can be generalized to a B-tree like
    structure (called S-tree).

75
User Profiles and Their Use
76
Simple Profiles
  • A simple profile consists of a set of key terms
    with given weights, much like a query.
  • Such profiles were originally developed for
    current awareness (CA) or selective dissemination
    of information (SDI).
  • The purpose of CA (SDI) is to help researchers
    keep up with the latest developments in their
    areas.
  • In a CA system, users are asked to file an
    interest profile, which must be updated
    periodically.
  • In fact, the interest profile acts an a routing
    query.

77
Extended Profiles
  • Extended profiles record background information
    of a person that might help in determining the
    interested document types.
  • Education level, familiarity of an area, language
    fluency, journal subscriptions, reading habits,
    specific preferences.
  • This type of information cannot be used directly
    in the retrieval process but must be applied to
    the retrieval set to organize it.

78
Current Awareness Systems
  • It assumes that the user is adequately aware of
    past work and needs only to keep abreast of
    current developments.
  • It operates only on current literature and
    actively w/o user intervene.
  • The user may redefine a profile at any time, and
    many systems will periodically remind users to
    review their profiles.
  • Most CA systems make use only the simple user
    profile.
  • Current awareness systems are suitable for a
    dynamic environment.

79
Retrospective Search Systems
  • The effectiveness of a CA system is difficult to
    measure because users often treat the presented
    documents off-line.
  • Unlike a CA system, a retrospective search system
    has a relatively large and stable database and
    handles ad-hoc queries.
  • Virtually all existing retrospective search
    systems do not differentiate users.

80
Modifying the Query By the Profile
  • A reference librarian may help a person with a
    request by learning more about this persons
    background and level of knowledge. E.g., theory
    of groups.
  • A given query may be modified according to the
    persons profile.
  • Three ways to modify a query
  • Post-filter effort to retrieve documents is
    substantial.
  • Pre-filter A food query ltcalories3,
    spiciness7gt may be modified for a user with
    profile lt2, 2gt to lt2.8, 6gt.

81
Modifying the Query By the Profile
  • Suppose Qltq1, q2, , qngt and Pltp1, p2, , pngt.
  • Simple linear transformation qi kpi
    (1-k)qi.
  • Piecewise linear transformation
  • Case 1. pi?0 and qi ?0 ordinary k value.
  • Case 2. Pi0 and qi ?0 k is very small (5).
  • Case 3. pi?0 and qi 0 k is smaller (50).

82
Query and Profile as Separate Reference Points
  • Query and profile are treated as co-filters.
  • Four approaches
  • Disjunctive model D, Q?d or D, P?d.
  • Conjunctive model D, Q?d and D, P?d.
  • Ellipsoidal model D, Q D, P?d, see Figure
    6.2, 6.3.
  • Cassini oval model D, Q ? D, P?d, see Figure
    6.4.
  • All the above models can be weighted.
  • Empirical experiments showed that query-profile
    combinations do provide better performance than
    the query alone.

83
Multiple Reference Point Systems
  • A reference point is a defined point or concept
    against which a document can be judged.
  • Queries, user profiles, known papers or books are
    reference points.
  • A reference point is sometimes called a point of
    interest (POI).
  • Weights and metrics can be applied to general
    reference points as before.

84
Documents and Document Clusters
  • Each favored document can be treated as a
    reference point.
  • Favored documents can also be clustered. Each
    document cluster may be represented as a cluster
    point.
  • Many statistical techniques can be used to
    cluster documents.
  • The centroid or medoid of a document cluster is
    then used as the reference point.

85
The Mathematical Basis
86
GUIDO
  • Graphical User Interface for Document
    Organization Rather than using terms as vector
    dimensions, GUIDO uses each reference point as a
    dimension, resulting in a low dimension space.
  • In a 2-D GUIDO, a document is represented as an
    ordered pair (x, y), where x is the distance from
    Q and y is the distance from P. Note that P-Q
    ?.
  • P (?, 0), Q(0, ?).
  • Consider the line between P and Q. Three cases
  • D, P D, Q ?
  • D, P D, Q ?
  • D, P D, Q - ?

87
GUIDO
  • For any points not on the line between P and Q
  • D, P D, Q gt ?
  • D, P ? gt D, Q
  • D, Q ? gt D, P
  • Observation 1 multiple document points are
    mapped into the same point in the distance space.
  • Observation 2 Mapping complex boundary contours
    into simpler contours.
  • In the ellipsoidal model, the contour becomes a
    straightline parallel to P-Q line.

88
GUIDO
  • In the weighted ellipsoidal model, the contour is
    still a straightline but at an angle.
  • If we are looking for a document D where the
    distance ratio of D, P to D, Q is a constant,
    we have
  • D, Q lt d/fr. (See the general model)
  • Therefore, the contour is a circle in the general
    model.
  • The contour is a straightline crossing the origin
    in GUIDO model because D, P k D, Q. See
    Figure 7.5.
  • With different metrics, the size of distance
    space and locations of documents may change but
    the basic shape in the distance space remains.

89
VIBE
  • Visual Information Browsing Environment a user
    chooses the positions of reference points
    arbitrarily on the screen.
  • The location of a document is the ratios of its
    similarities to the reference points.
  • Each document is represented as a rectangle whose
    size is the importance (sum of similarities?) to
    the reference points.

90
VIBE
  • In a 2-POI VIBE, documents are displayed on the
    line connecting the two POIs.
  • In a n-POI VIBE, let p1, p2, , pn be the
    coordinates of the POIs and s1, s2, , sn be the
    similarities of a document D to these POIs. The
    coordinate of D, pd, is (See example 7.2)

91
VIBE
  • While GUIDO is based on distance metrics, VIBE is
    based on similarity metrics.
  • Consider a 2-POI VIBE, a document is located at a
    position that is a fix ratio c s1/s2.
  • If si1/di, cd2/d1. Thus, a straightline in
    GUIDO is a point in VIBE.
  • If sk-d, c kd2-d1. Further compressed.

92
Boolean VIBE
  • One can think of n1 POIs as vertices in
    n-dimensions that form a polyhedron.
  • Three POIs A, B, and C form a triangle in a 2-D
    space as shown in Figure 7.10.
  • Documents containing all terms of A and B appear
    on the line A-B. Documents containing all terms
    of A, B, and C appear inside the triangle.
  • Four POIs form a polyhedron in a 3-D space.

93
Boolean VIBE
  • To render n POIs on a 2-D display, the resulting
    display consists of 2n-1 Boolean points,
    representing all Boolean combinations except the
    one that is completely negated, see Figure 7.10.
  • A threshold on the similarity between points need
    to be specified for determining document
    positions, see Table 7.1.

94
Retrieval Effectiveness Measures
95
Goodness of an IR System
  • Judged by the user for appropriateness to her
    information need. vague.
  • Determine the level of judgment
  • Question that meets the information need
  • Query that corresponds to the question.
  • Determine the measure
  • Binary accepted or rejected
  • N-ary 4 definitely relevant, 3 probably
    relevant, 2 neutral, 1 probably not relevant,
    0 definitely not relevant.

96
Goodness of an IR System (Contd)
  • Relevance of a document how well this document
    responds to the query.
  • Pertinence of a document how well this document
    satisfies the information need.
  • Usefulness of a document
  • The document is not relevant or pertinent to my
    present need, but it is useful in a different
    context.
  • The document is relevant, but it is not useful
    because Ive already known it.

97
Precision and Recall
Retrieved Not retrieved
Relevant w x n1wx
Not relevant y z
Not relevant n2wy z Nwxyz
  • Precision w/n2.
  • Recall w/n1.
  • The number of document returned in response to a
    query (n2) may controlled by either first K or a
    similarity threshold.
  • If very few documents are returned, precision
    could be high, while recall is very low.
  • If all documents are returned, recall1, while
    precision is very low.

98
Precision and Recall (contd)
  • One can plot a precision-recall graph to compare
    the performance of different IR systems. See
    Figure 8.1.
  • Two relevant measures
  • Fallout the proportion of nonrelevant documents
    that are retrieved, F y / (N-n1)
  • Generality the proportion of relevant documents
    within the entire collection G n1/N
  • Precision (P), recall (R), fallout, and
    generality (G) are related

99
Precision and Recall (contd)
  • P/(1-P) is the ratio of relevant retrieved
    documents to nonrelevant retrieved documents.
  • G/(1-G) is the ratio of relevant documents to
    nonrelevant documents in the collection.
  • R/F gt 1 if the IR system does better in locating
    relevant documents.
  • R/F lt 1 if the IR system does better in rejecting
    non-relevant documents.

100
Precision and Recall (contd)
  • Weakness of precision/recall measures
  • It is generally difficult to get exact value for
    recall because one has to examine the entire
    collection.
  • It is not clear that recall and precision are
    significant to the user. Some argued that
    precision is more important than recall.
  • Either one represents an incomplete picture of
    the IR systems performance.

101
User-oriented measures
  • The above measures attempt to measure the
    performance of the entire IR system, regardless
    of the differences on users.
  • From a user point of view, her interpretation on
    the retrieved set of documents could be
  • Let V of relevant documents known to the user.
    Vn of relevant, retrieved documents known to
    the user. N of relevant, retrieved documents.
  • Coverage ratio Vn/V
  • Novelty ratio (N-Vn)/N

102
User-oriented measures (Contd)
  • Relative recall of relevant, retrieved
    documents / of desired documents.
  • Recall effort of desired documents / of
    documents examined.

103
Average precision and recall
  • Fix recall at several points (say, 0.25, 0.5, and
    0.75) and compute the average precision at each
    recall level.
  • If the exact recall is difficult to compute, one
    can compute the average precision for each fix
    number of relevant documents. See Table 8.2.
  • If the exact recall can be computed, a more
    comprehensive precision/recall table can be
    obtained. See Table 8.3.

104
Operating Curves
  • Let C be a measurable characteristic, P1 and P2
    be the sets of relevant and irrelevant documents
    respectively.
  • If C distinguishes P1 and P2 well, the curve will
    have a higher slope.
  • It has been shown that the operating curve of a
    given IR system is usually a straightline.
  • The distance from lt50,50gt to the operating curve
    along the line lt0, 100gt to lt50, 50gt can be used
    to measure the performance of an IR system,
    called Swets E measure. See Figure 8.3.

105
Expected search length
  • All the above measures do not consider the order
    of returned documents.
  • Suppose the set of retrieved documents can be
    divided into subsets S1, S2, , Sk with
    decreasing priority and Si has ni relevant
    documents.
  • Given a desired number N of relevant documents,
    one can compute the expected search length. See
    Example 8.2.
  • By varying N, one can plot a performance on the
    expected search length as shown in Figure 8.4.

106
Expected search length (Contd)
  • An aggregate number can be computed as the
    average number of documents searched per relevant
    document. Let the number be ei.
  • If the chance of searching for 1, 2, , 7
    documents are equally likely, one can compute the
    overall expected search length by the formula

107
Normalized recall
  • Typical IR system presents results to the user in
    a linear list.
  • If a user sees many relevant documents first, she
    may be more satisfied with the system
    performance.
  • Rocchios normalized recall is defined as a step
    function F, where F(k)F(k-1) 1 if the kth
    document is relevant and F(k)F(k-1) otherwise.
    See Figure 8.5.
  • A step function F is defined as
  • F(0)0,
  • F(k1) (F(k) or F(k)1)).

108
Normalized recall (Contd)
  • Let A be the area between the actual and ideal
    graphs, n1 be the number of relevant documents, N
    be the number of documents examined.
  • Normalized recall 1 A/n1(N-n1).
  • However, if two systems behave the same except
    for the position of the last document, the
    normalized recall values may differ a lot.

109
Sliding ratio
  • Rather than judging a document as either relevant
    or irrelevant, sliding ratio assigns weighted
    relevance to each document.
  • Let the weight list of the retrieved documents be
    w1, w2, , wN, and their sorted list be W1, W2,
    , WN in decreasing order. The sliding ratio
    SR(n) is defined as

110
Satisfaction and frustration
  • Myaeng divides the measure into satisfaction and
    frustration.
  • Satisfaction is the accumulative sum of
    satisfaction weights.
  • Frustration is the accumulative sum of
    2-satisfaction weights. See Example 8.4.
  • Total Satisfaction frustration.

111
Content-based Recommendation
112
NewsWeeder Learn to Filter Netnews
  • Ken Lang
  • Proceedings of the Conference on Machine
    Learning, 1995

113
Introduction
  • NewsWeeder is a netnews-filtering system.
  • It allows users to read regular newsgroups.
  • It also creates some personal, virtual newsgroups
    such as nw.top50.bob for Bob.
  • A list of article summaries sorted by predicted
    rating.
  • After reading an article, the reader clicks on a
    rating from one to five.

114
Introduction
  • This way of collecting users ratings is called
    active feedback, in contrast to passive feedback,
    such as time spent reading.
  • The drawback to active feedback is the extra
    effort required to explicit rating.
  • Each night, the system uses the collected rating
    information to learn a new model for each users
    interest.
  • How to learn a new model is the subject of this
    paper.

115
Representation
  • Raw text is parsed into tokens.
  • A vector of token counts is created for each
    document (article).
  • Tokens are not stemmed.
  • The vector is on the order of 20,000 to 100,000
    tokens long.
  • No explicit dimension reduction techniques are
    used to reduce the size of vectors.

116
TF-IDF weighting
  • Motivation
  • The more times a token t appears in a document d
    (term frequency, tft,d),
  • The less times a token t occurs throughout all
    documents (document frequency, dft),
  • The better t represents the subject of document
    d.
  • Throw out tokens occurring less than 3 times
    total.
  • Throw out the M most frequent tokens.
  • The weight of t w.r.t to d, w(t, d) is
  • w(t, d) tft,d ? log2(N/ dft),
  • where N is the total number of documents.

117
TF-IDF weighting
  • Each document is represented by a tf-idf vector
    normalized into unit length.
  • Use cosine function to determine the similarity
    between two documents.
  • Given a category (1..5), a prototype vector is
    computed by averaging the normalized tf-idf
    vectors in the category.

118
TF-IDF weighting
  • Let vp1, vp2, vp3, vp4, vp5 be the prototype
    vectors of the five categories.
  • A learning model is derived as follows
  • Predicted-rate(d) c1?sim(d, vp1) c2?sim(d,
    vp2) c3?sim(d, vp3) c4?sim(d, vp4) c5?sim(d,
    vp5).
  • The above model is determined by linear
    regression on documents rated by the user.

119
Minimum Description Length (MDL)
  • A kind of Baysian classifier but based on the
    entropy measure.
  • In information theory, the minimum average length
    to encode messages with p1, p2, , pk
    probabilities is -?iPi log Pi. That is, the
    number of bits to represent message i is -Pi log
    Pi.
  • Let H be a category and D a document,

120
MDL
  • Equivalently, we can minimize log(p(DH)-log(p(H
    )).
  • The above total encoding length includes
  • Number of bits to encode the hypothesis
  • Number of bits required to encode the data given
    the hypothesis.
  • That is, to find a balance between simpler models
    and models that produce smaller error when
    explaining the observed data.

121
MDL applied to Newsweeder
  • Problem description
  • We are given a document d with token vector Td
    and non-zero entries ld, and a set of previous
    rating information Dtrain.
  • We like to find a category ci that maximizes p(ci
    Td, ld, Dtrain), or equivalently, minimizes
    log(p(Td ci, ld, Dtrain))- log(p(ci ld,
    Dtrain))

122
MDL applied to Newsweeder
  • Assume that words in a document are independent,
    we have
  • p(Td ci, ld, Dtrain)?j p(tj,d ci, ld,
    Dtrain)
  • where ti,d (0 or 1) represents whether token i
    appears in document d.
  • Notations
  • ti ?i?N ti,j
  • ri,l a correlation estimated 0, 1 between
    ti,d and ld.
  • The above measures can be computed for the entire
    documents or for a particular category, denoted
    by ck.

123
MDL applied to Newsweeder
  • When ti,d is not related to the length of the
    document (I.e, ri,l 0), we have
  • When ti,d is strongly related to the length of
    the document (I.e, ri,l 1), we have

124
MDL applied to Newsweeder
  • In general, it can be modeled as
  • Hypothesis For a given token, either it is
    special w.r.t. a category or it is unrelated to
    any category.

125
MDL applied to Newsweeder
  • A token is related to some category if the
    following value is greater than a small constant
    (0.1)
  • The intuition is that if by considering category
    information the encoding bits can be reduced,
    this token plays an important role in deciding
    the category of a document.

126
Summary
  • Divide the set of articles into training set and
    test set.
  • Parse the training articles, throwing out tokens
    occurring less than 3 times total.
  • Compute ti and ri,l for each token.
  • For each token t and category c, decide whether
    to use category independent or category dependent
    model.

127
Summary (contd)
  • Compute the similarity of each training article
    to each rating category by taking the inverse of
    the number of bits required to encode Td under
    the categorys probabilistic model.
  • Compute a linear regression model from the
    training articles.

128
Experiments
  • The performance metric is precision.
  • Retrieve the top 10 of highest predicted rating
    articles.
  • Data
  • see Table 1 for the meaning of 5 categories.
  • Articles rated as 1 or 2 are considered
    interesting.
  • Users only two exhibit enough amount of ratings,
    see Table 2.

129
TF-IDF performance
  • Do not use a fixed stop-list because it may not
    suit a dynamic environment.
  • Top N most frequent words are removed.
  • By experimenting different partitioning on
    training/test sets, it shows that removing
    100-400 words seem to have the best performance.
    See Graph 1.
  • TF-IDF has about three times improvement over
    non-filtering.

130
MDL Experiments
  • See Graph 2 for a comparison between TF-IDF and
    MDL.
  • MDL constantly outperforms TF-IDF.
  • Table 3 shows the predicted ratings and actual
    ratings of a test article.
  • The correct prediction is 65 (see the diagonal
    line)
  • In general, the performance after the regression
    step tends to meet or exceed the precision
    obtained by the method of choosing only the
    category with maximum probability.

131
Learning and Revising User Profiles The
Identification of Interesting Web Sites
  • M. Pazzani and D. Billsus
  • Machine Learning 27, 1997

132
Introduction
  • The goal is to find information that satisfies
    long-term recurring interests.
  • Feedback on the interestingness of a set of
    previously visited sites are used to predict the
    interests of unseen sites.
  • The recommender system is called Syskill Webert.

133
Syskill Webert
  • A different profile is learned for each topic.
  • Each user has a set of profiles, one for each
    topic.
  • Each web page is augmented with special control
    on selecting user ratings. See Figure 1.
  • Each page is rated as either hot or cold. See
    Figure 2 for notations for recommendations.

134
Learning user profiles
  • Use supervised learning with a set of positive
    examples and negative examples.
  • Each rated web page is converted into a Boolean
    feature vector.
  • The information gain of a word is used to
    determine how informative the word is.

135
Learning user profiles
  • The set of k most informative words are used for
    feature set. (k128)
  • In addition, words in a stop list with
    approximately 600 words and HTML tags are
    excluded.
  • See Table 1 on feature words on goats.

136
Naïve Bayesian classifier
  • Provided features are independent.
  • A given example is assigned to the class (hot or
    cold) with the higher probability.

137
Initial experiments
  • See Table 2 for four users on 9 topics.
  • Again, the partition on training set and test set
    is varied.
  • Accuracy is the primary performance metric.
  • Figure 3 displays the average accuracy, which is
    substantially better than the probability of cold
    pages.
  • In biomedical domain, all the top 10 pages were
    actually interesting, and all the bottom 10 pages
    were actually uninteresting.

138
Initial experiments
  • Among the 21 pages with probabilities above 0.9,
    19 were rated interesting.
  • Among the 64 pages with probability below 0.1,
    only one was rated interesting.
  • Table 3 shows how the number of feature words
    impact accuracy with 20 training examples.
  • An intermediate number (96) of features performs
    the best.
  • Comprehensive approach for feature selection is
    not feasible as it increases the complexity.

139
Alternative machine learning alg.
  • Nearest neighbor Assign the class of the most
    similar example.
  • PEBLS The distance between two examples is the
    sum of the value difference of all attributes.
    The difference between Vjx and Vjy is

140
Machine Learning (Contd)
  • Decision trees ID3, which recursively selects
    the features with the highest information gain.
  • Rocchios algorithm
  • Use TF-IDF as feature weights (with normalization
    to unit length).
  • Build the prototype-vector of the interesting
    class by subtracting 0.25 of the average vector
    of the uninteresting pages from the average
    vector of the interesting pages.
  • The purpose is to prevent infrequently occurring
    terms from overly affecting the classification.
  • Pages with a certain distance from the prototype
    (determined by cosine) are considered
    interesting.

141
Comparison
  • 20 examples were chosen as training set because
    the increase of accuracy after 20 is mild.
  • See Table 4. In each domain, the highest accuracy
    as well as those with slightly lower accuracies
    were marked as .
  • ID3 (or C4.5) is not suited.
  • Nearest neighbor performs worse (even for k-NN).
  • Backpropagation, Bayesian classifier and
    Rocchios algorithms are among the best.
  • Bayesian classifier is chosen because it is fast
    and adapts well to attribute dependencies.

142
Using predefined user profiles
  • Some users are unwilling to rate many pages
    before the system gives reliable prediction.
  • Initial profile is solicited as follows
  • Provide a set of words that indicate interesting
    pages.
  • Provide another set of words that indicate
    uninteresting pages. This set is more difficult
    to get.
  • Four probabilites for each word are given
    p(wordi present hot), p((wordi absent hot),
    p(wordi present cold), p((wordi absent cold).
    The default for p(wordi present hot) is 0.7 and
    that for p(wordi present cold) is 0.3.

143
Using predefined user profiles (Contd)
  • As more training data becomes available, more
    believe should be placed on the probability
    estimates.
  • Conjugate priors are used to update probabilities
    from data
  • The initial probability is assume to be
    equivalent to 50 pages.
  • If P(wordi presenthot)0.8 and among 25 hot
    pages seen, 10 contain wordi.
  • The probability becomes (4010)/(5025)

144
Experiments
  • Three alternatives
  • Data use only data for estimation. 96 features
    are obtained purely from data.
  • Revision use both data and initial profile for
    estimation. All words in the profile are used as
    features, supplemented with the most informative
    words for a total of 96 features.
  • Fixed Use only the words provided by the user as
    features and only the initial profiles.

145
Results
  • See Table 5, 6, and 7 for probabilities in
    initial profiles.
  • Figure 4, 5, and 6 show that the revision
    strategy performs the best. The performance of
    fixed is surprisingly good.
  • If we use only words in initial user profile and
    calculate the probability from data, it still
    performs well. See Figure 7.

146
Using lexical knowledge
  • Use WORDNET as thesaurus.
  • When there is no relationship between a word and
    words in a topic, this word is eliminated. This
    includes Hypernym, Antonym, Member-Holonym,
    Part-Holonym, Similar-to, Pertainnym, and
    Derived-from.
  • Table 8 shows the eliminated words that are
    unrelated to goat.
  • Figure 8 shows that when the number of examples
    is small, applying lexical knowledge does help.

147
Comparing Feature-based and Clique-based User
Models for Movie Selection
  • J. Alspector, A. Kotcz, and N. Karunanithi
  • Conf. of Digital Libraries, 1998

148
Introduction
  • Compare content-based and collaborative
    approaches for making recommendations for movies.
  • Users must provide explicit ratings on some
    movies.
  • Data sets 7389 movies
  • Volunteers for rating movies 242.

149
Clique-based approach
  • A set of users form a clique if their movie
    ratings are closely related.
  • The similarity between two users ratings is
    defined by Pearson correlation coefficient (I.e.,
    cosine function) as follows

150
Clique-based approach
  • How to decide the clique of a given user U?
  • Smin minimum number of common ratings with U.
  • Cmin minimum correlation threshold.
  • In the experiments, Smin is set as a constant 10,
    and Cmin is a variable such that the number of
    size of the clique is 40.
  • Once a clique is identified,
  • For a given unseen movie m, let N be the number
    of clique members that rate m.
  • ci(m) is the rating of movie m given by user i.
  • r(m) is the estimated rating of movie m to the
    user U.

151
Clique-based approach
152
Feature-based approach
  • Extract relevant features from the movies that
    user has rated.
  • Build a model for a user by associating selected
    features and the ratings.
  • Estimating ratings for an unseen movie to a user.
    By consulting the model.

153
Relevant features
  • Seven features are used
  • 25 catetories (0, 1)
  • 6 MPAA rating (0, 1)
  • Maltin rating (0..4)
  • Academy award won1, nominated0.5, not
    considered0.
  • Origin USA0, USA with foreign
    collaboration0.5, foreign made0.
  • Director each director is represented as
    numerical value that is the average rating of the
    user to the movies directed by the director.
  • Each feature is normalized between 0, 1.

154
Linear model
  • Use
Write a Comment
User Comments (0)
About PowerShow.com