Information Retrieval and Recommendation Techniques

About This Presentation

Title:

Information Retrieval and Recommendation Techniques

Description:

Information Retrieval and Recommendation Techniques – PowerPoint PPT presentation

Number of Views:682

Avg rating:3.0/5.0

Slides: 319

Provided by: SYHU

Category:

more less

Transcript and Presenter's Notes

Title: Information Retrieval and Recommendation Techniques

1
Information Retrieval and Recommendation
Techniques

?????????
???

2
Abstraction

Reality (real world) can not known in its
entirety
Reality is represented by a collection of data
abstracted from observation of the real world.
Information need drives the storage and retrieval
of information.
Relationships among reality, information need,
data and query (see Figure 1.1).

3
Information Systems

Two portions endosystem and ectosystem.
Ectosystem has three human components
User
Funder
Server information professional who operates the
system and provide service to the user.
Endosystem has four components
Media
Devices
Algorithms
Data structures

4
Measures

The performance is dictated by the endosystem but
judged by the ecosystem.
User is mainly concerned about effectiveness.
Server is more aware of the efficiency.
Founder is more concerned about economy of the
system.
This course concentrates primarily on
effectiveness measures.
The so called user-satisfaction has many meanings
and different users may use different criteria.
A fixed set of criteria must be established for
fair comparison.

5
From Signal to Wisdom

Five stepstones
Signal bit stream, wave, etc.
Data impersonal, available to any users
Information a set of data matched to a
particular information need.
Knowledge coherence of data, concepts, and
rules.
Wisdom a balanced judgment in the light of
certain value criteria.

6
Chapter 2 Document and Query Forms
7
What is a document?

A paper or a book? A section or a chapter?
There is no strict definition on the scope and
format of a document.
The document concept can be extended to include
programs, files, email messages, images, voices,
and videos.
However, most commercial IR systems handle
multimedia documents through their textual
representations.
The focus of this course is on text retrieval.

8
Data Structures of Documents

Fully formatted documents typically, these are
entities stored in DBMSs.
Fully unformatted documents typically, these are
data collected via sensors, e.g., medical
monitering, sound and image data, and a text
editor.
Most textual documents, however, is
semi-structured, including title, author, source,
abstract, and other structural information.

9
Document Surrogates

A document surrogate is a limited representation
of a full document. It is the main focus of
storing and querying for many IR system.
How to generate and evaluate document surrogates
in response to users information need is an
important topic.

10
Ingredients of document surrogates

Document identifier could be less meaningless
such as record id, or a more elaborate identifier
such as Library of Congress classification scheme
for books (e.g., T210 C37 1982).
Title
Names author, corporate, publisher
Dates for timeliness and appropriateness
Unit descriptor Introduction, Conclusion,
Bibliography.

11
Ingredients of document surrogates

Keywords
Abstract a brief one- or two-paragraph
description of the contents of a paper.
Extracts similar to abstract but created by
someone other than the authors.
Review similar to extract but meant to be
critical. The review itself is a separate
document that worth retrieving.

12
Vocabulary Control

It specifies a finite set of vocabularies to be
used for specifying keywords.
Advantages
Uniformity throughout the retrieval system
More efficient
Disadvantages
Authors/users cannot give/retrieve a more
detailed information.
Most IR system nowadays opt to an uncontrolled
vocabulary and rely on a sound internal thesaurus
for bring together related terms.

13
Encoding Standards

ASCII a standard for English text encoding.
However, it does not cover characters of
different fonts, macthematical symbols, etc.
Big-5 traditional chinese character set with 2
bytes.
GB simplified chinese charater set with XX
bytes.
CCCII a full traditional chinese character set
with at most 6 bytes.
Unicode a unified encoding trying to cover
characters from multiple nations.

14
Markup languages

Initially used by word processor (.doc, .tex) and
printer (.ps, .pdf)
Recently used for representing a document with
hypertext information (HTML, SGML) WWW.
A document written in markup language can be
segmented into several portions that better
represent that document for searching.

15
Query Structures

Two types of matches
Exact match (equality match and range match)
Approximate match

16
Boolean Queries

Based on Boolean algebra
Common connectives AND, OR, NOT
E.g., A AND (B OR C) AND D
Each term could be expanded by stemming or a list
of related terms from a thesaurus.
E.g., inf -gt information, vegetarian-gtmideastern
countries
A xor B ? (A AND NOT B) OR (NOT A AND B)
By far the most popular retrieval approach.

17
Boolean Queries (Contd)

Additional operators
Proximity (e.g., icing within 3 words of
chocolate)
K out of N terms (e.g., 3 OF (A, B, C)
Problems
No good way to weigh terms
E.g., music by Beethoven, preferably sonata.
(Beethoven AND sonata) OR (Beethoven)
Easy to misuse (e.g., People who like to have
dinner with sports or symphony may specify
dinner AND sports AND symphony).

18
Boolean Queries (Contd)

Order of preference may not be natural to users
(e.g., A OR B AND C). People tend to interpret
requests depending on the semantics.
E.g., coffee AND croissant OR muffin
Raincoat AND umbrella OR sunglass
User may construct a highly complex query.
There are techniques on simplifying a given query
into disjunctive normal form (DNF) or conjunctive
normal form (CNF)
It has been shown that every Boolean expression
can be converted to an equivalent DNF or CNF.

19
Boolean Queries (Contd)

DNF a disjunction of several conjuncts, each of
which includes two terms connected by AND.
E.g., (A AND B) OR (A AND NOT C)
(A AND B AND C) OR (A AND B AND NOT C) is
equivalent to (A AND B).
CNF a conjunction of several disjuncts, each of
which includes two terms connected by OR.
Normalization to DNF can be done by looking at
the TRUE rows, while that to CNF can be done by
looking at the FALSE rows.

20
Boolean Queries (Contd)

The size of returned set could be explosively
large. Sol return only a limited number of
records.
Though there are many problems with Boolean
queries, they are still popular because people
tend to use only two or three terms at a time.

21
Vector Queries

Each document is represented as a vector, or a
list of terms.
The similarity between a document and a query is
based on the presence of terms in both the query
and the document.
The simplest model is 0-1 vector. A more general
model is weighted vector.
Assigning weights to a document or a query is a
complex process.
It is reasonable to assume that more frequent
terms are more important.

22
Vector Queries (Contd)

It is better to give a user the freedom to assign
weights. In this case, a conversion between user
weight and system weight must be done. Show the
conversion equ.
There are two types of vector queries (for
similarity search)
top-N queries
Threshold-based queries

23
Extended Boolean Queries

This approach incorporates weights into Boolean
queries. A general form is Aw1 Bw2 (e.g., A0.2
AND B0.6).
A OR B0.2 retrieves all documents that contain A
and those documents in B that are within top 20
closest to the documents in A.
A OR B1 ?A OR B
A OR B0 ?A
See Figure 3.1 for a diagrammatic illustration.

24
Extended Boolean Queries (Contd)

A AND B0.2
A AND B0 ?A
A AND B1 ?A AND B
See Figure 3.2 for graphical illustration.
A AND NOT B0.2
A AND NOT B0 ?A
A AND NOT B1 ?A AND NOT B
See Figure 3.3 for graphical illustration.
A0.2 OR B0.6 returns 20 of the documents in A-B
that are closest to B and 60 of the documents in
B-A that are closest to A.

25
Extended Boolean Queries (Contd)

See Example 3.1.
One needs to define the distance between a
document and a set of document (contains A).
The computation of an extended Boolean query
could be time-consuming.
This model have not become popular.

26
Fuzzy Queries

It is based on fuzzy set.
In a fuzzy set S, each element in S is associated
with a membership grade.
Formally, Sltx, ?s(x)gt ?sgt0.
A?B xx?A and x ?B, ?(x)min (?A(x), ?B(x)).
A?B xx?A or B, ?(x)max(?A(x), ?B(x)).
NOT A xx?A, ?(x)1- ?A(x).

27
Fuzzy Queries (Contd)

To use fuzzy queries, documents must be fuzzy
too.
The documents are returned to the users in
decreasing order of their fuzzy values associated
with the fuzzy query.

28
Probabilistic Queries

Similar to fuzzy queries but now the membership
function is probabilities.
The probability of a document in association with
a query (or term) can be calculated through some
probability theory (e.g., Bayes Theorem) after
some observation.

29
Natural Language Queries

Convenient
Imprecise, inaccurate, and frequently
ungrammatical.
The difficulties lie in obtaining an accurate
interpretation of a longer text, which may rely
on common sense.
The successful system must restrict to a narrowly
defined domain (e.g., medicine v.s. diagnosis of
illness).

30
Information Retrieval and Database Systems

Should one use a database system to handle
information retrieval requests?
DBMS is a mature and successful technolgy in
handling precise queries.
It is not appropriate to handle imprecise textual
elements.
OODB provide the augment functions to the textual
or image elements and is considered a good
candidate.

31
The Matching Process
32
Boolean based matching

It divides the document space into two those
satisfying the query and those that do not.
Finer grading of the set of retrieved documents
can be defined on the number of terms satisfied
(e.g., A OR B OR C).

33
Vector-based matching

Measures
Based on the idea of distance
Minkowski metric (Lq) Lq(Xi1-Xj1q
Xi2-Xj2qXi3-Xj3qXip-Xjpq)1/q
Special cases Manhattan distance (q1),
Euclidean distance (q2), and maximum direction
distance (q?).
See example in p.133.
Based on the idea of angle
Cosine function ((Q?D)/(QD).

34
Mapping distance to similarity

It is better to map distance (or dissimilarity)
into some range, e.g. 0, 1.
A simple inversion function is ?b-u.
A more general inversion function is ?b-p(u),
where p(u) is a monotone nondecreasing func s.t.
p(0)0.
See Fig. 4.1 for graphical illustration.

35
Distance or cosine?

lt1, 3gt , lt100, 300gt, lt3, 1gt? Which pair is
similar?
In practice, distance and angular measures seem
to give results of similar quality because the
cluster of documents all roughly lie in the same
direction.

36
Missing terms and term relationships

The conventional value 0 means
Truly missing
No information
However, if 0 is regarded as undefined. It
becomes impossible to measure the distance
between two documents (e.g., lt3, -gt and lt-, 4gt.
Terms used to define the vector model are clearly
not independent, e.g., digital and computer
have a strong relationship.
However, the effect of dependent terms is hardly
known.

37
Probability matching

For a given query, we can define the probability
that a document is related as P(rel)n/N.
The discriminant function on the selected set is
dis(selected)P(relselected)/P(?relselected).
The desirable discriminant function value of a
set is at least 1.
Let a document be represented by terms t1, , tn,
and they are statistically independent.
P(selectedrel)P(t1rel)P(t2rel)P(tnrel).
We can use Bayes theorem to calculate the
probability that a document should be selected.
See Example 4.1.

38
Fuzzy matching

The issue is on how to define the fuzzy grade of
documents w.r.t. a query.
One can define the fuzzy grade based on the
closeness to a query. For example, ??? v.s. ??
v.s. ????

39
Proximity matching

The proximity criteria can be used independently
of any other criteria.
A modification is to use phrases rather than
words. But it causes problems in some cases
(e.g., information retrieval v.s. the retrieval
of information).
Another modification is to use order of words
(e.g., junior college v.s. college junior).
However, this still causes the same problem as
before.
Many systems introduce a measure on the proximity.

40
Effects of weighting

Weights can be given on sets of words, rather
than individual words.
E.g., (beef and broccoli)5 (beef but not
broccoli)2 (broccoli but not beef)2,
noodles1 snow peas1 water chestnuts1.

41
Effects of scaling

An extensive collection is likely to contain
fewer additional relevant documents.
Information filtering aims at producing a
relatively small set.
Another possibility is to use several models
together, leading to so called data fusion.

42
A user-centered view

Each user has an individual vocabulary that may
not match that of the author, editor, or indexer.
Many times, the user does not know how to specify
his/her information need. Ill know it when I
see it. Therefore, it is important to allow
users direct access to the data (browsing).

43
Text Analysis
44
Indexing

Indexing is the act of assigning index terms to a
document.
Many nonfiction books have indexes created by
authors.
The indexing language may be controlled or
uncontrolled.
For manual indexing, an uncontrolled indexing
language is generally used.
Lack of consistency (the agreement in index term
assignment may be as little as 20)
Difficult for fast evolving field.

45
Indexing (Contd)

Characteristics of an indexing language
Exhaustivity (the breadth) and specificity (the
depth)
The ingredients of indexes
Links (occur together)
Roles
Cross referencing
See Coal, see fuel
Related terms microcomputer, see also personal
computer
Broader term (BT) poodle, BT dog
Narrower term (NT) dog, NT poodle, cocker
spaniel, pointer.

46
Index (Contd)

Automatic indexing will play an ever-increasing
role.
Approaches for automatic indexing
Word counting
Based on deeper linguistic knowledge
Based on semantics and concepts within a document
collection.
Often inverted file is used to store indexes of
documents in a document collection.

47
Matrix Representations

Term-document matrix A
Aij indicates the occurrence or the count of term
i in document j.
Term-term matrix T
Tij indicates the occurrence or the count of term
i and term j.
Document-document matrix D
Dij indicates the degree of term overlapping
between document i and document j.
These matrices are usually sparse and better be
stored by lists.

48
Term Extraction and Analysis

It has been observed that frequencies of words in
a document follow the so called Zipfs law
(fkr-1 ) 1, ½, 1/3, ¼,
Many similar observations have been made
Half of a documents is made up of 250 distinct
words.
20 of the text words account for 70 of term
usage.
None of the observations are supported by Zipfs
law.
High frequncy terms are not desirable because
they are so common.
Rare words are not desirable because very few
documents will be retrieved.

49
Term Association

Term association is expanded with the concept of
word proximity.
Proximity measure depends on
the number of intervening words
The number of words appearing in the same
sentence.
Word order
Punctuation
However, there are risks The felons
information assured the retrieval of the money,
and the retrieval of information, and information
retrieval.

50
Term significance

Frequent words in a document collection may not
be significant. (e.g., digital computer in
computer science collection).
Absolute term frequency ignores the size of a
document.
Relative term frequency is often used.
Absolute term frequency / length of doc.
Term frequency of a document collection
Total frequency count of a term / total words in
documents of a document collection
Number of documents containing the term / total
number of documents.

51
How to adjust the frequency weight of a term

Inverse document frequency weight
N total number of documents.
Dk number of documents containing term k
fik absolute frequency of term k in doc. i.
Wik the weight of term k in document i.
idfk log2(N/dk)1
Wik fik?idfk
This weight assignment is called TF-IDF.

52
How to adjust the frequency weight of a term
(Contd)

Signal-to-noise
H(p1, p2, , pn) information content of a
document with pi being the probability of word i.
Requirements
H is a continuous function of pi.
If pi1/n, H is a monotone increasing function of
n.
H preserves the partitioning property
H(1/2, 1/3, 1/6) H(1/2, ½)1/2H(2/3,1/3)
H(2/3, 1/3)2/3H(3/4,1/4)
Entropy function satisfies all three requirements
H

53
How to adjust the frequency weight of a term
(Contd)

The more frequent a word is, the less information
it carries.
The noise nk of index term k is defined as
The signal sk of index term k is defined as
sklogtk nk.
The weight wik of term k in document i is
wikfik sk

54
How to adjust the frequency weight of a term
(Contd)

Term discrimination value
The average similarity
A centroid document D, where fk tk/N.
?k?k - ?.
wikfik ?k

55
Phrases and Proximity

Weighting schemes discriminate phrases.
How to compensate?
Count both the individual words and phrase.
Count the number of words in a phrase.
1 log (number of words in a phrase)
How to handle proximity query?
Documents with involved words are identified,
followed by the judgment of proximity criteria.
Direct analysis of a document collection can be
done by using standard vocabulary analysis (e.g.,
Brown corpus).

56
Pragmatic Factors

Identifying trigger phrases
Words such as conclusion, finding, identify key
points and ideas in a document.
Weighting authors
Weighting journals
Users pragmatic factors
Education level
Novice or expert in an area

57
Document Similarity

Similarity metrics of 0-1 vector.
Contingency table for doc. to doc. match

D21 D20
D11 w x n1
D10 y z N-n1
n2 N-n2 N
58
Document similarity

If D1 and D2 are independent, w/N(n1/N) (n2/N).
We can define the basic comparison between D1 and
D2 as ?(D1, D2)w-(n1n2/N).
In general, the similarity between D1 and D2 can
be defined as follows

59
Various ways for defining coefficient of
association

Separation coefficient N/2.
Rectangular distance max(n1, n2).
Conditional probability min(n1, n2).
Vector angle (n1n2)1/2
Arithmetic mean (n1n2)/2.
For more, see p. 128.
For the relationship, see Table 5.2.

60
Other close similarity metrics

Use only w instead of w-(n1n2/N).
Dices coefficient 2w/(n1n2).
Cosine coefficient w/(n1n2)1/2.
Overlap coefficient w/min(n1n2)
Jaccards coefficient w/(N-z)
Distance measures requirements
Non-negative
Symmetric
Triangle inequality (Dist(A, C) lt Dist(A,
B)Dist(B, C)

61
Stop lists

Stop list or negative dictionary consists of very
high frequency words.
Typical stop list contains 150-500 words.
Any well-defined field may have its own jargon.
Words in the stop list should be excluded from
later processing.
Query should also be processed against the stop
list.
However, phrases that contain the words in stop
list may not always be eliminated (e.g., to be or
not to be).

62
Stemming

Computer, computers, computing, compute,
computes, computed, computational,
computationally, computable all deal with closely
related concepts.
Use stemming algorithm to strip off word endings
(e.g., comput).
Watch out the false stripping
Bed -gt b, breed -gtbre
Keep minimum acceptable stem length, having a
small list of exceptional words, and keep various
word forms.

63
Stemming (contd)

Stemming may not save much space (5).
One can also stem only the queries and then use
wild cards in matching.
Watch the various word forms. E.g., knife should
be expanded as knif and kniv.

64
Thesauri

A thesaurus contains
Synonyms
Antonyms
Broader terms
Narrower terms
Closely related terms
A thesaurus can be used during the query
processing to broaden a query.
A similar problem arises w.r. t. homonyms.

65
Mid-term project

Lexical analysis and stoplist (Ch7)
Stemming algorithms (Ch8)
Thesaurus construction (Ch9)
String searching algorithms (Ch10)
Relevance feedback and other query modification
techniques (Ch11)
Hashing algorithms (Ch13)
Ranking algorithms (Ch14)
Chinese text segmentation (to be provided)

66
File Structures
67
Inverted File

Structures for inverted file
Sorted array (Figure 3.1 in the supplement)
B-tree (Figure 3.2 in the supplement)
Trie
A straightforward approach
Parse the text to get a list of (word, location)
Sort the list in ascending order of word
Weighting each word.
See Figure 3.3 and 3.4 in the supplement
Hard to evolve.

68
Inverted File (Contd)

The data structure can be improved for faster
searching (Figure 3.5 in the supplement)
A dictionary, including
Term and number of postings
A posting file, including
A set of list, one for each term
Doc
Number of postings in the doc.
See Figure 3.5.

69
Inverted File (Contd)

The dictionary can be implemented as a B-tree.
When a term in a new document is identified,
A new tree node is created, or
The related data of an existing node is modified.
The posting file can be implemented as a set of
linked list.
See Table 3.1 for some statistics.

70
Signature File

A document is partitioned into a set of blocks,
each of which has D keywords.
Each keyword is represented by a bit pattern
(signature) of size F, with m bits set to 1.
The block signature is formed by superimposing
(OR) the constituent word signatures.
Sig(Q) OR Sig(B) Sig(Q) if B contains the words
in Q.
See Figure 4.1 in the supplement.

71
Signature File (Contd)

Which m bits should be set for a given word?
For each 3-triplet of W, a hashing function maps
it to a position between 0, F-1.
If the number of 1s is less than m, randomly set
additional bits.
How to set m?
It has been shown that when mF ln2/D, the false
drop probability is minimized.

72
Signature File (Contd)

The signature file could be huge. Sequential
search takes time.
The signature file is often sparse.
Three approaches to reduce query time
Compression
Vertical partitioning
Horizontal partitioning

73
Signature File (Contd)

Vertical partitioning
Use F different files, one per bit position.
For a query with k bits set, we need to examine k
files. Then AND these files.
The qualifying blocks will have 1s in the
resultant vector.
Inserting a block requires writing to F files.

74
Signature File (Contd)

Horizontal partitioning
TWO level signatures
The first level has N document signatures.
Several signatures with a common prefix are
grouped into a group.
The second level has group signatures which are
created by superimposing the constituent document
signatures.
This approach can be generalized to a B-tree like
structure (called S-tree).

75
User Profiles and Their Use
76
Simple Profiles

A simple profile consists of a set of key terms
with given weights, much like a query.
Such profiles were originally developed for
current awareness (CA) or selective dissemination
of information (SDI).
The purpose of CA (SDI) is to help researchers
keep up with the latest developments in their
areas.
In a CA system, users are asked to file an
interest profile, which must be updated
periodically.
In fact, the interest profile acts an a routing
query.

77
Extended Profiles

Extended profiles record background information
of a person that might help in determining the
interested document types.
Education level, familiarity of an area, language
fluency, journal subscriptions, reading habits,
specific preferences.
This type of information cannot be used directly
in the retrieval process but must be applied to
the retrieval set to organize it.

78
Current Awareness Systems

It assumes that the user is adequately aware of
past work and needs only to keep abreast of
current developments.
It operates only on current literature and
actively w/o user intervene.
The user may redefine a profile at any time, and
many systems will periodically remind users to
review their profiles.
Most CA systems make use only the simple user
profile.
Current awareness systems are suitable for a
dynamic environment.

79
Retrospective Search Systems

The effectiveness of a CA system is difficult to
measure because users often treat the presented
documents off-line.
Unlike a CA system, a retrospective search system
has a relatively large and stable database and
handles ad-hoc queries.
Virtually all existing retrospective search
systems do not differentiate users.

80
Modifying the Query By the Profile

A reference librarian may help a person with a
request by learning more about this persons
background and level of knowledge. E.g., theory
of groups.
A given query may be modified according to the
persons profile.
Three ways to modify a query
Post-filter effort to retrieve documents is
substantial.
Pre-filter A food query ltcalories3,
spiciness7gt may be modified for a user with
profile lt2, 2gt to lt2.8, 6gt.

81
Modifying the Query By the Profile

Suppose Qltq1, q2, , qngt and Pltp1, p2, , pngt.
Simple linear transformation qi kpi
(1-k)qi.
Piecewise linear transformation
Case 1. pi?0 and qi ?0 ordinary k value.
Case 2. Pi0 and qi ?0 k is very small (5).
Case 3. pi?0 and qi 0 k is smaller (50).

82
Query and Profile as Separate Reference Points

Query and profile are treated as co-filters.
Four approaches
Disjunctive model D, Q?d or D, P?d.
Conjunctive model D, Q?d and D, P?d.
Ellipsoidal model D, Q D, P?d, see Figure
6.2, 6.3.
Cassini oval model D, Q ? D, P?d, see Figure
6.4.
All the above models can be weighted.
Empirical experiments showed that query-profile
combinations do provide better performance than
the query alone.

83
Multiple Reference Point Systems

A reference point is a defined point or concept
against which a document can be judged.
Queries, user profiles, known papers or books are
reference points.
A reference point is sometimes called a point of
interest (POI).
Weights and metrics can be applied to general
reference points as before.

84
Documents and Document Clusters

Each favored document can be treated as a
reference point.
Favored documents can also be clustered. Each
document cluster may be represented as a cluster
point.
Many statistical techniques can be used to
cluster documents.
The centroid or medoid of a document cluster is
then used as the reference point.

85
The Mathematical Basis
86
GUIDO

Graphical User Interface for Document
Organization Rather than using terms as vector
dimensions, GUIDO uses each reference point as a
dimension, resulting in a low dimension space.
In a 2-D GUIDO, a document is represented as an
ordered pair (x, y), where x is the distance from
Q and y is the distance from P. Note that P-Q
?.
P (?, 0), Q(0, ?).
Consider the line between P and Q. Three cases
D, P D, Q ?
D, P D, Q ?
D, P D, Q - ?

87
GUIDO

For any points not on the line between P and Q
D, P D, Q gt ?
D, P ? gt D, Q
D, Q ? gt D, P
Observation 1 multiple document points are
mapped into the same point in the distance space.
Observation 2 Mapping complex boundary contours
into simpler contours.
In the ellipsoidal model, the contour becomes a
straightline parallel to P-Q line.

88
GUIDO

In the weighted ellipsoidal model, the contour is
still a straightline but at an angle.
If we are looking for a document D where the
distance ratio of D, P to D, Q is a constant,
we have
D, Q lt d/fr. (See the general model)
Therefore, the contour is a circle in the general
model.
The contour is a straightline crossing the origin
in GUIDO model because D, P k D, Q. See
Figure 7.5.
With different metrics, the size of distance
space and locations of documents may change but
the basic shape in the distance space remains.

89
VIBE

Visual Information Browsing Environment a user
chooses the positions of reference points
arbitrarily on the screen.
The location of a document is the ratios of its
similarities to the reference points.
Each document is represented as a rectangle whose
size is the importance (sum of similarities?) to
the reference points.

90
VIBE

In a 2-POI VIBE, documents are displayed on the
line connecting the two POIs.
In a n-POI VIBE, let p1, p2, , pn be the
coordinates of the POIs and s1, s2, , sn be the
similarities of a document D to these POIs. The
coordinate of D, pd, is (See example 7.2)

91
VIBE

While GUIDO is based on distance metrics, VIBE is
based on similarity metrics.
Consider a 2-POI VIBE, a document is located at a
position that is a fix ratio c s1/s2.
If si1/di, cd2/d1. Thus, a straightline in
GUIDO is a point in VIBE.
If sk-d, c kd2-d1. Further compressed.

92
Boolean VIBE

One can think of n1 POIs as vertices in
n-dimensions that form a polyhedron.
Three POIs A, B, and C form a triangle in a 2-D
space as shown in Figure 7.10.
Documents containing all terms of A and B appear
on the line A-B. Documents containing all terms
of A, B, and C appear inside the triangle.
Four POIs form a polyhedron in a 3-D space.

93
Boolean VIBE

To render n POIs on a 2-D display, the resulting
display consists of 2n-1 Boolean points,
representing all Boolean combinations except the
one that is completely negated, see Figure 7.10.
A threshold on the similarity between points need
to be specified for determining document
positions, see Table 7.1.

94
Retrieval Effectiveness Measures
95
Goodness of an IR System

Judged by the user for appropriateness to her
information need. vague.
Determine the level of judgment
Question that meets the information need
Query that corresponds to the question.
Determine the measure
Binary accepted or rejected
N-ary 4 definitely relevant, 3 probably
relevant, 2 neutral, 1 probably not relevant,
0 definitely not relevant.

96
Goodness of an IR System (Contd)

Relevance of a document how well this document
responds to the query.
Pertinence of a document how well this document
satisfies the information need.
Usefulness of a document
The document is not relevant or pertinent to my
present need, but it is useful in a different
context.
The document is relevant, but it is not useful
because Ive already known it.

97
Precision and Recall
Retrieved Not retrieved
Relevant w x n1wx
Not relevant y z
Not relevant n2wy z Nwxyz

Precision w/n2.
Recall w/n1.
The number of document returned in response to a
query (n2) may controlled by either first K or a
similarity threshold.
If very few documents are returned, precision
could be high, while recall is very low.
If all documents are returned, recall1, while
precision is very low.

98
Precision and Recall (contd)

One can plot a precision-recall graph to compare
the performance of different IR systems. See
Figure 8.1.
Two relevant measures
Fallout the proportion of nonrelevant documents
that are retrieved, F y / (N-n1)
Generality the proportion of relevant documents
within the entire collection G n1/N
Precision (P), recall (R), fallout, and
generality (G) are related

99
Precision and Recall (contd)

P/(1-P) is the ratio of relevant retrieved
documents to nonrelevant retrieved documents.
G/(1-G) is the ratio of relevant documents to
nonrelevant documents in the collection.
R/F gt 1 if the IR system does better in locating
relevant documents.
R/F lt 1 if the IR system does better in rejecting
non-relevant documents.

100
Precision and Recall (contd)

Weakness of precision/recall measures
It is generally difficult to get exact value for
recall because one has to examine the entire
collection.
It is not clear that recall and precision are
significant to the user. Some argued that
precision is more important than recall.
Either one represents an incomplete picture of
the IR systems performance.

101
User-oriented measures

The above measures attempt to measure the
performance of the entire IR system, regardless
of the differences on users.
From a user point of view, her interpretation on
the retrieved set of documents could be
Let V of relevant documents known to the user.
Vn of relevant, retrieved documents known to
the user. N of relevant, retrieved documents.
Coverage ratio Vn/V
Novelty ratio (N-Vn)/N

102
User-oriented measures (Contd)

Relative recall of relevant, retrieved
documents / of desired documents.
Recall effort of desired documents / of
documents examined.

103
Average precision and recall

Fix recall at several points (say, 0.25, 0.5, and
0.75) and compute the average precision at each
recall level.
If the exact recall is difficult to compute, one
can compute the average precision for each fix
number of relevant documents. See Table 8.2.
If the exact recall can be computed, a more
comprehensive precision/recall table can be
obtained. See Table 8.3.

104
Operating Curves

Let C be a measurable characteristic, P1 and P2
be the sets of relevant and irrelevant documents
respectively.
If C distinguishes P1 and P2 well, the curve will
have a higher slope.
It has been shown that the operating curve of a
given IR system is usually a straightline.
The distance from lt50,50gt to the operating curve
along the line lt0, 100gt to lt50, 50gt can be used
to measure the performance of an IR system,
called Swets E measure. See Figure 8.3.

105
Expected search length

All the above measures do not consider the order
of returned documents.
Suppose the set of retrieved documents can be
divided into subsets S1, S2, , Sk with
decreasing priority and Si has ni relevant
documents.
Given a desired number N of relevant documents,
one can compute the expected search length. See
Example 8.2.
By varying N, one can plot a performance on the
expected search length as shown in Figure 8.4.

106
Expected search length (Contd)

An aggregate number can be computed as the
average number of documents searched per relevant
document. Let the number be ei.
If the chance of searching for 1, 2, , 7
documents are equally likely, one can compute the
overall expected search length by the formula

107
Normalized recall

Typical IR system presents results to the user in
a linear list.
If a user sees many relevant documents first, she
may be more satisfied with the system
performance.
Rocchios normalized recall is defined as a step
function F, where F(k)F(k-1) 1 if the kth
document is relevant and F(k)F(k-1) otherwise.
See Figure 8.5.
A step function F is defined as
F(0)0,
F(k1) (F(k) or F(k)1)).

108
Normalized recall (Contd)

Let A be the area between the actual and ideal
graphs, n1 be the number of relevant documents, N
be the number of documents examined.
Normalized recall 1 A/n1(N-n1).
However, if two systems behave the same except
for the position of the last document, the
normalized recall values may differ a lot.

109
Sliding ratio

Rather than judging a document as either relevant
or irrelevant, sliding ratio assigns weighted
relevance to each document.
Let the weight list of the retrieved documents be
w1, w2, , wN, and their sorted list be W1, W2,
, WN in decreasing order. The sliding ratio
SR(n) is defined as

110
Satisfaction and frustration

Myaeng divides the measure into satisfaction and
frustration.
Satisfaction is the accumulative sum of
satisfaction weights.
Frustration is the accumulative sum of
2-satisfaction weights. See Example 8.4.
Total Satisfaction frustration.

111
Content-based Recommendation
112
NewsWeeder Learn to Filter Netnews

Ken Lang
Proceedings of the Conference on Machine
Learning, 1995

113
Introduction

NewsWeeder is a netnews-filtering system.
It allows users to read regular newsgroups.
It also creates some personal, virtual newsgroups
such as nw.top50.bob for Bob.
A list of article summaries sorted by predicted
rating.
After reading an article, the reader clicks on a
rating from one to five.

114
Introduction

This way of collecting users ratings is called
active feedback, in contrast to passive feedback,
such as time spent reading.
The drawback to active feedback is the extra
effort required to explicit rating.
Each night, the system uses the collected rating
information to learn a new model for each users
interest.
How to learn a new model is the subject of this
paper.

115
Representation

Raw text is parsed into tokens.
A vector of token counts is created for each
document (article).
Tokens are not stemmed.
The vector is on the order of 20,000 to 100,000
tokens long.
No explicit dimension reduction techniques are
used to reduce the size of vectors.

116
TF-IDF weighting

Motivation
The more times a token t appears in a document d
(term frequency, tft,d),
The less times a token t occurs throughout all
documents (document frequency, dft),
The better t represents the subject of document
d.
Throw out tokens occurring less than 3 times
total.
Throw out the M most frequent tokens.
The weight of t w.r.t to d, w(t, d) is
w(t, d) tft,d ? log2(N/ dft),
where N is the total number of documents.

117
TF-IDF weighting

Each document is represented by a tf-idf vector
normalized into unit length.
Use cosine function to determine the similarity
between two documents.
Given a category (1..5), a prototype vector is
computed by averaging the normalized tf-idf
vectors in the category.

118
TF-IDF weighting

Let vp1, vp2, vp3, vp4, vp5 be the prototype
vectors of the five categories.
A learning model is derived as follows
Predicted-rate(d) c1?sim(d, vp1) c2?sim(d,
vp2) c3?sim(d, vp3) c4?sim(d, vp4) c5?sim(d,
vp5).
The above model is determined by linear
regression on documents rated by the user.

119
Minimum Description Length (MDL)

A kind of Baysian classifier but based on the
entropy measure.
In information theory, the minimum average length
to encode messages with p1, p2, , pk
probabilities is -?iPi log Pi. That is, the
number of bits to represent message i is -Pi log
Pi.
Let H be a category and D a document,

120
MDL

Equivalently, we can minimize log(p(DH)-log(p(H
)).
The above total encoding length includes
Number of bits to encode the hypothesis
Number of bits required to encode the data given
the hypothesis.
That is, to find a balance between simpler models
and models that produce smaller error when
explaining the observed data.

121
MDL applied to Newsweeder

Problem description
We are given a document d with token vector Td
and non-zero entries ld, and a set of previous
rating information Dtrain.
We like to find a category ci that maximizes p(ci
Td, ld, Dtrain), or equivalently, minimizes
log(p(Td ci, ld, Dtrain))- log(p(ci ld,
Dtrain))

122
MDL applied to Newsweeder

Assume that words in a document are independent,
we have
p(Td ci, ld, Dtrain)?j p(tj,d ci, ld,
Dtrain)
where ti,d (0 or 1) represents whether token i
appears in document d.
Notations
ti ?i?N ti,j
ri,l a correlation estimated 0, 1 between
ti,d and ld.
The above measures can be computed for the entire
documents or for a particular category, denoted
by ck.

123
MDL applied to Newsweeder

When ti,d is not related to the length of the
document (I.e, ri,l 0), we have
When ti,d is strongly related to the length of
the document (I.e, ri,l 1), we have

124
MDL applied to Newsweeder

In general, it can be modeled as
Hypothesis For a given token, either it is
special w.r.t. a category or it is unrelated to
any category.

125
MDL applied to Newsweeder

A token is related to some category if the
following value is greater than a small constant
(0.1)
The intuition is that if by considering category
information the encoding bits can be reduced,
this token plays an important role in deciding
the category of a document.

126
Summary

Divide the set of articles into training set and
test set.
Parse the training articles, throwing out tokens
occurring less than 3 times total.
Compute ti and ri,l for each token.
For each token t and category c, decide whether
to use category independent or category dependent
model.

127
Summary (contd)

Compute the similarity of each training article
to each rating category by taking the inverse of
the number of bits required to encode Td under
the categorys probabilistic model.
Compute a linear regression model from the
training articles.

128
Experiments

The performance metric is precision.
Retrieve the top 10 of highest predicted rating
articles.
Data
see Table 1 for the meaning of 5 categories.
Articles rated as 1 or 2 are considered
interesting.
Users only two exhibit enough amount of ratings,
see Table 2.

129
TF-IDF performance

Do not use a fixed stop-list because it may not
suit a dynamic environment.
Top N most frequent words are removed.
By experimenting different partitioning on
training/test sets, it shows that removing
100-400 words seem to have the best performance.
See Graph 1.
TF-IDF has about three times improvement over
non-filtering.

130
MDL Experiments

See Graph 2 for a comparison between TF-IDF and
MDL.
MDL constantly outperforms TF-IDF.
Table 3 shows the predicted ratings and actual
ratings of a test article.
The correct prediction is 65 (see the diagonal
line)
In general, the performance after the regression
step tends to meet or exceed the precision
obtained by the method of choosing only the
category with maximum probability.

131
Learning and Revising User Profiles The
Identification of Interesting Web Sites

M. Pazzani and D. Billsus
Machine Learning 27, 1997

132
Introduction

The goal is to find information that satisfies
long-term recurring interests.
Feedback on the interestingness of a set of
previously visited sites are used to predict the
interests of unseen sites.
The recommender system is called Syskill Webert.

133
Syskill Webert

A different profile is learned for each topic.
Each user has a set of profiles, one for each
topic.
Each web page is augmented with special control
on selecting user ratings. See Figure 1.
Each page is rated as either hot or cold. See
Figure 2 for notations for recommendations.

134
Learning user profiles

Use supervised learning with a set of positive
examples and negative examples.
Each rated web page is converted into a Boolean
feature vector.
The information gain of a word is used to
determine how informative the word is.

135
Learning user profiles

The set of k most informative words are used for
feature set. (k128)
In addition, words in a stop list with
approximately 600 words and HTML tags are
excluded.
See Table 1 on feature words on goats.

136
Naïve Bayesian classifier

Provided features are independent.
A given example is assigned to the class (hot or
cold) with the higher probability.

137
Initial experiments

See Table 2 for four users on 9 topics.
Again, the partition on training set and test set
is varied.
Accuracy is the primary performance metric.
Figure 3 displays the average accuracy, which is
substantially better than the probability of cold
pages.
In biomedical domain, all the top 10 pages were
actually interesting, and all the bottom 10 pages
were actually uninteresting.

138
Initial experiments

Among the 21 pages with probabilities above 0.9,
19 were rated interesting.
Among the 64 pages with probability below 0.1,
only one was rated interesting.
Table 3 shows how the number of feature words
impact accuracy with 20 training examples.
An intermediate number (96) of features performs
the best.
Comprehensive approach for feature selection is
not feasible as it increases the complexity.

139
Alternative machine learning alg.

Nearest neighbor Assign the class of the most
similar example.
PEBLS The distance between two examples is the
sum of the value difference of all attributes.
The difference between Vjx and Vjy is

140
Machine Learning (Contd)

Decision trees ID3, which recursively selects
the features with the highest information gain.
Rocchios algorithm
Use TF-IDF as feature weights (with normalization
to unit length).
Build the prototype-vector of the interesting
class by subtracting 0.25 of the average vector
of the uninteresting pages from the average
vector of the interesting pages.
The purpose is to prevent infrequently occurring
terms from overly affecting the classification.
Pages with a certain distance from the prototype
(determined by cosine) are considered
interesting.

141
Comparison

20 examples were chosen as training set because
the increase of accuracy after 20 is mild.
See Table 4. In each domain, the highest accuracy
as well as those with slightly lower accuracies
were marked as .
ID3 (or C4.5) is not suited.
Nearest neighbor performs worse (even for k-NN).
Backpropagation, Bayesian classifier and
Rocchios algorithms are among the best.
Bayesian classifier is chosen because it is fast
and adapts well to attribute dependencies.

142
Using predefined user profiles

Some users are unwilling to rate many pages
before the system gives reliable prediction.
Initial profile is solicited as follows
Provide a set of words that indicate interesting
pages.
Provide another set of words that indicate
uninteresting pages. This set is more difficult
to get.
Four probabilites for each word are given
p(wordi present hot), p((wordi absent hot),
p(wordi present cold), p((wordi absent cold).
The default for p(wordi present hot) is 0.7 and
that for p(wordi present cold) is 0.3.

143
Using predefined user profiles (Contd)

As more training data becomes available, more
believe should be placed on the probability
estimates.
Conjugate priors are used to update probabilities
from data
The initial probability is assume to be
equivalent to 50 pages.
If P(wordi presenthot)0.8 and among 25 hot
pages seen, 10 contain wordi.
The probability becomes (4010)/(5025)

144
Experiments

Three alternatives
Data use only data for estimation. 96 features
are obtained purely from data.
Revision use both data and initial profile for
estimation. All words in the profile are used as
features, supplemented with the most informative
words for a total of 96 features.
Fixed Use only the words provided by the user as
features and only the initial profiles.

145
Results

See Table 5, 6, and 7 for probabilities in
initial profiles.
Figure 4, 5, and 6 show that the revision
strategy performs the best. The performance of
fixed is surprisingly good.
If we use only words in initial user profile and
calculate the probability from data, it still
performs well. See Figure 7.

146
Using lexical knowledge

Use WORDNET as thesaurus.
When there is no relationship between a word and
words in a topic, this word is eliminated. This
includes Hypernym, Antonym, Member-Holonym,
Part-Holonym, Similar-to, Pertainnym, and
Derived-from.
Table 8 shows the eliminated words that are
unrelated to goat.
Figure 8 shows that when the number of examples
is small, applying lexical knowledge does help.

147
Comparing Feature-based and Clique-based User
Models for Movie Selection

J. Alspector, A. Kotcz, and N. Karunanithi
Conf. of Digital Libraries, 1998

148
Introduction

Compare content-based and collaborative
approaches for making recommendations for movies.
Users must provide explicit ratings on some
movies.
Data sets 7389 movies
Volunteers for rating movies 242.

149
Clique-based approach

A set of users form a clique if their movie
ratings are closely related.
The similarity between two users ratings is
defined by Pearson correlation coefficient (I.e.,
cosine function) as follows

150
Clique-based approach

How to decide the clique of a given user U?
Smin minimum number of common ratings with U.
Cmin minimum correlation threshold.
In the experiments, Smin is set as a constant 10,
and Cmin is a variable such that the number of
size of the clique is 40.
Once a clique is identified,
For a given unseen movie m, let N be the number
of clique members that rate m.
ci(m) is the rating of movie m given by user i.
r(m) is the estimated rating of movie m to the
user U.

151
Clique-based approach
152
Feature-based approach

Extract relevant features from the movies that
user has rated.
Build a model for a user by associating selected
features and the ratings.
Estimating ratings for an unseen movie to a user.
By consulting the model.

153
Relevant features

Seven features are used
25 catetories (0, 1)
6 MPAA rating (0, 1)
Maltin rating (0..4)
Academy award won1, nominated0.5, not
considered0.
Origin USA0, USA with foreign
collaboration0.5, foreign made0.
Director each director is represented as
numerical value that is the average rating of the
user to the movies directed by the director.
Each feature is normalized between 0, 1.

154
Linear model

Write a Comment

User Comments (0)

About PowerShow.com

Information Retrieval and Recommendation Techniques - PowerPoint PPT Presentation

Information Retrieval and Recommendation Techniques

Information Retrieval and Recommendation Techniques – PowerPoint PPT presentation