Title: CS 430: Information Discovery
1CS 430 Information Discovery
Lecture 17 Ranking 1
2Course Administration
Assignment 2 and Midterm examination were
mailed a week ago. A few questions
outstanding. Assignment 3 will be posted
shortly.
3Midterm Examination -- Question 4
4(a) What is the Dublin Core principle of
dumbing-down? Are there any fields in this record
that do not satisfy the principle? "The theory
behind this principle is that consumers of
metadata should be able to strip off qualifiers
and return to the base form of a property. ...
this principle makes it possible for client
applications to ignore qualifiers in the context
of more coarse-grained, cross-domain searches."
Lagoze 2001
4Question 4 (continued)
Dumbing-down failures Description.note Title
from home page as viewed on Nov. 1,
2000. Description Title from home
page as viewed on Nov. 1, 2000. which is not a
description of the object Publisher.place
Nashville, Tenn. Publisher
Nashville, Tenn. which is not the publisher
of the object Correct dumbing-down Subject.class
.LCC E840.8.G65 Subject
E840.8.G65 which is a subject code
5Question 4 (continued)
4(b) The metadata in the fields Publisher and
Publisher place end in punctuation marks. Can
you suggest any reasons for doing so? This is a
historic curiosity. It comes from the concept
that the metadata will be printed, so that the
metadata is stored in a printable
format. Publisher Gore/Lieberman, Publisher.pla
ce Nashville, Tenn. is intended to be combined
with a date as follows Nashville, Tenn.
Gore/Lieberman, 2001
6Question 4 (continued)
4(c) This record has no Creator field. It has a
Contributor.nameCorporate field with value
"Gore/Lieberman, Inc." Do you consider that this
is correct use of Dublin Core? What would you put
in the Creator and Contributor fields? Why?
7Question 4 (continued)
Specification of Dublin Core A. All fields are
optional. It is not necessary to have a
Creator. B. Definitions of fields Creator The
person or organization primarily responsible for
the intellectual content of the resource.
Contributor A person or organization not
specified in a creator element who has made
significant intellectual contributions to the
resource but whose contribution is secondary to
any person or organization specified in a creator
element. Gore/Lieberman, Inc. is the corporate
author of this web site and is therefore the
Creator.
8Midterm Examination -- Question 2
2(b) You have the collection of documents that
contain the following index terms D1 alpha
bravo charlie delta echo foxtrot golf D2 golf
golf golf delta alpha D3 bravo charlie bravo
echo foxtrot bravo D4 foxtrot alpha alpha golf
golf delta (i) Use an incidence matrix of terms
to calculate a similarity matrix for these four
documents, with no term weighting.
9Incidence array
D1 alpha bravo charlie delta echo foxtrot
golf D2 golf golf golf delta alpha D3 bravo
charlie bravo echo foxtrot bravo D4 foxtrot
alpha alpha golf golf delta
?7 ?3 ?4 ?4
alpha bravo charlie delta echo foxtrot golf
D1 1 1 1 1 1 1 1
D2 1 1 1
D3 1 1 1 1 D4 1
1 1 1
10Document similarity matrix
D1 D2 D3 D4 D1 0.65 0.76 0.76 D2 0.65
0.00 0.87 D3 0.76 0.00 0.25 D4 0.76 0.87
0.25
11Question 2 (continued)
2b(ii) Use a frequency matrix of terms to
calculate a similarity matrix for these
documents, with term weights inversely
proportional to frequency.
12Frequency Array
D1 alpha bravo charlie delta echo foxtrot
golf D2 golf golf golf delta alpha D3 bravo
charlie bravo echo foxtrot bravo D4 foxtrot
alpha alpha golf golf delta
alpha bravo charlie delta echo foxtrot golf
D1 1 1 1 1 1 1 1
D2 1 1 3
D3 3 1 1 1 D4 2
1 1 2
13Inverse Document Frequency Weighting
Principle (a) Weight is proportional to the
number of times that the term appears in the
document (b) Weight is inversely proportional to
the number of documents that contain the
term wik fik / dk Where wik is the weight
given to term k in document i fik is
the frequency with which term k appears in
document i dk is the number of documents that
contain term k
14Frequency Array with Weights
D1 alpha bravo charlie delta echo foxtrot
golf D2 golf golf golf delta alpha D3 bravo
charlie bravo echo foxtrot bravo D4 foxtrot
alpha alpha golf golf delta
alpha bravo charlie delta echo foxtrot golf
D1 0.33 0.50 0.50 0.33 0.50
0.33 0.33
D2 0.33 0.33
1.00 D3 1.50 0.50 0.50
0.33 D4 0.67 0.33 0.33 0.67
length 0.94 0.65 1.08 0.76
dk 3 2 2 3
2 3 3
15Document similarity matrix
D1 D2 D3 D4 D1 0.46 0.74 0.58 D2 0.46
0.00 0.86 D3 0.74 0.00 0.06 D4 0.56 0.86
0.06
16Google Ranking algorithm
Concept The rank of a page is higher if many
pages link to it. Links from highly ranked pages
are given greater weight than links from less
highly ranked pages.
17Page Ranks (Google)
Citing page
P1 P2 P3 P4 P5 P6
P1 1 1
1 P2 1 P3
1 P4 1
1 1 1 P5 1
P6
1 1
Cited page
Number 2 1 4 1
2 2
18Normalize by Number of Links from Page
Citing page
P1 P2 P3 P4 P5 P6
P1 1 0.25 0.5 P2
0.25 P3
0.5 P4 0.5
0.25 1 0.5 P5 0.5
P6
0.25 0.5
B
Cited page
Number 2 1 4 1
2 2
19Weighting of Pages
Initially all pages have weight 1 w1
Recalculate weights w2 Bw1
1.75 0.25 0.50 2.25 0.50 0.75
1 1 1 1 1 1
20Google Ranks
- Iterate until wk Bwk-1
- This w is the high order eigenvector of B
- It ranks the pages by links to them, normalized
by the number of citations from each page and
weighted by the ranking of the cited pages - Google
- calculates the ranks for all pages (over one
billion) - lists hits in rank order