CS 430: Information Discovery - PowerPoint PPT Presentation

About This Presentation
Title:

CS 430: Information Discovery

Description:

... has no Creator field. ... Creator The person or organization primarily responsible for the ... author of this web site and is therefore the Creator. 8 ... – PowerPoint PPT presentation

Number of Views:46
Avg rating:3.0/5.0
Slides: 21
Provided by: wya1
Category:

less

Transcript and Presenter's Notes

Title: CS 430: Information Discovery


1
CS 430 Information Discovery
Lecture 17 Ranking 1
2
Course Administration
Assignment 2 and Midterm examination were
mailed a week ago. A few questions
outstanding. Assignment 3 will be posted
shortly.
3
Midterm Examination -- Question 4
4(a) What is the Dublin Core principle of
dumbing-down? Are there any fields in this record
that do not satisfy the principle? "The theory
behind this principle is that consumers of
metadata should be able to strip off qualifiers
and return to the base form of a property. ...
this principle makes it possible for client
applications to ignore qualifiers in the context
of more coarse-grained, cross-domain searches."

Lagoze 2001
4
Question 4 (continued)
Dumbing-down failures Description.note Title
from home page as viewed on Nov. 1,
2000. Description Title from home
page as viewed on Nov. 1, 2000. which is not a
description of the object Publisher.place
Nashville, Tenn. Publisher
Nashville, Tenn. which is not the publisher
of the object Correct dumbing-down Subject.class
.LCC E840.8.G65 Subject
E840.8.G65 which is a subject code
5
Question 4 (continued)
4(b) The metadata in the fields Publisher and
Publisher place end in punctuation marks. Can
you suggest any reasons for doing so? This is a
historic curiosity. It comes from the concept
that the metadata will be printed, so that the
metadata is stored in a printable
format. Publisher Gore/Lieberman, Publisher.pla
ce Nashville, Tenn. is intended to be combined
with a date as follows Nashville, Tenn.
Gore/Lieberman, 2001
6
Question 4 (continued)
4(c) This record has no Creator field. It has a
Contributor.nameCorporate field with value
"Gore/Lieberman, Inc." Do you consider that this
is correct use of Dublin Core? What would you put
in the Creator and Contributor fields? Why?
7
Question 4 (continued)
Specification of Dublin Core A. All fields are
optional. It is not necessary to have a
Creator. B. Definitions of fields Creator The
person or organization primarily responsible for
the intellectual content of the resource.
Contributor A person or organization not
specified in a creator element who has made
significant intellectual contributions to the
resource but whose contribution is secondary to
any person or organization specified in a creator
element. Gore/Lieberman, Inc. is the corporate
author of this web site and is therefore the
Creator.
8
Midterm Examination -- Question 2
2(b) You have the collection of documents that
contain the following index terms D1 alpha
bravo charlie delta echo foxtrot golf D2 golf
golf golf delta alpha D3 bravo charlie bravo
echo foxtrot bravo D4 foxtrot alpha alpha golf
golf delta (i) Use an incidence matrix of terms
to calculate a similarity matrix for these four
documents, with no term weighting.
9
Incidence array
D1 alpha bravo charlie delta echo foxtrot
golf D2 golf golf golf delta alpha D3 bravo
charlie bravo echo foxtrot bravo D4 foxtrot
alpha alpha golf golf delta
?7 ?3 ?4 ?4
alpha bravo charlie delta echo foxtrot golf
D1 1 1 1 1 1 1 1

D2 1 1 1
D3 1 1 1 1 D4 1
1 1 1

10
Document similarity matrix
D1 D2 D3 D4 D1 0.65 0.76 0.76 D2 0.65
0.00 0.87 D3 0.76 0.00 0.25 D4 0.76 0.87
0.25
11
Question 2 (continued)
2b(ii) Use a frequency matrix of terms to
calculate a similarity matrix for these
documents, with term weights inversely
proportional to frequency.
12
Frequency Array
D1 alpha bravo charlie delta echo foxtrot
golf D2 golf golf golf delta alpha D3 bravo
charlie bravo echo foxtrot bravo D4 foxtrot
alpha alpha golf golf delta

alpha bravo charlie delta echo foxtrot golf
D1 1 1 1 1 1 1 1

D2 1 1 3
D3 3 1 1 1 D4 2
1 1 2
13
Inverse Document Frequency Weighting
Principle (a) Weight is proportional to the
number of times that the term appears in the
document (b) Weight is inversely proportional to
the number of documents that contain the
term wik fik / dk Where wik is the weight
given to term k in document i fik is
the frequency with which term k appears in
document i dk is the number of documents that
contain term k
14
Frequency Array with Weights
D1 alpha bravo charlie delta echo foxtrot
golf D2 golf golf golf delta alpha D3 bravo
charlie bravo echo foxtrot bravo D4 foxtrot
alpha alpha golf golf delta

alpha bravo charlie delta echo foxtrot golf
D1 0.33 0.50 0.50 0.33 0.50
0.33 0.33
D2 0.33 0.33
1.00 D3 1.50 0.50 0.50
0.33 D4 0.67 0.33 0.33 0.67

length 0.94 0.65 1.08 0.76
dk 3 2 2 3
2 3 3
15
Document similarity matrix
D1 D2 D3 D4 D1 0.46 0.74 0.58 D2 0.46
0.00 0.86 D3 0.74 0.00 0.06 D4 0.56 0.86
0.06
16
Google Ranking algorithm
Concept The rank of a page is higher if many
pages link to it. Links from highly ranked pages
are given greater weight than links from less
highly ranked pages.
17
Page Ranks (Google)
Citing page
P1 P2 P3 P4 P5 P6
P1 1 1
1 P2 1 P3
1 P4 1
1 1 1 P5 1
P6
1 1
Cited page
Number 2 1 4 1
2 2
18
Normalize by Number of Links from Page
Citing page
P1 P2 P3 P4 P5 P6
P1 1 0.25 0.5 P2
0.25 P3
0.5 P4 0.5
0.25 1 0.5 P5 0.5
P6
0.25 0.5
B
Cited page
Number 2 1 4 1
2 2
19
Weighting of Pages
Initially all pages have weight 1 w1
Recalculate weights w2 Bw1
1.75 0.25 0.50 2.25 0.50 0.75
1 1 1 1 1 1
20
Google Ranks
  • Iterate until wk Bwk-1
  • This w is the high order eigenvector of B
  • It ranks the pages by links to them, normalized
    by the number of citations from each page and
    weighted by the ranking of the cited pages
  • Google
  • calculates the ranks for all pages (over one
    billion)
  • lists hits in rank order
Write a Comment
User Comments (0)
About PowerShow.com