Title: Discussion Class 6
1Discussion Class 6
2Discussion Classes
Format Question Ask a member of the class to
answer Provide opportunity for others to
comment When answering Give your name. Make
sure that the TA hears it. Stand up Speak
clearly so that all the class can hear
3Question 1 Inverted Document Frequency (IDF)
In class, I first introduced Salton's original
term weighting, known as Inverted Document
Frequency wik fik / dk The reading gives
Sparck Jones's term weighting, Inverted Document
Frequency (IDF) IDFi log2 (N/ni) 1 or IDFi
log2 (maxn/ni) 1 What is the relationship
between these alternatives?
4Q1 (continued) Definitions of Terms
wik weight given to term k in document i fik
frequency with which term k appears in document
i dk number of documents that contain term k N
number of documents in the collection ni total
number of occurrences of term i in the
collection maxn maximum frequency of any term in
the collection
5Question 2 Within-Document Frequency
(a) Why does term weighting using within
document frequency improve ranking? (b) Why is
it necessary to normalize within-document
frequency? (c) Explain Croft's
normalization cfreqij K (1 - K)
freqij/maxfreqj (d) How does Salton and Buckley's
recommendation term weighting fit with Croft's
normalization?
6Question 3 Salton/Buckley Recommendation
where
and wij freqij x IDFj
freqiq frequency of term i in query q
maxfreqq maximum frequency of any term in query
q IDFi IDF of term i in entire
collection freqij frequency of term i
in document j
7Question4 Zipf's Law
"... significant performance inprovement using
... the inverted document frequency ... that is
based on Zipf's distribution ..." What has Zipf's
law to do with IDF?
8Question 4 Probabilistic Models
The section on probabilistic models is rather
unsatisfactory because it relies on a
mathematical foundation that has been left
out. Can you summarize the basic ideas?
9Question 5 TF.IDF compared with Google PageRank
(a) TF.IDF and PageRank are based on
fundamentally different considerations. What are
the fundamental differences? (b) Under which
circumstances would you expect each to excel?