Under The Hood Part II WebBased Information Architectures - PowerPoint PPT Presentation

About This Presentation

Title:

Under The Hood Part II WebBased Information Architectures

Description:

Under The Hood [Part II] Web-Based Information Architectures. MSEC 20-760. Mini II. Jaime Carbonell. Today's Topics. Term weighting in detail ... – PowerPoint PPT presentation

Number of Views:42

Avg rating:3.0/5.0

Slides: 28

Provided by: cjin

Learn more at: https://www.andrew.cmu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Under The Hood Part II WebBased Information Architectures

1
Under The Hood Part IIWeb-Based Information
Architectures

MSEC 20-760Mini II
Jaime Carbonell

2
Todays Topics

Term weighting in detail
Generalized Vector Space Model (GVSM)
Maximal Marginal Relevance
Summarization as Passage Retrieval

3
Term Weighting Revisited (1)

Definitions
wi "ith Term" a word, stemmed word, or
indexed phrase
Dj "jth Document" a unit of indexed text, e.g.
a web-page, a news report, an article, a patent,
a legal case, a book, a chapter of a book, etc.

4
Term Weighting Revisited (2)

Definitions
C "The Collection" the full set of indexed
documents
(e.g. the New York Times
archive, the Web, ...)
Tf(wi ,Dj) "Term Frequency" the number of times
wi occurs in document Dj. Tf is sometimes
normalized by dividing by frequency of the
most-frequent non-stop term in the document Tf
norm Tf/ max_TF .

5
Term Weighting Revisited (3)

Definitions
Df(wi ,C) "Document Frequency" the number of
documents from C in which wi occurs. Df
may be normalized by dividing it by the
total number of documents in C.
IDf(wi, C) Inverse Document Frequency
Df(wi, C)/size(C)-1 . Most often the
log2(IDf) is used, rather than IDf directly.

6
Term Weighting Revisited (4)

TfIDf Term Weights
In general TfIDf(wi, Dj, C)
F1(Tf(wi, Dj) F2(IDf(wi, C))
Usually F1 0.5 log2(Tf), or Tf/Tfmax
or 0.5 0.5Tf/Tfmax
Usually F2 log2(IDf)
In the SMART IR system TfIDf(wi, Dj,C)
0.5 0.5Tf(wi, Dj/Tfmax(Dj) log2(IDf(wi, C))

7
Term Weighting beyond TfIDf (1)

Probabilistic Models
Old style (see textbooks)
Improves precision-recall slightly
Full statistical language modeling (CMU)
Improves precision-recall more significantly
Difficult to compute efficiently.

8
Term Weighting beyond TfIDf (2)

Neural Networks
Theoretically attractive
Do not scale up at all, unfortunately
Fuzzy Sets
Not deeply researched, scaling difficulties

9
Term Weighting beyond TfIDf (3)

Natural Language Analysis
Analyze and understand Ds Q first
Ultimate IR method, in theory
Generally NL understanding is an unsolved problem
Scale up challenges, even if we could do it
But, shown to improve IR for very limited domains

10
Generalized Vector Space Model (1)

Principles
Define terms by their occurrence patterns in
documents
Define query terms in the same way
Compute similarity by document-pattern overlap
for terms in D and Q
Use standard Cos similarity and either binary or
TfIDf weights

11
Generalized Vector Space Model (2)

Advantages
Automatically calculates partial similarity
If "heart disease" and "stroke" and
"ventricular" co-occur in many documents, then if
the query contains only one of these terms,
documents containing the other will receive
partial credit proportional to their document
co-occurrence ratio.
No need to do query expansion or relevance
feedback

12
Generalized Vector Space Model (3)

Disadvantages
Computationally expensive
Performance vector space Q expansion

13
GVSM, How it Works (1)

Represent the collection as vector of documents
Let C D1, D2, ..., Dm
Represent each term by its distributional
frequency
Let ti Tf(ti, D1), Tf(ti, D2 ), ..., Tf(ti, Dm
)
Term-to-term similarity is computed as
Sim(ti, tj) cos(vec(ti), vec(tj))
Hence, highly co-occurring terms like "Arafat"
and "PLO"
will be treated as near-synonyms for retrieval

14
GVSM, How it Works (2)

And query-document similarity is computed as
before Sim(Q,D) cos(vec(Q)), vec(D)), except
that instead of the dot product calculation, we
use a function of the term-to-term similarity
computation above, For instance
Sim(Q,D) SiMaxj(sim(qi, dj)
or normalizing for document query length
Simnorm(Q, D)

15
GVSM, How it Works (3)

Primary problem
More computation (sparse gt dense)
Primary benefit
Automatic term expansion by corpus

16
A Critique of Pure Relevance (1)

IR Maximizes Relevance
Precision and recall are relevance measures
Quality of documents retrieved is ignored

17
A Critique of Pure Relevance (2)

Other Important Factors
What about information novelty, timeliness,
appropriateness, validity, comprehensibility,
density, medium,...??
In IR, we really want to maximize
P(U(f i , ..., f n ) Q C U H)
where Q query, C collection set,
U user profile, H interaction history
...but we dont yet know how. Darn.

18
Maximal Marginal Relevance (1)

A crude first approximation
novelty gt minimal-redundancy
Weighted linear combination
(redundancy cost, relevance benefit)
Free parameters k and ?

19
Maximal Marginal Relevance (2)

MMR(Q, C, R)
Argmaxkdi in C?S(Q, di) - (1-?)maxdj in R (S(di,
dj))

20
Maximal Marginal Relevance (MMR) (3)

COMPUTATION OF MMR RERANKING
1. Standard IR Retrieval of top-N docs
Let Dr IR(D, Q, N)
2. Rank max sim(di e Dr, Q) as top doc, i.e. Let
Ranked di
3. Let Dr Dr\di
4. While Dr is not empty, do
a. Find di with max MMR(Dr, Q. Ranked)
b. Let Ranked Ranked.di
c. Let Dr Dr\di

21
MMR Ranking vs Standard IR
documents
query
MMR
IR
? controls spiral curl
22
Maximal Marginal Relevance (MMR) (4)

Applications
Ranking retrieved documents from IR Engine
Ranking passages for inclusion in Summaries

23
Document Summarization in a Nutshell (1)

Types of Summaries

24
Document Summarization in a Nutshell (2)

Other Dimensions
Single vs multi document summarization
Genre-adaptive vs one-size-fits all
Single-language vs translingual
Flat summary vs hyperlinked pyramid
Text-only vs multi-media
...

25
Summarization as Passage Retrieval (1)

For Query-Driven Summaries
1. Divide document into passages
e.g, sentences, paragraphs, FAQ-pairs, ....
2. Use query to retrieve most relevant
passages, or better, use MMR to avoid
redundancy.
3. Assemble retrieved passages into a summary.

26
Summarization as Passage Retrieval (2)

For Generic Summaries
1. Use title or top-k Tf-IDF terms as query.
2. Proceed as Query-Driven Summarization.

27
Summarization as Passage Retrieval (3)

For Multidocument Summaries
1. Cluster documents into topically-related
groups.
2. For each group, divide document into passages
and keep track of source of each passage.
3. Use MMR to retrieve most relevant
non- redundant passages (MMR is necessary for
multiple docs).
4. Assemble a summary for each cluster.

Write a Comment

User Comments (0)