Title: Under The Hood Part II WebBased Information Architectures
1Under The Hood Part IIWeb-Based Information
Architectures
- MSEC 20-760Mini II
- Jaime Carbonell
2Todays Topics
- Term weighting in detail
- Generalized Vector Space Model (GVSM)
- Maximal Marginal Relevance
- Summarization as Passage Retrieval
3Term Weighting Revisited (1)
- Definitions
- wi "ith Term" a word, stemmed word, or
- indexed phrase
- Dj "jth Document" a unit of indexed text, e.g.
a web-page, a news report, an article, a patent,
a legal case, a book, a chapter of a book, etc.
4Term Weighting Revisited (2)
- Definitions
- C "The Collection" the full set of indexed
documents
(e.g. the New York Times
archive, the Web, ...) - Tf(wi ,Dj) "Term Frequency" the number of times
wi occurs in document Dj. Tf is sometimes
normalized by dividing by frequency of the
most-frequent non-stop term in the document Tf
norm Tf/ max_TF .
5Term Weighting Revisited (3)
- Definitions
- Df(wi ,C) "Document Frequency" the number of
- documents from C in which wi occurs. Df
- may be normalized by dividing it by the
- total number of documents in C.
- IDf(wi, C) Inverse Document Frequency
- Df(wi, C)/size(C)-1 . Most often the
log2(IDf) is used, rather than IDf directly.
6Term Weighting Revisited (4)
- TfIDf Term Weights
- In general TfIDf(wi, Dj, C)
- F1(Tf(wi, Dj) F2(IDf(wi, C))
- Usually F1 0.5 log2(Tf), or Tf/Tfmax
- or 0.5 0.5Tf/Tfmax
- Usually F2 log2(IDf)
- In the SMART IR system TfIDf(wi, Dj,C)
- 0.5 0.5Tf(wi, Dj/Tfmax(Dj) log2(IDf(wi, C))
7Term Weighting beyond TfIDf (1)
- Probabilistic Models
- Old style (see textbooks)
- Improves precision-recall slightly
- Full statistical language modeling (CMU)
- Improves precision-recall more significantly
- Difficult to compute efficiently.
8Term Weighting beyond TfIDf (2)
- Neural Networks
- Theoretically attractive
- Do not scale up at all, unfortunately
- Fuzzy Sets
- Not deeply researched, scaling difficulties
9Term Weighting beyond TfIDf (3)
- Natural Language Analysis
- Analyze and understand Ds Q first
- Ultimate IR method, in theory
- Generally NL understanding is an unsolved problem
- Scale up challenges, even if we could do it
- But, shown to improve IR for very limited domains
10Generalized Vector Space Model (1)
- Principles
- Define terms by their occurrence patterns in
documents - Define query terms in the same way
- Compute similarity by document-pattern overlap
for terms in D and Q - Use standard Cos similarity and either binary or
TfIDf weights
11Generalized Vector Space Model (2)
- Advantages
- Automatically calculates partial similarity
- If "heart disease" and "stroke" and
"ventricular" co-occur in many documents, then if
the query contains only one of these terms,
documents containing the other will receive
partial credit proportional to their document
co-occurrence ratio. - No need to do query expansion or relevance
feedback
12Generalized Vector Space Model (3)
- Disadvantages
- Computationally expensive
- Performance vector space Q expansion
13GVSM, How it Works (1)
- Represent the collection as vector of documents
- Let C D1, D2, ..., Dm
- Represent each term by its distributional
frequency - Let ti Tf(ti, D1), Tf(ti, D2 ), ..., Tf(ti, Dm
) - Term-to-term similarity is computed as
- Sim(ti, tj) cos(vec(ti), vec(tj))
- Hence, highly co-occurring terms like "Arafat"
and "PLO" - will be treated as near-synonyms for retrieval
14GVSM, How it Works (2)
- And query-document similarity is computed as
before Sim(Q,D) cos(vec(Q)), vec(D)), except
that instead of the dot product calculation, we
use a function of the term-to-term similarity
computation above, For instance - Sim(Q,D) SiMaxj(sim(qi, dj)
- or normalizing for document query length
- Simnorm(Q, D)
15GVSM, How it Works (3)
- Primary problem
- More computation (sparse gt dense)
- Primary benefit
- Automatic term expansion by corpus
16A Critique of Pure Relevance (1)
- IR Maximizes Relevance
- Precision and recall are relevance measures
- Quality of documents retrieved is ignored
17A Critique of Pure Relevance (2)
- Other Important Factors
- What about information novelty, timeliness,
appropriateness, validity, comprehensibility,
density, medium,...?? - In IR, we really want to maximize
- P(U(f i , ..., f n ) Q C U H)
- where Q query, C collection set,
- U user profile, H interaction history
- ...but we dont yet know how. Darn.
18Maximal Marginal Relevance (1)
- A crude first approximation
- novelty gt minimal-redundancy
- Weighted linear combination
- (redundancy cost, relevance benefit)
- Free parameters k and ?
19Maximal Marginal Relevance (2)
- MMR(Q, C, R)
- Argmaxkdi in C?S(Q, di) - (1-?)maxdj in R (S(di,
dj))
20Maximal Marginal Relevance (MMR) (3)
- COMPUTATION OF MMR RERANKING
- 1. Standard IR Retrieval of top-N docs
- Let Dr IR(D, Q, N)
- 2. Rank max sim(di e Dr, Q) as top doc, i.e. Let
- Ranked di
- 3. Let Dr Dr\di
- 4. While Dr is not empty, do
- a. Find di with max MMR(Dr, Q. Ranked)
- b. Let Ranked Ranked.di
- c. Let Dr Dr\di
21MMR Ranking vs Standard IR
documents
query
MMR
IR
? controls spiral curl
22Maximal Marginal Relevance (MMR) (4)
- Applications
- Ranking retrieved documents from IR Engine
- Ranking passages for inclusion in Summaries
23Document Summarization in a Nutshell (1)
24Document Summarization in a Nutshell (2)
- Other Dimensions
- Single vs multi document summarization
- Genre-adaptive vs one-size-fits all
- Single-language vs translingual
- Flat summary vs hyperlinked pyramid
- Text-only vs multi-media
- ...
25Summarization as Passage Retrieval (1)
- For Query-Driven Summaries
- 1. Divide document into passages
- e.g, sentences, paragraphs, FAQ-pairs, ....
- 2. Use query to retrieve most relevant
passages, or better, use MMR to avoid
redundancy. - 3. Assemble retrieved passages into a summary.
26Summarization as Passage Retrieval (2)
- For Generic Summaries
- 1. Use title or top-k Tf-IDF terms as query.
- 2. Proceed as Query-Driven Summarization.
27Summarization as Passage Retrieval (3)
- For Multidocument Summaries
- 1. Cluster documents into topically-related
groups. - 2. For each group, divide document into passages
and keep track of source of each passage. - 3. Use MMR to retrieve most relevant
non- redundant passages (MMR is necessary for
multiple docs). - 4. Assemble a summary for each cluster.