Title: Lecture 16: Text Databases
1Lecture 16 Text Databases Information
Retrieval Part II
Oct. 20, 2006 ChengXiang Zhai
2The Notion of Relevance
3What is a Statistical LM?
- A probability distribution over word sequences
- p(Today is Wednesday) ? 0.001
- p(Today Wednesday is) ? 0.0000000000001
- p(The eigenvalue is positive) ? 0.00001
- Context-dependent!
- Can also be regarded as a probabilistic mechanism
for generating text, thus also called a
generative model
4Why is a LM Useful?
- Provides a principled way to quantify the
uncertainties associated with natural language - Allows us to answer questions like
- Given that we see John and feels, how likely
will we see happy as opposed to habit as the
next word? (speech recognition) - Given that we observe baseball three times and
game once in a news article, how likely is it
about sports? (text categorization,
information retrieval) - Given that a user is interested in sports news,
how likely would the user use baseball in a
query? (information retrieval)
5Basic Issues
- Define the probabilistic model
- Event, Random Variables, Joint/Conditional Probs
- P(w1 w2 ... wn)f(?1, ?2 ,, ?n)
- Estimate model parameters
- Tune the model to best fit the data and our prior
knowledge - ?i?
- Apply the model to a particular task
- Many applications
6The Simplest Language Model(Unigram Model)
- Generate a piece of text by generating each word
INDEPENDENTLY - Thus, p(w1 w2 ... wn)p(w1)p(w2)p(wn)
- Parameters p(wi) p(w1)p(wN)1 (N is voc.
size) - Essentially a multinomial distribution over words
- A piece of text can be regarded as a sample drawn
according to this word distribution
7Text Generation with Unigram LM
(Unigram) Language Model ?
p(w ?)
Sampling
Document
text 0.2 mining 0.1 association
0.01 clustering 0.02 food 0.00001
Topic 1 Text mining
food 0.25 nutrition 0.1 healthy 0.05 diet 0.02
Topic 2 Health
8Estimation of Unigram LM
(Unigram) Language Model ?
p(w ?)?
Estimation
Document
text 10 mining 5 association 3 database
3 algorithm 2 query 1 efficient 1
A text mining paper (total words100)
9Empirical distribution of words
- There are stable language-independent patterns in
how people use natural languages - A few words occur very frequently most occur
rarely. E.g., in news articles, - Top 4 words 1015 word occurrences
- Top 50 words 3540 word occurrences
- The most frequent word in one corpus may be rare
in another
10Zipfs Law
- rank frequency ? constant
11Language Models for Retrieval(Ponte Croft 98)
Document
Query data mining algorithms
Text mining paper
Food nutrition paper
12Ranking Docs by Query Likelihood
d1
q
d2
dN
13Retrieval as Language Model Estimation
- Document ranking based on query likelihood
- Retrieval problem ? Estimation of p(wid)
- Smoothing is an important issue, and
distinguishes different approaches
14Problem with the ML Estimator
- What if a word doesnt appear in the text?
- In general, what probability should we give a
word that has not been observed? - If we want to assign non-zero probabilities to
such words, well have to discount the
probabilities of observed words - This is what smoothing is about
15Language Model Smoothing (Illustration)
16A General Smoothing Scheme
- All smoothing methods try to
- discount the probability of words seen in a doc
- re-allocate the extra probability so that unseen
words will have a non-zero probability - Most use a reference model (collection language
model) to discriminate unseen words
17Smoothing TF-IDF Weighting
- Plug in the general smoothing scheme to the query
likelihood retrieval formula, we obtain
- Smoothing with p(wC) ? TF-IDF length norm.
18How to Smooth?
- All smoothing methods try to
- discount the probability of words seen in a
document - re-allocate the extra counts so that unseen words
will have a non-zero count - Method 1 (Additive smoothing) Add a constant ?
to the counts of each word - Problems?
Counts of w in d
Add one, Laplace smoothing
Vocabulary size
Length of d (total counts)
19Other Smoothing Methods
- Method 2 (Absolute discounting) Subtract a
constant ? from the counts of each word - Method 3 (Linear interpolation, Jelinek-Mercer)
Shrink uniformly toward p(wREF)
uniq words
parameter
ML estimate
20Other Smoothing Methods (cont.)
- Method 4 (Dirichlet Prior/Bayesian) Assume
pseudo counts ?p(wREF) - Method 5 (Good Turing) Assume total unseen
events to be n1 ( of singletons), and adjust the
seen events in the same way
parameter
21So, which method is the best?
- It depends on the data and the task!
- Many other sophisticated smoothing methods have
been proposed - Cross validation is generally used to choose the
best method and/or set the smoothing parameters - For retrieval, Dirichlet prior performs well
22Comparison of Three Methods
23Applications of Basic IR Techniques
24Some Basic IR Techniques
- Stemming
- Stop words
- Weighting of terms (e.g., TF-IDF)
- Vector/Unigram representation of text
- Text similarity (e.g., cosine, KL-div)
- Relevance/pseudo feedback (e.g., Rocchio)
They are not just for retrieval!
25Generality of Basic Techniques
CLUSTERING
Raw text
26Sample Applications
- Information Filtering
- Text Categorization
- Document/Term Clustering
- Text Summarization
27Information Filtering
- Stable long term interest, dynamic info source
- System must make a delivery decision immediately
as a document arrives
my interest
Filtering System
28A Vector-Space Filtering Model
no
doc vector
Utility Evaluation
Scoring
Thresholding
yes
F3R-2N R yes correct N yes
incorrect
profile vector
threshold
29Issues in Information Filtering
- Threshold setting
- Crucial for binary decision making
- Must avoid under-delivery or over-delivery
- Initialization
- What threshold should a system start with?
- Learning from limited and biased feedback
- Only delivered documents get feedback info
- How to learn a threshold?
- Exploitation vs. exploration
- Other issues (redundancy, interest shift, etc.)
30Examples of Information Filtering
- News filtering
- Email filtering
- Recommending Systems
- Literature alert
- And many others
31Sample Applications
- Information Filtering
- Text Categorization
- Document/Term Clustering
- Text Summarization
32Text Categorization
- Pre-given categories and labeled document
examples (Categories may form hierarchy) - Classify new documents
- A standard supervised learning problem
Sports Business Education Science
Categorization System
Sports Business Education
33Retrieval-based Categorization
- Treat each category as representing an
information need - Treat examples in each category as relevant
documents - Use feedback approaches to learn a good query
- Match all the learned queries to a new document
- A document gets the category(categories)
represented by the best matching query(queries)
34Prototype-based Classifier
- Key elements (retrieval techniques)
- Prototype/document representation (e.g., term
vector) - Document-prototype distance measure (e.g., dot
product) - Prototype vector learning Rocchio feedback
- Example
35K-Nearest Neighbor Classifier
- Keep all training examples
- Find k examples that are most similar to the new
document (neighbor documents) - Assign the category that is most common in these
neighbor documents (neighbors vote for the
category) - Can be improved by considering the distance of a
neighbor ( A closer neighbor has more influence) - Technical elements (retrieval techniques)
- Document representation
- Document distance measure
36Example of K-NN Classifier
37Examples of Text Categorization
- News article classification
- Meta-data annotation
- Automatic Email sorting
- Web page classification
38Sample Applications
- Information Filtering
- Text Categorization
- Document/Term Clustering
- Text Summarization
39The Clustering Problem
- Discover natural structure
- Group similar objects together
- Object can be document, term, passages
- Example
40Similarity-based Clustering(as opposed to
model-based)
- Define a similarity function to measure
similarity between two objects - Gradually group similar objects together in a
bottom-up fashion - Stop when some stopping criterion is met
- Variations different ways to compute group
similarity based on individual object similarity
41Similarity-induced Structure
42How to Compute Group Similarity?
Three Popular Methods
Given two groups g1 and g2, Single-link
algorithm s(g1,g2) similarity of the closest
pair
complete-link algorithm s(g1,g2) similarity of
the farthest pair
average-link algorithm s(g1,g2) average of
similarity of all pairs
43Three Methods Illustrated
44Examples of Doc/Term Clustering
- Clustering of retrieval results
- Clustering of documents in the whole collection
- Term clustering to define concept or theme
- Automatic construction of hyperlinks
- In general, very useful for text mining
45Sample Applications
- Information Filtering
- Text Categorization
- Document/Term Clustering
- Text Summarization
46The Summarization Problem
- Essentially semantic compression of text
- Selection-based vs. generation-based summary
- In general, we need a purpose for summarization,
but its hard to define it
47Retrieval-based Summarization
- Observation term vector ? summary?
- Basic approach
- Rank sentences, and select top N as a summary
- Methods for ranking sentences
- Based on term weights
- Based on position of sentences
- Based on the similarity of sentence and document
vector
48Simple Discourse Analysis
vector 1 vector 2 vector 3 vector
n-1 vector n
similarity
---------- ---------- ---------- ---------- ------
---- ---------- ---------- ---------- ---------- -
--------- ---------- ---------- ---------- -------
--- ---------- ----------
similarity
similarity
49A Simple Summarization Method
---------- ---------- ---------- ---------- ------
---- ---------- ---------- ---------- ---------- -
--------- ---------- ---------- ---------- -------
--- ---------- ----------
summary
Most similar in each segment
sentence 1 sentence 2 sentence 3
Doc vector
50Examples of Summarization
- News summary
- Summarize retrieval results
- Single doc summary
- Multi-doc summary
- Summarize a cluster of documents (automatic label
creation for clusters)
51What You Should Know
- Language models are new retrieval models with
many advantages - The retrieval techniques can be used to do more
than just search