Lecture 16: Text Databases

1 / 51

About This Presentation

Title:

Lecture 16: Text Databases

Description:

Given that a user is interested in sports news, how likely would the user use ' ... Food nutrition. paper. Sampling. 8. CS511 Advanced Database Management Systems ... – PowerPoint PPT presentation

Number of Views:18

Avg rating:3.0/5.0

Slides: 52

Provided by: ChengXi4

Learn more at: https://cs.illinois.edu

more less

Transcript and Presenter's Notes

Title: Lecture 16: Text Databases

1
Lecture 16 Text Databases Information
Retrieval Part II
Oct. 20, 2006 ChengXiang Zhai
2
The Notion of Relevance
3
What is a Statistical LM?

A probability distribution over word sequences
p(Today is Wednesday) ? 0.001
p(Today Wednesday is) ? 0.0000000000001
p(The eigenvalue is positive) ? 0.00001
Context-dependent!
Can also be regarded as a probabilistic mechanism
for generating text, thus also called a
generative model

4
Why is a LM Useful?

Provides a principled way to quantify the
uncertainties associated with natural language
Allows us to answer questions like
Given that we see John and feels, how likely
will we see happy as opposed to habit as the
next word? (speech recognition)
Given that we observe baseball three times and
game once in a news article, how likely is it
about sports? (text categorization,
information retrieval)
Given that a user is interested in sports news,
how likely would the user use baseball in a
query? (information retrieval)

5
Basic Issues

Define the probabilistic model
Event, Random Variables, Joint/Conditional Probs
P(w1 w2 ... wn)f(?1, ?2 ,, ?n)
Estimate model parameters
Tune the model to best fit the data and our prior
knowledge
?i?
Apply the model to a particular task
Many applications

6
The Simplest Language Model(Unigram Model)

Generate a piece of text by generating each word
INDEPENDENTLY
Thus, p(w1 w2 ... wn)p(w1)p(w2)p(wn)
Parameters p(wi) p(w1)p(wN)1 (N is voc.
size)
Essentially a multinomial distribution over words
A piece of text can be regarded as a sample drawn
according to this word distribution

7
Text Generation with Unigram LM
(Unigram) Language Model ?
p(w ?)
Sampling
Document
text 0.2 mining 0.1 association
0.01 clustering 0.02 food 0.00001
Topic 1 Text mining
food 0.25 nutrition 0.1 healthy 0.05 diet 0.02
Topic 2 Health
8
Estimation of Unigram LM
(Unigram) Language Model ?
p(w ?)?
Estimation
Document
text 10 mining 5 association 3 database
3 algorithm 2 query 1 efficient 1
A text mining paper (total words100)
9
Empirical distribution of words

There are stable language-independent patterns in
how people use natural languages
A few words occur very frequently most occur
rarely. E.g., in news articles,
Top 4 words 1015 word occurrences
Top 50 words 3540 word occurrences
The most frequent word in one corpus may be rare
in another

10
Zipfs Law

rank frequency ? constant

11
Language Models for Retrieval(Ponte Croft 98)
Document
Query data mining algorithms
Text mining paper
Food nutrition paper
12
Ranking Docs by Query Likelihood
d1
q
d2
dN
13
Retrieval as Language Model Estimation

Document ranking based on query likelihood

Retrieval problem ? Estimation of p(wid)
Smoothing is an important issue, and
distinguishes different approaches

14
Problem with the ML Estimator

What if a word doesnt appear in the text?
In general, what probability should we give a
word that has not been observed?
If we want to assign non-zero probabilities to
such words, well have to discount the
probabilities of observed words
This is what smoothing is about

15
Language Model Smoothing (Illustration)
16
A General Smoothing Scheme

All smoothing methods try to
discount the probability of words seen in a doc
re-allocate the extra probability so that unseen
words will have a non-zero probability
Most use a reference model (collection language
model) to discriminate unseen words

17
Smoothing TF-IDF Weighting

Plug in the general smoothing scheme to the query
likelihood retrieval formula, we obtain

Smoothing with p(wC) ? TF-IDF length norm.

18
How to Smooth?

All smoothing methods try to
discount the probability of words seen in a
document
re-allocate the extra counts so that unseen words
will have a non-zero count
Method 1 (Additive smoothing) Add a constant ?
to the counts of each word
Problems?

Counts of w in d
Add one, Laplace smoothing
Vocabulary size
Length of d (total counts)
19
Other Smoothing Methods

Method 2 (Absolute discounting) Subtract a
constant ? from the counts of each word
Method 3 (Linear interpolation, Jelinek-Mercer)
Shrink uniformly toward p(wREF)

uniq words
parameter
ML estimate
20
Other Smoothing Methods (cont.)

Method 4 (Dirichlet Prior/Bayesian) Assume
pseudo counts ?p(wREF)
Method 5 (Good Turing) Assume total unseen
events to be n1 ( of singletons), and adjust the
seen events in the same way

parameter
21
So, which method is the best?

It depends on the data and the task!
Many other sophisticated smoothing methods have
been proposed
Cross validation is generally used to choose the
best method and/or set the smoothing parameters
For retrieval, Dirichlet prior performs well

22
Comparison of Three Methods
23
Applications of Basic IR Techniques
24
Some Basic IR Techniques

Stemming
Stop words
Weighting of terms (e.g., TF-IDF)
Vector/Unigram representation of text
Text similarity (e.g., cosine, KL-div)
Relevance/pseudo feedback (e.g., Rocchio)

They are not just for retrieval!
25
Generality of Basic Techniques
CLUSTERING
Raw text
26
Sample Applications

Information Filtering
Text Categorization
Document/Term Clustering
Text Summarization

27
Information Filtering

Stable long term interest, dynamic info source
System must make a delivery decision immediately
as a document arrives

my interest
Filtering System

28
A Vector-Space Filtering Model
no
doc vector
Utility Evaluation
Scoring
Thresholding
yes
F3R-2N R yes correct N yes
incorrect
profile vector
threshold
29
Issues in Information Filtering

Threshold setting
Crucial for binary decision making
Must avoid under-delivery or over-delivery
Initialization
What threshold should a system start with?
Learning from limited and biased feedback
Only delivered documents get feedback info
How to learn a threshold?
Exploitation vs. exploration
Other issues (redundancy, interest shift, etc.)

30
Examples of Information Filtering

News filtering
Email filtering
Recommending Systems
Literature alert
And many others

31
Sample Applications

Information Filtering
Text Categorization
Document/Term Clustering
Text Summarization

32
Text Categorization

Pre-given categories and labeled document
examples (Categories may form hierarchy)
Classify new documents
A standard supervised learning problem

Sports Business Education Science
Categorization System

Sports Business Education
33
Retrieval-based Categorization

Treat each category as representing an
information need
Treat examples in each category as relevant
documents
Use feedback approaches to learn a good query
Match all the learned queries to a new document
A document gets the category(categories)
represented by the best matching query(queries)

34
Prototype-based Classifier

Key elements (retrieval techniques)
Prototype/document representation (e.g., term
vector)
Document-prototype distance measure (e.g., dot
product)
Prototype vector learning Rocchio feedback
Example

35
K-Nearest Neighbor Classifier

Keep all training examples
Find k examples that are most similar to the new
document (neighbor documents)
Assign the category that is most common in these
neighbor documents (neighbors vote for the
category)
Can be improved by considering the distance of a
neighbor ( A closer neighbor has more influence)
Technical elements (retrieval techniques)
Document representation
Document distance measure

36
Example of K-NN Classifier
37
Examples of Text Categorization

News article classification
Meta-data annotation
Automatic Email sorting
Web page classification

38
Sample Applications

Information Filtering
Text Categorization
Document/Term Clustering
Text Summarization

39
The Clustering Problem

Discover natural structure
Group similar objects together
Object can be document, term, passages
Example

40
Similarity-based Clustering(as opposed to
model-based)

Define a similarity function to measure
similarity between two objects
Gradually group similar objects together in a
bottom-up fashion
Stop when some stopping criterion is met
Variations different ways to compute group
similarity based on individual object similarity

41
Similarity-induced Structure
42
How to Compute Group Similarity?
Three Popular Methods
Given two groups g1 and g2, Single-link
algorithm s(g1,g2) similarity of the closest
pair
complete-link algorithm s(g1,g2) similarity of
the farthest pair
average-link algorithm s(g1,g2) average of
similarity of all pairs
43
Three Methods Illustrated
44
Examples of Doc/Term Clustering

Clustering of retrieval results
Clustering of documents in the whole collection
Term clustering to define concept or theme
Automatic construction of hyperlinks
In general, very useful for text mining

45
Sample Applications

Information Filtering
Text Categorization
Document/Term Clustering
Text Summarization

46
The Summarization Problem

Essentially semantic compression of text
Selection-based vs. generation-based summary
In general, we need a purpose for summarization,
but its hard to define it

47
Retrieval-based Summarization

Observation term vector ? summary?
Basic approach
Rank sentences, and select top N as a summary
Methods for ranking sentences
Based on term weights
Based on position of sentences
Based on the similarity of sentence and document
vector

48
Simple Discourse Analysis

vector 1 vector 2 vector 3 vector
n-1 vector n
similarity
---------- ---------- ---------- ---------- ------
---- ---------- ---------- ---------- ---------- -
--------- ---------- ---------- ---------- -------
--- ---------- ----------
similarity
similarity
49
A Simple Summarization Method

---------- ---------- ---------- ---------- ------
---- ---------- ---------- ---------- ---------- -
--------- ---------- ---------- ---------- -------
--- ---------- ----------
summary
Most similar in each segment
sentence 1 sentence 2 sentence 3
Doc vector
50
Examples of Summarization