Lecture 16: Text Databases

1 / 51
About This Presentation
Title:

Lecture 16: Text Databases

Description:

Given that a user is interested in sports news, how likely would the user use ' ... Food nutrition. paper. Sampling. 8. CS511 Advanced Database Management Systems ... – PowerPoint PPT presentation

Number of Views:18
Avg rating:3.0/5.0
Slides: 52
Provided by: ChengXi4
Learn more at: https://cs.illinois.edu

less

Transcript and Presenter's Notes

Title: Lecture 16: Text Databases


1
Lecture 16 Text Databases Information
Retrieval Part II
Oct. 20, 2006 ChengXiang Zhai
2
The Notion of Relevance
3
What is a Statistical LM?
  • A probability distribution over word sequences
  • p(Today is Wednesday) ? 0.001
  • p(Today Wednesday is) ? 0.0000000000001
  • p(The eigenvalue is positive) ? 0.00001
  • Context-dependent!
  • Can also be regarded as a probabilistic mechanism
    for generating text, thus also called a
    generative model

4
Why is a LM Useful?
  • Provides a principled way to quantify the
    uncertainties associated with natural language
  • Allows us to answer questions like
  • Given that we see John and feels, how likely
    will we see happy as opposed to habit as the
    next word? (speech recognition)
  • Given that we observe baseball three times and
    game once in a news article, how likely is it
    about sports? (text categorization,
    information retrieval)
  • Given that a user is interested in sports news,
    how likely would the user use baseball in a
    query? (information retrieval)

5
Basic Issues
  • Define the probabilistic model
  • Event, Random Variables, Joint/Conditional Probs
  • P(w1 w2 ... wn)f(?1, ?2 ,, ?n)
  • Estimate model parameters
  • Tune the model to best fit the data and our prior
    knowledge
  • ?i?
  • Apply the model to a particular task
  • Many applications

6
The Simplest Language Model(Unigram Model)
  • Generate a piece of text by generating each word
    INDEPENDENTLY
  • Thus, p(w1 w2 ... wn)p(w1)p(w2)p(wn)
  • Parameters p(wi) p(w1)p(wN)1 (N is voc.
    size)
  • Essentially a multinomial distribution over words
  • A piece of text can be regarded as a sample drawn
    according to this word distribution

7
Text Generation with Unigram LM
(Unigram) Language Model ?
p(w ?)
Sampling
Document
text 0.2 mining 0.1 association
0.01 clustering 0.02 food 0.00001
Topic 1 Text mining
food 0.25 nutrition 0.1 healthy 0.05 diet 0.02
Topic 2 Health
8
Estimation of Unigram LM
(Unigram) Language Model ?
p(w ?)?
Estimation
Document
text 10 mining 5 association 3 database
3 algorithm 2 query 1 efficient 1
A text mining paper (total words100)
9
Empirical distribution of words
  • There are stable language-independent patterns in
    how people use natural languages
  • A few words occur very frequently most occur
    rarely. E.g., in news articles,
  • Top 4 words 1015 word occurrences
  • Top 50 words 3540 word occurrences
  • The most frequent word in one corpus may be rare
    in another

10
Zipfs Law
  • rank frequency ? constant

11
Language Models for Retrieval(Ponte Croft 98)
Document
Query data mining algorithms
Text mining paper
Food nutrition paper
12
Ranking Docs by Query Likelihood
d1
q
d2
dN
13
Retrieval as Language Model Estimation
  • Document ranking based on query likelihood
  • Retrieval problem ? Estimation of p(wid)
  • Smoothing is an important issue, and
    distinguishes different approaches

14
Problem with the ML Estimator
  • What if a word doesnt appear in the text?
  • In general, what probability should we give a
    word that has not been observed?
  • If we want to assign non-zero probabilities to
    such words, well have to discount the
    probabilities of observed words
  • This is what smoothing is about

15
Language Model Smoothing (Illustration)
16
A General Smoothing Scheme
  • All smoothing methods try to
  • discount the probability of words seen in a doc
  • re-allocate the extra probability so that unseen
    words will have a non-zero probability
  • Most use a reference model (collection language
    model) to discriminate unseen words

17
Smoothing TF-IDF Weighting
  • Plug in the general smoothing scheme to the query
    likelihood retrieval formula, we obtain
  • Smoothing with p(wC) ? TF-IDF length norm.

18
How to Smooth?
  • All smoothing methods try to
  • discount the probability of words seen in a
    document
  • re-allocate the extra counts so that unseen words
    will have a non-zero count
  • Method 1 (Additive smoothing) Add a constant ?
    to the counts of each word
  • Problems?

Counts of w in d
Add one, Laplace smoothing
Vocabulary size
Length of d (total counts)
19
Other Smoothing Methods
  • Method 2 (Absolute discounting) Subtract a
    constant ? from the counts of each word
  • Method 3 (Linear interpolation, Jelinek-Mercer)
    Shrink uniformly toward p(wREF)

uniq words
parameter
ML estimate
20
Other Smoothing Methods (cont.)
  • Method 4 (Dirichlet Prior/Bayesian) Assume
    pseudo counts ?p(wREF)
  • Method 5 (Good Turing) Assume total unseen
    events to be n1 ( of singletons), and adjust the
    seen events in the same way

parameter
21
So, which method is the best?
  • It depends on the data and the task!
  • Many other sophisticated smoothing methods have
    been proposed
  • Cross validation is generally used to choose the
    best method and/or set the smoothing parameters
  • For retrieval, Dirichlet prior performs well

22
Comparison of Three Methods
23
Applications of Basic IR Techniques
24
Some Basic IR Techniques
  • Stemming
  • Stop words
  • Weighting of terms (e.g., TF-IDF)
  • Vector/Unigram representation of text
  • Text similarity (e.g., cosine, KL-div)
  • Relevance/pseudo feedback (e.g., Rocchio)

They are not just for retrieval!
25
Generality of Basic Techniques
CLUSTERING
Raw text
26
Sample Applications
  • Information Filtering
  • Text Categorization
  • Document/Term Clustering
  • Text Summarization

27
Information Filtering
  • Stable long term interest, dynamic info source
  • System must make a delivery decision immediately
    as a document arrives

my interest
Filtering System

28
A Vector-Space Filtering Model
no
doc vector
Utility Evaluation
Scoring
Thresholding
yes
F3R-2N R yes correct N yes
incorrect
profile vector
threshold
29
Issues in Information Filtering
  • Threshold setting
  • Crucial for binary decision making
  • Must avoid under-delivery or over-delivery
  • Initialization
  • What threshold should a system start with?
  • Learning from limited and biased feedback
  • Only delivered documents get feedback info
  • How to learn a threshold?
  • Exploitation vs. exploration
  • Other issues (redundancy, interest shift, etc.)

30
Examples of Information Filtering
  • News filtering
  • Email filtering
  • Recommending Systems
  • Literature alert
  • And many others

31
Sample Applications
  • Information Filtering
  • Text Categorization
  • Document/Term Clustering
  • Text Summarization

32
Text Categorization
  • Pre-given categories and labeled document
    examples (Categories may form hierarchy)
  • Classify new documents
  • A standard supervised learning problem

Sports Business Education Science
Categorization System


Sports Business Education
33
Retrieval-based Categorization
  • Treat each category as representing an
    information need
  • Treat examples in each category as relevant
    documents
  • Use feedback approaches to learn a good query
  • Match all the learned queries to a new document
  • A document gets the category(categories)
    represented by the best matching query(queries)

34
Prototype-based Classifier
  • Key elements (retrieval techniques)
  • Prototype/document representation (e.g., term
    vector)
  • Document-prototype distance measure (e.g., dot
    product)
  • Prototype vector learning Rocchio feedback
  • Example

35
K-Nearest Neighbor Classifier
  • Keep all training examples
  • Find k examples that are most similar to the new
    document (neighbor documents)
  • Assign the category that is most common in these
    neighbor documents (neighbors vote for the
    category)
  • Can be improved by considering the distance of a
    neighbor ( A closer neighbor has more influence)
  • Technical elements (retrieval techniques)
  • Document representation
  • Document distance measure

36
Example of K-NN Classifier
37
Examples of Text Categorization
  • News article classification
  • Meta-data annotation
  • Automatic Email sorting
  • Web page classification

38
Sample Applications
  • Information Filtering
  • Text Categorization
  • Document/Term Clustering
  • Text Summarization

39
The Clustering Problem
  • Discover natural structure
  • Group similar objects together
  • Object can be document, term, passages
  • Example

40
Similarity-based Clustering(as opposed to
model-based)
  • Define a similarity function to measure
    similarity between two objects
  • Gradually group similar objects together in a
    bottom-up fashion
  • Stop when some stopping criterion is met
  • Variations different ways to compute group
    similarity based on individual object similarity

41
Similarity-induced Structure
42
How to Compute Group Similarity?
Three Popular Methods
Given two groups g1 and g2, Single-link
algorithm s(g1,g2) similarity of the closest
pair
complete-link algorithm s(g1,g2) similarity of
the farthest pair
average-link algorithm s(g1,g2) average of
similarity of all pairs
43
Three Methods Illustrated
44
Examples of Doc/Term Clustering
  • Clustering of retrieval results
  • Clustering of documents in the whole collection
  • Term clustering to define concept or theme
  • Automatic construction of hyperlinks
  • In general, very useful for text mining

45
Sample Applications
  • Information Filtering
  • Text Categorization
  • Document/Term Clustering
  • Text Summarization

46
The Summarization Problem
  • Essentially semantic compression of text
  • Selection-based vs. generation-based summary
  • In general, we need a purpose for summarization,
    but its hard to define it

47
Retrieval-based Summarization
  • Observation term vector ? summary?
  • Basic approach
  • Rank sentences, and select top N as a summary
  • Methods for ranking sentences
  • Based on term weights
  • Based on position of sentences
  • Based on the similarity of sentence and document
    vector

48
Simple Discourse Analysis

vector 1 vector 2 vector 3 vector
n-1 vector n
similarity
---------- ---------- ---------- ---------- ------
---- ---------- ---------- ---------- ---------- -
--------- ---------- ---------- ---------- -------
--- ---------- ----------
similarity
similarity
49
A Simple Summarization Method

---------- ---------- ---------- ---------- ------
---- ---------- ---------- ---------- ---------- -
--------- ---------- ---------- ---------- -------
--- ---------- ----------
summary
Most similar in each segment
sentence 1 sentence 2 sentence 3
Doc vector
50
Examples of Summarization
  • News summary
  • Summarize retrieval results
  • Single doc summary
  • Multi-doc summary
  • Summarize a cluster of documents (automatic label
    creation for clusters)

51
What You Should Know
  • Language models are new retrieval models with
    many advantages
  • The retrieval techniques can be used to do more
    than just search
Write a Comment
User Comments (0)