Ch%204:%20Information%20Retrieval%20and%20Text%20Mining - PowerPoint PPT Presentation

About This Presentation

Title:

Ch%204:%20Information%20Retrieval%20and%20Text%20Mining

Description:

The task of IR is to retrieve relevant documents in response to a query. ... Di = (ti1, ti2, ti3, ti4, ... tik) Document similarity is defined as ... – PowerPoint PPT presentation

Number of Views:30

Avg rating:3.0/5.0

Slides: 26

Provided by: Hak82

Learn more at: https://www.cs.kent.edu

Category:

more less

Transcript and Presenter's Notes

Title: Ch%204:%20Information%20Retrieval%20and%20Text%20Mining

1
Ch 4 Information Retrieval and Text Mining

Hakam Alomari

2
4.1 Is Information Retrieval a Form of Text
Mining?

What is the principal computer specialty for
processing documents and text??
Information Retrieval (IR)
The task of IR is to retrieve relevant documents
in response to a query.
The fundamental technique of IR is measuring
similarity
A query is examined and transformed into a vector
of values to be compared with stored documents

3
Cont. 4.1

In the predication problem similar documents are
retrieved, then measure their properties, i.e.
count the of class labels to see which label
should be assigned to a new document
The objectives of the prediction can be posed in
the form of an IR model where documents are
retrieved that are relevant to a query, the query
will be a new document

4
Cont. 4.1
5
Figure 4.2. Key steps in IR
Figure 4.3. Predicting from Retrieved Documents
simple criteria such as documents labels
6
4.2 Key Word Search

The technical goal for prediction is to classify
new, unseen documents
The Prediction and IR are unified by the
computation of similarity of documents
IR based on traditional keyword search through a
search engine
So we should recognize that using a search engine
is a special instance of prediction concept

We enter a key words to a search engine and
expect relevant documents to be returned
These key words are words in a dictionary created
from the document collection and can be viewed as
a small document
So, we want to measuring how similar the new
document (query) is to the documents in the
collection

So, the notion of similarity is reduced to
finding documents with the same keywords as posed
to the search engine
But, the objective of the search engine is to
rank the documents, not to assign a label
So we need additional techniques to break the
expected ties (all retrieved documents match the
search criteria)

9
4.3 Nearest-Neighbor Methods

A method that compares vectors and measures
similarity
In Prediction the NNMs will collect the K most
similar documents and then look at their labels
In IR the NNMs will determine whether a
satisfactory response to the search query has
been found

10
4.4 Measuring Similarity

These measures used to examine how documents are
similar and the output is a numerical measure of
similarity
Three increasingly complex measures
Shared Word Count
Word Count and Bonus
Cosine Similarity

11
4.4.1 Shared Word Count

Counts the shared words between documents
The words
In IR we have a global dictionary where all
potential words will be included, with the
exception of stopwords.
In Prediction its better to preselect the
dictionary relative to the label

12
Computing similarity by Shared words

Look at all words in the new document
For each document in the collection count how
many of these words appear
No weighting are used, just a simple count
The dictionary has true key words (weakly words
removed)
The results of this measure are clearly intuitive
No one will question why a document was retrieved

13
Computing similarity by Shared words

Each document represented as a vector of key
words (zeros and ones)
The similarity of 2 documents is the product of
the 2 vectors
If 2 documents have the same key word then this
word is counted (11)
The performance of this measure depends mainly on
the dictionary used

14
Computing similarity by Shared words

Shared words is an exact search
either retrieving or not retrieving a document.
No weighting can be done on terms
in query, A and B, you cant specify A is more
important than B
Every retrieved document are treated equally

15
4.4.2 Word Count and Bonus 1/4

TF term frequency
number of times a term occurs in a document
DF Document frequency
Number of documents that contain the term.
IDF inversed document frequency
log (N/df)
N the total number of documents
Vector is a numerical representation for a point
in a multi-dimensional space.
(x1, x2, xn)
Dimensions of the space need to be defined
A measure of the space needs to be defined.

16
4.4.2 Word Count and Bonus 2/4

Each indexing term is a dimension
Each document is a vector
Di (ti1, ti2, ti3, ti4, ... tik)
Document similarity is defined as

K number of words
If word (j) occurs in both documents
otherwise
17
4.4.2 Word Count and Bonus 3/4

The bonus 1/df(j) is a variant of idf. Thus, if
the word occurs in many documents, the bonus is
small.
This measure better than the Shared Word count,
because its discriminate among the weak and
strong predictive words.

18
4.4.2 Word Count and Bonus 4/4
Similarity Scores
Labeled Spreadsheet

A document Space is defined by five terms
hardware, software, user, information, index.
The query is hardware, user, information.

1 0 1 0 1
1 1 0 0 0
0 0 0 1 0
1 0 0 0 1
0 0 1 0 0
0 1 0 1 0
1 1 0 0 1
2.83
1.33
0
1.33
1.5
1.33
2.67
New Document
Vector
Measure Similarity With Bonus
1 1 0 1
Figure 4.4. Computing Similarity Scores with Bonus
19
4.4.3 Cosine Similarity The Vector Space

A document is represented as a vector
(W1, W2, , Wn)
Binary
Wi 1 if the corresponding term is in the
document
Wi 0 if the term is not in the document
TF (Term Frequency)
Wi tfi where tfi is the number of times the
term occurred in the document
TFIDF (Inverse Document Frequency)
Wi tfiidfitfi(1log(N/dfi)) where dfi is the
number of documents contains the term i, and N
the total number of documents in the collection.

20
4.4.3 Cosine Similarity The Vector Space

vec(D) (w1, w2, ..., wt)
Sim(d1,d2) cos(?)
vec(d1) ? vec(d2) / d1 d2
? wd1(j) wd2(j) / d1 d2
W(j) gt 0 whenever j? di
So, 0 lt sim(d1,d2) lt1
A document is retrieved
even if it matches the
query terms only partially

21
4.4.3 Cosine Similarity

How to compute the weight wj?
A good weight must take into account two effects
quantification of intra-document contents
(similarity)
tf factor, the term frequency within a document
quantification of inter-documents separation
(dissi-milarity)
idf factor, the inverse document frequency
wj tf(j) idf(j)

22
4.4.3 Cosine Similarity

TF in the given document shows how important the
term is in this document (makes the frequent
words for the document more important)
IDF makes rare words across all documents more
important.
A high weight in a tf-idf ranking scheme is
therefore reached by a high term frequency in the
given document and a low term frequency in all
other documents.
Term weights in a document affects the position
of the document vectors
di (wi,1 , wi,2 .wi,t)

23
4.4.3 Cosine Similarity

TF-IDF definitions
fik number occurrences of term ti in document Dk
tfik fik / max(fik) normalized term frequency
dfk number of documents which contain tk
idfk log(N / dfk) where N is the total number of
documents
wik tfik idfk term weight
Intuition rare words get more weight, common
words less weight

24
Example TF-IDF

Given a document containing terms with given
frequencies
Kent 3 Ohio 2 University 1
and assume a collection of 10,000 documents and
document frequencies of these terms are
Kent 50 Ohio 1300 University 250.
THEN
Kent tf 3/3 idf log(10000/50) 5.3
tf-idf 5.3
Ohio tf 2/3 idf log(10000/1300) 2.0
tf-idf 1.3
University tf 1/3 idf log(10000/250)
3.7 tf-idf 1.2

25
4.4.3 Cosine Similarity