Title: Ch%204:%20Information%20Retrieval%20and%20Text%20Mining
1Ch 4 Information Retrieval and Text Mining
24.1 Is Information Retrieval a Form of Text
Mining?
- What is the principal computer specialty for
processing documents and text?? - Information Retrieval (IR)
- The task of IR is to retrieve relevant documents
in response to a query. - The fundamental technique of IR is measuring
similarity - A query is examined and transformed into a vector
of values to be compared with stored documents
3Cont. 4.1
- In the predication problem similar documents are
retrieved, then measure their properties, i.e.
count the of class labels to see which label
should be assigned to a new document - The objectives of the prediction can be posed in
the form of an IR model where documents are
retrieved that are relevant to a query, the query
will be a new document
4Cont. 4.1
5Figure 4.2. Key steps in IR
Figure 4.3. Predicting from Retrieved Documents
simple criteria such as documents labels
64.2 Key Word Search
- The technical goal for prediction is to classify
new, unseen documents - The Prediction and IR are unified by the
computation of similarity of documents - IR based on traditional keyword search through a
search engine - So we should recognize that using a search engine
is a special instance of prediction concept
7- We enter a key words to a search engine and
expect relevant documents to be returned - These key words are words in a dictionary created
from the document collection and can be viewed as
a small document - So, we want to measuring how similar the new
document (query) is to the documents in the
collection
8- So, the notion of similarity is reduced to
finding documents with the same keywords as posed
to the search engine - But, the objective of the search engine is to
rank the documents, not to assign a label - So we need additional techniques to break the
expected ties (all retrieved documents match the
search criteria)
94.3 Nearest-Neighbor Methods
- A method that compares vectors and measures
similarity - In Prediction the NNMs will collect the K most
similar documents and then look at their labels - In IR the NNMs will determine whether a
satisfactory response to the search query has
been found
104.4 Measuring Similarity
- These measures used to examine how documents are
similar and the output is a numerical measure of
similarity - Three increasingly complex measures
- Shared Word Count
- Word Count and Bonus
- Cosine Similarity
114.4.1 Shared Word Count
- Counts the shared words between documents
- The words
- In IR we have a global dictionary where all
potential words will be included, with the
exception of stopwords. - In Prediction its better to preselect the
dictionary relative to the label
12Computing similarity by Shared words
- Look at all words in the new document
- For each document in the collection count how
many of these words appear - No weighting are used, just a simple count
- The dictionary has true key words (weakly words
removed) - The results of this measure are clearly intuitive
- No one will question why a document was retrieved
13Computing similarity by Shared words
- Each document represented as a vector of key
words (zeros and ones) - The similarity of 2 documents is the product of
the 2 vectors - If 2 documents have the same key word then this
word is counted (11) - The performance of this measure depends mainly on
the dictionary used
14Computing similarity by Shared words
- Shared words is an exact search
- either retrieving or not retrieving a document.
- No weighting can be done on terms
- in query, A and B, you cant specify A is more
important than B - Every retrieved document are treated equally
154.4.2 Word Count and Bonus 1/4
- TF term frequency
- number of times a term occurs in a document
- DF Document frequency
- Number of documents that contain the term.
- IDF inversed document frequency
- log (N/df)
- N the total number of documents
- Vector is a numerical representation for a point
in a multi-dimensional space. - (x1, x2, xn)
- Dimensions of the space need to be defined
- A measure of the space needs to be defined.
164.4.2 Word Count and Bonus 2/4
- Each indexing term is a dimension
- Each document is a vector
- Di (ti1, ti2, ti3, ti4, ... tik)
- Document similarity is defined as
K number of words
If word (j) occurs in both documents
otherwise
174.4.2 Word Count and Bonus 3/4
- The bonus 1/df(j) is a variant of idf. Thus, if
the word occurs in many documents, the bonus is
small. - This measure better than the Shared Word count,
because its discriminate among the weak and
strong predictive words.
184.4.2 Word Count and Bonus 4/4
Similarity Scores
Labeled Spreadsheet
- A document Space is defined by five terms
hardware, software, user, information, index. - The query is hardware, user, information.
1 0 1 0 1
1 1 0 0 0
0 0 0 1 0
1 0 0 0 1
0 0 1 0 0
0 1 0 1 0
1 1 0 0 1
2.83
1.33
0
1.33
1.5
1.33
2.67
New Document
Vector
Measure Similarity With Bonus
1 1 0 1
Figure 4.4. Computing Similarity Scores with Bonus
194.4.3 Cosine Similarity The Vector Space
- A document is represented as a vector
- (W1, W2, , Wn)
- Binary
- Wi 1 if the corresponding term is in the
document - Wi 0 if the term is not in the document
- TF (Term Frequency)
- Wi tfi where tfi is the number of times the
term occurred in the document - TFIDF (Inverse Document Frequency)
- Wi tfiidfitfi(1log(N/dfi)) where dfi is the
number of documents contains the term i, and N
the total number of documents in the collection.
204.4.3 Cosine Similarity The Vector Space
- vec(D) (w1, w2, ..., wt)
- Sim(d1,d2) cos(?)
- vec(d1) ? vec(d2) / d1 d2
? wd1(j) wd2(j) / d1 d2 - W(j) gt 0 whenever j? di
- So, 0 lt sim(d1,d2) lt1
- A document is retrieved
- even if it matches the
- query terms only partially
214.4.3 Cosine Similarity
- How to compute the weight wj?
- A good weight must take into account two effects
- quantification of intra-document contents
(similarity) - tf factor, the term frequency within a document
- quantification of inter-documents separation
(dissi-milarity) - idf factor, the inverse document frequency
- wj tf(j) idf(j)
224.4.3 Cosine Similarity
- TF in the given document shows how important the
term is in this document (makes the frequent
words for the document more important) - IDF makes rare words across all documents more
important. - A high weight in a tf-idf ranking scheme is
therefore reached by a high term frequency in the
given document and a low term frequency in all
other documents. - Term weights in a document affects the position
of the document vectors - di (wi,1 , wi,2 .wi,t)
234.4.3 Cosine Similarity
- TF-IDF definitions
- fik number occurrences of term ti in document Dk
- tfik fik / max(fik) normalized term frequency
- dfk number of documents which contain tk
- idfk log(N / dfk) where N is the total number of
documents - wik tfik idfk term weight
- Intuition rare words get more weight, common
words less weight
24Example TF-IDF
- Given a document containing terms with given
frequencies - Kent 3 Ohio 2 University 1
- and assume a collection of 10,000 documents and
document frequencies of these terms are - Kent 50 Ohio 1300 University 250.
- THEN
- Kent tf 3/3 idf log(10000/50) 5.3
tf-idf 5.3 - Ohio tf 2/3 idf log(10000/1300) 2.0
tf-idf 1.3 - University tf 1/3 idf log(10000/250)
3.7 tf-idf 1.2
254.4.3 Cosine Similarity
- Cosine
- W(j) tf(j) idf(j)
- Idf(j) log(N / df(j))