Ch 4 Information Retrieval and Text Mining
  • Hakam Alomari

4.1 Is Information Retrieval a Form of Text
  • What is the principal computer specialty for
    processing documents and text??
  • Information Retrieval (IR)
  • The task of IR is to retrieve relevant documents
    in response to a query.
  • The fundamental technique of IR is measuring
  • A query is examined and transformed into a vector
    of values to be compared with stored documents

Cont. 4.1
  • In the predication problem similar documents are
    retrieved, then measure their properties, i.e.
    count the of class labels to see which label
    should be assigned to a new document
  • The objectives of the prediction can be posed in
    the form of an IR model where documents are
    retrieved that are relevant to a query, the query
    will be a new document

Cont. 4.1
Figure 4.2. Key steps in IR
Figure 4.3. Predicting from Retrieved Documents
simple criteria such as documents labels
4.2 Key Word Search
  • The technical goal for prediction is to classify
    new, unseen documents
  • The Prediction and IR are unified by the
    computation of similarity of documents
  • IR based on traditional keyword search through a
    search engine
  • So we should recognize that using a search engine
    is a special instance of prediction concept

  • We enter a key words to a search engine and
    expect relevant documents to be returned
  • These key words are words in a dictionary created
    from the document collection and can be viewed as
    a small document
  • So, we want to measuring how similar the new
    document (query) is to the documents in the

  • So, the notion of similarity is reduced to
    finding documents with the same keywords as posed
    to the search engine
  • But, the objective of the search engine is to
    rank the documents, not to assign a label
  • So we need additional techniques to break the
    expected ties (all retrieved documents match the
    search criteria)

4.3 Nearest-Neighbor Methods
  • A method that compares vectors and measures
  • In Prediction the NNMs will collect the K most
    similar documents and then look at their labels
  • In IR the NNMs will determine whether a
    satisfactory response to the search query has
    been found

4.4 Measuring Similarity
  • These measures used to examine how documents are
    similar and the output is a numerical measure of
  • Three increasingly complex measures
  • Shared Word Count
  • Word Count and Bonus
  • Cosine Similarity

4.4.1 Shared Word Count
  • Counts the shared words between documents
  • The words
  • In IR we have a global dictionary where all
    potential words will be included, with the
    exception of stopwords.
  • In Prediction its better to preselect the
    dictionary relative to the label

Computing similarity by Shared words
  • Look at all words in the new document
  • For each document in the collection count how
    many of these words appear
  • No weighting are used, just a simple count
  • The dictionary has true key words (weakly words
  • The results of this measure are clearly intuitive
  • No one will question why a document was retrieved

Computing similarity by Shared words
  • Each document represented as a vector of key
    words (zeros and ones)
  • The similarity of 2 documents is the product of
    the 2 vectors
  • If 2 documents have the same key word then this
    word is counted (11)
  • The performance of this measure depends mainly on
    the dictionary used

Computing similarity by Shared words
  • Shared words is an exact search
  • either retrieving or not retrieving a document.
  • No weighting can be done on terms
  • in query, A and B, you cant specify A is more
    important than B
  • Every retrieved document are treated equally

4.4.2 Word Count and Bonus 1/4
  • TF term frequency
  • number of times a term occurs in a document
  • DF Document frequency
  • Number of documents that contain the term.
  • IDF inversed document frequency
  • log (N/df)
  • N the total number of documents
  • Vector is a numerical representation for a point
    in a multi-dimensional space.
  • (x1, x2, xn)
  • Dimensions of the space need to be defined
  • A measure of the space needs to be defined.

4.4.2 Word Count and Bonus 2/4
  • Each indexing term is a dimension
  • Each document is a vector
  • Di (ti1, ti2, ti3, ti4, ... tik)
  • Document similarity is defined as

K number of words
If word (j) occurs in both documents
4.4.2 Word Count and Bonus 3/4
  • The bonus 1/df(j) is a variant of idf. Thus, if
    the word occurs in many documents, the bonus is
  • This measure better than the Shared Word count,
    because its discriminate among the weak and
    strong predictive words.

4.4.2 Word Count and Bonus 4/4
Similarity Scores
Labeled Spreadsheet
  • A document Space is defined by five terms
    hardware, software, user, information, index.
  • The query is hardware, user, information.

1 0 1 0 1
1 1 0 0 0
0 0 0 1 0
1 0 0 0 1
0 0 1 0 0
0 1 0 1 0
1 1 0 0 1
New Document
Measure Similarity With Bonus
1 1 0 1
Figure 4.4. Computing Similarity Scores with Bonus
4.4.3 Cosine Similarity The Vector Space
  • A document is represented as a vector
  • (W1, W2, , Wn)
  • Binary
  • Wi 1 if the corresponding term is in the
  • Wi 0 if the term is not in the document
  • TF (Term Frequency)
  • Wi tfi where tfi is the number of times the
    term occurred in the document
  • TFIDF (Inverse Document Frequency)
  • Wi tfiidfitfi(1log(N/dfi)) where dfi is the
    number of documents contains the term i, and N
    the total number of documents in the collection.

4.4.3 Cosine Similarity The Vector Space
  • vec(D) (w1, w2, ..., wt)
  • Sim(d1,d2) cos(?)
  • vec(d1) ? vec(d2) / d1 d2
    ? wd1(j) wd2(j) / d1 d2
  • W(j) gt 0 whenever j? di
  • So, 0 lt sim(d1,d2) lt1
  • A document is retrieved
  • even if it matches the
  • query terms only partially

4.4.3 Cosine Similarity
  • How to compute the weight wj?
  • A good weight must take into account two effects
  • quantification of intra-document contents
  • tf factor, the term frequency within a document
  • quantification of inter-documents separation
  • idf factor, the inverse document frequency
  • wj tf(j) idf(j)

4.4.3 Cosine Similarity
  • TF in the given document shows how important the
    term is in this document (makes the frequent
    words for the document more important)
  • IDF makes rare words across all documents more
  • A high weight in a tf-idf ranking scheme is
    therefore reached by a high term frequency in the
    given document and a low term frequency in all
    other documents.
  • Term weights in a document affects the position
    of the document vectors
  • di (wi,1 , wi,2 .wi,t)

4.4.3 Cosine Similarity
  • TF-IDF definitions
  • fik number occurrences of term ti in document Dk
  • tfik fik / max(fik) normalized term frequency
  • dfk number of documents which contain tk
  • idfk log(N / dfk) where N is the total number of
  • wik tfik idfk term weight
  • Intuition rare words get more weight, common
    words less weight

Example TF-IDF
  • Given a document containing terms with given
  • Kent 3 Ohio 2 University 1
  • and assume a collection of 10,000 documents and
    document frequencies of these terms are
  • Kent 50 Ohio 1300 University 250.
  • THEN
  • Kent tf 3/3 idf log(10000/50) 5.3
    tf-idf 5.3
  • Ohio tf 2/3 idf log(10000/1300) 2.0
    tf-idf 1.3
  • University tf 1/3 idf log(10000/250)
    3.7 tf-idf 1.2

4.4.3 Cosine Similarity
  • Cosine
  • W(j) tf(j) idf(j)
  • Idf(j) log(N / df(j))
