Text Search System - PowerPoint PPT Presentation

1 / 18
About This Presentation
Title:

Text Search System

Description:

Text Search System. Group 10: Michaela Stadlerova. Jakub Silhavy. Gaojie He. Hanjie Shu ... Document: document id, document name, maximum term frequency, ... – PowerPoint PPT presentation

Number of Views:17
Avg rating:3.0/5.0
Slides: 19
Provided by: folk5
Category:
Tags: michaela | search | system | text

less

Transcript and Presenter's Notes

Title: Text Search System


1
Text Search System
  • Group 10
  • Michaela Stadlerova
  • Jakub Silhavy
  • Gaojie He
  • Hanjie Shu

2
Outline
  • 1.System Architecture
  • 2.Components
  • 3.Demonstration
  • 4.Evaluation

3
System Architecture
4
Components
5
Main Components
  • Query Processing
  • Document Processing
  • Document Index
  • Search Rank
  • Clustering

6
Query and Document Processing
  • Tokenization
  • Stemming
  • Removal of Stop Words

7
Document Index
  • Step one Vector Generation
  • Document document id, document name, maximum
    term frequency, existed term and its frequency.
  • Step two Inverted Files
  • Term Document ID, Term Frequency.

8
Inverted Files
9
Search Rank
  • Construct Weights Vector
  • Document Wi,jfi,j/ maxl(fl,j)log (N/ni),
    Querywi,q(0.50.5freqi,q/maxl(freql,q))log(N/n
    i)
  • Cosine Similarity
  • sim(q,d)(q/q)(d/d)

10
Clustering Bottom-up algorithm
Similarity matrix
11
Clustering Bottom-up algorithm
Document weight vector
Cosine similarity (d1,d2)0.85
Similarity matrix
12
Clustering Bottom-up algorithm
Document weight vector
Cosine similarity (d1,d2)0.85
Similarity matrix
findMax 0.94
merge d2 and d3
13
Clustering Bottom-up algorithm
Similarity matrix
repeat until
. . .
findMax gt Constant
.
.
.
14
Clustering ranking
Cluster weight vector
cluster 1
compute average
Cosine similarity (c1,query)
15
DEMONSTRATION
16
Evaluation of the system
Basic system (data from 20 first retrieval
documents)
Extended system
(Input first 25 documents)
17
Extended System VS Basic System
  • Different order of documents
  • Improvement of the search experience
  • The documents are grouped according to its
    similarity
  • Generating of labels
  • Precision and recall remain mostly the same

18
THANK YOU!!!!
Write a Comment
User Comments (0)
About PowerShow.com