WMES3103 : INFORMATION RETRIEVAL - PowerPoint PPT Presentation

1 / 12
About This Presentation
Title:

WMES3103 : INFORMATION RETRIEVAL

Description:

Scanning the text sequentially = sequential or online searching = finding the ... trees that store set of strings.Every edge of the tree is labelled with a letter. ... – PowerPoint PPT presentation

Number of Views:44
Avg rating:3.0/5.0
Slides: 13
Provided by: FSK6
Category:

less

Transcript and Presenter's Notes

Title: WMES3103 : INFORMATION RETRIEVAL


1
WMES3103 INFORMATION RETRIEVAL
  • INDEXING AND SEARCHING

2
INTRODUCTION
  • Searching for a basic query done via 2 options
  • Scanning the text sequentially sequential or
    online searching finding the occurrences of a
    pattern in a text when the text is not
    preprocessed
  • Good when the text is small or text collection is
    volatile (modified frequently) or no indexing
    space available
  • Build data structures over the text or indexes to
    speed up the search
  • Good to build and maintain index when text
    collection is large and semi-static (updated at
    reasonably regular intervals)

3
INDEXING
  • Key weight frequency dependent , determine
    ranking ? best match
  • tfidf weighting
  • tf key frequency in a document
  • idf the inverse of the number of documents
    containing the key

4
AUTOMATIC INDEXING PROCESS
Replace stems by identifiers
Text
Count posting
Recognize string
Weight
Delete Stopwords
Use thesaurus And phrases
Identify Stems
  • Text
  • representation

5
AUTOMATIC INDEXING PROCESS
  • In the process
  • Stem identification word normalization, NLP
  • Short codes are used as identifiers
  • Thesaurus rare stems are clustered
  • Phrases frequent stems are combined into less
    frequent phrases

6
  • Nowadays, medium size databases (200 Mb) combine
    online and indexed searching
  • 3 main indexing techniques
  • Inverted files best choice for most
    applications
  • Suffix trees and arrays faster for phrase
    searching but harder to build and maintain
  • Signature files popular in 1980s but
    outperformed by inverted files
  • Will concentrate on inverted files only

7
INVERTED FILE
  • Inverted file inverted index word-oriented
    mechanism for indexing a text collection in order
    to speed up the searching task
  • Composed of 2 elements vocabulary and
    occurrences
  • Vocabulary set of all different words in the
    text
  • For each word a list of all the text positions
    where the appears is stored
  • Occurrences the set of all those lists

8
Example
  • A sample text and an inverted index built on it
  • the words are converted to lower-case and some
    are not indexed
  • the occurences point to character positions in
    the text

9
INVERTED FILE
  • Positions can refer to words or characters
  • Word positions (eg. position i refers to the i-th
    word) simplifies phrase and proximity queries
  • Character positions (eg. position i refers to the
    i-th character) facilitates direct access to
    matching text positions
  • Space required for vocabulary is small - eg. 1 Gb
    of the TREC-2 collection has a size of 5 Mb can
    be further reduced by stemming and other
    techniques

10
INVERTED FILE
  • Occurrences require more space each word in the
    text is referenced once in the structure
  • building an inverted index from the sample text
  • Refer to word doc. Attached.

11
Searching on an inverted file
  • Done via 3 basic steps
  • Vocabulary search the words and patterns
    present in the query are isolated and searched in
    the vocabulary
  • Retrieval of occurrences lists of the
    occurrences of all the words found are retrieved
  • Manipulation of occurrences occurrences are
    processed to solve phrases, proximity or Boolean
    operations

12
TRIES
  • Tries or digital search trees are multiway
    trees that store set of strings.Every edge of the
    tree is labelled with a letter. To search a
    string in a trie, one starts at the root and
    scans the string characterwise, descending by the
    appropriate edge of the trie. This continues
    until a leaf is found.
Write a Comment
User Comments (0)
About PowerShow.com