IR Data Structures - PowerPoint PPT Presentation

1 / 16
About This Presentation
Title:

IR Data Structures

Description:

Understand other data structures which facilitate rapid access from ... the great arm-chair, half talking to herself and half asleep, thekitten had been ... – PowerPoint PPT presentation

Number of Views:251
Avg rating:3.0/5.0
Slides: 17
Provided by: osirisSun
Category:

less

Transcript and Presenter's Notes

Title: IR Data Structures


1
IR Data Structures
  • Making Matching Queries and Documents Effective
    and Efficient

2
Lecture Objectives
  • Learn an algorithm to stem without a dictionary
  • Know principles of other stemming systems
  • Understand other data structures which facilitate
    rapid access from keywords to documents

3
Stemming
  • Reducing morphological variants of words to a
    standard underlying form
  • e.g. calculate, calculates, calculations to
    calculat-
  • improves recall at the expense of precision

4
Porter Stemming Algorithm
  • Well known, effective stemmer, which does not use
    a dictionary
  • uses measure m
  • C(VC)mV
  • where
  • C is a sequence of consonants
  • V is a sequence of vowels

5
Porter Algorithm Step 1
Stem only vowels
Stem only vowels
6
Porter Algorithm Step 2-4
Measure gt0
Measure gt0
Measure gt1
7
Dictionary Based Stemmers
  • Dictionary of stems
  • cf vector based methods
  • Dictionary of words
  • effective handling of irregular forms
  • Proper Name/Controlled Vocabulary Lists
  • Equivalent Term/Thesaurii

8
Problems with stemming
  • Always worsens precision hoping to improve recall
  • Causes (sometimes odd misretrieval)
  • bled vs bleeding
  • incorrect term conflation plastered to
    plaster
  • Do we really want to improve recall on the web ?

9
N-Gram structures
  • Store keywords broken down into fixed length
    segments
  • e.g. trigrams sea colony to
  • sea col olo lon ony
  • useful as an index structure, stemming and for
    spelling correction
  • compuuter

10
Index Data Structures
  • Inverted Files
  • PAT Data Structure
  • tree based substrings
  • Signature Files
  • Hypertext Data Structure

11
Inverted Files
1
2
Alice
1, 5, 51182
5
42
887
51182
12
Inverted Files Supporting Proximity
Alice
1, 5, 51182
while Alice was sitting curled up in a corner
of the great arm-chair, half talking to herself
and half asleep, thekitten had been having a
grand game of romps with the ball of worsted
Alice had
167, 201, ...
13
Hypertext Data Structure
  • Nodes and Links
  • File types imply a program to interpret
    (Display/play) the data
  • Tags in HTML imply how to load referenced data
  • protocol
  • server
  • location at server

14
URL Example
http//
www.cet. sunderland.ac.uk/
cs0jel/teaching/com268/Lglass.asc
location
protocol
server
15
The Web
16
Conclusions
  • Stemmers
  • Porters Algorithm
  • Dictionary Based
  • disadvantages
  • Inverted Files
  • Hypertext
  • N-grams - other Data Structures
Write a Comment
User Comments (0)
About PowerShow.com