Modern Information Retrieval Chapter 1: Introduction - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

Modern Information Retrieval Chapter 1: Introduction

Description:

... maintained by an university in the USA and (2) participate in the NCAA ... the team in the last three years and the email or phone number of the team coach. ... – PowerPoint PPT presentation

Number of Views:128
Avg rating:3.0/5.0
Slides: 26
Provided by: hsinhs
Category:

less

Transcript and Presenter's Notes

Title: Modern Information Retrieval Chapter 1: Introduction


1
Modern Information RetrievalChapter 1
Introduction
  • Ricardo Baeza-Yates
  • Berthier Ribeiro-Neto

2
Motivation
  • Example of the user information need
  • Topic NCAA college tennis team
  • Description Find all the pages (documents)
    containing information on college tennis teams
    which (1) are maintained by an university in the
    USA and (2) participate in the NCAA tennis
    tournament.
  • Narrative To be relevant, the page must include
    information on the national ranking of the team
    in the last three years and the email or phone
    number of the team coach.

3
IR Research
  • Information retrieval vs Data retrieval
  • Research
  • information search
  • information filtering (routing)
  • document classification and categorization
  • user interfaces and data visualization
  • cross-language retrieval

4
IR History
  • 1970
  • 1990, WWW

5
The User Task
  • Retrieval (Searching)
  • classic information search process where clear
    objectives are defined
  • Browsing
  • a process where ones main objectives are not
    clearly defined and might change during the
    interaction with the system

6
Logical View of the Documents
  • Text Operations
  • reduce the complexity of the document
    representation
  • a full text ? a set of index terms
  • Steps
  • 1. Stopwords removing
  • 2. Stemming
  • 3. Noun groups
  • 4. ...

7
Past, Present, and Future
  • Early Development
  • Index
  • Library
  • Author name, title, subject headings, keywords
  • The Web and Digital Libraries
  • Hyperlinks

8
Resources
  • Journals
  • Journal of American Society of Information
    Sciences
  • ACM Transactions on Information Systems
  • Information Processing and Management
  • Information Systems (Elsevier)
  • Knowledge and Information Systems (Springer)
  • Conferences
  • ACM SIGIR, DL, CIKM, CHI, etc.
  • Text Retrieval Conference (TREC)

9
Conventional Text-Retrieval Systems Automatic
Text Processing
  • G. Salton, Addison-Wesley, 1989.
  • (Chapter 9)

10
Data Retrieval
  • A specified set of attributes is used to
    characterize each record.EMPLOYEE(NAME, SSN,
    BDATE, ADDR, SEX, SALARY, DNO)
  • Exact match between the attributes used inquery
    formulations and those attached to the document.
    SELECT BDATE, ADDR FROM EMPLOYEE WHERE NAME
    John Smith

11
Text-Retrieval Systems
  • Content identifiers (keywords, index terms,
    descriptors) characterize the stored texts.
  • Degrees of coincidence between the sets of
    identifiers attached to queries and documents

content analysis
query formulation
12
Possible Representation
  • Document representation
  • unweighted index terms (term vectors)
  • weighted index terms
  • Query
  • unweighted or weighted index terms
  • Boolean combinations (or, and, not)
  • Search operation must be effective

13
File Structures
  • Main requirements
  • fast-access for various kinds of searches
  • large number of indices
  • Alternatives
  • Inverted Files
  • Signature Files
  • PAT trees

14
Inverted Files
  • File is represented as an array of indexed
    documents.

15
Inverted-file process
  • The document-term array is inverted (transposed).

16
Inverted-file process (Continued)
  • Take two or more rows of an inverted
    term-document array, and produce a single
    combined list of document identifiers.
  • Ex Query (term2 and term3)
  • term2 1 1 0 0term3 0 1 1 1--------------------
    ---------------------------------- 1 lt-- D2

17
List-merging for two ordered lists
  • The inverted-index operations to obtain answers
    are based on list-merging process.
  • ExampleT1 D1, D3T2 D1, D2Merged(T1,
    T2) D1, D1, D2, D3

18
Extensions of Inverted Index Operations(Distance
Constraints)
  • Distance Constraints
  • (A within sentence B)terms A and B must co-occur
    in a common sentence
  • (A adjacent B)terms A and B must occur
    adjacently in the text

19
Extensions of Inverted Index Operations(Distance
Constraints)
  • Implementation
  • include term-location in the inverted
    indexesinformation P345, P348, P350,
    retrieval P123, P128, P345,
  • include sentence-location in the indexes
  • information P345, 25 P345, 37 P348, 10
    P350, 8 retrieval P123, 5 P128, 25
    P345, 37 P345, 40

20
Extensions of Inverted Index Operations(Distance
Constraints)
  • Include paragraph numbers in the indexessentence
    numbers within paragraphsword numbers within
    sentencesinformation P345, 2, 3, 5
    retrieval P345, 2, 3, 6
  • Query examples(information adjacent
    retrieval)(information within five words
    retrieval)
  • Cost the size of indexes

21
Term Weights
  • Term WeightsDiTi1, 0.2 Ti2, 0.5 Ti3, 0.6
  • Issues
  • How to generate the term weights?
  • How to apply the term weights?
  • Sum the weights of all document terms that match
    the given query.
  • Rank the output documents in the descending order
    of term weight.

22
Boolean Query with Term Weights
  • Transform a Boolean expression into disjunctive
    normal form. T1 and (T2 or T3) (T1 and T2)
    or (T1 and T3)
  • For each conjunct, compute the minimum term
    weight of any document term in that conjunct.
  • The document weight is the maximum of all the
    conjunct weights.

23
Boolean Query with Term Weights
  • Example Q(T1 and T2) or T3Document Conjunct Q
    ueryVectors Weights Weight (T1 and T2) (T3)
    (T1 and T2) or T3D1(T1,0.2T2,0.5T3,0.6) 0
    .2 0.6 0.6D2(T1,0.7T2,0.2T3,0.1) 0.2 0.1
    0.2D1 is preferred.

24
Stemming
  • Term Truncation
  • Remove suffixes and/or prefixes from context
    terms.
  • ExamplePSYCH psychiatrist, psychiatry,
    psychiatric,psychology, psychological,

25
Summary
Write a Comment
User Comments (0)
About PowerShow.com