CS511 Design of Database Management Systems - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

CS511 Design of Database Management Systems

Description:

text search: to satisfy 'information need' -- art ... Vector Space Model. View: Similarity of content. Intuitions: ... Vector space model. CS411. 28. End of Talk ... – PowerPoint PPT presentation

Number of Views:10
Avg rating:3.0/5.0
Slides: 29
Provided by: kevinc65
Category:

less

Transcript and Presenter's Notes

Title: CS511 Design of Database Management Systems


1
CS511Design of Database Management Systems
  • Lecture 13
  • Information Retrieval Overview
  • Kevin C. Chang

2
Announcements
  • MT format
  • Wednesday 200-315pm
  • open notes, papers, books. Calc. OK (wont need).
    PDA no.
  • 75 points (for 75 minutes)
  • 4 problems
  • Prob. 1 True/False problems
  • Prob. 2-4 longer problems
  • Preparation
  • study lecture notes, HW, SGP use them to review
    papers
  • ask why ask that...
  • discussion with peers
  • think more (beyond stated) and try to relate
    issues

3
Some History
  • Early Days--
  • 1945 V. Bushs article As we may think
  • 1957 H. P. Luhns idea of word counting and
    matching
  • Indexing Evaluation Methodology (1960s)
  • Smart system (G. Saltons group)
  • Cranfield test collection (C. Cleverdons group)
  • Indexing automatic can be as good as manual
  • IR Models (1970s 1980s)
  • Large-scale Evaluation Applications (1990s)
  • TREC (D. Harman E. Voorhees, NIST)
  • Large scale Web search

4
?? Text Search vs. Database Queries
  • Two related areas
  • information retrieval (IR)
  • databases
  • traditionally separate-- brought together by the
    Web
  • ?? Any differences in
  • data models?
  • query semantics?
  • desirable functionalities?

5
Text vs. Rel. DB Art vs. Algebra
  • Data models
  • unstructured text vs. well-structured data
  • Query semantics
  • fuzzy vs. well-defined
  • text search to satisfy information need lt--
    art
  • DB queries to perform data computation lt--
    algebra
  • relevant vs. correct answers
  • ranked vs. Boolean answers
  • Functionalities
  • read-mostly vs. read-write/transactions/cc ...

6
Recall Measuring False-Negatives
  • Recall x / relevant
  • e.g. relevant D1, D2, retrieved D1, D3,
    D4
  • recall R 1 / 2 0.5
  • there is 1 false negative D2
  • ? How to fool recall?

x
relevant
retrieved
collection
7
Precision Measuring False-Positives
  • Precision x / retrieved
  • e.g. relevant D1, D2, retrieved D1, D3,
    D4
  • precision P 1 / 3 0.33
  • there are 2 false positives D3 and D4

x
relevant
retrieved
collection
8
Models
  • Boolean criteria-based
  • Vector space similarity-based
  • Probabilistic probability-based

9
Boolean Model
  • Query
  • Q1 data AND web
  • Q2 (knowledge OR information) AND base
  • Q3 data NOT info
  • Documents
  • D1 web data and web queries
  • D2 digital data index
  • D3 data base for dummies

10
Boolean Model
  • View Satisfaction by criteria
  • Query a Boolean expression
  • Q1 data AND web
  • Document a Boolean conjunction
  • D1 web data and web queries
  • web AND data AND queries
  • Query results
  • D D implies Q, i.e., all docs that satisfy Q

11
Boolean Queries Problems
  • Query matching is exact and not flexible
  • exact matching can result in too few/many matches
  • Hard to formulate a right query
  • what is query for documents about color
    printer?
  • Results are not ranked/ordered for exploration
  • Boolean is binary yes or no
  • In short relevance not captured
  • traditional DB queries are similarly bad at
    fuzzy concepts
  • new research work in top-k queries

12
Vector Space Model
  • View Similarity of content
  • Intuitions
  • docs consist of words --gt put docs in the word
    space
  • space n-dimension for n words
  • similarity becomes geometric comparison
  • document-query similarity vector-vector
    similarity

D
Q
13
Probabilistic Models
  • View Probability of relevance
  • the probabilistic ranking principle
  • Estimate and rank by P(R Q, D)
  • or by log-odds

14
Probabilistic Models
  • To rank by
  • I.e., (see next page)
  • Assume pi the same for all query terms
  • Assume qi ni/N
  • N is the collection size i.e., all docs are
    irrelevant
  • Similar to using IDF
  • intuition e.g., apple computer in a computer DB

15
Probabilistic Models
  • To rank by

16
System Architecture
docs
INDEXING
query
Query Rep
Doc Rep
User
Ranking
results
17
Technique Term Selection/Weighting
  • Basis for matching query with document
  • Query and document should be represented using
    the same units/terms
  • Controlled vocabulary vs. full text indexing

18
What is a good indexing term?
  • Specific (phrases) or general (single word)?
  • Luhn found that words with middle frequency are
    most useful
  • Not too specific
  • Not too general
  • All words or a (controlled) subset?
  • When term weighting is used, it is a matter of
    weighting not selecting of indexing terms
  • more later

19
Technique Stemming
  • Words with similar meanings should be mapped to
    the same indexing term
  • Stemming Mapping all inflectional forms of words
    to the same root form, e.g.
  • computer -gt compute
  • computation -gt compute
  • computing -gt compute
  • Porters Stemmer is popular for English
  • In general clustering of synonym words

20
Technique Stopwords
  • A common word that bears little semantic
    content
  • preposition for, on,
  • article a, an, the
  • non-informative words (collection specific)
  • e.g., database in this class
  • e.g., PC in a computer collection
  • You can search the Web for stopwords list

21
Technique Relevance Feedback(or Query
Modification)
  • Motivation easier to judge results than to
    formulate queries right

22
Pseudo Feedback
  • Motivation top results are often relevant

Results d1 3.5 d2 2.4 dk 0.5 ...
Retrieval Engine
Query
Updated query
Document collection
Judgments d1 d2 d3 dk - ...
top 10
Feedback
23
Technique Inverted List
  • ti ? ltd1, gt, .., ltdn, gt
  • E.g.,
  • color ? ltd1, gt, ltd2, gt, ltd5, gt
  • printer ? ltd2, gt, ltd5, gt, ltd8, gt
  • How to evaluate Q color AND printer?
  • How to evaluate Q color printer?
  • what info to maintain in each entry?
  • More later

24
DB Meets IR
  • Multimedia databases
  • relational data text, images, audio, video
  • Fuzzy retrieval for relational data
  • similarity, preference-based queries
  • e.g., product search in e-commerce
  • XML represents text-based data
  • IR type search will be helpful
  • how can we extend it to retrieve XML documents?

25
?? Web Search
  • Text IR as natural starting point
  • Web as a collection of HTML documents
  • find pages satisfy information need
  • Web search as killer-app of IR!
  • Web search vs. traditional document search
  • ?? how are they related?
  • ?? any differences or new issues?
  • why search engines give lousy results?

26
Web Search New Issues and Challenges
  • Highly topic-heterogeneous documents
  • notion of collection lost
  • stopwords, idf scheme for term selection/weighting
    challenged
  • Structured/semi-structured documents
  • Highly-linked pages collection no longer flat
  • how to use links cleverly link analysis (more
    in TW2)
  • ideas from social networks for standing or
    importance
  • Extremely large scale Billions docs and counting
  • Many documents/data hidden behind databases
  • Multi-lingual documents
  • Spamming

27
Whats Next
  • Vector space model

28
End of Talk
Write a Comment
User Comments (0)
About PowerShow.com