CS511 Design of Database Management Systems

About This Presentation

Title:

CS511 Design of Database Management Systems

Description:

text search: to satisfy 'information need' -- art ... Vector Space Model. View: Similarity of content. Intuitions: ... Vector space model. CS411. 28. End of Talk ... – PowerPoint PPT presentation

Number of Views:10

Avg rating:3.0/5.0

Slides: 29

Provided by: kevinc65

Category:

more less

Transcript and Presenter's Notes

Title: CS511 Design of Database Management Systems

1
CS511Design of Database Management Systems

Lecture 13
Information Retrieval Overview
Kevin C. Chang

2
Announcements

MT format
Wednesday 200-315pm
open notes, papers, books. Calc. OK (wont need).
PDA no.
75 points (for 75 minutes)
4 problems
Prob. 1 True/False problems
Prob. 2-4 longer problems
Preparation
study lecture notes, HW, SGP use them to review
papers
ask why ask that...
discussion with peers
think more (beyond stated) and try to relate
issues

3
Some History

Early Days--
1945 V. Bushs article As we may think
1957 H. P. Luhns idea of word counting and
matching
Indexing Evaluation Methodology (1960s)
Smart system (G. Saltons group)
Cranfield test collection (C. Cleverdons group)
Indexing automatic can be as good as manual
IR Models (1970s 1980s)
Large-scale Evaluation Applications (1990s)
TREC (D. Harman E. Voorhees, NIST)
Large scale Web search

4
?? Text Search vs. Database Queries

Two related areas
information retrieval (IR)
databases
traditionally separate-- brought together by the
Web
?? Any differences in
data models?
query semantics?
desirable functionalities?

5
Text vs. Rel. DB Art vs. Algebra

Data models
unstructured text vs. well-structured data
Query semantics
fuzzy vs. well-defined
text search to satisfy information need lt--
art
DB queries to perform data computation lt--
algebra
relevant vs. correct answers
ranked vs. Boolean answers
Functionalities
read-mostly vs. read-write/transactions/cc ...

6
Recall Measuring False-Negatives

Recall x / relevant
e.g. relevant D1, D2, retrieved D1, D3,
D4
recall R 1 / 2 0.5
there is 1 false negative D2
? How to fool recall?

x
relevant
retrieved
collection
7
Precision Measuring False-Positives

Precision x / retrieved
e.g. relevant D1, D2, retrieved D1, D3,
D4
precision P 1 / 3 0.33
there are 2 false positives D3 and D4

x
relevant
retrieved
collection
8
Models

Boolean criteria-based
Vector space similarity-based
Probabilistic probability-based

9
Boolean Model

Query
Q1 data AND web
Q2 (knowledge OR information) AND base
Q3 data NOT info
Documents
D1 web data and web queries
D2 digital data index
D3 data base for dummies

10
Boolean Model

View Satisfaction by criteria
Query a Boolean expression
Q1 data AND web
Document a Boolean conjunction
D1 web data and web queries
web AND data AND queries
Query results
D D implies Q, i.e., all docs that satisfy Q

11
Boolean Queries Problems

Query matching is exact and not flexible
exact matching can result in too few/many matches
Hard to formulate a right query
what is query for documents about color
printer?
Results are not ranked/ordered for exploration
Boolean is binary yes or no
In short relevance not captured
traditional DB queries are similarly bad at
fuzzy concepts
new research work in top-k queries

12
Vector Space Model

View Similarity of content
Intuitions
docs consist of words --gt put docs in the word
space
space n-dimension for n words
similarity becomes geometric comparison
document-query similarity vector-vector
similarity

D
Q
13
Probabilistic Models

View Probability of relevance
the probabilistic ranking principle
Estimate and rank by P(R Q, D)
or by log-odds

14
Probabilistic Models

To rank by
I.e., (see next page)
Assume pi the same for all query terms
Assume qi ni/N
N is the collection size i.e., all docs are
irrelevant
Similar to using IDF
intuition e.g., apple computer in a computer DB

15
Probabilistic Models

To rank by

16
System Architecture
docs
INDEXING
query
Query Rep
Doc Rep
User
Ranking
results
17
Technique Term Selection/Weighting

Basis for matching query with document
Query and document should be represented using
the same units/terms
Controlled vocabulary vs. full text indexing

18
What is a good indexing term?

Specific (phrases) or general (single word)?
Luhn found that words with middle frequency are
most useful
Not too specific
Not too general
All words or a (controlled) subset?
When term weighting is used, it is a matter of
weighting not selecting of indexing terms
more later

19
Technique Stemming

Words with similar meanings should be mapped to
the same indexing term
Stemming Mapping all inflectional forms of words
to the same root form, e.g.
computer -gt compute
computation -gt compute
computing -gt compute
Porters Stemmer is popular for English
In general clustering of synonym words

20
Technique Stopwords

A common word that bears little semantic
content
preposition for, on,
article a, an, the
non-informative words (collection specific)
e.g., database in this class
e.g., PC in a computer collection
You can search the Web for stopwords list

21
Technique Relevance Feedback(or Query
Modification)

Motivation easier to judge results than to
formulate queries right

22
Pseudo Feedback

Motivation top results are often relevant

Results d1 3.5 d2 2.4 dk 0.5 ...
Retrieval Engine
Query
Updated query
Document collection
Judgments d1 d2 d3 dk - ...
top 10
Feedback
23
Technique Inverted List

ti ? ltd1, gt, .., ltdn, gt
E.g.,
color ? ltd1, gt, ltd2, gt, ltd5, gt
printer ? ltd2, gt, ltd5, gt, ltd8, gt
How to evaluate Q color AND printer?
How to evaluate Q color printer?
what info to maintain in each entry?
More later

24
DB Meets IR

Multimedia databases
relational data text, images, audio, video
Fuzzy retrieval for relational data
similarity, preference-based queries
e.g., product search in e-commerce
XML represents text-based data
IR type search will be helpful
how can we extend it to retrieve XML documents?

25
?? Web Search

Text IR as natural starting point
Web as a collection of HTML documents
find pages satisfy information need
Web search as killer-app of IR!
Web search vs. traditional document search
?? how are they related?
?? any differences or new issues?
why search engines give lousy results?

26
Web Search New Issues and Challenges

Highly topic-heterogeneous documents
notion of collection lost
stopwords, idf scheme for term selection/weighting
challenged
Structured/semi-structured documents
Highly-linked pages collection no longer flat
how to use links cleverly link analysis (more
in TW2)
ideas from social networks for standing or
importance
Extremely large scale Billions docs and counting
Many documents/data hidden behind databases
Multi-lingual documents
Spamming

27
Whats Next

Vector space model

28
End of Talk

Write a Comment

User Comments (0)

About PowerShow.com

CS511 Design of Database Management Systems - PowerPoint PPT Presentation

CS511 Design of Database Management Systems

text search: to satisfy 'information need' -- art ... Vector Space Model. View: Similarity of content. Intuitions: ... Vector space model. CS411. 28. End of Talk ... – PowerPoint PPT presentation