Title: Towards Unifying Database Systems and Information Retrieval Systems
1Towards Unifying Database Systems and Information
Retrieval Systems
- Jayavel Shanmugasundaram
- Cornell University
210000 foot view of Data Management
Information Retrieval Systems
Ranked Keyword Search
Queries
Complex and Structured
Database Systems
Structured
Unstructured
Data
310000 foot view of Data Management
Information Retrieval Systems
Ranked Keyword Search
Queries
Complex and Structured
Database Systems
Structured
Unstructured
Data
4Case Study Internet Archive
5Internet Archive Database
Movies
Name
Description
Mid
they stand on the golden gate bridge and
10
Amateur Film
20
American Thrift
golden gate bridge with statue of liberty
SELECT FROM Movies M ORDER BY
score(M.description, golden gate) FETCH TOP 10
RESULTS ONLY
6Main Issue
- Traditional IR ranking methods would rank the two
movies about the same - Example TF-IDF
- Golden Gate appears exactly once in both
descriptions - Length of the text fields are about the same
- Hence same normalized TF-IDF score
- Larger issue Traditional IR scoring methods
developed for stand-alone document collections
7Internet Archive Database
Movies
Name
Description
Mid
they stand on the golden gate bridge and
10
Amateur Film
20
American Thrift
golden gate bridge with statue of liberty
Structured Value Ranking (SVR)
8Structured Value Ranking(Guo et al., 2005)
- Use structured data values associated with text
columns to score results - Main technical challenge
- Structured data value (and hence scores) change
frequently and possibly dramatically! - Number of visits, downloads, award announcements
- SlashDot effect
- Bursts and rapidly changing popularity
Kleinberg - Users still want to see results ordered by latest
score values - Current focus design efficient inverted lists
9System Architecture
RDBMS
Text Management Component
Relational Query Engine
Relational Tables and Indices
10Index Operations
- Document score updates
- Handle frequent updates to scores
- Top-k keyword queries
- Conjunctive and disjunctive keyword queries
- Include IR-style (TF-IDF) scores
- Top-k query results
- Content updates, insertions and deletions
- Update to document content
- Document insertions and deletions
11Naïve Approach 1 ID Method
Inverted List
Score Table
Id
Score
golden
10
12
18
21
34
1
70.85
2
91.86
gate
11
13
18
34
39
3
12.34
...
(ordered by Id)
- Score updates efficient (just update score
table) - Top-k queries inefficient (scan all of inverted
list)
12Naïve Approach 2 Score Method
Inverted List
Score
golden
156
12
89
54
98.32
90.19
79.52
77.79
gate
176
12
64
4
97.19
90.19
89.55
84.63
(ordered by Score)
- Top-k queries efficient (top part of inverted
list) - Score updates inefficient (reorganize many lists)
13Dilemma
- Want inverted lists ordered by score
- For top-k query performance
- Like in Score Method
- But do not want to touch inverted lists for every
score update - For score update performance
- Like in ID Method
- How can we address this apparent dilemma?
14Score-Threshold Method
- Extends Score Method in two key aspects
- Allow inverted list scores to be out-of-date by
up to a threshold - Avoids having to frequently update inverted list
- Better score update performance
- Need to scan more of inverted list (by up to a
threshold) to correct for out-of-date score - Slightly reduced query performance
- Use short inverted list for scores that exceed
threshold - More efficient than updating large inverted list
15Score-Threshold Method
Score Table
golden
156
12
89
Id
Score
98.32
90.19
79.52
1
70.85
Short list
12
90.19
...
gate
176
12
64
ListScore Table
97.19
90.19
89.55
Id
Score
InShortList
(ordered by Score)
16Score-Threshold Method
Score Table
golden
156
12
89
Id
Score
98.32
90.19
79.52
1
70.85
12
95.00
...
gate
176
12
64
ListScore Table
97.19
90.19
89.55
Id
Score
InShortList
(ordered by Score)
17Score-Threshold Method
Score Table
golden
156
12
89
Id
Score
98.32
90.19
79.52
1
70.85
12
95.00
...
gate
176
12
64
ListScore Table
97.19
90.19
89.55
Id
Score
InShortList
(ordered by Score)
18Score-Threshold Method
Score Table
golden
156
12
89
Id
Score
98.32
90.19
79.52
1
70.85
12
105.0
...
gate
176
12
64
ListScore Table
97.19
90.19
89.55
Id
Score
InShortList
(ordered by Score)
19Score-Threshold Method
Score Table
golden
156
12
89
Id
Score
98.32
90.19
79.52
1
70.85
12
105.0
...
gate
176
12
64
ListScore Table
97.19
90.19
89.55
Id
Score
InShortList
(ordered by Score)
20Query-Update Tradeoff
- Choice of threshold function
- If threshold(score) 0
- Every update results in update to inverted list
- Similar to Score Method
- If threshold(score) infinity
- No inverted list update, but scan all of list
- Similar to ID Method
- Can control query-update tradeoff using threshold
function - threshold(score) r score, r gt 0
- r threshold ratio
21Experimental Setup
- Two primary performance metrics
- Time for a score update
- Time for a top-k query
- Data sets
- Real (Internet Archive) 60MB
- Thanks to Brewster Kahle and Jon Aizen
- Synthetic 805MB
- Compared alternatives
- Implemented in C on top of BerkeleyDB
- 2.7GHz 1GB processor
22Varying Updates
Times in Milliseconds
2310000 foot view of Data Management
Information Retrieval Systems
Ranked Keyword Search
Queries
Complex and Structured
Database Systems
Structured
Unstructured
Data
24XML Keyword Search
- Example applications
- Accident reports, Shakespeares plays
- XRank Keyword search over semi-structured XML
documents - Extends keyword search to work over both
structured and unstructured data - SIGMOD 2003 Guo, Shao, Botev, Shanmugasundaram
2510000 foot view of Data Management
Information Retrieval Systems
Ranked Keyword Search
Queries
Complex and Structured
Database Systems
Structured
Unstructured
Data
26Towards Unifying DB and IR
- Example applications
- Content management, web querying
- TeXQuery Query language for structured and
unstructured data, structured and keyword queries - Precursor to W3C XQuery Full-Text
- WWW 2004 Amer-Yahia, Botev, Shanmugasundaram
27Related Work
- Integrating DB and IR systems
- For the most part, treat individual systems as
black boxes - Our goal is to unify DB and IR systems
- Search over Semi-Structured Data
- Specialized techniques for search semi-structured
data - Our goal is to generalize DB and IR techniques
- Keyword search and ranking in databases
- BANKS, DBXplorer, DISCOVER
28Summary
- Many emerging applications require a unification
of DB and IR techniques - E-commerce, content management,
- Argues for a new generation of systems and
techniques that seamlessly provide this
capability - SVR, XRank, TeXQuery,
- Educational benefit present unified view of data
management - Currently at graduate level
- Eventually introduce concepts at undergraduate
level