Towards Unifying Database Systems and Information Retrieval Systems - PowerPoint PPT Presentation

About This Presentation
Title:

Towards Unifying Database Systems and Information Retrieval Systems

Description:

Towards Unifying Database Systems and Information Retrieval Systems ... Implemented in C on top of BerkeleyDB. 2.7GHz 1GB processor. Varying # Updates ... – PowerPoint PPT presentation

Number of Views:44
Avg rating:3.0/5.0
Slides: 29
Provided by: jayavelsha
Category:

less

Transcript and Presenter's Notes

Title: Towards Unifying Database Systems and Information Retrieval Systems


1
Towards Unifying Database Systems and Information
Retrieval Systems
  • Jayavel Shanmugasundaram
  • Cornell University

2
10000 foot view of Data Management
Information Retrieval Systems
Ranked Keyword Search
Queries
Complex and Structured
Database Systems
Structured
Unstructured
Data
3
10000 foot view of Data Management
Information Retrieval Systems
Ranked Keyword Search
Queries
Complex and Structured
Database Systems
Structured
Unstructured
Data
4
Case Study Internet Archive
5
Internet Archive Database
Movies
Name
Description
Mid
they stand on the golden gate bridge and
10
Amateur Film
20
American Thrift
golden gate bridge with statue of liberty



SELECT FROM Movies M ORDER BY
score(M.description, golden gate) FETCH TOP 10
RESULTS ONLY
6
Main Issue
  • Traditional IR ranking methods would rank the two
    movies about the same
  • Example TF-IDF
  • Golden Gate appears exactly once in both
    descriptions
  • Length of the text fields are about the same
  • Hence same normalized TF-IDF score
  • Larger issue Traditional IR scoring methods
    developed for stand-alone document collections

7
Internet Archive Database
Movies
Name
Description
Mid
they stand on the golden gate bridge and
10
Amateur Film
20
American Thrift
golden gate bridge with statue of liberty



Structured Value Ranking (SVR)
8
Structured Value Ranking(Guo et al., 2005)
  • Use structured data values associated with text
    columns to score results
  • Main technical challenge
  • Structured data value (and hence scores) change
    frequently and possibly dramatically!
  • Number of visits, downloads, award announcements
  • SlashDot effect
  • Bursts and rapidly changing popularity
    Kleinberg
  • Users still want to see results ordered by latest
    score values
  • Current focus design efficient inverted lists

9
System Architecture
RDBMS
Text Management Component
Relational Query Engine
Relational Tables and Indices
10
Index Operations
  • Document score updates
  • Handle frequent updates to scores
  • Top-k keyword queries
  • Conjunctive and disjunctive keyword queries
  • Include IR-style (TF-IDF) scores
  • Top-k query results
  • Content updates, insertions and deletions
  • Update to document content
  • Document insertions and deletions

11
Naïve Approach 1 ID Method
Inverted List
Score Table
Id
Score
golden
10
12
18
21
34

1
70.85
2
91.86
gate
11
13
18
34
39

3
12.34

...
(ordered by Id)
  • Score updates efficient (just update score
    table)
  • Top-k queries inefficient (scan all of inverted
    list)

12
Naïve Approach 2 Score Method
Inverted List
Score
golden
156
12
89
54

98.32
90.19
79.52
77.79

gate
176
12
64
4

97.19
90.19
89.55
84.63

(ordered by Score)
  • Top-k queries efficient (top part of inverted
    list)
  • Score updates inefficient (reorganize many lists)

13
Dilemma
  • Want inverted lists ordered by score
  • For top-k query performance
  • Like in Score Method
  • But do not want to touch inverted lists for every
    score update
  • For score update performance
  • Like in ID Method
  • How can we address this apparent dilemma?

14
Score-Threshold Method
  • Extends Score Method in two key aspects
  • Allow inverted list scores to be out-of-date by
    up to a threshold
  • Avoids having to frequently update inverted list
  • Better score update performance
  • Need to scan more of inverted list (by up to a
    threshold) to correct for out-of-date score
  • Slightly reduced query performance
  • Use short inverted list for scores that exceed
    threshold
  • More efficient than updating large inverted list

15
Score-Threshold Method
Score Table
golden
156
12
89

Id
Score
98.32
90.19
79.52

1
70.85


Short list
12
90.19

...
gate
176
12
64

ListScore Table
97.19
90.19
89.55

Id
Score
InShortList
(ordered by Score)
16
Score-Threshold Method
Score Table
golden
156
12
89

Id
Score
98.32
90.19
79.52

1
70.85


12
95.00

...
gate
176
12
64

ListScore Table
97.19
90.19
89.55

Id
Score
InShortList
(ordered by Score)
17
Score-Threshold Method
Score Table
golden
156
12
89

Id
Score
98.32
90.19
79.52

1
70.85


12
95.00

...
gate
176
12
64

ListScore Table
97.19
90.19
89.55

Id
Score
InShortList
(ordered by Score)
18
Score-Threshold Method
Score Table
golden
156
12
89

Id
Score
98.32
90.19
79.52

1
70.85


12
105.0

...
gate
176
12
64

ListScore Table
97.19
90.19
89.55

Id
Score
InShortList
(ordered by Score)
19
Score-Threshold Method
Score Table
golden
156
12
89

Id
Score
98.32
90.19
79.52

1
70.85


12
105.0

...
gate
176
12
64

ListScore Table
97.19
90.19
89.55

Id
Score
InShortList
(ordered by Score)
20
Query-Update Tradeoff
  • Choice of threshold function
  • If threshold(score) 0
  • Every update results in update to inverted list
  • Similar to Score Method
  • If threshold(score) infinity
  • No inverted list update, but scan all of list
  • Similar to ID Method
  • Can control query-update tradeoff using threshold
    function
  • threshold(score) r score, r gt 0
  • r threshold ratio

21
Experimental Setup
  • Two primary performance metrics
  • Time for a score update
  • Time for a top-k query
  • Data sets
  • Real (Internet Archive) 60MB
  • Thanks to Brewster Kahle and Jon Aizen
  • Synthetic 805MB
  • Compared alternatives
  • Implemented in C on top of BerkeleyDB
  • 2.7GHz 1GB processor

22
Varying Updates
Times in Milliseconds
23
10000 foot view of Data Management
Information Retrieval Systems
Ranked Keyword Search
Queries
Complex and Structured
Database Systems
Structured
Unstructured
Data
24
XML Keyword Search
  • Example applications
  • Accident reports, Shakespeares plays
  • XRank Keyword search over semi-structured XML
    documents
  • Extends keyword search to work over both
    structured and unstructured data
  • SIGMOD 2003 Guo, Shao, Botev, Shanmugasundaram

25
10000 foot view of Data Management
Information Retrieval Systems
Ranked Keyword Search
Queries
Complex and Structured
Database Systems
Structured
Unstructured
Data
26
Towards Unifying DB and IR
  • Example applications
  • Content management, web querying
  • TeXQuery Query language for structured and
    unstructured data, structured and keyword queries
  • Precursor to W3C XQuery Full-Text
  • WWW 2004 Amer-Yahia, Botev, Shanmugasundaram

27
Related Work
  • Integrating DB and IR systems
  • For the most part, treat individual systems as
    black boxes
  • Our goal is to unify DB and IR systems
  • Search over Semi-Structured Data
  • Specialized techniques for search semi-structured
    data
  • Our goal is to generalize DB and IR techniques
  • Keyword search and ranking in databases
  • BANKS, DBXplorer, DISCOVER

28
Summary
  • Many emerging applications require a unification
    of DB and IR techniques
  • E-commerce, content management,
  • Argues for a new generation of systems and
    techniques that seamlessly provide this
    capability
  • SVR, XRank, TeXQuery,
  • Educational benefit present unified view of data
    management
  • Currently at graduate level
  • Eventually introduce concepts at undergraduate
    level
Write a Comment
User Comments (0)
About PowerShow.com