StructuredValue Ranking in UpdateIntensive Relational Databases - PowerPoint PPT Presentation

About This Presentation

Title:

StructuredValue Ranking in UpdateIntensive Relational Databases

Description:

they stand on the golden gate bridge and ... Description. Mid. 20. American Thrift ... golden gate bridge with statue of liberty ... SELECT * FROM Movies M ... – PowerPoint PPT presentation

Number of Views:47

Avg rating:3.0/5.0

Slides: 52

Provided by: jayavelsha

Learn more at: https://www.cs.cornell.edu

Category:

more less

Transcript and Presenter's Notes

Title: StructuredValue Ranking in UpdateIntensive Relational Databases

1
Structured-Value Ranking in Update-Intensive
Relational Databases

Jayavel Shanmugasundaram
Cornell University
(Joint work withLin Guo, Kevin Beyer, Eugene
Shekita)

2
Case Study Internet Archive
3
Internet Archive Database
Movies
Name
Description
Mid
they stand on the golden gate bridge and
10
Amateur Film
20
American Thrift
golden gate bridge with statue of liberty

SELECT FROM Movies M ORDER BY
score(M.description, golden gate) FETCH TOP 10
RESULTS ONLY
4
Main Issue

Traditional IR ranking methods would rank the two
movies about the same
Example TF-IDF
Golden Gate appears exactly once in both
descriptions
Length of the text fields are about the same
Hence same normalized TF-IDF score
Larger issue Traditional IR scoring methods
developed for stand-alone document collections

5
Internet Archive Database
Movies
Name
Description
Mid
they stand on the golden gate bridge and
10
Amateur Film
20
American Thrift
golden gate bridge with statue of liberty

Structured Value Ranking (SVR)
6
Structured Value Ranking

Use structured data values associated with text
columns to score results
Main technical challenge
Structured data value (and hence scores) change
frequently and possibly dramatically!
Number of visits, downloads, award announcements
SlashDot effect
Bursts and rapidly changing popularity
Kleinberg
Users still want to see results ordered by latest
score values

7
Dealing with Score Updates

Traditional top-k algorithms order inverted
lists by score
Top-k queries answered efficiently by scanning
only top part of inverted list
Not efficient if scores are updated
Need to reorder inverted lists
Solution
New family of inverted lists that are maintained
in approximate score order
Correct for approximation during query processing

8
Summary of Proposed Techniques

SQL-based technique for specifying SVR in a
relational database
New family of inverted lists that are robust to
score updates, while still efficient for queries
Can specify update-query tradeoff
Combination of SVR and TF-IDF scores
Can be implemented using existing relational
technology such as B-trees

9
Outline

System Architecture
Indexing and Query Processing
Experimental Evaluation
Related Work
Conclusion

10
Internet Archive Database
Movies
Name
Description
Mid
they stand on the golden gate bridge and
10
Amateur Film
20
American Thrift
golden gate bridge with statue of liberty

11
System Architecture
RDBMS
Text Management Component
Relational Query Engine
Relational Tables and Indices
12
Internet Archive Database
Movies
Name
Description
Mid
they stand on the golden gate bridge and
10
Amateur Film
20
American Thrift
golden gate bridge with statue of liberty

13
SQL-Based SVR Specification
create function S1 (id integer) returns
float return SELECT Avg(R.rating) FROM
Reviews R WHERE R.Mid id
create function S2 (id integer) returns
float return SELECT S.Visits FROM
Statistics S WHERE S.Mid id
create function S3 (id integer) returns
float return SELECT S.Downloads FROM
Statistics S WHERE S.Mid id
14
SQL-Based SVR Specification
create function S1 (id integer) returns
float return SELECT Avg(R.rating) FROM
Reviews R WHERE R.Mid id
create function S2 (id integer) returns
float return SELECT S.Visits FROM
Statistics S WHERE S.Mid id
create function S3 (id integer) returns
float return SELECT S.Downloads FROM
Statistics S WHERE S.Mid id
create function Agg (s1, s2, s3, s4 float)
returns float return (s1100 s2/2 s3 s4/2)
(s4 TFIDF())
15
Efficiently Maintaining SVR Scores

One of key challenges SVR scores can change
frequently
Solution use materialized views
Leverage relational technology
Benefit of SQL-based SVR specification

create materialized view Score as SELECT
Agg(S1(M.Mid), S2(M.Mid), S3(M.Mid)) FROM
Movies M
16
System Architecture
RDBMS
Text Management Component
Relational Query Engine
Relational Tables and Indices
17
Outline

System Architecture
Indexing and Query Processing
Experimental Evaluation
Related Work
Conclusion

18
Index Operations

Document score updates
Handle frequent updates to scores
Top-k keyword queries
Conjunctive and disjunctive keyword queries
Include IR-style (TF-IDF) scores
Top-k query results
Content updates, insertions and deletions
Update to document content
Document insertions and deletions

19
Naïve Approach 1 ID Method
Inverted List
Score Table
Id
Score
golden
10
12
18
21
34

1
70.85
2
91.86
gate
11
13
18
34
39

3
12.34

...
(ordered by Id)

Score updates efficient (just update score
table)
Top-k queries inefficient (scan all of inverted
list)

20
Naïve Approach 2 Score Method
Inverted List
Score
golden
156
12
89
54

98.32
90.19
79.52
77.79

gate
176
12
64
4

97.19
90.19
89.55
84.63

(ordered by Score)

Top-k queries efficient (top part of inverted
list)
Score updates inefficient (reorganize many lists)

21
Dilemma

Want inverted lists ordered by score
For top-k query performance
Like in Score Method
But do not want to touch inverted lists for every
score update
For score update performance
Like in ID Method
How can we address this apparent dilemma?

22
Score-Threshold Method

Extends Score Method in two key aspects
Allow inverted list scores to be out-of-date by
up to a threshold
Avoids having to frequently update inverted list
Better score update performance
Need to scan more of inverted list (by up to a
threshold) to correct for out-of-date score
Slightly reduced query performance
Use short inverted list for scores that exceed
threshold
More efficient than updating large inverted list

23
Score-Threshold Method
Score Table
golden
156
12
89

Id
Score
98.32
90.19
79.52

1
70.85

Short list
12
90.19

...
gate
176
12
64

ListScore Table
97.19
90.19
89.55

Id
Score
InShortList
(ordered by Score)
24
Score-Threshold Method
Score Table
golden
156
12
89

Id
Score
98.32
90.19
79.52

1
70.85

12
95.00

...
gate
176
12
64

ListScore Table
97.19
90.19
89.55

Id
Score
InShortList
(ordered by Score)
25
Score-Threshold Method
Score Table
golden
156
12
89

Id
Score
98.32
90.19
79.52

1
70.85

12
95.00

...
gate
176
12
64

ListScore Table
97.19
90.19
89.55

Id
Score
InShortList
(ordered by Score)
26
Score-Threshold Method
Score Table
golden
156
12
89

Id
Score
98.32
90.19
79.52

1
70.85

12
105.0

...
gate
176
12
64

ListScore Table
97.19
90.19
89.55

Id
Score
InShortList
(ordered by Score)
27
Score-Threshold Method
Score Table
golden
156
12
89

Id
Score
98.32
90.19
79.52

1
70.85

12
105.0

...
gate
176
12
64

ListScore Table
97.19
90.19
89.55

Id
Score
InShortList
(ordered by Score)
28
Query-Update Tradeoff

Choice of threshold function
If threshold(score) score
Every update results in update to inverted list
Similar to Score Method
If threshold(score) infinity
No inverted list update, but scan all of list
Similar to ID Method
Can control query-update tradeoff using threshold
function
threshold(score) r score, r 1
r threshold ratio

29
Score-Threshold Method Critique

Provides good update-query tradeoff
But! Requires score to be stored in inverted list
Increases size of inverted list
Decreases query performance
Can we avoid storing scores in inverted list and
still get update-query tradeoff?

30
Chunk Method

Main idea divide document collection into
chunks based on original document score
Lowest 5000 documents in first chunk
Next higher 3000 documents in second chunk
Next higher 4000 documents in third chunk
Organize inverted list by chunk, but order
documents by Id within a chunk
Ordered approximately by score (chunk) like Score
Method
Avoids storing scores like in ID Method

31
Chunk Method
Score Table
golden
12
156
89

Id
Score
11
10

1
70.85

Short list
12
90.19

...
gate
12
64
156

ListScore Table
11

Id
Score
InShortList
(ordered by Chunk)
32
Chunk Method Details

Setting chunk boundaries
highdoc(c) highest score of document in chunk c
For two successive chunks c1 and c2
highdoc(c1)/highdoc(c2) r
r chunk ratio
Update document in short list only if document
score exceeds 2 chunk boundaries
2 chunks handles boundary cases

33
Chunk-TermScore Method

Support combination of SVR and TF-IDF
Combines Chunk Method with Fancy-ID Method Long
and Suel
In addition to long and short lists (ordered by
chunk), have short fancy list (ordered by TF-IDF)
Combined merge of all three lists
Details in ICDE paper

34
Summary of Alternatives

ID Method
Efficient updates, slow queries
Score Method
Efficient queries, slow updates
Score-Threshold Method
Efficient updates, Intermediate queries
Chunk Method
Efficient updates, Efficient queries
Chunk-TermScore Method
Efficient updates, Efficient queries, TF-IDF SVR

35
Outline

System Architecture
Indexing and Query Processing
Experimental Evaluation
Related Work
Conclusion

36
Experimental Setup

Two primary performance metrics
Time for a score update
Only time to update inverted lists
Time for a top-k query
Data sets
Real (Internet Archive) 60MB
Thanks to Brewster Kahle and Jon Aizen
Synthetic 805MB
Compared all five alternatives ID-TermScore
(baseline for Chunk-TermScore)

37
Implementation Details

Inverted lists implemented in BerkeleyDB
Long inverted lists as CLOBs
Read in a page at a time during query processing
Short inverted lists as clustered B trees
Since short inverted lists are updated
Query algorithms implemented in C

38
Inverted List Size

ID Method 145MB
Score Method 2768MB
Score-Threshold Method 847MB
Chunk Method 146MB
ID-TermScore Method 428MB
Chunk-TermScore Method 430MB

39
Effect of Chunk Ratio
Times in Milliseconds
40
Varying Updates
Times in Milliseconds
41
Varying k in Top-k
42
SVR TF-IDF
Times in Milliseconds
43
Summary of Alternatives

ID Method
Efficient updates, slow queries
Score Method
Efficient queries, slow updates
Score-Threshold Method
Efficient updates, Intermediate queries
Chunk Method
Efficient updates, Efficient queries
Chunk-TermScore Method
Efficient updates, Efficient queries, TF-IDF SVR

44
Outline

System Architecture
Indexing and Query Processing
Experimental Evaluation
Related Work
Conclusion

45
Related Work

SQL/MM
Integrating keyword search with databases
Banks, DBXplorer, Discover
Search across tuples, but simple or traditional
IR ranking
Top-k inverted lists and query processing
Do not handle score updates
Inverted list updates
Handle only content updates, not score updates
Proposed techniques can handle content updates too

46
Outline

System Architecture
Indexing and Query Processing
Experimental Evaluation
Related Work
Conclusion

47
10000 foot view of Data Management
Information Retrieval Systems
Ranked Keyword Search
Queries
Complex and Structured
Database Systems
Structured
Unstructured
Data
48
10000 foot view of Data Management
Information Retrieval Systems
Ranked Keyword Search
Queries
Complex and Structured
Database Systems
Structured
Unstructured
Data
49
10000 foot view of Data Management
Information Retrieval Systems
Ranked Keyword Search
Queries
Complex and Structured
Database Systems
Structured
Unstructured
Data
50
Towards Unifying DB and IR

XRank Keyword search over semi-structured XML
documents
Extends keyword search to work over both
structured and unstructured data
SIGMOD 2003 Guo et al.
TeXQuery Query language for structured and
unstructured data, structured and keyword queries
Precursor to W3C XQuery Full-Text
WWW 2004 Amer-Yahia et al.

51
Questions?

Write a Comment

User Comments (0)