Title: StructuredValue Ranking in UpdateIntensive Relational Databases
1Structured-Value Ranking in Update-Intensive
Relational Databases
- Jayavel Shanmugasundaram
- Cornell University
- (Joint work withLin Guo, Kevin Beyer, Eugene
Shekita)
2Case Study Internet Archive
3Internet Archive Database
Movies
Name
Description
Mid
they stand on the golden gate bridge and
10
Amateur Film
20
American Thrift
golden gate bridge with statue of liberty
SELECT FROM Movies M ORDER BY
score(M.description, golden gate) FETCH TOP 10
RESULTS ONLY
4Main Issue
- Traditional IR ranking methods would rank the two
movies about the same - Example TF-IDF
- Golden Gate appears exactly once in both
descriptions - Length of the text fields are about the same
- Hence same normalized TF-IDF score
- Larger issue Traditional IR scoring methods
developed for stand-alone document collections
5Internet Archive Database
Movies
Name
Description
Mid
they stand on the golden gate bridge and
10
Amateur Film
20
American Thrift
golden gate bridge with statue of liberty
Structured Value Ranking (SVR)
6Structured Value Ranking
- Use structured data values associated with text
columns to score results - Main technical challenge
- Structured data value (and hence scores) change
frequently and possibly dramatically! - Number of visits, downloads, award announcements
- SlashDot effect
- Bursts and rapidly changing popularity
Kleinberg - Users still want to see results ordered by latest
score values
7Dealing with Score Updates
- Traditional top-k algorithms order inverted
lists by score - Top-k queries answered efficiently by scanning
only top part of inverted list - Not efficient if scores are updated
- Need to reorder inverted lists
- Solution
- New family of inverted lists that are maintained
in approximate score order - Correct for approximation during query processing
8Summary of Proposed Techniques
- SQL-based technique for specifying SVR in a
relational database - New family of inverted lists that are robust to
score updates, while still efficient for queries - Can specify update-query tradeoff
- Combination of SVR and TF-IDF scores
- Can be implemented using existing relational
technology such as B-trees
9Outline
- System Architecture
- Indexing and Query Processing
- Experimental Evaluation
- Related Work
- Conclusion
10Internet Archive Database
Movies
Name
Description
Mid
they stand on the golden gate bridge and
10
Amateur Film
20
American Thrift
golden gate bridge with statue of liberty
11System Architecture
RDBMS
Text Management Component
Relational Query Engine
Relational Tables and Indices
12Internet Archive Database
Movies
Name
Description
Mid
they stand on the golden gate bridge and
10
Amateur Film
20
American Thrift
golden gate bridge with statue of liberty
13SQL-Based SVR Specification
create function S1 (id integer) returns
float return SELECT Avg(R.rating) FROM
Reviews R WHERE R.Mid id
create function S2 (id integer) returns
float return SELECT S.Visits FROM
Statistics S WHERE S.Mid id
create function S3 (id integer) returns
float return SELECT S.Downloads FROM
Statistics S WHERE S.Mid id
14SQL-Based SVR Specification
create function S1 (id integer) returns
float return SELECT Avg(R.rating) FROM
Reviews R WHERE R.Mid id
create function S2 (id integer) returns
float return SELECT S.Visits FROM
Statistics S WHERE S.Mid id
create function S3 (id integer) returns
float return SELECT S.Downloads FROM
Statistics S WHERE S.Mid id
create function Agg (s1, s2, s3, s4 float)
returns float return (s1100 s2/2 s3 s4/2)
(s4 TFIDF())
15Efficiently Maintaining SVR Scores
- One of key challenges SVR scores can change
frequently - Solution use materialized views
- Leverage relational technology
- Benefit of SQL-based SVR specification
create materialized view Score as SELECT
Agg(S1(M.Mid), S2(M.Mid), S3(M.Mid)) FROM
Movies M
16System Architecture
RDBMS
Text Management Component
Relational Query Engine
Relational Tables and Indices
17Outline
- System Architecture
- Indexing and Query Processing
- Experimental Evaluation
- Related Work
- Conclusion
18Index Operations
- Document score updates
- Handle frequent updates to scores
- Top-k keyword queries
- Conjunctive and disjunctive keyword queries
- Include IR-style (TF-IDF) scores
- Top-k query results
- Content updates, insertions and deletions
- Update to document content
- Document insertions and deletions
19Naïve Approach 1 ID Method
Inverted List
Score Table
Id
Score
golden
10
12
18
21
34
1
70.85
2
91.86
gate
11
13
18
34
39
3
12.34
...
(ordered by Id)
- Score updates efficient (just update score
table) - Top-k queries inefficient (scan all of inverted
list)
20Naïve Approach 2 Score Method
Inverted List
Score
golden
156
12
89
54
98.32
90.19
79.52
77.79
gate
176
12
64
4
97.19
90.19
89.55
84.63
(ordered by Score)
- Top-k queries efficient (top part of inverted
list) - Score updates inefficient (reorganize many lists)
21Dilemma
- Want inverted lists ordered by score
- For top-k query performance
- Like in Score Method
- But do not want to touch inverted lists for every
score update - For score update performance
- Like in ID Method
- How can we address this apparent dilemma?
22Score-Threshold Method
- Extends Score Method in two key aspects
- Allow inverted list scores to be out-of-date by
up to a threshold - Avoids having to frequently update inverted list
- Better score update performance
- Need to scan more of inverted list (by up to a
threshold) to correct for out-of-date score - Slightly reduced query performance
- Use short inverted list for scores that exceed
threshold - More efficient than updating large inverted list
23Score-Threshold Method
Score Table
golden
156
12
89
Id
Score
98.32
90.19
79.52
1
70.85
Short list
12
90.19
...
gate
176
12
64
ListScore Table
97.19
90.19
89.55
Id
Score
InShortList
(ordered by Score)
24Score-Threshold Method
Score Table
golden
156
12
89
Id
Score
98.32
90.19
79.52
1
70.85
12
95.00
...
gate
176
12
64
ListScore Table
97.19
90.19
89.55
Id
Score
InShortList
(ordered by Score)
25Score-Threshold Method
Score Table
golden
156
12
89
Id
Score
98.32
90.19
79.52
1
70.85
12
95.00
...
gate
176
12
64
ListScore Table
97.19
90.19
89.55
Id
Score
InShortList
(ordered by Score)
26Score-Threshold Method
Score Table
golden
156
12
89
Id
Score
98.32
90.19
79.52
1
70.85
12
105.0
...
gate
176
12
64
ListScore Table
97.19
90.19
89.55
Id
Score
InShortList
(ordered by Score)
27Score-Threshold Method
Score Table
golden
156
12
89
Id
Score
98.32
90.19
79.52
1
70.85
12
105.0
...
gate
176
12
64
ListScore Table
97.19
90.19
89.55
Id
Score
InShortList
(ordered by Score)
28Query-Update Tradeoff
- Choice of threshold function
- If threshold(score) score
- Every update results in update to inverted list
- Similar to Score Method
- If threshold(score) infinity
- No inverted list update, but scan all of list
- Similar to ID Method
- Can control query-update tradeoff using threshold
function - threshold(score) r score, r 1
- r threshold ratio
29Score-Threshold Method Critique
- Provides good update-query tradeoff
- But! Requires score to be stored in inverted list
- Increases size of inverted list
- Decreases query performance
- Can we avoid storing scores in inverted list and
still get update-query tradeoff?
30Chunk Method
- Main idea divide document collection into
chunks based on original document score - Lowest 5000 documents in first chunk
- Next higher 3000 documents in second chunk
- Next higher 4000 documents in third chunk
-
- Organize inverted list by chunk, but order
documents by Id within a chunk - Ordered approximately by score (chunk) like Score
Method - Avoids storing scores like in ID Method
31Chunk Method
Score Table
golden
12
156
89
Id
Score
11
10
1
70.85
Short list
12
90.19
...
gate
12
64
156
ListScore Table
11
Id
Score
InShortList
(ordered by Chunk)
32Chunk Method Details
- Setting chunk boundaries
- highdoc(c) highest score of document in chunk c
- For two successive chunks c1 and c2
- highdoc(c1)/highdoc(c2) r
- r chunk ratio
- Update document in short list only if document
score exceeds 2 chunk boundaries - 2 chunks handles boundary cases
33Chunk-TermScore Method
- Support combination of SVR and TF-IDF
- Combines Chunk Method with Fancy-ID Method Long
and Suel - In addition to long and short lists (ordered by
chunk), have short fancy list (ordered by TF-IDF) - Combined merge of all three lists
- Details in ICDE paper
34Summary of Alternatives
- ID Method
- Efficient updates, slow queries
- Score Method
- Efficient queries, slow updates
- Score-Threshold Method
- Efficient updates, Intermediate queries
- Chunk Method
- Efficient updates, Efficient queries
- Chunk-TermScore Method
- Efficient updates, Efficient queries, TF-IDF SVR
35Outline
- System Architecture
- Indexing and Query Processing
- Experimental Evaluation
- Related Work
- Conclusion
36Experimental Setup
- Two primary performance metrics
- Time for a score update
- Only time to update inverted lists
- Time for a top-k query
- Data sets
- Real (Internet Archive) 60MB
- Thanks to Brewster Kahle and Jon Aizen
- Synthetic 805MB
- Compared all five alternatives ID-TermScore
(baseline for Chunk-TermScore)
37Implementation Details
- Inverted lists implemented in BerkeleyDB
- Long inverted lists as CLOBs
- Read in a page at a time during query processing
- Short inverted lists as clustered B trees
- Since short inverted lists are updated
- Query algorithms implemented in C
38Inverted List Size
- ID Method 145MB
- Score Method 2768MB
- Score-Threshold Method 847MB
- Chunk Method 146MB
- ID-TermScore Method 428MB
- Chunk-TermScore Method 430MB
39Effect of Chunk Ratio
Times in Milliseconds
40Varying Updates
Times in Milliseconds
41Varying k in Top-k
42SVR TF-IDF
Times in Milliseconds
43Summary of Alternatives
- ID Method
- Efficient updates, slow queries
- Score Method
- Efficient queries, slow updates
- Score-Threshold Method
- Efficient updates, Intermediate queries
- Chunk Method
- Efficient updates, Efficient queries
- Chunk-TermScore Method
- Efficient updates, Efficient queries, TF-IDF SVR
44Outline
- System Architecture
- Indexing and Query Processing
- Experimental Evaluation
- Related Work
- Conclusion
45Related Work
- SQL/MM
- Integrating keyword search with databases
- Banks, DBXplorer, Discover
- Search across tuples, but simple or traditional
IR ranking - Top-k inverted lists and query processing
- Do not handle score updates
- Inverted list updates
- Handle only content updates, not score updates
- Proposed techniques can handle content updates too
46Outline
- System Architecture
- Indexing and Query Processing
- Experimental Evaluation
- Related Work
- Conclusion
4710000 foot view of Data Management
Information Retrieval Systems
Ranked Keyword Search
Queries
Complex and Structured
Database Systems
Structured
Unstructured
Data
4810000 foot view of Data Management
Information Retrieval Systems
Ranked Keyword Search
Queries
Complex and Structured
Database Systems
Structured
Unstructured
Data
4910000 foot view of Data Management
Information Retrieval Systems
Ranked Keyword Search
Queries
Complex and Structured
Database Systems
Structured
Unstructured
Data
50Towards Unifying DB and IR
- XRank Keyword search over semi-structured XML
documents - Extends keyword search to work over both
structured and unstructured data - SIGMOD 2003 Guo et al.
- TeXQuery Query language for structured and
unstructured data, structured and keyword queries - Precursor to W3C XQuery Full-Text
- WWW 2004 Amer-Yahia et al.
51Questions?