StructuredValue Ranking in UpdateIntensive Relational Databases - PowerPoint PPT Presentation

About This Presentation
Title:

StructuredValue Ranking in UpdateIntensive Relational Databases

Description:

they stand on the golden gate bridge and ... Description. Mid. 20. American Thrift ... golden gate bridge with statue of liberty ... SELECT * FROM Movies M ... – PowerPoint PPT presentation

Number of Views:47
Avg rating:3.0/5.0
Slides: 52
Provided by: jayavelsha
Category:

less

Transcript and Presenter's Notes

Title: StructuredValue Ranking in UpdateIntensive Relational Databases


1
Structured-Value Ranking in Update-Intensive
Relational Databases
  • Jayavel Shanmugasundaram
  • Cornell University
  • (Joint work withLin Guo, Kevin Beyer, Eugene
    Shekita)

2
Case Study Internet Archive
3
Internet Archive Database
Movies
Name
Description
Mid
they stand on the golden gate bridge and
10
Amateur Film
20
American Thrift
golden gate bridge with statue of liberty



SELECT FROM Movies M ORDER BY
score(M.description, golden gate) FETCH TOP 10
RESULTS ONLY
4
Main Issue
  • Traditional IR ranking methods would rank the two
    movies about the same
  • Example TF-IDF
  • Golden Gate appears exactly once in both
    descriptions
  • Length of the text fields are about the same
  • Hence same normalized TF-IDF score
  • Larger issue Traditional IR scoring methods
    developed for stand-alone document collections

5
Internet Archive Database
Movies
Name
Description
Mid
they stand on the golden gate bridge and
10
Amateur Film
20
American Thrift
golden gate bridge with statue of liberty



Structured Value Ranking (SVR)
6
Structured Value Ranking
  • Use structured data values associated with text
    columns to score results
  • Main technical challenge
  • Structured data value (and hence scores) change
    frequently and possibly dramatically!
  • Number of visits, downloads, award announcements
  • SlashDot effect
  • Bursts and rapidly changing popularity
    Kleinberg
  • Users still want to see results ordered by latest
    score values

7
Dealing with Score Updates
  • Traditional top-k algorithms order inverted
    lists by score
  • Top-k queries answered efficiently by scanning
    only top part of inverted list
  • Not efficient if scores are updated
  • Need to reorder inverted lists
  • Solution
  • New family of inverted lists that are maintained
    in approximate score order
  • Correct for approximation during query processing

8
Summary of Proposed Techniques
  • SQL-based technique for specifying SVR in a
    relational database
  • New family of inverted lists that are robust to
    score updates, while still efficient for queries
  • Can specify update-query tradeoff
  • Combination of SVR and TF-IDF scores
  • Can be implemented using existing relational
    technology such as B-trees

9
Outline
  • System Architecture
  • Indexing and Query Processing
  • Experimental Evaluation
  • Related Work
  • Conclusion

10
Internet Archive Database
Movies
Name
Description
Mid
they stand on the golden gate bridge and
10
Amateur Film
20
American Thrift
golden gate bridge with statue of liberty



11
System Architecture
RDBMS
Text Management Component
Relational Query Engine
Relational Tables and Indices
12
Internet Archive Database
Movies
Name
Description
Mid
they stand on the golden gate bridge and
10
Amateur Film
20
American Thrift
golden gate bridge with statue of liberty



13
SQL-Based SVR Specification
create function S1 (id integer) returns
float return SELECT Avg(R.rating) FROM
Reviews R WHERE R.Mid id
create function S2 (id integer) returns
float return SELECT S.Visits FROM
Statistics S WHERE S.Mid id
create function S3 (id integer) returns
float return SELECT S.Downloads FROM
Statistics S WHERE S.Mid id
14
SQL-Based SVR Specification
create function S1 (id integer) returns
float return SELECT Avg(R.rating) FROM
Reviews R WHERE R.Mid id
create function S2 (id integer) returns
float return SELECT S.Visits FROM
Statistics S WHERE S.Mid id
create function S3 (id integer) returns
float return SELECT S.Downloads FROM
Statistics S WHERE S.Mid id
create function Agg (s1, s2, s3, s4 float)
returns float return (s1100 s2/2 s3 s4/2)
(s4 TFIDF())
15
Efficiently Maintaining SVR Scores
  • One of key challenges SVR scores can change
    frequently
  • Solution use materialized views
  • Leverage relational technology
  • Benefit of SQL-based SVR specification

create materialized view Score as SELECT
Agg(S1(M.Mid), S2(M.Mid), S3(M.Mid)) FROM
Movies M
16
System Architecture
RDBMS
Text Management Component
Relational Query Engine
Relational Tables and Indices
17
Outline
  • System Architecture
  • Indexing and Query Processing
  • Experimental Evaluation
  • Related Work
  • Conclusion

18
Index Operations
  • Document score updates
  • Handle frequent updates to scores
  • Top-k keyword queries
  • Conjunctive and disjunctive keyword queries
  • Include IR-style (TF-IDF) scores
  • Top-k query results
  • Content updates, insertions and deletions
  • Update to document content
  • Document insertions and deletions

19
Naïve Approach 1 ID Method
Inverted List
Score Table
Id
Score
golden
10
12
18
21
34

1
70.85
2
91.86
gate
11
13
18
34
39

3
12.34

...
(ordered by Id)
  • Score updates efficient (just update score
    table)
  • Top-k queries inefficient (scan all of inverted
    list)

20
Naïve Approach 2 Score Method
Inverted List
Score
golden
156
12
89
54

98.32
90.19
79.52
77.79

gate
176
12
64
4

97.19
90.19
89.55
84.63

(ordered by Score)
  • Top-k queries efficient (top part of inverted
    list)
  • Score updates inefficient (reorganize many lists)

21
Dilemma
  • Want inverted lists ordered by score
  • For top-k query performance
  • Like in Score Method
  • But do not want to touch inverted lists for every
    score update
  • For score update performance
  • Like in ID Method
  • How can we address this apparent dilemma?

22
Score-Threshold Method
  • Extends Score Method in two key aspects
  • Allow inverted list scores to be out-of-date by
    up to a threshold
  • Avoids having to frequently update inverted list
  • Better score update performance
  • Need to scan more of inverted list (by up to a
    threshold) to correct for out-of-date score
  • Slightly reduced query performance
  • Use short inverted list for scores that exceed
    threshold
  • More efficient than updating large inverted list

23
Score-Threshold Method
Score Table
golden
156
12
89

Id
Score
98.32
90.19
79.52

1
70.85


Short list
12
90.19

...
gate
176
12
64

ListScore Table
97.19
90.19
89.55

Id
Score
InShortList
(ordered by Score)
24
Score-Threshold Method
Score Table
golden
156
12
89

Id
Score
98.32
90.19
79.52

1
70.85


12
95.00

...
gate
176
12
64

ListScore Table
97.19
90.19
89.55

Id
Score
InShortList
(ordered by Score)
25
Score-Threshold Method
Score Table
golden
156
12
89

Id
Score
98.32
90.19
79.52

1
70.85


12
95.00

...
gate
176
12
64

ListScore Table
97.19
90.19
89.55

Id
Score
InShortList
(ordered by Score)
26
Score-Threshold Method
Score Table
golden
156
12
89

Id
Score
98.32
90.19
79.52

1
70.85


12
105.0

...
gate
176
12
64

ListScore Table
97.19
90.19
89.55

Id
Score
InShortList
(ordered by Score)
27
Score-Threshold Method
Score Table
golden
156
12
89

Id
Score
98.32
90.19
79.52

1
70.85


12
105.0

...
gate
176
12
64

ListScore Table
97.19
90.19
89.55

Id
Score
InShortList
(ordered by Score)
28
Query-Update Tradeoff
  • Choice of threshold function
  • If threshold(score) score
  • Every update results in update to inverted list
  • Similar to Score Method
  • If threshold(score) infinity
  • No inverted list update, but scan all of list
  • Similar to ID Method
  • Can control query-update tradeoff using threshold
    function
  • threshold(score) r score, r 1
  • r threshold ratio

29
Score-Threshold Method Critique
  • Provides good update-query tradeoff
  • But! Requires score to be stored in inverted list
  • Increases size of inverted list
  • Decreases query performance
  • Can we avoid storing scores in inverted list and
    still get update-query tradeoff?

30
Chunk Method
  • Main idea divide document collection into
    chunks based on original document score
  • Lowest 5000 documents in first chunk
  • Next higher 3000 documents in second chunk
  • Next higher 4000 documents in third chunk
  • Organize inverted list by chunk, but order
    documents by Id within a chunk
  • Ordered approximately by score (chunk) like Score
    Method
  • Avoids storing scores like in ID Method

31
Chunk Method
Score Table
golden
12
156
89

Id
Score
11
10

1
70.85


Short list
12
90.19

...
gate
12
64
156

ListScore Table
11

Id
Score
InShortList
(ordered by Chunk)
32
Chunk Method Details
  • Setting chunk boundaries
  • highdoc(c) highest score of document in chunk c
  • For two successive chunks c1 and c2
  • highdoc(c1)/highdoc(c2) r
  • r chunk ratio
  • Update document in short list only if document
    score exceeds 2 chunk boundaries
  • 2 chunks handles boundary cases

33
Chunk-TermScore Method
  • Support combination of SVR and TF-IDF
  • Combines Chunk Method with Fancy-ID Method Long
    and Suel
  • In addition to long and short lists (ordered by
    chunk), have short fancy list (ordered by TF-IDF)
  • Combined merge of all three lists
  • Details in ICDE paper

34
Summary of Alternatives
  • ID Method
  • Efficient updates, slow queries
  • Score Method
  • Efficient queries, slow updates
  • Score-Threshold Method
  • Efficient updates, Intermediate queries
  • Chunk Method
  • Efficient updates, Efficient queries
  • Chunk-TermScore Method
  • Efficient updates, Efficient queries, TF-IDF SVR

35
Outline
  • System Architecture
  • Indexing and Query Processing
  • Experimental Evaluation
  • Related Work
  • Conclusion

36
Experimental Setup
  • Two primary performance metrics
  • Time for a score update
  • Only time to update inverted lists
  • Time for a top-k query
  • Data sets
  • Real (Internet Archive) 60MB
  • Thanks to Brewster Kahle and Jon Aizen
  • Synthetic 805MB
  • Compared all five alternatives ID-TermScore
    (baseline for Chunk-TermScore)

37
Implementation Details
  • Inverted lists implemented in BerkeleyDB
  • Long inverted lists as CLOBs
  • Read in a page at a time during query processing
  • Short inverted lists as clustered B trees
  • Since short inverted lists are updated
  • Query algorithms implemented in C

38
Inverted List Size
  • ID Method 145MB
  • Score Method 2768MB
  • Score-Threshold Method 847MB
  • Chunk Method 146MB
  • ID-TermScore Method 428MB
  • Chunk-TermScore Method 430MB

39
Effect of Chunk Ratio
Times in Milliseconds
40
Varying Updates
Times in Milliseconds
41
Varying k in Top-k
42
SVR TF-IDF
Times in Milliseconds
43
Summary of Alternatives
  • ID Method
  • Efficient updates, slow queries
  • Score Method
  • Efficient queries, slow updates
  • Score-Threshold Method
  • Efficient updates, Intermediate queries
  • Chunk Method
  • Efficient updates, Efficient queries
  • Chunk-TermScore Method
  • Efficient updates, Efficient queries, TF-IDF SVR

44
Outline
  • System Architecture
  • Indexing and Query Processing
  • Experimental Evaluation
  • Related Work
  • Conclusion

45
Related Work
  • SQL/MM
  • Integrating keyword search with databases
  • Banks, DBXplorer, Discover
  • Search across tuples, but simple or traditional
    IR ranking
  • Top-k inverted lists and query processing
  • Do not handle score updates
  • Inverted list updates
  • Handle only content updates, not score updates
  • Proposed techniques can handle content updates too

46
Outline
  • System Architecture
  • Indexing and Query Processing
  • Experimental Evaluation
  • Related Work
  • Conclusion

47
10000 foot view of Data Management
Information Retrieval Systems
Ranked Keyword Search
Queries
Complex and Structured
Database Systems
Structured
Unstructured
Data
48
10000 foot view of Data Management
Information Retrieval Systems
Ranked Keyword Search
Queries
Complex and Structured
Database Systems
Structured
Unstructured
Data
49
10000 foot view of Data Management
Information Retrieval Systems
Ranked Keyword Search
Queries
Complex and Structured
Database Systems
Structured
Unstructured
Data
50
Towards Unifying DB and IR
  • XRank Keyword search over semi-structured XML
    documents
  • Extends keyword search to work over both
    structured and unstructured data
  • SIGMOD 2003 Guo et al.
  • TeXQuery Query language for structured and
    unstructured data, structured and keyword queries
  • Precursor to W3C XQuery Full-Text
  • WWW 2004 Amer-Yahia et al.

51
Questions?
Write a Comment
User Comments (0)
About PowerShow.com