Concurrency Control On Inverted Lists - PowerPoint PPT Presentation

1 / 22

About This Presentation

Title:

Concurrency Control On Inverted Lists

Description:

Concurrency Control On Inverted Lists. Alexander Behm. University of ... modeled in similar fashion: ... being up to date is critical, e.g. news systems ... – PowerPoint PPT presentation

Number of Views:62

Avg rating:3.0/5.0

Slides: 23

Provided by: Rin52

Category:

more less

Transcript and Presenter's Notes

Title: Concurrency Control On Inverted Lists

1
Concurrency Control On Inverted Lists
CS223 Transaction Processing and Distributed
Data Management

Alexander Behm
University of California, Irvine
Instructor Prof. Sharad Mehrotra
Based on BMV96, KAM96, MOH93
(see references for details)

2
Overview
CS223 Transaction Processing and Distributed
Data Management

Introduction to Inverted Lists
Transactions for Inverted Lists
GOLD Text Indexing Engine
Summary
ARIES/LHS
References

3
Introduction to Inverted Lists
CS223 Transaction Processing and Distributed
Data Management
Suppose we have a set of documents with keywords
Doc1
Doc2
Doc3

Keywords Transaction Concurrency
Keywords Serializability Transaction
Keywords Database Transaction

How can we efficiently do keyword queries? Such
as
Get documents with keyword1 AND keyword2 AND
- Get documents with keyword1 OR keyword2 OR
- Get documents with keyword1 AND keyword2 AND
NOT
- Etc. etc.

4
Introduction to Inverted Lists
CS223 Transaction Processing and Distributed
Data Management
A popular solution in Information Retrieval (IR)
Create an inverted list index
Index on keywords
Inverted Lists (contains document IDs)
Database
1 3 5 6
Transaction
1 2 5
Concurrency
2 7 8
Serializability
3 4 7 8
Phantom
5 7 8

In essence For each keyword keep a list of
documents that contain it
5
Introduction to Inverted Lists
CS223 Transaction Processing and Distributed
Data Management
How can we answer queries?
Get documents with Database AND Transaction
AND Phantom
Database
1 3 5 6
Transaction
1 2 5
Create set intersection
Concurrency
2 7 8
5
Serializability
3 4 7 8
DocID 5 is a result!
Phantom
5 7 8

Other operations modeled in similar fashion
E.g. OR by set union, AND NOT by set difference,
etc.
6
Inverted Lists Good and Bad
CS223 Transaction Processing and Distributed
Data Management

Good
when answering queries only look at inverted
lists
only documents matching query are retrieved
if inverted lists are sorted union,
intersection, etc. can be implemented efficiently
other applications in exact and fuzzy string
matching

Bad
inverted list structure can become very large
updates may need to modify several inverted
lists at a time, i.e. updates are expensive

7
Transactions for Inverted Lists
CS223 Transaction Processing and Distributed
Data Management
Characteristics of Information Retrieval Systems

For some IR systems being up to date is not
critical, can be read-only
E.g. an online shop where new products are
(typically) not added every minute
Read-only systems perform updates offline, e.g.
at night in a batch
No concurrency control needed, no transactions
needed

For other systems being up to date is critical,
e.g. news systems
Most relevant documents to a query may be most
recent ones
Updates may be frequent (but typically still
less frequent that reads)
Concurrency control needed, e.g. model queries
as transactions

8
Transactions for Inverted Lists
CS223 Transaction Processing and Distributed
Data Management
What about traditional CC mechanisms e.g. 2PL?
A read-query accessing Database AND
Serializability must wait for whole update to
complete ? BAD
Keywords Database, Concurrency, Phantom
Doc9
Consider adding this document
Database
1 3 5 6
2PL acquires long-term locks
Transaction
1 2 5
Concurrency
2 7 8
Perform updates
Serializability
3 4 7 8
2PL releases locks
Phantom
5 7 8
9
Transactions for Inverted Lists
CS223 Transaction Processing and Distributed
Data Management
Observations by KAM96

Write Write conflicts
most write operations are actually appends, i.e.
we append a new docID to some lists
(we discuss deletions/updates later)
appends are idempotent and commute (consider
inverted list as a set of docIDs)

Read Write conflicts
write operations do not depend on reads, i.e.
transactions are write only or read only
read operations do not need to wait until whole
update has completed
need only wait until the conflicting lists have
been updated
may read each list directly after conflicting
update has completed

10
Transactions for Inverted Lists
CS223 Transaction Processing and Distributed
Data Management
I Using latches example
T1 w(Database), w(Transaction) T2 r(Database),
r(Concurrency)
Schedule w1(Database), r2(Database),
r2(Concurrency), w2(Transaction)
T1 locks
Database
1 3 5 6
T1 releases
Transaction
1 2 5
T2 locks
Concurrency
2 7 8
T2 releases
Serializability
3 4 7 8
T2 locks
Phantom
5 7 8

11
Transactions for Inverted Lists
CS223 Transaction Processing and Distributed
Data Management
II Keeping track of read-write dependencies
T1 add docID 9 with keywords Database,
Concurrency T2 get docs containing Database
AND Concurrency
Since we are using latches and not e.g. 2PL, the
following schedule is possible
S w1(Database), r2(Database), r2(Concurrency),
w2(Concurrency)
We miss docID 9 because it has yet to be added to
Concurrency list Results of T2 are still
correct! Meaning no false answers are
returned However, we miss the most recent (and
possibly most relevant) document! To meet
recency requirement we track dependencies for
read-transactions
12
Transactions for Inverted Lists
CS223 Transaction Processing and Distributed
Data Management
III Operation reordering example
T1 w(x), w(y), w(z) T2 w(a), w(b), w(z) T3
r(z), r(b), r(m)

Say we first scheduled T1 and T2
S w1(x), w1(y), w1(z), w2(a), w2(b), w2(c)
Now T3 arrives at the scheduler
We reorder operations to minimize the wait for T3
S w1(z), w2(z), r3(z), w2(b), r3(b), r3(m),
w1(x), w1(y), w2(a)
Why can we do this?
We use latches, locks are immediately released
Writes are independent of reads each
transaction writes a data item no more than once
? order of execution for writes can be chosen
arbitrarily (assuming appends)

13
CS223 Transaction Processing and Distributed
Data Management
Transaction
Inverted lists that need to be locked
T.ACCESS
Determine T.ACCESS
List of active update transactions
ACTIVE
Update?
Y
N
Add T to ACTIVE
Fill T.CONFLICT Using ACTIVE
List of conflicting active update transactions
T.CONFLICT
Reorder Operations
Request locks in T.ACCESS
Request locks in T.ACCESS
Lock request
Lock request
If granted list is empty Issue read
lock Else Put on wait queue
Safe check conflict list (and processed
list) If safe AND NOT granted to write
Issue read lock Else Put on wait queue
Unlock request
Unlock request
Add transaction ID to PROCESSED Release lock If
wait-queue NOT empty grant lock to next T in
wait-queue
Release lock If wait-queue not empty Grant
lock to next transaction in queue
14
Notes/Thoughts
CS223 Transaction Processing and Distributed
Data Management

Deletions/Updates
Deletions can be handled by the proposed CC
mechanism
Updates can be seen as collection of appends and
deletions
Reducing locking overhead
Main property an inverted list is accessed at
most once by a transaction
Appends can be aggregated into mini-batches
Deletions can be memorized in a purged list
When purged list reaches certain number,
deletions are performed in a batch

15
Notes/Thoughts
CS223 Transaction Processing and Distributed
Data Management

Critical assessment
Very nice recency can be traded for performance
using mini-batches!
Concurrency increased and locking overhead
reduced
Many algorithms rely on inverted lists being
sorted
Appends may not be commutative anymore, cannot
be reordered arbitrarily
Sorting requirement may impose further
restrictions on the scheduling of operations
In worst case read-queries need to check for
conflicts with a number of queries equal to MPL
(multi-programming level)

16
GOLD Text Indexing Engine
DataStructure Manager
OverFlow Manager
QUERY
Provides Access
Provides Access
Inverted Lists
Overflow Inverted Lists
Database
1 3 5
Database
6 8
Transaction
1 2
Transaction
7 9
Insertion List
Position Lists
DocID6, DocID7
Database
1,35 3,44 5,50
Transaction
1,5 2,80
Other structures
Documents
DocID1, DocID2, DocID3
Other structures
DISK
MEMORY
17
GOLD Text Indexing Engine
Multi-Layered System

Concurrency Control Details
Inverted List traversal done with lock coupling
L0-L2 CC done via conflict matrix
Novelty in L3 TimestampLocking CC protocol

Multi-Level Concurrency Control
Timestamp Locking
Locking
Locking
Locking
18
GOLD Text Indexing Engine
L3 Concurrency Control Protocol Basic Idea Use
Timestamps and Locks at Document level Locks
used to prevent insertions and deletions of
creating inconsistent state. Retrieve does not
care about locks. To avoid incorrect query
results timestamps are used.
Delete
Insert
Retrieve
Lock document
Lock document
Get next TS
Delete from Inverted Lists but keep document
Initialize Doc.TS to large value
Ignore Docs with greater TS
Get next TS
Perform Insertion
Check for active retrieve operations with smaller
TS. If exists, wait.
Get next TS and set Doc.TS
Delete Document
19
Summary
Both papers identify problem of interleaving
retrieve and write/delete/update
operations Solutions GOLD 1. Insert Ts set
document Timestamp, retrieve operations ignore
documents with greater timestamp 2. Delete Ts
remove entries from inverted lists and then check
for concurrent retrieve with greater timestamp
and wait before deleting document KAM96 1.
Retrieve Ts check set of common inverted lists
with update transactions 2. Operations are
reordered such that conflicting updates are
executed first 3. For each inverted list,
retrieve Ts wait until conflicting update has
been performed In terms of concurrency INSERT
GOLD gt KAM96 (GOLD could miss some recent
results) DELETE GOLD lt KAM96
20
ARIES/LHS
Keywords are typically mapped to the
corresponding inverted list by hashing. Only
point-queries are required for keywords,
therefore hashing is a good solution. BUT What
Hashing technique can we use? How can we do CC on
the Hash Table? MOH93 studies CC for
Linear Hashing (LHS) Basic operations in
LHS Insert Delete Update File Contraction File
Expansion Retrieve
May cause record relocation!

Basic Idea
Writing Ts get X lock on the record they modify,
NOT on records they relocate
Uncommitted relocations can be modified by other
Ts ? high concurrency
Read Ts get S lock on records
Recovery algorithms need to be prepared to handle
the above
Problem becomes ensuring only correct answers are
returned

21
ARIES/LHS

Retrieval of records is supported by an in-memory
data structure, signature table (ST)
current_page of a record may be different from
home_page
ST helps to quickly identify current_page for a
given query without causing disk I/O
CC problem we need to ensure page signatures and
ST are in synch, i.e. consistent

ST Latch

File Expansion/Contraction get X latch on ST
Signatures depend number of pages
STE latches used to increase concurrency (details
complex)
Other operations get S on ST

Page
STE Latch
STE Latch
Page
Page
STE Latch
T acquires latches
22
References
CS223 Transaction Processing and Distributed
Data Management
KAM96 Mohan Kamath, Krithi Ramamritham,
Efficient Transaction Support for Dynamic
Information Retrieval Systems, Proceedings of
the 19th annual international ACM SIGIR
conference on Research and Development in
information retrieval, 1996 BMV96 D.
Barbara, S. Mehrotra, P. Vallabhaneni, The Gold
Text Indexing Engine, Proceedings of 12th
International Conference on Data Engineering
(ICDE'96), 1996 MOH93 C. Mohan,
ARIES/LHS A Concurrency Control and Recovery
Method Using Write-Ahead Logging for Linear
Hashing with Separators, Proceedings of the Ninth
International Conference on Data Engineering
(ICDE), 1993
23
Issue with GOLD (my opinion)
Inverted List Index
T1 Get Docs containing A AND B AND C . AND NOT Z
T2 Delete Document 3
A
Z

TS(T1) lt TS(T2)
1 2 3
2 3 4

The following interleaving is possible r2(A),
d1(A,3), r2(B), r2(C).., d1(Z,3), r2(Z) T1 will
identify 3 as a result and since TS(T1) lt TS(T2)
the document has not been deleted yet!

Write a Comment

User Comments (0)