Title: Concurrency Control On Inverted Lists
1Concurrency Control On Inverted Lists
CS223 Transaction Processing and Distributed
Data Management
- Alexander Behm
- University of California, Irvine
- Instructor Prof. Sharad Mehrotra
- Based on BMV96, KAM96, MOH93
- (see references for details)
2Overview
CS223 Transaction Processing and Distributed
Data Management
- Introduction to Inverted Lists
- Transactions for Inverted Lists
- GOLD Text Indexing Engine
- Summary
- ARIES/LHS
- References
3Introduction to Inverted Lists
CS223 Transaction Processing and Distributed
Data Management
Suppose we have a set of documents with keywords
Doc1
Doc2
Doc3
Keywords Transaction Concurrency
Keywords Serializability Transaction
Keywords Database Transaction
- How can we efficiently do keyword queries? Such
as - Get documents with keyword1 AND keyword2 AND
- - Get documents with keyword1 OR keyword2 OR
- - Get documents with keyword1 AND keyword2 AND
NOT - - Etc. etc.
4Introduction to Inverted Lists
CS223 Transaction Processing and Distributed
Data Management
A popular solution in Information Retrieval (IR)
Create an inverted list index
Index on keywords
Inverted Lists (contains document IDs)
Database
1 3 5 6
Transaction
1 2 5
Concurrency
2 7 8
Serializability
3 4 7 8
Phantom
5 7 8
In essence For each keyword keep a list of
documents that contain it
5Introduction to Inverted Lists
CS223 Transaction Processing and Distributed
Data Management
How can we answer queries?
Get documents with Database AND Transaction
AND Phantom
Database
1 3 5 6
Transaction
1 2 5
Create set intersection
Concurrency
2 7 8
5
Serializability
3 4 7 8
DocID 5 is a result!
Phantom
5 7 8
Other operations modeled in similar fashion
E.g. OR by set union, AND NOT by set difference,
etc.
6Inverted Lists Good and Bad
CS223 Transaction Processing and Distributed
Data Management
- Good
- when answering queries only look at inverted
lists - only documents matching query are retrieved
- if inverted lists are sorted union,
intersection, etc. can be implemented efficiently - other applications in exact and fuzzy string
matching
- Bad
- inverted list structure can become very large
- updates may need to modify several inverted
lists at a time, i.e. updates are expensive
7Transactions for Inverted Lists
CS223 Transaction Processing and Distributed
Data Management
Characteristics of Information Retrieval Systems
- For some IR systems being up to date is not
critical, can be read-only - E.g. an online shop where new products are
(typically) not added every minute - Read-only systems perform updates offline, e.g.
at night in a batch - No concurrency control needed, no transactions
needed
- For other systems being up to date is critical,
e.g. news systems - Most relevant documents to a query may be most
recent ones - Updates may be frequent (but typically still
less frequent that reads) - Concurrency control needed, e.g. model queries
as transactions
8Transactions for Inverted Lists
CS223 Transaction Processing and Distributed
Data Management
What about traditional CC mechanisms e.g. 2PL?
A read-query accessing Database AND
Serializability must wait for whole update to
complete ? BAD
Keywords Database, Concurrency, Phantom
Doc9
Consider adding this document
Database
1 3 5 6
2PL acquires long-term locks
Transaction
1 2 5
Concurrency
2 7 8
Perform updates
Serializability
3 4 7 8
2PL releases locks
Phantom
5 7 8
9Transactions for Inverted Lists
CS223 Transaction Processing and Distributed
Data Management
Observations by KAM96
- Write Write conflicts
- most write operations are actually appends, i.e.
we append a new docID to some lists - (we discuss deletions/updates later)
- appends are idempotent and commute (consider
inverted list as a set of docIDs)
- Read Write conflicts
- write operations do not depend on reads, i.e.
transactions are write only or read only - read operations do not need to wait until whole
update has completed - need only wait until the conflicting lists have
been updated - may read each list directly after conflicting
update has completed
10Transactions for Inverted Lists
CS223 Transaction Processing and Distributed
Data Management
I Using latches example
T1 w(Database), w(Transaction) T2 r(Database),
r(Concurrency)
Schedule w1(Database), r2(Database),
r2(Concurrency), w2(Transaction)
T1 locks
Database
1 3 5 6
T1 releases
Transaction
1 2 5
T2 locks
Concurrency
2 7 8
T2 releases
Serializability
3 4 7 8
T2 locks
Phantom
5 7 8
11Transactions for Inverted Lists
CS223 Transaction Processing and Distributed
Data Management
II Keeping track of read-write dependencies
T1 add docID 9 with keywords Database,
Concurrency T2 get docs containing Database
AND Concurrency
Since we are using latches and not e.g. 2PL, the
following schedule is possible
S w1(Database), r2(Database), r2(Concurrency),
w2(Concurrency)
We miss docID 9 because it has yet to be added to
Concurrency list Results of T2 are still
correct! Meaning no false answers are
returned However, we miss the most recent (and
possibly most relevant) document! To meet
recency requirement we track dependencies for
read-transactions
12Transactions for Inverted Lists
CS223 Transaction Processing and Distributed
Data Management
III Operation reordering example
T1 w(x), w(y), w(z) T2 w(a), w(b), w(z) T3
r(z), r(b), r(m)
- Say we first scheduled T1 and T2
- S w1(x), w1(y), w1(z), w2(a), w2(b), w2(c)
- Now T3 arrives at the scheduler
- We reorder operations to minimize the wait for T3
- S w1(z), w2(z), r3(z), w2(b), r3(b), r3(m),
w1(x), w1(y), w2(a) - Why can we do this?
- We use latches, locks are immediately released
- Writes are independent of reads each
transaction writes a data item no more than once - ? order of execution for writes can be chosen
arbitrarily (assuming appends)
13CS223 Transaction Processing and Distributed
Data Management
Transaction
Inverted lists that need to be locked
T.ACCESS
Determine T.ACCESS
List of active update transactions
ACTIVE
Update?
Y
N
Add T to ACTIVE
Fill T.CONFLICT Using ACTIVE
List of conflicting active update transactions
T.CONFLICT
Reorder Operations
Request locks in T.ACCESS
Request locks in T.ACCESS
Lock request
Lock request
If granted list is empty Issue read
lock Else Put on wait queue
Safe check conflict list (and processed
list) If safe AND NOT granted to write
Issue read lock Else Put on wait queue
Unlock request
Unlock request
Add transaction ID to PROCESSED Release lock If
wait-queue NOT empty grant lock to next T in
wait-queue
Release lock If wait-queue not empty Grant
lock to next transaction in queue
14Notes/Thoughts
CS223 Transaction Processing and Distributed
Data Management
- Deletions/Updates
- Deletions can be handled by the proposed CC
mechanism - Updates can be seen as collection of appends and
deletions - Reducing locking overhead
- Main property an inverted list is accessed at
most once by a transaction - Appends can be aggregated into mini-batches
- Deletions can be memorized in a purged list
- When purged list reaches certain number,
deletions are performed in a batch
15Notes/Thoughts
CS223 Transaction Processing and Distributed
Data Management
- Critical assessment
- Very nice recency can be traded for performance
using mini-batches! - Concurrency increased and locking overhead
reduced - Many algorithms rely on inverted lists being
sorted - Appends may not be commutative anymore, cannot
be reordered arbitrarily - Sorting requirement may impose further
restrictions on the scheduling of operations - In worst case read-queries need to check for
conflicts with a number of queries equal to MPL
(multi-programming level)
16GOLD Text Indexing Engine
DataStructure Manager
OverFlow Manager
QUERY
Provides Access
Provides Access
Inverted Lists
Overflow Inverted Lists
Database
1 3 5
Database
6 8
Transaction
1 2
Transaction
7 9
Insertion List
Position Lists
DocID6, DocID7
Database
1,35 3,44 5,50
Transaction
1,5 2,80
Other structures
Documents
DocID1, DocID2, DocID3
Other structures
DISK
MEMORY
17GOLD Text Indexing Engine
Multi-Layered System
- Concurrency Control Details
- Inverted List traversal done with lock coupling
- L0-L2 CC done via conflict matrix
- Novelty in L3 TimestampLocking CC protocol
Multi-Level Concurrency Control
Timestamp Locking
Locking
Locking
Locking
18GOLD Text Indexing Engine
L3 Concurrency Control Protocol Basic Idea Use
Timestamps and Locks at Document level Locks
used to prevent insertions and deletions of
creating inconsistent state. Retrieve does not
care about locks. To avoid incorrect query
results timestamps are used.
Delete
Insert
Retrieve
Lock document
Lock document
Get next TS
Delete from Inverted Lists but keep document
Initialize Doc.TS to large value
Ignore Docs with greater TS
Get next TS
Perform Insertion
Check for active retrieve operations with smaller
TS. If exists, wait.
Get next TS and set Doc.TS
Delete Document
19Summary
Both papers identify problem of interleaving
retrieve and write/delete/update
operations Solutions GOLD 1. Insert Ts set
document Timestamp, retrieve operations ignore
documents with greater timestamp 2. Delete Ts
remove entries from inverted lists and then check
for concurrent retrieve with greater timestamp
and wait before deleting document KAM96 1.
Retrieve Ts check set of common inverted lists
with update transactions 2. Operations are
reordered such that conflicting updates are
executed first 3. For each inverted list,
retrieve Ts wait until conflicting update has
been performed In terms of concurrency INSERT
GOLD gt KAM96 (GOLD could miss some recent
results) DELETE GOLD lt KAM96
20ARIES/LHS
Keywords are typically mapped to the
corresponding inverted list by hashing. Only
point-queries are required for keywords,
therefore hashing is a good solution. BUT What
Hashing technique can we use? How can we do CC on
the Hash Table? MOH93 studies CC for
Linear Hashing (LHS) Basic operations in
LHS Insert Delete Update File Contraction File
Expansion Retrieve
May cause record relocation!
- Basic Idea
- Writing Ts get X lock on the record they modify,
NOT on records they relocate - Uncommitted relocations can be modified by other
Ts ? high concurrency - Read Ts get S lock on records
- Recovery algorithms need to be prepared to handle
the above - Problem becomes ensuring only correct answers are
returned
21ARIES/LHS
- Retrieval of records is supported by an in-memory
data structure, signature table (ST) - current_page of a record may be different from
home_page - ST helps to quickly identify current_page for a
given query without causing disk I/O - CC problem we need to ensure page signatures and
ST are in synch, i.e. consistent
ST Latch
- File Expansion/Contraction get X latch on ST
- Signatures depend number of pages
- STE latches used to increase concurrency (details
complex) - Other operations get S on ST
Page
STE Latch
STE Latch
Page
Page
STE Latch
T acquires latches
22References
CS223 Transaction Processing and Distributed
Data Management
KAM96 Mohan Kamath, Krithi Ramamritham,
Efficient Transaction Support for Dynamic
Information Retrieval Systems, Proceedings of
the 19th annual international ACM SIGIR
conference on Research and Development in
information retrieval, 1996 BMV96 D.
Barbara, S. Mehrotra, P. Vallabhaneni, The Gold
Text Indexing Engine, Proceedings of 12th
International Conference on Data Engineering
(ICDE'96), 1996 MOH93 C. Mohan,
ARIES/LHSÂ A Concurrency Control and Recovery
Method Using Write-Ahead Logging for Linear
Hashing with Separators, Proceedings of the Ninth
International Conference on Data Engineering
(ICDE), 1993
23Issue with GOLD (my opinion)
Inverted List Index
T1 Get Docs containing A AND B AND C . AND NOT Z
T2 Delete Document 3
A
Z
TS(T1) lt TS(T2)
1 2 3
2 3 4
The following interleaving is possible r2(A),
d1(A,3), r2(B), r2(C).., d1(Z,3), r2(Z) T1 will
identify 3 as a result and since TS(T1) lt TS(T2)
the document has not been deleted yet!