Title: Recovery
1Recovery
Chapter 9, 10, 11.1-11.4 in Gray and Reuter
- Adapted from slides by J. Gray A. Reuter
2Failure Types
- Transaction Failure
- Transaction issues abort
- System Failure
- Volatile memory is corrupted
- Media Failure
- Stable storage is corrupted
3Failure Model (Assumptions)
- Failures can always be detected (Failstop)
- System
- Defensive programming
- Error detecting codes (parity, checksums, etc.)
- Media
- Redundant information (e.g., description of block
in header) - Checksums
4System Failure Recovery
- Goal
- At point of failure, history is H.
- Recovery must restore DB to final state defined
by C(H). - All information needed to accomplish this must be
in stable storage
5Normal (no failure) Transaction Execution
- TM generates the TRID at Begin_Work().
- Coordinates Commit,
- RM joins work, generates log records, allows
commit
6The Resource manager view
- Boolean Prepare(LSN ) / invoked at ? 1. Return
vote on commit / - void Commit() / called at commit ?2 /
- void Abort() / called at failed commit ? 2 or
abort / - void UNDO(LSN) / Undo the log record with this
LSN / - void REDO(LSN) / Redo the log record with this
LSN / - void TM_Startup(LSN) / TM restarting. Passes RM
ckpt LSN / - LSN Checkpoint(LSN low_water) / TM
checkpointing, Return RM ckpt LSN, - set low water LSN /
7The Transaction Manager
- Transaction rollback.
- coordinates transaction rollback to a savepoint
or abort rollbacks can be initiated by any
participant. - Resource manager restart.
- If an RM fails and restarts, TM presents
checkpoint anchor RM undo/redo log - System restart.
- TM drives local RM recovery (like RM restart)
- TM resolves any in-doubt distributed transactions
- Media recovery.
- TM helps RM reconstruct damaged objects by
providing - archive copies of object the log of object
since archived. - Node restart.
- Transaction commit among independent TMs when a
TM fails.
8When a Transaction Aborts
- At transaction rollback
- TM drives undo of each RM joined to the
transaction - Can be to savepoint 0 (abort) or partial rollback.
9The Transaction Manager at Restart/Recovery
- At restart, TM reading the log drives RM
recovery. - Single log scan.
- Single resolver of transactions.
- Multiple logs possible, but more complex/more
work.
10Resource Manager ConceptsUndo Redo Protocol
11Resource Manager Concepts Transaction UNDO
Protocol
declare cursor for transaction_log select
rmid, lsn / a cursor on the transaction's
log / from log / it returns the resource
manager name / where trid trid / and
record id (log sequence number) / descending
lsn / and returns records in LIFO
order / void transaction_undo(TRID trid) /
Undo the specified transaction. / int
sqlcode / event variables set by sql
/ open cursor transaction_log / open an
sql cursor on the trans log / while
(TRUE) / scan trans log backwards undo
each/ / fetch the next most recent log
rec / fetch transaction_log into rmid, lsn
/ / if (sqlcode ! 0) break
/ if no more, trans is undone, end loop /
rmid.undo(lsn) / tell RM to undo that
record / close cursor
transaction_log / Undo scan is complete, close
cursor / / return to caller /
12Resource Manager Concepts Restart REDO Protocol
void log_redo(void) declare cursor for the_log
/ declare cursor from log start
forward / select rmid, lsn / gets RM id and
log record id (lsn) / from log / of all log
records. / ascending lsn / in FIFO
order / open cursor the_log / open an
sql cursor on the log table / while (TRUE) /
Scan log forward redo each record. / fetch
the_log into rmid, lsn / fetch the next log
record / if (sqlcode ! 0) break / if
no more, then all redone, end loop /
rmid.redo(lsn) / tell RM to redo that
record / close cursor the_log / Redo
scan complete, close cursor / /
return to caller /
- Note REDO forwards, UNDO backwards
13Idempotence
- F(F(X)) F(X) Needed in case restart fails
(and restarts) - Redo(Redo(old_state,log), log)
Redo(new_state,log) new_state - Undo(Undo(new_state,log), log)
Undo(old_state,log) old_state
14Testable State Can Tell If It Happened.
- IF operation not idempotent AND state not
testable - THEN recovery is impossible
15Kinds of Logging
- Physical
- Keep old and new value of container (page,
file,...) - Pro Simple
- Allows recovery of physical object (e.g. broken
page) - Con Generates LOTS of log data
- Logical
- Keep call params such that you can compute F(x),
F-1(x) - Pro Sounds simple
- Compact log.
- Con Doesn't work (wrong failure model).
- Operations do not fail cleanly.
16Sample Physical LOG RECORD
struct compressed_log_record_for_page_update /
/ int opcode / opcode will
say compressed page update/ filename fname /
name of file that was updated / long pageno
/ page that was updated / long offset /
offset within page that was updated / long len
gth / length of field that was
updated / char old_valuelength / old
value of field / char new_valuelength
/ new value of field / /
/
- Ordinary sequential insert is OK.
- Update of sorted (B-tree) page
- update LSN
- update page space map
- update pointer to record
- insert record at correct spot (move 1/2 the
others) - Essentially writes whole page (old and new).
- 16KB log records for 100-byte updates.
17Sample Physical LOG RECORD
struct logical_log_record_for_insert /
/ int opcode / opcode will
says insert / filename fname / name of
file that was updated / long length /
length of record that was updated / char record
length / value record / /
/
- Very compact.
- Implies page update(s) for record (may be many
pages long). - Implies index updates (may be many indices on
base table)
18The trouble with Logical Logging
- Logical logging needs to start UNDO/REDO with an
action-consistent state. - Partial Actions
- If an action runs to completion, we can use
inverse action to UNDO operation. - What if action fails part of the way through? How
do we put system in consistent state. - for example insert (table, record)
- ALL or NONE of the indices should be updated when
logical UNDO/REDO is invoked. - Action Consistency
- After a system failure, the state of persistent
storage may not be action consistent. - How can we restore an action consistent state?
19Making Logical Logging Work Shadows
- Keep old copy of each page
- Reset page to old copy at abort (no undo log)
- Discard old copy at commit.
- Handles all online failures due to
- Logic e.g. duplicate key.
- Limit ran out of space
- Contention deadlock
- Problem forces page locking, only one updater
per page. - What about restart?
- Need to atomically write out all changed pages.
20Making Logical Logging Work Shadows
- Perform same shadow trick at disc level.
- Keep shadow copy of old pages.
- Write out new pages.
- In one careful write, write out new page root.
- Makes update atomic
21Shadows
- Pro Simple
- Not such a bad deal with non-volatile ram
- Con page locking
- extra space
- extra overhead (for page maps)
- extra IO
- declusters sequential data
22Logical vs Physio-logical Logging
Note physical log records would be bigger for
sorted pages.
23Physiological Logging Rules
- Complex operations are a sequence of simple
operations on pages. - Each operation is constructed as a
mini-transaction - lock the object in exclusive mode
- transform the object
- generate an UNDO-REDO log record
- record log LSN in object
- unlock the object.
- Action Consistent Object
- When object semaphore free, no ops in progress.
- Log-Consistency
- Log contains log records of all complete page
actions.
24Physiological Logging Rules - Online Operation
- Each operation is structured as a
mini-transaction. - Each operation generates an UNDO record.
- No page operation fails with the semaphore set.
- (exception handler must clean up state and UNFIX
any pages). - Then Rollback can be physical to a page and
logical within page.
25Physiological Logging Rules - Restart Operation
- Need Page-Action consistent persistent state.
- Pages are action consistent.
- Committed actions can be redone from log.
- Uncommitted actions can be undone from log.
- WAL Write Ahead Log
- Write undo/redo log records before overwriting
disk page - Only write action-consistent pages
- Force-Log-At-Commit
- Make transaction log records durable at commit.
26WAL and Force at Commit
- WAL Write Ahead Log
- write page
- get page semaphore
- copy page to buffer
- give page semaphore / avoids holding
semaphore during IO / - Force_log(Page(LSN)) /WAL logic, probably
already flushed/ - Write buffer to disk.
- WAL gives idempotence and testability.
- Force-Log-At-Commit
- At commit phase 1
- Force_log(transaction.max_lsn)
27The One Bit Resource Manager
- Manages an array of transactional bits (the free
space bit map). - i get_bit() / gets a free bit and sets
it / - give_bit(i) / returns a free bit /
28The Bitmap and Its Log Records
- The Data Structure
- struct / layout of the one-bit RM data
structure / - LSN lsn / page LSN for WAL
protocol / - xsemaphore sem / semaphore regulates
access to the page / - Boolean bitBITS / page.biti TRUE gt
biti is free / - page / allocates the page
structure / - The Log Records
- struct / log record format for the one-bit RM /
- int index / index of bit that was
updated / - Boolean value / new value of bitindex
/ - log_rec / log record used
by the one-bit RM / - const int rec_size sizeof(log_rec) /size of
the log record body. /
29Page and Log Consistency for 1-Bit RM
- Data dirty if reflects an uncommitted transaction
update. Otherwise, data is clean. - Page Consistency
- No clean free bit has been given to any
transaction. - Every clean busy bit was given to exactly one
transaction. - Dirty bits locked in X mode by updating
transactions. - The page.lsn reflects most recent log record for
page. - Log Consistency
- Log contains a record for every completed
mini-transaction update to the page.
30 give_bit()
- get_bit() give_bit(i) temporarily violate page
consistency. - Mini-transaction holds semaphore while violating
consistency. - Makes page log mutually consistent before
releasing sem. - each mini-transaction observes a consistent page
state. - void give_bit(int i) / free a bit /
- if (LOCK_GRANTEDlock(i,LOCK_X,LOCK_LONG,0))
/ Lock bit / - Xsem_get(page.sem) / get page sem /
- page.biti TRUE / free the bit /
- log_rec.index i / generate log rec/
- log_rec.value TRUE /saying bit is free/
- page.lsn log_insert(log_rec,rec_size)
/write log recupdate lsn/ - Xsem_give(page.sem) / page consistent/
- else / if lock failed, caller doesn't own
bit, / - Abort_Work() / in that case abort
caller's trans / - return
31get_bit()
- int get_bit(void) / allocate a bit to and
returns bit index / - int i / loop variable /
- Xsem_get(page.sem) / get the page
semaphore / - for ( i 0 iltBITS i) / loop looking for a
free bit / - if (page.biti) / if bit is free, may be
dirty (so locked) / - if (LOCK_GRANTED lock(i,LOCK_X,LOCK_LONG,
0))/ lock bit / - page.biti FALSE / got lock on it /
- log_rec.value FALSE / generate log rec
describing update/ - log_rec.index i
- page.lsn log_insert(log_rec,rec_size) /
write log recupdatelsn / - Xsem_give(page.sem) / page now consistent,
give up sem/ - return i / return to caller /
-
-
- / try next free bit, /
- Xsem_give(page.sem) / if no free bits, give
up semaphore / - Abort_Work() / abort transaction /
- return -1 / returns -1 if no bits are
available. /
32Compensation Logging
- Undo may generate a log record recording undo
step - Makes Page LSN monotonic
331-bit RM UNDO Callback
- void undo(LSN lsn) / undo a one-bit RM
operation / -
- int i / bit index /
- Boolean value / old bit value from log rec to
be undone/ - log_rec_header header / buffer to hold log
record header / - rec_size log_read_lsn(lsn,header,0,log_rec,big)
/ read log rec / - Xsem_get(page.sem) / get the page
semaphore / - i log_rec.index / get bit index from log
record / - value ! log_rec.value / get complement of
new bit value/ - page.biti value / update bit to old
value / - log_rec.value value / make a compensation
log record / - page.lsn log_insert(log_rec,rec_size) /
log it and bump page lsn/ - Xsem_give(page.sem) / free the page
semaphore / - return
-
341-bit RM REDO Callback
- void redo( LSN lsn) / redo an free space
operation / - int i / bit index /
- Boolean value / new bit value from log rec to
be redone/ - log_rec_header header / buffer to hold log
record header / - rec_size log_read_lsn(lsn,header,0,log_rec,big)
/ read log record / - i log_rec.index / Get bit index /
- lock(i,LOCK_X,LOCK_LONG,0) / get lock on the
bit (often not needed) / - Xsem_get(page.sem) / get the page
semaphore / - if (page.lsn lt lsn) / if bit version older
than log record / - value log_rec.value / then redo the op.
get new bit value / - page.biti value / apply new bit value to
bit / - page.lsn lsn / advance the page lsn /
- Xsem_give(page.sem) / free the page
semaphore / - return
351-BIT Rm Noise Callbacks
- Boolean prepare(LSN lsn) / 1-bit RM has no
phase 1 work / - lsn NULLlsn return TRUE / /
- void Commit(void ) / Commit release locks
/ - unlock_class(LOCK_LONG, TRUE, MyRMID()) /
return / - void Abort(void ) / Abort release all locks
/ - unlock_class(LOCK_LONG, TRUE, MyRMID()) /
return / - Boolean savepoint((LSN lsn) / no work to do
at savepoint / - lsn NULLlsn return TRUE
- void UNDO_savepoint(LSN lsn) / rollback work or
abort transaction/ - if (savepoint 0) / if at savepoint zero
(abort) / - unlock_class(LOCK_LONG, TRUE, MyRMID()) /
release all locks/ -
36Summary
- Model Complex actions are a page action
sequence. - LSN Each page carries an LSN and a semaphore.
- ReadFix Read acquires semaphore in shared mode.
- WriteFix Update actions (1) get semaphore in
exclusive mode, (2) generate one or more log
records covering the page, (3) advance the page
LSN to match highest LSN (4) give semaphore - WAL log_flush(page.LSN) before overwriting
persistent page - FORCE AT COMMIT force all log records up to the
commit LSN at commit - Compensation Logging Invalidate undone log
record with a compensating log record. - Idempotence via LSN page LSN makes REDO
idempotent
37Two Phase Commit
- Getting two or more logs to agree
- Getting two or more RMs to agree
- Atomically and Durably
- Even in case one of them fails and restarts.
- The TM phases
- Prepare. Invoke each joined RM asking for its
vote. - Decide. If all vote yes, durably write commit
log record. - Commit. Invoke each joined RM, telling it commit
decision. - Complete. Write commit completion when all RM
ACK.
38Centralized Case of Two Phase Commit
- Each participant (TM RM) goes through a
sequence of states
39Transitions in Case of Restart
Active state not persistent, others are
persistent For both TM and RM. Log records make
them persistent (redo) TM tries to drive states
to the right. (to committed, aborted)
Prepared
Committing
Committed
Active
Null
Aborted
Aborting
40Successful two phase commit
- Call flow from TM to each RM joined to
transaction - If TM and RM share the same log,
- the RM FORCE can piggyback on the TM FORCE
- One IO to commit a transaction (less if commit is
grouped)
41Abort Two Phase Commit
- If RM sends "NO" or no response (timeout), TM
starts abort. - Calls UNDO of each trans log record
- May stop at a savepoint.
- At begin_trans it calls ABORT() callback of each
joined RM
42Full Transaction State Diagram
43CHECKPOINTING
- Commit consistent checkpoints
- Stop admitting new transactions and wait until
all active transactions complete (abort or
commit) - Flush all dirty cache slots
- Write checkpoint record to log
- During recovery, begin forward scan at last
checkpoint record. - After last checkpoint, every element in DB
contained its last committed value. - If an element does not contain its last committed
value, it must have been updated after the
checkpoint
44Fuzzy Checkpointing
- Commit consistent checkpoint has two drawbacks
- A lot of disk I/O is needed (1000 pages _at_
5mS/page 5sec) - Must wait until all active transactions terminate
(2 sec.) - Protocol
- Stop processing new operations- Wait until all
active ones complete. - Flush every cache slot that has not been flushed
since last checkpoint. Stable-LSN lt
checkpoint-LSN. - Update stable-LSN of all buffers flushed.
- Write checkpoint record including-
- Active transaction list
- List of data items and stable-LSNs of all dirty
slots
45Restart Algorithm
- Locate Penultimate Checkpoint (the checkpoint
preceding the last one) - Add all transactions in checkpoint record to
active transaction list. - Forward scan of log (starting at penultimate
checkpoint) - Call rm_redo() for each log record
- On BEGIN_TRANSACTION log record add transaction
to active transaction list - On COMMIT or ABORT log record, remove transaction
from active transaction list.
46Restart Algorithm (Cont.)
- For each transaction in active transaction list
- Retreive the transactions log records in reverse
order (last record of transaction retreived
first) - For each log record, call rm_undo()
47Simple Recovery Method (Bernstein)
- Restart Algorithm
- redone undone ?
- Scan log from last record to first. For each log
record Ti, x, vbefore, vafter do - If x ? (redone ? undone) then
- If Ti is committed then
- restore xs cache slot to vafter
- redone redone ? x
- Otherwise
- restore xs cache slot to vbefore
- undone undone ? x
48Simple Recovery
- Assumptions
- Strict 2PL - locking at granularity of page
- Before and after-image are complete pages
- Every element x is restored to its last committed
value by Restart. - If last update to x is by a committed
transaction, the value it wrote is restored and
no further changes are made. (x ? redone) - If last update to x is by aborted or active
transaction Ti, the before image of x wrt Ti is
restored and no further change to x is made. (x ?
undone) - Because histories are strict, this value was
written by the last transaction to commit and
write x.
49Record Level Locking
- History is not strict with respect to pages
anymore. It is, however, strict with respect to
individual tuples. - A single page LSN is not enough. Consider the
following example - r1, r2, and r3 are records on the same page.
- Log wir1 ci wjr2 wkr3 ck aj
- Page LSN 1 3 4
? - The abort of Tj is processed by restoring the
before image of r2 in the page. - What should we do to the LSN?
- If we leave it at 4 and a system failure occurs,
what will happen? - If we set it to 3 what happens?
50Record Level Locking
- Need to log the undo operations we apply.
- This way, the undo in the history above will
update the LSN of the page to 6, the LSN of the
compensation record. - Log wir1 ci wjr2 wkr3 ck undo(wjr2)
- Page LSN 1 3 4 6
- Recovery when page-LSN is 1
- Redo log records 3, 4 and 6
- Undo log records 6 and 3
- Add records 7 and 8 to end of log
undo(undo(wjr2)) and undo(wjr2) - page LSN
8.
51Record Level Locking (Cont.)
- Recovery when page-LSN is 4
- Redo log record 6.
- Undo log records 6 and 3
- Add records 7 and 8 to end of log
undo(undo(wjr2)) and undo(wjr2) - page LSN
8. - Recovery when page-LSN is 6
- Redo nothing
- Undo log records 6 and 3
- Add records 7 and 8 to end of log
undo(undo(wjr2)) and undo(wjr2) - page LSN
8.
52Record Level Locking (Cont.)
- Recovery when page-LSN is 7 (recovery fails)
- Log wir1 ci wjr2 wkr3 ck
undo(wjr2) wjr2) undo(wjr2) - Page LSN 1 3 4 6
7 8 - Redo log record 8
- Undo log records 8, 7, 6 and 3.
- Add records 9, 10, 11 and 12 to end of log
undo(undo(wjr2)), undo(wjr2),
undo(undo(wjr2)) and undo(wjr2) - page LSN
12.
53ARIES
- Source
- C. Mohan, et.al., ARIES A Transaction Recovery
Method Supporting Fine-Granularity Locking and
Partial Rollbacks Using Write-Ahead Logging in
ACM Transactions on Database Systems, Vol. 17,
No. 1, March 1992. - Context
- ARIES is an integrated set of recovery,
concurrency control, and indexing methods. - Explains state of the practice for Transaction
Processing. In some ways superior to academic
work in TP. - Concurrency control, buffering, and logging
assumptions are similar to what we have been
discussing.
54Issues and Goals
- Types of logging
- Value logging Physical logging
- Page oriented logging Physiological logging
- Operation logging Logical logging
- Buffer management
- Force policy - Force updates of transactions
before they commit. (No REDO required) - Steal policy - Buffer manager can steal pages
dirtied by uncommitted transactions. (requires
UNDO) - STEAL / NO FORCE
- Most desirable as it imposes few constraints on
BM. - This is model we follow in project.
55Commutativity-based Concurrency Control
- Definitions of conflicting operations can be
generalized based on the the notion of
commutativity - Operations p and q conflict if they do not
commute - Different definitions of commutativity are
possible. - Operations p and q commute with respect to ? if
for any two states st1 ? st2, the values returned
by p and q are the same and the state after
executing p q beginning in state st1 is
equivalent (?) to the state after executing q p
beginning in state st2. - If we are interested in states of bank accounts
where two states are equivalent if the bank
account has the same balance - debit operations (that do not return the current
balance commute) - credit operations do not commute in general
- Other examples of commuting operations are
increment / decrement (assuming signed numbers
with no overflow)
56Failure Assumptions
- Transaction / System / Media
- Process failure
- What is it?
- Can it be handled using one of the other three
approaches?
57Record Keeping
- Transaction table
- TRID
- State of transaction (prepared or unprepared)
- LastLSN - Last log record written by transaction
- UndoNxtLSN - LSN of next record to process during
rollback - Dirty page table
- PageID
- RecLSN - Beginning of updates not yet applied to
page (How is this maintained?)
58Normal Processing
- Updates (order of events)
- Order of events
- acquire lock (exclusive)
- acquire latch (exclusive)
- perform update
- insert log record
- update LSN in page being updated
- unlatch page
-
- unlock record
59Normal Processing
- Discussion
- Assume that lock names are (page, directory
offset). - How can we determine the name of the lock we
need? - What happens if we change the order, and latch
first and then lock?
60Rollback
- Update records
- Fix page
- Undo log record
- Write compensation record
- Update page LSN (LSN of compensation record)
- Update transaction table last LSN
- Process PrevLSN next
- Compensation records
- Nothing to undo
- Process UndoNxtLSN next
61Checkpoints
- Done asynchronously
- Write begin_chkpt record
- Write contents of transaction table
- Write dirty page table to log file
- Write end_chkpt record
- Update checkpoint LSN in master record
62Restart Processing
- Analysis pass
- Forward scan of log starting at begin_chkpt
- Process end_chkpt
- transaction table and dirty page table entries
- Process other log records
- If transaction not in transaction table, add it.
- If page is not in dirty page table add it
- Return RedoLSN minimum RecLSN of all entries in
dirty page table. - Checkpoints are asynchronous and so we may have
transactions executing during a checkpoint. - How are effects of transactions that are active
during checkpoint accounted for?
63REDO Pass
- Begin forward scan at RedoLSN
- Process update and compensation records
- PageID is dirty and RecLSN lt Log.LSN
- Page.LSN lt Log.LSN
- Redo log record
- update Page.LSN
- Page.LSN gt Log.LSN
- RecLSN Page.LSN 1
64UNDO Pass
- Find last log record of all active transactions.
- Update record
- Undo update
- Write compensation record
- update page LSN (LSN of compensation record)
- UndoNxtLSN LogRec.PrevLSN
- If no more records for TX
- Write end_transaction record
- Delete transaction from transaction table
65UNDO Pass (Cont.)
- Compensation records
- UndoNxtLSN LogRec.UndoNxtLSN
- (Skip over intermediate records from same
transaction.) - Recovery Example (from paper)
66Nested Top Actions
- Recovery in a B-tree
- Need to log page splits so that they can be
redone if a failure occurs before we finish. - Page allocation
- Copy some records to new page
- Post the new key-pointer pair to parent.
- Once index is restructured, release locks on the
pages that are modified (hold lock on key range)
67Nested Top Actions (Cont.)
- Dont want to undo split if transaction that
caused split aborts. - It will probably be needed again in the near
future. - Another example of problem is page allocation
- Undo of allocation may result in committed
transactions work being undone. Why?
68Requirements
- Actions of incomplete Nested Top Actions (NTA)
are redone regardless of transaction outcome. - Incomplete NTAs are undone regardless of outcome.
- Complete NTAs are not undone even if transaction
fails. Inverse operations (e.g., deallocation)
will be invoked when necessary.
69Aries Solution
- Write to mark beginning of nested top action.
- At end add a CLR which points to log record
just before . - During forward pass the work will be redone.
- During backward pass if is not found, the
work will be undone. - If is found, undo pass will skip operations
in parenthesis.
70Discussion
- What is the advantage of the analyis pass?
- There is a lot of effort spent to justify
replaying history. - What is the disadvantage?
- What advantages does it provide?
- If the recovery method in your project provided
NTAs, how would you use them? - Could you use redo-only log records for the same
purpose?
71TPC Benchmark
- Goal- Provide a method of characterizing
performance of OLTP systems - Accounts for hardware/software costs over 5 year
period - Report both transactions per second (tps) and
/tps - Faster systems (measured in tps) operate with
larger databases.
72TPC
- Models large Bank with multiple branches
- Based on single Deposit transaction
- UPDATE Account SET Abalance Abalance delta
- WHERE Aid Aid
- SELECT Abalance INTO Abalance
- FROM accounts
- WHERE Aid Aid
- UPDATE Tellers SET Tbalance Tbalance delta
- WHERE Tid Tid
- UPDATE Branches SET Bbalance Bbalance delta
- WHERE Bid Bid
- INSERT INTO history(Tid, Bid, Aid, delta, time)
- VALUES (Tid, Bid, Aid, delta, CURRRENT)
73Scaling
- For every tps of the system under test, there are
at least - 1 branch records
- 10 teller records
- 100,000 account records
- What is purpose of this scaling?
- Does it favor systems with low tps?
74Project
- Test case for project
- Initialization - (dbs_tpc_init.c) Create four
tables with index on primary key. - Workload - (dbs_tpc.c)
- multiple clients (2 - 5)
- Each executes 20 transactions and quits
- Verification - (dbs_tpc_test.c)
- Prints each branch balance, sum of all teller
balances in each branch, sum of all account
balances for each branch. - If ACID properties are working, balances should
be same. - Testing
- Run multiple client verify
- Run multiple clients kill RM perform recovery
verify
75Discussion
- What throughput can you achieve with page-level
locking? - What locks must be acquired?
- What is probability of lock conflict?
- How long will conflict persist?
- What throughput would you expect with tuple-level
locking?
76Newer TPC Benchmarks
- More info at http//www-europe.sgi.com/Technology/
tpc.html or http//www.tpc.org - TPC-C
- Models warehouse distribution system
- Nine tables representing warehouse, district,
customer, order, stock, and history information. - Five transaction types - creating a new order,
making a payment to a customer's balance,
determining the status of an order, the delivery
of orders, and determining the current stock
level - Scaling by warehouse
- A 10,000 tpmC requires about half a terabyte of
on-line storage
77Newer TPC Benchmarks
- More info at http//www.tpc.org
- Newer benchmarks
- TPC-H - Ad-hoc Decision support
- TPC-R - Business Reporting Decision support
- TPC-W - Transactional WEB benchmark
- e-commerce and business to business intranet
78Log Manager
- Concept
- Log is a history of all changes to the state.
- Log old state gives new state
- Log new state gives old state
- Log is a sequential file.
- Complete log is the complete history
- Current state is just a "cache" of the log
records. - Performance
- What kind of log performance is important for
- High throughput?
- Low response time?
79The Log Table
- Log table is a sequential set (relation).
- Log Records have standard part and then a log
body. - Often want to query table via one attribute or
another - RMID, TRID, timestamp
- create domain LSN unsigned integer(64) -- log
sequence number (file , rba) - create domain RMID unsigned integer -- resource
manager identifier - create domain TRID char(12) -- transaction
identifier - create table log_table (
- lsn LSN, -- the records log sequence number
- prev_lsn LSN, -- the lsn of the previous
record in log - timestamp TIMESTAMP, -- time log record was
created - resource_manager RMID, -- resource mgr that
wrote this record - trid TRID, -- id of transaction that wrote
this record - tran_prev_lsn LSN, -- prev log record of this
transaction (or 0) - body varchar, -- log data rm understands
it - primary key (lsn) -- lsn is primary key
- foreign key (prev_lsn) -- previous log record
in this table - references a_log_table(lsn), --
80Log is complete history
B files
A files
Log Table
Log Anchor
lsn
prev_lsn
resource_mgr
trid
tran_prev_lsn
body
Archive
- Log anchor points at chain of each transaction.
- May maintain other chains.
- Log records map to sequence of N-plexed files
- Old files are archived.
- Eventually, archive files are discarded (weeks,
months, never)
81The Log LSN
- Each log record has a logical sequence number.
- This number (LSN for Log Sequence Number) plays a
key role in many algorithms. - Key property - If action A happened after action
B then - LSN(A) gt LSN(B).
82Accessing the Log
- Read a log record
- long log_read_lsn( LSN lsn, / lsn of record
to be read / - log_record_header header, / header of
record to be read / - long offset, / offset into body to
start read / - pointer buffer, / buffer to receive log
data / - long n) / length of buffer /
- Find current end of log
- LSN log_max_lsn(void) / returns the current
maximum lsn of the log table./ - Insert Log record - Log manager fills in header.
- LSN log_insert( char buffer, long n) / log
body is buffer0..n-1 / - Force Log - up to a certain LSN to persistent
storage - LSN log_flush( LSN lsn, Boolean lazy) / lazy
waits for a batch write or
timeout boxcar) /
83Accessing the Log (Cont.)
- Many real interfaces allow some of
- empty buffer to allow RM to fill it in (avoids
data copies) - gather take log data from many buffers.
- Few offer SQL access to the log.
84Summary of Log Structure
B file
A file
durable
storage
log page header
Log pages
in buffer pool
empty page in
Pages written in next write
buffer pool
current end of log
end of
header
durable
Log Table
body
log
85Log Anchor Logging and Locking
typedef struct filename tablename / name
of log table / struct log_files/ A
B file prefix names active file
/ xsemaphore lock / semaphore
regulates log write / LSN prev_lsn / LSN
of most recent write / LSN lsn
/ LSN of next record / LSN
durable_lsn / max lsn in durable
storage / LSN TM_anchor_lsn / lsn of
trans mgr's last ckpt / struct / array of
open log parts / long partno /
partition number / int
os_fnum / operating system file /
part MAXOPENS / /
log_anchor / /
- Log records never updated only inserted and read
- So no locks needed on log. - Semaphore (or something) needed on "end" of log
to manage space/growth/LSN for inserts
86Log Insert
- Log semaphore covers
- Incrementing LSN
- Finding the log end
- filling in the page(s)
- allocating space on a page, perhaps allocating
new pages. - Log write Daemons
- Log Semaphore can be a hotspot so No IO under
semaphore - Flush to persistent storage (disc) is done
asynchronously. - Demons driven by timers and by events (requests)
- Demons need not touch end-of-log semaphore
87Group Commit (Boxcaring)
- Batch processing of log writes.
- Forcing log buffer each time requested may write
many incomplete buffers - Most efficient to write large buffers to log
- Must delay some log_flush() requests until buffer
is full. - Example - Boxcaring
- Modern SCSI disks have highest write bandwidth
around 64KB block size - Transfers per second is lt 15ms. (10ms for seek
and rotation) - Each block contains 325 log records of 200 bytes
each. - TPC benchmark generates around four records per
transaction. - This will allow 5416 TPS
- If receive 1,000 log force requests/second why
not just execute 50 of them? - Response time will be the same (20ms).
- IOs will be 20x fewer
- CPU will be 10x smaller (10x fewer dispatches,
20x fewer OS IO). - Without it, systems are limited to about
- 50tps no ping-pong
- 100tps ping-pong.
- With it, systems are limited to disc bandwidth
gtgt10ktps.
88Example - Synchronous write per flush
- Perform log write every time a log_flush() is
received. - Transfer small blocks (4 200 bytes) requires
10ms. - Each block holds log records for one transaction.
- This allows about 100 TPS
89WADS- Giving the Log Disc Zero Latency
- Log disc is dedicated, so only has rotational
latency. - Reserve some cylinders on the disc as scratch.
- For each write
- Write at current position on next track (zero
latency). - When have a full-track (or two) of log data
- consolidate the write in ram
- do a single LARGE write (100KB 1 rotation) to
the log. - cost of this is seek rotation 20ms.
- Reserved area is called Write Ahead Data Set
(WADS). - At restart
- read cylinders
- gather recent log data
- rewrite end of log.
90Synchronization on B-Trees What Is the Problem?
- B-Trees are fully redundant structures, which can
be reconstructed from the tuples therefore, no
synchronization should be required at all. - However, some queries operate on the index only.
This requires all operations on B-trees to be
serializable with the operations on the tuples. - Standard two-phase locking with the nodes as the
objects is not feasible for performance reasons.
91Protecting Tree Traversal
1. semaphore on Q
Node Q at level i
search path
2. follow search path
3. semaphore on R
Node R at level i1
4. release sem. on Q
92B-Trees and Value Locks
B-tree
14 P1 P2 16 P3 17 P4 P5 P6 20 P7 22 P8 P9 P10
93Making Lock Names
- To implement value locking, we need to build lock
names according to the following rule - LockN TableName, IndexName, KeyValue.
- KeyValue in turn is a composite
- KeyValue AttributeValue, TupleIdentifier.
94Key Range Locking on B-Trees
- Details
- Allow for duplicate key values by modeling a
tuple as ltc, kgt where c is the key value and k is
a tuple pointer. - A retreival of c1, c2 retreives all tuples
ltc1, kmingt, ltc2, kmaxgt. - Insertions and deletions apply to a single tuple
ltc1, kigt. - retrieval(c1?? K ? c2) (Range query)
- retrieval(k c) is special case of this -
retrieval(c ?? K ? c) - 1. Get shared semaphore on first leaf page that
contains the range. - 2. Get shared locks on all existing key values
c c0 ? c ? c2 in sequence, where c0 is
largest key such that c0 ? c1 - A key lock on key value c1 protects range of keys
from ltc1, kmingt, ltc2, kmingt) where c2 is next
key value after c1. - 3. Lock tuple in shared mode.
95Key Range Locking on B-Trees
- 4. Use coupling technique when acquiring
semaphores - 5. Locks are held until commit. Semaphores are
released earlier. - insert(ltK, tidgt)
- 1. Get exclusive semaphore on leaf page.
- 2. Get an instant IX lock on c1 where c1 is
largest key value less than or equal to K. - This ensures that no search has started using the
rangeltc1, kmingt, ltc2, kmingt) which includes the
key being inserted. - Multiple insertions into ltc1, kmingt, ltc2, kmingt)
can proceed concurrently. - 3. If c1 ? K get an exclusive lock on K
- This protects the new key range until the
transaction commits. - 4. Get X Lock on tuple.
96Key Range Locking on B-Trees
- delete(ltK, tidgt)
- 1. Get exclusive semaphore on leaf page.
- 2. Let ltc1, kdgt be largest index entry ? ltK,
tidgt. - If c1 K, get an IX lock on K.
- This does not change the key values that exist in
the relation. An ix lock is enough to discourage
a search. - If c1 lt K, get an X lock on c1 and K.
- In this case, the number of key ranges is
reduced. The x lock on K forces searches for K to
wait, and an X lock on c1 will force an insertion
into this range to be delayed. - 3. Get X lock on tuple.
- 4. Hold lock(s) until commit
97B-Tree Recovery
98B-Tree Recovery Based on Physiological Logging
- Cover all B-tree operations with semaphores on
all affected pages. - For each logical update a log record with the
logical UNDO (only) operation must be moved to
the log - While the update operation is being performed,
physical REDO (only) log records are written. - After all REDO records are safely in the log, the
exclusive semaphores can be released.
99The Two Phases of B-Tree-Recovery
- Phase1 Go forward through the log up to its
current end, applying all REDO records to the
tree. - Phase2 Go backward to the Begin of transaction
record of the oldest incomplete transaction,
executing the UNDO operations on the tree for all
losers along the way.
100Discussion
- What happens if failure occurs before all REDO
records are flushed. - Cannot restore Index to physically consistent
state. - Can the same optimization be applied to the case
when a hash bucket overflows?