Recovery - PowerPoint PPT Presentation

About This Presentation
Title:

Recovery

Description:

Recovery Chapter 9, 10, 11.1-11.4 in Gray and Reuter Adapted from s by J. Gray & A. Reuter – PowerPoint PPT presentation

Number of Views:125
Avg rating:3.0/5.0
Slides: 101
Provided by: rutg155
Category:
Tags: fuzzy | joins | recovery

less

Transcript and Presenter's Notes

Title: Recovery


1
Recovery
Chapter 9, 10, 11.1-11.4 in Gray and Reuter
  • Adapted from slides by J. Gray A. Reuter

2
Failure Types
  • Transaction Failure
  • Transaction issues abort
  • System Failure
  • Volatile memory is corrupted
  • Media Failure
  • Stable storage is corrupted

3
Failure Model (Assumptions)
  • Failures can always be detected (Failstop)
  • System
  • Defensive programming
  • Error detecting codes (parity, checksums, etc.)
  • Media
  • Redundant information (e.g., description of block
    in header)
  • Checksums

4
System Failure Recovery
  • Goal
  • At point of failure, history is H.
  • Recovery must restore DB to final state defined
    by C(H).
  • All information needed to accomplish this must be
    in stable storage

5
Normal (no failure) Transaction Execution
  • TM generates the TRID at Begin_Work().
  • Coordinates Commit,
  • RM joins work, generates log records, allows
    commit

6
The Resource manager view
  • Boolean Prepare(LSN ) / invoked at ? 1. Return
    vote on commit /
  • void Commit() / called at commit ?2 /
  • void Abort() / called at failed commit ? 2 or
    abort /
  • void UNDO(LSN) / Undo the log record with this
    LSN /
  • void REDO(LSN) / Redo the log record with this
    LSN /
  • void TM_Startup(LSN) / TM restarting. Passes RM
    ckpt LSN /
  • LSN Checkpoint(LSN low_water) / TM
    checkpointing, Return RM ckpt LSN,
  • set low water LSN /

7
The Transaction Manager
  • Transaction rollback.
  • coordinates transaction rollback to a savepoint
    or abort rollbacks can be initiated by any
    participant.
  • Resource manager restart.
  • If an RM fails and restarts, TM presents
    checkpoint anchor RM undo/redo log
  • System restart.
  • TM drives local RM recovery (like RM restart)
  • TM resolves any in-doubt distributed transactions
  • Media recovery.
  • TM helps RM reconstruct damaged objects by
    providing
  • archive copies of object the log of object
    since archived.
  • Node restart.
  • Transaction commit among independent TMs when a
    TM fails.

8
When a Transaction Aborts
  • At transaction rollback
  • TM drives undo of each RM joined to the
    transaction
  • Can be to savepoint 0 (abort) or partial rollback.

9
The Transaction Manager at Restart/Recovery
  • At restart, TM reading the log drives RM
    recovery.
  • Single log scan.
  • Single resolver of transactions.
  • Multiple logs possible, but more complex/more
    work.

10
Resource Manager ConceptsUndo Redo Protocol
11
Resource Manager Concepts Transaction UNDO
Protocol
declare cursor for transaction_log select
rmid, lsn / a cursor on the transaction's
log / from log / it returns the resource
manager name / where trid trid / and
record id (log sequence number) / descending
lsn / and returns records in LIFO
order / void transaction_undo(TRID trid) /
Undo the specified transaction. / int
sqlcode / event variables set by sql
/ open cursor transaction_log / open an
sql cursor on the trans log / while
(TRUE) / scan trans log backwards undo
each/ / fetch the next most recent log
rec / fetch transaction_log into rmid, lsn
/ / if (sqlcode ! 0) break
/ if no more, trans is undone, end loop /
rmid.undo(lsn) / tell RM to undo that
record / close cursor
transaction_log / Undo scan is complete, close
cursor / / return to caller /
12
Resource Manager Concepts Restart REDO Protocol
void log_redo(void) declare cursor for the_log
/ declare cursor from log start
forward / select rmid, lsn / gets RM id and
log record id (lsn) / from log / of all log
records. / ascending lsn / in FIFO
order / open cursor the_log / open an
sql cursor on the log table / while (TRUE) /
Scan log forward redo each record. / fetch
the_log into rmid, lsn / fetch the next log
record / if (sqlcode ! 0) break / if
no more, then all redone, end loop /
rmid.redo(lsn) / tell RM to redo that
record / close cursor the_log / Redo
scan complete, close cursor / /
return to caller /
  • Note REDO forwards, UNDO backwards

13
Idempotence
  • F(F(X)) F(X) Needed in case restart fails
    (and restarts)
  • Redo(Redo(old_state,log), log)
    Redo(new_state,log) new_state
  • Undo(Undo(new_state,log), log)
    Undo(old_state,log) old_state

14
Testable State Can Tell If It Happened.
  • IF operation not idempotent AND state not
    testable
  • THEN recovery is impossible

15
Kinds of Logging
  • Physical
  • Keep old and new value of container (page,
    file,...)
  • Pro Simple
  • Allows recovery of physical object (e.g. broken
    page)
  • Con Generates LOTS of log data
  • Logical
  • Keep call params such that you can compute F(x),
    F-1(x)
  • Pro Sounds simple
  • Compact log.
  • Con Doesn't work (wrong failure model).
  • Operations do not fail cleanly.

16
Sample Physical LOG RECORD
struct compressed_log_record_for_page_update /
/ int opcode / opcode will
say compressed page update/ filename fname /
name of file that was updated / long pageno
/ page that was updated / long offset /
offset within page that was updated / long len
gth / length of field that was
updated / char old_valuelength / old
value of field / char new_valuelength
/ new value of field / /
/
  • Ordinary sequential insert is OK.
  • Update of sorted (B-tree) page
  • update LSN
  • update page space map
  • update pointer to record
  • insert record at correct spot (move 1/2 the
    others)
  • Essentially writes whole page (old and new).
  • 16KB log records for 100-byte updates.

17
Sample Physical LOG RECORD
struct logical_log_record_for_insert /
/ int opcode / opcode will
says insert / filename fname / name of
file that was updated / long length /
length of record that was updated / char record
length / value record / /
/
  • Very compact.
  • Implies page update(s) for record (may be many
    pages long).
  • Implies index updates (may be many indices on
    base table)

18
The trouble with Logical Logging
  • Logical logging needs to start UNDO/REDO with an
    action-consistent state.
  • Partial Actions
  • If an action runs to completion, we can use
    inverse action to UNDO operation.
  • What if action fails part of the way through? How
    do we put system in consistent state.
  • for example insert (table, record)
  • ALL or NONE of the indices should be updated when
    logical UNDO/REDO is invoked.
  • Action Consistency
  • After a system failure, the state of persistent
    storage may not be action consistent.
  • How can we restore an action consistent state?

19
Making Logical Logging Work Shadows
  • Keep old copy of each page
  • Reset page to old copy at abort (no undo log)
  • Discard old copy at commit.
  • Handles all online failures due to
  • Logic e.g. duplicate key.
  • Limit ran out of space
  • Contention deadlock
  • Problem forces page locking, only one updater
    per page.
  • What about restart?
  • Need to atomically write out all changed pages.

20
Making Logical Logging Work Shadows
  • Perform same shadow trick at disc level.
  • Keep shadow copy of old pages.
  • Write out new pages.
  • In one careful write, write out new page root.
  • Makes update atomic

21
Shadows
  • Pro Simple
  • Not such a bad deal with non-volatile ram
  • Con page locking
  • extra space
  • extra overhead (for page maps)
  • extra IO
  • declusters sequential data

22
Logical vs Physio-logical Logging
Note physical log records would be bigger for
sorted pages.
23
Physiological Logging Rules
  • Complex operations are a sequence of simple
    operations on pages.
  • Each operation is constructed as a
    mini-transaction
  • lock the object in exclusive mode
  • transform the object
  • generate an UNDO-REDO log record
  • record log LSN in object
  • unlock the object.
  • Action Consistent Object
  • When object semaphore free, no ops in progress.
  • Log-Consistency
  • Log contains log records of all complete page
    actions.

24
Physiological Logging Rules - Online Operation
  • Each operation is structured as a
    mini-transaction.
  • Each operation generates an UNDO record.
  • No page operation fails with the semaphore set.
  • (exception handler must clean up state and UNFIX
    any pages).
  • Then Rollback can be physical to a page and
    logical within page.

25
Physiological Logging Rules - Restart Operation
  • Need Page-Action consistent persistent state.
  • Pages are action consistent.
  • Committed actions can be redone from log.
  • Uncommitted actions can be undone from log.
  • WAL Write Ahead Log
  • Write undo/redo log records before overwriting
    disk page
  • Only write action-consistent pages
  • Force-Log-At-Commit
  • Make transaction log records durable at commit.

26
WAL and Force at Commit
  • WAL Write Ahead Log
  • write page
  • get page semaphore
  • copy page to buffer
  • give page semaphore / avoids holding
    semaphore during IO /
  • Force_log(Page(LSN)) /WAL logic, probably
    already flushed/
  • Write buffer to disk.
  • WAL gives idempotence and testability.
  • Force-Log-At-Commit
  • At commit phase 1
  • Force_log(transaction.max_lsn)

27
The One Bit Resource Manager
  • Manages an array of transactional bits (the free
    space bit map).
  • i get_bit() / gets a free bit and sets
    it /
  • give_bit(i) / returns a free bit /

28
The Bitmap and Its Log Records
  • The Data Structure
  • struct / layout of the one-bit RM data
    structure /
  • LSN lsn / page LSN for WAL
    protocol /
  • xsemaphore sem / semaphore regulates
    access to the page /
  • Boolean bitBITS / page.biti TRUE gt
    biti is free /
  • page / allocates the page
    structure /
  • The Log Records
  • struct / log record format for the one-bit RM /
  • int index / index of bit that was
    updated /
  • Boolean value / new value of bitindex
    /
  • log_rec / log record used
    by the one-bit RM /
  • const int rec_size sizeof(log_rec) /size of
    the log record body. /

29
Page and Log Consistency for 1-Bit RM
  • Data dirty if reflects an uncommitted transaction
    update. Otherwise, data is clean.
  • Page Consistency
  • No clean free bit has been given to any
    transaction.
  • Every clean busy bit was given to exactly one
    transaction.
  • Dirty bits locked in X mode by updating
    transactions.
  • The page.lsn reflects most recent log record for
    page.
  • Log Consistency
  • Log contains a record for every completed
    mini-transaction update to the page.

30
give_bit()
  • get_bit() give_bit(i) temporarily violate page
    consistency.
  • Mini-transaction holds semaphore while violating
    consistency.
  • Makes page log mutually consistent before
    releasing sem.
  • each mini-transaction observes a consistent page
    state.
  • void give_bit(int i) / free a bit /
  • if (LOCK_GRANTEDlock(i,LOCK_X,LOCK_LONG,0))
    / Lock bit /
  • Xsem_get(page.sem) / get page sem /
  • page.biti TRUE / free the bit /
  • log_rec.index i / generate log rec/
  • log_rec.value TRUE /saying bit is free/
  • page.lsn log_insert(log_rec,rec_size)
    /write log recupdate lsn/
  • Xsem_give(page.sem) / page consistent/
  • else / if lock failed, caller doesn't own
    bit, /
  • Abort_Work() / in that case abort
    caller's trans /
  • return

31
get_bit()
  • int get_bit(void) / allocate a bit to and
    returns bit index /
  • int i / loop variable /
  • Xsem_get(page.sem) / get the page
    semaphore /
  • for ( i 0 iltBITS i) / loop looking for a
    free bit /
  • if (page.biti) / if bit is free, may be
    dirty (so locked) /
  • if (LOCK_GRANTED lock(i,LOCK_X,LOCK_LONG,
    0))/ lock bit /
  • page.biti FALSE / got lock on it /
  • log_rec.value FALSE / generate log rec
    describing update/
  • log_rec.index i
  • page.lsn log_insert(log_rec,rec_size) /
    write log recupdatelsn /
  • Xsem_give(page.sem) / page now consistent,
    give up sem/
  • return i / return to caller /
  • / try next free bit, /
  • Xsem_give(page.sem) / if no free bits, give
    up semaphore /
  • Abort_Work() / abort transaction /
  • return -1 / returns -1 if no bits are
    available. /

32
Compensation Logging
  • Undo may generate a log record recording undo
    step
  • Makes Page LSN monotonic

33
1-bit RM UNDO Callback
  • void undo(LSN lsn) / undo a one-bit RM
    operation /
  • int i / bit index /
  • Boolean value / old bit value from log rec to
    be undone/
  • log_rec_header header / buffer to hold log
    record header /
  • rec_size log_read_lsn(lsn,header,0,log_rec,big)
    / read log rec /
  • Xsem_get(page.sem) / get the page
    semaphore /
  • i log_rec.index / get bit index from log
    record /
  • value ! log_rec.value / get complement of
    new bit value/
  • page.biti value / update bit to old
    value /
  • log_rec.value value / make a compensation
    log record /
  • page.lsn log_insert(log_rec,rec_size) /
    log it and bump page lsn/
  • Xsem_give(page.sem) / free the page
    semaphore /
  • return

34
1-bit RM REDO Callback
  • void redo( LSN lsn) / redo an free space
    operation /
  • int i / bit index /
  • Boolean value / new bit value from log rec to
    be redone/
  • log_rec_header header / buffer to hold log
    record header /
  • rec_size log_read_lsn(lsn,header,0,log_rec,big)
    / read log record /
  • i log_rec.index / Get bit index /
  • lock(i,LOCK_X,LOCK_LONG,0) / get lock on the
    bit (often not needed) /
  • Xsem_get(page.sem) / get the page
    semaphore /
  • if (page.lsn lt lsn) / if bit version older
    than log record /
  • value log_rec.value / then redo the op.
    get new bit value /
  • page.biti value / apply new bit value to
    bit /
  • page.lsn lsn / advance the page lsn /
  • Xsem_give(page.sem) / free the page
    semaphore /
  • return

35
1-BIT Rm Noise Callbacks
  • Boolean prepare(LSN lsn) / 1-bit RM has no
    phase 1 work /
  • lsn NULLlsn return TRUE / /
  • void Commit(void ) / Commit release locks
    /
  • unlock_class(LOCK_LONG, TRUE, MyRMID()) /
    return /
  • void Abort(void ) / Abort release all locks
    /
  • unlock_class(LOCK_LONG, TRUE, MyRMID()) /
    return /
  • Boolean savepoint((LSN lsn) / no work to do
    at savepoint /
  • lsn NULLlsn return TRUE
  • void UNDO_savepoint(LSN lsn) / rollback work or
    abort transaction/
  • if (savepoint 0) / if at savepoint zero
    (abort) /
  • unlock_class(LOCK_LONG, TRUE, MyRMID()) /
    release all locks/

36
Summary
  • Model Complex actions are a page action
    sequence.
  • LSN Each page carries an LSN and a semaphore.
  • ReadFix Read acquires semaphore in shared mode.
  • WriteFix Update actions (1) get semaphore in
    exclusive mode, (2) generate one or more log
    records covering the page, (3) advance the page
    LSN to match highest LSN (4) give semaphore
  • WAL log_flush(page.LSN) before overwriting
    persistent page
  • FORCE AT COMMIT force all log records up to the
    commit LSN at commit
  • Compensation Logging Invalidate undone log
    record with a compensating log record.
  • Idempotence via LSN page LSN makes REDO
    idempotent

37
Two Phase Commit
  • Getting two or more logs to agree
  • Getting two or more RMs to agree
  • Atomically and Durably
  • Even in case one of them fails and restarts.
  • The TM phases
  • Prepare. Invoke each joined RM asking for its
    vote.
  • Decide. If all vote yes, durably write commit
    log record.
  • Commit. Invoke each joined RM, telling it commit
    decision.
  • Complete. Write commit completion when all RM
    ACK.

38
Centralized Case of Two Phase Commit
  • Each participant (TM RM) goes through a
    sequence of states

39
Transitions in Case of Restart
Active state not persistent, others are
persistent For both TM and RM. Log records make
them persistent (redo) TM tries to drive states
to the right. (to committed, aborted)
Prepared
Committing
Committed
Active
Null
Aborted
Aborting
40
Successful two phase commit
  • Call flow from TM to each RM joined to
    transaction
  • If TM and RM share the same log,
  • the RM FORCE can piggyback on the TM FORCE
  • One IO to commit a transaction (less if commit is
    grouped)

41
Abort Two Phase Commit
  • If RM sends "NO" or no response (timeout), TM
    starts abort.
  • Calls UNDO of each trans log record
  • May stop at a savepoint.
  • At begin_trans it calls ABORT() callback of each
    joined RM

42
Full Transaction State Diagram
43
CHECKPOINTING
  • Commit consistent checkpoints
  • Stop admitting new transactions and wait until
    all active transactions complete (abort or
    commit)
  • Flush all dirty cache slots
  • Write checkpoint record to log
  • During recovery, begin forward scan at last
    checkpoint record.
  • After last checkpoint, every element in DB
    contained its last committed value.
  • If an element does not contain its last committed
    value, it must have been updated after the
    checkpoint

44
Fuzzy Checkpointing
  • Commit consistent checkpoint has two drawbacks
  • A lot of disk I/O is needed (1000 pages _at_
    5mS/page 5sec)
  • Must wait until all active transactions terminate
    (2 sec.)
  • Protocol
  • Stop processing new operations- Wait until all
    active ones complete.
  • Flush every cache slot that has not been flushed
    since last checkpoint. Stable-LSN lt
    checkpoint-LSN.
  • Update stable-LSN of all buffers flushed.
  • Write checkpoint record including-
  • Active transaction list
  • List of data items and stable-LSNs of all dirty
    slots

45
Restart Algorithm
  • Locate Penultimate Checkpoint (the checkpoint
    preceding the last one)
  • Add all transactions in checkpoint record to
    active transaction list.
  • Forward scan of log (starting at penultimate
    checkpoint)
  • Call rm_redo() for each log record
  • On BEGIN_TRANSACTION log record add transaction
    to active transaction list
  • On COMMIT or ABORT log record, remove transaction
    from active transaction list.

46
Restart Algorithm (Cont.)
  • For each transaction in active transaction list
  • Retreive the transactions log records in reverse
    order (last record of transaction retreived
    first)
  • For each log record, call rm_undo()

47
Simple Recovery Method (Bernstein)
  • Restart Algorithm
  • redone undone ?
  • Scan log from last record to first. For each log
    record Ti, x, vbefore, vafter do
  • If x ? (redone ? undone) then
  • If Ti is committed then
  • restore xs cache slot to vafter
  • redone redone ? x
  • Otherwise
  • restore xs cache slot to vbefore
  • undone undone ? x

48
Simple Recovery
  • Assumptions
  • Strict 2PL - locking at granularity of page
  • Before and after-image are complete pages
  • Every element x is restored to its last committed
    value by Restart.
  • If last update to x is by a committed
    transaction, the value it wrote is restored and
    no further changes are made. (x ? redone)
  • If last update to x is by aborted or active
    transaction Ti, the before image of x wrt Ti is
    restored and no further change to x is made. (x ?
    undone)
  • Because histories are strict, this value was
    written by the last transaction to commit and
    write x.

49
Record Level Locking
  • History is not strict with respect to pages
    anymore. It is, however, strict with respect to
    individual tuples.
  • A single page LSN is not enough. Consider the
    following example
  • r1, r2, and r3 are records on the same page.
  • Log wir1 ci wjr2 wkr3 ck aj
  • Page LSN 1 3 4
    ?
  • The abort of Tj is processed by restoring the
    before image of r2 in the page.
  • What should we do to the LSN?
  • If we leave it at 4 and a system failure occurs,
    what will happen?
  • If we set it to 3 what happens?

50
Record Level Locking
  • Need to log the undo operations we apply.
  • This way, the undo in the history above will
    update the LSN of the page to 6, the LSN of the
    compensation record.
  • Log wir1 ci wjr2 wkr3 ck undo(wjr2)
  • Page LSN 1 3 4 6
  • Recovery when page-LSN is 1
  • Redo log records 3, 4 and 6
  • Undo log records 6 and 3
  • Add records 7 and 8 to end of log
    undo(undo(wjr2)) and undo(wjr2) - page LSN
    8.

51
Record Level Locking (Cont.)
  • Recovery when page-LSN is 4
  • Redo log record 6.
  • Undo log records 6 and 3
  • Add records 7 and 8 to end of log
    undo(undo(wjr2)) and undo(wjr2) - page LSN
    8.
  • Recovery when page-LSN is 6
  • Redo nothing
  • Undo log records 6 and 3
  • Add records 7 and 8 to end of log
    undo(undo(wjr2)) and undo(wjr2) - page LSN
    8.

52
Record Level Locking (Cont.)
  • Recovery when page-LSN is 7 (recovery fails)
  • Log wir1 ci wjr2 wkr3 ck
    undo(wjr2) wjr2) undo(wjr2)
  • Page LSN 1 3 4 6
    7 8
  • Redo log record 8
  • Undo log records 8, 7, 6 and 3.
  • Add records 9, 10, 11 and 12 to end of log
    undo(undo(wjr2)), undo(wjr2),
    undo(undo(wjr2)) and undo(wjr2) - page LSN
    12.

53
ARIES
  • Source
  • C. Mohan, et.al., ARIES A Transaction Recovery
    Method Supporting Fine-Granularity Locking and
    Partial Rollbacks Using Write-Ahead Logging in
    ACM Transactions on Database Systems, Vol. 17,
    No. 1, March 1992.
  • Context
  • ARIES is an integrated set of recovery,
    concurrency control, and indexing methods.
  • Explains state of the practice for Transaction
    Processing. In some ways superior to academic
    work in TP.
  • Concurrency control, buffering, and logging
    assumptions are similar to what we have been
    discussing.

54
Issues and Goals
  • Types of logging
  • Value logging Physical logging
  • Page oriented logging Physiological logging
  • Operation logging Logical logging
  • Buffer management
  • Force policy - Force updates of transactions
    before they commit. (No REDO required)
  • Steal policy - Buffer manager can steal pages
    dirtied by uncommitted transactions. (requires
    UNDO)
  • STEAL / NO FORCE
  • Most desirable as it imposes few constraints on
    BM.
  • This is model we follow in project.

55
Commutativity-based Concurrency Control
  • Definitions of conflicting operations can be
    generalized based on the the notion of
    commutativity
  • Operations p and q conflict if they do not
    commute
  • Different definitions of commutativity are
    possible.
  • Operations p and q commute with respect to ? if
    for any two states st1 ? st2, the values returned
    by p and q are the same and the state after
    executing p q beginning in state st1 is
    equivalent (?) to the state after executing q p
    beginning in state st2.
  • If we are interested in states of bank accounts
    where two states are equivalent if the bank
    account has the same balance
  • debit operations (that do not return the current
    balance commute)
  • credit operations do not commute in general
  • Other examples of commuting operations are
    increment / decrement (assuming signed numbers
    with no overflow)

56
Failure Assumptions
  • Transaction / System / Media
  • Process failure
  • What is it?
  • Can it be handled using one of the other three
    approaches?

57
Record Keeping
  • Transaction table
  • TRID
  • State of transaction (prepared or unprepared)
  • LastLSN - Last log record written by transaction
  • UndoNxtLSN - LSN of next record to process during
    rollback
  • Dirty page table
  • PageID
  • RecLSN - Beginning of updates not yet applied to
    page (How is this maintained?)

58
Normal Processing
  • Updates (order of events)
  • Order of events
  • acquire lock (exclusive)
  • acquire latch (exclusive)
  • perform update
  • insert log record
  • update LSN in page being updated
  • unlatch page
  • unlock record

59
Normal Processing
  • Discussion
  • Assume that lock names are (page, directory
    offset).
  • How can we determine the name of the lock we
    need?
  • What happens if we change the order, and latch
    first and then lock?

60
Rollback
  • Update records
  • Fix page
  • Undo log record
  • Write compensation record
  • Update page LSN (LSN of compensation record)
  • Update transaction table last LSN
  • Process PrevLSN next
  • Compensation records
  • Nothing to undo
  • Process UndoNxtLSN next

61
Checkpoints
  • Done asynchronously
  • Write begin_chkpt record
  • Write contents of transaction table
  • Write dirty page table to log file
  • Write end_chkpt record
  • Update checkpoint LSN in master record

62
Restart Processing
  • Analysis pass
  • Forward scan of log starting at begin_chkpt
  • Process end_chkpt
  • transaction table and dirty page table entries
  • Process other log records
  • If transaction not in transaction table, add it.
  • If page is not in dirty page table add it
  • Return RedoLSN minimum RecLSN of all entries in
    dirty page table.
  • Checkpoints are asynchronous and so we may have
    transactions executing during a checkpoint.
  • How are effects of transactions that are active
    during checkpoint accounted for?

63
REDO Pass
  • Begin forward scan at RedoLSN
  • Process update and compensation records
  • PageID is dirty and RecLSN lt Log.LSN
  • Page.LSN lt Log.LSN
  • Redo log record
  • update Page.LSN
  • Page.LSN gt Log.LSN
  • RecLSN Page.LSN 1

64
UNDO Pass
  • Find last log record of all active transactions.
  • Update record
  • Undo update
  • Write compensation record
  • update page LSN (LSN of compensation record)
  • UndoNxtLSN LogRec.PrevLSN
  • If no more records for TX
  • Write end_transaction record
  • Delete transaction from transaction table

65
UNDO Pass (Cont.)
  • Compensation records
  • UndoNxtLSN LogRec.UndoNxtLSN
  • (Skip over intermediate records from same
    transaction.)
  • Recovery Example (from paper)

66
Nested Top Actions
  • Recovery in a B-tree
  • Need to log page splits so that they can be
    redone if a failure occurs before we finish.
  • Page allocation
  • Copy some records to new page
  • Post the new key-pointer pair to parent.
  • Once index is restructured, release locks on the
    pages that are modified (hold lock on key range)

67
Nested Top Actions (Cont.)
  • Dont want to undo split if transaction that
    caused split aborts.
  • It will probably be needed again in the near
    future.
  • Another example of problem is page allocation
  • Undo of allocation may result in committed
    transactions work being undone. Why?

68
Requirements
  • Actions of incomplete Nested Top Actions (NTA)
    are redone regardless of transaction outcome.
  • Incomplete NTAs are undone regardless of outcome.
  • Complete NTAs are not undone even if transaction
    fails. Inverse operations (e.g., deallocation)
    will be invoked when necessary.

69
Aries Solution
  • Write to mark beginning of nested top action.
  • At end add a CLR which points to log record
    just before .
  • During forward pass the work will be redone.
  • During backward pass if is not found, the
    work will be undone.
  • If is found, undo pass will skip operations
    in parenthesis.

70
Discussion
  • What is the advantage of the analyis pass?
  • There is a lot of effort spent to justify
    replaying history.
  • What is the disadvantage?
  • What advantages does it provide?
  • If the recovery method in your project provided
    NTAs, how would you use them?
  • Could you use redo-only log records for the same
    purpose?

71
TPC Benchmark
  • Goal- Provide a method of characterizing
    performance of OLTP systems
  • Accounts for hardware/software costs over 5 year
    period
  • Report both transactions per second (tps) and
    /tps
  • Faster systems (measured in tps) operate with
    larger databases.

72
TPC
  • Models large Bank with multiple branches
  • Based on single Deposit transaction
  • UPDATE Account SET Abalance Abalance delta
  • WHERE Aid Aid
  • SELECT Abalance INTO Abalance
  • FROM accounts
  • WHERE Aid Aid
  • UPDATE Tellers SET Tbalance Tbalance delta
  • WHERE Tid Tid
  • UPDATE Branches SET Bbalance Bbalance delta
  • WHERE Bid Bid
  • INSERT INTO history(Tid, Bid, Aid, delta, time)
  • VALUES (Tid, Bid, Aid, delta, CURRRENT)

73
Scaling
  • For every tps of the system under test, there are
    at least
  • 1 branch records
  • 10 teller records
  • 100,000 account records
  • What is purpose of this scaling?
  • Does it favor systems with low tps?

74
Project
  • Test case for project
  • Initialization - (dbs_tpc_init.c) Create four
    tables with index on primary key.
  • Workload - (dbs_tpc.c)
  • multiple clients (2 - 5)
  • Each executes 20 transactions and quits
  • Verification - (dbs_tpc_test.c)
  • Prints each branch balance, sum of all teller
    balances in each branch, sum of all account
    balances for each branch.
  • If ACID properties are working, balances should
    be same.
  • Testing
  • Run multiple client verify
  • Run multiple clients kill RM perform recovery
    verify

75
Discussion
  • What throughput can you achieve with page-level
    locking?
  • What locks must be acquired?
  • What is probability of lock conflict?
  • How long will conflict persist?
  • What throughput would you expect with tuple-level
    locking?

76
Newer TPC Benchmarks
  • More info at http//www-europe.sgi.com/Technology/
    tpc.html or http//www.tpc.org
  • TPC-C
  • Models warehouse distribution system
  • Nine tables representing warehouse, district,
    customer, order, stock, and history information.
  • Five transaction types - creating a new order,
    making a payment to a customer's balance,
    determining the status of an order, the delivery
    of orders, and determining the current stock
    level
  • Scaling by warehouse
  • A 10,000 tpmC requires about half a terabyte of
    on-line storage

77
Newer TPC Benchmarks
  • More info at http//www.tpc.org
  • Newer benchmarks
  • TPC-H - Ad-hoc Decision support
  • TPC-R - Business Reporting Decision support
  • TPC-W - Transactional WEB benchmark
  • e-commerce and business to business intranet

78
Log Manager
  • Concept
  • Log is a history of all changes to the state.
  • Log old state gives new state
  • Log new state gives old state
  • Log is a sequential file.
  • Complete log is the complete history
  • Current state is just a "cache" of the log
    records.
  • Performance
  • What kind of log performance is important for
  • High throughput?
  • Low response time?

79
The Log Table
  • Log table is a sequential set (relation).
  • Log Records have standard part and then a log
    body.
  • Often want to query table via one attribute or
    another
  • RMID, TRID, timestamp
  • create domain LSN unsigned integer(64) -- log
    sequence number (file , rba)
  • create domain RMID unsigned integer -- resource
    manager identifier
  • create domain TRID char(12) -- transaction
    identifier
  • create table log_table (
  • lsn LSN, -- the records log sequence number
  • prev_lsn LSN, -- the lsn of the previous
    record in log
  • timestamp TIMESTAMP, -- time log record was
    created
  • resource_manager RMID, -- resource mgr that
    wrote this record
  • trid TRID, -- id of transaction that wrote
    this record
  • tran_prev_lsn LSN, -- prev log record of this
    transaction (or 0)
  • body varchar, -- log data rm understands
    it
  • primary key (lsn) -- lsn is primary key
  • foreign key (prev_lsn) -- previous log record
    in this table
  • references a_log_table(lsn), --

80
Log is complete history
B files
A files
Log Table
Log Anchor
lsn
prev_lsn
resource_mgr
trid
tran_prev_lsn

body
Archive
  • Log anchor points at chain of each transaction.
  • May maintain other chains.
  • Log records map to sequence of N-plexed files
  • Old files are archived.
  • Eventually, archive files are discarded (weeks,
    months, never)

81
The Log LSN
  • Each log record has a logical sequence number.
  • This number (LSN for Log Sequence Number) plays a
    key role in many algorithms.
  • Key property - If action A happened after action
    B then
  • LSN(A) gt LSN(B).

82
Accessing the Log
  • Read a log record
  • long log_read_lsn( LSN lsn, / lsn of record
    to be read /
  • log_record_header header, / header of
    record to be read /
  • long offset, / offset into body to
    start read /
  • pointer buffer, / buffer to receive log
    data /
  • long n) / length of buffer /
  • Find current end of log
  • LSN log_max_lsn(void) / returns the current
    maximum lsn of the log table./
  • Insert Log record - Log manager fills in header.
  • LSN log_insert( char buffer, long n) / log
    body is buffer0..n-1 /
  • Force Log - up to a certain LSN to persistent
    storage
  • LSN log_flush( LSN lsn, Boolean lazy) / lazy
    waits for a batch write or
    timeout boxcar) /

83
Accessing the Log (Cont.)
  • Many real interfaces allow some of
  • empty buffer to allow RM to fill it in (avoids
    data copies)
  • gather take log data from many buffers.
  • Few offer SQL access to the log.

84
Summary of Log Structure
B file
A file
durable
storage
log page header
Log pages
in buffer pool
empty page in
Pages written in next write
buffer pool
current end of log
end of
header
durable
Log Table
body
log
85
Log Anchor Logging and Locking
typedef struct filename tablename / name
of log table / struct log_files/ A
B file prefix names active file
/ xsemaphore lock / semaphore
regulates log write / LSN prev_lsn / LSN
of most recent write / LSN lsn
/ LSN of next record / LSN
durable_lsn / max lsn in durable
storage / LSN TM_anchor_lsn / lsn of
trans mgr's last ckpt / struct / array of
open log parts / long partno /
partition number / int
os_fnum / operating system file /
part MAXOPENS / /
log_anchor / /
  • Log records never updated only inserted and read
    - So no locks needed on log.
  • Semaphore (or something) needed on "end" of log
    to manage space/growth/LSN for inserts

86
Log Insert
  • Log semaphore covers
  • Incrementing LSN
  • Finding the log end
  • filling in the page(s)
  • allocating space on a page, perhaps allocating
    new pages.
  • Log write Daemons
  • Log Semaphore can be a hotspot so No IO under
    semaphore
  • Flush to persistent storage (disc) is done
    asynchronously.
  • Demons driven by timers and by events (requests)
  • Demons need not touch end-of-log semaphore

87
Group Commit (Boxcaring)
  • Batch processing of log writes.
  • Forcing log buffer each time requested may write
    many incomplete buffers
  • Most efficient to write large buffers to log
  • Must delay some log_flush() requests until buffer
    is full.
  • Example - Boxcaring
  • Modern SCSI disks have highest write bandwidth
    around 64KB block size
  • Transfers per second is lt 15ms. (10ms for seek
    and rotation)
  • Each block contains 325 log records of 200 bytes
    each.
  • TPC benchmark generates around four records per
    transaction.
  • This will allow 5416 TPS
  • If receive 1,000 log force requests/second why
    not just execute 50 of them?
  • Response time will be the same (20ms).
  • IOs will be 20x fewer
  • CPU will be 10x smaller (10x fewer dispatches,
    20x fewer OS IO).
  • Without it, systems are limited to about
  • 50tps no ping-pong
  • 100tps ping-pong.
  • With it, systems are limited to disc bandwidth
    gtgt10ktps.

88
Example - Synchronous write per flush
  • Perform log write every time a log_flush() is
    received.
  • Transfer small blocks (4 200 bytes) requires
    10ms.
  • Each block holds log records for one transaction.
  • This allows about 100 TPS

89
WADS- Giving the Log Disc Zero Latency
  • Log disc is dedicated, so only has rotational
    latency.
  • Reserve some cylinders on the disc as scratch.
  • For each write
  • Write at current position on next track (zero
    latency).
  • When have a full-track (or two) of log data
  • consolidate the write in ram
  • do a single LARGE write (100KB 1 rotation) to
    the log.
  • cost of this is seek rotation 20ms.
  • Reserved area is called Write Ahead Data Set
    (WADS).
  • At restart
  • read cylinders
  • gather recent log data
  • rewrite end of log.

90
Synchronization on B-Trees What Is the Problem?
  • B-Trees are fully redundant structures, which can
    be reconstructed from the tuples therefore, no
    synchronization should be required at all.
  • However, some queries operate on the index only.
    This requires all operations on B-trees to be
    serializable with the operations on the tuples.
  • Standard two-phase locking with the nodes as the
    objects is not feasible for performance reasons.

91
Protecting Tree Traversal
1. semaphore on Q
Node Q at level i
search path
2. follow search path
3. semaphore on R
Node R at level i1
4. release sem. on Q
92
B-Trees and Value Locks
B-tree
14 P1 P2 16 P3 17 P4 P5 P6 20 P7 22 P8 P9 P10
93
Making Lock Names
  • To implement value locking, we need to build lock
    names according to the following rule
  • LockN TableName, IndexName, KeyValue.
  • KeyValue in turn is a composite
  • KeyValue AttributeValue, TupleIdentifier.

94
Key Range Locking on B-Trees
  • Details
  • Allow for duplicate key values by modeling a
    tuple as ltc, kgt where c is the key value and k is
    a tuple pointer.
  • A retreival of c1, c2 retreives all tuples
    ltc1, kmingt, ltc2, kmaxgt.
  • Insertions and deletions apply to a single tuple
    ltc1, kigt.
  • retrieval(c1?? K ? c2) (Range query)
  • retrieval(k c) is special case of this -
    retrieval(c ?? K ? c)
  • 1. Get shared semaphore on first leaf page that
    contains the range.
  • 2. Get shared locks on all existing key values
    c c0 ? c ? c2 in sequence, where c0 is
    largest key such that c0 ? c1
  • A key lock on key value c1 protects range of keys
    from ltc1, kmingt, ltc2, kmingt) where c2 is next
    key value after c1.
  • 3. Lock tuple in shared mode.

95
Key Range Locking on B-Trees
  • 4. Use coupling technique when acquiring
    semaphores
  • 5. Locks are held until commit. Semaphores are
    released earlier.
  • insert(ltK, tidgt)
  • 1. Get exclusive semaphore on leaf page.
  • 2. Get an instant IX lock on c1 where c1 is
    largest key value less than or equal to K.
  • This ensures that no search has started using the
    rangeltc1, kmingt, ltc2, kmingt) which includes the
    key being inserted.
  • Multiple insertions into ltc1, kmingt, ltc2, kmingt)
    can proceed concurrently.
  • 3. If c1 ? K get an exclusive lock on K
  • This protects the new key range until the
    transaction commits.
  • 4. Get X Lock on tuple.

96
Key Range Locking on B-Trees
  • delete(ltK, tidgt)
  • 1. Get exclusive semaphore on leaf page.
  • 2. Let ltc1, kdgt be largest index entry ? ltK,
    tidgt.
  • If c1 K, get an IX lock on K.
  • This does not change the key values that exist in
    the relation. An ix lock is enough to discourage
    a search.
  • If c1 lt K, get an X lock on c1 and K.
  • In this case, the number of key ranges is
    reduced. The x lock on K forces searches for K to
    wait, and an X lock on c1 will force an insertion
    into this range to be delayed.
  • 3. Get X lock on tuple.
  • 4. Hold lock(s) until commit

97
B-Tree Recovery
98
B-Tree Recovery Based on Physiological Logging
  • Cover all B-tree operations with semaphores on
    all affected pages.
  • For each logical update a log record with the
    logical UNDO (only) operation must be moved to
    the log
  • While the update operation is being performed,
    physical REDO (only) log records are written.
  • After all REDO records are safely in the log, the
    exclusive semaphores can be released.

99
The Two Phases of B-Tree-Recovery
  • Phase1 Go forward through the log up to its
    current end, applying all REDO records to the
    tree.
  • Phase2 Go backward to the Begin of transaction
    record of the oldest incomplete transaction,
    executing the UNDO operations on the tree for all
    losers along the way.

100
Discussion
  • What happens if failure occurs before all REDO
    records are flushed.
  • Cannot restore Index to physically consistent
    state.
  • Can the same optimization be applied to the case
    when a hash bucket overflows?
Write a Comment
User Comments (0)
About PowerShow.com