Dal Mainmemory Storage Manager - PowerPoint PPT Presentation

1 / 56
About This Presentation

Dal Mainmemory Storage Manager


Commit record for proxy is similar to compenstation log records (CLRs) in ARIES ... Bitmap mirrors allocator's free list. Collections and Indexing. Extendible hashing ... – PowerPoint PPT presentation

Number of Views:86
Avg rating:3.0/5.0
Slides: 57
Provided by: fulln


Transcript and Presenter's Notes

Title: Dal Mainmemory Storage Manager

Dalí Main-memory Storage Manager
Tomasz Piech
Salvador Dalí - Persistence of Memory (1931)
  • Dalí
  • Implemented at Bell Laboratories
  • Storage manager for persistent data
  • Architecture optimized for databases resident in
    main memory
  • Application real-time billing and control of
    multimedia content deliver
  • High transaction rates, low latency

  • Dalí Techniques
  • Direct access to data direct pointers to
    information stored in dbase high performance
  • No interprocess communication communication
    with server only during dis/connection
    concurrency, logging provided via shared memory
  • Fault-tolerant advanced, multi-level
    transaction model high concurrency indexing and

  • Dalí
  • Recovery from process failure in addition to
    system failure
  • Use of codewords and memory protection
    integrity of data (discussed later)
  • Consistency of response time key requirement
    for applications with memory-resident data
  • Designed for databases that fit into main memory
    (virtual will work but not as well)

Overview of Presentation
  • Architecture
  • Storage
  • Transaction Management
  • Fault Tolerance
  • Concurrency Control
  • Collections and Indexing
  • Higher Level Interfaces

  • Database files user data, one or more exist in
  • System database files database support related
    data, such as locks and logs
  • Files opened by a process are directly mapped
    into its address space
  • mmap files or shared-memory segments used to
    provide mapping

Layers of Abstraction
Dalí architecture is organized to support the
toolkit approach
Layers of Abstraction
  • Toolkit approach
  • Logging can be turned off for data which need not
    be persistent
  • Locking can be turned off if data is private to a
  • Multiple interface levels
  • Low-level components are exposed to user for

Pointers and Offsets
  • Each process has a database-offset table
  • Specifies where in memory a file is mapped
  • Implemented as an array indexed by file id
  • Primary Dalí pointer (p)
  • Dbase file local-identifier offset within file
  • To dereference, add offset from p to virtual
    memory address from offset table
  • Secondary pointer
  • Index in one file, store just the offset since
    location of file is known

Storage Allocation
  • Motivation
  • Control data should be stored separately from
    user data
  • protection of control data from stray pointers
  • Indirection should not exist at the lowest level
  • Indirection adds a level of latching for each
    data access increases path length for
    dereferecing itself
  • Dalí exposes direct pointers to allocated data,
    provides time and space efficiency

Storage Allocation
  • Motivation
  • Large objects should be stored contiguously
  • Advantage is speed recreating a file from
    smaller files takes away that advantage
  • Different recovery characteristics should be
    available for different regions of the database
  • Not all data needs to be recovered from a crash
  • Indexes can be rebuilt, etc.

Storage Allocation
  • Two levels of non-recovered data
  • Zeroed memory remains allocated but is zeroed
  • Transient memory data no longer allocated upon

Segments and Chunks
  • Segment
  • contiguous page-aligned unit of allocation
    arbitrarily large database files are comprised
    of segments
  • Chunk
  • A collection of segments

Segments and Chunks
Segments and Chunks
  • Allocators
  • Return standard Dalí pointers to allocated space
    within a chunk indirection not imposed at
    storage manager level
  • No record of allocated space is retained
  • 3 different allocators
  • Power-of-two allocates buckets of size 2im
  • Inline power-of-two as above free space list
    uses 1st few bytes of each free block

Segments and Chunks
  • Allocators (contd)
  • Coalescing allocator merges adjacent free space
    uses a free tree
  • Power of 2 inline faster but neither coalesces
    adjacent free space fragmentation (thus fixed
    size records only)
  • Coalescing uses free tree based on T-tree to
    keep track of free space logarithmic time for
    allocation and freeing

Page Table Segment Headers
  • Segment header associate info about a
    segment/chunk with a physical pointer
  • Allocated when segment is added to a chunk
  • Can store additional info about data in segment
  • Page table maps pages to segment headers
  • Pre-allocated based on max of pages in dbase

Transaction Management
  • Recovery
  • System Overview
  • Checkpointing

Transaction Management in Dalí
  • Transaction atomicity, isolation durability in
  • Regions - logically organized data
  • A tuple, an object or arbitrary data structure (a
    tree or a list)
  • Region lock - X or S lock that guards
    access/updates to a region

Multi-Level Recovery
  • Permits use of weaker operation locks in place of
    X/S region locks
  • Example, index management
  • An update to index structure (i.e. Insert)
  • Physical undo description must be valid until
    transaction commit
  • Unacceptable level of concurrency

Multi-level Recovery
  • Replace low-level physical undo log records with
    higher-level logical undo log records
    (description at operation level)
  • Insert logical-undo record replaces
    physical-undo record by specifying that the
    inserted key must be deleted
  • Region locks can be released and less restrictive
    operation locks persist ? higher level of

Multi-level Recovery
  • An example of find and insert ?
  • Releasing region locks would allow updates on the
    same region
  • Cascading aborts - rolling back the first
    operation would damage effects of later actions
  • Only compensating undo operation can be used to
    undo the operation

Multi-level Recovery Example
System Overview
  • Stored on disk
  • Two checkpoint images Ckpt_A Ckpt_B
  • cur_ckpt anchor to the most recent valid
    checkpoint image for database
  • Single system log containing redo information,
    its tail in memory
  • end_of_stable_log pointer all records prior to
    it were flushed to stable system log

System Overview
System Overview
  • Stored in the system database with each
  • Active Transaction Table (ATT)
  • Stores separate redo undo logs for each active
  • dpt dirty page table stores pages updated
    since the last checkpoint
  • ckpt_dpt dpt in a checkpoint

Transactions and Operations
  • Transaction a list of operations
  • Each op. has a level Li associate with it
  • Op at level Li is can consist of ops of level
  • L0 are physical updates to regions
  • Pre-commit the commit record enters the system
    log in memory
  • Commit - commit record hits the stable storage

Logging Model
  • Updates generate physical undo and redo log
    records appended to Txs undo redo logs (in
  • When Tx pre-commits, redo appended to system log,
    and logical-undo included in operation commit log
    in system log
  • When operation pre-commits, undo log records are
    deleted for its sub-operations/updates from Txs
    undo log this operations logical undo appended
    to Txs undo log

Logging Model
  • Locks released once Tx/operation pre-commits
  • System log flushed to disk when Tx commits
  • Dirty pages are marked in the dpt by he flushing
    procedure no page latching

Ping-pong Checkpointing
  • Traditionally, systems implement WAL for recovery
    it is impossible to enforce WAL without latches
  • Latches increase access cost in main memory
    interfere with normal processing
  • Solution, store two copies of dbase image on
    disk dirty pages written to alternate
  • Fuzzy checkpointing no latches used, no
    interference with normal operations

Ping-pong Checkpointing
  • Checkpoints are allowed to be temporarily
    inconsistent updates written out without undo
  • Redo and undo info from ATT is written out to a
    checkpoint and brings it to a consistent state
  • If failure occurs, the other checkpoint is still
    consistent and can be used for recovery

Ping-pong Checkpointing
  • Log flush necessary at end of checkpointing
    before toggling cur_ckpt commit might take
    place before writing out ATT, leaving no undo
    information if system crashes

Abort Processing
  • Upon abort, undo log records undone by
    sequentially traversing undo log from end
  • New physical-redo log record created for every
    physical-undo encountered
  • Similarly, for logical-undo compensation
    operation is executed (proxy)
  • All undo log records deleted when proxy commits

Abort Processing
  • Commit record for proxy is similar to
    compenstation log records (CLRs) in ARIES
  • During recovery, logical-undo log record deleted
    from Txs undo log if a CLR encountered,
    preventing Tx from being undone gagin

  • end_of_stable_log is where recovery begins
  • Initializes ATT and undo logs with copies from
    last checkpoint
  • Loads database image and sets dpt to zero
  • Applies all redo log following begin-recovery-poin
  • Then all active transactions are rolled back
  • First all completed L0 operations must be rolled
    back then L1, then L2 and so on.

Post-commit Operations
  • Operations guaranteed to be carried out after
    commit of a transaction/operation even if the
    system crashes
  • Some operations cannot be rolled back once
    performed (deletion then allocation of same space
    to different operation)
  • Need to ensure high concurrency on storage
    allocator cannot hold locks
  • Solution perform these operations after
    transaction commits (keep post-commit log)

Fault Tolerance
  • Process Death and Its Detection

Fault Tolerance
  • Techniques that help cope with process failure

Process Death
  • Caused by an attempt to access invalid memory, or
    by an operator kill
  • Must return shared data partially updated to
    consistent state
  • Abort any uncommitted transactions owned by that
  • Cleanup server is primarily responsible for
    cleaning up dead processes

Process Death
  • Active Process Table (APT) keeps track of all
    processes in the system scanned periodically to
    check if any are dead
  • Low-level clean up
  • Process registers with APT any latch acquired
  • If latch held by dead process clean up function
    for that latch is called
  • If not possible to clean up latch then simulate
    system crash

Process Death
  • Cleaning up Transactions
  • Clean-up agent scan Tx table and abort any Tx
    running on behalf of the dead process or execute
    post-commit actions for committed Tx
  • Multiple clean up agents spawn if multiple
    processes have died

Protection from Application Errors
  • Memory protection
  • munprotect called right before an update to a
    page and mprotect after Tx commits to protect
  • Codewords
  • associate logical parity word with each page of
  • Erroneous writes will update only physical data
    not codeword crash simulated if error found

Concurrency Control
  • Implementation of Latches

Concurrency Control
  • Concurrency control facilities
  • Latches (low-level locks for mutual exclusion)
  • Queuing locks
  • Latch Implementation
  • Semaphores too expensive system call overhead
  • Implementation must complement cleanup server

Latch Implementation
Latch Implementation
  • Processes that wish to acquire a latch keep a
    pointer to that latch in their wants field
  • cleanup-in-progress flag forbids processes to
    attempt to get a latch is set to True
  • Cleanup server waits for process to set their
    wants fields to null or another lock or to die
  • If a dead process is a registered owner of the
    latch, cleanup function is called

Locking System
  • Lock header structure
  • Stores a pointer to a list of locks that have
    been requested (but not released) by transactions
  • Request times out if not granted in a certain
    amount of time
  • Add new lock modes with the use of conflicts and
  • covers holder of lock A checks for conflicts
    when requesting new lock of type B, unless A
    covers B

Collections and Indexing
  • Heap Files
  • Extendible Hashing

Collections and Indexing
  • Dalí provides higher level interfaces for
    grouping related data items performing scans
    associative access on items in group
  • Heap file
  • abstraction for handling a large number of
    fixed-length data items
  • Scans are supported through bitmaps in segment
  • Entries deleted from heap are 0 in the bitmap
  • Bitmap mirrors allocators free list

Collections and Indexing
  • Extendible hashing
  • Similar to what was covered in CS 432
  • Utilization factor determines when to double
    the directory more tolerant than bucket overflow
    trigger avoids space problems/util.

Extendible Hashing
T-tree indexes
  • Briefly internal nodes, semi-leaf leaf nodes
  • To search for value, at each node check if key is
    bounded by left and right-most key values. If
    so, check if key value returned if contained in
    the node otherwise traverse tree further down

Higher Level Interfaces
  • Two database management systems built on Dalí
  • Dalí Relational Manager
  • Main Memory ODE Object Oriented Database
Write a Comment
User Comments (0)
About PowerShow.com