Chapter 6 DFS Design and Implementation - PowerPoint PPT Presentation

1 / 114
About This Presentation
Title:

Chapter 6 DFS Design and Implementation

Description:

DFS (Distributed file system) a file system consisting of ... Is it practical to structure all file accesses as idempotent operations? File locking mechanism ... – PowerPoint PPT presentation

Number of Views:130
Avg rating:3.0/5.0
Slides: 115
Provided by: ccNct
Category:

less

Transcript and Presenter's Notes

Title: Chapter 6 DFS Design and Implementation


1
Chapter 6 DFS Design and Implementation
2
Outline
  • Characteristics of a DFS
  • DFS Design and Implementation
  • Transaction and Concurrency Control
  • Data and File Replication

3
Overview
  • A file system is responsible for the naming,
    creation, deletion, retrieval, modification, and
    protection of files
  • DFS (Distributed file system) a file system
    consisting of physically dispersed storage sites
    but providing a traditional centralized file
    system view for users
  • Transparency
  • Directory service (name service)
  • Performance and availability ? caching and
    replication ? cache coherence and replica
    management
  • Access control and protection
  • The unique problems in DFS are due to the need
    for sharing and replication of files

4
Characteristics of A DFS (I)
  • Dispersed clients
  • Login transparency uniform login procedure,
    uniform view of FS
  • Access transparency uniform mechanism to access
    local and remote files
  • Dispersed files
  • Location transparency file names need not
    contain information about the physical locations
    of files
  • Location independence (migration transparency)
    files can be moved from one physical location to
    another without changing their names

5
Characteristics of A DFS (II)
  • Multiplicity of users
  • Concurrency transparency update to a file by a
    process should not interfere with other processes
    sharing this file
  • Support concurrency transparency at the
    transaction level
  • Interleaved file accesses by several applications
  • Multiplicity of files
  • Replication transparency clients are not aware
    that more than one copy of the file exists
  • Perform atomic updates on the replicas
  • Fault tolerance, scalability, heterogeneity

6
DFS Design and Implementation
7
File Concept
  • OS abstracts from the physical storage devices to
    define a logical storage unit File
  • Types
  • Data numeric, alphabetic, alphanumeric, binary
  • Program source and object form

8
Logical components of a file
  • File name symbolic name
  • When accessing a file, its symbolic name is
    mapped to a unique file id (ufid or file handle)
    that can locate the physical file
  • Mapping is the primary function of the directory
    service
  • File attributes next slide
  • Data units
  • Flat structure of a stream of bytes of sequence
    of blocks
  • Hierarchical structure of indexed records

9
File Attributes
  • File Handle Unique ID of file
  • Name only information kept in human-readable
    form
  • Type needed for systems that support different
    types
  • Location pointer to file location on device
  • Size current file (and the maximum allowable)
    size
  • Protection controls who can read, write,
    execute
  • Time, date, and user identification data for
    protection, security, and usage monitoring.
  • Information about files are kept in the directory
    structure, which is maintained on the physical
    storage device.

10
Access Methods
  • Sequential access information is processed in
    order
  • read next
  • write next (append to the end of the file)
  • reset to the beginning of file
  • skip forward or backward n records
  • Direct access a file is made up of fixed length
    logical blocks or records
  • read n
  • write n
  • position to n
  • read next
  • write next
  • rewrite n

11
Access Methods (Cont.)
  • Indexed sequential access
  • Data units are addressed directly by using an
    index (key) associated with each data block
  • Requires the maintenance of an search index on
    the file, which must be searched to locate a
    block address for each access
  • Usually used only by large file systems in
    mainframe computers
  • Indexed sequential access method (ISAM)
  • A two-level scheme to reduce the size of the
    search index
  • Combine the direct and sequential access methods

12
Major Components in A File System
A file system organizes and provides access and
protection services for a collection of files
13
Directory Structure
  • Access to a file must first use a directory
    service to locate the file.
  • A collection of nodes containing information
    about all files.
  • Both the directory structure and the files reside
    on disk.

Directory
F 1
F 2
F 3
F 4
F n
Files
14
Information in a Directory
  • Name
  • Type file, directory, symbolic link, special
    file
  • Address device blocks to store a file
  • Current length
  • Maximum length
  • Date last accessed (for archival)
  • Date last updated (for dump)
  • Owner ID
  • Protection information

15
Operations Performed on Directory
  • Search for a file
  • Create a file
  • Delete a file
  • List a directory
  • Rename a file
  • Traverse the file system

Some kind of name service
16
Tree-Structured Directories Hierarchical
Structure of A File System
Subdirectory is just a special type of file
17
Authorization Service
  • File access must be regulated to ensure security
  • File owner/creator should be able to control
  • what can be done
  • by whom
  • Types of access
  • Read
  • Write
  • Execute
  • Append
  • Delete
  • List

18
File Service File Operations
  • Create
  • Allocate space
  • Make an entry in the directory
  • Write
  • Search the directory
  • Write is to take place at the location of the
    write pointer
  • Read
  • Search the directory
  • Read is to take place at the location of the read
    pointer
  • Reposition within file file seek
  • Set the current file pointer to a given value
  • Delete
  • Search the directory
  • Release all file space
  • Truncate
  • Reset the file to length zero
  • Open(Fi)
  • Search the directory structure
  • Move the content of the directory entry to memory
  • Close(Fi)
  • move the content in memory to directory structure
    on disk
  • Get/set file attributes

19
System Service
  • Directory, authorization, and file services are
    user interfaces to a file system (FS)
  • System services are a FSs interface to the
    hardware and are transparent to users of FS
  • Mapping of logical to physical block addresses
  • Interfacing to services at the device level for
    file space allocation/de-allocation
  • Actual read/write file operations
  • Caching for performance enhancement
  • Replicating for reliability improvement

20
DFS Architecture NFS Example
21
File Mounting
  • A useful concept for constructing a large file
    system from various file servers and storage
    devices
  • Attach a remote named file system to the clients
    file system hierarchy at the position pointed to
    by a path name (mounting point)
  • A mounting point is usually a leaf of the
    directory tree that contains only an empty
    subdirectory
  • mount claven.lib.nctu.edu.tw/OS /chow/book
  • Once files are mounted, they are accessed by
    using the concatenated logical path names without
    referencing either the remote hosts or local
    devices
  • Location transparency
  • The linked information (mount table) is kept
    until they are unmounted

22
File Mounting Example
root
root
Export
chow
OS
Mount
paper
book
DFS
DSM
/OS/DSM
Local Client
Remote Server
23
File Mounting (Cont.)
  • Different clients may perceive a different FS
    view
  • To achieve a global FS view SA enforces
    mounting rules
  • Export a file server restricts/allows the
    mounting of all or parts of its file system to a
    predefined set of hosts
  • The information is kept in the servers export
    file
  • File system mounting
  • Explicit mounting clients make explicit mounting
    system calls whenever one is desired
  • Boot mounting a set of file servers is
    prescribed and all mountings are performed the
    clients boot time
  • Auto-mounting mounting of the servers is
    implicitly done on demand when a file is first
    opened by a client

24
Location Transparency
No global naming
25
A Simple Automounter for NFS
26
Server Registration
  • The mounting protocol is not transparent
    require knowledge of the location of file servers
  • When multiple file servers can provide the same
    file service, the location information becomes
    irrelevant to the clients
  • Server registration ? name/address resolution
  • File servers register their services with a
    registration service, and clients consult with
    the registration server before mounting
  • Clients broadcast mounting requests, and file
    servers respond to clients requests

27
Stateful and Stateless File Servers
  • Stateless file server when a client sends a
    request to a server, the server carries out the
    request, sends the reply, and then remove from
    its internal tables all information about the
    request
  • Between requests, no client-specific information
    is kept on the server
  • Each request must be self-contained full file
    name and offset
  • Stateful file server file servers maintain
    state information about clients between requests
  • State information may be kept in servers or
    clients
  • Opened files and their clients
  • File descriptors and file handles
  • Current file position pointers
  • Mounting information
  • Lock status
  • Session keys
  • Cache or buffer

Session a connection for a sequenceof requests
and responses between aclient and the file server
28
A Comparison between Stateless and Stateful
Servers
29
Issues of A Stateless File Server
  • Idempotency requirement
  • Is it practical to structure all file accesses as
    idempotent operations?
  • File locking mechanism
  • Should locking mechanism be integrated into the
    transaction service?
  • Session key management
  • Can one-time session key be used for each file
    access?
  • Cache consistency
  • Is the file server responsible for controlling
    cache consistency among clients?
  • What sharing semantics are to be supported?

30
File Sharing
  • Overlapping access multiple copies of the same
    file
  • Space multiplexing of the file
  • Cache or replication
  • Coherency control managing accesses to the
    replicas, to provide a coherent view of the
    shared file
  • Desirable to guarantee the atomicity of updates
    (to all copies)
  • Interleaving access multiple granularities of
    data access operations
  • Time multiplexing of the file
  • Simple read/write, Transaction, Session
  • Concurrency control how to prevent one execution
    sequence from interfering with the others when
    they are interleaved and how to avoid
    inconsistent or erroneous results

31
Space Multiplexing
  • Remote access no file data is kept in the client
    machine. Each access request is transmitted
    directly to the remote file server through the
    underlying network.
  • Cache access a small part of the file data is
    maintained in a local cache. A write operation or
    cache miss results a remote access and update of
    the cache
  • Download/upload access the entire file is
    downloaded for local accesses. A remote access or
    upload is performed when updating the remote file

32
Remote Access VS Download/Upload Access
Remote Access
Download/Upload Access
33
Four Places to Caching
Clients disk (optional)
Servers disk
Clients main memory
Servers main memory
Client
Server
34
Coherency of Replicated Data
  • Four interpretations
  • All replicas are identical at all times
  • Impossible in distributed systems
  • Replicas are perceived as identical only at some
    points in time
  • How to determine the good synchronization points?
  • Users always read the most recent data in the
    replicas
  • How to define most recent?
  • Based on the completion times of write
    operations (the effect of a write operation has
    been reflected in all copies)
  • Write operations are always performed
    immediately and their results are propagated in
    a best-effort fasion
  • Coarse attempt to approximate the third definition

35
Time Multiplexing
  • Simple RW each read/write operation is an
    independent request/response access to the file
    server
  • Transaction RW a sequence of read and write
    operations is treated as a fundamental unit of
    file access (to the same file)
  • ACID properties
  • Session RW a sequence of transaction and simple
    RW operations

36
Space and Time Concurrencies of File Access
37
Semantics of File Sharing
  • On a single processor, when a read follows a
    write, the value returned by the read is the
    value just written (Unix Semantics).
  • In a distributed system with caching, obsolete
    values may be returned.

Solution to coherency andconcurrency control
problemsdepends on the semantics ofsharing
required by applications
38
Semantics of File Sharing (Cont.)
39
Version Control
  • Version control under immutable files
  • Implemented as a function of the directory
    service
  • Each file is attached with a version number
  • An open to a file always returns the current
    version
  • Subsequently read/write operations to the opened
    files are made only to the local working copy
  • When the file is closed, the local modified
    version (tentative version) is presented to the
    version control service
  • If the tentative version is based on the current
    version, the update is committed and the
    tentative version becomes the current version
    with a new version number
  • What is the tentative version is based on an
    older version?

40
Version Control (Cont.)
  • Action to be taken if based on an older version
  • Ignore conflict a new version is created
    regardless of what has happened (equivalent to
    session semantics)
  • Resolve version conflict the modified data in
    the tentative version are disjoint from those in
    the new current version
  • Merge the updates in the tentative version with
    the current version to yield to a new version
    that combines all updates
  • Resolve serializability conflict the modified
    data in the tentative version were already
    modified by the new current version
  • Abort the tentative version and roll back the
    execution of the client with the new current
    version as its working version
  • The concurrent updates are serialized in some
    arbitrary order

41
Transaction and Concurrency Control
  • Apply the idea of transaction to distributed file
    system management

42
The Transaction Model
  • Transaction a fundamental unit of interaction
    between processes (all-or-nothing)
  • Updating a master tape
  • Withdraw money from one account and deposit it in
    another

43
The Transaction Model (Cont.)
  • Examples of primitives for transactions
  • Support by underlying distributed OS or language
    runtime system

44
The Transaction Model (Cont.)
  • Transaction to reserve three flights commits
  • Transaction aborts when third flight is
    unavailable

45
The ACID Properties
  • Atomicity (Indivisibly)
  • Either all of the operations in a transaction are
    performed or none of them are, in spite of
    failures
  • Consistency (Serializability) (not violate system
    invariants)
  • The execution of interleaved transaction is
    equivalent to a serial execution of the
    transactions in some order
  • Isolation (no interference between transactions)
  • Partial results of an incomplete transaction are
    not visible to others before the transaction is
    successfully committed
  • Durability
  • The system guarantees that the results of a
    committed transaction will be made permanent even
    if a failure occurs after the commitment

46
Nested and Distributed Transactions
Durability applies only to the toptransaction
47
Implementation Private Workspace
During transactions,reads/writes go to private
workspace
Performance issue
  • The file index and disk blocks for a three-block
    file
  • The situation after a transaction has modified
    block 0 and appended block 3
  • After committing

48
Implementation Write-ahead Log
  • Files are actually modified in place, but before
    any block is changed, a record is written to a
    log telling which transaction is making the
    change, which file and block is being changed,
    and what the old and new values are
  • Only after the log has been written successfully
    is the change made to the file
  • Writeahead log can be used for undo (rollback)
    and redo

49
Transaction Processing System
Execution Phase and Commit Phase
(Transaction ID private workspace)
Satisfy the ACID property
50
Execution Phase and Commit Phase
Failures and recovery actions for the 2PC protocol
51
Transaction Processing System (Cont.)
  • Layered approach for transaction management
  • Data (object) manager perform actual read/write
    on data
  • Know nothing about transaction
  • Atomic update. Cache and replica management
  • Interface to the FS
  • Scheduler responsible for properly controlling
    concurrency
  • Determine which transaction is allowed to pass
    read/write to DM and at which time
  • Concurrent control protocol (serializable)
  • Enforce isolation and consistency avoid
    conflicts
  • Transaction Manager guarantee atomicity of
    transactions
  • All-or-none two-phase commit
  • Maintain write-ahead log and private workspace
    for each transaction

52
Distributed Transaction Processing System
  • General organization of managers for handling
    distributed transactions.

53
Distributed Transaction Processing System (Cont.)
Another view of the previous slide
  • A transaction may invoke operations on remote
    objects (files)
  • The machine initiates a transaction ? coordinator
  • The machine on which a remote object is located ?
    participant
  • Two-phase commit between coordinator and
    participant

54
Distributed Transaction Processing System (Cont.)
  • A transaction manager serves as the coordinator,
    with the remote transaction managers being the
    participants in the two-phase commit protocol
  • Apply two-phase commit protocol to atomic update
    of replicated objects
  • The object (data) manager where an update is
    requested initiates the two-phase commit protocol
    in conjunction with other object managers that
    are holding a replica of the object

55
Concurrency Control Schedulers Responsibility
  • The whole idea behind concurrency control
  • Properly schedule conflicting operations to
    ensure consistency
  • Conflicting operations operate on the same data
    item and if at least one of them is a write
    operation
  • Need to acquire/release locks before/after using
    the data items
  • Read-write conflict and Write-write conflict
  • How to discover and handle inconsistency
  • Prevent inconsistency two-phase locking
  • All access requests are constrained in a certain
    format such that interference among conflicts can
    be prevented
  • TM transforms clients transactions into the
    restrictive form
  • Use locks ? scheduler assumes the lock management
    functions

56
Concurrency Control (Cont.)
  • How to discover and handle inconsistency (Cont.)
  • Avoid inconsistency timestamp ordering
  • Each individual access operation is checked by
    the scheduler and a decision of whether the
    operation should be accepted, tentatively
    accepted, or rejected to avoid conflicts is made
    by the scheduler
  • Schedulers perform the scheduling of operations
    based on timestamps ordering
  • Validate consistency optimistic concurrency
    control protocols
  • Conflicts are completed ignored during the
    execution phase of a transaction. The consistency
    is validated at the end of the execution phase.
    Only transactions that can be globally validated
    are allowed to commit. Schedulers are Validation
    managers.

57
Serializability
  • Concurrency control are based on concept of
    serializability
  • Schedule operations execution order
  • Legal schedule a schedule that observe the
    internal ordering of operations for each
    transaction and in which no transactions hold
    conflicting locks simultaneously (for locking
    algorithm)
  • Not all legal schedules yield consistent results
    or even complete
  • Serial schedule a special legal schedule that is
    formed by a strict sequential execution of the
    transactions in some order
  • Each transaction satisfies ACID
  • Ensure the consistency requirement
  • A schedule is serializable if the result of its
    execution is equivalent to that of a serial
    schedule (This is what we want)

58
Serializability Example
Already committed
(1,3) and (2,4) have write-write conflicts
59
Interleaving Schedules
60
Serializability (Cont.)
  • Updates are made permanent only if the execution
    of the transactions satisfies the serializability
    requirement and is successfully committed
  • Sufficient condition for serializability
  • If the interleaved execution of transactions is
    to be equivalent to a serial in some order, then
    all conflicting objects in the interleaved
    serializable schedule must also be executed in
    the same order at all object sites
  • Chapter 12 presents the serialization graph model
    to address general serialization problems

61
Two-Phase Locking
  • Using locking approach, all shared objects in a
    well-formed transaction must be locked before
    they can be accessed and must be released before
    the end of transaction
  • Two-phasing locking Locking
  • A new lock cannot be acquired after the first
    release of a lock
  • Phase 1 growing phase of locking the objects
  • Phase 2 shrinking phase of releasing the objects
  • Extreme two-phase locking
  • Get all locks at the beginning of the transaction
    and release all locks at the same time as the end
    of the transaction
  • Example Table 6.1
  • 1, 2 are feasible

62
Two-Phase Locking (Cont.)
Deadlock may happen (ex. reverse operations 3 4
in t2 of Table 6.1)
Scheduler is responsible for granting and
releasing locks in such a way that only valid
schedules results (solving operation conflict)
63
Two-Phase Locking (Cont.)
  • Strict two-phase locking.

Deadlock may happen
64
Two-Phase Locking (Cont.)
  • Strict 2PL only release locks when commit/abort
  • Sacrifice some concurrency but easy to implement
  • Un-strict 2PL difficult to implement
  • TM does not know when the last lock has been
    requested
  • May cause rolling aborts
  • Transaction T1 updates X, then release X
  • Transaction T2 reads X ? the new X value is read
  • T1 aborts ? T2 must abort as well
  • T2 has a commit dependence on T1
  • Commit of T2 must be delayed until the commit of
    T1

65
Two-Phase Locking (Cont.)
  • Two-phase locking and strict two-phase locking
    ?deadlock
  • Two-phase locking in a distributed system
  • Centralized 2PL
  • A single site is responsible for grant/releasing
    locks
  • Primary 2PL
  • Each data is assigned a primary copy. The
    scheduler on the copys machine is responsible
    for grant/releasing locks
  • Distributed 2PL
  • Data may be replicated across multiple machines.
    The scheduler on each machine is responsible for
    grant/releasing locks and make sure the operation
    is forwarded to the local data manager

66
Pessimistic Timestamp Ordering
  • Basic idea
  • OM follows transaction timestamp order to perform
    operations
  • When an operation on a shared object is invoked,
    OM records the timestamp of the invoking
    transaction
  • When a transaction invokes a conflicting
    operation on the object
  • The transaction has a larger timestamp than the
    one recorded by the object ? proceed (and record
    the new timestamp)
  • Otherwise ? abort
  • No deadlock
  • Cascade aborts (schedule 5)
  • Tentative write before commit for ensuring
    isolation

67
Pessimistic Timestamp Ordering (Cont.)
  • Timestamp ordering with tentative writes SCH
  • Each object is associated with
  • RD transaction commitment time for the last
    read
  • WR transaction commitment time for the last
    write
  • A list of tentative times (Ts) for the pending
    transactions with a write operation to the object
  • Tmin the minimum of Ts

68
Pessimistic Timestamp Ordering (Cont.)
  • Concurrency control using timestamps.

Wait until T3commit/abort
And Restart
Different from Figure 6.7
69
Pessimistic Timestamp Ordering (Cont.)
Execution phase enforce or resolve
consistency Commit phase enforce atomicity
Execution Phase
Commit Phase
70
Pessimistic Timestamp Ordering (Cont.)
  • Read (with transaction timestamp T)
  • T lt WR ? abort (to maintain increasing timestamp
    order)
  • WR lt T lt Tmin ?allow to proceed (before any
    pending write)
  • Read result is put into TMs work space and
    return to the client
  • Tmin lt T ?put in the tentative list and waits for
    the preceding writes finish (commit or abort)
    (already have tentative writes)
  • Write
  • T gt RD and T gt WR ? put into the tentative list
  • Inform TM the success or failure of the tentative
    write operation
  • Otherwise ? abort

Enforce or resolve consistency in execution phase
71
Pessimistic Timestamp Ordering (Cont.)
  • Abort
  • Read ? simply discard the waiting read
  • Write ? remove from the tentative list
  • If a waiting read reaches the head of the list ?
    perform read
  • Commit ? the successful completion of the atomic
    commit in TM
  • Transaction waiting to read ? never happen
    (blocked)
  • Transaction with only completed read operation ?
    update the objects RD (the larger of the
    transaction timestamp and the objects current
    RD)
  • Transaction with tentative write
  • SCH aborts all pending transactions (both waiting
    reads and tentative writes) ahead of the
    committed transaction
  • Make the update permanent
  • Remove write from tentative list (may allow a
    waiting read proceed)
  • Replicas exist ?call replication manager

Enforce atomicity in abort/commit
72
Pessimistic Timestamp Ordering (Cont.)
Allow more transactions to proceed freely, but
with more aborts, since conflicts sometimes
occur and need to be resolved
Waiting reads and tentative writes abort! (to
maintain the consistency)
Tmin
Commit for a transactionwith tentative write
commit
73
Example
After 2 in t1 and 4 in t2, maybe exist more
unrelated operations
Sched 1 1, 2, 3, 4
1/2
3/4
RDWRt0
RDWRt0
Tmint1
RDWRt0
Tmint1
t2
If t2 commits first ? t1 has to abort and restart!
RDWRt1
74
Example (Cont.)
Sched 3 3, 1, 4, 2
3/4
1/2
RDWRt0
RDWRt0
Tmint2
RDWRt0
Tmint1
t2
If t2 commits first ? t1 has to abort and restart!
RDWRt1
75
Example (Cont.)
Sched 5 1, 3, 4, 2
C
1
3
RDWRt0
RDWRt0
Tmint1
RDWRt0
Tmint1
t2
D
4
2
RDWRt0
RDWRt0
Tmint2
RDWRt0
Tmint1
t2
If t2 commits first ? t1 has to abort and restart!
RDWRt1
76
Example (Cont.)
t1
t2
t3
Waiting read and pending write for t1
Schedule 3
XX1
X0
RD, WR
Tmint1
t2
RD, WR
Tmint1
t2
RD, WR
Tmint1
t2
t3
Order of commit
XX2XX3
RD, WR
Tmint1
t2
t3
Waiting read and pending write for t2 and t3
77
Example (Cont.)
Schedule 4 x0 (t2), x0 (t3), xx3 commit,
x0 (t1)
Waiting read and pending write for t3 (xx3)
t2 aborts and restart
t3 commit
RD, WR
Tmint2
t3
RDWRt3
X0
RDWRt3
t1
t1 aborts and restarts
78
Optimistic Timestamp Ordering
  • Optimistic timestamp ordering
  • Just go ahead and do whatever you want to without
    paying attention to what anybody else is doing
  • System keeps track of which data item have been
    read and write
  • At the point of committing
  • Check all other transactions to see if any of its
    items have been changed since the transaction
    started (Validation)
  • Yes ? abort
  • No ? commit
  • Transaction uses private workspace to store
    shadow copies of data
  • Deadlock free and maximum parallelism
  • If a transaction fails ? restart and run again
    (not good for heavy load)

79
Optimistic Timestamp Ordering (Cont.)
  • A transaction consists of three phases
  • Execution phase
  • Just go ahead without paying attention to other
    transactions
  • Need private work space for shadow copies of
    shared objects
  • Validation phase
  • Use a two-phase commit protocol to globally
    validate
  • Once a transaction is validated, it is guaranteed
    to be committed
  • All commitments must follow the order of
    validation time
  • Must be atomic
  • Update phase
  • Make changes permanent in the persistent memory

80
Optimistic Timestamp Ordering (Cont.)
  • Each transaction ti
  • TSi timestamp of the start time of its execution
    phase
  • TVi timestamp of the start time of its
    validation phase
  • Ri the set of data objects read by ti (read
    set)
  • Wi the set of data object written by ti (write
    set)
  • Each object Oj
  • RDj Commitment time for the last read operation
  • WRj Commitment time for the last write operation
  • Also called the version number of Oj
  • The transactions are to be serialized w.r.t. the
    timestamp TVs of the validated transactions

81
Execution Phase
  • Begin at a TM when receive a begin transaction
    from client
  • Private work space is created and maintained by
    TM
  • Shadow copies of object version numbers
  • Similar to the session semantics of files
  • Abort delete the transaction and its work space
  • End transaction
  • Request for commit
  • Move to the validation phase

82
Validation Phase
  • TM validates mutual consistency between the
    requested transaction and other distributed
    transactions to ensure serializability
  • Initiate two-phase validation protocol, as
    coordinator
  • Ri, Wi, and TVi is sent to all participating TMs
    for validation
  • A participant can respond with positive
    validation to more than one requests
  • Each TM has knowledge of all outstanding
    transactions tk at its local site
  • Validation of mutual consistency between ti and
    tk
  • TVi must be greater than TVk and tk must be
    completed before ti
  • if both are validated ? check Wk of tk for
    conflict
  • An accepted validation carries the current
    version number of the shared remote object. It is
    compared with TVi for work-space consistency. All
    WRs must be smaller than TVi

83
Validation Phase (Cont.)
All commitments follow the order of validation
time The update of the commitment must also be
atomic
Two-phase commit for validation
84
Optimistic Timestamp Ordering (Cont.)
3
4
1
2
85
Optimistic Timestamp Ordering (Cont.)
(1) Violate ordering of validation time
TVi
(2) Accept!! Serialized Already!
Ti
Execution
Validation
TVk
Tk
Execution
Validation
Update
86
Update Phase
  • The transaction moves to the update phase once it
    gathered an accepted validation from all
    participating TMs and the state of work space is
    consistent
  • A accepted validation is equivalent to a
    tentative pre-write in the timestamp ordering
    approach
  • The update phase is similar to the commit phase
    in timestamp ordering
  • Except tentative write can be aborted, while
    validation cannot be denies once it is given
  • Update must be committed in the TV order for
    those validate transactions

87
Data and File Replication
88
Overview
  • Advantages of data and file replication
  • Parallelism transparency higher performance
    (concurrent access to replicas)
  • Failure transparency higher availability
    (redundant replicas)
  • Necessity
  • Replication transparency not aware of the
    existence of replicas
  • Concurrency transparency no interference among
    sharing clients
  • Atomic update updates to all replicas must be
    atomic
  • One-copy serializability
  • atomic transaction
  • atomic update of replicas

89
Overview (Cont.)
  • Atomic multicast multicast messages are reliably
    delivered to all non-faulty group members and the
    order of message deliver must obey a total
    ordering
  • Atomic transaction operations in every
    transaction are performed on an all or none basis
    and conflicting operations among concurrent
    transaction are executed in the same order
    (serialized)
  • Atomic update updates are propagated to all
    replicated objects and are serialized

90
Overview (Cont.)
  • Similarity between atomic multicast, transaction,
    update
  • An atomic multicast is a special transaction
    where every message representing an operation
    conflicts with every other
  • Order of message delivery
  • Atomic update is very much like a transaction
    where every update is a conflicting operation
  • Atomic update is less stringent in consistency
    requirement
  • Failures of replicas may be allowed, as long as
    at least one copy is available
  • A client may not be interested in the global
    coherency of the replicas, as long as it can read
    the most recently written data

91
Architecture for Management of Replicas
92
Options for Read/Write
  • READ
  • Read-one-primary read from a primary RM
    (consistency)
  • Read-one read from any RM (concurrency)
  • Read-quorum read from a quorum of RMs (currency)
  • WRITE
  • Write-one-primary write to one primary replica
  • Primary RM propagates the updates to all other
    RMs
  • Write-all atomic updates to all RMs (subsequent
    writes must wait)
  • Write-all-available atomic updates to all
    available (non-faulty) RMs
  • Failure recovery
  • Write-quorum atomic updates to a quorum of RMs
  • Write-gossip updates to any RM and are lazily
    propagate to others

93
One-Copy Serializability
  • The execution of transactions on replicated
    objects is equivalent to the execution of the
    same transactions on non-replicated objects
  • Read-one-primary/write-one-primary no
    replication issue
  • Serialized by primary RM
  • Secondary RMs only for redundancy
  • Read-one/write-all
  • Consistency two-phase locking or timestamp
    ordering protocols
  • Read/write operations are sub-transactions
  • Read-one/write-all-available
  • Failure may cause problems with one-copy
    serializability (Chap. 12)
  • Read-quorum/write-quorum
  • Conflicts can be preserved if the read set of
    replicas of one transaction overlaps with the
    write set of another transaction

94
One-Copy Serializability (Cont.)
Serial schedule is the only correct execution
Either t1 reads X written by t2, or t2 reads Y
written by t1
Neither t1 nor t2 sees the objectwritten by the
other
Failures and recoveries must also be serialized
w.r.t transaction(failure should appear before
the start of a transaction otherwise, abort)
The failure of Yd in t1 should force t2 to abort
and rollback
95
Quorum Voting
Conflict two-phase lockingor timestamp ordering
Witness only carry the necessary information
(file version and id)
96
Quorum Voting (Cont.)
  • Three examples of the voting algorithm
  • A correct choice of read and write set
  • A choice that may lead to write-write conflicts
  • A correct choice, known as ROWA (read one, write
    all)

97
Gossip Update Propagation
  • If updates are less frequent than reads and
    ordering of updates can be relaxed, updates can
    be propagated lazily among replicas
  • Read-one/write-gossip
  • Support high availability in an environment where
    failures of replicas are likely and reliable
    multicast of updates is impractical
  • Idea
  • Both read and update operations are directly to
    any RM
  • RMs bring their data up to date by gossip
    information

98
A Gossip Architecture
  • Basic gossip protocol
  • Read/overwrite no definitive group
  • Causal order gossip protocol
  • Read/modify-update definitive group

99
Basic Gossip Protocol
  • Timestamps scalar or vector
  • TSf of FSA timestamp of the last successful
    access operation
  • TSi of RM i the last update of the data object
  • Read
  • TSf lt TSi ? RM has more recent data
  • Value returned
  • Assign TSi to TSf
  • TSf gt TSi ? RM has out-of-date data
  • Wait or contact other RMs
  • Update FSA increments TSf
  • TSf gt TSi ? execute update
  • Assign TSf to TSi
  • TSf lt TSi ? comes too late
  • Overwrite?
  • Read ? overwrite
  • Gossip RMj ? RMi
  • Accept if TSj gt TSi

100
Basic Gossip Update Example
1
FSA 1 TSf10 Set TSf11
RM1 TS1 0 Set TS1 1
Write OK
RM1 TS1 1 Set TS1 2
3. Gossip Reject
4. Gossip Accept
2
FSA 1 TSf11 Set TSf12
RM2 TS2 0 Set TS2 2
Write OK
5
7
FSA 2 TSf20
RM2 TS2 2
Read OK
FSA 2 TSf2 3
RM1 TS1 2
Read Reject
5
6
FSA 2 TSf22 Set TSf2 3
RM2 Set TS2 3
Write OK
101
Causal Order Gossip
  • Read/Modify-update with definitive group
  • Ex. Multiplied by 2 Incremented by 1 ? Order is
    important
  • Vector timestamps maintained at each RM
  • V (VAL) timestamp of current value of the
    object
  • R (WORK) knowledge of update requests in the RM
    group
  • How many works still to be done
  • Obtained through gossip by merging (pairwise
    maximum) Rs
  • Update log u r other information
  • u (DEP) timestamp issued by the FSA for an
    update operation
  • r (ts) identifier of the update operation
  • r for RMi is obtained by taking the corresponding
    u and replacing the ith component of u with the
    ith component of R

102
Casually-Consistent Lazy Replication
103
Processing Read Operations
Wait for RM to become up to date(i.e. if DEP(R)
?VAL(i))
  • VAL V in textbook
  • WORKR (how many more works)

FSA
  • DEPu in text book

104
Processing Write Operations
5. Stable. If many, execute in causal order)
Reject if DEP lt VAL
MergeVAL and ts
  • ts r in textbook
  • DEP u
  • VAL V in textbook
  • WORKR (how many more works)

105
Gossip
  • A gossip message from RMj to RMi carries RMjs
    vector timestamp Rj and log Lj
  • Rj is merged with Ri
  • Li is joined with Lj except for those update
    records with r ? Vi
  • Have been accounted by RMi

106
Example
107
Example (Cont.)
FSA 1 u,000
RM1V000 R000 Update Log
RM1V000 R100 Update Log u1000, r1100
RM1V000 R000 Update Log
because uV,operation executed!! V is advanced
by mergingwith r
Write
RM1V100 R100 Update Log u1000, r1100
The information has to propagate to other RMs!!!
108
Example (Cont.)
TSf lt V ? value at RM1 returned ?TSf is updated
to 100 (merge)
FSA 2 read from RM1 TSf 000
RM1V100 R100 Update Log u000, r100
Read
FSA 2 u,100
RM2V000 R010 Update Log u2100, r2110
FSA 2 u,100
RM2V000 R000 Update Log
u2 gt V ? RM2 does not get the most update-to-date
value
Write
109
Example (Cont.)
RM1V100 R100 Update Log u1000, r1100
RM2V000 R010 Update Log u2100, r2110
RM2V000 R110 Update Log u2100,
r2110 u1000, r1100
Gossip!!!
RM2V000 R110 Update Log u2100,
r2110 u1000, r1100
RM2V100 R110 Update Log u2100,
r2110 u1000, r1100
RM2V110 R110 Update Log u2100,
r2110 u1000, r1100
Execute u1(u1 V)
Execute u2(u2 V)
(Stable)
(Stable)
110
Example (Cont.)
TSf lt V ? value at RM1 returned ?TSf is updated
to 100 (merge)
FSA 3 read from RM1 TSf 000
RM1V100 R100 Update Log u000, r100
Read
FSA 3 u,100
RM3V000 R001 Update Log u3100, r3101
FSA 3 u,100
RM3V000 R000 Update Log
u3 gt V ? RM3 does not get the most update-to-date
value
Write
111
Example (Cont.)
RM2V110 R110 Update Log u2100,
r2110 u1000, r1100
RM3V000 R001 Update Log u3100, r3101
RM3V000 R111 Update Log u3100,
r3101 u2100, r2110 u1000, r1100
Gossip!!!
RM3V100 R111 Update Log u3100,
r3101 u2100, r2110 u1000, r1100
RM3V110 R111 Update Log u3100,
r3101 u2100, r2110 u1000, r1100
Execute u1(u1 V)
Execute u2(u2 V)
112
Example (Cont.)
RM3V111 R111 Update Log u3100,
r3101 u2100, r2110 u1000, r1100
Execute u3(u3 lt V)
More issues garbage collection of the logs and
optimization of message count(Chapter 12)
113
Cache-Coherence Protocol
  • Cache a special case of replication
  • controlled by clients (instead of servers)
  • Coherence detection strategy when during a
    transaction the detection is done
  • Every access (operation)
  • Let the transaction proceed while verification is
    taking place
  • If assumption later proves to be false ? abort
  • Verify only when the transaction committed

114
Cache-Coherence Protocol (Cont.)
  • Coherence enforcement strategy how caches are
    kept consistent with the copies stored at servers
  • Write-invalidate Server sends invalidation to
    all caches whenever data is modified
  • Write-update Server propagates the update
  • What happens when a process modifies cache data
  • Write-through immediate write back to the server
  • Write-back updates to the cache can be batched
    and written back to the server periodically
Write a Comment
User Comments (0)
About PowerShow.com