Large Scale Sharing GFS and PAST - PowerPoint PPT Presentation

About This Presentation
Title:

Large Scale Sharing GFS and PAST

Description:

Single master maintains metadata. Master, Chunkservers, Clients: Linux workstations, ... For each chunk's replica set, Master gives one replica primary lease ... – PowerPoint PPT presentation

Number of Views:27
Avg rating:3.0/5.0
Slides: 36
Provided by: mahes5
Category:
Tags: gfs | past | master | scale | sharing

less

Transcript and Presenter's Notes

Title: Large Scale Sharing GFS and PAST


1
Large Scale Sharing GFS and PAST
  • Mahesh Balakrishnan

2
Distributed File Systems
  • Traditional Definition
  • Data and/or metadata stored at remote locations,
    accessed by client over the network.
  • Various degrees of centralization from NFS to
    xFS.
  • GFS and PAST
  • Unconventional, specialized functionality
  • Large-scale in data and nodes

3
The Google File System
  • Specifically designed for Googles backend needs
  • Web Spiders append to huge files
  • Application data patterns
  • Multiple producer multiple consumer
  • Many-way merging
  • GFS ?? Traditional File Systems

4
Design Space Coordinates
  • Commodity Components
  • Very large files Multi GB
  • Large sequential accesses
  • Co-design of Applications and File System
  • Supports small files, random access writes and
    reads, but not efficiently

5
GFS Architecture
  • Interface
  • Usual create, delete, open, close, etc
  • Special snapshot, record append
  • Files divided into fixed size chunks
  • Each chunk replicated at chunkservers
  • Single master maintains metadata
  • Master, Chunkservers, Clients Linux
    workstations, user-level process

6
Client File Request
  • Client finds chunkid for offset within file
  • Client sends ltfilename, chunkidgt to Master
  • Master returns chunk handle and chunkserver
    locations

7
Design Choices Master
  • Single master maintains all metadata
  • Simple Design
  • Global decision making for chunk replication
  • and placement
  • Bottleneck?
  • Single Point of Failure?

8
Design Choices Master
  • Single master maintains all metadata in memory!
  • Fast master operations
  • Allows background scans of entire data
  • Memory Limit?
  • Fault Tolerance?

9
Relaxed Consistency Model
  • File Regions are -
  • Consistent All clients see the same thing
  • Defined After mutation, all clients see exactly
    what the mutation wrote
  • Ordering of Concurrent Mutations
  • For each chunks replica set, Master gives one
    replica primary lease
  • Primary replica decides ordering of mutations and
    sends to other replicas

10
Anatomy of a Mutation
  • 1 2 Client gets chunkserver locations from
    master
  • 3 Client pushes data to replicas, in a chain
  • 4 Client sends write request to primary
    primary assigns sequence number to write and
    applies it
  • 5 6 Primary tells other replicas to apply write
  • 7 Primary replies to client

11
Connection with Consistency Model
  • Secondary replica encounters error while applying
    write (step 5) region Inconsistent.
  • Client code breaks up single large write into
    multiple small writes region Consistent, but
    Undefined.

12
Special Functionality
  • Atomic Record Append
  • Primary appends to itself, then tells other
    replicas to write at that offset
  • If secondary replica fails to write data (step
    5),
  • duplicates in successful replicas, padding in
    failed ones
  • region defined where append successful,
    inconsistent where failed
  • Snapshot
  • Copy-on-write chunks copied lazily to same
    replica

13
Master Internals
  • Namespace management
  • Replica Placement
  • Chunk Creation, Re-replication, Rebalancing
  • Garbage Collection
  • Stale Replica Detection

14
Dealing with Faults
  • High availability
  • Fast master and chunkserver recovery
  • Chunk replication
  • Master state replication read-only shadow
    replicas
  • Data Integrity
  • Chunk broken into 64KB blocks, with 32 bit
    checksum
  • Checksums stored in memory, logged to disk
  • Optimized for appends, since no verifying required

15
Micro-benchmarks
16
Storage Data for real clusters
17
Performance
18
Workload Breakdown
of operations for given size
of bytes transferred for given operation size
19
GFS Conclusion
  • Very application-specific more engineering than
    research

20
PAST
  • Internet-based P2P global storage utility
  • Strong persistence
  • High availability
  • Scalability
  • Security
  • Not a conventional FS
  • Files have unique id
  • Clients can insert and retrieve files
  • Files are immutable

21
PAST Operations
  • Nodes have random unique nodeIds
  • No searching, directory lookup, key distribution
  • Supported Operations
  • Insert (name, key, k, file) ? fileId
  • Stores on k nodes closest in id space
  • Lookup (fileId) ? file
  • Reclaim (fileId, key)

22
Pastry
  • P2P routing substrate
  • route (key, msg) routes to numerically closest
    node in less than log2b N steps
  • Routing Table Size (2b - 1) log2b N 2l
  • b determines tradeoff between per node state
    and lookup order
  • l failure tolerance delivery guaranteed unless
    l/2 adjacent nodeIds fail

23
10233102 Routing Table
  • L/2 larger and L/2 smaller nodeIds
  • Routing Entries
  • M closest nodes

24
PAST operations/security
  • Insert
  • Certificate created with fileId, file content
    hash, replication factor and signed with private
    key
  • File and certificate routed through Pastry
  • First node in k closest accepts file and forwards
    to other k-1
  • Security Smartcards
  • Public/Private key
  • Generate and verify certificates
  • Ensure integrity of nodeId and fileId assignments

25
Storage Management
  • Design Goals
  • High global storage utilization
  • Graceful degradation near max utilization
  • PAST tries to
  • Balance free storage space amongst nodes
  • Maintain k closest nodes replication invariant
  • Storage Load Imbalance
  • Variance in number of files assigned to node
  • Variance in size distribution of inserted files
  • Variance in storage capacity of PAST nodes

26
Storage Management
  • Large capacity storage nodes have multiple
    nodeIds
  • Replica Diversion
  • If node A cannot store file, it stores pointer to
    file at leaf set node B which is not in k closest
  • What if A or B fail? Duplicate pointer in k1
    closest node
  • Policies for directing and accepting replicas
    tpri and tdiv thresholds for file size / free
    space.
  • File Diversion
  • If insert fails, client retries with different
    fileId

27
Storage Management
  • Maintaining replication invariant
  • Failures and joins
  • Caching
  • k-replication in PAST for availability
  • Extra copies stored to reduce client latency,
    network traffic
  • Unused disk space utilized
  • Greedy Dual-Size replacement policy

28
Performance
  • Workloads
  • 8 Web Proxy Logs
  • Combined file systems
  • k5, b4
  • of nodes 2250
  • Without replica and file diversion
  • 51.1 insertions failed
  • 60.8 global utilization

4 normal distributions of node storage sizes
29
Effect of Storage Management
30
Effect of tpri
Lower tpri Better utilization, More failures
tdiv 0.05 tpri varied
31
Effect of tdiv
Trend similar to tpri
tpri 0.1 tdiv varied
32
File and Replica Diversions
Ratio of replica diversions vs utilization
Ratio of file diversions vs utilization
33
Distribution of Insertion Failures
File system trace
Web logs trace
34
Caching
35
Conclusion
Write a Comment
User Comments (0)
About PowerShow.com