CS556: Distributed Systems - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

CS556: Distributed Systems

Description:

... organized in a hierarchy of directories & identified by pathnames. Operations: ... Data is pushed linearly along a chain of chunk-servers in a pipelined fashion ... – PowerPoint PPT presentation

Number of Views:55
Avg rating:3.0/5.0
Slides: 31
Provided by: mar177
Category:

less

Transcript and Presenter's Notes

Title: CS556: Distributed Systems


1
CS-556 Distributed Systems
Case Study google
  • Manolis Marazakis
  • maraz_at_csd.uoc.gr

2
Google cluster architecture
  • 15,000 commodity-class PCs
  • running custom fault-tolerant software
  • Replication of services across many different
    machines
  • Automatic fault detection handling
  • rather than a smaller number of high-end
    servers
  • Overall system throughput vs single-thread
    performance
  • Throughput-oriented workload
  • Easy parallelism
  • Raw data several 10s of TBs
  • Different queries can run of different processors
  • The overall index is partitioned, so that a
    single query can use multiple processors
  • Key metrics
  • Energy efficiency
  • Price/performance ratio

3
Google query-serving architecture
  • Coordination of
  • query execution
  • Formatting of
  • results

DNS-based load balancing among geographically
distributed data centers
H/W-based load balancer at each cluster
4
Query Execution
  • Phase-I
  • Cluster of index servers
  • consult an inverted index that maps each query
    keyword to a set of matching documents
  • Hit-list per keyword
  • Intersection of hit-lists
  • Computation of relevance rankings
  • Result ordered list of doc-ids
  • Phase-II
  • Cluster of document servers
  • Fetch each document from disk
  • Extract URL, title keyword-in-context snippet

5
Index servers
  • The inverted index is divided into shards
  • each having a randomly chosen subset of
    documents from the overall index
  • Pool of machines per shard
  • Intermediate load balancer
  • Each query goes to a subset of machines assigned
    to each shard
  • If a replica goes down, the load balancer will
    avoid using it, until it is recovered/replaced,
    but all parts of the index still remain available
  • During the downtime, system capacity is reduced
    in proportion to the total fraction of capacity
    that this machine represented.

6
Document servers
  • Partition the processing of documents
  • Randomly distribute documents into smaller shards
  • Multiple servers per shard
  • Routing requests through a load balancer
  • Access an online, low-latency copy of the entire
    Web
  • actually, multiple copies !!

7
Replication for capacity (I)
  • Mostly read-only accesses to the index other
    data structures
  • Relatively infrequent updates
  • Sidestep consistency issues, by temporarily
    diverting queries from a server until an update
    is completed
  • Divide the query stream into multiple streams,
    each handled by a cluster
  • lookup of matching documents ? many lookups for
    matching documents in much smaller indices,
    followed by a merging step

8
Replication for capacity (II)
  • Adding machines to each pool increases serving
    capacity
  • Adding shards accommodates index growth
  • Divide the computation among multiple CPUs
    disks
  • No communication/dependency bet. shards
  • Nearly linear speedup
  • The CPU speed of individual machines does not
    limit search performance
  • We can increase the number of shards to
    accommodate slower CPUs

9
Design principles
  • Software reliability
  • rather than high-quality components, RAID,
    redundant power-supplies,
  • Replication for better throughput and
    availability
  • Purchase the CPU generation that currently gives
    the best performance per unit price
  • not the CPUs that give the best absolute
    performance
  • Using commodity PCs reduces the costs of
    computation
  • making it affordable to use more computational
    resources per query
  • Search a larger index
  • Employ elaborate ranking algorithms

10
Rack organization
  • 40-80 x86-class custom-made PCs
  • Several CPU generations are in active service
  • 533 MHz Celeron, , dual 1.4 GHz Pentium-III
  • IDE 80 GB hard disks, one or more per machine
  • 100 Mbps Ethernet switch for each side of a rack
  • Each switch has one or two uplink connections
    to a gigabit switch that interconnects all the
    racks

Selection criterion cost per query (capital
expense operating costs)
Realistically, a server will not last more than
2-3 years, due to disparities in performance
when compared with newer machines
11
A rough cost comparison
  • Rack of 88 dual-CPU 2-GHz Intel Xeon servers with
    2 GB of RAM 1 80-Gbyte HDD 278,000 ? monthly
    capital cost 7,700 for a period of 3 years
  • Capacity 176 CPUs, 176 GB of RAM, 7 TB disk
    space
  • Multiprocessor containing 8 2-GHz Xeon CPUs, 64
    Gbytes of RAM, and 8 Tbytes of disk space
    758,000 ? monthly capital cost 21,000 for a
    period of 3 years
  • 22 times fewer CPUs, 3 times less RAM

Much of the cost difference derives from the much
higher interconnect bandwidth reliability of a
high-end server. - Googles highly redundant
architecture does not rely on either of these
attributes.
12
Power consumption cooling
  • Power consumption for a dual 1.4-GHz Pentium III
    server 90 W of DC power under load
  • 55 W for the two CPUs, 10 W for a disk drive, 25
    W for DRAM motherboard.
  • 120 W of AC power per server
  • Efficiency of ATX power supply 75
  • Power consumption per rack 10 kW.
  • A rack fits in 25 ft2 of space
  • Power density 400 W/ft2
  • With higher-end processors, the power density of
    a rack can exceed 700 W/ft2

Typical data centers power density 150 W/ft2
What counts is watts per unit of performance, not
watts alone.
13
H/W-level application characteristics
The index application traverses dynamic data
structures control flow is data dependent,
creating a significant of difficult-to-predict
branches.
There isnt that much exploitable
instruction-level parallelism (ILP) in the
workload.
  • Exploit thread-level parallelism
  • Processing each query shares mostly
  • read-only data with the rest of the system,
  • and constitutes a work unit that requires
  • little communication.
  • SMT simultaneous multithreading
  • CMP chip multiprocessor

Early experiments with dual-context Intel
Xeon 30 improvement
14
The Google File System (GFS)
  • Component failures are the norm rather than the
    exception
  • constant monitoring, error detection, fault
    tolerance automatic recovery
  • Huge files
  • Each file typically contains many application
    objects, such as web documents
  • A few millions of files, of size 100 MB or larger
  • Fast growing data sets of many TBs
  • Most files are mutated by appending new data
  • rather than overwriting existing data
  • Random writes within a file are practically
    non-existent.
  • Once written, the files are only read, often only
    sequentially.
  • Co-design of applications the GFS API
  • Relaxed consistency model
  • Atomic file-append operation

15
GFS workloads
  • Large streaming reads
  • Individual operations typically read hundreds of
    KBs, more commonly 1 MB or more
  • Successive operations from the same client often
    read through a contiguous region of a file
  • Small random reads
  • typically read a few KBs at some arbitrary offset
  • Many large sequential writes that append data to
    files
  • Once written, a file is seldom modified again
  • Files are often used as producer-consumer queues,
    or for N-way merging
  • Need for well-defined semantics for multiple
    processes that append to the same file
  • High sustained throughput is more important than
    low-latency

16
GFS API
  • Files are organized in a hierarchy of directories
    identified by pathnames
  • Operations
  • create, delete, open, close, read, write
  • Snapshot
  • Record-append
  • GFS does not implement the POSIX API
  • User-space servers client library
  • No file data caching at clients
  • because marginal benefit is expected
  • Only file metadata are cached at clients.
  • No special caching at chunk-servers
  • Files are stored as local files, so the OS
    buffer cache is employed

17
GFS Architecture (I)
- Files are divided into fixed-size chunks, each
stored at a of chunk-servers (default 3) -
Immutable, globally unique 64-bit handle per
chunk - Chunkservers store chunks on local disks
as Linux files read or write chunkdata
specified by a chunk handle and byte range.
18
GFS Architecture (II)
  • The master maintains all file system metadata.
  • namespace
  • access control information
  • mapping from files to chunks
  • current locations of chunks
  • Also controls system-wide activities
  • Chunk-lease management
  • garbage collection of orphaned chunks
  • Chunk-migration between chunkservers.
  • The master periodically communicates with each
    chunkserver in HeartBeat messages to give it
    instructions collect its state.

19
GFS Architecture (III)
  • Clients never read/write file data through the
    master.
  • Instead, a client asks the master which
    chunk-servers it should contact.
  • Using the fixed chunk size, the client translates
    the file name byte offset specified by the
    application into a chunk index within the file.
  • Then, it sends the master a request containing
    the file name chunk-index.
  • The master replies with the corresponding chunk
    handle locations of the replicas.
  • In fact, the client typically asks for multiple
    chunks in the same request the master can also
    include the information for chunks immediately
    following those requested.
  • This extra information sidesteps several future
    client-master interactions at practically no
    extra cost.
  • The client caches this information using the file
    name chunk index as the key.
  • It caches this information for a limited time
    interacts with the chunk-servers directly for
    many subsequent operations.

20
Chunk size 64 MB
  • Much larger than in traditional file systems
  • Large chunk size benefits
  • Reduces the need for client-master interaction
  • Sequential access is most common
  • The client can cache the location info of all
    chunks even for a multi-TB working set
  • On a large chunk, the client is more likely to
    perform multiple reads/writes
  • Client keeps a persistent TCP connection to a
    chunk-server over an extended period of time
  • Reduces volume of metadata kept at the master
  • Allows master to keep metadata in-memory
  • Disadvantages
  • Potential for chunk-servers holding small files
    to become hot-spots
  • Waste of disk space, due to fragmentation

21
Metadata
  • Metadata are kept in in-memory data structures
  • Persistent
  • Namespace
  • File-to-chunk mapping
  • Operation log is used to record mutations
  • Stored at local disk and replicated to remote
    machines
  • Volatile
  • Location of replicas
  • Info requested by the master at start-up and when
    a new chunk-server joins the system
  • Periodic state scan
  • Chunk garbage collection
  • Chunk re-replication
  • Handling chunk-server failures
  • Chunk migration
  • Balancing load disk space usage

22
Master operations
  • Chunk locations are volatile
  • A chunk-server has the final word over what
    chunks it does or does not have on its own disks.
  • There is no point in trying to maintain a
    consistent view of this information on the master
    because errors on a chunk-server may cause chunks
    to vanish spontaneously or an operator may rename
    a chunk-server.
  • Operation log
  • Defines logical timeline
  • Do not make changes visible to clients until
    metadata changes are made persistent
  • Replicate log on multiple remote machines
    respond to a client operation only after flushing
    the corresponding log record to disk both locally
    remotely.
  • The master batches several log records together
    before flushing thereby reducing the impact of
    flushing and replication on overall system
    throughput
  • Recovery by replay of operation log
  • Keep log size manageable by checkpointing
  • Checkpoint is a compact B-tree, ready to map into
    memory use to answer lookup requests

23
Consistency Model (I)
  • File namespace mutations are atomic
  • Handled exclusively by the master
  • The state of a file region after a data mutation
    depends on the type of mutation, whether it
    succeeds or fails, and whether there are
    concurrent mutations.

24
Consistency Model (II)
  • Consistent file region
  • All clients see the same data, regardless of
    which replica they read from
  • Defined file region
  • Region is consistent and
  • Clients see what the data mutation writes in its
    entirety
  • The mutation executed without interference from
    concurrent writes
  • Undefined file region
  • Mingled fragments from multiple mutations
  • Different clients may see different data at
    different times

25
Consistency Model (III)
  • Write
  • Causes data to be written at an
    application-defined file offset
  • Record-append
  • Causes data to be appended atomically, at least
    once even in the presence of concurrent
    mutations, but at an offset of the systems
    choosing
  • GFS may insert duplicates or padding
  • Apply mutations to a chunk in the same order at
    all its replicas
  • Use chunk version numbers to detect stale chunks
    (that have missed mutations)

26
Consistency Model (IV)
  • Application idioms
  • Rely on record-appends rather than overwrites
  • Checkpointing self-identifying, self-validating
    records
  • Consistent mutation order using leases
  • Master grants a chunk lease to primary replica
  • New lease ? increment chunk version number
  • Primary assigns serial numbers to mutations
  • Decouple control-flow from data-flow
  • Data is pushed linearly along a chain of
    chunk-servers in a pipelined fashion
  • Fully utilize servers outbound B/W
  • Each server forwards data to its closest server
    that has not received it
  • Avoid network bottlenecks high-latency links
    (eg inter-switch links)
  • Pipelining Once a server receives some data, it
    immediately starts forwarding it

27
Consistency Model (V)
  • Atomic record-append
  • Client pushes data to all replicas of the last
    chunk of the file
  • Client sends operation to the primary
  • Primary checks if appending the record would
    cause the chunk to exceed its max. allowable size
  • If so, it pads the chunk, tells replicas to do
    the same, and asks client to retry on next chunk
  • Otherwise
  • If the append fits the chunk, the primary
    performs the operation instructs replicas to
    write the data at the exact same file offset
  • If a replica fails, the client will retry the
    operation
  • Thus we may get different data for the same chunk
    !
  • Snapshot
  • Deferred execution, copy-on-write

28
Replica placement
  • Maximize data availability and
  • Maximize network B/W utilization
  • It is not enough to spread replicas across
    machines.
  • We must also spread replicas across racks
  • Initial placement, based on several factors
  • Place new replicas on servers with below-average
    disk utilization
  • Keep limit on number of recent chunk creations on
    each server
  • keep cloning traffic from overwhelming client
    traffic
  • Re-replication re-balancing
  • By priority

29
Fault tolerance diagnosis
  • Master
  • Replication of operation log checkpoints
  • Shadow masters for read-only file access
  • Chunk-servers
  • It would be impractical to detect data corruption
    by comparison with other replicas !
  • Checksums to detect data corruption
  • Chunk-server verifies integrity of data
    independently
  • 32-bit checksum per 64 KB data block
  • Diagnostic tools
  • Logs of significant events all RPC interactions
  • By matching requests with replies collating RPC
    records on different machines, we can reconstruct
    the entire interaction history to diagnose a
    problem.
  • Logs readily server as traces for load testing
    performance analysis

30
References
  • S. Brin L. Page, The Anatomy of a Large-Scale
    Hypertextual Web Search Engine, Proc. 7th WWW
    Conf., 1998
  • L.A. Barroso, J. Dean, H. Holze Web Search for
    a Planet The Google Cluster Architecture, IEEE
    Micro, 2003
  • S. Ghemawat, H. Gobioff, S.T Leung The Google
    File System, Proc. ACM SOSP, 2003
Write a Comment
User Comments (0)
About PowerShow.com