Title: CS556: Distributed Systems
1CS-556 Distributed Systems
Case Study google
- Manolis Marazakis
- maraz_at_csd.uoc.gr
2Google cluster architecture
- 15,000 commodity-class PCs
- running custom fault-tolerant software
- Replication of services across many different
machines - Automatic fault detection handling
- rather than a smaller number of high-end
servers - Overall system throughput vs single-thread
performance - Throughput-oriented workload
- Easy parallelism
- Raw data several 10s of TBs
- Different queries can run of different processors
- The overall index is partitioned, so that a
single query can use multiple processors - Key metrics
- Energy efficiency
- Price/performance ratio
3Google query-serving architecture
- Coordination of
- query execution
- Formatting of
- results
DNS-based load balancing among geographically
distributed data centers
H/W-based load balancer at each cluster
4Query Execution
- Phase-I
- Cluster of index servers
- consult an inverted index that maps each query
keyword to a set of matching documents - Hit-list per keyword
- Intersection of hit-lists
- Computation of relevance rankings
- Result ordered list of doc-ids
- Phase-II
- Cluster of document servers
- Fetch each document from disk
- Extract URL, title keyword-in-context snippet
5Index servers
- The inverted index is divided into shards
- each having a randomly chosen subset of
documents from the overall index - Pool of machines per shard
- Intermediate load balancer
- Each query goes to a subset of machines assigned
to each shard - If a replica goes down, the load balancer will
avoid using it, until it is recovered/replaced,
but all parts of the index still remain available - During the downtime, system capacity is reduced
in proportion to the total fraction of capacity
that this machine represented.
6Document servers
- Partition the processing of documents
- Randomly distribute documents into smaller shards
- Multiple servers per shard
- Routing requests through a load balancer
- Access an online, low-latency copy of the entire
Web - actually, multiple copies !!
7Replication for capacity (I)
- Mostly read-only accesses to the index other
data structures - Relatively infrequent updates
- Sidestep consistency issues, by temporarily
diverting queries from a server until an update
is completed - Divide the query stream into multiple streams,
each handled by a cluster - lookup of matching documents ? many lookups for
matching documents in much smaller indices,
followed by a merging step
8Replication for capacity (II)
- Adding machines to each pool increases serving
capacity - Adding shards accommodates index growth
- Divide the computation among multiple CPUs
disks - No communication/dependency bet. shards
- Nearly linear speedup
- The CPU speed of individual machines does not
limit search performance - We can increase the number of shards to
accommodate slower CPUs
9Design principles
- Software reliability
- rather than high-quality components, RAID,
redundant power-supplies, - Replication for better throughput and
availability - Purchase the CPU generation that currently gives
the best performance per unit price - not the CPUs that give the best absolute
performance - Using commodity PCs reduces the costs of
computation - making it affordable to use more computational
resources per query - Search a larger index
- Employ elaborate ranking algorithms
10Rack organization
- 40-80 x86-class custom-made PCs
- Several CPU generations are in active service
- 533 MHz Celeron, , dual 1.4 GHz Pentium-III
- IDE 80 GB hard disks, one or more per machine
- 100 Mbps Ethernet switch for each side of a rack
- Each switch has one or two uplink connections
to a gigabit switch that interconnects all the
racks
Selection criterion cost per query (capital
expense operating costs)
Realistically, a server will not last more than
2-3 years, due to disparities in performance
when compared with newer machines
11A rough cost comparison
- Rack of 88 dual-CPU 2-GHz Intel Xeon servers with
2 GB of RAM 1 80-Gbyte HDD 278,000 ? monthly
capital cost 7,700 for a period of 3 years - Capacity 176 CPUs, 176 GB of RAM, 7 TB disk
space - Multiprocessor containing 8 2-GHz Xeon CPUs, 64
Gbytes of RAM, and 8 Tbytes of disk space
758,000 ? monthly capital cost 21,000 for a
period of 3 years - 22 times fewer CPUs, 3 times less RAM
Much of the cost difference derives from the much
higher interconnect bandwidth reliability of a
high-end server. - Googles highly redundant
architecture does not rely on either of these
attributes.
12Power consumption cooling
- Power consumption for a dual 1.4-GHz Pentium III
server 90 W of DC power under load - 55 W for the two CPUs, 10 W for a disk drive, 25
W for DRAM motherboard. - 120 W of AC power per server
- Efficiency of ATX power supply 75
- Power consumption per rack 10 kW.
- A rack fits in 25 ft2 of space
- Power density 400 W/ft2
- With higher-end processors, the power density of
a rack can exceed 700 W/ft2
Typical data centers power density 150 W/ft2
What counts is watts per unit of performance, not
watts alone.
13H/W-level application characteristics
The index application traverses dynamic data
structures control flow is data dependent,
creating a significant of difficult-to-predict
branches.
There isnt that much exploitable
instruction-level parallelism (ILP) in the
workload.
- Exploit thread-level parallelism
- Processing each query shares mostly
- read-only data with the rest of the system,
- and constitutes a work unit that requires
- little communication.
- SMT simultaneous multithreading
- CMP chip multiprocessor
Early experiments with dual-context Intel
Xeon 30 improvement
14The Google File System (GFS)
- Component failures are the norm rather than the
exception - constant monitoring, error detection, fault
tolerance automatic recovery - Huge files
- Each file typically contains many application
objects, such as web documents - A few millions of files, of size 100 MB or larger
- Fast growing data sets of many TBs
- Most files are mutated by appending new data
- rather than overwriting existing data
- Random writes within a file are practically
non-existent. - Once written, the files are only read, often only
sequentially. - Co-design of applications the GFS API
- Relaxed consistency model
- Atomic file-append operation
15GFS workloads
- Large streaming reads
- Individual operations typically read hundreds of
KBs, more commonly 1 MB or more - Successive operations from the same client often
read through a contiguous region of a file - Small random reads
- typically read a few KBs at some arbitrary offset
- Many large sequential writes that append data to
files - Once written, a file is seldom modified again
- Files are often used as producer-consumer queues,
or for N-way merging - Need for well-defined semantics for multiple
processes that append to the same file - High sustained throughput is more important than
low-latency
16GFS API
- Files are organized in a hierarchy of directories
identified by pathnames - Operations
- create, delete, open, close, read, write
- Snapshot
- Record-append
- GFS does not implement the POSIX API
- User-space servers client library
- No file data caching at clients
- because marginal benefit is expected
- Only file metadata are cached at clients.
- No special caching at chunk-servers
- Files are stored as local files, so the OS
buffer cache is employed
17GFS Architecture (I)
- Files are divided into fixed-size chunks, each
stored at a of chunk-servers (default 3) -
Immutable, globally unique 64-bit handle per
chunk - Chunkservers store chunks on local disks
as Linux files read or write chunkdata
specified by a chunk handle and byte range.
18GFS Architecture (II)
- The master maintains all file system metadata.
- namespace
- access control information
- mapping from files to chunks
- current locations of chunks
- Also controls system-wide activities
- Chunk-lease management
- garbage collection of orphaned chunks
- Chunk-migration between chunkservers.
- The master periodically communicates with each
chunkserver in HeartBeat messages to give it
instructions collect its state.
19GFS Architecture (III)
- Clients never read/write file data through the
master. - Instead, a client asks the master which
chunk-servers it should contact. - Using the fixed chunk size, the client translates
the file name byte offset specified by the
application into a chunk index within the file. - Then, it sends the master a request containing
the file name chunk-index. - The master replies with the corresponding chunk
handle locations of the replicas. - In fact, the client typically asks for multiple
chunks in the same request the master can also
include the information for chunks immediately
following those requested. - This extra information sidesteps several future
client-master interactions at practically no
extra cost. - The client caches this information using the file
name chunk index as the key. - It caches this information for a limited time
interacts with the chunk-servers directly for
many subsequent operations.
20Chunk size 64 MB
- Much larger than in traditional file systems
- Large chunk size benefits
- Reduces the need for client-master interaction
- Sequential access is most common
- The client can cache the location info of all
chunks even for a multi-TB working set - On a large chunk, the client is more likely to
perform multiple reads/writes - Client keeps a persistent TCP connection to a
chunk-server over an extended period of time - Reduces volume of metadata kept at the master
- Allows master to keep metadata in-memory
- Disadvantages
- Potential for chunk-servers holding small files
to become hot-spots - Waste of disk space, due to fragmentation
21Metadata
- Metadata are kept in in-memory data structures
- Persistent
- Namespace
- File-to-chunk mapping
- Operation log is used to record mutations
- Stored at local disk and replicated to remote
machines - Volatile
- Location of replicas
- Info requested by the master at start-up and when
a new chunk-server joins the system - Periodic state scan
- Chunk garbage collection
- Chunk re-replication
- Handling chunk-server failures
- Chunk migration
- Balancing load disk space usage
22Master operations
- Chunk locations are volatile
- A chunk-server has the final word over what
chunks it does or does not have on its own disks.
- There is no point in trying to maintain a
consistent view of this information on the master
because errors on a chunk-server may cause chunks
to vanish spontaneously or an operator may rename
a chunk-server. - Operation log
- Defines logical timeline
- Do not make changes visible to clients until
metadata changes are made persistent - Replicate log on multiple remote machines
respond to a client operation only after flushing
the corresponding log record to disk both locally
remotely. - The master batches several log records together
before flushing thereby reducing the impact of
flushing and replication on overall system
throughput - Recovery by replay of operation log
- Keep log size manageable by checkpointing
- Checkpoint is a compact B-tree, ready to map into
memory use to answer lookup requests
23Consistency Model (I)
- File namespace mutations are atomic
- Handled exclusively by the master
- The state of a file region after a data mutation
depends on the type of mutation, whether it
succeeds or fails, and whether there are
concurrent mutations.
24Consistency Model (II)
- Consistent file region
- All clients see the same data, regardless of
which replica they read from - Defined file region
- Region is consistent and
- Clients see what the data mutation writes in its
entirety - The mutation executed without interference from
concurrent writes - Undefined file region
- Mingled fragments from multiple mutations
- Different clients may see different data at
different times
25Consistency Model (III)
- Write
- Causes data to be written at an
application-defined file offset - Record-append
- Causes data to be appended atomically, at least
once even in the presence of concurrent
mutations, but at an offset of the systems
choosing - GFS may insert duplicates or padding
- Apply mutations to a chunk in the same order at
all its replicas - Use chunk version numbers to detect stale chunks
(that have missed mutations)
26Consistency Model (IV)
- Application idioms
- Rely on record-appends rather than overwrites
- Checkpointing self-identifying, self-validating
records - Consistent mutation order using leases
- Master grants a chunk lease to primary replica
- New lease ? increment chunk version number
- Primary assigns serial numbers to mutations
- Decouple control-flow from data-flow
- Data is pushed linearly along a chain of
chunk-servers in a pipelined fashion - Fully utilize servers outbound B/W
- Each server forwards data to its closest server
that has not received it - Avoid network bottlenecks high-latency links
(eg inter-switch links) - Pipelining Once a server receives some data, it
immediately starts forwarding it
27Consistency Model (V)
- Atomic record-append
- Client pushes data to all replicas of the last
chunk of the file - Client sends operation to the primary
- Primary checks if appending the record would
cause the chunk to exceed its max. allowable size - If so, it pads the chunk, tells replicas to do
the same, and asks client to retry on next chunk - Otherwise
- If the append fits the chunk, the primary
performs the operation instructs replicas to
write the data at the exact same file offset - If a replica fails, the client will retry the
operation - Thus we may get different data for the same chunk
! - Snapshot
- Deferred execution, copy-on-write
28Replica placement
- Maximize data availability and
- Maximize network B/W utilization
- It is not enough to spread replicas across
machines. - We must also spread replicas across racks
- Initial placement, based on several factors
- Place new replicas on servers with below-average
disk utilization - Keep limit on number of recent chunk creations on
each server - keep cloning traffic from overwhelming client
traffic - Re-replication re-balancing
- By priority
29Fault tolerance diagnosis
- Master
- Replication of operation log checkpoints
- Shadow masters for read-only file access
- Chunk-servers
- It would be impractical to detect data corruption
by comparison with other replicas ! - Checksums to detect data corruption
- Chunk-server verifies integrity of data
independently - 32-bit checksum per 64 KB data block
- Diagnostic tools
- Logs of significant events all RPC interactions
- By matching requests with replies collating RPC
records on different machines, we can reconstruct
the entire interaction history to diagnose a
problem. - Logs readily server as traces for load testing
performance analysis
30References
- S. Brin L. Page, The Anatomy of a Large-Scale
Hypertextual Web Search Engine, Proc. 7th WWW
Conf., 1998 - L.A. Barroso, J. Dean, H. Holze Web Search for
a Planet The Google Cluster Architecture, IEEE
Micro, 2003 - S. Ghemawat, H. Gobioff, S.T Leung The Google
File System, Proc. ACM SOSP, 2003