CS556: Distributed Systems - PowerPoint PPT Presentation

1 / 30

About This Presentation

Title:

CS556: Distributed Systems

Description:

... organized in a hierarchy of directories & identified by pathnames. Operations: ... Data is pushed linearly along a chain of chunk-servers in a pipelined fashion ... – PowerPoint PPT presentation

Number of Views:56

Avg rating:3.0/5.0

Slides: 31

Provided by: mar177

Category:

more less

Transcript and Presenter's Notes

Title: CS556: Distributed Systems

1
CS-556 Distributed Systems
Case Study google

Manolis Marazakis
maraz_at_csd.uoc.gr

2
Google cluster architecture

15,000 commodity-class PCs
running custom fault-tolerant software
Replication of services across many different
machines
Automatic fault detection handling
rather than a smaller number of high-end
servers
Overall system throughput vs single-thread
performance
Throughput-oriented workload
Easy parallelism
Raw data several 10s of TBs
Different queries can run of different processors
The overall index is partitioned, so that a
single query can use multiple processors
Key metrics
Energy efficiency
Price/performance ratio

3
Google query-serving architecture

Coordination of
query execution
Formatting of
results

DNS-based load balancing among geographically
distributed data centers
H/W-based load balancer at each cluster
4
Query Execution

Phase-I
Cluster of index servers
consult an inverted index that maps each query
keyword to a set of matching documents
Hit-list per keyword
Intersection of hit-lists
Computation of relevance rankings
Result ordered list of doc-ids
Phase-II
Cluster of document servers
Fetch each document from disk
Extract URL, title keyword-in-context snippet

5
Index servers

The inverted index is divided into shards
each having a randomly chosen subset of
documents from the overall index
Pool of machines per shard
Intermediate load balancer
Each query goes to a subset of machines assigned
to each shard
If a replica goes down, the load balancer will
avoid using it, until it is recovered/replaced,
but all parts of the index still remain available
During the downtime, system capacity is reduced
in proportion to the total fraction of capacity
that this machine represented.

6
Document servers

Partition the processing of documents
Randomly distribute documents into smaller shards
Multiple servers per shard
Routing requests through a load balancer
Access an online, low-latency copy of the entire
Web
actually, multiple copies !!

7
Replication for capacity (I)

Mostly read-only accesses to the index other
data structures
Relatively infrequent updates
Sidestep consistency issues, by temporarily
diverting queries from a server until an update
is completed
Divide the query stream into multiple streams,
each handled by a cluster
lookup of matching documents ? many lookups for
matching documents in much smaller indices,
followed by a merging step

8
Replication for capacity (II)

Adding machines to each pool increases serving
capacity
Adding shards accommodates index growth
Divide the computation among multiple CPUs
disks
No communication/dependency bet. shards
Nearly linear speedup
The CPU speed of individual machines does not
limit search performance
We can increase the number of shards to
accommodate slower CPUs

9
Design principles

Software reliability
rather than high-quality components, RAID,
redundant power-supplies,
Replication for better throughput and
availability
Purchase the CPU generation that currently gives
the best performance per unit price
not the CPUs that give the best absolute
performance
Using commodity PCs reduces the costs of
computation
making it affordable to use more computational
resources per query
Search a larger index
Employ elaborate ranking algorithms

10
Rack organization

40-80 x86-class custom-made PCs
Several CPU generations are in active service
533 MHz Celeron, , dual 1.4 GHz Pentium-III
IDE 80 GB hard disks, one or more per machine
100 Mbps Ethernet switch for each side of a rack
Each switch has one or two uplink connections
to a gigabit switch that interconnects all the
racks

Selection criterion cost per query (capital
expense operating costs)
Realistically, a server will not last more than
2-3 years, due to disparities in performance
when compared with newer machines
11
A rough cost comparison

Rack of 88 dual-CPU 2-GHz Intel Xeon servers with
2 GB of RAM 1 80-Gbyte HDD 278,000 ? monthly
capital cost 7,700 for a period of 3 years
Capacity 176 CPUs, 176 GB of RAM, 7 TB disk
space
Multiprocessor containing 8 2-GHz Xeon CPUs, 64
Gbytes of RAM, and 8 Tbytes of disk space
758,000 ? monthly capital cost 21,000 for a
period of 3 years
22 times fewer CPUs, 3 times less RAM

Much of the cost difference derives from the much
higher interconnect bandwidth reliability of a
high-end server. - Googles highly redundant
architecture does not rely on either of these
attributes.
12
Power consumption cooling

Power consumption for a dual 1.4-GHz Pentium III
server 90 W of DC power under load
55 W for the two CPUs, 10 W for a disk drive, 25
W for DRAM motherboard.
120 W of AC power per server
Efficiency of ATX power supply 75
Power consumption per rack 10 kW.
A rack fits in 25 ft2 of space
Power density 400 W/ft2
With higher-end processors, the power density of
a rack can exceed 700 W/ft2

Typical data centers power density 150 W/ft2
What counts is watts per unit of performance, not
watts alone.
13
H/W-level application characteristics
The index application traverses dynamic data
structures control flow is data dependent,
creating a significant of difficult-to-predict
branches.
There isnt that much exploitable
instruction-level parallelism (ILP) in the
workload.

Exploit thread-level parallelism
Processing each query shares mostly
read-only data with the rest of the system,
and constitutes a work unit that requires
little communication.
SMT simultaneous multithreading
CMP chip multiprocessor

Early experiments with dual-context Intel
Xeon 30 improvement
14
The Google File System (GFS)

Component failures are the norm rather than the
exception
constant monitoring, error detection, fault
tolerance automatic recovery
Huge files
Each file typically contains many application
objects, such as web documents
A few millions of files, of size 100 MB or larger
Fast growing data sets of many TBs
Most files are mutated by appending new data
rather than overwriting existing data
Random writes within a file are practically
non-existent.
Once written, the files are only read, often only
sequentially.
Co-design of applications the GFS API
Relaxed consistency model
Atomic file-append operation

15
GFS workloads

Large streaming reads
Individual operations typically read hundreds of
KBs, more commonly 1 MB or more
Successive operations from the same client often
read through a contiguous region of a file
Small random reads
typically read a few KBs at some arbitrary offset
Many large sequential writes that append data to
files
Once written, a file is seldom modified again
Files are often used as producer-consumer queues,
or for N-way merging
Need for well-defined semantics for multiple
processes that append to the same file
High sustained throughput is more important than
low-latency

16
GFS API

Files are organized in a hierarchy of directories
identified by pathnames
Operations
create, delete, open, close, read, write
Snapshot
Record-append
GFS does not implement the POSIX API
User-space servers client library
No file data caching at clients
because marginal benefit is expected
Only file metadata are cached at clients.
No special caching at chunk-servers
Files are stored as local files, so the OS
buffer cache is employed

17
GFS Architecture (I)
- Files are divided into fixed-size chunks, each
stored at a of chunk-servers (default 3) -
Immutable, globally unique 64-bit handle per
chunk - Chunkservers store chunks on local disks
as Linux files read or write chunkdata
specified by a chunk handle and byte range.
18
GFS Architecture (II)

The master maintains all file system metadata.
namespace
access control information
mapping from files to chunks
current locations of chunks
Also controls system-wide activities
Chunk-lease management
garbage collection of orphaned chunks
Chunk-migration between chunkservers.
The master periodically communicates with each
chunkserver in HeartBeat messages to give it
instructions collect its state.

19
GFS Architecture (III)

Clients never read/write file data through the
master.
Instead, a client asks the master which
chunk-servers it should contact.
Using the fixed chunk size, the client translates
the file name byte offset specified by the
application into a chunk index within the file.
Then, it sends the master a request containing
the file name chunk-index.
The master replies with the corresponding chunk
handle locations of the replicas.
In fact, the client typically asks for multiple
chunks in the same request the master can also
include the information for chunks immediately
following those requested.
This extra information sidesteps several future
client-master interactions at practically no
extra cost.
The client caches this information using the file
name chunk index as the key.
It caches this information for a limited time
interacts with the chunk-servers directly for
many subsequent operations.

20
Chunk size 64 MB

Much larger than in traditional file systems
Large chunk size benefits
Reduces the need for client-master interaction
Sequential access is most common
The client can cache the location info of all
chunks even for a multi-TB working set
On a large chunk, the client is more likely to
perform multiple reads/writes
Client keeps a persistent TCP connection to a
chunk-server over an extended period of time
Reduces volume of metadata kept at the master
Allows master to keep metadata in-memory
Disadvantages
Potential for chunk-servers holding small files
to become hot-spots
Waste of disk space, due to fragmentation

21
Metadata

Metadata are kept in in-memory data structures
Persistent
Namespace
File-to-chunk mapping
Operation log is used to record mutations
Stored at local disk and replicated to remote
machines
Volatile
Location of replicas
Info requested by the master at start-up and when
a new chunk-server joins the system
Periodic state scan
Chunk garbage collection
Chunk re-replication
Handling chunk-server failures
Chunk migration
Balancing load disk space usage

22
Master operations

Chunk locations are volatile
A chunk-server has the final word over what
chunks it does or does not have on its own disks.
There is no point in trying to maintain a
consistent view of this information on the master
because errors on a chunk-server may cause chunks
to vanish spontaneously or an operator may rename
a chunk-server.
Operation log
Defines logical timeline
Do not make changes visible to clients until
metadata changes are made persistent
Replicate log on multiple remote machines
respond to a client operation only after flushing
the corresponding log record to disk both locally
remotely.
The master batches several log records together
before flushing thereby reducing the impact of
flushing and replication on overall system
throughput
Recovery by replay of operation log
Keep log size manageable by checkpointing
Checkpoint is a compact B-tree, ready to map into
memory use to answer lookup requests

23
Consistency Model (I)

File namespace mutations are atomic
Handled exclusively by the master
The state of a file region after a data mutation
depends on the type of mutation, whether it
succeeds or fails, and whether there are
concurrent mutations.

24
Consistency Model (II)

Consistent file region
All clients see the same data, regardless of
which replica they read from
Defined file region
Region is consistent and
Clients see what the data mutation writes in its
entirety
The mutation executed without interference from
concurrent writes
Undefined file region
Mingled fragments from multiple mutations
Different clients may see different data at
different times

25
Consistency Model (III)

Write
Causes data to be written at an
application-defined file offset
Record-append
Causes data to be appended atomically, at least
once even in the presence of concurrent
mutations, but at an offset of the systems
choosing
GFS may insert duplicates or padding
Apply mutations to a chunk in the same order at
all its replicas
Use chunk version numbers to detect stale chunks
(that have missed mutations)

26
Consistency Model (IV)

Application idioms
Rely on record-appends rather than overwrites
Checkpointing self-identifying, self-validating
records
Consistent mutation order using leases
Master grants a chunk lease to primary replica
New lease ? increment chunk version number
Primary assigns serial numbers to mutations
Decouple control-flow from data-flow
Data is pushed linearly along a chain of
chunk-servers in a pipelined fashion
Fully utilize servers outbound B/W
Each server forwards data to its closest server
that has not received it
Avoid network bottlenecks high-latency links
(eg inter-switch links)
Pipelining Once a server receives some data, it
immediately starts forwarding it

27
Consistency Model (V)

Atomic record-append
Client pushes data to all replicas of the last
chunk of the file
Client sends operation to the primary
Primary checks if appending the record would
cause the chunk to exceed its max. allowable size
If so, it pads the chunk, tells replicas to do
the same, and asks client to retry on next chunk
Otherwise
If the append fits the chunk, the primary
performs the operation instructs replicas to
write the data at the exact same file offset
If a replica fails, the client will retry the
operation
Thus we may get different data for the same chunk
!
Snapshot
Deferred execution, copy-on-write

28
Replica placement

Maximize data availability and
Maximize network B/W utilization
It is not enough to spread replicas across
machines.
We must also spread replicas across racks
Initial placement, based on several factors
Place new replicas on servers with below-average
disk utilization
Keep limit on number of recent chunk creations on
each server
keep cloning traffic from overwhelming client
traffic
Re-replication re-balancing
By priority

29
Fault tolerance diagnosis

Master
Replication of operation log checkpoints
Shadow masters for read-only file access
Chunk-servers
It would be impractical to detect data corruption
by comparison with other replicas !
Checksums to detect data corruption
Chunk-server verifies integrity of data
independently
32-bit checksum per 64 KB data block
Diagnostic tools
Logs of significant events all RPC interactions
By matching requests with replies collating RPC
records on different machines, we can reconstruct
the entire interaction history to diagnose a
problem.
Logs readily server as traces for load testing
performance analysis

30
References

S. Brin L. Page, The Anatomy of a Large-Scale
Hypertextual Web Search Engine, Proc. 7th WWW
Conf., 1998
L.A. Barroso, J. Dean, H. Holze Web Search for
a Planet The Google Cluster Architecture, IEEE
Micro, 2003
S. Ghemawat, H. Gobioff, S.T Leung The Google
File System, Proc. ACM SOSP, 2003

Write a Comment

User Comments (0)