Title: Distributed Data Storage
1Distributed Data Storage Access
- Zachary G. Ives
- University of Pennsylvania
- CIS 455 / 555 Internet and Web Systems
- October 11, 2020
Some slide content courtesy Tanenbaum van Steen
2Reminders
- Homework 2 Milestone 1 deadline imminent
- Homework 2 Milestone 2 due Monday after Spring
Break - Wed Marie Jacob on the Q query answering system
- Next week Spring Break
3Building Over a DHT
- Message passing architecture to coordinate
behavior among different nodes in an application - Send a request to the owner of a key
- Request contains a custom-formatted message type
- Each node has an event handler loop
- switch (msg.type)
- case one
- case two
-
- The request handler may send back a result, as
appropriate - Requires that the message include info about who
the requestor was, how to return the data
4Example How Do We Create a Hash Table (Hash
Multiset) Abstraction?
- We want the following
- put (key, value)
- remove (key)
- valueSet get (key)
- How can we use Pastry to do this?
- route()
- deliver()
5An Alternate Programming Abstraction GFS
MapReduce
- Abstraction Instead of sending messages,
different pieces of code communicate through
files - Code is going to take a very stylized form at
each stage each machine will get input from
files, send output to files - Files are generally persistent, name-able (in
contrast to DHT messages, which are transient) - Files consist of blocks, which are the basic unit
of partitioning (in contrast to object / data
item IDs)
6Background Distributed Filesystems
- Many distributed filesystems have been developed
- NFS, SMB are the most prevalent today
- Andrew FileSystem (AFS) was also fairly popular
- Hundreds of other research filesystems, e.g.,
Coda, Sprite, with different properties
7NFS in a Nutshell
- (Single) server, multi-client architecture
- Server is stateless, so clients must send all
context (including position to read from) in each
request - Plugs into VFS APIs, mostly mimics UNIX semantics
- Opening a file requires opening each dir along
the way - fd open(/x/y/z.txt) will do a
- lookup for x from the root handle
- lookup for y from xs handle
- lookup for z from ys handle
- Server must commit writes immediately
- Client does heavy caching requires frequent
polling for validity, and/or use of external
locking service
8The Google File System (GFS)
- Goals
- Support millions of huge (many-TB) files
- Partition replicate data across thousands of
unreliable machines, in multiple racks (and even
data centers) - Willing to make some compromises to get there
- Modified APIs doesnt plug into POSIX APIs
- In fact, relies on being built over Linux file
system - Doesnt provide transparent consistency to apps!
- App must detect duplicate or bad records, support
checkpoints - Performance is only good with a particular class
of apps - Stream-based reads
- Atomic record appends
9GFS Basic Architecture Lookups
- Files broken into 64MB chunks
- Master stores metadata 3 chunkservers store each
chunk - A single flat file namespace maps to chunks
replicas - As with Napster, actual data transfer from
chunkservers to client - No client-side caching!
10The Master Metadata and Versions
- Controls (and locks as appropriate)
- Mapping from files -gt chunks within each
namespace - Controls reallocation, garbage collection of
chunks - Maintains a log (replicated to backups) of all
mutations to the above - Also knows mapping from chunk ID -gt ltversion,
machinesgt - Doesnt have persistent knowledge of whats on
chunkservers - Instead, during startup, it polls them
- Or when one joins, it registers
11Chunkservers
- Each holds replicas of some of the chunks
- For a given write operation, one of the owners of
the chunk gets a lease becomes the primary and
all others the secondary - Receives requests for mutations
- Assigns an order
- Notifies the secondary nodes
- Waits for all to say they received the message
- Responds with a write-succeeded message
- Failure results in inconsistent data!!
12A Write Operation
- Client asks Master for lease-owning chunkserver
- Master gives ID of primary, secondary
chunkservers client caches - Client sends its data to all replicas, in any
order - Once client gets ACK, it requests primary to do a
write of those data items. Primary assigns
serial numbers to these operations. - Primary forwards write to secondaries (in a
chain). - Secondaries reply SUCCESS
- Primary replies to client
13Append
- GFS supports atomic append that multiple machines
can use at the same time - Primary will interleave the requests in any order
- Will be written at least once!
- Primary determines a position for the write,
forwards this to the secondaries
14Failures and the Client
- If there is a failure in a record write or
append, the client will generally retry - If there was partial success in a previous
append, there might be more than one copy on some
nodes and inconsistency - Client must handle this through checksums, record
IDs, and periodic checkpointing
15GFS Performance
- Many performance numbers in the paper
- Not enough context here to discuss them in much
detail would need to see how they compare with
other approaches! - But validate high scalability in terms of
concurrent reads, concurrent appends, with data
partitioned and replicated across many machines - Also show fast recovery from failed nodes
- Not the only approach to many of these problems,
but one shown to work at industrial-strength!
16A Popular Distributed Programming Model MapReduce
- In many circles, considered the key building
block for much of Googles data analysis - A programming language built on it
Sawzall,http//labs.google.com/papers/sawzall.htm
l - Sawzall has become one of the most widely used
programming languages at Google. On one
dedicated Workqueue cluster with 1500 Xeon CPUs,
there were 32,580 Sawzall jobs launched, using an
average of 220 machines each. While running those
jobs, 18,636 failures occurred (application
failure, network outage, system crash, etc.) that
triggered rerunning some portion of the job. The
jobs read a total of 3.2x1015 bytes of data
(2.8PB) and wrote 9.9x1012 bytes (9.3TB). - Other similar languages Yahoos Pig Latin and
Pig Microsofts Dryad - Cloned in open source Hadoop,http//hadoop.apach
e.org/core/ - So what is it? Whats it good for?
17MapReduce Simple Distributed Functional
Programming Primitives
- Modeled after Lisp primitives
- map (apply function to all items in a
collection) and reduce (apply function to set of
items with a common key) - We start with
- A user-defined function to be applied to all
data,map (key,value) ? (key, value) - Another user-specified operation reduce (key,
set of values) ? result - A set of n nodes, each with data
- All nodes run map on all of their data, producing
new data with keys - This data is collected by key, then shuffled,
reduced - Dataflow is through temp files on GFS
18Some Example Tasks
- Count word occurrences
- Map output word with count 1
- Reduce sum the counts
- Distributed grep all lines matching a pattern
- Map filter by pattern
- Reduce output set
- Count URL access frequency
- Map output each URL as key, with count 1
- Reduce sum the counts
- For each IP address, get the document with the
most in-links - Number of queries by IP address (requires
multiple steps)
19MapReduce Dataflow Diagram(Default MapReduce
Uses Filesystem)
Coordinator
Datapartitions by key
Map compu-tation partitions
Reduce compu-tation partitions
Redistributionby outputs key
20Some Details
- Fewer computation partitions than data partitions
- All data is accessible via a distributed
filesystem with replication - Worker nodes produce data in key order (makes it
easy to merge) - The master is responsible for scheduling, keeping
all nodes busy - The master knows how many data partitions there
are, which have completed atomic commits to
disk - Fault tolerance master triggers re-execution of
work originally performed by failed nodes to
make their data available again - Locality master tries to do work on nodes that
have replicas of the data
21Hadoop A Modern Open-Source Clone of
MapReduce GFS
- Underlying Hadoop HDFS, a page-level replicating
filesystem - Modeled in part after GFS
- Supports streaming page access from each site
- Master/Slave Namenode vs Datanodes
Source Hadoop HDFS architecture documentation
22Hadoop HDFS MapReduce
Source Meet Hadoop, Devaraj Das, Yahoo
Bangalore Apache
23Hadoop MapReduce Architecture
- Jobtracker (Master)
- Accepts jobs submitted by users
- Gives tasks to Tasktrackers makes scheduling
decisions, co-locates tasks to data - Monitors task, tracker status, re-executes tasks
if needed - Tasktrackers (Slaves)
- Run Map and Reduce tasks
- Manage storage, transmission of intermediate
output
24How Does this Relate to DHTs?
- Consider replacing the filesystem with the DHT
25What Does MapReduce Do Well?
- What are its strengths?
- What about weaknesses?
26MapReduce is a ParticularProgramming Model
- But its not especially general (though things
like Pig Latin improve it) - Suppose we have autonomous application components
that wish to communicate - Weve already seen a few strategies
- Request/response from client to server
- HTTP itself
- Asynchronous messages
- Router gossip protocols
- P2P finger tables, etc.
- Are there general mechanisms and principles?
- (Of course!)
- Lets first look at what happens if we need
in-order messaging
27Message-Queuing Model (1)
- Four combinations for loosely-coupled
communications using queues.
2-26
28Message-Queuing Model (2)
- Basic interface to a queue in a message-queuing
system.
Primitive Meaning
Put Append a message to a specified queue
Get Block until the specified queue is nonempty, and remove the first message
Poll Check a specified queue for messages, and remove the first. Never block.
Notify Install a handler to be called when a message is put into the specified queue.
29General Architecture of a Message-Queuing System
(1)
- The relationship between queue-level addressing
and network-level addressing.
30General Architecture of a Message-Queuing System
(2)
- The general organization of a message-queuing
system with routers.
2-29
31Benefits of Message Queueing
- Allows both synchronous (blocking) and
asynchronous (polling or event-driven)
communication - Ensures messages are delivered (or at least
readable) in the order received - The basis of many transactional systems
- e.g., Microsoft Message Queue (MMQ), IBM
MQseries, etc.