Distributed Data Storage presentation

About This Presentation

Transcript and Presenter's Notes

Title: Distributed Data Storage

1
Distributed Data Storage Access

Zachary G. Ives
University of Pennsylvania
CIS 455 / 555 Internet and Web Systems
October 11, 2020

Some slide content courtesy Tanenbaum van Steen
2
Reminders

Homework 2 Milestone 1 deadline imminent
Homework 2 Milestone 2 due Monday after Spring
Break
Wed Marie Jacob on the Q query answering system
Next week Spring Break

3
Building Over a DHT

Message passing architecture to coordinate
behavior among different nodes in an application
Send a request to the owner of a key
Request contains a custom-formatted message type
Each node has an event handler loop
switch (msg.type)
case one
case two
The request handler may send back a result, as
appropriate
Requires that the message include info about who
the requestor was, how to return the data

4
Example How Do We Create a Hash Table (Hash
Multiset) Abstraction?

We want the following
put (key, value)
remove (key)
valueSet get (key)
How can we use Pastry to do this?
route()
deliver()

5
An Alternate Programming Abstraction GFS
MapReduce

Abstraction Instead of sending messages,
different pieces of code communicate through
files
Code is going to take a very stylized form at
each stage each machine will get input from
files, send output to files
Files are generally persistent, name-able (in
contrast to DHT messages, which are transient)
Files consist of blocks, which are the basic unit
of partitioning (in contrast to object / data
item IDs)

6
Background Distributed Filesystems

Many distributed filesystems have been developed
NFS, SMB are the most prevalent today
Andrew FileSystem (AFS) was also fairly popular
Hundreds of other research filesystems, e.g.,
Coda, Sprite, with different properties

7
NFS in a Nutshell

(Single) server, multi-client architecture
Server is stateless, so clients must send all
context (including position to read from) in each
request
Plugs into VFS APIs, mostly mimics UNIX semantics
Opening a file requires opening each dir along
the way
fd open(/x/y/z.txt) will do a
lookup for x from the root handle
lookup for y from xs handle
lookup for z from ys handle
Server must commit writes immediately
Client does heavy caching requires frequent
polling for validity, and/or use of external
locking service

8
The Google File System (GFS)

Goals
Support millions of huge (many-TB) files
Partition replicate data across thousands of
unreliable machines, in multiple racks (and even
data centers)
Willing to make some compromises to get there
Modified APIs doesnt plug into POSIX APIs
In fact, relies on being built over Linux file
system
Doesnt provide transparent consistency to apps!
App must detect duplicate or bad records, support
checkpoints
Performance is only good with a particular class
of apps
Stream-based reads
Atomic record appends

9
GFS Basic Architecture Lookups

Files broken into 64MB chunks
Master stores metadata 3 chunkservers store each
chunk
A single flat file namespace maps to chunks
replicas
As with Napster, actual data transfer from
chunkservers to client
No client-side caching!

10
The Master Metadata and Versions

Controls (and locks as appropriate)
Mapping from files -gt chunks within each
namespace
Controls reallocation, garbage collection of
chunks
Maintains a log (replicated to backups) of all
mutations to the above
Also knows mapping from chunk ID -gt ltversion,
machinesgt
Doesnt have persistent knowledge of whats on
chunkservers
Instead, during startup, it polls them
Or when one joins, it registers

11
Chunkservers

Each holds replicas of some of the chunks
For a given write operation, one of the owners of
the chunk gets a lease becomes the primary and
all others the secondary
Receives requests for mutations
Assigns an order
Notifies the secondary nodes
Waits for all to say they received the message
Responds with a write-succeeded message
Failure results in inconsistent data!!

12
A Write Operation

Client asks Master for lease-owning chunkserver
Master gives ID of primary, secondary
chunkservers client caches
Client sends its data to all replicas, in any
order
Once client gets ACK, it requests primary to do a
write of those data items. Primary assigns
serial numbers to these operations.
Primary forwards write to secondaries (in a
chain).
Secondaries reply SUCCESS
Primary replies to client

13
Append

GFS supports atomic append that multiple machines
can use at the same time
Primary will interleave the requests in any order
Will be written at least once!
Primary determines a position for the write,
forwards this to the secondaries

14
Failures and the Client

If there is a failure in a record write or
append, the client will generally retry
If there was partial success in a previous
append, there might be more than one copy on some
nodes and inconsistency
Client must handle this through checksums, record
IDs, and periodic checkpointing

15
GFS Performance

Many performance numbers in the paper
Not enough context here to discuss them in much
detail would need to see how they compare with
other approaches!
But validate high scalability in terms of
concurrent reads, concurrent appends, with data
partitioned and replicated across many machines
Also show fast recovery from failed nodes
Not the only approach to many of these problems,
but one shown to work at industrial-strength!

16
A Popular Distributed Programming Model MapReduce

In many circles, considered the key building
block for much of Googles data analysis
A programming language built on it
Sawzall,http//labs.google.com/papers/sawzall.htm
l
Sawzall has become one of the most widely used
programming languages at Google. On one
dedicated Workqueue cluster with 1500 Xeon CPUs,
there were 32,580 Sawzall jobs launched, using an
average of 220 machines each. While running those
jobs, 18,636 failures occurred (application
failure, network outage, system crash, etc.) that
triggered rerunning some portion of the job. The
jobs read a total of 3.2x1015 bytes of data
(2.8PB) and wrote 9.9x1012 bytes (9.3TB).
Other similar languages Yahoos Pig Latin and
Pig Microsofts Dryad
Cloned in open source Hadoop,http//hadoop.apach
e.org/core/
So what is it? Whats it good for?

17
MapReduce Simple Distributed Functional
Programming Primitives

Modeled after Lisp primitives
map (apply function to all items in a
collection) and reduce (apply function to set of
items with a common key)
We start with
A user-defined function to be applied to all
data,map (key,value) ? (key, value)
Another user-specified operation reduce (key,
set of values) ? result
A set of n nodes, each with data
All nodes run map on all of their data, producing
new data with keys
This data is collected by key, then shuffled,
reduced
Dataflow is through temp files on GFS

18
Some Example Tasks

Count word occurrences
Map output word with count 1
Reduce sum the counts
Distributed grep all lines matching a pattern
Map filter by pattern
Reduce output set
Count URL access frequency
Map output each URL as key, with count 1
Reduce sum the counts
For each IP address, get the document with the
most in-links
Number of queries by IP address (requires
multiple steps)

19
MapReduce Dataflow Diagram(Default MapReduce
Uses Filesystem)
Coordinator
Datapartitions by key
Map compu-tation partitions
Reduce compu-tation partitions
Redistributionby outputs key
20
Some Details

Fewer computation partitions than data partitions
All data is accessible via a distributed
filesystem with replication
Worker nodes produce data in key order (makes it
easy to merge)
The master is responsible for scheduling, keeping
all nodes busy
The master knows how many data partitions there
are, which have completed atomic commits to
disk
Fault tolerance master triggers re-execution of
work originally performed by failed nodes to
make their data available again
Locality master tries to do work on nodes that
have replicas of the data

21
Hadoop A Modern Open-Source Clone of
MapReduce GFS

Underlying Hadoop HDFS, a page-level replicating
filesystem
Modeled in part after GFS
Supports streaming page access from each site
Master/Slave Namenode vs Datanodes

Source Hadoop HDFS architecture documentation
22
Hadoop HDFS MapReduce
Source Meet Hadoop, Devaraj Das, Yahoo
Bangalore Apache
23
Hadoop MapReduce Architecture

Jobtracker (Master)
Accepts jobs submitted by users
Gives tasks to Tasktrackers makes scheduling
decisions, co-locates tasks to data
Monitors task, tracker status, re-executes tasks
if needed
Tasktrackers (Slaves)
Run Map and Reduce tasks
Manage storage, transmission of intermediate
output

24
How Does this Relate to DHTs?

Consider replacing the filesystem with the DHT

25
What Does MapReduce Do Well?

What are its strengths?
What about weaknesses?

26
MapReduce is a ParticularProgramming Model

But its not especially general (though things
like Pig Latin improve it)
Suppose we have autonomous application components
that wish to communicate
Weve already seen a few strategies
Request/response from client to server
HTTP itself
Asynchronous messages
Router gossip protocols
P2P finger tables, etc.
Are there general mechanisms and principles?
(Of course!)
Lets first look at what happens if we need
in-order messaging

27
Message-Queuing Model (1)

Four combinations for loosely-coupled
communications using queues.

2-26
28
Message-Queuing Model (2)

Basic interface to a queue in a message-queuing
system.

Primitive Meaning
Put Append a message to a specified queue
Get Block until the specified queue is nonempty, and remove the first message
Poll Check a specified queue for messages, and remove the first. Never block.
Notify Install a handler to be called when a message is put into the specified queue.
29
General Architecture of a Message-Queuing System
(1)

The relationship between queue-level addressing
and network-level addressing.

30
General Architecture of a Message-Queuing System
(2)

The general organization of a message-queuing
system with routers.

2-29
31
Benefits of Message Queueing

Allows both synchronous (blocking) and
asynchronous (polling or event-driven)
communication
Ensures messages are delivered (or at least
readable) in the order received
The basis of many transactional systems
e.g., Microsoft Message Queue (MMQ), IBM
MQseries, etc.

Write a Comment

User Comments (0)

About PowerShow.com

Distributed Data Storage PowerPoint PPT Presentation