Distributed Data Storage - PowerPoint PPT Presentation

About This Presentation
Title:

Distributed Data Storage

Description:

What would need to happen to maintain connectivity? What data needs to be shipped around? ... Successor Lists Ensure Connectivity. N32. N10. N5. N20. N110. N99 ... – PowerPoint PPT presentation

Number of Views:622
Avg rating:3.0/5.0
Slides: 32
Provided by: zack4
Category:

less

Transcript and Presenter's Notes

Title: Distributed Data Storage


1
Distributed Data Storage Access
  • Zachary G. Ives
  • University of Pennsylvania
  • CIS 455 / 555 Internet and Web Systems
  • October 11, 2020

Some slide content courtesy Tanenbaum van Steen
2
Reminders
  • Homework 2 Milestone 1 deadline imminent
  • Homework 2 Milestone 2 due Monday after Spring
    Break
  • Wed Marie Jacob on the Q query answering system
  • Next week Spring Break

3
Building Over a DHT
  • Message passing architecture to coordinate
    behavior among different nodes in an application
  • Send a request to the owner of a key
  • Request contains a custom-formatted message type
  • Each node has an event handler loop
  • switch (msg.type)
  • case one
  • case two
  • The request handler may send back a result, as
    appropriate
  • Requires that the message include info about who
    the requestor was, how to return the data

4
Example How Do We Create a Hash Table (Hash
Multiset) Abstraction?
  • We want the following
  • put (key, value)
  • remove (key)
  • valueSet get (key)
  • How can we use Pastry to do this?
  • route()
  • deliver()

5
An Alternate Programming Abstraction GFS
MapReduce
  • Abstraction Instead of sending messages,
    different pieces of code communicate through
    files
  • Code is going to take a very stylized form at
    each stage each machine will get input from
    files, send output to files
  • Files are generally persistent, name-able (in
    contrast to DHT messages, which are transient)
  • Files consist of blocks, which are the basic unit
    of partitioning (in contrast to object / data
    item IDs)

6
Background Distributed Filesystems
  • Many distributed filesystems have been developed
  • NFS, SMB are the most prevalent today
  • Andrew FileSystem (AFS) was also fairly popular
  • Hundreds of other research filesystems, e.g.,
    Coda, Sprite, with different properties

7
NFS in a Nutshell
  • (Single) server, multi-client architecture
  • Server is stateless, so clients must send all
    context (including position to read from) in each
    request
  • Plugs into VFS APIs, mostly mimics UNIX semantics
  • Opening a file requires opening each dir along
    the way
  • fd open(/x/y/z.txt) will do a
  • lookup for x from the root handle
  • lookup for y from xs handle
  • lookup for z from ys handle
  • Server must commit writes immediately
  • Client does heavy caching requires frequent
    polling for validity, and/or use of external
    locking service

8
The Google File System (GFS)
  • Goals
  • Support millions of huge (many-TB) files
  • Partition replicate data across thousands of
    unreliable machines, in multiple racks (and even
    data centers)
  • Willing to make some compromises to get there
  • Modified APIs doesnt plug into POSIX APIs
  • In fact, relies on being built over Linux file
    system
  • Doesnt provide transparent consistency to apps!
  • App must detect duplicate or bad records, support
    checkpoints
  • Performance is only good with a particular class
    of apps
  • Stream-based reads
  • Atomic record appends

9
GFS Basic Architecture Lookups
  • Files broken into 64MB chunks
  • Master stores metadata 3 chunkservers store each
    chunk
  • A single flat file namespace maps to chunks
    replicas
  • As with Napster, actual data transfer from
    chunkservers to client
  • No client-side caching!

10
The Master Metadata and Versions
  • Controls (and locks as appropriate)
  • Mapping from files -gt chunks within each
    namespace
  • Controls reallocation, garbage collection of
    chunks
  • Maintains a log (replicated to backups) of all
    mutations to the above
  • Also knows mapping from chunk ID -gt ltversion,
    machinesgt
  • Doesnt have persistent knowledge of whats on
    chunkservers
  • Instead, during startup, it polls them
  • Or when one joins, it registers

11
Chunkservers
  • Each holds replicas of some of the chunks
  • For a given write operation, one of the owners of
    the chunk gets a lease becomes the primary and
    all others the secondary
  • Receives requests for mutations
  • Assigns an order
  • Notifies the secondary nodes
  • Waits for all to say they received the message
  • Responds with a write-succeeded message
  • Failure results in inconsistent data!!

12
A Write Operation
  1. Client asks Master for lease-owning chunkserver
  2. Master gives ID of primary, secondary
    chunkservers client caches
  3. Client sends its data to all replicas, in any
    order
  4. Once client gets ACK, it requests primary to do a
    write of those data items. Primary assigns
    serial numbers to these operations.
  5. Primary forwards write to secondaries (in a
    chain).
  6. Secondaries reply SUCCESS
  7. Primary replies to client

13
Append
  • GFS supports atomic append that multiple machines
    can use at the same time
  • Primary will interleave the requests in any order
  • Will be written at least once!
  • Primary determines a position for the write,
    forwards this to the secondaries

14
Failures and the Client
  • If there is a failure in a record write or
    append, the client will generally retry
  • If there was partial success in a previous
    append, there might be more than one copy on some
    nodes and inconsistency
  • Client must handle this through checksums, record
    IDs, and periodic checkpointing

15
GFS Performance
  • Many performance numbers in the paper
  • Not enough context here to discuss them in much
    detail would need to see how they compare with
    other approaches!
  • But validate high scalability in terms of
    concurrent reads, concurrent appends, with data
    partitioned and replicated across many machines
  • Also show fast recovery from failed nodes
  • Not the only approach to many of these problems,
    but one shown to work at industrial-strength!

16
A Popular Distributed Programming Model MapReduce
  • In many circles, considered the key building
    block for much of Googles data analysis
  • A programming language built on it
    Sawzall,http//labs.google.com/papers/sawzall.htm
    l
  • Sawzall has become one of the most widely used
    programming languages at Google. On one
    dedicated Workqueue cluster with 1500 Xeon CPUs,
    there were 32,580 Sawzall jobs launched, using an
    average of 220 machines each. While running those
    jobs, 18,636 failures occurred (application
    failure, network outage, system crash, etc.) that
    triggered rerunning some portion of the job. The
    jobs read a total of 3.2x1015 bytes of data
    (2.8PB) and wrote 9.9x1012 bytes (9.3TB).
  • Other similar languages Yahoos Pig Latin and
    Pig Microsofts Dryad
  • Cloned in open source Hadoop,http//hadoop.apach
    e.org/core/
  • So what is it? Whats it good for?

17
MapReduce Simple Distributed Functional
Programming Primitives
  • Modeled after Lisp primitives
  • map (apply function to all items in a
    collection) and reduce (apply function to set of
    items with a common key)
  • We start with
  • A user-defined function to be applied to all
    data,map (key,value) ? (key, value)
  • Another user-specified operation reduce (key,
    set of values) ? result
  • A set of n nodes, each with data
  • All nodes run map on all of their data, producing
    new data with keys
  • This data is collected by key, then shuffled,
    reduced
  • Dataflow is through temp files on GFS

18
Some Example Tasks
  • Count word occurrences
  • Map output word with count 1
  • Reduce sum the counts
  • Distributed grep all lines matching a pattern
  • Map filter by pattern
  • Reduce output set
  • Count URL access frequency
  • Map output each URL as key, with count 1
  • Reduce sum the counts
  • For each IP address, get the document with the
    most in-links
  • Number of queries by IP address (requires
    multiple steps)

19
MapReduce Dataflow Diagram(Default MapReduce
Uses Filesystem)
Coordinator
Datapartitions by key
Map compu-tation partitions
Reduce compu-tation partitions
Redistributionby outputs key
20
Some Details
  • Fewer computation partitions than data partitions
  • All data is accessible via a distributed
    filesystem with replication
  • Worker nodes produce data in key order (makes it
    easy to merge)
  • The master is responsible for scheduling, keeping
    all nodes busy
  • The master knows how many data partitions there
    are, which have completed atomic commits to
    disk
  • Fault tolerance master triggers re-execution of
    work originally performed by failed nodes to
    make their data available again
  • Locality master tries to do work on nodes that
    have replicas of the data

21
Hadoop A Modern Open-Source Clone of
MapReduce GFS
  • Underlying Hadoop HDFS, a page-level replicating
    filesystem
  • Modeled in part after GFS
  • Supports streaming page access from each site
  • Master/Slave Namenode vs Datanodes

Source Hadoop HDFS architecture documentation
22
Hadoop HDFS MapReduce
Source Meet Hadoop, Devaraj Das, Yahoo
Bangalore Apache
23
Hadoop MapReduce Architecture
  • Jobtracker (Master)
  • Accepts jobs submitted by users
  • Gives tasks to Tasktrackers makes scheduling
    decisions, co-locates tasks to data
  • Monitors task, tracker status, re-executes tasks
    if needed
  • Tasktrackers (Slaves)
  • Run Map and Reduce tasks
  • Manage storage, transmission of intermediate
    output

24
How Does this Relate to DHTs?
  • Consider replacing the filesystem with the DHT

25
What Does MapReduce Do Well?
  • What are its strengths?
  • What about weaknesses?

26
MapReduce is a ParticularProgramming Model
  • But its not especially general (though things
    like Pig Latin improve it)
  • Suppose we have autonomous application components
    that wish to communicate
  • Weve already seen a few strategies
  • Request/response from client to server
  • HTTP itself
  • Asynchronous messages
  • Router gossip protocols
  • P2P finger tables, etc.
  • Are there general mechanisms and principles?
  • (Of course!)
  • Lets first look at what happens if we need
    in-order messaging

27
Message-Queuing Model (1)
  • Four combinations for loosely-coupled
    communications using queues.

2-26
28
Message-Queuing Model (2)
  • Basic interface to a queue in a message-queuing
    system.

Primitive Meaning
Put Append a message to a specified queue
Get Block until the specified queue is nonempty, and remove the first message
Poll Check a specified queue for messages, and remove the first. Never block.
Notify Install a handler to be called when a message is put into the specified queue.
29
General Architecture of a Message-Queuing System
(1)
  • The relationship between queue-level addressing
    and network-level addressing.

30
General Architecture of a Message-Queuing System
(2)
  • The general organization of a message-queuing
    system with routers.

2-29
31
Benefits of Message Queueing
  • Allows both synchronous (blocking) and
    asynchronous (polling or event-driven)
    communication
  • Ensures messages are delivered (or at least
    readable) in the order received
  • The basis of many transactional systems
  • e.g., Microsoft Message Queue (MMQ), IBM
    MQseries, etc.
Write a Comment
User Comments (0)
About PowerShow.com