Title: Advanced data management
1Advanced data management
- Jiaheng Lu
- Department of Computer Science
- Renmin University of China
- www.jiahenglu.net
2Cloud computing
3(No Transcript)
4Distributed system
5OutlineConcepts and Terminology
- What is Distributed
- Distributed data objects
- Distributed execution
- Three tier architectures
- Transaction concepts
6Whats a Distributed System?
- Centralized
- everything in one place
- stand-alone PC or Mainframe
- Distributed
- some parts remote
- distributed users
- distributed execution
- distributed data
7Transparency in Distributed Systems
- Make distributed system as easy to use and manage
as a centralized system - Give a Single-System Image
- Location transparency
- hide fact that object is remote
- hide fact that object has moved
- hide fact that object is partitioned or
replicated - Name doesnt change if object is replicated,
partitioned or moved.
8Naming- The basics
- Objects have
- Globally Unique Identifier (GUIDs)
- location(s) address(es)
- name(s)
- addresses can change
- objects can have many names
- Names are context dependent
- (Jim _at_ KGB Jim _at_ CIA)
- Many naming systems
- UNC \\node\device\dir\dir\dir\object
- Internet http//node.domain.root/dir/dir/dir/obje
ct - LDAP ldap//ldap.domain.root/oorg,cUS,cndir
guid
James
9Name Serversin Distributed Systems
North
- Name servers translate names context to
address ( GUID) - Name servers are partitioned (subtrees of name
space) - Name servers replicate root of name tree
- Name servers form a hierarchy
- Distributed data from hell
- high read traffic
- high reliability availability
- autonomy
root
Northern names
South
root
Southern names
10Autonomy in Distributed Systems
- Owner of site (or node, or application, or
database)Wants to control it - If my part is working , must be able to access
manage it (reorganize, upgrade, add user,) - Autonomy is
- Essential
- Difficult to implement.
- Conflicts with global consistency
- examples naming, authentication, admin
11Security The Basics
- Authentication server subject Authenticator gt
(Yes token) No - Security matrix
- who can do what to whom
- Access control list is column of matrix
- who is authenticated ID
- In a distributed system, who and what and
whom are distributed objects
12Security in Distributed Systems
- Security domain nodes with a shared security
server. - Security domains can have trust relationships
- A trusts B A believes B when it says this is
Jim_at_B - Security domains form a hierarchy.
- Delegation passing authority to a server when A
asks B to do something (e.g. print a file, read a
database)B may need As authority - Autonomy requires
- each node is an authenticator
- each node does own security checks
- Internet Today
- no trust among domains (fire walls, many
passwords) - trust based on digital signatures
13Clusters The Ideal Distributed System.
- Cluster is distributed system BUT single
- location
- manager
- security policy
- relatively homogeneous
- communications is
- high bandwidth
- low latency
- low error rate
- Clusters use distributed system techniques for
- load distribution
- storage
- execution
- growth
- fault tolerance
14Cluster Shared What?
- Shared Memory Multiprocessor
- Multiple processors, one memory
- all devices are local
- DEC or SGI or Sequent 16x nodes
- Shared Disk Cluster
- an array of nodes
- all shared common disks
- VAXcluster Oracle
- Shared Nothing Cluster
- each device local to a node
- ownership may change
- Tandem, SP2, Wolfpack
15OutlineConcepts and Terminology
- Why Distribute
- Distributed data objects
- Partitioned
- Replicated
- Distributed execution
- Three tier architectures
- Transaction concepts
16Partitioned Data Break file into disjoint groups
Orders
- Exploit data access locality
- Put data near consumer
- Less network traffic
- Better response time
- Better availability
- Owner controls data autonomy
- Spread Load
- data or traffic may exceed single store
N.A. S.A. Europe Asia
17How to Partition Data?
- How to Partition
- by attribute or
- random or
- by source or
- by use
- Problem to find it must have
- Directory (replicated) or
- Algorithm
- Encourages attribute-based partitioning
N.A. S.A. Europe Asia
18Naming- The basics
- Objects have
- Globally Unique Identifier (GUIDs)
- location(s) address(es)
- name(s)
- addresses can change
- objects can have many names
- Names are context dependent
- (Jim _at_ KGB Jim _at_ CIA)
- Many naming systems
- UNC \\node\device\dir\dir\dir\object
- Internet http//node.domain.root/dir/dir/dir/obje
ct - LDAP ldap//ldap.domain.root/oorg,cUS,cndir
guid
James
19Autonomy in Distributed Systems
- Owner of site (or node, or application, or
database)Wants to control it - If my part is working , must be able to access
manage it (reorganize, upgrade, add user,) - Autonomy is
- Essential
- Difficult to implement.
- Conflicts with global consistency
- examples naming, authentication, admin
20Security The Basics
- Authentication server subject Authenticator gt
(Yes token) No - Security matrix
- who can do what to whom
- Access control list is column of matrix
- who is authenticated ID
- In a distributed system, who and what and
whom are distributed objects
21Security in Distributed Systems
- Security domain nodes with a shared security
server. - Security domains can have trust relationships
- A trusts B A believes B when it says this is
Jim_at_B - Security domains form a hierarchy.
- Delegation passing authority to a server when A
asks B to do something (e.g. print a file, read a
database)B may need As authority - Autonomy requires
- each node is an authenticator
- each node does own security checks
- Internet Today
- no trust among domains (fire walls, many
passwords) - trust based on digital signatures
22Clusters The Ideal Distributed System.
- Cluster is distributed system BUT single
- location
- manager
- security policy
- relatively homogeneous
- communications is
- high bandwidth
- low latency
- low error rate
- Clusters use distributed system techniques for
- load distribution
- storage
- execution
- growth
- fault tolerance
23Cluster Shared What?
- Shared Memory Multiprocessor
- Multiple processors, one memory
- all devices are local
- DEC or SGI or Sequent 16x nodes
- Shared Disk Cluster
- an array of nodes
- all shared common disks
- VAXcluster Oracle
- Shared Nothing Cluster
- each device local to a node
- ownership may change
- Tandem, SP2, Wolfpack
24Partitioned Data Break file into disjoint groups
Orders
- Exploit data access locality
- Put data near consumer
- Less network traffic
- Better response time
- Better availability
- Owner controls data autonomy
- Spread Load
- data or traffic may exceed single store
N.A. S.A. Europe Asia
25How to Partition Data?
- How to Partition
- by attribute or
- random or
- by source or
- by use
- Problem to find it must have
- Directory (replicated) or
- Algorithm
- Encourages attribute-based partitioning
N.A. S.A. Europe Asia
26Replicated DataPlace fragment at many sites
- Pros
- Improves availability
- Disconnected (mobile) operation
- Distributes load
- Reads are cheaper
- Cons
- N times more updates
- N times more storage
- Placement strategies
- Dynamic cache on demand
- Static place specific
Catalog
27Updating Replicated Data
- When a replica is updated, how do changes
propagate? - Master copy, many slave copies (SQL Server)
- always know the correct value (master)
- change propagation can be
- transactional
- as soon as possible
- periodic
- on demand
- Symmetric, and anytime (Access)
- allows mobile (disconnected) updates
- updates propagated ASAP, periodic, on demand
- non-serializable
- colliding updates must be reconciled.
- hard to know real value
28OutlineConcepts and Terminology
- Why Distribute
- Distributed data objects
- Partitioned
- Replicated
- Distributed execution
- remote procedure call
- queues
- Three tier architectures
- Transaction concepts
29Distributed ExecutionThreads and Messages
- Thread is Execution unit(software analog of
cpumemory) - Threads execute at a node
- Threads communicate via
- Shared memory (local)
- Messages (local and remote)
messages
30Peer-to-Peer or Client-Server
- Peer-to-Peer is symmetric
- Either side can send
- Client-server
- client sends requests
- server sends responses
- simple subset of peer-to-peer
request
response
31Connection-less or Connected
- Connected (sessions)
- open - request/reply - close
- client authenticated once
- Messages arrive in order
- Can send many replies (e.g. FTP)
- Server has client context (context sensitive)
- e.g. Winsock and ODBC
- HTTP adding connections
- Connection-less
- request contains
- client id
- client context
- work request
- client authenticated on each message
- only a single response message
- e.g. HTTP, NFS v1
32Remote Procedure Call The key to transparency
- Object may be local or remote
- Methods on object work wherever it is.
- Local invocation
33Remote Procedure Call The key to transparency
y pObj-gtf(x)
f()
return val
y val
34Object Request Broker (ORB)
- Registers Servers
- Manages pools of servers
- Connects clients to servers
- Does Naming, request-level authorization,
- Provides transaction coordination (new feature)
- Old names
- Transaction Processing Monitor,
- Web server,
- NetWare
Object-Request Broker
35OutlineConcepts and Terminology
- Why Distributed
- Distributed data objects
- Distributed execution
- remote procedure call
- queues
- Three tier architectures
- what
- why
- Transaction concepts
36Client/Server Interactions All can be done with
RPC
C
S
- Request-Response response may be many messages
- Conversational server keeps client context
- Dispatcherthree-tier complex operation at
server - Queuedde-couples client from serverallows
disconnected operation
C
S
S
S
C
S
S
C
S
37Queued Request/Response
- Time-decouples client and server
- Three Transactions
- Almost real time, ASAP processing
- Communicate at each others convenienceAllows
mobile (disconnected) operation - Disk queues survive client server failures
Client
Server
38Why Queued Processing?
- Prioritize requestsambulance dispatcher favors
high-priority calls - Manage Workflows
- Deferred processing in mobile apps
39Google Cloud computing techniques
40The Google File System
41The Google File System (GFS)
- A scalable distributed file system for large
distributed data intensive applications - Multiple GFS clusters are currently deployed.
- The largest ones have
- 1000 storage nodes
- 300 TeraBytes of disk storage
- heavily accessed by hundreds of clients on
distinct machines
42Introduction
- Shares many same goals as previous distributed
file systems - performance, scalability, reliability, etc
- GFS design has been driven by four key
observation of Google application workloads and
technological environment
43Intro Observations 1
- 1. Component failures are the norm
- constant monitoring, error detection, fault
tolerance and automatic recovery are integral to
the system - 2. Huge files (by traditional standards)
- Multi GB files are common
- I/O operations and blocks sizes must be revisited
44Intro Observations 2
- 3. Most files are mutated by appending new data
- This is the focus of performance optimization and
atomicity guarantees - 4. Co-designing the applications and APIs
benefits overall system by increasing flexibility
45The Design
- Cluster consists of a single master and multiple
chunkservers and is accessed by multiple clients
46The Master
- Maintains all file system metadata.
- names space, access control info, file to chunk
mappings, chunk (including replicas) location,
etc. - Periodically communicates with chunkservers in
HeartBeat messages to give instructions and check
state
47The Master
- Helps make sophisticated chunk placement and
replication decision, using global knowledge - For reading and writing, client contacts Master
to get chunk locations, then deals directly with
chunkservers - Master is not a bottleneck for reads/writes
48Chunkservers
- Files are broken into chunks. Each chunk has a
immutable globally unique 64-bit chunk-handle. - handle is assigned by the master at chunk
creation - Chunk size is 64 MB
- Each chunk is replicated on 3 (default) servers
49Clients
- Linked to apps using the file system API.
- Communicates with master and chunkservers for
reading and writing - Master interactions only for metadata
- Chunkserver interactions for data
- Only caches metadata information
- Data is too large to cache.
50Chunk Locations
- Master does not keep a persistent record of
locations of chunks and replicas. - Polls chunkservers at startup, and when new
chunkservers join/leave for this. - Stays up to date by controlling placement of new
chunks and through HeartBeat messages (when
monitoring chunkservers)
51Operation Log
- Record of all critical metadata changes
- Stored on Master and replicated on other machines
- Defines order of concurrent operations
- Changes not visible to clients until they
propigate to all chunk replicas - Also used to recover the file system state
52System Interactions Leases and Mutation Order
- Leases maintain a mutation order across all chunk
replicas - Master grants a lease to a replica, called the
primary - The primary choses the serial mutation order, and
all replicas follow this order - Minimizes management overhead for the Master
53System Interactions Leases and Mutation Order
54Atomic Record Append
- Client specifies the data to write GFS chooses
and returns the offset it writes to and appends
the data to each replica at least once - Heavily used by Googles Distributed
applications. - No need for a distributed lock manager
- GFS choses the offset, not the client
55Atomic Record Append How?
- Follows similar control flow as mutations
- Primary tells secondary replicas to append at the
same offset as the primary - If a replica append fails at any replica, it is
retried by the client. - So replicas of the same chunk may contain
different data, including duplicates, whole or in
part, of the same record
56Atomic Record Append How?
- GFS does not guarantee that all replicas are
bitwise identical. - Only guarantees that data is written at least
once in an atomic unit. - Data must be written at the same offset for all
chunk replicas for success to be reported.
57Replica Placement
- Placement policy maximizes data reliability and
network bandwidth - Spread replicas not only across machines, but
also across racks - Guards against machine failures, and racks
getting damaged or going offline - Reads for a chunk exploit aggregate bandwidth of
multiple racks - Writes have to flow through multiple racks
- tradeoff made willingly
58Chunk creation
- created and placed by master.
- placed on chunkservers with below average disk
utilization - limit number of recent creations on a
chunkserver - with creations comes lots of writes
59Detecting Stale Replicas
- Master has a chunk version number to distinguish
up to date and stale replicas - Increase version when granting a lease
- If a replica is not available, its version is not
increased - master detects stale replicas when a chunkservers
report chunks and versions - Remove stale replicas during garbage collection
60Garbage collection
- When a client deletes a file, master logs it like
other changes and changes filename to a hidden
file. - Master removes files hidden for longer than 3
days when scanning file system name space - metadata is also erased
- During HeartBeat messages, the chunkservers send
the master a subset of its chunks, and the
master tells it which files have no metadata. - Chunkserver removes these files on its own
61Fault ToleranceHigh Availability
- Fast recovery
- Master and chunkservers can restart in seconds
- Chunk Replication
- Master Replication
- shadow masters provide read-only access when
primary master is down - mutations not done until recorded on all master
replicas
62Fault ToleranceData Integrity
- Chunkservers use checksums to detect corrupt data
- Since replicas are not bitwise identical,
chunkservers maintain their own checksums - For reads, chunkserver verifies checksum before
sending chunk - Update checksums during writes
63QA Thanks