www.jiahenglu.net

About This Presentation

Title:

www.jiahenglu.net

Description:

Rows Name is an arbitrary string Access to data in a row is atomic Row ... Wireless Sensor Networks: An ... Detecting Stale Replicas Garbage collection ... – PowerPoint PPT presentation

Number of Views:182

Avg rating:3.0/5.0

Slides: 172

Provided by: educ5208

Category:

more less

Transcript and Presenter's Notes

Title: www.jiahenglu.net

1
?????????

???
??????
www.jiahenglu.net

2
CLOUD COMPUTING
3
????

?????
Google ?????GFS,Bigtable ?Mapreduce
Yahoo??????Hadoop
????????

3
4
Cloud computing

5
(No Transcript)
6
Why we use cloud computing?
7
Why we use cloud computing?

Case 1
Write a file
Save
Computer down, file is lost
Files are always stored in cloud, never lost

8
Why we use cloud computing?

Case 2
Use IE --- download, install, use
Use QQ --- download, install, use
Use C --- download, install, use
Get the serve from the cloud

9
What is cloud and cloud computing?

Cloud
Demand resources or services over Internet
scale and reliability of a data center.

10
What is cloud and cloud computing?

Cloud computing is a style of computing in
which dynamically scalable and often virtualized
resources are provided as a serve over the
Internet.
Users need not have knowledge of, expertise in,
or control over the technology infrastructure in
the "cloud" that supports them.

11
Characteristics of cloud computing

Virtual.
software, databases, Web servers,
operating systems, storage and networking as
virtual servers.
On demand.
add and subtract processors, memory, network
bandwidth, storage.

12
Types of cloud service
SaaS Software as a Service
PaaS Platform as a Service
IaaS Infrastructure as a Service
13
SaaS

Software delivery model
No hardware or software to manage
Service delivered through a browser
Customers use the service on demand
Instant Scalability

14
SaaS

Examples
Your current CRM package is not managing the load
or you simply dont want to host it in-house. Use
a SaaS provider such as Salesforce.com
Your email is hosted on an exchange server in
your office and it is very slow. Outsource this
using Hosted Exchange.

15
PaaS

Platform delivery model
Platforms are built upon Infrastructure, which is
expensive
Estimating demand is not a science!
Platform management is not fun!

16
PaaS

Examples
You need to host a large file (5Mb) on your
website and make it available for 35,000 users
for only two months duration. Use Cloud Front
from Amazon.
You want to start storage services on your
network for a large number of files and you do
not have the storage capacityuse Amazon S3.

17
IaaS

Computer infrastructure delivery model
A platform virtualization environment
Computing resources, such as storing and
processing capacity.
Virtualization taken a step further

18
IaaS

Examples
You want to run a batch job but you dont have
the infrastructure necessary to run it in a
timely manner. Use Amazon EC2.
You want to host a website, but only for a few
days. Use Flexiscale.

19
Cloud computing and other computing techniques

20
CLOUD COMPUTING
21
The 21st Century Vision Of Computing
Leonard Kleinrock , one of the chief scientists
of the original Advanced Research Projects Agency
Network (ARPANET) project which seeded the
Internet, said As of now, computer networks
are still in their infancy, but as they grow up
and become sophisticated, we will probably see
the spread of computer utilities which, like
present electric and telephone utilities, will
service individual homes and offices across the
country.
22
The 21st Century Vision Of Computing
Sun Microsystems co-founder Bill Joy
23
The 21st Century Vision Of Computing
24
Definitions
utility
25
Definitions
Utility computing is the packaging of computing
resources, such as computation and storage, as a
metered service similar to a traditional public
utility
utility
26
Definitions
utility
A computer cluster is a group of linked
computers, working together closely so that in
many respects they form a single computer.
27
Definitions
utility
Grid computing is the application of several
computers to a single problem at the same time
usually to a scientific or technical problem that
requires a great number of computer processing
cycles or access to large amounts of data
28
Definitions
utility
Cloud computing is a style of computing in which
dynamically scalable and often virtualized
resources are provided as a service over the
Internet.
29
Grid Computing Cloud Computing

share a lot commonality
intention, architecture and technology
Difference
programming model, business model, compute
model, applications, and Virtualization.

30
Grid Computing Cloud Computing

the problems are mostly the same
manage large facilities
define methods by which consumers discover,
request and use resources provided by the central
facilities
implement the often highly parallel computations
that execute on those resources.

31
Grid Computing Cloud Computing

Virtualization
Grid
do not rely on virtualization as much as Clouds
do, each individual organization maintain full
control of their resources
Cloud
an indispensable ingredient for almost every Cloud

32
(No Transcript)
33
Any question and any comments ?
2015-10-3
33
34
????

?????
Google ?????GFS,Bigtable ?Mapreduce
Yahoo??????Hadoop
????????

34
35
Google Cloud computing techniques

36
Cloud Systems
OSDI06

BigTable
HBase
HyperTable
Hive
HadoopDB
GreenPlum
CouchDB
Voldemort
PNUTS
SQL Azure

BigTable-like
MapReduce
VLDB09
VLDB09
DBMS-based
VLDB08
37
The Google File System
38
The Google File System (GFS)

A scalable distributed file system for large
distributed data intensive applications
Multiple GFS clusters are currently deployed.
The largest ones have
1000 storage nodes
300 TeraBytes of disk storage
heavily accessed by hundreds of clients on
distinct machines

39
Introduction

Shares many same goals as previous distributed
file systems
performance, scalability, reliability, etc
GFS design has been driven by four key
observation of Google application workloads and
technological environment

40
Intro Observations 1

1. Component failures are the norm
constant monitoring, error detection, fault
tolerance and automatic recovery are integral to
the system
2. Huge files (by traditional standards)
Multi GB files are common
I/O operations and blocks sizes must be revisited

41
Intro Observations 2

3. Most files are mutated by appending new data
This is the focus of performance optimization and
atomicity guarantees
4. Co-designing the applications and APIs
benefits overall system by increasing flexibility

42
The Design

Cluster consists of a single master and multiple
chunkservers and is accessed by multiple clients

43
The Master

Maintains all file system metadata.
names space, access control info, file to chunk
mappings, chunk (including replicas) location,
etc.
Periodically communicates with chunkservers in
HeartBeat messages to give instructions and check
state

44
The Master

Helps make sophisticated chunk placement and
replication decision, using global knowledge
For reading and writing, client contacts Master
to get chunk locations, then deals directly with
chunkservers
Master is not a bottleneck for reads/writes

45
Chunkservers

Files are broken into chunks. Each chunk has a
immutable globally unique 64-bit chunk-handle.
handle is assigned by the master at chunk
creation
Chunk size is 64 MB
Each chunk is replicated on 3 (default) servers

46
Clients

Linked to apps using the file system API.
Communicates with master and chunkservers for
reading and writing
Master interactions only for metadata
Chunkserver interactions for data
Only caches metadata information
Data is too large to cache.

47
Chunk Locations

Master does not keep a persistent record of
locations of chunks and replicas.
Polls chunkservers at startup, and when new
chunkservers join/leave for this.
Stays up to date by controlling placement of new
chunks and through HeartBeat messages (when
monitoring chunkservers)

48
Operation Log

Record of all critical metadata changes
Stored on Master and replicated on other machines
Defines order of concurrent operations
Also used to recover the file system state

49
System Interactions Leases and Mutation Order

Leases maintain a mutation order across all chunk
replicas
Master grants a lease to a replica, called the
primary
The primary choses the serial mutation order, and
all replicas follow this order
Minimizes management overhead for the Master

50
Atomic Record Append

Client specifies the data to write GFS chooses
and returns the offset it writes to and appends
the data to each replica at least once
Heavily used by Googles Distributed
applications.
No need for a distributed lock manager
GFS choses the offset, not the client

51
Atomic Record Append How?

Follows similar control flow as mutations
Primary tells secondary replicas to append at the
same offset as the primary
If a replica append fails at any replica, it is
retried by the client.
So replicas of the same chunk may contain
different data, including duplicates, whole or in
part, of the same record

52
Atomic Record Append How?

GFS does not guarantee that all replicas are
bitwise identical.
Only guarantees that data is written at least
once in an atomic unit.
Data must be written at the same offset for all
chunk replicas for success to be reported.

53
Detecting Stale Replicas

Master has a chunk version number to distinguish
up to date and stale replicas
Increase version when granting a lease
If a replica is not available, its version is not
increased
master detects stale replicas when a chunkservers
report chunks and versions
Remove stale replicas during garbage collection

54
Garbage collection

When a client deletes a file, master logs it like
other changes and changes filename to a hidden
file.
Master removes files hidden for longer than 3
days when scanning file system name space
metadata is also erased
During HeartBeat messages, the chunkservers send
the master a subset of its chunks, and the
master tells it which files have no metadata.
Chunkserver removes these files on its own

55
Fault ToleranceHigh Availability

Fast recovery
Master and chunkservers can restart in seconds
Chunk Replication
Master Replication
shadow masters provide read-only access when
primary master is down
mutations not done until recorded on all master
replicas

56
Fault ToleranceData Integrity

Chunkservers use checksums to detect corrupt data
Since replicas are not bitwise identical,
chunkservers maintain their own checksums
For reads, chunkserver verifies checksum before
sending chunk
Update checksums during writes

57
Introduction to MapReduce
58
MapReduce Insight

Consider the problem of counting the number of
occurrences of each word in a large collection of
documents
How would you do it in parallel ?

59
MapReduce Programming Model

Inspired from map and reduce operations commonly
used in functional programming languages like
Lisp.
Users implement interface of two primary methods
1. Map (key1, val1) ? (key2, val2)
2. Reduce (key2, val2) ? val3

60
Map operation

Map, a pure function, written by the user, takes
an input key/value pair and produces a set of
intermediate key/value pairs.
e.g. (docid, doc-content)
Draw an analogy to SQL, map can be visualized as
group-by clause of an aggregate query.

61
Reduce operation

On completion of map phase, all the intermediate
values for a given output key are combined
together into a list and given to a reducer.
Can be visualized as aggregate function (e.g.,
average) that is computed over all the rows with
the same group-by attribute.

62
Pseudo-code

map(String input_key, String input_value)
// input_key document name
// input_value document contents
for each word w in input_value
EmitIntermediate(w, "1")
reduce(String output_key, Iterator
intermediate_values)
// output_key a word
// output_values a list of counts
int result 0
for each v in intermediate_values
result ParseInt(v)
Emit(AsString(result))

63
MapReduce Execution overview

64
MapReduce Example

65
MapReduce in Parallel Example

66
MapReduce Fault Tolerance

Handled via re-execution of tasks.
Task completion committed through master
What happens if Mapper fails ?
Re-execute completed in-progress map tasks
What happens if Reducer fails ?
Re-execute in progress reduce tasks
What happens if Master fails ?
Potential trouble !!

67
MapReduce

Walk through of One more Application

68
(No Transcript)
69
MapReduce PageRank

PageRank models the behavior of a random
surfer.
C(t) is the out-degree of t, and (1-d) is a
damping factor (random jump)
The random surfer keeps clicking on successive
links at random not taking content into
consideration.
Distributes its pages rank equally among all
pages it links to.
The dampening factor takes the surfer getting
bored and typing arbitrary URL.

70
PageRank Key Insights

Effects at each iteration is local. i1th
iteration depends only on ith iteration
At iteration i, PageRank for individual nodes can
be computed independently

71
PageRank using MapReduce

Use Sparse matrix representation (M)
Map each row of M to a list of PageRank credit
to assign to out link neighbours.
These prestige scores are reduced to a single
PageRank value for a page by aggregating over
them.

72
PageRank using MapReduce
Source of Image Lin 2008
73
Phase 1 Process HTML

Map task takes (URL, page-content) pairs and maps
them to (URL, (PRinit, list-of-urls))
PRinit is the seed PageRank for URL
list-of-urls contains all pages pointed to by URL
Reduce task is just the identity function

74
Phase 2 PageRank Distribution

Reduce task gets (URL, url_list) and many (URL,
val) values
Sum vals and fix up with d to get new PR
Emit (URL, (new_rank, url_list))
Check for convergence using non parallel
component

75
MapReduce Some More Apps

Distributed Grep.
Count of URL Access Frequency.
Clustering (K-means)
Graph Algorithms.
Indexing Systems

MapReduce Programs In Google Source Tree
76
MapReduce Extensions and similar apps

PIG (Yahoo)
Hadoop (Apache)
DryadLinq (Microsoft)

77
Large Scale Systems Architecture using MapReduce
78
BigTable A Distributed Storage System for
Structured Data
79
Introduction

BigTable is a distributed storage system for
managing structured data.
Designed to scale to a very large size
Petabytes of data across thousands of servers
Used for many Google projects
Web indexing, Personalized Search, Google Earth,
Google Analytics, Google Finance,
Flexible, high-performance solution for all of
Googles products

80
Motivation

Lots of (semi-)structured data at Google
URLs
Contents, crawl metadata, links, anchors,
pagerank,
Per-user data
User preference settings, recent queries/search
results,
Geographic locations
Physical entities (shops, restaurants, etc.),
roads, satellite image data, user annotations,
Scale is large
Billions of URLs, many versions/page
(20K/version)
Hundreds of millions of users, thousands or q/sec
100TB of satellite image data

81
Why not just use commercial DB?

Scale is too large for most commercial databases
Even if it werent, cost would be very high
Building internally means system can be applied
across many projects for low incremental cost
Low-level storage optimizations help performance
significantly
Much harder to do when running on top of a
database layer

82
Goals

Want asynchronous processes to be continuously
updating different pieces of data
Want access to most current data at any time
Need to support
Very high read/write rates (millions of ops per
second)
Efficient scans over all or interesting subsets
of data
Efficient joins of large one-to-one and
one-to-many datasets
Often want to examine data changes over time
E.g. Contents of a web page over multiple crawls

83
BigTable

Distributed multi-level map
Fault-tolerant, persistent
Scalable
Thousands of servers
Terabytes of in-memory data
Petabyte of disk-based data
Millions of reads/writes per second, efficient
scans
Self-managing
Servers can be added/removed dynamically
Servers adjust to load imbalance

84
Building Blocks

Building blocks
Google File System (GFS) Raw storage
Scheduler schedules jobs onto machines
Lock service distributed lock manager
MapReduce simplified large-scale data processing
BigTable uses of building blocks
GFS stores persistent data (SSTable file format
for storage of data)
Scheduler schedules jobs involved in BigTable
serving
Lock service master election, location
bootstrapping
Map Reduce often used to read/write BigTable data

85
Basic Data Model

A BigTable is a sparse, distributed persistent
multi-dimensional sorted map
(row, column, timestamp) -gt cell contents
Good match for most Google applications

86
WebTable Example

Want to keep copy of a large collection of web
pages and related information
Use URLs as row keys
Various aspects of web page as column names
Store contents of web pages in the contents
column under the timestamps when they were
fetched.

87
Rows

Name is an arbitrary string
Access to data in a row is atomic
Row creation is implicit upon storing data
Rows ordered lexicographically
Rows close together lexicographically usually on
one or a small number of machines

88
Rows (cont.)

Reads of short row ranges are efficient and
typically require communication with a small
number of machines.
Can exploit this property by selecting row keys
so they get good locality for data access.
Example
math.gatech.edu, math.uga.edu, phys.gatech.edu,
phys.uga.edu
VS
edu.gatech.math, edu.gatech.phys, edu.uga.math,
edu.uga.phys

89
Columns

Columns have two-level name structure
familyoptional_qualifier
Column family
Unit of access control
Has associated type information
Qualifier gives unbounded columns
Additional levels of indexing, if desired

90
Timestamps

Used to store different versions of data in a
cell
New writes default to current time, but
timestamps for writes can also be set explicitly
by clients
Lookup options
Return most recent K values
Return all values in timestamp range (or all
values)
Column families can be marked w/ attributes
Only retain most recent K values in a cell
Keep values until they are older than K seconds

91
Implementation Three Major Components

Library linked into every client
One master server
Responsible for
Assigning tablets to tablet servers
Detecting addition and expiration of tablet
servers
Balancing tablet-server load
Garbage collection
Many tablet servers
Tablet servers handle read and write requests to
its table
Splits tablets that have grown too large

92
Implementation (cont.)

Client data doesnt move through master server.
Clients communicate directly with tablet servers
for reads and writes.
Most clients never communicate with the master
server, leaving it lightly loaded in practice.

93
Tablets

Large tables broken into tablets at row
boundaries
Tablet holds contiguous range of rows
Clients can often choose row keys to achieve
locality
Aim for 100MB to 200MB of data per tablet
Serving machine responsible for 100 tablets
Fast recovery
100 machines each pick up 1 tablet for failed
machine
Fine-grained load balancing
Migrate tablets away from overloaded machine
Master makes load-balancing decisions

94
Tablet Location

Since tablets move around from server to server,
given a row, how do clients find the right
machine?
Need to find tablet whose row range covers the
target row

95
Tablet Assignment

Each tablet is assigned to one tablet server at a
time.
Master server keeps track of the set of live
tablet servers and current assignments of tablets
to servers. Also keeps track of unassigned
tablets.
When a tablet is unassigned, master assigns the
tablet to an tablet server with sufficient room.

96
API

Metadata operations
Create/delete tables, column families, change
metadata
Writes (atomic)
Set() write cells in a row
DeleteCells() delete cells in a row
DeleteRow() delete all cells in a row
Reads
Scanner read arbitrary cells in a bigtable
Each row read is atomic
Can restrict returned rows to a particular range
Can ask for just data from 1 row, all rows, etc.
Can ask for all columns, just certain column
families, or specific columns

97
Refinements Compression

Many opportunities for compression
Similar values in the same row/column at
different timestamps
Similar values in different columns
Similar values across adjacent rows
Two-pass custom compressions scheme
First pass compress long common strings across a
large window
Second pass look for repetitions in small window
Speed emphasized, but good space reduction
(10-to-1)

98
Refinements Bloom Filters

Read operation has to read from disk when desired
SSTable isnt in memory
Reduce number of accesses by specifying a Bloom
filter.
Allows us ask if an SSTable might contain data
for a specified row/column pair.
Small amount of memory for Bloom filters
drastically reduces the number of disk seeks for
read operations
Use implies that most lookups for non-existent
rows or columns do not need to touch disk

99
Refinements Bloom Filters

Read operation has to read from disk when desired
SSTable isnt in memory
Reduce number of accesses by specifying a Bloom
filter.
Allows us ask if an SSTable might contain data
for a specified row/column pair.
Small amount of memory for Bloom filters
drastically reduces the number of disk seeks for
read operations
Use implies that most lookups for non-existent
rows or columns do not need to touch disk

100
????

?????
Google ?????GFS,Bigtable ?Mapreduce
Yahoo??????Hadoop
????????

100
101
Yahoo! Cloud computing

102
Search Results of the Future
yelp.com
Gawker
babycenter
New York Times
epicurious
LinkedIn
answers.com
webmd
103
Whats in the Horizontal Cloud?
Simple Web Service APIs
Horizontal Cloud Services
Edge Content Services e.g., YCS, YCPI
Provisioning Virtualization e.g., EC2
Batch Storage Processing e.g., Hadoop Pig
Operational Storage e.g., S3, MObStor, Sherpa
Other Services Messaging, Workflow, virtual
DBs Webserving
ID Account Management
Shared Infrastructure
Metering, Billing, Accounting
Monitoring QoS
Common Approaches to QA, Production
Engineering, Performance Engineering, Datacenter
Management, and Optimization
104
Yahoo! Cloud Stack
EDGE
Horizontal Cloud Services

YCS
YCPI
Brooklyn
WEB
Horizontal Cloud Services
VM/OS
yApache
PHP
App Engine
APP
Provisioning (Self-serve)
Monitoring/Metering/Security
Horizontal Cloud Services
VM/OS

Serving Grid
Data Highway
STORAGE
Horizontal Cloud Services

Sherpa
MOBStor
BATCH
Horizontal Cloud Services

Hadoop
105
Web Data Management

CRUD
Point lookups and short scans
Index organized table and random I/Os
per latency

Scan oriented workloads
Focus on sequential disk I/O
per cpu cycle

Structured record storage (PNUTS/Sherpa)
Large data analysis (Hadoop)

Object retrieval and streaming
Scalable file storage
per GB

Blob storage (SAN/NAS)
106
The World Has Changed

Web serving applications need
Scalability!
Preferably elastic
Flexible schemas
Geographic distribution
High availability
Reliable storage
Web serving applications can do without
Complicated queries
Strong transactions

107
PNUTS / SHERPA To Help You Scale Your Mountains
of Data
108
Yahoo! Serving Storage Problem

Small records 100KB or less
Structured records lots of fields, evolving
Extreme data scale - Tens of TB
Extreme request scale - Tens of thousands of
requests/sec
Low latency globally - 20 datacenters worldwide
High Availability - outages cost millions
Variable usage patterns - as applications and
users change

108
109
The PNUTS/Sherpa Solution

The next generation global-scale record store
Record-orientation Routing, data storage
optimized for low-latency record access
Scale out Add machines to scale throughput
(while keeping latency low)
Asynchrony Pub-sub replication to far-flung
datacenters to mask propagation delay
Consistency model Reduce complexity of
asynchrony for the application programmer
Cloud deployment model Hosted, managed service
to reduce app time-to-market and enable on demand
scale and elasticity

109
110
What is PNUTS/Sherpa?
CREATE TABLE Parts ( ID VARCHAR, StockNumber
INT, Status VARCHAR )
Structured, flexible schema
Geographic replication
Parallel database
Hosted, managed infrastructure
110
111
What Will It Become?
Indexes and views
CREATE TABLE Parts ( ID VARCHAR, StockNumber
INT, Status VARCHAR )
Geographic replication
Parallel database
Structured, flexible schema
Hosted, managed infrastructure
112
What Will It Become?
Indexes and views
113
Design Goals

Scalability
Thousands of machines
Easy to add capacity
Restrict query language to avoid costly queries
Geographic replication
Asynchronous replication around the globe
Low-latency local access
High availability and fault tolerance
Automatically recover from failures
Serve reads and writes despite failures

Consistency
Per-record guarantees
Timeline model
Option to relax if needed
Multiple access paths
Hash table, ordered table
Primary, secondary access
Hosted service
Applications plug and play
Share operational cost

113
114
Technology Elements
Applications
Tabular API
PNUTS API

PNUTS
Query planning and execution
Index maintenance

Distributed infrastructure for tabular data
Data partitioning
Update consistency
Replication

YCA Authorization

YDOT FS
Ordered tables

YDHT FS
Hash tables

Tribble
Pub/sub messaging

Zookeeper
Consistency service

114
115
Data Manipulation

Per-record operations
Get
Set
Delete
Multi-record operations
Multiget
Scan
Getrange

115
116
TabletsHash Table
Name
Description
Price
0x0000
Grape
12
Grapes are good to eat
Limes are green
9
Lime
1
Apple
Apple is wisdom
900
Strawberry
Strawberry shortcake
0x2AF3
2
Orange
Arrgh! Dont get scurvy!
3
Avocado
But at what price?
Lemon
How much did you pay for this lemon?
1
14
Is this a vegetable?
Tomato
0x911F
2
The perfect fruit
Banana
8
Kiwi
New Zealand
0xFFFF
116
117
TabletsOrdered Table
Name
Description
Price
A
1
Apple
Apple is wisdom
3
Avocado
But at what price?
2
Banana
The perfect fruit
12
Grape
Grapes are good to eat
H
Kiwi
8
New Zealand
Lemon
How much did you pay for this lemon?
1
Limes are green
Lime
9
2
Orange
Arrgh! Dont get scurvy!
Q
900
Strawberry
Strawberry shortcake
Is this a vegetable?
14
Tomato
Z
117
118
Flexible Schema
Posted date Listing id Item Price
6/1/07 424252 Couch 570
6/1/07 763245 Bike 86
6/3/07 211242 Car 1123
6/5/07 421133 Lamp 15
Condition
Good

Fair

Color

Red

119
Detailed Architecture
Local region
Remote regions
Clients
REST API
Routers
Tribble
Tablet Controller
Storage units
119
120
Tablet Splitting and Balancing
Each storage unit has many tablets (horizontal
partitions of the table)
Storage unit may become a hotspot
Tablets may grow over time
Overfull tablets split
Shed load by moving tablets to other servers
120
121
QUERY PROCESSING
121
122
Accessing Data
Get key k
SU
SU
SU
122
123
Bulk Read
SU
SU
SU
123
124
Range Queries in YDOT

Clustered, ordered retrieval of records

Apple Avocado Banana Blueberry
Canteloupe Grape Kiwi Lemon
Lime Mango Orange
Strawberry Tomato Watermelon
Apple Avocado Banana Blueberry
Canteloupe Grape Kiwi Lemon
Lime Mango Orange
Strawberry Tomato Watermelon
125
Updates
Write key k
Sequence for key k
Routers
Message brokers
Write key k
Sequence for key k
SUCCESS
Write key k
125
126
ASYNCHRONOUS REPLICATION AND CONSISTENCY
126
127
Asynchronous Replication
127
128
Consistency Model

Goal Make it easier for applications to reason
about updates and cope with asynchrony
What happens to a record with primary key
Alice?

Record inserted
Delete
Update
Update
Update
Update
Update
Update
Update
v. 1
v. 2
v. 3
v. 4
v. 5
v. 7
v. 6
v. 8
Time
Time
Generation 1
As the record is updated, copies may get out of
sync.
128
129
Example Social Alice
East
Record Timeline
West
User Status
Alice ___
___
User Status
Alice Busy
Busy
User Status
Alice Busy
User Status
Alice Free
Free
User Status
Alice ???
User Status
Alice ???
Free
130
Consistency Model
Read
Current version
Stale version
Stale version
v. 1
v. 2
v. 3
v. 4
v. 5
v. 7
v. 6
v. 8
Time
Generation 1
In general, reads are served using a local copy
130
131
Consistency Model
Read up-to-date
Current version
Stale version
Stale version
v. 1
v. 2
v. 3
v. 4
v. 5
v. 7
v. 6
v. 8
Time
Generation 1
But application can request and get current
version
131
132
Consistency Model
Read v.6
Current version
Stale version
Stale version
v. 1
v. 2
v. 3
v. 4
v. 5
v. 7
v. 6
v. 8
Time
Generation 1
Or variations such as read forwardwhile copies
may lag the master record, every copy goes
through the same sequence of changes
132
133
Consistency Model
Write
Current version
Stale version
Stale version
v. 1
v. 2
v. 3
v. 4
v. 5
v. 7
v. 6
v. 8
Time
Generation 1
Achieved via per-record primary copy
protocol (To maximize availability, record
masterships automaticlly transferred if site
fails) Can be selectively weakened to eventual
consistency (local writes that are reconciled
using version vectors)
133
134
Consistency Model
Write if v.7
ERROR
Current version
Stale version
Stale version
v. 1
v. 2
v. 3
v. 4
v. 5
v. 7
v. 6
v. 8
Time
Generation 1
Test-and-set writes facilitate per-record
transactions
134
135
Consistency Techniques

Per-record mastering
Each record is assigned a master region
May differ between records
Updates to the record forwarded to the master
region
Ensures consistent ordering of updates
Tablet-level mastering
Each tablet is assigned a master region
Inserts and deletes of records forwarded to the
master region
Master region decides tablet splits
These details are hidden from the application
Except for the latency impact!

136
Mastering
A 42342 E
B 42521 W
C 66354 W
D 12352 E
E 75656 C
F 15677 E
A 42342 E
B 42521 W
Tablet master
C 66354 W
D 12352 E
E 75656 C
F 15677 E
A 42342 E
B 42521 W
C 66354 W
D 12352 E
E 75656 C
F 15677 E
136
137
Bulk Insert/Update/Replace

Client feeds records to bulk manager
Bulk loader transfers records to SUs in batches
Bypass routers and message brokers
Efficient import into storage unit

Client
Bulk manager
Source Data
138
Bulk Load in YDOT

YDOT bulk inserts can cause performance hotspots
Solution preallocate tablets

139
Index Maintenance

How to have lots of interesting indexes and
views, without killing performance?
Solution Asynchrony!
Indexes/views updated asynchronously when base
table updated

140
SHERPAIN CONTEXT
140
141
Types of Record Stores

Query expressiveness

S3
PNUTS
Oracle
Simple
Feature rich
Object retrieval
Retrieval from single table of objects/records
SQL
142
Types of Record Stores

Consistency model

S3
PNUTS
Oracle
Best effort
Strong guarantees
Eventual consistency
Timeline consistency
ACID
Program centric consistency
Object-centric consistency
143
Types of Record Stores

Data model

PNUTS
CouchDB
Oracle
Flexibility, Schema evolution
Optimized for Fixed schemas
Object-centric consistency
Consistency spans objects
144
Types of Record Stores

Elasticity (ability to add resources on demand)

PNUTS S3
Oracle
Inelastic
Elastic
Limited (via data distribution)
VLSD (Very Large Scale Distribution /Replication)
145
Data Stores Comparison

Versus PNUTS
More expressive queries
Users must control partitioning
Limited elasticity
Highly optimized for complex workloads
Limited flexibility to evolving applications
Inherit limitations of underlying data management
system
Object storage versus record management

User-partitioned SQL stores
Microsoft Azure SDS
Amazon SimpleDB
Multi-tenant application databases
Salesforce.com
Oracle on Demand
Mutable object stores
Amazon S3

146
Application Design Space
Get a few things
Sherpa
MObStor
YMDB
MySQL
Oracle
Filer
BigTable
Scan everything
Hadoop
Everest
Files
Records
146
147
Alternatives Matrix
Consistency model
Structured access
Global low latency
SQL/ACID
Availability
Operability
Updates
Elastic
Sherpa
Y! UDB
MySQL
Oracle
HDFS
BigTable
Dynamo
Cassandra
147
148
QUESTIONS?
148
149
Hadoop

150
Problem

How do you scale up applications?
Run jobs processing 100s of terabytes of data
Takes 11 days to read on 1 computer
Need lots of cheap computers
Fixes speed problem (15 minutes on 1000
computers), but
Reliability problems
In large clusters, computers fail every day
Cluster size is not fixed
Need common infrastructure
Must be efficient and reliable

151
Solution

Open Source Apache Project
Hadoop Core includes
Distributed File System - distributes data
Map/Reduce - distributes application
Written in Java
Runs on
Linux, Mac OS/X, Windows, and Solaris
Commodity hardware

152
Hardware Cluster of Hadoop

Typically in 2 level architecture
Nodes are commodity PCs
40 nodes/rack
Uplink from rack is 8 gigabit
Rack-internal is 1 gigabit

153
Distributed File System

Single namespace for entire cluster
Managed by a single namenode.
Files are single-writer and append-only.
Optimized for streaming reads of large files.
Files are broken in to large blocks.
Typically 128 MB
Replicated to several datanodes, for reliability
Access from Java, C, or command line.

154
Block Placement

Default is 3 replicas, but settable
Blocks are placed (writes are pipelined)
On same node
On different rack
On the other rack
Clients read from closest replica
If the replication for a block drops below
target, it is automatically re-replicated.

155
How is Yahoo using Hadoop?

Started with building better applications
Scale up web scale batch applications (search,
ads, )
Factor out common code from existing systems, so
new applications will be easier to write
Manage the many clusters

156
Running Production WebMap

Search needs a graph of the known web
Invert edges, compute link text, whole graph
heuristics
Periodic batch job using Map/Reduce
Uses a chain of 100 map/reduce jobs
Scale
1 trillion edges in graph
Largest shuffle is 450 TB
Final output is 300 TB compressed
Runs on 10,000 cores
Raw disk used 5 PB

157
Terabyte Sort Benchmark

Started by Jim Gray at Microsoft in 1998
Sorting 10 billion 100 byte records
Hadoop won the general category in 209 seconds
910 nodes
2 quad-core Xeons _at_ 2.0Ghz / node
4 SATA disks / node
8 GB ram / node
1 gb ethernet / node
40 nodes / rack
8 gb ethernet uplink / rack
Previous records was 297 seconds

158
Hadoop clusters

We have 20,000 machines running Hadoop
Our largest clusters are currently 2000 nodes
Several petabytes of user data (compressed,
unreplicated)
We run hundreds of thousands of jobs every month

159
Research Cluster Usage
160
Who Uses Hadoop?

Amazon/A9
AOL
Facebook
Fox interactive media
Google / IBM
New York Times
PowerSet (now Microsoft)
Quantcast
Rackspace/Mailtrust
Veoh
Yahoo!
More at http//wiki.apache.org/hadoop/PoweredBy

161
QA

For more information
Website http//hadoop.apache.org/core
Mailing lists
core-dev_at_hadoop.apache
core-user_at_hadoop.apache

162
????

?????
Google ?????GFS,Bigtable ?Mapreduce
Yahoo??????Hadoop
????????

162
163
Summary of Applications
BigTable HBase HyperTable Hive HadoopDB

Data Analysis
Internet Service
Private Cloud
Web Applications
Some operations that can tolerate relaxed
consistency

PNUTS
164
Architecture
MapReduce-based
DBMS-based
Hybrid of MapReduce and DBMS
BigTable HBase Hypertable Hive
SQL Azure PNUTS Voldemort
HadoopDB
scalability
sounds good
easy to support SQL
fault tolerance
Performance?
ability to run in a heterogeneous environment
easy to utilize index, optimization method
bottleneck of data storage
data replication in file system
data replication upon DBMS
a lot of work to do to support SQL
165
Consistency
A
BigTable,HBase, Hive,Hypertable,HadoopDB
C

Two kinds of consistency
strong consistency ACID(Atomicity Consistency
Isolation Durability)
weak consistency BASE(Basically Available
Soft-state Eventual consistency )

P
A
C
PNUTS
P
SQL Azure ?
166
A tailor
RDBMS
LOCK
ACID
SAFETY
TRANSACTION
3NF
167
Further Reading
Efficient Bulk Insertion into a Distributed
Ordered Table (SIGMOD 2008) Adam Silberstein,
Brian Cooper, Utkarsh Srivastava, Erik Vee,
Ramana Yerneni, Raghu Ramakrishnan PNUTS
Yahoo!'s Hosted Data Serving Platform (VLDB
2008) Brian Cooper, Raghu Ramakrishnan, Utkarsh
Srivastava, Adam Silberstein, Phil Bohannon,
Hans-Arno Jacobsen, Nick Puz, Daniel Weaver,
Ramana Yerneni Asynchronous View Maintenance for
VLSD Databases, Parag Agrawal, Adam Silberstein,
Brian F. Cooper, Utkarsh Srivastava and Raghu
Ramakrishnan SIGMOD 2009 Cloud Storage Design
in a PNUTShell Brian F. Cooper, Raghu
Ramakrishnan, and Utkarsh Srivastava Beautiful
Data, OReilly Media, 2009
168
Further Reading
F. Chang et al. Bigtable A distributed storage
system for structured data. In OSDI, 2006. J.
Dean and S. Ghemawat. MapReduce Simplified data
processing on large clusters. In OSDI, 2004.
G. DeCandia et al. Dynamo Amazons highly
available key-value store. In SOSP, 2007. S.
Ghemawat, H. Gobioff, and S.-T. Leung. The
Google File System. In Proc. SOSP, 2003. D.
Kossmann. The state of the art in distributed
query processing. ACM Computing Surveys,
32(4)422469, 2000.
169
?????????