www.jiahenglu.net - PowerPoint PPT Presentation

1 / 171
About This Presentation
Title:

www.jiahenglu.net

Description:

Rows Name is an arbitrary string Access to data in a row is atomic Row ... Wireless Sensor Networks: An ... Detecting Stale Replicas Garbage collection ... – PowerPoint PPT presentation

Number of Views:166
Avg rating:3.0/5.0
Slides: 172
Provided by: educ5208
Category:

less

Transcript and Presenter's Notes

Title: www.jiahenglu.net


1
?????????
  • ???
  • ??????
  • www.jiahenglu.net

2
CLOUD COMPUTING
3
????
  • ?????
  • Google ?????GFS,Bigtable ?Mapreduce
  • Yahoo??????Hadoop
  • ????????

3
4
Cloud computing

5
(No Transcript)
6
Why we use cloud computing?
7
Why we use cloud computing?
  • Case 1
  • Write a file
  • Save
  • Computer down, file is lost
  • Files are always stored in cloud, never lost

8
Why we use cloud computing?
  • Case 2
  • Use IE --- download, install, use
  • Use QQ --- download, install, use
  • Use C --- download, install, use
  • Get the serve from the cloud

9
What is cloud and cloud computing?
  • Cloud
  • Demand resources or services over Internet
  • scale and reliability of a data center.

10
What is cloud and cloud computing?
  • Cloud computing is a style of computing in
    which dynamically scalable and often virtualized
    resources are provided as a serve over the
    Internet.
  • Users need not have knowledge of, expertise in,
    or control over the technology infrastructure in
    the "cloud" that supports them.

11
Characteristics of cloud computing
  • Virtual.
  • software, databases, Web servers,
    operating systems, storage and networking as
    virtual servers.
  • On demand.
  • add and subtract processors, memory, network
    bandwidth, storage.

12
Types of cloud service
SaaS Software as a Service
PaaS Platform as a Service
IaaS Infrastructure as a Service
13
SaaS
  • Software delivery model
  • No hardware or software to manage
  • Service delivered through a browser
  • Customers use the service on demand
  • Instant Scalability

14
SaaS
  • Examples
  • Your current CRM package is not managing the load
    or you simply dont want to host it in-house. Use
    a SaaS provider such as Salesforce.com
  • Your email is hosted on an exchange server in
    your office and it is very slow. Outsource this
    using Hosted Exchange.

15
PaaS
  • Platform delivery model
  • Platforms are built upon Infrastructure, which is
    expensive
  • Estimating demand is not a science!
  • Platform management is not fun!

16
PaaS
  • Examples
  • You need to host a large file (5Mb) on your
    website and make it available for 35,000 users
    for only two months duration. Use Cloud Front
    from Amazon.
  • You want to start storage services on your
    network for a large number of files and you do
    not have the storage capacityuse Amazon S3.

17
IaaS
  • Computer infrastructure delivery model
  • A platform virtualization environment
  • Computing resources, such as storing and
    processing capacity.
  • Virtualization taken a step further

18
IaaS
  • Examples
  • You want to run a batch job but you dont have
    the infrastructure necessary to run it in a
    timely manner. Use Amazon EC2.
  • You want to host a website, but only for a few
    days. Use Flexiscale.

19
Cloud computing and other computing techniques

20
CLOUD COMPUTING
21
The 21st Century Vision Of Computing
Leonard Kleinrock , one of the chief scientists
of the original Advanced Research Projects Agency
Network (ARPANET) project which seeded the
Internet, said As of now, computer networks
are still in their infancy, but as they grow up
and become sophisticated, we will probably see
the spread of computer utilities which, like
present electric and telephone utilities, will
service individual homes and offices across the
country.
22
The 21st Century Vision Of Computing
Sun Microsystems co-founder Bill Joy
23
The 21st Century Vision Of Computing
24
Definitions
utility
25
Definitions
Utility computing is the packaging of computing
resources, such as computation and storage, as a
metered service similar to a traditional public
utility
utility
26
Definitions
utility
A computer cluster is a group of linked
computers, working together closely so that in
many respects they form a single computer.
27
Definitions
utility
Grid computing is the application of several
computers to a single problem at the same time
usually to a scientific or technical problem that
requires a great number of computer processing
cycles or access to large amounts of data
28
Definitions
utility
Cloud computing is a style of computing in which
dynamically scalable and often virtualized
resources are provided as a service over the
Internet.
29
Grid Computing Cloud Computing
  • share a lot commonality
  • intention, architecture and technology
  • Difference
  • programming model, business model, compute
    model, applications, and Virtualization.

30
Grid Computing Cloud Computing
  • the problems are mostly the same
  • manage large facilities
  • define methods by which consumers discover,
    request and use resources provided by the central
    facilities
  • implement the often highly parallel computations
    that execute on those resources.

31
Grid Computing Cloud Computing
  • Virtualization
  • Grid
  • do not rely on virtualization as much as Clouds
    do, each individual organization maintain full
    control of their resources
  • Cloud
  • an indispensable ingredient for almost every Cloud

32
(No Transcript)
33
Any question and any comments ?
2015-10-3
33
34
????
  • ?????
  • Google ?????GFS,Bigtable ?Mapreduce
  • Yahoo??????Hadoop
  • ????????

34
35
Google Cloud computing techniques

36
Cloud Systems
OSDI06
  • BigTable
  • HBase
  • HyperTable
  • Hive
  • HadoopDB
  • GreenPlum
  • CouchDB
  • Voldemort
  • PNUTS
  • SQL Azure

BigTable-like
MapReduce
VLDB09
VLDB09
DBMS-based
VLDB08
37
The Google File System
38
The Google File System (GFS)
  • A scalable distributed file system for large
    distributed data intensive applications
  • Multiple GFS clusters are currently deployed.
  • The largest ones have
  • 1000 storage nodes
  • 300 TeraBytes of disk storage
  • heavily accessed by hundreds of clients on
    distinct machines

39
Introduction
  • Shares many same goals as previous distributed
    file systems
  • performance, scalability, reliability, etc
  • GFS design has been driven by four key
    observation of Google application workloads and
    technological environment

40
Intro Observations 1
  • 1. Component failures are the norm
  • constant monitoring, error detection, fault
    tolerance and automatic recovery are integral to
    the system
  • 2. Huge files (by traditional standards)
  • Multi GB files are common
  • I/O operations and blocks sizes must be revisited

41
Intro Observations 2
  • 3. Most files are mutated by appending new data
  • This is the focus of performance optimization and
    atomicity guarantees
  • 4. Co-designing the applications and APIs
    benefits overall system by increasing flexibility

42
The Design
  • Cluster consists of a single master and multiple
    chunkservers and is accessed by multiple clients

43
The Master
  • Maintains all file system metadata.
  • names space, access control info, file to chunk
    mappings, chunk (including replicas) location,
    etc.
  • Periodically communicates with chunkservers in
    HeartBeat messages to give instructions and check
    state

44
The Master
  • Helps make sophisticated chunk placement and
    replication decision, using global knowledge
  • For reading and writing, client contacts Master
    to get chunk locations, then deals directly with
    chunkservers
  • Master is not a bottleneck for reads/writes

45
Chunkservers
  • Files are broken into chunks. Each chunk has a
    immutable globally unique 64-bit chunk-handle.
  • handle is assigned by the master at chunk
    creation
  • Chunk size is 64 MB
  • Each chunk is replicated on 3 (default) servers

46
Clients
  • Linked to apps using the file system API.
  • Communicates with master and chunkservers for
    reading and writing
  • Master interactions only for metadata
  • Chunkserver interactions for data
  • Only caches metadata information
  • Data is too large to cache.

47
Chunk Locations
  • Master does not keep a persistent record of
    locations of chunks and replicas.
  • Polls chunkservers at startup, and when new
    chunkservers join/leave for this.
  • Stays up to date by controlling placement of new
    chunks and through HeartBeat messages (when
    monitoring chunkservers)

48
Operation Log
  • Record of all critical metadata changes
  • Stored on Master and replicated on other machines
  • Defines order of concurrent operations
  • Also used to recover the file system state

49
System Interactions Leases and Mutation Order
  • Leases maintain a mutation order across all chunk
    replicas
  • Master grants a lease to a replica, called the
    primary
  • The primary choses the serial mutation order, and
    all replicas follow this order
  • Minimizes management overhead for the Master

50
Atomic Record Append
  • Client specifies the data to write GFS chooses
    and returns the offset it writes to and appends
    the data to each replica at least once
  • Heavily used by Googles Distributed
    applications.
  • No need for a distributed lock manager
  • GFS choses the offset, not the client

51
Atomic Record Append How?
  • Follows similar control flow as mutations
  • Primary tells secondary replicas to append at the
    same offset as the primary
  • If a replica append fails at any replica, it is
    retried by the client.
  • So replicas of the same chunk may contain
    different data, including duplicates, whole or in
    part, of the same record

52
Atomic Record Append How?
  • GFS does not guarantee that all replicas are
    bitwise identical.
  • Only guarantees that data is written at least
    once in an atomic unit.
  • Data must be written at the same offset for all
    chunk replicas for success to be reported.

53
Detecting Stale Replicas
  • Master has a chunk version number to distinguish
    up to date and stale replicas
  • Increase version when granting a lease
  • If a replica is not available, its version is not
    increased
  • master detects stale replicas when a chunkservers
    report chunks and versions
  • Remove stale replicas during garbage collection

54
Garbage collection
  • When a client deletes a file, master logs it like
    other changes and changes filename to a hidden
    file.
  • Master removes files hidden for longer than 3
    days when scanning file system name space
  • metadata is also erased
  • During HeartBeat messages, the chunkservers send
    the master a subset of its chunks, and the
    master tells it which files have no metadata.
  • Chunkserver removes these files on its own

55
Fault ToleranceHigh Availability
  • Fast recovery
  • Master and chunkservers can restart in seconds
  • Chunk Replication
  • Master Replication
  • shadow masters provide read-only access when
    primary master is down
  • mutations not done until recorded on all master
    replicas

56
Fault ToleranceData Integrity
  • Chunkservers use checksums to detect corrupt data
  • Since replicas are not bitwise identical,
    chunkservers maintain their own checksums
  • For reads, chunkserver verifies checksum before
    sending chunk
  • Update checksums during writes

57
Introduction to MapReduce
58
MapReduce Insight
  • Consider the problem of counting the number of
    occurrences of each word in a large collection of
    documents
  • How would you do it in parallel ?

59
MapReduce Programming Model
  • Inspired from map and reduce operations commonly
    used in functional programming languages like
    Lisp.
  • Users implement interface of two primary methods
  • 1. Map (key1, val1) ? (key2, val2)
  • 2. Reduce (key2, val2) ? val3

60
Map operation
  • Map, a pure function, written by the user, takes
    an input key/value pair and produces a set of
    intermediate key/value pairs.
  • e.g. (docid, doc-content)
  • Draw an analogy to SQL, map can be visualized as
    group-by clause of an aggregate query.

61
Reduce operation
  • On completion of map phase, all the intermediate
    values for a given output key are combined
    together into a list and given to a reducer.
  • Can be visualized as aggregate function (e.g.,
    average) that is computed over all the rows with
    the same group-by attribute.

62
Pseudo-code
  • map(String input_key, String input_value)
  • // input_key document name
  • // input_value document contents
  • for each word w in input_value
  • EmitIntermediate(w, "1")
  • reduce(String output_key, Iterator
    intermediate_values)
  • // output_key a word
  • // output_values a list of counts
  • int result 0
  • for each v in intermediate_values
  • result ParseInt(v)
  • Emit(AsString(result))

63
MapReduce Execution overview

64
MapReduce Example

65
MapReduce in Parallel Example

66
MapReduce Fault Tolerance
  • Handled via re-execution of tasks.
  • Task completion committed through master
  • What happens if Mapper fails ?
  • Re-execute completed in-progress map tasks
  • What happens if Reducer fails ?
  • Re-execute in progress reduce tasks
  • What happens if Master fails ?
  • Potential trouble !!

67
MapReduce
  • Walk through of One more Application

68
(No Transcript)
69
MapReduce PageRank
  • PageRank models the behavior of a random
    surfer.
  • C(t) is the out-degree of t, and (1-d) is a
    damping factor (random jump)
  • The random surfer keeps clicking on successive
    links at random not taking content into
    consideration.
  • Distributes its pages rank equally among all
    pages it links to.
  • The dampening factor takes the surfer getting
    bored and typing arbitrary URL.

70
PageRank Key Insights
  • Effects at each iteration is local. i1th
    iteration depends only on ith iteration
  • At iteration i, PageRank for individual nodes can
    be computed independently

71
PageRank using MapReduce
  • Use Sparse matrix representation (M)
  • Map each row of M to a list of PageRank credit
    to assign to out link neighbours.
  • These prestige scores are reduced to a single
    PageRank value for a page by aggregating over
    them.

72
PageRank using MapReduce
Source of Image Lin 2008
73
Phase 1 Process HTML
  • Map task takes (URL, page-content) pairs and maps
    them to (URL, (PRinit, list-of-urls))
  • PRinit is the seed PageRank for URL
  • list-of-urls contains all pages pointed to by URL
  • Reduce task is just the identity function

74
Phase 2 PageRank Distribution
  • Reduce task gets (URL, url_list) and many (URL,
    val) values
  • Sum vals and fix up with d to get new PR
  • Emit (URL, (new_rank, url_list))
  • Check for convergence using non parallel
    component

75
MapReduce Some More Apps
  • Distributed Grep.
  • Count of URL Access Frequency.
  • Clustering (K-means)
  • Graph Algorithms.
  • Indexing Systems

MapReduce Programs In Google Source Tree
76
MapReduce Extensions and similar apps
  • PIG (Yahoo)
  • Hadoop (Apache)
  • DryadLinq (Microsoft)

77
Large Scale Systems Architecture using MapReduce
78
BigTable A Distributed Storage System for
Structured Data
79
Introduction
  • BigTable is a distributed storage system for
    managing structured data.
  • Designed to scale to a very large size
  • Petabytes of data across thousands of servers
  • Used for many Google projects
  • Web indexing, Personalized Search, Google Earth,
    Google Analytics, Google Finance,
  • Flexible, high-performance solution for all of
    Googles products

80
Motivation
  • Lots of (semi-)structured data at Google
  • URLs
  • Contents, crawl metadata, links, anchors,
    pagerank,
  • Per-user data
  • User preference settings, recent queries/search
    results,
  • Geographic locations
  • Physical entities (shops, restaurants, etc.),
    roads, satellite image data, user annotations,
  • Scale is large
  • Billions of URLs, many versions/page
    (20K/version)
  • Hundreds of millions of users, thousands or q/sec
  • 100TB of satellite image data

81
Why not just use commercial DB?
  • Scale is too large for most commercial databases
  • Even if it werent, cost would be very high
  • Building internally means system can be applied
    across many projects for low incremental cost
  • Low-level storage optimizations help performance
    significantly
  • Much harder to do when running on top of a
    database layer

82
Goals
  • Want asynchronous processes to be continuously
    updating different pieces of data
  • Want access to most current data at any time
  • Need to support
  • Very high read/write rates (millions of ops per
    second)
  • Efficient scans over all or interesting subsets
    of data
  • Efficient joins of large one-to-one and
    one-to-many datasets
  • Often want to examine data changes over time
  • E.g. Contents of a web page over multiple crawls

83
BigTable
  • Distributed multi-level map
  • Fault-tolerant, persistent
  • Scalable
  • Thousands of servers
  • Terabytes of in-memory data
  • Petabyte of disk-based data
  • Millions of reads/writes per second, efficient
    scans
  • Self-managing
  • Servers can be added/removed dynamically
  • Servers adjust to load imbalance

84
Building Blocks
  • Building blocks
  • Google File System (GFS) Raw storage
  • Scheduler schedules jobs onto machines
  • Lock service distributed lock manager
  • MapReduce simplified large-scale data processing
  • BigTable uses of building blocks
  • GFS stores persistent data (SSTable file format
    for storage of data)
  • Scheduler schedules jobs involved in BigTable
    serving
  • Lock service master election, location
    bootstrapping
  • Map Reduce often used to read/write BigTable data

85
Basic Data Model
  • A BigTable is a sparse, distributed persistent
    multi-dimensional sorted map
  • (row, column, timestamp) -gt cell contents
  • Good match for most Google applications

86
WebTable Example
  • Want to keep copy of a large collection of web
    pages and related information
  • Use URLs as row keys
  • Various aspects of web page as column names
  • Store contents of web pages in the contents
    column under the timestamps when they were
    fetched.

87
Rows
  • Name is an arbitrary string
  • Access to data in a row is atomic
  • Row creation is implicit upon storing data
  • Rows ordered lexicographically
  • Rows close together lexicographically usually on
    one or a small number of machines

88
Rows (cont.)
  • Reads of short row ranges are efficient and
    typically require communication with a small
    number of machines.
  • Can exploit this property by selecting row keys
    so they get good locality for data access.
  • Example
  • math.gatech.edu, math.uga.edu, phys.gatech.edu,
    phys.uga.edu
  • VS
  • edu.gatech.math, edu.gatech.phys, edu.uga.math,
    edu.uga.phys

89
Columns
  • Columns have two-level name structure
  • familyoptional_qualifier
  • Column family
  • Unit of access control
  • Has associated type information
  • Qualifier gives unbounded columns
  • Additional levels of indexing, if desired

90
Timestamps
  • Used to store different versions of data in a
    cell
  • New writes default to current time, but
    timestamps for writes can also be set explicitly
    by clients
  • Lookup options
  • Return most recent K values
  • Return all values in timestamp range (or all
    values)
  • Column families can be marked w/ attributes
  • Only retain most recent K values in a cell
  • Keep values until they are older than K seconds

91
Implementation Three Major Components
  • Library linked into every client
  • One master server
  • Responsible for
  • Assigning tablets to tablet servers
  • Detecting addition and expiration of tablet
    servers
  • Balancing tablet-server load
  • Garbage collection
  • Many tablet servers
  • Tablet servers handle read and write requests to
    its table
  • Splits tablets that have grown too large

92
Implementation (cont.)
  • Client data doesnt move through master server.
    Clients communicate directly with tablet servers
    for reads and writes.
  • Most clients never communicate with the master
    server, leaving it lightly loaded in practice.

93
Tablets
  • Large tables broken into tablets at row
    boundaries
  • Tablet holds contiguous range of rows
  • Clients can often choose row keys to achieve
    locality
  • Aim for 100MB to 200MB of data per tablet
  • Serving machine responsible for 100 tablets
  • Fast recovery
  • 100 machines each pick up 1 tablet for failed
    machine
  • Fine-grained load balancing
  • Migrate tablets away from overloaded machine
  • Master makes load-balancing decisions

94
Tablet Location
  • Since tablets move around from server to server,
    given a row, how do clients find the right
    machine?
  • Need to find tablet whose row range covers the
    target row

95
Tablet Assignment
  • Each tablet is assigned to one tablet server at a
    time.
  • Master server keeps track of the set of live
    tablet servers and current assignments of tablets
    to servers. Also keeps track of unassigned
    tablets.
  • When a tablet is unassigned, master assigns the
    tablet to an tablet server with sufficient room.

96
API
  • Metadata operations
  • Create/delete tables, column families, change
    metadata
  • Writes (atomic)
  • Set() write cells in a row
  • DeleteCells() delete cells in a row
  • DeleteRow() delete all cells in a row
  • Reads
  • Scanner read arbitrary cells in a bigtable
  • Each row read is atomic
  • Can restrict returned rows to a particular range
  • Can ask for just data from 1 row, all rows, etc.
  • Can ask for all columns, just certain column
    families, or specific columns

97
Refinements Compression
  • Many opportunities for compression
  • Similar values in the same row/column at
    different timestamps
  • Similar values in different columns
  • Similar values across adjacent rows
  • Two-pass custom compressions scheme
  • First pass compress long common strings across a
    large window
  • Second pass look for repetitions in small window
  • Speed emphasized, but good space reduction
    (10-to-1)

98
Refinements Bloom Filters
  • Read operation has to read from disk when desired
    SSTable isnt in memory
  • Reduce number of accesses by specifying a Bloom
    filter.
  • Allows us ask if an SSTable might contain data
    for a specified row/column pair.
  • Small amount of memory for Bloom filters
    drastically reduces the number of disk seeks for
    read operations
  • Use implies that most lookups for non-existent
    rows or columns do not need to touch disk

99
Refinements Bloom Filters
  • Read operation has to read from disk when desired
    SSTable isnt in memory
  • Reduce number of accesses by specifying a Bloom
    filter.
  • Allows us ask if an SSTable might contain data
    for a specified row/column pair.
  • Small amount of memory for Bloom filters
    drastically reduces the number of disk seeks for
    read operations
  • Use implies that most lookups for non-existent
    rows or columns do not need to touch disk

100
????
  • ?????
  • Google ?????GFS,Bigtable ?Mapreduce
  • Yahoo??????Hadoop
  • ????????

100
101
Yahoo! Cloud computing

102
Search Results of the Future
yelp.com
Gawker
babycenter
New York Times
epicurious
LinkedIn
answers.com
webmd
103
Whats in the Horizontal Cloud?
Simple Web Service APIs
Horizontal Cloud Services
Edge Content Services e.g., YCS, YCPI
Provisioning Virtualization e.g., EC2
Batch Storage Processing e.g., Hadoop Pig
Operational Storage e.g., S3, MObStor, Sherpa
Other Services Messaging, Workflow, virtual
DBs Webserving
ID Account Management
Shared Infrastructure
Metering, Billing, Accounting
Monitoring QoS
Common Approaches to QA, Production
Engineering, Performance Engineering, Datacenter
Management, and Optimization
104
Yahoo! Cloud Stack
EDGE
Horizontal Cloud Services

YCS
YCPI
Brooklyn
WEB
Horizontal Cloud Services
VM/OS
yApache
PHP
App Engine
APP
Provisioning (Self-serve)
Monitoring/Metering/Security
Horizontal Cloud Services
VM/OS

Serving Grid
Data Highway
STORAGE
Horizontal Cloud Services

Sherpa
MOBStor
BATCH
Horizontal Cloud Services

Hadoop
105
Web Data Management
  • CRUD
  • Point lookups and short scans
  • Index organized table and random I/Os
  • per latency
  • Scan oriented workloads
  • Focus on sequential disk I/O
  • per cpu cycle

Structured record storage (PNUTS/Sherpa)
Large data analysis (Hadoop)
  • Object retrieval and streaming
  • Scalable file storage
  • per GB

Blob storage (SAN/NAS)
106
The World Has Changed
  • Web serving applications need
  • Scalability!
  • Preferably elastic
  • Flexible schemas
  • Geographic distribution
  • High availability
  • Reliable storage
  • Web serving applications can do without
  • Complicated queries
  • Strong transactions

107
PNUTS / SHERPA To Help You Scale Your Mountains
of Data
108
Yahoo! Serving Storage Problem
  • Small records 100KB or less
  • Structured records lots of fields, evolving
  • Extreme data scale - Tens of TB
  • Extreme request scale - Tens of thousands of
    requests/sec
  • Low latency globally - 20 datacenters worldwide
  • High Availability - outages cost millions
  • Variable usage patterns - as applications and
    users change

108
109
The PNUTS/Sherpa Solution
  • The next generation global-scale record store
  • Record-orientation Routing, data storage
    optimized for low-latency record access
  • Scale out Add machines to scale throughput
    (while keeping latency low)
  • Asynchrony Pub-sub replication to far-flung
    datacenters to mask propagation delay
  • Consistency model Reduce complexity of
    asynchrony for the application programmer
  • Cloud deployment model Hosted, managed service
    to reduce app time-to-market and enable on demand
    scale and elasticity

109
110
What is PNUTS/Sherpa?
CREATE TABLE Parts ( ID VARCHAR, StockNumber
INT, Status VARCHAR )
Structured, flexible schema
Geographic replication
Parallel database
Hosted, managed infrastructure
110
111
What Will It Become?
Indexes and views
CREATE TABLE Parts ( ID VARCHAR, StockNumber
INT, Status VARCHAR )
Geographic replication
Parallel database
Structured, flexible schema
Hosted, managed infrastructure
112
What Will It Become?
Indexes and views
113
Design Goals
  • Scalability
  • Thousands of machines
  • Easy to add capacity
  • Restrict query language to avoid costly queries
  • Geographic replication
  • Asynchronous replication around the globe
  • Low-latency local access
  • High availability and fault tolerance
  • Automatically recover from failures
  • Serve reads and writes despite failures
  • Consistency
  • Per-record guarantees
  • Timeline model
  • Option to relax if needed
  • Multiple access paths
  • Hash table, ordered table
  • Primary, secondary access
  • Hosted service
  • Applications plug and play
  • Share operational cost

113
114
Technology Elements
Applications
Tabular API
PNUTS API
  • PNUTS
  • Query planning and execution
  • Index maintenance
  • Distributed infrastructure for tabular data
  • Data partitioning
  • Update consistency
  • Replication

YCA Authorization
  • YDOT FS
  • Ordered tables
  • YDHT FS
  • Hash tables
  • Tribble
  • Pub/sub messaging
  • Zookeeper
  • Consistency service

114
115
Data Manipulation
  • Per-record operations
  • Get
  • Set
  • Delete
  • Multi-record operations
  • Multiget
  • Scan
  • Getrange

115
116
TabletsHash Table
Name
Description
Price
0x0000
Grape
12
Grapes are good to eat
Limes are green
9
Lime
1
Apple
Apple is wisdom
900
Strawberry
Strawberry shortcake
0x2AF3
2
Orange
Arrgh! Dont get scurvy!
3
Avocado
But at what price?
Lemon
How much did you pay for this lemon?
1
14
Is this a vegetable?
Tomato
0x911F
2
The perfect fruit
Banana
8
Kiwi
New Zealand
0xFFFF
116
117
TabletsOrdered Table
Name
Description
Price
A
1
Apple
Apple is wisdom
3
Avocado
But at what price?
2
Banana
The perfect fruit
12
Grape
Grapes are good to eat
H
Kiwi
8
New Zealand
Lemon
How much did you pay for this lemon?
1
Limes are green
Lime
9
2
Orange
Arrgh! Dont get scurvy!
Q
900
Strawberry
Strawberry shortcake
Is this a vegetable?
14
Tomato
Z
117
118
Flexible Schema
Posted date Listing id Item Price
6/1/07 424252 Couch 570
6/1/07 763245 Bike 86
6/3/07 211242 Car 1123
6/5/07 421133 Lamp 15
Condition
Good

Fair

Color


Red

119
Detailed Architecture
Local region
Remote regions
Clients
REST API
Routers
Tribble
Tablet Controller
Storage units
119
120
Tablet Splitting and Balancing
Each storage unit has many tablets (horizontal
partitions of the table)
Storage unit may become a hotspot
Tablets may grow over time
Overfull tablets split
Shed load by moving tablets to other servers
120
121
QUERY PROCESSING
121
122
Accessing Data
Get key k
SU
SU
SU
122
123
Bulk Read
SU
SU
SU
123
124
Range Queries in YDOT
  • Clustered, ordered retrieval of records

Apple Avocado Banana Blueberry
Canteloupe Grape Kiwi Lemon
Lime Mango Orange
Strawberry Tomato Watermelon
Apple Avocado Banana Blueberry
Canteloupe Grape Kiwi Lemon
Lime Mango Orange
Strawberry Tomato Watermelon
125
Updates
Write key k
Sequence for key k
Routers
Message brokers
Write key k
Sequence for key k
SUCCESS
Write key k
125
126
ASYNCHRONOUS REPLICATION AND CONSISTENCY
126
127
Asynchronous Replication
127
128
Consistency Model
  • Goal Make it easier for applications to reason
    about updates and cope with asynchrony
  • What happens to a record with primary key
    Alice?

Record inserted
Delete
Update
Update
Update
Update
Update
Update
Update
v. 1
v. 2
v. 3
v. 4
v. 5
v. 7
v. 6
v. 8
Time
Time
Generation 1
As the record is updated, copies may get out of
sync.
128
129
Example Social Alice
East
Record Timeline
West
User Status
Alice ___
___
User Status
Alice Busy
Busy
User Status
Alice Busy
User Status
Alice Free
Free
User Status
Alice ???
User Status
Alice ???
Free
130
Consistency Model
Read
Current version
Stale version
Stale version
v. 1
v. 2
v. 3
v. 4
v. 5
v. 7
v. 6
v. 8
Time
Generation 1
In general, reads are served using a local copy
130
131
Consistency Model
Read up-to-date
Current version
Stale version
Stale version
v. 1
v. 2
v. 3
v. 4
v. 5
v. 7
v. 6
v. 8
Time
Generation 1
But application can request and get current
version
131
132
Consistency Model
Read v.6
Current version
Stale version
Stale version
v. 1
v. 2
v. 3
v. 4
v. 5
v. 7
v. 6
v. 8
Time
Generation 1
Or variations such as read forwardwhile copies
may lag the master record, every copy goes
through the same sequence of changes
132
133
Consistency Model
Write
Current version
Stale version
Stale version
v. 1
v. 2
v. 3
v. 4
v. 5
v. 7
v. 6
v. 8
Time
Generation 1
Achieved via per-record primary copy
protocol (To maximize availability, record
masterships automaticlly transferred if site
fails) Can be selectively weakened to eventual
consistency (local writes that are reconciled
using version vectors)
133
134
Consistency Model
Write if v.7
ERROR
Current version
Stale version
Stale version
v. 1
v. 2
v. 3
v. 4
v. 5
v. 7
v. 6
v. 8
Time
Generation 1
Test-and-set writes facilitate per-record
transactions
134
135
Consistency Techniques
  • Per-record mastering
  • Each record is assigned a master region
  • May differ between records
  • Updates to the record forwarded to the master
    region
  • Ensures consistent ordering of updates
  • Tablet-level mastering
  • Each tablet is assigned a master region
  • Inserts and deletes of records forwarded to the
    master region
  • Master region decides tablet splits
  • These details are hidden from the application
  • Except for the latency impact!

136
Mastering
A 42342 E
B 42521 W
C 66354 W
D 12352 E
E 75656 C
F 15677 E
A 42342 E
B 42521 W
Tablet master
C 66354 W
D 12352 E
E 75656 C
F 15677 E
A 42342 E
B 42521 W
C 66354 W
D 12352 E
E 75656 C
F 15677 E
136
137
Bulk Insert/Update/Replace
  • Client feeds records to bulk manager
  • Bulk loader transfers records to SUs in batches
  • Bypass routers and message brokers
  • Efficient import into storage unit

Client
Bulk manager
Source Data
138
Bulk Load in YDOT
  • YDOT bulk inserts can cause performance hotspots
  • Solution preallocate tablets

139
Index Maintenance
  • How to have lots of interesting indexes and
    views, without killing performance?
  • Solution Asynchrony!
  • Indexes/views updated asynchronously when base
    table updated

140
SHERPAIN CONTEXT
140
141
Types of Record Stores
  • Query expressiveness

S3
PNUTS
Oracle
Simple
Feature rich
Object retrieval
Retrieval from single table of objects/records
SQL
142
Types of Record Stores
  • Consistency model

S3
PNUTS
Oracle
Best effort
Strong guarantees
Eventual consistency
Timeline consistency
ACID
Program centric consistency
Object-centric consistency
143
Types of Record Stores
  • Data model

PNUTS
CouchDB
Oracle
Flexibility, Schema evolution
Optimized for Fixed schemas
Object-centric consistency
Consistency spans objects
144
Types of Record Stores
  • Elasticity (ability to add resources on demand)

PNUTS S3
Oracle
Inelastic
Elastic
Limited (via data distribution)
VLSD (Very Large Scale Distribution /Replication)
145
Data Stores Comparison
  • Versus PNUTS
  • More expressive queries
  • Users must control partitioning
  • Limited elasticity
  • Highly optimized for complex workloads
  • Limited flexibility to evolving applications
  • Inherit limitations of underlying data management
    system
  • Object storage versus record management
  • User-partitioned SQL stores
  • Microsoft Azure SDS
  • Amazon SimpleDB
  • Multi-tenant application databases
  • Salesforce.com
  • Oracle on Demand
  • Mutable object stores
  • Amazon S3

146
Application Design Space
Get a few things
Sherpa
MObStor
YMDB
MySQL
Oracle
Filer
BigTable
Scan everything
Hadoop
Everest
Files
Records
146
147
Alternatives Matrix
Consistency model
Structured access
Global low latency
SQL/ACID
Availability
Operability
Updates
Elastic
Sherpa
Y! UDB
MySQL
Oracle
HDFS
BigTable
Dynamo
Cassandra
147
148
QUESTIONS?
148
149
Hadoop

150
Problem
  • How do you scale up applications?
  • Run jobs processing 100s of terabytes of data
  • Takes 11 days to read on 1 computer
  • Need lots of cheap computers
  • Fixes speed problem (15 minutes on 1000
    computers), but
  • Reliability problems
  • In large clusters, computers fail every day
  • Cluster size is not fixed
  • Need common infrastructure
  • Must be efficient and reliable

151
Solution
  • Open Source Apache Project
  • Hadoop Core includes
  • Distributed File System - distributes data
  • Map/Reduce - distributes application
  • Written in Java
  • Runs on
  • Linux, Mac OS/X, Windows, and Solaris
  • Commodity hardware

152
Hardware Cluster of Hadoop
  • Typically in 2 level architecture
  • Nodes are commodity PCs
  • 40 nodes/rack
  • Uplink from rack is 8 gigabit
  • Rack-internal is 1 gigabit

153
Distributed File System
  • Single namespace for entire cluster
  • Managed by a single namenode.
  • Files are single-writer and append-only.
  • Optimized for streaming reads of large files.
  • Files are broken in to large blocks.
  • Typically 128 MB
  • Replicated to several datanodes, for reliability
  • Access from Java, C, or command line.

154
Block Placement
  • Default is 3 replicas, but settable
  • Blocks are placed (writes are pipelined)
  • On same node
  • On different rack
  • On the other rack
  • Clients read from closest replica
  • If the replication for a block drops below
    target, it is automatically re-replicated.

155
How is Yahoo using Hadoop?
  • Started with building better applications
  • Scale up web scale batch applications (search,
    ads, )
  • Factor out common code from existing systems, so
    new applications will be easier to write
  • Manage the many clusters

156
Running Production WebMap
  • Search needs a graph of the known web
  • Invert edges, compute link text, whole graph
    heuristics
  • Periodic batch job using Map/Reduce
  • Uses a chain of 100 map/reduce jobs
  • Scale
  • 1 trillion edges in graph
  • Largest shuffle is 450 TB
  • Final output is 300 TB compressed
  • Runs on 10,000 cores
  • Raw disk used 5 PB

157
Terabyte Sort Benchmark
  • Started by Jim Gray at Microsoft in 1998
  • Sorting 10 billion 100 byte records
  • Hadoop won the general category in 209 seconds
  • 910 nodes
  • 2 quad-core Xeons _at_ 2.0Ghz / node
  • 4 SATA disks / node
  • 8 GB ram / node
  • 1 gb ethernet / node
  • 40 nodes / rack
  • 8 gb ethernet uplink / rack
  • Previous records was 297 seconds

158
Hadoop clusters
  • We have 20,000 machines running Hadoop
  • Our largest clusters are currently 2000 nodes
  • Several petabytes of user data (compressed,
    unreplicated)
  • We run hundreds of thousands of jobs every month

159
Research Cluster Usage
160
Who Uses Hadoop?
  • Amazon/A9
  • AOL
  • Facebook
  • Fox interactive media
  • Google / IBM
  • New York Times
  • PowerSet (now Microsoft)
  • Quantcast
  • Rackspace/Mailtrust
  • Veoh
  • Yahoo!
  • More at http//wiki.apache.org/hadoop/PoweredBy

161
QA
  • For more information
  • Website http//hadoop.apache.org/core
  • Mailing lists
  • core-dev_at_hadoop.apache
  • core-user_at_hadoop.apache

162
????
  • ?????
  • Google ?????GFS,Bigtable ?Mapreduce
  • Yahoo??????Hadoop
  • ????????

162
163
Summary of Applications
BigTable HBase HyperTable Hive HadoopDB
  • Data Analysis
  • Internet Service
  • Private Cloud
  • Web Applications
  • Some operations that can tolerate relaxed
    consistency

PNUTS
164
Architecture
MapReduce-based
DBMS-based
Hybrid of MapReduce and DBMS
BigTable HBase Hypertable Hive
SQL Azure PNUTS Voldemort
HadoopDB
scalability
sounds good
easy to support SQL
fault tolerance
Performance?
ability to run in a heterogeneous environment
easy to utilize index, optimization method
bottleneck of data storage
data replication in file system
data replication upon DBMS
a lot of work to do to support SQL
165
Consistency
A
BigTable,HBase, Hive,Hypertable,HadoopDB
C
  • Two kinds of consistency
  • strong consistency ACID(Atomicity Consistency
    Isolation Durability)
  • weak consistency BASE(Basically Available
    Soft-state Eventual consistency )

P
A
C
PNUTS
P
SQL Azure ?
166
A tailor
RDBMS
LOCK
ACID
SAFETY
TRANSACTION
3NF
167
Further Reading
Efficient Bulk Insertion into a Distributed
Ordered Table (SIGMOD 2008) Adam Silberstein,
Brian Cooper, Utkarsh Srivastava, Erik Vee,
Ramana Yerneni, Raghu Ramakrishnan PNUTS
Yahoo!'s Hosted Data Serving Platform (VLDB
2008) Brian Cooper, Raghu Ramakrishnan, Utkarsh
Srivastava, Adam Silberstein, Phil Bohannon,
Hans-Arno Jacobsen, Nick Puz, Daniel Weaver,
Ramana Yerneni Asynchronous View Maintenance for
VLSD Databases, Parag Agrawal, Adam Silberstein,
Brian F. Cooper, Utkarsh Srivastava and Raghu
Ramakrishnan SIGMOD 2009 Cloud Storage Design
in a PNUTShell Brian F. Cooper, Raghu
Ramakrishnan, and Utkarsh Srivastava Beautiful
Data, OReilly Media, 2009
168
Further Reading
F. Chang et al. Bigtable A distributed storage
system for structured data. In OSDI, 2006. J.
Dean and S. Ghemawat. MapReduce Simplified data
processing on large clusters. In OSDI, 2004.
G. DeCandia et al. Dynamo Amazons highly
available key-value store. In SOSP, 2007. S.
Ghemawat, H. Gobioff, and S.-T. Leung. The
Google File System. In Proc. SOSP, 2003. D.
Kossmann. The state of the art in distributed
query processing. ACM Computing Surveys,
32(4)422469, 2000.
169
?????????
  • ??????? 2010?6?
  • ?????????
  • ????
  • ?????
  • ???????
  • ??????????

170
???????????????
  • ?1? ??
  • 1.1 ???????
  • 1.2 ?????????
  • 1.3 ??????????????
  • 1.4 ??
  • ???????
  • ?2? ???????
  • 2.1 ????????
  • 2.2 ??????????
  • 2.3??????????(?????,???,?????)
  • 2.4??
  • ?3? ??-??????
  • 3.1 ??-???????????
  • 3.2 ??-????????
  • 3.3 ??-?????????
  • 3.4 ??

171
???????????????
  • ?4? ?????
  • 4.1 ??????????
  • 4.2 ??????
  • 4.3 ??????
  • 4.3 ??
  • ?5? ?????????? (CORBA)
  • 5.1 CORBA????
  • 5.2 CORBA ?????
  • 5.3 ???????
  • 5.4 Java IDL??
  • 5.5 ??
  • ????????
  • ?6? ????????
  • 6.1 ?????
  • 6.2 ???
  • 6.3 ??????????
  • 6.4 ??
  • ?7? Google????????
  • 7.1 Google ????

172
???????????????
  • ?8? Yahoo??????
  • 8.1 PNUTS ??????????
  • 8.2 Pig ??????????
  • 8.3 ZooKeeper ??????????????
  • 8.4 ??
  • ?9? Aneka ??????
  • 9.1 Aneka ???
  • 9.2 ????????
  • 9.3 Aneka??????????????
  • 9.4 ??
  • ?10? Greenplum??????
  • 10.1 GreenPlum????
  • 10.2 GreenPlum?????
  • 10.3 GreenPlum???????????
  • 10.4 GreenPlum????????
  • 10.5 ??
  • ?11? Amazon dynamo??????
  • 11.1 Amazon dynamo??
  • 11.2 Amazon dynamo?????

173
???????????????
  • ???????????
  • ?13? ??Hadoop????
  • 13.1 Hadoop????
  • 13.2 Map/Reduce????
  • 13.3 ?????????
  • 13.4 ??????
  • 13.5 ??
  • ?14? ??HBase????
  • 14.1 ???HBase??
  • 14.2 HBase?????
  • 14.3 HBase??????
  • 14.4 ????HBase
  • 14.5 ??

174
???????????????
  • ?15? ??Google Apps????
  • 15.1 Google App Engine ??
  • 15.2 ????Google App Engine
  • 15.3 ??Google Apps?????????
  • 15.4 ??
  • ?16? ??MS Azure????
  • 16.1 MS Azure????
  • 16.2 WINDOWS AZURE????
  • 16.3 ??
  • ?17? ??Amazon EC2??????
  • 17.1 Amazon Elastic Compute Cloud ??
  • 17.2 ????AmazonEC2
  • 17.3 ??

175
QA Thanks
Write a Comment
User Comments (0)
About PowerShow.com