Grid Computing - PowerPoint PPT Presentation

1 / 116
About This Presentation
Title:

Grid Computing

Description:

Grid Computing. Session 2. David Yates. January 23, 2004. Acknowledgments ... Nick LeRoy, Alain Roy, Joseph Stanley, Andrea Arpaci-Dusseau, Remzi ... – PowerPoint PPT presentation

Number of Views:194
Avg rating:3.0/5.0
Slides: 117
Provided by: david259
Category:
Tags: andrea | computing | grid | yates

less

Transcript and Presenter's Notes

Title: Grid Computing


1
Grid Computing
  • Session 2
  • David Yates
  • January 23, 2004

2
Acknowledgments
  • Most of these slides were written by the authors
    of the papers
  • Most of the editorial comments are mine
  • Session 2 covers additional papers on data
    management (metadata and replication), papers on
    the interaction between data management and
    scheduling, and storage and file systems for Grid
    data

3
Session 2 Papers
A. Chervenak, E. Deelman, I. Foster, L. Guy, W.
Hoschek, A. Iamnitchi, C. Kesselman, P. Kunst, M.
Ripeanu, B, Schwartzkopf, H, Stockinger, K.
Stockinger, B. Tierney, Giggle A Framework for
Constructing Scaleable Replica Location
Services. In Supercomputing 2002, November
2002.http//www.globus.org/research/papers/giggle
.pdf G. Singh, S. Bharathi, A. Chervenak, E.
Deelman, C. Kesselman, M. Mahohar, S. Pail, L.
Pearlman, A Metadata Catalog Service for Data
Intensive Applications. In Supercomputing 2003,
Phoenix, AZ, November, 2003. http//www.globus.org
/research/papers/mcs_sc2003.pdf Matei Ripeanu
and Ian Foster, A Decentralized, Adaptive,
Replica Location Service. In Eleventh IEEE
International Symposium on High Performance
Distributed Computing, Edinburgh, Scotland, July
2002.http//people.cs.uchicago.edu/matei/PAPERS/
hpdc-02.pdf Kavitha Ranganathan and Ian Foster,
Computation Scheduling and Data Replication
Algorithms for Data Grids. Chapter 22 in Grid
Resource Management State of the Art and Future
Trends, Jarek Nabrzyski, Jennifer M. Schopf, and
Jan Weglarz editors, Kluwer Academic Publishers,
2003.http//www-unix.mcs.anl.gov/schopf/BookFina
l.pdf
4
Session 2 Papers, Continued
George Kola, Tevfik Kosar and Miron Livny,
Run-time Adaptation of Grid Data-placement
Jobs. To appear in Parallel and Distributed
Computing Practices, 2004.http//www.cs.wisc.edu/
condor/stork/papers/runtime_adaptation-pdcp2004.pd
f Renato J. Figueiredo, Nirav H. Kapadia and
Jose A. B. Fortesy, The PUNCH Virtual File
System Seamless Access to Decentralized Storage
Services in a Computational Grid. In Tenth IEEE
International Symposium on High Performance
Distributed Computing, San Francisco, CA, August
2001. http//punch.purdue.edu/HubInfo/publications
/2001/hpdc-renato.pdf John Bent, Venkateshwaran
Venkataramani, Nick LeRoy, Alain Roy, Joseph
Stanley, Andrea Arpaci-Dusseau, Remzi H.
Arpaci-Dusseau and Miron Livny, Flexibility,
Manageability, and Performance in a Grid Storage
Appliance. In Eleventh IEEE Symposium on High
Performance Distributed Computing, Edinburgh,
Scotland, July 2002.http//www.cs.wisc.edu/condor
/nest/papers/nest-hpdc-02.pdf Sean Rhea, Patrick
Eaton, Dennis Geels, Hakim Weatherspoon, Ben Zhao
and John Kubiatowicz, Pond the OceanStore
Prototype. In Second USENIX Conference on File
and Storage Technologies, March
2003.http//oceanstore.cs.berkeley.edu/publicatio
ns/papers/pdf/fast2003-pond.pdf
5
Giggle Overview
  • Giggle GIGa-scale Global Location Engine
  • A Framework for Constructing Scaleable Replica
    Location Services
  • Data intensive applications
  • Replicate data at multiple locations
  • A Replica Location Service (RLS) is a distributed
    registry service that records the locations of
    data copies and allows discovery of replicas
  • Maintains mappings between logical identifiers
    and target names
  • Physical targets Map to exact locations of
    replicated data
  • Logical targets Map to another layer of logical
    names, allowing storage systems to move data
    without informing the RLS
  • Issues
  • Locating replicas of desired files
  • Creating new replicas
  • Scalability
  • Reliability

6
Giggle Architecture
Replica Location Indexes (RLIs)
RLI
RLI
LRC
LRC
LRC
LRC
LRC
Local Replica Catalogs (LRCs)
  • LRCs contain consistent information about
    logical-to-targetmappings at a site
  • RLIs nodes aggregate information about LRCs
  • Arbitrary levels of RLI hierarchy (see paper for
    example)

7
Giggle A Flexible Replica Location Service
Framework
  • Allows users to make tradeoffs among
  • consistency
  • space overhead
  • reliability
  • update costs
  • query costs
  • By different combinations of five essential
    elements, the framework supports a variety of RLS
    designs. Five elements
  • 1. Consistent Local State
  • 2. Global State with relaxed consistency
  • 3. Soft state mechanisms for maintaining global
    state
  • 4. Compression of state updates
  • 5. Membership and Partitioning information
    maintenance

8
Components of RLS Implementation
  • Front-end Server
  • Multi-threaded
  • Supports Globus Grid Security Infrastructure
    (GSI) authentication
  • Common implementation for LRC and RLI
  • Back-end Server
  • mySQL relational database(or PostgreSQL
    database)
  • Holds logical name to target name mappings
  • Client APIs C and Java
  • Client Command line tool

9
Implementation Features
  • Two types of soft state updates from LRCs to RLIs
  • Complete list of logical names registered in LRC
  • Bloom filter compressed summaries of LRC
  • Immediate mode
  • When active, send updates after 30 seconds
    (configurable) or after fixed number (100
    default) of updates
  • Send full updates at a reduced rate
  • User-defined attributes
  • May be associated with logical or target names
  • Partitioning (without bloom filters)
  • Divide LRC soft state updates among RLI index
    nodes using pattern matching of logical names
  • Membership service
  • Static configuration only
  • Eventually use OGSA registration techniques

10
Wide Area Complete Soft State Update Performance
  • LRCs in Geneva and Pisa updating RLI at Glasgow
  • Full soft state updates quite slow for large
    databases, dominated by update costs on RLI
    database
  • Performance does not scale as LRCs grow need
    compression of soft state updates

11
Performance of an LCR ServerUpdating an RLI
Server
  • Number of SQL operations generated at single RLI
    and LRC servers for complete and incremental
    updates
  • Servers need to be configured (statically or
    dynamically) to use update scheme that is most
    appropriate for expected rate of updates to LRC

12
Future Work
  • Continued development of RLS as part ofGlobus
    Tookit
  • http//www.globus.org/rls
  • http//cern.ch/grid-data-management
  • Reliable replication service
  • Replicate data objects and register them in RLS
  • Provide fault tolerance
  • Consistency services
  • Versioning
  • Subscription
  • RLS will become an OGSA grid service
  • Replica location grid service specification will
    be standardized through Global Grid Forum

13
Metadata Catalog Service for Data Intensive
Applications
  • Metadata is information that describes data
    objects
  • Application-specific
  • Temperature, longitude, latitude, depth
  • Time, duration, sensor
  • Application-independent
  • Creator, logical name, time created, access
    control
  • Collections of data objects e.g., data
    collected during an experiment
  • Logical views of data objects allow users to
    group data objects according to their interests

14
Types Of Metadata
  • Physical metadata
  • Depends on location of data object and
    characteristics of the storage system
  • Logical metadata
  • a) How data objects were created or modified
  • By whom, when, using what equipment or
    computational engine
  • By what process experimental output,
    simulation or analysis results
  • With what input conditions or parameters
  • b) Description of what the data represent
  • Precipitation over Africa for December 1998
  • Particle collisions in the LHC for period of 1
    second
  • We restrict the Metadata Catalog Service (MCS)
    schema to logical metadata

15
Why is a Metadata Catalog Service Needed?
  • Essential for scientists and applications to
  • Record information about the creation,
    transformation, meaning and quality of data items
  • Query for data items based on these descriptive
    attributes
  • Identifying data items correctly is essential for
    correct analysis of experimental and simulation
    results
  • Traditionally, scientists have used ad hoc
    methods to keep track of what data items
    represent
  • Descriptive file names, datasets, directories,
    lab notebooks, memory
  • These methods do not scale to terabyte and
    petabyte data sets consisting of millions of data
    items

16
An Example MCS Usage Scenario
17
Data Model
Logical data item (file)
Logical Collection
Logical View
18
MCS Prototype Schema
  • Logical file metadata
  • logical file name
  • data type
  • version number
  • master copy location
  • container information
  • information about the creator
  • last modifier of the data
  • Logical collection metadata
  • collection name
  • description
  • set of files in a collection
  • annotations on the collection
  • information about the creator and modifier(s)
  • collection hierarchy information (parent
    collection id)
  • Logical view metadata
  • view name, attributes, description, creator /
    modifier(s)

19
Prototype Design
  • Initial Prototype
  • Simple, centralized Metadata Service
  • Based on open source web service (Apache / Axis)
    and relational database technology (mySQL)

20
  • Web interface is expensive (incurs 80 overhead)
  • Adding to metadata catalog scales well with
    database size

21
  • Web interface even more expensive (95)
  • Querying is 2x - 20x faster than adding
  • Querying metadata catalog also scales with
    database size

22
  • Complex queries are about 8x12x slower than
    simple queries
  • Overhead of web interface and increase in
    database size both have performance penalty

23
Status and Future Work
  • Evaluated alternative back end technologies
  • Evaluated methods to reduce web interface
    overhead
  • Initial prototype is relational (mySQL)
  • Requires shredding and reconstructing XML data
  • Difficult tradeoffs between complexity of storing
    XML metadata and query efficiency
  • XML metadata is not a very natural fit for MCSs
    relational database back end
  • But native XML databases have poor query
    performance
  • Evaluate use of native XML databases (Xindice,
    commercial XML databases)
  • New implementation will be based on OGSA Database
    Access and Integration (DAI) Service
  • Being standardized through Global Grid Forum
  • Reference implementation involving IBM, Oracle,
    UK eScience researchers, academic institutions
  • Provides both relational and native XML back ends
  • Provides a grid service front end with grid
    security
  • Provides a general pass-through SQL query
    interface
  • Testing OGSA DAI services with ESG metadata

24
Future Work, Continued
  • Re-evaluate MCS schema
  • How can we better support (multiple)
    domain-specific schema?
  • ESG makes extensive use of user-defined
    attributes to support domain-specific metadata
    schemas
  • Key requirement for metadata services - easily
    extensible
  • Need rich, efficient mechanisms for adding
    user-defined attributes
  • Reconsider usefulness of pre-defined attributes
  • How useful are pre-defined attributes?
  • ESG not using many of MCSs pre-defined
    attributes
  • Will we use more of these as we integrate further
    with other grid tools for workflow management,
    provenance, etc.?
  • Support for provenance information (describes
    data transformations)
  • Unify MCS schema with Chimera data catalog schema
  • Distribution and federation of heterogeneous
    metadata services
  • Want to federate multiple metadata catalogs
    (e.g., THREDDS)
  • Current work assumes strict consistency is a
    requirement
  • Explore relaxed consistency models heterogeneous
    metadata services export discovery information to
    aggregating index nodes

25
A Decentralized, Adaptive Replica Location Service
  • Replica location problem
  • Replication often used to improve reliability,
    access latency or availability
  • Need efficient mechanism to locate replicas
  • Map logical ID to replica location(s)
  • Common to cooperative proxy cache, distributed
    object systems
  • In Grids client presents a LFN and asks for one,
    many or all PFNs

26
End-to-end Argument
  • Impossible to provide a completely consistent
    view of the system in a distributed, asynchronous
    environment.
  • Giggle presents a framework for building replica
    location services.
  • We argue that the performance of the overall
    system benefits from relaxed consistency
    semantics at lower system levels.
  • Interesting tradeoffs between inconsistency
    levels and operational costs.

27
Example Application Requirements
  • Requirements for data intensive, scientific
    applications (GriPhyN project)
  • Scale 1 billion replicas by 2006, 10 times
    larger by 2010
  • Decentralization sites able to operate
    independently (100s of sites)
  • Replica lookup rates are order(s) of magnitude
    higher than update rates
  • Efficient queries for ad-hoc sets of files

28
Lossy Data Compression Bloom Filters
  • Probabilistic technique for compressed set
    representation
  • Good compression ratios at a cost of low false
    positive rate

29
Bloom Filters, Continued
  • Simple mathematical model to design filters
  • Accuracy/space (bandwidth) tradeoffs can be
    adjusted on the fly

of hash functions
30
Overlay Networks
  • Generally used to provide functionality not
    available at lower network levels (e.g.,
    multicast and security)
  • Why do we use overlays?
  • Resilient Overlay Networks (MIT)
    projectimproves network availability between
    Internet connected end-points by more than one
    order of magnitude
  • Work well file-sharing P2P systems scaled to
    more than 100k nodes (e.g., Gnutella, KaZaa)
  • Easy to adapt to heterogeneity in available
    resources

31
Soft-state Mechanisms
  • Producer sends state to receiver(s) over a
    (lossy) channel
  • Receivers keep state and associated timeouts
  • Advantages
  • Decouples state producer and consumer no
    explicit failure detection and state removal
    messages
  • Eventual full state
  • Adaptive traditionally fixed, empirically
    determined update rates, however state produces
    can obey more complex rules
  • Work well in practice RSVP, RIP, or MDS-2.

32
Assembling the Pieces
  • Replica add/delete
  • Digest dissemination
  • Replica lookup
  • Nodes cache responses
  • to benefit from locality in request flow
  • Storage sites and Replica Location Nodes (RLNs)
    join-in and leave
  • Typically one or more RLN per administrative
    domain

RLNA
Client
33
Resource Requirements Estimate
  • Compact Muon Solenoid (CMS) high-energy physics
    experiment requirements for 2006
  • 0.5G replicas overall (avg. 10 replicas/file)
  • 100 sites (replica location nodes)
  • overall 10,000 lookups/sec, 10 updates/sec on
    average, update propagation delay 30 sec
  • Translates into
  • RLNs needs 1GB of memory (lt0.05 false positives)
  • Generated traffic lt200 Mbps per overlay link

34
Prototype Implementation
  • Python code for fast prototyping
  • Bloom filters
  • False positive rates match theoretical results
  • Fast lookup, add, delete operations

35
Prototype Performance
  • Replica Location Node
  • Lookup rates 645 req/sec and 7700 req/sec
  • Add, delete about half of lookup performance
  • Overall performance
  • Tested with 24 nodes on a LAN
  • 50M replicas (about 2M per node)
  • 3 simulated clients per node
  • Peaks at 2000 lookups/sec concurrently with 1200
    updates/sec
  • Propagates update in 30 sec

36
Future Work
  • Improve prototype performance
  • Enhance overlay organization mechanisms to
    reflect various goodness criteria
  • Match infrastructure (reduce generated traffic
    overhead)
  • Match user behavior (file sharing ? overlay
    topology)Small-World File-Sharing Communities
    in Infocom 2004
  • Reduce latency
  • Maximize availability
  • Emulation environment to be able to perform
    controlled large scale experiments
  • Test on wide area deployments

37
COMPUTATION SCHEDULING AND DATA REPLICATION
ALGORITHMS FOR DATA GRIDS
  • Scheduling algorithms for large-scale data
    intensive problems in Grids
  • e.g. High Energy Physics experiments like CMS (at
    CERN) which will generate petabytes of data per
    year
  • Challenge
  • multiple, potentially independent sources of jobs
  • large number of storage, compute, and network
    resources
  • huge amounts of input / output data
  • Decentralized solutions for simplicity and
    feasibility
  • Jobs are data-intensive ? important to take data
    location into account while scheduling
  • Replication of data to reduce latency caused by
    remote data access

38
Contributions
  • A general and extensible scheduling framework for
    computational grids
  • A wide variety of scheduling algorithms can be
    implemented using this framework
  • Simulator ChicagoSim uses framework to explore
    effectiveness of different approaches /
    algorithms for scheduling
  • Paradigm for scheduling Integrated job
    scheduling and data replication

39
System Model
  • Model a Grid as a collection of sites - each site
    has
  • Certain number of processors
  • Limited Storage
  • Users associated with the local site
  • Set of files initially at site
  • Users generate jobs - each job
  • Needs certain input files before it can execute
  • Executes on a single processor
  • Has access to all files at its local site

40
Scheduling Framework
N users
User
                         
J
E External Schedulers
ES
ES
J
J
Local Scheduler
D
DS
LS
DS
LS
Migrate data
Request remote data
S Sites
Q
D
J
J
DataSet Scheduler
Computers
Storage
Computers
Storage
Schedule on idle node
Monitor popularity
D
D
D
J
J
J
Computers
Storage
Different mappings between Users and External
Schedulers lead to different architectures
41
Job and Data Scheduling Algorithms
  • Two distinct functionalities External Scheduler
    and Dataset Scheduler
  • Job Scheduling algorithms
  • External Scheduler runs job at -
  • Random A randomly selected site
  • LeastLoaded The site that currently has the
    least load
  • RandLeastLoaded A site randomly selected from
    the n least-loaded sites
  • DataPresent The least loaded site that already
    has the required data
  • Local The site where the job originated
  • Local scheduling is performed FIFO

42
Data Scheduling Algorithms
  • Dataset Scheduling algorithms
  • Datasets for jobs are replicated at -
  • Caching No active replication takes place
    datasets are cached and managed LRU
  • DataRandom Replicate popular datasets at a
    random site when local sites load exceeds a
    threshold
  • DataLeastLoaded Replicate popular datasets at
    the least loaded site when local sites load
    exceeds a threshold
  • DataRandLeastLoaded Replicate popular datasets
    at a random site picked from the n least loaded
    sites when local sites load exceeds a threshold
  • Datasets are also cached and storage at each site
    ismanaged LRU

43
Simulation Parameters
Dataset popularity is modeled by picking input
datasets from a geometric distribution
44
  • Performance from simulation results varies widely
    (6x or more)
  • Integrated approaches (DataPresent selective
    replication) perform best
  • Data-driven approach without selective
    replication (DataPresent Caching) performs
    worse than baseline policies (Random and Local)
  • Adding randomization to least loaded job
    scheduling yields significant gain

45
  • Data-driven scheduling approaches (DataPresent
    ?) perform best
  • Caching always reduces data transferred (no data
    is transferred with DataPresent Caching)

46
  • Integrated approaches (DataPresent selective
    replication) perform best
  • Load-based replication, like load-based
    scheduling, is a good idea

47
Summary and Future Work
  • Important to address both job scheduling and data
    replication and impact of one on the other
  • An integrated approach performs best among the
    strategies considered
  • data-driven job scheduling
  • proactive selective dataset replication
  • Future Work
  • Workloads from Fermi Lab user access patterns and
    CMS workload generator
  • Visualization tool for ChicagoSim
  • Experiments gauging sensitivities to
  • Bandwidth, storage / cache size, CPU speed
  • Heterogeneity in Grid
  • (user location, storage, compute elements)
  • Network topology / contention
  • File popularity / job popularity
  • Validate simulation results on real Grid testbeds
  • Explore adaptive algorithms that select
    algorithms dynamically depending on current Grid
    conditions

48
Run-time Adaptation of Grid Data Placement Jobs
  • Grid presents a continuously changing environment
  • Data intensive applications are being run on the
    grid
  • Data intensive applications have two parts
  • Data placement part
  • Computation part

49
Data Placement
A Data Intensive Application
Stage in data
Data placement
Compute
Stage out data
Data placement encompasses data transfer,
staging, replication, data positioning, space
allocation and de-allocation
50
Current Approach
  • Fedex
  • Hand Tuning
  • Network Weather Service
  • Not useful for high-bandwidth, high-latency
    networks
  • TCP Auto-tuning
  • 16-bit windows size and window scale option
    limitations

51
Our Approach
  • Full automation
  • Continuously monitor environment characteristics
  • Perform tuning whenever characteristics change
  • Ability to dynamically and automatically choose
    an appropriate protocol
  • Ability to switch to alternate protocol in case
    of failure

52
The Big Picture
53
Profilers
  • Memory Profiler
  • Optimal memory block-size and incremental
    block-size
  • Disk Profiler
  • Optimal disk block-size and incremental
    block-size
  • Network Profiler
  • Determines bandwidth, latency and the number of
    hops between a given pair of hosts
  • Uses pathrate, traceroute and diskrouter
    bandwidth test tool

54
Parameter Tuner
  • Generates optimal parameters for data transfer
    between a given pair of hosts
  • Calculates TCP buffer size as the bandwidth-delay
    product
  • Calculates the optimal disk buffer size based on
    TCP buffer size
  • Uses a heuristic to calculate the number of TCP
    streams
  • No of streams 1 No of hops with latency gt
    10ms
  • Rounded to an even number

55
Data Placement Scheduler
  • Data placement is a real job
  • A meta-scheduler (e.g. DAGMan in Condor) is used
    to coordinate data placement and computation
  • Sample data placement job
  • dap_type transfer
  • src_url diskrouter//slic04.sdsc.edu/s/s1
  • dest_urldiskrouter//quest2.ncsa.uiuc.edu/d/d1

56
Data Placement Scheduler
  • Used Stork, a prototype data placement scheduler
  • Tuned parameters are fed to Stork
  • Stork uses the tuned parameters to adapt data
    placement jobs

57
Coordinating DAG
58
Scalability
  • There is no centralized server
  • Parameter tuner can be run on any computation
    resource
  • Profiler data is 100s of bytes per host
  • There can be multiple data placement schedulers

59
Real World Experiment
  • DPOSS data had to be transferred from SDSC
    located in San Diego to NCSA located at Chicago

Transfer
60
Management Site (skywalker.cs.wisc.edu)
SDSC (slic04.sdsc.edu)
NCSA (quest2.ncsa.uiuc.edu)
StarLight (ncdm13.sl.startap.net)
61
Data Transfer from SDSC to NCSA using Run-time
Protocol Auto-tuning
Transfer Rate (MB/s)
Time
Network outage
Auto-tuning turned on
62
Parameter Tuning
Network parameters for GridFTP before and after
auto-tuning feature of Stork being turned on
63
Alternate Protocol Failover
  • dap_type transfer
  • src_url diskrouter//slic04.sdsc.edu/s/data1
  • dest_urldiskrouter//quest2.ncsa.uiuc.edu/d/data
    1
  • alt_protocolsnest-nest, gsiftp-gsiftp
  • In case of diskrouter failure, Stork will switch
    to other protocols in the order specified

64
Testing Alternate Protocol Failover
Transfer Rate (MB/s)
Time
DiskRouter server killed
DiskRouter server restarted
65
Conclusion
  • Run-time adaptation has a significant impact (20
    times improvement in our test case)
  • Profiling data has the potential to be used for
    network management and data mining
  • Network misconfigurations
  • Network outages
  • Dynamic protocol selection and alternate protocol
    failover increase resilience and improve overall
    throughput

66
Future Work
  • Enhance dynamic protocol selection in Stork to
    select best protocol
  • Performance (support for different requirements)
  • Security ?
  • Reliability ?
  • Dynamically select which route to use in
    transfers
  • Dynamically deploy diskrouters at Grid nodes
  • Combine route selection and diskrouters to make
    the best use of network bandwidth

67
The PUNCH Virtual File System (PVFS)
  • Seamless Access to Decentralized Storage Services
    in a Computational Grid
  • Goal computational grids to distribute and
    deliver computing services to users anytime,
    anywhere
  • Challenge data management

68
PUNCH
punch.purdue.edu
Web enabling
Applications
Data
Virtual file system
Resource management
Compute servers
69
Logical User Accounts
  • Problems with traditional user accounts
  • No support for dynamic access policies
  • Cannot cross administrative domains
  • Complicates resource management
  • Logical user accounts provide a capability that
    allows users to check out accounts dynamically
    via a resource management system
  • Shadow accounts allocated to users on demand at
    compute server
  • File accounts store data for one or more users
    at file server

70
Traditional vs. Logical user accounts
71
PUNCH Virtual File System PVFS Goals
  • Unmodified applications
  • Unmodified O/S clients, servers
  • Heterogeneous platforms
  • Block-based data transfers
  • gt De-facto standard NFS

72
NFS-based Virtual File System
  • Additional functionality is required
  • shadow-file account multiplexing, uid mapping
  • Possible solutions
  • Enhanced NFS clients and/or servers
  • NFS call forwarding via middle tier proxies

73
Network File System (NFS)
74
(No Transcript)
75
Multiplexing and access control




Client A

Server C


Client B

File system gateway
Server D
76
Performance Results
Andrew File System Benchmark on PVFS
Note Client machine was a 4-CPU, 480MHz
UltraSPARC connected to a 2-CPU 400MHz
server via a 100Mb/s switched Ethernet.
Data shown is the average across 200 samples.
77
User workload characteristics
Andrew gt 100 transactions/s
78
Related Work
  • Explicit file transfers Globus (RFT / GridFTP),
    Portable Batch System, others
  • Implicit transfers
  • Condor Custom libraries
  • Legion Custom NFS servers
  • PUNCH V0.5 Standard NFS clients/servers
  • SFS proxy-based, but no account multiplexing

79
Future Work
  • Coarse-grain locality
  • Placement
  • Migration
  • Fine-grain locality
  • Middleware-driven consistency
  • Proxy caching / prefetching

80
Flexibility, Manageability and Performance in a
Grid Storage Appliance
  • Two Trends
  • Data sets
  • Performance
  • Storage appliances address both trends

81
Storage Appliances and -
  • Storage Appliances Great for basic file service
  • Easy to manage Plug in and it works
  • Good performance Specialized just for I/O
  • Reliable and available too
  • Storage Appliances for the Grid Mismatch?
  • Inflexible Few, specific protocols (e.g., NFS)
  • Costly 10x the cost of PC a few disks
  • Difficult to integrate Just one piece of the
    puzzle

82
A Solution NeST
  • NeST A Storage Appliance for the Grid
  • Flexible Multiple simultaneous protocols
  • Virtual protocol layer
  • Low-cost Use commodity machines
  • Dynamic adaptation
  • Grid-aware Integrate w/ higher-level systems
  • Design specifically for the Grid

83
NeST Protocol Layer
Physical network layer
  • Virtualizes different protocols
  • Mediates access to network

Dispatcher
Storage Mgr
Transfer Mgr
Physical storage layer
84
NeST Dispatcher
  • Mediates interaction between other components
  • Gathers information, advertises

Physical network layer
Dispatcher
Storage Mgr
Transfer Mgr
Physical storage layer
85
NeST Storage Manager
Physical network layer
  • Space management
  • Access control
  • Virtualizes physical storage

Dispatcher
Storage Mgr
Transfer Mgr
Physical storage layer
86
NeST Transfer Manager
Physical network layer
  • Implementsscheduling policies
  • Chooses concurrency model

Dispatcher
Storage Mgr
Transfer Mgr
Physical storage layer
87
Flexibility Multiple protocols
  • Problem How to support multiple protocols?
  • One approach Just a Bunch of Servers (JBOS)
  • Problems with JBOS
  • Lack of control (scheduling)
  • Painful administration
  • No shared code
  • Larger memory footprint

wu-ftpd
nfsd
httpd
JBOS Server
88
NeST Flexibility By Design
  • NeST Integrate protocols and gain advantage
  • Implementation like VFS
  • Integration introduces new challenges
  • Different protocols allow different auth models
  • More expensive to add a new protocol
  • Less fault isolation

89
NeST vs JBOS
Linux cluster - Dual PIII - 1 GB Ram - linux
2.2.19 Each protocol - 4 clients - 10 MB files
35
35
30
30
25
25
20
20
Server bandwidth (MB/s)
15
15
10
10
linux nfsd
Apache
wu-ftpd
5
5
0
0
NFS
Chirp
HTTP
Total
GridFTP
  • For each protocol, NeST is comparable to JBOS
    server

90
Exerting Scheduling Control
  • Different scheduling policies
  • FCFS
  • Cache-awareExploiting Gray-Box Knowledge of
    Buffer-Cache Management in USENIX 2002
  • Proportional share
  • Proportional share scheduling
  • Allows administrators to set protocol proportions
  • e.g., favor NFS
  • Very difficult in JBOS

91
Proportional Share
35
30
25
Linux cluster - Dual PIII - 1 GB Ram - linux
2.2.19 Each protocol - 4 clients - 10 MB files
20
Server bandwidth (MB/s)
15
10
5
0
FCFS
1111
1211
1114
Scheduling configuration
  • In most cases, achieves Jains metric of
    fairness gt 0.98 (1 is fair)

92
Grid-Aware Mechanisms
  • Basic functionality
  • Users and groups Dynamic creation / deletion
  • does not need administrative intervention
  • Access control Generic AFS-style ACLs
  • Advanced functionality
  • QoS Preferential scheduling
  • Advertises into global scheduling systems
  • Flexible protocol and authentication mechanisms
  • Self-cleaning storage guarantees Lots

93
Storage Guarantees Lots
  • Characteristics of Lots
  • Capacity Total amount of data lot can store
  • Duration Time for which data is guaranteed to
    exist
  • Set of files Multiple files may co-exist within
    lot
  • Self-cleaning
  • Expired lots become best-effort lots
  • Lot management
  • Either default set created by administrator,
    ORuse resource management protocol to create
    before usage
  • Implementation File system quotas
  • Advantage Integrates cleanly with local access
    methods
  • Disadvantage Performance hit for large writes

94
Conclusions and Future Work
  • NeST A storage appliance for the Grid
  • Gain manageability
  • Without sacrificing performance
  • Design goals
  • Flexibility Virtual protocol architecture
  • Low-cost Adaptation mechanisms
  • Grid-aware Space management
  • Current status release 0.9 available
  • Future work
  • Hot deployable NeSTs, lot management extensions

95
Pond The OceanStore Prototype The OceanStore
Vision
96
The Challenges
  • Maintenance
  • Many components, many administrative domains
  • Constant change
  • Must be self-organizing
  • Must be self-maintaining
  • All resources virtualizedno physical names
  • Security
  • High availability is a hackers target-rich
    environment
  • Must have end-to-end encryption
  • Must not place too much trust in any one host

97
The Technologies Tapestry
  • Tapestry performs
  • Distributed Object Location and Routing
  • From any host, find a nearby
  • replica of a data object
  • Efficient
  • O(log N ) location time, N of hosts in
    system
  • Self-organizing, self-maintaining

98
The Technologies Tapestry (cont.)
99
The Technologies Erasure Codes
  • More durable than replication for same space
  • The technique

100
The Technologies Byzantine Agreement
  • Guarantees all non-faulty replicas agree
  • Given N 3f 1 replicas, up to f may be faulty
    / corrupt
  • Expensive
  • Requires O(N 2) communication
  • Combine with primary-copy replication
  • Small number participate in Byzantine agreement
  • Multicast results of decisions to remainder

101
Putting it all together The Path of a Write
102
Prototype Implementation
  • All major subsystems operational
  • Self-organizing Tapestry base
  • Primary replicas use Byzantine agreement
  • Secondary replicas self-organize into multicast
    tree
  • Erasure-coding archive
  • Application interfaces NFS, IMAP/SMTP, HTTP
  • Event-driven architecture
  • Built on SEDA
  • 280K lines of Java (J2SE v1.3)
  • JNI libraries for cryptography, erasure coding

103
Deployment on PlanetLab
  • http//www.planet-lab.org
  • 100 hosts, 40 sites
  • Shared .ssh/authorized_keys file
  • Pond up to 1000 virtual nodes
  • Using custom Perl scripts
  • 5 minute startup
  • Gives global scale for free

104
Performance Results Andrew Benchmark
  • Built a loopback file server in Linux
  • Translates kernel NFS calls into OceanStore API
  • Lets run the Andrew File System Benchmark

105
Performance Results Andrew Benchmark
  • Ran Andrew on Pond
  • Primary replicas at UCB, UW, Stanford, Intel
    Berkeley
  • Client at UCB
  • Control NFS server at UW

106
Closer Look Write Cost
  • Byzantine algorithm adapted from Castro Liskov
  • Gives fault tolerance, security against
    compromise
  • Fast version uses symmetric cryptography
  • Pond uses threshold signatures instead
  • Signature proves that f 1 primary replicas
    agreed
  • Can be shared among secondary replicas
  • Can also change primaries w/o changing public key
  • Big plus for maintenance costs
  • Results good for all time once signed
  • Replace faulty/compromised servers transparently

107
Closer Look Write Cost
  • Small writes
  • Signature dominates
  • Threshold sigs. slow!
  • Takes 70 ms to sign
  • Compare to 5 ms for regular sigs.

(times in milliseconds)
108
Closer Look Write Cost
(run on cluster)
109
Closer Look Write Cost
  • Throughput in the wide area
  • Wide Area Throughput
  • Not limited by signatures
  • Not limited by archive
  • Not limited by Byzantine process bandwidth use
  • Limited by client-to-primary replicas bandwidth

110
Closer look Dissemination Tree
111
Closer look Dissemination Tree
  • Self-organizing application-level multicast tree
  • Connects all secondary replicas to primary ones
  • Shields primary replicas from request load
  • Save bandwidth on consistency traffic
  • Tree joining heuristic (first-order solution)
  • Connect to closest replica using Tapestry
  • Take advantage of Tapestrys locality properties
  • Should minimize use of long-distance links
  • A sort of poor mans CDN

112
Performance Results Stream Benchmark
  • Goal measure efficiency of dissemination tree
  • Multicast tree between secondary replicas
  • Ran 500 virtual nodes on PlanetLab
  • Primary replicas in SF Bay Area
  • Other replicas clustered in 7 largest PlanetLab
    sites
  • Streams writes to all replicas
  • One content creator repeatedly appends to one
    object
  • Other replicas read new versions as they arrive
  • Measure network resource consumption

113
Performance Results Stream Benchmark
  • Dissemination tree uses network resources
    efficiently
  • Most bytes sent across local links as second tier
    grows
  • Acceptable latency increase over broadcast (33)

114
Related Work
  • Distributed Storage
  • Traditional AFS, CODA, Bayou
  • Peer-to-peer PAST, CFS, Ivy
  • Byzantine fault tolerant storage
  • Castro-Liskov, COCA, Fleet
  • Threshold signatures
  • COCA, Fleet
  • Erasure codes
  • Intermemory, Pasis, Mnemosyne, Free Haven
  • Others
  • Publius, Freenet, Eternity Service, SUNDR

115
Conclusion and Future Work
  • OceanStore designed as a global-scale file system
  • Design meets primary challenges
  • End-to-end encryption for privacy
  • Limited trust in any one host for integrity
  • Self-organizing and maintaining to increase
    usability
  • Pond prototype functional
  • Threshold signatures more expensive than
    expected(address in future work)
  • Generating erasure encoded fragments is expensive
    (address in future work)
  • Simple dissemination tree fairly effective
  • A good base for testing new ideas

116
Future Work (cont.)
  • Assess and improve storage cost of virtualization
  • Make more aspects of the system self-maintaining
  • Algorithms for predictive replica placement
  • Efficient detection and repair of lost data
  • Increased stability and fault-tolerance
  • Behavior of Pond / Tapestry when network is
    partitioned
Write a Comment
User Comments (0)
About PowerShow.com