Title: Grid Computing
1Grid Computing
- Session 2
- David Yates
- January 23, 2004
2Acknowledgments
- Most of these slides were written by the authors
of the papers - Most of the editorial comments are mine
- Session 2 covers additional papers on data
management (metadata and replication), papers on
the interaction between data management and
scheduling, and storage and file systems for Grid
data
3Session 2 Papers
A. Chervenak, E. Deelman, I. Foster, L. Guy, W.
Hoschek, A. Iamnitchi, C. Kesselman, P. Kunst, M.
Ripeanu, B, Schwartzkopf, H, Stockinger, K.
Stockinger, B. Tierney, Giggle A Framework for
Constructing Scaleable Replica Location
Services. In Supercomputing 2002, November
2002.http//www.globus.org/research/papers/giggle
.pdf G. Singh, S. Bharathi, A. Chervenak, E.
Deelman, C. Kesselman, M. Mahohar, S. Pail, L.
Pearlman, A Metadata Catalog Service for Data
Intensive Applications. In Supercomputing 2003,
Phoenix, AZ, November, 2003. http//www.globus.org
/research/papers/mcs_sc2003.pdf Matei Ripeanu
and Ian Foster, A Decentralized, Adaptive,
Replica Location Service. In Eleventh IEEE
International Symposium on High Performance
Distributed Computing, Edinburgh, Scotland, July
2002.http//people.cs.uchicago.edu/matei/PAPERS/
hpdc-02.pdf Kavitha Ranganathan and Ian Foster,
Computation Scheduling and Data Replication
Algorithms for Data Grids. Chapter 22 in Grid
Resource Management State of the Art and Future
Trends, Jarek Nabrzyski, Jennifer M. Schopf, and
Jan Weglarz editors, Kluwer Academic Publishers,
2003.http//www-unix.mcs.anl.gov/schopf/BookFina
l.pdf
4Session 2 Papers, Continued
George Kola, Tevfik Kosar and Miron Livny,
Run-time Adaptation of Grid Data-placement
Jobs. To appear in Parallel and Distributed
Computing Practices, 2004.http//www.cs.wisc.edu/
condor/stork/papers/runtime_adaptation-pdcp2004.pd
f Renato J. Figueiredo, Nirav H. Kapadia and
Jose A. B. Fortesy, The PUNCH Virtual File
System Seamless Access to Decentralized Storage
Services in a Computational Grid. In Tenth IEEE
International Symposium on High Performance
Distributed Computing, San Francisco, CA, August
2001. http//punch.purdue.edu/HubInfo/publications
/2001/hpdc-renato.pdf John Bent, Venkateshwaran
Venkataramani, Nick LeRoy, Alain Roy, Joseph
Stanley, Andrea Arpaci-Dusseau, Remzi H.
Arpaci-Dusseau and Miron Livny, Flexibility,
Manageability, and Performance in a Grid Storage
Appliance. In Eleventh IEEE Symposium on High
Performance Distributed Computing, Edinburgh,
Scotland, July 2002.http//www.cs.wisc.edu/condor
/nest/papers/nest-hpdc-02.pdf Sean Rhea, Patrick
Eaton, Dennis Geels, Hakim Weatherspoon, Ben Zhao
and John Kubiatowicz, Pond the OceanStore
Prototype. In Second USENIX Conference on File
and Storage Technologies, March
2003.http//oceanstore.cs.berkeley.edu/publicatio
ns/papers/pdf/fast2003-pond.pdf
5Giggle Overview
- Giggle GIGa-scale Global Location Engine
- A Framework for Constructing Scaleable Replica
Location Services - Data intensive applications
- Replicate data at multiple locations
- A Replica Location Service (RLS) is a distributed
registry service that records the locations of
data copies and allows discovery of replicas - Maintains mappings between logical identifiers
and target names - Physical targets Map to exact locations of
replicated data - Logical targets Map to another layer of logical
names, allowing storage systems to move data
without informing the RLS - Issues
- Locating replicas of desired files
- Creating new replicas
- Scalability
- Reliability
6Giggle Architecture
Replica Location Indexes (RLIs)
RLI
RLI
LRC
LRC
LRC
LRC
LRC
Local Replica Catalogs (LRCs)
- LRCs contain consistent information about
logical-to-targetmappings at a site - RLIs nodes aggregate information about LRCs
- Arbitrary levels of RLI hierarchy (see paper for
example)
7Giggle A Flexible Replica Location Service
Framework
- Allows users to make tradeoffs among
- consistency
- space overhead
- reliability
- update costs
- query costs
- By different combinations of five essential
elements, the framework supports a variety of RLS
designs. Five elements - 1. Consistent Local State
- 2. Global State with relaxed consistency
- 3. Soft state mechanisms for maintaining global
state - 4. Compression of state updates
- 5. Membership and Partitioning information
maintenance
8Components of RLS Implementation
- Front-end Server
- Multi-threaded
- Supports Globus Grid Security Infrastructure
(GSI) authentication - Common implementation for LRC and RLI
- Back-end Server
- mySQL relational database(or PostgreSQL
database) - Holds logical name to target name mappings
- Client APIs C and Java
- Client Command line tool
9Implementation Features
- Two types of soft state updates from LRCs to RLIs
- Complete list of logical names registered in LRC
- Bloom filter compressed summaries of LRC
- Immediate mode
- When active, send updates after 30 seconds
(configurable) or after fixed number (100
default) of updates - Send full updates at a reduced rate
- User-defined attributes
- May be associated with logical or target names
- Partitioning (without bloom filters)
- Divide LRC soft state updates among RLI index
nodes using pattern matching of logical names - Membership service
- Static configuration only
- Eventually use OGSA registration techniques
10Wide Area Complete Soft State Update Performance
- LRCs in Geneva and Pisa updating RLI at Glasgow
- Full soft state updates quite slow for large
databases, dominated by update costs on RLI
database - Performance does not scale as LRCs grow need
compression of soft state updates
11Performance of an LCR ServerUpdating an RLI
Server
- Number of SQL operations generated at single RLI
and LRC servers for complete and incremental
updates - Servers need to be configured (statically or
dynamically) to use update scheme that is most
appropriate for expected rate of updates to LRC
12Future Work
- Continued development of RLS as part ofGlobus
Tookit - http//www.globus.org/rls
- http//cern.ch/grid-data-management
- Reliable replication service
- Replicate data objects and register them in RLS
- Provide fault tolerance
- Consistency services
- Versioning
- Subscription
- RLS will become an OGSA grid service
- Replica location grid service specification will
be standardized through Global Grid Forum
13Metadata Catalog Service for Data Intensive
Applications
- Metadata is information that describes data
objects - Application-specific
- Temperature, longitude, latitude, depth
- Time, duration, sensor
- Application-independent
- Creator, logical name, time created, access
control - Collections of data objects e.g., data
collected during an experiment - Logical views of data objects allow users to
group data objects according to their interests
14Types Of Metadata
- Physical metadata
- Depends on location of data object and
characteristics of the storage system - Logical metadata
- a) How data objects were created or modified
- By whom, when, using what equipment or
computational engine - By what process experimental output,
simulation or analysis results - With what input conditions or parameters
- b) Description of what the data represent
- Precipitation over Africa for December 1998
- Particle collisions in the LHC for period of 1
second - We restrict the Metadata Catalog Service (MCS)
schema to logical metadata
15Why is a Metadata Catalog Service Needed?
- Essential for scientists and applications to
- Record information about the creation,
transformation, meaning and quality of data items - Query for data items based on these descriptive
attributes - Identifying data items correctly is essential for
correct analysis of experimental and simulation
results - Traditionally, scientists have used ad hoc
methods to keep track of what data items
represent - Descriptive file names, datasets, directories,
lab notebooks, memory - These methods do not scale to terabyte and
petabyte data sets consisting of millions of data
items
16An Example MCS Usage Scenario
17Data Model
Logical data item (file)
Logical Collection
Logical View
18MCS Prototype Schema
- Logical file metadata
- logical file name
- data type
- version number
- master copy location
- container information
- information about the creator
- last modifier of the data
- Logical collection metadata
- collection name
- description
- set of files in a collection
- annotations on the collection
- information about the creator and modifier(s)
- collection hierarchy information (parent
collection id) - Logical view metadata
- view name, attributes, description, creator /
modifier(s)
19Prototype Design
- Initial Prototype
- Simple, centralized Metadata Service
- Based on open source web service (Apache / Axis)
and relational database technology (mySQL)
20- Web interface is expensive (incurs 80 overhead)
- Adding to metadata catalog scales well with
database size
21- Web interface even more expensive (95)
- Querying is 2x - 20x faster than adding
- Querying metadata catalog also scales with
database size
22- Complex queries are about 8x12x slower than
simple queries - Overhead of web interface and increase in
database size both have performance penalty
23Status and Future Work
- Evaluated alternative back end technologies
- Evaluated methods to reduce web interface
overhead - Initial prototype is relational (mySQL)
- Requires shredding and reconstructing XML data
- Difficult tradeoffs between complexity of storing
XML metadata and query efficiency - XML metadata is not a very natural fit for MCSs
relational database back end - But native XML databases have poor query
performance - Evaluate use of native XML databases (Xindice,
commercial XML databases) - New implementation will be based on OGSA Database
Access and Integration (DAI) Service - Being standardized through Global Grid Forum
- Reference implementation involving IBM, Oracle,
UK eScience researchers, academic institutions - Provides both relational and native XML back ends
- Provides a grid service front end with grid
security - Provides a general pass-through SQL query
interface - Testing OGSA DAI services with ESG metadata
24Future Work, Continued
- Re-evaluate MCS schema
- How can we better support (multiple)
domain-specific schema? - ESG makes extensive use of user-defined
attributes to support domain-specific metadata
schemas - Key requirement for metadata services - easily
extensible - Need rich, efficient mechanisms for adding
user-defined attributes - Reconsider usefulness of pre-defined attributes
- How useful are pre-defined attributes?
- ESG not using many of MCSs pre-defined
attributes - Will we use more of these as we integrate further
with other grid tools for workflow management,
provenance, etc.? - Support for provenance information (describes
data transformations) - Unify MCS schema with Chimera data catalog schema
- Distribution and federation of heterogeneous
metadata services - Want to federate multiple metadata catalogs
(e.g., THREDDS) - Current work assumes strict consistency is a
requirement - Explore relaxed consistency models heterogeneous
metadata services export discovery information to
aggregating index nodes
25A Decentralized, Adaptive Replica Location Service
- Replica location problem
- Replication often used to improve reliability,
access latency or availability - Need efficient mechanism to locate replicas
- Map logical ID to replica location(s)
- Common to cooperative proxy cache, distributed
object systems - In Grids client presents a LFN and asks for one,
many or all PFNs
26End-to-end Argument
- Impossible to provide a completely consistent
view of the system in a distributed, asynchronous
environment. - Giggle presents a framework for building replica
location services. - We argue that the performance of the overall
system benefits from relaxed consistency
semantics at lower system levels. - Interesting tradeoffs between inconsistency
levels and operational costs.
27Example Application Requirements
- Requirements for data intensive, scientific
applications (GriPhyN project) - Scale 1 billion replicas by 2006, 10 times
larger by 2010 - Decentralization sites able to operate
independently (100s of sites) - Replica lookup rates are order(s) of magnitude
higher than update rates - Efficient queries for ad-hoc sets of files
28Lossy Data Compression Bloom Filters
- Probabilistic technique for compressed set
representation - Good compression ratios at a cost of low false
positive rate
29Bloom Filters, Continued
- Simple mathematical model to design filters
- Accuracy/space (bandwidth) tradeoffs can be
adjusted on the fly
of hash functions
30Overlay Networks
- Generally used to provide functionality not
available at lower network levels (e.g.,
multicast and security) - Why do we use overlays?
- Resilient Overlay Networks (MIT)
projectimproves network availability between
Internet connected end-points by more than one
order of magnitude - Work well file-sharing P2P systems scaled to
more than 100k nodes (e.g., Gnutella, KaZaa) - Easy to adapt to heterogeneity in available
resources
31Soft-state Mechanisms
- Producer sends state to receiver(s) over a
(lossy) channel - Receivers keep state and associated timeouts
- Advantages
- Decouples state producer and consumer no
explicit failure detection and state removal
messages - Eventual full state
- Adaptive traditionally fixed, empirically
determined update rates, however state produces
can obey more complex rules - Work well in practice RSVP, RIP, or MDS-2.
32Assembling the Pieces
- Replica add/delete
- Digest dissemination
- Replica lookup
- Nodes cache responses
- to benefit from locality in request flow
- Storage sites and Replica Location Nodes (RLNs)
join-in and leave - Typically one or more RLN per administrative
domain
RLNA
Client
33Resource Requirements Estimate
- Compact Muon Solenoid (CMS) high-energy physics
experiment requirements for 2006 - 0.5G replicas overall (avg. 10 replicas/file)
- 100 sites (replica location nodes)
- overall 10,000 lookups/sec, 10 updates/sec on
average, update propagation delay 30 sec - Translates into
- RLNs needs 1GB of memory (lt0.05 false positives)
- Generated traffic lt200 Mbps per overlay link
34Prototype Implementation
- Python code for fast prototyping
- Bloom filters
- False positive rates match theoretical results
- Fast lookup, add, delete operations
35Prototype Performance
- Replica Location Node
- Lookup rates 645 req/sec and 7700 req/sec
- Add, delete about half of lookup performance
- Overall performance
- Tested with 24 nodes on a LAN
- 50M replicas (about 2M per node)
- 3 simulated clients per node
- Peaks at 2000 lookups/sec concurrently with 1200
updates/sec - Propagates update in 30 sec
36Future Work
- Improve prototype performance
- Enhance overlay organization mechanisms to
reflect various goodness criteria - Match infrastructure (reduce generated traffic
overhead) - Match user behavior (file sharing ? overlay
topology)Small-World File-Sharing Communities
in Infocom 2004 - Reduce latency
- Maximize availability
- Emulation environment to be able to perform
controlled large scale experiments - Test on wide area deployments
37COMPUTATION SCHEDULING AND DATA REPLICATION
ALGORITHMS FOR DATA GRIDS
- Scheduling algorithms for large-scale data
intensive problems in Grids - e.g. High Energy Physics experiments like CMS (at
CERN) which will generate petabytes of data per
year - Challenge
- multiple, potentially independent sources of jobs
- large number of storage, compute, and network
resources - huge amounts of input / output data
- Decentralized solutions for simplicity and
feasibility - Jobs are data-intensive ? important to take data
location into account while scheduling - Replication of data to reduce latency caused by
remote data access
38Contributions
- A general and extensible scheduling framework for
computational grids - A wide variety of scheduling algorithms can be
implemented using this framework - Simulator ChicagoSim uses framework to explore
effectiveness of different approaches /
algorithms for scheduling - Paradigm for scheduling Integrated job
scheduling and data replication
39System Model
- Model a Grid as a collection of sites - each site
has - Certain number of processors
- Limited Storage
- Users associated with the local site
- Set of files initially at site
- Users generate jobs - each job
- Needs certain input files before it can execute
- Executes on a single processor
- Has access to all files at its local site
40Scheduling Framework
N users
User
            Â
J
E External Schedulers
ES
ES
J
J
Local Scheduler
D
DS
LS
DS
LS
Migrate data
Request remote data
S Sites
Q
D
J
J
DataSet Scheduler
Computers
Storage
Computers
Storage
Schedule on idle node
Monitor popularity
D
D
D
J
J
J
Computers
Storage
Different mappings between Users and External
Schedulers lead to different architectures
41Job and Data Scheduling Algorithms
- Two distinct functionalities External Scheduler
and Dataset Scheduler - Job Scheduling algorithms
- External Scheduler runs job at -
- Random A randomly selected site
- LeastLoaded The site that currently has the
least load - RandLeastLoaded A site randomly selected from
the n least-loaded sites - DataPresent The least loaded site that already
has the required data - Local The site where the job originated
- Local scheduling is performed FIFO
42Data Scheduling Algorithms
- Dataset Scheduling algorithms
- Datasets for jobs are replicated at -
- Caching No active replication takes place
datasets are cached and managed LRU - DataRandom Replicate popular datasets at a
random site when local sites load exceeds a
threshold - DataLeastLoaded Replicate popular datasets at
the least loaded site when local sites load
exceeds a threshold - DataRandLeastLoaded Replicate popular datasets
at a random site picked from the n least loaded
sites when local sites load exceeds a threshold - Datasets are also cached and storage at each site
ismanaged LRU
43Simulation Parameters
Dataset popularity is modeled by picking input
datasets from a geometric distribution
44- Performance from simulation results varies widely
(6x or more) - Integrated approaches (DataPresent selective
replication) perform best - Data-driven approach without selective
replication (DataPresent Caching) performs
worse than baseline policies (Random and Local) - Adding randomization to least loaded job
scheduling yields significant gain
45- Data-driven scheduling approaches (DataPresent
?) perform best - Caching always reduces data transferred (no data
is transferred with DataPresent Caching)
46- Integrated approaches (DataPresent selective
replication) perform best - Load-based replication, like load-based
scheduling, is a good idea
47Summary and Future Work
- Important to address both job scheduling and data
replication and impact of one on the other - An integrated approach performs best among the
strategies considered - data-driven job scheduling
- proactive selective dataset replication
- Future Work
- Workloads from Fermi Lab user access patterns and
CMS workload generator - Visualization tool for ChicagoSim
- Experiments gauging sensitivities to
- Bandwidth, storage / cache size, CPU speed
- Heterogeneity in Grid
- (user location, storage, compute elements)
- Network topology / contention
- File popularity / job popularity
- Validate simulation results on real Grid testbeds
- Explore adaptive algorithms that select
algorithms dynamically depending on current Grid
conditions
48Run-time Adaptation of Grid Data Placement Jobs
- Grid presents a continuously changing environment
- Data intensive applications are being run on the
grid - Data intensive applications have two parts
- Data placement part
- Computation part
49Data Placement
A Data Intensive Application
Stage in data
Data placement
Compute
Stage out data
Data placement encompasses data transfer,
staging, replication, data positioning, space
allocation and de-allocation
50Current Approach
- Fedex
- Hand Tuning
- Network Weather Service
- Not useful for high-bandwidth, high-latency
networks - TCP Auto-tuning
- 16-bit windows size and window scale option
limitations
51Our Approach
- Full automation
- Continuously monitor environment characteristics
- Perform tuning whenever characteristics change
- Ability to dynamically and automatically choose
an appropriate protocol - Ability to switch to alternate protocol in case
of failure
52The Big Picture
53Profilers
- Memory Profiler
- Optimal memory block-size and incremental
block-size - Disk Profiler
- Optimal disk block-size and incremental
block-size - Network Profiler
- Determines bandwidth, latency and the number of
hops between a given pair of hosts - Uses pathrate, traceroute and diskrouter
bandwidth test tool
54Parameter Tuner
- Generates optimal parameters for data transfer
between a given pair of hosts - Calculates TCP buffer size as the bandwidth-delay
product - Calculates the optimal disk buffer size based on
TCP buffer size - Uses a heuristic to calculate the number of TCP
streams - No of streams 1 No of hops with latency gt
10ms - Rounded to an even number
55Data Placement Scheduler
- Data placement is a real job
- A meta-scheduler (e.g. DAGMan in Condor) is used
to coordinate data placement and computation - Sample data placement job
-
- dap_type transfer
- src_url diskrouter//slic04.sdsc.edu/s/s1
- dest_urldiskrouter//quest2.ncsa.uiuc.edu/d/d1
56Data Placement Scheduler
- Used Stork, a prototype data placement scheduler
- Tuned parameters are fed to Stork
- Stork uses the tuned parameters to adapt data
placement jobs
57Coordinating DAG
58Scalability
- There is no centralized server
- Parameter tuner can be run on any computation
resource - Profiler data is 100s of bytes per host
- There can be multiple data placement schedulers
59Real World Experiment
- DPOSS data had to be transferred from SDSC
located in San Diego to NCSA located at Chicago
Transfer
60Management Site (skywalker.cs.wisc.edu)
SDSC (slic04.sdsc.edu)
NCSA (quest2.ncsa.uiuc.edu)
StarLight (ncdm13.sl.startap.net)
61Data Transfer from SDSC to NCSA using Run-time
Protocol Auto-tuning
Transfer Rate (MB/s)
Time
Network outage
Auto-tuning turned on
62Parameter Tuning
Network parameters for GridFTP before and after
auto-tuning feature of Stork being turned on
63Alternate Protocol Failover
-
- dap_type transfer
- src_url diskrouter//slic04.sdsc.edu/s/data1
- dest_urldiskrouter//quest2.ncsa.uiuc.edu/d/data
1 - alt_protocolsnest-nest, gsiftp-gsiftp
-
- In case of diskrouter failure, Stork will switch
to other protocols in the order specified
64Testing Alternate Protocol Failover
Transfer Rate (MB/s)
Time
DiskRouter server killed
DiskRouter server restarted
65Conclusion
- Run-time adaptation has a significant impact (20
times improvement in our test case) - Profiling data has the potential to be used for
network management and data mining - Network misconfigurations
- Network outages
- Dynamic protocol selection and alternate protocol
failover increase resilience and improve overall
throughput
66Future Work
- Enhance dynamic protocol selection in Stork to
select best protocol - Performance (support for different requirements)
- Security ?
- Reliability ?
- Dynamically select which route to use in
transfers - Dynamically deploy diskrouters at Grid nodes
- Combine route selection and diskrouters to make
the best use of network bandwidth
67The PUNCH Virtual File System (PVFS)
- Seamless Access to Decentralized Storage Services
in a Computational Grid - Goal computational grids to distribute and
deliver computing services to users anytime,
anywhere - Challenge data management
68PUNCH
punch.purdue.edu
Web enabling
Applications
Data
Virtual file system
Resource management
Compute servers
69Logical User Accounts
- Problems with traditional user accounts
- No support for dynamic access policies
- Cannot cross administrative domains
- Complicates resource management
- Logical user accounts provide a capability that
allows users to check out accounts dynamically
via a resource management system - Shadow accounts allocated to users on demand at
compute server - File accounts store data for one or more users
at file server
70Traditional vs. Logical user accounts
71PUNCH Virtual File System PVFS Goals
- Unmodified applications
- Unmodified O/S clients, servers
- Heterogeneous platforms
- Block-based data transfers
- gt De-facto standard NFS
72NFS-based Virtual File System
- Additional functionality is required
- shadow-file account multiplexing, uid mapping
- Possible solutions
- Enhanced NFS clients and/or servers
- NFS call forwarding via middle tier proxies
73Network File System (NFS)
74(No Transcript)
75Multiplexing and access control
Client A
Server C
Client B
File system gateway
Server D
76Performance Results
Andrew File System Benchmark on PVFS
Note Client machine was a 4-CPU, 480MHz
UltraSPARC connected to a 2-CPU 400MHz
server via a 100Mb/s switched Ethernet.
Data shown is the average across 200 samples.
77User workload characteristics
Andrew gt 100 transactions/s
78Related Work
- Explicit file transfers Globus (RFT / GridFTP),
Portable Batch System, others - Implicit transfers
- Condor Custom libraries
- Legion Custom NFS servers
- PUNCH V0.5 Standard NFS clients/servers
- SFS proxy-based, but no account multiplexing
79Future Work
- Coarse-grain locality
- Placement
- Migration
- Fine-grain locality
- Middleware-driven consistency
- Proxy caching / prefetching
80Flexibility, Manageability and Performance in a
Grid Storage Appliance
- Two Trends
- Data sets
- Performance
- Storage appliances address both trends
81Storage Appliances and -
- Storage Appliances Great for basic file service
- Easy to manage Plug in and it works
- Good performance Specialized just for I/O
- Reliable and available too
- Storage Appliances for the Grid Mismatch?
- Inflexible Few, specific protocols (e.g., NFS)
- Costly 10x the cost of PC a few disks
- Difficult to integrate Just one piece of the
puzzle
82A Solution NeST
- NeST A Storage Appliance for the Grid
- Flexible Multiple simultaneous protocols
- Virtual protocol layer
- Low-cost Use commodity machines
- Dynamic adaptation
- Grid-aware Integrate w/ higher-level systems
- Design specifically for the Grid
83NeST Protocol Layer
Physical network layer
- Virtualizes different protocols
- Mediates access to network
Dispatcher
Storage Mgr
Transfer Mgr
Physical storage layer
84NeST Dispatcher
- Mediates interaction between other components
- Gathers information, advertises
Physical network layer
Dispatcher
Storage Mgr
Transfer Mgr
Physical storage layer
85NeST Storage Manager
Physical network layer
- Space management
- Access control
- Virtualizes physical storage
Dispatcher
Storage Mgr
Transfer Mgr
Physical storage layer
86NeST Transfer Manager
Physical network layer
- Implementsscheduling policies
- Chooses concurrency model
Dispatcher
Storage Mgr
Transfer Mgr
Physical storage layer
87Flexibility Multiple protocols
- Problem How to support multiple protocols?
- One approach Just a Bunch of Servers (JBOS)
- Problems with JBOS
- Lack of control (scheduling)
- Painful administration
- No shared code
- Larger memory footprint
wu-ftpd
nfsd
httpd
JBOS Server
88NeST Flexibility By Design
- NeST Integrate protocols and gain advantage
- Implementation like VFS
- Integration introduces new challenges
- Different protocols allow different auth models
- More expensive to add a new protocol
- Less fault isolation
89NeST vs JBOS
Linux cluster - Dual PIII - 1 GB Ram - linux
2.2.19 Each protocol - 4 clients - 10 MB files
35
35
30
30
25
25
20
20
Server bandwidth (MB/s)
15
15
10
10
linux nfsd
Apache
wu-ftpd
5
5
0
0
NFS
Chirp
HTTP
Total
GridFTP
- For each protocol, NeST is comparable to JBOS
server
90Exerting Scheduling Control
- Different scheduling policies
- FCFS
- Cache-awareExploiting Gray-Box Knowledge of
Buffer-Cache Management in USENIX 2002 - Proportional share
- Proportional share scheduling
- Allows administrators to set protocol proportions
- e.g., favor NFS
- Very difficult in JBOS
91Proportional Share
35
30
25
Linux cluster - Dual PIII - 1 GB Ram - linux
2.2.19 Each protocol - 4 clients - 10 MB files
20
Server bandwidth (MB/s)
15
10
5
0
FCFS
1111
1211
1114
Scheduling configuration
- In most cases, achieves Jains metric of
fairness gt 0.98 (1 is fair)
92Grid-Aware Mechanisms
- Basic functionality
- Users and groups Dynamic creation / deletion
- does not need administrative intervention
- Access control Generic AFS-style ACLs
- Advanced functionality
- QoS Preferential scheduling
- Advertises into global scheduling systems
- Flexible protocol and authentication mechanisms
- Self-cleaning storage guarantees Lots
93Storage Guarantees Lots
- Characteristics of Lots
- Capacity Total amount of data lot can store
- Duration Time for which data is guaranteed to
exist - Set of files Multiple files may co-exist within
lot - Self-cleaning
- Expired lots become best-effort lots
- Lot management
- Either default set created by administrator,
ORuse resource management protocol to create
before usage - Implementation File system quotas
- Advantage Integrates cleanly with local access
methods - Disadvantage Performance hit for large writes
94Conclusions and Future Work
- NeST A storage appliance for the Grid
- Gain manageability
- Without sacrificing performance
- Design goals
- Flexibility Virtual protocol architecture
- Low-cost Adaptation mechanisms
- Grid-aware Space management
- Current status release 0.9 available
- Future work
- Hot deployable NeSTs, lot management extensions
95Pond The OceanStore Prototype The OceanStore
Vision
96The Challenges
- Maintenance
- Many components, many administrative domains
- Constant change
- Must be self-organizing
- Must be self-maintaining
- All resources virtualizedno physical names
- Security
- High availability is a hackers target-rich
environment - Must have end-to-end encryption
- Must not place too much trust in any one host
97The Technologies Tapestry
- Tapestry performs
- Distributed Object Location and Routing
- From any host, find a nearby
- replica of a data object
- Efficient
- O(log N ) location time, N of hosts in
system - Self-organizing, self-maintaining
98The Technologies Tapestry (cont.)
99The Technologies Erasure Codes
- More durable than replication for same space
- The technique
100The Technologies Byzantine Agreement
- Guarantees all non-faulty replicas agree
- Given N 3f 1 replicas, up to f may be faulty
/ corrupt - Expensive
- Requires O(N 2) communication
- Combine with primary-copy replication
- Small number participate in Byzantine agreement
- Multicast results of decisions to remainder
101Putting it all together The Path of a Write
102Prototype Implementation
- All major subsystems operational
- Self-organizing Tapestry base
- Primary replicas use Byzantine agreement
- Secondary replicas self-organize into multicast
tree - Erasure-coding archive
- Application interfaces NFS, IMAP/SMTP, HTTP
- Event-driven architecture
- Built on SEDA
- 280K lines of Java (J2SE v1.3)
- JNI libraries for cryptography, erasure coding
103Deployment on PlanetLab
- http//www.planet-lab.org
- 100 hosts, 40 sites
- Shared .ssh/authorized_keys file
- Pond up to 1000 virtual nodes
- Using custom Perl scripts
- 5 minute startup
- Gives global scale for free
104Performance Results Andrew Benchmark
- Built a loopback file server in Linux
- Translates kernel NFS calls into OceanStore API
- Lets run the Andrew File System Benchmark
105Performance Results Andrew Benchmark
- Ran Andrew on Pond
- Primary replicas at UCB, UW, Stanford, Intel
Berkeley - Client at UCB
- Control NFS server at UW
106Closer Look Write Cost
- Byzantine algorithm adapted from Castro Liskov
- Gives fault tolerance, security against
compromise - Fast version uses symmetric cryptography
- Pond uses threshold signatures instead
- Signature proves that f 1 primary replicas
agreed - Can be shared among secondary replicas
- Can also change primaries w/o changing public key
- Big plus for maintenance costs
- Results good for all time once signed
- Replace faulty/compromised servers transparently
107Closer Look Write Cost
- Small writes
- Signature dominates
- Threshold sigs. slow!
- Takes 70 ms to sign
- Compare to 5 ms for regular sigs.
(times in milliseconds)
108Closer Look Write Cost
(run on cluster)
109Closer Look Write Cost
- Throughput in the wide area
- Wide Area Throughput
- Not limited by signatures
- Not limited by archive
- Not limited by Byzantine process bandwidth use
- Limited by client-to-primary replicas bandwidth
110Closer look Dissemination Tree
111Closer look Dissemination Tree
- Self-organizing application-level multicast tree
- Connects all secondary replicas to primary ones
- Shields primary replicas from request load
- Save bandwidth on consistency traffic
- Tree joining heuristic (first-order solution)
- Connect to closest replica using Tapestry
- Take advantage of Tapestrys locality properties
- Should minimize use of long-distance links
- A sort of poor mans CDN
112Performance Results Stream Benchmark
- Goal measure efficiency of dissemination tree
- Multicast tree between secondary replicas
- Ran 500 virtual nodes on PlanetLab
- Primary replicas in SF Bay Area
- Other replicas clustered in 7 largest PlanetLab
sites - Streams writes to all replicas
- One content creator repeatedly appends to one
object - Other replicas read new versions as they arrive
- Measure network resource consumption
113Performance Results Stream Benchmark
- Dissemination tree uses network resources
efficiently - Most bytes sent across local links as second tier
grows - Acceptable latency increase over broadcast (33)
114Related Work
- Distributed Storage
- Traditional AFS, CODA, Bayou
- Peer-to-peer PAST, CFS, Ivy
- Byzantine fault tolerant storage
- Castro-Liskov, COCA, Fleet
- Threshold signatures
- COCA, Fleet
- Erasure codes
- Intermemory, Pasis, Mnemosyne, Free Haven
- Others
- Publius, Freenet, Eternity Service, SUNDR
115Conclusion and Future Work
- OceanStore designed as a global-scale file system
- Design meets primary challenges
- End-to-end encryption for privacy
- Limited trust in any one host for integrity
- Self-organizing and maintaining to increase
usability - Pond prototype functional
- Threshold signatures more expensive than
expected(address in future work) - Generating erasure encoded fragments is expensive
(address in future work) - Simple dissemination tree fairly effective
- A good base for testing new ideas
116Future Work (cont.)
- Assess and improve storage cost of virtualization
- Make more aspects of the system self-maintaining
- Algorithms for predictive replica placement
- Efficient detection and repair of lost data
- Increased stability and fault-tolerance
- Behavior of Pond / Tapestry when network is
partitioned