Title: Decentralized Data Management Framework for Data Grids
1Decentralized Data Management Framework for Data
Grids
- Houda Lamehamedi
- Computer Science Department
- Rensselaer Polytechnic Institute
2Data in Large Scale Computing
- In an increasing number of scientific
disciplines, large data collections are emerging
as important community resources - data produced and collected at experiment sites
e.g. high energy physics, climate modeling - processed data and analysis results
- The geographical location of the compute and
storage resources results in complex and
stringent performance demands Data Grid
3Data Grid Requirements
- The Data Grid is a Grid where data is treated as
a first class citizen - Scientific collaborations on the Grid generate
queries involving access to large data sets - Efficient execution of these queries requires
- careful management of large data caches,
- gigabits data transfer over wide area networks,
- creation, management, and strategic placement of
replicas
4Replication in Data Grids
- Globus toolkit is a standard set of services
supporting Grids and Grid applications - Data management services offered
- GridFTP offers secure efficient data transfer in
Grid environments - Replica Catalog allows users to register files
- Replica Location Service allows users to
register where data is replicated and locate
replicas - The system only provides the users with tools to
statically replicate data files
5- Example of a data replication scenario in the
Compact Muon Solenoid (CMS) experiment. All data
is collected at CERN, the European Center for
Nuclear Research located in Geneva Switzerland
6Example of a Replica Model used by the Sloan
Digital Sky astronomy survey and at the Laser
Interferometer Gravitational Observatory
7Data Management Issues in Data Grids
- Existing Data Grid frameworks demand extensive
administrative oversight and management overhead - Missing support for dynamic and intermittent
participation on the Data Grid hinders scalable
growth of collaborative research - Limited support to replication Data is
statically replicated under user guidelines
8Problem Statement
- Ensuring efficient access to huge and widely
distributed data is a serious challenge to
network and Grid designers - An automated system is needed to maximize the use
of storage, networking and computing resources
9Proposed Approach
- To address these issues we propose a
decentralized performance-driven adaptive replica
management middleware that - Uses an overlay network to organize Data Grid
nodes - Dynamically adapts replica placement to changing
users and networks needs and behavior - Dynamically evaluates data access costs vs.
performance gains before creating a new replica
10Mechanisms
- Adaptive and scalable data management tools that
enable users to dynamically join and leave the
grid - Replica management services that intelligently
and transparently place data at strategic
locations - Binding data organization to popular and
commonly used access patterns in data sharing
environments and data intensive applications
11Major Components
- A theoretical model of data transfer cost and
access performance - Parameterized by the changing computing
environment - Data monitoring tools that feed current values of
resource consumption to the cost function - Dynamic replica management services
- Offer transparent replication using the cost
function - Manage replica placement and discovery
12Middleware Design
- Distributed Data Grid architecture supported by
the delegation of management and decision making
to all member Grid nodes - Each participating node is responsible for
managing access to its local and contributed
resources - Layered Architecture
- Replica Management Layer
- Resource Access Layer
- Communication Layer
13Middleware Architecture
Replica Management Layer supports the management
and transferring of data between Grid nodes and
the creation of new replicas. Uses input from
lower layers to track users' access patterns,
monitor data popularity
Resource Access Layer Provides access to
available resources and monitors their usage and
availability. Includes a Replica Catalog to
support transparent access to data at each Grid
node for local and remote users
Communication Layer consists of data transfer
and authen. protocols used to ensure security,
verify users identities, and maintain data
integrity provides support for the overlay
network structure
14Framework
- Services Offered by the middleware
- Resource Monitoring service
- Replica Creation service
- Replica Location service
- Resource Allocation service
- Routing and Connectivity service
15Resource Monitoring/Allocation
- The monitoring service is responsible for
- monitoring resource availability at each Grid
node - collecting statistics about resource usage and
data access requests - The allocation service is responsible for
- allocating space for newly created replicas,
- de-allocating space from the least frequently
and last accessed locally stored replicas
16Replica Creation Service
- The service is responsible for creating local
replicas based on the evaluation of the incurred
cost of creating a local replica - A cost function is used to evaluate the cost of
creating a local replica vs. the cost of
transferring data based on the - popularity of the data,
- network resources availability,
- size of data, and
- storage space availability
17Replica Location Service
- This service is responsible for managing the
local replica Catalog at each node - Each newly created file is registered in the
catalog - Supports data location and discovery
18Data Catalog Management
19Replica Distribution Topology
- Our approach is based on using application level
overlay networks to enable scalable growth of
Data Grids and support larger numbers of
participants - The overlay network is formed by the set of
connections between the participating nodes in
the Data Grid - Topology means and represents the connectivity
graph formed by the overlay network
20Node Addition / Data Model Construction
After a Node joins, it starts developing a list
of preferred neighbors
Requests Flow
Data Flow
21Data Model Construction
- We use a combination of spanning tree and ring
topologies - Grid Node Insertion
- When joining the grid, a node is added through an
existing grid node by attaching to it as a child
node or a sibling - Node Removal
- When a node leaves the tree, it sends a
notification message to its parent, siblings, and
children
22 23Data Search and Replica Location
- The search starts at the local data catalog to
check if the data is stored and available locally - If it is not, then the node sends a request
message to its parent, siblings, and children ?
flooding mechanism - The local replica management service chooses the
best data source
24 25Cost Model
- Total data transfer cost for object i at node v
is costv,i (?v,i??v,i)size(i)d(v,r)
The incremental data transfer for placing a
replica at v is costi(N,Ri,v)
-?tv,id(v,c(v,Ri)) ?i(?tr,i-?tv,i)size(i)d(v,c(v
,Ri)))
Tv the partition of nodes which root is v ?tv,i
the total read rate of all nodes at partition
Tv ?tv,i the total write rate of the partition
N the set of all nodes in the system, Ri the
replica set of object i, and c(v,Ri) the
replica of object i closest to node v
Adding r to Ri increases or decreases the read
cost of each node in Tv Our cost function
evaluates the above formula at each node vs. the
cost of data transfer and the storage space
available before replicating any data
26Cost Function
- The transfer cost of file i at node v from node r
is - costi (Ri aWi)size(i)/d(v,r)
- Ri is read rate, Wi write rate and d(v,r) is cost
of transferring a data unit from v to r - A replica creation threshold is defined based on
cost of transferring data from known replica
sites - The cost function evaluates the accumulated read
and write costs at each node vs. the threshold
before creating a replica
27Simulation Design
- GridNet is a modular simulator built on top of NS
(the network simulator) and written in C - NS provides us with the basic grid specification
nodes, links, messages - GridNet introduces application level services
implemented on top of the existing NS services
and protocols. It allows us to specify - different types of nodes (client, cache, and
server nodes) - node resources (storage capacity, local data
files ) - the cost model and its parameters
- the replication strategy used
28Simulator Architecture
- Each Node in GridNet has three major components
- replica optimizer computes number of read/write
requests, response time, probes for connection
bandwidth - replica routing table parent, sibling, and child
nodes - storage element storage space available, and
files stored
29Simulation Experiments
- We used 6 file sets with file sizes ranging from
100Mb to 1Gb - All files were originally located at the server
- Access patterns
- Recently accessed files are more likely to be
accessed again - Files recently accessed by a client are more
likely to be accessed by neighbor clients
30Simulation Results
31Middleware Deployment
- We used two hierarchical distribution models that
represent the most popular and commonly used
models - Bottom Up Multiple collection sites
- Top Down single collection site
- Single collection site data repositories have
larger storage capacity
32Access Patterns
- Data access requests are based on patterns
commonly observed in scientific and data-sharing
environments - Files are of similar sizes within an application
- Spikes are generated by new Interesting Files
- Users social organization and interests guide
the overlay construction - Interest-based adaptive clustering of users
33Top Down Model
34Bottom UP Model
35Top Down Experiments Results
36Bottom Up Experiments Results
37Bottom Up Experiments Results
38Access Performance Evaluation
39Conclusions
- Cost guided dynamic replication improves data
access performance by up to 30 and a minimum of
10 compared to static user initiated replication - The combination of parameter selection for cost
evaluation and resource availability play a key
role in influencing the performance of the
system. - Lower storage availability might lead to race
conditions where popular data compete for storage
space. - The results also show that popular data files
benefit the most from dynamic replication
40Contributions
- A solution that combines and adapts approaches
previously used in different data sharing
environments the Grid and P2P systems - Support for the creation of small to medium scale
data sharing Grids - A formulation of the replica creation and
placement problem using a mathematical model - The use of dynamic and adaptive overlay networks
and data organization models to connect
participating nodes in a Data Grid (access
patterns) - Generic Data Grid simulator, GridNet
- A new data and replica management middleware that
scales with the number of users and supports
dynamic and intermittent user participation
41Future Directions
- Investigate additional user access patterns and
overlay structures and study data access
performance under these settings - Studying different and new combinations of
parameters and cost evaluation techniques - Deploying the middleware in a larger environment
with larger applications - Identifying the relation and effect of data and
replica management services on other services