Decentralized Data Management Framework for Data Grids - PowerPoint PPT Presentation

1 / 41

About This Presentation

Title:

Decentralized Data Management Framework for Data Grids

Description:

... large data collections are emerging as important community resources ... and widely distributed data is a serious challenge to network and Grid designers ... – PowerPoint PPT presentation

Number of Views:136

Avg rating:3.0/5.0

Slides: 42

Provided by: HOU9

Category:

more less

Transcript and Presenter's Notes

Title: Decentralized Data Management Framework for Data Grids

1
Decentralized Data Management Framework for Data
Grids

Houda Lamehamedi
Computer Science Department
Rensselaer Polytechnic Institute

2
Data in Large Scale Computing

In an increasing number of scientific
disciplines, large data collections are emerging
as important community resources
data produced and collected at experiment sites
e.g. high energy physics, climate modeling
processed data and analysis results
The geographical location of the compute and
storage resources results in complex and
stringent performance demands Data Grid

3
Data Grid Requirements

The Data Grid is a Grid where data is treated as
a first class citizen
Scientific collaborations on the Grid generate
queries involving access to large data sets
Efficient execution of these queries requires
careful management of large data caches,
gigabits data transfer over wide area networks,
creation, management, and strategic placement of
replicas

4
Replication in Data Grids

Globus toolkit is a standard set of services
supporting Grids and Grid applications
Data management services offered
GridFTP offers secure efficient data transfer in
Grid environments
Replica Catalog allows users to register files
Replica Location Service allows users to
register where data is replicated and locate
replicas
The system only provides the users with tools to
statically replicate data files

Example of a data replication scenario in the
Compact Muon Solenoid (CMS) experiment. All data
is collected at CERN, the European Center for
Nuclear Research located in Geneva Switzerland

6
Example of a Replica Model used by the Sloan
Digital Sky astronomy survey and at the Laser
Interferometer Gravitational Observatory
7
Data Management Issues in Data Grids

Existing Data Grid frameworks demand extensive
administrative oversight and management overhead
Missing support for dynamic and intermittent
participation on the Data Grid hinders scalable
growth of collaborative research
Limited support to replication Data is
statically replicated under user guidelines

8
Problem Statement

Ensuring efficient access to huge and widely
distributed data is a serious challenge to
network and Grid designers
An automated system is needed to maximize the use
of storage, networking and computing resources

9
Proposed Approach

To address these issues we propose a
decentralized performance-driven adaptive replica
management middleware that
Uses an overlay network to organize Data Grid
nodes
Dynamically adapts replica placement to changing
users and networks needs and behavior
Dynamically evaluates data access costs vs.
performance gains before creating a new replica

10
Mechanisms

Adaptive and scalable data management tools that
enable users to dynamically join and leave the
grid
Replica management services that intelligently
and transparently place data at strategic
locations
Binding data organization to popular and
commonly used access patterns in data sharing
environments and data intensive applications

11
Major Components

A theoretical model of data transfer cost and
access performance
Parameterized by the changing computing
environment
Data monitoring tools that feed current values of
resource consumption to the cost function
Dynamic replica management services
Offer transparent replication using the cost
function
Manage replica placement and discovery

12
Middleware Design

Distributed Data Grid architecture supported by
the delegation of management and decision making
to all member Grid nodes
Each participating node is responsible for
managing access to its local and contributed
resources
Layered Architecture
Replica Management Layer
Resource Access Layer
Communication Layer

13
Middleware Architecture
Replica Management Layer supports the management
and transferring of data between Grid nodes and
the creation of new replicas. Uses input from
lower layers to track users' access patterns,
monitor data popularity
Resource Access Layer Provides access to
available resources and monitors their usage and
availability. Includes a Replica Catalog to
support transparent access to data at each Grid
node for local and remote users
Communication Layer consists of data transfer
and authen. protocols used to ensure security,
verify users identities, and maintain data
integrity provides support for the overlay
network structure
14
Framework

Services Offered by the middleware
Resource Monitoring service
Replica Creation service
Replica Location service
Resource Allocation service
Routing and Connectivity service

15
Resource Monitoring/Allocation

The monitoring service is responsible for
monitoring resource availability at each Grid
node
collecting statistics about resource usage and
data access requests
The allocation service is responsible for
allocating space for newly created replicas,
de-allocating space from the least frequently
and last accessed locally stored replicas

16
Replica Creation Service

The service is responsible for creating local
replicas based on the evaluation of the incurred
cost of creating a local replica
A cost function is used to evaluate the cost of
creating a local replica vs. the cost of
transferring data based on the
popularity of the data,
network resources availability,
size of data, and
storage space availability

17
Replica Location Service

This service is responsible for managing the
local replica Catalog at each node
Each newly created file is registered in the
catalog
Supports data location and discovery

18
Data Catalog Management
19
Replica Distribution Topology

Our approach is based on using application level
overlay networks to enable scalable growth of
Data Grids and support larger numbers of
participants
The overlay network is formed by the set of
connections between the participating nodes in
the Data Grid
Topology means and represents the connectivity
graph formed by the overlay network

20
Node Addition / Data Model Construction
After a Node joins, it starts developing a list
of preferred neighbors
Requests Flow
Data Flow
21
Data Model Construction

We use a combination of spanning tree and ring
topologies
Grid Node Insertion
When joining the grid, a node is added through an
existing grid node by attaching to it as a child
node or a sibling
Node Removal
When a node leaves the tree, it sends a
notification message to its parent, siblings, and
children

22

23
Data Search and Replica Location

The search starts at the local data catalog to
check if the data is stored and available locally
If it is not, then the node sends a request
message to its parent, siblings, and children ?
flooding mechanism
The local replica management service chooses the
best data source

24

25
Cost Model

Total data transfer cost for object i at node v
is costv,i (?v,i??v,i)size(i)d(v,r)

The incremental data transfer for placing a
replica at v is costi(N,Ri,v)
-?tv,id(v,c(v,Ri)) ?i(?tr,i-?tv,i)size(i)d(v,c(v
,Ri)))
Tv the partition of nodes which root is v ?tv,i
the total read rate of all nodes at partition
Tv ?tv,i the total write rate of the partition
N the set of all nodes in the system, Ri the
replica set of object i, and c(v,Ri) the
replica of object i closest to node v
Adding r to Ri increases or decreases the read
cost of each node in Tv Our cost function
evaluates the above formula at each node vs. the
cost of data transfer and the storage space
available before replicating any data
26
Cost Function

The transfer cost of file i at node v from node r
is
costi (Ri aWi)size(i)/d(v,r)
Ri is read rate, Wi write rate and d(v,r) is cost
of transferring a data unit from v to r
A replica creation threshold is defined based on
cost of transferring data from known replica
sites
The cost function evaluates the accumulated read
and write costs at each node vs. the threshold
before creating a replica

27
Simulation Design

GridNet is a modular simulator built on top of NS
(the network simulator) and written in C
NS provides us with the basic grid specification
nodes, links, messages
GridNet introduces application level services
implemented on top of the existing NS services
and protocols. It allows us to specify
different types of nodes (client, cache, and
server nodes)
node resources (storage capacity, local data
files )
the cost model and its parameters
the replication strategy used

28
Simulator Architecture

Each Node in GridNet has three major components
replica optimizer computes number of read/write
requests, response time, probes for connection
bandwidth
replica routing table parent, sibling, and child
nodes
storage element storage space available, and
files stored

29
Simulation Experiments

We used 6 file sets with file sizes ranging from
100Mb to 1Gb
All files were originally located at the server
Access patterns
Recently accessed files are more likely to be
accessed again
Files recently accessed by a client are more
likely to be accessed by neighbor clients

30
Simulation Results
31
Middleware Deployment

We used two hierarchical distribution models that
represent the most popular and commonly used
models
Bottom Up Multiple collection sites
Top Down single collection site
Single collection site data repositories have
larger storage capacity

32
Access Patterns

Data access requests are based on patterns
commonly observed in scientific and data-sharing
environments
Files are of similar sizes within an application
Spikes are generated by new Interesting Files
Users social organization and interests guide
the overlay construction
Interest-based adaptive clustering of users

33
Top Down Model
34
Bottom UP Model
35
Top Down Experiments Results
36
Bottom Up Experiments Results
37
Bottom Up Experiments Results
38
Access Performance Evaluation
39
Conclusions

Cost guided dynamic replication improves data
access performance by up to 30 and a minimum of
10 compared to static user initiated replication
The combination of parameter selection for cost
evaluation and resource availability play a key
role in influencing the performance of the
system.
Lower storage availability might lead to race
conditions where popular data compete for storage
space.
The results also show that popular data files
benefit the most from dynamic replication

40
Contributions

A solution that combines and adapts approaches
previously used in different data sharing
environments the Grid and P2P systems
Support for the creation of small to medium scale
data sharing Grids
A formulation of the replica creation and
placement problem using a mathematical model
The use of dynamic and adaptive overlay networks
and data organization models to connect
participating nodes in a Data Grid (access
patterns)
Generic Data Grid simulator, GridNet
A new data and replica management middleware that
scales with the number of users and supports
dynamic and intermittent user participation

41
Future Directions

Investigate additional user access patterns and
overlay structures and study data access
performance under these settings
Studying different and new combinations of
parameters and cost evaluation techniques
Deploying the middleware in a larger environment
with larger applications
Identifying the relation and effect of data and
replica management services on other services

Write a Comment

User Comments (0)