Decentralized Data Management Framework for Data Grids - PowerPoint PPT Presentation

1 / 41
About This Presentation
Title:

Decentralized Data Management Framework for Data Grids

Description:

... large data collections are emerging as important community resources ... and widely distributed data is a serious challenge to network and Grid designers ... – PowerPoint PPT presentation

Number of Views:136
Avg rating:3.0/5.0
Slides: 42
Provided by: HOU9
Category:

less

Transcript and Presenter's Notes

Title: Decentralized Data Management Framework for Data Grids


1
Decentralized Data Management Framework for Data
Grids
  • Houda Lamehamedi
  • Computer Science Department
  • Rensselaer Polytechnic Institute

2
Data in Large Scale Computing
  • In an increasing number of scientific
    disciplines, large data collections are emerging
    as important community resources
  • data produced and collected at experiment sites
    e.g. high energy physics, climate modeling
  • processed data and analysis results
  • The geographical location of the compute and
    storage resources results in complex and
    stringent performance demands Data Grid

3
Data Grid Requirements
  • The Data Grid is a Grid where data is treated as
    a first class citizen
  • Scientific collaborations on the Grid generate
    queries involving access to large data sets
  • Efficient execution of these queries requires
  • careful management of large data caches,
  • gigabits data transfer over wide area networks,
  • creation, management, and strategic placement of
    replicas

4
Replication in Data Grids
  • Globus toolkit is a standard set of services
    supporting Grids and Grid applications
  • Data management services offered
  • GridFTP offers secure efficient data transfer in
    Grid environments
  • Replica Catalog allows users to register files
  • Replica Location Service allows users to
    register where data is replicated and locate
    replicas
  • The system only provides the users with tools to
    statically replicate data files

5
  • Example of a data replication scenario in the
    Compact Muon Solenoid (CMS) experiment. All data
    is collected at CERN, the European Center for
    Nuclear Research located in Geneva Switzerland

6
Example of a Replica Model used by the Sloan
Digital Sky astronomy survey and at the Laser
Interferometer Gravitational Observatory
7
Data Management Issues in Data Grids
  • Existing Data Grid frameworks demand extensive
    administrative oversight and management overhead
  • Missing support for dynamic and intermittent
    participation on the Data Grid hinders scalable
    growth of collaborative research
  • Limited support to replication Data is
    statically replicated under user guidelines

8
Problem Statement
  • Ensuring efficient access to huge and widely
    distributed data is a serious challenge to
    network and Grid designers
  • An automated system is needed to maximize the use
    of storage, networking and computing resources

9
Proposed Approach
  • To address these issues we propose a
    decentralized performance-driven adaptive replica
    management middleware that
  • Uses an overlay network to organize Data Grid
    nodes
  • Dynamically adapts replica placement to changing
    users and networks needs and behavior
  • Dynamically evaluates data access costs vs.
    performance gains before creating a new replica

10
Mechanisms
  • Adaptive and scalable data management tools that
    enable users to dynamically join and leave the
    grid
  • Replica management services that intelligently
    and transparently place data at strategic
    locations
  • Binding data organization to popular and
    commonly used access patterns in data sharing
    environments and data intensive applications

11
Major Components
  • A theoretical model of data transfer cost and
    access performance
  • Parameterized by the changing computing
    environment
  • Data monitoring tools that feed current values of
    resource consumption to the cost function
  • Dynamic replica management services
  • Offer transparent replication using the cost
    function
  • Manage replica placement and discovery

12
Middleware Design
  • Distributed Data Grid architecture supported by
    the delegation of management and decision making
    to all member Grid nodes
  • Each participating node is responsible for
    managing access to its local and contributed
    resources
  • Layered Architecture
  • Replica Management Layer
  • Resource Access Layer
  • Communication Layer

13
Middleware Architecture
Replica Management Layer supports the management
and transferring of data between Grid nodes and
the creation of new replicas. Uses input from
lower layers to track users' access patterns,
monitor data popularity
Resource Access Layer Provides access to
available resources and monitors their usage and
availability. Includes a Replica Catalog to
support transparent access to data at each Grid
node for local and remote users
Communication Layer consists of data transfer
and authen. protocols used to ensure security,
verify users identities, and maintain data
integrity provides support for the overlay
network structure
14
Framework
  • Services Offered by the middleware
  • Resource Monitoring service
  • Replica Creation service
  • Replica Location service
  • Resource Allocation service
  • Routing and Connectivity service

15
Resource Monitoring/Allocation
  • The monitoring service is responsible for
  • monitoring resource availability at each Grid
    node
  • collecting statistics about resource usage and
    data access requests
  • The allocation service is responsible for
  • allocating space for newly created replicas,
  • de-allocating space from the least frequently
    and last accessed locally stored replicas

16
Replica Creation Service
  • The service is responsible for creating local
    replicas based on the evaluation of the incurred
    cost of creating a local replica
  • A cost function is used to evaluate the cost of
    creating a local replica vs. the cost of
    transferring data based on the
  • popularity of the data,
  • network resources availability,
  • size of data, and
  • storage space availability

17
Replica Location Service
  • This service is responsible for managing the
    local replica Catalog at each node
  • Each newly created file is registered in the
    catalog
  • Supports data location and discovery

18
Data Catalog Management
19
Replica Distribution Topology
  • Our approach is based on using application level
    overlay networks to enable scalable growth of
    Data Grids and support larger numbers of
    participants
  • The overlay network is formed by the set of
    connections between the participating nodes in
    the Data Grid
  • Topology means and represents the connectivity
    graph formed by the overlay network

20
Node Addition / Data Model Construction
After a Node joins, it starts developing a list
of preferred neighbors
Requests Flow
Data Flow
21
Data Model Construction
  • We use a combination of spanning tree and ring
    topologies
  • Grid Node Insertion
  • When joining the grid, a node is added through an
    existing grid node by attaching to it as a child
    node or a sibling
  • Node Removal
  • When a node leaves the tree, it sends a
    notification message to its parent, siblings, and
    children

22

23
Data Search and Replica Location
  • The search starts at the local data catalog to
    check if the data is stored and available locally
  • If it is not, then the node sends a request
    message to its parent, siblings, and children ?
    flooding mechanism
  • The local replica management service chooses the
    best data source

24

25
Cost Model
  • Total data transfer cost for object i at node v
    is costv,i (?v,i??v,i)size(i)d(v,r)

The incremental data transfer for placing a
replica at v is costi(N,Ri,v)
-?tv,id(v,c(v,Ri)) ?i(?tr,i-?tv,i)size(i)d(v,c(v
,Ri)))
Tv the partition of nodes which root is v ?tv,i
the total read rate of all nodes at partition
Tv ?tv,i the total write rate of the partition
N the set of all nodes in the system, Ri the
replica set of object i, and c(v,Ri) the
replica of object i closest to node v
Adding r to Ri increases or decreases the read
cost of each node in Tv Our cost function
evaluates the above formula at each node vs. the
cost of data transfer and the storage space
available before replicating any data
26
Cost Function
  • The transfer cost of file i at node v from node r
    is
  • costi (Ri aWi)size(i)/d(v,r)
  • Ri is read rate, Wi write rate and d(v,r) is cost
    of transferring a data unit from v to r
  • A replica creation threshold is defined based on
    cost of transferring data from known replica
    sites
  • The cost function evaluates the accumulated read
    and write costs at each node vs. the threshold
    before creating a replica

27
Simulation Design
  • GridNet is a modular simulator built on top of NS
    (the network simulator) and written in C
  • NS provides us with the basic grid specification
    nodes, links, messages
  • GridNet introduces application level services
    implemented on top of the existing NS services
    and protocols. It allows us to specify
  • different types of nodes (client, cache, and
    server nodes)
  • node resources (storage capacity, local data
    files )
  • the cost model and its parameters
  • the replication strategy used

28
Simulator Architecture
  • Each Node in GridNet has three major components
  • replica optimizer computes number of read/write
    requests, response time, probes for connection
    bandwidth
  • replica routing table parent, sibling, and child
    nodes
  • storage element storage space available, and
    files stored

29
Simulation Experiments
  • We used 6 file sets with file sizes ranging from
    100Mb to 1Gb
  • All files were originally located at the server
  • Access patterns
  • Recently accessed files are more likely to be
    accessed again
  • Files recently accessed by a client are more
    likely to be accessed by neighbor clients

30
Simulation Results
31
Middleware Deployment
  • We used two hierarchical distribution models that
    represent the most popular and commonly used
    models
  • Bottom Up Multiple collection sites
  • Top Down single collection site
  • Single collection site data repositories have
    larger storage capacity

32
Access Patterns
  • Data access requests are based on patterns
    commonly observed in scientific and data-sharing
    environments
  • Files are of similar sizes within an application
  • Spikes are generated by new Interesting Files
  • Users social organization and interests guide
    the overlay construction
  • Interest-based adaptive clustering of users

33
Top Down Model
34
Bottom UP Model
35
Top Down Experiments Results
36
Bottom Up Experiments Results
37
Bottom Up Experiments Results
38
Access Performance Evaluation
39
Conclusions
  • Cost guided dynamic replication improves data
    access performance by up to 30 and a minimum of
    10 compared to static user initiated replication
  • The combination of parameter selection for cost
    evaluation and resource availability play a key
    role in influencing the performance of the
    system.
  • Lower storage availability might lead to race
    conditions where popular data compete for storage
    space.
  • The results also show that popular data files
    benefit the most from dynamic replication

40
Contributions
  • A solution that combines and adapts approaches
    previously used in different data sharing
    environments the Grid and P2P systems
  • Support for the creation of small to medium scale
    data sharing Grids
  • A formulation of the replica creation and
    placement problem using a mathematical model
  • The use of dynamic and adaptive overlay networks
    and data organization models to connect
    participating nodes in a Data Grid (access
    patterns)
  • Generic Data Grid simulator, GridNet
  • A new data and replica management middleware that
    scales with the number of users and supports
    dynamic and intermittent user participation

41
Future Directions
  • Investigate additional user access patterns and
    overlay structures and study data access
    performance under these settings
  • Studying different and new combinations of
    parameters and cost evaluation techniques
  • Deploying the middleware in a larger environment
    with larger applications
  • Identifying the relation and effect of data and
    replica management services on other services
Write a Comment
User Comments (0)
About PowerShow.com