An Effective Disk Caching Algorithm in Data Grid - PowerPoint PPT Presentation

1 / 8
About This Presentation
Title:

An Effective Disk Caching Algorithm in Data Grid

Description:

It takes a long latency (up to several minutes) to load data from a ... Examples: Greedy Dual-Size with Frequency (GDSF), Hybrid, Lowest relative Value (LRV) ... – PowerPoint PPT presentation

Number of Views:54
Avg rating:3.0/5.0
Slides: 9
Provided by: HKUC5
Category:

less

Transcript and Presenter's Notes

Title: An Effective Disk Caching Algorithm in Data Grid


1
An Effective Disk Caching Algorithm in Data Grid
Song Jiang and Xiaodong Zhang College of William
and Mary, USA
  • Why Disk Caching in Data Grids?
  • It takes a long latency (up to several minutes)
    to load data from a Mass Storage System (MSS) via
    LAN
  • It takes a very long time (up to a few hours) to
    complete file transfers for a request over
    wide-area networks (WAN)
  • A researcher's workstation or even her local
    computer center may not be able to keep all the
    required dataset for a long time for his needs.

2
Disk Caching in Storage Resource Managers (SRM)
HPSS
HRM
Wide Area Network
DRM
DRM
Clients
Clients
HPSS High Performance Storage System HRM
Hierarchical Resource Manager DRM Disk Resource
Manager
3
  • A General Replacement Utility Function based on
    Locality, Sizes, and Transfer Cost

For a file i at time T, Fi (t) Li (t) Ci /
Si Li (t) denotes its locality strength, Si
denotes the file size, Ci denotes its retrieving
cost
Locality estimation of files is the most critical
factor determining hit ratio of disk caching. In
data grid, the replacement is managed by
middleware, the maintenance cost can be more
relaxed than that in OS.
4
  • Existing Replacement and Their Limits
  • LRU Based Only
  • - Carries all the weaknesses of LRU
  • No size and transfer cost consideration
  • Easy to implement and low maintenance
    cost.
  • (2) LRU Based Locality Estimation
  • - Carries all the weaknesses of LRU.
  • Easy to implement and low maintenance
    cost.
  • (3) Frequency based Estimation
  • Based on the number of times an object in the
    file has been accessed since it is fetched.

5
  • Existing Replacement and Their Limits (Continued)
  • (4) Greedy-Dual-Size (USENIX Internet Tech97)
  • For a time interval, locality estimator is
  • access frequency / number of missed
    objects
  • - Share the same weaknesses frequency
    based.
  • Easy to implement.
  • (5) Least Cost Beneficial based on K backward
    References (SC02)
  • For a time interval, locality estimator is
  • number of most recent references total
    references
  • - the estimator is the time interval
    dependent.
  • Easy to implement and low maintenance
    cost.

6
  • Drawbacks in the existing Locality Estimation
    Methods
  • Recency Based
  • (1) Locality of large file access in Data Grids
    is weaker than that of I/O block and Web file
    caching
  • Hard to deal with weak locality file requests
  • Example Greedy Dual-Size (GDS)
  • (2) Frequency Based
  • Pollution problem
  • Examples Greedy Dual-Size with Frequency
    (GDSF), Hybrid, Lowest relative Value (LRV)
  • (3) Re-use Density Based
  • Overcome the drawbacks of previous methods
  • Could be irrelevant to locality because density
    is computed over wall clock time
  • Examples Least Cost Beneficial Based on the K
    backward References (LCB-K)

7
  • Our Principle on Disk Replacement
  • Only relevant history information is used in
    locality estimation the order of file requests,
    and cache size demands
  • Total cache size is used with the locality
    estimation to answer the question Does a file
    have enough locality so that it if it is cached
    it could be hit for its next reference before it
    is replaced from the cache with its specific size
    ?

Our Utility Function Caching Time of a file
measures how much cache is consumed since the
last reference to the file. A caching time is
maintained for a file for a certain period of
time even it is replaced, so that the utility of
its next access can be more accurately assessed.
  • Least Value Based on Caching Time (LVCT)
  • If f is fetched to cache, push its entry to
    stack top, set caching time 0.
  • (2) If file f is hit in the cache, advance the
    entry up to Size (f) positions.
  • (3) If the access to file f is a miss
  • Select a set of resident files whose utility
    values are smallest among resident files
  • If the utility value of file f is smaller than
    any of the files in the set, f is not cached.
    (mainly determined by its file size and transfer
    cost).
  • Otherwise, f is cached
  • The caching time stack is updated accordingly.

8
  • Trace Description
  • Collected in a MSS system, JASMine, at
    Jefferson's National Accelerator Facility (JLab),
  • Represent the file access activities for a period
    of about 6 months.
  • 207,331 files accessed in the trace, and their
    total size is 144.9 TBytes.

Simulation Results
  • LVCT Advantage
  • Replace files with large caching times timely
    because they are less likely to generate hits
    even if cached
  • Cache space is saved to serve small caching time
    files
  • Improvement is more apparent with small cache
    sizes.

9
  • Main Results
  • Disk caching in data grids exhibit properties
    different from transitional I/O buffering and Web
    file caching
  • We identify a critical drawback in abstracting
    locality information from Grid request streams,
    i.e. counting on misleading time measurements.
  • We propose a new locality estimator using more
    relevant access events
  • The real-life workload traces demonstrate the
    effectiveness of our replacement.
Write a Comment
User Comments (0)
About PowerShow.com