Title: New Template.97
1Adaptive File Caching in Distributed Systems
Ekow J. Otoo Frank Olken Arie Shoshani
2Objectives
- Goals
- Develop a coordinated optimal file caching and
replication of distributed datasets - Develop a software module, called Policy Advisory
Module (PAM) as part of Storage Resource Managers
(SRMs) and other grid storage middleware - Examples of application areas
- Particle Physics Data Grid (PPDG)
- Earth Science, Grid (ESG)
- Grid Physics Network (GriPhyN).
3Managing File Requests at a Single Site
Multiple Clients Using a Shared Disk for
Accessing Remote MSS
Other Sites
Mass Storage System
Storage Resource Manager
Network
File requests
Queuing And Scheduling
Policy Advisory Module
Shared disk
4Two Principal Components of Policy Advisory
Module
- A disk cache replacement policy
- Evaluates which files are to be replaced when
space is needed - Admission policy for file requests
- Determines which request is to be processed next
- e.g. may prefer to admit requests for files
already in cache - Work done so far concerns
- Disk cache replacement policies
- Development of SRM-PAM Interface
- Some models of file admission policies
5 New Results Since Last Meeting
- Implementation of the Greedy Dual Size (GDS),
replacement policy - New experimental runs with new workloads.
- 6 month log of access trace from Jlab
- Synthetic workload with file sizes from 500K to
2.14G Bytes - Implementation of SRM-PAM simulation in OMNeT
- Papers
- Disk cache replacement algorithm for storage
resource Managers on the grid SC2002. - Disk file caching Algorithms that account for for
delays in space reservation, file transfers and
file processing. To be submitted to Mass Storage
Conference - A discrete event simulation model of a storage
resource manager (To be submitted to
SIGMMETRICS2002) .
6Some Known Results in Caching (1)
- Disk to Memory Caching
- Least Recently Used (LRU) keeps the last ref.
time - Least Frequently Used (LFU) keeps reference
counts - LRU-K keeps last reference times up to a max of
K. - Best known result (ONeil et al. 1993)
- Small K is sufficient (K2, 3)
- Gain 5-10 over LRU depending on reference
pattern - Significance of a 10 saving in time per
reference - Improved response time
- In the Grid and Wide Area Networks, this
translates to - Reduced network traffic
- Reduce load at the source
- Savings in time to access files
7Some Known Results in Caching (2)
- File Caching in Tertiary Storage to Disk
- Modeling of Robotic Tapes Johnson 1995, Sarawagi
1995,.. - Hazard Rate Optimal Olken 1983
- Object LRU Hahn et al. 1999
- Web Caching
- Self Adjusted LRU Aggarwal and Yu 1997
- Greedy Dual Young 1991
- Greedy Dual Size (GDS), Cao and Irani 1997
8Difference Between Environments
- Caching in primary memory
- Fixed page size
- Cost (in time) is assumed constant
- Transfer time is negligible
- Latency is assumed fixed for disk
- Memory reference is instantaneous
- Caching in Grid environment
- Large files with variable sizes
- Cost of retrieval (in time) varies considerably
- From one time instant to another even for the
same file - Files may be available from different locations
in a WAN - Transfer time may be comparable to the latency in
a WAN - Duration of file reference is significant and
cannot be ignored - Main goal is to reduce network traffic and file
access times
9Our Theoretical Result on Caching Policiesin
Grid Environment
- Latency delays, transfer delays and file size
impact caching policies in the Grid - Cache replacement algorithms, such as LRU, LRU-K
do not take these into account and therefore are
inappropriate - The replacement policy we advocate is based on a
cost-beneficial function computed at time t0 as
t0 is the current time, ki(t0) is
the count of references for file i up to max of
K Ci(t0) is the cost in time of accessing the
file i, Si is size of file i. fi(t0)
is the total count of references to the file i
over its active time T. t-K is the
time of the kth backward reference.
- Eviction candidate is one with minimum gi(t0)
10Implementations from the Theoretical Results
- Two new practical implementations developed
- - MIT-K Maximum average Inter-arrival Time, an
improved - LRU-K.
- - MIT-K dominates LRU-K
- - Does not take into account access costs and
file size - - The main ranking function is
- - LCB-K Least Cost Beneficial with K backward
references - - does take into account retrieval delay and
file size -
11Some Details of the Implementation Algorithms
- Evaluation of replacement policies with no delay
considerations involves - - a reference stream r1, r2, r3, , rN
- - a specified cache size Z, and
- - two appropriate data structures
- One holds information of referenced files and
- A second holds information about the files in
cache but also allows for fast selection of
eviction candidate.
Cache content as a priority queue (PQ)
LRU
Binary Search Tree (BST)
12Implementation When Delays are Considered
When delays are considered each reference ri in
the reference stream has 5 event times time of
arrival, time when file caching starts, time
when caching ends, time when processing begins
and time when processing ends and file is
released.
BST, Active lifetime T
Varies with different policies
A vector of pinned files in cache
PQ of unpinned files in cache
13Performance Metrics
- Retrieval Cost Per
- Reference
14Implications of Metric Measures
- Hit Ratio
- Measure of the relative savings as a count of the
number of files hit - Byte Hit Ratio
- Measure of the relative savings as the time
avoided in data transfers - Retrieval Cost Per Reference
- Measure of the relative savings as the time
avoided in data transfers and in retrieving data
from their sources
15Parameters of the Simulations
- Real workload from Jefferson Natl. Accelerator
Facility (JLab) - A six month trace log of file accesses to
tertiary storage - Log contains batched requests
- Replacement policy used in JLab is LRU
- Synthetic workload based on JLab
- 250,000 files with large sizes uniformly
distributed between 500K to 2.147 GBytes - Inter-arrival time is exponentially distributed
with mean 90 sec - Number of references generated is about 500,000
- Locality of reference
- partition the references into random size
intervals - follows the 80-20 rule within each interval(80
of references are on 20 of the files)
16Replacement Policies Compared
- RND Random
- LFU Least Frequently Used
- LRU Least Recently Used
- MIT-K Maximum Inter-Arrival Time based on last K
references - LCB-K Least Cost Beneficial based on last K
references - GDS Greedy Dual Size
- Active lifetime of a file T is set at 5-days
- All results accumulated with variance reduction
technique.
17Simulations Results for JNAF Workload Comparison
of Hit Ratios
- Higher values represent better performance
- MIT-K and LRU give the best performance
- LCB-K, GDS and RND are comparable
- LFU is the worst
18Simulations Results for JNAF Workload Comparison
of Byte Hit Ratios
- Higher values represent better performance
- MIT-K and LRU give slightly best performance
- All policies except LFU are comparable
- LFU is the worst
19Simulations Results for JNAF Workload Comparison
of Average Retrieval Time Per Reference
- Lower values represent better performance
- LCB-K and GDS give the best performance
- MIT-K, LRU and RND are comparable
- LFU shows the worst performance
20Simulations Results for Synthetic Workload
Comparison of Average Retrieval Time Per
reference
- Lower values represent better performance
- LCB-K clearly gives the best performance although
not significantly better than GDS - LFU is still the worst
- Hit ratio and Byte Hit Ratio are not good
measures of caching policy effectiveness on the
Grid
21Summary
- Developed a good replacement policy, LCB-K, for
caching in storage resources management on the
grid. - Developed a realistic model for evaluating cache
replacement policies taking into account delays
at the data sources, transfers and processing. - Applied the model for extensive simulation of
different policies under synthetic and real
workloads of access to mass storage system in
JNAF - We conclude that two worthwhile replacement
policies for storage resource management on the
GRID are LCB-K and GDS. - The LCB-K gives about 10 savings in retrieval
cost per reference compared to the widely used
LRU. - The cumulative effect can be significant in terms
of reduced network traffic and reduced load at
the source.