Filecules in HighEnergy Physics: Characteristics and Impact on Resource Management - PowerPoint PPT Presentation

1 / 35

About This Presentation

Title:

Filecules in HighEnergy Physics: Characteristics and Impact on Resource Management

Description:

Filecules in High-Energy Physics: Characteristics and ... BitTorrent. Flash crowd situations. Sharing community. Comparison with BitTorrent. Other scenarios ... – PowerPoint PPT presentation

Number of Views:37

Avg rating:3.0/5.0

Slides: 36

Provided by: amarnaths

Category:

more less

Transcript and Presenter's Notes

Title: Filecules in HighEnergy Physics: Characteristics and Impact on Resource Management

1
Filecules in High-Energy Physics Characteristics
and Impact on Resource Management

Adriana Iamnitchi
Shyamala Doraimani
Gabriele Garzoglio

2
Motivations

Mature GRID deployments
Workflow models
Re-evaluation of traditional resource management
techniques

3
Objectives

Usage patterns
Large scale data intensive scientific
collaboration in high-energy physics
Focus on data usage
Filecules proposal of a new abstraction in data
resource usage
Importance in data caching, replication and
placement

4
Data-intensive, scientific projects

e.g. GriPhyN, PPDG etc
Files
Location
Management
Data scheduling
Computation scheduling

5
DZero Experiment

Fermi National Accelerator Laboratory (FermiLab),
US
Tevatron collider
70 institutes, 18 countries
Shareable computing and storage resources
Results from several petabytes of measured and
simulated data

Read only data files
Jobs analyze and produce new, processed data
files
Centralized file-based data management
Logs
January 2003 to May 2005
About 234,000 job runs
561 users
34 different Internet domains
11 countries, 3 continents

7
(No Transcript)
8
(No Transcript)
9

Particles formed from the annihilation of protons
and antiprotons at the TeV energy scale
Data collected from different layers of the
detector
Events consist of about 250 KB of information and
are stored in raw data files of about 1GB in
size
Every bit of raw data is accessed for further
processing/filtering
Data derived from pre-processing and filtering is
then classified based on the physics events they
represent

10
Computations

1 TB of recorded data/day
1 TB of derived and simulated data/day
Three main activities
data filtering (data reconstruction)
production of simulated events and
data analysis
Binary data format is detector h/w dependent
Filtering involves transformation of binary data
to concepts as particle tracks, charge, spin etc
Data analysis consists of the selection and
statistical study of particles with certain
characteristics

11
Infrastructure

SAM (sequential access of data from metadata)
Reliable data storage either directly from the
detector or from data processing facilities
around the world
Enables data distribution to and from all of the
collaborating institutions
Thoroughly catalogs data for content, provenance,
status, location, processing history,
user-defined datasets, and so on.
Manages the distributed resources to optimize
their usage and to enforce the policies of the
experiment

Categorizes computation activities in application
families (reconstruction, analysis etc)
data is organized in tiers, defined according
to the format of the physics events
raw, reconstructed, thumbnail, and
root-tuple
root-tuple tier identifies typically highly
processed events in root format
Application running on a dataset defines a job or
project. Projects are initiated by a user on
behalf of a physics group and typically trigger
data movement
A Dataset consists of a set of data useful from a
physics perspective. It may be a set of data
constructed by an analysis process, or it may be
a set of data used by an analysis process.

13
Traces

File and application traces
File traces show what files have been requested
with every job run during the period under study
Application traces list summary information about
jobs
metadata for the application (application name,
version, and family)
dataset processed (data tier)
general data, such as the user name and group
that initiated the job and the location (node
name) and start/stop time of the job

14
(No Transcript)
15
Analysis of traces

Quantifying resource requirements and allows for
better resource provisioning (Zipf law for web
requests, free riding in Gnutella etc)
Provides a credible workload for evaluating new
solutions via simulations or emulations
understanding usage patterns can lead to
New ways of improving the overall system
performance
file location mechanisms that exploit the
relation between stored files
information dissemination techniques that exploit
overlapping user interests in data, and
search algorithms adapted to particular overlay
topologies

16
Filecules

Aggregate of one or more files in a definite
arrangement held together by special forces
related to their usage. Filecule is the smallest
unit of data that still retains its usage
properties
One-file filecules as the equivalent of a
monatomic molecule
A set of files form a filecule if and only if
every pair of files also occur in every other
Any two filecules are disjoint
A filecule has at least one file
The number of requests for a file is identical
with the number of requests for the filecule that
includes that file
Popularity distribution on files and filecules is
the same

17
(No Transcript)
18
(No Transcript)
19
(No Transcript)
20
(No Transcript)
21
Size of filecules (in MB) in reconstructed tier
22
Size of filecules (in files) in reconstructed
tier
23
Popularity distribution (requests) for filecules
in reconstructed tier
24
(No Transcript)
25
Cache Replacement

Filecules vs Files
LRU policy
7 different cache sizes between 1TB and 100 TB
Largest filecule was 17TB

26
(No Transcript)
27

Miss rate for filecule LRU is significantly lower
(up to 4 to 5 times for large cache sizes)
Difference in performance is relatively small
(about 9.5) for cache size of 1 TB

28
BitTorrent

Flash crowd situations
Sharing community

29
Comparison with BitTorrent
30
(No Transcript)
31
Other scenarios

Scheduling data transfers while accounting for
filecules
Proactive data replication
based on considerations related to popularity
replication costs
membership to filecules and
the status of the filecule (partially or
not-replicated) on the destination storage

32
Other issues

Online instead offline identification
In a distributed environment
Without global information, identified filecules
can only be larger than real filecules
Filecules identified from only partial
information are correct from the perspective of
the local user base

33
Relationships between files

Successor relationship identified from a sequence
of accesses.
Explicit grouping in which files that are used
one after the other are placed in adjacent
locations on the disk and accessed as a whole
group
Two files related if they are opened within a
specified number of file open operations from
each other

34
Could be utilized in