Filecules in HighEnergy Physics: Characteristics and Impact on Resource Management - PowerPoint PPT Presentation

1 / 35
About This Presentation
Title:

Filecules in HighEnergy Physics: Characteristics and Impact on Resource Management

Description:

Filecules in High-Energy Physics: Characteristics and ... BitTorrent. Flash crowd situations. Sharing community. Comparison with BitTorrent. Other scenarios ... – PowerPoint PPT presentation

Number of Views:37
Avg rating:3.0/5.0
Slides: 36
Provided by: amarnaths
Category:

less

Transcript and Presenter's Notes

Title: Filecules in HighEnergy Physics: Characteristics and Impact on Resource Management


1
Filecules in High-Energy Physics Characteristics
and Impact on Resource Management
  • Adriana Iamnitchi
  • Shyamala Doraimani
  • Gabriele Garzoglio

2
Motivations
  • Mature GRID deployments
  • Workflow models
  • Re-evaluation of traditional resource management
    techniques

3
Objectives
  • Usage patterns
  • Large scale data intensive scientific
    collaboration in high-energy physics
  • Focus on data usage
  • Filecules proposal of a new abstraction in data
    resource usage
  • Importance in data caching, replication and
    placement

4
Data-intensive, scientific projects
  • e.g. GriPhyN, PPDG etc
  • Files
  • Location
  • Management
  • Data scheduling
  • Computation scheduling

5
DZero Experiment
  • Fermi National Accelerator Laboratory (FermiLab),
    US
  • Tevatron collider
  • 70 institutes, 18 countries
  • Shareable computing and storage resources
  • Results from several petabytes of measured and
    simulated data

6
  • Read only data files
  • Jobs analyze and produce new, processed data
    files
  • Centralized file-based data management
  • Logs
  • January 2003 to May 2005
  • About 234,000 job runs
  • 561 users
  • 34 different Internet domains
  • 11 countries, 3 continents

7
(No Transcript)
8
(No Transcript)
9
  • Particles formed from the annihilation of protons
    and antiprotons at the TeV energy scale
  • Data collected from different layers of the
    detector
  • Events consist of about 250 KB of information and
    are stored in raw data files of about 1GB in
    size
  • Every bit of raw data is accessed for further
    processing/filtering
  • Data derived from pre-processing and filtering is
    then classified based on the physics events they
    represent

10
Computations
  • 1 TB of recorded data/day
  • 1 TB of derived and simulated data/day
  • Three main activities
  • data filtering (data reconstruction)
  • production of simulated events and
  • data analysis
  • Binary data format is detector h/w dependent
  • Filtering involves transformation of binary data
    to concepts as particle tracks, charge, spin etc
  • Data analysis consists of the selection and
    statistical study of particles with certain
    characteristics

11
Infrastructure
  • SAM (sequential access of data from metadata)
  • Reliable data storage either directly from the
    detector or from data processing facilities
    around the world
  • Enables data distribution to and from all of the
    collaborating institutions
  • Thoroughly catalogs data for content, provenance,
    status, location, processing history,
    user-defined datasets, and so on.
  • Manages the distributed resources to optimize
    their usage and to enforce the policies of the
    experiment

12
  • Categorizes computation activities in application
    families (reconstruction, analysis etc)
  • data is organized in tiers, defined according
    to the format of the physics events
  • raw, reconstructed, thumbnail, and
    root-tuple
  • root-tuple tier identifies typically highly
    processed events in root format
  • Application running on a dataset defines a job or
    project. Projects are initiated by a user on
    behalf of a physics group and typically trigger
    data movement
  • A Dataset consists of a set of data useful from a
    physics perspective. It may be a set of data
    constructed by an analysis process, or it may be
    a set of data used by an analysis process.

13
Traces
  • File and application traces
  • File traces show what files have been requested
    with every job run during the period under study
  • Application traces list summary information about
    jobs
  • metadata for the application (application name,
    version, and family)
  • dataset processed (data tier)
  • general data, such as the user name and group
    that initiated the job and the location (node
    name) and start/stop time of the job

14
(No Transcript)
15
Analysis of traces
  • Quantifying resource requirements and allows for
    better resource provisioning (Zipf law for web
    requests, free riding in Gnutella etc)
  • Provides a credible workload for evaluating new
    solutions via simulations or emulations
  • understanding usage patterns can lead to
  • New ways of improving the overall system
    performance
  • file location mechanisms that exploit the
    relation between stored files
  • information dissemination techniques that exploit
    overlapping user interests in data, and
  • search algorithms adapted to particular overlay
    topologies

16
Filecules
  • Aggregate of one or more files in a definite
    arrangement held together by special forces
    related to their usage. Filecule is the smallest
    unit of data that still retains its usage
    properties
  • One-file filecules as the equivalent of a
    monatomic molecule
  • A set of files form a filecule if and only if
    every pair of files also occur in every other
  • Any two filecules are disjoint
  • A filecule has at least one file
  • The number of requests for a file is identical
    with the number of requests for the filecule that
    includes that file
  • Popularity distribution on files and filecules is
    the same

17
(No Transcript)
18
(No Transcript)
19
(No Transcript)
20
(No Transcript)
21
Size of filecules (in MB) in reconstructed tier
22
Size of filecules (in files) in reconstructed
tier
23
Popularity distribution (requests) for filecules
in reconstructed tier
24
(No Transcript)
25
Cache Replacement
  • Filecules vs Files
  • LRU policy
  • 7 different cache sizes between 1TB and 100 TB
  • Largest filecule was 17TB

26
(No Transcript)
27
  • Miss rate for filecule LRU is significantly lower
    (up to 4 to 5 times for large cache sizes)
  • Difference in performance is relatively small
    (about 9.5) for cache size of 1 TB

28
BitTorrent
  • Flash crowd situations
  • Sharing community

29
Comparison with BitTorrent
30
(No Transcript)
31
Other scenarios
  • Scheduling data transfers while accounting for
    filecules
  • Proactive data replication
  • based on considerations related to popularity
  • replication costs
  • membership to filecules and
  • the status of the filecule (partially or
    not-replicated) on the destination storage

32
Other issues
  • Online instead offline identification
  • In a distributed environment
  • Without global information, identified filecules
    can only be larger than real filecules
  • Filecules identified from only partial
    information are correct from the perspective of
    the local user base

33
Relationships between files
  • Successor relationship identified from a sequence
    of accesses.
  • Explicit grouping in which files that are used
    one after the other are placed in adjacent
    locations on the disk and accessed as a whole
    group
  • Two files related if they are opened within a
    specified number of file open operations from
    each other

34
Could be utilized in
  • Pre-fetching based on access patterns
  • Cache replacement
  • Storage allocation on disks

35
Thanks
Write a Comment
User Comments (0)
About PowerShow.com