Research and Development on Storage and Data Access - PowerPoint PPT Presentation

About This Presentation
Title:

Research and Development on Storage and Data Access

Description:

GYOZA. PRODUCTION. CLUSTER ( 80 Dual Nodes) IGT. ENSTORE (17 DRIVES, shared) ESNET (OC 12) ... Compute & Storage Elements: R&D on Components and Systems ... – PowerPoint PPT presentation

Number of Views:193
Avg rating:3.0/5.0
Slides: 39
Provided by: cms5
Learn more at: https://uscms.org
Category:

less

Transcript and Presenter's Notes

Title: Research and Development on Storage and Data Access


1
Research and Development on Storage and Data
Access
  • Michael ErnstFermilabDOE/NSF ReviewJanuary
    16, 2003

2
Anticipated System Architecture
PRODUCTION CLUSTER (gt80 Dual Nodes)
MREN (OC3) (shared)
ESNET (OC 12) (shared)
POPCRN
CISCO 6509
IGT
GYOZA
RD
FRY

ENSTORE (17 DRIVES, shared)

NAS
DCACHE (gt 6 TB)
USER ANALYSIS
3
Projects
  • Compute Storage Elements RD on Components and
    Systems
  • Cluster Management Generic Farms, Partitioning
  • Will be addressed in H. Wenzels talk
  • Storage Management and Access
  • Interface Standardization and their
    implementation
  • Data set catalogs, metadata, replication, robust
    file transfers
  • Networking Terabyte throughput to T2, to CERN
  • Ultrascale Network Protocol Stack
  • Physics Analysis Center
  • Analysis cluster
  • Desktop support
  • Software Distribution, Software support, User
    Support Helpdesk
  • Collaborative tools
  • Will be addressed in H. Wenzels talk
  • VO management and security
  • Worked out a plan for RD and Deployment
  • Need to develop operations scenario

4
Monte Carlo Production with MOP
5
User access to Tier-1 (Jets/Met, Muons)
  • ROOT/POOL interface
  • (TDCacheFile)
  • AMS server
  • AMS/Enstore interface
  • AMS/dCache interface

dCache
Objects
Network
  • Users in
  • Wisconsin
  • CERN
  • FNAL
  • Texas

Enstore
NAS/RAID
6
RD on Storage and Data Access
PRODUCTION CLUSTER (gt80 Dual Nodes)
MREN (OC3) (shared)
ESNET (OC 12) (shared)
POPCRN
CISCO 6509
IGT
GYOZA
RD
FRY

ENSTORE (17 DRIVES, shared)

NAS
DCACHE
USER ANALYSIS
7
dCache Placement
Tertiary Storage Systems
dCache
Enstore
Application (e.g. dccp)
dCap library
OSM
HSM X
Local Disk
PNFS Namespace Manager
xxxFTP
GRID Access Method(s)
  • Application Viewpoint
  • POSIX compliant Interface
  • Preload Library
  • ROOT/POOL Interface
  • Unique Namespace throughout HRM
  • Transparent access to Tertiary Storage

Applications
8
Network Attached Storage (NAS)
Growing Market offering emerging Technologies
e.g. Zambeels Aztera System Architecture
9
Data Access and Distribution Server
  • Addresses 3 Areas
  • Data Server for Production and Analysis
  • Sufficient Space to store entire annually
    produced data
  • Have implemented a HRM system at Tier-1 based on
    dCache (DRM) / Enstore (TRM)
  • Shared Workgroup Space and User Home Area
  • Replaces AFS at FNAL
  • Lack of flexibility (technical constraints)
  • Insufficient performance and resources
  • Improve performance and reliability
  • Minimize administration and management overhead
  • Backup and Archiving of User Data
  • Todays limitations
  • Central Backup System for Data in AFS Space only
  • Using FNAL MSS (Enstore) to manually archive
    filesdatasets
  • Developed Benchmark Suite for System Evaluation
    and Acceptance Testing
  • Benchmark Tools along with control analysis
    scripts to produce standardized report
  • Throughput, File Operations, Data Integrity
  • Enormously reduces Effort required for Evaluation
    and provides fair Comparison of In-house
    developed and Commercial Products

10
Storage System Test Suite
11
Benchmark Tools
12
Interactive Product Development Process
and after System Tuning
Performance achieved initially
Performance depending on Client OS
Single File Read Performance
Solaris
Linux
13
Data Access and Distribution Server
  • Benchmark Suite for System Evaluation and
    Acceptance Testing
  • Developed a CMS application related test suite
    (based on ROOT I/O) which we used to evaluate the
    dCache based DRM

Aggregated Throughput reading out of dCache with
multiple Clients
14
Data Access and Distribution Server
  • Effective Throughput per Client

15
Data Access and Distribution Server
  • Time Mover is active () for each process. With
    max. 40 Movers some proc are waiting for a Mover
    to become available

16
Data Access and Distribution Server
  • Client Rate Distribution when reading out of
    dCache

17
Storage and Data Access
  • Viewpoints of CMS Event Data in the CMS Data Grid
    System
  • High-level data views in the minds of physicists
  • High-level data views in physics analysis tools
  • Virtual data product collections
    (highest-level common view across CMS)
  • Materialized data product collections
  • File sets
    (set of log. files with hi.-lev. significance)

  • Logical files
  • Physical files on sites
    (device location independent view)
  • Physical files on storage devices
    (lowest-level generic view of files)
  • Device-specific files

RD focusing on common Interface
18
Storage Resource Management
Site hosting the Application
Client
Client
Logical Query
Today Responsibility of Application (invoking
some higher-level Middleware components (e.g.
Condor))
property index
Logical Files
Request Interpreter
site-specific file requests
site-specific files
Replica Catalog
request planning
Request Executer
DRM
MDS
pinning file transfer requests
Network
Tier1
Tier2
HRM
DRM
dCache
? (dCache, dFarm, DRM, NeST)
Enstore
19
Storage Resource Management
Site hosting the Application
Client
Client
Logical Query
property index
Logical Files
Request Interpreter
site-specific file requests
site-specific files
Replica Catalog
request planning
Request Executer
DRM
MDS
pinning file transfer requests
Network
Tier1
Tier2
HRM
DRM
dCache
? (dCache, dFarm, DRM, NeST)
Enstore
20
RD on Components for Data Storage and Data
Access
  • Approach develop a Storage Architecture, define
    Components and Interfaces
  • This will include StorageData Management, Data
    Access, Catalogs, Robust Data Movement etc)
  • Storage System related RD issues
  • Detailed Analysis of the SRM and GridFTP
    specifications including identification of
    initial version of protocols be used, discussion
    of any connective middleware w.r.t
    interoperability. Coordination with Tier0/1/2 and
    LCG. Goal is to effect transfer and support
    replica managers.
  • Protocol elements include features from GridFTP,
    SRM
  • At Tier2 centers selection of Temporary Store
    implementation, supporting SRM and GridFTP
    (incl. evaluation on interop issues with Tier1
    center)
  • dCache, dFarm, DRM, NeST, DAP
  • At Tier1 center provide SRM/dCache interface
    for FNAL/dCache implementation compatible with
    criteria above
  • Track compatibility with LCG (incl. Tier0 center
    at CERN) as their plan evolves
  • Have developed a resource loaded WBS
  • Work will be carried out jointly with CD/CCF and
    the SRM Collaboration
  • Further planning required to incorporate Replica
    Managers / Replica Location Service

21
The Need for Improved Storage Devices and File
Systems
  • CMS is currently in the process of developing the
    Data Model
  • Data Storage and Data Access are the most
    demanding problems
  • The choice of OS and Persistency solution can
    strongly influence the hardware needs (and the
    human resources required to support )
  • Moving away from a Persistency Model based on
    OODB
  • Problem mapping Objects to Files
  • New Model should be developed with focus on
    optimization of the underlying storage
    architecture and storage technology
  • Classic Filesystems at the limit of their scaling
    capabilities

22
OSD Architecture
Application File Manager
Meta Operation
Object Manager
LAN/SAN
Data Transfer
Security
OSD Intelligence Storage Device
23
Data Flow at Regional Center
Mass Storage Disk Servers Database Servers
Tier 2
Network from CERN
Data Export
Data Import
Local institutes
Network from Tier 2, simulation centers
Production Reconstruction Raw/Sim--gtESD Schedule
d, predictable experiment/ physics groups
Production Analysis ESD--gtAOD AOD--gtDPD Schedule
d Physics groups
Individual Analysis AOD--gtDPD and
plots Chaotic Physicists
CERN
Tapes
Tapes
Desktops
Physics Software Development
RD Systems and Testbeds
Info servers Code servers
Web Servers Telepresence Servers
Training Consulting Help Desk
worked out by the MONARC Project
24
Networking
  • Provisioning of Offsite Network Capacity at the
    Regional Center at FNAL
  • In general Shared wide-are HEP Networking
  • Vital Resource in CMS multi tier based Computing
    Model
  • Sharing a 622 Mbps link (best effort) to ESnet
    with all Fermilab experiments, primarily CDF and
    D0 with requirements of gt 300 Mbps each
  • Lead to shortfalls during spring data production
    for DAQ TDR with peak requirements of gt 200 Mbps
    when the link was still at 155 Mbps
  • Upgrade to 622 Mbps was delayed for 5 months
    while the link was completely saturated by CDFD0
    traffic over many hours/day
  • Uncertain if the upgrade to OC48 planned for 2004
    will be in time
  • US CMS requirements (Tier0/1 Tier1/2)
    according to the planned Data Challenges DC04
    (5) in 2003/4, DC05 (10) in 2004/5 and DC06
    (20) in 2005/6
  • With probably only 2 Regional Centers involved in
    DC04 we will have to transfer 1TB/ day starting
    in Q3/2003
  • () Numbers according to 50 link utilization

2003 2004 2005
Installed () BW in Mbps 300 600 800
25
(No Transcript)
26
Networking Facilities
  • Provisioning of Offsite Network Capacity at the
    Regional Center at FNAL (cont.)
  • Since CERN is directly connected to StarLight in
    Chicago (as US Collaborators at Universities are
    via Internet2) we propose that, in order to
    secure availability of adequate Functionality and
    Bandwidth for the CMS RD Program, Fermilab to
    provide direct connectivity at scalable data
    rates and w/o intervening Internet Service
    Providers before DC04 Pre-challenge Data
    Production
  • Though we are not limited to a specific
    implementation we believe Dark Fiber between
    Fermilab and StarLight would be the most suitable
    way to get prepared for the future

27
Why Fiber?
  • Capacity needed is not otherwise affordable
  • Capabilities needed are not available (in time)
  • Cheaper in the long range
  • Insurance against monopoly behavior
  • Stable and predictable anchor points

28
National Light Rail Project Proposal
SEA
POR
SAC
BOS
NYC
CHI
OGD
DEN
SVL
CLE
WDC
PIT
FRE
KAN
RAL
NAS
STR
LAX
PHO
WAL
ATL
SDG
OLG
DAL
JAC
Proposed by Tom West
29
National Light Rail Lambda Route Map
REGEN
TERMINAL
OADM
Metro 10 Gig E
4
Seattle
Chicago
Denver
Cleveland
Kansas
Ogden
Boise
4
2
4
4
6
2
6
4
Salt Lake City
Portland
StarLight
2
Boston
5
Pittsburgh
Sacramento
2
Sunnyvale
15808 LH System
15808 ELH System
2
4
4
Fresno
15540 Metro System
4
10 Gig E
4
New York City
Washington DC
OC192
4
Los Angeles
2
4
4
2
Stratford
4
San Diego
4
4
Walnut
Nashville
Dallas
Raleigh
Olga
Pheonix
Atlanta
4
30
LHCnet Network Late 2002
GEANT
Switch
IN2P3
WHO
CERN -Geneva
Alcatel 7770 DataTAG (CERN)
Cisco 7606DataTAG (CERN)
Juniper M10 DataTAG(CERN)
Linux PC for Performance tests Monitoring
Cisco 7609 CERN
Optical Mux/Dmux Alcatel 1670
2.5 Gbps (RD)
622 Mbps (Prod.)
Linux PC for Performance tests Monitoring
Optical Mux/DmuxAlcatel 1670
Cisco 7609 Caltech(DoE)
Cisco 7606Caltech(DoE)
Juniper M10 Caltech (DoE)
Alcatel 7770 DataTAG (CERN)
Caltech/DoE PoP StarLight Chicago
Abilene
MREN
ESnet
STARTAP
NASA
Development and tests
31
Networking
  • Immediate needs for RD in three topic areas
  • End-to-End Performance / Network Performance and
    Prediction
  • Closely related to work on Storage Grid (SRM
    etc)
  • Alternative implementations of TCP/IP Stack
  • QoS and Differentiated Services, Bandwidth
    Brokering
  • Evaluate and eventually utilize differentiated
    service framework as being implemented in Abilene
    and ESnet
  • Evaluate bandwidth brokers (e.g. GARA)
  • Virtual Private Networks (VPN)
  • Evaluate and eventually implement VPN technology
    over public network infrastructure for the CMS
    Production Grid
  • Other parties involved are CERN, Caltech,
    DataTAG, Internet2, ESnet,

32
Networking
  • Immediate needs for RD
  • End-to-End Performance / Network Performance and
    Prediction
  • Closely related to work on Storage Grid (SRM
    etc)
  • Transport Protocols (e.g. GridFTP, work w/CS
    community)
  • System optimization w.r.t. concurrent
    Storage/Network Traffic
  • Currently observed System Shortfalls
  • Reno Stack is in general lacking gigabit speed
    capabilities
  • End of scale reached due to severe equilibrium
    and stability problems
  • Using loss (i.e inducing loss) probability for
    control gt wild oscillations
  • Research community is paying a lot attention to
    address optimization issues based on deployed
    network stack implementations
  • Customization of Network Kernel Buffers
  • Largely increased values yielding at improved
    throughput on non-congested links BUT
  • behave unfair when competing with other streams
  • need very long time to recover from packet loss
    (see next slide)
  • requires manual negotiation/configuration of
    metrics

33
Time to recover from a single loss
TCP Throughput CERN-StarLight (link running at
622 Mbps)
  • TCP reactivity Due to the Basic
    Multiplicative-Decrease and Additive-Increase
    Algorithm to Handle Packet Loss
  • Time to increase the throughput by 120 Mbit/s is
    larger than 6 min for a connection between
    Fermilab and CERN.
  • A single loss is disastrous
  • A TCP connection reduces bandwidth use by half
    after a loss is detected (Multiplicative
    decrease)
  • A TCP connection increases slowly its bandwidth
    use (Additive increase)
  • TCP is much more sensitive to packet loss in WANs
    than in LANs

From Sylvain Ravot / Caltech
34
TCP Responsiveness
Case Capacity RTT (ms) MSS (Byte) Responsiveness
Typical LAN in 1988 10 Mbps 2 20 1460 1.5 ms 154 ms
Typical WAN in 1988 9.6 Kbps 40 1460 0.006 sec
Typical LAN today 100 Mbps 5 (worst case) 1460 0.096 sec
Current WAN link CERN Starlight 622 Mbps 120 1460 6 minutes
Future WAN link CERN Starlight 10 Gbit/s 120 1460 92 minutes
Future WAN link CERN Starlight 10 Gbit/s 120 8960 (Jumbo Frame) 15 minutes
From H. Newman
35
Iperf TCP throughput between CERN and StarLight
using the standard Stack
36
Networking
  • Immediate needs for RD
  • End-to-End Performance / Network Performance and
    Prediction (cont.)
  • Need to actively pursue Network Stack
    Implementations supporting Ultrascale Networking
    for Rapid Data Transactions and Data-Intensive
    Dynamic Workspaces
  • Maintain Statistical Multiplexing End-to-End
    Flow Control
  • Maintain functional compatibility with Reno/TCP
    implementation
  • FAST Project has shown dramatic improvements over
    Reno Stack by moving from loss based congestion
    to delay based control mechanism
  • with standard segment size and fewer streams
  • Fermilab/CMS is FAST partner
  • as a well supported user having the FAST stack
    installed on Facility RD Data Servers (first
    results look very promising)
  • Aiming at Installations/Evaluations for
    Integration with Production Environment at CERN
    and Tier-2 sites
  • Work in Collaboration with Fermilab CCF
    Department

37
Iperf TCP throughput between CERN and StarLight
using the FAST Stack
38
Milestones
  • Data Storage and Data Access
  • Implementation of a Storage Resource Management
    system based on the SRM protocol and respective
    Data Movement mechanisms
    08/2003
  • Data Access Optimization
  • Develop and implement a model to optimize Data
    Placement and Data Distribution in
    conjunction with new Persistency Mechanism
    08/2003
  • File Systems and advanced Disk Storage Technology
  • Development of a Storage Architecture using
    Cluster File Systems with intelligent
    Storage Devices, will implement Prototype
    12/2003
  • Resource Management
  • Develop tools for dynamic partitioning of Compute
    Elements (Farms) 03/2003
  • Networking
  • Research on End-to-End performance optimization
    (WAN)
  • Develop standard configuration for Tier-0/1/2
    connectivity 07/2003
Write a Comment
User Comments (0)
About PowerShow.com