Title: Research and Development on Storage and Data Access
1Research and Development on Storage and Data
Access
- Michael ErnstFermilabDOE/NSF ReviewJanuary
16, 2003
2Anticipated System Architecture
PRODUCTION CLUSTER (gt80 Dual Nodes)
MREN (OC3) (shared)
ESNET (OC 12) (shared)
POPCRN
CISCO 6509
IGT
GYOZA
RD
FRY
ENSTORE (17 DRIVES, shared)
NAS
DCACHE (gt 6 TB)
USER ANALYSIS
3Projects
- Compute Storage Elements RD on Components and
Systems - Cluster Management Generic Farms, Partitioning
- Will be addressed in H. Wenzels talk
- Storage Management and Access
- Interface Standardization and their
implementation - Data set catalogs, metadata, replication, robust
file transfers - Networking Terabyte throughput to T2, to CERN
- Ultrascale Network Protocol Stack
- Physics Analysis Center
- Analysis cluster
- Desktop support
- Software Distribution, Software support, User
Support Helpdesk - Collaborative tools
- Will be addressed in H. Wenzels talk
- VO management and security
- Worked out a plan for RD and Deployment
- Need to develop operations scenario
4Monte Carlo Production with MOP
5User access to Tier-1 (Jets/Met, Muons)
- ROOT/POOL interface
- (TDCacheFile)
- AMS server
- AMS/Enstore interface
- AMS/dCache interface
dCache
Objects
Network
- Users in
- Wisconsin
- CERN
- FNAL
- Texas
Enstore
NAS/RAID
6RD on Storage and Data Access
PRODUCTION CLUSTER (gt80 Dual Nodes)
MREN (OC3) (shared)
ESNET (OC 12) (shared)
POPCRN
CISCO 6509
IGT
GYOZA
RD
FRY
ENSTORE (17 DRIVES, shared)
NAS
DCACHE
USER ANALYSIS
7dCache Placement
Tertiary Storage Systems
dCache
Enstore
Application (e.g. dccp)
dCap library
OSM
HSM X
Local Disk
PNFS Namespace Manager
xxxFTP
GRID Access Method(s)
- Application Viewpoint
- POSIX compliant Interface
- Preload Library
- ROOT/POOL Interface
- Unique Namespace throughout HRM
- Transparent access to Tertiary Storage
Applications
8Network Attached Storage (NAS)
Growing Market offering emerging Technologies
e.g. Zambeels Aztera System Architecture
9Data Access and Distribution Server
- Addresses 3 Areas
- Data Server for Production and Analysis
- Sufficient Space to store entire annually
produced data - Have implemented a HRM system at Tier-1 based on
dCache (DRM) / Enstore (TRM) - Shared Workgroup Space and User Home Area
- Replaces AFS at FNAL
- Lack of flexibility (technical constraints)
- Insufficient performance and resources
- Improve performance and reliability
- Minimize administration and management overhead
- Backup and Archiving of User Data
- Todays limitations
- Central Backup System for Data in AFS Space only
- Using FNAL MSS (Enstore) to manually archive
filesdatasets - Developed Benchmark Suite for System Evaluation
and Acceptance Testing - Benchmark Tools along with control analysis
scripts to produce standardized report - Throughput, File Operations, Data Integrity
- Enormously reduces Effort required for Evaluation
and provides fair Comparison of In-house
developed and Commercial Products
10Storage System Test Suite
11Benchmark Tools
12Interactive Product Development Process
and after System Tuning
Performance achieved initially
Performance depending on Client OS
Single File Read Performance
Solaris
Linux
13Data Access and Distribution Server
- Benchmark Suite for System Evaluation and
Acceptance Testing - Developed a CMS application related test suite
(based on ROOT I/O) which we used to evaluate the
dCache based DRM
Aggregated Throughput reading out of dCache with
multiple Clients
14Data Access and Distribution Server
- Effective Throughput per Client
15Data Access and Distribution Server
- Time Mover is active () for each process. With
max. 40 Movers some proc are waiting for a Mover
to become available
16Data Access and Distribution Server
- Client Rate Distribution when reading out of
dCache
17Storage and Data Access
- Viewpoints of CMS Event Data in the CMS Data Grid
System - High-level data views in the minds of physicists
- High-level data views in physics analysis tools
- Virtual data product collections
(highest-level common view across CMS) - Materialized data product collections
- File sets
(set of log. files with hi.-lev. significance)
- Logical files
- Physical files on sites
(device location independent view) - Physical files on storage devices
(lowest-level generic view of files) - Device-specific files
RD focusing on common Interface
18Storage Resource Management
Site hosting the Application
Client
Client
Logical Query
Today Responsibility of Application (invoking
some higher-level Middleware components (e.g.
Condor))
property index
Logical Files
Request Interpreter
site-specific file requests
site-specific files
Replica Catalog
request planning
Request Executer
DRM
MDS
pinning file transfer requests
Network
Tier1
Tier2
HRM
DRM
dCache
? (dCache, dFarm, DRM, NeST)
Enstore
19Storage Resource Management
Site hosting the Application
Client
Client
Logical Query
property index
Logical Files
Request Interpreter
site-specific file requests
site-specific files
Replica Catalog
request planning
Request Executer
DRM
MDS
pinning file transfer requests
Network
Tier1
Tier2
HRM
DRM
dCache
? (dCache, dFarm, DRM, NeST)
Enstore
20RD on Components for Data Storage and Data
Access
- Approach develop a Storage Architecture, define
Components and Interfaces - This will include StorageData Management, Data
Access, Catalogs, Robust Data Movement etc) - Storage System related RD issues
- Detailed Analysis of the SRM and GridFTP
specifications including identification of
initial version of protocols be used, discussion
of any connective middleware w.r.t
interoperability. Coordination with Tier0/1/2 and
LCG. Goal is to effect transfer and support
replica managers. - Protocol elements include features from GridFTP,
SRM - At Tier2 centers selection of Temporary Store
implementation, supporting SRM and GridFTP
(incl. evaluation on interop issues with Tier1
center) - dCache, dFarm, DRM, NeST, DAP
- At Tier1 center provide SRM/dCache interface
for FNAL/dCache implementation compatible with
criteria above - Track compatibility with LCG (incl. Tier0 center
at CERN) as their plan evolves - Have developed a resource loaded WBS
- Work will be carried out jointly with CD/CCF and
the SRM Collaboration - Further planning required to incorporate Replica
Managers / Replica Location Service -
21The Need for Improved Storage Devices and File
Systems
- CMS is currently in the process of developing the
Data Model - Data Storage and Data Access are the most
demanding problems - The choice of OS and Persistency solution can
strongly influence the hardware needs (and the
human resources required to support ) - Moving away from a Persistency Model based on
OODB - Problem mapping Objects to Files
- New Model should be developed with focus on
optimization of the underlying storage
architecture and storage technology - Classic Filesystems at the limit of their scaling
capabilities
22OSD Architecture
Application File Manager
Meta Operation
Object Manager
LAN/SAN
Data Transfer
Security
OSD Intelligence Storage Device
23Data Flow at Regional Center
Mass Storage Disk Servers Database Servers
Tier 2
Network from CERN
Data Export
Data Import
Local institutes
Network from Tier 2, simulation centers
Production Reconstruction Raw/Sim--gtESD Schedule
d, predictable experiment/ physics groups
Production Analysis ESD--gtAOD AOD--gtDPD Schedule
d Physics groups
Individual Analysis AOD--gtDPD and
plots Chaotic Physicists
CERN
Tapes
Tapes
Desktops
Physics Software Development
RD Systems and Testbeds
Info servers Code servers
Web Servers Telepresence Servers
Training Consulting Help Desk
worked out by the MONARC Project
24Networking
- Provisioning of Offsite Network Capacity at the
Regional Center at FNAL - In general Shared wide-are HEP Networking
- Vital Resource in CMS multi tier based Computing
Model - Sharing a 622 Mbps link (best effort) to ESnet
with all Fermilab experiments, primarily CDF and
D0 with requirements of gt 300 Mbps each - Lead to shortfalls during spring data production
for DAQ TDR with peak requirements of gt 200 Mbps
when the link was still at 155 Mbps - Upgrade to 622 Mbps was delayed for 5 months
while the link was completely saturated by CDFD0
traffic over many hours/day - Uncertain if the upgrade to OC48 planned for 2004
will be in time - US CMS requirements (Tier0/1 Tier1/2)
according to the planned Data Challenges DC04
(5) in 2003/4, DC05 (10) in 2004/5 and DC06
(20) in 2005/6 - With probably only 2 Regional Centers involved in
DC04 we will have to transfer 1TB/ day starting
in Q3/2003 - () Numbers according to 50 link utilization
2003 2004 2005
Installed () BW in Mbps 300 600 800
25(No Transcript)
26Networking Facilities
- Provisioning of Offsite Network Capacity at the
Regional Center at FNAL (cont.) - Since CERN is directly connected to StarLight in
Chicago (as US Collaborators at Universities are
via Internet2) we propose that, in order to
secure availability of adequate Functionality and
Bandwidth for the CMS RD Program, Fermilab to
provide direct connectivity at scalable data
rates and w/o intervening Internet Service
Providers before DC04 Pre-challenge Data
Production - Though we are not limited to a specific
implementation we believe Dark Fiber between
Fermilab and StarLight would be the most suitable
way to get prepared for the future
27 Why Fiber?
- Capacity needed is not otherwise affordable
- Capabilities needed are not available (in time)
- Cheaper in the long range
- Insurance against monopoly behavior
- Stable and predictable anchor points
28National Light Rail Project Proposal
SEA
POR
SAC
BOS
NYC
CHI
OGD
DEN
SVL
CLE
WDC
PIT
FRE
KAN
RAL
NAS
STR
LAX
PHO
WAL
ATL
SDG
OLG
DAL
JAC
Proposed by Tom West
29National Light Rail Lambda Route Map
REGEN
TERMINAL
OADM
Metro 10 Gig E
4
Seattle
Chicago
Denver
Cleveland
Kansas
Ogden
Boise
4
2
4
4
6
2
6
4
Salt Lake City
Portland
StarLight
2
Boston
5
Pittsburgh
Sacramento
2
Sunnyvale
15808 LH System
15808 ELH System
2
4
4
Fresno
15540 Metro System
4
10 Gig E
4
New York City
Washington DC
OC192
4
Los Angeles
2
4
4
2
Stratford
4
San Diego
4
4
Walnut
Nashville
Dallas
Raleigh
Olga
Pheonix
Atlanta
4
30LHCnet Network Late 2002
GEANT
Switch
IN2P3
WHO
CERN -Geneva
Alcatel 7770 DataTAG (CERN)
Cisco 7606DataTAG (CERN)
Juniper M10 DataTAG(CERN)
Linux PC for Performance tests Monitoring
Cisco 7609 CERN
Optical Mux/Dmux Alcatel 1670
2.5 Gbps (RD)
622 Mbps (Prod.)
Linux PC for Performance tests Monitoring
Optical Mux/DmuxAlcatel 1670
Cisco 7609 Caltech(DoE)
Cisco 7606Caltech(DoE)
Juniper M10 Caltech (DoE)
Alcatel 7770 DataTAG (CERN)
Caltech/DoE PoP StarLight Chicago
Abilene
MREN
ESnet
STARTAP
NASA
Development and tests
31Networking
- Immediate needs for RD in three topic areas
- End-to-End Performance / Network Performance and
Prediction - Closely related to work on Storage Grid (SRM
etc) - Alternative implementations of TCP/IP Stack
- QoS and Differentiated Services, Bandwidth
Brokering - Evaluate and eventually utilize differentiated
service framework as being implemented in Abilene
and ESnet - Evaluate bandwidth brokers (e.g. GARA)
- Virtual Private Networks (VPN)
- Evaluate and eventually implement VPN technology
over public network infrastructure for the CMS
Production Grid - Other parties involved are CERN, Caltech,
DataTAG, Internet2, ESnet,
32Networking
- Immediate needs for RD
- End-to-End Performance / Network Performance and
Prediction - Closely related to work on Storage Grid (SRM
etc) - Transport Protocols (e.g. GridFTP, work w/CS
community) - System optimization w.r.t. concurrent
Storage/Network Traffic - Currently observed System Shortfalls
- Reno Stack is in general lacking gigabit speed
capabilities - End of scale reached due to severe equilibrium
and stability problems - Using loss (i.e inducing loss) probability for
control gt wild oscillations - Research community is paying a lot attention to
address optimization issues based on deployed
network stack implementations - Customization of Network Kernel Buffers
- Largely increased values yielding at improved
throughput on non-congested links BUT - behave unfair when competing with other streams
- need very long time to recover from packet loss
(see next slide) - requires manual negotiation/configuration of
metrics
33Time to recover from a single loss
TCP Throughput CERN-StarLight (link running at
622 Mbps)
- TCP reactivity Due to the Basic
Multiplicative-Decrease and Additive-Increase
Algorithm to Handle Packet Loss - Time to increase the throughput by 120 Mbit/s is
larger than 6 min for a connection between
Fermilab and CERN. - A single loss is disastrous
- A TCP connection reduces bandwidth use by half
after a loss is detected (Multiplicative
decrease) - A TCP connection increases slowly its bandwidth
use (Additive increase) - TCP is much more sensitive to packet loss in WANs
than in LANs
From Sylvain Ravot / Caltech
34TCP Responsiveness
Case Capacity RTT (ms) MSS (Byte) Responsiveness
Typical LAN in 1988 10 Mbps 2 20 1460 1.5 ms 154 ms
Typical WAN in 1988 9.6 Kbps 40 1460 0.006 sec
Typical LAN today 100 Mbps 5 (worst case) 1460 0.096 sec
Current WAN link CERN Starlight 622 Mbps 120 1460 6 minutes
Future WAN link CERN Starlight 10 Gbit/s 120 1460 92 minutes
Future WAN link CERN Starlight 10 Gbit/s 120 8960 (Jumbo Frame) 15 minutes
From H. Newman
35Iperf TCP throughput between CERN and StarLight
using the standard Stack
36Networking
- Immediate needs for RD
- End-to-End Performance / Network Performance and
Prediction (cont.) - Need to actively pursue Network Stack
Implementations supporting Ultrascale Networking
for Rapid Data Transactions and Data-Intensive
Dynamic Workspaces - Maintain Statistical Multiplexing End-to-End
Flow Control - Maintain functional compatibility with Reno/TCP
implementation - FAST Project has shown dramatic improvements over
Reno Stack by moving from loss based congestion
to delay based control mechanism - with standard segment size and fewer streams
- Fermilab/CMS is FAST partner
- as a well supported user having the FAST stack
installed on Facility RD Data Servers (first
results look very promising) - Aiming at Installations/Evaluations for
Integration with Production Environment at CERN
and Tier-2 sites - Work in Collaboration with Fermilab CCF
Department
37Iperf TCP throughput between CERN and StarLight
using the FAST Stack
38Milestones
- Data Storage and Data Access
- Implementation of a Storage Resource Management
system based on the SRM protocol and respective
Data Movement mechanisms
08/2003 - Data Access Optimization
- Develop and implement a model to optimize Data
Placement and Data Distribution in
conjunction with new Persistency Mechanism
08/2003 - File Systems and advanced Disk Storage Technology
- Development of a Storage Architecture using
Cluster File Systems with intelligent
Storage Devices, will implement Prototype
12/2003 - Resource Management
- Develop tools for dynamic partitioning of Compute
Elements (Farms) 03/2003 - Networking
- Research on End-to-End performance optimization
(WAN) - Develop standard configuration for Tier-0/1/2
connectivity 07/2003