Research and Development on Storage and Data Access - PowerPoint PPT Presentation

About This Presentation

Title:

Research and Development on Storage and Data Access

Description:

GYOZA. PRODUCTION. CLUSTER ( 80 Dual Nodes) IGT. ENSTORE (17 DRIVES, shared) ESNET (OC 12) ... Compute & Storage Elements: R&D on Components and Systems ... – PowerPoint PPT presentation

Number of Views:183

Avg rating:3.0/5.0

Slides: 39

Provided by: cms5

Learn more at: https://uscms.org

Category:

more less

Transcript and Presenter's Notes

Title: Research and Development on Storage and Data Access

1
Research and Development on Storage and Data
Access

Michael ErnstFermilabDOE/NSF ReviewJanuary
16, 2003

2
Anticipated System Architecture
PRODUCTION CLUSTER (gt80 Dual Nodes)
MREN (OC3) (shared)
ESNET (OC 12) (shared)
POPCRN
CISCO 6509
IGT
GYOZA
RD
FRY

ENSTORE (17 DRIVES, shared)

NAS
DCACHE (gt 6 TB)
USER ANALYSIS
3
Projects

Compute Storage Elements RD on Components and
Systems
Cluster Management Generic Farms, Partitioning
Will be addressed in H. Wenzels talk
Storage Management and Access
Interface Standardization and their
implementation
Data set catalogs, metadata, replication, robust
file transfers
Networking Terabyte throughput to T2, to CERN
Ultrascale Network Protocol Stack
Physics Analysis Center
Analysis cluster
Desktop support
Software Distribution, Software support, User
Support Helpdesk
Collaborative tools
Will be addressed in H. Wenzels talk
VO management and security
Worked out a plan for RD and Deployment
Need to develop operations scenario

4
Monte Carlo Production with MOP
5
User access to Tier-1 (Jets/Met, Muons)

ROOT/POOL interface
(TDCacheFile)

AMS server
AMS/Enstore interface
AMS/dCache interface

dCache
Objects
Network

Users in
Wisconsin
CERN
FNAL
Texas

Enstore
NAS/RAID
6
RD on Storage and Data Access
PRODUCTION CLUSTER (gt80 Dual Nodes)
MREN (OC3) (shared)
ESNET (OC 12) (shared)
POPCRN
CISCO 6509
IGT
GYOZA
RD
FRY

ENSTORE (17 DRIVES, shared)

NAS
DCACHE
USER ANALYSIS
7
dCache Placement
Tertiary Storage Systems
dCache
Enstore
Application (e.g. dccp)
dCap library
OSM
HSM X
Local Disk
PNFS Namespace Manager
xxxFTP
GRID Access Method(s)

Application Viewpoint
POSIX compliant Interface
Preload Library
ROOT/POOL Interface
Unique Namespace throughout HRM
Transparent access to Tertiary Storage

Applications
8
Network Attached Storage (NAS)
Growing Market offering emerging Technologies
e.g. Zambeels Aztera System Architecture
9
Data Access and Distribution Server

Addresses 3 Areas
Data Server for Production and Analysis
Sufficient Space to store entire annually
produced data
Have implemented a HRM system at Tier-1 based on
dCache (DRM) / Enstore (TRM)
Shared Workgroup Space and User Home Area
Replaces AFS at FNAL
Lack of flexibility (technical constraints)
Insufficient performance and resources
Improve performance and reliability
Minimize administration and management overhead
Backup and Archiving of User Data
Todays limitations
Central Backup System for Data in AFS Space only
Using FNAL MSS (Enstore) to manually archive
filesdatasets
Developed Benchmark Suite for System Evaluation
and Acceptance Testing
Benchmark Tools along with control analysis
scripts to produce standardized report
Throughput, File Operations, Data Integrity
Enormously reduces Effort required for Evaluation
and provides fair Comparison of In-house
developed and Commercial Products

10
Storage System Test Suite
11
Benchmark Tools
12
Interactive Product Development Process
and after System Tuning
Performance achieved initially
Performance depending on Client OS
Single File Read Performance
Solaris
Linux
13
Data Access and Distribution Server

Benchmark Suite for System Evaluation and
Acceptance Testing
Developed a CMS application related test suite
(based on ROOT I/O) which we used to evaluate the
dCache based DRM

Aggregated Throughput reading out of dCache with
multiple Clients
14
Data Access and Distribution Server

Effective Throughput per Client

15
Data Access and Distribution Server

Time Mover is active () for each process. With
max. 40 Movers some proc are waiting for a Mover
to become available

16
Data Access and Distribution Server

Client Rate Distribution when reading out of
dCache

17
Storage and Data Access

Viewpoints of CMS Event Data in the CMS Data Grid
System
High-level data views in the minds of physicists
High-level data views in physics analysis tools
Virtual data product collections
(highest-level common view across CMS)
Materialized data product collections
File sets
(set of log. files with hi.-lev. significance)
Logical files
Physical files on sites
(device location independent view)
Physical files on storage devices
(lowest-level generic view of files)
Device-specific files

RD focusing on common Interface
18
Storage Resource Management
Site hosting the Application
Client
Client
Logical Query
Today Responsibility of Application (invoking
some higher-level Middleware components (e.g.
Condor))
property index
Logical Files
Request Interpreter
site-specific file requests
site-specific files
Replica Catalog
request planning
Request Executer
DRM
MDS
pinning file transfer requests
Network
Tier1
Tier2
HRM
DRM
dCache
? (dCache, dFarm, DRM, NeST)
Enstore
19
Storage Resource Management
Site hosting the Application
Client
Client
Logical Query
property index
Logical Files
Request Interpreter
site-specific file requests
site-specific files
Replica Catalog
request planning
Request Executer
DRM
MDS
pinning file transfer requests
Network
Tier1
Tier2
HRM
DRM
dCache
? (dCache, dFarm, DRM, NeST)
Enstore
20
RD on Components for Data Storage and Data
Access

Approach develop a Storage Architecture, define
Components and Interfaces
This will include StorageData Management, Data
Access, Catalogs, Robust Data Movement etc)
Storage System related RD issues
Detailed Analysis of the SRM and GridFTP
specifications including identification of
initial version of protocols be used, discussion
of any connective middleware w.r.t
interoperability. Coordination with Tier0/1/2 and
LCG. Goal is to effect transfer and support
replica managers.
Protocol elements include features from GridFTP,
SRM
At Tier2 centers selection of Temporary Store
implementation, supporting SRM and GridFTP
(incl. evaluation on interop issues with Tier1
center)
dCache, dFarm, DRM, NeST, DAP
At Tier1 center provide SRM/dCache interface
for FNAL/dCache implementation compatible with
criteria above
Track compatibility with LCG (incl. Tier0 center
at CERN) as their plan evolves
Have developed a resource loaded WBS
Work will be carried out jointly with CD/CCF and
the SRM Collaboration
Further planning required to incorporate Replica
Managers / Replica Location Service

21
The Need for Improved Storage Devices and File
Systems

CMS is currently in the process of developing the
Data Model
Data Storage and Data Access are the most
demanding problems
The choice of OS and Persistency solution can
strongly influence the hardware needs (and the
human resources required to support )
Moving away from a Persistency Model based on
OODB
Problem mapping Objects to Files
New Model should be developed with focus on
optimization of the underlying storage
architecture and storage technology
Classic Filesystems at the limit of their scaling
capabilities

22
OSD Architecture
Application File Manager
Meta Operation
Object Manager
LAN/SAN
Data Transfer
Security
OSD Intelligence Storage Device
23
Data Flow at Regional Center
Mass Storage Disk Servers Database Servers
Tier 2
Network from CERN
Data Export
Data Import
Local institutes
Network from Tier 2, simulation centers
Production Reconstruction Raw/Sim--gtESD Schedule
d, predictable experiment/ physics groups
Production Analysis ESD--gtAOD AOD--gtDPD Schedule
d Physics groups
Individual Analysis AOD--gtDPD and
plots Chaotic Physicists
CERN
Tapes
Tapes
Desktops
Physics Software Development
RD Systems and Testbeds
Info servers Code servers
Web Servers Telepresence Servers
Training Consulting Help Desk
worked out by the MONARC Project
24
Networking

Provisioning of Offsite Network Capacity at the
Regional Center at FNAL
In general Shared wide-are HEP Networking
Vital Resource in CMS multi tier based Computing
Model
Sharing a 622 Mbps link (best effort) to ESnet
with all Fermilab experiments, primarily CDF and
D0 with requirements of gt 300 Mbps each
Lead to shortfalls during spring data production
for DAQ TDR with peak requirements of gt 200 Mbps
when the link was still at 155 Mbps
Upgrade to 622 Mbps was delayed for 5 months
while the link was completely saturated by CDFD0
traffic over many hours/day
Uncertain if the upgrade to OC48 planned for 2004
will be in time
US CMS requirements (Tier0/1 Tier1/2)
according to the planned Data Challenges DC04
(5) in 2003/4, DC05 (10) in 2004/5 and DC06
(20) in 2005/6
With probably only 2 Regional Centers involved in
DC04 we will have to transfer 1TB/ day starting
in Q3/2003
() Numbers according to 50 link utilization

2003 2004 2005
Installed () BW in Mbps 300 600 800
25
(No Transcript)
26
Networking Facilities

Provisioning of Offsite Network Capacity at the
Regional Center at FNAL (cont.)
Since CERN is directly connected to StarLight in
Chicago (as US Collaborators at Universities are
via Internet2) we propose that, in order to
secure availability of adequate Functionality and
Bandwidth for the CMS RD Program, Fermilab to
provide direct connectivity at scalable data
rates and w/o intervening Internet Service
Providers before DC04 Pre-challenge Data
Production
Though we are not limited to a specific
implementation we believe Dark Fiber between
Fermilab and StarLight would be the most suitable
way to get prepared for the future

27
Why Fiber?

Capacity needed is not otherwise affordable
Capabilities needed are not available (in time)
Cheaper in the long range
Insurance against monopoly behavior
Stable and predictable anchor points

28
National Light Rail Project Proposal
SEA
POR
SAC
BOS
NYC
CHI
OGD
DEN
SVL
CLE
WDC
PIT
FRE
KAN
RAL
NAS
STR
LAX
PHO
WAL
ATL
SDG
OLG
DAL
JAC
Proposed by Tom West
29
National Light Rail Lambda Route Map
REGEN
TERMINAL
OADM
Metro 10 Gig E
4
Seattle
Chicago
Denver
Cleveland
Kansas
Ogden
Boise
4
2
4
4
6
2
6
4
Salt Lake City
Portland
StarLight
2
Boston
5
Pittsburgh
Sacramento
2
Sunnyvale
15808 LH System
15808 ELH System
2
4
4
Fresno
15540 Metro System
4
10 Gig E
4
New York City
Washington DC
OC192
4
Los Angeles
2
4
4
2
Stratford
4
San Diego
4
4
Walnut
Nashville
Dallas
Raleigh
Olga
Pheonix
Atlanta
4
30
LHCnet Network Late 2002
GEANT
Switch
IN2P3
WHO
CERN -Geneva
Alcatel 7770 DataTAG (CERN)
Cisco 7606DataTAG (CERN)
Juniper M10 DataTAG(CERN)
Linux PC for Performance tests Monitoring
Cisco 7609 CERN
Optical Mux/Dmux Alcatel 1670
2.5 Gbps (RD)
622 Mbps (Prod.)
Linux PC for Performance tests Monitoring
Optical Mux/DmuxAlcatel 1670
Cisco 7609 Caltech(DoE)
Cisco 7606Caltech(DoE)
Juniper M10 Caltech (DoE)
Alcatel 7770 DataTAG (CERN)
Caltech/DoE PoP StarLight Chicago
Abilene
MREN
ESnet
STARTAP
NASA
Development and tests
31
Networking

Immediate needs for RD in three topic areas
End-to-End Performance / Network Performance and
Prediction
Closely related to work on Storage Grid (SRM
etc)
Alternative implementations of TCP/IP Stack
QoS and Differentiated Services, Bandwidth
Brokering
Evaluate and eventually utilize differentiated
service framework as being implemented in Abilene
and ESnet
Evaluate bandwidth brokers (e.g. GARA)
Virtual Private Networks (VPN)
Evaluate and eventually implement VPN technology
over public network infrastructure for the CMS
Production Grid
Other parties involved are CERN, Caltech,
DataTAG, Internet2, ESnet,

32
Networking

Immediate needs for RD
End-to-End Performance / Network Performance and
Prediction
Closely related to work on Storage Grid (SRM
etc)
Transport Protocols (e.g. GridFTP, work w/CS
community)
System optimization w.r.t. concurrent
Storage/Network Traffic
Currently observed System Shortfalls
Reno Stack is in general lacking gigabit speed
capabilities
End of scale reached due to severe equilibrium
and stability problems
Using loss (i.e inducing loss) probability for
control gt wild oscillations
Research community is paying a lot attention to
address optimization issues based on deployed
network stack implementations
Customization of Network Kernel Buffers
Largely increased values yielding at improved
throughput on non-congested links BUT
behave unfair when competing with other streams
need very long time to recover from packet loss
(see next slide)
requires manual negotiation/configuration of
metrics

33
Time to recover from a single loss
TCP Throughput CERN-StarLight (link running at
622 Mbps)

TCP reactivity Due to the Basic
Multiplicative-Decrease and Additive-Increase
Algorithm to Handle Packet Loss
Time to increase the throughput by 120 Mbit/s is
larger than 6 min for a connection between
Fermilab and CERN.
A single loss is disastrous
A TCP connection reduces bandwidth use by half
after a loss is detected (Multiplicative
decrease)
A TCP connection increases slowly its bandwidth
use (Additive increase)
TCP is much more sensitive to packet loss in WANs
than in LANs

From Sylvain Ravot / Caltech
34
TCP Responsiveness
Case Capacity RTT (ms) MSS (Byte) Responsiveness
Typical LAN in 1988 10 Mbps 2 20 1460 1.5 ms 154 ms
Typical WAN in 1988 9.6 Kbps 40 1460 0.006 sec
Typical LAN today 100 Mbps 5 (worst case) 1460 0.096 sec
Current WAN link CERN Starlight 622 Mbps 120 1460 6 minutes
Future WAN link CERN Starlight 10 Gbit/s 120 1460 92 minutes
Future WAN link CERN Starlight 10 Gbit/s 120 8960 (Jumbo Frame) 15 minutes
From H. Newman
35
Iperf TCP throughput between CERN and StarLight
using the standard Stack
36
Networking

Immediate needs for RD
End-to-End Performance / Network Performance and
Prediction (cont.)
Need to actively pursue Network Stack
Implementations supporting Ultrascale Networking
for Rapid Data Transactions and Data-Intensive
Dynamic Workspaces
Maintain Statistical Multiplexing End-to-End
Flow Control
Maintain functional compatibility with Reno/TCP
implementation
FAST Project has shown dramatic improvements over
Reno Stack by moving from loss based congestion
to delay based control mechanism
with standard segment size and fewer streams
Fermilab/CMS is FAST partner
as a well supported user having the FAST stack
installed on Facility RD Data Servers (first
results look very promising)
Aiming at Installations/Evaluations for
Integration with Production Environment at CERN
and Tier-2 sites
Work in Collaboration with Fermilab CCF
Department

37
Iperf TCP throughput between CERN and StarLight
using the FAST Stack
38
Milestones

Data Storage and Data Access
Implementation of a Storage Resource Management
system based on the SRM protocol and respective
Data Movement mechanisms
08/2003
Data Access Optimization
Develop and implement a model to optimize Data
Placement and Data Distribution in
conjunction with new Persistency Mechanism
08/2003
File Systems and advanced Disk Storage Technology
Development of a Storage Architecture using
Cluster File Systems with intelligent
Storage Devices, will implement Prototype
12/2003
Resource Management
Develop tools for dynamic partitioning of Compute
Elements (Farms) 03/2003
Networking
Research on End-to-End performance optimization
(WAN)
Develop standard configuration for Tier-0/1/2
connectivity 07/2003