Scientific Computing Resources

About This Presentation

Title:

Scientific Computing Resources

Description:

JASMine Cache Software ... JASMine Bulk Data Transfers. Model supports parallel transfers ... JASMine system performance. End-to-end performance ... – PowerPoint PPT presentation

Number of Views:275

Avg rating:3.0/5.0

Slides: 53

Provided by: ianb185

Learn more at: https://hallaweb.jlab.org

Category:

more less

Transcript and Presenter's Notes

Title: Scientific Computing Resources

1
Scientific Computing Resources

Ian Bird Computer Center
Hall A Analysis Workshop
December 11, 2001

2
Overview

Current Resources
Recent evolution
Mass storage HW SW
Farm
Remote data access
Staffing levels
Future Plans
Expansion/upgrades of current resources
Other computing LQCD
Grid Computing
What is it? Should you care?

3
Jefferson Lab Scientific Computing
Environment November 2001
2 TB Farm Cache SCSI RAID 0 on Linux servers

Batch Farm Cluster
350 Linux nodes (400 MHz 1 GHz)
10,000 SPECint95
Managed by LSF Java layer
web interface

2 STK silos
10 9940
10 9840
8 Redwood
10 Solaris/Linux data
movers w/ 300 GB stage

Interactive Analysis
2 Sun 450 4 processor
2 4-processor Intel/Linux

Gigabit Ethernet Switching Fabric
Grid gateway
bbftp service
JLAB Network Backbone
CUE General Services
JASMine managed Mass Storage Systems
Internet (ESNet OC-3)

Lattice QCD Cluster
40 Alpha/Linux (667 MHz)
256 Pentium 4 (Q2 FY02?)
Managed by PBS Web portal

4
JLAB Farm and Mass Storage Systems November 2001
Batch Farm 350 processors 175 dual nodes each
connected at 100 Mb to 24-port switch with Gb
uplink (8 switches)
Fiber Channel direct from CLAS

2 STK silos
10 9940
10 9840
8 Redwood
10 Solaris/Linux data
movers each w/ 300 GB stage Gb uplink

Foundry BigIron 8000 Switch 256 Gb
backplane, 45/60 Gb ports in use
Site Router CUE and general services
Work disks 4 MetaStor systems each with 100 Mb
uplink Total 5 TB SCSI RAID 5
Cache disk farm 20 Linux servers each with Gb
uplink Total 16 TB SCSI/IDE RAID 0
CH-Router Incoming data from Halls A C
Work disk farm 4 Linux servers each with Gb
uplink Total 4 TB SCSI RAID 5
5
CPU Resources

Farm
Upgraded this summer with 60 dual 1 GHz P III (4
cpu / 1 u rackmount)
Retired original 10 dual 300 MHz
Now 350 cpu (400, 450, 500, 750, 1000 MHz)
11,000 SPECint95
Deliver gt 500,000 SI95-hrs / week
Equivalent to 75 1 GHz cpu
Interactive
Solaris 2 E450 (4-proc)
Linux 2 quad systems (4x450, 4x750MHz)
If required can use batch systems (via LSF) to
add interactive CPU to these (Linux) front ends

6
Intel Linux Farm
7
Tape storage

Added 2nd silo this summer
Required move of room of equipment
Added 10 9940 drives (5 as part of new silo)
Current
8 Redwood, 10 9840, 10 9940
Redwood 50 GB _at_ 10MB/s (helical scan single
reel)
9840 20 GB _at_ 10MB/s (linear mid-load cassette
(fast))
9940 60 GB _at_ 10MB/s (linear single reel)
9840 9940 are very reliable
9840 9940 have upgrade paths that use same
media
9940 2nd generation 100 GB_at_20MB/s ??
Add 10 more 9940 this FY (budget..?)
Replace Redwoods (reduce to 1-2)
Requires copying 4500 tapes started budget
for tape?
Reliability, end of support(!)

8
Disk storage

Added cache space
For frequently used silo files, to reduce tape
accesses
Now have 22 cache servers
4 dedicated to farm 2 TB
16 TB of cache space allocated to expts
Some bought and owned by groups
Dual Linux systems, Gb network, 1 TB disk, RAID
0
9 SCSI systems
13 IDE systems
Performance approx equivalent
Good match cpunetwork throughputdisk space
This is a model that will scale by a few factors,
but probably not by 10 (but there is as yet no
solution to that)
Looking at distributed file systems for the
future to avoid NFS complications GFS, etc.,
but no production level system yet.
Nb. Accessing data with jcache does not need NFS,
and is fault tolerant
Added work space
Added 4 systems to reduce load on fs3,4,5,6 (orig
/work)
Dual Linux systems, Gb network, 1 TB disk, SCSI
RAID 5
Performance on all systems is now good
Problems

9
JASMine

JASMine Mass Storage system software
Rationale why write another MSS?
Had been using OSM
Not scaleable, not supported, reached limit of
sw, had to run 2 instances to get sufficient
drive capacity
Hidden from users by Tapeserver
Java layer that
Hid complexities of OSM installations
Implemented tape disk buffers (stage)
Provided get, put, managed cache (read copies of
archived data) capabilities
Migration from OSM
Production environment.
Timescales driven by experiment schedules, need
to add drive capacity
Retain user interface
Replace osmcp function tape to disk, drive
and library management
Choices investigated
Enstore, Castor, (HPSS)
Timescales, support, adaptability (missing
functionality/philosophy cache/stage)
Provide missing functions within Tapeserver
environment, clean up and reworking
JASMine (JLAB Asynchronous Storage Manager)

10
Architecture

JASMine
Written in Java
For data movement, as fast as C code.
JDBC makes using and changing databases easy.
Distributed Data Movers and Cache Managers
Scaleable to the foreseeable needs of the
experiments
Provides scheduling
Optimizing file access requests
User and group (and location dependent)
priorities
Off-site cache or ftp servers for data exporting
JASMine Cache Software
Stand-alone component can act as a local or
remote client, allows remote access to JASMine
Can be deployed to a collaborator to manage small
disk system and as basis for coordinated data
management between sites
Cache manager runs on each cache server.
Hardware is not an issue.
Need a JVM, network, and a disk to store files.

11
Software cont.

MySQL database used by all servers.
Fast and reliable.
SQL
Data Format
ANSI standard labels with extra information
Binary data
Support to read legacy OSM tapes
cpio, no file labels
Protocol for file transfers
Writes to cache are never NFS
Reads from cache may be NFS

12
Request Manager
Scheduler
Database
Request Manager
Log Manager
Client
Library Manager
Library Manager
Service Connection
Data Mover
Database Connection
Log Connection
13
JASMine Services

Database
Stores metadata
also presented to user on an NFS filesystem as
stubfiles
But could equally be presented as e.g. a web
service, LDAP,
Do not need to access stubfiles just need to
know filenames
Tracks status and locations of all requests,
files, volumes, drives, etc.
Request Manager
Handles user requests and queries.
Scheduler
Prioritizes user requests for tape access.
priority share / (0.01 (num_a
ACTIVE_WEIGHT) (num_c COMPLETED_WEIGHT) )
Host vs User shares, farm priorities
Log Manager
Writes out log and error files and databases.
Sends out notices for failures.
Library Manager
Mount and dismounts tapes as well as other
library related tasks.

14
JASMine Services -2

Data Mover
Dispatcher
Keeps track of available local resources and
starts requests the local system can work on.
Cache Manager
Manages a disk or disks for pre-staging data to
and from tape.
Sends and receives data to and from clients.
Volume Manager
Manages tapes for availability.
Drive Manager
Manages tape drives for usage.

15
User Access

Jput
Put one or more files on tape
Jget
Get one or more files from tape
Jcache
Copies one or more files from tape to cache
Jls
Get metadata for one or more files
Jtstat
Status of the request queue
Web interface
Query status and statistics for entire system

16
Web interface
17
(No Transcript)
18
Data Access to cache

NFS
Directory of links points the way.
Mounted read-only by the farm.
Users can mount read-only on their desktop.
Jcache
Java client.
Checks to see if files are on cache disks.
Will get/put files from/to cache disks.
More efficient than NFS, avoids NFS hangs if
server dies, etc., but users like NFS

19
Disk Cache Management

Disk Pools are divided into groups
Tape staging.
Experiments.
Pre-staging for the batch farm.
Management policy set per group
Cache LRU files removed as needed.
Stage Reference counting.
Explicit manual addition and deletion.
Policies are pluggable easy to add

20
Protocol for file moving

Simple extensible protocol for file copies
Messages are java serialized objects passed over
streams,
Bulk data transfer uses raw data transfer over
tcp
Protocol is synchronous all calls block
Asynchrony multiple requests by threading
CRC32 checksums at every transfer
More fair than NFS
Session may make many connections

21
Protocol for file moving

Cache server extends the basic protocol
Add database hooks for cache
Add hooks for cache policies
Additional message types were added
High throughput disk pool
Database shared by many servers
Any server in the pool can look up file location,
But data transfer always direct between client
and node holding file
Adding servers and disk to pool increases
throughput with no overhead,
Provides fault tolerance

22
Example Get from cache

cacheClient.getFile(/foo, halla)
send locate request to any server
receive locate reply
contact appropriate server
initiate direct xfer
Returns true on success

Get /foo
Where is /foo?
Sending /foo
cache4
23
Example simple put to cache

putFile(/quux,halla,123456789)

24
Fault Tolerance

Dead machines do not stop the system
Data Movers work independently
Unfinished jobs will restart on another mover
Cache Servers will only impact NFS clients
System recognizes dead server and will re-cache
file from tape
If users would not use NFS would never see a
failure just extended access time
Exception handling for
Received timeouts
Refused connections
Broken connections
Complete garbage on connections

25
Authorization and Authentication

Shared secret for each file transfer session
Session authorization by policy objects
Example receive 5 files from user_at_bar
Plug-in authenticators
Establish shared secret between client and server
No clear text passwords
Extend to be compatible with GSI

26
JASMine Bulk Data Transfers

Model supports parallel transfers
Many files at once, but not bbftp style
But could replace stream class with a parallel
stream
For bulk data transfer over WANs
Firewall issues
Client initiates all connections

27
Architecture Disk pool hardware

SCSI Disk Servers
Dual Pentium III 650 (later 933)MHz CPUs
512 Mbytes 100MHz SDRAM ECC
ASUS P2B-D Motherboard
NetGear GA620 Gigabit Ethernet PCI NIC
Mylex eXtremeRAID 1100, 32 MBytes cache
Seagate ST150176LW (Qty. 8) - 50 GBytes Ultra2
SCSI in Hot Swap Disk Carriers
CalPC 8U Rack Mount Case with Redundant 400W
Power Supplies
IDE Disk Servers
Dual Pentium III 933MHz CPUs
512 Mbytes 133MHz SDRAM ECC
Intel STL2 or ASUS CUR-DLS Motherboard
NetGear GA620 or Intel PRO/1000 T Server Gigabit
Ethernet PCI NIC
3ware Escalade 6800
IBM DTLA-307075 (Qty. 12) - 75 GBytes Ultra
ATA/100 in Hot Swap Disk Carriers
CalPC 8U Rack Mount Case with Redundant 400W
Power Supplies

28
Cache Performance

Matches network, disk I/O, and CPU performance
with size of disk pool
800 GB,
2 x 850MHz
Gb Ethernet

29
Cache status
30
Performance SCSI vs IDE

Disk Array/File System Ext2
SCSI Disk Server - 8 50 GByte disks in a RAID-0
stripe over 2 SCSI controllers
68 MBytes/sec single disk write
79 MBytes/sec burst for a single disk write
52 MBytes/sec single disk read
56 MBytes/sec burst for a single disk read
IDE Disk Server - 6 75 GByte disks in a RAID-0
stripe
64 MBytes/sec single disk write
77 MBytes/sec burst for a single disk write
48 MBytes/sec single disk read
49 MBytes/sec burst for a single disk read

31
Performance NFS vs Jcache

NFS v2 udp - 16 clients,
rsize8192 and wsize8192
Reads
SCSI Disk Servers
7700 NFS ops/sec and 80 cpu utilization
11000 NFS ops/sec burst and 83 cpu utilization
32 MBytes/sec and 83 cpu utilization
IDE Disk Servers
7700 NFS ops/sec and 72 cpu utilization
11000 NFS ops/sec burst and 92 cpu utilization
32 MBytes/sec and 72 cpu utilization
Jcache - 16 clients
Reads
SCSI Disk Servers
32 MBytes/sec and 100 cpu utilization
IDE Disk Servers
32 MBytes/sec and 100 cpu utilization

32
JASMine system performance

End-to-end performance
i.e. tape load, copy to stage, network copy to
client
Aggregate sustained performance of 50MB/s is
regularly observed in production
During stress tests, up to 120 MB/s was sustained
for several hours
A data mover with 2 drives can handle 15MB/s
(disk contention is the limit)
Expect current system should handle 150MB/s and
is scaleable by adding data movers drives
N.B. this is performance to a network client!
Data handling
Currently the system regularly moves 2-3 TB per
day total
6000 files per day, 2000 requests

33
(No Transcript)
34
(No Transcript)
35
JASMine performance
36
Tape migration

Begin migration of 5000 Redwood tapes to 9940
Procedure written
Uses any/all available drives
Use staging to allow re-packing of tapes
Expect will last 9-12 months

37
Raw Data lt 10MB/s over Gigabit Ethernet (Halls A
C)

Batch Farm Cluster
350 Linux nodes (400 MHz 1 GHz)
10,000 SPECint95
Managed by LSF Java layer
web interface

Raw Data
gt 20 MB/s over
Fiber channel
(Hall B)

25-30 MB/s
25-30 MB/s
Typical Data Flows
38
How to make optimal use of the resources

Plan ahead!
As a group
Organize data sets in advance (week) and use the
cache disks for their intended purpose
Hold frequently used data to reduce tape access
In a high data rate environment no other strategy
works
When running farm productions
Use jsub to submit many jobs in one command as
it was designed
Optimizes tape accesses
Gather output files together on work disks and
make a single jput for a complete tapes worth of
data

39
Remote data access

Tape copying is deprecated
Expensive, time consuming (for you and us), and
inefficient
We have OC-3 (155 Mbps) connection that is
under-utilized, filling it will get us upgraded
to OC-12 (622 Mbps)
At the moment we do often have to coordinate with
ESnet and peers to ensure high-bandwidth path,
but this is improving as Grid development
continues
Use network copies
Bbftp service
Parallel, secure ftp optimizes use of WAN
bandwidth
Future
Remote jcache
Cache manager can be deployed remotely
demonstration Feb 02.
Remote silo access, policy-based (unattended)
data migration
GridFTP, bbftp, bbcp
Parallel, secure ftp (or ftp-like)
As part of a Grid infrastructure
PKI authentication mechanism

40
(Data-) Grid Computing
41
Particle Physics Data GridCollaboratory Pilot
Who we are Four leading Grid Computer Science
Projects and Six international High Energy and
Nuclear Physics Collaborations
The problem at hand today Petabytes of
storage, Teraops/s of computing
Thousands of users, Hundreds of
institutions, 10 years of analysis ahead
What we do Develop and deploy Grid Services for
our Experiment Collaborators and Promote and
provide common Grid software and standards
42
PPDG Experiments
ATLAS - a Toroidal LHC ApparatuS at CERN Runs
2006 onGoals TeV physics - the Higgs and the
origin of mass http//atlasinfo.cern.ch/Atlas/We
lcome.html BaBar - at the Stanford Linear
Accelerator Center Running Now Goals study CP
violation and more http//www.slac.stanford.edu/BF
ROOT/ CMS - the Compact Muon Solenoid detector
at CERN Runs 2006 on Goals TeV physics - the
Higgs and the origin of mass http//cmsinfo.cern
.ch/Welcome.html/ D0 at the D0 colliding beam
interaction region at Fermilab Runs Soon Goals
learn more about the top quark, supersymmetry,
and the Higgs http//www-d0.fnal.gov/ STAR -
Solenoidal Tracker At RHIC at BNL Running
Now Goals quark-gluon plasma http//www.star.b
nl.gov/ Thomas Jefferson National Laboratory
Running Now Goals understanding the nucleus
using electron beams http//www.jlab.org/
43
PPDG Computer Science Groups
Condor develop, implement, deploy, and evaluate
mechanisms and policies that support High
Throughput Computing on large collections of
computing resources with distributed ownership.
http//www.cs.wisc.edu/condor/ Globus -
developing fundamental technologies needed to
build persistent environments that enable
software applications to integrate instruments,
displays, computational and information resources
that are managed by diverse organizations in
widespread locations http//www.globus.org/ SDM
- Scientific Data Management Research Group
optimized and standardized access to storage
systems http//gizmo.lbl.gov/DM.html Storage
Resource Broker - client-server middleware that
provides a uniform interface for connecting to
heterogeneous data resources over a network and
cataloging/accessing replicated data sets.
http//www.npaci.edu/DICE/SRB/index.html
44
Delivery of End-to-End Applications Integrated
Production Systems
to allow thousands of physicists to share data
computing resources for scientific processing and
analyses

PPDG Focus
Robust Data Replication
- Intelligent Job Placement
and Scheduling
- Management of Storage
Resources
- Monitoring and Information
of Global Services

Relies on Grid infrastructure
- Security Policy
High Speed Data Transfer
- Network management

Operators Users
Resources Computers, Storage, Networks
45
Project Activities,End-to-End Applicationsand
Cross-Cut Pilots

Project Activities are focused Experiment
Computer Science Collaborative developments.
Replicated data sets for science analysis
BaBar, CMS, STAR
Distributed Monte Carlo production services
ATLAS, D0, CMS
Common storage management and interfaces STAR,
JLAB

End-to-End Applications used in Experiment data
handling systems to give real-world requirements,
testing and feedback.
Error reporting and response
Fault tolerant integration of complex components

Cross-Cut Pilots for common services and policies
Certificate Authority policy and authentication
File transfer standards and protocols
Resource Monitoring networks, computers,
storage.

46
Year 0.5-1 Milestones (1)

Align milestones to Experiment data challenges
ATLAS production distributed data service
6/1/02
BaBar analysis across partitioned dataset
storage 5/1/02
CMS Distributed simulation production 1/1/02
D0 distributed analyses across multiple
workgroup clusters 4/1/02
STAR automated dataset replication 12/1/01
JLAB policy driven file migration 2/1/02

47
Year 0.5-1 Milestones

Common milestones with EDG
GDMP robust file replication layer Joint
Project with EDG Work Package (WP) 2 (Data
Access)
Support of Project Month (PM) 9 WP6 TestBed
Milestone. Will participate in integration fest
at CERN - 10/1/01
Collaborate on PM21 design for WP2 - 1/1/02
Proposed WP8 Application tests using PM9 testbed
3/1/02

Collaboration with GriPhyN
SC2001 demos will use common resources,
infrastructure and presentations 11/16/01
Common, GriPhyN-led grid architecture
Joint work on monitoring proposed

48
Year 0.5-1 Cross-cuts

Grid File Replication Services used by gt2
experiments
GridFTP production releases
Integrate with D0-SAM, STAR replication
Interfaced through SRB for BaBar, JLAB
Layered use by GDMP for CMS, ATLAS
SRB and Globus Replication Services
Include robustness features
Common catalog features and API
GDMP/Data Access layer continues to be shared
between EDG and PPDG.
Distributed Job Scheduling and Management used by
gt1 experiment
Condor-G, DAGman, Grid-Scheduler for D0-SAM, CMS
Job specification language interfaces to
distributed schedulers D0-SAM, CMS, JLAB
Storage Resource Interface and Management
Consensus on API between EDG, SRM, and PPDG
Disk cache management integrated with data
replication services

49
Year 1 other goals

Transatlantic Application Demonstrators
BaBar data replication between SLAC and IN2P3
D0 Monte Carlo Job Execution between Fermilab and
NIKHEF
CMS ATLAS simulation production between
Europe/US
Certificate exchange and authorization.
DOE Science Grid as CA?
Robust data replication.
fault tolerant
between heterogeneous storage resources.
Monitoring Services
MDS2 (Metacomputing Directory Service)?
common framework
network, compute and storage information made
available to
scheduling and resource management.

50
PPDG activities as part of the Global Grid
Community

Coordination with other Grid Projects in our
field
GriPhyN Grid for Physics Network
European DataGrid
Storage Resource Management collaboratory
HENP Data Grid Coordination Committee

Participation in Experiment and Grid deployments
in our field
ATLAS, BaBar, CMS, D0, Star, JLAB experiment data
handling systems
iVDGL/DataTAG International Virtual Data Grid
Laboratory
Use DTF computational facilities?

Active in Standards Committees
Internet2 HENP Working Group
Global Grid Forum

51
Staffing Levels

We are stretched thin
But compared with other labs with similar data
volumes we are efficient
Systems support group 5 1 vacant
Farms, MSS development 2
HW support/ Networks 3.7
Telecom 2.3
Security 2
User services 3
MIS, Database support 8
Support for Engineering 1
We cannot do as much as we would like

52
Future (FY02)

Removing Redwoods is a priority
Copying tapes, replacing drives w/ 9940s
Modest farm upgrades replace older CPU as
budget allows
Improve interactive systems
Add more /work, /cache
Grid developments
Visible as efficient WAN data replication
services
After FY02
Global filesystems to supercede NFS
10 Gb Ethernet
Disk vs. tape? Improved tape densities, data
rates
We welcome (coordinated) input as to what would
be most useful for your physics needs