Title: Scientific Computing Resources
1Scientific Computing Resources
- Ian Bird Computer Center
- Hall A Analysis Workshop
- December 11, 2001
2Overview
- Current Resources
- Recent evolution
- Mass storage HW SW
- Farm
- Remote data access
- Staffing levels
- Future Plans
- Expansion/upgrades of current resources
- Other computing LQCD
- Grid Computing
- What is it? Should you care?
3Jefferson Lab Scientific Computing
Environment November 2001
2 TB Farm Cache SCSI RAID 0 on Linux servers
- Batch Farm Cluster
- 350 Linux nodes (400 MHz 1 GHz)
- 10,000 SPECint95
- Managed by LSF Java layer
- web interface
- 2 STK silos
- 10 9940
- 10 9840
- 8 Redwood
- 10 Solaris/Linux data
- movers w/ 300 GB stage
- Interactive Analysis
- 2 Sun 450 4 processor
- 2 4-processor Intel/Linux
Gigabit Ethernet Switching Fabric
Grid gateway
bbftp service
JLAB Network Backbone
CUE General Services
JASMine managed Mass Storage Systems
Internet (ESNet OC-3)
- Lattice QCD Cluster
- 40 Alpha/Linux (667 MHz)
- 256 Pentium 4 (Q2 FY02?)
- Managed by PBS Web portal
4JLAB Farm and Mass Storage Systems November 2001
Batch Farm 350 processors 175 dual nodes each
connected at 100 Mb to 24-port switch with Gb
uplink (8 switches)
Fiber Channel direct from CLAS
- 2 STK silos
- 10 9940
- 10 9840
- 8 Redwood
- 10 Solaris/Linux data
- movers each w/ 300 GB stage Gb uplink
Foundry BigIron 8000 Switch 256 Gb
backplane, 45/60 Gb ports in use
Site Router CUE and general services
Work disks 4 MetaStor systems each with 100 Mb
uplink Total 5 TB SCSI RAID 5
Cache disk farm 20 Linux servers each with Gb
uplink Total 16 TB SCSI/IDE RAID 0
CH-Router Incoming data from Halls A C
Work disk farm 4 Linux servers each with Gb
uplink Total 4 TB SCSI RAID 5
5CPU Resources
- Farm
- Upgraded this summer with 60 dual 1 GHz P III (4
cpu / 1 u rackmount) - Retired original 10 dual 300 MHz
- Now 350 cpu (400, 450, 500, 750, 1000 MHz)
- 11,000 SPECint95
- Deliver gt 500,000 SI95-hrs / week
- Equivalent to 75 1 GHz cpu
- Interactive
- Solaris 2 E450 (4-proc)
- Linux 2 quad systems (4x450, 4x750MHz)
- If required can use batch systems (via LSF) to
add interactive CPU to these (Linux) front ends
6Intel Linux Farm
7Tape storage
- Added 2nd silo this summer
- Required move of room of equipment
- Added 10 9940 drives (5 as part of new silo)
- Current
- 8 Redwood, 10 9840, 10 9940
- Redwood 50 GB _at_ 10MB/s (helical scan single
reel) - 9840 20 GB _at_ 10MB/s (linear mid-load cassette
(fast)) - 9940 60 GB _at_ 10MB/s (linear single reel)
- 9840 9940 are very reliable
- 9840 9940 have upgrade paths that use same
media - 9940 2nd generation 100 GB_at_20MB/s ??
- Add 10 more 9940 this FY (budget..?)
- Replace Redwoods (reduce to 1-2)
- Requires copying 4500 tapes started budget
for tape? - Reliability, end of support(!)
8Disk storage
- Added cache space
- For frequently used silo files, to reduce tape
accesses - Now have 22 cache servers
- 4 dedicated to farm 2 TB
- 16 TB of cache space allocated to expts
- Some bought and owned by groups
- Dual Linux systems, Gb network, 1 TB disk, RAID
0 - 9 SCSI systems
- 13 IDE systems
- Performance approx equivalent
- Good match cpunetwork throughputdisk space
- This is a model that will scale by a few factors,
but probably not by 10 (but there is as yet no
solution to that) - Looking at distributed file systems for the
future to avoid NFS complications GFS, etc.,
but no production level system yet. - Nb. Accessing data with jcache does not need NFS,
and is fault tolerant - Added work space
- Added 4 systems to reduce load on fs3,4,5,6 (orig
/work) - Dual Linux systems, Gb network, 1 TB disk, SCSI
RAID 5 - Performance on all systems is now good
- Problems
9JASMine
- JASMine Mass Storage system software
- Rationale why write another MSS?
- Had been using OSM
- Not scaleable, not supported, reached limit of
sw, had to run 2 instances to get sufficient
drive capacity - Hidden from users by Tapeserver
- Java layer that
- Hid complexities of OSM installations
- Implemented tape disk buffers (stage)
- Provided get, put, managed cache (read copies of
archived data) capabilities - Migration from OSM
- Production environment.
- Timescales driven by experiment schedules, need
to add drive capacity - Retain user interface
- Replace osmcp function tape to disk, drive
and library management - Choices investigated
- Enstore, Castor, (HPSS)
- Timescales, support, adaptability (missing
functionality/philosophy cache/stage) - Provide missing functions within Tapeserver
environment, clean up and reworking - JASMine (JLAB Asynchronous Storage Manager)
10Architecture
- JASMine
- Written in Java
- For data movement, as fast as C code.
- JDBC makes using and changing databases easy.
- Distributed Data Movers and Cache Managers
- Scaleable to the foreseeable needs of the
experiments - Provides scheduling
- Optimizing file access requests
- User and group (and location dependent)
priorities - Off-site cache or ftp servers for data exporting
- JASMine Cache Software
- Stand-alone component can act as a local or
remote client, allows remote access to JASMine - Can be deployed to a collaborator to manage small
disk system and as basis for coordinated data
management between sites - Cache manager runs on each cache server.
- Hardware is not an issue.
- Need a JVM, network, and a disk to store files.
11Software cont.
- MySQL database used by all servers.
- Fast and reliable.
- SQL
- Data Format
- ANSI standard labels with extra information
- Binary data
- Support to read legacy OSM tapes
- cpio, no file labels
- Protocol for file transfers
- Writes to cache are never NFS
- Reads from cache may be NFS
12Request Manager
Scheduler
Database
Request Manager
Log Manager
Client
Library Manager
Library Manager
Service Connection
Data Mover
Database Connection
Log Connection
13JASMine Services
- Database
- Stores metadata
- also presented to user on an NFS filesystem as
stubfiles - But could equally be presented as e.g. a web
service, LDAP, - Do not need to access stubfiles just need to
know filenames - Tracks status and locations of all requests,
files, volumes, drives, etc. - Request Manager
- Handles user requests and queries.
- Scheduler
- Prioritizes user requests for tape access.
- priority share / (0.01 (num_a
ACTIVE_WEIGHT) (num_c COMPLETED_WEIGHT) ) - Host vs User shares, farm priorities
- Log Manager
- Writes out log and error files and databases.
- Sends out notices for failures.
- Library Manager
- Mount and dismounts tapes as well as other
library related tasks.
14JASMine Services -2
- Data Mover
- Dispatcher
- Keeps track of available local resources and
starts requests the local system can work on. - Cache Manager
- Manages a disk or disks for pre-staging data to
and from tape. - Sends and receives data to and from clients.
- Volume Manager
- Manages tapes for availability.
- Drive Manager
- Manages tape drives for usage.
15User Access
- Jput
- Put one or more files on tape
- Jget
- Get one or more files from tape
- Jcache
- Copies one or more files from tape to cache
- Jls
- Get metadata for one or more files
- Jtstat
- Status of the request queue
- Web interface
- Query status and statistics for entire system
16Web interface
17(No Transcript)
18Data Access to cache
- NFS
- Directory of links points the way.
- Mounted read-only by the farm.
- Users can mount read-only on their desktop.
- Jcache
- Java client.
- Checks to see if files are on cache disks.
- Will get/put files from/to cache disks.
- More efficient than NFS, avoids NFS hangs if
server dies, etc., but users like NFS
19Disk Cache Management
- Disk Pools are divided into groups
- Tape staging.
- Experiments.
- Pre-staging for the batch farm.
- Management policy set per group
- Cache LRU files removed as needed.
- Stage Reference counting.
- Explicit manual addition and deletion.
- Policies are pluggable easy to add
20Protocol for file moving
- Simple extensible protocol for file copies
- Messages are java serialized objects passed over
streams, - Bulk data transfer uses raw data transfer over
tcp - Protocol is synchronous all calls block
- Asynchrony multiple requests by threading
- CRC32 checksums at every transfer
- More fair than NFS
- Session may make many connections
21Protocol for file moving
- Cache server extends the basic protocol
- Add database hooks for cache
- Add hooks for cache policies
- Additional message types were added
- High throughput disk pool
- Database shared by many servers
- Any server in the pool can look up file location,
- But data transfer always direct between client
and node holding file - Adding servers and disk to pool increases
throughput with no overhead, - Provides fault tolerance
22Example Get from cache
- cacheClient.getFile(/foo, halla)
- send locate request to any server
- receive locate reply
- contact appropriate server
- initiate direct xfer
- Returns true on success
Get /foo
Where is /foo?
Sending /foo
cache4
23Example simple put to cache
- putFile(/quux,halla,123456789)
24Fault Tolerance
- Dead machines do not stop the system
- Data Movers work independently
- Unfinished jobs will restart on another mover
- Cache Servers will only impact NFS clients
- System recognizes dead server and will re-cache
file from tape - If users would not use NFS would never see a
failure just extended access time - Exception handling for
- Received timeouts
- Refused connections
- Broken connections
- Complete garbage on connections
25Authorization and Authentication
- Shared secret for each file transfer session
- Session authorization by policy objects
- Example receive 5 files from user_at_bar
- Plug-in authenticators
- Establish shared secret between client and server
- No clear text passwords
- Extend to be compatible with GSI
26JASMine Bulk Data Transfers
- Model supports parallel transfers
- Many files at once, but not bbftp style
- But could replace stream class with a parallel
stream - For bulk data transfer over WANs
- Firewall issues
- Client initiates all connections
27Architecture Disk pool hardware
- SCSI Disk Servers
- Dual Pentium III 650 (later 933)MHz CPUs
- 512 Mbytes 100MHz SDRAM ECC
- ASUS P2B-D Motherboard
- NetGear GA620 Gigabit Ethernet PCI NIC
- Mylex eXtremeRAID 1100, 32 MBytes cache
- Seagate ST150176LW (Qty. 8) - 50 GBytes Ultra2
SCSI in Hot Swap Disk Carriers - CalPC 8U Rack Mount Case with Redundant 400W
Power Supplies - IDE Disk Servers
- Dual Pentium III 933MHz CPUs
- 512 Mbytes 133MHz SDRAM ECC
- Intel STL2 or ASUS CUR-DLS Motherboard
- NetGear GA620 or Intel PRO/1000 T Server Gigabit
Ethernet PCI NIC - 3ware Escalade 6800
- IBM DTLA-307075 (Qty. 12) - 75 GBytes Ultra
ATA/100 in Hot Swap Disk Carriers - CalPC 8U Rack Mount Case with Redundant 400W
Power Supplies
28Cache Performance
- Matches network, disk I/O, and CPU performance
with size of disk pool - 800 GB,
- 2 x 850MHz
- Gb Ethernet
29Cache status
30Performance SCSI vs IDE
- Disk Array/File System Ext2
- SCSI Disk Server - 8 50 GByte disks in a RAID-0
stripe over 2 SCSI controllers - 68 MBytes/sec single disk write
- 79 MBytes/sec burst for a single disk write
- 52 MBytes/sec single disk read
- 56 MBytes/sec burst for a single disk read
- IDE Disk Server - 6 75 GByte disks in a RAID-0
stripe - 64 MBytes/sec single disk write
- 77 MBytes/sec burst for a single disk write
- 48 MBytes/sec single disk read
- 49 MBytes/sec burst for a single disk read
31Performance NFS vs Jcache
- NFS v2 udp - 16 clients,
- rsize8192 and wsize8192
- Reads
- SCSI Disk Servers
- 7700 NFS ops/sec and 80 cpu utilization
- 11000 NFS ops/sec burst and 83 cpu utilization
- 32 MBytes/sec and 83 cpu utilization
- IDE Disk Servers
- 7700 NFS ops/sec and 72 cpu utilization
- 11000 NFS ops/sec burst and 92 cpu utilization
- 32 MBytes/sec and 72 cpu utilization
- Jcache - 16 clients
- Reads
- SCSI Disk Servers
- 32 MBytes/sec and 100 cpu utilization
- IDE Disk Servers
- 32 MBytes/sec and 100 cpu utilization
32JASMine system performance
- End-to-end performance
- i.e. tape load, copy to stage, network copy to
client - Aggregate sustained performance of 50MB/s is
regularly observed in production - During stress tests, up to 120 MB/s was sustained
for several hours - A data mover with 2 drives can handle 15MB/s
(disk contention is the limit) - Expect current system should handle 150MB/s and
is scaleable by adding data movers drives - N.B. this is performance to a network client!
- Data handling
- Currently the system regularly moves 2-3 TB per
day total - 6000 files per day, 2000 requests
33(No Transcript)
34(No Transcript)
35JASMine performance
36Tape migration
- Begin migration of 5000 Redwood tapes to 9940
- Procedure written
- Uses any/all available drives
- Use staging to allow re-packing of tapes
- Expect will last 9-12 months
37Raw Data lt 10MB/s over Gigabit Ethernet (Halls A
C)
- Batch Farm Cluster
- 350 Linux nodes (400 MHz 1 GHz)
- 10,000 SPECint95
- Managed by LSF Java layer
- web interface
- Raw Data
- gt 20 MB/s over
- Fiber channel
- (Hall B)
25-30 MB/s
25-30 MB/s
Typical Data Flows
38How to make optimal use of the resources
- Plan ahead!
- As a group
- Organize data sets in advance (week) and use the
cache disks for their intended purpose - Hold frequently used data to reduce tape access
- In a high data rate environment no other strategy
works - When running farm productions
- Use jsub to submit many jobs in one command as
it was designed - Optimizes tape accesses
- Gather output files together on work disks and
make a single jput for a complete tapes worth of
data
39Remote data access
- Tape copying is deprecated
- Expensive, time consuming (for you and us), and
inefficient - We have OC-3 (155 Mbps) connection that is
under-utilized, filling it will get us upgraded
to OC-12 (622 Mbps) - At the moment we do often have to coordinate with
ESnet and peers to ensure high-bandwidth path,
but this is improving as Grid development
continues - Use network copies
- Bbftp service
- Parallel, secure ftp optimizes use of WAN
bandwidth - Future
- Remote jcache
- Cache manager can be deployed remotely
demonstration Feb 02. - Remote silo access, policy-based (unattended)
data migration - GridFTP, bbftp, bbcp
- Parallel, secure ftp (or ftp-like)
- As part of a Grid infrastructure
- PKI authentication mechanism
40(Data-) Grid Computing
41Particle Physics Data GridCollaboratory Pilot
Who we are Four leading Grid Computer Science
Projects and Six international High Energy and
Nuclear Physics Collaborations
The problem at hand today Petabytes of
storage, Teraops/s of computing
Thousands of users, Hundreds of
institutions, 10 years of analysis ahead
What we do Develop and deploy Grid Services for
our Experiment Collaborators and Promote and
provide common Grid software and standards
42PPDG Experiments
ATLAS - a Toroidal LHC ApparatuS at CERN Runs
2006 onGoals TeV physics - the Higgs and the
origin of mass http//atlasinfo.cern.ch/Atlas/We
lcome.html BaBar - at the Stanford Linear
Accelerator Center Running Now Goals study CP
violation and more http//www.slac.stanford.edu/BF
ROOT/ CMS - the Compact Muon Solenoid detector
at CERN Runs 2006 on Goals TeV physics - the
Higgs and the origin of mass http//cmsinfo.cern
.ch/Welcome.html/ D0 at the D0 colliding beam
interaction region at Fermilab Runs Soon Goals
learn more about the top quark, supersymmetry,
and the Higgs http//www-d0.fnal.gov/ STAR -
Solenoidal Tracker At RHIC at BNL Running
Now Goals quark-gluon plasma http//www.star.b
nl.gov/ Thomas Jefferson National Laboratory
Running Now Goals understanding the nucleus
using electron beams http//www.jlab.org/
43PPDG Computer Science Groups
Condor develop, implement, deploy, and evaluate
mechanisms and policies that support High
Throughput Computing on large collections of
computing resources with distributed ownership.
http//www.cs.wisc.edu/condor/ Globus -
developing fundamental technologies needed to
build persistent environments that enable
software applications to integrate instruments,
displays, computational and information resources
that are managed by diverse organizations in
widespread locations http//www.globus.org/ SDM
- Scientific Data Management Research Group
optimized and standardized access to storage
systems http//gizmo.lbl.gov/DM.html Storage
Resource Broker - client-server middleware that
provides a uniform interface for connecting to
heterogeneous data resources over a network and
cataloging/accessing replicated data sets.
http//www.npaci.edu/DICE/SRB/index.html
44Delivery of End-to-End Applications Integrated
Production Systems
to allow thousands of physicists to share data
computing resources for scientific processing and
analyses
- PPDG Focus
- Robust Data Replication
- - Intelligent Job Placement
- and Scheduling
- - Management of Storage
- Resources
- - Monitoring and Information
- of Global Services
- Relies on Grid infrastructure
- - Security Policy
- High Speed Data Transfer
- - Network management
Operators Users
Resources Computers, Storage, Networks
45Project Activities,End-to-End Applicationsand
Cross-Cut Pilots
- Project Activities are focused Experiment
Computer Science Collaborative developments. - Replicated data sets for science analysis
BaBar, CMS, STAR - Distributed Monte Carlo production services
ATLAS, D0, CMS - Common storage management and interfaces STAR,
JLAB
- End-to-End Applications used in Experiment data
handling systems to give real-world requirements,
testing and feedback. - Error reporting and response
- Fault tolerant integration of complex components
- Cross-Cut Pilots for common services and policies
- Certificate Authority policy and authentication
- File transfer standards and protocols
- Resource Monitoring networks, computers,
storage.
46Year 0.5-1 Milestones (1)
- Align milestones to Experiment data challenges
- ATLAS production distributed data service
6/1/02 - BaBar analysis across partitioned dataset
storage 5/1/02 - CMS Distributed simulation production 1/1/02
- D0 distributed analyses across multiple
workgroup clusters 4/1/02 - STAR automated dataset replication 12/1/01
- JLAB policy driven file migration 2/1/02
47Year 0.5-1 Milestones
- Common milestones with EDG
- GDMP robust file replication layer Joint
Project with EDG Work Package (WP) 2 (Data
Access) - Support of Project Month (PM) 9 WP6 TestBed
Milestone. Will participate in integration fest
at CERN - 10/1/01 - Collaborate on PM21 design for WP2 - 1/1/02
- Proposed WP8 Application tests using PM9 testbed
3/1/02
- Collaboration with GriPhyN
- SC2001 demos will use common resources,
infrastructure and presentations 11/16/01 - Common, GriPhyN-led grid architecture
- Joint work on monitoring proposed
48Year 0.5-1 Cross-cuts
- Grid File Replication Services used by gt2
experiments - GridFTP production releases
- Integrate with D0-SAM, STAR replication
- Interfaced through SRB for BaBar, JLAB
- Layered use by GDMP for CMS, ATLAS
- SRB and Globus Replication Services
- Include robustness features
- Common catalog features and API
- GDMP/Data Access layer continues to be shared
between EDG and PPDG. - Distributed Job Scheduling and Management used by
gt1 experiment - Condor-G, DAGman, Grid-Scheduler for D0-SAM, CMS
- Job specification language interfaces to
distributed schedulers D0-SAM, CMS, JLAB - Storage Resource Interface and Management
- Consensus on API between EDG, SRM, and PPDG
- Disk cache management integrated with data
replication services
49Year 1 other goals
- Transatlantic Application Demonstrators
- BaBar data replication between SLAC and IN2P3
- D0 Monte Carlo Job Execution between Fermilab and
NIKHEF - CMS ATLAS simulation production between
Europe/US - Certificate exchange and authorization.
- DOE Science Grid as CA?
- Robust data replication.
- fault tolerant
- between heterogeneous storage resources.
- Monitoring Services
- MDS2 (Metacomputing Directory Service)?
- common framework
- network, compute and storage information made
available to - scheduling and resource management.
50PPDG activities as part of the Global Grid
Community
- Coordination with other Grid Projects in our
field - GriPhyN Grid for Physics Network
- European DataGrid
- Storage Resource Management collaboratory
- HENP Data Grid Coordination Committee
- Participation in Experiment and Grid deployments
in our field - ATLAS, BaBar, CMS, D0, Star, JLAB experiment data
handling systems - iVDGL/DataTAG International Virtual Data Grid
Laboratory - Use DTF computational facilities?
- Active in Standards Committees
- Internet2 HENP Working Group
- Global Grid Forum
51Staffing Levels
- We are stretched thin
- But compared with other labs with similar data
volumes we are efficient - Systems support group 5 1 vacant
- Farms, MSS development 2
- HW support/ Networks 3.7
- Telecom 2.3
- Security 2
- User services 3
- MIS, Database support 8
- Support for Engineering 1
- We cannot do as much as we would like
52Future (FY02)
- Removing Redwoods is a priority
- Copying tapes, replacing drives w/ 9940s
- Modest farm upgrades replace older CPU as
budget allows - Improve interactive systems
- Add more /work, /cache
- Grid developments
- Visible as efficient WAN data replication
services - After FY02
- Global filesystems to supercede NFS
- 10 Gb Ethernet
- Disk vs. tape? Improved tape densities, data
rates - We welcome (coordinated) input as to what would
be most useful for your physics needs