Scientific Computing Resources - PowerPoint PPT Presentation

About This Presentation
Title:

Scientific Computing Resources

Description:

JASMine Cache Software ... JASMine Bulk Data Transfers. Model supports parallel transfers ... JASMine system performance. End-to-end performance ... – PowerPoint PPT presentation

Number of Views:277
Avg rating:3.0/5.0
Slides: 53
Provided by: ianb185
Category:

less

Transcript and Presenter's Notes

Title: Scientific Computing Resources


1
Scientific Computing Resources
  • Ian Bird Computer Center
  • Hall A Analysis Workshop
  • December 11, 2001

2
Overview
  • Current Resources
  • Recent evolution
  • Mass storage HW SW
  • Farm
  • Remote data access
  • Staffing levels
  • Future Plans
  • Expansion/upgrades of current resources
  • Other computing LQCD
  • Grid Computing
  • What is it? Should you care?

3
Jefferson Lab Scientific Computing
Environment November 2001
2 TB Farm Cache SCSI RAID 0 on Linux servers
  • Batch Farm Cluster
  • 350 Linux nodes (400 MHz 1 GHz)
  • 10,000 SPECint95
  • Managed by LSF Java layer
  • web interface
  • 2 STK silos
  • 10 9940
  • 10 9840
  • 8 Redwood
  • 10 Solaris/Linux data
  • movers w/ 300 GB stage
  • Interactive Analysis
  • 2 Sun 450 4 processor
  • 2 4-processor Intel/Linux

Gigabit Ethernet Switching Fabric
Grid gateway
bbftp service
JLAB Network Backbone
CUE General Services
JASMine managed Mass Storage Systems
Internet (ESNet OC-3)
  • Lattice QCD Cluster
  • 40 Alpha/Linux (667 MHz)
  • 256 Pentium 4 (Q2 FY02?)
  • Managed by PBS Web portal

4
JLAB Farm and Mass Storage Systems November 2001
Batch Farm 350 processors 175 dual nodes each
connected at 100 Mb to 24-port switch with Gb
uplink (8 switches)
Fiber Channel direct from CLAS
  • 2 STK silos
  • 10 9940
  • 10 9840
  • 8 Redwood
  • 10 Solaris/Linux data
  • movers each w/ 300 GB stage Gb uplink

Foundry BigIron 8000 Switch 256 Gb
backplane, 45/60 Gb ports in use
Site Router CUE and general services
Work disks 4 MetaStor systems each with 100 Mb
uplink Total 5 TB SCSI RAID 5
Cache disk farm 20 Linux servers each with Gb
uplink Total 16 TB SCSI/IDE RAID 0
CH-Router Incoming data from Halls A C
Work disk farm 4 Linux servers each with Gb
uplink Total 4 TB SCSI RAID 5
5
CPU Resources
  • Farm
  • Upgraded this summer with 60 dual 1 GHz P III (4
    cpu / 1 u rackmount)
  • Retired original 10 dual 300 MHz
  • Now 350 cpu (400, 450, 500, 750, 1000 MHz)
  • 11,000 SPECint95
  • Deliver gt 500,000 SI95-hrs / week
  • Equivalent to 75 1 GHz cpu
  • Interactive
  • Solaris 2 E450 (4-proc)
  • Linux 2 quad systems (4x450, 4x750MHz)
  • If required can use batch systems (via LSF) to
    add interactive CPU to these (Linux) front ends

6
Intel Linux Farm
7
Tape storage
  • Added 2nd silo this summer
  • Required move of room of equipment
  • Added 10 9940 drives (5 as part of new silo)
  • Current
  • 8 Redwood, 10 9840, 10 9940
  • Redwood 50 GB _at_ 10MB/s (helical scan single
    reel)
  • 9840 20 GB _at_ 10MB/s (linear mid-load cassette
    (fast))
  • 9940 60 GB _at_ 10MB/s (linear single reel)
  • 9840 9940 are very reliable
  • 9840 9940 have upgrade paths that use same
    media
  • 9940 2nd generation 100 GB_at_20MB/s ??
  • Add 10 more 9940 this FY (budget..?)
  • Replace Redwoods (reduce to 1-2)
  • Requires copying 4500 tapes started budget
    for tape?
  • Reliability, end of support(!)

8
Disk storage
  • Added cache space
  • For frequently used silo files, to reduce tape
    accesses
  • Now have 22 cache servers
  • 4 dedicated to farm 2 TB
  • 16 TB of cache space allocated to expts
  • Some bought and owned by groups
  • Dual Linux systems, Gb network, 1 TB disk, RAID
    0
  • 9 SCSI systems
  • 13 IDE systems
  • Performance approx equivalent
  • Good match cpunetwork throughputdisk space
  • This is a model that will scale by a few factors,
    but probably not by 10 (but there is as yet no
    solution to that)
  • Looking at distributed file systems for the
    future to avoid NFS complications GFS, etc.,
    but no production level system yet.
  • Nb. Accessing data with jcache does not need NFS,
    and is fault tolerant
  • Added work space
  • Added 4 systems to reduce load on fs3,4,5,6 (orig
    /work)
  • Dual Linux systems, Gb network, 1 TB disk, SCSI
    RAID 5
  • Performance on all systems is now good
  • Problems

9
JASMine
  • JASMine Mass Storage system software
  • Rationale why write another MSS?
  • Had been using OSM
  • Not scaleable, not supported, reached limit of
    sw, had to run 2 instances to get sufficient
    drive capacity
  • Hidden from users by Tapeserver
  • Java layer that
  • Hid complexities of OSM installations
  • Implemented tape disk buffers (stage)
  • Provided get, put, managed cache (read copies of
    archived data) capabilities
  • Migration from OSM
  • Production environment.
  • Timescales driven by experiment schedules, need
    to add drive capacity
  • Retain user interface
  • Replace osmcp function tape to disk, drive
    and library management
  • Choices investigated
  • Enstore, Castor, (HPSS)
  • Timescales, support, adaptability (missing
    functionality/philosophy cache/stage)
  • Provide missing functions within Tapeserver
    environment, clean up and reworking
  • JASMine (JLAB Asynchronous Storage Manager)

10
Architecture
  • JASMine
  • Written in Java
  • For data movement, as fast as C code.
  • JDBC makes using and changing databases easy.
  • Distributed Data Movers and Cache Managers
  • Scaleable to the foreseeable needs of the
    experiments
  • Provides scheduling
  • Optimizing file access requests
  • User and group (and location dependent)
    priorities
  • Off-site cache or ftp servers for data exporting
  • JASMine Cache Software
  • Stand-alone component can act as a local or
    remote client, allows remote access to JASMine
  • Can be deployed to a collaborator to manage small
    disk system and as basis for coordinated data
    management between sites
  • Cache manager runs on each cache server.
  • Hardware is not an issue.
  • Need a JVM, network, and a disk to store files.

11
Software cont.
  • MySQL database used by all servers.
  • Fast and reliable.
  • SQL
  • Data Format
  • ANSI standard labels with extra information
  • Binary data
  • Support to read legacy OSM tapes
  • cpio, no file labels
  • Protocol for file transfers
  • Writes to cache are never NFS
  • Reads from cache may be NFS

12
Request Manager
Scheduler
Database
Request Manager
Log Manager
Client
Library Manager
Library Manager
Service Connection
Data Mover
Database Connection
Log Connection
13
JASMine Services
  • Database
  • Stores metadata
  • also presented to user on an NFS filesystem as
    stubfiles
  • But could equally be presented as e.g. a web
    service, LDAP,
  • Do not need to access stubfiles just need to
    know filenames
  • Tracks status and locations of all requests,
    files, volumes, drives, etc.
  • Request Manager
  • Handles user requests and queries.
  • Scheduler
  • Prioritizes user requests for tape access.
  • priority share / (0.01 (num_a
    ACTIVE_WEIGHT) (num_c COMPLETED_WEIGHT) )
  • Host vs User shares, farm priorities
  • Log Manager
  • Writes out log and error files and databases.
  • Sends out notices for failures.
  • Library Manager
  • Mount and dismounts tapes as well as other
    library related tasks.

14
JASMine Services -2
  • Data Mover
  • Dispatcher
  • Keeps track of available local resources and
    starts requests the local system can work on.
  • Cache Manager
  • Manages a disk or disks for pre-staging data to
    and from tape.
  • Sends and receives data to and from clients.
  • Volume Manager
  • Manages tapes for availability.
  • Drive Manager
  • Manages tape drives for usage.

15
User Access
  • Jput
  • Put one or more files on tape
  • Jget
  • Get one or more files from tape
  • Jcache
  • Copies one or more files from tape to cache
  • Jls
  • Get metadata for one or more files
  • Jtstat
  • Status of the request queue
  • Web interface
  • Query status and statistics for entire system

16
Web interface
17
(No Transcript)
18
Data Access to cache
  • NFS
  • Directory of links points the way.
  • Mounted read-only by the farm.
  • Users can mount read-only on their desktop.
  • Jcache
  • Java client.
  • Checks to see if files are on cache disks.
  • Will get/put files from/to cache disks.
  • More efficient than NFS, avoids NFS hangs if
    server dies, etc., but users like NFS

19
Disk Cache Management
  • Disk Pools are divided into groups
  • Tape staging.
  • Experiments.
  • Pre-staging for the batch farm.
  • Management policy set per group
  • Cache LRU files removed as needed.
  • Stage Reference counting.
  • Explicit manual addition and deletion.
  • Policies are pluggable easy to add

20
Protocol for file moving
  • Simple extensible protocol for file copies
  • Messages are java serialized objects passed over
    streams,
  • Bulk data transfer uses raw data transfer over
    tcp
  • Protocol is synchronous all calls block
  • Asynchrony multiple requests by threading
  • CRC32 checksums at every transfer
  • More fair than NFS
  • Session may make many connections

21
Protocol for file moving
  • Cache server extends the basic protocol
  • Add database hooks for cache
  • Add hooks for cache policies
  • Additional message types were added
  • High throughput disk pool
  • Database shared by many servers
  • Any server in the pool can look up file location,
  • But data transfer always direct between client
    and node holding file
  • Adding servers and disk to pool increases
    throughput with no overhead,
  • Provides fault tolerance

22
Example Get from cache
  • cacheClient.getFile(/foo, halla)
  • send locate request to any server
  • receive locate reply
  • contact appropriate server
  • initiate direct xfer
  • Returns true on success

Get /foo
Where is /foo?
Sending /foo
cache4
23
Example simple put to cache
  • putFile(/quux,halla,123456789)

24
Fault Tolerance
  • Dead machines do not stop the system
  • Data Movers work independently
  • Unfinished jobs will restart on another mover
  • Cache Servers will only impact NFS clients
  • System recognizes dead server and will re-cache
    file from tape
  • If users would not use NFS would never see a
    failure just extended access time
  • Exception handling for
  • Received timeouts
  • Refused connections
  • Broken connections
  • Complete garbage on connections

25
Authorization and Authentication
  • Shared secret for each file transfer session
  • Session authorization by policy objects
  • Example receive 5 files from user_at_bar
  • Plug-in authenticators
  • Establish shared secret between client and server
  • No clear text passwords
  • Extend to be compatible with GSI

26
JASMine Bulk Data Transfers
  • Model supports parallel transfers
  • Many files at once, but not bbftp style
  • But could replace stream class with a parallel
    stream
  • For bulk data transfer over WANs
  • Firewall issues
  • Client initiates all connections

27
Architecture Disk pool hardware
  • SCSI Disk Servers
  • Dual Pentium III 650 (later 933)MHz CPUs
  • 512 Mbytes 100MHz SDRAM ECC
  • ASUS P2B-D Motherboard
  • NetGear GA620 Gigabit Ethernet PCI NIC
  • Mylex eXtremeRAID 1100, 32 MBytes cache
  • Seagate ST150176LW (Qty. 8) - 50 GBytes Ultra2
    SCSI in Hot Swap Disk Carriers
  • CalPC 8U Rack Mount Case with Redundant 400W
    Power Supplies
  • IDE Disk Servers
  • Dual Pentium III 933MHz CPUs
  • 512 Mbytes 133MHz SDRAM ECC
  • Intel STL2 or ASUS CUR-DLS Motherboard
  • NetGear GA620 or Intel PRO/1000 T Server Gigabit
    Ethernet PCI NIC
  • 3ware Escalade 6800
  • IBM DTLA-307075 (Qty. 12) - 75 GBytes Ultra
    ATA/100 in Hot Swap Disk Carriers
  • CalPC 8U Rack Mount Case with Redundant 400W
    Power Supplies

28
Cache Performance
  • Matches network, disk I/O, and CPU performance
    with size of disk pool
  • 800 GB,
  • 2 x 850MHz
  • Gb Ethernet

29
Cache status
30
Performance SCSI vs IDE
  • Disk Array/File System Ext2
  • SCSI Disk Server - 8 50 GByte disks in a RAID-0
    stripe over 2 SCSI controllers
  • 68 MBytes/sec single disk write
  • 79 MBytes/sec burst for a single disk write
  • 52 MBytes/sec single disk read
  • 56 MBytes/sec burst for a single disk read
  • IDE Disk Server - 6 75 GByte disks in a RAID-0
    stripe
  • 64 MBytes/sec single disk write
  • 77 MBytes/sec burst for a single disk write
  • 48 MBytes/sec single disk read
  • 49 MBytes/sec burst for a single disk read

31
Performance NFS vs Jcache
  • NFS v2 udp - 16 clients,
  • rsize8192 and wsize8192
  • Reads
  • SCSI Disk Servers
  • 7700 NFS ops/sec and 80 cpu utilization
  • 11000 NFS ops/sec burst and 83 cpu utilization
  • 32 MBytes/sec and 83 cpu utilization
  • IDE Disk Servers
  • 7700 NFS ops/sec and 72 cpu utilization
  • 11000 NFS ops/sec burst and 92 cpu utilization
  • 32 MBytes/sec and 72 cpu utilization
  • Jcache - 16 clients
  • Reads
  • SCSI Disk Servers
  • 32 MBytes/sec and 100 cpu utilization
  • IDE Disk Servers
  • 32 MBytes/sec and 100 cpu utilization

32
JASMine system performance
  • End-to-end performance
  • i.e. tape load, copy to stage, network copy to
    client
  • Aggregate sustained performance of 50MB/s is
    regularly observed in production
  • During stress tests, up to 120 MB/s was sustained
    for several hours
  • A data mover with 2 drives can handle 15MB/s
    (disk contention is the limit)
  • Expect current system should handle 150MB/s and
    is scaleable by adding data movers drives
  • N.B. this is performance to a network client!
  • Data handling
  • Currently the system regularly moves 2-3 TB per
    day total
  • 6000 files per day, 2000 requests

33
(No Transcript)
34
(No Transcript)
35
JASMine performance
36
Tape migration
  • Begin migration of 5000 Redwood tapes to 9940
  • Procedure written
  • Uses any/all available drives
  • Use staging to allow re-packing of tapes
  • Expect will last 9-12 months

37
Raw Data lt 10MB/s over Gigabit Ethernet (Halls A
C)
  • Batch Farm Cluster
  • 350 Linux nodes (400 MHz 1 GHz)
  • 10,000 SPECint95
  • Managed by LSF Java layer
  • web interface
  • Raw Data
  • gt 20 MB/s over
  • Fiber channel
  • (Hall B)

25-30 MB/s
25-30 MB/s
Typical Data Flows
38
How to make optimal use of the resources
  • Plan ahead!
  • As a group
  • Organize data sets in advance (week) and use the
    cache disks for their intended purpose
  • Hold frequently used data to reduce tape access
  • In a high data rate environment no other strategy
    works
  • When running farm productions
  • Use jsub to submit many jobs in one command as
    it was designed
  • Optimizes tape accesses
  • Gather output files together on work disks and
    make a single jput for a complete tapes worth of
    data

39
Remote data access
  • Tape copying is deprecated
  • Expensive, time consuming (for you and us), and
    inefficient
  • We have OC-3 (155 Mbps) connection that is
    under-utilized, filling it will get us upgraded
    to OC-12 (622 Mbps)
  • At the moment we do often have to coordinate with
    ESnet and peers to ensure high-bandwidth path,
    but this is improving as Grid development
    continues
  • Use network copies
  • Bbftp service
  • Parallel, secure ftp optimizes use of WAN
    bandwidth
  • Future
  • Remote jcache
  • Cache manager can be deployed remotely
    demonstration Feb 02.
  • Remote silo access, policy-based (unattended)
    data migration
  • GridFTP, bbftp, bbcp
  • Parallel, secure ftp (or ftp-like)
  • As part of a Grid infrastructure
  • PKI authentication mechanism

40
(Data-) Grid Computing
41
Particle Physics Data GridCollaboratory Pilot
Who we are Four leading Grid Computer Science
Projects and Six international High Energy and
Nuclear Physics Collaborations
The problem at hand today Petabytes of
storage, Teraops/s of computing
Thousands of users, Hundreds of
institutions, 10 years of analysis ahead
What we do Develop and deploy Grid Services for
our Experiment Collaborators and Promote and
provide common Grid software and standards
42
PPDG Experiments
ATLAS - a Toroidal LHC ApparatuS at CERN Runs
2006 onGoals TeV physics - the Higgs and the
origin of mass http//atlasinfo.cern.ch/Atlas/We
lcome.html BaBar - at the Stanford Linear
Accelerator Center Running Now Goals study CP
violation and more http//www.slac.stanford.edu/BF
ROOT/ CMS - the Compact Muon Solenoid detector
at CERN Runs 2006 on Goals TeV physics - the
Higgs and the origin of mass http//cmsinfo.cern
.ch/Welcome.html/ D0 at the D0 colliding beam
interaction region at Fermilab Runs Soon Goals
learn more about the top quark, supersymmetry,
and the Higgs http//www-d0.fnal.gov/ STAR -
Solenoidal Tracker At RHIC at BNL Running
Now Goals quark-gluon plasma http//www.star.b
nl.gov/ Thomas Jefferson National Laboratory
Running Now Goals understanding the nucleus
using electron beams http//www.jlab.org/
43
PPDG Computer Science Groups
Condor develop, implement, deploy, and evaluate
mechanisms and policies that support High
Throughput Computing on large collections of
computing resources with distributed ownership.
http//www.cs.wisc.edu/condor/ Globus -
developing fundamental technologies needed to
build persistent environments that enable
software applications to integrate instruments,
displays, computational and information resources
that are managed by diverse organizations in
widespread locations http//www.globus.org/ SDM
- Scientific Data Management Research Group
optimized and standardized access to storage
systems http//gizmo.lbl.gov/DM.html Storage
Resource Broker - client-server middleware that
provides a uniform interface for connecting to
heterogeneous data resources over a network and
cataloging/accessing replicated data sets.
http//www.npaci.edu/DICE/SRB/index.html
44
Delivery of End-to-End Applications Integrated
Production Systems
to allow thousands of physicists to share data
computing resources for scientific processing and
analyses
  • PPDG Focus
  • Robust Data Replication
  • - Intelligent Job Placement
  • and Scheduling
  • - Management of Storage
  • Resources
  • - Monitoring and Information
  • of Global Services
  • Relies on Grid infrastructure
  • - Security Policy
  • High Speed Data Transfer
  • - Network management

Operators Users
Resources Computers, Storage, Networks
45
Project Activities,End-to-End Applicationsand
Cross-Cut Pilots
  • Project Activities are focused Experiment
    Computer Science Collaborative developments.
  • Replicated data sets for science analysis
    BaBar, CMS, STAR
  • Distributed Monte Carlo production services
    ATLAS, D0, CMS
  • Common storage management and interfaces STAR,
    JLAB
  • End-to-End Applications used in Experiment data
    handling systems to give real-world requirements,
    testing and feedback.
  • Error reporting and response
  • Fault tolerant integration of complex components
  • Cross-Cut Pilots for common services and policies
  • Certificate Authority policy and authentication
  • File transfer standards and protocols
  • Resource Monitoring networks, computers,
    storage.

46
Year 0.5-1 Milestones (1)
  • Align milestones to Experiment data challenges
  • ATLAS production distributed data service
    6/1/02
  • BaBar analysis across partitioned dataset
    storage 5/1/02
  • CMS Distributed simulation production 1/1/02
  • D0 distributed analyses across multiple
    workgroup clusters 4/1/02
  • STAR automated dataset replication 12/1/01
  • JLAB policy driven file migration 2/1/02

47
Year 0.5-1 Milestones
  • Common milestones with EDG
  • GDMP robust file replication layer Joint
    Project with EDG Work Package (WP) 2 (Data
    Access)
  • Support of Project Month (PM) 9 WP6 TestBed
    Milestone. Will participate in integration fest
    at CERN - 10/1/01
  • Collaborate on PM21 design for WP2 - 1/1/02
  • Proposed WP8 Application tests using PM9 testbed
    3/1/02
  • Collaboration with GriPhyN
  • SC2001 demos will use common resources,
    infrastructure and presentations 11/16/01
  • Common, GriPhyN-led grid architecture
  • Joint work on monitoring proposed

48
Year 0.5-1 Cross-cuts
  • Grid File Replication Services used by gt2
    experiments
  • GridFTP production releases
  • Integrate with D0-SAM, STAR replication
  • Interfaced through SRB for BaBar, JLAB
  • Layered use by GDMP for CMS, ATLAS
  • SRB and Globus Replication Services
  • Include robustness features
  • Common catalog features and API
  • GDMP/Data Access layer continues to be shared
    between EDG and PPDG.
  • Distributed Job Scheduling and Management used by
    gt1 experiment
  • Condor-G, DAGman, Grid-Scheduler for D0-SAM, CMS
  • Job specification language interfaces to
    distributed schedulers D0-SAM, CMS, JLAB
  • Storage Resource Interface and Management
  • Consensus on API between EDG, SRM, and PPDG
  • Disk cache management integrated with data
    replication services

49
Year 1 other goals
  • Transatlantic Application Demonstrators
  • BaBar data replication between SLAC and IN2P3
  • D0 Monte Carlo Job Execution between Fermilab and
    NIKHEF
  • CMS ATLAS simulation production between
    Europe/US
  • Certificate exchange and authorization.
  • DOE Science Grid as CA?
  • Robust data replication.
  • fault tolerant
  • between heterogeneous storage resources.
  • Monitoring Services
  • MDS2 (Metacomputing Directory Service)?
  • common framework
  • network, compute and storage information made
    available to
  • scheduling and resource management.

50
PPDG activities as part of the Global Grid
Community
  • Coordination with other Grid Projects in our
    field
  • GriPhyN Grid for Physics Network
  • European DataGrid
  • Storage Resource Management collaboratory
  • HENP Data Grid Coordination Committee
  • Participation in Experiment and Grid deployments
    in our field
  • ATLAS, BaBar, CMS, D0, Star, JLAB experiment data
    handling systems
  • iVDGL/DataTAG International Virtual Data Grid
    Laboratory
  • Use DTF computational facilities?
  • Active in Standards Committees
  • Internet2 HENP Working Group
  • Global Grid Forum

51
Staffing Levels
  • We are stretched thin
  • But compared with other labs with similar data
    volumes we are efficient
  • Systems support group 5 1 vacant
  • Farms, MSS development 2
  • HW support/ Networks 3.7
  • Telecom 2.3
  • Security 2
  • User services 3
  • MIS, Database support 8
  • Support for Engineering 1
  • We cannot do as much as we would like

52
Future (FY02)
  • Removing Redwoods is a priority
  • Copying tapes, replacing drives w/ 9940s
  • Modest farm upgrades replace older CPU as
    budget allows
  • Improve interactive systems
  • Add more /work, /cache
  • Grid developments
  • Visible as efficient WAN data replication
    services
  • After FY02
  • Global filesystems to supercede NFS
  • 10 Gb Ethernet
  • Disk vs. tape? Improved tape densities, data
    rates
  • We welcome (coordinated) input as to what would
    be most useful for your physics needs
Write a Comment
User Comments (0)
About PowerShow.com