Liverpool HEP - Site Report - PowerPoint PPT Presentation

1 / 13
About This Presentation
Title:

Liverpool HEP - Site Report

Description:

... scsi0: WARNING: (0x06:0x0037): Character ioctl (0x108) timed out, resetting card. ... scsi0: ERROR: (0x06:0x001F): Microcontroller not ready during reset sequence. ... – PowerPoint PPT presentation

Number of Views:25
Avg rating:3.0/5.0
Slides: 14
Provided by: krusty
Category:

less

Transcript and Presenter's Notes

Title: Liverpool HEP - Site Report


1
Liverpool HEP - Site Report June 2008
Robert Fay, John Bland
2
Staff Status
  • One members of staff left in the past year
  • Paul Trepka, left March 2008
  • Two full time HEP system administrators
  • John Bland, Robert Fay
  • One full time Grid administrator currently being
    hired
  • Closing date for applications was Friday 13th,
    15 applications received
  • One part time hardware technician
  • Dave Muskett

3
Current Hardware
  • Desktops
  • 100 Desktops Scientific Linux 4.3, Windows XP
  • Minimum spec of 2GHz x86, 1GB RAM TFT Monitor
  • Laptops
  • 60 Laptops Mixed architectures, specs and OSes.
  • Batch Farm
  • Software repository (0.7TB), storage (1.3TB)
  • Old batch queue has 10 SL3 dual 800MHz P3s with
    1GB RAM
  • medium, short queues consist of 40 SL4 MAP-2
    nodes (3GHz P4s)
  • 5 interactive nodes (dual Xeon 2.4GHz)
  • Using Torque/PBS
  • Used for general analysis jobs

4
Current hardware continued
  • Matrix
  • 1 dual 2.40GHz Xeon, 1GB RAM
  • 6TB RAID array
  • Used for CDF batch analysis and data storage
  • HEP Servers
  • 4 core servers
  • User file store bulk storage via NFS (Samba
    front end for Windows)
  • Web (Apache), email (Sendmail) and database
    (MySQL)
  • User authentication via NIS (Samba for Windows)
  • Dual Xeon 2.40GHz shell server and ssh server
  • Core servers have a failover spare

5
Current Hardware - continued
  • LCG Servers
  • CE, SE upgraded to new hardware
  • CE now 8-core Xeon 2 GHz, 8GB RAM
  • SE now 4-core Xeon 2.33GHz, 8GB RAM, Raid 10
    array
  • CE, SE, UI all SL4, GLite 3.1
  • Mon still SL3, GLite 3.0
  • BDII SL4, Glite 3.0

6
Current Hardware continued
  • MAP2 Cluster
  • 24 rack (960 node) (Dell PowerEdge 650) cluster
  • 4 racks (280 nodes) shared with other departments
  • Each node has 3GHz P4, 1GB RAM, 120GB local
    storage
  • 19 racks (680 nodes) primarily for LCG jobs (5
    racks currently allocated for local
    ATLAS/T2K/Cockcroft batch processing)
  • 1 rack (40 nodes) for general purpose local batch
    processing
  • Front end machines for ATLAS, T2K, Cockcroft
  • Each rack has two 24 port gigabit switches
  • All racks connected into VLANs via Force10
    managed switch

7
Storage
  • RAID
  • All file stores are using at least RAID5. Newer
    servers using RAID6.
  • All RAID arrays using 3ware 7xxx/9xxx controllers
    on Scientific Linux 4.3.
  • Arrays monitored with 3ware 3DM2 software.
  • File stores
  • New User and critical software store, RAID6HS,
    2.25TB
  • 10B general purpose hepstores for bulk storage
  • 1.4TB 0.7TB batchstorebatchsoft for the Batch
    farm cluster
  • 1.4TB hepdata for backups
  • 37TB RAID6 for LCG storage element

8
Storage (continued)
  • 3ware Problems!
  • 3w-9xxx scsi0 WARNING (0x060x0037) Character
    ioctl (0x108) timed out, resetting card.
  • 3w-9xxx scsi0 ERROR (0x060x001F)
    Microcontroller not ready during reset sequence.
  • 3w-9xxx scsi0 AEN ERROR (0x040x005F) Cache
    synchronization failed some data lostunit0.
  • Leads to total loss of data access until system
    is rebooted.
  • Sometimes leads to data corruption at array
    level.
  • Seen under iozone load, normal production load,
    due to drive failure.
  • Anyone else seen this?

9
Network
  • Topology

MAP2
2GB
WAN
2GB
Force10 Gigabit Switch
firewall
LCG servers
Offices
Servers
1GB link
VLAN
10
Network (continued)
  • Core Force10 E600 managed switch.
  • Now have 450 gigabit ports (240 at line rate)
  • Used as central departmental switch, using VLANs
  • Increased bandwidth to WAN using link aggregation
    to 2-3GBit/s
  • Increased to departmental backbone to 2GBit/s
  • Added departmental firewall/gateway
  • Network intrusion monitoring with snort
  • Most office PCs and laptops are on internal
    private network
  • Building network infrastructure is creaking
  • - needs rewiring, old cheap hubs and
  • switches need replacing

11
Security Monitoring
  • Security
  • Logwatch (looking to develop filters to reduce
    noise)
  • University firewall local firewall network
    monitoring (snort)
  • Secure server room with swipe card access
  • Monitoring
  • Core network traffic usage monitored with ntop
    and cacti (all traffic to be monitored after
    network upgrade)
  • Use sysstat on core servers for recording system
    statistics
  • Rolling out system monitoring on all servers and
    worker nodes, using SNMP, Ganglia, Cacti, and
    Nagios
  • Hardware temperature monitors on water cooled
    racks, to be supplemented by software monitoring
    on nodes via SNMP. Still investigating other
    environment monitoring solutions.

12
System Management
  • Puppet used for configuration management
  • Dotproject used for general helpdesk
  • RT integrated with Nagios for system management
  • - Nagios automatically creates/updates tickets
    on acknowledgement
  • - Each RT ticket serves as a record for an
    individual system

13
Plans
  • Additional storage for the Grid
  • GridPP3 funded
  • Will be approx. 60? TB
  • May switch from dCache to DPM
  • Upgrades to local batch farm
  • Plans to purchase several multi-core (most likely
    8-core) nodes
  • Collaboration with local Computing Services
    Department
  • Share of their newly commissioned multi-core
    cluster available
Write a Comment
User Comments (0)
About PowerShow.com