Liverpool HEP - Site Report - PowerPoint PPT Presentation

1 / 13

About This Presentation

Title:

Liverpool HEP - Site Report

Description:

... scsi0: WARNING: (0x06:0x0037): Character ioctl (0x108) timed out, resetting card. ... scsi0: ERROR: (0x06:0x001F): Microcontroller not ready during reset sequence. ... – PowerPoint PPT presentation

Number of Views:25

Avg rating:3.0/5.0

Slides: 14

Provided by: krusty

Category:

Tags: hep | liverpool | report | resetting | site

Transcript and Presenter's Notes

Title: Liverpool HEP - Site Report

1
Liverpool HEP - Site Report June 2008
Robert Fay, John Bland
2
Staff Status

One members of staff left in the past year
Paul Trepka, left March 2008
Two full time HEP system administrators
John Bland, Robert Fay
One full time Grid administrator currently being
hired
Closing date for applications was Friday 13th,
15 applications received
One part time hardware technician
Dave Muskett

3
Current Hardware

Desktops
100 Desktops Scientific Linux 4.3, Windows XP
Minimum spec of 2GHz x86, 1GB RAM TFT Monitor
Laptops
60 Laptops Mixed architectures, specs and OSes.
Batch Farm
Software repository (0.7TB), storage (1.3TB)
Old batch queue has 10 SL3 dual 800MHz P3s with
1GB RAM
medium, short queues consist of 40 SL4 MAP-2
nodes (3GHz P4s)
5 interactive nodes (dual Xeon 2.4GHz)
Using Torque/PBS
Used for general analysis jobs

4
Current hardware continued

Matrix
1 dual 2.40GHz Xeon, 1GB RAM
6TB RAID array
Used for CDF batch analysis and data storage
HEP Servers
4 core servers
User file store bulk storage via NFS (Samba
front end for Windows)
Web (Apache), email (Sendmail) and database
(MySQL)
User authentication via NIS (Samba for Windows)
Dual Xeon 2.40GHz shell server and ssh server
Core servers have a failover spare

5
Current Hardware - continued

LCG Servers
CE, SE upgraded to new hardware
CE now 8-core Xeon 2 GHz, 8GB RAM
SE now 4-core Xeon 2.33GHz, 8GB RAM, Raid 10
array
CE, SE, UI all SL4, GLite 3.1
Mon still SL3, GLite 3.0
BDII SL4, Glite 3.0

6
Current Hardware continued

MAP2 Cluster
24 rack (960 node) (Dell PowerEdge 650) cluster
4 racks (280 nodes) shared with other departments
Each node has 3GHz P4, 1GB RAM, 120GB local
storage
19 racks (680 nodes) primarily for LCG jobs (5
racks currently allocated for local
ATLAS/T2K/Cockcroft batch processing)
1 rack (40 nodes) for general purpose local batch
processing
Front end machines for ATLAS, T2K, Cockcroft
Each rack has two 24 port gigabit switches
All racks connected into VLANs via Force10
managed switch

7
Storage

RAID
All file stores are using at least RAID5. Newer
servers using RAID6.
All RAID arrays using 3ware 7xxx/9xxx controllers
on Scientific Linux 4.3.
Arrays monitored with 3ware 3DM2 software.
File stores
New User and critical software store, RAID6HS,
2.25TB
10B general purpose hepstores for bulk storage
1.4TB 0.7TB batchstorebatchsoft for the Batch
farm cluster
1.4TB hepdata for backups
37TB RAID6 for LCG storage element

8
Storage (continued)

3ware Problems!
3w-9xxx scsi0 WARNING (0x060x0037) Character
ioctl (0x108) timed out, resetting card.
3w-9xxx scsi0 ERROR (0x060x001F)
Microcontroller not ready during reset sequence.
3w-9xxx scsi0 AEN ERROR (0x040x005F) Cache
synchronization failed some data lostunit0.
Leads to total loss of data access until system
is rebooted.
Sometimes leads to data corruption at array
level.
Seen under iozone load, normal production load,
due to drive failure.
Anyone else seen this?

9
Network

Topology

MAP2
2GB
WAN
2GB
Force10 Gigabit Switch
firewall
LCG servers
Offices
Servers
1GB link
VLAN
10
Network (continued)

Core Force10 E600 managed switch.
Now have 450 gigabit ports (240 at line rate)
Used as central departmental switch, using VLANs
Increased bandwidth to WAN using link aggregation
to 2-3GBit/s
Increased to departmental backbone to 2GBit/s
Added departmental firewall/gateway
Network intrusion monitoring with snort
Most office PCs and laptops are on internal
private network
Building network infrastructure is creaking
- needs rewiring, old cheap hubs and
switches need replacing

11
Security Monitoring

Security
Logwatch (looking to develop filters to reduce
noise)
University firewall local firewall network
monitoring (snort)
Secure server room with swipe card access
Monitoring
Core network traffic usage monitored with ntop
and cacti (all traffic to be monitored after
network upgrade)
Use sysstat on core servers for recording system
statistics
Rolling out system monitoring on all servers and
worker nodes, using SNMP, Ganglia, Cacti, and
Nagios
Hardware temperature monitors on water cooled
racks, to be supplemented by software monitoring
on nodes via SNMP. Still investigating other
environment monitoring solutions.

12
System Management

Puppet used for configuration management
Dotproject used for general helpdesk
RT integrated with Nagios for system management
- Nagios automatically creates/updates tickets
on acknowledgement
- Each RT ticket serves as a record for an
individual system

13
Plans

Additional storage for the Grid
GridPP3 funded
Will be approx. 60? TB
May switch from dCache to DPM
Upgrades to local batch farm
Plans to purchase several multi-core (most likely
8-core) nodes
Collaboration with local Computing Services
Department
Share of their newly commissioned multi-core
cluster available

Write a Comment

User Comments (0)

About PowerShow.com

Recommended Relevance Latest Highest Rated Most Viewed

Sort by:

Related More from user

Featured Presentations

Related Books