Title: HPCC Status, 4172009
1HPCC Status, 4/17/2009
2Changes in HPCC world
- SGI bankruptcy and subsequent purchase by
Rackable - Apparent end of Western Scientific
- Sun is rumored to be up for sale (IBM the latest)
- Economy is stressing many companies
3Changes in OUR HPCC world
- Construction continues, should be done by May
- Working on a number of software issues, in
particular networking - Systems have been running fairly busy in the
90-95 recently, overall over 75.
4Recent issues
- SGI SMP
- Green lost memory DIMM, went down
- White (Green frontend) went down
- Weird power loss (50 seconds, 300 kvA offline)
- Construction/planned downtime Thur.
5Recent changes
- new lustre file system online
- new user file system online (allows for Samba
mounts!) - NFS 4 running on Brody and infrastructure (help
with networking problems) - better testing suite to find problems sooner
6Coming soon
- a single disk image (same copy of the OS) is
being developed to be run on every system. It
will make using the different clusters much
easier - the environment will be the same on every system
(cluster, fat nodes, whatever) - ssh test-amd05, soon will have an intel version
as well.
7GLCPC
- www.greatlakesconsortium.org
- recently had a survey which 7 MSU members had
filled out (thanks!) - will hold summer sessions, likely remotely to the
various institutions - Really want to find out who will use the Blue
Waters machine and what they need to know to do
so. - Please let me know any questions.
8Staff plus issues
- Staff will present some of the present issues the
Center is working on
9Home Directory StorageEd Kryda, Manager
- Currently 100TB available / 50GB default quota
- Sun X4540 customized
- Performance 200 MB/s write, 1 GB/s read Max
- Initial reliability issues
- NFS v4
- Samba/CIFS file sharing!
- snapshots
10Lustre StorageGreg Mason, System Administrator
- Old Lustre retirement 5/1/09 (/mnt/lustre)
- Eventually repurposed
- New Shared Scratch Space (/mnt/ls09)
- 33 TB
- /mnt/lustre_scratch_2009
- /mnt/scratch
- ONLY TEMPORARY FILES
- Future automatic deletion
11User Education and AssistanceDirk Colbry,
Academic Specialist
http//wiki.hpcc.msu.edu/
- Research Collaborations
- System Level Debugging
- System Level Testing
- University Level Training Classes
- Research Group Level Training Classes
- Face-to-Face Individual Training and Debugging
- Up-to-date Documentation
12Better testingJim Leikert, System Administrator
- New scripts for testing node health
- New measures to keep jobs in line
- Job state messages
- Slowly being rolled out
13User vignettesKelly Osborn, Administrative
Assistant
- improves our public face
- currently have 12 vignettes
- looking for additional research to showcase
- kosborn_at_msu.edu
14SMP and WhiteAndy Keen, System Administrator
- SMP off support, down twice, repairing by hand
- White was down to two processors
- SMP days numbered
- Need to transition to newer fat nodes
- it will require recompilation to use the new
library links. Queue brody_4s
15Shorter term issues
16Buy replacement SMP nodes
- We have previously discussed buying replacement
nodes - Sweet spot is a box with 32 cores, 256GB
- Would like to buy on the order of 4-5 of these as
replacements for the SMP - Note you would have to recompile!
- same OS image as the clusters however!
- Your opinions? Wed like to buy soon.
17more storage
- roll our own has been a lot of work.
- Transition to NFS4 has improved performance and
reliability but we need more storage - Continue with the cheaper, expandable version or
go with a turnkey solution (such as NetApp)?
18Rack
Chassis
Nodes
Processors / Sockets
Cores
Examples
19Job Scheduling Example
Queue
ID1
of cores
duration
ID2
Priority
ID3
ID4
ID5
Current Jobs
New Schedule
1
Node 1
Node 1
4
Backfill
Current Time
Current Time
20Isolating long running jobs
- Working now on isolating long running jobs.
- long running jobs clog the nodes, especially
long running, single cpu jobs. - users would prefer to run on a single node for
better efficiency
21Current Scheduling Problem
1 week
Node
Current Time
- Long single core jobs take over nodes.
- Middle sized jobs (8-64 cores) can not be
reliably scheduled on dedicated nodes. - Very large core jobs can not be scheduled at all.
22Changing the scheduling of long jobs
- We propose grouping long-term jobs in the system
- Could involve capping the number
- For example, reserve ¼ of each cluster (128
256 for 384) - Improve scheduling of larger jobs, with potential
few side effects - Discussion?
23Discussion, buy-in priority
24Reinstitute buy-in
- Would like to reinstitute buy-in, users buying
nodes to be run by the HPCC - the recent renovations allow for expansion of the
centers facilities for shared, HPCC
infrastructure (no user hosting!) - we believe there are many users with equipment
money who would like to buy-in
25Rack
Chassis
Nodes
Processors / Sockets
Cores
Examples
26Users will buy chassis
- Increment of a chassis for purchase
- Price to be determined, but roughly 1000/core
- box will be 8 or 16 cores, depends on deals and
prices - example deal 8 core Nehalem, 48 GB memory, about
8000 (varies). - better deals with larger purchases
27HPCC will provide
- support the hardware, networking, disks, power
and cooling - software, OS, access
- 3 or 5 years (need feedback)
- most support contracts are 3 years, could be 5
but there are issues with this
28HPCC will also purchase chassis
- HPCC does have some funds to purchase general use
nodes as well - For the next 5 years will continue to expand
within the bounds of ICER budget. - However, ICER budget is sliding scale, providing
more support, less hardware, over time.
29Priority scheduling of buy-in
- These are points of discussion
- need your feedback.
- Couple of models, all of which allow non-used
nodes to get scheduled for larger jobs, but still
give buy-in users access
30First, really two systems
- HPCC provides public nodes for anyone with an
HPCC account to schedule - first come, first served (mostly)
- The researchers who buy-in would have reserved
access to their nodes, and the slack of other
buy-in users - no general scheduling in this part of the system
(mostly)
31In the buy-in system, three issues
- How quickly
- How many
- How long
32How quickly Purdue model
- guarantee access to number of purchased nodes
within X hours (could be 1 hour, 4 hours, 8
hours). Purdue is now 4 - Buy-in users can get more than they ask for if
they dont run longer than 4 hours (1 hour, 8
hours, ). - cannot guarantee the big job will go within some
time period, but the timeslice above provides
an opportunity
33How many Dial-in nodes
- Users can dial-in how many nodes of those
purchased they need within some time slice (1
day, 1 week, ) - Dialing-in low gets higher priority or future
credit, but other nodes now available for
larger jobs outside of what was purchased - Must have a reasonable time-slice to get good
scheduling (a week?)
34How long Dial-in Area
- Buy-in users get their nodes 24x7
- Could use the area model under some timeslice.
For example - You bought 100cores x168hours (1 week timeslice)
- Could use 200cores x 84hours (then wait 84 hours
before you schedule again) - Only resets every timeslice
35Others on buy-in nodes
- We would still like to keep utilization up on
buy-in nodes, so it would is possible that
general users get access to those nodes under two
conditions - very short time jobs (especially single cpu)
- pre-emption
36Interested?
- Contact Kelly Osborn at kosborn_at_msu.edu
- Required Information
- Account Number
- Approximate Amount (unit amount unknown)
- Deadlines on spending?
- Contact name
37Short, single cpu jobs
- Very short jobs can be used as backfill in the
scheduler to fill holes - if short, no one has to wait very long (5 minutes
say) - only if there is slack in the schedule
- low priority
38Preemption
- Jobs that label themselves are preemptible get
very high priority and can run anywhere at
anytime - preemptible means that they can be stopped at any
time - once stopped, re-queued at high priority
- User must recover state of stopped job!
39What about non buy-in researchers