Lattice QCD Clusters - PowerPoint PPT Presentation

1 / 10
About This Presentation
Title:

Lattice QCD Clusters

Description:

nVidia CK804 chipset, 4 GB main memory, Infiniband DDR. kaon. 1594 MFlops/node. 0.8 TFlops ... Intel E7210 chipset, 1 GB main memory, Myrinet. qcd. MILC ... – PowerPoint PPT presentation

Number of Views:23
Avg rating:3.0/5.0
Slides: 11
Provided by: amitoj3
Category:

less

Transcript and Presenter's Notes

Title: Lattice QCD Clusters


1
Lattice QCD Clusters
  • Amitoj Singh
  • Fermi National Accelerator Laboratory

2
Introduction
  • The LQCD Clusters
  • Cluster monitoring and response
  • Cluster job
  • types
  • submission, scheduling and allocation
  • Execution
  • Wish List
  • Questions and Answers

3
The LQCD Clusters
Cluster Processor Nodes MILC performance
qcd 2.8 GHz P4E, Intel E7210 chipset, 1 GB main memory, Myrinet 127 1017 MFlops/node 0.1 TFlops
pion 3.2 GHz Pentium 640, Intel E7221 chipset, 1 GB main memory, Infiniband SDR 518 1594 MFlops/node 0.8 TFlops
kaon 2.0 GHz Dual Opteron, nVidia CK804 chipset, 4 GB main memory, Infiniband DDR 600 3832 MFlops/node 2.2 TFlops
4
pion and qcd cluster
pion cluster front
qcd cluster back
pion cluster back
5
kaon cluster
kaon cluster front
kaon cluster back
kaon head-nodes Infiniband spine
6
Cluster monitoring
  • Worker node
  • nannies monitor critical components/processes
    such as
  • health (cpu/system temperature, cpu/system fan
    speeds)
  • batch queue clients (PBS mom)
  • disk space
  • NFS mount points
  • high speed interconnects
  • Except for nannies report via email any
    anomalies that may exist. For a corrective
    action is defined. A corrective action needs to
    be well-defined with sufficient decision paths to
    fully automate the error diagnosis and recovery
    process. Users are sophisticated enough to report
    any performance related issues.
  • Head-node
  • nanny monitors critical processes such as
  • mrtg graph plotting scripts
  • automated scripts to generate cluster status
    pages
  • batch queue server (PBS server)
  • NFS server
  • Except for nanny will restart processes that
    may have exited abnormally. All unhealthy nodes
    are reported as blinking on the cluster status
    pages. Cluster administrators can then analyze
    the mrtg plots to isolate the problem.
  • Network fabric
  • For the high speed network interconnects

7
Cluster job types
  • A large fraction of the jobs that are run on the
    LQCD clusters are limited by
  • Memory-bandwidth
  • Network-bandwidth

Memory bandwidth bound
Network bandwidth bound
8
Cluster job execution
  • Open PBS (Torque) and the Maui scheduler schedule
    jobs using the "FIFO" algorithm as follows
  • Jobs are queued in the order of submission
  • Maui will run the highest (oldest) jobs in the
    queue in order, except it will not start a job if
    any of the following are true
  • the job will put the number of running jobs by a
    particular user over the limit
  • the job will put the total number of nodes used
    by a particular user over the limit
  • the job specifies resources that cannot be
    fulfilled (e.g. a specific set of nodes requested
    by the user)
  • If there are jobs that are not eligible for any
    of the above, Maui will run the next eligible
    job.
  • Under certain conditions, Maui may run the next
    eligible job if only limit (c) holds. This is
    called backfilling. Maui will look at the state
    of the queue and the running jobs, and based on
    the requested and used wall-clock times predict
    when the job blocked by (c) will be able to run.
    If job(s) lower in the queue can run without
    extending the start time for the job blocked by
    (c), Maui will run that (those) jobs.
  • Once a job is ready to run, a set of nodes are
    allocated to the job exclusively, for the
    requested wall-time. Almost all jobs run on the
    LQCD clusters are MPI jobs. Users can explicitly
    refer to the PBS_NODEFILE environment variable OR
    it is coded into the mpirun launch script.

9
Cluster job execution (contd)
  • Typical user jobs are 8, 16 or 32 nodes which run
    for a maximum wall time of 24 hours.
  • A user nanny job running on the head-node
    executes job streams. Each job stream is a PBS
    job which
  • on the job head-node (MPI node 0) copies a
    lattice (problem) stored in dCache to the local
    scratch disk.
  • divides the lattice into the number of nodes and
    copies the sub-lattices to each node local
    scratch disk.
  • launches an MPI process on each node which
    computes its sub-lattice.
  • the main process (MPI process 0) gathers the
    results from each node onto the job head-node
    (MPI node 0) and copies the output into dCache.
  • marks checkpoints at regular intervals for error
    recovery.
  • Output from one job stream is the input lattice
    for the next job stream.
  • If a job stream fails, the nanny job restarts the
    stream from the most recent saved checkpoint.

10
Wish List
  • Missing link between the monitoring process and
    the scheduler. Scheduler could do better by being
    node and network aware.
  • Ability to monitor factors that are critical to
    application performance (e.g. Thermal
    instabilities can cause throttling of cpu speed
    which ultimately affects performance).
  • Very few automated corrective actions defined for
    components and processes that are currently being
    monitored.
  • Using current health data, ability to predict
    node failures rather than just updating mrtg
    plots.
Write a Comment
User Comments (0)
About PowerShow.com