Integrating GPUs into Condor

About This Presentation

Title:

Integrating GPUs into Condor

Description:

Distributed computing project using NVIDIA graphics card for atom molecular ... Binary for NVIDIA cards for more specific information. 7. Graphics Card Architecture ... – PowerPoint PPT presentation

Number of Views:100

Avg rating:3.0/5.0

Slides: 19

Provided by: Tim868

Learn more at: https://research.cs.wisc.edu

Category:

more less

Transcript and Presenter's Notes

Title: Integrating GPUs into Condor

1
Integrating GPUs into Condor

Timothy Blattner
Marquette University
Milwaukee, WI
April 22, 2009

2
Outline

Background and Vision
Graphics Cards
Condor Approach
Problems
Conclusions and Future Work

3
Graphics cards

Powerful NVIDIA Tesla C1060
240 massively parallel processing cores
4 GB GDDR3
CUDA Capable
993 gigaflops
1,300
Cheap NVIDIA 9800 GT
112 massively parallel processing cores
512 MB GDDR3
CUDA Capable
120

4
Vision and Focus

Pool of computers containing graphics cards,
managed by Condor
Provide users the ability to utilize graphics
cards identified by Condor

Central Manager
?
?
?
5
Opportunities

Resources may already be there
Majority of machines have graphics cards in them
GPU resources sit idle while Condor runs on the
CPU
Similar work
GPUGRID.net
Distributed computing project using NVIDIA
graphics card for atom molecular simulations of
proteins
Uses GPU-enabled BOINC client

6
Prototype Implementation

Linux only
Script queries operating system and graphics card
Hawkeye Cron job manager runs script
Script outputs graphics card information into
ClassAd format
Binary for NVIDIA cards for more specific
information

7
Graphics Card Architecture
8
Graphics card APIs

Favor general purpose computations
CUDA (NVIDIA)
Brook (ATI)
openCL (Khronos Group)

9
CUDA Programming Model

Kernels are functions run on the device (GPU)
Host (CPU) code invokes kernels and determines
Number of threads
Thread block structure for organizing threads
Kernel invocations are asynchronous
Control returns to the CPU immediately
CUDA provides synchronization primitives
Some CUDA calls (e.g. memory allocation) are
synchronous

10
Hawkeye Cron Job Manager

Provides mechanism for collecting, storing, and
using information about computers
Periodically executes specified program(s)
Program outputs in form of ClassAd
Outputs are added to machine's ClassAd

11
Hawkeye Implementation

Added to local configuration file
Runs script every minute
Condor user must be granted graphics card
privileges in order to query the card

STARTD_CRON_JOBLIST (STARTD_CRON_JOBLIST),
UPDATEGPU STARTD_CRON_UPDATEGPU_EXECUTABLE
gpu.sh STARTD_CRON_UPDATEGPU_PERIOD
1m STARTD_CRON_UPDATEGPU_MODE
Periodic STARTD_CRON_UPDATEGPU_KILL True
12
Script Output
HasGpu True NGpu 1 Gpu0 "Quadro FX
3700" Gpu0CudaCapable True Gpu0_Major 1
Gpu0_Minor 1 Gpu0Mem 536150016 Gpu0Procs
14 Gpu0Cores 112 Gpu0ShareMem 16384
Gpu0ThreadsPerBlock 512 Gpu0ClockRate 1.24
HasCuda True -
13
Job Submission

Users can submit jobs with GPU requirements into
Condor
Portable across Linux Distros

Universe vanilla Executable
tests/CudaJob Initialdir
gpuJobs Requirements (HasGpu true)
(Gpu0CudaCapable true) Log
gpu_test.log Error
gpu_test.stderr Output
gpu_test.stdout Queue
condor_submit gpu_job.submit
14
Access Control

/dev/nvidiactl, /dev/nvidia devices need
read/write by submitting/running user
Could be
Nobody, open access
Controlled by Unix group, containing limited
users
Integrated more directly with Condor user
control, slot users

15
Problems

Preemption
Jobs running in GPU kernel cannot be interrupted
reliably by Unix signals
Watchdog timer
After 5 seconds, job is killed
A Solution use general purpose graphics card as
secondary display
Memory Security
Malicious users, interrupting a job between GPU
kernel calls, have the opportunity to overwrite
or copy GPU memory

16
Summary