Title: Integrating GPUs into Condor
1Integrating GPUs into Condor
- Timothy Blattner
- Marquette University
- Milwaukee, WI
- April 22, 2009
2Outline
- Background and Vision
- Graphics Cards
- Condor Approach
- Problems
- Conclusions and Future Work
3Graphics cards
- Powerful NVIDIA Tesla C1060
- 240 massively parallel processing cores
- 4 GB GDDR3
- CUDA Capable
- 993 gigaflops
- 1,300
- Cheap NVIDIA 9800 GT
- 112 massively parallel processing cores
- 512 MB GDDR3
- CUDA Capable
- 120
4Vision and Focus
- Pool of computers containing graphics cards,
managed by Condor - Provide users the ability to utilize graphics
cards identified by Condor
Central Manager
?
?
?
5Opportunities
- Resources may already be there
- Majority of machines have graphics cards in them
- GPU resources sit idle while Condor runs on the
CPU - Similar work
- GPUGRID.net
- Distributed computing project using NVIDIA
graphics card for atom molecular simulations of
proteins - Uses GPU-enabled BOINC client
6Prototype Implementation
- Linux only
- Script queries operating system and graphics card
- Hawkeye Cron job manager runs script
- Script outputs graphics card information into
ClassAd format - Binary for NVIDIA cards for more specific
information
7Graphics Card Architecture
8Graphics card APIs
- Favor general purpose computations
- CUDA (NVIDIA)
- Brook (ATI)
- openCL (Khronos Group)
9CUDA Programming Model
- Kernels are functions run on the device (GPU)
- Host (CPU) code invokes kernels and determines
- Number of threads
- Thread block structure for organizing threads
- Kernel invocations are asynchronous
- Control returns to the CPU immediately
- CUDA provides synchronization primitives
- Some CUDA calls (e.g. memory allocation) are
synchronous
10Hawkeye Cron Job Manager
- Provides mechanism for collecting, storing, and
using information about computers - Periodically executes specified program(s)
- Program outputs in form of ClassAd
- Outputs are added to machine's ClassAd
11Hawkeye Implementation
- Added to local configuration file
- Runs script every minute
- Condor user must be granted graphics card
privileges in order to query the card
STARTD_CRON_JOBLIST (STARTD_CRON_JOBLIST),
UPDATEGPU STARTD_CRON_UPDATEGPU_EXECUTABLE
gpu.sh STARTD_CRON_UPDATEGPU_PERIOD
1m STARTD_CRON_UPDATEGPU_MODE
Periodic STARTD_CRON_UPDATEGPU_KILL True
12Script Output
HasGpu True NGpu 1 Gpu0 "Quadro FX
3700" Gpu0CudaCapable True Gpu0_Major 1
Gpu0_Minor 1 Gpu0Mem 536150016 Gpu0Procs
14 Gpu0Cores 112 Gpu0ShareMem 16384
Gpu0ThreadsPerBlock 512 Gpu0ClockRate 1.24
HasCuda True -
13Job Submission
- Users can submit jobs with GPU requirements into
Condor - Portable across Linux Distros
Universe vanilla Executable
tests/CudaJob Initialdir
gpuJobs Requirements (HasGpu true)
(Gpu0CudaCapable true) Log
gpu_test.log Error
gpu_test.stderr Output
gpu_test.stdout Queue
condor_submit gpu_job.submit
14Access Control
- /dev/nvidiactl, /dev/nvidia devices need
read/write by submitting/running user - Could be
- Nobody, open access
- Controlled by Unix group, containing limited
users - Integrated more directly with Condor user
control, slot users
15Problems
- Preemption
- Jobs running in GPU kernel cannot be interrupted
reliably by Unix signals - Watchdog timer
- After 5 seconds, job is killed
- A Solution use general purpose graphics card as
secondary display - Memory Security
- Malicious users, interrupting a job between GPU
kernel calls, have the opportunity to overwrite
or copy GPU memory
16Summary
- Condor based approach for advertising GPU
resources - Linux-based prototype implementation
- Can access available GPUs
- Works best on dedicated machines, with no need
for preemption - Current Limitations
- Doesnt report GPU usage
- Lack of preemption
- Limited OS and video card support
17Future Work
- Create benchmark and testing suite
- Handle preemption
- Investigate how watchdog works
- GPU usage reporting
- Integrate memory protection
- Support more Operating Systems
- Windows and Mac OS X
- Support alternative architectures and APIs
- Brook and OpenCL
18- Questions?
- Contact
- timothy.blattner_at_marquette.edu
- craig.struble_at_marquette.edu
- https//sourceforge.net/projects/condorgpu/