Title: JobData CoLocation with Condor at UWM
1Job/Data Co-Location with Condorat UWM
- W. Guan, B. Mellado, German Montoya, Sau Lan Wu,
Neng Xu University of Wisconsin-Madison
2Why we use Condor Dagman for data analysis
- Easy for users to use their old code.
- We already had Condor installed and the whole
Condor team as our technical backup. - User can put lots of analysis jobs in the queue.
Easy to maintain a multi-user environment. - Finding file location is faster than Xrootd,
especially for large datasets(more than 20,000
files) - User can use their own merge methods. The merge
method in PROOF-batch is not also working. (Gerri
is still working on it.)
3Analysis jobs with Condor Dagman
Dagman jobs Job1 Host1 /atlas/xrootd/file1 Job2
Host2 /atlas/xrootd/file2 Job3 Host3
/atlas/xrootd/file3
4
2
To analyze dataset1
File in the dataset1 Host1 /atlas/xrootd/file1 H
ost2 /atlas/xrootd/file2 Host3 /atlas/xrootd/file3
Host1 /atlas/xrootd/file4
3
5
The output HIST files are sent back by Condor.
Submitter
Database
The Xrootd pool
6
The Merge job runs locally and put the output to
the local xrootd pool. Like root//higgs10.cs.wi
sc.edu//atlas/output/file.HIST.root User can
directly open it with root.
1
7
4Condor Dagman for data analysis
- Condor Dagman
- Based on the databases, user only works on
datasets. DQ2 and UW_ls provide the file lists. - We did some optimizations to reduce the overhead
of the condor job submission. - This method is good for any I/O intensive tasks.
User can even directly run over the ESD or AODs. - User can put lots of jobs in the queue.
- Those Dagman jobs are running on a special fast
queue. They will suspend other normal batch jobs. - Condor takes care of the multi-users job priority.
5Some examples
6Some examples
- Use dq2-ls-local inside PROOF
- Submit Condor dagman job
7Problems with Condor Dagman
- Problem 1
- Idle jobs Those jobs never get matched because
the destination machine is down. - Held jobs Those jobs get held because the
output didnt get produced. They stay in the
queue forever. - With these 2 types of jobs, the dag will never
finish. - Problem 2
- Performance is slow when there are too many jobs
in the dag. - Matching takes time.
- Too many shadow processes if Dagman releases too
many jobs at once.
8Solutions with Condor Dagman
- Problem 1
- We use
- PeriodicRemove((JobStatus 1 CurrentTime
- EnteredCurrentStatus gt 300)(JobStatus 5
CurrentTime - EnteredCurrentStatus gt 60)) - To remove those unfinished jobs. For Held
jobs this is OK but not for Idle jobs.We check
the machine status before creating the jobs. - Problem 2
- We reduce the total number of jobs by running
multiple files in each job. - Less matching, less shadow processes, less output
files