Provide control over batch jobs and distributed compute nodes. They are used for submitting and controlling jobs on our departmental clusters. In our department, we use two resource control systems with slightly different syntax:
In both cases, they combine with MPI to support multi-process distributed processing.
The TORQUE Resource Manager is a distributed resource manager providing control over batch jobs and distributed compute nodes. Its name stands for Terascale Open-Source Resource and QUEue Manager.
To submit a job to torque, you may use this sample script called job.sh:
#!/bin/bash #PBS -l ncpus=16 echo $PBS_JOBID echo "Start time :" date cd /path/to/job mpiexec -np 16 ./job echo "End Time :" date
This script will try to execute the executable job
on 16 processors.
To actually submit the job, we use our script.sh
as input to qsub
:
software@abacus:~/samples/job_submission> qsub job.sh 1843.localhost
qsub returns the job id as 1843.localhost
. By default, standard output and error are redirected to files <JOB_SCRIPT>.o<#JOB_ID>
and <JOB_SCRIPT>.e<#JOB_ID>
respectively. In our case: job.sh.o1843
and job.sh.e1843
. In addition, the job id may be used in manipulating our jobs.
qstat
Returns the status of our jobs. For example, software@abacus:~/samples/job_submission> qstat Job id Name User Time Use S Queue ------------------------- ---------------- --------------- -------- - ----- 1843.localhost job.sh software 00:00:00 C batch
qdel JOB_ID
Deletes job with job id JOB_IDLoad Sharing Facility (or simply LSF) is a commercial computer software, job scheduler sold by Platform Computing. It can be used to execute batch jobs on networked Unix and Windows systems on many different architectures.
To submit a job to LSF, you may use this sample script called job.lsf:
#BSUB -L /bin/bash #BSUB -J test #BSUB -q normal #BSUB -o %J.out #BSUB -e %J.err #BSUB -n 8 #BSUB -a mvapich #BSUB -u email.address@mail.mcgill.ca #BSUB -N echo $LS_SUBCWD cd $LS_SUBCWD mpirun.lsf ./job
This script will try to execute the executable job
on 8 processors.
Then to submit your job to LSF, you may issue bsub < job.lsf
. It will give the following output:
[localuser@hadley second_run]$ bsub < second_run.lsf Job <12804> is submitted to queue <normal>.
bjobs
Returns the status of our jobs. For example, [localuser@hadley second_run]$ bjobs JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 12805 localus PEND normal hadley *pich_test May 21 14:59
bkill JOB_ID
Deletes job with job id JOB_ID