User Tools

Site Tools


it:resource_control_systems

Introduction to Resource Control Systems

Provide control over batch jobs and distributed compute nodes. They are used for submitting and controlling jobs on our departmental clusters. In our department, we use two resource control systems with slightly different syntax:

  1. Torque
  2. Platform LSF

In both cases, they combine with MPI to support multi-process distributed processing.

Torque

The TORQUE Resource Manager is a distributed resource manager providing control over batch jobs and distributed compute nodes. Its name stands for Terascale Open-Source Resource and QUEue Manager.

Job Submission

To submit a job to torque, you may use this sample script called job.sh:

#!/bin/bash
#PBS -l ncpus=16

echo $PBS_JOBID
echo "Start time :"
date

cd /path/to/job
mpiexec -np 16 ./job

echo "End Time :"
date

This script will try to execute the executable job on 16 processors.

To actually submit the job, we use our script.sh as input to qsub:

software@abacus:~/samples/job_submission> qsub job.sh 
1843.localhost

qsub returns the job id as 1843.localhost. By default, standard output and error are redirected to files <JOB_SCRIPT>.o<#JOB_ID> and <JOB_SCRIPT>.e<#JOB_ID> respectively. In our case: job.sh.o1843 and job.sh.e1843. In addition, the job id may be used in manipulating our jobs.

Job Manipulation

  • qstat Returns the status of our jobs. For example,
    software@abacus:~/samples/job_submission> qstat
    Job id                    Name             User            Time Use S Queue
    ------------------------- ---------------- --------------- -------- - -----
    1843.localhost             job.sh           software        00:00:00 C batch    
  • qdel JOB_ID Deletes job with job id JOB_ID

Platform LSF

Load Sharing Facility (or simply LSF) is a commercial computer software, job scheduler sold by Platform Computing. It can be used to execute batch jobs on networked Unix and Windows systems on many different architectures.

Job Submission

To submit a job to LSF, you may use this sample script called job.lsf:

#BSUB -L /bin/bash 
#BSUB -J test 
#BSUB -q normal 
#BSUB -o %J.out 
#BSUB -e %J.err 
#BSUB -n 8 
#BSUB -a mvapich 
#BSUB -u email.address@mail.mcgill.ca
#BSUB -N 
echo $LS_SUBCWD 
cd $LS_SUBCWD 
mpirun.lsf ./job

This script will try to execute the executable job on 8 processors.

Then to submit your job to LSF, you may issue bsub < job.lsf. It will give the following output:

[localuser@hadley second_run]$ bsub < second_run.lsf 
Job <12804> is submitted to queue <normal>.

Job Manipulation

  • bjobs Returns the status of our jobs. For example,
    [localuser@hadley second_run]$ bjobs
    JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
    12805   localus PEND  normal     hadley                  *pich_test May 21 14:59 
    
  • bkill JOB_ID Deletes job with job id JOB_ID
it/resource_control_systems.txt · Last modified: 2012/05/21 14:57 by admin