Getting Started with PBS
The CAEN Linux Computing Grid uses Portable Batch System (PBS) software to provide the framework for running compute jobs on CAEN Linux workstations while they are otherwise idle. The Grid is currently configured to run on all CAEN dual-boot PCs running Linux. Idle CAEN PCs reboot into Linux at 2:00 a.m. and, unless otherwise in use, are available for running Grid jobs until 7:00 a.m. at which time they reboot into Windows.
From a user's standpoint, the Grid can be viewed as a single computing resource. With the Grid, CAEN users are able to submit compute jobs to run on multiple machines. Pooled together, these machines make up a powerful, collective resource that can be used for many computational needs. Users submit scripts, or jobs, to the Grid, and the Grid then sends them to available computers and manages them to completion. A job can consist of a single script, multiple scripts running in parallel, or scripts that depend on one another. All that users of the Grid need is the knowledge of how to submit and monitor their jobs. This web page is meant to be a starting point for CAEN users, providing a general introduction to CAEN's Linux Grid implementation.
Submitting Jobs
The CAEN Computing Grid can be viewed as a single computing resource. It will do the work of finding an available machine, making sure it meets your necessary specifications, and running the job. A job can consist of a single script, multiple scripts run in parallel, or scripts that are dependent on one another. Jobs are submitted to the Linux Grid from the main submission host: ingrid.engin.umich.edu. A typical job will go through the following steps:
- The job is submitted by the user. At this stage, the job is queued, waiting to be scheduled.
- The Grid will find an available host on which to run the job.
- The Grid will send the job to the host.
- The host will run the job.
- After completion of the job, files containing the job's standard output and error messages are written to the directory from which the job was run.
- Once the job has finished, it is removed from the queue and the process is complete.
To submit a job, one uses the qsub command. This command must be run from the submission host: ingrid.engin.umich.edu. Throughout this document, an example script (example.sh) will be used (replace the word YOURUNIQNAME with your actual U-M uniqname in the example script):
#!/bin/sh
#
#This is an example script example.sh
#
#These commands set up the Grid Environment for your job:
#PBS -N ExampleJob
#PBS -l nodes=1,walltime=00:01:00
#PBS -q np_workq
#PBS -M YOURUNIQNAME@umich.edu
#PBS -m abe
#print the time and date
date
#wait 10 seconds
sleep 10
#print the time and date again
date
To submit the example.sh script to the CAEN Computing Grid using the qsub command, first make sure you are in the directory where your script is saved. At a Linux prompt, type:
ingrid% qsub example.sh
If successful, you will receive a response in the following format:
/tmp/qsub.example.sh.submit.YYYYY
ZZZZ.gridlock.engin.umich.edu
The number that appears in place of ZZZZ is the job's ID number.
Monitoring Jobs
Once you have submitted your job, you can see its current state by running the qstat command. Running qstat will give you a list of all jobs that have been submitted to the Linux grid that are queued or in the process of being run:
Job id Name User Time Use S Queue
---------------- -------------- -------------- -------- - -----
576.gridlock test user1 0 Q np_workq
1161.gridlock test1 user2 00:00:15 R np_workq
The most common states for a job to be in after submission are:
-
Q means the job is in the queue to be run
-
T means the job is attempting to transfer from the queue state to the run state
-
R means the job is running
In this example, job 576 is in the queued state, and job 1161 is in the running state. The qstat command also has several options that alter the output:
- The -u uniqname option causes qstat to only show jobs submitted by the specified uniqname.
- The -f option shows full information for each job in the queue.
- The -i job_id option causes qstat to only show information pertaining to the specified job id.
Deleting Jobs
You are able to delete any job you submit to the Grid. This can be done via the qdel command. The syntax for the qdel command is:
qdel [-W force] [job_id_list]
In order to delete one or more jobs, you supply the job ID(s) to the qdel command. If the job IDs you wanted to delete were 100 and 101, you would type:
ingrid% qdel 100 101
Adding the -W force option to qdel will force the grid to delete your job even if the host on which it is running cannot be reached.


