Getting started with NQS


Index




Introduction

This document describes the Network Queuing System as implemented at Mon- santo on Unix Workstations. It provides basic information on how to set up jobs, submit them, and monitor their status as they run. This document is intended to be an overview for new users and not as a complete description. For more detailed information, consult the appropriate man pages.

NQS allows one to submit batch jobs to queues on local or remote machines and have the log file returned to the originating machine or another machine. After submitting the request, the user can watch the progress of the request. The user can also affect the job after it is submitted by holding a queued job from being scheduled, releasing a held job, suspending a running request, resuming a suspended request, or deleting a queued or running request.

There are two main types of queues: batch and pipe. A batch queue is an execution queue where the request actually runs. A pipe queue provides routing capabilities; when a request is submitted to a pipe queue it is passed on to another pipe queue for further routing or a batch queue for execution on the same or another machine.

The core NQS user commands are as follows:

Qsub    - Submit	an NQS job
Qdel    - Delete	an NQS job
Qstat   - Determine the status of a job or a queue

The security mechanism for NQS is the .rhosts file, which is checked to determine if a request from a user on a remote system can be processed. This is checked when a request for status or a job arrives from a remote system. NQS requires that the system name and username both be present on the line separated by whitespace. In some cases it is necessary to have a line with the unqualified host name and a line with the fully qualified host name. If this file is not present on both the local and remote machines, then requests may not transfer to the execution machine, or the log file may not be returnable to the local machine.




How do I submit a job to NQS?

NQS jobs are submitted using the Qsub command. Qsub accepts a script which contains the shell commands to be executed when the job runs. In addition, you can provide instructions to Qsub to modify the characteristics of the request, such as to give the request a name, to indicate where the job is to run, and the like. These switches can be embedded at the beginning of the script, or placed on the Qsub command line.

Here is a sample NQS script with embedded switches:

  # QSUB -eo
  # QSUB -r cvtabc
  # QSUB -q batch
  # QSUB
  .
  .	  Various script commands follow here
Note that the lines starting with "#" appear as comments to the shell, but that Qsub interprets the lines starting with "# QSUB" as indicators that a Qsub switch follows. This script indicates that stdout and stderr should be combined into one file (-eo), that the request should be called "cvtabc" (-r), and that the job should be queued to the queue called "batch" (-q). The final QSUB line without any parameters indicates to Qsub that no more switches follow.

If this script was called scriptname.sh, it could be submitted using the command:

  qsub scriptname.sh
If there was a similar script called anotherscript.sh without the embedded NQS commands, it could be submitted using the following command and run exactly as the above script:

  qsub -eo -r cvtabc -q	batch anotherscript.sh
It is also possible to have switches both imbedded in the script and on the command line.

Here are several of the most often used Qsub switches:

       Switch	       Action
       -a	       run request after stated	time
       -e	       direct stderr output to the given destination
       -eo	       combine stdout and stderr in one	file
       -o	       direct stdout output to the given destination
       -r	       give the	request	a name
       -q	       indicate	to which queue to submit the job
All of the Qsub switches are explained in detail in the Qsub man pages.

The script file is spooled when you submit it, so you can modify the script after submission and not affect the request.

By default the sequence number of the request is printed when Qsub com- pletes its processing. This number combined with the hostname makes up the unique identifier for the request.




How do I decide which queue to use?

The queue configuration on any set of machines is very site dependent. Therefore, one cannot describe a configuration that applies for all loca- tions. Instead, please read the man pages on NQSCONFIG, which describe the local configuration or ask your local NQS system manager.

The general purpose command for determining the status of queues and jobs is Qstat. To find out what queues are present on the local machine, use the following command:

  qstat	-x
If you add the "-b" switch you will get a brief version of the information, and if you add the "-l" switch you will see a lot more. Sample output from qstat -x is:

batch@beaker.monsanto.com;  type=BATCH;  [ENABLED, INACTIVE];	 pri=16	 lim=1
0 exit;   0	run;   0 stage;	  0 queued;   0	wait;	0 hold;	  0 arrive;
User run limit= 1

helium@beaker.monsanto.com;  type=PIPE;  [ENABLED, INACTIVE];	 pri=16	 lim=1
0 depart;	0 route;   0 queued;   0 wait;	 0 hold;   0 arrive;
Destset = {batch@helium};
The first queue is a batch queue, and jobs actually run in this queue. The second queue is a pipe queue, which means jobs submitted to it are transferred to another queue either on the same machine or another to exe- cute. The destset on the helium queue indicates that the jobs submitted to that queue are transferred to the batch queue on the node helium to run.

If you want to learn more about queues on remote machines, use the command of the form:

  qstat	-x @ddcs1
which indicates that the request should be forwarded to the machine ddcs1 and the appropriate information printed on your screen.




How do I get the status of my jobs?

Again, Qstat is the command to get the status of NQS requests. You can use various Qstat switches to select which requests are shown. The default is to show only your own jobs on the local machine which originated anywhere. Additional switches can be used:

Qstat switch    Effect
-a	       show all	requests
-u username     show request belonging to a specific user
-o	       select jobs which originated on the local machine
-d	       show jobs on all	machines within	the local NQS domain
There are also switches which control the format of the output. The default Monsanto format is a single line for each request. The -s switch gives the standard COSMIC NQS format, and the -l switch provides much more detail in a long format.

The systems in the local NQS domain are listed in the file /usr/lib/nqs/nqs-domain (by default). This is a list of systems which can be considered a unit; jobs can be submitted between systems on the list. The -d switch then requests information from each system on the list.

This list can be modified by having a file called .qstat in your home directory which has the same format as the system-wide file, but has only the systems in which you are interested. Then you will get NQS status only from that list of systems.

The Qcat utility is also available to get information on the status of a job. It will list the spooled input script or the available output or error files. Since applications may not flush the stdout or stderr streams frequently, the available information may be limited, but it can be helpful in indicating how a job is progressing.

Here is an example of the default qstat output:

Request	  I.D.	Owner	 Queue	  Start	Time   Time Limit  Total Time St
-------------- ------	-------- -------- -----------  ----------  ---------- --
example	   129	jrroma	 batch	  4/30 10:11   4 04:00:00  0 00:00:00 R
The columns are self-explanatory, except perhaps, for the last one, which indicates the status of the request. Possible statuses include R for run- ning, Q for queued, H for holding, W for waiting, and S for suspended.

The -s switch gives the following information in the standard COSMIC NQS format:

batch@beaker.monsanto.com;  type=BATCH;  [ENABLED, RUNNING];	pri=16	lim=1
0 exit;   1	run;   0 stage;	  0 queued;   0	wait;	0 hold;	  0 arrive;
User run limit= 1

   REQUEST NAME	       REQUEST ID	     USER  PRI	  STATE	    PGRP
1:	example	      129.beaker	    jrroma  31	RUNNING	    7835
helium@beaker.monsanto.com;  type=PIPE;  [ENABLED, INACTIVE];	 pri=16	 lim=1
0 depart;	0 route;   0 queued;   0 wait;	 0 hold;   0 arrive;
And an example of output from the -l switch is as follows:

batch@beaker.monsanto.com;  type=BATCH;  [ENABLED, RUNNING];	pri=16	lim=1
0 exit;   1	run;   0 stage;	  0 queued;   0	wait;	0 hold;	  0 arrive;
User run limit= 1

Request    1:  Name=example
Id=129.beaker     Owner=jrroma  Priority=31	 RUNNING  Pgrp=7835
Created at Thu Apr 30 10:11:09 CDT 1992
Mail = [NONE]
Mail address = jrroma@beaker
Owner user name at originating machine = jrroma
Request is not restartable,	not recoverable.
Broadcast =	[NONE]
Per-proc. core file	size limit= [32	megabytes, 32 megabytes]
Per-proc. data size	limit= [32 megabytes, 32 megabytes]
Per-proc. permanent	file size limit= [500 megabytes, 500 megabytes]
Per-proc. execution	nice priority =	0 
Per-proc. stack size limit=	[32 megabytes, 32 megabytes]
Per-proc. CPU time limit= [360000.0, 360000.0]
Per-proc. working set limit= [32 megabytes,	32 megabytes]
Standard-error access mode = EO
Standard-output access mode	= SPOOL
Standard-output name = beaker:/usr2/jrroma/tmp/example.o129
Shell = DEFAULT
Umask =  22

helium@beaker.monsanto.com;  type=PIPE;  [ENABLED, INACTIVE];	 pri=16	 lim=1
0 depart;	0 route;   0 queued;   0 wait;	 0 hold;   0 arrive;
Again, information on the status of jobs on remote machines can be obtained by using the "@node" syntax to indicate where to get the information.




Why is my request not running?

Occasionally, your job may be in the Waiting or Queued state and it might not be clear why it is not running. Determination of the reason can be complicated. NQS allows system managers to set limits on the number of jobs that can run at a time. There are queue run limits, which limit the total number of jobs that can run in a queue at a time, and queue user run limits, which limit the number of jobs a particular user can run at a time. In the same manner there are global run and user run limits, which deter- mine the number of total jobs that can run on the system and the number of jobs a person can have running at any one time, respectively.
An investigation of the interactions of these limits and the mix of jobs on the system should indicate the reason a particular request is not running.




How do I delete a job?

Qdel is the command that deletes NQS jobs. It takes as a parameter the identifier of the job or jobs to be deleted. The identifier consists of the sequence number and the originating host of the job separated by a period. The sequence number will be reported when you submit the job, and it is shown when you do a qstat on the job. So the identifier of a job which is sequence number 217 and was originally submitted on beaker is 217.beaker.

If this job is queued on beaker, the appropriate command is:

  qdel 217.beaker
If the job is running, you must add the -k switch which indicates that the running job is to be killed.

Local jobs can be deleted by the request name with the -r swich. The argu- ment to the -r switch is the request pattern to delete. If the -c switch is used with the -r switch, then the user is prompted to confirm the dele- tion of the job.

If this job submitted from beaker is now running on a remote machine, you will need to add the remote system name, or:

  qdel -k 217.beaker@ddcs1
where ddcs1 is the name of the remote machine where the job is running. This will send a message to ddcs1 to delete the request 217 which ori- ginated on beaker.




Advanced Information

There are other commands one can use to modify requests, but which are not used as often. These are:

Qhold
Qhold holds all queued or waiting NQS requests given on the command line. Qhold will not hold a running request.

Qrls
Qrls is the inverse operation of Qhold. That is, it releases a held request and makes it eligible to be scheduled to run. Qrls will can- not release a job which is not being held.

Qsuspend
Qsuspend takes a running request and causes it to have no access to the cpu. That is, it will no longer run. One cannot suspend a request which is not in the NQS Running state.

Qresume
Qresume is the inverse operation of Qsuspend. It takes a suspended job and lets the cpu run the job again. Qresume cannot resume a job which is not suspended. These two commands have only been tested on SGI machines running IRIX and IBM RS6000s running AIX.

Qlimit
Qlimit does not modify the request, but shows the supported batch lim- its and shell strategy for the local or remote host. Here is example output:

    Core file size limit (-lc)
    Data segment size limit (-ld)
    Per-process	permanent file size limit (-lf)
    Nice value (-ln)
    Stack segment size limit (-ls)
    Per-process	cpu time limit (-lt)
    Working set	limit (-lw)

    Shell strategy = FREE
For more information on these commands, consult their respective man pages.